From 29a50257a9d6d16320bc5e2fc1b15e0f2b2eb7cf Mon Sep 17 00:00:00 2001 From: Bjorn Andersson Date: Tue, 28 May 2019 17:57:09 -0700 Subject: dt-bindings: PCI: qcom: Add QCS404 to the binding The Qualcomm QCS404 platform contains a PCIe controller, add this to the Qualcomm PCI binding document. The controller is the same version as the one used in IPQ4019, but the PHY part is described separately, hence the difference in clocks and resets. Signed-off-by: Bjorn Andersson Signed-off-by: Lorenzo Pieralisi Reviewed-by: Rob Herring Reviewed-by: Vinod Koul --- .../devicetree/bindings/pci/qcom,pcie.txt | 25 ++++++++++++++++++++-- 1 file changed, 23 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/pci/qcom,pcie.txt b/Documentation/devicetree/bindings/pci/qcom,pcie.txt index 1fd703bd73e0..ada80b01bf0c 100644 --- a/Documentation/devicetree/bindings/pci/qcom,pcie.txt +++ b/Documentation/devicetree/bindings/pci/qcom,pcie.txt @@ -10,6 +10,7 @@ - "qcom,pcie-msm8996" for msm8996 or apq8096 - "qcom,pcie-ipq4019" for ipq4019 - "qcom,pcie-ipq8074" for ipq8074 + - "qcom,pcie-qcs404" for qcs404 - reg: Usage: required @@ -116,6 +117,15 @@ - "ahb" AHB clock - "aux" Auxiliary clock +- clock-names: + Usage: required for qcs404 + Value type: + Definition: Should contain the following entries + - "iface" AHB clock + - "aux" Auxiliary clock + - "master_bus" AXI Master clock + - "slave_bus" AXI Slave clock + - resets: Usage: required Value type: @@ -167,6 +177,17 @@ - "ahb" AHB Reset - "axi_m_sticky" AXI Master Sticky reset +- reset-names: + Usage: required for qcs404 + Value type: + Definition: Should contain the following entries + - "axi_m" AXI Master reset + - "axi_s" AXI Slave reset + - "axi_m_sticky" AXI Master Sticky reset + - "pipe_sticky" PIPE sticky reset + - "pwr" PWR reset + - "ahb" AHB reset + - power-domains: Usage: required for apq8084 and msm8996/apq8096 Value type: @@ -195,12 +216,12 @@ Definition: A phandle to the PCIe endpoint power supply - phys: - Usage: required for apq8084 + Usage: required for apq8084 and qcs404 Value type: Definition: List of phandle(s) as listed in phy-names property - phy-names: - Usage: required for apq8084 + Usage: required for apq8084 and qcs404 Value type: Definition: Should contain "pciephy" -- cgit From c42eaffa16568a538f12dfebd99624659992913a Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Tue, 14 May 2019 22:47:23 +0800 Subject: Documentation: add Linux PCI to Sphinx TOC tree Add index.rst for PCI subsystem. More docs will be added later. Signed-off-by: Changbin Du Signed-off-by: Bjorn Helgaas --- Documentation/PCI/index.rst | 9 +++++++++ Documentation/index.rst | 1 + 2 files changed, 10 insertions(+) create mode 100644 Documentation/PCI/index.rst (limited to 'Documentation') diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst new file mode 100644 index 000000000000..c2f8728d11cf --- /dev/null +++ b/Documentation/PCI/index.rst @@ -0,0 +1,9 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================= +Linux PCI Bus Subsystem +======================= + +.. toctree:: + :maxdepth: 2 + :numbered: diff --git a/Documentation/index.rst b/Documentation/index.rst index a7566ef62411..4afa431d9b1f 100644 --- a/Documentation/index.rst +++ b/Documentation/index.rst @@ -101,6 +101,7 @@ needed). filesystems/index vm/index bpf/index + PCI/index misc-devices/index Architecture-specific documentation -- cgit From 229b4e0728e0a6ddca2645e73696d5b104fbbbfb Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Tue, 14 May 2019 22:47:24 +0800 Subject: Documentation: PCI: convert pci.txt to reST Convert plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Move the description of struct pci_driver and struct pci_device_id into in-source comments. Signed-off-by: Changbin Du [bhelgaas: fix kernel-doc warnings related to moving descriptions to linux/pci.h, fix "space tab" whitespace errors in mod_devicetable.h] Signed-off-by: Bjorn Helgaas Reviewed-by: Mauro Carvalho Chehab --- Documentation/PCI/index.rst | 2 + Documentation/PCI/pci.rst | 578 ++++++++++++++++++++++++++++++++++++++++ Documentation/PCI/pci.txt | 636 -------------------------------------------- 3 files changed, 580 insertions(+), 636 deletions(-) create mode 100644 Documentation/PCI/pci.rst delete mode 100644 Documentation/PCI/pci.txt (limited to 'Documentation') diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst index c2f8728d11cf..7babf43709b0 100644 --- a/Documentation/PCI/index.rst +++ b/Documentation/PCI/index.rst @@ -7,3 +7,5 @@ Linux PCI Bus Subsystem .. toctree:: :maxdepth: 2 :numbered: + + pci diff --git a/Documentation/PCI/pci.rst b/Documentation/PCI/pci.rst new file mode 100644 index 000000000000..6864f9a70f5f --- /dev/null +++ b/Documentation/PCI/pci.rst @@ -0,0 +1,578 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================== +How To Write Linux PCI Drivers +============================== + +:Authors: - Martin Mares + - Grant Grundler + +The world of PCI is vast and full of (mostly unpleasant) surprises. +Since each CPU architecture implements different chip-sets and PCI devices +have different requirements (erm, "features"), the result is the PCI support +in the Linux kernel is not as trivial as one would wish. This short paper +tries to introduce all potential driver authors to Linux APIs for +PCI device drivers. + +A more complete resource is the third edition of "Linux Device Drivers" +by Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman. +LDD3 is available for free (under Creative Commons License) from: +http://lwn.net/Kernel/LDD3/. + +However, keep in mind that all documents are subject to "bit rot". +Refer to the source code if things are not working as described here. + +Please send questions/comments/patches about Linux PCI API to the +"Linux PCI" mailing list. + + +Structure of PCI drivers +======================== +PCI drivers "discover" PCI devices in a system via pci_register_driver(). +Actually, it's the other way around. When the PCI generic code discovers +a new device, the driver with a matching "description" will be notified. +Details on this below. + +pci_register_driver() leaves most of the probing for devices to +the PCI layer and supports online insertion/removal of devices [thus +supporting hot-pluggable PCI, CardBus, and Express-Card in a single driver]. +pci_register_driver() call requires passing in a table of function +pointers and thus dictates the high level structure of a driver. + +Once the driver knows about a PCI device and takes ownership, the +driver generally needs to perform the following initialization: + + - Enable the device + - Request MMIO/IOP resources + - Set the DMA mask size (for both coherent and streaming DMA) + - Allocate and initialize shared control data (pci_allocate_coherent()) + - Access device configuration space (if needed) + - Register IRQ handler (request_irq()) + - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip) + - Enable DMA/processing engines + +When done using the device, and perhaps the module needs to be unloaded, +the driver needs to take the follow steps: + + - Disable the device from generating IRQs + - Release the IRQ (free_irq()) + - Stop all DMA activity + - Release DMA buffers (both streaming and coherent) + - Unregister from other subsystems (e.g. scsi or netdev) + - Release MMIO/IOP resources + - Disable the device + +Most of these topics are covered in the following sections. +For the rest look at LDD3 or . + +If the PCI subsystem is not configured (CONFIG_PCI is not set), most of +the PCI functions described below are defined as inline functions either +completely empty or just returning an appropriate error codes to avoid +lots of ifdefs in the drivers. + + +pci_register_driver() call +========================== + +PCI device drivers call ``pci_register_driver()`` during their +initialization with a pointer to a structure describing the driver +(``struct pci_driver``): + +.. kernel-doc:: include/linux/pci.h + :functions: pci_driver + +The ID table is an array of ``struct pci_device_id`` entries ending with an +all-zero entry. Definitions with static const are generally preferred. + +.. kernel-doc:: include/linux/mod_devicetable.h + :functions: pci_device_id + +Most drivers only need ``PCI_DEVICE()`` or ``PCI_DEVICE_CLASS()`` to set up +a pci_device_id table. + +New PCI IDs may be added to a device driver pci_ids table at runtime +as shown below:: + + echo "vendor device subvendor subdevice class class_mask driver_data" > \ + /sys/bus/pci/drivers/{driver}/new_id + +All fields are passed in as hexadecimal values (no leading 0x). +The vendor and device fields are mandatory, the others are optional. Users +need pass only as many optional fields as necessary: + + - subvendor and subdevice fields default to PCI_ANY_ID (FFFFFFFF) + - class and classmask fields default to 0 + - driver_data defaults to 0UL. + +Note that driver_data must match the value used by any of the pci_device_id +entries defined in the driver. This makes the driver_data field mandatory +if all the pci_device_id entries have a non-zero driver_data value. + +Once added, the driver probe routine will be invoked for any unclaimed +PCI devices listed in its (newly updated) pci_ids list. + +When the driver exits, it just calls pci_unregister_driver() and the PCI layer +automatically calls the remove hook for all devices handled by the driver. + + +"Attributes" for driver functions/data +-------------------------------------- + +Please mark the initialization and cleanup functions where appropriate +(the corresponding macros are defined in ): + + ====== ================================================= + __init Initialization code. Thrown away after the driver + initializes. + __exit Exit code. Ignored for non-modular drivers. + ====== ================================================= + +Tips on when/where to use the above attributes: + - The module_init()/module_exit() functions (and all + initialization functions called _only_ from these) + should be marked __init/__exit. + + - Do not mark the struct pci_driver. + + - Do NOT mark a function if you are not sure which mark to use. + Better to not mark the function than mark the function wrong. + + +How to find PCI devices manually +================================ + +PCI drivers should have a really good reason for not using the +pci_register_driver() interface to search for PCI devices. +The main reason PCI devices are controlled by multiple drivers +is because one PCI device implements several different HW services. +E.g. combined serial/parallel port/floppy controller. + +A manual search may be performed using the following constructs: + +Searching by vendor and device ID:: + + struct pci_dev *dev = NULL; + while (dev = pci_get_device(VENDOR_ID, DEVICE_ID, dev)) + configure_device(dev); + +Searching by class ID (iterate in a similar way):: + + pci_get_class(CLASS_ID, dev) + +Searching by both vendor/device and subsystem vendor/device ID:: + + pci_get_subsys(VENDOR_ID,DEVICE_ID, SUBSYS_VENDOR_ID, SUBSYS_DEVICE_ID, dev). + +You can use the constant PCI_ANY_ID as a wildcard replacement for +VENDOR_ID or DEVICE_ID. This allows searching for any device from a +specific vendor, for example. + +These functions are hotplug-safe. They increment the reference count on +the pci_dev that they return. You must eventually (possibly at module unload) +decrement the reference count on these devices by calling pci_dev_put(). + + +Device Initialization Steps +=========================== + +As noted in the introduction, most PCI drivers need the following steps +for device initialization: + + - Enable the device + - Request MMIO/IOP resources + - Set the DMA mask size (for both coherent and streaming DMA) + - Allocate and initialize shared control data (pci_allocate_coherent()) + - Access device configuration space (if needed) + - Register IRQ handler (request_irq()) + - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip) + - Enable DMA/processing engines. + +The driver can access PCI config space registers at any time. +(Well, almost. When running BIST, config space can go away...but +that will just result in a PCI Bus Master Abort and config reads +will return garbage). + + +Enable the PCI device +--------------------- +Before touching any device registers, the driver needs to enable +the PCI device by calling pci_enable_device(). This will: + + - wake up the device if it was in suspended state, + - allocate I/O and memory regions of the device (if BIOS did not), + - allocate an IRQ (if BIOS did not). + +.. note:: + pci_enable_device() can fail! Check the return value. + +.. warning:: + OS BUG: we don't check resource allocations before enabling those + resources. The sequence would make more sense if we called + pci_request_resources() before calling pci_enable_device(). + Currently, the device drivers can't detect the bug when when two + devices have been allocated the same range. This is not a common + problem and unlikely to get fixed soon. + + This has been discussed before but not changed as of 2.6.19: + http://lkml.org/lkml/2006/3/2/194 + + +pci_set_master() will enable DMA by setting the bus master bit +in the PCI_COMMAND register. It also fixes the latency timer value if +it's set to something bogus by the BIOS. pci_clear_master() will +disable DMA by clearing the bus master bit. + +If the PCI device can use the PCI Memory-Write-Invalidate transaction, +call pci_set_mwi(). This enables the PCI_COMMAND bit for Mem-Wr-Inval +and also ensures that the cache line size register is set correctly. +Check the return value of pci_set_mwi() as not all architectures +or chip-sets may support Memory-Write-Invalidate. Alternatively, +if Mem-Wr-Inval would be nice to have but is not required, call +pci_try_set_mwi() to have the system do its best effort at enabling +Mem-Wr-Inval. + + +Request MMIO/IOP resources +-------------------------- +Memory (MMIO), and I/O port addresses should NOT be read directly +from the PCI device config space. Use the values in the pci_dev structure +as the PCI "bus address" might have been remapped to a "host physical" +address by the arch/chip-set specific kernel support. + +See Documentation/io-mapping.txt for how to access device registers +or device memory. + +The device driver needs to call pci_request_region() to verify +no other device is already using the same address resource. +Conversely, drivers should call pci_release_region() AFTER +calling pci_disable_device(). +The idea is to prevent two devices colliding on the same address range. + +.. tip:: + See OS BUG comment above. Currently (2.6.19), The driver can only + determine MMIO and IO Port resource availability _after_ calling + pci_enable_device(). + +Generic flavors of pci_request_region() are request_mem_region() +(for MMIO ranges) and request_region() (for IO Port ranges). +Use these for address resources that are not described by "normal" PCI +BARs. + +Also see pci_request_selected_regions() below. + + +Set the DMA mask size +--------------------- +.. note:: + If anything below doesn't make sense, please refer to + Documentation/DMA-API.txt. This section is just a reminder that + drivers need to indicate DMA capabilities of the device and is not + an authoritative source for DMA interfaces. + +While all drivers should explicitly indicate the DMA capability +(e.g. 32 or 64 bit) of the PCI bus master, devices with more than +32-bit bus master capability for streaming data need the driver +to "register" this capability by calling pci_set_dma_mask() with +appropriate parameters. In general this allows more efficient DMA +on systems where System RAM exists above 4G _physical_ address. + +Drivers for all PCI-X and PCIe compliant devices must call +pci_set_dma_mask() as they are 64-bit DMA devices. + +Similarly, drivers must also "register" this capability if the device +can directly address "consistent memory" in System RAM above 4G physical +address by calling pci_set_consistent_dma_mask(). +Again, this includes drivers for all PCI-X and PCIe compliant devices. +Many 64-bit "PCI" devices (before PCI-X) and some PCI-X devices are +64-bit DMA capable for payload ("streaming") data but not control +("consistent") data. + + +Setup shared control data +------------------------- +Once the DMA masks are set, the driver can allocate "consistent" (a.k.a. shared) +memory. See Documentation/DMA-API.txt for a full description of +the DMA APIs. This section is just a reminder that it needs to be done +before enabling DMA on the device. + + +Initialize device registers +--------------------------- +Some drivers will need specific "capability" fields programmed +or other "vendor specific" register initialized or reset. +E.g. clearing pending interrupts. + + +Register IRQ handler +-------------------- +While calling request_irq() is the last step described here, +this is often just another intermediate step to initialize a device. +This step can often be deferred until the device is opened for use. + +All interrupt handlers for IRQ lines should be registered with IRQF_SHARED +and use the devid to map IRQs to devices (remember that all PCI IRQ lines +can be shared). + +request_irq() will associate an interrupt handler and device handle +with an interrupt number. Historically interrupt numbers represent +IRQ lines which run from the PCI device to the Interrupt controller. +With MSI and MSI-X (more below) the interrupt number is a CPU "vector". + +request_irq() also enables the interrupt. Make sure the device is +quiesced and does not have any interrupts pending before registering +the interrupt handler. + +MSI and MSI-X are PCI capabilities. Both are "Message Signaled Interrupts" +which deliver interrupts to the CPU via a DMA write to a Local APIC. +The fundamental difference between MSI and MSI-X is how multiple +"vectors" get allocated. MSI requires contiguous blocks of vectors +while MSI-X can allocate several individual ones. + +MSI capability can be enabled by calling pci_alloc_irq_vectors() with the +PCI_IRQ_MSI and/or PCI_IRQ_MSIX flags before calling request_irq(). This +causes the PCI support to program CPU vector data into the PCI device +capability registers. Many architectures, chip-sets, or BIOSes do NOT +support MSI or MSI-X and a call to pci_alloc_irq_vectors with just +the PCI_IRQ_MSI and PCI_IRQ_MSIX flags will fail, so try to always +specify PCI_IRQ_LEGACY as well. + +Drivers that have different interrupt handlers for MSI/MSI-X and +legacy INTx should chose the right one based on the msi_enabled +and msix_enabled flags in the pci_dev structure after calling +pci_alloc_irq_vectors. + +There are (at least) two really good reasons for using MSI: + +1) MSI is an exclusive interrupt vector by definition. + This means the interrupt handler doesn't have to verify + its device caused the interrupt. + +2) MSI avoids DMA/IRQ race conditions. DMA to host memory is guaranteed + to be visible to the host CPU(s) when the MSI is delivered. This + is important for both data coherency and avoiding stale control data. + This guarantee allows the driver to omit MMIO reads to flush + the DMA stream. + +See drivers/infiniband/hw/mthca/ or drivers/net/tg3.c for examples +of MSI/MSI-X usage. + + +PCI device shutdown +=================== + +When a PCI device driver is being unloaded, most of the following +steps need to be performed: + + - Disable the device from generating IRQs + - Release the IRQ (free_irq()) + - Stop all DMA activity + - Release DMA buffers (both streaming and consistent) + - Unregister from other subsystems (e.g. scsi or netdev) + - Disable device from responding to MMIO/IO Port addresses + - Release MMIO/IO Port resource(s) + + +Stop IRQs on the device +----------------------- +How to do this is chip/device specific. If it's not done, it opens +the possibility of a "screaming interrupt" if (and only if) +the IRQ is shared with another device. + +When the shared IRQ handler is "unhooked", the remaining devices +using the same IRQ line will still need the IRQ enabled. Thus if the +"unhooked" device asserts IRQ line, the system will respond assuming +it was one of the remaining devices asserted the IRQ line. Since none +of the other devices will handle the IRQ, the system will "hang" until +it decides the IRQ isn't going to get handled and masks the IRQ (100,000 +iterations later). Once the shared IRQ is masked, the remaining devices +will stop functioning properly. Not a nice situation. + +This is another reason to use MSI or MSI-X if it's available. +MSI and MSI-X are defined to be exclusive interrupts and thus +are not susceptible to the "screaming interrupt" problem. + + +Release the IRQ +--------------- +Once the device is quiesced (no more IRQs), one can call free_irq(). +This function will return control once any pending IRQs are handled, +"unhook" the drivers IRQ handler from that IRQ, and finally release +the IRQ if no one else is using it. + + +Stop all DMA activity +--------------------- +It's extremely important to stop all DMA operations BEFORE attempting +to deallocate DMA control data. Failure to do so can result in memory +corruption, hangs, and on some chip-sets a hard crash. + +Stopping DMA after stopping the IRQs can avoid races where the +IRQ handler might restart DMA engines. + +While this step sounds obvious and trivial, several "mature" drivers +didn't get this step right in the past. + + +Release DMA buffers +------------------- +Once DMA is stopped, clean up streaming DMA first. +I.e. unmap data buffers and return buffers to "upstream" +owners if there is one. + +Then clean up "consistent" buffers which contain the control data. + +See Documentation/DMA-API.txt for details on unmapping interfaces. + + +Unregister from other subsystems +-------------------------------- +Most low level PCI device drivers support some other subsystem +like USB, ALSA, SCSI, NetDev, Infiniband, etc. Make sure your +driver isn't losing resources from that other subsystem. +If this happens, typically the symptom is an Oops (panic) when +the subsystem attempts to call into a driver that has been unloaded. + + +Disable Device from responding to MMIO/IO Port addresses +-------------------------------------------------------- +io_unmap() MMIO or IO Port resources and then call pci_disable_device(). +This is the symmetric opposite of pci_enable_device(). +Do not access device registers after calling pci_disable_device(). + + +Release MMIO/IO Port Resource(s) +-------------------------------- +Call pci_release_region() to mark the MMIO or IO Port range as available. +Failure to do so usually results in the inability to reload the driver. + + +How to access PCI config space +============================== + +You can use `pci_(read|write)_config_(byte|word|dword)` to access the config +space of a device represented by `struct pci_dev *`. All these functions return +0 when successful or an error code (`PCIBIOS_...`) which can be translated to a +text string by pcibios_strerror. Most drivers expect that accesses to valid PCI +devices don't fail. + +If you don't have a struct pci_dev available, you can call +`pci_bus_(read|write)_config_(byte|word|dword)` to access a given device +and function on that bus. + +If you access fields in the standard portion of the config header, please +use symbolic names of locations and bits declared in . + +If you need to access Extended PCI Capability registers, just call +pci_find_capability() for the particular capability and it will find the +corresponding register block for you. + + +Other interesting functions +=========================== + +============================= ================================================ +pci_get_domain_bus_and_slot() Find pci_dev corresponding to given domain, + bus and slot and number. If the device is + found, its reference count is increased. +pci_set_power_state() Set PCI Power Management state (0=D0 ... 3=D3) +pci_find_capability() Find specified capability in device's capability + list. +pci_resource_start() Returns bus start address for a given PCI region +pci_resource_end() Returns bus end address for a given PCI region +pci_resource_len() Returns the byte length of a PCI region +pci_set_drvdata() Set private driver data pointer for a pci_dev +pci_get_drvdata() Return private driver data pointer for a pci_dev +pci_set_mwi() Enable Memory-Write-Invalidate transactions. +pci_clear_mwi() Disable Memory-Write-Invalidate transactions. +============================= ================================================ + + +Miscellaneous hints +=================== + +When displaying PCI device names to the user (for example when a driver wants +to tell the user what card has it found), please use pci_name(pci_dev). + +Always refer to the PCI devices by a pointer to the pci_dev structure. +All PCI layer functions use this identification and it's the only +reasonable one. Don't use bus/slot/function numbers except for very +special purposes -- on systems with multiple primary buses their semantics +can be pretty complex. + +Don't try to turn on Fast Back to Back writes in your driver. All devices +on the bus need to be capable of doing it, so this is something which needs +to be handled by platform and generic code, not individual drivers. + + +Vendor and device identifications +================================= + +Do not add new device or vendor IDs to include/linux/pci_ids.h unless they +are shared across multiple drivers. You can add private definitions in +your driver if they're helpful, or just use plain hex constants. + +The device IDs are arbitrary hex numbers (vendor controlled) and normally used +only in a single location, the pci_device_id table. + +Please DO submit new vendor/device IDs to http://pci-ids.ucw.cz/. +There are mirrors of the pci.ids file at http://pciids.sourceforge.net/ +and https://github.com/pciutils/pciids. + + +Obsolete functions +================== + +There are several functions which you might come across when trying to +port an old driver to the new PCI interface. They are no longer present +in the kernel as they aren't compatible with hotplug or PCI domains or +having sane locking. + +================= =========================================== +pci_find_device() Superseded by pci_get_device() +pci_find_subsys() Superseded by pci_get_subsys() +pci_find_slot() Superseded by pci_get_domain_bus_and_slot() +pci_get_slot() Superseded by pci_get_domain_bus_and_slot() +================= =========================================== + +The alternative is the traditional PCI device driver that walks PCI +device lists. This is still possible but discouraged. + + +MMIO Space and "Write Posting" +============================== + +Converting a driver from using I/O Port space to using MMIO space +often requires some additional changes. Specifically, "write posting" +needs to be handled. Many drivers (e.g. tg3, acenic, sym53c8xx_2) +already do this. I/O Port space guarantees write transactions reach the PCI +device before the CPU can continue. Writes to MMIO space allow the CPU +to continue before the transaction reaches the PCI device. HW weenies +call this "Write Posting" because the write completion is "posted" to +the CPU before the transaction has reached its destination. + +Thus, timing sensitive code should add readl() where the CPU is +expected to wait before doing other work. The classic "bit banging" +sequence works fine for I/O Port space:: + + for (i = 8; --i; val >>= 1) { + outb(val & 1, ioport_reg); /* write bit */ + udelay(10); + } + +The same sequence for MMIO space should be:: + + for (i = 8; --i; val >>= 1) { + writeb(val & 1, mmio_reg); /* write bit */ + readb(safe_mmio_reg); /* flush posted write */ + udelay(10); + } + +It is important that "safe_mmio_reg" not have any side effects that +interferes with the correct operation of the device. + +Another case to watch out for is when resetting a PCI device. Use PCI +Configuration space reads to flush the writel(). This will gracefully +handle the PCI master abort on all platforms if the PCI device is +expected to not respond to a readl(). Most x86 platforms will allow +MMIO reads to master abort (a.k.a. "Soft Fail") and return garbage +(e.g. ~0). But many RISC platforms will crash (a.k.a."Hard Fail"). diff --git a/Documentation/PCI/pci.txt b/Documentation/PCI/pci.txt deleted file mode 100644 index badb26ac33dc..000000000000 --- a/Documentation/PCI/pci.txt +++ /dev/null @@ -1,636 +0,0 @@ - - How To Write Linux PCI Drivers - - by Martin Mares on 07-Feb-2000 - updated by Grant Grundler on 23-Dec-2006 - -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The world of PCI is vast and full of (mostly unpleasant) surprises. -Since each CPU architecture implements different chip-sets and PCI devices -have different requirements (erm, "features"), the result is the PCI support -in the Linux kernel is not as trivial as one would wish. This short paper -tries to introduce all potential driver authors to Linux APIs for -PCI device drivers. - -A more complete resource is the third edition of "Linux Device Drivers" -by Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman. -LDD3 is available for free (under Creative Commons License) from: - - http://lwn.net/Kernel/LDD3/ - -However, keep in mind that all documents are subject to "bit rot". -Refer to the source code if things are not working as described here. - -Please send questions/comments/patches about Linux PCI API to the -"Linux PCI" mailing list. - - - -0. Structure of PCI drivers -~~~~~~~~~~~~~~~~~~~~~~~~~~~ -PCI drivers "discover" PCI devices in a system via pci_register_driver(). -Actually, it's the other way around. When the PCI generic code discovers -a new device, the driver with a matching "description" will be notified. -Details on this below. - -pci_register_driver() leaves most of the probing for devices to -the PCI layer and supports online insertion/removal of devices [thus -supporting hot-pluggable PCI, CardBus, and Express-Card in a single driver]. -pci_register_driver() call requires passing in a table of function -pointers and thus dictates the high level structure of a driver. - -Once the driver knows about a PCI device and takes ownership, the -driver generally needs to perform the following initialization: - - Enable the device - Request MMIO/IOP resources - Set the DMA mask size (for both coherent and streaming DMA) - Allocate and initialize shared control data (pci_allocate_coherent()) - Access device configuration space (if needed) - Register IRQ handler (request_irq()) - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip) - Enable DMA/processing engines - -When done using the device, and perhaps the module needs to be unloaded, -the driver needs to take the follow steps: - Disable the device from generating IRQs - Release the IRQ (free_irq()) - Stop all DMA activity - Release DMA buffers (both streaming and coherent) - Unregister from other subsystems (e.g. scsi or netdev) - Release MMIO/IOP resources - Disable the device - -Most of these topics are covered in the following sections. -For the rest look at LDD3 or . - -If the PCI subsystem is not configured (CONFIG_PCI is not set), most of -the PCI functions described below are defined as inline functions either -completely empty or just returning an appropriate error codes to avoid -lots of ifdefs in the drivers. - - - -1. pci_register_driver() call -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -PCI device drivers call pci_register_driver() during their -initialization with a pointer to a structure describing the driver -(struct pci_driver): - - field name Description - ---------- ------------------------------------------------------ - id_table Pointer to table of device ID's the driver is - interested in. Most drivers should export this - table using MODULE_DEVICE_TABLE(pci,...). - - probe This probing function gets called (during execution - of pci_register_driver() for already existing - devices or later if a new device gets inserted) for - all PCI devices which match the ID table and are not - "owned" by the other drivers yet. This function gets - passed a "struct pci_dev *" for each device whose - entry in the ID table matches the device. The probe - function returns zero when the driver chooses to - take "ownership" of the device or an error code - (negative number) otherwise. - The probe function always gets called from process - context, so it can sleep. - - remove The remove() function gets called whenever a device - being handled by this driver is removed (either during - deregistration of the driver or when it's manually - pulled out of a hot-pluggable slot). - The remove function always gets called from process - context, so it can sleep. - - suspend Put device into low power state. - suspend_late Put device into low power state. - - resume_early Wake device from low power state. - resume Wake device from low power state. - - (Please see Documentation/power/pci.txt for descriptions - of PCI Power Management and the related functions.) - - shutdown Hook into reboot_notifier_list (kernel/sys.c). - Intended to stop any idling DMA operations. - Useful for enabling wake-on-lan (NIC) or changing - the power state of a device before reboot. - e.g. drivers/net/e100.c. - - err_handler See Documentation/PCI/pci-error-recovery.txt - - -The ID table is an array of struct pci_device_id entries ending with an -all-zero entry. Definitions with static const are generally preferred. - -Each entry consists of: - - vendor,device Vendor and device ID to match (or PCI_ANY_ID) - - subvendor, Subsystem vendor and device ID to match (or PCI_ANY_ID) - subdevice, - - class Device class, subclass, and "interface" to match. - See Appendix D of the PCI Local Bus Spec or - include/linux/pci_ids.h for a full list of classes. - Most drivers do not need to specify class/class_mask - as vendor/device is normally sufficient. - - class_mask limit which sub-fields of the class field are compared. - See drivers/scsi/sym53c8xx_2/ for example of usage. - - driver_data Data private to the driver. - Most drivers don't need to use driver_data field. - Best practice is to use driver_data as an index - into a static list of equivalent device types, - instead of using it as a pointer. - - -Most drivers only need PCI_DEVICE() or PCI_DEVICE_CLASS() to set up -a pci_device_id table. - -New PCI IDs may be added to a device driver pci_ids table at runtime -as shown below: - -echo "vendor device subvendor subdevice class class_mask driver_data" > \ -/sys/bus/pci/drivers/{driver}/new_id - -All fields are passed in as hexadecimal values (no leading 0x). -The vendor and device fields are mandatory, the others are optional. Users -need pass only as many optional fields as necessary: - o subvendor and subdevice fields default to PCI_ANY_ID (FFFFFFFF) - o class and classmask fields default to 0 - o driver_data defaults to 0UL. - -Note that driver_data must match the value used by any of the pci_device_id -entries defined in the driver. This makes the driver_data field mandatory -if all the pci_device_id entries have a non-zero driver_data value. - -Once added, the driver probe routine will be invoked for any unclaimed -PCI devices listed in its (newly updated) pci_ids list. - -When the driver exits, it just calls pci_unregister_driver() and the PCI layer -automatically calls the remove hook for all devices handled by the driver. - - -1.1 "Attributes" for driver functions/data - -Please mark the initialization and cleanup functions where appropriate -(the corresponding macros are defined in ): - - __init Initialization code. Thrown away after the driver - initializes. - __exit Exit code. Ignored for non-modular drivers. - -Tips on when/where to use the above attributes: - o The module_init()/module_exit() functions (and all - initialization functions called _only_ from these) - should be marked __init/__exit. - - o Do not mark the struct pci_driver. - - o Do NOT mark a function if you are not sure which mark to use. - Better to not mark the function than mark the function wrong. - - - -2. How to find PCI devices manually -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -PCI drivers should have a really good reason for not using the -pci_register_driver() interface to search for PCI devices. -The main reason PCI devices are controlled by multiple drivers -is because one PCI device implements several different HW services. -E.g. combined serial/parallel port/floppy controller. - -A manual search may be performed using the following constructs: - -Searching by vendor and device ID: - - struct pci_dev *dev = NULL; - while (dev = pci_get_device(VENDOR_ID, DEVICE_ID, dev)) - configure_device(dev); - -Searching by class ID (iterate in a similar way): - - pci_get_class(CLASS_ID, dev) - -Searching by both vendor/device and subsystem vendor/device ID: - - pci_get_subsys(VENDOR_ID,DEVICE_ID, SUBSYS_VENDOR_ID, SUBSYS_DEVICE_ID, dev). - -You can use the constant PCI_ANY_ID as a wildcard replacement for -VENDOR_ID or DEVICE_ID. This allows searching for any device from a -specific vendor, for example. - -These functions are hotplug-safe. They increment the reference count on -the pci_dev that they return. You must eventually (possibly at module unload) -decrement the reference count on these devices by calling pci_dev_put(). - - - -3. Device Initialization Steps -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -As noted in the introduction, most PCI drivers need the following steps -for device initialization: - - Enable the device - Request MMIO/IOP resources - Set the DMA mask size (for both coherent and streaming DMA) - Allocate and initialize shared control data (pci_allocate_coherent()) - Access device configuration space (if needed) - Register IRQ handler (request_irq()) - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip) - Enable DMA/processing engines. - -The driver can access PCI config space registers at any time. -(Well, almost. When running BIST, config space can go away...but -that will just result in a PCI Bus Master Abort and config reads -will return garbage). - - -3.1 Enable the PCI device -~~~~~~~~~~~~~~~~~~~~~~~~~ -Before touching any device registers, the driver needs to enable -the PCI device by calling pci_enable_device(). This will: - o wake up the device if it was in suspended state, - o allocate I/O and memory regions of the device (if BIOS did not), - o allocate an IRQ (if BIOS did not). - -NOTE: pci_enable_device() can fail! Check the return value. - -[ OS BUG: we don't check resource allocations before enabling those - resources. The sequence would make more sense if we called - pci_request_resources() before calling pci_enable_device(). - Currently, the device drivers can't detect the bug when when two - devices have been allocated the same range. This is not a common - problem and unlikely to get fixed soon. - - This has been discussed before but not changed as of 2.6.19: - http://lkml.org/lkml/2006/3/2/194 -] - -pci_set_master() will enable DMA by setting the bus master bit -in the PCI_COMMAND register. It also fixes the latency timer value if -it's set to something bogus by the BIOS. pci_clear_master() will -disable DMA by clearing the bus master bit. - -If the PCI device can use the PCI Memory-Write-Invalidate transaction, -call pci_set_mwi(). This enables the PCI_COMMAND bit for Mem-Wr-Inval -and also ensures that the cache line size register is set correctly. -Check the return value of pci_set_mwi() as not all architectures -or chip-sets may support Memory-Write-Invalidate. Alternatively, -if Mem-Wr-Inval would be nice to have but is not required, call -pci_try_set_mwi() to have the system do its best effort at enabling -Mem-Wr-Inval. - - -3.2 Request MMIO/IOP resources -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Memory (MMIO), and I/O port addresses should NOT be read directly -from the PCI device config space. Use the values in the pci_dev structure -as the PCI "bus address" might have been remapped to a "host physical" -address by the arch/chip-set specific kernel support. - -See Documentation/io-mapping.txt for how to access device registers -or device memory. - -The device driver needs to call pci_request_region() to verify -no other device is already using the same address resource. -Conversely, drivers should call pci_release_region() AFTER -calling pci_disable_device(). -The idea is to prevent two devices colliding on the same address range. - -[ See OS BUG comment above. Currently (2.6.19), The driver can only - determine MMIO and IO Port resource availability _after_ calling - pci_enable_device(). ] - -Generic flavors of pci_request_region() are request_mem_region() -(for MMIO ranges) and request_region() (for IO Port ranges). -Use these for address resources that are not described by "normal" PCI -BARs. - -Also see pci_request_selected_regions() below. - - -3.3 Set the DMA mask size -~~~~~~~~~~~~~~~~~~~~~~~~~ -[ If anything below doesn't make sense, please refer to - Documentation/DMA-API.txt. This section is just a reminder that - drivers need to indicate DMA capabilities of the device and is not - an authoritative source for DMA interfaces. ] - -While all drivers should explicitly indicate the DMA capability -(e.g. 32 or 64 bit) of the PCI bus master, devices with more than -32-bit bus master capability for streaming data need the driver -to "register" this capability by calling pci_set_dma_mask() with -appropriate parameters. In general this allows more efficient DMA -on systems where System RAM exists above 4G _physical_ address. - -Drivers for all PCI-X and PCIe compliant devices must call -pci_set_dma_mask() as they are 64-bit DMA devices. - -Similarly, drivers must also "register" this capability if the device -can directly address "consistent memory" in System RAM above 4G physical -address by calling pci_set_consistent_dma_mask(). -Again, this includes drivers for all PCI-X and PCIe compliant devices. -Many 64-bit "PCI" devices (before PCI-X) and some PCI-X devices are -64-bit DMA capable for payload ("streaming") data but not control -("consistent") data. - - -3.4 Setup shared control data -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Once the DMA masks are set, the driver can allocate "consistent" (a.k.a. shared) -memory. See Documentation/DMA-API.txt for a full description of -the DMA APIs. This section is just a reminder that it needs to be done -before enabling DMA on the device. - - -3.5 Initialize device registers -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Some drivers will need specific "capability" fields programmed -or other "vendor specific" register initialized or reset. -E.g. clearing pending interrupts. - - -3.6 Register IRQ handler -~~~~~~~~~~~~~~~~~~~~~~~~ -While calling request_irq() is the last step described here, -this is often just another intermediate step to initialize a device. -This step can often be deferred until the device is opened for use. - -All interrupt handlers for IRQ lines should be registered with IRQF_SHARED -and use the devid to map IRQs to devices (remember that all PCI IRQ lines -can be shared). - -request_irq() will associate an interrupt handler and device handle -with an interrupt number. Historically interrupt numbers represent -IRQ lines which run from the PCI device to the Interrupt controller. -With MSI and MSI-X (more below) the interrupt number is a CPU "vector". - -request_irq() also enables the interrupt. Make sure the device is -quiesced and does not have any interrupts pending before registering -the interrupt handler. - -MSI and MSI-X are PCI capabilities. Both are "Message Signaled Interrupts" -which deliver interrupts to the CPU via a DMA write to a Local APIC. -The fundamental difference between MSI and MSI-X is how multiple -"vectors" get allocated. MSI requires contiguous blocks of vectors -while MSI-X can allocate several individual ones. - -MSI capability can be enabled by calling pci_alloc_irq_vectors() with the -PCI_IRQ_MSI and/or PCI_IRQ_MSIX flags before calling request_irq(). This -causes the PCI support to program CPU vector data into the PCI device -capability registers. Many architectures, chip-sets, or BIOSes do NOT -support MSI or MSI-X and a call to pci_alloc_irq_vectors with just -the PCI_IRQ_MSI and PCI_IRQ_MSIX flags will fail, so try to always -specify PCI_IRQ_LEGACY as well. - -Drivers that have different interrupt handlers for MSI/MSI-X and -legacy INTx should chose the right one based on the msi_enabled -and msix_enabled flags in the pci_dev structure after calling -pci_alloc_irq_vectors. - -There are (at least) two really good reasons for using MSI: -1) MSI is an exclusive interrupt vector by definition. - This means the interrupt handler doesn't have to verify - its device caused the interrupt. - -2) MSI avoids DMA/IRQ race conditions. DMA to host memory is guaranteed - to be visible to the host CPU(s) when the MSI is delivered. This - is important for both data coherency and avoiding stale control data. - This guarantee allows the driver to omit MMIO reads to flush - the DMA stream. - -See drivers/infiniband/hw/mthca/ or drivers/net/tg3.c for examples -of MSI/MSI-X usage. - - - -4. PCI device shutdown -~~~~~~~~~~~~~~~~~~~~~~~ - -When a PCI device driver is being unloaded, most of the following -steps need to be performed: - - Disable the device from generating IRQs - Release the IRQ (free_irq()) - Stop all DMA activity - Release DMA buffers (both streaming and consistent) - Unregister from other subsystems (e.g. scsi or netdev) - Disable device from responding to MMIO/IO Port addresses - Release MMIO/IO Port resource(s) - - -4.1 Stop IRQs on the device -~~~~~~~~~~~~~~~~~~~~~~~~~~~ -How to do this is chip/device specific. If it's not done, it opens -the possibility of a "screaming interrupt" if (and only if) -the IRQ is shared with another device. - -When the shared IRQ handler is "unhooked", the remaining devices -using the same IRQ line will still need the IRQ enabled. Thus if the -"unhooked" device asserts IRQ line, the system will respond assuming -it was one of the remaining devices asserted the IRQ line. Since none -of the other devices will handle the IRQ, the system will "hang" until -it decides the IRQ isn't going to get handled and masks the IRQ (100,000 -iterations later). Once the shared IRQ is masked, the remaining devices -will stop functioning properly. Not a nice situation. - -This is another reason to use MSI or MSI-X if it's available. -MSI and MSI-X are defined to be exclusive interrupts and thus -are not susceptible to the "screaming interrupt" problem. - - -4.2 Release the IRQ -~~~~~~~~~~~~~~~~~~~ -Once the device is quiesced (no more IRQs), one can call free_irq(). -This function will return control once any pending IRQs are handled, -"unhook" the drivers IRQ handler from that IRQ, and finally release -the IRQ if no one else is using it. - - -4.3 Stop all DMA activity -~~~~~~~~~~~~~~~~~~~~~~~~~ -It's extremely important to stop all DMA operations BEFORE attempting -to deallocate DMA control data. Failure to do so can result in memory -corruption, hangs, and on some chip-sets a hard crash. - -Stopping DMA after stopping the IRQs can avoid races where the -IRQ handler might restart DMA engines. - -While this step sounds obvious and trivial, several "mature" drivers -didn't get this step right in the past. - - -4.4 Release DMA buffers -~~~~~~~~~~~~~~~~~~~~~~~ -Once DMA is stopped, clean up streaming DMA first. -I.e. unmap data buffers and return buffers to "upstream" -owners if there is one. - -Then clean up "consistent" buffers which contain the control data. - -See Documentation/DMA-API.txt for details on unmapping interfaces. - - -4.5 Unregister from other subsystems -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Most low level PCI device drivers support some other subsystem -like USB, ALSA, SCSI, NetDev, Infiniband, etc. Make sure your -driver isn't losing resources from that other subsystem. -If this happens, typically the symptom is an Oops (panic) when -the subsystem attempts to call into a driver that has been unloaded. - - -4.6 Disable Device from responding to MMIO/IO Port addresses -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -io_unmap() MMIO or IO Port resources and then call pci_disable_device(). -This is the symmetric opposite of pci_enable_device(). -Do not access device registers after calling pci_disable_device(). - - -4.7 Release MMIO/IO Port Resource(s) -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Call pci_release_region() to mark the MMIO or IO Port range as available. -Failure to do so usually results in the inability to reload the driver. - - - -5. How to access PCI config space -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -You can use pci_(read|write)_config_(byte|word|dword) to access the config -space of a device represented by struct pci_dev *. All these functions return 0 -when successful or an error code (PCIBIOS_...) which can be translated to a text -string by pcibios_strerror. Most drivers expect that accesses to valid PCI -devices don't fail. - -If you don't have a struct pci_dev available, you can call -pci_bus_(read|write)_config_(byte|word|dword) to access a given device -and function on that bus. - -If you access fields in the standard portion of the config header, please -use symbolic names of locations and bits declared in . - -If you need to access Extended PCI Capability registers, just call -pci_find_capability() for the particular capability and it will find the -corresponding register block for you. - - - -6. Other interesting functions -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -pci_get_domain_bus_and_slot() Find pci_dev corresponding to given domain, - bus and slot and number. If the device is - found, its reference count is increased. -pci_set_power_state() Set PCI Power Management state (0=D0 ... 3=D3) -pci_find_capability() Find specified capability in device's capability - list. -pci_resource_start() Returns bus start address for a given PCI region -pci_resource_end() Returns bus end address for a given PCI region -pci_resource_len() Returns the byte length of a PCI region -pci_set_drvdata() Set private driver data pointer for a pci_dev -pci_get_drvdata() Return private driver data pointer for a pci_dev -pci_set_mwi() Enable Memory-Write-Invalidate transactions. -pci_clear_mwi() Disable Memory-Write-Invalidate transactions. - - - -7. Miscellaneous hints -~~~~~~~~~~~~~~~~~~~~~~ - -When displaying PCI device names to the user (for example when a driver wants -to tell the user what card has it found), please use pci_name(pci_dev). - -Always refer to the PCI devices by a pointer to the pci_dev structure. -All PCI layer functions use this identification and it's the only -reasonable one. Don't use bus/slot/function numbers except for very -special purposes -- on systems with multiple primary buses their semantics -can be pretty complex. - -Don't try to turn on Fast Back to Back writes in your driver. All devices -on the bus need to be capable of doing it, so this is something which needs -to be handled by platform and generic code, not individual drivers. - - - -8. Vendor and device identifications -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Do not add new device or vendor IDs to include/linux/pci_ids.h unless they -are shared across multiple drivers. You can add private definitions in -your driver if they're helpful, or just use plain hex constants. - -The device IDs are arbitrary hex numbers (vendor controlled) and normally used -only in a single location, the pci_device_id table. - -Please DO submit new vendor/device IDs to http://pci-ids.ucw.cz/. -There are mirrors of the pci.ids file at http://pciids.sourceforge.net/ -and https://github.com/pciutils/pciids. - - - -9. Obsolete functions -~~~~~~~~~~~~~~~~~~~~~ - -There are several functions which you might come across when trying to -port an old driver to the new PCI interface. They are no longer present -in the kernel as they aren't compatible with hotplug or PCI domains or -having sane locking. - -pci_find_device() Superseded by pci_get_device() -pci_find_subsys() Superseded by pci_get_subsys() -pci_find_slot() Superseded by pci_get_domain_bus_and_slot() -pci_get_slot() Superseded by pci_get_domain_bus_and_slot() - - -The alternative is the traditional PCI device driver that walks PCI -device lists. This is still possible but discouraged. - - - -10. MMIO Space and "Write Posting" -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Converting a driver from using I/O Port space to using MMIO space -often requires some additional changes. Specifically, "write posting" -needs to be handled. Many drivers (e.g. tg3, acenic, sym53c8xx_2) -already do this. I/O Port space guarantees write transactions reach the PCI -device before the CPU can continue. Writes to MMIO space allow the CPU -to continue before the transaction reaches the PCI device. HW weenies -call this "Write Posting" because the write completion is "posted" to -the CPU before the transaction has reached its destination. - -Thus, timing sensitive code should add readl() where the CPU is -expected to wait before doing other work. The classic "bit banging" -sequence works fine for I/O Port space: - - for (i = 8; --i; val >>= 1) { - outb(val & 1, ioport_reg); /* write bit */ - udelay(10); - } - -The same sequence for MMIO space should be: - - for (i = 8; --i; val >>= 1) { - writeb(val & 1, mmio_reg); /* write bit */ - readb(safe_mmio_reg); /* flush posted write */ - udelay(10); - } - -It is important that "safe_mmio_reg" not have any side effects that -interferes with the correct operation of the device. - -Another case to watch out for is when resetting a PCI device. Use PCI -Configuration space reads to flush the writel(). This will gracefully -handle the PCI master abort on all platforms if the PCI device is -expected to not respond to a readl(). Most x86 platforms will allow -MMIO reads to master abort (a.k.a. "Soft Fail") and return garbage -(e.g. ~0). But many RISC platforms will crash (a.k.a."Hard Fail"). - -- cgit From 2e6422444894685a8a3135f7b982aa026dc0f74c Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Tue, 14 May 2019 22:47:25 +0800 Subject: Documentation: PCI: convert PCIEBUS-HOWTO.txt to reST Convert plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Signed-off-by: Bjorn Helgaas Reviewed-by: Mauro Carvalho Chehab --- Documentation/PCI/PCIEBUS-HOWTO.txt | 198 -------------------------------- Documentation/PCI/index.rst | 1 + Documentation/PCI/picebus-howto.rst | 220 ++++++++++++++++++++++++++++++++++++ 3 files changed, 221 insertions(+), 198 deletions(-) delete mode 100644 Documentation/PCI/PCIEBUS-HOWTO.txt create mode 100644 Documentation/PCI/picebus-howto.rst (limited to 'Documentation') diff --git a/Documentation/PCI/PCIEBUS-HOWTO.txt b/Documentation/PCI/PCIEBUS-HOWTO.txt deleted file mode 100644 index 15f0bb3b5045..000000000000 --- a/Documentation/PCI/PCIEBUS-HOWTO.txt +++ /dev/null @@ -1,198 +0,0 @@ - The PCI Express Port Bus Driver Guide HOWTO - Tom L Nguyen tom.l.nguyen@intel.com - 11/03/2004 - -1. About this guide - -This guide describes the basics of the PCI Express Port Bus driver -and provides information on how to enable the service drivers to -register/unregister with the PCI Express Port Bus Driver. - -2. Copyright 2004 Intel Corporation - -3. What is the PCI Express Port Bus Driver - -A PCI Express Port is a logical PCI-PCI Bridge structure. There -are two types of PCI Express Port: the Root Port and the Switch -Port. The Root Port originates a PCI Express link from a PCI Express -Root Complex and the Switch Port connects PCI Express links to -internal logical PCI buses. The Switch Port, which has its secondary -bus representing the switch's internal routing logic, is called the -switch's Upstream Port. The switch's Downstream Port is bridging from -switch's internal routing bus to a bus representing the downstream -PCI Express link from the PCI Express Switch. - -A PCI Express Port can provide up to four distinct functions, -referred to in this document as services, depending on its port type. -PCI Express Port's services include native hotplug support (HP), -power management event support (PME), advanced error reporting -support (AER), and virtual channel support (VC). These services may -be handled by a single complex driver or be individually distributed -and handled by corresponding service drivers. - -4. Why use the PCI Express Port Bus Driver? - -In existing Linux kernels, the Linux Device Driver Model allows a -physical device to be handled by only a single driver. The PCI -Express Port is a PCI-PCI Bridge device with multiple distinct -services. To maintain a clean and simple solution each service -may have its own software service driver. In this case several -service drivers will compete for a single PCI-PCI Bridge device. -For example, if the PCI Express Root Port native hotplug service -driver is loaded first, it claims a PCI-PCI Bridge Root Port. The -kernel therefore does not load other service drivers for that Root -Port. In other words, it is impossible to have multiple service -drivers load and run on a PCI-PCI Bridge device simultaneously -using the current driver model. - -To enable multiple service drivers running simultaneously requires -having a PCI Express Port Bus driver, which manages all populated -PCI Express Ports and distributes all provided service requests -to the corresponding service drivers as required. Some key -advantages of using the PCI Express Port Bus driver are listed below: - - - Allow multiple service drivers to run simultaneously on - a PCI-PCI Bridge Port device. - - - Allow service drivers implemented in an independent - staged approach. - - - Allow one service driver to run on multiple PCI-PCI Bridge - Port devices. - - - Manage and distribute resources of a PCI-PCI Bridge Port - device to requested service drivers. - -5. Configuring the PCI Express Port Bus Driver vs. Service Drivers - -5.1 Including the PCI Express Port Bus Driver Support into the Kernel - -Including the PCI Express Port Bus driver depends on whether the PCI -Express support is included in the kernel config. The kernel will -automatically include the PCI Express Port Bus driver as a kernel -driver when the PCI Express support is enabled in the kernel. - -5.2 Enabling Service Driver Support - -PCI device drivers are implemented based on Linux Device Driver Model. -All service drivers are PCI device drivers. As discussed above, it is -impossible to load any service driver once the kernel has loaded the -PCI Express Port Bus Driver. To meet the PCI Express Port Bus Driver -Model requires some minimal changes on existing service drivers that -imposes no impact on the functionality of existing service drivers. - -A service driver is required to use the two APIs shown below to -register its service with the PCI Express Port Bus driver (see -section 5.2.1 & 5.2.2). It is important that a service driver -initializes the pcie_port_service_driver data structure, included in -header file /include/linux/pcieport_if.h, before calling these APIs. -Failure to do so will result an identity mismatch, which prevents -the PCI Express Port Bus driver from loading a service driver. - -5.2.1 pcie_port_service_register - -int pcie_port_service_register(struct pcie_port_service_driver *new) - -This API replaces the Linux Driver Model's pci_register_driver API. A -service driver should always calls pcie_port_service_register at -module init. Note that after service driver being loaded, calls -such as pci_enable_device(dev) and pci_set_master(dev) are no longer -necessary since these calls are executed by the PCI Port Bus driver. - -5.2.2 pcie_port_service_unregister - -void pcie_port_service_unregister(struct pcie_port_service_driver *new) - -pcie_port_service_unregister replaces the Linux Driver Model's -pci_unregister_driver. It's always called by service driver when a -module exits. - -5.2.3 Sample Code - -Below is sample service driver code to initialize the port service -driver data structure. - -static struct pcie_port_service_id service_id[] = { { - .vendor = PCI_ANY_ID, - .device = PCI_ANY_ID, - .port_type = PCIE_RC_PORT, - .service_type = PCIE_PORT_SERVICE_AER, - }, { /* end: all zeroes */ } -}; - -static struct pcie_port_service_driver root_aerdrv = { - .name = (char *)device_name, - .id_table = &service_id[0], - - .probe = aerdrv_load, - .remove = aerdrv_unload, - - .suspend = aerdrv_suspend, - .resume = aerdrv_resume, -}; - -Below is a sample code for registering/unregistering a service -driver. - -static int __init aerdrv_service_init(void) -{ - int retval = 0; - - retval = pcie_port_service_register(&root_aerdrv); - if (!retval) { - /* - * FIX ME - */ - } - return retval; -} - -static void __exit aerdrv_service_exit(void) -{ - pcie_port_service_unregister(&root_aerdrv); -} - -module_init(aerdrv_service_init); -module_exit(aerdrv_service_exit); - -6. Possible Resource Conflicts - -Since all service drivers of a PCI-PCI Bridge Port device are -allowed to run simultaneously, below lists a few of possible resource -conflicts with proposed solutions. - -6.1 MSI and MSI-X Vector Resource - -Once MSI or MSI-X interrupts are enabled on a device, it stays in this -mode until they are disabled again. Since service drivers of the same -PCI-PCI Bridge port share the same physical device, if an individual -service driver enables or disables MSI/MSI-X mode it may result -unpredictable behavior. - -To avoid this situation all service drivers are not permitted to -switch interrupt mode on its device. The PCI Express Port Bus driver -is responsible for determining the interrupt mode and this should be -transparent to service drivers. Service drivers need to know only -the vector IRQ assigned to the field irq of struct pcie_device, which -is passed in when the PCI Express Port Bus driver probes each service -driver. Service drivers should use (struct pcie_device*)dev->irq to -call request_irq/free_irq. In addition, the interrupt mode is stored -in the field interrupt_mode of struct pcie_device. - -6.3 PCI Memory/IO Mapped Regions - -Service drivers for PCI Express Power Management (PME), Advanced -Error Reporting (AER), Hot-Plug (HP) and Virtual Channel (VC) access -PCI configuration space on the PCI Express port. In all cases the -registers accessed are independent of each other. This patch assumes -that all service drivers will be well behaved and not overwrite -other service driver's configuration settings. - -6.4 PCI Config Registers - -Each service driver runs its PCI config operations on its own -capability structure except the PCI Express capability structure, in -which Root Control register and Device Control register are shared -between PME and AER. This patch assumes that all service drivers -will be well behaved and not overwrite other service driver's -configuration settings. diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst index 7babf43709b0..79d6d75bbf28 100644 --- a/Documentation/PCI/index.rst +++ b/Documentation/PCI/index.rst @@ -9,3 +9,4 @@ Linux PCI Bus Subsystem :numbered: pci + picebus-howto diff --git a/Documentation/PCI/picebus-howto.rst b/Documentation/PCI/picebus-howto.rst new file mode 100644 index 000000000000..f882ff62c51f --- /dev/null +++ b/Documentation/PCI/picebus-howto.rst @@ -0,0 +1,220 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +=========================================== +The PCI Express Port Bus Driver Guide HOWTO +=========================================== + +:Author: Tom L Nguyen tom.l.nguyen@intel.com 11/03/2004 +:Copyright: |copy| 2004 Intel Corporation + +About this guide +================ + +This guide describes the basics of the PCI Express Port Bus driver +and provides information on how to enable the service drivers to +register/unregister with the PCI Express Port Bus Driver. + + +What is the PCI Express Port Bus Driver +======================================= + +A PCI Express Port is a logical PCI-PCI Bridge structure. There +are two types of PCI Express Port: the Root Port and the Switch +Port. The Root Port originates a PCI Express link from a PCI Express +Root Complex and the Switch Port connects PCI Express links to +internal logical PCI buses. The Switch Port, which has its secondary +bus representing the switch's internal routing logic, is called the +switch's Upstream Port. The switch's Downstream Port is bridging from +switch's internal routing bus to a bus representing the downstream +PCI Express link from the PCI Express Switch. + +A PCI Express Port can provide up to four distinct functions, +referred to in this document as services, depending on its port type. +PCI Express Port's services include native hotplug support (HP), +power management event support (PME), advanced error reporting +support (AER), and virtual channel support (VC). These services may +be handled by a single complex driver or be individually distributed +and handled by corresponding service drivers. + +Why use the PCI Express Port Bus Driver? +======================================== + +In existing Linux kernels, the Linux Device Driver Model allows a +physical device to be handled by only a single driver. The PCI +Express Port is a PCI-PCI Bridge device with multiple distinct +services. To maintain a clean and simple solution each service +may have its own software service driver. In this case several +service drivers will compete for a single PCI-PCI Bridge device. +For example, if the PCI Express Root Port native hotplug service +driver is loaded first, it claims a PCI-PCI Bridge Root Port. The +kernel therefore does not load other service drivers for that Root +Port. In other words, it is impossible to have multiple service +drivers load and run on a PCI-PCI Bridge device simultaneously +using the current driver model. + +To enable multiple service drivers running simultaneously requires +having a PCI Express Port Bus driver, which manages all populated +PCI Express Ports and distributes all provided service requests +to the corresponding service drivers as required. Some key +advantages of using the PCI Express Port Bus driver are listed below: + + - Allow multiple service drivers to run simultaneously on + a PCI-PCI Bridge Port device. + + - Allow service drivers implemented in an independent + staged approach. + + - Allow one service driver to run on multiple PCI-PCI Bridge + Port devices. + + - Manage and distribute resources of a PCI-PCI Bridge Port + device to requested service drivers. + +Configuring the PCI Express Port Bus Driver vs. Service Drivers +=============================================================== + +Including the PCI Express Port Bus Driver Support into the Kernel +----------------------------------------------------------------- + +Including the PCI Express Port Bus driver depends on whether the PCI +Express support is included in the kernel config. The kernel will +automatically include the PCI Express Port Bus driver as a kernel +driver when the PCI Express support is enabled in the kernel. + +Enabling Service Driver Support +------------------------------- + +PCI device drivers are implemented based on Linux Device Driver Model. +All service drivers are PCI device drivers. As discussed above, it is +impossible to load any service driver once the kernel has loaded the +PCI Express Port Bus Driver. To meet the PCI Express Port Bus Driver +Model requires some minimal changes on existing service drivers that +imposes no impact on the functionality of existing service drivers. + +A service driver is required to use the two APIs shown below to +register its service with the PCI Express Port Bus driver (see +section 5.2.1 & 5.2.2). It is important that a service driver +initializes the pcie_port_service_driver data structure, included in +header file /include/linux/pcieport_if.h, before calling these APIs. +Failure to do so will result an identity mismatch, which prevents +the PCI Express Port Bus driver from loading a service driver. + +pcie_port_service_register +~~~~~~~~~~~~~~~~~~~~~~~~~~ +:: + + int pcie_port_service_register(struct pcie_port_service_driver *new) + +This API replaces the Linux Driver Model's pci_register_driver API. A +service driver should always calls pcie_port_service_register at +module init. Note that after service driver being loaded, calls +such as pci_enable_device(dev) and pci_set_master(dev) are no longer +necessary since these calls are executed by the PCI Port Bus driver. + +pcie_port_service_unregister +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +:: + + void pcie_port_service_unregister(struct pcie_port_service_driver *new) + +pcie_port_service_unregister replaces the Linux Driver Model's +pci_unregister_driver. It's always called by service driver when a +module exits. + +Sample Code +~~~~~~~~~~~ + +Below is sample service driver code to initialize the port service +driver data structure. +:: + + static struct pcie_port_service_id service_id[] = { { + .vendor = PCI_ANY_ID, + .device = PCI_ANY_ID, + .port_type = PCIE_RC_PORT, + .service_type = PCIE_PORT_SERVICE_AER, + }, { /* end: all zeroes */ } + }; + + static struct pcie_port_service_driver root_aerdrv = { + .name = (char *)device_name, + .id_table = &service_id[0], + + .probe = aerdrv_load, + .remove = aerdrv_unload, + + .suspend = aerdrv_suspend, + .resume = aerdrv_resume, + }; + +Below is a sample code for registering/unregistering a service +driver. +:: + + static int __init aerdrv_service_init(void) + { + int retval = 0; + + retval = pcie_port_service_register(&root_aerdrv); + if (!retval) { + /* + * FIX ME + */ + } + return retval; + } + + static void __exit aerdrv_service_exit(void) + { + pcie_port_service_unregister(&root_aerdrv); + } + + module_init(aerdrv_service_init); + module_exit(aerdrv_service_exit); + +Possible Resource Conflicts +=========================== + +Since all service drivers of a PCI-PCI Bridge Port device are +allowed to run simultaneously, below lists a few of possible resource +conflicts with proposed solutions. + +MSI and MSI-X Vector Resource +----------------------------- + +Once MSI or MSI-X interrupts are enabled on a device, it stays in this +mode until they are disabled again. Since service drivers of the same +PCI-PCI Bridge port share the same physical device, if an individual +service driver enables or disables MSI/MSI-X mode it may result +unpredictable behavior. + +To avoid this situation all service drivers are not permitted to +switch interrupt mode on its device. The PCI Express Port Bus driver +is responsible for determining the interrupt mode and this should be +transparent to service drivers. Service drivers need to know only +the vector IRQ assigned to the field irq of struct pcie_device, which +is passed in when the PCI Express Port Bus driver probes each service +driver. Service drivers should use (struct pcie_device*)dev->irq to +call request_irq/free_irq. In addition, the interrupt mode is stored +in the field interrupt_mode of struct pcie_device. + +PCI Memory/IO Mapped Regions +---------------------------- + +Service drivers for PCI Express Power Management (PME), Advanced +Error Reporting (AER), Hot-Plug (HP) and Virtual Channel (VC) access +PCI configuration space on the PCI Express port. In all cases the +registers accessed are independent of each other. This patch assumes +that all service drivers will be well behaved and not overwrite +other service driver's configuration settings. + +PCI Config Registers +-------------------- + +Each service driver runs its PCI config operations on its own +capability structure except the PCI Express capability structure, in +which Root Control register and Device Control register are shared +between PME and AER. This patch assumes that all service drivers +will be well behaved and not overwrite other service driver's +configuration settings. -- cgit From 4d2c729c62328d6841111d98396374476367ae83 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Tue, 14 May 2019 22:47:26 +0800 Subject: Documentation: PCI: convert pci-iov-howto.txt to reST Convert plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Signed-off-by: Bjorn Helgaas Reviewed-by: Mauro Carvalho Chehab --- Documentation/PCI/index.rst | 1 + Documentation/PCI/pci-iov-howto.rst | 172 ++++++++++++++++++++++++++++++++++++ Documentation/PCI/pci-iov-howto.txt | 147 ------------------------------ 3 files changed, 173 insertions(+), 147 deletions(-) create mode 100644 Documentation/PCI/pci-iov-howto.rst delete mode 100644 Documentation/PCI/pci-iov-howto.txt (limited to 'Documentation') diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst index 79d6d75bbf28..0d9390298c4a 100644 --- a/Documentation/PCI/index.rst +++ b/Documentation/PCI/index.rst @@ -10,3 +10,4 @@ Linux PCI Bus Subsystem pci picebus-howto + pci-iov-howto diff --git a/Documentation/PCI/pci-iov-howto.rst b/Documentation/PCI/pci-iov-howto.rst new file mode 100644 index 000000000000..b9fd003206f1 --- /dev/null +++ b/Documentation/PCI/pci-iov-howto.rst @@ -0,0 +1,172 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +==================================== +PCI Express I/O Virtualization Howto +==================================== + +:Copyright: |copy| 2009 Intel Corporation +:Authors: - Yu Zhao + - Donald Dutile + +Overview +======== + +What is SR-IOV +-------------- + +Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended +capability which makes one physical device appear as multiple virtual +devices. The physical device is referred to as Physical Function (PF) +while the virtual devices are referred to as Virtual Functions (VF). +Allocation of the VF can be dynamically controlled by the PF via +registers encapsulated in the capability. By default, this feature is +not enabled and the PF behaves as traditional PCIe device. Once it's +turned on, each VF's PCI configuration space can be accessed by its own +Bus, Device and Function Number (Routing ID). And each VF also has PCI +Memory Space, which is used to map its register set. VF device driver +operates on the register set so it can be functional and appear as a +real existing PCI device. + +User Guide +========== + +How can I enable SR-IOV capability +---------------------------------- + +Multiple methods are available for SR-IOV enablement. +In the first method, the device driver (PF driver) will control the +enabling and disabling of the capability via API provided by SR-IOV core. +If the hardware has SR-IOV capability, loading its PF driver would +enable it and all VFs associated with the PF. Some PF drivers require +a module parameter to be set to determine the number of VFs to enable. +In the second method, a write to the sysfs file sriov_numvfs will +enable and disable the VFs associated with a PCIe PF. This method +enables per-PF, VF enable/disable values versus the first method, +which applies to all PFs of the same device. Additionally, the +PCI SRIOV core support ensures that enable/disable operations are +valid to reduce duplication in multiple drivers for the same +checks, e.g., check numvfs == 0 if enabling VFs, ensure +numvfs <= totalvfs. +The second method is the recommended method for new/future VF devices. + +How can I use the Virtual Functions +----------------------------------- + +The VF is treated as hot-plugged PCI devices in the kernel, so they +should be able to work in the same way as real PCI devices. The VF +requires device driver that is same as a normal PCI device's. + +Developer Guide +=============== + +SR-IOV API +---------- + +To enable SR-IOV capability: + +(a) For the first method, in the driver:: + + int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn); + +'nr_virtfn' is number of VFs to be enabled. + +(b) For the second method, from sysfs:: + + echo 'nr_virtfn' > \ + /sys/bus/pci/devices//sriov_numvfs + +To disable SR-IOV capability: + +(a) For the first method, in the driver:: + + void pci_disable_sriov(struct pci_dev *dev); + +(b) For the second method, from sysfs:: + + echo 0 > \ + /sys/bus/pci/devices//sriov_numvfs + +To enable auto probing VFs by a compatible driver on the host, run +command below before enabling SR-IOV capabilities. This is the +default behavior. +:: + + echo 1 > \ + /sys/bus/pci/devices//sriov_drivers_autoprobe + +To disable auto probing VFs by a compatible driver on the host, run +command below before enabling SR-IOV capabilities. Updating this +entry will not affect VFs which are already probed. +:: + + echo 0 > \ + /sys/bus/pci/devices//sriov_drivers_autoprobe + +Usage example +------------- + +Following piece of code illustrates the usage of the SR-IOV API. +:: + + static int dev_probe(struct pci_dev *dev, const struct pci_device_id *id) + { + pci_enable_sriov(dev, NR_VIRTFN); + + ... + + return 0; + } + + static void dev_remove(struct pci_dev *dev) + { + pci_disable_sriov(dev); + + ... + } + + static int dev_suspend(struct pci_dev *dev, pm_message_t state) + { + ... + + return 0; + } + + static int dev_resume(struct pci_dev *dev) + { + ... + + return 0; + } + + static void dev_shutdown(struct pci_dev *dev) + { + ... + } + + static int dev_sriov_configure(struct pci_dev *dev, int numvfs) + { + if (numvfs > 0) { + ... + pci_enable_sriov(dev, numvfs); + ... + return numvfs; + } + if (numvfs == 0) { + .... + pci_disable_sriov(dev); + ... + return 0; + } + } + + static struct pci_driver dev_driver = { + .name = "SR-IOV Physical Function driver", + .id_table = dev_id_table, + .probe = dev_probe, + .remove = dev_remove, + .suspend = dev_suspend, + .resume = dev_resume, + .shutdown = dev_shutdown, + .sriov_configure = dev_sriov_configure, + }; diff --git a/Documentation/PCI/pci-iov-howto.txt b/Documentation/PCI/pci-iov-howto.txt deleted file mode 100644 index d2a84151e99c..000000000000 --- a/Documentation/PCI/pci-iov-howto.txt +++ /dev/null @@ -1,147 +0,0 @@ - PCI Express I/O Virtualization Howto - Copyright (C) 2009 Intel Corporation - Yu Zhao - - Update: November 2012 - -- sysfs-based SRIOV enable-/disable-ment - Donald Dutile - -1. Overview - -1.1 What is SR-IOV - -Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended -capability which makes one physical device appear as multiple virtual -devices. The physical device is referred to as Physical Function (PF) -while the virtual devices are referred to as Virtual Functions (VF). -Allocation of the VF can be dynamically controlled by the PF via -registers encapsulated in the capability. By default, this feature is -not enabled and the PF behaves as traditional PCIe device. Once it's -turned on, each VF's PCI configuration space can be accessed by its own -Bus, Device and Function Number (Routing ID). And each VF also has PCI -Memory Space, which is used to map its register set. VF device driver -operates on the register set so it can be functional and appear as a -real existing PCI device. - -2. User Guide - -2.1 How can I enable SR-IOV capability - -Multiple methods are available for SR-IOV enablement. -In the first method, the device driver (PF driver) will control the -enabling and disabling of the capability via API provided by SR-IOV core. -If the hardware has SR-IOV capability, loading its PF driver would -enable it and all VFs associated with the PF. Some PF drivers require -a module parameter to be set to determine the number of VFs to enable. -In the second method, a write to the sysfs file sriov_numvfs will -enable and disable the VFs associated with a PCIe PF. This method -enables per-PF, VF enable/disable values versus the first method, -which applies to all PFs of the same device. Additionally, the -PCI SRIOV core support ensures that enable/disable operations are -valid to reduce duplication in multiple drivers for the same -checks, e.g., check numvfs == 0 if enabling VFs, ensure -numvfs <= totalvfs. -The second method is the recommended method for new/future VF devices. - -2.2 How can I use the Virtual Functions - -The VF is treated as hot-plugged PCI devices in the kernel, so they -should be able to work in the same way as real PCI devices. The VF -requires device driver that is same as a normal PCI device's. - -3. Developer Guide - -3.1 SR-IOV API - -To enable SR-IOV capability: -(a) For the first method, in the driver: - int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn); - 'nr_virtfn' is number of VFs to be enabled. -(b) For the second method, from sysfs: - echo 'nr_virtfn' > \ - /sys/bus/pci/devices//sriov_numvfs - -To disable SR-IOV capability: -(a) For the first method, in the driver: - void pci_disable_sriov(struct pci_dev *dev); -(b) For the second method, from sysfs: - echo 0 > \ - /sys/bus/pci/devices//sriov_numvfs - -To enable auto probing VFs by a compatible driver on the host, run -command below before enabling SR-IOV capabilities. This is the -default behavior. - echo 1 > \ - /sys/bus/pci/devices//sriov_drivers_autoprobe - -To disable auto probing VFs by a compatible driver on the host, run -command below before enabling SR-IOV capabilities. Updating this -entry will not affect VFs which are already probed. - echo 0 > \ - /sys/bus/pci/devices//sriov_drivers_autoprobe - -3.2 Usage example - -Following piece of code illustrates the usage of the SR-IOV API. - -static int dev_probe(struct pci_dev *dev, const struct pci_device_id *id) -{ - pci_enable_sriov(dev, NR_VIRTFN); - - ... - - return 0; -} - -static void dev_remove(struct pci_dev *dev) -{ - pci_disable_sriov(dev); - - ... -} - -static int dev_suspend(struct pci_dev *dev, pm_message_t state) -{ - ... - - return 0; -} - -static int dev_resume(struct pci_dev *dev) -{ - ... - - return 0; -} - -static void dev_shutdown(struct pci_dev *dev) -{ - ... -} - -static int dev_sriov_configure(struct pci_dev *dev, int numvfs) -{ - if (numvfs > 0) { - ... - pci_enable_sriov(dev, numvfs); - ... - return numvfs; - } - if (numvfs == 0) { - .... - pci_disable_sriov(dev); - ... - return 0; - } -} - -static struct pci_driver dev_driver = { - .name = "SR-IOV Physical Function driver", - .id_table = dev_id_table, - .probe = dev_probe, - .remove = dev_remove, - .suspend = dev_suspend, - .resume = dev_resume, - .shutdown = dev_shutdown, - .sriov_configure = dev_sriov_configure, -}; -- cgit From 3b9bae029b60ee0fa6d6205e0debfad4482434a7 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Tue, 14 May 2019 22:47:27 +0800 Subject: Documentation: PCI: convert MSI-HOWTO.txt to reST Convert plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Signed-off-by: Bjorn Helgaas Reviewed-by: Mauro Carvalho Chehab --- Documentation/PCI/MSI-HOWTO.txt | 270 ------------------------------------- Documentation/PCI/index.rst | 1 + Documentation/PCI/msi-howto.rst | 287 ++++++++++++++++++++++++++++++++++++++++ 3 files changed, 288 insertions(+), 270 deletions(-) delete mode 100644 Documentation/PCI/MSI-HOWTO.txt create mode 100644 Documentation/PCI/msi-howto.rst (limited to 'Documentation') diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt deleted file mode 100644 index 618e13d5e276..000000000000 --- a/Documentation/PCI/MSI-HOWTO.txt +++ /dev/null @@ -1,270 +0,0 @@ - The MSI Driver Guide HOWTO - Tom L Nguyen tom.l.nguyen@intel.com - 10/03/2003 - Revised Feb 12, 2004 by Martine Silbermann - email: Martine.Silbermann@hp.com - Revised Jun 25, 2004 by Tom L Nguyen - Revised Jul 9, 2008 by Matthew Wilcox - Copyright 2003, 2008 Intel Corporation - -1. About this guide - -This guide describes the basics of Message Signaled Interrupts (MSIs), -the advantages of using MSI over traditional interrupt mechanisms, how -to change your driver to use MSI or MSI-X and some basic diagnostics to -try if a device doesn't support MSIs. - - -2. What are MSIs? - -A Message Signaled Interrupt is a write from the device to a special -address which causes an interrupt to be received by the CPU. - -The MSI capability was first specified in PCI 2.2 and was later enhanced -in PCI 3.0 to allow each interrupt to be masked individually. The MSI-X -capability was also introduced with PCI 3.0. It supports more interrupts -per device than MSI and allows interrupts to be independently configured. - -Devices may support both MSI and MSI-X, but only one can be enabled at -a time. - - -3. Why use MSIs? - -There are three reasons why using MSIs can give an advantage over -traditional pin-based interrupts. - -Pin-based PCI interrupts are often shared amongst several devices. -To support this, the kernel must call each interrupt handler associated -with an interrupt, which leads to reduced performance for the system as -a whole. MSIs are never shared, so this problem cannot arise. - -When a device writes data to memory, then raises a pin-based interrupt, -it is possible that the interrupt may arrive before all the data has -arrived in memory (this becomes more likely with devices behind PCI-PCI -bridges). In order to ensure that all the data has arrived in memory, -the interrupt handler must read a register on the device which raised -the interrupt. PCI transaction ordering rules require that all the data -arrive in memory before the value may be returned from the register. -Using MSIs avoids this problem as the interrupt-generating write cannot -pass the data writes, so by the time the interrupt is raised, the driver -knows that all the data has arrived in memory. - -PCI devices can only support a single pin-based interrupt per function. -Often drivers have to query the device to find out what event has -occurred, slowing down interrupt handling for the common case. With -MSIs, a device can support more interrupts, allowing each interrupt -to be specialised to a different purpose. One possible design gives -infrequent conditions (such as errors) their own interrupt which allows -the driver to handle the normal interrupt handling path more efficiently. -Other possible designs include giving one interrupt to each packet queue -in a network card or each port in a storage controller. - - -4. How to use MSIs - -PCI devices are initialised to use pin-based interrupts. The device -driver has to set up the device to use MSI or MSI-X. Not all machines -support MSIs correctly, and for those machines, the APIs described below -will simply fail and the device will continue to use pin-based interrupts. - -4.1 Include kernel support for MSIs - -To support MSI or MSI-X, the kernel must be built with the CONFIG_PCI_MSI -option enabled. This option is only available on some architectures, -and it may depend on some other options also being set. For example, -on x86, you must also enable X86_UP_APIC or SMP in order to see the -CONFIG_PCI_MSI option. - -4.2 Using MSI - -Most of the hard work is done for the driver in the PCI layer. The driver -simply has to request that the PCI layer set up the MSI capability for this -device. - -To automatically use MSI or MSI-X interrupt vectors, use the following -function: - - int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs, - unsigned int max_vecs, unsigned int flags); - -which allocates up to max_vecs interrupt vectors for a PCI device. It -returns the number of vectors allocated or a negative error. If the device -has a requirements for a minimum number of vectors the driver can pass a -min_vecs argument set to this limit, and the PCI core will return -ENOSPC -if it can't meet the minimum number of vectors. - -The flags argument is used to specify which type of interrupt can be used -by the device and the driver (PCI_IRQ_LEGACY, PCI_IRQ_MSI, PCI_IRQ_MSIX). -A convenient short-hand (PCI_IRQ_ALL_TYPES) is also available to ask for -any possible kind of interrupt. If the PCI_IRQ_AFFINITY flag is set, -pci_alloc_irq_vectors() will spread the interrupts around the available CPUs. - -To get the Linux IRQ numbers passed to request_irq() and free_irq() and the -vectors, use the following function: - - int pci_irq_vector(struct pci_dev *dev, unsigned int nr); - -Any allocated resources should be freed before removing the device using -the following function: - - void pci_free_irq_vectors(struct pci_dev *dev); - -If a device supports both MSI-X and MSI capabilities, this API will use the -MSI-X facilities in preference to the MSI facilities. MSI-X supports any -number of interrupts between 1 and 2048. In contrast, MSI is restricted to -a maximum of 32 interrupts (and must be a power of two). In addition, the -MSI interrupt vectors must be allocated consecutively, so the system might -not be able to allocate as many vectors for MSI as it could for MSI-X. On -some platforms, MSI interrupts must all be targeted at the same set of CPUs -whereas MSI-X interrupts can all be targeted at different CPUs. - -If a device supports neither MSI-X or MSI it will fall back to a single -legacy IRQ vector. - -The typical usage of MSI or MSI-X interrupts is to allocate as many vectors -as possible, likely up to the limit supported by the device. If nvec is -larger than the number supported by the device it will automatically be -capped to the supported limit, so there is no need to query the number of -vectors supported beforehand: - - nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_ALL_TYPES) - if (nvec < 0) - goto out_err; - -If a driver is unable or unwilling to deal with a variable number of MSI -interrupts it can request a particular number of interrupts by passing that -number to pci_alloc_irq_vectors() function as both 'min_vecs' and -'max_vecs' parameters: - - ret = pci_alloc_irq_vectors(pdev, nvec, nvec, PCI_IRQ_ALL_TYPES); - if (ret < 0) - goto out_err; - -The most notorious example of the request type described above is enabling -the single MSI mode for a device. It could be done by passing two 1s as -'min_vecs' and 'max_vecs': - - ret = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_ALL_TYPES); - if (ret < 0) - goto out_err; - -Some devices might not support using legacy line interrupts, in which case -the driver can specify that only MSI or MSI-X is acceptable: - - nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_MSI | PCI_IRQ_MSIX); - if (nvec < 0) - goto out_err; - -4.3 Legacy APIs - -The following old APIs to enable and disable MSI or MSI-X interrupts should -not be used in new code: - - pci_enable_msi() /* deprecated */ - pci_disable_msi() /* deprecated */ - pci_enable_msix_range() /* deprecated */ - pci_enable_msix_exact() /* deprecated */ - pci_disable_msix() /* deprecated */ - -Additionally there are APIs to provide the number of supported MSI or MSI-X -vectors: pci_msi_vec_count() and pci_msix_vec_count(). In general these -should be avoided in favor of letting pci_alloc_irq_vectors() cap the -number of vectors. If you have a legitimate special use case for the count -of vectors we might have to revisit that decision and add a -pci_nr_irq_vectors() helper that handles MSI and MSI-X transparently. - -4.4 Considerations when using MSIs - -4.4.1 Spinlocks - -Most device drivers have a per-device spinlock which is taken in the -interrupt handler. With pin-based interrupts or a single MSI, it is not -necessary to disable interrupts (Linux guarantees the same interrupt will -not be re-entered). If a device uses multiple interrupts, the driver -must disable interrupts while the lock is held. If the device sends -a different interrupt, the driver will deadlock trying to recursively -acquire the spinlock. Such deadlocks can be avoided by using -spin_lock_irqsave() or spin_lock_irq() which disable local interrupts -and acquire the lock (see Documentation/kernel-hacking/locking.rst). - -4.5 How to tell whether MSI/MSI-X is enabled on a device - -Using 'lspci -v' (as root) may show some devices with "MSI", "Message -Signalled Interrupts" or "MSI-X" capabilities. Each of these capabilities -has an 'Enable' flag which is followed with either "+" (enabled) -or "-" (disabled). - - -5. MSI quirks - -Several PCI chipsets or devices are known not to support MSIs. -The PCI stack provides three ways to disable MSIs: - -1. globally -2. on all devices behind a specific bridge -3. on a single device - -5.1. Disabling MSIs globally - -Some host chipsets simply don't support MSIs properly. If we're -lucky, the manufacturer knows this and has indicated it in the ACPI -FADT table. In this case, Linux automatically disables MSIs. -Some boards don't include this information in the table and so we have -to detect them ourselves. The complete list of these is found near the -quirk_disable_all_msi() function in drivers/pci/quirks.c. - -If you have a board which has problems with MSIs, you can pass pci=nomsi -on the kernel command line to disable MSIs on all devices. It would be -in your best interests to report the problem to linux-pci@vger.kernel.org -including a full 'lspci -v' so we can add the quirks to the kernel. - -5.2. Disabling MSIs below a bridge - -Some PCI bridges are not able to route MSIs between busses properly. -In this case, MSIs must be disabled on all devices behind the bridge. - -Some bridges allow you to enable MSIs by changing some bits in their -PCI configuration space (especially the Hypertransport chipsets such -as the nVidia nForce and Serverworks HT2000). As with host chipsets, -Linux mostly knows about them and automatically enables MSIs if it can. -If you have a bridge unknown to Linux, you can enable -MSIs in configuration space using whatever method you know works, then -enable MSIs on that bridge by doing: - - echo 1 > /sys/bus/pci/devices/$bridge/msi_bus - -where $bridge is the PCI address of the bridge you've enabled (eg -0000:00:0e.0). - -To disable MSIs, echo 0 instead of 1. Changing this value should be -done with caution as it could break interrupt handling for all devices -below this bridge. - -Again, please notify linux-pci@vger.kernel.org of any bridges that need -special handling. - -5.3. Disabling MSIs on a single device - -Some devices are known to have faulty MSI implementations. Usually this -is handled in the individual device driver, but occasionally it's necessary -to handle this with a quirk. Some drivers have an option to disable use -of MSI. While this is a convenient workaround for the driver author, -it is not good practice, and should not be emulated. - -5.4. Finding why MSIs are disabled on a device - -From the above three sections, you can see that there are many reasons -why MSIs may not be enabled for a given device. Your first step should -be to examine your dmesg carefully to determine whether MSIs are enabled -for your machine. You should also check your .config to be sure you -have enabled CONFIG_PCI_MSI. - -Then, 'lspci -t' gives the list of bridges above a device. Reading -/sys/bus/pci/devices/*/msi_bus will tell you whether MSIs are enabled (1) -or disabled (0). If 0 is found in any of the msi_bus files belonging -to bridges between the PCI root and the device, MSIs are disabled. - -It is also worth checking the device driver to see whether it supports MSIs. -For example, it may contain calls to pci_irq_alloc_vectors() with the -PCI_IRQ_MSI or PCI_IRQ_MSIX flags. diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst index 0d9390298c4a..458354daac47 100644 --- a/Documentation/PCI/index.rst +++ b/Documentation/PCI/index.rst @@ -11,3 +11,4 @@ Linux PCI Bus Subsystem pci picebus-howto pci-iov-howto + msi-howto diff --git a/Documentation/PCI/msi-howto.rst b/Documentation/PCI/msi-howto.rst new file mode 100644 index 000000000000..994cbb660ade --- /dev/null +++ b/Documentation/PCI/msi-howto.rst @@ -0,0 +1,287 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +========================== +The MSI Driver Guide HOWTO +========================== + +:Authors: Tom L Nguyen; Martine Silbermann; Matthew Wilcox + +:Copyright: 2003, 2008 Intel Corporation + +About this guide +================ + +This guide describes the basics of Message Signaled Interrupts (MSIs), +the advantages of using MSI over traditional interrupt mechanisms, how +to change your driver to use MSI or MSI-X and some basic diagnostics to +try if a device doesn't support MSIs. + + +What are MSIs? +============== + +A Message Signaled Interrupt is a write from the device to a special +address which causes an interrupt to be received by the CPU. + +The MSI capability was first specified in PCI 2.2 and was later enhanced +in PCI 3.0 to allow each interrupt to be masked individually. The MSI-X +capability was also introduced with PCI 3.0. It supports more interrupts +per device than MSI and allows interrupts to be independently configured. + +Devices may support both MSI and MSI-X, but only one can be enabled at +a time. + + +Why use MSIs? +============= + +There are three reasons why using MSIs can give an advantage over +traditional pin-based interrupts. + +Pin-based PCI interrupts are often shared amongst several devices. +To support this, the kernel must call each interrupt handler associated +with an interrupt, which leads to reduced performance for the system as +a whole. MSIs are never shared, so this problem cannot arise. + +When a device writes data to memory, then raises a pin-based interrupt, +it is possible that the interrupt may arrive before all the data has +arrived in memory (this becomes more likely with devices behind PCI-PCI +bridges). In order to ensure that all the data has arrived in memory, +the interrupt handler must read a register on the device which raised +the interrupt. PCI transaction ordering rules require that all the data +arrive in memory before the value may be returned from the register. +Using MSIs avoids this problem as the interrupt-generating write cannot +pass the data writes, so by the time the interrupt is raised, the driver +knows that all the data has arrived in memory. + +PCI devices can only support a single pin-based interrupt per function. +Often drivers have to query the device to find out what event has +occurred, slowing down interrupt handling for the common case. With +MSIs, a device can support more interrupts, allowing each interrupt +to be specialised to a different purpose. One possible design gives +infrequent conditions (such as errors) their own interrupt which allows +the driver to handle the normal interrupt handling path more efficiently. +Other possible designs include giving one interrupt to each packet queue +in a network card or each port in a storage controller. + + +How to use MSIs +=============== + +PCI devices are initialised to use pin-based interrupts. The device +driver has to set up the device to use MSI or MSI-X. Not all machines +support MSIs correctly, and for those machines, the APIs described below +will simply fail and the device will continue to use pin-based interrupts. + +Include kernel support for MSIs +------------------------------- + +To support MSI or MSI-X, the kernel must be built with the CONFIG_PCI_MSI +option enabled. This option is only available on some architectures, +and it may depend on some other options also being set. For example, +on x86, you must also enable X86_UP_APIC or SMP in order to see the +CONFIG_PCI_MSI option. + +Using MSI +--------- + +Most of the hard work is done for the driver in the PCI layer. The driver +simply has to request that the PCI layer set up the MSI capability for this +device. + +To automatically use MSI or MSI-X interrupt vectors, use the following +function:: + + int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs, + unsigned int max_vecs, unsigned int flags); + +which allocates up to max_vecs interrupt vectors for a PCI device. It +returns the number of vectors allocated or a negative error. If the device +has a requirements for a minimum number of vectors the driver can pass a +min_vecs argument set to this limit, and the PCI core will return -ENOSPC +if it can't meet the minimum number of vectors. + +The flags argument is used to specify which type of interrupt can be used +by the device and the driver (PCI_IRQ_LEGACY, PCI_IRQ_MSI, PCI_IRQ_MSIX). +A convenient short-hand (PCI_IRQ_ALL_TYPES) is also available to ask for +any possible kind of interrupt. If the PCI_IRQ_AFFINITY flag is set, +pci_alloc_irq_vectors() will spread the interrupts around the available CPUs. + +To get the Linux IRQ numbers passed to request_irq() and free_irq() and the +vectors, use the following function:: + + int pci_irq_vector(struct pci_dev *dev, unsigned int nr); + +Any allocated resources should be freed before removing the device using +the following function:: + + void pci_free_irq_vectors(struct pci_dev *dev); + +If a device supports both MSI-X and MSI capabilities, this API will use the +MSI-X facilities in preference to the MSI facilities. MSI-X supports any +number of interrupts between 1 and 2048. In contrast, MSI is restricted to +a maximum of 32 interrupts (and must be a power of two). In addition, the +MSI interrupt vectors must be allocated consecutively, so the system might +not be able to allocate as many vectors for MSI as it could for MSI-X. On +some platforms, MSI interrupts must all be targeted at the same set of CPUs +whereas MSI-X interrupts can all be targeted at different CPUs. + +If a device supports neither MSI-X or MSI it will fall back to a single +legacy IRQ vector. + +The typical usage of MSI or MSI-X interrupts is to allocate as many vectors +as possible, likely up to the limit supported by the device. If nvec is +larger than the number supported by the device it will automatically be +capped to the supported limit, so there is no need to query the number of +vectors supported beforehand:: + + nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_ALL_TYPES) + if (nvec < 0) + goto out_err; + +If a driver is unable or unwilling to deal with a variable number of MSI +interrupts it can request a particular number of interrupts by passing that +number to pci_alloc_irq_vectors() function as both 'min_vecs' and +'max_vecs' parameters:: + + ret = pci_alloc_irq_vectors(pdev, nvec, nvec, PCI_IRQ_ALL_TYPES); + if (ret < 0) + goto out_err; + +The most notorious example of the request type described above is enabling +the single MSI mode for a device. It could be done by passing two 1s as +'min_vecs' and 'max_vecs':: + + ret = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_ALL_TYPES); + if (ret < 0) + goto out_err; + +Some devices might not support using legacy line interrupts, in which case +the driver can specify that only MSI or MSI-X is acceptable:: + + nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_MSI | PCI_IRQ_MSIX); + if (nvec < 0) + goto out_err; + +Legacy APIs +----------- + +The following old APIs to enable and disable MSI or MSI-X interrupts should +not be used in new code:: + + pci_enable_msi() /* deprecated */ + pci_disable_msi() /* deprecated */ + pci_enable_msix_range() /* deprecated */ + pci_enable_msix_exact() /* deprecated */ + pci_disable_msix() /* deprecated */ + +Additionally there are APIs to provide the number of supported MSI or MSI-X +vectors: pci_msi_vec_count() and pci_msix_vec_count(). In general these +should be avoided in favor of letting pci_alloc_irq_vectors() cap the +number of vectors. If you have a legitimate special use case for the count +of vectors we might have to revisit that decision and add a +pci_nr_irq_vectors() helper that handles MSI and MSI-X transparently. + +Considerations when using MSIs +------------------------------ + +Spinlocks +~~~~~~~~~ + +Most device drivers have a per-device spinlock which is taken in the +interrupt handler. With pin-based interrupts or a single MSI, it is not +necessary to disable interrupts (Linux guarantees the same interrupt will +not be re-entered). If a device uses multiple interrupts, the driver +must disable interrupts while the lock is held. If the device sends +a different interrupt, the driver will deadlock trying to recursively +acquire the spinlock. Such deadlocks can be avoided by using +spin_lock_irqsave() or spin_lock_irq() which disable local interrupts +and acquire the lock (see Documentation/kernel-hacking/locking.rst). + +How to tell whether MSI/MSI-X is enabled on a device +---------------------------------------------------- + +Using 'lspci -v' (as root) may show some devices with "MSI", "Message +Signalled Interrupts" or "MSI-X" capabilities. Each of these capabilities +has an 'Enable' flag which is followed with either "+" (enabled) +or "-" (disabled). + + +MSI quirks +========== + +Several PCI chipsets or devices are known not to support MSIs. +The PCI stack provides three ways to disable MSIs: + +1. globally +2. on all devices behind a specific bridge +3. on a single device + +Disabling MSIs globally +----------------------- + +Some host chipsets simply don't support MSIs properly. If we're +lucky, the manufacturer knows this and has indicated it in the ACPI +FADT table. In this case, Linux automatically disables MSIs. +Some boards don't include this information in the table and so we have +to detect them ourselves. The complete list of these is found near the +quirk_disable_all_msi() function in drivers/pci/quirks.c. + +If you have a board which has problems with MSIs, you can pass pci=nomsi +on the kernel command line to disable MSIs on all devices. It would be +in your best interests to report the problem to linux-pci@vger.kernel.org +including a full 'lspci -v' so we can add the quirks to the kernel. + +Disabling MSIs below a bridge +----------------------------- + +Some PCI bridges are not able to route MSIs between busses properly. +In this case, MSIs must be disabled on all devices behind the bridge. + +Some bridges allow you to enable MSIs by changing some bits in their +PCI configuration space (especially the Hypertransport chipsets such +as the nVidia nForce and Serverworks HT2000). As with host chipsets, +Linux mostly knows about them and automatically enables MSIs if it can. +If you have a bridge unknown to Linux, you can enable +MSIs in configuration space using whatever method you know works, then +enable MSIs on that bridge by doing:: + + echo 1 > /sys/bus/pci/devices/$bridge/msi_bus + +where $bridge is the PCI address of the bridge you've enabled (eg +0000:00:0e.0). + +To disable MSIs, echo 0 instead of 1. Changing this value should be +done with caution as it could break interrupt handling for all devices +below this bridge. + +Again, please notify linux-pci@vger.kernel.org of any bridges that need +special handling. + +Disabling MSIs on a single device +--------------------------------- + +Some devices are known to have faulty MSI implementations. Usually this +is handled in the individual device driver, but occasionally it's necessary +to handle this with a quirk. Some drivers have an option to disable use +of MSI. While this is a convenient workaround for the driver author, +it is not good practice, and should not be emulated. + +Finding why MSIs are disabled on a device +----------------------------------------- + +From the above three sections, you can see that there are many reasons +why MSIs may not be enabled for a given device. Your first step should +be to examine your dmesg carefully to determine whether MSIs are enabled +for your machine. You should also check your .config to be sure you +have enabled CONFIG_PCI_MSI. + +Then, 'lspci -t' gives the list of bridges above a device. Reading +`/sys/bus/pci/devices/*/msi_bus` will tell you whether MSIs are enabled (1) +or disabled (0). If 0 is found in any of the msi_bus files belonging +to bridges between the PCI root and the device, MSIs are disabled. + +It is also worth checking the device driver to see whether it supports MSIs. +For example, it may contain calls to pci_irq_alloc_vectors() with the +PCI_IRQ_MSI or PCI_IRQ_MSIX flags. -- cgit From b66357f32fb9a68bb9f2126a894d3b9bbc4e821c Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Tue, 14 May 2019 22:47:28 +0800 Subject: Documentation: PCI: convert acpi-info.txt to reST Convert plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Signed-off-by: Bjorn Helgaas Cc: Mauro Carvalho Chehab --- Documentation/PCI/acpi-info.rst | 192 ++++++++++++++++++++++++++++++++++++++++ Documentation/PCI/acpi-info.txt | 187 -------------------------------------- Documentation/PCI/index.rst | 1 + 3 files changed, 193 insertions(+), 187 deletions(-) create mode 100644 Documentation/PCI/acpi-info.rst delete mode 100644 Documentation/PCI/acpi-info.txt (limited to 'Documentation') diff --git a/Documentation/PCI/acpi-info.rst b/Documentation/PCI/acpi-info.rst new file mode 100644 index 000000000000..060217081c79 --- /dev/null +++ b/Documentation/PCI/acpi-info.rst @@ -0,0 +1,192 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================================== +ACPI considerations for PCI host bridges +======================================== + +The general rule is that the ACPI namespace should describe everything the +OS might use unless there's another way for the OS to find it [1, 2]. + +For example, there's no standard hardware mechanism for enumerating PCI +host bridges, so the ACPI namespace must describe each host bridge, the +method for accessing PCI config space below it, the address space windows +the host bridge forwards to PCI (using _CRS), and the routing of legacy +INTx interrupts (using _PRT). + +PCI devices, which are below the host bridge, generally do not need to be +described via ACPI. The OS can discover them via the standard PCI +enumeration mechanism, using config accesses to discover and identify +devices and read and size their BARs. However, ACPI may describe PCI +devices if it provides power management or hotplug functionality for them +or if the device has INTx interrupts connected by platform interrupt +controllers and a _PRT is needed to describe those connections. + +ACPI resource description is done via _CRS objects of devices in the ACPI +namespace [2].   The _CRS is like a generalized PCI BAR: the OS can read +_CRS and figure out what resource is being consumed even if it doesn't have +a driver for the device [3].  That's important because it means an old OS +can work correctly even on a system with new devices unknown to the OS. +The new devices might not do anything, but the OS can at least make sure no +resources conflict with them. + +Static tables like MCFG, HPET, ECDT, etc., are *not* mechanisms for +reserving address space. The static tables are for things the OS needs to +know early in boot, before it can parse the ACPI namespace. If a new table +is defined, an old OS needs to operate correctly even though it ignores the +table. _CRS allows that because it is generic and understood by the old +OS; a static table does not. + +If the OS is expected to manage a non-discoverable device described via +ACPI, that device will have a specific _HID/_CID that tells the OS what +driver to bind to it, and the _CRS tells the OS and the driver where the +device's registers are. + +PCI host bridges are PNP0A03 or PNP0A08 devices.  Their _CRS should +describe all the address space they consume.  This includes all the windows +they forward down to the PCI bus, as well as registers of the host bridge +itself that are not forwarded to PCI.  The host bridge registers include +things like secondary/subordinate bus registers that determine the bus +range below the bridge, window registers that describe the apertures, etc. +These are all device-specific, non-architected things, so the only way a +PNP0A03/PNP0A08 driver can manage them is via _PRS/_CRS/_SRS, which contain +the device-specific details.  The host bridge registers also include ECAM +space, since it is consumed by the host bridge. + +ACPI defines a Consumer/Producer bit to distinguish the bridge registers +("Consumer") from the bridge apertures ("Producer") [4, 5], but early +BIOSes didn't use that bit correctly. The result is that the current ACPI +spec defines Consumer/Producer only for the Extended Address Space +descriptors; the bit should be ignored in the older QWord/DWord/Word +Address Space descriptors. Consequently, OSes have to assume all +QWord/DWord/Word descriptors are windows. + +Prior to the addition of Extended Address Space descriptors, the failure of +Consumer/Producer meant there was no way to describe bridge registers in +the PNP0A03/PNP0A08 device itself. The workaround was to describe the +bridge registers (including ECAM space) in PNP0C02 catch-all devices [6]. +With the exception of ECAM, the bridge register space is device-specific +anyway, so the generic PNP0A03/PNP0A08 driver (pci_root.c) has no need to +know about it.   + +New architectures should be able to use "Consumer" Extended Address Space +descriptors in the PNP0A03 device for bridge registers, including ECAM, +although a strict interpretation of [6] might prohibit this. Old x86 and +ia64 kernels assume all address space descriptors, including "Consumer" +Extended Address Space ones, are windows, so it would not be safe to +describe bridge registers this way on those architectures. + +PNP0C02 "motherboard" devices are basically a catch-all.  There's no +programming model for them other than "don't use these resources for +anything else."  So a PNP0C02 _CRS should claim any address space that is +(1) not claimed by _CRS under any other device object in the ACPI namespace +and (2) should not be assigned by the OS to something else. + +The PCIe spec requires the Enhanced Configuration Access Method (ECAM) +unless there's a standard firmware interface for config access, e.g., the +ia64 SAL interface [7]. A host bridge consumes ECAM memory address space +and converts memory accesses into PCI configuration accesses. The spec +defines the ECAM address space layout and functionality; only the base of +the address space is device-specific. An ACPI OS learns the base address +from either the static MCFG table or a _CBA method in the PNP0A03 device. + +The MCFG table must describe the ECAM space of non-hot pluggable host +bridges [8]. Since MCFG is a static table and can't be updated by hotplug, +a _CBA method in the PNP0A03 device describes the ECAM space of a +hot-pluggable host bridge [9]. Note that for both MCFG and _CBA, the base +address always corresponds to bus 0, even if the bus range below the bridge +(which is reported via _CRS) doesn't start at 0. + + +[1] ACPI 6.2, sec 6.1: + For any device that is on a non-enumerable type of bus (for example, an + ISA bus), OSPM enumerates the devices' identifier(s) and the ACPI + system firmware must supply an _HID object ... for each device to + enable OSPM to do that. + +[2] ACPI 6.2, sec 3.7: + The OS enumerates motherboard devices simply by reading through the + ACPI Namespace looking for devices with hardware IDs. + + Each device enumerated by ACPI includes ACPI-defined objects in the + ACPI Namespace that report the hardware resources the device could + occupy [_PRS], an object that reports the resources that are currently + used by the device [_CRS], and objects for configuring those resources + [_SRS]. The information is used by the Plug and Play OS (OSPM) to + configure the devices. + +[3] ACPI 6.2, sec 6.2: + OSPM uses device configuration objects to configure hardware resources + for devices enumerated via ACPI. Device configuration objects provide + information about current and possible resource requirements, the + relationship between shared resources, and methods for configuring + hardware resources. + + When OSPM enumerates a device, it calls _PRS to determine the resource + requirements of the device. It may also call _CRS to find the current + resource settings for the device. Using this information, the Plug and + Play system determines what resources the device should consume and + sets those resources by calling the device’s _SRS control method. + + In ACPI, devices can consume resources (for example, legacy keyboards), + provide resources (for example, a proprietary PCI bridge), or do both. + Unless otherwise specified, resources for a device are assumed to be + taken from the nearest matching resource above the device in the device + hierarchy. + +[4] ACPI 6.2, sec 6.4.3.5.1, 2, 3, 4: + QWord/DWord/Word Address Space Descriptor (.1, .2, .3) + General Flags: Bit [0] Ignored + + Extended Address Space Descriptor (.4) + General Flags: Bit [0] Consumer/Producer: + + * 1 – This device consumes this resource + * 0 – This device produces and consumes this resource + +[5] ACPI 6.2, sec 19.6.43: + ResourceUsage specifies whether the Memory range is consumed by + this device (ResourceConsumer) or passed on to child devices + (ResourceProducer). If nothing is specified, then + ResourceConsumer is assumed. + +[6] PCI Firmware 3.2, sec 4.1.2: + If the operating system does not natively comprehend reserving the + MMCFG region, the MMCFG region must be reserved by firmware. The + address range reported in the MCFG table or by _CBA method (see Section + 4.1.3) must be reserved by declaring a motherboard resource. For most + systems, the motherboard resource would appear at the root of the ACPI + namespace (under \_SB) in a node with a _HID of EISAID (PNP0C02), and + the resources in this case should not be claimed in the root PCI bus’s + _CRS. The resources can optionally be returned in Int15 E820 or + EFIGetMemoryMap as reserved memory but must always be reported through + ACPI as a motherboard resource. + +[7] PCI Express 4.0, sec 7.2.2: + For systems that are PC-compatible, or that do not implement a + processor-architecture-specific firmware interface standard that allows + access to the Configuration Space, the ECAM is required as defined in + this section. + +[8] PCI Firmware 3.2, sec 4.1.2: + The MCFG table is an ACPI table that is used to communicate the base + addresses corresponding to the non-hot removable PCI Segment Groups + range within a PCI Segment Group available to the operating system at + boot. This is required for the PC-compatible systems. + + The MCFG table is only used to communicate the base addresses + corresponding to the PCI Segment Groups available to the system at + boot. + +[9] PCI Firmware 3.2, sec 4.1.3: + The _CBA (Memory mapped Configuration Base Address) control method is + an optional ACPI object that returns the 64-bit memory mapped + configuration base address for the hot plug capable host bridge. The + base address returned by _CBA is processor-relative address. The _CBA + control method evaluates to an Integer. + + This control method appears under a host bridge object. When the _CBA + method appears under an active host bridge object, the operating system + evaluates this structure to identify the memory mapped configuration + base address corresponding to the PCI Segment Group for the bus number + range specified in _CRS method. An ACPI name space object that contains + the _CBA method must also contain a corresponding _SEG method. diff --git a/Documentation/PCI/acpi-info.txt b/Documentation/PCI/acpi-info.txt deleted file mode 100644 index 3ffa3b03970e..000000000000 --- a/Documentation/PCI/acpi-info.txt +++ /dev/null @@ -1,187 +0,0 @@ - ACPI considerations for PCI host bridges - -The general rule is that the ACPI namespace should describe everything the -OS might use unless there's another way for the OS to find it [1, 2]. - -For example, there's no standard hardware mechanism for enumerating PCI -host bridges, so the ACPI namespace must describe each host bridge, the -method for accessing PCI config space below it, the address space windows -the host bridge forwards to PCI (using _CRS), and the routing of legacy -INTx interrupts (using _PRT). - -PCI devices, which are below the host bridge, generally do not need to be -described via ACPI. The OS can discover them via the standard PCI -enumeration mechanism, using config accesses to discover and identify -devices and read and size their BARs. However, ACPI may describe PCI -devices if it provides power management or hotplug functionality for them -or if the device has INTx interrupts connected by platform interrupt -controllers and a _PRT is needed to describe those connections. - -ACPI resource description is done via _CRS objects of devices in the ACPI -namespace [2].   The _CRS is like a generalized PCI BAR: the OS can read -_CRS and figure out what resource is being consumed even if it doesn't have -a driver for the device [3].  That's important because it means an old OS -can work correctly even on a system with new devices unknown to the OS. -The new devices might not do anything, but the OS can at least make sure no -resources conflict with them. - -Static tables like MCFG, HPET, ECDT, etc., are *not* mechanisms for -reserving address space. The static tables are for things the OS needs to -know early in boot, before it can parse the ACPI namespace. If a new table -is defined, an old OS needs to operate correctly even though it ignores the -table. _CRS allows that because it is generic and understood by the old -OS; a static table does not. - -If the OS is expected to manage a non-discoverable device described via -ACPI, that device will have a specific _HID/_CID that tells the OS what -driver to bind to it, and the _CRS tells the OS and the driver where the -device's registers are. - -PCI host bridges are PNP0A03 or PNP0A08 devices.  Their _CRS should -describe all the address space they consume.  This includes all the windows -they forward down to the PCI bus, as well as registers of the host bridge -itself that are not forwarded to PCI.  The host bridge registers include -things like secondary/subordinate bus registers that determine the bus -range below the bridge, window registers that describe the apertures, etc. -These are all device-specific, non-architected things, so the only way a -PNP0A03/PNP0A08 driver can manage them is via _PRS/_CRS/_SRS, which contain -the device-specific details.  The host bridge registers also include ECAM -space, since it is consumed by the host bridge. - -ACPI defines a Consumer/Producer bit to distinguish the bridge registers -("Consumer") from the bridge apertures ("Producer") [4, 5], but early -BIOSes didn't use that bit correctly. The result is that the current ACPI -spec defines Consumer/Producer only for the Extended Address Space -descriptors; the bit should be ignored in the older QWord/DWord/Word -Address Space descriptors. Consequently, OSes have to assume all -QWord/DWord/Word descriptors are windows. - -Prior to the addition of Extended Address Space descriptors, the failure of -Consumer/Producer meant there was no way to describe bridge registers in -the PNP0A03/PNP0A08 device itself. The workaround was to describe the -bridge registers (including ECAM space) in PNP0C02 catch-all devices [6]. -With the exception of ECAM, the bridge register space is device-specific -anyway, so the generic PNP0A03/PNP0A08 driver (pci_root.c) has no need to -know about it.   - -New architectures should be able to use "Consumer" Extended Address Space -descriptors in the PNP0A03 device for bridge registers, including ECAM, -although a strict interpretation of [6] might prohibit this. Old x86 and -ia64 kernels assume all address space descriptors, including "Consumer" -Extended Address Space ones, are windows, so it would not be safe to -describe bridge registers this way on those architectures. - -PNP0C02 "motherboard" devices are basically a catch-all.  There's no -programming model for them other than "don't use these resources for -anything else."  So a PNP0C02 _CRS should claim any address space that is -(1) not claimed by _CRS under any other device object in the ACPI namespace -and (2) should not be assigned by the OS to something else. - -The PCIe spec requires the Enhanced Configuration Access Method (ECAM) -unless there's a standard firmware interface for config access, e.g., the -ia64 SAL interface [7]. A host bridge consumes ECAM memory address space -and converts memory accesses into PCI configuration accesses. The spec -defines the ECAM address space layout and functionality; only the base of -the address space is device-specific. An ACPI OS learns the base address -from either the static MCFG table or a _CBA method in the PNP0A03 device. - -The MCFG table must describe the ECAM space of non-hot pluggable host -bridges [8]. Since MCFG is a static table and can't be updated by hotplug, -a _CBA method in the PNP0A03 device describes the ECAM space of a -hot-pluggable host bridge [9]. Note that for both MCFG and _CBA, the base -address always corresponds to bus 0, even if the bus range below the bridge -(which is reported via _CRS) doesn't start at 0. - - -[1] ACPI 6.2, sec 6.1: - For any device that is on a non-enumerable type of bus (for example, an - ISA bus), OSPM enumerates the devices' identifier(s) and the ACPI - system firmware must supply an _HID object ... for each device to - enable OSPM to do that. - -[2] ACPI 6.2, sec 3.7: - The OS enumerates motherboard devices simply by reading through the - ACPI Namespace looking for devices with hardware IDs. - - Each device enumerated by ACPI includes ACPI-defined objects in the - ACPI Namespace that report the hardware resources the device could - occupy [_PRS], an object that reports the resources that are currently - used by the device [_CRS], and objects for configuring those resources - [_SRS]. The information is used by the Plug and Play OS (OSPM) to - configure the devices. - -[3] ACPI 6.2, sec 6.2: - OSPM uses device configuration objects to configure hardware resources - for devices enumerated via ACPI. Device configuration objects provide - information about current and possible resource requirements, the - relationship between shared resources, and methods for configuring - hardware resources. - - When OSPM enumerates a device, it calls _PRS to determine the resource - requirements of the device. It may also call _CRS to find the current - resource settings for the device. Using this information, the Plug and - Play system determines what resources the device should consume and - sets those resources by calling the device’s _SRS control method. - - In ACPI, devices can consume resources (for example, legacy keyboards), - provide resources (for example, a proprietary PCI bridge), or do both. - Unless otherwise specified, resources for a device are assumed to be - taken from the nearest matching resource above the device in the device - hierarchy. - -[4] ACPI 6.2, sec 6.4.3.5.1, 2, 3, 4: - QWord/DWord/Word Address Space Descriptor (.1, .2, .3) - General Flags: Bit [0] Ignored - - Extended Address Space Descriptor (.4) - General Flags: Bit [0] Consumer/Producer: - 1–This device consumes this resource - 0–This device produces and consumes this resource - -[5] ACPI 6.2, sec 19.6.43: - ResourceUsage specifies whether the Memory range is consumed by - this device (ResourceConsumer) or passed on to child devices - (ResourceProducer). If nothing is specified, then - ResourceConsumer is assumed. - -[6] PCI Firmware 3.2, sec 4.1.2: - If the operating system does not natively comprehend reserving the - MMCFG region, the MMCFG region must be reserved by firmware. The - address range reported in the MCFG table or by _CBA method (see Section - 4.1.3) must be reserved by declaring a motherboard resource. For most - systems, the motherboard resource would appear at the root of the ACPI - namespace (under \_SB) in a node with a _HID of EISAID (PNP0C02), and - the resources in this case should not be claimed in the root PCI bus’s - _CRS. The resources can optionally be returned in Int15 E820 or - EFIGetMemoryMap as reserved memory but must always be reported through - ACPI as a motherboard resource. - -[7] PCI Express 4.0, sec 7.2.2: - For systems that are PC-compatible, or that do not implement a - processor-architecture-specific firmware interface standard that allows - access to the Configuration Space, the ECAM is required as defined in - this section. - -[8] PCI Firmware 3.2, sec 4.1.2: - The MCFG table is an ACPI table that is used to communicate the base - addresses corresponding to the non-hot removable PCI Segment Groups - range within a PCI Segment Group available to the operating system at - boot. This is required for the PC-compatible systems. - - The MCFG table is only used to communicate the base addresses - corresponding to the PCI Segment Groups available to the system at - boot. - -[9] PCI Firmware 3.2, sec 4.1.3: - The _CBA (Memory mapped Configuration Base Address) control method is - an optional ACPI object that returns the 64-bit memory mapped - configuration base address for the hot plug capable host bridge. The - base address returned by _CBA is processor-relative address. The _CBA - control method evaluates to an Integer. - - This control method appears under a host bridge object. When the _CBA - method appears under an active host bridge object, the operating system - evaluates this structure to identify the memory mapped configuration - base address corresponding to the PCI Segment Group for the bus number - range specified in _CRS method. An ACPI name space object that contains - the _CBA method must also contain a corresponding _SEG method. diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst index 458354daac47..6f573f3df993 100644 --- a/Documentation/PCI/index.rst +++ b/Documentation/PCI/index.rst @@ -12,3 +12,4 @@ Linux PCI Bus Subsystem picebus-howto pci-iov-howto msi-howto + acpi-info -- cgit From 8a01fa64348aaaf54b3eef9728bfe2654e7bdd88 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Tue, 14 May 2019 22:47:29 +0800 Subject: Documentation: PCI: convert pci-error-recovery.txt to reST Convert plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Signed-off-by: Bjorn Helgaas Reviewed-by: Mauro Carvalho Chehab --- Documentation/PCI/index.rst | 1 + Documentation/PCI/pci-error-recovery.rst | 424 +++++++++++++++++++++++++++++++ Documentation/PCI/pci-error-recovery.txt | 413 ------------------------------ 3 files changed, 425 insertions(+), 413 deletions(-) create mode 100644 Documentation/PCI/pci-error-recovery.rst delete mode 100644 Documentation/PCI/pci-error-recovery.txt (limited to 'Documentation') diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst index 6f573f3df993..92e62d0fc9e6 100644 --- a/Documentation/PCI/index.rst +++ b/Documentation/PCI/index.rst @@ -13,3 +13,4 @@ Linux PCI Bus Subsystem pci-iov-howto msi-howto acpi-info + pci-error-recovery diff --git a/Documentation/PCI/pci-error-recovery.rst b/Documentation/PCI/pci-error-recovery.rst new file mode 100644 index 000000000000..83db42092935 --- /dev/null +++ b/Documentation/PCI/pci-error-recovery.rst @@ -0,0 +1,424 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================== +PCI Error Recovery +================== + + +:Authors: - Linas Vepstas + - Richard Lary + - Mike Mason + + +Many PCI bus controllers are able to detect a variety of hardware +PCI errors on the bus, such as parity errors on the data and address +buses, as well as SERR and PERR errors. Some of the more advanced +chipsets are able to deal with these errors; these include PCI-E chipsets, +and the PCI-host bridges found on IBM Power4, Power5 and Power6-based +pSeries boxes. A typical action taken is to disconnect the affected device, +halting all I/O to it. The goal of a disconnection is to avoid system +corruption; for example, to halt system memory corruption due to DMA's +to "wild" addresses. Typically, a reconnection mechanism is also +offered, so that the affected PCI device(s) are reset and put back +into working condition. The reset phase requires coordination +between the affected device drivers and the PCI controller chip. +This document describes a generic API for notifying device drivers +of a bus disconnection, and then performing error recovery. +This API is currently implemented in the 2.6.16 and later kernels. + +Reporting and recovery is performed in several steps. First, when +a PCI hardware error has resulted in a bus disconnect, that event +is reported as soon as possible to all affected device drivers, +including multiple instances of a device driver on multi-function +cards. This allows device drivers to avoid deadlocking in spinloops, +waiting for some i/o-space register to change, when it never will. +It also gives the drivers a chance to defer incoming I/O as +needed. + +Next, recovery is performed in several stages. Most of the complexity +is forced by the need to handle multi-function devices, that is, +devices that have multiple device drivers associated with them. +In the first stage, each driver is allowed to indicate what type +of reset it desires, the choices being a simple re-enabling of I/O +or requesting a slot reset. + +If any driver requests a slot reset, that is what will be done. + +After a reset and/or a re-enabling of I/O, all drivers are +again notified, so that they may then perform any device setup/config +that may be required. After these have all completed, a final +"resume normal operations" event is sent out. + +The biggest reason for choosing a kernel-based implementation rather +than a user-space implementation was the need to deal with bus +disconnects of PCI devices attached to storage media, and, in particular, +disconnects from devices holding the root file system. If the root +file system is disconnected, a user-space mechanism would have to go +through a large number of contortions to complete recovery. Almost all +of the current Linux file systems are not tolerant of disconnection +from/reconnection to their underlying block device. By contrast, +bus errors are easy to manage in the device driver. Indeed, most +device drivers already handle very similar recovery procedures; +for example, the SCSI-generic layer already provides significant +mechanisms for dealing with SCSI bus errors and SCSI bus resets. + + +Detailed Design +=============== + +Design and implementation details below, based on a chain of +public email discussions with Ben Herrenschmidt, circa 5 April 2005. + +The error recovery API support is exposed to the driver in the form of +a structure of function pointers pointed to by a new field in struct +pci_driver. A driver that fails to provide the structure is "non-aware", +and the actual recovery steps taken are platform dependent. The +arch/powerpc implementation will simulate a PCI hotplug remove/add. + +This structure has the form:: + + struct pci_error_handlers + { + int (*error_detected)(struct pci_dev *dev, enum pci_channel_state); + int (*mmio_enabled)(struct pci_dev *dev); + int (*slot_reset)(struct pci_dev *dev); + void (*resume)(struct pci_dev *dev); + }; + +The possible channel states are:: + + enum pci_channel_state { + pci_channel_io_normal, /* I/O channel is in normal state */ + pci_channel_io_frozen, /* I/O to channel is blocked */ + pci_channel_io_perm_failure, /* PCI card is dead */ + }; + +Possible return values are:: + + enum pci_ers_result { + PCI_ERS_RESULT_NONE, /* no result/none/not supported in device driver */ + PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */ + PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */ + PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */ + PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */ + }; + +A driver does not have to implement all of these callbacks; however, +if it implements any, it must implement error_detected(). If a callback +is not implemented, the corresponding feature is considered unsupported. +For example, if mmio_enabled() and resume() aren't there, then it +is assumed that the driver is not doing any direct recovery and requires +a slot reset. Typically a driver will want to know about +a slot_reset(). + +The actual steps taken by a platform to recover from a PCI error +event will be platform-dependent, but will follow the general +sequence described below. + +STEP 0: Error Event +------------------- +A PCI bus error is detected by the PCI hardware. On powerpc, the slot +is isolated, in that all I/O is blocked: all reads return 0xffffffff, +all writes are ignored. + + +STEP 1: Notification +-------------------- +Platform calls the error_detected() callback on every instance of +every driver affected by the error. + +At this point, the device might not be accessible anymore, depending on +the platform (the slot will be isolated on powerpc). The driver may +already have "noticed" the error because of a failing I/O, but this +is the proper "synchronization point", that is, it gives the driver +a chance to cleanup, waiting for pending stuff (timers, whatever, etc...) +to complete; it can take semaphores, schedule, etc... everything but +touch the device. Within this function and after it returns, the driver +shouldn't do any new IOs. Called in task context. This is sort of a +"quiesce" point. See note about interrupts at the end of this doc. + +All drivers participating in this system must implement this call. +The driver must return one of the following result codes: + + - PCI_ERS_RESULT_CAN_RECOVER + Driver returns this if it thinks it might be able to recover + the HW by just banging IOs or if it wants to be given + a chance to extract some diagnostic information (see + mmio_enable, below). + - PCI_ERS_RESULT_NEED_RESET + Driver returns this if it can't recover without a + slot reset. + - PCI_ERS_RESULT_DISCONNECT + Driver returns this if it doesn't want to recover at all. + +The next step taken will depend on the result codes returned by the +drivers. + +If all drivers on the segment/slot return PCI_ERS_RESULT_CAN_RECOVER, +then the platform should re-enable IOs on the slot (or do nothing in +particular, if the platform doesn't isolate slots), and recovery +proceeds to STEP 2 (MMIO Enable). + +If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET), +then recovery proceeds to STEP 4 (Slot Reset). + +If the platform is unable to recover the slot, the next step +is STEP 6 (Permanent Failure). + +.. note:: + + The current powerpc implementation assumes that a device driver will + *not* schedule or semaphore in this routine; the current powerpc + implementation uses one kernel thread to notify all devices; + thus, if one device sleeps/schedules, all devices are affected. + Doing better requires complex multi-threaded logic in the error + recovery implementation (e.g. waiting for all notification threads + to "join" before proceeding with recovery.) This seems excessively + complex and not worth implementing. + + The current powerpc implementation doesn't much care if the device + attempts I/O at this point, or not. I/O's will fail, returning + a value of 0xff on read, and writes will be dropped. If more than + EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH + assumes that the device driver has gone into an infinite loop + and prints an error to syslog. A reboot is then required to + get the device working again. + +STEP 2: MMIO Enabled +-------------------- +The platform re-enables MMIO to the device (but typically not the +DMA), and then calls the mmio_enabled() callback on all affected +device drivers. + +This is the "early recovery" call. IOs are allowed again, but DMA is +not, with some restrictions. This is NOT a callback for the driver to +start operations again, only to peek/poke at the device, extract diagnostic +information, if any, and eventually do things like trigger a device local +reset or some such, but not restart operations. This callback is made if +all drivers on a segment agree that they can try to recover and if no automatic +link reset was performed by the HW. If the platform can't just re-enable IOs +without a slot reset or a link reset, it will not call this callback, and +instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset) + +.. note:: + + The following is proposed; no platform implements this yet: + Proposal: All I/O's should be done _synchronously_ from within + this callback, errors triggered by them will be returned via + the normal pci_check_whatever() API, no new error_detected() + callback will be issued due to an error happening here. However, + such an error might cause IOs to be re-blocked for the whole + segment, and thus invalidate the recovery that other devices + on the same segment might have done, forcing the whole segment + into one of the next states, that is, link reset or slot reset. + +The driver should return one of the following result codes: + - PCI_ERS_RESULT_RECOVERED + Driver returns this if it thinks the device is fully + functional and thinks it is ready to start + normal driver operations again. There is no + guarantee that the driver will actually be + allowed to proceed, as another driver on the + same segment might have failed and thus triggered a + slot reset on platforms that support it. + + - PCI_ERS_RESULT_NEED_RESET + Driver returns this if it thinks the device is not + recoverable in its current state and it needs a slot + reset to proceed. + + - PCI_ERS_RESULT_DISCONNECT + Same as above. Total failure, no recovery even after + reset driver dead. (To be defined more precisely) + +The next step taken depends on the results returned by the drivers. +If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform +proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations). + +If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform +proceeds to STEP 4 (Slot Reset) + +STEP 3: Link Reset +------------------ +The platform resets the link. This is a PCI-Express specific step +and is done whenever a fatal error has been detected that can be +"solved" by resetting the link. + +STEP 4: Slot Reset +------------------ + +In response to a return value of PCI_ERS_RESULT_NEED_RESET, the +the platform will perform a slot reset on the requesting PCI device(s). +The actual steps taken by a platform to perform a slot reset +will be platform-dependent. Upon completion of slot reset, the +platform will call the device slot_reset() callback. + +Powerpc platforms implement two levels of slot reset: +soft reset(default) and fundamental(optional) reset. + +Powerpc soft reset consists of asserting the adapter #RST line and then +restoring the PCI BAR's and PCI configuration header to a state +that is equivalent to what it would be after a fresh system +power-on followed by power-on BIOS/system firmware initialization. +Soft reset is also known as hot-reset. + +Powerpc fundamental reset is supported by PCI Express cards only +and results in device's state machines, hardware logic, port states and +configuration registers to initialize to their default conditions. + +For most PCI devices, a soft reset will be sufficient for recovery. +Optional fundamental reset is provided to support a limited number +of PCI Express devices for which a soft reset is not sufficient +for recovery. + +If the platform supports PCI hotplug, then the reset might be +performed by toggling the slot electrical power off/on. + +It is important for the platform to restore the PCI config space +to the "fresh poweron" state, rather than the "last state". After +a slot reset, the device driver will almost always use its standard +device initialization routines, and an unusual config space setup +may result in hung devices, kernel panics, or silent data corruption. + +This call gives drivers the chance to re-initialize the hardware +(re-download firmware, etc.). At this point, the driver may assume +that the card is in a fresh state and is fully functional. The slot +is unfrozen and the driver has full access to PCI config space, +memory mapped I/O space and DMA. Interrupts (Legacy, MSI, or MSI-X) +will also be available. + +Drivers should not restart normal I/O processing operations +at this point. If all device drivers report success on this +callback, the platform will call resume() to complete the sequence, +and let the driver restart normal I/O processing. + +A driver can still return a critical failure for this function if +it can't get the device operational after reset. If the platform +previously tried a soft reset, it might now try a hard reset (power +cycle) and then call slot_reset() again. It the device still can't +be recovered, there is nothing more that can be done; the platform +will typically report a "permanent failure" in such a case. The +device will be considered "dead" in this case. + +Drivers for multi-function cards will need to coordinate among +themselves as to which driver instance will perform any "one-shot" +or global device initialization. For example, the Symbios sym53cxx2 +driver performs device init only from PCI function 0:: + + + if (PCI_FUNC(pdev->devfn) == 0) + + sym_reset_scsi_bus(np, 0); + +Result codes: + - PCI_ERS_RESULT_DISCONNECT + Same as above. + +Drivers for PCI Express cards that require a fundamental reset must +set the needs_freset bit in the pci_dev structure in their probe function. +For example, the QLogic qla2xxx driver sets the needs_freset bit for certain +PCI card types:: + + + /* Set EEH reset type to fundamental if required by hba */ + + if (IS_QLA24XX(ha) || IS_QLA25XX(ha) || IS_QLA81XX(ha)) + + pdev->needs_freset = 1; + + + +Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent +Failure). + +.. note:: + + The current powerpc implementation does not try a power-cycle + reset if the driver returned PCI_ERS_RESULT_DISCONNECT. + However, it probably should. + + +STEP 5: Resume Operations +------------------------- +The platform will call the resume() callback on all affected device +drivers if all drivers on the segment have returned +PCI_ERS_RESULT_RECOVERED from one of the 3 previous callbacks. +The goal of this callback is to tell the driver to restart activity, +that everything is back and running. This callback does not return +a result code. + +At this point, if a new error happens, the platform will restart +a new error recovery sequence. + +STEP 6: Permanent Failure +------------------------- +A "permanent failure" has occurred, and the platform cannot recover +the device. The platform will call error_detected() with a +pci_channel_state value of pci_channel_io_perm_failure. + +The device driver should, at this point, assume the worst. It should +cancel all pending I/O, refuse all new I/O, returning -EIO to +higher layers. The device driver should then clean up all of its +memory and remove itself from kernel operations, much as it would +during system shutdown. + +The platform will typically notify the system operator of the +permanent failure in some way. If the device is hotplug-capable, +the operator will probably want to remove and replace the device. +Note, however, not all failures are truly "permanent". Some are +caused by over-heating, some by a poorly seated card. Many +PCI error events are caused by software bugs, e.g. DMA's to +wild addresses or bogus split transactions due to programming +errors. See the discussion in powerpc/eeh-pci-error-recovery.txt +for additional detail on real-life experience of the causes of +software errors. + + +Conclusion; General Remarks +--------------------------- +The way the callbacks are called is platform policy. A platform with +no slot reset capability may want to just "ignore" drivers that can't +recover (disconnect them) and try to let other cards on the same segment +recover. Keep in mind that in most real life cases, though, there will +be only one driver per segment. + +Now, a note about interrupts. If you get an interrupt and your +device is dead or has been isolated, there is a problem :) +The current policy is to turn this into a platform policy. +That is, the recovery API only requires that: + + - There is no guarantee that interrupt delivery can proceed from any + device on the segment starting from the error detection and until the + slot_reset callback is called, at which point interrupts are expected + to be fully operational. + + - There is no guarantee that interrupt delivery is stopped, that is, + a driver that gets an interrupt after detecting an error, or that detects + an error within the interrupt handler such that it prevents proper + ack'ing of the interrupt (and thus removal of the source) should just + return IRQ_NOTHANDLED. It's up to the platform to deal with that + condition, typically by masking the IRQ source during the duration of + the error handling. It is expected that the platform "knows" which + interrupts are routed to error-management capable slots and can deal + with temporarily disabling that IRQ number during error processing (this + isn't terribly complex). That means some IRQ latency for other devices + sharing the interrupt, but there is simply no other way. High end + platforms aren't supposed to share interrupts between many devices + anyway :) + +.. note:: + + Implementation details for the powerpc platform are discussed in + the file Documentation/powerpc/eeh-pci-error-recovery.txt + + As of this writing, there is a growing list of device drivers with + patches implementing error recovery. Not all of these patches are in + mainline yet. These may be used as "examples": + + - drivers/scsi/ipr + - drivers/scsi/sym53c8xx_2 + - drivers/scsi/qla2xxx + - drivers/scsi/lpfc + - drivers/next/bnx2.c + - drivers/next/e100.c + - drivers/net/e1000 + - drivers/net/e1000e + - drivers/net/ixgb + - drivers/net/ixgbe + - drivers/net/cxgb3 + - drivers/net/s2io.c + - drivers/net/qlge diff --git a/Documentation/PCI/pci-error-recovery.txt b/Documentation/PCI/pci-error-recovery.txt deleted file mode 100644 index 0b6bb3ef449e..000000000000 --- a/Documentation/PCI/pci-error-recovery.txt +++ /dev/null @@ -1,413 +0,0 @@ - - PCI Error Recovery - ------------------ - February 2, 2006 - - Current document maintainer: - Linas Vepstas - updated by Richard Lary - and Mike Mason on 27-Jul-2009 - - -Many PCI bus controllers are able to detect a variety of hardware -PCI errors on the bus, such as parity errors on the data and address -buses, as well as SERR and PERR errors. Some of the more advanced -chipsets are able to deal with these errors; these include PCI-E chipsets, -and the PCI-host bridges found on IBM Power4, Power5 and Power6-based -pSeries boxes. A typical action taken is to disconnect the affected device, -halting all I/O to it. The goal of a disconnection is to avoid system -corruption; for example, to halt system memory corruption due to DMA's -to "wild" addresses. Typically, a reconnection mechanism is also -offered, so that the affected PCI device(s) are reset and put back -into working condition. The reset phase requires coordination -between the affected device drivers and the PCI controller chip. -This document describes a generic API for notifying device drivers -of a bus disconnection, and then performing error recovery. -This API is currently implemented in the 2.6.16 and later kernels. - -Reporting and recovery is performed in several steps. First, when -a PCI hardware error has resulted in a bus disconnect, that event -is reported as soon as possible to all affected device drivers, -including multiple instances of a device driver on multi-function -cards. This allows device drivers to avoid deadlocking in spinloops, -waiting for some i/o-space register to change, when it never will. -It also gives the drivers a chance to defer incoming I/O as -needed. - -Next, recovery is performed in several stages. Most of the complexity -is forced by the need to handle multi-function devices, that is, -devices that have multiple device drivers associated with them. -In the first stage, each driver is allowed to indicate what type -of reset it desires, the choices being a simple re-enabling of I/O -or requesting a slot reset. - -If any driver requests a slot reset, that is what will be done. - -After a reset and/or a re-enabling of I/O, all drivers are -again notified, so that they may then perform any device setup/config -that may be required. After these have all completed, a final -"resume normal operations" event is sent out. - -The biggest reason for choosing a kernel-based implementation rather -than a user-space implementation was the need to deal with bus -disconnects of PCI devices attached to storage media, and, in particular, -disconnects from devices holding the root file system. If the root -file system is disconnected, a user-space mechanism would have to go -through a large number of contortions to complete recovery. Almost all -of the current Linux file systems are not tolerant of disconnection -from/reconnection to their underlying block device. By contrast, -bus errors are easy to manage in the device driver. Indeed, most -device drivers already handle very similar recovery procedures; -for example, the SCSI-generic layer already provides significant -mechanisms for dealing with SCSI bus errors and SCSI bus resets. - - -Detailed Design ---------------- -Design and implementation details below, based on a chain of -public email discussions with Ben Herrenschmidt, circa 5 April 2005. - -The error recovery API support is exposed to the driver in the form of -a structure of function pointers pointed to by a new field in struct -pci_driver. A driver that fails to provide the structure is "non-aware", -and the actual recovery steps taken are platform dependent. The -arch/powerpc implementation will simulate a PCI hotplug remove/add. - -This structure has the form: -struct pci_error_handlers -{ - int (*error_detected)(struct pci_dev *dev, enum pci_channel_state); - int (*mmio_enabled)(struct pci_dev *dev); - int (*slot_reset)(struct pci_dev *dev); - void (*resume)(struct pci_dev *dev); -}; - -The possible channel states are: -enum pci_channel_state { - pci_channel_io_normal, /* I/O channel is in normal state */ - pci_channel_io_frozen, /* I/O to channel is blocked */ - pci_channel_io_perm_failure, /* PCI card is dead */ -}; - -Possible return values are: -enum pci_ers_result { - PCI_ERS_RESULT_NONE, /* no result/none/not supported in device driver */ - PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */ - PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */ - PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */ - PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */ -}; - -A driver does not have to implement all of these callbacks; however, -if it implements any, it must implement error_detected(). If a callback -is not implemented, the corresponding feature is considered unsupported. -For example, if mmio_enabled() and resume() aren't there, then it -is assumed that the driver is not doing any direct recovery and requires -a slot reset. Typically a driver will want to know about -a slot_reset(). - -The actual steps taken by a platform to recover from a PCI error -event will be platform-dependent, but will follow the general -sequence described below. - -STEP 0: Error Event -------------------- -A PCI bus error is detected by the PCI hardware. On powerpc, the slot -is isolated, in that all I/O is blocked: all reads return 0xffffffff, -all writes are ignored. - - -STEP 1: Notification --------------------- -Platform calls the error_detected() callback on every instance of -every driver affected by the error. - -At this point, the device might not be accessible anymore, depending on -the platform (the slot will be isolated on powerpc). The driver may -already have "noticed" the error because of a failing I/O, but this -is the proper "synchronization point", that is, it gives the driver -a chance to cleanup, waiting for pending stuff (timers, whatever, etc...) -to complete; it can take semaphores, schedule, etc... everything but -touch the device. Within this function and after it returns, the driver -shouldn't do any new IOs. Called in task context. This is sort of a -"quiesce" point. See note about interrupts at the end of this doc. - -All drivers participating in this system must implement this call. -The driver must return one of the following result codes: - - PCI_ERS_RESULT_CAN_RECOVER: - Driver returns this if it thinks it might be able to recover - the HW by just banging IOs or if it wants to be given - a chance to extract some diagnostic information (see - mmio_enable, below). - - PCI_ERS_RESULT_NEED_RESET: - Driver returns this if it can't recover without a - slot reset. - - PCI_ERS_RESULT_DISCONNECT: - Driver returns this if it doesn't want to recover at all. - -The next step taken will depend on the result codes returned by the -drivers. - -If all drivers on the segment/slot return PCI_ERS_RESULT_CAN_RECOVER, -then the platform should re-enable IOs on the slot (or do nothing in -particular, if the platform doesn't isolate slots), and recovery -proceeds to STEP 2 (MMIO Enable). - -If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET), -then recovery proceeds to STEP 4 (Slot Reset). - -If the platform is unable to recover the slot, the next step -is STEP 6 (Permanent Failure). - ->>> The current powerpc implementation assumes that a device driver will ->>> *not* schedule or semaphore in this routine; the current powerpc ->>> implementation uses one kernel thread to notify all devices; ->>> thus, if one device sleeps/schedules, all devices are affected. ->>> Doing better requires complex multi-threaded logic in the error ->>> recovery implementation (e.g. waiting for all notification threads ->>> to "join" before proceeding with recovery.) This seems excessively ->>> complex and not worth implementing. - ->>> The current powerpc implementation doesn't much care if the device ->>> attempts I/O at this point, or not. I/O's will fail, returning ->>> a value of 0xff on read, and writes will be dropped. If more than ->>> EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH ->>> assumes that the device driver has gone into an infinite loop ->>> and prints an error to syslog. A reboot is then required to ->>> get the device working again. - -STEP 2: MMIO Enabled -------------------- -The platform re-enables MMIO to the device (but typically not the -DMA), and then calls the mmio_enabled() callback on all affected -device drivers. - -This is the "early recovery" call. IOs are allowed again, but DMA is -not, with some restrictions. This is NOT a callback for the driver to -start operations again, only to peek/poke at the device, extract diagnostic -information, if any, and eventually do things like trigger a device local -reset or some such, but not restart operations. This callback is made if -all drivers on a segment agree that they can try to recover and if no automatic -link reset was performed by the HW. If the platform can't just re-enable IOs -without a slot reset or a link reset, it will not call this callback, and -instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset) - ->>> The following is proposed; no platform implements this yet: ->>> Proposal: All I/O's should be done _synchronously_ from within ->>> this callback, errors triggered by them will be returned via ->>> the normal pci_check_whatever() API, no new error_detected() ->>> callback will be issued due to an error happening here. However, ->>> such an error might cause IOs to be re-blocked for the whole ->>> segment, and thus invalidate the recovery that other devices ->>> on the same segment might have done, forcing the whole segment ->>> into one of the next states, that is, link reset or slot reset. - -The driver should return one of the following result codes: - - PCI_ERS_RESULT_RECOVERED - Driver returns this if it thinks the device is fully - functional and thinks it is ready to start - normal driver operations again. There is no - guarantee that the driver will actually be - allowed to proceed, as another driver on the - same segment might have failed and thus triggered a - slot reset on platforms that support it. - - - PCI_ERS_RESULT_NEED_RESET - Driver returns this if it thinks the device is not - recoverable in its current state and it needs a slot - reset to proceed. - - - PCI_ERS_RESULT_DISCONNECT - Same as above. Total failure, no recovery even after - reset driver dead. (To be defined more precisely) - -The next step taken depends on the results returned by the drivers. -If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform -proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations). - -If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform -proceeds to STEP 4 (Slot Reset) - -STEP 3: Link Reset ------------------- -The platform resets the link. This is a PCI-Express specific step -and is done whenever a fatal error has been detected that can be -"solved" by resetting the link. - -STEP 4: Slot Reset ------------------- - -In response to a return value of PCI_ERS_RESULT_NEED_RESET, the -the platform will perform a slot reset on the requesting PCI device(s). -The actual steps taken by a platform to perform a slot reset -will be platform-dependent. Upon completion of slot reset, the -platform will call the device slot_reset() callback. - -Powerpc platforms implement two levels of slot reset: -soft reset(default) and fundamental(optional) reset. - -Powerpc soft reset consists of asserting the adapter #RST line and then -restoring the PCI BAR's and PCI configuration header to a state -that is equivalent to what it would be after a fresh system -power-on followed by power-on BIOS/system firmware initialization. -Soft reset is also known as hot-reset. - -Powerpc fundamental reset is supported by PCI Express cards only -and results in device's state machines, hardware logic, port states and -configuration registers to initialize to their default conditions. - -For most PCI devices, a soft reset will be sufficient for recovery. -Optional fundamental reset is provided to support a limited number -of PCI Express devices for which a soft reset is not sufficient -for recovery. - -If the platform supports PCI hotplug, then the reset might be -performed by toggling the slot electrical power off/on. - -It is important for the platform to restore the PCI config space -to the "fresh poweron" state, rather than the "last state". After -a slot reset, the device driver will almost always use its standard -device initialization routines, and an unusual config space setup -may result in hung devices, kernel panics, or silent data corruption. - -This call gives drivers the chance to re-initialize the hardware -(re-download firmware, etc.). At this point, the driver may assume -that the card is in a fresh state and is fully functional. The slot -is unfrozen and the driver has full access to PCI config space, -memory mapped I/O space and DMA. Interrupts (Legacy, MSI, or MSI-X) -will also be available. - -Drivers should not restart normal I/O processing operations -at this point. If all device drivers report success on this -callback, the platform will call resume() to complete the sequence, -and let the driver restart normal I/O processing. - -A driver can still return a critical failure for this function if -it can't get the device operational after reset. If the platform -previously tried a soft reset, it might now try a hard reset (power -cycle) and then call slot_reset() again. It the device still can't -be recovered, there is nothing more that can be done; the platform -will typically report a "permanent failure" in such a case. The -device will be considered "dead" in this case. - -Drivers for multi-function cards will need to coordinate among -themselves as to which driver instance will perform any "one-shot" -or global device initialization. For example, the Symbios sym53cxx2 -driver performs device init only from PCI function 0: - -+ if (PCI_FUNC(pdev->devfn) == 0) -+ sym_reset_scsi_bus(np, 0); - - Result codes: - - PCI_ERS_RESULT_DISCONNECT - Same as above. - -Drivers for PCI Express cards that require a fundamental reset must -set the needs_freset bit in the pci_dev structure in their probe function. -For example, the QLogic qla2xxx driver sets the needs_freset bit for certain -PCI card types: - -+ /* Set EEH reset type to fundamental if required by hba */ -+ if (IS_QLA24XX(ha) || IS_QLA25XX(ha) || IS_QLA81XX(ha)) -+ pdev->needs_freset = 1; -+ - -Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent -Failure). - ->>> The current powerpc implementation does not try a power-cycle ->>> reset if the driver returned PCI_ERS_RESULT_DISCONNECT. ->>> However, it probably should. - - -STEP 5: Resume Operations -------------------------- -The platform will call the resume() callback on all affected device -drivers if all drivers on the segment have returned -PCI_ERS_RESULT_RECOVERED from one of the 3 previous callbacks. -The goal of this callback is to tell the driver to restart activity, -that everything is back and running. This callback does not return -a result code. - -At this point, if a new error happens, the platform will restart -a new error recovery sequence. - -STEP 6: Permanent Failure -------------------------- -A "permanent failure" has occurred, and the platform cannot recover -the device. The platform will call error_detected() with a -pci_channel_state value of pci_channel_io_perm_failure. - -The device driver should, at this point, assume the worst. It should -cancel all pending I/O, refuse all new I/O, returning -EIO to -higher layers. The device driver should then clean up all of its -memory and remove itself from kernel operations, much as it would -during system shutdown. - -The platform will typically notify the system operator of the -permanent failure in some way. If the device is hotplug-capable, -the operator will probably want to remove and replace the device. -Note, however, not all failures are truly "permanent". Some are -caused by over-heating, some by a poorly seated card. Many -PCI error events are caused by software bugs, e.g. DMA's to -wild addresses or bogus split transactions due to programming -errors. See the discussion in powerpc/eeh-pci-error-recovery.txt -for additional detail on real-life experience of the causes of -software errors. - - -Conclusion; General Remarks ---------------------------- -The way the callbacks are called is platform policy. A platform with -no slot reset capability may want to just "ignore" drivers that can't -recover (disconnect them) and try to let other cards on the same segment -recover. Keep in mind that in most real life cases, though, there will -be only one driver per segment. - -Now, a note about interrupts. If you get an interrupt and your -device is dead or has been isolated, there is a problem :) -The current policy is to turn this into a platform policy. -That is, the recovery API only requires that: - - - There is no guarantee that interrupt delivery can proceed from any -device on the segment starting from the error detection and until the -slot_reset callback is called, at which point interrupts are expected -to be fully operational. - - - There is no guarantee that interrupt delivery is stopped, that is, -a driver that gets an interrupt after detecting an error, or that detects -an error within the interrupt handler such that it prevents proper -ack'ing of the interrupt (and thus removal of the source) should just -return IRQ_NOTHANDLED. It's up to the platform to deal with that -condition, typically by masking the IRQ source during the duration of -the error handling. It is expected that the platform "knows" which -interrupts are routed to error-management capable slots and can deal -with temporarily disabling that IRQ number during error processing (this -isn't terribly complex). That means some IRQ latency for other devices -sharing the interrupt, but there is simply no other way. High end -platforms aren't supposed to share interrupts between many devices -anyway :) - ->>> Implementation details for the powerpc platform are discussed in ->>> the file Documentation/powerpc/eeh-pci-error-recovery.txt - ->>> As of this writing, there is a growing list of device drivers with ->>> patches implementing error recovery. Not all of these patches are in ->>> mainline yet. These may be used as "examples": ->>> ->>> drivers/scsi/ipr ->>> drivers/scsi/sym53c8xx_2 ->>> drivers/scsi/qla2xxx ->>> drivers/scsi/lpfc ->>> drivers/next/bnx2.c ->>> drivers/next/e100.c ->>> drivers/net/e1000 ->>> drivers/net/e1000e ->>> drivers/net/ixgb ->>> drivers/net/ixgbe ->>> drivers/net/cxgb3 ->>> drivers/net/s2io.c ->>> drivers/net/qlge - -The End -------- -- cgit From 4e37f055a92e4a813b29fb196a05a6a826abb790 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Tue, 14 May 2019 22:47:30 +0800 Subject: Documentation: PCI: convert pcieaer-howto.txt to reST Convert plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Signed-off-by: Bjorn Helgaas Reviewed-by: Mauro Carvalho Chehab --- Documentation/PCI/index.rst | 1 + Documentation/PCI/pcieaer-howto.rst | 311 ++++++++++++++++++++++++++++++++++++ Documentation/PCI/pcieaer-howto.txt | 267 ------------------------------- 3 files changed, 312 insertions(+), 267 deletions(-) create mode 100644 Documentation/PCI/pcieaer-howto.rst delete mode 100644 Documentation/PCI/pcieaer-howto.txt (limited to 'Documentation') diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst index 92e62d0fc9e6..f54b65b1ca5f 100644 --- a/Documentation/PCI/index.rst +++ b/Documentation/PCI/index.rst @@ -14,3 +14,4 @@ Linux PCI Bus Subsystem msi-howto acpi-info pci-error-recovery + pcieaer-howto diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst new file mode 100644 index 000000000000..18bdefaafd1a --- /dev/null +++ b/Documentation/PCI/pcieaer-howto.rst @@ -0,0 +1,311 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +=========================================================== +The PCI Express Advanced Error Reporting Driver Guide HOWTO +=========================================================== + +:Authors: - T. Long Nguyen + - Yanmin Zhang + +:Copyright: |copy| 2006 Intel Corporation + +Overview +=========== + +About this guide +---------------- + +This guide describes the basics of the PCI Express Advanced Error +Reporting (AER) driver and provides information on how to use it, as +well as how to enable the drivers of endpoint devices to conform with +PCI Express AER driver. + + +What is the PCI Express AER Driver? +----------------------------------- + +PCI Express error signaling can occur on the PCI Express link itself +or on behalf of transactions initiated on the link. PCI Express +defines two error reporting paradigms: the baseline capability and +the Advanced Error Reporting capability. The baseline capability is +required of all PCI Express components providing a minimum defined +set of error reporting requirements. Advanced Error Reporting +capability is implemented with a PCI Express advanced error reporting +extended capability structure providing more robust error reporting. + +The PCI Express AER driver provides the infrastructure to support PCI +Express Advanced Error Reporting capability. The PCI Express AER +driver provides three basic functions: + + - Gathers the comprehensive error information if errors occurred. + - Reports error to the users. + - Performs error recovery actions. + +AER driver only attaches root ports which support PCI-Express AER +capability. + + +User Guide +========== + +Include the PCI Express AER Root Driver into the Linux Kernel +------------------------------------------------------------- + +The PCI Express AER Root driver is a Root Port service driver attached +to the PCI Express Port Bus driver. If a user wants to use it, the driver +has to be compiled. Option CONFIG_PCIEAER supports this capability. It +depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and +CONFIG_PCIEAER = y. + +Load PCI Express AER Root Driver +-------------------------------- + +Some systems have AER support in firmware. Enabling Linux AER support at +the same time the firmware handles AER may result in unpredictable +behavior. Therefore, Linux does not handle AER events unless the firmware +grants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0 +Specification for details regarding _OSC usage. + +AER error output +---------------- + +When a PCIe AER error is captured, an error message will be output to +console. If it's a correctable error, it is output as a warning. +Otherwise, it is printed as an error. So users could choose different +log level to filter out correctable error messages. + +Below shows an example:: + + 0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID) + 0000:50:00.0: device [8086:0329] error status/mask=00100000/00000000 + 0000:50:00.0: [20] Unsupported Request (First) + 0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100 + +In the example, 'Requester ID' means the ID of the device who sends +the error message to root port. Pls. refer to pci express specs for +other fields. + +AER Statistics / Counters +------------------------- + +When PCIe AER errors are captured, the counters / statistics are also exposed +in the form of sysfs attributes which are documented at +Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats + +Developer Guide +=============== + +To enable AER aware support requires a software driver to configure +the AER capability structure within its device and to provide callbacks. + +To support AER better, developers need understand how AER does work +firstly. + +PCI Express errors are classified into two types: correctable errors +and uncorrectable errors. This classification is based on the impacts +of those errors, which may result in degraded performance or function +failure. + +Correctable errors pose no impacts on the functionality of the +interface. The PCI Express protocol can recover without any software +intervention or any loss of data. These errors are detected and +corrected by hardware. Unlike correctable errors, uncorrectable +errors impact functionality of the interface. Uncorrectable errors +can cause a particular transaction or a particular PCI Express link +to be unreliable. Depending on those error conditions, uncorrectable +errors are further classified into non-fatal errors and fatal errors. +Non-fatal errors cause the particular transaction to be unreliable, +but the PCI Express link itself is fully functional. Fatal errors, on +the other hand, cause the link to be unreliable. + +When AER is enabled, a PCI Express device will automatically send an +error message to the PCIe root port above it when the device captures +an error. The Root Port, upon receiving an error reporting message, +internally processes and logs the error message in its PCI Express +capability structure. Error information being logged includes storing +the error reporting agent's requestor ID into the Error Source +Identification Registers and setting the error bits of the Root Error +Status Register accordingly. If AER error reporting is enabled in Root +Error Command Register, the Root Port generates an interrupt if an +error is detected. + +Note that the errors as described above are related to the PCI Express +hierarchy and links. These errors do not include any device specific +errors because device specific errors will still get sent directly to +the device driver. + +Configure the AER capability structure +-------------------------------------- + +AER aware drivers of PCI Express component need change the device +control registers to enable AER. They also could change AER registers, +including mask and severity registers. Helper function +pci_enable_pcie_error_reporting could be used to enable AER. See +section 3.3. + +Provide callbacks +----------------- + +callback reset_link to reset pci express link +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This callback is used to reset the pci express physical link when a +fatal error happens. The root port aer service driver provides a +default reset_link function, but different upstream ports might +have different specifications to reset pci express link, so all +upstream ports should provide their own reset_link functions. + +In struct pcie_port_service_driver, a new pointer, reset_link, is +added. +:: + + pci_ers_result_t (*reset_link) (struct pci_dev *dev); + +Section 3.2.2.2 provides more detailed info on when to call +reset_link. + +PCI error-recovery callbacks +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The PCI Express AER Root driver uses error callbacks to coordinate +with downstream device drivers associated with a hierarchy in question +when performing error recovery actions. + +Data struct pci_driver has a pointer, err_handler, to point to +pci_error_handlers who consists of a couple of callback function +pointers. AER driver follows the rules defined in +pci-error-recovery.txt except pci express specific parts (e.g. +reset_link). Pls. refer to pci-error-recovery.txt for detailed +definitions of the callbacks. + +Below sections specify when to call the error callback functions. + +Correctable errors +~~~~~~~~~~~~~~~~~~ + +Correctable errors pose no impacts on the functionality of +the interface. The PCI Express protocol can recover without any +software intervention or any loss of data. These errors do not +require any recovery actions. The AER driver clears the device's +correctable error status register accordingly and logs these errors. + +Non-correctable (non-fatal and fatal) errors +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If an error message indicates a non-fatal error, performing link reset +at upstream is not required. The AER driver calls error_detected(dev, +pci_channel_io_normal) to all drivers associated within a hierarchy in +question. for example:: + + EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort + +If Upstream port A captures an AER error, the hierarchy consists of +Downstream port B and EndPoint. + +A driver may return PCI_ERS_RESULT_CAN_RECOVER, +PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on +whether it can recover or the AER driver calls mmio_enabled as next. + +If an error message indicates a fatal error, kernel will broadcast +error_detected(dev, pci_channel_io_frozen) to all drivers within +a hierarchy in question. Then, performing link reset at upstream is +necessary. As different kinds of devices might use different approaches +to reset link, AER port service driver is required to provide the +function to reset link. Firstly, kernel looks for if the upstream +component has an aer driver. If it has, kernel uses the reset_link +callback of the aer driver. If the upstream component has no aer driver +and the port is downstream port, we will perform a hot reset as the +default by setting the Secondary Bus Reset bit of the Bridge Control +register associated with the downstream port. As for upstream ports, +they should provide their own aer service drivers with reset_link +function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and +reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes +to mmio_enabled. + +helper functions +---------------- +:: + + int pci_enable_pcie_error_reporting(struct pci_dev *dev); + +pci_enable_pcie_error_reporting enables the device to send error +messages to root port when an error is detected. Note that devices +don't enable the error reporting by default, so device drivers need +call this function to enable it. + +:: + + int pci_disable_pcie_error_reporting(struct pci_dev *dev); + +pci_disable_pcie_error_reporting disables the device to send error +messages to root port when an error is detected. + +:: + + int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);` + +pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable +error status register. + +Frequent Asked Questions +------------------------ + +Q: + What happens if a PCI Express device driver does not provide an + error recovery handler (pci_driver->err_handler is equal to NULL)? + +A: + The devices attached with the driver won't be recovered. If the + error is fatal, kernel will print out warning messages. Please refer + to section 3 for more information. + +Q: + What happens if an upstream port service driver does not provide + callback reset_link? + +A: + Fatal error recovery will fail if the errors are reported by the + upstream ports who are attached by the service driver. + +Q: + How does this infrastructure deal with driver that is not PCI + Express aware? + +A: + This infrastructure calls the error callback functions of the + driver when an error happens. But if the driver is not aware of + PCI Express, the device might not report its own errors to root + port. + +Q: + What modifications will that driver need to make it compatible + with the PCI Express AER Root driver? + +A: + It could call the helper functions to enable AER in devices and + cleanup uncorrectable status register. Pls. refer to section 3.3. + + +Software error injection +======================== + +Debugging PCIe AER error recovery code is quite difficult because it +is hard to trigger real hardware errors. Software based error +injection can be used to fake various kinds of PCIe errors. + +First you should enable PCIe AER software error injection in kernel +configuration, that is, following item should be in your .config. + +CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m + +After reboot with new kernel or insert the module, a device file named +/dev/aer_inject should be created. + +Then, you need a user space tool named aer-inject, which can be gotten +from: + + https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/ + +More information about aer-inject can be found in the document comes +with its source code. diff --git a/Documentation/PCI/pcieaer-howto.txt b/Documentation/PCI/pcieaer-howto.txt deleted file mode 100644 index 48ce7903e3c6..000000000000 --- a/Documentation/PCI/pcieaer-howto.txt +++ /dev/null @@ -1,267 +0,0 @@ - The PCI Express Advanced Error Reporting Driver Guide HOWTO - T. Long Nguyen - Yanmin Zhang - 07/29/2006 - - -1. Overview - -1.1 About this guide - -This guide describes the basics of the PCI Express Advanced Error -Reporting (AER) driver and provides information on how to use it, as -well as how to enable the drivers of endpoint devices to conform with -PCI Express AER driver. - -1.2 Copyright (C) Intel Corporation 2006. - -1.3 What is the PCI Express AER Driver? - -PCI Express error signaling can occur on the PCI Express link itself -or on behalf of transactions initiated on the link. PCI Express -defines two error reporting paradigms: the baseline capability and -the Advanced Error Reporting capability. The baseline capability is -required of all PCI Express components providing a minimum defined -set of error reporting requirements. Advanced Error Reporting -capability is implemented with a PCI Express advanced error reporting -extended capability structure providing more robust error reporting. - -The PCI Express AER driver provides the infrastructure to support PCI -Express Advanced Error Reporting capability. The PCI Express AER -driver provides three basic functions: - -- Gathers the comprehensive error information if errors occurred. -- Reports error to the users. -- Performs error recovery actions. - -AER driver only attaches root ports which support PCI-Express AER -capability. - - -2. User Guide - -2.1 Include the PCI Express AER Root Driver into the Linux Kernel - -The PCI Express AER Root driver is a Root Port service driver attached -to the PCI Express Port Bus driver. If a user wants to use it, the driver -has to be compiled. Option CONFIG_PCIEAER supports this capability. It -depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and -CONFIG_PCIEAER = y. - -2.2 Load PCI Express AER Root Driver - -Some systems have AER support in firmware. Enabling Linux AER support at -the same time the firmware handles AER may result in unpredictable -behavior. Therefore, Linux does not handle AER events unless the firmware -grants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0 -Specification for details regarding _OSC usage. - -2.3 AER error output - -When a PCIe AER error is captured, an error message will be output to -console. If it's a correctable error, it is output as a warning. -Otherwise, it is printed as an error. So users could choose different -log level to filter out correctable error messages. - -Below shows an example: -0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID) -0000:50:00.0: device [8086:0329] error status/mask=00100000/00000000 -0000:50:00.0: [20] Unsupported Request (First) -0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100 - -In the example, 'Requester ID' means the ID of the device who sends -the error message to root port. Pls. refer to pci express specs for -other fields. - -2.4 AER Statistics / Counters - -When PCIe AER errors are captured, the counters / statistics are also exposed -in the form of sysfs attributes which are documented at -Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats - -3. Developer Guide - -To enable AER aware support requires a software driver to configure -the AER capability structure within its device and to provide callbacks. - -To support AER better, developers need understand how AER does work -firstly. - -PCI Express errors are classified into two types: correctable errors -and uncorrectable errors. This classification is based on the impacts -of those errors, which may result in degraded performance or function -failure. - -Correctable errors pose no impacts on the functionality of the -interface. The PCI Express protocol can recover without any software -intervention or any loss of data. These errors are detected and -corrected by hardware. Unlike correctable errors, uncorrectable -errors impact functionality of the interface. Uncorrectable errors -can cause a particular transaction or a particular PCI Express link -to be unreliable. Depending on those error conditions, uncorrectable -errors are further classified into non-fatal errors and fatal errors. -Non-fatal errors cause the particular transaction to be unreliable, -but the PCI Express link itself is fully functional. Fatal errors, on -the other hand, cause the link to be unreliable. - -When AER is enabled, a PCI Express device will automatically send an -error message to the PCIe root port above it when the device captures -an error. The Root Port, upon receiving an error reporting message, -internally processes and logs the error message in its PCI Express -capability structure. Error information being logged includes storing -the error reporting agent's requestor ID into the Error Source -Identification Registers and setting the error bits of the Root Error -Status Register accordingly. If AER error reporting is enabled in Root -Error Command Register, the Root Port generates an interrupt if an -error is detected. - -Note that the errors as described above are related to the PCI Express -hierarchy and links. These errors do not include any device specific -errors because device specific errors will still get sent directly to -the device driver. - -3.1 Configure the AER capability structure - -AER aware drivers of PCI Express component need change the device -control registers to enable AER. They also could change AER registers, -including mask and severity registers. Helper function -pci_enable_pcie_error_reporting could be used to enable AER. See -section 3.3. - -3.2. Provide callbacks - -3.2.1 callback reset_link to reset pci express link - -This callback is used to reset the pci express physical link when a -fatal error happens. The root port aer service driver provides a -default reset_link function, but different upstream ports might -have different specifications to reset pci express link, so all -upstream ports should provide their own reset_link functions. - -In struct pcie_port_service_driver, a new pointer, reset_link, is -added. - -pci_ers_result_t (*reset_link) (struct pci_dev *dev); - -Section 3.2.2.2 provides more detailed info on when to call -reset_link. - -3.2.2 PCI error-recovery callbacks - -The PCI Express AER Root driver uses error callbacks to coordinate -with downstream device drivers associated with a hierarchy in question -when performing error recovery actions. - -Data struct pci_driver has a pointer, err_handler, to point to -pci_error_handlers who consists of a couple of callback function -pointers. AER driver follows the rules defined in -pci-error-recovery.txt except pci express specific parts (e.g. -reset_link). Pls. refer to pci-error-recovery.txt for detailed -definitions of the callbacks. - -Below sections specify when to call the error callback functions. - -3.2.2.1 Correctable errors - -Correctable errors pose no impacts on the functionality of -the interface. The PCI Express protocol can recover without any -software intervention or any loss of data. These errors do not -require any recovery actions. The AER driver clears the device's -correctable error status register accordingly and logs these errors. - -3.2.2.2 Non-correctable (non-fatal and fatal) errors - -If an error message indicates a non-fatal error, performing link reset -at upstream is not required. The AER driver calls error_detected(dev, -pci_channel_io_normal) to all drivers associated within a hierarchy in -question. for example, -EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort. -If Upstream port A captures an AER error, the hierarchy consists of -Downstream port B and EndPoint. - -A driver may return PCI_ERS_RESULT_CAN_RECOVER, -PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on -whether it can recover or the AER driver calls mmio_enabled as next. - -If an error message indicates a fatal error, kernel will broadcast -error_detected(dev, pci_channel_io_frozen) to all drivers within -a hierarchy in question. Then, performing link reset at upstream is -necessary. As different kinds of devices might use different approaches -to reset link, AER port service driver is required to provide the -function to reset link. Firstly, kernel looks for if the upstream -component has an aer driver. If it has, kernel uses the reset_link -callback of the aer driver. If the upstream component has no aer driver -and the port is downstream port, we will perform a hot reset as the -default by setting the Secondary Bus Reset bit of the Bridge Control -register associated with the downstream port. As for upstream ports, -they should provide their own aer service drivers with reset_link -function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and -reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes -to mmio_enabled. - -3.3 helper functions - -3.3.1 int pci_enable_pcie_error_reporting(struct pci_dev *dev); -pci_enable_pcie_error_reporting enables the device to send error -messages to root port when an error is detected. Note that devices -don't enable the error reporting by default, so device drivers need -call this function to enable it. - -3.3.2 int pci_disable_pcie_error_reporting(struct pci_dev *dev); -pci_disable_pcie_error_reporting disables the device to send error -messages to root port when an error is detected. - -3.3.3 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev); -pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable -error status register. - -3.4 Frequent Asked Questions - -Q: What happens if a PCI Express device driver does not provide an -error recovery handler (pci_driver->err_handler is equal to NULL)? - -A: The devices attached with the driver won't be recovered. If the -error is fatal, kernel will print out warning messages. Please refer -to section 3 for more information. - -Q: What happens if an upstream port service driver does not provide -callback reset_link? - -A: Fatal error recovery will fail if the errors are reported by the -upstream ports who are attached by the service driver. - -Q: How does this infrastructure deal with driver that is not PCI -Express aware? - -A: This infrastructure calls the error callback functions of the -driver when an error happens. But if the driver is not aware of -PCI Express, the device might not report its own errors to root -port. - -Q: What modifications will that driver need to make it compatible -with the PCI Express AER Root driver? - -A: It could call the helper functions to enable AER in devices and -cleanup uncorrectable status register. Pls. refer to section 3.3. - - -4. Software error injection - -Debugging PCIe AER error recovery code is quite difficult because it -is hard to trigger real hardware errors. Software based error -injection can be used to fake various kinds of PCIe errors. - -First you should enable PCIe AER software error injection in kernel -configuration, that is, following item should be in your .config. - -CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m - -After reboot with new kernel or insert the module, a device file named -/dev/aer_inject should be created. - -Then, you need a user space tool named aer-inject, which can be gotten -from: - https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/ - -More information about aer-inject can be found in the document comes -with its source code. -- cgit From d8946fc38517755a2d9535a04c3f6a8d3a331eee Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Tue, 14 May 2019 22:47:31 +0800 Subject: Documentation: PCI: convert endpoint/pci-endpoint.txt to reST Convert plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Signed-off-by: Bjorn Helgaas Reviewed-by: Mauro Carvalho Chehab --- Documentation/PCI/endpoint/index.rst | 10 ++ Documentation/PCI/endpoint/pci-endpoint.rst | 231 ++++++++++++++++++++++++++++ Documentation/PCI/endpoint/pci-endpoint.txt | 215 -------------------------- Documentation/PCI/index.rst | 1 + 4 files changed, 242 insertions(+), 215 deletions(-) create mode 100644 Documentation/PCI/endpoint/index.rst create mode 100644 Documentation/PCI/endpoint/pci-endpoint.rst delete mode 100644 Documentation/PCI/endpoint/pci-endpoint.txt (limited to 'Documentation') diff --git a/Documentation/PCI/endpoint/index.rst b/Documentation/PCI/endpoint/index.rst new file mode 100644 index 000000000000..0db4f2fcd7f0 --- /dev/null +++ b/Documentation/PCI/endpoint/index.rst @@ -0,0 +1,10 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================== +PCI Endpoint Framework +====================== + +.. toctree:: + :maxdepth: 2 + + pci-endpoint diff --git a/Documentation/PCI/endpoint/pci-endpoint.rst b/Documentation/PCI/endpoint/pci-endpoint.rst new file mode 100644 index 000000000000..0e2311b5617b --- /dev/null +++ b/Documentation/PCI/endpoint/pci-endpoint.rst @@ -0,0 +1,231 @@ +.. SPDX-License-Identifier: GPL-2.0 + +:Author: Kishon Vijay Abraham I + +This document is a guide to use the PCI Endpoint Framework in order to create +endpoint controller driver, endpoint function driver, and using configfs +interface to bind the function driver to the controller driver. + +Introduction +============ + +Linux has a comprehensive PCI subsystem to support PCI controllers that +operates in Root Complex mode. The subsystem has capability to scan PCI bus, +assign memory resources and IRQ resources, load PCI driver (based on +vendor ID, device ID), support other services like hot-plug, power management, +advanced error reporting and virtual channels. + +However the PCI controller IP integrated in some SoCs is capable of operating +either in Root Complex mode or Endpoint mode. PCI Endpoint Framework will +add endpoint mode support in Linux. This will help to run Linux in an +EP system which can have a wide variety of use cases from testing or +validation, co-processor accelerator, etc. + +PCI Endpoint Core +================= + +The PCI Endpoint Core layer comprises 3 components: the Endpoint Controller +library, the Endpoint Function library, and the configfs layer to bind the +endpoint function with the endpoint controller. + +PCI Endpoint Controller(EPC) Library +------------------------------------ + +The EPC library provides APIs to be used by the controller that can operate +in endpoint mode. It also provides APIs to be used by function driver/library +in order to implement a particular endpoint function. + +APIs for the PCI controller Driver +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This section lists the APIs that the PCI Endpoint core provides to be used +by the PCI controller driver. + +* devm_pci_epc_create()/pci_epc_create() + + The PCI controller driver should implement the following ops: + + * write_header: ops to populate configuration space header + * set_bar: ops to configure the BAR + * clear_bar: ops to reset the BAR + * alloc_addr_space: ops to allocate in PCI controller address space + * free_addr_space: ops to free the allocated address space + * raise_irq: ops to raise a legacy, MSI or MSI-X interrupt + * start: ops to start the PCI link + * stop: ops to stop the PCI link + + The PCI controller driver can then create a new EPC device by invoking + devm_pci_epc_create()/pci_epc_create(). + +* devm_pci_epc_destroy()/pci_epc_destroy() + + The PCI controller driver can destroy the EPC device created by either + devm_pci_epc_create() or pci_epc_create() using devm_pci_epc_destroy() or + pci_epc_destroy(). + +* pci_epc_linkup() + + In order to notify all the function devices that the EPC device to which + they are linked has established a link with the host, the PCI controller + driver should invoke pci_epc_linkup(). + +* pci_epc_mem_init() + + Initialize the pci_epc_mem structure used for allocating EPC addr space. + +* pci_epc_mem_exit() + + Cleanup the pci_epc_mem structure allocated during pci_epc_mem_init(). + + +APIs for the PCI Endpoint Function Driver +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This section lists the APIs that the PCI Endpoint core provides to be used +by the PCI endpoint function driver. + +* pci_epc_write_header() + + The PCI endpoint function driver should use pci_epc_write_header() to + write the standard configuration header to the endpoint controller. + +* pci_epc_set_bar() + + The PCI endpoint function driver should use pci_epc_set_bar() to configure + the Base Address Register in order for the host to assign PCI addr space. + Register space of the function driver is usually configured + using this API. + +* pci_epc_clear_bar() + + The PCI endpoint function driver should use pci_epc_clear_bar() to reset + the BAR. + +* pci_epc_raise_irq() + + The PCI endpoint function driver should use pci_epc_raise_irq() to raise + Legacy Interrupt, MSI or MSI-X Interrupt. + +* pci_epc_mem_alloc_addr() + + The PCI endpoint function driver should use pci_epc_mem_alloc_addr(), to + allocate memory address from EPC addr space which is required to access + RC's buffer + +* pci_epc_mem_free_addr() + + The PCI endpoint function driver should use pci_epc_mem_free_addr() to + free the memory space allocated using pci_epc_mem_alloc_addr(). + +Other APIs +~~~~~~~~~~ + +There are other APIs provided by the EPC library. These are used for binding +the EPF device with EPC device. pci-ep-cfs.c can be used as reference for +using these APIs. + +* pci_epc_get() + + Get a reference to the PCI endpoint controller based on the device name of + the controller. + +* pci_epc_put() + + Release the reference to the PCI endpoint controller obtained using + pci_epc_get() + +* pci_epc_add_epf() + + Add a PCI endpoint function to a PCI endpoint controller. A PCIe device + can have up to 8 functions according to the specification. + +* pci_epc_remove_epf() + + Remove the PCI endpoint function from PCI endpoint controller. + +* pci_epc_start() + + The PCI endpoint function driver should invoke pci_epc_start() once it + has configured the endpoint function and wants to start the PCI link. + +* pci_epc_stop() + + The PCI endpoint function driver should invoke pci_epc_stop() to stop + the PCI LINK. + + +PCI Endpoint Function(EPF) Library +---------------------------------- + +The EPF library provides APIs to be used by the function driver and the EPC +library to provide endpoint mode functionality. + +APIs for the PCI Endpoint Function Driver +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This section lists the APIs that the PCI Endpoint core provides to be used +by the PCI endpoint function driver. + +* pci_epf_register_driver() + + The PCI Endpoint Function driver should implement the following ops: + * bind: ops to perform when a EPC device has been bound to EPF device + * unbind: ops to perform when a binding has been lost between a EPC + device and EPF device + * linkup: ops to perform when the EPC device has established a + connection with a host system + + The PCI Function driver can then register the PCI EPF driver by using + pci_epf_register_driver(). + +* pci_epf_unregister_driver() + + The PCI Function driver can unregister the PCI EPF driver by using + pci_epf_unregister_driver(). + +* pci_epf_alloc_space() + + The PCI Function driver can allocate space for a particular BAR using + pci_epf_alloc_space(). + +* pci_epf_free_space() + + The PCI Function driver can free the allocated space + (using pci_epf_alloc_space) by invoking pci_epf_free_space(). + +APIs for the PCI Endpoint Controller Library +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This section lists the APIs that the PCI Endpoint core provides to be used +by the PCI endpoint controller library. + +* pci_epf_linkup() + + The PCI endpoint controller library invokes pci_epf_linkup() when the + EPC device has established the connection to the host. + +Other APIs +~~~~~~~~~~ + +There are other APIs provided by the EPF library. These are used to notify +the function driver when the EPF device is bound to the EPC device. +pci-ep-cfs.c can be used as reference for using these APIs. + +* pci_epf_create() + + Create a new PCI EPF device by passing the name of the PCI EPF device. + This name will be used to bind the the EPF device to a EPF driver. + +* pci_epf_destroy() + + Destroy the created PCI EPF device. + +* pci_epf_bind() + + pci_epf_bind() should be invoked when the EPF device has been bound to + a EPC device. + +* pci_epf_unbind() + + pci_epf_unbind() should be invoked when the binding between EPC device + and EPF device is lost. diff --git a/Documentation/PCI/endpoint/pci-endpoint.txt b/Documentation/PCI/endpoint/pci-endpoint.txt deleted file mode 100644 index e86a96b66a6a..000000000000 --- a/Documentation/PCI/endpoint/pci-endpoint.txt +++ /dev/null @@ -1,215 +0,0 @@ - PCI ENDPOINT FRAMEWORK - Kishon Vijay Abraham I - -This document is a guide to use the PCI Endpoint Framework in order to create -endpoint controller driver, endpoint function driver, and using configfs -interface to bind the function driver to the controller driver. - -1. Introduction - -Linux has a comprehensive PCI subsystem to support PCI controllers that -operates in Root Complex mode. The subsystem has capability to scan PCI bus, -assign memory resources and IRQ resources, load PCI driver (based on -vendor ID, device ID), support other services like hot-plug, power management, -advanced error reporting and virtual channels. - -However the PCI controller IP integrated in some SoCs is capable of operating -either in Root Complex mode or Endpoint mode. PCI Endpoint Framework will -add endpoint mode support in Linux. This will help to run Linux in an -EP system which can have a wide variety of use cases from testing or -validation, co-processor accelerator, etc. - -2. PCI Endpoint Core - -The PCI Endpoint Core layer comprises 3 components: the Endpoint Controller -library, the Endpoint Function library, and the configfs layer to bind the -endpoint function with the endpoint controller. - -2.1 PCI Endpoint Controller(EPC) Library - -The EPC library provides APIs to be used by the controller that can operate -in endpoint mode. It also provides APIs to be used by function driver/library -in order to implement a particular endpoint function. - -2.1.1 APIs for the PCI controller Driver - -This section lists the APIs that the PCI Endpoint core provides to be used -by the PCI controller driver. - -*) devm_pci_epc_create()/pci_epc_create() - - The PCI controller driver should implement the following ops: - * write_header: ops to populate configuration space header - * set_bar: ops to configure the BAR - * clear_bar: ops to reset the BAR - * alloc_addr_space: ops to allocate in PCI controller address space - * free_addr_space: ops to free the allocated address space - * raise_irq: ops to raise a legacy, MSI or MSI-X interrupt - * start: ops to start the PCI link - * stop: ops to stop the PCI link - - The PCI controller driver can then create a new EPC device by invoking - devm_pci_epc_create()/pci_epc_create(). - -*) devm_pci_epc_destroy()/pci_epc_destroy() - - The PCI controller driver can destroy the EPC device created by either - devm_pci_epc_create() or pci_epc_create() using devm_pci_epc_destroy() or - pci_epc_destroy(). - -*) pci_epc_linkup() - - In order to notify all the function devices that the EPC device to which - they are linked has established a link with the host, the PCI controller - driver should invoke pci_epc_linkup(). - -*) pci_epc_mem_init() - - Initialize the pci_epc_mem structure used for allocating EPC addr space. - -*) pci_epc_mem_exit() - - Cleanup the pci_epc_mem structure allocated during pci_epc_mem_init(). - -2.1.2 APIs for the PCI Endpoint Function Driver - -This section lists the APIs that the PCI Endpoint core provides to be used -by the PCI endpoint function driver. - -*) pci_epc_write_header() - - The PCI endpoint function driver should use pci_epc_write_header() to - write the standard configuration header to the endpoint controller. - -*) pci_epc_set_bar() - - The PCI endpoint function driver should use pci_epc_set_bar() to configure - the Base Address Register in order for the host to assign PCI addr space. - Register space of the function driver is usually configured - using this API. - -*) pci_epc_clear_bar() - - The PCI endpoint function driver should use pci_epc_clear_bar() to reset - the BAR. - -*) pci_epc_raise_irq() - - The PCI endpoint function driver should use pci_epc_raise_irq() to raise - Legacy Interrupt, MSI or MSI-X Interrupt. - -*) pci_epc_mem_alloc_addr() - - The PCI endpoint function driver should use pci_epc_mem_alloc_addr(), to - allocate memory address from EPC addr space which is required to access - RC's buffer - -*) pci_epc_mem_free_addr() - - The PCI endpoint function driver should use pci_epc_mem_free_addr() to - free the memory space allocated using pci_epc_mem_alloc_addr(). - -2.1.3 Other APIs - -There are other APIs provided by the EPC library. These are used for binding -the EPF device with EPC device. pci-ep-cfs.c can be used as reference for -using these APIs. - -*) pci_epc_get() - - Get a reference to the PCI endpoint controller based on the device name of - the controller. - -*) pci_epc_put() - - Release the reference to the PCI endpoint controller obtained using - pci_epc_get() - -*) pci_epc_add_epf() - - Add a PCI endpoint function to a PCI endpoint controller. A PCIe device - can have up to 8 functions according to the specification. - -*) pci_epc_remove_epf() - - Remove the PCI endpoint function from PCI endpoint controller. - -*) pci_epc_start() - - The PCI endpoint function driver should invoke pci_epc_start() once it - has configured the endpoint function and wants to start the PCI link. - -*) pci_epc_stop() - - The PCI endpoint function driver should invoke pci_epc_stop() to stop - the PCI LINK. - -2.2 PCI Endpoint Function(EPF) Library - -The EPF library provides APIs to be used by the function driver and the EPC -library to provide endpoint mode functionality. - -2.2.1 APIs for the PCI Endpoint Function Driver - -This section lists the APIs that the PCI Endpoint core provides to be used -by the PCI endpoint function driver. - -*) pci_epf_register_driver() - - The PCI Endpoint Function driver should implement the following ops: - * bind: ops to perform when a EPC device has been bound to EPF device - * unbind: ops to perform when a binding has been lost between a EPC - device and EPF device - * linkup: ops to perform when the EPC device has established a - connection with a host system - - The PCI Function driver can then register the PCI EPF driver by using - pci_epf_register_driver(). - -*) pci_epf_unregister_driver() - - The PCI Function driver can unregister the PCI EPF driver by using - pci_epf_unregister_driver(). - -*) pci_epf_alloc_space() - - The PCI Function driver can allocate space for a particular BAR using - pci_epf_alloc_space(). - -*) pci_epf_free_space() - - The PCI Function driver can free the allocated space - (using pci_epf_alloc_space) by invoking pci_epf_free_space(). - -2.2.2 APIs for the PCI Endpoint Controller Library -This section lists the APIs that the PCI Endpoint core provides to be used -by the PCI endpoint controller library. - -*) pci_epf_linkup() - - The PCI endpoint controller library invokes pci_epf_linkup() when the - EPC device has established the connection to the host. - -2.2.2 Other APIs -There are other APIs provided by the EPF library. These are used to notify -the function driver when the EPF device is bound to the EPC device. -pci-ep-cfs.c can be used as reference for using these APIs. - -*) pci_epf_create() - - Create a new PCI EPF device by passing the name of the PCI EPF device. - This name will be used to bind the the EPF device to a EPF driver. - -*) pci_epf_destroy() - - Destroy the created PCI EPF device. - -*) pci_epf_bind() - - pci_epf_bind() should be invoked when the EPF device has been bound to - a EPC device. - -*) pci_epf_unbind() - - pci_epf_unbind() should be invoked when the binding between EPC device - and EPF device is lost. diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst index f54b65b1ca5f..f4c6121868c3 100644 --- a/Documentation/PCI/index.rst +++ b/Documentation/PCI/index.rst @@ -15,3 +15,4 @@ Linux PCI Bus Subsystem acpi-info pci-error-recovery pcieaer-howto + endpoint/index -- cgit From d4518e4ac64cae18f953f8a433359ea1face4b52 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Tue, 14 May 2019 22:47:32 +0800 Subject: Documentation: PCI: convert endpoint/pci-endpoint-cfs.txt to reST Convert plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Signed-off-by: Bjorn Helgaas Reviewed-by: Mauro Carvalho Chehab --- Documentation/PCI/endpoint/index.rst | 1 + Documentation/PCI/endpoint/pci-endpoint-cfs.rst | 118 ++++++++++++++++++++++++ Documentation/PCI/endpoint/pci-endpoint-cfs.txt | 105 --------------------- 3 files changed, 119 insertions(+), 105 deletions(-) create mode 100644 Documentation/PCI/endpoint/pci-endpoint-cfs.rst delete mode 100644 Documentation/PCI/endpoint/pci-endpoint-cfs.txt (limited to 'Documentation') diff --git a/Documentation/PCI/endpoint/index.rst b/Documentation/PCI/endpoint/index.rst index 0db4f2fcd7f0..3951de9f923c 100644 --- a/Documentation/PCI/endpoint/index.rst +++ b/Documentation/PCI/endpoint/index.rst @@ -8,3 +8,4 @@ PCI Endpoint Framework :maxdepth: 2 pci-endpoint + pci-endpoint-cfs diff --git a/Documentation/PCI/endpoint/pci-endpoint-cfs.rst b/Documentation/PCI/endpoint/pci-endpoint-cfs.rst new file mode 100644 index 000000000000..b6d39cdec56e --- /dev/null +++ b/Documentation/PCI/endpoint/pci-endpoint-cfs.rst @@ -0,0 +1,118 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================================= +Configuring PCI Endpoint Using CONFIGFS +======================================= + +:Author: Kishon Vijay Abraham I + +The PCI Endpoint Core exposes configfs entry (pci_ep) to configure the +PCI endpoint function and to bind the endpoint function +with the endpoint controller. (For introducing other mechanisms to +configure the PCI Endpoint Function refer to [1]). + +Mounting configfs +================= + +The PCI Endpoint Core layer creates pci_ep directory in the mounted configfs +directory. configfs can be mounted using the following command:: + + mount -t configfs none /sys/kernel/config + +Directory Structure +=================== + +The pci_ep configfs has two directories at its root: controllers and +functions. Every EPC device present in the system will have an entry in +the *controllers* directory and and every EPF driver present in the system +will have an entry in the *functions* directory. +:: + + /sys/kernel/config/pci_ep/ + .. controllers/ + .. functions/ + +Creating EPF Device +=================== + +Every registered EPF driver will be listed in controllers directory. The +entries corresponding to EPF driver will be created by the EPF core. +:: + + /sys/kernel/config/pci_ep/functions/ + .. / + ... / + ... / + .. / + ... / + ... / + +In order to create a of the type probed by , the +user has to create a directory inside . + +Every directory consists of the following entries that can be +used to configure the standard configuration header of the endpoint function. +(These entries are created by the framework when any new is +created) +:: + + .. / + ... / + ... vendorid + ... deviceid + ... revid + ... progif_code + ... subclass_code + ... baseclass_code + ... cache_line_size + ... subsys_vendor_id + ... subsys_id + ... interrupt_pin + +EPC Device +========== + +Every registered EPC device will be listed in controllers directory. The +entries corresponding to EPC device will be created by the EPC core. +:: + + /sys/kernel/config/pci_ep/controllers/ + .. / + ... / + ... / + ... start + .. / + ... / + ... / + ... start + +The directory will have a list of symbolic links to +. These symbolic links should be created by the user to +represent the functions present in the endpoint device. + +The directory will also have a *start* field. Once +"1" is written to this field, the endpoint device will be ready to +establish the link with the host. This is usually done after +all the EPF devices are created and linked with the EPC device. +:: + + | controllers/ + | / + | + | start + | functions/ + | / + | / + | vendorid + | deviceid + | revid + | progif_code + | subclass_code + | baseclass_code + | cache_line_size + | subsys_vendor_id + | subsys_id + | interrupt_pin + | function + +[1] :doc:`pci-endpoint` diff --git a/Documentation/PCI/endpoint/pci-endpoint-cfs.txt b/Documentation/PCI/endpoint/pci-endpoint-cfs.txt deleted file mode 100644 index d740f29960a4..000000000000 --- a/Documentation/PCI/endpoint/pci-endpoint-cfs.txt +++ /dev/null @@ -1,105 +0,0 @@ - CONFIGURING PCI ENDPOINT USING CONFIGFS - Kishon Vijay Abraham I - -The PCI Endpoint Core exposes configfs entry (pci_ep) to configure the -PCI endpoint function and to bind the endpoint function -with the endpoint controller. (For introducing other mechanisms to -configure the PCI Endpoint Function refer to [1]). - -*) Mounting configfs - -The PCI Endpoint Core layer creates pci_ep directory in the mounted configfs -directory. configfs can be mounted using the following command. - - mount -t configfs none /sys/kernel/config - -*) Directory Structure - -The pci_ep configfs has two directories at its root: controllers and -functions. Every EPC device present in the system will have an entry in -the *controllers* directory and and every EPF driver present in the system -will have an entry in the *functions* directory. - -/sys/kernel/config/pci_ep/ - .. controllers/ - .. functions/ - -*) Creating EPF Device - -Every registered EPF driver will be listed in controllers directory. The -entries corresponding to EPF driver will be created by the EPF core. - -/sys/kernel/config/pci_ep/functions/ - .. / - ... / - ... / - .. / - ... / - ... / - -In order to create a of the type probed by , the -user has to create a directory inside . - -Every directory consists of the following entries that can be -used to configure the standard configuration header of the endpoint function. -(These entries are created by the framework when any new is -created) - - .. / - ... / - ... vendorid - ... deviceid - ... revid - ... progif_code - ... subclass_code - ... baseclass_code - ... cache_line_size - ... subsys_vendor_id - ... subsys_id - ... interrupt_pin - -*) EPC Device - -Every registered EPC device will be listed in controllers directory. The -entries corresponding to EPC device will be created by the EPC core. - -/sys/kernel/config/pci_ep/controllers/ - .. / - ... / - ... / - ... start - .. / - ... / - ... / - ... start - -The directory will have a list of symbolic links to -. These symbolic links should be created by the user to -represent the functions present in the endpoint device. - -The directory will also have a *start* field. Once -"1" is written to this field, the endpoint device will be ready to -establish the link with the host. This is usually done after -all the EPF devices are created and linked with the EPC device. - - - | controllers/ - | / - | - | start - | functions/ - | / - | / - | vendorid - | deviceid - | revid - | progif_code - | subclass_code - | baseclass_code - | cache_line_size - | subsys_vendor_id - | subsys_id - | interrupt_pin - | function - -[1] -> Documentation/PCI/endpoint/pci-endpoint.txt -- cgit From bf2c2658d4b6baed13c274da7091428772b5cb03 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Tue, 14 May 2019 22:47:33 +0800 Subject: Documentation: PCI: convert endpoint/pci-test-function.txt to reST Convert plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Signed-off-by: Bjorn Helgaas Reviewed-by: Mauro Carvalho Chehab --- Documentation/PCI/endpoint/index.rst | 1 + Documentation/PCI/endpoint/pci-test-function.rst | 103 +++++++++++++++++++++++ Documentation/PCI/endpoint/pci-test-function.txt | 87 ------------------- 3 files changed, 104 insertions(+), 87 deletions(-) create mode 100644 Documentation/PCI/endpoint/pci-test-function.rst delete mode 100644 Documentation/PCI/endpoint/pci-test-function.txt (limited to 'Documentation') diff --git a/Documentation/PCI/endpoint/index.rst b/Documentation/PCI/endpoint/index.rst index 3951de9f923c..b680a3fc4fec 100644 --- a/Documentation/PCI/endpoint/index.rst +++ b/Documentation/PCI/endpoint/index.rst @@ -9,3 +9,4 @@ PCI Endpoint Framework pci-endpoint pci-endpoint-cfs + pci-test-function diff --git a/Documentation/PCI/endpoint/pci-test-function.rst b/Documentation/PCI/endpoint/pci-test-function.rst new file mode 100644 index 000000000000..3c8521d7aa31 --- /dev/null +++ b/Documentation/PCI/endpoint/pci-test-function.rst @@ -0,0 +1,103 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +PCI Test Function +================= + +:Author: Kishon Vijay Abraham I + +Traditionally PCI RC has always been validated by using standard +PCI cards like ethernet PCI cards or USB PCI cards or SATA PCI cards. +However with the addition of EP-core in linux kernel, it is possible +to configure a PCI controller that can operate in EP mode to work as +a test device. + +The PCI endpoint test device is a virtual device (defined in software) +used to test the endpoint functionality and serve as a sample driver +for other PCI endpoint devices (to use the EP framework). + +The PCI endpoint test device has the following registers: + + 1) PCI_ENDPOINT_TEST_MAGIC + 2) PCI_ENDPOINT_TEST_COMMAND + 3) PCI_ENDPOINT_TEST_STATUS + 4) PCI_ENDPOINT_TEST_SRC_ADDR + 5) PCI_ENDPOINT_TEST_DST_ADDR + 6) PCI_ENDPOINT_TEST_SIZE + 7) PCI_ENDPOINT_TEST_CHECKSUM + 8) PCI_ENDPOINT_TEST_IRQ_TYPE + 9) PCI_ENDPOINT_TEST_IRQ_NUMBER + +* PCI_ENDPOINT_TEST_MAGIC + +This register will be used to test BAR0. A known pattern will be written +and read back from MAGIC register to verify BAR0. + +* PCI_ENDPOINT_TEST_COMMAND + +This register will be used by the host driver to indicate the function +that the endpoint device must perform. + +======== ================================================================ +Bitfield Description +======== ================================================================ +Bit 0 raise legacy IRQ +Bit 1 raise MSI IRQ +Bit 2 raise MSI-X IRQ +Bit 3 read command (read data from RC buffer) +Bit 4 write command (write data to RC buffer) +Bit 5 copy command (copy data from one RC buffer to another RC buffer) +======== ================================================================ + +* PCI_ENDPOINT_TEST_STATUS + +This register reflects the status of the PCI endpoint device. + +======== ============================== +Bitfield Description +======== ============================== +Bit 0 read success +Bit 1 read fail +Bit 2 write success +Bit 3 write fail +Bit 4 copy success +Bit 5 copy fail +Bit 6 IRQ raised +Bit 7 source address is invalid +Bit 8 destination address is invalid +======== ============================== + +* PCI_ENDPOINT_TEST_SRC_ADDR + +This register contains the source address (RC buffer address) for the +COPY/READ command. + +* PCI_ENDPOINT_TEST_DST_ADDR + +This register contains the destination address (RC buffer address) for +the COPY/WRITE command. + +* PCI_ENDPOINT_TEST_IRQ_TYPE + +This register contains the interrupt type (Legacy/MSI) triggered +for the READ/WRITE/COPY and raise IRQ (Legacy/MSI) commands. + +Possible types: + +====== == +Legacy 0 +MSI 1 +MSI-X 2 +====== == + +* PCI_ENDPOINT_TEST_IRQ_NUMBER + +This register contains the triggered ID interrupt. + +Admissible values: + +====== =========== +Legacy 0 +MSI [1 .. 32] +MSI-X [1 .. 2048] +====== =========== diff --git a/Documentation/PCI/endpoint/pci-test-function.txt b/Documentation/PCI/endpoint/pci-test-function.txt deleted file mode 100644 index 5916f1f592bb..000000000000 --- a/Documentation/PCI/endpoint/pci-test-function.txt +++ /dev/null @@ -1,87 +0,0 @@ - PCI TEST - Kishon Vijay Abraham I - -Traditionally PCI RC has always been validated by using standard -PCI cards like ethernet PCI cards or USB PCI cards or SATA PCI cards. -However with the addition of EP-core in linux kernel, it is possible -to configure a PCI controller that can operate in EP mode to work as -a test device. - -The PCI endpoint test device is a virtual device (defined in software) -used to test the endpoint functionality and serve as a sample driver -for other PCI endpoint devices (to use the EP framework). - -The PCI endpoint test device has the following registers: - - 1) PCI_ENDPOINT_TEST_MAGIC - 2) PCI_ENDPOINT_TEST_COMMAND - 3) PCI_ENDPOINT_TEST_STATUS - 4) PCI_ENDPOINT_TEST_SRC_ADDR - 5) PCI_ENDPOINT_TEST_DST_ADDR - 6) PCI_ENDPOINT_TEST_SIZE - 7) PCI_ENDPOINT_TEST_CHECKSUM - 8) PCI_ENDPOINT_TEST_IRQ_TYPE - 9) PCI_ENDPOINT_TEST_IRQ_NUMBER - -*) PCI_ENDPOINT_TEST_MAGIC - -This register will be used to test BAR0. A known pattern will be written -and read back from MAGIC register to verify BAR0. - -*) PCI_ENDPOINT_TEST_COMMAND: - -This register will be used by the host driver to indicate the function -that the endpoint device must perform. - -Bitfield Description: - Bit 0 : raise legacy IRQ - Bit 1 : raise MSI IRQ - Bit 2 : raise MSI-X IRQ - Bit 3 : read command (read data from RC buffer) - Bit 4 : write command (write data to RC buffer) - Bit 5 : copy command (copy data from one RC buffer to another - RC buffer) - -*) PCI_ENDPOINT_TEST_STATUS - -This register reflects the status of the PCI endpoint device. - -Bitfield Description: - Bit 0 : read success - Bit 1 : read fail - Bit 2 : write success - Bit 3 : write fail - Bit 4 : copy success - Bit 5 : copy fail - Bit 6 : IRQ raised - Bit 7 : source address is invalid - Bit 8 : destination address is invalid - -*) PCI_ENDPOINT_TEST_SRC_ADDR - -This register contains the source address (RC buffer address) for the -COPY/READ command. - -*) PCI_ENDPOINT_TEST_DST_ADDR - -This register contains the destination address (RC buffer address) for -the COPY/WRITE command. - -*) PCI_ENDPOINT_TEST_IRQ_TYPE - -This register contains the interrupt type (Legacy/MSI) triggered -for the READ/WRITE/COPY and raise IRQ (Legacy/MSI) commands. - -Possible types: - - Legacy : 0 - - MSI : 1 - - MSI-X : 2 - -*) PCI_ENDPOINT_TEST_IRQ_NUMBER - -This register contains the triggered ID interrupt. - -Admissible values: - - Legacy : 0 - - MSI : [1 .. 32] - - MSI-X : [1 .. 2048] -- cgit From 9595aee2a389be5dfa9a0121a14e8fba70f17278 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Tue, 14 May 2019 22:47:34 +0800 Subject: Documentation: PCI: convert endpoint/pci-test-howto.txt to reST Convert plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Signed-off-by: Bjorn Helgaas Reviewed-by: Mauro Carvalho Chehab --- Documentation/PCI/endpoint/index.rst | 1 + Documentation/PCI/endpoint/pci-test-howto.rst | 235 ++++++++++++++++++++++++++ Documentation/PCI/endpoint/pci-test-howto.txt | 206 ---------------------- 3 files changed, 236 insertions(+), 206 deletions(-) create mode 100644 Documentation/PCI/endpoint/pci-test-howto.rst delete mode 100644 Documentation/PCI/endpoint/pci-test-howto.txt (limited to 'Documentation') diff --git a/Documentation/PCI/endpoint/index.rst b/Documentation/PCI/endpoint/index.rst index b680a3fc4fec..d114ea74b444 100644 --- a/Documentation/PCI/endpoint/index.rst +++ b/Documentation/PCI/endpoint/index.rst @@ -10,3 +10,4 @@ PCI Endpoint Framework pci-endpoint pci-endpoint-cfs pci-test-function + pci-test-howto diff --git a/Documentation/PCI/endpoint/pci-test-howto.rst b/Documentation/PCI/endpoint/pci-test-howto.rst new file mode 100644 index 000000000000..909f770a07d6 --- /dev/null +++ b/Documentation/PCI/endpoint/pci-test-howto.rst @@ -0,0 +1,235 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================== +PCI Test User Guide +=================== + +:Author: Kishon Vijay Abraham I + +This document is a guide to help users use pci-epf-test function driver +and pci_endpoint_test host driver for testing PCI. The list of steps to +be followed in the host side and EP side is given below. + +Endpoint Device +=============== + +Endpoint Controller Devices +--------------------------- + +To find the list of endpoint controller devices in the system:: + + # ls /sys/class/pci_epc/ + 51000000.pcie_ep + +If PCI_ENDPOINT_CONFIGFS is enabled:: + + # ls /sys/kernel/config/pci_ep/controllers + 51000000.pcie_ep + + +Endpoint Function Drivers +------------------------- + +To find the list of endpoint function drivers in the system:: + + # ls /sys/bus/pci-epf/drivers + pci_epf_test + +If PCI_ENDPOINT_CONFIGFS is enabled:: + + # ls /sys/kernel/config/pci_ep/functions + pci_epf_test + + +Creating pci-epf-test Device +---------------------------- + +PCI endpoint function device can be created using the configfs. To create +pci-epf-test device, the following commands can be used:: + + # mount -t configfs none /sys/kernel/config + # cd /sys/kernel/config/pci_ep/ + # mkdir functions/pci_epf_test/func1 + +The "mkdir func1" above creates the pci-epf-test function device that will +be probed by pci_epf_test driver. + +The PCI endpoint framework populates the directory with the following +configurable fields:: + + # ls functions/pci_epf_test/func1 + baseclass_code interrupt_pin progif_code subsys_id + cache_line_size msi_interrupts revid subsys_vendorid + deviceid msix_interrupts subclass_code vendorid + +The PCI endpoint function driver populates these entries with default values +when the device is bound to the driver. The pci-epf-test driver populates +vendorid with 0xffff and interrupt_pin with 0x0001:: + + # cat functions/pci_epf_test/func1/vendorid + 0xffff + # cat functions/pci_epf_test/func1/interrupt_pin + 0x0001 + + +Configuring pci-epf-test Device +------------------------------- + +The user can configure the pci-epf-test device using configfs entry. In order +to change the vendorid and the number of MSI interrupts used by the function +device, the following commands can be used:: + + # echo 0x104c > functions/pci_epf_test/func1/vendorid + # echo 0xb500 > functions/pci_epf_test/func1/deviceid + # echo 16 > functions/pci_epf_test/func1/msi_interrupts + # echo 8 > functions/pci_epf_test/func1/msix_interrupts + + +Binding pci-epf-test Device to EP Controller +-------------------------------------------- + +In order for the endpoint function device to be useful, it has to be bound to +a PCI endpoint controller driver. Use the configfs to bind the function +device to one of the controller driver present in the system:: + + # ln -s functions/pci_epf_test/func1 controllers/51000000.pcie_ep/ + +Once the above step is completed, the PCI endpoint is ready to establish a link +with the host. + + +Start the Link +-------------- + +In order for the endpoint device to establish a link with the host, the _start_ +field should be populated with '1':: + + # echo 1 > controllers/51000000.pcie_ep/start + + +RootComplex Device +================== + +lspci Output +------------ + +Note that the devices listed here correspond to the value populated in 1.4 +above:: + + 00:00.0 PCI bridge: Texas Instruments Device 8888 (rev 01) + 01:00.0 Unassigned class [ff00]: Texas Instruments Device b500 + + +Using Endpoint Test function Device +----------------------------------- + +pcitest.sh added in tools/pci/ can be used to run all the default PCI endpoint +tests. To compile this tool the following commands should be used:: + + # cd + # make -C tools/pci + +or if you desire to compile and install in your system:: + + # cd + # make -C tools/pci install + +The tool and script will be located in /usr/bin/ + + +pcitest.sh Output +~~~~~~~~~~~~~~~~~ +:: + + # pcitest.sh + BAR tests + + BAR0: OKAY + BAR1: OKAY + BAR2: OKAY + BAR3: OKAY + BAR4: NOT OKAY + BAR5: NOT OKAY + + Interrupt tests + + SET IRQ TYPE TO LEGACY: OKAY + LEGACY IRQ: NOT OKAY + SET IRQ TYPE TO MSI: OKAY + MSI1: OKAY + MSI2: OKAY + MSI3: OKAY + MSI4: OKAY + MSI5: OKAY + MSI6: OKAY + MSI7: OKAY + MSI8: OKAY + MSI9: OKAY + MSI10: OKAY + MSI11: OKAY + MSI12: OKAY + MSI13: OKAY + MSI14: OKAY + MSI15: OKAY + MSI16: OKAY + MSI17: NOT OKAY + MSI18: NOT OKAY + MSI19: NOT OKAY + MSI20: NOT OKAY + MSI21: NOT OKAY + MSI22: NOT OKAY + MSI23: NOT OKAY + MSI24: NOT OKAY + MSI25: NOT OKAY + MSI26: NOT OKAY + MSI27: NOT OKAY + MSI28: NOT OKAY + MSI29: NOT OKAY + MSI30: NOT OKAY + MSI31: NOT OKAY + MSI32: NOT OKAY + SET IRQ TYPE TO MSI-X: OKAY + MSI-X1: OKAY + MSI-X2: OKAY + MSI-X3: OKAY + MSI-X4: OKAY + MSI-X5: OKAY + MSI-X6: OKAY + MSI-X7: OKAY + MSI-X8: OKAY + MSI-X9: NOT OKAY + MSI-X10: NOT OKAY + MSI-X11: NOT OKAY + MSI-X12: NOT OKAY + MSI-X13: NOT OKAY + MSI-X14: NOT OKAY + MSI-X15: NOT OKAY + MSI-X16: NOT OKAY + [...] + MSI-X2047: NOT OKAY + MSI-X2048: NOT OKAY + + Read Tests + + SET IRQ TYPE TO MSI: OKAY + READ ( 1 bytes): OKAY + READ ( 1024 bytes): OKAY + READ ( 1025 bytes): OKAY + READ (1024000 bytes): OKAY + READ (1024001 bytes): OKAY + + Write Tests + + WRITE ( 1 bytes): OKAY + WRITE ( 1024 bytes): OKAY + WRITE ( 1025 bytes): OKAY + WRITE (1024000 bytes): OKAY + WRITE (1024001 bytes): OKAY + + Copy Tests + + COPY ( 1 bytes): OKAY + COPY ( 1024 bytes): OKAY + COPY ( 1025 bytes): OKAY + COPY (1024000 bytes): OKAY + COPY (1024001 bytes): OKAY diff --git a/Documentation/PCI/endpoint/pci-test-howto.txt b/Documentation/PCI/endpoint/pci-test-howto.txt deleted file mode 100644 index 040479f437a5..000000000000 --- a/Documentation/PCI/endpoint/pci-test-howto.txt +++ /dev/null @@ -1,206 +0,0 @@ - PCI TEST USERGUIDE - Kishon Vijay Abraham I - -This document is a guide to help users use pci-epf-test function driver -and pci_endpoint_test host driver for testing PCI. The list of steps to -be followed in the host side and EP side is given below. - -1. Endpoint Device - -1.1 Endpoint Controller Devices - -To find the list of endpoint controller devices in the system: - - # ls /sys/class/pci_epc/ - 51000000.pcie_ep - -If PCI_ENDPOINT_CONFIGFS is enabled - # ls /sys/kernel/config/pci_ep/controllers - 51000000.pcie_ep - -1.2 Endpoint Function Drivers - -To find the list of endpoint function drivers in the system: - - # ls /sys/bus/pci-epf/drivers - pci_epf_test - -If PCI_ENDPOINT_CONFIGFS is enabled - # ls /sys/kernel/config/pci_ep/functions - pci_epf_test - -1.3 Creating pci-epf-test Device - -PCI endpoint function device can be created using the configfs. To create -pci-epf-test device, the following commands can be used - - # mount -t configfs none /sys/kernel/config - # cd /sys/kernel/config/pci_ep/ - # mkdir functions/pci_epf_test/func1 - -The "mkdir func1" above creates the pci-epf-test function device that will -be probed by pci_epf_test driver. - -The PCI endpoint framework populates the directory with the following -configurable fields. - - # ls functions/pci_epf_test/func1 - baseclass_code interrupt_pin progif_code subsys_id - cache_line_size msi_interrupts revid subsys_vendorid - deviceid msix_interrupts subclass_code vendorid - -The PCI endpoint function driver populates these entries with default values -when the device is bound to the driver. The pci-epf-test driver populates -vendorid with 0xffff and interrupt_pin with 0x0001 - - # cat functions/pci_epf_test/func1/vendorid - 0xffff - # cat functions/pci_epf_test/func1/interrupt_pin - 0x0001 - -1.4 Configuring pci-epf-test Device - -The user can configure the pci-epf-test device using configfs entry. In order -to change the vendorid and the number of MSI interrupts used by the function -device, the following commands can be used. - - # echo 0x104c > functions/pci_epf_test/func1/vendorid - # echo 0xb500 > functions/pci_epf_test/func1/deviceid - # echo 16 > functions/pci_epf_test/func1/msi_interrupts - # echo 8 > functions/pci_epf_test/func1/msix_interrupts - -1.5 Binding pci-epf-test Device to EP Controller - -In order for the endpoint function device to be useful, it has to be bound to -a PCI endpoint controller driver. Use the configfs to bind the function -device to one of the controller driver present in the system. - - # ln -s functions/pci_epf_test/func1 controllers/51000000.pcie_ep/ - -Once the above step is completed, the PCI endpoint is ready to establish a link -with the host. - -1.6 Start the Link - -In order for the endpoint device to establish a link with the host, the _start_ -field should be populated with '1'. - - # echo 1 > controllers/51000000.pcie_ep/start - -2. RootComplex Device - -2.1 lspci Output - -Note that the devices listed here correspond to the value populated in 1.4 above - - 00:00.0 PCI bridge: Texas Instruments Device 8888 (rev 01) - 01:00.0 Unassigned class [ff00]: Texas Instruments Device b500 - -2.2 Using Endpoint Test function Device - -pcitest.sh added in tools/pci/ can be used to run all the default PCI endpoint -tests. To compile this tool the following commands should be used: - - # cd - # make -C tools/pci - -or if you desire to compile and install in your system: - - # cd - # make -C tools/pci install - -The tool and script will be located in /usr/bin/ - -2.2.1 pcitest.sh Output - # pcitest.sh - BAR tests - - BAR0: OKAY - BAR1: OKAY - BAR2: OKAY - BAR3: OKAY - BAR4: NOT OKAY - BAR5: NOT OKAY - - Interrupt tests - - SET IRQ TYPE TO LEGACY: OKAY - LEGACY IRQ: NOT OKAY - SET IRQ TYPE TO MSI: OKAY - MSI1: OKAY - MSI2: OKAY - MSI3: OKAY - MSI4: OKAY - MSI5: OKAY - MSI6: OKAY - MSI7: OKAY - MSI8: OKAY - MSI9: OKAY - MSI10: OKAY - MSI11: OKAY - MSI12: OKAY - MSI13: OKAY - MSI14: OKAY - MSI15: OKAY - MSI16: OKAY - MSI17: NOT OKAY - MSI18: NOT OKAY - MSI19: NOT OKAY - MSI20: NOT OKAY - MSI21: NOT OKAY - MSI22: NOT OKAY - MSI23: NOT OKAY - MSI24: NOT OKAY - MSI25: NOT OKAY - MSI26: NOT OKAY - MSI27: NOT OKAY - MSI28: NOT OKAY - MSI29: NOT OKAY - MSI30: NOT OKAY - MSI31: NOT OKAY - MSI32: NOT OKAY - SET IRQ TYPE TO MSI-X: OKAY - MSI-X1: OKAY - MSI-X2: OKAY - MSI-X3: OKAY - MSI-X4: OKAY - MSI-X5: OKAY - MSI-X6: OKAY - MSI-X7: OKAY - MSI-X8: OKAY - MSI-X9: NOT OKAY - MSI-X10: NOT OKAY - MSI-X11: NOT OKAY - MSI-X12: NOT OKAY - MSI-X13: NOT OKAY - MSI-X14: NOT OKAY - MSI-X15: NOT OKAY - MSI-X16: NOT OKAY - [...] - MSI-X2047: NOT OKAY - MSI-X2048: NOT OKAY - - Read Tests - - SET IRQ TYPE TO MSI: OKAY - READ ( 1 bytes): OKAY - READ ( 1024 bytes): OKAY - READ ( 1025 bytes): OKAY - READ (1024000 bytes): OKAY - READ (1024001 bytes): OKAY - - Write Tests - - WRITE ( 1 bytes): OKAY - WRITE ( 1024 bytes): OKAY - WRITE ( 1025 bytes): OKAY - WRITE (1024000 bytes): OKAY - WRITE (1024001 bytes): OKAY - - Copy Tests - - COPY ( 1 bytes): OKAY - COPY ( 1024 bytes): OKAY - COPY ( 1025 bytes): OKAY - COPY (1024000 bytes): OKAY - COPY (1024001 bytes): OKAY -- cgit From 151f4e2bdc7a04020ae5c533896fb91a16e1f501 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 13 Jun 2019 07:10:36 -0300 Subject: docs: power: convert docs to ReST and rename to *.rst Convert the PM documents to ReST, in order to allow them to build with Sphinx. The conversion is actually: - add blank lines and indentation in order to identify paragraphs; - fix tables markups; - add some lists markups; - mark literal blocks; - adjust title markups. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Bjorn Helgaas Acked-by: Mark Brown Acked-by: Srivatsa S. Bhat (VMware) --- Documentation/ABI/testing/sysfs-class-powercap | 2 +- Documentation/admin-guide/kernel-parameters.txt | 6 +- Documentation/cpu-freq/core.txt | 2 +- Documentation/driver-api/pm/devices.rst | 6 +- Documentation/driver-api/usb/power-management.rst | 2 +- Documentation/power/apm-acpi.rst | 36 + Documentation/power/apm-acpi.txt | 32 - Documentation/power/basic-pm-debugging.rst | 269 +++++ Documentation/power/basic-pm-debugging.txt | 254 ----- Documentation/power/charger-manager.rst | 205 ++++ Documentation/power/charger-manager.txt | 200 ---- Documentation/power/drivers-testing.rst | 51 + Documentation/power/drivers-testing.txt | 46 - Documentation/power/energy-model.rst | 147 +++ Documentation/power/energy-model.txt | 144 --- Documentation/power/freezing-of-tasks.rst | 244 +++++ Documentation/power/freezing-of-tasks.txt | 231 ---- Documentation/power/index.rst | 46 + Documentation/power/interface.rst | 79 ++ Documentation/power/interface.txt | 77 -- Documentation/power/opp.rst | 379 +++++++ Documentation/power/opp.txt | 342 ------ Documentation/power/pci.rst | 1135 ++++++++++++++++++++ Documentation/power/pci.txt | 1094 ------------------- Documentation/power/pm_qos_interface.rst | 225 ++++ Documentation/power/pm_qos_interface.txt | 212 ---- Documentation/power/power_supply_class.rst | 282 +++++ Documentation/power/power_supply_class.txt | 231 ---- Documentation/power/powercap/powercap.rst | 257 +++++ Documentation/power/powercap/powercap.txt | 236 ---- Documentation/power/regulator/consumer.rst | 229 ++++ Documentation/power/regulator/consumer.txt | 218 ---- Documentation/power/regulator/design.rst | 38 + Documentation/power/regulator/design.txt | 33 - Documentation/power/regulator/machine.rst | 97 ++ Documentation/power/regulator/machine.txt | 96 -- Documentation/power/regulator/overview.rst | 178 +++ Documentation/power/regulator/overview.txt | 171 --- Documentation/power/regulator/regulator.rst | 32 + Documentation/power/regulator/regulator.txt | 30 - Documentation/power/runtime_pm.rst | 940 ++++++++++++++++ Documentation/power/runtime_pm.txt | 928 ---------------- Documentation/power/s2ram.rst | 87 ++ Documentation/power/s2ram.txt | 85 -- Documentation/power/suspend-and-cpuhotplug.rst | 286 +++++ Documentation/power/suspend-and-cpuhotplug.txt | 274 ----- Documentation/power/suspend-and-interrupts.rst | 137 +++ Documentation/power/suspend-and-interrupts.txt | 135 --- Documentation/power/swsusp-and-swap-files.rst | 63 ++ Documentation/power/swsusp-and-swap-files.txt | 60 -- Documentation/power/swsusp-dmcrypt.rst | 140 +++ Documentation/power/swsusp-dmcrypt.txt | 138 --- Documentation/power/swsusp.rst | 501 +++++++++ Documentation/power/swsusp.txt | 446 -------- Documentation/power/tricks.rst | 29 + Documentation/power/tricks.txt | 27 - Documentation/power/userland-swsusp.rst | 191 ++++ Documentation/power/userland-swsusp.txt | 170 --- Documentation/power/video.rst | 213 ++++ Documentation/power/video.txt | 185 ---- Documentation/process/submitting-drivers.rst | 2 +- Documentation/scheduler/sched-energy.txt | 6 +- Documentation/trace/coresight-cpu-debug.txt | 2 +- .../zh_CN/process/submitting-drivers.rst | 2 +- 64 files changed, 6531 insertions(+), 6110 deletions(-) create mode 100644 Documentation/power/apm-acpi.rst delete mode 100644 Documentation/power/apm-acpi.txt create mode 100644 Documentation/power/basic-pm-debugging.rst delete mode 100644 Documentation/power/basic-pm-debugging.txt create mode 100644 Documentation/power/charger-manager.rst delete mode 100644 Documentation/power/charger-manager.txt create mode 100644 Documentation/power/drivers-testing.rst delete mode 100644 Documentation/power/drivers-testing.txt create mode 100644 Documentation/power/energy-model.rst delete mode 100644 Documentation/power/energy-model.txt create mode 100644 Documentation/power/freezing-of-tasks.rst delete mode 100644 Documentation/power/freezing-of-tasks.txt create mode 100644 Documentation/power/index.rst create mode 100644 Documentation/power/interface.rst delete mode 100644 Documentation/power/interface.txt create mode 100644 Documentation/power/opp.rst delete mode 100644 Documentation/power/opp.txt create mode 100644 Documentation/power/pci.rst delete mode 100644 Documentation/power/pci.txt create mode 100644 Documentation/power/pm_qos_interface.rst delete mode 100644 Documentation/power/pm_qos_interface.txt create mode 100644 Documentation/power/power_supply_class.rst delete mode 100644 Documentation/power/power_supply_class.txt create mode 100644 Documentation/power/powercap/powercap.rst delete mode 100644 Documentation/power/powercap/powercap.txt create mode 100644 Documentation/power/regulator/consumer.rst delete mode 100644 Documentation/power/regulator/consumer.txt create mode 100644 Documentation/power/regulator/design.rst delete mode 100644 Documentation/power/regulator/design.txt create mode 100644 Documentation/power/regulator/machine.rst delete mode 100644 Documentation/power/regulator/machine.txt create mode 100644 Documentation/power/regulator/overview.rst delete mode 100644 Documentation/power/regulator/overview.txt create mode 100644 Documentation/power/regulator/regulator.rst delete mode 100644 Documentation/power/regulator/regulator.txt create mode 100644 Documentation/power/runtime_pm.rst delete mode 100644 Documentation/power/runtime_pm.txt create mode 100644 Documentation/power/s2ram.rst delete mode 100644 Documentation/power/s2ram.txt create mode 100644 Documentation/power/suspend-and-cpuhotplug.rst delete mode 100644 Documentation/power/suspend-and-cpuhotplug.txt create mode 100644 Documentation/power/suspend-and-interrupts.rst delete mode 100644 Documentation/power/suspend-and-interrupts.txt create mode 100644 Documentation/power/swsusp-and-swap-files.rst delete mode 100644 Documentation/power/swsusp-and-swap-files.txt create mode 100644 Documentation/power/swsusp-dmcrypt.rst delete mode 100644 Documentation/power/swsusp-dmcrypt.txt create mode 100644 Documentation/power/swsusp.rst delete mode 100644 Documentation/power/swsusp.txt create mode 100644 Documentation/power/tricks.rst delete mode 100644 Documentation/power/tricks.txt create mode 100644 Documentation/power/userland-swsusp.rst delete mode 100644 Documentation/power/userland-swsusp.txt create mode 100644 Documentation/power/video.rst delete mode 100644 Documentation/power/video.txt (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-class-powercap b/Documentation/ABI/testing/sysfs-class-powercap index db3b3ff70d84..742dfd966592 100644 --- a/Documentation/ABI/testing/sysfs-class-powercap +++ b/Documentation/ABI/testing/sysfs-class-powercap @@ -5,7 +5,7 @@ Contact: linux-pm@vger.kernel.org Description: The powercap/ class sub directory belongs to the power cap subsystem. Refer to - Documentation/power/powercap/powercap.txt for details. + Documentation/power/powercap/powercap.rst for details. What: /sys/class/powercap/ Date: September 2013 diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 138f6664b2e2..7f5ca6e7c4d3 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -13,7 +13,7 @@ For ARM64, ONLY "acpi=off", "acpi=on" or "acpi=force" are available - See also Documentation/power/runtime_pm.txt, pci=noacpi + See also Documentation/power/runtime_pm.rst, pci=noacpi acpi_apic_instance= [ACPI, IOAPIC] Format: @@ -223,7 +223,7 @@ acpi_sleep= [HW,ACPI] Sleep options Format: { s3_bios, s3_mode, s3_beep, s4_nohwsig, old_ordering, nonvs, sci_force_enable, nobl } - See Documentation/power/video.txt for information on + See Documentation/power/video.rst for information on s3_bios and s3_mode. s3_beep is for debugging; it makes the PC's speaker beep as soon as the kernel's real-mode entry point is called. @@ -4108,7 +4108,7 @@ Specify the offset from the beginning of the partition given by "resume=" at which the swap header is located, in units (needed only for swap files). - See Documentation/power/swsusp-and-swap-files.txt + See Documentation/power/swsusp-and-swap-files.rst resumedelay= [HIBERNATION] Delay (in seconds) to pause before attempting to read the resume files diff --git a/Documentation/cpu-freq/core.txt b/Documentation/cpu-freq/core.txt index 073f128af5a7..55193e680250 100644 --- a/Documentation/cpu-freq/core.txt +++ b/Documentation/cpu-freq/core.txt @@ -95,7 +95,7 @@ flags - flags of the cpufreq driver 3. CPUFreq Table Generation with Operating Performance Point (OPP) ================================================================== -For details about OPP, see Documentation/power/opp.txt +For details about OPP, see Documentation/power/opp.rst dev_pm_opp_init_cpufreq_table - This function provides a ready to use conversion routine to translate diff --git a/Documentation/driver-api/pm/devices.rst b/Documentation/driver-api/pm/devices.rst index 30835683616a..f66c7b9126ea 100644 --- a/Documentation/driver-api/pm/devices.rst +++ b/Documentation/driver-api/pm/devices.rst @@ -225,7 +225,7 @@ system-wide transition to a sleep state even though its :c:member:`runtime_auto` flag is clear. For more information about the runtime power management framework, refer to -:file:`Documentation/power/runtime_pm.txt`. +:file:`Documentation/power/runtime_pm.rst`. Calling Drivers to Enter and Leave System Sleep States @@ -728,7 +728,7 @@ it into account in any way. Devices may be defined as IRQ-safe which indicates to the PM core that their runtime PM callbacks may be invoked with disabled interrupts (see -:file:`Documentation/power/runtime_pm.txt` for more information). If an +:file:`Documentation/power/runtime_pm.rst` for more information). If an IRQ-safe device belongs to a PM domain, the runtime PM of the domain will be disallowed, unless the domain itself is defined as IRQ-safe. However, it makes sense to define a PM domain as IRQ-safe only if all the devices in it @@ -795,7 +795,7 @@ so on) and the final state of the device must reflect the "active" runtime PM status in that case. During system-wide resume from a sleep state it's easiest to put devices into -the full-power state, as explained in :file:`Documentation/power/runtime_pm.txt`. +the full-power state, as explained in :file:`Documentation/power/runtime_pm.rst`. [Refer to that document for more information regarding this particular issue as well as for information on the device runtime power management framework in general.] diff --git a/Documentation/driver-api/usb/power-management.rst b/Documentation/driver-api/usb/power-management.rst index 4a74cf6f2797..2525c3622cae 100644 --- a/Documentation/driver-api/usb/power-management.rst +++ b/Documentation/driver-api/usb/power-management.rst @@ -46,7 +46,7 @@ device is turned off while the system as a whole remains running, we call it a "dynamic suspend" (also known as a "runtime suspend" or "selective suspend"). This document concentrates mostly on how dynamic PM is implemented in the USB subsystem, although system PM is -covered to some extent (see ``Documentation/power/*.txt`` for more +covered to some extent (see ``Documentation/power/*.rst`` for more information about system PM). System PM support is present only if the kernel was built with diff --git a/Documentation/power/apm-acpi.rst b/Documentation/power/apm-acpi.rst new file mode 100644 index 000000000000..5b90d947126d --- /dev/null +++ b/Documentation/power/apm-acpi.rst @@ -0,0 +1,36 @@ +============ +APM or ACPI? +============ + +If you have a relatively recent x86 mobile, desktop, or server system, +odds are it supports either Advanced Power Management (APM) or +Advanced Configuration and Power Interface (ACPI). ACPI is the newer +of the two technologies and puts power management in the hands of the +operating system, allowing for more intelligent power management than +is possible with BIOS controlled APM. + +The best way to determine which, if either, your system supports is to +build a kernel with both ACPI and APM enabled (as of 2.3.x ACPI is +enabled by default). If a working ACPI implementation is found, the +ACPI driver will override and disable APM, otherwise the APM driver +will be used. + +No, sorry, you cannot have both ACPI and APM enabled and running at +once. Some people with broken ACPI or broken APM implementations +would like to use both to get a full set of working features, but you +simply cannot mix and match the two. Only one power management +interface can be in control of the machine at once. Think about it.. + +User-space Daemons +------------------ +Both APM and ACPI rely on user-space daemons, apmd and acpid +respectively, to be completely functional. Obtain both of these +daemons from your Linux distribution or from the Internet (see below) +and be sure that they are started sometime in the system boot process. +Go ahead and start both. If ACPI or APM is not available on your +system the associated daemon will exit gracefully. + + ===== ======================================= + apmd http://ftp.debian.org/pool/main/a/apmd/ + acpid http://acpid.sf.net/ + ===== ======================================= diff --git a/Documentation/power/apm-acpi.txt b/Documentation/power/apm-acpi.txt deleted file mode 100644 index 6cc423d3662e..000000000000 --- a/Documentation/power/apm-acpi.txt +++ /dev/null @@ -1,32 +0,0 @@ -APM or ACPI? ------------- -If you have a relatively recent x86 mobile, desktop, or server system, -odds are it supports either Advanced Power Management (APM) or -Advanced Configuration and Power Interface (ACPI). ACPI is the newer -of the two technologies and puts power management in the hands of the -operating system, allowing for more intelligent power management than -is possible with BIOS controlled APM. - -The best way to determine which, if either, your system supports is to -build a kernel with both ACPI and APM enabled (as of 2.3.x ACPI is -enabled by default). If a working ACPI implementation is found, the -ACPI driver will override and disable APM, otherwise the APM driver -will be used. - -No, sorry, you cannot have both ACPI and APM enabled and running at -once. Some people with broken ACPI or broken APM implementations -would like to use both to get a full set of working features, but you -simply cannot mix and match the two. Only one power management -interface can be in control of the machine at once. Think about it.. - -User-space Daemons ------------------- -Both APM and ACPI rely on user-space daemons, apmd and acpid -respectively, to be completely functional. Obtain both of these -daemons from your Linux distribution or from the Internet (see below) -and be sure that they are started sometime in the system boot process. -Go ahead and start both. If ACPI or APM is not available on your -system the associated daemon will exit gracefully. - - apmd: http://ftp.debian.org/pool/main/a/apmd/ - acpid: http://acpid.sf.net/ diff --git a/Documentation/power/basic-pm-debugging.rst b/Documentation/power/basic-pm-debugging.rst new file mode 100644 index 000000000000..69862e759c30 --- /dev/null +++ b/Documentation/power/basic-pm-debugging.rst @@ -0,0 +1,269 @@ +================================= +Debugging hibernation and suspend +================================= + + (C) 2007 Rafael J. Wysocki , GPL + +1. Testing hibernation (aka suspend to disk or STD) +=================================================== + +To check if hibernation works, you can try to hibernate in the "reboot" mode:: + + # echo reboot > /sys/power/disk + # echo disk > /sys/power/state + +and the system should create a hibernation image, reboot, resume and get back to +the command prompt where you have started the transition. If that happens, +hibernation is most likely to work correctly. Still, you need to repeat the +test at least a couple of times in a row for confidence. [This is necessary, +because some problems only show up on a second attempt at suspending and +resuming the system.] Moreover, hibernating in the "reboot" and "shutdown" +modes causes the PM core to skip some platform-related callbacks which on ACPI +systems might be necessary to make hibernation work. Thus, if your machine +fails to hibernate or resume in the "reboot" mode, you should try the +"platform" mode:: + + # echo platform > /sys/power/disk + # echo disk > /sys/power/state + +which is the default and recommended mode of hibernation. + +Unfortunately, the "platform" mode of hibernation does not work on some systems +with broken BIOSes. In such cases the "shutdown" mode of hibernation might +work:: + + # echo shutdown > /sys/power/disk + # echo disk > /sys/power/state + +(it is similar to the "reboot" mode, but it requires you to press the power +button to make the system resume). + +If neither "platform" nor "shutdown" hibernation mode works, you will need to +identify what goes wrong. + +a) Test modes of hibernation +---------------------------- + +To find out why hibernation fails on your system, you can use a special testing +facility available if the kernel is compiled with CONFIG_PM_DEBUG set. Then, +there is the file /sys/power/pm_test that can be used to make the hibernation +core run in a test mode. There are 5 test modes available: + +freezer + - test the freezing of processes + +devices + - test the freezing of processes and suspending of devices + +platform + - test the freezing of processes, suspending of devices and platform + global control methods [1]_ + +processors + - test the freezing of processes, suspending of devices, platform + global control methods [1]_ and the disabling of nonboot CPUs + +core + - test the freezing of processes, suspending of devices, platform global + control methods\ [1]_, the disabling of nonboot CPUs and suspending + of platform/system devices + +.. [1] + + the platform global control methods are only available on ACPI systems + and are only tested if the hibernation mode is set to "platform" + +To use one of them it is necessary to write the corresponding string to +/sys/power/pm_test (eg. "devices" to test the freezing of processes and +suspending devices) and issue the standard hibernation commands. For example, +to use the "devices" test mode along with the "platform" mode of hibernation, +you should do the following:: + + # echo devices > /sys/power/pm_test + # echo platform > /sys/power/disk + # echo disk > /sys/power/state + +Then, the kernel will try to freeze processes, suspend devices, wait a few +seconds (5 by default, but configurable by the suspend.pm_test_delay module +parameter), resume devices and thaw processes. If "platform" is written to +/sys/power/pm_test , then after suspending devices the kernel will additionally +invoke the global control methods (eg. ACPI global control methods) used to +prepare the platform firmware for hibernation. Next, it will wait a +configurable number of seconds and invoke the platform (eg. ACPI) global +methods used to cancel hibernation etc. + +Writing "none" to /sys/power/pm_test causes the kernel to switch to the normal +hibernation/suspend operations. Also, when open for reading, /sys/power/pm_test +contains a space-separated list of all available tests (including "none" that +represents the normal functionality) in which the current test level is +indicated by square brackets. + +Generally, as you can see, each test level is more "invasive" than the previous +one and the "core" level tests the hardware and drivers as deeply as possible +without creating a hibernation image. Obviously, if the "devices" test fails, +the "platform" test will fail as well and so on. Thus, as a rule of thumb, you +should try the test modes starting from "freezer", through "devices", "platform" +and "processors" up to "core" (repeat the test on each level a couple of times +to make sure that any random factors are avoided). + +If the "freezer" test fails, there is a task that cannot be frozen (in that case +it usually is possible to identify the offending task by analysing the output of +dmesg obtained after the failing test). Failure at this level usually means +that there is a problem with the tasks freezer subsystem that should be +reported. + +If the "devices" test fails, most likely there is a driver that cannot suspend +or resume its device (in the latter case the system may hang or become unstable +after the test, so please take that into consideration). To find this driver, +you can carry out a binary search according to the rules: + +- if the test fails, unload a half of the drivers currently loaded and repeat + (that would probably involve rebooting the system, so always note what drivers + have been loaded before the test), +- if the test succeeds, load a half of the drivers you have unloaded most + recently and repeat. + +Once you have found the failing driver (there can be more than just one of +them), you have to unload it every time before hibernation. In that case please +make sure to report the problem with the driver. + +It is also possible that the "devices" test will still fail after you have +unloaded all modules. In that case, you may want to look in your kernel +configuration for the drivers that can be compiled as modules (and test again +with these drivers compiled as modules). You may also try to use some special +kernel command line options such as "noapic", "noacpi" or even "acpi=off". + +If the "platform" test fails, there is a problem with the handling of the +platform (eg. ACPI) firmware on your system. In that case the "platform" mode +of hibernation is not likely to work. You can try the "shutdown" mode, but that +is rather a poor man's workaround. + +If the "processors" test fails, the disabling/enabling of nonboot CPUs does not +work (of course, this only may be an issue on SMP systems) and the problem +should be reported. In that case you can also try to switch the nonboot CPUs +off and on using the /sys/devices/system/cpu/cpu*/online sysfs attributes and +see if that works. + +If the "core" test fails, which means that suspending of the system/platform +devices has failed (these devices are suspended on one CPU with interrupts off), +the problem is most probably hardware-related and serious, so it should be +reported. + +A failure of any of the "platform", "processors" or "core" tests may cause your +system to hang or become unstable, so please beware. Such a failure usually +indicates a serious problem that very well may be related to the hardware, but +please report it anyway. + +b) Testing minimal configuration +-------------------------------- + +If all of the hibernation test modes work, you can boot the system with the +"init=/bin/bash" command line parameter and attempt to hibernate in the +"reboot", "shutdown" and "platform" modes. If that does not work, there +probably is a problem with a driver statically compiled into the kernel and you +can try to compile more drivers as modules, so that they can be tested +individually. Otherwise, there is a problem with a modular driver and you can +find it by loading a half of the modules you normally use and binary searching +in accordance with the algorithm: +- if there are n modules loaded and the attempt to suspend and resume fails, +unload n/2 of the modules and try again (that would probably involve rebooting +the system), +- if there are n modules loaded and the attempt to suspend and resume succeeds, +load n/2 modules more and try again. + +Again, if you find the offending module(s), it(they) must be unloaded every time +before hibernation, and please report the problem with it(them). + +c) Using the "test_resume" hibernation option +--------------------------------------------- + +/sys/power/disk generally tells the kernel what to do after creating a +hibernation image. One of the available options is "test_resume" which +causes the just created image to be used for immediate restoration. Namely, +after doing:: + + # echo test_resume > /sys/power/disk + # echo disk > /sys/power/state + +a hibernation image will be created and a resume from it will be triggered +immediately without involving the platform firmware in any way. + +That test can be used to check if failures to resume from hibernation are +related to bad interactions with the platform firmware. That is, if the above +works every time, but resume from actual hibernation does not work or is +unreliable, the platform firmware may be responsible for the failures. + +On architectures and platforms that support using different kernels to restore +hibernation images (that is, the kernel used to read the image from storage and +load it into memory is different from the one included in the image) or support +kernel address space randomization, it also can be used to check if failures +to resume may be related to the differences between the restore and image +kernels. + +d) Advanced debugging +--------------------- + +In case that hibernation does not work on your system even in the minimal +configuration and compiling more drivers as modules is not practical or some +modules cannot be unloaded, you can use one of the more advanced debugging +techniques to find the problem. First, if there is a serial port in your box, +you can boot the kernel with the 'no_console_suspend' parameter and try to log +kernel messages using the serial console. This may provide you with some +information about the reasons of the suspend (resume) failure. Alternatively, +it may be possible to use a FireWire port for debugging with firescope +(http://v3.sk/~lkundrak/firescope/). On x86 it is also possible to +use the PM_TRACE mechanism documented in Documentation/power/s2ram.rst . + +2. Testing suspend to RAM (STR) +=============================== + +To verify that the STR works, it is generally more convenient to use the s2ram +tool available from http://suspend.sf.net and documented at +http://en.opensuse.org/SDB:Suspend_to_RAM (S2RAM_LINK). + +Namely, after writing "freezer", "devices", "platform", "processors", or "core" +into /sys/power/pm_test (available if the kernel is compiled with +CONFIG_PM_DEBUG set) the suspend code will work in the test mode corresponding +to given string. The STR test modes are defined in the same way as for +hibernation, so please refer to Section 1 for more information about them. In +particular, the "core" test allows you to test everything except for the actual +invocation of the platform firmware in order to put the system into the sleep +state. + +Among other things, the testing with the help of /sys/power/pm_test may allow +you to identify drivers that fail to suspend or resume their devices. They +should be unloaded every time before an STR transition. + +Next, you can follow the instructions at S2RAM_LINK to test the system, but if +it does not work "out of the box", you may need to boot it with +"init=/bin/bash" and test s2ram in the minimal configuration. In that case, +you may be able to search for failing drivers by following the procedure +analogous to the one described in section 1. If you find some failing drivers, +you will have to unload them every time before an STR transition (ie. before +you run s2ram), and please report the problems with them. + +There is a debugfs entry which shows the suspend to RAM statistics. Here is an +example of its output:: + + # mount -t debugfs none /sys/kernel/debug + # cat /sys/kernel/debug/suspend_stats + success: 20 + fail: 5 + failed_freeze: 0 + failed_prepare: 0 + failed_suspend: 5 + failed_suspend_noirq: 0 + failed_resume: 0 + failed_resume_noirq: 0 + failures: + last_failed_dev: alarm + adc + last_failed_errno: -16 + -16 + last_failed_step: suspend + suspend + +Field success means the success number of suspend to RAM, and field fail means +the failure number. Others are the failure number of different steps of suspend +to RAM. suspend_stats just lists the last 2 failed devices, error number and +failed step of suspend. diff --git a/Documentation/power/basic-pm-debugging.txt b/Documentation/power/basic-pm-debugging.txt deleted file mode 100644 index 708f87f78a75..000000000000 --- a/Documentation/power/basic-pm-debugging.txt +++ /dev/null @@ -1,254 +0,0 @@ -Debugging hibernation and suspend - (C) 2007 Rafael J. Wysocki , GPL - -1. Testing hibernation (aka suspend to disk or STD) - -To check if hibernation works, you can try to hibernate in the "reboot" mode: - -# echo reboot > /sys/power/disk -# echo disk > /sys/power/state - -and the system should create a hibernation image, reboot, resume and get back to -the command prompt where you have started the transition. If that happens, -hibernation is most likely to work correctly. Still, you need to repeat the -test at least a couple of times in a row for confidence. [This is necessary, -because some problems only show up on a second attempt at suspending and -resuming the system.] Moreover, hibernating in the "reboot" and "shutdown" -modes causes the PM core to skip some platform-related callbacks which on ACPI -systems might be necessary to make hibernation work. Thus, if your machine fails -to hibernate or resume in the "reboot" mode, you should try the "platform" mode: - -# echo platform > /sys/power/disk -# echo disk > /sys/power/state - -which is the default and recommended mode of hibernation. - -Unfortunately, the "platform" mode of hibernation does not work on some systems -with broken BIOSes. In such cases the "shutdown" mode of hibernation might -work: - -# echo shutdown > /sys/power/disk -# echo disk > /sys/power/state - -(it is similar to the "reboot" mode, but it requires you to press the power -button to make the system resume). - -If neither "platform" nor "shutdown" hibernation mode works, you will need to -identify what goes wrong. - -a) Test modes of hibernation - -To find out why hibernation fails on your system, you can use a special testing -facility available if the kernel is compiled with CONFIG_PM_DEBUG set. Then, -there is the file /sys/power/pm_test that can be used to make the hibernation -core run in a test mode. There are 5 test modes available: - -freezer -- test the freezing of processes - -devices -- test the freezing of processes and suspending of devices - -platform -- test the freezing of processes, suspending of devices and platform - global control methods(*) - -processors -- test the freezing of processes, suspending of devices, platform - global control methods(*) and the disabling of nonboot CPUs - -core -- test the freezing of processes, suspending of devices, platform global - control methods(*), the disabling of nonboot CPUs and suspending of - platform/system devices - -(*) the platform global control methods are only available on ACPI systems - and are only tested if the hibernation mode is set to "platform" - -To use one of them it is necessary to write the corresponding string to -/sys/power/pm_test (eg. "devices" to test the freezing of processes and -suspending devices) and issue the standard hibernation commands. For example, -to use the "devices" test mode along with the "platform" mode of hibernation, -you should do the following: - -# echo devices > /sys/power/pm_test -# echo platform > /sys/power/disk -# echo disk > /sys/power/state - -Then, the kernel will try to freeze processes, suspend devices, wait a few -seconds (5 by default, but configurable by the suspend.pm_test_delay module -parameter), resume devices and thaw processes. If "platform" is written to -/sys/power/pm_test , then after suspending devices the kernel will additionally -invoke the global control methods (eg. ACPI global control methods) used to -prepare the platform firmware for hibernation. Next, it will wait a -configurable number of seconds and invoke the platform (eg. ACPI) global -methods used to cancel hibernation etc. - -Writing "none" to /sys/power/pm_test causes the kernel to switch to the normal -hibernation/suspend operations. Also, when open for reading, /sys/power/pm_test -contains a space-separated list of all available tests (including "none" that -represents the normal functionality) in which the current test level is -indicated by square brackets. - -Generally, as you can see, each test level is more "invasive" than the previous -one and the "core" level tests the hardware and drivers as deeply as possible -without creating a hibernation image. Obviously, if the "devices" test fails, -the "platform" test will fail as well and so on. Thus, as a rule of thumb, you -should try the test modes starting from "freezer", through "devices", "platform" -and "processors" up to "core" (repeat the test on each level a couple of times -to make sure that any random factors are avoided). - -If the "freezer" test fails, there is a task that cannot be frozen (in that case -it usually is possible to identify the offending task by analysing the output of -dmesg obtained after the failing test). Failure at this level usually means -that there is a problem with the tasks freezer subsystem that should be -reported. - -If the "devices" test fails, most likely there is a driver that cannot suspend -or resume its device (in the latter case the system may hang or become unstable -after the test, so please take that into consideration). To find this driver, -you can carry out a binary search according to the rules: -- if the test fails, unload a half of the drivers currently loaded and repeat -(that would probably involve rebooting the system, so always note what drivers -have been loaded before the test), -- if the test succeeds, load a half of the drivers you have unloaded most -recently and repeat. - -Once you have found the failing driver (there can be more than just one of -them), you have to unload it every time before hibernation. In that case please -make sure to report the problem with the driver. - -It is also possible that the "devices" test will still fail after you have -unloaded all modules. In that case, you may want to look in your kernel -configuration for the drivers that can be compiled as modules (and test again -with these drivers compiled as modules). You may also try to use some special -kernel command line options such as "noapic", "noacpi" or even "acpi=off". - -If the "platform" test fails, there is a problem with the handling of the -platform (eg. ACPI) firmware on your system. In that case the "platform" mode -of hibernation is not likely to work. You can try the "shutdown" mode, but that -is rather a poor man's workaround. - -If the "processors" test fails, the disabling/enabling of nonboot CPUs does not -work (of course, this only may be an issue on SMP systems) and the problem -should be reported. In that case you can also try to switch the nonboot CPUs -off and on using the /sys/devices/system/cpu/cpu*/online sysfs attributes and -see if that works. - -If the "core" test fails, which means that suspending of the system/platform -devices has failed (these devices are suspended on one CPU with interrupts off), -the problem is most probably hardware-related and serious, so it should be -reported. - -A failure of any of the "platform", "processors" or "core" tests may cause your -system to hang or become unstable, so please beware. Such a failure usually -indicates a serious problem that very well may be related to the hardware, but -please report it anyway. - -b) Testing minimal configuration - -If all of the hibernation test modes work, you can boot the system with the -"init=/bin/bash" command line parameter and attempt to hibernate in the -"reboot", "shutdown" and "platform" modes. If that does not work, there -probably is a problem with a driver statically compiled into the kernel and you -can try to compile more drivers as modules, so that they can be tested -individually. Otherwise, there is a problem with a modular driver and you can -find it by loading a half of the modules you normally use and binary searching -in accordance with the algorithm: -- if there are n modules loaded and the attempt to suspend and resume fails, -unload n/2 of the modules and try again (that would probably involve rebooting -the system), -- if there are n modules loaded and the attempt to suspend and resume succeeds, -load n/2 modules more and try again. - -Again, if you find the offending module(s), it(they) must be unloaded every time -before hibernation, and please report the problem with it(them). - -c) Using the "test_resume" hibernation option - -/sys/power/disk generally tells the kernel what to do after creating a -hibernation image. One of the available options is "test_resume" which -causes the just created image to be used for immediate restoration. Namely, -after doing: - -# echo test_resume > /sys/power/disk -# echo disk > /sys/power/state - -a hibernation image will be created and a resume from it will be triggered -immediately without involving the platform firmware in any way. - -That test can be used to check if failures to resume from hibernation are -related to bad interactions with the platform firmware. That is, if the above -works every time, but resume from actual hibernation does not work or is -unreliable, the platform firmware may be responsible for the failures. - -On architectures and platforms that support using different kernels to restore -hibernation images (that is, the kernel used to read the image from storage and -load it into memory is different from the one included in the image) or support -kernel address space randomization, it also can be used to check if failures -to resume may be related to the differences between the restore and image -kernels. - -d) Advanced debugging - -In case that hibernation does not work on your system even in the minimal -configuration and compiling more drivers as modules is not practical or some -modules cannot be unloaded, you can use one of the more advanced debugging -techniques to find the problem. First, if there is a serial port in your box, -you can boot the kernel with the 'no_console_suspend' parameter and try to log -kernel messages using the serial console. This may provide you with some -information about the reasons of the suspend (resume) failure. Alternatively, -it may be possible to use a FireWire port for debugging with firescope -(http://v3.sk/~lkundrak/firescope/). On x86 it is also possible to -use the PM_TRACE mechanism documented in Documentation/power/s2ram.txt . - -2. Testing suspend to RAM (STR) - -To verify that the STR works, it is generally more convenient to use the s2ram -tool available from http://suspend.sf.net and documented at -http://en.opensuse.org/SDB:Suspend_to_RAM (S2RAM_LINK). - -Namely, after writing "freezer", "devices", "platform", "processors", or "core" -into /sys/power/pm_test (available if the kernel is compiled with -CONFIG_PM_DEBUG set) the suspend code will work in the test mode corresponding -to given string. The STR test modes are defined in the same way as for -hibernation, so please refer to Section 1 for more information about them. In -particular, the "core" test allows you to test everything except for the actual -invocation of the platform firmware in order to put the system into the sleep -state. - -Among other things, the testing with the help of /sys/power/pm_test may allow -you to identify drivers that fail to suspend or resume their devices. They -should be unloaded every time before an STR transition. - -Next, you can follow the instructions at S2RAM_LINK to test the system, but if -it does not work "out of the box", you may need to boot it with -"init=/bin/bash" and test s2ram in the minimal configuration. In that case, -you may be able to search for failing drivers by following the procedure -analogous to the one described in section 1. If you find some failing drivers, -you will have to unload them every time before an STR transition (ie. before -you run s2ram), and please report the problems with them. - -There is a debugfs entry which shows the suspend to RAM statistics. Here is an -example of its output. - # mount -t debugfs none /sys/kernel/debug - # cat /sys/kernel/debug/suspend_stats - success: 20 - fail: 5 - failed_freeze: 0 - failed_prepare: 0 - failed_suspend: 5 - failed_suspend_noirq: 0 - failed_resume: 0 - failed_resume_noirq: 0 - failures: - last_failed_dev: alarm - adc - last_failed_errno: -16 - -16 - last_failed_step: suspend - suspend -Field success means the success number of suspend to RAM, and field fail means -the failure number. Others are the failure number of different steps of suspend -to RAM. suspend_stats just lists the last 2 failed devices, error number and -failed step of suspend. diff --git a/Documentation/power/charger-manager.rst b/Documentation/power/charger-manager.rst new file mode 100644 index 000000000000..84fab9376792 --- /dev/null +++ b/Documentation/power/charger-manager.rst @@ -0,0 +1,205 @@ +=============== +Charger Manager +=============== + + (C) 2011 MyungJoo Ham , GPL + +Charger Manager provides in-kernel battery charger management that +requires temperature monitoring during suspend-to-RAM state +and where each battery may have multiple chargers attached and the userland +wants to look at the aggregated information of the multiple chargers. + +Charger Manager is a platform_driver with power-supply-class entries. +An instance of Charger Manager (a platform-device created with Charger-Manager) +represents an independent battery with chargers. If there are multiple +batteries with their own chargers acting independently in a system, +the system may need multiple instances of Charger Manager. + +1. Introduction +=============== + +Charger Manager supports the following: + +* Support for multiple chargers (e.g., a device with USB, AC, and solar panels) + A system may have multiple chargers (or power sources) and some of + they may be activated at the same time. Each charger may have its + own power-supply-class and each power-supply-class can provide + different information about the battery status. This framework + aggregates charger-related information from multiple sources and + shows combined information as a single power-supply-class. + +* Support for in suspend-to-RAM polling (with suspend_again callback) + While the battery is being charged and the system is in suspend-to-RAM, + we may need to monitor the battery health by looking at the ambient or + battery temperature. We can accomplish this by waking up the system + periodically. However, such a method wakes up devices unnecessarily for + monitoring the battery health and tasks, and user processes that are + supposed to be kept suspended. That, in turn, incurs unnecessary power + consumption and slow down charging process. Or even, such peak power + consumption can stop chargers in the middle of charging + (external power input < device power consumption), which not + only affects the charging time, but the lifespan of the battery. + + Charger Manager provides a function "cm_suspend_again" that can be + used as suspend_again callback of platform_suspend_ops. If the platform + requires tasks other than cm_suspend_again, it may implement its own + suspend_again callback that calls cm_suspend_again in the middle. + Normally, the platform will need to resume and suspend some devices + that are used by Charger Manager. + +* Support for premature full-battery event handling + If the battery voltage drops by "fullbatt_vchkdrop_uV" after + "fullbatt_vchkdrop_ms" from the full-battery event, the framework + restarts charging. This check is also performed while suspended by + setting wakeup time accordingly and using suspend_again. + +* Support for uevent-notify + With the charger-related events, the device sends + notification to users with UEVENT. + +2. Global Charger-Manager Data related with suspend_again +========================================================= +In order to setup Charger Manager with suspend-again feature +(in-suspend monitoring), the user should provide charger_global_desc +with setup_charger_manager(`struct charger_global_desc *`). +This charger_global_desc data for in-suspend monitoring is global +as the name suggests. Thus, the user needs to provide only once even +if there are multiple batteries. If there are multiple batteries, the +multiple instances of Charger Manager share the same charger_global_desc +and it will manage in-suspend monitoring for all instances of Charger Manager. + +The user needs to provide all the three entries to `struct charger_global_desc` +properly in order to activate in-suspend monitoring: + +`char *rtc_name;` + The name of rtc (e.g., "rtc0") used to wakeup the system from + suspend for Charger Manager. The alarm interrupt (AIE) of the rtc + should be able to wake up the system from suspend. Charger Manager + saves and restores the alarm value and use the previously-defined + alarm if it is going to go off earlier than Charger Manager so that + Charger Manager does not interfere with previously-defined alarms. + +`bool (*rtc_only_wakeup)(void);` + This callback should let CM know whether + the wakeup-from-suspend is caused only by the alarm of "rtc" in the + same struct. If there is any other wakeup source triggered the + wakeup, it should return false. If the "rtc" is the only wakeup + reason, it should return true. + +`bool assume_timer_stops_in_suspend;` + if true, Charger Manager assumes that + the timer (CM uses jiffies as timer) stops during suspend. Then, CM + assumes that the suspend-duration is same as the alarm length. + + +3. How to setup suspend_again +============================= +Charger Manager provides a function "extern bool cm_suspend_again(void)". +When cm_suspend_again is called, it monitors every battery. The suspend_ops +callback of the system's platform_suspend_ops can call cm_suspend_again +function to know whether Charger Manager wants to suspend again or not. +If there are no other devices or tasks that want to use suspend_again +feature, the platform_suspend_ops may directly refer to cm_suspend_again +for its suspend_again callback. + +The cm_suspend_again() returns true (meaning "I want to suspend again") +if the system was woken up by Charger Manager and the polling +(in-suspend monitoring) results in "normal". + +4. Charger-Manager Data (struct charger_desc) +============================================= +For each battery charged independently from other batteries (if a series of +batteries are charged by a single charger, they are counted as one independent +battery), an instance of Charger Manager is attached to it. The following + +struct charger_desc elements: + +`char *psy_name;` + The power-supply-class name of the battery. Default is + "battery" if psy_name is NULL. Users can access the psy entries + at "/sys/class/power_supply/[psy_name]/". + +`enum polling_modes polling_mode;` + CM_POLL_DISABLE: + do not poll this battery. + CM_POLL_ALWAYS: + always poll this battery. + CM_POLL_EXTERNAL_POWER_ONLY: + poll this battery if and only if an external power + source is attached. + CM_POLL_CHARGING_ONLY: + poll this battery if and only if the battery is being charged. + +`unsigned int fullbatt_vchkdrop_ms; / unsigned int fullbatt_vchkdrop_uV;` + If both have non-zero values, Charger Manager will check the + battery voltage drop fullbatt_vchkdrop_ms after the battery is fully + charged. If the voltage drop is over fullbatt_vchkdrop_uV, Charger + Manager will try to recharge the battery by disabling and enabling + chargers. Recharge with voltage drop condition only (without delay + condition) is needed to be implemented with hardware interrupts from + fuel gauges or charger devices/chips. + +`unsigned int fullbatt_uV;` + If specified with a non-zero value, Charger Manager assumes + that the battery is full (capacity = 100) if the battery is not being + charged and the battery voltage is equal to or greater than + fullbatt_uV. + +`unsigned int polling_interval_ms;` + Required polling interval in ms. Charger Manager will poll + this battery every polling_interval_ms or more frequently. + +`enum data_source battery_present;` + CM_BATTERY_PRESENT: + assume that the battery exists. + CM_NO_BATTERY: + assume that the battery does not exists. + CM_FUEL_GAUGE: + get battery presence information from fuel gauge. + CM_CHARGER_STAT: + get battery presence from chargers. + +`char **psy_charger_stat;` + An array ending with NULL that has power-supply-class names of + chargers. Each power-supply-class should provide "PRESENT" (if + battery_present is "CM_CHARGER_STAT"), "ONLINE" (shows whether an + external power source is attached or not), and "STATUS" (shows whether + the battery is {"FULL" or not FULL} or {"FULL", "Charging", + "Discharging", "NotCharging"}). + +`int num_charger_regulators; / struct regulator_bulk_data *charger_regulators;` + Regulators representing the chargers in the form for + regulator framework's bulk functions. + +`char *psy_fuel_gauge;` + Power-supply-class name of the fuel gauge. + +`int (*temperature_out_of_range)(int *mC); / bool measure_battery_temp;` + This callback returns 0 if the temperature is safe for charging, + a positive number if it is too hot to charge, and a negative number + if it is too cold to charge. With the variable mC, the callback returns + the temperature in 1/1000 of centigrade. + The source of temperature can be battery or ambient one according to + the value of measure_battery_temp. + + +5. Notify Charger-Manager of charger events: cm_notify_event() +============================================================== +If there is an charger event is required to notify +Charger Manager, a charger device driver that triggers the event can call +cm_notify_event(psy, type, msg) to notify the corresponding Charger Manager. +In the function, psy is the charger driver's power_supply pointer, which is +associated with Charger-Manager. The parameter "type" +is the same as irq's type (enum cm_event_types). The event message "msg" is +optional and is effective only if the event type is "UNDESCRIBED" or "OTHERS". + +6. Other Considerations +======================= + +At the charger/battery-related events such as battery-pulled-out, +charger-pulled-out, charger-inserted, DCIN-over/under-voltage, charger-stopped, +and others critical to chargers, the system should be configured to wake up. +At least the following should wake up the system from a suspend: +a) charger-on/off b) external-power-in/out c) battery-in/out (while charging) + +It is usually accomplished by configuring the PMIC as a wakeup source. diff --git a/Documentation/power/charger-manager.txt b/Documentation/power/charger-manager.txt deleted file mode 100644 index 9ff1105e58d6..000000000000 --- a/Documentation/power/charger-manager.txt +++ /dev/null @@ -1,200 +0,0 @@ -Charger Manager - (C) 2011 MyungJoo Ham , GPL - -Charger Manager provides in-kernel battery charger management that -requires temperature monitoring during suspend-to-RAM state -and where each battery may have multiple chargers attached and the userland -wants to look at the aggregated information of the multiple chargers. - -Charger Manager is a platform_driver with power-supply-class entries. -An instance of Charger Manager (a platform-device created with Charger-Manager) -represents an independent battery with chargers. If there are multiple -batteries with their own chargers acting independently in a system, -the system may need multiple instances of Charger Manager. - -1. Introduction -=============== - -Charger Manager supports the following: - -* Support for multiple chargers (e.g., a device with USB, AC, and solar panels) - A system may have multiple chargers (or power sources) and some of - they may be activated at the same time. Each charger may have its - own power-supply-class and each power-supply-class can provide - different information about the battery status. This framework - aggregates charger-related information from multiple sources and - shows combined information as a single power-supply-class. - -* Support for in suspend-to-RAM polling (with suspend_again callback) - While the battery is being charged and the system is in suspend-to-RAM, - we may need to monitor the battery health by looking at the ambient or - battery temperature. We can accomplish this by waking up the system - periodically. However, such a method wakes up devices unnecessarily for - monitoring the battery health and tasks, and user processes that are - supposed to be kept suspended. That, in turn, incurs unnecessary power - consumption and slow down charging process. Or even, such peak power - consumption can stop chargers in the middle of charging - (external power input < device power consumption), which not - only affects the charging time, but the lifespan of the battery. - - Charger Manager provides a function "cm_suspend_again" that can be - used as suspend_again callback of platform_suspend_ops. If the platform - requires tasks other than cm_suspend_again, it may implement its own - suspend_again callback that calls cm_suspend_again in the middle. - Normally, the platform will need to resume and suspend some devices - that are used by Charger Manager. - -* Support for premature full-battery event handling - If the battery voltage drops by "fullbatt_vchkdrop_uV" after - "fullbatt_vchkdrop_ms" from the full-battery event, the framework - restarts charging. This check is also performed while suspended by - setting wakeup time accordingly and using suspend_again. - -* Support for uevent-notify - With the charger-related events, the device sends - notification to users with UEVENT. - -2. Global Charger-Manager Data related with suspend_again -======================================================== -In order to setup Charger Manager with suspend-again feature -(in-suspend monitoring), the user should provide charger_global_desc -with setup_charger_manager(struct charger_global_desc *). -This charger_global_desc data for in-suspend monitoring is global -as the name suggests. Thus, the user needs to provide only once even -if there are multiple batteries. If there are multiple batteries, the -multiple instances of Charger Manager share the same charger_global_desc -and it will manage in-suspend monitoring for all instances of Charger Manager. - -The user needs to provide all the three entries properly in order to activate -in-suspend monitoring: - -struct charger_global_desc { - -char *rtc_name; - : The name of rtc (e.g., "rtc0") used to wakeup the system from - suspend for Charger Manager. The alarm interrupt (AIE) of the rtc - should be able to wake up the system from suspend. Charger Manager - saves and restores the alarm value and use the previously-defined - alarm if it is going to go off earlier than Charger Manager so that - Charger Manager does not interfere with previously-defined alarms. - -bool (*rtc_only_wakeup)(void); - : This callback should let CM know whether - the wakeup-from-suspend is caused only by the alarm of "rtc" in the - same struct. If there is any other wakeup source triggered the - wakeup, it should return false. If the "rtc" is the only wakeup - reason, it should return true. - -bool assume_timer_stops_in_suspend; - : if true, Charger Manager assumes that - the timer (CM uses jiffies as timer) stops during suspend. Then, CM - assumes that the suspend-duration is same as the alarm length. -}; - -3. How to setup suspend_again -============================= -Charger Manager provides a function "extern bool cm_suspend_again(void)". -When cm_suspend_again is called, it monitors every battery. The suspend_ops -callback of the system's platform_suspend_ops can call cm_suspend_again -function to know whether Charger Manager wants to suspend again or not. -If there are no other devices or tasks that want to use suspend_again -feature, the platform_suspend_ops may directly refer to cm_suspend_again -for its suspend_again callback. - -The cm_suspend_again() returns true (meaning "I want to suspend again") -if the system was woken up by Charger Manager and the polling -(in-suspend monitoring) results in "normal". - -4. Charger-Manager Data (struct charger_desc) -============================================= -For each battery charged independently from other batteries (if a series of -batteries are charged by a single charger, they are counted as one independent -battery), an instance of Charger Manager is attached to it. - -struct charger_desc { - -char *psy_name; - : The power-supply-class name of the battery. Default is - "battery" if psy_name is NULL. Users can access the psy entries - at "/sys/class/power_supply/[psy_name]/". - -enum polling_modes polling_mode; - : CM_POLL_DISABLE: do not poll this battery. - CM_POLL_ALWAYS: always poll this battery. - CM_POLL_EXTERNAL_POWER_ONLY: poll this battery if and only if - an external power source is attached. - CM_POLL_CHARGING_ONLY: poll this battery if and only if the - battery is being charged. - -unsigned int fullbatt_vchkdrop_ms; -unsigned int fullbatt_vchkdrop_uV; - : If both have non-zero values, Charger Manager will check the - battery voltage drop fullbatt_vchkdrop_ms after the battery is fully - charged. If the voltage drop is over fullbatt_vchkdrop_uV, Charger - Manager will try to recharge the battery by disabling and enabling - chargers. Recharge with voltage drop condition only (without delay - condition) is needed to be implemented with hardware interrupts from - fuel gauges or charger devices/chips. - -unsigned int fullbatt_uV; - : If specified with a non-zero value, Charger Manager assumes - that the battery is full (capacity = 100) if the battery is not being - charged and the battery voltage is equal to or greater than - fullbatt_uV. - -unsigned int polling_interval_ms; - : Required polling interval in ms. Charger Manager will poll - this battery every polling_interval_ms or more frequently. - -enum data_source battery_present; - : CM_BATTERY_PRESENT: assume that the battery exists. - CM_NO_BATTERY: assume that the battery does not exists. - CM_FUEL_GAUGE: get battery presence information from fuel gauge. - CM_CHARGER_STAT: get battery presence from chargers. - -char **psy_charger_stat; - : An array ending with NULL that has power-supply-class names of - chargers. Each power-supply-class should provide "PRESENT" (if - battery_present is "CM_CHARGER_STAT"), "ONLINE" (shows whether an - external power source is attached or not), and "STATUS" (shows whether - the battery is {"FULL" or not FULL} or {"FULL", "Charging", - "Discharging", "NotCharging"}). - -int num_charger_regulators; -struct regulator_bulk_data *charger_regulators; - : Regulators representing the chargers in the form for - regulator framework's bulk functions. - -char *psy_fuel_gauge; - : Power-supply-class name of the fuel gauge. - -int (*temperature_out_of_range)(int *mC); -bool measure_battery_temp; - : This callback returns 0 if the temperature is safe for charging, - a positive number if it is too hot to charge, and a negative number - if it is too cold to charge. With the variable mC, the callback returns - the temperature in 1/1000 of centigrade. - The source of temperature can be battery or ambient one according to - the value of measure_battery_temp. -}; - -5. Notify Charger-Manager of charger events: cm_notify_event() -========================================================= -If there is an charger event is required to notify -Charger Manager, a charger device driver that triggers the event can call -cm_notify_event(psy, type, msg) to notify the corresponding Charger Manager. -In the function, psy is the charger driver's power_supply pointer, which is -associated with Charger-Manager. The parameter "type" -is the same as irq's type (enum cm_event_types). The event message "msg" is -optional and is effective only if the event type is "UNDESCRIBED" or "OTHERS". - -6. Other Considerations -======================= - -At the charger/battery-related events such as battery-pulled-out, -charger-pulled-out, charger-inserted, DCIN-over/under-voltage, charger-stopped, -and others critical to chargers, the system should be configured to wake up. -At least the following should wake up the system from a suspend: -a) charger-on/off b) external-power-in/out c) battery-in/out (while charging) - -It is usually accomplished by configuring the PMIC as a wakeup source. diff --git a/Documentation/power/drivers-testing.rst b/Documentation/power/drivers-testing.rst new file mode 100644 index 000000000000..e53f1999fc39 --- /dev/null +++ b/Documentation/power/drivers-testing.rst @@ -0,0 +1,51 @@ +==================================================== +Testing suspend and resume support in device drivers +==================================================== + + (C) 2007 Rafael J. Wysocki , GPL + +1. Preparing the test system +============================ + +Unfortunately, to effectively test the support for the system-wide suspend and +resume transitions in a driver, it is necessary to suspend and resume a fully +functional system with this driver loaded. Moreover, that should be done +several times, preferably several times in a row, and separately for hibernation +(aka suspend to disk or STD) and suspend to RAM (STR), because each of these +cases involves slightly different operations and different interactions with +the machine's BIOS. + +Of course, for this purpose the test system has to be known to suspend and +resume without the driver being tested. Thus, if possible, you should first +resolve all suspend/resume-related problems in the test system before you start +testing the new driver. Please see Documentation/power/basic-pm-debugging.rst +for more information about the debugging of suspend/resume functionality. + +2. Testing the driver +===================== + +Once you have resolved the suspend/resume-related problems with your test system +without the new driver, you are ready to test it: + +a) Build the driver as a module, load it and try the test modes of hibernation + (see: Documentation/power/basic-pm-debugging.rst, 1). + +b) Load the driver and attempt to hibernate in the "reboot", "shutdown" and + "platform" modes (see: Documentation/power/basic-pm-debugging.rst, 1). + +c) Compile the driver directly into the kernel and try the test modes of + hibernation. + +d) Attempt to hibernate with the driver compiled directly into the kernel + in the "reboot", "shutdown" and "platform" modes. + +e) Try the test modes of suspend (see: Documentation/power/basic-pm-debugging.rst, + 2). [As far as the STR tests are concerned, it should not matter whether or + not the driver is built as a module.] + +f) Attempt to suspend to RAM using the s2ram tool with the driver loaded + (see: Documentation/power/basic-pm-debugging.rst, 2). + +Each of the above tests should be repeated several times and the STD tests +should be mixed with the STR tests. If any of them fails, the driver cannot be +regarded as suspend/resume-safe. diff --git a/Documentation/power/drivers-testing.txt b/Documentation/power/drivers-testing.txt deleted file mode 100644 index 638afdf4d6b8..000000000000 --- a/Documentation/power/drivers-testing.txt +++ /dev/null @@ -1,46 +0,0 @@ -Testing suspend and resume support in device drivers - (C) 2007 Rafael J. Wysocki , GPL - -1. Preparing the test system - -Unfortunately, to effectively test the support for the system-wide suspend and -resume transitions in a driver, it is necessary to suspend and resume a fully -functional system with this driver loaded. Moreover, that should be done -several times, preferably several times in a row, and separately for hibernation -(aka suspend to disk or STD) and suspend to RAM (STR), because each of these -cases involves slightly different operations and different interactions with -the machine's BIOS. - -Of course, for this purpose the test system has to be known to suspend and -resume without the driver being tested. Thus, if possible, you should first -resolve all suspend/resume-related problems in the test system before you start -testing the new driver. Please see Documentation/power/basic-pm-debugging.txt -for more information about the debugging of suspend/resume functionality. - -2. Testing the driver - -Once you have resolved the suspend/resume-related problems with your test system -without the new driver, you are ready to test it: - -a) Build the driver as a module, load it and try the test modes of hibernation - (see: Documentation/power/basic-pm-debugging.txt, 1). - -b) Load the driver and attempt to hibernate in the "reboot", "shutdown" and - "platform" modes (see: Documentation/power/basic-pm-debugging.txt, 1). - -c) Compile the driver directly into the kernel and try the test modes of - hibernation. - -d) Attempt to hibernate with the driver compiled directly into the kernel - in the "reboot", "shutdown" and "platform" modes. - -e) Try the test modes of suspend (see: Documentation/power/basic-pm-debugging.txt, - 2). [As far as the STR tests are concerned, it should not matter whether or - not the driver is built as a module.] - -f) Attempt to suspend to RAM using the s2ram tool with the driver loaded - (see: Documentation/power/basic-pm-debugging.txt, 2). - -Each of the above tests should be repeated several times and the STD tests -should be mixed with the STR tests. If any of them fails, the driver cannot be -regarded as suspend/resume-safe. diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst new file mode 100644 index 000000000000..90a345d57ae9 --- /dev/null +++ b/Documentation/power/energy-model.rst @@ -0,0 +1,147 @@ +==================== +Energy Model of CPUs +==================== + +1. Overview +----------- + +The Energy Model (EM) framework serves as an interface between drivers knowing +the power consumed by CPUs at various performance levels, and the kernel +subsystems willing to use that information to make energy-aware decisions. + +The source of the information about the power consumed by CPUs can vary greatly +from one platform to another. These power costs can be estimated using +devicetree data in some cases. In others, the firmware will know better. +Alternatively, userspace might be best positioned. And so on. In order to avoid +each and every client subsystem to re-implement support for each and every +possible source of information on its own, the EM framework intervenes as an +abstraction layer which standardizes the format of power cost tables in the +kernel, hence enabling to avoid redundant work. + +The figure below depicts an example of drivers (Arm-specific here, but the +approach is applicable to any architecture) providing power costs to the EM +framework, and interested clients reading the data from it:: + + +---------------+ +-----------------+ +---------------+ + | Thermal (IPA) | | Scheduler (EAS) | | Other | + +---------------+ +-----------------+ +---------------+ + | | em_pd_energy() | + | | em_cpu_get() | + +---------+ | +---------+ + | | | + v v v + +---------------------+ + | Energy Model | + | Framework | + +---------------------+ + ^ ^ ^ + | | | em_register_perf_domain() + +----------+ | +---------+ + | | | + +---------------+ +---------------+ +--------------+ + | cpufreq-dt | | arm_scmi | | Other | + +---------------+ +---------------+ +--------------+ + ^ ^ ^ + | | | + +--------------+ +---------------+ +--------------+ + | Device Tree | | Firmware | | ? | + +--------------+ +---------------+ +--------------+ + +The EM framework manages power cost tables per 'performance domain' in the +system. A performance domain is a group of CPUs whose performance is scaled +together. Performance domains generally have a 1-to-1 mapping with CPUFreq +policies. All CPUs in a performance domain are required to have the same +micro-architecture. CPUs in different performance domains can have different +micro-architectures. + + +2. Core APIs +------------ + +2.1 Config options +^^^^^^^^^^^^^^^^^^ + +CONFIG_ENERGY_MODEL must be enabled to use the EM framework. + + +2.2 Registration of performance domains +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Drivers are expected to register performance domains into the EM framework by +calling the following API:: + + int em_register_perf_domain(cpumask_t *span, unsigned int nr_states, + struct em_data_callback *cb); + +Drivers must specify the CPUs of the performance domains using the cpumask +argument, and provide a callback function returning tuples +for each capacity state. The callback function provided by the driver is free +to fetch data from any relevant location (DT, firmware, ...), and by any mean +deemed necessary. See Section 3. for an example of driver implementing this +callback, and kernel/power/energy_model.c for further documentation on this +API. + + +2.3 Accessing performance domains +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Subsystems interested in the energy model of a CPU can retrieve it using the +em_cpu_get() API. The energy model tables are allocated once upon creation of +the performance domains, and kept in memory untouched. + +The energy consumed by a performance domain can be estimated using the +em_pd_energy() API. The estimation is performed assuming that the schedutil +CPUfreq governor is in use. + +More details about the above APIs can be found in include/linux/energy_model.h. + + +3. Example driver +----------------- + +This section provides a simple example of a CPUFreq driver registering a +performance domain in the Energy Model framework using the (fake) 'foo' +protocol. The driver implements an est_power() function to be provided to the +EM framework:: + + -> drivers/cpufreq/foo_cpufreq.c + + 01 static int est_power(unsigned long *mW, unsigned long *KHz, int cpu) + 02 { + 03 long freq, power; + 04 + 05 /* Use the 'foo' protocol to ceil the frequency */ + 06 freq = foo_get_freq_ceil(cpu, *KHz); + 07 if (freq < 0); + 08 return freq; + 09 + 10 /* Estimate the power cost for the CPU at the relevant freq. */ + 11 power = foo_estimate_power(cpu, freq); + 12 if (power < 0); + 13 return power; + 14 + 15 /* Return the values to the EM framework */ + 16 *mW = power; + 17 *KHz = freq; + 18 + 19 return 0; + 20 } + 21 + 22 static int foo_cpufreq_init(struct cpufreq_policy *policy) + 23 { + 24 struct em_data_callback em_cb = EM_DATA_CB(est_power); + 25 int nr_opp, ret; + 26 + 27 /* Do the actual CPUFreq init work ... */ + 28 ret = do_foo_cpufreq_init(policy); + 29 if (ret) + 30 return ret; + 31 + 32 /* Find the number of OPPs for this policy */ + 33 nr_opp = foo_get_nr_opp(policy); + 34 + 35 /* And register the new performance domain */ + 36 em_register_perf_domain(policy->cpus, nr_opp, &em_cb); + 37 + 38 return 0; + 39 } diff --git a/Documentation/power/energy-model.txt b/Documentation/power/energy-model.txt deleted file mode 100644 index a2b0ae4c76bd..000000000000 --- a/Documentation/power/energy-model.txt +++ /dev/null @@ -1,144 +0,0 @@ - ==================== - Energy Model of CPUs - ==================== - -1. Overview ------------ - -The Energy Model (EM) framework serves as an interface between drivers knowing -the power consumed by CPUs at various performance levels, and the kernel -subsystems willing to use that information to make energy-aware decisions. - -The source of the information about the power consumed by CPUs can vary greatly -from one platform to another. These power costs can be estimated using -devicetree data in some cases. In others, the firmware will know better. -Alternatively, userspace might be best positioned. And so on. In order to avoid -each and every client subsystem to re-implement support for each and every -possible source of information on its own, the EM framework intervenes as an -abstraction layer which standardizes the format of power cost tables in the -kernel, hence enabling to avoid redundant work. - -The figure below depicts an example of drivers (Arm-specific here, but the -approach is applicable to any architecture) providing power costs to the EM -framework, and interested clients reading the data from it. - - +---------------+ +-----------------+ +---------------+ - | Thermal (IPA) | | Scheduler (EAS) | | Other | - +---------------+ +-----------------+ +---------------+ - | | em_pd_energy() | - | | em_cpu_get() | - +---------+ | +---------+ - | | | - v v v - +---------------------+ - | Energy Model | - | Framework | - +---------------------+ - ^ ^ ^ - | | | em_register_perf_domain() - +----------+ | +---------+ - | | | - +---------------+ +---------------+ +--------------+ - | cpufreq-dt | | arm_scmi | | Other | - +---------------+ +---------------+ +--------------+ - ^ ^ ^ - | | | - +--------------+ +---------------+ +--------------+ - | Device Tree | | Firmware | | ? | - +--------------+ +---------------+ +--------------+ - -The EM framework manages power cost tables per 'performance domain' in the -system. A performance domain is a group of CPUs whose performance is scaled -together. Performance domains generally have a 1-to-1 mapping with CPUFreq -policies. All CPUs in a performance domain are required to have the same -micro-architecture. CPUs in different performance domains can have different -micro-architectures. - - -2. Core APIs ------------- - - 2.1 Config options - -CONFIG_ENERGY_MODEL must be enabled to use the EM framework. - - - 2.2 Registration of performance domains - -Drivers are expected to register performance domains into the EM framework by -calling the following API: - - int em_register_perf_domain(cpumask_t *span, unsigned int nr_states, - struct em_data_callback *cb); - -Drivers must specify the CPUs of the performance domains using the cpumask -argument, and provide a callback function returning tuples -for each capacity state. The callback function provided by the driver is free -to fetch data from any relevant location (DT, firmware, ...), and by any mean -deemed necessary. See Section 3. for an example of driver implementing this -callback, and kernel/power/energy_model.c for further documentation on this -API. - - - 2.3 Accessing performance domains - -Subsystems interested in the energy model of a CPU can retrieve it using the -em_cpu_get() API. The energy model tables are allocated once upon creation of -the performance domains, and kept in memory untouched. - -The energy consumed by a performance domain can be estimated using the -em_pd_energy() API. The estimation is performed assuming that the schedutil -CPUfreq governor is in use. - -More details about the above APIs can be found in include/linux/energy_model.h. - - -3. Example driver ------------------ - -This section provides a simple example of a CPUFreq driver registering a -performance domain in the Energy Model framework using the (fake) 'foo' -protocol. The driver implements an est_power() function to be provided to the -EM framework. - - -> drivers/cpufreq/foo_cpufreq.c - -01 static int est_power(unsigned long *mW, unsigned long *KHz, int cpu) -02 { -03 long freq, power; -04 -05 /* Use the 'foo' protocol to ceil the frequency */ -06 freq = foo_get_freq_ceil(cpu, *KHz); -07 if (freq < 0); -08 return freq; -09 -10 /* Estimate the power cost for the CPU at the relevant freq. */ -11 power = foo_estimate_power(cpu, freq); -12 if (power < 0); -13 return power; -14 -15 /* Return the values to the EM framework */ -16 *mW = power; -17 *KHz = freq; -18 -19 return 0; -20 } -21 -22 static int foo_cpufreq_init(struct cpufreq_policy *policy) -23 { -24 struct em_data_callback em_cb = EM_DATA_CB(est_power); -25 int nr_opp, ret; -26 -27 /* Do the actual CPUFreq init work ... */ -28 ret = do_foo_cpufreq_init(policy); -29 if (ret) -30 return ret; -31 -32 /* Find the number of OPPs for this policy */ -33 nr_opp = foo_get_nr_opp(policy); -34 -35 /* And register the new performance domain */ -36 em_register_perf_domain(policy->cpus, nr_opp, &em_cb); -37 -38 return 0; -39 } diff --git a/Documentation/power/freezing-of-tasks.rst b/Documentation/power/freezing-of-tasks.rst new file mode 100644 index 000000000000..ef110fe55e82 --- /dev/null +++ b/Documentation/power/freezing-of-tasks.rst @@ -0,0 +1,244 @@ +================= +Freezing of tasks +================= + +(C) 2007 Rafael J. Wysocki , GPL + +I. What is the freezing of tasks? +================================= + +The freezing of tasks is a mechanism by which user space processes and some +kernel threads are controlled during hibernation or system-wide suspend (on some +architectures). + +II. How does it work? +===================== + +There are three per-task flags used for that, PF_NOFREEZE, PF_FROZEN +and PF_FREEZER_SKIP (the last one is auxiliary). The tasks that have +PF_NOFREEZE unset (all user space processes and some kernel threads) are +regarded as 'freezable' and treated in a special way before the system enters a +suspend state as well as before a hibernation image is created (in what follows +we only consider hibernation, but the description also applies to suspend). + +Namely, as the first step of the hibernation procedure the function +freeze_processes() (defined in kernel/power/process.c) is called. A system-wide +variable system_freezing_cnt (as opposed to a per-task flag) is used to indicate +whether the system is to undergo a freezing operation. And freeze_processes() +sets this variable. After this, it executes try_to_freeze_tasks() that sends a +fake signal to all user space processes, and wakes up all the kernel threads. +All freezable tasks must react to that by calling try_to_freeze(), which +results in a call to __refrigerator() (defined in kernel/freezer.c), which sets +the task's PF_FROZEN flag, changes its state to TASK_UNINTERRUPTIBLE and makes +it loop until PF_FROZEN is cleared for it. Then, we say that the task is +'frozen' and therefore the set of functions handling this mechanism is referred +to as 'the freezer' (these functions are defined in kernel/power/process.c, +kernel/freezer.c & include/linux/freezer.h). User space processes are generally +frozen before kernel threads. + +__refrigerator() must not be called directly. Instead, use the +try_to_freeze() function (defined in include/linux/freezer.h), that checks +if the task is to be frozen and makes the task enter __refrigerator(). + +For user space processes try_to_freeze() is called automatically from the +signal-handling code, but the freezable kernel threads need to call it +explicitly in suitable places or use the wait_event_freezable() or +wait_event_freezable_timeout() macros (defined in include/linux/freezer.h) +that combine interruptible sleep with checking if the task is to be frozen and +calling try_to_freeze(). The main loop of a freezable kernel thread may look +like the following one:: + + set_freezable(); + do { + hub_events(); + wait_event_freezable(khubd_wait, + !list_empty(&hub_event_list) || + kthread_should_stop()); + } while (!kthread_should_stop() || !list_empty(&hub_event_list)); + +(from drivers/usb/core/hub.c::hub_thread()). + +If a freezable kernel thread fails to call try_to_freeze() after the freezer has +initiated a freezing operation, the freezing of tasks will fail and the entire +hibernation operation will be cancelled. For this reason, freezable kernel +threads must call try_to_freeze() somewhere or use one of the +wait_event_freezable() and wait_event_freezable_timeout() macros. + +After the system memory state has been restored from a hibernation image and +devices have been reinitialized, the function thaw_processes() is called in +order to clear the PF_FROZEN flag for each frozen task. Then, the tasks that +have been frozen leave __refrigerator() and continue running. + + +Rationale behind the functions dealing with freezing and thawing of tasks +------------------------------------------------------------------------- + +freeze_processes(): + - freezes only userspace tasks + +freeze_kernel_threads(): + - freezes all tasks (including kernel threads) because we can't freeze + kernel threads without freezing userspace tasks + +thaw_kernel_threads(): + - thaws only kernel threads; this is particularly useful if we need to do + anything special in between thawing of kernel threads and thawing of + userspace tasks, or if we want to postpone the thawing of userspace tasks + +thaw_processes(): + - thaws all tasks (including kernel threads) because we can't thaw userspace + tasks without thawing kernel threads + + +III. Which kernel threads are freezable? +======================================== + +Kernel threads are not freezable by default. However, a kernel thread may clear +PF_NOFREEZE for itself by calling set_freezable() (the resetting of PF_NOFREEZE +directly is not allowed). From this point it is regarded as freezable +and must call try_to_freeze() in a suitable place. + +IV. Why do we do that? +====================== + +Generally speaking, there is a couple of reasons to use the freezing of tasks: + +1. The principal reason is to prevent filesystems from being damaged after + hibernation. At the moment we have no simple means of checkpointing + filesystems, so if there are any modifications made to filesystem data and/or + metadata on disks, we cannot bring them back to the state from before the + modifications. At the same time each hibernation image contains some + filesystem-related information that must be consistent with the state of the + on-disk data and metadata after the system memory state has been restored + from the image (otherwise the filesystems will be damaged in a nasty way, + usually making them almost impossible to repair). We therefore freeze + tasks that might cause the on-disk filesystems' data and metadata to be + modified after the hibernation image has been created and before the + system is finally powered off. The majority of these are user space + processes, but if any of the kernel threads may cause something like this + to happen, they have to be freezable. + +2. Next, to create the hibernation image we need to free a sufficient amount of + memory (approximately 50% of available RAM) and we need to do that before + devices are deactivated, because we generally need them for swapping out. + Then, after the memory for the image has been freed, we don't want tasks + to allocate additional memory and we prevent them from doing that by + freezing them earlier. [Of course, this also means that device drivers + should not allocate substantial amounts of memory from their .suspend() + callbacks before hibernation, but this is a separate issue.] + +3. The third reason is to prevent user space processes and some kernel threads + from interfering with the suspending and resuming of devices. A user space + process running on a second CPU while we are suspending devices may, for + example, be troublesome and without the freezing of tasks we would need some + safeguards against race conditions that might occur in such a case. + +Although Linus Torvalds doesn't like the freezing of tasks, he said this in one +of the discussions on LKML (http://lkml.org/lkml/2007/4/27/608): + +"RJW:> Why we freeze tasks at all or why we freeze kernel threads? + +Linus: In many ways, 'at all'. + +I **do** realize the IO request queue issues, and that we cannot actually do +s2ram with some devices in the middle of a DMA. So we want to be able to +avoid *that*, there's no question about that. And I suspect that stopping +user threads and then waiting for a sync is practically one of the easier +ways to do so. + +So in practice, the 'at all' may become a 'why freeze kernel threads?' and +freezing user threads I don't find really objectionable." + +Still, there are kernel threads that may want to be freezable. For example, if +a kernel thread that belongs to a device driver accesses the device directly, it +in principle needs to know when the device is suspended, so that it doesn't try +to access it at that time. However, if the kernel thread is freezable, it will +be frozen before the driver's .suspend() callback is executed and it will be +thawed after the driver's .resume() callback has run, so it won't be accessing +the device while it's suspended. + +4. Another reason for freezing tasks is to prevent user space processes from + realizing that hibernation (or suspend) operation takes place. Ideally, user + space processes should not notice that such a system-wide operation has + occurred and should continue running without any problems after the restore + (or resume from suspend). Unfortunately, in the most general case this + is quite difficult to achieve without the freezing of tasks. Consider, + for example, a process that depends on all CPUs being online while it's + running. Since we need to disable nonboot CPUs during the hibernation, + if this process is not frozen, it may notice that the number of CPUs has + changed and may start to work incorrectly because of that. + +V. Are there any problems related to the freezing of tasks? +=========================================================== + +Yes, there are. + +First of all, the freezing of kernel threads may be tricky if they depend one +on another. For example, if kernel thread A waits for a completion (in the +TASK_UNINTERRUPTIBLE state) that needs to be done by freezable kernel thread B +and B is frozen in the meantime, then A will be blocked until B is thawed, which +may be undesirable. That's why kernel threads are not freezable by default. + +Second, there are the following two problems related to the freezing of user +space processes: + +1. Putting processes into an uninterruptible sleep distorts the load average. +2. Now that we have FUSE, plus the framework for doing device drivers in + userspace, it gets even more complicated because some userspace processes are + now doing the sorts of things that kernel threads do + (https://lists.linux-foundation.org/pipermail/linux-pm/2007-May/012309.html). + +The problem 1. seems to be fixable, although it hasn't been fixed so far. The +other one is more serious, but it seems that we can work around it by using +hibernation (and suspend) notifiers (in that case, though, we won't be able to +avoid the realization by the user space processes that the hibernation is taking +place). + +There are also problems that the freezing of tasks tends to expose, although +they are not directly related to it. For example, if request_firmware() is +called from a device driver's .resume() routine, it will timeout and eventually +fail, because the user land process that should respond to the request is frozen +at this point. So, seemingly, the failure is due to the freezing of tasks. +Suppose, however, that the firmware file is located on a filesystem accessible +only through another device that hasn't been resumed yet. In that case, +request_firmware() will fail regardless of whether or not the freezing of tasks +is used. Consequently, the problem is not really related to the freezing of +tasks, since it generally exists anyway. + +A driver must have all firmwares it may need in RAM before suspend() is called. +If keeping them is not practical, for example due to their size, they must be +requested early enough using the suspend notifier API described in +Documentation/driver-api/pm/notifiers.rst. + +VI. Are there any precautions to be taken to prevent freezing failures? +======================================================================= + +Yes, there are. + +First of all, grabbing the 'system_transition_mutex' lock to mutually exclude a piece of code +from system-wide sleep such as suspend/hibernation is not encouraged. +If possible, that piece of code must instead hook onto the suspend/hibernation +notifiers to achieve mutual exclusion. Look at the CPU-Hotplug code +(kernel/cpu.c) for an example. + +However, if that is not feasible, and grabbing 'system_transition_mutex' is deemed necessary, +it is strongly discouraged to directly call mutex_[un]lock(&system_transition_mutex) since +that could lead to freezing failures, because if the suspend/hibernate code +successfully acquired the 'system_transition_mutex' lock, and hence that other entity failed +to acquire the lock, then that task would get blocked in TASK_UNINTERRUPTIBLE +state. As a consequence, the freezer would not be able to freeze that task, +leading to freezing failure. + +However, the [un]lock_system_sleep() APIs are safe to use in this scenario, +since they ask the freezer to skip freezing this task, since it is anyway +"frozen enough" as it is blocked on 'system_transition_mutex', which will be released +only after the entire suspend/hibernation sequence is complete. +So, to summarize, use [un]lock_system_sleep() instead of directly using +mutex_[un]lock(&system_transition_mutex). That would prevent freezing failures. + +V. Miscellaneous +================ + +/sys/power/pm_freeze_timeout controls how long it will cost at most to freeze +all user space processes or all freezable kernel threads, in unit of millisecond. +The default value is 20000, with range of unsigned integer. diff --git a/Documentation/power/freezing-of-tasks.txt b/Documentation/power/freezing-of-tasks.txt deleted file mode 100644 index cd283190855a..000000000000 --- a/Documentation/power/freezing-of-tasks.txt +++ /dev/null @@ -1,231 +0,0 @@ -Freezing of tasks - (C) 2007 Rafael J. Wysocki , GPL - -I. What is the freezing of tasks? - -The freezing of tasks is a mechanism by which user space processes and some -kernel threads are controlled during hibernation or system-wide suspend (on some -architectures). - -II. How does it work? - -There are three per-task flags used for that, PF_NOFREEZE, PF_FROZEN -and PF_FREEZER_SKIP (the last one is auxiliary). The tasks that have -PF_NOFREEZE unset (all user space processes and some kernel threads) are -regarded as 'freezable' and treated in a special way before the system enters a -suspend state as well as before a hibernation image is created (in what follows -we only consider hibernation, but the description also applies to suspend). - -Namely, as the first step of the hibernation procedure the function -freeze_processes() (defined in kernel/power/process.c) is called. A system-wide -variable system_freezing_cnt (as opposed to a per-task flag) is used to indicate -whether the system is to undergo a freezing operation. And freeze_processes() -sets this variable. After this, it executes try_to_freeze_tasks() that sends a -fake signal to all user space processes, and wakes up all the kernel threads. -All freezable tasks must react to that by calling try_to_freeze(), which -results in a call to __refrigerator() (defined in kernel/freezer.c), which sets -the task's PF_FROZEN flag, changes its state to TASK_UNINTERRUPTIBLE and makes -it loop until PF_FROZEN is cleared for it. Then, we say that the task is -'frozen' and therefore the set of functions handling this mechanism is referred -to as 'the freezer' (these functions are defined in kernel/power/process.c, -kernel/freezer.c & include/linux/freezer.h). User space processes are generally -frozen before kernel threads. - -__refrigerator() must not be called directly. Instead, use the -try_to_freeze() function (defined in include/linux/freezer.h), that checks -if the task is to be frozen and makes the task enter __refrigerator(). - -For user space processes try_to_freeze() is called automatically from the -signal-handling code, but the freezable kernel threads need to call it -explicitly in suitable places or use the wait_event_freezable() or -wait_event_freezable_timeout() macros (defined in include/linux/freezer.h) -that combine interruptible sleep with checking if the task is to be frozen and -calling try_to_freeze(). The main loop of a freezable kernel thread may look -like the following one: - - set_freezable(); - do { - hub_events(); - wait_event_freezable(khubd_wait, - !list_empty(&hub_event_list) || - kthread_should_stop()); - } while (!kthread_should_stop() || !list_empty(&hub_event_list)); - -(from drivers/usb/core/hub.c::hub_thread()). - -If a freezable kernel thread fails to call try_to_freeze() after the freezer has -initiated a freezing operation, the freezing of tasks will fail and the entire -hibernation operation will be cancelled. For this reason, freezable kernel -threads must call try_to_freeze() somewhere or use one of the -wait_event_freezable() and wait_event_freezable_timeout() macros. - -After the system memory state has been restored from a hibernation image and -devices have been reinitialized, the function thaw_processes() is called in -order to clear the PF_FROZEN flag for each frozen task. Then, the tasks that -have been frozen leave __refrigerator() and continue running. - - -Rationale behind the functions dealing with freezing and thawing of tasks: -------------------------------------------------------------------------- - -freeze_processes(): - - freezes only userspace tasks - -freeze_kernel_threads(): - - freezes all tasks (including kernel threads) because we can't freeze - kernel threads without freezing userspace tasks - -thaw_kernel_threads(): - - thaws only kernel threads; this is particularly useful if we need to do - anything special in between thawing of kernel threads and thawing of - userspace tasks, or if we want to postpone the thawing of userspace tasks - -thaw_processes(): - - thaws all tasks (including kernel threads) because we can't thaw userspace - tasks without thawing kernel threads - - -III. Which kernel threads are freezable? - -Kernel threads are not freezable by default. However, a kernel thread may clear -PF_NOFREEZE for itself by calling set_freezable() (the resetting of PF_NOFREEZE -directly is not allowed). From this point it is regarded as freezable -and must call try_to_freeze() in a suitable place. - -IV. Why do we do that? - -Generally speaking, there is a couple of reasons to use the freezing of tasks: - -1. The principal reason is to prevent filesystems from being damaged after -hibernation. At the moment we have no simple means of checkpointing -filesystems, so if there are any modifications made to filesystem data and/or -metadata on disks, we cannot bring them back to the state from before the -modifications. At the same time each hibernation image contains some -filesystem-related information that must be consistent with the state of the -on-disk data and metadata after the system memory state has been restored from -the image (otherwise the filesystems will be damaged in a nasty way, usually -making them almost impossible to repair). We therefore freeze tasks that might -cause the on-disk filesystems' data and metadata to be modified after the -hibernation image has been created and before the system is finally powered off. -The majority of these are user space processes, but if any of the kernel threads -may cause something like this to happen, they have to be freezable. - -2. Next, to create the hibernation image we need to free a sufficient amount of -memory (approximately 50% of available RAM) and we need to do that before -devices are deactivated, because we generally need them for swapping out. Then, -after the memory for the image has been freed, we don't want tasks to allocate -additional memory and we prevent them from doing that by freezing them earlier. -[Of course, this also means that device drivers should not allocate substantial -amounts of memory from their .suspend() callbacks before hibernation, but this -is a separate issue.] - -3. The third reason is to prevent user space processes and some kernel threads -from interfering with the suspending and resuming of devices. A user space -process running on a second CPU while we are suspending devices may, for -example, be troublesome and without the freezing of tasks we would need some -safeguards against race conditions that might occur in such a case. - -Although Linus Torvalds doesn't like the freezing of tasks, he said this in one -of the discussions on LKML (http://lkml.org/lkml/2007/4/27/608): - -"RJW:> Why we freeze tasks at all or why we freeze kernel threads? - -Linus: In many ways, 'at all'. - -I _do_ realize the IO request queue issues, and that we cannot actually do -s2ram with some devices in the middle of a DMA. So we want to be able to -avoid *that*, there's no question about that. And I suspect that stopping -user threads and then waiting for a sync is practically one of the easier -ways to do so. - -So in practice, the 'at all' may become a 'why freeze kernel threads?' and -freezing user threads I don't find really objectionable." - -Still, there are kernel threads that may want to be freezable. For example, if -a kernel thread that belongs to a device driver accesses the device directly, it -in principle needs to know when the device is suspended, so that it doesn't try -to access it at that time. However, if the kernel thread is freezable, it will -be frozen before the driver's .suspend() callback is executed and it will be -thawed after the driver's .resume() callback has run, so it won't be accessing -the device while it's suspended. - -4. Another reason for freezing tasks is to prevent user space processes from -realizing that hibernation (or suspend) operation takes place. Ideally, user -space processes should not notice that such a system-wide operation has occurred -and should continue running without any problems after the restore (or resume -from suspend). Unfortunately, in the most general case this is quite difficult -to achieve without the freezing of tasks. Consider, for example, a process -that depends on all CPUs being online while it's running. Since we need to -disable nonboot CPUs during the hibernation, if this process is not frozen, it -may notice that the number of CPUs has changed and may start to work incorrectly -because of that. - -V. Are there any problems related to the freezing of tasks? - -Yes, there are. - -First of all, the freezing of kernel threads may be tricky if they depend one -on another. For example, if kernel thread A waits for a completion (in the -TASK_UNINTERRUPTIBLE state) that needs to be done by freezable kernel thread B -and B is frozen in the meantime, then A will be blocked until B is thawed, which -may be undesirable. That's why kernel threads are not freezable by default. - -Second, there are the following two problems related to the freezing of user -space processes: -1. Putting processes into an uninterruptible sleep distorts the load average. -2. Now that we have FUSE, plus the framework for doing device drivers in -userspace, it gets even more complicated because some userspace processes are -now doing the sorts of things that kernel threads do -(https://lists.linux-foundation.org/pipermail/linux-pm/2007-May/012309.html). - -The problem 1. seems to be fixable, although it hasn't been fixed so far. The -other one is more serious, but it seems that we can work around it by using -hibernation (and suspend) notifiers (in that case, though, we won't be able to -avoid the realization by the user space processes that the hibernation is taking -place). - -There are also problems that the freezing of tasks tends to expose, although -they are not directly related to it. For example, if request_firmware() is -called from a device driver's .resume() routine, it will timeout and eventually -fail, because the user land process that should respond to the request is frozen -at this point. So, seemingly, the failure is due to the freezing of tasks. -Suppose, however, that the firmware file is located on a filesystem accessible -only through another device that hasn't been resumed yet. In that case, -request_firmware() will fail regardless of whether or not the freezing of tasks -is used. Consequently, the problem is not really related to the freezing of -tasks, since it generally exists anyway. - -A driver must have all firmwares it may need in RAM before suspend() is called. -If keeping them is not practical, for example due to their size, they must be -requested early enough using the suspend notifier API described in -Documentation/driver-api/pm/notifiers.rst. - -VI. Are there any precautions to be taken to prevent freezing failures? - -Yes, there are. - -First of all, grabbing the 'system_transition_mutex' lock to mutually exclude a piece of code -from system-wide sleep such as suspend/hibernation is not encouraged. -If possible, that piece of code must instead hook onto the suspend/hibernation -notifiers to achieve mutual exclusion. Look at the CPU-Hotplug code -(kernel/cpu.c) for an example. - -However, if that is not feasible, and grabbing 'system_transition_mutex' is deemed necessary, -it is strongly discouraged to directly call mutex_[un]lock(&system_transition_mutex) since -that could lead to freezing failures, because if the suspend/hibernate code -successfully acquired the 'system_transition_mutex' lock, and hence that other entity failed -to acquire the lock, then that task would get blocked in TASK_UNINTERRUPTIBLE -state. As a consequence, the freezer would not be able to freeze that task, -leading to freezing failure. - -However, the [un]lock_system_sleep() APIs are safe to use in this scenario, -since they ask the freezer to skip freezing this task, since it is anyway -"frozen enough" as it is blocked on 'system_transition_mutex', which will be released -only after the entire suspend/hibernation sequence is complete. -So, to summarize, use [un]lock_system_sleep() instead of directly using -mutex_[un]lock(&system_transition_mutex). That would prevent freezing failures. - -V. Miscellaneous -/sys/power/pm_freeze_timeout controls how long it will cost at most to freeze -all user space processes or all freezable kernel threads, in unit of millisecond. -The default value is 20000, with range of unsigned integer. diff --git a/Documentation/power/index.rst b/Documentation/power/index.rst new file mode 100644 index 000000000000..20415f21e48a --- /dev/null +++ b/Documentation/power/index.rst @@ -0,0 +1,46 @@ +:orphan: + +================ +Power Management +================ + +.. toctree:: + :maxdepth: 1 + + apm-acpi + basic-pm-debugging + charger-manager + drivers-testing + energy-model + freezing-of-tasks + interface + opp + pci + pm_qos_interface + power_supply_class + runtime_pm + s2ram + suspend-and-cpuhotplug + suspend-and-interrupts + swsusp-and-swap-files + swsusp-dmcrypt + swsusp + video + tricks + + userland-swsusp + + powercap/powercap + + regulator/consumer + regulator/design + regulator/machine + regulator/overview + regulator/regulator + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/power/interface.rst b/Documentation/power/interface.rst new file mode 100644 index 000000000000..8d270ed27228 --- /dev/null +++ b/Documentation/power/interface.rst @@ -0,0 +1,79 @@ +=========================================== +Power Management Interface for System Sleep +=========================================== + +Copyright (c) 2016 Intel Corp., Rafael J. Wysocki + +The power management subsystem provides userspace with a unified sysfs interface +for system sleep regardless of the underlying system architecture or platform. +The interface is located in the /sys/power/ directory (assuming that sysfs is +mounted at /sys). + +/sys/power/state is the system sleep state control file. + +Reading from it returns a list of supported sleep states, encoded as: + +- 'freeze' (Suspend-to-Idle) +- 'standby' (Power-On Suspend) +- 'mem' (Suspend-to-RAM) +- 'disk' (Suspend-to-Disk) + +Suspend-to-Idle is always supported. Suspend-to-Disk is always supported +too as long the kernel has been configured to support hibernation at all +(ie. CONFIG_HIBERNATION is set in the kernel configuration file). Support +for Suspend-to-RAM and Power-On Suspend depends on the capabilities of the +platform. + +If one of the strings listed in /sys/power/state is written to it, the system +will attempt to transition into the corresponding sleep state. Refer to +Documentation/admin-guide/pm/sleep-states.rst for a description of each of +those states. + +/sys/power/disk controls the operating mode of hibernation (Suspend-to-Disk). +Specifically, it tells the kernel what to do after creating a hibernation image. + +Reading from it returns a list of supported options encoded as: + +- 'platform' (put the system into sleep using a platform-provided method) +- 'shutdown' (shut the system down) +- 'reboot' (reboot the system) +- 'suspend' (trigger a Suspend-to-RAM transition) +- 'test_resume' (resume-after-hibernation test mode) + +The currently selected option is printed in square brackets. + +The 'platform' option is only available if the platform provides a special +mechanism to put the system to sleep after creating a hibernation image (ACPI +does that, for example). The 'suspend' option is available if Suspend-to-RAM +is supported. Refer to Documentation/power/basic-pm-debugging.rst for the +description of the 'test_resume' option. + +To select an option, write the string representing it to /sys/power/disk. + +/sys/power/image_size controls the size of hibernation images. + +It can be written a string representing a non-negative integer that will be +used as a best-effort upper limit of the image size, in bytes. The hibernation +core will do its best to ensure that the image size will not exceed that number. +However, if that turns out to be impossible to achieve, a hibernation image will +still be created and its size will be as small as possible. In particular, +writing '0' to this file will enforce hibernation images to be as small as +possible. + +Reading from this file returns the current image size limit, which is set to +around 2/5 of available RAM by default. + +/sys/power/pm_trace controls the PM trace mechanism saving the last suspend +or resume event point in the RTC across reboots. + +It helps to debug hard lockups or reboots due to device driver failures that +occur during system suspend or resume (which is more common) more effectively. + +If /sys/power/pm_trace contains '1', the fingerprint of each suspend/resume +event point in turn will be stored in the RTC memory (overwriting the actual +RTC information), so it will survive a system crash if one occurs right after +storing it and it can be used later to identify the driver that caused the crash +to happen (see Documentation/power/s2ram.rst for more information). + +Initially it contains '0' which may be changed to '1' by writing a string +representing a nonzero integer into it. diff --git a/Documentation/power/interface.txt b/Documentation/power/interface.txt deleted file mode 100644 index 27df7f98668a..000000000000 --- a/Documentation/power/interface.txt +++ /dev/null @@ -1,77 +0,0 @@ -Power Management Interface for System Sleep - -Copyright (c) 2016 Intel Corp., Rafael J. Wysocki - -The power management subsystem provides userspace with a unified sysfs interface -for system sleep regardless of the underlying system architecture or platform. -The interface is located in the /sys/power/ directory (assuming that sysfs is -mounted at /sys). - -/sys/power/state is the system sleep state control file. - -Reading from it returns a list of supported sleep states, encoded as: - -'freeze' (Suspend-to-Idle) -'standby' (Power-On Suspend) -'mem' (Suspend-to-RAM) -'disk' (Suspend-to-Disk) - -Suspend-to-Idle is always supported. Suspend-to-Disk is always supported -too as long the kernel has been configured to support hibernation at all -(ie. CONFIG_HIBERNATION is set in the kernel configuration file). Support -for Suspend-to-RAM and Power-On Suspend depends on the capabilities of the -platform. - -If one of the strings listed in /sys/power/state is written to it, the system -will attempt to transition into the corresponding sleep state. Refer to -Documentation/admin-guide/pm/sleep-states.rst for a description of each of -those states. - -/sys/power/disk controls the operating mode of hibernation (Suspend-to-Disk). -Specifically, it tells the kernel what to do after creating a hibernation image. - -Reading from it returns a list of supported options encoded as: - -'platform' (put the system into sleep using a platform-provided method) -'shutdown' (shut the system down) -'reboot' (reboot the system) -'suspend' (trigger a Suspend-to-RAM transition) -'test_resume' (resume-after-hibernation test mode) - -The currently selected option is printed in square brackets. - -The 'platform' option is only available if the platform provides a special -mechanism to put the system to sleep after creating a hibernation image (ACPI -does that, for example). The 'suspend' option is available if Suspend-to-RAM -is supported. Refer to Documentation/power/basic-pm-debugging.txt for the -description of the 'test_resume' option. - -To select an option, write the string representing it to /sys/power/disk. - -/sys/power/image_size controls the size of hibernation images. - -It can be written a string representing a non-negative integer that will be -used as a best-effort upper limit of the image size, in bytes. The hibernation -core will do its best to ensure that the image size will not exceed that number. -However, if that turns out to be impossible to achieve, a hibernation image will -still be created and its size will be as small as possible. In particular, -writing '0' to this file will enforce hibernation images to be as small as -possible. - -Reading from this file returns the current image size limit, which is set to -around 2/5 of available RAM by default. - -/sys/power/pm_trace controls the PM trace mechanism saving the last suspend -or resume event point in the RTC across reboots. - -It helps to debug hard lockups or reboots due to device driver failures that -occur during system suspend or resume (which is more common) more effectively. - -If /sys/power/pm_trace contains '1', the fingerprint of each suspend/resume -event point in turn will be stored in the RTC memory (overwriting the actual -RTC information), so it will survive a system crash if one occurs right after -storing it and it can be used later to identify the driver that caused the crash -to happen (see Documentation/power/s2ram.txt for more information). - -Initially it contains '0' which may be changed to '1' by writing a string -representing a nonzero integer into it. diff --git a/Documentation/power/opp.rst b/Documentation/power/opp.rst new file mode 100644 index 000000000000..b3cf1def9dee --- /dev/null +++ b/Documentation/power/opp.rst @@ -0,0 +1,379 @@ +========================================== +Operating Performance Points (OPP) Library +========================================== + +(C) 2009-2010 Nishanth Menon , Texas Instruments Incorporated + +.. Contents + + 1. Introduction + 2. Initial OPP List Registration + 3. OPP Search Functions + 4. OPP Availability Control Functions + 5. OPP Data Retrieval Functions + 6. Data Structures + +1. Introduction +=============== + +1.1 What is an Operating Performance Point (OPP)? +------------------------------------------------- + +Complex SoCs of today consists of a multiple sub-modules working in conjunction. +In an operational system executing varied use cases, not all modules in the SoC +need to function at their highest performing frequency all the time. To +facilitate this, sub-modules in a SoC are grouped into domains, allowing some +domains to run at lower voltage and frequency while other domains run at +voltage/frequency pairs that are higher. + +The set of discrete tuples consisting of frequency and voltage pairs that +the device will support per domain are called Operating Performance Points or +OPPs. + +As an example: + +Let us consider an MPU device which supports the following: +{300MHz at minimum voltage of 1V}, {800MHz at minimum voltage of 1.2V}, +{1GHz at minimum voltage of 1.3V} + +We can represent these as three OPPs as the following {Hz, uV} tuples: + +- {300000000, 1000000} +- {800000000, 1200000} +- {1000000000, 1300000} + +1.2 Operating Performance Points Library +---------------------------------------- + +OPP library provides a set of helper functions to organize and query the OPP +information. The library is located in drivers/base/power/opp.c and the header +is located in include/linux/pm_opp.h. OPP library can be enabled by enabling +CONFIG_PM_OPP from power management menuconfig menu. OPP library depends on +CONFIG_PM as certain SoCs such as Texas Instrument's OMAP framework allows to +optionally boot at a certain OPP without needing cpufreq. + +Typical usage of the OPP library is as follows:: + + (users) -> registers a set of default OPPs -> (library) + SoC framework -> modifies on required cases certain OPPs -> OPP layer + -> queries to search/retrieve information -> + +OPP layer expects each domain to be represented by a unique device pointer. SoC +framework registers a set of initial OPPs per device with the OPP layer. This +list is expected to be an optimally small number typically around 5 per device. +This initial list contains a set of OPPs that the framework expects to be safely +enabled by default in the system. + +Note on OPP Availability +^^^^^^^^^^^^^^^^^^^^^^^^ + +As the system proceeds to operate, SoC framework may choose to make certain +OPPs available or not available on each device based on various external +factors. Example usage: Thermal management or other exceptional situations where +SoC framework might choose to disable a higher frequency OPP to safely continue +operations until that OPP could be re-enabled if possible. + +OPP library facilitates this concept in it's implementation. The following +operational functions operate only on available opps: +opp_find_freq_{ceil, floor}, dev_pm_opp_get_voltage, dev_pm_opp_get_freq, dev_pm_opp_get_opp_count + +dev_pm_opp_find_freq_exact is meant to be used to find the opp pointer which can then +be used for dev_pm_opp_enable/disable functions to make an opp available as required. + +WARNING: Users of OPP library should refresh their availability count using +get_opp_count if dev_pm_opp_enable/disable functions are invoked for a device, the +exact mechanism to trigger these or the notification mechanism to other +dependent subsystems such as cpufreq are left to the discretion of the SoC +specific framework which uses the OPP library. Similar care needs to be taken +care to refresh the cpufreq table in cases of these operations. + +2. Initial OPP List Registration +================================ +The SoC implementation calls dev_pm_opp_add function iteratively to add OPPs per +device. It is expected that the SoC framework will register the OPP entries +optimally- typical numbers range to be less than 5. The list generated by +registering the OPPs is maintained by OPP library throughout the device +operation. The SoC framework can subsequently control the availability of the +OPPs dynamically using the dev_pm_opp_enable / disable functions. + +dev_pm_opp_add + Add a new OPP for a specific domain represented by the device pointer. + The OPP is defined using the frequency and voltage. Once added, the OPP + is assumed to be available and control of it's availability can be done + with the dev_pm_opp_enable/disable functions. OPP library internally stores + and manages this information in the opp struct. This function may be + used by SoC framework to define a optimal list as per the demands of + SoC usage environment. + + WARNING: + Do not use this function in interrupt context. + + Example:: + + soc_pm_init() + { + /* Do things */ + r = dev_pm_opp_add(mpu_dev, 1000000, 900000); + if (!r) { + pr_err("%s: unable to register mpu opp(%d)\n", r); + goto no_cpufreq; + } + /* Do cpufreq things */ + no_cpufreq: + /* Do remaining things */ + } + +3. OPP Search Functions +======================= +High level framework such as cpufreq operates on frequencies. To map the +frequency back to the corresponding OPP, OPP library provides handy functions +to search the OPP list that OPP library internally manages. These search +functions return the matching pointer representing the opp if a match is +found, else returns error. These errors are expected to be handled by standard +error checks such as IS_ERR() and appropriate actions taken by the caller. + +Callers of these functions shall call dev_pm_opp_put() after they have used the +OPP. Otherwise the memory for the OPP will never get freed and result in +memleak. + +dev_pm_opp_find_freq_exact + Search for an OPP based on an *exact* frequency and + availability. This function is especially useful to enable an OPP which + is not available by default. + Example: In a case when SoC framework detects a situation where a + higher frequency could be made available, it can use this function to + find the OPP prior to call the dev_pm_opp_enable to actually make + it available:: + + opp = dev_pm_opp_find_freq_exact(dev, 1000000000, false); + dev_pm_opp_put(opp); + /* dont operate on the pointer.. just do a sanity check.. */ + if (IS_ERR(opp)) { + pr_err("frequency not disabled!\n"); + /* trigger appropriate actions.. */ + } else { + dev_pm_opp_enable(dev,1000000000); + } + + NOTE: + This is the only search function that operates on OPPs which are + not available. + +dev_pm_opp_find_freq_floor + Search for an available OPP which is *at most* the + provided frequency. This function is useful while searching for a lesser + match OR operating on OPP information in the order of decreasing + frequency. + Example: To find the highest opp for a device:: + + freq = ULONG_MAX; + opp = dev_pm_opp_find_freq_floor(dev, &freq); + dev_pm_opp_put(opp); + +dev_pm_opp_find_freq_ceil + Search for an available OPP which is *at least* the + provided frequency. This function is useful while searching for a + higher match OR operating on OPP information in the order of increasing + frequency. + Example 1: To find the lowest opp for a device:: + + freq = 0; + opp = dev_pm_opp_find_freq_ceil(dev, &freq); + dev_pm_opp_put(opp); + + Example 2: A simplified implementation of a SoC cpufreq_driver->target:: + + soc_cpufreq_target(..) + { + /* Do stuff like policy checks etc. */ + /* Find the best frequency match for the req */ + opp = dev_pm_opp_find_freq_ceil(dev, &freq); + dev_pm_opp_put(opp); + if (!IS_ERR(opp)) + soc_switch_to_freq_voltage(freq); + else + /* do something when we can't satisfy the req */ + /* do other stuff */ + } + +4. OPP Availability Control Functions +===================================== +A default OPP list registered with the OPP library may not cater to all possible +situation. The OPP library provides a set of functions to modify the +availability of a OPP within the OPP list. This allows SoC frameworks to have +fine grained dynamic control of which sets of OPPs are operationally available. +These functions are intended to *temporarily* remove an OPP in conditions such +as thermal considerations (e.g. don't use OPPx until the temperature drops). + +WARNING: + Do not use these functions in interrupt context. + +dev_pm_opp_enable + Make a OPP available for operation. + Example: Lets say that 1GHz OPP is to be made available only if the + SoC temperature is lower than a certain threshold. The SoC framework + implementation might choose to do something as follows:: + + if (cur_temp < temp_low_thresh) { + /* Enable 1GHz if it was disabled */ + opp = dev_pm_opp_find_freq_exact(dev, 1000000000, false); + dev_pm_opp_put(opp); + /* just error check */ + if (!IS_ERR(opp)) + ret = dev_pm_opp_enable(dev, 1000000000); + else + goto try_something_else; + } + +dev_pm_opp_disable + Make an OPP to be not available for operation + Example: Lets say that 1GHz OPP is to be disabled if the temperature + exceeds a threshold value. The SoC framework implementation might + choose to do something as follows:: + + if (cur_temp > temp_high_thresh) { + /* Disable 1GHz if it was enabled */ + opp = dev_pm_opp_find_freq_exact(dev, 1000000000, true); + dev_pm_opp_put(opp); + /* just error check */ + if (!IS_ERR(opp)) + ret = dev_pm_opp_disable(dev, 1000000000); + else + goto try_something_else; + } + +5. OPP Data Retrieval Functions +=============================== +Since OPP library abstracts away the OPP information, a set of functions to pull +information from the OPP structure is necessary. Once an OPP pointer is +retrieved using the search functions, the following functions can be used by SoC +framework to retrieve the information represented inside the OPP layer. + +dev_pm_opp_get_voltage + Retrieve the voltage represented by the opp pointer. + Example: At a cpufreq transition to a different frequency, SoC + framework requires to set the voltage represented by the OPP using + the regulator framework to the Power Management chip providing the + voltage:: + + soc_switch_to_freq_voltage(freq) + { + /* do things */ + opp = dev_pm_opp_find_freq_ceil(dev, &freq); + v = dev_pm_opp_get_voltage(opp); + dev_pm_opp_put(opp); + if (v) + regulator_set_voltage(.., v); + /* do other things */ + } + +dev_pm_opp_get_freq + Retrieve the freq represented by the opp pointer. + Example: Lets say the SoC framework uses a couple of helper functions + we could pass opp pointers instead of doing additional parameters to + handle quiet a bit of data parameters:: + + soc_cpufreq_target(..) + { + /* do things.. */ + max_freq = ULONG_MAX; + max_opp = dev_pm_opp_find_freq_floor(dev,&max_freq); + requested_opp = dev_pm_opp_find_freq_ceil(dev,&freq); + if (!IS_ERR(max_opp) && !IS_ERR(requested_opp)) + r = soc_test_validity(max_opp, requested_opp); + dev_pm_opp_put(max_opp); + dev_pm_opp_put(requested_opp); + /* do other things */ + } + soc_test_validity(..) + { + if(dev_pm_opp_get_voltage(max_opp) < dev_pm_opp_get_voltage(requested_opp)) + return -EINVAL; + if(dev_pm_opp_get_freq(max_opp) < dev_pm_opp_get_freq(requested_opp)) + return -EINVAL; + /* do things.. */ + } + +dev_pm_opp_get_opp_count + Retrieve the number of available opps for a device + Example: Lets say a co-processor in the SoC needs to know the available + frequencies in a table, the main processor can notify as following:: + + soc_notify_coproc_available_frequencies() + { + /* Do things */ + num_available = dev_pm_opp_get_opp_count(dev); + speeds = kzalloc(sizeof(u32) * num_available, GFP_KERNEL); + /* populate the table in increasing order */ + freq = 0; + while (!IS_ERR(opp = dev_pm_opp_find_freq_ceil(dev, &freq))) { + speeds[i] = freq; + freq++; + i++; + dev_pm_opp_put(opp); + } + + soc_notify_coproc(AVAILABLE_FREQs, speeds, num_available); + /* Do other things */ + } + +6. Data Structures +================== +Typically an SoC contains multiple voltage domains which are variable. Each +domain is represented by a device pointer. The relationship to OPP can be +represented as follows:: + + SoC + |- device 1 + | |- opp 1 (availability, freq, voltage) + | |- opp 2 .. + ... ... + | `- opp n .. + |- device 2 + ... + `- device m + +OPP library maintains a internal list that the SoC framework populates and +accessed by various functions as described above. However, the structures +representing the actual OPPs and domains are internal to the OPP library itself +to allow for suitable abstraction reusable across systems. + +struct dev_pm_opp + The internal data structure of OPP library which is used to + represent an OPP. In addition to the freq, voltage, availability + information, it also contains internal book keeping information required + for the OPP library to operate on. Pointer to this structure is + provided back to the users such as SoC framework to be used as a + identifier for OPP in the interactions with OPP layer. + + WARNING: + The struct dev_pm_opp pointer should not be parsed or modified by the + users. The defaults of for an instance is populated by + dev_pm_opp_add, but the availability of the OPP can be modified + by dev_pm_opp_enable/disable functions. + +struct device + This is used to identify a domain to the OPP layer. The + nature of the device and it's implementation is left to the user of + OPP library such as the SoC framework. + +Overall, in a simplistic view, the data structure operations is represented as +following:: + + Initialization / modification: + +-----+ /- dev_pm_opp_enable + dev_pm_opp_add --> | opp | <------- + | +-----+ \- dev_pm_opp_disable + \-------> domain_info(device) + + Search functions: + /-- dev_pm_opp_find_freq_ceil ---\ +-----+ + domain_info<---- dev_pm_opp_find_freq_exact -----> | opp | + \-- dev_pm_opp_find_freq_floor ---/ +-----+ + + Retrieval functions: + +-----+ /- dev_pm_opp_get_voltage + | opp | <--- + +-----+ \- dev_pm_opp_get_freq + + domain_info <- dev_pm_opp_get_opp_count diff --git a/Documentation/power/opp.txt b/Documentation/power/opp.txt deleted file mode 100644 index 0c007e250cd1..000000000000 --- a/Documentation/power/opp.txt +++ /dev/null @@ -1,342 +0,0 @@ -Operating Performance Points (OPP) Library -========================================== - -(C) 2009-2010 Nishanth Menon , Texas Instruments Incorporated - -Contents --------- -1. Introduction -2. Initial OPP List Registration -3. OPP Search Functions -4. OPP Availability Control Functions -5. OPP Data Retrieval Functions -6. Data Structures - -1. Introduction -=============== -1.1 What is an Operating Performance Point (OPP)? - -Complex SoCs of today consists of a multiple sub-modules working in conjunction. -In an operational system executing varied use cases, not all modules in the SoC -need to function at their highest performing frequency all the time. To -facilitate this, sub-modules in a SoC are grouped into domains, allowing some -domains to run at lower voltage and frequency while other domains run at -voltage/frequency pairs that are higher. - -The set of discrete tuples consisting of frequency and voltage pairs that -the device will support per domain are called Operating Performance Points or -OPPs. - -As an example: -Let us consider an MPU device which supports the following: -{300MHz at minimum voltage of 1V}, {800MHz at minimum voltage of 1.2V}, -{1GHz at minimum voltage of 1.3V} - -We can represent these as three OPPs as the following {Hz, uV} tuples: -{300000000, 1000000} -{800000000, 1200000} -{1000000000, 1300000} - -1.2 Operating Performance Points Library - -OPP library provides a set of helper functions to organize and query the OPP -information. The library is located in drivers/base/power/opp.c and the header -is located in include/linux/pm_opp.h. OPP library can be enabled by enabling -CONFIG_PM_OPP from power management menuconfig menu. OPP library depends on -CONFIG_PM as certain SoCs such as Texas Instrument's OMAP framework allows to -optionally boot at a certain OPP without needing cpufreq. - -Typical usage of the OPP library is as follows: -(users) -> registers a set of default OPPs -> (library) -SoC framework -> modifies on required cases certain OPPs -> OPP layer - -> queries to search/retrieve information -> - -OPP layer expects each domain to be represented by a unique device pointer. SoC -framework registers a set of initial OPPs per device with the OPP layer. This -list is expected to be an optimally small number typically around 5 per device. -This initial list contains a set of OPPs that the framework expects to be safely -enabled by default in the system. - -Note on OPP Availability: ------------------------- -As the system proceeds to operate, SoC framework may choose to make certain -OPPs available or not available on each device based on various external -factors. Example usage: Thermal management or other exceptional situations where -SoC framework might choose to disable a higher frequency OPP to safely continue -operations until that OPP could be re-enabled if possible. - -OPP library facilitates this concept in it's implementation. The following -operational functions operate only on available opps: -opp_find_freq_{ceil, floor}, dev_pm_opp_get_voltage, dev_pm_opp_get_freq, dev_pm_opp_get_opp_count - -dev_pm_opp_find_freq_exact is meant to be used to find the opp pointer which can then -be used for dev_pm_opp_enable/disable functions to make an opp available as required. - -WARNING: Users of OPP library should refresh their availability count using -get_opp_count if dev_pm_opp_enable/disable functions are invoked for a device, the -exact mechanism to trigger these or the notification mechanism to other -dependent subsystems such as cpufreq are left to the discretion of the SoC -specific framework which uses the OPP library. Similar care needs to be taken -care to refresh the cpufreq table in cases of these operations. - -2. Initial OPP List Registration -================================ -The SoC implementation calls dev_pm_opp_add function iteratively to add OPPs per -device. It is expected that the SoC framework will register the OPP entries -optimally- typical numbers range to be less than 5. The list generated by -registering the OPPs is maintained by OPP library throughout the device -operation. The SoC framework can subsequently control the availability of the -OPPs dynamically using the dev_pm_opp_enable / disable functions. - -dev_pm_opp_add - Add a new OPP for a specific domain represented by the device pointer. - The OPP is defined using the frequency and voltage. Once added, the OPP - is assumed to be available and control of it's availability can be done - with the dev_pm_opp_enable/disable functions. OPP library internally stores - and manages this information in the opp struct. This function may be - used by SoC framework to define a optimal list as per the demands of - SoC usage environment. - - WARNING: Do not use this function in interrupt context. - - Example: - soc_pm_init() - { - /* Do things */ - r = dev_pm_opp_add(mpu_dev, 1000000, 900000); - if (!r) { - pr_err("%s: unable to register mpu opp(%d)\n", r); - goto no_cpufreq; - } - /* Do cpufreq things */ - no_cpufreq: - /* Do remaining things */ - } - -3. OPP Search Functions -======================= -High level framework such as cpufreq operates on frequencies. To map the -frequency back to the corresponding OPP, OPP library provides handy functions -to search the OPP list that OPP library internally manages. These search -functions return the matching pointer representing the opp if a match is -found, else returns error. These errors are expected to be handled by standard -error checks such as IS_ERR() and appropriate actions taken by the caller. - -Callers of these functions shall call dev_pm_opp_put() after they have used the -OPP. Otherwise the memory for the OPP will never get freed and result in -memleak. - -dev_pm_opp_find_freq_exact - Search for an OPP based on an *exact* frequency and - availability. This function is especially useful to enable an OPP which - is not available by default. - Example: In a case when SoC framework detects a situation where a - higher frequency could be made available, it can use this function to - find the OPP prior to call the dev_pm_opp_enable to actually make it available. - opp = dev_pm_opp_find_freq_exact(dev, 1000000000, false); - dev_pm_opp_put(opp); - /* dont operate on the pointer.. just do a sanity check.. */ - if (IS_ERR(opp)) { - pr_err("frequency not disabled!\n"); - /* trigger appropriate actions.. */ - } else { - dev_pm_opp_enable(dev,1000000000); - } - - NOTE: This is the only search function that operates on OPPs which are - not available. - -dev_pm_opp_find_freq_floor - Search for an available OPP which is *at most* the - provided frequency. This function is useful while searching for a lesser - match OR operating on OPP information in the order of decreasing - frequency. - Example: To find the highest opp for a device: - freq = ULONG_MAX; - opp = dev_pm_opp_find_freq_floor(dev, &freq); - dev_pm_opp_put(opp); - -dev_pm_opp_find_freq_ceil - Search for an available OPP which is *at least* the - provided frequency. This function is useful while searching for a - higher match OR operating on OPP information in the order of increasing - frequency. - Example 1: To find the lowest opp for a device: - freq = 0; - opp = dev_pm_opp_find_freq_ceil(dev, &freq); - dev_pm_opp_put(opp); - Example 2: A simplified implementation of a SoC cpufreq_driver->target: - soc_cpufreq_target(..) - { - /* Do stuff like policy checks etc. */ - /* Find the best frequency match for the req */ - opp = dev_pm_opp_find_freq_ceil(dev, &freq); - dev_pm_opp_put(opp); - if (!IS_ERR(opp)) - soc_switch_to_freq_voltage(freq); - else - /* do something when we can't satisfy the req */ - /* do other stuff */ - } - -4. OPP Availability Control Functions -===================================== -A default OPP list registered with the OPP library may not cater to all possible -situation. The OPP library provides a set of functions to modify the -availability of a OPP within the OPP list. This allows SoC frameworks to have -fine grained dynamic control of which sets of OPPs are operationally available. -These functions are intended to *temporarily* remove an OPP in conditions such -as thermal considerations (e.g. don't use OPPx until the temperature drops). - -WARNING: Do not use these functions in interrupt context. - -dev_pm_opp_enable - Make a OPP available for operation. - Example: Lets say that 1GHz OPP is to be made available only if the - SoC temperature is lower than a certain threshold. The SoC framework - implementation might choose to do something as follows: - if (cur_temp < temp_low_thresh) { - /* Enable 1GHz if it was disabled */ - opp = dev_pm_opp_find_freq_exact(dev, 1000000000, false); - dev_pm_opp_put(opp); - /* just error check */ - if (!IS_ERR(opp)) - ret = dev_pm_opp_enable(dev, 1000000000); - else - goto try_something_else; - } - -dev_pm_opp_disable - Make an OPP to be not available for operation - Example: Lets say that 1GHz OPP is to be disabled if the temperature - exceeds a threshold value. The SoC framework implementation might - choose to do something as follows: - if (cur_temp > temp_high_thresh) { - /* Disable 1GHz if it was enabled */ - opp = dev_pm_opp_find_freq_exact(dev, 1000000000, true); - dev_pm_opp_put(opp); - /* just error check */ - if (!IS_ERR(opp)) - ret = dev_pm_opp_disable(dev, 1000000000); - else - goto try_something_else; - } - -5. OPP Data Retrieval Functions -=============================== -Since OPP library abstracts away the OPP information, a set of functions to pull -information from the OPP structure is necessary. Once an OPP pointer is -retrieved using the search functions, the following functions can be used by SoC -framework to retrieve the information represented inside the OPP layer. - -dev_pm_opp_get_voltage - Retrieve the voltage represented by the opp pointer. - Example: At a cpufreq transition to a different frequency, SoC - framework requires to set the voltage represented by the OPP using - the regulator framework to the Power Management chip providing the - voltage. - soc_switch_to_freq_voltage(freq) - { - /* do things */ - opp = dev_pm_opp_find_freq_ceil(dev, &freq); - v = dev_pm_opp_get_voltage(opp); - dev_pm_opp_put(opp); - if (v) - regulator_set_voltage(.., v); - /* do other things */ - } - -dev_pm_opp_get_freq - Retrieve the freq represented by the opp pointer. - Example: Lets say the SoC framework uses a couple of helper functions - we could pass opp pointers instead of doing additional parameters to - handle quiet a bit of data parameters. - soc_cpufreq_target(..) - { - /* do things.. */ - max_freq = ULONG_MAX; - max_opp = dev_pm_opp_find_freq_floor(dev,&max_freq); - requested_opp = dev_pm_opp_find_freq_ceil(dev,&freq); - if (!IS_ERR(max_opp) && !IS_ERR(requested_opp)) - r = soc_test_validity(max_opp, requested_opp); - dev_pm_opp_put(max_opp); - dev_pm_opp_put(requested_opp); - /* do other things */ - } - soc_test_validity(..) - { - if(dev_pm_opp_get_voltage(max_opp) < dev_pm_opp_get_voltage(requested_opp)) - return -EINVAL; - if(dev_pm_opp_get_freq(max_opp) < dev_pm_opp_get_freq(requested_opp)) - return -EINVAL; - /* do things.. */ - } - -dev_pm_opp_get_opp_count - Retrieve the number of available opps for a device - Example: Lets say a co-processor in the SoC needs to know the available - frequencies in a table, the main processor can notify as following: - soc_notify_coproc_available_frequencies() - { - /* Do things */ - num_available = dev_pm_opp_get_opp_count(dev); - speeds = kzalloc(sizeof(u32) * num_available, GFP_KERNEL); - /* populate the table in increasing order */ - freq = 0; - while (!IS_ERR(opp = dev_pm_opp_find_freq_ceil(dev, &freq))) { - speeds[i] = freq; - freq++; - i++; - dev_pm_opp_put(opp); - } - - soc_notify_coproc(AVAILABLE_FREQs, speeds, num_available); - /* Do other things */ - } - -6. Data Structures -================== -Typically an SoC contains multiple voltage domains which are variable. Each -domain is represented by a device pointer. The relationship to OPP can be -represented as follows: -SoC - |- device 1 - | |- opp 1 (availability, freq, voltage) - | |- opp 2 .. - ... ... - | `- opp n .. - |- device 2 - ... - `- device m - -OPP library maintains a internal list that the SoC framework populates and -accessed by various functions as described above. However, the structures -representing the actual OPPs and domains are internal to the OPP library itself -to allow for suitable abstraction reusable across systems. - -struct dev_pm_opp - The internal data structure of OPP library which is used to - represent an OPP. In addition to the freq, voltage, availability - information, it also contains internal book keeping information required - for the OPP library to operate on. Pointer to this structure is - provided back to the users such as SoC framework to be used as a - identifier for OPP in the interactions with OPP layer. - - WARNING: The struct dev_pm_opp pointer should not be parsed or modified by the - users. The defaults of for an instance is populated by dev_pm_opp_add, but the - availability of the OPP can be modified by dev_pm_opp_enable/disable functions. - -struct device - This is used to identify a domain to the OPP layer. The - nature of the device and it's implementation is left to the user of - OPP library such as the SoC framework. - -Overall, in a simplistic view, the data structure operations is represented as -following: - -Initialization / modification: - +-----+ /- dev_pm_opp_enable -dev_pm_opp_add --> | opp | <------- - | +-----+ \- dev_pm_opp_disable - \-------> domain_info(device) - -Search functions: - /-- dev_pm_opp_find_freq_ceil ---\ +-----+ -domain_info<---- dev_pm_opp_find_freq_exact -----> | opp | - \-- dev_pm_opp_find_freq_floor ---/ +-----+ - -Retrieval functions: -+-----+ /- dev_pm_opp_get_voltage -| opp | <--- -+-----+ \- dev_pm_opp_get_freq - -domain_info <- dev_pm_opp_get_opp_count diff --git a/Documentation/power/pci.rst b/Documentation/power/pci.rst new file mode 100644 index 000000000000..0e2ef7429304 --- /dev/null +++ b/Documentation/power/pci.rst @@ -0,0 +1,1135 @@ +==================== +PCI Power Management +==================== + +Copyright (c) 2010 Rafael J. Wysocki , Novell Inc. + +An overview of concepts and the Linux kernel's interfaces related to PCI power +management. Based on previous work by Patrick Mochel +(and others). + +This document only covers the aspects of power management specific to PCI +devices. For general description of the kernel's interfaces related to device +power management refer to Documentation/driver-api/pm/devices.rst and +Documentation/power/runtime_pm.rst. + +.. contents: + + 1. Hardware and Platform Support for PCI Power Management + 2. PCI Subsystem and Device Power Management + 3. PCI Device Drivers and Power Management + 4. Resources + + +1. Hardware and Platform Support for PCI Power Management +========================================================= + +1.1. Native and Platform-Based Power Management +----------------------------------------------- + +In general, power management is a feature allowing one to save energy by putting +devices into states in which they draw less power (low-power states) at the +price of reduced functionality or performance. + +Usually, a device is put into a low-power state when it is underutilized or +completely inactive. However, when it is necessary to use the device once +again, it has to be put back into the "fully functional" state (full-power +state). This may happen when there are some data for the device to handle or +as a result of an external event requiring the device to be active, which may +be signaled by the device itself. + +PCI devices may be put into low-power states in two ways, by using the device +capabilities introduced by the PCI Bus Power Management Interface Specification, +or with the help of platform firmware, such as an ACPI BIOS. In the first +approach, that is referred to as the native PCI power management (native PCI PM) +in what follows, the device power state is changed as a result of writing a +specific value into one of its standard configuration registers. The second +approach requires the platform firmware to provide special methods that may be +used by the kernel to change the device's power state. + +Devices supporting the native PCI PM usually can generate wakeup signals called +Power Management Events (PMEs) to let the kernel know about external events +requiring the device to be active. After receiving a PME the kernel is supposed +to put the device that sent it into the full-power state. However, the PCI Bus +Power Management Interface Specification doesn't define any standard method of +delivering the PME from the device to the CPU and the operating system kernel. +It is assumed that the platform firmware will perform this task and therefore, +even though a PCI device is set up to generate PMEs, it also may be necessary to +prepare the platform firmware for notifying the CPU of the PMEs coming from the +device (e.g. by generating interrupts). + +In turn, if the methods provided by the platform firmware are used for changing +the power state of a device, usually the platform also provides a method for +preparing the device to generate wakeup signals. In that case, however, it +often also is necessary to prepare the device for generating PMEs using the +native PCI PM mechanism, because the method provided by the platform depends on +that. + +Thus in many situations both the native and the platform-based power management +mechanisms have to be used simultaneously to obtain the desired result. + +1.2. Native PCI Power Management +-------------------------------- + +The PCI Bus Power Management Interface Specification (PCI PM Spec) was +introduced between the PCI 2.1 and PCI 2.2 Specifications. It defined a +standard interface for performing various operations related to power +management. + +The implementation of the PCI PM Spec is optional for conventional PCI devices, +but it is mandatory for PCI Express devices. If a device supports the PCI PM +Spec, it has an 8 byte power management capability field in its PCI +configuration space. This field is used to describe and control the standard +features related to the native PCI power management. + +The PCI PM Spec defines 4 operating states for devices (D0-D3) and for buses +(B0-B3). The higher the number, the less power is drawn by the device or bus +in that state. However, the higher the number, the longer the latency for +the device or bus to return to the full-power state (D0 or B0, respectively). + +There are two variants of the D3 state defined by the specification. The first +one is D3hot, referred to as the software accessible D3, because devices can be +programmed to go into it. The second one, D3cold, is the state that PCI devices +are in when the supply voltage (Vcc) is removed from them. It is not possible +to program a PCI device to go into D3cold, although there may be a programmable +interface for putting the bus the device is on into a state in which Vcc is +removed from all devices on the bus. + +PCI bus power management, however, is not supported by the Linux kernel at the +time of this writing and therefore it is not covered by this document. + +Note that every PCI device can be in the full-power state (D0) or in D3cold, +regardless of whether or not it implements the PCI PM Spec. In addition to +that, if the PCI PM Spec is implemented by the device, it must support D3hot +as well as D0. The support for the D1 and D2 power states is optional. + +PCI devices supporting the PCI PM Spec can be programmed to go to any of the +supported low-power states (except for D3cold). While in D1-D3hot the +standard configuration registers of the device must be accessible to software +(i.e. the device is required to respond to PCI configuration accesses), although +its I/O and memory spaces are then disabled. This allows the device to be +programmatically put into D0. Thus the kernel can switch the device back and +forth between D0 and the supported low-power states (except for D3cold) and the +possible power state transitions the device can undergo are the following: + ++----------------------------+ +| Current State | New State | ++----------------------------+ +| D0 | D1, D2, D3 | ++----------------------------+ +| D1 | D2, D3 | ++----------------------------+ +| D2 | D3 | ++----------------------------+ +| D1, D2, D3 | D0 | ++----------------------------+ + +The transition from D3cold to D0 occurs when the supply voltage is provided to +the device (i.e. power is restored). In that case the device returns to D0 with +a full power-on reset sequence and the power-on defaults are restored to the +device by hardware just as at initial power up. + +PCI devices supporting the PCI PM Spec can be programmed to generate PMEs +while in a low-power state (D1-D3), but they are not required to be capable +of generating PMEs from all supported low-power states. In particular, the +capability of generating PMEs from D3cold is optional and depends on the +presence of additional voltage (3.3Vaux) allowing the device to remain +sufficiently active to generate a wakeup signal. + +1.3. ACPI Device Power Management +--------------------------------- + +The platform firmware support for the power management of PCI devices is +system-specific. However, if the system in question is compliant with the +Advanced Configuration and Power Interface (ACPI) Specification, like the +majority of x86-based systems, it is supposed to implement device power +management interfaces defined by the ACPI standard. + +For this purpose the ACPI BIOS provides special functions called "control +methods" that may be executed by the kernel to perform specific tasks, such as +putting a device into a low-power state. These control methods are encoded +using special byte-code language called the ACPI Machine Language (AML) and +stored in the machine's BIOS. The kernel loads them from the BIOS and executes +them as needed using an AML interpreter that translates the AML byte code into +computations and memory or I/O space accesses. This way, in theory, a BIOS +writer can provide the kernel with a means to perform actions depending +on the system design in a system-specific fashion. + +ACPI control methods may be divided into global control methods, that are not +associated with any particular devices, and device control methods, that have +to be defined separately for each device supposed to be handled with the help of +the platform. This means, in particular, that ACPI device control methods can +only be used to handle devices that the BIOS writer knew about in advance. The +ACPI methods used for device power management fall into that category. + +The ACPI specification assumes that devices can be in one of four power states +labeled as D0, D1, D2, and D3 that roughly correspond to the native PCI PM +D0-D3 states (although the difference between D3hot and D3cold is not taken +into account by ACPI). Moreover, for each power state of a device there is a +set of power resources that have to be enabled for the device to be put into +that state. These power resources are controlled (i.e. enabled or disabled) +with the help of their own control methods, _ON and _OFF, that have to be +defined individually for each of them. + +To put a device into the ACPI power state Dx (where x is a number between 0 and +3 inclusive) the kernel is supposed to (1) enable the power resources required +by the device in this state using their _ON control methods and (2) execute the +_PSx control method defined for the device. In addition to that, if the device +is going to be put into a low-power state (D1-D3) and is supposed to generate +wakeup signals from that state, the _DSW (or _PSW, replaced with _DSW by ACPI +3.0) control method defined for it has to be executed before _PSx. Power +resources that are not required by the device in the target power state and are +not required any more by any other device should be disabled (by executing their +_OFF control methods). If the current power state of the device is D3, it can +only be put into D0 this way. + +However, quite often the power states of devices are changed during a +system-wide transition into a sleep state or back into the working state. ACPI +defines four system sleep states, S1, S2, S3, and S4, and denotes the system +working state as S0. In general, the target system sleep (or working) state +determines the highest power (lowest number) state the device can be put +into and the kernel is supposed to obtain this information by executing the +device's _SxD control method (where x is a number between 0 and 4 inclusive). +If the device is required to wake up the system from the target sleep state, the +lowest power (highest number) state it can be put into is also determined by the +target state of the system. The kernel is then supposed to use the device's +_SxW control method to obtain the number of that state. It also is supposed to +use the device's _PRW control method to learn which power resources need to be +enabled for the device to be able to generate wakeup signals. + +1.4. Wakeup Signaling +--------------------- + +Wakeup signals generated by PCI devices, either as native PCI PMEs, or as +a result of the execution of the _DSW (or _PSW) ACPI control method before +putting the device into a low-power state, have to be caught and handled as +appropriate. If they are sent while the system is in the working state +(ACPI S0), they should be translated into interrupts so that the kernel can +put the devices generating them into the full-power state and take care of the +events that triggered them. In turn, if they are sent while the system is +sleeping, they should cause the system's core logic to trigger wakeup. + +On ACPI-based systems wakeup signals sent by conventional PCI devices are +converted into ACPI General-Purpose Events (GPEs) which are hardware signals +from the system core logic generated in response to various events that need to +be acted upon. Every GPE is associated with one or more sources of potentially +interesting events. In particular, a GPE may be associated with a PCI device +capable of signaling wakeup. The information on the connections between GPEs +and event sources is recorded in the system's ACPI BIOS from where it can be +read by the kernel. + +If a PCI device known to the system's ACPI BIOS signals wakeup, the GPE +associated with it (if there is one) is triggered. The GPEs associated with PCI +bridges may also be triggered in response to a wakeup signal from one of the +devices below the bridge (this also is the case for root bridges) and, for +example, native PCI PMEs from devices unknown to the system's ACPI BIOS may be +handled this way. + +A GPE may be triggered when the system is sleeping (i.e. when it is in one of +the ACPI S1-S4 states), in which case system wakeup is started by its core logic +(the device that was the source of the signal causing the system wakeup to occur +may be identified later). The GPEs used in such situations are referred to as +wakeup GPEs. + +Usually, however, GPEs are also triggered when the system is in the working +state (ACPI S0) and in that case the system's core logic generates a System +Control Interrupt (SCI) to notify the kernel of the event. Then, the SCI +handler identifies the GPE that caused the interrupt to be generated which, +in turn, allows the kernel to identify the source of the event (that may be +a PCI device signaling wakeup). The GPEs used for notifying the kernel of +events occurring while the system is in the working state are referred to as +runtime GPEs. + +Unfortunately, there is no standard way of handling wakeup signals sent by +conventional PCI devices on systems that are not ACPI-based, but there is one +for PCI Express devices. Namely, the PCI Express Base Specification introduced +a native mechanism for converting native PCI PMEs into interrupts generated by +root ports. For conventional PCI devices native PMEs are out-of-band, so they +are routed separately and they need not pass through bridges (in principle they +may be routed directly to the system's core logic), but for PCI Express devices +they are in-band messages that have to pass through the PCI Express hierarchy, +including the root port on the path from the device to the Root Complex. Thus +it was possible to introduce a mechanism by which a root port generates an +interrupt whenever it receives a PME message from one of the devices below it. +The PCI Express Requester ID of the device that sent the PME message is then +recorded in one of the root port's configuration registers from where it may be +read by the interrupt handler allowing the device to be identified. [PME +messages sent by PCI Express endpoints integrated with the Root Complex don't +pass through root ports, but instead they cause a Root Complex Event Collector +(if there is one) to generate interrupts.] + +In principle the native PCI Express PME signaling may also be used on ACPI-based +systems along with the GPEs, but to use it the kernel has to ask the system's +ACPI BIOS to release control of root port configuration registers. The ACPI +BIOS, however, is not required to allow the kernel to control these registers +and if it doesn't do that, the kernel must not modify their contents. Of course +the native PCI Express PME signaling cannot be used by the kernel in that case. + + +2. PCI Subsystem and Device Power Management +============================================ + +2.1. Device Power Management Callbacks +-------------------------------------- + +The PCI Subsystem participates in the power management of PCI devices in a +number of ways. First of all, it provides an intermediate code layer between +the device power management core (PM core) and PCI device drivers. +Specifically, the pm field of the PCI subsystem's struct bus_type object, +pci_bus_type, points to a struct dev_pm_ops object, pci_dev_pm_ops, containing +pointers to several device power management callbacks:: + + const struct dev_pm_ops pci_dev_pm_ops = { + .prepare = pci_pm_prepare, + .complete = pci_pm_complete, + .suspend = pci_pm_suspend, + .resume = pci_pm_resume, + .freeze = pci_pm_freeze, + .thaw = pci_pm_thaw, + .poweroff = pci_pm_poweroff, + .restore = pci_pm_restore, + .suspend_noirq = pci_pm_suspend_noirq, + .resume_noirq = pci_pm_resume_noirq, + .freeze_noirq = pci_pm_freeze_noirq, + .thaw_noirq = pci_pm_thaw_noirq, + .poweroff_noirq = pci_pm_poweroff_noirq, + .restore_noirq = pci_pm_restore_noirq, + .runtime_suspend = pci_pm_runtime_suspend, + .runtime_resume = pci_pm_runtime_resume, + .runtime_idle = pci_pm_runtime_idle, + }; + +These callbacks are executed by the PM core in various situations related to +device power management and they, in turn, execute power management callbacks +provided by PCI device drivers. They also perform power management operations +involving some standard configuration registers of PCI devices that device +drivers need not know or care about. + +The structure representing a PCI device, struct pci_dev, contains several fields +that these callbacks operate on:: + + struct pci_dev { + ... + pci_power_t current_state; /* Current operating state. */ + int pm_cap; /* PM capability offset in the + configuration space */ + unsigned int pme_support:5; /* Bitmask of states from which PME# + can be generated */ + unsigned int pme_interrupt:1;/* Is native PCIe PME signaling used? */ + unsigned int d1_support:1; /* Low power state D1 is supported */ + unsigned int d2_support:1; /* Low power state D2 is supported */ + unsigned int no_d1d2:1; /* D1 and D2 are forbidden */ + unsigned int wakeup_prepared:1; /* Device prepared for wake up */ + unsigned int d3_delay; /* D3->D0 transition time in ms */ + ... + }; + +They also indirectly use some fields of the struct device that is embedded in +struct pci_dev. + +2.2. Device Initialization +-------------------------- + +The PCI subsystem's first task related to device power management is to +prepare the device for power management and initialize the fields of struct +pci_dev used for this purpose. This happens in two functions defined in +drivers/pci/pci.c, pci_pm_init() and platform_pci_wakeup_init(). + +The first of these functions checks if the device supports native PCI PM +and if that's the case the offset of its power management capability structure +in the configuration space is stored in the pm_cap field of the device's struct +pci_dev object. Next, the function checks which PCI low-power states are +supported by the device and from which low-power states the device can generate +native PCI PMEs. The power management fields of the device's struct pci_dev and +the struct device embedded in it are updated accordingly and the generation of +PMEs by the device is disabled. + +The second function checks if the device can be prepared to signal wakeup with +the help of the platform firmware, such as the ACPI BIOS. If that is the case, +the function updates the wakeup fields in struct device embedded in the +device's struct pci_dev and uses the firmware-provided method to prevent the +device from signaling wakeup. + +At this point the device is ready for power management. For driverless devices, +however, this functionality is limited to a few basic operations carried out +during system-wide transitions to a sleep state and back to the working state. + +2.3. Runtime Device Power Management +------------------------------------ + +The PCI subsystem plays a vital role in the runtime power management of PCI +devices. For this purpose it uses the general runtime power management +(runtime PM) framework described in Documentation/power/runtime_pm.rst. +Namely, it provides subsystem-level callbacks:: + + pci_pm_runtime_suspend() + pci_pm_runtime_resume() + pci_pm_runtime_idle() + +that are executed by the core runtime PM routines. It also implements the +entire mechanics necessary for handling runtime wakeup signals from PCI devices +in low-power states, which at the time of this writing works for both the native +PCI Express PME signaling and the ACPI GPE-based wakeup signaling described in +Section 1. + +First, a PCI device is put into a low-power state, or suspended, with the help +of pm_schedule_suspend() or pm_runtime_suspend() which for PCI devices call +pci_pm_runtime_suspend() to do the actual job. For this to work, the device's +driver has to provide a pm->runtime_suspend() callback (see below), which is +run by pci_pm_runtime_suspend() as the first action. If the driver's callback +returns successfully, the device's standard configuration registers are saved, +the device is prepared to generate wakeup signals and, finally, it is put into +the target low-power state. + +The low-power state to put the device into is the lowest-power (highest number) +state from which it can signal wakeup. The exact method of signaling wakeup is +system-dependent and is determined by the PCI subsystem on the basis of the +reported capabilities of the device and the platform firmware. To prepare the +device for signaling wakeup and put it into the selected low-power state, the +PCI subsystem can use the platform firmware as well as the device's native PCI +PM capabilities, if supported. + +It is expected that the device driver's pm->runtime_suspend() callback will +not attempt to prepare the device for signaling wakeup or to put it into a +low-power state. The driver ought to leave these tasks to the PCI subsystem +that has all of the information necessary to perform them. + +A suspended device is brought back into the "active" state, or resumed, +with the help of pm_request_resume() or pm_runtime_resume() which both call +pci_pm_runtime_resume() for PCI devices. Again, this only works if the device's +driver provides a pm->runtime_resume() callback (see below). However, before +the driver's callback is executed, pci_pm_runtime_resume() brings the device +back into the full-power state, prevents it from signaling wakeup while in that +state and restores its standard configuration registers. Thus the driver's +callback need not worry about the PCI-specific aspects of the device resume. + +Note that generally pci_pm_runtime_resume() may be called in two different +situations. First, it may be called at the request of the device's driver, for +example if there are some data for it to process. Second, it may be called +as a result of a wakeup signal from the device itself (this sometimes is +referred to as "remote wakeup"). Of course, for this purpose the wakeup signal +is handled in one of the ways described in Section 1 and finally converted into +a notification for the PCI subsystem after the source device has been +identified. + +The pci_pm_runtime_idle() function, called for PCI devices by pm_runtime_idle() +and pm_request_idle(), executes the device driver's pm->runtime_idle() +callback, if defined, and if that callback doesn't return error code (or is not +present at all), suspends the device with the help of pm_runtime_suspend(). +Sometimes pci_pm_runtime_idle() is called automatically by the PM core (for +example, it is called right after the device has just been resumed), in which +cases it is expected to suspend the device if that makes sense. Usually, +however, the PCI subsystem doesn't really know if the device really can be +suspended, so it lets the device's driver decide by running its +pm->runtime_idle() callback. + +2.4. System-Wide Power Transitions +---------------------------------- +There are a few different types of system-wide power transitions, described in +Documentation/driver-api/pm/devices.rst. Each of them requires devices to be handled +in a specific way and the PM core executes subsystem-level power management +callbacks for this purpose. They are executed in phases such that each phase +involves executing the same subsystem-level callback for every device belonging +to the given subsystem before the next phase begins. These phases always run +after tasks have been frozen. + +2.4.1. System Suspend +^^^^^^^^^^^^^^^^^^^^^ + +When the system is going into a sleep state in which the contents of memory will +be preserved, such as one of the ACPI sleep states S1-S3, the phases are: + + prepare, suspend, suspend_noirq. + +The following PCI bus type's callbacks, respectively, are used in these phases:: + + pci_pm_prepare() + pci_pm_suspend() + pci_pm_suspend_noirq() + +The pci_pm_prepare() routine first puts the device into the "fully functional" +state with the help of pm_runtime_resume(). Then, it executes the device +driver's pm->prepare() callback if defined (i.e. if the driver's struct +dev_pm_ops object is present and the prepare pointer in that object is valid). + +The pci_pm_suspend() routine first checks if the device's driver implements +legacy PCI suspend routines (see Section 3), in which case the driver's legacy +suspend callback is executed, if present, and its result is returned. Next, if +the device's driver doesn't provide a struct dev_pm_ops object (containing +pointers to the driver's callbacks), pci_pm_default_suspend() is called, which +simply turns off the device's bus master capability and runs +pcibios_disable_device() to disable it, unless the device is a bridge (PCI +bridges are ignored by this routine). Next, the device driver's pm->suspend() +callback is executed, if defined, and its result is returned if it fails. +Finally, pci_fixup_device() is called to apply hardware suspend quirks related +to the device if necessary. + +Note that the suspend phase is carried out asynchronously for PCI devices, so +the pci_pm_suspend() callback may be executed in parallel for any pair of PCI +devices that don't depend on each other in a known way (i.e. none of the paths +in the device tree from the root bridge to a leaf device contains both of them). + +The pci_pm_suspend_noirq() routine is executed after suspend_device_irqs() has +been called, which means that the device driver's interrupt handler won't be +invoked while this routine is running. It first checks if the device's driver +implements legacy PCI suspends routines (Section 3), in which case the legacy +late suspend routine is called and its result is returned (the standard +configuration registers of the device are saved if the driver's callback hasn't +done that). Second, if the device driver's struct dev_pm_ops object is not +present, the device's standard configuration registers are saved and the routine +returns success. Otherwise the device driver's pm->suspend_noirq() callback is +executed, if present, and its result is returned if it fails. Next, if the +device's standard configuration registers haven't been saved yet (one of the +device driver's callbacks executed before might do that), pci_pm_suspend_noirq() +saves them, prepares the device to signal wakeup (if necessary) and puts it into +a low-power state. + +The low-power state to put the device into is the lowest-power (highest number) +state from which it can signal wakeup while the system is in the target sleep +state. Just like in the runtime PM case described above, the mechanism of +signaling wakeup is system-dependent and determined by the PCI subsystem, which +is also responsible for preparing the device to signal wakeup from the system's +target sleep state as appropriate. + +PCI device drivers (that don't implement legacy power management callbacks) are +generally not expected to prepare devices for signaling wakeup or to put them +into low-power states. However, if one of the driver's suspend callbacks +(pm->suspend() or pm->suspend_noirq()) saves the device's standard configuration +registers, pci_pm_suspend_noirq() will assume that the device has been prepared +to signal wakeup and put into a low-power state by the driver (the driver is +then assumed to have used the helper functions provided by the PCI subsystem for +this purpose). PCI device drivers are not encouraged to do that, but in some +rare cases doing that in the driver may be the optimum approach. + +2.4.2. System Resume +^^^^^^^^^^^^^^^^^^^^ + +When the system is undergoing a transition from a sleep state in which the +contents of memory have been preserved, such as one of the ACPI sleep states +S1-S3, into the working state (ACPI S0), the phases are: + + resume_noirq, resume, complete. + +The following PCI bus type's callbacks, respectively, are executed in these +phases:: + + pci_pm_resume_noirq() + pci_pm_resume() + pci_pm_complete() + +The pci_pm_resume_noirq() routine first puts the device into the full-power +state, restores its standard configuration registers and applies early resume +hardware quirks related to the device, if necessary. This is done +unconditionally, regardless of whether or not the device's driver implements +legacy PCI power management callbacks (this way all PCI devices are in the +full-power state and their standard configuration registers have been restored +when their interrupt handlers are invoked for the first time during resume, +which allows the kernel to avoid problems with the handling of shared interrupts +by drivers whose devices are still suspended). If legacy PCI power management +callbacks (see Section 3) are implemented by the device's driver, the legacy +early resume callback is executed and its result is returned. Otherwise, the +device driver's pm->resume_noirq() callback is executed, if defined, and its +result is returned. + +The pci_pm_resume() routine first checks if the device's standard configuration +registers have been restored and restores them if that's not the case (this +only is necessary in the error path during a failing suspend). Next, resume +hardware quirks related to the device are applied, if necessary, and if the +device's driver implements legacy PCI power management callbacks (see +Section 3), the driver's legacy resume callback is executed and its result is +returned. Otherwise, the device's wakeup signaling mechanisms are blocked and +its driver's pm->resume() callback is executed, if defined (the callback's +result is then returned). + +The resume phase is carried out asynchronously for PCI devices, like the +suspend phase described above, which means that if two PCI devices don't depend +on each other in a known way, the pci_pm_resume() routine may be executed for +the both of them in parallel. + +The pci_pm_complete() routine only executes the device driver's pm->complete() +callback, if defined. + +2.4.3. System Hibernation +^^^^^^^^^^^^^^^^^^^^^^^^^ + +System hibernation is more complicated than system suspend, because it requires +a system image to be created and written into a persistent storage medium. The +image is created atomically and all devices are quiesced, or frozen, before that +happens. + +The freezing of devices is carried out after enough memory has been freed (at +the time of this writing the image creation requires at least 50% of system RAM +to be free) in the following three phases: + + prepare, freeze, freeze_noirq + +that correspond to the PCI bus type's callbacks:: + + pci_pm_prepare() + pci_pm_freeze() + pci_pm_freeze_noirq() + +This means that the prepare phase is exactly the same as for system suspend. +The other two phases, however, are different. + +The pci_pm_freeze() routine is quite similar to pci_pm_suspend(), but it runs +the device driver's pm->freeze() callback, if defined, instead of pm->suspend(), +and it doesn't apply the suspend-related hardware quirks. It is executed +asynchronously for different PCI devices that don't depend on each other in a +known way. + +The pci_pm_freeze_noirq() routine, in turn, is similar to +pci_pm_suspend_noirq(), but it calls the device driver's pm->freeze_noirq() +routine instead of pm->suspend_noirq(). It also doesn't attempt to prepare the +device for signaling wakeup and put it into a low-power state. Still, it saves +the device's standard configuration registers if they haven't been saved by one +of the driver's callbacks. + +Once the image has been created, it has to be saved. However, at this point all +devices are frozen and they cannot handle I/O, while their ability to handle +I/O is obviously necessary for the image saving. Thus they have to be brought +back to the fully functional state and this is done in the following phases: + + thaw_noirq, thaw, complete + +using the following PCI bus type's callbacks:: + + pci_pm_thaw_noirq() + pci_pm_thaw() + pci_pm_complete() + +respectively. + +The first of them, pci_pm_thaw_noirq(), is analogous to pci_pm_resume_noirq(), +but it doesn't put the device into the full power state and doesn't attempt to +restore its standard configuration registers. It also executes the device +driver's pm->thaw_noirq() callback, if defined, instead of pm->resume_noirq(). + +The pci_pm_thaw() routine is similar to pci_pm_resume(), but it runs the device +driver's pm->thaw() callback instead of pm->resume(). It is executed +asynchronously for different PCI devices that don't depend on each other in a +known way. + +The complete phase it the same as for system resume. + +After saving the image, devices need to be powered down before the system can +enter the target sleep state (ACPI S4 for ACPI-based systems). This is done in +three phases: + + prepare, poweroff, poweroff_noirq + +where the prepare phase is exactly the same as for system suspend. The other +two phases are analogous to the suspend and suspend_noirq phases, respectively. +The PCI subsystem-level callbacks they correspond to:: + + pci_pm_poweroff() + pci_pm_poweroff_noirq() + +work in analogy with pci_pm_suspend() and pci_pm_poweroff_noirq(), respectively, +although they don't attempt to save the device's standard configuration +registers. + +2.4.4. System Restore +^^^^^^^^^^^^^^^^^^^^^ + +System restore requires a hibernation image to be loaded into memory and the +pre-hibernation memory contents to be restored before the pre-hibernation system +activity can be resumed. + +As described in Documentation/driver-api/pm/devices.rst, the hibernation image is loaded +into memory by a fresh instance of the kernel, called the boot kernel, which in +turn is loaded and run by a boot loader in the usual way. After the boot kernel +has loaded the image, it needs to replace its own code and data with the code +and data of the "hibernated" kernel stored within the image, called the image +kernel. For this purpose all devices are frozen just like before creating +the image during hibernation, in the + + prepare, freeze, freeze_noirq + +phases described above. However, the devices affected by these phases are only +those having drivers in the boot kernel; other devices will still be in whatever +state the boot loader left them. + +Should the restoration of the pre-hibernation memory contents fail, the boot +kernel would go through the "thawing" procedure described above, using the +thaw_noirq, thaw, and complete phases (that will only affect the devices having +drivers in the boot kernel), and then continue running normally. + +If the pre-hibernation memory contents are restored successfully, which is the +usual situation, control is passed to the image kernel, which then becomes +responsible for bringing the system back to the working state. To achieve this, +it must restore the devices' pre-hibernation functionality, which is done much +like waking up from the memory sleep state, although it involves different +phases: + + restore_noirq, restore, complete + +The first two of these are analogous to the resume_noirq and resume phases +described above, respectively, and correspond to the following PCI subsystem +callbacks:: + + pci_pm_restore_noirq() + pci_pm_restore() + +These callbacks work in analogy with pci_pm_resume_noirq() and pci_pm_resume(), +respectively, but they execute the device driver's pm->restore_noirq() and +pm->restore() callbacks, if available. + +The complete phase is carried out in exactly the same way as during system +resume. + + +3. PCI Device Drivers and Power Management +========================================== + +3.1. Power Management Callbacks +------------------------------- + +PCI device drivers participate in power management by providing callbacks to be +executed by the PCI subsystem's power management routines described above and by +controlling the runtime power management of their devices. + +At the time of this writing there are two ways to define power management +callbacks for a PCI device driver, the recommended one, based on using a +dev_pm_ops structure described in Documentation/driver-api/pm/devices.rst, and the +"legacy" one, in which the .suspend(), .suspend_late(), .resume_early(), and +.resume() callbacks from struct pci_driver are used. The legacy approach, +however, doesn't allow one to define runtime power management callbacks and is +not really suitable for any new drivers. Therefore it is not covered by this +document (refer to the source code to learn more about it). + +It is recommended that all PCI device drivers define a struct dev_pm_ops object +containing pointers to power management (PM) callbacks that will be executed by +the PCI subsystem's PM routines in various circumstances. A pointer to the +driver's struct dev_pm_ops object has to be assigned to the driver.pm field in +its struct pci_driver object. Once that has happened, the "legacy" PM callbacks +in struct pci_driver are ignored (even if they are not NULL). + +The PM callbacks in struct dev_pm_ops are not mandatory and if they are not +defined (i.e. the respective fields of struct dev_pm_ops are unset) the PCI +subsystem will handle the device in a simplified default manner. If they are +defined, though, they are expected to behave as described in the following +subsections. + +3.1.1. prepare() +^^^^^^^^^^^^^^^^ + +The prepare() callback is executed during system suspend, during hibernation +(when a hibernation image is about to be created), during power-off after +saving a hibernation image and during system restore, when a hibernation image +has just been loaded into memory. + +This callback is only necessary if the driver's device has children that in +general may be registered at any time. In that case the role of the prepare() +callback is to prevent new children of the device from being registered until +one of the resume_noirq(), thaw_noirq(), or restore_noirq() callbacks is run. + +In addition to that the prepare() callback may carry out some operations +preparing the device to be suspended, although it should not allocate memory +(if additional memory is required to suspend the device, it has to be +preallocated earlier, for example in a suspend/hibernate notifier as described +in Documentation/driver-api/pm/notifiers.rst). + +3.1.2. suspend() +^^^^^^^^^^^^^^^^ + +The suspend() callback is only executed during system suspend, after prepare() +callbacks have been executed for all devices in the system. + +This callback is expected to quiesce the device and prepare it to be put into a +low-power state by the PCI subsystem. It is not required (in fact it even is +not recommended) that a PCI driver's suspend() callback save the standard +configuration registers of the device, prepare it for waking up the system, or +put it into a low-power state. All of these operations can very well be taken +care of by the PCI subsystem, without the driver's participation. + +However, in some rare case it is convenient to carry out these operations in +a PCI driver. Then, pci_save_state(), pci_prepare_to_sleep(), and +pci_set_power_state() should be used to save the device's standard configuration +registers, to prepare it for system wakeup (if necessary), and to put it into a +low-power state, respectively. Moreover, if the driver calls pci_save_state(), +the PCI subsystem will not execute either pci_prepare_to_sleep(), or +pci_set_power_state() for its device, so the driver is then responsible for +handling the device as appropriate. + +While the suspend() callback is being executed, the driver's interrupt handler +can be invoked to handle an interrupt from the device, so all suspend-related +operations relying on the driver's ability to handle interrupts should be +carried out in this callback. + +3.1.3. suspend_noirq() +^^^^^^^^^^^^^^^^^^^^^^ + +The suspend_noirq() callback is only executed during system suspend, after +suspend() callbacks have been executed for all devices in the system and +after device interrupts have been disabled by the PM core. + +The difference between suspend_noirq() and suspend() is that the driver's +interrupt handler will not be invoked while suspend_noirq() is running. Thus +suspend_noirq() can carry out operations that would cause race conditions to +arise if they were performed in suspend(). + +3.1.4. freeze() +^^^^^^^^^^^^^^^ + +The freeze() callback is hibernation-specific and is executed in two situations, +during hibernation, after prepare() callbacks have been executed for all devices +in preparation for the creation of a system image, and during restore, +after a system image has been loaded into memory from persistent storage and the +prepare() callbacks have been executed for all devices. + +The role of this callback is analogous to the role of the suspend() callback +described above. In fact, they only need to be different in the rare cases when +the driver takes the responsibility for putting the device into a low-power +state. + +In that cases the freeze() callback should not prepare the device system wakeup +or put it into a low-power state. Still, either it or freeze_noirq() should +save the device's standard configuration registers using pci_save_state(). + +3.1.5. freeze_noirq() +^^^^^^^^^^^^^^^^^^^^^ + +The freeze_noirq() callback is hibernation-specific. It is executed during +hibernation, after prepare() and freeze() callbacks have been executed for all +devices in preparation for the creation of a system image, and during restore, +after a system image has been loaded into memory and after prepare() and +freeze() callbacks have been executed for all devices. It is always executed +after device interrupts have been disabled by the PM core. + +The role of this callback is analogous to the role of the suspend_noirq() +callback described above and it very rarely is necessary to define +freeze_noirq(). + +The difference between freeze_noirq() and freeze() is analogous to the +difference between suspend_noirq() and suspend(). + +3.1.6. poweroff() +^^^^^^^^^^^^^^^^^ + +The poweroff() callback is hibernation-specific. It is executed when the system +is about to be powered off after saving a hibernation image to a persistent +storage. prepare() callbacks are executed for all devices before poweroff() is +called. + +The role of this callback is analogous to the role of the suspend() and freeze() +callbacks described above, although it does not need to save the contents of +the device's registers. In particular, if the driver wants to put the device +into a low-power state itself instead of allowing the PCI subsystem to do that, +the poweroff() callback should use pci_prepare_to_sleep() and +pci_set_power_state() to prepare the device for system wakeup and to put it +into a low-power state, respectively, but it need not save the device's standard +configuration registers. + +3.1.7. poweroff_noirq() +^^^^^^^^^^^^^^^^^^^^^^^ + +The poweroff_noirq() callback is hibernation-specific. It is executed after +poweroff() callbacks have been executed for all devices in the system. + +The role of this callback is analogous to the role of the suspend_noirq() and +freeze_noirq() callbacks described above, but it does not need to save the +contents of the device's registers. + +The difference between poweroff_noirq() and poweroff() is analogous to the +difference between suspend_noirq() and suspend(). + +3.1.8. resume_noirq() +^^^^^^^^^^^^^^^^^^^^^ + +The resume_noirq() callback is only executed during system resume, after the +PM core has enabled the non-boot CPUs. The driver's interrupt handler will not +be invoked while resume_noirq() is running, so this callback can carry out +operations that might race with the interrupt handler. + +Since the PCI subsystem unconditionally puts all devices into the full power +state in the resume_noirq phase of system resume and restores their standard +configuration registers, resume_noirq() is usually not necessary. In general +it should only be used for performing operations that would lead to race +conditions if carried out by resume(). + +3.1.9. resume() +^^^^^^^^^^^^^^^ + +The resume() callback is only executed during system resume, after +resume_noirq() callbacks have been executed for all devices in the system and +device interrupts have been enabled by the PM core. + +This callback is responsible for restoring the pre-suspend configuration of the +device and bringing it back to the fully functional state. The device should be +able to process I/O in a usual way after resume() has returned. + +3.1.10. thaw_noirq() +^^^^^^^^^^^^^^^^^^^^ + +The thaw_noirq() callback is hibernation-specific. It is executed after a +system image has been created and the non-boot CPUs have been enabled by the PM +core, in the thaw_noirq phase of hibernation. It also may be executed if the +loading of a hibernation image fails during system restore (it is then executed +after enabling the non-boot CPUs). The driver's interrupt handler will not be +invoked while thaw_noirq() is running. + +The role of this callback is analogous to the role of resume_noirq(). The +difference between these two callbacks is that thaw_noirq() is executed after +freeze() and freeze_noirq(), so in general it does not need to modify the +contents of the device's registers. + +3.1.11. thaw() +^^^^^^^^^^^^^^ + +The thaw() callback is hibernation-specific. It is executed after thaw_noirq() +callbacks have been executed for all devices in the system and after device +interrupts have been enabled by the PM core. + +This callback is responsible for restoring the pre-freeze configuration of +the device, so that it will work in a usual way after thaw() has returned. + +3.1.12. restore_noirq() +^^^^^^^^^^^^^^^^^^^^^^^ + +The restore_noirq() callback is hibernation-specific. It is executed in the +restore_noirq phase of hibernation, when the boot kernel has passed control to +the image kernel and the non-boot CPUs have been enabled by the image kernel's +PM core. + +This callback is analogous to resume_noirq() with the exception that it cannot +make any assumption on the previous state of the device, even if the BIOS (or +generally the platform firmware) is known to preserve that state over a +suspend-resume cycle. + +For the vast majority of PCI device drivers there is no difference between +resume_noirq() and restore_noirq(). + +3.1.13. restore() +^^^^^^^^^^^^^^^^^ + +The restore() callback is hibernation-specific. It is executed after +restore_noirq() callbacks have been executed for all devices in the system and +after the PM core has enabled device drivers' interrupt handlers to be invoked. + +This callback is analogous to resume(), just like restore_noirq() is analogous +to resume_noirq(). Consequently, the difference between restore_noirq() and +restore() is analogous to the difference between resume_noirq() and resume(). + +For the vast majority of PCI device drivers there is no difference between +resume() and restore(). + +3.1.14. complete() +^^^^^^^^^^^^^^^^^^ + +The complete() callback is executed in the following situations: + + - during system resume, after resume() callbacks have been executed for all + devices, + - during hibernation, before saving the system image, after thaw() callbacks + have been executed for all devices, + - during system restore, when the system is going back to its pre-hibernation + state, after restore() callbacks have been executed for all devices. + +It also may be executed if the loading of a hibernation image into memory fails +(in that case it is run after thaw() callbacks have been executed for all +devices that have drivers in the boot kernel). + +This callback is entirely optional, although it may be necessary if the +prepare() callback performs operations that need to be reversed. + +3.1.15. runtime_suspend() +^^^^^^^^^^^^^^^^^^^^^^^^^ + +The runtime_suspend() callback is specific to device runtime power management +(runtime PM). It is executed by the PM core's runtime PM framework when the +device is about to be suspended (i.e. quiesced and put into a low-power state) +at run time. + +This callback is responsible for freezing the device and preparing it to be +put into a low-power state, but it must allow the PCI subsystem to perform all +of the PCI-specific actions necessary for suspending the device. + +3.1.16. runtime_resume() +^^^^^^^^^^^^^^^^^^^^^^^^ + +The runtime_resume() callback is specific to device runtime PM. It is executed +by the PM core's runtime PM framework when the device is about to be resumed +(i.e. put into the full-power state and programmed to process I/O normally) at +run time. + +This callback is responsible for restoring the normal functionality of the +device after it has been put into the full-power state by the PCI subsystem. +The device is expected to be able to process I/O in the usual way after +runtime_resume() has returned. + +3.1.17. runtime_idle() +^^^^^^^^^^^^^^^^^^^^^^ + +The runtime_idle() callback is specific to device runtime PM. It is executed +by the PM core's runtime PM framework whenever it may be desirable to suspend +the device according to the PM core's information. In particular, it is +automatically executed right after runtime_resume() has returned in case the +resume of the device has happened as a result of a spurious event. + +This callback is optional, but if it is not implemented or if it returns 0, the +PCI subsystem will call pm_runtime_suspend() for the device, which in turn will +cause the driver's runtime_suspend() callback to be executed. + +3.1.18. Pointing Multiple Callback Pointers to One Routine +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Although in principle each of the callbacks described in the previous +subsections can be defined as a separate function, it often is convenient to +point two or more members of struct dev_pm_ops to the same routine. There are +a few convenience macros that can be used for this purpose. + +The SIMPLE_DEV_PM_OPS macro declares a struct dev_pm_ops object with one +suspend routine pointed to by the .suspend(), .freeze(), and .poweroff() +members and one resume routine pointed to by the .resume(), .thaw(), and +.restore() members. The other function pointers in this struct dev_pm_ops are +unset. + +The UNIVERSAL_DEV_PM_OPS macro is similar to SIMPLE_DEV_PM_OPS, but it +additionally sets the .runtime_resume() pointer to the same value as +.resume() (and .thaw(), and .restore()) and the .runtime_suspend() pointer to +the same value as .suspend() (and .freeze() and .poweroff()). + +The SET_SYSTEM_SLEEP_PM_OPS can be used inside of a declaration of struct +dev_pm_ops to indicate that one suspend routine is to be pointed to by the +.suspend(), .freeze(), and .poweroff() members and one resume routine is to +be pointed to by the .resume(), .thaw(), and .restore() members. + +3.1.19. Driver Flags for Power Management +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The PM core allows device drivers to set flags that influence the handling of +power management for the devices by the core itself and by middle layer code +including the PCI bus type. The flags should be set once at the driver probe +time with the help of the dev_pm_set_driver_flags() function and they should not +be updated directly afterwards. + +The DPM_FLAG_NEVER_SKIP flag prevents the PM core from using the direct-complete +mechanism allowing device suspend/resume callbacks to be skipped if the device +is in runtime suspend when the system suspend starts. That also affects all of +the ancestors of the device, so this flag should only be used if absolutely +necessary. + +The DPM_FLAG_SMART_PREPARE flag instructs the PCI bus type to only return a +positive value from pci_pm_prepare() if the ->prepare callback provided by the +driver of the device returns a positive value. That allows the driver to opt +out from using the direct-complete mechanism dynamically. + +The DPM_FLAG_SMART_SUSPEND flag tells the PCI bus type that from the driver's +perspective the device can be safely left in runtime suspend during system +suspend. That causes pci_pm_suspend(), pci_pm_freeze() and pci_pm_poweroff() +to skip resuming the device from runtime suspend unless there are PCI-specific +reasons for doing that. Also, it causes pci_pm_suspend_late/noirq(), +pci_pm_freeze_late/noirq() and pci_pm_poweroff_late/noirq() to return early +if the device remains in runtime suspend in the beginning of the "late" phase +of the system-wide transition under way. Moreover, if the device is in +runtime suspend in pci_pm_resume_noirq() or pci_pm_restore_noirq(), its runtime +power management status will be changed to "active" (as it is going to be put +into D0 going forward), but if it is in runtime suspend in pci_pm_thaw_noirq(), +the function will set the power.direct_complete flag for it (to make the PM core +skip the subsequent "thaw" callbacks for it) and return. + +Setting the DPM_FLAG_LEAVE_SUSPENDED flag means that the driver prefers the +device to be left in suspend after system-wide transitions to the working state. +This flag is checked by the PM core, but the PCI bus type informs the PM core +which devices may be left in suspend from its perspective (that happens during +the "noirq" phase of system-wide suspend and analogous transitions) and next it +uses the dev_pm_may_skip_resume() helper to decide whether or not to return from +pci_pm_resume_noirq() early, as the PM core will skip the remaining resume +callbacks for the device during the transition under way and will set its +runtime PM status to "suspended" if dev_pm_may_skip_resume() returns "true" for +it. + +3.2. Device Runtime Power Management +------------------------------------ + +In addition to providing device power management callbacks PCI device drivers +are responsible for controlling the runtime power management (runtime PM) of +their devices. + +The PCI device runtime PM is optional, but it is recommended that PCI device +drivers implement it at least in the cases where there is a reliable way of +verifying that the device is not used (like when the network cable is detached +from an Ethernet adapter or there are no devices attached to a USB controller). + +To support the PCI runtime PM the driver first needs to implement the +runtime_suspend() and runtime_resume() callbacks. It also may need to implement +the runtime_idle() callback to prevent the device from being suspended again +every time right after the runtime_resume() callback has returned +(alternatively, the runtime_suspend() callback will have to check if the +device should really be suspended and return -EAGAIN if that is not the case). + +The runtime PM of PCI devices is enabled by default by the PCI core. PCI +device drivers do not need to enable it and should not attempt to do so. +However, it is blocked by pci_pm_init() that runs the pm_runtime_forbid() +helper function. In addition to that, the runtime PM usage counter of +each PCI device is incremented by local_pci_probe() before executing the +probe callback provided by the device's driver. + +If a PCI driver implements the runtime PM callbacks and intends to use the +runtime PM framework provided by the PM core and the PCI subsystem, it needs +to decrement the device's runtime PM usage counter in its probe callback +function. If it doesn't do that, the counter will always be different from +zero for the device and it will never be runtime-suspended. The simplest +way to do that is by calling pm_runtime_put_noidle(), but if the driver +wants to schedule an autosuspend right away, for example, it may call +pm_runtime_put_autosuspend() instead for this purpose. Generally, it +just needs to call a function that decrements the devices usage counter +from its probe routine to make runtime PM work for the device. + +It is important to remember that the driver's runtime_suspend() callback +may be executed right after the usage counter has been decremented, because +user space may already have caused the pm_runtime_allow() helper function +unblocking the runtime PM of the device to run via sysfs, so the driver must +be prepared to cope with that. + +The driver itself should not call pm_runtime_allow(), though. Instead, it +should let user space or some platform-specific code do that (user space can +do it via sysfs as stated above), but it must be prepared to handle the +runtime PM of the device correctly as soon as pm_runtime_allow() is called +(which may happen at any time, even before the driver is loaded). + +When the driver's remove callback runs, it has to balance the decrementation +of the device's runtime PM usage counter at the probe time. For this reason, +if it has decremented the counter in its probe callback, it must run +pm_runtime_get_noresume() in its remove callback. [Since the core carries +out a runtime resume of the device and bumps up the device's usage counter +before running the driver's remove callback, the runtime PM of the device +is effectively disabled for the duration of the remove execution and all +runtime PM helper functions incrementing the device's usage counter are +then effectively equivalent to pm_runtime_get_noresume().] + +The runtime PM framework works by processing requests to suspend or resume +devices, or to check if they are idle (in which cases it is reasonable to +subsequently request that they be suspended). These requests are represented +by work items put into the power management workqueue, pm_wq. Although there +are a few situations in which power management requests are automatically +queued by the PM core (for example, after processing a request to resume a +device the PM core automatically queues a request to check if the device is +idle), device drivers are generally responsible for queuing power management +requests for their devices. For this purpose they should use the runtime PM +helper functions provided by the PM core, discussed in +Documentation/power/runtime_pm.rst. + +Devices can also be suspended and resumed synchronously, without placing a +request into pm_wq. In the majority of cases this also is done by their +drivers that use helper functions provided by the PM core for this purpose. + +For more information on the runtime PM of devices refer to +Documentation/power/runtime_pm.rst. + + +4. Resources +============ + +PCI Local Bus Specification, Rev. 3.0 + +PCI Bus Power Management Interface Specification, Rev. 1.2 + +Advanced Configuration and Power Interface (ACPI) Specification, Rev. 3.0b + +PCI Express Base Specification, Rev. 2.0 + +Documentation/driver-api/pm/devices.rst + +Documentation/power/runtime_pm.rst diff --git a/Documentation/power/pci.txt b/Documentation/power/pci.txt deleted file mode 100644 index 8eaf9ee24d43..000000000000 --- a/Documentation/power/pci.txt +++ /dev/null @@ -1,1094 +0,0 @@ -PCI Power Management - -Copyright (c) 2010 Rafael J. Wysocki , Novell Inc. - -An overview of concepts and the Linux kernel's interfaces related to PCI power -management. Based on previous work by Patrick Mochel -(and others). - -This document only covers the aspects of power management specific to PCI -devices. For general description of the kernel's interfaces related to device -power management refer to Documentation/driver-api/pm/devices.rst and -Documentation/power/runtime_pm.txt. - ---------------------------------------------------------------------------- - -1. Hardware and Platform Support for PCI Power Management -2. PCI Subsystem and Device Power Management -3. PCI Device Drivers and Power Management -4. Resources - - -1. Hardware and Platform Support for PCI Power Management -========================================================= - -1.1. Native and Platform-Based Power Management ------------------------------------------------ -In general, power management is a feature allowing one to save energy by putting -devices into states in which they draw less power (low-power states) at the -price of reduced functionality or performance. - -Usually, a device is put into a low-power state when it is underutilized or -completely inactive. However, when it is necessary to use the device once -again, it has to be put back into the "fully functional" state (full-power -state). This may happen when there are some data for the device to handle or -as a result of an external event requiring the device to be active, which may -be signaled by the device itself. - -PCI devices may be put into low-power states in two ways, by using the device -capabilities introduced by the PCI Bus Power Management Interface Specification, -or with the help of platform firmware, such as an ACPI BIOS. In the first -approach, that is referred to as the native PCI power management (native PCI PM) -in what follows, the device power state is changed as a result of writing a -specific value into one of its standard configuration registers. The second -approach requires the platform firmware to provide special methods that may be -used by the kernel to change the device's power state. - -Devices supporting the native PCI PM usually can generate wakeup signals called -Power Management Events (PMEs) to let the kernel know about external events -requiring the device to be active. After receiving a PME the kernel is supposed -to put the device that sent it into the full-power state. However, the PCI Bus -Power Management Interface Specification doesn't define any standard method of -delivering the PME from the device to the CPU and the operating system kernel. -It is assumed that the platform firmware will perform this task and therefore, -even though a PCI device is set up to generate PMEs, it also may be necessary to -prepare the platform firmware for notifying the CPU of the PMEs coming from the -device (e.g. by generating interrupts). - -In turn, if the methods provided by the platform firmware are used for changing -the power state of a device, usually the platform also provides a method for -preparing the device to generate wakeup signals. In that case, however, it -often also is necessary to prepare the device for generating PMEs using the -native PCI PM mechanism, because the method provided by the platform depends on -that. - -Thus in many situations both the native and the platform-based power management -mechanisms have to be used simultaneously to obtain the desired result. - -1.2. Native PCI Power Management --------------------------------- -The PCI Bus Power Management Interface Specification (PCI PM Spec) was -introduced between the PCI 2.1 and PCI 2.2 Specifications. It defined a -standard interface for performing various operations related to power -management. - -The implementation of the PCI PM Spec is optional for conventional PCI devices, -but it is mandatory for PCI Express devices. If a device supports the PCI PM -Spec, it has an 8 byte power management capability field in its PCI -configuration space. This field is used to describe and control the standard -features related to the native PCI power management. - -The PCI PM Spec defines 4 operating states for devices (D0-D3) and for buses -(B0-B3). The higher the number, the less power is drawn by the device or bus -in that state. However, the higher the number, the longer the latency for -the device or bus to return to the full-power state (D0 or B0, respectively). - -There are two variants of the D3 state defined by the specification. The first -one is D3hot, referred to as the software accessible D3, because devices can be -programmed to go into it. The second one, D3cold, is the state that PCI devices -are in when the supply voltage (Vcc) is removed from them. It is not possible -to program a PCI device to go into D3cold, although there may be a programmable -interface for putting the bus the device is on into a state in which Vcc is -removed from all devices on the bus. - -PCI bus power management, however, is not supported by the Linux kernel at the -time of this writing and therefore it is not covered by this document. - -Note that every PCI device can be in the full-power state (D0) or in D3cold, -regardless of whether or not it implements the PCI PM Spec. In addition to -that, if the PCI PM Spec is implemented by the device, it must support D3hot -as well as D0. The support for the D1 and D2 power states is optional. - -PCI devices supporting the PCI PM Spec can be programmed to go to any of the -supported low-power states (except for D3cold). While in D1-D3hot the -standard configuration registers of the device must be accessible to software -(i.e. the device is required to respond to PCI configuration accesses), although -its I/O and memory spaces are then disabled. This allows the device to be -programmatically put into D0. Thus the kernel can switch the device back and -forth between D0 and the supported low-power states (except for D3cold) and the -possible power state transitions the device can undergo are the following: - -+----------------------------+ -| Current State | New State | -+----------------------------+ -| D0 | D1, D2, D3 | -+----------------------------+ -| D1 | D2, D3 | -+----------------------------+ -| D2 | D3 | -+----------------------------+ -| D1, D2, D3 | D0 | -+----------------------------+ - -The transition from D3cold to D0 occurs when the supply voltage is provided to -the device (i.e. power is restored). In that case the device returns to D0 with -a full power-on reset sequence and the power-on defaults are restored to the -device by hardware just as at initial power up. - -PCI devices supporting the PCI PM Spec can be programmed to generate PMEs -while in a low-power state (D1-D3), but they are not required to be capable -of generating PMEs from all supported low-power states. In particular, the -capability of generating PMEs from D3cold is optional and depends on the -presence of additional voltage (3.3Vaux) allowing the device to remain -sufficiently active to generate a wakeup signal. - -1.3. ACPI Device Power Management ---------------------------------- -The platform firmware support for the power management of PCI devices is -system-specific. However, if the system in question is compliant with the -Advanced Configuration and Power Interface (ACPI) Specification, like the -majority of x86-based systems, it is supposed to implement device power -management interfaces defined by the ACPI standard. - -For this purpose the ACPI BIOS provides special functions called "control -methods" that may be executed by the kernel to perform specific tasks, such as -putting a device into a low-power state. These control methods are encoded -using special byte-code language called the ACPI Machine Language (AML) and -stored in the machine's BIOS. The kernel loads them from the BIOS and executes -them as needed using an AML interpreter that translates the AML byte code into -computations and memory or I/O space accesses. This way, in theory, a BIOS -writer can provide the kernel with a means to perform actions depending -on the system design in a system-specific fashion. - -ACPI control methods may be divided into global control methods, that are not -associated with any particular devices, and device control methods, that have -to be defined separately for each device supposed to be handled with the help of -the platform. This means, in particular, that ACPI device control methods can -only be used to handle devices that the BIOS writer knew about in advance. The -ACPI methods used for device power management fall into that category. - -The ACPI specification assumes that devices can be in one of four power states -labeled as D0, D1, D2, and D3 that roughly correspond to the native PCI PM -D0-D3 states (although the difference between D3hot and D3cold is not taken -into account by ACPI). Moreover, for each power state of a device there is a -set of power resources that have to be enabled for the device to be put into -that state. These power resources are controlled (i.e. enabled or disabled) -with the help of their own control methods, _ON and _OFF, that have to be -defined individually for each of them. - -To put a device into the ACPI power state Dx (where x is a number between 0 and -3 inclusive) the kernel is supposed to (1) enable the power resources required -by the device in this state using their _ON control methods and (2) execute the -_PSx control method defined for the device. In addition to that, if the device -is going to be put into a low-power state (D1-D3) and is supposed to generate -wakeup signals from that state, the _DSW (or _PSW, replaced with _DSW by ACPI -3.0) control method defined for it has to be executed before _PSx. Power -resources that are not required by the device in the target power state and are -not required any more by any other device should be disabled (by executing their -_OFF control methods). If the current power state of the device is D3, it can -only be put into D0 this way. - -However, quite often the power states of devices are changed during a -system-wide transition into a sleep state or back into the working state. ACPI -defines four system sleep states, S1, S2, S3, and S4, and denotes the system -working state as S0. In general, the target system sleep (or working) state -determines the highest power (lowest number) state the device can be put -into and the kernel is supposed to obtain this information by executing the -device's _SxD control method (where x is a number between 0 and 4 inclusive). -If the device is required to wake up the system from the target sleep state, the -lowest power (highest number) state it can be put into is also determined by the -target state of the system. The kernel is then supposed to use the device's -_SxW control method to obtain the number of that state. It also is supposed to -use the device's _PRW control method to learn which power resources need to be -enabled for the device to be able to generate wakeup signals. - -1.4. Wakeup Signaling ---------------------- -Wakeup signals generated by PCI devices, either as native PCI PMEs, or as -a result of the execution of the _DSW (or _PSW) ACPI control method before -putting the device into a low-power state, have to be caught and handled as -appropriate. If they are sent while the system is in the working state -(ACPI S0), they should be translated into interrupts so that the kernel can -put the devices generating them into the full-power state and take care of the -events that triggered them. In turn, if they are sent while the system is -sleeping, they should cause the system's core logic to trigger wakeup. - -On ACPI-based systems wakeup signals sent by conventional PCI devices are -converted into ACPI General-Purpose Events (GPEs) which are hardware signals -from the system core logic generated in response to various events that need to -be acted upon. Every GPE is associated with one or more sources of potentially -interesting events. In particular, a GPE may be associated with a PCI device -capable of signaling wakeup. The information on the connections between GPEs -and event sources is recorded in the system's ACPI BIOS from where it can be -read by the kernel. - -If a PCI device known to the system's ACPI BIOS signals wakeup, the GPE -associated with it (if there is one) is triggered. The GPEs associated with PCI -bridges may also be triggered in response to a wakeup signal from one of the -devices below the bridge (this also is the case for root bridges) and, for -example, native PCI PMEs from devices unknown to the system's ACPI BIOS may be -handled this way. - -A GPE may be triggered when the system is sleeping (i.e. when it is in one of -the ACPI S1-S4 states), in which case system wakeup is started by its core logic -(the device that was the source of the signal causing the system wakeup to occur -may be identified later). The GPEs used in such situations are referred to as -wakeup GPEs. - -Usually, however, GPEs are also triggered when the system is in the working -state (ACPI S0) and in that case the system's core logic generates a System -Control Interrupt (SCI) to notify the kernel of the event. Then, the SCI -handler identifies the GPE that caused the interrupt to be generated which, -in turn, allows the kernel to identify the source of the event (that may be -a PCI device signaling wakeup). The GPEs used for notifying the kernel of -events occurring while the system is in the working state are referred to as -runtime GPEs. - -Unfortunately, there is no standard way of handling wakeup signals sent by -conventional PCI devices on systems that are not ACPI-based, but there is one -for PCI Express devices. Namely, the PCI Express Base Specification introduced -a native mechanism for converting native PCI PMEs into interrupts generated by -root ports. For conventional PCI devices native PMEs are out-of-band, so they -are routed separately and they need not pass through bridges (in principle they -may be routed directly to the system's core logic), but for PCI Express devices -they are in-band messages that have to pass through the PCI Express hierarchy, -including the root port on the path from the device to the Root Complex. Thus -it was possible to introduce a mechanism by which a root port generates an -interrupt whenever it receives a PME message from one of the devices below it. -The PCI Express Requester ID of the device that sent the PME message is then -recorded in one of the root port's configuration registers from where it may be -read by the interrupt handler allowing the device to be identified. [PME -messages sent by PCI Express endpoints integrated with the Root Complex don't -pass through root ports, but instead they cause a Root Complex Event Collector -(if there is one) to generate interrupts.] - -In principle the native PCI Express PME signaling may also be used on ACPI-based -systems along with the GPEs, but to use it the kernel has to ask the system's -ACPI BIOS to release control of root port configuration registers. The ACPI -BIOS, however, is not required to allow the kernel to control these registers -and if it doesn't do that, the kernel must not modify their contents. Of course -the native PCI Express PME signaling cannot be used by the kernel in that case. - - -2. PCI Subsystem and Device Power Management -============================================ - -2.1. Device Power Management Callbacks --------------------------------------- -The PCI Subsystem participates in the power management of PCI devices in a -number of ways. First of all, it provides an intermediate code layer between -the device power management core (PM core) and PCI device drivers. -Specifically, the pm field of the PCI subsystem's struct bus_type object, -pci_bus_type, points to a struct dev_pm_ops object, pci_dev_pm_ops, containing -pointers to several device power management callbacks: - -const struct dev_pm_ops pci_dev_pm_ops = { - .prepare = pci_pm_prepare, - .complete = pci_pm_complete, - .suspend = pci_pm_suspend, - .resume = pci_pm_resume, - .freeze = pci_pm_freeze, - .thaw = pci_pm_thaw, - .poweroff = pci_pm_poweroff, - .restore = pci_pm_restore, - .suspend_noirq = pci_pm_suspend_noirq, - .resume_noirq = pci_pm_resume_noirq, - .freeze_noirq = pci_pm_freeze_noirq, - .thaw_noirq = pci_pm_thaw_noirq, - .poweroff_noirq = pci_pm_poweroff_noirq, - .restore_noirq = pci_pm_restore_noirq, - .runtime_suspend = pci_pm_runtime_suspend, - .runtime_resume = pci_pm_runtime_resume, - .runtime_idle = pci_pm_runtime_idle, -}; - -These callbacks are executed by the PM core in various situations related to -device power management and they, in turn, execute power management callbacks -provided by PCI device drivers. They also perform power management operations -involving some standard configuration registers of PCI devices that device -drivers need not know or care about. - -The structure representing a PCI device, struct pci_dev, contains several fields -that these callbacks operate on: - -struct pci_dev { - ... - pci_power_t current_state; /* Current operating state. */ - int pm_cap; /* PM capability offset in the - configuration space */ - unsigned int pme_support:5; /* Bitmask of states from which PME# - can be generated */ - unsigned int pme_interrupt:1;/* Is native PCIe PME signaling used? */ - unsigned int d1_support:1; /* Low power state D1 is supported */ - unsigned int d2_support:1; /* Low power state D2 is supported */ - unsigned int no_d1d2:1; /* D1 and D2 are forbidden */ - unsigned int wakeup_prepared:1; /* Device prepared for wake up */ - unsigned int d3_delay; /* D3->D0 transition time in ms */ - ... -}; - -They also indirectly use some fields of the struct device that is embedded in -struct pci_dev. - -2.2. Device Initialization --------------------------- -The PCI subsystem's first task related to device power management is to -prepare the device for power management and initialize the fields of struct -pci_dev used for this purpose. This happens in two functions defined in -drivers/pci/pci.c, pci_pm_init() and platform_pci_wakeup_init(). - -The first of these functions checks if the device supports native PCI PM -and if that's the case the offset of its power management capability structure -in the configuration space is stored in the pm_cap field of the device's struct -pci_dev object. Next, the function checks which PCI low-power states are -supported by the device and from which low-power states the device can generate -native PCI PMEs. The power management fields of the device's struct pci_dev and -the struct device embedded in it are updated accordingly and the generation of -PMEs by the device is disabled. - -The second function checks if the device can be prepared to signal wakeup with -the help of the platform firmware, such as the ACPI BIOS. If that is the case, -the function updates the wakeup fields in struct device embedded in the -device's struct pci_dev and uses the firmware-provided method to prevent the -device from signaling wakeup. - -At this point the device is ready for power management. For driverless devices, -however, this functionality is limited to a few basic operations carried out -during system-wide transitions to a sleep state and back to the working state. - -2.3. Runtime Device Power Management ------------------------------------- -The PCI subsystem plays a vital role in the runtime power management of PCI -devices. For this purpose it uses the general runtime power management -(runtime PM) framework described in Documentation/power/runtime_pm.txt. -Namely, it provides subsystem-level callbacks: - - pci_pm_runtime_suspend() - pci_pm_runtime_resume() - pci_pm_runtime_idle() - -that are executed by the core runtime PM routines. It also implements the -entire mechanics necessary for handling runtime wakeup signals from PCI devices -in low-power states, which at the time of this writing works for both the native -PCI Express PME signaling and the ACPI GPE-based wakeup signaling described in -Section 1. - -First, a PCI device is put into a low-power state, or suspended, with the help -of pm_schedule_suspend() or pm_runtime_suspend() which for PCI devices call -pci_pm_runtime_suspend() to do the actual job. For this to work, the device's -driver has to provide a pm->runtime_suspend() callback (see below), which is -run by pci_pm_runtime_suspend() as the first action. If the driver's callback -returns successfully, the device's standard configuration registers are saved, -the device is prepared to generate wakeup signals and, finally, it is put into -the target low-power state. - -The low-power state to put the device into is the lowest-power (highest number) -state from which it can signal wakeup. The exact method of signaling wakeup is -system-dependent and is determined by the PCI subsystem on the basis of the -reported capabilities of the device and the platform firmware. To prepare the -device for signaling wakeup and put it into the selected low-power state, the -PCI subsystem can use the platform firmware as well as the device's native PCI -PM capabilities, if supported. - -It is expected that the device driver's pm->runtime_suspend() callback will -not attempt to prepare the device for signaling wakeup or to put it into a -low-power state. The driver ought to leave these tasks to the PCI subsystem -that has all of the information necessary to perform them. - -A suspended device is brought back into the "active" state, or resumed, -with the help of pm_request_resume() or pm_runtime_resume() which both call -pci_pm_runtime_resume() for PCI devices. Again, this only works if the device's -driver provides a pm->runtime_resume() callback (see below). However, before -the driver's callback is executed, pci_pm_runtime_resume() brings the device -back into the full-power state, prevents it from signaling wakeup while in that -state and restores its standard configuration registers. Thus the driver's -callback need not worry about the PCI-specific aspects of the device resume. - -Note that generally pci_pm_runtime_resume() may be called in two different -situations. First, it may be called at the request of the device's driver, for -example if there are some data for it to process. Second, it may be called -as a result of a wakeup signal from the device itself (this sometimes is -referred to as "remote wakeup"). Of course, for this purpose the wakeup signal -is handled in one of the ways described in Section 1 and finally converted into -a notification for the PCI subsystem after the source device has been -identified. - -The pci_pm_runtime_idle() function, called for PCI devices by pm_runtime_idle() -and pm_request_idle(), executes the device driver's pm->runtime_idle() -callback, if defined, and if that callback doesn't return error code (or is not -present at all), suspends the device with the help of pm_runtime_suspend(). -Sometimes pci_pm_runtime_idle() is called automatically by the PM core (for -example, it is called right after the device has just been resumed), in which -cases it is expected to suspend the device if that makes sense. Usually, -however, the PCI subsystem doesn't really know if the device really can be -suspended, so it lets the device's driver decide by running its -pm->runtime_idle() callback. - -2.4. System-Wide Power Transitions ----------------------------------- -There are a few different types of system-wide power transitions, described in -Documentation/driver-api/pm/devices.rst. Each of them requires devices to be handled -in a specific way and the PM core executes subsystem-level power management -callbacks for this purpose. They are executed in phases such that each phase -involves executing the same subsystem-level callback for every device belonging -to the given subsystem before the next phase begins. These phases always run -after tasks have been frozen. - -2.4.1. System Suspend - -When the system is going into a sleep state in which the contents of memory will -be preserved, such as one of the ACPI sleep states S1-S3, the phases are: - - prepare, suspend, suspend_noirq. - -The following PCI bus type's callbacks, respectively, are used in these phases: - - pci_pm_prepare() - pci_pm_suspend() - pci_pm_suspend_noirq() - -The pci_pm_prepare() routine first puts the device into the "fully functional" -state with the help of pm_runtime_resume(). Then, it executes the device -driver's pm->prepare() callback if defined (i.e. if the driver's struct -dev_pm_ops object is present and the prepare pointer in that object is valid). - -The pci_pm_suspend() routine first checks if the device's driver implements -legacy PCI suspend routines (see Section 3), in which case the driver's legacy -suspend callback is executed, if present, and its result is returned. Next, if -the device's driver doesn't provide a struct dev_pm_ops object (containing -pointers to the driver's callbacks), pci_pm_default_suspend() is called, which -simply turns off the device's bus master capability and runs -pcibios_disable_device() to disable it, unless the device is a bridge (PCI -bridges are ignored by this routine). Next, the device driver's pm->suspend() -callback is executed, if defined, and its result is returned if it fails. -Finally, pci_fixup_device() is called to apply hardware suspend quirks related -to the device if necessary. - -Note that the suspend phase is carried out asynchronously for PCI devices, so -the pci_pm_suspend() callback may be executed in parallel for any pair of PCI -devices that don't depend on each other in a known way (i.e. none of the paths -in the device tree from the root bridge to a leaf device contains both of them). - -The pci_pm_suspend_noirq() routine is executed after suspend_device_irqs() has -been called, which means that the device driver's interrupt handler won't be -invoked while this routine is running. It first checks if the device's driver -implements legacy PCI suspends routines (Section 3), in which case the legacy -late suspend routine is called and its result is returned (the standard -configuration registers of the device are saved if the driver's callback hasn't -done that). Second, if the device driver's struct dev_pm_ops object is not -present, the device's standard configuration registers are saved and the routine -returns success. Otherwise the device driver's pm->suspend_noirq() callback is -executed, if present, and its result is returned if it fails. Next, if the -device's standard configuration registers haven't been saved yet (one of the -device driver's callbacks executed before might do that), pci_pm_suspend_noirq() -saves them, prepares the device to signal wakeup (if necessary) and puts it into -a low-power state. - -The low-power state to put the device into is the lowest-power (highest number) -state from which it can signal wakeup while the system is in the target sleep -state. Just like in the runtime PM case described above, the mechanism of -signaling wakeup is system-dependent and determined by the PCI subsystem, which -is also responsible for preparing the device to signal wakeup from the system's -target sleep state as appropriate. - -PCI device drivers (that don't implement legacy power management callbacks) are -generally not expected to prepare devices for signaling wakeup or to put them -into low-power states. However, if one of the driver's suspend callbacks -(pm->suspend() or pm->suspend_noirq()) saves the device's standard configuration -registers, pci_pm_suspend_noirq() will assume that the device has been prepared -to signal wakeup and put into a low-power state by the driver (the driver is -then assumed to have used the helper functions provided by the PCI subsystem for -this purpose). PCI device drivers are not encouraged to do that, but in some -rare cases doing that in the driver may be the optimum approach. - -2.4.2. System Resume - -When the system is undergoing a transition from a sleep state in which the -contents of memory have been preserved, such as one of the ACPI sleep states -S1-S3, into the working state (ACPI S0), the phases are: - - resume_noirq, resume, complete. - -The following PCI bus type's callbacks, respectively, are executed in these -phases: - - pci_pm_resume_noirq() - pci_pm_resume() - pci_pm_complete() - -The pci_pm_resume_noirq() routine first puts the device into the full-power -state, restores its standard configuration registers and applies early resume -hardware quirks related to the device, if necessary. This is done -unconditionally, regardless of whether or not the device's driver implements -legacy PCI power management callbacks (this way all PCI devices are in the -full-power state and their standard configuration registers have been restored -when their interrupt handlers are invoked for the first time during resume, -which allows the kernel to avoid problems with the handling of shared interrupts -by drivers whose devices are still suspended). If legacy PCI power management -callbacks (see Section 3) are implemented by the device's driver, the legacy -early resume callback is executed and its result is returned. Otherwise, the -device driver's pm->resume_noirq() callback is executed, if defined, and its -result is returned. - -The pci_pm_resume() routine first checks if the device's standard configuration -registers have been restored and restores them if that's not the case (this -only is necessary in the error path during a failing suspend). Next, resume -hardware quirks related to the device are applied, if necessary, and if the -device's driver implements legacy PCI power management callbacks (see -Section 3), the driver's legacy resume callback is executed and its result is -returned. Otherwise, the device's wakeup signaling mechanisms are blocked and -its driver's pm->resume() callback is executed, if defined (the callback's -result is then returned). - -The resume phase is carried out asynchronously for PCI devices, like the -suspend phase described above, which means that if two PCI devices don't depend -on each other in a known way, the pci_pm_resume() routine may be executed for -the both of them in parallel. - -The pci_pm_complete() routine only executes the device driver's pm->complete() -callback, if defined. - -2.4.3. System Hibernation - -System hibernation is more complicated than system suspend, because it requires -a system image to be created and written into a persistent storage medium. The -image is created atomically and all devices are quiesced, or frozen, before that -happens. - -The freezing of devices is carried out after enough memory has been freed (at -the time of this writing the image creation requires at least 50% of system RAM -to be free) in the following three phases: - - prepare, freeze, freeze_noirq - -that correspond to the PCI bus type's callbacks: - - pci_pm_prepare() - pci_pm_freeze() - pci_pm_freeze_noirq() - -This means that the prepare phase is exactly the same as for system suspend. -The other two phases, however, are different. - -The pci_pm_freeze() routine is quite similar to pci_pm_suspend(), but it runs -the device driver's pm->freeze() callback, if defined, instead of pm->suspend(), -and it doesn't apply the suspend-related hardware quirks. It is executed -asynchronously for different PCI devices that don't depend on each other in a -known way. - -The pci_pm_freeze_noirq() routine, in turn, is similar to -pci_pm_suspend_noirq(), but it calls the device driver's pm->freeze_noirq() -routine instead of pm->suspend_noirq(). It also doesn't attempt to prepare the -device for signaling wakeup and put it into a low-power state. Still, it saves -the device's standard configuration registers if they haven't been saved by one -of the driver's callbacks. - -Once the image has been created, it has to be saved. However, at this point all -devices are frozen and they cannot handle I/O, while their ability to handle -I/O is obviously necessary for the image saving. Thus they have to be brought -back to the fully functional state and this is done in the following phases: - - thaw_noirq, thaw, complete - -using the following PCI bus type's callbacks: - - pci_pm_thaw_noirq() - pci_pm_thaw() - pci_pm_complete() - -respectively. - -The first of them, pci_pm_thaw_noirq(), is analogous to pci_pm_resume_noirq(), -but it doesn't put the device into the full power state and doesn't attempt to -restore its standard configuration registers. It also executes the device -driver's pm->thaw_noirq() callback, if defined, instead of pm->resume_noirq(). - -The pci_pm_thaw() routine is similar to pci_pm_resume(), but it runs the device -driver's pm->thaw() callback instead of pm->resume(). It is executed -asynchronously for different PCI devices that don't depend on each other in a -known way. - -The complete phase it the same as for system resume. - -After saving the image, devices need to be powered down before the system can -enter the target sleep state (ACPI S4 for ACPI-based systems). This is done in -three phases: - - prepare, poweroff, poweroff_noirq - -where the prepare phase is exactly the same as for system suspend. The other -two phases are analogous to the suspend and suspend_noirq phases, respectively. -The PCI subsystem-level callbacks they correspond to - - pci_pm_poweroff() - pci_pm_poweroff_noirq() - -work in analogy with pci_pm_suspend() and pci_pm_poweroff_noirq(), respectively, -although they don't attempt to save the device's standard configuration -registers. - -2.4.4. System Restore - -System restore requires a hibernation image to be loaded into memory and the -pre-hibernation memory contents to be restored before the pre-hibernation system -activity can be resumed. - -As described in Documentation/driver-api/pm/devices.rst, the hibernation image is loaded -into memory by a fresh instance of the kernel, called the boot kernel, which in -turn is loaded and run by a boot loader in the usual way. After the boot kernel -has loaded the image, it needs to replace its own code and data with the code -and data of the "hibernated" kernel stored within the image, called the image -kernel. For this purpose all devices are frozen just like before creating -the image during hibernation, in the - - prepare, freeze, freeze_noirq - -phases described above. However, the devices affected by these phases are only -those having drivers in the boot kernel; other devices will still be in whatever -state the boot loader left them. - -Should the restoration of the pre-hibernation memory contents fail, the boot -kernel would go through the "thawing" procedure described above, using the -thaw_noirq, thaw, and complete phases (that will only affect the devices having -drivers in the boot kernel), and then continue running normally. - -If the pre-hibernation memory contents are restored successfully, which is the -usual situation, control is passed to the image kernel, which then becomes -responsible for bringing the system back to the working state. To achieve this, -it must restore the devices' pre-hibernation functionality, which is done much -like waking up from the memory sleep state, although it involves different -phases: - - restore_noirq, restore, complete - -The first two of these are analogous to the resume_noirq and resume phases -described above, respectively, and correspond to the following PCI subsystem -callbacks: - - pci_pm_restore_noirq() - pci_pm_restore() - -These callbacks work in analogy with pci_pm_resume_noirq() and pci_pm_resume(), -respectively, but they execute the device driver's pm->restore_noirq() and -pm->restore() callbacks, if available. - -The complete phase is carried out in exactly the same way as during system -resume. - - -3. PCI Device Drivers and Power Management -========================================== - -3.1. Power Management Callbacks -------------------------------- -PCI device drivers participate in power management by providing callbacks to be -executed by the PCI subsystem's power management routines described above and by -controlling the runtime power management of their devices. - -At the time of this writing there are two ways to define power management -callbacks for a PCI device driver, the recommended one, based on using a -dev_pm_ops structure described in Documentation/driver-api/pm/devices.rst, and the -"legacy" one, in which the .suspend(), .suspend_late(), .resume_early(), and -.resume() callbacks from struct pci_driver are used. The legacy approach, -however, doesn't allow one to define runtime power management callbacks and is -not really suitable for any new drivers. Therefore it is not covered by this -document (refer to the source code to learn more about it). - -It is recommended that all PCI device drivers define a struct dev_pm_ops object -containing pointers to power management (PM) callbacks that will be executed by -the PCI subsystem's PM routines in various circumstances. A pointer to the -driver's struct dev_pm_ops object has to be assigned to the driver.pm field in -its struct pci_driver object. Once that has happened, the "legacy" PM callbacks -in struct pci_driver are ignored (even if they are not NULL). - -The PM callbacks in struct dev_pm_ops are not mandatory and if they are not -defined (i.e. the respective fields of struct dev_pm_ops are unset) the PCI -subsystem will handle the device in a simplified default manner. If they are -defined, though, they are expected to behave as described in the following -subsections. - -3.1.1. prepare() - -The prepare() callback is executed during system suspend, during hibernation -(when a hibernation image is about to be created), during power-off after -saving a hibernation image and during system restore, when a hibernation image -has just been loaded into memory. - -This callback is only necessary if the driver's device has children that in -general may be registered at any time. In that case the role of the prepare() -callback is to prevent new children of the device from being registered until -one of the resume_noirq(), thaw_noirq(), or restore_noirq() callbacks is run. - -In addition to that the prepare() callback may carry out some operations -preparing the device to be suspended, although it should not allocate memory -(if additional memory is required to suspend the device, it has to be -preallocated earlier, for example in a suspend/hibernate notifier as described -in Documentation/driver-api/pm/notifiers.rst). - -3.1.2. suspend() - -The suspend() callback is only executed during system suspend, after prepare() -callbacks have been executed for all devices in the system. - -This callback is expected to quiesce the device and prepare it to be put into a -low-power state by the PCI subsystem. It is not required (in fact it even is -not recommended) that a PCI driver's suspend() callback save the standard -configuration registers of the device, prepare it for waking up the system, or -put it into a low-power state. All of these operations can very well be taken -care of by the PCI subsystem, without the driver's participation. - -However, in some rare case it is convenient to carry out these operations in -a PCI driver. Then, pci_save_state(), pci_prepare_to_sleep(), and -pci_set_power_state() should be used to save the device's standard configuration -registers, to prepare it for system wakeup (if necessary), and to put it into a -low-power state, respectively. Moreover, if the driver calls pci_save_state(), -the PCI subsystem will not execute either pci_prepare_to_sleep(), or -pci_set_power_state() for its device, so the driver is then responsible for -handling the device as appropriate. - -While the suspend() callback is being executed, the driver's interrupt handler -can be invoked to handle an interrupt from the device, so all suspend-related -operations relying on the driver's ability to handle interrupts should be -carried out in this callback. - -3.1.3. suspend_noirq() - -The suspend_noirq() callback is only executed during system suspend, after -suspend() callbacks have been executed for all devices in the system and -after device interrupts have been disabled by the PM core. - -The difference between suspend_noirq() and suspend() is that the driver's -interrupt handler will not be invoked while suspend_noirq() is running. Thus -suspend_noirq() can carry out operations that would cause race conditions to -arise if they were performed in suspend(). - -3.1.4. freeze() - -The freeze() callback is hibernation-specific and is executed in two situations, -during hibernation, after prepare() callbacks have been executed for all devices -in preparation for the creation of a system image, and during restore, -after a system image has been loaded into memory from persistent storage and the -prepare() callbacks have been executed for all devices. - -The role of this callback is analogous to the role of the suspend() callback -described above. In fact, they only need to be different in the rare cases when -the driver takes the responsibility for putting the device into a low-power -state. - -In that cases the freeze() callback should not prepare the device system wakeup -or put it into a low-power state. Still, either it or freeze_noirq() should -save the device's standard configuration registers using pci_save_state(). - -3.1.5. freeze_noirq() - -The freeze_noirq() callback is hibernation-specific. It is executed during -hibernation, after prepare() and freeze() callbacks have been executed for all -devices in preparation for the creation of a system image, and during restore, -after a system image has been loaded into memory and after prepare() and -freeze() callbacks have been executed for all devices. It is always executed -after device interrupts have been disabled by the PM core. - -The role of this callback is analogous to the role of the suspend_noirq() -callback described above and it very rarely is necessary to define -freeze_noirq(). - -The difference between freeze_noirq() and freeze() is analogous to the -difference between suspend_noirq() and suspend(). - -3.1.6. poweroff() - -The poweroff() callback is hibernation-specific. It is executed when the system -is about to be powered off after saving a hibernation image to a persistent -storage. prepare() callbacks are executed for all devices before poweroff() is -called. - -The role of this callback is analogous to the role of the suspend() and freeze() -callbacks described above, although it does not need to save the contents of -the device's registers. In particular, if the driver wants to put the device -into a low-power state itself instead of allowing the PCI subsystem to do that, -the poweroff() callback should use pci_prepare_to_sleep() and -pci_set_power_state() to prepare the device for system wakeup and to put it -into a low-power state, respectively, but it need not save the device's standard -configuration registers. - -3.1.7. poweroff_noirq() - -The poweroff_noirq() callback is hibernation-specific. It is executed after -poweroff() callbacks have been executed for all devices in the system. - -The role of this callback is analogous to the role of the suspend_noirq() and -freeze_noirq() callbacks described above, but it does not need to save the -contents of the device's registers. - -The difference between poweroff_noirq() and poweroff() is analogous to the -difference between suspend_noirq() and suspend(). - -3.1.8. resume_noirq() - -The resume_noirq() callback is only executed during system resume, after the -PM core has enabled the non-boot CPUs. The driver's interrupt handler will not -be invoked while resume_noirq() is running, so this callback can carry out -operations that might race with the interrupt handler. - -Since the PCI subsystem unconditionally puts all devices into the full power -state in the resume_noirq phase of system resume and restores their standard -configuration registers, resume_noirq() is usually not necessary. In general -it should only be used for performing operations that would lead to race -conditions if carried out by resume(). - -3.1.9. resume() - -The resume() callback is only executed during system resume, after -resume_noirq() callbacks have been executed for all devices in the system and -device interrupts have been enabled by the PM core. - -This callback is responsible for restoring the pre-suspend configuration of the -device and bringing it back to the fully functional state. The device should be -able to process I/O in a usual way after resume() has returned. - -3.1.10. thaw_noirq() - -The thaw_noirq() callback is hibernation-specific. It is executed after a -system image has been created and the non-boot CPUs have been enabled by the PM -core, in the thaw_noirq phase of hibernation. It also may be executed if the -loading of a hibernation image fails during system restore (it is then executed -after enabling the non-boot CPUs). The driver's interrupt handler will not be -invoked while thaw_noirq() is running. - -The role of this callback is analogous to the role of resume_noirq(). The -difference between these two callbacks is that thaw_noirq() is executed after -freeze() and freeze_noirq(), so in general it does not need to modify the -contents of the device's registers. - -3.1.11. thaw() - -The thaw() callback is hibernation-specific. It is executed after thaw_noirq() -callbacks have been executed for all devices in the system and after device -interrupts have been enabled by the PM core. - -This callback is responsible for restoring the pre-freeze configuration of -the device, so that it will work in a usual way after thaw() has returned. - -3.1.12. restore_noirq() - -The restore_noirq() callback is hibernation-specific. It is executed in the -restore_noirq phase of hibernation, when the boot kernel has passed control to -the image kernel and the non-boot CPUs have been enabled by the image kernel's -PM core. - -This callback is analogous to resume_noirq() with the exception that it cannot -make any assumption on the previous state of the device, even if the BIOS (or -generally the platform firmware) is known to preserve that state over a -suspend-resume cycle. - -For the vast majority of PCI device drivers there is no difference between -resume_noirq() and restore_noirq(). - -3.1.13. restore() - -The restore() callback is hibernation-specific. It is executed after -restore_noirq() callbacks have been executed for all devices in the system and -after the PM core has enabled device drivers' interrupt handlers to be invoked. - -This callback is analogous to resume(), just like restore_noirq() is analogous -to resume_noirq(). Consequently, the difference between restore_noirq() and -restore() is analogous to the difference between resume_noirq() and resume(). - -For the vast majority of PCI device drivers there is no difference between -resume() and restore(). - -3.1.14. complete() - -The complete() callback is executed in the following situations: - - during system resume, after resume() callbacks have been executed for all - devices, - - during hibernation, before saving the system image, after thaw() callbacks - have been executed for all devices, - - during system restore, when the system is going back to its pre-hibernation - state, after restore() callbacks have been executed for all devices. -It also may be executed if the loading of a hibernation image into memory fails -(in that case it is run after thaw() callbacks have been executed for all -devices that have drivers in the boot kernel). - -This callback is entirely optional, although it may be necessary if the -prepare() callback performs operations that need to be reversed. - -3.1.15. runtime_suspend() - -The runtime_suspend() callback is specific to device runtime power management -(runtime PM). It is executed by the PM core's runtime PM framework when the -device is about to be suspended (i.e. quiesced and put into a low-power state) -at run time. - -This callback is responsible for freezing the device and preparing it to be -put into a low-power state, but it must allow the PCI subsystem to perform all -of the PCI-specific actions necessary for suspending the device. - -3.1.16. runtime_resume() - -The runtime_resume() callback is specific to device runtime PM. It is executed -by the PM core's runtime PM framework when the device is about to be resumed -(i.e. put into the full-power state and programmed to process I/O normally) at -run time. - -This callback is responsible for restoring the normal functionality of the -device after it has been put into the full-power state by the PCI subsystem. -The device is expected to be able to process I/O in the usual way after -runtime_resume() has returned. - -3.1.17. runtime_idle() - -The runtime_idle() callback is specific to device runtime PM. It is executed -by the PM core's runtime PM framework whenever it may be desirable to suspend -the device according to the PM core's information. In particular, it is -automatically executed right after runtime_resume() has returned in case the -resume of the device has happened as a result of a spurious event. - -This callback is optional, but if it is not implemented or if it returns 0, the -PCI subsystem will call pm_runtime_suspend() for the device, which in turn will -cause the driver's runtime_suspend() callback to be executed. - -3.1.18. Pointing Multiple Callback Pointers to One Routine - -Although in principle each of the callbacks described in the previous -subsections can be defined as a separate function, it often is convenient to -point two or more members of struct dev_pm_ops to the same routine. There are -a few convenience macros that can be used for this purpose. - -The SIMPLE_DEV_PM_OPS macro declares a struct dev_pm_ops object with one -suspend routine pointed to by the .suspend(), .freeze(), and .poweroff() -members and one resume routine pointed to by the .resume(), .thaw(), and -.restore() members. The other function pointers in this struct dev_pm_ops are -unset. - -The UNIVERSAL_DEV_PM_OPS macro is similar to SIMPLE_DEV_PM_OPS, but it -additionally sets the .runtime_resume() pointer to the same value as -.resume() (and .thaw(), and .restore()) and the .runtime_suspend() pointer to -the same value as .suspend() (and .freeze() and .poweroff()). - -The SET_SYSTEM_SLEEP_PM_OPS can be used inside of a declaration of struct -dev_pm_ops to indicate that one suspend routine is to be pointed to by the -.suspend(), .freeze(), and .poweroff() members and one resume routine is to -be pointed to by the .resume(), .thaw(), and .restore() members. - -3.1.19. Driver Flags for Power Management - -The PM core allows device drivers to set flags that influence the handling of -power management for the devices by the core itself and by middle layer code -including the PCI bus type. The flags should be set once at the driver probe -time with the help of the dev_pm_set_driver_flags() function and they should not -be updated directly afterwards. - -The DPM_FLAG_NEVER_SKIP flag prevents the PM core from using the direct-complete -mechanism allowing device suspend/resume callbacks to be skipped if the device -is in runtime suspend when the system suspend starts. That also affects all of -the ancestors of the device, so this flag should only be used if absolutely -necessary. - -The DPM_FLAG_SMART_PREPARE flag instructs the PCI bus type to only return a -positive value from pci_pm_prepare() if the ->prepare callback provided by the -driver of the device returns a positive value. That allows the driver to opt -out from using the direct-complete mechanism dynamically. - -The DPM_FLAG_SMART_SUSPEND flag tells the PCI bus type that from the driver's -perspective the device can be safely left in runtime suspend during system -suspend. That causes pci_pm_suspend(), pci_pm_freeze() and pci_pm_poweroff() -to skip resuming the device from runtime suspend unless there are PCI-specific -reasons for doing that. Also, it causes pci_pm_suspend_late/noirq(), -pci_pm_freeze_late/noirq() and pci_pm_poweroff_late/noirq() to return early -if the device remains in runtime suspend in the beginning of the "late" phase -of the system-wide transition under way. Moreover, if the device is in -runtime suspend in pci_pm_resume_noirq() or pci_pm_restore_noirq(), its runtime -power management status will be changed to "active" (as it is going to be put -into D0 going forward), but if it is in runtime suspend in pci_pm_thaw_noirq(), -the function will set the power.direct_complete flag for it (to make the PM core -skip the subsequent "thaw" callbacks for it) and return. - -Setting the DPM_FLAG_LEAVE_SUSPENDED flag means that the driver prefers the -device to be left in suspend after system-wide transitions to the working state. -This flag is checked by the PM core, but the PCI bus type informs the PM core -which devices may be left in suspend from its perspective (that happens during -the "noirq" phase of system-wide suspend and analogous transitions) and next it -uses the dev_pm_may_skip_resume() helper to decide whether or not to return from -pci_pm_resume_noirq() early, as the PM core will skip the remaining resume -callbacks for the device during the transition under way and will set its -runtime PM status to "suspended" if dev_pm_may_skip_resume() returns "true" for -it. - -3.2. Device Runtime Power Management ------------------------------------- -In addition to providing device power management callbacks PCI device drivers -are responsible for controlling the runtime power management (runtime PM) of -their devices. - -The PCI device runtime PM is optional, but it is recommended that PCI device -drivers implement it at least in the cases where there is a reliable way of -verifying that the device is not used (like when the network cable is detached -from an Ethernet adapter or there are no devices attached to a USB controller). - -To support the PCI runtime PM the driver first needs to implement the -runtime_suspend() and runtime_resume() callbacks. It also may need to implement -the runtime_idle() callback to prevent the device from being suspended again -every time right after the runtime_resume() callback has returned -(alternatively, the runtime_suspend() callback will have to check if the -device should really be suspended and return -EAGAIN if that is not the case). - -The runtime PM of PCI devices is enabled by default by the PCI core. PCI -device drivers do not need to enable it and should not attempt to do so. -However, it is blocked by pci_pm_init() that runs the pm_runtime_forbid() -helper function. In addition to that, the runtime PM usage counter of -each PCI device is incremented by local_pci_probe() before executing the -probe callback provided by the device's driver. - -If a PCI driver implements the runtime PM callbacks and intends to use the -runtime PM framework provided by the PM core and the PCI subsystem, it needs -to decrement the device's runtime PM usage counter in its probe callback -function. If it doesn't do that, the counter will always be different from -zero for the device and it will never be runtime-suspended. The simplest -way to do that is by calling pm_runtime_put_noidle(), but if the driver -wants to schedule an autosuspend right away, for example, it may call -pm_runtime_put_autosuspend() instead for this purpose. Generally, it -just needs to call a function that decrements the devices usage counter -from its probe routine to make runtime PM work for the device. - -It is important to remember that the driver's runtime_suspend() callback -may be executed right after the usage counter has been decremented, because -user space may already have caused the pm_runtime_allow() helper function -unblocking the runtime PM of the device to run via sysfs, so the driver must -be prepared to cope with that. - -The driver itself should not call pm_runtime_allow(), though. Instead, it -should let user space or some platform-specific code do that (user space can -do it via sysfs as stated above), but it must be prepared to handle the -runtime PM of the device correctly as soon as pm_runtime_allow() is called -(which may happen at any time, even before the driver is loaded). - -When the driver's remove callback runs, it has to balance the decrementation -of the device's runtime PM usage counter at the probe time. For this reason, -if it has decremented the counter in its probe callback, it must run -pm_runtime_get_noresume() in its remove callback. [Since the core carries -out a runtime resume of the device and bumps up the device's usage counter -before running the driver's remove callback, the runtime PM of the device -is effectively disabled for the duration of the remove execution and all -runtime PM helper functions incrementing the device's usage counter are -then effectively equivalent to pm_runtime_get_noresume().] - -The runtime PM framework works by processing requests to suspend or resume -devices, or to check if they are idle (in which cases it is reasonable to -subsequently request that they be suspended). These requests are represented -by work items put into the power management workqueue, pm_wq. Although there -are a few situations in which power management requests are automatically -queued by the PM core (for example, after processing a request to resume a -device the PM core automatically queues a request to check if the device is -idle), device drivers are generally responsible for queuing power management -requests for their devices. For this purpose they should use the runtime PM -helper functions provided by the PM core, discussed in -Documentation/power/runtime_pm.txt. - -Devices can also be suspended and resumed synchronously, without placing a -request into pm_wq. In the majority of cases this also is done by their -drivers that use helper functions provided by the PM core for this purpose. - -For more information on the runtime PM of devices refer to -Documentation/power/runtime_pm.txt. - - -4. Resources -============ - -PCI Local Bus Specification, Rev. 3.0 -PCI Bus Power Management Interface Specification, Rev. 1.2 -Advanced Configuration and Power Interface (ACPI) Specification, Rev. 3.0b -PCI Express Base Specification, Rev. 2.0 -Documentation/driver-api/pm/devices.rst -Documentation/power/runtime_pm.txt diff --git a/Documentation/power/pm_qos_interface.rst b/Documentation/power/pm_qos_interface.rst new file mode 100644 index 000000000000..945fc6d760c9 --- /dev/null +++ b/Documentation/power/pm_qos_interface.rst @@ -0,0 +1,225 @@ +=============================== +PM Quality Of Service Interface +=============================== + +This interface provides a kernel and user mode interface for registering +performance expectations by drivers, subsystems and user space applications on +one of the parameters. + +Two different PM QoS frameworks are available: +1. PM QoS classes for cpu_dma_latency, network_latency, network_throughput, +memory_bandwidth. +2. the per-device PM QoS framework provides the API to manage the per-device latency +constraints and PM QoS flags. + +Each parameters have defined units: + + * latency: usec + * timeout: usec + * throughput: kbs (kilo bit / sec) + * memory bandwidth: mbs (mega bit / sec) + + +1. PM QoS framework +=================== + +The infrastructure exposes multiple misc device nodes one per implemented +parameter. The set of parameters implement is defined by pm_qos_power_init() +and pm_qos_params.h. This is done because having the available parameters +being runtime configurable or changeable from a driver was seen as too easy to +abuse. + +For each parameter a list of performance requests is maintained along with +an aggregated target value. The aggregated target value is updated with +changes to the request list or elements of the list. Typically the +aggregated target value is simply the max or min of the request values held +in the parameter list elements. +Note: the aggregated target value is implemented as an atomic variable so that +reading the aggregated value does not require any locking mechanism. + + +From kernel mode the use of this interface is simple: + +void pm_qos_add_request(handle, param_class, target_value): + Will insert an element into the list for that identified PM QoS class with the + target value. Upon change to this list the new target is recomputed and any + registered notifiers are called only if the target value is now different. + Clients of pm_qos need to save the returned handle for future use in other + pm_qos API functions. + +void pm_qos_update_request(handle, new_target_value): + Will update the list element pointed to by the handle with the new target value + and recompute the new aggregated target, calling the notification tree if the + target is changed. + +void pm_qos_remove_request(handle): + Will remove the element. After removal it will update the aggregate target and + call the notification tree if the target was changed as a result of removing + the request. + +int pm_qos_request(param_class): + Returns the aggregated value for a given PM QoS class. + +int pm_qos_request_active(handle): + Returns if the request is still active, i.e. it has not been removed from a + PM QoS class constraints list. + +int pm_qos_add_notifier(param_class, notifier): + Adds a notification callback function to the PM QoS class. The callback is + called when the aggregated value for the PM QoS class is changed. + +int pm_qos_remove_notifier(int param_class, notifier): + Removes the notification callback function for the PM QoS class. + + +From user mode: + +Only processes can register a pm_qos request. To provide for automatic +cleanup of a process, the interface requires the process to register its +parameter requests in the following way: + +To register the default pm_qos target for the specific parameter, the process +must open one of /dev/[cpu_dma_latency, network_latency, network_throughput] + +As long as the device node is held open that process has a registered +request on the parameter. + +To change the requested target value the process needs to write an s32 value to +the open device node. Alternatively the user mode program could write a hex +string for the value using 10 char long format e.g. "0x12345678". This +translates to a pm_qos_update_request call. + +To remove the user mode request for a target value simply close the device +node. + + +2. PM QoS per-device latency and flags framework +================================================ + +For each device, there are three lists of PM QoS requests. Two of them are +maintained along with the aggregated targets of resume latency and active +state latency tolerance (in microseconds) and the third one is for PM QoS flags. +Values are updated in response to changes of the request list. + +The target values of resume latency and active state latency tolerance are +simply the minimum of the request values held in the parameter list elements. +The PM QoS flags aggregate value is a gather (bitwise OR) of all list elements' +values. One device PM QoS flag is defined currently: PM_QOS_FLAG_NO_POWER_OFF. + +Note: The aggregated target values are implemented in such a way that reading +the aggregated value does not require any locking mechanism. + + +From kernel mode the use of this interface is the following: + +int dev_pm_qos_add_request(device, handle, type, value): + Will insert an element into the list for that identified device with the + target value. Upon change to this list the new target is recomputed and any + registered notifiers are called only if the target value is now different. + Clients of dev_pm_qos need to save the handle for future use in other + dev_pm_qos API functions. + +int dev_pm_qos_update_request(handle, new_value): + Will update the list element pointed to by the handle with the new target + value and recompute the new aggregated target, calling the notification + trees if the target is changed. + +int dev_pm_qos_remove_request(handle): + Will remove the element. After removal it will update the aggregate target + and call the notification trees if the target was changed as a result of + removing the request. + +s32 dev_pm_qos_read_value(device): + Returns the aggregated value for a given device's constraints list. + +enum pm_qos_flags_status dev_pm_qos_flags(device, mask) + Check PM QoS flags of the given device against the given mask of flags. + The meaning of the return values is as follows: + + PM_QOS_FLAGS_ALL: + All flags from the mask are set + PM_QOS_FLAGS_SOME: + Some flags from the mask are set + PM_QOS_FLAGS_NONE: + No flags from the mask are set + PM_QOS_FLAGS_UNDEFINED: + The device's PM QoS structure has not been initialized + or the list of requests is empty. + +int dev_pm_qos_add_ancestor_request(dev, handle, type, value) + Add a PM QoS request for the first direct ancestor of the given device whose + power.ignore_children flag is unset (for DEV_PM_QOS_RESUME_LATENCY requests) + or whose power.set_latency_tolerance callback pointer is not NULL (for + DEV_PM_QOS_LATENCY_TOLERANCE requests). + +int dev_pm_qos_expose_latency_limit(device, value) + Add a request to the device's PM QoS list of resume latency constraints and + create a sysfs attribute pm_qos_resume_latency_us under the device's power + directory allowing user space to manipulate that request. + +void dev_pm_qos_hide_latency_limit(device) + Drop the request added by dev_pm_qos_expose_latency_limit() from the device's + PM QoS list of resume latency constraints and remove sysfs attribute + pm_qos_resume_latency_us from the device's power directory. + +int dev_pm_qos_expose_flags(device, value) + Add a request to the device's PM QoS list of flags and create sysfs attribute + pm_qos_no_power_off under the device's power directory allowing user space to + change the value of the PM_QOS_FLAG_NO_POWER_OFF flag. + +void dev_pm_qos_hide_flags(device) + Drop the request added by dev_pm_qos_expose_flags() from the device's PM QoS list + of flags and remove sysfs attribute pm_qos_no_power_off from the device's power + directory. + +Notification mechanisms: + +The per-device PM QoS framework has a per-device notification tree. + +int dev_pm_qos_add_notifier(device, notifier): + Adds a notification callback function for the device. + The callback is called when the aggregated value of the device constraints list + is changed (for resume latency device PM QoS only). + +int dev_pm_qos_remove_notifier(device, notifier): + Removes the notification callback function for the device. + + +Active state latency tolerance +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +This device PM QoS type is used to support systems in which hardware may switch +to energy-saving operation modes on the fly. In those systems, if the operation +mode chosen by the hardware attempts to save energy in an overly aggressive way, +it may cause excess latencies to be visible to software, causing it to miss +certain protocol requirements or target frame or sample rates etc. + +If there is a latency tolerance control mechanism for a given device available +to software, the .set_latency_tolerance callback in that device's dev_pm_info +structure should be populated. The routine pointed to by it is should implement +whatever is necessary to transfer the effective requirement value to the +hardware. + +Whenever the effective latency tolerance changes for the device, its +.set_latency_tolerance() callback will be executed and the effective value will +be passed to it. If that value is negative, which means that the list of +latency tolerance requirements for the device is empty, the callback is expected +to switch the underlying hardware latency tolerance control mechanism to an +autonomous mode if available. If that value is PM_QOS_LATENCY_ANY, in turn, and +the hardware supports a special "no requirement" setting, the callback is +expected to use it. That allows software to prevent the hardware from +automatically updating the device's latency tolerance in response to its power +state changes (e.g. during transitions from D3cold to D0), which generally may +be done in the autonomous latency tolerance control mode. + +If .set_latency_tolerance() is present for the device, sysfs attribute +pm_qos_latency_tolerance_us will be present in the devivce's power directory. +Then, user space can use that attribute to specify its latency tolerance +requirement for the device, if any. Writing "any" to it means "no requirement, +but do not let the hardware control latency tolerance" and writing "auto" to it +allows the hardware to be switched to the autonomous mode if there are no other +requirements from the kernel side in the device's list. + +Kernel code can use the functions described above along with the +DEV_PM_QOS_LATENCY_TOLERANCE device PM QoS type to add, remove and update +latency tolerance requirements for devices. diff --git a/Documentation/power/pm_qos_interface.txt b/Documentation/power/pm_qos_interface.txt deleted file mode 100644 index 19c5f7b1a7ba..000000000000 --- a/Documentation/power/pm_qos_interface.txt +++ /dev/null @@ -1,212 +0,0 @@ -PM Quality Of Service Interface. - -This interface provides a kernel and user mode interface for registering -performance expectations by drivers, subsystems and user space applications on -one of the parameters. - -Two different PM QoS frameworks are available: -1. PM QoS classes for cpu_dma_latency, network_latency, network_throughput, -memory_bandwidth. -2. the per-device PM QoS framework provides the API to manage the per-device latency -constraints and PM QoS flags. - -Each parameters have defined units: - * latency: usec - * timeout: usec - * throughput: kbs (kilo bit / sec) - * memory bandwidth: mbs (mega bit / sec) - - -1. PM QoS framework - -The infrastructure exposes multiple misc device nodes one per implemented -parameter. The set of parameters implement is defined by pm_qos_power_init() -and pm_qos_params.h. This is done because having the available parameters -being runtime configurable or changeable from a driver was seen as too easy to -abuse. - -For each parameter a list of performance requests is maintained along with -an aggregated target value. The aggregated target value is updated with -changes to the request list or elements of the list. Typically the -aggregated target value is simply the max or min of the request values held -in the parameter list elements. -Note: the aggregated target value is implemented as an atomic variable so that -reading the aggregated value does not require any locking mechanism. - - -From kernel mode the use of this interface is simple: - -void pm_qos_add_request(handle, param_class, target_value): -Will insert an element into the list for that identified PM QoS class with the -target value. Upon change to this list the new target is recomputed and any -registered notifiers are called only if the target value is now different. -Clients of pm_qos need to save the returned handle for future use in other -pm_qos API functions. - -void pm_qos_update_request(handle, new_target_value): -Will update the list element pointed to by the handle with the new target value -and recompute the new aggregated target, calling the notification tree if the -target is changed. - -void pm_qos_remove_request(handle): -Will remove the element. After removal it will update the aggregate target and -call the notification tree if the target was changed as a result of removing -the request. - -int pm_qos_request(param_class): -Returns the aggregated value for a given PM QoS class. - -int pm_qos_request_active(handle): -Returns if the request is still active, i.e. it has not been removed from a -PM QoS class constraints list. - -int pm_qos_add_notifier(param_class, notifier): -Adds a notification callback function to the PM QoS class. The callback is -called when the aggregated value for the PM QoS class is changed. - -int pm_qos_remove_notifier(int param_class, notifier): -Removes the notification callback function for the PM QoS class. - - -From user mode: -Only processes can register a pm_qos request. To provide for automatic -cleanup of a process, the interface requires the process to register its -parameter requests in the following way: - -To register the default pm_qos target for the specific parameter, the process -must open one of /dev/[cpu_dma_latency, network_latency, network_throughput] - -As long as the device node is held open that process has a registered -request on the parameter. - -To change the requested target value the process needs to write an s32 value to -the open device node. Alternatively the user mode program could write a hex -string for the value using 10 char long format e.g. "0x12345678". This -translates to a pm_qos_update_request call. - -To remove the user mode request for a target value simply close the device -node. - - -2. PM QoS per-device latency and flags framework - -For each device, there are three lists of PM QoS requests. Two of them are -maintained along with the aggregated targets of resume latency and active -state latency tolerance (in microseconds) and the third one is for PM QoS flags. -Values are updated in response to changes of the request list. - -The target values of resume latency and active state latency tolerance are -simply the minimum of the request values held in the parameter list elements. -The PM QoS flags aggregate value is a gather (bitwise OR) of all list elements' -values. One device PM QoS flag is defined currently: PM_QOS_FLAG_NO_POWER_OFF. - -Note: The aggregated target values are implemented in such a way that reading -the aggregated value does not require any locking mechanism. - - -From kernel mode the use of this interface is the following: - -int dev_pm_qos_add_request(device, handle, type, value): -Will insert an element into the list for that identified device with the -target value. Upon change to this list the new target is recomputed and any -registered notifiers are called only if the target value is now different. -Clients of dev_pm_qos need to save the handle for future use in other -dev_pm_qos API functions. - -int dev_pm_qos_update_request(handle, new_value): -Will update the list element pointed to by the handle with the new target value -and recompute the new aggregated target, calling the notification trees if the -target is changed. - -int dev_pm_qos_remove_request(handle): -Will remove the element. After removal it will update the aggregate target and -call the notification trees if the target was changed as a result of removing -the request. - -s32 dev_pm_qos_read_value(device): -Returns the aggregated value for a given device's constraints list. - -enum pm_qos_flags_status dev_pm_qos_flags(device, mask) -Check PM QoS flags of the given device against the given mask of flags. -The meaning of the return values is as follows: - PM_QOS_FLAGS_ALL: All flags from the mask are set - PM_QOS_FLAGS_SOME: Some flags from the mask are set - PM_QOS_FLAGS_NONE: No flags from the mask are set - PM_QOS_FLAGS_UNDEFINED: The device's PM QoS structure has not been - initialized or the list of requests is empty. - -int dev_pm_qos_add_ancestor_request(dev, handle, type, value) -Add a PM QoS request for the first direct ancestor of the given device whose -power.ignore_children flag is unset (for DEV_PM_QOS_RESUME_LATENCY requests) -or whose power.set_latency_tolerance callback pointer is not NULL (for -DEV_PM_QOS_LATENCY_TOLERANCE requests). - -int dev_pm_qos_expose_latency_limit(device, value) -Add a request to the device's PM QoS list of resume latency constraints and -create a sysfs attribute pm_qos_resume_latency_us under the device's power -directory allowing user space to manipulate that request. - -void dev_pm_qos_hide_latency_limit(device) -Drop the request added by dev_pm_qos_expose_latency_limit() from the device's -PM QoS list of resume latency constraints and remove sysfs attribute -pm_qos_resume_latency_us from the device's power directory. - -int dev_pm_qos_expose_flags(device, value) -Add a request to the device's PM QoS list of flags and create sysfs attribute -pm_qos_no_power_off under the device's power directory allowing user space to -change the value of the PM_QOS_FLAG_NO_POWER_OFF flag. - -void dev_pm_qos_hide_flags(device) -Drop the request added by dev_pm_qos_expose_flags() from the device's PM QoS list -of flags and remove sysfs attribute pm_qos_no_power_off from the device's power -directory. - -Notification mechanisms: -The per-device PM QoS framework has a per-device notification tree. - -int dev_pm_qos_add_notifier(device, notifier): -Adds a notification callback function for the device. -The callback is called when the aggregated value of the device constraints list -is changed (for resume latency device PM QoS only). - -int dev_pm_qos_remove_notifier(device, notifier): -Removes the notification callback function for the device. - - -Active state latency tolerance - -This device PM QoS type is used to support systems in which hardware may switch -to energy-saving operation modes on the fly. In those systems, if the operation -mode chosen by the hardware attempts to save energy in an overly aggressive way, -it may cause excess latencies to be visible to software, causing it to miss -certain protocol requirements or target frame or sample rates etc. - -If there is a latency tolerance control mechanism for a given device available -to software, the .set_latency_tolerance callback in that device's dev_pm_info -structure should be populated. The routine pointed to by it is should implement -whatever is necessary to transfer the effective requirement value to the -hardware. - -Whenever the effective latency tolerance changes for the device, its -.set_latency_tolerance() callback will be executed and the effective value will -be passed to it. If that value is negative, which means that the list of -latency tolerance requirements for the device is empty, the callback is expected -to switch the underlying hardware latency tolerance control mechanism to an -autonomous mode if available. If that value is PM_QOS_LATENCY_ANY, in turn, and -the hardware supports a special "no requirement" setting, the callback is -expected to use it. That allows software to prevent the hardware from -automatically updating the device's latency tolerance in response to its power -state changes (e.g. during transitions from D3cold to D0), which generally may -be done in the autonomous latency tolerance control mode. - -If .set_latency_tolerance() is present for the device, sysfs attribute -pm_qos_latency_tolerance_us will be present in the devivce's power directory. -Then, user space can use that attribute to specify its latency tolerance -requirement for the device, if any. Writing "any" to it means "no requirement, -but do not let the hardware control latency tolerance" and writing "auto" to it -allows the hardware to be switched to the autonomous mode if there are no other -requirements from the kernel side in the device's list. - -Kernel code can use the functions described above along with the -DEV_PM_QOS_LATENCY_TOLERANCE device PM QoS type to add, remove and update -latency tolerance requirements for devices. diff --git a/Documentation/power/power_supply_class.rst b/Documentation/power/power_supply_class.rst new file mode 100644 index 000000000000..3f2c3fe38a61 --- /dev/null +++ b/Documentation/power/power_supply_class.rst @@ -0,0 +1,282 @@ +======================== +Linux power supply class +======================== + +Synopsis +~~~~~~~~ +Power supply class used to represent battery, UPS, AC or DC power supply +properties to user-space. + +It defines core set of attributes, which should be applicable to (almost) +every power supply out there. Attributes are available via sysfs and uevent +interfaces. + +Each attribute has well defined meaning, up to unit of measure used. While +the attributes provided are believed to be universally applicable to any +power supply, specific monitoring hardware may not be able to provide them +all, so any of them may be skipped. + +Power supply class is extensible, and allows to define drivers own attributes. +The core attribute set is subject to the standard Linux evolution (i.e. +if it will be found that some attribute is applicable to many power supply +types or their drivers, it can be added to the core set). + +It also integrates with LED framework, for the purpose of providing +typically expected feedback of battery charging/fully charged status and +AC/USB power supply online status. (Note that specific details of the +indication (including whether to use it at all) are fully controllable by +user and/or specific machine defaults, per design principles of LED +framework). + + +Attributes/properties +~~~~~~~~~~~~~~~~~~~~~ +Power supply class has predefined set of attributes, this eliminates code +duplication across drivers. Power supply class insist on reusing its +predefined attributes *and* their units. + +So, userspace gets predictable set of attributes and their units for any +kind of power supply, and can process/present them to a user in consistent +manner. Results for different power supplies and machines are also directly +comparable. + +See drivers/power/supply/ds2760_battery.c and drivers/power/supply/pda_power.c +for the example how to declare and handle attributes. + + +Units +~~~~~ +Quoting include/linux/power_supply.h: + + All voltages, currents, charges, energies, time and temperatures in µV, + µA, µAh, µWh, seconds and tenths of degree Celsius unless otherwise + stated. It's driver's job to convert its raw values to units in which + this class operates. + + +Attributes/properties detailed +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + ++--------------------------------------------------------------------------+ +| **Charge/Energy/Capacity - how to not confuse** | ++--------------------------------------------------------------------------+ +| **Because both "charge" (µAh) and "energy" (µWh) represents "capacity" | +| of battery, this class distinguish these terms. Don't mix them!** | +| | +| - `CHARGE_*` | +| attributes represents capacity in µAh only. | +| - `ENERGY_*` | +| attributes represents capacity in µWh only. | +| - `CAPACITY` | +| attribute represents capacity in *percents*, from 0 to 100. | ++--------------------------------------------------------------------------+ + +Postfixes: + +_AVG + *hardware* averaged value, use it if your hardware is really able to + report averaged values. +_NOW + momentary/instantaneous values. + +STATUS + this attribute represents operating status (charging, full, + discharging (i.e. powering a load), etc.). This corresponds to + `BATTERY_STATUS_*` values, as defined in battery.h. + +CHARGE_TYPE + batteries can typically charge at different rates. + This defines trickle and fast charges. For batteries that + are already charged or discharging, 'n/a' can be displayed (or + 'unknown', if the status is not known). + +AUTHENTIC + indicates the power supply (battery or charger) connected + to the platform is authentic(1) or non authentic(0). + +HEALTH + represents health of the battery, values corresponds to + POWER_SUPPLY_HEALTH_*, defined in battery.h. + +VOLTAGE_OCV + open circuit voltage of the battery. + +VOLTAGE_MAX_DESIGN, VOLTAGE_MIN_DESIGN + design values for maximal and minimal power supply voltages. + Maximal/minimal means values of voltages when battery considered + "full"/"empty" at normal conditions. Yes, there is no direct relation + between voltage and battery capacity, but some dumb + batteries use voltage for very approximated calculation of capacity. + Battery driver also can use this attribute just to inform userspace + about maximal and minimal voltage thresholds of a given battery. + +VOLTAGE_MAX, VOLTAGE_MIN + same as _DESIGN voltage values except that these ones should be used + if hardware could only guess (measure and retain) the thresholds of a + given power supply. + +VOLTAGE_BOOT + Reports the voltage measured during boot + +CURRENT_BOOT + Reports the current measured during boot + +CHARGE_FULL_DESIGN, CHARGE_EMPTY_DESIGN + design charge values, when battery considered full/empty. + +ENERGY_FULL_DESIGN, ENERGY_EMPTY_DESIGN + same as above but for energy. + +CHARGE_FULL, CHARGE_EMPTY + These attributes means "last remembered value of charge when battery + became full/empty". It also could mean "value of charge when battery + considered full/empty at given conditions (temperature, age)". + I.e. these attributes represents real thresholds, not design values. + +ENERGY_FULL, ENERGY_EMPTY + same as above but for energy. + +CHARGE_COUNTER + the current charge counter (in µAh). This could easily + be negative; there is no empty or full value. It is only useful for + relative, time-based measurements. + +PRECHARGE_CURRENT + the maximum charge current during precharge phase of charge cycle + (typically 20% of battery capacity). + +CHARGE_TERM_CURRENT + Charge termination current. The charge cycle terminates when battery + voltage is above recharge threshold, and charge current is below + this setting (typically 10% of battery capacity). + +CONSTANT_CHARGE_CURRENT + constant charge current programmed by charger. + + +CONSTANT_CHARGE_CURRENT_MAX + maximum charge current supported by the power supply object. + +CONSTANT_CHARGE_VOLTAGE + constant charge voltage programmed by charger. +CONSTANT_CHARGE_VOLTAGE_MAX + maximum charge voltage supported by the power supply object. + +INPUT_CURRENT_LIMIT + input current limit programmed by charger. Indicates + the current drawn from a charging source. + +CHARGE_CONTROL_LIMIT + current charge control limit setting +CHARGE_CONTROL_LIMIT_MAX + maximum charge control limit setting + +CALIBRATE + battery or coulomb counter calibration status + +CAPACITY + capacity in percents. +CAPACITY_ALERT_MIN + minimum capacity alert value in percents. +CAPACITY_ALERT_MAX + maximum capacity alert value in percents. +CAPACITY_LEVEL + capacity level. This corresponds to POWER_SUPPLY_CAPACITY_LEVEL_*. + +TEMP + temperature of the power supply. +TEMP_ALERT_MIN + minimum battery temperature alert. +TEMP_ALERT_MAX + maximum battery temperature alert. +TEMP_AMBIENT + ambient temperature. +TEMP_AMBIENT_ALERT_MIN + minimum ambient temperature alert. +TEMP_AMBIENT_ALERT_MAX + maximum ambient temperature alert. +TEMP_MIN + minimum operatable temperature +TEMP_MAX + maximum operatable temperature + +TIME_TO_EMPTY + seconds left for battery to be considered empty + (i.e. while battery powers a load) +TIME_TO_FULL + seconds left for battery to be considered full + (i.e. while battery is charging) + + +Battery <-> external power supply interaction +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Often power supplies are acting as supplies and supplicants at the same +time. Batteries are good example. So, batteries usually care if they're +externally powered or not. + +For that case, power supply class implements notification mechanism for +batteries. + +External power supply (AC) lists supplicants (batteries) names in +"supplied_to" struct member, and each power_supply_changed() call +issued by external power supply will notify supplicants via +external_power_changed callback. + + +Devicetree battery characteristics +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Drivers should call power_supply_get_battery_info() to obtain battery +characteristics from a devicetree battery node, defined in +Documentation/devicetree/bindings/power/supply/battery.txt. This is +implemented in drivers/power/supply/bq27xxx_battery.c. + +Properties in struct power_supply_battery_info and their counterparts in the +battery node have names corresponding to elements in enum power_supply_property, +for naming consistency between sysfs attributes and battery node properties. + + +QA +~~ + +Q: + Where is POWER_SUPPLY_PROP_XYZ attribute? +A: + If you cannot find attribute suitable for your driver needs, feel free + to add it and send patch along with your driver. + + The attributes available currently are the ones currently provided by the + drivers written. + + Good candidates to add in future: model/part#, cycle_time, manufacturer, + etc. + + +Q: + I have some very specific attribute (e.g. battery color), should I add + this attribute to standard ones? +A: + Most likely, no. Such attribute can be placed in the driver itself, if + it is useful. Of course, if the attribute in question applicable to + large set of batteries, provided by many drivers, and/or comes from + some general battery specification/standard, it may be a candidate to + be added to the core attribute set. + + +Q: + Suppose, my battery monitoring chip/firmware does not provides capacity + in percents, but provides charge_{now,full,empty}. Should I calculate + percentage capacity manually, inside the driver, and register CAPACITY + attribute? The same question about time_to_empty/time_to_full. +A: + Most likely, no. This class is designed to export properties which are + directly measurable by the specific hardware available. + + Inferring not available properties using some heuristics or mathematical + model is not subject of work for a battery driver. Such functionality + should be factored out, and in fact, apm_power, the driver to serve + legacy APM API on top of power supply class, uses a simple heuristic of + approximating remaining battery capacity based on its charge, current, + voltage and so on. But full-fledged battery model is likely not subject + for kernel at all, as it would require floating point calculation to deal + with things like differential equations and Kalman filters. This is + better be handled by batteryd/libbattery, yet to be written. diff --git a/Documentation/power/power_supply_class.txt b/Documentation/power/power_supply_class.txt deleted file mode 100644 index 300d37896e51..000000000000 --- a/Documentation/power/power_supply_class.txt +++ /dev/null @@ -1,231 +0,0 @@ -Linux power supply class -======================== - -Synopsis -~~~~~~~~ -Power supply class used to represent battery, UPS, AC or DC power supply -properties to user-space. - -It defines core set of attributes, which should be applicable to (almost) -every power supply out there. Attributes are available via sysfs and uevent -interfaces. - -Each attribute has well defined meaning, up to unit of measure used. While -the attributes provided are believed to be universally applicable to any -power supply, specific monitoring hardware may not be able to provide them -all, so any of them may be skipped. - -Power supply class is extensible, and allows to define drivers own attributes. -The core attribute set is subject to the standard Linux evolution (i.e. -if it will be found that some attribute is applicable to many power supply -types or their drivers, it can be added to the core set). - -It also integrates with LED framework, for the purpose of providing -typically expected feedback of battery charging/fully charged status and -AC/USB power supply online status. (Note that specific details of the -indication (including whether to use it at all) are fully controllable by -user and/or specific machine defaults, per design principles of LED -framework). - - -Attributes/properties -~~~~~~~~~~~~~~~~~~~~~ -Power supply class has predefined set of attributes, this eliminates code -duplication across drivers. Power supply class insist on reusing its -predefined attributes *and* their units. - -So, userspace gets predictable set of attributes and their units for any -kind of power supply, and can process/present them to a user in consistent -manner. Results for different power supplies and machines are also directly -comparable. - -See drivers/power/supply/ds2760_battery.c and drivers/power/supply/pda_power.c -for the example how to declare and handle attributes. - - -Units -~~~~~ -Quoting include/linux/power_supply.h: - - All voltages, currents, charges, energies, time and temperatures in µV, - µA, µAh, µWh, seconds and tenths of degree Celsius unless otherwise - stated. It's driver's job to convert its raw values to units in which - this class operates. - - -Attributes/properties detailed -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -~ ~ ~ ~ ~ ~ ~ Charge/Energy/Capacity - how to not confuse ~ ~ ~ ~ ~ ~ ~ -~ ~ -~ Because both "charge" (µAh) and "energy" (µWh) represents "capacity" ~ -~ of battery, this class distinguish these terms. Don't mix them! ~ -~ ~ -~ CHARGE_* attributes represents capacity in µAh only. ~ -~ ENERGY_* attributes represents capacity in µWh only. ~ -~ CAPACITY attribute represents capacity in *percents*, from 0 to 100. ~ -~ ~ -~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ - -Postfixes: -_AVG - *hardware* averaged value, use it if your hardware is really able to -report averaged values. -_NOW - momentary/instantaneous values. - -STATUS - this attribute represents operating status (charging, full, -discharging (i.e. powering a load), etc.). This corresponds to -BATTERY_STATUS_* values, as defined in battery.h. - -CHARGE_TYPE - batteries can typically charge at different rates. -This defines trickle and fast charges. For batteries that -are already charged or discharging, 'n/a' can be displayed (or -'unknown', if the status is not known). - -AUTHENTIC - indicates the power supply (battery or charger) connected -to the platform is authentic(1) or non authentic(0). - -HEALTH - represents health of the battery, values corresponds to -POWER_SUPPLY_HEALTH_*, defined in battery.h. - -VOLTAGE_OCV - open circuit voltage of the battery. - -VOLTAGE_MAX_DESIGN, VOLTAGE_MIN_DESIGN - design values for maximal and -minimal power supply voltages. Maximal/minimal means values of voltages -when battery considered "full"/"empty" at normal conditions. Yes, there is -no direct relation between voltage and battery capacity, but some dumb -batteries use voltage for very approximated calculation of capacity. -Battery driver also can use this attribute just to inform userspace -about maximal and minimal voltage thresholds of a given battery. - -VOLTAGE_MAX, VOLTAGE_MIN - same as _DESIGN voltage values except that -these ones should be used if hardware could only guess (measure and -retain) the thresholds of a given power supply. - -VOLTAGE_BOOT - Reports the voltage measured during boot - -CURRENT_BOOT - Reports the current measured during boot - -CHARGE_FULL_DESIGN, CHARGE_EMPTY_DESIGN - design charge values, when -battery considered full/empty. - -ENERGY_FULL_DESIGN, ENERGY_EMPTY_DESIGN - same as above but for energy. - -CHARGE_FULL, CHARGE_EMPTY - These attributes means "last remembered value -of charge when battery became full/empty". It also could mean "value of -charge when battery considered full/empty at given conditions (temperature, -age)". I.e. these attributes represents real thresholds, not design values. - -ENERGY_FULL, ENERGY_EMPTY - same as above but for energy. - -CHARGE_COUNTER - the current charge counter (in µAh). This could easily -be negative; there is no empty or full value. It is only useful for -relative, time-based measurements. - -PRECHARGE_CURRENT - the maximum charge current during precharge phase -of charge cycle (typically 20% of battery capacity). -CHARGE_TERM_CURRENT - Charge termination current. The charge cycle -terminates when battery voltage is above recharge threshold, and charge -current is below this setting (typically 10% of battery capacity). - -CONSTANT_CHARGE_CURRENT - constant charge current programmed by charger. -CONSTANT_CHARGE_CURRENT_MAX - maximum charge current supported by the -power supply object. - -CONSTANT_CHARGE_VOLTAGE - constant charge voltage programmed by charger. -CONSTANT_CHARGE_VOLTAGE_MAX - maximum charge voltage supported by the -power supply object. - -INPUT_CURRENT_LIMIT - input current limit programmed by charger. Indicates -the current drawn from a charging source. - -CHARGE_CONTROL_LIMIT - current charge control limit setting -CHARGE_CONTROL_LIMIT_MAX - maximum charge control limit setting - -CALIBRATE - battery or coulomb counter calibration status - -CAPACITY - capacity in percents. -CAPACITY_ALERT_MIN - minimum capacity alert value in percents. -CAPACITY_ALERT_MAX - maximum capacity alert value in percents. -CAPACITY_LEVEL - capacity level. This corresponds to -POWER_SUPPLY_CAPACITY_LEVEL_*. - -TEMP - temperature of the power supply. -TEMP_ALERT_MIN - minimum battery temperature alert. -TEMP_ALERT_MAX - maximum battery temperature alert. -TEMP_AMBIENT - ambient temperature. -TEMP_AMBIENT_ALERT_MIN - minimum ambient temperature alert. -TEMP_AMBIENT_ALERT_MAX - maximum ambient temperature alert. -TEMP_MIN - minimum operatable temperature -TEMP_MAX - maximum operatable temperature - -TIME_TO_EMPTY - seconds left for battery to be considered empty (i.e. -while battery powers a load) -TIME_TO_FULL - seconds left for battery to be considered full (i.e. -while battery is charging) - - -Battery <-> external power supply interaction -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Often power supplies are acting as supplies and supplicants at the same -time. Batteries are good example. So, batteries usually care if they're -externally powered or not. - -For that case, power supply class implements notification mechanism for -batteries. - -External power supply (AC) lists supplicants (batteries) names in -"supplied_to" struct member, and each power_supply_changed() call -issued by external power supply will notify supplicants via -external_power_changed callback. - - -Devicetree battery characteristics -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Drivers should call power_supply_get_battery_info() to obtain battery -characteristics from a devicetree battery node, defined in -Documentation/devicetree/bindings/power/supply/battery.txt. This is -implemented in drivers/power/supply/bq27xxx_battery.c. - -Properties in struct power_supply_battery_info and their counterparts in the -battery node have names corresponding to elements in enum power_supply_property, -for naming consistency between sysfs attributes and battery node properties. - - -QA -~~ -Q: Where is POWER_SUPPLY_PROP_XYZ attribute? -A: If you cannot find attribute suitable for your driver needs, feel free - to add it and send patch along with your driver. - - The attributes available currently are the ones currently provided by the - drivers written. - - Good candidates to add in future: model/part#, cycle_time, manufacturer, - etc. - - -Q: I have some very specific attribute (e.g. battery color), should I add - this attribute to standard ones? -A: Most likely, no. Such attribute can be placed in the driver itself, if - it is useful. Of course, if the attribute in question applicable to - large set of batteries, provided by many drivers, and/or comes from - some general battery specification/standard, it may be a candidate to - be added to the core attribute set. - - -Q: Suppose, my battery monitoring chip/firmware does not provides capacity - in percents, but provides charge_{now,full,empty}. Should I calculate - percentage capacity manually, inside the driver, and register CAPACITY - attribute? The same question about time_to_empty/time_to_full. -A: Most likely, no. This class is designed to export properties which are - directly measurable by the specific hardware available. - - Inferring not available properties using some heuristics or mathematical - model is not subject of work for a battery driver. Such functionality - should be factored out, and in fact, apm_power, the driver to serve - legacy APM API on top of power supply class, uses a simple heuristic of - approximating remaining battery capacity based on its charge, current, - voltage and so on. But full-fledged battery model is likely not subject - for kernel at all, as it would require floating point calculation to deal - with things like differential equations and Kalman filters. This is - better be handled by batteryd/libbattery, yet to be written. diff --git a/Documentation/power/powercap/powercap.rst b/Documentation/power/powercap/powercap.rst new file mode 100644 index 000000000000..7ae3b44c7624 --- /dev/null +++ b/Documentation/power/powercap/powercap.rst @@ -0,0 +1,257 @@ +======================= +Power Capping Framework +======================= + +The power capping framework provides a consistent interface between the kernel +and the user space that allows power capping drivers to expose the settings to +user space in a uniform way. + +Terminology +=========== + +The framework exposes power capping devices to user space via sysfs in the +form of a tree of objects. The objects at the root level of the tree represent +'control types', which correspond to different methods of power capping. For +example, the intel-rapl control type represents the Intel "Running Average +Power Limit" (RAPL) technology, whereas the 'idle-injection' control type +corresponds to the use of idle injection for controlling power. + +Power zones represent different parts of the system, which can be controlled and +monitored using the power capping method determined by the control type the +given zone belongs to. They each contain attributes for monitoring power, as +well as controls represented in the form of power constraints. If the parts of +the system represented by different power zones are hierarchical (that is, one +bigger part consists of multiple smaller parts that each have their own power +controls), those power zones may also be organized in a hierarchy with one +parent power zone containing multiple subzones and so on to reflect the power +control topology of the system. In that case, it is possible to apply power +capping to a set of devices together using the parent power zone and if more +fine grained control is required, it can be applied through the subzones. + + +Example sysfs interface tree:: + + /sys/devices/virtual/powercap + └──intel-rapl + ├──intel-rapl:0 + │   ├──constraint_0_name + │   ├──constraint_0_power_limit_uw + │   ├──constraint_0_time_window_us + │   ├──constraint_1_name + │   ├──constraint_1_power_limit_uw + │   ├──constraint_1_time_window_us + │   ├──device -> ../../intel-rapl + │   ├──energy_uj + │   ├──intel-rapl:0:0 + │   │   ├──constraint_0_name + │   │   ├──constraint_0_power_limit_uw + │   │   ├──constraint_0_time_window_us + │   │   ├──constraint_1_name + │   │   ├──constraint_1_power_limit_uw + │   │   ├──constraint_1_time_window_us + │   │   ├──device -> ../../intel-rapl:0 + │   │   ├──energy_uj + │   │   ├──max_energy_range_uj + │   │   ├──name + │   │   ├──enabled + │   │   ├──power + │   │   │   ├──async + │   │   │   [] + │   │   ├──subsystem -> ../../../../../../class/power_cap + │   │   └──uevent + │   ├──intel-rapl:0:1 + │   │   ├──constraint_0_name + │   │   ├──constraint_0_power_limit_uw + │   │   ├──constraint_0_time_window_us + │   │   ├──constraint_1_name + │   │   ├──constraint_1_power_limit_uw + │   │   ├──constraint_1_time_window_us + │   │   ├──device -> ../../intel-rapl:0 + │   │   ├──energy_uj + │   │   ├──max_energy_range_uj + │   │   ├──name + │   │   ├──enabled + │   │   ├──power + │   │   │   ├──async + │   │   │   [] + │   │   ├──subsystem -> ../../../../../../class/power_cap + │   │   └──uevent + │   ├──max_energy_range_uj + │   ├──max_power_range_uw + │   ├──name + │   ├──enabled + │   ├──power + │   │   ├──async + │   │   [] + │   ├──subsystem -> ../../../../../class/power_cap + │   ├──enabled + │   ├──uevent + ├──intel-rapl:1 + │   ├──constraint_0_name + │   ├──constraint_0_power_limit_uw + │   ├──constraint_0_time_window_us + │   ├──constraint_1_name + │   ├──constraint_1_power_limit_uw + │   ├──constraint_1_time_window_us + │   ├──device -> ../../intel-rapl + │   ├──energy_uj + │   ├──intel-rapl:1:0 + │   │   ├──constraint_0_name + │   │   ├──constraint_0_power_limit_uw + │   │   ├──constraint_0_time_window_us + │   │   ├──constraint_1_name + │   │   ├──constraint_1_power_limit_uw + │   │   ├──constraint_1_time_window_us + │   │   ├──device -> ../../intel-rapl:1 + │   │   ├──energy_uj + │   │   ├──max_energy_range_uj + │   │   ├──name + │   │   ├──enabled + │   │   ├──power + │   │   │   ├──async + │   │   │   [] + │   │   ├──subsystem -> ../../../../../../class/power_cap + │   │   └──uevent + │   ├──intel-rapl:1:1 + │   │   ├──constraint_0_name + │   │   ├──constraint_0_power_limit_uw + │   │   ├──constraint_0_time_window_us + │   │   ├──constraint_1_name + │   │   ├──constraint_1_power_limit_uw + │   │   ├──constraint_1_time_window_us + │   │   ├──device -> ../../intel-rapl:1 + │   │   ├──energy_uj + │   │   ├──max_energy_range_uj + │   │   ├──name + │   │   ├──enabled + │   │   ├──power + │   │   │   ├──async + │   │   │   [] + │   │   ├──subsystem -> ../../../../../../class/power_cap + │   │   └──uevent + │   ├──max_energy_range_uj + │   ├──max_power_range_uw + │   ├──name + │   ├──enabled + │   ├──power + │   │   ├──async + │   │   [] + │   ├──subsystem -> ../../../../../class/power_cap + │   ├──uevent + ├──power + │   ├──async + │   [] + ├──subsystem -> ../../../../class/power_cap + ├──enabled + └──uevent + +The above example illustrates a case in which the Intel RAPL technology, +available in Intel® IA-64 and IA-32 Processor Architectures, is used. There is one +control type called intel-rapl which contains two power zones, intel-rapl:0 and +intel-rapl:1, representing CPU packages. Each of these power zones contains +two subzones, intel-rapl:j:0 and intel-rapl:j:1 (j = 0, 1), representing the +"core" and the "uncore" parts of the given CPU package, respectively. All of +the zones and subzones contain energy monitoring attributes (energy_uj, +max_energy_range_uj) and constraint attributes (constraint_*) allowing controls +to be applied (the constraints in the 'package' power zones apply to the whole +CPU packages and the subzone constraints only apply to the respective parts of +the given package individually). Since Intel RAPL doesn't provide instantaneous +power value, there is no power_uw attribute. + +In addition to that, each power zone contains a name attribute, allowing the +part of the system represented by that zone to be identified. +For example:: + + cat /sys/class/power_cap/intel-rapl/intel-rapl:0/name + +package-0 +--------- + +The Intel RAPL technology allows two constraints, short term and long term, +with two different time windows to be applied to each power zone. Thus for +each zone there are 2 attributes representing the constraint names, 2 power +limits and 2 attributes representing the sizes of the time windows. Such that, +constraint_j_* attributes correspond to the jth constraint (j = 0,1). + +For example:: + + constraint_0_name + constraint_0_power_limit_uw + constraint_0_time_window_us + constraint_1_name + constraint_1_power_limit_uw + constraint_1_time_window_us + +Power Zone Attributes +===================== + +Monitoring attributes +--------------------- + +energy_uj (rw) + Current energy counter in micro joules. Write "0" to reset. + If the counter can not be reset, then this attribute is read only. + +max_energy_range_uj (ro) + Range of the above energy counter in micro-joules. + +power_uw (ro) + Current power in micro watts. + +max_power_range_uw (ro) + Range of the above power value in micro-watts. + +name (ro) + Name of this power zone. + +It is possible that some domains have both power ranges and energy counter ranges; +however, only one is mandatory. + +Constraints +----------- + +constraint_X_power_limit_uw (rw) + Power limit in micro watts, which should be applicable for the + time window specified by "constraint_X_time_window_us". + +constraint_X_time_window_us (rw) + Time window in micro seconds. + +constraint_X_name (ro) + An optional name of the constraint + +constraint_X_max_power_uw(ro) + Maximum allowed power in micro watts. + +constraint_X_min_power_uw(ro) + Minimum allowed power in micro watts. + +constraint_X_max_time_window_us(ro) + Maximum allowed time window in micro seconds. + +constraint_X_min_time_window_us(ro) + Minimum allowed time window in micro seconds. + +Except power_limit_uw and time_window_us other fields are optional. + +Common zone and control type attributes +--------------------------------------- + +enabled (rw): Enable/Disable controls at zone level or for all zones using +a control type. + +Power Cap Client Driver Interface +================================= + +The API summary: + +Call powercap_register_control_type() to register control type object. +Call powercap_register_zone() to register a power zone (under a given +control type), either as a top-level power zone or as a subzone of another +power zone registered earlier. +The number of constraints in a power zone and the corresponding callbacks have +to be defined prior to calling powercap_register_zone() to register that zone. + +To Free a power zone call powercap_unregister_zone(). +To free a control type object call powercap_unregister_control_type(). +Detailed API can be generated using kernel-doc on include/linux/powercap.h. diff --git a/Documentation/power/powercap/powercap.txt b/Documentation/power/powercap/powercap.txt deleted file mode 100644 index 1e6ef164e07a..000000000000 --- a/Documentation/power/powercap/powercap.txt +++ /dev/null @@ -1,236 +0,0 @@ -Power Capping Framework -================================== - -The power capping framework provides a consistent interface between the kernel -and the user space that allows power capping drivers to expose the settings to -user space in a uniform way. - -Terminology -========================= -The framework exposes power capping devices to user space via sysfs in the -form of a tree of objects. The objects at the root level of the tree represent -'control types', which correspond to different methods of power capping. For -example, the intel-rapl control type represents the Intel "Running Average -Power Limit" (RAPL) technology, whereas the 'idle-injection' control type -corresponds to the use of idle injection for controlling power. - -Power zones represent different parts of the system, which can be controlled and -monitored using the power capping method determined by the control type the -given zone belongs to. They each contain attributes for monitoring power, as -well as controls represented in the form of power constraints. If the parts of -the system represented by different power zones are hierarchical (that is, one -bigger part consists of multiple smaller parts that each have their own power -controls), those power zones may also be organized in a hierarchy with one -parent power zone containing multiple subzones and so on to reflect the power -control topology of the system. In that case, it is possible to apply power -capping to a set of devices together using the parent power zone and if more -fine grained control is required, it can be applied through the subzones. - - -Example sysfs interface tree: - -/sys/devices/virtual/powercap -??? intel-rapl - ??? intel-rapl:0 - ?   ??? constraint_0_name - ?   ??? constraint_0_power_limit_uw - ?   ??? constraint_0_time_window_us - ?   ??? constraint_1_name - ?   ??? constraint_1_power_limit_uw - ?   ??? constraint_1_time_window_us - ?   ??? device -> ../../intel-rapl - ?   ??? energy_uj - ?   ??? intel-rapl:0:0 - ?   ?   ??? constraint_0_name - ?   ?   ??? constraint_0_power_limit_uw - ?   ?   ??? constraint_0_time_window_us - ?   ?   ??? constraint_1_name - ?   ?   ??? constraint_1_power_limit_uw - ?   ?   ??? constraint_1_time_window_us - ?   ?   ??? device -> ../../intel-rapl:0 - ?   ?   ??? energy_uj - ?   ?   ??? max_energy_range_uj - ?   ?   ??? name - ?   ?   ??? enabled - ?   ?   ??? power - ?   ?   ?   ??? async - ?   ?   ?   [] - ?   ?   ??? subsystem -> ../../../../../../class/power_cap - ?   ?   ??? uevent - ?   ??? intel-rapl:0:1 - ?   ?   ??? constraint_0_name - ?   ?   ??? constraint_0_power_limit_uw - ?   ?   ??? constraint_0_time_window_us - ?   ?   ??? constraint_1_name - ?   ?   ??? constraint_1_power_limit_uw - ?   ?   ??? constraint_1_time_window_us - ?   ?   ??? device -> ../../intel-rapl:0 - ?   ?   ??? energy_uj - ?   ?   ??? max_energy_range_uj - ?   ?   ??? name - ?   ?   ??? enabled - ?   ?   ??? power - ?   ?   ?   ??? async - ?   ?   ?   [] - ?   ?   ??? subsystem -> ../../../../../../class/power_cap - ?   ?   ??? uevent - ?   ??? max_energy_range_uj - ?   ??? max_power_range_uw - ?   ??? name - ?   ??? enabled - ?   ??? power - ?   ?   ??? async - ?   ?   [] - ?   ??? subsystem -> ../../../../../class/power_cap - ?   ??? enabled - ?   ??? uevent - ??? intel-rapl:1 - ?   ??? constraint_0_name - ?   ??? constraint_0_power_limit_uw - ?   ??? constraint_0_time_window_us - ?   ??? constraint_1_name - ?   ??? constraint_1_power_limit_uw - ?   ??? constraint_1_time_window_us - ?   ??? device -> ../../intel-rapl - ?   ??? energy_uj - ?   ??? intel-rapl:1:0 - ?   ?   ??? constraint_0_name - ?   ?   ??? constraint_0_power_limit_uw - ?   ?   ??? constraint_0_time_window_us - ?   ?   ??? constraint_1_name - ?   ?   ??? constraint_1_power_limit_uw - ?   ?   ??? constraint_1_time_window_us - ?   ?   ??? device -> ../../intel-rapl:1 - ?   ?   ??? energy_uj - ?   ?   ??? max_energy_range_uj - ?   ?   ??? name - ?   ?   ??? enabled - ?   ?   ??? power - ?   ?   ?   ??? async - ?   ?   ?   [] - ?   ?   ??? subsystem -> ../../../../../../class/power_cap - ?   ?   ??? uevent - ?   ??? intel-rapl:1:1 - ?   ?   ??? constraint_0_name - ?   ?   ??? constraint_0_power_limit_uw - ?   ?   ??? constraint_0_time_window_us - ?   ?   ??? constraint_1_name - ?   ?   ??? constraint_1_power_limit_uw - ?   ?   ??? constraint_1_time_window_us - ?   ?   ??? device -> ../../intel-rapl:1 - ?   ?   ??? energy_uj - ?   ?   ??? max_energy_range_uj - ?   ?   ??? name - ?   ?   ??? enabled - ?   ?   ??? power - ?   ?   ?   ??? async - ?   ?   ?   [] - ?   ?   ??? subsystem -> ../../../../../../class/power_cap - ?   ?   ??? uevent - ?   ??? max_energy_range_uj - ?   ??? max_power_range_uw - ?   ??? name - ?   ??? enabled - ?   ??? power - ?   ?   ??? async - ?   ?   [] - ?   ??? subsystem -> ../../../../../class/power_cap - ?   ??? uevent - ??? power - ?   ??? async - ?   [] - ??? subsystem -> ../../../../class/power_cap - ??? enabled - ??? uevent - -The above example illustrates a case in which the Intel RAPL technology, -available in Intel® IA-64 and IA-32 Processor Architectures, is used. There is one -control type called intel-rapl which contains two power zones, intel-rapl:0 and -intel-rapl:1, representing CPU packages. Each of these power zones contains -two subzones, intel-rapl:j:0 and intel-rapl:j:1 (j = 0, 1), representing the -"core" and the "uncore" parts of the given CPU package, respectively. All of -the zones and subzones contain energy monitoring attributes (energy_uj, -max_energy_range_uj) and constraint attributes (constraint_*) allowing controls -to be applied (the constraints in the 'package' power zones apply to the whole -CPU packages and the subzone constraints only apply to the respective parts of -the given package individually). Since Intel RAPL doesn't provide instantaneous -power value, there is no power_uw attribute. - -In addition to that, each power zone contains a name attribute, allowing the -part of the system represented by that zone to be identified. -For example: - -cat /sys/class/power_cap/intel-rapl/intel-rapl:0/name -package-0 - -The Intel RAPL technology allows two constraints, short term and long term, -with two different time windows to be applied to each power zone. Thus for -each zone there are 2 attributes representing the constraint names, 2 power -limits and 2 attributes representing the sizes of the time windows. Such that, -constraint_j_* attributes correspond to the jth constraint (j = 0,1). - -For example: - constraint_0_name - constraint_0_power_limit_uw - constraint_0_time_window_us - constraint_1_name - constraint_1_power_limit_uw - constraint_1_time_window_us - -Power Zone Attributes -================================= -Monitoring attributes ----------------------- - -energy_uj (rw): Current energy counter in micro joules. Write "0" to reset. -If the counter can not be reset, then this attribute is read only. - -max_energy_range_uj (ro): Range of the above energy counter in micro-joules. - -power_uw (ro): Current power in micro watts. - -max_power_range_uw (ro): Range of the above power value in micro-watts. - -name (ro): Name of this power zone. - -It is possible that some domains have both power ranges and energy counter ranges; -however, only one is mandatory. - -Constraints ----------------- -constraint_X_power_limit_uw (rw): Power limit in micro watts, which should be -applicable for the time window specified by "constraint_X_time_window_us". - -constraint_X_time_window_us (rw): Time window in micro seconds. - -constraint_X_name (ro): An optional name of the constraint - -constraint_X_max_power_uw(ro): Maximum allowed power in micro watts. - -constraint_X_min_power_uw(ro): Minimum allowed power in micro watts. - -constraint_X_max_time_window_us(ro): Maximum allowed time window in micro seconds. - -constraint_X_min_time_window_us(ro): Minimum allowed time window in micro seconds. - -Except power_limit_uw and time_window_us other fields are optional. - -Common zone and control type attributes ----------------------------------------- -enabled (rw): Enable/Disable controls at zone level or for all zones using -a control type. - -Power Cap Client Driver Interface -================================== -The API summary: - -Call powercap_register_control_type() to register control type object. -Call powercap_register_zone() to register a power zone (under a given -control type), either as a top-level power zone or as a subzone of another -power zone registered earlier. -The number of constraints in a power zone and the corresponding callbacks have -to be defined prior to calling powercap_register_zone() to register that zone. - -To Free a power zone call powercap_unregister_zone(). -To free a control type object call powercap_unregister_control_type(). -Detailed API can be generated using kernel-doc on include/linux/powercap.h. diff --git a/Documentation/power/regulator/consumer.rst b/Documentation/power/regulator/consumer.rst new file mode 100644 index 000000000000..0cd8cc1275a7 --- /dev/null +++ b/Documentation/power/regulator/consumer.rst @@ -0,0 +1,229 @@ +=================================== +Regulator Consumer Driver Interface +=================================== + +This text describes the regulator interface for consumer device drivers. +Please see overview.txt for a description of the terms used in this text. + + +1. Consumer Regulator Access (static & dynamic drivers) +======================================================= + +A consumer driver can get access to its supply regulator by calling :: + + regulator = regulator_get(dev, "Vcc"); + +The consumer passes in its struct device pointer and power supply ID. The core +then finds the correct regulator by consulting a machine specific lookup table. +If the lookup is successful then this call will return a pointer to the struct +regulator that supplies this consumer. + +To release the regulator the consumer driver should call :: + + regulator_put(regulator); + +Consumers can be supplied by more than one regulator e.g. codec consumer with +analog and digital supplies :: + + digital = regulator_get(dev, "Vcc"); /* digital core */ + analog = regulator_get(dev, "Avdd"); /* analog */ + +The regulator access functions regulator_get() and regulator_put() will +usually be called in your device drivers probe() and remove() respectively. + + +2. Regulator Output Enable & Disable (static & dynamic drivers) +=============================================================== + + +A consumer can enable its power supply by calling:: + + int regulator_enable(regulator); + +NOTE: + The supply may already be enabled before regulator_enabled() is called. + This may happen if the consumer shares the regulator or the regulator has been + previously enabled by bootloader or kernel board initialization code. + +A consumer can determine if a regulator is enabled by calling:: + + int regulator_is_enabled(regulator); + +This will return > zero when the regulator is enabled. + + +A consumer can disable its supply when no longer needed by calling:: + + int regulator_disable(regulator); + +NOTE: + This may not disable the supply if it's shared with other consumers. The + regulator will only be disabled when the enabled reference count is zero. + +Finally, a regulator can be forcefully disabled in the case of an emergency:: + + int regulator_force_disable(regulator); + +NOTE: + this will immediately and forcefully shutdown the regulator output. All + consumers will be powered off. + + +3. Regulator Voltage Control & Status (dynamic drivers) +======================================================= + +Some consumer drivers need to be able to dynamically change their supply +voltage to match system operating points. e.g. CPUfreq drivers can scale +voltage along with frequency to save power, SD drivers may need to select the +correct card voltage, etc. + +Consumers can control their supply voltage by calling:: + + int regulator_set_voltage(regulator, min_uV, max_uV); + +Where min_uV and max_uV are the minimum and maximum acceptable voltages in +microvolts. + +NOTE: this can be called when the regulator is enabled or disabled. If called +when enabled, then the voltage changes instantly, otherwise the voltage +configuration changes and the voltage is physically set when the regulator is +next enabled. + +The regulators configured voltage output can be found by calling:: + + int regulator_get_voltage(regulator); + +NOTE: + get_voltage() will return the configured output voltage whether the + regulator is enabled or disabled and should NOT be used to determine regulator + output state. However this can be used in conjunction with is_enabled() to + determine the regulator physical output voltage. + + +4. Regulator Current Limit Control & Status (dynamic drivers) +============================================================= + +Some consumer drivers need to be able to dynamically change their supply +current limit to match system operating points. e.g. LCD backlight driver can +change the current limit to vary the backlight brightness, USB drivers may want +to set the limit to 500mA when supplying power. + +Consumers can control their supply current limit by calling:: + + int regulator_set_current_limit(regulator, min_uA, max_uA); + +Where min_uA and max_uA are the minimum and maximum acceptable current limit in +microamps. + +NOTE: + this can be called when the regulator is enabled or disabled. If called + when enabled, then the current limit changes instantly, otherwise the current + limit configuration changes and the current limit is physically set when the + regulator is next enabled. + +A regulators current limit can be found by calling:: + + int regulator_get_current_limit(regulator); + +NOTE: + get_current_limit() will return the current limit whether the regulator + is enabled or disabled and should not be used to determine regulator current + load. + + +5. Regulator Operating Mode Control & Status (dynamic drivers) +============================================================== + +Some consumers can further save system power by changing the operating mode of +their supply regulator to be more efficient when the consumers operating state +changes. e.g. consumer driver is idle and subsequently draws less current + +Regulator operating mode can be changed indirectly or directly. + +Indirect operating mode control. +-------------------------------- +Consumer drivers can request a change in their supply regulator operating mode +by calling:: + + int regulator_set_load(struct regulator *regulator, int load_uA); + +This will cause the core to recalculate the total load on the regulator (based +on all its consumers) and change operating mode (if necessary and permitted) +to best match the current operating load. + +The load_uA value can be determined from the consumer's datasheet. e.g. most +datasheets have tables showing the maximum current consumed in certain +situations. + +Most consumers will use indirect operating mode control since they have no +knowledge of the regulator or whether the regulator is shared with other +consumers. + +Direct operating mode control. +------------------------------ + +Bespoke or tightly coupled drivers may want to directly control regulator +operating mode depending on their operating point. This can be achieved by +calling:: + + int regulator_set_mode(struct regulator *regulator, unsigned int mode); + unsigned int regulator_get_mode(struct regulator *regulator); + +Direct mode will only be used by consumers that *know* about the regulator and +are not sharing the regulator with other consumers. + + +6. Regulator Events +=================== + +Regulators can notify consumers of external events. Events could be received by +consumers under regulator stress or failure conditions. + +Consumers can register interest in regulator events by calling:: + + int regulator_register_notifier(struct regulator *regulator, + struct notifier_block *nb); + +Consumers can unregister interest by calling:: + + int regulator_unregister_notifier(struct regulator *regulator, + struct notifier_block *nb); + +Regulators use the kernel notifier framework to send event to their interested +consumers. + +7. Regulator Direct Register Access +=================================== + +Some kinds of power management hardware or firmware are designed such that +they need to do low-level hardware access to regulators, with no involvement +from the kernel. Examples of such devices are: + +- clocksource with a voltage-controlled oscillator and control logic to change + the supply voltage over I2C to achieve a desired output clock rate +- thermal management firmware that can issue an arbitrary I2C transaction to + perform system poweroff during overtemperature conditions + +To set up such a device/firmware, various parameters like I2C address of the +regulator, addresses of various regulator registers etc. need to be configured +to it. The regulator framework provides the following helpers for querying +these details. + +Bus-specific details, like I2C addresses or transfer rates are handled by the +regmap framework. To get the regulator's regmap (if supported), use:: + + struct regmap *regulator_get_regmap(struct regulator *regulator); + +To obtain the hardware register offset and bitmask for the regulator's voltage +selector register, use:: + + int regulator_get_hardware_vsel_register(struct regulator *regulator, + unsigned *vsel_reg, + unsigned *vsel_mask); + +To convert a regulator framework voltage selector code (used by +regulator_list_voltage) to a hardware-specific voltage selector that can be +directly written to the voltage selector register, use:: + + int regulator_list_hardware_vsel(struct regulator *regulator, + unsigned selector); diff --git a/Documentation/power/regulator/consumer.txt b/Documentation/power/regulator/consumer.txt deleted file mode 100644 index e51564c1a140..000000000000 --- a/Documentation/power/regulator/consumer.txt +++ /dev/null @@ -1,218 +0,0 @@ -Regulator Consumer Driver Interface -=================================== - -This text describes the regulator interface for consumer device drivers. -Please see overview.txt for a description of the terms used in this text. - - -1. Consumer Regulator Access (static & dynamic drivers) -======================================================= - -A consumer driver can get access to its supply regulator by calling :- - -regulator = regulator_get(dev, "Vcc"); - -The consumer passes in its struct device pointer and power supply ID. The core -then finds the correct regulator by consulting a machine specific lookup table. -If the lookup is successful then this call will return a pointer to the struct -regulator that supplies this consumer. - -To release the regulator the consumer driver should call :- - -regulator_put(regulator); - -Consumers can be supplied by more than one regulator e.g. codec consumer with -analog and digital supplies :- - -digital = regulator_get(dev, "Vcc"); /* digital core */ -analog = regulator_get(dev, "Avdd"); /* analog */ - -The regulator access functions regulator_get() and regulator_put() will -usually be called in your device drivers probe() and remove() respectively. - - -2. Regulator Output Enable & Disable (static & dynamic drivers) -==================================================================== - -A consumer can enable its power supply by calling:- - -int regulator_enable(regulator); - -NOTE: The supply may already be enabled before regulator_enabled() is called. -This may happen if the consumer shares the regulator or the regulator has been -previously enabled by bootloader or kernel board initialization code. - -A consumer can determine if a regulator is enabled by calling :- - -int regulator_is_enabled(regulator); - -This will return > zero when the regulator is enabled. - - -A consumer can disable its supply when no longer needed by calling :- - -int regulator_disable(regulator); - -NOTE: This may not disable the supply if it's shared with other consumers. The -regulator will only be disabled when the enabled reference count is zero. - -Finally, a regulator can be forcefully disabled in the case of an emergency :- - -int regulator_force_disable(regulator); - -NOTE: this will immediately and forcefully shutdown the regulator output. All -consumers will be powered off. - - -3. Regulator Voltage Control & Status (dynamic drivers) -====================================================== - -Some consumer drivers need to be able to dynamically change their supply -voltage to match system operating points. e.g. CPUfreq drivers can scale -voltage along with frequency to save power, SD drivers may need to select the -correct card voltage, etc. - -Consumers can control their supply voltage by calling :- - -int regulator_set_voltage(regulator, min_uV, max_uV); - -Where min_uV and max_uV are the minimum and maximum acceptable voltages in -microvolts. - -NOTE: this can be called when the regulator is enabled or disabled. If called -when enabled, then the voltage changes instantly, otherwise the voltage -configuration changes and the voltage is physically set when the regulator is -next enabled. - -The regulators configured voltage output can be found by calling :- - -int regulator_get_voltage(regulator); - -NOTE: get_voltage() will return the configured output voltage whether the -regulator is enabled or disabled and should NOT be used to determine regulator -output state. However this can be used in conjunction with is_enabled() to -determine the regulator physical output voltage. - - -4. Regulator Current Limit Control & Status (dynamic drivers) -=========================================================== - -Some consumer drivers need to be able to dynamically change their supply -current limit to match system operating points. e.g. LCD backlight driver can -change the current limit to vary the backlight brightness, USB drivers may want -to set the limit to 500mA when supplying power. - -Consumers can control their supply current limit by calling :- - -int regulator_set_current_limit(regulator, min_uA, max_uA); - -Where min_uA and max_uA are the minimum and maximum acceptable current limit in -microamps. - -NOTE: this can be called when the regulator is enabled or disabled. If called -when enabled, then the current limit changes instantly, otherwise the current -limit configuration changes and the current limit is physically set when the -regulator is next enabled. - -A regulators current limit can be found by calling :- - -int regulator_get_current_limit(regulator); - -NOTE: get_current_limit() will return the current limit whether the regulator -is enabled or disabled and should not be used to determine regulator current -load. - - -5. Regulator Operating Mode Control & Status (dynamic drivers) -============================================================= - -Some consumers can further save system power by changing the operating mode of -their supply regulator to be more efficient when the consumers operating state -changes. e.g. consumer driver is idle and subsequently draws less current - -Regulator operating mode can be changed indirectly or directly. - -Indirect operating mode control. --------------------------------- -Consumer drivers can request a change in their supply regulator operating mode -by calling :- - -int regulator_set_load(struct regulator *regulator, int load_uA); - -This will cause the core to recalculate the total load on the regulator (based -on all its consumers) and change operating mode (if necessary and permitted) -to best match the current operating load. - -The load_uA value can be determined from the consumer's datasheet. e.g. most -datasheets have tables showing the maximum current consumed in certain -situations. - -Most consumers will use indirect operating mode control since they have no -knowledge of the regulator or whether the regulator is shared with other -consumers. - -Direct operating mode control. ------------------------------- -Bespoke or tightly coupled drivers may want to directly control regulator -operating mode depending on their operating point. This can be achieved by -calling :- - -int regulator_set_mode(struct regulator *regulator, unsigned int mode); -unsigned int regulator_get_mode(struct regulator *regulator); - -Direct mode will only be used by consumers that *know* about the regulator and -are not sharing the regulator with other consumers. - - -6. Regulator Events -=================== -Regulators can notify consumers of external events. Events could be received by -consumers under regulator stress or failure conditions. - -Consumers can register interest in regulator events by calling :- - -int regulator_register_notifier(struct regulator *regulator, - struct notifier_block *nb); - -Consumers can unregister interest by calling :- - -int regulator_unregister_notifier(struct regulator *regulator, - struct notifier_block *nb); - -Regulators use the kernel notifier framework to send event to their interested -consumers. - -7. Regulator Direct Register Access -=================================== -Some kinds of power management hardware or firmware are designed such that -they need to do low-level hardware access to regulators, with no involvement -from the kernel. Examples of such devices are: - -- clocksource with a voltage-controlled oscillator and control logic to change - the supply voltage over I2C to achieve a desired output clock rate -- thermal management firmware that can issue an arbitrary I2C transaction to - perform system poweroff during overtemperature conditions - -To set up such a device/firmware, various parameters like I2C address of the -regulator, addresses of various regulator registers etc. need to be configured -to it. The regulator framework provides the following helpers for querying -these details. - -Bus-specific details, like I2C addresses or transfer rates are handled by the -regmap framework. To get the regulator's regmap (if supported), use :- - -struct regmap *regulator_get_regmap(struct regulator *regulator); - -To obtain the hardware register offset and bitmask for the regulator's voltage -selector register, use :- - -int regulator_get_hardware_vsel_register(struct regulator *regulator, - unsigned *vsel_reg, - unsigned *vsel_mask); - -To convert a regulator framework voltage selector code (used by -regulator_list_voltage) to a hardware-specific voltage selector that can be -directly written to the voltage selector register, use :- - -int regulator_list_hardware_vsel(struct regulator *regulator, - unsigned selector); diff --git a/Documentation/power/regulator/design.rst b/Documentation/power/regulator/design.rst new file mode 100644 index 000000000000..3b09c6841dc4 --- /dev/null +++ b/Documentation/power/regulator/design.rst @@ -0,0 +1,38 @@ +========================== +Regulator API design notes +========================== + +This document provides a brief, partially structured, overview of some +of the design considerations which impact the regulator API design. + +Safety +------ + + - Errors in regulator configuration can have very serious consequences + for the system, potentially including lasting hardware damage. + - It is not possible to automatically determine the power configuration + of the system - software-equivalent variants of the same chip may + have different power requirements, and not all components with power + requirements are visible to software. + +.. note:: + + The API should make no changes to the hardware state unless it has + specific knowledge that these changes are safe to perform on this + particular system. + +Consumer use cases +------------------ + + - The overwhelming majority of devices in a system will have no + requirement to do any runtime configuration of their power beyond + being able to turn it on or off. + + - Many of the power supplies in the system will be shared between many + different consumers. + +.. note:: + + The consumer API should be structured so that these use cases are + very easy to handle and so that consumers will work with shared + supplies without any additional effort. diff --git a/Documentation/power/regulator/design.txt b/Documentation/power/regulator/design.txt deleted file mode 100644 index fdd919b96830..000000000000 --- a/Documentation/power/regulator/design.txt +++ /dev/null @@ -1,33 +0,0 @@ -Regulator API design notes -========================== - -This document provides a brief, partially structured, overview of some -of the design considerations which impact the regulator API design. - -Safety ------- - - - Errors in regulator configuration can have very serious consequences - for the system, potentially including lasting hardware damage. - - It is not possible to automatically determine the power configuration - of the system - software-equivalent variants of the same chip may - have different power requirements, and not all components with power - requirements are visible to software. - - => The API should make no changes to the hardware state unless it has - specific knowledge that these changes are safe to perform on this - particular system. - -Consumer use cases ------------------- - - - The overwhelming majority of devices in a system will have no - requirement to do any runtime configuration of their power beyond - being able to turn it on or off. - - - Many of the power supplies in the system will be shared between many - different consumers. - - => The consumer API should be structured so that these use cases are - very easy to handle and so that consumers will work with shared - supplies without any additional effort. diff --git a/Documentation/power/regulator/machine.rst b/Documentation/power/regulator/machine.rst new file mode 100644 index 000000000000..22fffefaa3ad --- /dev/null +++ b/Documentation/power/regulator/machine.rst @@ -0,0 +1,97 @@ +================================== +Regulator Machine Driver Interface +================================== + +The regulator machine driver interface is intended for board/machine specific +initialisation code to configure the regulator subsystem. + +Consider the following machine:: + + Regulator-1 -+-> Regulator-2 --> [Consumer A @ 1.8 - 2.0V] + | + +-> [Consumer B @ 3.3V] + +The drivers for consumers A & B must be mapped to the correct regulator in +order to control their power supplies. This mapping can be achieved in machine +initialisation code by creating a struct regulator_consumer_supply for +each regulator:: + + struct regulator_consumer_supply { + const char *dev_name; /* consumer dev_name() */ + const char *supply; /* consumer supply - e.g. "vcc" */ + }; + +e.g. for the machine above:: + + static struct regulator_consumer_supply regulator1_consumers[] = { + REGULATOR_SUPPLY("Vcc", "consumer B"), + }; + + static struct regulator_consumer_supply regulator2_consumers[] = { + REGULATOR_SUPPLY("Vcc", "consumer A"), + }; + +This maps Regulator-1 to the 'Vcc' supply for Consumer B and maps Regulator-2 +to the 'Vcc' supply for Consumer A. + +Constraints can now be registered by defining a struct regulator_init_data +for each regulator power domain. This structure also maps the consumers +to their supply regulators:: + + static struct regulator_init_data regulator1_data = { + .constraints = { + .name = "Regulator-1", + .min_uV = 3300000, + .max_uV = 3300000, + .valid_modes_mask = REGULATOR_MODE_NORMAL, + }, + .num_consumer_supplies = ARRAY_SIZE(regulator1_consumers), + .consumer_supplies = regulator1_consumers, + }; + +The name field should be set to something that is usefully descriptive +for the board for configuration of supplies for other regulators and +for use in logging and other diagnostic output. Normally the name +used for the supply rail in the schematic is a good choice. If no +name is provided then the subsystem will choose one. + +Regulator-1 supplies power to Regulator-2. This relationship must be registered +with the core so that Regulator-1 is also enabled when Consumer A enables its +supply (Regulator-2). The supply regulator is set by the supply_regulator +field below and co:: + + static struct regulator_init_data regulator2_data = { + .supply_regulator = "Regulator-1", + .constraints = { + .min_uV = 1800000, + .max_uV = 2000000, + .valid_ops_mask = REGULATOR_CHANGE_VOLTAGE, + .valid_modes_mask = REGULATOR_MODE_NORMAL, + }, + .num_consumer_supplies = ARRAY_SIZE(regulator2_consumers), + .consumer_supplies = regulator2_consumers, + }; + +Finally the regulator devices must be registered in the usual manner:: + + static struct platform_device regulator_devices[] = { + { + .name = "regulator", + .id = DCDC_1, + .dev = { + .platform_data = ®ulator1_data, + }, + }, + { + .name = "regulator", + .id = DCDC_2, + .dev = { + .platform_data = ®ulator2_data, + }, + }, + }; + /* register regulator 1 device */ + platform_device_register(®ulator_devices[0]); + + /* register regulator 2 device */ + platform_device_register(®ulator_devices[1]); diff --git a/Documentation/power/regulator/machine.txt b/Documentation/power/regulator/machine.txt deleted file mode 100644 index eff4dcaaa252..000000000000 --- a/Documentation/power/regulator/machine.txt +++ /dev/null @@ -1,96 +0,0 @@ -Regulator Machine Driver Interface -=================================== - -The regulator machine driver interface is intended for board/machine specific -initialisation code to configure the regulator subsystem. - -Consider the following machine :- - - Regulator-1 -+-> Regulator-2 --> [Consumer A @ 1.8 - 2.0V] - | - +-> [Consumer B @ 3.3V] - -The drivers for consumers A & B must be mapped to the correct regulator in -order to control their power supplies. This mapping can be achieved in machine -initialisation code by creating a struct regulator_consumer_supply for -each regulator. - -struct regulator_consumer_supply { - const char *dev_name; /* consumer dev_name() */ - const char *supply; /* consumer supply - e.g. "vcc" */ -}; - -e.g. for the machine above - -static struct regulator_consumer_supply regulator1_consumers[] = { - REGULATOR_SUPPLY("Vcc", "consumer B"), -}; - -static struct regulator_consumer_supply regulator2_consumers[] = { - REGULATOR_SUPPLY("Vcc", "consumer A"), -}; - -This maps Regulator-1 to the 'Vcc' supply for Consumer B and maps Regulator-2 -to the 'Vcc' supply for Consumer A. - -Constraints can now be registered by defining a struct regulator_init_data -for each regulator power domain. This structure also maps the consumers -to their supply regulators :- - -static struct regulator_init_data regulator1_data = { - .constraints = { - .name = "Regulator-1", - .min_uV = 3300000, - .max_uV = 3300000, - .valid_modes_mask = REGULATOR_MODE_NORMAL, - }, - .num_consumer_supplies = ARRAY_SIZE(regulator1_consumers), - .consumer_supplies = regulator1_consumers, -}; - -The name field should be set to something that is usefully descriptive -for the board for configuration of supplies for other regulators and -for use in logging and other diagnostic output. Normally the name -used for the supply rail in the schematic is a good choice. If no -name is provided then the subsystem will choose one. - -Regulator-1 supplies power to Regulator-2. This relationship must be registered -with the core so that Regulator-1 is also enabled when Consumer A enables its -supply (Regulator-2). The supply regulator is set by the supply_regulator -field below and co:- - -static struct regulator_init_data regulator2_data = { - .supply_regulator = "Regulator-1", - .constraints = { - .min_uV = 1800000, - .max_uV = 2000000, - .valid_ops_mask = REGULATOR_CHANGE_VOLTAGE, - .valid_modes_mask = REGULATOR_MODE_NORMAL, - }, - .num_consumer_supplies = ARRAY_SIZE(regulator2_consumers), - .consumer_supplies = regulator2_consumers, -}; - -Finally the regulator devices must be registered in the usual manner. - -static struct platform_device regulator_devices[] = { - { - .name = "regulator", - .id = DCDC_1, - .dev = { - .platform_data = ®ulator1_data, - }, - }, - { - .name = "regulator", - .id = DCDC_2, - .dev = { - .platform_data = ®ulator2_data, - }, - }, -}; -/* register regulator 1 device */ -platform_device_register(®ulator_devices[0]); - -/* register regulator 2 device */ -platform_device_register(®ulator_devices[1]); diff --git a/Documentation/power/regulator/overview.rst b/Documentation/power/regulator/overview.rst new file mode 100644 index 000000000000..ee494c70a7c4 --- /dev/null +++ b/Documentation/power/regulator/overview.rst @@ -0,0 +1,178 @@ +============================================= +Linux voltage and current regulator framework +============================================= + +About +===== + +This framework is designed to provide a standard kernel interface to control +voltage and current regulators. + +The intention is to allow systems to dynamically control regulator power output +in order to save power and prolong battery life. This applies to both voltage +regulators (where voltage output is controllable) and current sinks (where +current limit is controllable). + +(C) 2008 Wolfson Microelectronics PLC. + +Author: Liam Girdwood + + +Nomenclature +============ + +Some terms used in this document: + + - Regulator + - Electronic device that supplies power to other devices. + Most regulators can enable and disable their output while + some can control their output voltage and or current. + + Input Voltage -> Regulator -> Output Voltage + + + - PMIC + - Power Management IC. An IC that contains numerous + regulators and often contains other subsystems. + + + - Consumer + - Electronic device that is supplied power by a regulator. + Consumers can be classified into two types:- + + Static: consumer does not change its supply voltage or + current limit. It only needs to enable or disable its + power supply. Its supply voltage is set by the hardware, + bootloader, firmware or kernel board initialisation code. + + Dynamic: consumer needs to change its supply voltage or + current limit to meet operation demands. + + + - Power Domain + - Electronic circuit that is supplied its input power by the + output power of a regulator, switch or by another power + domain. + + The supply regulator may be behind a switch(s). i.e.:: + + Regulator -+-> Switch-1 -+-> Switch-2 --> [Consumer A] + | | + | +-> [Consumer B], [Consumer C] + | + +-> [Consumer D], [Consumer E] + + That is one regulator and three power domains: + + - Domain 1: Switch-1, Consumers D & E. + - Domain 2: Switch-2, Consumers B & C. + - Domain 3: Consumer A. + + and this represents a "supplies" relationship: + + Domain-1 --> Domain-2 --> Domain-3. + + A power domain may have regulators that are supplied power + by other regulators. i.e.:: + + Regulator-1 -+-> Regulator-2 -+-> [Consumer A] + | + +-> [Consumer B] + + This gives us two regulators and two power domains: + + - Domain 1: Regulator-2, Consumer B. + - Domain 2: Consumer A. + + and a "supplies" relationship: + + Domain-1 --> Domain-2 + + + - Constraints + - Constraints are used to define power levels for performance + and hardware protection. Constraints exist at three levels: + + Regulator Level: This is defined by the regulator hardware + operating parameters and is specified in the regulator + datasheet. i.e. + + - voltage output is in the range 800mV -> 3500mV. + - regulator current output limit is 20mA @ 5V but is + 10mA @ 10V. + + Power Domain Level: This is defined in software by kernel + level board initialisation code. It is used to constrain a + power domain to a particular power range. i.e. + + - Domain-1 voltage is 3300mV + - Domain-2 voltage is 1400mV -> 1600mV + - Domain-3 current limit is 0mA -> 20mA. + + Consumer Level: This is defined by consumer drivers + dynamically setting voltage or current limit levels. + + e.g. a consumer backlight driver asks for a current increase + from 5mA to 10mA to increase LCD illumination. This passes + to through the levels as follows :- + + Consumer: need to increase LCD brightness. Lookup and + request next current mA value in brightness table (the + consumer driver could be used on several different + personalities based upon the same reference device). + + Power Domain: is the new current limit within the domain + operating limits for this domain and system state (e.g. + battery power, USB power) + + Regulator Domains: is the new current limit within the + regulator operating parameters for input/output voltage. + + If the regulator request passes all the constraint tests + then the new regulator value is applied. + + +Design +====== + +The framework is designed and targeted at SoC based devices but may also be +relevant to non SoC devices and is split into the following four interfaces:- + + + 1. Consumer driver interface. + + This uses a similar API to the kernel clock interface in that consumer + drivers can get and put a regulator (like they can with clocks atm) and + get/set voltage, current limit, mode, enable and disable. This should + allow consumers complete control over their supply voltage and current + limit. This also compiles out if not in use so drivers can be reused in + systems with no regulator based power control. + + See Documentation/power/regulator/consumer.rst + + 2. Regulator driver interface. + + This allows regulator drivers to register their regulators and provide + operations to the core. It also has a notifier call chain for propagating + regulator events to clients. + + See Documentation/power/regulator/regulator.rst + + 3. Machine interface. + + This interface is for machine specific code and allows the creation of + voltage/current domains (with constraints) for each regulator. It can + provide regulator constraints that will prevent device damage through + overvoltage or overcurrent caused by buggy client drivers. It also + allows the creation of a regulator tree whereby some regulators are + supplied by others (similar to a clock tree). + + See Documentation/power/regulator/machine.rst + + 4. Userspace ABI. + + The framework also exports a lot of useful voltage/current/opmode data to + userspace via sysfs. This could be used to help monitor device power + consumption and status. + + See Documentation/ABI/testing/sysfs-class-regulator diff --git a/Documentation/power/regulator/overview.txt b/Documentation/power/regulator/overview.txt deleted file mode 100644 index 721b4739ec32..000000000000 --- a/Documentation/power/regulator/overview.txt +++ /dev/null @@ -1,171 +0,0 @@ -Linux voltage and current regulator framework -============================================= - -About -===== - -This framework is designed to provide a standard kernel interface to control -voltage and current regulators. - -The intention is to allow systems to dynamically control regulator power output -in order to save power and prolong battery life. This applies to both voltage -regulators (where voltage output is controllable) and current sinks (where -current limit is controllable). - -(C) 2008 Wolfson Microelectronics PLC. -Author: Liam Girdwood - - -Nomenclature -============ - -Some terms used in this document:- - - o Regulator - Electronic device that supplies power to other devices. - Most regulators can enable and disable their output while - some can control their output voltage and or current. - - Input Voltage -> Regulator -> Output Voltage - - - o PMIC - Power Management IC. An IC that contains numerous regulators - and often contains other subsystems. - - - o Consumer - Electronic device that is supplied power by a regulator. - Consumers can be classified into two types:- - - Static: consumer does not change its supply voltage or - current limit. It only needs to enable or disable its - power supply. Its supply voltage is set by the hardware, - bootloader, firmware or kernel board initialisation code. - - Dynamic: consumer needs to change its supply voltage or - current limit to meet operation demands. - - - o Power Domain - Electronic circuit that is supplied its input power by the - output power of a regulator, switch or by another power - domain. - - The supply regulator may be behind a switch(s). i.e. - - Regulator -+-> Switch-1 -+-> Switch-2 --> [Consumer A] - | | - | +-> [Consumer B], [Consumer C] - | - +-> [Consumer D], [Consumer E] - - That is one regulator and three power domains: - - Domain 1: Switch-1, Consumers D & E. - Domain 2: Switch-2, Consumers B & C. - Domain 3: Consumer A. - - and this represents a "supplies" relationship: - - Domain-1 --> Domain-2 --> Domain-3. - - A power domain may have regulators that are supplied power - by other regulators. i.e. - - Regulator-1 -+-> Regulator-2 -+-> [Consumer A] - | - +-> [Consumer B] - - This gives us two regulators and two power domains: - - Domain 1: Regulator-2, Consumer B. - Domain 2: Consumer A. - - and a "supplies" relationship: - - Domain-1 --> Domain-2 - - - o Constraints - Constraints are used to define power levels for performance - and hardware protection. Constraints exist at three levels: - - Regulator Level: This is defined by the regulator hardware - operating parameters and is specified in the regulator - datasheet. i.e. - - - voltage output is in the range 800mV -> 3500mV. - - regulator current output limit is 20mA @ 5V but is - 10mA @ 10V. - - Power Domain Level: This is defined in software by kernel - level board initialisation code. It is used to constrain a - power domain to a particular power range. i.e. - - - Domain-1 voltage is 3300mV - - Domain-2 voltage is 1400mV -> 1600mV - - Domain-3 current limit is 0mA -> 20mA. - - Consumer Level: This is defined by consumer drivers - dynamically setting voltage or current limit levels. - - e.g. a consumer backlight driver asks for a current increase - from 5mA to 10mA to increase LCD illumination. This passes - to through the levels as follows :- - - Consumer: need to increase LCD brightness. Lookup and - request next current mA value in brightness table (the - consumer driver could be used on several different - personalities based upon the same reference device). - - Power Domain: is the new current limit within the domain - operating limits for this domain and system state (e.g. - battery power, USB power) - - Regulator Domains: is the new current limit within the - regulator operating parameters for input/output voltage. - - If the regulator request passes all the constraint tests - then the new regulator value is applied. - - -Design -====== - -The framework is designed and targeted at SoC based devices but may also be -relevant to non SoC devices and is split into the following four interfaces:- - - - 1. Consumer driver interface. - - This uses a similar API to the kernel clock interface in that consumer - drivers can get and put a regulator (like they can with clocks atm) and - get/set voltage, current limit, mode, enable and disable. This should - allow consumers complete control over their supply voltage and current - limit. This also compiles out if not in use so drivers can be reused in - systems with no regulator based power control. - - See Documentation/power/regulator/consumer.txt - - 2. Regulator driver interface. - - This allows regulator drivers to register their regulators and provide - operations to the core. It also has a notifier call chain for propagating - regulator events to clients. - - See Documentation/power/regulator/regulator.txt - - 3. Machine interface. - - This interface is for machine specific code and allows the creation of - voltage/current domains (with constraints) for each regulator. It can - provide regulator constraints that will prevent device damage through - overvoltage or overcurrent caused by buggy client drivers. It also - allows the creation of a regulator tree whereby some regulators are - supplied by others (similar to a clock tree). - - See Documentation/power/regulator/machine.txt - - 4. Userspace ABI. - - The framework also exports a lot of useful voltage/current/opmode data to - userspace via sysfs. This could be used to help monitor device power - consumption and status. - - See Documentation/ABI/testing/sysfs-class-regulator diff --git a/Documentation/power/regulator/regulator.rst b/Documentation/power/regulator/regulator.rst new file mode 100644 index 000000000000..794b3256fbb9 --- /dev/null +++ b/Documentation/power/regulator/regulator.rst @@ -0,0 +1,32 @@ +========================== +Regulator Driver Interface +========================== + +The regulator driver interface is relatively simple and designed to allow +regulator drivers to register their services with the core framework. + + +Registration +============ + +Drivers can register a regulator by calling:: + + struct regulator_dev *regulator_register(struct regulator_desc *regulator_desc, + const struct regulator_config *config); + +This will register the regulator's capabilities and operations to the regulator +core. + +Regulators can be unregistered by calling:: + + void regulator_unregister(struct regulator_dev *rdev); + + +Regulator Events +================ + +Regulators can send events (e.g. overtemperature, undervoltage, etc) to +consumer drivers by calling:: + + int regulator_notifier_call_chain(struct regulator_dev *rdev, + unsigned long event, void *data); diff --git a/Documentation/power/regulator/regulator.txt b/Documentation/power/regulator/regulator.txt deleted file mode 100644 index b17e5833ce21..000000000000 --- a/Documentation/power/regulator/regulator.txt +++ /dev/null @@ -1,30 +0,0 @@ -Regulator Driver Interface -========================== - -The regulator driver interface is relatively simple and designed to allow -regulator drivers to register their services with the core framework. - - -Registration -============ - -Drivers can register a regulator by calling :- - -struct regulator_dev *regulator_register(struct regulator_desc *regulator_desc, - const struct regulator_config *config); - -This will register the regulator's capabilities and operations to the regulator -core. - -Regulators can be unregistered by calling :- - -void regulator_unregister(struct regulator_dev *rdev); - - -Regulator Events -================ -Regulators can send events (e.g. overtemperature, undervoltage, etc) to -consumer drivers by calling :- - -int regulator_notifier_call_chain(struct regulator_dev *rdev, - unsigned long event, void *data); diff --git a/Documentation/power/runtime_pm.rst b/Documentation/power/runtime_pm.rst new file mode 100644 index 000000000000..2c2ec99b5088 --- /dev/null +++ b/Documentation/power/runtime_pm.rst @@ -0,0 +1,940 @@ +================================================== +Runtime Power Management Framework for I/O Devices +================================================== + +(C) 2009-2011 Rafael J. Wysocki , Novell Inc. + +(C) 2010 Alan Stern + +(C) 2014 Intel Corp., Rafael J. Wysocki + +1. Introduction +=============== + +Support for runtime power management (runtime PM) of I/O devices is provided +at the power management core (PM core) level by means of: + +* The power management workqueue pm_wq in which bus types and device drivers can + put their PM-related work items. It is strongly recommended that pm_wq be + used for queuing all work items related to runtime PM, because this allows + them to be synchronized with system-wide power transitions (suspend to RAM, + hibernation and resume from system sleep states). pm_wq is declared in + include/linux/pm_runtime.h and defined in kernel/power/main.c. + +* A number of runtime PM fields in the 'power' member of 'struct device' (which + is of the type 'struct dev_pm_info', defined in include/linux/pm.h) that can + be used for synchronizing runtime PM operations with one another. + +* Three device runtime PM callbacks in 'struct dev_pm_ops' (defined in + include/linux/pm.h). + +* A set of helper functions defined in drivers/base/power/runtime.c that can be + used for carrying out runtime PM operations in such a way that the + synchronization between them is taken care of by the PM core. Bus types and + device drivers are encouraged to use these functions. + +The runtime PM callbacks present in 'struct dev_pm_ops', the device runtime PM +fields of 'struct dev_pm_info' and the core helper functions provided for +runtime PM are described below. + +2. Device Runtime PM Callbacks +============================== + +There are three device runtime PM callbacks defined in 'struct dev_pm_ops':: + + struct dev_pm_ops { + ... + int (*runtime_suspend)(struct device *dev); + int (*runtime_resume)(struct device *dev); + int (*runtime_idle)(struct device *dev); + ... + }; + +The ->runtime_suspend(), ->runtime_resume() and ->runtime_idle() callbacks +are executed by the PM core for the device's subsystem that may be either of +the following: + + 1. PM domain of the device, if the device's PM domain object, dev->pm_domain, + is present. + + 2. Device type of the device, if both dev->type and dev->type->pm are present. + + 3. Device class of the device, if both dev->class and dev->class->pm are + present. + + 4. Bus type of the device, if both dev->bus and dev->bus->pm are present. + +If the subsystem chosen by applying the above rules doesn't provide the relevant +callback, the PM core will invoke the corresponding driver callback stored in +dev->driver->pm directly (if present). + +The PM core always checks which callback to use in the order given above, so the +priority order of callbacks from high to low is: PM domain, device type, class +and bus type. Moreover, the high-priority one will always take precedence over +a low-priority one. The PM domain, bus type, device type and class callbacks +are referred to as subsystem-level callbacks in what follows. + +By default, the callbacks are always invoked in process context with interrupts +enabled. However, the pm_runtime_irq_safe() helper function can be used to tell +the PM core that it is safe to run the ->runtime_suspend(), ->runtime_resume() +and ->runtime_idle() callbacks for the given device in atomic context with +interrupts disabled. This implies that the callback routines in question must +not block or sleep, but it also means that the synchronous helper functions +listed at the end of Section 4 may be used for that device within an interrupt +handler or generally in an atomic context. + +The subsystem-level suspend callback, if present, is _entirely_ _responsible_ +for handling the suspend of the device as appropriate, which may, but need not +include executing the device driver's own ->runtime_suspend() callback (from the +PM core's point of view it is not necessary to implement a ->runtime_suspend() +callback in a device driver as long as the subsystem-level suspend callback +knows what to do to handle the device). + + * Once the subsystem-level suspend callback (or the driver suspend callback, + if invoked directly) has completed successfully for the given device, the PM + core regards the device as suspended, which need not mean that it has been + put into a low power state. It is supposed to mean, however, that the + device will not process data and will not communicate with the CPU(s) and + RAM until the appropriate resume callback is executed for it. The runtime + PM status of a device after successful execution of the suspend callback is + 'suspended'. + + * If the suspend callback returns -EBUSY or -EAGAIN, the device's runtime PM + status remains 'active', which means that the device _must_ be fully + operational afterwards. + + * If the suspend callback returns an error code different from -EBUSY and + -EAGAIN, the PM core regards this as a fatal error and will refuse to run + the helper functions described in Section 4 for the device until its status + is directly set to either 'active', or 'suspended' (the PM core provides + special helper functions for this purpose). + +In particular, if the driver requires remote wakeup capability (i.e. hardware +mechanism allowing the device to request a change of its power state, such as +PCI PME) for proper functioning and device_can_wakeup() returns 'false' for the +device, then ->runtime_suspend() should return -EBUSY. On the other hand, if +device_can_wakeup() returns 'true' for the device and the device is put into a +low-power state during the execution of the suspend callback, it is expected +that remote wakeup will be enabled for the device. Generally, remote wakeup +should be enabled for all input devices put into low-power states at run time. + +The subsystem-level resume callback, if present, is **entirely responsible** for +handling the resume of the device as appropriate, which may, but need not +include executing the device driver's own ->runtime_resume() callback (from the +PM core's point of view it is not necessary to implement a ->runtime_resume() +callback in a device driver as long as the subsystem-level resume callback knows +what to do to handle the device). + + * Once the subsystem-level resume callback (or the driver resume callback, if + invoked directly) has completed successfully, the PM core regards the device + as fully operational, which means that the device _must_ be able to complete + I/O operations as needed. The runtime PM status of the device is then + 'active'. + + * If the resume callback returns an error code, the PM core regards this as a + fatal error and will refuse to run the helper functions described in Section + 4 for the device, until its status is directly set to either 'active', or + 'suspended' (by means of special helper functions provided by the PM core + for this purpose). + +The idle callback (a subsystem-level one, if present, or the driver one) is +executed by the PM core whenever the device appears to be idle, which is +indicated to the PM core by two counters, the device's usage counter and the +counter of 'active' children of the device. + + * If any of these counters is decreased using a helper function provided by + the PM core and it turns out to be equal to zero, the other counter is + checked. If that counter also is equal to zero, the PM core executes the + idle callback with the device as its argument. + +The action performed by the idle callback is totally dependent on the subsystem +(or driver) in question, but the expected and recommended action is to check +if the device can be suspended (i.e. if all of the conditions necessary for +suspending the device are satisfied) and to queue up a suspend request for the +device in that case. If there is no idle callback, or if the callback returns +0, then the PM core will attempt to carry out a runtime suspend of the device, +also respecting devices configured for autosuspend. In essence this means a +call to pm_runtime_autosuspend() (do note that drivers needs to update the +device last busy mark, pm_runtime_mark_last_busy(), to control the delay under +this circumstance). To prevent this (for example, if the callback routine has +started a delayed suspend), the routine must return a non-zero value. Negative +error return codes are ignored by the PM core. + +The helper functions provided by the PM core, described in Section 4, guarantee +that the following constraints are met with respect to runtime PM callbacks for +one device: + +(1) The callbacks are mutually exclusive (e.g. it is forbidden to execute + ->runtime_suspend() in parallel with ->runtime_resume() or with another + instance of ->runtime_suspend() for the same device) with the exception that + ->runtime_suspend() or ->runtime_resume() can be executed in parallel with + ->runtime_idle() (although ->runtime_idle() will not be started while any + of the other callbacks is being executed for the same device). + +(2) ->runtime_idle() and ->runtime_suspend() can only be executed for 'active' + devices (i.e. the PM core will only execute ->runtime_idle() or + ->runtime_suspend() for the devices the runtime PM status of which is + 'active'). + +(3) ->runtime_idle() and ->runtime_suspend() can only be executed for a device + the usage counter of which is equal to zero _and_ either the counter of + 'active' children of which is equal to zero, or the 'power.ignore_children' + flag of which is set. + +(4) ->runtime_resume() can only be executed for 'suspended' devices (i.e. the + PM core will only execute ->runtime_resume() for the devices the runtime + PM status of which is 'suspended'). + +Additionally, the helper functions provided by the PM core obey the following +rules: + + * If ->runtime_suspend() is about to be executed or there's a pending request + to execute it, ->runtime_idle() will not be executed for the same device. + + * A request to execute or to schedule the execution of ->runtime_suspend() + will cancel any pending requests to execute ->runtime_idle() for the same + device. + + * If ->runtime_resume() is about to be executed or there's a pending request + to execute it, the other callbacks will not be executed for the same device. + + * A request to execute ->runtime_resume() will cancel any pending or + scheduled requests to execute the other callbacks for the same device, + except for scheduled autosuspends. + +3. Runtime PM Device Fields +=========================== + +The following device runtime PM fields are present in 'struct dev_pm_info', as +defined in include/linux/pm.h: + + `struct timer_list suspend_timer;` + - timer used for scheduling (delayed) suspend and autosuspend requests + + `unsigned long timer_expires;` + - timer expiration time, in jiffies (if this is different from zero, the + timer is running and will expire at that time, otherwise the timer is not + running) + + `struct work_struct work;` + - work structure used for queuing up requests (i.e. work items in pm_wq) + + `wait_queue_head_t wait_queue;` + - wait queue used if any of the helper functions needs to wait for another + one to complete + + `spinlock_t lock;` + - lock used for synchronization + + `atomic_t usage_count;` + - the usage counter of the device + + `atomic_t child_count;` + - the count of 'active' children of the device + + `unsigned int ignore_children;` + - if set, the value of child_count is ignored (but still updated) + + `unsigned int disable_depth;` + - used for disabling the helper functions (they work normally if this is + equal to zero); the initial value of it is 1 (i.e. runtime PM is + initially disabled for all devices) + + `int runtime_error;` + - if set, there was a fatal error (one of the callbacks returned error code + as described in Section 2), so the helper functions will not work until + this flag is cleared; this is the error code returned by the failing + callback + + `unsigned int idle_notification;` + - if set, ->runtime_idle() is being executed + + `unsigned int request_pending;` + - if set, there's a pending request (i.e. a work item queued up into pm_wq) + + `enum rpm_request request;` + - type of request that's pending (valid if request_pending is set) + + `unsigned int deferred_resume;` + - set if ->runtime_resume() is about to be run while ->runtime_suspend() is + being executed for that device and it is not practical to wait for the + suspend to complete; means "start a resume as soon as you've suspended" + + `enum rpm_status runtime_status;` + - the runtime PM status of the device; this field's initial value is + RPM_SUSPENDED, which means that each device is initially regarded by the + PM core as 'suspended', regardless of its real hardware status + + `unsigned int runtime_auto;` + - if set, indicates that the user space has allowed the device driver to + power manage the device at run time via the /sys/devices/.../power/control + `interface;` it may only be modified with the help of the pm_runtime_allow() + and pm_runtime_forbid() helper functions + + `unsigned int no_callbacks;` + - indicates that the device does not use the runtime PM callbacks (see + Section 8); it may be modified only by the pm_runtime_no_callbacks() + helper function + + `unsigned int irq_safe;` + - indicates that the ->runtime_suspend() and ->runtime_resume() callbacks + will be invoked with the spinlock held and interrupts disabled + + `unsigned int use_autosuspend;` + - indicates that the device's driver supports delayed autosuspend (see + Section 9); it may be modified only by the + pm_runtime{_dont}_use_autosuspend() helper functions + + `unsigned int timer_autosuspends;` + - indicates that the PM core should attempt to carry out an autosuspend + when the timer expires rather than a normal suspend + + `int autosuspend_delay;` + - the delay time (in milliseconds) to be used for autosuspend + + `unsigned long last_busy;` + - the time (in jiffies) when the pm_runtime_mark_last_busy() helper + function was last called for this device; used in calculating inactivity + periods for autosuspend + +All of the above fields are members of the 'power' member of 'struct device'. + +4. Runtime PM Device Helper Functions +===================================== + +The following runtime PM helper functions are defined in +drivers/base/power/runtime.c and include/linux/pm_runtime.h: + + `void pm_runtime_init(struct device *dev);` + - initialize the device runtime PM fields in 'struct dev_pm_info' + + `void pm_runtime_remove(struct device *dev);` + - make sure that the runtime PM of the device will be disabled after + removing the device from device hierarchy + + `int pm_runtime_idle(struct device *dev);` + - execute the subsystem-level idle callback for the device; returns an + error code on failure, where -EINPROGRESS means that ->runtime_idle() is + already being executed; if there is no callback or the callback returns 0 + then run pm_runtime_autosuspend(dev) and return its result + + `int pm_runtime_suspend(struct device *dev);` + - execute the subsystem-level suspend callback for the device; returns 0 on + success, 1 if the device's runtime PM status was already 'suspended', or + error code on failure, where -EAGAIN or -EBUSY means it is safe to attempt + to suspend the device again in future and -EACCES means that + 'power.disable_depth' is different from 0 + + `int pm_runtime_autosuspend(struct device *dev);` + - same as pm_runtime_suspend() except that the autosuspend delay is taken + `into account;` if pm_runtime_autosuspend_expiration() says the delay has + not yet expired then an autosuspend is scheduled for the appropriate time + and 0 is returned + + `int pm_runtime_resume(struct device *dev);` + - execute the subsystem-level resume callback for the device; returns 0 on + success, 1 if the device's runtime PM status was already 'active' or + error code on failure, where -EAGAIN means it may be safe to attempt to + resume the device again in future, but 'power.runtime_error' should be + checked additionally, and -EACCES means that 'power.disable_depth' is + different from 0 + + `int pm_request_idle(struct device *dev);` + - submit a request to execute the subsystem-level idle callback for the + device (the request is represented by a work item in pm_wq); returns 0 on + success or error code if the request has not been queued up + + `int pm_request_autosuspend(struct device *dev);` + - schedule the execution of the subsystem-level suspend callback for the + device when the autosuspend delay has expired; if the delay has already + expired then the work item is queued up immediately + + `int pm_schedule_suspend(struct device *dev, unsigned int delay);` + - schedule the execution of the subsystem-level suspend callback for the + device in future, where 'delay' is the time to wait before queuing up a + suspend work item in pm_wq, in milliseconds (if 'delay' is zero, the work + item is queued up immediately); returns 0 on success, 1 if the device's PM + runtime status was already 'suspended', or error code if the request + hasn't been scheduled (or queued up if 'delay' is 0); if the execution of + ->runtime_suspend() is already scheduled and not yet expired, the new + value of 'delay' will be used as the time to wait + + `int pm_request_resume(struct device *dev);` + - submit a request to execute the subsystem-level resume callback for the + device (the request is represented by a work item in pm_wq); returns 0 on + success, 1 if the device's runtime PM status was already 'active', or + error code if the request hasn't been queued up + + `void pm_runtime_get_noresume(struct device *dev);` + - increment the device's usage counter + + `int pm_runtime_get(struct device *dev);` + - increment the device's usage counter, run pm_request_resume(dev) and + return its result + + `int pm_runtime_get_sync(struct device *dev);` + - increment the device's usage counter, run pm_runtime_resume(dev) and + return its result + + `int pm_runtime_get_if_in_use(struct device *dev);` + - return -EINVAL if 'power.disable_depth' is nonzero; otherwise, if the + runtime PM status is RPM_ACTIVE and the runtime PM usage counter is + nonzero, increment the counter and return 1; otherwise return 0 without + changing the counter + + `void pm_runtime_put_noidle(struct device *dev);` + - decrement the device's usage counter + + `int pm_runtime_put(struct device *dev);` + - decrement the device's usage counter; if the result is 0 then run + pm_request_idle(dev) and return its result + + `int pm_runtime_put_autosuspend(struct device *dev);` + - decrement the device's usage counter; if the result is 0 then run + pm_request_autosuspend(dev) and return its result + + `int pm_runtime_put_sync(struct device *dev);` + - decrement the device's usage counter; if the result is 0 then run + pm_runtime_idle(dev) and return its result + + `int pm_runtime_put_sync_suspend(struct device *dev);` + - decrement the device's usage counter; if the result is 0 then run + pm_runtime_suspend(dev) and return its result + + `int pm_runtime_put_sync_autosuspend(struct device *dev);` + - decrement the device's usage counter; if the result is 0 then run + pm_runtime_autosuspend(dev) and return its result + + `void pm_runtime_enable(struct device *dev);` + - decrement the device's 'power.disable_depth' field; if that field is equal + to zero, the runtime PM helper functions can execute subsystem-level + callbacks described in Section 2 for the device + + `int pm_runtime_disable(struct device *dev);` + - increment the device's 'power.disable_depth' field (if the value of that + field was previously zero, this prevents subsystem-level runtime PM + callbacks from being run for the device), make sure that all of the + pending runtime PM operations on the device are either completed or + canceled; returns 1 if there was a resume request pending and it was + necessary to execute the subsystem-level resume callback for the device + to satisfy that request, otherwise 0 is returned + + `int pm_runtime_barrier(struct device *dev);` + - check if there's a resume request pending for the device and resume it + (synchronously) in that case, cancel any other pending runtime PM requests + regarding it and wait for all runtime PM operations on it in progress to + complete; returns 1 if there was a resume request pending and it was + necessary to execute the subsystem-level resume callback for the device to + satisfy that request, otherwise 0 is returned + + `void pm_suspend_ignore_children(struct device *dev, bool enable);` + - set/unset the power.ignore_children flag of the device + + `int pm_runtime_set_active(struct device *dev);` + - clear the device's 'power.runtime_error' flag, set the device's runtime + PM status to 'active' and update its parent's counter of 'active' + children as appropriate (it is only valid to use this function if + 'power.runtime_error' is set or 'power.disable_depth' is greater than + zero); it will fail and return error code if the device has a parent + which is not active and the 'power.ignore_children' flag of which is unset + + `void pm_runtime_set_suspended(struct device *dev);` + - clear the device's 'power.runtime_error' flag, set the device's runtime + PM status to 'suspended' and update its parent's counter of 'active' + children as appropriate (it is only valid to use this function if + 'power.runtime_error' is set or 'power.disable_depth' is greater than + zero) + + `bool pm_runtime_active(struct device *dev);` + - return true if the device's runtime PM status is 'active' or its + 'power.disable_depth' field is not equal to zero, or false otherwise + + `bool pm_runtime_suspended(struct device *dev);` + - return true if the device's runtime PM status is 'suspended' and its + 'power.disable_depth' field is equal to zero, or false otherwise + + `bool pm_runtime_status_suspended(struct device *dev);` + - return true if the device's runtime PM status is 'suspended' + + `void pm_runtime_allow(struct device *dev);` + - set the power.runtime_auto flag for the device and decrease its usage + counter (used by the /sys/devices/.../power/control interface to + effectively allow the device to be power managed at run time) + + `void pm_runtime_forbid(struct device *dev);` + - unset the power.runtime_auto flag for the device and increase its usage + counter (used by the /sys/devices/.../power/control interface to + effectively prevent the device from being power managed at run time) + + `void pm_runtime_no_callbacks(struct device *dev);` + - set the power.no_callbacks flag for the device and remove the runtime + PM attributes from /sys/devices/.../power (or prevent them from being + added when the device is registered) + + `void pm_runtime_irq_safe(struct device *dev);` + - set the power.irq_safe flag for the device, causing the runtime-PM + callbacks to be invoked with interrupts off + + `bool pm_runtime_is_irq_safe(struct device *dev);` + - return true if power.irq_safe flag was set for the device, causing + the runtime-PM callbacks to be invoked with interrupts off + + `void pm_runtime_mark_last_busy(struct device *dev);` + - set the power.last_busy field to the current time + + `void pm_runtime_use_autosuspend(struct device *dev);` + - set the power.use_autosuspend flag, enabling autosuspend delays; call + pm_runtime_get_sync if the flag was previously cleared and + power.autosuspend_delay is negative + + `void pm_runtime_dont_use_autosuspend(struct device *dev);` + - clear the power.use_autosuspend flag, disabling autosuspend delays; + decrement the device's usage counter if the flag was previously set and + power.autosuspend_delay is negative; call pm_runtime_idle + + `void pm_runtime_set_autosuspend_delay(struct device *dev, int delay);` + - set the power.autosuspend_delay value to 'delay' (expressed in + milliseconds); if 'delay' is negative then runtime suspends are + prevented; if power.use_autosuspend is set, pm_runtime_get_sync may be + called or the device's usage counter may be decremented and + pm_runtime_idle called depending on if power.autosuspend_delay is + changed to or from a negative value; if power.use_autosuspend is clear, + pm_runtime_idle is called + + `unsigned long pm_runtime_autosuspend_expiration(struct device *dev);` + - calculate the time when the current autosuspend delay period will expire, + based on power.last_busy and power.autosuspend_delay; if the delay time + is 1000 ms or larger then the expiration time is rounded up to the + nearest second; returns 0 if the delay period has already expired or + power.use_autosuspend isn't set, otherwise returns the expiration time + in jiffies + +It is safe to execute the following helper functions from interrupt context: + +- pm_request_idle() +- pm_request_autosuspend() +- pm_schedule_suspend() +- pm_request_resume() +- pm_runtime_get_noresume() +- pm_runtime_get() +- pm_runtime_put_noidle() +- pm_runtime_put() +- pm_runtime_put_autosuspend() +- pm_runtime_enable() +- pm_suspend_ignore_children() +- pm_runtime_set_active() +- pm_runtime_set_suspended() +- pm_runtime_suspended() +- pm_runtime_mark_last_busy() +- pm_runtime_autosuspend_expiration() + +If pm_runtime_irq_safe() has been called for a device then the following helper +functions may also be used in interrupt context: + +- pm_runtime_idle() +- pm_runtime_suspend() +- pm_runtime_autosuspend() +- pm_runtime_resume() +- pm_runtime_get_sync() +- pm_runtime_put_sync() +- pm_runtime_put_sync_suspend() +- pm_runtime_put_sync_autosuspend() + +5. Runtime PM Initialization, Device Probing and Removal +======================================================== + +Initially, the runtime PM is disabled for all devices, which means that the +majority of the runtime PM helper functions described in Section 4 will return +-EAGAIN until pm_runtime_enable() is called for the device. + +In addition to that, the initial runtime PM status of all devices is +'suspended', but it need not reflect the actual physical state of the device. +Thus, if the device is initially active (i.e. it is able to process I/O), its +runtime PM status must be changed to 'active', with the help of +pm_runtime_set_active(), before pm_runtime_enable() is called for the device. + +However, if the device has a parent and the parent's runtime PM is enabled, +calling pm_runtime_set_active() for the device will affect the parent, unless +the parent's 'power.ignore_children' flag is set. Namely, in that case the +parent won't be able to suspend at run time, using the PM core's helper +functions, as long as the child's status is 'active', even if the child's +runtime PM is still disabled (i.e. pm_runtime_enable() hasn't been called for +the child yet or pm_runtime_disable() has been called for it). For this reason, +once pm_runtime_set_active() has been called for the device, pm_runtime_enable() +should be called for it too as soon as reasonably possible or its runtime PM +status should be changed back to 'suspended' with the help of +pm_runtime_set_suspended(). + +If the default initial runtime PM status of the device (i.e. 'suspended') +reflects the actual state of the device, its bus type's or its driver's +->probe() callback will likely need to wake it up using one of the PM core's +helper functions described in Section 4. In that case, pm_runtime_resume() +should be used. Of course, for this purpose the device's runtime PM has to be +enabled earlier by calling pm_runtime_enable(). + +Note, if the device may execute pm_runtime calls during the probe (such as +if it is registers with a subsystem that may call back in) then the +pm_runtime_get_sync() call paired with a pm_runtime_put() call will be +appropriate to ensure that the device is not put back to sleep during the +probe. This can happen with systems such as the network device layer. + +It may be desirable to suspend the device once ->probe() has finished. +Therefore the driver core uses the asynchronous pm_request_idle() to submit a +request to execute the subsystem-level idle callback for the device at that +time. A driver that makes use of the runtime autosuspend feature, may want to +update the last busy mark before returning from ->probe(). + +Moreover, the driver core prevents runtime PM callbacks from racing with the bus +notifier callback in __device_release_driver(), which is necessary, because the +notifier is used by some subsystems to carry out operations affecting the +runtime PM functionality. It does so by calling pm_runtime_get_sync() before +driver_sysfs_remove() and the BUS_NOTIFY_UNBIND_DRIVER notifications. This +resumes the device if it's in the suspended state and prevents it from +being suspended again while those routines are being executed. + +To allow bus types and drivers to put devices into the suspended state by +calling pm_runtime_suspend() from their ->remove() routines, the driver core +executes pm_runtime_put_sync() after running the BUS_NOTIFY_UNBIND_DRIVER +notifications in __device_release_driver(). This requires bus types and +drivers to make their ->remove() callbacks avoid races with runtime PM directly, +but also it allows of more flexibility in the handling of devices during the +removal of their drivers. + +Drivers in ->remove() callback should undo the runtime PM changes done +in ->probe(). Usually this means calling pm_runtime_disable(), +pm_runtime_dont_use_autosuspend() etc. + +The user space can effectively disallow the driver of the device to power manage +it at run time by changing the value of its /sys/devices/.../power/control +attribute to "on", which causes pm_runtime_forbid() to be called. In principle, +this mechanism may also be used by the driver to effectively turn off the +runtime power management of the device until the user space turns it on. +Namely, during the initialization the driver can make sure that the runtime PM +status of the device is 'active' and call pm_runtime_forbid(). It should be +noted, however, that if the user space has already intentionally changed the +value of /sys/devices/.../power/control to "auto" to allow the driver to power +manage the device at run time, the driver may confuse it by using +pm_runtime_forbid() this way. + +6. Runtime PM and System Sleep +============================== + +Runtime PM and system sleep (i.e., system suspend and hibernation, also known +as suspend-to-RAM and suspend-to-disk) interact with each other in a couple of +ways. If a device is active when a system sleep starts, everything is +straightforward. But what should happen if the device is already suspended? + +The device may have different wake-up settings for runtime PM and system sleep. +For example, remote wake-up may be enabled for runtime suspend but disallowed +for system sleep (device_may_wakeup(dev) returns 'false'). When this happens, +the subsystem-level system suspend callback is responsible for changing the +device's wake-up setting (it may leave that to the device driver's system +suspend routine). It may be necessary to resume the device and suspend it again +in order to do so. The same is true if the driver uses different power levels +or other settings for runtime suspend and system sleep. + +During system resume, the simplest approach is to bring all devices back to full +power, even if they had been suspended before the system suspend began. There +are several reasons for this, including: + + * The device might need to switch power levels, wake-up settings, etc. + + * Remote wake-up events might have been lost by the firmware. + + * The device's children may need the device to be at full power in order + to resume themselves. + + * The driver's idea of the device state may not agree with the device's + physical state. This can happen during resume from hibernation. + + * The device might need to be reset. + + * Even though the device was suspended, if its usage counter was > 0 then most + likely it would need a runtime resume in the near future anyway. + +If the device had been suspended before the system suspend began and it's +brought back to full power during resume, then its runtime PM status will have +to be updated to reflect the actual post-system sleep status. The way to do +this is: + + - pm_runtime_disable(dev); + - pm_runtime_set_active(dev); + - pm_runtime_enable(dev); + +The PM core always increments the runtime usage counter before calling the +->suspend() callback and decrements it after calling the ->resume() callback. +Hence disabling runtime PM temporarily like this will not cause any runtime +suspend attempts to be permanently lost. If the usage count goes to zero +following the return of the ->resume() callback, the ->runtime_idle() callback +will be invoked as usual. + +On some systems, however, system sleep is not entered through a global firmware +or hardware operation. Instead, all hardware components are put into low-power +states directly by the kernel in a coordinated way. Then, the system sleep +state effectively follows from the states the hardware components end up in +and the system is woken up from that state by a hardware interrupt or a similar +mechanism entirely under the kernel's control. As a result, the kernel never +gives control away and the states of all devices during resume are precisely +known to it. If that is the case and none of the situations listed above takes +place (in particular, if the system is not waking up from hibernation), it may +be more efficient to leave the devices that had been suspended before the system +suspend began in the suspended state. + +To this end, the PM core provides a mechanism allowing some coordination between +different levels of device hierarchy. Namely, if a system suspend .prepare() +callback returns a positive number for a device, that indicates to the PM core +that the device appears to be runtime-suspended and its state is fine, so it +may be left in runtime suspend provided that all of its descendants are also +left in runtime suspend. If that happens, the PM core will not execute any +system suspend and resume callbacks for all of those devices, except for the +complete callback, which is then entirely responsible for handling the device +as appropriate. This only applies to system suspend transitions that are not +related to hibernation (see Documentation/driver-api/pm/devices.rst for more +information). + +The PM core does its best to reduce the probability of race conditions between +the runtime PM and system suspend/resume (and hibernation) callbacks by carrying +out the following operations: + + * During system suspend pm_runtime_get_noresume() is called for every device + right before executing the subsystem-level .prepare() callback for it and + pm_runtime_barrier() is called for every device right before executing the + subsystem-level .suspend() callback for it. In addition to that the PM core + calls __pm_runtime_disable() with 'false' as the second argument for every + device right before executing the subsystem-level .suspend_late() callback + for it. + + * During system resume pm_runtime_enable() and pm_runtime_put() are called for + every device right after executing the subsystem-level .resume_early() + callback and right after executing the subsystem-level .complete() callback + for it, respectively. + +7. Generic subsystem callbacks + +Subsystems may wish to conserve code space by using the set of generic power +management callbacks provided by the PM core, defined in +driver/base/power/generic_ops.c: + + `int pm_generic_runtime_suspend(struct device *dev);` + - invoke the ->runtime_suspend() callback provided by the driver of this + device and return its result, or return 0 if not defined + + `int pm_generic_runtime_resume(struct device *dev);` + - invoke the ->runtime_resume() callback provided by the driver of this + device and return its result, or return 0 if not defined + + `int pm_generic_suspend(struct device *dev);` + - if the device has not been suspended at run time, invoke the ->suspend() + callback provided by its driver and return its result, or return 0 if not + defined + + `int pm_generic_suspend_noirq(struct device *dev);` + - if pm_runtime_suspended(dev) returns "false", invoke the ->suspend_noirq() + callback provided by the device's driver and return its result, or return + 0 if not defined + + `int pm_generic_resume(struct device *dev);` + - invoke the ->resume() callback provided by the driver of this device and, + if successful, change the device's runtime PM status to 'active' + + `int pm_generic_resume_noirq(struct device *dev);` + - invoke the ->resume_noirq() callback provided by the driver of this device + + `int pm_generic_freeze(struct device *dev);` + - if the device has not been suspended at run time, invoke the ->freeze() + callback provided by its driver and return its result, or return 0 if not + defined + + `int pm_generic_freeze_noirq(struct device *dev);` + - if pm_runtime_suspended(dev) returns "false", invoke the ->freeze_noirq() + callback provided by the device's driver and return its result, or return + 0 if not defined + + `int pm_generic_thaw(struct device *dev);` + - if the device has not been suspended at run time, invoke the ->thaw() + callback provided by its driver and return its result, or return 0 if not + defined + + `int pm_generic_thaw_noirq(struct device *dev);` + - if pm_runtime_suspended(dev) returns "false", invoke the ->thaw_noirq() + callback provided by the device's driver and return its result, or return + 0 if not defined + + `int pm_generic_poweroff(struct device *dev);` + - if the device has not been suspended at run time, invoke the ->poweroff() + callback provided by its driver and return its result, or return 0 if not + defined + + `int pm_generic_poweroff_noirq(struct device *dev);` + - if pm_runtime_suspended(dev) returns "false", run the ->poweroff_noirq() + callback provided by the device's driver and return its result, or return + 0 if not defined + + `int pm_generic_restore(struct device *dev);` + - invoke the ->restore() callback provided by the driver of this device and, + if successful, change the device's runtime PM status to 'active' + + `int pm_generic_restore_noirq(struct device *dev);` + - invoke the ->restore_noirq() callback provided by the device's driver + +These functions are the defaults used by the PM core, if a subsystem doesn't +provide its own callbacks for ->runtime_idle(), ->runtime_suspend(), +->runtime_resume(), ->suspend(), ->suspend_noirq(), ->resume(), +->resume_noirq(), ->freeze(), ->freeze_noirq(), ->thaw(), ->thaw_noirq(), +->poweroff(), ->poweroff_noirq(), ->restore(), ->restore_noirq() in the +subsystem-level dev_pm_ops structure. + +Device drivers that wish to use the same function as a system suspend, freeze, +poweroff and runtime suspend callback, and similarly for system resume, thaw, +restore, and runtime resume, can achieve this with the help of the +UNIVERSAL_DEV_PM_OPS macro defined in include/linux/pm.h (possibly setting its +last argument to NULL). + +8. "No-Callback" Devices +======================== + +Some "devices" are only logical sub-devices of their parent and cannot be +power-managed on their own. (The prototype example is a USB interface. Entire +USB devices can go into low-power mode or send wake-up requests, but neither is +possible for individual interfaces.) The drivers for these devices have no +need of runtime PM callbacks; if the callbacks did exist, ->runtime_suspend() +and ->runtime_resume() would always return 0 without doing anything else and +->runtime_idle() would always call pm_runtime_suspend(). + +Subsystems can tell the PM core about these devices by calling +pm_runtime_no_callbacks(). This should be done after the device structure is +initialized and before it is registered (although after device registration is +also okay). The routine will set the device's power.no_callbacks flag and +prevent the non-debugging runtime PM sysfs attributes from being created. + +When power.no_callbacks is set, the PM core will not invoke the +->runtime_idle(), ->runtime_suspend(), or ->runtime_resume() callbacks. +Instead it will assume that suspends and resumes always succeed and that idle +devices should be suspended. + +As a consequence, the PM core will never directly inform the device's subsystem +or driver about runtime power changes. Instead, the driver for the device's +parent must take responsibility for telling the device's driver when the +parent's power state changes. + +9. Autosuspend, or automatically-delayed suspends +================================================= + +Changing a device's power state isn't free; it requires both time and energy. +A device should be put in a low-power state only when there's some reason to +think it will remain in that state for a substantial time. A common heuristic +says that a device which hasn't been used for a while is liable to remain +unused; following this advice, drivers should not allow devices to be suspended +at runtime until they have been inactive for some minimum period. Even when +the heuristic ends up being non-optimal, it will still prevent devices from +"bouncing" too rapidly between low-power and full-power states. + +The term "autosuspend" is an historical remnant. It doesn't mean that the +device is automatically suspended (the subsystem or driver still has to call +the appropriate PM routines); rather it means that runtime suspends will +automatically be delayed until the desired period of inactivity has elapsed. + +Inactivity is determined based on the power.last_busy field. Drivers should +call pm_runtime_mark_last_busy() to update this field after carrying out I/O, +typically just before calling pm_runtime_put_autosuspend(). The desired length +of the inactivity period is a matter of policy. Subsystems can set this length +initially by calling pm_runtime_set_autosuspend_delay(), but after device +registration the length should be controlled by user space, using the +/sys/devices/.../power/autosuspend_delay_ms attribute. + +In order to use autosuspend, subsystems or drivers must call +pm_runtime_use_autosuspend() (preferably before registering the device), and +thereafter they should use the various `*_autosuspend()` helper functions +instead of the non-autosuspend counterparts:: + + Instead of: pm_runtime_suspend use: pm_runtime_autosuspend; + Instead of: pm_schedule_suspend use: pm_request_autosuspend; + Instead of: pm_runtime_put use: pm_runtime_put_autosuspend; + Instead of: pm_runtime_put_sync use: pm_runtime_put_sync_autosuspend. + +Drivers may also continue to use the non-autosuspend helper functions; they +will behave normally, which means sometimes taking the autosuspend delay into +account (see pm_runtime_idle). + +Under some circumstances a driver or subsystem may want to prevent a device +from autosuspending immediately, even though the usage counter is zero and the +autosuspend delay time has expired. If the ->runtime_suspend() callback +returns -EAGAIN or -EBUSY, and if the next autosuspend delay expiration time is +in the future (as it normally would be if the callback invoked +pm_runtime_mark_last_busy()), the PM core will automatically reschedule the +autosuspend. The ->runtime_suspend() callback can't do this rescheduling +itself because no suspend requests of any kind are accepted while the device is +suspending (i.e., while the callback is running). + +The implementation is well suited for asynchronous use in interrupt contexts. +However such use inevitably involves races, because the PM core can't +synchronize ->runtime_suspend() callbacks with the arrival of I/O requests. +This synchronization must be handled by the driver, using its private lock. +Here is a schematic pseudo-code example:: + + foo_read_or_write(struct foo_priv *foo, void *data) + { + lock(&foo->private_lock); + add_request_to_io_queue(foo, data); + if (foo->num_pending_requests++ == 0) + pm_runtime_get(&foo->dev); + if (!foo->is_suspended) + foo_process_next_request(foo); + unlock(&foo->private_lock); + } + + foo_io_completion(struct foo_priv *foo, void *req) + { + lock(&foo->private_lock); + if (--foo->num_pending_requests == 0) { + pm_runtime_mark_last_busy(&foo->dev); + pm_runtime_put_autosuspend(&foo->dev); + } else { + foo_process_next_request(foo); + } + unlock(&foo->private_lock); + /* Send req result back to the user ... */ + } + + int foo_runtime_suspend(struct device *dev) + { + struct foo_priv foo = container_of(dev, ...); + int ret = 0; + + lock(&foo->private_lock); + if (foo->num_pending_requests > 0) { + ret = -EBUSY; + } else { + /* ... suspend the device ... */ + foo->is_suspended = 1; + } + unlock(&foo->private_lock); + return ret; + } + + int foo_runtime_resume(struct device *dev) + { + struct foo_priv foo = container_of(dev, ...); + + lock(&foo->private_lock); + /* ... resume the device ... */ + foo->is_suspended = 0; + pm_runtime_mark_last_busy(&foo->dev); + if (foo->num_pending_requests > 0) + foo_process_next_request(foo); + unlock(&foo->private_lock); + return 0; + } + +The important point is that after foo_io_completion() asks for an autosuspend, +the foo_runtime_suspend() callback may race with foo_read_or_write(). +Therefore foo_runtime_suspend() has to check whether there are any pending I/O +requests (while holding the private lock) before allowing the suspend to +proceed. + +In addition, the power.autosuspend_delay field can be changed by user space at +any time. If a driver cares about this, it can call +pm_runtime_autosuspend_expiration() from within the ->runtime_suspend() +callback while holding its private lock. If the function returns a nonzero +value then the delay has not yet expired and the callback should return +-EAGAIN. diff --git a/Documentation/power/runtime_pm.txt b/Documentation/power/runtime_pm.txt deleted file mode 100644 index 937e33c46211..000000000000 --- a/Documentation/power/runtime_pm.txt +++ /dev/null @@ -1,928 +0,0 @@ -Runtime Power Management Framework for I/O Devices - -(C) 2009-2011 Rafael J. Wysocki , Novell Inc. -(C) 2010 Alan Stern -(C) 2014 Intel Corp., Rafael J. Wysocki - -1. Introduction - -Support for runtime power management (runtime PM) of I/O devices is provided -at the power management core (PM core) level by means of: - -* The power management workqueue pm_wq in which bus types and device drivers can - put their PM-related work items. It is strongly recommended that pm_wq be - used for queuing all work items related to runtime PM, because this allows - them to be synchronized with system-wide power transitions (suspend to RAM, - hibernation and resume from system sleep states). pm_wq is declared in - include/linux/pm_runtime.h and defined in kernel/power/main.c. - -* A number of runtime PM fields in the 'power' member of 'struct device' (which - is of the type 'struct dev_pm_info', defined in include/linux/pm.h) that can - be used for synchronizing runtime PM operations with one another. - -* Three device runtime PM callbacks in 'struct dev_pm_ops' (defined in - include/linux/pm.h). - -* A set of helper functions defined in drivers/base/power/runtime.c that can be - used for carrying out runtime PM operations in such a way that the - synchronization between them is taken care of by the PM core. Bus types and - device drivers are encouraged to use these functions. - -The runtime PM callbacks present in 'struct dev_pm_ops', the device runtime PM -fields of 'struct dev_pm_info' and the core helper functions provided for -runtime PM are described below. - -2. Device Runtime PM Callbacks - -There are three device runtime PM callbacks defined in 'struct dev_pm_ops': - -struct dev_pm_ops { - ... - int (*runtime_suspend)(struct device *dev); - int (*runtime_resume)(struct device *dev); - int (*runtime_idle)(struct device *dev); - ... -}; - -The ->runtime_suspend(), ->runtime_resume() and ->runtime_idle() callbacks -are executed by the PM core for the device's subsystem that may be either of -the following: - - 1. PM domain of the device, if the device's PM domain object, dev->pm_domain, - is present. - - 2. Device type of the device, if both dev->type and dev->type->pm are present. - - 3. Device class of the device, if both dev->class and dev->class->pm are - present. - - 4. Bus type of the device, if both dev->bus and dev->bus->pm are present. - -If the subsystem chosen by applying the above rules doesn't provide the relevant -callback, the PM core will invoke the corresponding driver callback stored in -dev->driver->pm directly (if present). - -The PM core always checks which callback to use in the order given above, so the -priority order of callbacks from high to low is: PM domain, device type, class -and bus type. Moreover, the high-priority one will always take precedence over -a low-priority one. The PM domain, bus type, device type and class callbacks -are referred to as subsystem-level callbacks in what follows. - -By default, the callbacks are always invoked in process context with interrupts -enabled. However, the pm_runtime_irq_safe() helper function can be used to tell -the PM core that it is safe to run the ->runtime_suspend(), ->runtime_resume() -and ->runtime_idle() callbacks for the given device in atomic context with -interrupts disabled. This implies that the callback routines in question must -not block or sleep, but it also means that the synchronous helper functions -listed at the end of Section 4 may be used for that device within an interrupt -handler or generally in an atomic context. - -The subsystem-level suspend callback, if present, is _entirely_ _responsible_ -for handling the suspend of the device as appropriate, which may, but need not -include executing the device driver's own ->runtime_suspend() callback (from the -PM core's point of view it is not necessary to implement a ->runtime_suspend() -callback in a device driver as long as the subsystem-level suspend callback -knows what to do to handle the device). - - * Once the subsystem-level suspend callback (or the driver suspend callback, - if invoked directly) has completed successfully for the given device, the PM - core regards the device as suspended, which need not mean that it has been - put into a low power state. It is supposed to mean, however, that the - device will not process data and will not communicate with the CPU(s) and - RAM until the appropriate resume callback is executed for it. The runtime - PM status of a device after successful execution of the suspend callback is - 'suspended'. - - * If the suspend callback returns -EBUSY or -EAGAIN, the device's runtime PM - status remains 'active', which means that the device _must_ be fully - operational afterwards. - - * If the suspend callback returns an error code different from -EBUSY and - -EAGAIN, the PM core regards this as a fatal error and will refuse to run - the helper functions described in Section 4 for the device until its status - is directly set to either 'active', or 'suspended' (the PM core provides - special helper functions for this purpose). - -In particular, if the driver requires remote wakeup capability (i.e. hardware -mechanism allowing the device to request a change of its power state, such as -PCI PME) for proper functioning and device_can_wakeup() returns 'false' for the -device, then ->runtime_suspend() should return -EBUSY. On the other hand, if -device_can_wakeup() returns 'true' for the device and the device is put into a -low-power state during the execution of the suspend callback, it is expected -that remote wakeup will be enabled for the device. Generally, remote wakeup -should be enabled for all input devices put into low-power states at run time. - -The subsystem-level resume callback, if present, is _entirely_ _responsible_ for -handling the resume of the device as appropriate, which may, but need not -include executing the device driver's own ->runtime_resume() callback (from the -PM core's point of view it is not necessary to implement a ->runtime_resume() -callback in a device driver as long as the subsystem-level resume callback knows -what to do to handle the device). - - * Once the subsystem-level resume callback (or the driver resume callback, if - invoked directly) has completed successfully, the PM core regards the device - as fully operational, which means that the device _must_ be able to complete - I/O operations as needed. The runtime PM status of the device is then - 'active'. - - * If the resume callback returns an error code, the PM core regards this as a - fatal error and will refuse to run the helper functions described in Section - 4 for the device, until its status is directly set to either 'active', or - 'suspended' (by means of special helper functions provided by the PM core - for this purpose). - -The idle callback (a subsystem-level one, if present, or the driver one) is -executed by the PM core whenever the device appears to be idle, which is -indicated to the PM core by two counters, the device's usage counter and the -counter of 'active' children of the device. - - * If any of these counters is decreased using a helper function provided by - the PM core and it turns out to be equal to zero, the other counter is - checked. If that counter also is equal to zero, the PM core executes the - idle callback with the device as its argument. - -The action performed by the idle callback is totally dependent on the subsystem -(or driver) in question, but the expected and recommended action is to check -if the device can be suspended (i.e. if all of the conditions necessary for -suspending the device are satisfied) and to queue up a suspend request for the -device in that case. If there is no idle callback, or if the callback returns -0, then the PM core will attempt to carry out a runtime suspend of the device, -also respecting devices configured for autosuspend. In essence this means a -call to pm_runtime_autosuspend() (do note that drivers needs to update the -device last busy mark, pm_runtime_mark_last_busy(), to control the delay under -this circumstance). To prevent this (for example, if the callback routine has -started a delayed suspend), the routine must return a non-zero value. Negative -error return codes are ignored by the PM core. - -The helper functions provided by the PM core, described in Section 4, guarantee -that the following constraints are met with respect to runtime PM callbacks for -one device: - -(1) The callbacks are mutually exclusive (e.g. it is forbidden to execute - ->runtime_suspend() in parallel with ->runtime_resume() or with another - instance of ->runtime_suspend() for the same device) with the exception that - ->runtime_suspend() or ->runtime_resume() can be executed in parallel with - ->runtime_idle() (although ->runtime_idle() will not be started while any - of the other callbacks is being executed for the same device). - -(2) ->runtime_idle() and ->runtime_suspend() can only be executed for 'active' - devices (i.e. the PM core will only execute ->runtime_idle() or - ->runtime_suspend() for the devices the runtime PM status of which is - 'active'). - -(3) ->runtime_idle() and ->runtime_suspend() can only be executed for a device - the usage counter of which is equal to zero _and_ either the counter of - 'active' children of which is equal to zero, or the 'power.ignore_children' - flag of which is set. - -(4) ->runtime_resume() can only be executed for 'suspended' devices (i.e. the - PM core will only execute ->runtime_resume() for the devices the runtime - PM status of which is 'suspended'). - -Additionally, the helper functions provided by the PM core obey the following -rules: - - * If ->runtime_suspend() is about to be executed or there's a pending request - to execute it, ->runtime_idle() will not be executed for the same device. - - * A request to execute or to schedule the execution of ->runtime_suspend() - will cancel any pending requests to execute ->runtime_idle() for the same - device. - - * If ->runtime_resume() is about to be executed or there's a pending request - to execute it, the other callbacks will not be executed for the same device. - - * A request to execute ->runtime_resume() will cancel any pending or - scheduled requests to execute the other callbacks for the same device, - except for scheduled autosuspends. - -3. Runtime PM Device Fields - -The following device runtime PM fields are present in 'struct dev_pm_info', as -defined in include/linux/pm.h: - - struct timer_list suspend_timer; - - timer used for scheduling (delayed) suspend and autosuspend requests - - unsigned long timer_expires; - - timer expiration time, in jiffies (if this is different from zero, the - timer is running and will expire at that time, otherwise the timer is not - running) - - struct work_struct work; - - work structure used for queuing up requests (i.e. work items in pm_wq) - - wait_queue_head_t wait_queue; - - wait queue used if any of the helper functions needs to wait for another - one to complete - - spinlock_t lock; - - lock used for synchronization - - atomic_t usage_count; - - the usage counter of the device - - atomic_t child_count; - - the count of 'active' children of the device - - unsigned int ignore_children; - - if set, the value of child_count is ignored (but still updated) - - unsigned int disable_depth; - - used for disabling the helper functions (they work normally if this is - equal to zero); the initial value of it is 1 (i.e. runtime PM is - initially disabled for all devices) - - int runtime_error; - - if set, there was a fatal error (one of the callbacks returned error code - as described in Section 2), so the helper functions will not work until - this flag is cleared; this is the error code returned by the failing - callback - - unsigned int idle_notification; - - if set, ->runtime_idle() is being executed - - unsigned int request_pending; - - if set, there's a pending request (i.e. a work item queued up into pm_wq) - - enum rpm_request request; - - type of request that's pending (valid if request_pending is set) - - unsigned int deferred_resume; - - set if ->runtime_resume() is about to be run while ->runtime_suspend() is - being executed for that device and it is not practical to wait for the - suspend to complete; means "start a resume as soon as you've suspended" - - enum rpm_status runtime_status; - - the runtime PM status of the device; this field's initial value is - RPM_SUSPENDED, which means that each device is initially regarded by the - PM core as 'suspended', regardless of its real hardware status - - unsigned int runtime_auto; - - if set, indicates that the user space has allowed the device driver to - power manage the device at run time via the /sys/devices/.../power/control - interface; it may only be modified with the help of the pm_runtime_allow() - and pm_runtime_forbid() helper functions - - unsigned int no_callbacks; - - indicates that the device does not use the runtime PM callbacks (see - Section 8); it may be modified only by the pm_runtime_no_callbacks() - helper function - - unsigned int irq_safe; - - indicates that the ->runtime_suspend() and ->runtime_resume() callbacks - will be invoked with the spinlock held and interrupts disabled - - unsigned int use_autosuspend; - - indicates that the device's driver supports delayed autosuspend (see - Section 9); it may be modified only by the - pm_runtime{_dont}_use_autosuspend() helper functions - - unsigned int timer_autosuspends; - - indicates that the PM core should attempt to carry out an autosuspend - when the timer expires rather than a normal suspend - - int autosuspend_delay; - - the delay time (in milliseconds) to be used for autosuspend - - unsigned long last_busy; - - the time (in jiffies) when the pm_runtime_mark_last_busy() helper - function was last called for this device; used in calculating inactivity - periods for autosuspend - -All of the above fields are members of the 'power' member of 'struct device'. - -4. Runtime PM Device Helper Functions - -The following runtime PM helper functions are defined in -drivers/base/power/runtime.c and include/linux/pm_runtime.h: - - void pm_runtime_init(struct device *dev); - - initialize the device runtime PM fields in 'struct dev_pm_info' - - void pm_runtime_remove(struct device *dev); - - make sure that the runtime PM of the device will be disabled after - removing the device from device hierarchy - - int pm_runtime_idle(struct device *dev); - - execute the subsystem-level idle callback for the device; returns an - error code on failure, where -EINPROGRESS means that ->runtime_idle() is - already being executed; if there is no callback or the callback returns 0 - then run pm_runtime_autosuspend(dev) and return its result - - int pm_runtime_suspend(struct device *dev); - - execute the subsystem-level suspend callback for the device; returns 0 on - success, 1 if the device's runtime PM status was already 'suspended', or - error code on failure, where -EAGAIN or -EBUSY means it is safe to attempt - to suspend the device again in future and -EACCES means that - 'power.disable_depth' is different from 0 - - int pm_runtime_autosuspend(struct device *dev); - - same as pm_runtime_suspend() except that the autosuspend delay is taken - into account; if pm_runtime_autosuspend_expiration() says the delay has - not yet expired then an autosuspend is scheduled for the appropriate time - and 0 is returned - - int pm_runtime_resume(struct device *dev); - - execute the subsystem-level resume callback for the device; returns 0 on - success, 1 if the device's runtime PM status was already 'active' or - error code on failure, where -EAGAIN means it may be safe to attempt to - resume the device again in future, but 'power.runtime_error' should be - checked additionally, and -EACCES means that 'power.disable_depth' is - different from 0 - - int pm_request_idle(struct device *dev); - - submit a request to execute the subsystem-level idle callback for the - device (the request is represented by a work item in pm_wq); returns 0 on - success or error code if the request has not been queued up - - int pm_request_autosuspend(struct device *dev); - - schedule the execution of the subsystem-level suspend callback for the - device when the autosuspend delay has expired; if the delay has already - expired then the work item is queued up immediately - - int pm_schedule_suspend(struct device *dev, unsigned int delay); - - schedule the execution of the subsystem-level suspend callback for the - device in future, where 'delay' is the time to wait before queuing up a - suspend work item in pm_wq, in milliseconds (if 'delay' is zero, the work - item is queued up immediately); returns 0 on success, 1 if the device's PM - runtime status was already 'suspended', or error code if the request - hasn't been scheduled (or queued up if 'delay' is 0); if the execution of - ->runtime_suspend() is already scheduled and not yet expired, the new - value of 'delay' will be used as the time to wait - - int pm_request_resume(struct device *dev); - - submit a request to execute the subsystem-level resume callback for the - device (the request is represented by a work item in pm_wq); returns 0 on - success, 1 if the device's runtime PM status was already 'active', or - error code if the request hasn't been queued up - - void pm_runtime_get_noresume(struct device *dev); - - increment the device's usage counter - - int pm_runtime_get(struct device *dev); - - increment the device's usage counter, run pm_request_resume(dev) and - return its result - - int pm_runtime_get_sync(struct device *dev); - - increment the device's usage counter, run pm_runtime_resume(dev) and - return its result - - int pm_runtime_get_if_in_use(struct device *dev); - - return -EINVAL if 'power.disable_depth' is nonzero; otherwise, if the - runtime PM status is RPM_ACTIVE and the runtime PM usage counter is - nonzero, increment the counter and return 1; otherwise return 0 without - changing the counter - - void pm_runtime_put_noidle(struct device *dev); - - decrement the device's usage counter - - int pm_runtime_put(struct device *dev); - - decrement the device's usage counter; if the result is 0 then run - pm_request_idle(dev) and return its result - - int pm_runtime_put_autosuspend(struct device *dev); - - decrement the device's usage counter; if the result is 0 then run - pm_request_autosuspend(dev) and return its result - - int pm_runtime_put_sync(struct device *dev); - - decrement the device's usage counter; if the result is 0 then run - pm_runtime_idle(dev) and return its result - - int pm_runtime_put_sync_suspend(struct device *dev); - - decrement the device's usage counter; if the result is 0 then run - pm_runtime_suspend(dev) and return its result - - int pm_runtime_put_sync_autosuspend(struct device *dev); - - decrement the device's usage counter; if the result is 0 then run - pm_runtime_autosuspend(dev) and return its result - - void pm_runtime_enable(struct device *dev); - - decrement the device's 'power.disable_depth' field; if that field is equal - to zero, the runtime PM helper functions can execute subsystem-level - callbacks described in Section 2 for the device - - int pm_runtime_disable(struct device *dev); - - increment the device's 'power.disable_depth' field (if the value of that - field was previously zero, this prevents subsystem-level runtime PM - callbacks from being run for the device), make sure that all of the - pending runtime PM operations on the device are either completed or - canceled; returns 1 if there was a resume request pending and it was - necessary to execute the subsystem-level resume callback for the device - to satisfy that request, otherwise 0 is returned - - int pm_runtime_barrier(struct device *dev); - - check if there's a resume request pending for the device and resume it - (synchronously) in that case, cancel any other pending runtime PM requests - regarding it and wait for all runtime PM operations on it in progress to - complete; returns 1 if there was a resume request pending and it was - necessary to execute the subsystem-level resume callback for the device to - satisfy that request, otherwise 0 is returned - - void pm_suspend_ignore_children(struct device *dev, bool enable); - - set/unset the power.ignore_children flag of the device - - int pm_runtime_set_active(struct device *dev); - - clear the device's 'power.runtime_error' flag, set the device's runtime - PM status to 'active' and update its parent's counter of 'active' - children as appropriate (it is only valid to use this function if - 'power.runtime_error' is set or 'power.disable_depth' is greater than - zero); it will fail and return error code if the device has a parent - which is not active and the 'power.ignore_children' flag of which is unset - - void pm_runtime_set_suspended(struct device *dev); - - clear the device's 'power.runtime_error' flag, set the device's runtime - PM status to 'suspended' and update its parent's counter of 'active' - children as appropriate (it is only valid to use this function if - 'power.runtime_error' is set or 'power.disable_depth' is greater than - zero) - - bool pm_runtime_active(struct device *dev); - - return true if the device's runtime PM status is 'active' or its - 'power.disable_depth' field is not equal to zero, or false otherwise - - bool pm_runtime_suspended(struct device *dev); - - return true if the device's runtime PM status is 'suspended' and its - 'power.disable_depth' field is equal to zero, or false otherwise - - bool pm_runtime_status_suspended(struct device *dev); - - return true if the device's runtime PM status is 'suspended' - - void pm_runtime_allow(struct device *dev); - - set the power.runtime_auto flag for the device and decrease its usage - counter (used by the /sys/devices/.../power/control interface to - effectively allow the device to be power managed at run time) - - void pm_runtime_forbid(struct device *dev); - - unset the power.runtime_auto flag for the device and increase its usage - counter (used by the /sys/devices/.../power/control interface to - effectively prevent the device from being power managed at run time) - - void pm_runtime_no_callbacks(struct device *dev); - - set the power.no_callbacks flag for the device and remove the runtime - PM attributes from /sys/devices/.../power (or prevent them from being - added when the device is registered) - - void pm_runtime_irq_safe(struct device *dev); - - set the power.irq_safe flag for the device, causing the runtime-PM - callbacks to be invoked with interrupts off - - bool pm_runtime_is_irq_safe(struct device *dev); - - return true if power.irq_safe flag was set for the device, causing - the runtime-PM callbacks to be invoked with interrupts off - - void pm_runtime_mark_last_busy(struct device *dev); - - set the power.last_busy field to the current time - - void pm_runtime_use_autosuspend(struct device *dev); - - set the power.use_autosuspend flag, enabling autosuspend delays; call - pm_runtime_get_sync if the flag was previously cleared and - power.autosuspend_delay is negative - - void pm_runtime_dont_use_autosuspend(struct device *dev); - - clear the power.use_autosuspend flag, disabling autosuspend delays; - decrement the device's usage counter if the flag was previously set and - power.autosuspend_delay is negative; call pm_runtime_idle - - void pm_runtime_set_autosuspend_delay(struct device *dev, int delay); - - set the power.autosuspend_delay value to 'delay' (expressed in - milliseconds); if 'delay' is negative then runtime suspends are - prevented; if power.use_autosuspend is set, pm_runtime_get_sync may be - called or the device's usage counter may be decremented and - pm_runtime_idle called depending on if power.autosuspend_delay is - changed to or from a negative value; if power.use_autosuspend is clear, - pm_runtime_idle is called - - unsigned long pm_runtime_autosuspend_expiration(struct device *dev); - - calculate the time when the current autosuspend delay period will expire, - based on power.last_busy and power.autosuspend_delay; if the delay time - is 1000 ms or larger then the expiration time is rounded up to the - nearest second; returns 0 if the delay period has already expired or - power.use_autosuspend isn't set, otherwise returns the expiration time - in jiffies - -It is safe to execute the following helper functions from interrupt context: - -pm_request_idle() -pm_request_autosuspend() -pm_schedule_suspend() -pm_request_resume() -pm_runtime_get_noresume() -pm_runtime_get() -pm_runtime_put_noidle() -pm_runtime_put() -pm_runtime_put_autosuspend() -pm_runtime_enable() -pm_suspend_ignore_children() -pm_runtime_set_active() -pm_runtime_set_suspended() -pm_runtime_suspended() -pm_runtime_mark_last_busy() -pm_runtime_autosuspend_expiration() - -If pm_runtime_irq_safe() has been called for a device then the following helper -functions may also be used in interrupt context: - -pm_runtime_idle() -pm_runtime_suspend() -pm_runtime_autosuspend() -pm_runtime_resume() -pm_runtime_get_sync() -pm_runtime_put_sync() -pm_runtime_put_sync_suspend() -pm_runtime_put_sync_autosuspend() - -5. Runtime PM Initialization, Device Probing and Removal - -Initially, the runtime PM is disabled for all devices, which means that the -majority of the runtime PM helper functions described in Section 4 will return --EAGAIN until pm_runtime_enable() is called for the device. - -In addition to that, the initial runtime PM status of all devices is -'suspended', but it need not reflect the actual physical state of the device. -Thus, if the device is initially active (i.e. it is able to process I/O), its -runtime PM status must be changed to 'active', with the help of -pm_runtime_set_active(), before pm_runtime_enable() is called for the device. - -However, if the device has a parent and the parent's runtime PM is enabled, -calling pm_runtime_set_active() for the device will affect the parent, unless -the parent's 'power.ignore_children' flag is set. Namely, in that case the -parent won't be able to suspend at run time, using the PM core's helper -functions, as long as the child's status is 'active', even if the child's -runtime PM is still disabled (i.e. pm_runtime_enable() hasn't been called for -the child yet or pm_runtime_disable() has been called for it). For this reason, -once pm_runtime_set_active() has been called for the device, pm_runtime_enable() -should be called for it too as soon as reasonably possible or its runtime PM -status should be changed back to 'suspended' with the help of -pm_runtime_set_suspended(). - -If the default initial runtime PM status of the device (i.e. 'suspended') -reflects the actual state of the device, its bus type's or its driver's -->probe() callback will likely need to wake it up using one of the PM core's -helper functions described in Section 4. In that case, pm_runtime_resume() -should be used. Of course, for this purpose the device's runtime PM has to be -enabled earlier by calling pm_runtime_enable(). - -Note, if the device may execute pm_runtime calls during the probe (such as -if it is registers with a subsystem that may call back in) then the -pm_runtime_get_sync() call paired with a pm_runtime_put() call will be -appropriate to ensure that the device is not put back to sleep during the -probe. This can happen with systems such as the network device layer. - -It may be desirable to suspend the device once ->probe() has finished. -Therefore the driver core uses the asynchronous pm_request_idle() to submit a -request to execute the subsystem-level idle callback for the device at that -time. A driver that makes use of the runtime autosuspend feature, may want to -update the last busy mark before returning from ->probe(). - -Moreover, the driver core prevents runtime PM callbacks from racing with the bus -notifier callback in __device_release_driver(), which is necessary, because the -notifier is used by some subsystems to carry out operations affecting the -runtime PM functionality. It does so by calling pm_runtime_get_sync() before -driver_sysfs_remove() and the BUS_NOTIFY_UNBIND_DRIVER notifications. This -resumes the device if it's in the suspended state and prevents it from -being suspended again while those routines are being executed. - -To allow bus types and drivers to put devices into the suspended state by -calling pm_runtime_suspend() from their ->remove() routines, the driver core -executes pm_runtime_put_sync() after running the BUS_NOTIFY_UNBIND_DRIVER -notifications in __device_release_driver(). This requires bus types and -drivers to make their ->remove() callbacks avoid races with runtime PM directly, -but also it allows of more flexibility in the handling of devices during the -removal of their drivers. - -Drivers in ->remove() callback should undo the runtime PM changes done -in ->probe(). Usually this means calling pm_runtime_disable(), -pm_runtime_dont_use_autosuspend() etc. - -The user space can effectively disallow the driver of the device to power manage -it at run time by changing the value of its /sys/devices/.../power/control -attribute to "on", which causes pm_runtime_forbid() to be called. In principle, -this mechanism may also be used by the driver to effectively turn off the -runtime power management of the device until the user space turns it on. -Namely, during the initialization the driver can make sure that the runtime PM -status of the device is 'active' and call pm_runtime_forbid(). It should be -noted, however, that if the user space has already intentionally changed the -value of /sys/devices/.../power/control to "auto" to allow the driver to power -manage the device at run time, the driver may confuse it by using -pm_runtime_forbid() this way. - -6. Runtime PM and System Sleep - -Runtime PM and system sleep (i.e., system suspend and hibernation, also known -as suspend-to-RAM and suspend-to-disk) interact with each other in a couple of -ways. If a device is active when a system sleep starts, everything is -straightforward. But what should happen if the device is already suspended? - -The device may have different wake-up settings for runtime PM and system sleep. -For example, remote wake-up may be enabled for runtime suspend but disallowed -for system sleep (device_may_wakeup(dev) returns 'false'). When this happens, -the subsystem-level system suspend callback is responsible for changing the -device's wake-up setting (it may leave that to the device driver's system -suspend routine). It may be necessary to resume the device and suspend it again -in order to do so. The same is true if the driver uses different power levels -or other settings for runtime suspend and system sleep. - -During system resume, the simplest approach is to bring all devices back to full -power, even if they had been suspended before the system suspend began. There -are several reasons for this, including: - - * The device might need to switch power levels, wake-up settings, etc. - - * Remote wake-up events might have been lost by the firmware. - - * The device's children may need the device to be at full power in order - to resume themselves. - - * The driver's idea of the device state may not agree with the device's - physical state. This can happen during resume from hibernation. - - * The device might need to be reset. - - * Even though the device was suspended, if its usage counter was > 0 then most - likely it would need a runtime resume in the near future anyway. - -If the device had been suspended before the system suspend began and it's -brought back to full power during resume, then its runtime PM status will have -to be updated to reflect the actual post-system sleep status. The way to do -this is: - - pm_runtime_disable(dev); - pm_runtime_set_active(dev); - pm_runtime_enable(dev); - -The PM core always increments the runtime usage counter before calling the -->suspend() callback and decrements it after calling the ->resume() callback. -Hence disabling runtime PM temporarily like this will not cause any runtime -suspend attempts to be permanently lost. If the usage count goes to zero -following the return of the ->resume() callback, the ->runtime_idle() callback -will be invoked as usual. - -On some systems, however, system sleep is not entered through a global firmware -or hardware operation. Instead, all hardware components are put into low-power -states directly by the kernel in a coordinated way. Then, the system sleep -state effectively follows from the states the hardware components end up in -and the system is woken up from that state by a hardware interrupt or a similar -mechanism entirely under the kernel's control. As a result, the kernel never -gives control away and the states of all devices during resume are precisely -known to it. If that is the case and none of the situations listed above takes -place (in particular, if the system is not waking up from hibernation), it may -be more efficient to leave the devices that had been suspended before the system -suspend began in the suspended state. - -To this end, the PM core provides a mechanism allowing some coordination between -different levels of device hierarchy. Namely, if a system suspend .prepare() -callback returns a positive number for a device, that indicates to the PM core -that the device appears to be runtime-suspended and its state is fine, so it -may be left in runtime suspend provided that all of its descendants are also -left in runtime suspend. If that happens, the PM core will not execute any -system suspend and resume callbacks for all of those devices, except for the -complete callback, which is then entirely responsible for handling the device -as appropriate. This only applies to system suspend transitions that are not -related to hibernation (see Documentation/driver-api/pm/devices.rst for more -information). - -The PM core does its best to reduce the probability of race conditions between -the runtime PM and system suspend/resume (and hibernation) callbacks by carrying -out the following operations: - - * During system suspend pm_runtime_get_noresume() is called for every device - right before executing the subsystem-level .prepare() callback for it and - pm_runtime_barrier() is called for every device right before executing the - subsystem-level .suspend() callback for it. In addition to that the PM core - calls __pm_runtime_disable() with 'false' as the second argument for every - device right before executing the subsystem-level .suspend_late() callback - for it. - - * During system resume pm_runtime_enable() and pm_runtime_put() are called for - every device right after executing the subsystem-level .resume_early() - callback and right after executing the subsystem-level .complete() callback - for it, respectively. - -7. Generic subsystem callbacks - -Subsystems may wish to conserve code space by using the set of generic power -management callbacks provided by the PM core, defined in -driver/base/power/generic_ops.c: - - int pm_generic_runtime_suspend(struct device *dev); - - invoke the ->runtime_suspend() callback provided by the driver of this - device and return its result, or return 0 if not defined - - int pm_generic_runtime_resume(struct device *dev); - - invoke the ->runtime_resume() callback provided by the driver of this - device and return its result, or return 0 if not defined - - int pm_generic_suspend(struct device *dev); - - if the device has not been suspended at run time, invoke the ->suspend() - callback provided by its driver and return its result, or return 0 if not - defined - - int pm_generic_suspend_noirq(struct device *dev); - - if pm_runtime_suspended(dev) returns "false", invoke the ->suspend_noirq() - callback provided by the device's driver and return its result, or return - 0 if not defined - - int pm_generic_resume(struct device *dev); - - invoke the ->resume() callback provided by the driver of this device and, - if successful, change the device's runtime PM status to 'active' - - int pm_generic_resume_noirq(struct device *dev); - - invoke the ->resume_noirq() callback provided by the driver of this device - - int pm_generic_freeze(struct device *dev); - - if the device has not been suspended at run time, invoke the ->freeze() - callback provided by its driver and return its result, or return 0 if not - defined - - int pm_generic_freeze_noirq(struct device *dev); - - if pm_runtime_suspended(dev) returns "false", invoke the ->freeze_noirq() - callback provided by the device's driver and return its result, or return - 0 if not defined - - int pm_generic_thaw(struct device *dev); - - if the device has not been suspended at run time, invoke the ->thaw() - callback provided by its driver and return its result, or return 0 if not - defined - - int pm_generic_thaw_noirq(struct device *dev); - - if pm_runtime_suspended(dev) returns "false", invoke the ->thaw_noirq() - callback provided by the device's driver and return its result, or return - 0 if not defined - - int pm_generic_poweroff(struct device *dev); - - if the device has not been suspended at run time, invoke the ->poweroff() - callback provided by its driver and return its result, or return 0 if not - defined - - int pm_generic_poweroff_noirq(struct device *dev); - - if pm_runtime_suspended(dev) returns "false", run the ->poweroff_noirq() - callback provided by the device's driver and return its result, or return - 0 if not defined - - int pm_generic_restore(struct device *dev); - - invoke the ->restore() callback provided by the driver of this device and, - if successful, change the device's runtime PM status to 'active' - - int pm_generic_restore_noirq(struct device *dev); - - invoke the ->restore_noirq() callback provided by the device's driver - -These functions are the defaults used by the PM core, if a subsystem doesn't -provide its own callbacks for ->runtime_idle(), ->runtime_suspend(), -->runtime_resume(), ->suspend(), ->suspend_noirq(), ->resume(), -->resume_noirq(), ->freeze(), ->freeze_noirq(), ->thaw(), ->thaw_noirq(), -->poweroff(), ->poweroff_noirq(), ->restore(), ->restore_noirq() in the -subsystem-level dev_pm_ops structure. - -Device drivers that wish to use the same function as a system suspend, freeze, -poweroff and runtime suspend callback, and similarly for system resume, thaw, -restore, and runtime resume, can achieve this with the help of the -UNIVERSAL_DEV_PM_OPS macro defined in include/linux/pm.h (possibly setting its -last argument to NULL). - -8. "No-Callback" Devices - -Some "devices" are only logical sub-devices of their parent and cannot be -power-managed on their own. (The prototype example is a USB interface. Entire -USB devices can go into low-power mode or send wake-up requests, but neither is -possible for individual interfaces.) The drivers for these devices have no -need of runtime PM callbacks; if the callbacks did exist, ->runtime_suspend() -and ->runtime_resume() would always return 0 without doing anything else and -->runtime_idle() would always call pm_runtime_suspend(). - -Subsystems can tell the PM core about these devices by calling -pm_runtime_no_callbacks(). This should be done after the device structure is -initialized and before it is registered (although after device registration is -also okay). The routine will set the device's power.no_callbacks flag and -prevent the non-debugging runtime PM sysfs attributes from being created. - -When power.no_callbacks is set, the PM core will not invoke the -->runtime_idle(), ->runtime_suspend(), or ->runtime_resume() callbacks. -Instead it will assume that suspends and resumes always succeed and that idle -devices should be suspended. - -As a consequence, the PM core will never directly inform the device's subsystem -or driver about runtime power changes. Instead, the driver for the device's -parent must take responsibility for telling the device's driver when the -parent's power state changes. - -9. Autosuspend, or automatically-delayed suspends - -Changing a device's power state isn't free; it requires both time and energy. -A device should be put in a low-power state only when there's some reason to -think it will remain in that state for a substantial time. A common heuristic -says that a device which hasn't been used for a while is liable to remain -unused; following this advice, drivers should not allow devices to be suspended -at runtime until they have been inactive for some minimum period. Even when -the heuristic ends up being non-optimal, it will still prevent devices from -"bouncing" too rapidly between low-power and full-power states. - -The term "autosuspend" is an historical remnant. It doesn't mean that the -device is automatically suspended (the subsystem or driver still has to call -the appropriate PM routines); rather it means that runtime suspends will -automatically be delayed until the desired period of inactivity has elapsed. - -Inactivity is determined based on the power.last_busy field. Drivers should -call pm_runtime_mark_last_busy() to update this field after carrying out I/O, -typically just before calling pm_runtime_put_autosuspend(). The desired length -of the inactivity period is a matter of policy. Subsystems can set this length -initially by calling pm_runtime_set_autosuspend_delay(), but after device -registration the length should be controlled by user space, using the -/sys/devices/.../power/autosuspend_delay_ms attribute. - -In order to use autosuspend, subsystems or drivers must call -pm_runtime_use_autosuspend() (preferably before registering the device), and -thereafter they should use the various *_autosuspend() helper functions instead -of the non-autosuspend counterparts: - - Instead of: pm_runtime_suspend use: pm_runtime_autosuspend; - Instead of: pm_schedule_suspend use: pm_request_autosuspend; - Instead of: pm_runtime_put use: pm_runtime_put_autosuspend; - Instead of: pm_runtime_put_sync use: pm_runtime_put_sync_autosuspend. - -Drivers may also continue to use the non-autosuspend helper functions; they -will behave normally, which means sometimes taking the autosuspend delay into -account (see pm_runtime_idle). - -Under some circumstances a driver or subsystem may want to prevent a device -from autosuspending immediately, even though the usage counter is zero and the -autosuspend delay time has expired. If the ->runtime_suspend() callback -returns -EAGAIN or -EBUSY, and if the next autosuspend delay expiration time is -in the future (as it normally would be if the callback invoked -pm_runtime_mark_last_busy()), the PM core will automatically reschedule the -autosuspend. The ->runtime_suspend() callback can't do this rescheduling -itself because no suspend requests of any kind are accepted while the device is -suspending (i.e., while the callback is running). - -The implementation is well suited for asynchronous use in interrupt contexts. -However such use inevitably involves races, because the PM core can't -synchronize ->runtime_suspend() callbacks with the arrival of I/O requests. -This synchronization must be handled by the driver, using its private lock. -Here is a schematic pseudo-code example: - - foo_read_or_write(struct foo_priv *foo, void *data) - { - lock(&foo->private_lock); - add_request_to_io_queue(foo, data); - if (foo->num_pending_requests++ == 0) - pm_runtime_get(&foo->dev); - if (!foo->is_suspended) - foo_process_next_request(foo); - unlock(&foo->private_lock); - } - - foo_io_completion(struct foo_priv *foo, void *req) - { - lock(&foo->private_lock); - if (--foo->num_pending_requests == 0) { - pm_runtime_mark_last_busy(&foo->dev); - pm_runtime_put_autosuspend(&foo->dev); - } else { - foo_process_next_request(foo); - } - unlock(&foo->private_lock); - /* Send req result back to the user ... */ - } - - int foo_runtime_suspend(struct device *dev) - { - struct foo_priv foo = container_of(dev, ...); - int ret = 0; - - lock(&foo->private_lock); - if (foo->num_pending_requests > 0) { - ret = -EBUSY; - } else { - /* ... suspend the device ... */ - foo->is_suspended = 1; - } - unlock(&foo->private_lock); - return ret; - } - - int foo_runtime_resume(struct device *dev) - { - struct foo_priv foo = container_of(dev, ...); - - lock(&foo->private_lock); - /* ... resume the device ... */ - foo->is_suspended = 0; - pm_runtime_mark_last_busy(&foo->dev); - if (foo->num_pending_requests > 0) - foo_process_next_request(foo); - unlock(&foo->private_lock); - return 0; - } - -The important point is that after foo_io_completion() asks for an autosuspend, -the foo_runtime_suspend() callback may race with foo_read_or_write(). -Therefore foo_runtime_suspend() has to check whether there are any pending I/O -requests (while holding the private lock) before allowing the suspend to -proceed. - -In addition, the power.autosuspend_delay field can be changed by user space at -any time. If a driver cares about this, it can call -pm_runtime_autosuspend_expiration() from within the ->runtime_suspend() -callback while holding its private lock. If the function returns a nonzero -value then the delay has not yet expired and the callback should return --EAGAIN. diff --git a/Documentation/power/s2ram.rst b/Documentation/power/s2ram.rst new file mode 100644 index 000000000000..d739aa7c742c --- /dev/null +++ b/Documentation/power/s2ram.rst @@ -0,0 +1,87 @@ +======================== +How to get s2ram working +======================== + +2006 Linus Torvalds +2006 Pavel Machek + +1) Check suspend.sf.net, program s2ram there has long whitelist of + "known ok" machines, along with tricks to use on each one. + +2) If that does not help, try reading tricks.txt and + video.txt. Perhaps problem is as simple as broken module, and + simple module unload can fix it. + +3) You can use Linus' TRACE_RESUME infrastructure, described below. + +Using TRACE_RESUME +~~~~~~~~~~~~~~~~~~ + +I've been working at making the machines I have able to STR, and almost +always it's a driver that is buggy. Thank God for the suspend/resume +debugging - the thing that Chuck tried to disable. That's often the _only_ +way to debug these things, and it's actually pretty powerful (but +time-consuming - having to insert TRACE_RESUME() markers into the device +driver that doesn't resume and recompile and reboot). + +Anyway, the way to debug this for people who are interested (have a +machine that doesn't boot) is: + + - enable PM_DEBUG, and PM_TRACE + + - use a script like this:: + + #!/bin/sh + sync + echo 1 > /sys/power/pm_trace + echo mem > /sys/power/state + + to suspend + + - if it doesn't come back up (which is usually the problem), reboot by + holding the power button down, and look at the dmesg output for things + like:: + + Magic number: 4:156:725 + hash matches drivers/base/power/resume.c:28 + hash matches device 0000:01:00.0 + + which means that the last trace event was just before trying to resume + device 0000:01:00.0. Then figure out what driver is controlling that + device (lspci and /sys/devices/pci* is your friend), and see if you can + fix it, disable it, or trace into its resume function. + + If no device matches the hash (or any matches appear to be false positives), + the culprit may be a device from a loadable kernel module that is not loaded + until after the hash is checked. You can check the hash against the current + devices again after more modules are loaded using sysfs:: + + cat /sys/power/pm_trace_dev_match + +For example, the above happens to be the VGA device on my EVO, which I +used to run with "radeonfb" (it's an ATI Radeon mobility). It turns out +that "radeonfb" simply cannot resume that device - it tries to set the +PLL's, and it just _hangs_. Using the regular VGA console and letting X +resume it instead works fine. + +NOTE +==== +pm_trace uses the system's Real Time Clock (RTC) to save the magic number. +Reason for this is that the RTC is the only reliably available piece of +hardware during resume operations where a value can be set that will +survive a reboot. + +pm_trace is not compatible with asynchronous suspend, so it turns +asynchronous suspend off (which may work around timing or +ordering-sensitive bugs). + +Consequence is that after a resume (even if it is successful) your system +clock will have a value corresponding to the magic number instead of the +correct date/time! It is therefore advisable to use a program like ntp-date +or rdate to reset the correct date/time from an external time source when +using this trace option. + +As the clock keeps ticking it is also essential that the reboot is done +quickly after the resume failure. The trace option does not use the seconds +or the low order bits of the minutes of the RTC, but a too long delay will +corrupt the magic value. diff --git a/Documentation/power/s2ram.txt b/Documentation/power/s2ram.txt deleted file mode 100644 index 4685aee197fd..000000000000 --- a/Documentation/power/s2ram.txt +++ /dev/null @@ -1,85 +0,0 @@ - How to get s2ram working - ~~~~~~~~~~~~~~~~~~~~~~~~ - 2006 Linus Torvalds - 2006 Pavel Machek - -1) Check suspend.sf.net, program s2ram there has long whitelist of - "known ok" machines, along with tricks to use on each one. - -2) If that does not help, try reading tricks.txt and - video.txt. Perhaps problem is as simple as broken module, and - simple module unload can fix it. - -3) You can use Linus' TRACE_RESUME infrastructure, described below. - - Using TRACE_RESUME - ~~~~~~~~~~~~~~~~~~ - -I've been working at making the machines I have able to STR, and almost -always it's a driver that is buggy. Thank God for the suspend/resume -debugging - the thing that Chuck tried to disable. That's often the _only_ -way to debug these things, and it's actually pretty powerful (but -time-consuming - having to insert TRACE_RESUME() markers into the device -driver that doesn't resume and recompile and reboot). - -Anyway, the way to debug this for people who are interested (have a -machine that doesn't boot) is: - - - enable PM_DEBUG, and PM_TRACE - - - use a script like this: - - #!/bin/sh - sync - echo 1 > /sys/power/pm_trace - echo mem > /sys/power/state - - to suspend - - - if it doesn't come back up (which is usually the problem), reboot by - holding the power button down, and look at the dmesg output for things - like - - Magic number: 4:156:725 - hash matches drivers/base/power/resume.c:28 - hash matches device 0000:01:00.0 - - which means that the last trace event was just before trying to resume - device 0000:01:00.0. Then figure out what driver is controlling that - device (lspci and /sys/devices/pci* is your friend), and see if you can - fix it, disable it, or trace into its resume function. - - If no device matches the hash (or any matches appear to be false positives), - the culprit may be a device from a loadable kernel module that is not loaded - until after the hash is checked. You can check the hash against the current - devices again after more modules are loaded using sysfs: - - cat /sys/power/pm_trace_dev_match - -For example, the above happens to be the VGA device on my EVO, which I -used to run with "radeonfb" (it's an ATI Radeon mobility). It turns out -that "radeonfb" simply cannot resume that device - it tries to set the -PLL's, and it just _hangs_. Using the regular VGA console and letting X -resume it instead works fine. - -NOTE -==== -pm_trace uses the system's Real Time Clock (RTC) to save the magic number. -Reason for this is that the RTC is the only reliably available piece of -hardware during resume operations where a value can be set that will -survive a reboot. - -pm_trace is not compatible with asynchronous suspend, so it turns -asynchronous suspend off (which may work around timing or -ordering-sensitive bugs). - -Consequence is that after a resume (even if it is successful) your system -clock will have a value corresponding to the magic number instead of the -correct date/time! It is therefore advisable to use a program like ntp-date -or rdate to reset the correct date/time from an external time source when -using this trace option. - -As the clock keeps ticking it is also essential that the reboot is done -quickly after the resume failure. The trace option does not use the seconds -or the low order bits of the minutes of the RTC, but a too long delay will -corrupt the magic value. diff --git a/Documentation/power/suspend-and-cpuhotplug.rst b/Documentation/power/suspend-and-cpuhotplug.rst new file mode 100644 index 000000000000..7ac8e1f549f4 --- /dev/null +++ b/Documentation/power/suspend-and-cpuhotplug.rst @@ -0,0 +1,286 @@ +==================================================================== +Interaction of Suspend code (S3) with the CPU hotplug infrastructure +==================================================================== + +(C) 2011 - 2014 Srivatsa S. Bhat + + +I. Differences between CPU hotplug and Suspend-to-RAM +====================================================== + +How does the regular CPU hotplug code differ from how the Suspend-to-RAM +infrastructure uses it internally? And where do they share common code? + +Well, a picture is worth a thousand words... So ASCII art follows :-) + +[This depicts the current design in the kernel, and focusses only on the +interactions involving the freezer and CPU hotplug and also tries to explain +the locking involved. It outlines the notifications involved as well. +But please note that here, only the call paths are illustrated, with the aim +of describing where they take different paths and where they share code. +What happens when regular CPU hotplug and Suspend-to-RAM race with each other +is not depicted here.] + +On a high level, the suspend-resume cycle goes like this:: + + |Freeze| -> |Disable nonboot| -> |Do suspend| -> |Enable nonboot| -> |Thaw | + |tasks | | cpus | | | | cpus | |tasks| + + +More details follow:: + + Suspend call path + ----------------- + + Write 'mem' to + /sys/power/state + sysfs file + | + v + Acquire system_transition_mutex lock + | + v + Send PM_SUSPEND_PREPARE + notifications + | + v + Freeze tasks + | + | + v + disable_nonboot_cpus() + /* start */ + | + v + Acquire cpu_add_remove_lock + | + v + Iterate over CURRENTLY + online CPUs + | + | + | ---------- + v | L + ======> _cpu_down() | + | [This takes cpuhotplug.lock | + Common | before taking down the CPU | + code | and releases it when done] | O + | While it is at it, notifications | + | are sent when notable events occur, | + ======> by running all registered callbacks. | + | | O + | | + | | + v | + Note down these cpus in | P + frozen_cpus mask ---------- + | + v + Disable regular cpu hotplug + by increasing cpu_hotplug_disabled + | + v + Release cpu_add_remove_lock + | + v + /* disable_nonboot_cpus() complete */ + | + v + Do suspend + + + +Resuming back is likewise, with the counterparts being (in the order of +execution during resume): + +* enable_nonboot_cpus() which involves:: + + | Acquire cpu_add_remove_lock + | Decrease cpu_hotplug_disabled, thereby enabling regular cpu hotplug + | Call _cpu_up() [for all those cpus in the frozen_cpus mask, in a loop] + | Release cpu_add_remove_lock + v + +* thaw tasks +* send PM_POST_SUSPEND notifications +* Release system_transition_mutex lock. + + +It is to be noted here that the system_transition_mutex lock is acquired at the very +beginning, when we are just starting out to suspend, and then released only +after the entire cycle is complete (i.e., suspend + resume). + +:: + + + + Regular CPU hotplug call path + ----------------------------- + + Write 0 (or 1) to + /sys/devices/system/cpu/cpu*/online + sysfs file + | + | + v + cpu_down() + | + v + Acquire cpu_add_remove_lock + | + v + If cpu_hotplug_disabled > 0 + return gracefully + | + | + v + ======> _cpu_down() + | [This takes cpuhotplug.lock + Common | before taking down the CPU + code | and releases it when done] + | While it is at it, notifications + | are sent when notable events occur, + ======> by running all registered callbacks. + | + | + v + Release cpu_add_remove_lock + [That's it!, for + regular CPU hotplug] + + + +So, as can be seen from the two diagrams (the parts marked as "Common code"), +regular CPU hotplug and the suspend code path converge at the _cpu_down() and +_cpu_up() functions. They differ in the arguments passed to these functions, +in that during regular CPU hotplug, 0 is passed for the 'tasks_frozen' +argument. But during suspend, since the tasks are already frozen by the time +the non-boot CPUs are offlined or onlined, the _cpu_*() functions are called +with the 'tasks_frozen' argument set to 1. +[See below for some known issues regarding this.] + + +Important files and functions/entry points: +------------------------------------------- + +- kernel/power/process.c : freeze_processes(), thaw_processes() +- kernel/power/suspend.c : suspend_prepare(), suspend_enter(), suspend_finish() +- kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](), [disable|enable]_nonboot_cpus() + + + +II. What are the issues involved in CPU hotplug? +------------------------------------------------ + +There are some interesting situations involving CPU hotplug and microcode +update on the CPUs, as discussed below: + +[Please bear in mind that the kernel requests the microcode images from +userspace, using the request_firmware() function defined in +drivers/base/firmware_loader/main.c] + + +a. When all the CPUs are identical: + + This is the most common situation and it is quite straightforward: we want + to apply the same microcode revision to each of the CPUs. + To give an example of x86, the collect_cpu_info() function defined in + arch/x86/kernel/microcode_core.c helps in discovering the type of the CPU + and thereby in applying the correct microcode revision to it. + But note that the kernel does not maintain a common microcode image for the + all CPUs, in order to handle case 'b' described below. + + +b. When some of the CPUs are different than the rest: + + In this case since we probably need to apply different microcode revisions + to different CPUs, the kernel maintains a copy of the correct microcode + image for each CPU (after appropriate CPU type/model discovery using + functions such as collect_cpu_info()). + + +c. When a CPU is physically hot-unplugged and a new (and possibly different + type of) CPU is hot-plugged into the system: + + In the current design of the kernel, whenever a CPU is taken offline during + a regular CPU hotplug operation, upon receiving the CPU_DEAD notification + (which is sent by the CPU hotplug code), the microcode update driver's + callback for that event reacts by freeing the kernel's copy of the + microcode image for that CPU. + + Hence, when a new CPU is brought online, since the kernel finds that it + doesn't have the microcode image, it does the CPU type/model discovery + afresh and then requests the userspace for the appropriate microcode image + for that CPU, which is subsequently applied. + + For example, in x86, the mc_cpu_callback() function (which is the microcode + update driver's callback registered for CPU hotplug events) calls + microcode_update_cpu() which would call microcode_init_cpu() in this case, + instead of microcode_resume_cpu() when it finds that the kernel doesn't + have a valid microcode image. This ensures that the CPU type/model + discovery is performed and the right microcode is applied to the CPU after + getting it from userspace. + + +d. Handling microcode update during suspend/hibernate: + + Strictly speaking, during a CPU hotplug operation which does not involve + physically removing or inserting CPUs, the CPUs are not actually powered + off during a CPU offline. They are just put to the lowest C-states possible. + Hence, in such a case, it is not really necessary to re-apply microcode + when the CPUs are brought back online, since they wouldn't have lost the + image during the CPU offline operation. + + This is the usual scenario encountered during a resume after a suspend. + However, in the case of hibernation, since all the CPUs are completely + powered off, during restore it becomes necessary to apply the microcode + images to all the CPUs. + + [Note that we don't expect someone to physically pull out nodes and insert + nodes with a different type of CPUs in-between a suspend-resume or a + hibernate/restore cycle.] + + In the current design of the kernel however, during a CPU offline operation + as part of the suspend/hibernate cycle (cpuhp_tasks_frozen is set), + the existing copy of microcode image in the kernel is not freed up. + And during the CPU online operations (during resume/restore), since the + kernel finds that it already has copies of the microcode images for all the + CPUs, it just applies them to the CPUs, avoiding any re-discovery of CPU + type/model and the need for validating whether the microcode revisions are + right for the CPUs or not (due to the above assumption that physical CPU + hotplug will not be done in-between suspend/resume or hibernate/restore + cycles). + + +III. Known problems +=================== + +Are there any known problems when regular CPU hotplug and suspend race +with each other? + +Yes, they are listed below: + +1. When invoking regular CPU hotplug, the 'tasks_frozen' argument passed to + the _cpu_down() and _cpu_up() functions is *always* 0. + This might not reflect the true current state of the system, since the + tasks could have been frozen by an out-of-band event such as a suspend + operation in progress. Hence, the cpuhp_tasks_frozen variable will not + reflect the frozen state and the CPU hotplug callbacks which evaluate + that variable might execute the wrong code path. + +2. If a regular CPU hotplug stress test happens to race with the freezer due + to a suspend operation in progress at the same time, then we could hit the + situation described below: + + * A regular cpu online operation continues its journey from userspace + into the kernel, since the freezing has not yet begun. + * Then freezer gets to work and freezes userspace. + * If cpu online has not yet completed the microcode update stuff by now, + it will now start waiting on the frozen userspace in the + TASK_UNINTERRUPTIBLE state, in order to get the microcode image. + * Now the freezer continues and tries to freeze the remaining tasks. But + due to this wait mentioned above, the freezer won't be able to freeze + the cpu online hotplug task and hence freezing of tasks fails. + + As a result of this task freezing failure, the suspend operation gets + aborted. diff --git a/Documentation/power/suspend-and-cpuhotplug.txt b/Documentation/power/suspend-and-cpuhotplug.txt deleted file mode 100644 index a8751b8df10e..000000000000 --- a/Documentation/power/suspend-and-cpuhotplug.txt +++ /dev/null @@ -1,274 +0,0 @@ -Interaction of Suspend code (S3) with the CPU hotplug infrastructure - - (C) 2011 - 2014 Srivatsa S. Bhat - - -I. How does the regular CPU hotplug code differ from how the Suspend-to-RAM - infrastructure uses it internally? And where do they share common code? - -Well, a picture is worth a thousand words... So ASCII art follows :-) - -[This depicts the current design in the kernel, and focusses only on the -interactions involving the freezer and CPU hotplug and also tries to explain -the locking involved. It outlines the notifications involved as well. -But please note that here, only the call paths are illustrated, with the aim -of describing where they take different paths and where they share code. -What happens when regular CPU hotplug and Suspend-to-RAM race with each other -is not depicted here.] - -On a high level, the suspend-resume cycle goes like this: - -|Freeze| -> |Disable nonboot| -> |Do suspend| -> |Enable nonboot| -> |Thaw | -|tasks | | cpus | | | | cpus | |tasks| - - -More details follow: - - Suspend call path - ----------------- - - Write 'mem' to - /sys/power/state - sysfs file - | - v - Acquire system_transition_mutex lock - | - v - Send PM_SUSPEND_PREPARE - notifications - | - v - Freeze tasks - | - | - v - disable_nonboot_cpus() - /* start */ - | - v - Acquire cpu_add_remove_lock - | - v - Iterate over CURRENTLY - online CPUs - | - | - | ---------- - v | L - ======> _cpu_down() | - | [This takes cpuhotplug.lock | - Common | before taking down the CPU | - code | and releases it when done] | O - | While it is at it, notifications | - | are sent when notable events occur, | - ======> by running all registered callbacks. | - | | O - | | - | | - v | - Note down these cpus in | P - frozen_cpus mask ---------- - | - v - Disable regular cpu hotplug - by increasing cpu_hotplug_disabled - | - v - Release cpu_add_remove_lock - | - v - /* disable_nonboot_cpus() complete */ - | - v - Do suspend - - - -Resuming back is likewise, with the counterparts being (in the order of -execution during resume): -* enable_nonboot_cpus() which involves: - | Acquire cpu_add_remove_lock - | Decrease cpu_hotplug_disabled, thereby enabling regular cpu hotplug - | Call _cpu_up() [for all those cpus in the frozen_cpus mask, in a loop] - | Release cpu_add_remove_lock - v - -* thaw tasks -* send PM_POST_SUSPEND notifications -* Release system_transition_mutex lock. - - -It is to be noted here that the system_transition_mutex lock is acquired at the very -beginning, when we are just starting out to suspend, and then released only -after the entire cycle is complete (i.e., suspend + resume). - - - - Regular CPU hotplug call path - ----------------------------- - - Write 0 (or 1) to - /sys/devices/system/cpu/cpu*/online - sysfs file - | - | - v - cpu_down() - | - v - Acquire cpu_add_remove_lock - | - v - If cpu_hotplug_disabled > 0 - return gracefully - | - | - v - ======> _cpu_down() - | [This takes cpuhotplug.lock - Common | before taking down the CPU - code | and releases it when done] - | While it is at it, notifications - | are sent when notable events occur, - ======> by running all registered callbacks. - | - | - v - Release cpu_add_remove_lock - [That's it!, for - regular CPU hotplug] - - - -So, as can be seen from the two diagrams (the parts marked as "Common code"), -regular CPU hotplug and the suspend code path converge at the _cpu_down() and -_cpu_up() functions. They differ in the arguments passed to these functions, -in that during regular CPU hotplug, 0 is passed for the 'tasks_frozen' -argument. But during suspend, since the tasks are already frozen by the time -the non-boot CPUs are offlined or onlined, the _cpu_*() functions are called -with the 'tasks_frozen' argument set to 1. -[See below for some known issues regarding this.] - - -Important files and functions/entry points: ------------------------------------------- - -kernel/power/process.c : freeze_processes(), thaw_processes() -kernel/power/suspend.c : suspend_prepare(), suspend_enter(), suspend_finish() -kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](), [disable|enable]_nonboot_cpus() - - - -II. What are the issues involved in CPU hotplug? - ------------------------------------------- - -There are some interesting situations involving CPU hotplug and microcode -update on the CPUs, as discussed below: - -[Please bear in mind that the kernel requests the microcode images from -userspace, using the request_firmware() function defined in -drivers/base/firmware_loader/main.c] - - -a. When all the CPUs are identical: - - This is the most common situation and it is quite straightforward: we want - to apply the same microcode revision to each of the CPUs. - To give an example of x86, the collect_cpu_info() function defined in - arch/x86/kernel/microcode_core.c helps in discovering the type of the CPU - and thereby in applying the correct microcode revision to it. - But note that the kernel does not maintain a common microcode image for the - all CPUs, in order to handle case 'b' described below. - - -b. When some of the CPUs are different than the rest: - - In this case since we probably need to apply different microcode revisions - to different CPUs, the kernel maintains a copy of the correct microcode - image for each CPU (after appropriate CPU type/model discovery using - functions such as collect_cpu_info()). - - -c. When a CPU is physically hot-unplugged and a new (and possibly different - type of) CPU is hot-plugged into the system: - - In the current design of the kernel, whenever a CPU is taken offline during - a regular CPU hotplug operation, upon receiving the CPU_DEAD notification - (which is sent by the CPU hotplug code), the microcode update driver's - callback for that event reacts by freeing the kernel's copy of the - microcode image for that CPU. - - Hence, when a new CPU is brought online, since the kernel finds that it - doesn't have the microcode image, it does the CPU type/model discovery - afresh and then requests the userspace for the appropriate microcode image - for that CPU, which is subsequently applied. - - For example, in x86, the mc_cpu_callback() function (which is the microcode - update driver's callback registered for CPU hotplug events) calls - microcode_update_cpu() which would call microcode_init_cpu() in this case, - instead of microcode_resume_cpu() when it finds that the kernel doesn't - have a valid microcode image. This ensures that the CPU type/model - discovery is performed and the right microcode is applied to the CPU after - getting it from userspace. - - -d. Handling microcode update during suspend/hibernate: - - Strictly speaking, during a CPU hotplug operation which does not involve - physically removing or inserting CPUs, the CPUs are not actually powered - off during a CPU offline. They are just put to the lowest C-states possible. - Hence, in such a case, it is not really necessary to re-apply microcode - when the CPUs are brought back online, since they wouldn't have lost the - image during the CPU offline operation. - - This is the usual scenario encountered during a resume after a suspend. - However, in the case of hibernation, since all the CPUs are completely - powered off, during restore it becomes necessary to apply the microcode - images to all the CPUs. - - [Note that we don't expect someone to physically pull out nodes and insert - nodes with a different type of CPUs in-between a suspend-resume or a - hibernate/restore cycle.] - - In the current design of the kernel however, during a CPU offline operation - as part of the suspend/hibernate cycle (cpuhp_tasks_frozen is set), - the existing copy of microcode image in the kernel is not freed up. - And during the CPU online operations (during resume/restore), since the - kernel finds that it already has copies of the microcode images for all the - CPUs, it just applies them to the CPUs, avoiding any re-discovery of CPU - type/model and the need for validating whether the microcode revisions are - right for the CPUs or not (due to the above assumption that physical CPU - hotplug will not be done in-between suspend/resume or hibernate/restore - cycles). - - -III. Are there any known problems when regular CPU hotplug and suspend race - with each other? - -Yes, they are listed below: - -1. When invoking regular CPU hotplug, the 'tasks_frozen' argument passed to - the _cpu_down() and _cpu_up() functions is *always* 0. - This might not reflect the true current state of the system, since the - tasks could have been frozen by an out-of-band event such as a suspend - operation in progress. Hence, the cpuhp_tasks_frozen variable will not - reflect the frozen state and the CPU hotplug callbacks which evaluate - that variable might execute the wrong code path. - -2. If a regular CPU hotplug stress test happens to race with the freezer due - to a suspend operation in progress at the same time, then we could hit the - situation described below: - - * A regular cpu online operation continues its journey from userspace - into the kernel, since the freezing has not yet begun. - * Then freezer gets to work and freezes userspace. - * If cpu online has not yet completed the microcode update stuff by now, - it will now start waiting on the frozen userspace in the - TASK_UNINTERRUPTIBLE state, in order to get the microcode image. - * Now the freezer continues and tries to freeze the remaining tasks. But - due to this wait mentioned above, the freezer won't be able to freeze - the cpu online hotplug task and hence freezing of tasks fails. - - As a result of this task freezing failure, the suspend operation gets - aborted. diff --git a/Documentation/power/suspend-and-interrupts.rst b/Documentation/power/suspend-and-interrupts.rst new file mode 100644 index 000000000000..4cda6617709a --- /dev/null +++ b/Documentation/power/suspend-and-interrupts.rst @@ -0,0 +1,137 @@ +==================================== +System Suspend and Device Interrupts +==================================== + +Copyright (C) 2014 Intel Corp. +Author: Rafael J. Wysocki + + +Suspending and Resuming Device IRQs +----------------------------------- + +Device interrupt request lines (IRQs) are generally disabled during system +suspend after the "late" phase of suspending devices (that is, after all of the +->prepare, ->suspend and ->suspend_late callbacks have been executed for all +devices). That is done by suspend_device_irqs(). + +The rationale for doing so is that after the "late" phase of device suspend +there is no legitimate reason why any interrupts from suspended devices should +trigger and if any devices have not been suspended properly yet, it is better to +block interrupts from them anyway. Also, in the past we had problems with +interrupt handlers for shared IRQs that device drivers implementing them were +not prepared for interrupts triggering after their devices had been suspended. +In some cases they would attempt to access, for example, memory address spaces +of suspended devices and cause unpredictable behavior to ensue as a result. +Unfortunately, such problems are very difficult to debug and the introduction +of suspend_device_irqs(), along with the "noirq" phase of device suspend and +resume, was the only practical way to mitigate them. + +Device IRQs are re-enabled during system resume, right before the "early" phase +of resuming devices (that is, before starting to execute ->resume_early +callbacks for devices). The function doing that is resume_device_irqs(). + + +The IRQF_NO_SUSPEND Flag +------------------------ + +There are interrupts that can legitimately trigger during the entire system +suspend-resume cycle, including the "noirq" phases of suspending and resuming +devices as well as during the time when nonboot CPUs are taken offline and +brought back online. That applies to timer interrupts in the first place, +but also to IPIs and to some other special-purpose interrupts. + +The IRQF_NO_SUSPEND flag is used to indicate that to the IRQ subsystem when +requesting a special-purpose interrupt. It causes suspend_device_irqs() to +leave the corresponding IRQ enabled so as to allow the interrupt to work as +expected during the suspend-resume cycle, but does not guarantee that the +interrupt will wake the system from a suspended state -- for such cases it is +necessary to use enable_irq_wake(). + +Note that the IRQF_NO_SUSPEND flag affects the entire IRQ and not just one +user of it. Thus, if the IRQ is shared, all of the interrupt handlers installed +for it will be executed as usual after suspend_device_irqs(), even if the +IRQF_NO_SUSPEND flag was not passed to request_irq() (or equivalent) by some of +the IRQ's users. For this reason, using IRQF_NO_SUSPEND and IRQF_SHARED at the +same time should be avoided. + + +System Wakeup Interrupts, enable_irq_wake() and disable_irq_wake() +------------------------------------------------------------------ + +System wakeup interrupts generally need to be configured to wake up the system +from sleep states, especially if they are used for different purposes (e.g. as +I/O interrupts) in the working state. + +That may involve turning on a special signal handling logic within the platform +(such as an SoC) so that signals from a given line are routed in a different way +during system sleep so as to trigger a system wakeup when needed. For example, +the platform may include a dedicated interrupt controller used specifically for +handling system wakeup events. Then, if a given interrupt line is supposed to +wake up the system from sleep sates, the corresponding input of that interrupt +controller needs to be enabled to receive signals from the line in question. +After wakeup, it generally is better to disable that input to prevent the +dedicated controller from triggering interrupts unnecessarily. + +The IRQ subsystem provides two helper functions to be used by device drivers for +those purposes. Namely, enable_irq_wake() turns on the platform's logic for +handling the given IRQ as a system wakeup interrupt line and disable_irq_wake() +turns that logic off. + +Calling enable_irq_wake() causes suspend_device_irqs() to treat the given IRQ +in a special way. Namely, the IRQ remains enabled, by on the first interrupt +it will be disabled, marked as pending and "suspended" so that it will be +re-enabled by resume_device_irqs() during the subsequent system resume. Also +the PM core is notified about the event which causes the system suspend in +progress to be aborted (that doesn't have to happen immediately, but at one +of the points where the suspend thread looks for pending wakeup events). + +This way every interrupt from a wakeup interrupt source will either cause the +system suspend currently in progress to be aborted or wake up the system if +already suspended. However, after suspend_device_irqs() interrupt handlers are +not executed for system wakeup IRQs. They are only executed for IRQF_NO_SUSPEND +IRQs at that time, but those IRQs should not be configured for system wakeup +using enable_irq_wake(). + + +Interrupts and Suspend-to-Idle +------------------------------ + +Suspend-to-idle (also known as the "freeze" sleep state) is a relatively new +system sleep state that works by idling all of the processors and waiting for +interrupts right after the "noirq" phase of suspending devices. + +Of course, this means that all of the interrupts with the IRQF_NO_SUSPEND flag +set will bring CPUs out of idle while in that state, but they will not cause the +IRQ subsystem to trigger a system wakeup. + +System wakeup interrupts, in turn, will trigger wakeup from suspend-to-idle in +analogy with what they do in the full system suspend case. The only difference +is that the wakeup from suspend-to-idle is signaled using the usual working +state interrupt delivery mechanisms and doesn't require the platform to use +any special interrupt handling logic for it to work. + + +IRQF_NO_SUSPEND and enable_irq_wake() +------------------------------------- + +There are very few valid reasons to use both enable_irq_wake() and the +IRQF_NO_SUSPEND flag on the same IRQ, and it is never valid to use both for the +same device. + +First of all, if the IRQ is not shared, the rules for handling IRQF_NO_SUSPEND +interrupts (interrupt handlers are invoked after suspend_device_irqs()) are +directly at odds with the rules for handling system wakeup interrupts (interrupt +handlers are not invoked after suspend_device_irqs()). + +Second, both enable_irq_wake() and IRQF_NO_SUSPEND apply to entire IRQs and not +to individual interrupt handlers, so sharing an IRQ between a system wakeup +interrupt source and an IRQF_NO_SUSPEND interrupt source does not generally +make sense. + +In rare cases an IRQ can be shared between a wakeup device driver and an +IRQF_NO_SUSPEND user. In order for this to be safe, the wakeup device driver +must be able to discern spurious IRQs from genuine wakeup events (signalling +the latter to the core with pm_system_wakeup()), must use enable_irq_wake() to +ensure that the IRQ will function as a wakeup source, and must request the IRQ +with IRQF_COND_SUSPEND to tell the core that it meets these requirements. If +these requirements are not met, it is not valid to use IRQF_COND_SUSPEND. diff --git a/Documentation/power/suspend-and-interrupts.txt b/Documentation/power/suspend-and-interrupts.txt deleted file mode 100644 index 8afb29a8604a..000000000000 --- a/Documentation/power/suspend-and-interrupts.txt +++ /dev/null @@ -1,135 +0,0 @@ -System Suspend and Device Interrupts - -Copyright (C) 2014 Intel Corp. -Author: Rafael J. Wysocki - - -Suspending and Resuming Device IRQs ------------------------------------ - -Device interrupt request lines (IRQs) are generally disabled during system -suspend after the "late" phase of suspending devices (that is, after all of the -->prepare, ->suspend and ->suspend_late callbacks have been executed for all -devices). That is done by suspend_device_irqs(). - -The rationale for doing so is that after the "late" phase of device suspend -there is no legitimate reason why any interrupts from suspended devices should -trigger and if any devices have not been suspended properly yet, it is better to -block interrupts from them anyway. Also, in the past we had problems with -interrupt handlers for shared IRQs that device drivers implementing them were -not prepared for interrupts triggering after their devices had been suspended. -In some cases they would attempt to access, for example, memory address spaces -of suspended devices and cause unpredictable behavior to ensue as a result. -Unfortunately, such problems are very difficult to debug and the introduction -of suspend_device_irqs(), along with the "noirq" phase of device suspend and -resume, was the only practical way to mitigate them. - -Device IRQs are re-enabled during system resume, right before the "early" phase -of resuming devices (that is, before starting to execute ->resume_early -callbacks for devices). The function doing that is resume_device_irqs(). - - -The IRQF_NO_SUSPEND Flag ------------------------- - -There are interrupts that can legitimately trigger during the entire system -suspend-resume cycle, including the "noirq" phases of suspending and resuming -devices as well as during the time when nonboot CPUs are taken offline and -brought back online. That applies to timer interrupts in the first place, -but also to IPIs and to some other special-purpose interrupts. - -The IRQF_NO_SUSPEND flag is used to indicate that to the IRQ subsystem when -requesting a special-purpose interrupt. It causes suspend_device_irqs() to -leave the corresponding IRQ enabled so as to allow the interrupt to work as -expected during the suspend-resume cycle, but does not guarantee that the -interrupt will wake the system from a suspended state -- for such cases it is -necessary to use enable_irq_wake(). - -Note that the IRQF_NO_SUSPEND flag affects the entire IRQ and not just one -user of it. Thus, if the IRQ is shared, all of the interrupt handlers installed -for it will be executed as usual after suspend_device_irqs(), even if the -IRQF_NO_SUSPEND flag was not passed to request_irq() (or equivalent) by some of -the IRQ's users. For this reason, using IRQF_NO_SUSPEND and IRQF_SHARED at the -same time should be avoided. - - -System Wakeup Interrupts, enable_irq_wake() and disable_irq_wake() ------------------------------------------------------------------- - -System wakeup interrupts generally need to be configured to wake up the system -from sleep states, especially if they are used for different purposes (e.g. as -I/O interrupts) in the working state. - -That may involve turning on a special signal handling logic within the platform -(such as an SoC) so that signals from a given line are routed in a different way -during system sleep so as to trigger a system wakeup when needed. For example, -the platform may include a dedicated interrupt controller used specifically for -handling system wakeup events. Then, if a given interrupt line is supposed to -wake up the system from sleep sates, the corresponding input of that interrupt -controller needs to be enabled to receive signals from the line in question. -After wakeup, it generally is better to disable that input to prevent the -dedicated controller from triggering interrupts unnecessarily. - -The IRQ subsystem provides two helper functions to be used by device drivers for -those purposes. Namely, enable_irq_wake() turns on the platform's logic for -handling the given IRQ as a system wakeup interrupt line and disable_irq_wake() -turns that logic off. - -Calling enable_irq_wake() causes suspend_device_irqs() to treat the given IRQ -in a special way. Namely, the IRQ remains enabled, by on the first interrupt -it will be disabled, marked as pending and "suspended" so that it will be -re-enabled by resume_device_irqs() during the subsequent system resume. Also -the PM core is notified about the event which causes the system suspend in -progress to be aborted (that doesn't have to happen immediately, but at one -of the points where the suspend thread looks for pending wakeup events). - -This way every interrupt from a wakeup interrupt source will either cause the -system suspend currently in progress to be aborted or wake up the system if -already suspended. However, after suspend_device_irqs() interrupt handlers are -not executed for system wakeup IRQs. They are only executed for IRQF_NO_SUSPEND -IRQs at that time, but those IRQs should not be configured for system wakeup -using enable_irq_wake(). - - -Interrupts and Suspend-to-Idle ------------------------------- - -Suspend-to-idle (also known as the "freeze" sleep state) is a relatively new -system sleep state that works by idling all of the processors and waiting for -interrupts right after the "noirq" phase of suspending devices. - -Of course, this means that all of the interrupts with the IRQF_NO_SUSPEND flag -set will bring CPUs out of idle while in that state, but they will not cause the -IRQ subsystem to trigger a system wakeup. - -System wakeup interrupts, in turn, will trigger wakeup from suspend-to-idle in -analogy with what they do in the full system suspend case. The only difference -is that the wakeup from suspend-to-idle is signaled using the usual working -state interrupt delivery mechanisms and doesn't require the platform to use -any special interrupt handling logic for it to work. - - -IRQF_NO_SUSPEND and enable_irq_wake() -------------------------------------- - -There are very few valid reasons to use both enable_irq_wake() and the -IRQF_NO_SUSPEND flag on the same IRQ, and it is never valid to use both for the -same device. - -First of all, if the IRQ is not shared, the rules for handling IRQF_NO_SUSPEND -interrupts (interrupt handlers are invoked after suspend_device_irqs()) are -directly at odds with the rules for handling system wakeup interrupts (interrupt -handlers are not invoked after suspend_device_irqs()). - -Second, both enable_irq_wake() and IRQF_NO_SUSPEND apply to entire IRQs and not -to individual interrupt handlers, so sharing an IRQ between a system wakeup -interrupt source and an IRQF_NO_SUSPEND interrupt source does not generally -make sense. - -In rare cases an IRQ can be shared between a wakeup device driver and an -IRQF_NO_SUSPEND user. In order for this to be safe, the wakeup device driver -must be able to discern spurious IRQs from genuine wakeup events (signalling -the latter to the core with pm_system_wakeup()), must use enable_irq_wake() to -ensure that the IRQ will function as a wakeup source, and must request the IRQ -with IRQF_COND_SUSPEND to tell the core that it meets these requirements. If -these requirements are not met, it is not valid to use IRQF_COND_SUSPEND. diff --git a/Documentation/power/swsusp-and-swap-files.rst b/Documentation/power/swsusp-and-swap-files.rst new file mode 100644 index 000000000000..a33a2919dbe4 --- /dev/null +++ b/Documentation/power/swsusp-and-swap-files.rst @@ -0,0 +1,63 @@ +=============================================== +Using swap files with software suspend (swsusp) +=============================================== + + (C) 2006 Rafael J. Wysocki + +The Linux kernel handles swap files almost in the same way as it handles swap +partitions and there are only two differences between these two types of swap +areas: +(1) swap files need not be contiguous, +(2) the header of a swap file is not in the first block of the partition that +holds it. From the swsusp's point of view (1) is not a problem, because it is +already taken care of by the swap-handling code, but (2) has to be taken into +consideration. + +In principle the location of a swap file's header may be determined with the +help of appropriate filesystem driver. Unfortunately, however, it requires the +filesystem holding the swap file to be mounted, and if this filesystem is +journaled, it cannot be mounted during resume from disk. For this reason to +identify a swap file swsusp uses the name of the partition that holds the file +and the offset from the beginning of the partition at which the swap file's +header is located. For convenience, this offset is expressed in +units. + +In order to use a swap file with swsusp, you need to: + +1) Create the swap file and make it active, eg.:: + + # dd if=/dev/zero of= bs=1024 count= + # mkswap + # swapon + +2) Use an application that will bmap the swap file with the help of the +FIBMAP ioctl and determine the location of the file's swap header, as the +offset, in units, from the beginning of the partition which +holds the swap file. + +3) Add the following parameters to the kernel command line:: + + resume= resume_offset= + +where is the partition on which the swap file is located +and is the offset of the swap header determined by the +application in 2) (of course, this step may be carried out automatically +by the same application that determines the swap file's header offset using the +FIBMAP ioctl) + +OR + +Use a userland suspend application that will set the partition and offset +with the help of the SNAPSHOT_SET_SWAP_AREA ioctl described in +Documentation/power/userland-swsusp.rst (this is the only method to suspend +to a swap file allowing the resume to be initiated from an initrd or initramfs +image). + +Now, swsusp will use the swap file in the same way in which it would use a swap +partition. In particular, the swap file has to be active (ie. be present in +/proc/swaps) so that it can be used for suspending. + +Note that if the swap file used for suspending is deleted and recreated, +the location of its header need not be the same as before. Thus every time +this happens the value of the "resume_offset=" kernel command line parameter +has to be updated. diff --git a/Documentation/power/swsusp-and-swap-files.txt b/Documentation/power/swsusp-and-swap-files.txt deleted file mode 100644 index f281886de490..000000000000 --- a/Documentation/power/swsusp-and-swap-files.txt +++ /dev/null @@ -1,60 +0,0 @@ -Using swap files with software suspend (swsusp) - (C) 2006 Rafael J. Wysocki - -The Linux kernel handles swap files almost in the same way as it handles swap -partitions and there are only two differences between these two types of swap -areas: -(1) swap files need not be contiguous, -(2) the header of a swap file is not in the first block of the partition that -holds it. From the swsusp's point of view (1) is not a problem, because it is -already taken care of by the swap-handling code, but (2) has to be taken into -consideration. - -In principle the location of a swap file's header may be determined with the -help of appropriate filesystem driver. Unfortunately, however, it requires the -filesystem holding the swap file to be mounted, and if this filesystem is -journaled, it cannot be mounted during resume from disk. For this reason to -identify a swap file swsusp uses the name of the partition that holds the file -and the offset from the beginning of the partition at which the swap file's -header is located. For convenience, this offset is expressed in -units. - -In order to use a swap file with swsusp, you need to: - -1) Create the swap file and make it active, eg. - -# dd if=/dev/zero of= bs=1024 count= -# mkswap -# swapon - -2) Use an application that will bmap the swap file with the help of the -FIBMAP ioctl and determine the location of the file's swap header, as the -offset, in units, from the beginning of the partition which -holds the swap file. - -3) Add the following parameters to the kernel command line: - -resume= resume_offset= - -where is the partition on which the swap file is located -and is the offset of the swap header determined by the -application in 2) (of course, this step may be carried out automatically -by the same application that determines the swap file's header offset using the -FIBMAP ioctl) - -OR - -Use a userland suspend application that will set the partition and offset -with the help of the SNAPSHOT_SET_SWAP_AREA ioctl described in -Documentation/power/userland-swsusp.txt (this is the only method to suspend -to a swap file allowing the resume to be initiated from an initrd or initramfs -image). - -Now, swsusp will use the swap file in the same way in which it would use a swap -partition. In particular, the swap file has to be active (ie. be present in -/proc/swaps) so that it can be used for suspending. - -Note that if the swap file used for suspending is deleted and recreated, -the location of its header need not be the same as before. Thus every time -this happens the value of the "resume_offset=" kernel command line parameter -has to be updated. diff --git a/Documentation/power/swsusp-dmcrypt.rst b/Documentation/power/swsusp-dmcrypt.rst new file mode 100644 index 000000000000..426df59172cd --- /dev/null +++ b/Documentation/power/swsusp-dmcrypt.rst @@ -0,0 +1,140 @@ +======================================= +How to use dm-crypt and swsusp together +======================================= + +Author: Andreas Steinmetz + + + +Some prerequisites: +You know how dm-crypt works. If not, visit the following web page: +http://www.saout.de/misc/dm-crypt/ +You have read Documentation/power/swsusp.rst and understand it. +You did read Documentation/admin-guide/initrd.rst and know how an initrd works. +You know how to create or how to modify an initrd. + +Now your system is properly set up, your disk is encrypted except for +the swap device(s) and the boot partition which may contain a mini +system for crypto setup and/or rescue purposes. You may even have +an initrd that does your current crypto setup already. + +At this point you want to encrypt your swap, too. Still you want to +be able to suspend using swsusp. This, however, means that you +have to be able to either enter a passphrase or that you read +the key(s) from an external device like a pcmcia flash disk +or an usb stick prior to resume. So you need an initrd, that sets +up dm-crypt and then asks swsusp to resume from the encrypted +swap device. + +The most important thing is that you set up dm-crypt in such +a way that the swap device you suspend to/resume from has +always the same major/minor within the initrd as well as +within your running system. The easiest way to achieve this is +to always set up this swap device first with dmsetup, so that +it will always look like the following:: + + brw------- 1 root root 254, 0 Jul 28 13:37 /dev/mapper/swap0 + +Now set up your kernel to use /dev/mapper/swap0 as the default +resume partition, so your kernel .config contains:: + + CONFIG_PM_STD_PARTITION="/dev/mapper/swap0" + +Prepare your boot loader to use the initrd you will create or +modify. For lilo the simplest setup looks like the following +lines:: + + image=/boot/vmlinuz + initrd=/boot/initrd.gz + label=linux + append="root=/dev/ram0 init=/linuxrc rw" + +Finally you need to create or modify your initrd. Lets assume +you create an initrd that reads the required dm-crypt setup +from a pcmcia flash disk card. The card is formatted with an ext2 +fs which resides on /dev/hde1 when the card is inserted. The +card contains at least the encrypted swap setup in a file +named "swapkey". /etc/fstab of your initrd contains something +like the following:: + + /dev/hda1 /mnt ext3 ro 0 0 + none /proc proc defaults,noatime,nodiratime 0 0 + none /sys sysfs defaults,noatime,nodiratime 0 0 + +/dev/hda1 contains an unencrypted mini system that sets up all +of your crypto devices, again by reading the setup from the +pcmcia flash disk. What follows now is a /linuxrc for your +initrd that allows you to resume from encrypted swap and that +continues boot with your mini system on /dev/hda1 if resume +does not happen:: + + #!/bin/sh + PATH=/sbin:/bin:/usr/sbin:/usr/bin + mount /proc + mount /sys + mapped=0 + noresume=`grep -c noresume /proc/cmdline` + if [ "$*" != "" ] + then + noresume=1 + fi + dmesg -n 1 + /sbin/cardmgr -q + for i in 1 2 3 4 5 6 7 8 9 0 + do + if [ -f /proc/ide/hde/media ] + then + usleep 500000 + mount -t ext2 -o ro /dev/hde1 /mnt + if [ -f /mnt/swapkey ] + then + dmsetup create swap0 /mnt/swapkey > /dev/null 2>&1 && mapped=1 + fi + umount /mnt + break + fi + usleep 500000 + done + killproc /sbin/cardmgr + dmesg -n 6 + if [ $mapped = 1 ] + then + if [ $noresume != 0 ] + then + mkswap /dev/mapper/swap0 > /dev/null 2>&1 + fi + echo 254:0 > /sys/power/resume + dmsetup remove swap0 + fi + umount /sys + mount /mnt + umount /proc + cd /mnt + pivot_root . mnt + mount /proc + umount -l /mnt + umount /proc + exec chroot . /sbin/init $* < dev/console > dev/console 2>&1 + +Please don't mind the weird loop above, busybox's msh doesn't know +the let statement. Now, what is happening in the script? +First we have to decide if we want to try to resume, or not. +We will not resume if booting with "noresume" or any parameters +for init like "single" or "emergency" as boot parameters. + +Then we need to set up dmcrypt with the setup data from the +pcmcia flash disk. If this succeeds we need to reset the swap +device if we don't want to resume. The line "echo 254:0 > /sys/power/resume" +then attempts to resume from the first device mapper device. +Note that it is important to set the device in /sys/power/resume, +regardless if resuming or not, otherwise later suspend will fail. +If resume starts, script execution terminates here. + +Otherwise we just remove the encrypted swap device and leave it to the +mini system on /dev/hda1 to set the whole crypto up (it is up to +you to modify this to your taste). + +What then follows is the well known process to change the root +file system and continue booting from there. I prefer to unmount +the initrd prior to continue booting but it is up to you to modify +this. diff --git a/Documentation/power/swsusp-dmcrypt.txt b/Documentation/power/swsusp-dmcrypt.txt deleted file mode 100644 index b802fbfd95ef..000000000000 --- a/Documentation/power/swsusp-dmcrypt.txt +++ /dev/null @@ -1,138 +0,0 @@ -Author: Andreas Steinmetz - - -How to use dm-crypt and swsusp together: -======================================== - -Some prerequisites: -You know how dm-crypt works. If not, visit the following web page: -http://www.saout.de/misc/dm-crypt/ -You have read Documentation/power/swsusp.txt and understand it. -You did read Documentation/admin-guide/initrd.rst and know how an initrd works. -You know how to create or how to modify an initrd. - -Now your system is properly set up, your disk is encrypted except for -the swap device(s) and the boot partition which may contain a mini -system for crypto setup and/or rescue purposes. You may even have -an initrd that does your current crypto setup already. - -At this point you want to encrypt your swap, too. Still you want to -be able to suspend using swsusp. This, however, means that you -have to be able to either enter a passphrase or that you read -the key(s) from an external device like a pcmcia flash disk -or an usb stick prior to resume. So you need an initrd, that sets -up dm-crypt and then asks swsusp to resume from the encrypted -swap device. - -The most important thing is that you set up dm-crypt in such -a way that the swap device you suspend to/resume from has -always the same major/minor within the initrd as well as -within your running system. The easiest way to achieve this is -to always set up this swap device first with dmsetup, so that -it will always look like the following: - -brw------- 1 root root 254, 0 Jul 28 13:37 /dev/mapper/swap0 - -Now set up your kernel to use /dev/mapper/swap0 as the default -resume partition, so your kernel .config contains: - -CONFIG_PM_STD_PARTITION="/dev/mapper/swap0" - -Prepare your boot loader to use the initrd you will create or -modify. For lilo the simplest setup looks like the following -lines: - -image=/boot/vmlinuz -initrd=/boot/initrd.gz -label=linux -append="root=/dev/ram0 init=/linuxrc rw" - -Finally you need to create or modify your initrd. Lets assume -you create an initrd that reads the required dm-crypt setup -from a pcmcia flash disk card. The card is formatted with an ext2 -fs which resides on /dev/hde1 when the card is inserted. The -card contains at least the encrypted swap setup in a file -named "swapkey". /etc/fstab of your initrd contains something -like the following: - -/dev/hda1 /mnt ext3 ro 0 0 -none /proc proc defaults,noatime,nodiratime 0 0 -none /sys sysfs defaults,noatime,nodiratime 0 0 - -/dev/hda1 contains an unencrypted mini system that sets up all -of your crypto devices, again by reading the setup from the -pcmcia flash disk. What follows now is a /linuxrc for your -initrd that allows you to resume from encrypted swap and that -continues boot with your mini system on /dev/hda1 if resume -does not happen: - -#!/bin/sh -PATH=/sbin:/bin:/usr/sbin:/usr/bin -mount /proc -mount /sys -mapped=0 -noresume=`grep -c noresume /proc/cmdline` -if [ "$*" != "" ] -then - noresume=1 -fi -dmesg -n 1 -/sbin/cardmgr -q -for i in 1 2 3 4 5 6 7 8 9 0 -do - if [ -f /proc/ide/hde/media ] - then - usleep 500000 - mount -t ext2 -o ro /dev/hde1 /mnt - if [ -f /mnt/swapkey ] - then - dmsetup create swap0 /mnt/swapkey > /dev/null 2>&1 && mapped=1 - fi - umount /mnt - break - fi - usleep 500000 -done -killproc /sbin/cardmgr -dmesg -n 6 -if [ $mapped = 1 ] -then - if [ $noresume != 0 ] - then - mkswap /dev/mapper/swap0 > /dev/null 2>&1 - fi - echo 254:0 > /sys/power/resume - dmsetup remove swap0 -fi -umount /sys -mount /mnt -umount /proc -cd /mnt -pivot_root . mnt -mount /proc -umount -l /mnt -umount /proc -exec chroot . /sbin/init $* < dev/console > dev/console 2>&1 - -Please don't mind the weird loop above, busybox's msh doesn't know -the let statement. Now, what is happening in the script? -First we have to decide if we want to try to resume, or not. -We will not resume if booting with "noresume" or any parameters -for init like "single" or "emergency" as boot parameters. - -Then we need to set up dmcrypt with the setup data from the -pcmcia flash disk. If this succeeds we need to reset the swap -device if we don't want to resume. The line "echo 254:0 > /sys/power/resume" -then attempts to resume from the first device mapper device. -Note that it is important to set the device in /sys/power/resume, -regardless if resuming or not, otherwise later suspend will fail. -If resume starts, script execution terminates here. - -Otherwise we just remove the encrypted swap device and leave it to the -mini system on /dev/hda1 to set the whole crypto up (it is up to -you to modify this to your taste). - -What then follows is the well known process to change the root -file system and continue booting from there. I prefer to unmount -the initrd prior to continue booting but it is up to you to modify -this. diff --git a/Documentation/power/swsusp.rst b/Documentation/power/swsusp.rst new file mode 100644 index 000000000000..d000312f6965 --- /dev/null +++ b/Documentation/power/swsusp.rst @@ -0,0 +1,501 @@ +============ +Swap suspend +============ + +Some warnings, first. + +.. warning:: + + **BIG FAT WARNING** + + If you touch anything on disk between suspend and resume... + ...kiss your data goodbye. + + If you do resume from initrd after your filesystems are mounted... + ...bye bye root partition. + + [this is actually same case as above] + + If you have unsupported ( ) devices using DMA, you may have some + problems. If your disk driver does not support suspend... (IDE does), + it may cause some problems, too. If you change kernel command line + between suspend and resume, it may do something wrong. If you change + your hardware while system is suspended... well, it was not good idea; + but it will probably only crash. + + ( ) suspend/resume support is needed to make it safe. + + If you have any filesystems on USB devices mounted before software suspend, + they won't be accessible after resume and you may lose data, as though + you have unplugged the USB devices with mounted filesystems on them; + see the FAQ below for details. (This is not true for more traditional + power states like "standby", which normally don't turn USB off.) + +Swap partition: + You need to append resume=/dev/your_swap_partition to kernel command + line or specify it using /sys/power/resume. + +Swap file: + If using a swapfile you can also specify a resume offset using + resume_offset= on the kernel command line or specify it + in /sys/power/resume_offset. + +After preparing then you suspend by:: + + echo shutdown > /sys/power/disk; echo disk > /sys/power/state + +- If you feel ACPI works pretty well on your system, you might try:: + + echo platform > /sys/power/disk; echo disk > /sys/power/state + +- If you would like to write hibernation image to swap and then suspend + to RAM (provided your platform supports it), you can try:: + + echo suspend > /sys/power/disk; echo disk > /sys/power/state + +- If you have SATA disks, you'll need recent kernels with SATA suspend + support. For suspend and resume to work, make sure your disk drivers + are built into kernel -- not modules. [There's way to make + suspend/resume with modular disk drivers, see FAQ, but you probably + should not do that.] + +If you want to limit the suspend image size to N bytes, do:: + + echo N > /sys/power/image_size + +before suspend (it is limited to around 2/5 of available RAM by default). + +- The resume process checks for the presence of the resume device, + if found, it then checks the contents for the hibernation image signature. + If both are found, it resumes the hibernation image. + +- The resume process may be triggered in two ways: + + 1) During lateinit: If resume=/dev/your_swap_partition is specified on + the kernel command line, lateinit runs the resume process. If the + resume device has not been probed yet, the resume process fails and + bootup continues. + 2) Manually from an initrd or initramfs: May be run from + the init script by using the /sys/power/resume file. It is vital + that this be done prior to remounting any filesystems (even as + read-only) otherwise data may be corrupted. + +Article about goals and implementation of Software Suspend for Linux +==================================================================== + +Author: Gábor Kuti +Last revised: 2003-10-20 by Pavel Machek + +Idea and goals to achieve +------------------------- + +Nowadays it is common in several laptops that they have a suspend button. It +saves the state of the machine to a filesystem or to a partition and switches +to standby mode. Later resuming the machine the saved state is loaded back to +ram and the machine can continue its work. It has two real benefits. First we +save ourselves the time machine goes down and later boots up, energy costs +are real high when running from batteries. The other gain is that we don't have +to interrupt our programs so processes that are calculating something for a long +time shouldn't need to be written interruptible. + +swsusp saves the state of the machine into active swaps and then reboots or +powerdowns. You must explicitly specify the swap partition to resume from with +`resume=` kernel option. If signature is found it loads and restores saved +state. If the option `noresume` is specified as a boot parameter, it skips +the resuming. If the option `hibernate=nocompress` is specified as a boot +parameter, it saves hibernation image without compression. + +In the meantime while the system is suspended you should not add/remove any +of the hardware, write to the filesystems, etc. + +Sleep states summary +==================== + +There are three different interfaces you can use, /proc/acpi should +work like this: + +In a really perfect world:: + + echo 1 > /proc/acpi/sleep # for standby + echo 2 > /proc/acpi/sleep # for suspend to ram + echo 3 > /proc/acpi/sleep # for suspend to ram, but with more power conservative + echo 4 > /proc/acpi/sleep # for suspend to disk + echo 5 > /proc/acpi/sleep # for shutdown unfriendly the system + +and perhaps:: + + echo 4b > /proc/acpi/sleep # for suspend to disk via s4bios + +Frequently Asked Questions +========================== + +Q: + well, suspending a server is IMHO a really stupid thing, + but... (Diego Zuccato): + +A: + You bought new UPS for your server. How do you install it without + bringing machine down? Suspend to disk, rearrange power cables, + resume. + + You have your server on UPS. Power died, and UPS is indicating 30 + seconds to failure. What do you do? Suspend to disk. + + +Q: + Maybe I'm missing something, but why don't the regular I/O paths work? + +A: + We do use the regular I/O paths. However we cannot restore the data + to its original location as we load it. That would create an + inconsistent kernel state which would certainly result in an oops. + Instead, we load the image into unused memory and then atomically copy + it back to it original location. This implies, of course, a maximum + image size of half the amount of memory. + + There are two solutions to this: + + * require half of memory to be free during suspend. That way you can + read "new" data onto free spots, then cli and copy + + * assume we had special "polling" ide driver that only uses memory + between 0-640KB. That way, I'd have to make sure that 0-640KB is free + during suspending, but otherwise it would work... + + suspend2 shares this fundamental limitation, but does not include user + data and disk caches into "used memory" by saving them in + advance. That means that the limitation goes away in practice. + +Q: + Does linux support ACPI S4? + +A: + Yes. That's what echo platform > /sys/power/disk does. + +Q: + What is 'suspend2'? + +A: + suspend2 is 'Software Suspend 2', a forked implementation of + suspend-to-disk which is available as separate patches for 2.4 and 2.6 + kernels from swsusp.sourceforge.net. It includes support for SMP, 4GB + highmem and preemption. It also has a extensible architecture that + allows for arbitrary transformations on the image (compression, + encryption) and arbitrary backends for writing the image (eg to swap + or an NFS share[Work In Progress]). Questions regarding suspend2 + should be sent to the mailing list available through the suspend2 + website, and not to the Linux Kernel Mailing List. We are working + toward merging suspend2 into the mainline kernel. + +Q: + What is the freezing of tasks and why are we using it? + +A: + The freezing of tasks is a mechanism by which user space processes and some + kernel threads are controlled during hibernation or system-wide suspend (on some + architectures). See freezing-of-tasks.txt for details. + +Q: + What is the difference between "platform" and "shutdown"? + +A: + shutdown: + save state in linux, then tell bios to powerdown + + platform: + save state in linux, then tell bios to powerdown and blink + "suspended led" + + "platform" is actually right thing to do where supported, but + "shutdown" is most reliable (except on ACPI systems). + +Q: + I do not understand why you have such strong objections to idea of + selective suspend. + +A: + Do selective suspend during runtime power management, that's okay. But + it's useless for suspend-to-disk. (And I do not see how you could use + it for suspend-to-ram, I hope you do not want that). + + Lets see, so you suggest to + + * SUSPEND all but swap device and parents + * Snapshot + * Write image to disk + * SUSPEND swap device and parents + * Powerdown + + Oh no, that does not work, if swap device or its parents uses DMA, + you've corrupted data. You'd have to do + + * SUSPEND all but swap device and parents + * FREEZE swap device and parents + * Snapshot + * UNFREEZE swap device and parents + * Write + * SUSPEND swap device and parents + + Which means that you still need that FREEZE state, and you get more + complicated code. (And I have not yet introduce details like system + devices). + +Q: + There don't seem to be any generally useful behavioral + distinctions between SUSPEND and FREEZE. + +A: + Doing SUSPEND when you are asked to do FREEZE is always correct, + but it may be unnecessarily slow. If you want your driver to stay simple, + slowness may not matter to you. It can always be fixed later. + + For devices like disk it does matter, you do not want to spindown for + FREEZE. + +Q: + After resuming, system is paging heavily, leading to very bad interactivity. + +A: + Try running:: + + cat /proc/[0-9]*/maps | grep / | sed 's:.* /:/:' | sort -u | while read file + do + test -f "$file" && cat "$file" > /dev/null + done + + after resume. swapoff -a; swapon -a may also be useful. + +Q: + What happens to devices during swsusp? They seem to be resumed + during system suspend? + +A: + That's correct. We need to resume them if we want to write image to + disk. Whole sequence goes like + + **Suspend part** + + running system, user asks for suspend-to-disk + + user processes are stopped + + suspend(PMSG_FREEZE): devices are frozen so that they don't interfere + with state snapshot + + state snapshot: copy of whole used memory is taken with interrupts disabled + + resume(): devices are woken up so that we can write image to swap + + write image to swap + + suspend(PMSG_SUSPEND): suspend devices so that we can power off + + turn the power off + + **Resume part** + + (is actually pretty similar) + + running system, user asks for suspend-to-disk + + user processes are stopped (in common case there are none, + but with resume-from-initrd, no one knows) + + read image from disk + + suspend(PMSG_FREEZE): devices are frozen so that they don't interfere + with image restoration + + image restoration: rewrite memory with image + + resume(): devices are woken up so that system can continue + + thaw all user processes + +Q: + What is this 'Encrypt suspend image' for? + +A: + First of all: it is not a replacement for dm-crypt encrypted swap. + It cannot protect your computer while it is suspended. Instead it does + protect from leaking sensitive data after resume from suspend. + + Think of the following: you suspend while an application is running + that keeps sensitive data in memory. The application itself prevents + the data from being swapped out. Suspend, however, must write these + data to swap to be able to resume later on. Without suspend encryption + your sensitive data are then stored in plaintext on disk. This means + that after resume your sensitive data are accessible to all + applications having direct access to the swap device which was used + for suspend. If you don't need swap after resume these data can remain + on disk virtually forever. Thus it can happen that your system gets + broken in weeks later and sensitive data which you thought were + encrypted and protected are retrieved and stolen from the swap device. + To prevent this situation you should use 'Encrypt suspend image'. + + During suspend a temporary key is created and this key is used to + encrypt the data written to disk. When, during resume, the data was + read back into memory the temporary key is destroyed which simply + means that all data written to disk during suspend are then + inaccessible so they can't be stolen later on. The only thing that + you must then take care of is that you call 'mkswap' for the swap + partition used for suspend as early as possible during regular + boot. This asserts that any temporary key from an oopsed suspend or + from a failed or aborted resume is erased from the swap device. + + As a rule of thumb use encrypted swap to protect your data while your + system is shut down or suspended. Additionally use the encrypted + suspend image to prevent sensitive data from being stolen after + resume. + +Q: + Can I suspend to a swap file? + +A: + Generally, yes, you can. However, it requires you to use the "resume=" and + "resume_offset=" kernel command line parameters, so the resume from a swap file + cannot be initiated from an initrd or initramfs image. See + swsusp-and-swap-files.txt for details. + +Q: + Is there a maximum system RAM size that is supported by swsusp? + +A: + It should work okay with highmem. + +Q: + Does swsusp (to disk) use only one swap partition or can it use + multiple swap partitions (aggregate them into one logical space)? + +A: + Only one swap partition, sorry. + +Q: + If my application(s) causes lots of memory & swap space to be used + (over half of the total system RAM), is it correct that it is likely + to be useless to try to suspend to disk while that app is running? + +A: + No, it should work okay, as long as your app does not mlock() + it. Just prepare big enough swap partition. + +Q: + What information is useful for debugging suspend-to-disk problems? + +A: + Well, last messages on the screen are always useful. If something + is broken, it is usually some kernel driver, therefore trying with as + little as possible modules loaded helps a lot. I also prefer people to + suspend from console, preferably without X running. Booting with + init=/bin/bash, then swapon and starting suspend sequence manually + usually does the trick. Then it is good idea to try with latest + vanilla kernel. + +Q: + How can distributions ship a swsusp-supporting kernel with modular + disk drivers (especially SATA)? + +A: + Well, it can be done, load the drivers, then do echo into + /sys/power/resume file from initrd. Be sure not to mount + anything, not even read-only mount, or you are going to lose your + data. + +Q: + How do I make suspend more verbose? + +A: + If you want to see any non-error kernel messages on the virtual + terminal the kernel switches to during suspend, you have to set the + kernel console loglevel to at least 4 (KERN_WARNING), for example by + doing:: + + # save the old loglevel + read LOGLEVEL DUMMY < /proc/sys/kernel/printk + # set the loglevel so we see the progress bar. + # if the level is higher than needed, we leave it alone. + if [ $LOGLEVEL -lt 5 ]; then + echo 5 > /proc/sys/kernel/printk + fi + + IMG_SZ=0 + read IMG_SZ < /sys/power/image_size + echo -n disk > /sys/power/state + RET=$? + # + # the logic here is: + # if image_size > 0 (without kernel support, IMG_SZ will be zero), + # then try again with image_size set to zero. + if [ $RET -ne 0 -a $IMG_SZ -ne 0 ]; then # try again with minimal image size + echo 0 > /sys/power/image_size + echo -n disk > /sys/power/state + RET=$? + fi + + # restore previous loglevel + echo $LOGLEVEL > /proc/sys/kernel/printk + exit $RET + +Q: + Is this true that if I have a mounted filesystem on a USB device and + I suspend to disk, I can lose data unless the filesystem has been mounted + with "sync"? + +A: + That's right ... if you disconnect that device, you may lose data. + In fact, even with "-o sync" you can lose data if your programs have + information in buffers they haven't written out to a disk you disconnect, + or if you disconnect before the device finished saving data you wrote. + + Software suspend normally powers down USB controllers, which is equivalent + to disconnecting all USB devices attached to your system. + + Your system might well support low-power modes for its USB controllers + while the system is asleep, maintaining the connection, using true sleep + modes like "suspend-to-RAM" or "standby". (Don't write "disk" to the + /sys/power/state file; write "standby" or "mem".) We've not seen any + hardware that can use these modes through software suspend, although in + theory some systems might support "platform" modes that won't break the + USB connections. + + Remember that it's always a bad idea to unplug a disk drive containing a + mounted filesystem. That's true even when your system is asleep! The + safest thing is to unmount all filesystems on removable media (such USB, + Firewire, CompactFlash, MMC, external SATA, or even IDE hotplug bays) + before suspending; then remount them after resuming. + + There is a work-around for this problem. For more information, see + Documentation/driver-api/usb/persist.rst. + +Q: + Can I suspend-to-disk using a swap partition under LVM? + +A: + Yes and No. You can suspend successfully, but the kernel will not be able + to resume on its own. You need an initramfs that can recognize the resume + situation, activate the logical volume containing the swap volume (but not + touch any filesystems!), and eventually call:: + + echo -n "$major:$minor" > /sys/power/resume + + where $major and $minor are the respective major and minor device numbers of + the swap volume. + + uswsusp works with LVM, too. See http://suspend.sourceforge.net/ + +Q: + I upgraded the kernel from 2.6.15 to 2.6.16. Both kernels were + compiled with the similar configuration files. Anyway I found that + suspend to disk (and resume) is much slower on 2.6.16 compared to + 2.6.15. Any idea for why that might happen or how can I speed it up? + +A: + This is because the size of the suspend image is now greater than + for 2.6.15 (by saving more data we can get more responsive system + after resume). + + There's the /sys/power/image_size knob that controls the size of the + image. If you set it to 0 (eg. by echo 0 > /sys/power/image_size as + root), the 2.6.15 behavior should be restored. If it is still too + slow, take a look at suspend.sf.net -- userland suspend is faster and + supports LZF compression to speed it up further. diff --git a/Documentation/power/swsusp.txt b/Documentation/power/swsusp.txt deleted file mode 100644 index 236d1fb13640..000000000000 --- a/Documentation/power/swsusp.txt +++ /dev/null @@ -1,446 +0,0 @@ -Some warnings, first. - - * BIG FAT WARNING ********************************************************* - * - * If you touch anything on disk between suspend and resume... - * ...kiss your data goodbye. - * - * If you do resume from initrd after your filesystems are mounted... - * ...bye bye root partition. - * [this is actually same case as above] - * - * If you have unsupported (*) devices using DMA, you may have some - * problems. If your disk driver does not support suspend... (IDE does), - * it may cause some problems, too. If you change kernel command line - * between suspend and resume, it may do something wrong. If you change - * your hardware while system is suspended... well, it was not good idea; - * but it will probably only crash. - * - * (*) suspend/resume support is needed to make it safe. - * - * If you have any filesystems on USB devices mounted before software suspend, - * they won't be accessible after resume and you may lose data, as though - * you have unplugged the USB devices with mounted filesystems on them; - * see the FAQ below for details. (This is not true for more traditional - * power states like "standby", which normally don't turn USB off.) - -Swap partition: -You need to append resume=/dev/your_swap_partition to kernel command -line or specify it using /sys/power/resume. - -Swap file: -If using a swapfile you can also specify a resume offset using -resume_offset= on the kernel command line or specify it -in /sys/power/resume_offset. - -After preparing then you suspend by - -echo shutdown > /sys/power/disk; echo disk > /sys/power/state - -. If you feel ACPI works pretty well on your system, you might try - -echo platform > /sys/power/disk; echo disk > /sys/power/state - -. If you would like to write hibernation image to swap and then suspend -to RAM (provided your platform supports it), you can try - -echo suspend > /sys/power/disk; echo disk > /sys/power/state - -. If you have SATA disks, you'll need recent kernels with SATA suspend -support. For suspend and resume to work, make sure your disk drivers -are built into kernel -- not modules. [There's way to make -suspend/resume with modular disk drivers, see FAQ, but you probably -should not do that.] - -If you want to limit the suspend image size to N bytes, do - -echo N > /sys/power/image_size - -before suspend (it is limited to around 2/5 of available RAM by default). - -. The resume process checks for the presence of the resume device, -if found, it then checks the contents for the hibernation image signature. -If both are found, it resumes the hibernation image. - -. The resume process may be triggered in two ways: - 1) During lateinit: If resume=/dev/your_swap_partition is specified on - the kernel command line, lateinit runs the resume process. If the - resume device has not been probed yet, the resume process fails and - bootup continues. - 2) Manually from an initrd or initramfs: May be run from - the init script by using the /sys/power/resume file. It is vital - that this be done prior to remounting any filesystems (even as - read-only) otherwise data may be corrupted. - -Article about goals and implementation of Software Suspend for Linux -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Author: Gábor Kuti -Last revised: 2003-10-20 by Pavel Machek - -Idea and goals to achieve - -Nowadays it is common in several laptops that they have a suspend button. It -saves the state of the machine to a filesystem or to a partition and switches -to standby mode. Later resuming the machine the saved state is loaded back to -ram and the machine can continue its work. It has two real benefits. First we -save ourselves the time machine goes down and later boots up, energy costs -are real high when running from batteries. The other gain is that we don't have to -interrupt our programs so processes that are calculating something for a long -time shouldn't need to be written interruptible. - -swsusp saves the state of the machine into active swaps and then reboots or -powerdowns. You must explicitly specify the swap partition to resume from with -``resume='' kernel option. If signature is found it loads and restores saved -state. If the option ``noresume'' is specified as a boot parameter, it skips -the resuming. If the option ``hibernate=nocompress'' is specified as a boot -parameter, it saves hibernation image without compression. - -In the meantime while the system is suspended you should not add/remove any -of the hardware, write to the filesystems, etc. - -Sleep states summary -==================== - -There are three different interfaces you can use, /proc/acpi should -work like this: - -In a really perfect world: -echo 1 > /proc/acpi/sleep # for standby -echo 2 > /proc/acpi/sleep # for suspend to ram -echo 3 > /proc/acpi/sleep # for suspend to ram, but with more power conservative -echo 4 > /proc/acpi/sleep # for suspend to disk -echo 5 > /proc/acpi/sleep # for shutdown unfriendly the system - -and perhaps -echo 4b > /proc/acpi/sleep # for suspend to disk via s4bios - -Frequently Asked Questions -========================== - -Q: well, suspending a server is IMHO a really stupid thing, -but... (Diego Zuccato): - -A: You bought new UPS for your server. How do you install it without -bringing machine down? Suspend to disk, rearrange power cables, -resume. - -You have your server on UPS. Power died, and UPS is indicating 30 -seconds to failure. What do you do? Suspend to disk. - - -Q: Maybe I'm missing something, but why don't the regular I/O paths work? - -A: We do use the regular I/O paths. However we cannot restore the data -to its original location as we load it. That would create an -inconsistent kernel state which would certainly result in an oops. -Instead, we load the image into unused memory and then atomically copy -it back to it original location. This implies, of course, a maximum -image size of half the amount of memory. - -There are two solutions to this: - -* require half of memory to be free during suspend. That way you can -read "new" data onto free spots, then cli and copy - -* assume we had special "polling" ide driver that only uses memory -between 0-640KB. That way, I'd have to make sure that 0-640KB is free -during suspending, but otherwise it would work... - -suspend2 shares this fundamental limitation, but does not include user -data and disk caches into "used memory" by saving them in -advance. That means that the limitation goes away in practice. - -Q: Does linux support ACPI S4? - -A: Yes. That's what echo platform > /sys/power/disk does. - -Q: What is 'suspend2'? - -A: suspend2 is 'Software Suspend 2', a forked implementation of -suspend-to-disk which is available as separate patches for 2.4 and 2.6 -kernels from swsusp.sourceforge.net. It includes support for SMP, 4GB -highmem and preemption. It also has a extensible architecture that -allows for arbitrary transformations on the image (compression, -encryption) and arbitrary backends for writing the image (eg to swap -or an NFS share[Work In Progress]). Questions regarding suspend2 -should be sent to the mailing list available through the suspend2 -website, and not to the Linux Kernel Mailing List. We are working -toward merging suspend2 into the mainline kernel. - -Q: What is the freezing of tasks and why are we using it? - -A: The freezing of tasks is a mechanism by which user space processes and some -kernel threads are controlled during hibernation or system-wide suspend (on some -architectures). See freezing-of-tasks.txt for details. - -Q: What is the difference between "platform" and "shutdown"? - -A: - -shutdown: save state in linux, then tell bios to powerdown - -platform: save state in linux, then tell bios to powerdown and blink - "suspended led" - -"platform" is actually right thing to do where supported, but -"shutdown" is most reliable (except on ACPI systems). - -Q: I do not understand why you have such strong objections to idea of -selective suspend. - -A: Do selective suspend during runtime power management, that's okay. But -it's useless for suspend-to-disk. (And I do not see how you could use -it for suspend-to-ram, I hope you do not want that). - -Lets see, so you suggest to - -* SUSPEND all but swap device and parents -* Snapshot -* Write image to disk -* SUSPEND swap device and parents -* Powerdown - -Oh no, that does not work, if swap device or its parents uses DMA, -you've corrupted data. You'd have to do - -* SUSPEND all but swap device and parents -* FREEZE swap device and parents -* Snapshot -* UNFREEZE swap device and parents -* Write -* SUSPEND swap device and parents - -Which means that you still need that FREEZE state, and you get more -complicated code. (And I have not yet introduce details like system -devices). - -Q: There don't seem to be any generally useful behavioral -distinctions between SUSPEND and FREEZE. - -A: Doing SUSPEND when you are asked to do FREEZE is always correct, -but it may be unnecessarily slow. If you want your driver to stay simple, -slowness may not matter to you. It can always be fixed later. - -For devices like disk it does matter, you do not want to spindown for -FREEZE. - -Q: After resuming, system is paging heavily, leading to very bad interactivity. - -A: Try running - -cat /proc/[0-9]*/maps | grep / | sed 's:.* /:/:' | sort -u | while read file -do - test -f "$file" && cat "$file" > /dev/null -done - -after resume. swapoff -a; swapon -a may also be useful. - -Q: What happens to devices during swsusp? They seem to be resumed -during system suspend? - -A: That's correct. We need to resume them if we want to write image to -disk. Whole sequence goes like - - Suspend part - ~~~~~~~~~~~~ - running system, user asks for suspend-to-disk - - user processes are stopped - - suspend(PMSG_FREEZE): devices are frozen so that they don't interfere - with state snapshot - - state snapshot: copy of whole used memory is taken with interrupts disabled - - resume(): devices are woken up so that we can write image to swap - - write image to swap - - suspend(PMSG_SUSPEND): suspend devices so that we can power off - - turn the power off - - Resume part - ~~~~~~~~~~~ - (is actually pretty similar) - - running system, user asks for suspend-to-disk - - user processes are stopped (in common case there are none, but with resume-from-initrd, no one knows) - - read image from disk - - suspend(PMSG_FREEZE): devices are frozen so that they don't interfere - with image restoration - - image restoration: rewrite memory with image - - resume(): devices are woken up so that system can continue - - thaw all user processes - -Q: What is this 'Encrypt suspend image' for? - -A: First of all: it is not a replacement for dm-crypt encrypted swap. -It cannot protect your computer while it is suspended. Instead it does -protect from leaking sensitive data after resume from suspend. - -Think of the following: you suspend while an application is running -that keeps sensitive data in memory. The application itself prevents -the data from being swapped out. Suspend, however, must write these -data to swap to be able to resume later on. Without suspend encryption -your sensitive data are then stored in plaintext on disk. This means -that after resume your sensitive data are accessible to all -applications having direct access to the swap device which was used -for suspend. If you don't need swap after resume these data can remain -on disk virtually forever. Thus it can happen that your system gets -broken in weeks later and sensitive data which you thought were -encrypted and protected are retrieved and stolen from the swap device. -To prevent this situation you should use 'Encrypt suspend image'. - -During suspend a temporary key is created and this key is used to -encrypt the data written to disk. When, during resume, the data was -read back into memory the temporary key is destroyed which simply -means that all data written to disk during suspend are then -inaccessible so they can't be stolen later on. The only thing that -you must then take care of is that you call 'mkswap' for the swap -partition used for suspend as early as possible during regular -boot. This asserts that any temporary key from an oopsed suspend or -from a failed or aborted resume is erased from the swap device. - -As a rule of thumb use encrypted swap to protect your data while your -system is shut down or suspended. Additionally use the encrypted -suspend image to prevent sensitive data from being stolen after -resume. - -Q: Can I suspend to a swap file? - -A: Generally, yes, you can. However, it requires you to use the "resume=" and -"resume_offset=" kernel command line parameters, so the resume from a swap file -cannot be initiated from an initrd or initramfs image. See -swsusp-and-swap-files.txt for details. - -Q: Is there a maximum system RAM size that is supported by swsusp? - -A: It should work okay with highmem. - -Q: Does swsusp (to disk) use only one swap partition or can it use -multiple swap partitions (aggregate them into one logical space)? - -A: Only one swap partition, sorry. - -Q: If my application(s) causes lots of memory & swap space to be used -(over half of the total system RAM), is it correct that it is likely -to be useless to try to suspend to disk while that app is running? - -A: No, it should work okay, as long as your app does not mlock() -it. Just prepare big enough swap partition. - -Q: What information is useful for debugging suspend-to-disk problems? - -A: Well, last messages on the screen are always useful. If something -is broken, it is usually some kernel driver, therefore trying with as -little as possible modules loaded helps a lot. I also prefer people to -suspend from console, preferably without X running. Booting with -init=/bin/bash, then swapon and starting suspend sequence manually -usually does the trick. Then it is good idea to try with latest -vanilla kernel. - -Q: How can distributions ship a swsusp-supporting kernel with modular -disk drivers (especially SATA)? - -A: Well, it can be done, load the drivers, then do echo into -/sys/power/resume file from initrd. Be sure not to mount -anything, not even read-only mount, or you are going to lose your -data. - -Q: How do I make suspend more verbose? - -A: If you want to see any non-error kernel messages on the virtual -terminal the kernel switches to during suspend, you have to set the -kernel console loglevel to at least 4 (KERN_WARNING), for example by -doing - - # save the old loglevel - read LOGLEVEL DUMMY < /proc/sys/kernel/printk - # set the loglevel so we see the progress bar. - # if the level is higher than needed, we leave it alone. - if [ $LOGLEVEL -lt 5 ]; then - echo 5 > /proc/sys/kernel/printk - fi - - IMG_SZ=0 - read IMG_SZ < /sys/power/image_size - echo -n disk > /sys/power/state - RET=$? - # - # the logic here is: - # if image_size > 0 (without kernel support, IMG_SZ will be zero), - # then try again with image_size set to zero. - if [ $RET -ne 0 -a $IMG_SZ -ne 0 ]; then # try again with minimal image size - echo 0 > /sys/power/image_size - echo -n disk > /sys/power/state - RET=$? - fi - - # restore previous loglevel - echo $LOGLEVEL > /proc/sys/kernel/printk - exit $RET - -Q: Is this true that if I have a mounted filesystem on a USB device and -I suspend to disk, I can lose data unless the filesystem has been mounted -with "sync"? - -A: That's right ... if you disconnect that device, you may lose data. -In fact, even with "-o sync" you can lose data if your programs have -information in buffers they haven't written out to a disk you disconnect, -or if you disconnect before the device finished saving data you wrote. - -Software suspend normally powers down USB controllers, which is equivalent -to disconnecting all USB devices attached to your system. - -Your system might well support low-power modes for its USB controllers -while the system is asleep, maintaining the connection, using true sleep -modes like "suspend-to-RAM" or "standby". (Don't write "disk" to the -/sys/power/state file; write "standby" or "mem".) We've not seen any -hardware that can use these modes through software suspend, although in -theory some systems might support "platform" modes that won't break the -USB connections. - -Remember that it's always a bad idea to unplug a disk drive containing a -mounted filesystem. That's true even when your system is asleep! The -safest thing is to unmount all filesystems on removable media (such USB, -Firewire, CompactFlash, MMC, external SATA, or even IDE hotplug bays) -before suspending; then remount them after resuming. - -There is a work-around for this problem. For more information, see -Documentation/driver-api/usb/persist.rst. - -Q: Can I suspend-to-disk using a swap partition under LVM? - -A: Yes and No. You can suspend successfully, but the kernel will not be able -to resume on its own. You need an initramfs that can recognize the resume -situation, activate the logical volume containing the swap volume (but not -touch any filesystems!), and eventually call - -echo -n "$major:$minor" > /sys/power/resume - -where $major and $minor are the respective major and minor device numbers of -the swap volume. - -uswsusp works with LVM, too. See http://suspend.sourceforge.net/ - -Q: I upgraded the kernel from 2.6.15 to 2.6.16. Both kernels were -compiled with the similar configuration files. Anyway I found that -suspend to disk (and resume) is much slower on 2.6.16 compared to -2.6.15. Any idea for why that might happen or how can I speed it up? - -A: This is because the size of the suspend image is now greater than -for 2.6.15 (by saving more data we can get more responsive system -after resume). - -There's the /sys/power/image_size knob that controls the size of the -image. If you set it to 0 (eg. by echo 0 > /sys/power/image_size as -root), the 2.6.15 behavior should be restored. If it is still too -slow, take a look at suspend.sf.net -- userland suspend is faster and -supports LZF compression to speed it up further. diff --git a/Documentation/power/tricks.rst b/Documentation/power/tricks.rst new file mode 100644 index 000000000000..ca787f142c3f --- /dev/null +++ b/Documentation/power/tricks.rst @@ -0,0 +1,29 @@ +================ +swsusp/S3 tricks +================ + +Pavel Machek + +If you want to trick swsusp/S3 into working, you might want to try: + +* go with minimal config, turn off drivers like USB, AGP you don't + really need + +* turn off APIC and preempt + +* use ext2. At least it has working fsck. [If something seems to go + wrong, force fsck when you have a chance] + +* turn off modules + +* use vga text console, shut down X. [If you really want X, you might + want to try vesafb later] + +* try running as few processes as possible, preferably go to single + user mode. + +* due to video issues, swsusp should be easier to get working than + S3. Try that first. + +When you make it work, try to find out what exactly was it that broke +suspend, and preferably fix that. diff --git a/Documentation/power/tricks.txt b/Documentation/power/tricks.txt deleted file mode 100644 index a1b8f7249f4c..000000000000 --- a/Documentation/power/tricks.txt +++ /dev/null @@ -1,27 +0,0 @@ - swsusp/S3 tricks - ~~~~~~~~~~~~~~~~ -Pavel Machek - -If you want to trick swsusp/S3 into working, you might want to try: - -* go with minimal config, turn off drivers like USB, AGP you don't - really need - -* turn off APIC and preempt - -* use ext2. At least it has working fsck. [If something seems to go - wrong, force fsck when you have a chance] - -* turn off modules - -* use vga text console, shut down X. [If you really want X, you might - want to try vesafb later] - -* try running as few processes as possible, preferably go to single - user mode. - -* due to video issues, swsusp should be easier to get working than - S3. Try that first. - -When you make it work, try to find out what exactly was it that broke -suspend, and preferably fix that. diff --git a/Documentation/power/userland-swsusp.rst b/Documentation/power/userland-swsusp.rst new file mode 100644 index 000000000000..a0fa51bb1a4d --- /dev/null +++ b/Documentation/power/userland-swsusp.rst @@ -0,0 +1,191 @@ +===================================================== +Documentation for userland software suspend interface +===================================================== + + (C) 2006 Rafael J. Wysocki + +First, the warnings at the beginning of swsusp.txt still apply. + +Second, you should read the FAQ in swsusp.txt _now_ if you have not +done it already. + +Now, to use the userland interface for software suspend you need special +utilities that will read/write the system memory snapshot from/to the +kernel. Such utilities are available, for example, from +. You may want to have a look at them if you +are going to develop your own suspend/resume utilities. + +The interface consists of a character device providing the open(), +release(), read(), and write() operations as well as several ioctl() +commands defined in include/linux/suspend_ioctls.h . The major and minor +numbers of the device are, respectively, 10 and 231, and they can +be read from /sys/class/misc/snapshot/dev. + +The device can be open either for reading or for writing. If open for +reading, it is considered to be in the suspend mode. Otherwise it is +assumed to be in the resume mode. The device cannot be open for simultaneous +reading and writing. It is also impossible to have the device open more than +once at a time. + +Even opening the device has side effects. Data structures are +allocated, and PM_HIBERNATION_PREPARE / PM_RESTORE_PREPARE chains are +called. + +The ioctl() commands recognized by the device are: + +SNAPSHOT_FREEZE + freeze user space processes (the current process is + not frozen); this is required for SNAPSHOT_CREATE_IMAGE + and SNAPSHOT_ATOMIC_RESTORE to succeed + +SNAPSHOT_UNFREEZE + thaw user space processes frozen by SNAPSHOT_FREEZE + +SNAPSHOT_CREATE_IMAGE + create a snapshot of the system memory; the + last argument of ioctl() should be a pointer to an int variable, + the value of which will indicate whether the call returned after + creating the snapshot (1) or after restoring the system memory state + from it (0) (after resume the system finds itself finishing the + SNAPSHOT_CREATE_IMAGE ioctl() again); after the snapshot + has been created the read() operation can be used to transfer + it out of the kernel + +SNAPSHOT_ATOMIC_RESTORE + restore the system memory state from the + uploaded snapshot image; before calling it you should transfer + the system memory snapshot back to the kernel using the write() + operation; this call will not succeed if the snapshot + image is not available to the kernel + +SNAPSHOT_FREE + free memory allocated for the snapshot image + +SNAPSHOT_PREF_IMAGE_SIZE + set the preferred maximum size of the image + (the kernel will do its best to ensure the image size will not exceed + this number, but if it turns out to be impossible, the kernel will + create the smallest image possible) + +SNAPSHOT_GET_IMAGE_SIZE + return the actual size of the hibernation image + +SNAPSHOT_AVAIL_SWAP_SIZE + return the amount of available swap in bytes (the + last argument should be a pointer to an unsigned int variable that will + contain the result if the call is successful). + +SNAPSHOT_ALLOC_SWAP_PAGE + allocate a swap page from the resume partition + (the last argument should be a pointer to a loff_t variable that + will contain the swap page offset if the call is successful) + +SNAPSHOT_FREE_SWAP_PAGES + free all swap pages allocated by + SNAPSHOT_ALLOC_SWAP_PAGE + +SNAPSHOT_SET_SWAP_AREA + set the resume partition and the offset (in + units) from the beginning of the partition at which the swap header is + located (the last ioctl() argument should point to a struct + resume_swap_area, as defined in kernel/power/suspend_ioctls.h, + containing the resume device specification and the offset); for swap + partitions the offset is always 0, but it is different from zero for + swap files (see Documentation/power/swsusp-and-swap-files.rst for + details). + +SNAPSHOT_PLATFORM_SUPPORT + enable/disable the hibernation platform support, + depending on the argument value (enable, if the argument is nonzero) + +SNAPSHOT_POWER_OFF + make the kernel transition the system to the hibernation + state (eg. ACPI S4) using the platform (eg. ACPI) driver + +SNAPSHOT_S2RAM + suspend to RAM; using this call causes the kernel to + immediately enter the suspend-to-RAM state, so this call must always + be preceded by the SNAPSHOT_FREEZE call and it is also necessary + to use the SNAPSHOT_UNFREEZE call after the system wakes up. This call + is needed to implement the suspend-to-both mechanism in which the + suspend image is first created, as though the system had been suspended + to disk, and then the system is suspended to RAM (this makes it possible + to resume the system from RAM if there's enough battery power or restore + its state on the basis of the saved suspend image otherwise) + +The device's read() operation can be used to transfer the snapshot image from +the kernel. It has the following limitations: + +- you cannot read() more than one virtual memory page at a time +- read()s across page boundaries are impossible (ie. if you read() 1/2 of + a page in the previous call, you will only be able to read() + **at most** 1/2 of the page in the next call) + +The device's write() operation is used for uploading the system memory snapshot +into the kernel. It has the same limitations as the read() operation. + +The release() operation frees all memory allocated for the snapshot image +and all swap pages allocated with SNAPSHOT_ALLOC_SWAP_PAGE (if any). +Thus it is not necessary to use either SNAPSHOT_FREE or +SNAPSHOT_FREE_SWAP_PAGES before closing the device (in fact it will also +unfreeze user space processes frozen by SNAPSHOT_UNFREEZE if they are +still frozen when the device is being closed). + +Currently it is assumed that the userland utilities reading/writing the +snapshot image from/to the kernel will use a swap partition, called the resume +partition, or a swap file as storage space (if a swap file is used, the resume +partition is the partition that holds this file). However, this is not really +required, as they can use, for example, a special (blank) suspend partition or +a file on a partition that is unmounted before SNAPSHOT_CREATE_IMAGE and +mounted afterwards. + +These utilities MUST NOT make any assumptions regarding the ordering of +data within the snapshot image. The contents of the image are entirely owned +by the kernel and its structure may be changed in future kernel releases. + +The snapshot image MUST be written to the kernel unaltered (ie. all of the image +data, metadata and header MUST be written in _exactly_ the same amount, form +and order in which they have been read). Otherwise, the behavior of the +resumed system may be totally unpredictable. + +While executing SNAPSHOT_ATOMIC_RESTORE the kernel checks if the +structure of the snapshot image is consistent with the information stored +in the image header. If any inconsistencies are detected, +SNAPSHOT_ATOMIC_RESTORE will not succeed. Still, this is not a fool-proof +mechanism and the userland utilities using the interface SHOULD use additional +means, such as checksums, to ensure the integrity of the snapshot image. + +The suspending and resuming utilities MUST lock themselves in memory, +preferably using mlockall(), before calling SNAPSHOT_FREEZE. + +The suspending utility MUST check the value stored by SNAPSHOT_CREATE_IMAGE +in the memory location pointed to by the last argument of ioctl() and proceed +in accordance with it: + +1. If the value is 1 (ie. the system memory snapshot has just been + created and the system is ready for saving it): + + (a) The suspending utility MUST NOT close the snapshot device + _unless_ the whole suspend procedure is to be cancelled, in + which case, if the snapshot image has already been saved, the + suspending utility SHOULD destroy it, preferably by zapping + its header. If the suspend is not to be cancelled, the + system MUST be powered off or rebooted after the snapshot + image has been saved. + (b) The suspending utility SHOULD NOT attempt to perform any + file system operations (including reads) on the file systems + that were mounted before SNAPSHOT_CREATE_IMAGE has been + called. However, it MAY mount a file system that was not + mounted at that time and perform some operations on it (eg. + use it for saving the image). + +2. If the value is 0 (ie. the system state has just been restored from + the snapshot image), the suspending utility MUST close the snapshot + device. Afterwards it will be treated as a regular userland process, + so it need not exit. + +The resuming utility SHOULD NOT attempt to mount any file systems that could +be mounted before suspend and SHOULD NOT attempt to perform any operations +involving such file systems. + +For details, please refer to the source code. diff --git a/Documentation/power/userland-swsusp.txt b/Documentation/power/userland-swsusp.txt deleted file mode 100644 index bbfcd1bbedc5..000000000000 --- a/Documentation/power/userland-swsusp.txt +++ /dev/null @@ -1,170 +0,0 @@ -Documentation for userland software suspend interface - (C) 2006 Rafael J. Wysocki - -First, the warnings at the beginning of swsusp.txt still apply. - -Second, you should read the FAQ in swsusp.txt _now_ if you have not -done it already. - -Now, to use the userland interface for software suspend you need special -utilities that will read/write the system memory snapshot from/to the -kernel. Such utilities are available, for example, from -. You may want to have a look at them if you -are going to develop your own suspend/resume utilities. - -The interface consists of a character device providing the open(), -release(), read(), and write() operations as well as several ioctl() -commands defined in include/linux/suspend_ioctls.h . The major and minor -numbers of the device are, respectively, 10 and 231, and they can -be read from /sys/class/misc/snapshot/dev. - -The device can be open either for reading or for writing. If open for -reading, it is considered to be in the suspend mode. Otherwise it is -assumed to be in the resume mode. The device cannot be open for simultaneous -reading and writing. It is also impossible to have the device open more than -once at a time. - -Even opening the device has side effects. Data structures are -allocated, and PM_HIBERNATION_PREPARE / PM_RESTORE_PREPARE chains are -called. - -The ioctl() commands recognized by the device are: - -SNAPSHOT_FREEZE - freeze user space processes (the current process is - not frozen); this is required for SNAPSHOT_CREATE_IMAGE - and SNAPSHOT_ATOMIC_RESTORE to succeed - -SNAPSHOT_UNFREEZE - thaw user space processes frozen by SNAPSHOT_FREEZE - -SNAPSHOT_CREATE_IMAGE - create a snapshot of the system memory; the - last argument of ioctl() should be a pointer to an int variable, - the value of which will indicate whether the call returned after - creating the snapshot (1) or after restoring the system memory state - from it (0) (after resume the system finds itself finishing the - SNAPSHOT_CREATE_IMAGE ioctl() again); after the snapshot - has been created the read() operation can be used to transfer - it out of the kernel - -SNAPSHOT_ATOMIC_RESTORE - restore the system memory state from the - uploaded snapshot image; before calling it you should transfer - the system memory snapshot back to the kernel using the write() - operation; this call will not succeed if the snapshot - image is not available to the kernel - -SNAPSHOT_FREE - free memory allocated for the snapshot image - -SNAPSHOT_PREF_IMAGE_SIZE - set the preferred maximum size of the image - (the kernel will do its best to ensure the image size will not exceed - this number, but if it turns out to be impossible, the kernel will - create the smallest image possible) - -SNAPSHOT_GET_IMAGE_SIZE - return the actual size of the hibernation image - -SNAPSHOT_AVAIL_SWAP_SIZE - return the amount of available swap in bytes (the - last argument should be a pointer to an unsigned int variable that will - contain the result if the call is successful). - -SNAPSHOT_ALLOC_SWAP_PAGE - allocate a swap page from the resume partition - (the last argument should be a pointer to a loff_t variable that - will contain the swap page offset if the call is successful) - -SNAPSHOT_FREE_SWAP_PAGES - free all swap pages allocated by - SNAPSHOT_ALLOC_SWAP_PAGE - -SNAPSHOT_SET_SWAP_AREA - set the resume partition and the offset (in - units) from the beginning of the partition at which the swap header is - located (the last ioctl() argument should point to a struct - resume_swap_area, as defined in kernel/power/suspend_ioctls.h, - containing the resume device specification and the offset); for swap - partitions the offset is always 0, but it is different from zero for - swap files (see Documentation/power/swsusp-and-swap-files.txt for - details). - -SNAPSHOT_PLATFORM_SUPPORT - enable/disable the hibernation platform support, - depending on the argument value (enable, if the argument is nonzero) - -SNAPSHOT_POWER_OFF - make the kernel transition the system to the hibernation - state (eg. ACPI S4) using the platform (eg. ACPI) driver - -SNAPSHOT_S2RAM - suspend to RAM; using this call causes the kernel to - immediately enter the suspend-to-RAM state, so this call must always - be preceded by the SNAPSHOT_FREEZE call and it is also necessary - to use the SNAPSHOT_UNFREEZE call after the system wakes up. This call - is needed to implement the suspend-to-both mechanism in which the - suspend image is first created, as though the system had been suspended - to disk, and then the system is suspended to RAM (this makes it possible - to resume the system from RAM if there's enough battery power or restore - its state on the basis of the saved suspend image otherwise) - -The device's read() operation can be used to transfer the snapshot image from -the kernel. It has the following limitations: -- you cannot read() more than one virtual memory page at a time -- read()s across page boundaries are impossible (ie. if you read() 1/2 of - a page in the previous call, you will only be able to read() - _at_ _most_ 1/2 of the page in the next call) - -The device's write() operation is used for uploading the system memory snapshot -into the kernel. It has the same limitations as the read() operation. - -The release() operation frees all memory allocated for the snapshot image -and all swap pages allocated with SNAPSHOT_ALLOC_SWAP_PAGE (if any). -Thus it is not necessary to use either SNAPSHOT_FREE or -SNAPSHOT_FREE_SWAP_PAGES before closing the device (in fact it will also -unfreeze user space processes frozen by SNAPSHOT_UNFREEZE if they are -still frozen when the device is being closed). - -Currently it is assumed that the userland utilities reading/writing the -snapshot image from/to the kernel will use a swap partition, called the resume -partition, or a swap file as storage space (if a swap file is used, the resume -partition is the partition that holds this file). However, this is not really -required, as they can use, for example, a special (blank) suspend partition or -a file on a partition that is unmounted before SNAPSHOT_CREATE_IMAGE and -mounted afterwards. - -These utilities MUST NOT make any assumptions regarding the ordering of -data within the snapshot image. The contents of the image are entirely owned -by the kernel and its structure may be changed in future kernel releases. - -The snapshot image MUST be written to the kernel unaltered (ie. all of the image -data, metadata and header MUST be written in _exactly_ the same amount, form -and order in which they have been read). Otherwise, the behavior of the -resumed system may be totally unpredictable. - -While executing SNAPSHOT_ATOMIC_RESTORE the kernel checks if the -structure of the snapshot image is consistent with the information stored -in the image header. If any inconsistencies are detected, -SNAPSHOT_ATOMIC_RESTORE will not succeed. Still, this is not a fool-proof -mechanism and the userland utilities using the interface SHOULD use additional -means, such as checksums, to ensure the integrity of the snapshot image. - -The suspending and resuming utilities MUST lock themselves in memory, -preferably using mlockall(), before calling SNAPSHOT_FREEZE. - -The suspending utility MUST check the value stored by SNAPSHOT_CREATE_IMAGE -in the memory location pointed to by the last argument of ioctl() and proceed -in accordance with it: -1. If the value is 1 (ie. the system memory snapshot has just been - created and the system is ready for saving it): - (a) The suspending utility MUST NOT close the snapshot device - _unless_ the whole suspend procedure is to be cancelled, in - which case, if the snapshot image has already been saved, the - suspending utility SHOULD destroy it, preferably by zapping - its header. If the suspend is not to be cancelled, the - system MUST be powered off or rebooted after the snapshot - image has been saved. - (b) The suspending utility SHOULD NOT attempt to perform any - file system operations (including reads) on the file systems - that were mounted before SNAPSHOT_CREATE_IMAGE has been - called. However, it MAY mount a file system that was not - mounted at that time and perform some operations on it (eg. - use it for saving the image). -2. If the value is 0 (ie. the system state has just been restored from - the snapshot image), the suspending utility MUST close the snapshot - device. Afterwards it will be treated as a regular userland process, - so it need not exit. - -The resuming utility SHOULD NOT attempt to mount any file systems that could -be mounted before suspend and SHOULD NOT attempt to perform any operations -involving such file systems. - -For details, please refer to the source code. diff --git a/Documentation/power/video.rst b/Documentation/power/video.rst new file mode 100644 index 000000000000..337a2ba9f32f --- /dev/null +++ b/Documentation/power/video.rst @@ -0,0 +1,213 @@ +=========================== +Video issues with S3 resume +=========================== + +2003-2006, Pavel Machek + +During S3 resume, hardware needs to be reinitialized. For most +devices, this is easy, and kernel driver knows how to do +it. Unfortunately there's one exception: video card. Those are usually +initialized by BIOS, and kernel does not have enough information to +boot video card. (Kernel usually does not even contain video card +driver -- vesafb and vgacon are widely used). + +This is not problem for swsusp, because during swsusp resume, BIOS is +run normally so video card is normally initialized. It should not be +problem for S1 standby, because hardware should retain its state over +that. + +We either have to run video BIOS during early resume, or interpret it +using vbetool later, or maybe nothing is necessary on particular +system because video state is preserved. Unfortunately different +methods work on different systems, and no known method suits all of +them. + +Userland application called s2ram has been developed; it contains long +whitelist of systems, and automatically selects working method for a +given system. It can be downloaded from CVS at +www.sf.net/projects/suspend . If you get a system that is not in the +whitelist, please try to find a working solution, and submit whitelist +entry so that work does not need to be repeated. + +Currently, VBE_SAVE method (6 below) works on most +systems. Unfortunately, vbetool only runs after userland is resumed, +so it makes debugging of early resume problems +hard/impossible. Methods that do not rely on userland are preferable. + +Details +~~~~~~~ + +There are a few types of systems where video works after S3 resume: + +(1) systems where video state is preserved over S3. + +(2) systems where it is possible to call the video BIOS during S3 + resume. Unfortunately, it is not correct to call the video BIOS at + that point, but it happens to work on some machines. Use + acpi_sleep=s3_bios. + +(3) systems that initialize video card into vga text mode and where + the BIOS works well enough to be able to set video mode. Use + acpi_sleep=s3_mode on these. + +(4) on some systems s3_bios kicks video into text mode, and + acpi_sleep=s3_bios,s3_mode is needed. + +(5) radeon systems, where X can soft-boot your video card. You'll need + a new enough X, and a plain text console (no vesafb or radeonfb). See + http://www.doesi.gmxhome.de/linux/tm800s3/s3.html for more information. + Alternatively, you should use vbetool (6) instead. + +(6) other radeon systems, where vbetool is enough to bring system back + to life. It needs text console to be working. Do vbetool vbestate + save > /tmp/delme; echo 3 > /proc/acpi/sleep; vbetool post; vbetool + vbestate restore < /tmp/delme; setfont , and your video + should work. + +(7) on some systems, it is possible to boot most of kernel, and then + POSTing bios works. Ole Rohne has patch to do just that at + http://dev.gentoo.org/~marineam/patch-radeonfb-2.6.11-rc2-mm2. + +(8) on some systems, you can use the video_post utility and or + do echo 3 > /sys/power/state && /usr/sbin/video_post - which will + initialize the display in console mode. If you are in X, you can switch + to a virtual terminal and back to X using CTRL+ALT+F1 - CTRL+ALT+F7 to get + the display working in graphical mode again. + +Now, if you pass acpi_sleep=something, and it does not work with your +bios, you'll get a hard crash during resume. Be careful. Also it is +safest to do your experiments with plain old VGA console. The vesafb +and radeonfb (etc) drivers have a tendency to crash the machine during +resume. + +You may have a system where none of above works. At that point you +either invent another ugly hack that works, or write proper driver for +your video card (good luck getting docs :-(). Maybe suspending from X +(proper X, knowing your hardware, not XF68_FBcon) might have better +chance of working. + +Table of known working notebooks: + + +=============================== =============================================== +Model hack (or "how to do it") +=============================== =============================================== +Acer Aspire 1406LC ole's late BIOS init (7), turn off DRI +Acer TM 230 s3_bios (2) +Acer TM 242FX vbetool (6) +Acer TM C110 video_post (8) +Acer TM C300 vga=normal (only suspend on console, not in X), + vbetool (6) or video_post (8) +Acer TM 4052LCi s3_bios (2) +Acer TM 636Lci s3_bios,s3_mode (4) +Acer TM 650 (Radeon M7) vga=normal plus boot-radeon (5) gets text + console back +Acer TM 660 ??? [#f1]_ +Acer TM 800 vga=normal, X patches, see webpage (5) + or vbetool (6) +Acer TM 803 vga=normal, X patches, see webpage (5) + or vbetool (6) +Acer TM 803LCi vga=normal, vbetool (6) +Arima W730a vbetool needed (6) +Asus L2400D s3_mode (3) [#f2]_ (S1 also works OK) +Asus L3350M (SiS 740) (6) +Asus L3800C (Radeon M7) s3_bios (2) (S1 also works OK) +Asus M6887Ne vga=normal, s3_bios (2), use radeon driver + instead of fglrx in x.org +Athlon64 desktop prototype s3_bios (2) +Compal CL-50 ??? [#f1]_ +Compaq Armada E500 - P3-700 none (1) (S1 also works OK) +Compaq Evo N620c vga=normal, s3_bios (2) +Dell 600m, ATI R250 Lf none (1), but needs xorg-x11-6.8.1.902-1 +Dell D600, ATI RV250 vga=normal and X, or try vbestate (6) +Dell D610 vga=normal and X (possibly vbestate (6) too, + but not tested) +Dell Inspiron 4000 ??? [#f1]_ +Dell Inspiron 500m ??? [#f1]_ +Dell Inspiron 510m ??? +Dell Inspiron 5150 vbetool needed (6) +Dell Inspiron 600m ??? [#f1]_ +Dell Inspiron 8200 ??? [#f1]_ +Dell Inspiron 8500 ??? [#f1]_ +Dell Inspiron 8600 ??? [#f1]_ +eMachines athlon64 machines vbetool needed (6) (someone please get + me model #s) +HP NC6000 s3_bios, may not use radeonfb (2); + or vbetool (6) +HP NX7000 ??? [#f1]_ +HP Pavilion ZD7000 vbetool post needed, need open-source nv + driver for X +HP Omnibook XE3 athlon version none (1) +HP Omnibook XE3GC none (1), video is S3 Savage/IX-MV +HP Omnibook XE3L-GF vbetool (6) +HP Omnibook 5150 none (1), (S1 also works OK) +IBM TP T20, model 2647-44G none (1), video is S3 Inc. 86C270-294 + Savage/IX-MV, vesafb gets "interesting" + but X work. +IBM TP A31 / Type 2652-M5G s3_mode (3) [works ok with + BIOS 1.04 2002-08-23, but not at all with + BIOS 1.11 2004-11-05 :-(] +IBM TP R32 / Type 2658-MMG none (1) +IBM TP R40 2722B3G ??? [#f1]_ +IBM TP R50p / Type 1832-22U s3_bios (2) +IBM TP R51 none (1) +IBM TP T30 236681A ??? [#f1]_ +IBM TP T40 / Type 2373-MU4 none (1) +IBM TP T40p none (1) +IBM TP R40p s3_bios (2) +IBM TP T41p s3_bios (2), switch to X after resume +IBM TP T42 s3_bios (2) +IBM ThinkPad T42p (2373-GTG) s3_bios (2) +IBM TP X20 ??? [#f1]_ +IBM TP X30 s3_bios, s3_mode (4) +IBM TP X31 / Type 2672-XXH none (1), use radeontool + (http://fdd.com/software/radeon/) to + turn off backlight. +IBM TP X32 none (1), but backlight is on and video is + trashed after long suspend. s3_bios, + s3_mode (4) works too. Perhaps that gets + better results? +IBM Thinkpad X40 Type 2371-7JG s3_bios,s3_mode (4) +IBM TP 600e none(1), but a switch to console and + back to X is needed +Medion MD4220 ??? [#f1]_ +Samsung P35 vbetool needed (6) +Sharp PC-AR10 (ATI rage) none (1), backlight does not switch off +Sony Vaio PCG-C1VRX/K s3_bios (2) +Sony Vaio PCG-F403 ??? [#f1]_ +Sony Vaio PCG-GRT995MP none (1), works with 'nv' X driver +Sony Vaio PCG-GR7/K none (1), but needs radeonfb, use + radeontool (http://fdd.com/software/radeon/) + to turn off backlight. +Sony Vaio PCG-N505SN ??? [#f1]_ +Sony Vaio vgn-s260 X or boot-radeon can init it (5) +Sony Vaio vgn-S580BH vga=normal, but suspend from X. Console will + be blank unless you return to X. +Sony Vaio vgn-FS115B s3_bios (2),s3_mode (4) +Toshiba Libretto L5 none (1) +Toshiba Libretto 100CT/110CT vbetool (6) +Toshiba Portege 3020CT s3_mode (3) +Toshiba Satellite 4030CDT s3_mode (3) (S1 also works OK) +Toshiba Satellite 4080XCDT s3_mode (3) (S1 also works OK) +Toshiba Satellite 4090XCDT ??? [#f1]_ +Toshiba Satellite P10-554 s3_bios,s3_mode (4)[#f3]_ +Toshiba M30 (2) xor X with nvidia driver using internal AGP +Uniwill 244IIO ??? [#f1]_ +=============================== =============================================== + +Known working desktop systems +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +=================== ============================= ======================== +Mainboard Graphics card hack (or "how to do it") +=================== ============================= ======================== +Asus A7V8X nVidia RIVA TNT2 model 64 s3_bios,s3_mode (4) +=================== ============================= ======================== + + +.. [#f1] from https://wiki.ubuntu.com/HoaryPMResults, not sure + which options to use. If you know, please tell me. + +.. [#f2] To be tested with a newer kernel. + +.. [#f3] Not with SMP kernel, UP only. diff --git a/Documentation/power/video.txt b/Documentation/power/video.txt deleted file mode 100644 index 3e6272bc4472..000000000000 --- a/Documentation/power/video.txt +++ /dev/null @@ -1,185 +0,0 @@ - - Video issues with S3 resume - ~~~~~~~~~~~~~~~~~~~~~~~~~~~ - 2003-2006, Pavel Machek - -During S3 resume, hardware needs to be reinitialized. For most -devices, this is easy, and kernel driver knows how to do -it. Unfortunately there's one exception: video card. Those are usually -initialized by BIOS, and kernel does not have enough information to -boot video card. (Kernel usually does not even contain video card -driver -- vesafb and vgacon are widely used). - -This is not problem for swsusp, because during swsusp resume, BIOS is -run normally so video card is normally initialized. It should not be -problem for S1 standby, because hardware should retain its state over -that. - -We either have to run video BIOS during early resume, or interpret it -using vbetool later, or maybe nothing is necessary on particular -system because video state is preserved. Unfortunately different -methods work on different systems, and no known method suits all of -them. - -Userland application called s2ram has been developed; it contains long -whitelist of systems, and automatically selects working method for a -given system. It can be downloaded from CVS at -www.sf.net/projects/suspend . If you get a system that is not in the -whitelist, please try to find a working solution, and submit whitelist -entry so that work does not need to be repeated. - -Currently, VBE_SAVE method (6 below) works on most -systems. Unfortunately, vbetool only runs after userland is resumed, -so it makes debugging of early resume problems -hard/impossible. Methods that do not rely on userland are preferable. - -Details -~~~~~~~ - -There are a few types of systems where video works after S3 resume: - -(1) systems where video state is preserved over S3. - -(2) systems where it is possible to call the video BIOS during S3 - resume. Unfortunately, it is not correct to call the video BIOS at - that point, but it happens to work on some machines. Use - acpi_sleep=s3_bios. - -(3) systems that initialize video card into vga text mode and where - the BIOS works well enough to be able to set video mode. Use - acpi_sleep=s3_mode on these. - -(4) on some systems s3_bios kicks video into text mode, and - acpi_sleep=s3_bios,s3_mode is needed. - -(5) radeon systems, where X can soft-boot your video card. You'll need - a new enough X, and a plain text console (no vesafb or radeonfb). See - http://www.doesi.gmxhome.de/linux/tm800s3/s3.html for more information. - Alternatively, you should use vbetool (6) instead. - -(6) other radeon systems, where vbetool is enough to bring system back - to life. It needs text console to be working. Do vbetool vbestate - save > /tmp/delme; echo 3 > /proc/acpi/sleep; vbetool post; vbetool - vbestate restore < /tmp/delme; setfont , and your video - should work. - -(7) on some systems, it is possible to boot most of kernel, and then - POSTing bios works. Ole Rohne has patch to do just that at - http://dev.gentoo.org/~marineam/patch-radeonfb-2.6.11-rc2-mm2. - -(8) on some systems, you can use the video_post utility and or - do echo 3 > /sys/power/state && /usr/sbin/video_post - which will - initialize the display in console mode. If you are in X, you can switch - to a virtual terminal and back to X using CTRL+ALT+F1 - CTRL+ALT+F7 to get - the display working in graphical mode again. - -Now, if you pass acpi_sleep=something, and it does not work with your -bios, you'll get a hard crash during resume. Be careful. Also it is -safest to do your experiments with plain old VGA console. The vesafb -and radeonfb (etc) drivers have a tendency to crash the machine during -resume. - -You may have a system where none of above works. At that point you -either invent another ugly hack that works, or write proper driver for -your video card (good luck getting docs :-(). Maybe suspending from X -(proper X, knowing your hardware, not XF68_FBcon) might have better -chance of working. - -Table of known working notebooks: - -Model hack (or "how to do it") ------------------------------------------------------------------------------- -Acer Aspire 1406LC ole's late BIOS init (7), turn off DRI -Acer TM 230 s3_bios (2) -Acer TM 242FX vbetool (6) -Acer TM C110 video_post (8) -Acer TM C300 vga=normal (only suspend on console, not in X), vbetool (6) or video_post (8) -Acer TM 4052LCi s3_bios (2) -Acer TM 636Lci s3_bios,s3_mode (4) -Acer TM 650 (Radeon M7) vga=normal plus boot-radeon (5) gets text console back -Acer TM 660 ??? (*) -Acer TM 800 vga=normal, X patches, see webpage (5) or vbetool (6) -Acer TM 803 vga=normal, X patches, see webpage (5) or vbetool (6) -Acer TM 803LCi vga=normal, vbetool (6) -Arima W730a vbetool needed (6) -Asus L2400D s3_mode (3)(***) (S1 also works OK) -Asus L3350M (SiS 740) (6) -Asus L3800C (Radeon M7) s3_bios (2) (S1 also works OK) -Asus M6887Ne vga=normal, s3_bios (2), use radeon driver instead of fglrx in x.org -Athlon64 desktop prototype s3_bios (2) -Compal CL-50 ??? (*) -Compaq Armada E500 - P3-700 none (1) (S1 also works OK) -Compaq Evo N620c vga=normal, s3_bios (2) -Dell 600m, ATI R250 Lf none (1), but needs xorg-x11-6.8.1.902-1 -Dell D600, ATI RV250 vga=normal and X, or try vbestate (6) -Dell D610 vga=normal and X (possibly vbestate (6) too, but not tested) -Dell Inspiron 4000 ??? (*) -Dell Inspiron 500m ??? (*) -Dell Inspiron 510m ??? -Dell Inspiron 5150 vbetool needed (6) -Dell Inspiron 600m ??? (*) -Dell Inspiron 8200 ??? (*) -Dell Inspiron 8500 ??? (*) -Dell Inspiron 8600 ??? (*) -eMachines athlon64 machines vbetool needed (6) (someone please get me model #s) -HP NC6000 s3_bios, may not use radeonfb (2); or vbetool (6) -HP NX7000 ??? (*) -HP Pavilion ZD7000 vbetool post needed, need open-source nv driver for X -HP Omnibook XE3 athlon version none (1) -HP Omnibook XE3GC none (1), video is S3 Savage/IX-MV -HP Omnibook XE3L-GF vbetool (6) -HP Omnibook 5150 none (1), (S1 also works OK) -IBM TP T20, model 2647-44G none (1), video is S3 Inc. 86C270-294 Savage/IX-MV, vesafb gets "interesting" but X work. -IBM TP A31 / Type 2652-M5G s3_mode (3) [works ok with BIOS 1.04 2002-08-23, but not at all with BIOS 1.11 2004-11-05 :-(] -IBM TP R32 / Type 2658-MMG none (1) -IBM TP R40 2722B3G ??? (*) -IBM TP R50p / Type 1832-22U s3_bios (2) -IBM TP R51 none (1) -IBM TP T30 236681A ??? (*) -IBM TP T40 / Type 2373-MU4 none (1) -IBM TP T40p none (1) -IBM TP R40p s3_bios (2) -IBM TP T41p s3_bios (2), switch to X after resume -IBM TP T42 s3_bios (2) -IBM ThinkPad T42p (2373-GTG) s3_bios (2) -IBM TP X20 ??? (*) -IBM TP X30 s3_bios, s3_mode (4) -IBM TP X31 / Type 2672-XXH none (1), use radeontool (http://fdd.com/software/radeon/) to turn off backlight. -IBM TP X32 none (1), but backlight is on and video is trashed after long suspend. s3_bios,s3_mode (4) works too. Perhaps that gets better results? -IBM Thinkpad X40 Type 2371-7JG s3_bios,s3_mode (4) -IBM TP 600e none(1), but a switch to console and back to X is needed -Medion MD4220 ??? (*) -Samsung P35 vbetool needed (6) -Sharp PC-AR10 (ATI rage) none (1), backlight does not switch off -Sony Vaio PCG-C1VRX/K s3_bios (2) -Sony Vaio PCG-F403 ??? (*) -Sony Vaio PCG-GRT995MP none (1), works with 'nv' X driver -Sony Vaio PCG-GR7/K none (1), but needs radeonfb, use radeontool (http://fdd.com/software/radeon/) to turn off backlight. -Sony Vaio PCG-N505SN ??? (*) -Sony Vaio vgn-s260 X or boot-radeon can init it (5) -Sony Vaio vgn-S580BH vga=normal, but suspend from X. Console will be blank unless you return to X. -Sony Vaio vgn-FS115B s3_bios (2),s3_mode (4) -Toshiba Libretto L5 none (1) -Toshiba Libretto 100CT/110CT vbetool (6) -Toshiba Portege 3020CT s3_mode (3) -Toshiba Satellite 4030CDT s3_mode (3) (S1 also works OK) -Toshiba Satellite 4080XCDT s3_mode (3) (S1 also works OK) -Toshiba Satellite 4090XCDT ??? (*) -Toshiba Satellite P10-554 s3_bios,s3_mode (4)(****) -Toshiba M30 (2) xor X with nvidia driver using internal AGP -Uniwill 244IIO ??? (*) - -Known working desktop systems -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Mainboard Graphics card hack (or "how to do it") ------------------------------------------------------------------------------- -Asus A7V8X nVidia RIVA TNT2 model 64 s3_bios,s3_mode (4) - - -(*) from https://wiki.ubuntu.com/HoaryPMResults, not sure - which options to use. If you know, please tell me. - -(***) To be tested with a newer kernel. - -(****) Not with SMP kernel, UP only. diff --git a/Documentation/process/submitting-drivers.rst b/Documentation/process/submitting-drivers.rst index 58bc047e7b95..1acaa14903d6 100644 --- a/Documentation/process/submitting-drivers.rst +++ b/Documentation/process/submitting-drivers.rst @@ -117,7 +117,7 @@ PM support: implemented") error. You should also try to make sure that your driver uses as little power as possible when it's not doing anything. For the driver testing instructions see - Documentation/power/drivers-testing.txt and for a relatively + Documentation/power/drivers-testing.rst and for a relatively complete overview of the power management issues related to drivers see :ref:`Documentation/driver-api/pm/devices.rst `. diff --git a/Documentation/scheduler/sched-energy.txt b/Documentation/scheduler/sched-energy.txt index 197d81f4b836..d97207b9accb 100644 --- a/Documentation/scheduler/sched-energy.txt +++ b/Documentation/scheduler/sched-energy.txt @@ -22,7 +22,7 @@ the highest. The actual EM used by EAS is _not_ maintained by the scheduler, but by a dedicated framework. For details about this framework and what it provides, -please refer to its documentation (see Documentation/power/energy-model.txt). +please refer to its documentation (see Documentation/power/energy-model.rst). 2. Background and Terminology @@ -81,7 +81,7 @@ through the arch_scale_cpu_capacity() callback. The rest of platform knowledge used by EAS is directly read from the Energy Model (EM) framework. The EM of a platform is composed of a power cost table -per 'performance domain' in the system (see Documentation/power/energy-model.txt +per 'performance domain' in the system (see Documentation/power/energy-model.rst for futher details about performance domains). The scheduler manages references to the EM objects in the topology code when the @@ -352,7 +352,7 @@ could be amended in the future if proven otherwise. EAS uses the EM of a platform to estimate the impact of scheduling decisions on energy. So, your platform must provide power cost tables to the EM framework in order to make EAS start. To do so, please refer to documentation of the -independent EM framework in Documentation/power/energy-model.txt. +independent EM framework in Documentation/power/energy-model.rst. Please also note that the scheduling domains need to be re-built after the EM has been registered in order to start EAS. diff --git a/Documentation/trace/coresight-cpu-debug.txt b/Documentation/trace/coresight-cpu-debug.txt index f07e38094b40..1a660a39e3c0 100644 --- a/Documentation/trace/coresight-cpu-debug.txt +++ b/Documentation/trace/coresight-cpu-debug.txt @@ -151,7 +151,7 @@ At the runtime you can disable idle states with below methods: It is possible to disable CPU idle states by way of the PM QoS subsystem, more specifically by using the "/dev/cpu_dma_latency" -interface (see Documentation/power/pm_qos_interface.txt for more +interface (see Documentation/power/pm_qos_interface.rst for more details). As specified in the PM QoS documentation the requested parameter will stay in effect until the file descriptor is released. For example: diff --git a/Documentation/translations/zh_CN/process/submitting-drivers.rst b/Documentation/translations/zh_CN/process/submitting-drivers.rst index 72c6cd935821..f1c3906c69a8 100644 --- a/Documentation/translations/zh_CN/process/submitting-drivers.rst +++ b/Documentation/translations/zh_CN/process/submitting-drivers.rst @@ -97,7 +97,7 @@ Linux 2.6: 函数定义成返回 -ENOSYS(功能未实现)错误。你还应该尝试确 保你的驱动在什么都不干的情况下将耗电降到最低。要获得驱动 程序测试的指导,请参阅 - Documentation/power/drivers-testing.txt。有关驱动程序电 + Documentation/power/drivers-testing.rst。有关驱动程序电 源管理问题相对全面的概述,请参阅 Documentation/driver-api/pm/devices.rst。 -- cgit From 5992b044989daf281a2ab78b2e841cd6bd51d93a Mon Sep 17 00:00:00 2001 From: Manikanta Maddireddy Date: Tue, 18 Jun 2019 23:32:01 +0530 Subject: dt-bindings: pci: tegra: Document PCIe DPD pinctrl optional prop Document PCIe DPD pinctrl optional property to put PEX clk & BIAS pads in low power mode. Signed-off-by: Manikanta Maddireddy Signed-off-by: Lorenzo Pieralisi Reviewed-by: Rob Herring Acked-by: Thierry Reding --- Documentation/devicetree/bindings/pci/nvidia,tegra20-pcie.txt | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/pci/nvidia,tegra20-pcie.txt b/Documentation/devicetree/bindings/pci/nvidia,tegra20-pcie.txt index 145a4f04194f..7939bca47861 100644 --- a/Documentation/devicetree/bindings/pci/nvidia,tegra20-pcie.txt +++ b/Documentation/devicetree/bindings/pci/nvidia,tegra20-pcie.txt @@ -65,6 +65,14 @@ Required properties: - afi - pcie_x +Optional properties: +- pinctrl-names: A list of pinctrl state names. Must contain the following + entries: + - "default": active state, puts PCIe I/O out of deep power down state + - "idle": puts PCIe I/O into deep power down state +- pinctrl-0: phandle for the default/active state of pin configurations. +- pinctrl-1: phandle for the idle state of pin configurations. + Required properties on Tegra124 and later (deprecated): - phys: Must contain an entry for each entry in phy-names. - phy-names: Must include the following entries: -- cgit From 0fc8b82f31c4a7bea4c487d380a10d1271bf8d4d Mon Sep 17 00:00:00 2001 From: Manikanta Maddireddy Date: Tue, 18 Jun 2019 23:32:04 +0530 Subject: PCI: Add DT binding for "reset-gpios" property Add DT binding for "reset-gpios" property which supports GPIO based PERST# signal. Signed-off-by: Manikanta Maddireddy Signed-off-by: Lorenzo Pieralisi Reviewed-by: Rob Herring Acked-by: Thierry Reding --- Documentation/devicetree/bindings/pci/pci.txt | 3 +++ 1 file changed, 3 insertions(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/pci/pci.txt b/Documentation/devicetree/bindings/pci/pci.txt index 92c01db610df..2a5d91024059 100644 --- a/Documentation/devicetree/bindings/pci/pci.txt +++ b/Documentation/devicetree/bindings/pci/pci.txt @@ -24,6 +24,9 @@ driver implementation may support the following properties: unsupported link speed, for instance, trying to do training for unsupported link speed, etc. Must be '4' for gen4, '3' for gen3, '2' for gen2, and '1' for gen1. Any other values are invalid. +- reset-gpios: + If present this property specifies PERST# GPIO. Host drivers can parse the + GPIO and apply fundamental reset to endpoints. PCI-PCI Bridge properties ------------------------- -- cgit From 69bc586518e0902493c42b77652fa712fae3480f Mon Sep 17 00:00:00 2001 From: Biju Das Date: Fri, 7 Jun 2019 08:03:36 +0100 Subject: dt-bindings: PCI: rcar: Add device tree support for r8a774a1 Add PCIe support for the RZ/G2M (a.k.a. R8A774A1). Signed-off-by: Biju Das Signed-off-by: Lorenzo Pieralisi Reviewed-by: Geert Uytterhoeven Reviewed-by: Simon Horman Acked-by: Simon Horman --- Documentation/devicetree/bindings/pci/rcar-pci.txt | 1 + 1 file changed, 1 insertion(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/pci/rcar-pci.txt b/Documentation/devicetree/bindings/pci/rcar-pci.txt index 6904882a0e94..45bba9f88a51 100644 --- a/Documentation/devicetree/bindings/pci/rcar-pci.txt +++ b/Documentation/devicetree/bindings/pci/rcar-pci.txt @@ -3,6 +3,7 @@ Required properties: compatible: "renesas,pcie-r8a7743" for the R8A7743 SoC; "renesas,pcie-r8a7744" for the R8A7744 SoC; + "renesas,pcie-r8a774a1" for the R8A774A1 SoC; "renesas,pcie-r8a774c0" for the R8A774C0 SoC; "renesas,pcie-r8a7779" for the R8A7779 SoC; "renesas,pcie-r8a7790" for the R8A7790 SoC; -- cgit From 93bad0f5d15f3acd6d2ab0aee4a81ce1fcc6300d Mon Sep 17 00:00:00 2001 From: Hou Zhiqiang Date: Fri, 5 Jul 2019 17:56:40 +0800 Subject: dt-bindings: PCI: mobiveil: Change gpio_slave and apb_csr to optional Change the "gpio_slave" and "apb_csr" to optional, the "gpio_slave" is not used in current code, and "apb_csr" is not used by some platforms. Signed-off-by: Hou Zhiqiang Signed-off-by: Lorenzo Pieralisi Acked-by: Subrahmanya Lingappa Acked-by: Rob Herring Reviewed-by: Minghuan Lian Reviewed-by: Subrahmanya Lingappa --- Documentation/devicetree/bindings/pci/mobiveil-pcie.txt | 2 ++ 1 file changed, 2 insertions(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/pci/mobiveil-pcie.txt b/Documentation/devicetree/bindings/pci/mobiveil-pcie.txt index a618d4787dd7..64156993e052 100644 --- a/Documentation/devicetree/bindings/pci/mobiveil-pcie.txt +++ b/Documentation/devicetree/bindings/pci/mobiveil-pcie.txt @@ -10,8 +10,10 @@ Required properties: interrupt source. The value must be 1. - compatible: Should contain "mbvl,gpex40-pcie" - reg: Should contain PCIe registers location and length + Mandatory: "config_axi_slave": PCIe controller registers "csr_axi_slave" : Bridge config registers + Optional: "gpio_slave" : GPIO registers to control slot power "apb_csr" : MSI registers -- cgit