diff options
Diffstat (limited to 'Documentation/driver-api')
| -rw-r--r-- | Documentation/driver-api/cxl/allocation/page-allocator.rst | 31 | ||||
| -rw-r--r-- | Documentation/driver-api/firmware/efi/index.rst | 11 | ||||
| -rw-r--r-- | Documentation/driver-api/generic_pt.rst | 137 | ||||
| -rw-r--r-- | Documentation/driver-api/index.rst | 1 | ||||
| -rw-r--r-- | Documentation/driver-api/pci/p2pdma.rst | 97 | ||||
| -rw-r--r-- | Documentation/driver-api/pci/pci.rst | 3 |
6 files changed, 223 insertions, 57 deletions
diff --git a/Documentation/driver-api/cxl/allocation/page-allocator.rst b/Documentation/driver-api/cxl/allocation/page-allocator.rst index 7b8fe1b8d5bb..3fa584a248bd 100644 --- a/Documentation/driver-api/cxl/allocation/page-allocator.rst +++ b/Documentation/driver-api/cxl/allocation/page-allocator.rst @@ -41,37 +41,6 @@ To simplify this, the page allocator will prefer :code:`ZONE_MOVABLE` over will fallback to allocate from :code:`ZONE_NORMAL`. -Zone and Node Quirks -==================== -Let's consider a configuration where the local DRAM capacity is largely onlined -into :code:`ZONE_NORMAL`, with no :code:`ZONE_MOVABLE` capacity present. The -CXL capacity has the opposite configuration - all onlined in -:code:`ZONE_MOVABLE`. - -Under the default allocation policy, the page allocator will completely skip -:code:`ZONE_MOVABLE` as a valid allocation target. This is because, as of -Linux v6.15, the page allocator does (approximately) the following: :: - - for (each zone in local_node): - - for (each node in fallback_order): - - attempt_allocation(gfp_flags); - -Because the local node does not have :code:`ZONE_MOVABLE`, the CXL node is -functionally unreachable for direct allocation. As a result, the only way -for CXL capacity to be used is via `demotion` in the reclaim path. - -This configuration also means that if the DRAM ndoe has :code:`ZONE_MOVABLE` -capacity - when that capacity is depleted, the page allocator will actually -prefer CXL :code:`ZONE_MOVABLE` pages over DRAM :code:`ZONE_NORMAL` pages. - -We may wish to invert this priority in future Linux versions. - -If `demotion` and `swap` are disabled, Linux will begin to cause OOM crashes -when the DRAM nodes are depleted. See the reclaim section for more details. - - CGroups and CPUSets =================== Finally, assuming CXL memory is reachable via the page allocation (i.e. onlined diff --git a/Documentation/driver-api/firmware/efi/index.rst b/Documentation/driver-api/firmware/efi/index.rst index 4fe8abba9fc6..5a6b6229592c 100644 --- a/Documentation/driver-api/firmware/efi/index.rst +++ b/Documentation/driver-api/firmware/efi/index.rst @@ -1,11 +1,16 @@ .. SPDX-License-Identifier: GPL-2.0 -============ -UEFI Support -============ +==================================================== +Unified Extensible Firmware Interface (UEFI) Support +==================================================== UEFI stub library functions =========================== .. kernel-doc:: drivers/firmware/efi/libstub/mem.c :internal: + +UEFI Common Platform Error Record (CPER) functions +================================================== + +.. kernel-doc:: drivers/firmware/efi/cper.c diff --git a/Documentation/driver-api/generic_pt.rst b/Documentation/driver-api/generic_pt.rst new file mode 100644 index 000000000000..fd29d1b525e5 --- /dev/null +++ b/Documentation/driver-api/generic_pt.rst @@ -0,0 +1,137 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================== +Generic Radix Page Table +======================== + +.. kernel-doc:: include/linux/generic_pt/common.h + :doc: Generic Radix Page Table + +.. kernel-doc:: drivers/iommu/generic_pt/pt_defs.h + :doc: Generic Page Table Language + +Usage +===== + +Generic PT is structured as a multi-compilation system. Since each format +provides an API using a common set of names there can be only one format active +within a compilation unit. This design avoids function pointers around the low +level API. + +Instead the function pointers can end up at the higher level API (i.e. +map/unmap, etc.) and the per-format code can be directly inlined into the +per-format compilation unit. For something like IOMMU each format will be +compiled into a per-format IOMMU operations kernel module. + +For this to work the .c file for each compilation unit will include both the +format headers and the generic code for the implementation. For instance in an +implementation compilation unit the headers would normally be included as +follows: + +generic_pt/fmt/iommu_amdv1.c:: + + #include <linux/generic_pt/common.h> + #include "defs_amdv1.h" + #include "../pt_defs.h" + #include "amdv1.h" + #include "../pt_common.h" + #include "../pt_iter.h" + #include "../iommu_pt.h" /* The IOMMU implementation */ + +iommu_pt.h includes definitions that will generate the operations functions for +map/unmap/etc. using the definitions provided by AMDv1. The resulting module +will have exported symbols named like pt_iommu_amdv1_init(). + +Refer to drivers/iommu/generic_pt/fmt/iommu_template.h for an example of how the +IOMMU implementation uses multi-compilation to generate per-format ops structs +pointers. + +The format code is written so that the common names arise from #defines to +distinct format specific names. This is intended to aid debuggability by +avoiding symbol clashes across all the different formats. + +Exported symbols and other global names are mangled using a per-format string +via the NS() helper macro. + +The format uses struct pt_common as the top-level struct for the table, +and each format will have its own struct pt_xxx which embeds it to store +format-specific information. + +The implementation will further wrap struct pt_common in its own top-level +struct, such as struct pt_iommu_amdv1. + +Format functions at the struct pt_common level +---------------------------------------------- + +.. kernel-doc:: include/linux/generic_pt/common.h + :identifiers: +.. kernel-doc:: drivers/iommu/generic_pt/pt_common.h + +Iteration Helpers +----------------- + +.. kernel-doc:: drivers/iommu/generic_pt/pt_iter.h + +Writing a Format +---------------- + +It is best to start from a simple format that is similar to the target. x86_64 +is usually a good reference for something simple, and AMDv1 is something fairly +complete. + +The required inline functions need to be implemented in the format header. +These should all follow the standard pattern of:: + + static inline pt_oaddr_t amdv1pt_entry_oa(const struct pt_state *pts) + { + [..] + } + #define pt_entry_oa amdv1pt_entry_oa + +where a uniquely named per-format inline function provides the implementation +and a define maps it to the generic name. This is intended to make debug symbols +work better. inline functions should always be used as the prototypes in +pt_common.h will cause the compiler to validate the function signature to +prevent errors. + +Review pt_fmt_defaults.h to understand some of the optional inlines. + +Once the format compiles then it should be run through the generic page table +kunit test in kunit_generic_pt.h using kunit. For example:: + + $ tools/testing/kunit/kunit.py run --build_dir build_kunit_x86_64 --arch x86_64 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig amdv1_fmt_test.* + [...] + [11:15:08] Testing complete. Ran 9 tests: passed: 9 + [11:15:09] Elapsed time: 3.137s total, 0.001s configuring, 2.368s building, 0.311s running + +The generic tests are intended to prove out the format functions and give +clearer failures to speed up finding the problems. Once those pass then the +entire kunit suite should be run. + +IOMMU Invalidation Features +--------------------------- + +Invalidation is how the page table algorithms synchronize with a HW cache of the +page table memory, typically called the TLB (or IOTLB for IOMMU cases). + +The TLB can store present PTEs, non-present PTEs and table pointers, depending +on its design. Every HW has its own approach on how to describe what has changed +to have changed items removed from the TLB. + +PT_FEAT_FLUSH_RANGE +~~~~~~~~~~~~~~~~~~~ + +PT_FEAT_FLUSH_RANGE is the easiest scheme to understand. It tries to generate a +single range invalidation for each operation, over-invalidating if there are +gaps of VA that don't need invalidation. This trades off impacted VA for number +of invalidation operations. It does not keep track of what is being invalidated; +however, if pages have to be freed then page table pointers have to be cleaned +from the walk cache. The range can start/end at any page boundary. + +PT_FEAT_FLUSH_RANGE_NO_GAPS +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +PT_FEAT_FLUSH_RANGE_NO_GAPS is similar to PT_FEAT_FLUSH_RANGE; however, it tries +to minimize the amount of impacted VA by issuing extra flush operations. This is +useful if the cost of processing VA is very high, for instance because a +hypervisor is processing the page table with a shadowing algorithm. diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index 3e2a270bd828..baff96b5cf0b 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -93,6 +93,7 @@ Subsystem-specific APIs frame-buffer aperture generic-counter + generic_pt gpio/index hsi hte/index diff --git a/Documentation/driver-api/pci/p2pdma.rst b/Documentation/driver-api/pci/p2pdma.rst index d0b241628cf1..280673b50350 100644 --- a/Documentation/driver-api/pci/p2pdma.rst +++ b/Documentation/driver-api/pci/p2pdma.rst @@ -9,22 +9,48 @@ between two devices on the bus. This type of transaction is henceforth called Peer-to-Peer (or P2P). However, there are a number of issues that make P2P transactions tricky to do in a perfectly safe way. -One of the biggest issues is that PCI doesn't require forwarding -transactions between hierarchy domains, and in PCIe, each Root Port -defines a separate hierarchy domain. To make things worse, there is no -simple way to determine if a given Root Complex supports this or not. -(See PCIe r4.0, sec 1.3.1). Therefore, as of this writing, the kernel -only supports doing P2P when the endpoints involved are all behind the -same PCI bridge, as such devices are all in the same PCI hierarchy -domain, and the spec guarantees that all transactions within the -hierarchy will be routable, but it does not require routing -between hierarchies. - -The second issue is that to make use of existing interfaces in Linux, -memory that is used for P2P transactions needs to be backed by struct -pages. However, PCI BARs are not typically cache coherent so there are -a few corner case gotchas with these pages so developers need to -be careful about what they do with them. +For PCIe the routing of Transaction Layer Packets (TLPs) is well-defined up +until they reach a host bridge or root port. If the path includes PCIe switches +then based on the ACS settings the transaction can route entirely within +the PCIe hierarchy and never reach the root port. The kernel will evaluate +the PCIe topology and always permit P2P in these well-defined cases. + +However, if the P2P transaction reaches the host bridge then it might have to +hairpin back out the same root port, be routed inside the CPU SOC to another +PCIe root port, or routed internally to the SOC. + +The PCIe specification doesn't define the forwarding of transactions between +hierarchy domains and kernel defaults to blocking such routing. There is an +allow list to allow detecting known-good HW, in which case P2P between any +two PCIe devices will be permitted. + +Since P2P inherently is doing transactions between two devices it requires two +drivers to be co-operating inside the kernel. The providing driver has to convey +its MMIO to the consuming driver. To meet the driver model lifecycle rules the +MMIO must have all DMA mapping removed, all CPU accesses prevented, all page +table mappings undone before the providing driver completes remove(). + +This requires the providing and consuming driver to actively work together to +guarantee that the consuming driver has stopped using the MMIO during a removal +cycle. This is done by either a synchronous invalidation shutdown or waiting +for all usage refcounts to reach zero. + +At the lowest level the P2P subsystem offers a naked struct p2p_provider that +delegates lifecycle management to the providing driver. It is expected that +drivers using this option will wrap their MMIO memory in DMABUF and use DMABUF +to provide an invalidation shutdown. These MMIO addresess have no struct page, and +if used with mmap() must create special PTEs. As such there are very few +kernel uAPIs that can accept pointers to them; in particular they cannot be used +with read()/write(), including O_DIRECT. + +Building on this, the subsystem offers a layer to wrap the MMIO in a ZONE_DEVICE +pgmap of MEMORY_DEVICE_PCI_P2PDMA to create struct pages. The lifecycle of +pgmap ensures that when the pgmap is destroyed all other drivers have stopped +using the MMIO. This option works with O_DIRECT flows, in some cases, if the +underlying subsystem supports handling MEMORY_DEVICE_PCI_P2PDMA through +FOLL_PCI_P2PDMA. The use of FOLL_LONGTERM is prevented. As this relies on pgmap +it also relies on architecture support along with alignment and minimum size +limitations. Driver Writer's Guide @@ -114,14 +140,39 @@ allocating scatter-gather lists with P2P memory. Struct Page Caveats ------------------- -Driver writers should be very careful about not passing these special -struct pages to code that isn't prepared for it. At this time, the kernel -interfaces do not have any checks for ensuring this. This obviously -precludes passing these pages to userspace. +While the MEMORY_DEVICE_PCI_P2PDMA pages can be installed in VMAs, +pin_user_pages() and related will not return them unless FOLL_PCI_P2PDMA is set. -P2P memory is also technically IO memory but should never have any side -effects behind it. Thus, the order of loads and stores should not be important -and ioreadX(), iowriteX() and friends should not be necessary. +The MEMORY_DEVICE_PCI_P2PDMA pages require care to support in the kernel. The +KVA is still MMIO and must still be accessed through the normal +readX()/writeX()/etc helpers. Direct CPU access (e.g. memcpy) is forbidden, just +like any other MMIO mapping. While this will actually work on some +architectures, others will experience corruption or just crash in the kernel. +Supporting FOLL_PCI_P2PDMA in a subsystem requires scrubbing it to ensure no CPU +access happens. + + +Usage With DMABUF +================= + +DMABUF provides an alternative to the above struct page-based +client/provider/orchestrator system and should be used when struct page +doesn't exist. In this mode the exporting driver will wrap +some of its MMIO in a DMABUF and give the DMABUF FD to userspace. + +Userspace can then pass the FD to an importing driver which will ask the +exporting driver to map it to the importer. + +In this case the initiator and target pci_devices are known and the P2P subsystem +is used to determine the mapping type. The phys_addr_t-based DMA API is used to +establish the dma_addr_t. + +Lifecycle is controlled by DMABUF move_notify(). When the exporting driver wants +to remove() it must deliver an invalidation shutdown to all DMABUF importing +drivers through move_notify() and synchronously DMA unmap all the MMIO. + +No importing driver can continue to have a DMA map to the MMIO after the +exporting driver has destroyed its p2p_provider. P2P DMA Support Library diff --git a/Documentation/driver-api/pci/pci.rst b/Documentation/driver-api/pci/pci.rst index 59d86e827198..99a1bbaaec5d 100644 --- a/Documentation/driver-api/pci/pci.rst +++ b/Documentation/driver-api/pci/pci.rst @@ -37,6 +37,9 @@ PCI Support Library .. kernel-doc:: drivers/pci/slot.c :export: +.. kernel-doc:: drivers/pci/rebar.c + :export: + .. kernel-doc:: drivers/pci/rom.c :export: |
