diff options
Diffstat (limited to 'Documentation/core-api')
23 files changed, 1556 insertions, 310 deletions
diff --git a/Documentation/core-api/dma-api-howto.rst b/Documentation/core-api/dma-api-howto.rst index 0bf31b6c4383..96fce2a9aa90 100644 --- a/Documentation/core-api/dma-api-howto.rst +++ b/Documentation/core-api/dma-api-howto.rst @@ -155,7 +155,7 @@ a device with limitations, it needs to be decreased. Special note about PCI: PCI-X specification requires PCI-X devices to support 64-bit addressing (DAC) for all transactions. And at least one platform (SGI -SN2) requires 64-bit consistent allocations to operate correctly when the IO +SN2) requires 64-bit coherent allocations to operate correctly when the IO bus is in PCI-X mode. For correct operation, you must set the DMA mask to inform the kernel about @@ -174,7 +174,7 @@ used instead: int dma_set_mask(struct device *dev, u64 mask); - The setup for consistent allocations is performed via a call + The setup for coherent allocations is performed via a call to dma_set_coherent_mask():: int dma_set_coherent_mask(struct device *dev, u64 mask); @@ -241,7 +241,7 @@ it would look like this:: The coherent mask will always be able to set the same or a smaller mask as the streaming mask. However for the rare case that a device driver only -uses consistent allocations, one would have to check the return value from +uses coherent allocations, one would have to check the return value from dma_set_coherent_mask(). Finally, if your device can only drive the low 24-bits of @@ -298,20 +298,20 @@ Types of DMA mappings There are two types of DMA mappings: -- Consistent DMA mappings which are usually mapped at driver +- Coherent DMA mappings which are usually mapped at driver initialization, unmapped at the end and for which the hardware should guarantee that the device and the CPU can access the data in parallel and will see updates made by each other without any explicit software flushing. - Think of "consistent" as "synchronous" or "coherent". + Think of "coherent" as "synchronous". - The current default is to return consistent memory in the low 32 + The current default is to return coherent memory in the low 32 bits of the DMA space. However, for future compatibility you should - set the consistent mask even if this default is fine for your + set the coherent mask even if this default is fine for your driver. - Good examples of what to use consistent mappings for are: + Good examples of what to use coherent mappings for are: - Network card DMA ring descriptors. - SCSI adapter mailbox command data structures. @@ -320,13 +320,13 @@ There are two types of DMA mappings: The invariant these examples all require is that any CPU store to memory is immediately visible to the device, and vice - versa. Consistent mappings guarantee this. + versa. Coherent mappings guarantee this. .. important:: - Consistent DMA memory does not preclude the usage of + Coherent DMA memory does not preclude the usage of proper memory barriers. The CPU may reorder stores to - consistent memory just as it may normal memory. Example: + coherent memory just as it may normal memory. Example: if it is important for the device to see the first word of a descriptor updated before the second, you must do something like:: @@ -365,10 +365,10 @@ Also, systems with caches that aren't DMA-coherent will work better when the underlying buffers don't share cache lines with other data. -Using Consistent DMA mappings -============================= +Using Coherent DMA mappings +=========================== -To allocate and map large (PAGE_SIZE or so) consistent DMA regions, +To allocate and map large (PAGE_SIZE or so) coherent DMA regions, you should do:: dma_addr_t dma_handle; @@ -385,10 +385,10 @@ __get_free_pages() (but takes size instead of a page order). If your driver needs regions sized smaller than a page, you may prefer using the dma_pool interface, described below. -The consistent DMA mapping interfaces, will by default return a DMA address +The coherent DMA mapping interfaces, will by default return a DMA address which is 32-bit addressable. Even if the device indicates (via the DMA mask) -that it may address the upper 32-bits, consistent allocation will only -return > 32-bit addresses for DMA if the consistent DMA mask has been +that it may address the upper 32-bits, coherent allocation will only +return > 32-bit addresses for DMA if the coherent DMA mask has been explicitly changed via dma_set_coherent_mask(). This is true of the dma_pool interface as well. @@ -497,7 +497,7 @@ program address space. Such platforms can and do report errors in the kernel logs when the DMA controller hardware detects violation of the permission setting. -Only streaming mappings specify a direction, consistent mappings +Only streaming mappings specify a direction, coherent mappings implicitly have a direction attribute setting of DMA_BIDIRECTIONAL. diff --git a/Documentation/core-api/dma-api.rst b/Documentation/core-api/dma-api.rst index 8e3cce3d0a23..3087bea715ed 100644 --- a/Documentation/core-api/dma-api.rst +++ b/Documentation/core-api/dma-api.rst @@ -8,15 +8,15 @@ This document describes the DMA API. For a more gentle introduction of the API (and actual examples), see Documentation/core-api/dma-api-howto.rst. This API is split into two pieces. Part I describes the basic API. -Part II describes extensions for supporting non-consistent memory +Part II describes extensions for supporting non-coherent memory machines. Unless you know that your driver absolutely has to support -non-consistent platforms (this is usually only legacy platforms) you +non-coherent platforms (this is usually only legacy platforms) you should only use the API described in part I. -Part I - dma_API +Part I - DMA API ---------------- -To get the dma_API, you must #include <linux/dma-mapping.h>. This +To get the DMA API, you must #include <linux/dma-mapping.h>. This provides dma_addr_t and the interfaces described below. A dma_addr_t can hold any valid DMA address for the platform. It can be @@ -33,13 +33,13 @@ Part Ia - Using large DMA-coherent buffers dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t flag) -Consistent memory is memory for which a write by either the device or +Coherent memory is memory for which a write by either the device or the processor can immediately be read by the processor or device without having to worry about caching effects. (You may however need to make sure to flush the processor's write buffers before telling devices to read that memory.) -This routine allocates a region of <size> bytes of consistent memory. +This routine allocates a region of <size> bytes of coherent memory. It returns a pointer to the allocated region (in the processor's virtual address space) or NULL if the allocation failed. @@ -48,15 +48,14 @@ It also returns a <dma_handle> which may be cast to an unsigned integer the same width as the bus and given to the device as the DMA address base of the region. -Note: consistent memory can be expensive on some platforms, and the +Note: coherent memory can be expensive on some platforms, and the minimum allocation length may be as big as a page, so you should -consolidate your requests for consistent memory as much as possible. +consolidate your requests for coherent memory as much as possible. The simplest way to do that is to use the dma_pool calls (see below). -The flag parameter (dma_alloc_coherent() only) allows the caller to -specify the ``GFP_`` flags (see kmalloc()) for the allocation (the -implementation may choose to ignore flags that affect the location of -the returned memory, like GFP_DMA). +The flag parameter allows the caller to specify the ``GFP_`` flags (see +kmalloc()) for the allocation (the implementation may ignore flags that affect +the location of the returned memory, like GFP_DMA). :: @@ -64,19 +63,18 @@ the returned memory, like GFP_DMA). dma_free_coherent(struct device *dev, size_t size, void *cpu_addr, dma_addr_t dma_handle) -Free a region of consistent memory you previously allocated. dev, -size and dma_handle must all be the same as those passed into -dma_alloc_coherent(). cpu_addr must be the virtual address returned by -the dma_alloc_coherent(). +Free a previously allocated region of coherent memory. dev, size and dma_handle +must all be the same as those passed into dma_alloc_coherent(). cpu_addr must +be the virtual address returned by dma_alloc_coherent(). -Note that unlike their sibling allocation calls, these routines -may only be called with IRQs enabled. +Note that unlike the sibling allocation call, this routine may only be called +with IRQs enabled. Part Ib - Using small DMA-coherent buffers ------------------------------------------ -To get this part of the dma_API, you must #include <linux/dmapool.h> +To get this part of the DMA API, you must #include <linux/dmapool.h> Many drivers need lots of small DMA-coherent memory regions for DMA descriptors or I/O buffers. Rather than allocating in units of a page @@ -85,78 +83,29 @@ much like a struct kmem_cache, except that they use the DMA-coherent allocator, not __get_free_pages(). Also, they understand common hardware constraints for alignment, like queue heads needing to be aligned on N-byte boundaries. +.. kernel-doc:: mm/dmapool.c + :export: -:: - - struct dma_pool * - dma_pool_create(const char *name, struct device *dev, - size_t size, size_t align, size_t alloc); - -dma_pool_create() initializes a pool of DMA-coherent buffers -for use with a given device. It must be called in a context which -can sleep. - -The "name" is for diagnostics (like a struct kmem_cache name); dev and size -are like what you'd pass to dma_alloc_coherent(). The device's hardware -alignment requirement for this type of data is "align" (which is expressed -in bytes, and must be a power of two). If your device has no boundary -crossing restrictions, pass 0 for alloc; passing 4096 says memory allocated -from this pool must not cross 4KByte boundaries. - -:: - - void * - dma_pool_zalloc(struct dma_pool *pool, gfp_t mem_flags, - dma_addr_t *handle) - -Wraps dma_pool_alloc() and also zeroes the returned memory if the -allocation attempt succeeded. - - -:: - - void * - dma_pool_alloc(struct dma_pool *pool, gfp_t gfp_flags, - dma_addr_t *dma_handle); - -This allocates memory from the pool; the returned memory will meet the -size and alignment requirements specified at creation time. Pass -GFP_ATOMIC to prevent blocking, or if it's permitted (not -in_interrupt, not holding SMP locks), pass GFP_KERNEL to allow -blocking. Like dma_alloc_coherent(), this returns two values: an -address usable by the CPU, and the DMA address usable by the pool's -device. - -:: - - void - dma_pool_free(struct dma_pool *pool, void *vaddr, - dma_addr_t addr); - -This puts memory back into the pool. The pool is what was passed to -dma_pool_alloc(); the CPU (vaddr) and DMA addresses are what -were returned when that routine allocated the memory being freed. - -:: - - void - dma_pool_destroy(struct dma_pool *pool); - -dma_pool_destroy() frees the resources of the pool. It must be -called in a context which can sleep. Make sure you've freed all allocated -memory back to the pool before you destroy it. +.. kernel-doc:: include/linux/dmapool.h Part Ic - DMA addressing limitations ------------------------------------ +DMA mask is a bit mask of the addressable region for the device. In other words, +if applying the DMA mask (a bitwise AND operation) to the DMA address of a +memory region does not clear any bits in the address, then the device can +perform DMA to that memory region. + +All the below functions which set a DMA mask may fail if the requested mask +cannot be used with the device, or if the device is not capable of doing DMA. + :: int dma_set_mask_and_coherent(struct device *dev, u64 mask) -Checks to see if the mask is possible and updates the device -streaming and coherent DMA mask parameters if it is. +Updates both streaming and coherent DMA masks. Returns: 0 if successful and a negative error if not. @@ -165,8 +114,7 @@ Returns: 0 if successful and a negative error if not. int dma_set_mask(struct device *dev, u64 mask) -Checks to see if the mask is possible and updates the device -parameters if it is. +Updates only the streaming DMA mask. Returns: 0 if successful and a negative error if not. @@ -175,8 +123,7 @@ Returns: 0 if successful and a negative error if not. int dma_set_coherent_mask(struct device *dev, u64 mask) -Checks to see if the mask is possible and updates the device -parameters if it is. +Updates only the coherent DMA mask. Returns: 0 if successful and a negative error if not. @@ -231,12 +178,32 @@ transfer memory ownership. Returns %false if those calls can be skipped. unsigned long dma_get_merge_boundary(struct device *dev); -Returns the DMA merge boundary. If the device cannot merge any the DMA address +Returns the DMA merge boundary. If the device cannot merge any DMA address segments, the function returns 0. Part Id - Streaming DMA mappings -------------------------------- +Streaming DMA allows to map an existing buffer for DMA transfers and then +unmap it when finished. Map functions are not guaranteed to succeed, so the +return value must be checked. + +.. note:: + + In particular, mapping may fail for memory not addressable by the + device, e.g. if it is not within the DMA mask of the device and/or a + connecting bus bridge. Streaming DMA functions try to overcome such + addressing constraints, either by using an IOMMU (a device which maps + I/O DMA addresses to physical memory addresses), or by copying the + data to/from a bounce buffer if the kernel is configured with a + :doc:`SWIOTLB <swiotlb>`. However, these methods are not always + available, and even if they are, they may still fail for a number of + reasons. + + In short, a device driver may need to be wary of where buffers are + located in physical memory, especially if the DMA mask is less than 32 + bits. + :: dma_addr_t @@ -246,9 +213,7 @@ Part Id - Streaming DMA mappings Maps a piece of processor virtual memory so it can be accessed by the device and returns the DMA address of the memory. -The direction for both APIs may be converted freely by casting. -However the dma_API uses a strongly typed enumerator for its -direction: +The DMA API uses a strongly typed enumerator for its direction: ======================= ============================================= DMA_NONE no direction (used for debugging) @@ -259,31 +224,13 @@ DMA_BIDIRECTIONAL direction isn't known .. note:: - Not all memory regions in a machine can be mapped by this API. - Further, contiguous kernel virtual space may not be contiguous as + Contiguous kernel virtual space may not be contiguous as physical memory. Since this API does not provide any scatter/gather capability, it will fail if the user tries to map a non-physically contiguous piece of memory. For this reason, memory to be mapped by this API should be obtained from sources which guarantee it to be physically contiguous (like kmalloc). - Further, the DMA address of the memory must be within the - dma_mask of the device (the dma_mask is a bit mask of the - addressable region for the device, i.e., if the DMA address of - the memory ANDed with the dma_mask is still equal to the DMA - address, then the device can perform DMA to the memory). To - ensure that the memory allocated by kmalloc is within the dma_mask, - the driver may specify various platform-dependent flags to restrict - the DMA address range of the allocation (e.g., on x86, GFP_DMA - guarantees to be within the first 16MB of available DMA addresses, - as required by ISA devices). - - Note also that the above constraints on physical contiguity and - dma_mask may not apply if the platform has an IOMMU (a device which - maps an I/O DMA address to a physical memory address). However, to be - portable, device driver writers may *not* assume that such an IOMMU - exists. - .. warning:: Memory coherency operates at a granularity called the cache @@ -325,8 +272,7 @@ DMA_BIDIRECTIONAL direction isn't known enum dma_data_direction direction) Unmaps the region previously mapped. All the parameters passed in -must be identical to those passed in (and returned) by the mapping -API. +must be identical to those passed to (and returned by) dma_map_single(). :: @@ -376,10 +322,10 @@ action (e.g. reduce current DMA mapping usage or delay and try again later). dma_map_sg(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction direction) -Returns: the number of DMA address segments mapped (this may be shorter -than <nents> passed in if some elements of the scatter/gather list are -physically or virtually adjacent and an IOMMU maps them with a single -entry). +Maps a scatter/gather list for DMA. Returns the number of DMA address segments +mapped, which may be smaller than <nents> passed in if several consecutive +sglist entries are merged (e.g. with an IOMMU, or if some adjacent segments +just happen to be physically contiguous). Please note that the sg cannot be mapped again if it has been mapped once. The mapping process is allowed to destroy information in the sg. @@ -403,9 +349,8 @@ With scatterlists, you use the resulting mapping like this:: where nents is the number of entries in the sglist. The implementation is free to merge several consecutive sglist entries -into one (e.g. with an IOMMU, or if several pages just happen to be -physically contiguous) and returns the actual number of sg entries it -mapped them to. On failure 0, is returned. +into one. The returned number is the actual number of sg entries it +mapped them to. On failure, 0 is returned. Then you should loop count times (note: this can be less than nents times) and use sg_dma_address() and sg_dma_len() macros where you previously @@ -530,6 +475,77 @@ routines, e.g.::: .... } +Part Ie - IOVA-based DMA mappings +--------------------------------- + +These APIs allow a very efficient mapping when using an IOMMU. They are an +optional path that requires extra code and are only recommended for drivers +where DMA mapping performance, or the space usage for storing the DMA addresses +matter. All the considerations from the previous section apply here as well. + +:: + + bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state, + phys_addr_t phys, size_t size); + +Is used to try to allocate IOVA space for mapping operation. If it returns +false this API can't be used for the given device and the normal streaming +DMA mapping API should be used. The ``struct dma_iova_state`` is allocated +by the driver and must be kept around until unmap time. + +:: + + static inline bool dma_use_iova(struct dma_iova_state *state) + +Can be used by the driver to check if the IOVA-based API is used after a +call to dma_iova_try_alloc. This can be useful in the unmap path. + +:: + + int dma_iova_link(struct device *dev, struct dma_iova_state *state, + phys_addr_t phys, size_t offset, size_t size, + enum dma_data_direction dir, unsigned long attrs); + +Is used to link ranges to the IOVA previously allocated. The start of all +but the first call to dma_iova_link for a given state must be aligned +to the DMA merge boundary returned by ``dma_get_merge_boundary())``, and +the size of all but the last range must be aligned to the DMA merge boundary +as well. + +:: + + int dma_iova_sync(struct device *dev, struct dma_iova_state *state, + size_t offset, size_t size); + +Must be called to sync the IOMMU page tables for IOVA-range mapped by one or +more calls to ``dma_iova_link()``. + +For drivers that use a one-shot mapping, all ranges can be unmapped and the +IOVA freed by calling: + +:: + + void dma_iova_destroy(struct device *dev, struct dma_iova_state *state, + size_t mapped_len, enum dma_data_direction dir, + unsigned long attrs); + +Alternatively drivers can dynamically manage the IOVA space by unmapping +and mapping individual regions. In that case + +:: + + void dma_iova_unlink(struct device *dev, struct dma_iova_state *state, + size_t offset, size_t size, enum dma_data_direction dir, + unsigned long attrs); + +is used to unmap a range previously mapped, and + +:: + + void dma_iova_free(struct device *dev, struct dma_iova_state *state); + +is used to free the IOVA space. All regions must have been unmapped using +``dma_iova_unlink()`` before calling ``dma_iova_free()``. Part II - Non-coherent DMA allocations -------------------------------------- @@ -704,19 +720,19 @@ memory or doing partial flushes. of two for easy alignment. -Part III - Debug drivers use of the DMA-API +Part III - Debug drivers use of the DMA API ------------------------------------------- -The DMA-API as described above has some constraints. DMA addresses must be +The DMA API as described above has some constraints. DMA addresses must be released with the corresponding function with the same size for example. With the advent of hardware IOMMUs it becomes more and more important that drivers do not violate those constraints. In the worst case such a violation can result in data corruption up to destroyed filesystems. -To debug drivers and find bugs in the usage of the DMA-API checking code can +To debug drivers and find bugs in the usage of the DMA API checking code can be compiled into the kernel which will tell the developer about those violations. If your architecture supports it you can select the "Enable -debugging of DMA-API usage" option in your kernel configuration. Enabling this +debugging of DMA API usage" option in your kernel configuration. Enabling this option has a performance impact. Do not enable it in production kernels. If you boot the resulting kernel will contain code which does some bookkeeping @@ -755,7 +771,7 @@ example warning message may look like this:: <EOI> <4>---[ end trace f6435a98e2a38c0e ]--- The driver developer can find the driver and the device including a stacktrace -of the DMA-API call which caused this warning. +of the DMA API call which caused this warning. Per default only the first error will result in a warning message. All other errors will only silently counted. This limitation exist to prevent the code @@ -763,7 +779,7 @@ from flooding your kernel log. To support debugging a device driver this can be disabled via debugfs. See the debugfs interface documentation below for details. -The debugfs directory for the DMA-API debugging code is called dma-api/. In +The debugfs directory for the DMA API debugging code is called dma-api/. In this directory the following files can currently be found: =============================== =============================================== @@ -811,7 +827,7 @@ dma-api/driver_filter You can write a name of a driver into this file If you have this code compiled into your kernel it will be enabled by default. If you want to boot without the bookkeeping anyway you can provide -'dma_debug=off' as a boot parameter. This will disable DMA-API debugging. +'dma_debug=off' as a boot parameter. This will disable DMA API debugging. Notice that you can not enable it again at runtime. You have to reboot to do so. @@ -844,3 +860,9 @@ the driver. When driver does unmap, debug_dma_unmap() checks the flag and if this flag is still set, prints warning message that includes call trace that leads up to the unmap. This interface can be called from dma_mapping_error() routines to enable DMA mapping error check debugging. + +Functions and structures +======================== + +.. kernel-doc:: include/linux/scatterlist.h +.. kernel-doc:: lib/scatterlist.c diff --git a/Documentation/core-api/entry.rst b/Documentation/core-api/entry.rst index a15f9b1767a2..71d8eedc0549 100644 --- a/Documentation/core-api/entry.rst +++ b/Documentation/core-api/entry.rst @@ -105,7 +105,7 @@ has to do extra work between the various steps. In such cases it has to ensure that enter_from_user_mode() is called first on entry and exit_to_user_mode() is called last on exit. -Do not nest syscalls. Nested systcalls will cause RCU and/or context tracking +Do not nest syscalls. Nested syscalls will cause RCU and/or context tracking to print a warning. KVM @@ -115,8 +115,8 @@ Entering or exiting guest mode is very similar to syscalls. From the host kernel point of view the CPU goes off into user space when entering the guest and returns to the kernel on exit. -kvm_guest_enter_irqoff() is a KVM-specific variant of exit_to_user_mode() -and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode(). +guest_state_enter_irqoff() is a KVM-specific variant of exit_to_user_mode() +and guest_state_exit_irqoff() is the KVM variant of enter_from_user_mode(). The state operations have the same ordering. Task work handling is done separately for guest at the boundary of the diff --git a/Documentation/core-api/folio_queue.rst b/Documentation/core-api/folio_queue.rst index 1fe7a9bc4b8d..83cfbc157e49 100644 --- a/Documentation/core-api/folio_queue.rst +++ b/Documentation/core-api/folio_queue.rst @@ -151,19 +151,16 @@ The marks can be set by:: void folioq_mark(struct folio_queue *folioq, unsigned int slot); void folioq_mark2(struct folio_queue *folioq, unsigned int slot); - void folioq_mark3(struct folio_queue *folioq, unsigned int slot); Cleared by:: void folioq_unmark(struct folio_queue *folioq, unsigned int slot); void folioq_unmark2(struct folio_queue *folioq, unsigned int slot); - void folioq_unmark3(struct folio_queue *folioq, unsigned int slot); And the marks can be queried by:: bool folioq_is_marked(const struct folio_queue *folioq, unsigned int slot); bool folioq_is_marked2(const struct folio_queue *folioq, unsigned int slot); - bool folioq_is_marked3(const struct folio_queue *folioq, unsigned int slot); The marks can be used for any purpose and are not interpreted by this API. diff --git a/Documentation/core-api/genericirq.rst b/Documentation/core-api/genericirq.rst index 25f94dfd66fa..582bde9bf5a9 100644 --- a/Documentation/core-api/genericirq.rst +++ b/Documentation/core-api/genericirq.rst @@ -410,8 +410,6 @@ which are used in the generic IRQ layer. .. kernel-doc:: include/linux/interrupt.h :internal: -.. kernel-doc:: include/linux/irqdomain.h - Public Functions Provided ========================= diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index e9789bd381d8..a03a99c2cac5 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -54,6 +54,7 @@ Library functionality that is used throughout the kernel. union_find min_heap parser + list Low level entry and exit ======================== @@ -115,6 +116,7 @@ more memory-management documentation in Documentation/mm/index.rst. pin_user_pages boot-time-mm gfp_mask-from-fs-io + kho/index Interfaces for kernel debugging =============================== diff --git a/Documentation/core-api/irq/concepts.rst b/Documentation/core-api/irq/concepts.rst index 4273806a606b..7c4564f3cbdf 100644 --- a/Documentation/core-api/irq/concepts.rst +++ b/Documentation/core-api/irq/concepts.rst @@ -2,23 +2,24 @@ What is an IRQ? =============== -An IRQ is an interrupt request from a device. -Currently they can come in over a pin, or over a packet. -Several devices may be connected to the same pin thus -sharing an IRQ. +An IRQ is an interrupt request from a device. Currently, they can come +in over a pin, or over a packet. Several devices may be connected to +the same pin thus sharing an IRQ. Such as on legacy PCI bus: All devices +typically share 4 lanes/pins. Note that each device can request an +interrupt on each of the lanes. An IRQ number is a kernel identifier used to talk about a hardware -interrupt source. Typically this is an index into the global irq_desc -array, but except for what linux/interrupt.h implements the details -are architecture specific. +interrupt source. Typically, this is an index into the global irq_desc +array or sparse_irqs tree. But except for what linux/interrupt.h +implements, the details are architecture specific. An IRQ number is an enumeration of the possible interrupt sources on a -machine. Typically what is enumerated is the number of input pins on -all of the interrupt controller in the system. In the case of ISA -what is enumerated are the 16 input pins on the two i8259 interrupt -controllers. +machine. Typically, what is enumerated is the number of input pins on +all of the interrupt controllers in the system. In the case of ISA, +what is enumerated are the 8 input pins on each of the two i8259 +interrupt controllers. Architectures can assign additional meaning to the IRQ numbers, and -are encouraged to in the case where there is any manual configuration -of the hardware involved. The ISA IRQs are a classic example of +are encouraged to in the case where there is any manual configuration +of the hardware involved. The ISA IRQs are a classic example of assigning this kind of additional meaning. diff --git a/Documentation/core-api/irq/irq-domain.rst b/Documentation/core-api/irq/irq-domain.rst index f88a6ee67a35..a01c6ead1bc0 100644 --- a/Documentation/core-api/irq/irq-domain.rst +++ b/Documentation/core-api/irq/irq-domain.rst @@ -1,59 +1,77 @@ =============================================== -The irq_domain interrupt number mapping library +The irq_domain Interrupt Number Mapping Library =============================================== The current design of the Linux kernel uses a single large number -space where each separate IRQ source is assigned a different number. -This is simple when there is only one interrupt controller, but in -systems with multiple interrupt controllers the kernel must ensure +space where each separate IRQ source is assigned a unique number. +This is simple when there is only one interrupt controller. But in +systems with multiple interrupt controllers, the kernel must ensure that each one gets assigned non-overlapping allocations of Linux IRQ numbers. The number of interrupt controllers registered as unique irqchips -show a rising tendency: for example subdrivers of different kinds +shows a rising tendency. For example, subdrivers of different kinds such as GPIO controllers avoid reimplementing identical callback mechanisms as the IRQ core system by modelling their interrupt -handlers as irqchips, i.e. in effect cascading interrupt controllers. +handlers as irqchips. I.e. in effect cascading interrupt controllers. -Here the interrupt number loose all kind of correspondence to -hardware interrupt numbers: whereas in the past, IRQ numbers could -be chosen so they matched the hardware IRQ line into the root -interrupt controller (i.e. the component actually fireing the -interrupt line to the CPU) nowadays this number is just a number. +So in the past, IRQ numbers could be chosen so that they match the +hardware IRQ line into the root interrupt controller (i.e. the +component actually firing the interrupt line to the CPU). Nowadays, +this number is just a number and the number loose all kind of +correspondence to hardware interrupt numbers. -For this reason we need a mechanism to separate controller-local -interrupt numbers, called hardware irq's, from Linux IRQ numbers. +For this reason, we need a mechanism to separate controller-local +interrupt numbers, called hardware IRQs, from Linux IRQ numbers. The irq_alloc_desc*() and irq_free_desc*() APIs provide allocation of -irq numbers, but they don't provide any support for reverse mapping of +IRQ numbers, but they don't provide any support for reverse mapping of the controller-local IRQ (hwirq) number into the Linux IRQ number space. -The irq_domain library adds mapping between hwirq and IRQ numbers on -top of the irq_alloc_desc*() API. An irq_domain to manage mapping is -preferred over interrupt controller drivers open coding their own +The irq_domain library adds a mapping between hwirq and IRQ numbers on +top of the irq_alloc_desc*() API. An irq_domain to manage the mapping +is preferred over interrupt controller drivers open coding their own reverse mapping scheme. -irq_domain also implements translation from an abstract irq_fwspec -structure to hwirq numbers (Device Tree and ACPI GSI so far), and can -be easily extended to support other IRQ topology data sources. +irq_domain also implements a translation from an abstract struct +irq_fwspec to hwirq numbers (Device Tree, non-DT firmware node, ACPI +GSI, and software node so far), and can be easily extended to support +other IRQ topology data sources. The implementation is performed +without any extra platform support code. -irq_domain usage +irq_domain Usage ================ - -An interrupt controller driver creates and registers an irq_domain by -calling one of the irq_domain_add_*() or irq_domain_create_*() functions -(each mapping method has a different allocator function, more on that later). -The function will return a pointer to the irq_domain on success. The caller -must provide the allocator function with an irq_domain_ops structure. +struct irq_domain could be defined as an irq domain controller. That +is, it handles the mapping between hardware and virtual interrupt +numbers for a given interrupt domain. The domain structure is +generally created by the PIC code for a given PIC instance (though a +domain can cover more than one PIC if they have a flat number model). +It is the domain callbacks that are responsible for setting the +irq_chip on a given irq_desc after it has been mapped. + +The host code and data structures use a fwnode_handle pointer to +identify the domain. In some cases, and in order to preserve source +code compatibility, this fwnode pointer is "upgraded" to a DT +device_node. For those firmware infrastructures that do not provide a +unique identifier for an interrupt controller, the irq_domain code +offers a fwnode allocator. + +An interrupt controller driver creates and registers a struct irq_domain +by calling one of the irq_domain_create_*() functions (each mapping +method has a different allocator function, more on that later). The +function will return a pointer to the struct irq_domain on success. The +caller must provide the allocator function with a struct irq_domain_ops +pointer. In most cases, the irq_domain will begin empty without any mappings between hwirq and IRQ numbers. Mappings are added to the irq_domain by calling irq_create_mapping() which accepts the irq_domain and a -hwirq number as arguments. If a mapping for the hwirq doesn't already -exist then it will allocate a new Linux irq_desc, associate it with -the hwirq, and call the .map() callback so the driver can perform any -required hardware setup. +hwirq number as arguments. If a mapping for the hwirq doesn't already +exist, irq_create_mapping() allocates a new Linux irq_desc, associates +it with the hwirq, and calls the :c:member:`irq_domain_ops.map()` +callback. In there, the driver can perform any required hardware +setup. Once a mapping has been established, it can be retrieved or used via a variety of methods: @@ -63,8 +81,6 @@ variety of methods: mapping. - irq_find_mapping() returns a Linux IRQ number for a given domain and hwirq number, and 0 if there was no mapping -- irq_linear_revmap() is now identical to irq_find_mapping(), and is - deprecated - generic_handle_domain_irq() handles an interrupt described by a domain and a hwirq number @@ -77,9 +93,10 @@ be allocated. If the driver has the Linux IRQ number or the irq_data pointer, and needs to know the associated hwirq number (such as in the irq_chip -callbacks) then it can be directly obtained from irq_data->hwirq. +callbacks) then it can be directly obtained from +:c:member:`irq_data.hwirq`. -Types of irq_domain mappings +Types of irq_domain Mappings ============================ There are several mechanisms available for reverse mapping from hwirq @@ -92,7 +109,6 @@ Linear :: - irq_domain_add_linear() irq_domain_create_linear() The linear reverse map maintains a fixed size table indexed by the @@ -105,19 +121,13 @@ map are fixed time lookup for IRQ numbers, and irq_descs are only allocated for in-use IRQs. The disadvantage is that the table must be as large as the largest possible hwirq number. -irq_domain_add_linear() and irq_domain_create_linear() are functionally -equivalent, except for the first argument is different - the former -accepts an Open Firmware specific 'struct device_node', while the latter -accepts a more general abstraction 'struct fwnode_handle'. - -The majority of drivers should use the linear map. +The majority of drivers should use the Linear map. Tree ---- :: - irq_domain_add_tree() irq_domain_create_tree() The irq_domain maintains a radix tree map from hwirq numbers to Linux @@ -129,11 +139,6 @@ since it doesn't need to allocate a table as large as the largest hwirq number. The disadvantage is that hwirq to IRQ number lookup is dependent on how many entries are in the table. -irq_domain_add_tree() and irq_domain_create_tree() are functionally -equivalent, except for the first argument is different - the former -accepts an Open Firmware specific 'struct device_node', while the latter -accepts a more general abstraction 'struct fwnode_handle'. - Very few drivers should need this mapping. No Map @@ -141,7 +146,7 @@ No Map :: - irq_domain_add_nomap() + irq_domain_create_nomap() The No Map mapping is to be used when the hwirq number is programmable in the hardware. In this case it is best to program the @@ -159,8 +164,6 @@ Legacy :: - irq_domain_add_simple() - irq_domain_add_legacy() irq_domain_create_simple() irq_domain_create_legacy() @@ -189,13 +192,13 @@ supported. For example, ISA controllers would use the legacy map for mapping Linux IRQs 0-15 so that existing ISA drivers get the correct IRQ numbers. -Most users of legacy mappings should use irq_domain_add_simple() or -irq_domain_create_simple() which will use a legacy domain only if an IRQ range -is supplied by the system and will otherwise use a linear domain mapping. -The semantics of this call are such that if an IRQ range is specified then -descriptors will be allocated on-the-fly for it, and if no range is -specified it will fall through to irq_domain_add_linear() or -irq_domain_create_linear() which means *no* irq descriptors will be allocated. +Most users of legacy mappings should use irq_domain_create_simple() +which will use a legacy domain only if an IRQ range is supplied by the +system and will otherwise use a linear domain mapping. The semantics of +this call are such that if an IRQ range is specified then descriptors +will be allocated on-the-fly for it, and if no range is specified it +will fall through to irq_domain_create_linear() which means *no* irq +descriptors will be allocated. A typical use case for simple domains is where an irqchip provider is supporting both dynamic and static IRQ assignments. @@ -206,13 +209,7 @@ that the driver using the simple domain call irq_create_mapping() before any irq_find_mapping() since the latter will actually work for the static IRQ assignment case. -irq_domain_add_simple() and irq_domain_create_simple() as well as -irq_domain_add_legacy() and irq_domain_create_legacy() are functionally -equivalent, except for the first argument is different - the former -accepts an Open Firmware specific 'struct device_node', while the latter -accepts a more general abstraction 'struct fwnode_handle'. - -Hierarchy IRQ domain +Hierarchy IRQ Domain -------------------- On some architectures, there may be multiple interrupt controllers @@ -253,20 +250,40 @@ There are four major interfaces to use hierarchy irq_domain: 4) irq_domain_deactivate_irq(): deactivate interrupt controller hardware to stop delivering the interrupt. -Following changes are needed to support hierarchy irq_domain: +The following is needed to support hierarchy irq_domain: -1) a new field 'parent' is added to struct irq_domain; it's used to +1) The :c:member:`parent` field in struct irq_domain is used to maintain irq_domain hierarchy information. -2) a new field 'parent_data' is added to struct irq_data; it's used to - build hierarchy irq_data to match hierarchy irq_domains. The irq_data - is used to store irq_domain pointer and hardware irq number. -3) new callbacks are added to struct irq_domain_ops to support hierarchy - irq_domain operations. - -With support of hierarchy irq_domain and hierarchy irq_data ready, an -irq_domain structure is built for each interrupt controller, and an +2) The :c:member:`parent_data` field in struct irq_data is used to + build hierarchy irq_data to match hierarchy irq_domains. The + irq_data is used to store irq_domain pointer and hardware irq + number. +3) The :c:member:`alloc()`, :c:member:`free()`, and other callbacks in + struct irq_domain_ops to support hierarchy irq_domain operations. + +With the support of hierarchy irq_domain and hierarchy irq_data ready, +an irq_domain structure is built for each interrupt controller, and an irq_data structure is allocated for each irq_domain associated with an -IRQ. Now we could go one step further to support stacked(hierarchy) +IRQ. + +For an interrupt controller driver to support hierarchy irq_domain, it +needs to: + +1) Implement irq_domain_ops.alloc() and irq_domain_ops.free() +2) Optionally, implement irq_domain_ops.activate() and + irq_domain_ops.deactivate(). +3) Optionally, implement an irq_chip to manage the interrupt controller + hardware. +4) There is no need to implement irq_domain_ops.map() and + irq_domain_ops.unmap(). They are unused with hierarchy irq_domain. + +Note the hierarchy irq_domain is in no way x86-specific, and is +heavily used to support other architectures, such as ARM, ARM64 etc. + +Stacked irq_chip +~~~~~~~~~~~~~~~~ + +Now, we could go one step further to support stacked (hierarchy) irq_chip. That is, an irq_chip is associated with each irq_data along the hierarchy. A child irq_chip may implement a required action by itself or by cooperating with its parent irq_chip. @@ -276,22 +293,28 @@ with the hardware managed by itself and may ask for services from its parent irq_chip when needed. So we could achieve a much cleaner software architecture. -For an interrupt controller driver to support hierarchy irq_domain, it -needs to: - -1) Implement irq_domain_ops.alloc and irq_domain_ops.free -2) Optionally implement irq_domain_ops.activate and - irq_domain_ops.deactivate. -3) Optionally implement an irq_chip to manage the interrupt controller - hardware. -4) No need to implement irq_domain_ops.map and irq_domain_ops.unmap, - they are unused with hierarchy irq_domain. - -Hierarchy irq_domain is in no way x86 specific, and is heavily used to -support other architectures, such as ARM, ARM64 etc. - Debugging ========= Most of the internals of the IRQ subsystem are exposed in debugfs by turning CONFIG_GENERIC_IRQ_DEBUGFS on. + +Structures and Public Functions Provided +======================================== + +This chapter contains the autogenerated documentation of the structures +and exported kernel API functions which are used for IRQ domains. + +.. kernel-doc:: include/linux/irqdomain.h + +.. kernel-doc:: kernel/irq/irqdomain.c + :export: + +Internal Functions Provided +=========================== + +This chapter contains the autogenerated documentation of the internal +functions. + +.. kernel-doc:: kernel/irq/irqdomain.c + :internal: diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst index ae92a2571388..e8211c4ca662 100644 --- a/Documentation/core-api/kernel-api.rst +++ b/Documentation/core-api/kernel-api.rst @@ -3,12 +3,6 @@ The Linux Kernel API ==================== -List Management Functions -========================= - -.. kernel-doc:: include/linux/list.h - :internal: - Basic C Library Functions ========================= @@ -136,26 +130,28 @@ Arithmetic Overflow Checking CRC Functions ------------- -.. kernel-doc:: lib/crc4.c +.. kernel-doc:: lib/crc/crc4.c :export: -.. kernel-doc:: lib/crc7.c +.. kernel-doc:: lib/crc/crc7.c :export: -.. kernel-doc:: lib/crc8.c +.. kernel-doc:: lib/crc/crc8.c :export: -.. kernel-doc:: lib/crc16.c +.. kernel-doc:: lib/crc/crc16.c :export: -.. kernel-doc:: lib/crc32.c - -.. kernel-doc:: lib/crc-ccitt.c +.. kernel-doc:: lib/crc/crc-ccitt.c :export: -.. kernel-doc:: lib/crc-itu-t.c +.. kernel-doc:: lib/crc/crc-itu-t.c :export: +.. kernel-doc:: include/linux/crc32.h + +.. kernel-doc:: include/linux/crc64.h + Base 2 log and power Functions ------------------------------ diff --git a/Documentation/core-api/kho/bindings/kho.yaml b/Documentation/core-api/kho/bindings/kho.yaml new file mode 100644 index 000000000000..11e8ab7b219d --- /dev/null +++ b/Documentation/core-api/kho/bindings/kho.yaml @@ -0,0 +1,43 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +title: Kexec HandOver (KHO) root tree + +maintainers: + - Mike Rapoport <rppt@kernel.org> + - Changyuan Lyu <changyuanl@google.com> + +description: | + System memory preserved by KHO across kexec. + +properties: + compatible: + enum: + - kho-v1 + + preserved-memory-map: + description: | + physical address (u64) of an in-memory structure describing all preserved + folios and memory ranges. + +patternProperties: + "$[0-9a-f_]+^": + $ref: sub-fdt.yaml# + description: physical address of a KHO user's own FDT. + +required: + - compatible + - preserved-memory-map + +additionalProperties: false + +examples: + - | + kho { + compatible = "kho-v1"; + preserved-memory-map = <0xf0be16 0x1000000>; + + memblock { + fdt = <0x80cc16 0x1000000>; + }; + }; diff --git a/Documentation/core-api/kho/bindings/memblock/memblock.yaml b/Documentation/core-api/kho/bindings/memblock/memblock.yaml new file mode 100644 index 000000000000..d388c28eb91d --- /dev/null +++ b/Documentation/core-api/kho/bindings/memblock/memblock.yaml @@ -0,0 +1,39 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +title: Memblock reserved memory + +maintainers: + - Mike Rapoport <rppt@kernel.org> + +description: | + Memblock can serialize its current memory reservations created with + reserve_mem command line option across kexec through KHO. + The post-KHO kernel can then consume these reservations and they are + guaranteed to have the same physical address. + +properties: + compatible: + enum: + - reserve-mem-v1 + +patternProperties: + "$[0-9a-f_]+^": + $ref: reserve-mem.yaml# + description: reserved memory regions + +required: + - compatible + +additionalProperties: false + +examples: + - | + memblock { + compatible = "memblock-v1"; + n1 { + compatible = "reserve-mem-v1"; + start = <0xc06b 0x4000000>; + size = <0x04 0x00>; + }; + }; diff --git a/Documentation/core-api/kho/bindings/memblock/reserve-mem.yaml b/Documentation/core-api/kho/bindings/memblock/reserve-mem.yaml new file mode 100644 index 000000000000..10282d3d1bcd --- /dev/null +++ b/Documentation/core-api/kho/bindings/memblock/reserve-mem.yaml @@ -0,0 +1,40 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +title: Memblock reserved memory regions + +maintainers: + - Mike Rapoport <rppt@kernel.org> + +description: | + Memblock can serialize its current memory reservations created with + reserve_mem command line option across kexec through KHO. + This object describes each such region. + +properties: + compatible: + enum: + - reserve-mem-v1 + + start: + description: | + physical address (u64) of the reserved memory region. + + size: + description: | + size (u64) of the reserved memory region. + +required: + - compatible + - start + - size + +additionalProperties: false + +examples: + - | + n1 { + compatible = "reserve-mem-v1"; + start = <0xc06b 0x4000000>; + size = <0x04 0x00>; + }; diff --git a/Documentation/core-api/kho/bindings/sub-fdt.yaml b/Documentation/core-api/kho/bindings/sub-fdt.yaml new file mode 100644 index 000000000000..b9a3d2d24850 --- /dev/null +++ b/Documentation/core-api/kho/bindings/sub-fdt.yaml @@ -0,0 +1,27 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +title: KHO users' FDT address + +maintainers: + - Mike Rapoport <rppt@kernel.org> + - Changyuan Lyu <changyuanl@google.com> + +description: | + Physical address of an FDT blob registered by a KHO user. + +properties: + fdt: + description: | + physical address (u64) of an FDT blob. + +required: + - fdt + +additionalProperties: false + +examples: + - | + memblock { + fdt = <0x80cc16 0x1000000>; + }; diff --git a/Documentation/core-api/kho/concepts.rst b/Documentation/core-api/kho/concepts.rst new file mode 100644 index 000000000000..36d5c05cfb30 --- /dev/null +++ b/Documentation/core-api/kho/concepts.rst @@ -0,0 +1,74 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later +.. _kho-concepts: + +======================= +Kexec Handover Concepts +======================= + +Kexec HandOver (KHO) is a mechanism that allows Linux to preserve memory +regions, which could contain serialized system states, across kexec. + +It introduces multiple concepts: + +KHO FDT +======= + +Every KHO kexec carries a KHO specific flattened device tree (FDT) blob +that describes preserved memory regions. These regions contain either +serialized subsystem states, or in-memory data that shall not be touched +across kexec. After KHO, subsystems can retrieve and restore preserved +memory regions from KHO FDT. + +KHO only uses the FDT container format and libfdt library, but does not +adhere to the same property semantics that normal device trees do: Properties +are passed in native endianness and standardized properties like ``regs`` and +``ranges`` do not exist, hence there are no ``#...-cells`` properties. + +KHO is still under development. The FDT schema is unstable and would change +in the future. + +Scratch Regions +=============== + +To boot into kexec, we need to have a physically contiguous memory range that +contains no handed over memory. Kexec then places the target kernel and initrd +into that region. The new kernel exclusively uses this region for memory +allocations before during boot up to the initialization of the page allocator. + +We guarantee that we always have such regions through the scratch regions: On +first boot KHO allocates several physically contiguous memory regions. Since +after kexec these regions will be used by early memory allocations, there is a +scratch region per NUMA node plus a scratch region to satisfy allocations +requests that do not require particular NUMA node assignment. +By default, size of the scratch region is calculated based on amount of memory +allocated during boot. The ``kho_scratch`` kernel command line option may be +used to explicitly define size of the scratch regions. +The scratch regions are declared as CMA when page allocator is initialized so +that their memory can be used during system lifetime. CMA gives us the +guarantee that no handover pages land in that region, because handover pages +must be at a static physical memory location and CMA enforces that only +movable pages can be located inside. + +After KHO kexec, we ignore the ``kho_scratch`` kernel command line option and +instead reuse the exact same region that was originally allocated. This allows +us to recursively execute any amount of KHO kexecs. Because we used this region +for boot memory allocations and as target memory for kexec blobs, some parts +of that memory region may be reserved. These reservations are irrelevant for +the next KHO, because kexec can overwrite even the original kernel. + +.. _kho-finalization-phase: + +KHO finalization phase +====================== + +To enable user space based kexec file loader, the kernel needs to be able to +provide the FDT that describes the current kernel's state before +performing the actual kexec. The process of generating that FDT is +called serialization. When the FDT is generated, some properties +of the system may become immutable because they are already written down +in the FDT. That state is called the KHO finalization phase. + +Public API +========== +.. kernel-doc:: kernel/kexec_handover.c + :export: diff --git a/Documentation/core-api/kho/fdt.rst b/Documentation/core-api/kho/fdt.rst new file mode 100644 index 000000000000..62505285d60d --- /dev/null +++ b/Documentation/core-api/kho/fdt.rst @@ -0,0 +1,80 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +======= +KHO FDT +======= + +KHO uses the flattened device tree (FDT) container format and libfdt +library to create and parse the data that is passed between the +kernels. The properties in KHO FDT are stored in native format. +It includes the physical address of an in-memory structure describing +all preserved memory regions, as well as physical addresses of KHO users' +own FDTs. Interpreting those sub FDTs is the responsibility of KHO users. + +KHO nodes and properties +======================== + +Property ``preserved-memory-map`` +--------------------------------- + +KHO saves a special property named ``preserved-memory-map`` under the root node. +This node contains the physical address of an in-memory structure for KHO to +preserve memory regions across kexec. + +Property ``compatible`` +----------------------- + +The ``compatible`` property determines compatibility between the kernel +that created the KHO FDT and the kernel that attempts to load it. +If the kernel that loads the KHO FDT is not compatible with it, the entire +KHO process will be bypassed. + +Property ``fdt`` +---------------- + +Generally, a KHO user serialize its state into its own FDT and instructs +KHO to preserve the underlying memory, such that after kexec, the new kernel +can recover its state from the preserved FDT. + +A KHO user thus can create a node in KHO root tree and save the physical address +of its own FDT in that node's property ``fdt`` . + +Examples +======== + +The following example demonstrates KHO FDT that preserves two memory +regions created with ``reserve_mem`` kernel command line parameter:: + + /dts-v1/; + + / { + compatible = "kho-v1"; + + preserved-memory-map = <0x40be16 0x1000000>; + + memblock { + fdt = <0x1517 0x1000000>; + }; + }; + +where the ``memblock`` node contains an FDT that is requested by the +subsystem memblock for preservation. The FDT contains the following +serialized data:: + + /dts-v1/; + + / { + compatible = "memblock-v1"; + + n1 { + compatible = "reserve-mem-v1"; + start = <0xc06b 0x4000000>; + size = <0x04 0x00>; + }; + + n2 { + compatible = "reserve-mem-v1"; + start = <0xc067 0x4000000>; + size = <0x04 0x00>; + }; + }; diff --git a/Documentation/core-api/kho/index.rst b/Documentation/core-api/kho/index.rst new file mode 100644 index 000000000000..0c63b0c5c143 --- /dev/null +++ b/Documentation/core-api/kho/index.rst @@ -0,0 +1,13 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +======================== +Kexec Handover Subsystem +======================== + +.. toctree:: + :maxdepth: 1 + + concepts + fdt + +.. only:: subproject and html diff --git a/Documentation/core-api/list.rst b/Documentation/core-api/list.rst new file mode 100644 index 000000000000..86873ce9adbf --- /dev/null +++ b/Documentation/core-api/list.rst @@ -0,0 +1,776 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +===================== +Linked Lists in Linux +===================== + +:Author: Nicolas Frattaroli <nicolas.frattaroli@collabora.com> + +.. contents:: + +Introduction +============ + +Linked lists are one of the most basic data structures used in many programs. +The Linux kernel implements several different flavours of linked lists. The +purpose of this document is not to explain linked lists in general, but to show +new kernel developers how to use the Linux kernel implementations of linked +lists. + +Please note that while linked lists certainly are ubiquitous, they are rarely +the best data structure to use in cases where a simple array doesn't already +suffice. In particular, due to their poor data locality, linked lists are a bad +choice in situations where performance may be of consideration. Familiarizing +oneself with other in-kernel generic data structures, especially for concurrent +accesses, is highly encouraged. + +Linux implementation of doubly linked lists +=========================================== + +Linux's linked list implementations can be used by including the header file +``<linux/list.h>``. + +The doubly-linked list will likely be the most familiar to many readers. It's a +list that can efficiently be traversed forwards and backwards. + +The Linux kernel's doubly-linked list is circular in nature. This means that to +get from the head node to the tail, we can just travel one edge backwards. +Similarly, to get from the tail node to the head, we can simply travel forwards +"beyond" the tail and arrive back at the head. + +Declaring a node +---------------- + +A node in a doubly-linked list is declared by adding a struct list_head +member to the data structure you wish to be contained in the list: + +.. code-block:: c + + struct clown { + unsigned long long shoe_size; + const char *name; + struct list_head node; /* the aforementioned member */ + }; + +This may be an unfamiliar approach to some, as the classical explanation of a +linked list is a list node data structure with pointers to the previous and next +list node, as well the payload data. Linux chooses this approach because it +allows for generic list modification code regardless of what data structure is +contained within the list. Since the struct list_head member is not a pointer +but part of the data structure proper, the container_of() pattern can be used by +the list implementation to access the payload data regardless of its type, while +staying oblivious to what said type actually is. + +Declaring and initializing a list +--------------------------------- + +A doubly-linked list can then be declared as just another struct list_head, +and initialized with the LIST_HEAD_INIT() macro during initial assignment, or +with the INIT_LIST_HEAD() function later: + +.. code-block:: c + + struct clown_car { + int tyre_pressure[4]; + struct list_head clowns; /* Looks like a node! */ + }; + + /* ... Somewhere later in our driver ... */ + + static int circus_init(struct circus_priv *circus) + { + struct clown_car other_car = { + .tyre_pressure = {10, 12, 11, 9}, + .clowns = LIST_HEAD_INIT(other_car.clowns) + }; + + INIT_LIST_HEAD(&circus->car.clowns); + + return 0; + } + +A further point of confusion to some may be that the list itself doesn't really +have its own type. The concept of the entire linked list and a +struct list_head member that points to other entries in the list are one and +the same. + +Adding nodes to the list +------------------------ + +Adding a node to the linked list is done through the list_add() macro. + +We'll return to our clown car example to illustrate how nodes get added to the +list: + +.. code-block:: c + + static int circus_fill_car(struct circus_priv *circus) + { + struct clown_car *car = &circus->car; + struct clown *grock; + struct clown *dimitri; + + /* State 1 */ + + grock = kzalloc(sizeof(*grock), GFP_KERNEL); + if (!grock) + return -ENOMEM; + grock->name = "Grock"; + grock->shoe_size = 1000; + + /* Note that we're adding the "node" member */ + list_add(&grock->node, &car->clowns); + + /* State 2 */ + + dimitri = kzalloc(sizeof(*dimitri), GFP_KERNEL); + if (!dimitri) + return -ENOMEM; + dimitri->name = "Dimitri"; + dimitri->shoe_size = 50; + + list_add(&dimitri->node, &car->clowns); + + /* State 3 */ + + return 0; + } + +In State 1, our list of clowns is still empty:: + + .------. + v | + .--------. | + | clowns |--' + '--------' + +This diagram shows the singular "clowns" node pointing at itself. In this +diagram, and all following diagrams, only the forward edges are shown, to aid in +clarity. + +In State 2, we've added Grock after the list head:: + + .--------------------. + v | + .--------. .-------. | + | clowns |---->| Grock |--' + '--------' '-------' + +This diagram shows the "clowns" node pointing at a new node labeled "Grock". +The Grock node is pointing back at the "clowns" node. + +In State 3, we've added Dimitri after the list head, resulting in the following:: + + .------------------------------------. + v | + .--------. .---------. .-------. | + | clowns |---->| Dimitri |---->| Grock |--' + '--------' '---------' '-------' + +This diagram shows the "clowns" node pointing at a new node labeled "Dimitri", +which then points at the node labeled "Grock". The "Grock" node still points +back at the "clowns" node. + +If we wanted to have Dimitri inserted at the end of the list instead, we'd use +list_add_tail(). Our code would then look like this: + +.. code-block:: c + + static int circus_fill_car(struct circus_priv *circus) + { + /* ... */ + + list_add_tail(&dimitri->node, &car->clowns); + + /* State 3b */ + + return 0; + } + +This results in the following list:: + + .------------------------------------. + v | + .--------. .-------. .---------. | + | clowns |---->| Grock |---->| Dimitri |--' + '--------' '-------' '---------' + +This diagram shows the "clowns" node pointing at the node labeled "Grock", +which points at the new node labeled "Dimitri". The node labeled "Dimitri" +points back at the "clowns" node. + +Traversing the list +------------------- + +To iterate the list, we can loop through all nodes within the list with +list_for_each(). + +In our clown example, this results in the following somewhat awkward code: + +.. code-block:: c + + static unsigned long long circus_get_max_shoe_size(struct circus_priv *circus) + { + unsigned long long res = 0; + struct clown *e; + struct list_head *cur; + + list_for_each(cur, &circus->car.clowns) { + e = list_entry(cur, struct clown, node); + if (e->shoe_size > res) + res = e->shoe_size; + } + + return res; + } + +The list_entry() macro internally uses the aforementioned container_of() to +retrieve the data structure instance that ``node`` is a member of. + +Note how the additional list_entry() call is a little awkward here. It's only +there because we're iterating through the ``node`` members, but we really want +to iterate through the payload, i.e. the ``struct clown`` that contains each +node's struct list_head. For this reason, there is a second macro: +list_for_each_entry() + +Using it would change our code to something like this: + +.. code-block:: c + + static unsigned long long circus_get_max_shoe_size(struct circus_priv *circus) + { + unsigned long long res = 0; + struct clown *e; + + list_for_each_entry(e, &circus->car.clowns, node) { + if (e->shoe_size > res) + res = e->shoe_size; + } + + return res; + } + +This eliminates the need for the list_entry() step, and our loop cursor is now +of the type of our payload. The macro is given the member name that corresponds +to the list's struct list_head within the clown data structure so that it can +still walk the list. + +Removing nodes from the list +---------------------------- + +The list_del() function can be used to remove entries from the list. It not only +removes the given entry from the list, but poisons the entry's ``prev`` and +``next`` pointers, so that unintended use of the entry after removal does not +go unnoticed. + +We can extend our previous example to remove one of the entries: + +.. code-block:: c + + static int circus_fill_car(struct circus_priv *circus) + { + /* ... */ + + list_add(&dimitri->node, &car->clowns); + + /* State 3 */ + + list_del(&dimitri->node); + + /* State 4 */ + + return 0; + } + +The result of this would be this:: + + .--------------------. + v | + .--------. .-------. | .---------. + | clowns |---->| Grock |--' | Dimitri | + '--------' '-------' '---------' + +This diagram shows the "clowns" node pointing at the node labeled "Grock", +which points back at the "clowns" node. Off to the side is a lone node labeled +"Dimitri", which has no arrows pointing anywhere. + +Note how the Dimitri node does not point to itself; its pointers are +intentionally set to a "poison" value that the list code refuses to traverse. + +If we wanted to reinitialize the removed node instead to make it point at itself +again like an empty list head, we can use list_del_init() instead: + +.. code-block:: c + + static int circus_fill_car(struct circus_priv *circus) + { + /* ... */ + + list_add(&dimitri->node, &car->clowns); + + /* State 3 */ + + list_del_init(&dimitri->node); + + /* State 4b */ + + return 0; + } + +This results in the deleted node pointing to itself again:: + + .--------------------. .-------. + v | v | + .--------. .-------. | .---------. | + | clowns |---->| Grock |--' | Dimitri |--' + '--------' '-------' '---------' + +This diagram shows the "clowns" node pointing at the node labeled "Grock", +which points back at the "clowns" node. Off to the side is a lone node labeled +"Dimitri", which points to itself. + +Traversing whilst removing nodes +-------------------------------- + +Deleting entries while we're traversing the list will cause problems if we use +list_for_each() and list_for_each_entry(), as deleting the current entry would +modify the ``next`` pointer of it, which means the traversal can't properly +advance to the next list entry. + +There is a solution to this however: list_for_each_safe() and +list_for_each_entry_safe(). These take an additional parameter of a pointer to +a struct list_head to use as temporary storage for the next entry during +iteration, solving the issue. + +An example of how to use it: + +.. code-block:: c + + static void circus_eject_insufficient_clowns(struct circus_priv *circus) + { + struct clown *e; + struct clown *n; /* temporary storage for safe iteration */ + + list_for_each_entry_safe(e, n, &circus->car.clowns, node) { + if (e->shoe_size < 500) + list_del(&e->node); + } + } + +Proper memory management (i.e. freeing the deleted node while making sure +nothing still references it) in this case is left as an exercise to the reader. + +Cutting a list +-------------- + +There are two helper functions to cut lists with. Both take elements from the +list ``head``, and replace the contents of the list ``list``. + +The first such function is list_cut_position(). It removes all list entries from +``head`` up to and including ``entry``, placing them in ``list`` instead. + +In this example, it's assumed we start with the following list:: + + .----------------------------------------------------------------. + v | + .--------. .-------. .---------. .-----. .---------. | + | clowns |---->| Grock |---->| Dimitri |---->| Pic |---->| Alfredo |--' + '--------' '-------' '---------' '-----' '---------' + +With the following code, every clown up to and including "Pic" is moved from +the "clowns" list head to a separate struct list_head initialized at local +stack variable ``retirement``: + +.. code-block:: c + + static void circus_retire_clowns(struct circus_priv *circus) + { + struct list_head retirement = LIST_HEAD_INIT(retirement); + struct clown *grock, *dimitri, *pic, *alfredo; + struct clown_car *car = &circus->car; + + /* ... clown initialization, list adding ... */ + + list_cut_position(&retirement, &car->clowns, &pic->node); + + /* State 1 */ + } + +The resulting ``car->clowns`` list would be this:: + + .----------------------. + v | + .--------. .---------. | + | clowns |---->| Alfredo |--' + '--------' '---------' + +Meanwhile, the ``retirement`` list is transformed to the following:: + + .--------------------------------------------------. + v | + .------------. .-------. .---------. .-----. | + | retirement |---->| Grock |---->| Dimitri |---->| Pic |--' + '------------' '-------' '---------' '-----' + +The second function, list_cut_before(), is much the same, except it cuts before +the ``entry`` node, i.e. it removes all list entries from ``head`` up to but +excluding ``entry``, placing them in ``list`` instead. This example assumes the +same initial starting list as the previous example: + +.. code-block:: c + + static void circus_retire_clowns(struct circus_priv *circus) + { + struct list_head retirement = LIST_HEAD_INIT(retirement); + struct clown *grock, *dimitri, *pic, *alfredo; + struct clown_car *car = &circus->car; + + /* ... clown initialization, list adding ... */ + + list_cut_before(&retirement, &car->clowns, &pic->node); + + /* State 1b */ + } + +The resulting ``car->clowns`` list would be this:: + + .----------------------------------. + v | + .--------. .-----. .---------. | + | clowns |---->| Pic |---->| Alfredo |--' + '--------' '-----' '---------' + +Meanwhile, the ``retirement`` list is transformed to the following:: + + .--------------------------------------. + v | + .------------. .-------. .---------. | + | retirement |---->| Grock |---->| Dimitri |--' + '------------' '-------' '---------' + +It should be noted that both functions will destroy links to any existing nodes +in the destination ``struct list_head *list``. + +Moving entries and partial lists +-------------------------------- + +The list_move() and list_move_tail() functions can be used to move an entry +from one list to another, to either the start or end respectively. + +In the following example, we'll assume we start with two lists ("clowns" and +"sidewalk" in the following initial state "State 0":: + + .----------------------------------------------------------------. + v | + .--------. .-------. .---------. .-----. .---------. | + | clowns |---->| Grock |---->| Dimitri |---->| Pic |---->| Alfredo |--' + '--------' '-------' '---------' '-----' '---------' + + .-------------------. + v | + .----------. .-----. | + | sidewalk |---->| Pio |--' + '----------' '-----' + +We apply the following example code to the two lists: + +.. code-block:: c + + static void circus_clowns_exit_car(struct circus_priv *circus) + { + struct list_head sidewalk = LIST_HEAD_INIT(sidewalk); + struct clown *grock, *dimitri, *pic, *alfredo, *pio; + struct clown_car *car = &circus->car; + + /* ... clown initialization, list adding ... */ + + /* State 0 */ + + list_move(&pic->node, &sidewalk); + + /* State 1 */ + + list_move_tail(&dimitri->node, &sidewalk); + + /* State 2 */ + } + +In State 1, we arrive at the following situation:: + + .-----------------------------------------------------. + | | + v | + .--------. .-------. .---------. .---------. | + | clowns |---->| Grock |---->| Dimitri |---->| Alfredo |--' + '--------' '-------' '---------' '---------' + + .-------------------------------. + v | + .----------. .-----. .-----. | + | sidewalk |---->| Pic |---->| Pio |--' + '----------' '-----' '-----' + +In State 2, after we've moved Dimitri to the tail of sidewalk, the situation +changes as follows:: + + .-------------------------------------. + | | + v | + .--------. .-------. .---------. | + | clowns |---->| Grock |---->| Alfredo |--' + '--------' '-------' '---------' + + .-----------------------------------------------. + v | + .----------. .-----. .-----. .---------. | + | sidewalk |---->| Pic |---->| Pio |---->| Dimitri |--' + '----------' '-----' '-----' '---------' + +As long as the source and destination list head are part of the same list, we +can also efficiently bulk move a segment of the list to the tail end of the +list. We continue the previous example by adding a list_bulk_move_tail() after +State 2, moving Pic and Pio to the tail end of the sidewalk list. + +.. code-block:: c + + static void circus_clowns_exit_car(struct circus_priv *circus) + { + struct list_head sidewalk = LIST_HEAD_INIT(sidewalk); + struct clown *grock, *dimitri, *pic, *alfredo, *pio; + struct clown_car *car = &circus->car; + + /* ... clown initialization, list adding ... */ + + /* State 0 */ + + list_move(&pic->node, &sidewalk); + + /* State 1 */ + + list_move_tail(&dimitri->node, &sidewalk); + + /* State 2 */ + + list_bulk_move_tail(&sidewalk, &pic->node, &pio->node); + + /* State 3 */ + } + +For the sake of brevity, only the altered "sidewalk" list at State 3 is depicted +in the following diagram:: + + .-----------------------------------------------. + v | + .----------. .---------. .-----. .-----. | + | sidewalk |---->| Dimitri |---->| Pic |---->| Pio |--' + '----------' '---------' '-----' '-----' + +Do note that list_bulk_move_tail() does not do any checking as to whether all +three supplied ``struct list_head *`` parameters really do belong to the same +list. If you use it outside the constraints the documentation gives, then the +result is a matter between you and the implementation. + +Rotating entries +---------------- + +A common write operation on lists, especially when using them as queues, is +to rotate it. A list rotation means entries at the front are sent to the back. + +For rotation, Linux provides us with two functions: list_rotate_left() and +list_rotate_to_front(). The former can be pictured like a bicycle chain, taking +the entry after the supplied ``struct list_head *`` and moving it to the tail, +which in essence means the entire list, due to its circular nature, rotates by +one position. + +The latter, list_rotate_to_front(), takes the same concept one step further: +instead of advancing the list by one entry, it advances it *until* the specified +entry is the new front. + +In the following example, our starting state, State 0, is the following:: + + .-----------------------------------------------------------------. + v | + .--------. .-------. .---------. .-----. .---------. .-----. | + | clowns |-->| Grock |-->| Dimitri |-->| Pic |-->| Alfredo |-->| Pio |-' + '--------' '-------' '---------' '-----' '---------' '-----' + +The example code being used to demonstrate list rotations is the following: + +.. code-block:: c + + static void circus_clowns_rotate(struct circus_priv *circus) + { + struct clown *grock, *dimitri, *pic, *alfredo, *pio; + struct clown_car *car = &circus->car; + + /* ... clown initialization, list adding ... */ + + /* State 0 */ + + list_rotate_left(&car->clowns); + + /* State 1 */ + + list_rotate_to_front(&alfredo->node, &car->clowns); + + /* State 2 */ + + } + +In State 1, we arrive at the following situation:: + + .-----------------------------------------------------------------. + v | + .--------. .---------. .-----. .---------. .-----. .-------. | + | clowns |-->| Dimitri |-->| Pic |-->| Alfredo |-->| Pio |-->| Grock |-' + '--------' '---------' '-----' '---------' '-----' '-------' + +Next, after the list_rotate_to_front() call, we arrive in the following +State 2:: + + .-----------------------------------------------------------------. + v | + .--------. .---------. .-----. .-------. .---------. .-----. | + | clowns |-->| Alfredo |-->| Pio |-->| Grock |-->| Dimitri |-->| Pic |-' + '--------' '---------' '-----' '-------' '---------' '-----' + +As is hopefully evident from the diagrams, the entries in front of "Alfredo" +were cycled to the tail end of the list. + +Swapping entries +---------------- + +Another common operation is that two entries need to be swapped with each other. + +For this, Linux provides us with list_swap(). + +In the following example, we have a list with three entries, and swap two of +them. This is our starting state in "State 0":: + + .-----------------------------------------. + v | + .--------. .-------. .---------. .-----. | + | clowns |-->| Grock |-->| Dimitri |-->| Pic |-' + '--------' '-------' '---------' '-----' + +.. code-block:: c + + static void circus_clowns_swap(struct circus_priv *circus) + { + struct clown *grock, *dimitri, *pic; + struct clown_car *car = &circus->car; + + /* ... clown initialization, list adding ... */ + + /* State 0 */ + + list_swap(&dimitri->node, &pic->node); + + /* State 1 */ + } + +The resulting list at State 1 is the following:: + + .-----------------------------------------. + v | + .--------. .-------. .-----. .---------. | + | clowns |-->| Grock |-->| Pic |-->| Dimitri |-' + '--------' '-------' '-----' '---------' + +As is evident by comparing the diagrams, the "Pic" and "Dimitri" nodes have +traded places. + +Splicing two lists together +--------------------------- + +Say we have two lists, in the following example one represented by a list head +we call "knie" and one we call "stey". In a hypothetical circus acquisition, +the two list of clowns should be spliced together. The following is our +situation in "State 0":: + + .-----------------------------------------. + | | + v | + .------. .-------. .---------. .-----. | + | knie |-->| Grock |-->| Dimitri |-->| Pic |--' + '------' '-------' '---------' '-----' + + .-----------------------------. + v | + .------. .---------. .-----. | + | stey |-->| Alfredo |-->| Pio |--' + '------' '---------' '-----' + +The function to splice these two lists together is list_splice(). Our example +code is as follows: + +.. code-block:: c + + static void circus_clowns_splice(void) + { + struct clown *grock, *dimitri, *pic, *alfredo, *pio; + struct list_head knie = LIST_HEAD_INIT(knie); + struct list_head stey = LIST_HEAD_INIT(stey); + + /* ... Clown allocation and initialization here ... */ + + list_add_tail(&grock->node, &knie); + list_add_tail(&dimitri->node, &knie); + list_add_tail(&pic->node, &knie); + list_add_tail(&alfredo->node, &stey); + list_add_tail(&pio->node, &stey); + + /* State 0 */ + + list_splice(&stey, &dimitri->node); + + /* State 1 */ + } + +The list_splice() call here adds all the entries in ``stey`` to the list +``dimitri``'s ``node`` list_head is in, after the ``node`` of ``dimitri``. A +somewhat surprising diagram of the resulting "State 1" follows:: + + .-----------------------------------------------------------------. + | | + v | + .------. .-------. .---------. .---------. .-----. .-----. | + | knie |-->| Grock |-->| Dimitri |-->| Alfredo |-->| Pio |-->| Pic |--' + '------' '-------' '---------' '---------' '-----' '-----' + ^ + .-------------------------------' + | + .------. | + | stey |--' + '------' + +Traversing the ``stey`` list no longer results in correct behavior. A call of +list_for_each() on ``stey`` results in an infinite loop, as it never returns +back to the ``stey`` list head. + +This is because list_splice() did not reinitialize the list_head it took +entries from, leaving its pointer pointing into what is now a different list. + +If we want to avoid this situation, list_splice_init() can be used. It does the +same thing as list_splice(), except reinitalizes the donor list_head after the +transplant. + +Concurrency considerations +-------------------------- + +Concurrent access and modification of a list needs to be protected with a lock +in most cases. Alternatively and preferably, one may use the RCU primitives for +lists in read-mostly use-cases, where read accesses to the list are common but +modifications to the list less so. See Documentation/RCU/listRCU.rst for more +details. + +Further reading +--------------- + +* `How does the kernel implements Linked Lists? - KernelNewbies <https://kernelnewbies.org/FAQ/LinkedLists>`_ + +Full List API +============= + +.. kernel-doc:: include/linux/list.h + :internal: diff --git a/Documentation/core-api/memory-hotplug.rst b/Documentation/core-api/memory-hotplug.rst index 682259ee633a..8fc97c2379de 100644 --- a/Documentation/core-api/memory-hotplug.rst +++ b/Documentation/core-api/memory-hotplug.rst @@ -9,6 +9,9 @@ Memory hotplug event notifier Hotplugging events are sent to a notification queue. +Memory notifier +---------------- + There are six types of notification defined in ``include/linux/memory.h``: MEM_GOING_ONLINE @@ -56,20 +59,18 @@ The third argument (arg) passes a pointer of struct memory_notify:: struct memory_notify { unsigned long start_pfn; unsigned long nr_pages; - int status_change_nid_normal; - int status_change_nid; } - start_pfn is start_pfn of online/offline memory. - nr_pages is # of pages of online/offline memory. -- status_change_nid_normal is set node id when N_NORMAL_MEMORY of nodemask - is (will be) set/clear, if this is -1, then nodemask status is not changed. -- status_change_nid is set node id when N_MEMORY of nodemask is (will be) - set/clear. It means a new(memoryless) node gets new memory by online and a - node loses all memory. If this is -1, then nodemask status is not changed. - If status_changed_nid* >= 0, callback should create/discard structures for the - node if necessary. +It is possible to get notified for MEM_CANCEL_ONLINE without having been notified +for MEM_GOING_ONLINE, and the same applies to MEM_CANCEL_OFFLINE and +MEM_GOING_OFFLINE. +This can happen when a consumer fails, meaning we break the callchain and we +stop calling the remaining consumers of the notifier. +It is then important that users of memory_notify make no assumptions and get +prepared to handle such cases. The callback routine shall return one of the values NOTIFY_DONE, NOTIFY_OK, NOTIFY_BAD, NOTIFY_STOP @@ -83,6 +84,78 @@ further processing of the notification queue. NOTIFY_STOP stops further processing of the notification queue. +Numa node notifier +------------------ + +There are six types of notification defined in ``include/linux/node.h``: + +NODE_ADDING_FIRST_MEMORY + Generated before memory becomes available to this node for the first time. + +NODE_CANCEL_ADDING_FIRST_MEMORY + Generated if NODE_ADDING_FIRST_MEMORY fails. + +NODE_ADDED_FIRST_MEMORY + Generated when memory has become available fo this node for the first time. + +NODE_REMOVING_LAST_MEMORY + Generated when the last memory available to this node is about to be offlined. + +NODE_CANCEL_REMOVING_LAST_MEMORY + Generated when NODE_CANCEL_REMOVING_LAST_MEMORY fails. + +NODE_REMOVED_LAST_MEMORY + Generated when the last memory available to this node has been offlined. + +A callback routine can be registered by calling:: + + hotplug_node_notifier(callback_func, priority) + +Callback functions with higher values of priority are called before callback +functions with lower values. + +A callback function must have the following prototype:: + + int callback_func( + + struct notifier_block *self, unsigned long action, void *arg); + +The first argument of the callback function (self) is a pointer to the block +of the notifier chain that points to the callback function itself. +The second argument (action) is one of the event types described above. +The third argument (arg) passes a pointer of struct node_notify:: + + struct node_notify { + int nid; + } + +- nid is the node we are adding or removing memory to. + +It is possible to get notified for NODE_CANCEL_ADDING_FIRST_MEMORY without +having been notified for NODE_ADDING_FIRST_MEMORY, and the same applies to +NODE_CANCEL_REMOVING_LAST_MEMORY and NODE_REMOVING_LAST_MEMORY. +This can happen when a consumer fails, meaning we break the callchain and we +stop calling the remaining consumers of the notifier. +It is then important that users of node_notify make no assumptions and get +prepared to handle such cases. + +The callback routine shall return one of the values +NOTIFY_DONE, NOTIFY_OK, NOTIFY_BAD, NOTIFY_STOP +defined in ``include/linux/notifier.h`` + +NOTIFY_DONE and NOTIFY_OK have no effect on the further processing. + +NOTIFY_BAD is used as response to the NODE_ADDING_FIRST_MEMORY, +NODE_REMOVING_LAST_MEMORY, NODE_ADDED_FIRST_MEMORY or +NODE_REMOVED_LAST_MEMORY action to cancel hotplugging. +It stops further processing of the notification queue. + +NOTIFY_STOP stops further processing of the notification queue. + +Please note that we should not fail for NODE_ADDED_FIRST_MEMORY / +NODE_REMOVED_FIRST_MEMORY, as memory_hotplug code cannot rollback at that +point anymore. + Locking Internals ================= diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst index af8151db88b2..50cfc7842930 100644 --- a/Documentation/core-api/mm-api.rst +++ b/Documentation/core-api/mm-api.rst @@ -91,12 +91,6 @@ Memory pools .. kernel-doc:: mm/mempool.c :export: -DMA pools -========= - -.. kernel-doc:: mm/dmapool.c - :export: - More Memory Management Functions ================================ diff --git a/Documentation/core-api/packing.rst b/Documentation/core-api/packing.rst index 0ce2078c8e13..f68f1e08fef9 100644 --- a/Documentation/core-api/packing.rst +++ b/Documentation/core-api/packing.rst @@ -319,7 +319,7 @@ Here is an example of how to use the fields APIs: #define SIZE 13 - typdef struct __packed { u8 buf[SIZE]; } packed_buf_t; + typedef struct __packed { u8 buf[SIZE]; } packed_buf_t; static const struct packed_field_u8 fields[] = { PACKED_FIELD(100, 90, struct data, field1), diff --git a/Documentation/core-api/printk-formats.rst b/Documentation/core-api/printk-formats.rst index 4bdc394e86af..4b7f3646ec6c 100644 --- a/Documentation/core-api/printk-formats.rst +++ b/Documentation/core-api/printk-formats.rst @@ -571,9 +571,8 @@ struct clk :: %pC pll1 - %pCn pll1 -For printing struct clk structures. %pC and %pCn print the name of the clock +For printing struct clk structures. %pC prints the name of the clock (Common Clock Framework) or a unique 32-bit ID (legacy clock framework). Passed by reference. @@ -648,6 +647,38 @@ Examples:: %p4cc Y10 little-endian (0x20303159) %p4cc NV12 big-endian (0xb231564e) +Generic FourCC code +------------------- + +:: + %p4c[h[R]lb] gP00 (0x67503030) + +Print a generic FourCC code, as both ASCII characters and its numerical +value as hexadecimal. + +The generic FourCC code is always printed in the big-endian format, +the most significant byte first. This is the opposite of V4L/DRM FourCCs. + +The additional ``h``, ``hR``, ``l``, and ``b`` specifiers define what +endianness is used to load the stored bytes. The data might be interpreted +using the host, reversed host byte order, little-endian, or big-endian. + +Passed by reference. + +Examples for a little-endian machine, given &(u32)0x67503030:: + + %p4ch gP00 (0x67503030) + %p4chR 00Pg (0x30305067) + %p4cl gP00 (0x67503030) + %p4cb 00Pg (0x30305067) + +Examples for a big-endian machine, given &(u32)0x67503030:: + + %p4ch gP00 (0x67503030) + %p4chR 00Pg (0x30305067) + %p4cl 00Pg (0x30305067) + %p4cb gP00 (0x67503030) + Rust ---- diff --git a/Documentation/core-api/symbol-namespaces.rst b/Documentation/core-api/symbol-namespaces.rst index 06f766a6aab2..32fc73dc5529 100644 --- a/Documentation/core-api/symbol-namespaces.rst +++ b/Documentation/core-api/symbol-namespaces.rst @@ -6,18 +6,8 @@ The following document describes how to use Symbol Namespaces to structure the export surface of in-kernel symbols exported through the family of EXPORT_SYMBOL() macros. -.. Table of Contents - - === 1 Introduction - === 2 How to define Symbol Namespaces - --- 2.1 Using the EXPORT_SYMBOL macros - --- 2.2 Using the DEFAULT_SYMBOL_NAMESPACE define - === 3 How to use Symbols exported in Namespaces - === 4 Loading Modules that use namespaced Symbols - === 5 Automatically creating MODULE_IMPORT_NS statements - -1. Introduction -=============== +Introduction +============ Symbol Namespaces have been introduced as a means to structure the export surface of the in-kernel API. It allows subsystem maintainers to partition @@ -28,15 +18,18 @@ kernel. As of today, modules that make use of symbols exported into namespaces, are required to import the namespace. Otherwise the kernel will, depending on its configuration, reject loading the module or warn about a missing import. -2. How to define Symbol Namespaces -================================== +Additionally, it is possible to put symbols into a module namespace, strictly +limiting which modules are allowed to use these symbols. + +How to define Symbol Namespaces +=============================== Symbols can be exported into namespace using different methods. All of them are changing the way EXPORT_SYMBOL and friends are instrumented to create ksymtab entries. -2.1 Using the EXPORT_SYMBOL macros -================================== +Using the EXPORT_SYMBOL macros +------------------------------ In addition to the macros EXPORT_SYMBOL() and EXPORT_SYMBOL_GPL(), that allow exporting of kernel symbols to the kernel symbol table, variants of these are @@ -54,8 +47,8 @@ refer to ``NULL``. There is no default namespace if none is defined. ``modpost`` and kernel/module/main.c make use the namespace at build time or module load time, respectively. -2.2 Using the DEFAULT_SYMBOL_NAMESPACE define -============================================= +Using the DEFAULT_SYMBOL_NAMESPACE define +----------------------------------------- Defining namespaces for all symbols of a subsystem can be very verbose and may become hard to maintain. Therefore a default define (DEFAULT_SYMBOL_NAMESPACE) @@ -83,8 +76,24 @@ unit as preprocessor statement. The above example would then read:: within the corresponding compilation unit before the #include for <linux/export.h>. Typically it's placed before the first #include statement. -3. How to use Symbols exported in Namespaces -============================================ +Using the EXPORT_SYMBOL_GPL_FOR_MODULES() macro +----------------------------------------------- + +Symbols exported using this macro are put into a module namespace. This +namespace cannot be imported. + +The macro takes a comma separated list of module names, allowing only those +modules to access this symbol. Simple tail-globs are supported. + +For example:: + + EXPORT_SYMBOL_GPL_FOR_MODULES(preempt_notifier_inc, "kvm,kvm-*") + +will limit usage of this symbol to modules whoes name matches the given +patterns. + +How to use Symbols exported in Namespaces +========================================= In order to use symbols that are exported into namespaces, kernel modules need to explicitly import these namespaces. Otherwise the kernel might reject to @@ -106,11 +115,10 @@ inspected with modinfo:: It is advisable to add the MODULE_IMPORT_NS() statement close to other module -metadata definitions like MODULE_AUTHOR() or MODULE_LICENSE(). Refer to section -5. for a way to create missing import statements automatically. +metadata definitions like MODULE_AUTHOR() or MODULE_LICENSE(). -4. Loading Modules that use namespaced Symbols -============================================== +Loading Modules that use namespaced Symbols +=========================================== At module loading time (e.g. ``insmod``), the kernel will check each symbol referenced from the module for its availability and whether the namespace it @@ -121,8 +129,8 @@ allow loading of modules that don't satisfy this precondition, a configuration option is available: Setting MODULE_ALLOW_MISSING_NAMESPACE_IMPORTS=y will enable loading regardless, but will emit a warning. -5. Automatically creating MODULE_IMPORT_NS statements -===================================================== +Automatically creating MODULE_IMPORT_NS statements +================================================== Missing namespaces imports can easily be detected at build time. In fact, modpost will emit a warning if a module uses a symbol from a namespace @@ -154,3 +162,6 @@ in-tree modules:: You can also run nsdeps for external module builds. A typical usage is:: $ make -C <path_to_kernel_src> M=$PWD nsdeps + +Note: it will happily generate an import statement for the module namespace; +which will not work and generates build and runtime failures. diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst index e295835fc116..165ca73e8351 100644 --- a/Documentation/core-api/workqueue.rst +++ b/Documentation/core-api/workqueue.rst @@ -183,6 +183,12 @@ resources, scheduled and executed. BH work items cannot sleep. All other features such as delayed queueing, flushing and canceling are supported. +``WQ_PERCPU`` + Work items queued to a per-cpu wq are bound to a specific CPU. + This flag is the right choice when cpu locality is important. + + This flag is the complement of ``WQ_UNBOUND``. + ``WQ_UNBOUND`` Work items queued to an unbound wq are served by the special worker-pools which host workers which are not bound to any |