Age | Commit message (Collapse) | Author |
|
Patch series "kexec: introduce Kexec HandOver (KHO)", v8.
Kexec today considers itself purely a boot loader: When we enter the new
kernel, any state the previous kernel left behind is irrelevant and the
new kernel reinitializes the system.
However, there are use cases where this mode of operation is not what we
actually want. In virtualization hosts for example, we want to use kexec
to update the host kernel while virtual machine memory stays untouched.
When we add device assignment to the mix, we also need to ensure that
IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
need to do the same for the PCI subsystem. If we want to kexec while an
SEV-SNP enabled virtual machine is running, we need to preserve the VM
context pages and physical memory. See "pkernfs: Persisting guest memory
and kernel/device state safely across kexec" Linux Plumbers Conference
2023 presentation for details:
https://lpc.events/event/17/contributions/1485/
To start us on the journey to support all the use cases above, this patch
implements basic infrastructure to allow hand over of kernel state across
kexec (Kexec HandOver, aka KHO). As a really simple example target, we
use memblock's reserve_mem.
With this patchset applied, memory that was reserved using "reserve_mem"
command line options remains intact after kexec and it is guaranteed to
reside at the same physical address.
== Alternatives ==
There are alternative approaches to (parts of) the problems above:
* Memory Pools [1] - preallocated persistent memory region + allocator
* PRMEM [2] - resizable persistent memory regions with fixed metadata
pointer on the kernel command line + allocator
* Pkernfs [3] - preallocated file system for in-kernel data with fixed
address location on the kernel command line
* PKRAM [4] - handover of user space pages using a fixed metadata page
specified via command line
All of the approaches above fundamentally have the same problem: They
require the administrator to explicitly carve out a physical memory
location because they have no mechanism outside of the kernel command line
to pass data (including memory reservations) between kexec'ing kernels.
KHO provides that base foundation. We will determine later whether we
still need any of the approaches above for fast bulk memory handover of
for example IOMMU page tables. But IMHO they would all be users of KHO,
with KHO providing the foundational primitive to pass metadata and bulk
memory reservations as well as provide easy versioning for data.
== Overview ==
We introduce a metadata file that the kernels pass between each other.
How they pass it is architecture specific. The file's format is a
Flattened Device Tree (fdt) which has a generator and parser already
included in Linux. KHO is enabled in the kernel command line by `kho=on`.
When the root user enables KHO through
/sys/kernel/debug/kho/out/finalize, the kernel invokes callbacks to every
KHO users to register preserved memory regions, which contain drivers'
states.
When the actual kexec happens, the fdt is part of the image set that we
boot into. In addition, we keep "scratch regions" available for kexec:
physically contiguous memory regions that are guaranteed to not have any
memory that KHO would preserve. The new kernel bootstraps itself using
the scratch regions and sets all handed over memory as in use. When
drivers initialize that support KHO, they introspect the fdt, restore
preserved memory regions, and retrieve their states stored in the
preserved memory.
== Limitations ==
Currently KHO is only implemented for file based kexec. The kernel
interfaces in the patch set are already in place to support user space
kexec as well, but it is still not implemented it yet inside kexec tools.
== How to Use ==
To use the code, please boot the kernel with the "kho=on" command line
parameter. KHO will automatically create scratch regions. If you want to
set the scratch size explicitly you can use "kho_scratch=" command line
parameter. For instance, "kho_scratch=16M,512M,256M" will reserve a 16
MiB low memory scratch area, a 512 MiB global scratch region, and 256 MiB
per NUMA node scratch regions on boot.
Make sure to have a reserved memory range requested with reserv_mem
command line option, for example, "reserve_mem=64m:4k:n1".
Then before you invoke file based "kexec -l", finalize KHO FDT:
# echo 1 > /sys/kernel/debug/kho/out/finalize
You can preview the generated FDT using `dtc`,
# dtc /sys/kernel/debug/kho/out/fdt
# dtc /sys/kernel/debug/kho/out/sub_fdts/memblock
`dtc` is available on ubuntu by `sudo apt-get install device-tree-compiler`.
Now kexec into the new kernel,
# kexec -l Image --initrd=initrd -s
# kexec -e
(The order of KHO finalization and "kexec -l" does not matter.)
The new kernel will boot up and contain the previous kernel's reserve_mem
contents at the same physical address as the first kernel.
You can also review the FDT passed from the old kernel,
# dtc /sys/kernel/debug/kho/in/fdt
# dtc /sys/kernel/debug/kho/in/sub_fdts/memblock
This patch (of 17):
To denote areas that were reserved for kernel use either directly with
memblock_reserve_kern() or via memblock allocations.
Link: https://lore.kernel.org/lkml/20250424083258.2228122-1-changyuanl@google.com/
Link: https://lore.kernel.org/lkml/aAeaJ2iqkrv_ffhT@kernel.org/
Link: https://lore.kernel.org/lkml/35c58191-f774-40cf-8d66-d1e2aaf11a62@intel.com/
Link: https://lore.kernel.org/lkml/20250424093302.3894961-1-arnd@kernel.org/
Link: https://lkml.kernel.org/r/20250509074635.3187114-1-changyuanl@google.com
Link: https://lkml.kernel.org/r/20250509074635.3187114-2-changyuanl@google.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Gowans <jgowans@amazon.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pratyush Yadav <ptyadav@amazon.de>
Cc: Rob Herring <robh@kernel.org>
Cc: Saravana Kannan <saravanak@google.com>
Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Cc: Will Deacon <will@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Move variable declarations to a single block at the beginning of each
testing function.
Signed-off-by: Rebecca Mckeever <remckee0@gmail.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Link: https://lore.kernel.org/r/e61431e73977f305fdd027bca99d1dc119e96d84.1662264355.git.remckee0@gmail.com
|
|
Update memblock_alloc() tests so that they test either memblock_alloc()
or memblock_alloc_raw() depending on the value of alloc_test_flags. Run
through all the existing tests in memblock_alloc_api twice: once for
memblock_alloc() and once for memblock_alloc_raw().
When the tests run memblock_alloc(), they test that the entire memory
region is zero. When the tests run memblock_alloc_raw(), they test that
the entire memory region is nonzero. The content of the memory region is
initialized to nonzero, and we expect it to remain unchanged if running
memblock_alloc_raw().
Reviewed-by: Shaoqin Huang <shaoqin.huang@intel.com>
Signed-off-by: Rebecca Mckeever <remckee0@gmail.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Link: https://lore.kernel.org/r/5a7cfb2f807ee2cb53ee77f9f5c910107b253d6e.1661578349.git.remckee0@gmail.com
|
|
Add tests for memblock_add(), memblock_reserve(), memblock_remove(),
memblock_free(), and memblock_alloc() for the following test scenarios.
memblock_add() and memblock_reserve():
- add/reserve a memory block in the gap between two existing memory
blocks, and check that the blocks are merged into one region
- try to add/reserve memblock regions that extend past PHYS_ADDR_MAX
memblock_remove() and memblock_free():
- remove/free a region when it is the only available region
+ These tests ensure that the first region is overwritten with a
"dummy" region when the last remaining region of that type is
removed or freed.
- remove/free() a region that overlaps with two existing regions of the
relevant type
- try to remove/free memblock regions that extend past PHYS_ADDR_MAX
memblock_alloc():
- try to allocate a region that is larger than the total size of available
memory (memblock.memory)
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Shaoqin Huang <shaoqin.huang@intel.com>
Signed-off-by: Rebecca Mckeever <remckee0@gmail.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Link: https://lore.kernel.org/r/c23c0393c5b9a53fe7f676996913c629495e9727.1661578349.git.remckee0@gmail.com
|
|
Generic tests for memblock_alloc*() functions do not use separate
functions for testing top-down and bottom-up allocation directions.
Therefore, the function name that is displayed in the verbose testing
output does not include the allocation direction.
Add an additional prefix when running generic tests for
memblock_alloc*() functions that indicates which allocation direction is
set. The prefix will be displayed when the tests are run in verbose mode.
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Shaoqin Huang <shaoqin.huang@intel.com>
Signed-off-by: Rebecca Mckeever <remckee0@gmail.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Link: https://lore.kernel.org/r/fb76a42253d2a196a7daea29dd8121a69904f58e.1661578349.git.remckee0@gmail.com
|
|
Add an assert in memblock_alloc() tests where allocation is expected to
occur. The assert checks whether the entire chunk of allocated memory is
cleared.
The current memblock_alloc() tests do not check whether the allocated
memory was zeroed. memblock_alloc() should zero the allocated memory since
it is a wrapper for memblock_alloc_try_nid().
Reviewed-by: Shaoqin Huang <shaoqin.huang@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Rebecca Mckeever <remckee0@gmail.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Link: https://lore.kernel.org/r/83ffb941b65074f40eb14552f8bfe5b71fe50abd.1661578349.git.remckee0@gmail.com
|
|
Add and use functions and macros for printing verbose testing output.
If the Memblock simulator was compiled with VERBOSE=1:
- prefix_push(): appends the given string to a prefix string that will be
printed in test_fail() and test_pass*().
- prefix_pop(): removes the last prefix from the prefix string.
- prefix_reset(): clears the prefix string.
- test_fail(): prints a message after a test fails containing the test
number of the failing test and the prefix.
- test_pass(): prints a message after a test passes containing its test
number and the prefix.
- test_print(): prints the given formatted output string.
- test_pass_pop(): runs test_pass() followed by prefix_pop().
- PREFIX_PUSH(): runs prefix_push(__func__).
If the Memblock simulator was not compiled with VERBOSE=1, these
functions/macros do nothing.
Add the assert wrapper macros ASSERT_EQ(), ASSERT_NE(), and ASSERT_LT().
If the assert condition fails, these macros call test_fail() before
executing assert().
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Shaoqin Huang <shaoqin.huang@intel.com>
Signed-off-by: Rebecca Mckeever <remckee0@gmail.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Link: https://lore.kernel.org/r/f234d443fe154d5ae8d8aa07284aff69edfb6f61.1656907314.git.remckee0@gmail.com
|
|
Add checks for memblock_alloc for bottom up allocation direction.
The tested scenarios are:
- Region can be allocated on the first fit (with and without
region merging)
- Region can be allocated on the second fit (with and without
region merging)
Add test case wrappers to test both directions in the same context.
Signed-off-by: Karolina Drobnik <karolinadrobnik@gmail.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Link: https://lore.kernel.org/r/426674eee20d99dca49caf1ee0142a83dccbc98d.1646055639.git.karolinadrobnik@gmail.com
|
|
Add checks for memblock_alloc for top down allocation direction.
The tested scenarios are:
- Region can be allocated on the first fit (with and without
region merging)
- Region can be allocated on the second fit (with and without
region merging)
Add checks for both allocation directions:
- Region can be allocated between two already existing entries
- Limited memory available
- All memory is reserved
- No available memory registered with memblock
Signed-off-by: Karolina Drobnik <karolinadrobnik@gmail.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Link: https://lore.kernel.org/r/26ccf409b8ff0394559d38d792b2afb24b55887c.1646055639.git.karolinadrobnik@gmail.com
|