From 05d6f1cc2dc214c1491181be13f37d2a3a26f694 Mon Sep 17 00:00:00 2001 From: Thomas Weißschuh Date: Mon, 13 Oct 2025 11:12:02 +0200 Subject: compiler.h: remove ARCH_SEL() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Its last user was removed in commit 8ea815399c3f ("compiler: remove __ADDRESSABLE_ASM{_STR,}() again"). Link: https://lkml.kernel.org/r/20251013-arch-sel-v1-1-7eef9b22ceb0@linutronix.de Signed-off-by: Thomas Weißschuh Cc: Luc Van Oostenryck Signed-off-by: Andrew Morton --- include/linux/compiler.h | 6 ------ 1 file changed, 6 deletions(-) (limited to 'include/linux') diff --git a/include/linux/compiler.h b/include/linux/compiler.h index 5b45ea7dff3e..a9a2f8aae821 100644 --- a/include/linux/compiler.h +++ b/include/linux/compiler.h @@ -269,12 +269,6 @@ static inline void *offset_to_ptr(const int *off) #endif /* __ASSEMBLY__ */ -#ifdef CONFIG_64BIT -#define ARCH_SEL(a,b) a -#else -#define ARCH_SEL(a,b) b -#endif - /* * Force the compiler to emit 'sym' as a symbol, so that we can reference * it from inline assembler. Necessary in case 'sym' could be inlined -- cgit From adc15829fb73e402903b7030729263b6ee4a7232 Mon Sep 17 00:00:00 2001 From: Sourabh Jain Date: Thu, 16 Oct 2025 19:58:31 +0530 Subject: crash: let architecture decide crash memory export to iomem_resource With the generic crashkernel reservation, the kernel emits the following warning on powerpc: WARNING: CPU: 0 PID: 1 at arch/powerpc/mm/mem.c:341 add_system_ram_resources+0xfc/0x180 Modules linked in: CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.17.0-auto-12607-g5472d60c129f #1 VOLUNTARY Hardware name: IBM,9080-HEX Power11 (architected) 0x820200 0xf000007 of:IBM,FW1110.01 (NH1110_069) hv:phyp pSeries NIP: c00000000201de3c LR: c00000000201de34 CTR: 0000000000000000 REGS: c000000127cef8a0 TRAP: 0700 Not tainted (6.17.0-auto-12607-g5472d60c129f) MSR: 8000000002029033 CR: 84000840 XER: 20040010 CFAR: c00000000017eed0 IRQMASK: 0 GPR00: c00000000201de34 c000000127cefb40 c0000000016a8100 0000000000000001 GPR04: c00000012005aa00 0000000020000000 c000000002b705c8 0000000000000000 GPR08: 000000007fffffff fffffffffffffff0 c000000002db8100 000000011fffffff GPR12: c00000000201dd40 c000000002ff0000 c0000000000112bc 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 0000000000000000 c0000000015a3808 GPR24: c00000000200468c c000000001699888 0000000000000106 c0000000020d1950 GPR28: c0000000014683f8 0000000081000200 c0000000015c1868 c000000002b9f710 NIP [c00000000201de3c] add_system_ram_resources+0xfc/0x180 LR [c00000000201de34] add_system_ram_resources+0xf4/0x180 Call Trace: add_system_ram_resources+0xf4/0x180 (unreliable) do_one_initcall+0x60/0x36c do_initcalls+0x120/0x220 kernel_init_freeable+0x23c/0x390 kernel_init+0x34/0x26c ret_from_kernel_user_thread+0x14/0x1c This warning occurs due to a conflict between crashkernel and System RAM iomem resources. The generic crashkernel reservation adds the crashkernel memory range to /proc/iomem during early initialization. Later, all memblock ranges are added to /proc/iomem as System RAM. If the crashkernel region overlaps with any memblock range, it causes a conflict while adding those memblock regions as iomem resources, triggering the above warning. The conflicting memblock regions are then omitted from /proc/iomem. For example, if the following crashkernel region is added to /proc/iomem: 20000000-11fffffff : Crash kernel then the following memblock regions System RAM regions fail to be inserted: 00000000-7fffffff : System RAM 80000000-257fffffff : System RAM Fix this by not adding the crashkernel memory to /proc/iomem on powerpc. Introduce an architecture hook to let each architecture decide whether to export the crashkernel region to /proc/iomem. For more info checkout commit c40dd2f766440 ("powerpc: Add System RAM to /proc/iomem") and commit bce074bdbc36 ("powerpc: insert System RAM resource to prevent crashkernel conflict") Note: Before switching to the generic crashkernel reservation, powerpc never exported the crashkernel region to /proc/iomem. Link: https://lkml.kernel.org/r/20251016142831.144515-1-sourabhjain@linux.ibm.com Fixes: e3185ee438c2 ("powerpc/crash: use generic crashkernel reservation"). Signed-off-by: Sourabh Jain Reported-by: Venkat Rao Bagalkote Closes: https://lore.kernel.org/all/90937fe0-2e76-4c82-b27e-7b8a7fe3ac69@linux.ibm.com/ Tested-by: Venkat Rao Bagalkote Cc: Baoquan he Cc: Hari Bathini Cc: Madhavan Srinivasan Cc: Mahesh Salgaonkar Cc: Michael Ellerman Cc: Ritesh Harjani (IBM) Cc: Vivek Goyal Cc: Dave Young Cc: Mike Rapoport Cc: Signed-off-by: Andrew Morton --- include/linux/crash_reserve.h | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'include/linux') diff --git a/include/linux/crash_reserve.h b/include/linux/crash_reserve.h index 7b44b41d0a20..f0dc03d94ca2 100644 --- a/include/linux/crash_reserve.h +++ b/include/linux/crash_reserve.h @@ -32,6 +32,12 @@ int __init parse_crashkernel(char *cmdline, unsigned long long system_ram, void __init reserve_crashkernel_cma(unsigned long long cma_size); #ifdef CONFIG_ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION +#ifndef arch_add_crash_res_to_iomem +static inline bool arch_add_crash_res_to_iomem(void) +{ + return true; +} +#endif #ifndef DEFAULT_CRASH_KERNEL_LOW_SIZE #define DEFAULT_CRASH_KERNEL_LOW_SIZE (128UL << 20) #endif -- cgit From 37ade54f386c829597f74b54bad335c12bd2a698 Mon Sep 17 00:00:00 2001 From: Petr Pavlu Date: Wed, 22 Oct 2025 10:28:04 +0200 Subject: taint/module: remove unnecessary taint_flag.module field The TAINT_RANDSTRUCT and TAINT_FWCTL flags are mistakenly set in the taint_flags table as per-module flags. While this can be trivially corrected, the issue can be avoided altogether by removing the taint_flag.module field. This is possible because, since commit 7fd8329ba502 ("taint/module: Clean up global and module taint flags handling") in 2016, the handling of module taint flags has been fully generic. Specifically, module_flags_taint() can print all flags, and the required output buffer size is properly defined in terms of TAINT_FLAGS_COUNT. The actual per-module flags are always those added to module.taints by calls to add_taint_module(). Link: https://lkml.kernel.org/r/20251022082938.26670-1-petr.pavlu@suse.com Signed-off-by: Petr Pavlu Acked-by: Petr Mladek Reviewed-by: Randy Dunlap Cc: Aaron Tomlin Cc: Luis Chamberalin Cc: Petr Pavlu Cc: Sami Tolvanen Signed-off-by: Andrew Morton --- include/linux/panic.h | 1 - 1 file changed, 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/panic.h b/include/linux/panic.h index 6f972a66c13e..a00bc0937698 100644 --- a/include/linux/panic.h +++ b/include/linux/panic.h @@ -86,7 +86,6 @@ static inline void set_arch_panic_timeout(int timeout, int arch_default_timeout) struct taint_flag { char c_true; /* character printed when tainted */ char c_false; /* character printed when not tainted */ - bool module; /* also show as a per-module taint flag */ const char *desc; /* verbose description of the set taint flag */ }; -- cgit From d99dc586ca7c7729450af2ed39ca1483c0eb7b5c Mon Sep 17 00:00:00 2001 From: "Yury Norov (NVIDIA)" Date: Thu, 23 Oct 2025 13:16:06 -0400 Subject: uaccess: decouple INLINE_COPY_FROM_USER and CONFIG_RUST MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Commit 1f9a8286bc0c ("uaccess: always export _copy_[from|to]_user with CONFIG_RUST") exports _copy_{from,to}_user() unconditionally, if RUST is enabled. This pollutes exported symbols namespace, and spreads RUST ifdefery in core files. It's better to declare a corresponding helper under the rust/helpers, similarly to how non-underscored copy_{from,to}_user() is handled. [yury.norov@gmail.com: drop rust part of comment for _copy_from_user(), per Alice] Link: https://lkml.kernel.org/r/20251024154754.99768-1-yury.norov@gmail.com Link: https://lkml.kernel.org/r/20251023171607.1171534-1-yury.norov@gmail.com Signed-off-by: Yury Norov (NVIDIA) Acked-by: Arnd Bergmann Acked-by: Miguel Ojeda Reviewed-by: Alice Ryhl Tested-by: Alice Ryhl Cc: Alex Gaynor Cc: Andreas Hindborg Cc: Björn Roy Baron Cc: Boqun Feng Cc: Danilo Krummrich Cc: Gary Guo Cc: John Hubbard Cc: Trevor Gross Signed-off-by: Andrew Morton --- include/linux/uaccess.h | 2 -- 1 file changed, 2 deletions(-) (limited to 'include/linux') diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index 1beb5b395d81..01cbd7dd0ba3 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -152,8 +152,6 @@ __copy_to_user(void __user *to, const void *from, unsigned long n) * directly in the normal copy_to/from_user(), the other ones go * through an extern _copy_to/from_user(), which expands the same code * here. - * - * Rust code always uses the extern definition. */ static inline __must_check unsigned long _inline_copy_from_user(void *to, const void __user *from, unsigned long n) -- cgit From 6c2e6e2c1af1809d1d9cdbd50ac80f54f5995bdb Mon Sep 17 00:00:00 2001 From: Ye Bin Date: Sat, 25 Oct 2025 16:00:03 +0800 Subject: dynamic_debug: add support for print stack In practical problem diagnosis, especially during the boot phase, it is often desirable to know the call sequence. However, currently, apart from adding print statements and recompiling the kernel, there seems to be no good alternative. If dynamic_debug supported printing the call stack, it would be very helpful for diagnosing issues. This patch add support '+d' for dump stack. Link: https://lkml.kernel.org/r/20251025080003.312536-1-yebin@huaweicloud.com Signed-off-by: Ye Bin Cc: Jason Baron Cc: Jim Cromie Signed-off-by: Andrew Morton --- include/linux/dynamic_debug.h | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) (limited to 'include/linux') diff --git a/include/linux/dynamic_debug.h b/include/linux/dynamic_debug.h index ff44ec346162..05743900a116 100644 --- a/include/linux/dynamic_debug.h +++ b/include/linux/dynamic_debug.h @@ -38,11 +38,12 @@ struct _ddebug { #define _DPRINTK_FLAGS_INCL_LINENO (1<<3) #define _DPRINTK_FLAGS_INCL_TID (1<<4) #define _DPRINTK_FLAGS_INCL_SOURCENAME (1<<5) +#define _DPRINTK_FLAGS_INCL_STACK (1<<6) #define _DPRINTK_FLAGS_INCL_ANY \ (_DPRINTK_FLAGS_INCL_MODNAME | _DPRINTK_FLAGS_INCL_FUNCNAME |\ _DPRINTK_FLAGS_INCL_LINENO | _DPRINTK_FLAGS_INCL_TID |\ - _DPRINTK_FLAGS_INCL_SOURCENAME) + _DPRINTK_FLAGS_INCL_SOURCENAME | _DPRINTK_FLAGS_INCL_STACK) #if defined DEBUG #define _DPRINTK_FLAGS_DEFAULT _DPRINTK_FLAGS_PRINT @@ -160,6 +161,12 @@ void __dynamic_ibdev_dbg(struct _ddebug *descriptor, const struct ib_device *ibdev, const char *fmt, ...); +#define __dynamic_dump_stack(desc) \ +{ \ + if (desc.flags & _DPRINTK_FLAGS_INCL_STACK) \ + dump_stack(); \ +} + #define DEFINE_DYNAMIC_DEBUG_METADATA_CLS(name, cls, fmt) \ static struct _ddebug __aligned(8) \ __section("__dyndbg") name = { \ @@ -220,8 +227,10 @@ void __dynamic_ibdev_dbg(struct _ddebug *descriptor, */ #define __dynamic_func_call_cls(id, cls, fmt, func, ...) do { \ DEFINE_DYNAMIC_DEBUG_METADATA_CLS(id, cls, fmt); \ - if (DYNAMIC_DEBUG_BRANCH(id)) \ + if (DYNAMIC_DEBUG_BRANCH(id)) { \ func(&id, ##__VA_ARGS__); \ + __dynamic_dump_stack(id); \ + } \ } while (0) #define __dynamic_func_call(id, fmt, func, ...) \ __dynamic_func_call_cls(id, _DPRINTK_CLASS_DFLT, fmt, \ @@ -229,8 +238,10 @@ void __dynamic_ibdev_dbg(struct _ddebug *descriptor, #define __dynamic_func_call_cls_no_desc(id, cls, fmt, func, ...) do { \ DEFINE_DYNAMIC_DEBUG_METADATA_CLS(id, cls, fmt); \ - if (DYNAMIC_DEBUG_BRANCH(id)) \ + if (DYNAMIC_DEBUG_BRANCH(id)) { \ func(__VA_ARGS__); \ + __dynamic_dump_stack(id); \ + } \ } while (0) #define __dynamic_func_call_no_desc(id, fmt, func, ...) \ __dynamic_func_call_cls_no_desc(id, _DPRINTK_CLASS_DFLT, \ -- cgit From a0b8c6af29a4be3ca2ff9a95cf71e54db5d73e65 Mon Sep 17 00:00:00 2001 From: "Dr. David Alan Gilbert" Date: Fri, 24 Oct 2025 21:51:20 +0100 Subject: lib/xxhash: remove more unused xxh functions xxh32_reset() and xxh32_copy_state() are unused, and with those gone, the xxh32_state struct is also unused. xxh64_copy_state() is also unused. Remove them all. (Also fixes a comment above the xxh64_state that referred to it as xxh32_state). Link: https://lkml.kernel.org/r/20251024205120.454508-1-linux@treblig.org Signed-off-by: Dr. David Alan Gilbert Suggested-by: Christoph Hellwig Reviewed-by: Kuan-Wei Chiu Reviewed-by: Christoph Hellwig Signed-off-by: Andrew Morton --- include/linux/xxhash.h | 46 +--------------------------------------------- 1 file changed, 1 insertion(+), 45 deletions(-) (limited to 'include/linux') diff --git a/include/linux/xxhash.h b/include/linux/xxhash.h index 27f57eca8cb1..587122e2c29c 100644 --- a/include/linux/xxhash.h +++ b/include/linux/xxhash.h @@ -141,21 +141,7 @@ static inline unsigned long xxhash(const void *input, size_t length, */ /** - * struct xxh32_state - private xxh32 state, do not use members directly - */ -struct xxh32_state { - uint32_t total_len_32; - uint32_t large_len; - uint32_t v1; - uint32_t v2; - uint32_t v3; - uint32_t v4; - uint32_t mem32[4]; - uint32_t memsize; -}; - -/** - * struct xxh32_state - private xxh64 state, do not use members directly + * struct xxh64_state - private xxh64 state, do not use members directly */ struct xxh64_state { uint64_t total_len; @@ -167,16 +153,6 @@ struct xxh64_state { uint32_t memsize; }; -/** - * xxh32_reset() - reset the xxh32 state to start a new hashing operation - * - * @state: The xxh32 state to reset. - * @seed: Initialize the hash state with this seed. - * - * Call this function on any xxh32_state to prepare for a new hashing operation. - */ -void xxh32_reset(struct xxh32_state *state, uint32_t seed); - /** * xxh64_reset() - reset the xxh64 state to start a new hashing operation * @@ -210,24 +186,4 @@ int xxh64_update(struct xxh64_state *state, const void *input, size_t length); */ uint64_t xxh64_digest(const struct xxh64_state *state); -/*-************************** - * Utils - ***************************/ - -/** - * xxh32_copy_state() - copy the source state into the destination state - * - * @src: The source xxh32 state. - * @dst: The destination xxh32 state. - */ -void xxh32_copy_state(struct xxh32_state *dst, const struct xxh32_state *src); - -/** - * xxh64_copy_state() - copy the source state into the destination state - * - * @src: The source xxh64 state. - * @dst: The destination xxh64 state. - */ -void xxh64_copy_state(struct xxh64_state *dst, const struct xxh64_state *src); - #endif /* XXHASH_H */ -- cgit From 760fc597c33d5a727507c8bb19d6ab87a8c5885b Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Thu, 30 Oct 2025 12:44:18 +0100 Subject: panic: sys_info: align constant definition names with parameters Align constant definition names with parameters to make it easier to map. It's also better to maintain and extend the names while keeping their uniqueness. Link: https://lkml.kernel.org/r/20251030132007.3742368-3-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko Reviewed-by: Feng Tang Reviewed-by: Petr Mladek Signed-off-by: Andrew Morton --- include/linux/sys_info.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/sys_info.h b/include/linux/sys_info.h index 89d77dc4f2ed..a5bc3ea3d44b 100644 --- a/include/linux/sys_info.h +++ b/include/linux/sys_info.h @@ -14,7 +14,7 @@ #define SYS_INFO_LOCKS 0x00000008 #define SYS_INFO_FTRACE 0x00000010 #define SYS_INFO_PANIC_CONSOLE_REPLAY 0x00000020 -#define SYS_INFO_ALL_CPU_BT 0x00000040 +#define SYS_INFO_ALL_BT 0x00000040 #define SYS_INFO_BLOCKED_TASKS 0x00000080 void sys_info(unsigned long si_mask); -- cgit From bd97c976419126ee3e9acd4957f6f16a90316643 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Tue, 4 Nov 2025 19:38:34 +0100 Subject: util_macros.h: fix kernel-doc for u64_to_user_ptr() The added documentation to u64_to_user_ptr() misspelled the function name. Fix it. Link: https://lkml.kernel.org/r/20251104183834.1046584-1-andriy.shevchenko@linux.intel.com Fixes: 029c896c4105 ("kernel.h: move PTR_IF() and u64_to_user_ptr() to util_macros.h") Signed-off-by: Andy Shevchenko Cc: Alexandru Ardelean Signed-off-by: Andrew Morton --- include/linux/util_macros.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'include/linux') diff --git a/include/linux/util_macros.h b/include/linux/util_macros.h index 9373962aade9..2eb528058d0d 100644 --- a/include/linux/util_macros.h +++ b/include/linux/util_macros.h @@ -136,10 +136,10 @@ #define PTR_IF(cond, ptr) ((cond) ? (ptr) : NULL) /** - * to_user_ptr - cast a pointer passed as u64 from user space to void __user * + * u64_to_user_ptr - cast a pointer passed as u64 from user space to void __user * * @x: The u64 value from user space, usually via IOCTL * - * to_user_ptr() simply casts a pointer passed as u64 from user space to void + * u64_to_user_ptr() simply casts a pointer passed as u64 from user space to void * __user * correctly. Using this lets us get rid of all the tiresome casts. */ #define u64_to_user_ptr(x) \ -- cgit From 6480241f31f543333ed0c7a209962412461f6e41 Mon Sep 17 00:00:00 2001 From: David Laight Date: Wed, 5 Nov 2025 20:10:30 +0000 Subject: lib: add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The existing mul_u64_u64_div_u64() rounds down, a 'rounding up' variant needs 'divisor - 1' adding in between the multiply and divide so cannot easily be done by a caller. Add mul_u64_add_u64_div_u64(a, b, c, d) that calculates (a * b + c)/d and implement the 'round down' and 'round up' using it. Update the x86-64 asm to optimise for 'c' being a constant zero. Add kerndoc definitions for all three functions. Link: https://lkml.kernel.org/r/20251105201035.64043-5-david.laight.linux@gmail.com Signed-off-by: David Laight Reviewed-by: Nicolas Pitre Cc: Biju Das Cc: Borislav Betkov Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jens Axboe Cc: Li RongQing Cc: Oleg Nesterov Cc: Peter Zijlstra Cc: Thomas Gleinxer Cc: Uwe Kleine-König Signed-off-by: Andrew Morton --- include/linux/math64.h | 48 +++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 47 insertions(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/math64.h b/include/linux/math64.h index 6aaccc1626ab..e889d850b7f1 100644 --- a/include/linux/math64.h +++ b/include/linux/math64.h @@ -282,7 +282,53 @@ static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 divisor) } #endif /* mul_u64_u32_div */ -u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div); +/** + * mul_u64_add_u64_div_u64 - unsigned 64bit multiply, add, and divide + * @a: first unsigned 64bit multiplicand + * @b: second unsigned 64bit multiplicand + * @c: unsigned 64bit addend + * @d: unsigned 64bit divisor + * + * Multiply two 64bit values together to generate a 128bit product + * add a third value and then divide by a fourth. + * The Generic code divides by 0 if @d is zero and returns ~0 on overflow. + * Architecture specific code may trap on zero or overflow. + * + * Return: (@a * @b + @c) / @d + */ +u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d); + +/** + * mul_u64_u64_div_u64 - unsigned 64bit multiply and divide + * @a: first unsigned 64bit multiplicand + * @b: second unsigned 64bit multiplicand + * @d: unsigned 64bit divisor + * + * Multiply two 64bit values together to generate a 128bit product + * and then divide by a third value. + * The Generic code divides by 0 if @d is zero and returns ~0 on overflow. + * Architecture specific code may trap on zero or overflow. + * + * Return: @a * @b / @d + */ +#define mul_u64_u64_div_u64(a, b, d) mul_u64_add_u64_div_u64(a, b, 0, d) + +/** + * mul_u64_u64_div_u64_roundup - unsigned 64bit multiply and divide rounded up + * @a: first unsigned 64bit multiplicand + * @b: second unsigned 64bit multiplicand + * @d: unsigned 64bit divisor + * + * Multiply two 64bit values together to generate a 128bit product + * and then divide and round up. + * The Generic code divides by 0 if @d is zero and returns ~0 on overflow. + * Architecture specific code may trap on zero or overflow. + * + * Return: (@a * @b + @d - 1) / @d + */ +#define mul_u64_u64_div_u64_roundup(a, b, d) \ + ({ u64 _tmp = (d); mul_u64_add_u64_div_u64(a, b, _tmp - 1, _tmp); }) + /** * DIV64_U64_ROUND_UP - unsigned 64bit divide with 64bit divisor rounded up -- cgit From 630f96a687def5616d6fa7f069adcea158320909 Mon Sep 17 00:00:00 2001 From: David Laight Date: Wed, 5 Nov 2025 20:10:33 +0000 Subject: lib: mul_u64_u64_div_u64(): optimise multiply on 32bit x86 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit gcc generates horrid code for both ((u64)u32_a * u32_b) and (u64_a + u32_b). As well as the extra instructions it can generate a lot of spills to stack (including spills of constant zeros and even multiplies by constant zero). mul_u32_u32() already exists to optimise the multiply. Add a similar add_u64_32() for the addition. Disable both for clang - it generates better code without them. Move the 64x64 => 128 multiply into a static inline helper function for code clarity. No need for the a/b_hi/lo variables, the implicit casts on the function calls do the work for us. Should have minimal effect on the generated code. Use mul_u32_u32() and add_u64_u32() in the 64x64 => 128 multiply in mul_u64_add_u64_div_u64(). Link: https://lkml.kernel.org/r/20251105201035.64043-8-david.laight.linux@gmail.com Signed-off-by: David Laight Reviewed-by: Nicolas Pitre Cc: Biju Das Cc: Borislav Betkov Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jens Axboe Cc: Li RongQing Cc: Oleg Nesterov Cc: Peter Zijlstra Cc: Thomas Gleinxer Cc: Uwe Kleine-König Signed-off-by: Andrew Morton --- include/linux/math64.h | 11 +++++++++++ 1 file changed, 11 insertions(+) (limited to 'include/linux') diff --git a/include/linux/math64.h b/include/linux/math64.h index e889d850b7f1..cc305206d89f 100644 --- a/include/linux/math64.h +++ b/include/linux/math64.h @@ -158,6 +158,17 @@ static inline u64 mul_u32_u32(u32 a, u32 b) } #endif +#ifndef add_u64_u32 +/* + * Many a GCC version also messes this up. + * Zero extending b and then spilling everything to stack. + */ +static inline u64 add_u64_u32(u64 a, u32 b) +{ + return a + b; +} +#endif + #if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__) #ifndef mul_u64_u32_shr -- cgit From f3fb126fdc9e148da38a6e25d7fc609774a99fc3 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Thu, 6 Nov 2025 16:20:51 +0100 Subject: math.h: amend abs() kernel-doc and add a note about signed type limits - amend the kernel-doc so the description is decoupled from the parameter descriptions. - add a note to explain behaviour for the signed types when supplied value is the minimum (e.g., INT_MIN for int type). Link: https://lkml.kernel.org/r/20251106152051.2361551-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko Cc: Jonathan Cameron Signed-off-by: Andrew Morton --- include/linux/math.h | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) (limited to 'include/linux') diff --git a/include/linux/math.h b/include/linux/math.h index 0198c92cbe3e..6dc1d1d32fbc 100644 --- a/include/linux/math.h +++ b/include/linux/math.h @@ -148,11 +148,16 @@ __STRUCT_FRACT(u32) /** * abs - return absolute value of an argument - * @x: the value. If it is unsigned type, it is converted to signed type first. - * char is treated as if it was signed (regardless of whether it really is) - * but the macro's return type is preserved as char. + * @x: the value. * - * Return: an absolute value of x. + * If it is unsigned type, @x is converted to signed type first. + * char is treated as if it was signed (regardless of whether it really is) + * but the macro's return type is preserved as char. + * + * NOTE, for signed type if @x is the minimum, the returned result is undefined + * as there is not enough bits to represent it as a positive number. + * + * Return: an absolute value of @x. */ #define abs(x) __abs_choose_expr(x, long long, \ __abs_choose_expr(x, long, \ -- cgit From 242b872239f6a7deacbc20ab9406ea40cb738ec6 Mon Sep 17 00:00:00 2001 From: Xie Yuanbin Date: Sun, 9 Nov 2025 16:37:15 +0800 Subject: include/linux/once_lite.h: fix judgment in WARN_ONCE with clang For c code: ```c extern int xx; void test(void) { if (WARN_ONCE(xx, "x")) __asm__ volatile ("nop":::); } ``` Clang will generate the following assembly code: ```assemble test: movl xx(%rip), %eax // Assume xx == 0 (likely case) testl %eax, %eax // judge once je .LBB0_3 // jump to .LBB0_3 testb $1, test.__already_done(%rip) je .LBB0_2 .LBB0_3: testl %eax, %eax // judge again je .LBB0_5 // jump to .LBB0_5 .LBB0_4: nop .LBB0_5: retq // omit ``` In the above code, `xx == 0` should be a likely case, but in this case, xx has been judged twice. Test info: 1. kernel source: linux-next commit 9c0826a5d9aa4d52206d ("Add linux-next specific files for 20251107") 2. compiler: clang: Debian clang version 21.1.4 (8) with Debian LLD 21.1.4 (compatible with GNU linkers) 3. config: base on default x86_64_defconfig, and setting: CONFIG_MITIGATION_RETHUNK=n CONFIG_STACKPROTECTOR=n Add unlikely to __ret_cond to help the compiler optimize correctly. [akpm@linux-foundation.org: undo whitespace changes] Link: https://lkml.kernel.org/r/20251109083715.24495-1-qq570070308@gmail.com Signed-off-by: Xie Yuanbin Cc: Bill Wendling Cc: Jan Kara Cc: Justin Stitt Cc: Maninder Singh Cc: Masahiro Yamada Cc: Nathan Chancellor Cc: Will Deacon Signed-off-by: Andrew Morton --- include/linux/once_lite.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/once_lite.h b/include/linux/once_lite.h index 27de7bc32a06..236592c4eeb1 100644 --- a/include/linux/once_lite.h +++ b/include/linux/once_lite.h @@ -16,7 +16,7 @@ bool __ret_cond = !!(condition); \ bool __ret_once = false; \ \ - if (unlikely(__ret_cond && !__already_done)) { \ + if (unlikely(__ret_cond) && unlikely(!__already_done)) {\ __already_done = true; \ __ret_once = true; \ } \ -- cgit From f1e2ca801c54dfc09d6a5540207cec25e8d43f6f Mon Sep 17 00:00:00 2001 From: Kuan-Wei Chiu Date: Fri, 14 Nov 2025 14:00:45 +0800 Subject: lib/base64: add support for multiple variants Patch series " lib/base64: add generic encoder/decoder, migrate users", v5. This series introduces a generic Base64 encoder/decoder to the kernel library, eliminating duplicated implementations and delivering significant performance improvements. The Base64 API has been extended to support multiple variants (Standard, URL-safe, and IMAP) as defined in RFC 4648 and RFC 3501. The API now takes a variant parameter and an option to control padding. As part of this series, users are migrated to the new interface while preserving their specific formats: fscrypt now uses BASE64_URLSAFE, Ceph uses BASE64_IMAP, and NVMe is updated to BASE64_STD. On the encoder side, the implementation processes input in 3-byte blocks, mapping 24 bits directly to 4 output symbols. This avoids bit-by-bit streaming and reduces loop overhead, achieving about a 2.7x speedup compared to previous implementations. On the decoder side, replace strchr() lookups with per-variant reverse tables and process input in 4-character groups. Each group is mapped to numeric values and combined into 3 bytes. Padded and unpadded forms are validated explicitly, rejecting invalid '=' usage and enforcing tail rules. This improves throughput by ~43-52x. This patch (of 6): Extend the base64 API to support multiple variants (standard, URL-safe, and IMAP) as defined in RFC 4648 and RFC 3501. The API now takes a variant parameter and an option to control padding. Update NVMe auth code to use the new interface with BASE64_STD. Link: https://lkml.kernel.org/r/20251114055829.87814-1-409411716@gms.tku.edu.tw Link: https://lkml.kernel.org/r/20251114060045.88792-1-409411716@gms.tku.edu.tw Signed-off-by: Kuan-Wei Chiu Co-developed-by: Guan-Chun Wu <409411716@gms.tku.edu.tw> Signed-off-by: Guan-Chun Wu <409411716@gms.tku.edu.tw> Reviewed-by: David Laight Cc: Christoph Hellwig Cc: Eric Biggers Cc: Ilya Dryomov Cc: Jaegeuk Kim Cc: Jens Axboe Cc: Keith Busch Cc: Sagi Grimberg Cc: "Theodore Y. Ts'o" Cc: Viacheslav Dubeyko Cc: Xiubo Li Cc: Yu-Sheng Huang Signed-off-by: Andrew Morton --- include/linux/base64.h | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) (limited to 'include/linux') diff --git a/include/linux/base64.h b/include/linux/base64.h index 660d4cb1ef31..a2c6c9222da3 100644 --- a/include/linux/base64.h +++ b/include/linux/base64.h @@ -8,9 +8,15 @@ #include +enum base64_variant { + BASE64_STD, /* RFC 4648 (standard) */ + BASE64_URLSAFE, /* RFC 4648 (base64url) */ + BASE64_IMAP, /* RFC 3501 */ +}; + #define BASE64_CHARS(nbytes) DIV_ROUND_UP((nbytes) * 4, 3) -int base64_encode(const u8 *src, int len, char *dst); -int base64_decode(const char *src, int len, u8 *dst); +int base64_encode(const u8 *src, int len, char *dst, bool padding, enum base64_variant variant); +int base64_decode(const char *src, int len, u8 *dst, bool padding, enum base64_variant variant); #endif /* _LINUX_BASE64_H */ -- cgit From 9031b852c97f1db52180878aed66ca08946eca93 Mon Sep 17 00:00:00 2001 From: Alice Ryhl Date: Tue, 18 Nov 2025 17:32:50 +0000 Subject: uaccess: gate _copy_[to|from]_user on !INLINE_COPY_FROM_USER MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit These methods only exist when INLINE_COPY_FROM_USER is disabled, so update the header file to reflect that. This fixes the following error on builds that enable both RUST and INLINE_COPY_FROM_USER. ERROR: modpost: "_copy_from_user" [samples/rust/rust_misc_device.ko] undefined! ERROR: modpost: "_copy_to_user" [samples/rust/rust_misc_device.ko] undefined! This error is triggered because when a method is available both as a rust_helper_* and normal method, Rust will call the normal method. [akpm@linux-foundation.org: s/INLINE_COPY_FROM_USER/INLINE_COPY_TO_USER/, per Alice] Link: https://lkml.kernel.org/r/20251118173250.2821388-1-aliceryhl@google.com Fixes: d99dc586ca7c ("uaccess: decouple INLINE_COPY_FROM_USER and CONFIG_RUST") Signed-off-by: Alice Ryhl Cc: Alex Gaynor Cc: Andreas Hindborg Cc: Arnd Bergmann Cc: Björn Roy Baron Cc: Boqun Feng Cc: Danilo Krummrich Cc: Gary Guo Cc: John Hubbard Cc: Miguel Ojeda Cc: Trevor Gross Cc: Yury Norov (NVIDIA) Signed-off-by: Andrew Morton --- include/linux/uaccess.h | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'include/linux') diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index 01cbd7dd0ba3..5594012160da 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -181,8 +181,10 @@ fail: memset(to + (n - res), 0, res); return res; } +#ifndef INLINE_COPY_FROM_USER extern __must_check unsigned long _copy_from_user(void *, const void __user *, unsigned long); +#endif static inline __must_check unsigned long _inline_copy_to_user(void __user *to, const void *from, unsigned long n) @@ -196,8 +198,10 @@ _inline_copy_to_user(void __user *to, const void *from, unsigned long n) } return n; } +#ifndef INLINE_COPY_TO_USER extern __must_check unsigned long _copy_to_user(void __user *, const void *, unsigned long); +#endif static __always_inline unsigned long __must_check copy_from_user(void *to, const void __user *from, unsigned long n) -- cgit From c2d2dad24503d7e2eb7cba354fcc73f95fa78d7a Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Fri, 14 Nov 2025 14:06:45 +0000 Subject: rbtree: inline rb_first() Patch series "rbree: inline rb_first() and rb_last()". Inline these two small helpers, heavily used in TCP and FQ packet scheduler, and in many other places. This reduces kernel text size, and brings an 1.5 % improvement on network TCP stress test. This patch (of 2): This is a very small function, inlining it saves cpu cycles by reducing register pressure and removing call/ret overhead. It also reduces vmlinux text size by 744 bytes on a typical x86_64 build. Before: size vmlinux text data bss dec hex filename 34812525 22177365 5685248 62675138 3bc58c2 vmlinux After: size vmlinux text data bss dec hex filename 34811781 22177365 5685248 62674394 3bc55da vmlinux [ojeda@kernel.org: fix rust build] Link: https://lkml.kernel.org/r/20251120085518.1463498-1-ojeda@kernel.org Link: https://lkml.kernel.org/r/20251114140646.3817319-1-edumazet@google.com Link: https://lkml.kernel.org/r/20251114140646.3817319-2-edumazet@google.com Signed-off-by: Eric Dumazet Signed-off-by: Miguel Ojeda Reviewed-by: Kuan-Wei Chiu Cc: Jakub Kacinski Cc: Neal Cardwell Cc: Paolo Abeni Cc: Alice Ryhl Cc: Stehen Rothwell Signed-off-by: Andrew Morton --- include/linux/rbtree.h | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h index 8d2ba3749866..484554900f7d 100644 --- a/include/linux/rbtree.h +++ b/include/linux/rbtree.h @@ -43,7 +43,21 @@ extern void rb_erase(struct rb_node *, struct rb_root *); /* Find logical next and previous nodes in a tree */ extern struct rb_node *rb_next(const struct rb_node *); extern struct rb_node *rb_prev(const struct rb_node *); -extern struct rb_node *rb_first(const struct rb_root *); + +/* + * This function returns the first node (in sort order) of the tree. + */ +static inline struct rb_node *rb_first(const struct rb_root *root) +{ + struct rb_node *n; + + n = root->rb_node; + if (!n) + return NULL; + while (n->rb_left) + n = n->rb_left; + return n; +} extern struct rb_node *rb_last(const struct rb_root *); /* Postorder iteration - always visit the parent after its children */ -- cgit From 94984bfed58ca129f7e259ce09973ed0b3f540a8 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Fri, 14 Nov 2025 14:06:46 +0000 Subject: rbtree: inline rb_last() This is a very small function, inlining it saves cpu cycles in TCP by reducing register pressure and removing call/ret overhead. It also reduces vmlinux text size by 122 bytes on a typical x86_64 build. Before: size vmlinux text data bss dec hex filename 34811781 22177365 5685248 62674394 3bc55da vmlinux After: size vmlinux text data bss dec hex filename 34811659 22177365 5685248 62674272 3bc5560 vmlinux [ojeda@kernel.org: fix rust build] Link: https://lkml.kernel.org/r/20251120085518.1463498-1-ojeda@kernel.org Link: https://lkml.kernel.org/r/20251114140646.3817319-3-edumazet@google.com Signed-off-by: Eric Dumazet Signed-off-by: Miguel Ojeda Reviewed-by: Kuan-Wei Chiu Cc: Jakub Kacinski Cc: Neal Cardwell Cc: Paolo Abeni Cc: Alice Ryhl Cc: Stehen Rothwell Signed-off-by: Andrew Morton --- include/linux/rbtree.h | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h index 484554900f7d..4091e978aef2 100644 --- a/include/linux/rbtree.h +++ b/include/linux/rbtree.h @@ -58,7 +58,21 @@ static inline struct rb_node *rb_first(const struct rb_root *root) n = n->rb_left; return n; } -extern struct rb_node *rb_last(const struct rb_root *); + +/* + * This function returns the last node (in sort order) of the tree. + */ +static inline struct rb_node *rb_last(const struct rb_root *root) +{ + struct rb_node *n; + + n = root->rb_node; + if (!n) + return NULL; + while (n->rb_right) + n = n->rb_right; + return n; +} /* Postorder iteration - always visit the parent after its children */ extern struct rb_node *rb_first_postorder(const struct rb_root *); -- cgit From 70f9133096c833922c3b63461480248cefa7bb0f Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Sat, 1 Nov 2025 10:23:18 -0400 Subject: kho: drop notifiers The KHO framework uses a notifier chain as the mechanism for clients to participate in the finalization process. While this works for a single, central state machine, it is too restrictive for kernel-internal components like pstore/reserve_mem or IMA. These components need a simpler, direct way to register their state for preservation (e.g., during their initcall) without being part of a complex, shutdown-time notifier sequence. The notifier model forces all participants into a single finalization flow and makes direct preservation from an arbitrary context difficult. This patch refactors the client participation model by removing the notifier chain and introducing a direct API for managing FDT subtrees. The core kho_finalize() and kho_abort() state machine remains, but clients now register their data with KHO beforehand. Link: https://lkml.kernel.org/r/20251101142325.1326536-3-pasha.tatashin@soleen.com Signed-off-by: Mike Rapoport (Microsoft) Co-developed-by: Pasha Tatashin Signed-off-by: Pasha Tatashin Cc: Alexander Graf Cc: Changyuan Lyu Cc: Christian Brauner Cc: Jason Gunthorpe Cc: Jason Gunthorpe Cc: Jonathan Corbet Cc: Masahiro Yamada Cc: Miguel Ojeda Cc: Pratyush Yadav Cc: Randy Dunlap Cc: Simon Horman Cc: Tejun Heo Cc: Zhu Yanjun Signed-off-by: Andrew Morton --- include/linux/kexec_handover.h | 28 +++++----------------------- 1 file changed, 5 insertions(+), 23 deletions(-) (limited to 'include/linux') diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h index 25042c1d8d54..0d860d793b66 100644 --- a/include/linux/kexec_handover.h +++ b/include/linux/kexec_handover.h @@ -10,14 +10,7 @@ struct kho_scratch { phys_addr_t size; }; -/* KHO Notifier index */ -enum kho_event { - KEXEC_KHO_FINALIZE = 0, - KEXEC_KHO_ABORT = 1, -}; - struct folio; -struct notifier_block; struct page; #define DECLARE_KHOSER_PTR(name, type) \ @@ -37,8 +30,6 @@ struct page; (typeof((s).ptr))((s).phys ? phys_to_virt((s).phys) : NULL); \ }) -struct kho_serialization; - struct kho_vmalloc_chunk; struct kho_vmalloc { DECLARE_KHOSER_PTR(first, struct kho_vmalloc_chunk *); @@ -57,12 +48,10 @@ int kho_preserve_vmalloc(void *ptr, struct kho_vmalloc *preservation); struct folio *kho_restore_folio(phys_addr_t phys); struct page *kho_restore_pages(phys_addr_t phys, unsigned int nr_pages); void *kho_restore_vmalloc(const struct kho_vmalloc *preservation); -int kho_add_subtree(struct kho_serialization *ser, const char *name, void *fdt); +int kho_add_subtree(const char *name, void *fdt); +void kho_remove_subtree(void *fdt); int kho_retrieve_subtree(const char *name, phys_addr_t *phys); -int register_kho_notifier(struct notifier_block *nb); -int unregister_kho_notifier(struct notifier_block *nb); - void kho_memory_init(void); void kho_populate(phys_addr_t fdt_phys, u64 fdt_len, phys_addr_t scratch_phys, @@ -110,23 +99,16 @@ static inline void *kho_restore_vmalloc(const struct kho_vmalloc *preservation) return NULL; } -static inline int kho_add_subtree(struct kho_serialization *ser, - const char *name, void *fdt) +static inline int kho_add_subtree(const char *name, void *fdt) { return -EOPNOTSUPP; } -static inline int kho_retrieve_subtree(const char *name, phys_addr_t *phys) +static inline void kho_remove_subtree(void *fdt) { - return -EOPNOTSUPP; } -static inline int register_kho_notifier(struct notifier_block *nb) -{ - return -EOPNOTSUPP; -} - -static inline int unregister_kho_notifier(struct notifier_block *nb) +static inline int kho_retrieve_subtree(const char *name, phys_addr_t *phys) { return -EOPNOTSUPP; } -- cgit From 36f8f7ef7fd2f238922e9d217e86c69838319d8c Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Sat, 1 Nov 2025 10:23:19 -0400 Subject: kho: add interfaces to unpreserve folios, page ranges, and vmalloc Allow users of KHO to cancel the previous preservation by adding the necessary interfaces to unpreserve folio, pages, and vmallocs. Link: https://lkml.kernel.org/r/20251101142325.1326536-4-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav Reviewed-by: Mike Rapoport (Microsoft) Cc: Alexander Graf Cc: Changyuan Lyu Cc: Christian Brauner Cc: Jason Gunthorpe Cc: Jason Gunthorpe Cc: Jonathan Corbet Cc: Masahiro Yamada Cc: Miguel Ojeda Cc: Randy Dunlap Cc: Simon Horman Cc: Tejun Heo Cc: Zhu Yanjun Signed-off-by: Andrew Morton --- include/linux/kexec_handover.h | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) (limited to 'include/linux') diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h index 0d860d793b66..80ece4232617 100644 --- a/include/linux/kexec_handover.h +++ b/include/linux/kexec_handover.h @@ -43,8 +43,11 @@ bool kho_is_enabled(void); bool is_kho_boot(void); int kho_preserve_folio(struct folio *folio); +int kho_unpreserve_folio(struct folio *folio); int kho_preserve_pages(struct page *page, unsigned int nr_pages); +int kho_unpreserve_pages(struct page *page, unsigned int nr_pages); int kho_preserve_vmalloc(void *ptr, struct kho_vmalloc *preservation); +int kho_unpreserve_vmalloc(struct kho_vmalloc *preservation); struct folio *kho_restore_folio(phys_addr_t phys); struct page *kho_restore_pages(phys_addr_t phys, unsigned int nr_pages); void *kho_restore_vmalloc(const struct kho_vmalloc *preservation); @@ -72,17 +75,32 @@ static inline int kho_preserve_folio(struct folio *folio) return -EOPNOTSUPP; } +static inline int kho_unpreserve_folio(struct folio *folio) +{ + return -EOPNOTSUPP; +} + static inline int kho_preserve_pages(struct page *page, unsigned int nr_pages) { return -EOPNOTSUPP; } +static inline int kho_unpreserve_pages(struct page *page, unsigned int nr_pages) +{ + return -EOPNOTSUPP; +} + static inline int kho_preserve_vmalloc(void *ptr, struct kho_vmalloc *preservation) { return -EOPNOTSUPP; } +static inline int kho_unpreserve_vmalloc(struct kho_vmalloc *preservation) +{ + return -EOPNOTSUPP; +} + static inline struct folio *kho_restore_folio(phys_addr_t phys) { return NULL; -- cgit From 4c205677af2726bd3b51c02ab6a5a2b411efed09 Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Fri, 14 Nov 2025 13:59:52 -0500 Subject: kho: introduce high-level memory allocation API Currently, clients of KHO must manually allocate memory (e.g., via alloc_pages), calculate the page order, and explicitly call kho_preserve_folio(). Similarly, cleanup requires separate calls to unpreserve and free the memory. Introduce a high-level API to streamline this common pattern: - kho_alloc_preserve(size): Allocates physically contiguous, zeroed memory and immediately marks it for preservation. - kho_unpreserve_free(ptr): Unpreserves and frees the memory in the current kernel. - kho_restore_free(ptr): Restores the struct page state of preserved memory in the new kernel and immediately frees it to the page allocator. [pasha.tatashin@soleen.com: build fixes] Link: https://lkml.kernel.org/r/CA+CK2bBgXDhrHwTVgxrw7YTQ-0=LgW0t66CwPCgG=C85ftz4zw@mail.gmail.com Link: https://lkml.kernel.org/r/20251114190002.3311679-4-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav Cc: Alexander Graf Cc: Arnd Bergmann Cc: Baoquan He Cc: Coiby Xu Cc: Dave Vasilevsky Cc: Eric Biggers Cc: Kees Cook Signed-off-by: Andrew Morton --- include/linux/kexec_handover.h | 22 +++++++++++++++------- 1 file changed, 15 insertions(+), 7 deletions(-) (limited to 'include/linux') diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h index 80ece4232617..dde952227b88 100644 --- a/include/linux/kexec_handover.h +++ b/include/linux/kexec_handover.h @@ -2,8 +2,9 @@ #ifndef LINUX_KEXEC_HANDOVER_H #define LINUX_KEXEC_HANDOVER_H -#include +#include #include +#include struct kho_scratch { phys_addr_t addr; @@ -48,6 +49,9 @@ int kho_preserve_pages(struct page *page, unsigned int nr_pages); int kho_unpreserve_pages(struct page *page, unsigned int nr_pages); int kho_preserve_vmalloc(void *ptr, struct kho_vmalloc *preservation); int kho_unpreserve_vmalloc(struct kho_vmalloc *preservation); +void *kho_alloc_preserve(size_t size); +void kho_unpreserve_free(void *mem); +void kho_restore_free(void *mem); struct folio *kho_restore_folio(phys_addr_t phys); struct page *kho_restore_pages(phys_addr_t phys, unsigned int nr_pages); void *kho_restore_vmalloc(const struct kho_vmalloc *preservation); @@ -101,6 +105,14 @@ static inline int kho_unpreserve_vmalloc(struct kho_vmalloc *preservation) return -EOPNOTSUPP; } +static inline void *kho_alloc_preserve(size_t size) +{ + return ERR_PTR(-EOPNOTSUPP); +} + +static inline void kho_unpreserve_free(void *mem) { } +static inline void kho_restore_free(void *mem) { } + static inline struct folio *kho_restore_folio(phys_addr_t phys) { return NULL; @@ -122,18 +134,14 @@ static inline int kho_add_subtree(const char *name, void *fdt) return -EOPNOTSUPP; } -static inline void kho_remove_subtree(void *fdt) -{ -} +static inline void kho_remove_subtree(void *fdt) { } static inline int kho_retrieve_subtree(const char *name, phys_addr_t *phys) { return -EOPNOTSUPP; } -static inline void kho_memory_init(void) -{ -} +static inline void kho_memory_init(void) { } static inline void kho_populate(phys_addr_t fdt_phys, u64 fdt_len, phys_addr_t scratch_phys, u64 scratch_len) -- cgit From de51999e687c70a41997124b43291f84324c7924 Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Fri, 14 Nov 2025 14:00:01 -0500 Subject: kho: allow memory preservation state updates after finalization Currently, kho_preserve_* and kho_unpreserve_* return -EBUSY if KHO is finalized. This enforces a rigid "freeze" on the KHO memory state. With the introduction of re-entrant finalization, this restriction is no longer necessary. Users should be allowed to modify the preservation set (e.g., adding new pages or freeing old ones) even after an initial finalization. The intended workflow for updates is now: 1. Modify state (preserve/unpreserve). 2. Call kho_finalize() again to refresh the serialized metadata. Remove the kho_out.finalized checks to enable this dynamic behavior. This also allows to convert kho_unpreserve_* functions to void, as they do not return any error anymore. Link: https://lkml.kernel.org/r/20251114190002.3311679-13-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav Cc: Alexander Graf Cc: Arnd Bergmann Cc: Baoquan He Cc: Coiby Xu Cc: Dave Vasilevsky Cc: Eric Biggers Cc: Kees Cook Signed-off-by: Andrew Morton --- include/linux/kexec_handover.h | 21 ++++++--------------- 1 file changed, 6 insertions(+), 15 deletions(-) (limited to 'include/linux') diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h index dde952227b88..5f7b9de97e8d 100644 --- a/include/linux/kexec_handover.h +++ b/include/linux/kexec_handover.h @@ -44,11 +44,11 @@ bool kho_is_enabled(void); bool is_kho_boot(void); int kho_preserve_folio(struct folio *folio); -int kho_unpreserve_folio(struct folio *folio); +void kho_unpreserve_folio(struct folio *folio); int kho_preserve_pages(struct page *page, unsigned int nr_pages); -int kho_unpreserve_pages(struct page *page, unsigned int nr_pages); +void kho_unpreserve_pages(struct page *page, unsigned int nr_pages); int kho_preserve_vmalloc(void *ptr, struct kho_vmalloc *preservation); -int kho_unpreserve_vmalloc(struct kho_vmalloc *preservation); +void kho_unpreserve_vmalloc(struct kho_vmalloc *preservation); void *kho_alloc_preserve(size_t size); void kho_unpreserve_free(void *mem); void kho_restore_free(void *mem); @@ -79,20 +79,14 @@ static inline int kho_preserve_folio(struct folio *folio) return -EOPNOTSUPP; } -static inline int kho_unpreserve_folio(struct folio *folio) -{ - return -EOPNOTSUPP; -} +static inline void kho_unpreserve_folio(struct folio *folio) { } static inline int kho_preserve_pages(struct page *page, unsigned int nr_pages) { return -EOPNOTSUPP; } -static inline int kho_unpreserve_pages(struct page *page, unsigned int nr_pages) -{ - return -EOPNOTSUPP; -} +static inline void kho_unpreserve_pages(struct page *page, unsigned int nr_pages) { } static inline int kho_preserve_vmalloc(void *ptr, struct kho_vmalloc *preservation) @@ -100,10 +94,7 @@ static inline int kho_preserve_vmalloc(void *ptr, return -EOPNOTSUPP; } -static inline int kho_unpreserve_vmalloc(struct kho_vmalloc *preservation) -{ - return -EOPNOTSUPP; -} +static inline void kho_unpreserve_vmalloc(struct kho_vmalloc *preservation) { } static inline void *kho_alloc_preserve(size_t size) { -- cgit From 9e2fd062fa1713a33380cc97ef324d086dd45ba5 Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Tue, 25 Nov 2025 11:58:31 -0500 Subject: liveupdate: luo_core: Live Update Orchestrator MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "Live Update Orchestrator", v8. This series introduces the Live Update Orchestrator, a kernel subsystem designed to facilitate live kernel updates using a kexec-based reboot. This capability is critical for cloud environments, allowing hypervisors to be updated with minimal downtime for running virtual machines. LUO achieves this by preserving the state of selected resources, such as memory, devices and their dependencies, across the kernel transition. As a key feature, this series includes support for preserving memfd file descriptors, which allows critical in-memory data, such as guest RAM or any other large memory region, to be maintained in RAM across the kexec reboot. The other series that use LUO, are VFIO [1], IOMMU [2], and PCI [3] preservations. Github repo of this series [4]. The core of LUO is a framework for managing the lifecycle of preserved resources through a userspace-driven interface. Key features include: - Session Management Userspace agent (i.e. luod [5]) creates named sessions, each represented by a file descriptor (via centralized agent that controls /dev/liveupdate). The lifecycle of all preserved resources within a session is tied to this FD, ensuring automatic kernel cleanup if the controlling userspace agent crashes or exits unexpectedly. - File Preservation A handler-based framework allows specific file types (demonstrated here with memfd) to be preserved. Handlers manage the serialization, restoration, and lifecycle of their specific file types. - File-Lifecycle-Bound State A new mechanism for managing shared global state whose lifecycle is tied to the preservation of one or more files. This is crucial for subsystems like IOMMU or HugeTLB, where multiple file descriptors may depend on a single, shared underlying resource that must be preserved only once. - KHO Integration LUO drives the Kexec Handover framework programmatically to pass its serialized metadata to the next kernel. The LUO state is finalized and added to the kexec image just before the reboot is triggered. In the future this step will also be removed once stateless KHO is merged [6]. - Userspace Interface Control is provided via ioctl commands on /dev/liveupdate for creating and retrieving sessions, as well as on session file descriptors for managing individual files. - Testing The series includes a set of selftests, including userspace API validation, kexec-based lifecycle tests for various session and file scenarios, and a new in-kernel test module to validate the FLB logic. Introduce LUO, a mechanism intended to facilitate kernel updates while keeping designated devices operational across the transition (e.g., via kexec). The primary use case is updating hypervisors with minimal disruption to running virtual machines. For userspace side of hypervisor update we have copyless migration. LUO is for updating the kernel. This initial patch lays the groundwork for the LUO subsystem. Further functionality, including the implementation of state transition logic, integration with KHO, and hooks for subsystems and file descriptors, will be added in subsequent patches. Create a character device at /dev/liveupdate. A new uAPI header, , will define the necessary structures. The magic number for IOCTL is registered in Documentation/userspace-api/ioctl/ioctl-number.rst. Link: https://lkml.kernel.org/r/20251125165850.3389713-1-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20251125165850.3389713-2-pasha.tatashin@soleen.com Link: https://lore.kernel.org/all/20251018000713.677779-1-vipinsh@google.com/ [1] Link: https://lore.kernel.org/linux-iommu/20250928190624.3735830-1-skhawaja@google.com [2] Link: https://lore.kernel.org/linux-pci/20250916-luo-pci-v2-0-c494053c3c08@kernel.org [3] Link: https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v8 [4] Link: https://tinyurl.com/luoddesign [5] Link: https://lore.kernel.org/all/20251020100306.2709352-1-jasonmiu@google.com [6] Link: https://lore.kernel.org/all/20251115233409.768044-1-pasha.tatashin@soleen.com [7] Link: https://github.com/soleen/linux/blob/luo/v8b03/diff.v7.v8 [8] Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav Reviewed-by: Mike Rapoport (Microsoft) Tested-by: David Matlack Cc: Aleksander Lobakin Cc: Alexander Graf Cc: Alice Ryhl Cc: Andriy Shevchenko Cc: anish kumar Cc: Anna Schumaker Cc: Bartosz Golaszewski Cc: Bjorn Helgaas Cc: Borislav Betkov Cc: Chanwoo Choi Cc: Chen Ridong Cc: Chris Li Cc: Christian Brauner Cc: Daniel Wagner Cc: Danilo Krummrich Cc: Dan Williams Cc: David Hildenbrand Cc: David Jeffery Cc: David Rientjes Cc: Greg Kroah-Hartman Cc: Guixin Liu Cc: "H. Peter Anvin" Cc: Hugh Dickins Cc: Ilpo Järvinen Cc: Ingo Molnar Cc: Ira Weiny Cc: Jann Horn Cc: Jason Gunthorpe Cc: Jens Axboe Cc: Joanthan Cameron Cc: Joel Granados Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Lennart Poettering Cc: Leon Romanovsky Cc: Leon Romanovsky Cc: Lukas Wunner Cc: Marc Rutland Cc: Masahiro Yamada Cc: Matthew Maurer Cc: Miguel Ojeda Cc: Myugnjoo Ham Cc: Parav Pandit Cc: Randy Dunlap Cc: Roman Gushchin Cc: Saeed Mahameed Cc: Samiullah Khawaja Cc: Song Liu Cc: Steven Rostedt Cc: Stuart Hayes Cc: Tejun Heo Cc: Thomas Gleinxer Cc: Thomas Weißschuh Cc: Vincent Guittot Cc: William Tu Cc: Yoann Congal Cc: Zijun Hu Cc: Pratyush Yadav Cc: Zhu Yanjun Signed-off-by: Andrew Morton --- include/linux/liveupdate.h | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) create mode 100644 include/linux/liveupdate.h (limited to 'include/linux') diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h new file mode 100644 index 000000000000..c6a1d6bd90cb --- /dev/null +++ b/include/linux/liveupdate.h @@ -0,0 +1,35 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +/* + * Copyright (c) 2025, Google LLC. + * Pasha Tatashin + */ +#ifndef _LINUX_LIVEUPDATE_H +#define _LINUX_LIVEUPDATE_H + +#include +#include +#include + +#ifdef CONFIG_LIVEUPDATE + +/* Return true if live update orchestrator is enabled */ +bool liveupdate_enabled(void); + +/* Called during kexec to tell LUO that entered into reboot */ +int liveupdate_reboot(void); + +#else /* CONFIG_LIVEUPDATE */ + +static inline bool liveupdate_enabled(void) +{ + return false; +} + +static inline int liveupdate_reboot(void) +{ + return 0; +} + +#endif /* CONFIG_LIVEUPDATE */ +#endif /* _LINUX_LIVEUPDATE_H */ -- cgit From 1aece821004f67f46ef4db7199bbeca87cf22bdd Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Tue, 25 Nov 2025 11:58:32 -0500 Subject: liveupdate: luo_core: integrate with KHO MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Integrate the LUO with the KHO framework to enable passing LUO state across a kexec reboot. This patch implements the lifecycle integration with KHO: 1. Incoming State: During early boot (`early_initcall`), LUO checks if KHO is active. If so, it retrieves the "LUO" subtree, verifies the "luo-v1" compatibility string, and reads the `liveupdate-number` to track the update count. 2. Outgoing State: During late initialization (`late_initcall`), LUO allocates a new FDT for the next kernel, populates it with the basic header (compatible string and incremented update number), and registers it with KHO (`kho_add_subtree`). 3. Finalization: The `liveupdate_reboot()` notifier is updated to invoke `kho_finalize()`. This ensures that all memory segments marked for preservation are properly serialized before the kexec jump. Link: https://lkml.kernel.org/r/20251125165850.3389713-3-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav Tested-by: David Matlack Reviewed-by: Mike Rapoport (Microsoft) Cc: Aleksander Lobakin Cc: Alexander Graf Cc: Alice Ryhl Cc: Andriy Shevchenko Cc: anish kumar Cc: Anna Schumaker Cc: Bartosz Golaszewski Cc: Bjorn Helgaas Cc: Borislav Betkov Cc: Chanwoo Choi Cc: Chen Ridong Cc: Chris Li Cc: Christian Brauner Cc: Daniel Wagner Cc: Danilo Krummrich Cc: Dan Williams Cc: David Hildenbrand Cc: David Jeffery Cc: David Rientjes Cc: Greg Kroah-Hartman Cc: Guixin Liu Cc: "H. Peter Anvin" Cc: Hugh Dickins Cc: Ilpo Järvinen Cc: Ingo Molnar Cc: Ira Weiny Cc: Jann Horn Cc: Jason Gunthorpe Cc: Jens Axboe Cc: Joanthan Cameron Cc: Joel Granados Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Lennart Poettering Cc: Leon Romanovsky Cc: Leon Romanovsky Cc: Lukas Wunner Cc: Marc Rutland Cc: Masahiro Yamada Cc: Matthew Maurer Cc: Miguel Ojeda Cc: Myugnjoo Ham Cc: Parav Pandit Cc: Pratyush Yadav Cc: Randy Dunlap Cc: Roman Gushchin Cc: Saeed Mahameed Cc: Samiullah Khawaja Cc: Song Liu Cc: Steven Rostedt Cc: Stuart Hayes Cc: Tejun Heo Cc: Thomas Gleinxer Cc: Thomas Weißschuh Cc: Vincent Guittot Cc: William Tu Cc: Yoann Congal Cc: Zhu Yanjun Cc: Zijun Hu Signed-off-by: Andrew Morton --- include/linux/kho/abi/luo.h | 58 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 58 insertions(+) create mode 100644 include/linux/kho/abi/luo.h (limited to 'include/linux') diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h new file mode 100644 index 000000000000..2099b51929e5 --- /dev/null +++ b/include/linux/kho/abi/luo.h @@ -0,0 +1,58 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +/* + * Copyright (c) 2025, Google LLC. + * Pasha Tatashin + */ + +/** + * DOC: Live Update Orchestrator ABI + * + * This header defines the stable Application Binary Interface used by the + * Live Update Orchestrator to pass state from a pre-update kernel to a + * post-update kernel. The ABI is built upon the Kexec HandOver framework + * and uses a Flattened Device Tree to describe the preserved data. + * + * This interface is a contract. Any modification to the FDT structure, node + * properties, compatible strings, or the layout of the `__packed` serialization + * structures defined here constitutes a breaking change. Such changes require + * incrementing the version number in the relevant `_COMPATIBLE` string to + * prevent a new kernel from misinterpreting data from an old kernel. + * + * Changes are allowed provided the compatibility version is incremented; + * however, backward/forward compatibility is only guaranteed for kernels + * supporting the same ABI version. + * + * FDT Structure Overview: + * The entire LUO state is encapsulated within a single KHO entry named "LUO". + * This entry contains an FDT with the following layout: + * + * .. code-block:: none + * + * / { + * compatible = "luo-v1"; + * liveupdate-number = <...>; + * }; + * + * Main LUO Node (/): + * + * - compatible: "luo-v1" + * Identifies the overall LUO ABI version. + * - liveupdate-number: u64 + * A counter tracking the number of successful live updates performed. + */ + +#ifndef _LINUX_KHO_ABI_LUO_H +#define _LINUX_KHO_ABI_LUO_H + +/* + * The LUO FDT hooks all LUO state for sessions, fds, etc. + * In the root it also carries "liveupdate-number" 64-bit property that + * corresponds to the number of live-updates performed on this machine. + */ +#define LUO_FDT_SIZE PAGE_SIZE +#define LUO_FDT_KHO_ENTRY_NAME "LUO" +#define LUO_FDT_COMPATIBLE "luo-v1" +#define LUO_FDT_LIVEUPDATE_NUM "liveupdate-number" + +#endif /* _LINUX_KHO_ABI_LUO_H */ -- cgit From 0153094d03df5a2e834a19c59b255649a258ae46 Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Tue, 25 Nov 2025 11:58:34 -0500 Subject: liveupdate: luo_session: add sessions support MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Introduce concept of "Live Update Sessions" within the LUO framework. LUO sessions provide a mechanism to group and manage `struct file *` instances (representing file descriptors) that need to be preserved across a kexec-based live update. Each session is identified by a unique name and acts as a container for file objects whose state is critical to a userspace workload, such as a virtual machine or a high-performance database, aiming to maintain their functionality across a kernel transition. This groundwork establishes the framework for preserving file-backed state across kernel updates, with the actual file data preservation mechanisms to be implemented in subsequent patches. [dan.carpenter@linaro.org: fix use after free in luo_session_deserialize()] Link: https://lkml.kernel.org/r/c5dd637d7eed3a3be48c5e9fedb881596a3b1f5a.1764163896.git.dan.carpenter@linaro.org Link: https://lkml.kernel.org/r/20251125165850.3389713-5-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin Signed-off-by: Dan Carpenter Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav Tested-by: David Matlack Cc: Aleksander Lobakin Cc: Alexander Graf Cc: Alice Ryhl Cc: Andriy Shevchenko Cc: anish kumar Cc: Anna Schumaker Cc: Bartosz Golaszewski Cc: Bjorn Helgaas Cc: Borislav Betkov Cc: Chanwoo Choi Cc: Chen Ridong Cc: Chris Li Cc: Christian Brauner Cc: Daniel Wagner Cc: Danilo Krummrich Cc: Dan Williams Cc: David Hildenbrand Cc: David Jeffery Cc: David Rientjes Cc: Greg Kroah-Hartman Cc: Guixin Liu Cc: "H. Peter Anvin" Cc: Hugh Dickins Cc: Ilpo Järvinen Cc: Ingo Molnar Cc: Ira Weiny Cc: Jann Horn Cc: Jason Gunthorpe Cc: Jens Axboe Cc: Joanthan Cameron Cc: Joel Granados Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Lennart Poettering Cc: Leon Romanovsky Cc: Leon Romanovsky Cc: Lukas Wunner Cc: Marc Rutland Cc: Masahiro Yamada Cc: Matthew Maurer Cc: Miguel Ojeda Cc: Myugnjoo Ham Cc: Parav Pandit Cc: Pratyush Yadav Cc: Randy Dunlap Cc: Roman Gushchin Cc: Saeed Mahameed Cc: Samiullah Khawaja Cc: Song Liu Cc: Steven Rostedt Cc: Stuart Hayes Cc: Tejun Heo Cc: Thomas Gleinxer Cc: Thomas Weißschuh Cc: Vincent Guittot Cc: William Tu Cc: Yoann Congal Cc: Zhu Yanjun Cc: Zijun Hu Signed-off-by: Andrew Morton --- include/linux/kho/abi/luo.h | 71 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 71 insertions(+) (limited to 'include/linux') diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h index 2099b51929e5..bf1ab2910959 100644 --- a/include/linux/kho/abi/luo.h +++ b/include/linux/kho/abi/luo.h @@ -32,6 +32,11 @@ * / { * compatible = "luo-v1"; * liveupdate-number = <...>; + * + * luo-session { + * compatible = "luo-session-v1"; + * luo-session-header = ; + * }; * }; * * Main LUO Node (/): @@ -40,11 +45,37 @@ * Identifies the overall LUO ABI version. * - liveupdate-number: u64 * A counter tracking the number of successful live updates performed. + * + * Session Node (luo-session): + * This node describes all preserved user-space sessions. + * + * - compatible: "luo-session-v1" + * Identifies the session ABI version. + * - luo-session-header: u64 + * The physical address of a `struct luo_session_header_ser`. This structure + * is the header for a contiguous block of memory containing an array of + * `struct luo_session_ser`, one for each preserved session. + * + * Serialization Structures: + * The FDT properties point to memory regions containing arrays of simple, + * `__packed` structures. These structures contain the actual preserved state. + * + * - struct luo_session_header_ser: + * Header for the session array. Contains the total page count of the + * preserved memory block and the number of `struct luo_session_ser` + * entries that follow. + * + * - struct luo_session_ser: + * Metadata for a single session, including its name and a physical pointer + * to another preserved memory block containing an array of + * `struct luo_file_ser` for all files in that session. */ #ifndef _LINUX_KHO_ABI_LUO_H #define _LINUX_KHO_ABI_LUO_H +#include + /* * The LUO FDT hooks all LUO state for sessions, fds, etc. * In the root it also carries "liveupdate-number" 64-bit property that @@ -55,4 +86,44 @@ #define LUO_FDT_COMPATIBLE "luo-v1" #define LUO_FDT_LIVEUPDATE_NUM "liveupdate-number" +/* + * LUO FDT session node + * LUO_FDT_SESSION_HEADER: is a u64 physical address of struct + * luo_session_header_ser + */ +#define LUO_FDT_SESSION_NODE_NAME "luo-session" +#define LUO_FDT_SESSION_COMPATIBLE "luo-session-v1" +#define LUO_FDT_SESSION_HEADER "luo-session-header" + +/** + * struct luo_session_header_ser - Header for the serialized session data block. + * @count: The number of `struct luo_session_ser` entries that immediately + * follow this header in the memory block. + * + * This structure is located at the beginning of a contiguous block of + * physical memory preserved across the kexec. It provides the necessary + * metadata to interpret the array of session entries that follow. + * + * If this structure is modified, `LUO_FDT_SESSION_COMPATIBLE` must be updated. + */ +struct luo_session_header_ser { + u64 count; +} __packed; + +/** + * struct luo_session_ser - Represents the serialized metadata for a LUO session. + * @name: The unique name of the session, provided by the userspace at + * the time of session creation. + * + * This structure is used to package session-specific metadata for transfer + * between kernels via Kexec Handover. An array of these structures (one per + * session) is created and passed to the new kernel, allowing it to reconstruct + * the session context. + * + * If this structure is modified, `LUO_FDT_SESSION_COMPATIBLE` must be updated. + */ +struct luo_session_ser { + char name[LIVEUPDATE_SESSION_NAME_LENGTH]; +} __packed; + #endif /* _LINUX_KHO_ABI_LUO_H */ -- cgit From 7c722a7f44e0c1f9714084152226bc7bd644b7e3 Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Tue, 25 Nov 2025 11:58:36 -0500 Subject: liveupdate: luo_file: implement file systems callbacks MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This patch implements the core mechanism for managing preserved files throughout the live update lifecycle. It provides the logic to invoke the file handler callbacks (preserve, unpreserve, freeze, unfreeze, retrieve, and finish) at the appropriate stages. During the reboot phase, luo_file_freeze() serializes the final metadata for each file (handler compatible string, token, and data handle) into a memory region preserved by KHO. In the new kernel, luo_file_deserialize() reconstructs the in-memory file list from this data, preparing the session for retrieval. Link: https://lkml.kernel.org/r/20251125165850.3389713-7-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav Tested-by: David Matlack Cc: Aleksander Lobakin Cc: Alexander Graf Cc: Alice Ryhl Cc: Andriy Shevchenko Cc: anish kumar Cc: Anna Schumaker Cc: Bartosz Golaszewski Cc: Bjorn Helgaas Cc: Borislav Betkov Cc: Chanwoo Choi Cc: Chen Ridong Cc: Chris Li Cc: Christian Brauner Cc: Daniel Wagner Cc: Danilo Krummrich Cc: Dan Williams Cc: David Hildenbrand Cc: David Jeffery Cc: David Rientjes Cc: Greg Kroah-Hartman Cc: Guixin Liu Cc: "H. Peter Anvin" Cc: Hugh Dickins Cc: Ilpo Järvinen Cc: Ingo Molnar Cc: Ira Weiny Cc: Jann Horn Cc: Jason Gunthorpe Cc: Jens Axboe Cc: Joanthan Cameron Cc: Joel Granados Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Lennart Poettering Cc: Leon Romanovsky Cc: Leon Romanovsky Cc: Lukas Wunner Cc: Marc Rutland Cc: Masahiro Yamada Cc: Matthew Maurer Cc: Miguel Ojeda Cc: Myugnjoo Ham Cc: Parav Pandit Cc: Pratyush Yadav Cc: Randy Dunlap Cc: Roman Gushchin Cc: Saeed Mahameed Cc: Samiullah Khawaja Cc: Song Liu Cc: Steven Rostedt Cc: Stuart Hayes Cc: Tejun Heo Cc: Thomas Gleinxer Cc: Thomas Weißschuh Cc: Vincent Guittot Cc: William Tu Cc: Yoann Congal Cc: Zhu Yanjun Cc: Zijun Hu Signed-off-by: Andrew Morton --- include/linux/kho/abi/luo.h | 39 +++++++++++++++++- include/linux/liveupdate.h | 98 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 136 insertions(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h index bf1ab2910959..bb099c92e469 100644 --- a/include/linux/kho/abi/luo.h +++ b/include/linux/kho/abi/luo.h @@ -69,6 +69,11 @@ * Metadata for a single session, including its name and a physical pointer * to another preserved memory block containing an array of * `struct luo_file_ser` for all files in that session. + * + * - struct luo_file_ser: + * Metadata for a single preserved file. Contains the `compatible` string to + * find the correct handler in the new kernel, a user-provided `token` for + * identification, and an opaque `data` handle for the handler to use. */ #ifndef _LINUX_KHO_ABI_LUO_H @@ -86,13 +91,43 @@ #define LUO_FDT_COMPATIBLE "luo-v1" #define LUO_FDT_LIVEUPDATE_NUM "liveupdate-number" +#define LIVEUPDATE_HNDL_COMPAT_LENGTH 48 + +/** + * struct luo_file_ser - Represents the serialized preserves files. + * @compatible: File handler compatible string. + * @data: Private data + * @token: User provided token for this file + * + * If this structure is modified, LUO_SESSION_COMPATIBLE must be updated. + */ +struct luo_file_ser { + char compatible[LIVEUPDATE_HNDL_COMPAT_LENGTH]; + u64 data; + u64 token; +} __packed; + +/** + * struct luo_file_set_ser - Represents the serialized metadata for file set + * @files: The physical address of a contiguous memory block that holds + * the serialized state of files (array of luo_file_ser) in this file + * set. + * @count: The total number of files that were part of this session during + * serialization. Used for iteration and validation during + * restoration. + */ +struct luo_file_set_ser { + u64 files; + u64 count; +} __packed; + /* * LUO FDT session node * LUO_FDT_SESSION_HEADER: is a u64 physical address of struct * luo_session_header_ser */ #define LUO_FDT_SESSION_NODE_NAME "luo-session" -#define LUO_FDT_SESSION_COMPATIBLE "luo-session-v1" +#define LUO_FDT_SESSION_COMPATIBLE "luo-session-v2" #define LUO_FDT_SESSION_HEADER "luo-session-header" /** @@ -114,6 +149,7 @@ struct luo_session_header_ser { * struct luo_session_ser - Represents the serialized metadata for a LUO session. * @name: The unique name of the session, provided by the userspace at * the time of session creation. + * @file_set_ser: Serialized files belonging to this session, * * This structure is used to package session-specific metadata for transfer * between kernels via Kexec Handover. An array of these structures (one per @@ -124,6 +160,7 @@ struct luo_session_header_ser { */ struct luo_session_ser { char name[LIVEUPDATE_SESSION_NAME_LENGTH]; + struct luo_file_set_ser file_set_ser; } __packed; #endif /* _LINUX_KHO_ABI_LUO_H */ diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h index c6a1d6bd90cb..122ad8f16ff9 100644 --- a/include/linux/liveupdate.h +++ b/include/linux/liveupdate.h @@ -8,8 +8,93 @@ #define _LINUX_LIVEUPDATE_H #include +#include +#include #include #include +#include + +struct liveupdate_file_handler; +struct file; + +/** + * struct liveupdate_file_op_args - Arguments for file operation callbacks. + * @handler: The file handler being called. + * @retrieved: The retrieve status for the 'can_finish / finish' + * operation. + * @file: The file object. For retrieve: [OUT] The callback sets + * this to the new file. For other ops: [IN] The caller sets + * this to the file being operated on. + * @serialized_data: The opaque u64 handle, preserve/prepare/freeze may update + * this field. + * + * This structure bundles all parameters for the file operation callbacks. + * The 'data' and 'file' fields are used for both input and output. + */ +struct liveupdate_file_op_args { + struct liveupdate_file_handler *handler; + bool retrieved; + struct file *file; + u64 serialized_data; +}; + +/** + * struct liveupdate_file_ops - Callbacks for live-updatable files. + * @can_preserve: Required. Lightweight check to see if this handler is + * compatible with the given file. + * @preserve: Required. Performs state-saving for the file. + * @unpreserve: Required. Cleans up any resources allocated by @preserve. + * @freeze: Optional. Final actions just before kernel transition. + * @unfreeze: Optional. Undo freeze operations. + * @retrieve: Required. Restores the file in the new kernel. + * @can_finish: Optional. Check if this FD can finish, i.e. all restoration + * pre-requirements for this FD are satisfied. Called prior to + * finish, in order to do successful finish calls for all + * resources in the session. + * @finish: Required. Final cleanup in the new kernel. + * @owner: Module reference + * + * All operations (except can_preserve) receive a pointer to a + * 'struct liveupdate_file_op_args' containing the necessary context. + */ +struct liveupdate_file_ops { + bool (*can_preserve)(struct liveupdate_file_handler *handler, + struct file *file); + int (*preserve)(struct liveupdate_file_op_args *args); + void (*unpreserve)(struct liveupdate_file_op_args *args); + int (*freeze)(struct liveupdate_file_op_args *args); + void (*unfreeze)(struct liveupdate_file_op_args *args); + int (*retrieve)(struct liveupdate_file_op_args *args); + bool (*can_finish)(struct liveupdate_file_op_args *args); + void (*finish)(struct liveupdate_file_op_args *args); + struct module *owner; +}; + +/** + * struct liveupdate_file_handler - Represents a handler for a live-updatable file type. + * @ops: Callback functions + * @compatible: The compatibility string (e.g., "memfd-v1", "vfiofd-v1") + * that uniquely identifies the file type this handler + * supports. This is matched against the compatible string + * associated with individual &struct file instances. + * + * Modules that want to support live update for specific file types should + * register an instance of this structure. LUO uses this registration to + * determine if a given file can be preserved and to find the appropriate + * operations to manage its state across the update. + */ +struct liveupdate_file_handler { + const struct liveupdate_file_ops *ops; + const char compatible[LIVEUPDATE_HNDL_COMPAT_LENGTH]; + + /* private: */ + + /* + * Used for linking this handler instance into a global list of + * registered file handlers. + */ + struct list_head __private list; +}; #ifdef CONFIG_LIVEUPDATE @@ -19,6 +104,9 @@ bool liveupdate_enabled(void); /* Called during kexec to tell LUO that entered into reboot */ int liveupdate_reboot(void); +int liveupdate_register_file_handler(struct liveupdate_file_handler *fh); +int liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh); + #else /* CONFIG_LIVEUPDATE */ static inline bool liveupdate_enabled(void) @@ -31,5 +119,15 @@ static inline int liveupdate_reboot(void) return 0; } +static inline int liveupdate_register_file_handler(struct liveupdate_file_handler *fh) +{ + return -EOPNOTSUPP; +} + +static inline int liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh) +{ + return -EOPNOTSUPP; +} + #endif /* CONFIG_LIVEUPDATE */ #endif /* _LINUX_LIVEUPDATE_H */ -- cgit From 6ff1610ced5689c9af4c28a1798e04b74128a703 Mon Sep 17 00:00:00 2001 From: Pratyush Yadav Date: Tue, 25 Nov 2025 11:58:40 -0500 Subject: mm: shmem: use SHMEM_F_* flags instead of VM_* flags MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit shmem_inode_info::flags can have the VM flags VM_NORESERVE and VM_LOCKED. These are used to suppress pre-accounting or to lock the pages in the inode respectively. Using the VM flags directly makes it difficult to add shmem-specific flags that are unrelated to VM behavior since one would need to find a VM flag not used by shmem and re-purpose it. Introduce SHMEM_F_NORESERVE and SHMEM_F_LOCKED which represent the same information, but their bits are independent of the VM flags. Callers can still pass VM_NORESERVE to shmem_get_inode(), but it gets transformed to the shmem-specific flag internally. No functional changes intended. Link: https://lkml.kernel.org/r/20251125165850.3389713-11-pasha.tatashin@soleen.com Signed-off-by: Pratyush Yadav Signed-off-by: Pasha Tatashin Reviewed-by: Mike Rapoport (Microsoft) Tested-by: David Matlack Cc: Aleksander Lobakin Cc: Alexander Graf Cc: Alice Ryhl Cc: Andriy Shevchenko Cc: anish kumar Cc: Anna Schumaker Cc: Bartosz Golaszewski Cc: Bjorn Helgaas Cc: Borislav Betkov Cc: Chanwoo Choi Cc: Chen Ridong Cc: Chris Li Cc: Christian Brauner Cc: Daniel Wagner Cc: Danilo Krummrich Cc: Dan Williams Cc: David Hildenbrand Cc: David Jeffery Cc: David Rientjes Cc: Greg Kroah-Hartman Cc: Guixin Liu Cc: "H. Peter Anvin" Cc: Hugh Dickins Cc: Ilpo Järvinen Cc: Ingo Molnar Cc: Ira Weiny Cc: Jann Horn Cc: Jason Gunthorpe Cc: Jens Axboe Cc: Joanthan Cameron Cc: Joel Granados Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Lennart Poettering Cc: Leon Romanovsky Cc: Leon Romanovsky Cc: Lukas Wunner Cc: Marc Rutland Cc: Masahiro Yamada Cc: Matthew Maurer Cc: Miguel Ojeda Cc: Myugnjoo Ham Cc: Parav Pandit Cc: Pratyush Yadav Cc: Randy Dunlap Cc: Roman Gushchin Cc: Saeed Mahameed Cc: Samiullah Khawaja Cc: Song Liu Cc: Steven Rostedt Cc: Stuart Hayes Cc: Tejun Heo Cc: Thomas Gleinxer Cc: Thomas Weißschuh Cc: Vincent Guittot Cc: William Tu Cc: Yoann Congal Cc: Zhu Yanjun Cc: Zijun Hu Signed-off-by: Andrew Morton --- include/linux/shmem_fs.h | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'include/linux') diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index 0e47465ef0fd..650874b400b5 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -10,6 +10,7 @@ #include #include #include +#include struct swap_iocb; @@ -19,6 +20,11 @@ struct swap_iocb; #define SHMEM_MAXQUOTAS 2 #endif +/* Suppress pre-accounting of the entire object size. */ +#define SHMEM_F_NORESERVE BIT(0) +/* Disallow swapping. */ +#define SHMEM_F_LOCKED BIT(1) + struct shmem_inode_info { spinlock_t lock; unsigned int seals; /* shmem seals */ -- cgit From e165e2a2577b048664be09c074a10304290055f0 Mon Sep 17 00:00:00 2001 From: Pratyush Yadav Date: Tue, 25 Nov 2025 11:58:41 -0500 Subject: mm: shmem: allow freezing inode mapping MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit To prepare a shmem inode for live update, its index -> folio mappings must be serialized. Once the mappings are serialized, they cannot change since it would cause the serialized data to become inconsistent. This can be done by pinning the folios to avoid migration, and by making sure no folios can be added to or removed from the inode. While mechanisms to pin folios already exist, the only way to stop folios being added or removed are the grow and shrink file seals. But file seals come with their own semantics, one of which is that they can't be removed. This doesn't work with liveupdate since it can be cancelled or error out, which would need the seals to be removed and the file's normal functionality to be restored. Introduce SHMEM_F_MAPPING_FROZEN to indicate this instead. It is internal to shmem and is not directly exposed to userspace. It functions similar to F_SEAL_GROW | F_SEAL_SHRINK, but additionally disallows hole punching, and can be removed. Link: https://lkml.kernel.org/r/20251125165850.3389713-12-pasha.tatashin@soleen.com Signed-off-by: Pratyush Yadav Signed-off-by: Pasha Tatashin Reviewed-by: Mike Rapoport (Microsoft) Tested-by: David Matlack Cc: Aleksander Lobakin Cc: Alexander Graf Cc: Alice Ryhl Cc: Andriy Shevchenko Cc: anish kumar Cc: Anna Schumaker Cc: Bartosz Golaszewski Cc: Bjorn Helgaas Cc: Borislav Betkov Cc: Chanwoo Choi Cc: Chen Ridong Cc: Chris Li Cc: Christian Brauner Cc: Daniel Wagner Cc: Danilo Krummrich Cc: Dan Williams Cc: David Hildenbrand Cc: David Jeffery Cc: David Rientjes Cc: Greg Kroah-Hartman Cc: Guixin Liu Cc: "H. Peter Anvin" Cc: Hugh Dickins Cc: Ilpo Järvinen Cc: Ingo Molnar Cc: Ira Weiny Cc: Jann Horn Cc: Jason Gunthorpe Cc: Jens Axboe Cc: Joanthan Cameron Cc: Joel Granados Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Lennart Poettering Cc: Leon Romanovsky Cc: Leon Romanovsky Cc: Lukas Wunner Cc: Marc Rutland Cc: Masahiro Yamada Cc: Matthew Maurer Cc: Miguel Ojeda Cc: Myugnjoo Ham Cc: Parav Pandit Cc: Pratyush Yadav Cc: Randy Dunlap Cc: Roman Gushchin Cc: Saeed Mahameed Cc: Samiullah Khawaja Cc: Song Liu Cc: Steven Rostedt Cc: Stuart Hayes Cc: Tejun Heo Cc: Thomas Gleinxer Cc: Thomas Weißschuh Cc: Vincent Guittot Cc: William Tu Cc: Yoann Congal Cc: Zhu Yanjun Cc: Zijun Hu Signed-off-by: Andrew Morton --- include/linux/shmem_fs.h | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) (limited to 'include/linux') diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index 650874b400b5..d34a64eafe60 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -24,6 +24,14 @@ struct swap_iocb; #define SHMEM_F_NORESERVE BIT(0) /* Disallow swapping. */ #define SHMEM_F_LOCKED BIT(1) +/* + * Disallow growing, shrinking, or hole punching in the inode. Combined with + * folio pinning, makes sure the inode's mapping stays fixed. + * + * In some ways similar to F_SEAL_GROW | F_SEAL_SHRINK, but can be removed and + * isn't directly visible to userspace. + */ +#define SHMEM_F_MAPPING_FROZEN BIT(2) struct shmem_inode_info { spinlock_t lock; @@ -186,6 +194,15 @@ static inline bool shmem_file(struct file *file) return shmem_mapping(file->f_mapping); } +/* Must be called with inode lock taken exclusive. */ +static inline void shmem_freeze(struct inode *inode, bool freeze) +{ + if (freeze) + SHMEM_I(inode)->flags |= SHMEM_F_MAPPING_FROZEN; + else + SHMEM_I(inode)->flags &= ~SHMEM_F_MAPPING_FROZEN; +} + /* * If fallocate(FALLOC_FL_KEEP_SIZE) has been used, there may be pages * beyond i_size's notion of EOF, which fallocate has committed to reserving: -- cgit From 8def18633e8df54a05cf7d323d0df24c21b320d6 Mon Sep 17 00:00:00 2001 From: Pratyush Yadav Date: Tue, 25 Nov 2025 11:58:43 -0500 Subject: liveupdate: luo_file: add private argument to store runtime state MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Currently file handlers only get the serialized_data field to store their state. This field has a pointer to the serialized state of the file, and it becomes a part of LUO file's serialized state. File handlers can also need some runtime state to track information that shouldn't make it in the serialized data. One such example is a vmalloc pointer. While kho_preserve_vmalloc() preserves the memory backing a vmalloc allocation, it does not store the original vmap pointer, since that has no use being passed to the next kernel. The pointer is needed to free the memory in case the file is unpreserved. Provide a private field in struct luo_file and pass it to all the callbacks. The field's can be set by preserve, and must be freed by unpreserve. Link: https://lkml.kernel.org/r/20251125165850.3389713-14-pasha.tatashin@soleen.com Signed-off-by: Pratyush Yadav Co-developed-by: Pasha Tatashin Signed-off-by: Pasha Tatashin Reviewed-by: Mike Rapoport (Microsoft) Tested-by: David Matlack Cc: Aleksander Lobakin Cc: Alexander Graf Cc: Alice Ryhl Cc: Andriy Shevchenko Cc: anish kumar Cc: Anna Schumaker Cc: Bartosz Golaszewski Cc: Bjorn Helgaas Cc: Borislav Betkov Cc: Chanwoo Choi Cc: Chen Ridong Cc: Chris Li Cc: Christian Brauner Cc: Daniel Wagner Cc: Danilo Krummrich Cc: Dan Williams Cc: David Hildenbrand Cc: David Jeffery Cc: David Rientjes Cc: Greg Kroah-Hartman Cc: Guixin Liu Cc: "H. Peter Anvin" Cc: Hugh Dickins Cc: Ilpo Järvinen Cc: Ingo Molnar Cc: Ira Weiny Cc: Jann Horn Cc: Jason Gunthorpe Cc: Jens Axboe Cc: Joanthan Cameron Cc: Joel Granados Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Lennart Poettering Cc: Leon Romanovsky Cc: Leon Romanovsky Cc: Lukas Wunner Cc: Marc Rutland Cc: Masahiro Yamada Cc: Matthew Maurer Cc: Miguel Ojeda Cc: Myugnjoo Ham Cc: Parav Pandit Cc: Pratyush Yadav Cc: Randy Dunlap Cc: Roman Gushchin Cc: Saeed Mahameed Cc: Samiullah Khawaja Cc: Song Liu Cc: Steven Rostedt Cc: Stuart Hayes Cc: Tejun Heo Cc: Thomas Gleinxer Cc: Thomas Weißschuh Cc: Vincent Guittot Cc: William Tu Cc: Yoann Congal Cc: Zhu Yanjun Cc: Zijun Hu Signed-off-by: Andrew Morton --- include/linux/liveupdate.h | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'include/linux') diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h index 122ad8f16ff9..a7f6ee5b6771 100644 --- a/include/linux/liveupdate.h +++ b/include/linux/liveupdate.h @@ -27,6 +27,10 @@ struct file; * this to the file being operated on. * @serialized_data: The opaque u64 handle, preserve/prepare/freeze may update * this field. + * @private_data: Private data for the file used to hold runtime state that + * is not preserved. Set by the handler's .preserve() + * callback, and must be freed in the handler's + * .unpreserve() callback. * * This structure bundles all parameters for the file operation callbacks. * The 'data' and 'file' fields are used for both input and output. @@ -36,6 +40,7 @@ struct liveupdate_file_op_args { bool retrieved; struct file *file; u64 serialized_data; + void *private_data; }; /** -- cgit From b3749f174d686627f702234e64bad976dc432dbc Mon Sep 17 00:00:00 2001 From: Pratyush Yadav Date: Tue, 25 Nov 2025 11:58:44 -0500 Subject: mm: memfd_luo: allow preserving memfd MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The ability to preserve a memfd allows userspace to use KHO and LUO to transfer its memory contents to the next kernel. This is useful in many ways. For one, it can be used with IOMMUFD as the backing store for IOMMU page tables. Preserving IOMMUFD is essential for performing a hypervisor live update with passthrough devices. memfd support provides the first building block for making that possible. For another, applications with a large amount of memory that takes time to reconstruct, reboots to consume kernel upgrades can be very expensive. memfd with LUO gives those applications reboot-persistent memory that they can use to quickly save and reconstruct that state. While memfd is backed by either hugetlbfs or shmem, currently only support on shmem is added. To be more precise, support for anonymous shmem files is added. The handover to the next kernel is not transparent. All the properties of the file are not preserved; only its memory contents, position, and size. The recreated file gets the UID and GID of the task doing the restore, and the task's cgroup gets charged with the memory. Once preserved, the file cannot grow or shrink, and all its pages are pinned to avoid migrations and swapping. The file can still be read from or written to. Use vmalloc to get the buffer to hold the folios, and preserve it using kho_preserve_vmalloc(). This doesn't have the size limit. Link: https://lkml.kernel.org/r/20251125165850.3389713-15-pasha.tatashin@soleen.com Signed-off-by: Pratyush Yadav Co-developed-by: Pasha Tatashin Signed-off-by: Pasha Tatashin Reviewed-by: Mike Rapoport (Microsoft) Tested-by: David Matlack Cc: Aleksander Lobakin Cc: Alexander Graf Cc: Alice Ryhl Cc: Andriy Shevchenko Cc: anish kumar Cc: Anna Schumaker Cc: Bartosz Golaszewski Cc: Bjorn Helgaas Cc: Borislav Betkov Cc: Chanwoo Choi Cc: Chen Ridong Cc: Chris Li Cc: Christian Brauner Cc: Daniel Wagner Cc: Danilo Krummrich Cc: Dan Williams Cc: David Hildenbrand Cc: David Jeffery Cc: David Rientjes Cc: Greg Kroah-Hartman Cc: Guixin Liu Cc: "H. Peter Anvin" Cc: Hugh Dickins Cc: Ilpo Järvinen Cc: Ingo Molnar Cc: Ira Weiny Cc: Jann Horn Cc: Jason Gunthorpe Cc: Jens Axboe Cc: Joanthan Cameron Cc: Joel Granados Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Lennart Poettering Cc: Leon Romanovsky Cc: Leon Romanovsky Cc: Lukas Wunner Cc: Marc Rutland Cc: Masahiro Yamada Cc: Matthew Maurer Cc: Miguel Ojeda Cc: Myugnjoo Ham Cc: Parav Pandit Cc: Pratyush Yadav Cc: Randy Dunlap Cc: Roman Gushchin Cc: Saeed Mahameed Cc: Samiullah Khawaja Cc: Song Liu Cc: Steven Rostedt Cc: Stuart Hayes Cc: Tejun Heo Cc: Thomas Gleinxer Cc: Thomas Weißschuh Cc: Vincent Guittot Cc: William Tu Cc: Yoann Congal Cc: Zhu Yanjun Cc: Zijun Hu Signed-off-by: Andrew Morton --- include/linux/kho/abi/memfd.h | 77 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 77 insertions(+) create mode 100644 include/linux/kho/abi/memfd.h (limited to 'include/linux') diff --git a/include/linux/kho/abi/memfd.h b/include/linux/kho/abi/memfd.h new file mode 100644 index 000000000000..da7d063474a1 --- /dev/null +++ b/include/linux/kho/abi/memfd.h @@ -0,0 +1,77 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +/* + * Copyright (c) 2025, Google LLC. + * Pasha Tatashin + * + * Copyright (C) 2025 Amazon.com Inc. or its affiliates. + * Pratyush Yadav + */ + +#ifndef _LINUX_KHO_ABI_MEMFD_H +#define _LINUX_KHO_ABI_MEMFD_H + +#include +#include + +/** + * DOC: memfd Live Update ABI + * + * This header defines the ABI for preserving the state of a memfd across a + * kexec reboot using the LUO. + * + * The state is serialized into a packed structure `struct memfd_luo_ser` + * which is handed over to the next kernel via the KHO mechanism. + * + * This interface is a contract. Any modification to the structure layout + * constitutes a breaking change. Such changes require incrementing the + * version number in the MEMFD_LUO_FH_COMPATIBLE string. + */ + +/** + * MEMFD_LUO_FOLIO_DIRTY - The folio is dirty. + * + * This flag indicates the folio contains data from user. A non-dirty folio is + * one that was allocated (say using fallocate(2)) but not written to. + */ +#define MEMFD_LUO_FOLIO_DIRTY BIT(0) + +/** + * MEMFD_LUO_FOLIO_UPTODATE - The folio is up-to-date. + * + * An up-to-date folio has been zeroed out. shmem zeroes out folios on first + * use. This flag tracks which folios need zeroing. + */ +#define MEMFD_LUO_FOLIO_UPTODATE BIT(1) + +/** + * struct memfd_luo_folio_ser - Serialized state of a single folio. + * @pfn: The page frame number of the folio. + * @flags: Flags to describe the state of the folio. + * @index: The page offset (pgoff_t) of the folio within the original file. + */ +struct memfd_luo_folio_ser { + u64 pfn:52; + u64 flags:12; + u64 index; +} __packed; + +/** + * struct memfd_luo_ser - Main serialization structure for a memfd. + * @pos: The file's current position (f_pos). + * @size: The total size of the file in bytes (i_size). + * @nr_folios: Number of folios in the folios array. + * @folios: KHO vmalloc descriptor pointing to the array of + * struct memfd_luo_folio_ser. + */ +struct memfd_luo_ser { + u64 pos; + u64 size; + u64 nr_folios; + struct kho_vmalloc folios; +} __packed; + +/* The compatibility string for memfd file handler */ +#define MEMFD_LUO_FH_COMPATIBLE "memfd-v1" + +#endif /* _LINUX_KHO_ABI_MEMFD_H */ -- cgit From 3fa805c37dd4d3e72ae5c58800f3f46ab3ca1f70 Mon Sep 17 00:00:00 2001 From: Breno Leitao Date: Fri, 10 Oct 2025 03:36:50 -0700 Subject: vmcoreinfo: track and log recoverable hardware errors MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Introduce a generic infrastructure for tracking recoverable hardware errors (HW errors that are visible to the OS but does not cause a panic) and record them for vmcore consumption. This aids post-mortem crash analysis tools by preserving a count and timestamp for the last occurrence of such errors. On the other side, correctable errors, which the OS typically remains unaware of because the underlying hardware handles them transparently, are less relevant for crash dump and therefore are NOT tracked in this infrastructure. Add centralized logging for sources of recoverable hardware errors based on the subsystem it has been notified. hwerror_data is write-only at kernel runtime, and it is meant to be read from vmcore using tools like crash/drgn. For example, this is how it looks like when opening the crashdump from drgn. >>> prog['hwerror_data'] (struct hwerror_info[1]){ { .count = (int)844, .timestamp = (time64_t)1752852018, }, ... This helps fleet operators quickly triage whether a crash may be influenced by hardware recoverable errors (which executes a uncommon code path in the kernel), especially when recoverable errors occurred shortly before a panic, such as the bug fixed by commit ee62ce7a1d90 ("page_pool: Track DMA-mapped pages and unmap them when destroying the pool") This is not intended to replace full hardware diagnostics but provides a fast way to correlate hardware events with kernel panics quickly. Rare machine check exceptions—like those indicated by mce_flags.p5 or mce_flags.winchip—are not accounted for in this method, as they fall outside the intended usage scope for this feature's user base. [leitao@debian.org: add hw-recoverable-errors to toctree] Link: https://lkml.kernel.org/r/20251127-vmcoreinfo_fix-v1-1-26f5b1c43da9@debian.org Link: https://lkml.kernel.org/r/20251010-vmcore_hw_error-v5-1-636ede3efe44@debian.org Signed-off-by: Breno Leitao Suggested-by: Tony Luck Suggested-by: Shuai Xue Reviewed-by: Shuai Xue Reviewed-by: Hanjun Guo [APEI] Cc: Bjorn Helgaas Cc: Bob Moore Cc: Borislav Betkov Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: James Morse Cc: Konrad Rzessutek Wilk Cc: Len Brown Cc: Mahesh Salgaonkar Cc: Mauro Carvalho Chehab Cc: "Oliver O'Halloran" Cc: Omar Sandoval Cc: Thomas Gleinxer Signed-off-by: Andrew Morton --- include/linux/vmcore_info.h | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'include/linux') diff --git a/include/linux/vmcore_info.h b/include/linux/vmcore_info.h index 37e003ae5262..e71518caacdf 100644 --- a/include/linux/vmcore_info.h +++ b/include/linux/vmcore_info.h @@ -5,6 +5,7 @@ #include #include #include +#include #define CRASH_CORE_NOTE_HEAD_BYTES ALIGN(sizeof(struct elf_note), 4) #define CRASH_CORE_NOTE_NAME_BYTES ALIGN(sizeof(NN_PRSTATUS), 4) @@ -77,4 +78,11 @@ extern u32 *vmcoreinfo_note; Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type, void *data, size_t data_len); void final_note(Elf_Word *buf); + +#ifdef CONFIG_VMCORE_INFO +void hwerr_log_error_type(enum hwerr_error_type src); +#else +static inline void hwerr_log_error_type(enum hwerr_error_type src) {}; +#endif + #endif /* LINUX_VMCORE_INFO_H */ -- cgit