summaryrefslogtreecommitdiff
path: root/arch/x86/kernel/fpu
AgeCommit message (Collapse)Author
2025-05-26Merge tag 'x86-core-2025-05-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core x86 updates from Ingo Molnar: "Boot code changes: - A large series of changes to reorganize the x86 boot code into a better isolated and easier to maintain base of PIC early startup code in arch/x86/boot/startup/, by Ard Biesheuvel. Motivation & background: | Since commit | | c88d71508e36 ("x86/boot/64: Rewrite startup_64() in C") | | dated Jun 6 2017, we have been using C code on the boot path in a way | that is not supported by the toolchain, i.e., to execute non-PIC C | code from a mapping of memory that is different from the one provided | to the linker. It should have been obvious at the time that this was a | bad idea, given the need to sprinkle fixup_pointer() calls left and | right to manipulate global variables (including non-pointer variables) | without crashing. | | This C startup code has been expanding, and in particular, the SEV-SNP | startup code has been expanding over the past couple of years, and | grown many of these warts, where the C code needs to use special | annotations or helpers to access global objects. This tree includes the first phase of this work-in-progress x86 boot code reorganization. Scalability enhancements and micro-optimizations: - Improve code-patching scalability (Eric Dumazet) - Remove MFENCEs for X86_BUG_CLFLUSH_MONITOR (Andrew Cooper) CPU features enumeration updates: - Thorough reorganization and cleanup of CPUID parsing APIs (Ahmed S. Darwish) - Fix, refactor and clean up the cacheinfo code (Ahmed S. Darwish, Thomas Gleixner) - Update CPUID bitfields to x86-cpuid-db v2.3 (Ahmed S. Darwish) Memory management changes: - Allow temporary MMs when IRQs are on (Andy Lutomirski) - Opt-in to IRQs-off activate_mm() (Andy Lutomirski) - Simplify choose_new_asid() and generate better code (Borislav Petkov) - Simplify 32-bit PAE page table handling (Dave Hansen) - Always use dynamic memory layout (Kirill A. Shutemov) - Make SPARSEMEM_VMEMMAP the only memory model (Kirill A. Shutemov) - Make 5-level paging support unconditional (Kirill A. Shutemov) - Stop prefetching current->mm->mmap_lock on page faults (Mateusz Guzik) - Predict valid_user_address() returning true (Mateusz Guzik) - Consolidate initmem_init() (Mike Rapoport) FPU support and vector computing: - Enable Intel APX support (Chang S. Bae) - Reorgnize and clean up the xstate code (Chang S. Bae) - Make task_struct::thread constant size (Ingo Molnar) - Restore fpu_thread_struct_whitelist() to fix CONFIG_HARDENED_USERCOPY=y (Kees Cook) - Simplify the switch_fpu_prepare() + switch_fpu_finish() logic (Oleg Nesterov) - Always preserve non-user xfeatures/flags in __state_perm (Sean Christopherson) Microcode loader changes: - Help users notice when running old Intel microcode (Dave Hansen) - AMD: Do not return error when microcode update is not necessary (Annie Li) - AMD: Clean the cache if update did not load microcode (Boris Ostrovsky) Code patching (alternatives) changes: - Simplify, reorganize and clean up the x86 text-patching code (Ingo Molnar) - Make smp_text_poke_batch_process() subsume smp_text_poke_batch_finish() (Nikolay Borisov) - Refactor the {,un}use_temporary_mm() code (Peter Zijlstra) Debugging support: - Add early IDT and GDT loading to debug relocate_kernel() bugs (David Woodhouse) - Print the reason for the last reset on modern AMD CPUs (Yazen Ghannam) - Add AMD Zen debugging document (Mario Limonciello) - Fix opcode map (!REX2) superscript tags (Masami Hiramatsu) - Stop decoding i64 instructions in x86-64 mode at opcode (Masami Hiramatsu) CPU bugs and bug mitigations: - Remove X86_BUG_MMIO_UNKNOWN (Borislav Petkov) - Fix SRSO reporting on Zen1/2 with SMT disabled (Borislav Petkov) - Restructure and harmonize the various CPU bug mitigation methods (David Kaplan) - Fix spectre_v2 mitigation default on Intel (Pawan Gupta) MSR API: - Large MSR code and API cleanup (Xin Li) - In-kernel MSR API type cleanups and renames (Ingo Molnar) PKEYS: - Simplify PKRU update in signal frame (Chang S. Bae) NMI handling code: - Clean up, refactor and simplify the NMI handling code (Sohil Mehta) - Improve NMI duration console printouts (Sohil Mehta) Paravirt guests interface: - Restrict PARAVIRT_XXL to 64-bit only (Kirill A. Shutemov) SEV support: - Share the sev_secrets_pa value again (Tom Lendacky) x86 platform changes: - Introduce the <asm/amd/> header namespace (Ingo Molnar) - i2c: piix4, x86/platform: Move the SB800 PIIX4 FCH definitions to <asm/amd/fch.h> (Mario Limonciello) Fixes and cleanups: - x86 assembly code cleanups and fixes (Uros Bizjak) - Misc fixes and cleanups (Andi Kleen, Andy Lutomirski, Andy Shevchenko, Ard Biesheuvel, Bagas Sanjaya, Baoquan He, Borislav Petkov, Chang S. Bae, Chao Gao, Dan Williams, Dave Hansen, David Kaplan, David Woodhouse, Eric Biggers, Ingo Molnar, Josh Poimboeuf, Juergen Gross, Malaya Kumar Rout, Mario Limonciello, Nathan Chancellor, Oleg Nesterov, Pawan Gupta, Peter Zijlstra, Shivank Garg, Sohil Mehta, Thomas Gleixner, Uros Bizjak, Xin Li)" * tag 'x86-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (331 commits) x86/bugs: Fix spectre_v2 mitigation default on Intel x86/bugs: Restructure ITS mitigation x86/xen/msr: Fix uninitialized variable 'err' x86/msr: Remove a superfluous inclusion of <asm/asm.h> x86/paravirt: Restrict PARAVIRT_XXL to 64-bit only x86/mm/64: Make 5-level paging support unconditional x86/mm/64: Make SPARSEMEM_VMEMMAP the only memory model x86/mm/64: Always use dynamic memory layout x86/bugs: Fix indentation due to ITS merge x86/cpuid: Rename hypervisor_cpuid_base()/for_each_possible_hypervisor_cpuid_base() to cpuid_base_hypervisor()/for_each_possible_cpuid_base_hypervisor() x86/cpu/intel: Rename CPUID(0x2) descriptors iterator parameter x86/cacheinfo: Rename CPUID(0x2) descriptors iterator parameter x86/cpuid: Rename cpuid_get_leaf_0x2_regs() to cpuid_leaf_0x2() x86/cpuid: Rename have_cpuid_p() to cpuid_feature() x86/cpuid: Set <asm/cpuid/api.h> as the main CPUID header x86/cpuid: Move CPUID(0x2) APIs into <cpuid/api.h> x86/msr: Add rdmsrl_on_cpu() compatibility wrapper x86/mm: Fix kernel-doc descriptions of various pgtable methods x86/asm-offsets: Export certain 'struct cpuinfo_x86' fields for 64-bit asm use too x86/boot: Defer initialization of VM space related global variables ...
2025-05-26x86/fpu: Fix irq_fpu_usable() to return false during CPU onliningEric Biggers
irq_fpu_usable() incorrectly returned true before the FPU is initialized. The x86 CPU onlining code can call sha256() to checksum AMD microcode images, before the FPU is initialized. Since sha256() recently gained a kernel-mode FPU optimized code path, a crash occurred in kernel_fpu_begin_mask() during hotplug CPU onlining. (The crash did not occur during boot-time CPU onlining, since the optimized sha256() code is not enabled until subsys_initcalls run.) Fix this by making irq_fpu_usable() return false before fpu__init_cpu() has run. To do this without adding any additional overhead to irq_fpu_usable(), replace the existing per-CPU bool in_kernel_fpu with kernel_fpu_allowed which tracks both initialization and usage rather than just usage. The initial state is false; FPU initialization sets it to true; kernel-mode FPU sections toggle it to false and then back to true; and CPU offlining restores it to the initial state of false. Fixes: 11d7956d526f ("crypto: x86/sha256 - implement library instead of shash") Reported-by: Ayush Jain <Ayush.Jain3@amd.com> Closes: https://lore.kernel.org/r/20250516112217.GBaCcf6Yoc6LkIIryP@fat_crate.local Signed-off-by: Eric Biggers <ebiggers@google.com> Tested-by: Ayush Jain <Ayush.Jain3@amd.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-05-15x86/cpuid: Set <asm/cpuid/api.h> as the main CPUID headerAhmed S. Darwish
The main CPUID header <asm/cpuid.h> was originally a storefront for the headers: <asm/cpuid/api.h> <asm/cpuid/leaf_0x2_api.h> Now that the latter CPUID(0x2) header has been merged into the former, there is no practical difference between <asm/cpuid.h> and <asm/cpuid/api.h>. Migrate all users to the <asm/cpuid/api.h> header, in preparation of the removal of <asm/cpuid.h>. Don't remove <asm/cpuid.h> just yet, in case some new code in -next started using it. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: x86-cpuid@lists.linux.dev Link: https://lore.kernel.org/r/20250508150240.172915-3-darwi@linutronix.de
2025-05-13Merge branch 'x86/msr' into x86/core, to resolve conflictsIngo Molnar
Conflicts: arch/x86/boot/startup/sme.c arch/x86/coco/sev/core.c arch/x86/kernel/fpu/core.c arch/x86/kernel/fpu/xstate.c Semantic conflict: arch/x86/include/asm/sev-internal.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-06x86/fpu: Drop @perm from guest pseudo FPU containerChao Gao
Remove @perm from the guest pseudo FPU container. The field is initialized during allocation and never used later. Rename fpu_init_guest_permissions() to show that its sole purpose is to lock down guest permissions. Suggested-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biggers <ebiggers@google.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <kees@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mitchell Levy <levymitchell0@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Samuel Holland <samuel.holland@sifive.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/kvm/af972fe5981b9e7101b64de43c7be0a8cc165323.camel@redhat.com/ Link: https://lore.kernel.org/r/20250506093740.2864458-3-chao.gao@intel.com
2025-05-06x86/fpu/xstate: Always preserve non-user xfeatures/flags in __state_permSean Christopherson
When granting userspace or a KVM guest access to an xfeature, preserve the entity's existing supervisor and software-defined permissions as tracked by __state_perm, i.e. use __state_perm to track *all* permissions even though all supported supervisor xfeatures are granted to all FPUs and FPU_GUEST_PERM_LOCKED disallows changing permissions. Effectively clobbering supervisor permissions results in inconsistent behavior, as xstate_get_group_perm() will report supervisor features for process that do NOT request access to dynamic user xfeatures, whereas any and all supervisor features will be absent from the set of permissions for any process that is granted access to one or more dynamic xfeatures (which right now means AMX). The inconsistency isn't problematic because fpu_xstate_prctl() already strips out everything except user xfeatures: case ARCH_GET_XCOMP_PERM: /* * Lockless snapshot as it can also change right after the * dropping the lock. */ permitted = xstate_get_host_group_perm(); permitted &= XFEATURE_MASK_USER_SUPPORTED; return put_user(permitted, uptr); case ARCH_GET_XCOMP_GUEST_PERM: permitted = xstate_get_guest_group_perm(); permitted &= XFEATURE_MASK_USER_SUPPORTED; return put_user(permitted, uptr); and similarly KVM doesn't apply the __state_perm to supervisor states (kvm_get_filtered_xcr0() incorporates xstate_get_guest_group_perm()): case 0xd: { u64 permitted_xcr0 = kvm_get_filtered_xcr0(); u64 permitted_xss = kvm_caps.supported_xss; But if KVM in particular were to ever change, dropping supervisor permissions would result in subtle bugs in KVM's reporting of supported CPUID settings. And the above behavior also means that having supervisor xfeatures in __state_perm is correctly handled by all users. Dropping supervisor permissions also creates another landmine for KVM. If more dynamic user xfeatures are ever added, requesting access to multiple xfeatures in separate ARCH_REQ_XCOMP_GUEST_PERM calls will result in the second invocation of __xstate_request_perm() computing the wrong ksize, as as the mask passed to xstate_calculate_size() would not contain *any* supervisor features. Commit 781c64bfcb73 ("x86/fpu/xstate: Handle supervisor states in XSTATE permissions") fudged around the size issue for userspace FPUs, but for reasons unknown skipped guest FPUs. Lack of a fix for KVM "works" only because KVM doesn't yet support virtualizing features that have supervisor xfeatures, i.e. as of today, KVM guest FPUs will never need the relevant xfeatures. Simply extending the hack-a-fix for guests would temporarily solve the ksize issue, but wouldn't address the inconsistency issue and would leave another lurking pitfall for KVM. KVM support for virtualizing CET will likely add CET_KERNEL as a guest-only xfeature, i.e. CET_KERNEL will not be set in xfeatures_mask_supervisor() and would again be dropped when granting access to dynamic xfeatures. Note, the existing clobbering behavior is rather subtle. The @permitted parameter to __xstate_request_perm() comes from: permitted = xstate_get_group_perm(guest); which is either fpu->guest_perm.__state_perm or fpu->perm.__state_perm, where __state_perm is initialized to: fpu->perm.__state_perm = fpu_kernel_cfg.default_features; and copied to the guest side of things: /* Same defaults for guests */ fpu->guest_perm = fpu->perm; fpu_kernel_cfg.default_features contains everything except the dynamic xfeatures, i.e. everything except XFEATURE_MASK_XTILE_DATA: fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features; fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC; When __xstate_request_perm() restricts the local "mask" variable to compute the user state size: mask &= XFEATURE_MASK_USER_SUPPORTED; usize = xstate_calculate_size(mask, false); it subtly overwrites the target __state_perm with "mask" containing only user xfeatures: perm = guest ? &fpu->guest_perm : &fpu->perm; /* Pairs with the READ_ONCE() in xstate_get_group_perm() */ WRITE_ONCE(perm->__state_perm, mask); Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: John Allen <john.allen@amd.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mitchell Levy <levymitchell0@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Samuel Holland <samuel.holland@sifive.com> Cc: Sohil Mehta <sohil.mehta@intel.com> Cc: Vignesh Balasubramanian <vigbalas@amd.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Xin Li <xin3.li@intel.com> Cc: kvm@vger.kernel.org Link: https://lore.kernel.org/all/ZTqgzZl-reO1m01I@google.com Link: https://lore.kernel.org/r/20250506093740.2864458-2-chao.gao@intel.com
2025-05-05x86/fpu: Restore fpu_thread_struct_whitelist() to fix ↵Kees Cook
CONFIG_HARDENED_USERCOPY=y crash Borislav Petkov reported the following boot crash on x86-32, with CONFIG_HARDENED_USERCOPY=y: | usercopy: Kernel memory overwrite attempt detected to SLUB object 'task_struct' (offset 2112, size 160)! | ... | kernel BUG at mm/usercopy.c:102! So the useroffset and usersize arguments are what control the allowed window of copying in/out of the "task_struct" kmem cache: /* create a slab on which task_structs can be allocated */ task_struct_whitelist(&useroffset, &usersize); task_struct_cachep = kmem_cache_create_usercopy("task_struct", arch_task_struct_size, align, SLAB_PANIC|SLAB_ACCOUNT, useroffset, usersize, NULL); task_struct_whitelist() positions this window based on the location of the thread_struct within task_struct, and gets the arch-specific details via arch_thread_struct_whitelist(offset, size): static void __init task_struct_whitelist(unsigned long *offset, unsigned long *size) { /* Fetch thread_struct whitelist for the architecture. */ arch_thread_struct_whitelist(offset, size); /* * Handle zero-sized whitelist or empty thread_struct, otherwise * adjust offset to position of thread_struct in task_struct. */ if (unlikely(*size == 0)) *offset = 0; else *offset += offsetof(struct task_struct, thread); } Commit cb7ca40a3882 ("x86/fpu: Make task_struct::thread constant size") removed the logic for the window, leaving: static inline void arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size) { *offset = 0; *size = 0; } So now there is no window that usercopy hardening will allow to be copied in/out of task_struct. But as reported above, there *is* a copy in copy_uabi_to_xstate(). (It seems there are several, actually.) int copy_sigframe_from_user_to_xstate(struct task_struct *tsk, const void __user *ubuf) { return copy_uabi_to_xstate(x86_task_fpu(tsk)->fpstate, NULL, ubuf, &tsk->thread.pkru); } This appears to be writing into x86_task_fpu(tsk)->fpstate. With or without CONFIG_X86_DEBUG_FPU, this resolves to: ((struct fpu *)((void *)(task) + sizeof(*(task)))) i.e. the memory "after task_struct" is cast to "struct fpu", and the uses the "fpstate" pointer. How that pointer gets set looks to be variable, but I think the one we care about here is: fpu->fpstate = &fpu->__fpstate; And struct fpu::__fpstate says: struct fpstate __fpstate; /* * WARNING: '__fpstate' is dynamically-sized. Do not put * anything after it here. */ So we're still dealing with a dynamically sized thing, even if it's not within the literal struct task_struct -- it's still in the kmem cache, though. Looking at the kmem cache size, it has allocated "arch_task_struct_size" bytes, which is calculated in fpu__init_task_struct_size(): int task_size = sizeof(struct task_struct); task_size += sizeof(struct fpu); /* * Subtract off the static size of the register state. * It potentially has a bunch of padding. */ task_size -= sizeof(union fpregs_state); /* * Add back the dynamically-calculated register state * size. */ task_size += fpu_kernel_cfg.default_size; /* * We dynamically size 'struct fpu', so we require that * 'state' be at the end of 'it: */ CHECK_MEMBER_AT_END_OF(struct fpu, __fpstate); arch_task_struct_size = task_size; So, this is still copying out of the kmem cache for task_struct, and the window seems unchanged (still fpu regs). This is what the window was before: void fpu_thread_struct_whitelist(unsigned long *offset, unsigned long *size) { *offset = offsetof(struct thread_struct, fpu.__fpstate.regs); *size = fpu_kernel_cfg.default_size; } And the same commit I mentioned above removed it. I think the misunderstanding is here: | The fpu_thread_struct_whitelist() quirk to hardened usercopy can be removed, | now that the FPU structure is not embedded in the task struct anymore, which | reduces text footprint a bit. Yes, FPU is no longer in task_struct, but it IS in the kmem cache named "task_struct", since the fpstate is still being allocated there. Partially revert the earlier mentioned commit, along with a recalculation of the fpstate regs location. Fixes: cb7ca40a3882 ("x86/fpu: Make task_struct::thread constant size") Reported-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Chang S. Bae <chang.seok.bae@intel.com> Cc: Gustavo A. R. Silva <gustavoars@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-hardening@vger.kernel.org Link: https://lore.kernel.org/all/20250409211127.3544993-1-mingo@kernel.org/ # Discussion #1 Link: https://lore.kernel.org/r/202505041418.F47130C4C8@keescook # Discussion #2
2025-05-04x86/fpu: Check TIF_NEED_FPU_LOAD instead of PF_KTHREAD|PF_USER_WORKER in ↵Oleg Nesterov
fpu__drop() PF_KTHREAD|PF_USER_WORKER tasks should never clear TIF_NEED_FPU_LOAD, so the TIF_NEED_FPU_LOAD check should equally filter them out. And this way an exiting userspace task can avoid the unnecessary "fwait" if it does context_switch() at least once on its way to exit_thread(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Chang S . Bae <chang.seok.bae@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Brian Gerst <brgerst@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250503143856.GA9009@redhat.com
2025-05-04x86/fpu: Remove x86_init_fpuOleg Nesterov
It is not actually used after: 55bc30f2e34d ("x86/fpu: Remove the thread::fpu pointer") Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Chang S . Bae <chang.seok.bae@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Brian Gerst <brgerst@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250503143837.GA8985@redhat.com
2025-05-02x86/msr: Add explicit includes of <asm/msr.h>Xin Li (Intel)
For historic reasons there are some TSC-related functions in the <asm/msr.h> header, even though there's an <asm/tsc.h> header. To facilitate the relocation of rdtsc{,_ordered}() from <asm/msr.h> to <asm/tsc.h> and to eventually eliminate the inclusion of <asm/msr.h> in <asm/tsc.h>, add an explicit <asm/msr.h> dependency to the source files that reference definitions from <asm/msr.h>. [ mingo: Clarified the changelog. ] Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Kees Cook <keescook@chromium.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/r/20250501054241.1245648-1-xin@zytor.com
2025-04-16x86/fpu: Rename fpu_reset_fpregs() to fpu_reset_fpstate_regs()Chang S. Bae
The original function name came from an overly compressed form of 'fpstate_regs' by commit: e61d6310a0f8 ("x86/fpu: Reset permission and fpstate on exec()") However, the term 'fpregs' typically refers to physical FPU registers. In contrast, this function copies the init values to fpu->fpstate->regs, not hardware registers. Rename the function to better reflect what it actually does. No functional change. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250416021720.12305-11-chang.seok.bae@intel.com
2025-04-16x86/fpu: Remove export of mxcsr_feature_maskChang S. Bae
The variable was previously referenced in KVM code but the last usage was removed by: ea4d6938d4c0 ("x86/fpu: Replace KVMs home brewed FPU copy from user") Remove its export symbol. Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250416021720.12305-10-chang.seok.bae@intel.com
2025-04-16x86/pkeys: Simplify PKRU update in signal frameChang S. Bae
The signal delivery logic was modified to always set the PKRU bit in xregs_state->header->xfeatures by this commit: ae6012d72fa6 ("x86/pkeys: Ensure updated PKRU value is XRSTOR'd") However, the change derives the bitmask value using XGETBV(1), rather than simply updating the buffer that already holds the value. Thus, this approach induces an unnecessary dependency on XGETBV1 for PKRU handling. Eliminate the dependency by using the established helper function. Subsequently, remove the now-unused 'mask' argument. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tony W Wang-oc <TonyWWang-oc@zhaoxin.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20250416021720.12305-9-chang.seok.bae@intel.com
2025-04-16x86/fpu: Refactor xfeature bitmask update code for sigframe XSAVEChang S. Bae
Currently, saving register states in the signal frame, the legacy feature bits are always set in xregs_state->header->xfeatures. This code sequence can be generalized for reuse in similar cases. Refactor the logic to ensure a consistent approach across similar usages. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250416021720.12305-8-chang.seok.bae@intel.com
2025-04-16x86/fpu: Log XSAVE disablement consistentlyChang S. Bae
Not all paths that lead to fpu__init_disable_system_xstate() currently emit a message indicating that XSAVE has been disabled. Move the print statement into the function to ensure the message in all cases. Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250416021720.12305-7-chang.seok.bae@intel.com
2025-04-16x86/fpu/apx: Enable APX state supportChang S. Bae
With securing APX against conflicting MPX, it is now ready to be enabled. Include APX in the enabled xfeature set. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250416021720.12305-5-chang.seok.bae@intel.com
2025-04-16x86/fpu/apx: Disallow conflicting MPX presenceChang S. Bae
XSTATE components are architecturally independent. There is no rule requiring their offsets in the non-compacted format to be strictly ascending or mutually non-overlapping. However, in practice, such overlaps have not occurred -- until now. APX is introduced as xstate component 19, following AMX. In the non-compacted XSAVE format, its offset overlaps with the space previously occupied by the now-deprecated MPX feature: 45fc24e89b7c ("x86/mpx: remove MPX from arch/x86") To prevent conflicts, the kernel must ensure the CPU never expose both features at the same time. If so, it indicates unreliable hardware. In such cases, XSAVE should be disabled entirely as a precautionary measure. Add a sanity check to detect this condition and disable XSAVE if an invalid hardware configuration is identified. Note: MPX state components remain enabled on legacy systems solely for KVM guest support. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250416021720.12305-4-chang.seok.bae@intel.com
2025-04-16x86/fpu/apx: Define APX state componentChang S. Bae
Advanced Performance Extensions (APX) is associated with a new state component number 19. To support saving and restoring of the corresponding registers via the XSAVE mechanism, introduce the component definition along with the necessary sanity checks. Define the new component number, state name, and those register data type. Then, extend the size checker to validate the register data type and explicitly list the APX feature flag as a dependency for the new component in xsave_cpuid_features[]. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250416021720.12305-3-chang.seok.bae@intel.com
2025-04-14x86/fpu: Clarify FPU context cacheline alignmentIngo Molnar
Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Chang S. Bae <chang.seok.bae@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/Z_ejggklB5-IWB5W@gmail.com
2025-04-14x86/fpu: Use 'fpstate' variable names consistentlyIngo Molnar
A few uses of 'fps' snuck in, which is rather confusing (to me) as it suggests frames-per-second. ;-) Rename them to the canonical 'fpstate' name. No change in functionality. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Chang S. Bae <chang.seok.bae@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250409211127.3544993-9-mingo@kernel.org
2025-04-14x86/fpu: Remove init_task FPU state dependencies, add debugging warning for ↵Ingo Molnar
PF_KTHREAD tasks init_task's FPU state initialization was a bit of a hack: __x86_init_fpu_begin = .; . = __x86_init_fpu_begin + 128*PAGE_SIZE; __x86_init_fpu_end = .; But the init task isn't supposed to be using the FPU context in any case, so remove the hack and add in some debug warnings. As Linus noted in the discussion, the init task (and other PF_KTHREAD tasks) *can* use the FPU via kernel_fpu_begin()/_end(), but they don't need the context area because their FPU use is not preemptible or reentrant, and they don't return to user-space. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Chang S. Bae <chang.seok.bae@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/r/20250409211127.3544993-8-mingo@kernel.org
2025-04-14x86/fpu: Make sure x86_task_fpu() doesn't get called for ↵Ingo Molnar
PF_KTHREAD|PF_USER_WORKER tasks during exit fpu__drop() and arch_release_task_struct() calls x86_task_fpu() unconditionally, while the FPU context area will not be present if it's the init task, and should not be in use when it's some other type of kthread. Return early for PF_KTHREAD or PF_USER_WORKER tasks. The debug warning in x86_task_fpu() will catch any kthreads attempting to use the FPU save area. Fixed-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250409211127.3544993-7-mingo@kernel.org
2025-04-14x86/fpu: Push 'fpu' pointer calculation into the fpu__drop() callIngo Molnar
This encapsulates the fpu__drop() functionality better, and it will also enable other changes that want to check a task for PF_KTHREAD before calling x86_task_fpu(). Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Chang S. Bae <chang.seok.bae@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250409211127.3544993-6-mingo@kernel.org
2025-04-14x86/fpu: Remove the thread::fpu pointerIngo Molnar
As suggested by Oleg, remove the thread::fpu pointer, as we can calculate it via x86_task_fpu() at compile-time. This improves code generation a bit: kepler:~/tip> size vmlinux.before vmlinux.after text data bss dec hex filename 26475405 10435342 1740804 38651551 24dc69f vmlinux.before 26475339 10959630 1216516 38651485 24dc65d vmlinux.after Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Chang S. Bae <chang.seok.bae@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/r/20250409211127.3544993-5-mingo@kernel.org
2025-04-14x86/fpu: Make task_struct::thread constant sizeIngo Molnar
Turn thread.fpu into a pointer. Since most FPU code internals work by passing around the FPU pointer already, the code generation impact is small. This allows us to remove the old kludge of task_struct being variable size: struct task_struct { ... /* * New fields for task_struct should be added above here, so that * they are included in the randomized portion of task_struct. */ randomized_struct_fields_end /* CPU-specific state of this task: */ struct thread_struct thread; /* * WARNING: on x86, 'thread_struct' contains a variable-sized * structure. It *MUST* be at the end of 'task_struct'. * * Do not put anything below here! */ }; ... which creates a number of problems, such as requiring thread_struct to be the last member of the struct - not allowing it to be struct-randomized, etc. But the primary motivation is to allow the decoupling of task_struct from hardware details (<asm/processor.h> in particular), and to eventually allow the per-task infrastructure: DECLARE_PER_TASK(type, name); ... per_task(current, name) = val; ... which requires task_struct to be a constant size struct. The fpu_thread_struct_whitelist() quirk to hardened usercopy can be removed, now that the FPU structure is not embedded in the task struct anymore, which reduces text footprint a bit. Fixed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Chang S. Bae <chang.seok.bae@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250409211127.3544993-4-mingo@kernel.org
2025-04-14x86/fpu: Convert task_struct::thread.fpu accesses to use x86_task_fpu()Ingo Molnar
This will make the removal of the task_struct::thread.fpu array easier. No change in functionality - code generated before and after this commit is identical on x86-defconfig: kepler:~/tip> diff -up vmlinux.before.asm vmlinux.after.asm kepler:~/tip> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Chang S. Bae <chang.seok.bae@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250409211127.3544993-3-mingo@kernel.org
2025-04-14x86/fpu/xstate: Adjust xstate copying logic for user ABIChang S. Bae
== Background == As feature positions in the userspace XSAVE buffer do not always align with their feature numbers, the XSAVE format conversion needs to be reconsidered to align with the revised xstate size calculation logic. * For signal handling, XSAVE and XRSTOR are used directly to save and restore extended registers. * For ptrace, KVM, and signal returns (for 32-bit frame), the kernel copies data between its internal buffer and the userspace XSAVE buffer. If memcpy() were used for these cases, existing offset helpers — such as __raw_xsave_addr() or xstate_offsets[] — would be sufficient to handle the format conversion. == Problem == When copying data from the compacted in-kernel buffer to the non-compacted userspace buffer, the function follows the user_regset_get2_fn() prototype. This means it utilizes struct membuf helpers for the destination buffer. As defined in regset.h, these helpers update the memory pointer during the copy process, enforcing sequential writes within the loop. Since xstate components are processed sequentially, any component whose buffer position does not align with its feature number has an issue. == Solution == Replace for_each_extended_xfeature() with the newly introduced for_each_extended_xfeature_in_order(). This macro ensures xstate components are handled in the correct order based on their actual positions in the destination buffer, rather than their feature numbers. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250320234301.8342-5-chang.seok.bae@intel.com
2025-04-14x86/fpu/xstate: Adjust XSAVE buffer size calculationChang S. Bae
The current xstate size calculation assumes that the highest-numbered xstate feature has the highest offset in the buffer, determining the size based on the topmost bit in the feature mask. However, this assumption is not architecturally guaranteed -- higher-numbered features may have lower offsets. With the introduction of the xfeature order table and its helper macro, xstate components can now be traversed in their positional order. Update the non-compacted format handling to iterate through the table to determine the last-positioned feature. Then, set the offset accordingly. Since size calculation primarily occurs during initialization or in non-critical paths, looping to find the last feature is not expected to have a meaningful performance impact. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250320234301.8342-4-chang.seok.bae@intel.com
2025-04-14x86/fpu/xstate: Introduce xfeature order table and accessor macroChang S. Bae
The kernel has largely assumed that higher xstate component numbers correspond to later offsets in the buffer. However, this assumption no longer holds for the non-compacted format, where a newer state component may have a lower offset. When iterating over xstate components in offset order, using the feature number as an index may be misleading. At the same time, the CPU exposes each component’s size and offset based on its feature number, making it a key for state information. To provide flexibility in handling xstate ordering, introduce a mapping table: feature order -> feature number. The table is dynamically populated based on the CPU-exposed features and is sorted in offset order at boot time. Additionally, add an accessor macro to facilitate sequential traversal of xstate components based on their actual buffer positions, given a feature bitmask. This accessor macro will be particularly useful for computing custom non-compacted format sizes and iterating over xstate offsets in non-compacted buffers. Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250320234301.8342-3-chang.seok.bae@intel.com
2025-04-14x86/fpu/xstate: Remove xstate offset checkChang S. Bae
Traditionally, new xstate components have been assigned sequentially, aligning feature numbers with their offsets in the XSAVE buffer. However, this ordering is not architecturally mandated in the non-compacted format, where a component's offset may not correspond to its feature number. The kernel caches CPUID-reported xstate component details, including size and offset in the non-compacted format. As part of this process, a sanity check is also conducted to ensure alignment between feature numbers and offsets. This check was likely intended as a general guideline rather than a strict requirement. Upcoming changes will support out-of-order offsets. Remove the check as becoming obsolete. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250320234301.8342-2-chang.seok.bae@intel.com
2025-04-10x86/msr: Rename 'wrmsrl()' to 'wrmsrq()'Ingo Molnar
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'rdmsrl()' to 'rdmsrq()'Ingo Molnar
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-03-25x86/fpu: Update the outdated comment above fpstate_init_user()Chao Gao
fpu_init_fpstate_user() was removed in: commit 582b01b6ab27 ("x86/fpu: Remove old KVM FPU interface"). Update that comment to accurately reflect the current state regarding its callers. Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250324131931.2097905-1-chao.gao@intel.com
2025-03-17x86/fpu/xstate: Fix inconsistencies in guest FPU xfeaturesChao Gao
Guest FPUs manage vCPU FPU states. They are allocated via fpu_alloc_guest_fpstate() and are resized in fpstate_realloc() when XFD features are enabled. Since the introduction of guest FPUs, there have been inconsistencies in the kernel buffer size and xfeatures: 1. fpu_alloc_guest_fpstate() uses fpu_user_cfg since its introduction. See: 69f6ed1d14c6 ("x86/fpu: Provide infrastructure for KVM FPU cleanup") 36487e6228c4 ("x86/fpu: Prepare guest FPU for dynamically enabled FPU features") 2. __fpstate_reset() references fpu_kernel_cfg to set storage attributes. 3. fpu->guest_perm uses fpu_kernel_cfg, affecting fpstate_realloc(). A recent commit in the tip:x86/fpu tree partially addressed the inconsistency between (1) and (3) by using fpu_kernel_cfg for size calculation in (1), but left fpu_guest->xfeatures and fpu_guest->perm still referencing fpu_user_cfg: https://lore.kernel.org/all/20250218141045.85201-1-stanspas@amazon.de/ 1937e18cc3cf ("x86/fpu: Fix guest FPU state buffer allocation size") The inconsistencies within fpu_alloc_guest_fpstate() and across the mentioned functions cause confusion. Fix them by using fpu_kernel_cfg consistently in fpu_alloc_guest_fpstate(), except for fields related to the UABI buffer. Referencing fpu_kernel_cfg won't impact functionalities, as: 1. fpu_guest->perm is overwritten shortly in fpu_init_guest_permissions() with fpstate->guest_perm, which already uses fpu_kernel_cfg. 2. fpu_guest->xfeatures is solely used to check if XFD features are enabled. Including supervisor xfeatures doesn't affect the check. Fixes: 36487e6228c4 ("x86/fpu: Prepare guest FPU for dynamically enabled FPU features") Suggested-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Juergen Gross <jgross@suse.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: David Woodhouse <dwmw2@infradead.org> Link: https://lore.kernel.org/r/20250317140613.1761633-1-chao.gao@intel.com
2025-03-17x86/fpu: Clarify the "xa" symbolic name used in the XSTATE* macrosBorislav Petkov (AMD)
Tie together the %[xa] in the XSAVE/XRSTOR definitions with the respective usage in the asm macros so that it is perfectly clear. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2025-03-13x86/fpu: Use XSAVE{,OPT,C,S} and XRSTOR{,S} mnemonics in xstate.hUros Bizjak
Current minimum required version of binutils is 2.25, which supports XSAVE{,OPT,C,S} and XRSTOR{,S} instruction mnemonics. Replace the byte-wise specification of XSAVE{,OPT,C,S} and XRSTOR{,S} with these proper mnemonics. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250313130251.383204-1-ubizjak@gmail.com
2025-03-06x86/fpu: Improve crypto performance by making kernel-mode FPU reliably ↵Eric Biggers
usable in softirqs Background: =========== Currently kernel-mode FPU is not always usable in softirq context on x86, since softirqs can nest inside a kernel-mode FPU section in task context, and nested use of kernel-mode FPU is not supported. Therefore, x86 SIMD-optimized code that can be called in softirq context has to sometimes fall back to non-SIMD code. There are two options for the fallback, both of which are pretty terrible: (a) Use a scalar fallback. This can be 10-100x slower than vectorized code because it cannot use specialized instructions like AES, SHA, or carryless multiplication. (b) Execute the request asynchronously using a kworker. In other words, use the "crypto SIMD helper" in crypto/simd.c. Currently most of the x86 en/decryption code (skcipher and aead algorithms) uses option (b), since this avoids the slow scalar fallback and it is easier to wire up. But option (b) is still really bad for its own reasons: - Punting the request to a kworker is bad for performance too. - It forces the algorithm to be marked as asynchronous (CRYPTO_ALG_ASYNC), preventing it from being used by crypto API users who request a synchronous algorithm. That's another huge performance problem, which is especially unfortunate for users who don't even do en/decryption in softirq context. - It makes all en/decryption operations take a detour through crypto/simd.c. That involves additional checks and an additional indirect call, which slow down en/decryption for *everyone*. Fortunately, the skcipher and aead APIs are only usable in task and softirq context in the first place. Thus, if kernel-mode FPU were to be reliably usable in softirq context, no fallback would be needed. Indeed, other architectures such as arm, arm64, and riscv have already done this. Changes implemented: ==================== Therefore, this patch updates x86 accordingly to reliably support kernel-mode FPU in softirqs. This is done by just disabling softirq processing in kernel-mode FPU sections (when hardirqs are not already disabled), as that prevents the nesting that was problematic. This will delay some softirqs slightly, but only ones that would have otherwise been nested inside a task context kernel-mode FPU section. Any such softirqs would have taken the slow fallback path before if they tried to do any en/decryption. Now these softirqs will just run at the end of the task context kernel-mode FPU section (since local_bh_enable() runs pending softirqs) and will no longer take the slow fallback path. Alternatives considered: ======================== - Make kernel-mode FPU sections fully preemptible. This would require growing task_struct by another struct fpstate which is more than 2K. - Make softirqs save/restore the kernel-mode FPU state to a per-CPU struct fpstate when nested use is detected. Somewhat interesting, but seems unnecessary when a simpler solution exists. Performance results: ==================== I did some benchmarks with AES-XTS encryption of 16-byte messages (which is unrealistically small, but this makes it easier to see the overhead of kernel-mode FPU...). The baseline was 384 MB/s. Removing the use of crypto/simd.c, which this work makes possible, increases it to 487 MB/s, a +27% improvement in throughput. CPU was AMD Ryzen 9 9950X (Zen 5). No debugging options were enabled. [ mingo: Prettified the changelog and added performance results. ] Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/r/20250304204954.3901-1-ebiggers@kernel.org
2025-02-27x86/fpu/xstate: Simplify print_xstate_features()Chang S. Bae
print_xstate_features() currently invokes print_xstate_feature() multiple times on separate lines, which can be simplified in a loop. print_xstate_feature() already checks the feature's enabled status and is only called within print_xstate_features(). Inline print_xstate_feature() and iterate over features in a loop to streamline the enabling message. No functional changes. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20250227184502.10288-2-chang.seok.bae@intel.com
2025-02-27x86/fpu: Refine and simplify the magic number check during signal returnChang S. Bae
Before restoring xstate from the user space buffer, the kernel performs sanity checks on these magic numbers: magic1 in the software reserved area, and magic2 at the end of XSAVE region. The position of magic2 is calculated based on the xstate size derived from the user space buffer. But, the in-kernel record is directly available and reliable for this purpose. This reliance on user space data is also inconsistent with the recent fix in: d877550eaf2d ("x86/fpu: Stop relying on userspace for info to fault in xsave buffer") Simply use fpstate->user_size, and then get rid of unnecessary size-evaluation code. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20241211014500.3738-1-chang.seok.bae@intel.com
2025-02-21x86/fpu: Fix guest FPU state buffer allocation sizeStanislav Spassov
Ongoing work on an optimization to batch-preallocate vCPU state buffers for KVM revealed a mismatch between the allocation sizes used in fpu_alloc_guest_fpstate() and fpstate_realloc(). While the former allocates a buffer sized to fit the default set of XSAVE features in UABI form (as per fpu_user_cfg), the latter uses its ksize argument derived (for the requested set of features) in the same way as the sizes found in fpu_kernel_cfg, i.e. using the compacted in-kernel representation. The correct size to use for guest FPU state should indeed be the kernel one as seen in fpstate_realloc(). The original issue likely went unnoticed through a combination of UABI size typically being larger than or equal to kernel size, and/or both amounting to the same number of allocated 4K pages. Fixes: 69f6ed1d14c6 ("x86/fpu: Provide infrastructure for KVM FPU cleanup") Signed-off-by: Stanislav Spassov <stanspas@amazon.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250218141045.85201-1-stanspas@amazon.de
2025-02-10x86/fpu: Fully optimize out WARN_ON_FPU()Eric Biggers
Currently WARN_ON_FPU evaluates its argument even if CONFIG_X86_DEBUG_FPU is disabled, which adds unnecessary instructions to several functions, for example kernel_fpu_begin(). Fix this by using BUILD_BUG_ON_INVALID(x) in the no-debug case rather than (void)(x). Fixes: 83242c515881 ("x86/fpu: Make WARN_ON_FPU() more robust in the !CONFIG_X86_DEBUG_FPU case") Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20250127224523.94300-1-ebiggers%40kernel.org
2025-01-21Merge tag 'x86_cpu_for_v6.14_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpuid updates from Borislav Petkov: - Remove the less generic CPU matching infra around struct x86_cpu_desc and use the generic struct x86_cpu_id thing - Remove magic naked numbers for CPUID functions and use proper defines of the prefix CPUID_LEAF_*. Consolidate some of the crazy use around the tree - Smaller cleanups and improvements * tag 'x86_cpu_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu: Make all all CPUID leaf names consistent x86/fpu: Remove unnecessary CPUID level check x86/fpu: Move CPUID leaf definitions to common code x86/tsc: Remove CPUID "frequency" leaf magic numbers. x86/tsc: Move away from TSC leaf magic numbers x86/cpu: Move TSC CPUID leaf definition x86/cpu: Refresh DCA leaf reading code x86/cpu: Remove unnecessary MwAIT leaf checks x86/cpu: Use MWAIT leaf definition x86/cpu: Move MWAIT leaf definition to common header x86/cpu: Remove 'x86_cpu_desc' infrastructure x86/cpu: Move AMD erratum 1386 table over to 'x86_cpu_id' x86/cpu: Replace PEBS use of 'x86_cpu_desc' use with 'x86_cpu_id' x86/cpu: Expose only stepping min/max interface x86/cpu: Introduce new microcode matching helper x86/cpufeature: Document cpu_feature_enabled() as the default to use x86/paravirt: Remove the WBINVD callback x86/cpufeatures: Free up unused feature bits
2025-01-07x86/fpu: Ensure shadow stack is active before "getting" registersRick Edgecombe
The x86 shadow stack support has its own set of registers. Those registers are XSAVE-managed, but they are "supervisor state components" which means that userspace can not touch them with XSAVE/XRSTOR. It also means that they are not accessible from the existing ptrace ABI for XSAVE state. Thus, there is a new ptrace get/set interface for it. The regset code that ptrace uses provides an ->active() handler in addition to the get/set ones. For shadow stack this ->active() handler verifies that shadow stack is enabled via the ARCH_SHSTK_SHSTK bit in the thread struct. The ->active() handler is checked from some call sites of the regset get/set handlers, but not the ptrace ones. This was not understood when shadow stack support was put in place. As a result, both the set/get handlers can be called with XFEATURE_CET_USER in its init state, which would cause get_xsave_addr() to return NULL and trigger a WARN_ON(). The ssp_set() handler luckily has an ssp_active() check to avoid surprising the kernel with shadow stack behavior when the kernel is not ready for it (ARCH_SHSTK_SHSTK==0). That check just happened to avoid the warning. But the ->get() side wasn't so lucky. It can be called with shadow stacks disabled, triggering the warning in practice, as reported by Christina Schimpe: WARNING: CPU: 5 PID: 1773 at arch/x86/kernel/fpu/regset.c:198 ssp_get+0x89/0xa0 [...] Call Trace: <TASK> ? show_regs+0x6e/0x80 ? ssp_get+0x89/0xa0 ? __warn+0x91/0x150 ? ssp_get+0x89/0xa0 ? report_bug+0x19d/0x1b0 ? handle_bug+0x46/0x80 ? exc_invalid_op+0x1d/0x80 ? asm_exc_invalid_op+0x1f/0x30 ? __pfx_ssp_get+0x10/0x10 ? ssp_get+0x89/0xa0 ? ssp_get+0x52/0xa0 __regset_get+0xad/0xf0 copy_regset_to_user+0x52/0xc0 ptrace_regset+0x119/0x140 ptrace_request+0x13c/0x850 ? wait_task_inactive+0x142/0x1d0 ? do_syscall_64+0x6d/0x90 arch_ptrace+0x102/0x300 [...] Ensure that shadow stacks are active in a thread before looking them up in the XSAVE buffer. Since ARCH_SHSTK_SHSTK and user_ssp[SHSTK_EN] are set at the same time, the active check ensures that there will be something to find in the XSAVE buffer. [ dhansen: changelog/subject tweaks ] Fixes: 2fab02b25ae7 ("x86: Add PTRACE interface for shadow stack") Reported-by: Christina Schimpe <christina.schimpe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Christina Schimpe <christina.schimpe@intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/20250107233056.235536-1-rick.p.edgecombe%40intel.com
2024-12-18x86/cpu: Make all all CPUID leaf names consistentDave Hansen
The leaf names are not consistent. Give them all a CPUID_LEAF_ prefix for consistency and vertical alignment. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Dave Jiang <dave.jiang@intel.com> # for ioatdma bits Link: https://lore.kernel.org/all/20241213205040.7B0C3241%40davehans-spike.ostc.intel.com
2024-12-18x86/fpu: Remove unnecessary CPUID level checkDave Hansen
The CPUID level dependency table will entirely zap X86_FEATURE_XSAVE if the CPUID level is too low. This code is unreachable. Kill it. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com> Link: https://lore.kernel.org/all/20241213205038.6E71F9A4%40davehans-spike.ostc.intel.com
2024-12-18x86/fpu: Move CPUID leaf definitions to common codeDave Hansen
Move the XSAVE-related CPUID leaf definitions to common code. Then, use the new definition to remove the last magic number from the CPUID level dependency table. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/all/20241213205037.43C57CDE%40davehans-spike.ostc.intel.com
2024-12-02x86/pkeys: Ensure updated PKRU value is XRSTOR'dAruna Ramakrishna
When XSTATE_BV[i] is 0, and XRSTOR attempts to restore state component 'i' it ignores any value in the XSAVE buffer and instead restores the state component's init value. This means that if XSAVE writes XSTATE_BV[PKRU]=0 then XRSTOR will ignore the value that update_pkru_in_sigframe() writes to the XSAVE buffer. XSTATE_BV[PKRU] only gets written as 0 if PKRU is in its init state. On Intel CPUs, basically never happens because the kernel usually overwrites the init value (aside: this is why we didn't notice this bug until now). But on AMD, the init tracker is more aggressive and will track PKRU as being in its init state upon any wrpkru(0x0). Unfortunately, sig_prepare_pkru() does just that: wrpkru(0x0). This writes XSTATE_BV[PKRU]=0 which makes XRSTOR ignore the PKRU value in the sigframe. To fix this, always overwrite the sigframe XSTATE_BV with a value that has XSTATE_BV[PKRU]==1. This ensures that XRSTOR will not ignore what update_pkru_in_sigframe() wrote. The problematic sequence of events is something like this: Userspace does: * wrpkru(0xffff0000) (or whatever) * Hardware sets: XINUSE[PKRU]=1 Signal happens, kernel is entered: * sig_prepare_pkru() => wrpkru(0x00000000) * Hardware sets: XINUSE[PKRU]=0 (aggressive AMD init tracker) * XSAVE writes most of XSAVE buffer, including XSTATE_BV[PKRU]=XINUSE[PKRU]=0 * update_pkru_in_sigframe() overwrites PKRU in XSAVE buffer ... signal handling * XRSTOR sees XSTATE_BV[PKRU]==0, ignores just-written value from update_pkru_in_sigframe() Fixes: 70044df250d0 ("x86/pkeys: Update PKRU to enable all pkeys before XSAVE") Suggested-by: Rudi Horn <rudi.horn@oracle.com> Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20241119174520.3987538-3-aruna.ramakrishna%40oracle.com
2024-12-02x86/pkeys: Change caller of update_pkru_in_sigframe()Aruna Ramakrishna
update_pkru_in_sigframe() will shortly need some information which is only available inside xsave_to_user_sigframe(). Move update_pkru_in_sigframe() inside the other function to make it easier to provide it that information. No functional changes. Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20241119174520.3987538-2-aruna.ramakrishna%40oracle.com
2024-09-17Merge tag 'x86-mm-2024-09-17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 memory management updates from Thomas Gleixner: - Make LAM enablement safe vs. kernel threads using a process mm temporarily as switching back to the process would not update CR3 and therefore not enable LAM causing faults in user space when using tagged pointers. Cure it by synchronizing LAM enablement via IPIs to all CPUs which use the related mm. - Cure a LAM harmless inconsistency between CR3 and the state during context switch. It's both confusing and prone to lead to real bugs - Handle alt stack handling for threads which run with a non-zero protection key. The non-zero key prevents the kernel to access the alternate stack. Cure it by temporarily enabling all protection keys for the alternate stack setup/restore operations. - Provide a EFI config table identity mapping for kexec kernel to prevent kexec fails because the new kernel cannot access the config table array - Use GB pages only when a full GB is mapped in the identity map as otherwise the CPU can speculate into reserved areas after the end of memory which causes malfunction on UV systems. - Remove the noisy and pointless SRAT table dump during boot - Use is_ioremap_addr() for iounmap() address range checks instead of high_memory. is_ioremap_addr() is more precise. * tag 'x86-mm-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/ioremap: Improve iounmap() address range checks x86/mm: Remove duplicate check from build_cr3() x86/mm: Remove unused NX related declarations x86/mm: Remove unused CR3_HW_ASID_BITS x86/mm: Don't print out SRAT table information x86/mm/ident_map: Use gbpages only where full GB page should be mapped. x86/kexec: Add EFI config table identity mapping for kexec kernel selftests/mm: Add new testcases for pkeys x86/pkeys: Restore altstack access in sigreturn() x86/pkeys: Update PKRU to enable all pkeys before XSAVE x86/pkeys: Add helper functions to update PKRU on the sigframe x86/pkeys: Add PKRU as a parameter in signal handling functions x86/mm: Cleanup prctl_enable_tagged_addr() nr_bits error checking x86/mm: Fix LAM inconsistency during context switch x86/mm: Use IPIs to synchronize LAM enablement
2024-09-17Merge tag 'x86-fpu-2024-09-17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fpu updates from Thomas Gleixner: "Provide FPU buffer layout in core dumps: Debuggers have guess the FPU buffer layout in core dumps, which is error prone. This is because AMD and Intel layouts differ. To avoid buggy heuristics add a ELF section which describes the buffer layout which can be retrieved by tools" * tag 'x86-fpu-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/elf: Add a new FPU buffer layout info to x86 core files