Age | Commit message (Collapse) | Author |
|
Avoid a conflict in <asm/cpufeatures.h> by merging pending x86/cpu changes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Dave Hansen reports the following crash on a 32-bit system with
CONFIG_HIGHMEM=y and CONFIG_X86_PAE=y:
> 0xf75fe000 is the mem_map[] entry for the first page >4GB. It
> obviously wasn't allocated, thus the oops.
BUG: unable to handle page fault for address: f75fe000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
*pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000
Oops: Oops: 0002 [#1] SMP NOPTI
CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
EIP: __free_pages_core+0x3c/0x74
...
Call Trace:
memblock_free_pages+0x11/0x2c
memblock_free_all+0x2ce/0x3a0
mm_core_init+0xf5/0x320
start_kernel+0x296/0x79c
i386_start_kernel+0xad/0xb0
startup_32_smp+0x151/0x154
The mem_map[] is allocated up to the end of ZONE_HIGHMEM which is defined
by max_pfn.
The bug was introduced by this recent commit:
6faea3422e3b ("arch, mm: streamline HIGHMEM freeing")
Previously, freeing of high memory was also clamped to the end of
ZONE_HIGHMEM but after this change, memblock_free_all() tries to
free memory above the of ZONE_HIGHMEM as well and that causes
access to mem_map[] entries beyond the end of the memory map.
To fix this, discard the memory after max_pfn from memblock on
32-bit systems so that core MM would be aware only of actually
usable memory.
Fixes: 6faea3422e3b ("arch, mm: streamline HIGHMEM freeing")
Reported-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Arnd Bergmann <arnd@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Davide Ciminaghi <ciminaghi@gnudd.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: kvm@vger.kernel.org
Link: https://lore.kernel.org/r/20250413080858.743221-1-rppt@kernel.org # discussion and submission
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
- Fix hypercall detection on Xen guests
- Extend the AMD microcode loader SHA check to Zen5, to block loading
of any unreleased standalone Zen5 microcode patches
- Add new Intel CPU model number for Bartlett Lake
- Fix the workaround for AMD erratum 1054
- Fix buggy early memory acceptance between SEV-SNP guests and the EFI
stub
* tag 'x86-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot/sev: Avoid shared GHCB page for early memory acceptance
x86/cpu/amd: Fix workaround for erratum 1054
x86/cpu: Add CPU model number for Bartlett Lake CPUs with Raptor Cove cores
x86/microcode/AMD: Extend the SHA check to Zen5, block loading of any unreleased standalone Zen5 microcode patches
x86/xen: Fix __xen_hypercall_setfunc()
|
|
Erratum 1054 affects AMD Zen processors that are a part of Family 17h
Models 00-2Fh and the workaround is to not set HWCR[IRPerfEn]. However,
when X86_FEATURE_ZEN1 was introduced, the condition to detect unaffected
processors was incorrectly changed in a way that the IRPerfEn bit gets
set only for unaffected Zen 1 processors.
Ensure that HWCR[IRPerfEn] is set for all unaffected processors. This
includes a subset of Zen 1 (Family 17h Models 30h and above) and all
later processors. Also clear X86_FEATURE_IRPERF on affected processors
so that the IRPerfCount register is not used by other entities like the
MSR PMU driver.
Fixes: 232afb557835 ("x86/CPU/AMD: Add X86_FEATURE_ZEN1")
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/caa057a9d6f8ad579e2f1abaa71efbd5bd4eaf6d.1744956467.git.sandipan.das@amd.com
|
|
Rename rep_nop() function to what it really does.
No functional change intended.
Suggested-by: David Laight <david.laight.linux@gmail.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Link: https://lore.kernel.org/r/20250418080805.83679-1-ubizjak@gmail.com
|
|
Current minimum required version of binutils is 2.25,
which supports PAUSE instruction mnemonic.
Replace "REP; NOP" with this proper mnemonic.
No functional change intended.
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Link: https://lore.kernel.org/r/20250418080805.83679-2-ubizjak@gmail.com
|
|
Minimum version of binutils required to compile the kernel is 2.25.
This version correctly handles the "rep" prefixes, so it is possible
to remove the semicolon, which was used to support ancient versions
of GNU as.
Due to the semicolon, the compiler considers "rep; insn" (or its
alternate "rep\n\tinsn" form) as two separate instructions. Removing
the semicolon makes asm length calculations more accurate, consequently
making scheduling and inlining decisions of the compiler more accurate.
Removing the semicolon also enables assembler checks involving "rep"
prefixes. Trying to assemble e.g. "rep addl %eax, %ebx" results in:
Error: invalid instruction `add' after `rep'
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pavel Machek <pavel@kernel.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20250418071437.4144391-2-ubizjak@gmail.com
|
|
Add support to emulate all NOP instructions as the original uprobe
instruction.
This change speeds up uprobe on top of all NOP instructions and is a
preparation for usdt probe optimization, that will be done on top of
NOP5 instructions.
With this change the usdt probe on top of NOP5s won't take the performance
hit compared to usdt probe on top of standard NOP instructions.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Cc: Alan Maguire <alan.maguire@oracle.com>
Cc: Hao Luo <haoluo@google.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20250414083647.1234007-1-jolsa@kernel.org
|
|
The static key mmio_stale_data_clear controls the KVM-only mitigation for MMIO
Stale Data vulnerability. Rename it to reflect its purpose.
No functional change.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250416-mmio-rename-v2-1-ad1f5488767c@linux.intel.com
|
|
The original function name came from an overly compressed form of
'fpstate_regs' by commit:
e61d6310a0f8 ("x86/fpu: Reset permission and fpstate on exec()")
However, the term 'fpregs' typically refers to physical FPU registers. In
contrast, this function copies the init values to fpu->fpstate->regs, not
hardware registers.
Rename the function to better reflect what it actually does.
No functional change.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250416021720.12305-11-chang.seok.bae@intel.com
|
|
The variable was previously referenced in KVM code but the last usage was
removed by:
ea4d6938d4c0 ("x86/fpu: Replace KVMs home brewed FPU copy from user")
Remove its export symbol.
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250416021720.12305-10-chang.seok.bae@intel.com
|
|
The signal delivery logic was modified to always set the PKRU bit in
xregs_state->header->xfeatures by this commit:
ae6012d72fa6 ("x86/pkeys: Ensure updated PKRU value is XRSTOR'd")
However, the change derives the bitmask value using XGETBV(1), rather
than simply updating the buffer that already holds the value. Thus, this
approach induces an unnecessary dependency on XGETBV1 for PKRU handling.
Eliminate the dependency by using the established helper function.
Subsequently, remove the now-unused 'mask' argument.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tony W Wang-oc <TonyWWang-oc@zhaoxin.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20250416021720.12305-9-chang.seok.bae@intel.com
|
|
Currently, saving register states in the signal frame, the legacy feature
bits are always set in xregs_state->header->xfeatures. This code sequence
can be generalized for reuse in similar cases.
Refactor the logic to ensure a consistent approach across similar usages.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250416021720.12305-8-chang.seok.bae@intel.com
|
|
Not all paths that lead to fpu__init_disable_system_xstate() currently
emit a message indicating that XSAVE has been disabled. Move the print
statement into the function to ensure the message in all cases.
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250416021720.12305-7-chang.seok.bae@intel.com
|
|
With securing APX against conflicting MPX, it is now ready to be enabled.
Include APX in the enabled xfeature set.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250416021720.12305-5-chang.seok.bae@intel.com
|
|
XSTATE components are architecturally independent. There is no rule
requiring their offsets in the non-compacted format to be strictly
ascending or mutually non-overlapping. However, in practice, such
overlaps have not occurred -- until now.
APX is introduced as xstate component 19, following AMX. In the
non-compacted XSAVE format, its offset overlaps with the space previously
occupied by the now-deprecated MPX feature:
45fc24e89b7c ("x86/mpx: remove MPX from arch/x86")
To prevent conflicts, the kernel must ensure the CPU never expose both
features at the same time. If so, it indicates unreliable hardware. In
such cases, XSAVE should be disabled entirely as a precautionary measure.
Add a sanity check to detect this condition and disable XSAVE if an
invalid hardware configuration is identified.
Note: MPX state components remain enabled on legacy systems solely for
KVM guest support.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250416021720.12305-4-chang.seok.bae@intel.com
|
|
Advanced Performance Extensions (APX) is associated with a new state
component number 19. To support saving and restoring of the corresponding
registers via the XSAVE mechanism, introduce the component definition
along with the necessary sanity checks.
Define the new component number, state name, and those register data
type. Then, extend the size checker to validate the register data type
and explicitly list the APX feature flag as a dependency for the new
component in xsave_cpuid_features[].
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250416021720.12305-3-chang.seok.bae@intel.com
|
|
Intel Advanced Performance Extensions (APX) introduce a new set of
general-purpose registers, managed as an extended state component via the
xstate management facility.
Before enabling this new xstate, define a feature flag to clarify the
dependency in xsave_cpuid_features[]. APX is enumerated under CPUID level
7 with EDX=1. Since this CPUID leaf is not yet allocated, place the flag
in a scattered feature word.
While this feature is intended only for userspace, exposing it via
/proc/cpuinfo is unnecessary. Instead, the existing arch_prctl(2)
mechanism with the ARCH_GET_XCOMP_SUPP option can be used to query the
feature availability.
Finally, clarify that APX depends on XSAVE.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250416021720.12305-2-chang.seok.bae@intel.com
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Conflicts:
tools/arch/x86/include/asm/cpufeatures.h
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Shorten X86_FEATURE_AMD_HETEROGENEOUS_CORES to X86_FEATURE_AMD_HTR_CORES
to make the last column aligned consistently in the whole file.
No functional changes.
Suggested-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250415175410.2944032-4-xin@zytor.com
|
|
Shorten X86_FEATURE_CLEAR_BHB_LOOP_ON_VMEXIT to
X86_FEATURE_CLEAR_BHB_VMEXIT to make the last column aligned
consistently in the whole file.
There's no need to explain in the name what the mitigation does.
No functional changes.
Suggested-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250415175410.2944032-3-xin@zytor.com
|
|
Whack this thing because:
- the "unknown" handling is done only for this vuln and not for the
others
- it doesn't do anything besides reporting things differently. It
doesn't apply any mitigations - it is simply causing unnecessary
complications to the code which don't bring anything besides
maintenance overhead to what is already a very nasty spaghetti pile
- all the currently unaffected CPUs can also be in "unknown" status so
there's no need for special handling here
so get rid of it.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: David Kaplan <david.kaplan@amd.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Link: https://lore.kernel.org/r/20250414150951.5345-1-bp@kernel.org
|
|
Collect AMD specific platform header files in <asm/amd/*.h>.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mario Limonciello <superm1@kernel.org>
Link: https://lore.kernel.org/r/20250413084144.3746608-7-mingo@kernel.org
|
|
Collect AMD specific platform header files in <asm/amd/*.h>.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mario Limonciello <superm1@kernel.org>
Link: https://lore.kernel.org/r/20250413084144.3746608-4-mingo@kernel.org
|
|
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Chang S. Bae <chang.seok.bae@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/Z_ejggklB5-IWB5W@gmail.com
|
|
A few uses of 'fps' snuck in, which is rather confusing
(to me) as it suggests frames-per-second. ;-)
Rename them to the canonical 'fpstate' name.
No change in functionality.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chang S. Bae <chang.seok.bae@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250409211127.3544993-9-mingo@kernel.org
|
|
PF_KTHREAD tasks
init_task's FPU state initialization was a bit of a hack:
__x86_init_fpu_begin = .;
. = __x86_init_fpu_begin + 128*PAGE_SIZE;
__x86_init_fpu_end = .;
But the init task isn't supposed to be using the FPU context
in any case, so remove the hack and add in some debug warnings.
As Linus noted in the discussion, the init task (and other
PF_KTHREAD tasks) *can* use the FPU via kernel_fpu_begin()/_end(),
but they don't need the context area because their FPU use is not
preemptible or reentrant, and they don't return to user-space.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chang S. Bae <chang.seok.bae@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20250409211127.3544993-8-mingo@kernel.org
|
|
PF_KTHREAD|PF_USER_WORKER tasks during exit
fpu__drop() and arch_release_task_struct() calls x86_task_fpu()
unconditionally, while the FPU context area will not be present
if it's the init task, and should not be in use when it's some
other type of kthread.
Return early for PF_KTHREAD or PF_USER_WORKER tasks. The debug
warning in x86_task_fpu() will catch any kthreads attempting to
use the FPU save area.
Fixed-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250409211127.3544993-7-mingo@kernel.org
|
|
This encapsulates the fpu__drop() functionality better, and it
will also enable other changes that want to check a task for
PF_KTHREAD before calling x86_task_fpu().
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chang S. Bae <chang.seok.bae@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250409211127.3544993-6-mingo@kernel.org
|
|
As suggested by Oleg, remove the thread::fpu pointer, as we can
calculate it via x86_task_fpu() at compile-time.
This improves code generation a bit:
kepler:~/tip> size vmlinux.before vmlinux.after
text data bss dec hex filename
26475405 10435342 1740804 38651551 24dc69f vmlinux.before
26475339 10959630 1216516 38651485 24dc65d vmlinux.after
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chang S. Bae <chang.seok.bae@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20250409211127.3544993-5-mingo@kernel.org
|
|
Turn thread.fpu into a pointer. Since most FPU code internals work by passing
around the FPU pointer already, the code generation impact is small.
This allows us to remove the old kludge of task_struct being variable size:
struct task_struct {
...
/*
* New fields for task_struct should be added above here, so that
* they are included in the randomized portion of task_struct.
*/
randomized_struct_fields_end
/* CPU-specific state of this task: */
struct thread_struct thread;
/*
* WARNING: on x86, 'thread_struct' contains a variable-sized
* structure. It *MUST* be at the end of 'task_struct'.
*
* Do not put anything below here!
*/
};
... which creates a number of problems, such as requiring thread_struct to be
the last member of the struct - not allowing it to be struct-randomized, etc.
But the primary motivation is to allow the decoupling of task_struct from
hardware details (<asm/processor.h> in particular), and to eventually allow
the per-task infrastructure:
DECLARE_PER_TASK(type, name);
...
per_task(current, name) = val;
... which requires task_struct to be a constant size struct.
The fpu_thread_struct_whitelist() quirk to hardened usercopy can be removed,
now that the FPU structure is not embedded in the task struct anymore, which
reduces text footprint a bit.
Fixed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chang S. Bae <chang.seok.bae@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250409211127.3544993-4-mingo@kernel.org
|
|
This will make the removal of the task_struct::thread.fpu array
easier.
No change in functionality - code generated before and after this
commit is identical on x86-defconfig:
kepler:~/tip> diff -up vmlinux.before.asm vmlinux.after.asm
kepler:~/tip>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chang S. Bae <chang.seok.bae@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250409211127.3544993-3-mingo@kernel.org
|
|
== Background ==
As feature positions in the userspace XSAVE buffer do not always align
with their feature numbers, the XSAVE format conversion needs to be
reconsidered to align with the revised xstate size calculation logic.
* For signal handling, XSAVE and XRSTOR are used directly to save and
restore extended registers.
* For ptrace, KVM, and signal returns (for 32-bit frame), the kernel
copies data between its internal buffer and the userspace XSAVE buffer.
If memcpy() were used for these cases, existing offset helpers — such
as __raw_xsave_addr() or xstate_offsets[] — would be sufficient to
handle the format conversion.
== Problem ==
When copying data from the compacted in-kernel buffer to the
non-compacted userspace buffer, the function follows the
user_regset_get2_fn() prototype. This means it utilizes struct membuf
helpers for the destination buffer. As defined in regset.h, these helpers
update the memory pointer during the copy process, enforcing sequential
writes within the loop.
Since xstate components are processed sequentially, any component whose
buffer position does not align with its feature number has an issue.
== Solution ==
Replace for_each_extended_xfeature() with the newly introduced
for_each_extended_xfeature_in_order(). This macro ensures xstate
components are handled in the correct order based on their actual
positions in the destination buffer, rather than their feature numbers.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250320234301.8342-5-chang.seok.bae@intel.com
|
|
The current xstate size calculation assumes that the highest-numbered
xstate feature has the highest offset in the buffer, determining the size
based on the topmost bit in the feature mask. However, this assumption is
not architecturally guaranteed -- higher-numbered features may have lower
offsets.
With the introduction of the xfeature order table and its helper macro,
xstate components can now be traversed in their positional order. Update
the non-compacted format handling to iterate through the table to
determine the last-positioned feature. Then, set the offset accordingly.
Since size calculation primarily occurs during initialization or in
non-critical paths, looping to find the last feature is not expected to
have a meaningful performance impact.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250320234301.8342-4-chang.seok.bae@intel.com
|
|
The kernel has largely assumed that higher xstate component numbers
correspond to later offsets in the buffer. However, this assumption no
longer holds for the non-compacted format, where a newer state component
may have a lower offset.
When iterating over xstate components in offset order, using the feature
number as an index may be misleading. At the same time, the CPU exposes
each component’s size and offset based on its feature number, making it a
key for state information.
To provide flexibility in handling xstate ordering, introduce a mapping
table: feature order -> feature number. The table is dynamically
populated based on the CPU-exposed features and is sorted in offset order
at boot time.
Additionally, add an accessor macro to facilitate sequential traversal of
xstate components based on their actual buffer positions, given a feature
bitmask. This accessor macro will be particularly useful for computing
custom non-compacted format sizes and iterating over xstate offsets in
non-compacted buffers.
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250320234301.8342-3-chang.seok.bae@intel.com
|
|
Traditionally, new xstate components have been assigned sequentially,
aligning feature numbers with their offsets in the XSAVE buffer. However,
this ordering is not architecturally mandated in the non-compacted
format, where a component's offset may not correspond to its feature
number.
The kernel caches CPUID-reported xstate component details, including size
and offset in the non-compacted format. As part of this process, a sanity
check is also conducted to ensure alignment between feature numbers and
offsets.
This check was likely intended as a general guideline rather than a
strict requirement. Upcoming changes will support out-of-order offsets.
Remove the check as becoming obsolete.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250320234301.8342-2-chang.seok.bae@intel.com
|
|
unreleased standalone Zen5 microcode patches
All Zen5 machines out there should get BIOS updates which update to the
correct microcode patches addressing the microcode signature issue.
However, silly people carve out random microcode blobs from BIOS
packages and think are doing other people a service this way...
Block loading of any unreleased standalone Zen5 microcode patches.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: <stable@kernel.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Cc: Nikolay Borisov <nik.borisov@suse.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250410114222.32523-1-bp@kernel.org
|
|
The startup code that constructs the kernel virtual mapping runs from
the 1:1 mapping of memory itself, and therefore, cannot use absolute
symbol references. Before making changes in subsequent patches, move
this code into a separate source file under arch/x86/boot/startup/ where
all such code will be kept from now on.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Dionna Amalie Glaze <dionnaglaze@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kevin Loughlin <kevinloughlin@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: linux-efi@vger.kernel.org
Link: https://lore.kernel.org/r/20250410134117.3713574-16-ardb+git@google.com
|
|
Move the early GDT/IDT setup code that runs long before the kernel
virtual mapping is up into arch/x86/boot/startup/, and build it in a way
that ensures that the code tolerates being called from the 1:1 mapping
of memory. The code itself is left unchanged by this patch.
Also tweak the sed symbol matching pattern in the decompressor to match
on lower case 't' or 'b', as these will be emitted by Clang for symbols
with hidden linkage.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Dionna Amalie Glaze <dionnaglaze@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kevin Loughlin <kevinloughlin@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: linux-efi@vger.kernel.org
Link: https://lore.kernel.org/r/20250410134117.3713574-15-ardb+git@google.com
|
|
RIP_REL_REF() is used in non-PIC C code that is called very early,
before the kernel virtual mapping is up, which is the mapping that the
linker expects. It is currently used in two different ways:
- to refer to the value of a global variable, including as an lvalue in
assignments;
- to take the address of a global variable via the mapping that the code
currently executes at.
The former case is only needed in non-PIC code, as PIC code will never
use absolute symbol references when the address of the symbol is not
being used. But taking the address of a variable in PIC code may still
require extra care, as a stack allocated struct assignment may be
emitted as a memcpy() from a statically allocated copy in .rodata.
For instance, this
void startup_64_setup_gdt_idt(void)
{
struct desc_ptr startup_gdt_descr = {
.address = (__force unsigned long)gdt_page.gdt,
.size = GDT_SIZE - 1,
};
may result in an absolute symbol reference in PIC code, even though the
struct is allocated on the stack and populated at runtime.
To address this case, make rip_rel_ptr() accessible in PIC code, and
update any existing uses where the address of a global variable is
taken using RIP_REL_REF.
Once all code of this nature has been moved into arch/x86/boot/startup
and built with -fPIC, RIP_REL_REF() can be retired, and only
rip_rel_ptr() will remain.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Dionna Amalie Glaze <dionnaglaze@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kevin Loughlin <kevinloughlin@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: linux-efi@vger.kernel.org
Link: https://lore.kernel.org/r/20250410134117.3713574-14-ardb+git@google.com
|
|
Now that unuse_temporary_mm() lives in tlb.c it can access
cpu_tlbstate.loaded_mm.
[ mingo: Merged it on top of x86/alternatives ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/r/20250402094540.3586683-5-mingo@kernel.org
|
|
This prepares them for use outside of the alternative machinery.
The code is unchanged.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/r/20250402094540.3586683-4-mingo@kernel.org
|
|
In commit 209954cbc7d0 ("x86/mm/tlb: Update mm_cpumask lazily")
unuse_temporary_mm() grew the assumption that it gets used on
poking_mm exclusively. While this is currently true, lets not hard
code this assumption.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/r/20250402094540.3586683-2-mingo@kernel.org
|
|
Reference header files using their canonical form <linux/cacheinfo.h>.
Standardize on CPUID(0xN), instead of CPUID(N), for all standard leaves.
This removes ambiguity and aligns them with their extended counterparts
like CPUID(0x8000001d).
References: 0dd09e215a39 ("x86/cacheinfo: Apply maintainer-tip coding style fixes")
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: John Ogness <john.ogness@linutronix.de>
Link: https://lore.kernel.org/r/20250411070401.1358760-3-darwi@linutronix.de
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
smp_text_poke_batch_finish()
Simplify the alternatives interface some more by moving the
poke_batch_finish check into poke_batch_process and renaming the latter.
The net effect is one less function name to consider when reading the
code.
Signed-off-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250411054105.2341982-54-mingo@kernel.org
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250411054105.2341982-53-mingo@kernel.org
|
|
Join the text_poke_*() API namespace.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250411054105.2341982-52-mingo@kernel.org
|
|
- Capitalize 'INT3' consistently,
- make it clear that 'sync cores' means an SMP sync to all CPUs,
- fix typos and spelling.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250411054105.2341982-51-mingo@kernel.org
|