Age | Commit message (Collapse) | Author |
|
Sometimes it can be very useful to run CPU vulnerability mitigations on
systems where they aren't known to mitigate any real-world
vulnerabilities. This can be handy for mundane reasons like debugging
HW-agnostic logic on whatever machine is to hand, but also for research
reasons: while some mitigations are focused on individual vulns and
uarches, others are fairly general, and it's strategically useful to
have an idea how they'd perform on systems where they aren't currently
needed.
As evidence for this being useful, a flag specifically for Retbleed was
added in:
5c9a92dec323 ("x86/bugs: Add retbleed=force").
Since CPU bugs are tracked using the same basic mechanism as features,
and there are already parameters for manipulating them by hand, extend
that mechanism to support bug as well as capabilities.
With this patch and setcpuid=srso, a QEMU guest running on an Intel host
will boot with Safe-RET enabled.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20241220-force-cpu-bug-v2-3-7dc71bce742a@google.com
|
|
In preparation for adding support to inject fake CPU bugs at boot-time,
add a general facility to force enablement of CPU flags.
The flag taints the kernel and the documentation attempts to be clear
that this is highly unsuitable for uses outside of kernel development
and platform experimentation.
The new arg is parsed just like clearcpuid, but instead of leading to
setup_clear_cpu_cap() it leads to setup_force_cpu_cap().
I've tested this by booting a nested QEMU guest on an Intel host, which
with setcpuid=svm will claim that it supports AMD virtualization.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20241220-force-cpu-bug-v2-2-7dc71bce742a@google.com
|
|
This is in preparation for a later commit that will reuse this code, to
make review convenient.
Factor out a helper function which does the full handling for this arg
including printing info to the console.
No functional change intended.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20241220-force-cpu-bug-v2-1-7dc71bce742a@google.com
|
|
running in a virtual machine
When running in a virtual machine, we might see the original hardware CPU
vendor string (i.e. "AuthenticAMD"), but a model and family ID set by the
hypervisor. In case we run on AMD hardware and the hypervisor sets a model
ID < 0x14, the LAHF cpu feature is eliminated from the the list of CPU
capabilities present to circumvent a bug with some BIOSes in conjunction with
AMD K8 processors.
Parsing the flags list from /proc/cpuinfo seems to be happening mostly in
bash scripts and prebuilt Docker containers, as it does not need to have
additionals tools present – even though more reliable ways like using "kcpuid",
which calls the CPUID instruction instead of parsing a list, should be preferred.
Scripts, that use /proc/cpuinfo to determine if the current CPU is
"compliant" with defined microarchitecture levels like x86-64-v2 will falsely
claim the CPU is incapable of modern CPU instructions when "lahf_lm" is missing
in that flags list.
This can prevent some docker containers from starting or build scripts to create
unoptimized binaries.
Admittably, this is more a small inconvenience than a severe bug in the kernel
and the shoddy scripts that rely on parsing /proc/cpuinfo
should be fixed instead.
This patch adds an additional check to see if we're running inside a
virtual machine (X86_FEATURE_HYPERVISOR is present), which, to my
understanding, can't be present on a real K8 processor as it was introduced
only with the later/other Athlon64 models.
Example output with the "lahf_lm" flag missing in the flags list
(should be shown between "hypervisor" and "abm"):
$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 6
model name : Common KVM processor
stepping : 1
microcode : 0x1000065
cpu MHz : 2599.998
cache size : 512 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp
lm rep_good nopl cpuid extd_apicid tsc_known_freq pni
pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt
tsc_deadline_timer aes xsave avx f16c hypervisor abm
3dnowprefetch vmmcall bmi1 avx2 bmi2 xsaveopt
... while kcpuid shows the feature to be present in the CPU:
# kcpuid -d | grep lahf
lahf_lm - LAHF/SAHF available in 64-bit mode
[ mingo: Updated the comment a bit, incorporated Boris's review feedback. ]
Signed-off-by: Max Grobecker <max@grobecker.info>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: Borislav Petkov <bp@alien8.de>
|
|
Fix this sparse warning:
arch/x86/kernel/quirks.c:662:6: warning: symbol 'x86_apple_machine' was not declared. Should it be static?
Signed-off-by: Zhang Kunbo <zhangkunbo@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20241126015636.3463994-1-zhangkunbo@huawei.com
|
|
Fix this sparse warning:
arch/x86/kernel/i8259.c:57:15: warning: symbol 'io_apic_irqs' was not declared. Should it be static?
Signed-off-by: Zhang Kunbo <zhangkunbo@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20241126020511.3464664-1-zhangkunbo@huawei.com
|
|
The first GDT descriptor is reserved as 'NULL descriptor'. As bits 0
and 1 of a segment selector, i.e., the RPL bits, are NOT used to index
GDT, selector values 0~3 all point to the NULL descriptor, thus values
0, 1, 2 and 3 are all valid NULL selector values.
When a NULL selector value is to be loaded into a segment register,
reload_segments() sets its RPL bits. Later IRET zeros ES, FS, GS, and
DS segment registers if any of them is found to have any nonzero NULL
selector value. The two operations offset each other to actually effect
a nop.
Besides, zeroing of RPL in NULL selector values is an information leak
in pre-FRED systems as userspace can spot any interrupt/exception by
loading a nonzero NULL selector, and waiting for it to become zero.
But there is nothing software can do to prevent it before FRED.
ERETU, the only legit instruction to return to userspace from kernel
under FRED, by design does NOT zero any segment register to avoid this
problem behavior.
As such, leave NULL selector values 0~3 unchanged and close the leak.
Do the same on 32-bit kernel as well.
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241126184529.1607334-1-xin@zytor.com
|
|
print_xstate_features() currently invokes print_xstate_feature() multiple
times on separate lines, which can be simplified in a loop.
print_xstate_feature() already checks the feature's enabled status and is
only called within print_xstate_features(). Inline print_xstate_feature()
and iterate over features in a loop to streamline the enabling message.
No functional changes.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20250227184502.10288-2-chang.seok.bae@intel.com
|
|
Before restoring xstate from the user space buffer, the kernel performs
sanity checks on these magic numbers: magic1 in the software reserved
area, and magic2 at the end of XSAVE region.
The position of magic2 is calculated based on the xstate size derived
from the user space buffer. But, the in-kernel record is directly
available and reliable for this purpose.
This reliance on user space data is also inconsistent with the recent
fix in:
d877550eaf2d ("x86/fpu: Stop relying on userspace for info to fault in xsave buffer")
Simply use fpstate->user_size, and then get rid of unnecessary
size-evaluation code.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20241211014500.3738-1-chang.seok.bae@intel.com
|
|
Refactor parity calculations to use the standard parity8() helper. This
change eliminates redundant implementations and improves code
efficiency.
[ ubizjak: Updated the patch to apply to the latest x86 tree. ]
Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com>
Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com>
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20250227125616.2253774-1-ubizjak@gmail.com
|
|
Because calls to get_this_hybrid_cpu_type() and
get_this_hybrid_cpu_native_id() are not required now. cpu-type and
native-model-id are cached at boot in per-cpu struct cpuinfo_topology.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20241211-add-cpu-type-v5-4-2ae010f50370@linux.intel.com
|
|
The hex values in CPU debug interface are not prefixed with 0x. This may
cause misinterpretation of values. Fix it.
[ mingo: Restore previous vertical alignment of the output. ]
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20241211-add-cpu-type-v5-1-2ae010f50370@linux.intel.com
|
|
The x86-32 kernel used to support multiple platforms with more than eight
logical CPUs, from the 1999-2003 timeframe: Sequent NUMA-Q, IBM Summit,
Unisys ES7000 and HP F8. Support for all except the latter was dropped
back in 2014, leaving only the F8 based DL740 and DL760 G2 machines in
this catery, with up to eight single-core Socket-603 Xeon-MP processors
with hyperthreading.
Like the already removed machines, the HP F8 servers at the time cost
upwards of $100k in typical configurations, but were quickly obsoleted
by their 64-bit Socket-604 cousins and the AMD Opteron.
Earlier servers with up to 8 Pentium Pro or Xeon processors remain
fully supported as they had no hyperthreading. Similarly, the more
common 4-socket Xeon-MP machines with hyperthreading using Intel
or ServerWorks chipsets continue to work without this, and all the
multi-core Xeon processors also run 64-bit kernels.
While the "bigsmp" support can also be used to run on later 64-bit
machines (including VM guests), it seems best to discourage that
and get any remaining users to update their kernels to 64-bit builds
on these. As a side-effect of this, there is also no more need to
support NUMA configurations on 32-bit x86, as all true 32-bit
NUMA platforms are already gone.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250226213714.4040853-3-arnd@kernel.org
|
|
We are going to apply a new series that conflicts with pending
work in x86/mm, so merge in x86/mm to avoid it, and also to
refresh the x86/cpu branch with fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
X86_FEATURE_USE_IBPB was introduced in:
2961298efe1e ("x86/cpufeatures: Clean up Spectre v2 related CPUID flags")
to have separate flags for when the CPU supports IBPB (i.e. X86_FEATURE_IBPB)
and when an IBPB is actually used to mitigate Spectre v2.
Ever since then, the uses of IBPB expanded. The name became confusing
because it does not control all IBPB executions in the kernel.
Furthermore, because its name is generic and it's buried within
indirect_branch_prediction_barrier(), it's easy to use it not knowing
that it is specific to Spectre v2.
X86_FEATURE_USE_IBPB is no longer needed because all the IBPB executions
it used to control are now controlled through other means (e.g.
switch_mm_*_ibpb static branches).
Remove the unused feature bit.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20250227012712.3193063-7-yosry.ahmed@linux.dev
|
|
Instead of using X86_FEATURE_USE_IBPB to guard the IBPB execution in KVM
when a new vCPU is loaded, introduce a static branch, similar to
switch_mm_*_ibpb.
This makes it obvious in spectre_v2_user_select_mitigation() what
exactly is being toggled, instead of the unclear X86_FEATURE_USE_IBPB
(which will be shortly removed). It also provides more fine-grained
control, making it simpler to change/add paths that control the IBPB in
the vCPU switch path without affecting other IBPBs.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250227012712.3193063-5-yosry.ahmed@linux.dev
|
|
If X86_FEATURE_USE_IBPB is not set, then both spectre_v2_user_ibpb and
spectre_v2_user_stibp are set to SPECTRE_V2_USER_NONE in
spectre_v2_user_select_mitigation(). Since ib_prctl_set() already checks
for this before performing the IBPB, the X86_FEATURE_USE_IBPB check is
redundant. Remove it.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20250227012712.3193063-4-yosry.ahmed@linux.dev
|
|
indirect_branch_prediction_barrier() only performs the MSR write if
X86_FEATURE_USE_IBPB is set, using alternative_msr_write(). In
preparation for removing X86_FEATURE_USE_IBPB, move the feature check
into the callers so that they can be addressed one-by-one, and use
X86_FEATURE_IBPB instead to guard the MSR write.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250227012712.3193063-2-yosry.ahmed@linux.dev
|
|
Change parity bit with XOR when !parity instead of masking bit out
and conditionally setting it when !parity.
Saves a couple of bytes in the object file.
Co-developed-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250226153709.6370-1-ubizjak@gmail.com
|
|
In preparation of support of inline static calls on powerpc, provide
trampoline address when updating sites, so that when the destination
function is too far for a direct function call, the call site is
patched with a call to the trampoline.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/5efe0cffc38d6f69b1ec13988a99f1acff551abf.1733245362.git.christophe.leroy@csgroup.eu
|
|
The init_task instance of struct task_struct is statically allocated and
may not contain the full FP state for userspace. As such, limit the copy
to the valid area of both init_task and 'dst' and ensure all memory is
initialized.
Note that the FP state is only needed for userspace, and as such it is
entirely reasonable for init_task to not contain parts of it.
Fixes: 5aaeb5c01c5b ("x86/fpu, sched: Introduce CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT and use it on x86")
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250226133136.816901-1-benjamin@sipsolutions.net
----
v2:
- Fix code if arch_task_struct_size < sizeof(init_task) by using
memcpy_and_pad.
|
|
Add support for
CPUID Fn8000_0021_EAX[31] (SRSO_MSR_FIX). If this bit is 1, it
indicates that software may use MSR BP_CFG[BpSpecReduce] to mitigate
SRSO.
Enable BpSpecReduce to mitigate SRSO across guest/host boundaries.
Switch back to enabling the bit when virtualization is enabled and to
clear the bit when virtualization is disabled because using a MSR slot
would clear the bit when the guest is exited and any training the guest
has done, would potentially influence the host kernel when execution
enters the kernel and hasn't VMRUN the guest yet.
More detail on the public thread in Link below.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241202120416.6054-1-bp@kernel.org
|
|
Saves a CALL to an out-of-line thunk for the common case of 1
argument.
Suggested-by: Scott Constable <scott.d.constable@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250224124200.927885784@infradead.org
|
|
While WAIT_FOR_ENDBR is specified to be a full speculation stop; it
has been shown that some implementations are 'leaky' to such an extend
that speculation can escape even the FineIBT preamble.
To deal with this, add additional hardening to the FineIBT preamble.
Notably, using a new LLVM feature:
https://github.com/llvm/llvm-project/commit/e223485c9b38a5579991b8cebb6a200153eee245
which encodes the number of arguments in the kCFI preamble's register.
Using this register<->arity mapping, have the FineIBT preamble CALL
into a stub clobbering the relevant argument registers in the
speculative case.
Scott sayeth thusly:
Microarchitectural attacks such as Branch History Injection (BHI) and
Intra-mode Branch Target Injection (IMBTI) [1] can cause an indirect
call to mispredict to an adversary-influenced target within the same
hardware domain (e.g., within the kernel). Instructions at the
mispredicted target may execute speculatively and potentially expose
kernel data (e.g., to a user-mode adversary) through a
microarchitectural covert channel such as CPU cache state.
CET-IBT [2] is a coarse-grained control-flow integrity (CFI) ISA
extension that enforces that each indirect call (or indirect jump)
must land on an ENDBR (end branch) instruction, even speculatively*.
FineIBT is a software technique that refines CET-IBT by associating
each function type with a 32-bit hash and enforcing (at the callee)
that the hash of the caller's function pointer type matches the hash
of the callee's function type. However, recent research [3] has
demonstrated that the conditional branch that enforces FineIBT's hash
check can be coerced to mispredict, potentially allowing an adversary
to speculatively bypass the hash check:
__cfi_foo:
ENDBR64
SUB R10d, 0x01234567
JZ foo # Even if the hash check fails and ZF=0, this branch could still mispredict as taken
UD2
foo:
...
The techniques demonstrated in [3] require the attacker to be able to
control the contents of at least one live register at the mispredicted
target. Therefore, this patch set introduces a sequence of CMOV
instructions at each indirect-callable target that poisons every live
register with data that the attacker cannot control whenever the
FineIBT hash check fails, thus mitigating any potential attack.
The security provided by this scheme has been discussed in detail on
an earlier thread [4].
[1] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html
[2] Intel Software Developer's Manual, Volume 1, Chapter 18
[3] https://www.vusec.net/projects/native-bhi/
[4] https://lore.kernel.org/lkml/20240927194925.707462984@infradead.org/
*There are some caveats for certain processors, see [1] for more info
Suggested-by: Scott Constable <scott.d.constable@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250224124200.820402212@infradead.org
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Due to concerns about circumvention attacks against FineIBT on 'naked'
ENDBR, add an additional caller side hash check to FineIBT. This
should make it impossible to pivot over such a 'naked' ENDBR
instruction at the cost of an additional load.
The specific pivot reported was against the SYSCALL entry site and
FRED will have all those holes fixed up.
https://lore.kernel.org/linux-hardening/Z60NwR4w%2F28Z7XUa@ubun/
This specific fineibt_paranoid_start[] sequence was concocted by
Scott.
Suggested-by: Scott Constable <scott.d.constable@intel.com>
Reported-by: Jennifer Miller <jmill@asu.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250224124200.598033084@infradead.org
|
|
Because overlapping code sequences are all the rage.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250224124200.486463917@infradead.org
|
|
Scott notes that non-taken branches are faster. Abuse overlapping code
that traps instead of explicit UD2 instructions.
And LEA does not modify flags and will have less dependencies.
Suggested-by: Scott Constable <scott.d.constable@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250224124200.371942555@infradead.org
|
|
The normal fixup in handle_bug() is simply continuing at the next
instruction. However upcoming patches make this the wrong thing, so
allow handlers (specifically handle_cfi_failure()) to over-ride
regs->ip.
The callchain is such that the fixup needs to be done before it is
determined if the exception is fatal, as such, revert any changes in
that case.
Additionally, have handle_cfi_failure() remember the regs->ip value it
starts with for reporting.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250224124200.275223080@infradead.org
|
|
FineIBT will start using 0xEA as #UD. Normally '0xEA' is a 'bad',
invalid instruction for the CPU.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250224124200.166774696@infradead.org
|
|
The call to mce_notify_irq() has been there since the initial version of
the soft inject mce machinery, introduced in
ea149b36c7f5 ("x86, mce: add basic error injection infrastructure").
At that time it was functional since injecting an MCE resulted in the
following call chain:
raise_mce()
->machine_check_poll()
->mce_log() - sets notfiy_user_bit
->mce_notify_user() (current mce_notify_irq) consumed the bit and called the
usermode helper.
However, with the introduction of
011d82611172 ("RAS: Add a Corrected Errors Collector")
the code got moved around and the usermode helper began to be called via the
early notifier mce_first_notifier() rendering the call in raise_local()
defunct as the mce_need_notify bit (ex notify_user) is only being set from the
early notifier.
Remove the noop call and make mce_notify_irq() static.
No functional changes.
Signed-off-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250225143348.268469-1-nik.borisov@suse.com
|
|
When in the middle of a kernel source code file a kernel developer
sees a lone #else or #endif:
...
#else
...
It's not obvious at a glance what those preprocessor blocks are
conditional on, if the starting #ifdef is outside visible range.
So apply the standard pattern we use in such cases elsewhere in
the kernel for large preprocessor blocks:
#ifdef CONFIG_XXX
...
...
...
#endif /* CONFIG_XXX */
...
#ifdef CONFIG_XXX
...
...
...
#else /* !CONFIG_XXX: */
...
...
...
#endif /* !CONFIG_XXX */
( Note that in the #else case we use the /* !CONFIG_XXX */ marker
in the final #endif, not /* CONFIG_XXX */, which serves as an easy
visual marker to differentiate #else or #elif related #endif closures
from singular #ifdef/#endif blocks. )
Also clean up __CFI_DEFAULT definition with a bit more vertical alignment
applied, and a pointless tab converted to the standard space we use in
such definitions.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
For when we want to exactly match ENDBR, and not everything that we
can scribble it with.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250224124200.059556588@infradead.org
|
|
Rebuilding with CONFIG_CFI_PERMISSIVE=y enabled is such a pain, esp. since
clang is so slow.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250224124159.924496481@infradead.org
|
|
When both of X86_LOCAL_APIC and X86_THERMAL_VECTOR are disabled,
the irq tracing produces a W=1 build warning for the tracing
definitions:
In file included from include/trace/trace_events.h:27,
from include/trace/define_trace.h:113,
from arch/x86/include/asm/trace/irq_vectors.h:383,
from arch/x86/kernel/irq.c:29:
include/trace/stages/init.h:2:23: error: 'str__irq_vectors__trace_system_name' defined but not used [-Werror=unused-const-variable=]
Make the tracepoints conditional on the same symbosl that guard
their usage.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250225213236.3141752-1-arnd@kernel.org
|
|
I still have some Soekris net4826 in a Community Wireless Network I
volunteer with. These devices use an AMD SC1100 SoC. I am running
OpenWrt on them, which uses a patched kernel, that naturally has
evolved over time. I haven't updated the ones in the field in a
number of years (circa 2017), but have one in a test bed, where I have
intermittently tried out test builds.
A few years ago, I noticed some trouble, particularly when "warm
booting", that is, doing a reboot without removing power, and noticed
the device was hanging after the kernel message:
[ 0.081615] Working around Cyrix MediaGX virtual DMA bugs.
If I removed power and then restarted, it would boot fine, continuing
through the message above, thusly:
[ 0.081615] Working around Cyrix MediaGX virtual DMA bugs.
[ 0.090076] Enable Memory-Write-back mode on Cyrix/NSC processor.
[ 0.100000] Enable Memory access reorder on Cyrix/NSC processor.
[ 0.100070] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
[ 0.110058] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
[ 0.120037] CPU: NSC Geode(TM) Integrated Processor by National Semi (family: 0x5, model: 0x9, stepping: 0x1)
[...]
In order to continue using modern tools, like ssh, to interact with
the software on these old devices, I need modern builds of the OpenWrt
firmware on the devices. I confirmed that the warm boot hang was still
an issue in modern OpenWrt builds (currently using a patched linux
v6.6.65).
Last night, I decided it was time to get to the bottom of the warm
boot hang, and began bisecting. From preserved builds, I narrowed down
the bisection window from late February to late May 2019. During this
period, the OpenWrt builds were using 4.14.x. I was able to build
using period-correct Ubuntu 18.04.6. After a number of bisection
iterations, I identified a kernel bump from 4.14.112 to 4.14.113 as
the commit that introduced the warm boot hang.
https://github.com/openwrt/openwrt/commit/07aaa7e3d62ad32767d7067107db64b6ade81537
Looking at the upstream changes in the stable kernel between 4.14.112
and 4.14.113 (tig v4.14.112..v4.14.113), I spotted a likely suspect:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=20afb90f730982882e65b01fb8bdfe83914339c5
So, I tried reverting just that kernel change on top of the breaking
OpenWrt commit, and my warm boot hang went away.
Presumably, the warm boot hang is due to some register not getting
cleared in the same way that a loss of power does. That is
approximately as much as I understand about the problem.
More poking/prodding and coaching from Jonas Gorski, it looks
like this test patch fixes the problem on my board: Tested against
v6.6.67 and v4.14.113.
Fixes: 18fb053f9b82 ("x86/cpu/cyrix: Use correct macros for Cyrix calls on Geode processors")
Debugged-by: Jonas Gorski <jonas.gorski@gmail.com>
Signed-off-by: Russell Senior <russell@personaltelco.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/CAHP3WfOgs3Ms4Z+L9i0-iBOE21sdMk5erAiJurPjnrL9LSsgRA@mail.gmail.com
Cc: Matthew Whitehead <tedheadster@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
|
|
There are cases when it is useful to use both ACPI and DTB provided by
the bootloader, however in such cases we should make sure to prevent
conflicts between the two. Namely, don't try to use DTB for SMP setup
if ACPI is enabled.
Precisely, this prevents at least:
- incorrectly calling register_lapic_address(APIC_DEFAULT_PHYS_BASE)
after the LAPIC was already successfully enumerated via ACPI, causing
noisy kernel warnings and probably potential real issues as well
- failed IOAPIC setup in the case when IOAPIC is enumerated via mptable
instead of ACPI (e.g. with acpi=noirq), due to
mpparse_parse_smp_config() overridden by x86_dtb_parse_smp_config()
Signed-off-by: Dmytro Maluka <dmaluka@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250105172741.3476758-2-dmaluka@chromium.org
|
|
The local variable length already holds the string length after calling
strncpy_from_user(). Using another local variable linlen and calling
strlen() is therefore unnecessary and can be removed. Remove linlen
and strlen() and use length instead.
No change in functionality intended.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250225131621.329699-2-thorsten.blum@linux.dev
|
|
Depending on the type of panics, it was found that the
__register_nmi_handler() function can be called in NMI context from
nmi_shootdown_cpus() leading to a lockdep splat:
WARNING: inconsistent lock state
inconsistent {INITIAL USE} -> {IN-NMI} usage.
lock(&nmi_desc[0].lock);
<Interrupt>
lock(&nmi_desc[0].lock);
Call Trace:
_raw_spin_lock_irqsave
__register_nmi_handler
nmi_shootdown_cpus
kdump_nmi_shootdown_cpus
native_machine_crash_shutdown
__crash_kexec
In this particular case, the following panic message was printed before:
Kernel panic - not syncing: Fatal hardware error!
This message seemed to be given out from __ghes_panic() running in
NMI context.
The __register_nmi_handler() function which takes the nmi_desc lock
with irq disabled shouldn't be called from NMI context as this can
lead to deadlock.
The nmi_shootdown_cpus() function can only be invoked once. After the
first invocation, all other CPUs should be stuck in the newly added
crash_nmi_callback() and cannot respond to a second NMI.
Fix it by adding a new emergency NMI handler to the nmi_desc
structure and provide a new set_emergency_nmi_handler() helper to set
crash_nmi_callback() in any context. The new emergency handler will
preempt other handlers in the linked list. That will eliminate the need
to take any lock and serve the panic in NMI use case.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20250206191844.131700-1-longman@redhat.com
|
|
Use atomic64_inc_return(&ref) instead of atomic64_add_return(1, &ref)
to use optimized implementation on targets that define
atomic_inc_return() and to remove now unneeded initialization of the
%eax/%edx register pair before the call to atomic64_inc_return().
On x86_32 the code improves from:
1b0: b9 00 00 00 00 mov $0x0,%ecx
1b1: R_386_32 .bss
1b5: 89 43 0c mov %eax,0xc(%ebx)
1b8: 31 d2 xor %edx,%edx
1ba: b8 01 00 00 00 mov $0x1,%eax
1bf: e8 fc ff ff ff call 1c0 <ksys_ioperm+0xa8>
1c0: R_386_PC32 atomic64_add_return_cx8
1c4: 89 03 mov %eax,(%ebx)
1c6: 89 53 04 mov %edx,0x4(%ebx)
to:
1b0: be 00 00 00 00 mov $0x0,%esi
1b1: R_386_32 .bss
1b5: 89 43 0c mov %eax,0xc(%ebx)
1b8: e8 fc ff ff ff call 1b9 <ksys_ioperm+0xa1>
1b9: R_386_PC32 atomic64_inc_return_cx8
1bd: 89 03 mov %eax,(%ebx)
1bf: 89 53 04 mov %edx,0x4(%ebx)
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250223161355.3607-1-ubizjak@gmail.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
- Fix AVX-VNNI CPU feature dependency bug triggered via the 'noxsave'
boot option
- Fix typos in the SVA documentation
- Add Tony Luck as RDT co-maintainer and remove Fenghua Yu
* tag 'x86-urgent-2025-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
docs: arch/x86/sva: Fix two grammar errors under Background and FAQ
x86/cpufeatures: Make AVX-VNNI depend on AVX
MAINTAINERS: Change maintainer for RDT
|
|
Previously the e820_table_kexec[] was exported to sysfs since kexec-tools uses
the memmap entries to prepare the e820 table for the new kernel.
The following commit, ~8 years ago, introduced e820_table_firmware[] and changed
the behavior to export the firmware table instead:
12df216c61c8 ("x86/boot/e820: Introduce the bootloader provided e820_table_firmware[] table")
Originally the kexec_file_load and kexec_load syscalls both used e820_table_kexec[].
Since the sysfs exported entries are from e820_table_firmware[] people
now need to tune both tables for kexec.
Restore the old behavior so the kexec_load and kexec_file_load syscalls work with
only one table update. The e820_table_firmware[] is used by hibernation kernel
code and it works without the sysfs exporting. Also remove the SEV
e820_table_firmware[] updating code.
Also update the code comments here and drop the comments about setup_data
reservation since it is not needed any more after this change was made
a year ago:
fc7f27cda843 ("x86/kexec: Do not update E820 kexec table for setup_data")
[ mingo: Tidy up the changelog and comments. ]
Signed-off-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Link: https://lore.kernel.org/r/Z5jcb1GKhLvH8kDc@darkstar.users.ipa.redhat.com
|
|
The return values of some functions are of boolean type. Change the
type of these function to bool and adjust their return values.
No functional change intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250129154920.6773-1-ubizjak@gmail.com
|
|
Load patches for which the driver carries a SHA256 checksum of the patch
blob.
This can be disabled by adding "microcode.amd_sha_check=off" on the
kernel cmdline. But it is highly NOT recommended.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
|
|
Introduce hv_curr_partition_type to store the partition type
as an enum.
Right now this is limited to guest or root partition, but there will
be other kinds in future and the enum is easily extensible.
Set up hv_curr_partition_type early in Hyper-V initialization with
hv_identify_partition_type(). hv_root_partition() just queries this
value, and shouldn't be called before that.
Making this check into a function sets the stage for adding a config
option to gate the compilation of root partition code. In particular,
hv_root_partition() can be stubbed out always be false if root
partition support isn't desired.
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/1740167795-13296-3-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1740167795-13296-3-git-send-email-nunodasneves@linux.microsoft.com>
|
|
process_64.c is not built on native 32-bit, so CONFIG_X86_32 will never
be set.
No change in functionality intended.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20250202202323.422113-3-brgerst@gmail.com
|
|
Use in_ia32_syscall() instead of a compat syscall entry.
No change in functionality intended.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20250202202323.422113-2-brgerst@gmail.com
|
|
Remove hard-coded strings by using the str_disabled_enabled() helper.
No change in functionality intended.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250209210333.5666-2-thorsten.blum@linux.dev
|
|
Every pv_ops.mmu.tlb_remove_table call ends up calling tlb_remove_table.
Get rid of the indirection by simply calling tlb_remove_table directly,
and not going through the paravirt function pointers.
Suggested-by: Qi Zheng <zhengqi.arch@bytedance.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/20250213161423.449435-3-riel@surriel.com
|
|
Currently x86 uses CONFIG_MMU_GATHER_TABLE_FREE when using
paravirt, and not when running on bare metal.
There is no real good reason to do things differently for
each setup. Make them all the same.
Currently get_user_pages_fast synchronizes against page table
freeing in two different ways:
- on bare metal, by blocking IRQs, which block TLB flush IPIs
- on paravirt, with MMU_GATHER_RCU_TABLE_FREE
This is done because some paravirt TLB flush implementations
handle the TLB flush in the hypervisor, and will do the flush
even when the target CPU has interrupts disabled.
Always handle page table freeing with MMU_GATHER_RCU_TABLE_FREE.
Using RCU synchronization between page table freeing and get_user_pages_fast()
allows bare metal to also do TLB flushing while interrupts are disabled.
Various places in the mm do still block IRQs or disable preemption
as an implicit way to block RCU frees.
That makes it safe to use INVLPGB on AMD CPUs.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/20250213161423.449435-2-riel@surriel.com
|