summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2025-05-11x86/sev: Fix operator precedence in GHCB_MSR_VMPL_REQ_LEVEL macroSeongman Lee
The GHCB_MSR_VMPL_REQ_LEVEL macro lacked parentheses around the bitmask expression, causing the shift operation to bind too early. As a result, when requesting VMPL1 (e.g., GHCB_MSR_VMPL_REQ_LEVEL(1)), incorrect values such as 0x000000016 were generated instead of the intended 0x100000016 (the requested VMPL level is specified in GHCBData[39:32]). Fix the precedence issue by grouping the masked value before applying the shift. [ bp: Massage commit message. ] Fixes: 34ff65901735 ("x86/sev: Use kernel provided SVSM Calling Areas") Signed-off-by: Seongman Lee <augustus92@kaist.ac.kr> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250511092329.12680-1-cloudlee1719@gmail.com
2025-05-10Merge tag 'mm-hotfixes-stable-2025-05-10-14-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc hotfixes from Andrew Morton: "22 hotfixes. 13 are cc:stable and the remainder address post-6.14 issues or aren't considered necessary for -stable kernels. About half are for MM. Five OCFS2 fixes and a few MAINTAINERS updates" * tag 'mm-hotfixes-stable-2025-05-10-14-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits) mm: fix folio_pte_batch() on XEN PV nilfs2: fix deadlock warnings caused by lock dependency in init_nilfs() mm/hugetlb: copy the CMA flag when demoting mm, swap: fix false warning for large allocation with !THP_SWAP selftests/mm: fix a build failure on powerpc selftests/mm: fix build break when compiling pkey_util.c mm: vmalloc: support more granular vrealloc() sizing tools/testing/selftests: fix guard region test tmpfs assumption ocfs2: stop quota recovery before disabling quotas ocfs2: implement handshaking with ocfs2 recovery thread ocfs2: switch osb->disable_recovery to enum mailmap: map Uwe's BayLibre addresses to a single one MAINTAINERS: add mm THP section mm/userfaultfd: fix uninitialized output field for -EAGAIN race selftests/mm: compaction_test: support platform with huge mount of memory MAINTAINERS: add core mm section ocfs2: fix panic in failed foilio allocation mm/huge_memory: fix dereferencing invalid pmd migration entry MAINTAINERS: add reverse mapping section x86: disable image size check for test builds ...
2025-05-10Merge tag 'kvm-x86-fixes-6.15-rcN' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 fixes for 6.15-rcN - Forcibly leave SMM on SHUTDOWN interception on AMD CPUs to avoid causing problems due to KVM stuffing INIT on SHUTDOWN (KVM needs to sanitize the VMCB as its state is undefined after SHUTDOWN, emulating INIT is the least awful choice). - Track the valid sync/dirty fields in kvm_run as a u64 to ensure KVM KVM doesn't goof a sanity check in the future. - Free obsolete roots when (re)loading the MMU to fix a bug where pre-faulting memory can get stuck due to always encountering a stale root. - When dumping GHCB state, use KVM's snapshot instead of the raw GHCB page to print state, so that KVM doesn't print stale/wrong information. - When changing memory attributes (e.g. shared <=> private), add potential hugepage ranges to the mmu_invalidate_range_{start,end} set so that KVM doesn't create a shared/private hugepage when the the corresponding attributes will become mixed (the attributes are commited *after* KVM finishes the invalidation). - Rework the SRSO mitigation to enable BP_SPEC_REDUCE only when KVM has at least one active VM. Effectively BP_SPEC_REDUCE when KVM is loaded led to very measurable performance regressions for non-KVM workloads.
2025-05-09lib/crc: make arch-optimized code use subsys_initcallEric Biggers
Make the architecture-optimized CRC code do its CPU feature checks in subsys_initcalls instead of arch_initcalls. This makes it consistent with arch/*/lib/crypto/ and ensures that it runs after initcalls that possibly could be a prerequisite for kernel-mode FPU, such as x86's xfd_update_static_branch() and loongarch's init_euen_mask(). Note: as far as I can tell, x86's xfd_update_static_branch() isn't *actually* needed for kernel-mode FPU. loongarch's init_euen_mask() is needed to enable save/restore of the vector registers, but loongarch doesn't yet have any CRC or crypto code that uses vector registers anyway. Regardless, let's be consistent with arch/*/lib/crypto/ and robust against any potential future dependency on an arch_initcall. Link: https://lore.kernel.org/r/20250510035959.87995-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2025-05-09Merge tag 'rust-fixes-6.15-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux Pull rust fixes from Miguel Ojeda: - Make CFI_AUTO_DEFAULT depend on !RUST or Rust >= 1.88.0 - Clean Rust (and Clippy) lints for the upcoming Rust 1.87.0 and 1.88.0 releases - Clean objtool warning for the upcoming Rust 1.87.0 release by adding one more noreturn function * tag 'rust-fixes-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux: x86/Kconfig: make CFI_AUTO_DEFAULT depend on !RUST or Rust >= 1.88 rust: clean Rust 1.88.0's `clippy::uninlined_format_args` lint rust: clean Rust 1.88.0's warning about `clippy::disallowed_macros` configuration rust: clean Rust 1.88.0's `unnecessary_transmutes` lint rust: allow Rust 1.87.0's `clippy::ptr_eq` lint objtool/rust: add one more `noreturn` Rust function for Rust 1.87.0
2025-05-09x86/its: FineIBT-paranoid vs ITSPeter Zijlstra
FineIBT-paranoid was using the retpoline bytes for the paranoid check, disabling retpolines, because all parts that have IBT also have eIBRS and thus don't need no stinking retpolines. Except... ITS needs the retpolines for indirect calls must not be in the first half of a cacheline :-/ So what was the paranoid call sequence: <fineibt_paranoid_start>: 0: 41 ba 78 56 34 12 mov $0x12345678, %r10d 6: 45 3b 53 f7 cmp -0x9(%r11), %r10d a: 4d 8d 5b <f0> lea -0x10(%r11), %r11 e: 75 fd jne d <fineibt_paranoid_start+0xd> 10: 41 ff d3 call *%r11 13: 90 nop Now becomes: <fineibt_paranoid_start>: 0: 41 ba 78 56 34 12 mov $0x12345678, %r10d 6: 45 3b 53 f7 cmp -0x9(%r11), %r10d a: 4d 8d 5b f0 lea -0x10(%r11), %r11 e: 2e e8 XX XX XX XX cs call __x86_indirect_paranoid_thunk_r11 Where the paranoid_thunk looks like: 1d: <ea> (bad) __x86_indirect_paranoid_thunk_r11: 1e: 75 fd jne 1d __x86_indirect_its_thunk_r11: 20: 41 ff eb jmp *%r11 23: cc int3 [ dhansen: remove initialization to false ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-09x86/its: Use dynamic thunks for indirect branchesPeter Zijlstra
ITS mitigation moves the unsafe indirect branches to a safe thunk. This could degrade the prediction accuracy as the source address of indirect branches becomes same for different execution paths. To improve the predictions, and hence the performance, assign a separate thunk for each indirect callsite. This is also a defense-in-depth measure to avoid indirect branches aliasing with each other. As an example, 5000 dynamic thunks would utilize around 16 bits of the address space, thereby gaining entropy. For a BTB that uses 32 bits for indexing, dynamic thunks could provide better prediction accuracy over fixed thunks. Have ITS thunks be variable sized and use EXECMEM_MODULE_TEXT such that they are both more flexible (got to extend them later) and live in 2M TLBs, just like kernel code, avoiding undue TLB pressure. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-09x86/ibt: Keep IBT disabled during alternative patchingPawan Gupta
cfi_rewrite_callers() updates the fineIBT hash matching at the caller side, but except for paranoid-mode it relies on apply_retpoline() and friends for any ENDBR relocation. This could temporarily cause an indirect branch to land on a poisoned ENDBR. For instance, with para-virtualization enabled, a simple wrmsrl() could have an indirect branch pointing to native_write_msr() who's ENDBR has been relocated due to fineIBT: <wrmsrl>: push %rbp mov %rsp,%rbp mov %esi,%eax mov %rsi,%rdx shr $0x20,%rdx mov %edi,%edi mov %rax,%rsi call *0x21e65d0(%rip) # <pv_ops+0xb8> ^^^^^^^^^^^^^^^^^^^^^^^ Such an indirect call during the alternative patching could #CP if the caller is not *yet* adjusted for the new target ENDBR. To prevent a false #CP, keep CET-IBT disabled until all callers are patched. Patching during the module load does not need to be guarded by IBT-disable because the module code is not executed until the patching is complete. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-09mm/execmem: Unify early execmem_cache behaviourPeter Zijlstra
Early kernel memory is RWX, only at the end of early boot (before SMP) do we mark things ROX. Have execmem_cache mirror this behaviour for early users. This avoids having to remember what code is execmem and what is not -- we can poke everything with impunity ;-) Also performance for not having to do endless text_poke_mm switches. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-09x86/its: Align RETs in BHB clear sequence to avoid thunkingPawan Gupta
The software mitigation for BHI is to execute BHB clear sequence at syscall entry, and possibly after a cBPF program. ITS mitigation thunks RETs in the lower half of the cacheline. This causes the RETs in the BHB clear sequence to be thunked as well, adding unnecessary branches to the BHB clear sequence. Since the sequence is in hot path, align the RET instructions in the sequence to avoid thunking. This is how disassembly clear_bhb_loop() looks like after this change: 0x44 <+4>: mov $0x5,%ecx 0x49 <+9>: call 0xffffffff81001d9b <clear_bhb_loop+91> 0x4e <+14>: jmp 0xffffffff81001de5 <clear_bhb_loop+165> 0x53 <+19>: int3 ... 0x9b <+91>: call 0xffffffff81001dce <clear_bhb_loop+142> 0xa0 <+96>: ret 0xa1 <+97>: int3 ... 0xce <+142>: mov $0x5,%eax 0xd3 <+147>: jmp 0xffffffff81001dd6 <clear_bhb_loop+150> 0xd5 <+149>: nop 0xd6 <+150>: sub $0x1,%eax 0xd9 <+153>: jne 0xffffffff81001dd3 <clear_bhb_loop+147> 0xdb <+155>: sub $0x1,%ecx 0xde <+158>: jne 0xffffffff81001d9b <clear_bhb_loop+91> 0xe0 <+160>: ret 0xe1 <+161>: int3 0xe2 <+162>: int3 0xe3 <+163>: int3 0xe4 <+164>: int3 0xe5 <+165>: lfence 0xe8 <+168>: pop %rbp 0xe9 <+169>: ret Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-09x86/its: Add support for RSB stuffing mitigationPawan Gupta
When retpoline mitigation is enabled for spectre-v2, enabling call-depth-tracking and RSB stuffing also mitigates ITS. Add cmdline option indirect_target_selection=stuff to allow enabling RSB stuffing mitigation. When retpoline mitigation is not enabled, =stuff option is ignored, and default mitigation for ITS is deployed. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-09x86/its: Add "vmexit" option to skip mitigation on some CPUsPawan Gupta
Ice Lake generation CPUs are not affected by guest/host isolation part of ITS. If a user is only concerned about KVM guests, they can now choose a new cmdline option "vmexit" that will not deploy the ITS mitigation when CPU is not affected by guest/host isolation. This saves the performance overhead of ITS mitigation on Ice Lake gen CPUs. When "vmexit" option selected, if the CPU is affected by ITS guest/host isolation, the default ITS mitigation is deployed. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-09x86/its: Enable Indirect Target Selection mitigationPawan Gupta
Indirect Target Selection (ITS) is a bug in some pre-ADL Intel CPUs with eIBRS. It affects prediction of indirect branch and RETs in the lower half of cacheline. Due to ITS such branches may get wrongly predicted to a target of (direct or indirect) branch that is located in the upper half of the cacheline. Scope of impact =============== Guest/host isolation -------------------- When eIBRS is used for guest/host isolation, the indirect branches in the VMM may still be predicted with targets corresponding to branches in the guest. Intra-mode ---------- cBPF or other native gadgets can be used for intra-mode training and disclosure using ITS. User/kernel isolation --------------------- When eIBRS is enabled user/kernel isolation is not impacted. Indirect Branch Prediction Barrier (IBPB) ----------------------------------------- After an IBPB, indirect branches may be predicted with targets corresponding to direct branches which were executed prior to IBPB. This is mitigated by a microcode update. Add cmdline parameter indirect_target_selection=off|on|force to control the mitigation to relocate the affected branches to an ITS-safe thunk i.e. located in the upper half of cacheline. Also add the sysfs reporting. When retpoline mitigation is deployed, ITS safe-thunks are not needed, because retpoline sequence is already ITS-safe. Similarly, when call depth tracking (CDT) mitigation is deployed (retbleed=stuff), ITS safe return thunk is not used, as CDT prevents RSB-underflow. To not overcomplicate things, ITS mitigation is not supported with spectre-v2 lfence;jmp mitigation. Moreover, it is less practical to deploy lfence;jmp mitigation on ITS affected parts anyways. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-09x86/its: Add support for ITS-safe return thunkPawan Gupta
RETs in the lower half of cacheline may be affected by ITS bug, specifically when the RSB-underflows. Use ITS-safe return thunk for such RETs. RETs that are not patched: - RET in retpoline sequence does not need to be patched, because the sequence itself fills an RSB before RET. - RET in Call Depth Tracking (CDT) thunks __x86_indirect_{call|jump}_thunk and call_depth_return_thunk are not patched because CDT by design prevents RSB-underflow. - RETs in .init section are not reachable after init. - RETs that are explicitly marked safe with ANNOTATE_UNRET_SAFE. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-09x86/its: Add support for ITS-safe indirect thunkPawan Gupta
Due to ITS, indirect branches in the lower half of a cacheline may be vulnerable to branch target injection attack. Introduce ITS-safe thunks to patch indirect branches in the lower half of cacheline with the thunk. Also thunk any eBPF generated indirect branches in emit_indirect_jump(). Below category of indirect branches are not mitigated: - Indirect branches in the .init section are not mitigated because they are discarded after boot. - Indirect branches that are explicitly marked retpoline-safe. Note that retpoline also mitigates the indirect branches against ITS. This is because the retpoline sequence fills an RSB entry before RET, and it does not suffer from RSB-underflow part of the ITS. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-09x86/its: Enumerate Indirect Target Selection (ITS) bugPawan Gupta
ITS bug in some pre-Alderlake Intel CPUs may allow indirect branches in the first half of a cache line get predicted to a target of a branch located in the second half of the cache line. Set X86_BUG_ITS on affected CPUs. Mitigation to follow in later commits. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-09x86/mm: Eliminate window where TLB flushes may be inadvertently skippedDave Hansen
tl;dr: There is a window in the mm switching code where the new CR3 is set and the CPU should be getting TLB flushes for the new mm. But should_flush_tlb() has a bug and suppresses the flush. Fix it by widening the window where should_flush_tlb() sends an IPI. Long Version: === History === There were a few things leading up to this. First, updating mm_cpumask() was observed to be too expensive, so it was made lazier. But being lazy caused too many unnecessary IPIs to CPUs due to the now-lazy mm_cpumask(). So code was added to cull mm_cpumask() periodically[2]. But that culling was a bit too aggressive and skipped sending TLB flushes to CPUs that need them. So here we are again. === Problem === The too-aggressive code in should_flush_tlb() strikes in this window: // Turn on IPIs for this CPU/mm combination, but only // if should_flush_tlb() agrees: cpumask_set_cpu(cpu, mm_cpumask(next)); next_tlb_gen = atomic64_read(&next->context.tlb_gen); choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush); load_new_mm_cr3(need_flush); // ^ After 'need_flush' is set to false, IPIs *MUST* // be sent to this CPU and not be ignored. this_cpu_write(cpu_tlbstate.loaded_mm, next); // ^ Not until this point does should_flush_tlb() // become true! should_flush_tlb() will suppress TLB flushes between load_new_mm_cr3() and writing to 'loaded_mm', which is a window where they should not be suppressed. Whoops. === Solution === Thankfully, the fuzzy "just about to write CR3" window is already marked with loaded_mm==LOADED_MM_SWITCHING. Simply checking for that state in should_flush_tlb() is sufficient to ensure that the CPU is targeted with an IPI. This will cause more TLB flush IPIs. But the window is relatively small and I do not expect this to cause any kind of measurable performance impact. Update the comment where LOADED_MM_SWITCHING is written since it grew yet another user. Peter Z also raised a concern that should_flush_tlb() might not observe 'loaded_mm' and 'is_lazy' in the same order that switch_mm_irqs_off() writes them. Add a barrier to ensure that they are observed in the order they are written. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Rik van Riel <riel@surriel.com> Link: https://lore.kernel.org/oe-lkp/202411282207.6bd28eae-lkp@intel.com/ [1] Fixes: 6db2526c1d69 ("x86/mm/tlb: Only trim the mm_cpumask once a second") [2] Reported-by: Stephen Dolan <sdolan@janestreet.com> Cc: stable@vger.kernel.org Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-05-08x86/tdx: tdx_mcall_get_report0: Return -EBUSY on TDCALL_OPERAND_BUSY errorCedric Xing
Return `-EBUSY` from tdx_mcall_get_report0() when `TDG.MR.REPORT` returns `TDCALL_OPERAND_BUSY`. This enables the caller to retry obtaining a TDREPORT later if another VCPU is extending an RTMR concurrently. Signed-off-by: Cedric Xing <cedric.xing@intel.com> Acked-by: Dionna Amalie Glaze <dionnaglaze@google.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://patch.msgid.link/20250506-tdx-rtmr-v6-4-ac6ff5e9d58a@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2025-05-08x86/tdx: Add tdx_mcall_extend_rtmr() interfaceCedric Xing
The TDX guest exposes one MRTD (Build-time Measurement Register) and four RTMR (Run-time Measurement Register) registers to record the build and boot measurements of a virtual machine (VM). These registers are similar to PCR (Platform Configuration Register) registers in the TPM (Trusted Platform Module) space. This measurement data is used to implement security features like attestation and trusted boot. To facilitate updating the RTMR registers, the TDX module provides support for the `TDG.MR.RTMR.EXTEND` TDCALL which can be used to securely extend the RTMR registers. Add helper function to update RTMR registers. It will be used by the TDX guest driver in enabling RTMR extension support. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Cedric Xing <cedric.xing@intel.com> Acked-by: Dionna Amalie Glaze <dionnaglaze@google.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://patch.msgid.link/20250506-tdx-rtmr-v6-3-ac6ff5e9d58a@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2025-05-08treewide, timers: Rename try_to_del_timer_sync() as timer_delete_sync_try()Ingo Molnar
Move this API to the canonical timer_*() namespace. Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250507175338.672442-9-mingo@kernel.org
2025-05-08KVM: SVM: Set/clear SRSO's BP_SPEC_REDUCE on 0 <=> 1 VM count transitionsSean Christopherson
Set the magic BP_SPEC_REDUCE bit to mitigate SRSO when running VMs if and only if KVM has at least one active VM. Leaving the bit set at all times unfortunately degrades performance by a wee bit more than expected. Use a dedicated spinlock and counter instead of hooking virtualization enablement, as changing the behavior of kvm.enable_virt_at_load based on SRSO_BP_SPEC_REDUCE is painful, and has its own drawbacks, e.g. could result in performance issues for flows that are sensitive to VM creation latency. Defer setting BP_SPEC_REDUCE until VMRUN is imminent to avoid impacting performance on CPUs that aren't running VMs, e.g. if a setup is using housekeeping CPUs. Setting BP_SPEC_REDUCE in task context, i.e. without blasting IPIs to all CPUs, also helps avoid serializing 1<=>N transitions without incurring a gross amount of complexity (see the Link for details on how ugly coordinating via IPIs gets). Link: https://lore.kernel.org/all/aBOnzNCngyS_pQIW@google.com Fixes: 8442df2b49ed ("x86/bugs: KVM: Add support for SRSO_MSR_FIX") Reported-by: Michael Larabel <Michael@michaellarabel.com> Closes: https://www.phoronix.com/review/linux-615-amd-regression Cc: Borislav Petkov <bp@alien8.de> Tested-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250505180300.973137-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-07x86: disable image size check for test buildsGuenter Roeck
64-bit allyesconfig builds fail with x86_64-linux-ld: kernel image bigger than KERNEL_IMAGE_SIZE Bisect points to commit 6f110a5e4f99 ("Disable SLUB_TINY for build testing") as the responsible commit. Reverting that patch does indeed fix the problem. Further analysis shows that disabling SLUB_TINY enables KASAN, and that KASAN is responsible for the image size increase. Solve the build problem by disabling the image size check for test builds. [akpm@linux-foundation.org: add comment, fix nearby typo (sink->sync)] [akpm@linux-foundation.org: fix comment snafu Link: https://lore.kernel.org/oe-kbuild-all/202504191813.4r9H6Glt-lkp@intel.com/ Link: https://lkml.kernel.org/r/20250417010950.2203847-1-linux@roeck-us.net Fixes: 6f110a5e4f99 ("Disable SLUB_TINY for build testing") Signed-off-by: Guenter Roeck <linux@roeck-us.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <x86@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-07ubsan: Remove regs from report_ubsan_failure()Mostafa Saleh
report_ubsan_failure() doesn't use argument regs, and soon it will be called from the hypervisor context were regs are not available. So, remove the unused argument. Signed-off-by: Mostafa Saleh <smostafa@google.com> Acked-by: Kees Cook <kees@kernel.org> Link: https://lore.kernel.org/r/20250430162713.1997569-3-smostafa@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-07x86/Kconfig: make CFI_AUTO_DEFAULT depend on !RUST or Rust >= 1.88Paweł Anikiel
Calling core::fmt::write() from rust code while FineIBT is enabled results in a kernel panic: [ 4614.199779] kernel BUG at arch/x86/kernel/cet.c:132! [ 4614.205343] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 4614.211781] CPU: 2 UID: 0 PID: 6057 Comm: dmabuf_dump Tainted: G U O 6.12.17-android16-0-g6ab38c534a43 #1 9da040f27673ec3945e23b998a0f8bd64c846599 [ 4614.227832] Tainted: [U]=USER, [O]=OOT_MODULE [ 4614.241247] RIP: 0010:do_kernel_cp_fault+0xea/0xf0 ... [ 4614.398144] RIP: 0010:_RNvXs5_NtNtNtCs3o2tGsuHyou_4core3fmt3num3impyNtB9_7Display3fmt+0x0/0x20 [ 4614.407792] Code: 48 f7 df 48 0f 48 f9 48 89 f2 89 c6 5d e9 18 fd ff ff 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 41 81 ea 14 61 af 2c 74 03 0f 0b 90 <66> 0f 1f 00 55 48 89 e5 48 89 f2 48 8b 3f be 01 00 00 00 5d e9 e7 [ 4614.428775] RSP: 0018:ffffb95acfa4ba68 EFLAGS: 00010246 [ 4614.434609] RAX: 0000000000000000 RBX: 0000000000000010 RCX: 0000000000000000 [ 4614.442587] RDX: 0000000000000007 RSI: ffffb95acfa4ba70 RDI: ffffb95acfa4bc88 [ 4614.450557] RBP: ffffb95acfa4bae0 R08: ffff0a00ffffff05 R09: 0000000000000070 [ 4614.458527] R10: 0000000000000000 R11: ffffffffab67eaf0 R12: ffffb95acfa4bcc8 [ 4614.466493] R13: ffffffffac5d50f0 R14: 0000000000000000 R15: 0000000000000000 [ 4614.474473] ? __cfi__RNvXs5_NtNtNtCs3o2tGsuHyou_4core3fmt3num3impyNtB9_7Display3fmt+0x10/0x10 [ 4614.484118] ? _RNvNtCs3o2tGsuHyou_4core3fmt5write+0x1d2/0x250 This happens because core::fmt::write() calls core::fmt::rt::Argument::fmt(), which currently has CFI disabled: library/core/src/fmt/rt.rs: 171 // FIXME: Transmuting formatter in new and indirectly branching to/calling 172 // it here is an explicit CFI violation. 173 #[allow(inline_no_sanitize)] 174 #[no_sanitize(cfi, kcfi)] 175 #[inline] 176 pub(super) unsafe fn fmt(&self, f: &mut Formatter<'_>) -> Result { This causes a Control Protection exception, because FineIBT has sealed off the original function's endbr64. This makes rust currently incompatible with FineIBT. Add a Kconfig dependency that prevents FineIBT from getting turned on by default if rust is enabled. [ Rust 1.88.0 (scheduled for 2025-06-26) should have this fixed [1], and thus we relaxed the condition with Rust >= 1.88. When `objtool` lands checking for this with e.g. [2], the plan is to ideally run that in upstream Rust's CI to prevent regressions early [3], since we do not control `core`'s source code. Alice tested the Rust PR backported to an older compiler. Peter would like that Rust provides a stable `core` which can be pulled into the kernel: "Relying on that much out of tree code is 'unfortunate'". - Miguel ] Signed-off-by: Paweł Anikiel <panikiel@google.com> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://github.com/rust-lang/rust/pull/139632 [1] Link: https://lore.kernel.org/rust-for-linux/20250410154556.GB9003@noisy.programming.kicks-ass.net/ [2] Link: https://github.com/rust-lang/rust/pull/139632#issuecomment-2801950873 [3] Link: https://lore.kernel.org/r/20250410115420.366349-1-panikiel@google.com Link: https://lore.kernel.org/r/att0-CANiq72kjDM0cKALVy4POEzhfdT4nO7tqz0Pm7xM+3=_0+L1t=A@mail.gmail.com [ Reduced splat. - Miguel ] Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2025-05-06Merge tag 'v6.15-rc5' into x86/msr, to pick up fixes and to resolve conflictsIngo Molnar
Conflicts: drivers/cpufreq/intel_pstate.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-06x86/bhi: Do not set BHI_DIS_S in 32-bit modePawan Gupta
With the possibility of intra-mode BHI via cBPF, complete mitigation for BHI is to use IBHF (history fence) instruction with BHI_DIS_S set. Since this new instruction is only available in 64-bit mode, setting BHI_DIS_S in 32-bit mode is only a partial mitigation. Do not set BHI_DIS_S in 32-bit mode so as to avoid reporting misleading mitigated status. With this change IBHF won't be used in 32-bit mode, also remove the CONFIG_X86_64 check from emit_spectre_bhb_barrier(). Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-06x86/bpf: Add IBHF call at end of classic BPFDaniel Sneddon
Classic BPF programs can be run by unprivileged users, allowing unprivileged code to execute inside the kernel. Attackers can use this to craft branch history in kernel mode that can influence the target of indirect branches. BHI_DIS_S provides user-kernel isolation of branch history, but cBPF can be used to bypass this protection by crafting branch history in kernel mode. To stop intra-mode attacks via cBPF programs, Intel created a new instruction Indirect Branch History Fence (IBHF). IBHF prevents the predicted targets of subsequent indirect branches from being influenced by branch history prior to the IBHF. IBHF is only effective while BHI_DIS_S is enabled. Add the IBHF instruction to cBPF jitted code's exit path. Add the new fence when the hardware mitigation is enabled (i.e., X86_FEATURE_CLEAR_BHB_HW is set) or after the software sequence (X86_FEATURE_CLEAR_BHB_LOOP) is being used in a virtual machine. Note that X86_FEATURE_CLEAR_BHB_HW and X86_FEATURE_CLEAR_BHB_LOOP are mutually exclusive, so the JIT compiler will only emit the new fence, not the SW sequence, when X86_FEATURE_CLEAR_BHB_HW is set. Hardware that enumerates BHI_NO basically has BHI_DIS_S protections always enabled, regardless of the value of BHI_DIS_S. Since BHI_DIS_S doesn't protect against intra-mode attacks, enumerate BHI bug on BHI_NO hardware as well. Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-06x86/bpf: Call branch history clearing sequence on exitDaniel Sneddon
Classic BPF programs have been identified as potential vectors for intra-mode Branch Target Injection (BTI) attacks. Classic BPF programs can be run by unprivileged users. They allow unprivileged code to execute inside the kernel. Attackers can use unprivileged cBPF to craft branch history in kernel mode that can influence the target of indirect branches. Introduce a branch history buffer (BHB) clearing sequence during the JIT compilation of classic BPF programs. The clearing sequence is the same as is used in previous mitigations to protect syscalls. Since eBPF programs already have their own mitigations in place, only insert the call on classic programs that aren't run by privileged users. Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-06x86/insn: Stop decoding i64 instructions in x86-64 mode at opcodeMasami Hiramatsu (Google)
In commit 2e044911be75 ("x86/traps: Decode 0xEA instructions as #UD") FineIBT starts using 0xEA as an invalid instruction like UD2. But insn decoder always returns the length of "0xea" instruction as 7 because it does not check the (i64) superscript. The x86 instruction decoder should also decode 0xEA on x86-64 as a one-byte invalid instruction by decoding the "(i64)" superscript tag. This stops decoding instruction which has (i64) but does not have (o64) superscript in 64-bit mode at opcode and skips other fields. With this change, insn_decoder_test says 0xea is 1 byte length if x86-64 (-y option means 64-bit): $ printf "0:\tea\t\n" | insn_decoder_test -y -v insn_decoder_test: success: Decoded and checked 1 instructions Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/174580490000.388420.5225447607417115496.stgit@devnote2
2025-05-06x86/insn: Fix opcode map (!REX2) superscript tagsMasami Hiramatsu (Google)
Commit: 159039af8c07 ("x86/insn: x86/insn: Add support for REX2 prefix to the instruction decoder opcode map") added (!REX2) superscript with a space, but the correct format requires ',' for concatination with other superscript tags. Add ',' to generate correct insn attribute tables. I confirmed with following command: arch/x86/lib/x86-opcode-map.txt | grep e8 | head -n 1 [0xe8] = INAT_MAKE_IMM(INAT_IMM_VWORD32) | INAT_FORCE64 | INAT_NO_REX2, Fixes: 159039af8c07 ("x86/insn: x86/insn: Add support for REX2 prefix to the instruction decoder opcode map") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/174580489027.388420.15539375184727726142.stgit@devnote2
2025-05-06Merge tag 'v6.15-rc4' into x86/asm, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-06x86/fpu: Drop @perm from guest pseudo FPU containerChao Gao
Remove @perm from the guest pseudo FPU container. The field is initialized during allocation and never used later. Rename fpu_init_guest_permissions() to show that its sole purpose is to lock down guest permissions. Suggested-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biggers <ebiggers@google.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <kees@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mitchell Levy <levymitchell0@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Samuel Holland <samuel.holland@sifive.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/kvm/af972fe5981b9e7101b64de43c7be0a8cc165323.camel@redhat.com/ Link: https://lore.kernel.org/r/20250506093740.2864458-3-chao.gao@intel.com
2025-05-06x86/fpu/xstate: Always preserve non-user xfeatures/flags in __state_permSean Christopherson
When granting userspace or a KVM guest access to an xfeature, preserve the entity's existing supervisor and software-defined permissions as tracked by __state_perm, i.e. use __state_perm to track *all* permissions even though all supported supervisor xfeatures are granted to all FPUs and FPU_GUEST_PERM_LOCKED disallows changing permissions. Effectively clobbering supervisor permissions results in inconsistent behavior, as xstate_get_group_perm() will report supervisor features for process that do NOT request access to dynamic user xfeatures, whereas any and all supervisor features will be absent from the set of permissions for any process that is granted access to one or more dynamic xfeatures (which right now means AMX). The inconsistency isn't problematic because fpu_xstate_prctl() already strips out everything except user xfeatures: case ARCH_GET_XCOMP_PERM: /* * Lockless snapshot as it can also change right after the * dropping the lock. */ permitted = xstate_get_host_group_perm(); permitted &= XFEATURE_MASK_USER_SUPPORTED; return put_user(permitted, uptr); case ARCH_GET_XCOMP_GUEST_PERM: permitted = xstate_get_guest_group_perm(); permitted &= XFEATURE_MASK_USER_SUPPORTED; return put_user(permitted, uptr); and similarly KVM doesn't apply the __state_perm to supervisor states (kvm_get_filtered_xcr0() incorporates xstate_get_guest_group_perm()): case 0xd: { u64 permitted_xcr0 = kvm_get_filtered_xcr0(); u64 permitted_xss = kvm_caps.supported_xss; But if KVM in particular were to ever change, dropping supervisor permissions would result in subtle bugs in KVM's reporting of supported CPUID settings. And the above behavior also means that having supervisor xfeatures in __state_perm is correctly handled by all users. Dropping supervisor permissions also creates another landmine for KVM. If more dynamic user xfeatures are ever added, requesting access to multiple xfeatures in separate ARCH_REQ_XCOMP_GUEST_PERM calls will result in the second invocation of __xstate_request_perm() computing the wrong ksize, as as the mask passed to xstate_calculate_size() would not contain *any* supervisor features. Commit 781c64bfcb73 ("x86/fpu/xstate: Handle supervisor states in XSTATE permissions") fudged around the size issue for userspace FPUs, but for reasons unknown skipped guest FPUs. Lack of a fix for KVM "works" only because KVM doesn't yet support virtualizing features that have supervisor xfeatures, i.e. as of today, KVM guest FPUs will never need the relevant xfeatures. Simply extending the hack-a-fix for guests would temporarily solve the ksize issue, but wouldn't address the inconsistency issue and would leave another lurking pitfall for KVM. KVM support for virtualizing CET will likely add CET_KERNEL as a guest-only xfeature, i.e. CET_KERNEL will not be set in xfeatures_mask_supervisor() and would again be dropped when granting access to dynamic xfeatures. Note, the existing clobbering behavior is rather subtle. The @permitted parameter to __xstate_request_perm() comes from: permitted = xstate_get_group_perm(guest); which is either fpu->guest_perm.__state_perm or fpu->perm.__state_perm, where __state_perm is initialized to: fpu->perm.__state_perm = fpu_kernel_cfg.default_features; and copied to the guest side of things: /* Same defaults for guests */ fpu->guest_perm = fpu->perm; fpu_kernel_cfg.default_features contains everything except the dynamic xfeatures, i.e. everything except XFEATURE_MASK_XTILE_DATA: fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features; fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC; When __xstate_request_perm() restricts the local "mask" variable to compute the user state size: mask &= XFEATURE_MASK_USER_SUPPORTED; usize = xstate_calculate_size(mask, false); it subtly overwrites the target __state_perm with "mask" containing only user xfeatures: perm = guest ? &fpu->guest_perm : &fpu->perm; /* Pairs with the READ_ONCE() in xstate_get_group_perm() */ WRITE_ONCE(perm->__state_perm, mask); Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: John Allen <john.allen@amd.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mitchell Levy <levymitchell0@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Samuel Holland <samuel.holland@sifive.com> Cc: Sohil Mehta <sohil.mehta@intel.com> Cc: Vignesh Balasubramanian <vigbalas@amd.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Xin Li <xin3.li@intel.com> Cc: kvm@vger.kernel.org Link: https://lore.kernel.org/all/ZTqgzZl-reO1m01I@google.com Link: https://lore.kernel.org/r/20250506093740.2864458-2-chao.gao@intel.com
2025-05-06x86/mm: Fix false positive warning in switch_mm_irqs_off()Peter Zijlstra
Multiple testers reported the following new warning: WARNING: CPU: 0 PID: 0 at arch/x86/mm/tlb.c:795 Which corresponds to: if (IS_ENABLED(CONFIG_DEBUG_VM) && WARN_ON_ONCE(prev != &init_mm && !cpumask_test_cpu(cpu, mm_cpumask(next)))) cpumask_set_cpu(cpu, mm_cpumask(next)); So the problem is that unuse_temporary_mm() explicitly clears that bit; and it has to, because otherwise the flush_tlb_mm_range() in __text_poke() will try sending IPIs, which are not at all needed. See also: https://lore.kernel.org/all/20241113095550.GBZzR3pg-RhJKPDazS@fat_crate.local/ Notably, the whole {,un}use_temporary_mm() thing requires preemption to be disabled across it with the express purpose of keeping all TLB nonsense CPU local, such that invalidations can also stay local etc. However, as a side-effect, we violate this above WARN(), which sorta makes sense for the normal case, but very much doesn't make sense here. Change unuse_temporary_mm() to mark the mm_struct such that a further exception (beyond init_mm) can be grafted, to keep the warning for all the other cases. Reported-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com> Reported-by: Jani Nikula <jani.nikula@linux.intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rik van Riel <riel@surriel.com> Link: https://lore.kernel.org/r/20250430081154.GH4439@noisy.programming.kicks-ass.net
2025-05-06x86/cpu: Sanitize CPUID(0x80000000) outputAhmed S. Darwish
CPUID(0x80000000).EAX returns the max extended CPUID leaf available. On x86-32 machines without an extended CPUID range, a CPUID(0x80000000) query will just repeat the output of the last valid standard CPUID leaf on the CPU; i.e., a garbage values. Current tip:x86/cpu code protects against this by doing: eax = cpuid_eax(0x80000000); c->extended_cpuid_level = eax; if ((eax & 0xffff0000) == 0x80000000) { // CPU has an extended CPUID range. Check for 0x80000001 if (eax >= 0x80000001) { cpuid(0x80000001, ...); } } This is correct so far. Afterwards though, the same possibly broken EAX value is used to check the availability of other extended CPUID leaves: if (c->extended_cpuid_level >= 0x80000007) ... if (c->extended_cpuid_level >= 0x80000008) ... if (c->extended_cpuid_level >= 0x8000000a) ... if (c->extended_cpuid_level >= 0x8000001f) ... which is invalid. Fix this by immediately setting the CPU's max extended CPUID leaf to zero if CPUID(0x80000000).EAX doesn't indicate a valid CPUID extended range. While at it, add a comment, similar to kernel/head_32.S, clarifying the CPUID(0x80000000) sanity check. References: 8a50e5135af0 ("x86-32: Use symbolic constants, safer CPUID when enabling EFER.NX") Fixes: 3da99c977637 ("x86: make (early)_identify_cpu more the same between 32bit and 64 bit") Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: x86-cpuid@lists.linux.dev Link: https://lore.kernel.org/r/20250506050437.10264-3-darwi@linutronix.de
2025-05-06Merge tag 'v6.15-rc5' into x86/cpu, to resolve conflictsIngo Molnar
Conflicts: tools/arch/x86/include/asm/cpufeatures.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-05x86/alternative: Remove unused header #definesJuergen Gross
Remove some unfortunately-named unused macros which could potentially result in weird build failures. Fortunately, they are under an #ifdef __ASSEMBLER__ which has kept them from causing problems so far. [ dhansen: subject and changelog tweaks ] Fixes: 1a6ade825079 ("x86/alternative: Convert the asm ALTERNATIVE_3() macro") Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20250505131646.29288-1-jgross%40suse.com
2025-05-05x86/CPU/AMD: Print the reason for the last resetYazen Ghannam
The following register contains bits that indicate the cause for the previous reset. PMx000000C0 (FCH::PM::S5_RESET_STATUS) This is useful for debug. The reasons for reset are broken into 6 high level categories. Decode it by category and print during boot. Specifics within a category are split off into debugging documentation. The register is accessed indirectly through a "PM" port in the FCH. Use MMIO access in order to avoid restrictions with legacy port access. Use a late_initcall() to ensure that MMIO has been set up before trying to access the register. This register was introduced with AMD Family 17h, so avoid access on older families. There is no CPUID feature bit for this register. [ bp: Simplify the reason dumping loop. - merge a fix to not access an array element after the last one: https://lore.kernel.org/r/20250505133609.83933-1-superm1@kernel.org Reported-by: James Dutton <james.dutton@gmail.com> ] [ mingo: - Use consistent .rst formatting - Fix 'Sleep' class field to 'ACPI-State' - Standardize pin messages around the 'tripped' verbiage - Remove reference to ring-buffer printing & simplify the wording - Use curly braces for multi-line conditional statements ] Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Co-developed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250422234830.2840784-6-superm1@kernel.org
2025-05-05x86/fpu: Restore fpu_thread_struct_whitelist() to fix ↵Kees Cook
CONFIG_HARDENED_USERCOPY=y crash Borislav Petkov reported the following boot crash on x86-32, with CONFIG_HARDENED_USERCOPY=y: | usercopy: Kernel memory overwrite attempt detected to SLUB object 'task_struct' (offset 2112, size 160)! | ... | kernel BUG at mm/usercopy.c:102! So the useroffset and usersize arguments are what control the allowed window of copying in/out of the "task_struct" kmem cache: /* create a slab on which task_structs can be allocated */ task_struct_whitelist(&useroffset, &usersize); task_struct_cachep = kmem_cache_create_usercopy("task_struct", arch_task_struct_size, align, SLAB_PANIC|SLAB_ACCOUNT, useroffset, usersize, NULL); task_struct_whitelist() positions this window based on the location of the thread_struct within task_struct, and gets the arch-specific details via arch_thread_struct_whitelist(offset, size): static void __init task_struct_whitelist(unsigned long *offset, unsigned long *size) { /* Fetch thread_struct whitelist for the architecture. */ arch_thread_struct_whitelist(offset, size); /* * Handle zero-sized whitelist or empty thread_struct, otherwise * adjust offset to position of thread_struct in task_struct. */ if (unlikely(*size == 0)) *offset = 0; else *offset += offsetof(struct task_struct, thread); } Commit cb7ca40a3882 ("x86/fpu: Make task_struct::thread constant size") removed the logic for the window, leaving: static inline void arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size) { *offset = 0; *size = 0; } So now there is no window that usercopy hardening will allow to be copied in/out of task_struct. But as reported above, there *is* a copy in copy_uabi_to_xstate(). (It seems there are several, actually.) int copy_sigframe_from_user_to_xstate(struct task_struct *tsk, const void __user *ubuf) { return copy_uabi_to_xstate(x86_task_fpu(tsk)->fpstate, NULL, ubuf, &tsk->thread.pkru); } This appears to be writing into x86_task_fpu(tsk)->fpstate. With or without CONFIG_X86_DEBUG_FPU, this resolves to: ((struct fpu *)((void *)(task) + sizeof(*(task)))) i.e. the memory "after task_struct" is cast to "struct fpu", and the uses the "fpstate" pointer. How that pointer gets set looks to be variable, but I think the one we care about here is: fpu->fpstate = &fpu->__fpstate; And struct fpu::__fpstate says: struct fpstate __fpstate; /* * WARNING: '__fpstate' is dynamically-sized. Do not put * anything after it here. */ So we're still dealing with a dynamically sized thing, even if it's not within the literal struct task_struct -- it's still in the kmem cache, though. Looking at the kmem cache size, it has allocated "arch_task_struct_size" bytes, which is calculated in fpu__init_task_struct_size(): int task_size = sizeof(struct task_struct); task_size += sizeof(struct fpu); /* * Subtract off the static size of the register state. * It potentially has a bunch of padding. */ task_size -= sizeof(union fpregs_state); /* * Add back the dynamically-calculated register state * size. */ task_size += fpu_kernel_cfg.default_size; /* * We dynamically size 'struct fpu', so we require that * 'state' be at the end of 'it: */ CHECK_MEMBER_AT_END_OF(struct fpu, __fpstate); arch_task_struct_size = task_size; So, this is still copying out of the kmem cache for task_struct, and the window seems unchanged (still fpu regs). This is what the window was before: void fpu_thread_struct_whitelist(unsigned long *offset, unsigned long *size) { *offset = offsetof(struct thread_struct, fpu.__fpstate.regs); *size = fpu_kernel_cfg.default_size; } And the same commit I mentioned above removed it. I think the misunderstanding is here: | The fpu_thread_struct_whitelist() quirk to hardened usercopy can be removed, | now that the FPU structure is not embedded in the task struct anymore, which | reduces text footprint a bit. Yes, FPU is no longer in task_struct, but it IS in the kmem cache named "task_struct", since the fpstate is still being allocated there. Partially revert the earlier mentioned commit, along with a recalculation of the fpstate regs location. Fixes: cb7ca40a3882 ("x86/fpu: Make task_struct::thread constant size") Reported-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Chang S. Bae <chang.seok.bae@intel.com> Cc: Gustavo A. R. Silva <gustavoars@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-hardening@vger.kernel.org Link: https://lore.kernel.org/all/20250409211127.3544993-1-mingo@kernel.org/ # Discussion #1 Link: https://lore.kernel.org/r/202505041418.F47130C4C8@keescook # Discussion #2
2025-05-05crypto: x86/sha256 - Add simd block functionHerbert Xu
Add CRYPTO_ARCH_HAVE_LIB_SHA256_SIMD and a SIMD block function so that the caller can decide whether to use SIMD. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-05-05crypto: arch/sha256 - Export block functions as GPL onlyHerbert Xu
Export the block functions as GPL only, there is no reason to let arbitrary modules use these internal functions. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-05-05crypto: x86/blake2s - Include linux/init.hHerbert Xu
Explicitly include linux/init.h rather than pulling it through potluck. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-05-05Revert "crypto: run initcalls for generic implementations earlier"Herbert Xu
This reverts commit c4741b23059794bd99beef0f700103b0d983b3fd. Crypto API self-tests no longer run at registration time and now occur either at late_initcall or upon the first use. Therefore the premise of the above commit no longer exists. Revert it and subsequent additions of subsys_initcall and arch_initcall. Note that lib/crypto calls will stay at subsys_initcall (or rather downgraded from arch_initcall) because they may need to occur before Crypto API registration. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-05-05crypto: x86/sha256 - implement library instead of shashEric Biggers
Instead of providing crypto_shash algorithms for the arch-optimized SHA-256 code, instead implement the SHA-256 library. This is much simpler, it makes the SHA-256 library functions be arch-optimized, and it fixes the longstanding issue where the arch-optimized SHA-256 was disabled by default. SHA-256 still remains available through crypto_shash, but individual architectures no longer need to handle it. To match sha256_blocks_arch(), change the type of the nblocks parameter of the assembly functions from int to size_t. The assembly functions actually already treated it as size_t. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-05-05x86/microcode: Consolidate the loader enablement checkingBorislav Petkov (AMD)
Consolidate the whole logic which determines whether the microcode loader should be enabled or not into a single function and call it everywhere. Well, almost everywhere - not in mk_early_pgtbl_32() because there the kernel is running without paging enabled and checking dis_ucode_ldr et al would require physical addresses and uglification of the code. But since this is 32-bit, the easier thing to do is to simply map the initrd unconditionally especially since that mapping is getting removed later anyway by zap_early_initrd_mapping() and avoid the uglification. In doing so, address the issue of old 486er machines without CPUID support, not booting current kernels. [ mingo: Fix no previous prototype for ‘microcode_loader_disabled’ [-Wmissing-prototypes] ] Fixes: 4c585af7180c1 ("x86/boot/32: Temporarily map initrd for microcode loading") Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/CANpbe9Wm3z8fy9HbgS8cuhoj0TREYEEkBipDuhgkWFvqX0UoVQ@mail.gmail.com
2025-05-05um/asm: Replace "REP; NOP" with PAUSE mnemonicUros Bizjak
Current minimum required version of binutils is 2.25, which supports PAUSE instruction mnemonic. Replace "REP; NOP" with this proper mnemonic. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Richard Weinberger <richard@nod.at> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: David Laight <david.laight.linux@gmail.com> Link: https://patch.msgid.link/20250418083436.133148-2-ubizjak@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-05-05um/asm: Rename rep_nop() to native_pause()Uros Bizjak
Rename rep_nop() function to what it really does. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Richard Weinberger <richard@nod.at> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: David Laight <david.laight.linux@gmail.com> Link: https://patch.msgid.link/20250418083436.133148-1-ubizjak@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-05-05um: Add cmpxchg8b_emu and checksum functions to asm-prototypes.hSami Tolvanen
With CONFIG_GENDWARFKSYMS, um builds fail due to missing prototypes in asm/asm-prototypes.h. Add declarations for cmpxchg8b_emu and the exported checksum functions, including csum_partial_copy_generic as it's also exported. Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: linux-kbuild@vger.kernel.org Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202503251216.lE4t9Ikj-lkp@intel.com/ Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Link: https://patch.msgid.link/20250326190500.847236-2-samitolvanen@google.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-05-05um: fix _nofault accessesJohannes Berg
Nathan reported [1] that when built with clang, the um kernel crashes pretty much immediately. This turned out to be an issue with the inline assembly I had added, when clang used %rax/%eax for both operands. Reorder it so current->thread.segv_continue is written first, and then the lifetime of _faulted won't have overlap with the lifetime of segv_continue. In the email thread Benjamin also pointed out that current->mm is only NULL for true kernel tasks, but we could do this for a userspace task, so the current->thread.segv_continue logic must be lifted out of the mm==NULL check. Finally, while looking at this, put a barrier() so the NULL assignment to thread.segv_continue cannot be reorder before the possibly faulting operation. Reported-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/r/20250402221254.GA384@ax162 [1] Fixes: d1d7f01f7cd3 ("um: mark rodata read-only and implement _nofault accesses") Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-05-05crypto: lib/poly1305 - Use block-only interfaceHerbert Xu
Now that every architecture provides a block function, use that to implement the lib/poly1305 and remove the old per-arch code. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>