summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2023-09-23KVM: SVM: INTERCEPT_RDTSCP is never intercepted anywayPaolo Bonzini
svm_recalc_instruction_intercepts() is always called at least once before the vCPU is started, so the setting or clearing of the RDTSCP intercept can be dropped from the TSC_AUX virtualization support. Extracted from a patch by Tom Lendacky. Cc: stable@vger.kernel.org Fixes: 296d5a17e793 ("KVM: SEV-ES: Use V_TSC_AUX if available instead of RDTSC/MSR_TSC_AUX intercepts") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-23KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronouslySean Christopherson
Stop zapping invalidate TDP MMU roots via work queue now that KVM preserves TDP MMU roots until they are explicitly invalidated. Zapping roots asynchronously was effectively a workaround to avoid stalling a vCPU for an extended during if a vCPU unloaded a root, which at the time happened whenever the guest toggled CR0.WP (a frequent operation for some guest kernels). While a clever hack, zapping roots via an unbound worker had subtle, unintended consequences on host scheduling, especially when zapping multiple roots, e.g. as part of a memslot. Because the work of zapping a root is no longer bound to the task that initiated the zap, things like the CPU affinity and priority of the original task get lost. Losing the affinity and priority can be especially problematic if unbound workqueues aren't affined to a small number of CPUs, as zapping multiple roots can cause KVM to heavily utilize the majority of CPUs in the system, *beyond* the CPUs KVM is already using to run vCPUs. When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root zap can result in KVM occupying all logical CPUs for ~8ms, and result in high priority tasks not being scheduled in in a timely manner. In v5.15, which doesn't preserve unloaded roots, the issues were even more noticeable as KVM would zap roots more frequently and could occupy all CPUs for 50ms+. Consuming all CPUs for an extended duration can lead to significant jitter throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots is a semi-frequent operation as memslots are deleted and recreated with different host virtual addresses to react to host GPU drivers allocating and freeing GPU blobs. On ChromeOS, the jitter manifests as audio blips during games due to the audio server's tasks not getting scheduled in promptly, despite the tasks having a high realtime priority. Deleting memslots isn't exactly a fast path and should be avoided when possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the memslot shenanigans, but KVM is squarely in the wrong. Not to mention that removing the async zapping eliminates a non-trivial amount of complexity. Note, one of the subtle behaviors hidden behind the async zapping is that KVM would zap invalidated roots only once (ignoring partial zaps from things like mmu_notifier events). Preserve this behavior by adding a flag to identify roots that are scheduled to be zapped versus roots that have already been zapped but not yet freed. Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can encounter invalid roots, as it's not at all obvious why zapping invalidated roots shouldn't simply zap all invalid roots. Reported-by: Pattara Teerapong <pteerapong@google.com> Cc: David Stevens <stevensd@google.com> Cc: Yiwei Zhang<zzyiwei@google.com> Cc: Paul Hsia <paulhsia@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230916003916.2545000-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-23KVM: x86/mmu: Do not filter address spaces in for_each_tdp_mmu_root_yield_safe()Paolo Bonzini
All callers except the MMU notifier want to process all address spaces. Remove the address space ID argument of for_each_tdp_mmu_root_yield_safe() and switch the MMU notifier to use __for_each_tdp_mmu_root_yield_safe(). Extracted out of a patch by Sean Christopherson <seanjc@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-22Merge tag 'x86_urgent_for_v6.6-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 rethunk fixes from Borislav Petkov: "Fix the patching ordering between static calls and return thunks" * tag 'x86_urgent_for_v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86,static_call: Fix static-call vs return-thunk x86/alternatives: Remove faulty optimization
2023-09-22Merge tag 'x86-urgent-2023-09-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 fixes from Ingo Molnar: - Fix a kexec bug - Fix an UML build bug - Fix a handful of SRSO related bugs - Fix a shadow stacks handling bug & robustify related code * tag 'x86-urgent-2023-09-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/shstk: Add warning for shadow stack double unmap x86/shstk: Remove useless clone error handling x86/shstk: Handle vfork clone failure correctly x86/srso: Fix SBPB enablement for spec_rstack_overflow=off x86/srso: Don't probe microcode in a guest x86/srso: Set CPUID feature bits independently of bug or mitigation status x86/srso: Fix srso_show_state() side effect x86/asm: Fix build of UML with KASAN x86/mm, kexec, ima: Use memblock_free_late() from ima_free_kexec_buffer()
2023-09-22x86/hyperv: Add common print prefix "Hyper-V" in hv_initSaurabh Sengar
Add "#define pr_fmt()" in hv_init.c to use "Hyper-V:" as common print prefix for all pr_*() statements in this file. Remove the "Hyper-V:" already prefixed in couple of prints. Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org> Link: https://lore.kernel.org/r/1695123361-8877-1-git-send-email-ssengar@linux.microsoft.com
2023-09-22x86/hyperv: Remove hv_vtl_early_init initcallSaurabh Sengar
There has been cases reported where HYPERV_VTL_MODE is enabled by mistake, on a non Hyper-V platforms. This causes the hv_vtl_early_init function to be called in an non Hyper-V/VTL platforms which results the memory corruption. Remove the early_initcall for hv_vtl_early_init and call it at the end of hyperv_init to make sure it is never called in a non Hyper-V platform by mistake. Reported-by: Mathias Krause <minipli@grsecurity.net> Closes: https://lore.kernel.org/lkml/40467722-f4ab-19a5-4989-308225b1f9f0@grsecurity.net/ Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com> Acked-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Wei Liu <wei.liu@kernel.org> Link: https://lore.kernel.org/r/1695358720-27681-1-git-send-email-ssengar@linux.microsoft.com
2023-09-22x86/hyperv: Restrict get_vtl to only VTL platformsSaurabh Sengar
When Linux runs in a non-default VTL (CONFIG_HYPERV_VTL_MODE=y), get_vtl() must never fail as its return value is used in negotiations with the host. In the more generic case, (CONFIG_HYPERV_VTL_MODE=n) the VTL is always zero so there's no need to do the hypercall. Make get_vtl() BUG() in case of failure and put the implementation under "if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)" to avoid the call altogether in the most generic use case. Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Wei Liu <wei.liu@kernel.org> Link: https://lore.kernel.org/r/1695182675-13405-1-git-send-email-ssengar@linux.microsoft.com
2023-09-22x86,static_call: Fix static-call vs return-thunkPeter Zijlstra
Commit 7825451fa4dc ("static_call: Add call depth tracking support") failed to realize the problem fixed there is not specific to call depth tracking but applies to all return-thunk uses. Move the fix to the appropriate place and condition. Fixes: ee88d363d156 ("x86,static_call: Use alternative RET encoding") Reported-by: David Kaplan <David.Kaplan@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Tested-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org>
2023-09-22x86/alternatives: Remove faulty optimizationJosh Poimboeuf
The following commit 095b8303f383 ("x86/alternative: Make custom return thunk unconditional") made '__x86_return_thunk' a placeholder value. All code setting X86_FEATURE_RETHUNK also changes the value of 'x86_return_thunk'. So the optimization at the beginning of apply_returns() is dead code. Also, before the above-mentioned commit, the optimization actually had a bug It bypassed __static_call_fixup(), causing some raw returns to remain unpatched in static call trampolines. Thus the 'Fixes' tag. Fixes: d2408e043e72 ("x86/alternative: Optimize returns patching") Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/16d19d2249d4485d8380fb215ffaae81e6b8119e.1693889988.git.jpoimboe@kernel.org
2023-09-22hardening: Provide Kconfig fragments for basic optionsKees Cook
Inspired by Salvatore Mesoraca's earlier[1] efforts to provide some in-tree guidance for kernel hardening Kconfig options, add a new fragment named "hardening-basic.config" (along with some arch-specific fragments) that enable a basic set of kernel hardening options that have the least (or no) performance impact and remove a reasonable set of legacy APIs. Using this fragment is as simple as running "make hardening.config". More extreme fragments can be added[2] in the future to cover all the recognized hardening options, and more per-architecture files can be added too. For now, document the fragments directly via comments. Perhaps .rst documentation can be generated from them in the future (rather than the other way around). [1] https://lore.kernel.org/kernel-hardening/1536516257-30871-1-git-send-email-s.mesoraca16@gmail.com/ [2] https://github.com/KSPP/linux/issues/14 Cc: Salvatore Mesoraca <s.mesoraca16@gmail.com> Cc: x86@kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-doc@vger.kernel.org Cc: linux-kbuild@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2023-09-22perf/x86/amd/core: Fix overflow reset on hotplugSandipan Das
Kernels older than v5.19 do not support PerfMonV2 and the PMI handler does not clear the overflow bits of the PerfCntrGlobalStatus register. Because of this, loading a recent kernel using kexec from an older kernel can result in inconsistent register states on Zen 4 systems. The PMI handler of the new kernel gets confused and shows a warning when an overflow occurs because some of the overflow bits are set even if the corresponding counters are inactive. These are remnants from overflows that were handled by the older kernel. During CPU hotplug, the PerfCntrGlobalCtl and PerfCntrGlobalStatus registers should always be cleared for PerfMonV2-capable processors. However, a condition used for NB event constaints applicable only to older processors currently prevents this from happening. Move the reset sequence to an appropriate place and also clear the LBR Freeze bit. Fixes: 21d59e3e2c40 ("perf/x86/amd/core: Detect PerfMonV2 support") Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/882a87511af40792ba69bb0e9026f19a2e71e8a3.1694696888.git.sandipan.das@amd.com
2023-09-22x86/speculation, objtool: Use absolute relocations for annotationsFangrui Song
.discard.retpoline_safe sections do not have the SHF_ALLOC flag. These sections referencing text sections' STT_SECTION symbols with PC-relative relocations like R_386_PC32 [0] is conceptually not suitable. Newer LLD will report warnings for REL relocations even for relocatable links [1]: ld.lld: warning: vmlinux.a(drivers/i2c/busses/i2c-i801.o):(.discard.retpoline_safe+0x120): has non-ABS relocation R_386_PC32 against symbol '' Switch to absolute relocations instead, which indicate link-time addresses. In a relocatable link, these addresses are also output section offsets, used by checks in tools/objtool/check.c. When linking vmlinux, these .discard.* sections will be discarded, therefore it is not a problem that R_X86_64_32 cannot represent a kernel address. Alternatively, we could set the SHF_ALLOC flag for .discard.* sections, but I think non-SHF_ALLOC for sections to be discarded makes more sense. Note: if we decide to never support REL architectures (e.g. arm, i386), we can utilize R_*_NONE relocations (.reloc ., BFD_RELOC_NONE, sym), making .discard.* sections zero-sized. That said, the section content waste is 4 bytes per entry, much smaller than sizeof(Elf{32,64}_Rel). [0] commit 1c0c1faf5692 ("objtool: Use relative pointers for annotations") [1] https://github.com/ClangBuiltLinux/linux/issues/1937 Signed-off-by: Fangrui Song <maskray@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20230920001728.1439947-1-maskray@google.com
2023-09-22x86/cpu: Clear SVM feature if disabled by BIOSPaolo Bonzini
When SVM is disabled by BIOS, one cannot use KVM but the SVM feature is still shown in the output of /proc/cpuinfo. On Intel machines, VMX is cleared by init_ia32_feat_ctl(), so do the same on AMD and Hygon processors. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230921114940.957141-1-pbonzini@redhat.com
2023-09-22x86/bitops: Remove unused __sw_hweight64() assembly implementation on x86-32Ingo Molnar
Header cleanups in the fast-headers tree highlighted that we have an unused assembly implementation for __sw_hweight64(): WARNING: modpost: EXPORT symbol "__sw_hweight64" [vmlinux] version ... __arch_hweight64() on x86-32 is defined in the arch/x86/include/asm/arch_hweight.h header as an inline, using __arch_hweight32(): #ifdef CONFIG_X86_32 static inline unsigned long __arch_hweight64(__u64 w) { return __arch_hweight32((u32)w) + __arch_hweight32((u32)(w >> 32)); } *But* there's also a __sw_hweight64() assembly implementation: arch/x86/lib/hweight.S SYM_FUNC_START(__sw_hweight64) #ifdef CONFIG_X86_64 ... #else /* CONFIG_X86_32 */ /* We're getting an u64 arg in (%eax,%edx): unsigned long hweight64(__u64 w) */ pushl %ecx call __sw_hweight32 movl %eax, %ecx # stash away result movl %edx, %eax # second part of input call __sw_hweight32 addl %ecx, %eax # result popl %ecx ret #endif But this __sw_hweight64 assembly implementation is unused - and it's essentially doing the same thing that the inline wrapper does. Remove the assembly version and add a comment about it. Reported-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org
2023-09-22x86/mm: Move arch_memory_failure() and arch_is_platform_page() definitions ↵Ingo Molnar
from <asm/processor.h> to <asm/pgtable.h> <linux/mm.h> relies on these definitions being included first, which is true currently due to historic header spaghetti, but in the future <asm/processor.h> will not guaranteed to be included by the MM code. Move these definitions over into a suitable MM header. This is a preparatory patch for x86 header dependency simplifications and reductions. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: linux-kernel@vger.kernel.org
2023-09-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
Cross-merge networking fixes after downstream PR. No conflicts. Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-09-21futex: Add sys_futex_requeue()peterz@infradead.org
Finish off the 'simple' futex2 syscall group by adding sys_futex_requeue(). Unlike sys_futex_{wait,wake}() its arguments are too numerous to fit into a regular syscall. As such, use struct futex_waitv to pass the 'source' and 'destination' futexes to the syscall. This syscall implements what was previously known as FUTEX_CMP_REQUEUE and uses {val, uaddr, flags} for source and {uaddr, flags} for destination. This design explicitly allows requeueing between different types of futex by having a different flags word per uaddr. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/r/20230921105248.511860556@noisy.programming.kicks-ass.net
2023-09-21futex: Add sys_futex_wait()peterz@infradead.org
To complement sys_futex_waitv()/wake(), add sys_futex_wait(). This syscall implements what was previously known as FUTEX_WAIT_BITSET except it uses 'unsigned long' for the value and bitmask arguments, takes timespec and clockid_t arguments for the absolute timeout and uses FUTEX2 flags. The 'unsigned long' allows FUTEX2_SIZE_U64 on 64bit platforms. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/r/20230921105248.164324363@noisy.programming.kicks-ass.net
2023-09-21futex: Add sys_futex_wake()peterz@infradead.org
To complement sys_futex_waitv() add sys_futex_wake(). This syscall implements what was previously known as FUTEX_WAKE_BITSET except it uses 'unsigned long' for the bitmask and takes FUTEX2 flags. The 'unsigned long' allows FUTEX2_SIZE_U64 on 64bit platforms. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/r/20230921105247.936205525@noisy.programming.kicks-ass.net
2023-09-21KVM: x86/mmu: Open code leaf invalidation from mmu_notifierSean Christopherson
The mmu_notifier path is a bit of a special snowflake, e.g. it zaps only a single address space (because it's per-slot), and can't always yield. Because of this, it calls kvm_tdp_mmu_zap_leafs() in ways that no one else does. Iterate manually over the leafs in response to an mmu_notifier invalidation, instead of invoking kvm_tdp_mmu_zap_leafs(). Drop the @can_yield param from kvm_tdp_mmu_zap_leafs() as its sole remaining caller unconditionally passes "true". Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230916003916.2545000-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-21x86/platform/uv/apic: Clean up inconsistent indentingYang Li
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230816003842.116574-1-yang.lee@linux.alibaba.com
2023-09-21x86/percpu: Do not clobber %rsi in percpu_{try_,}cmpxchg{64,128}_opUros Bizjak
The fallback alternative uses %rsi register to manually load pointer to the percpu variable before the call to the emulation function. This is unoptimal, because the load is hidden from the compiler. Move the load of %rsi outside inline asm, so the compiler can reuse the value. The code in slub.o improves from: 55ac: 49 8b 3c 24 mov (%r12),%rdi 55b0: 48 8d 4a 40 lea 0x40(%rdx),%rcx 55b4: 49 8b 1c 07 mov (%r15,%rax,1),%rbx 55b8: 4c 89 f8 mov %r15,%rax 55bb: 48 8d 37 lea (%rdi),%rsi 55be: e8 00 00 00 00 callq 55c3 <...> 55bf: R_X86_64_PLT32 this_cpu_cmpxchg16b_emu-0x4 55c3: 75 a3 jne 5568 <...> 55c5: ... 0000000000000000 <.altinstr_replacement>: 5: 65 48 0f c7 0f cmpxchg16b %gs:(%rdi) to: 55ac: 49 8b 34 24 mov (%r12),%rsi 55b0: 48 8d 4a 40 lea 0x40(%rdx),%rcx 55b4: 49 8b 1c 07 mov (%r15,%rax,1),%rbx 55b8: 4c 89 f8 mov %r15,%rax 55bb: e8 00 00 00 00 callq 55c0 <...> 55bc: R_X86_64_PLT32 this_cpu_cmpxchg16b_emu-0x4 55c0: 75 a6 jne 5568 <...> 55c2: ... Where the alternative replacement instruction now uses %rsi: 0000000000000000 <.altinstr_replacement>: 5: 65 48 0f c7 0e cmpxchg16b %gs:(%rsi) The instruction (effectively a reg-reg move) at 55bb: in the original assembly is removed. Also, both the CALL and replacement CMPXCHG16B are 5 bytes long, removing the need for NOPs in the asm code. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230918151452.62344-1-ubizjak@gmail.com
2023-09-21x86/unwind/orc: Remove redundant initialization of 'mid' pointer in __orc_find()Colin Ian King
The 'mid' pointer is being initialized with a value that is never read, it is being re-assigned and used inside a for-loop. Remove the redundant initialization. Cleans up clang scan build warning: arch/x86/kernel/unwind_orc.c:88:7: warning: Value stored to 'mid' during its initialization is never read [deadcode.DeadStores] Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20230920114141.118919-1-colin.i.king@gmail.com
2023-09-20crypto: x86/sha - load modules based on CPU featuresRoxana Nicolescu
x86 optimized crypto modules are built as modules rather than build-in and they are not loaded when the crypto API is initialized, resulting in the generic builtin module (sha1-generic) being used instead. It was discovered when creating a sha1/sha256 checksum of a 2Gb file by using kcapi-tools because it would take significantly longer than creating a sha512 checksum of the same file. trace-cmd showed that for sha1/256 the generic module was used, whereas for sha512 the optimized module was used instead. Add module aliases() for these x86 optimized crypto modules based on CPU feature bits so udev gets a chance to load them later in the boot process. This resulted in ~3x decrease in the real-time execution of kcapi-dsg. Fix is inspired from commit aa031b8f702e ("crypto: x86/sha512 - load based on CPU features") where a similar fix was done for sha512. Cc: stable@vger.kernel.org # 5.15+ Suggested-by: Dimitri John Ledkov <dimitri.ledkov@canonical.com> Suggested-by: Julian Andres Klode <julian.klode@canonical.com> Signed-off-by: Roxana Nicolescu <roxana.nicolescu@canonical.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2023-09-20crypto: aesni - Fix double word in commentsBo Liu
Remove the repeated word "if" in comments. Signed-off-by: Bo Liu <liubo03@inspur.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2023-09-19x86/shstk: Add warning for shadow stack double unmapRick Edgecombe
There are several ways a thread's shadow stacks can get unmapped. This can happen on exit or exec, as well as error handling in exec or clone. The task struct already keeps track of the thread's shadow stack. Use the size variable to keep track of if the shadow stack has already been freed. When an attempt to double unmap the thread shadow stack is caught, warn about it and abort the operation. Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: H.J. Lu <hjl.tools@gmail.com> Link: https://lore.kernel.org/all/20230908203655.543765-4-rick.p.edgecombe%40intel.com
2023-09-19x86/shstk: Remove useless clone error handlingRick Edgecombe
When clone fails after the shadow stack is allocated, any allocated shadow stack is cleaned up in exit_thread() in copy_process(). So the logic in copy_thread() is unneeded, and also will not handle failures that happen outside of copy_thread(). In addition, since there is a second attempt to unmap the same shadow stack, there is a race where an newly mapped region could get unmapped. So remove the logic in copy_thread() and rely on exit_thread() to handle clone failure. Fixes: b2926a36b97a ("x86/shstk: Handle thread shadow stack") Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: H.J. Lu <hjl.tools@gmail.com> Link: https://lore.kernel.org/all/20230908203655.543765-3-rick.p.edgecombe%40intel.com
2023-09-19x86/shstk: Handle vfork clone failure correctlyRick Edgecombe
Shadow stacks are allocated automatically and freed on exit, depending on the clone flags. The two cases where new shadow stacks are not allocated are !CLONE_VM (fork()) and CLONE_VFORK (vfork()). For !CLONE_VM, although a new stack is not allocated, it can be freed normally because it will happen in the child's copy of the VM. However, for CLONE_VFORK the parent and the child are actually using the same shadow stack. So the kernel doesn't need to allocate *or* free a shadow stack for a CLONE_VFORK child. CLONE_VFORK children already need special tracking to avoid returning to userspace until the child exits or execs. Shadow stack uses this same tracking to avoid freeing CLONE_VFORK shadow stacks. However, the tracking is not setup until the clone has succeeded (internally). Which means, if a CLONE_VFORK fails, the existing logic will not know it is a CLONE_VFORK and proceed to unmap the parents shadow stack. This error handling cleanup logic runs via exit_thread() in the bad_fork_cleanup_thread label in copy_process(). The issue was seen in the glibc test "posix/tst-spawn3-pidfd" while running with shadow stack using currently out-of-tree glibc patches. Fix it by not unmapping the vfork shadow stack in the error case as well. Since clone is implemented in core code, it is not ideal to pass the clone flags along the error path in order to have shadow stack code have symmetric logic in the freeing half of the thread shadow stack handling. Instead use the existing state for thread shadow stacks to track whether the thread is managing its own shadow stack. For CLONE_VFORK, simply set shstk->base and shstk->size to 0, and have it mean the thread is not managing a shadow stack and so should skip cleanup work. Implement this by breaking up the CLONE_VFORK and !CLONE_VM cases in shstk_alloc_thread_stack() to separate conditionals since, the logic is now different between them. In the case of CLONE_VFORK && !CLONE_VM, the existing behavior is to not clean up the shadow stack in the child (which should go away quickly with either be exit or exec), so maintain that behavior by handling the CLONE_VFORK case first in the allocation path. This new logioc cleanly handles the case of normal, successful CLONE_VFORK's skipping cleaning up their shadow stack's on exit as well. So remove the existing, vfork shadow stack freeing logic. This is in deactivate_mm() where vfork_done is used to tell if it is a vfork child that can skip cleaning up the thread shadow stack. Fixes: b2926a36b97a ("x86/shstk: Handle thread shadow stack") Reported-by: H.J. Lu <hjl.tools@gmail.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: H.J. Lu <hjl.tools@gmail.com> Link: https://lore.kernel.org/all/20230908203655.543765-2-rick.p.edgecombe%40intel.com
2023-09-19bpf: Disable exceptions when CONFIG_UNWINDER_FRAME_POINTER=yKumar Kartikeya Dwivedi
The build with CONFIG_UNWINDER_FRAME_POINTER=y is broken for current exceptions feature as it assumes ORC unwinder specific fields in the unwind_state. Disable exceptions when frame_pointer unwinder is enabled for now. Fixes: fd5d27b70188 ("arch/x86: Implement arch_bpf_stack_walk") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230918155233.297024-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-19x86/srso: Fix SBPB enablement for spec_rstack_overflow=offJosh Poimboeuf
If the user has requested no SRSO mitigation, other mitigations can use the lighter-weight SBPB instead of IBPB. Fixes: fb3bd914b3ec ("x86/srso: Add a Speculative RAS Overflow mitigation") Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/b20820c3cfd1003171135ec8d762a0b957348497.1693889988.git.jpoimboe@kernel.org
2023-09-19x86/srso: Don't probe microcode in a guestJosh Poimboeuf
To support live migration, the hypervisor sets the "lowest common denominator" of features. Probing the microcode isn't allowed because any detected features might go away after a migration. As Andy Cooper states: "Linux must not probe microcode when virtualised.  What it may see instantaneously on boot (owing to MSR_PRED_CMD being fully passed through) is not accurate for the lifetime of the VM." Rely on the hypervisor to set the needed IBPB_BRTYPE and SBPB bits. Fixes: 1b5277c0ea0b ("x86/srso: Add SRSO_NO support") Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/3938a7209606c045a3f50305d201d840e8c834c7.1693889988.git.jpoimboe@kernel.org
2023-09-19x86/srso: Set CPUID feature bits independently of bug or mitigation statusJosh Poimboeuf
Booting with mitigations=off incorrectly prevents the X86_FEATURE_{IBPB_BRTYPE,SBPB} CPUID bits from getting set. Also, future CPUs without X86_BUG_SRSO might still have IBPB with branch type prediction flushing, in which case SBPB should be used instead of IBPB. The current code doesn't allow for that. Also, cpu_has_ibpb_brtype_microcode() has some surprising side effects and the setting of these feature bits really doesn't belong in the mitigation code anyway. Move it to earlier. Fixes: fb3bd914b3ec ("x86/srso: Add a Speculative RAS Overflow mitigation") Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/869a1709abfe13b673bdd10c2f4332ca253a40bc.1693889988.git.jpoimboe@kernel.org
2023-09-19x86/srso: Fix srso_show_state() side effectJosh Poimboeuf
Reading the 'spec_rstack_overflow' sysfs file can trigger an unnecessary MSR write, and possibly even a (handled) exception if the microcode hasn't been updated. Avoid all that by just checking X86_FEATURE_IBPB_BRTYPE instead, which gets set by srso_select_mitigation() if the updated microcode exists. Fixes: fb3bd914b3ec ("x86/srso: Add a Speculative RAS Overflow mitigation") Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/27d128899cb8aee9eb2b57ddc996742b0c1d776b.1693889988.git.jpoimboe@kernel.org
2023-09-19xen/efi: refactor deprecated strncpyJustin Stitt
`strncpy` is deprecated for use on NUL-terminated destination strings [1]. `efi_loader_signature` has space for 4 bytes. We are copying "Xen" (3 bytes) plus a NUL-byte which makes 4 total bytes. With that being said, there is currently not a bug with the current `strncpy()` implementation in terms of buffer overreads but we should favor a more robust string interface either way. A suitable replacement is `strscpy` [2] due to the fact that it guarantees NUL-termination on the destination buffer while being functionally the same in this case. Link: www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings[1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2] Link: https://github.com/KSPP/linux/issues/90 Cc: linux-hardening@vger.kernel.org Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230911-strncpy-arch-x86-xen-efi-c-v1-1-96ab2bba2feb@google.com Signed-off-by: Juergen Gross <jgross@suse.com>
2023-09-19x86/xen: allow nesting of same lazy modeJuergen Gross
When running as a paravirtualized guest under Xen, Linux is using "lazy mode" for issuing hypercalls which don't need to take immediate effect in order to improve performance (examples are e.g. multiple PTE changes). There are two different lazy modes defined: MMU and CPU lazy mode. Today it is not possible to nest multiple lazy mode sections, even if they are of the same kind. A recent change in memory management added nesting of MMU lazy mode sections, resulting in a regression when running as Xen PV guest. Technically there is no reason why nesting of multiple sections of the same kind of lazy mode shouldn't be allowed. So add support for that for fixing the regression. Fixes: bcc6cc832573 ("mm: add default definition of set_ptes()") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: https://lore.kernel.org/r/20230913113828.18421-4-jgross@suse.com Signed-off-by: Juergen Gross <jgross@suse.com>
2023-09-19x86/xen: move paravirt lazy codeJuergen Gross
Only Xen is using the paravirt lazy mode code, so it can be moved to Xen specific sources. This allows to make some of the functions static or to merge them into their only call sites. While at it do a rename from "paravirt" to "xen" for all moved specifiers. No functional change. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: https://lore.kernel.org/r/20230913113828.18421-3-jgross@suse.com Signed-off-by: Juergen Gross <jgross@suse.com>
2023-09-19xen: simplify evtchn_do_upcall() call mazeJuergen Gross
There are several functions involved for performing the functionality of evtchn_do_upcall(): - __xen_evtchn_do_upcall() doing the real work - xen_hvm_evtchn_do_upcall() just being a wrapper for __xen_evtchn_do_upcall(), exposed for external callers - xen_evtchn_do_upcall() calling __xen_evtchn_do_upcall(), too, but without any user Simplify this maze by: - removing the unused xen_evtchn_do_upcall() - removing xen_hvm_evtchn_do_upcall() as the only left caller of __xen_evtchn_do_upcall(), while renaming __xen_evtchn_do_upcall() to xen_evtchn_do_upcall() Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Juergen Gross <jgross@suse.com>
2023-09-18locking/lockref/x86: Enable ARCH_USE_CMPXCHG_LOCKREF for X86_CMPXCHG64Uros Bizjak
The following commit: bc08b449ee14 ("lockref: implement lockless reference count updates using cmpxchg()") enabled lockless reference count updates using cmpxchg() only for x86_64, and left x86_32 behind due to inability to detect support for cmpxchg8b instruction. Nowadays, we can use CONFIG_X86_CMPXCHG64 for this purpose. Also, by using try_cmpxchg64() instead of cmpxchg64() in the CMPXCHG_LOOP macro, the compiler actually produces sane code, improving the lockref_get_not_zero() main loop from: eb: 8d 48 01 lea 0x1(%eax),%ecx ee: 85 c0 test %eax,%eax f0: 7e 2f jle 121 <lockref_get_not_zero+0x71> f2: 8b 44 24 10 mov 0x10(%esp),%eax f6: 8b 54 24 14 mov 0x14(%esp),%edx fa: 8b 74 24 08 mov 0x8(%esp),%esi fe: f0 0f c7 0e lock cmpxchg8b (%esi) 102: 8b 7c 24 14 mov 0x14(%esp),%edi 106: 89 c1 mov %eax,%ecx 108: 89 c3 mov %eax,%ebx 10a: 8b 74 24 10 mov 0x10(%esp),%esi 10e: 89 d0 mov %edx,%eax 110: 31 fa xor %edi,%edx 112: 31 ce xor %ecx,%esi 114: 09 f2 or %esi,%edx 116: 75 58 jne 170 <lockref_get_not_zero+0xc0> to: 350: 8d 4f 01 lea 0x1(%edi),%ecx 353: 85 ff test %edi,%edi 355: 7e 79 jle 3d0 <lockref_get_not_zero+0xb0> 357: f0 0f c7 0e lock cmpxchg8b (%esi) 35b: 75 53 jne 3b0 <lockref_get_not_zero+0x90> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20230918184050.9180-1-ubizjak@gmail.com
2023-09-18x86/asm: Fix build of UML with KASANVincent Whitchurch
Building UML with KASAN fails since commit 69d4c0d32186 ("entry, kasan, x86: Disallow overriding mem*() functions") with the following errors: $ tools/testing/kunit/kunit.py run --kconfig_add CONFIG_KASAN=y ... ld: mm/kasan/shadow.o: in function `memset': shadow.c:(.text+0x40): multiple definition of `memset'; arch/x86/lib/memset_64.o:(.noinstr.text+0x0): first defined here ld: mm/kasan/shadow.o: in function `memmove': shadow.c:(.text+0x90): multiple definition of `memmove'; arch/x86/lib/memmove_64.o:(.noinstr.text+0x0): first defined here ld: mm/kasan/shadow.o: in function `memcpy': shadow.c:(.text+0x110): multiple definition of `memcpy'; arch/x86/lib/memcpy_64.o:(.noinstr.text+0x0): first defined here UML does not use GENERIC_ENTRY and is still supposed to be allowed to override the mem*() functions, so use weak aliases in that case. Fixes: 69d4c0d32186 ("entry, kasan, x86: Disallow overriding mem*() functions") Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20230918-uml-kasan-v3-1-7ad6db477df6@axis.com
2023-09-18x86/mm, kexec, ima: Use memblock_free_late() from ima_free_kexec_buffer()Rik van Riel
The code calling ima_free_kexec_buffer() runs long after the memblock allocator has already been torn down, potentially resulting in a use after free in memblock_isolate_range(). With KASAN or KFENCE, this use after free will result in a BUG from the idle task, and a subsequent kernel panic. Switch ima_free_kexec_buffer() over to memblock_free_late() to avoid that bug. Fixes: fee3ff99bc67 ("powerpc: Move arch independent ima kexec functions to drivers/of/kexec.c") Suggested-by: Mike Rappoport <rppt@kernel.org> Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230817135558.67274c83@imladris.surriel.com
2023-09-18x86/tdx: Fix __noreturn build warning around __tdx_hypercall_failed()Kai Huang
LKP reported below build warning: vmlinux.o: warning: objtool: __tdx_hypercall+0x128: __tdx_hypercall_failed() is missing a __noreturn annotation The __tdx_hypercall_failed() function definition already has __noreturn annotation, but it turns out the __noreturn must be annotated to the function declaration. PeterZ explains: "FWIW, the reason being that... The point of noreturn is that the caller should know to stop generating code. For that the declaration needs the attribute, because call sites typically do not have access to the function definition in C." Add __noreturn annotation to the declaration of __tdx_hypercall_failed() to fix. It's not a bad idea to document the __noreturn nature at the definition site either, so keep the annotation at the definition. Note <asm/shared/tdx.h> is also included by TDX related assembly files. Include <linux/compiler_attributes.h> only in case of !__ASSEMBLY__ otherwise compiling assembly file would trigger build error. Also, following the objtool documentation, add __tdx_hypercall_failed() to "tools/objtool/noreturns.h". Fixes: c641cfb5c157 ("x86/tdx: Make TDX_HYPERCALL asm similar to TDX_MODULE_CALL") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230918041858.331234-1-kai.huang@intel.com Closes: https://lore.kernel.org/oe-kbuild-all/202309140828.9RdmlH2Z-lkp@intel.com/
2023-09-17Merge tag 'x86-urgent-2023-09-17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Misc fixes: - Fix an UV boot crash - Skip spurious ENDBR generation on _THIS_IP_ - Fix ENDBR use in putuser() asm methods - Fix corner case boot crashes on 5-level paging - and fix a false positive WARNING on LTO kernels" * tag 'x86-urgent-2023-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/purgatory: Remove LTO flags x86/boot/compressed: Reserve more memory for page tables x86/ibt: Avoid duplicate ENDBR in __put_user_nocheck*() x86/ibt: Suppress spurious ENDBR x86/platform/uv: Use alternate source for socket to node data
2023-09-17Merge tag 'sched-urgent-2023-09-17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Fix a performance regression on large SMT systems, an Intel SMT4 balancing bug, and a topology setup bug on (Intel) hybrid processors" * tag 'sched-urgent-2023-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sched: Restore the SD_ASYM_PACKING flag in the DIE domain sched/fair: Fix SMT4 group_smt_balance handling sched/fair: Optimize should_we_balance() for large SMT systems
2023-09-17x86/boot: Increase section and file alignment to 4k/512Ard Biesheuvel
Align x86 with other EFI architectures, and increase the section alignment to the EFI page size (4k), so that firmware is able to honour the section permission attributes and map code read-only and data non-executable. There are a number of requirements that have to be taken into account: - the sign tools get cranky when there are gaps between sections in the file view of the image - the virtual offset of each section must be aligned to the image's section alignment - the file offset *and size* of each section must be aligned to the image's file alignment - the image size must be aligned to the section alignment - each section's virtual offset must be greater than or equal to the size of the headers. In order to meet all these requirements, while avoiding the need for lots of padding to accommodate the .compat section, the latter is placed at an arbitrary offset towards the end of the image, but aligned to the minimum file alignment (512 bytes). The space before the .text section is therefore distributed between the PE header, the .setup section and the .compat section, leaving no gaps in the file coverage, making the signing tools happy. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230915171623.655440-18-ardb@google.com
2023-09-17x86/boot: Split off PE/COFF .data sectionArd Biesheuvel
Describe the code and data of the decompressor binary using separate .text and .data PE/COFF sections, so that we will be able to map them using restricted permissions once we increase the section and file alignment sufficiently. This avoids the need for memory mappings that are writable and executable at the same time, which is something that is best avoided for security reasons. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230915171623.655440-17-ardb@google.com
2023-09-17x86/boot: Drop PE/COFF .reloc sectionArd Biesheuvel
Ancient buggy EFI loaders may have required a .reloc section to be present at some point in time, but this has not been true for a long time so the .reloc section can just be dropped. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230915171623.655440-16-ardb@google.com
2023-09-17x86/boot: Construct PE/COFF .text section from assemblerArd Biesheuvel
Now that the size of the setup block is visible to the assembler, it is possible to populate the PE/COFF header fields from the asm code directly, instead of poking the values into the binary using the build tool. This will make it easier to reorganize the section layout without having to tweak the build tool in lockstep. This change has no impact on the resulting bzImage binary. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230915171623.655440-15-ardb@google.com
2023-09-17x86/boot: Derive file size from _edata symbolArd Biesheuvel
Tweak the linker script so that the value of _edata represents the decompressor binary's file size rounded up to the appropriate alignment. This removes the need to calculate it in the build tool, and will make it easier to refer to the file size from the header directly in subsequent changes to the PE header layout. While adding _edata to the sed regex that parses the compressed vmlinux's symbol list, tweak the regex a bit for conciseness. This change has no impact on the resulting bzImage binary when configured with CONFIG_EFI_STUB=y. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230915171623.655440-14-ardb@google.com
2023-09-17x86/boot: Define setup size in linker scriptArd Biesheuvel
The setup block contains the real mode startup code that is used when booting from a legacy BIOS, along with the boot_params/setup_data that is used by legacy x86 bootloaders to pass the command line and initial ramdisk parameters, among other things. The setup block also contains the PE/COFF header of the entire combined image, which includes the compressed kernel image, the decompressor and the EFI stub. This PE header describes the layout of the executable image in memory, and currently, the fact that the setup block precedes it makes it rather fiddly to get the right values into the right place in the final image. Let's make things a bit easier by defining the setup_size in the linker script so it can be referenced from the asm code directly, rather than having to rely on the build tool to calculate it. For the time being, add 64 bytes of fixed padding for the .reloc and .compat sections - this will be removed in a subsequent patch after the PE/COFF header has been reorganized. This change has no impact on the resulting bzImage binary when configured with CONFIG_EFI_MIXED=y. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230915171623.655440-13-ardb@google.com