summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2023-09-14x86/elf: Make loading of 32bit processes depend on ia32_enabled()Nikolay Borisov
Major aspect of ia32 emulation is the ability to load 32bit processes. That's currently decided (among others) by compat_elf_check_arch(). Make the macro use ia32_enabled() to decide if IA32 compat is enabled before loading a 32bit process. Signed-off-by: Nikolay Borisov <nik.borisov@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230623111409.3047467-5-nik.borisov@suse.com
2023-09-14x86/entry: Compile entry_SYSCALL32_ignore() unconditionallyNikolay Borisov
To limit the IA32 exposure on 64bit kernels while keeping the flexibility for the user to enable it when required, the compile time enable/disable via CONFIG_IA32_EMULATION is not good enough and will be complemented with a kernel command line option. Right now entry_SYSCALL32_ignore() is only compiled when CONFIG_IA32_EMULATION=n, but boot-time enable- / disablement obviously requires it to be unconditionally available. Remove the #ifndef CONFIG_IA32_EMULATION guard. Signed-off-by: Nikolay Borisov <nik.borisov@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230623111409.3047467-4-nik.borisov@suse.com
2023-09-14x86/entry: Rename ignore_sysret()Nikolay Borisov
The SYSCALL instruction cannot really be disabled in compatibility mode. The best that can be done is to configure the CSTAR msr to point to a minimal handler. Currently this handler has a rather misleading name - ignore_sysret() as it's not really doing anything with sysret. Give it a more descriptive name. Signed-off-by: Nikolay Borisov <nik.borisov@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230623111409.3047467-3-nik.borisov@suse.com
2023-09-14x86: Introduce ia32_enabled()Nikolay Borisov
IA32 support on 64bit kernels depends on whether CONFIG_IA32_EMULATION is selected or not. As it is a compile time option it doesn't provide the flexibility to have distributions set their own policy for IA32 support and give the user the flexibility to override it. As a first step introduce ia32_enabled() which abstracts whether IA32 compat is turned on or off. Upcoming patches will implement the ability to set IA32 compat state at boot time. Signed-off-by: Nikolay Borisov <nik.borisov@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230623111409.3047467-2-nik.borisov@suse.com
2023-09-13x86/sched: Restore the SD_ASYM_PACKING flag in the DIE domainRicardo Neri
Commit 8f2d6c41e5a6 ("x86/sched: Rewrite topology setup") dropped the SD_ASYM_PACKING flag in the DIE domain added in commit 044f0e27dec6 ("x86/sched: Add the SD_ASYM_PACKING flag to the die domain of hybrid processors"). Restore it on hybrid processors. The die-level domain does not depend on any build configuration and now x86_sched_itmt_flags() is always needed. Remove the build dependency on CONFIG_SCHED_[SMT|CLUSTER|MC]. Fixes: 8f2d6c41e5a6 ("x86/sched: Rewrite topology setup") Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Tested-by: Caleb Callaway <caleb.callaway@intel.com> Link: https://lkml.kernel.org/r/20230815035747.11529-1-ricardo.neri-calderon@linux.intel.com
2023-09-12x86/virt/tdx: Make TDX_MODULE_CALL handle SEAMCALL #UD and #GPKai Huang
SEAMCALL instruction causes #UD if the CPU isn't in VMX operation. Currently the TDX_MODULE_CALL assembly doesn't handle #UD, thus making SEAMCALL when VMX is disabled would cause Oops. Unfortunately, there are legal cases that SEAMCALL can be made when VMX is disabled. For instance, VMX can be disabled due to emergency reboot while there are still TDX guests running. Extend the TDX_MODULE_CALL assembly to return an error code for #UD to handle this case gracefully, e.g., KVM can then quietly eat all SEAMCALL errors caused by emergency reboot. SEAMCALL instruction also causes #GP when TDX isn't enabled by the BIOS. Use _ASM_EXTABLE_FAULT() to catch both exceptions with the trap number recorded, and define two new error codes by XORing the trap number to the TDX_SW_ERROR. This opportunistically handles #GP too while using the same simple assembly code. A bonus is when kernel mistakenly calls SEAMCALL when CPU isn't in VMX operation, or when TDX isn't enabled by the BIOS, or when the BIOS is buggy, the kernel can get a nicer error code rather than a less understandable Oops. This is basically based on Peter's code. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/de975832a367f476aab2d0eb0d9de66019a16b54.1692096753.git.kai.huang%40intel.com
2023-09-12x86/virt/tdx: Wire up basic SEAMCALL functionsKai Huang
Intel Trust Domain Extensions (TDX) protects guest VMs from malicious host and certain physical attacks. A CPU-attested software module called 'the TDX module' runs inside a new isolated memory range as a trusted hypervisor to manage and run protected VMs. TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM). This mode runs only the TDX module itself or other code to load the TDX module. The host kernel communicates with SEAM software via a new SEAMCALL instruction. This is conceptually similar to a guest->host hypercall, except it is made from the host to SEAM software instead. The TDX module establishes a new SEAMCALL ABI which allows the host to initialize the module and to manage VMs. The SEAMCALL ABI is very similar to the TDCALL ABI and leverages much TDCALL infrastructure. Wire up basic functions to make SEAMCALLs for the basic support of running TDX guests: __seamcall(), __seamcall_ret(), and __seamcall_saved_ret() for TDH.VP.ENTER. All SEAMCALLs involved in the basic TDX support don't use "callee-saved" registers as input and output, except the TDH.VP.ENTER. To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for TDX host kernel support. Add a new Kconfig option CONFIG_INTEL_TDX_HOST to opt-in TDX host kernel support (to distinguish with TDX guest kernel support). So far only KVM uses TDX. Make the new config option depend on KVM_INTEL. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Isaku Yamahata <isaku.yamahata@intel.com> Link: https://lore.kernel.org/all/4db7c3fc085e6af12acc2932294254ddb3d320b3.1692096753.git.kai.huang%40intel.com
2023-09-12x86/tdx: Remove 'struct tdx_hypercall_args'Kai Huang
Now 'struct tdx_hypercall_args' is basically 'struct tdx_module_args' minus RCX. Although from __tdx_hypercall()'s perspective RCX isn't used as shared register thus not part of input/output registers, it's not worth to have a separate structure just due to one register. Remove the 'struct tdx_hypercall_args' and use 'struct tdx_module_args' instead in __tdx_hypercall() related code. This also saves the memory copy between the two structures within __tdx_hypercall(). Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/798dad5ce24e9d745cf0e16825b75ccc433ad065.1692096753.git.kai.huang%40intel.com
2023-09-12x86/tdx: Reimplement __tdx_hypercall() using TDX_MODULE_CALL asmKai Huang
Now the TDX_HYPERCALL asm is basically identical to the TDX_MODULE_CALL with both '\saved' and '\ret' enabled, with two minor things though: 1) The way to restore the structure pointer is different The TDX_HYPERCALL uses RCX as spare to restore the structure pointer, but the TDX_MODULE_CALL assumes no spare register can be used. In other words, TDX_MODULE_CALL already covers what TDX_HYPERCALL does. 2) TDX_MODULE_CALL only clears shared registers for TDH.VP.ENTER For this just need to make that code available for the non-host case. Thus, remove the TDX_HYPERCALL and reimplement the __tdx_hypercall() using the TDX_MODULE_CALL. Extend the TDX_MODULE_CALL to cover "clear shared registers" for TDG.VP.VMCALL. Introduce a new __tdcall_saved_ret() to replace the temporary __tdcall_hypercall(). The __tdcall_saved_ret() can also be used for those new TDCALLs which require more input/output registers than the basic TDCALLs do. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/e68a2473fb6f5bcd78b078cae7510e9d0753b3df.1692096753.git.kai.huang%40intel.com
2023-09-12x86/tdx: Make TDX_HYPERCALL asm similar to TDX_MODULE_CALLKai Huang
Now the 'struct tdx_hypercall_args' and 'struct tdx_module_args' are almost the same, and the TDX_HYPERCALL and TDX_MODULE_CALL asm macro share similar code pattern too. The __tdx_hypercall() and __tdcall() should be unified to use the same assembly code. As a preparation to unify them, simplify the TDX_HYPERCALL to make it more like the TDX_MODULE_CALL. The TDX_HYPERCALL takes the pointer of 'struct tdx_hypercall_args' as function call argument, and does below extra things comparing to the TDX_MODULE_CALL: 1) It sets RAX to 0 (TDG.VP.VMCALL leaf) internally; 2) It sets RCX to the (fixed) bitmap of shared registers internally; 3) It calls __tdx_hypercall_failed() internally (and panics) when the TDCALL instruction itself fails; 4) After TDCALL, it moves R10 to RAX to return the return code of the VMCALL leaf, regardless the '\ret' asm macro argument; Firstly, change the TDX_HYPERCALL to take the same function call arguments as the TDX_MODULE_CALL does: TDCALL leaf ID, and the pointer to 'struct tdx_module_args'. Then 1) and 2) can be moved to the caller: - TDG.VP.VMCALL leaf ID can be passed via the function call argument; - 'struct tdx_module_args' is 'struct tdx_hypercall_args' + RCX, thus the bitmap of shared registers can be passed via RCX in the structure. Secondly, to move 3) and 4) out of assembly, make the TDX_HYPERCALL always save output registers to the structure. The caller then can: - Call __tdx_hypercall_failed() when TDX_HYPERCALL returns error; - Return R10 in the structure as the return code of the VMCALL leaf; With above changes, change the asm function from __tdx_hypercall() to __tdcall_hypercall(), and reimplement __tdx_hypercall() as the C wrapper of it. This avoids having to add another wrapper of __tdx_hypercall() (_tdx_hypercall() is already taken). The __tdcall_hypercall() will be replaced with a __tdcall() variant using TDX_MODULE_CALL in a later commit as the final goal is to have one assembly to handle both TDCALL and TDVMCALL. Currently, the __tdx_hypercall() asm is in '.noinstr.text'. To keep this unchanged, annotate __tdx_hypercall(), which is a C function now, as 'noinstr'. Remove the __tdx_hypercall_ret() as __tdx_hypercall() already does so. Implement __tdx_hypercall() in tdx-shared.c so it can be shared with the compressed code. Opportunistically fix a checkpatch error complaining using space around parenthesis '(' and ')' while moving the bitmap of shared registers to <asm/shared/tdx.h>. [ dhansen: quash new calls of __tdx_hypercall_ret() that showed up ] Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/0cbf25e7aee3256288045023a31f65f0cef90af4.1692096753.git.kai.huang%40intel.com
2023-09-12x86/numa: Introduce numa_fill_memblks()Alison Schofield
numa_fill_memblks() fills in the gaps in numa_meminfo memblks over an physical address range. The ACPI driver will use numa_fill_memblks() to implement a new Linux policy that prescribes extending proximity domains in a portion of a CFMWS window to the entire window. Dan Williams offered this explanation of the policy: A CFWMS is an ACPI data structure that indicates *potential* locations where CXL memory can be placed. It is the playground where the CXL driver has free reign to establish regions. That space can be populated by BIOS created regions, or driver created regions, after hotplug or other reconfiguration. When BIOS creates a region in a CXL Window it additionally describes that subset of the Window range in the other typical ACPI tables SRAT, SLIT, and HMAT. The rationale for BIOS not pre-describing the entire CXL Window in SRAT, SLIT, and HMAT is that it can not predict the future. I.e. there is nothing stopping higher or lower performance devices being placed in the same Window. Compare that to ACPI memory hotplug that just onlines additional capacity in the proximity domain with little freedom for dynamic performance differentiation. That leaves the OS with a choice, should unpopulated window capacity match the proximity domain of an existing region, or should it allocate a new one? This patch takes the simple position of minimizing proximity domain proliferation by reusing any proximity domain intersection for the entire Window. If the Window has no intersections then allocate a new proximity domain. Note that SRAT, SLIT and HMAT information can be enumerated dynamically in a standard way from device provided data. Think of CXL as the end of ACPI needing to describe memory attributes, CXL offers a standard discovery model for performance attributes, but Linux still needs to interoperate with the old regime. Reported-by: Derick Marks <derick.w.marks@intel.com> Suggested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Alison Schofield <alison.schofield@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Derick Marks <derick.w.marks@intel.com> Link: https://lore.kernel.org/all/ef078a6f056ca974e5af85997013c0fda9e3326d.1689018477.git.alison.schofield%40intel.com
2023-09-12bpf, x64: Fix tailcall infinite loopLeon Hwang
From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT"), the tailcall on x64 works better than before. From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms for x64 JIT"), tailcall is able to run in BPF subprograms on x64. From commit 5b92a28aae4dd0f8 ("bpf: Support attaching tracing BPF program to other BPF programs"), BPF program is able to trace other BPF programs. How about combining them all together? 1. FENTRY/FEXIT on a BPF subprogram. 2. A tailcall runs in the BPF subprogram. 3. The tailcall calls the subprogram's caller. As a result, a tailcall infinite loop comes up. And the loop would halt the machine. As we know, in tail call context, the tail_call_cnt propagates by stack and rax register between BPF subprograms. So do in trampolines. Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT") Fixes: e411901c0b77 ("bpf: allow for tailcalls in BPF subprograms for x64 JIT") Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Leon Hwang <hffilwlqm@gmail.com> Link: https://lore.kernel.org/r/20230912150442.2009-3-hffilwlqm@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-12bpf, x64: Comment tail_call_cnt initialisationLeon Hwang
Without understanding emit_prologue(), it is really hard to figure out where does tail_call_cnt come from, even though searching tail_call_cnt in the whole kernel repo. By adding these comments, it is a little bit easier to understand tail_call_cnt initialisation. Signed-off-by: Leon Hwang <hffilwlqm@gmail.com> Link: https://lore.kernel.org/r/20230912150442.2009-2-hffilwlqm@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-12x86/ibt: Avoid duplicate ENDBR in __put_user_nocheck*()Peter Zijlstra
Commit cb855971d717 ("x86/putuser: Provide room for padding") changed __put_user_nocheck_*() into proper functions but failed to note that SYM_FUNC_START() already provides ENDBR, rendering the explicit ENDBR superfluous. Fixes: cb855971d717 ("x86/putuser: Provide room for padding") Reported-by: David Kaplan <David.Kaplan@amd.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230802110323.086971726@infradead.org
2023-09-12x86/ibt: Suppress spurious ENDBRPeter Zijlstra
It was reported that under certain circumstances GCC emits ENDBR instructions for _THIS_IP_ usage. Specifically, when it appears at the start of a basic block -- but not elsewhere. Since _THIS_IP_ is never used for control flow, these ENDBR instructions are completely superfluous. Override the _THIS_IP_ definition for x86_64 to avoid this. Less ENDBR instructions is better. Fixes: 156ff4a544ae ("x86/ibt: Base IBT bits") Reported-by: David Kaplan <David.Kaplan@amd.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230802110323.016197440@infradead.org
2023-09-12perf/x86/intel: Extend the ref-cycles event to GP countersKan Liang
The current ref-cycles event is only available on the fixed counter 2. Starting from the GLC and GRT core, the architectural UnHalted Reference Cycles event (0x013c) which is available on general-purpose counters can collect the exact same events as the fixed counter 2. Update the mapping of ref-cycles to 0x013c. So the ref-cycles can be available on both fixed counter 2 and general-purpose counters. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230911144139.2354015-1-kan.liang@linux.intel.com
2023-09-12perf/x86/intel: Fix broken fixed event constraints extensionKan Liang
Unnecessary multiplexing is triggered when running an "instructions" event on an MTL. perf stat -e cpu_core/instructions/,cpu_core/instructions/ -a sleep 1 Performance counter stats for 'system wide': 115,489,000 cpu_core/instructions/ (50.02%) 127,433,777 cpu_core/instructions/ (49.98%) 1.002294504 seconds time elapsed Linux architectural perf events, e.g., cycles and instructions, usually have dedicated fixed counters. These events also have equivalent events which can be used in the general-purpose counters. The counters are precious. In the intel_pmu_check_event_constraints(), perf check/extend the event constraints of these events. So these events can utilize both fixed counters and general-purpose counters. The following cleanup commit: 97588df87b56 ("perf/x86/intel: Add common intel_pmu_init_hybrid()") forgot adding the intel_pmu_check_event_constraints() into update_pmu_cap(). The architectural perf events cannot utilize the general-purpose counters. The code to check and update the counters, event constraints and extra_regs is the same among hybrid systems. Move intel_pmu_check_hybrid_pmus() to init_hybrid_pmu(), and emove the duplicate check in update_pmu_cap(). Fixes: 97588df87b56 ("perf/x86/intel: Add common intel_pmu_init_hybrid()") Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230911135128.2322833-1-kan.liang@linux.intel.com
2023-09-11x86/tdx: Extend TDX_MODULE_CALL to support more TDCALL/SEAMCALL leafsKai Huang
The TDX guest live migration support (TDX 1.5) adds new TDCALL/SEAMCALL leaf functions. Those new TDCALLs/SEAMCALLs take additional registers for input (R10-R13) and output (R12-R13). TDG.SERVTD.RD is an example. Also, the current TDX_MODULE_CALL doesn't aim to handle TDH.VP.ENTER SEAMCALL, which monitors the TDG.VP.VMCALL in input/output registers when it returns in case of VMCALL from TDX guest. With those new TDCALLs/SEAMCALLs and the TDH.VP.ENTER covered, the TDX_MODULE_CALL macro basically needs to handle the same input/output registers as the TDX_HYPERCALL does. And as a result, they also share similar logic in the assembly, thus should be unified to use one common assembly. Extend the TDX_MODULE_CALL asm to support the new TDCALLs/SEAMCALLs and also the TDH.VP.ENTER SEAMCALL. Eventually it will be unified with the TDX_HYPERCALL. The new input/output registers fit with the "callee-saved" registers in the x86 calling convention. Add a new "saved" parameter to support those new TDCALLs/SEAMCALLs and TDH.VP.ENTER and keep the existing TDCALLs/SEAMCALLs minimally impacted. For TDH.VP.ENTER, after it returns the registers shared by the guest contain guest's values. Explicitly clear them to prevent speculative use of guest's values. Note most TDX live migration related SEAMCALLs may also clobber AVX* state ("AVX, AVX2 and AVX512 state: may be reset to the architectural INIT state" -- see TDH.EXPORT.MEM for example). And TDH.VP.ENTER also clobbers XMM0-XMM15 when the corresponding bit is set in RCX. Don't handle them in the TDX_MODULE_CALL macro but let the caller save and restore when needed. This is basically based on Peter's code. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/d4785de7c392f7c5684407f6c24a73b92148ec49.1692096753.git.kai.huang%40intel.com
2023-09-11x86/tdx: Pass TDCALL/SEAMCALL input/output registers via a structureKai Huang
Currently, the TDX_MODULE_CALL asm macro, which handles both TDCALL and SEAMCALL, takes one parameter for each input register and an optional 'struct tdx_module_output' (a collection of output registers) as output. This is different from the TDX_HYPERCALL macro which uses a single 'struct tdx_hypercall_args' to carry all input/output registers. The newer TDX versions introduce more TDCALLs/SEAMCALLs which use more input/output registers. Also, the TDH.VP.ENTER (which isn't covered by the current TDX_MODULE_CALL macro) basically can use all registers that the TDX_HYPERCALL does. The current TDX_MODULE_CALL macro isn't extendible to cover those cases. Similar to the TDX_HYPERCALL macro, simplify the TDX_MODULE_CALL macro to use a single structure 'struct tdx_module_args' to carry all the input/output registers. Currently, R10/R11 are only used as output register but not as input by any TDCALL/SEAMCALL. Change to also use R10/R11 as input register to make input/output registers symmetric. Currently, the TDX_MODULE_CALL macro depends on the caller to pass a non-NULL 'struct tdx_module_output' to get additional output registers. Similar to the TDX_HYPERCALL macro, change the TDX_MODULE_CALL macro to take a new 'ret' macro argument to indicate whether to save the output registers to the 'struct tdx_module_args'. Also introduce a new __tdcall_ret() for that purpose, similar to the __tdx_hypercall_ret(). Note the tdcall(), which is a wrapper of __tdcall(), is called by three callers: tdx_parse_tdinfo(), tdx_get_ve_info() and tdx_early_init(). The former two need the additional output but the last one doesn't. For simplicity, make tdcall() always call __tdcall_ret() to avoid another "_ret()" wrapper. The last caller tdx_early_init() isn't performance critical anyway. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/483616c1762d85eb3a3c3035a7de061cfacf2f14.1692096753.git.kai.huang%40intel.com
2023-09-11x86/tdx: Rename __tdx_module_call() to __tdcall()Kai Huang
__tdx_module_call() is only used by the TDX guest to issue TDCALL to the TDX module. Rename it to __tdcall() to match its behaviour, e.g., it cannot be used to make host-side SEAMCALL. Also rename tdx_module_call() which is a wrapper of __tdx_module_call() to tdcall(). No functional change intended. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/785d20d99fbcd0db8262c94da6423375422d8c75.1692096753.git.kai.huang%40intel.com
2023-09-11x86/tdx: Make macros of TDCALLs consistent with the specKai Huang
The TDX spec names all TDCALLs with prefix "TDG". Currently, the kernel doesn't follow such convention for the macros of those TDCALLs but uses prefix "TDX_" for all of them. Although it's arguable whether the TDX spec names those TDCALLs properly, it's better for the kernel to follow the spec when naming those macros. Change all macros of TDCALLs to make them consistent with the spec. As a bonus, they get distinguished easily from the host-side SEAMCALLs, which all have prefix "TDH". No functional change intended. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/516dccd0bd8fb9a0b6af30d25bb2d971aa03d598.1692096753.git.kai.huang%40intel.com
2023-09-11x86/tdx: Skip saving output regs when SEAMCALL fails with VMFailInvalidKai Huang
If SEAMCALL fails with VMFailInvalid, the SEAM software (e.g., the TDX module) won't have chance to set any output register. Skip saving the output registers to the structure in this case. Also, as '.Lno_output_struct' is the very last symbol before RET, rename it to '.Lout' to make it short. Opportunistically make the asm directives unindented. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/704088f5b4d72c7e24084f7f15bd1ac5005b7213.1692096753.git.kai.huang%40intel.com
2023-09-11x86/tdx: Zero out the missing RSI in TDX_HYPERCALL macroKai Huang
In the TDX_HYPERCALL asm, after the TDCALL instruction returns from the untrusted VMM, the registers that the TDX guest shares to the VMM need to be cleared to avoid speculative execution of VMM-provided values. RSI is specified in the bitmap of those registers, but it is missing when zeroing out those registers in the current TDX_HYPERCALL. It was there when it was originally added in commit 752d13305c78 ("x86/tdx: Expand __tdx_hypercall() to handle more arguments"), but was later removed in commit 1e70c680375a ("x86/tdx: Do not corrupt frame-pointer in __tdx_hypercall()"), which was correct because %rsi is later restored in the "pop %rsi". However a later commit 7a3a401874be ("x86/tdx: Drop flags from __tdx_hypercall()") removed that "pop %rsi" but forgot to add the "xor %rsi, %rsi" back. Fix by adding it back. Fixes: 7a3a401874be ("x86/tdx: Drop flags from __tdx_hypercall()") Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/e7d1157074a0b45d34564d5f17f3e0ffee8115e9.1692096753.git.kai.huang%40intel.com
2023-09-11x86/tdx: Retry partially-completed page conversion hypercallsDexuan Cui
TDX guest memory is private by default and the VMM may not access it. However, in cases where the guest needs to share data with the VMM, the guest and the VMM can coordinate to make memory shared between them. The guest side of this protocol includes the "MapGPA" hypercall. This call takes a guest physical address range. The hypercall spec (aka. the GHCI) says that the MapGPA call is allowed to return partial progress in mapping this range and indicate that fact with a special error code. A guest that sees such partial progress is expected to retry the operation for the portion of the address range that was not completed. Hyper-V does this partial completion dance when set_memory_decrypted() is called to "decrypt" swiotlb bounce buffers that can be up to 1GB in size. It is evidently the only VMM that does this, which is why nobody noticed this until now. [ dhansen: rewrite changelog ] Signed-off-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/all/20230811021246.821-2-decui%40microsoft.com
2023-09-11x86/platform/uv: Use alternate source for socket to node dataSteve Wahl
The UV code attempts to build a set of tables to allow it to do bidirectional socket<=>node lookups. But when nr_cpus is set to a smaller number than actually present, the cpu_to_node() mapping information for unused CPUs is not available to build_socket_tables(). This results in skipping some nodes or sockets when creating the tables and leaving some -1's for later code to trip. over, causing oopses. The problem is that the socket<=>node lookups are created by doing a loop over all CPUs, then looking up the CPU's APICID and socket. But if a CPU is not present, there is no way to start this lookup. Instead of looping over all CPUs, take CPUs out of the equation entirely. Loop over all APICIDs which are mapped to a valid NUMA node. Then just extract the socket-id from the APICID. This avoid tripping over disabled CPUs. Fixes: 8a50c5851927 ("x86/platform/uv: UV support for sub-NUMA clustering") Signed-off-by: Steve Wahl <steve.wahl@hpe.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20230807141730.1117278-1-steve.wahl%40hpe.com
2023-09-11efi/x86: Ensure that EFI_RUNTIME_MAP is enabled for kexecArd Biesheuvel
CONFIG_EFI_RUNTIME_MAP needs to be enabled in order for kexec to be able to provide the required information about the EFI runtime mappings to the incoming kernel, regardless of whether kexec_load() or kexec_file_load() is being used. Without this information, kexec boot in EFI mode is not possible. The CONFIG_EFI_RUNTIME_MAP option is currently directly configurable if CONFIG_EXPERT is enabled, so that it can be turned on for debugging purposes even if KEXEC is not enabled. However, the upshot of this is that it can also be disabled even when it shouldn't. So tweak the Kconfig declarations to avoid this situation. Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2023-09-11efi/x86: Move EFI runtime call setup/teardown helpers out of lineArd Biesheuvel
Only the arch_efi_call_virt() macro that some architectures override needs to be a macro, given that it is variadic and encapsulates calls via function pointers that have different prototypes. The associated setup and teardown code are not special in this regard, and don't need to be instantiated at each call site. So turn them into ordinary C functions and move them out of line. Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2023-09-10Merge tag 'x86-urgent-2023-09-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Fix preemption delays in the SGX code, remove unnecessarily UAPI-exported code, fix a ld.lld linker (in)compatibility quirk and make the x86 SMP init code a bit more conservative to fix kexec() lockups" * tag 'x86-urgent-2023-09-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sgx: Break up long non-preemptible delays in sgx_vepc_release() x86: Remove the arch_calc_vm_prot_bits() macro from the UAPI x86/build: Fix linker fill bytes quirk/incompatibility for ld.lld x86/smp: Don't send INIT to non-present and non-booted CPUs
2023-09-10Merge tag 'perf-urgent-2023-09-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 perf event fix from Ingo Molnar: "Work around a firmware bug in the uncore PMU driver, affecting certain Intel systems" * tag 'perf-urgent-2023-09-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/uncore: Correct the number of CHAs on EMR
2023-09-07Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "ARM: - Clean up vCPU targets, always returning generic v8 as the preferred target - Trap forwarding infrastructure for nested virtualization (used for traps that are taken from an L2 guest and are needed by the L1 hypervisor) - FEAT_TLBIRANGE support to only invalidate specific ranges of addresses when collapsing a table PTE to a block PTE. This avoids that the guest refills the TLBs again for addresses that aren't covered by the table PTE. - Fix vPMU issues related to handling of PMUver. - Don't unnecessary align non-stack allocations in the EL2 VA space - Drop HCR_VIRT_EXCP_MASK, which was never used... - Don't use smp_processor_id() in kvm_arch_vcpu_load(), but the cpu parameter instead - Drop redundant call to kvm_set_pfn_accessed() in user_mem_abort() - Remove prototypes without implementations RISC-V: - Zba, Zbs, Zicntr, Zicsr, Zifencei, and Zihpm support for guest - Added ONE_REG interface for SATP mode - Added ONE_REG interface to enable/disable multiple ISA extensions - Improved error codes returned by ONE_REG interfaces - Added KVM_GET_REG_LIST ioctl() implementation for KVM RISC-V - Added get-reg-list selftest for KVM RISC-V s390: - PV crypto passthrough enablement (Tony, Steffen, Viktor, Janosch) Allows a PV guest to use crypto cards. Card access is governed by the firmware and once a crypto queue is "bound" to a PV VM every other entity (PV or not) looses access until it is not bound anymore. Enablement is done via flags when creating the PV VM. - Guest debug fixes (Ilya) x86: - Clean up KVM's handling of Intel architectural events - Intel bugfixes - Add support for SEV-ES DebugSwap, allowing SEV-ES guests to use debug registers and generate/handle #DBs - Clean up LBR virtualization code - Fix a bug where KVM fails to set the target pCPU during an IRTE update - Fix fatal bugs in SEV-ES intrahost migration - Fix a bug where the recent (architecturally correct) change to reinject #BP and skip INT3 broke SEV guests (can't decode INT3 to skip it) - Retry APIC map recalculation if a vCPU is added/enabled - Overhaul emergency reboot code to bring SVM up to par with VMX, tie the "emergency disabling" behavior to KVM actually being loaded, and move all of the logic within KVM - Fix user triggerable WARNs in SVM where KVM incorrectly assumes the TSC ratio MSR cannot diverge from the default when TSC scaling is disabled up related code - Add a framework to allow "caching" feature flags so that KVM can check if the guest can use a feature without needing to search guest CPUID - Rip out the ancient MMU_DEBUG crud and replace the useful bits with CONFIG_KVM_PROVE_MMU - Fix KVM's handling of !visible guest roots to avoid premature triple fault injection - Overhaul KVM's page-track APIs, and KVMGT's usage, to reduce the API surface that is needed by external users (currently only KVMGT), and fix a variety of issues in the process Generic: - Wrap kvm_{gfn,hva}_range.pte in a union to allow mmu_notifier events to pass action specific data without needing to constantly update the main handlers. - Drop unused function declarations Selftests: - Add testcases to x86's sync_regs_test for detecting KVM TOCTOU bugs - Add support for printf() in guest code and covert all guest asserts to use printf-based reporting - Clean up the PMU event filter test and add new testcases - Include x86 selftests in the KVM x86 MAINTAINERS entry" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (279 commits) KVM: x86/mmu: Include mmu.h in spte.h KVM: x86/mmu: Use dummy root, backed by zero page, for !visible guest roots KVM: x86/mmu: Disallow guest from using !visible slots for page tables KVM: x86/mmu: Harden TDP MMU iteration against root w/o shadow page KVM: x86/mmu: Harden new PGD against roots without shadow pages KVM: x86/mmu: Add helper to convert root hpa to shadow page drm/i915/gvt: Drop final dependencies on KVM internal details KVM: x86/mmu: Handle KVM bookkeeping in page-track APIs, not callers KVM: x86/mmu: Drop @slot param from exported/external page-track APIs KVM: x86/mmu: Bug the VM if write-tracking is used but not enabled KVM: x86/mmu: Assert that correct locks are held for page write-tracking KVM: x86/mmu: Rename page-track APIs to reflect the new reality KVM: x86/mmu: Drop infrastructure for multiple page-track modes KVM: x86/mmu: Use page-track notifiers iff there are external users KVM: x86/mmu: Move KVM-only page-track declarations to internal header KVM: x86: Remove the unused page-track hook track_flush_slot() drm/i915/gvt: switch from ->track_flush_slot() to ->track_remove_region() KVM: x86: Add a new page-track hook to handle memslot deletion drm/i915/gvt: Don't bother removing write-protection on to-be-deleted slot KVM: x86: Reject memslot MOVE operations if KVMGT is attached ...
2023-09-07x86/asm/bitops: Use __builtin_clz{l|ll} to evaluate constant expressionsNick Desaulniers
Micro-optimize the bitops code some more, similar to commits: fdb6649ab7c1 ("x86/asm/bitops: Use __builtin_ctzl() to evaluate constant expressions") 2fcff790dcb4 ("powerpc: Use builtin functions for fls()/__fls()/fls64()") From a recent discussion, I noticed that x86 is lacking an optimization that appears in arch/powerpc/include/asm/bitops.h related to constant folding. If you add a BUILD_BUG_ON(__builtin_constant_p(param)) to these functions, you'll find that there were cases where the use of inline asm pessimized the compiler's ability to perform constant folding resulting in runtime calculation of a value that could have been computed at compile time. Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230828-x86_fls-v1-1-e6a31b9f79c3@google.com
2023-09-06x86/sgx: Break up long non-preemptible delays in sgx_vepc_release()Jack Wang
On large enclaves we hit the softlockup warning with following call trace: xa_erase() sgx_vepc_release() __fput() task_work_run() do_exit() The latency issue is similar to the one fixed in: 8795359e35bc ("x86/sgx: Silence softlockup detection when releasing large enclaves") The test system has 64GB of enclave memory, and all is assigned to a single VM. Release of 'vepc' takes a longer time and causes long latencies, which triggers the softlockup warning. Add cond_resched() to give other tasks a chance to run and reduce latencies, which also avoids the softlockup detector. [ mingo: Rewrote the changelog. ] Fixes: 540745ddbc70 ("x86/sgx: Introduce virtual EPC for use by KVM guests") Reported-by: Yu Zhang <yu.zhang@ionos.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Yu Zhang <yu.zhang@ionos.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Haitao Huang <haitao.huang@linux.intel.com> Cc: stable@vger.kernel.org
2023-09-06x86: Remove the arch_calc_vm_prot_bits() macro from the UAPIThomas Huth
The arch_calc_vm_prot_bits() macro uses VM_PKEY_BIT0 etc. which are not part of the UAPI, so the macro is completely useless for userspace. It is also hidden behind the CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS config switch which we shouldn't expose to userspace. Thus let's move this macro into a new internal header instead. Fixes: 8f62c883222c ("x86/mm/pkeys: Add arch-specific VMA protection bits") Signed-off-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Nicolas Schier <nicolas@fjasle.eu> Acked-by: Dave Hansen <dave.hansen@intel.com> Link: https://lore.kernel.org/r/20230906162658.142511-1-thuth@redhat.com
2023-09-06x86/build: Fix linker fill bytes quirk/incompatibility for ld.lldSong Liu
With ":text =0xcccc", ld.lld fills unused text area with 0xcccc0000. Example objdump -D output: ffffffff82b04203: 00 00 add %al,(%rax) ffffffff82b04205: cc int3 ffffffff82b04206: cc int3 ffffffff82b04207: 00 00 add %al,(%rax) ffffffff82b04209: cc int3 ffffffff82b0420a: cc int3 Replace it with ":text =0xcccccccc", so we get the following instead: ffffffff82b04203: cc int3 ffffffff82b04204: cc int3 ffffffff82b04205: cc int3 ffffffff82b04206: cc int3 ffffffff82b04207: cc int3 ffffffff82b04208: cc int3 gcc/ld doesn't seem to have the same issue. The generated code stays the same for gcc/ld. Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Fixes: 7705dc855797 ("x86/vmlinux: Use INT3 instead of NOP for linker fill bytes") Link: https://lore.kernel.org/r/20230906175215.2236033-1-song@kernel.org
2023-09-05perf/x86/uncore: Correct the number of CHAs on EMRKan Liang
Starting from SPR, the basic uncore PMON information is retrieved from the discovery table (resides in an MMIO space populated by BIOS). It is called the discovery method. The existing value of the type->num_boxes is from the discovery table. On some SPR variants, there is a firmware bug that makes the value from the discovery table incorrect. We use the value from the SPR_MSR_UNC_CBO_CONFIG MSR to replace the one from the discovery table: 38776cc45eb7 ("perf/x86/uncore: Correct the number of CHAs on SPR") Unfortunately, the SPR_MSR_UNC_CBO_CONFIG isn't available for the EMR XCC (Always returns 0), but the above firmware bug doesn't impact the EMR XCC. Don't let the value from the MSR replace the existing value from the discovery table. Fixes: 38776cc45eb7 ("perf/x86/uncore: Correct the number of CHAs on SPR") Reported-by: Stephane Eranian <eranian@google.com> Reported-by: Yunying Sun <yunying.sun@intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Yunying Sun <yunying.sun@intel.com> Link: https://lore.kernel.org/r/20230905134248.496114-1-kan.liang@linux.intel.com
2023-09-05Merge tag 'kbuild-v6.6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Enable -Wenum-conversion warning option - Refactor the rpm-pkg target - Fix scripts/setlocalversion to consider annotated tags for rt-kernel - Add a jump key feature for the search menu of 'make nconfig' - Support Qt6 for 'make xconfig' - Enable -Wformat-overflow, -Wformat-truncation, -Wstringop-overflow, and -Wrestrict warnings for W=1 builds - Replace <asm/export.h> with <linux/export.h> for alpha, ia64, and sparc - Support DEB_BUILD_OPTIONS=parallel=N for the debian source package - Refactor scripts/Makefile.modinst and fix some modules_sign issues - Add a new Kconfig env variable to warn symbols that are not defined anywhere - Show help messages of config fragments in 'make help' * tag 'kbuild-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (62 commits) kconfig: fix possible buffer overflow kbuild: Show marked Kconfig fragments in "help" kconfig: add warn-unknown-symbols sanity check kbuild: dummy-tools: make MPROFILE_KERNEL checks work on BE Documentation/llvm: refresh docs modpost: Skip .llvm.call-graph-profile section check kbuild: support modules_sign for external modules as well kbuild: support 'make modules_sign' with CONFIG_MODULE_SIG_ALL=n kbuild: move more module installation code to scripts/Makefile.modinst kbuild: reduce the number of mkdir calls during modules_install kbuild: remove $(MODLIB)/source symlink kbuild: move depmod rule to scripts/Makefile.modinst kbuild: add modules_sign to no-{compiler,sync-config}-targets kbuild: do not run depmod for 'make modules_sign' kbuild: deb-pkg: support DEB_BUILD_OPTIONS=parallel=N in debian/rules alpha: remove <asm/export.h> alpha: replace #include <asm/export.h> with #include <linux/export.h> ia64: remove <asm/export.h> ia64: replace #include <asm/export.h> with #include <linux/export.h> sparc: remove <asm/export.h> ...
2023-09-04Merge tag 'uml-for-linus-6.6-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux Pull UML updates from Richard Weinberger: - Drop 32-bit checksum implementation and re-use it from arch/x86 - String function cleanup - Fixes for -Wmissing-variable-declarations and -Wmissing-prototypes builds * tag 'uml-for-linus-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux: um: virt-pci: fix missing declaration warning um: Refactor deprecated strncpy to memcpy um: fix 3 instances of -Wmissing-prototypes um: port_kern: fix -Wmissing-variable-declarations uml: audio: fix -Wmissing-variable-declarations um: vector: refactor deprecated strncpy um: use obj-y to descend into arch/um/*/ um: Hard-code the result of 'uname -s' um: Use the x86 checksum implementation on 32-bit asm-generic: current: Don't include thread-info.h if building asm um: Remove unsued extern declaration ldt_host_info() um: Fix hostaudio build errors um: Remove strlcpy usage
2023-09-04Merge tag 'hyperv-next-signed-20230902' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - Support for SEV-SNP guests on Hyper-V (Tianyu Lan) - Support for TDX guests on Hyper-V (Dexuan Cui) - Use SBRM API in Hyper-V balloon driver (Mitchell Levy) - Avoid dereferencing ACPI root object handle in VMBus driver (Maciej Szmigiero) - A few misecllaneous fixes (Jiapeng Chong, Nathan Chancellor, Saurabh Sengar) * tag 'hyperv-next-signed-20230902' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (24 commits) x86/hyperv: Remove duplicate include x86/hyperv: Move the code in ivm.c around to avoid unnecessary ifdef's x86/hyperv: Remove hv_isolation_type_en_snp x86/hyperv: Use TDX GHCI to access some MSRs in a TDX VM with the paravisor Drivers: hv: vmbus: Bring the post_msg_page back for TDX VMs with the paravisor x86/hyperv: Introduce a global variable hyperv_paravisor_present Drivers: hv: vmbus: Support >64 VPs for a fully enlightened TDX/SNP VM x86/hyperv: Fix serial console interrupts for fully enlightened TDX guests Drivers: hv: vmbus: Support fully enlightened TDX guests x86/hyperv: Support hypercalls for fully enlightened TDX guests x86/hyperv: Add hv_isolation_type_tdx() to detect TDX guests x86/hyperv: Fix undefined reference to isolation_type_en_snp without CONFIG_HYPERV x86/hyperv: Add missing 'inline' to hv_snp_boot_ap() stub hv: hyperv.h: Replace one-element array with flexible-array member Drivers: hv: vmbus: Don't dereference ACPI root object handle x86/hyperv: Add hyperv-specific handling for VMMCALL under SEV-ES x86/hyperv: Add smp support for SEV-SNP guest clocksource: hyper-v: Mark hyperv tsc page unencrypted in sev-snp enlightened guest x86/hyperv: Use vmmcall to implement Hyper-V hypercall in sev-snp enlightened guest drivers: hv: Mark percpu hvcall input arg page unencrypted in SEV-SNP enlightened guest ...
2023-09-04x86/smp: Don't send INIT to non-present and non-booted CPUsThomas Gleixner
Vasant reported that kexec() can hang or reset the machine when it tries to park CPUs via INIT. This happens when the kernel is using extended APIC, but the present mask has APIC IDs >= 0x100 enumerated. As extended APIC can only handle 8 bit of APIC ID sending INIT to APIC ID 0x100 sends INIT to APIC ID 0x0. That's the boot CPU which is special on x86 and INIT causes the system to hang or resets the machine. Prevent this by sending INIT only to those CPUs which have been booted once. Fixes: 45e34c8af58f ("x86/smp: Put CPUs into INIT on shutdown if possible") Reported-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Vasant Hegde <vasant.hegde@amd.com> Link: https://lore.kernel.org/r/87cyzwjbff.ffs@tglx
2023-09-04kbuild: Show marked Kconfig fragments in "help"Kees Cook
Currently the Kconfig fragments in kernel/configs and arch/*/configs that aren't used internally aren't discoverable through "make help", which consists of hard-coded lists of config fragments. Instead, list all the fragment targets that have a "# Help: " comment prefix so the targets can be generated dynamically. Add logic to the Makefile to search for and display the fragment and comment. Add comments to fragments that are intended to be direct targets. Signed-off-by: Kees Cook <keescook@chromium.org> Co-developed-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Reviewed-by: Nicolas Schier <nicolas@fjasle.eu> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2023-09-01Merge tag 'x86-urgent-2023-09-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Dave Hansen: "The most important fix here adds a missing CPU model to the recent Gather Data Sampling (GDS) mitigation list to ensure that mitigations are available on that CPU. There are also a pair of warning fixes, and closure of a covert channel that pops up when protection keys are disabled. Summary: - Mark all Skylake CPUs as vulnerable to GDS - Fix PKRU covert channel - Fix -Wmissing-variable-declarations warning for ia32_xyz_class - Fix kernel-doc annotation warning" * tag 'x86-urgent-2023-09-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/fpu/xstate: Fix PKRU covert channel x86/irq/i8259: Fix kernel-doc annotation warning x86/speculation: Mark all Skylake CPUs as vulnerable to GDS x86/audit: Fix -Wmissing-variable-declarations warning for ia32_xyz_class
2023-09-01Merge tag 'char-misc-6.6-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver updates from Greg KH: "Here is the big set of char/misc and other small driver subsystem changes for 6.6-rc1. Stuff all over the place here, lots of driver updates and changes and new additions. Short summary is: - new IIO drivers and updates - Interconnect driver updates - fpga driver updates and additions - fsi driver updates - mei driver updates - coresight driver updates - nvmem driver updates - counter driver updates - lots of smaller misc and char driver updates and additions All of these have been in linux-next for a long time with no reported problems" * tag 'char-misc-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (267 commits) nvmem: core: Notify when a new layout is registered nvmem: core: Do not open-code existing functions nvmem: core: Return NULL when no nvmem layout is found nvmem: core: Create all cells before adding the nvmem device nvmem: u-boot-env:: Replace zero-length array with DECLARE_FLEX_ARRAY() helper nvmem: sec-qfprom: Add Qualcomm secure QFPROM support dt-bindings: nvmem: sec-qfprom: Add bindings for secure qfprom dt-bindings: nvmem: Add compatible for QCM2290 nvmem: Kconfig: Fix typo "drive" -> "driver" nvmem: Explicitly include correct DT includes nvmem: add new NXP QorIQ eFuse driver dt-bindings: nvmem: Add t1023-sfp efuse support dt-bindings: nvmem: qfprom: Add compatible for MSM8226 nvmem: uniphier: Use devm_platform_get_and_ioremap_resource() nvmem: qfprom: do some cleanup nvmem: stm32-romem: Use devm_platform_get_and_ioremap_resource() nvmem: rockchip-efuse: Use devm_platform_get_and_ioremap_resource() nvmem: meson-mx-efuse: Convert to devm_platform_ioremap_resource() nvmem: lpc18xx_otp: Convert to devm_platform_ioremap_resource() nvmem: brcm_nvram: Use devm_platform_get_and_ioremap_resource() ...
2023-09-01Merge tag 'driver-core-6.6-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is a small set of driver core updates and additions for 6.6-rc1. Included in here are: - stable kernel documentation updates - class structure const work from Ivan on various subsystems - kernfs tweaks - driver core tests! - kobject sanity cleanups - kobject structure reordering to save space - driver core error code handling fixups - other minor driver core cleanups All of these have been in linux-next for a while with no reported problems" * tag 'driver-core-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (32 commits) driver core: Call in reversed order in device_platform_notify_remove() driver core: Return proper error code when dev_set_name() fails kobject: Remove redundant checks for whether ktype is NULL kobject: Add sanity check for kset->kobj.ktype in kset_register() drivers: base: test: Add missing MODULE_* macros to root device tests drivers: base: test: Add missing MODULE_* macros for platform devices tests drivers: base: Free devm resources when unregistering a device drivers: base: Add basic devm tests for platform devices drivers: base: Add basic devm tests for root devices kernfs: fix missing kernfs_iattr_rwsem locking docs: stable-kernel-rules: mention that regressions must be prevented docs: stable-kernel-rules: fine-tune various details docs: stable-kernel-rules: make the examples for option 1 a proper list docs: stable-kernel-rules: move text around to improve flow docs: stable-kernel-rules: improve structure by changing headlines base/node: Remove duplicated include kernfs: attach uuid for every kernfs and report it in fsid kernfs: add stub helper for kernfs_generic_poll() x86/resctrl: make pseudo_lock_class a static const structure x86/MSR: make msr_class a static const structure ...
2023-08-31x86/fpu/xstate: Fix PKRU covert channelJim Mattson
When XCR0[9] is set, PKRU can be read and written from userspace with XSAVE and XRSTOR, even when CR4.PKE is clear. Clear XCR0[9] when protection keys are disabled. Reported-by: Tavis Ormandy <taviso@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20230831043228.1194256-1-jmattson@google.com
2023-08-31Merge tag 'x86_shstk_for_6.6-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 shadow stack support from Dave Hansen: "This is the long awaited x86 shadow stack support, part of Intel's Control-flow Enforcement Technology (CET). CET consists of two related security features: shadow stacks and indirect branch tracking. This series implements just the shadow stack part of this feature, and just for userspace. The main use case for shadow stack is providing protection against return oriented programming attacks. It works by maintaining a secondary (shadow) stack using a special memory type that has protections against modification. When executing a CALL instruction, the processor pushes the return address to both the normal stack and to the special permission shadow stack. Upon RET, the processor pops the shadow stack copy and compares it to the normal stack copy. For more information, refer to the links below for the earlier versions of this patch set" Link: https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/ Link: https://lore.kernel.org/lkml/20230613001108.3040476-1-rick.p.edgecombe@intel.com/ * tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits) x86/shstk: Change order of __user in type x86/ibt: Convert IBT selftest to asm x86/shstk: Don't retry vm_munmap() on -EINTR x86/kbuild: Fix Documentation/ reference x86/shstk: Move arch detail comment out of core mm x86/shstk: Add ARCH_SHSTK_STATUS x86/shstk: Add ARCH_SHSTK_UNLOCK x86: Add PTRACE interface for shadow stack selftests/x86: Add shadow stack test x86/cpufeatures: Enable CET CR4 bit for shadow stack x86/shstk: Wire in shadow stack interface x86: Expose thread features in /proc/$PID/status x86/shstk: Support WRSS for userspace x86/shstk: Introduce map_shadow_stack syscall x86/shstk: Check that signal frame is shadow stack mem x86/shstk: Check that SSP is aligned on sigreturn x86/shstk: Handle signals for shadow stack x86/shstk: Introduce routines modifying shstk x86/shstk: Handle thread shadow stack x86/shstk: Add user-mode shadow stack support ...
2023-08-31x86/irq/i8259: Fix kernel-doc annotation warningVincenzo Palazzo
Fix this warning: arch/x86/kernel/i8259.c:235: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * ELCR registers (0x4d0, 0x4d1) control edge/level of IRQ CC arch/x86/kernel/irqinit.o Signed-off-by: Vincenzo Palazzo <vincenzopalazzodev@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230830131211.88226-1-vincenzopalazzodev@gmail.com
2023-08-31x86/speculation: Mark all Skylake CPUs as vulnerable to GDSDave Hansen
The Gather Data Sampling (GDS) vulnerability is common to all Skylake processors. However, the "client" Skylakes* are now in this list: https://www.intel.com/content/www/us/en/support/articles/000022396/processors.html which means they are no longer included for new vulnerabilities here: https://www.intel.com/content/www/us/en/developer/topic-technology/software-security-guidance/processors-affected-consolidated-product-cpu-model.html or in other GDS documentation. Thus, they were not included in the original GDS mitigation patches. Mark SKYLAKE and SKYLAKE_L as vulnerable to GDS to match all the other Skylake CPUs (which include Kaby Lake). Also group the CPUs so that the ones that share the exact same vulnerabilities are next to each other. Last, move SRBDS to the end of each line. This makes it clear at a glance that SKYLAKE_X is unique. Of the five Skylakes, it is the only "server" CPU and has a different implementation from the clients of the "special register" hardware, making it immune to SRBDS. This makes the diff much harder to read, but the resulting table is worth it. I very much appreciate the report from Michael Zhivich about this issue. Despite what level of support a hardware vendor is providing, the kernel very much needs an accurate and up-to-date list of vulnerable CPUs. More reports like this are very welcome. * Client Skylakes are CPUID 406E3/506E3 which is family 6, models 0x4E and 0x5E, aka INTEL_FAM6_SKYLAKE and INTEL_FAM6_SKYLAKE_L. Reported-by: Michael Zhivich <mzhivich@akamai.com> Fixes: 8974eb588283 ("x86/speculation: Add Gather Data Sampling mitigation") Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2023-08-31KVM: x86/mmu: Include mmu.h in spte.hSean Christopherson
Explicitly include mmu.h in spte.h instead of relying on the "parent" to include mmu.h. spte.h references a variety of macros and variables that are defined/declared in mmu.h, and so including spte.h before (or instead of) mmu.h will result in build errors, e.g. arch/x86/kvm/mmu/spte.h: In function ‘is_mmio_spte’: arch/x86/kvm/mmu/spte.h:242:23: error: ‘enable_mmio_caching’ undeclared 242 | likely(enable_mmio_caching); | ^~~~~~~~~~~~~~~~~~~ arch/x86/kvm/mmu/spte.h: In function ‘is_large_pte’: arch/x86/kvm/mmu/spte.h:302:22: error: ‘PT_PAGE_SIZE_MASK’ undeclared 302 | return pte & PT_PAGE_SIZE_MASK; | ^~~~~~~~~~~~~~~~~ arch/x86/kvm/mmu/spte.h: In function ‘is_dirty_spte’: arch/x86/kvm/mmu/spte.h:332:56: error: ‘PT_WRITABLE_MASK’ undeclared 332 | return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK; | ^~~~~~~~~~~~~~~~ Fixes: 5a9624affe7c ("KVM: mmu: extract spte.h and spte.c") Link: https://lore.kernel.org/r/20230808224059.2492476-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Use dummy root, backed by zero page, for !visible guest rootsSean Christopherson
When attempting to allocate a shadow root for a !visible guest root gfn, e.g. that resides in MMIO space, load a dummy root that is backed by the zero page instead of immediately synthesizing a triple fault shutdown (using the zero page ensures any attempt to translate memory will generate a !PRESENT fault and thus VM-Exit). Unless the vCPU is racing with memslot activity, KVM will inject a page fault due to not finding a visible slot in FNAME(walk_addr_generic), i.e. the end result is mostly same, but critically KVM will inject a fault only *after* KVM runs the vCPU with the bogus root. Waiting to inject a fault until after running the vCPU fixes a bug where KVM would bail from nested VM-Enter if L1 tried to run L2 with TDP enabled and a !visible root. Even though a bad root will *probably* lead to shutdown, (a) it's not guaranteed and (b) the CPU won't read the underlying memory until after VM-Enter succeeds. E.g. if L1 runs L2 with a VMX preemption timer value of '0', then architecturally the preemption timer VM-Exit is guaranteed to occur before the CPU executes any instruction, i.e. before the CPU needs to translate a GPA to a HPA (so long as there are no injected events with higher priority than the preemption timer). If KVM manages to get to FNAME(fetch) with a dummy root, e.g. because userspace created a memslot between installing the dummy root and handling the page fault, simply unload the MMU to allocate a new root and retry the instruction. Use KVM_REQ_MMU_FREE_OBSOLETE_ROOTS to drop the root, as invoking kvm_mmu_free_roots() while holding mmu_lock would deadlock, and conceptually the dummy root has indeeed become obsolete. The only difference versus existing usage of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is that the root has become obsolete due to memslot *creation*, not memslot deletion or movement. Reported-by: Reima Ishii <ishiir@g.ecc.u-tokyo.ac.jp> Cc: Yu Zhang <yu.c.zhang@linux.intel.com> Link: https://lore.kernel.org/r/20230729005200.1057358-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Disallow guest from using !visible slots for page tablesSean Christopherson
Explicitly inject a page fault if guest attempts to use a !visible gfn as a page table. kvm_vcpu_gfn_to_hva_prot() will naturally handle the case where there is no memslot, but doesn't catch the scenario where the gfn points at a KVM-internal memslot. Letting the guest backdoor its way into accessing KVM-internal memslots isn't dangerous on its own, e.g. at worst the guest can crash itself, but disallowing the behavior will simplify fixing how KVM handles !visible guest root gfns (immediately synthesizing a triple fault when loading the root is architecturally wrong). Link: https://lore.kernel.org/r/20230729005200.1057358-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>