summaryrefslogtreecommitdiff
path: root/arch/x86/kernel
AgeCommit message (Collapse)Author
2023-03-23x86,objtool: Introduce ORC_TYPE_*Josh Poimboeuf
Unwind hints and ORC entry types are two distinct things. Separate them out more explicitly. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/cc879d38fff8a43f8f7beb2fd56e35a5a384d7cd.1677683419.git.jpoimboe@kernel.org
2023-03-22x86: Simplify one-level sysctl registration for itmt_kern_tableLuis Chamberlain
There is no need to declare an extra tables to just create directory, this can be easily be done with a prefix path with register_sysctl(). Simplify this registration. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230310233248.3965389-3-mcgrof%40kernel.org
2023-03-22x86/ACPI/boot: Improve __acpi_acquire_global_lockUros Bizjak
Improve __acpi_acquire_global_lock by using a temporary variable. This enables compiler to perform if-conversion and improves generated code from: ... 72a: d1 ea shr %edx 72c: 83 e1 fc and $0xfffffffc,%ecx 72f: 83 e2 01 and $0x1,%edx 732: 09 ca or %ecx,%edx 734: 83 c2 02 add $0x2,%edx 737: f0 0f b1 17 lock cmpxchg %edx,(%rdi) 73b: 75 e9 jne 726 <__acpi_acquire_global_lock+0x6> 73d: 83 e2 03 and $0x3,%edx 740: 31 c0 xor %eax,%eax 742: 83 fa 03 cmp $0x3,%edx 745: 0f 95 c0 setne %al 748: f7 d8 neg %eax to: ... 72a: d1 e9 shr %ecx 72c: 83 e2 fc and $0xfffffffc,%edx 72f: 83 e1 01 and $0x1,%ecx 732: 09 ca or %ecx,%edx 734: 83 c2 02 add $0x2,%edx 737: f0 0f b1 17 lock cmpxchg %edx,(%rdi) 73b: 75 e9 jne 726 <__acpi_acquire_global_lock+0x6> 73d: 8d 41 ff lea -0x1(%rcx),%eax BTW: the compiler could generate: lea 0x2(%rcx,%rdx,1),%edx instead of: or %ecx,%edx add $0x2,%edx but unwated conversion from add to or when bits are known to be zero prevents this improvement. This is GCC PR108477. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/all/20230320212012.12704-1-ubizjak%40gmail.com
2023-03-22x86/fpu/xstate: Prevent false-positive warning in __copy_xstate_uabi_buf()Chang S. Bae
__copy_xstate_to_uabi_buf() copies either from the tasks XSAVE buffer or from init_fpstate into the ptrace buffer. Dynamic features, like XTILEDATA, have an all zeroes init state and are not saved in init_fpstate, which means the corresponding bit is not set in the xfeatures bitmap of the init_fpstate header. But __copy_xstate_to_uabi_buf() retrieves addresses for both the tasks xstate and init_fpstate unconditionally via __raw_xsave_addr(). So if the tasks XSAVE buffer has a dynamic feature set, then the address retrieval for init_fpstate triggers the warning in __raw_xsave_addr() which checks the feature bit in the init_fpstate header. Remove the address retrieval from init_fpstate for extended features. They have an all zeroes init state so init_fpstate has zeros for them. Then zeroing the user buffer for the init state is the same as copying them from init_fpstate. Fixes: 2308ee57d93d ("x86/fpu/amx: Enable the AMX feature in 64-bit mode") Reported-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/kvm/20230221163655.920289-2-mizhang@google.com/ Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/all/20230227210504.18520-2-chang.seok.bae%40intel.com Cc: stable@vger.kernel.org
2023-03-21ftrace: selftest: remove broken trace_direct_trampMark Rutland
The ftrace selftest code has a trace_direct_tramp() function which it uses as a direct call trampoline. This happens to work on x86, since the direct call's return address is in the usual place, and can be returned to via a RET, but in general the calling convention for direct calls is different from regular function calls, and requires a trampoline written in assembly. On s390, regular function calls place the return address in %r14, and an ftrace patch-site in an instrumented function places the trampoline's return address (which is within the instrumented function) in %r0, preserving the original %r14 value in-place. As a regular C function will return to the address in %r14, using a C function as the trampoline results in the trampoline returning to the caller of the instrumented function, skipping the body of the instrumented function. Note that the s390 issue is not detcted by the ftrace selftest code, as the instrumented function is trivial, and returning back into the caller happens to be equivalent. On arm64, regular function calls place the return address in x30, and an ftrace patch-site in an instrumented function saves this into r9 and places the trampoline's return address (within the instrumented function) in x30. A regular C function will return to the address in x30, but will not restore x9 into x30. Consequently, using a C function as the trampoline results in returning to the trampoline's return address having corrupted x30, such that when the instrumented function returns, it will return back into itself. To avoid future issues in this area, remove the trace_direct_tramp() function, and require that each architecture with direct calls provides a stub trampoline, named ftrace_stub_direct_tramp. This can be written to handle the architecture's trampoline calling convention, and in future could be used elsewhere (e.g. in the ftrace ops sample, to measure the overhead of direct calls), so we may as well always build it in. Link: https://lkml.kernel.org/r/20230321140424.345218-8-revest@chromium.org Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Li Huafei <lihuafei1@huawei.com> Cc: Xu Kuohai <xukuohai@huawei.com> Signed-off-by: Florent Revest <revest@chromium.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-21x86/sev: Change snp_guest_issue_request()'s fw_err argumentDionna Glaze
The GHCB specification declares that the firmware error value for a guest request will be stored in the lower 32 bits of EXIT_INFO_2. The upper 32 bits are for the VMM's own error code. The fw_err argument to snp_guest_issue_request() is thus a misnomer, and callers will need access to all 64 bits. The type of unsigned long also causes problems, since sw_exit_info2 is u64 (unsigned long long) vs the argument's unsigned long*. Change this type for issuing the guest request. Pass the ioctl command struct's error field directly instead of in a local variable, since an incomplete guest request may not set the error code, and uninitialized stack memory would be written back to user space. The firmware might not even be called, so bookend the call with the no firmware call error and clear the error. Since the "fw_err" field is really exitinfo2 split into the upper bits' vmm error code and lower bits' firmware error code, convert the 64 bit value to a union. [ bp: - Massage commit message - adjust code - Fix a build issue as Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202303070609.vX6wp2Af-lkp@intel.com - print exitinfo2 in hex Tom: - Correct -EIO exit case. ] Signed-off-by: Dionna Glaze <dionnaglaze@google.com> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230214164638.1189804-5-dionnaglaze@google.com Link: https://lore.kernel.org/r/20230307192449.24732-12-bp@alien8.de
2023-03-21x86/smpboot: Reference count on smpboot_setup_warm_reset_vector()David Woodhouse
When bringing up a secondary CPU from do_boot_cpu(), the warm reset flag is set in CMOS and the starting IP for the trampoline written inside the BDA at 0x467. Once the CPU is running, the CMOS flag is unset and the value in the BDA cleared. To allow for parallel bringup of CPUs, add a reference count to track the number of CPUs currently bring brought up, and clear the state only when the count reaches zero. Since the RTC spinlock is required to write to the CMOS, it can be used for mutual exclusion on the refcount too. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Usama Arif <usama.arif@bytedance.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Kim Phillips <kim.phillips@amd.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Link: https://lore.kernel.org/r/20230316222109.1940300-5-usama.arif@bytedance.com
2023-03-21x86/smpboot: Remove initial_gsBrian Gerst
Given its CPU#, each CPU can find its own per-cpu offset, and directly set GSBASE accordingly. The global variable can be eliminated. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Usama Arif <usama.arif@bytedance.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Usama Arif <usama.arif@bytedance.com> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20230316222109.1940300-9-usama.arif@bytedance.com
2023-03-21x86/smpboot: Remove early_gdt_descr on 64-bitBrian Gerst
Build the GDT descriptor on the stack instead. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Usama Arif <usama.arif@bytedance.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Usama Arif <usama.arif@bytedance.com> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20230316222109.1940300-8-usama.arif@bytedance.com
2023-03-21x86/smpboot: Remove initial_stack on 64-bitBrian Gerst
In order to facilitate parallel startup, start to eliminate some of the global variables passing information to CPUs in the startup path. However, start by introducing one more: smpboot_control. For now this merely holds the CPU# of the CPU which is coming up. Each CPU can then find its own per-cpu data, and everything else it needs can be found from there, allowing the other global variables to be removed. First to be removed is initial_stack. Each CPU can load %rsp from its current_task->thread.sp instead. That is already set up with the correct idle thread for APs. Set up the .sp field in INIT_THREAD on x86 so that the BSP also finds a suitable stack pointer in the static per-cpu data when coming up on first boot. On resume from S3, the CPU needs a temporary stack because its idle task is already active. Instead of setting initial_stack, the sleep code can simply set its own current->thread.sp to point to the temporary stack. Nobody else cares about ->thread.sp for a thread which is currently on a CPU, because the true value is actually in the %rsp register. Which is restored with the rest of the CPU context in do_suspend_lowlevel(). Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Usama Arif <usama.arif@bytedance.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Usama Arif <usama.arif@bytedance.com> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20230316222109.1940300-7-usama.arif@bytedance.com
2023-03-21x86/apic/x2apic: Allow CPU cluster_mask to be populated in parallelDavid Woodhouse
Each of the sibling CPUs in a cluster uses the same clustermask. The first CPU in a cluster will need a new clustermask allocated, while subsequent siblings will use the same clustermask as the first. However, the CPU being brought up cannot yet perform memory allocations at the point that this occurs in init_x2apic_ldr(). So at present, the alloc_clustermask() function allocates a clustermask just in case it's needed, storing it in the global cluster_hotplug_mask. A CPU which is the first sibling of a cluster will "take" it from there and set cluster_hotplug_mask to NULL, in order for alloc_clustermask() to allocate a new one before bringing up the next CPU. To facilitate parallel bringup of CPUs in future, switch to a model where alloc_clustermask() prepopulates the clustermask in the per_cpu data for each present CPU in the cluster in advance. All that the CPU needs to do for itself in init_x2apic_ldr() is set its own bit in that mask. The 'node' and 'clusterid' members of struct cluster_mask are thus redundant, and it can become a simple struct cpumask instead. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Usama Arif <usama.arif@bytedance.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Kim Phillips <kim.phillips@amd.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Link: https://lore.kernel.org/r/20230316222109.1940300-2-usama.arif@bytedance.com
2023-03-19x86/MCE/AMD: Use an u64 for bank_mapMuralidhara M K
Thee maximum number of MCA banks is 64 (MAX_NR_BANKS), see a0bc32b3cacf ("x86/mce: Increase maximum number of banks to 64"). However, the bank_map which contains a bitfield of which banks to initialize is of type unsigned int and that overflows when those bit numbers are >= 32, leading to UBSAN complaining correctly: UBSAN: shift-out-of-bounds in arch/x86/kernel/cpu/mce/amd.c:1365:38 shift exponent 32 is too large for 32-bit type 'int' Change the bank_map to a u64 and use the proper BIT_ULL() macro when modifying bits in there. [ bp: Rewrite commit message. ] Fixes: a0bc32b3cacf ("x86/mce: Increase maximum number of banks to 64") Signed-off-by: Muralidhara M K <muralimk@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127151601.1068324-1-muralimk@amd.com
2023-03-19Merge tag 'ras_urgent_for_v6.3_rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS fix from Borislav Petkov: - Flush out logged errors immediately after MCA banks configuration changes over sysfs have been done instead of waiting until something else triggers the workqueue later - another error or the polling interval cycle is reached * tag 'ras_urgent_for_v6.3_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Make sure logged MCEs are processed after sysfs update
2023-03-19Merge tag 'x86_urgent_for_v6.3_rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: "There's a little bit more 'movement' in there for my taste but it needs to happen and should make the code better after it. - Check cmdline_find_option()'s return value before further processing - Clear temporary storage in the resctrl code to prevent access to an unexistent MSR - Add a simple throttling mechanism to protect the hypervisor from potentially malicious SEV guests issuing requests in rapid succession. In order to not jeopardize the sanity of everyone involved in maintaining this code, the request issuing side has received a cleanup, split in more or less trivial, small and digestible pieces. Otherwise, the code was threatening to become an unmaintainable mess. Therefore, that cleanup is marked indirectly also for stable so that there's no differences between the upstream code and the stable variant when it comes down to backporting more there" * tag 'x86_urgent_for_v6.3_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Fix use of uninitialized buffer in sme_enable() x86/resctrl: Clear staged_config[] before and after it is used virt/coco/sev-guest: Add throttling awareness virt/coco/sev-guest: Convert the sw_exit_info_2 checking to a switch-case virt/coco/sev-guest: Do some code style cleanups virt/coco/sev-guest: Carve out the request issuing logic into a helper virt/coco/sev-guest: Remove the disable_vmpck label in handle_guest_request() virt/coco/sev-guest: Simplify extended guest request handling virt/coco/sev-guest: Check SEV_SNP attribute at probe time
2023-03-17x86/umwait: move to use bus_get_dev_root()Greg Kroah-Hartman
Direct access to the struct bus_type dev_root pointer is going away soon so replace that with a call to bus_get_dev_root() instead, which is what it is there for. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20230313182918.1312597-10-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-17x86/microcode: move to use bus_get_dev_root()Greg Kroah-Hartman
Direct access to the struct bus_type dev_root pointer is going away soon so replace that with a call to bus_get_dev_root() instead, which is what it is there for. Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20230313182918.1312597-9-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-17driver core: class: remove module * from class_create()Greg Kroah-Hartman
The module pointer in class_create() never actually did anything, and it shouldn't have been requred to be set as a parameter even if it did something. So just remove it and fix up all callers of the function in the kernel tree at the same time. Cc: "Rafael J. Wysocki" <rafael@kernel.org> Acked-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Link: https://lore.kernel.org/r/20230313181843.1207845-4-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-17x86/paravirt: Convert simple paravirt functions to asmJuergen Gross
All functions referenced via __PV_IS_CALLEE_SAVE() need to be assembler functions, as those functions calls are hidden from the compiler. In case the kernel is compiled with "-fzero-call-used-regs" the compiler will clobber caller-saved registers at the end of C functions, which will result in unexpectedly zeroed registers at the call site of the related paravirt functions. Replace the C functions with DEFINE_PARAVIRT_ASM() constructs using the same instructions as the related paravirt calls in the PVOP_ALT_[V]CALLEE*() macros. And since they're not C functions visible to the compiler anymore, latter won't do the callee-clobbered zeroing invoked by -fzero-call-used-regs and thus won't corrupt registers. [ bp: Extend commit message. ] Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230317063325.361-1-jgross@suse.com
2023-03-17x86/hyperv: Block root partition functionality in a Confidential VMMichael Kelley
Hyper-V should never specify a VM that is a Confidential VM and also running in the root partition. Nonetheless, explicitly block such a combination to guard against a compromised Hyper-V maliciously trying to exploit root partition functionality in a Confidential VM to expose Confidential VM secrets. No known bug is being fixed, but the attack surface for Confidential VMs on Hyper-V is reduced. Signed-off-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/1678894453-95392-1-git-send-email-mikelley@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2023-03-16x86/mm/iommu/sva: Make LAM and SVA mutually exclusiveKirill A. Shutemov
IOMMU and SVA-capable devices know nothing about LAM and only expect canonical addresses. An attempt to pass down tagged pointer will lead to address translation failure. By default do not allow to enable both LAM and use SVA in the same process. The new ARCH_FORCE_TAGGED_SVA arch_prctl() overrides the limitation. By using the arch_prctl() userspace takes responsibility to never pass tagged address to the device. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Ashok Raj <ashok.raj@intel.com> Reviewed-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20230312112612.31869-12-kirill.shutemov%40linux.intel.com
2023-03-16iommu/sva: Replace pasid_valid() helper with mm_valid_pasid()Kirill A. Shutemov
Kernel has few users of pasid_valid() and all but one checks if the process has PASID allocated. The helper takes ioasid_t as the input. Replace the helper with mm_valid_pasid() that takes mm_struct as the argument. The only call that checks PASID that is not tied to mm_struct is open-codded now. This is preparatory patch. It helps avoid ifdeffery: no need to dereference mm->pasid in generic code to check if the process has PASID. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20230312112612.31869-11-kirill.shutemov%40linux.intel.com
2023-03-16x86/mm: Provide arch_prctl() interface for LAMKirill A. Shutemov
Add a few of arch_prctl() handles: - ARCH_ENABLE_TAGGED_ADDR enabled LAM. The argument is required number of tag bits. It is rounded up to the nearest LAM mode that can provide it. For now only LAM_U57 is supported, with 6 tag bits. - ARCH_GET_UNTAG_MASK returns untag mask. It can indicates where tag bits located in the address. - ARCH_GET_MAX_TAG_BITS returns the maximum tag bits user can request. Zero if LAM is not supported. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/all/20230312112612.31869-9-kirill.shutemov%40linux.intel.com
2023-03-16x86/uaccess: Provide untagged_addr() and remove tags before address checkKirill A. Shutemov
untagged_addr() is a helper used by the core-mm to strip tag bits and get the address to the canonical shape based on rules of the current thread. It only handles userspace addresses. The untagging mask is stored in per-CPU variable and set on context switching to the task. The tags must not be included into check whether it's okay to access the userspace address. Strip tags in access_ok(). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/all/20230312112612.31869-7-kirill.shutemov%40linux.intel.com
2023-03-16x86: Allow atomic MM_CONTEXT flags settingKirill A. Shutemov
So far there's no need in atomic setting of MM context flags in mm_context_t::flags. The flags set early in exec and never change after that. LAM enabling requires atomic flag setting. The upcoming flag MM_CONTEXT_FORCE_TAGGED_SVA can be set much later in the process lifetime where multiple threads exist. Convert the field to unsigned long and do MM_CONTEXT_* accesses with __set_bit() and test_bit(). No functional changes. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/all/20230312112612.31869-3-kirill.shutemov%40linux.intel.com
2023-03-16x86/split_lock: Enumerate architectural split lock disable bitFenghua Yu
The December 2022 edition of the Intel Instruction Set Extensions manual defined that the split lock disable bit in the IA32_CORE_CAPABILITIES MSR is (and retrospectively always has been) architectural. Remove all the model specific checks except for Ice Lake variants which are still needed because these CPU models do not enumerate presence of the IA32_CORE_CAPABILITIES MSR. Originally-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/lkml/20220701131958.687066-1-fenghua.yu@intel.com/t/#mada243bee0915532a6adef6a9e32d244d1a9aef4
2023-03-16x86/CPU/AMD: Make sure EFER[AIBRSE] is setBorislav Petkov (AMD)
The AutoIBRS bit gets set only on the BSP as part of determining which mitigation to enable on AMD. Setting on the APs relies on the circumstance that the APs get booted through the trampoline and EFER - the MSR which contains that bit - gets replicated on every AP from the BSP. However, this can change in the future and considering the security implications of this bit not being set on every CPU, make sure it is set by verifying EFER later in the boot process and on every AP. Reported-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20230224185257.o3mcmloei5zqu7wa@treble
2023-03-15x86/resctrl: Avoid redundant counter read in __mon_event_count()Peter Newman
__mon_event_count() does the per-RMID, per-domain work for user-initiated event count reads and the initialization of new monitor groups. In the initialization case, after resctrl_arch_reset_rmid() calls __rmid_read() to record an initial count for a new monitor group, it immediately calls resctrl_arch_rmid_read(). This re-read of the hardware counter is unnecessary and the following computations are ignored by the caller during initialization. Following return from resctrl_arch_reset_rmid(), just clear the mbm_state and return. This involves moving the mbm_state lookup into the rr->first case, as it's not needed for regular event count reads: the QOS_L3_OCCUP_EVENT_ID case was redundant with the accumulating logic at the end of the function. Signed-off-by: Peter Newman <peternewman@google.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/all/20221220164132.443083-2-peternewman%40google.com
2023-03-15x86/resctrl: Clear staged_config[] before and after it is usedShawn Wang
As a temporary storage, staged_config[] in rdt_domain should be cleared before and after it is used. The stale value in staged_config[] could cause an MSR access error. Here is a reproducer on a system with 16 usable CLOSIDs for a 15-way L3 Cache (MBA should be disabled if the number of CLOSIDs for MB is less than 16.) : mount -t resctrl resctrl -o cdp /sys/fs/resctrl mkdir /sys/fs/resctrl/p{1..7} umount /sys/fs/resctrl/ mount -t resctrl resctrl /sys/fs/resctrl mkdir /sys/fs/resctrl/p{1..8} An error occurs when creating resource group named p8: unchecked MSR access error: WRMSR to 0xca0 (tried to write 0x00000000000007ff) at rIP: 0xffffffff82249142 (cat_wrmsr+0x32/0x60) Call Trace: <IRQ> __flush_smp_call_function_queue+0x11d/0x170 __sysvec_call_function+0x24/0xd0 sysvec_call_function+0x89/0xc0 </IRQ> <TASK> asm_sysvec_call_function+0x16/0x20 When creating a new resource control group, hardware will be configured by the following process: rdtgroup_mkdir() rdtgroup_mkdir_ctrl_mon() rdtgroup_init_alloc() resctrl_arch_update_domains() resctrl_arch_update_domains() iterates and updates all resctrl_conf_type whose have_new_ctrl is true. Since staged_config[] holds the same values as when CDP was enabled, it will continue to update the CDP_CODE and CDP_DATA configurations. When group p8 is created, get_config_index() called in resctrl_arch_update_domains() will return 16 and 17 as the CLOSIDs for CDP_CODE and CDP_DATA, which will be translated to an invalid register - 0xca0 in this scenario. Fix it by clearing staged_config[] before and after it is used. [reinette: re-order commit tags] Fixes: 75408e43509e ("x86/resctrl: Allow different CODE/DATA configurations to be staged") Suggested-by: Xin Hao <xhao@linux.alibaba.com> Signed-off-by: Shawn Wang <shawnwang@linux.alibaba.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Reinette Chatre <reinette.chatre@intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/2fad13f49fbe89687fc40e9a5a61f23a28d1507a.1673988935.git.reinette.chatre%40intel.com
2023-03-14Merge tag 'trace-v6.3-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Do not allow histogram values to have modifies. They can cause a NULL pointer dereference if they do. - Warn if hist_field_name() is passed a NULL. Prevent the NULL pointer dereference mentioned above. - Fix invalid address look up race in lookup_rec() - Define ftrace_stub_graph conditionally to prevent linker errors - Always check if RCU is watching at all tracepoint locations * tag 'trace-v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Make tracepoint lockdep check actually test something ftrace,kcfi: Define ftrace_stub_graph conditionally ftrace: Fix invalid address access in lookup_rec() when index is 0 tracing: Check field value in hist_field_name() tracing: Do not let histogram values have some modifiers
2023-03-13virt/coco/sev-guest: Add throttling awarenessDionna Glaze
A potentially malicious SEV guest can constantly hammer the hypervisor using this driver to send down requests and thus prevent or at least considerably hinder other guests from issuing requests to the secure processor which is a shared platform resource. Therefore, the host is permitted and encouraged to throttle such guest requests. Add the capability to handle the case when the hypervisor throttles excessive numbers of requests issued by the guest. Otherwise, the VM platform communication key will be disabled, preventing the guest from attesting itself. Realistically speaking, a well-behaved guest should not even care about throttling. During its lifetime, it would end up issuing a handful of requests which the hardware can easily handle. This is more to address the case of a malicious guest. Such guest should get throttled and if its VMPCK gets disabled, then that's its own wrongdoing and perhaps that guest even deserves it. To the implementation: the hypervisor signals with SNP_GUEST_REQ_ERR_BUSY that the guest requests should be throttled. That error code is returned in the upper 32-bit half of exitinfo2 and this is part of the GHCB spec v2. So the guest is given a throttling period of 1 minute in which it retries the request every 2 seconds. This is a good default but if it turns out to not pan out in practice, it can be tweaked later. For safety, since the encryption algorithm in GHCBv2 is AES_GCM, control must remain in the kernel to complete the request with the current sequence number. Returning without finishing the request allows the guest to make another request but with different message contents. This is IV reuse, and breaks cryptographic protections. [ bp: - Rewrite commit message and do a simplified version. - The stable tags are supposed to denote that a cleanup should go upfront before backporting this so that any future fixes to this can preserve the sanity of the backporter(s). ] Fixes: d5af44dde546 ("x86/sev: Provide support for SNP guest request NAEs") Signed-off-by: Dionna Glaze <dionnaglaze@google.com> Co-developed-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: <stable@kernel.org> # d6fd48eff750 ("virt/coco/sev-guest: Check SEV_SNP attribute at probe time") Cc: <stable@kernel.org> # 970ab823743f (" virt/coco/sev-guest: Simplify extended guest request handling") Cc: <stable@kernel.org> # c5a338274bdb ("virt/coco/sev-guest: Remove the disable_vmpck label in handle_guest_request()") Cc: <stable@kernel.org> # 0fdb6cc7c89c ("virt/coco/sev-guest: Carve out the request issuing logic into a helper") Cc: <stable@kernel.org> # d25bae7dc7b0 ("virt/coco/sev-guest: Do some code style cleanups") Cc: <stable@kernel.org> # fa4ae42cc60a ("virt/coco/sev-guest: Convert the sw_exit_info_2 checking to a switch-case") Link: https://lore.kernel.org/r/20230214164638.1189804-2-dionnaglaze@google.com
2023-03-13virt/coco/sev-guest: Convert the sw_exit_info_2 checking to a switch-caseBorislav Petkov (AMD)
snp_issue_guest_request() checks the value returned by the hypervisor in sw_exit_info_2 and returns a different error depending on it. Convert those checks into a switch-case to make it more readable when more error values are going to be checked in the future. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20230307192449.24732-8-bp@alien8.de
2023-03-13virt/coco/sev-guest: Simplify extended guest request handlingBorislav Petkov (AMD)
Return a specific error code - -ENOSPC - to signal the too small cert data buffer instead of checking exit code and exitinfo2. While at it, hoist the *fw_err assignment in snp_issue_guest_request() so that a proper error value is returned to the callers. [ Tom: check override_err instead of err. ] Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230307192449.24732-4-bp@alien8.de
2023-03-13virt/coco/sev-guest: Check SEV_SNP attribute at probe timeBorislav Petkov (AMD)
No need to check it on every ioctl. And yes, this is a common SEV driver but it does only SNP-specific operations currently. This can be revisited later, when more use cases appear. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20230307192449.24732-3-bp@alien8.de
2023-03-12x86/mce: Make sure logged MCEs are processed after sysfs updateYazen Ghannam
A recent change introduced a flag to queue up errors found during boot-time polling. These errors will be processed during late init once the MCE subsystem is fully set up. A number of sysfs updates call mce_restart() which goes through a subset of the CPU init flow. This includes polling MCA banks and logging any errors found. Since the same function is used as boot-time polling, errors will be queued. However, the system is now past late init, so the errors will remain queued until another error is found and the workqueue is triggered. Call mce_schedule_work() at the end of mce_restart() so that queued errors are processed. Fixes: 3bff147b187d ("x86/mce: Defer processing of early errors") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230301221420.2203184-1-yazen.ghannam@amd.com
2023-03-12Merge tag 'x86_urgent_for_v6.3_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Borislav Petkov: "A single erratum fix for AMD machines: - Disable XSAVES on AMD Zen1 and Zen2 machines due to an erratum. No impact to anything as those machines will fallback to XSAVEC which is equivalent there" * tag 'x86_urgent_for_v6.3_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/CPU/AMD: Disable XSAVES on AMD family 0x17
2023-03-09ftrace,kcfi: Define ftrace_stub_graph conditionallyArnd Bergmann
When CONFIG_FUNCTION_GRAPH_TRACER is disabled, __kcfi_typeid_ftrace_stub_graph is missing, causing a link failure: ld.lld: error: undefined symbol: __kcfi_typeid_ftrace_stub_graph referenced by arch/x86/kernel/ftrace_64.o:(__cfi_ftrace_stub_graph) in archive vmlinux.a Mark the reference to it as conditional on the same symbol, as is done on arm64. Link: https://lore.kernel.org/linux-trace-kernel/20230131093643.3850272-1-arnd@kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Fixes: 883bbbffa5a4 ("ftrace,kcfi: Separate ftrace_stub() and ftrace_stub_graph()") See-also: 2598ac6ec493 ("arm64: ftrace: Define ftrace_stub_graph only with FUNCTION_GRAPH_TRACER") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-09module: replace module_layout with module_memorySong Liu
module_layout manages different types of memory (text, data, rodata, etc.) in one allocation, which is problematic for some reasons: 1. It is hard to enable CONFIG_STRICT_MODULE_RWX. 2. It is hard to use huge pages in modules (and not break strict rwx). 3. Many archs uses module_layout for arch-specific data, but it is not obvious how these data are used (are they RO, RX, or RW?) Improve the scenario by replacing 2 (or 3) module_layout per module with up to 7 module_memory per module: MOD_TEXT, MOD_DATA, MOD_RODATA, MOD_RO_AFTER_INIT, MOD_INIT_TEXT, MOD_INIT_DATA, MOD_INIT_RODATA, and allocating them separately. This adds slightly more entries to mod_tree (from up to 3 entries per module, to up to 7 entries per module). However, this at most adds a small constant overhead to __module_address(), which is expected to be fast. Various archs use module_layout for different data. These data are put into different module_memory based on their location in module_layout. IOW, data that used to go with text is allocated with MOD_MEM_TYPE_TEXT; data that used to go with data is allocated with MOD_MEM_TYPE_DATA, etc. module_memory simplifies quite some of the module code. For example, ARCH_WANTS_MODULES_DATA_IN_VMALLOC is a lot cleaner, as it just uses a different allocator for the data. kernel/module/strict_rwx.c is also much cleaner with module_memory. Signed-off-by: Song Liu <song@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-03-08x86/resctl: fix scheduler confusion with 'current'Linus Torvalds
The implementation of 'current' on x86 is very intentionally special: it is a very common thing to look up, and it uses 'this_cpu_read_stable()' to get the current thread pointer efficiently from per-cpu storage. And the keyword in there is 'stable': the current thread pointer never changes as far as a single thread is concerned. Even if when a thread is preempted, or moved to another CPU, or even across an explicit call 'schedule()' that thread will still have the same value for 'current'. It is, after all, the kernel base pointer to thread-local storage. That's why it's stable to begin with, but it's also why it's important enough that we have that special 'this_cpu_read_stable()' access for it. So this is all done very intentionally to allow the compiler to treat 'current' as a value that never visibly changes, so that the compiler can do CSE and combine multiple different 'current' accesses into one. However, there is obviously one very special situation when the currently running thread does actually change: inside the scheduler itself. So the scheduler code paths are special, and do not have a 'current' thread at all. Instead there are _two_ threads: the previous and the next thread - typically called 'prev' and 'next' (or prev_p/next_p) internally. So this is all actually quite straightforward and simple, and not all that complicated. Except for when you then have special code that is run in scheduler context, that code then has to be aware that 'current' isn't really a valid thing. Did you mean 'prev'? Did you mean 'next'? In fact, even if then look at the code, and you use 'current' after the new value has been assigned to the percpu variable, we have explicitly told the compiler that 'current' is magical and always stable. So the compiler is quite free to use an older (or newer) value of 'current', and the actual assignment to the percpu storage is not relevant even if it might look that way. Which is exactly what happened in the resctl code, that blithely used 'current' in '__resctrl_sched_in()' when it really wanted the new process state (as implied by the name: we're scheduling 'into' that new resctl state). And clang would end up just using the old thread pointer value at least in some configurations. This could have happened with gcc too, and purely depends on random compiler details. Clang just seems to have been more aggressive about moving the read of the per-cpu current_task pointer around. The fix is trivial: just make the resctl code adhere to the scheduler rules of using the prev/next thread pointer explicitly, instead of using 'current' in a situation where it just wasn't valid. That same code is then also used outside of the scheduler context (when a thread resctl state is explicitly changed), and then we will just pass in 'current' as that pointer, of course. There is no ambiguity in that case. The fix may be trivial, but noticing and figuring out what went wrong was not. The credit for that goes to Stephane Eranian. Reported-by: Stephane Eranian <eranian@google.com> Link: https://lore.kernel.org/lkml/20230303231133.1486085-1-eranian@google.com/ Link: https://lore.kernel.org/lkml/alpine.LFD.2.01.0908011214330.3304@localhost.localdomain/ Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Tony Luck <tony.luck@intel.com> Tested-by: Stephane Eranian <eranian@google.com> Tested-by: Babu Moger <babu.moger@amd.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-03-08x86/cpu: Expose arch_cpu_idle_dead()'s prototype definitionPhilippe Mathieu-Daudé
Include <linux/cpu.h> to make sure arch_cpu_idle_dead() matches its prototype going forward. Inspired-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Link: https://lore.kernel.org/r/20230214083857.50163-1-philmd@linaro.org Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2023-03-08sched/idle: Mark arch_cpu_idle_dead() __noreturnJosh Poimboeuf
Before commit 076cbf5d2163 ("x86/xen: don't let xen_pv_play_dead() return"), in Xen, when a previously offlined CPU was brought back online, it unexpectedly resumed execution where it left off in the middle of the idle loop. There were some hacks to make that work, but the behavior was surprising as do_idle() doesn't expect an offlined CPU to return from the dead (in arch_cpu_idle_dead()). Now that Xen has been fixed, and the arch-specific implementations of arch_cpu_idle_dead() also don't return, give it a __noreturn attribute. This will cause the compiler to complain if an arch-specific implementation might return. It also improves code generation for both caller and callee. Also fixes the following warning: vmlinux.o: warning: objtool: do_idle+0x25f: unreachable instruction Reported-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/60d527353da8c99d4cf13b6473131d46719ed16d.1676358308.git.jpoimboe@kernel.org Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2023-03-08x86/cpu: Mark play_dead() __noreturnJosh Poimboeuf
play_dead() doesn't return. Annotate it as such. By extension this also makes arch_cpu_idle_dead() noreturn. Link: https://lore.kernel.org/r/f3a069e6869c51ccfdda656b76882363bc9fcfa4.1676358308.git.jpoimboe@kernel.org Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2023-03-08x86/CPU/AMD: Disable XSAVES on AMD family 0x17Andrew Cooper
AMD Erratum 1386 is summarised as: XSAVES Instruction May Fail to Save XMM Registers to the Provided State Save Area This piece of accidental chronomancy causes the %xmm registers to occasionally reset back to an older value. Ignore the XSAVES feature on all AMD Zen1/2 hardware. The XSAVEC instruction (which works fine) is equivalent on affected parts. [ bp: Typos, move it into the F17h-specific function. ] Reported-by: Tavis Ormandy <taviso@gmail.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20230307174643.1240184-1-andrew.cooper3@citrix.com
2023-03-08x86/mce: Always inline old MCA stubsBorislav Petkov (AMD)
The stubs for the ancient MCA support (CONFIG_X86_ANCIENT_MCE) are normally optimized away on 64-bit builds. However, an allmodconfig one causes the compiler to add sanitizer calls gunk into them and they exist as constprop calls. Which objtool then complains about: vmlinux.o: warning: objtool: do_machine_check+0xad8: call to \ pentium_machine_check.constprop.0() leaves .noinstr.text section due to them missing noinstr. One could tag them "noinstr" but what should really happen is, they should be forcefully inlined so that all that gunk gets optimized away and the warning doesn't even have a chance to fire. Do so. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230222191054.4701-1-bp@alien8.de
2023-03-06x86/MCE/AMD: Make kobj_type structure constantThomas Weißschuh
Since ee6d3dd4ed48 ("driver core: make kobj_type constant.") the driver core allows the usage of const struct kobj_type. Take advantage of this to constify the structure definition to prevent modification at runtime. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230217-kobj_type-mce-amd-v1-1-40ef94816444@weissschuh.net
2023-03-06x86/paravirt: Merge activate_mm() and dup_mmap() callbacksJuergen Gross
The two paravirt callbacks .mmu.activate_mm() and .mmu.dup_mmap() are sharing the same implementations in all cases: for Xen PV guests they are pinning the PGD of the new mm_struct, and for all other cases they are a NOP. In the end, both callbacks are meant to register an address space with the underlying hypervisor, so there needs to be only a single callback for that purpose. So merge them to a common callback .mmu.enter_mmap() (in contrast to the corresponding already existing .mmu.exit_mmap()). As the first parameter of the old callbacks isn't used, drop it from the replacement one. [ bp: Remove last occurrence of paravirt_activate_mm() in asm/mmu_context.h ] Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Link: https://lore.kernel.org/r/20230207075902.7539-1-jgross@suse.com
2023-03-05Merge tag 'x86-urgent-2023-03-05' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 updates from Thomas Gleixner: "A small set of updates for x86: - Return -EIO instead of success when the certificate buffer for SEV guests is not large enough - Allow STIPB to be enabled with legacy IBSR. Legacy IBRS is cleared on return to userspace for performance reasons, but the leaves user space vulnerable to cross-thread attacks which STIBP prevents. Update the documentation accordingly" * tag 'x86-urgent-2023-03-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: virt/sev-guest: Return -EIO if certificate buffer is not large enough Documentation/hw-vuln: Document the interaction between IBRS and STIBP x86/speculation: Allow enabling STIBP with legacy IBRS
2023-03-02Merge tag 'objtool-core-2023-03-02' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool updates from Ingo Molnar: - Shrink 'struct instruction', to improve objtool performance & memory footprint - Other maximum memory usage reductions - this makes the build both faster, and fixes kernel build OOM failures on allyesconfig and similar configs when they try to build the final (large) vmlinux.o - Fix ORC unwinding when a kprobe (INT3) is set on a stack-modifying single-byte instruction (PUSH/POP or LEAVE). This requires the extension of the ORC metadata structure with a 'signal' field - Misc fixes & cleanups * tag 'objtool-core-2023-03-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits) objtool: Fix ORC 'signal' propagation objtool: Remove instruction::list x86: Fix FILL_RETURN_BUFFER objtool: Fix overlapping alternatives objtool: Union instruction::{call_dest,jump_table} objtool: Remove instruction::reloc objtool: Shrink instruction::{type,visited} objtool: Make instruction::alts a single-linked list objtool: Make instruction::stack_ops a single-linked list objtool: Change arch_decode_instruction() signature x86/entry: Fix unwinding from kprobe on PUSH/POP instruction x86/unwind/orc: Add 'signal' field to ORC metadata objtool: Optimize layout of struct special_alt objtool: Optimize layout of struct symbol objtool: Allocate multiple structures with calloc() objtool: Make struct check_options static objtool: Make struct entries[] static and const objtool: Fix HOSTCC flag usage objtool: Properly support make V=1 objtool: Install libsubcmd in build ...
2023-02-27x86/speculation: Allow enabling STIBP with legacy IBRSKP Singh
When plain IBRS is enabled (not enhanced IBRS), the logic in spectre_v2_user_select_mitigation() determines that STIBP is not needed. The IBRS bit implicitly protects against cross-thread branch target injection. However, with legacy IBRS, the IBRS bit is cleared on returning to userspace for performance reasons which leaves userspace threads vulnerable to cross-thread branch target injection against which STIBP protects. Exclude IBRS from the spectre_v2_in_ibrs_mode() check to allow for enabling STIBP (through seccomp/prctl() by default or always-on, if selected by spectre_v2_user kernel cmdline parameter). [ bp: Massage. ] Fixes: 7c693f54c873 ("x86/speculation: Add spectre_v2=ibrs option to support Kernel IBRS") Reported-by: José Oliveira <joseloliveira11@gmail.com> Reported-by: Rodrigo Branco <rodrigo@kernelhacking.com> Signed-off-by: KP Singh <kpsingh@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230220120127.1975241-1-kpsingh@kernel.org Link: https://lore.kernel.org/r/20230221184908.2349578-1-kpsingh@kernel.org
2023-02-25Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "ARM: - Provide a virtual cache topology to the guest to avoid inconsistencies with migration on heterogenous systems. Non secure software has no practical need to traverse the caches by set/way in the first place - Add support for taking stage-2 access faults in parallel. This was an accidental omission in the original parallel faults implementation, but should provide a marginal improvement to machines w/o FEAT_HAFDBS (such as hardware from the fruit company) - A preamble to adding support for nested virtualization to KVM, including vEL2 register state, rudimentary nested exception handling and masking unsupported features for nested guests - Fixes to the PSCI relay that avoid an unexpected host SVE trap when resuming a CPU when running pKVM - VGIC maintenance interrupt support for the AIC - Improvements to the arch timer emulation, primarily aimed at reducing the trap overhead of running nested - Add CONFIG_USERFAULTFD to the KVM selftests config fragment in the interest of CI systems - Avoid VM-wide stop-the-world operations when a vCPU accesses its own redistributor - Serialize when toggling CPACR_EL1.SMEN to avoid unexpected exceptions in the host - Aesthetic and comment/kerneldoc fixes - Drop the vestiges of the old Columbia mailing list and add [Oliver] as co-maintainer RISC-V: - Fix wrong usage of PGDIR_SIZE instead of PUD_SIZE - Correctly place the guest in S-mode after redirecting a trap to the guest - Redirect illegal instruction traps to guest - SBI PMU support for guest s390: - Sort out confusion between virtual and physical addresses, which currently are the same on s390 - A new ioctl that performs cmpxchg on guest memory - A few fixes x86: - Change tdp_mmu to a read-only parameter - Separate TDP and shadow MMU page fault paths - Enable Hyper-V invariant TSC control - Fix a variety of APICv and AVIC bugs, some of them real-world, some of them affecting architecurally legal but unlikely to happen in practice - Mark APIC timer as expired if its in one-shot mode and the count underflows while the vCPU task was being migrated - Advertise support for Intel's new fast REP string features - Fix a double-shootdown issue in the emergency reboot code - Ensure GIF=1 and disable SVM during an emergency reboot, i.e. give SVM similar treatment to VMX - Update Xen's TSC info CPUID sub-leaves as appropriate - Add support for Hyper-V's extended hypercalls, where "support" at this point is just forwarding the hypercalls to userspace - Clean up the kvm->lock vs. kvm->srcu sequences when updating the PMU and MSR filters - One-off fixes and cleanups - Fix and cleanup the range-based TLB flushing code, used when KVM is running on Hyper-V - Add support for filtering PMU events using a mask. If userspace wants to restrict heavily what events the guest can use, it can now do so without needing an absurd number of filter entries - Clean up KVM's handling of "PMU MSRs to save", especially when vPMU support is disabled - Add PEBS support for Intel Sapphire Rapids - Fix a mostly benign overflow bug in SEV's send|receive_update_data() - Move several SVM-specific flags into vcpu_svm x86 Intel: - Handle NMI VM-Exits before leaving the noinstr region - A few trivial cleanups in the VM-Enter flows - Stop enabling VMFUNC for L1 purely to document that KVM doesn't support EPTP switching (or any other VM function) for L1 - Fix a crash when using eVMCS's enlighted MSR bitmaps Generic: - Clean up the hardware enable and initialization flow, which was scattered around multiple arch-specific hooks. Instead, just let the arch code call into generic code. Both x86 and ARM should benefit from not having to fight common KVM code's notion of how to do initialization - Account allocations in generic kvm_arch_alloc_vm() - Fix a memory leak if coalesced MMIO unregistration fails selftests: - On x86, cache the CPU vendor (AMD vs. Intel) and use the info to emit the correct hypercall instruction instead of relying on KVM to patch in VMMCALL - Use TAP interface for kvm_binary_stats_test and tsc_msrs_test" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (325 commits) KVM: SVM: hyper-v: placate modpost section mismatch error KVM: x86/mmu: Make tdp_mmu_allowed static KVM: arm64: nv: Use reg_to_encoding() to get sysreg ID KVM: arm64: nv: Only toggle cache for virtual EL2 when SCTLR_EL2 changes KVM: arm64: nv: Filter out unsupported features from ID regs KVM: arm64: nv: Emulate EL12 register accesses from the virtual EL2 KVM: arm64: nv: Allow a sysreg to be hidden from userspace only KVM: arm64: nv: Emulate PSTATE.M for a guest hypervisor KVM: arm64: nv: Add accessors for SPSR_EL1, ELR_EL1 and VBAR_EL1 from virtual EL2 KVM: arm64: nv: Handle SMCs taken from virtual EL2 KVM: arm64: nv: Handle trapped ERET from virtual EL2 KVM: arm64: nv: Inject HVC exceptions to the virtual EL2 KVM: arm64: nv: Support virtual EL2 exceptions KVM: arm64: nv: Handle HCR_EL2.NV system register traps KVM: arm64: nv: Add nested virt VCPU primitives for vEL2 VCPU state KVM: arm64: nv: Add EL2 system registers to vcpu context KVM: arm64: nv: Allow userspace to set PSR_MODE_EL2x KVM: arm64: nv: Reset VCPU to EL2 registers if VCPU nested virt is set KVM: arm64: nv: Introduce nested virtualization VCPU feature KVM: arm64: Use the S2 MMU context to iterate over S2 table ...
2023-02-25Merge tag 'x86_tdx_for_6.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull Intel Trust Domain Extensions (TDX) updates from Dave Hansen: "Other than a minor fixup, the content here is to ensure that TDX guests never see virtualization exceptions (#VE's) that might be induced by the untrusted VMM. This is a highly desirable property. Without it, #VE exception handling would fall somewhere between NMIs, machine checks and total insanity. With it, #VE handling remains pretty mundane. Summary: - Fixup comment typo - Prevent unexpected #VE's from: - Hosts removing perfectly good guest mappings (SEPT_VE_DISABLE) - Excessive #VE notifications (NOTIFY_ENABLES) which are delivered via a #VE" * tag 'x86_tdx_for_6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/tdx: Do not corrupt frame-pointer in __tdx_hypercall() x86/tdx: Disable NOTIFY_ENABLES x86/tdx: Relax SEPT_VE_DISABLE check for debug TD x86/tdx: Use ReportFatalError to report missing SEPT_VE_DISABLE x86/tdx: Expand __tdx_hypercall() to handle more arguments x86/tdx: Refactor __tdx_hypercall() to allow pass down more arguments x86/tdx: Add more registers to struct tdx_hypercall_args x86/tdx: Fix typo in comment in __tdx_hypercall()