summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2023-02-07KVM: SVM: Fix potential overflow in SEV's send|receive_update_data()Peter Gonda
KVM_SEV_SEND_UPDATE_DATA and KVM_SEV_RECEIVE_UPDATE_DATA have an integer overflow issue. Params.guest_len and offset are both 32 bits wide, with a large params.guest_len the check to confirm a page boundary is not crossed can falsely pass: /* Check if we are crossing the page boundary * offset = params.guest_uaddr & (PAGE_SIZE - 1); if ((params.guest_len + offset > PAGE_SIZE)) Add an additional check to confirm that params.guest_len itself is not greater than PAGE_SIZE. Note, this isn't a security concern as overflow can happen if and only if params.guest_len is greater than 0xfffff000, and the FW spec says these commands fail with lengths greater than 16KB, i.e. the PSP will detect KVM's goof. Fixes: 15fb7de1a7f5 ("KVM: SVM: Add KVM_SEV_RECEIVE_UPDATE_DATA command") Fixes: d3d1af85e2c7 ("KVM: SVM: Add KVM_SEND_UPDATE_DATA command") Reported-by: Andy Nguyen <theflow@google.com> Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com> Signed-off-by: Peter Gonda <pgonda@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20230207171354.4012821-1-pgonda@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-07x86/vdso: Fix -Wmissing-prototypes warningsBorislav Petkov (AMD)
Fix those: In file included from arch/x86/entry/vdso/vdso32/vclock_gettime.c:4: arch/x86/entry/vdso/vdso32/../vclock_gettime.c:70:5: warning: no previous prototype for ‘__vdso_clock_gettime64’ [-Wmissing-prototypes] 70 | int __vdso_clock_gettime64(clockid_t clock, struct __kernel_timespec *ts) | In file included from arch/x86/entry/vdso/vdso32/vgetcpu.c:3: arch/x86/entry/vdso/vdso32/../vgetcpu.c:13:1: warning: no previous prototype for ‘__vdso_getcpu’ [-Wmissing-prototypes] 13 | __vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused) | ^~~~~~~~~~~~~ Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/202302070742.iYcnoJwk-lkp@intel.com
2023-02-07x86/vdso: Fake 32bit VDSO build on 64bit compile for vgetcpuSebastian Andrzej Siewior
The 64bit register constrains in __arch_hweight64() cannot be fulfilled in a 32-bit build. The function is only declared but not used within vclock_gettime.c and gcc does not care. LLVM complains and aborts. Reportedly because it validates extended asm even if latter would get compiled out, see https://lore.kernel.org/r/Y%2BJ%2BUQ1vAKr6RHuH@dev-arch.thelio-3990X i.e., a long standing design difference between gcc and LLVM. Move the "fake a 32 bit kernel configuration" bits from vclock_gettime.c into a common header file. Use this from vclock_gettime.c and vgetcpu.c. [ bp: Add background info from Nathan. ] Fixes: 92d33063c081a ("x86/vdso: Provide getcpu for x86-32.") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/Y+IsCWQdXEr8d9Vy@linutronix.de
2023-02-07KVM: VMX: Fix crash due to uninitialized current_vmcsAlexandru Matei
KVM enables 'Enlightened VMCS' and 'Enlightened MSR Bitmap' when running as a nested hypervisor on top of Hyper-V. When MSR bitmap is updated, evmcs_touch_msr_bitmap function uses current_vmcs per-cpu variable to mark that the msr bitmap was changed. vmx_vcpu_create() modifies the msr bitmap via vmx_disable_intercept_for_msr -> vmx_msr_bitmap_l01_changed which in the end calls this function. The function checks for current_vmcs if it is null but the check is insufficient because current_vmcs is not initialized. Because of this, the code might incorrectly write to the structure pointed by current_vmcs value left by another task. Preemption is not disabled, the current task can be preempted and moved to another CPU while current_vmcs is accessed multiple times from evmcs_touch_msr_bitmap() which leads to crash. The manipulation of MSR bitmaps by callers happens only for vmcs01 so the solution is to use vmx->vmcs01.vmcs instead of current_vmcs. BUG: kernel NULL pointer dereference, address: 0000000000000338 PGD 4e1775067 P4D 0 Oops: 0002 [#1] PREEMPT SMP NOPTI ... RIP: 0010:vmx_msr_bitmap_l01_changed+0x39/0x50 [kvm_intel] ... Call Trace: vmx_disable_intercept_for_msr+0x36/0x260 [kvm_intel] vmx_vcpu_create+0xe6/0x540 [kvm_intel] kvm_arch_vcpu_create+0x1d1/0x2e0 [kvm] kvm_vm_ioctl_create_vcpu+0x178/0x430 [kvm] kvm_vm_ioctl+0x53f/0x790 [kvm] __x64_sys_ioctl+0x8a/0xc0 do_syscall_64+0x5c/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: ceef7d10dfb6 ("KVM: x86: VMX: hyper-v: Enlightened MSR-Bitmap support") Cc: stable@vger.kernel.org Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com> Link: https://lore.kernel.org/r/20230123221208.4964-1-alexandru.matei@uipath.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-07KVM: nVMX: Simplify the setting of SECONDARY_EXEC_ENABLE_VMFUNC for nested.Yu Zhang
Values of base settings for nested proc-based VM-Execution control MSR come from the ones for non-nested. And for SECONDARY_EXEC_ENABLE_VMFUNC flag, KVM currently a) first mask off it from vmcs_conf->cpu_based_2nd_exec_ctrl; b) then check it against the same source; c) and reset it again if host has it. So just simplify this, by not masking off SECONDARY_EXEC_ENABLE_VMFUNC in the first place. No functional change. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Link: https://lore.kernel.org/r/20221109075413.1405803-3-yu.c.zhang@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-07KVM: VMX: Do not trap VMFUNC instructions for L1 guests.Yu Zhang
Explicitly disable VMFUNC in vmcs01 to document that KVM doesn't support any VM-Functions for L1. WARN in the dedicated VMFUNC handler if an exit occurs while L1 is active, but keep the existing handlers as fallbacks to avoid killing the VM as an unexpected VMFUNC VM-Exit isn't fatal Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Link: https://lore.kernel.org/r/20221109075413.1405803-2-yu.c.zhang@linux.intel.com [sean: don't kill the VM on an unexpected VMFUNC from L1, reword changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-06clocksource: Enable TSC watchdog checking of HPET and PMTMR only when requestedPaul E. McKenney
Unconditionally enabling TSC watchdog checking of the HPET and PMTMR clocksources can degrade latency and performance. Therefore, provide a new "watchdog" option to the tsc= boot parameter that opts into such checking. Note that tsc=watchdog is overridden by a tsc=nowatchdog regardless of their relative positions in the list of boot parameters. Reported-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: Waiman Long <longman@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Waiman Long <longman@redhat.com>
2023-02-06x86/vdso: Provide getcpu for x86-32.Sebastian Andrzej Siewior
Wire up __vdso_getcpu() for x86-32. The 64bit version is reused with trivial modifications. Contrary to vclock_gettime.c there is no requirement to fake any defines in the case of 32bit VDSO on a 64bit kernel because the GDT entry from which the CPU and node information is read is always the native one. Adopt vdso_getcpu.c by: - removing the unneeded time* header files which lead to compile errors for 32bit. - adding segment.h which provides vdso_read_cpunode() and the defines required by it. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20221125094216.3663444-3-bigeasy@linutronix.de
2023-02-06x86/cpu: Provide the full setup for getcpu() on x86-32Sebastian Andrzej Siewior
setup_getcpu() configures two things: - it writes the current CPU & node information into MSR_TSC_AUX - it writes the same information as a GDT entry. By using the "full" setup_getcpu() on i386 it is possible to read the CPU information in userland via RDTSCP() or via LSL from the GDT. Provide an GDT_ENTRY_CPUNODE for x86-32 and make the setup function unconditionally available. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Roland Mainz <roland.mainz@nrubsig.org> Link: https://lore.kernel.org/r/20221125094216.3663444-2-bigeasy@linutronix.de
2023-02-06x86/microcode/core: Return an error only when necessaryBorislav Petkov (AMD)
Return an error from the late loading function which is run on each CPU only when an error has actually been encountered during the update. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230130161709.11615-5-bp@alien8.de
2023-02-06x86/microcode/AMD: Fix mixed steppings supportBorislav Petkov (AMD)
The AMD side of the loader has always claimed to support mixed steppings. But somewhere along the way, it broke that by assuming that the cached patch blob is a single one instead of it being one per *node*. So turn it into a per-node one so that each node can stash the blob relevant for it. [ NB: Fixes tag is not really the exactly correct one but it is good enough. ] Fixes: fe055896c040 ("x86/microcode: Merge the early microcode loader") Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> # 2355370cd941 ("x86/microcode/amd: Remove load_microcode_amd()'s bsp parameter") Cc: <stable@kernel.org> # a5ad92134bd1 ("x86/microcode/AMD: Add a @cpu parameter to the reloading functions") Link: https://lore.kernel.org/r/20230130161709.11615-4-bp@alien8.de
2023-02-06x86/microcode/AMD: Add a @cpu parameter to the reloading functionsBorislav Petkov (AMD)
Will be used in a subsequent change. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230130161709.11615-3-bp@alien8.de
2023-02-06x86/microcode/amd: Remove load_microcode_amd()'s bsp parameterBorislav Petkov (AMD)
It is always the BSP. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230130161709.11615-2-bp@alien8.de
2023-02-05kbuild: remove --include-dir MAKEFLAG from top MakefileMasahiro Yamada
I added $(srctree)/ to some included Makefiles in the following commits: - 3204a7fb98a3 ("kbuild: prefix $(srctree)/ to some included Makefiles") - d82856395505 ("kbuild: do not require sub-make for separate output tree builds") They were a preparation for removing --include-dir flag. I have never thought --include-dir useful. Rather, it _is_ harmful. For example, run the following commands: $ make -s ARCH=x86 mrproper defconfig $ make ARCH=arm O=foo dtbs make[1]: Entering directory '/tmp/linux/foo' HOSTCC scripts/basic/fixdep Error: kernelrelease not valid - run 'make prepare' to update it UPD include/config/kernel.release make[1]: Leaving directory '/tmp/linux/foo' The first command configures the source tree for x86. The next command tries to build ARM device trees in the separate foo/ directory - this must stop because the directory foo/ has not been configured yet. However, due to --include-dir=$(abs_srctree), the top Makefile includes the wrong include/config/auto.conf from the source tree and continues building. Kbuild traverses the directory tree, but of course it does not work correctly. The Error message is also pointless - 'make prepare' does not help at all for fixing the issue. This commit fixes more arch Makefile, and finally removes --include-dir from the top Makefile. There are more breakages under drivers/, but I do not volunteer to fix them all. I just moved --include-dir to drivers/Makefile. With this commit, the second command will stop with a sensible message. $ make -s ARCH=x86 mrproper defconfig $ make ARCH=arm O=foo dtbs make[1]: Entering directory '/tmp/linux/foo' SYNC include/config/auto.conf.cmd *** *** The source tree is not clean, please run 'make ARCH=arm mrproper' *** in /tmp/linux *** make[2]: *** [../Makefile:646: outputmakefile] Error 1 /tmp/linux/Makefile:770: include/config/auto.conf.cmd: No such file or directory make[1]: *** [/tmp/linux/Makefile:793: include/config/auto.conf.cmd] Error 2 make[1]: Leaving directory '/tmp/linux/foo' make: *** [Makefile:226: __sub-make] Error 2 Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2023-02-04efi: Discover BTI support in runtime services regionsArd Biesheuvel
Add the generic plumbing to detect whether or not the runtime code regions were constructed with BTI/IBT landing pads by the firmware, permitting the OS to enable enforcement when mapping these regions into the OS's address space. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org>
2023-02-03KVM: x86: Simplify msr_io()Michal Luczaj
As of commit bccf2150fe62 ("KVM: Per-vcpu inodes"), __msr_io() doesn't return a negative value. Remove unnecessary checks. Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://lore.kernel.org/r/20230107001256.2365304-7-mhal@rbox.co [sean: call out commit which left behind the unnecessary check] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-03KVM: x86: Remove unnecessary initialization in kvm_vm_ioctl_set_msr_filter()Michal Luczaj
Do not initialize the value of `r`, as it will be overwritten. Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://lore.kernel.org/r/20230107001256.2365304-6-mhal@rbox.co Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-03KVM: x86: Explicitly state lockdep condition of msr_filter updateMichal Luczaj
Replace `1` with the actual mutex_is_locked() check. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://lore.kernel.org/r/20230107001256.2365304-5-mhal@rbox.co [sean: delete the comment that explained the hardocded '1'] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-03KVM: x86: Simplify msr_filter updateMichal Luczaj
Replace srcu_dereference()+rcu_assign_pointer() sequence with a single rcu_replace_pointer(). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://lore.kernel.org/r/20230107001256.2365304-4-mhal@rbox.co Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-03KVM: x86: Optimize kvm->lock and SRCU interaction (KVM_X86_SET_MSR_FILTER)Michal Luczaj
Reduce time spent holding kvm->lock: unlock mutex before calling synchronize_srcu(). There is no need to hold kvm->lock until all vCPUs have been kicked, KVM only needs to guarantee that all vCPUs will switch to the new filter before exiting to userspace. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://lore.kernel.org/r/20230107001256.2365304-3-mhal@rbox.co [sean: expand changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-03KVM: x86: Optimize kvm->lock and SRCU interaction (KVM_SET_PMU_EVENT_FILTER)Michal Luczaj
Reduce time spent holding kvm->lock: unlock mutex before calling synchronize_srcu_expedited(). There is no need to hold kvm->lock until all vCPUs have been kicked, KVM only needs to guarantee that all vCPUs will switch to the new filter before exiting to userspace. Protecting the write to __reprogram_pmi is also unnecessary as a vCPU may process a set bit before receiving the final KVM_REQ_PMU, but the per-vCPU writes are guaranteed to occur after all vCPUs have switched to the new filter. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://lore.kernel.org/r/20230107001256.2365304-2-mhal@rbox.co [sean: expand changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-03KVM: x86/emulator: Fix comment in __load_segment_descriptor()Michal Luczaj
The comment refers to the same condition twice. Make it reflect what the code actually does. No functional change intended. Signed-off-by: Michal Luczaj <mhal@rbox.co> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230126013405.2967156-3-mhal@rbox.co Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-03KVM: x86/emulator: Fix segment load privilege level validationMichal Luczaj
Intel SDM describes what steps are taken by the CPU to verify if a memory segment can actually be used at a given privilege level. Loading DS/ES/FS/GS involves checking segment's type as well as making sure that neither selector's RPL nor caller's CPL are greater than segment's DPL. Emulator implements Intel's pseudocode in __load_segment_descriptor(), even quoting the pseudocode in the comments. Although the pseudocode is correctly translated, the implementation is incorrect. This is most likely due to SDM, at the time, being wrong. Patch fixes emulator's logic and updates the pseudocode in the comment. Below are historical notes. Emulator code for handling segment descriptors appears to have been introduced in March 2010 in commit 38ba30ba51a0 ("KVM: x86 emulator: Emulate task switch in emulator.c"). Intel SDM Vol 2A: Instruction Set Reference, A-M (Order Number: 253666-034US, _March 2010_) lists the steps for loading segment registers in section related to MOV instruction: IF DS, ES, FS, or GS is loaded with non-NULL selector THEN IF segment selector index is outside descriptor table limits or segment is not a data or readable code segment or ((segment is a data or nonconforming code segment) and (both RPL and CPL > DPL)) <--- THEN #GP(selector); FI; This is precisely what __load_segment_descriptor() quotes and implements. But there's a twist; a few SDM revisions later (253667-044US), in August 2012, the snippet above becomes: IF DS, ES, FS, or GS is loaded with non-NULL selector THEN IF segment selector index is outside descriptor table limits or segment is not a data or readable code segment or ((segment is a data or nonconforming code segment) [note: missing or superfluous parenthesis?] or ((RPL > DPL) and (CPL > DPL)) <--- THEN #GP(selector); FI; Many SDMs later (253667-065US), in December 2017, pseudocode reaches what seems to be its final form: IF DS, ES, FS, or GS is loaded with non-NULL selector THEN IF segment selector index is outside descriptor table limits OR segment is not a data or readable code segment OR ((segment is a data or nonconforming code segment) AND ((RPL > DPL) or (CPL > DPL))) <--- THEN #GP(selector); FI; which also matches the behavior described in AMD's APM, which states that a #GP occurs if: The DS, ES, FS, or GS register was loaded and the segment pointed to was a data or non-conforming code segment, but the RPL or CPL was greater than the DPL. Signed-off-by: Michal Luczaj <mhal@rbox.co> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230126013405.2967156-2-mhal@rbox.co [sean: add blurb to changelog calling out AMD agrees] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-03efi: Drop minimum EFI version check at bootArd Biesheuvel
We currently pass a minimum major version to the generic EFI helper that checks the system table magic and version, and refuse to boot if the value is lower. The motivation for this check is unknown, and even the code that uses major version 2 as the minimum (ARM, arm64 and RISC-V) should make it past this check without problems, and boot to a point where we have access to a console or some other means to inform the user that the firmware's major revision number made us unhappy. (Revision 2.0 of the UEFI specification was released in January 2006, whereas ARM, arm64 and RISC-V support where added in 2009, 2013 and 2017, respectively, so checking for major version 2 or higher is completely arbitrary) So just drop the check. Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2023-02-03livepatch,x86: Clear relocation targets on a module removalSong Liu
Josh reported a bug: When the object to be patched is a module, and that module is rmmod'ed and reloaded, it fails to load with: module: x86/modules: Skipping invalid relocation target, existing value is nonzero for type 2, loc 00000000ba0302e9, val ffffffffa03e293c livepatch: failed to initialize patch 'livepatch_nfsd' for module 'nfsd' (-8) livepatch: patch 'livepatch_nfsd' failed for module 'nfsd', refusing to load module 'nfsd' The livepatch module has a relocation which references a symbol in the _previous_ loading of nfsd. When apply_relocate_add() tries to replace the old relocation with a new one, it sees that the previous one is nonzero and it errors out. He also proposed three different solutions. We could remove the error check in apply_relocate_add() introduced by commit eda9cec4c9a1 ("x86/module: Detect and skip invalid relocations"). However the check is useful for detecting corrupted modules. We could also deny the patched modules to be removed. If it proved to be a major drawback for users, we could still implement a different approach. The solution would also complicate the existing code a lot. We thus decided to reverse the relocation patching (clear all relocation targets on x86_64). The solution is not universal and is too much arch-specific, but it may prove to be simpler in the end. Reported-by: Josh Poimboeuf <jpoimboe@redhat.com> Originally-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Song Liu <song@kernel.org> Acked-by: Miroslav Benes <mbenes@suse.cz> Reviewed-by: Petr Mladek <pmladek@suse.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Joe Lawrence <joe.lawrence@redhat.com> Tested-by: Joe Lawrence <joe.lawrence@redhat.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20230125185401.279042-2-song@kernel.org
2023-02-03x86/module: remove unused code in __apply_relocate_addSong Liu
This "#if 0" block has been untouched for many years. Remove it to clean up the code. Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Joe Lawrence <joe.lawrence@redhat.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20230125185401.279042-1-song@kernel.org
2023-02-02scripts/spelling.txt: add `permitted'Ricardo Ribalda
Patch series "spelling: Fix some trivial typos". Seems like permitted has two t's :), Lets add that to spellings to help others. This patch (of 3): Add another common typo. Noticed when I sent a patch with the typo and in kvm and of. [ribalda@chromium.org: fix trivial typo] Link: https://lkml.kernel.org/r/20221220-permited-v1-2-52ea9857fa61@chromium.org Link: https://lkml.kernel.org/r/20221220-permited-v1-1-52ea9857fa61@chromium.org Signed-off-by: Ricardo Ribalda <ribalda@chromium.org> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02mm: add vma_alloc_zeroed_movable_folio()Matthew Wilcox (Oracle)
Replace alloc_zeroed_user_highpage_movable(). The main difference is returning a folio containing a single page instead of returning the page, but take the opportunity to rename the function to match other allocation functions a little better and rewrite the documentation to place more emphasis on the zeroing rather than the highmem aspect. Link: https://lkml.kernel.org/r/20230116191813.2145215-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02mm: remove __HAVE_ARCH_PTE_SWP_EXCLUSIVEDavid Hildenbrand
__HAVE_ARCH_PTE_SWP_EXCLUSIVE is now supported by all architectures that support swp PTEs, so let's drop it. Link: https://lkml.kernel.org/r/20230113171026.582290-27-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02x86/mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE also on 32bitDavid Hildenbrand
Let's support __HAVE_ARCH_PTE_SWP_EXCLUSIVE just like we already do on x86-64. After deciphering the PTE layout it becomes clear that there are still unused bits for 2-level and 3-level page tables that we should be able to use. Reusing a bit avoids stealing one bit from the swap offset. While at it, mask the type in __swp_entry(); use some helper definitions to make the macros easier to grasp. Link: https://lkml.kernel.org/r/20230113171026.582290-25-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02Revert "x86: kmsan: sync metadata pages on page fault"Alexander Potapenko
This reverts commit 3f1e2c7a9099c1ed32c67f12cdf432ba782cf51f. As noticed by Qun-Wei Lin, arch_sync_kernel_mappings() in arch/x86/mm/fault.c is only used with CONFIG_X86_32, whereas KMSAN is only supported on x86_64, where this code is not compiled. The patch in question dates back to downstream KMSAN branch based on v5.8-rc5, it sneaked into upstream unnoticed in v6.1. Link: https://lkml.kernel.org/r/20230111101806.3236991-1-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Reported-by: Qun-Wei Lin <qun-wei.lin@mediatek.com> Link: https://github.com/google/kmsan/issues/91 Cc: Andy Lutomirski <luto@kernel.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Marco Elver <elver@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-03crypto: x86 - exit fpu context earlier in ECB/CBC macrosPeter Lafreniere
Currently the ecb/cbc macros hold fpu context unnecessarily when using scalar cipher routines (e.g. when handling odd sizes of blocks per walk). Change the macros to drop fpu context as soon as the fpu is out of use. No performance impact found (on Intel Haswell). Signed-off-by: Peter Lafreniere <peter@n8pjl.ca> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2023-02-02x86/tdx: Do not corrupt frame-pointer in __tdx_hypercall()Kirill A. Shutemov
If compiled with CONFIG_FRAME_POINTER=y, objtool is not happy that __tdx_hypercall() messes up RBP: objtool: __tdx_hypercall+0x7f: return with modified stack frame Rework the function to store TDX_HCALL_ flags on stack instead of RBP. [ dhansen: minor changelog tweaks ] Fixes: c30c4b2555ba ("x86/tdx: Refactor __tdx_hypercall() to allow pass down more arguments") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/202301290255.buUBs99R-lkp@intel.com Link: https://lore.kernel.org/all/20230130135354.27674-1-kirill.shutemov%40linux.intel.com
2023-02-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
net/core/gro.c 7d2c89b32587 ("skb: Do mix page pool and page referenced frags in GRO") b1a78b9b9886 ("net: add support for ipv4 big tcp") https://lore.kernel.org/all/20230203094454.5766f160@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-02clocksource: Verify HPET and PMTMR when TSC unverifiedPaul E. McKenney
On systems with two or fewer sockets, when the boot CPU has CONSTANT_TSC, NONSTOP_TSC, and TSC_ADJUST, clocksource watchdog verification of the TSC is disabled. This works well much of the time, but there is the occasional production-level system that meets all of these criteria, but which still has a TSC that skews significantly from atomic-clock time. This is usually attributed to a firmware or hardware fault. Yes, the various NTP daemons do express their opinions of userspace-to-atomic-clock time skew, but they put them in various places, depending on the daemon and distro in question. It would therefore be good for the kernel to have some clue that there is a problem. The old behavior of marking the TSC unstable is a non-starter because a great many workloads simply cannot tolerate the overheads and latencies of the various non-TSC clocksources. In addition, NTP-corrected systems sometimes can tolerate significant kernel-space time skew as long as the userspace time sources are within epsilon of atomic-clock time. Therefore, when watchdog verification of TSC is disabled, enable it for HPET and PMTMR (AKA ACPI PM timer). This provides the needed in-kernel time-skew diagnostic without degrading the system's performance. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Waiman Long <longman@redhat.com> Cc: <x86@kernel.org> Tested-by: Feng Tang <feng.tang@intel.com>
2023-02-02x86/tsc: Add option to force frequency recalibration with HW timerFeng Tang
The kernel assumes that the TSC frequency which is provided by the hardware / firmware via MSRs or CPUID(0x15) is correct after applying a few basic consistency checks. This disables the TSC recalibration against HPET or PM timer. As a result there is no mechanism to validate that frequency in cases where a firmware or hardware defect is suspected. And there was case that some user used atomic clock to measure the TSC frequency and reported an inaccuracy issue, which was later fixed in firmware. Add an option 'recalibrate' for 'tsc' kernel parameter to force the tsc freq recalibration with HPET or PM timer, and warn if the deviation from previous value is more than about 500 PPM, which provides a way to verify the data from hardware / firmware. There is no functional change to existing work flow. Recently there was a real-world case: "The 40ms/s divergence between TSC and HPET was observed on hardware that is quite recent" [1], on that platform the TSC frequence 1896 MHz was got from CPUID(0x15), and the force-reclibration with HPET/PMTIMER both calibrated out value of 1975 MHz, which also matched with check from software 'chronyd', indicating it's a problem of BIOS or firmware. [Thanks tglx for helping improving the commit log] [ paulmck: Wordsmith Kconfig help text. ] [1]. https://lore.kernel.org/lkml/20221117230910.GI4001@paulmck-ThinkPad-P17-Gen-1/ Signed-off-by: Feng Tang <feng.tang@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: <x86@kernel.org> Cc: <linux-doc@vger.kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-02-01perf/x86/intel: Expose EPT-friendly PEBS for SPR and future modelsLike Xu
According to Intel SDM, the EPT-friendly PEBS is supported by all the platforms after ICX, ADL and the future platforms with PEBS format 5. Currently the only in-kernel user of this capability is KVM, which has very limited support for hybrid core pmu, so ADL and its successors do not currently expose this capability. When both hybrid core and PEBS format 5 are present, KVM will decide on its own merits. Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-perf-users@vger.kernel.org Suggested-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Like Xu <likexu@tencent.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20221109082802.27543-4-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-01KVM: x86/pmu: Add PRIR++ and PDist support for SPR and later modelsLike Xu
The pebs capability on the SPR is basically the same as Ice Lake Server with the exception of two special facilities that have been enhanced and require special handling. Upon triggering a PEBS assist, there will be a finite delay between the time the counter overflows and when the microcode starts to carry out its data collection obligations. Even if the delay is constant in core clock space, it invariably manifest as variable "skids" in instruction address space. On the Ice Lake Server, the Precise Distribution of Instructions Retire (PDIR) facility mitigates the "skid" problem by providing an early indication of when the counter is about to overflow. On SPR, the PDIR counter available (Fixed 0) is unchanged, but the capability is enhanced to Instruction-Accurate PDIR (PDIR++), where PEBS is taken on the next instruction after the one that caused the overflow. SPR also introduces a new Precise Distribution (PDist) facility only on general programmable counter 0. Per Intel SDM, PDist eliminates any skid or shadowing effects from PEBS. With PDist, the PEBS record will be generated precisely upon completion of the instruction or operation that causes the counter to overflow (there is no "wait for next occurrence" by default). In terms of KVM handling, when guest accesses those special counters, the KVM needs to request the same index counters via the perf_event kernel subsystem to ensure that the guest uses the correct pebs hardware counter (PRIR++ or PDist). This is mainly achieved by adjusting the event precise level to the maximum, where the semantics of this magic number is mainly defined by the internal software context of perf_event and it's also backwards compatible as part of the user space interface. Opportunistically, refine confusing comments on TNT+, as the only ones that currently support pebs_ept are Ice Lake server and SPR (GLC+). Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20221109082802.27543-3-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-01KVM: x86: Reinitialize xAPIC ID when userspace forces x2APIC => xAPICEmanuele Giuseppe Esposito
Reinitialize the xAPIC ID to the vCPU ID when userspace forces the APIC to transition directly from x2APIC to xAPIC mode, e.g. to emulate RESET. KVM already stuffs the xAPIC ID when the APIC is transitioned from DISABLED to xAPIC (commit 49bd29ba1dbd ("KVM: x86: reset APIC ID when enabling LAPIC")), i.e. userspace is conditioned to expect KVM to update the xAPIC ID, but KVM doesn't handle the architecturally-impossible case where userspace forces x2APIC=>xAPIC via KVM_SET_MSRS. On its own, the "bug" is benign, as userspace emulation of RESET will also stuff APIC registers via KVM_SET_LAPIC, i.e. will manually set the xAPIC ID. However, commit 3743c2f02517 ("KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base") introduced a bug, fixed by commit commit ef40757743b4 ("KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itself"), that caused KVM to fail to properly update the xAPIC ID when handling KVM_SET_LAPIC. Refresh the xAPIC ID even though it's not strictly necessary so that KVM provides consistent behavior. Note, KVM follows Intel architecture with regard to handling the xAPIC ID and x2APIC IDs across mode transitions. For the APIC DISABLED case (commit 49bd29ba1dbd), Intel's SDM says the xAPIC ID _may_ be reinitialized 10.4.3 Enabling or Disabling the Local APIC When IA32_APIC_BASE[11] is set to 0, prior initialization to the APIC may be lost and the APIC may return to the state described in Section 10.4.7.1, “Local APIC State After Power-Up or Reset.” 10.4.7.1 Local APIC State After Power-Up or Reset ... The local APIC ID register is set to a unique APIC ID. ... i.e. KVM's behavior is legal as per Intel's architecture. In practice, Intel's behavior is N/A as modern Intel CPUs (since at least Haswell) make the xAPIC ID fully read-only. And for xAPIC => x2APIC transitions (commit 257b9a5faab5 ("KVM: x86: use correct APIC ID on x2APIC transition")), Intel's SDM says: Any APIC ID value written to the memory-mapped local APIC ID register is not preserved. AMD's APM says nothing (that I could find) about the xAPIC ID when the APIC is DISABLED, but testing on bare metal (Rome) shows that the xAPIC ID is preserved when the APIC is DISABLED and re-enabled in xAPIC mode. AMD also preserves the xAPIC ID when the APIC is transitioned from xAPIC to x2APIC, i.e. allows a backdoor write of the x2APIC ID, which is again not emulated by KVM. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Link: https://lore.kernel.org/all/20230109130605.2013555-2-eesposit@redhat.com [sean: rewrite changelog, set xAPIC ID iff APIC is enabled] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-01KVM: x86: hyper-v: Add extended hypercall support in Hyper-vVipin Sharma
Add support for extended hypercall in Hyper-v. Hyper-v TLFS 6.0b describes hypercalls above call code 0x8000 as extended hypercalls. A Hyper-v hypervisor's guest VM finds availability of extended hypercalls via CPUID.0x40000003.EBX BIT(20). If the bit is set then the guest can call extended hypercalls. All extended hypercalls will exit to userspace by default. This allows for easy support of future hypercalls without being dependent on KVM releases. If there will be need to process the hypercall in KVM instead of userspace then KVM can create a capability which userspace can query to know which hypercalls can be handled by the KVM and enable handling of those hypercalls. Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20221212183720.4062037-10-vipinsh@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-01KVM: x86: hyper-v: Use common code for hypercall userspace exitVipin Sharma
Remove duplicate code to exit to userspace for hyper-v hypercalls and use a common place to exit. No functional change intended. Signed-off-by: Vipin Sharma <vipinsh@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20221212183720.4062037-9-vipinsh@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-31KVM: x86: Use emulator callbacks instead of duplicating "host flags"Maxim Levitsky
Instead of re-defining the "host flags" bits, just expose dedicated helpers for each of the two remaining flags that are consumed by the emulator. The emulator never consumes both "is guest" and "is SMM" in close proximity, so there is no motivation to avoid additional indirect branches. Also while at it, garbage collect the recently removed host flags. No functional change is intended. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Santosh Shukla <Santosh.Shukla@amd.com> Link: https://lore.kernel.org/r/20221129193717.513824-6-mlevitsk@redhat.com [sean: fix CONFIG_KVM_SMM=n builds, tweak names of wrappers] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-31KVM: x86: Move HF_NMI_MASK and HF_IRET_MASK into "struct vcpu_svm"Maxim Levitsky
Move HF_NMI_MASK and HF_IRET_MASK (a.k.a. "waiting for IRET") out of the common "hflags" and into dedicated flags in "struct vcpu_svm". The flags are used only for the SVM and thus should not be in hflags. Tracking NMI masking in software isn't SVM specific, e.g. VMX has a similar flag (soft_vnmi_blocked), but that's much more of a hack as VMX can't intercept IRET, is useful only for ancient CPUs, i.e. will hopefully be removed at some point, and again the exact behavior is vendor specific and shouldn't ever be referenced in common code. converting VMX No functional change is intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Santosh Shukla <Santosh.Shukla@amd.com> Link: https://lore.kernel.org/r/20221129193717.513824-5-mlevitsk@redhat.com [sean: split from HF_GIF_MASK patch] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-31KVM: x86: Move HF_GIF_MASK into "struct vcpu_svm" as "guest_gif"Maxim Levitsky
Move HF_GIF_MASK out of the common "hflags" and into vcpu_svm.guest_gif. GIF is an SVM-only concept and has should never be consulted outside of SVM-specific code. No functional change is intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Santosh Shukla <Santosh.Shukla@amd.com> Link: https://lore.kernel.org/r/20221129193717.513824-5-mlevitsk@redhat.com [sean: split to separate patch] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-31KVM: nSVM: Don't sync tlb_ctl back to vmcb12 on nested VM-ExitMaxim Levitsky
Don't sync the TLB control field from vmcb02 to vmcs12 on nested VM-Exit. Per AMD's APM, the field is not modified by hardware: The VMRUN instruction reads, but does not change, the value of the TLB_CONTROL field Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Santosh Shukla <Santosh.Shukla@amd.com> Link: https://lore.kernel.org/r/20221129193717.513824-2-mlevitsk@redhat.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-31x86/amd: Cache debug register values in percpu variablesAlexey Kardashevskiy
Reading DR[0-3]_ADDR_MASK MSRs takes about 250 cycles which is going to be noticeable with the AMD KVM SEV-ES DebugSwap feature enabled. KVM is going to store host's DR[0-3] and DR[0-3]_ADDR_MASK before switching to a guest; the hardware is going to swap these on VMRUN and VMEXIT. Store MSR values passed to set_dr_addr_mask() in percpu variables (when changed) and return them via new amd_get_dr_addr_mask(). The gain here is about 10x. As set_dr_addr_mask() uses the array too, change the @dr type to unsigned to avoid checking for <0. And give it the amd_ prefix to match the new helper as the whole DR_ADDR_MASK feature is AMD-specific anyway. While at it, replace deprecated boot_cpu_has() with cpu_feature_enabled() in set_dr_addr_mask(). Signed-off-by: Alexey Kardashevskiy <aik@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230120031047.628097-2-aik@amd.com
2023-01-31x86/microcode: Allow only "1" as a late reload trigger valueAshok Raj
Microcode gets reloaded late only if "1" is written to the reload file. However, the code silently treats any other unsigned integer as a successful write even though no actions are performed to load microcode. Make the loader more strict to accept only "1" as a trigger value and return an error otherwise. [ bp: Massage commit message. ] Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230130213955.6046-3-ashok.raj@intel.com
2023-01-31x86/static_call: Add support for Jcc tail-callsPeter Zijlstra
Clang likes to create conditional tail calls like: 0000000000000350 <amd_pmu_add_event>: 350: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 351: R_X86_64_NONE __fentry__-0x4 355: 48 83 bf 20 01 00 00 00 cmpq $0x0,0x120(%rdi) 35d: 0f 85 00 00 00 00 jne 363 <amd_pmu_add_event+0x13> 35f: R_X86_64_PLT32 __SCT__amd_pmu_branch_add-0x4 363: e9 00 00 00 00 jmp 368 <amd_pmu_add_event+0x18> 364: R_X86_64_PLT32 __x86_return_thunk-0x4 Where 0x35d is a static call site that's turned into a conditional tail-call using the Jcc class of instructions. Teach the in-line static call text patching about this. Notably, since there is no conditional-ret, in that case patch the Jcc to point at an empty stub function that does the ret -- or the return thunk when needed. Reported-by: "Erhard F." <erhard_f@mailbox.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/r/Y9Kdg9QjHkr9G5b5@hirez.programming.kicks-ass.net
2023-01-31x86/alternatives: Teach text_poke_bp() to patch Jcc.d32 instructionsPeter Zijlstra
In order to re-write Jcc.d32 instructions text_poke_bp() needs to be taught about them. The biggest hurdle is that the whole machinery is currently made for 5 byte instructions and extending this would grow struct text_poke_loc which is currently a nice 16 bytes and used in an array. However, since text_poke_loc contains a full copy of the (s32) displacement, it is possible to map the Jcc.d32 2 byte opcodes to Jcc.d8 1 byte opcode for the int3 emulation. This then leaves the replacement bytes; fudge that by only storing the last 5 bytes and adding the rule that 'length == 6' instruction will be prefixed with a 0x0f byte. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20230123210607.115718513@infradead.org
2023-01-31x86/alternatives: Introduce int3_emulate_jcc()Peter Zijlstra
Move the kprobe Jcc emulation into int3_emulate_jcc() so it can be used by more code -- specifically static_call() will need this. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20230123210607.057678245@infradead.org