summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)Author
2021-10-22ARM: kprobes: Make a frame pointer on __kretprobe_trampolineMasami Hiramatsu
Currently kretprobe on ARM just fills r0-r11 of pt_regs, but that is not enough for the stacktrace. Moreover, from the user kretprobe handler, stacktrace needs a frame pointer on the __kretprobe_trampoline. This adds a frame pointer on __kretprobe_trampoline for both gcc and clang case. Those have different frame pointer so we need different but similar stack on pt_regs. Gcc makes the frame pointer (fp) to point the 'pc' address of the {fp, ip (=sp), lr, pc}, this means {r11, r13, r14, r15}. Thus if we save the r11 (fp) on pt_regs->r12, we can make this set on the end of pt_regs. On the other hand, Clang makes the frame pointer to point the 'fp' address of {fp, lr} on stack. Since the next to the pt_regs->lr is pt_regs->sp, I reused the pair of pt_regs->fp and pt_regs->ip. So this stores the 'lr' on pt_regs->ip and make the fp to point pt_regs->fp. For both cases, saves __kretprobe_trampoline address to pt_regs->lr, so that the stack tracer can identify this frame pointer has been made by the __kretprobe_trampoline. Note that if the CONFIG_FRAME_POINTER is not set, this keeps fp as is. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-22ARM: clang: Do not rely on lr register for stacktraceMasami Hiramatsu
Currently the stacktrace on clang compiled arm kernel uses the 'lr' register to find the first frame address from pt_regs. However, that is wrong after calling another function, because the 'lr' register is used by 'bl' instruction and never be recovered. As same as gcc arm kernel, directly use the frame pointer (r11) of the pt_regs to find the first frame address. Note that this fixes kretprobe stacktrace issue only with CONFIG_UNWINDER_FRAME_POINTER=y. For the CONFIG_UNWINDER_ARM, we need another fix. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-22arm64: Recover kretprobe modified return address in stacktraceMasami Hiramatsu
Since the kretprobe replaces the function return address with the kretprobe_trampoline on the stack, stack unwinder shows it instead of the correct return address. This checks whether the next return address is the __kretprobe_trampoline(), and if so, try to find the correct return address from the kretprobe instance list. For this purpose this adds 'kr_cur' loop cursor to memorize the current kretprobe instance. With this fix, now arm64 can enable CONFIG_ARCH_CORRECT_STACKTRACE_ON_KRETPROBE, and pass the kprobe self tests. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-22arm64: kprobes: Make a frame pointer on __kretprobe_trampolineMasami Hiramatsu
Make a frame pointer (make the x29 register points the address of pt_regs->regs[29]) on __kretprobe_trampoline. This frame pointer will be used by the stacktracer when it is called from the kretprobe handlers. In this case, the stack tracer will unwind stack to trampoline_probe_handler() and find the next frame pointer in the stack frame of the __kretprobe_trampoline(). Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-22arm64: kprobes: Record frame pointer with kretprobe instanceMasami Hiramatsu
Record the frame pointer instead of stack address with kretprobe instance as the identifier on the instance list. Since arm64 always enable CONFIG_FRAME_POINTER, we can use the actual frame pointer (x29). This will allow the stacktrace code to find the original return address from the FP alone. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Will Deacon <will@kernel.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-22x86/unwind: Compile kretprobe fixup code only if CONFIG_KRETPROBES=yMasami Hiramatsu
Compile kretprobe related stacktrace entry recovery code and unwind_state::kr_cur field only when CONFIG_KRETPROBES=y. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-22dts: socfpga: Add Mercury+ AA1 devicetreePaweł Anikiel
Add support for the Mercury+ AA1 module for Arria 10 SoC FPGA. Signed-off-by: Paweł Anikiel <pan@semihalf.com> Signed-off-by: Joanna Brozek <jbrozek@antmicro.com> Signed-off-by: Mariusz Glebocki <mglebocki@antmicro.com> Signed-off-by: Tomasz Gorochowik <tgorochowik@antmicro.com> Signed-off-by: Maciej Mikunda <mmikunda@antmicro.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20211021151736.2096926-2-pan@semihalf.com' Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2021-10-22x86/sgx/virt: implement SGX_IOC_VEPC_REMOVE ioctlPaolo Bonzini
For bare-metal SGX on real hardware, the hardware provides guarantees SGX state at reboot. For instance, all pages start out uninitialized. The vepc driver provides a similar guarantee today for freshly-opened vepc instances, but guests such as Windows expect all pages to be in uninitialized state on startup, including after every guest reboot. Some userspace implementations of virtual SGX would rather avoid having to close and reopen the /dev/sgx_vepc file descriptor and re-mmap the virtual EPC. For example, they could sandbox themselves after the guest starts and forbid further calls to open(), in order to mitigate exploits from untrusted guests. Therefore, add a ioctl that does this with EREMOVE. Userspace can invoke the ioctl to bring its vEPC pages back to uninitialized state. There is a possibility that some pages fail to be removed if they are SECS pages, and the child and SECS pages could be in separate vEPC regions. Therefore, the ioctl returns the number of EREMOVE failures, telling userspace to try the ioctl again after it's done with all vEPC regions. A more verbose description of the correct usage and the possible error conditions is documented in sgx.rst. Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20211021201155.1523989-3-pbonzini@redhat.com
2021-10-22x86/sgx/virt: extract sgx_vepc_remove_pagePaolo Bonzini
For bare-metal SGX on real hardware, the hardware provides guarantees SGX state at reboot. For instance, all pages start out uninitialized. The vepc driver provides a similar guarantee today for freshly-opened vepc instances, but guests such as Windows expect all pages to be in uninitialized state on startup, including after every guest reboot. One way to do this is to simply close and reopen the /dev/sgx_vepc file descriptor and re-mmap the virtual EPC. However, this is problematic because it prevents sandboxing the userspace (for example forbidding open() after the guest starts; this is doable with heavy use of SCM_RIGHTS file descriptor passing). In order to implement this, we will need a ioctl that performs EREMOVE on all pages mapped by a /dev/sgx_vepc file descriptor: other possibilities, such as closing and reopening the device, are racy. Start the implementation by creating a separate function with just the __eremove wrapper. Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20211021201155.1523989-2-pbonzini@redhat.com
2021-10-22ARM: dts: spear13xx: Drop malformed 'interrupt-map' on PCI nodesRob Herring
The spear13xx PCI 'interrupt-map' property is not parse-able. '#interrupt-cells' is missing and there are 3 #address-cells. Based on the driver, the only supported interrupt is for MSI. Therefore, 'interrupt-map' is not needed. Signed-off-by: Rob Herring <robh@kernel.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Shiraz Hashim <shiraz.linux.kernel@gmail.com> Cc: linux-arm-kernel@lists.infradead.org Link: https://lore.kernel.org/r/20211022141156.2592221-1-robh@kernel.org' Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2021-10-22KVM: x86: Use rw_semaphore for APICv lock to allow vCPU parallelismSean Christopherson
Use a rw_semaphore instead of a mutex to coordinate APICv updates so that vCPUs responding to requests can take the lock for read and run in parallel. Using a mutex forces serialization of vCPUs even though kvm_vcpu_update_apicv() only touches data local to that vCPU or is protected by a different lock, e.g. SVM's ir_list_lock. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211022004927.1448382-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: x86: Move SVM's APICv sanity check to common x86Sean Christopherson
Move SVM's assertion that vCPU's APICv state is consistent with its VM's state out of svm_vcpu_run() and into x86's common inner run loop. The assertion and underlying logic is not unique to SVM, it's just that SVM has more inhibiting conditions and thus is more likely to run headfirst into any KVM bugs. Add relevant comments to document exactly why the update path has unusual ordering between the update the kick, why said ordering is safe, and also the basic rules behind the assertion in the run loop. Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211022004927.1448382-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: SEV-ES: go over the sev_pio_data buffer in multiple passes if neededPaolo Bonzini
The PIO scratch buffer is larger than a single page, and therefore it is not possible to copy it in a single step to vcpu->arch/pio_data. Bound each call to emulator_pio_in/out to a single page; keep track of how many I/O operations are left in vcpu->arch.sev_pio_count, so that the operation can be restarted in the complete_userspace_io callback. For OUT, this means that the previous kvm_sev_es_outs implementation becomes an iterator of the loop, and we can consume the sev_pio_data buffer before leaving to userspace. For IN, instead, consuming the buffer and decreasing sev_pio_count is always done in the complete_userspace_io callback, because that is when the memcpy is done into sev_pio_data. Cc: stable@vger.kernel.org Fixes: 7ed9abfe8e9f ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reported-by: Felix Wilhelm <fwilhelm@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: SEV-ES: keep INS functions togetherPaolo Bonzini
Make the diff a little nicer when we actually get to fixing the bug. No functional change intended. Cc: stable@vger.kernel.org Fixes: 7ed9abfe8e9f ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: x86: remove unnecessary arguments from complete_emulator_pio_inPaolo Bonzini
complete_emulator_pio_in can expect that vcpu->arch.pio has been filled in, and therefore does not need the size and count arguments. This makes things nicer when the function is called directly from a complete_userspace_io callback. No functional change intended. Cc: stable@vger.kernel.org Fixes: 7ed9abfe8e9f ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: x86: split the two parts of emulator_pio_inPaolo Bonzini
emulator_pio_in handles both the case where the data is pending in vcpu->arch.pio.count, and the case where I/O has to be done via either an in-kernel device or a userspace exit. For SEV-ES we would like to split these, to identify clearly the moment at which the sev_pio_data is consumed. To this end, create two different functions: __emulator_pio_in fills in vcpu->arch.pio.count, while complete_emulator_pio_in clears it and releases vcpu->arch.pio.data. Because this patch has to be backported, things are left a bit messy. kernel_pio() operates on vcpu->arch.pio, which leads to emulator_pio_in() having with two calls to complete_emulator_pio_in(). It will be fixed in the next release. While at it, remove the unused void* val argument of emulator_pio_in_out. The function currently hardcodes vcpu->arch.pio_data as the source/destination buffer, which sucks but will be fixed after the more severe SEV-ES buffer overflow. No functional change intended. Cc: stable@vger.kernel.org Fixes: 7ed9abfe8e9f ("KVM: SVM: Support string IO operations for an SEV-ES guest") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: SEV-ES: clean up kvm_sev_es_ins/outsPaolo Bonzini
A few very small cleanups to the functions, smushed together because the patch is already very small like this: - inline emulator_pio_in_emulated and emulator_pio_out_emulated, since we already have the vCPU - remove the data argument and pull setting vcpu->arch.sev_pio_data into the caller - remove unnecessary clearing of vcpu->arch.pio.count when emulation is done by the kernel (and therefore vcpu->arch.pio.count is already clear on exit from emulator_pio_in and emulator_pio_out). No functional change intended. Cc: stable@vger.kernel.org Fixes: 7ed9abfe8e9f ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: x86: leave vcpu->arch.pio.count alone in emulator_pio_in_outPaolo Bonzini
Currently emulator_pio_in clears vcpu->arch.pio.count twice if emulator_pio_in_out performs kernel PIO. Move the clear into emulator_pio_out where it is actually necessary. No functional change intended. Cc: stable@vger.kernel.org Fixes: 7ed9abfe8e9f ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: SEV-ES: rename guest_ins_data to sev_pio_dataPaolo Bonzini
We will be using this field for OUTS emulation as well, in case the data that is pushed via OUTS spans more than one page. In that case, there will be a need to save the data pointer across exits to userspace. So, change the name to something that refers to any kind of PIO. Also spell out what it is used for, namely SEV-ES. No functional change intended. Cc: stable@vger.kernel.org Fixes: 7ed9abfe8e9f ("KVM: SVM: Support string IO operations for an SEV-ES guest") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22crypto: x86/sm4 - Fix invalid section entry sizeTianjia Zhang
This fixes the following warning: vmlinux.o: warning: objtool: elf_update: invalid section entry size The size of the rodata section is 164 bytes, directly using the entry_size of 164 bytes will cause errors in some versions of the gcc compiler, while using 16 bytes directly will cause errors in the clang compiler. This patch correct it by filling the size of rodata to a 16-byte boundary. Fixes: a7ee22ee1445 ("crypto: x86/sm4 - add AES-NI/AVX/x86_64 implementation") Fixes: 5b2efa2bb865 ("crypto: x86/sm4 - add AES-NI/AVX2/x86_64 implementation") Reported-by: Peter Zijlstra <peterz@infradead.org> Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com> Tested-by: Heyuan Shi <heyuan@linux.alibaba.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-10-22x86/build: Tuck away built-in firmware under FW_LOADERLuis Chamberlain
When FW_LOADER is modular or disabled we don't use it. Update x86 relocs to reflect that. Reviewed-by: Borislav Petkov <bp@suse.de> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20211021155843.1969401-7-mcgrof@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-22x86/microcode: Use the firmware_loader built-in APIBorislav Petkov
The microcode loader has been looping through __start_builtin_fw down to __end_builtin_fw to look for possibly built-in firmware for microcode updates. Now that the firmware loader code has exported an API for looping through the kernel's built-in firmware section, use it and drop the x86 implementation in favor. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20211021155843.1969401-4-mcgrof@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Lots of simnple overlapping additions. With a build fix from Stephen Rothwell. Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-22riscv: do not select non-existing config ANON_INODESLukas Bulwahn
Commit 99cdc6c18c2d ("RISC-V: Add initial skeletal KVM support") selects the config ANON_INODES in config KVM, but the config ANON_INODES is removed since commit 5dd50aaeb185 ("Make anon_inodes unconditional") in 2018. Hence, ./scripts/checkkconfigsymbols.py warns on non-existing symbols: ANON_INODES Referencing files: arch/riscv/kvm/Kconfig Remove selecting the non-existing config ANON_INODES. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Message-Id: <20211022061514.25946-1-lukas.bulwahn@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: x86/mmu: Extract zapping of rmaps for gfn range to separate helperSean Christopherson
Extract the zapping of rmaps, a.k.a. legacy MMU, for a gfn range to a separate helper to clean up the unholy mess that kvm_zap_gfn_range() has become. In addition to deep nesting, the rmaps zapping spreads out the declaration of several variables and is generally a mess. Clean up the mess now so that future work to improve the memslots implementation doesn't need to deal with it. Cc: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211022010005.1454978-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: x86/mmu: Drop a redundant remote TLB flush in kvm_zap_gfn_range()Sean Christopherson
Remove an unnecessary remote TLB flush in kvm_zap_gfn_range() now that said function holds mmu_lock for write for its entire duration. The flush was added by the now-reverted commit to allow TDP MMU to flush while holding mmu_lock for read, as the transition from write=>read required dropping the lock and thus a pending flush needed to be serviced. Fixes: 5a324c24b638 ("Revert "KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock"") Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211022010005.1454978-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: x86/mmu: Drop a redundant, broken remote TLB flushSean Christopherson
A recent commit to fix the calls to kvm_flush_remote_tlbs_with_address() in kvm_zap_gfn_range() inadvertantly added yet another flush instead of fixing the existing flush. Drop the redundant flush, and fix the params for the existing flush. Cc: stable@vger.kernel.org Fixes: 2822da446640 ("KVM: x86/mmu: fix parameters to kvm_flush_remote_tlbs_with_address") Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211022010005.1454978-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: X86: Don't unload MMU in kvm_vcpu_flush_tlb_guest()Lai Jiangshan
kvm_mmu_unload() destroys all the PGD caches. Use the lighter kvm_mmu_sync_roots() and kvm_mmu_sync_prev_roots() instead. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211019110154.4091-5-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: X86: pair smp_wmb() of mmu_try_to_unsync_pages() with smp_rmb()Lai Jiangshan
The commit 578e1c4db2213 ("kvm: x86: Avoid taking MMU lock in kvm_mmu_sync_roots if no sync is needed") added smp_wmb() in mmu_try_to_unsync_pages(), but the corresponding smp_load_acquire() isn't used on the load of SPTE.W. smp_load_acquire() orders _subsequent_ loads after sp->is_unsync; it does not order _earlier_ loads before the load of sp->is_unsync. This has no functional change; smp_rmb() is a NOP on x86, and no compiler barrier is required because there is a VMEXIT between the load of SPTE.W and kvm_mmu_snc_roots. Cc: Junaid Shahid <junaids@google.com> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211019110154.4091-4-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: X86: Cache CR3 in prev_roots when PCID is disabledLai Jiangshan
The commit 21823fbda5522 ("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush") invalidates all PGDs for the specific PCID and in the case of PCID is disabled, it includes all PGDs in the prev_roots and the commit made prev_roots totally unused in this case. Not using prev_roots fixes a problem when CR4.PCIDE is changed 0 -> 1 before the said commit: (CR4.PCIDE=0, CR4.PGE=1; CR3=cr3_a; the page for the guest RIP is global; cr3_b is cached in prev_roots) modify page tables under cr3_b the shadow root of cr3_b is unsync in kvm INVPCID single context the guest expects the TLB is clean for PCID=0 change CR4.PCIDE 0 -> 1 switch to cr3_b with PCID=0,NOFLUSH=1 No sync in kvm, cr3_b is still unsync in kvm jump to the page that was modified in step 1 shadow page tables point to the wrong page It is a very unlikely case, but it shows that stale prev_roots can be a problem after CR4.PCIDE changes from 0 to 1. However, to fix this case, the commit disabled caching CR3 in prev_roots altogether when PCID is disabled. Not all CPUs have PCID; especially the PCID support for AMD CPUs is kind of recent. To restore the prev_roots optimization for CR4.PCIDE=0, flush the whole MMU (including all prev_roots) when CR4.PCIDE changes. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211019110154.4091-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: X86: Fix tlb flush for tdp in kvm_invalidate_pcid()Lai Jiangshan
The KVM doesn't know whether any TLB for a specific pcid is cached in the CPU when tdp is enabled. So it is better to flush all the guest TLB when invalidating any single PCID context. The case is very rare or even impossible since KVM generally doesn't intercept CR3 write or INVPCID instructions when tdp is enabled, so the fix is mostly for the sake of overall robustness. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211019110154.4091-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: X86: Don't reset mmu context when toggling X86_CR4_PGELai Jiangshan
X86_CR4_PGE doesn't participate in kvm_mmu_role, so the mmu context doesn't need to be reset. It is only required to flush all the guest tlb. It is also inconsistent that X86_CR4_PGE is in KVM_MMU_CR4_ROLE_BITS while kvm_mmu_role doesn't use X86_CR4_PGE. So X86_CR4_PGE is also removed from KVM_MMU_CR4_ROLE_BITS. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210919024246.89230-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: X86: Don't reset mmu context when X86_CR4_PCIDE 1->0Lai Jiangshan
X86_CR4_PCIDE doesn't participate in kvm_mmu_role, so the mmu context doesn't need to be reset. It is only required to flush all the guest tlb. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210919024246.89230-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: emulate: Comment on difference between RDPMC implementation and manualWanpeng Li
SDM mentioned that, RDPMC: IF (((CR4.PCE = 1) or (CPL = 0) or (CR0.PE = 0)) and (ECX indicates a supported counter)) THEN EAX := counter[31:0]; EDX := ZeroExtend(counter[MSCB:32]); ELSE (* ECX is not valid or CR4.PCE is 0 and CPL is 1, 2, or 3 and CR0.PE is 1 *) #GP(0); FI; Let's add a comment why CR0.PE isn't tested since it's impossible for CPL to be >0 if CR0.PE=0. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1634724836-73721-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: x86: Add vendor name to kvm_x86_ops, use it for error messagesSean Christopherson
Paul pointed out the error messages when KVM fails to load are unhelpful in understanding exactly what went wrong if userspace probes the "wrong" module. Add a mandatory kvm_x86_ops field to track vendor module names, kvm_intel and kvm_amd, and use the name for relevant error message when KVM fails to load so that the user knows which module failed to load. Opportunistically tweak the "disabled by bios" error message to clarify that _support_ was disabled, not that the module itself was magically disabled by BIOS. Suggested-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211018183929.897461-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22kvm: x86: mmu: Make NX huge page recovery period configurableJunaid Shahid
Currently, the NX huge page recovery thread wakes up every minute and zaps 1/nx_huge_pages_recovery_ratio of the total number of split NX huge pages at a time. This is intended to ensure that only a relatively small number of pages get zapped at a time. But for very large VMs (or more specifically, VMs with a large number of executable pages), a period of 1 minute could still result in this number being too high (unless the ratio is changed significantly, but that can result in split pages lingering on for too long). This change makes the period configurable instead of fixing it at 1 minute. Users of large VMs can then adjust the period and/or the ratio to reduce the number of pages zapped at one time while still maintaining the same overall duration for cycling through the entire list. By default, KVM derives a period from the ratio such that a page will remain on the list for 1 hour on average. Signed-off-by: Junaid Shahid <junaids@google.com> Message-Id: <20211020010627.305925-1-junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: vPMU: Fill get_msr MSR_CORE_PERF_GLOBAL_OVF_CTRL w/ 0Wanpeng Li
SDM section 18.2.3 mentioned that: "IA32_PERF_GLOBAL_OVF_CTL MSR allows software to clear overflow indicator(s) of any general-purpose or fixed-function counters via a single WRMSR." It is R/W mentioned by SDM, we read this msr on bare-metal during perf testing, the value is always 0 for ICX/SKX boxes on hands. Let's fill get_msr MSR_CORE_PERF_GLOBAL_OVF_CTRL w/ 0 as hardware behavior and drop global_ovf_ctrl variable. Tested-by: Like Xu <likexu@tencent.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1634631160-67276-2-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: x86/mmu: Rename slot_handle_leaf to slot_handle_level_4kDavid Matlack
slot_handle_leaf is a misnomer because it only operates on 4K SPTEs whereas "leaf" is used to describe any valid terminal SPTE (4K or large page). Rename slot_handle_leaf to slot_handle_level_4k to avoid confusion. Making this change makes it more obvious there is a benign discrepency between the legacy MMU and the TDP MMU when it comes to dirty logging. The legacy MMU only iterates through 4K SPTEs when zapping for collapsing and when clearing D-bits. The TDP MMU, on the other hand, iterates through SPTEs on all levels. The TDP MMU behavior of zapping SPTEs at all levels is technically overkill for its current dirty logging implementation, which always demotes to 4k SPTES, but both the TDP MMU and legacy MMU zap if and only if the SPTE can be replaced by a larger page, i.e. will not spuriously zap 2m (or larger) SPTEs. Opportunistically add comments to explain this discrepency in the code. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20211019162223.3935109-1-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: VMX: RTIT_CTL_BRANCH_EN has no dependency on other CPUID bitXiaoyao Li
Per Intel SDM, RTIT_CTL_BRANCH_EN bit has no dependency on any CPUID leaf 0x14. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20210827070249.924633-5-xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: VMX: Rename pt_desc.addr_range to pt_desc.num_address_rangesXiaoyao Li
To better self explain the meaning of this field and match the PT_CAP_num_address_ranges constatn. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20210827070249.924633-4-xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: VMX: Use precomputed vmx->pt_desc.addr_rangeXiaoyao Li
The number of valid PT ADDR MSRs for the guest is precomputed in vmx->pt_desc.addr_range. Use it instead of calculating again. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20210827070249.924633-3-xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: VMX: Restore host's MSR_IA32_RTIT_CTL when it's not zeroXiaoyao Li
A minor optimization to WRMSR MSR_IA32_RTIT_CTL when necessary. Opportunistically refine the comment to call out that KVM requires VM_EXIT_CLEAR_IA32_RTIT_CTL to expose PT to the guest. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20210827070249.924633-2-xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: x86/mmu: clean up prefetch/prefault/speculative namingPaolo Bonzini
"prefetch", "prefault" and "speculative" are used throughout KVM to mean the same thing. Use a single name, standardizing on "prefetch" which is already used by various functions such as direct_pte_prefetch, FNAME(prefetch_gpte), FNAME(pte_prefetch), etc. Suggested-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: cleanup allocation of rmaps and page tracking dataDavid Stevens
Unify the flags for rmaps and page tracking data, using a single flag in struct kvm_arch and a single loop to go over all the address spaces and memslots. This avoids code duplication between alloc_all_memslots_rmaps and kvm_page_track_enable_mmu_write_tracking. Signed-off-by: David Stevens <stevensd@chromium.org> [This patch is the delta between David's v2 and v3, with conflicts fixed and my own commit message. - Paolo] Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22x86/fpu/xstate: Move remaining xfeature helpers to coreThomas Gleixner
Now that everything is mopped up, move all the helpers and prototypes into the core header. They are not required by the outside. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211014230739.514095101@linutronix.de
2021-10-22x86/fpu: Rework restore_regs_from_fpstate()Thomas Gleixner
xfeatures_mask_fpstate() is no longer valid when dynamically enabled features come into play. Rework restore_regs_from_fpstate() so it takes a constant mask which will then be applied against the maximum feature set so that the restore operation brings all features which are not in the xsave buffer xfeature bitmap into init state. This ensures that if the previous task used a dynamically enabled feature that the task which restores has all unused components properly initialized. Cleanup the last user of xfeatures_mask_fpstate() as well and remove it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211014230739.461348278@linutronix.de
2021-10-22x86/fpu: Mop up xfeatures_mask_uabi()Thomas Gleixner
Use the new fpu_user_cfg to retrieve the information instead of xfeatures_mask_uabi() which will be no longer correct when dynamically enabled features become available. Using fpu_user_cfg is appropriate when setting XCOMP_BV in the init_fpstate since it has space allocated for "max_features". But, normal fpstates might only have space for default xfeatures. Since XRSTOR* derives the format of the XSAVE buffer from XCOMP_BV, this can lead to XRSTOR reading out of bounds. So when copying actively used fpstate, simply read the XCOMP_BV features bits directly out of the fpstate instead. This correction courtesy of Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211014230739.408879849@linutronix.de
2021-10-22powerpc/pseries/mobility: ignore ibm, platform-facilities updatesNathan Lynch
On VMs with NX encryption, compression, and/or RNG offload, these capabilities are described by nodes in the ibm,platform-facilities device tree hierarchy: $ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/ /sys/firmware/devicetree/base/ibm,platform-facilities/ ├── ibm,compression-v1 ├── ibm,random-v1 └── ibm,sym-encryption-v1 3 directories The acceleration functions that these nodes describe are not disrupted by live migration, not even temporarily. But the post-migration ibm,update-nodes sequence firmware always sends "delete" messages for this hierarchy, followed by an "add" directive to reconstruct it via ibm,configure-connector (log with debugging statements enabled in mobility.c): mobility: removing node /ibm,platform-facilities/ibm,random-v1:4294967285 mobility: removing node /ibm,platform-facilities/ibm,compression-v1:4294967284 mobility: removing node /ibm,platform-facilities/ibm,sym-encryption-v1:4294967283 mobility: removing node /ibm,platform-facilities:4294967286 ... mobility: added node /ibm,platform-facilities:4294967286 Note we receive a single "add" message for the entire hierarchy, and what we receive from the ibm,configure-connector sequence is the top-level platform-facilities node along with its three children. The debug message simply reports the parent node and not the whole subtree. Also, significantly, the nodes added are almost completely equivalent to the ones removed; even phandles are unchanged. ibm,shared-interrupt-pool in the leaf nodes is the only property I've observed to differ, and Linux does not use that. So in practice, the sum of update messages Linux receives for this hierarchy is equivalent to minor property updates. We succeed in removing the original hierarchy from the device tree. But the vio bus code is ignorant of this, and does not unbind or relinquish its references. The leaf nodes, still reachable through sysfs, of course still refer to the now-freed ibm,platform-facilities parent node, which makes use-after-free possible: refcount_t: addition on 0; use-after-free. WARNING: CPU: 3 PID: 1706 at lib/refcount.c:25 refcount_warn_saturate+0x164/0x1f0 refcount_warn_saturate+0x160/0x1f0 (unreliable) kobject_get+0xf0/0x100 of_node_get+0x30/0x50 of_get_parent+0x50/0xb0 of_fwnode_get_parent+0x54/0x90 fwnode_count_parents+0x50/0x150 fwnode_full_name_string+0x30/0x110 device_node_string+0x49c/0x790 vsnprintf+0x1c0/0x4c0 sprintf+0x44/0x60 devspec_show+0x34/0x50 dev_attr_show+0x40/0xa0 sysfs_kf_seq_show+0xbc/0x200 kernfs_seq_show+0x44/0x60 seq_read_iter+0x2a4/0x740 kernfs_fop_read_iter+0x254/0x2e0 new_sync_read+0x120/0x190 vfs_read+0x1d0/0x240 Moreover, the "new" replacement subtree is not correctly added to the device tree, resulting in ibm,platform-facilities parent node without the appropriate leaf nodes, and broken symlinks in the sysfs device hierarchy: $ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/ /sys/firmware/devicetree/base/ibm,platform-facilities/ 0 directories $ cd /sys/devices/vio ; find . -xtype l -exec file {} + ./ibm,sym-encryption-v1/of_node: broken symbolic link to ../../../firmware/devicetree/base/ibm,platform-facilities/ibm,sym-encryption-v1 ./ibm,random-v1/of_node: broken symbolic link to ../../../firmware/devicetree/base/ibm,platform-facilities/ibm,random-v1 ./ibm,compression-v1/of_node: broken symbolic link to ../../../firmware/devicetree/base/ibm,platform-facilities/ibm,compression-v1 This is because add_dt_node() -> dlpar_attach_node() attaches only the parent node returned from configure-connector, ignoring any children. This should be corrected for the general case, but fixing that won't help with the stale OF node references, which is the more urgent problem. One way to address that would be to make the drivers respond to node removal notifications, so that node references can be dropped appropriately. But this would likely force the drivers to disrupt active clients for no useful purpose: equivalent nodes are immediately re-added. And recall that the acceleration capabilities described by the nodes remain available throughout the whole process. The solution I believe to be robust for this situation is to convert remove+add of a node with an unchanged phandle to an update of the node's properties in the Linux device tree structure. That would involve changing and adding a fair amount of code, and may take several iterations to land. Until that can be realized we have a confirmed use-after-free and the possibility of memory corruption. So add a limited workaround that discriminates on the node type, ignoring adds and removes. This should be amenable to backporting in the meantime. Fixes: 410bccf97881 ("powerpc/pseries: Partition migration in the kernel") Cc: stable@vger.kernel.org Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211020194703.2613093-1-nathanl@linux.ibm.com
2021-10-22powerpc/32: Don't use a struct based type for pte_tChristophe Leroy
Long time ago we had a config item called STRICT_MM_TYPECHECKS to build the kernel with pte_t defined as a structure in order to perform additional build checks or build it with pte_t defined as a simple type in order to get simpler generated code. Commit 670eea924198 ("powerpc/mm: Always use STRICT_MM_TYPECHECKS") made the struct based definition the only one, considering that the generated code was similar in both cases. That's right on ppc64 because the ABI is such that the content of a struct having a single simple type element is passed as register, but on ppc32 such a structure is passed via the stack like any structure. Simple test function: pte_t test(pte_t pte) { return pte; } Before this patch we get c00108ec <test>: c00108ec: 81 24 00 00 lwz r9,0(r4) c00108f0: 91 23 00 00 stw r9,0(r3) c00108f4: 4e 80 00 20 blr So, for PPC32, restore the simple type behaviour we got before commit 670eea924198, but instead of adding a config option to activate type check, do it when __CHECKER__ is set so that type checking is performed by 'sparse' and provides feedback like: arch/powerpc/mm/pgtable.c:466:16: warning: incorrect type in return expression (different base types) arch/powerpc/mm/pgtable.c:466:16: expected unsigned long arch/powerpc/mm/pgtable.c:466:16: got struct pte_t [usertype] x With this patch we now get c0010890 <test>: c0010890: 4e 80 00 20 blr Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> [mpe: Define STRICT_MM_TYPECHECKS rather than repeating the condition] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/c904599f33aaf6bb7ee2836a9ff8368509e0d78d.1631887042.git.christophe.leroy@csgroup.eu
2021-10-22powerpc/breakpoint: CleanupChristophe Leroy
cache_op_size() does exactly the same as l1_dcache_bytes(). Remove it. MSR_64BIT already exists, no need to enclode the check around #ifdef __powerpc64__ Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/6184b08088312a7d787d450eb902584e4ae77f7a.1632317816.git.christophe.leroy@csgroup.eu