summaryrefslogtreecommitdiff
path: root/arch/x86/kernel
AgeCommit message (Collapse)Author
2021-03-21x86: Fix various typos in comments, take #2Ingo Molnar
Fix another ~42 single-word typos in arch/x86/ code comments, missed a few in the first pass, in particular in .S files. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: linux-kernel@vger.kernel.org
2021-03-21Merge branch 'linus' into x86/cleanups, to resolve conflictIngo Molnar
Conflicts: arch/x86/kernel/kprobes/ftrace.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-03-21Merge tag 'x86_urgent_for_v5.12-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: "The freshest pile of shiny x86 fixes for 5.12: - Add the arch-specific mapping between physical and logical CPUs to fix devicetree-node lookups - Restore the IRQ2 ignore logic - Fix get_nr_restart_syscall() to return the correct restart syscall number. Split in a 4-patches set to avoid kABI breakage when backporting to dead kernels" * tag 'x86_urgent_for_v5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/apic/of: Fix CPU devicetree-node lookups x86/ioapic: Ignore IRQ2 again x86: Introduce restart_block->arch_data to remove TS_COMPAT_RESTART x86: Introduce TS_COMPAT_RESTART to fix get_nr_restart_syscall() x86: Move TS_COMPAT back to asm/thread_info.h kernel, fs: Introduce and use set_restart_fn() and arch_set_restart_data()
2021-03-20Merge tag 'riscv-for-linus-5.12-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fixes from Palmer Dabbelt: "A handful of fixes for 5.12: - fix the SBI remote fence numbers for hypervisor fences, which had been transcribed in the wrong order in Linux. These fences are only used with the KVM patches applied. - fix a whole host of build warnings, these should have no functional change. - fix init_resources() to prevent an off-by-one error from causing an out-of-bounds array reference. This was manifesting during boot on vexriscv. - ensure the KASAN mappings are visible before proceeding to use them" * tag 'riscv-for-linus-5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: riscv: Correct SPARSEMEM configuration RISC-V: kasan: Declare kasan_shallow_populate() static riscv: Ensure page table writes are flushed when initializing KASAN vmalloc RISC-V: Fix out-of-bounds accesses in init_resources() riscv: Fix compilation error with Canaan SoC ftrace: Fix spelling mistake "disabed" -> "disabled" riscv: fix bugon.cocci warnings riscv: process: Fix no prototype for arch_dup_task_struct riscv: ftrace: Use ftrace_get_regs helper riscv: process: Fix no prototype for show_regs riscv: syscall_table: Reduce W=1 compilation warnings noise riscv: time: Fix no prototype for time_init riscv: ptrace: Fix no prototype warnings riscv: sbi: Fix comment of __sbi_set_timer_v01 riscv: irq: Fix no prototype warning riscv: traps: Fix no prototype warnings RISC-V: correct enum sbi_ext_rfence_fid
2021-03-19x86/apic/of: Fix CPU devicetree-node lookupsJohan Hovold
Architectures that describe the CPU topology in devicetree and do not have an identity mapping between physical and logical CPU ids must override the default implementation of arch_match_cpu_phys_id(). Failing to do so breaks CPU devicetree-node lookups using of_get_cpu_node() and of_cpu_device_node_get() which several drivers rely on. It also causes the CPU struct devices exported through sysfs to point to the wrong devicetree nodes. On x86, CPUs are described in devicetree using their APIC ids and those do not generally coincide with the logical ids, even if CPU0 typically uses APIC id 0. Add the missing implementation of arch_match_cpu_phys_id() so that CPU-node lookups work also with SMP. Apart from fixing the broken sysfs devicetree-node links this likely does not affect current users of mainline kernels on x86. Fixes: 4e07db9c8db8 ("x86/devicetree: Use CPU description from Device Tree") Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210312092033.26317-1-johan@kernel.org
2021-03-19x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()Jarkko Sakkinen
Background ========== SGX enclave memory is enumerated by the processor in contiguous physical ranges called Enclave Page Cache (EPC) sections. Currently, there is a free list per section, but allocations simply target the lowest-numbered sections. This is functional, but has no NUMA awareness. Fortunately, EPC sections are covered by entries in the ACPI SRAT table. These entries allow each EPC section to be associated with a NUMA node, just like normal RAM. Solution ======== Implement a NUMA-aware enclave page allocator. Mirror the buddy allocator and maintain a list of enclave pages for each NUMA node. Attempt to allocate enclave memory first from local nodes, then fall back to other nodes. Note that the fallback is not as sophisticated as the buddy allocator and is itself not aware of NUMA distances. When a node's free list is empty, it searches for the next-highest node with enclave pages (and will wrap if necessary). This could be improved in the future. Other ===== NUMA_KEEP_MEMINFO dependency is required for phys_to_target_node(). [ Kai Huang: Do not return NULL from __sgx_alloc_epc_page() because callers do not expect that and that leads to a NULL ptr deref. ] [ dhansen: Fix an uninitialized 'nid' variable in __sgx_alloc_epc_page() as Reported-by: kernel test robot <lkp@intel.com> to avoid any potential allocations from the wrong NUMA node or even premature allocation failures. ] Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/lkml/158188326978.894464.217282995221175417.stgit@dwillia2-desk3.amr.corp.intel.com/ Link: https://lkml.kernel.org/r/20210319040602.178558-1-kai.huang@intel.com Link: https://lkml.kernel.org/r/20210318214933.29341-1-dave.hansen@intel.com Link: https://lkml.kernel.org/r/20210317235332.362001-2-jarkko.sakkinen@intel.com
2021-03-19x86/sev-es: Optimize __sev_es_ist_enter() for better readabilityJoerg Roedel
Reorganize the code and improve the comments to make the function more readable and easier to understand. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210303141716.29223-4-joro@8bytes.org
2021-03-19x86/ioapic: Ignore IRQ2 againThomas Gleixner
Vitaly ran into an issue with hotplugging CPU0 on an Amazon instance where the matrix allocator claimed to be out of vectors. He analyzed it down to the point that IRQ2, the PIC cascade interrupt, which is supposed to be not ever routed to the IO/APIC ended up having an interrupt vector assigned which got moved during unplug of CPU0. The underlying issue is that IRQ2 for various reasons (see commit af174783b925 ("x86: I/O APIC: Never configure IRQ2" for details) is treated as a reserved system vector by the vector core code and is not accounted as a regular vector. The Amazon BIOS has an routing entry of pin2 to IRQ2 which causes the IO/APIC setup to claim that interrupt which is granted by the vector domain because there is no sanity check. As a consequence the allocation counter of CPU0 underflows which causes a subsequent unplug to fail with: [ ... ] CPU 0 has 4294967295 vectors, 589 available. Cannot disable CPU There is another sanity check missing in the matrix allocator, but the underlying root cause is that the IO/APIC code lost the IRQ2 ignore logic during the conversion to irqdomains. For almost 6 years nobody complained about this wreckage, which might indicate that this requirement could be lifted, but for any system which actually has a PIC IRQ2 is unusable by design so any routing entry has no effect and the interrupt cannot be connected to a device anyway. Due to that and due to history biased paranoia reasons restore the IRQ2 ignore logic and treat it as non existent despite a routing entry claiming otherwise. Fixes: d32932d02e18 ("x86/irq: Convert IOAPIC to use hierarchical irqdomain interfaces") Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210318192819.636943062@linutronix.de
2021-03-18x86/sev-es: Replace open-coded hlt-loops with sev_es_terminate()Joerg Roedel
There are a few places left in the SEV-ES C code where hlt loops and/or terminate requests are implemented. Replace them all with calls to sev_es_terminate(). Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210312123824.306-9-joro@8bytes.org
2021-03-18x86/kvm: Fix broken irq restoration in kvm_waitWanpeng Li
After commit 997acaf6b4b59c (lockdep: report broken irq restoration), the guest splatting below during boot: raw_local_irq_restore() called with IRQs enabled WARNING: CPU: 1 PID: 169 at kernel/locking/irqflag-debug.c:10 warn_bogus_irq_restore+0x26/0x30 Modules linked in: hid_generic usbhid hid CPU: 1 PID: 169 Comm: systemd-udevd Not tainted 5.11.0+ #25 RIP: 0010:warn_bogus_irq_restore+0x26/0x30 Call Trace: kvm_wait+0x76/0x90 __pv_queued_spin_lock_slowpath+0x285/0x2e0 do_raw_spin_lock+0xc9/0xd0 _raw_spin_lock+0x59/0x70 lockref_get_not_dead+0xf/0x50 __legitimize_path+0x31/0x60 legitimize_root+0x37/0x50 try_to_unlazy_next+0x7f/0x1d0 lookup_fast+0xb0/0x170 path_openat+0x165/0x9b0 do_filp_open+0x99/0x110 do_sys_openat2+0x1f1/0x2e0 do_sys_open+0x5c/0x80 __x64_sys_open+0x21/0x30 do_syscall_64+0x32/0x50 entry_SYSCALL_64_after_hwframe+0x44/0xae The new consistency checking, expects local_irq_save() and local_irq_restore() to be paired and sanely nested, and therefore expects local_irq_restore() to be called with irqs disabled. The irqflags handling in kvm_wait() which ends up doing: local_irq_save(flags); safe_halt(); local_irq_restore(flags); instead triggers it. This patch fixes it by using local_irq_disable()/enable() directly. Cc: Thomas Gleixner <tglx@linutronix.de> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1615791328-2735-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-18x86/sev: Do not require Hypervisor CPUID bit for SEV guestsJoerg Roedel
A malicious hypervisor could disable the CPUID intercept for an SEV or SEV-ES guest and trick it into the no-SEV boot path, where it could potentially reveal secrets. This is not an issue for SEV-SNP guests, as the CPUID intercept can't be disabled for those. Remove the Hypervisor CPUID bit check from the SEV detection code to protect against this kind of attack and add a Hypervisor bit equals zero check to the SME detection path to prevent non-encrypted guests from trying to enable SME. This handles the following cases: 1) SEV(-ES) guest where CPUID intercept is disabled. The guest will still see leaf 0x8000001f and the SEV bit. It can retrieve the C-bit and boot normally. 2) Non-encrypted guests with intercepted CPUID will check the SEV_STATUS MSR and find it 0 and will try to enable SME. This will fail when the guest finds MSR_K8_SYSCFG to be zero, as it is emulated by KVM. But we can't rely on that, as there might be other hypervisors which return this MSR with bit 23 set. The Hypervisor bit check will prevent that the guest tries to enable SME in this case. 3) Non-encrypted guests on SEV capable hosts with CPUID intercept disabled (by a malicious hypervisor) will try to boot into the SME path. This will fail, but it is also not considered a problem because non-encrypted guests have no protection against the hypervisor anyway. [ bp: s/non-SEV/non-encrypted/g ] Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lkml.kernel.org/r/20210312123824.306-3-joro@8bytes.org
2021-03-18Merge tag 'v5.12-rc3' into x86/sevesBorislav Petkov
Pick up dependent SEV-ES urgent changes which went into -rc3 to base new work ontop. Signed-off-by: Borislav Petkov <bp@suse.de>
2021-03-18x86/sgx: Replace section->init_laundry_list with sgx_dirty_page_listJarkko Sakkinen
During normal runtime, the "ksgxd" daemon behaves like a version of kswapd just for SGX. But, before it starts acting like kswapd, its first job is to initialize enclave memory. Currently, the SGX boot code places each enclave page on a epc_section->init_laundry_list. Once it starts up, the ksgxd code walks over that list and populates the actual SGX page allocator. However, the per-section structures are going away to make way for the SGX NUMA allocator. There's also little need to have a per-section structure; the enclave pages are all treated identically, and they can be placed on the correct allocator list from metadata stored in the enclave page (struct sgx_epc_page) itself. Modify sgx_sanitize_section() to take a single page list instead of taking a section and deriving the list from there. Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20210317235332.362001-1-jarkko.sakkinen@intel.com
2021-03-18x86: Fix various typos in commentsIngo Molnar
Fix ~144 single-word typos in arch/x86/ code comments. Doing this in a single commit should reduce the churn. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: linux-kernel@vger.kernel.org
2021-03-18Merge tag 'v5.12-rc3' into x86/cleanups, to refresh the treeIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-03-16ftrace: Fix spelling mistake "disabed" -> "disabled"Colin Ian King
There is a spelling mistake in a comment, fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2021-03-16x86: Introduce restart_block->arch_data to remove TS_COMPAT_RESTARTOleg Nesterov
Save the current_thread_info()->status of X86 in the new restart_block->arch_data field so TS_COMPAT_RESTART can be removed again. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210201174716.GA17898@redhat.com
2021-03-16x86: Introduce TS_COMPAT_RESTART to fix get_nr_restart_syscall()Oleg Nesterov
The comment in get_nr_restart_syscall() says: * The problem is that we can get here when ptrace pokes * syscall-like values into regs even if we're not in a syscall * at all. Yes, but if not in a syscall then the status & (TS_COMPAT|TS_I386_REGS_POKED) check below can't really help: - TS_COMPAT can't be set - TS_I386_REGS_POKED is only set if regs->orig_ax was changed by 32bit debugger; and even in this case get_nr_restart_syscall() is only correct if the tracee is 32bit too. Suppose that a 64bit debugger plays with a 32bit tracee and * Tracee calls sleep(2) // TS_COMPAT is set * User interrupts the tracee by CTRL-C after 1 sec and does "(gdb) call func()" * gdb saves the regs by PTRACE_GETREGS * does PTRACE_SETREGS to set %rip='func' and %orig_rax=-1 * PTRACE_CONT // TS_COMPAT is cleared * func() hits int3. * Debugger catches SIGTRAP. * Restore original regs by PTRACE_SETREGS. * PTRACE_CONT get_nr_restart_syscall() wrongly returns __NR_restart_syscall==219, the tracee calls ia32_sys_call_table[219] == sys_madvise. Add the sticky TS_COMPAT_RESTART flag which survives after return to user mode. It's going to be removed in the next step again by storing the information in the restart block. As a further cleanup it might be possible to remove also TS_I386_REGS_POKED with that. Test-case: $ cvs -d :pserver:anoncvs:anoncvs@sourceware.org:/cvs/systemtap co ptrace-tests $ gcc -o erestartsys-trap-debuggee ptrace-tests/tests/erestartsys-trap-debuggee.c --m32 $ gcc -o erestartsys-trap-debugger ptrace-tests/tests/erestartsys-trap-debugger.c -lutil $ ./erestartsys-trap-debugger Unexpected: retval 1, errno 22 erestartsys-trap-debugger: ptrace-tests/tests/erestartsys-trap-debugger.c:421 Fixes: 609c19a385c8 ("x86/ptrace: Stop setting TS_COMPAT in ptrace code") Reported-by: Jan Kratochvil <jan.kratochvil@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210201174709.GA17895@redhat.com
2021-03-14Merge tag 'x86_urgent_for_v5.12_rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - A couple of SEV-ES fixes and robustifications: verify usermode stack pointer in NMI is not coming from the syscall gap, correctly track IRQ states in the #VC handler and access user insn bytes atomically in same handler as latter cannot sleep. - Balance 32-bit fast syscall exit path to do the proper work on exit and thus not confuse audit and ptrace frameworks. - Two fixes for the ORC unwinder going "off the rails" into KASAN redzones and when ORC data is missing. * tag 'x86_urgent_for_v5.12_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sev-es: Use __copy_from_user_inatomic() x86/sev-es: Correctly track IRQ states in runtime #VC handler x86/sev-es: Check regs->sp is trusted before adjusting #VC IST stack x86/sev-es: Introduce ip_within_syscall_gap() helper x86/entry: Fix entry/exit mismatch on failed fast 32-bit syscalls x86/unwind/orc: Silence warnings caused by missing ORC data x86/unwind/orc: Disable KASAN checking in the ORC unwinder, part 2
2021-03-14Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM fixes from Paolo Bonzini: "More fixes for ARM and x86" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: LAPIC: Advancing the timer expiration on guest initiated write KVM: x86/mmu: Skip !MMU-present SPTEs when removing SP in exclusive mode KVM: kvmclock: Fix vCPUs > 64 can't be online/hotpluged kvm: x86: annotate RCU pointers KVM: arm64: Fix exclusive limit for IPA size KVM: arm64: Reject VM creation when the default IPA size is unsupported KVM: arm64: Ensure I-cache isolation between vcpus of a same VM KVM: arm64: Don't use cbz/adr with external symbols KVM: arm64: Fix range alignment when walking page tables KVM: arm64: Workaround firmware wrongly advertising GICv2-on-v3 compatibility KVM: arm64: Rename __vgic_v3_get_ich_vtr_el2() to __vgic_v3_get_gic_config() KVM: arm64: Don't access PMSELR_EL0/PMUSERENR_EL0 when no PMU is available KVM: arm64: Turn kvm_arm_support_pmu_v3() into a static key KVM: arm64: Fix nVHE hyp panic host context restore KVM: arm64: Avoid corrupting vCPU context register in guest exit KVM: arm64: nvhe: Save the SPE context early kvm: x86: use NULL instead of using plain integer as pointer KVM: SVM: Connect 'npt' module param to KVM's internal 'npt_enabled' KVM: x86: Ensure deadline timer has truly expired before posting its IRQ
2021-03-12KVM: kvmclock: Fix vCPUs > 64 can't be online/hotplugedWanpeng Li
# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 88 On-line CPU(s) list: 0-63 Off-line CPU(s) list: 64-87 # cat /proc/cmdline BOOT_IMAGE=/vmlinuz-5.10.0-rc3-tlinux2-0050+ root=/dev/mapper/cl-root ro rd.lvm.lv=cl/root rhgb quiet console=ttyS0 LANG=en_US .UTF-8 no-kvmclock-vsyscall # echo 1 > /sys/devices/system/cpu/cpu76/online -bash: echo: write error: Cannot allocate memory The per-cpu vsyscall pvclock data pointer assigns either an element of the static array hv_clock_boot (#vCPU <= 64) or dynamically allocated memory hvclock_mem (vCPU > 64), the dynamically memory will not be allocated if kvmclock vsyscall is disabled, this can result in cpu hotpluged fails in kvmclock_setup_percpu() which returns -ENOMEM. It's broken for no-vsyscall and sometimes you end up with vsyscall disabled if the host does something strange. This patch fixes it by allocating this dynamically memory unconditionally even if vsyscall is disabled. Fixes: 6a1cac56f4 ("x86/kvm: Use __bss_decrypted attribute in shared variables") Reported-by: Zelin Deng <zelin.deng@linux.alibaba.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: stable@vger.kernel.org#v4.19-rc5+ Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1614130683-24137-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-11x86/paravirt: Have only one paravirt patch functionJuergen Gross
There is no need any longer to have different paravirt patch functions for native and Xen. Eliminate native_patch() and rename paravirt_patch_default() to paravirt_patch(). Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210311142319.4723-15-jgross@suse.com
2021-03-11x86/paravirt: Switch functions with custom code to ALTERNATIVEJuergen Gross
Instead of using paravirt patching for custom code sequences use ALTERNATIVE for the functions with custom code replacements. Instead of patching an ud2 instruction for unpopulated vector entries into the caller site, use a simple function just calling BUG() as a replacement. Simplify the register defines for assembler paravirt calling, as there isn't much usage left. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210311142319.4723-14-jgross@suse.com
2021-03-11x86/paravirt: Switch iret pvops to ALTERNATIVEJuergen Gross
The iret paravirt op is rather special as it is using a jmp instead of a call instruction. Switch it to ALTERNATIVE. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210311142319.4723-12-jgross@suse.com
2021-03-11x86/paravirt: Remove no longer needed 32-bit pvops cruftJuergen Gross
PVOP_VCALL4() is only used for Xen PV, while PVOP_CALL4() isn't used at all. Keep PVOP_CALL4() for 64 bits due to symmetry reasons. This allows to remove the 32-bit definitions of those macros leading to a substantial simplification of the paravirt macros, as those were the only ones needing non-empty "pre" and "post" parameters. PVOP_CALLEE2() and PVOP_VCALLEE2() are used nowhere, so remove them. Another no longer needed case is special handling of return types larger than unsigned long. Replace that with a BUILD_BUG_ON(). DISABLE_INTERRUPTS() is used in 32-bit code only, so it can just be replaced by cli. INTERRUPT_RETURN in 32-bit code can be replaced by iret. ENABLE_INTERRUPTS is used nowhere, so it can be removed. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210311142319.4723-10-jgross@suse.com
2021-03-11x86/paravirt: Add new features for paravirt patchingJuergen Gross
For being able to switch paravirt patching from special cased custom code sequences to ALTERNATIVE handling some X86_FEATURE_* are needed as new features. This enables to have the standard indirect pv call as the default code and to patch that with the non-Xen custom code sequence via ALTERNATIVE patching later. Make sure paravirt patching is performed before alternatives patching. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210311142319.4723-9-jgross@suse.com
2021-03-11x86/alternative: Support not-featureJuergen Gross
Add support for alternative patching for the case a feature is not present on the current CPU. For users of ALTERNATIVE() and friends, an inverted feature is specified by applying the ALT_NOT() macro to it, e.g.: ALTERNATIVE(old, new, ALT_NOT(feature)); Committer note: The decision to encode the NOT-bit in the feature bit itself is because a future change which would make objtool generate such alternative calls, would keep the code in objtool itself fairly simple. Also, this allows for the alternative macros to support the NOT feature without having to change them. Finally, the u16 cpuid member encoding the X86_FEATURE_ flags is not an ABI so if more bits are needed, cpuid itself can be enlarged or a flags field can be added to struct alt_instr after having considered the size growth in either cases. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210311142319.4723-6-jgross@suse.com
2021-03-11x86/paravirt: Switch time pvops functions to use static_call()Juergen Gross
The time pvops functions are the only ones left which might be used in 32-bit mode and which return a 64-bit value. Switch them to use the static_call() mechanism instead of pvops, as this allows quite some simplification of the pvops implementation. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210311142319.4723-5-jgross@suse.com
2021-03-11x86/setup: Remove unused RESERVE_BRK_ARRAY()Cao jin
Since a13f2ef168cb ("x86/xen: remove 32-bit Xen PV guest support"), RESERVE_BRK_ARRAY() has no user anymore so drop it. Update related comments too. Signed-off-by: Cao jin <jojing64@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210311083919.27530-1-jojing64@gmail.com
2021-03-10stacktrace: Move documentation for arch_stack_walk_reliable() to headerMark Brown
Currently arch_stack_walk_reliable() is documented with an identical comment in both x86 and S/390 implementations which is a bit redundant. Move this to the header and convert to kerneldoc while we're at it. Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Vasily Gorbik <gor@linux.ibm.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lkml.kernel.org/r/20210309194125.652-1-broonie@kernel.org
2021-03-09x86/sev-es: Use __copy_from_user_inatomic()Joerg Roedel
The #VC handler must run in atomic context and cannot sleep. This is a problem when it tries to fetch instruction bytes from user-space via copy_from_user(). Introduce a insn_fetch_from_user_inatomic() helper which uses __copy_from_user_inatomic() to safely copy the instruction bytes to kernel memory in the #VC handler. Fixes: 5e3427a7bc432 ("x86/sev-es: Handle instruction fetches from user-space") Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org # v5.10+ Link: https://lkml.kernel.org/r/20210303141716.29223-6-joro@8bytes.org
2021-03-09x86/sev-es: Correctly track IRQ states in runtime #VC handlerJoerg Roedel
Call irqentry_nmi_enter()/irqentry_nmi_exit() in the #VC handler to correctly track the IRQ state during its execution. Fixes: 0786138c78e79 ("x86/sev-es: Add a Runtime #VC Exception Handler") Reported-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org # v5.10+ Link: https://lkml.kernel.org/r/20210303141716.29223-5-joro@8bytes.org
2021-03-09x86/sev-es: Check regs->sp is trusted before adjusting #VC IST stackJoerg Roedel
The code in the NMI handler to adjust the #VC handler IST stack is needed in case an NMI hits when the #VC handler is still using its IST stack. But the check for this condition also needs to look if the regs->sp value is trusted, meaning it was not set by user-space. Extend the check to not use regs->sp when the NMI interrupted user-space code or the SYSCALL gap. Fixes: 315562c9af3d5 ("x86/sev-es: Adjust #VC IST Stack on entering NMI handler") Reported-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org # 5.10+ Link: https://lkml.kernel.org/r/20210303141716.29223-3-joro@8bytes.org
2021-03-08clocksource/drivers/hyper-v: Move handling of STIMER0 interruptsMichael Kelley
STIMER0 interrupts are most naturally modeled as per-cpu IRQs. But because x86/x64 doesn't have per-cpu IRQs, the core STIMER0 interrupt handling machinery is done in code under arch/x86 and Linux IRQs are not used. Adding support for ARM64 means adding equivalent code using per-cpu IRQs under arch/arm64. A better model is to treat per-cpu IRQs as the normal path (which it is for modern architectures), and the x86/x64 path as the exception. Do this by incorporating standard Linux per-cpu IRQ allocation into the main SITMER0 driver code, and bypass it in the x86/x64 exception case. For x86/x64, special case code is retained under arch/x86, but no STIMER0 interrupt handling code is needed under arch/arm64. No functional change. Signed-off-by: Michael Kelley <mikelley@microsoft.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Link: https://lore.kernel.org/r/1614721102-2241-11-git-send-email-mikelley@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-03-08Drivers: hv: vmbus: Move handling of VMbus interruptsMichael Kelley
VMbus interrupts are most naturally modelled as per-cpu IRQs. But because x86/x64 doesn't have per-cpu IRQs, the core VMbus interrupt handling machinery is done in code under arch/x86 and Linux IRQs are not used. Adding support for ARM64 means adding equivalent code using per-cpu IRQs under arch/arm64. A better model is to treat per-cpu IRQs as the normal path (which it is for modern architectures), and the x86/x64 path as the exception. Do this by incorporating standard Linux per-cpu IRQ allocation into the main VMbus driver, and bypassing it in the x86/x64 exception case. For x86/x64, special case code is retained under arch/x86, but no VMbus interrupt handling code is needed under arch/arm64. No functional change. Signed-off-by: Michael Kelley <mikelley@microsoft.com> Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Link: https://lore.kernel.org/r/1614721102-2241-7-git-send-email-mikelley@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-03-08x86/sev-es: Introduce ip_within_syscall_gap() helperJoerg Roedel
Introduce a helper to check whether an exception came from the syscall gap and use it in the SEV-ES code. Extend the check to also cover the compatibility SYSCALL entry path. Fixes: 315562c9af3d5 ("x86/sev-es: Adjust #VC IST Stack on entering NMI handler") Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org # 5.10+ Link: https://lkml.kernel.org/r/20210303141716.29223-2-joro@8bytes.org
2021-03-08x86/platform/uv: Set section block size for hubless architecturesMike Travis
Commit bbbd2b51a2aa ("x86/platform/UV: Use new set memory block size function") added a call to set the block size value that is needed by the kernel to set the boundaries in the section list. This was done for UV Hubbed systems but missed in the UV Hubless setup. Fix that mistake by adding that same set call for hubless systems, which support the same NVRAMs and Intel BIOS, thus the same problem occurs. [ bp: Massage commit message. ] Fixes: bbbd2b51a2aa ("x86/platform/UV: Use new set memory block size function") Signed-off-by: Mike Travis <mike.travis@hpe.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Steve Wahl <steve.wahl@hpe.com> Reviewed-by: Russ Anderson <rja@hpe.com> Link: https://lkml.kernel.org/r/20210305162853.299892-1-mike.travis@hpe.com
2021-03-06x86/unwind/orc: Silence warnings caused by missing ORC dataJosh Poimboeuf
The ORC unwinder attempts to fall back to frame pointers when ORC data is missing for a given instruction. It sets state->error, but then tries to keep going as a best-effort type of thing. That may result in further warnings if the unwinder gets lost. Until we have some way to register generated code with the unwinder, missing ORC will be expected, and occasionally going off the rails will also be expected. So don't warn about it. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Ivan Babrou <ivan@cloudflare.com> Link: https://lkml.kernel.org/r/06d02c4bbb220bd31668db579278b0352538efbb.1612534649.git.jpoimboe@redhat.com
2021-03-06x86/unwind/orc: Disable KASAN checking in the ORC unwinder, part 2Josh Poimboeuf
KASAN reserves "redzone" areas between stack frames in order to detect stack overruns. A read or write to such an area triggers a KASAN "stack-out-of-bounds" BUG. Normally, the ORC unwinder stays in-bounds and doesn't access the redzone. But sometimes it can't find ORC metadata for a given instruction. This can happen for code which is missing ORC metadata, or for generated code. In such cases, the unwinder attempts to fall back to frame pointers, as a best-effort type thing. This fallback often works, but when it doesn't, the unwinder can get confused and go off into the weeds into the KASAN redzone, triggering the aforementioned KASAN BUG. But in this case, the unwinder's confusion is actually harmless and working as designed. It already has checks in place to prevent off-stack accesses, but those checks get short-circuited by the KASAN BUG. And a BUG is a lot more disruptive than a harmless unwinder warning. Disable the KASAN checks by using READ_ONCE_NOCHECK() for all stack accesses. This finishes the job started by commit 881125bfe65b ("x86/unwind: Disable KASAN checking in the ORC unwinder"), which only partially fixed the issue. Fixes: ee9f8fce9964 ("x86/unwind: Add the ORC unwinder") Reported-by: Ivan Babrou <ivan@cloudflare.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Tested-by: Ivan Babrou <ivan@cloudflare.com> Cc: stable@kernel.org Link: https://lkml.kernel.org/r/9583327904ebbbeda399eca9c56d6c7085ac20fe.1612534649.git.jpoimboe@redhat.com
2021-03-06x86/sev-es: Remove subtraction of res variableBorislav Petkov
vc_decode_insn() calls copy_from_kernel_nofault() by way of vc_fetch_insn_kernel() to fetch 15 bytes max of opcodes to decode. copy_from_kernel_nofault() returns negative on error and 0 on success. The error case is handled by returning ES_EXCEPTION. In the success case, the ret variable which contains the return value is 0 so there's no need to subtract it from MAX_INSN_SIZE when initializing the insn buffer for further decoding. Remove it. No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Joerg Roedel <jroedel@suse.de> Link: https://lkml.kernel.org/r/20210223111130.16201-1-bp@alien8.de
2021-02-27Merge tag 'io_uring-worker.v3-2021-02-25' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring thread rewrite from Jens Axboe: "This converts the io-wq workers to be forked off the tasks in question instead of being kernel threads that assume various bits of the original task identity. This kills > 400 lines of code from io_uring/io-wq, and it's the worst part of the code. We've had several bugs in this area, and the worry is always that we could be missing some pieces for file types doing unusual things (recent /dev/tty example comes to mind, userfaultfd reads installing file descriptors is another fun one... - both of which need special handling, and I bet it's not the last weird oddity we'll find). With these identical workers, we can have full confidence that we're never missing anything. That, in itself, is a huge win. Outside of that, it's also more efficient since we're not wasting space and code on tracking state, or switching between different states. I'm sure we're going to find little things to patch up after this series, but testing has been pretty thorough, from the usual regression suite to production. Any issue that may crop up should be manageable. There's also a nice series of further reductions we can do on top of this, but I wanted to get the meat of it out sooner rather than later. The general worry here isn't that it's fundamentally broken. Most of the little issues we've found over the last week have been related to just changes in how thread startup/exit is done, since that's the main difference between using kthreads and these kinds of threads. In fact, if all goes according to plan, I want to get this into the 5.10 and 5.11 stable branches as well. That said, the changes outside of io_uring/io-wq are: - arch setup, simple one-liner to each arch copy_thread() implementation. - Removal of net and proc restrictions for io_uring, they are no longer needed or useful" * tag 'io_uring-worker.v3-2021-02-25' of git://git.kernel.dk/linux-block: (30 commits) io-wq: remove now unused IO_WQ_BIT_ERROR io_uring: fix SQPOLL thread handling over exec io-wq: improve manager/worker handling over exec io_uring: ensure SQPOLL startup is triggered before error shutdown io-wq: make buffered file write hashed work map per-ctx io-wq: fix race around io_worker grabbing io-wq: fix races around manager/worker creation and task exit io_uring: ensure io-wq context is always destroyed for tasks arch: ensure parisc/powerpc handle PF_IO_WORKER in copy_thread() io_uring: cleanup ->user usage io-wq: remove nr_process accounting io_uring: flag new native workers with IORING_FEAT_NATIVE_WORKERS net: remove cmsg restriction from io_uring based send/recvmsg calls Revert "proc: don't allow async path resolution of /proc/self components" Revert "proc: don't allow async path resolution of /proc/thread-self components" io_uring: move SQPOLL thread io-wq forked worker io-wq: make io_wq_fork_thread() available to other users io-wq: only remove worker from free_list, if it was there io_uring: remove io_identity io_uring: remove any grabbing of context ...
2021-02-24Merge tag 'x86-entry-2021-02-24' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 irq entry updates from Thomas Gleixner: "The irq stack switching was moved out of the ASM entry code in course of the entry code consolidation. It ended up being suboptimal in various ways. This reworks the X86 irq stack handling: - Make the stack switching inline so the stackpointer manipulation is not longer at an easy to find place. - Get rid of the unnecessary indirect call. - Avoid the double stack switching in interrupt return and reuse the interrupt stack for softirq handling. - A objtool fix for CONFIG_FRAME_POINTER=y builds where it got confused about the stack pointer manipulation" * tag 'x86-entry-2021-02-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: objtool: Fix stack-swizzle for FRAME_POINTER=y um: Enforce the usage of asm-generic/softirq_stack.h x86/softirq/64: Inline do_softirq_own_stack() softirq: Move do_softirq_own_stack() to generic asm header softirq: Move __ARCH_HAS_DO_SOFTIRQ to Kconfig x86: Select CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK x86/softirq: Remove indirection in do_softirq_own_stack() x86/entry: Use run_sysvec_on_irqstack_cond() for XEN upcall x86/entry: Convert device interrupts to inline stack switching x86/entry: Convert system vectors to irq stack macro x86/irq: Provide macro for inlining irq stack switching x86/apic: Split out spurious handling code x86/irq/64: Adjust the per CPU irq stack pointer by 8 x86/irq: Sanitize irq stack tracking x86/entry: Fix instrumentation annotation
2021-02-24Merge tag 'sfi-removal-5.12-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull Simple Firmware Interface (SFI) support removal from Rafael Wysocki: "Drop support for depercated platforms using SFI, drop the entire support for SFI that has been long deprecated too and make some janitorial changes on top of that (Andy Shevchenko)" * tag 'sfi-removal-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: x86/platform/intel-mid: Update Copyright year and drop file names x86/platform/intel-mid: Remove unused header inclusion in intel-mid.h x86/platform/intel-mid: Drop unused __intel_mid_cpu_chip and Co. x86/platform/intel-mid: Get rid of intel_scu_ipc_legacy.h x86/PCI: Describe @reg for type1_access_ok() x86/PCI: Get rid of custom x86 model comparison sfi: Remove framework for deprecated firmware cpufreq: sfi-cpufreq: Remove driver for deprecated firmware media: atomisp: Remove unused header mfd: intel_msic: Remove driver for deprecated platform x86/apb_timer: Remove driver for deprecated platform x86/platform/intel-mid: Remove unused leftovers (vRTC) x86/platform/intel-mid: Remove unused leftovers (msic) x86/platform/intel-mid: Remove unused leftovers (msic_thermal) x86/platform/intel-mid: Remove unused leftovers (msic_power_btn) x86/platform/intel-mid: Remove unused leftovers (msic_gpio) x86/platform/intel-mid: Remove unused leftovers (msic_battery) x86/platform/intel-mid: Remove unused leftovers (msic_ocd) x86/platform/intel-mid: Remove unused leftovers (msic_audio) platform/x86: intel_scu_wdt: Drop mistakenly added const
2021-02-24Merge tag 'char-misc-5.12-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver updates from Greg KH: "Here is the large set of char/misc/whatever driver subsystem updates for 5.12-rc1. Over time it seems like this tree is collecting more and more tiny driver subsystems in one place, making it easier for those maintainers, which is why this is getting larger. Included in here are: - coresight driver updates - habannalabs driver updates - virtual acrn driver addition (proper acks from the x86 maintainers) - broadcom misc driver addition - speakup driver updates - soundwire driver updates - fpga driver updates - amba driver updates - mei driver updates - vfio driver updates - greybus driver updates - nvmeem driver updates - phy driver updates - mhi driver updates - interconnect driver udpates - fsl-mc bus driver updates - random driver fix - some small misc driver updates (rtsx, pvpanic, etc.) All of these have been in linux-next for a while, with the only reported issue being a merge conflict due to the dfl_device_id addition from the fpga subsystem in here" * tag 'char-misc-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (311 commits) spmi: spmi-pmic-arb: Fix hw_irq overflow Documentation: coresight: Add PID tracing description coresight: etm-perf: Support PID tracing for kernel at EL2 coresight: etm-perf: Clarify comment on perf options ACRN: update MAINTAINERS: mailing list is subscribers-only regmap: sdw-mbq: use MODULE_LICENSE("GPL") regmap: sdw: use no_pm routines for SoundWire 1.2 MBQ regmap: sdw: use _no_pm functions in regmap_read/write soundwire: intel: fix possible crash when no device is detected MAINTAINERS: replace my with email with replacements mhi: Fix double dma free uapi: map_to_7segment: Update example in documentation uio: uio_pci_generic: don't fail probe if pdev->irq equals to IRQ_NOTCONNECTED drivers/misc/vmw_vmci: restrict too big queue size in qp_host_alloc_queue firewire: replace tricky statement by two simple ones vme: make remove callback return void firmware: google: make coreboot driver's remove callback return void firmware: xilinx: Use explicit values for all enum values sample/acrn: Introduce a sample of HSM ioctl interface usage virt: acrn: Introduce an interface for Service VM to control vCPU ...
2021-02-23Merge tag 'objtool-core-2021-02-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool updates from Thomas Gleixner: - Make objtool work for big-endian cross compiles - Make stack tracking via stack pointer memory operations match push/pop semantics to prepare for architectures w/o PUSH/POP instructions. - Add support for analyzing alternatives - Improve retpoline detection and handling - Improve assembly code coverage on x86 - Provide support for inlined stack switching * tag 'objtool-core-2021-02-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) objtool: Support stack-swizzle objtool,x86: Additionally decode: mov %rsp, (%reg) x86/unwind/orc: Change REG_SP_INDIRECT x86/power: Support objtool validation in hibernate_asm_64.S x86/power: Move restore_registers() to top of the file x86/power: Annotate indirect branches as safe x86/acpi: Support objtool validation in wakeup_64.S x86/acpi: Annotate indirect branch as safe x86/ftrace: Support objtool vmlinux.o validation in ftrace_64.S x86/xen/pvh: Annotate indirect branch as safe x86/xen: Support objtool vmlinux.o validation in xen-head.S x86/xen: Support objtool validation in xen-asm.S objtool: Add xen_start_kernel() to noreturn list objtool: Combine UNWIND_HINT_RET_OFFSET and UNWIND_HINT_FUNC objtool: Add asm version of STACK_FRAME_NON_STANDARD objtool: Assume only ELF functions do sibling calls x86/ftrace: Add UNWIND_HINT_FUNC annotation for ftrace_stub objtool: Support retpoline jump detection for vmlinux.o objtool: Fix ".cold" section suffix check for newer versions of GCC objtool: Fix retpoline detection in asm code ...
2021-02-21arch: setup PF_IO_WORKER threads like PF_KTHREADJens Axboe
PF_IO_WORKER are kernel threads too, but they aren't PF_KTHREAD in the sense that we don't assign ->set_child_tid with our own structure. Just ensure that every arch sets up the PF_IO_WORKER threads like kthreads in the arch implementation of copy_thread(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-21Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Paolo Bonzini: "x86: - Support for userspace to emulate Xen hypercalls - Raise the maximum number of user memslots - Scalability improvements for the new MMU. Instead of the complex "fast page fault" logic that is used in mmu.c, tdp_mmu.c uses an rwlock so that page faults are concurrent, but the code that can run against page faults is limited. Right now only page faults take the lock for reading; in the future this will be extended to some cases of page table destruction. I hope to switch the default MMU around 5.12-rc3 (some testing was delayed due to Chinese New Year). - Cleanups for MAXPHYADDR checks - Use static calls for vendor-specific callbacks - On AMD, use VMLOAD/VMSAVE to save and restore host state - Stop using deprecated jump label APIs - Workaround for AMD erratum that made nested virtualization unreliable - Support for LBR emulation in the guest - Support for communicating bus lock vmexits to userspace - Add support for SEV attestation command - Miscellaneous cleanups PPC: - Support for second data watchpoint on POWER10 - Remove some complex workarounds for buggy early versions of POWER9 - Guest entry/exit fixes ARM64: - Make the nVHE EL2 object relocatable - Cleanups for concurrent translation faults hitting the same page - Support for the standard TRNG hypervisor call - A bunch of small PMU/Debug fixes - Simplification of the early init hypercall handling Non-KVM changes (with acks): - Detection of contended rwlocks (implemented only for qrwlocks, because KVM only needs it for x86) - Allow __DISABLE_EXPORTS from assembly code - Provide a saner follow_pfn replacements for modules" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (192 commits) KVM: x86/xen: Explicitly pad struct compat_vcpu_info to 64 bytes KVM: selftests: Don't bother mapping GVA for Xen shinfo test KVM: selftests: Fix hex vs. decimal snafu in Xen test KVM: selftests: Fix size of memslots created by Xen tests KVM: selftests: Ignore recently added Xen tests' build output KVM: selftests: Add missing header file needed by xAPIC IPI tests KVM: selftests: Add operand to vmsave/vmload/vmrun in svm.c KVM: SVM: Make symbol 'svm_gp_erratum_intercept' static locking/arch: Move qrwlock.h include after qspinlock.h KVM: PPC: Book3S HV: Fix host radix SLB optimisation with hash guests KVM: PPC: Book3S HV: Ensure radix guest has no SLB entries KVM: PPC: Don't always report hash MMU capability for P9 < DD2.2 KVM: PPC: Book3S HV: Save and restore FSCR in the P9 path KVM: PPC: remove unneeded semicolon KVM: PPC: Book3S HV: Use POWER9 SLBIA IH=6 variant to clear SLB KVM: PPC: Book3S HV: No need to clear radix host SLB before loading HPT guest KVM: PPC: Book3S HV: Fix radix guest SLB side channel KVM: PPC: Book3S HV: Remove support for running HPT guest on RPT host without mixed mode support KVM: PPC: Book3S HV: Introduce new capability for 2nd DAWR KVM: PPC: Book3S HV: Add infrastructure to support 2nd DAWR ...
2021-02-21Merge tag 'hyperv-next-signed-20210216' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull Hyper-V updates from Wei Liu: - VMBus hardening patches from Andrea Parri and Andres Beltran. - Patches to make Linux boot as the root partition on Microsoft Hypervisor from Wei Liu. - One patch to add a new sysfs interface to support hibernation on Hyper-V from Dexuan Cui. - Two miscellaneous clean-up patches from Colin and Gustavo. * tag 'hyperv-next-signed-20210216' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (31 commits) Revert "Drivers: hv: vmbus: Copy packets sent by Hyper-V out of the ring buffer" iommu/hyperv: setup an IO-APIC IRQ remapping domain for root partition x86/hyperv: implement an MSI domain for root partition asm-generic/hyperv: import data structures for mapping device interrupts asm-generic/hyperv: introduce hv_device_id and auxiliary structures asm-generic/hyperv: update hv_interrupt_entry asm-generic/hyperv: update hv_msi_entry x86/hyperv: implement and use hv_smp_prepare_cpus x86/hyperv: provide a bunch of helper functions ACPI / NUMA: add a stub function for node_to_pxm() x86/hyperv: handling hypercall page setup for root x86/hyperv: extract partition ID from Microsoft Hypervisor if necessary x86/hyperv: allocate output arg pages if required clocksource/hyperv: use MSR-based access if running as root Drivers: hv: vmbus: skip VMBus initialization if Linux is root x86/hyperv: detect if Linux is the root partition asm-generic/hyperv: change HV_CPU_POWER_MANAGEMENT to HV_CPU_MANAGEMENT hv: hyperv.h: Replace one-element array with flexible-array in struct icmsg_negotiate hv_netvsc: Restrict configurations on isolated guests Drivers: hv: vmbus: Enforce 'VMBus version >= 5.2' on isolated guests ...
2021-02-21Merge tag 'perf-core-2021-02-17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull performance event updates from Ingo Molnar: - Add CPU-PMU support for Intel Sapphire Rapids CPUs - Extend the perf ABI with PERF_SAMPLE_WEIGHT_STRUCT, to offer two-parameter sampling event feedback. Not used yet, but is intended for Golden Cove CPU-PMU, which can provide both the instruction latency and the cache latency information for memory profiling events. - Remove experimental, default-disabled perfmon-v4 counter_freezing support that could only be enabled via a boot option. The hardware is hopelessly broken, we'd like to make sure nobody starts relying on this, as it would only end in tears. - Fix energy/power events on Intel SPR platforms - Simplify the uprobes resume_execution() logic - Misc smaller fixes. * tag 'perf-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/rapl: Fix psys-energy event on Intel SPR platform perf/x86/rapl: Only check lower 32bits for RAPL energy counters perf/x86/rapl: Add msr mask support perf/x86/kvm: Add Cascade Lake Xeon steppings to isolation_ucodes[] perf/x86/intel: Support CPUID 10.ECX to disable fixed counters perf/x86/intel: Add perf core PMU support for Sapphire Rapids perf/x86/intel: Filter unsupported Topdown metrics event perf/x86/intel: Factor out intel_update_topdown_event() perf/core: Add PERF_SAMPLE_WEIGHT_STRUCT perf/intel: Remove Perfmon-v4 counter_freezing support x86/perf: Use static_call for x86_pmu.guest_get_msrs perf/x86/intel/uncore: With > 8 nodes, get pci bus die id from NUMA info perf/x86/intel/uncore: Store the logical die id instead of the physical die id. x86/kprobes: Do not decode opcode in resume_execution()
2021-02-21Merge tag 'sched-core-2021-02-17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Core scheduler updates: - Add CONFIG_PREEMPT_DYNAMIC: this in its current form adds the preempt=none/voluntary/full boot options (default: full), to allow distros to build a PREEMPT kernel but fall back to close to PREEMPT_VOLUNTARY (or PREEMPT_NONE) runtime scheduling behavior via a boot time selection. There's also the /debug/sched_debug switch to do this runtime. This feature is implemented via runtime patching (a new variant of static calls). The scope of the runtime patching can be best reviewed by looking at the sched_dynamic_update() function in kernel/sched/core.c. ( Note that the dynamic none/voluntary mode isn't 100% identical, for example preempt-RCU is available in all cases, plus the preempt count is maintained in all models, which has runtime overhead even with the code patching. ) The PREEMPT_VOLUNTARY/PREEMPT_NONE models, used by the vast majority of distributions, are supposed to be unaffected. - Fix ignored rescheduling after rcu_eqs_enter(). This is a bug that was found via rcutorture triggering a hang. The bug is that rcu_idle_enter() may wake up a NOCB kthread, but this happens after the last generic need_resched() check. Some cpuidle drivers fix it by chance but many others don't. In true 2020 fashion the original bug fix has grown into a 5-patch scheduler/RCU fix series plus another 16 RCU patches to address the underlying issue of missed preemption events. These are the initial fixes that should fix current incarnations of the bug. - Clean up rbtree usage in the scheduler, by providing & using the following consistent set of rbtree APIs: partial-order; less() based: - rb_add(): add a new entry to the rbtree - rb_add_cached(): like rb_add(), but for a rb_root_cached total-order; cmp() based: - rb_find(): find an entry in an rbtree - rb_find_add(): find an entry, and add if not found - rb_find_first(): find the first (leftmost) matching entry - rb_next_match(): continue from rb_find_first() - rb_for_each(): iterate a sub-tree using the previous two - Improve the SMP/NUMA load-balancer: scan for an idle sibling in a single pass. This is a 4-commit series where each commit improves one aspect of the idle sibling scan logic. - Improve the cpufreq cooling driver by getting the effective CPU utilization metrics from the scheduler - Improve the fair scheduler's active load-balancing logic by reducing the number of active LB attempts & lengthen the load-balancing interval. This improves stress-ng mmapfork performance. - Fix CFS's estimated utilization (util_est) calculation bug that can result in too high utilization values Misc updates & fixes: - Fix the HRTICK reprogramming & optimization feature - Fix SCHED_SOFTIRQ raising race & warning in the CPU offlining code - Reduce dl_add_task_root_domain() overhead - Fix uprobes refcount bug - Process pending softirqs in flush_smp_call_function_from_idle() - Clean up task priority related defines, remove *USER_*PRIO and USER_PRIO() - Simplify the sched_init_numa() deduplication sort - Documentation updates - Fix EAS bug in update_misfit_status(), which degraded the quality of energy-balancing - Smaller cleanups" * tag 'sched-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits) sched,x86: Allow !PREEMPT_DYNAMIC entry/kvm: Explicitly flush pending rcuog wakeup before last rescheduling point entry: Explicitly flush pending rcuog wakeup before last rescheduling point rcu/nocb: Trigger self-IPI on late deferred wake up before user resume rcu/nocb: Perform deferred wake up before last idle's need_resched() check rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers sched/features: Distinguish between NORMAL and DEADLINE hrtick sched/features: Fix hrtick reprogramming sched/deadline: Reduce rq lock contention in dl_add_task_root_domain() uprobes: (Re)add missing get_uprobe() in __find_uprobe() smp: Process pending softirqs in flush_smp_call_function_from_idle() sched: Harden PREEMPT_DYNAMIC static_call: Allow module use without exposing static_call_key sched: Add /debug/sched_preempt preempt/dynamic: Support dynamic preempt with preempt= boot option preempt/dynamic: Provide irqentry_exit_cond_resched() static call preempt/dynamic: Provide preempt_schedule[_notrace]() static calls preempt/dynamic: Provide cond_resched() and might_resched() static calls preempt: Introduce CONFIG_PREEMPT_DYNAMIC static_call: Provide DEFINE_STATIC_CALL_RET0() ...