summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2021-06-03Merge branch 'sched/urgent' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-06-03x86/alternative: Optimize single-byte NOPs at an arbitrary positionBorislav Petkov
Up until now the assumption was that an alternative patching site would have some instructions at the beginning and trailing single-byte NOPs (0x90) padding. Therefore, the patching machinery would go and optimize those single-byte NOPs into longer ones. However, this assumption is broken on 32-bit when code like hv_do_hypercall() in hyperv_init() would use the ratpoline speculation killer CALL_NOSPEC. The 32-bit version of that macro would align certain insns to 16 bytes, leading to the compiler issuing a one or more single-byte NOPs, depending on the holes it needs to fill for alignment. That would lead to the warning in optimize_nops() to fire: ------------[ cut here ]------------ Not a NOP at 0xc27fb598 WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:211 optimize_nops.isra.13 due to that function verifying whether all of the following bytes really are single-byte NOPs. Therefore, carve out the NOP padding into a separate function and call it for each NOP range beginning with a single-byte NOP. Fixes: 23c1ad538f4f ("x86/alternatives: Optimize optimize_nops()") Reported-by: Richard Narron <richard@aaazen.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://bugzilla.kernel.org/show_bug.cgi?id=213301 Link: https://lkml.kernel.org/r/20210601212125.17145-1-bp@alien8.de
2021-06-03x86/cpufeatures: Force disable X86_FEATURE_ENQCMD and remove update_pasid()Thomas Gleixner
While digesting the XSAVE-related horrors which got introduced with the supervisor/user split, the recent addition of ENQCMD-related functionality got on the radar and turned out to be similarly broken. update_pasid(), which is only required when X86_FEATURE_ENQCMD is available, is invoked from two places: 1) From switch_to() for the incoming task 2) Via a SMP function call from the IOMMU/SMV code #1 is half-ways correct as it hacks around the brokenness of get_xsave_addr() by enforcing the state to be 'present', but all the conditionals in that code are completely pointless for that. Also the invocation is just useless overhead because at that point it's guaranteed that TIF_NEED_FPU_LOAD is set on the incoming task and all of this can be handled at return to user space. #2 is broken beyond repair. The comment in the code claims that it is safe to invoke this in an IPI, but that's just wishful thinking. FPU state of a running task is protected by fregs_lock() which is nothing else than a local_bh_disable(). As BH-disabled regions run usually with interrupts enabled the IPI can hit a code section which modifies FPU state and there is absolutely no guarantee that any of the assumptions which are made for the IPI case is true. Also the IPI is sent to all CPUs in mm_cpumask(mm), but the IPI is invoked with a NULL pointer argument, so it can hit a completely unrelated task and unconditionally force an update for nothing. Worse, it can hit a kernel thread which operates on a user space address space and set a random PASID for it. The offending commit does not cleanly revert, but it's sufficient to force disable X86_FEATURE_ENQCMD and to remove the broken update_pasid() code to make this dysfunctional all over the place. Anything more complex would require more surgery and none of the related functions outside of the x86 core code are blatantly wrong, so removing those would be overkill. As nothing enables the PASID bit in the IA32_XSS MSR yet, which is required to make this actually work, this cannot result in a regression except for related out of tree train-wrecks, but they are broken already today. Fixes: 20f0afd1fb3d ("x86/mmu: Allocate/free a PASID") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Andy Lutomirski <luto@kernel.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/87mtsd6gr9.ffs@nanos.tec.linutronix.de
2021-06-03kprobes: Do not increment probe miss count in the fault handlerNaveen N. Rao
Kprobes has a counter 'nmissed', that is used to count the number of times a probe handler was not called. This generally happens when we hit a kprobe while handling another kprobe. However, if one of the probe handlers causes a fault, we are currently incrementing 'nmissed'. The comment in fault handler indicates that this can be used to account faults taken by the probe handlers. But, this has never been the intention as is evident from the comment above 'nmissed' in 'struct kprobe': /*count the number of times this probe was temporarily disarmed */ unsigned long nmissed; Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lkml.kernel.org/r/20210601120150.672652-1-naveen.n.rao@linux.vnet.ibm.com
2021-06-03x86/alternative: Align insn bytes verticallyBorislav Petkov
For easier inspection which bytes have changed. For example: feat: 7*32+12, old: (__x86_indirect_thunk_r10+0x0/0x20 (ffffffff81c02480) len: 17), repl: (ffffffff897813aa, len: 17) ffffffff81c02480: old_insn: 41 ff e2 90 90 90 90 90 90 90 90 90 90 90 90 90 90 ffffffff897813aa: rpl_insn: e8 07 00 00 00 f3 90 0f ae e8 eb f9 4c 89 14 24 c3 ffffffff81c02480: final_insn: e8 07 00 00 00 f3 90 0f ae e8 eb f9 4c 89 14 24 c3 No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210601193713.16190-1-bp@alien8.de
2021-06-02x86/hyperv: fix logical processor creationPraveen Kumar
Microsoft Hypervisor expects the logical processor index to be the same as CPU's index during logical processor creation. Using cpu_physical_id confuses hypervisor's scheduler. That causes the root partition not boot when core scheduler is used. This patch removes the call to cpu_physical_id and uses the CPU index directly for bringing up logical processor. This scheme works for both classic scheduler and core scheduler. Fixes: 333abaf5abb3 (x86/hyperv: implement and use hv_smp_prepare_cpus) Signed-off-by: Praveen Kumar <kumarpraveen@linux.microsoft.com> Link: https://lore.kernel.org/r/20210531074046.113452-1-kumarpraveen@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-06-01perf/x86/rapl: Use CPUID bit on AMD and Hygon partsAndrew Cooper
AMD and Hygon CPUs have a CPUID bit for RAPL. Drop the fam17h suffix as it is stale already. Make use of this instead of a model check to work more nicely in virtual environments where RAPL typically isn't available. [ bp: drop the ../cpu/powerflags.c hunk which is superfluous as the "rapl" bit name appears already in flags. ] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210514135920.16093-1-andrew.cooper3@citrix.com
2021-06-01x86,kprobes: WARN if kprobes tries to handle a faultPeter Zijlstra
With the removal of kprobe::handle_fault there is no reason left that kprobe_page_fault() would ever return true on x86, make sure it doesn't happen by accident. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20210525073213.660594073@infradead.org
2021-06-01kprobes: Remove kprobe::fault_handlerPeter Zijlstra
The reason for kprobe::fault_handler(), as given by their comment: * We come here because instructions in the pre/post * handler caused the page_fault, this could happen * if handler tries to access user space by * copy_from_user(), get_user() etc. Let the * user-specified handler try to fix it first. Is just plain bad. Those other handlers are ran from non-preemptible context and had better use _nofault() functions. Also, there is no upstream usage of this. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20210525073213.561116662@infradead.org
2021-06-01perf/x86/intel/uncore: Fix M2M event umask for Ice Lake serverKan Liang
Perf tool errors out with the latest event list for the Ice Lake server. event syntax error: 'unc_m2m_imc_reads.to_pmm' \___ value too big for format, maximum is 255 The same as the Snow Ridge server, the M2M uncore unit in the Ice Lake server has the unit mask extension field as well. Fixes: 2b3b76b5ec67 ("perf/x86/intel/uncore: Add Ice Lake server uncore support") Reported-by: Jin Yao <yao.jin@linux.intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1622552943-119174-1-git-send-email-kan.liang@linux.intel.com
2021-05-31x86/thermal: Fix LVT thermal setup for SMI delivery modeBorislav Petkov
There are machines out there with added value crap^WBIOS which provide an SMI handler for the local APIC thermal sensor interrupt. Out of reset, the BSP on those machines has something like 0x200 in that APIC register (timestamps left in because this whole issue is timing sensitive): [ 0.033858] read lvtthmr: 0x330, val: 0x200 which means: - bit 16 - the interrupt mask bit is clear and thus that interrupt is enabled - bits [10:8] have 010b which means SMI delivery mode. Now, later during boot, when the kernel programs the local APIC, it soft-disables it temporarily through the spurious vector register: setup_local_APIC: ... /* * If this comes from kexec/kcrash the APIC might be enabled in * SPIV. Soft disable it before doing further initialization. */ value = apic_read(APIC_SPIV); value &= ~APIC_SPIV_APIC_ENABLED; apic_write(APIC_SPIV, value); which means (from the SDM): "10.4.7.2 Local APIC State After It Has Been Software Disabled ... * The mask bits for all the LVT entries are set. Attempts to reset these bits will be ignored." And this happens too: [ 0.124111] APIC: Switch to symmetric I/O mode setup [ 0.124117] lvtthmr 0x200 before write 0xf to APIC 0xf0 [ 0.124118] lvtthmr 0x10200 after write 0xf to APIC 0xf0 This results in CPU 0 soft lockups depending on the placement in time when the APIC soft-disable happens. Those soft lockups are not 100% reproducible and the reason for that can only be speculated as no one tells you what SMM does. Likely, it confuses the SMM code that the APIC is disabled and the thermal interrupt doesn't doesn't fire at all, leading to CPU 0 stuck in SMM forever... Now, before 4f432e8bb15b ("x86/mce: Get rid of mcheck_intel_therm_init()") due to how the APIC_LVTTHMR was read before APIC initialization in mcheck_intel_therm_init(), it would read the value with the mask bit 16 clear and then intel_init_thermal() would replicate it onto the APs and all would be peachy - the thermal interrupt would remain enabled. But that commit moved that reading to a later moment in intel_init_thermal(), resulting in reading APIC_LVTTHMR on the BSP too late and with its interrupt mask bit set. Thus, revert back to the old behavior of reading the thermal LVT register before the APIC gets initialized. Fixes: 4f432e8bb15b ("x86/mce: Get rid of mcheck_intel_therm_init()") Reported-by: James Feeney <james@nurealm.net> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Cc: Zhang Rui <rui.zhang@intel.com> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Link: https://lkml.kernel.org/r/YKIqDdFNaXYd39wz@zn.tnic
2021-05-31x86/cstate: Allow ACPI C1 FFH MWAIT use on Hygon systemsPu Wen
Hygon systems support the MONITOR/MWAIT instructions and these can be used for ACPI C1 in the same way as on AMD and Intel systems. The BIOS declares a C1 state in _CST to use FFH and CPUID_Fn00000005_EDX is non-zero on Hygon systems. Allow ffh_cstate_init() to succeed on Hygon systems to default using FFH MWAIT instead of HALT for ACPI C1. Signed-off-by: Pu Wen <puwen@hygon.cn> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210528081417.31474-1-puwen@hygon.cn
2021-05-31perf/x86/intel/uncore: Fix a kernel WARNING triggered by maxcpus=1Kan Liang
A kernel WARNING may be triggered when setting maxcpus=1. The uncore counters are Die-scope. When probing a PCI device, only the BUS information can be retrieved. The uncore driver has to maintain a mapping table used to calculate the logical Die ID from a given BUS#. Before the patch ba9506be4e40, the mapping table stores the mapping information from the BUS# -> a Physical Socket ID. To calculate the logical die ID, perf does, - In snbep_pci2phy_map_init(), retrieve the BUS# -> a Physical Socket ID from the UBOX PCI configure space. - Calculate the mapping information (a BUS# -> a Physical Socket ID) for the other PCI BUS. - In the uncore_pci_probe(), get the physical Socket ID from a given BUS and the mapping table. - Calculate the logical Die ID Since only the logical Die ID is required, with the patch ba9506be4e40, the mapping table stores the mapping information from the BUS# -> a logical Die ID. Now perf does, - In snbep_pci2phy_map_init(), retrieve the BUS# -> a Physical Socket ID from the UBOX PCI configure space. - Calculate the logical Die ID - Calculate the mapping information (a BUS# -> a logical Die ID) for the other PCI BUS. - In the uncore_pci_probe(), get the logical die ID from a given BUS and the mapping table. When calculating the logical Die ID, -1 may be returned, especially when maxcpus=1. Here, -1 means the logical Die ID is not found. But when calculating the mapping information for the other PCI BUS, -1 indicates that it's the other PCI BUS that requires the calculation of the mapping. The driver will mistakenly do the calculation. Uses the -ENODEV to indicate the case which the logical Die ID is not found. The driver will not mess up the mapping table anymore. Fixes: ba9506be4e40 ("perf/x86/intel/uncore: Store the logical die id instead of the physical die id.") Reported-by: John Donnelly <john.p.donnelly@oracle.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: John Donnelly <john.p.donnelly@oracle.com> Tested-by: John Donnelly <john.p.donnelly@oracle.com> Link: https://lkml.kernel.org/r/1622037527-156028-1-git-send-email-kan.liang@linux.intel.com
2021-05-29Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM fixes from Paolo Bonzini: "ARM fixes: - Another state update on exit to userspace fix - Prevent the creation of mixed 32/64 VMs - Fix regression with irqbypass not restarting the guest on failed connect - Fix regression with debug register decoding resulting in overlapping access - Commit exception state on exit to usrspace - Fix the MMU notifier return values - Add missing 'static' qualifiers in the new host stage-2 code x86 fixes: - fix guest missed wakeup with assigned devices - fix WARN reported by syzkaller - do not use BIT() in UAPI headers - make the kvm_amd.avic parameter bool PPC fixes: - make halt polling heuristics consistent with other architectures selftests: - various fixes - new performance selftest memslot_perf_test - test UFFD minor faults in demand_paging_test" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (44 commits) selftests: kvm: fix overlapping addresses in memslot_perf_test KVM: X86: Kill off ctxt->ud KVM: X86: Fix warning caused by stale emulation context KVM: X86: Use kvm_get_linear_rip() in single-step and #DB/#BP interception KVM: x86/mmu: Fix comment mentioning skip_4k KVM: VMX: update vcpu posted-interrupt descriptor when assigning device KVM: rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK KVM: x86: add start_assignment hook to kvm_x86_ops KVM: LAPIC: Narrow the timer latency between wait_lapic_expire and world switch selftests: kvm: do only 1 memslot_perf_test run by default KVM: X86: Use _BITUL() macro in UAPI headers KVM: selftests: add shared hugetlbfs backing source type KVM: selftests: allow using UFFD minor faults for demand paging KVM: selftests: create alias mappings when using shared memory KVM: selftests: add shmem backing source type KVM: selftests: refactor vm_mem_backing_src_type flags KVM: selftests: allow different backing source types KVM: selftests: compute correct demand paging size KVM: selftests: simplify setup_demand_paging error handling KVM: selftests: Print a message if /dev/kvm is missing ...
2021-05-29x86/apic: Mark _all_ legacy interrupts when IO/APIC is missingThomas Gleixner
PIC interrupts do not support affinity setting and they can end up on any online CPU. Therefore, it's required to mark the associated vectors as system-wide reserved. Otherwise, the corresponding irq descriptors are copied to the secondary CPUs but the vectors are not marked as assigned or reserved. This works correctly for the IO/APIC case. When the IO/APIC is disabled via config, kernel command line or lack of enumeration then all legacy interrupts are routed through the PIC, but nothing marks them as system-wide reserved vectors. As a consequence, a subsequent allocation on a secondary CPU can result in allocating one of these vectors, which triggers the BUG() in apic_update_vector() because the interrupt descriptor slot is not empty. Imran tried to work around that by marking those interrupts as allocated when a CPU comes online. But that's wrong in case that the IO/APIC is available and one of the legacy interrupts, e.g. IRQ0, has been switched to PIC mode because then marking them as allocated will fail as they are already marked as system vectors. Stay consistent and update the legacy vectors after attempting IO/APIC initialization and mark them as system vectors in case that no IO/APIC is available. Fixes: 69cde0004a4b ("x86/vector: Use matrix allocator for vector assignment") Reported-by: Imran Khan <imran.f.khan@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20210519233928.2157496-1-imran.f.khan@oracle.com
2021-05-28KVM: X86: Kill off ctxt->udWanpeng Li
ctxt->ud is consumed only by x86_decode_insn(), we can kill it off by passing emulation_type to x86_decode_insn() and dropping ctxt->ud altogether. Tracking that info in ctxt for literally one call is silly. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <1622160097-37633-2-git-send-email-wanpengli@tencent.com>
2021-05-28KVM: X86: Fix warning caused by stale emulation contextWanpeng Li
Reported by syzkaller: WARNING: CPU: 7 PID: 10526 at linux/arch/x86/kvm//x86.c:7621 x86_emulate_instruction+0x41b/0x510 [kvm] RIP: 0010:x86_emulate_instruction+0x41b/0x510 [kvm] Call Trace: kvm_mmu_page_fault+0x126/0x8f0 [kvm] vmx_handle_exit+0x11e/0x680 [kvm_intel] vcpu_enter_guest+0xd95/0x1b40 [kvm] kvm_arch_vcpu_ioctl_run+0x377/0x6a0 [kvm] kvm_vcpu_ioctl+0x389/0x630 [kvm] __x64_sys_ioctl+0x8e/0xd0 do_syscall_64+0x3c/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae Commit 4a1e10d5b5d8 ("KVM: x86: handle hardware breakpoints during emulation()) adds hardware breakpoints check before emulation the instruction and parts of emulation context initialization, actually we don't have the EMULTYPE_NO_DECODE flag here and the emulation context will not be reused. Commit c8848cee74ff ("KVM: x86: set ctxt->have_exception in x86_decode_insn()) triggers the warning because it catches the stale emulation context has #UD, however, it is not during instruction decoding which should result in EMULATION_FAILED. This patch fixes it by moving the second part emulation context initialization into init_emulate_ctxt() and before hardware breakpoints check. The ctxt->ud will be dropped by a follow-up patch. syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=134683fdd00000 Reported-by: syzbot+71271244f206d17f6441@syzkaller.appspotmail.com Fixes: 4a1e10d5b5d8 (KVM: x86: handle hardware breakpoints during emulation) Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <1622160097-37633-1-git-send-email-wanpengli@tencent.com>
2021-05-28KVM: X86: Use kvm_get_linear_rip() in single-step and #DB/#BP interceptionYuan Yao
The kvm_get_linear_rip() handles x86/long mode cases well and has better readability, __kvm_set_rflags() also use the paired function kvm_is_linear_rip() to check the vcpu->arch.singlestep_rip set in kvm_arch_vcpu_ioctl_set_guest_debug(), so change the "CS.BASE + RIP" code in kvm_arch_vcpu_ioctl_set_guest_debug() and handle_exception_nmi() to this one. Signed-off-by: Yuan Yao <yuan.yao@intel.com> Message-Id: <20210526063828.1173-1-yuan.yao@linux.intel.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-28x86/mce: Include a MCi_MISC value in faked mce logsTony Luck
When BIOS reports memory errors to Linux using the ACPI/APEI error reporting method Linux creates a "struct mce" to pass to the normal reporting code path. The constructed record doesn't include a value for the "misc" field of the structure, and so mce_usable_address() says this record doesn't include a valid address. Net result is that functions like uc_decode_notifier() will just ignore this record instead of taking action to offline a page. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210527222846.931851-1-tony.luck@intel.com
2021-05-27x86/pci: Return true/false (not 1/0) from bool functionsYang Li
Return boolean values ("true" or "false") instead of 1 or 0 from bool functions. This fixes the following warnings from coccicheck: ./arch/x86/pci/mmconfig-shared.c:464:9-10: WARNING: return of 0/1 in function 'is_mmconf_reserved' with return type bool ./arch/x86/pci/mmconfig-shared.c:493:5-6: WARNING: return of 0/1 in function 'is_mmconf_reserved' with return type bool ./arch/x86/pci/mmconfig-shared.c:501:9-10: WARNING: return of 0/1 in function 'is_mmconf_reserved' with return type bool ./arch/x86/pci/mmconfig-shared.c:522:5-6: WARNING: return of 0/1 in function 'is_mmconf_reserved' with return type bool Link: https://lore.kernel.org/r/1615794000-102771-1-git-send-email-yang.lee@linux.alibaba.com Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Krzysztof Wilczyński <kw@linux.com>
2021-05-27x86/MCE/AMD, EDAC/mce_amd: Add new SMCA bank typesMuralidhara M K
Add the (HWID, MCATYPE) tuples and names for new SMCA bank types. Also, add their respective error descriptions to the MCE decoding module edac_mce_amd. Also while at it, optimize the string names for some SMCA banks. [ bp: Drop repeated comments, explain why UMC_V2 is a separate entry. ] Signed-off-by: Muralidhara M K <muralimk@amd.com> Signed-off-by: Naveen Krishna Chatradhi <nchatrad@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lkml.kernel.org/r/20210526164601.66228-1-nchatrad@amd.com
2021-05-27KVM: x86/mmu: Fix comment mentioning skip_4kDavid Matlack
This comment was left over from a previous version of the patch that introduced wrprot_gfn_range, when skip_4k was passed in instead of min_level. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210526163227.3113557-1-dmatlack@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27KVM: VMX: update vcpu posted-interrupt descriptor when assigning deviceMarcelo Tosatti
For VMX, when a vcpu enters HLT emulation, pi_post_block will: 1) Add vcpu to per-cpu list of blocked vcpus. 2) Program the posted-interrupt descriptor "notification vector" to POSTED_INTR_WAKEUP_VECTOR With interrupt remapping, an interrupt will set the PIR bit for the vector programmed for the device on the CPU, test-and-set the ON bit on the posted interrupt descriptor, and if the ON bit is clear generate an interrupt for the notification vector. This way, the target CPU wakes upon a device interrupt and wakes up the target vcpu. Problem is that pi_post_block only programs the notification vector if kvm_arch_has_assigned_device() is true. Its possible for the following to happen: 1) vcpu V HLTs on pcpu P, kvm_arch_has_assigned_device is false, notification vector is not programmed 2) device is assigned to VM 3) device interrupts vcpu V, sets ON bit (notification vector not programmed, so pcpu P remains in idle) 4) vcpu 0 IPIs vcpu V (in guest), but since pi descriptor ON bit is set, kvm_vcpu_kick is skipped 5) vcpu 0 busy spins on vcpu V's response for several seconds, until RCU watchdog NMIs all vCPUs. To fix this, use the start_assignment kvm_x86_ops callback to kick vcpus out of the halt loop, so the notification vector is properly reprogrammed to the wakeup vector. Reported-by: Pei Zhang <pezhang@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Message-Id: <20210526172014.GA29007@fuller.cnet> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27KVM: rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCKMarcelo Tosatti
KVM_REQ_UNBLOCK will be used to exit a vcpu from its inner vcpu halt emulation loop. Rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK, switch PowerPC to arch specific request bit. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Message-Id: <20210525134321.303768132@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27KVM: x86: add start_assignment hook to kvm_x86_opsMarcelo Tosatti
Add a start_assignment hook to kvm_x86_ops, which is called when kvm_arch_start_assignment is done. The hook is required to update the wakeup vector of a sleeping vCPU when a device is assigned to the guest. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Message-Id: <20210525134321.254128742@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27KVM: LAPIC: Narrow the timer latency between wait_lapic_expire and world switchWanpeng Li
Let's treat lapic_timer_advance_ns automatic tuning logic as hypervisor overhead, move it before wait_lapic_expire instead of between wait_lapic_expire and the world switch, the wait duration should be calculated by the up-to-date guest_tsc after the overhead of automatic tuning logic. This patch reduces ~30+ cycles for kvm-unit-tests/tscdeadline-latency when testing busy waits. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1621339235-11131-5-git-send-email-wanpengli@tencent.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27KVM: X86: hyper-v: Task srcu lock when accessing kvm_memslots()Wanpeng Li
WARNING: suspicious RCU usage 5.13.0-rc1 #4 Not tainted ----------------------------- ./include/linux/kvm_host.h:710 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by hyperv_clock/8318: #0: ffffb6b8cb05a7d8 (&hv->hv_lock){+.+.}-{3:3}, at: kvm_hv_invalidate_tsc_page+0x3e/0xa0 [kvm] stack backtrace: CPU: 3 PID: 8318 Comm: hyperv_clock Not tainted 5.13.0-rc1 #4 Call Trace: dump_stack+0x87/0xb7 lockdep_rcu_suspicious+0xce/0xf0 kvm_write_guest_page+0x1c1/0x1d0 [kvm] kvm_write_guest+0x50/0x90 [kvm] kvm_hv_invalidate_tsc_page+0x79/0xa0 [kvm] kvm_gen_update_masterclock+0x1d/0x110 [kvm] kvm_arch_vm_ioctl+0x2a7/0xc50 [kvm] kvm_vm_ioctl+0x123/0x11d0 [kvm] __x64_sys_ioctl+0x3ed/0x9d0 do_syscall_64+0x3d/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae kvm_memslots() will be called by kvm_write_guest(), so we should take the srcu lock. Fixes: e880c6ea5 (KVM: x86: hyper-v: Prevent using not-yet-updated TSC page by secondary CPUs) Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1621339235-11131-4-git-send-email-wanpengli@tencent.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27KVM: X86: Fix vCPU preempted state from guest's point of viewWanpeng Li
Commit 66570e966dd9 (kvm: x86: only provide PV features if enabled in guest's CPUID) avoids to access pv tlb shootdown host side logic when this pv feature is not exposed to guest, however, kvm_steal_time.preempted not only leveraged by pv tlb shootdown logic but also mitigate the lock holder preemption issue. From guest's point of view, vCPU is always preempted since we lose the reset of kvm_steal_time.preempted before vmentry if pv tlb shootdown feature is not exposed. This patch fixes it by clearing kvm_steal_time.preempted before vmentry. Fixes: 66570e966dd9 (kvm: x86: only provide PV features if enabled in guest's CPUID) Reviewed-by: Sean Christopherson <seanjc@google.com> Cc: stable@vger.kernel.org Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1621339235-11131-3-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27KVM: X86: Bail out of direct yield in case of under-committed scenariosWanpeng Li
In case of under-committed scenarios, vCPUs can be scheduled easily; kvm_vcpu_yield_to adds extra overhead, and it is also common to see when vcpu->ready is true but yield later failing due to p->state is TASK_RUNNING. Let's bail out in such scenarios by checking the length of current cpu runqueue, which can be treated as a hint of under-committed instead of guarantee of accuracy. 30%+ of directed-yield attempts can now avoid the expensive lookups in kvm_sched_yield() in an under-committed scenario. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1621339235-11131-2-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27Merge v5.13-rc3 into drm-nextDaniel Vetter
drm/i915 is extremely on fire without the below revert from -rc3: commit 293837b9ac8d3021657f44c9d7a14948ec01c5d0 Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Wed May 19 05:55:57 2021 -1000 Revert "i915: fix remap_io_sg to verify the pgprot" Backmerge so we don't have a too wide bisect window for anything that's a more involved workload than booting the driver. Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
2021-05-26kbuild: require all architectures to have arch/$(SRCARCH)/KbuildMasahiro Yamada
arch/$(SRCARCH)/Kbuild is useful for Makefile cleanups because you can use the obj-y syntax. Add an empty file if it is missing in arch/$(SRCARCH)/. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2021-05-26locking/atomic: delete !ARCH_ATOMIC remnantsMark Rutland
Now that all architectures implement ARCH_ATOMIC, we can make it mandatory, removing the Kconfig symbol and logic for !ARCH_ATOMIC. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210525140232.53872-33-mark.rutland@arm.com
2021-05-26locking/atomic: make ARCH_ATOMIC a Kconfig symbolMark Rutland
Subsequent patches will move architectures over to the ARCH_ATOMIC API, after preparing the asm-generic atomic implementations to function with or without ARCH_ATOMIC. As some architectures use the asm-generic implementations exclusively (and don't have a local atomic.h), and to avoid the risk that ARCH_ATOMIC isn't defined in some cases we expect, let's make the ARCH_ATOMIC macro a Kconfig symbol instead, so that we can guarantee it is consistently available where needed. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210525140232.53872-2-mark.rutland@arm.com
2021-05-25x86/syscalls: Don't adjust CFLAGS for syscall tablesBrian Gerst
The syscall_*.c files only contain data (the syscall tables). There is no need to adjust CFLAGS for tracing and stack protector since they contain no code. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Masahiro Yamada <masahiroy@kernel.org> Link: https://lore.kernel.org/r/20210524181707.132844-4-brgerst@gmail.com
2021-05-25x86/syscalls: Remove -Wno-override-init for syscall tablesBrian Gerst
Commit 44fe4895f47c ("Stop filling syscall arrays with *_sys_ni_syscall") removes the need for -Wno-override-init, since the table is now filled sequentially instead of overriding a default value. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Masahiro Yamada <masahiroy@kernel.org> Link: https://lore.kernel.org/r/20210524181707.132844-3-brgerst@gmail.com
2021-05-25x86/uml/syscalls: Remove array index from syscall initializersBrian Gerst
The recent syscall table generator rework removed the index from the initializers for native x86 syscall tables, but missed the UML syscall tables. Fixes: 44fe4895f47c ("Stop filling syscall arrays with *_sys_ni_syscall") Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Masahiro Yamada <masahiroy@kernel.org> Link: https://lore.kernel.org/r/20210524181707.132844-2-brgerst@gmail.com
2021-05-25x86/syscalls: Clear 'offset' and 'prefix' in case they are set in envMasahiro Yamada
If the environment variable 'prefix' is set on the build host, it is wrongly used as syscall macro prefixes. $ export prefix=/usr $ make -s defconfig all In file included from ./arch/x86/include/asm/unistd.h:20, from <stdin>:2: ./arch/x86/include/generated/uapi/asm/unistd_64.h:4:9: warning: missing whitespace after the macro name 4 | #define __NR_/usrread 0 | ^~~~~ arch/x86/entry/syscalls/Makefile should clear 'offset' and 'prefix'. Fixes: 3cba325b358f ("x86/syscalls: Switch to generic syscallhdr.sh") Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210525115420.679416-1-masahiroy@kernel.org
2021-05-25x86/entry: Use int everywhere for system call numbersH. Peter Anvin (Intel)
System call numbers are defined as int, so use int everywhere for system call numbers. This is strictly a cleanup; it should not change anything user visible; all ABI changes have been done in the preceeding patches. [ tglx: Replaced the unsigned long cast ] Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210518191303.4135296-7-hpa@zytor.com
2021-05-24KVM: SVM: make the avic parameter a boolPaolo Bonzini
Make it consistent with kvm_intel.enable_apicv. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-24KVM: VMX: Drop unneeded CONFIG_X86_LOCAL_APIC checkVitaly Kuznetsov
CONFIG_X86_LOCAL_APIC is always on when CONFIG_KVM (on x86) since commit e42eef4ba388 ("KVM: add X86_LOCAL_APIC dependency"). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210518144339.1987982-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com>
2021-05-24KVM: SVM: Drop unneeded CONFIG_X86_LOCAL_APIC checkVitaly Kuznetsov
AVIC dependency on CONFIG_X86_LOCAL_APIC is dead code since commit e42eef4ba388 ("KVM: add X86_LOCAL_APIC dependency"). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210518144339.1987982-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com>
2021-05-23Merge tag 'perf-urgent-2021-05-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "Two perf fixes: - Do not check the LBR_TOS MSR when setting up unrelated LBR MSRs as this can cause malfunction when TOS is not supported - Allocate the LBR XSAVE buffers along with the DS buffers upfront because allocating them when adding an event can deadlock" * tag 'perf-urgent-2021-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/lbr: Remove cpuc->lbr_xsave allocation from atomic context perf/x86: Avoid touching LBR_TOS MSR for Arch LBR
2021-05-23Merge tag 'x86_urgent_for_v5.13_rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Fix how SEV handles MMIO accesses by forwarding potential page faults instead of killing the machine and by using the accessors with the exact functionality needed when accessing memory. - Fix a confusion with Clang LTO compiler switches passed to the it - Handle the case gracefully when VMGEXIT has been executed in userspace * tag 'x86_urgent_for_v5.13_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sev-es: Use __put_user()/__get_user() for data accesses x86/sev-es: Forward page-faults which happen during emulation x86/sev-es: Don't return NULL from sev_es_get_ghcb() x86/build: Fix location of '-plugin-opt=' flags x86/sev-es: Invalidate the GHCB after completing VMGEXIT x86/sev-es: Move sev_es_put_ghcb() in prep for follow on patch
2021-05-23Merge tag 'efi-next-for-v5.14' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi into efi/core Pull EFI updates for v5.14 from Ard Biesheuvel: "First microbatch of EFI updates - not a lot going on these days." Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-05-22Merge tag 'for-linus-5.13b-rc3-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: - a fix for a boot regression when running as PV guest on hardware without NX support - a small series fixing a bug in the Xen pciback driver when configuring a PCI card with multiple virtual functions * tag 'for-linus-5.13b-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen-pciback: reconfigure also from backend watch handler xen-pciback: redo VF placement in the virtual topology x86/Xen: swap NX determination and GDT setup on BSP
2021-05-22x86/efi: Log 32/64-bit mismatch with kernel as an errorPaul Menzel
Log the message No EFI runtime due to 32/64-bit mismatch with kernel as an error condition, as several things like efivarfs won’t work without the EFI runtime. Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2021-05-21Merge branch 'for-v5.13-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull siginfo fix from Eric Biederman: "During the merge window an issue with si_perf and the siginfo ABI came up. The alpha and sparc siginfo structure layout had changed with the addition of SIGTRAP TRAP_PERF and the new field si_perf. The reason only alpha and sparc were affected is that they are the only architectures that use si_trapno. Looking deeper it was discovered that si_trapno is used for only a few select signals on alpha and sparc, and that none of the other _sigfault fields past si_addr are used at all. Which means technically no regression on alpha and sparc. While the alignment concerns might be dismissed the abuse of si_errno by SIGTRAP TRAP_PERF does have the potential to cause regressions in existing userspace. While we still have time before userspace starts using and depending on the new definition siginfo for SIGTRAP TRAP_PERF this set of changes cleans up siginfo_t. - The si_trapno field is demoted from magic alpha and sparc status and made an ordinary union member of the _sigfault member of siginfo_t. Without moving it of course. - si_perf is replaced with si_perf_data and si_perf_type ending the abuse of si_errno. - Unnecessary additions to signalfd_siginfo are removed" * 'for-v5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: signalfd: Remove SIL_PERF_EVENT fields from signalfd_siginfo signal: Deliver all of the siginfo perf data in _perf signal: Factor force_sig_perf out of perf_sigtrap signal: Implement SIL_FAULT_TRAPNO siginfo: Move si_trapno inside the union inside _si_fault
2021-05-21x86/kexec: Set_[gi]dt() -> native_[gi]dt_invalidate() in machine_kexec_*.cH. Peter Anvin (Intel)
These files contain private set_gdt() functions which are only used to invalid the gdt; machine_kexec_64.c also contains a set_idt() function to invalidate the idt. phys_to_virt(0) *really* doesn't make any sense for creating an invalid GDT. A NULL pointer (virtual 0) makes a lot more sense; although neither will allow any actual memory reference, a NULL pointer stands out more. Replace these calls with native_[gi]dt_invalidate(). Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210519212154.511983-7-hpa@zytor.com
2021-05-21x86: Add native_[ig]dt_invalidate()H. Peter Anvin (Intel)
In some places, the native forms of descriptor table invalidation is required. Rather than open-coding them, add explicitly native functions to invalidate the GDT and IDT. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210519212154.511983-6-hpa@zytor.com
2021-05-21x86/idt: Remove address argument from idt_invalidate()H. Peter Anvin (Intel)
There is no reason to specify any specific address to idt_invalidate(). It looks mostly like an artifact of unifying code done differently by accident. The most "sensible" address to set here is a NULL pointer - virtual address zero, just as a visual marker. This also makes it possible to mark the struct desc_ptr in idt_invalidate() as static const. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210519212154.511983-5-hpa@zytor.com