summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2020-07-02Merge branch 'perf/vlbr'Peter Zijlstra
2020-07-02perf/x86: Keep LBR records unchanged in host context for guest usageLike Xu
When a guest wants to use the LBR registers, its hypervisor creates a guest LBR event and let host perf schedules it. The LBR records msrs are accessible to the guest when its guest LBR event is scheduled on by the perf subsystem. Before scheduling this event out, we should avoid host changes on IA32_DEBUGCTLMSR or LBR_SELECT. Otherwise, some unexpected branch operations may interfere with guest behavior, pollute LBR records, and even cause host branches leakage. In addition, the read operation on host is also avoidable. To ensure that guest LBR records are not lost during the context switch, the guest LBR event would enable the callstack mode which could save/restore guest unread LBR records with the help of intel_pmu_lbr_sched_task() naturally. However, the guest LBR_SELECT may changes for its own use and the host LBR event doesn't save/restore it. To ensure that we doesn't lost the guest LBR_SELECT value when the guest LBR event is running, the vlbr_constraint is bound up with a new constraint flag PERF_X86_EVENT_LBR_SELECT. Signed-off-by: Like Xu <like.xu@linux.intel.com> Signed-off-by: Wei Wang <wei.w.wang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200514083054.62538-6-like.xu@linux.intel.com
2020-07-02perf/x86: Add constraint to create guest LBR event without hw counterLike Xu
The hypervisor may request the perf subsystem to schedule a time window to directly access the LBR records msrs for its own use. Normally, it would create a guest LBR event with callstack mode enabled, which is scheduled along with other ordinary LBR events on the host but in an exclusive way. To avoid wasting a counter for the guest LBR event, the perf tracks its hw->idx via INTEL_PMC_IDX_FIXED_VLBR and assigns it with a fake VLBR counter with the help of new vlbr_constraint. As with the BTS event, there is actually no hardware counter assigned for the guest LBR event. Signed-off-by: Like Xu <like.xu@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200514083054.62538-5-like.xu@linux.intel.com
2020-07-02perf/x86/lbr: Add interface to get LBR informationLike Xu
The LBR records msrs are model specific. The perf subsystem has already obtained the base addresses of LBR records based on the cpu model. Therefore, an interface is added to allow callers outside the perf subsystem to obtain these LBR information. It's useful for hypervisors to emulate the LBR feature for guests with less code. Signed-off-by: Like Xu <like.xu@linux.intel.com> Signed-off-by: Wei Wang <wei.w.wang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200613080958.132489-4-like.xu@linux.intel.com
2020-07-02perf/x86/core: Refactor hw->idx checks and cleanupLike Xu
For intel_pmu_en/disable_event(), reorder the branches checks for hw->idx and make them sorted by probability: gp,fixed,bts,others. Clean up the x86_assign_hw_event() by converting multiple if-else statements to a switch statement. To skip x86_perf_event_update() and x86_perf_event_set_period(), it's generic to replace "idx == INTEL_PMC_IDX_FIXED_BTS" check with '!hwc->event_base' because that should be 0 for all non-gp/fixed cases. Wrap related bit operations into intel_set/clear_masks() and make the main path more cleaner and readable. No functional changes. Signed-off-by: Like Xu <like.xu@linux.intel.com> Original-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200613080958.132489-3-like.xu@linux.intel.com
2020-07-02perf/x86: Fix variable types for LBR registersWei Wang
The MSR variable type can be 'unsigned int', which uses less memory than the longer 'unsigned long'. Fix 'struct x86_pmu' for that. The lbr_nr won't be a negative number, so make it 'unsigned int' as well. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Wei Wang <wei.w.wang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200613080958.132489-2-like.xu@linux.intel.com
2020-07-02cpufreq: intel_pstate: Allow enable/disable energy efficiencySrinivas Pandruvada
By default intel_pstate the driver disables energy efficiency by setting MSR_IA32_POWER_CTL bit 19 for Kaby Lake desktop CPU model in HWP mode. This CPU model is also shared by Coffee Lake desktop CPUs. This allows these systems to reach maximum possible frequency. But this adds power penalty, which some customers don't want. They want some way to enable/ disable dynamically. So, add an additional attribute "energy_efficiency" under /sys/devices/system/cpu/intel_pstate/ for these CPU models. This allows to read and write bit 19 ("Disable Energy Efficiency Optimization") in the MSR IA32_POWER_CTL. This attribute is present in both HWP and non-HWP mode as this has an effect in both modes. Refer to Intel Software Developer's manual for details. The scope of this bit is package wide. Also these systems are single package systems. So read/write MSR on the current CPU is enough. The energy efficiency (EE) bit setting needs to be preserved during suspend/resume and CPU offline/online operation. To do this: - Restoring the EE setting from the cpufreq resume() callback, if there is change from the system default. - By default, don't disable EE from cpufreq init() callback for matching CPU models. Since the scope is package wide and is a single package system, move the disable EE calls from init() callback to intel_pstate_init() function, which is called only once. Suggested-by: Len Brown <lenb@kernel.org> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-07-01x86/fsgsbase: Fix Xen PV supportAndy Lutomirski
On Xen PV, SWAPGS doesn't work. Teach __rdfsbase_inactive() and __wrgsbase_inactive() to use rdmsrl()/wrmsrl() on Xen PV. The Xen pvop code will understand this and issue the correct hypercalls. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/f07c08f178fe9711915862b656722a207cd52c28.1593192140.git.luto@kernel.org
2020-07-01x86/ptrace: Fix 32-bit PTRACE_SETREGS vs fsbase and gsbaseAndy Lutomirski
Debuggers expect that doing PTRACE_GETREGS, then poking at a tracee and maybe letting it run for a while, then doing PTRACE_SETREGS will put the tracee back where it was. In the specific case of a 32-bit tracer and tracee, the PTRACE_GETREGS/SETREGS data structure doesn't have fs_base or gs_base fields, so FSBASE and GSBASE fields are never stored anywhere. Everything used to still work because nonzero FS or GS would result full reloads of the segment registers when the tracee resumes, and the bases associated with FS==0 or GS==0 are irrelevant to 32-bit code. Adding FSGSBASE support broke this: when FSGSBASE is enabled, FSBASE and GSBASE are now restored independently of FS and GS for all tasks when context-switched in. This means that, if a 32-bit tracer restores a previous state using PTRACE_SETREGS but the tracee's pre-restore and post-restore bases don't match, then the tracee is resumed with the wrong base. Fix it by explicitly loading the base when a 32-bit tracer pokes FS or GS on a 64-bit kernel. Also add a test case. Fixes: 673903495c85 ("x86/process/64: Use FSBSBASE in switch_to() if available") Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/229cc6a50ecbb701abd50fe4ddaf0eda888898cd.1593192140.git.luto@kernel.org
2020-07-01x86/entry/64/compat: Fix Xen PV SYSENTER frame setupAndy Lutomirski
The SYSENTER frame setup was nonsense. It worked by accident because the normal code into which the Xen asm jumped (entry_SYSENTER_32/compat) threw away SP without touching the stack. entry_SYSENTER_compat was recently modified such that it relied on having a valid stack pointer, so now the Xen asm needs to invoke it with a valid stack. Fix it up like SYSCALL: use the Xen-provided frame and skip the bare metal prologue. Fixes: 1c3e5d3f60e2 ("x86/entry: Make entry_64_compat.S objtool clean") Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: https://lkml.kernel.org/r/947880c41ade688ff4836f665d0c9fcaa9bd1201.1593191971.git.luto@kernel.org
2020-07-01x86/entry: Move SYSENTER's regs->sp and regs->flags fixups into CAndy Lutomirski
The SYSENTER asm (32-bit and compat) contains fixups for regs->sp and regs->flags. Move the fixups into C and fix some comments while at it. This is a valid cleanup all by itself, and it also simplifies the subsequent patch that will fix Xen PV SYSENTER. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/fe62bef67eda7fac75b8f3dbafccf571dc4ece6b.1593191971.git.luto@kernel.org
2020-07-01x86/entry: Assert that syscalls are on the right stackAndy Lutomirski
Now that the entry stack is a full page, it's too easy to regress the system call entry code and end up on the wrong stack without noticing. Assert that all system calls (SYSCALL64, SYSCALL32, SYSENTER, and INT80) are on the right stack and have pt_regs in the right place. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/52059e42bb0ab8551153d012d68f7be18d72ff8e.1593191971.git.luto@kernel.org
2020-06-30PCI: Replace http:// links with https://Alexander A. Klimov
Replace http:// links with https:// links. This reduces the likelihood of man-in-the-middle attacks when developers open these links. Deterministic algorithm: For each file: If not .svg: For each line: If doesn't contain `\bxmlns\b`: For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`: If both the HTTP and HTTPS versions return 200 OK and serve the same content: Replace HTTP with HTTPS. [bhelgaas: also update samsung.com links, drop sourceforge link] Link: https://lore.kernel.org/r/20200627103050.71712-1-grandmaster@al2klimov.de Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-06-30x86/split_lock: Don't write MSR_TEST_CTRL on CPUs that aren't whitelistedSean Christopherson
Choo! Choo! All aboard the Split Lock Express, with direct service to Wreckage! Skip split_lock_verify_msr() if the CPU isn't whitelisted as a possible SLD-enabled CPU model to avoid writing MSR_TEST_CTRL. MSR_TEST_CTRL exists, and is writable, on many generations of CPUs. Writing the MSR, even with '0', can result in bizarre, undocumented behavior. This fixes a crash on Haswell when resuming from suspend with a live KVM guest. Because APs use the standard SMP boot flow for resume, they will go through split_lock_init() and the subsequent RDMSR/WRMSR sequence, which runs even when sld_state==sld_off to ensure SLD is disabled. On Haswell (at least, my Haswell), writing MSR_TEST_CTRL with '0' will succeed and _may_ take the SMT _sibling_ out of VMX root mode. When KVM has an active guest, KVM performs VMXON as part of CPU onlining (see kvm_starting_cpu()). Because SMP boot is serialized, the resulting flow is effectively: on_each_ap_cpu() { WRMSR(MSR_TEST_CTRL, 0) VMXON } As a result, the WRMSR can disable VMX on a different CPU that has already done VMXON. This ultimately results in a #UD on VMPTRLD when KVM regains control and attempt run its vCPUs. The above voodoo was confirmed by reworking KVM's VMXON flow to write MSR_TEST_CTRL prior to VMXON, and to serialize the sequence as above. Further verification of the insanity was done by redoing VMXON on all APs after the initial WRMSR->VMXON sequence. The additional VMXON, which should VM-Fail, occasionally succeeded, and also eliminated the unexpected #UD on VMPTRLD. The damage done by writing MSR_TEST_CTRL doesn't appear to be limited to VMX, e.g. after suspend with an active KVM guest, subsequent reboots almost always hang (even when fudging VMXON), a #UD on a random Jcc was observed, suspend/resume stability is qualitatively poor, and so on and so forth. kernel BUG at arch/x86/kvm/x86.c:386! CPU: 1 PID: 2592 Comm: CPU 6/KVM Tainted: G D Hardware name: ASUS Q87M-E/Q87M-E, BIOS 1102 03/03/2014 RIP: 0010:kvm_spurious_fault+0xf/0x20 Call Trace: vmx_vcpu_load_vmcs+0x1fb/0x2b0 vmx_vcpu_load+0x3e/0x160 kvm_arch_vcpu_load+0x48/0x260 finish_task_switch+0x140/0x260 __schedule+0x460/0x720 _cond_resched+0x2d/0x40 kvm_arch_vcpu_ioctl_run+0x82e/0x1ca0 kvm_vcpu_ioctl+0x363/0x5c0 ksys_ioctl+0x88/0xa0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x4c/0x170 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: dbaba47085b0c ("x86/split_lock: Rework the initialization flow of split lock detection") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200605192605.7439-1-sean.j.christopherson@intel.com
2020-06-30KVM: x86: bit 8 of non-leaf PDPEs is not reservedPaolo Bonzini
Bit 8 would be the "global" bit, which does not quite make sense for non-leaf page table entries. Intel ignores it; AMD ignores it in PDEs and PDPEs, but reserves it in PML4Es. Probably, earlier versions of the AMD manual documented it as reserved in PDPEs as well, and that behavior made it into KVM as well as kvm-unit-tests; fix it. Cc: stable@vger.kernel.org Reported-by: Nadav Amit <namit@vmware.com> Fixes: a0c0feb57992 ("KVM: x86: reserve bit 8 of non-leaf PDPEs and PML4Es in 64-bit mode on AMD", 2014-09-03) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-30x86: Remove dev->archdata.iommu pointerJoerg Roedel
There are no users left, all drivers have been converted to use the per-device private pointer offered by IOMMU core. Signed-off-by: Joerg Roedel <jroedel@suse.de> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Acked-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20200625130836.1916-10-joro@8bytes.org
2020-06-29x86/mm/pat: Mark an intentional data raceQian Cai
cpa_4k_install could be accessed concurrently as noticed by KCSAN, read to 0xffffffffaa59a000 of 8 bytes by interrupt on cpu 7: cpa_inc_4k_install arch/x86/mm/pat/set_memory.c:131 [inline] __change_page_attr+0x10cf/0x1840 arch/x86/mm/pat/set_memory.c:1514 __change_page_attr_set_clr+0xce/0x490 arch/x86/mm/pat/set_memory.c:1636 __set_pages_np+0xc4/0xf0 arch/x86/mm/pat/set_memory.c:2148 __kernel_map_pages+0xb0/0xc8 arch/x86/mm/pat/set_memory.c:2178 kernel_map_pages include/linux/mm.h:2719 [inline] <snip> write to 0xffffffffaa59a000 of 8 bytes by task 1 on cpu 6: cpa_inc_4k_install arch/x86/mm/pat/set_memory.c:131 [inline] __change_page_attr+0x10ea/0x1840 arch/x86/mm/pat/set_memory.c:1514 __change_page_attr_set_clr+0xce/0x490 arch/x86/mm/pat/set_memory.c:1636 __set_pages_p+0xc4/0xf0 arch/x86/mm/pat/set_memory.c:2129 __kernel_map_pages+0x2e/0xc8 arch/x86/mm/pat/set_memory.c:2176 kernel_map_pages include/linux/mm.h:2719 [inline] <snip> Both accesses are due to the same "cpa_4k_install++" in cpa_inc_4k_install. A data race here could be potentially undesirable: depending on compiler optimizations or how x86 executes a non-LOCK'd increment, it may lose increments, corrupt the counter, etc. Since this counter only seems to be used for printing some stats, this data race itself is unlikely to cause harm to the system though. Thus, mark this intentional data race using the data_race() marco. Suggested-by: Macro Elver <elver@google.com> Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29x86/ftrace: Do not jump to direct code in created trampolinesSteven Rostedt (VMware)
When creating a trampoline based on the ftrace_regs_caller code, nop out the jnz test that would jmup to the code that would return to a direct caller (stored in the ORIG_RAX field) and not back to the function that called it. Link: http://lkml.kernel.org/r/20200422162750.638839749@goodmis.org Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-29x86/ftrace: Only have the builtin ftrace_regs_caller call direct hooksSteven Rostedt (VMware)
If a direct hook is attached to a function that ftrace also has a function attached to it, then it is required that the ftrace_ops_list_func() is used to iterate over the registered ftrace callbacks. This will also include the direct ftrace_ops helper, that tells ftrace_regs_caller where to return to (the direct callback and not the function that called it). As this direct helper is only to handle the case of ftrace callbacks attached to the same function as the direct callback, the ftrace callback allocated trampolines (used to only call them), should never be used to return back to a direct callback. Only copy the portion of the ftrace_regs_caller that will return back to what called it, and not the portion that returns back to the direct caller. The direct ftrace_ops must then pick the ftrace_regs_caller builtin function as its own trampoline to ensure that it will never have one allocated for it (which would not include the handling of direct callbacks). Link: http://lkml.kernel.org/r/20200422162750.495903799@goodmis.org Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-29x86/ftrace: Make non direct case the default in ftrace_regs_callerSteven Rostedt (VMware)
If a direct function is hooked along with one of the ftrace registered functions, then the ftrace_regs_caller is attached to the function that shares the direct hook as well as the ftrace hook. The ftrace_regs_caller will call ftrace_ops_list_func() that iterates through all the registered ftrace callbacks, and if there's a direct callback attached to that function, the direct ftrace_ops callback is called to notify that ftrace_regs_caller to return to the direct caller instead of going back to the function that called it. But this is a very uncommon case. Currently, the code has it as the default case. Modify ftrace_regs_caller to make the default case (the non jump) to just return normally, and have the jump to the handling of the direct caller. Link: http://lkml.kernel.org/r/20200422162750.350373278@goodmis.org Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-29KVM: X86: Fix async pf caused null-ptr-derefWanpeng Li
Syzbot reported that: CPU: 1 PID: 6780 Comm: syz-executor153 Not tainted 5.7.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__apic_accept_irq+0x46/0xb80 Call Trace: kvm_arch_async_page_present+0x7de/0x9e0 kvm_check_async_pf_completion+0x18d/0x400 kvm_arch_vcpu_ioctl_run+0x18bf/0x69f0 kvm_vcpu_ioctl+0x46a/0xe20 ksys_ioctl+0x11a/0x180 __x64_sys_ioctl+0x6f/0xb0 do_syscall_64+0xf6/0x7d0 entry_SYSCALL_64_after_hwframe+0x49/0xb3 The testcase enables APF mechanism in MSR_KVM_ASYNC_PF_EN with ASYNC_PF_INT enabled w/o setting MSR_KVM_ASYNC_PF_INT before, what's worse, interrupt based APF 'page ready' event delivery depends on in kernel lapic, however, we didn't bail out when lapic is not in kernel during guest setting MSR_KVM_ASYNC_PF_EN which causes the null-ptr-deref in host later. This patch fixes it. Reported-by: syzbot+1bf777dfdde86d64b89b@syzkaller.appspotmail.com Fixes: 2635b5c4a0 (KVM: x86: interrupt based APF 'page ready' event delivery) Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1593426391-8231-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-29x86/fpu: Reset MXCSR to default in kernel_fpu_begin()Petteri Aimonen
Previously, kernel floating point code would run with the MXCSR control register value last set by userland code by the thread that was active on the CPU core just before kernel call. This could affect calculation results if rounding mode was changed, or a crash if a FPU/SIMD exception was unmasked. Restore MXCSR to the kernel's default value. [ bp: Carve out from a bigger patch by Petteri, add feature check, add FNINIT call too (amluto). ] Signed-off-by: Petteri Aimonen <jpa@git.mail.kapsi.fi> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://bugzilla.kernel.org/show_bug.cgi?id=207979 Link: https://lkml.kernel.org/r/20200624114646.28953-2-bp@alien8.de
2020-06-28Merge tag 'perf-urgent-2020-06-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Ingo Molnar: "A single Kbuild dependency fix" * tag 'perf-urgent-2020-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/rapl: Fix RAPL config variable bug
2020-06-28Merge tag 'efi-urgent-2020-06-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI fixes from Ingo Molnar: - Fix build regression on v4.8 and older - Robustness fix for TPM log parsing code - kobject refcount fix for the ESRT parsing code - Two efivarfs fixes to make it behave more like an ordinary file system - Style fixup for zero length arrays - Fix a regression in path separator handling in the initrd loader - Fix a missing prototype warning - Add some kerneldoc headers for newly introduced stub routines - Allow support for SSDT overrides via EFI variables to be disabled - Report CPU mode and MMU state upon entry for 32-bit ARM - Use the correct stack pointer alignment when entering from mixed mode * tag 'efi-urgent-2020-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: efi/libstub: arm: Print CPU boot mode and MMU state at boot efi/libstub: arm: Omit arch specific config table matching array on arm64 efi/x86: Setup stack correctly for efi_pe_entry efi: Make it possible to disable efivar_ssdt entirely efi/libstub: Descriptions for stub helper functions efi/libstub: Fix path separator regression efi/libstub: Fix missing-prototype warning for skip_spaces() efi: Replace zero-length array and use struct_size() helper efivarfs: Don't return -EINTR when rate-limiting reads efivarfs: Update inode modification time for successful writes efi/esrt: Fix reference count leak in esre_create_sysfs_entry. efi/tpm: Verify event log header before parsing efi/x86: Fix build with gcc 4
2020-06-28Merge tag 'x86_urgent_for_5.8_rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - AMD Memory bandwidth counter width fix, by Babu Moger. - Use the proper length type in the 32-bit truncate() syscall variant, by Jiri Slaby. - Reinit IA32_FEAT_CTL during wakeup to fix the case where after resume, VMXON would #GP due to VMX not being properly enabled, by Sean Christopherson. - Fix a static checker warning in the resctrl code, by Dan Carpenter. - Add a CR4 pinning mask for bits which cannot change after boot, by Kees Cook. - Align the start of the loop of __clear_user() to 16 bytes, to improve performance on AMD zen1 and zen2 microarchitectures, by Matt Fleming. * tag 'x86_urgent_for_5.8_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/asm/64: Align start of __clear_user() loop to 16-bytes x86/cpu: Use pinning mask for CR4 bits needing to be 0 x86/resctrl: Fix a NULL vs IS_ERR() static checker warning in rdt_cdp_peer_get() x86/cpu: Reinitialize IA32_FEAT_CTL MSR on BSP during wakeup syscalls: Fix offset type of ksys_ftruncate() x86/resctrl: Fix memory bandwidth counter width for AMD
2020-06-28Merge tag 'objtool_urgent_for_5.8_rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool fixes from Borislav Petkov: "Three fixes from Peter Zijlstra suppressing KCOV instrumentation in noinstr sections. Peter Zijlstra says: "Address KCOV vs noinstr. There is no function attribute to selectively suppress KCOV instrumentation, instead teach objtool to NOP out the calls in noinstr functions" This cures a bunch of KCOV crashes (as used by syzcaller)" * tag 'objtool_urgent_for_5.8_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: objtool: Fix noinstr vs KCOV objtool: Provide elf_write_{insn,reloc}() objtool: Clean up elf_write() condition
2020-06-28Merge tag 'x86_entry_for_5.8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 entry fixes from Borislav Petkov: "This is the x86/entry urgent pile which has accumulated since the merge window. It is not the smallest but considering the almost complete entry core rewrite, the amount of fixes to follow is somewhat higher than usual, which is to be expected. Peter Zijlstra says: 'These patches address a number of instrumentation issues that were found after the x86/entry overhaul. When combined with rcu/urgent and objtool/urgent, these patches make UBSAN/KASAN/KCSAN happy again. Part of making this all work is bumping the minimum GCC version for KASAN builds to gcc-8.3, the reason for this is that the __no_sanitize_address function attribute is broken in GCC releases before that. No known GCC version has a working __no_sanitize_undefined, however because the only noinstr violation that results from this happens when an UB is found, we treat it like WARN. That is, we allow it to violate the noinstr rules in order to get the warning out'" * tag 'x86_entry_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/entry: Fix #UD vs WARN more x86/entry: Increase entry_stack size to a full page x86/entry: Fixup bad_iret vs noinstr objtool: Don't consider vmlinux a C-file kasan: Fix required compiler version compiler_attributes.h: Support no_sanitize_undefined check with GCC 4 x86/entry, bug: Comment the instrumentation_begin() usage for WARN() x86/entry, ubsan, objtool: Whitelist __ubsan_handle_*() x86/entry, cpumask: Provide non-instrumented variant of cpu_is_offline() compiler_types.h: Add __no_sanitize_{address,undefined} to noinstr kasan: Bump required compiler version x86, kcsan: Add __no_kcsan to noinstr kcsan: Remove __no_kcsan_or_inline x86, kcsan: Remove __no_kcsan_or_inline usage
2020-06-26docs: fix references for DMA*.txt filesMauro Carvalho Chehab
As we moved those files to core-api, fix references to point to their newer locations. Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Link: https://lore.kernel.org/r/37b2fd159fbc7655dbf33b3eb1215396a25f6344.1592895969.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2020-06-26Merge branch 'linus' into x86/entry, to resolve conflictsIngo Molnar
Conflicts: arch/x86/kernel/traps.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-06-26x86/hyperv: allocate the hypercall page with only read and execute bitsChristoph Hellwig
Patch series "fix a hyperv W^X violation and remove vmalloc_exec" Dexuan reported a W^X violation due to the fact that the hyper hypercall page due switching it to be allocated using vmalloc_exec. The problem is that PAGE_KERNEL_EXEC as used by vmalloc_exec actually sets writable permissions in the pte. This series fixes the issue by switching to the low-level __vmalloc_node_range interface that allows specifing more detailed permissions instead. It then also open codes the other two callers and removes the somewhat confusing vmalloc_exec interface. Peter noted that the hyper hypercall page allocation also has another long standing issue in that it shouldn't use the full vmalloc but just the module space. This issue is so far theoretical as the allocation is done early in the boot process. I plan to fix it with another bigger series for 5.9. This patch (of 3): Avoid a W^X violation cause by the fact that PAGE_KERNEL_EXEC includes the writable bit. For this resurrect the removed PAGE_KERNEL_RX definition, but as PAGE_KERNEL_ROX to match arm64 and powerpc. Link: http://lkml.kernel.org/r/20200618064307.32739-2-hch@lst.de Fixes: 78bb17f76edc ("x86/hyperv: use vmalloc_exec for the hypercall page") Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Dexuan Cui <decui@microsoft.com> Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Acked-by: Wei Liu <wei.liu@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Jessica Yu <jeyu@kernel.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-26x86: kill dump_fpu()Al Viro
dead since the removal of aout coredump support... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-06-26x86: copy_fpstate_to_sigframe(): have fpregs_soft_get() use kernel bufferAl Viro
... then copy_to_user() the results Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-06-25Merge drm/drm-next into drm-intel-next-queuedJani Nikula
Catch up with upstream, in particular to get c1e8d7c6a7a6 ("mmap locking API: convert mmap_sem comments"). Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2020-06-25x86/entry: Fix #UD vs WARN morePeter Zijlstra
vmlinux.o: warning: objtool: exc_invalid_op()+0x47: call to probe_kernel_read() leaves .noinstr.text section Since we use UD2 as a short-cut for 'CALL __WARN', treat it as such. Have the bare exception handler do the report_bug() thing. Fixes: 15a416e8aaa7 ("x86/entry: Treat BUG/WARN as NMI-like entries") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/20200622114713.GE577403@hirez.programming.kicks-ass.net
2020-06-25x86/entry: Increase entry_stack size to a full pagePeter Zijlstra
Marco crashed in bad_iret with a Clang11/KCSAN build due to overflowing the stack. Now that we run C code on it, expand it to a full page. Suggested-by: Andy Lutomirski <luto@amacapital.net> Reported-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Tested-by: Marco Elver <elver@google.com> Link: https://lkml.kernel.org/r/20200618144801.819246178@infradead.org
2020-06-25x86/entry: Fixup bad_iret vs noinstrPeter Zijlstra
vmlinux.o: warning: objtool: fixup_bad_iret()+0x8e: call to memcpy() leaves .noinstr.text section Worse, when KASAN there is no telling what memcpy() actually is. Force the use of __memcpy() which is our assmebly implementation. Reported-by: Marco Elver <elver@google.com> Suggested-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marco Elver <elver@google.com> Link: https://lkml.kernel.org/r/20200618144801.760070502@infradead.org
2020-06-25x86/msr: Filter MSR writesBorislav Petkov
Add functionality to disable writing to MSRs from userspace. Writes can still be allowed by supplying the allow_writes=on module parameter. The kernel will be tainted so that it shows in oopses. Having unfettered access to all MSRs on a system is and has always been a disaster waiting to happen. Think performance counter MSRs, MSRs with sticky or locked bits, MSRs making major system changes like loading microcode, MTRRs, PAT configuration, TSC counter, security mitigations MSRs, you name it. This also destroys all the kernel's caching of MSR values for performance, as the recent case with MSR_AMD64_LS_CFG showed. Another example is writing MSRs by mistake by simply typing the wrong MSR address. System freezes have been experienced that way. In general, poking at MSRs under the kernel's feet is a bad bad idea. So log writing to MSRs by default. Longer term, such writes will be disabled by default. If userspace still wants to do that, then proper interfaces should be defined which are under the kernel's control and accesses to those MSRs can be synchronized and sanitized properly. [ Fix sparse warnings. ] Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Sean Christopherson <sean.j.christopherson@intel.com> Link: https://lkml.kernel.org/r/20200612105026.GA22660@zn.tnic
2020-06-23Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "All bugfixes except for a couple cleanup patches" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: VMX: Remove vcpu_vmx's defunct copy of host_pkru KVM: x86: allow TSC to differ by NTP correction bounds without TSC scaling KVM: X86: Fix MSR range of APIC registers in X2APIC mode KVM: VMX: Stop context switching MSR_IA32_UMWAIT_CONTROL KVM: nVMX: Plumb L2 GPA through to PML emulation KVM: x86/mmu: Avoid mixing gpa_t with gfn_t in walk_addr_generic() KVM: LAPIC: ensure APIC map is up to date on concurrent update requests kvm: lapic: fix broken vcpu hotplug Revert "KVM: VMX: Micro-optimize vmexit time when not exposing PMU" KVM: VMX: Add helpers to identify interrupt type from intr_info kvm/svm: disable KCSAN for svm_vcpu_run() KVM: MIPS: Fix a build error for !CPU_LOONGSON64
2020-06-23x86/mce, EDAC/mce_amd: Print PPIN in machine check recordsSmita Koralahalli
Print the Protected Processor Identification Number (PPIN) on processors which support it. [ bp: Massage. ] Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200623130059.8870-1-Smita.KoralahalliChannabasappa@amd.com
2020-06-23KVM: VMX: Remove vcpu_vmx's defunct copy of host_pkruSean Christopherson
Remove vcpu_vmx.host_pkru, which got left behind when PKRU support was moved to common x86 code. No functional change intended. Fixes: 37486135d3a7b ("KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.c") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200617034123.25647-1-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-23KVM: x86: allow TSC to differ by NTP correction bounds without TSC scalingMarcelo Tosatti
The Linux TSC calibration procedure is subject to small variations (its common to see +-1 kHz difference between reboots on a given CPU, for example). So migrating a guest between two hosts with identical processor can fail, in case of a small variation in calibrated TSC between them. Without TSC scaling, the current kernel interface will either return an error (if user_tsc_khz <= tsc_khz) or enable TSC catchup mode. This change enables the following TSC tolerance check to accept KVM_SET_TSC_KHZ within tsc_tolerance_ppm (which is 250ppm by default). /* * Compute the variation in TSC rate which is acceptable * within the range of tolerance and decide if the * rate being applied is within that bounds of the hardware * rate. If so, no scaling or compensation need be done. */ thresh_lo = adjust_tsc_khz(tsc_khz, -tsc_tolerance_ppm); thresh_hi = adjust_tsc_khz(tsc_khz, tsc_tolerance_ppm); if (user_tsc_khz < thresh_lo || user_tsc_khz > thresh_hi) { pr_debug("kvm: requested TSC rate %u falls outside tolerance [%u,%u]\n", user_tsc_khz, thresh_lo, thresh_hi); use_scaling = 1; } NTP daemon in the guest can correct this difference (NTP can correct upto 500ppm). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Message-Id: <20200616114741.GA298183@fuller.cnet> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-23KVM: X86: Fix MSR range of APIC registers in X2APIC modeXiaoyao Li
Only MSR address range 0x800 through 0x8ff is architecturally reserved and dedicated for accessing APIC registers in x2APIC mode. Fixes: 0105d1a52640 ("KVM: x2apic interface to lapic") Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20200616073307.16440-1-xiaoyao.li@intel.com> Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-22KVM: VMX: Stop context switching MSR_IA32_UMWAIT_CONTROLSean Christopherson
Remove support for context switching between the guest's and host's desired UMWAIT_CONTROL. Propagating the guest's value to hardware isn't required for correct functionality, e.g. KVM intercepts reads and writes to the MSR, and the latency effects of the settings controlled by the MSR are not architecturally visible. As a general rule, KVM should not allow the guest to control power management settings unless explicitly enabled by userspace, e.g. see KVM_CAP_X86_DISABLE_EXITS. E.g. Intel's SDM explicitly states that C0.2 can improve the performance of SMT siblings. A devious guest could disable C0.2 so as to improve the performance of their workloads at the detriment to workloads running in the host or on other VMs. Wholesale removal of UMWAIT_CONTROL context switching also fixes a race condition where updates from the host may cause KVM to enter the guest with the incorrect value. Because updates are are propagated to all CPUs via IPI (SMP function callback), the value in hardware may be stale with respect to the cached value and KVM could enter the guest with the wrong value in hardware. As above, the guest can't observe the bad value, but it's a weird and confusing wart in the implementation. Removal also fixes the unnecessary usage of VMX's atomic load/store MSR lists. Using the lists is only necessary for MSRs that are required for correct functionality immediately upon VM-Enter/VM-Exit, e.g. EFER on old hardware, or for MSRs that need to-the-uop precision, e.g. perf related MSRs. For UMWAIT_CONTROL, the effects are only visible in the kernel via TPAUSE/delay(), and KVM doesn't do any form of delay in vcpu_vmx_run(). Using the atomic lists is undesirable as they are more expensive than direct RDMSR/WRMSR. Furthermore, even if giving the guest control of the MSR is legitimate, e.g. in pass-through scenarios, it's not clear that the benefits would outweigh the overhead. E.g. saving and restoring an MSR across a VMX roundtrip costs ~250 cycles, and if the guest diverged from the host that cost would be paid on every run of the guest. In other words, if there is a legitimate use case then it should be enabled by a new per-VM capability. Note, KVM still needs to emulate MSR_IA32_UMWAIT_CONTROL so that it can correctly expose other WAITPKG features to the guest, e.g. TPAUSE, UMWAIT and UMONITOR. Fixes: 6e3ba4abcea56 ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL") Cc: stable@vger.kernel.org Cc: Jingqi Liu <jingqi.liu@intel.com> Cc: Tao Xu <tao3.xu@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200623005135.10414-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-22KVM: nVMX: Plumb L2 GPA through to PML emulationSean Christopherson
Explicitly pass the L2 GPA to kvm_arch_write_log_dirty(), which for all intents and purposes is vmx_write_pml_buffer(), instead of having the latter pull the GPA from vmcs.GUEST_PHYSICAL_ADDRESS. If the dirty bit update is the result of KVM emulation (rare for L2), then the GPA in the VMCS may be stale and/or hold a completely unrelated GPA. Fixes: c5f983f6e8455 ("nVMX: Implement emulated Page Modification Logging") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200622215832.22090-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-22KVM: x86/mmu: Avoid mixing gpa_t with gfn_t in walk_addr_generic()Vitaly Kuznetsov
translate_gpa() returns a GPA, assigning it to 'real_gfn' seems obviously wrong. There is no real issue because both 'gpa_t' and 'gfn_t' are u64 and we don't use the value in 'real_gfn' as a GFN, we do real_gfn = gpa_to_gfn(real_gfn); instead. 'If you see a "buffalo" sign on an elephant's cage, do not trust your eyes', but let's fix it for good. No functional change intended. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200622151435.752560-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-22KVM: LAPIC: ensure APIC map is up to date on concurrent update requestsPaolo Bonzini
The following race can cause lost map update events: cpu1 cpu2 apic_map_dirty = true ------------------------------------------------------------ kvm_recalculate_apic_map: pass check mutex_lock(&kvm->arch.apic_map_lock); if (!kvm->arch.apic_map_dirty) and in process of updating map ------------------------------------------------------------- other calls to apic_map_dirty = true might be too late for affected cpu ------------------------------------------------------------- apic_map_dirty = false ------------------------------------------------------------- kvm_recalculate_apic_map: bail out on if (!kvm->arch.apic_map_dirty) To fix it, record the beginning of an update of the APIC map in apic_map_dirty. If another APIC map change switches apic_map_dirty back to DIRTY during the update, kvm_recalculate_apic_map should not make it CLEAN, and the other caller will go through the slow path. Reported-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-22kvm: lapic: fix broken vcpu hotplugIgor Mammedov
Guest fails to online hotplugged CPU with error smpboot: do_boot_cpu failed(-1) to wakeup CPU#4 It's caused by the fact that kvm_apic_set_state(), which used to call recalculate_apic_map() unconditionally and pulled hotplugged CPU into apic map, is updating map conditionally on state changes. In this case the APIC map is not considered dirty and the is not updated. Fix the issue by forcing unconditional update from kvm_apic_set_state(), like it used to be. Fixes: 4abaffce4d25a ("KVM: LAPIC: Recalculate apic map in batch") Cc: stable@vger.kernel.org Signed-off-by: Igor Mammedov <imammedo@redhat.com> Message-Id: <20200622160830.426022-1-imammedo@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-22x86/msr: Move the F15h MSRs where they belongBorislav Petkov
1068ed4547ad ("x86/msr: Lift AMD family 0x15 power-specific MSRs") moved the three F15h power MSRs to the architectural list but that was wrong as they belong in the family 0x15 list. That also caused: In file included from trace/beauty/tracepoints/x86_msr.c:10: perf/trace/beauty/generated/x86_arch_MSRs_array.c:292:45: error: initialized field overwritten [-Werror=override-init] 292 | [0xc0010280 - x86_AMD_V_KVM_MSRs_offset] = "F15H_PTSC", | ^~~~~~~~~~~ perf/trace/beauty/generated/x86_arch_MSRs_array.c:292:45: note: (near initialization for 'x86_AMD_V_KVM_MSRs[640]') due to MSR_F15H_PTSC ending up being defined twice. Move them where they belong and drop the duplicate. Also, drop the respective tools/ changes of the msr-index.h copy the above commit added because perf tool developers prefer to go through those changes themselves in order to figure out whether changes to the kernel headers would need additional handling in perf. Fixes: 1068ed4547ad ("x86/msr: Lift AMD family 0x15 power-specific MSRs") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lkml.kernel.org/r/20200621163323.14e8533f@canb.auug.org.au
2020-06-22fork: fold legacy_clone_args_valid() into _do_fork()Christian Brauner
This separate helper only existed to guarantee the mutual exclusivity of CLONE_PIDFD and CLONE_PARENT_SETTID for legacy clone since CLONE_PIDFD abuses the parent_tid field to return the pidfd. But we can actually handle this uniformely thus removing the helper. For legacy clone we can detect that CLONE_PIDFD is specified in conjunction with CLONE_PARENT_SETTID because they will share the same memory which is invalid and for clone3() setting the separate pidfd and parent_tid fields to the same memory is bogus as well. So fold that helper directly into _do_fork() by detecting this case. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: linux-m68k@lists.linux-m68k.org Cc: x86@kernel.org Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-06-20Merge branch 'i2c/for-current' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c fixes from Wolfram Sang: - a small collection of remaining API conversion patches (all acked) which allow to finally remove the deprecated API - some documentation fixes and a MAINTAINERS addition * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: MAINTAINERS: Add robert and myself as qcom i2c cci maintainers i2c: smbus: Fix spelling mistake in the comments Documentation/i2c: SMBus start signal is S not A i2c: remove deprecated i2c_new_device API Documentation: media: convert to use i2c_new_client_device() video: backlight: tosa_lcd: convert to use i2c_new_client_device() x86/platform/intel-mid: convert to use i2c_new_client_device() drm: encoder_slave: use new I2C API drm: encoder_slave: fix refcouting error for modules