summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2018-07-20x86/tsc: Redefine notsc to behave as tsc=unstablePavel Tatashin
Currently, the notsc kernel parameter disables the use of the TSC by sched_clock(). However, this parameter does not prevent the kernel from accessing tsc in other places. The only rationale to boot with notsc is to avoid timing discrepancies on multi-socket systems where TSC are not properly synchronized, and thus exclude TSC from being used for time keeping. But that prevents using TSC as sched_clock() as well, which is not necessary as the core sched_clock() implementation can handle non synchronized TSC based sched clocks just fine. However, there is another method to solve the above problem: booting with tsc=unstable parameter. This parameter allows sched_clock() to use TSC and just excludes it from timekeeping. So there is no real reason to keep notsc, but for compatibility reasons the parameter has to stay. Make it behave like 'tsc=unstable' instead. [ tglx: Massaged changelog ] Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: steven.sistare@oracle.com Cc: daniel.m.jordan@oracle.com Cc: linux@armlinux.org.uk Cc: schwidefsky@de.ibm.com Cc: heiko.carstens@de.ibm.com Cc: john.stultz@linaro.org Cc: sboyd@codeaurora.org Cc: hpa@zytor.com Cc: peterz@infradead.org Cc: prarit@redhat.com Cc: feng.tang@intel.com Cc: pmladek@suse.com Cc: gnomes@lxorguk.ukuu.org.uk Cc: linux-s390@vger.kernel.org Cc: boris.ostrovsky@oracle.com Cc: jgross@suse.com Cc: pbonzini@redhat.com Link: https://lkml.kernel.org/r/20180719205545.16512-12-pasha.tatashin@oracle.com
2018-07-20x86/CPU: Call detect_nopl() only on the BSPBorislav Petkov
Make it use the setup_* variants and have it be called only on the BSP and drop the call in generic_identify() - X86_FEATURE_NOPL will be replicated to the APs through the forced caps. Helps to keep the mess at a manageable level. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: steven.sistare@oracle.com Cc: daniel.m.jordan@oracle.com Cc: linux@armlinux.org.uk Cc: schwidefsky@de.ibm.com Cc: heiko.carstens@de.ibm.com Cc: john.stultz@linaro.org Cc: sboyd@codeaurora.org Cc: hpa@zytor.com Cc: douly.fnst@cn.fujitsu.com Cc: peterz@infradead.org Cc: prarit@redhat.com Cc: feng.tang@intel.com Cc: pmladek@suse.com Cc: gnomes@lxorguk.ukuu.org.uk Cc: linux-s390@vger.kernel.org Cc: boris.ostrovsky@oracle.com Cc: jgross@suse.com Cc: pbonzini@redhat.com Link: https://lkml.kernel.org/r/20180719205545.16512-11-pasha.tatashin@oracle.com
2018-07-20x86/jump_label: Initialize static branching earlyPavel Tatashin
Static branching is useful to runtime patch branches that are used in hot path, but are infrequently changed. The x86 clock framework is one example that uses static branches to setup the best clock during boot and never changes it again. It is desired to enable the TSC based sched clock early to allow fine grained boot time analysis early on. That requires the static branching functionality to be functional early as well. Static branching requires patching nop instructions, thus, arch_init_ideal_nops() must be called prior to jump_label_init(). Do all the necessary steps to call arch_init_ideal_nops() right after early_cpu_init(), which also allows to insert a call to jump_label_init() right after that. jump_label_init() will be called again from the generic init code, but the code is protected against reinitialization already. [ tglx: Massaged changelog ] Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: steven.sistare@oracle.com Cc: daniel.m.jordan@oracle.com Cc: linux@armlinux.org.uk Cc: schwidefsky@de.ibm.com Cc: heiko.carstens@de.ibm.com Cc: john.stultz@linaro.org Cc: sboyd@codeaurora.org Cc: hpa@zytor.com Cc: douly.fnst@cn.fujitsu.com Cc: prarit@redhat.com Cc: feng.tang@intel.com Cc: pmladek@suse.com Cc: gnomes@lxorguk.ukuu.org.uk Cc: linux-s390@vger.kernel.org Cc: boris.ostrovsky@oracle.com Cc: jgross@suse.com Cc: pbonzini@redhat.com Link: https://lkml.kernel.org/r/20180719205545.16512-10-pasha.tatashin@oracle.com
2018-07-20x86/alternatives, jumplabel: Use text_poke_early() before mm_init()Pavel Tatashin
It supposed to be safe to modify static branches after jump_label_init(). But, because static key modifying code eventually calls text_poke() it can end up accessing a struct page which has not been initialized yet. Here is how to quickly reproduce the problem. Insert code like this into init/main.c: | +static DEFINE_STATIC_KEY_FALSE(__test); | asmlinkage __visible void __init start_kernel(void) | { | char *command_line; |@@ -587,6 +609,10 @@ asmlinkage __visible void __init start_kernel(void) | vfs_caches_init_early(); | sort_main_extable(); | trap_init(); |+ { |+ static_branch_enable(&__test); |+ WARN_ON(!static_branch_likely(&__test)); |+ } | mm_init(); The following warnings show-up: WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:701 text_poke+0x20d/0x230 RIP: 0010:text_poke+0x20d/0x230 Call Trace: ? text_poke_bp+0x50/0xda ? arch_jump_label_transform+0x89/0xe0 ? __jump_label_update+0x78/0xb0 ? static_key_enable_cpuslocked+0x4d/0x80 ? static_key_enable+0x11/0x20 ? start_kernel+0x23e/0x4c8 ? secondary_startup_64+0xa5/0xb0 ---[ end trace abdc99c031b8a90a ]--- If the code above is moved after mm_init(), no warning is shown, as struct pages are initialized during handover from memblock. Use text_poke_early() in static branching until early boot IRQs are enabled and from there switch to text_poke. Also, ensure text_poke() is never invoked when unitialized memory access may happen by using adding a !after_bootmem assertion. Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: steven.sistare@oracle.com Cc: daniel.m.jordan@oracle.com Cc: linux@armlinux.org.uk Cc: schwidefsky@de.ibm.com Cc: heiko.carstens@de.ibm.com Cc: john.stultz@linaro.org Cc: sboyd@codeaurora.org Cc: hpa@zytor.com Cc: douly.fnst@cn.fujitsu.com Cc: peterz@infradead.org Cc: prarit@redhat.com Cc: feng.tang@intel.com Cc: pmladek@suse.com Cc: gnomes@lxorguk.ukuu.org.uk Cc: linux-s390@vger.kernel.org Cc: boris.ostrovsky@oracle.com Cc: jgross@suse.com Cc: pbonzini@redhat.com Link: https://lkml.kernel.org/r/20180719205545.16512-9-pasha.tatashin@oracle.com
2018-07-20x86/kvmclock: Switch kvmclock data to a PER_CPU variableThomas Gleixner
The previous removal of the memblock dependency from kvmclock introduced a static data array sized 64bytes * CONFIG_NR_CPUS. That's wasteful on large systems when kvmclock is not used. Replace it with: - A static page sized array of pvclock data. It's page sized because the pvclock data of the boot cpu is mapped into the VDSO so otherwise random other data would be exposed to the vDSO - A PER_CPU variable of pvclock data pointers. This is used to access the pcvlock data storage on each CPU. The setup is done in two stages: - Early boot stores the pointer to the static page for the boot CPU in the per cpu data. - In the preparatory stage of CPU hotplug assign either an element of the static array (when the CPU number is in that range) or allocate memory and initialize the per cpu pointer. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: steven.sistare@oracle.com Cc: daniel.m.jordan@oracle.com Cc: linux@armlinux.org.uk Cc: schwidefsky@de.ibm.com Cc: heiko.carstens@de.ibm.com Cc: john.stultz@linaro.org Cc: sboyd@codeaurora.org Cc: hpa@zytor.com Cc: douly.fnst@cn.fujitsu.com Cc: peterz@infradead.org Cc: prarit@redhat.com Cc: feng.tang@intel.com Cc: pmladek@suse.com Cc: gnomes@lxorguk.ukuu.org.uk Cc: linux-s390@vger.kernel.org Cc: boris.ostrovsky@oracle.com Cc: jgross@suse.com Link: https://lkml.kernel.org/r/20180719205545.16512-8-pasha.tatashin@oracle.com
2018-07-20x86/kvmclock: Move kvmclock vsyscall param and init to kvmclockThomas Gleixner
There is no point to have this in the kvm code itself and call it from there. This can be called from an initcall and the parameter is cleared when the hypervisor is not KVM. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: steven.sistare@oracle.com Cc: daniel.m.jordan@oracle.com Cc: linux@armlinux.org.uk Cc: schwidefsky@de.ibm.com Cc: heiko.carstens@de.ibm.com Cc: john.stultz@linaro.org Cc: sboyd@codeaurora.org Cc: hpa@zytor.com Cc: douly.fnst@cn.fujitsu.com Cc: peterz@infradead.org Cc: prarit@redhat.com Cc: feng.tang@intel.com Cc: pmladek@suse.com Cc: gnomes@lxorguk.ukuu.org.uk Cc: linux-s390@vger.kernel.org Cc: boris.ostrovsky@oracle.com Cc: jgross@suse.com Link: https://lkml.kernel.org/r/20180719205545.16512-7-pasha.tatashin@oracle.com
2018-07-20x86/kvmclock: Mark variables __initdata and __ro_after_initThomas Gleixner
The kvmclock parameter is init data and the other variables are not modified after init. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: steven.sistare@oracle.com Cc: daniel.m.jordan@oracle.com Cc: linux@armlinux.org.uk Cc: schwidefsky@de.ibm.com Cc: heiko.carstens@de.ibm.com Cc: john.stultz@linaro.org Cc: sboyd@codeaurora.org Cc: hpa@zytor.com Cc: douly.fnst@cn.fujitsu.com Cc: peterz@infradead.org Cc: prarit@redhat.com Cc: feng.tang@intel.com Cc: pmladek@suse.com Cc: gnomes@lxorguk.ukuu.org.uk Cc: linux-s390@vger.kernel.org Cc: boris.ostrovsky@oracle.com Cc: jgross@suse.com Link: https://lkml.kernel.org/r/20180719205545.16512-6-pasha.tatashin@oracle.com
2018-07-20x86/kvmclock: Cleanup the codeThomas Gleixner
- Cleanup the mrs write for wall clock. The type casts to (int) are sloppy because the wrmsr parameters are u32 and aside of that wrmsrl() already provides the high/low split for free. - Remove the pointless get_cpu()/put_cpu() dance from various functions. Either they are called during early init where CPU is guaranteed to be 0 or they are already called from non preemptible context where smp_processor_id() can be used safely - Simplify the convoluted check for kvmclock in the init function. - Mark the parameter parsing function __init. No point in keeping it around. - Convert to pr_info() Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: steven.sistare@oracle.com Cc: daniel.m.jordan@oracle.com Cc: linux@armlinux.org.uk Cc: schwidefsky@de.ibm.com Cc: heiko.carstens@de.ibm.com Cc: john.stultz@linaro.org Cc: sboyd@codeaurora.org Cc: hpa@zytor.com Cc: douly.fnst@cn.fujitsu.com Cc: peterz@infradead.org Cc: prarit@redhat.com Cc: feng.tang@intel.com Cc: pmladek@suse.com Cc: gnomes@lxorguk.ukuu.org.uk Cc: linux-s390@vger.kernel.org Cc: boris.ostrovsky@oracle.com Cc: jgross@suse.com Link: https://lkml.kernel.org/r/20180719205545.16512-5-pasha.tatashin@oracle.com
2018-07-20x86/kvmclock: Decrapify kvm_register_clock()Thomas Gleixner
The return value is pointless because the wrmsr cannot fail if KVM_FEATURE_CLOCKSOURCE or KVM_FEATURE_CLOCKSOURCE2 are set. kvm_register_clock() is only called locally so wants to be static. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: steven.sistare@oracle.com Cc: daniel.m.jordan@oracle.com Cc: linux@armlinux.org.uk Cc: schwidefsky@de.ibm.com Cc: heiko.carstens@de.ibm.com Cc: john.stultz@linaro.org Cc: sboyd@codeaurora.org Cc: hpa@zytor.com Cc: douly.fnst@cn.fujitsu.com Cc: peterz@infradead.org Cc: prarit@redhat.com Cc: feng.tang@intel.com Cc: pmladek@suse.com Cc: gnomes@lxorguk.ukuu.org.uk Cc: linux-s390@vger.kernel.org Cc: boris.ostrovsky@oracle.com Cc: jgross@suse.com Link: https://lkml.kernel.org/r/20180719205545.16512-4-pasha.tatashin@oracle.com
2018-07-20x86/kvmclock: Remove page size requirement from wall_clockThomas Gleixner
There is no requirement for wall_clock data to be page aligned or page sized. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: steven.sistare@oracle.com Cc: daniel.m.jordan@oracle.com Cc: linux@armlinux.org.uk Cc: schwidefsky@de.ibm.com Cc: heiko.carstens@de.ibm.com Cc: john.stultz@linaro.org Cc: sboyd@codeaurora.org Cc: hpa@zytor.com Cc: douly.fnst@cn.fujitsu.com Cc: peterz@infradead.org Cc: prarit@redhat.com Cc: feng.tang@intel.com Cc: pmladek@suse.com Cc: gnomes@lxorguk.ukuu.org.uk Cc: linux-s390@vger.kernel.org Cc: boris.ostrovsky@oracle.com Cc: jgross@suse.com Link: https://lkml.kernel.org/r/20180719205545.16512-3-pasha.tatashin@oracle.com
2018-07-20x86/kvmclock: Remove memblock dependencyPavel Tatashin
KVM clock is initialized later compared to other hypervisor clocks because it has a dependency on the memblock allocator. Bring it in line with other hypervisors by using memory from the BSS instead of allocating it. The benefits: - Remove ifdef from common code - Earlier availability of the clock - Remove dependency on memblock, and reduce code The downside: - Static allocation of the per cpu data structures sized NR_CPUS * 64byte Will be addressed in follow up patches. [ tglx: Split out from larger series ] Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: steven.sistare@oracle.com Cc: daniel.m.jordan@oracle.com Cc: linux@armlinux.org.uk Cc: schwidefsky@de.ibm.com Cc: heiko.carstens@de.ibm.com Cc: john.stultz@linaro.org Cc: sboyd@codeaurora.org Cc: hpa@zytor.com Cc: douly.fnst@cn.fujitsu.com Cc: peterz@infradead.org Cc: prarit@redhat.com Cc: feng.tang@intel.com Cc: pmladek@suse.com Cc: gnomes@lxorguk.ukuu.org.uk Cc: linux-s390@vger.kernel.org Cc: boris.ostrovsky@oracle.com Cc: jgross@suse.com Link: https://lkml.kernel.org/r/20180719205545.16512-2-pasha.tatashin@oracle.com
2018-07-19Merge branch 'linus' into x86/timersThomas Gleixner
Pick up upstream changes to avoid conflicts
2018-07-19x86: Avoid pr_cont() in show_opcodes()Rasmus Villemoes
If concurrent printk() messages are emitted, then pr_cont() is making it extremly hard to decode which part of the output belongs to what. See the convoluted example at: https://syzkaller.appspot.com/text?tag=CrashReport&x=139d342c400000 Avoid this by using a proper prefix for each line and by using %ph format in show_opcodes() which emits the 'Code:' line in one go. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: joe@perches.com Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@suse.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lkml.kernel.org/r/1532009278-5953-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
2018-07-19x86/KVM/VMX: Initialize the vmx_l1d_flush_pages' contentNicolai Stange
The slow path in vmx_l1d_flush() reads from vmx_l1d_flush_pages in order to evict the L1d cache. However, these pages are never cleared and, in theory, their data could be leaked. More importantly, KSM could merge a nested hypervisor's vmx_l1d_flush_pages to fewer than 1 << L1D_CACHE_ORDER host physical pages and this would break the L1d flushing algorithm: L1D on x86_64 is tagged by physical addresses. Fix this by initializing the individual vmx_l1d_flush_pages with a different pattern each. Rename the "empty_zp" asm constraint identifier in vmx_l1d_flush() to "flush_pages" to reflect this change. Fixes: a47dd5f06714 ("x86/KVM/VMX: Add L1D flush algorithm") Signed-off-by: Nicolai Stange <nstange@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-07-19x86/speculation: Remove SPECTRE_V2_IBRS in enum spectre_v2_mitigationJiang Biao
SPECTRE_V2_IBRS in enum spectre_v2_mitigation is never used. Remove it. Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: hpa@zytor.com Cc: dwmw2@amazon.co.uk Cc: konrad.wilk@oracle.com Cc: bp@suse.de Cc: zhong.weidong@zte.com.cn Link: https://lkml.kernel.org/r/1531872194-39207-1-git-send-email-jiang.biao2@zte.com.cn
2018-07-19um: remove redundant 'export LDFLAGS' in arch/x86/Makefile.umMasahiro Yamada
This is already exported by the top-level Makefile. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
2018-07-18Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "Miscellaneous bugfixes, plus a small patchlet related to Spectre v2" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: kvmclock: fix TSC calibration for nested guests KVM: VMX: Mark VMXArea with revision_id of physical CPU even when eVMCS enabled KVM: irqfd: fix race between EPOLLHUP and irq_bypass_register_consumer KVM/Eventfd: Avoid crash when assign and deassign specific eventfd in parallel. x86/kvmclock: set pvti_cpu0_va after enabling kvmclock x86/kvm/Kconfig: Ensure CRYPTO_DEV_CCP_DD state at minimum matches KVM_AMD kvm: nVMX: Restore exit qual for VM-entry failure due to MSR loading x86/kvm/vmx: don't read current->thread.{fs,gs}base of legacy tasks KVM: VMX: support MSR_IA32_ARCH_CAPABILITIES as a feature MSR
2018-07-18kvmclock: fix TSC calibration for nested guestsPeng Hao
Inside a nested guest, access to hardware can be slow enough that tsc_read_refs always return ULLONG_MAX, causing tsc_refine_calibration_work to be called periodically and the nested guest to spend a lot of time reading the ACPI timer. However, if the TSC frequency is available from the pvclock page, we can just set X86_FEATURE_TSC_KNOWN_FREQ and avoid the recalibration. 'refine' operation. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Peng Hao <peng.hao2@zte.com.cn> [Commit message rewritten. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-07-18KVM: VMX: Mark VMXArea with revision_id of physical CPU even when eVMCS enabledLiran Alon
When eVMCS is enabled, all VMCS allocated to be used by KVM are marked with revision_id of KVM_EVMCS_VERSION instead of revision_id reported by MSR_IA32_VMX_BASIC. However, even though not explictly documented by TLFS, VMXArea passed as VMXON argument should still be marked with revision_id reported by physical CPU. This issue was found by the following setup: * L0 = KVM which expose eVMCS to it's L1 guest. * L1 = KVM which consume eVMCS reported by L0. This setup caused the following to occur: 1) L1 execute hardware_enable(). 2) hardware_enable() calls kvm_cpu_vmxon() to execute VMXON. 3) L0 intercept L1 VMXON and execute handle_vmon() which notes vmxarea->revision_id != VMCS12_REVISION and therefore fails with nested_vmx_failInvalid() which sets RFLAGS.CF. 4) L1 kvm_cpu_vmxon() don't check RFLAGS.CF for failure and therefore hardware_enable() continues as usual. 5) L1 hardware_enable() then calls ept_sync_global() which executes INVEPT. 6) L0 intercept INVEPT and execute handle_invept() which notes !vmx->nested.vmxon and thus raise a #UD to L1. 7) Raised #UD caused L1 to panic. Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Cc: stable@vger.kernel.org Fixes: 773e8a0425c923bc02668a2d6534a5ef5a43cc69 Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-07-18x86: Add build salt to the vDSOLaura Abbott
The vDSO needs to have a unique build id in a similar manner to the kernel and modules. Use the build salt macro. Acked-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Laura Abbott <labbott@redhat.com> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-07-18kbuild: move bin2c back to scripts/ from scripts/basic/Masahiro Yamada
Commit 8370edea81e3 ("bin2c: move bin2c in scripts/basic") moved bin2c to the scripts/basic/ directory, incorrectly stating "Kexec wants to use bin2c and it wants to use it really early in the build process. See arch/x86/purgatory/ code in later patches." Commit bdab125c9301 ("Revert "kexec/purgatory: Add clean-up for purgatory directory"") and commit d6605b6bbee8 ("x86/build: Remove unnecessary preparation for purgatory") removed the redundant purgatory build magic entirely. That means that the move of bin2c was unnecessary in the first place. fixdep is the only host program that deserves to sit in the scripts/basic/ directory. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-07-17x86/MCE: Remove min interval polling limitationDewet Thibaut
commit b3b7c4795c ("x86/MCE: Serialize sysfs changes") introduced a min interval limitation when setting the check interval for polled MCEs. However, the logic is that 0 disables polling for corrected MCEs, see Documentation/x86/x86_64/machinecheck. The limitation prevents disabling. Remove this limitation and allow the value 0 to disable polling again. Fixes: b3b7c4795c ("x86/MCE: Serialize sysfs changes") Signed-off-by: Dewet Thibaut <thibaut.dewet@nokia.com> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com> [ Massage commit message. ] Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20180716084927.24869-1-alexander.sverdlin@nokia.com
2018-07-17x86/mm/tlb: Skip atomic operations for 'init_mm' in switch_mm_irqs_off()Rik van Riel
Song Liu noticed switch_mm_irqs_off() taking a lot of CPU time in recent kernels,using 1.8% of a 48 CPU system during a netperf to localhost run. Digging into the profile, we noticed that cpumask_clear_cpu and cpumask_set_cpu together take about half of the CPU time taken by switch_mm_irqs_off(). However, the CPUs running netperf end up switching back and forth between netperf and the idle task, which does not require changes to the mm_cpumask. Furthermore, the init_mm cpumask ends up being the most heavily contended one in the system. Simply skipping changes to mm_cpumask(&init_mm) reduces overhead. Reported-and-tested-by: Song Liu <songliubraving@fb.com> Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: efault@gmx.de Cc: kernel-team@fb.com Cc: luto@kernel.org Link: http://lkml.kernel.org/r/20180716190337.26133-8-riel@surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-17x86/mm/tlb: Always use lazy TLB modeRik van Riel
Now that CPUs in lazy TLB mode no longer receive TLB shootdown IPIs, except at page table freeing time, and idle CPUs will no longer get shootdown IPIs for things like mprotect and madvise, we can always use lazy TLB mode. Tested-by: Song Liu <songliubraving@fb.com> Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: efault@gmx.de Cc: kernel-team@fb.com Cc: luto@kernel.org Link: http://lkml.kernel.org/r/20180716190337.26133-7-riel@surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-17x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUsRik van Riel
CPUs in !is_lazy have either received TLB flush IPIs earlier on during the munmap (when the user memory was unmapped), or have context switched and reloaded during that stage of the munmap. Page table free TLB flushes only need to be sent to CPUs in lazy TLB mode, which TLB contents might not yet be up to date yet. Tested-by: Song Liu <songliubraving@fb.com> Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: efault@gmx.de Cc: kernel-team@fb.com Cc: luto@kernel.org Link: http://lkml.kernel.org/r/20180716190337.26133-6-riel@surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-17x86/mm/tlb: Make lazy TLB mode lazierRik van Riel
Lazy TLB mode can result in an idle CPU being woken up by a TLB flush, when all it really needs to do is reload %CR3 at the next context switch, assuming no page table pages got freed. Memory ordering is used to prevent race conditions between switch_mm_irqs_off, which checks whether .tlb_gen changed, and the TLB invalidation code, which increments .tlb_gen whenever page table entries get invalidated. The atomic increment in inc_mm_tlb_gen is its own barrier; the context switch code adds an explicit barrier between reading tlbstate.is_lazy and next->context.tlb_gen. Unlike the 2016 version of this patch, CPUs with cpu_tlbstate.is_lazy set are not removed from the mm_cpumask(mm), since that would prevent the TLB flush IPIs at page table free time from being sent to all the CPUs that need them. This patch reduces total CPU use in the system by about 1-2% for a memcache workload on two socket systems, and by about 1% for a heavily multi-process netperf between two systems. Tested-by: Song Liu <songliubraving@fb.com> Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: efault@gmx.de Cc: kernel-team@fb.com Cc: luto@kernel.org Link: http://lkml.kernel.org/r/20180716190337.26133-5-riel@surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-17x86/mm/tlb: Restructure switch_mm_irqs_off()Rik van Riel
Move some code that will be needed for the lazy -> !lazy state transition when a lazy TLB CPU has gotten out of date. No functional changes, since the if (real_prev == next) branch always returns. Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: efault@gmx.de Cc: kernel-team@fb.com Link: http://lkml.kernel.org/r/20180716190337.26133-4-riel@surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-17x86/mm/tlb: Leave lazy TLB mode at page table free timeRik van Riel
Andy discovered that speculative memory accesses while in lazy TLB mode can crash a system, when a CPU tries to dereference a speculative access using memory contents that used to be valid page table memory, but have since been reused for something else and point into la-la land. The latter problem can be prevented in two ways. The first is to always send a TLB shootdown IPI to CPUs in lazy TLB mode, while the second one is to only send the TLB shootdown at page table freeing time. The second should result in fewer IPIs, since operationgs like mprotect and madvise are very common with some workloads, but do not involve page table freeing. Also, on munmap, batching of page table freeing covers much larger ranges of virtual memory than the batching of unmapped user pages. Tested-by: Song Liu <songliubraving@fb.com> Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: efault@gmx.de Cc: kernel-team@fb.com Cc: luto@kernel.org Link: http://lkml.kernel.org/r/20180716190337.26133-3-riel@surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-17Merge tag 'v4.18-rc5' into x86/mm, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-17Merge tag 'v4.18-rc5' into locking/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16x86/apm: Don't access __preempt_count with zeroed fsVille Syrjälä
APM_DO_POP_SEGS does not restore fs/gs which were zeroed by APM_DO_ZERO_SEGS. Trying to access __preempt_count with zeroed fs doesn't really work. Move the ibrs call outside the APM_DO_SAVE_SEGS/APM_DO_RESTORE_SEGS invocations so that fs is actually restored before calling preempt_enable(). Fixes the following sort of oopses: [ 0.313581] general protection fault: 0000 [#1] PREEMPT SMP [ 0.313803] Modules linked in: [ 0.314040] CPU: 0 PID: 268 Comm: kapmd Not tainted 4.16.0-rc1-triton-bisect-00090-gdd84441a7971 #19 [ 0.316161] EIP: __apm_bios_call_simple+0xc8/0x170 [ 0.316161] EFLAGS: 00210016 CPU: 0 [ 0.316161] EAX: 00000102 EBX: 00000000 ECX: 00000102 EDX: 00000000 [ 0.316161] ESI: 0000530e EDI: dea95f64 EBP: dea95f18 ESP: dea95ef0 [ 0.316161] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 [ 0.316161] CR0: 80050033 CR2: 00000000 CR3: 015d3000 CR4: 000006d0 [ 0.316161] Call Trace: [ 0.316161] ? cpumask_weight.constprop.15+0x20/0x20 [ 0.316161] on_cpu0+0x44/0x70 [ 0.316161] apm+0x54e/0x720 [ 0.316161] ? __switch_to_asm+0x26/0x40 [ 0.316161] ? __schedule+0x17d/0x590 [ 0.316161] kthread+0xc0/0xf0 [ 0.316161] ? proc_apm_show+0x150/0x150 [ 0.316161] ? kthread_create_worker_on_cpu+0x20/0x20 [ 0.316161] ret_from_fork+0x2e/0x38 [ 0.316161] Code: da 8e c2 8e e2 8e ea 57 55 2e ff 1d e0 bb 5d b1 0f 92 c3 5d 5f 07 1f 89 47 0c 90 8d b4 26 00 00 00 00 90 8d b4 26 00 00 00 00 90 <64> ff 0d 84 16 5c b1 74 7f 8b 45 dc 8e e0 8b 45 d8 8e e8 8b 45 [ 0.316161] EIP: __apm_bios_call_simple+0xc8/0x170 SS:ESP: 0068:dea95ef0 [ 0.316161] ---[ end trace 656253db2deaa12c ]--- Fixes: dd84441a7971 ("x86/speculation: Use IBRS if available before calling into firmware") Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lkml.kernel.org/r/20180709133534.5963-1-ville.syrjala@linux.intel.com
2018-07-16x86/pti: Make pti_set_kernel_image_nonglobal() staticJiang Biao
pti_set_kernel_image_nonglobal() is only used in pti.c, make it static. Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: luto@kernel.org Cc: hpa@zytor.com Cc: albcamus@gmail.com Cc: zhong.weidong@zte.com.cn Link: https://lkml.kernel.org/r/1531713820-24544-4-git-send-email-jiang.biao2@zte.com.cn
2018-07-16x86/hyper-v: Check for VP_INVAL in hyperv_flush_tlb_others()Vitaly Kuznetsov
Commit 1268ed0c474a ("x86/hyper-v: Fix the circular dependency in IPI enlightenment") pre-filled hv_vp_index with VP_INVAL so it is now (theoretically) possible to observe hv_cpu_number_to_vp_number() returning VP_INVAL. We need to check for that in hyperv_flush_tlb_others(). Not checking for VP_INVAL on the first call site where we do if (hv_cpu_number_to_vp_number(cpumask_last(cpus)) >= 64) goto do_ex_hypercall; is OK, in case we're eligible for non-ex hypercall we'll catch the issue later in for_each_cpu() cycle and in case we'll be doing ex- hypercall cpumask_to_vpset() will fail. It would be nice to change hv_cpu_number_to_vp_number() return value's type to 'u32' but this will likely be a bigger change as all call sites need to be checked first. Fixes: 1268ed0c474a ("x86/hyper-v: Fix the circular dependency in IPI enlightenment") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: "Michael Kelley (EOSG)" <Michael.H.Kelley@microsoft.com> Cc: devel@linuxdriverproject.org Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lkml.kernel.org/r/20180709174012.17429-3-vkuznets@redhat.com
2018-07-16x86/hyper-v: Check cpumask_to_vpset() return value in ↵Vitaly Kuznetsov
hyperv_flush_tlb_others_ex() Commit 1268ed0c474a ("x86/hyper-v: Fix the circular dependency in IPI enlightenment") made cpumask_to_vpset() return '-1' when there is a CPU with unknown VP index in the supplied set. This needs to be checked before we pass 'nr_bank' to hypercall. Fixes: 1268ed0c474a ("x86/hyper-v: Fix the circular dependency in IPI enlightenment") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: "Michael Kelley (EOSG)" <Michael.H.Kelley@microsoft.com> Cc: devel@linuxdriverproject.org Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lkml.kernel.org/r/20180709174012.17429-2-vkuznets@redhat.com
2018-07-16Merge 4.18-rc5 into char-misc-nextGreg Kroah-Hartman
We want the char-misc fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-16efi: Drop type and attribute checks in efi_mem_desc_lookup()Ard Biesheuvel
The current implementation of efi_mem_desc_lookup() includes the following check on the memory descriptor it returns: if (!(md->attribute & EFI_MEMORY_RUNTIME) && md->type != EFI_BOOT_SERVICES_DATA && md->type != EFI_RUNTIME_SERVICES_DATA) { continue; } This means that only EfiBootServicesData or EfiRuntimeServicesData regions are considered, or any other region type provided that it has the EFI_MEMORY_RUNTIME attribute set. Given what the name of the function implies, and the fact that any physical address can be described in the UEFI memory map only a single time, it does not make sense to impose this condition in the body of the loop, but instead, should be imposed by the caller depending on the value that is returned to it. Two such callers exist at the moment: - The BGRT code when running on x86, via efi_mem_reserve() and efi_arch_mem_reserve(). In this case, the region is already known to be EfiBootServicesData, and so the check is redundant. - The ESRT handling code which introduced this function, which calls it both directly from efi_esrt_init() and again via efi_mem_reserve() and efi_arch_mem_reserve() [on x86]. So let's move this check into the callers instead. This preserves the current behavior both for BGRT and ESRT handling, and allows the lookup routine to be reused by other [upcoming] users that don't have this limitation. In the ESRT case, keep the entire condition, so that platforms that deviate from the UEFI spec and use something other than EfiBootServicesData for the ESRT table will keep working as before. For x86's efi_arch_mem_reserve() implementation, limit the type to EfiBootServicesData, since it is the only type the reservation code expects to operate on in the first place. While we're at it, drop the __init annotation so that drivers can use it as well. Tested-by: Laszlo Ersek <lersek@redhat.com> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Jones <pjones@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/20180711094040.12506-8-ard.biesheuvel@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16efi/x86: Use non-blocking SetVariable() for efi_delete_dummy_variable()Sai Praneeth
Presently, efi_delete_dummy_variable() uses set_variable() which might block, which the scheduler is rightfully upset about when used from the idle thread, producing this splat: "bad: scheduling from the idle thread!" So, make efi_delete_dummy_variable() use set_variable_nonblocking(), which, as the name suggests, doesn't block. Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/20180711094040.12506-3-ard.biesheuvel@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16efi/x86: Clean up the eboot codeIngo Molnar
Various small cleanups: - Standardize printk messages: 'alloc' => 'allocate' 'mem' => 'memory' also put variable names in printk messages between quotes. - Align mass-assignments vertically for better readability - Break multi-line function prototypes at the name where possible, not in the middle of the parameter list - Use a newline before return statements consistently. - Use curly braces in a balanced fashion. - Remove stray newlines. No change in functionality. Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/20180711094040.12506-2-ard.biesheuvel@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16x86/build: Remove old -funit-at-a-time GCC quirkMasahiro Yamada
The following commit: e501ce957a78 ("x86: Force asm-goto") ... bumped the minimum GCC version to 4.5 for building the x86 kernel. arch/x86/Makefile no longer needs to take care of older GCC versions, such as this pre-4.0 -funit-at-a-time quirk. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: http://lkml.kernel.org/r/1531138041-24200-1-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16x86/asm/memcpy_mcsafe: Fix copy_to_user_mcsafe() exception handlingDan Williams
All copy_to_user() implementations need to be prepared to handle faults accessing userspace. The __memcpy_mcsafe() implementation handles both mmu-faults on the user destination and machine-check-exceptions on the source buffer. However, the memcpy_mcsafe() wrapper may silently fallback to memcpy() depending on build options and cpu-capabilities. Force copy_to_user_mcsafe() to always use __memcpy_mcsafe() when available, and otherwise disable all of the copy_to_user_mcsafe() infrastructure when __memcpy_mcsafe() is not available, i.e. CONFIG_X86_MCE=n. This fixes crashes of the form: run fstests generic/323 at 2018-07-02 12:46:23 BUG: unable to handle kernel paging request at 00007f0d50001000 RIP: 0010:__memcpy+0x12/0x20 [..] Call Trace: copyout_mcsafe+0x3a/0x50 _copy_to_iter_mcsafe+0xa1/0x4a0 ? dax_alive+0x30/0x50 dax_iomap_actor+0x1f9/0x280 ? dax_iomap_rw+0x100/0x100 iomap_apply+0xba/0x130 ? dax_iomap_rw+0x100/0x100 dax_iomap_rw+0x95/0x100 ? dax_iomap_rw+0x100/0x100 xfs_file_dax_read+0x7b/0x1d0 [xfs] xfs_file_read_iter+0xa7/0xc0 [xfs] aio_read+0x11c/0x1a0 Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com> Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Fixes: 8780356ef630 ("x86/asm/memcpy_mcsafe: Define copy_to_iter_mcsafe()") Link: http://lkml.kernel.org/r/153108277790.37979.1486841789275803399.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-15Merge tag 'v4.18-rc5' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-15x86/kvmclock: set pvti_cpu0_va after enabling kvmclockRadim Krčmář
pvti_cpu0_va is the address of shared kvmclock data structure. pvti_cpu0_va is currently kept unset (1) on 32 bit systems, (2) when kvmclock vsyscall is disabled, and (3) if kvmclock is not stable. This poses a problem, because kvm_ptp needs pvti_cpu0_va, but (1) can work on 32 bit, (2) has little relation to the vsyscall, and (3) does not need stable kvmclock (although kvmclock won't be used for system clock if it's not stable, so kvm_ptp is pointless in that case). Expose pvti_cpu0_va whenever kvmclock is enabled to allow all users to work with it. This fixes a regression found on Gentoo: https://bugs.gentoo.org/658544. Fixes: 9f08890ab906 ("x86/pvclock: add setter for pvclock_pvti_cpu0_va") Cc: stable@vger.kernel.org Reported-by: Andreas Steinmetz <ast@domdv.de> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-07-15x86/kvm/Kconfig: Ensure CRYPTO_DEV_CCP_DD state at minimum matches KVM_AMDJanakarajan Natarajan
Prevent a config where KVM_AMD=y and CRYPTO_DEV_CCP_DD=m thereby ensuring that AMD Secure Processor device driver will be built-in when KVM_AMD is also built-in. v1->v2: * Removed usage of 'imply' Kconfig option. * Change patch commit message. Fixes: 505c9e94d832 ("KVM: x86: prefer "depends on" to "select" for SEV") Cc: <stable@vger.kernel.org> # 4.16.x Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Reviewed-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-07-15kvm: nVMX: Restore exit qual for VM-entry failure due to MSR loadingJim Mattson
This exit qualification was inadvertently dropped when the two VM-entry failure blocks were coalesced. Fixes: e79f245ddec1 ("X86/KVM: Properly update 'tsc_offset' to represent the running guest") Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-07-15x86/kvm/vmx: don't read current->thread.{fs,gs}base of legacy tasksVitaly Kuznetsov
When we switched from doing rdmsr() to reading FS/GS base values from current->thread we completely forgot about legacy 32-bit userspaces which we still support in KVM (why?). task->thread.{fsbase,gsbase} are only synced for 64-bit processes, calling save_fsgs_for_kvm() and using its result from current is illegal for legacy processes. There's no ARCH_SET_FS/GS prctls for legacy applications. Base MSRs are, however, not always equal to zero. Intel's manual says (3.4.4 Segment Loading Instructions in IA-32e Mode): "In order to set up compatibility mode for an application, segment-load instructions (MOV to Sreg, POP Sreg) work normally in 64-bit mode. An entry is read from the system descriptor table (GDT or LDT) and is loaded in the hidden portion of the segment register. ... The hidden descriptor register fields for FS.base and GS.base are physically mapped to MSRs in order to load all address bits supported by a 64-bit implementation. " The issue was found by strace test suite where 32-bit ioctl_kvm_run test started segfaulting. Reported-by: Dmitry V. Levin <ldv@altlinux.org> Bisected-by: Masatake YAMATO <yamato@redhat.com> Fixes: 42b933b59721 ("x86/kvm/vmx: read MSR_{FS,KERNEL_GS}_BASE from current->thread") Cc: stable@vger.kernel.org Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-07-15KVM: VMX: support MSR_IA32_ARCH_CAPABILITIES as a feature MSRPaolo Bonzini
This lets userspace read the MSR_IA32_ARCH_CAPABILITIES and check that all requested features are available on the host. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-07-15x86/events/intel/ds: Fix bts_interrupt_threshold alignmentHugh Dickins
Markus reported that BTS is sporadically missing the tail of the trace in the perf_event data buffer: [decode error (1): instruction overflow] shown in GDB; and bisected it to the conversion of debug_store to PTI. A little "optimization" crept into alloc_bts_buffer(), which mistakenly placed bts_interrupt_threshold away from the 24-byte record boundary. Intel SDM Vol 3B 17.4.9 says "This address must point to an offset from the BTS buffer base that is a multiple of the BTS record size." Revert "max" from a byte count to a record count, to calculate the bts_interrupt_threshold correctly: which turns out to fix problem seen. Fixes: c1961a4631da ("x86/events/intel/ds: Map debug buffers in cpu_entry_area") Reported-and-tested-by: Markus T Metzger <markus.t.metzger@intel.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@intel.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Stephane Eranian <eranian@google.com> Cc: stable@vger.kernel.org # v4.14+ Link: https://lkml.kernel.org/r/alpine.LSU.2.11.1807141248290.1614@eggly.anvils
2018-07-14Merge tag 'for-linus-4.18-rc5-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: "Two related fixes for a boot failure of Xen PV guests" * tag 'for-linus-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen: setup pv irq ops vector earlier xen: remove global bit from __default_kernel_pte_mask for pv guests
2018-07-14x86/purgatory: add missing FORCE to Makefile targetPhilipp Rudo
- Build the kernel without the fix - Add some flag to the purgatories KBUILD_CFLAGS,I used -fno-asynchronous-unwind-tables - Re-build the kernel When you look at makes output you see that sha256.o is not re-build in the last step. Also readelf -S still shows the .eh_frame section for sha256.o. With the fix sha256.o is rebuilt in the last step. Without FORCE make does not detect changes only made to the command line options. So object files might not be re-built even when they should be. Fix this by adding FORCE where it is missing. Link: http://lkml.kernel.org/r/20180704110044.29279-2-prudo@linux.ibm.com Fixes: df6f2801f511 ("kernel/kexec_file.c: move purgatories sha256 to common code") Signed-off-by: Philipp Rudo <prudo@linux.ibm.com> Acked-by: Dave Young <dyoung@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> [4.17+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-13x86/bugs, kvm: Introduce boot-time control of L1TF mitigationsJiri Kosina
Introduce the 'l1tf=' kernel command line option to allow for boot-time switching of mitigation that is used on processors affected by L1TF. The possible values are: full Provides all available mitigations for the L1TF vulnerability. Disables SMT and enables all mitigations in the hypervisors. SMT control via /sys/devices/system/cpu/smt/control is still possible after boot. Hypervisors will issue a warning when the first VM is started in a potentially insecure configuration, i.e. SMT enabled or L1D flush disabled. full,force Same as 'full', but disables SMT control. Implies the 'nosmt=force' command line option. sysfs control of SMT and the hypervisor flush control is disabled. flush Leaves SMT enabled and enables the conditional hypervisor mitigation. Hypervisors will issue a warning when the first VM is started in a potentially insecure configuration, i.e. SMT enabled or L1D flush disabled. flush,nosmt Disables SMT and enables the conditional hypervisor mitigation. SMT control via /sys/devices/system/cpu/smt/control is still possible after boot. If SMT is reenabled or flushing disabled at runtime hypervisors will issue a warning. flush,nowarn Same as 'flush', but hypervisors will not warn when a VM is started in a potentially insecure configuration. off Disables hypervisor mitigations and doesn't emit any warnings. Default is 'flush'. Let KVM adhere to these semantics, which means: - 'lt1f=full,force' : Performe L1D flushes. No runtime control possible. - 'l1tf=full' - 'l1tf-flush' - 'l1tf=flush,nosmt' : Perform L1D flushes and warn on VM start if SMT has been runtime enabled or L1D flushing has been run-time enabled - 'l1tf=flush,nowarn' : Perform L1D flushes and no warnings are emitted. - 'l1tf=off' : L1D flushes are not performed and no warnings are emitted. KVM can always override the L1D flushing behavior using its 'vmentry_l1d_flush' module parameter except when lt1f=full,force is set. This makes KVM's private 'nosmt' option redundant, and as it is a bit non-systematic anyway (this is something to control globally, not on hypervisor level), remove that option. Add the missing Documentation entry for the l1tf vulnerability sysfs file while at it. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jiri Kosina <jkosina@suse.cz> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/20180713142323.202758176@linutronix.de