summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2018-04-27x86/headers/UAPI: Move DISABLE_EXITS KVM capability bits to the UAPIKarimAllah Ahmed
Move DISABLE_EXITS KVM capability bits to the UAPI just like the rest of capabilities. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: x86@kernel.org Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-04-27kvm: apic: Flush TLB after APIC mode/address change if VPIDs are in useJunaid Shahid
Currently, KVM flushes the TLB after a change to the APIC access page address or the APIC mode when EPT mode is enabled. However, even in shadow paging mode, a TLB flush is needed if VPIDs are being used, as specified in the Intel SDM Section 29.4.5. So replace vmx_flush_tlb_ept_only() with vmx_flush_tlb(), which will flush if either EPT or VPIDs are in use. Signed-off-by: Junaid Shahid <junaids@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-04-27x86/entry/64/compat: Preserve r8-r11 in int $0x80Andy Lutomirski
32-bit user code that uses int $80 doesn't care about r8-r11. There is, however, some 64-bit user code that intentionally uses int $0x80 to invoke 32-bit system calls. From what I've seen, basically all such code assumes that r8-r15 are all preserved, but the kernel clobbers r8-r11. Since I doubt that there's any code that depends on int $0x80 zeroing r8-r11, change the kernel to preserve them. I suspect that very little user code is broken by the old clobber, since r8-r11 are only rarely allocated by gcc, and they're clobbered by function calls, so they only way we'd see a problem is if the same function that invokes int $0x80 also spills something important to one of these registers. The current behavior seems to date back to the historical commit "[PATCH] x86-64 merge for 2.6.4". Before that, all regs were preserved. I can't find any explanation of why this change was made. Update the test_syscall_vdso_32 testcase as well to verify the new behavior, and it strengthens the test to make sure that the kernel doesn't accidentally permute r8..r15. Suggested-by: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Link: https://lkml.kernel.org/r/d4c4d9985fbe64f8c9e19291886453914b48caee.1523975710.git.luto@kernel.org
2018-04-27x86/ipc: Fix x32 version of shmid64_ds and msqid64_dsArnd Bergmann
A bugfix broke the x32 shmid64_ds and msqid64_ds data structure layout (as seen from user space) a few years ago: Originally, __BITS_PER_LONG was defined as 64 on x32, so we did not have padding after the 64-bit __kernel_time_t fields, After __BITS_PER_LONG got changed to 32, applications would observe extra padding. In other parts of the uapi headers we seem to have a mix of those expecting either 32 or 64 on x32 applications, so we can't easily revert the path that broke these two structures. Instead, this patch decouples x32 from the other architectures and moves it back into arch specific headers, partially reverting the even older commit 73a2d096fdf2 ("x86: remove all now-duplicate header files"). It's not clear whether this ever made any difference, since at least glibc carries its own (correct) copy of both of these header files, so possibly no application has ever observed the definitions here. Based on a suggestion from H.J. Lu, I tried out the tool from https://github.com/hjl-tools/linux-header to find other such bugs, which pointed out the same bug in statfs(), which also has a separate (correct) copy in glibc. Fixes: f4b4aae18288 ("x86/headers/uapi: Fix __BITS_PER_LONG value for x32 builds") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "H . J . Lu" <hjl.tools@gmail.com> Cc: Jeffrey Walton <noloader@gmail.com> Cc: stable@vger.kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lkml.kernel.org/r/20180424212013.3967461-1-arnd@arndb.de
2018-04-27x86/setup: Do not reserve a crash kernel region if booted on Xen PVPetr Tesarik
Xen PV domains cannot shut down and start a crash kernel. Instead, the crashing kernel makes a SCHEDOP_shutdown hypercall with the reason code SHUTDOWN_crash, cf. xen_crash_shutdown() machine op in arch/x86/xen/enlighten_pv.c. A crash kernel reservation is merely a waste of RAM in this case. It may also confuse users of kexec_load(2) and/or kexec_file_load(2). When flags include KEXEC_ON_CRASH or KEXEC_FILE_ON_CRASH, respectively, these syscalls return success, which is technically correct, but the crash kexec image will never be actually used. Signed-off-by: Petr Tesarik <ptesarik@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Juergen Gross <jgross@suse.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Dou Liyang <douly.fnst@cn.fujitsu.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: xen-devel@lists.xenproject.org Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@suse.de> Cc: Jean Delvare <jdelvare@suse.de> Link: https://lkml.kernel.org/r/20180425120835.23cef60c@ezekiel.suse.cz
2018-04-27x86/i8237: Register device based on FADT legacy boot flagRajneesh Bhardwaj
From Skylake onwards, the platform controller hub (Sunrisepoint PCH) does not support legacy DMA operations to IO ports 81h-83h, 87h, 89h-8Bh, 8Fh. Currently this driver registers as syscore ops and its resume function is called on every resume from S3. On Skylake and Kabylake, this causes a resume delay of around 100ms due to port IO operations, which is a problem. This change allows to load the driver only when the platform bios explicitly supports such devices or has a cut-off date earlier than 2017 due to the following reasons: - The platforms released before year 2017 have support for the 8237. (except Sunrisepoint PCH e.g. Skylake) - Some of the BIOS that were released for platforms (Skylake, Kabylake) during 2016-17 are buggy. These BIOS do not set/unset the ACPI_FADT_LEGACY_DEVICES field in FADT table properly based on the presence or absence of the DMA device. Very recently, open source system firmware like coreboot started unsetting ACPI_FADT_LEGACY_DEVICES field in FADT table if the 8237 DMA device is not present on the PCH. Please refer to chapter 21 of 6th Generation Intel® Core™ Processor Platform Controller Hub Family: BIOS Specification. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@intel.com> Signed-off-by: Anshuman Gupta <anshuman.gupta@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: rjw@rjwysocki.net Cc: hpa@zytor.com Cc: Alan Cox <alan@linux.intel.com> Link: https://lkml.kernel.org/r/1522336015-22994-1-git-send-email-anshuman.gupta@intel.com
2018-04-27x86/bpf: Clean up non-standard comments, to make the code more readableIngo Molnar
So by chance I looked into x86 assembly in arch/x86/net/bpf_jit_comp.c and noticed the weird and inconsistent comment style it mistakenly learned from the networking code: /* Multi-line comment ... * ... looks like this. */ Fix this to use the standard comment style specified in Documentation/CodingStyle and used in arch/x86/ as well: /* * Multi-line comment ... * ... looks like this. */ Also, to quote Linus's ... more explicit views about this: http://article.gmane.org/gmane.linux.kernel.cryptoapi/21066 > But no, the networking code picked *none* of the above sane formats. > Instead, it picked these two models that are just half-arsed > shit-for-brains: > > (no) > /* This is disgusting drug-induced > * crap, and should die > */ > > (no-no-no) > /* This is also very nasty > * and visually unbalanced */ > > Please. The networking code actually has the *worst* possible comment > style. You can literally find that (no-no-no) style, which is just > really horribly disgusting and worse than the otherwise fairly similar > (d) in pretty much every way. Also improve the comments and some other details while at it: - Don't mix same-line and previous-line comment style on otherwise identical code patterns within the same function, - capitalize 'BPF' and x86 register names consistently, - capitalize sentences consistently, - instead of 'x64' use 'x86-64': x64 is a Microsoft specific term, - use more consistent punctuation, - use standard coding style in macros as well, - fix typos and a few other minor details. Consistent coding style is not optional, at least in arch/x86/. No change in functionality. ( In case this commit causes conflicts with pending development code I'll be glad to help resolve any conflicts! ) Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@fb.com> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-27locking/qspinlock: Use smp_store_release() in queued_spin_unlock()Will Deacon
A qspinlock can be unlocked simply by writing zero to the locked byte. This can be implemented in the generic code, so do that and remove the arch-specific override for x86 in the !PV case. Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: boqun.feng@gmail.com Cc: linux-arm-kernel@lists.infradead.org Cc: paulmck@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1524738868-31318-11-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-27locking/qspinlock/x86: Increase _Q_PENDING_LOOPS upper boundWill Deacon
On x86, atomic_cond_read_relaxed will busy-wait with a cpu_relax() loop, so it is desirable to increase the number of times we spin on the qspinlock lockword when it is found to be transitioning from pending to locked. According to Waiman Long: | Ideally, the spinning times should be at least a few times the typical | cacheline load time from memory which I think can be down to 100ns or | so for each cacheline load with the newest systems or up to several | hundreds ns for older systems. which in his benchmarking corresponded to 512 iterations. Suggested-by: Waiman Long <longman@redhat.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: boqun.feng@gmail.com Cc: linux-arm-kernel@lists.infradead.org Cc: paulmck@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1524738868-31318-5-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-27locking/qspinlock: Merge 'struct __qspinlock' into 'struct qspinlock'Will Deacon
'struct __qspinlock' provides a handy union of fields so that subcomponents of the lockword can be accessed by name, without having to manage shifts and masks explicitly and take endianness into account. This is useful in qspinlock.h and also potentially in arch headers, so move the 'struct __qspinlock' into 'struct qspinlock' and kill the extra definition. Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Acked-by: Boqun Feng <boqun.feng@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arm-kernel@lists.infradead.org Cc: paulmck@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1524738868-31318-3-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-26Merge tag 'trace-v4.17-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: - Add workqueue forward declaration (for new work, but a nice clean up) - seftest fixes for the new histogram code - Print output fix for hwlat tracer - Fix missing system call events - due to change in x86 syscall naming - Fix kprobe address being used by perf being hashed * tag 'trace-v4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix missing tab for hwlat_detector print format selftests: ftrace: Add a testcase for multiple actions on trigger selftests: ftrace: Fix trigger extended error testcase kprobes: Fix random address output of blacklist file tracing: Fix kernel crash while using empty filter with perf tracing/x86: Update syscall trace events to handle new prefixed syscall func names tracing: Add missing forward declaration
2018-04-26x86/cpu/intel: Add missing TLB cpuid valuesjacek.tomaka@poczta.fm
Make kernel print the correct number of TLB entries on Intel Xeon Phi 7210 (and others) Before: [ 0.320005] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0 After: [ 0.320005] Last level dTLB entries: 4KB 256, 2MB 128, 4MB 128, 1GB 16 The entries do exist in the official Intel SMD but the type column there is incorrect (states "Cache" where it should read "TLB"), but the entries for the values 0x6B, 0x6C and 0x6D are correctly described as 'Data TLB'. Signed-off-by: Jacek Tomaka <jacek.tomaka@poczta.fm> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20180423161425.24366-1-jacekt@dugeo.com
2018-04-26x86/dumpstack: Explain the reasoning for the prologue and buffer sizeBorislav Petkov
The whole reasoning behind the amount of opcode bytes dumped and prologue length isn't very clear so write down some of the reasons for why it is done the way it is. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Link: https://lkml.kernel.org/r/20180417161124.5294-10-bp@alien8.de
2018-04-26x86/dumpstack: Save first regs set for the executive summaryBorislav Petkov
Save the regs set when __die() is onvoked for the first time and print it in oops_end(). Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Link: https://lkml.kernel.org/r/20180417161124.5294-9-bp@alien8.de
2018-04-26x86/dumpstack: Add a show_ip() functionBorislav Petkov
... which shows the Instruction Pointer along with the insn bytes around it. Use it whenever rIP is printed. Drop the rIP < PAGE_OFFSET check since probe_kernel_read() can handle any address properly. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Link: https://lkml.kernel.org/r/20180417161124.5294-8-bp@alien8.de
2018-04-26x86/fault: Dump user opcode bytes on fatal faultsBorislav Petkov
Sometimes it is useful to see which user opcode bytes RIP points to when a fault happens: be it to rule out RIP corruption, to dump info early during boot, when doing core dumps is impossible due to not having a writable filesystem yet. Sometimes it is useful if debugging an issue and one doesn't have access to the executable which caused the fault in order to disassemble it. That last aspect might have some security implications so show_unhandled_signals could be revisited for that or a new config option added. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Link: https://lkml.kernel.org/r/20180417161124.5294-7-bp@alien8.de
2018-04-26x86/dumpstack: Add loglevel argument to show_opcodes()Borislav Petkov
Will be used in the next patch. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Link: https://lkml.kernel.org/r/20180417161124.5294-6-bp@alien8.de
2018-04-26x86/dumpstack: Improve opcodes dumping in the code sectionBorislav Petkov
The code used to iterate byte-by-byte over the bytes around RIP and that is expensive: disabling pagefaults around it, copy_from_user, etc... Make it read the whole buffer of OPCODE_BUFSIZE size in one go. Use a statically allocated 64 bytes buffer so that concurrent show_opcodes() do not interleave in the output even though in the majority of the cases it's serialized via die_lock. Except the #PF path which doesn't... Also, do the PAGE_OFFSET check outside of the function because latter will be reused in other context. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Link: https://lkml.kernel.org/r/20180417161124.5294-5-bp@alien8.de
2018-04-26x86/dumpstack: Carve out code-dumping into a functionBorislav Petkov
No functionality change, carve it out into a separate function for later changes. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Link: https://lkml.kernel.org/r/20180417161124.5294-4-bp@alien8.de
2018-04-26x86/dumpstack: Unexport oops_begin()Borislav Petkov
The only user outside of arch/ is not a module since 86cd47334b00 ("ACPI, APEI, GHES, Prevent GHES to be built as module") No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Link: https://lkml.kernel.org/r/20180417161124.5294-3-bp@alien8.de
2018-04-26x86/dumpstack: Remove code_bytesBorislav Petkov
This was added by 86c418374223 ("[PATCH] i386: add option to show more code in oops reports") long time ago but experience shows that 64 instruction bytes are plenty when deciphering an oops. So get rid of it. Removing it will simplify further enhancements to the opcodes dumping machinery coming in the following patches. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Link: https://lkml.kernel.org/r/20180417161124.5294-2-bp@alien8.de
2018-04-26x86/smpboot: Don't use mwait_play_dead() on AMD systemsYazen Ghannam
Recent AMD systems support using MWAIT for C1 state. However, MWAIT will not allow deeper cstates than C1 on current systems. play_dead() expects to use the deepest state available. The deepest state available on AMD systems is reached through SystemIO or HALT. If MWAIT is available, it is preferred over the other methods, so the CPU never reaches the deepest possible state. Don't try to use MWAIT to play_dead() on AMD systems. Instead, use CPUIDLE to enter the deepest state advertised by firmware. If CPUIDLE is not available then fallback to HALT. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Cc: Yazen Ghannam <Yazen.Ghannam@amd.com> Link: https://lkml.kernel.org/r/20180403140228.58540-1-Yazen.Ghannam@amd.com
2018-04-26x86/mm: Make vmemmap and vmalloc base address constants unsigned longJiri Kosina
Commits 9b46a051e4 ("x86/mm: Initialize vmemmap_base at boot-time") and a7412546d8 ("x86/mm: Adjust vmalloc base and size at boot-time") lost the type information for __VMALLOC_BASE_L4, __VMALLOC_BASE_L5, __VMEMMAP_BASE_L4 and __VMEMMAP_BASE_L5 constants. Declare them explicitly unsigned long again. Fixes: 9b46a051e4 ("x86/mm: Initialize vmemmap_base at boot-time") Fixes: a7412546d8 ("x86/mm: Adjust vmalloc base and size at boot-time") Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1804121437350.28129@cbobk.fhfr.pm
2018-04-26x86/vector: Remove the unused macro FPU_IRQDou Liyang
The macro FPU_IRQ has never been used since v3.10, So remove it. Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: hpa@zytor.com Link: https://lkml.kernel.org/r/20180426060832.27312-1-douly.fnst@cn.fujitsu.com
2018-04-26x86/vector: Remove the macro VECTOR_OFFSET_STARTDou Liyang
Now, Linux uses matrix allocator for vector assignment, the original assignment code which used VECTOR_OFFSET_START has been removed. So remove the stale macro as well. Fixes: commit 69cde0004a4b ("x86/vector: Use matrix allocator for vector assignment") Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: David Rientjes <rientjes@google.com> Cc: hpa@zytor.com Link: https://lkml.kernel.org/r/20180425020553.17210-1-douly.fnst@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-26x86/cpufeatures: Enumerate cldemote instructionFenghua Yu
cldemote is a new instruction in future x86 processors. It hints to hardware that a specified cache line should be moved ("demoted") from the cache(s) closest to the processor core to a level more distant from the processor core. This instruction is faster than snooping to make the cache line available for other cores. cldemote instruction is indicated by the presence of the CPUID feature flag CLDEMOTE (CPUID.(EAX=0x7, ECX=0):ECX[bit25]). More details on cldemote instruction can be found in the latest Intel Architecture Instruction Set Extensions and Future Features Programming Reference. Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: "Ashok Raj" <ashok.raj@intel.com> Link: https://lkml.kernel.org/r/1524508162-192587-1-git-send-email-fenghua.yu@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2018-04-25 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix to clear the percpu metadata_dst that could otherwise carry stale ip_tunnel_info, from William. 2) Fix that reduces the number of passes in x64 JIT with regards to dead code sanitation to avoid risk of prog rejection, from Gianluca. 3) Several fixes of sockmap programs, besides others, fixing a double page_put() in error path, missing refcount hold for pinned sockmap, adding required -target bpf for clang in sample Makefile, from John. 4) Fix to disable preemption in __BPF_PROG_RUN_ARRAY() paths, from Roman. 5) Fix tools/bpf/ Makefile with regards to a lex/yacc build error seen on older gcc-5, from John. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-25perf/x86/intel: Don't enable freeze-on-smi for PerfMon V1Kan Liang
The SMM freeze feature was introduced since PerfMon V2. But the current code unconditionally enables the feature for all platforms. It can generate #GP exception, if the related FREEZE_WHILE_SMM bit is set for the machine with PerfMon V1. To disable the feature for PerfMon V1, perf needs to - Remove the freeze_on_smi sysfs entry by moving intel_pmu_attrs to intel_pmu, which is only applied to PerfMon V2 and later. - Check the PerfMon version before flipping the SMM bit when starting CPU Fixes: 6089327f5424 ("perf/x86: Add sysfs entry to freeze counters on SMI") Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: ak@linux.intel.com Cc: eranian@google.com Cc: acme@redhat.com Link: https://lkml.kernel.org/r/1524682637-63219-1-git-send-email-kan.liang@linux.intel.com
2018-04-25signal: Add TRAP_UNK si_code for undiagnosted trap exceptionsEric W. Biederman
Both powerpc and alpha have cases where they wronly set si_code to 0 in combination with SIGTRAP and don't mean SI_USER. About half the time this is because the architecture can not report accurately what kind of trap exception triggered the trap exception. The other half the time it looks like no one has bothered to figure out an appropriate si_code. For the cases where the architecture does not have enough information or is too lazy to figure out exactly what kind of trap exception it is define TRAP_UNK. Cc: linux-api@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-alpha@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-04-25signal: Ensure every siginfo we send has all bits initializedEric W. Biederman
Call clear_siginfo to ensure every stack allocated siginfo is properly initialized before being passed to the signal sending functions. Note: It is not safe to depend on C initializers to initialize struct siginfo on the stack because C is allowed to skip holes when initializing a structure. The initialization of struct siginfo in tracehook_report_syscall_exit was moved from the helper user_single_step_siginfo into tracehook_report_syscall_exit itself, to make it clear that the local variable siginfo gets fully initialized. In a few cases the scope of struct siginfo has been reduced to make it clear that siginfo siginfo is not used on other paths in the function in which it is declared. Instances of using memset to initialize siginfo have been replaced with calls clear_siginfo for clarity. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-04-25bpf, x64: fix JIT emission for dead codeGianluca Borello
Commit 2a5418a13fcf ("bpf: improve dead code sanitizing") replaced dead code with a series of ja-1 instructions, for safety. That made JIT compilation much more complex for some BPF programs. One instance of such programs is, for example: bool flag = false ... /* A bunch of other code */ ... if (flag) do_something() In some cases llvm is not able to remove at compile time the code for do_something(), so the generated BPF program ends up with a large amount of dead instructions. In one specific real life example, there are two series of ~500 and ~1000 dead instructions in the program. When the verifier replaces them with a series of ja-1 instructions, it causes an interesting behavior at JIT time. During the first pass, since all the instructions are estimated at 64 bytes, the ja-1 instructions end up being translated as 5 bytes JMP instructions (0xE9), since the jump offsets become increasingly large (> 127) as each instruction gets discovered to be 5 bytes instead of the estimated 64. Starting from the second pass, the first N instructions of the ja-1 sequence get translated into 2 bytes JMPs (0xEB) because the jump offsets become <= 127 this time. In particular, N is defined as roughly 127 / (5 - 2) ~= 42. So, each further pass will make the subsequent N JMP instructions shrink from 5 to 2 bytes, making the image shrink every time. This means that in order to have the entire program converge, there need to be, in the real example above, at least ~1000 / 42 ~= 24 passes just for translating the dead code. If we add this number to the passes needed to translate the other non dead code, it brings such program to 40+ passes, and JIT doesn't complete. Ultimately the userspace loader fails because such BPF program was supposed to be part of a prog array owner being JITed. While it is certainly possible to try to refactor such programs to help the compiler remove dead code, the behavior is not really intuitive and it puts further burden on the BPF developer who is not expecting such behavior. To make things worse, such programs are working just fine in all the kernel releases prior to the ja-1 fix. A possible approach to mitigate this behavior consists into noticing that for ja-1 instructions we don't really need to rely on the estimated size of the previous and current instructions, we know that a -1 BPF jump offset can be safely translated into a 0xEB instruction with a jump offset of -2. Such fix brings the BPF program in the previous example to complete again in ~9 passes. Fixes: 2a5418a13fcf ("bpf: improve dead code sanitizing") Signed-off-by: Gianluca Borello <g.borello@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-25tracing/x86: Update syscall trace events to handle new prefixed syscall func ↵Steven Rostedt (VMware)
names Arnaldo noticed that the latest kernel is missing the syscall event system directory in x86. I bisected it down to d5a00528b58c ("syscalls/core, syscalls/x86: Rename struct pt_regs-based sys_*() to __x64_sys_*()"). The system call trace events are special, as there is only one trace event for all system calls (the raw_syscalls). But a macro that wraps the system calls creates meta data for them that copies the name to find the system call that maps to the system call table (the number). At boot up, it does a kallsyms lookup of the system call table to find the function that maps to the meta data of the system call. If it does not find a function, then that system call is ignored. Because the x86 system calls had "__x64_", or "__ia32_" prefixed to the "sys" for the names, they do not match the default compare algorithm. As this was a problem for power pc, the algorithm can be overwritten by the architecture. The solution is to have x86 have its own algorithm to do the compare and this brings back the system call trace events. Link: http://lkml.kernel.org/r/20180417174128.0f3457f0@gandalf.local.home Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Dominik Brodowski <linux@dominikbrodowski.net> Acked-by: Thomas Gleixner <tglx@linutronix.de> Fixes: d5a00528b58c ("syscalls/core, syscalls/x86: Rename struct pt_regs-based sys_*() to __x64_sys_*()") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-04-25x86/pti: Filter at vma->vm_page_prot populationDave Hansen
commit ce9962bf7e22bb3891655c349faff618922d4a73 0day reported warnings at boot on 32-bit systems without NX support: attempted to set unsupported pgprot: 8000000000000025 bits: 8000000000000000 supported: 7fffffffffffffff WARNING: CPU: 0 PID: 1 at arch/x86/include/asm/pgtable.h:540 handle_mm_fault+0xfc1/0xfe0: check_pgprot at arch/x86/include/asm/pgtable.h:535 (inlined by) pfn_pte at arch/x86/include/asm/pgtable.h:549 (inlined by) do_anonymous_page at mm/memory.c:3169 (inlined by) handle_pte_fault at mm/memory.c:3961 (inlined by) __handle_mm_fault at mm/memory.c:4087 (inlined by) handle_mm_fault at mm/memory.c:4124 The problem is that due to the recent commit which removed auto-massaging of page protections, filtering page permissions at PTE creation time is not longer done, so vma->vm_page_prot is passed unfiltered to PTE creation. Filter the page protections before they are installed in vma->vm_page_prot. Fixes: fb43d6cb91 ("x86/mm: Do not auto-massage page protections") Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Nadav Amit <namit@vmware.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Link: https://lkml.kernel.org/r/20180420222028.99D72858@viggo.jf.intel.com
2018-04-25x86/pti: Disallow global kernel text with RANDSTRUCTDave Hansen
commit 26d35ca6c3776784f8156e1d6f80cc60d9a2a915 RANDSTRUCT derives its hardening benefits from the attacker's lack of knowledge about the layout of kernel data structures. Keep the kernel image non-global in cases where RANDSTRUCT is in use to help keep the layout a secret. Fixes: 8c06c7740 (x86/pti: Leave kernel text global for !PCID) Reported-by: Kees Cook <keescook@google.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Link: https://lkml.kernel.org/r/20180420222026.D0B4AAC9@viggo.jf.intel.com
2018-04-25x86/pti: Reduce amount of kernel text allowed to be GlobalDave Hansen
commit abb67605203687c8b7943d760638d0301787f8d9 Kees reported to me that I made too much of the kernel image global. It was far more than just text: I think this is too much set global: _end is after data, bss, and brk, and all kinds of other stuff that could hold secrets. I think this should match what mark_rodata_ro() is doing. This does exactly that. We use __end_rodata_hpage_align as our marker both because it is huge-page-aligned and it does not contain any sections we expect to hold secrets. Kees's logic was that r/o data is in the kernel image anyway and, in the case of traditional distributions, can be freely downloaded from the web, so there's no reason to hide it. Fixes: 8c06c7740 (x86/pti: Leave kernel text global for !PCID) Reported-by: Kees Cook <keescook@google.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Link: https://lkml.kernel.org/r/20180420222023.1C8B2B20@viggo.jf.intel.com
2018-04-25x86/pti: Fix boot warning from Global-bit settingDave Hansen
commit 231df823c4f04176f607afc4576c989895cff40e The pageattr.c code attempts to process "faults" when it goes looking for PTEs to change and finds non-present entries. It allows these faults in the linear map which is "expected to have holes", but WARN()s about them elsewhere, like when called on the kernel image. However, change_page_attr_clear() is now called on the kernel image in the process of trying to clear the Global bit. This trips the warning in __cpa_process_fault() if a non-present PTE is encountered in the kernel image. The "holes" in the kernel image result from free_init_pages()'s use of set_memory_np(). These holes are totally fine, and result from normal operation, just as they would be in the kernel linear map. Just silence the warning when holes in the kernel image are encountered. Fixes: 39114b7a7 (x86/pti: Never implicitly clear _PAGE_GLOBAL for kernel image) Reported-by: Mariusz Ceier <mceier@gmail.com> Reported-by: Aaro Koskinen <aaro.koskinen@nokia.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Aaro Koskinen <aaro.koskinen@nokia.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Nadav Amit <namit@vmware.com> Cc: Kees Cook <keescook@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Link: https://lkml.kernel.org/r/20180420222021.1C7D2B3F@viggo.jf.intel.com
2018-04-25x86/pti: Fix boot problems from Global-bit settingDave Hansen
commit 16dce603adc9de4237b7bf2ff5c5290f34373e7b Part of the global bit _setting_ patches also includes clearing the Global bit when it should not be enabled. That is done with set_memory_nonglobal(), which uses change_page_attr_clear() in pageattr.c under the covers. The TLB flushing code inside pageattr.c has has checks like BUG_ON(irqs_disabled()), looking for interrupt disabling that might cause deadlocks. But, these also trip in early boot on certain preempt configurations. Just copy the existing BUG_ON() sequence from cpa_flush_range() to the other two sites and check for early boot. Fixes: 39114b7a7 (x86/pti: Never implicitly clear _PAGE_GLOBAL for kernel image) Reported-by: Mariusz Ceier <mceier@gmail.com> Reported-by: Aaro Koskinen <aaro.koskinen@nokia.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Aaro Koskinen <aaro.koskinen@nokia.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Nadav Amit <namit@vmware.com> Cc: Kees Cook <keescook@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Link: https://lkml.kernel.org/r/20180420222019.20C4A410@viggo.jf.intel.com
2018-04-24x86/microcode: Do not exit early from __reload_late()Borislav Petkov
Vitezslav reported a case where the "Timeout during microcode update!" panic would hit. After a deeper look, it turned out that his .config had CONFIG_HOTPLUG_CPU disabled which practically made save_mc_for_early() a no-op. When that happened, the discovered microcode patch wasn't saved into the cache and the late loading path wouldn't find any. This, then, lead to early exit from __reload_late() and thus CPUs waiting until the timeout is reached, leading to the panic. In hindsight, that function should have been written so it does not return before the post-synchronization. Oh well, I know better now... Fixes: bb8c13d61a62 ("x86/microcode: Fix CPU synchronization routine") Reported-by: Vitezslav Samel <vitezslav@samel.cz> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Vitezslav Samel <vitezslav@samel.cz> Tested-by: Ashok Raj <ashok.raj@intel.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20180418081140.GA2439@pc11.op.pod.cz Link: https://lkml.kernel.org/r/20180421081930.15741-2-bp@alien8.de
2018-04-24x86/microcode/intel: Save microcode patch unconditionallyBorislav Petkov
save_mc_for_early() was a no-op on !CONFIG_HOTPLUG_CPU but the generic_load_microcode() path saves the microcode patches it has found into the cache of patches which is used for late loading too. Regardless of whether CPU hotplug is used or not. Make the saving unconditional so that late loading can find the proper patch. Reported-by: Vitezslav Samel <vitezslav@samel.cz> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Vitezslav Samel <vitezslav@samel.cz> Tested-by: Ashok Raj <ashok.raj@intel.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20180418081140.GA2439@pc11.op.pod.cz Link: https://lkml.kernel.org/r/20180421081930.15741-1-bp@alien8.de
2018-04-23x86/jailhouse: Fix incorrect SPDX identifierThomas Gleixner
GPL2.0 is not a valid SPDX identiier. Replace it with GPL-2.0. Fixes: 4a362601baa6 ("x86/jailhouse: Add infrastructure for running in non-root cell") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Jan Kiszka <jan.kiszka@siemens.com> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Link: https://lkml.kernel.org/r/20180422220832.815346488@linutronix.de
2018-04-22Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A small set of fixes for x86: - Prevent X2APIC ID 0xFFFFFFFF from being treated as valid, which causes the possible CPU count to be wrong. - Prevent 32bit truncation in calc_hpet_ref() which causes the TSC calibration to fail - Fix the page table setup for temporary text mappings in the resume code which causes resume failures - Make the page table dump code handle HIGHPTE correctly instead of oopsing - Support for topologies where NUMA nodes share an LLC to prevent a invalid topology warning and further malfunction on such systems. - Remove the now unused pci-nommu code - Remove stale function declarations" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/power/64: Fix page-table setup for temporary text mapping x86/mm: Prevent kernel Oops in PTDUMP code with HIGHPTE=y x86,sched: Allow topologies where NUMA nodes share an LLC x86/processor: Remove two unused function declarations x86/acpi: Prevent X2APIC id 0xffffffff from being accounted x86/tsc: Prevent 32bit truncation in calc_hpet_ref() x86: Remove pci-nommu.c
2018-04-22Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "A larger set of updates for perf. Kernel: - Handle the SBOX uncore monitoring correctly on Broadwell CPUs which do not have SBOX. - Store context switch out type in PERF_RECORD_SWITCH[_CPU_WIDE]. The percentage of preempting and non-preempting context switches help understanding the nature of workloads (CPU or IO bound) that are running on a machine. This adds the kernel facility and userspace changes needed to show this information in 'perf script' and 'perf report -D' (Alexey Budankov) - Remove a WARN_ON() in the trace/kprobes code which is pointless because the return error code is already telling the caller what's wrong. - Revert a fugly workaround for clang BPF targets. - Fix sample_max_stack maximum check and do not proceed when an error has been detect, return them to avoid misidentifying errors (Jiri Olsa) - Add SPDX idenitifiers and get rid of GPL boilderplate. Tools: - Synchronize kernel ABI headers, v4.17-rc1 (Ingo Molnar) - Support MAP_FIXED_NOREPLACE, noticed when updating the tools/include/ copies (Arnaldo Carvalho de Melo) - Add '\n' at the end of parse-options error messages (Ravi Bangoria) - Add s390 support for detailed/verbose PMU event description (Thomas Richter) - perf annotate fixes and improvements: * Allow showing offsets in more than just jump targets, use the new 'O' hotkey in the TUI, config ~/.perfconfig annotate.offset_level for it and for --stdio2 (Arnaldo Carvalho de Melo) * Use the resolved variable names from objdump disassembled lines to make them more compact, just like was already done for some instructions, like "mov", this eventually will be done more generally, but lets now add some more to the existing mechanism (Arnaldo Carvalho de Melo) - perf record fixes: * Change warning for missing topology sysfs entry to debug, as not all architectures have those files, s390 being one of those (Thomas Richter) * Remove old error messages about things that unlikely to be the root cause in modern systems (Andi Kleen) - perf sched fixes: * Fix -g/--call-graph documentation (Takuya Yamamoto) - perf stat: * Enable 1ms interval for printing event counters values in (Alexey Budankov) - perf test fixes: * Run dwarf unwind on arm32 (Kim Phillips) * Remove unused ptrace.h include from LLVM test, sidesteping older clang's lack of support for some asm constructs (Arnaldo Carvalho de Melo) * Fixup BPF test using epoll_pwait syscall function probe, to cope with the syscall routines renames performed in this development cycle (Arnaldo Carvalho de Melo) - perf version fixes: * Do not print info about HAVE_LIBAUDIT_SUPPORT in 'perf version --build-options' when HAVE_SYSCALL_TABLE_SUPPORT is true, as libaudit won't be used in that case, print info about syscall_table support instead (Jin Yao) - Build system fixes: * Use HAVE_..._SUPPORT used consistently (Jin Yao) * Restore READ_ONCE() C++ compatibility in tools/include (Mark Rutland) * Give hints about package names needed to build jvmti (Arnaldo Carvalho de Melo)" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits) perf/x86/intel/uncore: Fix SBOX support for Broadwell CPUs perf/x86/intel/uncore: Revert "Remove SBOX support for Broadwell server" coresight: Move to SPDX identifier perf test BPF: Fixup BPF test using epoll_pwait syscall function probe perf tests mmap: Show which tracepoint is failing perf tools: Add '\n' at the end of parse-options error messages perf record: Remove suggestion to enable APIC perf record: Remove misleading error suggestion perf hists browser: Clarify top/report browser help perf mem: Allow all record/report options perf trace: Support MAP_FIXED_NOREPLACE perf: Remove superfluous allocation error check perf: Fix sample_max_stack maximum check perf: Return proper values for user stack errors perf list: Add s390 support for detailed/verbose PMU event description perf script: Extend misc field decoding with switch out event type perf report: Extend raw dump (-D) out with switch out event type perf/core: Store context switch out type in PERF_RECORD_SWITCH[_CPU_WIDE] tools/headers: Synchronize kernel ABI headers, v4.17-rc1 trace_kprobe: Remove warning message "Could not insert probe at..." ...
2018-04-20kexec_file: do not add extra alignment to efi memmapDave Young
Chun-Yi reported a kernel warning message below: WARNING: CPU: 0 PID: 0 at ../mm/early_ioremap.c:182 early_iounmap+0x4f/0x12c() early_iounmap(ffffffffff200180, 00000118) [0] size not consistent 00000120 The problem is x86 kexec_file_load adds extra alignment to the efi memmap: in bzImage64_load(): efi_map_sz = efi_get_runtime_map_size(); efi_map_sz = ALIGN(efi_map_sz, 16); And __efi_memmap_init maps with the size including the alignment bytes but efi_memmap_unmap use nr_maps * desc_size which does not include the extra bytes. The alignment in kexec code is only needed for the kexec buffer internal use Actually kexec should pass exact size of the efi memmap to 2nd kernel. Link: http://lkml.kernel.org/r/20180417083600.GA1972@dhcp-128-65.nay.redhat.com Signed-off-by: Dave Young <dyoung@redhat.com> Reported-by: joeyli <jlee@suse.com> Tested-by: Randy Wright <rwright@hpe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20y2038: x86: Extend sysvipc data structuresArnd Bergmann
This extends the x86 copy of the sysvipc data structures to deal with 32-bit user space that has 64-bit time_t and wants to see timestamps beyond 2038. Fortunately, x86 has padding for this purpose in all the data structures, so we can just add extra fields. With msgid64_ds and shmid64_ds, the data structure is identical to the asm-generic version, which we have already extended. For some reason however, the 64-bit version of semid64_ds ended up with extra padding, so I'm implementing the same approach as the asm-generic version here, by using separate fields for the upper and lower halves of the two timestamps. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-04-20perf/x86/intel/uncore: Fix SBOX support for Broadwell CPUsOskar Senft
SBOX on some Broadwell CPUs is broken because it's enabled unconditionally despite the fact that there are no SBOXes available. Check the Power Control Unit CAPID4 register to determine the number of available SBOXes on the particular CPU before trying to enable them. If there are none, nullify the SBOX descriptor so it isn't tried to be initialized. Signed-off-by: Oskar Senft <osk@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Mark van Dijk <mark@voidzero.net> Reviewed-by: Kan Liang <kan.liang@intel.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: ak@linux.intel.com Cc: peterz@infradead.org Cc: eranian@google.com Link: https://lkml.kernel.org/r/1521810690-2576-2-git-send-email-kan.liang@linux.intel.com
2018-04-20perf/x86/intel/uncore: Revert "Remove SBOX support for Broadwell server"Stephane Eranian
This reverts commit 3b94a891667c ("perf/x86/intel/uncore: Remove SBOX support for Broadwell server") Revert because there exists a proper workaround for Broadwell-EP servers without SBOX now. Note that BDX-DE does not have a SBOX. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kan Liang <kan.liang@intel.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: ak@linux.intel.com Cc: osk@google.com Cc: mark@voidzero.net Link: https://lkml.kernel.org/r/1521810690-2576-1-git-send-email-kan.liang@linux.intel.com
2018-04-20x86/Centaur: Initialize supported CPU features properlyDavid Wang
Centaur CPUs have some Intel compatible capabilities,including Permformance Monitoring Counters and CPU virtualization capabilities. Initialize them in the Centaur specific init code. Signed-off-by: David Wang <davidwang@zhaoxin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: lukelin@viacpu.com Cc: qiyuanwang@zhaoxin.com Cc: gregkh@linuxfoundation.org Cc: brucechang@via-alliance.com Cc: timguo@zhaoxin.com Cc: cooperyan@zhaoxin.com Cc: hpa@zytor.com Cc: benjaminpan@viatech.com Link: https://lkml.kernel.org/r/1524212968-28998-1-git-send-email-davidwang@zhaoxin.com
2018-04-20x86/power/64: Fix page-table setup for temporary text mappingJoerg Roedel
On a system with 4-level page-tables there is no p4d, so the pud in the pgd should be mapped. The old code before commit fb43d6cb91ef already did that. The change from above commit causes an invalid page-table which causes undefined behavior. In one report it caused triple faults. Fix it by changing the p4d back to pud. Fixes: fb43d6cb91ef ('x86/mm: Do not auto-massage page protections') Reported-by: Borislav Petkov <bp@suse.de> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michal Kubecek <mkubecek@suse.cz> Tested-by: Borislav Petkov <bp@suse.de> Cc: linux-pm@vger.kernel.org Cc: rjw@rjwysocki.net Cc: pavel@ucw.cz Cc: hpa@zytor.com Cc: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/1524162360-26179-1-git-send-email-joro@8bytes.org
2018-04-19x86/xen: Remove use of VLAsLaura Abbott
There's an ongoing effort to remove VLAs[1] from the kernel to eventually turn on -Wvla. It turns out, the few VLAs in use in Xen produce only a single entry array that is always bounded by GDT_SIZE. Clean up the code to get rid of the VLA and the loop. [1] https://lkml.org/lkml/2018/3/7/621 Signed-off-by: Laura Abbott <labbott@redhat.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> [boris: Use BUG_ON(size>PAGE_SIZE) instead of GDT_SIZE] Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2018-04-19Merge tag 'y2038-timekeeping' of ↵Thomas Gleixner
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground into timers/core Pull y2038 timekeeping syscall changes from Arnd Bergmann: This is the first set of system call entry point changes to enable 32-bit architectures to have variants on both 32-bit and 64-bit time_t. Typically these system calls take a 'struct timespec' argument, but that structure is defined in user space by the C library and its layout will change. The kernel already supports handling the 32-bit time_t on 64-bit architectures through the CONFIG_COMPAT mechanism. As there are a total of 51 system calls suffering from this problem, reusing that mechanism on 32-bit architectures. We already have patches for most of the remaining system calls, but this set contains most of the complexity and is best tested. There was one last-minute regression that prevented it from going into 4.17, but that is fixed now. More details from Deepa's patch series description: Big picture is as per the lwn article: https://lwn.net/Articles/643234/ [2] The series is directed at converting posix clock syscalls: clock_gettime, clock_settime, clock_getres and clock_nanosleep to use a new data structure __kernel_timespec at syscall boundaries. __kernel_timespec maintains 64 bit time_t across all execution modes. vdso will be handled as part of each architecture when they enable support for 64 bit time_t. The compat syscalls are repurposed to provide backward compatibility by using them as native syscalls as well for 32 bit architectures. They will continue to use timespec at syscall boundaries. CONFIG_64_BIT_TIME controls whether the syscalls use __kernel_timespec or timespec at syscall boundaries. The series does the following: 1. Enable compat syscalls on 32 bit architectures. 2. Add a new __kernel_timespec type to be used as the data structure for all the new syscalls. 3. Add new config CONFIG_64BIT_TIME(intead of the CONFIG_COMPAT_TIME in [1] and [2] to switch to new definition of __kernel_timespec. It is the same as struct timespec otherwise. 4. Add new CONFIG_32BIT_TIME to conditionally compile compat syscalls.