summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2020-08-30Merge tag 'x86-urgent-2020-08-30' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "Three interrupt related fixes for X86: - Move disabling of the local APIC after invoking fixup_irqs() to ensure that interrupts which are incoming are noted in the IRR and not ignored. - Unbreak affinity setting. The rework of the entry code reused the regular exception entry code for device interrupts. The vector number is pushed into the errorcode slot on the stack which is then lifted into an argument and set to -1 because that's regs->orig_ax which is used in quite some places to check whether the entry came from a syscall. But it was overlooked that orig_ax is used in the affinity cleanup code to validate whether the interrupt has arrived on the new target. It turned out that this vector check is pointless because interrupts are never moved from one vector to another on the same CPU. That check is a historical leftover from the time where x86 supported multi-CPU affinities, but not longer needed with the now strict single CPU affinity. Famous last words ... - Add a missing check for an empty cpumask into the matrix allocator. The affinity change added a warning to catch the case where an interrupt is moved on the same CPU to a different vector. This triggers because a condition with an empty cpumask returns an assignment from the allocator as the allocator uses for_each_cpu() without checking the cpumask for being empty. The historical inconsistent for_each_cpu() behaviour of ignoring the cpumask and unconditionally claiming that CPU0 is in the mask struck again. Sigh. plus a new entry into the MAINTAINER file for the HPE/UV platform" * tag 'x86-urgent-2020-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/matrix: Deal with the sillyness of for_each_cpu() on UP x86/irq: Unbreak interrupt affinity setting x86/hotplug: Silence APIC only after all interrupts are migrated MAINTAINERS: Add entry for HPE Superdome Flex (UV) maintainers
2020-08-30Merge tag 'locking-urgent-2020-08-30' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Thomas Gleixner: "A set of fixes for lockdep, tracing and RCU: - Prevent recursion by using raw_cpu_* operations - Fixup the interrupt state in the cpu idle code to be consistent - Push rcu_idle_enter/exit() invocations deeper into the idle path so that the lock operations are inside the RCU watching sections - Move trace_cpu_idle() into generic code so it's called before RCU goes idle. - Handle raw_local_irq* vs. local_irq* operations correctly - Move the tracepoints out from under the lockdep recursion handling which turned out to be fragile and inconsistent" * tag 'locking-urgent-2020-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: lockdep,trace: Expose tracepoints lockdep: Only trace IRQ edges mips: Implement arch_irqs_disabled() arm64: Implement arch_irqs_disabled() nds32: Implement arch_irqs_disabled() locking/lockdep: Cleanup x86/entry: Remove unused THUNKs cpuidle: Move trace_cpu_idle() into generic code cpuidle: Make CPUIDLE_FLAG_TLB_FLUSHED generic sched,idle,rcu: Push rcu_idle deeper into the idle path cpuidle: Fixup IRQ state lockdep: Use raw_cpu_*() for per-cpu variables
2020-08-27x86/irq: Unbreak interrupt affinity settingThomas Gleixner
Several people reported that 5.8 broke the interrupt affinity setting mechanism. The consolidation of the entry code reused the regular exception entry code for device interrupts and changed the way how the vector number is conveyed from ptregs->orig_ax to a function argument. The low level entry uses the hardware error code slot to push the vector number onto the stack which is retrieved from there into a function argument and the slot on stack is set to -1. The reason for setting it to -1 is that the error code slot is at the position where pt_regs::orig_ax is. A positive value in pt_regs::orig_ax indicates that the entry came via a syscall. If it's not set to a negative value then a signal delivery on return to userspace would try to restart a syscall. But there are other places which rely on pt_regs::orig_ax being a valid indicator for syscall entry. But setting pt_regs::orig_ax to -1 has a nasty side effect vs. the interrupt affinity setting mechanism, which was overlooked when this change was made. Moving interrupts on x86 happens in several steps. A new vector on a different CPU is allocated and the relevant interrupt source is reprogrammed to that. But that's racy and there might be an interrupt already in flight to the old vector. So the old vector is preserved until the first interrupt arrives on the new vector and the new target CPU. Once that happens the old vector is cleaned up, but this cleanup still depends on the vector number being stored in pt_regs::orig_ax, which is now -1. That -1 makes the check for cleanup: pt_regs::orig_ax == new_vector always false. As a consequence the interrupt is moved once, but then it cannot be moved anymore because the cleanup of the old vector never happens. There would be several ways to convey the vector information to that place in the guts of the interrupt handling, but on deeper inspection it turned out that this check is pointless and a leftover from the old affinity model of X86 which supported multi-CPU affinities. Under this model it was possible that an interrupt had an old and a new vector on the same CPU, so the vector match was required. Under the new model the effective affinity of an interrupt is always a single CPU from the requested affinity mask. If the affinity mask changes then either the interrupt stays on the CPU and on the same vector when that CPU is still in the new affinity mask or it is moved to a different CPU, but it is never moved to a different vector on the same CPU. Ergo the cleanup check for the matching vector number is not required and can be removed which makes the dependency on pt_regs:orig_ax go away. The remaining check for new_cpu == smp_processsor_id() is completely sufficient. If it matches then the interrupt was successfully migrated and the cleanup can proceed. For paranoia sake add a warning into the vector assignment code to validate that the assumption of never moving to a different vector on the same CPU holds. Fixes: 633260fa143 ("x86/irq: Convey vector as argument and not in ptregs") Reported-by: Alex bykov <alex.bykov@scylladb.com> Reported-by: Avi Kivity <avi@scylladb.com> Reported-by: Alexander Graf <graf@amazon.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Alexander Graf <graf@amazon.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/87wo1ltaxz.fsf@nanos.tec.linutronix.de
2020-08-27x86/hotplug: Silence APIC only after all interrupts are migratedAshok Raj
There is a race when taking a CPU offline. Current code looks like this: native_cpu_disable() { ... apic_soft_disable(); /* * Any existing set bits for pending interrupt to * this CPU are preserved and will be sent via IPI * to another CPU by fixup_irqs(). */ cpu_disable_common(); { .... /* * Race window happens here. Once local APIC has been * disabled any new interrupts from the device to * the old CPU are lost */ fixup_irqs(); // Too late to capture anything in IRR. ... } } The fix is to disable the APIC *after* cpu_disable_common(). Testing was done with a USB NIC that provided a source of frequent interrupts. A script migrated interrupts to a specific CPU and then took that CPU offline. Fixes: 60dcaad5736f ("x86/hotplug: Silence APIC and NMI when CPU is dead") Reported-by: Evan Green <evgreen@chromium.org> Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Mathias Nyman <mathias.nyman@linux.intel.com> Tested-by: Evan Green <evgreen@chromium.org> Reviewed-by: Evan Green <evgreen@chromium.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/lkml/875zdarr4h.fsf@nanos.tec.linutronix.de/ Link: https://lore.kernel.org/r/1598501530-45821-1-git-send-email-ashok.raj@intel.com
2020-08-26x86/entry: Remove unused THUNKsPeter Zijlstra
Unused remnants Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Marco Elver <elver@google.com> Link: https://lkml.kernel.org/r/20200821085348.487040689@infradead.org
2020-08-26cpuidle: Move trace_cpu_idle() into generic codePeter Zijlstra
Remove trace_cpu_idle() from the arch_cpu_idle() implementations and put it in the generic code, right before disabling RCU. Gets rid of more trace_*_rcuidle() users. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Marco Elver <elver@google.com> Link: https://lkml.kernel.org/r/20200821085348.428433395@infradead.org
2020-08-26cpuidle: Make CPUIDLE_FLAG_TLB_FLUSHED genericPeter Zijlstra
This allows moving the leave_mm() call into generic code before rcu_idle_enter(). Gets rid of more trace_*_rcuidle() users. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Marco Elver <elver@google.com> Link: https://lkml.kernel.org/r/20200821085348.369441600@infradead.org
2020-08-23treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-23Merge tag 'x86-urgent-2020-08-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Thomas Gleixner: "A single fix for x86 which removes the RDPID usage from the paranoid entry path and unconditionally uses LSL to retrieve the CPU number. RDPID depends on MSR_TSX_AUX. KVM has an optmization to avoid expensive MRS read/writes on VMENTER/EXIT. It caches the MSR values and restores them either when leaving the run loop, on preemption or when going out to user space. MSR_TSX_AUX is part of that lazy MSR set, so after writing the guest value and before the lazy restore any exception using the paranoid entry will read the guest value and use it as CPU number to retrieve the GSBASE value for the current CPU when FSGSBASE is enabled. As RDPID is only used in that particular entry path, there is no reason to burden VMENTER/EXIT with two extra MSR writes. Remove the RDPID optimization, which is not even backed by numbers from the paranoid entry path instead" * tag 'x86-urgent-2020-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/entry/64: Do not use RDPID in paranoid entry to accomodate KVM
2020-08-23Merge tag 'perf-urgent-2020-08-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 perf fix from Thomas Gleixner: "A single update for perf on x86 which has support for the broken down bandwith counters" * tag 'perf-urgent-2020-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel/uncore: Add BW counters for GT, IA and IO breakdown
2020-08-23Merge tag 'efi-urgent-2020-08-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI fixes from Thomas Gleixner: - Enforce NX on RO data in mixed EFI mode - Destroy workqueue in an error handling path to prevent UAF - Stop argument parser at '--' which is the delimiter for init - Treat a NULL command line pointer as empty instead of dereferncing it unconditionally. - Handle an unterminated command line correctly - Cleanup the 32bit code leftovers and remove obsolete documentation * tag 'efi-urgent-2020-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Documentation: efi: remove description of efi=old_map efi/x86: Move 32-bit code into efi_32.c efi/libstub: Handle unterminated cmdline efi/libstub: Handle NULL cmdline efi/libstub: Stop parsing arguments at "--" efi: add missed destroy_workqueue when efisubsys_init fails efi/x86: Mark kernel rodata non-executable for mixed mode
2020-08-22Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: - PAE and PKU bugfixes for x86 - selftests fix for new binutils - MMU notifier fix for arm64 * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: arm64: Only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is not set KVM: Pass MMU notifier range flags to kvm_unmap_hva_range() kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE mode kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE mode KVM: x86: fix access code passed to gva_to_gpa selftests: kvm: Use a shorter encoding to clear RAX
2020-08-21KVM: Pass MMU notifier range flags to kvm_unmap_hva_range()Will Deacon
The 'flags' field of 'struct mmu_notifier_range' is used to indicate whether invalidate_range_{start,end}() are permitted to block. In the case of kvm_mmu_notifier_invalidate_range_start(), this field is not forwarded on to the architecture-specific implementation of kvm_unmap_hva_range() and therefore the backend cannot sensibly decide whether or not to block. Add an extra 'flags' parameter to kvm_unmap_hva_range() so that architectures are aware as to whether or not they are permitted to block. Cc: <stable@vger.kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: James Morse <james.morse@arm.com> Signed-off-by: Will Deacon <will@kernel.org> Message-Id: <20200811102725.7121-2-will@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-08-21Merge tag 'for-linus-5.9-rc2-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: "One build fix and a minor fix for suppressing a useless warning when booting a Xen dom0 via UEFI" * tag 'for-linus-5.9-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: Fix build error when CONFIG_ACPI is not set/enabled: efi: avoid error message when booting under Xen
2020-08-21x86/entry/64: Do not use RDPID in paranoid entry to accomodate KVMSean Christopherson
KVM has an optmization to avoid expensive MRS read/writes on VMENTER/EXIT. It caches the MSR values and restores them either when leaving the run loop, on preemption or when going out to user space. The affected MSRs are not required for kernel context operations. This changed with the recently introduced mechanism to handle FSGSBASE in the paranoid entry code which has to retrieve the kernel GSBASE value by accessing per CPU memory. The mechanism needs to retrieve the CPU number and uses either LSL or RDPID if the processor supports it. Unfortunately RDPID uses MSR_TSC_AUX which is in the list of cached and lazily restored MSRs, which means between the point where the guest value is written and the point of restore, MSR_TSC_AUX contains a random number. If an NMI or any other exception which uses the paranoid entry path happens in such a context, then RDPID returns the random guest MSR_TSC_AUX value. As a consequence this reads from the wrong memory location to retrieve the kernel GSBASE value. Kernel GS is used to for all regular this_cpu_*() operations. If the GSBASE in the exception handler points to the per CPU memory of a different CPU then this has the obvious consequences of data corruption and crashes. As the paranoid entry path is the only place which accesses MSR_TSX_AUX (via RDPID) and the fallback via LSL is not significantly slower, remove the RDPID alternative from the entry path and always use LSL. The alternative would be to write MSR_TSC_AUX on every VMENTER and VMEXIT which would be inflicting massive overhead on that code path. [ tglx: Rewrote changelog ] Fixes: eaad981291ee3 ("x86/entry/64: Introduce the FIND_PERCPU_BASE macro") Reported-by: Tom Lendacky <thomas.lendacky@amd.com> Debugged-by: Tom Lendacky <thomas.lendacky@amd.com> Suggested-by: Andy Lutomirski <luto@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20200821105229.18938-1-pbonzini@redhat.com
2020-08-20efi/x86: Move 32-bit code into efi_32.cArd Biesheuvel
Now that the old memmap code has been removed, some code that was left behind in arch/x86/platform/efi/efi.c is only used for 32-bit builds, which means it can live in efi_32.c as well. So move it over. Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2020-08-20efi/x86: Mark kernel rodata non-executable for mixed modeArvind Sankar
When remapping the kernel rodata section RO in the EFI pagetables, the protection flags that were used for the text section are being reused, but the rodata section should not be marked executable. Cc: <stable@vger.kernel.org> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Link: https://lore.kernel.org/r/20200717194526.3452089-1-nivedita@alum.mit.edu Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2020-08-20Fix build error when CONFIG_ACPI is not set/enabled:Randy Dunlap
../arch/x86/pci/xen.c: In function ‘pci_xen_init’: ../arch/x86/pci/xen.c:410:2: error: implicit declaration of function ‘acpi_noirq_set’; did you mean ‘acpi_irq_get’? [-Werror=implicit-function-declaration] acpi_noirq_set(); Fixes: 88e9ca161c13 ("xen/pci: Use acpi_noirq_set() helper to avoid #ifdef") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Juergen Gross <jgross@suse.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: xen-devel@lists.xenproject.org Cc: linux-pci@vger.kernel.org Signed-off-by: Juergen Gross <jgross@suse.com>
2020-08-19x86/boot/compressed: Use builtin mem functions for decompressorArvind Sankar
Since commits c041b5ad8640 ("x86, boot: Create a separate string.h file to provide standard string functions") fb4cac573ef6 ("x86, boot: Move memcmp() into string.h and string.c") the decompressor stub has been using the compiler's builtin memcpy, memset and memcmp functions, _except_ where it would likely have the largest impact, in the decompression code itself. Remove the #undef's of memcpy and memset in misc.c so that the decompressor code also uses the compiler builtins. The rationale given in the comment doesn't really apply: just because some functions use the out-of-line version is no reason to not use the builtin version in the rest. Replace the comment with an explanation of why memzero and memmove are being #define'd. Drop the suggestion to #undef in boot/string.h as well: the out-of-line versions are not really optimized versions, they're generic code that's good enough for the preboot environment. The compiler will likely generate better code for constant-size memcpy/memset/memcmp if it is allowed to. Most decompressors' performance is unchanged, with the exception of LZ4 and 64-bit ZSTD. Before After ARCH LZ4 73ms 10ms 32 LZ4 120ms 10ms 64 ZSTD 90ms 74ms 64 Measurements on QEMU on 2.2GHz Broadwell Xeon, using defconfig kernels. Decompressor code size has small differences, with the largest being that 64-bit ZSTD decreases just over 2k. The largest code size increase was on 64-bit XZ, of about 400 bytes. Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Suggested-by: Nick Terrell <nickrterrell@gmail.com> Tested-by: Nick Terrell <nickrterrell@gmail.com> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-17kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE modeJim Mattson
See the SDM, volume 3, section 4.4.1: If PAE paging would be in use following an execution of MOV to CR0 or MOV to CR4 (see Section 4.1.1) and the instruction is modifying any of CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then the PDPTEs are loaded from the address in CR3. Fixes: b9baba8614890 ("KVM, pkeys: expose CPUID/CR4 to guest") Cc: Huaitong Han <huaitong.han@intel.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20200817181655.3716509-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-08-17kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE modeJim Mattson
See the SDM, volume 3, section 4.4.1: If PAE paging would be in use following an execution of MOV to CR0 or MOV to CR4 (see Section 4.1.1) and the instruction is modifying any of CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then the PDPTEs are loaded from the address in CR3. Fixes: 0be0226f07d14 ("KVM: MMU: fix SMAP virtualization") Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20200817181655.3716509-2-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-08-17KVM: x86: fix access code passed to gva_to_gpaPaolo Bonzini
The PK bit of the error code is computed dynamically in permission_fault and therefore need not be passed to gva_to_gpa: only the access bits (fetch, user, write) need to be passed down. Not doing so causes a splat in the pku test: WARNING: CPU: 25 PID: 5465 at arch/x86/kvm/mmu.h:197 paging64_walk_addr_generic+0x594/0x750 [kvm] Hardware name: Intel Corporation WilsonCity/WilsonCity, BIOS WLYDCRB1.SYS.0014.D62.2001092233 01/09/2020 RIP: 0010:paging64_walk_addr_generic+0x594/0x750 [kvm] Code: <0f> 0b e9 db fe ff ff 44 8b 43 04 4c 89 6c 24 30 8b 13 41 39 d0 89 RSP: 0018:ff53778fc623fb60 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ff53778fc623fbf0 RCX: 0000000000000007 RDX: 0000000000000001 RSI: 0000000000000002 RDI: ff4501efba818000 RBP: 0000000000000020 R08: 0000000000000005 R09: 00000000004000e7 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000007 R13: ff4501efba818388 R14: 10000000004000e7 R15: 0000000000000000 FS: 00007f2dcf31a700(0000) GS:ff4501f1c8040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000001dea475005 CR4: 0000000000763ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: paging64_gva_to_gpa+0x3f/0xb0 [kvm] kvm_fixup_and_inject_pf_error+0x48/0xa0 [kvm] handle_exception_nmi+0x4fc/0x5b0 [kvm_intel] kvm_arch_vcpu_ioctl_run+0x911/0x1c10 [kvm] kvm_vcpu_ioctl+0x23e/0x5d0 [kvm] ksys_ioctl+0x92/0xb0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x3e/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ---[ end trace d17eb998aee991da ]--- Reported-by: Sean Christopherson <sean.j.christopherson@intel.com> Fixes: 897861479c064 ("KVM: x86: Add helper functions for illegal GPA checking and page fault injection") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-08-15perf/x86/intel/uncore: Add BW counters for GT, IA and IO breakdownVaibhav Shankar
Linux only has support to read total DDR reads and writes. Here we add support to enable bandwidth breakdown-GT, IA and IO. Breakdown of BW is important to debug and optimize memory access. This can also be used for telemetry and improving the system software.The offsets for GT, IA and IO are added and these free running counters can be accessed via MMIO space. The BW breakdown can be measured using the following cmd: perf stat -e uncore_imc/gt_requests/,uncore_imc/ia_requests/,uncore_imc/io_requests/ 30.57 MiB uncore_imc/gt_requests/ 1346.13 MiB uncore_imc/ia_requests/ 190.97 MiB uncore_imc/io_requests/ 5.984572733 seconds time elapsed BW/s = <gt,ia,io>_requests/time elapsed Signed-off-by: Vaibhav Shankar <vaibhav.shankar@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200814022234.23605-1-vaibhav.shankar@intel.com
2020-08-15Merge tag 'x86-urgent-2020-08-15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Misc fixes and small updates all around the place: - Fix mitigation state sysfs output - Fix an FPU xstate/sxave code assumption bug triggered by Architectural LBR support - Fix Lightning Mountain SoC TSC frequency enumeration bug - Fix kexec debug output - Fix kexec memory range assumption bug - Fix a boundary condition in the crash kernel code - Optimize porgatory.ro generation a bit - Enable ACRN guests to use X2APIC mode - Reduce a __text_poke() IRQs-off critical section for the benefit of PREEMPT_RT" * tag 'x86-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/alternatives: Acquire pte lock with interrupts enabled x86/bugs/multihit: Fix mitigation reporting when VMX is not in use x86/fpu/xstate: Fix an xstate size check warning with architectural LBRs x86/purgatory: Don't generate debug info for purgatory.ro x86/tsr: Fix tsc frequency enumeration bug on Lightning Mountain SoC kexec_file: Correctly output debugging information for the PT_LOAD ELF header kexec: Improve & fix crash_exclude_mem_range() to handle overlapping ranges x86/crash: Correct the address boundary of function parameters x86/acrn: Remove redundant chars from ACRN signature x86/acrn: Allow ACRN guest to use X2APIC mode
2020-08-15Merge tag 'perf-urgent-2020-08-15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Misc fixes, an expansion of perf syscall access to CAP_PERFMON privileged tools, plus a RAPL HW-enablement for Intel SPR platforms" * tag 'perf-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/rapl: Add support for Intel SPR platform perf/x86/rapl: Support multiple RAPL unit quirks perf/x86/rapl: Fix missing psys sysfs attributes hw_breakpoint: Remove unused __register_perf_hw_breakpoint() declaration kprobes: Remove show_registers() function prototype perf/core: Take over CAP_SYS_PTRACE creds to CAP_PERFMON capability
2020-08-14all arch: remove system call sys_sysctlXiaoming Ni
Since commit 61a47c1ad3a4dc ("sysctl: Remove the sysctl system call"), sys_sysctl is actually unavailable: any input can only return an error. We have been warning about people using the sysctl system call for years and believe there are no more users. Even if there are users of this interface if they have not complained or fixed their code by now they probably are not going to, so there is no point in warning them any longer. So completely remove sys_sysctl on all architectures. [nixiaoming@huawei.com: s390: fix build error for sys_call_table_emu] Link: http://lkml.kernel.org/r/20200618141426.16884-1-nixiaoming@huawei.com Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Will Deacon <will@kernel.org> [arm/arm64] Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Bin Meng <bin.meng@windriver.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: chenzefeng <chenzefeng2@huawei.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christian Brauner <christian@brauner.io> Cc: Chris Zankel <chris@zankel.net> Cc: David Howells <dhowells@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Diego Elio Pettenò <flameeyes@flameeyes.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kars de Jong <jongk@linux-m68k.org> Cc: Kees Cook <keescook@chromium.org> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Nick Piggin <npiggin@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Olof Johansson <olof@lixom.net> Cc: Paul Burton <paulburton@kernel.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: Sven Schnelle <svens@stackframe.org> Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zhou Yanjie <zhouyanjie@wanyeetech.com> Link: http://lkml.kernel.org/r/20200616030734.87257-1-nixiaoming@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-14Merge tag 'timers-urgent-2020-08-14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timekeeping updates from Thomas Gleixner: "A set of timekeeping/VDSO updates: - Preparatory work to allow S390 to switch over to the generic VDSO implementation. S390 requires that the VDSO data pointer is handed in to the counter read function when time namespace support is enabled. Adding the pointer is a NOOP for all other architectures because the compiler is supposed to optimize that out when it is unused in the architecture specific inline. The change also solved a similar problem for MIPS which fortunately has time namespaces not yet enabled. S390 needs to update clock related VDSO data independent of the timekeeping updates. This was solved so far with yet another sequence counter in the S390 implementation. A better solution is to utilize the already existing VDSO sequence count for this. The core code now exposes helper functions which allow to serialize against the timekeeper code and against concurrent readers. S390 needs extra data for their clock readout function. The initial common VDSO data structure did not provide a way to add that. It now has an embedded architecture specific struct embedded which defaults to an empty struct. Doing this now avoids tree dependencies and conflicts post rc1 and allows all other architectures which work on generic VDSO support to work from a common upstream base. - A trivial comment fix" * tag 'timers-urgent-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: time: Delete repeated words in comments lib/vdso: Allow to add architecture-specific vdso data timekeeping/vsyscall: Provide vdso_update_begin/end() vdso/treewide: Add vdso_data pointer argument to __arch_get_hw_counter()
2020-08-14Merge tag 'timers-core-2020-08-14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull more timer updates from Thomas Gleixner: "A set of posix CPU timer changes which allows to defer the heavy work of posix CPU timers into task work context. The tick interrupt is reduced to a quick check which queues the work which is doing the heavy lifting before returning to user space or going back to guest mode. Moving this out is deferring the signal delivery slightly but posix CPU timers are inaccurate by nature as they depend on the tick so there is no real damage. The relevant test cases all passed. This lifts the last offender for RT out of the hard interrupt context tick handler, but it also has the general benefit that the actual heavy work is accounted to the task/process and not to the tick interrupt itself. Further optimizations are possible to break long sighand lock hold and interrupt disabled (on !RT kernels) times when a massive amount of posix CPU timers (which are unpriviledged) is armed for a task/process. This is currently only enabled for x86 because the architecture has to ensure that task work is handled in KVM before entering a guest, which was just established for x86 with the new common entry/exit code which got merged post 5.8 and is not the case for other KVM architectures" * tag 'timers-core-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Select POSIX_CPU_TIMERS_TASK_WORK posix-cpu-timers: Provide mechanisms to defer timer handling to task_work posix-cpu-timers: Split run_posix_cpu_timers()
2020-08-14Merge tag 'for-linus-5.9-rc1b-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull more xen updates from Juergen Gross: - Remove support for running as 32-bit Xen PV-guest. 32-bit PV guests are rarely used, are lacking security fixes for Meltdown, and can be easily replaced by PVH mode. Another series for doing more cleanup will follow soon (removal of 32-bit-only pvops functionality). - Fixes and additional features for the Xen display frontend driver. * tag 'for-linus-5.9-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: drm/xen-front: Pass dumb buffer data offset to the backend xen: Sync up with the canonical protocol definition in Xen drm/xen-front: Add YUYV to supported formats drm/xen-front: Fix misused IS_ERR_OR_NULL checks xen/gntdev: Fix dmabuf import with non-zero sgt offset x86/xen: drop tests for highmem in pv code x86/xen: eliminate xen-asm_64.S x86/xen: remove 32-bit Xen PV guest support
2020-08-14Merge tag 'hyperv-fixes-signed' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyper-v fixes from Wei Liu: - fix oops reporting on Hyper-V - make objtool happy * tag 'hyperv-fixes-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: x86/hyperv: Make hv_setup_sched_clock inline Drivers: hv: vmbus: Only notify Hyper-V for die events that are oops
2020-08-14x86/fsgsbase/64: Fix NULL deref in 86_fsgsbase_read_taskEric Dumazet
syzbot found its way in 86_fsgsbase_read_task() and triggered this oops: KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] CPU: 0 PID: 6866 Comm: syz-executor262 Not tainted 5.8.0-syzkaller #0 RIP: 0010:x86_fsgsbase_read_task+0x16d/0x310 arch/x86/kernel/process_64.c:393 Call Trace: putreg32+0x3ab/0x530 arch/x86/kernel/ptrace.c:876 genregs32_set arch/x86/kernel/ptrace.c:1026 [inline] genregs32_set+0xa4/0x100 arch/x86/kernel/ptrace.c:1006 copy_regset_from_user include/linux/regset.h:326 [inline] ia32_arch_ptrace arch/x86/kernel/ptrace.c:1061 [inline] compat_arch_ptrace+0x36c/0xd90 arch/x86/kernel/ptrace.c:1198 __do_compat_sys_ptrace kernel/ptrace.c:1420 [inline] __se_compat_sys_ptrace kernel/ptrace.c:1389 [inline] __ia32_compat_sys_ptrace+0x220/0x2f0 kernel/ptrace.c:1389 do_syscall_32_irqs_on arch/x86/entry/common.c:84 [inline] __do_fast_syscall_32+0x57/0x80 arch/x86/entry/common.c:126 do_fast_syscall_32+0x2f/0x70 arch/x86/entry/common.c:149 entry_SYSENTER_compat_after_hwframe+0x4d/0x5c This can happen if ptrace() or sigreturn() pokes an LDT selector into FS or GS for a task with no LDT and something tries to read the base before a return to usermode notices the bad selector and fixes it. The fix is to make sure ldt pointer is not NULL. Fixes: 07e1d88adaae ("x86/fsgsbase/64: Fix ptrace() to read the FS/GS base accurately") Co-developed-by: Jann Horn <jannh@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Andy Lutomirski <luto@kernel.org> Cc: Chang S. Bae <chang.seok.bae@intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Markus T Metzger <markus.t.metzger@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Shankar <ravi.v.shankar@intel.com> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-14perf/x86/rapl: Add support for Intel SPR platformZhang Rui
Intel SPR platform uses fixed 16 bit energy unit for DRAM RAPL domain, and fixed 0 bit energy unit for Psys RAPL domain. After this, on SPR platform the energy counters appear in perf list. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Acked-by: Len Brown <len.brown@intel.com> Link: https://lore.kernel.org/r/20200811153149.12242-4-rui.zhang@intel.com
2020-08-14perf/x86/rapl: Support multiple RAPL unit quirksZhang Rui
There will be more platforms with different fixed energy units. Enhance the code to support different RAPL unit quirks for different platforms. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Reviewed-by: Len Brown <len.brown@intel.com> Link: https://lore.kernel.org/r/20200811153149.12242-3-rui.zhang@intel.com
2020-08-14perf/x86/rapl: Fix missing psys sysfs attributesZhang Rui
This fixes a problem introduced by commit: 5fb5273a905c ("perf/x86/rapl: Use new MSR detection interface") that perf event sysfs attributes for psys RAPL domain are missing. Fixes: 5fb5273a905c ("perf/x86/rapl: Use new MSR detection interface") Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Reviewed-by: Len Brown <len.brown@intel.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Link: https://lore.kernel.org/r/20200811153149.12242-2-rui.zhang@intel.com
2020-08-13x86/alternatives: Acquire pte lock with interrupts enabledSebastian Andrzej Siewior
pte lock is never acquired in-IRQ context so it does not require interrupts to be disabled. The lock is a regular spinlock which cannot be acquired with interrupts disabled on RT. RT complains about pte_lock() in __text_poke() because it's invoked after disabling interrupts. __text_poke() has to disable interrupts as use_temporary_mm() expects interrupts to be off because it invokes switch_mm_irqs_off() and uses per-CPU (current active mm) data. Move the PTE lock handling outside the interrupt disabled region. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by; Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20200813105026.bvugytmsso6muljw@linutronix.de
2020-08-12Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull more KVM updates from Paolo Bonzini: "PPC: - Improvements and bugfixes for secure VM support, giving reduced startup time and memory hotplug support. - Locking fixes in nested KVM code - Increase number of guests supported by HV KVM to 4094 - Preliminary POWER10 support ARM: - Split the VHE and nVHE hypervisor code bases, build the EL2 code separately, allowing for the VHE code to now be built with instrumentation - Level-based TLB invalidation support - Restructure of the vcpu register storage to accomodate the NV code - Pointer Authentication available for guests on nVHE hosts - Simplification of the system register table parsing - MMU cleanups and fixes - A number of post-32bit cleanups and other fixes MIPS: - compilation fixes x86: - bugfixes - support for the SERIALIZE instruction" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (70 commits) KVM: MIPS/VZ: Fix build error caused by 'kvm_run' cleanup x86/kvm/hyper-v: Synic default SCONTROL MSR needs to be enabled MIPS: KVM: Convert a fallthrough comment to fallthrough MIPS: VZ: Only include loongson_regs.h for CPU_LOONGSON64 x86: Expose SERIALIZE for supported cpuid KVM: x86: Don't attempt to load PDPTRs when 64-bit mode is enabled KVM: arm64: Move S1PTW S2 fault logic out of io_mem_abort() KVM: arm64: Don't skip cache maintenance for read-only memslots KVM: arm64: Handle data and instruction external aborts the same way KVM: arm64: Rename kvm_vcpu_dabt_isextabt() KVM: arm: Add trace name for ARM_NISV KVM: arm64: Ensure that all nVHE hyp code is in .hyp.text KVM: arm64: Substitute RANDOMIZE_BASE for HARDEN_EL2_VECTORS KVM: arm64: Make nVHE ASLR conditional on RANDOMIZE_BASE KVM: PPC: Book3S HV: Rework secure mem slot dropping KVM: PPC: Book3S HV: Move kvmppc_svm_page_out up KVM: PPC: Book3S HV: Migrate hot plugged memory KVM: PPC: Book3S HV: In H_SVM_INIT_DONE, migrate remaining normal-GFNs to secure-GFNs KVM: PPC: Book3S HV: Track the state GFNs associated with secure VMs KVM: PPC: Book3S HV: Disable page merging in H_SVM_INIT_START ...
2020-08-12Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge more updates from Andrew Morton: - most of the rest of MM (memcg, hugetlb, vmscan, proc, compaction, mempolicy, oom-kill, hugetlbfs, migration, thp, cma, util, memory-hotplug, cleanups, uaccess, migration, gup, pagemap), - various other subsystems (alpha, misc, sparse, bitmap, lib, bitops, checkpatch, autofs, minix, nilfs, ufs, fat, signals, kmod, coredump, exec, kdump, rapidio, panic, kcov, kgdb, ipc). * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (164 commits) mm/gup: remove task_struct pointer for all gup code mm: clean up the last pieces of page fault accountings mm/xtensa: use general page fault accounting mm/x86: use general page fault accounting mm/sparc64: use general page fault accounting mm/sparc32: use general page fault accounting mm/sh: use general page fault accounting mm/s390: use general page fault accounting mm/riscv: use general page fault accounting mm/powerpc: use general page fault accounting mm/parisc: use general page fault accounting mm/openrisc: use general page fault accounting mm/nios2: use general page fault accounting mm/nds32: use general page fault accounting mm/mips: use general page fault accounting mm/microblaze: use general page fault accounting mm/m68k: use general page fault accounting mm/ia64: use general page fault accounting mm/hexagon: use general page fault accounting mm/csky: use general page fault accounting ...
2020-08-12mm/x86: use general page fault accountingPeter Xu
Use the general page fault accounting by passing regs into handle_mm_fault(). Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Link: http://lkml.kernel.org/r/20200707225021.200906-23-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm: do page fault accounting in handle_mm_faultPeter Xu
Patch series "mm: Page fault accounting cleanups", v5. This is v5 of the pf accounting cleanup series. It originates from Gerald Schaefer's report on an issue a week ago regarding to incorrect page fault accountings for retried page fault after commit 4064b9827063 ("mm: allow VM_FAULT_RETRY for multiple times"): https://lore.kernel.org/lkml/20200610174811.44b94525@thinkpad/ What this series did: - Correct page fault accounting: we do accounting for a page fault (no matter whether it's from #PF handling, or gup, or anything else) only with the one that completed the fault. For example, page fault retries should not be counted in page fault counters. Same to the perf events. - Unify definition of PERF_COUNT_SW_PAGE_FAULTS: currently this perf event is used in an adhoc way across different archs. Case (1): for many archs it's done at the entry of a page fault handler, so that it will also cover e.g. errornous faults. Case (2): for some other archs, it is only accounted when the page fault is resolved successfully. Case (3): there're still quite some archs that have not enabled this perf event. Since this series will touch merely all the archs, we unify this perf event to always follow case (1), which is the one that makes most sense. And since we moved the accounting into handle_mm_fault, the other two MAJ/MIN perf events are well taken care of naturally. - Unify definition of "major faults": the definition of "major fault" is slightly changed when used in accounting (not VM_FAULT_MAJOR). More information in patch 1. - Always account the page fault onto the one that triggered the page fault. This does not matter much for #PF handlings, but mostly for gup. More information on this in patch 25. Patchset layout: Patch 1: Introduced the accounting in handle_mm_fault(), not enabled. Patch 2-23: Enable the new accounting for arch #PF handlers one by one. Patch 24: Enable the new accounting for the rest outliers (gup, iommu, etc.) Patch 25: Cleanup GUP task_struct pointer since it's not needed any more This patch (of 25): This is a preparation patch to move page fault accountings into the general code in handle_mm_fault(). This includes both the per task flt_maj/flt_min counters, and the major/minor page fault perf events. To do this, the pt_regs pointer is passed into handle_mm_fault(). PERF_COUNT_SW_PAGE_FAULTS should still be kept in per-arch page fault handlers. So far, all the pt_regs pointer that passed into handle_mm_fault() is NULL, which means this patch should have no intented functional change. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Cain <bcain@codeaurora.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Chris Zankel <chris@zankel.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Greentime Hu <green.hu@gmail.com> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Nick Hu <nickhu@andestech.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/20200707225021.200906-1-peterx@redhat.com Link: http://lkml.kernel.org/r/20200707225021.200906-2-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12uaccess: remove segment_eqChristoph Hellwig
segment_eq is only used to implement uaccess_kernel. Just open code uaccess_kernel in the arch uaccess headers and remove one layer of indirection. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Greentime Hu <green.hu@gmail.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Nick Hu <nickhu@andestech.com> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Link: http://lkml.kernel.org/r/20200710135706.537715-5-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/memory_hotplug: introduce default dummy memory_add_physaddr_to_nid()Jia He
This is to introduce a general dummy helper. memory_add_physaddr_to_nid() is a fallback option to get the nid in case NUMA_NO_NID is detected. After this patch, arm64/sh/s390 can simply use the general dummy version. PowerPC/x86/ia64 will still use their specific version. This is the preparation to set a fallback value for dev_dax->target_node. Signed-off-by: Jia He <justin.he@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Baoquan He <bhe@redhat.com> Cc: Chuhong Yuan <hslester96@gmail.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Jonathan Cameron <Jonathan.Cameron@Huawei.com> Cc: Kaly Xin <Kaly.Xin@arm.com> Link: http://lkml.kernel.org/r/20200710031619.18762-2-justin.he@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12x86/mm: use max memory block size on bare metalDaniel Jordan
Some of our servers spend significant time at kernel boot initializing memory block sysfs directories and then creating symlinks between them and the corresponding nodes. The slowness happens because the machines get stuck with the smallest supported memory block size on x86 (128M), which results in 16,288 directories to cover the 2T of installed RAM. The search for each memory block is noticeable even with commit 4fb6eabf1037 ("drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup"). Commit 078eb6aa50dc ("x86/mm/memory_hotplug: determine block size based on the end of boot memory") chooses the block size based on alignment with memory end. That addresses hotplug failures in qemu guests, but for bare metal systems whose memory end isn't aligned to even the smallest size, it leaves them at 128M. Make kernels that aren't running on a hypervisor use the largest supported size (2G) to minimize overhead on big machines. Kernel boot goes 7% faster on the aforementioned servers, shaving off half a second. [daniel.m.jordan@oracle.com: v3] Link: http://lkml.kernel.org/r/20200714205450.945834-1-daniel.m.jordan@oracle.com Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20200609225451.3542648-1-daniel.m.jordan@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-11Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio updates from Michael Tsirkin: - IRQ bypass support for vdpa and IFC - MLX5 vdpa driver - Endianness fixes for virtio drivers - Misc other fixes * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (71 commits) vdpa/mlx5: fix up endian-ness for mtu vdpa: Fix pointer math bug in vdpasim_get_config() vdpa/mlx5: Fix pointer math in mlx5_vdpa_get_config() vdpa/mlx5: fix memory allocation failure checks vdpa/mlx5: Fix uninitialised variable in core/mr.c vdpa_sim: init iommu lock virtio_config: fix up warnings on parisc vdpa/mlx5: Add VDPA driver for supported mlx5 devices vdpa/mlx5: Add shared memory registration code vdpa/mlx5: Add support library for mlx5 VDPA implementation vdpa/mlx5: Add hardware descriptive header file vdpa: Modify get_vq_state() to return error code net/vdpa: Use struct for set/get vq state vdpa: remove hard coded virtq num vdpasim: support batch updating vhost-vdpa: support IOTLB batching hints vhost-vdpa: support get/set backend features vhost: generialize backend features setting/getting vhost-vdpa: refine ioctl pre-processing vDPA: dont change vq irq after DRIVER_OK ...
2020-08-11Merge tag 'iommu-updates-v5.9' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull iommu updates from Joerg Roedel: - Remove of the dev->archdata.iommu (or similar) pointers from most architectures. Only Sparc is left, but this is private to Sparc as their drivers don't use the IOMMU-API. - ARM-SMMU updates from Will Deacon: - Support for SMMU-500 implementation in Marvell Armada-AP806 SoC - Support for SMMU-500 implementation in NVIDIA Tegra194 SoC - DT compatible string updates - Remove unused IOMMU_SYS_CACHE_ONLY flag - Move ARM-SMMU drivers into their own subdirectory - Intel VT-d updates from Lu Baolu: - Misc tweaks and fixes for vSVA - Report/response page request events - Cleanups - Move the Kconfig and Makefile bits for the AMD and Intel drivers into their respective subdirectory. - MT6779 IOMMU Support - Support for new chipsets in the Renesas IOMMU driver - Other misc cleanups and fixes (e.g. to improve compile test coverage) * tag 'iommu-updates-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (77 commits) iommu/amd: Move Kconfig and Makefile bits down into amd directory iommu/vt-d: Move Kconfig and Makefile bits down into intel directory iommu/arm-smmu: Move Arm SMMU drivers into their own subdirectory iommu/vt-d: Skip TE disabling on quirky gfx dedicated iommu iommu: Add gfp parameter to io_pgtable_ops->map() iommu: Mark __iommu_map_sg() as static iommu/vt-d: Rename intel-pasid.h to pasid.h iommu/vt-d: Add page response ops support iommu/vt-d: Report page request faults for guest SVA iommu/vt-d: Add a helper to get svm and sdev for pasid iommu/vt-d: Refactor device_to_iommu() helper iommu/vt-d: Disable multiple GPASID-dev bind iommu/vt-d: Warn on out-of-range invalidation address iommu/vt-d: Fix devTLB flush for vSVA iommu/vt-d: Handle non-page aligned address iommu/vt-d: Fix PASID devTLB invalidation iommu/vt-d: Remove global page support in devTLB flush iommu/vt-d: Enforce PASID devTLB field mask iommu: Make some functions static iommu/amd: Remove double zero check ...
2020-08-11x86/hyperv: Make hv_setup_sched_clock inlineMichael Kelley
Make hv_setup_sched_clock inline so the reference to pv_ops works correctly with objtool updates to detect noinstr violations. See https://lore.kernel.org/patchwork/patch/1283635/ Signed-off-by: Michael Kelley <mikelley@microsoft.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/1597022991-24088-1-git-send-email-mikelley@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2020-08-11x86/xen: drop tests for highmem in pv codeJuergen Gross
With support for 32-bit pv guests gone pure pv-code no longer needs to test for highmem. Dropping those tests removes the need for flushing in some places. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2020-08-11x86/xen: eliminate xen-asm_64.SJuergen Gross
With 32-bit pv-guest support removed xen-asm_64.S can be merged with xen-asm.S Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2020-08-11x86/xen: remove 32-bit Xen PV guest supportJuergen Gross
Xen is requiring 64-bit machines today and since Xen 4.14 it can be built without 32-bit PV guest support. There is no need to carry the burden of 32-bit PV guest support in the kernel any longer, as new guests can be either HVM or PVH, or they can use a 64 bit kernel. Remove the 32-bit Xen PV support from the kernel. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2020-08-10Merge tag 'locking-urgent-2020-08-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Thomas Gleixner: "A set of locking fixes and updates: - Untangle the header spaghetti which causes build failures in various situations caused by the lockdep additions to seqcount to validate that the write side critical sections are non-preemptible. - The seqcount associated lock debug addons which were blocked by the above fallout. seqcount writers contrary to seqlock writers must be externally serialized, which usually happens via locking - except for strict per CPU seqcounts. As the lock is not part of the seqcount, lockdep cannot validate that the lock is held. This new debug mechanism adds the concept of associated locks. sequence count has now lock type variants and corresponding initializers which take a pointer to the associated lock used for writer serialization. If lockdep is enabled the pointer is stored and write_seqcount_begin() has a lockdep assertion to validate that the lock is held. Aside of the type and the initializer no other code changes are required at the seqcount usage sites. The rest of the seqcount API is unchanged and determines the type at compile time with the help of _Generic which is possible now that the minimal GCC version has been moved up. Adding this lockdep coverage unearthed a handful of seqcount bugs which have been addressed already independent of this. While generally useful this comes with a Trojan Horse twist: On RT kernels the write side critical section can become preemtible if the writers are serialized by an associated lock, which leads to the well known reader preempts writer livelock. RT prevents this by storing the associated lock pointer independent of lockdep in the seqcount and changing the reader side to block on the lock when a reader detects that a writer is in the write side critical section. - Conversion of seqcount usage sites to associated types and initializers" * tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits) locking/seqlock, headers: Untangle the spaghetti monster locking, arch/ia64: Reduce <asm/smp.h> header dependencies by moving XTP bits into the new <asm/xtp.h> header x86/headers: Remove APIC headers from <asm/smp.h> seqcount: More consistent seqprop names seqcount: Compress SEQCNT_LOCKNAME_ZERO() seqlock: Fold seqcount_LOCKNAME_init() definition seqlock: Fold seqcount_LOCKNAME_t definition seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g hrtimer: Use sequence counter with associated raw spinlock kvm/eventfd: Use sequence counter with associated spinlock userfaultfd: Use sequence counter with associated spinlock NFSv4: Use sequence counter with associated spinlock iocost: Use sequence counter with associated spinlock raid5: Use sequence counter with associated spinlock vfs: Use sequence counter with associated spinlock timekeeping: Use sequence counter with associated raw spinlock xfrm: policy: Use sequence counters with associated lock netfilter: nft_set_rbtree: Use sequence counter with associated rwlock netfilter: conntrack: Use sequence counter with associated spinlock sched: tasks: Use sequence counter with associated spinlock ...
2020-08-10x86/kvm/hyper-v: Synic default SCONTROL MSR needs to be enabledJon Doron
Based on an analysis of the HyperV firmwares (Gen1 and Gen2) it seems like the SCONTROL is not being set to the ENABLED state as like we have thought. Also from a test done by Vitaly Kuznetsov, running a nested HyperV it was concluded that the first access to the SCONTROL MSR with a read resulted with the value of 0x1, aka HV_SYNIC_CONTROL_ENABLE. It's important to note that this diverges from the value states in the HyperV TLFS of 0. Signed-off-by: Jon Doron <arilou@gmail.com> Message-Id: <20200717125238.1103096-2-arilou@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>