Age | Commit message (Collapse) | Author |
|
Several people reported that 5.8 broke the interrupt affinity setting
mechanism.
The consolidation of the entry code reused the regular exception entry code
for device interrupts and changed the way how the vector number is conveyed
from ptregs->orig_ax to a function argument.
The low level entry uses the hardware error code slot to push the vector
number onto the stack which is retrieved from there into a function
argument and the slot on stack is set to -1.
The reason for setting it to -1 is that the error code slot is at the
position where pt_regs::orig_ax is. A positive value in pt_regs::orig_ax
indicates that the entry came via a syscall. If it's not set to a negative
value then a signal delivery on return to userspace would try to restart a
syscall. But there are other places which rely on pt_regs::orig_ax being a
valid indicator for syscall entry.
But setting pt_regs::orig_ax to -1 has a nasty side effect vs. the
interrupt affinity setting mechanism, which was overlooked when this change
was made.
Moving interrupts on x86 happens in several steps. A new vector on a
different CPU is allocated and the relevant interrupt source is
reprogrammed to that. But that's racy and there might be an interrupt
already in flight to the old vector. So the old vector is preserved until
the first interrupt arrives on the new vector and the new target CPU. Once
that happens the old vector is cleaned up, but this cleanup still depends
on the vector number being stored in pt_regs::orig_ax, which is now -1.
That -1 makes the check for cleanup: pt_regs::orig_ax == new_vector
always false. As a consequence the interrupt is moved once, but then it
cannot be moved anymore because the cleanup of the old vector never
happens.
There would be several ways to convey the vector information to that place
in the guts of the interrupt handling, but on deeper inspection it turned
out that this check is pointless and a leftover from the old affinity model
of X86 which supported multi-CPU affinities. Under this model it was
possible that an interrupt had an old and a new vector on the same CPU, so
the vector match was required.
Under the new model the effective affinity of an interrupt is always a
single CPU from the requested affinity mask. If the affinity mask changes
then either the interrupt stays on the CPU and on the same vector when that
CPU is still in the new affinity mask or it is moved to a different CPU, but
it is never moved to a different vector on the same CPU.
Ergo the cleanup check for the matching vector number is not required and
can be removed which makes the dependency on pt_regs:orig_ax go away.
The remaining check for new_cpu == smp_processsor_id() is completely
sufficient. If it matches then the interrupt was successfully migrated and
the cleanup can proceed.
For paranoia sake add a warning into the vector assignment code to
validate that the assumption of never moving to a different vector on
the same CPU holds.
Fixes: 633260fa143 ("x86/irq: Convey vector as argument and not in ptregs")
Reported-by: Alex bykov <alex.bykov@scylladb.com>
Reported-by: Avi Kivity <avi@scylladb.com>
Reported-by: Alexander Graf <graf@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Alexander Graf <graf@amazon.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87wo1ltaxz.fsf@nanos.tec.linutronix.de
|
|
There is a race when taking a CPU offline. Current code looks like this:
native_cpu_disable()
{
...
apic_soft_disable();
/*
* Any existing set bits for pending interrupt to
* this CPU are preserved and will be sent via IPI
* to another CPU by fixup_irqs().
*/
cpu_disable_common();
{
....
/*
* Race window happens here. Once local APIC has been
* disabled any new interrupts from the device to
* the old CPU are lost
*/
fixup_irqs(); // Too late to capture anything in IRR.
...
}
}
The fix is to disable the APIC *after* cpu_disable_common().
Testing was done with a USB NIC that provided a source of frequent
interrupts. A script migrated interrupts to a specific CPU and
then took that CPU offline.
Fixes: 60dcaad5736f ("x86/hotplug: Silence APIC and NMI when CPU is dead")
Reported-by: Evan Green <evgreen@chromium.org>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Tested-by: Evan Green <evgreen@chromium.org>
Reviewed-by: Evan Green <evgreen@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/875zdarr4h.fsf@nanos.tec.linutronix.de/
Link: https://lore.kernel.org/r/1598501530-45821-1-git-send-email-ashok.raj@intel.com
|
|
The kernel currently crashes if 4-level paging is used. Add missing
p4d_populate for just allocated pud entry.
Fixes: 3e0d3e408e63 ("s390/vmem: consolidate vmem_add_range() and vmem_remove_range()")
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
Since commit a21ee6055c30 ("lockdep: Change hardirq{s_enabled,_context}
to per-cpu variables") the lockdep code itself uses percpu variables. This
leads to recursions because the percpu macros are calling preempt_enable()
which might call trace_preempt_on().
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
Problem:
raw_local_irq_save(); // software state on
local_irq_save(); // software state off
...
local_irq_restore(); // software state still off, because we don't enable IRQs
raw_local_irq_restore(); // software state still off, *whoopsie*
existing instances:
- lock_acquire()
raw_local_irq_save()
__lock_acquire()
arch_spin_lock(&graph_lock)
pv_wait() := kvm_wait() (same or worse for Xen/HyperV)
local_irq_save()
- trace_clock_global()
raw_local_irq_save()
arch_spin_lock()
pv_wait() := kvm_wait()
local_irq_save()
- apic_retrigger_irq()
raw_local_irq_save()
apic->send_IPI() := default_send_IPI_single_phys()
local_irq_save()
Possible solutions:
A) make it work by enabling the tracing inside raw_*()
B) make it work by keeping tracing disabled inside raw_*()
C) call it broken and clean it up now
Now, given that the only reason to use the raw_* variant is because you don't
want tracing. Therefore A) seems like a weird option (although it can be done).
C) is tempting, but OTOH it ends up converting a _lot_ of code to raw just
because there is one raw user, this strips the validation/tracing off for all
the other users.
So we pick B) and declare any code that ends up doing:
raw_local_irq_save()
local_irq_save()
lockdep_assert_irqs_disabled();
broken. AFAICT this problem has existed forever, the only reason it came
up is because commit: 859d069ee1dd ("lockdep: Prepare for NMI IRQ
state tracking") changed IRQ tracing vs lockdep recursion and the
first instance is fairly common, the other cases hardly ever happen.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[rewrote changelog]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Marco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20200723105615.1268126-1-npiggin@gmail.com
|
|
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Paul Burton <paulburton@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200826101653.GE1362448@hirez.programming.kicks-ass.net
|
|
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200821085348.664425120@infradead.org
|
|
Cc: Nick Hu <nickhu@andestech.com>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Marco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20200821085348.604899379@infradead.org
|
|
Unused remnants
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Marco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20200821085348.487040689@infradead.org
|
|
Remove trace_cpu_idle() from the arch_cpu_idle() implementations and
put it in the generic code, right before disabling RCU. Gets rid of
more trace_*_rcuidle() users.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Marco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20200821085348.428433395@infradead.org
|
|
This allows moving the leave_mm() call into generic code before
rcu_idle_enter(). Gets rid of more trace_*_rcuidle() users.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Marco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20200821085348.369441600@infradead.org
|
|
Since migration of guests using SAO to ISA 3.1 hosts may cause issues,
disable PROT_SAO in LPARs by default and introduce a new Kconfig option
PPC_PROT_SAO_LPAR to allow users to enable it if desired.
Signed-off-by: Shawn Anastasio <shawn@anastas.io>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200821185558.35561-3-shawn@anastas.io
|
|
This reverts commit 5c9fa16e8abd342ce04dc830c1ebb2a03abf6c05.
Since PROT_SAO can still be useful for certain classes of software,
reintroduce it. Concerns about guest migration for LPARs using SAO
will be addressed next.
Signed-off-by: Shawn Anastasio <shawn@anastas.io>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200821185558.35561-2-shawn@anastas.io
|
|
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.
[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Add perf support for emitting extended registers for power10.
- A fix for CPU hotplug on pseries, where on large/loaded systems we
may not wait long enough for the CPU to be offlined, leading to
crashes.
- Addition of a raw cputable entry for Power10, which is not required
to boot, but is required to make our PMU setup work correctly in
guests.
- Three fixes for the recent changes on 32-bit Book3S to move modules
into their own segment for strict RWX.
- A fix for a recent change in our powernv PCI code that could lead to
crashes.
- A change to our perf interrupt accounting to avoid soft lockups when
using some events, found by syzkaller.
- A change in the way we handle power loss events from the hypervisor
on pseries. We no longer immediately shut down if we're told we're
running on a UPS.
- A few other minor fixes.
Thanks to Alexey Kardashevskiy, Andreas Schwab, Aneesh Kumar K.V, Anju T
Sudhakar, Athira Rajeev, Christophe Leroy, Frederic Barrat, Greg Kurz,
Kajol Jain, Madhavan Srinivasan, Michael Neuling, Michael Roth,
Nageswara R Sastry, Oliver O'Halloran, Thiago Jung Bauermann,
Vaidyanathan Srinivasan, Vasant Hegde.
* tag 'powerpc-5.9-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/perf/hv-24x7: Move cpumask file to top folder of hv-24x7 driver
powerpc/32s: Fix module loading failure when VMALLOC_END is over 0xf0000000
powerpc/pseries: Do not initiate shutdown when system is running on UPS
powerpc/perf: Fix soft lockups due to missed interrupt accounting
powerpc/powernv/pci: Fix possible crash when releasing DMA resources
powerpc/pseries/hotplug-cpu: wait indefinitely for vCPU death
powerpc/32s: Fix is_module_segment() when MODULES_VADDR is defined
powerpc/kasan: Fix KASAN_SHADOW_START on BOOK3S_32
powerpc/fixmap: Fix the size of the early debug area
powerpc/pkeys: Fix build error with PPC_MEM_KEYS disabled
powerpc/kernel: Cleanup machine check function declarations
powerpc: Add POWER10 raw mode cputable entry
powerpc/perf: Add extended regs support for power10 platform
powerpc/perf: Add support for outputting extended regs in perf intr_regs
powerpc: Fix P10 PVR revision in /proc/cpuinfo for SMT4 cores
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Thomas Gleixner:
"A single fix for x86 which removes the RDPID usage from the paranoid
entry path and unconditionally uses LSL to retrieve the CPU number.
RDPID depends on MSR_TSX_AUX. KVM has an optmization to avoid
expensive MRS read/writes on VMENTER/EXIT. It caches the MSR values
and restores them either when leaving the run loop, on preemption or
when going out to user space. MSR_TSX_AUX is part of that lazy MSR
set, so after writing the guest value and before the lazy restore any
exception using the paranoid entry will read the guest value and use
it as CPU number to retrieve the GSBASE value for the current CPU when
FSGSBASE is enabled. As RDPID is only used in that particular entry
path, there is no reason to burden VMENTER/EXIT with two extra MSR
writes. Remove the RDPID optimization, which is not even backed by
numbers from the paranoid entry path instead"
* tag 'x86-urgent-2020-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry/64: Do not use RDPID in paranoid entry to accomodate KVM
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 perf fix from Thomas Gleixner:
"A single update for perf on x86 which has support for the broken down
bandwith counters"
* tag 'perf-urgent-2020-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Add BW counters for GT, IA and IO breakdown
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI fixes from Thomas Gleixner:
- Enforce NX on RO data in mixed EFI mode
- Destroy workqueue in an error handling path to prevent UAF
- Stop argument parser at '--' which is the delimiter for init
- Treat a NULL command line pointer as empty instead of dereferncing it
unconditionally.
- Handle an unterminated command line correctly
- Cleanup the 32bit code leftovers and remove obsolete documentation
* tag 'efi-urgent-2020-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation: efi: remove description of efi=old_map
efi/x86: Move 32-bit code into efi_32.c
efi/libstub: Handle unterminated cmdline
efi/libstub: Handle NULL cmdline
efi/libstub: Stop parsing arguments at "--"
efi: add missed destroy_workqueue when efisubsys_init fails
efi/x86: Mark kernel rodata non-executable for mixed mode
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- Allow booting of late secondary CPUs affected by erratum 1418040
(currently they are parked if none of the early CPUs are affected by
this erratum).
- Add the 32-bit vdso Makefile to the vdso_install rule so that 'make
vdso_install' installs the 32-bit compat vdso when it is compiled.
- Print a warning that untrusted guests without a CPU erratum
workaround (Cortex-A57 832075) may deadlock the affected system.
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
ARM64: vdso32: Install vdso32 from vdso_install
KVM: arm64: Print warning when cpu erratum can cause guests to deadlock
arm64: Allow booting of late CPUs affected by erratum 1418040
arm64: Move handling of erratum 1418040 into C code
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Vasily Gorbik:
- a couple of fixes for storage key handling relevant for debugging
- add cond_resched into potentially slow subchannels scanning loop
- fixes for PF/VF linking and to ignore stale PCI configuration request
events
* tag 's390-5.9-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/pci: fix PF/VF linking on hot plug
s390/pci: re-introduce zpci_remove_device()
s390/pci: fix zpci_bus_link_virtfn()
s390/ptrace: fix storage key handling
s390/runtime_instrumentation: fix storage key handling
s390/pci: ignore stale configuration request event
s390/cio: add cond_resched() in the slow_eval_known_fn() loop
|
|
Pull kvm fixes from Paolo Bonzini:
- PAE and PKU bugfixes for x86
- selftests fix for new binutils
- MMU notifier fix for arm64
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: arm64: Only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is not set
KVM: Pass MMU notifier range flags to kvm_unmap_hva_range()
kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE mode
kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE mode
KVM: x86: fix access code passed to gva_to_gpa
selftests: kvm: Use a shorter encoding to clear RAX
|
|
We want neither
"
include/linux/compiler_attributes.h:201:41: warning: statement will never
be executed [-Wswitch-unreachable]
201 | # define fallthrough __attribute__((__fallthrough__))
| ^~~~~~~~~~~~~
"
nor
"
include/linux/compiler_attributes.h:201:41: warning: attribute
'fallthrough' not preceding a case label or default label
201 | # define fallthrough __attribute__((__fallthrough__))
| ^~~~~~~~~~~~~
"
It's not worth adding one more macro. Let's simply place the fallthrough
in between the expansions.
Fixes: c9b029903466 ("MIPS: Use fallthrough for arch/mips")
Cc: stable@vger.kernel.org
Signed-off-by: He Zhe <zhe.he@windriver.com>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
|
|
The couple of #includes are unused by now; remove to prevent namespace
pollution.
This fixes e.g. build of dm_thin, which has a VIRTUAL symbol that
conflicted with the newly-introduced one in mach-loongson64/boot_param.h.
Fixes: 39c1485c8baa ("MIPS: KVM: Add kvm guest support for Loongson-3")
Signed-off-by: WANG Xuerui <git@xen0n.name>
Reviewed-by: Huacai Chen <chenhc@lemote.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
|
|
When an MMU notifier call results in unmapping a range that spans multiple
PGDs, we end up calling into cond_resched_lock() when crossing a PGD boundary,
since this avoids running into RCU stalls during VM teardown. Unfortunately,
if the VM is destroyed as a result of OOM, then blocking is not permitted
and the call to the scheduler triggers the following BUG():
| BUG: sleeping function called from invalid context at arch/arm64/kvm/mmu.c:394
| in_atomic(): 1, irqs_disabled(): 0, non_block: 1, pid: 36, name: oom_reaper
| INFO: lockdep is turned off.
| CPU: 3 PID: 36 Comm: oom_reaper Not tainted 5.8.0 #1
| Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
| Call trace:
| dump_backtrace+0x0/0x284
| show_stack+0x1c/0x28
| dump_stack+0xf0/0x1a4
| ___might_sleep+0x2bc/0x2cc
| unmap_stage2_range+0x160/0x1ac
| kvm_unmap_hva_range+0x1a0/0x1c8
| kvm_mmu_notifier_invalidate_range_start+0x8c/0xf8
| __mmu_notifier_invalidate_range_start+0x218/0x31c
| mmu_notifier_invalidate_range_start_nonblock+0x78/0xb0
| __oom_reap_task_mm+0x128/0x268
| oom_reap_task+0xac/0x298
| oom_reaper+0x178/0x17c
| kthread+0x1e4/0x1fc
| ret_from_fork+0x10/0x30
Use the new 'flags' argument to kvm_unmap_hva_range() to ensure that we
only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is set in the notifier
flags.
Cc: <stable@vger.kernel.org>
Fixes: 8b3405e345b5 ("kvm: arm/arm64: Fix locking for kvm_free_stage2_pgd")
Cc: Marc Zyngier <maz@kernel.org>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20200811102725.7121-3-will@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The 'flags' field of 'struct mmu_notifier_range' is used to indicate
whether invalidate_range_{start,end}() are permitted to block. In the
case of kvm_mmu_notifier_invalidate_range_start(), this field is not
forwarded on to the architecture-specific implementation of
kvm_unmap_hva_range() and therefore the backend cannot sensibly decide
whether or not to block.
Add an extra 'flags' parameter to kvm_unmap_hva_range() so that
architectures are aware as to whether or not they are permitted to block.
Cc: <stable@vger.kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20200811102725.7121-2-will@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- The CLINT driver has been split in two: one to handle the M-mode
CLINT (memory mapped and used on NOMMU systems) and one to handle the
S-mode CLINT (via SBI).
- The addition of SiFive's drivers to rv32_defconfig
* tag 'riscv-for-linus-5.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: Add SiFive drivers to rv32_defconfig
dt-bindings: timer: Add CLINT bindings
RISC-V: Remove CLINT related code from timer and arch
clocksource/drivers: Add CLINT timer driver
RISC-V: Add mechanism to provide custom IPI operations
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"One build fix and a minor fix for suppressing a useless warning when
booting a Xen dom0 via UEFI"
* tag 'for-linus-5.9-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
Fix build error when CONFIG_ACPI is not set/enabled:
efi: avoid error message when booting under Xen
|
|
Add the 32-bit vdso Makefile to the vdso_install rule so that 'make
vdso_install' installs the 32-bit compat vdso when it is compiled.
Fixes: a7f71a2c8903 ("arm64: compat: Add vDSO")
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Link: https://lore.kernel.org/r/20200818014950.42492-1-swboyd@chromium.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
KVM has an optmization to avoid expensive MRS read/writes on
VMENTER/EXIT. It caches the MSR values and restores them either when
leaving the run loop, on preemption or when going out to user space.
The affected MSRs are not required for kernel context operations. This
changed with the recently introduced mechanism to handle FSGSBASE in the
paranoid entry code which has to retrieve the kernel GSBASE value by
accessing per CPU memory. The mechanism needs to retrieve the CPU number
and uses either LSL or RDPID if the processor supports it.
Unfortunately RDPID uses MSR_TSC_AUX which is in the list of cached and
lazily restored MSRs, which means between the point where the guest value
is written and the point of restore, MSR_TSC_AUX contains a random number.
If an NMI or any other exception which uses the paranoid entry path happens
in such a context, then RDPID returns the random guest MSR_TSC_AUX value.
As a consequence this reads from the wrong memory location to retrieve the
kernel GSBASE value. Kernel GS is used to for all regular this_cpu_*()
operations. If the GSBASE in the exception handler points to the per CPU
memory of a different CPU then this has the obvious consequences of data
corruption and crashes.
As the paranoid entry path is the only place which accesses MSR_TSX_AUX
(via RDPID) and the fallback via LSL is not significantly slower, remove
the RDPID alternative from the entry path and always use LSL.
The alternative would be to write MSR_TSC_AUX on every VMENTER and VMEXIT
which would be inflicting massive overhead on that code path.
[ tglx: Rewrote changelog ]
Fixes: eaad981291ee3 ("x86/entry/64: Introduce the FIND_PERCPU_BASE macro")
Reported-by: Tom Lendacky <thomas.lendacky@amd.com>
Debugged-by: Tom Lendacky <thomas.lendacky@amd.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200821105229.18938-1-pbonzini@redhat.com
|
|
Commit 792f73f747b8 ("powerpc/hv-24x7: Add sysfs files inside hv-24x7
device to show cpumask") added cpumask file as part of hv-24x7 driver
inside the interface folder. The cpumask file is supposed to be in the
top folder of the PMU driver in order to make hotplug work.
This patch fixes that issue and creates new group 'cpumask_attr_group'
to add cpumask file and make sure it added in top folder.
command:# cat /sys/devices/hv_24x7/cpumask
0
Fixes: 792f73f747b8 ("powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show cpumask")
Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200821080610.123997-1-kjain@linux.ibm.com
|
|
In is_module_segment(), when VMALLOC_END is over 0xf0000000,
ALIGN(VMALLOC_END, SZ_256M) has value 0.
In that case, addr >= ALIGN(VMALLOC_END, SZ_256M) is always
true then is_module_segment() always returns false.
Use (ALIGN(VMALLOC_END, SZ_256M) - 1) which will have
value 0xffffffff and will be suitable for the comparison.
Fixes: c49643319715 ("powerpc/32s: Only leave NX unset on segments used for modules")
Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Tested-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/09fc73fe9c7423c6b4cf93f93df9bb0ed8eefab5.1597994047.git.christophe.leroy@csgroup.eu
|
|
If guests don't have certain CPU erratum workarounds implemented, then
there is a possibility a guest can deadlock the system. IOW, only trusted
guests should be used on systems with the erratum.
This is the case for Cortex-A57 erratum 832075.
Signed-off-by: Rob Herring <robh@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Julien Thierry <julien.thierry.kdev@gmail.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: kvmarm@lists.cs.columbia.edu
Link: https://lore.kernel.org/r/20200803193127.3012242-2-robh@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
As we can now switch from a system that isn't affected by 1418040
to a system that globally is affected, let's allow affected CPUs
to come in at a later time.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20200731173824.107480-3-maz@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
Instead of dealing with erratum 1418040 on each entry and exit,
let's move the handling to __switch_to() instead, which has
several advantages:
- It can be applied when it matters (switching between 32 and 64
bit tasks).
- It is written in C (yay!)
- It can rely on static keys rather than alternatives
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20200731173824.107480-2-maz@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
The initialization done by bmips_cpu_setup() typically affects both
threads of a given core, on 7435 which supports 2 cores and 2 threads,
logical CPU number 2 and 3 would not run this initialization.
Fixes: 738a3f79027b ("MIPS: BMIPS: Add early CPU initialization code")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
|
|
When the BMIPS generic cpu-feature-overrides.h file was introduced,
cpu_has_inclusive_caches/MIPS_CPU_INCLUSIVE_CACHES was not set for
BMIPS5000 CPUs. Correct this when we have initialized the MIPS secondary
cache successfully.
Fixes: f337967d6d87 ("MIPS: BMIPS: Add cpu-feature-overrides.h")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
|
|
This adds SiFive drivers to rv32_defconfig, to keep in sync with the
64-bit config. This is useful when testing 32-bit kernel with QEMU
'sifive_u' 32-bit machine.
Signed-off-by: Bin Meng <bin.meng@windriver.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
Right now the RISC-V timer driver is convoluted to support:
1. Linux RISC-V S-mode (with MMU) where it will use TIME CSR for
clocksource and SBI timer calls for clockevent device.
2. Linux RISC-V M-mode (without MMU) where it will use CLINT MMIO
counter register for clocksource and CLINT MMIO compare register
for clockevent device.
We now have a separate CLINT timer driver which also provide CLINT
based IPI operations so let's remove CLINT MMIO related code from
arch/riscv directory and RISC-V timer driver.
Signed-off-by: Anup Patel <anup.patel@wdc.com>
Tested-by: Emil Renner Berhing <kernel@esmil.dk>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Reviewed-by: Atish Patra <atish.patra@wdc.com>
Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
We add mechanism to set custom IPI operations so that CLINT driver
from drivers directory can provide custom IPI operations.
Signed-off-by: Anup Patel <anup.patel@wdc.com>
Tested-by: Emil Renner Berhing <kernel@esmil.dk>
Reviewed-by: Atish Patra <atish.patra@wdc.com>
Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
|
|
As per PAPR we have to look for both EPOW sensor value and event
modifier to identify the type of event and take appropriate action.
In LoPAPR v1.1 section 10.2.2 includes table 136 "EPOW Action Codes":
SYSTEM_SHUTDOWN 3
The system must be shut down. An EPOW-aware OS logs the EPOW error
log information, then schedules the system to be shut down to begin
after an OS defined delay internal (default is 10 minutes.)
Then in section 10.3.2.2.8 there is table 146 "Platform Event Log
Format, Version 6, EPOW Section", which includes the "EPOW Event
Modifier":
For EPOW sensor value = 3
0x01 = Normal system shutdown with no additional delay
0x02 = Loss of utility power, system is running on UPS/Battery
0x03 = Loss of system critical functions, system should be shutdown
0x04 = Ambient temperature too high
All other values = reserved
We have a user space tool (rtas_errd) on LPAR to monitor for
EPOW_SHUTDOWN_ON_UPS. Once it gets an event it initiates shutdown
after predefined time. It also starts monitoring for any new EPOW
events. If it receives "Power restored" event before predefined time
it will cancel the shutdown. Otherwise after predefined time it will
shutdown the system.
Commit 79872e35469b ("powerpc/pseries: All events of
EPOW_SYSTEM_SHUTDOWN must initiate shutdown") changed our handling of
the "on UPS/Battery" case, to immediately shutdown the system. This
breaks existing setups that rely on the userspace tool to delay
shutdown and let the system run on the UPS.
Fixes: 79872e35469b ("powerpc/pseries: All events of EPOW_SYSTEM_SHUTDOWN must initiate shutdown")
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
[mpe: Massage change log and add PAPR references]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200820061844.306460-1-hegdevasant@linux.vnet.ibm.com
|
|
Performance monitor interrupt handler checks if any counter has
overflown and calls record_and_restart() in core-book3s which invokes
perf_event_overflow() to record the sample information. Apart from
creating sample, perf_event_overflow() also does the interrupt and
period checks via perf_event_account_interrupt().
Currently we record information only if the SIAR (Sampled Instruction
Address Register) valid bit is set (using siar_valid() check) and
hence the interrupt check.
But it is possible that we do sampling for some events that are not
generating valid SIAR, and hence there is no chance to disable the
event if interrupts are more than max_samples_per_tick. This leads to
soft lockup.
Fix this by adding perf_event_account_interrupt() in the invalid SIAR
code path for a sampling event. ie if SIAR is invalid, just do
interrupt check and don't record the sample information.
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1596717992-7321-1-git-send-email-atrajeev@linux.vnet.ibm.com
|
|
Now that the old memmap code has been removed, some code that was left
behind in arch/x86/platform/efi/efi.c is only used for 32-bit builds,
which means it can live in efi_32.c as well. So move it over.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
|
|
When remapping the kernel rodata section RO in the EFI pagetables, the
protection flags that were used for the text section are being reused,
but the rodata section should not be marked executable.
Cc: <stable@vger.kernel.org>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Link: https://lore.kernel.org/r/20200717194526.3452089-1-nivedita@alum.mit.edu
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
|
|
../arch/x86/pci/xen.c: In function ‘pci_xen_init’:
../arch/x86/pci/xen.c:410:2: error: implicit declaration of function ‘acpi_noirq_set’; did you mean ‘acpi_irq_get’? [-Werror=implicit-function-declaration]
acpi_noirq_set();
Fixes: 88e9ca161c13 ("xen/pci: Use acpi_noirq_set() helper to avoid #ifdef")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xenproject.org
Cc: linux-pci@vger.kernel.org
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
Fix a typo introduced during recent code cleanup, which could lead to
silently not freeing resources or an oops message (on PCI hotplug or
CAPI reset).
Only impacts ioda2, the code path for ioda1 is correct.
Fixes: 01e12629af4e ("powerpc/powernv/pci: Add explicit tracking of the DMA setup state")
Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
Reviewed-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200819130741.16769-1-fbarrat@linux.ibm.com
|
|
Since commits
c041b5ad8640 ("x86, boot: Create a separate string.h file to provide standard string functions")
fb4cac573ef6 ("x86, boot: Move memcmp() into string.h and string.c")
the decompressor stub has been using the compiler's builtin memcpy,
memset and memcmp functions, _except_ where it would likely have the
largest impact, in the decompression code itself.
Remove the #undef's of memcpy and memset in misc.c so that the
decompressor code also uses the compiler builtins.
The rationale given in the comment doesn't really apply: just because
some functions use the out-of-line version is no reason to not use the
builtin version in the rest.
Replace the comment with an explanation of why memzero and memmove are
being #define'd.
Drop the suggestion to #undef in boot/string.h as well: the out-of-line
versions are not really optimized versions, they're generic code that's
good enough for the preboot environment. The compiler will likely
generate better code for constant-size memcpy/memset/memcmp if it is
allowed to.
Most decompressors' performance is unchanged, with the exception of LZ4
and 64-bit ZSTD.
Before After ARCH
LZ4 73ms 10ms 32
LZ4 120ms 10ms 64
ZSTD 90ms 74ms 64
Measurements on QEMU on 2.2GHz Broadwell Xeon, using defconfig kernels.
Decompressor code size has small differences, with the largest being
that 64-bit ZSTD decreases just over 2k. The largest code size increase
was on 64-bit XZ, of about 400 bytes.
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Suggested-by: Nick Terrell <nickrterrell@gmail.com>
Tested-by: Nick Terrell <nickrterrell@gmail.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
For a power9 KVM guest with XIVE enabled, running a test loop
where we hotplug 384 vcpus and then unplug them, the following traces
can be seen (generally within a few loops) either from the unplugged
vcpu:
cpu 65 (hwid 65) Ready to die...
Querying DEAD? cpu 66 (66) shows 2
list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:56!
Oops: Exception in kernel mode, sig: 5 [#1]
LE SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: fuse nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 ...
CPU: 66 PID: 0 Comm: swapper/66 Kdump: loaded Not tainted 4.18.0-221.el8.ppc64le #1
NIP: c0000000007ab50c LR: c0000000007ab508 CTR: 00000000000003ac
REGS: c0000009e5a17840 TRAP: 0700 Not tainted (4.18.0-221.el8.ppc64le)
MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 28000842 XER: 20040000
...
NIP __list_del_entry_valid+0xac/0x100
LR __list_del_entry_valid+0xa8/0x100
Call Trace:
__list_del_entry_valid+0xa8/0x100 (unreliable)
free_pcppages_bulk+0x1f8/0x940
free_unref_page+0xd0/0x100
xive_spapr_cleanup_queue+0x148/0x1b0
xive_teardown_cpu+0x1bc/0x240
pseries_mach_cpu_die+0x78/0x2f0
cpu_die+0x48/0x70
arch_cpu_idle_dead+0x20/0x40
do_idle+0x2f4/0x4c0
cpu_startup_entry+0x38/0x40
start_secondary+0x7bc/0x8f0
start_secondary_prolog+0x10/0x14
or on the worker thread handling the unplug:
pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a
Querying DEAD? cpu 314 (314) shows 2
BUG: Bad page state in process kworker/u768:3 pfn:95de1
cpu 314 (hwid 314) Ready to die...
page:c00a000002577840 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0
flags: 0x5ffffc00000000()
raw: 005ffffc00000000 5deadbeef0000100 5deadbeef0000200 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffff7f 0000000000000000
page dumped because: nonzero mapcount
Modules linked in: kvm xt_CHECKSUM ipt_MASQUERADE xt_conntrack ...
CPU: 0 PID: 548 Comm: kworker/u768:3 Kdump: loaded Not tainted 4.18.0-224.el8.bz1856588.ppc64le #1
Workqueue: pseries hotplug workque pseries_hp_work_fn
Call Trace:
dump_stack+0xb0/0xf4 (unreliable)
bad_page+0x12c/0x1b0
free_pcppages_bulk+0x5bc/0x940
page_alloc_cpu_dead+0x118/0x120
cpuhp_invoke_callback.constprop.5+0xb8/0x760
_cpu_down+0x188/0x340
cpu_down+0x5c/0xa0
cpu_subsys_offline+0x24/0x40
device_offline+0xf0/0x130
dlpar_offline_cpu+0x1c4/0x2a0
dlpar_cpu_remove+0xb8/0x190
dlpar_cpu_remove_by_index+0x12c/0x150
dlpar_cpu+0x94/0x800
pseries_hp_work_fn+0x128/0x1e0
process_one_work+0x304/0x5d0
worker_thread+0xcc/0x7a0
kthread+0x1ac/0x1c0
ret_from_kernel_thread+0x5c/0x80
The latter trace is due to the following sequence:
page_alloc_cpu_dead
drain_pages
drain_pages_zone
free_pcppages_bulk
where drain_pages() in this case is called under the assumption that
the unplugged cpu is no longer executing. To ensure that is the case,
and early call is made to __cpu_die()->pseries_cpu_die(), which runs a
loop that waits for the cpu to reach a halted state by polling its
status via query-cpu-stopped-state RTAS calls. It only polls for 25
iterations before giving up, however, and in the trace above this
results in the following being printed only .1 seconds after the
hotplug worker thread begins processing the unplug request:
pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a
Querying DEAD? cpu 314 (314) shows 2
At that point the worker thread assumes the unplugged CPU is in some
unknown/dead state and procedes with the cleanup, causing the race
with the XIVE cleanup code executed by the unplugged CPU.
Fix this by waiting indefinitely, but also making an effort to avoid
spurious lockup messages by allowing for rescheduling after polling
the CPU status and printing a warning if we wait for longer than 120s.
Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller")
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Tested-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
[mpe: Trim oopses in change log slightly for readability]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200811161544.10513-1-mdroth@linux.vnet.ibm.com
|
|
When MODULES_VADDR is defined, is_module_segment() shall check the
address against it instead of checking agains VMALLOC_START.
Fixes: 6ca055322da8 ("powerpc/32s: Use dedicated segment for modules with STRICT_KERNEL_RWX")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/07884ed033c31e074747b7eb8eaa329d15db07ec.1596641219.git.christophe.leroy@csgroup.eu
|
|
On BOOK3S_32, when we have modules and strict kernel RWX, modules
are not in vmalloc space but in a dedicated segment that is
below PAGE_OFFSET.
So KASAN_SHADOW_START must take it into account.
MODULES_VADDR can't be used because it is not defined yet
in kasan.h
Fixes: 6ca055322da8 ("powerpc/32s: Use dedicated segment for modules with STRICT_KERNEL_RWX")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/6eddca2d5611fd57312a88eae31278c87a8fc99d.1596641224.git.christophe.leroy@csgroup.eu
|
|
See the SDM, volume 3, section 4.4.1:
If PAE paging would be in use following an execution of MOV to CR0 or
MOV to CR4 (see Section 4.1.1) and the instruction is modifying any of
CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then
the PDPTEs are loaded from the address in CR3.
Fixes: b9baba8614890 ("KVM, pkeys: expose CPUID/CR4 to guest")
Cc: Huaitong Han <huaitong.han@intel.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Message-Id: <20200817181655.3716509-1-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|