summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)Author
2019-02-22powerpc: simplify BDI switchChristophe Leroy
There is no reason to re-read each time the pointer at location 0xf0 as it is fixed and known. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22powerpc/83xx: Also save/restore SPRG4-7 during suspendChristophe Leroy
The 83xx has 8 SPRG registers and uses at least SPRG4 for DTLB handling LRU. Fixes: 2319f1239592 ("powerpc/mm: e300c2/c3/c4 TLB errata workaround") Cc: stable@vger.kernel.org Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22powerpc/traps: fix recoverability of machine check handling on book3s/32Christophe Leroy
Looks like book3s/32 doesn't set RI on machine check, so checking RI before calling die() will always be fatal allthought this is not an issue in most cases. Fixes: b96672dd840f ("powerpc: Machine check interrupt is a non-maskable interrupt") Fixes: daf00ae71dad ("powerpc/traps: restore recoverability of machine_check interrupts") Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Cc: stable@vger.kernel.org Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22powerpc/32: Remove unneccessary MSR[RI] clearing for 8xxChristophe Leroy
MSR[RI] has already been cleared a few lines above. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22powerpc/setup: display reason for not bootingChristophe Leroy
When no machine description matches, display it clearly before looping forever. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22powerpc/8xx: hide itlbie and dtlbie symbolsChristophe Leroy
When disassembling InstructionTLBError we get the following messy code: c000138c: 7d 84 63 78 mr r4,r12 c0001390: 75 25 58 00 andis. r5,r9,22528 c0001394: 75 2a 40 00 andis. r10,r9,16384 c0001398: 41 a2 00 08 beq c00013a0 <itlbie> c000139c: 7c 00 22 64 tlbie r4,r0 c00013a0 <itlbie>: c00013a0: 39 40 04 01 li r10,1025 c00013a4: 91 4b 00 b0 stw r10,176(r11) c00013a8: 39 40 10 32 li r10,4146 c00013ac: 48 00 cc 59 bl c000e004 <transfer_to_handler> For a cleaner code dump, this patch replaces itlbie and dtlbie symbols by local symbols. c000138c: 7d 84 63 78 mr r4,r12 c0001390: 75 25 58 00 andis. r5,r9,22528 c0001394: 75 2a 40 00 andis. r10,r9,16384 c0001398: 41 a2 00 08 beq c00013a0 <InstructionTLBError+0xa0> c000139c: 7c 00 22 64 tlbie r4,r0 c00013a0: 39 40 04 01 li r10,1025 c00013a4: 91 4b 00 b0 stw r10,176(r11) c00013a8: 39 40 10 32 li r10,4146 c00013ac: 48 00 cc 59 bl c000e004 <transfer_to_handler> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22powerpc/powernv: Don't reprogram SLW image on every KVM guest entry/exitPaul Mackerras
Commit 24be85a23d1f ("powerpc/powernv: Clear PECE1 in LPCR via stop-api only on Hotplug", 2017-07-21) added two calls to opal_slw_set_reg() inside pnv_cpu_offline(), with the aim of changing the LPCR value in the SLW image to disable wakeups from the decrementer while a CPU is offline. However, pnv_cpu_offline() gets called each time a secondary CPU thread is woken up to participate in running a KVM guest, that is, not just when a CPU is offlined. Since opal_slw_set_reg() is a very slow operation (with observed execution times around 20 milliseconds), this means that an offline secondary CPU can often be busy doing the opal_slw_set_reg() call when the primary CPU wants to grab all the secondary threads so that it can run a KVM guest. This leads to messages like "KVM: couldn't grab CPU n" being printed and guest execution failing. There is no need to reprogram the SLW image on every KVM guest entry and exit. So that we do it only when a CPU is really transitioning between online and offline, this moves the calls to pnv_program_cpu_hotplug_lpcr() into pnv_smp_cpu_kill_self(). Fixes: 24be85a23d1f ("powerpc/powernv: Clear PECE1 in LPCR via stop-api only on Hotplug") Cc: stable@vger.kernel.org # v4.14+ Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22powerpc/64s: Fix logic when handling unknown CPU featuresMichael Ellerman
In cpufeatures_process_feature(), if a provided CPU feature is unknown and enable_unknown is false, we erroneously print that the feature is being enabled and return true, even though no feature has been enabled, and may also set feature bits based on the last entry in the match table. Fix this so that we only set feature bits from the match table if we have actually enabled a feature from that table, and when failing to enable an unknown feature, always print the "not enabling" message and return false. Coincidentally, some older gccs (<GCC 7), when invoked with -fsanitize-coverage=trace-pc, cause a spurious uninitialised variable warning in this function: arch/powerpc/kernel/dt_cpu_ftrs.c: In function ‘cpufeatures_process_feature’: arch/powerpc/kernel/dt_cpu_ftrs.c:686:7: warning: ‘m’ may be used uninitialized in this function [-Wmaybe-uninitialized] if (m->cpu_ftr_bit_mask) An upcoming patch will enable support for kcov, which requires this option. This patch avoids the warning. Fixes: 5a61ef74f269 ("powerpc/64s: Support new device tree binding for discovering CPU features") Reported-by: Segher Boessenkool <segher@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> [ajd: add commit message] Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
2019-02-22powerpc/smp: Make __smp_send_nmi_ipi() staticNicholas Piggin
Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22powerpc/smp: Fix NMI IPI xmon timeoutNicholas Piggin
The xmon debugger IPI handler waits in the callback function while xmon is still active. This means they don't complete the IPI, and the initiator always times out waiting for them. Things manage to work after the timeout because there is some fallback logic to keep NMI IPI state sane in case of the timeout, but this is a bit ugly. This patch changes NMI IPI back to half-asynchronous (i.e., wait for everyone to call in, do not wait for IPI function to complete), but the complexity is avoided by going one step further and allowing new IPIs to be issued before the IPI functions to all complete. If synchronization against that is required, it is left up to the caller, but current callers don't require that. In fact with the timeout handling, callers must be able to cope with this already. Fixes: 5b73151fff63 ("powerpc: NMI IPI make NMI IPIs fully sychronous") Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22powerpc/smp: Fix NMI IPI timeoutNicholas Piggin
The NMI IPI timeout logic is broken, if __smp_send_nmi_ipi() times out on the first condition, delay_us will be zero which will send it into the second spin loop with no timeout so it will spin forever. Fixes: 5b73151fff63 ("powerpc: NMI IPI make NMI IPIs fully sychronous") Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22powerpc: Make PPC_64K_PAGES depend on only 44x or PPC_BOOK3S_64Michael Ellerman
In commit 7820856a4fcd ("powerpc/mm/book3e/64: Remove unsupported 64Kpage size from 64bit booke") we dropped the 64K page size support from the 64-bit nohash (Book3E) code. But we didn't update the dependencies of the PPC_64K_PAGES option, meaning a randconfig can still trigger this code and cause a build breakage, eg: arch/powerpc/include/asm/nohash/64/pgtable.h:14:2: error: #error "Page size not supported" arch/powerpc/include/asm/nohash/mmu-book3e.h:275:2: error: #error Unsupported page size So remove PPC_BOOK3E_64 from the dependencies. This also means we don't need to worry about PPC_FSL_BOOK3E, because that was just trying to prevent the PPC_BOOK3E_64=y && PPC_FSL_BOOK3E=y case. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22powerpc/64: Make sys_switch_endian() traceableMichael Ellerman
We weren't using SYSCALL_DEFINE for sys_switch_endian(), which means it wasn't able to be traced by CONFIG_FTRACE_SYSCALLS. By using the macro we create the right metadata and the syscall is visible. eg: # cd /sys/kernel/debug/tracing # echo 1 | tee events/syscalls/sys_*_switch_endian/enable # ~/switch_endian_test # cat trace ... switch_endian_t-3604 [009] .... 315.175164: sys_switch_endian() switch_endian_t-3604 [009] .... 315.175167: sys_switch_endian -> 0x5555aaaa5555aaaa switch_endian_t-3604 [009] .... 315.175169: sys_switch_endian() switch_endian_t-3604 [009] .... 315.175169: sys_switch_endian -> 0x5555aaaa5555aaaa Fixes: 529d235a0e19 ("powerpc: Add a proper syscall for switching endianness") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22powerpc/dts: Standardize DTS status assignments from "ok" to "okay"Robert P. J. Day
While the current kernel drivers/of/ code allows developers to be sloppy and use a DTS status value of "ok", the current DTSpec 0.1 makes it clear that the proper spelling is "okay", so fix the small number of PowerPC .dts files that do this. Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22powerpc/book3s: Remove pgd/pud/pmd_set() interfacesAneesh Kumar K.V
When updating page tables, we need to make sure we fill the page table entry valid bits. We do this by or'ing in one of PGD/PUD/PMD_VAL_BITS. The page table 'set' interfaces allow updating the raw value of page table entries without setting the valid bits, so remove those interfaces to avoid incorrect usage in future. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [mpe: Reword commit message based on mailing list discussion] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22powerpc: Fix 32-bit KVM-PR lockup and host crash with MacOS guestMark Cave-Ayland
Commit 8792468da5e1 "powerpc: Add the ability to save FPU without giving it up" unexpectedly removed the MSR_FE0 and MSR_FE1 bits from the bitmask used to update the MSR of the previous thread in __giveup_fpu() causing a KVM-PR MacOS guest to lockup and panic the host kernel. Leaving FE0/1 enabled means unrelated processes might receive FPEs when they're not expecting them and crash. In particular if this happens to init the host will then panic. eg (transcribed): qemu-system-ppc[837]: unhandled signal 8 at 12cc9ce4 nip 12cc9ce4 lr 12cc9ca4 code 0 systemd[1]: unhandled signal 8 at 202f02e0 nip 202f02e0 lr 001003d4 code 0 Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b Reinstate these bits to the MSR bitmask to enable MacOS guests to run under 32-bit KVM-PR once again without issue. Fixes: 8792468da5e1 ("powerpc: Add the ability to save FPU without giving it up") Cc: stable@vger.kernel.org # v4.6+ Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22powerpc/pseries: export timebase register sample in lparcfgTyrel Datwyler
The Processor Utilzation of Resource Registers (PURR) provide an estimate of resources used by a cpu thread. Section 7.6 in Book III of the ISA outlines how to calculate the percentage of shared resources for threads using the ratio of the PURR delta and Timebase Register delta for a sampled period. This calculation is currently done erroneously by the lparstat tool from the powerpc-utils package. This patch exports the current timebase value after we sample the PURRs and exposes it to userspace accounting tools via /proc/ppc64/lparcfg. Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22powerpc/44x: Force PCI on for CURRITUCKMichael Ellerman
The recent rework of PCI kconfig symbols exposed an existing bug in the CURRITUCK kconfig logic. It selects PPC4xx_PCI_EXPRESS which depends on PCI, but PCI is user selectable and might be disabled, leading to a warning: WARNING: unmet direct dependencies detected for PPC4xx_PCI_EXPRESS Depends on [n]: PCI [=n] && 4xx [=y] Selected by [y]: - CURRITUCK [=y] && PPC_47x [=y] Prior to commit eb01d42a7778 ("PCI: consolidate PCI config entry in drivers/pci") PCI was enabled by default for currituck_defconfig so we didn't see the warning. The bad logic was still there, it just required someone disabling PCI in their .config to hit it. Fix it by forcing PCI on for CURRITUCK, which seems was always the expectation anyway. Fixes: eb01d42a7778 ("PCI: consolidate PCI config entry in drivers/pci") Reported-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22powerpc/eeh: Add eeh_force_recover to debugfsOliver O'Halloran
This patch adds a debugfs interface to force scheduling a recovery event. This can be used to recover a specific PE or schedule a "special" recovery even that checks for errors at the PHB level. To force a recovery of a normal PE, use: echo '<#pe>:<#phb>' > /sys/kernel/debug/powerpc/eeh_force_recover To force a scan for broken PHBs: echo 'hwcheck' > /sys/kernel/debug/powerpc/eeh_force_recover Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22powerpc/eeh: Allow disabling recoveryOliver O'Halloran
Currently when we detect an error we automatically invoke the EEH recovery handler. This can be annoying when debugging EEH problems, or when working on EEH itself so this patch adds a debugfs knob that will prevent a recovery event from being queued up when an issue is detected. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22powerpc/pci: Add pci_find_controller_for_domain()Oliver O'Halloran
Add a helper to find the pci_controller structure based on the domain number / phb id. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22powerpc/eeh_cache: Bump log level of eeh_addr_cache_print()Oliver O'Halloran
To use this function at all #define DEBUG needs to be set in eeh_cache.c. Considering that printing at pr_debug is probably not all that useful since it adds the additional hurdle of requiring you to enable the debug print if dynamic_debug is in use so this patch bumps it to pr_info. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22powerpc/eeh_cache: Add a way to dump the EEH address cacheOliver O'Halloran
Adds a debugfs file that can be read to view the contents of the EEH address cache. This is pretty similar to the existing eeh_addr_cache_print() function, but that function is intended to debug issues inside of the kernel since it's #ifdef`ed out by default, and writes into the kernel log. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22powerpc/eeh_cache: Add pr_debug() prints for insert/removeOliver O'Halloran
The EEH address cache is used to map a physical MMIO address back to a PCI device. It's useful to know when it's being manipulated, but currently this requires recompiling with #define DEBUG set. This is pointless since we have dynamic_debug nowdays, so remove the #ifdef guard and add a pr_debug() for the remove case too. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22powerpc/eeh: Use debugfs_create_u32 for eeh_max_freezesOliver O'Halloran
There's no need to the custom getter/setter functions so we should remove them in favour of using the generic one. While we're here, change the type of eeh_max_freeze to u32 and print the value in decimal rather than hex because printing it in hex makes no sense. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22powerpc: drop unused GENERIC_CSUM Kconfig itemChristophe Leroy
Commit d4fde568a34a ("powerpc/64: Use optimized checksum routines on little-endian") converted last powerpc user of GENERIC_CSUM. This patch does a final cleanup dropping the Kconfig GENERIC_CSUM option which is always 'n', and associated piece of code in asm/checksum.h Fixes: d4fde568a34a ("powerpc/64: Use optimized checksum routines on little-endian") Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22powerpc/64s/hash: Fix assert_slb_presence() use of the slbfee. instructionNicholas Piggin
The slbfee. instruction must have bit 24 of RB clear, failure to do so can result in false negatives that result in incorrect assertions. This is not obvious from the ISA v3.0B document, which only says: The hardware ignores the contents of RB 36:38 40:63 -- p.1032 This patch fixes the bug and also clears all other bits from PPC bit 36-63, which is good practice when dealing with reserved or ignored bits. Fixes: e15a4fea4dee ("powerpc/64s/hash: Add some SLB debugging tests") Cc: stable@vger.kernel.org # v4.20+ Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22powerpc/mm/hash: Increase vmalloc space to 512T with hash MMUMichael Ellerman
This patch updates the kernel non-linear virtual map to 512TB when we're built with 64K page size and are using the hash MMU. We allocate one context for the vmalloc region and hence the max virtual area size is limited by the context map size (512TB for 64K and 64TB for 4K page size). This patch fixes boot failures with large amounts of system RAM where we need large vmalloc space to handle per cpu allocations. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
2019-02-22powerpc/ptrace: Simplify vr_get/set() to avoid GCC warningMichael Ellerman
GCC 8 warns about the logic in vr_get/set(), which with -Werror breaks the build: In function ‘user_regset_copyin’, inlined from ‘vr_set’ at arch/powerpc/kernel/ptrace.c:628:9: include/linux/regset.h:295:4: error: ‘memcpy’ offset [-527, -529] is out of the bounds [0, 16] of object ‘vrsave’ with type ‘union <anonymous>’ [-Werror=array-bounds] arch/powerpc/kernel/ptrace.c: In function ‘vr_set’: arch/powerpc/kernel/ptrace.c:623:5: note: ‘vrsave’ declared here } vrsave; This has been identified as a regression in GCC, see GCC bug 88273. However we can avoid the warning and also simplify the logic and make it more robust. Currently we pass -1 as end_pos to user_regset_copyout(). This says "copy up to the end of the regset". The definition of the regset is: [REGSET_VMX] = { .core_note_type = NT_PPC_VMX, .n = 34, .size = sizeof(vector128), .align = sizeof(vector128), .active = vr_active, .get = vr_get, .set = vr_set }, The end is calculated as (n * size), ie. 34 * sizeof(vector128). In vr_get/set() we pass start_pos as 33 * sizeof(vector128), meaning we can copy up to sizeof(vector128) into/out-of vrsave. The on-stack vrsave is defined as: union { elf_vrreg_t reg; u32 word; } vrsave; And elf_vrreg_t is: typedef __vector128 elf_vrreg_t; So there is no bug, but we rely on all those sizes lining up, otherwise we would have a kernel stack exposure/overwrite on our hands. Rather than relying on that we can pass an explict end_pos based on the sizeof(vrsave). The result should be exactly the same but it's more obviously not over-reading/writing the stack and it avoids the compiler warning. Reported-by: Meelis Roos <mroos@linux.ee> Reported-by: Mathieu Malaterre <malat@debian.org> Cc: stable@vger.kernel.org Tested-by: Mathieu Malaterre <malat@debian.org> Tested-by: Meelis Roos <mroos@linux.ee> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22powerpc/powernv/npu: Remove redundant change_pte() hookPeter Xu
The change_pte() notifier was designed to use as a quick path to update secondary MMU PTEs on write permission changes or PFN changes. For KVM, it could reduce the vm-exits when vcpu faults on the pages that was touched up by KSM. It's not used to do cache invalidations, for example, if we see the notifier will be called before the real PTE update after all (please see set_pte_at_notify that set_pte_at was called later). All the necessary cache invalidation should all be done in invalidate_range() already. Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Alistair Popple <alistair@popple.id.au> Reviewed-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22Merge branch 'topic/ppc-kvm' into nextMichael Ellerman
Merge commits we're sharing with kvm-ppc tree.
2019-02-21powerpc/64s: Better printing of machine check info for guest MCEsPaul Mackerras
This adds an "in_guest" parameter to machine_check_print_event_info() so that we can avoid trying to translate guest NIP values into symbolic form using the host kernel's symbol table. Reviewed-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com> Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-21KVM: PPC: Book3S HV: Simplify machine check handlingPaul Mackerras
This makes the handling of machine check interrupts that occur inside a guest simpler and more robust, with less done in assembler code and in real mode. Now, when a machine check occurs inside a guest, we always get the machine check event struct and put a copy in the vcpu struct for the vcpu where the machine check occurred. We no longer call machine_check_queue_event() from kvmppc_realmode_mc_power7(), because on POWER8, when a vcpu is running on an offline secondary thread and we call machine_check_queue_event(), that calls irq_work_queue(), which doesn't work because the CPU is offline, but instead triggers the WARN_ON(lazy_irq_pending()) in pnv_smp_cpu_kill_self() (which fires again and again because nothing clears the condition). All that machine_check_queue_event() actually does is to cause the event to be printed to the console. For a machine check occurring in the guest, we now print the event in kvmppc_handle_exit_hv() instead. The assembly code at label machine_check_realmode now just calls C code and then continues exiting the guest. We no longer either synthesize a machine check for the guest in assembly code or return to the guest without a machine check. The code in kvmppc_handle_exit_hv() is extended to handle the case where the guest is not FWNMI-capable. In that case we now always synthesize a machine check interrupt for the guest. Previously, if the host thinks it has recovered the machine check fully, it would return to the guest without any notification that the machine check had occurred. If the machine check was caused by some action of the guest (such as creating duplicate SLB entries), it is much better to tell the guest that it has caused a problem. Therefore we now always generate a machine check interrupt for guests that are not FWNMI-capable. Reviewed-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com> Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-21Merge branch 'topic/dma' into nextMichael Ellerman
Merge hch's big DMA rework series. This is in a topic branch in case he wants to merge it to minimise conflicts.
2019-02-21Merge branch 'ib-qcom-ssbi' into develLinus Walleij
2019-02-21arm64: dts: Add the PCIE EP node in dtsXiaowei Bao
Add the PCIE EP node in dts for ls1046a. Signed-off-by: Xiaowei Bao <xiaowei.bao@nxp.com> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Reviewed-by: Minghuan Lian <minghuan.lian@nxp.com> Reviewed-by: Zhiqiang Hou <zhiqiang.hou@nxp.com> Reviewed-by: Rob Herring <robh+dt@kernel.org>
2019-02-21RISC-V: Free-up initrd in free_initrd_mem()Anup Patel
We should free-up initrd memory in free_initrd_mem() instead of doing nothing. Signed-off-by: Anup Patel <anup.patel@wdc.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
2019-02-21RISC-V: Implement compile-time fixed mappingsAnup Patel
This patch implements compile-time virtual to physical mappings. These compile-time fixed mappings can be used by earlycon, ACPI, and early ioremap for creating fixed mappings when FIX_EARLYCON_MEM=y. To start with, we have enabled compile-time fixed mappings for earlycon. Signed-off-by: Anup Patel <anup.patel@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Palmer Dabbelt <palmer@sifive.com>
2019-02-21RISC-V: Move setup_vm() to mm/init.cAnup Patel
The setup_vm() is responsible for setting up initial page table hence should be placed in mm/init.c. Signed-off-by: Anup Patel <anup.patel@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Palmer Dabbelt <palmer@sifive.com>
2019-02-21RISC-V: Move setup_bootmem() to mm/init.cAnup Patel
The setup_bootmem() mainly populates memblocks and does early memory reservations. The right location for this function is mm/init.c. It calls setup_initrd() so we move that as well. Signed-off-by: Anup Patel <anup.patel@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Palmer Dabbelt <palmer@sifive.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
2019-02-21RISC-V: Setup init_mm before parse_early_param()Anup Patel
We should setup init_mm before doing parse_early_param() in setup_arch() to be consistent with setup_arch() of other architectures such as x86, ARM, and ARM64. Signed-off-by: Anup Patel <anup.patel@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Palmer Dabbelt <palmer@sifive.com>
2019-02-21KVM: PPC: Book3S HV: Context switch AMR on Power9Michael Ellerman
kvmhv_p9_guest_entry() implements a fast-path guest entry for Power9 when guest and host are both running with the Radix MMU. Currently in that path we don't save the host AMR (Authority Mask Register) value, and we always restore 0 on return to the host. That is OK at the moment because the AMR is not used for storage keys with the Radix MMU. However we plan to start using the AMR on Radix to prevent the kernel from reading/writing to userspace outside of copy_to/from_user(). In order to make that work we need to save/restore the AMR value. We only restore the value if it is different from the guest value, which is already in the register when we exit to the host. This should mean we rarely need to actually restore the value when running a modern Linux as a guest, because it will be using the same value as us. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Tested-by: Russell Currey <ruscur@russell.cc>
2019-02-20Merge tag 'kvm-s390-master-5.0' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kvm-master KVM: s390: Fix crypto handling for nested KVM
2019-02-20x86: kvmguest: use TSC clocksource if invariant TSC is exposedMarcelo Tosatti
The invariant TSC bit has the following meaning: "The time stamp counter in newer processors may support an enhancement, referred to as invariant TSC. Processor's support for invariant TSC is indicated by CPUID.80000007H:EDX[8]. The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource." IOW, TSC does not change frequency. In such case, and with TSC scaling hardware available to handle migration, it is possible to use the TSC clocksource directly, whose system calls are faster. Reduce the rating of kvmclock clocksource to allow TSC clocksource to be the default if invariant TSC is exposed. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> v2: Use feature bits and tsc_unstable() check (Sean Christopherson) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: Never start grow vCPU halt_poll_ns from value below halt_poll_ns_grow_startNir Weiner
grow_halt_poll_ns() have a strange behaviour in case (vcpu->halt_poll_ns != 0) && (vcpu->halt_poll_ns < halt_poll_ns_grow_start). In this case, vcpu->halt_poll_ns will be multiplied by grow factor (halt_poll_ns_grow) which will require several grow iteration in order to reach a value bigger than halt_poll_ns_grow_start. This means that growing vcpu->halt_poll_ns from value of 0 is slower than growing it from a positive value less than halt_poll_ns_grow_start. Which is misleading and inaccurate. Fix issue by changing grow_halt_poll_ns() to set vcpu->halt_poll_ns to halt_poll_ns_grow_start in any case that (vcpu->halt_poll_ns < halt_poll_ns_grow_start). Regardless if vcpu->halt_poll_ns is 0. use READ_ONCE to get a consistent number for all cases. Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Nir Weiner <nir.weiner@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: Expose the initial start value in grow_halt_poll_ns() as a module parameterNir Weiner
The hard-coded value 10000 in grow_halt_poll_ns() stands for the initial start value when raising up vcpu->halt_poll_ns. It actually sets the first timeout to the first polling session. This value has significant effect on how tolerant we are to outliers. On the standard case, higher value is better - we will spend more time in the polling busyloop, handle events/interrupts faster and result in better performance. But on outliers it puts us in a busy loop that does nothing. Even if the shrink factor is zero, we will still waste time on the first iteration. The optimal value changes between different workloads. It depends on outliers rate and polling sessions length. As this value has significant effect on the dynamic halt-polling algorithm, it should be configurable and exposed. Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Nir Weiner <nir.weiner@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: grow_halt_poll_ns() should never shrink vCPU halt_poll_nsNir Weiner
grow_halt_poll_ns() have a strange behavior in case (halt_poll_ns_grow == 0) && (vcpu->halt_poll_ns != 0). In this case, vcpu->halt_pol_ns will be set to zero. That results in shrinking instead of growing. Fix issue by changing grow_halt_poll_ns() to not modify vcpu->halt_poll_ns in case halt_poll_ns_grow is zero Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Nir Weiner <nir.weiner@oracle.com> Suggested-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: x86/mmu: Consolidate kvm_mmu_zap_all() and kvm_mmu_zap_mmio_sptes()Sean Christopherson
...via a new helper, __kvm_mmu_zap_all(). An alternative to passing a 'bool mmio_only' would be to pass a callback function to filter the shadow page, i.e. to make __kvm_mmu_zap_all() generic and reusable, but zapping all shadow pages is a last resort, i.e. making the helper less extensible is a feature of sorts. And the explicit MMIO parameter makes it easy to preserve the WARN_ON_ONCE() if a restart is triggered when zapping MMIO sptes. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: x86/mmu: WARN if zapping a MMIO spte results in zapping childrenSean Christopherson
Paolo expressed a concern that kvm_mmu_zap_mmio_sptes() could have a quadratic runtime[1], i.e. restarting the spte walk while zapping only MMIO sptes could result in re-walking large portions of the list over and over due to the non-MMIO sptes encountered before the restart not being removed. At the time, the concern was legitimate as the walk was restarted when any spte was zapped. But that is no longer the case as the walk is now restarted iff one or more children have been zapped, which is necessary because zapping children makes the active_mmu_pages list unstable. Furthermore, it should be impossible for an MMIO spte to have children, i.e. zapping an MMIO spte should never result in zapping children. In other words, kvm_mmu_zap_mmio_sptes() should never restart its walk, and so should always execute in linear time. WARN if this assertion fails. Although it should never be needed, leave the restart logic in place. In normal operation, the cost is at worst an extra CMP+Jcc, and if for some reason the list does become unstable, not restarting would likely crash KVM, or worse, the kernel. [1] https://patchwork.kernel.org/patch/10756589/#22452085 Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: x86/mmu: Differentiate between nr zapped and list unstableSean Christopherson
The return value of kvm_mmu_prepare_zap_page() has evolved to become overloaded to convey two separate pieces of information. 1) was at least one page zapped and 2) has the list of MMU pages become unstable. In it's original incarnation (as kvm_mmu_zap_page()), there was no return value at all. Commit 0738541396be ("KVM: MMU: awareness of new kvm_mmu_zap_page behaviour") added a return value in preparation for commit 4731d4c7a077 ("KVM: MMU: out of sync shadow core"). Although the return value was of type 'int', it was actually used as a boolean to indicate whether or not active_mmu_pages may have become unstable due to zapping children. Walking a list with list_for_each_entry_safe() only protects against deleting/moving the current entry, i.e. zapping a child page would break iteration due to modifying any number of entries. Later, commit 60c8aec6e2c9 ("KVM: MMU: use page array in unsync walk") modified mmu_zap_unsync_children() to return an approximation of the number of children zapped. This was not intentional, it was simply a side effect of how the code was written. The unintented side affect was then morphed into an actual feature by commit 77662e0028c7 ("KVM: MMU: fix kvm_mmu_zap_page() and its calling path"), which modified kvm_mmu_change_mmu_pages() to use the number of zapped pages when determining the number of MMU pages in use by the VM. Finally, commit 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return the number of pages it actually freed") added the initial page to the return value to make its behavior more consistent with what most users would expect. Incorporating the initial parent page in the return value of kvm_mmu_zap_page() breaks the original usage of restarting a list walk on a non-zero return value to handle a potentially unstable list, i.e. walks will unnecessarily restart when any page is zapped. Fix this by restoring the original behavior of kvm_mmu_zap_page(), i.e. return a boolean to indicate that the list may be unstable and move the number of zapped children to a dedicated parameter. Since the majority of callers to kvm_mmu_prepare_zap_page() don't care about either return value, preserve the current definition of kvm_mmu_prepare_zap_page() by making it a wrapper of a new helper, __kvm_mmu_prepare_zap_page(). This avoids having to update every call site and also provides cleaner code for functions that only care about the number of pages zapped. Fixes: 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return the number of pages it actually freed") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>