summaryrefslogtreecommitdiff
path: root/arch/powerpc
AgeCommit message (Collapse)Author
2017-08-11Merge tag 'powerpc-4.13-6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: "All fixes for code that went in this cycle. - a revert of an optimisation to the syscall exit path, which could lead to an oops on either older machines or machines with > 1TB of memory - disable some deep idle states if the firmware configuration for them fails - re-enable HARD/SOFT lockup detectors in defconfigs after a Kconfig change - six fairly small patches fixing bugs in our new watchdog code Thanks to: Gautham R Shenoy, Nicholas Piggin" * tag 'powerpc-4.13-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/watchdog: add locking around init/exit functions powerpc/watchdog: Fix marking of stuck CPUs powerpc/watchdog: Fix final-check recovered case powerpc/watchdog: Moderate touch_nmi_watchdog overhead powerpc/watchdog: Improve watchdog lock primitive powerpc: NMI IPI improve lock primitive powerpc/configs: Re-enable HARD/SOFT lockup detectors powerpc/powernv/idle: Disable LOSE_FULL_CONTEXT states when stop-api fails Revert "powerpc/64: Avoid restore_math call if possible in syscall exit"
2017-08-09powerpc/watchdog: add locking around init/exit functionsNicholas Piggin
When CPUs start and stop the watchdog, they manipulate shared data that is normally protected by the lock. Other CPUs can be running concurrently at this time, so it's a good idea to use locking here to be on the safe side. Remove the barrier which is undocumented and didn't do anything. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-09powerpc/watchdog: Fix marking of stuck CPUsNicholas Piggin
When the SMP detector finds other CPUs stuck, it iterates over them and marks them as stuck. This pulls them out of the pending mask and allows the detector to continue with remaining good CPUs (if nmi_watchdog=panic is not enabled). The code to dothat was buggy because when setting a CPU stuck, if the pending mask became empty, it resets it to keep the watchdog running. However the iterator will continue to run over the new pending mask and mark remaining good CPUs sas stuck. Fix this by doing it with cpumask bitwise operations. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-09powerpc/watchdog: Fix final-check recovered caseNicholas Piggin
When the watchdog decides to panic, it takes the lock and double checks everything (to avoid races with the CPU being unstuck or panic()ed by something else). The exit label was misplaced and would result in all-CPUs backtrace and watchdog panic even in the case that the condition was found to be resolved. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-09powerpc/watchdog: Moderate touch_nmi_watchdog overheadNicholas Piggin
Some code can go into a tight loop calling touch_nmi_watchdog (e.g., stop_machine CPU hotplug code). This can cause contention on watchdog locks particularly if all CPUs with watchdog enabled are spinning in the loops. Avoid this storm of activity by running the watchdog timer callback from this path if we have exceeded the timer period since it was last run. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-09powerpc/watchdog: Improve watchdog lock primitiveNicholas Piggin
- Hard-disable interrupts before taking the lock, which prevents soft-NMI re-entrancy and therefore can prevent deadlocks. - Use raw_ variants of local_irq_disable to avoid irq debugging. - When the lock is contended, spin at low SMT priority, using loads only, and with interrupts enabled (where possible). Some stalls have been noticed at high loads that go away with improved locking. There should not be so much locking contention in the first place (which is addressed in a subsequent patch), but locking should still be improved. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-09powerpc: NMI IPI improve lock primitiveNicholas Piggin
When the NMI IPI lock is contended, spin at low SMT priority, using loads only, and with interrupts enabled (where possible). This improves behaviour under high contention (e.g., a system crash when a number of CPUs are trying to enter the debugger). Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-09powerpc/configs: Re-enable HARD/SOFT lockup detectorsMichael Ellerman
In commit 05a4a9527931 ("kernel/watchdog: split up config options"), CONFIG_LOCKUP_DETECTOR was split into two separate config options, CONFIG_HARDLOCKUP_DETECTOR and CONFIG_SOFTLOCKUP_DETECTOR. Our defconfigs still have CONFIG_LOCKUP_DETECTOR=y, but that is no longer user selectable, and we don't mention the new options, so we end up with none of them enabled. So update the defconfigs to turn on the new SOFT and HARD options, the end result being the same as what we had previously. Fixes: 05a4a9527931 ("kernel/watchdog: split up config options") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-08powerpc/powernv/idle: Disable LOSE_FULL_CONTEXT states when stop-api failsGautham R. Shenoy
Currently, we use the opal call opal_slw_set_reg() to inform the Sleep-Winkle Engine (SLW) to restore the contents of some of the Hypervisor state on wakeup from deep idle states that lose full hypervisor context (characterized by the flag OPAL_PM_LOSE_FULL_CONTEXT). However, the current code has a bug in that if opal_slw_set_reg() fails, we don't disable the use of these deep states (winkle on POWER8, stop4 onwards on POWER9). This patch fixes this bug by ensuring that if programing the sleep-winkle engine to restore the hypervisor states in pnv_save_sprs_for_deep_states() fails, then we exclude such states by clearing the OPAL_PM_LOSE_FULL_CONTEXT flag from supported_cpuidle_states. As a result POWER8 will be prevented from using winkle for CPU-Hotplug, and POWER9 will put the offlined CPUs to the default stop state when available. Further, we ensure in the initialization of the cpuidle-powernv driver to only include those states whose flags are present in supported_cpuidle_states, thereby skipping OPAL_PM_LOSE_FULL_CONTEXT states when they have been disabled due to stop-api failure. Fixes: 1e1601b38e6 ("powerpc/powernv/idle: Restore SPRs for deep idle states via stop API.") Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-07Revert "powerpc/64: Avoid restore_math call if possible in syscall exit"Michael Ellerman
This reverts commit bc4f65e4cf9d6cc43e0e9ba0b8648cf9201cd55f. As reported by Andreas, this commit is causing unrecoverable SLB misses in the system call exit path: Unrecoverable exception 4100 at c00000000000a1ec Oops: Unrecoverable exception, sig: 6 [#1] SMP NR_CPUS=2 PowerMac ... CPU: 0 PID: 18626 Comm: rm Not tainted 4.13.0-rc3 #1 task: c00000018335e080 task.stack: c000000139e50000 NIP: c00000000000a1ec LR: c00000000000a118 CTR: 0000000000000000 REGS: c000000139e53bb0 TRAP: 4100 Not tainted (4.13.0-rc3) MSR: 9000000000001030 <SF,HV,ME,IR,DR> CR: 24000044 XER: 20000000 SOFTE: 1 GPR00: 0000000000000000 c000000139e53e30 c000000000abb500 fffffffffffffffe GPR04: c0000001eb866298 0000000000000000 0000000000000000 c00000018335e080 GPR08: 900000000000d032 0000000000000000 0000000000000002 fffffffffffff001 GPR12: c000000139e50000 c00000000ffff000 00003fffa8c0dca0 00003fffa8c0dc88 GPR16: 0000000010000000 0000000000000001 00003fffa8c0eaa0 0000000000000000 GPR20: 00003fffa8c27528 00003fffa8c27b00 0000000000000000 0000000000000000 GPR24: 00003fffa8c0d918 00003ffff1b3efa0 00003fffa8c26d68 0000000000000000 GPR28: 00003fffa8c249e8 00003fffa8c263d0 00003fffa8c27550 00003ffff1b3ef10 NIP [c00000000000a1ec] system_call_exit+0xc0/0x21c LR [c00000000000a118] system_call+0x58/0x6c Call Trace: [c000000139e53e30] [c00000000000a118] system_call+0x58/0x6c (unreliable) Instruction dump: 64a51000 7c6300d0 f8a101a0 4bffff9c 3c000000 60000006 780007c6 64000000 60000000 7c004039 4082001c e8ed0170 <88070b78> 88c70b79 7c003214 2c200000 This is caused by us trying to load THREAD_LOAD_FP with MSR_RI=0, and taking an SLB miss on the thread struct. Reported-by: Andreas Schwab <schwab@linux-m68k.org> Diagnosed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-04Merge tag 'powerpc-4.13-5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: "Fixes for recently merged code: - a fix for the _PAGE_DEVMAP support, which was breaking KVM on Power9 radix - avoid a (harmless) lockdep warning in the early SMP code - return failure for some uses of dma_set_mask() rather than falling back to 32-bits - fix stack setup in watchdog soft_nmi_common() to use emergency stack - fix of_irq_to_resource() error check in of_fsl_spi_probe() Two fixes going to stable: - fix saving of Transactional Memory SPRs in core dump - fix __check_irq_replay missing decrementer interrupt And two misc: - fix 64-bit boot wrapper build with non-biarch compiler - work around a POWER9 PMU hang after state-loss idle Thanks to: Alistair Popple, Aneesh Kumar K.V, Cyril Bur, Gustavo Romero, Jose Ricardo Ziviani, Laurent Vivier, Nicholas Piggin, Oliver O'Halloran, Sergei Shtylyov, Suraj Jitindar Singh, Thomas Gleixner" * tag 'powerpc-4.13-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/64: Fix __check_irq_replay missing decrementer interrupt powerpc/perf: POWER9 PMU stops after idle workaround powerpc/83xx/mpc832x_rdb: fix of_irq_to_resource() error check powerpc/64s: Fix stack setup in watchdog soft_nmi_common() powerpc/powernv/pci: Return failure for some uses of dma_set_mask() powerpc/boot: Fix 64-bit boot wrapper build with non-biarch compiler powerpc/smp: Call smp_ops->setup_cpu() directly on the boot CPU powerpc/tm: Fix saving of TM SPRs in core dump powerpc/mm: Fix pmd/pte_devmap() on non-leaf entries
2017-08-04powerpc/64: Fix __check_irq_replay missing decrementer interruptNicholas Piggin
If the decrementer wraps again and de-asserts the decrementer exception while hard-disabled, __check_irq_replay() has a test to notice the wrap when interrupts are re-enabled. The decrementer check must be done when clearing the PACA_IRQ_HARD_DIS flag, not when the PACA_IRQ_DEC flag is tested. Previously this worked because the decrementer interrupt was always the first one checked after clearing the hard disable flag, but HMI check was moved ahead of that, which introduced this bug. This can cause a missed decrementer interrupt if we soft-disable interrupts then take an HMI which is recorded in irq_happened, then hard-disable interrupts for > 4s to wrap the decrementer. Fixes: e0e0d6b7390b ("powerpc/64: Replay hypervisor maintenance interrupt first") Cc: stable@vger.kernel.org # v4.9+ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-04powerpc/perf: POWER9 PMU stops after idle workaroundNicholas Piggin
POWER9 DD2 PMU can stop after a state-loss idle in some conditions. A solution is to set then clear MMCRA[60] after wake from state-loss idle. MMCRA[60] is a non-architected bit, see the user manual for details. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Acked-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-02powerpc/83xx/mpc832x_rdb: fix of_irq_to_resource() error checkSergei Shtylyov
of_irq_to_resource() has recently been fixed to return negative error #'s along with 0 in case of failure, however the Freescale MPC832x RDB board code still only regards 0 as a failure indication -- fix it up. Fixes: 7a4228bbff76 ("of: irq: use of_irq_get() in of_irq_to_resource()") Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Acked-by: Scott Wood <oss@buserror.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-31powerpc/64s: Fix stack setup in watchdog soft_nmi_common()Nicholas Piggin
The watchdog soft-NMI exception stack setup loads a stack pointer twice, which is an obvious error. It ends up using the system reset interrupt (true-NMI) stack, which is also a bug because the watchdog could be preempted by a system reset interrupt that overwrites the NMI stack. Change the soft-NMI to use the "emergency stack". The current kernel stack is not used, because of the longer-term goal to prevent asynchronous stack access using soft-disable. Fixes: 2104180a5369 ("powerpc/64s: implement arch-specific hardlockup watchdog") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-31Merge tag 'v4.13-rc1' into fixesMichael Ellerman
The fixes branch is based off a random pre-rc1 commit, because we had some fixes that needed to go in before rc1 was released. However we now need to fix some code that went in after that point, but before rc1, so merge rc1 to get that code into fixes so we can fix it!
2017-07-28Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM fixes from Paolo Bonzini: "s390: - SRCU fix PPC: - host crash fixes x86: - bugfixes, including making nested posted interrupts really work Generic: - tweaks to kvm_stat and to uevents" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: LAPIC: Fix reentrancy issues with preempt notifiers tools/kvm_stat: add '-f help' to get the available event list tools/kvm_stat: use variables instead of hard paths in help output KVM: nVMX: Fix loss of L2's NMI blocking state KVM: nVMX: Fix posted intr delivery when vcpu is in guest mode x86: irq: Define a global vector for nested posted interrupts KVM: x86: do mask out upper bits of PAE CR3 KVM: make pid available for uevents without debugfs KVM: s390: take srcu lock when getting/setting storage keys KVM: VMX: remove unused field KVM: PPC: Book3S HV: Fix host crash on changing HPT size KVM: PPC: Book3S HV: Enable TM before accessing TM registers
2017-07-28Merge tag 'powerpc-4.13-4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: "The highlight is Ben's patch to work around a host killing bug when running KVM guests with the Radix MMU on Power9. See the long change log of that commit for more detail. And then three fairly minor fixes: - fix of_node_put() underflow during reconfig remove, using old DLPAR tools. - fix recently introduced ld version check with 64-bit LE-only toolchain. - free the subpage_prot_table correctly, avoiding a memory leak. Thanks to: Aneesh Kumar K.V, Benjamin Herrenschmidt, Laurent Vivier" * tag 'powerpc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/mm/hash: Free the subpage_prot_table correctly powerpc/Makefile: Fix ld version check with 64-bit LE-only toolchain powerpc/pseries: Fix of_node_put() underflow during reconfig remove powerpc/mm/radix: Workaround prefetch issue with KVM
2017-07-28powerpc/powernv/pci: Return failure for some uses of dma_set_mask()Alistair Popple
Commit 8e3f1b1d8255 ("powerpc/powernv/pci: Enable 64-bit devices to access >4GB DMA space") introduced the ability for PCI device drivers to request a DMA mask between 64 and 32 bits and actually get a mask greater than 32-bits. However currently if certain machine configuration dependent conditions are not meet the code silently falls back to a 32-bit mask. This makes it hard for device drivers to detect which mask they actually got. Instead we should return an error when the request could not be fulfilled which allows drivers to either fallback or implement other workarounds as documented in DMA-API-HOWTO.txt. Signed-off-by: Alistair Popple <alistair@popple.id.au> Acked-by: Russell Currey <ruscur@russell.cc> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-28powerpc/boot: Fix 64-bit boot wrapper build with non-biarch compilerMichael Ellerman
Historically the boot wrapper was always built 32-bit big endian, even for 64-bit kernels. That was because old firmwares didn't necessarily support booting a 64-bit image. Because of that arch/powerpc/boot/Makefile uses CROSS32CC for compilation. However when we added 64-bit little endian support, we also added support for building the boot wrapper 64-bit. However we kept using CROSS32CC, because in most cases it is just CC and everything works. However if the user doesn't specify CROSS32_COMPILE (which no one ever does AFAIK), and CC is *not* biarch (32/64-bit capable), then CROSS32CC becomes just "gcc". On native systems that is probably OK, but if we're cross building it definitely isn't, leading to eg: gcc ... -m64 -mlittle-endian -mabi=elfv2 ... arch/powerpc/boot/cpm-serial.c gcc: error: unrecognized argument in option ‘-mabi=elfv2’ gcc: error: unrecognized command line option ‘-mlittle-endian’ make: *** [zImage] Error 2 To fix it, stop using CROSS32CC, because we may or may not be building 32-bit. Instead setup a BOOTCC, which defaults to CC, and only use CROSS32_COMPILE if it's set and we're building for 32-bit. Fixes: 147c05168fc8 ("powerpc/boot: Add support for 64bit little endian wrapper") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Cyril Bur <cyrilbur@gmail.com>
2017-07-28powerpc/smp: Call smp_ops->setup_cpu() directly on the boot CPUMichael Ellerman
In smp_cpus_done() we need to call smp_ops->setup_cpu() for the boot CPU, which means it has to run *on* the boot CPU. In the past we ensured it ran on the boot CPU by changing the CPU affinity mask of current directly. That was removed in commit 6d11b87d55eb ("powerpc/smp: Replace open coded task affinity logic"), and replaced with a work queue call. Unfortunately using a work queue leads to a lockdep warning, now that the CPU hotplug lock is a regular semaphore: ====================================================== WARNING: possible circular locking dependency detected ... kworker/0:1/971 is trying to acquire lock: (cpu_hotplug_lock.rw_sem){++++++}, at: [<c000000000100974>] apply_workqueue_attrs+0x34/0xa0 but task is already holding lock: ((&wfc.work)){+.+.+.}, at: [<c0000000000fdb2c>] process_one_work+0x25c/0x800 ... CPU0 CPU1 ---- ---- lock((&wfc.work)); lock(cpu_hotplug_lock.rw_sem); lock((&wfc.work)); lock(cpu_hotplug_lock.rw_sem); Although the deadlock can't happen in practice, because smp_cpus_done() only runs in early boot before CPU hotplug is allowed, lockdep can't tell that. Luckily in commit 8fb12156b8db ("init: Pin init task to the boot CPU, initially") tglx changed the generic code to pin init to the boot CPU to begin with. The unpinning of init from the boot CPU happens in sched_init_smp(), which is called after smp_cpus_done(). So smp_cpus_done() is always called on the boot CPU, which means we don't need the work queue call at all - and the lockdep warning goes away. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2017-07-28powerpc/tm: Fix saving of TM SPRs in core dumpGustavo Romero
Currently flush_tmregs_to_thread() does not save the TM SPRs (TFHAR, TFIAR, TEXASR) to the thread struct, unless the process is currently inside a suspended transaction. If the process is core dumping, and the TM SPRs have changed since the last time the process was context switched, then we will save stale values of the TM SPRs to the core dump. Fix it by saving the live register state to the thread struct in that case. Fixes: 08e1c01d6aed ("powerpc/ptrace: Enable support for TM SPR state") Cc: stable@vger.kernel.org # v4.8+ Signed-off-by: Gustavo Romero <gromero@linux.vnet.ibm.com> Reviewed-by: Cyril Bur <cyrilbur@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-28powerpc/mm: Fix pmd/pte_devmap() on non-leaf entriesOliver O'Halloran
The Radix MMU translation tree as defined in ISA v3.0 contains two different types of entry, directories and leaves. Leaves are identified by _PAGE_PTE being set. The formats of the two entries are different, with the directory entries containing no spare bits for use by software. In particular the bit we use for _PAGE_DEVMAP is not reserved for software, and is part of the NLB (Next Level Base) field, essentially the address of the next level in the tree. Note that the Linux pte_t is not == _PAGE_PTE. A huge page pmd entry (or devmap!) is also a leaf and so has _PAGE_PTE set, even though we use a pmd_t for it in Linux. The fix is to ensure that the pmd/pte_devmap() confirm they are looking at a leaf entry (_PAGE_PTE) as well as checking _PAGE_DEVMAP. Fixes: ebd31197931d ("powerpc/mm: Add devmap support for ppc64") Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Tested-by: Laurent Vivier <lvivier@redhat.com> Tested-by: Jose Ricardo Ziviani <joserz@linux.vnet.ibm.com> Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Add a comment in the code and flesh out change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-27powerpc/mm/hash: Free the subpage_prot_table correctlyAneesh Kumar K.V
Fixes: dad6f37c2602e ("powerpc: subpage_protect: Increase the array size to take care of 64TB") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Tested-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-26powerpc/Makefile: Fix ld version check with 64-bit LE-only toolchainMichael Ellerman
In commit efe0160cfd40 ("powerpc/64: Linker on-demand sfpr functions for modules"), we added an ld version check early in the powerpc top-level Makefile. Because the Makefile runs before the kernel config is setup, the checks for CONFIG_CPU_LITTLE_ENDIAN etc. all take the default case. So we end up configuring ld for 32-bit big endian. That would be OK, except that for historical (or perhaps no) reason, we use 'override LD' to add the endian flags to the LD variable itself, rather than the normal approach of adding them to LDFLAGS. The end result is that when we check the ld version we run it as: $(CROSS_COMPILE)ld -EB -m elf32ppc --version This often works, unless you are using a 64-bit only and/or little endian only, toolchain. In which case you see something like: $ make defconfig powerpc64le-linux-ld: unrecognised emulation mode: elf32ppc Supported emulations: elf64lppc elf32lppc elf32lppclinux elf32lppcsim /bin/sh: 1: [: -ge: unexpected operator The proper fix is to stop using 'override LD', but that will require a fair bit of testing. Instead we can fix it for now just by reordering the Makefile to do the version check earlier. Fixes: efe0160cfd40 ("powerpc/64: Linker on-demand sfpr functions for modules") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-26powerpc/pseries: Fix of_node_put() underflow during reconfig removeLaurent Vivier
As for commit 68baf692c435 ("powerpc/pseries: Fix of_node_put() underflow during DLPAR remove"), the call to of_node_put() must be removed from pSeries_reconfig_remove_node(). dlpar_detach_node() and pSeries_reconfig_remove_node() both call of_detach_node(), and thus the node should not be released in both cases. Fixes: 0829f6d1f69e ("of: device_node kobject lifecycle fixes") Cc: stable@vger.kernel.org # v3.15+ Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-26powerpc/mm/radix: Workaround prefetch issue with KVMBenjamin Herrenschmidt
There's a somewhat architectural issue with Radix MMU and KVM. When coming out of a guest with AIL (Alternate Interrupt Location, ie, MMU enabled), we start executing hypervisor code with the PID register still containing whatever the guest has been using. The problem is that the CPU can (and will) then start prefetching or speculatively load from whatever host context has that same PID (if any), thus bringing translations for that context into the TLB, which Linux doesn't know about. This can cause stale translations and subsequent crashes. Fixing this in a way that is neither racy nor a huge performance impact is difficult. We could just make the host invalidations always use broadcast forms but that would hurt single threaded programs for example. We chose to fix it instead by partitioning the PID space between guest and host. This is possible because today Linux only use 19 out of the 20 bits of PID space, so existing guests will work if we make the host use the top half of the 20 bits space. We additionally add support for a property to indicate to Linux the size of the PID register which will be useful if we eventually have processors with a larger PID space available. There is still an issue with malicious guests purposefully setting the PID register to a value in the hosts PID range. Hopefully future HW can prevent that, but in the meantime, we handle it with a pair of kludges: - On the way out of a guest, before we clear the current VCPU in the PACA, we check the PID and if it's outside of the permitted range we flush the TLB for that PID. - When context switching, if the mm is "new" on that CPU (the corresponding bit was set for the first time in the mm cpumask), we check if any sibling thread is in KVM (has a non-NULL VCPU pointer in the PACA). If that is the case, we also flush the PID for that CPU (core). This second part is needed to handle the case where a process is migrated (or starts a new pthread) on a sibling thread of the CPU coming out of KVM, as there's a window where stale translations can exist before we detect it and flush them out. A future optimization could be added by keeping track of whether the PID has ever been used and avoid doing that for completely fresh PIDs. We could similarily mark PIDs that have been the subject of a global invalidation as "fresh". But for now this will do. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [mpe: Rework the asm to build with CONFIG_PPC_RADIX_MMU=n, drop unneeded include of kvm_book3s_asm.h] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-24KVM: PPC: Book3S HV: Fix host crash on changing HPT sizePaul Mackerras
Commit f98a8bf9ee20 ("KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB ioctl() to change HPT size", 2016-12-20) changed the behaviour of the KVM_PPC_ALLOCATE_HTAB ioctl so that it now allocates a new HPT and new revmap array if there was a previously-allocated HPT of a different size from the size being requested. In this case, we need to reset the rmap arrays of the memslots, because the rmap arrays will contain references to HPTEs which are no longer valid. Worse, these references are also references to slots in the new revmap array (which parallels the HPT), and the new revmap array contains random contents, since it doesn't get zeroed on allocation. The effect of having these stale references to slots in the revmap array that contain random contents is that subsequent calls to functions such as kvmppc_add_revmap_chain will crash because they will interpret the non-zero contents of the revmap array as HPTE indexes and thus index outside of the revmap array. This leads to host crashes such as the following. [ 7072.862122] Unable to handle kernel paging request for data at address 0xd000000c250c00f8 [ 7072.862218] Faulting instruction address: 0xc0000000000e1c78 [ 7072.862233] Oops: Kernel access of bad area, sig: 11 [#1] [ 7072.862286] SMP NR_CPUS=1024 [ 7072.862286] NUMA [ 7072.862325] PowerNV [ 7072.862378] Modules linked in: kvm_hv vhost_net vhost tap xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm iw_cxgb3 mlx5_ib ib_core ses enclosure scsi_transport_sas ipmi_powernv ipmi_devintf ipmi_msghandler powernv_op_panel i2c_opal nfsd auth_rpcgss oid_registry [ 7072.863085] nfs_acl lockd grace sunrpc kvm_pr kvm xfs libcrc32c scsi_dh_alua dm_service_time radeon lpfc nvme_fc nvme_fabrics nvme_core scsi_transport_fc i2c_algo_bit tg3 drm_kms_helper ptp pps_core syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm dm_multipath i2c_core cxgb3 mlx5_core mdio [last unloaded: kvm_hv] [ 7072.863381] CPU: 72 PID: 56929 Comm: qemu-system-ppc Not tainted 4.12.0-kvm+ #59 [ 7072.863457] task: c000000fe29e7600 task.stack: c000001e3ffec000 [ 7072.863520] NIP: c0000000000e1c78 LR: c0000000000e2e3c CTR: c0000000000e25f0 [ 7072.863596] REGS: c000001e3ffef560 TRAP: 0300 Not tainted (4.12.0-kvm+) [ 7072.863658] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE,TM[E]> [ 7072.863667] CR: 44082882 XER: 20000000 [ 7072.863767] CFAR: c0000000000e2e38 DAR: d000000c250c00f8 DSISR: 42000000 SOFTE: 1 GPR00: c0000000000e2e3c c000001e3ffef7e0 c000000001407d00 d000000c250c00f0 GPR04: d00000006509fb70 d00000000b3d2048 0000000003ffdfb7 0000000000000000 GPR08: 00000001007fdfb7 00000000c000000f d0000000250c0000 000000000070f7bf GPR12: 0000000000000008 c00000000fdad000 0000000010879478 00000000105a0d78 GPR16: 00007ffaf4080000 0000000000001190 0000000000000000 0000000000010000 GPR20: 4001ffffff000415 d00000006509fb70 0000000004091190 0000000ee1881190 GPR24: 0000000003ffdfb7 0000000003ffdfb7 00000000007fdfb7 c000000f5c958000 GPR28: d00000002d09fb70 0000000003ffdfb7 d00000006509fb70 d00000000b3d2048 [ 7072.864439] NIP [c0000000000e1c78] kvmppc_add_revmap_chain+0x88/0x130 [ 7072.864503] LR [c0000000000e2e3c] kvmppc_do_h_enter+0x84c/0x9e0 [ 7072.864566] Call Trace: [ 7072.864594] [c000001e3ffef7e0] [c000001e3ffef830] 0xc000001e3ffef830 (unreliable) [ 7072.864671] [c000001e3ffef830] [c0000000000e2e3c] kvmppc_do_h_enter+0x84c/0x9e0 [ 7072.864751] [c000001e3ffef920] [d00000000b38d878] kvmppc_map_vrma+0x168/0x200 [kvm_hv] [ 7072.864831] [c000001e3ffef9e0] [d00000000b38a684] kvmppc_vcpu_run_hv+0x1284/0x1300 [kvm_hv] [ 7072.864914] [c000001e3ffefb30] [d00000000f465664] kvmppc_vcpu_run+0x44/0x60 [kvm] [ 7072.865008] [c000001e3ffefb60] [d00000000f461864] kvm_arch_vcpu_ioctl_run+0x114/0x290 [kvm] [ 7072.865152] [c000001e3ffefbe0] [d00000000f453c98] kvm_vcpu_ioctl+0x598/0x7a0 [kvm] [ 7072.865292] [c000001e3ffefd40] [c000000000389328] do_vfs_ioctl+0xd8/0x8c0 [ 7072.865410] [c000001e3ffefde0] [c000000000389be4] SyS_ioctl+0xd4/0x130 [ 7072.865526] [c000001e3ffefe30] [c00000000000b760] system_call+0x58/0x6c [ 7072.865644] Instruction dump: [ 7072.865715] e95b2110 793a0020 7b4926e4 7f8a4a14 409e0098 807c000c 786326e4 7c6a1a14 [ 7072.865857] 935e0008 7bbd0020 813c000c 913e000c <93a30008> 93bc000c 48000038 60000000 [ 7072.866001] ---[ end trace 627b6e4bf8080edc ]--- Note that to trigger this, it is necessary to use a recent upstream QEMU (or other userspace that resizes the HPT at CAS time), specify a maximum memory size substantially larger than the current memory size, and boot a guest kernel that does not support HPT resizing. This fixes the problem by resetting the rmap arrays when the old HPT is freed. Fixes: f98a8bf9ee20 ("KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB ioctl() to change HPT size") Cc: stable@vger.kernel.org # v4.11+ Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-07-24KVM: PPC: Book3S HV: Enable TM before accessing TM registersPaul Mackerras
Commit 46a704f8409f ("KVM: PPC: Book3S HV: Preserve userspace HTM state properly", 2017-06-15) added code to read transactional memory (TM) registers but forgot to enable TM before doing so. The result is that if userspace does have live values in the TM registers, a KVM_RUN ioctl will cause a host kernel crash like this: [ 181.328511] Unrecoverable TM Unavailable Exception f60 at d00000001e7d9980 [ 181.328605] Oops: Unrecoverable TM Unavailable Exception, sig: 6 [#1] [ 181.328613] SMP NR_CPUS=2048 [ 181.328613] NUMA [ 181.328618] PowerNV [ 181.328646] Modules linked in: vhost_net vhost tap nfs_layout_nfsv41_files rpcsec_gss_krb5 nfsv4 dns_resolver nfs +fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat +nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun ebtable_filter ebtables +ip6table_filter ip6_tables iptable_filter bridge stp llc kvm_hv kvm nfsd ses enclosure scsi_transport_sas ghash_generic +auth_rpcgss gf128mul xts sg ctr nfs_acl lockd vmx_crypto shpchp ipmi_powernv i2c_opal grace ipmi_devintf i2c_core +powernv_rng sunrpc ipmi_msghandler ibmpowernv uio_pdrv_genirq uio leds_powernv powernv_op_panel ip_tables xfs sd_mod +lpfc ipr bnx2x libata mdio ptp pps_core scsi_transport_fc libcrc32c dm_mirror dm_region_hash dm_log dm_mod [ 181.329278] CPU: 40 PID: 9926 Comm: CPU 0/KVM Not tainted 4.12.0+ #1 [ 181.329337] task: c000003fc6980000 task.stack: c000003fe4d80000 [ 181.329396] NIP: d00000001e7d9980 LR: d00000001e77381c CTR: d00000001e7d98f0 [ 181.329465] REGS: c000003fe4d837e0 TRAP: 0f60 Not tainted (4.12.0+) [ 181.329523] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> [ 181.329527] CR: 24022448 XER: 00000000 [ 181.329608] CFAR: d00000001e773818 SOFTE: 1 [ 181.329608] GPR00: d00000001e77381c c000003fe4d83a60 d00000001e7ef410 c000003fdcfe0000 [ 181.329608] GPR04: c000003fe4f00000 0000000000000000 0000000000000000 c000003fd7954800 [ 181.329608] GPR08: 0000000000000001 c000003fc6980000 0000000000000000 d00000001e7e2880 [ 181.329608] GPR12: d00000001e7d98f0 c000000007b19000 00000001295220e0 00007fffc0ce2090 [ 181.329608] GPR16: 0000010011886608 00007fff8c89f260 0000000000000001 00007fff8c080028 [ 181.329608] GPR20: 0000000000000000 00000100118500a6 0000010011850000 0000010011850000 [ 181.329608] GPR24: 00007fffc0ce1b48 0000010011850000 00000000d673b901 0000000000000000 [ 181.329608] GPR28: 0000000000000000 c000003fdcfe0000 c000003fdcfe0000 c000003fe4f00000 [ 181.330199] NIP [d00000001e7d9980] kvmppc_vcpu_run_hv+0x90/0x6b0 [kvm_hv] [ 181.330264] LR [d00000001e77381c] kvmppc_vcpu_run+0x2c/0x40 [kvm] [ 181.330322] Call Trace: [ 181.330351] [c000003fe4d83a60] [d00000001e773478] kvmppc_set_one_reg+0x48/0x340 [kvm] (unreliable) [ 181.330437] [c000003fe4d83b30] [d00000001e77381c] kvmppc_vcpu_run+0x2c/0x40 [kvm] [ 181.330513] [c000003fe4d83b50] [d00000001e7700b4] kvm_arch_vcpu_ioctl_run+0x114/0x2a0 [kvm] [ 181.330586] [c000003fe4d83bd0] [d00000001e7642f8] kvm_vcpu_ioctl+0x598/0x7a0 [kvm] [ 181.330658] [c000003fe4d83d40] [c0000000003451b8] do_vfs_ioctl+0xc8/0x8b0 [ 181.330717] [c000003fe4d83de0] [c000000000345a64] SyS_ioctl+0xc4/0x120 [ 181.330776] [c000003fe4d83e30] [c00000000000b004] system_call+0x58/0x6c [ 181.330833] Instruction dump: [ 181.330869] e92d0260 e9290b50 e9290108 792807e3 41820058 e92d0260 e9290b50 e9290108 [ 181.330941] 792ae8a4 794a1f87 408204f4 e92d0260 <7d4022a6> f9490ff0 e92d0260 7d4122a6 [ 181.331013] ---[ end trace 6f6ddeb4bfe92a92 ]--- The fix is just to turn on the TM bit in the MSR before accessing the registers. Cc: stable@vger.kernel.org # v3.14+ Fixes: 46a704f8409f ("KVM: PPC: Book3S HV: Preserve userspace HTM state properly") Reported-by: Jan Stancek <jstancek@redhat.com> Tested-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-07-22Merge tag 'tty-4.13-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty/serial fixes from Greg KH: "Here are some small tty and serial driver fixes for 4.13-rc2. Nothing huge at all, a revert of a patch that turned out to break things, a fix up for a new tty ioctl we added in 4.13-rc1 to get the uapi definition correct, and a few minor serial driver fixes for reported issues. All of these have been in linux-next for a while with no reported issues" * tag 'tty-4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: tty: Fix TIOCGPTPEER ioctl definition tty: hide unused pty_get_peer function tty: serial: lpuart: Fix the logic for detecting the 32-bit type UART serial: imx: Prevent TX buffer PIO write when a DMA has been started Revert "serial: imx-serial - move DMA buffer configuration to DT" serial: sh-sci: Uninitialized variables in sysfs files serial: st-asc: Potential error pointer dereference
2017-07-21Merge tag 'powerpc-4.13-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: "A handful of fixes, mostly for new code: - some reworking of the new STRICT_KERNEL_RWX support to make sure we also remove executable permission from __init memory before it's freed. - a fix to some recent optimisations to the hypercall entry where we were clobbering r12, this was breaking nested guests (PR KVM). - a fix for the recent patch to opal_configure_cores(). This could break booting on bare metal Power8 boxes if the kernel was built without CONFIG_JUMP_LABEL_FEATURE_CHECK_DEBUG. - .. and finally a workaround for spurious PMU interrupts on Power9 DD2. Thanks to: Nicholas Piggin, Anton Blanchard, Balbir Singh" * tag 'powerpc-4.13-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/mm: Mark __init memory no-execute when STRICT_KERNEL_RWX=y powerpc/mm/hash: Refactor hash__mark_rodata_ro() powerpc/mm/radix: Refactor radix__mark_rodata_ro() powerpc/64s: Fix hypercall entry clobbering r12 input powerpc/perf: Avoid spurious PMU interrupts after idle powerpc/powernv: Fix boot on Power8 bare metal due to opal_configure_cores()
2017-07-21Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core fixes from Ingo Molnar: "A fix to WARN_ON_ONCE() done by modules, plus a MAINTAINERS update" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: debug: Fix WARN_ON_ONCE() for modules MAINTAINERS: Update the PTRACE entry
2017-07-20debug: Fix WARN_ON_ONCE() for modulesJosh Poimboeuf
Mike Galbraith reported a situation where a WARN_ON_ONCE() call in DRM code turned into an oops. As it turns out, WARN_ON_ONCE() seems to be completely broken when called from a module. The bug was introduced with the following commit: 19d436268dde ("debug: Add _ONCE() logic to report_bug()") That commit changed WARN_ON_ONCE() to move its 'once' logic into the bug trap handler. It requires a writable bug table so that the BUGFLAG_DONE bit can be written to the flags to indicate the first warning has occurred. The bug table was made writable for vmlinux, which relies on vmlinux.lds.S and vmlinux.lds.h for laying out the sections. However, it wasn't made writable for modules, which rely on the ELF section header flags. Reported-by: Mike Galbraith <efault@gmx.de> Tested-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 19d436268dde ("debug: Add _ONCE() logic to report_bug()") Link: http://lkml.kernel.org/r/a53b04235a65478dd9afc51f5b329fdc65c84364.1500095401.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-18powerpc/mm: Mark __init memory no-execute when STRICT_KERNEL_RWX=yMichael Ellerman
Currently even with STRICT_KERNEL_RWX we leave the __init text marked executable after init, which is bad. Add a hook to mark it NX (no-execute) before we free it, and implement it for radix and hash. Note that we use __init_end as the end address, not _einittext, because overlaps_kernel_text() uses __init_end, because there are additional executable sections other than .init.text between __init_begin and __init_end. Tested on radix and hash with: 0:mon> p $__init_begin *** 400 exception occurred Fixes: 1e0fc9d1eb2b ("powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-18powerpc/mm/hash: Refactor hash__mark_rodata_ro()Michael Ellerman
Move the core logic into a helper, so we can use it for changing other permissions. We also change the logic to align start down, and end up. This means calling the function with a range will expand that range to be at least 1 mmu_linear_psize page in size. We need that so we can use it on __init_begin ... __init_end which is not a full page in size. This should always work for _stext/__init_begin, because we align __init_begin to _stext + 16M in the linker script. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-18powerpc/mm/radix: Refactor radix__mark_rodata_ro()Michael Ellerman
Move the core logic into a helper, so we can use it for changing permissions other than _PAGE_WRITE. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-18powerpc/64s: Fix hypercall entry clobbering r12 inputNicholas Piggin
A previous optimisation incorrectly assumed the PAPR hcall does not use r12, and clobbers it upon entry. In fact it is used as an input. This can result in KVM guests crashing (observed with PR KVM). Instead of using r12 to save r13, tihs patch saves r13 in ctr. This is more costly, but not as slow as using the SPRG. Fixes: acd7d8cef0153 ("powerpc/64s: Optimize hypercall/syscall entry") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-18powerpc/perf: Avoid spurious PMU interrupts after idleNicholas Piggin
POWER9 DD2 can see spurious PMU interrupts after state-loss idle in some conditions. A solution is to save and reload MMCR0 over state-loss idle. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Tested-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-17tty: Fix TIOCGPTPEER ioctl definitionGleb Fotengauer-Malinovskiy
This ioctl does nothing to justify an _IOC_READ or _IOC_WRITE flag because it doesn't copy anything from/to userspace to access the argument. Fixes: 54ebbfb16034 ("tty: add TIOCGPTPEER ioctl") Signed-off-by: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org> Acked-by: Aleksa Sarai <asarai@suse.de> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-17powerpc/powernv: Fix boot on Power8 bare metal due to opal_configure_cores()Michael Ellerman
In commit 1c0eaf0f56d6 ("powerpc/powernv: Tell OPAL about our MMU mode on POWER9"), we added additional flags to the OPAL call to configure CPUs at boot. These flags only work on Power9 firmwares, and worse can cause boot failures on Power8 machines, so we check for CPU_FTR_ARCH_300 (aka POWER9) before adding the extra flags. Unfortunately we forgot that opal_configure_cores() is called before the CPU feature checks are dynamically patched, meaning the check always returns true. We definitely need to do something to make the CPU feature checks less prone to bugs like this, but for now the minimal fix is to use early_cpu_has_feature(). Reported-and-tested-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Fixes: 1c0eaf0f56d6 ("powerpc/powernv: Tell OPAL about our MMU mode on POWER9") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-15Merge branch 'work.mount' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull ->s_options removal from Al Viro: "Preparations for fsmount/fsopen stuff (coming next cycle). Everything gets moved to explicit ->show_options(), killing ->s_options off + some cosmetic bits around fs/namespace.c and friends. Basically, the stuff needed to work with fsmount series with minimum of conflicts with other work. It's not strictly required for this merge window, but it would reduce the PITA during the coming cycle, so it would be nice to have those bits and pieces out of the way" * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: isofs: Fix isofs_show_options() VFS: Kill off s_options and helpers orangefs: Implement show_options 9p: Implement show_options isofs: Implement show_options afs: Implement show_options affs: Implement show_options befs: Implement show_options spufs: Implement show_options bpf: Implement show_options ramfs: Implement show_options pstore: Implement show_options omfs: Implement show_options hugetlbfs: Implement show_options VFS: Don't use save/replace_mount_options if not using generic_show_options VFS: Provide empty name qstr VFS: Make get_filesystem() return the affected filesystem VFS: Clean up whitespace in fs/namespace.c and fs/super.c Provide a function to create a NUL-terminated string from unterminated data
2017-07-15Merge branch 'work.uaccess-unaligned' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull uacess-unaligned removal from Al Viro: "That stuff had just one user, and an exotic one, at that - binfmt_flat on arm and m68k" * 'work.uaccess-unaligned' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: kill {__,}{get,put}_user_unaligned() binfmt_flat: flat_{get,put}_addr_from_rp() should be able to fail
2017-07-14Merge tag 'powerpc-4.13-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: "Nothing that really stands out, just a bunch of fixes that have come in in the last couple of weeks. None of these are actually fixes for code that is new in 4.13. It's roughly half older bugs, with fixes going to stable, and half fixes/updates for Power9. Thanks to: Aneesh Kumar K.V, Anton Blanchard, Balbir Singh, Benjamin Herrenschmidt, Madhavan Srinivasan, Michael Neuling, Nicholas Piggin, Oliver O'Halloran" * tag 'powerpc-4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/64: Fix atomic64_inc_not_zero() to return an int powerpc: Fix emulation of mfocrf in emulate_step() powerpc: Fix emulation of mcrf in emulate_step() powerpc/perf: Add POWER9 alternate PM_RUN_CYC and PM_RUN_INST_CMPL events powerpc/perf: Fix SDAR_MODE value for continous sampling on Power9 powerpc/asm: Mark cr0 as clobbered in mftb() powerpc/powernv: Fix local TLB flush for boot and MCE on POWER9 powerpc/mm/radix: Synchronize updates to the process table powerpc/mm/radix: Properly clear process table entry powerpc/powernv: Tell OPAL about our MMU mode on POWER9 powerpc/kexec: Fix radix to hash kexec due to IAMR/AMOR
2017-07-12mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful ↵Michal Hocko
semantic __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to the page allocator. This has been true but only for allocations requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for smaller sizes. This is a bit unfortunate because there is no way to express the same semantic for those requests and they are considered too important to fail so they might end up looping in the page allocator for ever, similarly to GFP_NOFAIL requests. Now that the whole tree has been cleaned up and accidental or misled usage of __GFP_REPEAT flag has been removed for !costly requests we can give the original flag a better name and more importantly a more useful semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user that the allocator would try really hard but there is no promise of a success. This will work independent of the order and overrides the default allocator behavior. Page allocator users have several levels of guarantee vs. cost options (take GFP_KERNEL as an example) - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_ attempt to free memory at all. The most light weight mode which even doesn't kick the background reclaim. Should be used carefully because it might deplete the memory and the next user might hit the more aggressive reclaim - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic allocation without any attempt to free memory from the current context but can wake kswapd to reclaim memory if the zone is below the low watermark. Can be used from either atomic contexts or when the request is a performance optimization and there is another fallback for a slow path. - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) - non sleeping allocation with an expensive fallback so it can access some portion of memory reserves. Usually used from interrupt/bh context with an expensive slow path fallback. - GFP_KERNEL - both background and direct reclaim are allowed and the _default_ page allocator behavior is used. That means that !costly allocation requests are basically nofail but there is no guarantee of that behavior so failures have to be checked properly by callers (e.g. OOM killer victim is allowed to fail currently). - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and all allocation requests fail early rather than cause disruptive reclaim (one round of reclaim in this implementation). The OOM killer is not invoked. - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator behavior and all allocation requests try really hard. The request will fail if the reclaim cannot make any progress. The OOM killer won't be triggered. - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior and all allocation requests will loop endlessly until they succeed. This might be really dangerous especially for larger orders. Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL because they already had their semantic. No new users are added. __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if there is no progress and we have already passed the OOM point. This means that all the reclaim opportunities have been exhausted except the most disruptive one (the OOM killer) and a user defined fallback behavior is more sensible than keep retrying in the page allocator. [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c] [mhocko@suse.com: semantic fix] Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz [mhocko@kernel.org: address other thing spotted by Vlastimil] Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alex Belits <alex.belits@cavium.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: David Daney <david.daney@cavium.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: NeilBrown <neilb@suse.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12powerpc,mmap: properly account for stack randomization in mmap_baseRik van Riel
When RLIMIT_STACK is, for example, 256MB, the current code results in a gap between the top of the task and mmap_base of 256MB, failing to take into account the amount by which the stack address was randomized. In other words, the stack gets less than RLIMIT_STACK space. Ensure that the gap between the stack and mmap_base always takes stack randomization and the stack guard gap into account. Inspired by Daniel Micay's linux-hardened tree. Link: http://lkml.kernel.org/r/20170622200033.25714-4-riel@redhat.com Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Florian Weimer <fweimer@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Daniel Micay <danielmicay@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Hugh Dickins <hughd@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12include/linux/string.h: add the option of fortified string.h functionsDaniel Micay
This adds support for compiling with a rough equivalent to the glibc _FORTIFY_SOURCE=1 feature, providing compile-time and runtime buffer overflow checks for string.h functions when the compiler determines the size of the source or destination buffer at compile-time. Unlike glibc, it covers buffer reads in addition to writes. GNU C __builtin_*_chk intrinsics are avoided because they would force a much more complex implementation. They aren't designed to detect read overflows and offer no real benefit when using an implementation based on inline checks. Inline checks don't add up to much code size and allow full use of the regular string intrinsics while avoiding the need for a bunch of _chk functions and per-arch assembly to avoid wrapper overhead. This detects various overflows at compile-time in various drivers and some non-x86 core kernel code. There will likely be issues caught in regular use at runtime too. Future improvements left out of initial implementation for simplicity, as it's all quite optional and can be done incrementally: * Some of the fortified string functions (strncpy, strcat), don't yet place a limit on reads from the source based on __builtin_object_size of the source buffer. * Extending coverage to more string functions like strlcat. * It should be possible to optionally use __builtin_object_size(x, 1) for some functions (C strings) to detect intra-object overflows (like glibc's _FORTIFY_SOURCE=2), but for now this takes the conservative approach to avoid likely compatibility issues. * The compile-time checks should be made available via a separate config option which can be enabled by default (or always enabled) once enough time has passed to get the issues it catches fixed. Kees said: "This is great to have. While it was out-of-tree code, it would have blocked at least CVE-2016-3858 from being exploitable (improper size argument to strlcpy()). I've sent a number of fixes for out-of-bounds-reads that this detected upstream already" [arnd@arndb.de: x86: fix fortified memcpy] Link: http://lkml.kernel.org/r/20170627150047.660360-1-arnd@arndb.de [keescook@chromium.org: avoid panic() in favor of BUG()] Link: http://lkml.kernel.org/r/20170626235122.GA25261@beast [keescook@chromium.org: move from -mm, add ARCH_HAS_FORTIFY_SOURCE, tweak Kconfig help] Link: http://lkml.kernel.org/r/20170526095404.20439-1-danielmicay@gmail.com Link: http://lkml.kernel.org/r/1497903987-21002-8-git-send-email-keescook@chromium.org Signed-off-by: Daniel Micay <danielmicay@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Daniel Axtens <dja@axtens.net> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12powerpc: make feature-fixup tests fortify-safeDaniel Axtens
Testing the fortified string functions[1] would cause a kernel panic on boot in test_feature_fixups() due to a buffer overflow in memcmp. This boils down to things like this: extern unsigned int ftr_fixup_test1; extern unsigned int ftr_fixup_test1_orig; check(memcmp(&ftr_fixup_test1, &ftr_fixup_test1_orig, size) == 0); We know that these are asm labels so it is safe to read up to 'size' bytes at those addresses. However, because we have passed the address of a single unsigned int to memcmp, the compiler believes the underlying object is in fact a single unsigned int. So if size > sizeof(unsigned int), there will be a panic at runtime. We can fix this by changing the types: instead of calling the asm labels unsigned ints, call them unsigned int[]s. Therefore the size isn't incorrectly determined at compile time and we get a regular unsafe memcmp and no panic. [1] http://openwall.com/lists/kernel-hardening/2017/05/09/2 Link: http://lkml.kernel.org/r/1497903987-21002-7-git-send-email-keescook@chromium.org Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Kees Cook <keescook@chromium.org> Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Tested-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Cc: Kees Cook <keescook@chromium.org> Cc: Daniel Micay <danielmicay@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12powerpc: don't fortify prom_initDaniel Axtens
prom_init is a bit special; in theory it should be able to be linked separately to the kernel. To keep this from getting too complex, the symbols that prom_init.c uses are checked. Fortification adds symbols, and it gets quite messy as it includes things like panic(). So just don't fortify prom_init.c for now. Link: http://lkml.kernel.org/r/1497903987-21002-6-git-send-email-keescook@chromium.org Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Daniel Micay <danielmicay@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12powerpc/64s: implement arch-specific hardlockup watchdogNicholas Piggin
Implement an arch-speicfic watchdog rather than use the perf-based hardlockup detector. The new watchdog takes the soft-NMI directly, rather than going through perf. Perf interrupts are to be made maskable in future, so that would prevent the perf detector from working in those regions. Additionally, implement a SMP based detector where all CPUs watch one another by pinging a shared cpumask. This is because powerpc Book3S does not have a true periodic local NMI, but some platforms do implement a true NMI IPI. If a CPU is stuck with interrupts hard disabled, the soft-NMI watchdog does not work, but the SMP watchdog will. Even on platforms without a true NMI IPI to get a good trace from the stuck CPU, other CPUs will notice the lockup sufficiently to report it and panic. [npiggin@gmail.com: honor watchdog disable at boot/hotplug] Link: http://lkml.kernel.org/r/20170621001346.5bb337c9@roar.ozlabs.ibm.com [npiggin@gmail.com: fix false positive warning at CPU unplug] Link: http://lkml.kernel.org/r/20170630080740.20766-1-npiggin@gmail.com [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170616065715.18390-6-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Don Zickus <dzickus@redhat.com> Tested-by: Babu Moger <babu.moger@oracle.com> [sparc] Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12kernel/watchdog: split up config optionsNicholas Piggin
Split SOFTLOCKUP_DETECTOR from LOCKUP_DETECTOR, and split HARDLOCKUP_DETECTOR_PERF from HARDLOCKUP_DETECTOR. LOCKUP_DETECTOR implies the general boot, sysctl, and programming interfaces for the lockup detectors. An architecture that wants to use a hard lockup detector must define HAVE_HARDLOCKUP_DETECTOR_PERF or HAVE_HARDLOCKUP_DETECTOR_ARCH. Alternatively an arch can define HAVE_NMI_WATCHDOG, which provides the minimum arch_touch_nmi_watchdog, and it otherwise does its own thing and does not implement the LOCKUP_DETECTOR interfaces. sparc is unusual in that it has started to implement some of the interfaces, but not fully yet. It should probably be converted to a full HAVE_HARDLOCKUP_DETECTOR_ARCH. [npiggin@gmail.com: fix] Link: http://lkml.kernel.org/r/20170617223522.66c0ad88@roar.ozlabs.ibm.com Link: http://lkml.kernel.org/r/20170616065715.18390-4-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Don Zickus <dzickus@redhat.com> Reviewed-by: Babu Moger <babu.moger@oracle.com> Tested-by: Babu Moger <babu.moger@oracle.com> [sparc] Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>