summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)Author
2022-12-07powerpc/rtas: define pr_fmt and convert printk call sitesNathan Lynch
Set pr_fmt to "rtas: " and convert the handful of printk() uses in rtas.c, adjusting the messages to remove now-redundant "RTAS" strings. Note that rtas_restart(), rtas_power_off(), and rtas_halt() all currently use printk() without specifying a log level. These have been changed to use pr_emerg(), which matches the behavior of rtas_os_term(). Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221118150751.469393-9-nathanl@linux.ibm.com
2022-12-07powerpc/rtas: clean up includesNathan Lynch
rtas.c used to host complex code related to pseries-specific guest migration and suspend, which used atomics, completions, hcalls, and CPU hotplug APIs. That's all been deleted or moved, so remove the include directives that have been rendered unnecessary. Sort the remainder (with linux/ before asm/) to impose some order on where future additions go. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221118150751.469393-8-nathanl@linux.ibm.com
2022-12-07powerpc/rtas: clean up rtas_error_log_max initializationNathan Lynch
The code in rtas_get_error_log_max() doesn't cause problems in practice, but there are no measures to ensure that the lazy initialization of the static rtas_error_log_max variable is atomic, and it's not worth adding them. Initialize the static rtas_error_log_max variable at boot when we're single-threaded instead of lazily on first use. Use the more appropriate of_property_read_u32() API instead of rtas_token() to consult the "rtas-error-log-max" property, which is not the name of an RTAS function. Convert use of printk() to pr_warn() and distinguish the possible error cases. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221118150751.469393-7-nathanl@linux.ibm.com
2022-12-07powerpc/pseries/eeh: use correct API for error log sizeNathan Lynch
rtas-error-log-max is not the name of an RTAS function, so rtas_token() is not the appropriate API for retrieving its value. We already have rtas_get_error_log_max() which returns a sensible value if the property is absent for any reason, so use that instead. Fixes: 8d633291b4fc ("powerpc/eeh: pseries platform EEH error log retrieval") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> [mpe: Drop no-longer possible error handling as noticed by ajd] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221118150751.469393-6-nathanl@linux.ibm.com
2022-12-07powerpc/rtas: avoid scheduling in rtas_os_term()Nathan Lynch
It's unsafe to use rtas_busy_delay() to handle a busy status from the ibm,os-term RTAS function in rtas_os_term(): Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b BUG: sleeping function called from invalid context at arch/powerpc/kernel/rtas.c:618 in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 1, name: swapper/0 preempt_count: 2, expected: 0 CPU: 7 PID: 1 Comm: swapper/0 Tainted: G D 6.0.0-rc5-02182-gf8553a572277-dirty #9 Call Trace: [c000000007b8f000] [c000000001337110] dump_stack_lvl+0xb4/0x110 (unreliable) [c000000007b8f040] [c0000000002440e4] __might_resched+0x394/0x3c0 [c000000007b8f0e0] [c00000000004f680] rtas_busy_delay+0x120/0x1b0 [c000000007b8f100] [c000000000052d04] rtas_os_term+0xb8/0xf4 [c000000007b8f180] [c0000000001150fc] pseries_panic+0x50/0x68 [c000000007b8f1f0] [c000000000036354] ppc_panic_platform_handler+0x34/0x50 [c000000007b8f210] [c0000000002303c4] notifier_call_chain+0xd4/0x1c0 [c000000007b8f2b0] [c0000000002306cc] atomic_notifier_call_chain+0xac/0x1c0 [c000000007b8f2f0] [c0000000001d62b8] panic+0x228/0x4d0 [c000000007b8f390] [c0000000001e573c] do_exit+0x140c/0x1420 [c000000007b8f480] [c0000000001e586c] make_task_dead+0xdc/0x200 Use rtas_busy_delay_time() instead, which signals without side effects whether to attempt the ibm,os-term RTAS call again. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221118150751.469393-5-nathanl@linux.ibm.com
2022-12-07powerpc/rtas: avoid device tree lookups in rtas_os_term()Nathan Lynch
rtas_os_term() is called during panic. Its behavior depends on a couple of conditions in the /rtas node of the device tree, the traversal of which entails locking and local IRQ state changes. If the kernel panics while devtree_lock is held, rtas_os_term() as currently written could hang. Instead of discovering the relevant characteristics at panic time, cache them in file-static variables at boot. Note the lookup for "ibm,extended-os-term" is converted to of_property_read_bool() since it is a boolean property, not an RTAS function token. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com> [mpe: Incorporate suggested change from Nick] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221118150751.469393-4-nathanl@linux.ibm.com
2022-12-07powerpc/rtasd: use correct OF API for event scan rateNathan Lynch
rtas_token() should be used only for properties that are RTAS function tokens. "rtas-event-scan-rate" does not contain a function token, but it has the same size/format as token properties so reading it with rtas_token() happens to work. Convert to of_property_read_u32(). Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221118150751.469393-3-nathanl@linux.ibm.com
2022-12-07powerpc/rtas: document rtas_call()Nathan Lynch
rtas_call() has a complex calling convention, non-standard return values, and many users. Add kernel-doc for it and remove the less structured commentary from rtas.h. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221118150751.469393-2-nathanl@linux.ibm.com
2022-12-07powerpc/pseries: unregister VPA when hot unplugging a CPULaurent Dufour
The VPA should unregister when offlining a CPU. Otherwise there could be a short window where 2 CPUs could share the same VPA. This happens because the hypervisor is still keeping the VPA attached to the vCPU even if it became offline. Here is a potential situation: 1. remove proc A, 2. add proc B. If proc B gets proc A's place in cpu_present_mask, then it registers proc A's VPAs. 3. If proc B is then re-added to the LP, its threads are sharing VPAs with proc A briefly as they come online. As the hypervisor may check for the VPA's yield_count field oddity, it may detect an unexpected value and kill the LPAR. Suggested-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com> [mpe: s/cpu_present_map/cpu_present_mask/ in change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221114160150.13554-1-ldufour@linux.ibm.com
2022-12-07powerpc/pseries: reset the RCU watchdogs after a LPMLaurent Dufour
The RCU watchdog timer should be reset when restarting the CPU after a Live Partition Mobility operation. Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Acked-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Combine comments into a single comment block] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221125173204.15329-1-ldufour@linux.ibm.com
2022-12-07powerpc: Take in account addition CPU node when building kexec FDTLaurent Dufour
On a system with a large number of CPUs, the creation of the FDT for a kexec kernel may fail because the allocated FDT is not large enough. When this happens, such a message is displayed on the console: Unable to add ibm,processor-vadd-size property: FDT_ERR_NOSPACE The property's name may change depending when the buffer overwrite is detected. Obviously the created FDT is missing information, and it is expected that system dump or kexec kernel failed to run properly. When the FDT is allocated, the size of the FDT the kernel received at boot time is used and an extra size can be applied. Currently, only memory added after boot time is taken in account, not the CPU nodes. The extra size should take in account these additional CPU nodes and compute the required extra space. To achieve that, the size of a CPU node, including its subnode is computed once and multiplied by the number of additional CPU nodes. The assumption is that the size of the CPU node is _same_ for all the node, the only variable part should be the name "PowerPC,POWERxx@##" where "##" may vary a little. Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> [mpe: Don't shadow function name w/variable, minor coding style changes] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221110180619.15796-3-ldufour@linux.ibm.com
2022-12-07powerpc: export the CPU node countLaurent Dufour
At boot time, the FDT is parsed to compute the number of CPUs. In addition count the number of CPU nodes and export it. This is useful when building the FDT for a kexeced kernel since we need to take in account the CPU node added since the boot time during CPU hotplug operations. Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221110180619.15796-2-ldufour@linux.ibm.com
2022-12-06powerpc/dts/fsl: Fix pca954x i2c-mux node namesGeert Uytterhoeven
"make dtbs_check": arch/powerpc/boot/dts/fsl/t1040rdb-rev-a.dtb: pca9546@77: $nodename:0: 'pca9546@77' does not match '^(i2c-?)?mux' From schema: Documentation/devicetree/bindings/i2c/i2c-mux-pca954x.yaml arch/powerpc/boot/dts/fsl/t1024qds.dtb: pca9547@77: Unevaluated properties are not allowed ('#address-cells', '#size-cells', 'i2c@0', 'i2c@2', 'i2c@3' were unexpected) From schema: Documentation/devicetree/bindings/i2c/i2c-mux-pca954x.yaml ... Fix this by renaming pca954x nodes to "i2c-mux", to match the I2C bus multiplexer/switch DT bindings and the Generic Names Recommendation in the Devicetree Specification. Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/6c5d86c49ac170e9d56ab121ea0602f3873849ca.1669999298.git.geert+renesas@glider.be
2022-12-02powerpc/code-patching: Remove protection against patching init addresses ↵Christophe Leroy
after init Once init section is freed, attempting to patch init code ends up in the weed. Commit 51c3c62b58b3 ("powerpc: Avoid code patching freed init sections") protected patch_instruction() against that, but it is the responsibility of the caller to ensure that the patched memory is valid. All callers have now been verified and fixed so the check can be removed. This improves ftrace activation by about 2% on 8xx. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/504310828f473d424e2ed229eff57bf075f52796.1669969781.git.christophe.leroy@csgroup.eu
2022-12-02powerpc/feature-fixups: Do not patch init section after initChristophe Leroy
Once init section is freed, attempting to patch init code ends up in the weed. Commit 51c3c62b58b3 ("powerpc: Avoid code patching freed init sections") protected patch_instruction() against that, but it is the responsibility of the caller to ensure that the patched memory is valid. In the same spirit as jump_label with its jump_label_can_update() function, add is_fixup_addr_valid() function to skip patching on freed init section. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/8e9311fc1b057e4e6a2a3a0701ebcc74b787affe.1669969781.git.christophe.leroy@csgroup.eu
2022-12-02powerpc/feature-fixups: Refactor other fixups patchingChristophe Leroy
Several fonctions have the same loop for patching instructions. Introduce function do_patch_fixups() to refactor those loops. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/58ab36949c18f94d466fc98d6c085783b0cd474f.1669969781.git.christophe.leroy@csgroup.eu
2022-12-02powerpc/feature-fixups: Refactor entry fixups patchingChristophe Leroy
Several fonctions have the same loop for patching instructions. Introduce function do_patch_entry_fixups() to refactor those loops. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/79eeff7b20a98f7136da5f79b1f7c436928f27f3.1669969781.git.christophe.leroy@csgroup.eu
2022-12-02powerpc/code-patching: Remove #ifdef CONFIG_STRICT_KERNEL_RWXChristophe Leroy
No need to have one implementation of patch_instruction() for CONFIG_STRICT_KERNEL_RWX and one for !CONFIG_STRICT_KERNEL_RWX. In patch_instruction(), call raw_patch_instruction() when !CONFIG_STRICT_KERNEL_RWX. In poking_init(), bail out immediately, it will be equivalent to the weak default implementation. Everything else is declared static and will be discarded by GCC when !CONFIG_STRICT_KERNEL_RWX. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/f67d2a109404d03e8fdf1ea15388c8778337a76b.1669969781.git.christophe.leroy@csgroup.eu
2022-12-02powerpc/ftrace: fix syscall tracing on PPC64_ELF_ABI_V1Michael Jeanson
In v5.7 the powerpc syscall entry/exit logic was rewritten in C, on PPC64_ELF_ABI_V1 this resulted in the symbols in the syscall table changing from their dot prefixed variant to the non-prefixed ones. Since ftrace prefixes a dot to the syscall names when matching them to build its syscall event list, this resulted in no syscall events being available. Remove the PPC64_ELF_ABI_V1 specific version of arch_syscall_match_sym_name to have the same behavior across all powerpc variants. Fixes: 68b34588e202 ("powerpc/64/sycall: Implement syscall entry/exit logic in C") Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221201161442.2127231-1-mjeanson@efficios.com
2022-12-02powerpc/64: Sanitise user registers on interrupt in pseries, POWERNVRohan McLure
Cause pseries and POWERNV platforms to default to zeroising all potentially user-defined registers when entering the kernel by means of any interrupt source, reducing user-influence of the kernel and the likelihood or producing speculation gadgets. Acked-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Rohan McLure <rmclure@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221201071019.1953023-7-rmclure@linux.ibm.com
2022-12-02powerpc/64e: Clear gprs on interrupt routine entry on Book3ERohan McLure
Zero GPRS r14-r31 on entry into the kernel for interrupt sources to limit influence of user-space values in potential speculation gadgets. Prior to this commit, all other GPRS are reassigned during the common prologue to interrupt handlers and so need not be zeroised explicitly. This may be done safely, without loss of register state prior to the interrupt, as the common prologue saves the initial values of non-volatiles, which are unconditionally restored in interrupt_64.S. Mitigation defaults to enabled by INTERRUPT_SANITIZE_REGISTERS. Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Rohan McLure <rmclure@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221201071019.1953023-6-rmclure@linux.ibm.com
2022-12-02powerpc/64s: Zeroise gprs on interrupt routine entry on Book3SRohan McLure
Zeroise user state in gprs (assign to zero) to reduce the influence of user registers on speculation within kernel syscall handlers. Clears occur at the very beginning of the sc and scv 0 interrupt handlers, with restores occurring following the execution of the syscall handler. Zeroise GPRS r0, r2-r11, r14-r31, on entry into the kernel for all other interrupt sources. The remaining gprs are overwritten by entry macros to interrupt handlers, irrespective of whether or not a given handler consumes these register values. If an interrupt does not select the IMSR_R12 IOption, zeroise r12. Prior to this commit, r14-r31 are restored on a per-interrupt basis at exit, but now they are always restored on 64bit Book3S. Remove explicit REST_NVGPRS invocations on 64-bit Book3S. 32-bit systems do not clear user registers on interrupt, and continue to depend on the return value of interrupt_exit_user_prepare to determine whether or not to restore non-volatiles. The mmap_bench benchmark in selftests should rapidly invoke pagefaults. See ~0.8% performance regression with this mitigation, but this indicates the worst-case performance due to heavier-weight interrupt handlers. This mitigation is able to be enabled/disabled through CONFIG_INTERRUPT_SANITIZE_REGISTERS. Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Rohan McLure <rmclure@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221201071019.1953023-5-rmclure@linux.ibm.com
2022-12-02powerpc/64s: IOption for MSR stored in r12Rohan McLure
Interrupt handlers in asm/exceptions-64s.S contain a great deal of common code produced by the GEN_COMMON macros. Currently, at the exit point of the macro, r12 will contain the contents of the MSR. A future patch will cause these macros to zeroise architected registers to avoid potential speculation influence of user data. Provide an IOption that signals that r12 must be retained, as the interrupt handler assumes it to hold the contents of the MSR. Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Rohan McLure <rmclure@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221201071019.1953023-4-rmclure@linux.ibm.com
2022-12-02powerpc/64: Sanitise common exit code for interruptsRohan McLure
Interrupt code is shared between Book3E/S 64-bit systems for interrupt handlers. Ensure that exit code correctly restores non-volatile gprs on each system when CONFIG_INTERRUPT_SANITIZE_REGISTERS is enabled. Also introduce macros for clearing/restoring registers on interrupt entry for when this configuration option is either disabled or enabled. Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Rohan McLure <rmclure@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221201071019.1953023-3-rmclure@linux.ibm.com
2022-12-02powerpc/64: Add interrupt register sanitisation macrosRohan McLure
Include in asm/ppc_asm.h macros to be used in multiple successive patches to implement zeroising architected registers in interrupt handlers. Registers will be sanitised in this fashion in future patches to reduce the speculation influence of user-controlled register values. These mitigations will be configurable through the CONFIG_INTERRUPT_SANITIZE_REGISTERS Kconfig option. Included are macros for conditionally zeroising registers and restoring as required with the mitigation enabled. With the mitigation disabled, non-volatiles must be restored on demand at separate locations to those required by the mitigation. Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Rohan McLure <rmclure@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221201071019.1953023-2-rmclure@linux.ibm.com
2022-12-02powerpc/64: Add INTERRUPT_SANITIZE_REGISTERS KconfigRohan McLure
Add Kconfig option for enabling clearing of registers on arrival in an interrupt handler. This reduces the speculation influence of registers on kernel internals. The option will be consumed by 64-bit systems that feature speculation and wish to implement this mitigation. This patch only introduces the Kconfig option, no actual mitigations. The primary overhead of this mitigation lies in an increased number of registers that must be saved and restored by interrupt handlers on Book3S systems. Enable by default on Book3E systems, which prior to this patch eagerly save and restore register state, meaning that the mitigation when implemented will have minimal overhead. Acked-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Rohan McLure <rmclure@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221201071019.1953023-1-rmclure@linux.ibm.com
2022-12-02powerpc/hv-gpci: Fix hv_gpci event listKajol Jain
Based on getPerfCountInfo v1.018 documentation, some of the hv_gpci events were deprecated for platform firmware that supports counter_info_version 0x8 or above. Fix the hv_gpci event list by adding a new attribute group called "hv_gpci_event_attrs_v6" and a "ENABLE_EVENTS_COUNTERINFO_V6" macro to enable these events for platform firmware that supports counter_info_version 0x6 or below. And assigning the hv_gpci event list based on output counter info version of underlying plaform. Fixes: 97bf2640184f ("powerpc/perf/hv-gpci: add the remaining gpci requests") Signed-off-by: Kajol Jain <kjain@linux.ibm.com> Reviewed-by: Madhavan Srinivasan <maddy@linux.ibm.com> Reviewed-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221130174513.87501-1-kjain@linux.ibm.com
2022-12-02powerpc/83xx/mpc832x_rdb: call platform_device_put() in error case in ↵Yang Yingliang
of_fsl_spi_probe() If platform_device_add() is not called or failed, it can not call platform_device_del() to clean up memory, it should call platform_device_put() in error case. Fixes: 26f6cb999366 ("[POWERPC] fsl_soc: add support for fsl_spi") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221029111626.429971-1-yangyingliang@huawei.com
2022-12-02Merge branch 'topic/qspinlock' into nextMichael Ellerman
Merge Nick's powerpc qspinlock implementation. From his cover letter: This replaces the generic queued spinlock code (like s390 does) with our own implementation. Generic PV qspinlock code is causing latency / starvation regressions on large systems that are resulting in hard lockups reported (mostly in pathoogical cases). The generic qspinlock code has a number of issues important for powerpc hardware and hypervisors that aren't easily solved without changing code that would impact other architectures. Follow s390's lead and implement our own for now. Issues for powerpc using generic qspinlocks: - The previous lock value should not be loaded with simple loads, and need not be passed around from previous loads or cmpxchg results, because powerpc uses ll/sc-style atomics which can perform more complex operations that do not require this. powerpc implementations tend to prefer loads use larx for improved coherency performance. - The queueing process should absolutely minimise the number of stores to the lock word to reduce exclusive coherency probes, important for large system scalability. The pending logic is counter productive here. - Non-atomic unlock for paravirt locks is important (atomic instructions tend to still be more expensive than x86 CPUs). - Yielding to the lock owner is important in the oversubscribed paravirt case, which requires storing the owner CPU in the lock word. - More control of lock stealing for the paravirt case is important to keep latency down on large systems. - The lock acquisition operation should always be made with a special variant of atomic instructions with the lock hint bit set, including (especially) in the queueing paths. This is more a matter of adding more arch lock helpers so not an insurmountable problem for generic code.
2022-12-02powerpc/64s/hash: add stress_hpt kernel boot option to increase hash faultsNicholas Piggin
This option increases the number of hash misses by limiting the number of kernel HPT entries, by keeping a per-CPU record of the last kernel HPTEs installed, and removing that from the hash table on the next hash insertion. A timer round-robins CPUs removing remaining kernel HPTEs and clearing the TLB (in the case of bare metal) to increase and slightly randomise kernel fault activity. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Add comment about NR_CPUS usage, fixup whitespace] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221024030150.852517-1-npiggin@gmail.com
2022-12-02powerpc: remove STACK_FRAME_OVERHEADNicholas Piggin
This is equal to STACK_FRAME_MIN_SIZE on 32-bit and 64-bit ELFv1, and no longer used in 64-bit ELFv2, so replace STACK_FRAME_OVERHEAD occurrences with STACK_FRAME_MIN_SIZE. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221127124942.1665522-18-npiggin@gmail.com
2022-12-02powerpc/64: ELFv2 use minimal stack frames in int and switch frame sizesNicholas Piggin
Adjust the ELFv2 interrupt and switch frames to the minimum C ABI size, plus pt_regs, plus 16 bytes for the aligned regs marker for the int frame (and the switch frame needs to match that because it uses the same regs offset as the int frame). This saves 80 bytes of kernel stack per interrupt. It's the principle of getting our accounting right that's more important than the practical saving. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221127124942.1665522-17-npiggin@gmail.com
2022-12-02powerpc: allow minimum sized kernel stack framesNicholas Piggin
This affects only 64-bit ELFv2 kernels, and reduces the minimum asm-created stack frame size from 112 to 32 byte on those kernels. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221127124942.1665522-16-npiggin@gmail.com
2022-12-02powerpc: split validate_sp into two functionsNicholas Piggin
Most callers just want to validate an arbitrary kernel stack pointer, some need a particular size. Make the size case the exceptional one with an extra function. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221127124942.1665522-15-npiggin@gmail.com
2022-12-02powerpc: copy_thread add a back chain to the switch stack frameNicholas Piggin
Stack unwinders need LR and the back chain as a minimum. The switch stack uses regs->nip for its return pointer rather than lrsave, so that was not set in the fork frame, and neither was the back chain. This change sets those fields in the stack. With this and the previous change, a stack trace in the switch or interrupt stack goes from looking like this: Oops: Exception in kernel mode, sig: 5 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries Modules linked in: CPU: 3 PID: 90 Comm: systemd Not tainted NIP: c000000000011060 LR: c000000000010f68 CTR: 0000000000007fff [ ... regs ... ] NIP [c000000000011060] _switch+0x160/0x17c LR [c000000000010f68] _switch+0x68/0x17c Call Trace: To this: Oops: Exception in kernel mode, sig: 5 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries CPU: 0 PID: 93 Comm: systemd Not tainted NIP: c000000000011060 LR: c000000000010f68 CTR: 0000000000007fff [ ... regs ... ] NIP [c000000000011060] _switch+0x160/0x17c LR [c000000000010f68] _switch+0x68/0x17c Call Trace: [c000000005a93e10] [c00000000000cdbc] ret_from_fork_scv+0x0/0x54 --- interrupt: 3000 at 0x7fffa72f56d8 NIP: 00007fffa72f56d8 LR: 0000000000000000 CTR: 0000000000000000 [ ... regs ... ] NIP [00007fffa72f56d8] 0x7fffa72f56d8 LR [0000000000000000] 0x0 --- interrupt: 3000 Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221127124942.1665522-14-npiggin@gmail.com
2022-12-02powerpc: copy_thread fill in interrupt frame marker and back chainNicholas Piggin
Backtraces will not recognise the fork system call interrupt without the regs marker. And regular interrupt entry from userspace creates the back chain to the user stack, so do this for the initial fork frame too, to be consistent. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221127124942.1665522-13-npiggin@gmail.com
2022-12-02powerpc: add a define for the switch frame size and regs offsetNicholas Piggin
This is open-coded in process.c, ppc32 uses a different define with the same value, and the C definition is name differently which makes it an extra indirection to grep for. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221127124942.1665522-12-npiggin@gmail.com
2022-12-02powerpc: add a define for the user interrupt frame sizeNicholas Piggin
The user interrupt frame is a different size from the kernel frame, so give it its own name. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221127124942.1665522-11-npiggin@gmail.com
2022-12-02powerpc: Rename STACK_FRAME_MARKER and derive it from frame offsetNicholas Piggin
This is a count of longs from the stack pointer to the regs marker. Rename it to make it more distinct from the other byte offsets. It can be derived from the byte offset definitions just added. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221127124942.1665522-10-npiggin@gmail.com
2022-12-02powerpc: add a definition for the marker offset within the interrupt frameNicholas Piggin
Define a constant rather than open-code the offset for the "regs" marker. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221127124942.1665522-9-npiggin@gmail.com
2022-12-02powerpc: add definition for pt_regs offset within an interrupt frameNicholas Piggin
This is a common offset that currently uses the overloaded STACK_FRAME_OVERHEAD constant. It's easier to read and more flexible to use a specific regs offset for this. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221127124942.1665522-8-npiggin@gmail.com
2022-12-02powerpc: simplify ppc_save_regsNicholas Piggin
Adjust the pt_regs pointer so the interrupt frame offsets can be used to save registers. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221127124942.1665522-7-npiggin@gmail.com
2022-12-02powerpc/pseries: hvcall stack frame overheadNicholas Piggin
This call may use the min size stack frame. The scratch space used is in the caller's parameter area frame, not this function's frame. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221127124942.1665522-6-npiggin@gmail.com
2022-12-02powerpc: Rearrange copy_thread child stack creationNicholas Piggin
This makes it a bit clearer where the stack frame is created, and will allow easier use of some of the stack offset constants in a later change. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221127124942.1665522-5-npiggin@gmail.com
2022-12-02powerpc/perf: callchain validate kernel stack pointer boundsNicholas Piggin
The interrupt frame detection and loads from the hypothetical pt_regs are not bounds-checked. The next-frame validation only bounds-checks STACK_FRAME_OVERHEAD, which does not include the pt_regs. Add another test for this. The user could set r1 to be equal to the address matching the first interrupt frame - STACK_INT_FRAME_SIZE, which is in the previous page due to the kernel redzone, and induce the kernel to load the marker from there. Possibly this could cause a crash at least. If the user could induce the previous page to contain a valid marker, then it might be able to direct perf to read specific memory addresses in a way that could be transmitted back to the user in the perf data. Fixes: 20002ded4d93 ("perf_counter: powerpc: Add callchain support") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221127124942.1665522-4-npiggin@gmail.com
2022-12-02powerpc/64: Remove asm interrupt tracing call helpersNicholas Piggin
These are now unused. Remove. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221127124942.1665522-3-npiggin@gmail.com
2022-12-02powerpc/64: Option to build big-endian with ELFv2 ABINicholas Piggin
Provide an option to build big-endian kernels using the ELFv2 ABI. This works on GCC only for now. Clang is rumored to support this, but core build files need updating first, at least. This gives big-endian kernels useful advantages of the ELFv2 ABI, e.g., less stack usage, -mprofile-kernel support, better compatibility with eBPF tools. BE+ELFv2 is not officially supported by the GNU toolchain, but it works fine in testing and has been used by some userspace for some time (e.g., Void Linux). Tested-by: Michal Suchánek <msuchanek@suse.de> Reviewed-by: Segher Boessenkool <segher@kernel.crashing.org> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221128041539.1742489-5-npiggin@gmail.com
2022-12-02powerpc/64: Add module check for ELF ABI versionNicholas Piggin
Override the generic module ELF check to provide a check for the ELF ABI version. This becomes important if we allow big-endian ELF ABI V2 builds but it doesn't hurt to check now. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221128041539.1742489-3-npiggin@gmail.com
2022-12-02powerpc/code-patching: Consolidate and cache per-cpu patching contextBenjamin Gray
With the temp mm context support, there are CPU local variables to hold the patch address and pte. Use these in the non-temp mm path as well instead of adding a level of indirection through the text_poke_area vm_struct and pointer chasing the pte. As both paths use these fields now, there is no need to let unreferenced variables be dropped by the compiler, so it is cleaner to merge them into a single context struct. This has the additional benefit of removing a redundant CPU local pointer, as only one of cpu_patching_mm / text_poke_area is ever used, while remaining well-typed. It also groups each CPU's data into a single cacheline. Signed-off-by: Benjamin Gray <bgray@linux.ibm.com> [mpe: Shorten name to 'area' as suggested by Christophe] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221109045112.187069-10-bgray@linux.ibm.com
2022-12-02powerpc/code-patching: Use temporary mm for Radix MMUChristopher M. Riedl
x86 supports the notion of a temporary mm which restricts access to temporary PTEs to a single CPU. A temporary mm is useful for situations where a CPU needs to perform sensitive operations (such as patching a STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing said mappings to other CPUs. Another benefit is that other CPU TLBs do not need to be flushed when the temporary mm is torn down. Mappings in the temporary mm can be set in the userspace portion of the address-space. Interrupts must be disabled while the temporary mm is in use. HW breakpoints, which may have been set by userspace as watchpoints on addresses now within the temporary mm, are saved and disabled when loading the temporary mm. The HW breakpoints are restored when unloading the temporary mm. All HW breakpoints are indiscriminately disabled while the temporary mm is in use - this may include breakpoints set by perf. Use the `poking_init` init hook to prepare a temporary mm and patching address. Initialize the temporary mm using mm_alloc(). Choose a randomized patching address inside the temporary mm userspace address space. The patching address is randomized between PAGE_SIZE and DEFAULT_MAP_WINDOW-PAGE_SIZE. Bits of entropy with 64K page size on BOOK3S_64: bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE) PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB bits of entropy = log2(128TB / 64K) bits of entropy = 31 The upper limit is DEFAULT_MAP_WINDOW due to how the Book3s64 Hash MMU operates - by default the space above DEFAULT_MAP_WINDOW is not available. Currently the Hash MMU does not use a temporary mm so technically this upper limit isn't necessary; however, a larger randomization range does not further "harden" this overall approach and future work may introduce patching with a temporary mm on Hash as well. Randomization occurs only once during initialization for each CPU as it comes online. The patching page is mapped with PAGE_KERNEL to set EAA[0] for the PTE which ignores the AMR (so no need to unlock/lock KUAP) according to PowerISA v3.0b Figure 35 on Radix. Based on x86 implementation: commit 4fc19708b165 ("x86/alternatives: Initialize temporary mm for patching") and: commit b3fd8e83ada0 ("x86/alternatives: Use temporary mm for text poking") From: Benjamin Gray <bgray@linux.ibm.com> Synchronisation is done according to ISA 3.1B Book 3 Chapter 13 "Synchronization Requirements for Context Alterations". Switching the mm is a change to the PID, which requires a CSI before and after the change, and a hwsync between the last instruction that performs address translation for an associated storage access. Instruction fetch is an associated storage access, but the instruction address mappings are not being changed, so it should not matter which context they use. We must still perform a hwsync to guard arbitrary prior code that may have accessed a userspace address. TLB invalidation is local and VA specific. Local because only this core used the patching mm, and VA specific because we only care that the writable mapping is purged. Leaving the other mappings intact is more efficient, especially when performing many code patches in a row (e.g., as ftrace would). Signed-off-by: Christopher M. Riedl <cmr@bluescreens.de> Signed-off-by: Benjamin Gray <bgray@linux.ibm.com> [mpe: Use mm_alloc() per 107b6828a7cd ("x86/mm: Use mm_alloc() in poking_init()")] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20221109045112.187069-9-bgray@linux.ibm.com