summaryrefslogtreecommitdiff
path: root/arch/riscv
AgeCommit message (Collapse)Author
2024-07-12Merge patch series "riscv: Apply Zawrs when available"Palmer Dabbelt
Andrew Jones <ajones@ventanamicro.com> says: Zawrs provides two instructions (wrs.nto and wrs.sto), where both are meant to allow the hart to enter a low-power state while waiting on a store to a memory location. The instructions also both wait an implementation-defined "short" duration (unless the implementation terminates the stall for another reason). The difference is that while wrs.sto will terminate when the duration elapses, wrs.nto, depending on configuration, will either just keep waiting or an ILL exception will be raised. Linux will use wrs.nto, so if platforms have an implementation which falls in the "just keep waiting" category (which is not expected), then it should _not_ advertise Zawrs in the hardware description. Like wfi (and with the same {m,h}status bits to configure it), when wrs.nto is configured to raise exceptions it's expected that the higher privilege level will see the instruction was a wait instruction, do something, and then resume execution following the instruction. For example, KVM does configure exceptions for wfi (hstatus.VTW=1) and therefore also for wrs.nto. KVM does this for wfi since it's better to allow other tasks to be scheduled while a VCPU waits for an interrupt. For waits such as those where wrs.nto/sto would be used, which are typically locks, it is also a good idea for KVM to be involved, as it can attempt to schedule the lock holding VCPU. This series starts with Christoph's addition of the riscv smp_cond_load_relaxed function which applies wrs.sto when available. That patch has been reworked to use wrs.nto and to use the same approach as Arm for the wait loop, since we can't have arbitrary C code between the load-reserved and the wrs. Then, hwprobe support is added (since the instructions are also usable from usermode), and finally KVM is taught about wrs.nto, allowing guests to see and use the Zawrs extension. We still don't have test results from hardware, and it's not possible to prove that using Zawrs is a win when testing on QEMU, not even when oversubscribing VCPUs to guests. However, it is possible to use KVM selftests to force a scenario where we can prove Zawrs does its job and does it well. [4] is a test which does this and, on my machine, without Zawrs it takes 16 seconds to complete and with Zawrs it takes 0.25 seconds. This series is also available here [1]. In order to use QEMU for testing a build with [2] is needed. In order to enable guests to use Zawrs with KVM using kvmtool, the branch at [3] may be used. [1] https://github.com/jones-drew/linux/commits/riscv/zawrs-v3/ [2] https://lore.kernel.org/all/20240312152901.512001-2-ajones@ventanamicro.com/ [3] https://github.com/jones-drew/kvmtool/commits/riscv/zawrs/ [4] https://github.com/jones-drew/linux/commit/cb2beccebcece10881db842ed69bdd5715cfab5d Link: https://lore.kernel.org/r/20240426100820.14762-8-ajones@ventanamicro.com * b4-shazam-merge: KVM: riscv: selftests: Add Zawrs extension to get-reg-list test KVM: riscv: Support guest wrs.nto riscv: hwprobe: export Zawrs ISA extension riscv: Add Zawrs support for spinlocks dt-bindings: riscv: Add Zawrs ISA extension description riscv: Provide a definition for 'pause' Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-12KVM: riscv: Support guest wrs.ntoAndrew Jones
When a guest traps on wrs.nto, call kvm_vcpu_on_spin() to attempt to yield to the lock holding VCPU. Also extend the KVM ISA extension ONE_REG interface to allow KVM userspace to detect and enable the Zawrs extension for the Guest/VM. Signed-off-by: Andrew Jones <ajones@ventanamicro.com> Acked-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20240426100820.14762-13-ajones@ventanamicro.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-12riscv: hwprobe: export Zawrs ISA extensionAndrew Jones
Export Zawrs ISA extension through hwprobe. [Palmer: there's a gap in the numbers here as there will be a merge conflict when this is picked up. To avoid confusion I just set the hwprobe ID to match what it would be post-merge.] Signed-off-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Clément Léger <cleger@rivosinc.com> Link: https://lore.kernel.org/r/20240426100820.14762-12-ajones@ventanamicro.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-12Merge tag 'loongarch-kvm-6.11' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.11 1. Add ParaVirt steal time support. 2. Add some VM migration enhancement. 3. Add perf kvm-stat support for loongarch.
2024-07-12Merge tag 'kvm-riscv-6.11-1' of https://github.com/kvm-riscv/linux into HEADPaolo Bonzini
KVM/riscv changes for 6.11 - Redirect AMO load/store access fault traps to guest - Perf kvm stat support for RISC-V - Use guest files for IMSIC virtualization, when available ONE_REG support for the Zimop, Zcmop, Zca, Zcf, Zcd, Zcb and Zawrs ISA extensions is coming through the RISC-V tree.
2024-07-12riscv: Add Zawrs support for spinlocksChristoph Müllner
RISC-V code uses the generic ticket lock implementation, which calls the macros smp_cond_load_relaxed() and smp_cond_load_acquire(). Introduce a RISC-V specific implementation of smp_cond_load_relaxed() which applies WRS.NTO of the Zawrs extension in order to reduce power consumption while waiting and allows hypervisors to enable guests to trap while waiting. smp_cond_load_acquire() doesn't need a RISC-V specific implementation as the generic implementation is based on smp_cond_load_relaxed() and smp_acquire__after_ctrl_dep() sufficiently provides the acquire semantics. This implementation is heavily based on Arm's approach which is the approach Andrea Parri also suggested. The Zawrs specification can be found here: https://github.com/riscv/riscv-zawrs/blob/main/zawrs.adoc Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu> Co-developed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Andrew Jones <ajones@ventanamicro.com> Link: https://lore.kernel.org/r/20240426100820.14762-11-ajones@ventanamicro.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-12riscv: Provide a definition for 'pause'Andrew Jones
If we're going to provide the encoding for 'pause' in cpu_relax() anyway, then we can drop the toolchain checks and just always use it. The advantage of doing this is that other code that need pause don't need to also define it (yes, another use is coming). Add the definition to insn-def.h since it's an instruction definition and also because insn-def.h doesn't include much, so it's safe to include from asm/vdso/processor.h without concern for circular dependencies. Signed-off-by: Andrew Jones <ajones@ventanamicro.com> Link: https://lore.kernel.org/r/20240426100820.14762-9-ajones@ventanamicro.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: net/sched/act_ct.c 26488172b029 ("net/sched: Fix UAF when resolving a clash") 3abbd7ed8b76 ("act_ct: prepare for stolen verdict coming from conntrack and nat engine") No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-11riscv: hwprobe: export highest virtual userspace addressClément Léger
Some userspace applications (OpenJDK for instance) uses the free MSBs in pointers to insert additional information for their own logic and need to get this information from somewhere. Currently they rely on parsing /proc/cpuinfo "mmu=svxx" string to obtain the current value of virtual address usable bits [1]. Since this reflect the raw supported MMU mode, it might differ from the logical one used internally which is why arch_get_mmap_end() is used. Exporting the highest mmapable address through hwprobe will allow a more stable interface to be used. For that purpose, add a new hwprobe key named RISCV_HWPROBE_KEY_HIGHEST_VIRT_ADDRESS which will export the highest userspace virtual address. Link: https://github.com/openjdk/jdk/blob/master/src/hotspot/os_cpu/linux_riscv/vm_version_linux_riscv.cpp#L171 [1] Signed-off-by: Clément Léger <cleger@rivosinc.com> Reviewed-by: Charlie Jenkins <charlie@rivosinc.com> Link: https://lore.kernel.org/r/20240410144558.1104006-1-cleger@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-10riscv: Remove unnecessary int cast in variable_fls()Thorsten Blum
__builtin_clz() returns an int and casting the whole expression to int is unnecessary. Remove it. Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by: Yury Norov <yury.norov@gmail.com>
2024-07-10riscv: Improve sbi_ecall() code generation by reordering argumentsAlexandre Ghiti
The sbi_ecall() function arguments are not in the same order as the ecall arguments, so we end up re-ordering the registers before the ecall which is useless and costly. So simply reorder the arguments in the same way as expected by ecall. Instead of reordering directly the arguments of sbi_ecall(), use a proxy macro since the current ordering is more natural. Before: Dump of assembler code for function sbi_ecall: 0xffffffff800085e0 <+0>: add sp,sp,-32 0xffffffff800085e2 <+2>: sd s0,24(sp) 0xffffffff800085e4 <+4>: mv t1,a0 0xffffffff800085e6 <+6>: add s0,sp,32 0xffffffff800085e8 <+8>: mv t3,a1 0xffffffff800085ea <+10>: mv a0,a2 0xffffffff800085ec <+12>: mv a1,a3 0xffffffff800085ee <+14>: mv a2,a4 0xffffffff800085f0 <+16>: mv a3,a5 0xffffffff800085f2 <+18>: mv a4,a6 0xffffffff800085f4 <+20>: mv a5,a7 0xffffffff800085f6 <+22>: mv a6,t3 0xffffffff800085f8 <+24>: mv a7,t1 0xffffffff800085fa <+26>: ecall 0xffffffff800085fe <+30>: ld s0,24(sp) 0xffffffff80008600 <+32>: add sp,sp,32 0xffffffff80008602 <+34>: ret After: Dump of assembler code for function __sbi_ecall: 0xffffffff8000b6b2 <+0>: add sp,sp,-32 0xffffffff8000b6b4 <+2>: sd s0,24(sp) 0xffffffff8000b6b6 <+4>: add s0,sp,32 0xffffffff8000b6b8 <+6>: ecall 0xffffffff8000b6bc <+10>: ld s0,24(sp) 0xffffffff8000b6be <+12>: add sp,sp,32 0xffffffff8000b6c0 <+14>: ret Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Yunhui Cui <cuiyunhui@bytedance.com> Link: https://lore.kernel.org/r/20240322112629.68170-1-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-10riscv: Add tracepoints for SBI calls and returnsSamuel Holland
These are useful for measuring the latency of SBI calls. The SBI HSM extension is excluded because those functions are called from contexts such as cpuidle where instrumentation is not allowed. Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Link: https://lore.kernel.org/r/20240321230131.1838105-1-samuel.holland@sifive.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-10riscv: Optimize crc32 with Zbc extensionXiao Wang
As suggested by the B-ext spec, the Zbc (carry-less multiplication) instructions can be used to accelerate CRC calculations. Currently, the crc32 is the most widely used crc function inside kernel, so this patch focuses on the optimization of just the crc32 APIs. Compared with the current table-lookup based optimization, Zbc based optimization can also achieve large stride during CRC calculation loop, meantime, it avoids the memory access latency of the table-lookup based implementation and it reduces memory footprint. If Zbc feature is not supported in a runtime environment, then the table-lookup based implementation would serve as fallback via alternative mechanism. By inspecting the vmlinux built by gcc v12.2.0 with default optimization level (-O2), we can see below instruction count change for each 8-byte stride in the CRC32 loop: rv64: crc32_be (54->31), crc32_le (54->13), __crc32c_le (54->13) rv32: crc32_be (50->32), crc32_le (50->16), __crc32c_le (50->16) The compile target CPU is little endian, extra effort is needed for byte swapping for the crc32_be API, thus, the instruction count change is not as significant as that in the *_le cases. This patch is tested on QEMU VM with the kernel CRC32 selftest for both rv64 and rv32. Running the CRC32 selftest on a real hardware (SpacemiT K1) with Zbc extension shows 65% and 125% performance improvement respectively on crc32_test() and crc32c_test(). Signed-off-by: Xiao Wang <xiao.w.wang@intel.com> Reviewed-by: Charlie Jenkins <charlie@rivosinc.com> Link: https://lore.kernel.org/r/20240621054707.1847548-1-xiao.w.wang@intel.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-10riscv: convert to generic syscall tableArnd Bergmann
The uapi/asm/unistd_{32,64}.h and asm/syscall_table_{32,64}.h headers can now be generated from scripts/syscall.tbl, which makes this consistent with the other architectures that have their own syscall.tbl. riscv has two extra system call that gets added to scripts/syscall.tbl. The newstat and rlimit entries in the syscall_abis_64 line are for system calls that were part of the generic ABI when riscv64 got added but are no longer enabled by default for new architectures. Both riscv32 and riscv64 also implement memfd_secret, which is optional for all architectures. Unlike all the other 32-bit architectures, the time32 and stat64 sets of syscalls are not enabled on riscv32. Both the user visible side of asm/unistd.h and the internal syscall table in the kernel should have the same effective contents after this. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-07-10clone3: drop __ARCH_WANT_SYS_CLONE3 macroArnd Bergmann
When clone3() was introduced, it was not obvious how each architecture deals with setting up the stack and keeping the register contents in a fork()-like system call, so this was left for the architecture maintainers to implement, with __ARCH_WANT_SYS_CLONE3 defined by those that already implement it. Five years later, we still have a few architectures left that are missing clone3(), and the macro keeps getting in the way as it's fundamentally different from all the other __ARCH_WANT_SYS_* macros that are meant to provide backwards-compatibility with applications using older syscalls that are no longer provided by default. Address this by reversing the polarity of the macro, adding an __ARCH_BROKEN_SYS_CLONE3 macro to all architectures that don't already provide the syscall, and remove __ARCH_WANT_SYS_CLONE3 from all the other ones. Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-07-09Merge tag 'for-netdev' of ↵Paolo Abeni
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-07-08 The following pull-request contains BPF updates for your *net-next* tree. We've added 102 non-merge commits during the last 28 day(s) which contain a total of 127 files changed, 4606 insertions(+), 980 deletions(-). The main changes are: 1) Support resilient split BTF which cuts down on duplication and makes BTF as compact as possible wrt BTF from modules, from Alan Maguire & Eduard Zingerman. 2) Add support for dumping kfunc prototypes from BTF which enables both detecting as well as dumping compilable prototypes for kfuncs, from Daniel Xu. 3) Batch of s390x BPF JIT improvements to add support for BPF arena and to implement support for BPF exceptions, from Ilya Leoshkevich. 4) Batch of riscv64 BPF JIT improvements in particular to add 12-argument support for BPF trampolines and to utilize bpf_prog_pack for the latter, from Pu Lehui. 5) Extend BPF test infrastructure to add a CHECKSUM_COMPLETE validation option for skbs and add coverage along with it, from Vadim Fedorenko. 6) Inline bpf_get_current_task/_btf() helpers in the arm64 BPF JIT which gives a small 1% performance improvement in micro-benchmarks, from Puranjay Mohan. 7) Extend the BPF verifier to track the delta between linked registers in order to better deal with recent LLVM code optimizations, from Alexei Starovoitov. 8) Fix bpf_wq_set_callback_impl() kfunc signature where the third argument should have been a pointer to the map value, from Benjamin Tissoires. 9) Extend BPF selftests to add regular expression support for test output matching and adjust some of the selftest when compiled under gcc, from Cupertino Miranda. 10) Simplify task_file_seq_get_next() and remove an unnecessary loop which always iterates exactly once anyway, from Dan Carpenter. 11) Add the capability to offload the netfilter flowtable in XDP layer through kfuncs, from Florian Westphal & Lorenzo Bianconi. 12) Various cleanups in networking helpers in BPF selftests to shave off a few lines of open-coded functions on client/server handling, from Geliang Tang. 13) Properly propagate prog->aux->tail_call_reachable out of BPF verifier, so that x86 JIT does not need to implement detection, from Leon Hwang. 14) Fix BPF verifier to add a missing check_func_arg_reg_off() to prevent an out-of-bounds memory access for dynpointers, from Matt Bobrowski. 15) Fix bpf_session_cookie() kfunc to return __u64 instead of long pointer as it might lead to problems on 32-bit archs, from Jiri Olsa. 16) Enhance traffic validation and dynamic batch size support in xsk selftests, from Tushar Vyavahare. bpf-next-for-netdev * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (102 commits) selftests/bpf: DENYLIST.aarch64: Remove fexit_sleep selftests/bpf: amend for wrong bpf_wq_set_callback_impl signature bpf: helpers: fix bpf_wq_set_callback_impl signature libbpf: Add NULL checks to bpf_object__{prev_map,next_map} selftests/bpf: Remove exceptions tests from DENYLIST.s390x s390/bpf: Implement exceptions s390/bpf: Change seen_reg to a mask bpf: Remove unnecessary loop in task_file_seq_get_next() riscv, bpf: Optimize stack usage of trampoline bpf, devmap: Add .map_alloc_check selftests/bpf: Remove arena tests from DENYLIST.s390x selftests/bpf: Add UAF tests for arena atomics selftests/bpf: Introduce __arena_global s390/bpf: Support arena atomics s390/bpf: Enable arena s390/bpf: Support address space cast instruction s390/bpf: Support BPF_PROBE_MEM32 s390/bpf: Land on the next JITed instruction after exception s390/bpf: Introduce pre- and post- probe functions s390/bpf: Get rid of get_probe_mem_regno() ... ==================== Link: https://patch.msgid.link/20240708221438.10974-1-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-09Merge tag 'riscv-sophgo-dt-for-v6.11' of https://github.com/sophgo/linux ↵Arnd Bergmann
into soc/dt RISC-V Devicetrees for v6.11 Sopgho: Add clock support for SG2042. Signed-off-by: Chen Wang <unicorn_wang@outlook.com> * tag 'riscv-sophgo-dt-for-v6.11' of https://github.com/sophgo/linux: riscv: dts: add clock generator for Sophgo SG2042 SoC Link: https://lore.kernel.org/r/PN1P287MB281861EA2B1706B430D2FA3EFEDB2@PN1P287MB2818.INDP287.PROD.OUTLOOK.COM Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-07-09riscv: dts: add clock generator for Sophgo SG2042 SoCChen Wang
Add clock generator node to device tree for SG2042, and enable clock for uart. Signed-off-by: Chen Wang <unicorn_wang@outlook.com> Reviewed-by: Guo Ren <guoren@kernel.org>
2024-07-08Merge tag 'riscv-config-for-v6.11' of ↵Arnd Bergmann
https://git.kernel.org/pub/scm/linux/kernel/git/conor/linux into soc/defconfig RISC-V config update for v6.11 StarFive: Enable most of the options needed for the jh7100 based boards to be properly testable with defconfig. Signed-off-by: Conor Dooley <conor.dooley@microchip.com> * tag 'riscv-config-for-v6.11' of https://git.kernel.org/pub/scm/linux/kernel/git/conor/linux: riscv: defconfig: Enable StarFive JH7110 drivers Link: https://lore.kernel.org/r/20240707-unused-outflank-aa127ccb2cfe@spud Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-07-08Merge tag 'riscv-dt-for-v6.11' of ↵Arnd Bergmann
https://git.kernel.org/pub/scm/linux/kernel/git/conor/linux into soc/dt RISC-V Devicetrees for v6.11 T-Head: Last change from me before this starts going via Drew's tree is the addition of the SBI PMU events node for the th1520. StarFive: A dts for the Pin64 Star64, another board with a jh7110 SoC. This board is almost identical to the existing Milk-v Mars and VisionFive 2 boards that are already support - just with a different PHY configuration and only one of the two PCIe ports exposed. Additionally, the Mars and VisionFive 2 get their PCie configuration added. Microchip: A dts for the BeagleV Fire. PCIe is disabled on it for now, as some binding and driver changes are required. Signed-off-by: Conor Dooley <conor.dooley@microchip.com> * tag 'riscv-dt-for-v6.11' of https://git.kernel.org/pub/scm/linux/kernel/git/conor/linux: riscv: dts: starfive: add PCIe dts configuration for JH7110 riscv: dts: microchip: add an initial devicetree for the BeagleV Fire dt-bindings: riscv: microchip: document beaglev-fire riscv: dts: starfive: Update flash partition layout riscv: dts: thead: th1520: Add PMU event node riscv: dts: starfive: add Star64 board devicetree dt-bindings: riscv: starfive: add Star64 board compatible dt-bindings: riscv: Add T-HEAD C908 compatible Link: https://lore.kernel.org/r/20240707-nuttiness-lustfully-4aaf03c991b2@spud Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-07-08Merge tag 'sunxi-dt-for-6.11' of ↵Arnd Bergmann
https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux into soc/dt Allwinner SoC device tree changes for 6.11 This includes a commit shared with the clk tree. This commit adds clock and reset indices to the device tree binding, and thus is needed for both the device tree and driver changes. ARM64 device tree and binding-only changes - Add LRADC (low resolution ADC for resistor network based keys) for H616 SoC - Add cache information for A64, H6, and H616 SoCs - Correct model names and descriptions for Pine64 boards - Add GPADC (general purpose ADC) for H616 SoC - Add ADC joysticks based on GPADC for anbernic-rg35xx-h board - Add additional CPU OPPs for the H700 on top of existing H616 ones - Enable DVFS for rg35xx boards - Add IOMMU for H616 SoC RISC-V device tree changes - Add system LDOs to D1s/T113 SoC - Add ClockworkPi and DevTerm device trees * tag 'sunxi-dt-for-6.11' of https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux: riscv: dts: allwinner: Add ClockworkPi and DevTerm devicetrees riscv: dts: allwinner: d1s-t113: Add system LDOs arm64: dts: allwinner: h616: add IOMMU node arm64: dts: allwinner: rg35xx: Enable DVFS CPU frequency scaling arm64: dts: allwinner: h616: add additional CPU OPPs for the H700 arm64: dts: allwinner: anbernic-rg35xx-h: Add ADC joysticks arm64: dts: allwinner: h616: Add GPADC device node dt-bindings: clock: sun50i-h616-ccu: Add GPADC clocks ARM: dts: sunxi: remove duplicated entries in makefile arm64: dts: allwinner: Add cache information to the SoC dtsi for H616 arm64: dts: allwinner: Add cache information to the SoC dtsi for A64 arm64: dts: allwinner: Correct the model names for Pine64 boards dt-bindings: arm: sunxi: Correct the descriptions for Pine64 boards arm64: dts: allwinner: Add cache information to the SoC dtsi for H6 ARM: dts: sun50i: Add LRADC node dt-bindings: input: sun4i-lradc-keys: Add H616 compatible Link: https://lore.kernel.org/r/ZoQa8r1N8yi7FlPV@wens.tw Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-07-08riscv, bpf: Optimize stack usage of trampolinePuranjay Mohan
When BPF_TRAMP_F_CALL_ORIG is not set, stack space for passing arguments on stack doesn't need to be reserved because the original function is not called. Only reserve space for stacked arguments when BPF_TRAMP_F_CALL_ORIG is set. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Pu Lehui <pulehui@huawei.com> Link: https://lore.kernel.org/bpf/20240708114758.64414-1-puranjay@kernel.org
2024-07-05Merge tag 'riscv-for-linus-6.10-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fixes from Palmer Dabbelt: - A fix for the CMODX example in the recently added icache flushing prctl() - A fix to the perf driver to avoid corrupting event data on counter overflows when external overflow handlers are in use - A fix to clear all hardware performance monitor events on boot, to avoid dangling events firmware or previously booted kernels from triggering spuriously - A fix to the perf event probing logic to avoid erroneously reporting the presence of unimplemented counters. This also prevents some implemented counters from being reported - A build fix for the vector sigreturn selftest on clang - A fix to ftrace, which now requires the previously optional index argument to ftrace_graph_ret_addr() - A fix to avoid deadlocking if kexec crash handling triggers in an interrupt context * tag 'riscv-for-linus-6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: riscv: kexec: Avoid deadlock in kexec crash path riscv: stacktrace: fix usage of ftrace_graph_ret_addr() riscv: selftests: Fix vsetivli args for clang perf: RISC-V: Check standard event availability drivers/perf: riscv: Reset the counter to hpmevent mapping while starting cpus drivers/perf: riscv: Do not update the event data if uptodate documentation: Fix riscv cmodx example
2024-07-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/phy/aquantia/aquantia.h 219343755eae ("net: phy: aquantia: add missing include guards") 61578f679378 ("net: phy: aquantia: add support for PHY LEDs") drivers/net/ethernet/wangxun/libwx/wx_hw.c bd07a9817846 ("net: txgbe: remove separate irq request for MSI and INTx") b501d261a5b3 ("net: txgbe: add FDIR ATR support") https://lore.kernel.org/all/20240703112936.483c1975@canb.auug.org.au/ include/linux/mlx5/mlx5_ifc.h 048a403648fc ("net/mlx5: IFC updates for changing max EQs") 99be56171fa9 ("net/mlx5e: SHAMPO, Re-enable HW-GRO") https://lore.kernel.org/all/20240701133951.6926b2e3@canb.auug.org.au/ Adjacent changes: drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c 4130c67cd123 ("wifi: iwlwifi: mvm: check vif for NULL/ERR_PTR before dereference") 3f3126515fbe ("wifi: iwlwifi: mvm: add mvm-specific guard") include/net/mac80211.h 816c6bec09ed ("wifi: mac80211: fix BSS_CHANGED_UNSOL_BCAST_PROBE_RESP") 5a009b42e041 ("wifi: mac80211: track changes in AP's TPE") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-03mm: implement update_mmu_tlb() using update_mmu_tlb_range()Bang Li
Let's make update_mmu_tlb() simply a generic wrapper around update_mmu_tlb_range(). Only the latter can now be overridden by the architecture. We can now remove __HAVE_ARCH_UPDATE_MMU_TLB as well. Link: https://lkml.kernel.org/r/20240522061204.117421-3-libang.li@antgroup.com Signed-off-by: Bang Li <libang.li@antgroup.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Lance Yang <ioworker0@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: add update_mmu_tlb_range()Bang Li
Patch series "Add update_mmu_tlb_range() to simplify code", v4. This series of commits mainly adds the update_mmu_tlb_range() to batch update tlb in an address range and implement update_mmu_tlb() using update_mmu_tlb_range(). After commit 19eaf44954df ("mm: thp: support allocation of anonymous multi-size THP"), We may need to batch update tlb of a certain address range by calling update_mmu_tlb() in a loop. Using the update_mmu_tlb_range(), we can simplify the code and possibly reduce the execution of some unnecessary code in some architectures. This patch (of 3): Add update_mmu_tlb_range(), we can batch update tlb of an address range. Link: https://lkml.kernel.org/r/20240522061204.117421-1-libang.li@antgroup.com Link: https://lkml.kernel.org/r/20240522061204.117421-2-libang.li@antgroup.com Signed-off-by: Bang Li <libang.li@antgroup.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Lance Yang <ioworker0@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03riscv: kexec: Avoid deadlock in kexec crash pathSong Shuai
If the kexec crash code is called in the interrupt context, the machine_kexec_mask_interrupts() function will trigger a deadlock while trying to acquire the irqdesc spinlock and then deactivate irqchip in irq_set_irqchip_state() function. Unlike arm64, riscv only requires irq_eoi handler to complete EOI and keeping irq_set_irqchip_state() will only leave this possible deadlock without any use. So we simply remove it. Link: https://lore.kernel.org/linux-riscv/20231208111015.173237-1-songshuaishuai@tinylab.org/ Fixes: b17d19a5314a ("riscv: kexec: Fixup irq controller broken in kexec crash path") Signed-off-by: Song Shuai <songshuaishuai@tinylab.org> Reviewed-by: Ryo Takakura <takakura@valinux.co.jp> Link: https://lore.kernel.org/r/20240626023316.539971-1-songshuaishuai@tinylab.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-03riscv: stacktrace: fix usage of ftrace_graph_ret_addr()Puranjay Mohan
ftrace_graph_ret_addr() takes an `idx` integer pointer that is used to optimize the stack unwinding. Pass it a valid pointer to utilize the optimizations that might be available in the future. The commit is making riscv's usage of ftrace_graph_ret_addr() match x86_64. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20240618145820.62112-1-puranjay@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-03Merge patch series "Assorted fixes in RISC-V PMU driver"Palmer Dabbelt
Atish Patra <atishp@rivosinc.com> says: This series contains 3 fixes out of which the first one is a new fix for invalid event data reported in lkml[2]. The last two are v3 of Samuel's patch[1]. I added the RB/TB/Fixes tag and moved 1 unrelated change to its own patch. I also changed an error message in kvm vcpu_pmu from pr_err to pr_debug to avoid redundant failure error messages generated due to the boot time quering of events implemented in the patch[1] Here is the original cover letter for the patch[1] Before this patch: $ perf list hw List of pre-defined events (to be used in -e or -M): branch-instructions OR branches [Hardware event] branch-misses [Hardware event] bus-cycles [Hardware event] cache-misses [Hardware event] cache-references [Hardware event] cpu-cycles OR cycles [Hardware event] instructions [Hardware event] ref-cycles [Hardware event] stalled-cycles-backend OR idle-cycles-backend [Hardware event] stalled-cycles-frontend OR idle-cycles-frontend [Hardware event] $ perf stat -ddd true Performance counter stats for 'true': 4.36 msec task-clock # 0.744 CPUs utilized 1 context-switches # 229.325 /sec 0 cpu-migrations # 0.000 /sec 38 page-faults # 8.714 K/sec 4,375,694 cycles # 1.003 GHz (60.64%) 728,945 instructions # 0.17 insn per cycle 79,199 branches # 18.162 M/sec 17,709 branch-misses # 22.36% of all branches 181,734 L1-dcache-loads # 41.676 M/sec 5,547 L1-dcache-load-misses # 3.05% of all L1-dcache accesses <not counted> LLC-loads (0.00%) <not counted> LLC-load-misses (0.00%) <not counted> L1-icache-loads (0.00%) <not counted> L1-icache-load-misses (0.00%) <not counted> dTLB-loads (0.00%) <not counted> dTLB-load-misses (0.00%) <not counted> iTLB-loads (0.00%) <not counted> iTLB-load-misses (0.00%) <not counted> L1-dcache-prefetches (0.00%) <not counted> L1-dcache-prefetch-misses (0.00%) 0.005860375 seconds time elapsed 0.000000000 seconds user 0.010383000 seconds sys After this patch: $ perf list hw List of pre-defined events (to be used in -e or -M): branch-instructions OR branches [Hardware event] branch-misses [Hardware event] cache-misses [Hardware event] cache-references [Hardware event] cpu-cycles OR cycles [Hardware event] instructions [Hardware event] $ perf stat -ddd true Performance counter stats for 'true': 5.16 msec task-clock # 0.848 CPUs utilized 1 context-switches # 193.817 /sec 0 cpu-migrations # 0.000 /sec 37 page-faults # 7.171 K/sec 5,183,625 cycles # 1.005 GHz 961,696 instructions # 0.19 insn per cycle 85,853 branches # 16.640 M/sec 20,462 branch-misses # 23.83% of all branches 243,545 L1-dcache-loads # 47.203 M/sec 5,974 L1-dcache-load-misses # 2.45% of all L1-dcache accesses <not supported> LLC-loads <not supported> LLC-load-misses <not supported> L1-icache-loads <not supported> L1-icache-load-misses <not supported> dTLB-loads 19,619 dTLB-load-misses <not supported> iTLB-loads 6,831 iTLB-load-misses <not supported> L1-dcache-prefetches <not supported> L1-dcache-prefetch-misses 0.006085625 seconds time elapsed 0.000000000 seconds user 0.013022000 seconds sys [1] https://lore.kernel.org/linux-riscv/20240418014652.1143466-1-samuel.holland@sifive.com/ [2] https://lore.kernel.org/all/CC51D53B-846C-4D81-86FC-FBF969D0A0D6@pku.edu.cn/ * b4-shazam-merge: perf: RISC-V: Check standard event availability drivers/perf: riscv: Reset the counter to hpmevent mapping while starting cpus drivers/perf: riscv: Do not update the event data if uptodate Link: https://lore.kernel.org/r/20240628-misc_perf_fixes-v4-0-e01cfddcf035@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-03perf: RISC-V: Check standard event availabilitySamuel Holland
The RISC-V SBI PMU specification defines several standard hardware and cache events. Currently, all of these events are exposed to userspace, even when not actually implemented. They appear in the `perf list` output, and commands like `perf stat` try to use them. This is more than just a cosmetic issue, because the PMU driver's .add function fails for these events, which causes pmu_groups_sched_in() to prematurely stop scheduling in other (possibly valid) hardware events. Add logic to check which events are supported by the hardware (i.e. can be mapped to some counter), so only usable events are reported to userspace. Since the kernel does not know the mapping between events and possible counters, this check must happen during boot, when no counters are in use. Make the check asynchronous to minimize impact on boot time. Fixes: e9991434596f ("RISC-V: Add perf platform driver based on SBI PMU extension") Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240628-misc_perf_fixes-v4-3-e01cfddcf035@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-02riscv, bpf: Add 12-argument support for RV64 bpf trampolinePu Lehui
This patch adds 12 function arguments support for riscv64 bpf trampoline. The current bpf trampoline supports <= sizeof(u64) bytes scalar arguments [0] and <= 16 bytes struct arguments [1]. Therefore, we focus on the situation where scalars are at most XLEN bits and aggregates whose total size does not exceed 2×XLEN bits in the riscv calling convention [2]. Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Björn Töpel <bjorn@rivosinc.com> Acked-by: Björn Töpel <bjorn@kernel.org> Acked-by: Puranjay Mohan <puranjay@kernel.org> Link: https://elixir.bootlin.com/linux/v6.8/source/kernel/bpf/btf.c#L6184 [0] Link: https://elixir.bootlin.com/linux/v6.8/source/kernel/bpf/btf.c#L6769 [1] Link: https://github.com/riscv-non-isa/riscv-elf-psabi-doc/releases/download/draft-20230929-e5c800e661a53efe3c2678d71a306323b60eb13b/riscv-abi.pdf [2] Link: https://lore.kernel.org/bpf/20240702121944.1091530-2-pulehui@huaweicloud.com
2024-07-01Merge tag 'arm-fixes-6.10-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull SoC fixes from Arnd Bergmann: "A number of devicetree fixes came in for the rockchip platforms, correcting some of the address information, and reverting a change to the MMC controller configuration that caused regressions. Four drivers have one code change each, addressing minor build issues for the optee firmware driver, the litex SoC platform driver and two reset drivers. The riscv fixes as also simple, mainly turning off device nodes in the canaan dts files unless they are actually usable on a particular board. Finally, Drew takes over maintaining the THEAD RISC-V SoC platform" * tag 'arm-fixes-6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: drivers/soc/litex: drop obsolete dependency on COMPILE_TEST tee: optee: ffa: Fix missing-field-initializers warning arm64: dts: rockchip: Add sound-dai-cells for RK3368 arm64: dts: rockchip: Fix the i2c address of es8316 on Cool Pi 4B reset: hisilicon: hi6220: add missing MODULE_DESCRIPTION() macro reset: gpio: Fix missing gpiolib dependency for GPIO reset controller MAINTAINERS: thead: update Maintainer arm64: dts: rockchip: fix PMIC interrupt pin on ROCK Pi E riscv: dts: starfive: Set EMMC vqmmc maximum voltage to 3.3V on JH7110 boards arm64: dts: rockchip: make poweroff(8) work on Radxa ROCK 5A Revert "arm64: dts: rockchip: remove redundant cd-gpios from rk3588 sdmmc nodes" ARM: dts: rockchip: rk3066a: add #sound-dai-cells to hdmi node arm64: dts: rockchip: Fix the value of `dlg,jack-det-rate` mismatch on rk3399-gru arm64: dts: rockchip: set correct pwm0 pinctrl on rk3588-tiger riscv: dts: canaan: Disable I/O devices unless used riscv: dts: canaan: Clean up serial aliases arm64: dts: rockchip: Rename LED related pinctrl nodes on rk3308-rock-pi-s arm64: dts: rockchip: Fix SD NAND and eMMC init on rk3308-rock-pi-s arm64: dts: rockchip: Fix rk3308 codec@ff560000 reset-names arm64: dts: rockchip: Fix the DCDC_REG2 minimum voltage on Quartz64 Model B
2024-07-01riscv, bpf: Use bpf_prog_pack for RV64 bpf trampolinePu Lehui
We used bpf_prog_pack to aggregate bpf programs into huge page to relieve the iTLB pressure on the system. We can apply it to bpf trampoline, as Song had been implemented it in core and x86 [0]. This patch is going to use bpf_prog_pack to RV64 bpf trampoline. Since Song and Puranjay have done a lot of work for bpf_prog_pack on RV64, implementing this function will be easy. Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Björn Töpel <bjorn@rivosinc.com> #riscv Link: https://lore.kernel.org/all/20231206224054.492250-1-song@kernel.org [0] Link: https://lore.kernel.org/bpf/20240622030437.3973492-4-pulehui@huaweicloud.com
2024-07-01riscv, bpf: Fix out-of-bounds issue when preparing trampoline imagePu Lehui
We get the size of the trampoline image during the dry run phase and allocate memory based on that size. The allocated image will then be populated with instructions during the real patch phase. But after commit 26ef208c209a ("bpf: Use arch_bpf_trampoline_size"), the `im` argument is inconsistent in the dry run and real patch phase. This may cause emit_imm in RV64 to generate a different number of instructions when generating the 'im' address, potentially causing out-of-bounds issues. Let's emit the maximum number of instructions for the "im" address during dry run to fix this problem. Fixes: 26ef208c209a ("bpf: Use arch_bpf_trampoline_size") Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20240622030437.3973492-3-pulehui@huaweicloud.com
2024-07-01riscv: dts: starfive: add PCIe dts configuration for JH7110Minda Chen
Add PCIe dts configuraion for JH7110 SoC platform. The Star64 only has one exposed PCIe port, so only the Mars and VisionFive 2 get two enabled. Signed-off-by: Minda Chen <minda.chen@starfivetech.com> Reviewed-by: Hal Feng <hal.feng@starfivetech.com> [conor: squash in star64's single exposed port] Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
2024-07-01Merge 6.10-rc6 into tty-nextGreg Kroah-Hartman
This resolves the merge issues in the 8250 code due to some reverts in 6.10-rc6 in the console changes. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-30riscv: dts: allwinner: Add ClockworkPi and DevTerm devicetreesSamuel Holland
Clockwork Tech manufactures several SoMs for their RasPi CM3-compatible "ClockworkPi" mainboard. Their R-01 SoM features the Allwinner D1 SoC. The R-01 contains only the CPU, DRAM, and always-on voltage regulation; it does not merit a separate devicetree. The ClockworkPi mainboard features analog audio, a MIPI-DSI panel, USB host and peripheral ports, an Ampak AP6256 WiFi/Bluetooth module, and an X-Powers AXP228 PMIC for managing a Li-ion battery. The DevTerm is a complete system which extends the ClockworkPi mainboard with a MIPI-DSI panel and a pair of expansion boards. These expansion boards provide a fan, a USB keyboard, speakers, and a thermal printer. Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Signed-off-by: Samuel Holland <samuel@sholland.org> Link: https://lore.kernel.org/r/20240622150731.1105901-4-wens@kernel.org Signed-off-by: Chen-Yu Tsai <wens@csie.org>
2024-06-30riscv: dts: allwinner: d1s-t113: Add system LDOsChen-Yu Tsai
Now that the bindings for the system LDOs have been merged, the nodes for the system LDOs can be added. These are used on the ClockworkPi. This was originally part of Samuel's D1 device tree series [1], but was dropped in v5 as the regulator bindings weren't merged at the time. [1] https://lore.kernel.org/linux-sunxi/20221231233851.24923-1-samuel@sholland.org/ Link: https://lore.kernel.org/r/20240622150731.1105901-3-wens@kernel.org Signed-off-by: Chen-Yu Tsai <wens@csie.org>
2024-06-28Merge tag 'riscv-for-linus-6.10-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fixes from Palmer Dabbelt: - A fix for vector load/store instruction decoding, which could result in reserved vector element length encodings decoding as valid vector instructions. - Instruction patching now aggressively flushes the local instruction cache, to avoid situations where patching functions on the flush path results in torn instructions being fetched. - A fix to prevent the stack walker from showing up as part of traces. * tag 'riscv-for-linus-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: riscv: stacktrace: convert arch_stack_walk() to noinstr riscv: patch: Flush the icache right after patching to avoid illegal insns RISC-V: fix vector insn load/store width mask
2024-06-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: e3f02f32a050 ("ionic: fix kernel panic due to multi-buffer handling") d9c04209990b ("ionic: Mark error paths in the data path as unlikely") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-27Merge tag 'riscv-dt-fixes-for-v6.10-rc5+' of ↵Arnd Bergmann
https://git.kernel.org/pub/scm/linux/kernel/git/conor/linux into arm/fixes RISC-V Devicetree fixes for v6.10-rc5+ T-Head: Jisheng hasn't got enough time to look after the platform, so Drew Fustini is going to take over. StarFive: A fix for a regulator voltage range that prevented using low performance SD cards. Canaan: Cleanup for some "over eager" aliases for serial ports that did not exist on some boards and I/O devices disabled on boards where they were not actually in use. Signed-off-by: Conor Dooley <conor.dooley@microchip.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-06-26Merge patch series "riscv: Memory Hot(Un)Plug support"Palmer Dabbelt
Björn Töpel <bjorn@kernel.org> says: From: Björn Töpel <bjorn@rivosinc.com> ================================================================ Memory Hot(Un)Plug support (and ZONE_DEVICE) for the RISC-V port ================================================================ Introduction ============ To quote "Documentation/admin-guide/mm/memory-hotplug.rst": "Memory hot(un)plug allows for increasing and decreasing the size of physical memory available to a machine at runtime." This series adds memory hot(un)plugging, and ZONE_DEVICE support for the RISC-V Linux port. MM configuration ================ RISC-V MM has the following configuration: * Memory blocks are 128M, analogous to x86-64. It uses PMD ("hugepage") vmemmaps. From that follows that 2M (PMD) worth of vmemmap spans 32768 pages á 4K which gets us 128M. * The pageblock size is the minimum minimum virtio_mem size, and on RISC-V it's 2M (2^9 * 4K). Implementation ============== The PGD table on RISC-V is shared/copied between for all processes. To avoid doing page table synchronization, the first patch (patch 1) pre-allocated the PGD entries for vmemmap/direct map. By doing that the init_mm PGD will be fixed at kernel init, and synchronization can be avoided all together. The following two patches (patch 2-3) does some preparations, followed by the actual MHP implementation (patch 4-5). Then, MHP and virtio-mem are enabled (patch 6-7), and finally ZONE_DEVICE support is added (patch 8). MHP and locking =============== TL;DR: The MHP does not step on any toes, except for ptdump. Additional locking is required for ptdump. Long version: For v2 I spent some time digging into init_mm synchronization/update. Here are my findings, and I'd love them to be corrected if incorrect. It's been a gnarly path... The `init_mm` structure is a special mm (perhaps not a "real" one). It's a "lazy context" that tracks kernel page table resources, e.g., the kernel page table (swapper_pg_dir), a kernel page_table_lock (more about the usage below), mmap_lock, and such. `init_mm` does not track/contain any VMAs. Having the `init_mm` is convenient, so that the regular kernel page table walk/modify functions can be used. Now, `init_mm` being special means that the locking for kernel page tables are special as well. On RISC-V the PGD (top-level page table structure), similar to x86, is shared (copied) with user processes. If the kernel PGD is modified, it has to be synched to user-mode processes PGDs. This is avoided by pre-populating the PGD, so it'll be fixed from boot. The in-kernel pgd regions are documented in `Documentation/arch/riscv/vm-layout.rst`. The distinct regions are: * vmemmap * vmalloc/ioremap space * direct mapping of all physical memory * kasan * modules, BPF * kernel Memory hotplug is the process of adding/removing memory to/from the kernel. Adding is done in two phases: 1. Add the memory to the kernel 2. Online memory, making it available to the page allocator. Step 1 is partially architecture dependent, and updates the init_mm page table: * Update the direct map page tables. The direct map is a linear map, representing all physical memory: `virt = phys + PAGE_OFFSET` * Add a `struct page` for each added page of memory. Update the vmemmap (virtual mapping to the `struct page`, so we can easily transform a kernel virtual address to a `struct page *` address. From an MHP perspective, there are two regions of the PGD that are updated: * vmemmap * direct mapping of all physical memory The `struct mm_struct` has a couple of locks in play: * `spinlock_t page_table_lock` protects the page table, and some counters * `struct rw_semaphore mmap_lock` protect an mm's VMAs Note again that `init_mm` does not contain any VMAs, but still uses the mmap_lock in some places. The `page_table_lock` was originally used to to protect all pages tables, but more recently a split page table lock has been introduced. The split lock has a per-table lock for the PTE and PMD tables. If split lock is disabled, all tables are guarded by `mm->page_table_lock` (for user processes). Split page table locks are not used for init_mm. MHP operations is typically synchronized using `DEFINE_STATIC_PERCPU_RWSEM(mem_hotplug_lock)`. Actors ------ The following non-MHP actors in the kernel traverses (read), and/or modifies the kernel PGD. * `ptdump` Walks the entire `init_mm`, via `ptdump_walk_pgd()` with the `mmap_write_lock(init_mm)` taken. Observation: ptdump can race with MHP, and needs additional locking to avoid crashes/races. * `set_direct_*` / `arch/riscv/mm/pageattr.c` The `set_direct_*` functionality is used to "synchronize" the direct map to other kernel mappings, e.g. modules/kernel text. The direct map is using "as large huge table mappings as possible", which means that the `set_direct_*` might need to split the direct map. The `set_direct_*` functions operates with the `mmap_write_lock(init_mm)` taken. Observation: `set_direct_*` uses the direct map, but will never modify the same entry as MHP. If there is a mapping, that entry will never race with MHP. Further, MHP acts when memory is offline. * HVO / `mm/hugetlb_vmemmap` HVO optimizes the backing `struct page` for hugetlb pages, which means changing the "vmemmap" region. HVO can split (merge?) a vmemmap pmd. However, it will never race with MHP, since HVO only operates at online memory. HVO cannot touch memory being MHP added or removed. * `apply_to_page_range` Walks a range, creates pages and applies a callback (setting permissions) for the page. When creating a table, it might use `int __pte_alloc_kernel(pmd_t *pmd)` which takes the `init_mm.page_table_lock` to synchronize pmd populate. Used by: `mm/vmalloc.c` and `mm/kasan/shadow.c`. The KASAN callback takes the `init_mm.page_table_lock` to synchronize pte creation. Observations: `apply_to_page_range` applies to the "vmalloc/ioremap space" region, and "kasan" region. *Not* affected by MHP. * `apply_to_existing_page_range` Walks a range, applies a callback (setting permissions) for the page (no page creation). Used by: `kernel/bpf/arena.c` and `mm/kasan/shadow.c`. The KASAN callback takes the `init_mm.page_table_lock` to synchronize pte creation. *Not* affected by MHP regions. * `apply_to_existing_page_range` applies to the "vmalloc/ioremap space" region, and "kasan" region. *Not* affected by MHP regions. * `ioremap_page_range` and `vmap_page_range` Uses the same internal function, and might create table entries at the "vmalloc/ioremap space" region. Can call `__pte_alloc_kernel()` which takes the `init_mm.page_table_lock` synchronizing pmd populate in the region. *Not* affected by MHP regions. Summary: * MHP add will never modify the same page table entries, as any of the other actors. * MHP remove is done when memory is offlined, and will not clash with any of the actors. * Functions that walk the entire kernel page table need synchronization * It's sufficient to add the MHP lock ptdump. Testing ======= This series adds basic DT supported hotplugging. There is a QEMU series enabling MHP for the RISC-V "virt" machine here: [1] ACPI/MSI support is still in the making for RISC-V, and prior proper (ACPI) PCI MSI support lands [2] and NUMA SRAT support [3], it hard to try it out. I've prepared a QEMU branch with proper ACPI GED/PC-DIMM support [4], and a this series with the required prerequisites [5] (AIA, ACPI AIA MADT, ACPI NUMA SRAT). To test with virtio-mem, e.g.: | qemu-system-riscv64 \ | -machine virt,aia=aplic-imsic \ | -cpu rv64,v=true,vlen=256,elen=64,h=true,zbkb=on,zbkc=on,zbkx=on,zkr=on,zkt=on,svinval=on,svnapot=on,svpbmt=on \ | -nodefaults \ | -nographic -smp 8 -kernel rv64-u-boot.bin \ | -drive file=rootfs.img,format=raw,if=virtio \ | -device virtio-rng-pci \ | -m 16G,slots=3,maxmem=32G \ | -object memory-backend-ram,id=mem0,size=16G \ | -numa node,nodeid=0,memdev=mem0 \ | -serial chardev:char0 \ | -mon chardev=char0,mode=readline \ | -chardev stdio,mux=on,id=char0 \ | -device pci-serial,id=serial0,chardev=char0 \ | -object memory-backend-ram,id=vmem0,size=2G \ | -device virtio-mem-pci,id=vm0,memdev=vmem0,node=0 where "rv64-u-boot.bin" is U-boot with EFI/ACPI-support (use [6] if you're lazy). In the QEMU monitor: | (qemu) info memory-devices | (qemu) qom-set vm0 requested-size 1G ...to test DAX/KMEM, use the follow QEMU parameters: | -object memory-backend-file,id=mem1,share=on,mem-path=virtio_pmem.img,size=4G \ | -device virtio-pmem-pci,memdev=mem1,id=nv1 and the regular ndctl/daxctl dance. If you're brave to try the ACPI branch, add "acpi=on" to "-machine virt", and test PC-DIMM MHP (in addition to virtio-{p},mem): In the QEMU monitor: | (qemu) object_add memory-backend-ram,id=mem1,size=1G | (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 You can also try hot-remove with some QEMU options, say: | -object memory-backend-file,id=mem-1,size=256M,mem-path=/pagesize-2MB | -device pc-dimm,id=mem1,memdev=mem-1 | -object memory-backend-file,id=mem-2,size=1G,mem-path=/pagesize-1GB | -device pc-dimm,id=mem2,memdev=mem-2 | -object memory-backend-file,id=mem-3,size=256M,mem-path=/pagesize-2MB | -device pc-dimm,id=mem3,memdev=mem-3 Remove "acpi=on" to run with DT. Thanks to Alex, Andrew, David, and Oscar for all comments/tests/fixups. References ========== [1] https://lore.kernel.org/qemu-devel/20240521105635.795211-1-bjorn@kernel.org/ [2] https://lore.kernel.org/linux-riscv/20240501121742.1215792-1-sunilvl@ventanamicro.com/ [3] https://lore.kernel.org/linux-riscv/cover.1713778236.git.haibo1.xu@intel.com/ [4] https://github.com/bjoto/qemu/commits/virtio-mem-pc-dimm-mhp-acpi-v2/ [5] https://github.com/bjoto/linux/commits/mhp-v4-acpi [6] https://github.com/bjoto/riscv-rootfs-utils/tree/acpi * b4-shazam-merge: riscv: Enable DAX VMEMMAP optimization riscv: mm: Add support for ZONE_DEVICE virtio-mem: Enable virtio-mem for RISC-V riscv: Enable memory hotplugging for RISC-V riscv: mm: Take memory hotplug read-lock during kernel page table dump riscv: mm: Add memory hotplugging support riscv: mm: Add pfn_to_kaddr() implementation riscv: mm: Refactor create_linear_mapping_range() for memory hot add riscv: mm: Change attribute from __init to __meminit for page functions riscv: mm: Pre-allocate vmemmap/direct map/kasan PGD entries riscv: mm: Properly forward vmemmap_populate() altmap parameter Link: https://lore.kernel.org/r/20240605114100.315918-1-bjorn@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-06-26riscv: Enable DAX VMEMMAP optimizationBjörn Töpel
Now that DAX is usable, enable the DAX VMEMMAP optimization as well. Signed-off-by: Björn Töpel <bjorn@rivosinc.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20240605114100.315918-12-bjorn@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-06-26riscv: mm: Add support for ZONE_DEVICEBjörn Töpel
ZONE_DEVICE pages need DEVMAP PTEs support to function (ARCH_HAS_PTE_DEVMAP). Claim another RSW (reserved for software) bit in the PTE for DEVMAP mark, add the corresponding helpers, and enable ARCH_HAS_PTE_DEVMAP for riscv64. Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Björn Töpel <bjorn@rivosinc.com> Link: https://lore.kernel.org/r/20240605114100.315918-11-bjorn@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-06-26riscv: Enable memory hotplugging for RISC-VBjörn Töpel
Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for RISC-V. Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Björn Töpel <bjorn@rivosinc.com> Link: https://lore.kernel.org/r/20240605114100.315918-9-bjorn@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-06-26riscv: mm: Take memory hotplug read-lock during kernel page table dumpBjörn Töpel
During memory hot remove, the ptdump functionality can end up touching stale data. Avoid any potential crashes (or worse), by holding the memory hotplug read-lock while traversing the page table. This change is analogous to arm64's commit bf2b59f60ee1 ("arm64/mm: Hold memory hotplug lock while walking for kernel page table dump"). Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Björn Töpel <bjorn@rivosinc.com> Link: https://lore.kernel.org/r/20240605114100.315918-8-bjorn@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-06-26riscv: mm: Add memory hotplugging supportBjörn Töpel
For an architecture to support memory hotplugging, a couple of callbacks needs to be implemented: arch_add_memory() This callback is responsible for adding the physical memory into the direct map, and call into the memory hotplugging generic code via __add_pages() that adds the corresponding struct page entries, and updates the vmemmap mapping. arch_remove_memory() This is the inverse of the callback above. vmemmap_free() This function tears down the vmemmap mappings (if CONFIG_SPARSEMEM_VMEMMAP is enabled), and also deallocates the backing vmemmap pages. Note that for persistent memory, an alternative allocator for the backing pages can be used; The vmem_altmap. This means that when the backing pages are cleared, extra care is needed so that the correct deallocation method is used. arch_get_mappable_range() This functions returns the PA range that the direct map can map. Used by the MHP internals for sanity checks. The page table unmap/teardown functions are heavily based on code from the x86 tree. The same remove_pgd_mapping() function is used in both vmemmap_free() and arch_remove_memory(), but in the latter function the backing pages are not removed. Signed-off-by: Björn Töpel <bjorn@rivosinc.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20240605114100.315918-7-bjorn@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-06-26riscv: mm: Add pfn_to_kaddr() implementationBjörn Töpel
The pfn_to_kaddr() function is used by KASAN's memory hotplugging path. Add the missing function to the RISC-V port, so that it can be built with MHP and CONFIG_KASAN. Signed-off-by: Björn Töpel <bjorn@rivosinc.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20240605114100.315918-6-bjorn@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-06-26riscv: mm: Refactor create_linear_mapping_range() for memory hot addBjörn Töpel
Add a parameter to the direct map setup function, so it can be used in arch_add_memory() later. Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Björn Töpel <bjorn@rivosinc.com> Link: https://lore.kernel.org/r/20240605114100.315918-5-bjorn@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-06-26riscv: mm: Change attribute from __init to __meminit for page functionsBjörn Töpel
Prepare for memory hotplugging support by changing from __init to __meminit for the page table functions that are used by the upcoming architecture specific callbacks. Changing the __init attribute to __meminit, avoids that the functions are removed after init. The __meminit attribute makes sure the functions are kept in the kernel text post init, but only if memory hotplugging is enabled for the build. Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Björn Töpel <bjorn@rivosinc.com> Link: https://lore.kernel.org/r/20240605114100.315918-4-bjorn@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>