summaryrefslogtreecommitdiff
path: root/arch/arm64/kernel
AgeCommit message (Collapse)Author
2024-07-11Merge branches 'for-next/cpufeature', 'for-next/misc', 'for-next/kselftest', ↵Catalin Marinas
'for-next/mte', 'for-next/errata', 'for-next/acpi', 'for-next/gic-v3-pmr' and 'for-next/doc', remote-tracking branch 'arm64/for-next/perf' into for-next/core * arm64/for-next/perf: perf: add missing MODULE_DESCRIPTION() macros perf: arm_pmuv3: Include asm/arm_pmuv3.h from linux/perf/arm_pmuv3.h perf: arm_v6/7_pmu: Drop non-DT probe support perf/arm: Move 32-bit PMU drivers to drivers/perf/ perf: arm_pmuv3: Drop unnecessary IS_ENABLED(CONFIG_ARM64) check perf: arm_pmuv3: Avoid assigning fixed cycle counter with threshold perf: imx_perf: add support for i.MX95 platform perf: imx_perf: fix counter start and config sequence perf: imx_perf: refactor driver for imx93 perf: imx_perf: let the driver manage the counter usage rather the user perf: imx_perf: add macro definitions for parsing config attr dt-bindings: perf: fsl-imx-ddr: Add i.MX95 compatible perf: pmuv3: Add new Cortex and Neoverse PMUs dt-bindings: arm: pmu: Add new Cortex and Neoverse cores perf/arm-cmn: Enable support for tertiary match group perf/arm-cmn: Decouple wp_config registers from filter group number * for-next/cpufeature: : Various cpufeature infrastructure patches arm64/cpufeature: Replace custom macros with fields from ID_AA64PFR0_EL1 KVM: arm64: Replace custom macros with fields from ID_AA64PFR0_EL1 arm64/cpufeatures/kvm: Add ARMv8.9 FEAT_ECBHB bits in ID_AA64MMFR1 register * for-next/misc: : Miscellaneous patches arm64: smp: Fix missing IPI statistics arm64: Cleanup __cpu_set_tcr_t0sz() arm64/mm: Stop using ESR_ELx_FSC_TYPE during fault arm64: Kconfig: fix typo in __builtin_return_adddress ARM64: reloc_test: add missing MODULE_DESCRIPTION() macro arm64: implement raw_smp_processor_id() using thread_info arm64/arch_timer: include <linux/percpu.h> * for-next/kselftest: : arm64 kselftest updates selftests: arm64: tags: remove the result script selftests: arm64: tags_test: conform test to TAP output kselftest/arm64: Fix a couple of spelling mistakes kselftest/arm64: Fix redundancy of a testcase kselftest/arm64: Include kernel mode NEON in fp-stress * for-next/mte: : MTE updates arm64: mte: Make mte_check_tfsr_*() conditional on KASAN instead of MTE * for-next/errata: : Arm CPU errata workarounds arm64: errata: Expand speculative SSBS workaround arm64: errata: Unify speculative SSBS errata logic arm64: cputype: Add Cortex-X925 definitions arm64: cputype: Add Cortex-A720 definitions arm64: cputype: Add Cortex-X3 definitions * for-next/acpi: : arm64 ACPI patches ACPI: Add acpi=nospcr to disable ACPI SPCR as default console on ARM64 ACPI / amba: Drop unnecessary check for registered amba_dummy_clk arm64: FFH: Move ACPI specific code into drivers/acpi/arm64/ arm64: cpuidle: Move ACPI specific code into drivers/acpi/arm64/ ACPI: arm64: Sort entries alphabetically * for-next/gic-v3-pmr: : arm64: irqchip/gic-v3: Use compiletime constant PMR values arm64: irqchip/gic-v3: Select priorities at boot time irqchip/gic-v3: Detect GICD_CTRL.DS and SCR_EL3.FIQ earlier irqchip/gic-v3: Make distributor priorities variables irqchip/gic-common: Remove sync_access callback wordpart.h: Add REPEAT_BYTE_U32() * for-next/doc: : arm64 documentation updates Documentation: arm64: Update memory.rst for TBI
2024-07-08arm64: smp: Fix missing IPI statisticsJinjie Ruan
commit 83cfac95c018 ("genirq: Allow interrupts to be excluded from /proc/interrupts") is to avoid IPIs appear twice in /proc/interrupts. But the commit 331a1b3a836c ("arm64: smp: Add arch support for backtrace using pseudo-NMI") and commit 2f5cd0c7ffde("arm64: kgdb: Implement kgdb_roundup_cpus() to enable pseudo-NMI roundup") set CPU_BACKTRACE and KGDB_ROUNDUP IPIs "IRQ_HIDDEN" flag but not show them in arch_show_interrupts(), which cause the interrupt kstat_irqs accounting is missing in display. Before this patch, CPU_BACKTRACE and KGDB_ROUNDUP IPIs are missing: / # cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 11: 466 600 309 332 GICv3 27 Level arch_timer 13: 24 0 0 0 GICv3 33 Level uart-pl011 15: 64 0 0 0 GICv3 78 Edge virtio0 16: 0 0 0 0 GICv3 79 Edge virtio1 17: 0 0 0 0 GICv3 34 Level rtc-pl031 18: 3 3 3 3 GICv3 23 Level arm-pmu 19: 0 0 0 0 9030000.pl061 3 Edge GPIO Key Poweroff IPI0: 7 14 9 26 Rescheduling interrupts IPI1: 354 93 233 255 Function call interrupts IPI2: 0 0 0 0 CPU stop interrupts IPI3: 0 0 0 0 CPU stop (for crash dump) interrupts IPI4: 0 0 0 0 Timer broadcast interrupts IPI5: 1 0 0 0 IRQ work interrupts Err: 0 After this pacth, CPU_BACKTRACE and KGDB_ROUNDUP IPIs are displayed: / # cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 11: 393 281 532 449 GICv3 27 Level arch_timer 13: 15 0 0 0 GICv3 33 Level uart-pl011 15: 64 0 0 0 GICv3 78 Edge virtio0 16: 0 0 0 0 GICv3 79 Edge virtio1 17: 0 0 0 0 GICv3 34 Level rtc-pl031 18: 2 2 2 2 GICv3 23 Level arm-pmu 19: 0 0 0 0 9030000.pl061 3 Edge GPIO Key Poweroff IPI0: 11 19 4 23 Rescheduling interrupts IPI1: 279 347 222 72 Function call interrupts IPI2: 0 0 0 0 CPU stop interrupts IPI3: 0 0 0 0 CPU stop (for crash dump) interrupts IPI4: 0 0 0 0 Timer broadcast interrupts IPI5: 1 0 0 1 IRQ work interrupts IPI6: 0 0 0 0 CPU backtrace interrupts IPI7: 0 0 0 0 KGDB roundup interrupts Err: 0 Fixes: 331a1b3a836c ("arm64: smp: Add arch support for backtrace using pseudo-NMI") Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Suggested-by: Doug Anderson <dianders@chromium.org> Acked-by: Will Deacon <will@kernel.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Link: https://lore.kernel.org/r/20240620063600.573559-1-ruanjinjie@huawei.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-07-04ACPI: Add acpi=nospcr to disable ACPI SPCR as default console on ARM64Liu Wei
For varying privacy and security reasons, sometimes we would like to completely silence the _serial_ console, and only enable it when needed. But there are many existing systems that depend on this _serial_ console, so add acpi=nospcr to disable console in ACPI SPCR table as default _serial_ console. Signed-off-by: Liu Wei <liuwei09@cestc.cn> Suggested-by: Prarit Bhargava <prarit@redhat.com> Suggested-by: Will Deacon <will@kernel.org> Suggested-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Hanjun Guo <guohanjun@huawei.com> Reviewed-by: Prarit Bhargava <prarit@redhat.com> Link: https://lore.kernel.org/r/20240625030504.58025-1-liuwei09@cestc.cn Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-07-04arm64/cpufeature: Replace custom macros with fields from ID_AA64PFR0_EL1Anshuman Khandual
This replaces custom macros usage (i.e ID_AA64PFR0_EL1_ELx_64BIT_ONLY and ID_AA64PFR0_EL1_ELx_32BIT_64BIT) and instead directly uses register fields from ID_AA64PFR0_EL1 sysreg definition. Finally let's drop off both these custom macros as they are now redundant. Cc: Will Deacon <will@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mark Brown <broonie@kernel.org> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20240613102710.3295108-3-anshuman.khandual@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-24arm64: irqchip/gic-v3: Select priorities at boot timeMark Rutland
The distributor and PMR/RPR can present different views of the interrupt priority space dependent upon the values of GICD_CTLR.DS and SCR_EL3.FIQ. Currently we treat the distributor's view of the priority space as canonical, and when the two differ we change the way we handle values in the PMR/RPR, using the `gic_nonsecure_priorities` static key to decide what to do. This approach works, but it's sub-optimal. When using pseudo-NMI we manipulate the distributor rarely, and we manipulate the PMR/RPR registers very frequently in code spread out throughout the kernel (e.g. local_irq_{save,restore}()). It would be nicer if we could use fixed values for the PMR/RPR, and dynamically choose the values programmed into the distributor. This patch changes the GICv3 driver and arm64 code accordingly. PMR values are chosen at compile time, and the GICv3 driver determines the appropriate values to program into the distributor at boot time. This removes the need for the `gic_nonsecure_priorities` static key and results in smaller and better generated code for saving/restoring the irqflags. Before this patch, local_irq_disable() compiles to: | 0000000000000000 <outlined_local_irq_disable>: | 0: d503201f nop | 4: d50343df msr daifset, #0x3 | 8: d65f03c0 ret | c: d503201f nop | 10: d2800c00 mov x0, #0x60 // #96 | 14: d5184600 msr icc_pmr_el1, x0 | 18: d65f03c0 ret | 1c: d2801400 mov x0, #0xa0 // #160 | 20: 17fffffd b 14 <outlined_local_irq_disable+0x14> After this patch, local_irq_disable() compiles to: | 0000000000000000 <outlined_local_irq_disable>: | 0: d503201f nop | 4: d50343df msr daifset, #0x3 | 8: d65f03c0 ret | c: d2801800 mov x0, #0xc0 // #192 | 10: d5184600 msr icc_pmr_el1, x0 | 14: d65f03c0 ret ... with 3 fewer instructions per call. For defconfig + CONFIG_PSEUDO_NMI=y, this results in a minor saving of ~4K of text, and will make it easier to make further improvements to the way we manipulate irqflags and DAIF bits. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexandru Elisei <alexandru.elisei@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Tested-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240617111841.2529370-6-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Thomas Gleixner <tglx@linutronix.de>
2024-06-13ARM64: reloc_test: add missing MODULE_DESCRIPTION() macroJeff Johnson
With ARCH=arm64, make allmodconfig && make W=1 C=1 reports: WARNING: modpost: missing MODULE_DESCRIPTION() in arch/arm64/kernel/arm64-reloc-test.o Add the missing invocation of the MODULE_DESCRIPTION() macro. Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Link: https://lore.kernel.org/r/20240612-md-arch-arm64-kernel-v1-1-1fafe8d11df3@quicinc.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-13arm64: FFH: Move ACPI specific code into drivers/acpi/arm64/Sudeep Holla
The ACPI FFH Opregion code can be moved out of arm64 arch code as it just uses SMCCC. Move all the ACPI FFH Opregion code into drivers/acpi/arm64/ffh.c Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Hanjun Guo <guohanjun@huawei.com> Link: https://lore.kernel.org/r/20240605131458.3341095-4-sudeep.holla@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-13arm64: cpuidle: Move ACPI specific code into drivers/acpi/arm64/Sudeep Holla
The ACPI cpuidle LPI FFH code can be moved out of arm64 arch code as it just uses SMCCC. Move all the ACPI cpuidle LPI FFH code into drivers/acpi/arm64/cpuidle.c Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Hanjun Guo <guohanjun@huawei.com> Link: https://lore.kernel.org/r/20240605131458.3341095-3-sudeep.holla@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-12arm64: errata: Expand speculative SSBS workaroundMark Rutland
A number of Arm Ltd CPUs suffer from errata whereby an MSR to the SSBS special-purpose register does not affect subsequent speculative instructions, permitting speculative store bypassing for a window of time. We worked around this for Cortex-X4 and Neoverse-V3, in commit: 7187bb7d0b5c7dfa ("arm64: errata: Add workaround for Arm errata 3194386 and 3312417") ... as per their Software Developer Errata Notice (SDEN) documents: * Cortex-X4 SDEN v8.0, erratum 3194386: https://developer.arm.com/documentation/SDEN-2432808/0800/ * Neoverse-V3 SDEN v6.0, erratum 3312417: https://developer.arm.com/documentation/SDEN-2891958/0600/ Since then, similar errata have been published for a number of other Arm Ltd CPUs, for which the mitigation is the same. This is described in their respective SDEN documents: * Cortex-A710 SDEN v19.0, errataum 3324338 https://developer.arm.com/documentation/SDEN-1775101/1900/?lang=en * Cortex-A720 SDEN v11.0, erratum 3456091 https://developer.arm.com/documentation/SDEN-2439421/1100/?lang=en * Cortex-X2 SDEN v19.0, erratum 3324338 https://developer.arm.com/documentation/SDEN-1775100/1900/?lang=en * Cortex-X3 SDEN v14.0, erratum 3324335 https://developer.arm.com/documentation/SDEN-2055130/1400/?lang=en * Cortex-X925 SDEN v8.0, erratum 3324334 https://developer.arm.com/documentation/109108/800/?lang=en * Neoverse-N2 SDEN v17.0, erratum 3324339 https://developer.arm.com/documentation/SDEN-1982442/1700/?lang=en * Neoverse-V2 SDEN v9.0, erratum 3324336 https://developer.arm.com/documentation/SDEN-2332927/900/?lang=en Note that due to shared design lineage, some CPUs share the same erratum number. Add these to the existing mitigation under CONFIG_ARM64_ERRATUM_3194386. As listing all of the erratum IDs in the runtime description would be unwieldy, this is reduced to: "SSBS not fully self-synchronizing" ... matching the description of the errata in all of the SDENs. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: James Morse <james.morse@arm.com> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20240603111812.1514101-6-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-12arm64: errata: Unify speculative SSBS errata logicMark Rutland
Cortex-X4 erratum 3194386 and Neoverse-V3 erratum 3312417 are identical, with duplicate Kconfig text and some unsightly ifdeffery. While we try to share code behind CONFIG_ARM64_WORKAROUND_SPECULATIVE_SSBS, having separate options results in a fair amount of boilerplate code, and this will only get worse as we expand the set of affected CPUs. To reduce this boilerplate, unify the two behind a common Kconfig option. This removes the duplicate text and Kconfig logic, and removes the need for the intermediate ARM64_WORKAROUND_SPECULATIVE_SSBS option. The set of affected CPUs is described as a list so that this can easily be extended. I've used ARM64_ERRATUM_3194386 (matching the Neoverse-V3 erratum ID) as the common option, matching the way we use ARM64_ERRATUM_1319367 to cover Cortex-A57 erratum 1319537 and Cortex-A72 erratum 1319367. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: James Morse <james.morse@arm.com> Cc: Will Deacon <wilL@kernel.org> Link: https://lore.kernel.org/r/20240603111812.1514101-5-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-12arm64: implement raw_smp_processor_id() using thread_infoPuranjay Mohan
Historically, arm64 implemented raw_smp_processor_id() as a read of current_thread_info()->cpu. This changed when arm64 moved thread_info into task struct, as at the time CONFIG_THREAD_INFO_IN_TASK made core code use thread_struct::cpu for the cpu number, and due to header dependencies prevented using this in raw_smp_processor_id(). As a workaround, we moved to using a percpu variable in commit: 57c82954e77fa12c ("arm64: make cpu number a percpu variable") Since then, thread_info::cpu was reintroduced, and core code was made to use this in commits: 001430c1910df65a ("arm64: add CPU field to struct thread_info") bcf9033e5449bdca ("sched: move CPU field back into thread_info if THREAD_INFO_IN_TASK=y") Consequently it is possible to use current_thread_info()->cpu again. This decreases the number of emitted instructions like in the following example: Dump of assembler code for function bpf_get_smp_processor_id: 0xffff8000802cd608 <+0>: nop 0xffff8000802cd60c <+4>: nop 0xffff8000802cd610 <+8>: adrp x0, 0xffff800082138000 0xffff8000802cd614 <+12>: mrs x1, tpidr_el1 0xffff8000802cd618 <+16>: add x0, x0, #0x8 0xffff8000802cd61c <+20>: ldrsw x0, [x0, x1] 0xffff8000802cd620 <+24>: ret After this patch: Dump of assembler code for function bpf_get_smp_processor_id: 0xffff8000802c9130 <+0>: nop 0xffff8000802c9134 <+4>: nop 0xffff8000802c9138 <+8>: mrs x0, sp_el0 0xffff8000802c913c <+12>: ldr w0, [x0, #24] 0xffff8000802c9140 <+16>: ret A microbenchmark[1] was built to measure the performance improvement provided by this change. It calls the following function given number of times and finds the runtime overhead: static noinline int get_cpu_id(void) { return smp_processor_id(); } Run the benchmark like: modprobe smp_processor_id nr_function_calls=1000000000 +--------------------------+------------------------+ | | Number of Calls | Time taken | +--------+-----------------+------------------------+ | Before | 1000000000 | 1602888401ns | +--------+-----------------+------------------------+ | After | 1000000000 | 1206212658ns | +--------+-----------------+------------------------+ | Difference (decrease) | 396675743ns (24.74%) | +---------------------------------------------------+ Remove the percpu variable cpu_number as it is used only in set_smp_ipi_range() as a dummy variable to be passed to ipi_handler(). Use irq_stat in place of cpu_number here like arm32. [1] https://github.com/puranjaymohan/linux/commit/77d3fdd Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Link: https://lore.kernel.org/r/20240503171847.68267-2-puranjay@kernel.org Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-12arm64/cpufeatures/kvm: Add ARMv8.9 FEAT_ECBHB bits in ID_AA64MMFR1 registerNianyao Tang
Enable ECBHB bits in ID_AA64MMFR1 register as per ARM DDI 0487K.a specification. When guest OS read ID_AA64MMFR1_EL1, kvm emulate this reg using ftr_id_aa64mmfr1 and always return ID_AA64MMFR1_EL1.ECBHB=0 to guest. It results in guest syscall jump to tramp ventry, which is not needed in implementation with ID_AA64MMFR1_EL1.ECBHB=1. Let's make the guest syscall process the same as the host. Signed-off-by: Nianyao Tang <tangnianyao@huawei.com> Link: https://lore.kernel.org/r/20240611122049.2758600-1-tangnianyao@huawei.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-05arm64: armv8_deprecated: Fix warning in isndep cpuhp starting processWei Li
The function run_all_insn_set_hw_mode() is registered as startup callback of 'CPUHP_AP_ARM64_ISNDEP_STARTING', it invokes set_hw_mode() methods of all emulated instructions. As the STARTING callbacks are not expected to fail, if one of the set_hw_mode() fails, e.g. due to el0 mixed-endian is not supported for 'setend', it will report a warning: ``` CPU[2] cannot support the emulation of setend CPU 2 UP state arm64/isndep:starting (136) failed (-22) CPU2: Booted secondary processor 0x0000000002 [0x414fd0c1] ``` To fix it, add a check for INSN_UNAVAILABLE status and skip the process. Signed-off-by: Wei Li <liwei391@huawei.com> Tested-by: Huisong Li <lihuisong@huawei.com> Link: https://lore.kernel.org/r/20240423093501.3460764-1-liwei391@huawei.com Signed-off-by: Will Deacon <will@kernel.org>
2024-05-25Merge tag 'mm-hotfixes-stable-2024-05-25-09-13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "16 hotfixes, 11 of which are cc:stable. A few nilfs2 fixes, the remainder are for MM: a couple of selftests fixes, various singletons fixing various issues in various parts" * tag 'mm-hotfixes-stable-2024-05-25-09-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/ksm: fix possible UAF of stable_node mm/memory-failure: fix handling of dissolved but not taken off from buddy pages mm: /proc/pid/smaps_rollup: avoid skipping vma after getting mmap_lock again nilfs2: fix potential hang in nilfs_detach_log_writer() nilfs2: fix unexpected freezing of nilfs_segctor_sync() nilfs2: fix use-after-free of timer for log writer thread selftests/mm: fix build warnings on ppc64 arm64: patching: fix handling of execmem addresses selftests/mm: compaction_test: fix bogus test success and reduce probability of OOM-killer invocation selftests/mm: compaction_test: fix incorrect write of zero to nr_hugepages selftests/mm: compaction_test: fix bogus test success on Aarch64 mailmap: update email address for Satya Priya mm/huge_memory: don't unpoison huge_zero_folio kasan, fortify: properly rename memintrinsics lib: add version into /proc/allocinfo output mm/vmalloc: fix vmalloc which may return null if called with __GFP_NOFAIL
2024-05-24arm64: patching: fix handling of execmem addressesWill Deacon
Klara Modin reported warnings for a kernel configured with BPF_JIT but without MODULES: [ 44.131296] Trying to vfree() bad address (000000004a17c299) [ 44.138024] WARNING: CPU: 1 PID: 193 at mm/vmalloc.c:3189 remove_vm_area (mm/vmalloc.c:3189 (discriminator 1)) [ 44.146675] CPU: 1 PID: 193 Comm: kworker/1:2 Tainted: G D W 6.9.0-01786-g2c9e5d4a0082 #25 [ 44.158229] Hardware name: Raspberry Pi 3 Model B (DT) [ 44.164433] Workqueue: events bpf_prog_free_deferred [ 44.170492] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 44.178601] pc : remove_vm_area (mm/vmalloc.c:3189 (discriminator 1)) [ 44.183705] lr : remove_vm_area (mm/vmalloc.c:3189 (discriminator 1)) [ 44.188772] sp : ffff800082a13c70 [ 44.193112] x29: ffff800082a13c70 x28: 0000000000000000 x27: 0000000000000000 [ 44.201384] x26: 0000000000000000 x25: ffff00003a44efa0 x24: 00000000d4202000 [ 44.209658] x23: ffff800081223dd0 x22: ffff00003a198a40 x21: ffff8000814dd880 [ 44.217924] x20: 00000000d4202000 x19: ffff8000814dd880 x18: 0000000000000006 [ 44.226206] x17: 0000000000000000 x16: 0000000000000020 x15: 0000000000000002 [ 44.234460] x14: ffff8000811a6370 x13: 0000000020000000 x12: 0000000000000000 [ 44.242710] x11: ffff8000811a6370 x10: 0000000000000144 x9 : ffff8000811fe370 [ 44.250959] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000811fe370 [ 44.259206] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 44.267457] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000002203240 [ 44.275703] Call trace: [ 44.279158] remove_vm_area (mm/vmalloc.c:3189 (discriminator 1)) [ 44.283858] vfree (mm/vmalloc.c:3322) [ 44.287835] execmem_free (mm/execmem.c:70) [ 44.292347] bpf_jit_free_exec+0x10/0x1c [ 44.297283] bpf_prog_pack_free (kernel/bpf/core.c:1006) [ 44.302457] bpf_jit_binary_pack_free (kernel/bpf/core.c:1195) [ 44.307951] bpf_jit_free (include/linux/filter.h:1083 arch/arm64/net/bpf_jit_comp.c:2474) [ 44.312342] bpf_prog_free_deferred (kernel/bpf/core.c:2785) [ 44.317785] process_one_work (kernel/workqueue.c:3273) [ 44.322684] worker_thread (kernel/workqueue.c:3342 (discriminator 2) kernel/workqueue.c:3429 (discriminator 2)) [ 44.327292] kthread (kernel/kthread.c:388) [ 44.331342] ret_from_fork (arch/arm64/kernel/entry.S:861) The problem is because bpf_arch_text_copy() silently fails to write to the read-only area as a result of patch_map() faulting and the resulting -EFAULT being chucked away. Update patch_map() to use CONFIG_EXECMEM instead of CONFIG_STRICT_MODULE_RWX to check for vmalloc addresses. Link: https://lkml.kernel.org/r/20240521213813.703309-1-rppt@kernel.org Fixes: 2c9e5d4a0082 ("bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of") Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Reported-by: Klara Modin <klarasmodin@gmail.com> Closes: https://lore.kernel.org/all/7983fbbf-0127-457c-9394-8d6e4299c685@gmail.com Tested-by: Klara Modin <klarasmodin@gmail.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-23Merge tag 'trace-assign-str-v6.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing cleanup from Steven Rostedt: "Remove second argument of __assign_str() The __assign_str() macro logic of the TRACE_EVENT() macro was optimized so that it no longer needs the second argument. The __assign_str() is always matched with __string() field that takes a field name and the source for that field: __string(field, source) The TRACE_EVENT() macro logic will save off the source value and then use that value to copy into the ring buffer via the __assign_str(). Before commit c1fa617caeb0 ("tracing: Rework __assign_str() and __string() to not duplicate getting the string"), the __assign_str() needed the second argument which would perform the same logic as the __string() source parameter did. Not only would this add overhead, but it was error prone as if the __assign_str() source produced something different, it may not have allocated enough for the string in the ring buffer (as the __string() source was used to determine how much to allocate) Now that the __assign_str() just uses the same string that was used in __string() it no longer needs the source parameter. It can now be removed" * tag 'trace-assign-str-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/treewide: Remove second parameter of __assign_str()
2024-05-23Merge tag 'arm64-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Will Deacon: "The major fix here is for a filesystem corruption issue reported on Apple M1 as a result of buggy management of the floating point register state introduced in 6.8. I initially reverted one of the offending patches, but in the end Ard cooked a proper fix so there's a revert+reapply in the series. Aside from that, we've got some CPU errata workarounds and misc other fixes. - Fix broken FP register state tracking which resulted in filesystem corruption when dm-crypt is used - Workarounds for Arm CPU errata affecting the SSBS Spectre mitigation - Fix lockdep assertion in DMC620 memory controller PMU driver - Fix alignment of BUG table when CONFIG_DEBUG_BUGVERBOSE is disabled" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64/fpsimd: Avoid erroneous elide of user state reload Reapply "arm64: fpsimd: Implement lazy restore for kernel mode FPSIMD" arm64: asm-bug: Add .align 2 to the end of __BUG_ENTRY perf/arm-dmc620: Fix lockdep assert in ->event_init() Revert "arm64: fpsimd: Implement lazy restore for kernel mode FPSIMD" arm64: errata: Add workaround for Arm errata 3194386 and 3312417 arm64: cputype: Add Neoverse-V3 definitions arm64: cputype: Add Cortex-X4 definitions arm64: barrier: Restore spec_bar() macro
2024-05-22tracing/treewide: Remove second parameter of __assign_str()Steven Rostedt (Google)
With the rework of how the __string() handles dynamic strings where it saves off the source string in field in the helper structure[1], the assignment of that value to the trace event field is stored in the helper value and does not need to be passed in again. This means that with: __string(field, mystring) Which use to be assigned with __assign_str(field, mystring), no longer needs the second parameter and it is unused. With this, __assign_str() will now only get a single parameter. There's over 700 users of __assign_str() and because coccinelle does not handle the TRACE_EVENT() macro I ended up using the following sed script: git grep -l __assign_str | while read a ; do sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file; mv /tmp/test-file $a; done I then searched for __assign_str() that did not end with ';' as those were multi line assignments that the sed script above would fail to catch. Note, the same updates will need to be done for: __assign_str_len() __assign_rel_str() __assign_rel_str_len() I tested this with both an allmodconfig and an allyesconfig (build only for both). [1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/ Link: https://lore.kernel.org/linux-trace-kernel/20240516133454.681ba6a0@rorschach.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Christian König <christian.koenig@amd.com> for the amdgpu parts. Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #for Acked-by: Rafael J. Wysocki <rafael@kernel.org> # for thermal Acked-by: Takashi Iwai <tiwai@suse.de> Acked-by: Darrick J. Wong <djwong@kernel.org> # xfs Tested-by: Guenter Roeck <linux@roeck-us.net>
2024-05-22arm64/fpsimd: Avoid erroneous elide of user state reloadArd Biesheuvel
TIF_FOREIGN_FPSTATE is a 'convenience' flag that should reflect whether the current CPU holds the most recent user mode FP/SIMD state of the current task. It combines two conditions: - whether the current CPU's FP/SIMD state belongs to the task; - whether that state is the most recent associated with the task (as a task may have executed on other CPUs as well). When a task is scheduled in and TIF_KERNEL_FPSTATE is set, it means the task was in a kernel mode NEON section when it was scheduled out, and so the kernel mode FP/SIMD state is restored. Since this implies that the current CPU is *not* holding the most recent user mode FP/SIMD state of the current task, the TIF_FOREIGN_FPSTATE flag is set too, so that the user mode FP/SIMD state is reloaded from memory when returning to userland. However, the task may be scheduled out after completing the kernel mode NEON section, but before returning to userland. When this happens, the TIF_FOREIGN_FPSTATE flag will not be preserved, but will be set as usual the next time the task is scheduled in, and will be based on the above conditions. This means that, rather than setting TIF_FOREIGN_FPSTATE when scheduling in a task with TIF_KERNEL_FPSTATE set, the underlying state should be updated so that TIF_FOREIGN_FPSTATE will assume the expected value as a result. So instead, call fpsimd_flush_cpu_state(), which takes care of this. Closes: https://lore.kernel.org/all/cb8822182231850108fa43e0446a4c7f@kernel.org Reported-by: Johannes Nixdorf <mixi@shadowice.org> Fixes: aefbab8e77eb ("arm64: fpsimd: Preserve/restore kernel mode NEON at context switch") Cc: Mark Brown <broonie@kernel.org> Cc: Dave Martin <Dave.Martin@arm.com> Cc: Janne Grunau <j@jannau.net> Cc: stable@vger.kernel.org Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Tested-by: Janne Grunau <j@jannau.net> Tested-by: Johannes Nixdorf <mixi@shadowice.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20240522091335.335346-2-ardb+git@google.com Signed-off-by: Will Deacon <will@kernel.org>
2024-05-22Reapply "arm64: fpsimd: Implement lazy restore for kernel mode FPSIMD"Will Deacon
This reverts commit b8995a18417088bb53f87c49d200ec72a9dd4ec1. Ard managed to reproduce the dm-crypt corruption problem and got to the bottom of it, so re-apply the problematic patch in preparation for fixing things properly. Cc: stable@vger.kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2024-05-19Merge tag 'mm-stable-2024-05-17-19-19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: "The usual shower of singleton fixes and minor series all over MM, documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/ maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series: "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking"" * tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits) memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault selftests: cgroup: add tests to verify the zswap writeback path mm: memcg: make alloc_mem_cgroup_per_node_info() return bool mm/damon/core: fix return value from damos_wmark_metric_value mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED selftests: cgroup: remove redundant enabling of memory controller Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT Docs/mm/damon/design: use a list for supported filters Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file selftests/damon: classify tests for functionalities and regressions selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None' selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts selftests/damon/_damon_sysfs: check errors from nr_schemes file reads mm/damon/core: initialize ->esz_bp from damos_quota_init_priv() selftests/damon: add a test for DAMOS quota goal ...
2024-05-18Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma updates from Jason Gunthorpe: "Aside from the usual things this has an arch update for __iowrite64_copy() used by the RDMA drivers. This API was intended to generate large 64 byte MemWr TLPs on PCI. These days most processors had done this by just repeating writel() in a loop. S390 and some new ARM64 designs require a special helper to get this to generate. - Small improvements and fixes for erdma, efa, hfi1, bnxt_re - Fix a UAF crash after module unload on leaking restrack entry - Continue adding full RDMA support in mana with support for EQs, GID's and CQs - Improvements to the mkey cache in mlx5 - DSCP traffic class support in hns and several bug fixes - Cap the maximum number of MADs in the receive queue to avoid OOM - Another batch of rxe bug fixes from large scale testing - __iowrite64_copy() optimizations for write combining MMIO memory - Remove NULL checks before dev_put/hold() - EFA support for receive with immediate - Fix a recent memleaking regression in a cma error path" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (70 commits) RDMA/cma: Fix kmemleak in rdma_core observed during blktests nvme/rdma use siw RDMA/IPoIB: Fix format truncation compilation errors bnxt_re: avoid shift undefined behavior in bnxt_qplib_alloc_init_hwq RDMA/efa: Support QP with unsolicited write w/ imm. receive IB/hfi1: Remove generic .ndo_get_stats64 IB/hfi1: Do not use custom stat allocator RDMA/hfi1: Use RMW accessors for changing LNKCTL2 RDMA/mana_ib: implement uapi for creation of rnic cq RDMA/mana_ib: boundary check before installing cq callbacks RDMA/mana_ib: introduce a helper to remove cq callbacks RDMA/mana_ib: create and destroy RNIC cqs RDMA/mana_ib: create EQs for RNIC CQs RDMA/core: Remove NULL check before dev_{put, hold} RDMA/ipoib: Remove NULL check before dev_{put, hold} RDMA/mlx5: Remove NULL check before dev_{put, hold} RDMA/mlx5: Track DCT, DCI and REG_UMR QPs as diver_detail resources. RDMA/core: Add an option to display driver-specific QPs in the rdmatool RDMA/efa: Add shutdown notifier RDMA/mana_ib: Fix missing ret value IB/mlx5: Use __iowrite64_copy() for write combining stores ...
2024-05-18Merge tag 'kbuild-v6.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Avoid 'constexpr', which is a keyword in C23 - Allow 'dtbs_check' and 'dt_compatible_check' run independently of 'dt_binding_check' - Fix weak references to avoid GOT entries in position-independent code generation - Convert the last use of 'optional' property in arch/sh/Kconfig - Remove support for the 'optional' property in Kconfig - Remove support for Clang's ThinLTO caching, which does not work with the .incbin directive - Change the semantics of $(src) so it always points to the source directory, which fixes Makefile inconsistencies between upstream and downstream - Fix 'make tar-pkg' for RISC-V to produce a consistent package - Provide reasonable default coverage for objtool, sanitizers, and profilers - Remove redundant OBJECT_FILES_NON_STANDARD, KASAN_SANITIZE, etc. - Remove the last use of tristate choice in drivers/rapidio/Kconfig - Various cleanups and fixes in Kconfig * tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (46 commits) kconfig: use sym_get_choice_menu() in sym_check_prop() rapidio: remove choice for enumeration kconfig: lxdialog: remove initialization with A_NORMAL kconfig: m/nconf: merge two item_add_str() calls kconfig: m/nconf: remove dead code to display value of bool choice kconfig: m/nconf: remove dead code to display children of choice members kconfig: gconf: show checkbox for choice correctly kbuild: use GCOV_PROFILE and KCSAN_SANITIZE in scripts/Makefile.modfinal Makefile: remove redundant tool coverage variables kbuild: provide reasonable defaults for tool coverage modules: Drop the .export_symbol section from the final modules kconfig: use menu_list_for_each_sym() in sym_check_choice_deps() kconfig: use sym_get_choice_menu() in conf_write_defconfig() kconfig: add sym_get_choice_menu() helper kconfig: turn defaults and additional prompt for choice members into error kconfig: turn missing prompt for choice members into error kconfig: turn conf_choice() into void function kconfig: use linked list in sym_set_changed() kconfig: gconf: use MENU_CHANGED instead of SYMBOL_CHANGED kconfig: gconf: remove debug code ...
2024-05-17Revert "arm64: fpsimd: Implement lazy restore for kernel mode FPSIMD"Will Deacon
This reverts commit 2632e25217696712681dd1f3ecc0d71624ea3b23. Johannes (and others) report data corruption with dm-crypt on Apple M1 which has been bisected to this change. Revert the offending commit while we figure out what's going on. Cc: stable@vger.kernel.org Reported-by: Johannes Nixdorf <mixi@shadowice.org> Link: https://lore.kernel.org/all/D1B7GPIR9K1E.5JFV37G0YTIF@shadowice.org/ Signed-off-by: Will Deacon <will@kernel.org>
2024-05-15Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Paolo Bonzini: "ARM: - Move a lot of state that was previously stored on a per vcpu basis into a per-CPU area, because it is only pertinent to the host while the vcpu is loaded. This results in better state tracking, and a smaller vcpu structure. - Add full handling of the ERET/ERETAA/ERETAB instructions in nested virtualisation. The last two instructions also require emulating part of the pointer authentication extension. As a result, the trap handling of pointer authentication has been greatly simplified. - Turn the global (and not very scalable) LPI translation cache into a per-ITS, scalable cache, making non directly injected LPIs much cheaper to make visible to the vcpu. - A batch of pKVM patches, mostly fixes and cleanups, as the upstreaming process seems to be resuming. Fingers crossed! - Allocate PPIs and SGIs outside of the vcpu structure, allowing for smaller EL2 mapping and some flexibility in implementing more or less than 32 private IRQs. - Purge stale mpidr_data if a vcpu is created after the MPIDR map has been created. - Preserve vcpu-specific ID registers across a vcpu reset. - Various minor cleanups and improvements. LoongArch: - Add ParaVirt IPI support - Add software breakpoint support - Add mmio trace events support RISC-V: - Support guest breakpoints using ebreak - Introduce per-VCPU mp_state_lock and reset_cntx_lock - Virtualize SBI PMU snapshot and counter overflow interrupts - New selftests for SBI PMU and Guest ebreak - Some preparatory work for both TDX and SNP page fault handling. This also cleans up the page fault path, so that the priorities of various kinds of fauls (private page, no memory, write to read-only slot, etc.) are easier to follow. x86: - Minimize amount of time that shadow PTEs remain in the special REMOVED_SPTE state. This is a state where the mmu_lock is held for reading but concurrent accesses to the PTE have to spin; shortening its use allows other vCPUs to repopulate the zapped region while the zapper finishes tearing down the old, defunct page tables. - Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID field, which is defined by hardware but left for software use. This lets KVM communicate its inability to map GPAs that set bits 51:48 on hosts without 5-level nested page tables. Guest firmware is expected to use the information when mapping BARs; this avoids that they end up at a legal, but unmappable, GPA. - Fixed a bug where KVM would not reject accesses to MSR that aren't supposed to exist given the vCPU model and/or KVM configuration. - As usual, a bunch of code cleanups. x86 (AMD): - Implement a new and improved API to initialize SEV and SEV-ES VMs, which will also be extendable to SEV-SNP. The new API specifies the desired encryption in KVM_CREATE_VM and then separately initializes the VM. The new API also allows customizing the desired set of VMSA features; the features affect the measurement of the VM's initial state, and therefore enabling them cannot be done tout court by the hypervisor. While at it, the new API includes two bugfixes that couldn't be applied to the old one without a flag day in userspace or without affecting the initial measurement. When a SEV-ES VM is created with the new VM type, KVM_GET_REGS/KVM_SET_REGS and friends are rejected once the VMSA has been encrypted. Also, the FPU and AVX state will be synchronized and encrypted too. - Support for GHCB version 2 as applicable to SEV-ES guests. This, once more, is only accessible when using the new KVM_SEV_INIT2 flow for initialization of SEV-ES VMs. x86 (Intel): - An initial bunch of prerequisite patches for Intel TDX were merged. They generally don't do anything interesting. The only somewhat user visible change is a new debugging mode that checks that KVM's MMU never triggers a #VE virtualization exception in the guest. - Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig VM-Exit to L1, as per the SDM. Generic: - Use vfree() instead of kvfree() for allocations that always use vcalloc() or __vcalloc(). - Remove .change_pte() MMU notifier - the changes to non-KVM code are small and Andrew Morton asked that I also take those through the KVM tree. The callback was only ever implemented by KVM (which was also the original user of MMU notifiers) but it had been nonfunctional ever since calls to set_pte_at_notify were wrapped with invalidate_range_start and invalidate_range_end... in 2012. Selftests: - Enhance the demand paging test to allow for better reporting and stressing of UFFD performance. - Convert the steal time test to generate TAP-friendly output. - Fix a flaky false positive in the xen_shinfo_test due to comparing elapsed time across two different clock domains. - Skip the MONITOR/MWAIT test if the host doesn't actually support MWAIT. - Avoid unnecessary use of "sudo" in the NX hugepage test wrapper shell script, to play nice with running in a minimal userspace environment. - Allow skipping the RSEQ test's sanity check that the vCPU was able to complete a reasonable number of KVM_RUNs, as the assert can fail on a completely valid setup. If the test is run on a large-ish system that is otherwise idle, and the test isn't affined to a low-ish number of CPUs, the vCPU task can be repeatedly migrated to CPUs that are in deep sleep states, which results in the vCPU having very little net runtime before the next migration due to high wakeup latencies. - Define _GNU_SOURCE for all selftests to fix a warning that was introduced by a change to kselftest_harness.h late in the 6.9 cycle, and because forcing every test to #define _GNU_SOURCE is painful. - Provide a global pseudo-RNG instance for all tests, so that library code can generate random, but determinstic numbers. - Use the global pRNG to randomly force emulation of select writes from guest code on x86, e.g. to help validate KVM's emulation of locked accesses. - Allocate and initialize x86's GDT, IDT, TSS, segments, and default exception handlers at VM creation, instead of forcing tests to manually trigger the related setup. Documentation: - Fix a goof in the KVM_CREATE_GUEST_MEMFD documentation" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (225 commits) selftests/kvm: remove dead file KVM: selftests: arm64: Test vCPU-scoped feature ID registers KVM: selftests: arm64: Test that feature ID regs survive a reset KVM: selftests: arm64: Store expected register value in set_id_regs KVM: selftests: arm64: Rename helper in set_id_regs to imply VM scope KVM: arm64: Only reset vCPU-scoped feature ID regs once KVM: arm64: Reset VM feature ID regs from kvm_reset_sys_regs() KVM: arm64: Rename is_id_reg() to imply VM scope KVM: arm64: Destroy mpidr_data for 'late' vCPU creation KVM: arm64: Use hVHE in pKVM by default on CPUs with VHE support KVM: arm64: Fix hvhe/nvhe early alias parsing KVM: SEV: Allow per-guest configuration of GHCB protocol version KVM: SEV: Add GHCB handling for termination requests KVM: SEV: Add GHCB handling for Hypervisor Feature Support requests KVM: SEV: Add support to handle AP reset MSR protocol KVM: x86: Explicitly zero kvm_caps during vendor module load KVM: x86: Fully re-initialize supported_mce_cap on vendor module load KVM: x86: Fully re-initialize supported_vm_types on vendor module load KVM: x86/mmu: Sanity check that __kvm_faultin_pfn() doesn't create noslot pfns KVM: x86/mmu: Initialize kvm_page_fault's pfn and hva to error values ...
2024-05-15Merge tag 'modules-6.10-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull modules updates from Luis Chamberlain: "Finally something fun. Mike Rapoport does some cleanup to allow us to take out module_alloc() out of modules into a new paint shedded execmem_alloc() and execmem_free() so to make emphasis these helpers are actually used outside of modules. It starts with a non-functional changes API rename / placeholders to then allow architectures to define their requirements into a new shiny struct execmem_info with ranges, and requirements for those ranges. Archs now can intitialize this execmem_info as the last part of mm_core_init() if they have to diverge from the norm. Each range is a known type clearly articulated and spelled out in enum execmem_type. Although a lot of this is major cleanup and prep work for future enhancements an immediate clear gain is we get to enable KPROBES without MODULES now. That is ultimately what motiviated to pick this work up again, now with smaller goal as concrete stepping stone" * tag 'modules-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of kprobes: remove dependency on CONFIG_MODULES powerpc: use CONFIG_EXECMEM instead of CONFIG_MODULES where appropriate x86/ftrace: enable dynamic ftrace without CONFIG_MODULES arch: make execmem setup available regardless of CONFIG_MODULES powerpc: extend execmem_params for kprobes allocations arm64: extend execmem_info for generated code allocations riscv: extend execmem_params for generated code allocations mm/execmem, arch: convert remaining overrides of module_alloc to execmem mm/execmem, arch: convert simple overrides of module_alloc to execmem mm: introduce execmem_alloc() and execmem_free() module: make module_memory_{alloc,free} more self-contained sparc: simplify module_alloc() nios2: define virtual address space for modules mips: module: rename MODULE_START to MODULES_VADDR arm64: module: remove unneeded call to kasan_alloc_module_shadow() kallsyms: replace deprecated strncpy with strscpy module: allow UNUSED_KSYMS_WHITELIST to be relative against objtree.
2024-05-14Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "The most interesting parts are probably the mm changes from Ryan which optimise the creation of the linear mapping at boot and (separately) implement write-protect support for userfaultfd. Outside of our usual directories, the Kbuild-related changes under scripts/ have been acked by Masahiro whilst the drivers/acpi/ parts have been acked by Rafael and the addition of cpumask_any_and_but() has been acked by Yury. ACPI: - Support for the Firmware ACPI Control Structure (FACS) signature feature which is used to reboot out of hibernation on some systems Kbuild: - Support for building Flat Image Tree (FIT) images, where the kernel Image is compressed alongside a set of devicetree blobs Memory management: - Optimisation of our early page-table manipulation for creation of the linear mapping - Support for userfaultfd write protection, which brings along some nice cleanups to our handling of invalid but present ptes - Extend our use of range TLBI invalidation at EL1 Perf and PMUs: - Ensure that the 'pmu->parent' pointer is correctly initialised by PMU drivers - Avoid allocating 'cpumask_t' types on the stack in some PMU drivers - Fix parsing of the CPU PMU "version" field in assembly code, as it doesn't follow the usual architectural rules - Add best-effort unwinding support for USER_STACKTRACE - Minor driver fixes and cleanups Selftests: - Minor cleanups to the arm64 selftests (missing NULL check, unused variable) Miscellaneous: - Add a command-line alias for disabling 32-bit application support - Add part number for Neoverse-V2 CPUs - Minor fixes and cleanups" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (64 commits) arm64/mm: Fix pud_user_accessible_page() for PGTABLE_LEVELS <= 2 arm64/mm: Add uffd write-protect support arm64/mm: Move PTE_PRESENT_INVALID to overlay PTE_NG arm64/mm: Remove PTE_PROT_NONE bit arm64/mm: generalize PMD_PRESENT_INVALID for all levels arm64: simplify arch_static_branch/_jump function arm64: Add USER_STACKTRACE support arm64: Add the arm64.no32bit_el0 command line option drivers/perf: hisi: hns3: Actually use devm_add_action_or_reset() drivers/perf: hisi: hns3: Fix out-of-bound access when valid event group drivers/perf: hisi_pcie: Fix out-of-bound access when valid event group kselftest: arm64: Add a null pointer check arm64: defer clearing DAIF.D arm64: assembler: update stale comment for disable_step_tsk arm64/sysreg: Update PIE permission encodings kselftest/arm64: Remove unused parameters in abi test perf/arm-spe: Assign parents for event_source device perf/arm-smmuv3: Assign parents for event_source device perf/arm-dsu: Assign parents for event_source device perf/arm-dmc620: Assign parents for event_source device ...
2024-05-14Makefile: remove redundant tool coverage variablesMasahiro Yamada
Now Kbuild provides reasonable defaults for objtool, sanitizers, and profilers. Remove redundant variables. Note: This commit changes the coverage for some objects: - include arch/mips/vdso/vdso-image.o into UBSAN, GCOV, KCOV - include arch/sparc/vdso/vdso-image-*.o into UBSAN - include arch/sparc/vdso/vma.o into UBSAN - include arch/x86/entry/vdso/extable.o into KASAN, KCSAN, UBSAN, GCOV, KCOV - include arch/x86/entry/vdso/vdso-image-*.o into KASAN, KCSAN, UBSAN, GCOV, KCOV - include arch/x86/entry/vdso/vdso32-setup.o into KASAN, KCSAN, UBSAN, GCOV, KCOV - include arch/x86/entry/vdso/vma.o into GCOV, KCOV - include arch/x86/um/vdso/vma.o into KASAN, GCOV, KCOV I believe these are positive effects because all of them are kernel space objects. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Roberto Sassu <roberto.sassu@huawei.com>
2024-05-14arch: make execmem setup available regardless of CONFIG_MODULESMike Rapoport (IBM)
execmem does not depend on modules, on the contrary modules use execmem. To make execmem available when CONFIG_MODULES=n, for instance for kprobes, split execmem_params initialization out from arch/*/kernel/module.c and compile it when CONFIG_EXECMEM=y Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14arm64: extend execmem_info for generated code allocationsMike Rapoport (IBM)
The memory allocations for kprobes and BPF on arm64 can be placed anywhere in vmalloc address space and currently this is implemented with overrides of alloc_insn_page() and bpf_jit_alloc_exec() in arm64. Define EXECMEM_KPROBES and EXECMEM_BPF ranges in arm64::execmem_info and drop overrides of alloc_insn_page() and bpf_jit_alloc_exec(). Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14mm/execmem, arch: convert remaining overrides of module_alloc to execmemMike Rapoport (IBM)
Extend execmem parameters to accommodate more complex overrides of module_alloc() by architectures. This includes specification of a fallback range required by arm, arm64 and powerpc, EXECMEM_MODULE_DATA type required by powerpc, support for allocation of KASAN shadow required by s390 and x86 and support for late initialization of execmem required by arm64. The core implementation of execmem_alloc() takes care of suppressing warnings when the initial allocation fails but there is a fallback range defined. Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Will Deacon <will@kernel.org> Acked-by: Song Liu <song@kernel.org> Tested-by: Liviu Dudau <liviu@dudau.co.uk> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14arm64: module: remove unneeded call to kasan_alloc_module_shadow()Mike Rapoport (IBM)
Since commit f6f37d9320a1 ("arm64: select KASAN_VMALLOC for SW/HW_TAGS modes") KASAN_VMALLOC is always enabled when KASAN is on. This means that allocations in module_alloc() will be tracked by KASAN protection for vmalloc() and that kasan_alloc_module_shadow() will be always an empty inline and there is no point in calling it. Drop meaningless call to kasan_alloc_module_shadow() from module_alloc(). Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-13Merge tag 'perf-core-2024-05-13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf events updates from Ingo Molnar: - Combine perf and BPF for fast evalution of HW breakpoint conditions - Add LBR capture support outside of hardware events - Trigger IO signals for watermark_wakeup - Add RAPL support for Intel Arrow Lake and Lunar Lake - Optimize frequency-throttling - Miscellaneous cleanups & fixes * tag 'perf-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits) perf/bpf: Mark perf_event_set_bpf_handler() and perf_event_free_bpf_handler() as inline too selftests/perf_events: Test FASYNC with watermark wakeups perf/ring_buffer: Trigger IO signals for watermark_wakeup perf: Move perf_event_fasync() to perf_event.h perf/bpf: Change the !CONFIG_BPF_SYSCALL stubs to static inlines selftest/bpf: Test a perf BPF program that suppresses side effects perf/bpf: Allow a BPF program to suppress all sample side effects perf/bpf: Remove unneeded uses_default_overflow_handler() perf/bpf: Call BPF handler directly, not through overflow machinery perf/bpf: Remove #ifdef CONFIG_BPF_SYSCALL from struct perf_event members perf/bpf: Create bpf_overflow_handler() stub for !CONFIG_BPF_SYSCALL perf/bpf: Reorder bpf_overflow_handler() ahead of __perf_event_overflow() perf/x86/rapl: Add support for Intel Lunar Lake perf/x86/rapl: Add support for Intel Arrow Lake perf/core: Reduce PMU access to adjust sample freq perf/core: Optimize perf_adjust_freq_unthr_context() perf/x86/amd: Don't reject non-sampling events with configured LBR perf/x86/amd: Support capturing LBR from software events perf/x86/amd: Avoid taking branches before disabling LBR perf/x86/amd: Ensure amd_pmu_core_disable_all() is always inlined ...
2024-05-12Merge tag 'kvmarm-6.10-1' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for Linux 6.10 - Move a lot of state that was previously stored on a per vcpu basis into a per-CPU area, because it is only pertinent to the host while the vcpu is loaded. This results in better state tracking, and a smaller vcpu structure. - Add full handling of the ERET/ERETAA/ERETAB instructions in nested virtualisation. The last two instructions also require emulating part of the pointer authentication extension. As a result, the trap handling of pointer authentication has been greattly simplified. - Turn the global (and not very scalable) LPI translation cache into a per-ITS, scalable cache, making non directly injected LPIs much cheaper to make visible to the vcpu. - A batch of pKVM patches, mostly fixes and cleanups, as the upstreaming process seems to be resuming. Fingers crossed! - Allocate PPIs and SGIs outside of the vcpu structure, allowing for smaller EL2 mapping and some flexibility in implementing more or less than 32 private IRQs. - Purge stale mpidr_data if a vcpu is created after the MPIDR map has been created. - Preserve vcpu-specific ID registers across a vcpu reset. - Various minor cleanups and improvements.
2024-05-10Merge branch 'for-next/errata' into for-next/coreWill Deacon
* for-next/errata: arm64: errata: Add workaround for Arm errata 3194386 and 3312417 arm64: cputype: Add Neoverse-V3 definitions arm64: cputype: Add Cortex-X4 definitions arm64: barrier: Restore spec_bar() macro
2024-05-10arm64: errata: Add workaround for Arm errata 3194386 and 3312417Mark Rutland
Cortex-X4 and Neoverse-V3 suffer from errata whereby an MSR to the SSBS special-purpose register does not affect subsequent speculative instructions, permitting speculative store bypassing for a window of time. This is described in their Software Developer Errata Notice (SDEN) documents: * Cortex-X4 SDEN v8.0, erratum 3194386: https://developer.arm.com/documentation/SDEN-2432808/0800/ * Neoverse-V3 SDEN v6.0, erratum 3312417: https://developer.arm.com/documentation/SDEN-2891958/0600/ To workaround these errata, it is necessary to place a speculation barrier (SB) after MSR to the SSBS special-purpose register. This patch adds the requisite SB after writes to SSBS within the kernel, and hides the presence of SSBS from EL0 such that userspace software which cares about SSBS will manipulate this via prctl(PR_GET_SPECULATION_CTRL, ...). Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: James Morse <james.morse@arm.com> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20240508081400.235362-5-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2024-05-10kbuild: use $(src) instead of $(srctree)/$(src) for source directoryMasahiro Yamada
Kbuild conventionally uses $(obj)/ for generated files, and $(src)/ for checked-in source files. It is merely a convention without any functional difference. In fact, $(obj) and $(src) are exactly the same, as defined in scripts/Makefile.build: src := $(obj) When the kernel is built in a separate output directory, $(src) does not accurately reflect the source directory location. While Kbuild resolves this discrepancy by specifying VPATH=$(srctree) to search for source files, it does not cover all cases. For example, when adding a header search path for local headers, -I$(srctree)/$(src) is typically passed to the compiler. This introduces inconsistency between upstream and downstream Makefiles because $(src) is used instead of $(srctree)/$(src) for the latter. To address this inconsistency, this commit changes the semantics of $(src) so that it always points to the directory in the source tree. Going forward, the variables used in Makefiles will have the following meanings: $(obj) - directory in the object tree $(src) - directory in the source tree (changed by this commit) $(objtree) - the top of the kernel object tree $(srctree) - the top of the kernel source tree Consequently, $(srctree)/$(src) in upstream Makefiles need to be replaced with $(src). Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
2024-05-09Merge branch 'for-next/perf' into for-next/coreWill Deacon
* for-next/perf: (41 commits) arm64: Add USER_STACKTRACE support drivers/perf: hisi: hns3: Actually use devm_add_action_or_reset() drivers/perf: hisi: hns3: Fix out-of-bound access when valid event group drivers/perf: hisi_pcie: Fix out-of-bound access when valid event group perf/arm-spe: Assign parents for event_source device perf/arm-smmuv3: Assign parents for event_source device perf/arm-dsu: Assign parents for event_source device perf/arm-dmc620: Assign parents for event_source device perf/arm-ccn: Assign parents for event_source device perf/arm-cci: Assign parents for event_source device perf/alibaba_uncore: Assign parents for event_source device perf/arm_pmu: Assign parents for event_source devices perf/imx_ddr: Assign parents for event_source devices perf/qcom: Assign parents for event_source devices Documentation: qcom-pmu: Use /sys/bus/event_source/devices paths perf/riscv: Assign parents for event_source devices perf/thunderx2: Assign parents for event_source devices Documentation: thunderx2-pmu: Use /sys/bus/event_source/devices paths perf/xgene: Assign parents for event_source devices Documentation: xgene-pmu: Use /sys/bus/event_source/devices paths ...
2024-05-09Merge branch 'for-next/misc' into for-next/coreWill Deacon
* for-next/misc: arm64: simplify arch_static_branch/_jump function arm64: Add the arm64.no32bit_el0 command line option arm64: defer clearing DAIF.D arm64: assembler: update stale comment for disable_step_tsk arm64/sysreg: Update PIE permission encodings arm64: Add Neoverse-V2 part arm64: Remove unnecessary irqflags alternative.h include
2024-05-08Merge branch kvm-arm64/misc-6.10 into kvmarm-master/nextMarc Zyngier
* kvm-arm64/misc-6.10: : . : Misc fixes and updates targeting 6.10 : : - Improve boot-time diagnostics when the sysreg tables : are not correctly sorted : : - Allow FFA_MSG_SEND_DIRECT_REQ in the FFA proxy : : - Fix duplicate XNX field in the ID_AA64MMFR1_EL1 : writeable mask : : - Allocate PPIs and SGIs outside of the vcpu structure, allowing : for smaller EL2 mapping and some flexibility in implementing : more or less than 32 private IRQs. : : - Use bitmap_gather() instead of its open-coded equivalent : : - Make protected mode use hVHE if available : : - Purge stale mpidr_data if a vcpu is created after the MPIDR : map has been created : . KVM: arm64: Destroy mpidr_data for 'late' vCPU creation KVM: arm64: Use hVHE in pKVM by default on CPUs with VHE support KVM: arm64: Fix hvhe/nvhe early alias parsing KVM: arm64: Convert kvm_mpidr_index() to bitmap_gather() KVM: arm64: vgic: Allocate private interrupts on demand KVM: arm64: Remove duplicated AA64MMFR1_EL1 XNX KVM: arm64: Remove FFA_MSG_SEND_DIRECT_REQ from the denylist KVM: arm64: Improve out-of-order sysreg table diagnostics Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-05-08KVM: arm64: Use hVHE in pKVM by default on CPUs with VHE supportWill Deacon
The early command line parsing treats "kvm-arm.mode=protected" as an alias for "id_aa64mmfr1.vh=0", forcing the use of nVHE so that the host kernel runs at EL1 with the pKVM hypervisor at EL2. With the introduction of hVHE support in ad744e8cb346 ("arm64: Allow arm64_sw.hvhe on command line"), the hypervisor can run using the EL2+0 translation regime. This is interesting for unusual CPUs that have VH stuck to 1, but also because it opens the possibility of a hypervisor "userspace" in the distant future which could be used to isolate vCPU contexts in the hypervisor (see Marc's talk from KVM Forum 2022 [1]). Repaint the "kvm-arm.mode=protected" alias to map to "arm64_sw.hvhe=1", which will use hVHE on CPUs that support it and remain with nVHE otherwise. [1] https://www.youtube.com/watch?v=1F_Mf2j9eIo Signed-off-by: Will Deacon <will@kernel.org> Acked-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240501163400.15838-3-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-05-08KVM: arm64: Fix hvhe/nvhe early alias parsingWill Deacon
Booting a kernel with "arm64_sw.hvhe=1 kvm-arm.mode=nvhe" on the command-line results in KVM initialising using hVHE, whereas one might expect the latter option to override the former. Fix this by adding "arm64_sw.hvhe=0" to the alias expansion for "kvm-arm.mode=nvhe". Signed-off-by: Will Deacon <will@kernel.org> Acked-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240501163400.15838-2-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-05-03arm64: Add USER_STACKTRACE supportchenqiwu
Currently, userstacktrace is unsupported for ftrace and uprobe tracers on arm64. This patch uses the perf_callchain_user() code as blueprint to implement the arch_stack_walk_user() which add userstacktrace support on arm64. Meanwhile, we can use arch_stack_walk_user() to simplify the implementation of perf_callchain_user(). This patch is tested pass with ftrace, uprobe and perf tracers profiling userstacktrace cases. Tested-by: chenqiwu <qiwu.chen@transsion.com> Signed-off-by: chenqiwu <qiwu.chen@transsion.com> Link: https://lore.kernel.org/r/20231219022229.10230-1-qiwu.chen@transsion.com Signed-off-by: Will Deacon <will@kernel.org>
2024-05-03arm64: Add the arm64.no32bit_el0 command line optionAndrea della Porta
Introducing the field 'el0' to the idreg-override for register ID_AA64PFR0_EL1. This field is also aliased to the new kernel command line option 'arm64.no32bit_el0' as a more recognizable and mnemonic name to disable the execution of 32 bit userspace applications (i.e. avoid Aarch32 execution state in EL0) from kernel command line. Link: https://lore.kernel.org/all/20240207105847.7739-1-andrea.porta@suse.com/ Signed-off-by: Andrea della Porta <andrea.porta@suse.com> Link: https://lore.kernel.org/r/20240429102833.6426-1-andrea.porta@suse.com Signed-off-by: Will Deacon <will@kernel.org>
2024-05-02arch: use $(obj)/ instead of $(src)/ for preprocessed linker scriptsMasahiro Yamada
These are generated files. Prefix them with $(obj)/ instead of $(src)/. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Helge Deller <deller@gmx.de> Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
2024-04-28arm64: defer clearing DAIF.DMark Rutland
For historical reasons we unmask debug exceptions in __cpu_setup(), but it's not necessary to unmask debug exceptions this early in the boot/idle entry paths. It would be better to unmask debug exceptions later in C code as this simplifies the current code and will make it easier to rework exception masking logic to handle non-DAIF bits in future (e.g. PSTATE.{ALLINT,PM}). We started clearing DAIF.D in __cpu_setup() in commit: 2ce39ad15182604b ("arm64: debug: unmask PSTATE.D earlier") At the time, we needed to ensure that DAIF.D was clear on the primary CPU before scheduling and preemption were possible, and chose to do this in __cpu_setup() so that this occurred in the same place for primary and secondary CPUs. As we cannot handle debug exceptions this early, we placed an ISB between initializing MDSCR_EL1 and clearing DAIF.D so that no exceptions should be triggered. Subsequently we rewrote the return-from-{idle,suspend} paths to use __cpu_setup() in commit: cabe1c81ea5be983 ("arm64: Change cpu_resume() to enable mmu early then access sleep_sp by va") ... which allowed for earlier use of the MMU and had the desirable property of using the same code to reset the CPU in the cold and warm boot paths. This introduced a bug: DAIF.D was clear while cpu_do_resume() restored MDSCR_EL1 and other control registers (e.g. breakpoint/watchpoint control/value registers), and so we could unexpectedly take debug exceptions. We fixed that in commit: 744c6c37cc18705d ("arm64: kernel: Fix unmasked debug exceptions when restoring mdscr_el1") ... by having cpu_do_resume() use the `disable_dbg` macro to set DAIF.D before restoring MDSCR_EL1 and other control registers. This relies on DAIF.D being subsequently cleared again in cpu_resume(). Subsequently we reworked DAIF masking in commit: 0fbeb318754860b3 ("arm64: explicitly mask all exceptions") ... where we began enforcing a policy that DAIF.D being set implies all other DAIF bits are set, and so e.g. we cannot take an IRQ while DAIF.D is set. As part of this the use of `disable_dbg` in cpu_resume() was replaced with `disable_daif` for consistency with the rest of the kernel. These days, there's no need to clear DAIF.D early within __cpu_setup(): * setup_arch() clears DAIF.DA before scheduling and preemption are possible on the primary CPU, avoiding the problem we we originally trying to work around. Note: DAIF.IF get cleared later when interrupts are enabled for the first time. * secondary_start_kernel() clears all DAIF bits before scheduling and preemption are possible on secondary CPUs. Note: with pseudo-NMI, the PMR is initialized here before any DAIF bits are cleared. Similar will be necessary for the architectural NMI. * cpu_suspend() restores all DAIF bits when returning from idle, ensuring that we don't unexpectedly leave DAIF.D clear or set. Note: with pseudo-NMI, the PMR is initialized here before DAIF is cleared. Similar will be necessary for the architectural NMI. This patch removes the unmasking of debug exceptions from __cpu_setup(), relying on the above locations to initialize DAIF. This allows some other cleanups: * It is no longer necessary for cpu_resume() to explicitly mask debug (or other) exceptions, as it is always called with all DAIF bits set. Thus we drop the use of `disable_daif`. * The `enable_dbg` macro is no longer used, and so is dropped. * It is no longer necessary to have an ISB immediately after initializing MDSCR_EL1 in __cpu_setup(), and we can revert to relying on the context synchronization that occurs when the MMU is enabled between __cpu_setup() and code which clears DAIF.D Comments are added to setup_arch() and secondary_start_kernel() to explain the initial unmasking of the DAIF bits. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20240422113523.4070414-3-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2024-04-25fix missing vmalloc.h includesKent Overstreet
Patch series "Memory allocation profiling", v6. Overview: Low overhead [1] per-callsite memory allocation profiling. Not just for debug kernels, overhead low enough to be deployed in production. Example output: root@moria-kvm:~# sort -rn /proc/allocinfo 127664128 31168 mm/page_ext.c:270 func:alloc_page_ext 56373248 4737 mm/slub.c:2259 func:alloc_slab_page 14880768 3633 mm/readahead.c:247 func:page_cache_ra_unbounded 14417920 3520 mm/mm_init.c:2530 func:alloc_large_system_hash 13377536 234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs 11718656 2861 mm/filemap.c:1919 func:__filemap_get_folio 9192960 2800 kernel/fork.c:307 func:alloc_thread_stack_node 4206592 4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable 4136960 1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start 3940352 962 mm/memory.c:4214 func:alloc_anon_folio 2894464 22613 fs/kernfs/dir.c:615 func:__kernfs_new_node ... Usage: kconfig options: - CONFIG_MEM_ALLOC_PROFILING - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT - CONFIG_MEM_ALLOC_PROFILING_DEBUG adds warnings for allocations that weren't accounted because of a missing annotation sysctl: /proc/sys/vm/mem_profiling Runtime info: /proc/allocinfo Notes: [1]: Overhead To measure the overhead we are comparing the following configurations: (1) Baseline with CONFIG_MEMCG_KMEM=n (2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) (3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) (4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1) (5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT (6) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) && CONFIG_MEMCG_KMEM=y (7) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) && CONFIG_MEMCG_KMEM=y Performance overhead: To evaluate performance we implemented an in-kernel test executing multiple get_free_page/free_page and kmalloc/kfree calls with allocation sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU affinity set to a specific CPU to minimize the noise. Below are results from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on 56 core Intel Xeon: kmalloc pgalloc (1 baseline) 6.764s 16.902s (2 default disabled) 6.793s (+0.43%) 17.007s (+0.62%) (3 default enabled) 7.197s (+6.40%) 23.666s (+40.02%) (4 runtime enabled) 7.405s (+9.48%) 23.901s (+41.41%) (5 memcg) 13.388s (+97.94%) 48.460s (+186.71%) (6 def disabled+memcg) 13.332s (+97.10%) 48.105s (+184.61%) (7 def enabled+memcg) 13.446s (+98.78%) 54.963s (+225.18%) Memory overhead: Kernel size: text data bss dec diff (1) 26515311 18890222 17018880 62424413 (2) 26524728 19423818 16740352 62688898 264485 (3) 26524724 19423818 16740352 62688894 264481 (4) 26524728 19423818 16740352 62688898 264485 (5) 26541782 18964374 16957440 62463596 39183 Memory consumption on a 56 core Intel CPU with 125GB of memory: Code tags: 192 kB PageExts: 262144 kB (256MB) SlabExts: 9876 kB (9.6MB) PcpuExts: 512 kB (0.5MB) Total overhead is 0.2% of total memory. Benchmarks: Hackbench tests run 100 times: hackbench -s 512 -l 200 -g 15 -f 25 -P baseline disabled profiling enabled profiling avg 0.3543 0.3559 (+0.0016) 0.3566 (+0.0023) stdev 0.0137 0.0188 0.0077 hackbench -l 10000 baseline disabled profiling enabled profiling avg 6.4218 6.4306 (+0.0088) 6.5077 (+0.0859) stdev 0.0933 0.0286 0.0489 stress-ng tests: stress-ng --class memory --seq 4 -t 60 stress-ng --class cpu --seq 4 -t 60 Results posted at: https://evilpiepirate.org/~kent/memalloc_prof_v4_stress-ng/ [2] https://lore.kernel.org/all/20240306182440.2003814-1-surenb@google.com/ This patch (of 37): The next patch drops vmalloc.h from a system header in order to fix a circular dependency; this adds it to all the files that were pulling it in implicitly. [kent.overstreet@linux.dev: fix arch/alpha/lib/memcpy.c] Link: https://lkml.kernel.org/r/20240327002152.3339937-1-kent.overstreet@linux.dev [surenb@google.com: fix arch/x86/mm/numa_32.c] Link: https://lkml.kernel.org/r/20240402180933.1663992-1-surenb@google.com [kent.overstreet@linux.dev: a few places were depending on sizes.h] Link: https://lkml.kernel.org/r/20240404034744.1664840-1-kent.overstreet@linux.dev [arnd@arndb.de: fix mm/kasan/hw_tags.c] Link: https://lkml.kernel.org/r/20240404124435.3121534-1-arnd@kernel.org [surenb@google.com: fix arc build] Link: https://lkml.kernel.org/r/20240405225115.431056-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-2-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-22arm64/io: Provide a WC friendly __iowriteXX_copy()Jason Gunthorpe
The kernel provides driver support for using write combining IO memory through the __iowriteXX_copy() API which is commonly used as an optional optimization to generate 16/32/64 byte MemWr TLPs in a PCIe environment. iomap_copy.c provides a generic implementation as a simple 4/8 byte at a time copy loop that has worked well with past ARM64 CPUs, giving a high frequency of large TLPs being successfully formed. However modern ARM64 CPUs are quite sensitive to how the write combining CPU HW is operated and a compiler generated loop with intermixed load/store is not sufficient to frequently generate a large TLP. The CPUs would like to see the entire TLP generated by consecutive store instructions from registers. Compilers like gcc tend to intermix loads and stores and have poor code generation, in part, due to the ARM64 situation that writeq() does not codegen anything other than "[xN]". However even with that resolved compilers like clang still do not have good code generation. This means on modern ARM64 CPUs the rate at which __iowriteXX_copy() successfully generates large TLPs is very small (less than 1 in 10,000) tries), to the point that the use of WC is pointless. Implement __iowrite32/64_copy() specifically for ARM64 and use inline assembly to build consecutive blocks of STR instructions. Provide direct support for 64/32/16 large TLP generation in this manner. Optimize for common constant lengths so that the compiler can directly inline the store blocks. This brings the frequency of large TLP generation up to a high level that is comparable with older CPU generations. As the __iowriteXX_copy() family of APIs is intended for use with WC incorporate the DGH hint directly into the function. Link: https://lore.kernel.org/r/4-v3-1893cd8b9369+1925-mlx5_arm_wc_jgg@nvidia.com Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: linux-arch@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-18arm64/head: Disable MMU at EL2 before clearing HCR_EL2.E2HArd Biesheuvel
Even though the boot protocol stipulates otherwise, an exception has been made for the EFI stub, and entering the core kernel with the MMU enabled is permitted. This allows a substantial amount of cache maintenance to be elided, wich is significant when fast boot times are critical (e.g., for booting micro-VMs) Once the initial ID map has been populated, the MMU is disabled as part of the logic sequence that puts all system registers into a known state. Any code that needs to execute within the window where the MMU is off is cleaned to the PoC explicitly, which includes all of HYP text when entering at EL2. However, the current sequence of initializing the EL2 system registers is not safe: HCR_EL2 is set to its nVHE initial state before SCTLR_EL2 is reprogrammed, and this means that a VHE-to-nVHE switch may occur while the MMU is enabled. This switch causes some system registers as well as page table descriptors to be interpreted in a different way, potentially resulting in spurious exceptions relating to MMU translation. So disable the MMU explicitly first when entering in EL2 with the MMU and caches enabled. Fixes: 617861703830 ("efi: arm64: enter with MMU and caches enabled") Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Cc: <stable@vger.kernel.org> # 6.3.x Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240415075412.2347624-6-ardb+git@google.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-04-18arm64/head: Drop unnecessary pre-disable-MMU workaroundArd Biesheuvel
The Falkor erratum that results in the need for an ISB before clearing the M bit in SCTLR_ELx only applies to execution at exception level x, and so the workaround is not needed when disabling the EL1 MMU while running at EL2. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Marc Zyngier <maz@kernel.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20240415075412.2347624-5-ardb+git@google.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>