summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)Author
2023-10-30arm64: text replication: verify kernel textktextRussell King (Oracle)
Verify that the replicated kernel image for the non-boot nodes matches the boot kernel image, and report differences found. This ensures that the non-boot modes are running an identical copy of the kernel. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
2023-10-30arm64: text replication: add test moduleRussell King (Oracle)
Add a module to allow kernel text replication to be tested; this exposes some data in procfs which can be used to verify that: (a) we're using different page tables in TTBR1 on CPUs in different NUMA nodes (b) that CPUs in different NUMA nodes are indeed accessing different copies of the kernel Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
2023-10-30arm64: text replication: add Kconfig for default stateRussell King (Oracle)
Add a kernel configuration option to determine whether kernel text replication should default to being enabled or disabled at boot without a command line specifier. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
2023-10-30arm64: text replication: add KconfigRussell King (Oracle)
Add the Kconfig symbol for kernel text replication. This unfortunately requires KASAN and kernel text randomisation options to be disabled at the moment. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
2023-10-30arm64: text replication: early kernel option to enable replicationRussell King (Oracle)
Provide an early kernel option "ktext=" which allows the kernel text replication to be enabled. This takes a boolean argument. The way this has been implemented means that we take all the same paths through the kernel at runtime whether kernel text replication has been enabled or not; this allows the performance effects of the code changes to be evaluated separately from the act of running with replicating the kernel text. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
2023-10-30arm64: text replication: include most of read-only data as wellRussell King (Oracle)
Include as much of the read-only data in the replication as we can without needing to move away from the generic RO_DATA() macro in the linker script. Unfortunately, the read-only data section is immedaitely followed by the read-only after init data with no page alignment, which means we can't have separate mappings for the read-only data section and everything else. Changing that would mean replacing the generic RO_DATA() macro which increases the maintenance burden. however, this is likely not worth the effort as the majority of read-only data will be covered. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
2023-10-30arm64: text replication: setup page tables for copied kernelRussell King (Oracle)
Setup page table entries in each non-boot NUMA node page table to point at each node's own copy of the kernel text. This switches each node to use its own unique copy of the kernel text. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
2023-10-30arm64: text replication: update cnp supportRussell King (Oracle)
Add changes for CNP (Common Not Private) support of kernel text replication. Although text replication has only been tested on dual-socket Ampere A1 systems, provided the different NUMA nodes are not part of the same inner shareable domain, CNP should not be a problem. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
2023-10-30arm64: text replication: boot secondary CPUs with appropriate TTBR1Russell King (Oracle)
Arrange for secondary CPUs to boot with TTBR1 pointing at the appropriate per-node copy of the kernel page tables for the CPUs NUMA node. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
2023-10-30arm64: text replication: create per-node kernel page tablesRussell King (Oracle)
Allocate the level 0 page tables for the per-node kernel text replication, but copy all level 0 table entries from the NUMA node 0 table. Therefore, for the time being, each node's level 0 page tables will contain identical entries, and thus other nodes will continue to use the node 0 kernel text. Since the level 0 page tables can be updated at runtime to add entries for vmalloc and module space, propagate these updates to the other swapper page tables. The exception is if we see an update for the level 0 entry which points to the kernel mapping. We also need to setup a copy of the trampoline page tables as well, as the assembly code relies on the two page tables being a fixed offset apart. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
2023-10-30arm64: text replication: add swapper page directory helpersRussell King (Oracle)
Add a series of helpers for the swapper page directories - a set which return those for the calling CPU, and those which take the NUMA node number. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
2023-10-30arm64: text replication: add node 0 page table definitionsRussell King (Oracle)
Add a struct definition for the level zero page table group (the optional trampoline page tables, reserved page tables, and swapper page tables). Add a symbol and extern declaration for the node 0 page table group. Add an array of pointers to per-node page tables, which will default to using the node 0 page table group. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
2023-10-30arm64: text replication: handle aarch64_insn_write_literal_u64()Russell King (Oracle)
aarch64_insn_write_literal_u64() was introduced in v6.3-rc1 for updating ftrace ops pointers in the kernel text. This needs to be fixed up for kernel text replication, so provide a version that will update the mapping. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
2023-10-30arm64: text replication: add node text patchingRussell King (Oracle)
Add support for text patching on our replicated texts. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
2023-10-30arm64: text replication: copy initial kernel textRussell King (Oracle)
Allocate memory on the appropriate node for the per-node copies of the kernel text, and copy the kernel text to that memory. Clean and invalidate the caches to the point of unification so that the copied text is correctly visible to the target node. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
2023-10-30arm64: text replication: add sanity checksRussell King (Oracle)
The kernel text and modules must be in separate L0 page table entries. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
2023-10-30arm64: text replication: add init functionRussell King (Oracle)
A simple patch that adds an empty function for kernel text replication initialisation and hooks it into the initialisation path. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
2023-10-30arm64: place kernel in its own L0 page table entryRussell King (Oracle)
Kernel text replication needs to maintain separate per-node page tables for the kernel text. In order to do this without affecting other kernel memory mappings, placing the kernel such that it does not share a L0 page table entry with any other mapping is desirable. Prior to this commit, the layout without KASLR was: +----------+ | vmalloc | +----------+ | Kernel | +----------+ MODULES_END, VMALLOC_START, KIMAGE_VADDR = | Modules | MODULES_VADDR + MODULES_VSIZE +----------+ MODULES_VADDR = _PAGE_END(VA_BITS_MIN) | VA space | +----------+ 0 This becomes: +----------+ | vmalloc | +----------+ VMALLOC_START = MODULES_END + PGDIR_SIZE | Kernel | +----------+ MODULES_END, KIMAGE_VADDR = _PAGE_END(VA_BITS_MIN) + PGDIR_SIZE | Modules | +----------+ MODULES_VADDR = MODULES_END - MODULES_VSIZE | VA space | +----------+ 0 This assumes MODULES_VSIZE (128M) <= PGDIR_SIZE. One side effect of this change is that KIMAGE_VADDR's definition now includes PGDIR_SIZE (to leave room for the modules) but this is not defined when asm/memory.h is included. This means KIMAGE_VADDR can not be used in inline functions within this file, so we convert kaslr_offset() and kaslr_enabled() to be macros instead. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
2023-10-30arm64: make clean_dcache_range_nopatch() visibleRussell King (Oracle)
When we hook into the kernel text patching code, we will need to call clean_dcache_range_nopatch() to ensure that the patching of the replicated kernel text is properly visible to other CPUs. Make this function available to the replication code. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
2023-10-30arm64: provide cpu_replace_ttbr1_phys()Russell King (Oracle)
Provide a version of cpu_replace_ttbr1_phys() which operates using a physical address rather than the virtual address of the page tables. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
2023-10-28Merge tag 'x86-urgent-2023-10-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 fixes from Ingo Molnar: - Fix a possible CPU hotplug deadlock bug caused by the new TSC synchronization code - Fix a legacy PIC discovery bug that results in device troubles on affected systems, such as non-working keybards, etc - Add a new Intel CPU model number to <asm/intel-family.h> * tag 'x86-urgent-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/tsc: Defer marking TSC unstable to a worker x86/i8259: Skip probing when ACPI/MADT advertises PCAT compatibility x86/cpu: Add model number for Intel Arrow Lake mobile processor
2023-10-27Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull misc filesystem fixes from Al Viro: "Assorted fixes all over the place: literally nothing in common, could have been three separate pull requests. All are simple regression fixes, but not for anything from this cycle" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ceph_wait_on_conflict_unlink(): grab reference before dropping ->d_lock io_uring: kiocb_done() should *not* trust ->ki_pos if ->{read,write}_iter() failed sparc32: fix a braino in fault handling in csum_and_copy_..._user()
2023-10-27sparc32: fix a braino in fault handling in csum_and_copy_..._user()Al Viro
Fault handler used to make non-trivial calls, so it needed to set a stack frame up. Used to be save ... - grab a stack frame, old %o... become %i... .... ret - go back to address originally in %o7, currently %i7 restore - switch to previous stack frame, in delay slot Non-trivial calls had been gone since ab5e8b331244 and that code should have become retl - go back to address in %o7 clr %o0 - have return value set to 0 What it had become instead was ret - go back to address in %i7 - return address of *caller* clr %o0 - have return value set to 0 which is not good, to put it mildly - we forcibly return 0 from csum_and_copy_{from,to}_iter() (which is what the call of that thing had been inlined into) and do that without dropping the stack frame of said csum_and_copy_..._iter(). Confuses the hell out of the caller of csum_and_copy_..._iter(), obviously... Reviewed-by: Sam Ravnborg <sam@ravnborg.org> Fixes: ab5e8b331244 "sparc32: propagate the calling conventions change down to __csum_partial_copy_sparc_generic()" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-10-27x86/tsc: Defer marking TSC unstable to a workerThomas Gleixner
Tetsuo reported the following lockdep splat when the TSC synchronization fails during CPU hotplug: tsc: Marking TSC unstable due to check_tsc_sync_source failed WARNING: inconsistent lock state inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. ffffffff8cfa1c78 (watchdog_lock){?.-.}-{2:2}, at: clocksource_watchdog+0x23/0x5a0 {IN-HARDIRQ-W} state was registered at: _raw_spin_lock_irqsave+0x3f/0x60 clocksource_mark_unstable+0x1b/0x90 mark_tsc_unstable+0x41/0x50 check_tsc_sync_source+0x14f/0x180 sysvec_call_function_single+0x69/0x90 Possible unsafe locking scenario: lock(watchdog_lock); <Interrupt> lock(watchdog_lock); stack backtrace: _raw_spin_lock+0x30/0x40 clocksource_watchdog+0x23/0x5a0 run_timer_softirq+0x2a/0x50 sysvec_apic_timer_interrupt+0x6e/0x90 The reason is the recent conversion of the TSC synchronization function during CPU hotplug on the control CPU to a SMP function call. In case that the synchronization with the upcoming CPU fails, the TSC has to be marked unstable via clocksource_mark_unstable(). clocksource_mark_unstable() acquires 'watchdog_lock', but that lock is taken with interrupts enabled in the watchdog timer callback to minimize interrupt disabled time. That's obviously a possible deadlock scenario, Before that change the synchronization function was invoked in thread context so this could not happen. As it is not crucical whether the unstable marking happens slightly delayed, defer the call to a worker thread which avoids the lock context problem. Fixes: 9d349d47f0e3 ("x86/smpboot: Make TSC synchronization function call based") Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/87zg064ceg.ffs@tglx
2023-10-27x86/i8259: Skip probing when ACPI/MADT advertises PCAT compatibilityThomas Gleixner
David and a few others reported that on certain newer systems some legacy interrupts fail to work correctly. Debugging revealed that the BIOS of these systems leaves the legacy PIC in uninitialized state which makes the PIC detection fail and the kernel switches to a dummy implementation. Unfortunately this fallback causes quite some code to fail as it depends on checks for the number of legacy PIC interrupts or the availability of the real PIC. In theory there is no reason to use the PIC on any modern system when IO/APIC is available, but the dependencies on the related checks cannot be resolved trivially and on short notice. This needs lots of analysis and rework. The PIC detection has been added to avoid quirky checks and force selection of the dummy implementation all over the place, especially in VM guest scenarios. So it's not an option to revert the relevant commit as that would break a lot of other scenarios. One solution would be to try to initialize the PIC on detection fail and retry the detection, but that puts the burden on everything which does not have a PIC. Fortunately the ACPI/MADT table header has a flag field, which advertises in bit 0 that the system is PCAT compatible, which means it has a legacy 8259 PIC. Evaluate that bit and if set avoid the detection routine and keep the real PIC installed, which then gets initialized (for nothing) and makes the rest of the code with all the dependencies work again. Fixes: e179f6914152 ("x86, irq, pic: Probe for legacy PIC and set legacy_pic appropriately") Reported-by: David Lazar <dlazar@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: David Lazar <dlazar@gmail.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Cc: stable@vger.kernel.org Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218003 Link: https://lore.kernel.org/r/875y2u5s8g.ffs@tglx
2023-10-27x86/cpu: Add model number for Intel Arrow Lake mobile processorTony Luck
For "reasons" Intel has code-named this CPU with a "_H" suffix. [ dhansen: As usual, apply this and send it upstream quickly to make it easier for anyone who is doing work that consumes this. ] Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231025202513.12358-1-tony.luck%40intel.com
2023-10-27Merge tag 'powerpc-6.6-6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Fix boot crash with FLATMEM since set_ptes() introduction - Avoid calling arch_enter/leave_lazy_mmu() in set_ptes() Thanks to Aneesh Kumar K.V and Erhard Furtner. * tag 'powerpc-6.6-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes powerpc/mm: Fix boot crash with FLATMEM
2023-10-26Merge tag 'soc-fixes-6.7-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull ARM SoC fixes from Arnd Bergmann: "A couple of platforms have some last-minute fixes, in particular: - riscv gets some fixes for noncoherent DMA on the renesas and thead platforms and dts fix for SPI on the visionfive 2 board - Qualcomm Snapdragon gets three dts fixes to address board specific regressions on the pmic and gpio nodes - Rockchip platforms get multiple dts fixes to address issues on the recent rk3399 platform as well as the older rk3128 platform that apparently regressed a while ago. - TI OMAP gets some trivial code and dts fixes and a regression fix for the omap1 ams-delta modem - NXP i.MX firmware has one fix for a use-after-free but in its error handling" * tag 'soc-fixes-6.7-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (25 commits) soc: renesas: ARCH_R9A07G043 depends on !RISCV_ISA_ZICBOM riscv: only select DMA_DIRECT_REMAP from RISCV_ISA_ZICBOM and ERRATA_THEAD_PBMT riscv: RISCV_NONSTANDARD_CACHE_OPS shouldn't depend on RISCV_DMA_NONCOHERENT riscv: dts: thead: set dma-noncoherent to soc bus arm64: dts: rockchip: Fix i2s0 pin conflict on ROCK Pi 4 boards arm64: dts: rockchip: Add i2s0-2ch-bus-bclk-off pins to RK3399 clk: ti: Fix missing omap5 mcbsp functional clock and aliases clk: ti: Fix missing omap4 mcbsp functional clock and aliases ARM: OMAP1: ams-delta: Fix MODEM initialization failure soc: renesas: Make ARCH_R9A07G043 depend on required options riscv: dts: starfive: visionfive 2: correct spi's ss pin firmware/imx-dsp: Fix use_after_free in imx_dsp_setup_channels() ARM: OMAP: timer32K: fix all kernel-doc warnings ARM: omap2: fix a debug printk ARM: dts: rockchip: Fix timer clocks for RK3128 ARM: dts: rockchip: Add missing quirk for RK3128's dma engine ARM: dts: rockchip: Add missing arm timer interrupt for RK3128 ARM: dts: rockchip: Fix i2c0 register address for RK3128 arm64: dts: rockchip: set codec system-clock-fixed on px30-ringneck-haikou arm64: dts: rockchip: use codec as clock master on px30-ringneck-haikou ...
2023-10-26Merge tag 'renesas-fixes-for-v6.6-tag3' of ↵Arnd Bergmann
git://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-devel into arm/fixes Renesas fixes for v6.6 (take three) - Sort out a few Kconfig dependency issues for the rich set of RISC-V non-coherent DMA support. * tag 'renesas-fixes-for-v6.6-tag3' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-devel: soc: renesas: ARCH_R9A07G043 depends on !RISCV_ISA_ZICBOM riscv: only select DMA_DIRECT_REMAP from RISCV_ISA_ZICBOM and ERRATA_THEAD_PBMT riscv: RISCV_NONSTANDARD_CACHE_OPS shouldn't depend on RISCV_DMA_NONCOHERENT Link: https://lore.kernel.org/r/cover.1698312384.git.geert+renesas@glider.be Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-10-26riscv: only select DMA_DIRECT_REMAP from RISCV_ISA_ZICBOM and ERRATA_THEAD_PBMTChristoph Hellwig
RISCV_DMA_NONCOHERENT is also used for whacky non-standard non-coherent ops that use different hooks in dma-direct. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Tested-by: Samuel Holland <samuel.holland@sifive.com> Link: https://lore.kernel.org/r/20231018052654.50074-3-hch@lst.de Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
2023-10-26riscv: RISCV_NONSTANDARD_CACHE_OPS shouldn't depend on RISCV_DMA_NONCOHERENTChristoph Hellwig
RISCV_NONSTANDARD_CACHE_OPS is also used for the pmem cache maintenance helpers, which are built into the kernel unconditionally. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20231018052654.50074-2-hch@lst.de Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
2023-10-25powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptesAneesh Kumar K.V
With commit 9fee28baa601 ("powerpc: implement the new page table range API") we added set_ptes to powerpc architecture. The implementation included calling arch_enter/leave_lazy_mmu() calls. The patch removes the usage of arch_enter/leave_lazy_mmu() because set_pte is not supposed to be used when updating a pte entry. Powerpc architecture uses this rule to skip the expensive tlb invalidate which is not needed when you are setting up the pte for the first time. See commit 56eecdb912b5 ("mm: Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit") for more details The patch also makes sure we are not using the interface to update a valid/present pte entry by adding VM_WARN_ON check all the ptes we are setting up. Furthermore, we add a comment to set_pte_filter to clarify it can only update folio-related flags and cannot filter pfn specific details in pte filtering. Removal of arch_enter/leave_lazy_mmu() also will avoid nesting of these functions that are not supported. For ex: remap_pte_range() -> arch_enter_lazy_mmu() -> set_ptes() -> arch_enter_lazy_mmu() -> arch_leave_lazy_mmu() -> arch_leave_lazy_mmu() Fixes: 9fee28baa601 ("powerpc: implement the new page table range API") Signed-off-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231024143604.16749-1-aneesh.kumar@linux.ibm.com
2023-10-24Merge tag 'mm-hotfixes-stable-2023-10-24-09-40' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "20 hotfixes. 12 are cc:stable and the remainder address post-6.5 issues or aren't considered necessary for earlier kernel versions" * tag 'mm-hotfixes-stable-2023-10-24-09-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: maple_tree: add GFP_KERNEL to allocations in mas_expected_entries() selftests/mm: include mman header to access MREMAP_DONTUNMAP identifier mailmap: correct email aliasing for Oleksij Rempel mailmap: map Bartosz's old address to the current one mm/damon/sysfs: check DAMOS regions update progress from before_terminate() MAINTAINERS: Ondrej has moved kasan: disable kasan_non_canonical_hook() for HW tags kasan: print the original fault addr when access invalid shadow hugetlbfs: close race between MADV_DONTNEED and page fault hugetlbfs: extend hugetlb_vma_lock to private VMAs hugetlbfs: clear resv_map pointer if mmap fails mm: zswap: fix pool refcount bug around shrink_worker() mm/migrate: fix do_pages_move for compat pointers riscv: fix set_huge_pte_at() for NAPOT mappings when a swap entry is set riscv: handle VM_FAULT_[HWPOISON|HWPOISON_LARGE] faults instead of panicking mmap: fix error paths with dup_anon_vma() mmap: fix vma_iterator in error path of vma_merge() mm: fix vm_brk_flags() to not bail out while holding lock mm/mempolicy: fix set_mempolicy_home_node() previous VMA pointer mm/page_alloc: correct start page when guard page debug is enabled
2023-10-23powerpc/mm: Fix boot crash with FLATMEMMichael Ellerman
Erhard reported that his G5 was crashing with v6.6-rc kernels: mpic: Setting up HT PICs workarounds for U3/U4 BUG: Unable to handle kernel data access at 0xfeffbb62ffec65fe Faulting instruction address: 0xc00000000005dc40 Oops: Kernel access of bad area, sig: 11 [#1] BE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2 PowerMac Modules linked in: CPU: 0 PID: 0 Comm: swapper/0 Tainted: G T 6.6.0-rc3-PMacGS #1 Hardware name: PowerMac11,2 PPC970MP 0x440101 PowerMac NIP: c00000000005dc40 LR: c000000000066660 CTR: c000000000007730 REGS: c0000000022bf510 TRAP: 0380 Tainted: G T (6.6.0-rc3-PMacGS) MSR: 9000000000001032 <SF,HV,ME,IR,DR,RI> CR: 44004242 XER: 00000000 IRQMASK: 3 GPR00: 0000000000000000 c0000000022bf7b0 c0000000010c0b00 00000000000001ac GPR04: 0000000003c80000 0000000000000300 c0000000f20001ae 0000000000000300 GPR08: 0000000000000006 feffbb62ffec65ff 0000000000000001 0000000000000000 GPR12: 9000000000001032 c000000002362000 c000000000f76b80 000000000349ecd8 GPR16: 0000000002367ba8 0000000002367f08 0000000000000006 0000000000000000 GPR20: 00000000000001ac c000000000f6f920 c0000000022cd985 000000000000000c GPR24: 0000000000000300 00000003b0a3691d c0003e008030000e 0000000000000000 GPR28: c00000000000000c c0000000f20001ee feffbb62ffec65fe 00000000000001ac NIP hash_page_do_lazy_icache+0x50/0x100 LR __hash_page_4K+0x420/0x590 Call Trace: hash_page_mm+0x364/0x6f0 do_hash_fault+0x114/0x2b0 data_access_common_virt+0x198/0x1f0 --- interrupt: 300 at mpic_init+0x4bc/0x10c4 NIP: c000000002020a5c LR: c000000002020a04 CTR: 0000000000000000 REGS: c0000000022bf9f0 TRAP: 0300 Tainted: G T (6.6.0-rc3-PMacGS) MSR: 9000000000001032 <SF,HV,ME,IR,DR,RI> CR: 24004248 XER: 00000000 DAR: c0003e008030000e DSISR: 40000000 IRQMASK: 1 ... NIP mpic_init+0x4bc/0x10c4 LR mpic_init+0x464/0x10c4 --- interrupt: 300 pmac_setup_one_mpic+0x258/0x2dc pmac_pic_init+0x28c/0x3d8 init_IRQ+0x90/0x140 start_kernel+0x57c/0x78c start_here_common+0x1c/0x20 A bisect pointed to the breakage beginning with commit 9fee28baa601 ("powerpc: implement the new page table range API"). Analysis of the oops pointed to a struct page with a corrupted compound_head being loaded via page_folio() -> _compound_head() in hash_page_do_lazy_icache(). The access by the mpic code is to an MMIO address, so the expectation is that the struct page for that address would be initialised by init_unavailable_range(), as pointed out by Aneesh. Instrumentation showed that was not the case, which eventually lead to the realisation that pfn_valid() was returning false for that address, causing the struct page to not be initialised. Because the system is using FLATMEM, the version of pfn_valid() in memory_model.h is used: static inline int pfn_valid(unsigned long pfn) { ... return pfn >= pfn_offset && (pfn - pfn_offset) < max_mapnr; } Which relies on max_mapnr being initialised. Early in boot max_mapnr is zero meaning no PFNs are valid. max_mapnr is initialised in mem_init() called via: start_kernel() mm_core_init() # init/main.c:928 mem_init() But that is too late for the usage in init_unavailable_range() called via: start_kernel() setup_arch() # init/main.c:893 paging_init() free_area_init() init_unavailable_range() Although max_mapnr is currently set in mem_init(), the value is actually already available much earlier, as soon as mem_topology_setup() has completed, which is also before paging_init() is called. So move the initialisation there, which causes paging_init() to correctly initialise the struct page and fixes the bug. This bug seems to have been lurking for years, but went unnoticed because the pre-folio code was inspecting the uninitialised page->flags but not dereferencing it. Thanks to Erhard and Aneesh for help debugging. Reported-by: Erhard Furtner <erhard_f@mailbox.org> Closes: https://lore.kernel.org/all/20230929132750.3cd98452@yea/ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231023112500.1550208-1-mpe@ellerman.id.au
2023-10-21Merge tag 'powerpc-6.6-5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Fix stale propagated yield_cpu in qspinlocks leading to lockups - Fix broken hugepages on some configs due to ARCH_FORCE_MAX_ORDER - Fix a spurious warning when copros are in use at exit time Thanks to Nicholas Piggin, Christophe Leroy, Nysal Jan K.A Sachin Sant, and Shrikanth Hegde. * tag 'powerpc-6.6-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/qspinlock: Fix stale propagated yield_cpu powerpc/64s/radix: Don't warn on copros in radix__tlb_flush() powerpc/mm: Allow ARCH_FORCE_MAX_ORDER up to 12
2023-10-21Merge tag 's390-6.6-4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Vasily Gorbik: - Fix IOMMU bitmap allocation in s390 PCI to avoid out of bounds access when IOMMU pages aren't a multiple of 64 - Fix kasan crashes when accessing DCSS mapping in memory holes by adding corresponding kasan zero shadow mappings - Fix a memory leak in css_alloc_subchannel in case dma_set_coherent_mask fails * tag 's390-6.6-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/pci: fix iommu bitmap allocation s390/kasan: handle DCSS mapping in memory holes s390/cio: fix a memleak in css_alloc_subchannel
2023-10-19Merge tag 'sev_fixes_for_v6.6' of ↵Linus Torvalds
//git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: "Take care of a race between when the #VC exception is raised and when the guest kernel gets to emulate certain instructions in SEV-{ES,SNP} guests by: - disabling emulation of MMIO instructions when coming from user mode - checking the IO permission bitmap before emulating IO instructions and verifying the memory operands of INS/OUTS insns" * tag 'sev_fixes_for_v6.6' of //git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sev: Check for user-space IOIO pointing to kernel space x86/sev: Check IOBM for IOIO exceptions from user-space x86/sev: Disable MMIO emulation from user mode
2023-10-19Merge tag 'loongarch-fixes-6.6-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson Pull LoongArch fixes from Huacai ChenL "Fix 4-level pagetable building, disable WUC for pgprot_writecombine() like ioremap_wc(), use correct annotation for exception handlers, and a trivial cleanup" * tag 'loongarch-fixes-6.6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson: LoongArch: Disable WUC for pgprot_writecombine() like ioremap_wc() LoongArch: Replace kmap_atomic() with kmap_local_page() in copy_user_highpage() LoongArch: Export symbol invalid_pud_table for modules building LoongArch: Use SYM_CODE_* to annotate exception handlers
2023-10-19s390/pci: fix iommu bitmap allocationNiklas Schnelle
Since the fixed commits both zdev->iommu_bitmap and zdev->lazy_bitmap are allocated as vzalloc(zdev->iommu_pages / 8). The problem is that zdev->iommu_bitmap is a pointer to unsigned long but the above only yields an allocation that is a multiple of sizeof(unsigned long) which is 8 on s390x if the number of IOMMU pages is a multiple of 64. This in turn is the case only if the effective IOMMU aperture is a multiple of 64 * 4K = 256K. This is usually the case and so didn't cause visible issues since both the virt_to_phys(high_memory) reduced limit and hardware limits use nice numbers. Under KVM, and in particular with QEMU limiting the IOMMU aperture to the vfio DMA limit (default 65535), it is possible for the reported aperture not to be a multiple of 256K however. In this case we end up with an iommu_bitmap whose allocation is not a multiple of 8 causing bitmap operations to access it out of bounds. Sadly we can't just fix this in the obvious way and use bitmap_zalloc() because for large RAM systems (tested on 8 TiB) the zdev->iommu_bitmap grows too large for kmalloc(). So add our own bitmap_vzalloc() wrapper. This might be a candidate for common code, but this area of code will be replaced by the upcoming conversion to use the common code DMA API on s390 so just add a local routine. Fixes: 224593215525 ("s390/pci: use virtual memory for iommu bitmap") Fixes: 13954fd6913a ("s390/pci_dma: improve lazy flush for unmap") Cc: stable@vger.kernel.org Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-10-18Merge tag 'omap-fixes-audio-clock-and-modem-signed' of ↵Arnd Bergmann
git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into arm/fixes Few minor fixes for omaps Regression fixes for mcbsp audio clock, and for ams-delta modem. And two warning fixes. These all can be merged whenever and are not urgent by any means. Feel free to defer to the merge window unless other fixes are still pending. * tag 'omap-fixes-audio-clock-and-modem-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap: clk: ti: Fix missing omap5 mcbsp functional clock and aliases clk: ti: Fix missing omap4 mcbsp functional clock and aliases ARM: OMAP1: ams-delta: Fix MODEM initialization failure ARM: OMAP: timer32K: fix all kernel-doc warnings ARM: omap2: fix a debug printk Link: https://lore.kernel.org/r/pull-1697606314-911862@atomide.com Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-10-18powerpc/qspinlock: Fix stale propagated yield_cpuNicholas Piggin
yield_cpu is a sample of a preempted lock holder that gets propagated back through the queue. Queued waiters use this to yield to the preempted lock holder without continually sampling the lock word (which would defeat the purpose of MCS queueing by bouncing the cache line). The problem is that yield_cpu can become stale. It can take some time to be passed down the chain, and if any queued waiter gets preempted then it will cease to propagate the yield_cpu to later waiters. This can result in yielding to a CPU that no longer holds the lock, which is bad, but particularly if it is currently in H_CEDE (idle), then it appears to be preempted and some hypervisors (PowerVM) can cause very long H_CONFER latencies waiting for H_CEDE wakeup. This results in latency spikes and hard lockups on oversubscribed partitions with lock contention. This is a minimal fix. Before yielding to yield_cpu, sample the lock word to confirm yield_cpu is still the owner, and bail out of it is not. Thanks to a bunch of people who reported this and tracked down the exact problem using tracepoints and dispatch trace logs. Fixes: 28db61e207ea ("powerpc/qspinlock: allow propagation of yield CPU down the queue") Cc: stable@vger.kernel.org # v6.2+ Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reported-by: Laurent Dufour <ldufour@linux.ibm.com> Reported-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Debugged-by: "Nysal Jan K.A" <nysal@linux.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Tested-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231016124305.139923-2-npiggin@gmail.com
2023-10-18LoongArch: Disable WUC for pgprot_writecombine() like ioremap_wc()Icenowy Zheng
Currently the code disables WUC only disables it for ioremap_wc(), which is only used when mapping writecombine pages like ioremap() (mapped to the kernel space). But for VRAM mapped in TTM/GEM, it is mapped with a crafted pgprot by the pgprot_writecombine() function, in which case WUC isn't disabled now. Disable WUC for pgprot_writecombine() (fallback to SUC) if needed, like ioremap_wc(). This improves the AMDGPU driver's stability (solves some misrendering) on Loongson-3A5000/3A6000 machines. Signed-off-by: Icenowy Zheng <uwu@icenowy.me> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-10-18LoongArch: Replace kmap_atomic() with kmap_local_page() in copy_user_highpage()Huacai Chen
Replace kmap_atomic()/kunmap_atomic() calls with kmap_local_page()/ kunmap_local() in copy_user_highpage() which can be invoked from both preemptible and atomic context [1]. [1] https://lore.kernel.org/all/20201029222652.302358281@linutronix.de/ Suggested-by: Deepak R Varma <drv@mailo.com> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-10-18LoongArch: Export symbol invalid_pud_table for modules buildingHuacai Chen
Export symbol invalid_pud_table for modules building (such as the KVM module) if 4-level page tables enabled. Otherwise we get: ERROR: modpost: "invalid_pud_table" [arch/loongarch/kvm/kvm.ko] undefined! Reported-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Tianrui Zhao <zhaotianrui@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-10-18LoongArch: Use SYM_CODE_* to annotate exception handlersTiezhu Yang
As described in include/linux/linkage.h, FUNC -- C-like functions (proper stack frame etc.) CODE -- non-C code (e.g. irq handlers with different, special stack etc.) SYM_FUNC_{START, END} -- use for global functions SYM_CODE_{START, END} -- use for non-C (special) functions So use SYM_CODE_* to annotate exception handlers. Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-10-18powerpc/64s/radix: Don't warn on copros in radix__tlb_flush()Michael Ellerman
Sachin reported a warning when running the inject-ra-err selftest: # selftests: powerpc/mce: inject-ra-err Disabling lock debugging due to kernel taint MCE: CPU19: machine check (Severe) Real address Load/Store (foreign/control memory) [Not recovered] MCE: CPU19: PID: 5254 Comm: inject-ra-err NIP: [0000000010000e48] MCE: CPU19: Initiator CPU MCE: CPU19: Unknown ------------[ cut here ]------------ WARNING: CPU: 19 PID: 5254 at arch/powerpc/mm/book3s64/radix_tlb.c:1221 radix__tlb_flush+0x160/0x180 CPU: 19 PID: 5254 Comm: inject-ra-err Kdump: loaded Tainted: G M E 6.6.0-rc3-00055-g9ed22ae6be81 #4 Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1030.20 (NH1030_058) hv:phyp pSeries ... NIP radix__tlb_flush+0x160/0x180 LR radix__tlb_flush+0x104/0x180 Call Trace: radix__tlb_flush+0xf4/0x180 (unreliable) tlb_finish_mmu+0x15c/0x1e0 exit_mmap+0x1a0/0x510 __mmput+0x60/0x1e0 exit_mm+0xdc/0x170 do_exit+0x2bc/0x5a0 do_group_exit+0x4c/0xc0 sys_exit_group+0x28/0x30 system_call_exception+0x138/0x330 system_call_vectored_common+0x15c/0x2ec And bisected it to commit e43c0a0c3c28 ("powerpc/64s/radix: combine final TLB flush and lazy tlb mm shootdown IPIs"), which added a warning in radix__tlb_flush() if mm->context.copros is still elevated. However it's possible for the copros count to be elevated if a process exits without first closing file descriptors that are associated with a copro, eg. VAS. If the process exits with a VAS file still open, the release callback is queued up for exit_task_work() via: exit_files() put_files_struct() close_files() filp_close() fput() And called via: exit_task_work() ____fput() __fput() file->f_op->release(inode, file) coproc_release() vas_user_win_ops->close_win() vas_deallocate_window() mm_context_remove_vas_window() mm_context_remove_copro() But that is after exit_mm() has been called from do_exit() and triggered the warning. Fix it by dropping the warning, and always calling __flush_all_mm(). In the normal case of no copros, that will result in a call to _tlbiel_pid(mm->context.id, RIC_FLUSH_ALL) just as the current code does. If the copros count is elevated then it will cause a global flush, which should flush translations from any copros. Note that the process table entry was cleared in arch_exit_mmap(), so copros should not be able to fetch any new translations. Fixes: e43c0a0c3c28 ("powerpc/64s/radix: combine final TLB flush and lazy tlb mm shootdown IPIs") Reported-by: Sachin Sant <sachinp@linux.ibm.com> Closes: https://lore.kernel.org/all/A8E52547-4BF1-47CE-8AEA-BC5A9D7E3567@linux.ibm.com/ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Link: https://msgid.link/20231017121527.1574104-1-mpe@ellerman.id.au
2023-10-17Merge tag 'v6.6-rockchip-dtsfixes1' of ↵Arnd Bergmann
git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip into arm/fixes I2S pinctrl fixes, someone resurrected the rk3128 arm32 and found some needed fixes and finally some sound fixes for the px30 ringneck som. * tag 'v6.6-rockchip-dtsfixes1' of git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip: arm64: dts: rockchip: Fix i2s0 pin conflict on ROCK Pi 4 boards arm64: dts: rockchip: Add i2s0-2ch-bus-bclk-off pins to RK3399 ARM: dts: rockchip: Fix timer clocks for RK3128 ARM: dts: rockchip: Add missing quirk for RK3128's dma engine ARM: dts: rockchip: Add missing arm timer interrupt for RK3128 ARM: dts: rockchip: Fix i2c0 register address for RK3128 arm64: dts: rockchip: set codec system-clock-fixed on px30-ringneck-haikou arm64: dts: rockchip: use codec as clock master on px30-ringneck-haikou Link: https://lore.kernel.org/r/1965242.usQuhbGJ8B@phil Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-10-17riscv: dts: thead: set dma-noncoherent to soc busJisheng Zhang
riscv select ARCH_DMA_DEFAULT_COHERENT by default, and th1520 isn't dma coherent, so set dma-noncoherent to reflect this fact. Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Tested-by: Drew Fustini <dfustini@baylibre.com> Reviewed-by: Guo Ren <guoren@kernel.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-10-17x86/sev: Check for user-space IOIO pointing to kernel spaceJoerg Roedel
Check the memory operand of INS/OUTS before emulating the instruction. The #VC exception can get raised from user-space, but the memory operand can be manipulated to access kernel memory before the emulation actually begins and after the exception handler has run. [ bp: Massage commit message. ] Fixes: 597cfe48212a ("x86/boot/compressed/64: Setup a GHCB-based VC Exception handler") Reported-by: Tom Dohrmann <erbse.13@gmx.de> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org>
2023-10-16Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "ARM: - Fix the handling of the phycal timer offset when FEAT_ECV and CNTPOFF_EL2 are implemented - Restore the functionnality of Permission Indirection that was broken by the Fine Grained Trapping rework - Cleanup some PMU event sharing code MIPS: - Fix W=1 build s390: - One small fix for gisa to avoid stalls x86: - Truncate writes to PMU counters to the counter's width to avoid spurious overflows when emulating counter events in software - Set the LVTPC entry mask bit when handling a PMI (to match Intel-defined architectural behavior) - Treat KVM_REQ_PMI as a wake event instead of queueing host IRQ work to kick the guest out of emulated halt - Fix for loading XSAVE state from an old kernel into a new one - Fixes for AMD AVIC selftests: - Play nice with %llx when formatting guest printf and assert statements - Clean up stale test metadata - Zero-initialize structures in memslot perf test to workaround a suspected 'may be used uninitialized' false positives from GCC" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits) KVM: arm64: timers: Correctly handle TGE flip with CNTPOFF_EL2 KVM: arm64: POR{E0}_EL1 do not need trap handlers KVM: arm64: Add nPIR{E0}_EL1 to HFG traps KVM: MIPS: fix -Wunused-but-set-variable warning KVM: arm64: pmu: Drop redundant check for non-NULL kvm_pmu_events KVM: SVM: Fix build error when using -Werror=unused-but-set-variable x86: KVM: SVM: refresh AVIC inhibition in svm_leave_nested() x86: KVM: SVM: add support for Invalid IPI Vector interception x86: KVM: SVM: always update the x2avic msr interception KVM: selftests: Force load all supported XSAVE state in state test KVM: selftests: Load XSAVE state into untouched vCPU during state test KVM: selftests: Touch relevant XSAVE state in guest for state test KVM: x86: Constrain guest-supported xfeatures only at KVM_GET_XSAVE{2} x86/fpu: Allow caller to constrain xfeatures when copying to uabi buffer KVM: selftests: Zero-initialize entire test_result in memslot perf test KVM: selftests: Remove obsolete and incorrect test case metadata KVM: selftests: Treat %llx like %lx when formatting guest printf KVM: x86/pmu: Synthesize at most one PMI per VM-exit KVM: x86: Mask LVTPC when handling a PMI KVM: x86/pmu: Truncate counter value to allowed width on write ...