summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)Author
2024-04-26dma-mapping: Simplify arch_setup_dma_ops()Robin Murphy
The dma_base, size and iommu arguments are only used by ARM, and can now easily be deduced from the device itself, so there's no need to pass them through the callchain as well. Acked-by: Rob Herring <robh@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Michael Kelley <mhklinux@outlook.com> # For Hyper-V Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Robin Murphy <robin.murphy@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/5291c2326eab405b1aa7693aa964e8d3cb7193de.1713523152.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2024-04-26iommu/dma: Centralise iommu_setup_dma_ops()Robin Murphy
It's somewhat hard to see, but arm64's arch_setup_dma_ops() should only ever call iommu_setup_dma_ops() after a successful iommu_probe_device(), which means there should be no harm in achieving the same order of operations by running it off the back of iommu_probe_device() itself. This then puts it in line with the x86 and s390 .probe_finalize bodges, letting us pull it all into the main flow properly. As a bonus this lets us fold in and de-scope the PCI workaround setup as well. At this point we can also then pull the call up inside the group mutex, and avoid having to think about whether iommu_group_store_type() could theoretically race and free the domain if iommu_setup_dma_ops() ran just *before* iommu_device_use_default_domain() claims it... Furthermore we replace one .probe_finalize call completely, since the only remaining implementations are now one which only needs to run once for the initial boot-time probe, and two which themselves render that path unreachable. This leaves us a big step closer to realistically being able to unpick the variety of different things that iommu_setup_dma_ops() has been muddling together, and further streamline iommu-dma into core API flows in future. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> # For Intel IOMMU Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Robin Murphy <robin.murphy@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/bebea331c1d688b34d9862eefd5ede47503961b8.1713523152.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2024-04-26dma-mapping: Add helpers for dma_range_map boundsRobin Murphy
Several places want to compute the lower and/or upper bounds of a dma_range_map, so let's factor that out into reusable helpers. Acked-by: Rob Herring <robh@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hanjun Guo <guohanjun@huawei.com> # For arm64 Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Robin Murphy <robin.murphy@arm.com> Link: https://lore.kernel.org/r/45ec52f033ec4dfb364e23f48abaf787f612fa53.1713523152.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2024-04-26crypto: x86/aes-gcm - simplify GCM hash subkey derivationEric Biggers
Remove a redundant expansion of the AES key, and use rodata for zeroes. Also rename rfc4106_set_hash_subkey() to aes_gcm_derive_hash_subkey() because it's used for both versions of AES-GCM, not just RFC4106. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-26crypto: x86/aes-gcm - delete unused GCM assembly codeEric Biggers
Delete aesni_gcm_enc() and aesni_gcm_dec() because they are unused. Only the incremental AES-GCM functions (aesni_gcm_init(), aesni_gcm_enc_update(), aesni_gcm_finalize()) are actually used. This saves 17 KB of object code. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-26crypto: x86/aes-xts - simplify loop in xts_crypt_slowpath()Eric Biggers
Since the total length processed by the loop in xts_crypt_slowpath() is a multiple of AES_BLOCK_SIZE, just round the length down to AES_BLOCK_SIZE even on the last step. This doesn't change behavior, as the last step will process a multiple of AES_BLOCK_SIZE regardless. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-26crypto: arm64/aes-ce - Simplify round key load sequenceArd Biesheuvel
Tweak the round key logic so that they can be loaded using a single branchless sequence using overlapping loads. This is shorter and simpler, and puts the conditional branches based on the key size further apart, which might benefit microarchitectures that cannot record taken branches at every instruction. For these branches, use test-bit-branch instructions that don't clobber the condition flags. Note that none of this has any impact on performance, positive or otherwise (and the branch prediction benefit would only benefit AES-192 which nobody uses). It does make for nicer code, though. While at it, use \@ to generate the labels inside the macros, which is more robust than using fixed numbers, which could clash inadvertently. Also, bring aes-neon.S in line with these changes, including the switch to test-and-branch instructions, to avoid surprises in the future when we might start relying on the condition flags being preserved in the chaining mode wrappers in aes-modes.S Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-26RISC-V: KVM: Improve firmware counter read functionAtish Patra
Rename the function to indicate that it is meant for firmware counter read. While at it, add a range sanity check for it as well. Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20240420151741.962500-17-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26RISC-V: KVM: Support 64 bit firmware counters on RV32Atish Patra
The SBI v2.0 introduced a fw_read_hi function to read 64 bit firmware counters for RV32 based systems. Add infrastructure to support that. Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Anup Patel <anup@brainfault.org> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-16-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26RISC-V: KVM: Add perf sampling support for guestsAtish Patra
KVM enables perf for guest via counter virtualization. However, the sampling can not be supported as there is no mechanism to enabled trap/emulate scountovf in ISA yet. Rely on the SBI PMU snapshot to provide the counter overflow data via the shared memory. In case of sampling event, the host first sets the guest's LCOFI interrupt and injects to the guest via irq filtering mechanism defined in AIA specification. Thus, ssaia must be enabled in the host in order to use perf sampling in the guest. No other AIA dependency w.r.t kernel is required. Reviewed-by: Anup Patel <anup@brainfault.org> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-15-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26RISC-V: KVM: Implement SBI PMU Snapshot featureAtish Patra
PMU Snapshot function allows to minimize the number of traps when the guest access configures/access the hpmcounters. If the snapshot feature is enabled, the hypervisor updates the shared memory with counter data and state of overflown counters. The guest can just read the shared memory instead of trap & emulate done by the hypervisor. This patch doesn't implement the counter overflow yet. Reviewed-by: Anup Patel <anup@brainfault.org> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-14-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26RISC-V: KVM: No need to exit to the user space if perf event failedAtish Patra
Currently, we return a linux error code if creating a perf event failed in kvm. That shouldn't be necessary as guest can continue to operate without perf profiling or profiling with firmware counters. Return appropriate SBI error code to indicate that PMU configuration failed. An error message in kvm already describes the reason for failure. Fixes: 0cb74b65d2e5 ("RISC-V: KVM: Implement perf support without sampling") Reviewed-by: Anup Patel <anup@brainfault.org> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-13-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26RISC-V: KVM: No need to update the counter value during resetAtish Patra
The virtual counter value is updated during pmu_ctr_read. There is no need to update it in reset case. Otherwise, it will be counted twice which is incorrect. Fixes: 0cb74b65d2e5 ("RISC-V: KVM: Implement perf support without sampling") Reviewed-by: Anup Patel <anup@brainfault.org> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-12-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26RISC-V: KVM: Fix the initial sample period valueAtish Patra
The initial sample period value when counter value is not assigned should be set to maximum value supported by the counter width. Otherwise, it may result in spurious interrupts. Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20240420151741.962500-11-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-25x86: call instrumentation hooks from copy_mc.cAlexander Potapenko
Memory accesses in copy_mc_to_kernel() and copy_mc_to_user() are performed by assembly routines and are invisible to KASAN, KCSAN, and KMSAN. Add hooks from instrumentation.h to tell the tools these functions have memcpy/copy_from_user semantics. The call to copy_mc_fragile() in copy_mc_fragile_handle_tail() is left intact, because the latter is only called from the assembly implementation of copy_mc_fragile(), so the memory accesses in it are covered by the instrumentation in copy_mc_to_kernel() and copy_mc_to_user(). Link: https://lore.kernel.org/all/3b7dbd88-0861-4638-b2d2-911c97a4cadf@I-love.SAKURA.ne.jp/ Link: https://lkml.kernel.org/r/20240320101851.2589698-3-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm/treewide: rename CONFIG_HAVE_FAST_GUP to CONFIG_HAVE_GUP_FASTDavid Hildenbrand
Nowadays, we call it "GUP-fast", the external interface includes functions like "get_user_pages_fast()", and we renamed all internal functions to reflect that as well. Let's make the config option reflect that. Link: https://lkml.kernel.org/r/20240402125516.223131-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25x86: mm: accelerate pagefault when badaccessKefeng Wang
The access_error() of vma is already checked under per-VMA lock, if it is a bad access, directly handle error, no need to retry with mmap_lock again. In order to release the correct lock, pass the mm_struct into bad_area_access_error(). If mm is NULL, release vma lock, or release mmap_lock. Since the page faut is handled under per-VMA lock, count it as a vma lock event with VMA_LOCK_SUCCESS. Link: https://lkml.kernel.org/r/20240403083805.1818160-8-wangkefeng.wang@huawei.com Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25s390: mm: accelerate pagefault when badaccessKefeng Wang
The vm_flags of vma already checked under per-VMA lock, if it is a bad access, directly handle error, no need to retry with mmap_lock again. Since the page faut is handled under per-VMA lock, count it as a vma lock event with VMA_LOCK_SUCCESS. Link: https://lkml.kernel.org/r/20240403083805.1818160-7-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25riscv: mm: accelerate pagefault when badaccessKefeng Wang
The access_error() of vma already checked under per-VMA lock, if it is a bad access, directly handle error, no need to retry with mmap_lock again. Since the page faut is handled under per-VMA lock, count it as a vma lock event with VMA_LOCK_SUCCESS. [wangkefeng.wang@huawei.com: use `cause' rather than SIGSEGV, per Alexandre] Link: https://lkml.kernel.org/r/ac978061-ce1a-40a4-8b0a-61883b42bea7@huawei.com Link: https://lkml.kernel.org/r/20240403083805.1818160-6-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Tested-by: Alexandre Ghiti <alexghiti@rivosinc.com Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25powerpc: mm: accelerate pagefault when badaccessKefeng Wang
The access_[pkey]_error() of vma already checked under per-VMA lock, if it is a bad access, directly handle error, no need to retry with mmap_lock again. In order to release the correct lock, pass the mm_struct into bad_access_pkey()/bad_access(), if mm is NULL, release vma lock, or release mmap_lock. Since the page faut is handled under per-VMA lock, count it as a vma lock event with VMA_LOCK_SUCCESS. Link: https://lkml.kernel.org/r/20240403083805.1818160-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25arm: mm: accelerate pagefault when VM_FAULT_BADACCESSKefeng Wang
The vm_flags of vma already checked under per-VMA lock, if it is a bad access, directly set fault to VM_FAULT_BADACCESS and handle error, no need to retry with mmap_lock again. Since the page faut is handled under per-VMA lock, count it as a vma lock event with VMA_LOCK_SUCCESS. Link: https://lkml.kernel.org/r/20240403083805.1818160-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25arm64: mm: accelerate pagefault when VM_FAULT_BADACCESSKefeng Wang
The vm_flags of vma already checked under per-VMA lock, if it is a bad access, directly set fault to VM_FAULT_BADACCESS and handle error, no need to retry with mmap_lock again, the latency time reduces 34% in 'lat_sig -P 1 prot lat_sig' from lmbench testcase. Since the page fault is handled under per-VMA lock, count it as a vma lock event with VMA_LOCK_SUCCESS. Link: https://lkml.kernel.org/r/20240403083805.1818160-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25arm64: mm: cleanup __do_page_fault()Kefeng Wang
Patch series "arch/mm/fault: accelerate pagefault when badaccess", v2. After VMA lock-based page fault handling enabled, if bad access met under per-vma lock, it will fallback to mmap_lock-based handling, so it leads to unnessary mmap lock and vma find again. A test from lmbench shows 34% improve after this changes on arm64, lat_sig -P 1 prot lat_sig 0.29194 -> 0.19198 This patch (of 7): The __do_page_fault() only calls handle_mm_fault() after vm_flags checked, and it is only called by do_page_fault(), let's squash it into do_page_fault() to cleanup code. Link: https://lkml.kernel.org/r/20240403083805.1818160-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20240403083805.1818160-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25x86/mm: care about shadow stack guard gap during placementRick Edgecombe
When memory is being placed, mmap() will take care to respect the guard gaps of certain types of memory (VM_SHADOWSTACK, VM_GROWSUP and VM_GROWSDOWN). In order to ensure guard gaps between mappings, mmap() needs to consider two things: 1. That the new mapping isn't placed in an any existing mappings guard gaps. 2. That the new mapping isn't placed such that any existing mappings are not in *its* guard gaps. The longstanding behavior of mmap() is to ensure 1, but not take any care around 2. So for example, if there is a PAGE_SIZE free area, and a mmap() with a PAGE_SIZE size, and a type that has a guard gap is being placed, mmap() may place the shadow stack in the PAGE_SIZE free area. Then the mapping that is supposed to have a guard gap will not have a gap to the adjacent VMA. Now that the vm_flags is passed into the arch get_unmapped_area()'s, and vm_unmapped_area() is ready to consider it, have VM_SHADOW_STACK's get guard gap consideration for scenario 2. Link: https://lkml.kernel.org/r/20240326021656.202649-14-rick.p.edgecombe@intel.com Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25x86/mm: implement HAVE_ARCH_UNMAPPED_AREA_VMFLAGSRick Edgecombe
When memory is being placed, mmap() will take care to respect the guard gaps of certain types of memory (VM_SHADOWSTACK, VM_GROWSUP and VM_GROWSDOWN). In order to ensure guard gaps between mappings, mmap() needs to consider two things: 1. That the new mapping isn't placed in an any existing mappings guard gaps. 2. That the new mapping isn't placed such that any existing mappings are not in *its* guard gaps. The longstanding behavior of mmap() is to ensure 1, but not take any care around 2. So for example, if there is a PAGE_SIZE free area, and a mmap() with a PAGE_SIZE size, and a type that has a guard gap is being placed, mmap() may place the shadow stack in the PAGE_SIZE free area. Then the mapping that is supposed to have a guard gap will not have a gap to the adjacent VMA. Add x86 arch implementations of arch_get_unmapped_area_vmflags/_topdown() so future changes can allow the guard gap of type of vma being placed to be taken into account. This will be used for shadow stack memory. Link: https://lkml.kernel.org/r/20240326021656.202649-13-rick.p.edgecombe@intel.com Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25treewide: use initializer for struct vm_unmapped_area_infoRick Edgecombe
Future changes will need to add a new member to struct vm_unmapped_area_info. This would cause trouble for any call site that doesn't initialize the struct. Currently every caller sets each member manually, so if new ones are added they will be uninitialized and the core code parsing the struct will see garbage in the new member. It could be possible to initialize the new member manually to 0 at each call site. This and a couple other options were discussed. Having some struct vm_unmapped_area_info instances not zero initialized will put those sites at risk of feeding garbage into vm_unmapped_area(), if the convention is to zero initialize the struct and any new field addition missed a call site that initializes each field manually. So it is useful to do things similar across the kernel. The consensus (see links) was that in general the best way to accomplish taking into account both code cleanliness and minimizing the chance of introducing bugs, was to do C99 static initialization. As in: struct vm_unmapped_area_info info = {}; With this method of initialization, the whole struct will be zero initialized, and any statements setting fields to zero will be unneeded. The change should not leave cleanup at the call sides. While iterating though the possible solutions a few archs kindly acked other variations that still zero initialized the struct. These sites have been modified in previous changes using the pattern acked by the respective arch. So to be reduce the chance of bugs via uninitialized fields, perform a tree wide change using the consensus for the best general way to do this change. Use C99 static initializing to zero the struct and remove and statements that simply set members to zero. Link: https://lkml.kernel.org/r/20240326021656.202649-11-rick.p.edgecombe@intel.com Link: https://lore.kernel.org/lkml/202402280912.33AEE7A9CF@keescook/#t Link: https://lore.kernel.org/lkml/j7bfvig3gew3qruouxrh7z7ehjjafrgkbcmg6tcghhfh3rhmzi@wzlcoecgy5rs/ Link: https://lore.kernel.org/lkml/ec3e377a-c0a0-4dd3-9cb9-96517e54d17e@csgroup.eu/ Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25powerpc: use initializer for struct vm_unmapped_area_infoRick Edgecombe
Future changes will need to add a new member to struct vm_unmapped_area_info. This would cause trouble for any call site that doesn't initialize the struct. Currently every caller sets each member manually, so if new members are added they will be uninitialized and the core code parsing the struct will see garbage in the new member. It could be possible to initialize the new member manually to 0 at each call site. This and a couple other options were discussed, and a working consensus (see links) was that in general the best way to accomplish this would be via static initialization with designated member initiators. Having some struct vm_unmapped_area_info instances not zero initialized will put those sites at risk of feeding garbage into vm_unmapped_area() if the convention is to zero initialize the struct and any new member addition misses a call site that initializes each member manually. It could be possible to leave the code mostly untouched, and just change the line: struct vm_unmapped_area_info info to: struct vm_unmapped_area_info info = {}; However, that would leave cleanup for the members that are manually set to zero, as it would no longer be required. So to be reduce the chance of bugs via uninitialized members, instead simply continue the process to initialize the struct this way tree wide. This will zero any unspecified members. Move the member initializers to the struct declaration when they are known at that time. Leave the members out that were manually initialized to zero, as this would be redundant for designated initializers. Link: https://lkml.kernel.org/r/20240326021656.202649-10-rick.p.edgecombe@intel.com Link: https://lore.kernel.org/lkml/202402280912.33AEE7A9CF@keescook/#t Link: https://lore.kernel.org/lkml/j7bfvig3gew3qruouxrh7z7ehjjafrgkbcmg6tcghhfh3rhmzi@wzlcoecgy5rs/ Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25parisc: use initializer for struct vm_unmapped_area_infoRick Edgecombe
Future changes will need to add a new member to struct vm_unmapped_area_info. This would cause trouble for any call site that doesn't initialize the struct. Currently every caller sets each member manually, so if new members are added they will be uninitialized and the core code parsing the struct will see garbage in the new member. It could be possible to initialize the new member manually to 0 at each call site. This and a couple other options were discussed, and a working consensus (see links) was that in general the best way to accomplish this would be via static initialization with designated member initiators. Having some struct vm_unmapped_area_info instances not zero initialized will put those sites at risk of feeding garbage into vm_unmapped_area() if the convention is to zero initialize the struct and any new member addition misses a call site that initializes each member manually. It could be possible to leave the code mostly untouched, and just change the line: struct vm_unmapped_area_info info to: struct vm_unmapped_area_info info = {}; However, that would leave cleanup for the members that are manually set to zero, as it would no longer be required. So to be reduce the chance of bugs via uninitialized members, instead simply continue the process to initialize the struct this way tree wide. This will zero any unspecified members. Move the member initializers to the struct declaration when they are known at that time. Leave the members out that were manually initialized to zero, as this would be redundant for designated initializers. Link: https://lkml.kernel.org/r/20240326021656.202649-9-rick.p.edgecombe@intel.com Link: https://lore.kernel.org/lkml/202402280912.33AEE7A9CF@keescook/#t Link: https://lore.kernel.org/lkml/j7bfvig3gew3qruouxrh7z7ehjjafrgkbcmg6tcghhfh3rhmzi@wzlcoecgy5rs/ Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Acked-by: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Helge Deller <deller@gmx.de> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Guo Ren <guoren@kernel.org> Cc: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25csky: use initializer for struct vm_unmapped_area_infoRick Edgecombe
Future changes will need to add a new member to struct vm_unmapped_area_info. This would cause trouble for any call site that doesn't initialize the struct. Currently every caller sets each member manually, so if new members are added they will be uninitialized and the core code parsing the struct will see garbage in the new member. It could be possible to initialize the new member manually to 0 at each call site. This and a couple other options were discussed, and a working consensus (see links) was that in general the best way to accomplish this would be via static initialization with designated member initiators. Having some struct vm_unmapped_area_info instances not zero initialized will put those sites at risk of feeding garbage into vm_unmapped_area() if the convention is to zero initialize the struct and any new member addition misses a call site that initializes each member manually. It could be possible to leave the code mostly untouched, and just change the line: struct vm_unmapped_area_info info to: struct vm_unmapped_area_info info = {}; However, that would leave cleanup for the members that are manually set to zero, as it would no longer be required. So to be reduce the chance of bugs via uninitialized members, instead simply continue the process to initialize the struct this way tree wide. This will zero any unspecified members. Move the member initializers to the struct declaration when they are known at that time. Leave the members out that were manually initialized to zero, as this would be redundant for designated initializers. Link: https://lkml.kernel.org/r/20240326021656.202649-8-rick.p.edgecombe@intel.com Link: https://lore.kernel.org/lkml/202402280912.33AEE7A9CF@keescook/#t Link: https://lore.kernel.org/lkml/j7bfvig3gew3qruouxrh7z7ehjjafrgkbcmg6tcghhfh3rhmzi@wzlcoecgy5rs/ Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Guo Ren <guoren@kernel.org> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Guo Ren <guoren@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: switch mm->get_unmapped_area() to a flagRick Edgecombe
The mm_struct contains a function pointer *get_unmapped_area(), which is set to either arch_get_unmapped_area() or arch_get_unmapped_area_topdown() during the initialization of the mm. Since the function pointer only ever points to two functions that are named the same across all arch's, a function pointer is not really required. In addition future changes will want to add versions of the functions that take additional arguments. So to save a pointers worth of bytes in mm_struct, and prevent adding additional function pointers to mm_struct in future changes, remove it and keep the information about which get_unmapped_area() to use in a flag. Add the new flag to MMF_INIT_MASK so it doesn't get clobbered on fork by mmf_init_flags(). Most MM flags get clobbered on fork. In the pre-existing behavior mm->get_unmapped_area() would get copied to the new mm in dup_mm(), so not clobbering the flag preserves the existing behavior around inheriting the topdown-ness. Introduce a helper, mm_get_unmapped_area(), to easily convert code that refers to the old function pointer to instead select and call either arch_get_unmapped_area() or arch_get_unmapped_area_topdown() based on the flag. Then drop the mm->get_unmapped_area() function pointer. Leave the get_unmapped_area() pointer in struct file_operations alone. The main purpose of this change is to reorganize in preparation for future changes, but it also converts the calls of mm->get_unmapped_area() from indirect branches into a direct ones. The stress-ng bigheap benchmark calls realloc a lot, which calls through get_unmapped_area() in the kernel. On x86, the change yielded a ~1% improvement there on a retpoline config. In testing a few x86 configs, removing the pointer unfortunately didn't result in any actual size reductions in the compiled layout of mm_struct. But depending on compiler or arch alignment requirements, the change could shrink the size of mm_struct. Link: https://lkml.kernel.org/r/20240326021656.202649-3-rick.p.edgecombe@intel.com Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Brown <broonie@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: remove "prot" parameter from move_pte()David Hildenbrand
The "prot" parameter is unused, and using it instead of what's stored in that particular PTE would very likely be wrong. Let's simply remove it. Link: https://lkml.kernel.org/r/20240327143301.741807-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Andreas Larsson <andreas@gaisler.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm/arch: provide pud_pfn() fallbackPeter Xu
The comment in the code explains the reasons. We took a different approach comparing to pmd_pfn() by providing a fallback function. Another option is to provide some lower level config options (compare to HUGETLB_PAGE or THP) to identify which layer an arch can support for such huge mappings. However that can be an overkill. [peterx@redhat.com: fix loongson defconfig] Link: https://lkml.kernel.org/r/20240403013249.1418299-4-peterx@redhat.com Link: https://lkml.kernel.org/r/20240327152332.950956-6-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Jones <andrew.jones@linux.dev> Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25sparc: use is_huge_zero_pmd()Matthew Wilcox (Oracle)
Patch series "Convert huge_zero_page to huge_zero_folio". Almost all the callers of is_huge_zero_page() already have a folio. And they should -- is_huge_zero_page() will return false for tail pages, even if they're tail pages of the huge zero page. That's confusing, and one of the benefits of the folio conversion is to get rid of this confusion. This patch (of 8): There's no need to convert to a page, much less a folio. We can tell from the pmd whether it is a huge zero page or not. Saves 60 bytes of text. Link: https://lkml.kernel.org/r/20240326202833.523759-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240326202833.523759-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: convert arch_clear_hugepage_flags to take a folioMatthew Wilcox (Oracle)
All implementations that aren't no-ops just set a bit in the flags, and we want to use the folio flags rather than the page flags for that. Rename it to arch_clear_hugetlb_flags() while we're touching it so nobody thinks it's used for THP. [willy@infradead.org: fix arm64 build] Link: https://lkml.kernel.org/r/ZgQvNKGdlDkwhQEX@casper.infradead.org Link: https://lkml.kernel.org/r/20240326171045.410737-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25xtensa: remove uses of PG_arch_1 on individual pagesMatthew Wilcox (Oracle)
Since switching to the new page table range API, we disregard the PG_arch_1 (aka dcache dirty) flag on tail pages, and only pay attention to it on the folio. Fix these two missed spots where we were setting it on arbitrary pages. Link: https://lkml.kernel.org/r/20240326171045.410737-3-willy@infradead.org Reported-by: Svetly Todorov <svetly.todorov@memverge.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Svetly Todorov <svetly.todorov@memverge.com> [xtensa] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25sh: remove use of PG_arch_1 on individual pagesMatthew Wilcox (Oracle)
Patch series "Various page->flags cleanups". The first two patches are bug fixes, although I'm not sure that either architecture will have noticed. There aren't a lot of uses of page->flags left! The big build-up here is to reworking stable_page_flags(), which will definitely be a user-visible change. I think a welcome one, given the special case we had to spread the Slab flag into all tail pages. This patch (of 10): Since switching to the new page table range API, we do not set the PG_arch_1 (aka dcache clean) flag on tail pages, only on the folio. Test it on the folio. Also use page_mapped() instead of page_mapcount() as it is more efficient. [akpm@linux-foundation.org: fix folio_flags call] Link: https://lkml.kernel.org/r/20240326171045.410737-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240326171045.410737-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: move follow_phys to arch/x86/mm/pat/memtype.cChristoph Hellwig
follow_phys is only used by two callers in arch/x86/mm/pat/memtype.c. Move it there and hardcode the two arguments that get the same values passed by both callers. [david@redhat.com: conflict resolutions] Link: https://lkml.kernel.org/r/20240403212131.929421-4-david@redhat.com Link: https://lkml.kernel.org/r/20240324234542.2038726-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fei Li <fei1.li@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm/mm_init.c: remove arch_reserved_kernel_pages()Baoquan He
Since the current calculation of calc_nr_kernel_pages() has taken into consideration of kernel reserved memory, no need to have arch_reserved_kernel_pages() any more. Link: https://lkml.kernel.org/r/20240325145646.1044760-7-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25x86: remove unneeded memblock_find_dma_reserve()Baoquan He
Patch series "mm/mm_init.c: refactor free_area_init_core()". In function free_area_init_core(), the code calculating zone->managed_pages and the subtracting dma_reserve from DMA zone looks very confusing. From git history, the code calculating zone->managed_pages was for zone->present_pages originally. The early rough assignment is for optimize zone's pcp and water mark setting. Later, managed_pages was introduced into zone to represent the number of managed pages by buddy. Now, zone->managed_pages is zeroed out and reset in mem_init() when calling memblock_free_all(). zone's pcp and wmark setting relying on actual zone->managed_pages are done later than mem_init() invocation. So we don't need rush to early calculate and set zone->managed_pages, just set it as zone->present_pages, will adjust it in mem_init(). And also add a new function calc_nr_kernel_pages() to count up free but not reserved pages in memblock, then assign it to nr_all_pages and nr_kernel_pages after memmap pages are allocated. This patch (of 6): Variable dma_reserve and its usage was introduced in commit 0e0b864e069c ("[PATCH] Account for memmap and optionally the kernel image as holes"). Its original purpose was to accounting for the reserved pages in DMA zone to make DMA zone's watermarks calculation more accurate on x86. However, currently there's zone->managed_pages to account for all available pages for buddy, zone->present_pages to account for all present physical pages in zone. What is more important, on x86, calculating and setting the zone->managed_pages is a temporary move, all zone's managed_pages will be zeroed out and reset to the actual value according to how many pages are added to buddy allocator in mem_init(). Before mem_init(), no buddy alloction is requested. And zone's pcp and watermark setting are all done after mem_init(). So, no need to worry about the DMA zone's setting accuracy during free_area_init(). Hence, remove memblock_find_dma_reserve() to stop calculating and setting dma_reserve. Link: https://lkml.kernel.org/r/20240325145646.1044760-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20240325145646.1044760-2-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25arm64: mm: swap: support THP_SWAP on hardware with MTEBarry Song
Commit d0637c505f8a1 ("arm64: enable THP_SWAP for arm64") brings up THP_SWAP on ARM64, but it doesn't enable THP_SWP on hardware with MTE as the MTE code works with the assumption tags save/restore is always handling a folio with only one page. The limitation should be removed as more and more ARM64 SoCs have this feature. Co-existence of MTE and THP_SWAP becomes more and more important. This patch makes MTE tags saving support large folios, then we don't need to split large folios into base pages for swapping out on ARM64 SoCs with MTE any more. arch_prepare_to_swap() should take folio rather than page as parameter because we support THP swap-out as a whole. It saves tags for all pages in a large folio. As now we are restoring tags based-on folio, in arch_swap_restore(), we may increase some extra loops and early-exitings while refaulting a large folio which is still in swapcache in do_swap_page(). In case a large folio has nr pages, do_swap_page() will only set the PTE of the particular page which is causing the page fault. Thus do_swap_page() runs nr times, and each time, arch_swap_restore() will loop nr times for those subpages in the folio. So right now the algorithmic complexity becomes O(nr^2). Once we support mapping large folios in do_swap_page(), extra loops and early-exitings will decrease while not being completely removed as a large folio might get partially tagged in corner cases such as, 1. a large folio in swapcache can be partially unmapped, thus, MTE tags for the unmapped pages will be invalidated; 2. users might use mprotect() to set MTEs on a part of a large folio. arch_thp_swp_supported() is dropped since ARM64 MTE was the only one who needed it. Link: https://lkml.kernel.org/r/20240322114136.61386-2-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Steven Price <steven.price@arm.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: vmalloc: enable memory allocation profilingKent Overstreet
This wrapps all external vmalloc allocation functions with the alloc_hooks() wrapper, and switches internal allocations to _noprof variants where appropriate, for the new memory allocation profiling feature. [surenb@google.com: arch/um: fix forward declaration for vmalloc] Link: https://lkml.kernel.org/r/20240326073750.726636-1-surenb@google.com [surenb@google.com: undo _noprof additions in the documentation] Link: https://lkml.kernel.org/r/20240326231453.1206227-5-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-31-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25change alloc_pages name in dma_map_ops to avoid name conflictsSuren Baghdasaryan
After redefining alloc_pages, all uses of that name are being replaced. Change the conflicting names to prevent preprocessor from replacing them when it's not intended. Link: https://lkml.kernel.org/r/20240321163705.3067592-18-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25fix missing vmalloc.h includesKent Overstreet
Patch series "Memory allocation profiling", v6. Overview: Low overhead [1] per-callsite memory allocation profiling. Not just for debug kernels, overhead low enough to be deployed in production. Example output: root@moria-kvm:~# sort -rn /proc/allocinfo 127664128 31168 mm/page_ext.c:270 func:alloc_page_ext 56373248 4737 mm/slub.c:2259 func:alloc_slab_page 14880768 3633 mm/readahead.c:247 func:page_cache_ra_unbounded 14417920 3520 mm/mm_init.c:2530 func:alloc_large_system_hash 13377536 234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs 11718656 2861 mm/filemap.c:1919 func:__filemap_get_folio 9192960 2800 kernel/fork.c:307 func:alloc_thread_stack_node 4206592 4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable 4136960 1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start 3940352 962 mm/memory.c:4214 func:alloc_anon_folio 2894464 22613 fs/kernfs/dir.c:615 func:__kernfs_new_node ... Usage: kconfig options: - CONFIG_MEM_ALLOC_PROFILING - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT - CONFIG_MEM_ALLOC_PROFILING_DEBUG adds warnings for allocations that weren't accounted because of a missing annotation sysctl: /proc/sys/vm/mem_profiling Runtime info: /proc/allocinfo Notes: [1]: Overhead To measure the overhead we are comparing the following configurations: (1) Baseline with CONFIG_MEMCG_KMEM=n (2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) (3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) (4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1) (5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT (6) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) && CONFIG_MEMCG_KMEM=y (7) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) && CONFIG_MEMCG_KMEM=y Performance overhead: To evaluate performance we implemented an in-kernel test executing multiple get_free_page/free_page and kmalloc/kfree calls with allocation sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU affinity set to a specific CPU to minimize the noise. Below are results from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on 56 core Intel Xeon: kmalloc pgalloc (1 baseline) 6.764s 16.902s (2 default disabled) 6.793s (+0.43%) 17.007s (+0.62%) (3 default enabled) 7.197s (+6.40%) 23.666s (+40.02%) (4 runtime enabled) 7.405s (+9.48%) 23.901s (+41.41%) (5 memcg) 13.388s (+97.94%) 48.460s (+186.71%) (6 def disabled+memcg) 13.332s (+97.10%) 48.105s (+184.61%) (7 def enabled+memcg) 13.446s (+98.78%) 54.963s (+225.18%) Memory overhead: Kernel size: text data bss dec diff (1) 26515311 18890222 17018880 62424413 (2) 26524728 19423818 16740352 62688898 264485 (3) 26524724 19423818 16740352 62688894 264481 (4) 26524728 19423818 16740352 62688898 264485 (5) 26541782 18964374 16957440 62463596 39183 Memory consumption on a 56 core Intel CPU with 125GB of memory: Code tags: 192 kB PageExts: 262144 kB (256MB) SlabExts: 9876 kB (9.6MB) PcpuExts: 512 kB (0.5MB) Total overhead is 0.2% of total memory. Benchmarks: Hackbench tests run 100 times: hackbench -s 512 -l 200 -g 15 -f 25 -P baseline disabled profiling enabled profiling avg 0.3543 0.3559 (+0.0016) 0.3566 (+0.0023) stdev 0.0137 0.0188 0.0077 hackbench -l 10000 baseline disabled profiling enabled profiling avg 6.4218 6.4306 (+0.0088) 6.5077 (+0.0859) stdev 0.0933 0.0286 0.0489 stress-ng tests: stress-ng --class memory --seq 4 -t 60 stress-ng --class cpu --seq 4 -t 60 Results posted at: https://evilpiepirate.org/~kent/memalloc_prof_v4_stress-ng/ [2] https://lore.kernel.org/all/20240306182440.2003814-1-surenb@google.com/ This patch (of 37): The next patch drops vmalloc.h from a system header in order to fix a circular dependency; this adds it to all the files that were pulling it in implicitly. [kent.overstreet@linux.dev: fix arch/alpha/lib/memcpy.c] Link: https://lkml.kernel.org/r/20240327002152.3339937-1-kent.overstreet@linux.dev [surenb@google.com: fix arch/x86/mm/numa_32.c] Link: https://lkml.kernel.org/r/20240402180933.1663992-1-surenb@google.com [kent.overstreet@linux.dev: a few places were depending on sizes.h] Link: https://lkml.kernel.org/r/20240404034744.1664840-1-kent.overstreet@linux.dev [arnd@arndb.de: fix mm/kasan/hw_tags.c] Link: https://lkml.kernel.org/r/20240404124435.3121534-1-arnd@kernel.org [surenb@google.com: fix arc build] Link: https://lkml.kernel.org/r/20240405225115.431056-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-2-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm/arm: remove pmd_thp_or_huge()Peter Xu
ARM/ARM64 used to define pmd_thp_or_huge(). Now this macro is completely redundant. Remove it and use pmd_leaf(). Link: https://lkml.kernel.org/r/20240318200404.448346-14-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Mark Salter <msalter@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm/treewide: remove pXd_huge()Peter Xu
This API is not used anymore, drop it for the whole tree. Link: https://lkml.kernel.org/r/20240318200404.448346-13-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm/treewide: replace pXd_huge() with pXd_leaf()Peter Xu
Now after we're sure all pXd_huge() definitions are the same as pXd_leaf(), reuse it. Luckily, pXd_huge() isn't widely used. Link: https://lkml.kernel.org/r/20240318200404.448346-12-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm/powerpc: redefine pXd_huge() with pXd_leaf()Peter Xu
PowerPC book3s 4K mostly has the same definition on both, except pXd_huge() constantly returns 0 for hash MMUs. As Michael Ellerman pointed out [1], it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set so it will keep returning false. As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create such huge mappings for 4K hash MMUs. Meanwhile, the major powerpc hugetlb pgtable walker __find_linux_pte() already used pXd_leaf() to check leaf hugetlb mappings. The goal should be that we will have one API pXd_leaf() to detect all kinds of huge mappings (hugepd is still special in this case, though). AFAICT we need to use the pXd_leaf() impl (rather than pXd_huge()'s) to make sure ie. THPs on hash MMU will also return true. This helps to simplify a follow up patch to drop pXd_huge() treewide. NOTE: *_leaf() definition need to be moved before the inclusion of asm/book3s/64/pgtable-4k.h, which defines pXd_huge() with it. [1] https://lore.kernel.org/r/87v85zo6w7.fsf@mail.lhotse Link: https://lkml.kernel.org/r/20240318200404.448346-10-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm/arm64: merge pXd_huge() and pXd_leaf() definitionsPeter Xu
Unlike most archs, aarch64 defines pXd_huge() and pXd_leaf() slightly differently. Redefine the pXd_huge() with pXd_leaf(). There used to be two traps for old aarch64 definitions over these APIs that I found when reading the code around, they're: (1) 4797ec2dc83a ("arm64: fix pud_huge() for 2-level pagetables") (2) 23bc8f69f0ec ("arm64: mm: fix p?d_leaf()") Define pXd_huge() with the current pXd_leaf() will make sure (2) isn't a problem (on PROT_NONE checks). To make sure it also works for (1), we move over the __PAGETABLE_PMD_FOLDED check to pud_leaf(), allowing it to constantly returning "false" for 2-level pgtables, which looks even safer to cover both now. Link: https://lkml.kernel.org/r/20240318200404.448346-9-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Mark Salter <msalter@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm/arm: redefine pmd_huge() with pmd_leaf()Peter Xu
Most of the archs already define these two APIs the same way. ARM is more complicated in two aspects: - For pXd_huge() it's always checking against !PXD_TABLE_BIT, while for pXd_leaf() it's always checking against PXD_TYPE_SECT. - SECT/TABLE bits are defined differently on 2-level v.s. 3-level ARM pgtables, which makes the whole thing even harder to follow. Luckily, the second complexity should be hidden by the pmd_leaf() implementation against 2-level v.s. 3-level headers. Invoke pmd_leaf() directly for pmd_huge(), to remove the first part of complexity. This prepares to drop pXd_huge() API globally. When at it, drop the obsolete comments - it's outdated. Link: https://lkml.kernel.org/r/20240318200404.448346-8-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm/arm: use macros to define pmd/pud helpersPeter Xu
It's already confusing that ARM 2-level v.s. 3-level defines SECT bit differently on pmd/puds. Always use a macro which is much clearer. Link: https://lkml.kernel.org/r/20240318200404.448346-7-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>