summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2024-02-22mm/cma: make MAX_CMA_AREAS = CONFIG_CMA_AREASAnshuman Khandual
There is no real difference between the global area, and other additionally configured CMA areas via CONFIG_CMA_AREAS that always defaults without user input. This makes MAX_CMA_AREAS same as CONFIG_CMA_AREAS, also incrementing its default values, thus maintaining current default for MAX_CMA_AREAS both for UMA and NUMA systems. Link: https://lkml.kernel.org/r/20240205051929.298559-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: reduce dependencies on <linux/kernel.h>Christophe JAILLET
"page_counter.h" does not need <linux/kernel.h>. <linux/limits.h> is enough to get LONG_MAX. Files that include page_counter.h are limited. They have been compile tested or checked. $ git grep page_counter\.h include/linux/hugetlb_cgroup.h: struct page_counter hugepage[HUGE_MAX_HSTATE]; --> all files that include it have been compile tested include/linux/memcontrol.h:#include <linux/page_counter.h> --> <linux/kernel.h> has been added, to be safe include/net/sock.h:#include <linux/page_counter.h> --> already include <linux/kernel.h> mm/hugetlb_cgroup.c:#include <linux/page_counter.h> mm/memcontrol.c:#include <linux/page_counter.h> mm/page_counter.c:#include <linux/page_counter.h> --> compile tested Link: https://lkml.kernel.org/r/adfdbe21c4d06400d7bd802868762deb85cae8b6.1706908921.git.christophe.jaillet@wanadoo.fr Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/memory: optimize fork() with PTE-mapped THPDavid Hildenbrand
Let's implement PTE batching when consecutive (present) PTEs map consecutive pages of the same large folio, and all other PTE bits besides the PFNs are equal. We will optimize folio_pte_batch() separately, to ignore selected PTE bits. This patch is based on work by Ryan Roberts. Use __always_inline for __copy_present_ptes() and keep the handling for single PTEs completely separate from the multi-PTE case: we really want the compiler to optimize for the single-PTE case with small folios, to not degrade performance. Note that PTE batching will never exceed a single page table and will always stay within VMA boundaries. Further, processing PTE-mapped THP that maybe pinned and have PageAnonExclusive set on at least one subpage should work as expected, but there is room for improvement: We will repeatedly (1) detect a PTE batch (2) detect that we have to copy a page (3) fall back and allocate a single page to copy a single page. For now we won't care as pinned pages are a corner case, and we should rather look into maintaining only a single PageAnonExclusive bit for large folios. Link: https://lkml.kernel.org/r/20240129124649.189745-14-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Russell King (Oracle) <linux@armlinux.org.uk> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/pgtable: make pte_next_pfn() independent of set_ptes()David Hildenbrand
Let's provide pte_next_pfn(), independently of set_ptes(). This allows for using the generic pte_next_pfn() version in some arch-specific set_ptes() implementations, and prepares for reusing pte_next_pfn() in other context. Link: https://lkml.kernel.org/r/20240129124649.189745-9-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Russell King (Oracle) <linux@armlinux.org.uk> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: compaction: update the cc->nr_migratepages when allocating or freeing ↵Baolin Wang
the freepages Currently we will use 'cc->nr_freepages >= cc->nr_migratepages' comparison to ensure that enough freepages are isolated in isolate_freepages(), however it just decreases the cc->nr_freepages without updating cc->nr_migratepages in compaction_alloc(), which will waste more CPU cycles and cause too many freepages to be isolated. So we should also update the cc->nr_migratepages when allocating or freeing the freepages to avoid isolating excess freepages. And I can see fewer free pages are scanned and isolated when running thpcompact on my Arm64 server: k6.7 k6.7_patched Ops Compaction pages isolated 120692036.00 118160797.00 Ops Compaction migrate scanned 131210329.00 154093268.00 Ops Compaction free scanned 1090587971.00 1080632536.00 Ops Compact scan efficiency 12.03 14.26 Moreover, I did not see an obvious latency improvements, this is likely because isolating freepages is not the bottleneck in the thpcompact test case. k6.7 k6.7_patched Amean fault-both-1 1089.76 ( 0.00%) 1080.16 * 0.88%* Amean fault-both-3 1616.48 ( 0.00%) 1636.65 * -1.25%* Amean fault-both-5 2266.66 ( 0.00%) 2219.20 * 2.09%* Amean fault-both-7 2909.84 ( 0.00%) 2801.90 * 3.71%* Amean fault-both-12 4861.26 ( 0.00%) 4733.25 * 2.63%* Amean fault-both-18 7351.11 ( 0.00%) 6950.51 * 5.45%* Amean fault-both-24 9059.30 ( 0.00%) 9159.99 * -1.11%* Amean fault-both-30 10685.68 ( 0.00%) 11399.02 * -6.68%* Link: https://lkml.kernel.org/r/6440493f18da82298152b6305d6b41c2962a3ce6.1708409245.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm: ptdump: have ptdump_check_wx() return boolChristophe Leroy
Have ptdump_check_wx() return true when the check is successful or false otherwise. [akpm@linux-foundation.org: fix a couple of build issues (x86_64 allmodconfig)] Link: https://lkml.kernel.org/r/7943149fe955458cb7b57cd483bf41a3aad94684.1706610398.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Aneesh Kumar K.V (IBM)" <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg KH <greg@kroah.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Phong Tran <tranmanphong@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Steven Price <steven.price@arm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22arm64, powerpc, riscv, s390, x86: ptdump: refactor CONFIG_DEBUG_WXChristophe Leroy
All architectures using the core ptdump functionality also implement CONFIG_DEBUG_WX, and they all do it more or less the same way, with a function called debug_checkwx() that is called by mark_rodata_ro(), which is a substitute to ptdump_check_wx() when CONFIG_DEBUG_WX is set and a no-op otherwise. Refactor by centrally defining debug_checkwx() in linux/ptdump.h and call debug_checkwx() immediately after calling mark_rodata_ro() instead of calling it at the end of every mark_rodata_ro(). On x86_32, mark_rodata_ro() first checks __supported_pte_mask has _PAGE_NX before calling debug_checkwx(). Now the check is inside the callee ptdump_walk_pgd_level_checkwx(). On powerpc_64, mark_rodata_ro() bails out early before calling ptdump_check_wx() when the MMU doesn't have KERNEL_RO feature. The check is now also done in ptdump_check_wx() as it is called outside mark_rodata_ro(). Link: https://lkml.kernel.org/r/a59b102d7964261d31ead0316a9f18628e4e7a8e.1706610398.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Aneesh Kumar K.V (IBM)" <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg KH <greg@kroah.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Phong Tran <tranmanphong@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Steven Price <steven.price@arm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleavingGregory Price
When a system has multiple NUMA nodes and it becomes bandwidth hungry, using the current MPOL_INTERLEAVE could be an wise option. However, if those NUMA nodes consist of different types of memory such as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin based interleave policy does not optimally distribute data to make use of their different bandwidth characteristics. Instead, interleave is more effective when the allocation policy follows each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution. This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE, enabling weighted interleave between NUMA nodes. Weighted interleave allows for proportional distribution of memory across multiple numa nodes, preferably apportioned to match the bandwidth of each node. For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1), with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate weight distribution is (2:1). Weights for each node can be assigned via the new sysfs extension: /sys/kernel/mm/mempolicy/weighted_interleave/ For now, the default value of all nodes will be `1`, which matches the behavior of standard 1:1 round-robin interleave. An extension will be added in the future to allow default values to be registered at kernel and device bringup time. The policy allocates a number of pages equal to the set weights. For example, if the weights are (2,1), then 2 pages will be allocated on node0 for every 1 page allocated on node1. The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2) and mbind(2). Some high level notes about the pieces of weighted interleave: current->il_prev: Tracks the node previously allocated from. current->il_weight: The active weight of the current node (current->il_prev) When this reaches 0, current->il_prev is set to the next node and current->il_weight is set to the next weight. weighted_interleave_nodes: Counts the number of allocations as they occur, and applies the weight for the current node. When the weight reaches 0, switch to the next node. Operates only on task->mempolicy. weighted_interleave_nid: Gets the total weight of the nodemask as well as each individual node weight, then calculates the node based on the given index. Operates on VMA policies. bulk_array_weighted_interleave: Gets the total weight of the nodemask as well as each individual node weight, then calculates the number of "interleave rounds" as well as any delta ("partial round"). Calculates the number of pages for each node and allocates them. If a node was scheduled for interleave via interleave_nodes, the current weight will be allocated first. Operates only on the task->mempolicy. One piece of complexity is the interaction between a recent refactor which split the logic to acquire the "ilx" (interleave index) of an allocation and the actually application of the interleave. If a call to alloc_pages_mpol() were made with a weighted-interleave policy and ilx set to NO_INTERLEAVE_INDEX, weighted_interleave_nodes() would operate on a VMA policy - violating the description above. An inspection of all callers of alloc_pages_mpol() shows that all external callers set ilx to `0`, an index value, or will call get_vma_policy() to acquire the ilx. For example, mm/shmem.c may call into alloc_pages_mpol. The call stacks all set (pgoff_t ilx) or end up in `get_vma_policy()`. This enforces the `weighted_interleave_nodes()` and `weighted_interleave_nid()` policy requirements (task/vma respectively). Link: https://lkml.kernel.org/r/20240202170238.90004-4-gregory.price@memverge.com Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com> Signed-off-by: Gregory Price <gregory.price@memverge.com> Co-developed-by: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Rakie Kim <rakie.kim@sk.com> Co-developed-by: Honggyu Kim <honggyu.kim@sk.com> Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com> Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com> Co-developed-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com> Signed-off-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com> Co-developed-by: Ravi Jonnalagadda <ravis.opensrc@micron.com> Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@micron.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/list_lru: remove list_lru_putback()Chengming Zhou
Since the only user zswap_lru_putback() has gone, remove list_lru_putback() too. Link: https://lkml.kernel.org/r/20240126-zswap-writeback-race-v2-3-b10479847099@bytedance.com Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Cc: Chris Li <chriscli@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22x86/mm: delete unused cpu argument to leave_mm()Yosry Ahmed
The argument is unused since commit 3d28ebceaffa ("x86/mm: Rework lazy TLB to track the actual loaded mm"), delete it. Link: https://lkml.kernel.org/r/20240126080644.1714297-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm and cache_info: remove unnecessary CPU cache info updateHuang Ying
For each CPU hotplug event, we will update per-CPU data slice size and corresponding PCP configuration for every online CPU to make the implementation simple. But, Kyle reported that this takes tens seconds during boot on a machine with 34 zones and 3840 CPUs. So, in this patch, for each CPU hotplug event, we only update per-CPU data slice size and corresponding PCP configuration for the CPUs that share caches with the hotplugged CPU. With the patch, the system boot time reduces 67 seconds on the machine. Link: https://lkml.kernel.org/r/20240126081944.414520-1-ying.huang@intel.com Fixes: 362d37a106dd ("mm, pcp: reduce lock contention for draining high-order pages") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Originally-by: Kyle Meyer <kyle.meyer@hpe.com> Reported-and-tested-by: Kyle Meyer <kyle.meyer@hpe.com> Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22highmem: add kernel-doc for memcpy_*_folio()Matthew Wilcox (Oracle)
This was inadvertently skipped when adding the new functions. Link: https://lkml.kernel.org/r/20240124181217.1761674-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/memory_hotplug: export mhp_supports_memmap_on_memory()Vishal Verma
In preparation for adding sysfs ABI to toggle memmap_on_memory semantics for drivers adding memory, export the mhp_supports_memmap_on_memory() helper. This allows drivers to check if memmap_on_memory support is available before trying to request it, and display an appropriate message if it isn't available. As part of this, remove the size argument to this - with recent updates to allow memmap_on_memory for larger ranges, and the internal splitting of altmaps into respective memory blocks, the size argument is meaningless. [akpm@linux-foundation.org: fix build] Link: https://lkml.kernel.org/r/20240124-vv-dax_abi-v7-4-20d16cb8d23d@intel.com Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Li Zhijian <lizhijian@fujitsu.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/zswap: split zswap rb-treeChengming Zhou
Each swapfile has one rb-tree to search the mapping of swp_entry_t to zswap_entry, that use a spinlock to protect, which can cause heavy lock contention if multiple tasks zswap_store/load concurrently. Optimize the scalability problem by splitting the zswap rb-tree into multiple rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M), just like we did in the swap cache address_space splitting. Although this method can't solve the spinlock contention completely, it can mitigate much of that contention. Below is the results of kernel build in tmpfs with zswap shrinker enabled: linux-next zswap-lock-optimize real 1m9.181s 1m3.820s user 17m44.036s 17m40.100s sys 7m37.297s 4m54.622s So there are clearly improvements. Link: https://lkml.kernel.org/r/20240117-b4-zswap-lock-optimize-v2-2-b5cc55479090@bytedance.com Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Cc: Chris Li <chriscli@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/zswap: make sure each swapfile always have zswap rb-treeChengming Zhou
Patch series "mm/zswap: optimize the scalability of zswap rb-tree", v2. When testing the zswap performance by using kernel build -j32 in a tmpfs directory, I found the scalability of zswap rb-tree is not good, which is protected by the only spinlock. That would cause heavy lock contention if multiple tasks zswap_store/load concurrently. So a simple solution is to split the only one zswap rb-tree into multiple rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"). Although this method can't solve the spinlock contention completely, it can mitigate much of that contention. Below is the results of kernel build in tmpfs with zswap shrinker enabled: linux-next zswap-lock-optimize real 1m9.181s 1m3.820s user 17m44.036s 17m40.100s sys 7m37.297s 4m54.622s So there are clearly improvements. And it's complementary with the ongoing zswap xarray conversion by Chris. Anyway, I think we can also merge this first, it's complementary IMHO. So I just refresh and resend this for further discussion. This patch (of 2): Not all zswap interfaces can handle the absence of the zswap rb-tree, actually only zswap_store() has handled it for now. To make things simple, we make sure each swapfile always have the zswap rb-tree prepared before being enabled and used. The preparation is unlikely to fail in practice, this patch just make it explicit. Link: https://lkml.kernel.org/r/20240117-b4-zswap-lock-optimize-v2-0-b5cc55479090@bytedance.com Link: https://lkml.kernel.org/r/20240117-b4-zswap-lock-optimize-v2-1-b5cc55479090@bytedance.com Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Yosry Ahmed <yosryahmed@google.com> Cc: Chris Li <chriscli@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22bpf: Clarify batch lookup/lookup_and_delete semanticsMartin Kelly
The batch lookup and lookup_and_delete APIs have two parameters, in_batch and out_batch, to facilitate iterative lookup/lookup_and_deletion operations for supported maps. Except NULL for in_batch at the start of these two batch operations, both parameters need to point to memory equal or larger than the respective map key size, except for various hashmaps (hash, percpu_hash, lru_hash, lru_percpu_hash) where the in_batch/out_batch memory size should be at least 4 bytes. Document these semantics to clarify the API. Signed-off-by: Martin Kelly <martin.kelly@crowdstrike.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240221211838.1241578-1-martin.kelly@crowdstrike.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-22Merge tag 'vfs-6.8-rc6.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix a memory leak in cachefiles - Restrict aio cancellations to I/O submitted through the aio interfaces as this is otherwise causing issues for I/O submitted via io_uring - Increase buffer for afs volume status to avoid overflow - Fix a missing zero-length check in unbuffered writes in the netfs library. If generic_write_checks() returns zero make netfs_unbuffered_write_iter() return right away - Prevent a leak in i_dio_count caused by netfs_begin_read() operating past i_size. It will return early and leave i_dio_count incremented - Account for ipv4 addresses as well as ipv6 addresses when processing incoming callbacks in afs * tag 'vfs-6.8-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs/aio: Restrict kiocb_set_cancel_fn() to I/O submitted via libaio afs: Increase buffer size in afs_update_volume_status() afs: Fix ignored callbacks over ipv4 cachefiles: fix memory leak in cachefiles_add_cache() netfs: Fix missing zero-length check in unbuffered write netfs: Fix i_dio_count leak on DIO read past i_size
2024-02-22Merge tag 'net-6.8.0-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from bpf and netfilter. Current release - regressions: - af_unix: fix another unix GC hangup Previous releases - regressions: - core: fix a possible AF_UNIX deadlock - bpf: fix NULL pointer dereference in sk_psock_verdict_data_ready() - netfilter: nft_flow_offload: release dst in case direct xmit path is used - bridge: switchdev: ensure MDB events are delivered exactly once - l2tp: pass correct message length to ip6_append_data - dccp/tcp: unhash sk from ehash for tb2 alloc failure after check_estalblished() - tls: fixes for record type handling with PEEK - devlink: fix possible use-after-free and memory leaks in devlink_init() Previous releases - always broken: - bpf: fix an oops when attempting to read the vsyscall page through bpf_probe_read_kernel - sched: act_mirred: use the backlog for mirred ingress - netfilter: nft_flow_offload: fix dst refcount underflow - ipv6: sr: fix possible use-after-free and null-ptr-deref - mptcp: fix several data races - phonet: take correct lock to peek at the RX queue Misc: - handful of fixes and reliability improvements for selftests" * tag 'net-6.8.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (72 commits) l2tp: pass correct message length to ip6_append_data net: phy: realtek: Fix rtl8211f_config_init() for RTL8211F(D)(I)-VD-CG PHY selftests: ioam: refactoring to align with the fix Fix write to cloned skb in ipv6_hop_ioam() phonet/pep: fix racy skb_queue_empty() use phonet: take correct lock to peek at the RX queue net: sparx5: Add spinlock for frame transmission from CPU net/sched: flower: Add lock protection when remove filter handle devlink: fix port dump cmd type net: stmmac: Fix EST offset for dwmac 5.10 tools: ynl: don't leak mcast_groups on init error tools: ynl: make sure we always pass yarg to mnl_cb_run net: mctp: put sock on tag allocation failure netfilter: nf_tables: use kzalloc for hook allocation netfilter: nf_tables: register hooks last when adding new chain/flowtable netfilter: nft_flow_offload: release dst in case direct xmit path is used netfilter: nft_flow_offload: reset dst in route object after setting up flow netfilter: nf_tables: set dormant flag on hook register failure selftests: tls: add test for peeking past a record of a different type selftests: tls: add test for merging of same-type control messages ...
2024-02-22timers: Always queue timers on the local CPUAnna-Maria Behnsen
The timer pull model is in place so we can remove the heuristics which try to guess the best target CPU at enqueue/modification time. All non pinned timers are queued on the local CPU in the separate storage and eventually pulled at expiry time to a remote CPU. Originally-by: Richard Cochran (linutronix GmbH) <richardcochran@gmail.com> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-21-anna-maria@linutronix.de
2024-02-22timer_migration: Add tracepointsAnna-Maria Behnsen
The timer pull logic needs proper debugging aids. Add tracepoints so the hierarchical idle machinery can be diagnosed. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240222103403.31923-1-anna-maria@linutronix.de
2024-02-22timers: Implement the hierarchical pull modelAnna-Maria Behnsen
Placing timers at enqueue time on a target CPU based on dubious heuristics does not make any sense: 1) Most timer wheel timers are canceled or rearmed before they expire. 2) The heuristics to predict which CPU will be busy when the timer expires are wrong by definition. So placing the timers at enqueue wastes precious cycles. The proper solution to this problem is to always queue the timers on the local CPU and allow the non pinned timers to be pulled onto a busy CPU at expiry time. Therefore split the timer storage into local pinned and global timers: Local pinned timers are always expired on the CPU on which they have been queued. Global timers can be expired on any CPU. As long as a CPU is busy it expires both local and global timers. When a CPU goes idle it arms for the first expiring local timer. If the first expiring pinned (local) timer is before the first expiring movable timer, then no action is required because the CPU will wake up before the first movable timer expires. If the first expiring movable timer is before the first expiring pinned (local) timer, then this timer is queued into an idle timerqueue and eventually expired by another active CPU. To avoid global locking the timerqueues are implemented as a hierarchy. The lowest level of the hierarchy holds the CPUs. The CPUs are associated to groups of 8, which are separated per node. If more than one CPU group exist, then a second level in the hierarchy collects the groups. Depending on the size of the system more than 2 levels are required. Each group has a "migrator" which checks the timerqueue during the tick for remote expirable timers. If the last CPU in a group goes idle it reports the first expiring event in the group up to the next group(s) in the hierarchy. If the last CPU goes idle it arms its timer for the first system wide expiring timer to ensure that no timer event is missed. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240222103710.32582-1-anna-maria@linutronix.de
2024-02-22timers: Introduce add_timer() variants which modify timer flagsAnna-Maria Behnsen
A timer might be used as a pinned timer (using add_timer_on()) and later on as non-pinned timer using add_timer(). When the "NOHZ timer pull at expiry model" is in place, the TIMER_PINNED flag is required to be used whenever a timer needs to expire on a dedicated CPU. Otherwise the flag must not be set if expiration on a dedicated CPU is not required. add_timer_on()'s behavior will be changed during the preparation patches for the "NOHZ timer pull at expiry model" to unconditionally set the TIMER_PINNED flag. To be able to clear/ set the flag when queueing a timer, two variants of add_timer() are introduced. This is a preparatory step and has no functional change. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-6-anna-maria@linutronix.de
2024-02-22net: mctp: provide a more specific tag allocation ioctlJeremy Kerr
Now that we have net-specific tags, extend the tag allocation ioctls (SIOCMCTPALLOCTAG / SIOCMCTPDROPTAG) to allow a network parameter to be passed to the tag allocation. We also add a local_addr member to the ioc struct, to allow for a future finer-grained tag allocation using local EIDs too. We don't add any specific support for that now though, so require MCTP_ADDR_ANY or MCTP_ADDR_NULL for those at present. The old ioctls will still work, but allocate for the default MCTP net. These are now marked as deprecated in the header. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-02-22net: mctp: separate key correlation across netsJeremy Kerr
Currently, we lookup sk_keys from the entire struct net_namespace, which may contain multiple MCTP net IDs. In those cases we want to distinguish between endpoints with the same EID but different net ID. Add the net ID data to the struct mctp_sk_key, populate on add and filter on this during route lookup. For the ioctl interface, we use a default net of MCTP_INITIAL_DEFAULT_NET (ie., what will be in use for single-net configurations), but we'll extend the ioctl interface to provide net-specific tag allocation in an upcoming change. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-02-22net: mctp: avoid confusion over local/peer dest/source addressesJeremy Kerr
We have a double-swap of local and peer addresses in mctp_alloc_local_tag; the arguments in both call sites are swapped, but there is also a swap in the implementation of alloc_local_tag. This is opaque because we're using source/dest address references, which don't match the local/peer semantics. Avoid this confusion by naming the arguments as 'local' and 'peer', and remove the double swap. The calling order now matches mctp_key_alloc. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-02-22ACPI: x86: Move acpi_quirk_skip_serdev_enumeration() out of ↵Hans de Goede
CONFIG_X86_ANDROID_TABLETS Some recent(ish) Dell AIO devices have a backlight controller board connected to an UART. This UART has a DELL0501 HID with CID set to PNP0501 so that the UART is still handled by 8250_pnp.c. Unfortunately there is no separate ACPI device with an UartSerialBusV2() resource to model the backlight-controller. The next patch in this series will use acpi_quirk_skip_serdev_enumeration() to still create a serdev for this for a backlight driver to bind to instead of creating a /dev/ttyS0. This new acpi_quirk_skip_serdev_enumeration() use is not limited to Android X86 tablets, so move it out of the ifdef CONFIG_X86_ANDROID_TABLETS block. Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-02-22Merge series 'Use Maple Trees for simple_offset utilities' of ↵Christian Brauner
https://lore.kernel.org/r/170820083431.6328.16233178852085891453.stgit@91.116.238.104.host.secureserver.net Pull simple offset series from Chuck Lever In an effort to address slab fragmentation issues reported a few months ago, I've replaced the use of xarrays for the directory offset map in "simple" file systems (including tmpfs). Thanks to Liam Howlett for helping me get this working with Maple Trees. * series 'Use Maple Trees for simple_offset utilities' of https://lore.kernel.org/r/170820083431.6328.16233178852085891453.stgit@91.116.238.104.host.secureserver.net: (6 commits) libfs: Convert simple directory offsets to use a Maple Tree test_maple_tree: testing the cyclic allocation maple_tree: Add mtree_alloc_cyclic() libfs: Add simple_offset_empty() libfs: Define a minimum directory offset libfs: Re-arrange locking in offset_iterate_dir() Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-22firmware: arm_scmi: Add standard clock OEM definitionsCristian Marussi
Add a common enum to define the standard clock OEM types defined by the SCMI specification, so as to enable the configuration of such extended configuration properties with the existent clock protocol operations. Signed-off-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20240214183006.3403207-5-cristian.marussi@arm.com Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
2024-02-22firmware: arm_scmi: Add clock check for extended config supportCristian Marussi
SCMI v3.2 added support to set/get clock custom OEM types; such support is conditionally present, though, depending on an extended config attribute bit possibly advertised by the platform server on a per-domain base. Add a check to verify if OEM types are supported before allowing any kind of OEM-specific get/set operation. Also add a check around all the new v3.2 clock features. Signed-off-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20240214183006.3403207-4-cristian.marussi@arm.com Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
2024-02-21dt-bindings: clock: mobileye,eyeq5-clk: add bindingsThéo Lebrun
Add DT schema bindings for the EyeQ5 clock controller driver. Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com> Link: https://lore.kernel.org/r/20240221-mbly-clk-v7-3-31d4ce3630c3@bootlin.com Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2024-02-21clk: fixed-factor: add fwname-based constructor functionsThéo Lebrun
Add four functions to register clk_hw based on the fw_name field in clk_parent_data, ie the value in the DT property `clock-names`. There are variants for devm or not and passing an accuracy or not passing one: - clk_hw_register_fixed_factor_fwname - clk_hw_register_fixed_factor_with_accuracy_fwname - devm_clk_hw_register_fixed_factor_fwname - devm_clk_hw_register_fixed_factor_with_accuracy_fwname The `struct clk_parent_data` init is extracted from __clk_hw_register_fixed_factor to each calling function. It is required to allow each function to pass whatever field they want, not only index. Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com> Link: https://lore.kernel.org/r/20240221-mbly-clk-v7-2-31d4ce3630c3@bootlin.com Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2024-02-21clk: fixed-factor: add optional accuracy supportThéo Lebrun
Fixed factor clock reports the parent clock accuracy. Add flags and acc fields to `struct clk_fixed_factor` to support setting a fixed accuracy. The default if no flag is set is not changed: use the parent clock accuracy. Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com> Link: https://lore.kernel.org/r/20240221-mbly-clk-v7-1-31d4ce3630c3@bootlin.com Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2024-02-21dt-bindings: clock: ast2600: Add FSI clockEddie James
Add a definition for the FSI clock. Signed-off-by: Eddie James <eajames@linux.ibm.com> Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20240215220759.976998-2-eajames@linux.ibm.com Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2024-02-21clk: Provide managed helper to get and enable bulk clocksShradha Todi
Provide a managed devm_clk_bulk* wrapper to get and enable all bulk clocks in order to simplify drivers that keeps all clocks enabled for the time of driver operation. Suggested-by: Marek Szyprowski <m.szyprowski@samsung.com> Reviewed-by: Alim Akhtar <alim.akhtar@samsung.com> Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Shradha Todi <shradha.t@samsung.com> Link: https://lore.kernel.org/r/20240220084046.23786-2-shradha.t@samsung.com Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2024-02-21dt-bindings: reset: mediatek: add MT7988 infracfg reset IDsFrank Wunderlich
Add reset constants for using as index in driver and dts. Value is starting again from 0 because resets are used in another device than existing constants. Signed-off-by: Frank Wunderlich <frank-w@public-files.de> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Acked-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20240201182409.39878-2-linux@fw-web.de Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2024-02-21udp: add local "peek offset enabled" flagPaolo Abeni
We want to re-organize the struct sock layout. The sk_peek_off field location is problematic, as most protocols want it in the RX read area, while UDP wants it on a cacheline different from sk_receive_queue. Create a local (inside udp_sock) copy of the 'peek offset is enabled' flag and place it inside the same cacheline of reader_queue. Check such flag before reading sk_peek_off. This will save potential false sharing and cache misses in the fast-path. Tested under UDP flood with small packets. The struct sock layout update causes a 4% performance drop, and this patch restores completely the original tput. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/67ab679c15fbf49fa05b3ffe05d91c47ab84f147.1708426665.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-21mm: convert mm_counter_file() to take a folioKefeng Wang
Now all callers of mm_counter_file() have a folio, convert mm_counter_file() to take a folio. Saves a call to compound_head() hidden inside PageSwapBacked(). Link: https://lkml.kernel.org/r/20240111152429.3374566-11-willy@infradead.org Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-21mm: convert mm_counter() to take a folioKefeng Wang
Now all callers of mm_counter() have a folio, convert mm_counter() to take a folio. Saves a call to compound_head() hidden inside PageAnon(). Link: https://lkml.kernel.org/r/20240111152429.3374566-10-willy@infradead.org Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-21mm: add pfn_swap_entry_folio()Matthew Wilcox (Oracle)
Patch series "mm: convert mm counter to take a folio", v3. Make sure all mm_counter() and mm_counter_file() callers have a folio, then convert mm counter functions to take a folio, which saves some compound_head() calls. This patch (of 10): Thanks to the compound_head() hidden inside PageLocked(), this saves a call to compound_head() over calling page_folio(pfn_swap_entry_to_page()) Link: https://lkml.kernel.org/r/20240111152429.3374566-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240111152429.3374566-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-21mm: list_lru: remove unused macro list_lru_init_key()Haifeng Xu
list_lru_init_key() isn't used by anyone, remove it to clean up. Link: https://lkml.kernel.org/r/20231228062715.338672-2-haifeng.xu@shopee.com Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-21mm/memory_hotplug: introduce MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiersSumanth Korikkar
Patch series "implement "memmap on memory" feature on s390". This series provides "memmap on memory" support on s390 platform. "memmap on memory" allows struct pages array to be allocated from the hotplugged memory range instead of allocating it from main system memory. s390 currently preallocates struct pages array for all potentially possible memory, which ensures memory onlining always succeeds, but with the cost of significant memory consumption from the available system memory during boottime. In certain extreme configuration, this could lead to ipl failure. "memmap on memory" ensures struct pages array are populated from self contained hotplugged memory range instead of depleting the available system memory and this could eliminate ipl failure on s390 platform. On other platforms, system might go OOM when the physically hotplugged memory depletes the available memory before it is onlined. Hence, "memmap on memory" feature was introduced as described in commit a08a2ae34613 ("mm,memory_hotplug: allocate memmap from the added memory range"). Unlike other architectures, s390 memory blocks are not physically accessible until it is online. To make it physically accessible two new memory notifiers MEM_PREPARE_ONLINE / MEM_FINISH_OFFLINE are added and this notifier lets the hypervisor inform that the memory should be made physically accessible. This allows for "memmap on memory" initialization during memory hotplug onlining phase, which is performed before calling MEM_GOING_ONLINE notifier. Patch 1 introduces MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE memory notifiers to prepare the transition of memory to and from a physically accessible state. New mhp_flag MHP_OFFLINE_INACCESSIBLE is introduced to ensure altmap cannot be written when adding memory - before it is set online. This enhancement is crucial for implementing the "memmap on memory" feature for s390 in a subsequent patch. Patches 2 allocates vmemmap pages from self-contained memory range for s390. It allocates memory map (struct pages array) from the hotplugged memory range, rather than using system memory by passing altmap to vmemmap functions. Patch 3 removes unhandled memory notifier types on s390. Patch 4 implements MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE memory notifiers on s390. MEM_PREPARE_ONLINE memory notifier makes memory block physical accessible via sclp assign command. The notifier ensures self-contained memory maps are accessible and hence enabling the "memmap on memory" on s390. MEM_FINISH_OFFLINE memory notifier shifts the memory block to an inaccessible state via sclp unassign command. Patch 5 finally enables MHP_MEMMAP_ON_MEMORY on s390. This patch (of 5): Introduce MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE memory notifiers to prepare the transition of memory to and from a physically accessible state. This enhancement is crucial for implementing the "memmap on memory" feature for s390 in a subsequent patch. Platforms such as x86 can support physical memory hotplug via ACPI. When there is physical memory hotplug, ACPI event leads to the memory addition with the following callchain: acpi_memory_device_add() -> acpi_memory_enable_device() -> __add_memory() After this, the hotplugged memory is physically accessible, and altmap support prepared, before the "memmap on memory" initialization in memory_block_online() is called. On s390, memory hotplug works in a different way. The available hotplug memory has to be defined upfront in the hypervisor, but it is made physically accessible only when the user sets it online via sysfs, currently in the MEM_GOING_ONLINE notifier. This is too late and "memmap on memory" initialization is performed before calling MEM_GOING_ONLINE notifier. During the memory hotplug addition phase, altmap support is prepared and during the memory onlining phase s390 requires memory to be physically accessible and then subsequently initiate the "memmap on memory" initialization process. The memory provider will handle new MEM_PREPARE_ONLINE / MEM_FINISH_OFFLINE notifications and make the memory accessible. The mhp_flag MHP_OFFLINE_INACCESSIBLE is introduced and is relevant when used along with MHP_MEMMAP_ON_MEMORY, because the altmap cannot be written (e.g., poisoned) when adding memory -- before it is set online. This allows for adding memory with an altmap that is not currently made available by a hypervisor. When onlining that memory, the hypervisor can be instructed to make that memory accessible via the new notifiers and the onlining phase will not require any memory allocations, which is helpful in low-memory situations. All architectures ignore unknown memory notifiers. Therefore, the introduction of these new notifiers does not result in any functional modifications across architectures. Link: https://lkml.kernel.org/r/20240108132747.3238763-1-sumanthk@linux.ibm.com Link: https://lkml.kernel.org/r/20240108132747.3238763-2-sumanthk@linux.ibm.com Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Suggested-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22netfilter: nft_flow_offload: reset dst in route object after setting up flowPablo Neira Ayuso
dst is transferred to the flow object, route object does not own it anymore. Reset dst in route object, otherwise if flow_offload_add() fails, error path releases dst twice, leading to a refcount underflow. Fixes: a3c90f7a2323 ("netfilter: nf_tables: flow offload expression") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-02-21net: phy: marvell-88q2xxx: add driver for the Marvell 88Q2220 PHYDimitri Fedrau
Add a driver for the Marvell 88Q2220. This driver allows to detect the link, switch between 100BASE-T1 and 1000BASE-T1 and switch between master and slave mode. Autonegotiation is supported. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Gregor Herburger <gregor.herburger@ew.tq-group.com> Signed-off-by: Dimitri Fedrau <dima.fedrau@gmail.com> Link: https://lore.kernel.org/r/20240218075753.18067-6-dima.fedrau@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-21net: phy: Support 100/1000BT1 linkmode advertisementsDimitri Fedrau
Extend helper functions mii_t1_adv_m_mod_linkmode_t and linkmode_adv_to_mii_t1_adv_m_t to support 100BT1 and 1000BT1 linkmode advertisements. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Dimitri Fedrau <dima.fedrau@gmail.com> Link: https://lore.kernel.org/r/20240218075753.18067-3-dima.fedrau@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-21net: phy: Add BaseT1 auto-negotiation constantsDimitri Fedrau
Added constants for advertising 100BT1 and 1000BT1 in register BASE-T1 auto-negotiation advertisement register [31:16] (Register 7.515) Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Dimitri Fedrau <dima.fedrau@gmail.com> Link: https://lore.kernel.org/r/20240218075753.18067-2-dima.fedrau@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-21bpf: Shrink size of struct bpf_map/bpf_array.Alexei Starovoitov
Back in 2018 the commit be95a845cc44 ("bpf: avoid false sharing of map refcount with max_entries") added ____cacheline_aligned to "struct bpf_map" to make sure that fields like refcnt don't share a cache line with max_entries that is used to bounds check map access. That was done to make spectre style attacks harder. The main mitigation is done via code similar to array_index_nospec(), of course. This was an additional precaution. It increased the size of "struct bpf_map" a little, but it's affect on all other maps (like array) is significant, since "struct bpf_map" is typically the first member in other map types. Undo this ____cacheline_aligned tag. Instead move freeze_mutex field around, so that refcnt and max_entries are still in different cache lines. The main effect is seen in sizeof(struct bpf_array) that reduces from 320 to 248 bytes. BEFORE: struct bpf_map { const struct bpf_map_ops * ops; /* 0 8 */ ... char name[16]; /* 96 16 */ /* XXX 16 bytes hole, try to pack */ /* --- cacheline 2 boundary (128 bytes) --- */ atomic64_t refcnt __attribute__((__aligned__(64))); /* 128 8 */ ... /* size: 256, cachelines: 4, members: 30 */ /* sum members: 232, holes: 1, sum holes: 16 */ /* padding: 8 */ /* paddings: 1, sum paddings: 2 */ } __attribute__((__aligned__(64))); struct bpf_array { struct bpf_map map; /* 0 256 */ ... /* size: 320, cachelines: 5, members: 5 */ /* padding: 48 */ /* paddings: 1, sum paddings: 8 */ } __attribute__((__aligned__(64))); AFTER: struct bpf_map { /* size: 232, cachelines: 4, members: 30 */ /* paddings: 1, sum paddings: 2 */ /* last cacheline: 40 bytes */ }; struct bpf_array { /* size: 248, cachelines: 4, members: 5 */ /* last cacheline: 56 bytes */ }; Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20240220235001.57411-1-alexei.starovoitov@gmail.com
2024-02-21drm/xe: Add uapi for dumpable bosMaarten Lankhorst
Add the flag XE_VM_BIND_FLAG_DUMPABLE to notify devcoredump that this mapping should be dumped. This is not hooked up, but the uapi should be ready before merging. It's likely easier to dump the contents of the bo's at devcoredump readout time, so it's better if the bos will stay unmodified after a hang. The NEEDS_CPU_MAPPING flag is removed as requirement. Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Reviewed-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240221133024.898315-3-maarten.lankhorst@linux.intel.com
2024-02-21RDMA/uverbs: Remove flexible arrays from struct *_filterErick Archer
When a struct containing a flexible array is included in another struct, and there is a member after the struct-with-flex-array, there is a possibility of memory overlap. These cases must be audited [1]. See: struct inner { ... int flex[]; }; struct outer { ... struct inner header; int overlap; ... }; This is the scenario for all the "struct *_filter" structures that are included in the following "struct ib_flow_spec_*" structures: struct ib_flow_spec_eth struct ib_flow_spec_ib struct ib_flow_spec_ipv4 struct ib_flow_spec_ipv6 struct ib_flow_spec_tcp_udp struct ib_flow_spec_tunnel struct ib_flow_spec_esp struct ib_flow_spec_gre struct ib_flow_spec_mpls The pattern is like the one shown below: struct *_filter { ... u8 real_sz[]; }; struct ib_flow_spec_* { ... struct *_filter val; struct *_filter mask; }; In this case, the trailing flexible array "real_sz" is never allocated and is only used to calculate the size of the structures. Here the use of the "offsetof" helper can be changed by the "sizeof" operator because the goal is to get the size of these structures. Therefore, the trailing flexible arrays can also be removed. However, due to the trailing padding that can be induced in structs it is possible that the: offsetof(struct *_filter, real_sz) != sizeof(struct *_filter) This situation happens with the "struct ib_flow_ipv6_filter" and to avoid it the "__packed" macro is used in this structure. But now, the "sizeof(struct ib_flow_ipv6_filter)" has changed. This is not a problem since this size is not used in the code. The situation now is that "sizeof(struct ib_flow_spec_ipv6)" has also changed (this struct contains the struct ib_flow_ipv6_filter). This is also not a problem since it is only used to set the size of the "union ib_flow_spec", which can store all the "ib_flow_spec_*" structures. Link: https://lore.kernel.org/r/20240217142913.4285-1-erick.archer@gmx.com Signed-off-by: Erick Archer <erick.archer@gmx.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-02-21fs/aio: Restrict kiocb_set_cancel_fn() to I/O submitted via libaioBart Van Assche
If kiocb_set_cancel_fn() is called for I/O submitted via io_uring, the following kernel warning appears: WARNING: CPU: 3 PID: 368 at fs/aio.c:598 kiocb_set_cancel_fn+0x9c/0xa8 Call trace: kiocb_set_cancel_fn+0x9c/0xa8 ffs_epfile_read_iter+0x144/0x1d0 io_read+0x19c/0x498 io_issue_sqe+0x118/0x27c io_submit_sqes+0x25c/0x5fc __arm64_sys_io_uring_enter+0x104/0xab0 invoke_syscall+0x58/0x11c el0_svc_common+0xb4/0xf4 do_el0_svc+0x2c/0xb0 el0_svc+0x2c/0xa4 el0t_64_sync_handler+0x68/0xb4 el0t_64_sync+0x1a4/0x1a8 Fix this by setting the IOCB_AIO_RW flag for read and write I/O that is submitted by libaio. Suggested-by: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Cc: Avi Kivity <avi@scylladb.com> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: stable@vger.kernel.org Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240215204739.2677806-2-bvanassche@acm.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-21wifi: cfg80211: use ML element parsing helpersJohannes Berg
Use the existing ML element parsing helpers and add a new one for this (ieee80211_mle_get_mld_id). Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240216135047.4da47b1f035b.I437a5570ac456449facb0b147851ef24a1e473c2@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>