From c5f1e2d1890935a734c302b9b8579748222b8e1e Mon Sep 17 00:00:00 2001 From: Sumanth Korikkar Date: Mon, 8 Jan 2024 14:27:43 +0100 Subject: mm/memory_hotplug: introduce MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers Patch series "implement "memmap on memory" feature on s390". This series provides "memmap on memory" support on s390 platform. "memmap on memory" allows struct pages array to be allocated from the hotplugged memory range instead of allocating it from main system memory. s390 currently preallocates struct pages array for all potentially possible memory, which ensures memory onlining always succeeds, but with the cost of significant memory consumption from the available system memory during boottime. In certain extreme configuration, this could lead to ipl failure. "memmap on memory" ensures struct pages array are populated from self contained hotplugged memory range instead of depleting the available system memory and this could eliminate ipl failure on s390 platform. On other platforms, system might go OOM when the physically hotplugged memory depletes the available memory before it is onlined. Hence, "memmap on memory" feature was introduced as described in commit a08a2ae34613 ("mm,memory_hotplug: allocate memmap from the added memory range"). Unlike other architectures, s390 memory blocks are not physically accessible until it is online. To make it physically accessible two new memory notifiers MEM_PREPARE_ONLINE / MEM_FINISH_OFFLINE are added and this notifier lets the hypervisor inform that the memory should be made physically accessible. This allows for "memmap on memory" initialization during memory hotplug onlining phase, which is performed before calling MEM_GOING_ONLINE notifier. Patch 1 introduces MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE memory notifiers to prepare the transition of memory to and from a physically accessible state. New mhp_flag MHP_OFFLINE_INACCESSIBLE is introduced to ensure altmap cannot be written when adding memory - before it is set online. This enhancement is crucial for implementing the "memmap on memory" feature for s390 in a subsequent patch. Patches 2 allocates vmemmap pages from self-contained memory range for s390. It allocates memory map (struct pages array) from the hotplugged memory range, rather than using system memory by passing altmap to vmemmap functions. Patch 3 removes unhandled memory notifier types on s390. Patch 4 implements MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE memory notifiers on s390. MEM_PREPARE_ONLINE memory notifier makes memory block physical accessible via sclp assign command. The notifier ensures self-contained memory maps are accessible and hence enabling the "memmap on memory" on s390. MEM_FINISH_OFFLINE memory notifier shifts the memory block to an inaccessible state via sclp unassign command. Patch 5 finally enables MHP_MEMMAP_ON_MEMORY on s390. This patch (of 5): Introduce MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE memory notifiers to prepare the transition of memory to and from a physically accessible state. This enhancement is crucial for implementing the "memmap on memory" feature for s390 in a subsequent patch. Platforms such as x86 can support physical memory hotplug via ACPI. When there is physical memory hotplug, ACPI event leads to the memory addition with the following callchain: acpi_memory_device_add() -> acpi_memory_enable_device() -> __add_memory() After this, the hotplugged memory is physically accessible, and altmap support prepared, before the "memmap on memory" initialization in memory_block_online() is called. On s390, memory hotplug works in a different way. The available hotplug memory has to be defined upfront in the hypervisor, but it is made physically accessible only when the user sets it online via sysfs, currently in the MEM_GOING_ONLINE notifier. This is too late and "memmap on memory" initialization is performed before calling MEM_GOING_ONLINE notifier. During the memory hotplug addition phase, altmap support is prepared and during the memory onlining phase s390 requires memory to be physically accessible and then subsequently initiate the "memmap on memory" initialization process. The memory provider will handle new MEM_PREPARE_ONLINE / MEM_FINISH_OFFLINE notifications and make the memory accessible. The mhp_flag MHP_OFFLINE_INACCESSIBLE is introduced and is relevant when used along with MHP_MEMMAP_ON_MEMORY, because the altmap cannot be written (e.g., poisoned) when adding memory -- before it is set online. This allows for adding memory with an altmap that is not currently made available by a hypervisor. When onlining that memory, the hypervisor can be instructed to make that memory accessible via the new notifiers and the onlining phase will not require any memory allocations, which is helpful in low-memory situations. All architectures ignore unknown memory notifiers. Therefore, the introduction of these new notifiers does not result in any functional modifications across architectures. Link: https://lkml.kernel.org/r/20240108132747.3238763-1-sumanthk@linux.ibm.com Link: https://lkml.kernel.org/r/20240108132747.3238763-2-sumanthk@linux.ibm.com Signed-off-by: Sumanth Korikkar Suggested-by: Gerald Schaefer Suggested-by: David Hildenbrand Acked-by: David Hildenbrand Cc: Alexander Gordeev Cc: Aneesh Kumar K.V Cc: Anshuman Khandual Cc: Heiko Carstens Cc: Michal Hocko Cc: Oscar Salvador Cc: Vasily Gorbik Signed-off-by: Andrew Morton --- drivers/base/memory.c | 23 ++++++++++++++++++++++- 1 file changed, 22 insertions(+), 1 deletion(-) (limited to 'drivers/base') diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 14f964a7719b..c0436f46cfb7 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -188,6 +188,7 @@ static int memory_block_online(struct memory_block *mem) unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; unsigned long nr_vmemmap_pages = 0; + struct memory_notify arg; struct zone *zone; int ret; @@ -207,9 +208,19 @@ static int memory_block_online(struct memory_block *mem) if (mem->altmap) nr_vmemmap_pages = mem->altmap->free; + arg.altmap_start_pfn = start_pfn; + arg.altmap_nr_pages = nr_vmemmap_pages; + arg.start_pfn = start_pfn + nr_vmemmap_pages; + arg.nr_pages = nr_pages - nr_vmemmap_pages; mem_hotplug_begin(); + ret = memory_notify(MEM_PREPARE_ONLINE, &arg); + ret = notifier_to_errno(ret); + if (ret) + goto out_notifier; + if (nr_vmemmap_pages) { - ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, zone); + ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, + zone, mem->altmap->inaccessible); if (ret) goto out; } @@ -231,7 +242,11 @@ static int memory_block_online(struct memory_block *mem) nr_vmemmap_pages); mem->zone = zone; + mem_hotplug_done(); + return ret; out: + memory_notify(MEM_FINISH_OFFLINE, &arg); +out_notifier: mem_hotplug_done(); return ret; } @@ -244,6 +259,7 @@ static int memory_block_offline(struct memory_block *mem) unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; unsigned long nr_vmemmap_pages = 0; + struct memory_notify arg; int ret; if (!mem->zone) @@ -275,6 +291,11 @@ static int memory_block_offline(struct memory_block *mem) mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages); mem->zone = NULL; + arg.altmap_start_pfn = start_pfn; + arg.altmap_nr_pages = nr_vmemmap_pages; + arg.start_pfn = start_pfn + nr_vmemmap_pages; + arg.nr_pages = nr_pages - nr_vmemmap_pages; + memory_notify(MEM_FINISH_OFFLINE, &arg); out: mem_hotplug_done(); return ret; -- cgit From 5cec4eb7fad6fb1e9a3dd8403b558d1eff7490ff Mon Sep 17 00:00:00 2001 From: Huang Ying Date: Fri, 26 Jan 2024 16:19:44 +0800 Subject: mm and cache_info: remove unnecessary CPU cache info update For each CPU hotplug event, we will update per-CPU data slice size and corresponding PCP configuration for every online CPU to make the implementation simple. But, Kyle reported that this takes tens seconds during boot on a machine with 34 zones and 3840 CPUs. So, in this patch, for each CPU hotplug event, we only update per-CPU data slice size and corresponding PCP configuration for the CPUs that share caches with the hotplugged CPU. With the patch, the system boot time reduces 67 seconds on the machine. Link: https://lkml.kernel.org/r/20240126081944.414520-1-ying.huang@intel.com Fixes: 362d37a106dd ("mm, pcp: reduce lock contention for draining high-order pages") Signed-off-by: "Huang, Ying" Originally-by: Kyle Meyer Reported-and-tested-by: Kyle Meyer Cc: Sudeep Holla Cc: Mel Gorman Signed-off-by: Andrew Morton --- drivers/base/cacheinfo.c | 50 ++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 44 insertions(+), 6 deletions(-) (limited to 'drivers/base') diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c index f1e79263fe61..23b8cba4a2a3 100644 --- a/drivers/base/cacheinfo.c +++ b/drivers/base/cacheinfo.c @@ -898,6 +898,37 @@ err: return rc; } +static unsigned int cpu_map_shared_cache(bool online, unsigned int cpu, + cpumask_t **map) +{ + struct cacheinfo *llc, *sib_llc; + unsigned int sibling; + + if (!last_level_cache_is_valid(cpu)) + return 0; + + llc = per_cpu_cacheinfo_idx(cpu, cache_leaves(cpu) - 1); + + if (llc->type != CACHE_TYPE_DATA && llc->type != CACHE_TYPE_UNIFIED) + return 0; + + if (online) { + *map = &llc->shared_cpu_map; + return cpumask_weight(*map); + } + + /* shared_cpu_map of offlined CPU will be cleared, so use sibling map */ + for_each_cpu(sibling, &llc->shared_cpu_map) { + if (sibling == cpu || !last_level_cache_is_valid(sibling)) + continue; + sib_llc = per_cpu_cacheinfo_idx(sibling, cache_leaves(sibling) - 1); + *map = &sib_llc->shared_cpu_map; + return cpumask_weight(*map); + } + + return 0; +} + /* * Calculate the size of the per-CPU data cache slice. This can be * used to estimate the size of the data cache slice that can be used @@ -929,28 +960,31 @@ static void update_per_cpu_data_slice_size_cpu(unsigned int cpu) ci->per_cpu_data_slice_size = llc->size / nr_shared; } -static void update_per_cpu_data_slice_size(bool cpu_online, unsigned int cpu) +static void update_per_cpu_data_slice_size(bool cpu_online, unsigned int cpu, + cpumask_t *cpu_map) { unsigned int icpu; - for_each_online_cpu(icpu) { + for_each_cpu(icpu, cpu_map) { if (!cpu_online && icpu == cpu) continue; update_per_cpu_data_slice_size_cpu(icpu); + setup_pcp_cacheinfo(icpu); } } static int cacheinfo_cpu_online(unsigned int cpu) { int rc = detect_cache_attributes(cpu); + cpumask_t *cpu_map; if (rc) return rc; rc = cache_add_dev(cpu); if (rc) goto err; - update_per_cpu_data_slice_size(true, cpu); - setup_pcp_cacheinfo(); + if (cpu_map_shared_cache(true, cpu, &cpu_map)) + update_per_cpu_data_slice_size(true, cpu, cpu_map); return 0; err: free_cache_attributes(cpu); @@ -959,12 +993,16 @@ err: static int cacheinfo_cpu_pre_down(unsigned int cpu) { + cpumask_t *cpu_map; + unsigned int nr_shared; + + nr_shared = cpu_map_shared_cache(false, cpu, &cpu_map); if (cpumask_test_and_clear_cpu(cpu, &cache_dev_map)) cpu_cache_sysfs_exit(cpu); free_cache_attributes(cpu); - update_per_cpu_data_slice_size(false, cpu); - setup_pcp_cacheinfo(); + if (nr_shared > 1) + update_per_cpu_data_slice_size(false, cpu, cpu_map); return 0; } -- cgit From 02aff8480533817a29e820729360866441d7403d Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Wed, 24 Jan 2024 13:12:44 +0800 Subject: crash: split crash dumping code out from kexec_core.c Currently, KEXEC_CORE select CRASH_CORE automatically because crash codes need be built in to avoid compiling error when building kexec code even though the crash dumping functionality is not enabled. E.g -------------------- CONFIG_CRASH_CORE=y CONFIG_KEXEC_CORE=y CONFIG_KEXEC=y CONFIG_KEXEC_FILE=y --------------------- After splitting out crashkernel reservation code and vmcoreinfo exporting code, there's only crash related code left in kernel/crash_core.c. Now move crash related codes from kexec_core.c to crash_core.c and only build it in when CONFIG_CRASH_DUMP=y. And also wrap up crash codes inside CONFIG_CRASH_DUMP ifdeffery scope, or replace inappropriate CONFIG_KEXEC_CORE ifdef with CONFIG_CRASH_DUMP ifdef in generic kernel files. With these changes, crash_core codes are abstracted from kexec codes and can be disabled at all if only kexec reboot feature is wanted. Link: https://lkml.kernel.org/r/20240124051254.67105-5-bhe@redhat.com Signed-off-by: Baoquan He Cc: Al Viro Cc: Eric W. Biederman Cc: Hari Bathini Cc: Pingfan Liu Cc: Klara Modin Cc: Michael Kelley Cc: Nathan Chancellor Cc: Stephen Rothwell Cc: Yang Li Signed-off-by: Andrew Morton --- drivers/base/cpu.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'drivers/base') diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index 47de0f140ba6..b621a0fc75e1 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -144,7 +144,7 @@ static DEVICE_ATTR(release, S_IWUSR, NULL, cpu_release_store); #endif /* CONFIG_ARCH_CPU_PROBE_RELEASE */ #endif /* CONFIG_HOTPLUG_CPU */ -#ifdef CONFIG_KEXEC_CORE +#ifdef CONFIG_CRASH_DUMP #include static ssize_t crash_notes_show(struct device *dev, @@ -189,14 +189,14 @@ static const struct attribute_group crash_note_cpu_attr_group = { #endif static const struct attribute_group *common_cpu_attr_groups[] = { -#ifdef CONFIG_KEXEC_CORE +#ifdef CONFIG_CRASH_DUMP &crash_note_cpu_attr_group, #endif NULL }; static const struct attribute_group *hotplugable_cpu_attr_groups[] = { -#ifdef CONFIG_KEXEC_CORE +#ifdef CONFIG_CRASH_DUMP &crash_note_cpu_attr_group, #endif NULL -- cgit