From 9b8b17541f13809d06f6f873325305ddbb760e3e Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Fri, 10 Apr 2020 14:32:19 -0700 Subject: mm, memcg: do not high throttle allocators based on wraparound If a cgroup violates its memory.high constraints, we may end up unduly penalising it. For example, for the following hierarchy: A: max high, 20 usage A/B: 9 high, 10 usage A/C: max high, 10 usage We would end up doing the following calculation below when calculating high delay for A/B: A/B: 10 - 9 = 1... A: 20 - PAGE_COUNTER_MAX = 21, so set max_overage to 21. This gets worse with higher disparities in usage in the parent. I have no idea how this disappeared from the final version of the patch, but it is certainly Not Good(tm). This wasn't obvious in testing because, for a simple cgroup hierarchy with only one child, the result is usually roughly the same. It's only in more complex hierarchies that things go really awry (although still, the effects are limited to a maximum of 2 seconds in schedule_timeout_killable at a maximum). [chris@chrisdown.name: changelog] Fixes: e26733e0d0ec ("mm, memcg: throttle allocators based on ancestral memory.high") Signed-off-by: Jakub Kicinski Signed-off-by: Chris Down Signed-off-by: Andrew Morton Acked-by: Michal Hocko Cc: Johannes Weiner Cc: [5.4.x] Link: http://lkml.kernel.org/r/20200331152424.GA1019937@chrisdown.name Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'mm') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 05b4ec2c6499..5beea03dd58a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2336,6 +2336,9 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg, usage = page_counter_read(&memcg->memory); high = READ_ONCE(memcg->high); + if (usage <= high) + continue; + /* * Prevent division by 0 in overage calculation by acting as if * it was a threshold of 1 page -- cgit From b991cee567bf045097d9426719d7f1477bd7dc59 Mon Sep 17 00:00:00 2001 From: Qiujun Huang Date: Fri, 10 Apr 2020 14:32:22 -0700 Subject: mm, slab_common: fix a typo in comment "eariler"->"earlier" There is a typo in comment, fix it. s/eariler/earlier/ Signed-off-by: Qiujun Huang Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Acked-by: Christoph Lameter Link: http://lkml.kernel.org/r/20200405160544.1246-1-hqjagain@gmail.com Signed-off-by: Linus Torvalds --- mm/slab_common.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/slab_common.c b/mm/slab_common.c index 93ec4a574d8d..23c7500eea7d 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -731,7 +731,7 @@ static void kmemcg_rcufn(struct rcu_head *head) /* * We need to grab blocking locks. Bounce to ->work. The * work item shares the space with the RCU head and can't be - * initialized eariler. + * initialized earlier. */ INIT_WORK(&s->memcg_params.work, kmemcg_workfn); queue_work(memcg_kmem_cache_wq, &s->memcg_params.work); -- cgit From e6a0a7ad1c2b6608e294fbbce60c42ba5a1304ce Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Fri, 10 Apr 2020 14:32:29 -0700 Subject: mm/page_alloc.c: fix kernel-doc warning Add description of function parameter 'mt' to fix kernel-doc warning: mm/page_alloc.c:3246: warning: Function parameter or member 'mt' not described in '__putback_isolated_page' Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Acked-by: Pankaj Gupta Link: http://lkml.kernel.org/r/02998bd4-0b82-2f15-2570-f86130304d1e@infradead.org Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 1 + 1 file changed, 1 insertion(+) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 114c56c3685d..d98441ab1036 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3224,6 +3224,7 @@ int __isolate_free_page(struct page *page, unsigned int order) * __putback_isolated_page - Return a now-isolated page back where we got it * @page: Page that was isolated * @order: Order of the isolated page + * @mt: The page's pageblock's migratetype * * This function is meant to return a page pulled from the free lists via * __isolate_free_page back to the free lists they were pulled from. -- cgit From 8b885f53b03e3b14003083daf64a6ed16bf561b3 Mon Sep 17 00:00:00 2001 From: Jason Yan Date: Fri, 10 Apr 2020 14:32:32 -0700 Subject: mm/page_alloc: make pcpu_drain_mutex and pcpu_drain static Fix the following sparse warning: mm/page_alloc.c:106:1: warning: symbol 'pcpu_drain_mutex' was not declared. Should it be static? mm/page_alloc.c:107:1: warning: symbol '__pcpu_scope_pcpu_drain' was not declared. Should it be static? Reported-by: Hulk Robot Signed-off-by: Jason Yan Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/20200407023925.46438-1-yanaijie@huawei.com Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d98441ab1036..69827d4fa052 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -103,8 +103,8 @@ struct pcpu_drain { struct zone *zone; struct work_struct work; }; -DEFINE_MUTEX(pcpu_drain_mutex); -DEFINE_PER_CPU(struct pcpu_drain, pcpu_drain); +static DEFINE_MUTEX(pcpu_drain_mutex); +static DEFINE_PER_CPU(struct pcpu_drain, pcpu_drain); #ifdef CONFIG_GCC_PLUGIN_LATENT_ENTROPY volatile unsigned long latent_entropy __latent_entropy; -- cgit From 8676af1ff2d28e64e5636147821bda7524cf007d Mon Sep 17 00:00:00 2001 From: Aslan Bakirov Date: Fri, 10 Apr 2020 14:32:42 -0700 Subject: mm: cma: NUMA node interface I've noticed that there is no interface exposed by CMA which would let me to declare contigous memory on particular NUMA node. This patchset adds the ability to try to allocate contiguous memory on a specific node. It will fallback to other nodes if the specified one doesn't work. Implement a new method for declaring contigous memory on particular node and keep cma_declare_contiguous() as a wrapper. [akpm@linux-foundation.org: build fix] Signed-off-by: Aslan Bakirov Signed-off-by: Roman Gushchin Signed-off-by: Andrew Morton Acked-by: Michal Hocko Cc: Andreas Schaufler Cc: Mike Kravetz Cc: Rik van Riel Cc: Joonsoo Kim Link: http://lkml.kernel.org/r/20200407163840.92263-2-guro@fb.com Signed-off-by: Linus Torvalds --- mm/cma.c | 16 +++++++++------- mm/memblock.c | 2 +- 2 files changed, 10 insertions(+), 8 deletions(-) (limited to 'mm') diff --git a/mm/cma.c b/mm/cma.c index be55d1988c67..0463ad2ce06b 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -220,7 +220,7 @@ int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size, } /** - * cma_declare_contiguous() - reserve custom contiguous area + * cma_declare_contiguous_nid() - reserve custom contiguous area * @base: Base address of the reserved area optional, use 0 for any * @size: Size of the reserved area (in bytes), * @limit: End address of the reserved memory (optional, 0 for any). @@ -229,6 +229,7 @@ int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size, * @fixed: hint about where to place the reserved area * @name: The name of the area. See function cma_init_reserved_mem() * @res_cma: Pointer to store the created cma region. + * @nid: nid of the free area to find, %NUMA_NO_NODE for any node * * This function reserves memory from early allocator. It should be * called by arch specific code once the early allocator (memblock or bootmem) @@ -238,10 +239,11 @@ int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size, * If @fixed is true, reserve contiguous area at exactly @base. If false, * reserve in range from @base to @limit. */ -int __init cma_declare_contiguous(phys_addr_t base, +int __init cma_declare_contiguous_nid(phys_addr_t base, phys_addr_t size, phys_addr_t limit, phys_addr_t alignment, unsigned int order_per_bit, - bool fixed, const char *name, struct cma **res_cma) + bool fixed, const char *name, struct cma **res_cma, + int nid) { phys_addr_t memblock_end = memblock_end_of_DRAM(); phys_addr_t highmem_start; @@ -336,14 +338,14 @@ int __init cma_declare_contiguous(phys_addr_t base, * memory in case of failure. */ if (base < highmem_start && limit > highmem_start) { - addr = memblock_phys_alloc_range(size, alignment, - highmem_start, limit); + addr = memblock_alloc_range_nid(size, alignment, + highmem_start, limit, nid, false); limit = highmem_start; } if (!addr) { - addr = memblock_phys_alloc_range(size, alignment, base, - limit); + addr = memblock_alloc_range_nid(size, alignment, base, + limit, nid, false); if (!addr) { ret = -ENOMEM; goto err; diff --git a/mm/memblock.c b/mm/memblock.c index 4d06bbaded0f..c79ba6f9920c 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1349,7 +1349,7 @@ __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone, * Return: * Physical address of allocated memory block on success, %0 on failure. */ -static phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size, +phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size, phys_addr_t align, phys_addr_t start, phys_addr_t end, int nid, bool exact_nid) -- cgit From cf11e85fc08cc6a4fe3ac2ba2e610c962bf20bc3 Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Fri, 10 Apr 2020 14:32:45 -0700 Subject: mm: hugetlb: optionally allocate gigantic hugepages using cma Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation at runtime") has added the run-time allocation of gigantic pages. However it actually works only at early stages of the system loading, when the majority of memory is free. After some time the memory gets fragmented by non-movable pages, so the chances to find a contiguous 1GB block are getting close to zero. Even dropping caches manually doesn't help a lot. At large scale rebooting servers in order to allocate gigantic hugepages is quite expensive and complex. At the same time keeping some constant percentage of memory in reserved hugepages even if the workload isn't using it is a big waste: not all workloads can benefit from using 1 GB pages. The following solution can solve the problem: 1) On boot time a dedicated cma area* is reserved. The size is passed as a kernel argument. 2) Run-time allocations of gigantic hugepages are performed using the cma allocator and the dedicated cma area In this case gigantic hugepages can be allocated successfully with a high probability, however the memory isn't completely wasted if nobody is using 1GB hugepages: it can be used for pagecache, anon memory, THPs, etc. * On a multi-node machine a per-node cma area is allocated on each node. Following gigantic hugetlb allocation are using the first available numa node if the mask isn't specified by a user. Usage: 1) configure the kernel to allocate a cma area for hugetlb allocations: pass hugetlb_cma=10G as a kernel argument 2) allocate hugetlb pages as usual, e.g. echo 10 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages If the option isn't enabled or the allocation of the cma area failed, the current behavior of the system is preserved. x86 and arm-64 are covered by this patch, other architectures can be trivially added later. The patch contains clean-ups and fixes proposed and implemented by Aslan Bakirov and Randy Dunlap. It also contains ideas and suggestions proposed by Rik van Riel, Michal Hocko and Mike Kravetz. Thanks! Signed-off-by: Roman Gushchin Signed-off-by: Andrew Morton Tested-by: Andreas Schaufler Acked-by: Mike Kravetz Acked-by: Michal Hocko Cc: Aslan Bakirov Cc: Randy Dunlap Cc: Rik van Riel Cc: Joonsoo Kim Link: http://lkml.kernel.org/r/20200407163840.92263-3-guro@fb.com Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 109 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 109 insertions(+) (limited to 'mm') diff --git a/mm/hugetlb.c b/mm/hugetlb.c index f5fb53fdfa02..cd459155d28a 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include @@ -44,6 +45,9 @@ int hugetlb_max_hstate __read_mostly; unsigned int default_hstate_idx; struct hstate hstates[HUGE_MAX_HSTATE]; + +static struct cma *hugetlb_cma[MAX_NUMNODES]; + /* * Minimum page order among possible hugepage sizes, set to a proper value * at boot time. @@ -1228,6 +1232,14 @@ static void destroy_compound_gigantic_page(struct page *page, static void free_gigantic_page(struct page *page, unsigned int order) { + /* + * If the page isn't allocated using the cma allocator, + * cma_release() returns false. + */ + if (IS_ENABLED(CONFIG_CMA) && + cma_release(hugetlb_cma[page_to_nid(page)], page, 1 << order)) + return; + free_contig_range(page_to_pfn(page), 1 << order); } @@ -1237,6 +1249,21 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask, { unsigned long nr_pages = 1UL << huge_page_order(h); + if (IS_ENABLED(CONFIG_CMA)) { + struct page *page; + int node; + + for_each_node_mask(node, *nodemask) { + if (!hugetlb_cma[node]) + continue; + + page = cma_alloc(hugetlb_cma[node], nr_pages, + huge_page_order(h), true); + if (page) + return page; + } + } + return alloc_contig_pages(nr_pages, gfp_mask, nid, nodemask); } @@ -1281,8 +1308,14 @@ static void update_and_free_page(struct hstate *h, struct page *page) set_compound_page_dtor(page, NULL_COMPOUND_DTOR); set_page_refcounted(page); if (hstate_is_gigantic(h)) { + /* + * Temporarily drop the hugetlb_lock, because + * we might block in free_gigantic_page(). + */ + spin_unlock(&hugetlb_lock); destroy_compound_gigantic_page(page, huge_page_order(h)); free_gigantic_page(page, huge_page_order(h)); + spin_lock(&hugetlb_lock); } else { __free_pages(page, huge_page_order(h)); } @@ -2539,6 +2572,10 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h) for (i = 0; i < h->max_huge_pages; ++i) { if (hstate_is_gigantic(h)) { + if (IS_ENABLED(CONFIG_CMA) && hugetlb_cma[0]) { + pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n"); + break; + } if (!alloc_bootmem_huge_page(h)) break; } else if (!alloc_pool_huge_page(h, @@ -3194,6 +3231,7 @@ static int __init hugetlb_init(void) default_hstate.max_huge_pages = default_hstate_max_huge_pages; } + hugetlb_cma_check(); hugetlb_init_hstates(); gather_bootmem_prealloc(); report_hugepages(); @@ -5506,3 +5544,74 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason) spin_unlock(&hugetlb_lock); } } + +#ifdef CONFIG_CMA +static unsigned long hugetlb_cma_size __initdata; +static bool cma_reserve_called __initdata; + +static int __init cmdline_parse_hugetlb_cma(char *p) +{ + hugetlb_cma_size = memparse(p, &p); + return 0; +} + +early_param("hugetlb_cma", cmdline_parse_hugetlb_cma); + +void __init hugetlb_cma_reserve(int order) +{ + unsigned long size, reserved, per_node; + int nid; + + cma_reserve_called = true; + + if (!hugetlb_cma_size) + return; + + if (hugetlb_cma_size < (PAGE_SIZE << order)) { + pr_warn("hugetlb_cma: cma area should be at least %lu MiB\n", + (PAGE_SIZE << order) / SZ_1M); + return; + } + + /* + * If 3 GB area is requested on a machine with 4 numa nodes, + * let's allocate 1 GB on first three nodes and ignore the last one. + */ + per_node = DIV_ROUND_UP(hugetlb_cma_size, nr_online_nodes); + pr_info("hugetlb_cma: reserve %lu MiB, up to %lu MiB per node\n", + hugetlb_cma_size / SZ_1M, per_node / SZ_1M); + + reserved = 0; + for_each_node_state(nid, N_ONLINE) { + int res; + + size = min(per_node, hugetlb_cma_size - reserved); + size = round_up(size, PAGE_SIZE << order); + + res = cma_declare_contiguous_nid(0, size, 0, PAGE_SIZE << order, + 0, false, "hugetlb", + &hugetlb_cma[nid], nid); + if (res) { + pr_warn("hugetlb_cma: reservation failed: err %d, node %d", + res, nid); + continue; + } + + reserved += size; + pr_info("hugetlb_cma: reserved %lu MiB on node %d\n", + size / SZ_1M, nid); + + if (reserved >= hugetlb_cma_size) + break; + } +} + +void __init hugetlb_cma_check(void) +{ + if (!hugetlb_cma_size || cma_reserve_called) + return; + + pr_warn("hugetlb_cma: the option isn't supported by current arch\n"); +} + +#endif /* CONFIG_CMA */ -- cgit From 09ef5283fd96ac424ef0e569626f359bf9ab86c9 Mon Sep 17 00:00:00 2001 From: Jaewon Kim Date: Fri, 10 Apr 2020 14:32:48 -0700 Subject: mm/mmap.c: initialize align_offset explicitly for vm_unmapped_area On passing requirement to vm_unmapped_area, arch_get_unmapped_area and arch_get_unmapped_area_topdown did not set align_offset. Internally on both unmapped_area and unmapped_area_topdown, if info->align_mask is 0, then info->align_offset was meaningless. But commit df529cabb7a2 ("mm: mmap: add trace point of vm_unmapped_area") always prints info->align_offset even though it is uninitialized. Fix this uninitialized value issue by setting it to 0 explicitly. Before: vm_unmapped_area: addr=0x755b155000 err=0 total_vm=0x15aaf0 flags=0x1 len=0x109000 lo=0x8000 hi=0x75eed48000 mask=0x0 ofs=0x4022 After: vm_unmapped_area: addr=0x74a4ca1000 err=0 total_vm=0x168ab1 flags=0x1 len=0x9000 lo=0x8000 hi=0x753d94b000 mask=0x0 ofs=0x0 Signed-off-by: Jaewon Kim Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Cc: Matthew Wilcox (Oracle) Cc: Michel Lespinasse Cc: Borislav Petkov Link: http://lkml.kernel.org/r/20200409094035.19457-1-jaewon31.kim@samsung.com Signed-off-by: Linus Torvalds --- mm/mmap.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'mm') diff --git a/mm/mmap.c b/mm/mmap.c index 8d77dbbb80fe..de07bbc0e21f 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2123,6 +2123,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr, info.low_limit = mm->mmap_base; info.high_limit = mmap_end; info.align_mask = 0; + info.align_offset = 0; return vm_unmapped_area(&info); } #endif @@ -2164,6 +2165,7 @@ arch_get_unmapped_area_topdown(struct file *filp, unsigned long addr, info.low_limit = max(PAGE_SIZE, mmap_min_addr); info.high_limit = arch_get_mmap_base(addr, mm->mmap_base); info.align_mask = 0; + info.align_offset = 0; addr = vm_unmapped_area(&info); /* -- cgit From 8efd6f5b1732c4ac88b4bb6908d481d95804fa1c Mon Sep 17 00:00:00 2001 From: Arjun Roy Date: Fri, 10 Apr 2020 14:32:51 -0700 Subject: mm/memory.c: refactor insert_page to prepare for batched-lock insert Add helper methods for vm_insert_page()/insert_page() to prepare for vm_insert_pages(), which batch-inserts pages to reduce spinlock operations when inserting multiple consecutive pages into the user page table. The intention of this patch-set is to reduce atomic ops for tcp zerocopy receives, which normally hits the same spinlock multiple times consecutively. Signed-off-by: Arjun Roy Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh Signed-off-by: Andrew Morton Cc: David Miller Cc: Matthew Wilcox Cc: Jason Gunthorpe Cc: Stephen Rothwell Link: http://lkml.kernel.org/r/20200128025958.43490-1-arjunroy.kdev@gmail.com Signed-off-by: Linus Torvalds --- mm/memory.c | 39 ++++++++++++++++++++++++--------------- 1 file changed, 24 insertions(+), 15 deletions(-) (limited to 'mm') diff --git a/mm/memory.c b/mm/memory.c index 19874d133a66..52a3303458cb 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1442,6 +1442,27 @@ pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr, return pte_alloc_map_lock(mm, pmd, addr, ptl); } +static int validate_page_before_insert(struct page *page) +{ + if (PageAnon(page) || PageSlab(page) || page_has_type(page)) + return -EINVAL; + flush_dcache_page(page); + return 0; +} + +static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte, + unsigned long addr, struct page *page, pgprot_t prot) +{ + if (!pte_none(*pte)) + return -EBUSY; + /* Ok, finally just insert the thing.. */ + get_page(page); + inc_mm_counter_fast(mm, mm_counter_file(page)); + page_add_file_rmap(page, false); + set_pte_at(mm, addr, pte, mk_pte(page, prot)); + return 0; +} + /* * This is the old fallback for page remapping. * @@ -1457,26 +1478,14 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr, pte_t *pte; spinlock_t *ptl; - retval = -EINVAL; - if (PageAnon(page) || PageSlab(page) || page_has_type(page)) + retval = validate_page_before_insert(page); + if (retval) goto out; retval = -ENOMEM; - flush_dcache_page(page); pte = get_locked_pte(mm, addr, &ptl); if (!pte) goto out; - retval = -EBUSY; - if (!pte_none(*pte)) - goto out_unlock; - - /* Ok, finally just insert the thing.. */ - get_page(page); - inc_mm_counter_fast(mm, mm_counter_file(page)); - page_add_file_rmap(page, false); - set_pte_at(mm, addr, pte, mk_pte(page, prot)); - - retval = 0; -out_unlock: + retval = insert_page_into_pte_locked(mm, pte, addr, page, prot); pte_unmap_unlock(pte, ptl); out: return retval; -- cgit From 8cd3984d81d5fd5e18bccb12d7d228a114ec2508 Mon Sep 17 00:00:00 2001 From: Arjun Roy Date: Fri, 10 Apr 2020 14:33:01 -0700 Subject: mm/memory.c: add vm_insert_pages() Add the ability to insert multiple pages at once to a user VM with lower PTE spinlock operations. The intention of this patch-set is to reduce atomic ops for tcp zerocopy receives, which normally hits the same spinlock multiple times consecutively. [akpm@linux-foundation.org: pte_alloc() no longer takes the `addr' argument] [arjunroy@google.com: add missing page_count() check to vm_insert_pages()] Link: http://lkml.kernel.org/r/20200214005929.104481-1-arjunroy.kdev@gmail.com [arjunroy@google.com: vm_insert_pages() checks if pte_index defined] Link: http://lkml.kernel.org/r/20200228054714.204424-2-arjunroy.kdev@gmail.com Signed-off-by: Arjun Roy Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh Signed-off-by: Andrew Morton Cc: David Miller Cc: Matthew Wilcox Cc: Jason Gunthorpe Cc: Stephen Rothwell Link: http://lkml.kernel.org/r/20200128025958.43490-2-arjunroy.kdev@gmail.com Signed-off-by: Linus Torvalds --- mm/memory.c | 129 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 127 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/memory.c b/mm/memory.c index 52a3303458cb..f703fe8c8346 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1419,8 +1419,7 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address, } EXPORT_SYMBOL_GPL(zap_vma_ptes); -pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr, - spinlock_t **ptl) +static pmd_t *walk_to_pmd(struct mm_struct *mm, unsigned long addr) { pgd_t *pgd; p4d_t *p4d; @@ -1439,6 +1438,16 @@ pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr, return NULL; VM_BUG_ON(pmd_trans_huge(*pmd)); + return pmd; +} + +pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr, + spinlock_t **ptl) +{ + pmd_t *pmd = walk_to_pmd(mm, addr); + + if (!pmd) + return NULL; return pte_alloc_map_lock(mm, pmd, addr, ptl); } @@ -1491,6 +1500,122 @@ out: return retval; } +#ifdef pte_index +static int insert_page_in_batch_locked(struct mm_struct *mm, pmd_t *pmd, + unsigned long addr, struct page *page, pgprot_t prot) +{ + int err; + + if (!page_count(page)) + return -EINVAL; + err = validate_page_before_insert(page); + return err ? err : insert_page_into_pte_locked( + mm, pte_offset_map(pmd, addr), addr, page, prot); +} + +/* insert_pages() amortizes the cost of spinlock operations + * when inserting pages in a loop. Arch *must* define pte_index. + */ +static int insert_pages(struct vm_area_struct *vma, unsigned long addr, + struct page **pages, unsigned long *num, pgprot_t prot) +{ + pmd_t *pmd = NULL; + spinlock_t *pte_lock = NULL; + struct mm_struct *const mm = vma->vm_mm; + unsigned long curr_page_idx = 0; + unsigned long remaining_pages_total = *num; + unsigned long pages_to_write_in_pmd; + int ret; +more: + ret = -EFAULT; + pmd = walk_to_pmd(mm, addr); + if (!pmd) + goto out; + + pages_to_write_in_pmd = min_t(unsigned long, + remaining_pages_total, PTRS_PER_PTE - pte_index(addr)); + + /* Allocate the PTE if necessary; takes PMD lock once only. */ + ret = -ENOMEM; + if (pte_alloc(mm, pmd)) + goto out; + pte_lock = pte_lockptr(mm, pmd); + + while (pages_to_write_in_pmd) { + int pte_idx = 0; + const int batch_size = min_t(int, pages_to_write_in_pmd, 8); + + spin_lock(pte_lock); + for (; pte_idx < batch_size; ++pte_idx) { + int err = insert_page_in_batch_locked(mm, pmd, + addr, pages[curr_page_idx], prot); + if (unlikely(err)) { + spin_unlock(pte_lock); + ret = err; + remaining_pages_total -= pte_idx; + goto out; + } + addr += PAGE_SIZE; + ++curr_page_idx; + } + spin_unlock(pte_lock); + pages_to_write_in_pmd -= batch_size; + remaining_pages_total -= batch_size; + } + if (remaining_pages_total) + goto more; + ret = 0; +out: + *num = remaining_pages_total; + return ret; +} +#endif /* ifdef pte_index */ + +/** + * vm_insert_pages - insert multiple pages into user vma, batching the pmd lock. + * @vma: user vma to map to + * @addr: target start user address of these pages + * @pages: source kernel pages + * @num: in: number of pages to map. out: number of pages that were *not* + * mapped. (0 means all pages were successfully mapped). + * + * Preferred over vm_insert_page() when inserting multiple pages. + * + * In case of error, we may have mapped a subset of the provided + * pages. It is the caller's responsibility to account for this case. + * + * The same restrictions apply as in vm_insert_page(). + */ +int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr, + struct page **pages, unsigned long *num) +{ +#ifdef pte_index + const unsigned long end_addr = addr + (*num * PAGE_SIZE) - 1; + + if (addr < vma->vm_start || end_addr >= vma->vm_end) + return -EFAULT; + if (!(vma->vm_flags & VM_MIXEDMAP)) { + BUG_ON(down_read_trylock(&vma->vm_mm->mmap_sem)); + BUG_ON(vma->vm_flags & VM_PFNMAP); + vma->vm_flags |= VM_MIXEDMAP; + } + /* Defer page refcount checking till we're about to map that page. */ + return insert_pages(vma, addr, pages, num, vma->vm_page_prot); +#else + unsigned long idx = 0, pgcount = *num; + int err; + + for (; idx < pgcount; ++idx) { + err = vm_insert_page(vma, addr + (PAGE_SIZE * idx), pages[idx]); + if (err) + break; + } + *num = pgcount - idx; + return err; +#endif /* ifdef pte_index */ +} +EXPORT_SYMBOL(vm_insert_pages); + /** * vm_insert_page - insert single page into user vma * @vma: user vma to map to -- cgit From 6cb4d9a2870d2062e34c93bfef4d52fca3fe42d1 Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Fri, 10 Apr 2020 14:33:09 -0700 Subject: mm/vma: introduce VM_ACCESS_FLAGS There are many places where all basic VMA access flags (read, write, exec) are initialized or checked against as a group. One such example is during page fault. Existing vma_is_accessible() wrapper already creates the notion of VMA accessibility as a group access permissions. Hence lets just create VM_ACCESS_FLAGS (VM_READ|VM_WRITE|VM_EXEC) which will not only reduce code duplication but also extend the VMA accessibility concept in general. Signed-off-by: Anshuman Khandual Signed-off-by: Andrew Morton Reviewed-by: Vlastimil Babka Cc: Russell King Cc: Catalin Marinas Cc: Mark Salter Cc: Nick Hu Cc: Ley Foon Tan Cc: Michael Ellerman Cc: Heiko Carstens Cc: Yoshinori Sato Cc: Guan Xuetao Cc: Dave Hansen Cc: Thomas Gleixner Cc: Rob Springer Cc: Greg Kroah-Hartman Cc: Geert Uytterhoeven Link: http://lkml.kernel.org/r/1583391014-8170-3-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds --- mm/mmap.c | 2 +- mm/mprotect.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/mmap.c b/mm/mmap.c index de07bbc0e21f..f609e9ec4a25 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1224,7 +1224,7 @@ static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct * return a->vm_end == b->vm_start && mpol_equal(vma_policy(a), vma_policy(b)) && a->vm_file == b->vm_file && - !((a->vm_flags ^ b->vm_flags) & ~(VM_READ|VM_WRITE|VM_EXEC|VM_SOFTDIRTY)) && + !((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_SOFTDIRTY)) && b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT); } diff --git a/mm/mprotect.c b/mm/mprotect.c index 1d823b050329..494192ca954b 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -419,7 +419,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, */ if (arch_has_pfn_modify_check() && (vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) && - (newflags & (VM_READ|VM_WRITE|VM_EXEC)) == 0) { + (newflags & VM_ACCESS_FLAGS) == 0) { pgprot_t new_pgprot = vm_get_page_prot(newflags); error = walk_page_range(current->mm, start, end, @@ -598,7 +598,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len, newflags |= (vma->vm_flags & ~mask_off_old_flags); /* newflags >> 4 shift VM_MAY% in place of VM_% */ - if ((newflags & ~(newflags >> 4)) & (VM_READ | VM_WRITE | VM_EXEC)) { + if ((newflags & ~(newflags >> 4)) & VM_ACCESS_FLAGS) { error = -EACCES; goto out; } -- cgit From f5637d3b42ab0465ef71d5fb8461bce97fba95e8 Mon Sep 17 00:00:00 2001 From: Logan Gunthorpe Date: Fri, 10 Apr 2020 14:33:21 -0700 Subject: mm/memory_hotplug: rename mhp_restrictions to mhp_params The mhp_restrictions struct really doesn't specify anything resembling a restriction anymore so rename it to be mhp_params as it is a list of extended parameters. Signed-off-by: Logan Gunthorpe Signed-off-by: Andrew Morton Reviewed-by: David Hildenbrand Reviewed-by: Dan Williams Acked-by: Michal Hocko Cc: Andy Lutomirski Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christoph Hellwig Cc: Dave Hansen Cc: Eric Badger Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jason Gunthorpe Cc: Michael Ellerman Cc: Paul Mackerras Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Will Deacon Link: http://lkml.kernel.org/r/20200306170846.9333-3-logang@deltatee.com Signed-off-by: Linus Torvalds --- mm/memory_hotplug.c | 8 ++++---- mm/memremap.c | 8 ++++---- 2 files changed, 8 insertions(+), 8 deletions(-) (limited to 'mm') diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 635e8e286598..fbfe7b40f552 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -304,12 +304,12 @@ static int check_hotplug_memory_addressable(unsigned long pfn, * add the new pages. */ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages, - struct mhp_restrictions *restrictions) + struct mhp_params *params) { const unsigned long end_pfn = pfn + nr_pages; unsigned long cur_nr_pages; int err; - struct vmem_altmap *altmap = restrictions->altmap; + struct vmem_altmap *altmap = params->altmap; err = check_hotplug_memory_addressable(pfn, nr_pages); if (err) @@ -1002,7 +1002,7 @@ static int online_memory_block(struct memory_block *mem, void *arg) */ int __ref add_memory_resource(int nid, struct resource *res) { - struct mhp_restrictions restrictions = {}; + struct mhp_params params = {}; u64 start, size; bool new_node = false; int ret; @@ -1030,7 +1030,7 @@ int __ref add_memory_resource(int nid, struct resource *res) new_node = ret; /* call arch's memory hotadd */ - ret = arch_add_memory(nid, start, size, &restrictions); + ret = arch_add_memory(nid, start, size, ¶ms); if (ret < 0) goto error; diff --git a/mm/memremap.c b/mm/memremap.c index bbf457c4f166..b0b5170843ff 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -184,7 +184,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) { struct resource *res = &pgmap->res; struct dev_pagemap *conflict_pgmap; - struct mhp_restrictions restrictions = { + struct mhp_params params = { /* * We do not want any optional features only our own memmap */ @@ -302,7 +302,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) */ if (pgmap->type == MEMORY_DEVICE_PRIVATE) { error = add_pages(nid, PHYS_PFN(res->start), - PHYS_PFN(resource_size(res)), &restrictions); + PHYS_PFN(resource_size(res)), ¶ms); } else { error = kasan_add_zero_shadow(__va(res->start), resource_size(res)); if (error) { @@ -311,7 +311,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) } error = arch_add_memory(nid, res->start, resource_size(res), - &restrictions); + ¶ms); } if (!error) { @@ -319,7 +319,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE]; move_pfn_range_to_zone(zone, PHYS_PFN(res->start), - PHYS_PFN(resource_size(res)), restrictions.altmap); + PHYS_PFN(resource_size(res)), params.altmap); } mem_hotplug_done(); -- cgit From bfeb022f8fe4c5afdcfd7a3d868fac9765f9bcad Mon Sep 17 00:00:00 2001 From: Logan Gunthorpe Date: Fri, 10 Apr 2020 14:33:36 -0700 Subject: mm/memory_hotplug: add pgprot_t to mhp_params devm_memremap_pages() is currently used by the PCI P2PDMA code to create struct page mappings for IO memory. At present, these mappings are created with PAGE_KERNEL which implies setting the PAT bits to be WB. However, on x86, an mtrr register will typically override this and force the cache type to be UC-. In the case firmware doesn't set this register it is effectively WB and will typically result in a machine check exception when it's accessed. Other arches are not currently likely to function correctly seeing they don't have any MTRR registers to fall back on. To solve this, provide a way to specify the pgprot value explicitly to arch_add_memory(). Of the arches that support MEMORY_HOTPLUG: x86_64, and arm64 need a simple change to pass the pgprot_t down to their respective functions which set up the page tables. For x86_32, set the page tables explicitly using _set_memory_prot() (seeing they are already mapped). For ia64, s390 and sh, reject anything but PAGE_KERNEL settings -- this should be fine, for now, seeing these architectures don't support ZONE_DEVICE. A check in __add_pages() is also added to ensure the pgprot parameter was set for all arches. Signed-off-by: Logan Gunthorpe Signed-off-by: Andrew Morton Acked-by: David Hildenbrand Acked-by: Michal Hocko Acked-by: Dan Williams Cc: Andy Lutomirski Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christoph Hellwig Cc: Dave Hansen Cc: Eric Badger Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jason Gunthorpe Cc: Michael Ellerman Cc: Paul Mackerras Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Will Deacon Link: http://lkml.kernel.org/r/20200306170846.9333-7-logang@deltatee.com Signed-off-by: Linus Torvalds --- mm/memory_hotplug.c | 5 ++++- mm/memremap.c | 6 +++--- 2 files changed, 7 insertions(+), 4 deletions(-) (limited to 'mm') diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index fbfe7b40f552..fc0aad0bc1f5 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -311,6 +311,9 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages, int err; struct vmem_altmap *altmap = params->altmap; + if (WARN_ON_ONCE(!params->pgprot.pgprot)) + return -EINVAL; + err = check_hotplug_memory_addressable(pfn, nr_pages); if (err) return err; @@ -1002,7 +1005,7 @@ static int online_memory_block(struct memory_block *mem, void *arg) */ int __ref add_memory_resource(int nid, struct resource *res) { - struct mhp_params params = {}; + struct mhp_params params = { .pgprot = PAGE_KERNEL }; u64 start, size; bool new_node = false; int ret; diff --git a/mm/memremap.c b/mm/memremap.c index b0b5170843ff..bc167cde3237 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -189,8 +189,8 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) * We do not want any optional features only our own memmap */ .altmap = pgmap_altmap(pgmap), + .pgprot = PAGE_KERNEL, }; - pgprot_t pgprot = PAGE_KERNEL; int error, is_ram; bool need_devmap_managed = true; @@ -282,8 +282,8 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) if (nid < 0) nid = numa_mem_id(); - error = track_pfn_remap(NULL, &pgprot, PHYS_PFN(res->start), 0, - resource_size(res)); + error = track_pfn_remap(NULL, ¶ms.pgprot, PHYS_PFN(res->start), + 0, resource_size(res)); if (error) goto err_pfn_remap; -- cgit From a50d8d98a87f33efa07adfa20747e13a93839a4b Mon Sep 17 00:00:00 2001 From: Logan Gunthorpe Date: Fri, 10 Apr 2020 14:33:39 -0700 Subject: mm/memremap: set caching mode for PCI P2PDMA memory to WC PCI BAR IO memory should never be mapped as WB, however prior to this the PAT bits were set WB and it was typically overridden by MTRR registers set by the firmware. Set PCI P2PDMA memory to be UC as this is what it currently, typically, ends up being mapped as on x86 after the MTRR registers override the cache setting. Future use-cases may need to generalize this by adding flags to select the caching type, as some P2PDMA cases may not want UC. However, those use-cases are not upstream yet and this can be changed when they arrive. Signed-off-by: Logan Gunthorpe Signed-off-by: Andrew Morton Reviewed-by: Dan Williams Cc: Christoph Hellwig Cc: Jason Gunthorpe Cc: Andy Lutomirski Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Catalin Marinas Cc: Dave Hansen Cc: David Hildenbrand Cc: Eric Badger Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Michael Ellerman Cc: Michal Hocko Cc: Paul Mackerras Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Will Deacon Link: http://lkml.kernel.org/r/20200306170846.9333-8-logang@deltatee.com Signed-off-by: Linus Torvalds --- mm/memremap.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'mm') diff --git a/mm/memremap.c b/mm/memremap.c index bc167cde3237..03e38b7a38f1 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -217,7 +217,10 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) } break; case MEMORY_DEVICE_DEVDAX: + need_devmap_managed = false; + break; case MEMORY_DEVICE_PCI_P2PDMA: + params.pgprot = pgprot_noncached(params.pgprot); need_devmap_managed = false; break; default: -- cgit