summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2021-09-08Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge more updates from Andrew Morton: "147 patches, based on 7d2a07b769330c34b4deabeed939325c77a7ec2f. Subsystems affected by this patch series: mm (memory-hotplug, rmap, ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan), alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib, checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig, selftests, ipc, and scripts" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits) scripts: check_extable: fix typo in user error message mm/workingset: correct kernel-doc notations ipc: replace costly bailout check in sysvipc_find_ipc() selftests/memfd: remove unused variable Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH configs: remove the obsolete CONFIG_INPUT_POLLDEV prctl: allow to setup brk for et_dyn executables pid: cleanup the stale comment mentioning pidmap_init(). kernel/fork.c: unexport get_{mm,task}_exe_file coredump: fix memleak in dump_vma_snapshot() fs/coredump.c: log if a core dump is aborted due to changed file permissions nilfs2: use refcount_dec_and_lock() to fix potential UAF nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group nilfs2: fix NULL pointer in nilfs_##name##_attr_release nilfs2: fix memory leak in nilfs_sysfs_create_device_group trap: cleanup trap_init() init: move usermodehelper_enable() to populate_rootfs() ...
2021-09-08Merge tag 'mm-slub-5.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux Pull SLUB updates from Vlastimil Babka: "SLUB: reduce irq disabled scope and make it RT compatible This series was initially inspired by Mel's pcplist local_lock rewrite, and also interest to better understand SLUB's locking and the new primitives and RT variants and implications. It makes SLUB compatible with PREEMPT_RT and generally more preemption-friendly, apparently without significant regressions, as the fast paths are not affected. The main changes to SLUB by this series: - irq disabling is now only done for minimum amount of time needed to protect the strict kmem_cache_cpu fields, and as part of spin lock, local lock and bit lock operations to make them irq-safe - SLUB is fully PREEMPT_RT compatible The series should now be sufficiently tested in both RT and !RT configs, mainly thanks to Mike. The RFC/v1 version also got basic performance screening by Mel that didn't show major regressions. Mike's testing with hackbench of v2 on !RT reported negligible differences [6]: virgin(ish) tip 5.13.0.g60ab3ed-tip 7,320.67 msec task-clock # 7.792 CPUs utilized ( +- 0.31% ) 221,215 context-switches # 0.030 M/sec ( +- 3.97% ) 16,234 cpu-migrations # 0.002 M/sec ( +- 4.07% ) 13,233 page-faults # 0.002 M/sec ( +- 0.91% ) 27,592,205,252 cycles # 3.769 GHz ( +- 0.32% ) 8,309,495,040 instructions # 0.30 insn per cycle ( +- 0.37% ) 1,555,210,607 branches # 212.441 M/sec ( +- 0.42% ) 5,484,209 branch-misses # 0.35% of all branches ( +- 2.13% ) 0.93949 +- 0.00423 seconds time elapsed ( +- 0.45% ) 0.94608 +- 0.00384 seconds time elapsed ( +- 0.41% ) (repeat) 0.94422 +- 0.00410 seconds time elapsed ( +- 0.43% ) 5.13.0.g60ab3ed-tip +slub-local-lock-v2r3 7,343.57 msec task-clock # 7.776 CPUs utilized ( +- 0.44% ) 223,044 context-switches # 0.030 M/sec ( +- 3.02% ) 16,057 cpu-migrations # 0.002 M/sec ( +- 4.03% ) 13,164 page-faults # 0.002 M/sec ( +- 0.97% ) 27,684,906,017 cycles # 3.770 GHz ( +- 0.45% ) 8,323,273,871 instructions # 0.30 insn per cycle ( +- 0.28% ) 1,556,106,680 branches # 211.901 M/sec ( +- 0.31% ) 5,463,468 branch-misses # 0.35% of all branches ( +- 1.33% ) 0.94440 +- 0.00352 seconds time elapsed ( +- 0.37% ) 0.94830 +- 0.00228 seconds time elapsed ( +- 0.24% ) (repeat) 0.93813 +- 0.00440 seconds time elapsed ( +- 0.47% ) (repeat) RT configs showed some throughput regressions, but that's expected tradeoff for the preemption improvements through the RT mutex. It didn't prevent the v2 to be incorporated to the 5.13 RT tree [7], leading to testing exposure and bugfixes. Before the series, SLUB is lockless in both allocation and free fast paths, but elsewhere, it's disabling irqs for considerable periods of time - especially in allocation slowpath and the bulk allocation, where IRQs are re-enabled only when a new page from the page allocator is needed, and the context allows blocking. The irq disabled sections can then include deactivate_slab() which walks a full freelist and frees the slab back to page allocator or unfreeze_partials() going through a list of percpu partial slabs. The RT tree currently has some patches mitigating these, but we can do much better in mainline too. Patches 1-6 are straightforward improvements or cleanups that could exist outside of this series too, but are prerequsities. Patches 7-9 are also preparatory code changes without functional changes, but not so useful without the rest of the series. Patch 10 simplifies the fast paths on systems with preemption, based on (hopefully correct) observation that the current loops to verify tid are unnecessary. Patches 11-20 focus on reducing irq disabled scope in the allocation slowpath: - patch 11 moves disabling of irqs into ___slab_alloc() from its callers, which are the allocation slowpath, and bulk allocation. Instead these callers only disable preemption to stabilize the cpu. - The following patches then gradually reduce the scope of disabled irqs in ___slab_alloc() and the functions called from there. As of patch 14, the re-enabling of irqs based on gfp flags before calling the page allocator is removed from allocate_slab(). As of patch 17, it's possible to reach the page allocator (in case of existing slabs depleted) without disabling and re-enabling irqs a single time. Pathces 21-26 reduce the scope of disabled irqs in functions related to unfreezing percpu partial slab. Patch 27 is preparatory. Patch 28 is adopted from the RT tree and converts the flushing of percpu slabs on all cpus from using IPI to workqueue, so that the processing isn't happening with irqs disabled in the IPI handler. The flushing is not performance critical so it should be acceptable. Patch 29 also comes from RT tree and makes object_map_lock RT compatible. Patch 30 make slab_lock irq-safe on RT where we cannot rely on having irq disabled from the list_lock spin lock usage. Patch 31 changes kmem_cache_cpu->partial handling in put_cpu_partial() from cmpxchg loop to a short irq disabled section, which is used by all other code modifying the field. This addresses a theoretical race scenario pointed out by Jann, and makes the critical section safe wrt with RT local_lock semantics after the conversion in patch 35. Patch 32 changes preempt disable to migrate disable, so that the nested list_lock spinlock is safe to take on RT. Because migrate_disable() is a function call even on !RT, a small set of private wrappers is introduced to keep using the cheaper preempt_disable() on !PREEMPT_RT configurations. As of this patch, SLUB should be already compatible with RT's lock semantics. Finally, patch 33 changes irq disabled sections that protect kmem_cache_cpu fields in the slow paths, with a local lock. However on PREEMPT_RT it means the lockless fast paths can now preempt slow paths which don't expect that, so the local lock has to be taken also in the fast paths and they are no longer lockless. RT folks seem to not mind this tradeoff. The patch also updates the locking documentation in the file's comment" Mike Galbraith and Mel Gorman verified that their earlier testing observations still hold for the final series: Link: https://lore.kernel.org/lkml/89ba4f783114520c167cc915ba949ad2c04d6790.camel@gmx.de/ Link: https://lore.kernel.org/lkml/20210907082010.GB3959@techsingularity.net/ * tag 'mm-slub-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux: (33 commits) mm, slub: convert kmem_cpu_slab protection to local_lock mm, slub: use migrate_disable() on PREEMPT_RT mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg mm, slub: make slab_lock() disable irqs with PREEMPT_RT mm: slub: make object_map_lock a raw_spinlock_t mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context mm, slab: split out the cpu offline variant of flush_slab() mm, slub: don't disable irqs in slub_cpu_dead() mm, slub: only disable irq with spin_lock in __unfreeze_partials() mm, slub: separate detaching of partial list in unfreeze_partials() from unfreezing mm, slub: detach whole partial list at once in unfreeze_partials() mm, slub: discard slabs in unfreeze_partials() without irqs disabled mm, slub: move irq control into unfreeze_partials() mm, slub: call deactivate_slab() without disabling irqs mm, slub: make locking in deactivate_slab() irq-safe mm, slub: move reset of c->page and freelist out of deactivate_slab() mm, slub: stop disabling irqs around get_partial() mm, slub: check new pages with restored irqs mm, slub: validate slab from partial list or page allocator before making it cpu slab mm, slub: restore irqs around calling new_slab() ...
2021-09-08mm/workingset: correct kernel-doc notationsRandy Dunlap
Use the documented kernel-doc format to prevent kernel-doc warnings. mm/workingset.c:256: warning: No description found for return value of 'workingset_eviction' mm/workingset.c:285: warning: Function parameter or member 'folio' not described in 'workingset_refault' mm/workingset.c:285: warning: Excess function parameter 'page' description in 'workingset_refault' Link: https://lkml.kernel.org/r/20210808203153.10678-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08percpu: remove export of pcpu_base_addrGreg Kroah-Hartman
This is not needed by any modules, so remove the export. Link: https://lkml.kernel.org/r/20210722185814.504541-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/damon: add kunit testsSeongJae Park
This commit adds kunit based unit tests for the core and the virtual address spaces monitoring primitives of DAMON. Link: https://lkml.kernel.org/r/20210716081449.22187-12-sj38.park@gmail.com Signed-off-by: SeongJae Park <sjpark@amazon.de> Reviewed-by: Brendan Higgins <brendanhiggins@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Fan Du <fan.du@intel.com> Cc: Fernand Sieber <sieberf@amazon.com> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Greg Thelen <gthelen@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Maximilian Heyne <mheyne@amazon.de> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/damon/dbgfs: support multiple contextsSeongJae Park
In some use cases, users would want to run multiple monitoring context. For example, if a user wants a high precision monitoring and dedicating multiple CPUs for the job is ok, because DAMON creates one monitoring thread per one context, the user can split the monitoring target regions into multiple small regions and create one context for each region. Or, someone might want to simultaneously monitor different address spaces, e.g., both virtual address space and physical address space. The DAMON's API allows such usage, but 'damon-dbgfs' does not. Therefore, only kernel space DAMON users can do multiple contexts monitoring. This commit allows the user space DAMON users to use multiple contexts monitoring by introducing two new 'damon-dbgfs' debugfs files, 'mk_context' and 'rm_context'. Users can create a new monitoring context by writing the desired name of the new context to 'mk_context'. Then, a new directory with the name and having the files for setting of the context ('attrs', 'target_ids' and 'record') will be created under the debugfs directory. Writing the name of the context to remove to 'rm_context' will remove the related context and directory. Link: https://lkml.kernel.org/r/20210716081449.22187-10-sj38.park@gmail.com Signed-off-by: SeongJae Park <sjpark@amazon.de> Reviewed-by: Fernand Sieber <sieberf@amazon.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Fan Du <fan.du@intel.com> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Greg Thelen <gthelen@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Maximilian Heyne <mheyne@amazon.de> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/damon/dbgfs: export kdamond pid to the user spaceSeongJae Park
For CPU usage accounting, knowing pid of the monitoring thread could be helpful. For example, users could use cpuaccount cgroups with the pid. This commit therefore exports the pid of currently running monitoring thread to the user space via 'kdamond_pid' file in the debugfs directory. Link: https://lkml.kernel.org/r/20210716081449.22187-9-sj38.park@gmail.com Signed-off-by: SeongJae Park <sjpark@amazon.de> Reviewed-by: Fernand Sieber <sieberf@amazon.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Fan Du <fan.du@intel.com> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Greg Thelen <gthelen@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Maximilian Heyne <mheyne@amazon.de> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/damon: implement a debugfs-based user space interfaceSeongJae Park
DAMON is designed to be used by kernel space code such as the memory management subsystems, and therefore it provides only kernel space API. That said, letting the user space control DAMON could provide some benefits to them. For example, it will allow user space to analyze their specific workloads and make their own special optimizations. For such cases, this commit implements a simple DAMON application kernel module, namely 'damon-dbgfs', which merely wraps the DAMON api and exports those to the user space via the debugfs. 'damon-dbgfs' exports three files, ``attrs``, ``target_ids``, and ``monitor_on`` under its debugfs directory, ``<debugfs>/damon/``. Attributes ---------- Users can read and write the ``sampling interval``, ``aggregation interval``, ``regions update interval``, and min/max number of monitoring target regions by reading from and writing to the ``attrs`` file. For example, below commands set those values to 5 ms, 100 ms, 1,000 ms, 10, 1000 and check it again:: # cd <debugfs>/damon # echo 5000 100000 1000000 10 1000 > attrs # cat attrs 5000 100000 1000000 10 1000 Target IDs ---------- Some types of address spaces supports multiple monitoring target. For example, the virtual memory address spaces monitoring can have multiple processes as the monitoring targets. Users can set the targets by writing relevant id values of the targets to, and get the ids of the current targets by reading from the ``target_ids`` file. In case of the virtual address spaces monitoring, the values should be pids of the monitoring target processes. For example, below commands set processes having pids 42 and 4242 as the monitoring targets and check it again:: # cd <debugfs>/damon # echo 42 4242 > target_ids # cat target_ids 42 4242 Note that setting the target ids doesn't start the monitoring. Turning On/Off -------------- Setting the files as described above doesn't incur effect unless you explicitly start the monitoring. You can start, stop, and check the current status of the monitoring by writing to and reading from the ``monitor_on`` file. Writing ``on`` to the file starts the monitoring of the targets with the attributes. Writing ``off`` to the file stops those. DAMON also stops if every targets are invalidated (in case of the virtual memory monitoring, target processes are invalidated when terminated). Below example commands turn on, off, and check the status of DAMON:: # cd <debugfs>/damon # echo on > monitor_on # echo off > monitor_on # cat monitor_on off Please note that you cannot write to the above-mentioned debugfs files while the monitoring is turned on. If you write to the files while DAMON is running, an error code such as ``-EBUSY`` will be returned. [akpm@linux-foundation.org: remove unneeded "alloc failed" printks] [akpm@linux-foundation.org: replace macro with static inline] Link: https://lkml.kernel.org/r/20210716081449.22187-8-sj38.park@gmail.com Signed-off-by: SeongJae Park <sjpark@amazon.de> Reviewed-by: Leonard Foerster <foersleo@amazon.de> Reviewed-by: Fernand Sieber <sieberf@amazon.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Fan Du <fan.du@intel.com> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Greg Thelen <gthelen@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Maximilian Heyne <mheyne@amazon.de> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/damon: add a tracepointSeongJae Park
This commit adds a tracepoint for DAMON. It traces the monitoring results of each region for each aggregation interval. Using this, DAMON can easily integrated with tracepoints supporting tools such as perf. Link: https://lkml.kernel.org/r/20210716081449.22187-7-sj38.park@gmail.com Signed-off-by: SeongJae Park <sjpark@amazon.de> Reviewed-by: Leonard Foerster <foersleo@amazon.de> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Fernand Sieber <sieberf@amazon.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Fan Du <fan.du@intel.com> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Greg Thelen <gthelen@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Maximilian Heyne <mheyne@amazon.de> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/damon: implement primitives for the virtual memory address spacesSeongJae Park
This commit introduces a reference implementation of the address space specific low level primitives for the virtual address space, so that users of DAMON can easily monitor the data accesses on virtual address spaces of specific processes by simply configuring the implementation to be used by DAMON. The low level primitives for the fundamental access monitoring are defined in two parts: 1. Identification of the monitoring target address range for the address space. 2. Access check of specific address range in the target space. The reference implementation for the virtual address space does the works as below. PTE Accessed-bit Based Access Check ----------------------------------- The implementation uses PTE Accessed-bit for basic access checks. That is, it clears the bit for the next sampling target page and checks whether it is set again after one sampling period. This could disturb the reclaim logic. DAMON uses ``PG_idle`` and ``PG_young`` page flags to solve the conflict, as Idle page tracking does. VMA-based Target Address Range Construction ------------------------------------------- Only small parts in the super-huge virtual address space of the processes are mapped to physical memory and accessed. Thus, tracking the unmapped address regions is just wasteful. However, because DAMON can deal with some level of noise using the adaptive regions adjustment mechanism, tracking every mapping is not strictly required but could even incur a high overhead in some cases. That said, too huge unmapped areas inside the monitoring target should be removed to not take the time for the adaptive mechanism. For the reason, this implementation converts the complex mappings to three distinct regions that cover every mapped area of the address space. Also, the two gaps between the three regions are the two biggest unmapped areas in the given address space. The two biggest unmapped areas would be the gap between the heap and the uppermost mmap()-ed region, and the gap between the lowermost mmap()-ed region and the stack in most of the cases. Because these gaps are exceptionally huge in usual address spaces, excluding these will be sufficient to make a reasonable trade-off. Below shows this in detail:: <heap> <BIG UNMAPPED REGION 1> <uppermost mmap()-ed region> (small mmap()-ed regions and munmap()-ed regions) <lowermost mmap()-ed region> <BIG UNMAPPED REGION 2> <stack> [akpm@linux-foundation.org: mm/damon/vaddr.c needs highmem.h for kunmap_atomic()] [sjpark@amazon.de: remove unnecessary PAGE_EXTENSION setup] Link: https://lkml.kernel.org/r/20210806095153.6444-2-sj38.park@gmail.com [sjpark@amazon.de: safely walk page table] Link: https://lkml.kernel.org/r/20210831161800.29419-1-sj38.park@gmail.com Link: https://lkml.kernel.org/r/20210716081449.22187-6-sj38.park@gmail.com Signed-off-by: SeongJae Park <sjpark@amazon.de> Reviewed-by: Leonard Foerster <foersleo@amazon.de> Reviewed-by: Fernand Sieber <sieberf@amazon.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Fan Du <fan.du@intel.com> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Greg Thelen <gthelen@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Maximilian Heyne <mheyne@amazon.de> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/idle_page_tracking: make PG_idle reusableSeongJae Park
PG_idle and PG_young allow the two PTE Accessed bit users, Idle Page Tracking and the reclaim logic concurrently work while not interfering with each other. That is, when they need to clear the Accessed bit, they set PG_young to represent the previous state of the bit, respectively. And when they need to read the bit, if the bit is cleared, they further read the PG_young to know whether the other has cleared the bit meanwhile or not. For yet another user of the PTE Accessed bit, we could add another page flag, or extend the mechanism to use the flags. For the DAMON usecase, however, we don't need to do that just yet. IDLE_PAGE_TRACKING and DAMON are mutually exclusive, so there's only ever going to be one user of the current set of flags. In this commit, we split out the CONFIG options to allow for the use of PG_young and PG_idle outside of idle page tracking. In the next commit, DAMON's reference implementation of the virtual memory address space monitoring primitives will use it. [sjpark@amazon.de: set PAGE_EXTENSION for non-64BIT] Link: https://lkml.kernel.org/r/20210806095153.6444-1-sj38.park@gmail.com [akpm@linux-foundation.org: tweak Kconfig text] [sjpark@amazon.de: hide PAGE_IDLE_FLAG from users] Link: https://lkml.kernel.org/r/20210813081238.34705-1-sj38.park@gmail.com Link: https://lkml.kernel.org/r/20210716081449.22187-5-sj38.park@gmail.com Signed-off-by: SeongJae Park <sjpark@amazon.de> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Fernand Sieber <sieberf@amazon.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Fan Du <fan.du@intel.com> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Greg Thelen <gthelen@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Maximilian Heyne <mheyne@amazon.de> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/damon: adaptively adjust regionsSeongJae Park
Even somehow the initial monitoring target regions are well constructed to fulfill the assumption (pages in same region have similar access frequencies), the data access pattern can be dynamically changed. This will result in low monitoring quality. To keep the assumption as much as possible, DAMON adaptively merges and splits each region based on their access frequency. For each ``aggregation interval``, it compares the access frequencies of adjacent regions and merges those if the frequency difference is small. Then, after it reports and clears the aggregated access frequency of each region, it splits each region into two or three regions if the total number of regions will not exceed the user-specified maximum number of regions after the split. In this way, DAMON provides its best-effort quality and minimal overhead while keeping the upper-bound overhead that users set. Link: https://lkml.kernel.org/r/20210716081449.22187-4-sj38.park@gmail.com Signed-off-by: SeongJae Park <sjpark@amazon.de> Reviewed-by: Leonard Foerster <foersleo@amazon.de> Reviewed-by: Fernand Sieber <sieberf@amazon.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Fan Du <fan.du@intel.com> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Greg Thelen <gthelen@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Maximilian Heyne <mheyne@amazon.de> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/damon/core: implement region-based samplingSeongJae Park
To avoid the unbounded increase of the overhead, DAMON groups adjacent pages that are assumed to have the same access frequencies into a region. As long as the assumption (pages in a region have the same access frequencies) is kept, only one page in the region is required to be checked. Thus, for each ``sampling interval``, 1. the 'prepare_access_checks' primitive picks one page in each region, 2. waits for one ``sampling interval``, 3. checks whether the page is accessed meanwhile, and 4. increases the access count of the region if so. Therefore, the monitoring overhead is controllable by adjusting the number of regions. DAMON allows both the underlying primitives and user callbacks to adjust regions for the trade-off. In other words, this commit makes DAMON to use not only time-based sampling but also space-based sampling. This scheme, however, cannot preserve the quality of the output if the assumption is not guaranteed. Next commit will address this problem. Link: https://lkml.kernel.org/r/20210716081449.22187-3-sj38.park@gmail.com Signed-off-by: SeongJae Park <sjpark@amazon.de> Reviewed-by: Leonard Foerster <foersleo@amazon.de> Reviewed-by: Fernand Sieber <sieberf@amazon.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Fan Du <fan.du@intel.com> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Greg Thelen <gthelen@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Maximilian Heyne <mheyne@amazon.de> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm: introduce Data Access MONitor (DAMON)SeongJae Park
Patch series "Introduce Data Access MONitor (DAMON)", v34. Introduction ============ DAMON is a data access monitoring framework for the Linux kernel. The core mechanisms of DAMON called 'region based sampling' and 'adaptive regions adjustment' (refer to 'mechanisms.rst' in the 11th patch of this patchset for the detail) make it - accurate (The monitored information is useful for DRAM level memory management. It might not appropriate for Cache-level accuracy, though.), - light-weight (The monitoring overhead is low enough to be applied online while making no impact on the performance of the target workloads.), and - scalable (the upper-bound of the instrumentation overhead is controllable regardless of the size of target workloads.). Using this framework, therefore, several memory management mechanisms such as reclamation and THP can be optimized to aware real data access patterns. Experimental access pattern aware memory management optimization works that incurring high instrumentation overhead will be able to have another try. Though DAMON is for kernel subsystems, it can be easily exposed to the user space by writing a DAMON-wrapper kernel subsystem. Then, user space users who have some special workloads will be able to write personalized tools or applications for deeper understanding and specialized optimizations of their systems. DAMON is also merged in two public Amazon Linux kernel trees that based on v5.4.y[1] and v5.10.y[2]. [1] https://github.com/amazonlinux/linux/tree/amazon-5.4.y/master/mm/damon [2] https://github.com/amazonlinux/linux/tree/amazon-5.10.y/master/mm/damon The userspace tool[1] is available, released under GPLv2, and actively being maintained. I am also planning to implement another basic user interface in perf[2]. Also, the basic test suite for DAMON is available under GPLv2[3]. [1] https://github.com/awslabs/damo [2] https://lore.kernel.org/linux-mm/20210107120729.22328-1-sjpark@amazon.com/ [3] https://github.com/awslabs/damon-tests Long-term Plan -------------- DAMON is a part of a project called Data Access-aware Operating System (DAOS). As the name implies, I want to improve the performance and efficiency of systems using fine-grained data access patterns. The optimizations are for both kernel and user spaces. I will therefore modify or create kernel subsystems, export some of those to user space and implement user space library / tools. Below shows the layers and components for the project. --------------------------------------------------------------------------- Primitives: PTE Accessed bit, PG_idle, rmap, (Intel CMT), ... Framework: DAMON Features: DAMOS, virtual addr, physical addr, ... Applications: DAMON-debugfs, (DARC), ... ^^^^^^^^^^^^^^^^^^^^^^^ KERNEL SPACE ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Raw Interface: debugfs, (sysfs), (damonfs), tracepoints, (sys_damon), ... vvvvvvvvvvvvvvvvvvvvvvv USER SPACE vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv Library: (libdamon), ... Tools: DAMO, (perf), ... --------------------------------------------------------------------------- The components in parentheses or marked as '...' are not implemented yet but in the future plan. IOW, those are the TODO tasks of DAOS project. For more detail, please refer to the plans: https://lore.kernel.org/linux-mm/20201202082731.24828-1-sjpark@amazon.com/ Evaluations =========== We evaluated DAMON's overhead, monitoring quality and usefulness using 24 realistic workloads on my QEMU/KVM based virtual machine running a kernel that v24 DAMON patchset is applied. DAMON is lightweight. It increases system memory usage by 0.39% and slows target workloads down by 1.16%. DAMON is accurate and useful for memory management optimizations. An experimental DAMON-based operation scheme for THP, namely 'ethp', removes 76.15% of THP memory overheads while preserving 51.25% of THP speedup. Another experimental DAMON-based 'proactive reclamation' implementation, 'prcl', reduces 93.38% of residential sets and 23.63% of system memory footprint while incurring only 1.22% runtime overhead in the best case (parsec3/freqmine). NOTE that the experimental THP optimization and proactive reclamation are not for production but only for proof of concepts. Please refer to the official document[1] or "Documentation/admin-guide/mm: Add a document for DAMON" patch in this patchset for detailed evaluation setup and results. [1] https://damonitor.github.io/doc/html/latest-damon/admin-guide/mm/damon/eval.html Real-world User Story ===================== In summary, DAMON has used on production systems and proved its usefulness. DAMON as a profiler ------------------- We analyzed characteristics of a large scale production systems of our customers using DAMON. The systems utilize 70GB DRAM and 36 CPUs. From this, we were able to find interesting things below. There were obviously different access pattern under idle workload and active workload. Under the idle workload, it accessed large memory regions with low frequency, while the active workload accessed small memory regions with high freuqnecy. DAMON found a 7GB memory region that showing obviously high access frequency under the active workload. We believe this is the performance-effective working set and need to be protected. There was a 4KB memory region that showing highest access frequency under not only active but also idle workloads. We think this must be a hottest code section like thing that should never be paged out. For this analysis, DAMON used only 0.3-1% of single CPU time. Because we used recording-based analysis, it consumed about 3-12 MB of disk space per 20 minutes. This is only small amount of disk space, but we can further reduce the disk usage by using non-recording-based DAMON features. I'd like to argue that only DAMON can do such detailed analysis (finding 4KB highest region in 70GB memory) with the light overhead. DAMON as a system optimization tool ----------------------------------- We also found below potential performance problems on the systems and made DAMON-based solutions. The system doesn't want to make the workload suffer from the page reclamation and thus it utilizes enough DRAM but no swap device. However, we found the system is actively reclaiming file-backed pages, because the system has intensive file IO. The file IO turned out to be not performance critical for the workload, but the customer wanted to ensure performance critical file-backed pages like code section to not mistakenly be evicted. Using direct IO should or `mlock()` would be a straightforward solution, but modifying the user space code is not easy for the customer. Alternatively, we could use DAMON-based operation scheme[1]. By using it, we can ask DAMON to track access frequency of each region and make 'process_madvise(MADV_WILLNEED)[2]' call for regions having specific size and access frequency for a time interval. We also found the system is having high number of TLB misses. We tried 'always' THP enabled policy and it greatly reduced TLB misses, but the page reclamation also been more frequent due to the THP internal fragmentation caused memory bloat. We could try another DAMON-based operation scheme that applies 'MADV_HUGEPAGE' to memory regions having >=2MB size and high access frequency, while applying 'MADV_NOHUGEPAGE' to regions having <2MB size and low access frequency. We do not own the systems so we only reported the analysis results and possible optimization solutions to the customers. The customers satisfied about the analysis results and promised to try the optimization guides. [1] https://lore.kernel.org/linux-mm/20201006123931.5847-1-sjpark@amazon.com/ [2] https://lore.kernel.org/linux-api/20200622192900.22757-4-minchan@kernel.org/ Comparison with Idle Page Tracking ================================== Idle Page Tracking allows users to set and read idleness of pages using a bitmap file which represents each page with each bit of the file. One recommended usage of it is working set size detection. Users can do that by 1. find PFN of each page for workloads in interest, 2. set all the pages as idle by doing writes to the bitmap file, 3. wait until the workload accesses its working set, and 4. read the idleness of the pages again and count pages became not idle. NOTE: While Idle Page Tracking is for user space users, DAMON is primarily designed for kernel subsystems though it can easily exposed to the user space. Hence, this section only assumes such user space use of DAMON. For what use cases Idle Page Tracking would be better? ------------------------------------------------------ 1. Flexible usecases other than hotness monitoring. Because Idle Page Tracking allows users to control the primitive (Page idleness) by themselves, Idle Page Tracking users can do anything they want. Meanwhile, DAMON is primarily designed to monitor the hotness of each memory region. For this, DAMON asks users to provide sampling interval and aggregation interval. For the reason, there could be some use case that using Idle Page Tracking is simpler. 2. Physical memory monitoring. Idle Page Tracking receives PFN range as input, so natively supports physical memory monitoring. DAMON is designed to be extensible for multiple address spaces and use cases by implementing and using primitives for the given use case. Therefore, by theory, DAMON has no limitation in the type of target address space as long as primitives for the given address space exists. However, the default primitives introduced by this patchset supports only virtual address spaces. Therefore, for physical memory monitoring, you should implement your own primitives and use it, or simply use Idle Page Tracking. Nonetheless, RFC patchsets[1] for the physical memory address space primitives is already available. It also supports user memory same to Idle Page Tracking. [1] https://lore.kernel.org/linux-mm/20200831104730.28970-1-sjpark@amazon.com/ For what use cases DAMON is better? ----------------------------------- 1. Hotness Monitoring. Idle Page Tracking let users know only if a page frame is accessed or not. For hotness check, the user should write more code and use more memory. DAMON do that by itself. 2. Low Monitoring Overhead DAMON receives user's monitoring request with one step and then provide the results. So, roughly speaking, DAMON require only O(1) user/kernel context switches. In case of Idle Page Tracking, however, because the interface receives contiguous page frames, the number of user/kernel context switches increases as the monitoring target becomes complex and huge. As a result, the context switch overhead could be not negligible. Moreover, DAMON is born to handle with the monitoring overhead. Because the core mechanism is pure logical, Idle Page Tracking users might be able to implement the mechanism on their own, but it would be time consuming and the user/kernel context switching will still more frequent than that of DAMON. Also, the kernel subsystems cannot use the logic in this case. 3. Page granularity working set size detection. Until v22 of this patchset, this was categorized as the thing Idle Page Tracking could do better, because DAMON basically maintains additional metadata for each of the monitoring target regions. So, in the page granularity working set size detection use case, DAMON would incur (number of monitoring target pages * size of metadata) memory overhead. Size of the single metadata item is about 54 bytes, so assuming 4KB pages, about 1.3% of monitoring target pages will be additionally used. All essential metadata for Idle Page Tracking are embedded in 'struct page' and page table entries. Therefore, in this use case, only one counter variable for working set size accounting is required if Idle Page Tracking is used. There are more details to consider, but roughly speaking, this is true in most cases. However, the situation changed from v23. Now DAMON supports arbitrary types of monitoring targets, which don't use the metadata. Using that, DAMON can do the working set size detection with no additional space overhead but less user-kernel context switch. A first draft for the implementation of monitoring primitives for this usage is available in a DAMON development tree[1]. An RFC patchset for it based on this patchset will also be available soon. Since v24, the arbitrary type support is dropped from this patchset because this patchset doesn't introduce real use of the type. You can still get it from the DAMON development tree[2], though. [1] https://github.com/sjp38/linux/tree/damon/pgidle_hack [2] https://github.com/sjp38/linux/tree/damon/master 4. More future usecases While Idle Page Tracking has tight coupling with base primitives (PG_Idle and page table Accessed bits), DAMON is designed to be extensible for many use cases and address spaces. If you need some special address type or want to use special h/w access check primitives, you can write your own primitives for that and configure DAMON to use those. Therefore, if your use case could be changed a lot in future, using DAMON could be better. Can I use both Idle Page Tracking and DAMON? -------------------------------------------- Yes, though using them concurrently for overlapping memory regions could result in interference to each other. Nevertheless, such use case would be rare or makes no sense at all. Even in the case, the noise would bot be really significant. So, you can choose whatever you want depending on the characteristics of your use cases. More Information ================ We prepared a showcase web site[1] that you can get more information. There are - the official documentations[2], - the heatmap format dynamic access pattern of various realistic workloads for heap area[3], mmap()-ed area[4], and stack[5] area, - the dynamic working set size distribution[6] and chronological working set size changes[7], and - the latest performance test results[8]. [1] https://damonitor.github.io/_index [2] https://damonitor.github.io/doc/html/latest-damon [3] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.0.png.html [4] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.1.png.html [5] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.2.png.html [6] https://damonitor.github.io/test/result/visual/latest/rec.wss_sz.png.html [7] https://damonitor.github.io/test/result/visual/latest/rec.wss_time.png.html [8] https://damonitor.github.io/test/result/perf/latest/html/index.html Baseline and Complete Git Trees =============================== The patches are based on the latest -mm tree, specifically v5.14-rc1-mmots-2021-07-15-18-47 of https://github.com/hnaz/linux-mm. You can also clone the complete git tree: $ git clone git://github.com/sjp38/linux -b damon/patches/v34 The web is also available: https://github.com/sjp38/linux/releases/tag/damon/patches/v34 Development Trees ----------------- There are a couple of trees for entire DAMON patchset series and features for future release. - For latest release: https://github.com/sjp38/linux/tree/damon/master - For next release: https://github.com/sjp38/linux/tree/damon/next Long-term Support Trees ----------------------- For people who want to test DAMON but using LTS kernels, there are another couple of trees based on two latest LTS kernels respectively and containing the 'damon/master' backports. - For v5.4.y: https://github.com/sjp38/linux/tree/damon/for-v5.4.y - For v5.10.y: https://github.com/sjp38/linux/tree/damon/for-v5.10.y Amazon Linux Kernel Trees ------------------------- DAMON is also merged in two public Amazon Linux kernel trees that based on v5.4.y[1] and v5.10.y[2]. [1] https://github.com/amazonlinux/linux/tree/amazon-5.4.y/master/mm/damon [2] https://github.com/amazonlinux/linux/tree/amazon-5.10.y/master/mm/damon Git Tree for Diff of Patches ============================ For easy review of diff between different versions of each patch, I prepared a git tree containing all versions of the DAMON patchset series: https://github.com/sjp38/damon-patches You can clone it and use 'diff' for easy review of changes between different versions of the patchset. For example: $ git clone https://github.com/sjp38/damon-patches && cd damon-patches $ diff -u damon/v33 damon/v34 Sequence Of Patches =================== First three patches implement the core logics of DAMON. The 1st patch introduces basic sampling based hotness monitoring for arbitrary types of targets. Following two patches implement the core mechanisms for control of overhead and accuracy, namely regions based sampling (patch 2) and adaptive regions adjustment (patch 3). Now the essential parts of DAMON is complete, but it cannot work unless someone provides monitoring primitives for a specific use case. The following two patches make it just work for virtual address spaces monitoring. The 4th patch makes 'PG_idle' can be used by DAMON and the 5th patch implements the virtual memory address space specific monitoring primitives using page table Accessed bits and the 'PG_idle' page flag. Now DAMON just works for virtual address space monitoring via the kernel space api. To let the user space users can use DAMON, following four patches add interfaces for them. The 6th patch adds a tracepoint for monitoring results. The 7th patch implements a DAMON application kernel module, namely damon-dbgfs, that simply wraps DAMON and exposes DAMON interface to the user space via the debugfs interface. The 8th patch further exports pid of monitoring thread (kdamond) to user space for easier cpu usage accounting, and the 9th patch makes the debugfs interface to support multiple contexts. Three patches for maintainability follows. The 10th patch adds documentations for both the user space and the kernel space. The 11th patch provides unit tests (based on the kunit) while the 12th patch adds user space tests (based on the kselftest). Finally, the last patch (13th) updates the MAINTAINERS file. This patch (of 13): DAMON is a data access monitoring framework for the Linux kernel. The core mechanisms of DAMON make it - accurate (the monitoring output is useful enough for DRAM level performance-centric memory management; It might be inappropriate for CPU cache levels, though), - light-weight (the monitoring overhead is normally low enough to be applied online), and - scalable (the upper-bound of the overhead is in constant range regardless of the size of target workloads). Using this framework, hence, we can easily write efficient kernel space data access monitoring applications. For example, the kernel's memory management mechanisms can make advanced decisions using this. Experimental data access aware optimization works that incurring high access monitoring overhead could again be implemented on top of this. Due to its simple and flexible interface, providing user space interface would be also easy. Then, user space users who have some special workloads can write personalized applications for better understanding and optimizations of their workloads and systems. === Nevertheless, this commit is defining and implementing only basic access check part without the overhead-accuracy handling core logic. The basic access check is as below. The output of DAMON says what memory regions are how frequently accessed for a given duration. The resolution of the access frequency is controlled by setting ``sampling interval`` and ``aggregation interval``. In detail, DAMON checks access to each page per ``sampling interval`` and aggregates the results. In other words, counts the number of the accesses to each region. After each ``aggregation interval`` passes, DAMON calls callback functions that previously registered by users so that users can read the aggregated results and then clears the results. This can be described in below simple pseudo-code:: init() while monitoring_on: for page in monitoring_target: if accessed(page): nr_accesses[page] += 1 if time() % aggregation_interval == 0: for callback in user_registered_callbacks: callback(monitoring_target, nr_accesses) for page in monitoring_target: nr_accesses[page] = 0 if time() % update_interval == 0: update() sleep(sampling interval) The target regions constructed at the beginning of the monitoring and updated after each ``regions_update_interval``, because the target regions could be dynamically changed (e.g., mmap() or memory hotplug). The monitoring overhead of this mechanism will arbitrarily increase as the size of the target workload grows. The basic monitoring primitives for actual access check and dynamic target regions construction aren't in the core part of DAMON. Instead, it allows users to implement their own primitives that are optimized for their use case and configure DAMON to use those. In other words, users cannot use current version of DAMON without some additional works. Following commits will implement the core mechanisms for the overhead-accuracy control and default primitives implementations. Link: https://lkml.kernel.org/r/20210716081449.22187-1-sj38.park@gmail.com Link: https://lkml.kernel.org/r/20210716081449.22187-2-sj38.park@gmail.com Signed-off-by: SeongJae Park <sjpark@amazon.de> Reviewed-by: Leonard Foerster <foersleo@amazon.de> Reviewed-by: Fernand Sieber <sieberf@amazon.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: David Hildenbrand <david@redhat.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Marco Elver <elver@google.com> Cc: Fan Du <fan.du@intel.com> Cc: Greg Kroah-Hartman <greg@kroah.com> Cc: Greg Thelen <gthelen@google.com> Cc: Joe Perches <joe@perches.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Maximilian Heyne <mheyne@amazon.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: David Rientjes <rientjes@google.com> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Markus Boehme <markubo@amazon.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08kfence: test: fail fast if disabled at bootMarco Elver
Fail kfence_test fast if KFENCE was disabled at boot, instead of each test case trying several seconds to allocate from KFENCE and failing. KUnit will fail all test cases if kunit_suite::init returns an error. Even if KFENCE was disabled, we still want the test to fail, so that CI systems that parse KUnit output will alert on KFENCE being disabled (accidentally or otherwise). Link: https://lkml.kernel.org/r/20210825105533.1247922-1-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Reported-by: Kefeng Wang <wangkefeng.wang@huawei.com> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08kfence: show cpu and timestamp in alloc/free infoMarco Elver
Record cpu and timestamp on allocations and frees, and show them in reports. Upon an error, this can help correlate earlier messages in the kernel log via allocation and free timestamps. Link: https://lkml.kernel.org/r/20210714175312.2947941-1-elver@google.com Suggested-by: Joern Engel <joern@purestorage.com> Signed-off-by: Marco Elver <elver@google.com> Acked-by: Alexander Potapenko <glider@google.com> Acked-by: Joern Engel <joern@purestorage.com> Cc: Yuanyuan Zhong <yzhong@purestorage.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/secretmem: use refcount_t instead of atomic_tJordy Zomer
When a secret memory region is active, memfd_secret disables hibernation. One of the goals is to keep the secret data from being written to persistent-storage. It accomplishes this by maintaining a reference count to `secretmem_users`. Once this reference is held your system can not be hibernated due to the check in `hibernation_available()`. However, because `secretmem_users` is of type `atomic_t`, reference counter overflows are possible. As you can see there's an `atomic_inc` for each `memfd` that is opened in the `memfd_secret` syscall. If a local attacker succeeds to open 2^32 memfd's, the counter will wrap around to 0. This implies that you may hibernate again, even though there are still regions of this secret memory, thereby bypassing the security check. In an attempt to fix this I have used `refcount_t` instead of `atomic_t` which prevents reference counter overflows. Link: https://lkml.kernel.org/r/20210820043339.2151352-1-jordy@pwning.systems Signed-off-by: Jordy Zomer <jordy@pwning.systems> Cc: Kees Cook <keescook@chromium.org>, Cc: Jordy Zomer <jordy@jordyzomer.github.io> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm: in_irq() cleanupChangbin Du
Replace the obsolete and ambiguos macro in_irq() with new macro in_hardirq(). Link: https://lkml.kernel.org/r/20210813145245.86070-1-changbin.du@gmail.com Signed-off-by: Changbin Du <changbin.du@gmail.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [kmemleak] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/early_ioremap.c: remove redundant early_ioremap_shutdown()Weizhao Ouyang
early_ioremap_reset() reserved a weak function so that architectures can provide a specific cleanup. Now no architectures use it, remove this redundant function. Link: https://lkml.kernel.org/r/20210901082917.399953-1-o451686892@gmail.com Signed-off-by: Weizhao Ouyang <o451686892@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm: don't allow executable ioremap mappingsChristoph Hellwig
There is no need to execute from iomem (and most platforms it is impossible anyway), so add the pgprot_nx() call similar to vmap. Link: https://lkml.kernel.org/r/20210824091259.1324527-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm: move ioremap_page_range to vmalloc.cChristoph Hellwig
Patch series "small ioremap cleanups". The first patch moves a little code around the vmalloc/ioremap boundary following a bigger move by Nick earlier. The second enforces non-executable mapping on ioremap just like we do for vmap. No driver currently uses executable mappings anyway, as they should. This patch (of 2): This keeps it together with the implementation, and to remove the vmap_range wrapper. Link: https://lkml.kernel.org/r/20210824091259.1324527-1-hch@lst.de Link: https://lkml.kernel.org/r/20210824091259.1324527-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm: remove redundant compound_head() callingMuchun Song
There is a READ_ONCE() in the macro of compound_head(), which will prevent compiler from optimizing the code when there are more than once calling of it in a function. Remove the redundant calling of compound_head() from page_to_index() and page_add_file_rmap() for better code generation. Link: https://lkml.kernel.org/r/20210811101431.83940-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/memory_hotplug: use helper zone_is_zone_device() to simplify the codeMiaohe Lin
Patch series "Cleanup and fixups for memory hotplug". This series contains cleanup to use helper function to simplify the code. Also we fix some potential bugs. More details can be found in the respective changelogs. This patch (of 3): Use helper zone_is_zone_device() to simplify the code and remove some explicit CONFIG_ZONE_DEVICE codes. Link: https://lkml.kernel.org/r/20210821094246.10149-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210821094246.10149-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Chris Goldsworthy <cgoldswo@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/memory_hotplug: improved dynamic memory group aware "auto-movable" online ↵David Hildenbrand
policy Currently, the "auto-movable" online policy does not allow for hotplugged KERNEL (ZONE_NORMAL) memory to increase the amount of MOVABLE memory we can have, primarily, because there is no coordiantion across memory devices and we don't want to create zone-imbalances accidentially when unplugging memory. However, within a single memory device it's different. Let's allow for KERNEL memory within a dynamic memory group to allow for more MOVABLE within the same memory group. The only thing we have to take care of is that the managing driver avoids zone imbalances by unplugging MOVABLE memory first, otherwise there can be corner cases where unplug of memory could result in (accidential) zone imbalances. virtio-mem is the only user of dynamic memory groups and recently added support for prioritizing unplug of ZONE_MOVABLE over ZONE_NORMAL, so we don't need a new toggle to enable it for dynamic memory groups. We limit this handling to dynamic memory groups, because: * We want to keep the runtime overhead for collecting stats when onlining a single memory block small. We tend to have only a handful of dynamic memory groups, but we can have quite some static memory groups (e.g., 256 DIMMs). * It doesn't make too much sense for static memory groups, as we try onlining all applicable memory blocks either completely to ZONE_MOVABLE or not. In ordinary operation, we won't have a mixture of zones within a static memory group. When adding memory to a dynamic memory group, we'll first online memory to ZONE_MOVABLE as long as early KERNEL memory allows for it. Then, we'll online the next unit(s) to ZONE_NORMAL, until we can online the next unit(s) to ZONE_MOVABLE. For a simple virtio-mem device with a MOVABLE:KERNEL ratio of 3:1, it will result in a layout like: [M][M][M][M][M][M][M][M][N][M][M][M][N][M][M][M]... ^ movable memory due to early kernel memory ^ allows for more movable memory ... ^-----^ ... here ^ allows for more movable memory ... ^-----^ ... here While the created layout is sub-optimal when it comes to contiguous zones, it gives us the maximum flexibility when dynamically growing/shrinking a device; we can grow small VMs really big in small steps, and still shrink reliably to e.g., 1/4 of the maximum VM size in this example, removing full memory blocks along with meta data more reliably. Mark dynamic memory groups in the xarray such that we can efficiently iterate over them when collecting stats. In usual setups, we have one virtio-mem device per NUMA node, and usually only a small number of NUMA nodes. Note: for now, there seems to be no compelling reason to make this behavior configurable. Link: https://lkml.kernel.org/r/20210806124715.17090-10-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/memory_hotplug: memory group aware "auto-movable" online policyDavid Hildenbrand
Use memory groups to improve our "auto-movable" onlining policy: 1. For static memory groups (e.g., a DIMM), online a memory block MOVABLE only if all other memory blocks in the group are either MOVABLE or could be onlined MOVABLE. A DIMM will either be MOVABLE or not, not a mixture. 2. For dynamic memory groups (e.g., a virtio-mem device), online a memory block MOVABLE only if all other memory blocks inside the current unit are either MOVABLE or could be onlined MOVABLE. For a virtio-mem device with a device block size with 512 MiB, all 128 MiB memory blocks wihin a 512 MiB unit will either be MOVABLE or not, not a mixture. We have to pass the memory group to zone_for_pfn_range() to take the memory group into account. Note: for now, there seems to be no compelling reason to make this behavior configurable. Link: https://lkml.kernel.org/r/20210806124715.17090-9-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/memory_hotplug: track present pages in memory groupsDavid Hildenbrand
Let's track all present pages in each memory group. Especially, track memory present in ZONE_MOVABLE and memory present in one of the kernel zones (which really only is ZONE_NORMAL right now as memory groups only apply to hotplugged memory) separately within a memory group, to prepare for making smart auto-online decision for individual memory blocks within a memory group based on group statistics. Link: https://lkml.kernel.org/r/20210806124715.17090-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08drivers/base/memory: introduce "memory groups" to logically group memory blocksDavid Hildenbrand
In our "auto-movable" memory onlining policy, we want to make decisions across memory blocks of a single memory device. Examples of memory devices include ACPI memory devices (in the simplest case a single DIMM) and virtio-mem. For now, we don't have a connection between a single memory block device and the real memory device. Each memory device consists of 1..X memory block devices. Let's logically group memory blocks belonging to the same memory device in "memory groups". Memory groups can span multiple physical ranges and a memory group itself does not contain any information regarding physical ranges, only properties (e.g., "max_pages") necessary for improved memory onlining. Introduce two memory group types: 1) Static memory group: E.g., a single ACPI memory device, consisting of 1..X memory resources. A memory group consists of 1..Y memory blocks. The whole group is added/removed in one go. If any part cannot get offlined, the whole group cannot be removed. 2) Dynamic memory group: E.g., a single virtio-mem device. Memory is dynamically added/removed in a fixed granularity, called a "unit", consisting of 1..X memory blocks. A unit is added/removed in one go. If any part of a unit cannot get offlined, the whole unit cannot be removed. In case of 1) we usually want either all memory managed by ZONE_MOVABLE or none. In case of 2) we usually want to have as many units as possible managed by ZONE_MOVABLE. We want a single unit to be of the same type. For now, memory groups are an internal concept that is not exposed to user space; we might want to change that in the future, though. add_memory() users can specify a mgid instead of a nid when passing the MHP_NID_IS_MGID flag. Link: https://lkml.kernel.org/r/20210806124715.17090-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/memory_hotplug: introduce "auto-movable" online policyDavid Hildenbrand
When onlining without specifying a zone (using "online" instead of "online_kernel" or "online_movable"), we currently select a zone such that existing zones are kept contiguous. This online policy made sense in the past, where contiguous zones where required. We'd like to implement smarter policies, however: * User space has little insight. As one example, it has no idea which memory blocks logically belong together (e.g., to a DIMM or to a virtio-mem device). * Drivers that add memory in separate memory blocks, especially virtio-mem, want memory to get onlined right from the kernel when adding. So we really want to have onlining to differing zones managed in the kernel, configured by user space. We see more and more cases where we might eventually hotplug a lot of memory in the future (e.g., eventually grow a 2 GiB VM to 64 GiB), however: * Resizing happens dynamically, in smaller steps in both directions (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...) * We still want as much flexibility as possible, especially, hotunplugging as much memory as possible later. We can really only use "online_movable" if we know that the amount of memory we are going to hotplug upfront, and we know that it won't result in a zone imbalance. So in our example, a 2 GiB VM that could grow to 64 GiB could currently not use "online_movable", and instead, "online_kernel" would have to be used, resulting in worse (no) memory hotunplug reliability. Let's add a new "auto-movable" online policy that considers the current zone ratios (global, per-node) to determine, whether we a memory block can be onlined to ZONE_MOVABLE: MOVABLE : KERNEL However, internally we'll only consider the following ratio for now: MOVABLE : KERNEL_EARLY For now, we don't allow for hotplugged KERNEL memory to allow for more MOVABLE memory, because there is no coordination across memory devices. In follow-up patches, we will allow for more KERNEL memory within a memory device to allow for more MOVABLE memory within the same memory device -- which only makes sense for special memory device types. We base our calculation on "present pages", see the code comments for details. Hotplugged memory will get online to ZONE_MOVABLE if the configured ratio allows for it. Depending on the setup, this can result in fragmented zones, which can make compaction slower and dynamic allocation of gigantic pages when not using CMA less reliable (... which is already pretty unreliable). The old policy will be the default and called "contig-zones". In follow-up patches, our new policy will use additional information, such as memory groups, to make even smarter decisions across memory blocks. Configuration: * memory_hotplug.online_policy is used to switch between both polices and defaults to "contig-zones". * memory_hotplug.auto_movable_ratio defines the maximum ratio is in percent and defaults to "301" -- allowing e.g., most 8 GiB machines to grow to 32 GiB and have all hotplugged memory in ZONE_MOVABLE. The additional percent accounts for a handful of lost present pages (e.g., firmware allocations). User space is expected to adjust this ratio when enabling the new "auto-movable" policy, though. * memory_hotplug.auto_movable_numa_aware considers numa node stats in addition to global stats, and defaults to "true". Note: just like the old policy, the new policy won't take things like unmovable huge pages or memory ballooning that doesn't support balloon compaction into account. User space has to configure onlining accordingly. Link: https://lkml.kernel.org/r/20210806124715.17090-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm: track present early pages per zoneDavid Hildenbrand
Patch series "mm/memory_hotplug: "auto-movable" online policy and memory groups", v3. I. Goal The goal of this series is improving in-kernel auto-online support. It tackles the fundamental problems that: 1) We can create zone imbalances when onlining all memory blindly to ZONE_MOVABLE, in the worst case crashing the system. We have to know upfront how much memory we are going to hotplug such that we can safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE via "online_movable". This is far from practical and only applicable in limited setups -- like inside VMs under the RHV/oVirt hypervisor which will never hotplug more than 3 times the boot memory (and the limitation is only in place due to the Linux limitation). 2) We see more setups that implement dynamic VM resizing, hot(un)plugging memory to resize VM memory. In these setups, we might hotplug a lot of memory, but it might happen in various small steps in both directions (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the primary driver of this upstream right now, performing such dynamic resizing NUMA-aware via multiple virtio-mem devices. Onlining all hotplugged memory to ZONE_NORMAL means we basically have no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can easily run into zone imbalances when growing a VM. We want a mixture, and we want as much memory as reasonable/configured in ZONE_MOVABLE. Details regarding zone imbalances can be found at [1]. 3) Memory devices consist of 1..X memory block devices, however, the kernel doesn't really track the relationship. Consequently, also user space has no idea. We want to make per-device decisions. As one example, for memory hotunplug it doesn't make sense to use a mixture of zones within a single DIMM: we want all MOVABLE if possible, otherwise all !MOVABLE, because any !MOVABLE part will easily block the whole DIMM from getting hotunplugged. As another example, virtio-mem operates on individual units that span 1..X memory blocks. Similar to a DIMM, we want a unit to either be all MOVABLE or !MOVABLE. A "unit" can be thought of like a DIMM, however, all units of a virtio-mem device logically belong together and are managed (added/removed) by a single driver. We want as much memory of a virtio-mem device to be MOVABLE as possible. 4) We want memory onlining to be done right from the kernel while adding memory, not triggered by user space via udev rules; for example, this is reqired for fast memory hotplug for drivers that add individual memory blocks, like virito-mem. We want a way to configure a policy in the kernel and avoid implementing advanced policies in user space. The auto-onlining support we have in the kernel is not sufficient. All we have is a) online everything MOVABLE (online_movable) b) online everything !MOVABLE (online_kernel) c) keep zones contiguous (online). This series allows configuring c) to mean instead "online movable if possible according to the coniguration, driven by a maximum MOVABLE:KERNEL ratio" -- a new onlining policy. II. Approach This series does 3 things: 1) Introduces the "auto-movable" online policy that initially operates on individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio to make a decision whether a memory block will be onlined to ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL memory does not allow for more MOVABLE memory (details in the patches). CMA memory is treated like MOVABLE memory. 2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory groups and uses group information to make decisions in the "auto-movable" online policy across memory blocks of a single memory device (modeled as memory group). More details can be found in patch #3 or in the DIMM example below. 3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by allowing ZONE_NORMAL memory within a dynamic memory group to allow for more ZONE_MOVABLE memory within the same memory group. The target use case is dynamic VM resizing using virtio-mem. See the virtio-mem example below. I remember that the basic idea of using a ratio to implement a policy in the kernel was once mentioned by Vitaly Kuznetsov, but I might be wrong (I lost the pointer to that discussion). For me, the main use case is using it along with virtio-mem (and DIMMs / ppc64 dlpar where necessary) for dynamic resizing of VMs, increasing the amount of memory we can hotunplug reliably again if we might eventually hotplug a lot of memory to a VM. III. Target Usage The target usage will be: 1) Linux boots with "mhp_default_online_type=offline" 2) User space (e.g., systemd unit) configures memory onlining (according to a config file and system properties), for example: * Setting memory_hotplug.online_policy=auto-movable * Setting memory_hotplug.auto_movable_ratio=301 * Setting memory_hotplug.auto_movable_numa_aware=true 3) User space enabled auto onlining via "echo online > /sys/devices/system/memory/auto_online_blocks" 4) User space triggers manual onlining of all already-offline memory blocks (go over offline memory blocks and set them to "online") IV. Example For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of 301% results in the following layout: Memory block 0-15: DMA32 (early) Memory block 32-47: Normal (early) Memory block 48-79: Movable (DIMM 0) Memory block 80-111: Movable (DIMM 1) Memory block 112-143: Movable (DIMM 2) Memory block 144-275: Normal (DIMM 3) Memory block 176-207: Normal (DIMM 4) ... all Normal (-> hotplugged Normal memory does not allow for more Movable memory) For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM will result in the following layout: Memory block 0-15: DMA32 (early) Memory block 32-47: Normal (early) Memory block 48-143: Movable (virtio-mem, first 12 GiB) Memory block 144: Normal (virtio-mem, next 128 MiB) Memory block 145-147: Movable (virtio-mem, next 384 MiB) Memory block 148: Normal (virtio-mem, next 128 MiB) Memory block 149-151: Movable (virtio-mem, next 384 MiB) ... Normal/Movable mixture as above (-> hotplugged Normal memory allows for more Movable memory within the same device) Which gives us maximum flexibility when dynamically growing/shrinking a VM in smaller steps. V. Doc Update I'll update the memory-hotplug.rst documentation, once the overhaul [1] is usptream. Until then, details can be found in patch #2. VI. Future Work 1) Use memory groups for ppc64 dlpar 2) Being able to specify a portion of (early) kernel memory that will be excluded from the ratio. Like "128 MiB globally/per node" are excluded. This might be helpful when starting VMs with extremely small memory footprint (e.g., 128 MiB) and hotplugging memory later -- not wanting the first hotplugged units getting onlined to ZONE_MOVABLE. One alternative would be a trigger to not consider ZONE_DMA memory in the ratio. We'll have to see if this is really rrequired. 3) Indicate to user space that MOVABLE might be a bad idea -- especially relevant when memory ballooning without support for balloon compaction is active. This patch (of 9): For implementing a new memory onlining policy, which determines when to online memory blocks to ZONE_MOVABLE semi-automatically, we need the number of present early (boot) pages -- present pages excluding hotplugged pages. Let's track these pages per zone. Pass a page instead of the zone to adjust_present_page_count(), similar as adjust_managed_page_count() and derive the zone from the page. It's worth noting that a memory block to be offlined/onlined is either completely "early" or "not early". add_memory() and friends can only add complete memory blocks and we only online/offline complete (individual) memory blocks. Link: https://lkml.kernel.org/r/20210806124715.17090-1-david@redhat.com Link: https://lkml.kernel.org/r/20210806124715.17090-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: Hui Zhu <teawater@gmail.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Len Brown <lenb@kernel.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/memory_hotplug: remove nid parameter from remove_memory() and friendsDavid Hildenbrand
There is only a single user remaining. We can simply lookup the nid only used for node offlining purposes when walking our memory blocks. We don't expect to remove multi-nid ranges; and if we'd ever do, we most probably don't care about removing multi-nid ranges that actually result in empty nodes. If ever required, we can detect the "multi-nid" scenario and simply try offlining all online nodes. Link: https://lkml.kernel.org/r/20210712124052.26491-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Len Brown <lenb@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jia He <justin.he@arm.com> Cc: Joe Perches <joe@perches.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta@ionos.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pierre Morel <pmorel@linux.ibm.com> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Rich Felker <dalias@libc.org> Cc: Sergei Trofimovich <slyfox@gentoo.org> Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/memory_hotplug: remove nid parameter from arch_remove_memory()David Hildenbrand
The parameter is unused, let's remove it. Link: https://lkml.kernel.org/r/20210712124052.26491-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: Heiko Carstens <hca@linux.ibm.com> [s390] Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Baoquan He <bhe@redhat.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Sergei Trofimovich <slyfox@gentoo.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com> Cc: Joe Perches <joe@perches.com> Cc: Pierre Morel <pmorel@linux.ibm.com> Cc: Jia He <justin.he@arm.com> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/memory_hotplug: use "unsigned long" for PFN in zone_for_pfn_range()David Hildenbrand
Patch series "mm/memory_hotplug: preparatory patches for new online policy and memory" These are all cleanups and one fix previously sent as part of [1]: [PATCH v1 00/12] mm/memory_hotplug: "auto-movable" online policy and memory groups. These patches make sense even without the other series, therefore I pulled them out to make the other series easier to digest. [1] https://lkml.kernel.org/r/20210607195430.48228-1-david@redhat.com This patch (of 4): Checkpatch complained on a follow-up patch that we are using "unsigned" here, which defaults to "unsigned int" and checkpatch is correct. As we will search for a fitting zone using the wrong pfn, we might end up onlining memory to one of the special kernel zones, such as ZONE_DMA, which can end badly as the onlined memory does not satisfy properties of these zones. Use "unsigned long" instead, just as we do in other places when handling PFNs. This can bite us once we have physical addresses in the range of multiple TB. Link: https://lkml.kernel.org/r/20210712124052.26491-2-david@redhat.com Fixes: e5e689302633 ("mm, memory_hotplug: display allowed zones in the preferred ordering") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Len Brown <lenb@kernel.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: virtualization@lists.linux-foundation.org Cc: Andy Lutomirski <luto@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Dave Jiang <dave.jiang@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jia He <justin.he@arm.com> Cc: Joe Perches <joe@perches.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pierre Morel <pmorel@linux.ibm.com> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Rich Felker <dalias@libc.org> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Sergei Trofimovich <slyfox@gentoo.org> Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm: memory_hotplug: cleanup after removal of pfn_valid_within()Mike Rapoport
When test_pages_in_a_zone() used pfn_valid_within() is has some logic surrounding pfn_valid_within() checks. Since pfn_valid_within() is gone, this logic can be removed. Link: https://lkml.kernel.org/r/20210713080035.7464-3-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONEMike Rapoport
Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE". After recent updates to freeing unused parts of the memory map, no architecture can have holes in the memory map within a pageblock. This makes pfn_valid_within() check and CONFIG_HOLES_IN_ZONE configuration option redundant. The first patch removes them both in a mechanical way and the second patch simplifies memory_hotplug::test_pages_in_a_zone() that had pfn_valid_within() surrounded by more logic than simple if. This patch (of 2): After recent changes in freeing of the unused parts of the memory map and rework of pfn_valid() in arm and arm64 there are no architectures that can have holes in the memory map within a pageblock and so nothing can enable CONFIG_HOLES_IN_ZONE which guards non trivial implementation of pfn_valid_within(). With that, pfn_valid_within() is always hardwired to 1 and can be completely removed. Remove calls to pfn_valid_within() and CONFIG_HOLES_IN_ZONE. Link: https://lkml.kernel.org/r/20210713080035.7464-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20210713080035.7464-2-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-07Revert "mm/gup: remove try_get_page(), call try_get_compound_head() directly"Linus Torvalds
This reverts commit 9857a17f206ff374aea78bccfb687f145368be2e. That commit was completely broken, and I should have caught on to it earlier. But happily, the kernel test robot noticed the breakage fairly quickly. The breakage is because "try_get_page()" is about avoiding the page reference count overflow case, but is otherwise the exact same as a plain "get_page()". In contrast, "try_get_compound_head()" is an entirely different beast, and uses __page_cache_add_speculative() because it's not just about the page reference count, but also about possibly racing with the underlying page going away. So all the commentary about how "try_get_page() has fallen a little behind in terms of maintenance, try_get_compound_head() handles speculative page references more thoroughly" was just completely wrong: yes, try_get_compound_head() handles speculative page references, but the point is that try_get_page() does not, and must not. So there's no lack of maintainance - there are fundamentally different semantics. A speculative page reference would be entirely wrong in "get_page()", and it's entirely wrong in "try_get_page()". It's not about speculation, it's purely about "uhhuh, you can't get this page because you've tried to increment the reference count too much already". The reason the kernel test robot noticed this bug was that it hit the VM_BUG_ON() in __page_cache_add_speculative(), which is all about verifying that the context of any speculative page access is correct. But since that isn't what try_get_page() is all about, the VM_BUG_ON() tests things that are not correct to test for try_get_page(). Reported-by: kernel test robot <oliver.sang@intel.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-04Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linuxLinus Torvalds
Pull MAP_DENYWRITE removal from David Hildenbrand: "Remove all in-tree usage of MAP_DENYWRITE from the kernel and remove VM_DENYWRITE. There are some (minor) user-visible changes: - We no longer deny write access to shared libaries loaded via legacy uselib(); this behavior matches modern user space e.g. dlopen(). - We no longer deny write access to the elf interpreter after exec completed, treating it just like shared libraries (which it often is). - We always deny write access to the file linked via /proc/pid/exe: sys_prctl(PR_SET_MM_MAP/EXE_FILE) will fail if write access to the file cannot be denied, and write access to the file will remain denied until the link is effectivel gone (exec, termination, sys_prctl(PR_SET_MM_MAP/EXE_FILE)) -- just as if exec'ing the file. Cross-compiled for a bunch of architectures (alpha, microblaze, i386, s390x, ...) and verified via ltp that especially the relevant tests (i.e., creat07 and execve04) continue working as expected" * tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux: fs: update documentation of get_write_access() and friends mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff() mm: remove VM_DENYWRITE binfmt: remove in-tree usage of MAP_DENYWRITE kernel/fork: always deny write access to current MM exe_file kernel/fork: factor out replacing the current MM exe_file binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib()
2021-09-04mm, slub: convert kmem_cpu_slab protection to local_lockVlastimil Babka
Embed local_lock into struct kmem_cpu_slab and use the irq-safe versions of local_lock instead of plain local_irq_save/restore. On !PREEMPT_RT that's equivalent, with better lockdep visibility. On PREEMPT_RT that means better preemption. However, the cost on PREEMPT_RT is the loss of lockless fast paths which only work with cpu freelist. Those are designed to detect and recover from being preempted by other conflicting operations (both fast or slow path), but the slow path operations assume they cannot be preempted by a fast path operation, which is guaranteed naturally with disabled irqs. With local locks on PREEMPT_RT, the fast paths now also need to take the local lock to avoid races. In the allocation fastpath slab_alloc_node() we can just defer to the slowpath __slab_alloc() which also works with cpu freelist, but under the local lock. In the free fastpath do_slab_free() we have to add a new local lock protected version of freeing to the cpu freelist, as the existing slowpath only works with the page freelist. Also update the comment about locking scheme in SLUB to reflect changes done by this series. [ Mike Galbraith <efault@gmx.de>: use local_lock() without irq in PREEMPT_RT scope; debugging of RT crashes resulting in put_cpu_partial() locking changes ] Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: use migrate_disable() on PREEMPT_RTVlastimil Babka
We currently use preempt_disable() (directly or via get_cpu_ptr()) to stabilize the pointer to kmem_cache_cpu. On PREEMPT_RT this would be incompatible with the list_lock spinlock. We can use migrate_disable() instead, but that increases overhead on !PREEMPT_RT as it's an unconditional function call. In order to get the best available mechanism on both PREEMPT_RT and !PREEMPT_RT, introduce private slub_get_cpu_ptr() and slub_put_cpu_ptr() wrappers and use them. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchgVlastimil Babka
Jann Horn reported [1] the following theoretically possible race: task A: put_cpu_partial() calls preempt_disable() task A: oldpage = this_cpu_read(s->cpu_slab->partial) interrupt: kfree() reaches unfreeze_partials() and discards the page task B (on another CPU): reallocates page as page cache task A: reads page->pages and page->pobjects, which are actually halves of the pointer page->lru.prev task B (on another CPU): frees page interrupt: allocates page as SLUB page and places it on the percpu partial list task A: this_cpu_cmpxchg() succeeds which would cause page->pages and page->pobjects to end up containing halves of pointers that would then influence when put_cpu_partial() happens and show up in root-only sysfs files. Maybe that's acceptable, I don't know. But there should probably at least be a comment for now to point out that we're reading union fields of a page that might be in a completely different state. Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() is only safe against s->cpu_slab->partial manipulation in ___slab_alloc() if the latter disables irqs, otherwise a __slab_free() in an irq handler could call put_cpu_partial() in the middle of ___slab_alloc() manipulating ->partial and corrupt it. This becomes an issue on RT after a local_lock is introduced in later patch. The fix means taking the local_lock also in put_cpu_partial() on RT. After debugging this issue, Mike Galbraith suggested [2] that to avoid different locking schemes on RT and !RT, we can just protect put_cpu_partial() with disabled irqs (to be converted to local_lock_irqsave() later) everywhere. This should be acceptable as it's not a fast path, and moving the actual partial unfreezing outside of the irq disabled section makes it short, and with the retry loop gone the code can be also simplified. In addition, the race reported by Jann should no longer be possible. [1] https://lore.kernel.org/lkml/CAG48ez1mvUuXwg0YPH5ANzhQLpbphqk-ZS+jbRz+H66fvm4FcA@mail.gmail.com/ [2] https://lore.kernel.org/linux-rt-users/e3470ab357b48bccfbd1f5133b982178a7d2befb.camel@gmx.de/ Reported-by: Jann Horn <jannh@google.com> Suggested-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: make slab_lock() disable irqs with PREEMPT_RTVlastimil Babka
We need to disable irqs around slab_lock() (a bit spinlock) to make it irq-safe. Most calls to slab_lock() are nested under spin_lock_irqsave() which doesn't disable irqs on PREEMPT_RT, so add explicit disabling with PREEMPT_RT. The exception is cmpxchg_double_slab() which already disables irqs, so use a __slab_[un]lock() variant without irq disable there. slab_[un]lock() thus needs a flags pointer parameter, which is unused on !RT. free_debug_processing() now has two flags variables, which looks odd, but only one is actually used - the one used in spin_lock_irqsave() on !RT and the one used in slab_lock() on RT. As a result, __cmpxchg_double_slab() and cmpxchg_double_slab() become effectively identical on RT, as both will disable irqs, which is necessary on RT as most callers of this function also rely on irqsaving lock operations. Thus, assert that irqs are already disabled in __cmpxchg_double_slab() only on !RT and also change the VM_BUG_ON assertion to the more standard lockdep_assert one. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm: slub: make object_map_lock a raw_spinlock_tSebastian Andrzej Siewior
The variable object_map is protected by object_map_lock. The lock is always acquired in debug code and within already atomic context Make object_map_lock a raw_spinlock_t. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of ↵Sebastian Andrzej Siewior
IRQ context flush_all() flushes a specific SLAB cache on each CPU (where the cache is present). The deactivate_slab()/__free_slab() invocation happens within IPI handler and is problematic for PREEMPT_RT. The flush operation is not a frequent operation or a hot path. The per-CPU flush operation can be moved to within a workqueue. Because a workqueue handler, unlike IPI handler, does not disable irqs, flush_slab() now has to disable them for working with the kmem_cache_cpu fields. deactivate_slab() is safe to call with irqs enabled. [vbabka@suse.cz: adapt to new SLUB changes] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slab: split out the cpu offline variant of flush_slab()Vlastimil Babka
flush_slab() is called either as part IPI handler on given live cpu, or as a cleanup on behalf of another cpu that went offline. The first case needs to protect updating the kmem_cache_cpu fields with disabled irqs. Currently the whole call happens with irqs disabled by the IPI handler, but the following patch will change from IPI to workqueue, and flush_slab() will have to disable irqs (to be replaced with a local lock later) in the critical part. To prepare for this change, replace the call to flush_slab() for the dead cpu handling with an opencoded variant that will not disable irqs nor take a local lock. Suggested-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: don't disable irqs in slub_cpu_dead()Vlastimil Babka
slub_cpu_dead() cleans up for an offlined cpu from another cpu and calls only functions that are now irq safe, so we don't need to disable irqs anymore. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: only disable irq with spin_lock in __unfreeze_partials()Vlastimil Babka
__unfreeze_partials() no longer needs to have irqs disabled, except for making the spin_lock operations irq-safe, so convert the spin_locks operations and remove the separate irq handling. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: separate detaching of partial list in unfreeze_partials() from ↵Vlastimil Babka
unfreezing Unfreezing partial list can be split to two phases - detaching the list from struct kmem_cache_cpu, and processing the list. The whole operation does not need to be protected by disabled irqs. Restructure the code to separate the detaching (with disabled irqs) and unfreezing (with irq disabling to be reduced in the next patch). Also, unfreeze_partials() can be called from another cpu on behalf of a cpu that is being offlined, where disabling irqs on the local cpu has no sense, so restructure the code as follows: - __unfreeze_partials() is the bulk of unfreeze_partials() that processes the detached percpu partial list - unfreeze_partials() detaches list from current cpu with irqs disabled and calls __unfreeze_partials() - unfreeze_partials_cpu() is to be called for the offlined cpu so it needs no irq disabling, and is called from __flush_cpu_slab() - flush_cpu_slab() is for the local cpu thus it needs to call unfreeze_partials(). So it can't simply call __flush_cpu_slab(smp_processor_id()) anymore and we have to open-code the proper calls. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: detach whole partial list at once in unfreeze_partials()Vlastimil Babka
Instead of iterating through the live percpu partial list, detach it from the kmem_cache_cpu at once. This is simpler and will allow further optimization. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: discard slabs in unfreeze_partials() without irqs disabledVlastimil Babka
No need for disabled irqs when discarding slabs, so restore them before discarding. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: move irq control into unfreeze_partials()Vlastimil Babka
unfreeze_partials() can be optimized so that it doesn't need irqs disabled for the whole time. As the first step, move irq control into the function and remove it from the put_cpu_partial() caller. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: call deactivate_slab() without disabling irqsVlastimil Babka
The function is now safe to be called with irqs enabled, so move the calls outside of irq disabled sections. When called from ___slab_alloc() -> flush_slab() we have irqs disabled, so to reenable them before deactivate_slab() we need to open-code flush_slab() in ___slab_alloc() and reenable irqs after modifying the kmem_cache_cpu fields. But that means a IRQ handler meanwhile might have assigned a new page to kmem_cache_cpu.page so we have to retry the whole check. The remaining callers of flush_slab() are the IPI handler which has disabled irqs anyway, and slub_cpu_dead() which will be dealt with in the following patch. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>