summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-11-11mm: remove unnecessary page_table_lock on stack expansionLorenzo Stoakes
Ever since commit 8d7071af8907 ("mm: always expand the stack with the mmap write lock held") we have been expanding the stack with the mmap write lock held. This is true in all code paths: get_arg_page() -> expand_downwards() setup_arg_pages() -> expand_stack_locked() -> expand_downwards() / expand_upwards() lock_mm_and_find_vma() -> expand_stack_locked() -> expand_downwards() / expand_upwards() create_elf_tables() -> find_extend_vma_locked() -> expand_stack_locked() expand_stack() -> vma_expand_down() -> expand_downwards() expand_stack() -> vma_expand_up() -> expand_upwards() Each of which acquire the mmap write lock before doing so. Despite this, we maintain code that acquires a page table lock in the expand_upwards() and expand_downwards() code, stating that we hold a shared mmap lock and thus this is necessary. It is not, we do not have to worry about concurrent VMA expansions so we can simply drop this, and update comments accordingly. We do not even need be concerned with racing page faults, as vma_start_write() is invoked in both cases. Link: https://lkml.kernel.org/r/20241101184627.131391-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Jann Horn <jannh@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm: huge_memory: use strscpy() instead of strcpy()Maíra Canal
Replace strcpy() with strscpy() in mm/huge_memory.c strcpy() has been deprecated because it is generally unsafe, so help to eliminate it from the kernel source. Link: https://github.com/KSPP/linux/issues/88 Link: https://lkml.kernel.org/r/20241101165719.1074234-7-mcanal@igalia.com Signed-off-by: Maíra Canal <mcanal@igalia.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm: shmem: override mTHP shmem default with a kernel parameterMaíra Canal
Add the ``thp_shmem=`` kernel command line to allow specifying the default policy of each supported shmem hugepage size. The kernel parameter accepts the following format: thp_shmem=<size>[KMG],<size>[KMG]:<policy>;<size>[KMG]-<size>[KMG]:<policy> For example, thp_shmem=16K-64K:always;128K,512K:inherit;256K:advise;1M-2M:never;4M-8M:within_size Some GPUs may benefit from using huge pages. Since DRM GEM uses shmem to allocate anonymous pageable memory, it's essential to control the huge page allocation policy for the internal shmem mount. This control can be achieved through the ``transparent_hugepage_shmem=`` parameter. Beyond just setting the allocation policy, it's crucial to have granular control over the size of huge pages that can be allocated. The GPU may support only specific huge page sizes, and allocating pages larger/smaller than those sizes would be ineffective. Link: https://lkml.kernel.org/r/20241101165719.1074234-6-mcanal@igalia.com Signed-off-by: Maíra Canal <mcanal@igalia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm: move ``get_order_from_str()`` to internal.hMaíra Canal
In order to implement a kernel parameter similar to ``thp_anon=`` for shmem, we'll need the function ``get_order_from_str()``. Instead of duplicating the function, move the function to a shared header, in which both mm/shmem.c and mm/huge_memory.c will be able to use it. Link: https://lkml.kernel.org/r/20241101165719.1074234-5-mcanal@igalia.com Signed-off-by: Maíra Canal <mcanal@igalia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm: shmem: control THP support through the kernel command lineMaíra Canal
Patch series "mm: add more kernel parameters to control mTHP", v5. This series introduces four patches related to the kernel parameters controlling mTHP and a fifth patch replacing `strcpy()` for `strscpy()` in the file `mm/huge_memory.c`. The first patch is a straightforward documentation update, correcting the format of the kernel parameter ``thp_anon=``. The second, third, and fourth patches focus on controlling THP support for shmem via the kernel command line. The second patch introduces a parameter to control the global default huge page allocation policy for the internal shmem mount. The third patch moves a piece of code to a shared header to ease the implementation of the fourth patch. Finally, the fourth patch implements a parameter similar to ``thp_anon=``, but for shmem. The goal of these changes is to simplify the configuration of systems that rely on mTHP support for shmem. For instance, a platform with a GPU that benefits from huge pages may want to enable huge pages for shmem. Having these kernel parameters streamlines the configuration process and ensures consistency across setups. This patch (of 4): Add a new kernel command line to control the hugepage allocation policy for the internal shmem mount, ``transparent_hugepage_shmem``. The parameter is similar to ``transparent_hugepage`` and has the following format: transparent_hugepage_shmem=<policy> where ``<policy>`` is one of the seven valid policies available for shmem. Configuring the default huge page allocation policy for the internal shmem mount can be beneficial for DRM GPU drivers. Just as CPU architectures, GPUs can also take advantage of huge pages, but this is possible only if DRM GEM objects are backed by huge pages. Since GEM uses shmem to allocate anonymous pageable memory, having control over the default huge page allocation policy allows for the exploration of huge pages use on GPUs that rely on GEM objects backed by shmem. Link: https://lkml.kernel.org/r/20241101165719.1074234-2-mcanal@igalia.com Link: https://lkml.kernel.org/r/20241101165719.1074234-4-mcanal@igalia.com Signed-off-by: Maíra Canal <mcanal@igalia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: dri-devel@lists.freedesktop.org Cc: Hugh Dickins <hughd@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: kernel-dev@igalia.com Cc: Lance Yang <ioworker0@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11tools/mm: fix slabinfo crash when MAX_SLABS is exceededMarc Dionne
The number of slabs can easily exceed the hard coded MAX_SLABS in the slabinfo tool, causing it to overwrite memory and crash. Increase the value of MAX_SLABS, and check if that has been exceeded for each new slab, instead of at the end when it's already too late. Also move the check for MAX_ALIASES into the loop body. Link: https://lkml.kernel.org/r/20241031105534.565533-1-marc.c.dionne@gmail.com Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11maple_tree: add a test checking storing nullWei Yang
Add a test to assert that, when storing null to am empty tree or a single entry tree it will not result into: * a root node with range [0, ULONG_MAX] set to NULL * a root node with consecutive slot set to NULL [akpm@linux-foundation.org: work around build error (mas_root)] Link: https://lkml.kernel.org/r/20241031231627.14316-6-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11maple_tree: refine mas_store_root() on storing NULLWei Yang
Currently, when storing NULL on mas_store_root(), the behavior could be improved. Storing NULLs over the entire tree may result in a node being used to store a single range. Further stores of NULL may cause the node and tree to be corrupt and cause incorrect behaviour. Fixing the store to the root null fixes the issue by ensuring that a range of 0 - ULONG_MAX results in an empty tree. Users of the tree may experience incorrect values returned if the tree was expanded to store values, then overwritten by all NULLS, then continued to store NULLs over the empty area. For example possible cases are: * store NULL at any range result a new node * store NULL at range [m, n] where m > 0 to a single entry tree result a new node with range [m, n] set to NULL * store NULL at range [m, n] where m > 0 to an empty tree result consecutive NULL slot * it allows for multiple NULL entries by expanding root to store NULLs to an empty tree This patch tries to improve in: * memory efficient by setting to empty tree instead of using a node * remove the possibility of consecutive NULL slot which will prohibit extended null in later operation Link: https://lkml.kernel.org/r/20241031231627.14316-5-richard.weiyang@gmail.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11maple_tree: not necessary to check index/last againWei Yang
Before calling mas_new_root(), the range has been checked. Link: https://lkml.kernel.org/r/20241031231627.14316-4-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11maple_tree: the return value of mas_root_expand() is not usedWei Yang
No user of the return value now, just remove it. Link: https://lkml.kernel.org/r/20241031231627.14316-3-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11maple_tree: print empty for an empty tree on mt_dump()Wei Yang
Patch series "refine storing null", v5. When overwriting the whole range with NULL, current behavior is not correct. An empty tree is represented by having the tree point to NULL directly. An empty tree indicates the entire range (0-ULONG_MAX) is NULL. A store operation into an existing node that causes 0 - ULONG_MAX to be equal to NULL may not be restored to an empty state - a node is used to store the single range instead. This is wasteful and different from the initial setup of the tree. Once the tree is using a single node to store 0 - ULONG_MAX, problems may arise when storing more values into a tree with the unexpected state of 0 - ULONG being a single range in a node. User visible issues may mean a corrupt tree and incorrect storage of information within the tree. This would be limited to users who create and then empty a tree by overwriting all values, then try to store more NULLs into the empty tree. I cannot come up with an example of any user doing this (users usually destroy the tree and generally don't keep trying to store NULLs over NULLs), but patch 4/5 "maple_tree: refine mas_store_root() on storing NULL" should be backported just in case. This patch (of 5): Currently for an empty tree, it would print: maple_tree(0x7ffcd02c6ee0) flags 1, height 0 root (nil) 0: (nil) This is a little misleading. Let's print (empty) for an empty tree. Link: https://lkml.kernel.org/r/20241031231627.14316-1-richard.weiyang@gmail.com Link: https://lkml.kernel.org/r/20241031231627.14316-2-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11vma: detect infinite loop in vma treeLiam R. Howlett
There have been no reported infinite loops in the tree, but checking the detection of an infinite loop during validation is simple enough. Add the detection to the validate_mm() function so that error reports are clear and don't just report stalls. This does not protect against internal maple tree issues, but it does detect too many vmas being returned from the tree. The variance of +10 is to allow for the debugging output to be more useful for nearly correct counts. In the event of more than 10 over the map_count, the count will be set to -1 for easier identification of a potential infinite loop. Note that the mmap lock is held to ensure a consistent tree state during the validation process. [akpm@linux-foundation.org: add comment] Link: https://lkml.kernel.org/r/20241031193608.1965366-1-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11selftests/mm: skip virtual_address_range tests on riscvChunyan Zhang
RISC-V doesn't currently have the behavior of restricting the virtual address space which virtual_address_range tests check, this will cause the tests fail. So lets disable the whole test suite for riscv64 for now, not build it and run_vmtests.sh will skip it if it is not present. Link: https://lkml.kernel.org/r/20241008094141.549248-5-zhangchunyan@iscas.ac.cn Signed-off-by: Chunyan Zhang <zhangchunyan@iscas.ac.cn> Reviewed-by: Charlie Jenkins <charlie@rivosinc.com> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11selftest/mm: fix typo in virtual_address_rangeChunyan Zhang
The function name should be *hint* address, so correct it. Link: https://lkml.kernel.org/r/20241008094141.549248-4-zhangchunyan@iscas.ac.cn Signed-off-by: Chunyan Zhang <zhangchunyan@iscas.ac.cn> Reviewed-by: Charlie Jenkins <charlie@rivosinc.com> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11zram: clear IDLE flag in mark_idle()Sergey Senozhatsky
If entry does not fulfill current mark_idle() parameters, e.g. cutoff time, then we should clear its ZRAM_IDLE from previous mark_idle() invocations. Consider the following case: - mark_idle() cutoff time 8h - mark_idle() cutoff time 4h - writeback() idle - will writeback entries with cutoff time 8h, while it should only pick entries with cutoff time 4h The bug was reported by Shin Kawamura. Link: https://lkml.kernel.org/r/20241028153629.1479791-3-senozhatsky@chromium.org Fixes: 755804d16965 ("zram: introduce an aged idle interface") Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reported-by: Shin Kawamura <kawasin@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: <stable@vger.kernel.org>
2024-11-11zram: clear IDLE flag after recompressionSergey Senozhatsky
Patch series "zram: IDLE flag handling fixes", v2. zram can wrongly preserve ZRAM_IDLE flag on its entries which can result in premature post-processing (writeback and recompression) of such entries. This patch (of 2) Recompression should clear ZRAM_IDLE flag on the entries it has accessed, because otherwise some entries, specifically those for which recompression has failed, become immediate candidate entries for another post-processing (e.g. writeback). Consider the following case: - recompression marks entries IDLE every 4 hours and attempts to recompress them - some entries are incompressible, so we keep them intact and hence preserve IDLE flag - writeback marks entries IDLE every 8 hours and writebacks IDLE entries, however we have IDLE entries left from recompression, so writeback prematurely writebacks those entries. The bug was reported by Shin Kawamura. Link: https://lkml.kernel.org/r/20241028153629.1479791-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20241028153629.1479791-2-senozhatsky@chromium.org Fixes: 84b33bf78889 ("zram: introduce recompress sysfs knob") Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reported-by: Shin Kawamura <kawasin@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: <stable@vger.kernel.org>
2024-11-11mm/memory-failure: replace sprintf() with sysfs_emit()zhangguopeng
As Documentation/filesystems/sysfs.rst suggested, show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. Link: https://lkml.kernel.org/r/20241029101853.37890-1-zhangguopeng@kylinos.cn Signed-off-by: zhangguopeng <zhangguopeng@kylinos.cn> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11memcg: add flush tracepointJP Kobryn
This tracepoint gives visibility on how often the flushing of memcg stats occurs and contains info on whether it was forced, skipped, and the value of stats updated. It can help with understanding how readers are affected by having to perform the flush, and the effectiveness of the flush by inspecting the number of stats updated. Paired with the recently added tracepoints for tracing rstat updates, it can also help show correlation where stats exceed thresholds frequently. Link: https://lkml.kernel.org/r/20241029021106.25587-3-inwardvessel@gmail.com Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11memcg: rename do_flush_stats and add force flagJP Kobryn
Patch series "memcg: tracepoint for flushing stats", v3. This series adds new capability for understanding frequency and circumstances behind flushing memcg stats. This patch (of 2): Change the name to something more consistent with others in the file and use double unders to signify it is associated with the mem_cgroup_flush_stats() API call. Additionally include a new flag that call sites use to indicate a forced flush; skipping checks and flushing unconditionally. There are no changes in functionality. Link: https://lkml.kernel.org/r/20241029021106.25587-1-inwardvessel@gmail.com Link: https://lkml.kernel.org/r/20241029021106.25587-2-inwardvessel@gmail.com Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm: delete the unused put_pages_list()Hugh Dickins
The last user of put_pages_list() converted away from it in 6.10 commit 06c375053cef ("iommu/vt-d: add wrapper functions for page allocations"): delete put_pages_list(). Link: https://lkml.kernel.org/r/d9985d6a-293e-176b-e63d-82fdfd28c139@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11selftests/mm: add self tests for guard page featureLorenzo Stoakes
Utilise the kselftest harmness to implement tests for the guard page implementation. We start by implement basic tests asserting that guard pages can be installed, removed and that touching guard pages result in SIGSEGV. We also assert that, in removing guard pages from a range, non-guard pages remain intact. We then examine different operations on regions containing guard markers behave to ensure correct behaviour: * Operations over multiple VMAs operate as expected. * Invoking MADV_GUARD_INSTALL / MADV_GUARD_REMOVE via process_madvise() in batches works correctly. * Ensuring that munmap() correctly tears down guard markers. * Using mprotect() to adjust protection bits does not in any way override or cause issues with guard markers. * Ensuring that splitting and merging VMAs around guard markers causes no issue - i.e. that a marker which 'belongs' to one VMA can function just as well 'belonging' to another. * Ensuring that madvise(..., MADV_DONTNEED) and madvise(..., MADV_FREE) do not remove guard markers. * Ensuring that mlock()'ing a range containing guard markers does not cause issues. * Ensuring that mremap() can move a guard range and retain guard markers. * Ensuring that mremap() can expand a guard range and retain guard markers (perhaps moving the range). * Ensuring that mremap() can shrink a guard range and retain guard markers. * Ensuring that forking a process correctly retains guard markers. * Ensuring that forking a VMA with VM_WIPEONFORK set behaves sanely. * Ensuring that lazyfree simply clears guard markers. * Ensuring that userfaultfd can co-exist with guard pages. * Ensuring that madvise(..., MADV_POPULATE_READ) and madvise(..., MADV_POPULATE_WRITE) error out when encountering guard markers. * Ensuring that madvise(..., MADV_COLD) and madvise(..., MADV_PAGEOUT) do not remove guard markers. If any test is unable to be run due to lack of permissions, that test is skipped. Link: https://lkml.kernel.org/r/c3dcca76b736bac0aeaf1dc085927536a253ac94.1730123433.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Suggested-by: Jann Horn <jannh@google.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chris Zankel <chris@zankel.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vlastimil Babka <vbabkba@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11tools: testing: update tools UAPI header for mman-common.hLorenzo Stoakes
Import the new MADV_GUARD_INSTALL/REMOVE madvise flags. Link: https://lkml.kernel.org/r/ada462fa73fa1defc114242e446ab625b8290b71.1730123433.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Suggested-by: Jann Horn <jannh@google.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chris Zankel <chris@zankel.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vlastimil Babka <vbabkba@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm: madvise: implement lightweight guard page mechanismLorenzo Stoakes
Implement a new lightweight guard page feature, that is regions of userland virtual memory that, when accessed, cause a fatal signal to arise. Currently users must establish PROT_NONE ranges to achieve this. However this is very costly memory-wise - we need a VMA for each and every one of these regions AND they become unmergeable with surrounding VMAs. In addition repeated mmap() calls require repeated kernel context switches and contention of the mmap lock to install these ranges, potentially also having to unmap memory if installed over existing ranges. The lightweight guard approach eliminates the VMA cost altogether - rather than establishing a PROT_NONE VMA, it operates at the level of page table entries - establishing PTE markers such that accesses to them cause a fault followed by a SIGSGEV signal being raised. This is achieved through the PTE marker mechanism, which we have already extended to provide PTE_MARKER_GUARD, which we installed via the generic page walking logic which we have extended for this purpose. These guard ranges are established with MADV_GUARD_INSTALL. If the range in which they are installed contain any existing mappings, they will be zapped, i.e. free the range and unmap memory (thus mimicking the behaviour of MADV_DONTNEED in this respect). Any existing guard entries will be left untouched. There is therefore no nesting of guarded pages. Guarded ranges are NOT cleared by MADV_DONTNEED nor MADV_FREE (in both instances the memory range may be reused at which point a user would expect guards to still be in place), but they are cleared via MADV_GUARD_REMOVE, process teardown or unmapping of memory ranges. The guard property can be removed from ranges via MADV_GUARD_REMOVE. The ranges over which this is applied, should they contain non-guard entries, will be untouched, with only guard entries being cleared. We permit this operation on anonymous memory only, and only VMAs which are non-special, non-huge and not mlock()'d (if we permitted this we'd have to drop locked pages which would be rather counterintuitive). Racing page faults can cause repeated attempts to install guard pages that are interrupted, result in a zap, and this process can end up being repeated. If this happens more than would be expected in normal operation, we rescind locks and retry the whole thing, which avoids lock contention in this scenario. Link: https://lkml.kernel.org/r/6aafb5821bf209f277dfae0787abb2ef87a37542.1730123433.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Suggested-by: Jann Horn <jannh@google.com> Suggested-by: David Hildenbrand <david@redhat.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Suggested-by: Jann Horn <jannh@google.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Arnd Bergmann <arnd@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chris Zankel <chris@zankel.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm: add PTE_MARKER_GUARD PTE markerLorenzo Stoakes
Add a new PTE marker that results in any access causing the accessing process to segfault. This is preferable to PTE_MARKER_POISONED, which results in the same handling as hardware poisoned memory, and is thus undesirable for cases where we simply wish to 'soft' poison a range. This is in preparation for implementing the ability to specify guard pages at the page table level, i.e. ranges that, when accessed, should cause process termination. Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single purpose was simply incorrect. We then reuse the same logic to determine whether a zap should clear a guard entry - this should only be performed on teardown and never on MADV_DONTNEED or MADV_FREE. We additionally add a WARN_ON_ONCE() in hugetlb logic should a guard marker be encountered there, as we explicitly do not support this operation and this should not occur. Link: https://lkml.kernel.org/r/f47f3d5acca2dcf9bbf655b6d33f3dc713e4a4a0.1730123433.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Vlastimil Babka <vbabkba@suse.cz> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Suggested-by: Jann Horn <jannh@google.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chris Zankel <chris@zankel.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm: pagewalk: add the ability to install PTEsLorenzo Stoakes
Patch series "implement lightweight guard pages", v4. Userland library functions such as allocators and threading implementations often require regions of memory to act as 'guard pages' - mappings which, when accessed, result in a fatal signal being sent to the accessing process. The current means by which these are implemented is via a PROT_NONE mmap() mapping, which provides the required semantics however incur an overhead of a VMA for each such region. With a great many processes and threads, this can rapidly add up and incur a significant memory penalty. It also has the added problem of preventing merges that might otherwise be permitted. This series takes a different approach - an idea suggested by Vlastimil Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the provenance becomes a little tricky to ascertain after this - please forgive any omissions!) - rather than locating the guard pages at the VMA layer, instead placing them in page tables mapping the required ranges. Early testing of the prototype version of this code suggests a 5 times speed up in memory mapping invocations (in conjunction with use of process_madvise()) and a 13% reduction in VMAs on an entirely idle android system and unoptimised code. We expect with optimisation and a loaded system with a larger number of guard pages this could significantly increase, but in any case these numbers are encouraging. This way, rather than having separate VMAs specifying which parts of a range are guard pages, instead we have a VMA spanning the entire range of memory a user is permitted to access and including ranges which are to be 'guarded'. After mapping this, a user can specify which parts of the range should result in a fatal signal when accessed. By restricting the ability to specify guard pages to memory mapped by existing VMAs, we can rely on the mappings being torn down when the mappings are ultimately unmapped and everything works simply as if the memory were not faulted in, from the point of view of the containing VMAs. This mechanism in effect poisons memory ranges similar to hardware memory poisoning, only it is an entirely software-controlled form of poisoning. The mechanism is implemented via madvise() behaviour - MADV_GUARD_INSTALL which installs page table-level guard page markers - and MADV_GUARD_REMOVE - which clears them. Guard markers can be installed across multiple VMAs and any existing mappings will be cleared, that is zapped, before installing the guard page markers in the page tables. There is no concept of 'nested' guard markers, multiple attempts to install guard markers in a range will, after the first attempt, have no effect. Importantly, removing guard markers over a range that contains both guard markers and ordinary backed memory has no effect on anything but the guard markers (including leaving huge pages un-split), so a user can safely remove guard markers over a range of memory leaving the rest intact. The actual mechanism by which the page table entries are specified makes use of existing logic - PTE markers, which are used for the userfaultfd UFFDIO_POISON mechanism. Unfortunately PTE_MARKER_POISONED is not suited for the guard page mechanism as it results in VM_FAULT_HWPOISON semantics in the fault handler, so we add our own specific PTE_MARKER_GUARD and adapt existing logic to handle it. We also extend the generic page walk mechanism to allow for installation of PTEs (carefully restricted to memory management logic only to prevent unwanted abuse). We ensure that zapping performed by MADV_DONTNEED and MADV_FREE do not remove guard markers, nor does forking (except when VM_WIPEONFORK is specified for a VMA which implies a total removal of memory characteristics). It's important to note that the guard page implementation is emphatically NOT a security feature, so a user can remove the markers if they wish. We simply implement it in such a way as to provide the least surprising behaviour. An extensive set of self-tests are provided which ensure behaviour is as expected and additionally self-documents expected behaviour of guard ranges. This patch (of 5): The existing generic pagewalk logic permits the walking of page tables, invoking callbacks at individual page table levels via user-provided mm_walk_ops callbacks. This is useful for traversing existing page table entries, but precludes the ability to establish new ones. Existing mechanism for performing a walk which also installs page table entries if necessary are heavily duplicated throughout the kernel, each with semantic differences from one another and largely unavailable for use elsewhere. Rather than add yet another implementation, we extend the generic pagewalk logic to enable the installation of page table entries by adding a new install_pte() callback in mm_walk_ops. If this is specified, then upon encountering a missing page table entry, we allocate and install a new one and continue the traversal. If a THP huge page is encountered at either the PMD or PUD level we split it only if there are ops->pte_entry() (or ops->pmd_entry at PUD level), otherwise if there is only an ops->install_pte(), we avoid the unnecessary split. We do not support hugetlb at this stage. If this function returns an error, or an allocation fails during the operation, we abort the operation altogether. It is up to the caller to deal appropriately with partially populated page table ranges. If install_pte() is defined, the semantics of pte_entry() change - this callback is then only invoked if the entry already exists. This is a useful property, as it allows a caller to handle existing PTEs while installing new ones where necessary in the specified range. If install_pte() is not defined, then there is no functional difference to this patch, so all existing logic will work precisely as it did before. As we only permit the installation of PTEs where a mapping does not already exist there is no need for TLB management, however we do invoke update_mmu_cache() for architectures which require manual maintenance of mappings for other CPUs. We explicitly do not allow the existing page walk API to expose this feature as it is dangerous and intended for internal mm use only. Therefore we provide a new walk_page_range_mm() function exposed only to mm/internal.h. We take the opportunity to additionally clean up the page walker logic to be a little easier to follow. Link: https://lkml.kernel.org/r/cover.1730123433.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/51b432ebef013e3fdf9f92101533435de1bffadf.1730123433.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Suggested-by: Jann Horn <jannh@google.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chris Zankel <chris@zankel.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Vlastimil Babka <vbabkba@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11kasan: delete CONFIG_KASAN_MODULE_TESTSabyrzhan Tasbolatov
Since we've migrated all tests to the KUnit framework, we can delete CONFIG_KASAN_MODULE_TEST and mentioning of it in the documentation as well. I've used the online translator to modify the non-English documentation. [snovitoll@gmail.com: fix indentation in translation] Link: https://lkml.kernel.org/r/20241020042813.3223449-1-snovitoll@gmail.com Link: https://lkml.kernel.org/r/20241016131802.3115788-4-snovitoll@gmail.com Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alex Shi <alexs@kernel.org> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marco Elver <elver@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11kasan: migrate copy_user_test to kunitSabyrzhan Tasbolatov
Migrate the copy_user_test to the KUnit framework to verify out-of-bound detection via KASAN reports in copy_from_user(), copy_to_user() and their static functions. This is the last migrated test in kasan_test_module.c, therefore delete the file. [arnd@arndb.de: export copy_to_kernel_nofault] Link: https://lkml.kernel.org/r/20241018151112.3533820-1-arnd@kernel.org Link: https://lkml.kernel.org/r/20241016131802.3115788-3-snovitoll@gmail.com Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alex Shi <alexs@kernel.org> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marco Elver <elver@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11kasan: move checks to do_strncpy_from_userSabyrzhan Tasbolatov
Patch series "kasan: migrate the last module test to kunit", v4. copy_user_test() is the last KUnit-incompatible test with CONFIG_KASAN_MODULE_TEST requirement, which we are going to migrate to KUnit framework and delete the former test and Kconfig as well. In this patch series: - [1/3] move kasan_check_write() and check_object_size() to do_strncpy_from_user() to cover with KASAN checks with multiple conditions in strncpy_from_user(). - [2/3] migrated copy_user_test() to KUnit, where we can also test strncpy_from_user() due to [1/4]. KUnits have been tested on: - x86_64 with CONFIG_KASAN_GENERIC. Passed - arm64 with CONFIG_KASAN_SW_TAGS. 1 fail. See [1] - arm64 with CONFIG_KASAN_HW_TAGS. 1 fail. See [1] [1] https://lore.kernel.org/linux-mm/CACzwLxj21h7nCcS2-KA_q7ybe+5pxH0uCDwu64q_9pPsydneWQ@mail.gmail.com/ - [3/3] delete CONFIG_KASAN_MODULE_TEST and documentation occurrences. This patch (of 3): Since in the commit 2865baf54077("x86: support user address masking instead of non-speculative conditional") do_strncpy_from_user() is called from multiple places, we should sanitize the kernel *dst memory and size which were done in strncpy_from_user() previously. Link: https://lkml.kernel.org/r/20241016131802.3115788-1-snovitoll@gmail.com Link: https://lkml.kernel.org/r/20241016131802.3115788-2-snovitoll@gmail.com Fixes: 2865baf54077 ("x86: support user address masking instead of non-speculative conditional") Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alex Shi <alexs@kernel.org> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marco Elver <elver@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm: add per-order mTHP swpin countersBarry Song
This helps profile the sizes of folios being swapped in. Currently, only mTHP swap-out is being counted. The new interface can be found at: /sys/kernel/mm/transparent_hugepage/hugepages-<size>/stats swpin For example, cat /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpin 12809 cat /sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpin 4763 [v-songbaohua@oppo.com: add a blank line in doc] Link: https://lkml.kernel.org/r/20241030233423.80759-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20241026082423.26298-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kairui Song <kasong@tencent.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm: zswap: zswap_store_page() will initialize entry after adding to xarray.Kanchana P Sridhar
This incorporates Yosry's suggestions in [1] for further simplifying zswap_store_page(). If the page is successfully compressed and added to the xarray, we get the pool/objcg refs, and initialize all the entry's members. Only after this, we add it to the zswap LRU. In the time between the entry's addition to the xarray and it's member initialization, we are protected against concurrent stores/loads/swapoff through the folio lock, and are protected against writeback because the entry is not on the LRU yet. This way, we don't have to drop the pool/objcg refs, now that the entry initialization is centralized to the successful page store code path. zswap_compress() is modified to take a zswap_pool parameter in keeping with this simplification (as against obtaining this from entry->pool). [1]: https://lore.kernel.org/all/CAJD7tkZh6ufHQef5HjXf_F5b5LC1EATexgseD=4WvrO+a6Ni6w@mail.gmail.com/ Link: https://lkml.kernel.org/r/20241002173329.213722-1-kanchana.p.sridhar@intel.com Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Huang Ying <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Wajdi Feghali <wajdi.k.feghali@intel.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm: swap: count successful large folio zswap stores in hugepage zswpout statsKanchana P Sridhar
Added a new MTHP_STAT_ZSWPOUT entry to the sysfs transparent_hugepage stats so that successful large folio zswap stores can be accounted under the per-order sysfs "zswpout" stats: /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout Other non-zswap swap device swap-out events will be counted under the existing sysfs "swpout" stats: /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout Also, added documentation for the newly added sysfs per-order hugepage "zswpout" stats. The documentation clarifies that only non-zswap swapouts will be accounted in the existing "swpout" stats. Link: https://lkml.kernel.org/r/20241001053222.6944-8-kanchana.p.sridhar@intel.com Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Wajdi Feghali <wajdi.k.feghali@intel.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: "Zou, Nanhai" <nanhai.zou@intel.com> Cc: Barry Song <21cnbao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm: zswap: support large folios in zswap_store()Kanchana P Sridhar
This series enables zswap_store() to accept and store large folios. The most significant contribution in this series is from the earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been migrated to mm-unstable as of 9-30-2024 in patch 6 of this series, and adapted based on code review comments received for the current patch-series. [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u The first few patches do the prep work for supporting large folios in zswap_store. Patch 6 provides the main functionality to swap-out large folios in zswap. Patch 7 adds sysfs per-order hugepages "zswpout" counters that get incremented upon successful zswap_store of large folios, and also updates the documentation for this: /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout This series is a pre-requisite for zswap compress batching of large folio swap-out and decompress batching of swap-ins based on swapin_readahead(), using Intel IAA hardware acceleration, which we would like to submit in subsequent patch-series, with performance improvement data. Thanks to Ying Huang for pre-posting review feedback and suggestions! Thanks also to Nhat, Yosry, Johannes, Barry, Chengming, Usama, Ying and Matthew for their helpful feedback, code/data reviews and suggestions! I would like to thank Ryan Roberts for his original RFC [1]. System setup for testing: ========================= Testing of this series was done with mm-unstable as of 9-27-2024, commit de2fbaa6d9c3576ec7133ed02a370ec9376bf000 (without this patch-series) and mm-unstable 9-30-2024 commit c121617e3606be6575cdacfdb63cc8d67b46a568 (with this patch-series). Data was gathered on an Intel Sapphire Rapids server, dual-socket 56 cores per socket, 4 IAA devices per socket, 503 GiB RAM and 525G SSD disk partition swap. Core frequency was fixed at 2500MHz. The vm-scalability "usemem" test was run in a cgroup whose memory.high was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem processes were run, each allocating and writing 10G of memory, and sleeping for 10 sec before exiting: usemem --init-time -w -O -s 10 -n 30 10g Other kernel configuration parameters: zswap compressors : zstd, deflate-iaa zswap allocator : zsmalloc vm.page-cluster : 2 In the experiments where "deflate-iaa" is used as the zswap compressor, IAA "compression verification" is enabled by default (cat /sys/bus/dsa/drivers/crypto/verify_compress). Hence each IAA compression will be decompressed internally by the "iaa_crypto" driver, the crc-s returned by the hardware will be compared and errors reported in case of mismatches. Thus "deflate-iaa" helps ensure better data integrity as compared to the software compressors, and the experimental data listed below is with verify_compress set to "1". Metrics reporting methodology: ============================== Total and average throughput are derived from the individual 30 processes' throughputs reported by usemem. elapsed/sys times are measured with perf. All percentage changes are "new" vs. "old"; hence a positive value denotes an increase in the metric, whether it is throughput or latency, and a negative value denotes a reduction in the metric. Positive throughput change percentages and negative latency change percentages denote improvements. The vm stats and sysfs hugepages stats included with the performance data provide details on the swapout activity to zswap/swap device. Testing labels used in data summaries: ====================================== The data refers to these test configurations and the before/after comparisons that they do: before-case1: ------------- mm-unstable 9-27-2024, CONFIG_THP_SWAP=N (compares zswap 4K vs. zswap 64K) In this scenario, CONFIG_THP_SWAP=N results in 64K/2M folios to be split into 4K folios that get processed by zswap. before-case2: ------------- mm-unstable 9-27-2024, CONFIG_THP_SWAP=Y (compares SSD swap large folios vs. zswap large folios) In this scenario, CONFIG_THP_SWAP=Y results in zswap rejecting large folios, which will then be stored by the SSD swap device. after: ------ v10 of this patch-series, CONFIG_THP_SWAP=Y The "after" is CONFIG_THP_SWAP=Y and v10 of this patch-series, that results in 64K/2M folios to not be split, and to be processed by zswap_store. Regression Testing: =================== I ran vm-scalability usemem without large folios, i.e., only 4K folios with mm-unstable and this patch-series. The main goal was to make sure that there is no functional or performance regression wrt the earlier zswap behavior for 4K folios, now that 4K folios will be processed by the new zswap_store() code. The data indicates there is no significant regression. ------------------------------------------------------------------------------- 4K folios: ========== zswap compressor zstd zstd zstd zstd v10 before-case1 before-case2 after vs. vs. case1 case2 ------------------------------------------------------------------------------- Total throughput (KB/s) 4,793,363 4,880,978 4,853,074 1% -1% Average throughput (KB/s) 159,778 162,699 161,769 1% -1% elapsed time (sec) 130.14 123.17 126.29 -3% 3% sys time (sec) 3,135.53 2,985.64 3,083.18 -2% 3% memcg_high 446,826 444,626 452,930 memcg_swap_fail 0 0 0 zswpout 48,932,107 48,931,971 48,931,820 zswpin 383 386 397 pswpout 0 0 0 pswpin 0 0 0 thp_swpout 0 0 0 thp_swpout_fallback 0 0 0 64kB-mthp_swpout_fallback 0 0 0 pgmajfault 3,063 3,077 3,479 swap_ra 93 94 96 swap_ra_hit 47 47 50 ZSWPOUT-64kB n/a n/a 0 SWPOUT-64kB 0 0 0 ------------------------------------------------------------------------------- Performance Testing: ==================== We list the data for 64K folios with before/after data per-compressor, followed by the same for 2M pmd-mappable folios. ------------------------------------------------------------------------------- 64K folios: zstd: ================= zswap compressor zstd zstd zstd zstd v10 before-case1 before-case2 after vs. vs. case1 case2 ------------------------------------------------------------------------------- Total throughput (KB/s) 5,222,213 1,076,611 6,159,776 18% 472% Average throughput (KB/s) 174,073 35,887 205,325 18% 472% elapsed time (sec) 120.50 347.16 108.33 -10% -69% sys time (sec) 2,930.33 248.16 2,549.65 -13% 927% memcg_high 416,773 552,200 465,874 memcg_swap_fail 3,192,906 1,293 1,012 zswpout 48,931,583 20,903 48,931,218 zswpin 384 363 410 pswpout 0 40,778,448 0 pswpin 0 16 0 thp_swpout 0 0 0 thp_swpout_fallback 0 0 0 64kB-mthp_swpout_fallback 3,192,906 1,293 1,012 pgmajfault 3,452 3,072 3,061 swap_ra 90 87 107 swap_ra_hit 42 43 57 ZSWPOUT-64kB n/a n/a 3,057,173 SWPOUT-64kB 0 2,548,653 0 ------------------------------------------------------------------------------- ------------------------------------------------------------------------------- 64K folios: deflate-iaa: ======================== zswap compressor deflate-iaa deflate-iaa deflate-iaa deflate-iaa v10 before-case1 before-case2 after vs. vs. case1 case2 ------------------------------------------------------------------------------- Total throughput (KB/s) 5,652,608 1,089,180 7,189,778 27% 560% Average throughput (KB/s) 188,420 36,306 239,659 27% 560% elapsed time (sec) 102.90 343.35 87.05 -15% -75% sys time (sec) 2,246.86 213.53 1,864.16 -17% 773% memcg_high 576,104 502,907 642,083 memcg_swap_fail 4,016,117 1,407 1,478 zswpout 61,163,423 22,444 57,798,716 zswpin 401 368 454 pswpout 0 40,862,080 0 pswpin 0 20 0 thp_swpout 0 0 0 thp_swpout_fallback 0 0 0 64kB-mthp_swpout_fallback 4,016,117 1,407 1,478 pgmajfault 3,063 3,153 3,122 swap_ra 96 93 156 swap_ra_hit 46 45 83 ZSWPOUT-64kB n/a n/a 3,611,032 SWPOUT-64kB 0 2,553,880 0 ------------------------------------------------------------------------------- ------------------------------------------------------------------------------- 2M folios: zstd: ================ zswap compressor zstd zstd zstd zstd v10 before-case1 before-case2 after vs. vs. case1 case2 ------------------------------------------------------------------------------- Total throughput (KB/s) 5,895,500 1,109,694 6,484,224 10% 484% Average throughput (KB/s) 196,516 36,989 216,140 10% 484% elapsed time (sec) 108.77 334.28 106.33 -2% -68% sys time (sec) 2,657.14 94.88 2,376.13 -11% 2404% memcg_high 64,200 66,316 56,898 memcg_swap_fail 101,182 70 27 zswpout 48,931,499 36,507 48,890,640 zswpin 380 379 377 pswpout 0 40,166,400 0 pswpin 0 0 0 thp_swpout 0 78,450 0 thp_swpout_fallback 101,182 70 27 2MB-mthp_swpout_fallback 0 0 27 pgmajfault 3,067 3,417 3,311 swap_ra 91 90 854 swap_ra_hit 45 45 810 ZSWPOUT-2MB n/a n/a 95,459 SWPOUT-2MB 0 78,450 0 ------------------------------------------------------------------------------- ------------------------------------------------------------------------------- 2M folios: deflate-iaa: ======================= zswap compressor deflate-iaa deflate-iaa deflate-iaa deflate-iaa v10 before-case1 before-case2 after vs. vs. case1 case2 ------------------------------------------------------------------------------- Total throughput (KB/s) 6,286,587 1,126,785 7,073,464 13% 528% Average throughput (KB/s) 209,552 37,559 235,782 13% 528% elapsed time (sec) 96.19 333.03 85.79 -11% -74% sys time (sec) 2,141.44 99.96 1,826.67 -15% 1727% memcg_high 99,253 64,666 79,718 memcg_swap_fail 129,074 53 165 zswpout 61,312,794 28,321 56,045,120 zswpin 383 406 403 pswpout 0 40,048,128 0 pswpin 0 0 0 thp_swpout 0 78,219 0 thp_swpout_fallback 129,074 53 165 2MB-mthp_swpout_fallback 0 0 165 pgmajfault 3,430 3,077 31,468 swap_ra 91 103 84,373 swap_ra_hit 47 46 84,317 ZSWPOUT-2MB n/a n/a 109,229 SWPOUT-2MB 0 78,219 0 ------------------------------------------------------------------------------- And finally, this is a comparison of deflate-iaa vs. zstd with v10 of this patch-series: --------------------------------------------- zswap_store large folios v10 Impr w/ deflate-iaa vs. zstd 64K folios 2M folios --------------------------------------------- Throughput (KB/s) 17% 9% elapsed time (sec) -20% -19% sys time (sec) -27% -23% --------------------------------------------- Conclusions based on the performance results: ============================================= v10 wrt before-case1: --------------------- We see significant improvements in throughput, elapsed and sys time for zstd and deflate-iaa, when comparing before-case1 (THP_SWAP=N) vs. after (THP_SWAP=Y) with zswap_store large folios. v10 wrt before-case2: --------------------- We see even more significant improvements in throughput and elapsed time for zstd and deflate-iaa, when comparing before-case2 (large-folio-SSD) vs. after (large-folio-zswap). The sys time increases with large-folio-zswap as expected, due to the CPU compression time vs. asynchronous disk write times, as pointed out by Ying and Yosry. In before-case2, when zswap does not store large folios, only allocations and cgroup charging due to 4K folio zswap stores count towards the cgroup memory limit. However, in the after scenario, with the introduction of zswap_store() of large folios, there is an added component of the zswap compressed pool usage from large folio stores from potentially all 30 processes, that gets counted towards the memory limit. As a result, we see higher swapout activity in the "after" data. Summary: ======== The v10 data presented above shows that zswap_store of large folios demonstrates good throughput/performance improvements compared to conventional SSD swap of large folios with a sufficiently large 525G SSD swap device. Hence, it seems reasonable for zswap_store to support large folios, so that further performance improvements can be implemented. In the experimental setup used in this patchset, we have enabled IAA compress verification to ensure additional hardware data integrity CRC checks not currently done by the software compressors. We see good throughput/latency improvements with deflate-iaa vs. zstd with zswap_store of large folios. Some of the ideas for further reducing latency that have shown promise in our experiments, are: 1) IAA compress/decompress batching. 2) Distributing compress jobs across all IAA devices on the socket. The tests run for this patchset are using only 1 IAA device per core, that avails of 2 compress engines on the device. In our experiments with IAA batching, we distribute compress jobs from all cores to the 8 compress engines available per socket. We further compress the pages in each folio in parallel in the accelerator. As a result, we improve compress latency and reclaim throughput. In decompress batching, we use swapin_readahead to generate a prefetch batch of 4K folios that we decompress in parallel in IAA. ------------------------------------------------------------------------------ IAA compress/decompress batching Further improvements wrt v10 zswap_store Sequential subpage store using "deflate-iaa": "deflate-iaa" Batching "deflate-iaa-canned" [2] Batching Additional Impr Additional Impr 64K folios 2M folios 64K folios 2M folios ------------------------------------------------------------------------------ Throughput (KB/s) 19% 43% 26% 55% elapsed time (sec) -5% -14% -10% -21% sys time (sec) 4% -7% -4% -18% ------------------------------------------------------------------------------ With zswap IAA compress/decompress batching, we are able to demonstrate significant performance improvements and memory savings in server scalability experiments in highly contended system scenarios under significant memory pressure; as compared to software compressors. We hope to submit this work in subsequent patch series. The current patch-series is a prequisite for these future submissions. This patch (of 7): zswap_store() will store large folios by compressing them page by page. This patch provides a sequential implementation of storing a large folio in zswap_store() by iterating through each page in the folio to compress and store it in the zswap zpool. zswap_store() calls the newly added zswap_store_page() function for each page in the folio. zswap_store_page() handles compressing and storing each page. We check the global and per-cgroup limits once at the beginning of zswap_store(), and only check that the limit is not reached yet. This is racy and inaccurate, but it should be sufficient for now. We also obtain initial references to the relevant objcg and pool to guarantee that subsequent references can be acquired by zswap_store_page(). A new function zswap_pool_get() is added to facilitate this. If these one-time checks pass, we compress the pages of the folio, while maintaining a running count of compressed bytes for all the folio's pages. If all pages are successfully compressed and stored, we do the cgroup zswap charging with the total compressed bytes, and batch update the zswap_stored_pages atomic/zswpout event stats with folio_nr_pages() once, before returning from zswap_store(). If an error is encountered during the store of any page in the folio, all pages in that folio currently stored in zswap will be invalidated. Thus, a folio is either entirely stored in zswap, or entirely not stored in zswap. The most important value provided by this patch is it enables swapping out large folios to zswap without splitting them. Furthermore, it batches some operations while doing so (cgroup charging, stats updates). This patch also forms the basis for building compress batching of pages in a large folio in zswap_store() by compressing up to say, 8 pages of the folio in parallel in hardware using the Intel In-Memory Analytics Accelerator (Intel IAA). This change reuses and adapts the functionality in Ryan Roberts' RFC patch [1]: "[RFC,v1] mm: zswap: Store large folios without splitting" [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u Link: https://lkml.kernel.org/r/20241001053222.6944-1-kanchana.p.sridhar@intel.com Link: https://lkml.kernel.org/r/20241001053222.6944-7-kanchana.p.sridhar@intel.com Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Originally-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Wajdi Feghali <wajdi.k.feghali@intel.com> Cc: "Zou, Nanhai" <nanhai.zou@intel.com> Cc: Barry Song <21cnbao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm: zswap: modify zswap_stored_pages to be atomic_long_tKanchana P Sridhar
For zswap_store() to support large folios, we need to be able to do a batch update of zswap_stored_pages upon successful store of all pages in the folio. For this, we need to add folio_nr_pages(), which returns a long, to zswap_stored_pages. Link: https://lkml.kernel.org/r/20241001053222.6944-6-kanchana.p.sridhar@intel.com Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Wajdi Feghali <wajdi.k.feghali@intel.com> Cc: "Zou, Nanhai" <nanhai.zou@intel.com> Cc: Barry Song <21cnbao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm: zswap: rename zswap_pool_get() to zswap_pool_tryget()Kanchana P Sridhar
Modify the name of the existing zswap_pool_get() to zswap_pool_tryget() to be representative of the call it makes to percpu_ref_tryget(). A subsequent patch will introduce a new zswap_pool_get() that calls percpu_ref_get(). The intent behind this change is for higher level zswap API such as zswap_store() to call zswap_pool_tryget() to check upfront if the pool's refcount is "0" (which means it could be getting destroyed) and to handle this as an error condition. zswap_store() would proceed only if zswap_pool_tryget() returns success, and any additional pool refcounts that need to be obtained for compressing sub-pages in a large folio could simply call zswap_pool_get(). Link: https://lkml.kernel.org/r/20241001053222.6944-4-kanchana.p.sridhar@intel.com Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Wajdi Feghali <wajdi.k.feghali@intel.com> Cc: "Zou, Nanhai" <nanhai.zou@intel.com> Cc: Barry Song <21cnbao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm: zswap: modify zswap_compress() to accept a page instead of a folioKanchana P Sridhar
For zswap_store() to be able to store a large folio by compressing it one page at a time, zswap_compress() needs to accept a page as input. This will allow us to iterate through each page in the folio in zswap_store(), compress it and store it in the zpool. Link: https://lkml.kernel.org/r/20241001053222.6944-3-kanchana.p.sridhar@intel.com Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Yosry Ahmed <yosryahmed@google.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Wajdi Feghali <wajdi.k.feghali@intel.com> Cc: "Zou, Nanhai" <nanhai.zou@intel.com> Cc: Barry Song <21cnbao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm: define obj_cgroup_get() if CONFIG_MEMCG is not definedKanchana P Sridhar
Patch series "mm: zswap swap-out of large folios", v10. This patch series enables zswap_store() to accept and store large folios. The most significant contribution in this series is from the earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been migrated to mm-unstable as of 9-30-2024 in patch 6 of this series, and adapted based on code review comments received for the current patch-series. [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u The first few patches do the prep work for supporting large folios in zswap_store. Patch 6 provides the main functionality to swap-out large folios in zswap. Patch 7 adds sysfs per-order hugepages "zswpout" counters that get incremented upon successful zswap_store of large folios, and also updates the documentation for this: /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout This patch series is a prerequisite for zswap compress batching of large folio swap-out and decompress batching of swap-ins based on swapin_readahead(), using Intel IAA hardware acceleration, which we would like to submit in subsequent patch-series, with performance improvement data. Thanks to Ying Huang for pre-posting review feedback and suggestions! Thanks also to Nhat, Yosry, Johannes, Barry, Chengming, Usama, Ying and Matthew for their helpful feedback, code/data reviews and suggestions! Co-development signoff request: =============================== I would like to thank Ryan Roberts for his original RFC [1] and request his co-developer signoff on patch 6 in this series. Thanks Ryan! System setup for testing: ========================= Testing of this patch series was done with mm-unstable as of 9-27-2024, commit de2fbaa6d9c3576ec7133ed02a370ec9376bf000 (without this patch-series) and mm-unstable 9-30-2024 commit c121617e3606be6575cdacfdb63cc8d67b46a568 (with this patch-series). Data was gathered on an Intel Sapphire Rapids server, dual-socket 56 cores per socket, 4 IAA devices per socket, 503 GiB RAM and 525G SSD disk partition swap. Core frequency was fixed at 2500MHz. The vm-scalability "usemem" test was run in a cgroup whose memory.high was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem processes were run, each allocating and writing 10G of memory, and sleeping for 10 sec before exiting: usemem --init-time -w -O -s 10 -n 30 10g Other kernel configuration parameters: zswap compressors : zstd, deflate-iaa zswap allocator : zsmalloc vm.page-cluster : 2 In the experiments where "deflate-iaa" is used as the zswap compressor, IAA "compression verification" is enabled by default (cat /sys/bus/dsa/drivers/crypto/verify_compress). Hence each IAA compression will be decompressed internally by the "iaa_crypto" driver, the crc-s returned by the hardware will be compared and errors reported in case of mismatches. Thus "deflate-iaa" helps ensure better data integrity as compared to the software compressors, and the experimental data listed below is with verify_compress set to "1". Metrics reporting methodology: ============================== Total and average throughput are derived from the individual 30 processes' throughputs reported by usemem. elapsed/sys times are measured with perf. All percentage changes are "new" vs. "old"; hence a positive value denotes an increase in the metric, whether it is throughput or latency, and a negative value denotes a reduction in the metric. Positive throughput change percentages and negative latency change percentages denote improvements. The vm stats and sysfs hugepages stats included with the performance data provide details on the swapout activity to zswap/swap device. Testing labels used in data summaries: ====================================== The data refers to these test configurations and the before/after comparisons that they do: before-case1: ------------- mm-unstable 9-27-2024, CONFIG_THP_SWAP=N (compares zswap 4K vs. zswap 64K) In this scenario, CONFIG_THP_SWAP=N results in 64K/2M folios to be split into 4K folios that get processed by zswap. before-case2: ------------- mm-unstable 9-27-2024, CONFIG_THP_SWAP=Y (compares SSD swap large folios vs. zswap large folios) In this scenario, CONFIG_THP_SWAP=Y results in zswap rejecting large folios, which will then be stored by the SSD swap device. after: ------ v10 of this patch-series, CONFIG_THP_SWAP=Y The "after" is CONFIG_THP_SWAP=Y and v10 of this patch-series, that results in 64K/2M folios to not be split, and to be processed by zswap_store. Regression Testing: =================== I ran vm-scalability usemem without large folios, i.e., only 4K folios with mm-unstable and this patch-series. The main goal was to make sure that there is no functional or performance regression wrt the earlier zswap behavior for 4K folios, now that 4K folios will be processed by the new zswap_store() code. The data indicates there is no significant regression. ------------------------------------------------------------------------------- 4K folios: ========== zswap compressor zstd zstd zstd zstd v10 before-case1 before-case2 after vs. vs. case1 case2 ------------------------------------------------------------------------------- Total throughput (KB/s) 4,793,363 4,880,978 4,853,074 1% -1% Average throughput (KB/s) 159,778 162,699 161,769 1% -1% elapsed time (sec) 130.14 123.17 126.29 -3% 3% sys time (sec) 3,135.53 2,985.64 3,083.18 -2% 3% memcg_high 446,826 444,626 452,930 memcg_swap_fail 0 0 0 zswpout 48,932,107 48,931,971 48,931,820 zswpin 383 386 397 pswpout 0 0 0 pswpin 0 0 0 thp_swpout 0 0 0 thp_swpout_fallback 0 0 0 64kB-mthp_swpout_fallback 0 0 0 pgmajfault 3,063 3,077 3,479 swap_ra 93 94 96 swap_ra_hit 47 47 50 ZSWPOUT-64kB n/a n/a 0 SWPOUT-64kB 0 0 0 ------------------------------------------------------------------------------- Performance Testing: ==================== We list the data for 64K folios with before/after data per-compressor, followed by the same for 2M pmd-mappable folios. ------------------------------------------------------------------------------- 64K folios: zstd: ================= zswap compressor zstd zstd zstd zstd v10 before-case1 before-case2 after vs. vs. case1 case2 ------------------------------------------------------------------------------- Total throughput (KB/s) 5,222,213 1,076,611 6,159,776 18% 472% Average throughput (KB/s) 174,073 35,887 205,325 18% 472% elapsed time (sec) 120.50 347.16 108.33 -10% -69% sys time (sec) 2,930.33 248.16 2,549.65 -13% 927% memcg_high 416,773 552,200 465,874 memcg_swap_fail 3,192,906 1,293 1,012 zswpout 48,931,583 20,903 48,931,218 zswpin 384 363 410 pswpout 0 40,778,448 0 pswpin 0 16 0 thp_swpout 0 0 0 thp_swpout_fallback 0 0 0 64kB-mthp_swpout_fallback 3,192,906 1,293 1,012 pgmajfault 3,452 3,072 3,061 swap_ra 90 87 107 swap_ra_hit 42 43 57 ZSWPOUT-64kB n/a n/a 3,057,173 SWPOUT-64kB 0 2,548,653 0 ------------------------------------------------------------------------------- ------------------------------------------------------------------------------- 64K folios: deflate-iaa: ======================== zswap compressor deflate-iaa deflate-iaa deflate-iaa deflate-iaa v10 before-case1 before-case2 after vs. vs. case1 case2 ------------------------------------------------------------------------------- Total throughput (KB/s) 5,652,608 1,089,180 7,189,778 27% 560% Average throughput (KB/s) 188,420 36,306 239,659 27% 560% elapsed time (sec) 102.90 343.35 87.05 -15% -75% sys time (sec) 2,246.86 213.53 1,864.16 -17% 773% memcg_high 576,104 502,907 642,083 memcg_swap_fail 4,016,117 1,407 1,478 zswpout 61,163,423 22,444 57,798,716 zswpin 401 368 454 pswpout 0 40,862,080 0 pswpin 0 20 0 thp_swpout 0 0 0 thp_swpout_fallback 0 0 0 64kB-mthp_swpout_fallback 4,016,117 1,407 1,478 pgmajfault 3,063 3,153 3,122 swap_ra 96 93 156 swap_ra_hit 46 45 83 ZSWPOUT-64kB n/a n/a 3,611,032 SWPOUT-64kB 0 2,553,880 0 ------------------------------------------------------------------------------- ------------------------------------------------------------------------------- 2M folios: zstd: ================ zswap compressor zstd zstd zstd zstd v10 before-case1 before-case2 after vs. vs. case1 case2 ------------------------------------------------------------------------------- Total throughput (KB/s) 5,895,500 1,109,694 6,484,224 10% 484% Average throughput (KB/s) 196,516 36,989 216,140 10% 484% elapsed time (sec) 108.77 334.28 106.33 -2% -68% sys time (sec) 2,657.14 94.88 2,376.13 -11% 2404% memcg_high 64,200 66,316 56,898 memcg_swap_fail 101,182 70 27 zswpout 48,931,499 36,507 48,890,640 zswpin 380 379 377 pswpout 0 40,166,400 0 pswpin 0 0 0 thp_swpout 0 78,450 0 thp_swpout_fallback 101,182 70 27 2MB-mthp_swpout_fallback 0 0 27 pgmajfault 3,067 3,417 3,311 swap_ra 91 90 854 swap_ra_hit 45 45 810 ZSWPOUT-2MB n/a n/a 95,459 SWPOUT-2MB 0 78,450 0 ------------------------------------------------------------------------------- ------------------------------------------------------------------------------- 2M folios: deflate-iaa: ======================= zswap compressor deflate-iaa deflate-iaa deflate-iaa deflate-iaa v10 before-case1 before-case2 after vs. vs. case1 case2 ------------------------------------------------------------------------------- Total throughput (KB/s) 6,286,587 1,126,785 7,073,464 13% 528% Average throughput (KB/s) 209,552 37,559 235,782 13% 528% elapsed time (sec) 96.19 333.03 85.79 -11% -74% sys time (sec) 2,141.44 99.96 1,826.67 -15% 1727% memcg_high 99,253 64,666 79,718 memcg_swap_fail 129,074 53 165 zswpout 61,312,794 28,321 56,045,120 zswpin 383 406 403 pswpout 0 40,048,128 0 pswpin 0 0 0 thp_swpout 0 78,219 0 thp_swpout_fallback 129,074 53 165 2MB-mthp_swpout_fallback 0 0 165 pgmajfault 3,430 3,077 31,468 swap_ra 91 103 84,373 swap_ra_hit 47 46 84,317 ZSWPOUT-2MB n/a n/a 109,229 SWPOUT-2MB 0 78,219 0 ------------------------------------------------------------------------------- And finally, this is a comparison of deflate-iaa vs. zstd with v10 of this patch-series: --------------------------------------------- zswap_store large folios v10 Impr w/ deflate-iaa vs. zstd 64K folios 2M folios --------------------------------------------- Throughput (KB/s) 17% 9% elapsed time (sec) -20% -19% sys time (sec) -27% -23% --------------------------------------------- Conclusions based on the performance results: ============================================= v10 wrt before-case1: --------------------- We see significant improvements in throughput, elapsed and sys time for zstd and deflate-iaa, when comparing before-case1 (THP_SWAP=N) vs. after (THP_SWAP=Y) with zswap_store large folios. v10 wrt before-case2: --------------------- We see even more significant improvements in throughput and elapsed time for zstd and deflate-iaa, when comparing before-case2 (large-folio-SSD) vs. after (large-folio-zswap). The sys time increases with large-folio-zswap as expected, due to the CPU compression time vs. asynchronous disk write times, as pointed out by Ying and Yosry. In before-case2, when zswap does not store large folios, only allocations and cgroup charging due to 4K folio zswap stores count towards the cgroup memory limit. However, in the after scenario, with the introduction of zswap_store() of large folios, there is an added component of the zswap compressed pool usage from large folio stores from potentially all 30 processes, that gets counted towards the memory limit. As a result, we see higher swapout activity in the "after" data. Summary: ======== The v10 data presented above shows that zswap_store of large folios demonstrates good throughput/performance improvements compared to conventional SSD swap of large folios with a sufficiently large 525G SSD swap device. Hence, it seems reasonable for zswap_store to support large folios, so that further performance improvements can be implemented. In the experimental setup used in this patchset, we have enabled IAA compress verification to ensure additional hardware data integrity CRC checks not currently done by the software compressors. We see good throughput/latency improvements with deflate-iaa vs. zstd with zswap_store of large folios. Some of the ideas for further reducing latency that have shown promise in our experiments, are: 1) IAA compress/decompress batching. 2) Distributing compress jobs across all IAA devices on the socket. The tests run for this patchset are using only 1 IAA device per core, that avails of 2 compress engines on the device. In our experiments with IAA batching, we distribute compress jobs from all cores to the 8 compress engines available per socket. We further compress the pages in each folio in parallel in the accelerator. As a result, we improve compress latency and reclaim throughput. In decompress batching, we use swapin_readahead to generate a prefetch batch of 4K folios that we decompress in parallel in IAA. ------------------------------------------------------------------------------ IAA compress/decompress batching Further improvements wrt v10 zswap_store Sequential subpage store using "deflate-iaa": "deflate-iaa" Batching "deflate-iaa-canned" [2] Batching Additional Impr Additional Impr 64K folios 2M folios 64K folios 2M folios ------------------------------------------------------------------------------ Throughput (KB/s) 19% 43% 26% 55% elapsed time (sec) -5% -14% -10% -21% sys time (sec) 4% -7% -4% -18% ------------------------------------------------------------------------------ With zswap IAA compress/decompress batching, we are able to demonstrate significant performance improvements and memory savings in server scalability experiments in highly contended system scenarios under significant memory pressure; as compared to software compressors. We hope to submit this work in subsequent patch series. The current patch-series is a prequisite for these future submissions. [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u [2] https://patchwork.kernel.org/project/linux-crypto/cover/cover.1710969449.git.andre.glover@linux.intel.com/ This patch (of 6): This resolves an issue with obj_cgroup_get() not being defined if CONFIG_MEMCG is not defined. Before this patch, we would see build errors if obj_cgroup_get() is called from code that is agnostic of CONFIG_MEMCG. The zswap_store() changes for large folios in subsequent commits will require the use of obj_cgroup_get() in zswap code that falls into this category. Link: https://lkml.kernel.org/r/20241001053222.6944-1-kanchana.p.sridhar@intel.com Link: https://lkml.kernel.org/r/20241001053222.6944-2-kanchana.p.sridhar@intel.com Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Wajdi Feghali <wajdi.k.feghali@intel.com> Cc: "Zou, Nanhai" <nanhai.zou@intel.com> Cc: Barry Song <21cnbao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11Merge branch 'mm-hotfixes-stable' into mm-stableAndrew Morton
Pick up e7ac4daeed91 ("mm: count zeromap read and set for swapout and swapin") in order to move mm: define obj_cgroup_get() if CONFIG_MEMCG is not defined mm: zswap: modify zswap_compress() to accept a page instead of a folio mm: zswap: rename zswap_pool_get() to zswap_pool_tryget() mm: zswap: modify zswap_stored_pages to be atomic_long_t mm: zswap: support large folios in zswap_store() mm: swap: count successful large folio zswap stores in hugepage zswpout stats mm: zswap: zswap_store_page() will initialize entry after adding to xarray. mm: add per-order mTHP swpin counters from mm-unstable into mm-stable.
2024-11-11mm: count zeromap read and set for swapout and swapinBarry Song
When the proportion of folios from the zeromap is small, missing their accounting may not significantly impact profiling. However, it's easy to construct a scenario where this becomes an issue—for example, allocating 1 GB of memory, writing zeros from userspace, followed by MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out and swap-in counts seem to vanish into a black hole, potentially causing semantic ambiguity. On the other hand, Usama reported that zero-filled pages can exceed 10% in workloads utilizing zswap, while Hailong noted that some app in Android have more than 6% zero-filled pages. Before commit 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap"), both zswap and zRAM implemented similar optimizations, leading to these optimized-out pages being counted in either zswap or zRAM counters (with pswpin/pswpout also increasing for zRAM). With zeromap functioning prior to both zswap and zRAM, userspace will no longer detect these swap-out and swap-in actions. We have three ways to address this: 1. Introduce a dedicated counter specifically for the zeromap. 2. Use pswpin/pswpout accounting, treating the zero map as a standard backend. This approach aligns with zRAM's current handling of same-page fills at the device level. However, it would mean losing the optimized-out page counters previously available in zRAM and would not align with systems using zswap. Additionally, as noted by Nhat Pham, pswpin/pswpout counters apply only to I/O done directly to the backend device. 3. Count zeromap pages under zswap, aligning with system behavior when zswap is enabled. However, this would not be consistent with zRAM, nor would it align with systems lacking both zswap and zRAM. Given the complications with options 2 and 3, this patch selects option 1. We can find these counters from /proc/vmstat (counters for the whole system) and memcg's memory.stat (counters for the interested memcg). For example: $ grep -E 'swpin_zero|swpout_zero' /proc/vmstat swpin_zero 1648 swpout_zero 33536 $ grep -E 'swpin_zero|swpout_zero' /sys/fs/cgroup/system.slice/memory.stat swpin_zero 3905 swpout_zero 3985 This patch does not address any specific zeromap bug, but the missing swpout and swpin counts for zero-filled pages can be highly confusing and may mislead user-space agents that rely on changes in these counters as indicators. Therefore, we add a Fixes tag to encourage the inclusion of this counter in any kernel versions with zeromap. Many thanks to Kanchana for the contribution of changing count_objcg_event() to count_objcg_events() to support large folios[1], which has now been incorporated into this patch. [1] https://lkml.kernel.org/r/20241001053222.6944-5-kanchana.p.sridhar@intel.com Link: https://lkml.kernel.org/r/20241107011246.59137-1-21cnbao@gmail.com Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap") Co-developed-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Hailong Liu <hailong.liu@oppo.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Andi Kleen <ak@linux.intel.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kairui Song <kasong@tencent.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07mm/damon/tests/dbgfs-kunit: fix the header double inclusion guarding ifdef ↵SeongJae Park
comment Closing part of double inclusion guarding macro for dbgfs-kunit.h was copy-pasted from somewhere (maybe before the initial mainline merge of DAMON), and not properly updated. Fix it. Link: https://lkml.kernel.org/r/20241028233058.283381-7-sj@kernel.org Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Andrew Paniakin <apanyaki@amazon.com> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07mm/damon/Kconfig: update DBGFS_KUNIT prompt copy for SYSFS_KUNITSeongJae Park
CONFIG_DAMON_SYSFS_KUNIT_TEST prompt is copied from that for DAMON debugfs interface kunit tests, and not correctly updated. Fix it. Link: https://lkml.kernel.org/r/20241028233058.283381-6-sj@kernel.org Fixes: b8ee5575f763 ("mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Andrew Paniakin <apanyaki@amazon.com> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07selftests/damon/debugfs_duplicate_context_creation: hide errors from ↵SeongJae Park
expected file write failures debugfs_duplicate_context_creation.sh does an invalid file write to ensure it fails. Check of the failure is sufficient, so the error message from the failure only makes the output unnecessarily noisy. Hide it. Link: https://lkml.kernel.org/r/20241028233058.283381-5-sj@kernel.org Fixes: ade38b8ca5ce ("selftest/damon: add a test for duplicate context dirs creation") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Andrew Paniakin <apanyaki@amazon.com> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07selftests/damon/_debugfs_common: hide expected error message from ↵SeongJae Park
test_write_result() DAMON debugfs interface selftests use test_write_result() to check if valid or invalid writes to files of the interface success or fail as expected. File write error messages from expected failures are only making the output noisy. Hide such expected error messages. Link: https://lkml.kernel.org/r/20241028233058.283381-4-sj@kernel.org Fixes: b348eb7abd09 ("mm/damon: add user space selftests") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Andrew Paniakin <apanyaki@amazon.com> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07selftests/damon/huge_count_read_write: remove unnecessary debugging messageSeongJae Park
The program prints expected errors from write/read of the files with invalid huge count, for only debugging purpose. It is only making the output noisy. Remove those. Link: https://lkml.kernel.org/r/20241028233058.283381-3-sj@kernel.org Fixes: b4a002889d24 ("selftests/damon: test debugfs file reads/writes with huge count") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Andrew Paniakin <apanyaki@amazon.com> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07selftests/damon/huge_count_read_write: provide sufficiently large buffer for ↵Andrew Paniakin
DEPRECATED file read Patch series "damon/{self,kunit}tests: minor fixups for DAMON debugfs interface tests". Fixup small broken window panes in DAMON selftests and kunit tests. First four patches clean up DAMON debugfs interface selftests output, by fixing segmentation fault of a test program (patch 1), removing unnecessary debugging messages (patch 2), and hiding error messages from expected failures (patches 3 and 4). Following two patches fix copy-paste mistakes in DAMON Kconfig help message that copied from debugfs kunit test (patch 5) and a comment on the debugfs kunit test code (patch 6). This patch (of 6): 'huge_count_read_write' crashes with segmentation fault when reading DEPRECATED file of DAMON debugfs interface. This is not causing any problem for users or other tests because the purpose of the test is just ensuring the read is not causing kernel warning messages. Nonetheless, it makes the output unnecessarily noisy, and the DEPRECATED file is not properly being tested. It happens because the size of the content of the file is larger than the size of the buffer for the read. The file contains about 170 characters. Increase the buffer size to 256 characters. Link: https://lkml.kernel.org/r/20241028233058.283381-1-sj@kernel.org Link: https://lkml.kernel.org/r/20241028233058.283381-2-sj@kernel.org Fixes: b4a002889d24 ("selftests/damon: test debugfs file reads/writes with huge count") Signed-off-by: Andrew Paniakin <apanyaki@amazon.com> Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Andrew Panyakin <apanyaki@amazon.com> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07memcg: factor out mem_cgroup_stat_aggregate()Xiu Jianfeng
Currently mem_cgroup_css_rstat_flush() is used to flush the per-CPU statistics from a specified CPU into the global statistics of the memcg. It processes three kinds of data in three for loops using exactly the same method. Therefore, the for loop can be factored out and may make the code more clean. Link: https://lkml.kernel.org/r/20241026093407.310955-1-xiujianfeng@huaweicloud.com Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Wang Weiyang <wangweiyang2@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07mm/show_mem: use str_yes_no() helper in show_free_areas()Thorsten Blum
Remove hard-coded strings by using the str_yes_no() helper function. Link: https://lkml.kernel.org/r/20241026103552.6790-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07mm/vmscan: wake up flushers conditionally to avoid cgroup OOMZeng Jingxiang
Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle") removed the opportunity to wake up flushers during the MGLRU page reclamation process can lead to an increased likelihood of triggering OOM when encountering many dirty pages during reclamation on MGLRU. This leads to premature OOM if there are too many dirty pages in cgroup: Killed dd invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE), order=0, oom_score_adj=0 Call Trace: <TASK> dump_stack_lvl+0x5f/0x80 dump_stack+0x14/0x20 dump_header+0x46/0x1b0 oom_kill_process+0x104/0x220 out_of_memory+0x112/0x5a0 mem_cgroup_out_of_memory+0x13b/0x150 try_charge_memcg+0x44f/0x5c0 charge_memcg+0x34/0x50 __mem_cgroup_charge+0x31/0x90 filemap_add_folio+0x4b/0xf0 __filemap_get_folio+0x1a4/0x5b0 ? srso_return_thunk+0x5/0x5f ? __block_commit_write+0x82/0xb0 ext4_da_write_begin+0xe5/0x270 generic_perform_write+0x134/0x2b0 ext4_buffered_write_iter+0x57/0xd0 ext4_file_write_iter+0x76/0x7d0 ? selinux_file_permission+0x119/0x150 ? srso_return_thunk+0x5/0x5f ? srso_return_thunk+0x5/0x5f vfs_write+0x30c/0x440 ksys_write+0x65/0xe0 __x64_sys_write+0x1e/0x30 x64_sys_call+0x11c2/0x1d50 do_syscall_64+0x47/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e memory: usage 308224kB, limit 308224kB, failcnt 2589 swap: usage 0kB, limit 9007199254740988kB, failcnt 0 ... file_dirty 303247360 file_writeback 0 ... oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test, mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0 Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB, anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB oom_score_adj:0 The flusher wake up was removed to decrease SSD wearing, but if we are seeing all dirty folios at the tail of an LRU, not waking up the flusher could lead to thrashing easily. So wake it up when a memcg is about to OOM due to dirty caches. I did run the build kernel test[1] on V6, with -j16 1G memcg on my local branch: Without the patch(10 times): user 1449.394 system 368.78 372.58 363.03 362.31 360.84 372.70 368.72 364.94 373.51 366.58 (avg 367.399) real 164.883 With the V6 patch(10 times): user 1447.525 system 360.87 360.63 372.39 364.09 368.49 365.15 359.93 362.04 359.72 354.60 (avg 362.79) real 164.514 Test results show that this patch has about 1% performance improvement, which should be caused by noise. Link: https://lkml.kernel.org/r/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com Link: https://lore.kernel.org/all/CACePvbV4L-gRN9UKKuUnksfVJjOTq_5Sti2-e=pb_w51kucLKQ@mail.gmail.com/ [1] Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle") Suggested-by: Wei Xu <weixugc@google.com> Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com> Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Wei Xu <weixugc@google.com> Tested-by: Chris Li <chrisl@kernel.org> Cc: T.J. Mercier <tjmercier@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07mm: use page->private instead of page->index in percpuMatthew Wilcox (Oracle)
The percpu allocator only uses one field in struct page, just change it from page->index to page->private. Link: https://lkml.kernel.org/r/20241005200121.3231142-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07mm: remove references to page->index in huge_memory.cMatthew Wilcox (Oracle)
We already have folios in all these places; it's just a matter of using them instead of the pages. Link: https://lkml.kernel.org/r/20241005200121.3231142-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07bootmem: stop using page->indexMatthew Wilcox (Oracle)
Encode the type into the bottom four bits of page->private and the info into the remaining bits. Also turn the bootmem type into a named enum. [arnd@arndb.de: bootmem: add bootmem_type stub function] Link: https://lkml.kernel.org/r/20241015143802.577613-1-arnd@kernel.org [akpm@linux-foundation.org: fix build with !CONFIG_HAVE_BOOTMEM_INFO_NODE] Link: https://lore.kernel.org/oe-kbuild-all/202410090311.eaqcL7IZ-lkp@intel.com/ Link: https://lkml.kernel.org/r/20241005200121.3231142-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>