summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2023-03-28mm: multi-gen LRU: improve design docT.J. Alumbaugh
This patch improves the design doc. Specifically, 1. add a section for the per-memcg mm_struct list, and 2. add a section for the PID controller. Link: https://lkml.kernel.org/r/20230214035445.1250139-2-talumbau@google.com Signed-off-by: T.J. Alumbaugh <talumbau@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28mm: multi-gen LRU: clean up sysfs codeT.J. Alumbaugh
This patch cleans up the sysfs code. Specifically, 1. use sysfs_emit(), 2. use __ATTR_RW(), and 3. constify multi-gen LRU struct attribute_group. Link: https://lkml.kernel.org/r/20230214035445.1250139-1-talumbau@google.com Signed-off-by: T.J. Alumbaugh <talumbau@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28x86/mm/pat: clear VM_PAT if copy_p4d_range failedMa Wupeng
Syzbot reports a warning in untrack_pfn(). Digging into the root we found that this is due to memory allocation failure in pmd_alloc_one. And this failure is produced due to failslab. In copy_page_range(), memory alloaction for pmd failed. During the error handling process in copy_page_range(), mmput() is called to remove all vmas. While untrack_pfn this empty pfn, warning happens. Here's a simplified flow: dup_mm dup_mmap copy_page_range copy_p4d_range copy_pud_range copy_pmd_range pmd_alloc __pmd_alloc pmd_alloc_one page = alloc_pages(gfp, 0); if (!page) return NULL; mmput exit_mmap unmap_vmas unmap_single_vma untrack_pfn follow_phys WARN_ON_ONCE(1); Since this vma is not generate successfully, we can clear flag VM_PAT. In this case, untrack_pfn() will not be called while cleaning this vma. Function untrack_pfn_moved() has also been renamed to fit the new logic. Link: https://lkml.kernel.org/r/20230217025615.1595558-1-mawupeng1@huawei.com Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Reported-by: <syzbot+5f488e922d047d8f00cc@syzkaller.appspotmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Toshi Kani <toshi.kani@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28mm/userfaultfd: support WP on multiple VMAsMuhammad Usama Anjum
mwriteprotect_range() errors out if [start, end) doesn't fall in one VMA. We are facing a use case where multiple VMAs are present in one range of interest. For example, the following pseudocode reproduces the error which we are trying to fix: - Allocate memory of size 16 pages with PROT_NONE with mmap - Register userfaultfd - Change protection of the first half (1 to 8 pages) of memory to PROT_READ | PROT_WRITE. This breaks the memory area in two VMAs. - Now UFFDIO_WRITEPROTECT_MODE_WP on the whole memory of 16 pages errors out. This is a simple use case where user may or may not know if the memory area has been divided into multiple VMAs. We need an implementation which doesn't disrupt the already present users. So keeping things simple, stop going over all the VMAs if any one of the VMA hasn't been registered in WP mode. While at it, remove the un-needed error check as well. [akpm@linux-foundation.org: s/VM_WARN_ON_ONCE/VM_WARN_ONCE/ to fix build] Link: https://lkml.kernel.org/r/20230217105558.832710-1-usama.anjum@collabora.com Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reported-by: Paul Gofman <pgofman@codeweavers.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28mm, page_alloc: reduce page alloc/free sanity checksVlastimil Babka
Historically, we have performed sanity checks on all struct pages being allocated or freed, making sure they have no unexpected page flags or certain field values. This can detect insufficient cleanup and some cases of use-after-free, although on its own it can't always identify the culprit. The result is a warning and the "bad page" being leaked. The checks do need some cpu cycles, so in 4.7 with commits 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP") and 4db7548ccbd9 ("mm, page_alloc: defer debugging checks of freed pages until a PCP drain") they were no longer performed in the hot paths when allocating and freeing from pcplists, but only when pcplists are bypassed, refilled or drained. For debugging purposes, with CONFIG_DEBUG_VM enabled the checks were instead still done in the hot paths and not when refilling or draining pcplists. With 4462b32c9285 ("mm, page_alloc: more extensive free page checking with debug_pagealloc"), enabling debug_pagealloc also moved the sanity checks back to hot pahs. When both debug_pagealloc and CONFIG_DEBUG_VM are enabled, the checks are done both in hotpaths and pcplist refill/drain. Even though the non-debug default today might seem to be a sensible tradeoff between overhead and ability to detect bad pages, on closer look it's arguably not. As most allocations go through the pcplists, catching any bad pages when refilling or draining pcplists has only a small chance, insufficient for debugging or serious hardening purposes. On the other hand the cost of the checks is concentrated in the already expensive drain/refill batching operations, and those are done under the often contended zone lock. That was recently identified as an issue for page allocation and the zone lock contention reduced by moving the checks outside of the locked section with a patch "mm: reduce lock contention of pcp buffer refill", but the cost of the checks is still visible compared to their removal [1]. In the pcplist draining path free_pcppages_bulk() the checks are still done under zone->lock. Thus, remove the checks from pcplist refill and drain paths completely. Introduce a static key check_pages_enabled to control checks during page allocation a freeing (whether pcplist is used or bypassed). The static key is enabled if either is true: - kernel is built with CONFIG_DEBUG_VM=y (debugging) - debug_pagealloc or page poisoning is boot-time enabled (debugging) - init_on_alloc or init_on_free is boot-time enabled (hardening) The resulting user visible changes: - no checks when draining/refilling pcplists - less overhead, with likely no practical reduction of ability to catch bad pages - no checks when bypassing pcplists in default config (no debugging/hardening) - less overhead etc. as above - on typical hardened kernels [2], checks are now performed on each page allocation/free (previously only when bypassing/draining/refilling pcplists) - the init_on_alloc/init_on_free enabled should be sufficient indication for preferring more costly alloc/free operations for hardening purposes and we shouldn't need to introduce another toggle - code (various wrappers) removal and simplification [1] https://lore.kernel.org/all/68ba44d8-6899-c018-dcb3-36f3a96e6bea@sra.uni-hannover.de/ [2] https://lore.kernel.org/all/63ebc499.a70a0220.9ac51.29ea@mx.google.com/ [akpm@linux-foundation.org: coding-style cleanups] [akpm@linux-foundation.org: make check_pages_enabled static] Link: https://lkml.kernel.org/r/20230216095131.17336-1-vbabka@suse.cz Reported-by: Alexander Halbuer <halbuer@sra.uni-hannover.de> Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Kees Cook <keescook@chromium.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28mm: reduce lock contention of pcp buffer refillAlexander Halbuer
rmqueue_bulk() batches the allocation of multiple elements to refill the per-CPU buffers into a single hold of the zone lock. Each element is allocated and checked using check_pcp_refill(). The check touches every related struct page which is especially expensive for higher order allocations (huge pages). This patch reduces the time holding the lock by moving the check out of the critical section similar to rmqueue_buddy() which allocates a single element. Measurements of parallel allocation-heavy workloads show a reduction of the average huge page allocation latency of 50 percent for two cores and nearly 90 percent for 24 cores. Link: https://lkml.kernel.org/r/20230201162549.68384-1-halbuer@sra.uni-hannover.de Signed-off-by: Alexander Halbuer <halbuer@sra.uni-hannover.de> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28mm: cma: make kobj_type structure constantThomas Weißschuh
Since commit ee6d3dd4ed48 ("driver core: make kobj_type constant.") the driver core allows the usage of const struct kobj_type. Take advantage of this to constify the structure definition to prevent modification at runtime. Link: https://lkml.kernel.org/r/20230220-kobj_type-mm-cma-v1-1-45996cff1a81@weissschuh.net Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28mm/khugepaged: alloc_charge_hpage() take care of mem charge errorsPeter Xu
If memory charge failed, instead of returning the hpage but with an error, allow the function to cleanup the folio properly, which is normally what a function should do in this case - either return successfully, or return with no side effect of partial runs with an indicated error. This will also avoid the caller calling mem_cgroup_uncharge() unnecessarily with either anon or shmem path (even if it's safe to do so). Link: https://lkml.kernel.org/r/20230222195247.791227-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Stevens <stevensd@chromium.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28mm: hugetlb_vmemmap: simplify hugetlb_vmemmap_init() a bitMuchun Song
The check of IS_ENABLED(CONFIG_PROC_SYSCTL) is unnecessary since register_sysctl_init() will be empty in this case. So, there is no warnings after removing the check. Link: https://lkml.kernel.org/r/20230223065947.64134-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28mm: kfence: fix handling discontiguous pageMuchun Song
The struct pages could be discontiguous when the kfence pool is allocated via alloc_contig_pages() with CONFIG_SPARSEMEM and !CONFIG_SPARSEMEM_VMEMMAP. This may result in setting PG_slab and memcg_data to a arbitrary address (may be not used as a struct page), which in the worst case might corrupt the kernel. So the iteration should use nth_page(). Link: https://lkml.kernel.org/r/20230323025003.94447-1-songmuchun@bytedance.com Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jann Horn <jannh@google.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28mm: kfence: fix PG_slab and memcg_data clearingMuchun Song
It does not reset PG_slab and memcg_data when KFENCE fails to initialize kfence pool at runtime. It is reporting a "Bad page state" message when kfence pool is freed to buddy. The checking of whether it is a compound head page seems unnecessary since we already guarantee this when allocating kfence pool. Remove the check to simplify the code. Link: https://lkml.kernel.org/r/20230320030059.20189-1-songmuchun@bytedance.com Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jann Horn <jannh@google.com> Cc: Marco Elver <elver@google.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: SeongJae Park <sjpark@amazon.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-27mm,kfence: decouple kfence from page granularity mapping judgementZhenhua Huang
Kfence only needs its pool to be mapped as page granularity, if it is inited early. Previous judgement was a bit over protected. From [1], Mark suggested to "just map the KFENCE region a page granularity". So I decouple it from judgement and do page granularity mapping for kfence pool only. Need to be noticed that late init of kfence pool still requires page granularity mapping. Page granularity mapping in theory cost more(2M per 1GB) memory on arm64 platform. Like what I've tested on QEMU(emulated 1GB RAM) with gki_defconfig, also turning off rodata protection: Before: [root@liebao ]# cat /proc/meminfo MemTotal: 999484 kB After: [root@liebao ]# cat /proc/meminfo MemTotal: 1001480 kB To implement this, also relocate the kfence pool allocation before the linear mapping setting up, arm64_kfence_alloc_pool is to allocate phys addr, __kfence_pool is to be set after linear mapping set up. LINK: [1] https://lore.kernel.org/linux-arm-kernel/Y+IsdrvDNILA59UN@FVFF77S0Q05N/ Suggested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Marco Elver <elver@google.com> Link: https://lore.kernel.org/r/1679066974-690-1-git-send-email-quic_zhenhuah@quicinc.com Signed-off-by: Will Deacon <will@kernel.org>
2023-03-24Merge tag 'mm-hotfixes-stable-2023-03-24-17-09' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "21 hotfixes, 8 of which are cc:stable. 11 are for MM, the remainder are for other subsystems" * tag 'mm-hotfixes-stable-2023-03-24-17-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits) mm: mmap: remove newline at the end of the trace mailmap: add entries for Richard Leitner kcsan: avoid passing -g for test kfence: avoid passing -g for test mm: kfence: fix using kfence_metadata without initialization in show_object() lib: dhry: fix unstable smp_processor_id(_) usage mailmap: add entry for Enric Balletbo i Serra mailmap: map Sai Prakash Ranjan's old address to his current one mailmap: map Rajendra Nayak's old address to his current one Revert "kasan: drop skip_kasan_poison variable in free_pages_prepare" mailmap: add entry for Tobias Klauser kasan, powerpc: don't rename memintrinsics if compiler adds prefixes mm/ksm: fix race with VMA iteration and mm_struct teardown kselftest: vm: fix unused variable warning mm: fix error handling for map_deny_write_exec mm: deduplicate error handling for map_deny_write_exec checksyscalls: ignore fstat to silence build warning on LoongArch nilfs2: fix kernel-infoleak in nilfs_ioctl_wrap_copy() test_maple_tree: add more testing for mas_empty_area() maple_tree: fix mas_skip_node() end slot detection ...
2023-03-24Merge tag 'slab-fix-for-6.3-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fix from Vlastimil Babka: "A single build fix for a corner case configuration that is apparently possible to achieve on some arches, from Geert" * tag 'slab-fix-for-6.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slab: Fix undefined init_cache_node_node() for NUMA and !SMP
2023-03-23cpuset: Clean up cpuset_node_allowedHaifeng Xu
Commit 002f290627c2 ("cpuset: use static key better and convert to new API") has used __cpuset_node_allowed() instead of cpuset_node_allowed() to check whether we can allocate on a memory node. Now this function isn't used by anyone, so we can do the follow things to clean up it. 1. remove unused codes 2. rename __cpuset_node_allowed() to cpuset_node_allowed() 3. update comments in mm/page_alloc.c Suggested-by: Waiman Long <longman@redhat.com> Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-03-23kfence: avoid passing -g for testMarco Elver
Nathan reported that when building with GNU as and a version of clang that defaults to DWARF5: $ make -skj"$(nproc)" ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- \ LLVM=1 LLVM_IAS=0 O=build \ mrproper allmodconfig mm/kfence/kfence_test.o /tmp/kfence_test-08a0a0.s: Assembler messages: /tmp/kfence_test-08a0a0.s:14627: Error: non-constant .uleb128 is not supported /tmp/kfence_test-08a0a0.s:14628: Error: non-constant .uleb128 is not supported /tmp/kfence_test-08a0a0.s:14632: Error: non-constant .uleb128 is not supported /tmp/kfence_test-08a0a0.s:14633: Error: non-constant .uleb128 is not supported /tmp/kfence_test-08a0a0.s:14639: Error: non-constant .uleb128 is not supported ... This is because `-g` defaults to the compiler debug info default. If the assembler does not support some of the directives used, the above errors occur. To fix, remove the explicit passing of `-g`. All the test wants is that stack traces print valid function names, and debug info is not required for that. (I currently cannot recall why I added the explicit `-g`.) Link: https://lkml.kernel.org/r/20230316224705.709984-1-elver@google.com Fixes: bc8fbc5f305a ("kfence: add test suite") Signed-off-by: Marco Elver <elver@google.com> Reported-by: Nathan Chancellor <nathan@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-23mm: kfence: fix using kfence_metadata without initialization in show_object()Muchun Song
The variable kfence_metadata is initialized in kfence_init_pool(), then, it is not initialized if kfence is disabled after booting. In this case, kfence_metadata will be used (e.g. ->lock and ->state fields) without initialization when reading /sys/kernel/debug/kfence/objects. There will be a warning if you enable CONFIG_DEBUG_SPINLOCK. Fix it by creating debugfs files when necessary. Link: https://lkml.kernel.org/r/20230315034441.44321-1-songmuchun@bytedance.com Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Tested-by: Marco Elver <elver@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jann Horn <jannh@google.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-23Revert "kasan: drop skip_kasan_poison variable in free_pages_prepare"Peter Collingbourne
This reverts commit 487a32ec24be819e747af8c2ab0d5c515508086a. should_skip_kasan_poison() reads the PG_skip_kasan_poison flag from page->flags. However, this line of code in free_pages_prepare(): page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; clears most of page->flags, including PG_skip_kasan_poison, before calling should_skip_kasan_poison(), which meant that it would never return true as a result of the page flag being set. Therefore, fix the code to call should_skip_kasan_poison() before clearing the flags, as we were doing before the reverted patch. This fixes a measurable performance regression introduced in the reverted commit, where munmap() takes longer than intended if HW tags KASAN is supported and enabled at runtime. Without this patch, we see a single-digit percentage performance regression in a particular mmap()-heavy benchmark when enabling HW tags KASAN, and with the patch, there is no statistically significant performance impact when enabling HW tags KASAN. Link: https://lkml.kernel.org/r/20230310042914.3805818-2-pcc@google.com Fixes: 487a32ec24be ("kasan: drop skip_kasan_poison variable in free_pages_prepare") Link: https://linux-review.googlesource.com/id/Ic4f13affeebd20548758438bb9ed9ca40e312b79 Signed-off-by: Peter Collingbourne <pcc@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> [arm64] Cc: Evgenii Stepanov <eugenis@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> [6.1] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-23mm/ksm: fix race with VMA iteration and mm_struct teardownLiam R. Howlett
exit_mmap() will tear down the VMAs and maple tree with the mmap_lock held in write mode. Ensure that the maple tree is still valid by checking ksm_test_exit() after taking the mmap_lock in read mode, but before the for_each_vma() iterator dereferences a destroyed maple tree. Since the maple tree is destroyed, the flags telling lockdep to check an external lock has been cleared. Skip the for_each_vma() iterator to avoid dereferencing a maple tree without the external lock flag, which would create a lockdep warning. Link: https://lkml.kernel.org/r/20230308220310.3119196-1-Liam.Howlett@oracle.com Fixes: a5f18ba07276 ("mm/ksm: use vma iterators instead of vma linked list") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: Pengfei Xu <pengfei.xu@intel.com> Link: https://lore.kernel.org/lkml/ZAdUUhSbaa6fHS36@xpf.sh.intel.com/ Reported-by: syzbot+2ee18845e89ae76342c5@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=64a3e95957cd3deab99df7cd7b5a9475af92c93e Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <heng.su@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-23mm: fix error handling for map_deny_write_execJoey Gouly
Commit 4a18419f71cd ("mm/mprotect: use mmu_gather") changed 'goto out;' to 'break' in the loop. This wasn't noticed while rebasing the MDWE patches, so fix it now. Link: https://lkml.kernel.org/r/20230308190423.46491-3-joey.gouly@arm.com Fixes: b507808ebce2 ("mm: implement memory-deny-write-execute as a prctl") Signed-off-by: Joey Gouly <joey.gouly@arm.com> Reported-by: Alexey Izbyshev <izbyshev@ispras.ru> Link: https://lore.kernel.org/linux-arm-kernel/8408d8901e9d7ee6b78db4c6cba04b78@ispras.ru/ Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: nd <nd@arm.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-23mm: deduplicate error handling for map_deny_write_execJoey Gouly
Patch series "Fixes for MDWE prctl" These are four small fixes for the recent memory-write-deny-execute prctl patches [1]. Two reported by Alexey about error handling and two tooling fixes by Peter. This patch (of 4): Commit cc8d1b097de7 ("mmap: clean up mmap_region() unrolling") deduplicated the error handling, do the same for the return value of `map_deny_write_exec`. Link: https://lkml.kernel.org/r/20230308190423.46491-1-joey.gouly@arm.com Link: https://lkml.kernel.org/r/20230308190423.46491-2-joey.gouly@arm.com Link: https://lore.kernel.org/linux-arm-kernel/20230119160344.54358-1-joey.gouly@arm.com/ [1] Fixes: b507808ebce2 ("mm: implement memory-deny-write-execute as a prctl") Signed-off-by: Joey Gouly <joey.gouly@arm.com> Reported-by: Alexey Izbyshev <izbyshev@ispras.ru> Link: https://lore.kernel.org/linux-arm-kernel/8408d8901e9d7ee6b78db4c6cba04b78@ispras.ru/ Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: nd <nd@arm.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-23mm, vmalloc: fix high order __GFP_NOFAIL allocationsMichal Hocko
Gao Xiang has reported that the page allocator complains about high order __GFP_NOFAIL request coming from the vmalloc core: __alloc_pages+0x1cb/0x5b0 mm/page_alloc.c:5549 alloc_pages+0x1aa/0x270 mm/mempolicy.c:2286 vm_area_alloc_pages mm/vmalloc.c:2989 [inline] __vmalloc_area_node mm/vmalloc.c:3057 [inline] __vmalloc_node_range+0x978/0x13c0 mm/vmalloc.c:3227 kvmalloc_node+0x156/0x1a0 mm/util.c:606 kvmalloc include/linux/slab.h:737 [inline] kvmalloc_array include/linux/slab.h:755 [inline] kvcalloc include/linux/slab.h:760 [inline] it seems that I have completely missed high order allocation backing vmalloc areas case when implementing __GFP_NOFAIL support. This means that [k]vmalloc at al. can allocate higher order allocations with __GFP_NOFAIL which can trigger OOM killer for non-costly orders easily or cause a lot of reclaim/compaction activity if those requests cannot be satisfied. Fix the issue by falling back to zero order allocations for __GFP_NOFAIL requests if the high order request fails. Link: https://lkml.kernel.org/r/ZAXynvdNqcI0f6Us@dhcp22.suse.cz Fixes: 9376130c390a ("mm/vmalloc: add support for __GFP_NOFAIL") Reported-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lkml.kernel.org/r/20230305053035.1911-1-hsiangkao@linux.alibaba.com Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-22mm/slab: Fix undefined init_cache_node_node() for NUMA and !SMPGeert Uytterhoeven
sh/migor_defconfig: mm/slab.c: In function ‘slab_memory_callback’: mm/slab.c:1127:23: error: implicit declaration of function ‘init_cache_node_node’; did you mean ‘drain_cache_node_node’? [-Werror=implicit-function-declaration] 1127 | ret = init_cache_node_node(nid); | ^~~~~~~~~~~~~~~~~~~~ | drain_cache_node_node The #ifdef condition protecting the definition of init_cache_node_node() no longer matches the conditions protecting the (multiple) users. Fix this by syncing the conditions. Fixes: 76af6a054da40553 ("mm/migrate: add CPU hotplug to demotion #ifdef") Reported-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/b5bdea22-ed2f-3187-6efe-0c72330270a4@infradead.org Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Acked-by: Randy Dunlap <rdunlap@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-03-20mm: hugetlb: move hugeltb sysctls to its own fileKefeng Wang
This moves all hugetlb sysctls to its own file, also kill an useless hugetlb_treat_movable_handler() defination. Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-03-17driver core: class: remove module * from class_create()Greg Kroah-Hartman
The module pointer in class_create() never actually did anything, and it shouldn't have been requred to be set as a parameter even if it did something. So just remove it and fix up all callers of the function in the kernel tree at the same time. Cc: "Rafael J. Wysocki" <rafael@kernel.org> Acked-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Link: https://lore.kernel.org/r/20230313181843.1207845-4-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-17mm/slub: fix help comment of SLUB_DEBUGVernon Yang
Since commit ab4d5ed5eeda ("slub: Enable sysfs support for !CONFIG_SLUB_DEBUG"), disabling SLUB_DEBUG no longer disables sysfs support completely, so fix the description. Also update the path to /sys/kernel/slab. Signed-off-by: Vernon Yang <vernon2gm@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-03-16mm: Introduce untagged_addr_remote()Kirill A. Shutemov
untagged_addr() removes tags/metadata from the address and brings it to the canonical form. The helper is implemented on arm64 and sparc. Both of them do untagging based on global rules. However, Linear Address Masking (LAM) on x86 introduces per-process settings for untagging. As a result, untagged_addr() is now only suitable for untagging addresses for the current proccess. The new helper untagged_addr_remote() has to be used when the address targets remote process. It requires the mmap lock for target mm to be taken. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/all/20230312112612.31869-6-kirill.shutemov%40linux.intel.com
2023-03-13mm: slub: make kobj_type structure constantThomas Weißschuh
Since commit ee6d3dd4ed48 ("driver core: make kobj_type constant.") the driver core allows the usage of const struct kobj_type. Take advantage of this to constify the structure definition to prevent modification at runtime. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-03-12mm,jfs: move write_one_page/folio_write_one to jfsChristoph Hellwig
The last remaining user of folio_write_one through the write_one_page wrapper is jfs, so move the functionality there and hard code the call to metapage_writepage. Note that the use of the pagecache by the JFS 'metapage' buffer cache is a bit odd, and we could probably do without VM-level dirty tracking at all, but that's a change for another time. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-03-12fork/vm: Move common PF_IO_WORKER behavior to new flagMike Christie
This adds a new flag, PF_USER_WORKER, that's used for behavior common to to both PF_IO_WORKER and users like vhost which will use a new helper instead of create_io_thread because they require different behavior for operations like signal handling. The common behavior PF_USER_WORKER covers is the vm reclaim handling. Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-03-07mm/damon/paddr: fix folio_nr_pages() after folio_put() in ↵SeongJae Park
damon_pa_mark_accessed_or_deactivate() damon_pa_mark_accessed_or_deactivate() is accessing a folio via folio_nr_pages() after folio_put() for the folio has invoked. Fix it. Link: https://lkml.kernel.org/r/20230304193949.296391-3-sj@kernel.org Fixes: f70da5ee8fe1 ("mm/damon: convert damon_pa_mark_accessed_or_deactivate() to use folios") Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-07mm/damon/paddr: fix folio_size() call after folio_put() in damon_pa_young()SeongJae Park
Patch series "mm/damon/paddr: Fix folio-use-after-put bugs". There are two folio accesses after folio_put() in mm/damon/paddr.c file. Fix those. This patch (of 2): damon_pa_young() is accessing a folio via folio_size() after folio_put() for the folio has invoked. Fix it. Link: https://lkml.kernel.org/r/20230304193949.296391-1-sj@kernel.org Link: https://lkml.kernel.org/r/20230304193949.296391-2-sj@kernel.org Fixes: 397b0c3a584b ("mm/damon/paddr: remove folio_sz field from damon_pa_access_chk_result") Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: <stable@vger.kernel.org> [6.2.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-07migrate_pages: try migrate in batch asynchronously firstlyHuang Ying
When we have locked more than one folios, we cannot wait the lock or bit (e.g., page lock, buffer head lock, writeback bit) synchronously. Otherwise deadlock may be triggered. This make it hard to batch the synchronous migration directly. This patch re-enables batching synchronous migration via trying to migrate in batch asynchronously firstly. And any folios that are failed to be migrated asynchronously will be migrated synchronously one by one. Test shows that this can restore the TLB flushing batching performance for synchronous migration effectively. Link: https://lkml.kernel.org/r/20230303030155.160983-4-ying.huang@intel.com Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Tested-by: Hugh Dickins <hughd@google.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Xu, Pengfei" <pengfei.xu@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Stefan Roesch <shr@devkernel.io> Cc: Tejun Heo <tj@kernel.org> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-07migrate_pages: move split folios processing out of migrate_pages_batch()Huang Ying
To simplify the code logic and reduce the line number. Link: https://lkml.kernel.org/r/20230303030155.160983-3-ying.huang@intel.com Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Xu, Pengfei" <pengfei.xu@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Stefan Roesch <shr@devkernel.io> Cc: Tejun Heo <tj@kernel.org> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-07migrate_pages: fix deadlock in batched migrationHuang Ying
Patch series "migrate_pages: fix deadlock in batched synchronous migration", v2. Two deadlock bugs were reported for the migrate_pages() batching series. Thanks Hugh and Pengfei. Analysis shows that if we have locked some other folios except the one we are migrating, it's not safe in general to wait synchronously, for example, to wait the writeback to complete or wait to lock the buffer head. So 1/3 fixes the deadlock in a simple way, where the batching support for the synchronous migration is disabled. The change is straightforward and easy to be understood. While 3/3 re-introduce the batching for synchronous migration via trying to migrate asynchronously in batch optimistically, then fall back to migrate synchronously one by one for fail-to-migrate folios. Test shows that this can restore the TLB flushing batching performance for synchronous migration effectively. This patch (of 3): Two deadlock bugs were reported for the migrate_pages() batching series. Thanks Hugh and Pengfei! For example, in the following deadlock trace snippet, INFO: task kworker/u4:0:9 blocked for more than 147 seconds. Not tainted 6.2.0-rc4-kvm+ #1314 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u4:0 state:D stack:0 pid:9 ppid:2 flags:0x00004000 Workqueue: loop4 loop_rootcg_workfn Call Trace: <TASK> __schedule+0x43b/0xd00 schedule+0x6a/0xf0 io_schedule+0x4a/0x80 folio_wait_bit_common+0x1b5/0x4e0 ? __pfx_wake_page_function+0x10/0x10 __filemap_get_folio+0x73d/0x770 shmem_get_folio_gfp+0x1fd/0xc80 shmem_write_begin+0x91/0x220 generic_perform_write+0x10e/0x2e0 __generic_file_write_iter+0x17e/0x290 ? generic_write_checks+0x12b/0x1a0 generic_file_write_iter+0x97/0x180 ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 do_iter_readv_writev+0x13c/0x210 ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 do_iter_write+0xf6/0x330 vfs_iter_write+0x46/0x70 loop_process_work+0x723/0xfe0 loop_rootcg_workfn+0x28/0x40 process_one_work+0x3cc/0x8d0 worker_thread+0x66/0x630 ? __pfx_worker_thread+0x10/0x10 kthread+0x153/0x190 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x29/0x50 </TASK> INFO: task repro:1023 blocked for more than 147 seconds. Not tainted 6.2.0-rc4-kvm+ #1314 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:repro state:D stack:0 pid:1023 ppid:360 flags:0x00004004 Call Trace: <TASK> __schedule+0x43b/0xd00 schedule+0x6a/0xf0 io_schedule+0x4a/0x80 folio_wait_bit_common+0x1b5/0x4e0 ? compaction_alloc+0x77/0x1150 ? __pfx_wake_page_function+0x10/0x10 folio_wait_bit+0x30/0x40 folio_wait_writeback+0x2e/0x1e0 migrate_pages_batch+0x555/0x1ac0 ? __pfx_compaction_alloc+0x10/0x10 ? __pfx_compaction_free+0x10/0x10 ? __this_cpu_preempt_check+0x17/0x20 ? lock_is_held_type+0xe6/0x140 migrate_pages+0x100e/0x1180 ? __pfx_compaction_free+0x10/0x10 ? __pfx_compaction_alloc+0x10/0x10 compact_zone+0xe10/0x1b50 ? lock_is_held_type+0xe6/0x140 ? check_preemption_disabled+0x80/0xf0 compact_node+0xa3/0x100 ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30 ? _find_first_bit+0x7b/0x90 sysctl_compaction_handler+0x5d/0xb0 proc_sys_call_handler+0x29d/0x420 proc_sys_write+0x2b/0x40 vfs_write+0x3a3/0x780 ksys_write+0xb7/0x180 __x64_sys_write+0x26/0x30 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7f3a2471f59d RSP: 002b:00007ffe567f7288 EFLAGS: 00000217 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a2471f59d RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005 RBP: 00007ffe567f72a0 R08: 0000000000000010 R09: 0000000000000010 R10: 0000000000000010 R11: 0000000000000217 R12: 00000000004012e0 R13: 00007ffe567f73e0 R14: 0000000000000000 R15: 0000000000000000 </TASK> The page migration task has held the lock of the shmem folio A, and is waiting the writeback of the folio B of the file system on the loop block device to complete. While the loop worker task which writes back the folio B is waiting to lock the shmem folio A, because the folio A backs the folio B in the loop device. Thus deadlock is triggered. In general, if we have locked some other folios except the one we are migrating, it's not safe to wait synchronously, for example, to wait the writeback to complete or wait to lock the buffer head. To fix the deadlock, in this patch, we avoid to batch the page migration except for MIGRATE_ASYNC mode. In MIGRATE_ASYNC mode, synchronous waiting is avoided. The fix can be improved further. We will do that as soon as possible. Link: https://lkml.kernel.org/r/20230303030155.160983-1-ying.huang@intel.com Link: https://lore.kernel.org/linux-mm/87a6c8c-c5c1-67dc-1e32-eb30831d6e3d@google.com/ Link: https://lore.kernel.org/linux-mm/874jrg7kke.fsf@yhuang6-desk2.ccr.corp.intel.com/ Link: https://lore.kernel.org/linux-mm/20230227110614.dngdub2j3exr6dfp@quack3/ Link: https://lkml.kernel.org/r/20230303030155.160983-2-ying.huang@intel.com Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reported-by: Hugh Dickins <hughd@google.com> Reported-by: "Xu, Pengfei" <pengfei.xu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Stefan Roesch <shr@devkernel.io> Cc: Tejun Heo <tj@kernel.org> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-07mm/userfaultfd: propagate uffd-wp bit when PTE-mapping the huge zeropageDavid Hildenbrand
Currently, we'd lose the userfaultfd-wp marker when PTE-mapping a huge zeropage, resulting in the next write faults in the PMD range not triggering uffd-wp events. Various actions (partial MADV_DONTNEED, partial mremap, partial munmap, partial mprotect) could trigger this. However, most importantly, un-protecting a single sub-page from the userfaultfd-wp handler when processing a uffd-wp event will PTE-map the shared huge zeropage and lose the uffd-wp bit for the remainder of the PMD. Let's properly propagate the uffd-wp bit to the PMDs. #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <stdbool.h> #include <inttypes.h> #include <fcntl.h> #include <unistd.h> #include <errno.h> #include <poll.h> #include <pthread.h> #include <sys/mman.h> #include <sys/syscall.h> #include <sys/ioctl.h> #include <linux/userfaultfd.h> static size_t pagesize; static int uffd; static volatile bool uffd_triggered; #define barrier() __asm__ __volatile__("": : :"memory") static void uffd_wp_range(char *start, size_t size, bool wp) { struct uffdio_writeprotect uffd_writeprotect; uffd_writeprotect.range.start = (unsigned long) start; uffd_writeprotect.range.len = size; if (wp) { uffd_writeprotect.mode = UFFDIO_WRITEPROTECT_MODE_WP; } else { uffd_writeprotect.mode = 0; } if (ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect)) { fprintf(stderr, "UFFDIO_WRITEPROTECT failed: %d\n", errno); exit(1); } } static void *uffd_thread_fn(void *arg) { static struct uffd_msg msg; ssize_t nread; while (1) { struct pollfd pollfd; int nready; pollfd.fd = uffd; pollfd.events = POLLIN; nready = poll(&pollfd, 1, -1); if (nready == -1) { fprintf(stderr, "poll() failed: %d\n", errno); exit(1); } nread = read(uffd, &msg, sizeof(msg)); if (nread <= 0) continue; if (msg.event != UFFD_EVENT_PAGEFAULT || !(msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP)) { printf("FAIL: wrong uffd-wp event fired\n"); exit(1); } /* un-protect the single page. */ uffd_triggered = true; uffd_wp_range((char *)(uintptr_t)msg.arg.pagefault.address, pagesize, false); } return arg; } static int setup_uffd(char *map, size_t size) { struct uffdio_api uffdio_api; struct uffdio_register uffdio_register; pthread_t thread; uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY); if (uffd < 0) { fprintf(stderr, "syscall() failed: %d\n", errno); return -errno; } uffdio_api.api = UFFD_API; uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP; if (ioctl(uffd, UFFDIO_API, &uffdio_api) < 0) { fprintf(stderr, "UFFDIO_API failed: %d\n", errno); return -errno; } if (!(uffdio_api.features & UFFD_FEATURE_PAGEFAULT_FLAG_WP)) { fprintf(stderr, "UFFD_FEATURE_WRITEPROTECT missing\n"); return -ENOSYS; } uffdio_register.range.start = (unsigned long) map; uffdio_register.range.len = size; uffdio_register.mode = UFFDIO_REGISTER_MODE_WP; if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) < 0) { fprintf(stderr, "UFFDIO_REGISTER failed: %d\n", errno); return -errno; } pthread_create(&thread, NULL, uffd_thread_fn, NULL); return 0; } int main(void) { const size_t size = 4 * 1024 * 1024ull; char *map, *cur; pagesize = getpagesize(); map = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0); if (map == MAP_FAILED) { fprintf(stderr, "mmap() failed\n"); return -errno; } if (madvise(map, size, MADV_HUGEPAGE)) { fprintf(stderr, "MADV_HUGEPAGE failed\n"); return -errno; } if (setup_uffd(map, size)) return 1; /* Read the whole range, populating zeropages. */ madvise(map, size, MADV_POPULATE_READ); /* Write-protect the whole range. */ uffd_wp_range(map, size, true); /* Make sure uffd-wp triggers on each page. */ for (cur = map; cur < map + size; cur += pagesize) { uffd_triggered = false; barrier(); /* Trigger a write fault. */ *cur = 1; barrier(); if (!uffd_triggered) { printf("FAIL: uffd-wp did not trigger\n"); return 1; } } printf("PASS: uffd-wp triggered\n"); return 0; } Link: https://lkml.kernel.org/r/20230302175423.589164-1-david@redhat.com Fixes: e06f1e1dd499 ("userfaultfd: wp: enabled write protection in userfaultfd API") Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Shaohua Li <shli@fb.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-07mm: teach mincore_hugetlb about pte markersJames Houghton
By checking huge_pte_none(), we incorrectly classify PTE markers as "present". Instead, check huge_pte_none_mostly(), classifying PTE markers the same as if the PTE were completely blank. PTE markers, unlike other kinds of swap entries, don't reference any physical page and don't indicate that a physical page was mapped previously. As such, treat them as non-present for the sake of mincore(). Link: https://lkml.kernel.org/r/20230302222404.175303-1-jthoughton@google.com Fixes: 5c041f5d1f23 ("mm: teach core mm about pte markers") Signed-off-by: James Houghton <jthoughton@google.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: James Houghton <jthoughton@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-06fs: drop unused posix acl handlersChristian Brauner
Remove struct posix_acl_{access,default}_handler for all filesystems that don't depend on the xattr handler in their inode->i_op->listxattr() method in any way. There's nothing more to do than to simply remove the handler. It's been effectively unused ever since we introduced the new posix acl api. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-03-04mm: avoid gcc complaint about pointer castingLinus Torvalds
The migration code ends up temporarily stashing information of the wrong type in unused fields of the newly allocated destination folio. That all works fine, but gcc does complain about the pointer type mis-use: mm/migrate.c: In function ‘__migrate_folio_extract’: mm/migrate.c:1050:20: note: randstruct: casting between randomized structure pointer types (ssa): ‘struct anon_vma’ and ‘struct address_space’ 1050 | *anon_vmap = (void *)dst->mapping; | ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~ and gcc is actually right to complain since it really doesn't understand that this is a very temporary special case where this is ok. This could be fixed in different ways by just obfuscating the assignment sufficiently that gcc doesn't see what is going on, but the truly "proper C" way to do this is by explicitly using a union. Using unions for type conversions like this is normally hugely ugly and syntactically nasty, but this really is one of the few cases where we want to make it clear that we're not doing type conversion, we're really re-using the value bit-for-bit just using another type. IOW, this should not become a common pattern, but in this one case using that odd union is probably the best way to document to the compiler what is conceptually going on here. [ Side note: there are valid cases where we convert pointers to other pointer types, notably the whole "folio vs page" situation, where the types actually have fundamental commonalities. The fact that the gcc note is limited to just randomized structures means that we don't see equivalent warnings for those cases, but it migth also mean that we miss other cases where we do play these kinds of dodgy games, and this kind of explicit conversion might be a good idea. ] I verified that at least for an allmodconfig build on x86-64, this generates the exact same code, apart from line numbers and assembler comment changes. Fixes: 64c8902ed441 ("migrate_pages: split unmap_and_move() to _unmap() and _move()") Cc: Huang, Ying <ying.huang@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-03-04Merge tag 'mm-hotfixes-stable-2023-03-04-13-12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "17 hotfixes. Eight are for MM and seven are for other parts of the kernel. Seven are cc:stable and eight address post-6.3 issues or were judged unsuitable for -stable backporting" * tag 'mm-hotfixes-stable-2023-03-04-13-12' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mailmap: map Dikshita Agarwal's old address to his current one mailmap: map Vikash Garodia's old address to his current one fs/cramfs/inode.c: initialize file_ra_state fs: hfsplus: fix UAF issue in hfsplus_put_super panic: fix the panic_print NMI backtrace setting lib: parser: update documentation for match_NUMBER functions kasan, x86: don't rename memintrinsics in uninstrumented files kasan: test: fix test for new meminstrinsic instrumentation kasan: treat meminstrinsic as builtins in uninstrumented files kasan: emit different calls for instrumentable memintrinsics ocfs2: fix non-auto defrag path not working issue ocfs2: fix defrag path triggering jbd2 ASSERT mailmap: map Georgi Djakov's old Linaro address to his current one mm/hwpoison: convert TTU_IGNORE_HWPOISON to TTU_HWPOISON lib/zlib: DFLTCC deflate does not write all available bits for Z_NO_FLUSH mm/damon/paddr: fix missing folio_put() mm/mremap: fix dup_anon_vma() in vma_merge() case 4
2023-03-02kasan: test: fix test for new meminstrinsic instrumentationMarco Elver
The tests for memset/memmove have been failing since they haven't been instrumented in 69d4c0d32186. Fix the test to recognize when memintrinsics aren't instrumented, and skip test cases accordingly. We also need to conditionally pass -fno-builtin to the test, otherwise the instrumentation pass won't recognize memintrinsics and end up not instrumenting them either. Link: https://lkml.kernel.org/r/20230224085942.1791837-3-elver@google.com Fixes: 69d4c0d32186 ("entry, kasan, x86: Disallow overriding mem*() functions") Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Jelinek <jakub@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Nicolas Schier <nicolas@fjasle.eu> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-02kasan: treat meminstrinsic as builtins in uninstrumented filesMarco Elver
Where the compiler instruments meminstrinsics by generating calls to __asan/__hwasan_ prefixed functions, let the compiler consider memintrinsics as builtin again. To do so, never override memset/memmove/memcpy if the compiler does the correct instrumentation - even on !GENERIC_ENTRY architectures. [elver@google.com: powerpc: don't rename memintrinsics if compiler adds prefixes] Link: https://lore.kernel.org/all/20230224085942.1791837-1-elver@google.com/ [1] Link: https://lkml.kernel.org/r/20230227094726.3833247-1-elver@google.com Link: https://lkml.kernel.org/r/20230224085942.1791837-2-elver@google.com Fixes: 69d4c0d32186 ("entry, kasan, x86: Disallow overriding mem*() functions") Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Jelinek <jakub@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Nicolas Schier <nicolas@fjasle.eu> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-02kasan: emit different calls for instrumentable memintrinsicsMarco Elver
Clang 15 provides an option to prefix memcpy/memset/memmove calls with __asan_/__hwasan_ in instrumented functions: https://reviews.llvm.org/D122724 GCC will add support in future: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108777 Use it to regain KASAN instrumentation of memcpy/memset/memmove on architectures that require noinstr to be really free from instrumented mem*() functions (all GENERIC_ENTRY architectures). Link: https://lkml.kernel.org/r/20230224085942.1791837-1-elver@google.com Fixes: 69d4c0d32186 ("entry, kasan, x86: Disallow overriding mem*() functions") Signed-off-by: Marco Elver <elver@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Jelinek <jakub@redhat.com> Cc: kasan-dev@googlegroups.com Cc: Kees Cook <keescook@chromium.org> Cc: Linux Kernel Functional Testing <lkft@linaro.org> Cc: Nathan Chancellor <nathan@kernel.org> # build only Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Nicolas Schier <nicolas@fjasle.eu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-27mm/hwpoison: convert TTU_IGNORE_HWPOISON to TTU_HWPOISONNaoya Horiguchi
After a memory error happens on a clean folio, a process unexpectedly receives SIGBUS when it accesses the error page. This SIGBUS killing is pointless and simply degrades the level of RAS of the system, because the clean folio can be dropped without any data lost on memory error handling as we do for a clean pagecache. When memory_failure() is called on a clean folio, try_to_unmap() is called twice (one from split_huge_page() and one from hwpoison_user_mappings()). The root cause of the issue is that pte conversion to hwpoisoned entry is now done in the first call of try_to_unmap() because PageHWPoison is already set at this point, while it's actually expected to be done in the second call. This behavior disturbs the error handling operation like removing pagecache, which results in the malfunction described above. So convert TTU_IGNORE_HWPOISON into TTU_HWPOISON and set TTU_HWPOISON only when we really intend to convert pte to hwpoison entry. This can prevent other callers of try_to_unmap() from accidentally converting to hwpoison entries. Link: https://lkml.kernel.org/r/20230221085905.1465385-1-naoya.horiguchi@linux.dev Fixes: a42634a6c07d ("readahead: Use a folio in read_pages()") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-27mm/damon/paddr: fix missing folio_put()andrew.yang
damon_get_folio() would always increase folio _refcount and folio_isolate_lru() would increase folio _refcount if the folio's lru flag is set. If an unevictable folio isolated successfully, there will be two more _refcount. The one from folio_isolate_lru() will be decreased in folio_puback_lru(), but the other one from damon_get_folio() will be left behind. This causes a pin page. Whatever the case, the _refcount from damon_get_folio() should be decreased. Link: https://lkml.kernel.org/r/20230222064223.6735-1-andrew.yang@mediatek.com Fixes: 57223ac29584 ("mm/damon/paddr: support the pageout scheme") Signed-off-by: andrew.yang <andrew.yang@mediatek.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [5.16.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-27mm/mremap: fix dup_anon_vma() in vma_merge() case 4Vlastimil Babka
In case 4, we are shrinking 'prev' (PPPP in the comment) and expanding 'mid' (NNNN). So we need to make sure 'mid' clones the anon_vma from 'prev', if it doesn't have any. After commit 0503ea8f5ba7 ("mm/mmap: remove __vma_adjust()") we can fail to do that due to wrong parameters for dup_anon_vma(). The call is a no-op because res == next, adjust == mid and mid == next. Fix it. Link: https://lkml.kernel.org/r/ad91d62b-37eb-4b73-707a-3c45c9e16256@suse.cz Fixes: 0503ea8f5ba7 ("mm/mmap: remove __vma_adjust()") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-27Merge tag 'memblock-v6.3-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull memblock updates from Mike Rapoport: "Small optimizations: - fix off-by-one in the check whether memblock_add_range() should reallocate memory to accommodate newly inserted range - check only for relevant regions in memblock_merge_regions() rather than swipe over the entire array" * tag 'memblock-v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: memblock: Avoid useless checks in memblock_merge_regions(). memblock: Make a boundary tighter in memblock_add_range().
2023-02-25mm/mprotect: Fix successful vma_merge() of next in do_mprotect_pkey()Liam R. Howlett
If mprotect_fixup() successfully calls vma_merge() and replaces vma and the next vma, then the tmp variable in the do_mprotect_pkey() is not updated to point to the new vma end. This results in the loop detecting a gap between VMAs that does not exist. Fix the faulty value of tmp by setting it to the end location of the vma iterator at the end of the loop. Link: https://lkml.kernel.org/r/20230224212055.1786100-1-Liam.Howlett@oracle.com Fixes: 2286a6914c77 ("mm: change mprotect_fixup to vma iterator") Link: https://lore.kernel.org/linux-mm/20230223120407.729110a6ecd1416ac59d9cb0@linux-foundation.org/ Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: Bert Karwatzki <spasswolf@web.de> Link: https://bugzilla.kernel.org/show_bug.cgi?id=217061 Tested-by: Bert Karwatzki <spasswolf@web.de> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/linux-mm/CAHk-=wjFmVL7NiuxL54qLkoabni_yD-oF9=dpDgETtdsiCbhUg@mail.gmail.com/ Tested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-02-23Merge tag 'mm-stable-2023-02-20-13-37' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ...
2023-02-23Merge tag 'sysctl-6.3-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull sysctl update from Luis Chamberlain: "Just one fix which just came in. Sadly the eager beavers willing to help with the sysctl moves have slowed" * tag 'sysctl-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: sysctl: fix proc_dobool() usability