summaryrefslogtreecommitdiff
path: root/mm/slub.c
AgeCommit message (Collapse)Author
2023-10-02mm/slub: refactor calculate_order() and calc_slab_order()Vlastimil Babka
After the previous cleanups, we can now move some code from calc_slab_order() to calculate_order() so it's executed just once, and do some more cleanups. - move the min_order and MAX_OBJS_PER_PAGE evaluation to calculate_order(). - change calc_slab_order() parameter min_objects to min_order Also make MAX_OBJS_PER_PAGE check more robust by considering also min_objects in addition to slub_min_order. Otherwise this is not a functional change. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Feng Tang <feng.tang@intel.com> Reviewed-and-tested-by: Jay Patel <jaypatel@linux.ibm.com>
2023-10-02mm/slub: attempt to find layouts up to 1/2 waste in calculate_order()Vlastimil Babka
The main loop in calculate_order() currently tries to find an order with at most 1/4 waste. If that's impossible (for particular large object sizes), there's a fallback that will try to place one object within slab_max_order. If we expand the loop boundary to also allow up to 1/2 waste as the last resort, we can remove the fallback and simplify the code, as the loop will find an order for such sizes as well. Note we don't need to allow more than 1/2 waste as that will never happen - calc_slab_order() would calculate more objects to fit, reducing waste below 1/2. Successfully finding an order in the loop (compared to the fallback) will also have the benefit in trying to satisfy min_objects, because the fallback was passing 1. Thus the resulting slab orders might be larger (not because it would improve waste, but to reduce pressure on shared locks), which is one of the goals of calculate_order(). For example, with nr_cpus=1 and 4kB PAGE_SIZE, slub_max_order=3, before the patch we would get the following orders for these object sizes: 2056 to 10920 - order-3 as selected by the loop 10928 to 12280 - order-2 due to fallback, as <1/4 waste is not possible 12288 to 32768 - order-3 as <1/4 waste is again possible After the patch: 2056 to 32768 - order-3, because even in the range of 10928 to 12280 we try to satisfy the calculated min_objects. As a result the code is simpler and gives more consistent results. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Feng Tang <feng.tang@intel.com> Reviewed-and-tested-by: Jay Patel <jaypatel@linux.ibm.com>
2023-10-02mm/slub: remove min_objects loop from calculate_order()Vlastimil Babka
calculate_order() currently has two nested loops. The inner one that gradually modifies the acceptable waste from 1/16 up to 1/4, and the outer one that decreases min_objects down to 2. Upon closer inspection, the outer loop is unnecessary. Decreasing min_objects could have in theory two effects to make the inner loop and its call to calc_slab_order() succeed where a previous iteration with higher min_objects would not: - it could cause the min_objects-derived min_order to fit within slub_max_order. But min_objects is already pre-capped to max_objects that's derived from slub_max_order above the loops, so every iteration tries at least slub_max_order in calc_slab_order() - it could cause calc_slab_order() to be called with lower min_objects thus potentially lower min_order in its loop. This would make a difference if the lower order could cause the fractional waste test to succeed where a higher order has already failed with same fract_leftover in the previous iteration with a higher min_order. But that's not possible, because increasing the order can only result in lower (or same) fractional waste. If we increase the slab size 2 times, we will fit at least 2 times the number of objects (thus same fraction of waste), or it will allow us to fit one more object (lower fraction of waste). For more confidence I have tried adding a printk to notify when decreasing min_objects resulted in a success, and simulated calculations for a range of object sizes, nr_cpus and page_sizes. As expected, the printk never triggered. Thus remove the outer loop and adjust comments accordingly. There's almost no functional change except a weird corner case when slub_min_objects=1 on boot command line would cause the whole two nested loops to be skipped before this patch. Now it would try to find the best layout as usual, resulting in potentially higher orderthat minimizes waste. This is not wrong and will be further expanded by the next patch. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Feng Tang <feng.tang@intel.com> Reviewed-and-tested-by: Jay Patel <jaypatel@linux.ibm.com>
2023-10-02mm/slub: simplify the last resort slab order calculationVlastimil Babka
If calculate_order() can't fit even a single large object within slub_max_order, it will try using the smallest necessary order that may exceed slub_max_order but not MAX_ORDER. Currently this is done with a call to calc_slab_order() which is unnecessary. We can simply use get_order(size). No functional change. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Feng Tang <feng.tang@intel.com> Reviewed-and-tested-by: Jay Patel <jaypatel@linux.ibm.com>
2023-10-02mm/slub: add sanity check for slub_min/max_order cmdline setupFeng Tang
Currently there are 2 parameters could be setup from kernel cmdline: slub_min_order and slub_max_order. It's possible that the user configured slub_min_order is bigger than the default slub_max_order [1], which can still take effect, as calculate_oder() will use MAX_ORDER as a fallback to check against, but has some downsides: * the kernel message about SLUB will be strange in showing min/max orders: SLUB: HWalign=64, Order=9-3, MinObjects=0, CPUs=16, Nodes=1 * in calculate_order() called by each slab, the 2 loops of calc_slab_order() will all be meaningless due to slub_min_order is bigger than slub_max_order * prevent future code cleanup like in [2]. Fix it by adding some sanity check to enforce the min/max semantics. [1]. https://lore.kernel.org/lkml/21a0ba8b-bf05-0799-7c78-2a35f8c8d52a@os.amperecomputing.com/ [2]. https://lore.kernel.org/lkml/20230908145302.30320-7-vbabka@suse.cz/ Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-07-14mm/slub: remove freelist_dereference()Vlastimil Babka
freelist_dereference() is a one-liner only used from get_freepointer(). Remove it and make get_freepointer() call freelist_ptr_decode() directly to make the code easier to follow. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kees Cook <keescook@chromium.org>
2023-07-14mm/slub: remove redundant kasan_reset_tag() from freelist_ptr calculationsVlastimil Babka
Commit d36a63a943e3 ("kasan, slub: fix more conflicts with CONFIG_SLAB_FREELIST_HARDENED") has introduced kasan_reset_tags() to freelist_ptr() encoding/decoding when CONFIG_SLAB_FREELIST_HARDENED is enabled to resolve issues when passing tagged or untagged pointers inconsistently would lead to incorrect calculations. Later, commit aa1ef4d7b3f6 ("kasan, mm: reset tags when accessing metadata") made sure all pointers have tags reset regardless of CONFIG_SLAB_FREELIST_HARDENED, because there was no other way to access the freepointer metadata safely with hw tag-based KASAN. Therefore the kasan_reset_tag() usage in freelist_ptr_encode()/decode() is now redundant, as all callers use kasan_reset_tag() unconditionally when constructing ptr_addr. Remove the redundant calls and simplify the code and remove obsolete comments. Also in freelist_ptr_encode() introduce an 'encoded' variable to make the lines shorter and make it similar to the _decode() one. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Acked-by: Kees Cook <keescook@chromium.org>
2023-07-11mm/slub: refactor freelist to use custom typeJann Horn
Currently the SLUB code represents encoded freelist entries as "void*". That's misleading, those things are encoded under CONFIG_SLAB_FREELIST_HARDENED so that they're not actually dereferencable. Give them their own type, and split freelist_ptr() into one function per direction (one for encoding, one for decoding). Signed-off-by: Jann Horn <jannh@google.com> Co-developed-by: Matteo Rizzo <matteorizzo@google.com> Signed-off-by: Matteo Rizzo <matteorizzo@google.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-06-29Merge tag 'slab-for-6.5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - SLAB deprecation: Following the discussion at LSF/MM 2023 [1] and no objections, the SLAB allocator is deprecated by renaming the config option (to make its users notice) to CONFIG_SLAB_DEPRECATED with updated help text. SLUB should be used instead. Existing defconfigs with CONFIG_SLAB are also updated. - SLAB_NO_MERGE kmem_cache flag (Jesper Dangaard Brouer): There are (very limited) cases where kmem_cache merging is undesirable, and existing ways to prevent it are hacky. Introduce a new flag to do that cleanly and convert the existing hacky users. Btrfs plans to use this for debug kernel builds (that use case is always fine), networking for performance reasons (that should be very rare). - Replace the usage of weak PRNGs (David Keisar Schmidt): In addition to using stronger RNGs for the security related features, the code is a bit cleaner. - Misc code cleanups (SeongJae Parki, Xiongwei Song, Zhen Lei, and zhaoxinchao) Link: https://lwn.net/Articles/932201/ [1] * tag 'slab-for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slab_common: use SLAB_NO_MERGE instead of negative refcount mm/slab: break up RCU readers on SLAB_TYPESAFE_BY_RCU example code mm/slab: add a missing semicolon on SLAB_TYPESAFE_BY_RCU example code mm/slab_common: reduce an if statement in create_cache() mm/slab: introduce kmem_cache flag SLAB_NO_MERGE mm/slab: rename CONFIG_SLAB to CONFIG_SLAB_DEPRECATED mm/slab: remove HAVE_HARDENED_USERCOPY_ALLOCATOR mm/slab_common: Replace invocation of weak PRNG mm/slab: Replace invocation of weak PRNG slub: Don't read nr_slabs and total_objects directly slub: Remove slabs_node() function slub: Remove CONFIG_SMP defined check slub: Put objects_show() into CONFIG_SLUB_DEBUG enabled block slub: Correct the error code when slab_kset is NULL mm/slab: correct return values in comment for _kmem_cache_create()
2023-06-05slub: Replace cmpxchg_double()Peter Zijlstra
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20230531132323.924677086@infradead.org
2023-05-22slub: Don't read nr_slabs and total_objects directlyXiongwei Song
We have node_nr_slabs() to read nr_slabs, node_nr_objs() to read total_objects in a kmem_cache_node, so no need to access the two members directly. Signed-off-by: Xiongwei Song <xiongwei.song@windriver.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-05-22slub: Remove slabs_node() functionXiongwei Song
When traversing nodes one by one, the get_node() function called in for_each_kmem_cache_node macro, no need to call get_node() again in slabs_node(), just reading nr_slabs field should be enough. However, the node_nr_slabs() function can do this. Hence, the slabs_node() function is not needed anymore. Signed-off-by: Xiongwei Song <xiongwei.song@windriver.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-05-22slub: Remove CONFIG_SMP defined checkXiongwei Song
As CONFIG_SMP is one of dependencies of CONFIG_SLUB_CPU_PARTIAL, so if CONFIG_SLUB_CPU_PARTIAL is defined then CONFIG_SMP must be defined, no need to check CONFIG_SMP definition here. Signed-off-by: Xiongwei Song <xiongwei.song@windriver.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-05-22slub: Put objects_show() into CONFIG_SLUB_DEBUG enabled blockXiongwei Song
The SO_ALL|SO_OBJECTS pair is only used when enabling CONFIG_SLUB_DEBUG option, so the objects_show() definition should be surrounded by CONFIG_SLUB_DEBUG too. Signed-off-by: Xiongwei Song <xiongwei.song@windriver.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-05-22slub: Correct the error code when slab_kset is NULLXiongwei Song
The -ENOSYS is inproper when kset_create_and_add call returns a NULL pointer, the failure more likely is because lacking memory, hence returning -ENOMEM is better. Signed-off-by: Xiongwei Song <xiongwei.song@windriver.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-04-27Merge tag 'mm-stable-2023-04-27-15-30' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of switching from a user process to a kernel thread. - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav. - zsmalloc performance improvements from Sergey Senozhatsky. - Yue Zhao has found and fixed some data race issues around the alteration of memcg userspace tunables. - VFS rationalizations from Christoph Hellwig: - removal of most of the callers of write_one_page() - make __filemap_get_folio()'s return value more useful - Luis Chamberlain has changed tmpfs so it no longer requires swap backing. Use `mount -o noswap'. - Qi Zheng has made the slab shrinkers operate locklessly, providing some scalability benefits. - Keith Busch has improved dmapool's performance, making part of its operations O(1) rather than O(n). - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd, permitting userspace to wr-protect anon memory unpopulated ptes. - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather than exclusive, and has fixed a bunch of errors which were caused by its unintuitive meaning. - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature, which causes minor faults to install a write-protected pte. - Vlastimil Babka has done some maintenance work on vma_merge(): cleanups to the kernel code and improvements to our userspace test harness. - Cleanups to do_fault_around() by Lorenzo Stoakes. - Mike Rapoport has moved a lot of initialization code out of various mm/ files and into mm/mm_init.c. - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for DRM, but DRM doesn't use it any more. - Lorenzo has also coverted read_kcore() and vread() to use iterators and has thereby removed the use of bounce buffers in some cases. - Lorenzo has also contributed further cleanups of vma_merge(). - Chaitanya Prakash provides some fixes to the mmap selftesting code. - Matthew Wilcox changes xfs and afs so they no longer take sleeping locks in ->map_page(), a step towards RCUification of pagefaults. - Suren Baghdasaryan has improved mmap_lock scalability by switching to per-VMA locking. - Frederic Weisbecker has reworked the percpu cache draining so that it no longer causes latency glitches on cpu isolated workloads. - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig logic. - Liu Shixin has changed zswap's initialization so we no longer waste a chunk of memory if zswap is not being used. - Yosry Ahmed has improved the performance of memcg statistics flushing. - David Stevens has fixed several issues involving khugepaged, userfaultfd and shmem. - Christoph Hellwig has provided some cleanup work to zram's IO-related code paths. - David Hildenbrand has fixed up some issues in the selftest code's testing of our pte state changing. - Pankaj Raghav has made page_endio() unneeded and has removed it. - Peter Xu contributed some rationalizations of the userfaultfd selftests. - Yosry Ahmed has fixed an issue around memcg's page recalim accounting. - Chaitanya Prakash has fixed some arm-related issues in the selftests/mm code. - Longlong Xia has improved the way in which KSM handles hwpoisoned pages. - Peter Xu fixes a few issues with uffd-wp at fork() time. - Stefan Roesch has changed KSM so that it may now be used on a per-process and per-cgroup basis. * tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm,unmap: avoid flushing TLB in batch if PTE is inaccessible shmem: restrict noswap option to initial user namespace mm/khugepaged: fix conflicting mods to collapse_file() sparse: remove unnecessary 0 values from rc mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area() hugetlb: pte_alloc_huge() to replace huge pte_alloc_map() maple_tree: fix allocation in mas_sparse_area() mm: do not increment pgfault stats when page fault handler retries zsmalloc: allow only one active pool compaction context selftests/mm: add new selftests for KSM mm: add new KSM process and sysfs knobs mm: add new api to enable ksm per process mm: shrinkers: fix debugfs file permissions mm: don't check VMA write permissions if the PTE/PMD indicates write permissions migrate_pages_batch: fix statistics for longterm pin retry userfaultfd: use helper function range_in_vma() lib/show_mem.c: use for_each_populated_zone() simplify code mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list() fs/buffer: convert create_page_buffers to folio_create_buffers fs/buffer: add folio_create_empty_buffers helper ...
2023-04-18mm: vmscan: refactor updating current->reclaim_stateYosry Ahmed
During reclaim, we keep track of pages reclaimed from other means than LRU-based reclaim through scan_control->reclaim_state->reclaimed_slab, which we stash a pointer to in current task_struct. However, we keep track of more than just reclaimed slab pages through this. We also use it for clean file pages dropped through pruned inodes, and xfs buffer pages freed. Rename reclaimed_slab to reclaimed, and add a helper function that wraps updating it through current, so that future changes to this logic are contained within include/linux/swap.h. Link: https://lkml.kernel.org/r/20230413104034.1086717-4-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Lameter <cl@linux.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm, treewide: redefine MAX_ORDER sanelyKirill A. Shutemov
MAX_ORDER currently defined as number of orders page allocator supports: user can ask buddy allocator for page order between 0 and MAX_ORDER-1. This definition is counter-intuitive and lead to number of bugs all over the kernel. Change the definition of MAX_ORDER to be inclusive: the range of orders user can ask from buddy allocator is 0..MAX_ORDER now. [kirill@shutemov.name: fix min() warning] Link: https://lkml.kernel.org/r/20230315153800.32wib3n5rickolvh@box [akpm@linux-foundation.org: fix another min_t warning] [kirill@shutemov.name: fixups per Zi Yan] Link: https://lkml.kernel.org/r/20230316232144.b7ic4cif4kjiabws@box.shutemov.name [akpm@linux-foundation.org: fix underlining in docs] Link: https://lore.kernel.org/oe-kbuild-all/202303191025.VRCTk6mP-lkp@intel.com/ Link: https://lkml.kernel.org/r/20230315113133.11326-11-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm/slub: fix MAX_ORDER usage in calculate_order()Kirill A. Shutemov
MAX_ORDER is not inclusive: the maximum allocation order buddy allocator can deliver is MAX_ORDER-1. Fix MAX_ORDER usage in calculate_order(). Link: https://lkml.kernel.org/r/20230315113133.11326-9-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-13mm: slub: make kobj_type structure constantThomas Weißschuh
Since commit ee6d3dd4ed48 ("driver core: make kobj_type constant.") the driver core allows the usage of const struct kobj_type. Take advantage of this to constify the structure definition to prevent modification at runtime. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-02-23Merge tag 'mm-stable-2023-02-20-13-37' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ...
2023-02-21Merge branch 'slab/for-6.3/fixes' into slab/for-linusVlastimil Babka
Two fixes for SLAB and SLUB - Make it possible to use kmem_cache_alloc_bulk() early in boot when interrupts are not yet enabled, as code doing that start to appear via the maple tree (by Thomas Gleixner). - Fix debugfs-related memory leak (by Greg Kroah-Hartman).
2023-02-16lib/stackdepot, mm: rename stack_depot_want_early_initAndrey Konovalov
Rename stack_depot_want_early_init to stack_depot_request_early_init. The old name is confusing, as it hints at returning some kind of intention of stack depot. The new name reflects that this function requests an action from stack depot instead. No functional changes. [akpm@linux-foundation.org: update mm/kmemleak.c] Link: https://lkml.kernel.org/r/359f31bf67429a06e630b4395816a967214ef753.1676063693.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-08mm, slab/slub: Ensure kmem_cache_alloc_bulk() is available earlyThomas Gleixner
The memory allocators are available during early boot even in the phase where interrupts are disabled and scheduling is not yet possible. The setup is so that GFP_KERNEL allocations work in this phase without causing might_alloc() splats to be emitted because the system state is SYSTEM_BOOTING at that point which prevents the warnings to trigger. Most allocation/free functions use local_irq_save()/restore() or a lock variant of that. But kmem_cache_alloc_bulk() and kmem_cache_free_bulk() use local_[lock]_irq_disable()/enable(), which leads to a lockdep warning when interrupts are enabled during the early boot phase. This went unnoticed so far as there are no early users of these interfaces. The upcoming conversion of the interrupt descriptor store from radix_tree to maple_tree triggered this warning as maple_tree uses the bulk interface. Cure this by moving the kmem_cache_alloc/free() bulk variants of SLUB and SLAB to local[_lock]_irq_save()/restore(). There is obviously no reclaim possible and required at this point so there is no need to expand this coverage further. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-02-06mm/slub: fix memory leak with using debugfs_lookup()Greg Kroah-Hartman
When calling debugfs_lookup() the result must have dput() called on it, otherwise the memory will leak over time. To make things simpler, just call debugfs_lookup_and_remove() instead which handles all of the logic at once. Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-01-18mm: introduce folio_is_pfmemallocSidhartha Kumar
Add a folio equivalent for page_is_pfmemalloc. This removes two instances of page_is_pfmemalloc(folio_page(folio, 0)) so the folio can be used directly. Link: https://lkml.kernel.org/r/20230106215251.599222-1-sidhartha.kumar@oracle.com Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: SeongJae Park <sj@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-13mm/sl{a,u}b: fix wrong usages of folio_page() for getting head pagesSeongJae Park
The standard idiom for getting head page of a given folio is '&folio->page', but some are wrongly using 'folio_page(folio, 0)' for the purpose. Fix those to use the idiom. Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-12-13Merge tag 'mm-stable-2022-12-13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - More userfaultfs work from Peter Xu - Several convert-to-folios series from Sidhartha Kumar and Huang Ying - Some filemap cleanups from Vishal Moola - David Hildenbrand added the ability to selftest anon memory COW handling - Some cpuset simplifications from Liu Shixin - Addition of vmalloc tracing support by Uladzislau Rezki - Some pagecache folioifications and simplifications from Matthew Wilcox - A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use it - Miguel Ojeda contributed some cleanups for our use of the __no_sanitize_thread__ gcc keyword. This series should have been in the non-MM tree, my bad - Naoya Horiguchi improved the interaction between memory poisoning and memory section removal for huge pages - DAMON cleanups and tuneups from SeongJae Park - Tony Luck fixed the handling of COW faults against poisoned pages - Peter Xu utilized the PTE marker code for handling swapin errors - Hugh Dickins reworked compound page mapcount handling, simplifying it and making it more efficient - Removal of the autonuma savedwrite infrastructure from Nadav Amit and David Hildenbrand - zram support for multiple compression streams from Sergey Senozhatsky - David Hildenbrand reworked the GUP code's R/O long-term pinning so that drivers no longer need to use the FOLL_FORCE workaround which didn't work very well anyway - Mel Gorman altered the page allocator so that local IRQs can remnain enabled during per-cpu page allocations - Vishal Moola removed the try_to_release_page() wrapper - Stefan Roesch added some per-BDI sysfs tunables which are used to prevent network block devices from dirtying excessive amounts of pagecache - David Hildenbrand did some cleanup and repair work on KSM COW breaking - Nhat Pham and Johannes Weiner have implemented writeback in zswap's zsmalloc backend - Brian Foster has fixed a longstanding corner-case oddity in file[map]_write_and_wait_range() - sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang Chen - Shiyang Ruan has done some work on fsdax, to make its reflink mode work better under xfstests. Better, but still not perfect - Christoph Hellwig has removed the .writepage() method from several filesystems. They only need .writepages() - Yosry Ahmed wrote a series which fixes the memcg reclaim target beancounting - David Hildenbrand has fixed some of our MM selftests for 32-bit machines - Many singleton patches, as usual * tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (313 commits) mm/hugetlb: set head flag before setting compound_order in __prep_compound_gigantic_folio mm: mmu_gather: allow more than one batch of delayed rmaps mm: fix typo in struct pglist_data code comment kmsan: fix memcpy tests mm: add cond_resched() in swapin_walk_pmd_entry() mm: do not show fs mm pc for VM_LOCKONFAULT pages selftests/vm: ksm_functional_tests: fixes for 32bit selftests/vm: cow: fix compile warning on 32bit selftests/vm: madv_populate: fix missing MADV_POPULATE_(READ|WRITE) definitions mm/gup_test: fix PIN_LONGTERM_TEST_READ with highmem mm,thp,rmap: fix races between updates of subpages_mapcount mm: memcg: fix swapcached stat accounting mm: add nodes= arg to memory.reclaim mm: disable top-tier fallback to reclaim on proactive reclaim selftests: cgroup: make sure reclaim target memcg is unprotected selftests: cgroup: refactor proactive reclaim code to reclaim_until() mm: memcg: fix stale protection of reclaim target memcg mm/mmap: properly unaccount memory on mas_preallocate() failure omfs: remove ->writepage jfs: remove ->writepage ...
2022-12-12Merge tag 'linux-kselftest-kunit-next-6.2-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull KUnit updates from Shuah Khan: "Several enhancements, fixes, clean-ups, documentation updates, improvements to logging and KTAP compliance of KUnit test output: - log numbers in decimal and hex - parse KTAP compliant test output - allow conditionally exposing static symbols to tests when KUNIT is enabled - make static symbols visible during kunit testing - clean-ups to remove unused structure definition" * tag 'linux-kselftest-kunit-next-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (29 commits) Documentation: dev-tools: Clarify requirements for result description apparmor: test: make static symbols visible during kunit testing kunit: add macro to allow conditionally exposing static symbols to tests kunit: tool: make parser preserve whitespace when printing test log Documentation: kunit: Fix "How Do I Use This" / "Next Steps" sections kunit: tool: don't include KTAP headers and the like in the test log kunit: improve KTAP compliance of KUnit test output kunit: tool: parse KTAP compliant test output mm: slub: test: Use the kunit_get_current_test() function kunit: Use the static key when retrieving the current test kunit: Provide a static key to check if KUnit is actively running tests kunit: tool: make --json do nothing if --raw_ouput is set kunit: tool: tweak error message when no KTAP found kunit: remove KUNIT_INIT_MEM_ASSERTION macro Documentation: kunit: Remove redundant 'tips.rst' page Documentation: KUnit: reword description of assertions Documentation: KUnit: make usage.rst a superset of tips.rst, remove duplication kunit: eliminate KUNIT_INIT_*_ASSERT_STRUCT macros kunit: tool: remove redundant file.close() call in unit test kunit: tool: unit tests all check parser errors, standardize formatting a bit ...
2022-12-12Merge tag 'random-6.2-rc1-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull random number generator updates from Jason Donenfeld: - Replace prandom_u32_max() and various open-coded variants of it, there is now a new family of functions that uses fast rejection sampling to choose properly uniformly random numbers within an interval: get_random_u32_below(ceil) - [0, ceil) get_random_u32_above(floor) - (floor, U32_MAX] get_random_u32_inclusive(floor, ceil) - [floor, ceil] Coccinelle was used to convert all current users of prandom_u32_max(), as well as many open-coded patterns, resulting in improvements throughout the tree. I'll have a "late" 6.1-rc1 pull for you that removes the now unused prandom_u32_max() function, just in case any other trees add a new use case of it that needs to converted. According to linux-next, there may be two trivial cases of prandom_u32_max() reintroductions that are fixable with a 's/.../.../'. So I'll have for you a final conversion patch doing that alongside the removal patch during the second week. This is a treewide change that touches many files throughout. - More consistent use of get_random_canary(). - Updates to comments, documentation, tests, headers, and simplification in configuration. - The arch_get_random*_early() abstraction was only used by arm64 and wasn't entirely useful, so this has been replaced by code that works in all relevant contexts. - The kernel will use and manage random seeds in non-volatile EFI variables, refreshing a variable with a fresh seed when the RNG is initialized. The RNG GUID namespace is then hidden from efivarfs to prevent accidental leakage. These changes are split into random.c infrastructure code used in the EFI subsystem, in this pull request, and related support inside of EFISTUB, in Ard's EFI tree. These are co-dependent for full functionality, but the order of merging doesn't matter. - Part of the infrastructure added for the EFI support is also used for an improvement to the way vsprintf initializes its siphash key, replacing an sleep loop wart. - The hardware RNG framework now always calls its correct random.c input function, add_hwgenerator_randomness(), rather than sometimes going through helpers better suited for other cases. - The add_latent_entropy() function has long been called from the fork handler, but is a no-op when the latent entropy gcc plugin isn't used, which is fine for the purposes of latent entropy. But it was missing out on the cycle counter that was also being mixed in beside the latent entropy variable. So now, if the latent entropy gcc plugin isn't enabled, add_latent_entropy() will expand to a call to add_device_randomness(NULL, 0), which adds a cycle counter, without the absent latent entropy variable. - The RNG is now reseeded from a delayed worker, rather than on demand when used. Always running from a worker allows it to make use of the CPU RNG on platforms like S390x, whose instructions are too slow to do so from interrupts. It also has the effect of adding in new inputs more frequently with more regularity, amounting to a long term transcript of random values. Plus, it helps a bit with the upcoming vDSO implementation (which isn't yet ready for 6.2). - The jitter entropy algorithm now tries to execute on many different CPUs, round-robining, in hopes of hitting even more memory latencies and other unpredictable effects. It also will mix in a cycle counter when the entropy timer fires, in addition to being mixed in from the main loop, to account more explicitly for fluctuations in that timer firing. And the state it touches is now kept within the same cache line, so that it's assured that the different execution contexts will cause latencies. * tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (23 commits) random: include <linux/once.h> in the right header random: align entropy_timer_state to cache line random: mix in cycle counter when jitter timer fires random: spread out jitter callback to different CPUs random: remove extraneous period and add a missing one in comments efi: random: refresh non-volatile random seed when RNG is initialized vsprintf: initialize siphash key using notifier random: add back async readiness notifier random: reseed in delayed work rather than on-demand random: always mix cycle counter in add_latent_entropy() hw_random: use add_hwgenerator_randomness() for early entropy random: modernize documentation comment on get_random_bytes() random: adjust comment to account for removed function random: remove early archrandom abstraction random: use random.trust_{bootloader,cpu} command line option only stackprotector: actually use get_random_canary() stackprotector: move get_random_canary() into stackprotector.h treewide: use get_random_u32_inclusive() when possible treewide: use get_random_u32_{above,below}() instead of manual loop treewide: use get_random_u32_below() instead of deprecated function ...
2022-12-12mm: slub: test: Use the kunit_get_current_test() functionDavid Gow
Use the newly-added function kunit_get_current_test() instead of accessing current->kunit_test directly. This function uses a static key to return more quickly when KUnit is enabled, but no tests are actively running. There should therefore be a negligible performance impact to enabling the slub KUnit tests. Other than the performance improvement, this should be a no-op. Cc: Oliver Glitta <glittao@gmail.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Gow <davidgow@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2022-12-01Merge branch 'slub-tiny-v1r6' into slab/for-nextVlastimil Babka
Merge my series [1] to deprecate the SLOB allocator. - Renames CONFIG_SLOB to CONFIG_SLOB_DEPRECATED with deprecation notice. - The recommended replacement is CONFIG_SLUB, optionally with the new CONFIG_SLUB_TINY tweaks for systems with 16MB or less RAM. - Use cases that stopped working with CONFIG_SLUB_TINY instead of SLOB should be reported to linux-mm@kvack.org and slab maintainers, otherwise SLOB will be removed in few cycles. [1] https://lore.kernel.org/all/20221121171202.22080-1-vbabka@suse.cz/
2022-12-01mm, slub: don't aggressively inline with CONFIG_SLUB_TINYVlastimil Babka
SLUB fastpaths use __always_inline to avoid function calls. With CONFIG_SLUB_TINY we would rather save the memory. Add a __fastpath_inline macro that's __always_inline normally but empty with CONFIG_SLUB_TINY. bloat-o-meter results on x86_64 mm/slub.o: add/remove: 3/1 grow/shrink: 1/8 up/down: 865/-1784 (-919) Function old new delta kmem_cache_free 20 281 +261 slab_alloc_node.isra - 245 +245 slab_free.constprop.isra - 231 +231 __kmem_cache_alloc_lru.isra - 128 +128 __kmem_cache_release 88 83 -5 __kmem_cache_create 1446 1436 -10 __kmem_cache_free 271 142 -129 kmem_cache_alloc_node 330 127 -203 kmem_cache_free_bulk.part 826 613 -213 __kmem_cache_alloc_node 230 10 -220 kmem_cache_alloc_lru 325 12 -313 kmem_cache_alloc 325 10 -315 kmem_cache_free.part 376 - -376 Total: Before=26103, After=25184, chg -3.52% Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
2022-12-01mm, slub: remove percpu slabs with CONFIG_SLUB_TINYVlastimil Babka
SLUB gets most of its scalability by percpu slabs. However for CONFIG_SLUB_TINY the goal is minimal memory overhead, not scalability. Thus, #ifdef out the whole kmem_cache_cpu percpu structure and associated code. Additionally to the slab page savings, this reduces percpu allocator usage, and code size. This change builds on recent commit c7323a5ad078 ("mm/slub: restrict sysfs validation to debug caches and make it safe"), as caches with enabled debugging also avoid percpu slabs and all allocations and freeing ends up working with the partial list. With a bit more refactoring by the preceding patches, use the same code paths with CONFIG_SLUB_TINY. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com>
2022-12-01mm, slub: split out allocations from pre/post hooksVlastimil Babka
In the following patch we want to introduce CONFIG_SLUB_TINY allocation paths that don't use the percpu slab. To prepare, refactor the allocation functions: Split out __slab_alloc_node() from slab_alloc_node() where the former does the actual allocation and the latter calls the pre/post hooks. Analogically, split out __kmem_cache_alloc_bulk() from kmem_cache_alloc_bulk(). Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
2022-11-27mm, slub: refactor free debug processingVlastimil Babka
Since commit c7323a5ad078 ("mm/slub: restrict sysfs validation to debug caches and make it safe"), caches with debugging enabled use the free_debug_processing() function to do both freeing checks and actual freeing to partial list under list_lock, bypassing the fast paths. We will want to use the same path for CONFIG_SLUB_TINY, but without the debugging checks, so refactor the code so that free_debug_processing() does only the checks, while the freeing is handled by a new function free_to_partial_list(). For consistency, change return parameter alloc_debug_processing() from int to bool and correct the !SLUB_DEBUG variant to return true and not false. This didn't matter until now, but will in the following changes. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
2022-11-27mm, slub: lower the default slub_max_order with CONFIG_SLUB_TINYVlastimil Babka
With CONFIG_SLUB_TINY we want to minimize memory overhead. By lowering the default slub_max_order we can make slab allocations use smaller pages. However depending on object sizes, order-0 might not be the best due to increased fragmentation. When testing on a 8MB RAM k210 system by Damien Le Moal [1], slub_max_order=1 had the best results, so use that as the default for CONFIG_SLUB_TINY. [1] https://lore.kernel.org/all/6a1883c4-4c3f-545a-90e8-2cd805bcf4ae@opensource.wdc.com/ Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com>
2022-11-27mm, slub: retain no free slabs on partial list with CONFIG_SLUB_TINYVlastimil Babka
SLUB will leave a number of slabs on the partial list even if they are empty, to avoid some slab freeing and reallocation. The goal of CONFIG_SLUB_TINY is to minimize memory overhead, so set the limits to 0 for immediate slab page freeing. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com>
2022-11-27mm, slub: disable SYSFS support with CONFIG_SLUB_TINYVlastimil Babka
Currently SLUB enables its sysfs support depending unconditionally on the general CONFIG_SYSFS setting. To reduce the configuration combination space, make CONFIG_SLUB_TINY disable SLUB's sysfs support by reusing the existing SLAB_SUPPORTS_SYSFS define. It is unlikely that real tiny systems would combine CONFIG_SLUB_TINY with CONFIG_SYSFS, but a randconfig might. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com>
2022-11-27mm, slab: ignore hardened usercopy parameters when disabledVlastimil Babka
With CONFIG_HARDENED_USERCOPY not enabled, there are no __check_heap_object() checks happening that would use the struct kmem_cache useroffset and usersize fields. Yet the fields are still initialized, preventing merging of otherwise compatible caches. Also the fields contribute to struct kmem_cache size unnecessarily when unused. Thus #ifdef them out completely when CONFIG_HARDENED_USERCOPY is disabled. In kmem_dump_obj() print object_size instead of usersize, as that's actually the intention. In a quick virtme boot test, this has reduced the number of caches in /proc/slabinfo from 131 to 111. Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com>
2022-11-21Merge branch 'slab/for-6.2/kmalloc_redzone' into slab/for-nextVlastimil Babka
kmalloc() redzone improvements by Feng Tang From cover letter [1]: kmalloc's API family is critical for mm, and one of its nature is that it will round up the request size to a fixed one (mostly power of 2). When user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes could be allocated, so there is an extra space than what is originally requested. This patchset tries to extend the redzone sanity check to the extra kmalloced buffer than requested, to better detect un-legitimate access to it. (depends on SLAB_STORE_USER & SLAB_RED_ZONE) [1] https://lore.kernel.org/all/20221021032405.1825078-1-feng.tang@intel.com/
2022-11-21Merge branch 'slab/for-6.2/fit_rcu_head' into slab/for-nextVlastimil Babka
A series by myself to reorder fields in struct slab to allow the embedded rcu_head to grow (for debugging purposes). Requires changes to isolate_movable_page() to skip slab pages which can otherwise become false-positive __PageMovable due to its use of low bits in page->mapping.
2022-11-21Merge branch 'slab/for-6.2/slub-sysfs' into slab/for-nextVlastimil Babka
- Two patches for SLUB's sysfs by Rasmus Villemoes to remove dead code and optimize boot time with late initialization. - Allow SLUB's sysfs 'failslab' parameter to be runtime-controllable again as it can be both useful and safe, by Alexander Atanasov.
2022-11-21mm/migrate: make isolate_movable_page() skip slab pagesVlastimil Babka
In the next commit we want to rearrange struct slab fields to allow a larger rcu_head. Afterwards, the page->mapping field will overlap with SLUB's "struct list_head slab_list", where the value of prev pointer can become LIST_POISON2, which is 0x122 + POISON_POINTER_DELTA. Unfortunately the bit 1 being set can confuse PageMovable() to be a false positive and cause a GPF as reported by lkp [1]. To fix this, make isolate_movable_page() skip pages with the PageSlab flag set. This is a bit tricky as we need to add memory barriers to SLAB and SLUB's page allocation and freeing, and their counterparts to isolate_movable_page(). Based on my RFC from [2]. Added a comment update from Matthew's variant in [3] and, as done there, moved the PageSlab checks to happen before trying to take the page lock. [1] https://lore.kernel.org/all/208c1757-5edd-fd42-67d4-1940cc43b50f@intel.com/ [2] https://lore.kernel.org/all/aec59f53-0e53-1736-5932-25407125d4d4@suse.cz/ [3] https://lore.kernel.org/all/YzsVM8eToHUeTP75@casper.infradead.org/ Reported-by: kernel test robot <yujie.liu@intel.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
2022-11-21mm/slub, percpu: correct the calculation of early percpu allocation sizeBaoquan He
SLUB allocator relies on percpu allocator to initialize its ->cpu_slab during early boot. For that, the dynamic chunk of percpu which serves the early allocation need be large enough to satisfy the kmalloc creation. However, the current BUILD_BUG_ON() in alloc_kmem_cache_cpus() doesn't consider the kmalloc array with NR_KMALLOC_TYPES length. Fix that with correct calculation. Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-11-18treewide: use get_random_u32_below() instead of deprecated functionJason A. Donenfeld
This is a simple mechanical transformation done by: @@ expression E; @@ - prandom_u32_max + get_random_u32_below (E) Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Reviewed-by: SeongJae Park <sj@kernel.org> # for damon Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-11-11mm/slub: extend redzone check to extra allocated kmalloc space than requestedFeng Tang
kmalloc will round up the request size to a fixed size (mostly power of 2), so there could be a extra space than what is requested, whose size is the actual buffer size minus original request size. To better detect out of bound access or abuse of this space, add redzone sanity check for it. In current kernel, some kmalloc user already knows the existence of the space and utilizes it after calling 'ksize()' to know the real size of the allocated buffer. So we skip the sanity check for objects which have been called with ksize(), as treating them as legitimate users. Kees Cook is working on sanitizing all these user cases, by using kmalloc_size_roundup() to avoid ambiguous usages. And after this is done, this special handling for ksize() can be removed. In some cases, the free pointer could be saved inside the latter part of object data area, which may overlap the redzone part(for small sizes of kmalloc objects). As suggested by Hyeonggon Yoo, force the free pointer to be in meta data area when kmalloc redzone debug is enabled, to make all kmalloc objects covered by redzone check. Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Feng Tang <feng.tang@intel.com> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-11-10mm: kasan: Extend kasan_metadata_size() to also cover in-object sizeFeng Tang
When kasan is enabled for slab/slub, it may save kasan' free_meta data in the former part of slab object data area in slab object's free path, which works fine. There is ongoing effort to extend slub's debug function which will redzone the latter part of kmalloc object area, and when both of the debug are enabled, there is possible conflict, especially when the kmalloc object has small size, as caught by 0Day bot [1]. To solve it, slub code needs to know the in-object kasan's meta data size. Currently, there is existing kasan_metadata_size() which returns the kasan's metadata size inside slub's metadata area, so extend it to also cover the in-object meta size by adding a boolean flag 'in_object'. There is no functional change to existing code logic. [1]. https://lore.kernel.org/lkml/YuYm3dWwpZwH58Hu@xsang-OptiPlex-9020/ Reported-by: kernel test robot <oliver.sang@intel.com> Suggested-by: Andrey Konovalov <andreyknvl@gmail.com> Signed-off-by: Feng Tang <feng.tang@intel.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-11-10mm/slub: only zero requested size of buffer for kzalloc when debug enabledFeng Tang
kzalloc/kmalloc will round up the request size to a fixed size (mostly power of 2), so the allocated memory could be more than requested. Currently kzalloc family APIs will zero all the allocated memory. To detect out-of-bound usage of the extra allocated memory, only zero the requested part, so that redzone sanity check could be added to the extra space later. For kzalloc users who will call ksize() later and utilize this extra space, please be aware that the space is not zeroed any more when debug is enabled. (Thanks to Kees Cook's effort to sanitize all ksize() user cases [1], this won't be a big issue). [1]. https://lore.kernel.org/all/20220922031013.2150682-1-keescook@chromium.org/#r Signed-off-by: Feng Tang <feng.tang@intel.com> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-11-08mm/slub.c: use hotplug_memory_notifier() directlyLiu Shixin
Commit 76ae847497bc52 ("Documentation: raise minimum supported version of GCC to 5.1") updated the minimum gcc version to 5.1. So the problem mentioned in f02c69680088 ("include/linux/memory.h: implement register_hotmemory_notifier()") no longer exist. So we can now switch to use hotplug_memory_notifier() directly rather than register_hotmemory_notifier(). Link: https://lkml.kernel.org/r/20220923033347.3935160-4-liushixin2@huawei.com Signed-off-by: Liu Shixin <liushixin2@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Waiman Long <longman@redhat.com> Cc: zefan li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>