summaryrefslogtreecommitdiff
path: root/lib
AgeCommit message (Collapse)Author
2023-12-10lib/stackdepot: store next pool pointer in new_poolAndrey Konovalov
Instead of using the last pointer in stack_pools for storing the pointer to a new pool (which does not yet store any stack records), use a new new_pool variable. This a purely code readability change: it seems more logical to store the pointer to a pool with a special meaning in a dedicated variable. Link: https://lkml.kernel.org/r/448bc18296c16bef95cb3167697be6583dcc8ce3.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10lib/stackdepot: rename next_pool_required to new_pool_requiredAndrey Konovalov
Rename next_pool_required to new_pool_required. This a purely code readability change: the following patch will change stack depot to store the pointer to the new pool in a separate variable, and "new" seems like a more logical name. Link: https://lkml.kernel.org/r/fd7cd6c6eb250c13ec5d2009d75bb4ddd1470db9.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10lib/stackdepot: rework helpers for depot_alloc_stackAndrey Konovalov
Split code in depot_alloc_stack and depot_init_pool into 3 functions: 1. depot_keep_next_pool that keeps preallocated memory for the next pool if required. 2. depot_update_pools that moves on to the next pool if there's no space left in the current pool, uses preallocated memory for the new current pool if required, and calls depot_keep_next_pool otherwise. 3. depot_alloc_stack that calls depot_update_pools and then allocates a stack record as before. This makes it somewhat easier to follow the logic of depot_alloc_stack and also serves as a preparation for implementing the eviction of stack records from the stack depot. Link: https://lkml.kernel.org/r/71fb144d42b701fcb46708d7f4be6801a4a8270e.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10lib/stackdepot: fix and clean-up atomic annotationsAndrey Konovalov
Drop smp_load_acquire from next_pool_required in depot_init_pool, as both depot_init_pool and the all smp_store_release's to this variable are executed under the stack depot lock. Also simplify and clean up comments accompanying the use of atomic accesses in the stack depot code. Link: https://lkml.kernel.org/r/c118ef044d8db80248d9e1f14592c72e8429e9d9.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10lib/stackdepot: use fixed-sized slots for stack recordsAndrey Konovalov
Instead of storing stack records in stack depot pools one right after another, use fixed-sized slots. Add a new Kconfig option STACKDEPOT_MAX_FRAMES that allows to select the size of the slot in frames. Use 64 as the default value, which is the maximum stack trace size both KASAN and KMSAN use right now. Also add descriptions for other stack depot Kconfig options. This is preparatory patch for implementing the eviction of stack records from the stack depot. Link: https://lkml.kernel.org/r/dce7d030a99ff61022509665187fac45b0827298.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10lib/stackdepot: add depot_fetch_stack helperAndrey Konovalov
Add a helper depot_fetch_stack function that fetches the pointer to a stack record. With this change, all static depot_* functions now operate on stack pools and the exported stack_depot_* functions operate on the hash table. Link: https://lkml.kernel.org/r/170d8c202f29dc8e3d5491ee074d1e9e029a46db.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10lib/stackdepot: drop valid bit from handlesAndrey Konovalov
Stack depot doesn't use the valid bit in handles in any way, so drop it. Link: https://lkml.kernel.org/r/34969bba2ca6e012c6ad071767197dee64dc5723.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10lib/stackdepot: simplify __stack_depot_saveAndrey Konovalov
The retval local variable in __stack_depot_save has the union type handle_parts, but the function never uses anything but the union's handle field. Define retval simply as depot_stack_handle_t to simplify the code. Link: https://lkml.kernel.org/r/3b0763c8057a1cf2f200ff250a5f9580ee36a28c.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10lib/stackdepot: check disabled flag when fetchingAndrey Konovalov
Do not try fetching a stack trace from the stack depot if the stack_depot_disabled flag is enabled. Link: https://lkml.kernel.org/r/c3bfa3b7ab00b2e48ab75a3fbb9c67555777cb08.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10lib/stackdepot: print disabled message only if truly disabledAndrey Konovalov
Patch series "stackdepot: allow evicting stack traces", v4. Currently, the stack depot grows indefinitely until it reaches its capacity. Once that happens, the stack depot stops saving new stack traces. This creates a problem for using the stack depot for in-field testing and in production. For such uses, an ideal stack trace storage should: 1. Allow saving fresh stack traces on systems with a large uptime while limiting the amount of memory used to store the traces; 2. Have a low performance impact. Implementing #1 in the stack depot is impossible with the current keep-forever approach. This series targets to address that. Issue #2 is left to be addressed in a future series. This series changes the stack depot implementation to allow evicting unneeded stack traces from the stack depot. The users of the stack depot can do that via new stack_depot_save_flags(STACK_DEPOT_FLAG_GET) and stack_depot_put APIs. Internal changes to the stack depot code include: 1. Storing stack traces in fixed-frame-sized slots (vs precisely-sized slots in the current implementation); the slot size is controlled via CONFIG_STACKDEPOT_MAX_FRAMES (default: 64 frames); 2. Keeping available slots in a freelist (vs keeping an offset to the next free slot); 3. Using a read/write lock for synchronization (vs a lock-free approach combined with a spinlock). This series also integrates the eviction functionality into KASAN: the tag-based modes evict stack traces when the corresponding entry leaves the stack ring, and Generic KASAN evicts stack traces for objects once those leave the quarantine. With KASAN, despite wasting some space on rounding up the size of each stack record, the total memory consumed by stack depot gets saturated due to the eviction of irrelevant stack traces from the stack depot. With the tag-based KASAN modes, the average total amount of memory used for stack traces becomes ~0.5 MB (with the current default stack ring size of 32k entries and the default CONFIG_STACKDEPOT_MAX_FRAMES of 64). With Generic KASAN, the stack traces take up ~1 MB per 1 GB of RAM (as the quarantine's size depends on the amount of RAM). However, with KMSAN, the stack depot ends up using ~4x more memory per a stack trace than before. Thus, for KMSAN, the stack depot capacity is increased accordingly. KMSAN uses a lot of RAM for shadow memory anyway, so the increased stack depot memory usage will not make a significant difference. Other users of the stack depot do not save stack traces as often as KASAN and KMSAN. Thus, the increased memory usage is taken as an acceptable trade-off. In the future, these other users can take advantage of the eviction API to limit the memory waste. There is no measurable boot time performance impact of these changes for KASAN on x86-64. I haven't done any tests for arm64 modes (the stack depot without performance optimizations is not suitable for intended use of those anyway), but I expect a similar result. Obtaining and copying stack trace frames when saving them into stack depot is what takes the most time. This series does not yet provide a way to configure the maximum size of the stack depot externally (e.g. via a command-line parameter). This will be added in a separate series, possibly together with the performance improvement changes. This patch (of 22): Currently, if stack_depot_disable=off is passed to the kernel command-line after stack_depot_disable=on, stack depot prints a message that it is disabled, while it is actually enabled. Fix this by moving printing the disabled message to stack_depot_early_init. Place it before the __stack_depot_early_init_requested check, so that the message is printed even if early stack depot init has not been requested. Also drop the stack_table = NULL assignment from disable_stack_depot, as stack_table is NULL by default. Link: https://lkml.kernel.org/r/cover.1700502145.git.andreyknvl@google.com Link: https://lkml.kernel.org/r/73a25c5fff29f3357cd7a9330e85e09bc8da2cbe.1700502145.git.andreyknvl@google.com Fixes: e1fdc403349c ("lib: stackdepot: add support to disable stack depot") Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10kasan: default to inline instrumentationPaul Heidekrüger
KASan inline instrumentation can yield up to a 2x performance gain at the cost of a larger binary. Make inline instrumentation the default, as suggested in the bug report below. When an architecture does not support inline instrumentation, it should set ARCH_DISABLE_KASAN_INLINE, as done by PowerPC, for instance. Link: https://lkml.kernel.org/r/20231109155101.186028-1-paul.heidekrueger@tum.de Signed-off-by: Paul Heidekrüger <paul.heidekrueger@tum.de> Reported-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Marco Elver <elver@google.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=203495 Acked-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10maple_tree: preserve the tree attributes when destroying maple treePeng Zhang
When destroying maple tree, preserve its attributes and then turn it into an empty tree. This allows it to be reused without needing to be reinitialized. Link: https://lkml.kernel.org/r/20231027033845.90608-10-zhangpeng.00@bytedance.com Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Mike Christie <michael.christie@oracle.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10maple_tree: update check_forking() and bench_forking()Peng Zhang
Updated check_forking() and bench_forking() to use __mt_dup() to duplicate maple tree. Link: https://lkml.kernel.org/r/20231027033845.90608-9-zhangpeng.00@bytedance.com Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Mike Christie <michael.christie@oracle.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10maple_tree: skip other tests when BENCH is enabledPeng Zhang
Skip other tests when BENCH is enabled so that performance can be measured in user space. Link: https://lkml.kernel.org/r/20231027033845.90608-8-zhangpeng.00@bytedance.com Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Mike Christie <michael.christie@oracle.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10maple_tree: introduce interfaces __mt_dup() and mtree_dup()Peng Zhang
Introduce interfaces __mt_dup() and mtree_dup(), which are used to duplicate a maple tree. They duplicate a maple tree in Depth-First Search (DFS) pre-order traversal. It uses memcopy() to copy nodes in the source tree and allocate new child nodes in non-leaf nodes. The new node is exactly the same as the source node except for all the addresses stored in it. It will be faster than traversing all elements in the source tree and inserting them one by one into the new tree. The time complexity of these two functions is O(n). The difference between __mt_dup() and mtree_dup() is that mtree_dup() handles locks internally. Analysis of the average time complexity of this algorithm: For simplicity, let's assume that the maximum branching factor of all non-leaf nodes is 16 (in allocation mode, it is 10), and the tree is a full tree. Under the given conditions, if there is a maple tree with n elements, the number of its leaves is n/16. From bottom to top, the number of nodes in each level is 1/16 of the number of nodes in the level below. So the total number of nodes in the entire tree is given by the sum of n/16 + n/16^2 + n/16^3 + ... + 1. This is a geometric series, and it has log(n) terms with base 16. According to the formula for the sum of a geometric series, the sum of this series can be calculated as (n-1)/15. Each node has only one parent node pointer, which can be considered as an edge. In total, there are (n-1)/15-1 edges. This algorithm consists of two operations: 1. Traversing all nodes in DFS order. 2. For each node, making a copy and performing necessary modifications to create a new node. For the first part, DFS traversal will visit each edge twice. Let T(ascend) represent the cost of taking one step downwards, and T(descend) represent the cost of taking one step upwards. And both of them are constants (although mas_ascend() may not be, as it contains a loop, but here we ignore it and treat it as a constant). So the time spent on the first part can be represented as ((n-1)/15-1) * (T(ascend) + T(descend)). For the second part, each node will be copied, and the cost of copying a node is denoted as T(copy_node). For each non-leaf node, it is necessary to reallocate all child nodes, and the cost of this operation is denoted as T(dup_alloc). The behavior behind memory allocation is complex and not specific to the maple tree operation. Here, we assume that the time required for a single allocation is constant. Since the size of a node is fixed, both of these symbols are also constants. We can calculate that the time spent on the second part is ((n-1)/15) * T(copy_node) + ((n-1)/15 - n/16) * T(dup_alloc). Adding both parts together, the total time spent by the algorithm can be represented as: ((n-1)/15) * (T(ascend) + T(descend) + T(copy_node) + T(dup_alloc)) - n/16 * T(dup_alloc) - (T(ascend) + T(descend)) Let C1 = T(ascend) + T(descend) + T(copy_node) + T(dup_alloc) Let C2 = T(dup_alloc) Let C3 = T(ascend) + T(descend) Finally, the expression can be simplified as: ((16 * C1 - 15 * C2) / (15 * 16)) * n - (C1 / 15 + C3). This is a linear function, so the average time complexity is O(n). Link: https://lkml.kernel.org/r/20231027033845.90608-4-zhangpeng.00@bytedance.com Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Mike Christie <michael.christie@oracle.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10maple_tree: add mt_free_one() and mt_attr() helpersPeng Zhang
Patch series "Introduce __mt_dup() to improve the performance of fork()", v7. This series introduces __mt_dup() to improve the performance of fork(). During the duplication process of mmap, all VMAs are traversed and inserted one by one into the new maple tree, causing the maple tree to be rebalanced multiple times. Balancing the maple tree is a costly operation. To duplicate VMAs more efficiently, mtree_dup() and __mt_dup() are introduced for the maple tree. They can efficiently duplicate a maple tree. Here are some algorithmic details about {mtree,__mt}_dup(). We perform a DFS pre-order traversal of all nodes in the source maple tree. During this process, we fully copy the nodes from the source tree to the new tree. This involves memory allocation, and when encountering a new node, if it is a non-leaf node, all its child nodes are allocated at once. This idea was originally from Liam R. Howlett's Maple Tree Work email, and I added some of my own ideas to implement it. Some previous discussions can be found in [1]. For a more detailed analysis of the algorithm, please refer to the logs for patch [3/10] and patch [10/10]. There is a "spawn" in byte-unixbench[2], which can be used to test the performance of fork(). I modified it slightly to make it work with different number of VMAs. Below are the test results. The first row shows the number of VMAs. The second and third rows show the number of fork() calls per ten seconds, corresponding to next-20231006 and the this patchset, respectively. The test results were obtained with CPU binding to avoid scheduler load balancing that could cause unstable results. There are still some fluctuations in the test results, but at least they are better than the original performance. 21 121 221 421 821 1621 3221 6421 12821 25621 51221 112100 76261 54227 34035 20195 11112 6017 3161 1606 802 393 114558 83067 65008 45824 28751 16072 8922 4747 2436 1233 599 2.19% 8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42% Thanks to Liam and Matthew for the review. This patch (of 10): Add two helpers: 1. mt_free_one(), used to free a maple node. 2. mt_attr(), used to obtain the attributes of maple tree. Link: https://lkml.kernel.org/r/20231027033845.90608-1-zhangpeng.00@bytedance.com Link: https://lkml.kernel.org/r/20231027033845.90608-2-zhangpeng.00@bytedance.com Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Mike Christie <michael.christie@oracle.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-08Merge tag 'mm-hotfixes-stable-2023-12-07-18-47' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "31 hotfixes. Ten of these address pre-6.6 issues and are marked cc:stable. The remainder address post-6.6 issues or aren't considered serious enough to justify backporting" * tag 'mm-hotfixes-stable-2023-12-07-18-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (31 commits) mm/madvise: add cond_resched() in madvise_cold_or_pageout_pte_range() nilfs2: prevent WARNING in nilfs_sufile_set_segment_usage() mm/hugetlb: have CONFIG_HUGETLB_PAGE select CONFIG_XARRAY_MULTI scripts/gdb: fix lx-device-list-bus and lx-device-list-class MAINTAINERS: drop Antti Palosaari highmem: fix a memory copy problem in memcpy_from_folio nilfs2: fix missing error check for sb_set_blocksize call kernel/Kconfig.kexec: drop select of KEXEC for CRASH_DUMP units: add missing header drivers/base/cpu: crash data showing should depends on KEXEC_CORE mm/damon/sysfs-schemes: add timeout for update_schemes_tried_regions scripts/gdb/tasks: fix lx-ps command error mm/Kconfig: make userfaultfd a menuconfig selftests/mm: prevent duplicate runs caused by TEST_GEN_PROGS mm/damon/core: copy nr_accesses when splitting region lib/group_cpus.c: avoid acquiring cpu hotplug lock in group_cpus_evenly checkstack: fix printed address mm/memory_hotplug: fix error handling in add_memory_resource() mm/memory_hotplug: add missing mem_hotplug_lock .mailmap: add a new address mapping for Chester Lin ...
2023-12-06Merge branch 'master' into mm-hotfixes-stableAndrew Morton
2023-12-06lib/group_cpus.c: avoid acquiring cpu hotplug lock in group_cpus_evenlyMing Lei
group_cpus_evenly() could be part of storage driver's error handler, such as nvme driver, when may happen during CPU hotplug, in which storage queue has to drain its pending IOs because all CPUs associated with the queue are offline and the queue is becoming inactive. And handling IO needs error handler to provide forward progress. Then deadlock is caused: 1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's handler is waiting for inflight IO 2) error handler is waiting for CPU hotplug lock 3) inflight IO can't be completed in blk-mq's CPU hotplug handler because error handling can't provide forward progress. Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(), in which two stage spreads are taken: 1) the 1st stage is over all present CPUs; 2) the end stage is over all other CPUs. Turns out the two stage spread just needs consistent 'cpu_present_mask', and remove the CPU hotplug lock by storing it into one local cache. This way doesn't change correctness, because all CPUs are still covered. Link: https://lkml.kernel.org/r/20231120083559.285174-1-ming.lei@redhat.com Signed-off-by: Ming Lei <ming.lei@redhat.com> Reported-by: Yi Zhang <yi.zhang@redhat.com> Reported-by: Guangwu Zhang <guazhang@redhat.com> Tested-by: Guangwu Zhang <guazhang@redhat.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Cc: Keith Busch <kbusch@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-06ACPI: tables: Correct and clean up the logic of acpi_parse_entries_array()Yuntao Wang
The original intention of acpi_parse_entries_array() is to return the number of all matching entries on success. This number may be greater than the value of the max_entries parameter. When this happens, the function will output a warning message, indicating that `count - max_entries` matching entries remain unprocessed and have been ignored. However, commit 4ceacd02f5a1 ("ACPI / table: Always count matched and successfully parsed entries") changed this logic to return the number of entries successfully processed by the handler. In this case, when the max_entries parameter is not zero, the number of entries successfully processed can never be greater than the value of max_entries. In other words, the expression `count > max_entries` will always evaluate to false. This means that the logic in the final if statement will never be executed. Commit 99b0efd7c886 ("ACPI / tables: do not report the number of entries ignored by acpi_parse_entries()") mentioned this issue, but it tried to fix it by removing part of the warning message. This is meaningless because the pr_warn statement will never be executed in the first place. Commit 8726d4f44150 ("ACPI / tables: fix acpi_parse_entries_array() so it traverses all subtables") introduced an errs variable, which is intended to make acpi_parse_entries_array() always traverse all of the subtables, calling as many of the callbacks as possible. However, it seems that the commit does not achieve this goal. For example, when a handler returns an error, none of the handlers will be called again in the subsequent iterations. This result appears to be no different from before the change. This patch corrects and cleans up the logic of acpi_parse_entries_array(), making it return the number of all matching entries, rather than the number of entries successfully processed by handlers. Additionally, if an error occurs when executing a handler, the function will return -EINVAL immediately. This patch should not affect existing users of acpi_parse_entries_array(). Signed-off-by: Yuntao Wang <ytcoode@gmail.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-12-06lib/vsprintf: Fix %pfwf when current node refcount == 0Herve Codina
A refcount issue can appeared in __fwnode_link_del() due to the pr_debug() call: WARNING: CPU: 0 PID: 901 at lib/refcount.c:25 refcount_warn_saturate+0xe5/0x110 Call Trace: <TASK> ... of_node_get+0x1e/0x30 of_fwnode_get+0x28/0x40 fwnode_full_name_string+0x34/0x90 fwnode_string+0xdb/0x140 ... vsnprintf+0x17b/0x630 ... __fwnode_link_del+0x25/0xa0 fwnode_links_purge+0x39/0xb0 of_node_release+0xd9/0x180 ... Indeed, an fwnode (of_node) is being destroyed and so, of_node_release() is called because the of_node refcount reached 0. From of_node_release() several function calls are done and lead to a pr_debug() calls with %pfwf to print the fwnode full name. The issue is not present if we change %pfwf to %pfwP. To print the full name, %pfwf iterates over the current node and its parents and obtain/drop a reference to all nodes involved. In order to allow to print the full name (%pfwf) of a node while it is being destroyed, do not obtain/drop a reference to this current node. Fixes: a92eb7621b9f ("lib/vsprintf: Make use of fwnode API to obtain node names and separators") Cc: stable@vger.kernel.org Signed-off-by: Herve Codina <herve.codina@bootlin.com> Reviewed-by: Sakari Ailus <sakari.ailus@linux.intel.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20231114152655.409331-1-herve.codina@bootlin.com
2023-12-05iov_iter: replace import_single_range() with import_ubuf()Jens Axboe
With the removal of the 'iov' argument to import_single_range(), the two functions are now fully identical. Convert the import_single_range() callers to import_ubuf(), and remove the former fully. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20231204174827.1258875-3-axboe@kernel.dk Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-05iov_iter: remove unused 'iov' argument from import_single_range()Jens Axboe
It is entirely unused, just get rid of it. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20231204174827.1258875-2-axboe@kernel.dk Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-05mm/slab: remove CONFIG_SLAB from all Kconfig and MakefileVlastimil Babka
Remove CONFIG_SLAB, CONFIG_DEBUG_SLAB, CONFIG_SLAB_DEPRECATED and everything in Kconfig files and mm/Makefile that depends on those. Since SLUB is the only remaining allocator, remove the allocator choice, make CONFIG_SLUB a "def_bool y" for now and remove all explicit dependencies on SLUB or SLAB as it's now always enabled. Make every option's verbose name and description refer to "the slab allocator" without refering to the specific implementation. Do not rename the CONFIG_ option names yet. Everything under #ifdef CONFIG_SLAB, and mm/slab.c is now dead code, all code under #ifdef CONFIG_SLUB is now always compiled. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-12-03Merge tag 'probes-fixes-v6.7-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixes from Masami Hiramatsu: - objpool: Fix objpool overrun case on memory/cache access delay especially on the big.LITTLE SoC. The objpool uses a copy of object slot index internal loop, but the slot index can be changed on another processor in parallel. In that case, the difference of 'head' local copy and the 'slot->last' index will be bigger than local slot size. In that case, we need to re-read the slot::head to update it. - kretprobe: Fix to use appropriate rcu API for kretprobe holder. Since kretprobe_holder::rp is RCU managed, it should use rcu_assign_pointer() and rcu_dereference_check() correctly. Also adding __rcu tag for finding wrong usage by sparse. - rethook: Fix to use appropriate rcu API for rethook::handler. The same as kretprobe, rethook::handler is RCU managed and it should use rcu_assign_pointer() and rcu_dereference_check(). This also adds __rcu tag for finding wrong usage by sparse. * tag 'probes-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: rethook: Use __rcu pointer for rethook::handler kprobes: consistent rcu api usage for kretprobe holder lib: objpool: fix head overrun on RK3588 SBC
2023-12-02Merge tag 'acpi-6.7-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI fixes from Rafael Wysocki: "This fixes a recently introduced build issue on ARM32 and a NULL pointer dereference in the ACPI backlight driver due to a design issue exposed by a recent change in the ACPI bus type code. Specifics: - Fix a recently introduced build issue on ARM32 platforms caused by an inadvertent header file breakage (Dave Jiang) - Eliminate questionable usage of acpi_driver_data() in the ACPI backlight cooling device code that leads to NULL pointer dereferences after recent ACPI core changes (Hans de Goede)" * tag 'acpi-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI: video: Use acpi_video_device for cooling-dev driver data ACPI: Fix ARM32 platforms compile issue introduced by fw_table changes
2023-12-02Merge tag 'bcachefs-2023-11-29' of https://evilpiepirate.org/git/bcachefsLinus Torvalds
Pull more bcachefs bugfixes from Kent Overstreet: - bcache & bcachefs were broken with CFI enabled; patch for closures to fix type punning - mark erasure coding as extra-experimental; there are incompatible disk space accounting changes coming for erasure coding, and I'm still seeing checksum errors in some tests - several fixes for durability-related issues (durability is a device specific setting where we can tell bcachefs that data on a given device should be counted as replicated x times) - a fix for a rare livelock when a btree node merge then updates a parent node that is almost full - fix a race in the device removal path, where dropping a pointer in a btree node to a device would be clobbered by an in flight btree write updating the btree node key on completion - fix one SRCU lock hold time warning in the btree gc code - ther's still a bunch more of these to fix - fix a rare race where we'd start copygc before initializing the "are we rw" percpu refcount; copygc would think we were already ro and die immediately * tag 'bcachefs-2023-11-29' of https://evilpiepirate.org/git/bcachefs: (23 commits) bcachefs: Extra kthread_should_stop() calls for copygc bcachefs: Convert gc_alloc_start() to for_each_btree_key2() bcachefs: Fix race between btree writes and metadata drop bcachefs: move journal seq assertion bcachefs: -EROFS doesn't count as move_extent_start_fail bcachefs: trace_move_extent_start_fail() now includes errcode bcachefs: Fix split_race livelock bcachefs: Fix bucket data type for stripe buckets bcachefs: Add missing validation for jset_entry_data_usage bcachefs: Fix zstd compress workspace size bcachefs: bpos is misaligned on big endian bcachefs: Fix ec + durability calculation bcachefs: Data update path won't accidentaly grow replicas bcachefs: deallocate_extra_replicas() bcachefs: Proper refcounting for journal_keys bcachefs: preserve device path as device name bcachefs: Fix an endianness conversion bcachefs: Start gc, copygc, rebalance threads after initing writes ref bcachefs: Don't stop copygc thread on device resize bcachefs: Make sure bch2_move_ratelimit() also waits for move_ops ...
2023-12-01Merge branch 'acpi-tables'Rafael J. Wysocki
Merge a fix for a recently introduced build issue on ARM32 platforms caused by an inadvertent header file breakage (Dave Jiang). * acpi-tables: ACPI: Fix ARM32 platforms compile issue introduced by fw_table changes
2023-12-01lib: objpool: fix head overrun on RK3588 SBCwuqiang.matt
objpool overrun stress with test_objpool on OrangePi5+ SBC triggered the following kernel warnings: WARNING: CPU: 6 PID: 3115 at lib/objpool.c:168 objpool_push+0xc0/0x100 This message is from objpool.c:168: WARN_ON_ONCE(tail - head > pool->nr_objs); The overrun test case is to validate the case that pre-allocated objects are insufficient: 8 objects are pre-allocated for each node and consumer thread per node tries to grab 16 objects in a row. The testing system is OrangePI 5+, with RK3588, a big.LITTLE SOC with 4x A76 and 4x A55. When disabling either all 4 big or 4 little cores, the overrun tests run well, and once with big and little cores mixed together, the overrun test would always cause an overrun loop. It's likely the memory timing differences of big and little cores cause this trouble. Here are the debugging data of objpool_try_get_slot after try_cmpxchg_release: objpool_pop: cpu: 4/0 0:0 head: 278/279 tail:278 last:276/278 The local copies of 'head' and 'last' were 278 and 276, and reloading of 'slot->head' and 'slot->last' got 279 and 278. After try_cmpxchg_release 'slot->head' became 'head + 1', which is correct. But what's wrong here is the stale value of 'last', and that stale value of 'last' finally led the overrun of 'head'. Memory updating of 'last' and 'head' are performed in push() and pop() independently, which could be the culprit leading this out of order visibility of 'last' and 'head'. So for objpool_try_get_slot(), it's not enough only checking the condition of 'head != slot', the implicit condition 'last - head <= nr_objs' must also be explicitly asserted to guarantee 'last' is always behind 'head' before the object retrieving. This patch will check and try reloading of 'head' and 'last' to ensure 'last' is behind 'head' at the time of object retrieving. Performance testings show the average impact is about 0.1% for X86_64 and 1.12% for ARM64. Here are the results: OS: Debian 10 X86_64, Linux 6.6rc HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s 1T 2T 4T 8T 16T native: 49543304 99277826 199017659 399070324 795185848 objpool: 29909085 59865637 119692073 239750369 478005250 objpool+: 29879313 59230743 119609856 239067773 478509029 32T 48T 64T 96T 128T native: 1596927073 2390099988 2929397330 3183875848 3257546602 objpool: 957553042 1435814086 1680872925 2043126796 2165424198 objpool+: 956476281 1434491297 1666055740 2041556569 2157415622 OS: Debian 11 AARCH64, Linux 6.6rc HW: Kunpeng-920 96 cores/2 sockets/4 NUMA nodes, DDR4 2933 MT/s 1T 2T 4T 8T 16T native: 30890508 60399915 123111980 242257008 494002946 objpool: 14742531 28883047 57739948 115886644 232455421 objpool+: 14107220 29032998 57286084 113730493 232232850 24T 32T 48T 64T 96T native: 746406039 1000174750 1493236240 1998318364 2942911180 objpool: 349164852 467284332 702296756 934459713 1387898285 objpool+: 348388180 462750976 696606096 927865887 1368402195 Link: https://lore.kernel.org/all/20231114115148.298821-1-wuqiang.matt@bytedance.com/ Fixes: b4edb8d2d464 ("lib: objpool added: ring-array based lockless MPMC") Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-12-01Merge tag 'linux_kselftest-kunit-fixes-6.7-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull KUnit fixes from Shuah Khan: "Three fixes to warnings and run-time test behavior. With these fixes, test suite counter will be reset correctly before running tests, kunit will warn if tests are too slow, and eliminate warning when kfree() as an action" * tag 'linux_kselftest-kunit-fixes-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: kunit: test: Avoid cast warning when adding kfree() as an action kunit: Reset suite counter right before running tests kunit: Warn if tests are slow
2023-11-26Merge tag 'parisc-for-6.7-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux Pull parisc architecture fixes from Helge Deller: "This patchset fixes and enforces correct section alignments for the ex_table, altinstructions, parisc_unwind, jump_table and bug_table which are created by inline assembly. Due to not being correctly aligned at link & load time they can trigger unnecessarily the kernel unaligned exception handler at runtime. While at it, I switched the bug table to use relative addresses which reduces the size of the table by half on 64-bit. We still had the ENOSYM and EREMOTERELEASE errno symbols as left-overs from HP-UX, which now trigger build-issues with glibc. We can simply remove them. Most of the patches are tagged for stable kernel series. Summary: - Drop HP-UX ENOSYM and EREMOTERELEASE return codes to avoid glibc build issues - Fix section alignments for ex_table, altinstructions, parisc unwind table, jump_table and bug_table - Reduce size of bug_table on 64-bit kernel by using relative pointers" * tag 'parisc-for-6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: parisc: Reduce size of the bug_table on 64-bit kernel by half parisc: Drop the HP-UX ENOSYM and EREMOTERELEASE error codes parisc: Use natural CPU alignment for bug_table parisc: Ensure 32-bit alignment on parisc unwind section parisc: Mark lock_aligned variables 16-byte aligned on SMP parisc: Mark jump_table naturally aligned parisc: Mark altinstructions read-only and 32-bit aligned parisc: Mark ex_table entries 32-bit aligned in uaccess.h parisc: Mark ex_table entries 32-bit aligned in assembly.h
2023-11-25parisc: Drop the HP-UX ENOSYM and EREMOTERELEASE error codesHelge Deller
Those return codes are only defined for the parisc architecture and are leftovers from when we wanted to be HP-UX compatible. They are not returned by any Linux kernel syscall but do trigger problems with the glibc strerrorname_np() and strerror() functions as reported in glibc issue #31080. There is no need to keep them, so simply remove them. Signed-off-by: Helge Deller <deller@gmx.de> Reported-by: Bruno Haible <bruno@clisp.org> Closes: https://sourceware.org/bugzilla/show_bug.cgi?id=31080 Cc: stable@vger.kernel.org
2023-11-24Merge tag 'vfs-6.7-rc3.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Avoid calling back into LSMs from vfs_getattr_nosec() calls. IMA used to query inode properties accessing raw inode fields without dedicated helpers. That was finally fixed a few releases ago by forcing IMA to use vfs_getattr_nosec() helpers. The goal of the vfs_getattr_nosec() helper is to query for attributes without calling into the LSM layer which would be quite problematic because incredibly IMA is called from __fput()... __fput() -> ima_file_free() What it does is to call back into the filesystem to update the file's IMA xattr. Querying the inode without using vfs_getattr_nosec() meant that IMA didn't handle stacking filesystems such as overlayfs correctly. So the switch to vfs_getattr_nosec() is quite correct. But the switch to vfs_getattr_nosec() revealed another bug when used on stacking filesystems: __fput() -> ima_file_free() -> vfs_getattr_nosec() -> i_op->getattr::ovl_getattr() -> vfs_getattr() -> i_op->getattr::$WHATEVER_UNDERLYING_FS_getattr() -> security_inode_getattr() # calls back into LSMs Now, if that __fput() happens from task_work_run() of an exiting task current->fs and various other pointer could already be NULL. So anything in the LSM layer relying on that not being NULL would be quite surprised. Fix that by passing the information that this is a security request through to the stacking filesystem by adding a new internal ATT_GETATTR_NOSEC flag. Now the callchain becomes: __fput() -> ima_file_free() -> vfs_getattr_nosec() -> i_op->getattr::ovl_getattr() -> if (AT_GETATTR_NOSEC) vfs_getattr_nosec() else vfs_getattr() -> i_op->getattr::$WHATEVER_UNDERLYING_FS_getattr() - Fix a bug introduced with the iov_iter rework from last cycle. This broke /proc/kcore by copying too much and without the correct offset. - Add a missing NULL check when allocating the root inode in autofs_fill_super(). - Fix stable writes for multi-device filesystems (xfs, btrfs etc) and the block device pseudo filesystem. Stable writes used to be a superblock flag only, making it a per filesystem property. Add an additional AS_STABLE_WRITES mapping flag to allow for fine-grained control. - Ensure that offset_iterate_dir() returns 0 after reaching the end of a directory so it adheres to getdents() convention. * tag 'vfs-6.7-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: libfs: getdents() should return 0 after reaching EOD xfs: respect the stable writes flag on the RT device xfs: clean up FS_XFLAG_REALTIME handling in xfs_ioctl_setattr_xflags block: update the stable_writes flag in bdev_add filemap: add a per-mapping stable writes flag autofs: add: new_inode check in autofs_fill_super() iov_iter: fix copy_page_to_iter_nofault() fs: Pass AT_GETATTR_NOSEC flag to getattr interface function
2023-11-24closures: CLOSURE_CALLBACK() to fix type punningKent Overstreet
Control flow integrity is now checking that type signatures match on indirect function calls. That breaks closures, which embed a work_struct in a closure in such a way that a closure_fn may also be used as a workqueue fn by the underlying closure code. So we have to change closure fns to take a work_struct as their argument - but that results in a loss of clarity, as closure fns have different semantics from normal workqueue functions (they run owning a ref on the closure, which must be released with continue_at() or closure_return()). Thus, this patc introduces CLOSURE_CALLBACK() and closure_type() macros as suggested by Kees, to smooth things over a bit. Suggested-by: Kees Cook <keescook@chromium.org> Cc: Coly Li <colyli@suse.de> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-22ACPI: Fix ARM32 platforms compile issue introduced by fw_table changesDave Jiang
Linus reported that: After commit a103f46633fd the kernel stopped compiling for several ARM32 platforms that I am building with a bare metal compiler. Bare metal compilers (arm-none-eabi-) don't define __linux__. This is because the header <acpi/platform/acenv.h> is now in the include path for <linux/irq.h>: CC arch/arm/kernel/irq.o CC kernel/sysctl.o CC crypto/api.o In file included from ../include/acpi/acpi.h:22, from ../include/linux/fw_table.h:29, from ../include/linux/acpi.h:18, from ../include/linux/irqchip.h:14, from ../arch/arm/kernel/irq.c:25: ../include/acpi/platform/acenv.h:218:2: error: #error Unknown target environment 218 | #error Unknown target environment | ^~~~~ The issue is caused by the introducing of splitting out the ACPI code to support the new generic fw_table code. Rafael suggested [1] moving the fw_table.h include in linux/acpi.h to below the linux/mutex.h. Remove the two includes in fw_table.h. Replace linux/fw_table.h include in fw_table.c with linux/acpi.h. Link: https://lore.kernel.org/linux-acpi/CAJZ5v0idWdJq3JSqQWLG5q+b+b=zkEdWR55rGYEoxh7R6N8kFQ@mail.gmail.com/ Fixes: a103f46633fd ("acpi: Move common tables helper functions to common lib") Closes: https://lore.kernel.org/linux-acpi/20231114-arm-build-bug-v1-1-458745fe32a4@linaro.org/ Reported-by: Linus Walleij <linus.walleij@linaro.org> Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-11-22debugobjects: Stop accessing objects after releasing hash bucket lockAndrzej Hajda
After release of the hashbucket lock the tracking object can be modified or freed by a concurrent thread. Using it in such a case is error prone, even for printing the object state: 1. T1 tries to deactivate destroyed object, debugobjects detects it, hash bucket lock is released. 2. T2 preempts T1 and frees the tracking object. 3. The freed tracking object is allocated and initialized for a different to be tracked kernel object. 4. T1 resumes and reports error for wrong kernel object. Create a local copy of the tracking object before releasing the hash bucket lock and use the local copy for reporting and fixups to prevent this. Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20231025-debugobjects_fix-v3-1-2bc3bf7084c2@intel.com
2023-11-18iov_iter: fix copy_page_to_iter_nofault()Omar Sandoval
The recent conversion to inline functions made two mistakes: 1. It tries to copy the full amount requested (bytes), not just what's available in the kmap'd page (n). 2. It's not applying the offset in the first page. Note that copy_page_to_iter_nofault() is only used by /proc/kcore. This was detected by drgn's test suite. Fixes: f1982740f5e7 ("iov_iter: Convert iterate*() to inline funcs") Signed-off-by: Omar Sandoval <osandov@fb.com> Link: https://lore.kernel.org/r/c1616e06b5248013cbbb1881bb4fef85a7a69ccb.1700257019.git.osandov@fb.com Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-14Merge tag 'zstd-linus-v6.7-rc2' of https://github.com/terrelln/linuxLinus Torvalds
Pull Zstd fix from Nick Terrell: "Only a single line change to fix a benign UBSAN warning" * tag 'zstd-linus-v6.7-rc2' of https://github.com/terrelln/linux: zstd: Fix array-index-out-of-bounds UBSAN warning
2023-11-14zstd: Fix array-index-out-of-bounds UBSAN warningNick Terrell
Zstd used an array of length 1 to mean a flexible array for C89 compatibility. Switch to a C99 flexible array to fix the UBSAN warning. Tested locally by booting the kernel and writing to and reading from a BtrFS filesystem with zstd compression enabled. I was unable to reproduce the issue before the fix, however it is a trivial change. Link: https://lkml.kernel.org/r/20231012213428.1390905-1-nickrterrell@gmail.com Reported-by: syzbot+1f2eb3e8cd123ffce499@syzkaller.appspotmail.com Reported-by: Eric Biggers <ebiggers@kernel.org> Reported-by: Kees Cook <keescook@chromium.org> Signed-off-by: Nick Terrell <terrelln@fb.com> Reviewed-by: Kees Cook <keescook@chromium.org>
2023-11-14kunit: test: Avoid cast warning when adding kfree() as an actionRichard Fitzgerald
In kunit_log_test() pass the kfree_wrapper() function to kunit_add_action() instead of directly passing kfree(). This prevents a cast warning: lib/kunit/kunit-test.c:565:25: warning: cast from 'void (*)(const void *)' to 'kunit_action_t *' (aka 'void (*)(void *)') converts to incompatible function type [-Wcast-function-type-strict] 564 full_log = string_stream_get_string(test->log); > 565 kunit_add_action(test, (kunit_action_t *)kfree, full_log); Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202311070041.kWVYx7YP-lkp@intel.com/ Fixes: 05e2006ce493 ("kunit: Use string_stream for test log") Reviewed-by: David Gow <davidgow@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2023-11-14kunit: Reset suite counter right before running testsMichal Wajdeczko
Today we reset the suite counter as part of the suite cleanup, called from the module exit callback, but it might not work that well as one can try to collect results without unloading a previous test (either unintentionally or due to dependencies). For easy reproduction try to load the kunit-test.ko and then collect and parse results from the kunit-example-test.ko load. Parser will complain about mismatch of expected test number: [ ] KTAP version 1 [ ] 1..1 [ ] # example: initializing suite [ ] KTAP version 1 [ ] # Subtest: example .. [ ] # example: pass:5 fail:0 skip:4 total:9 [ ] # Totals: pass:6 fail:0 skip:6 total:12 [ ] ok 7 example [ ] [ERROR] Test: example: Expected test number 1 but found 7 [ ] ===================== [PASSED] example ===================== [ ] ============================================================ [ ] Testing complete. Ran 12 tests: passed: 6, skipped: 6, errors: 1 Since we are now printing suite test plan on every module load, right before running suite tests, we should make sure that suite counter will also start from 1. Easiest solution seems to be move counter reset to the __kunit_test_suites_init() function. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: David Gow <davidgow@google.com> Cc: Rae Moar <rmoar@google.com> Reviewed-by: David Gow <davidgow@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2023-11-14kunit: Warn if tests are slowMaxime Ripard
Kunit recently gained support to setup attributes, the first one being the speed of a given test, then allowing to filter out slow tests. A slow test is defined in the documentation as taking more than one second. There's an another speed attribute called "super slow" but whose definition is less clear. Add support to the test runner to check the test execution time, and report tests that should be marked as slow but aren't. Signed-off-by: Maxime Ripard <mripard@kernel.org> Reviewed-by: David Gow <davidgow@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2023-11-10lib: test_objpool: make global variables staticwuqiang.matt
Kernel test robot reported build warnings that structures g_ot_sync_ops, g_ot_async_ops and g_testcases should be static. These definitions are only used in test_objpool.c, so make them static Link: https://lore.kernel.org/all/20231108012248.313574-1-wuqiang.matt@bytedance.com/ Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202311071229.WGrWUjM1-lkp@intel.com/ Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-11-07Merge tag 'bcachefs-2023-11-5' of https://evilpiepirate.org/git/bcachefsLinus Torvalds
Pull more bcachefs updates from Kent Overstreet: "Here's the second big bcachefs pull request. This brings your tree up to date with my master branch, which is what existing bcachefs users are currently running. New features: - rebalance_work btree (and metadata version 1.3): the rebalance thread no longer has to scan to find extents that need processing - big scalability improvement. - sb_errors superblock section: this adds counters for each fsck error type, since filesystem creation, along with the date of the most recent error. It'll get us better bug reports (since users do not typically report errors that fsck was able to fix), and I might add telemetry for this in the future. Fixes include: - multiple snapshot deletion fixes - members_v2 fixups - deleted_inodes btree fixes - copygc thread no longer spins when a device is full but has no fragmented buckets (i.e. rebalance needs to move data around instead) - a fix for a memory reclaim issue with the btree key cache: we're now careful not to hold the srcu read lock that blocks key cache reclaim for too long - an early allocator locking fix, from Brian - endianness fixes, from Brian - CONFIG_BCACHEFS_DEBUG_TRANSACTIONS no longer defaults to y, a big performance improvement on multithreaded workloads" * tag 'bcachefs-2023-11-5' of https://evilpiepirate.org/git/bcachefs: (70 commits) bcachefs: Improve stripe checksum error message bcachefs: Simplify, fix bch2_backpointer_get_key() bcachefs: kill thing_it_points_to arg to backpointer_not_found() bcachefs: bch2_ec_read_extent() now takes btree_trans bcachefs: bch2_stripe_to_text() now prints ptr gens bcachefs: Don't iterate over journal entries just for btree roots bcachefs: Break up bch2_journal_write() bcachefs: Replace ERANGE with private error codes bcachefs: bkey_copy() is no longer a macro bcachefs: x-macro-ify inode flags enum bcachefs: Convert bch2_fs_open() to darray bcachefs: Move __bch2_members_v2_get_mut to sb-members.h bcachefs: bch2_prt_datetime() bcachefs: CONFIG_BCACHEFS_DEBUG_TRANSACTIONS no longer defaults to y bcachefs: Add a comment for BTREE_INSERT_NOJOURNAL usage bcachefs: rebalance_work btree is not a snapshots btree bcachefs: Add missing printk newlines bcachefs: Fix recovery when forced to use JSET_NO_FLUSH journal entry bcachefs: .get_parent() should return an error pointer bcachefs: Fix bch2_delete_dead_inodes() ...
2023-11-04Merge tag 'cxl-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxlLinus Torvalds
Pull CXL (Compute Express Link) updates from Dan Williams: "The main new functionality this time is work to allow Linux to natively handle CXL link protocol errors signalled via PCIe AER for current generation CXL platforms. This required some enlightenment of the PCIe AER core to workaround the fact that current generation RCH (Restricted CXL Host) platforms physically hide topology details and registers via a mechanism called RCRB (Root Complex Register Block). The next major highlight is reworks to address bugs in parsing region configurations for next generation VH (Virtual Host) topologies. The old broken algorithm is replaced with a simpler one that significantly increases the number of region configurations supported by Linux. This is again relevant for error handling so that forward and reverse address translation of memory errors can be carried out by Linux for memory regions instantiated by platform firmware. As for other cross-tree work, the ACPI table parsing code has been refactored for reuse parsing the "CDAT" structure which is an ACPI-like data structure that is reported by CXL devices. That work is in preparation for v6.8 support for CXL QoS. Think of this as dynamic generation of NUMA node topology information generated by Linux rather than platform firmware. Lastly, a number of internal object lifetime issues have been resolved along with misc. fixes and feature updates (decoders_committed sysfs ABI). Summary: - Add support for RCH (Restricted CXL Host) Error recovery - Fix several region assembly bugs - Fix mem-device lifetime issues relative to the sanitize command and RCH topology. - Refactor ACPI table parsing for CDAT parsing re-use in preparation for CXL QOS support" * tag 'cxl-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (50 commits) lib/fw_table: Remove acpi_parse_entries_array() export cxl/pci: Change CXL AER support check to use native AER cxl/hdm: Remove broken error path cxl/hdm: Fix && vs || bug acpi: Move common tables helper functions to common lib cxl: Add support for reading CXL switch CDAT table cxl: Add checksum verification to CDAT from CXL cxl: Export QTG ids from CFMWS to sysfs as qos_class attribute cxl: Add decoders_committed sysfs attribute to cxl_port cxl: Add cxl_decoders_committed() helper cxl/core/regs: Rework cxl_map_pmu_regs() to use map->dev for devm cxl/core/regs: Rename phys_addr in cxl_map_component_regs() PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler cxl/pci: Disable root port interrupts in RCH mode cxl/pci: Add RCH downstream port error logging cxl/pci: Map RCH downstream AER registers for logging protocol errors cxl/pci: Update CXL error logging to use RAS register address PCI/AER: Refactor cper_print_aer() for use by CXL driver module cxl/pci: Add RCH downstream port AER register discovery ...
2023-11-03Merge tag 'powerpc-6.7-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Add support for KVM running as a nested hypervisor under development versions of PowerVM, using the new PAPR nested virtualisation API - Add support for the BPF prog pack allocator - A rework of the non-server MMU handling to support execute-only on all platforms - Some optimisations & cleanups for the powerpc qspinlock code - Various other small features and fixes Thanks to Aboorva Devarajan, Aditya Gupta, Amit Machhiwal, Benjamin Gray, Christophe Leroy, Dr. David Alan Gilbert, Gaurav Batra, Gautam Menghani, Geert Uytterhoeven, Haren Myneni, Hari Bathini, Joel Stanley, Jordan Niethe, Julia Lawall, Kautuk Consul, Kuan-Wei Chiu, Michael Neuling, Minjie Du, Muhammad Muzammil, Naveen N Rao, Nicholas Piggin, Nick Child, Nysal Jan K.A, Peter Lafreniere, Rob Herring, Sachin Sant, Sebastian Andrzej Siewior, Shrikanth Hegde, Srikar Dronamraju, Stanislav Kinsburskii, Vaibhav Jain, Wang Yufen, Yang Yingliang, and Yuan Tan. * tag 'powerpc-6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (100 commits) powerpc/vmcore: Add MMU information to vmcoreinfo Revert "powerpc: add `cur_cpu_spec` symbol to vmcoreinfo" powerpc/bpf: use bpf_jit_binary_pack_[alloc|finalize|free] powerpc/bpf: rename powerpc64_jit_data to powerpc_jit_data powerpc/bpf: implement bpf_arch_text_invalidate for bpf_prog_pack powerpc/bpf: implement bpf_arch_text_copy powerpc/code-patching: introduce patch_instructions() powerpc/32s: Implement local_flush_tlb_page_psize() powerpc/pseries: use kfree_sensitive() in plpks_gen_password() powerpc/code-patching: Perform hwsync in __patch_instruction() in case of failure powerpc/fsl_msi: Use device_get_match_data() powerpc: Remove cpm_dp...() macros powerpc/qspinlock: Rename yield_propagate_owner tunable powerpc/qspinlock: Propagate sleepy if previous waiter is preempted powerpc/qspinlock: don't propagate the not-sleepy state powerpc/qspinlock: propagate owner preemptedness rather than CPU number powerpc/qspinlock: stop queued waiters trying to set lock sleepy powerpc/perf: Fix disabling BHRB and instruction sampling powerpc/trace: Add support for HAVE_FUNCTION_ARG_ACCESS_API powerpc/tools: Pass -mabi=elfv2 to gcc-check-mprofile-kernel.sh ...
2023-11-03Merge tag 'trace-v6.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: - Remove eventfs_file descriptor This is the biggest change, and the second part of making eventfs create its files dynamically. In 6.6 the first part was added, and that maintained a one to one mapping between eventfs meta descriptors and the directories and file inodes and dentries that were dynamically created. The directories were represented by a eventfs_inode and the files were represented by a eventfs_file. In v6.7 the eventfs_file is removed. As all events have the same directory make up (sched_switch has an "enable", "id", "format", etc files), the handing of what files are underneath each leaf eventfs directory is moved back to the tracing subsystem via a callback. When an event is added to the eventfs, it registers an array of evenfs_entry's. These hold the names of the files and the callbacks to call when the file is referenced. The callback gets the name so that the same callback may be used by multiple files. The callback then supplies the filesystem_operations structure needed to create this file. This has brought the memory footprint of creating multiple eventfs instances down by 2 megs each! - User events now has persistent events that are not associated to a single processes. These are privileged events that hang around even if no process is attached to them - Clean up of seq_buf There's talk about using seq_buf more to replace strscpy() and friends. But this also requires some minor modifications of seq_buf to be able to do this - Expand instance ring buffers individually Currently if boot up creates an instance, and a trace event is enabled on that instance, the ring buffer for that instance and the top level ring buffer are expanded (1.4 MB per CPU). This wastes memory as this happens when nothing is using the top level instance - Other minor clean ups and fixes * tag 'trace-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (34 commits) seq_buf: Export seq_buf_puts() seq_buf: Export seq_buf_putc() eventfs: Use simple_recursive_removal() to clean up dentries eventfs: Remove special processing of dput() of events directory eventfs: Delete eventfs_inode when the last dentry is freed eventfs: Hold eventfs_mutex when calling callback functions eventfs: Save ownership and mode eventfs: Test for ei->is_freed when accessing ei->dentry eventfs: Have a free_ei() that just frees the eventfs_inode eventfs: Remove "is_freed" union with rcu head eventfs: Fix kerneldoc of eventfs_remove_rec() tracing: Have the user copy of synthetic event address use correct context eventfs: Remove extra dget() in eventfs_create_events_dir() tracing: Have trace_event_file have ref counters seq_buf: Introduce DECLARE_SEQ_BUF and seq_buf_str() eventfs: Fix typo in eventfs_inode union comment eventfs: Fix WARN_ON() in create_file_dentry() powerpc: Remove initialisation of readpos tracing/histograms: Simplify last_cmd_set() seq_buf: fix a misleading comment ...
2023-11-03Merge tag 'printk-for-6.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk updates from Petr Mladek: - Another preparation step for introducing printk kthreads. The main piece is a per-console lock with several features: - Support three priorities: normal, emergency, and panic. They will be defined by a context where the lock is taken. A context with a higher priority is allowed to take over the lock from a context with a lower one. The plan is to use the emergency context for Oops and WARN() messages, and also by watchdogs. The panic() context will be used on panic CPU. - The owner might enter/exit regions where it is not safe to take over the lock. It allows the take over the lock a safe way in the middle of a message. For example, serial drivers emit characters one by one. And the serial port is in a safe state in between. Only the final console_flush_in_panic() will be allowed to take over the lock even in the unsafe state (last chance, pray, and hope). - A higher priority context might busy wait with a timeout. The current owner is informed about the waiter and releases the lock on exit from the unsafe state. - The new lock is safe even in atomic contexts, including NMI. Another change is a safe manipulation of per-console sequence number counter under the new lock. - simple_strntoull() micro-optimization - Reduce pr_flush() pooling time. - Calm down false warning about possible buffer invalid access to console buffers when CONFIG_PRINTK is disabled. [ .. and Thomas Gleixner wants to point out that while several of the commits are attributed to him, he only authored the early versions of said commits, and that John Ogness and Petr Mladek have been the ones who sorted out the details and really should be those who get the credit - Linus ] * tag 'printk-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: vsprintf: uninline simple_strntoull(), reorder arguments printk: printk: Remove unnecessary statements'len = 0;' printk: Reduce pr_flush() pooling time printk: fix illegal pbufs access for !CONFIG_PRINTK printk: nbcon: Allow drivers to mark unsafe regions and check state printk: nbcon: Add emit function and callback function for atomic printing printk: nbcon: Add sequence handling printk: nbcon: Add ownership state functions printk: nbcon: Add buffer management printk: Make static printk buffers available to nbcon printk: nbcon: Add acquire/release logic printk: Add non-BKL (nbcon) console basic infrastructure
2023-11-03Merge tag 'bitmap-for-6.7' of https://github.com/norov/linuxLinus Torvalds
Pull bitmap updates from Yury Norov: "This includes the 'bitmap: cleanup bitmap_*_region() implementation' series, and scattered cleanup patches" * tag 'bitmap-for-6.7' of https://github.com/norov/linux: buildid: reduce header file dependencies for module bitmap: move bitmap_*_region() functions to bitmap.h bitmap: drop _reg_op() function bitmap: replace _reg_op(REG_OP_ISFREE) with find_next_bit() bitmap: replace _reg_op(REG_OP_RELEASE) with bitmap_clear() bitmap: replace _reg_op(REG_OP_ALLOC) with bitmap_set() bitmap: fix opencoded bitmap_allocate_region() bitmap: add test for bitmap_*_region() functions bitmap: align __reg_op() wrappers with modern coding style lib/bitmap: split-out string-related operations to a separate files bitmap: Remove dead code, i.e. bitmap_copy_le() bitmap: Fix a typo ("identify map") cpumask: kernel-doc cleanups and additions
2023-11-02Merge tag 'mm-nonmm-stable-2023-11-02-14-08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "As usual, lots of singleton and doubleton patches all over the tree and there's little I can say which isn't in the individual changelogs. The lengthier patch series are - 'kdump: use generic functions to simplify crashkernel reservation in arch', from Baoquan He. This is mainly cleanups and consolidation of the 'crashkernel=' kernel parameter handling - After much discussion, David Laight's 'minmax: Relax type checks in min() and max()' is here. Hopefully reduces some typecasting and the use of min_t() and max_t() - A group of patches from Oleg Nesterov which clean up and slightly fix our handling of reads from /proc/PID/task/... and which remove task_struct.thread_group" * tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (64 commits) scripts/gdb/vmalloc: disable on no-MMU scripts/gdb: fix usage of MOD_TEXT not defined when CONFIG_MODULES=n .mailmap: add address mapping for Tomeu Vizoso mailmap: update email address for Claudiu Beznea tools/testing/selftests/mm/run_vmtests.sh: lower the ptrace permissions .mailmap: map Benjamin Poirier's address scripts/gdb: add lx_current support for riscv ocfs2: fix a spelling typo in comment proc: test ProtectionKey in proc-empty-vm test proc: fix proc-empty-vm test with vsyscall fs/proc/base.c: remove unneeded semicolon do_io_accounting: use sig->stats_lock do_io_accounting: use __for_each_thread() ocfs2: replace BUG_ON() at ocfs2_num_free_extents() with ocfs2_error() ocfs2: fix a typo in a comment scripts/show_delta: add __main__ judgement before main code treewide: mark stuff as __ro_after_init fs: ocfs2: check status values proc: test /proc/${pid}/statm compiler.h: move __is_constexpr() to compiler.h ...