summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-07-12selftests/proc: add PROCMAP_QUERY ioctl testsAndrii Nakryiko
Extend existing proc-pid-vm.c tests with PROCMAP_QUERY ioctl() API. Test a few successful and negative cases, validating querying filtering and exact vs next VMA logic works as expected. Link: https://lkml.kernel.org/r/20240627170900.1672542-7-andrii@kernel.org Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12tools: sync uapi/linux/fs.h header into tools subdirAndrii Nakryiko
We need this UAPI header in tools/include subdirectory for using it from BPF selftests. Link: https://lkml.kernel.org/r/20240627170900.1672542-6-andrii@kernel.org Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12docs/procfs: call out ioctl()-based PROCMAP_QUERY command existenceAndrii Nakryiko
Call out PROCMAP_QUERY ioctl() existence in the section describing /proc/PID/maps file in documentation. We refer user to UAPI header for low-level details of this programmatic interface. Link: https://lkml.kernel.org/r/20240627170900.1672542-5-andrii@kernel.org Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12fs/procfs: add build ID fetching to PROCMAP_QUERY APIAndrii Nakryiko
The need to get ELF build ID reliably is an important aspect when dealing with profiling and stack trace symbolization, and /proc/<pid>/maps textual representation doesn't help with this. To get backing file's ELF build ID, application has to first resolve VMA, then use it's start/end address range to follow a special /proc/<pid>/map_files/<start>-<end> symlink to open the ELF file (this is necessary because backing file might have been removed from the disk or was already replaced with another binary in the same file path. Such approach, beyond just adding complexity of having to do a bunch of extra work, has extra security implications. Because application opens underlying ELF file and needs read access to its entire contents (as far as kernel is concerned), kernel puts additional capable() checks on following /proc/<pid>/map_files/<start>-<end> symlink. And that makes sense in general. But in the case of build ID, profiler/symbolizer doesn't need the contents of ELF file, per se. It's only build ID that is of interest, and ELF build ID itself doesn't provide any sensitive information. So this patch adds a way to request backing file's ELF build ID along the rest of VMA information in the same API. User has control over whether this piece of information is requested or not by either setting build_id_size field to zero or non-zero maximum buffer size they provided through build_id_addr field (which encodes user pointer as __u64 field). This is a completely optional piece of information, and so has no performance implications for user cases that don't care about build ID, while improving performance and simplifying the setup for those application that do need it. Kernel already implements build ID fetching, which is used from BPF subsystem. We are reusing this code here, but plan a follow up changes to make it work better under more relaxed assumption (compared to what existing code assumes) of being called from user process context, in which page faults are allowed. BPF-specific implementation currently bails out if necessary part of ELF file is not paged in, all due to extra BPF-specific restrictions (like the need to fetch build ID in restrictive contexts such as NMI handler). [andrii@kernel.org: fix integer to pointer cast warning in do_procmap_query()] Link: https://lkml.kernel.org/r/20240701174805.1897344-1-andrii@kernel.org Link: https://lkml.kernel.org/r/20240627170900.1672542-4-andrii@kernel.org Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12fs/procfs: implement efficient VMA querying API for /proc/<pid>/mapsAndrii Nakryiko
/proc/<pid>/maps file is extremely useful in practice for various tasks involving figuring out process memory layout, what files are backing any given memory range, etc. One important class of applications that absolutely rely on this are profilers/stack symbolizers (perf tool being one of them). Patterns of use differ, but they generally would fall into two categories. In on-demand pattern, a profiler/symbolizer would normally capture stack trace containing absolute memory addresses of some functions, and would then use /proc/<pid>/maps file to find corresponding backing ELF files (normally, only executable VMAs are of interest), file offsets within them, and then continue from there to get yet more information (ELF symbols, DWARF information) to get human-readable symbolic information. This pattern is used by Meta's fleet-wide profiler, as one example. In preprocessing pattern, application doesn't know the set of addresses of interest, so it has to fetch all relevant VMAs (again, probably only executable ones), store or cache them, then proceed with profiling and stack trace capture. Once done, it would do symbolization based on stored VMA information. This can happen at much later point in time. This patterns is used by perf tool, as an example. In either case, there are both performance and correctness requirement involved. This address to VMA information translation has to be done as efficiently as possible, but also not miss any VMA (especially in the case of loading/unloading shared libraries). In practice, correctness can't be guaranteed (due to process dying before VMA data can be captured, or shared library being unloaded, etc), but any effort to maximize the chance of finding the VMA is appreciated. Unfortunately, for all the /proc/<pid>/maps file universality and usefulness, it doesn't fit the above use cases 100%. First, it's main purpose is to emit all VMAs sequentially, but in practice captured addresses would fall only into a smaller subset of all process' VMAs, mainly containing executable text. Yet, library would need to parse most or all of the contents to find needed VMAs, as there is no way to skip VMAs that are of no use. Efficient library can do the linear pass and it is still relatively efficient, but it's definitely an overhead that can be avoided, if there was a way to do more targeted querying of the relevant VMA information. Second, it's a text based interface, which makes its programmatic use from applications and libraries more cumbersome and inefficient due to the need to handle text parsing to get necessary pieces of information. The overhead is actually payed both by kernel, formatting originally binary VMA data into text, and then by user space application, parsing it back into binary data for further use. For the on-demand pattern of usage, described above, another problem when writing generic stack trace symbolization library is an unfortunate performance-vs-correctness tradeoff that needs to be made. Library has to make a decision to either cache parsed contents of /proc/<pid>/maps (after initial processing) to service future requests (if application requests to symbolize another set of addresses (for the same process), captured at some later time, which is typical for periodic/continuous profiling cases) to avoid higher costs of re-parsing this file. Or it has to choose to cache the contents in memory to speed up future requests. In the former case, more memory is used for the cache and there is a risk of getting stale data if application loads or unloads shared libraries, or otherwise changed its set of VMAs somehow, e.g., through additional mmap() calls. In the latter case, it's the performance hit that comes from re-opening the file and re-parsing its contents all over again. This patch aims to solve this problem by providing a new API built on top of /proc/<pid>/maps. It's meant to address both non-selectiveness and text nature of /proc/<pid>/maps, by giving user more control of what sort of VMA(s) needs to be queried, and being binary-based interface eliminates the overhead of text formatting (on kernel side) and parsing (on user space side). It's also designed to be extensible and forward/backward compatible by including required struct size field, which user has to provide. We use established copy_struct_from_user() approach to handle extensibility. User has a choice to pick either getting VMA that covers provided address or -ENOENT if none is found (exact, least surprising, case). Or, with an extra query flag (PROCMAP_QUERY_COVERING_OR_NEXT_VMA), they can get either VMA that covers the address (if there is one), or the closest next VMA (i.e., VMA with the smallest vm_start > addr). The latter allows more efficient use, but, given it could be a surprising behavior, requires an explicit opt-in. There is another query flag that is useful for some use cases. PROCMAP_QUERY_FILE_BACKED_VMA instructs this API to only return file-backed VMAs. Combining this with PROCMAP_QUERY_COVERING_OR_NEXT_VMA makes it possible to efficiently iterate only file-backed VMAs of the process, which is what profilers/symbolizers are normally interested in. All the above querying flags can be combined with (also optional) set of desired VMA permissions flags. This allows to, for example, iterate only an executable subset of VMAs, which is what preprocessing pattern, used by perf tool, would benefit from, as the assumption is that captured stack traces would have addresses of executable code. This saves time by skipping non-executable VMAs altogether efficienty. All these querying flags (modifiers) are orthogonal and can be combined in a semantically meaningful and natural way. Basing this ioctl()-based API on top of /proc/<pid>/maps's FD makes sense given it's querying the same set of VMA data. It's also benefitial because permission checks for /proc/<pid>/maps is performed at open time once, and the actual data read of text contents of /proc/<pid>/maps is done without further permission checks. We piggyback on this pattern with ioctl()-based API as well, as that's a desired property. Both for performance reasons, but also for security and flexibility reasons. Allowing application to open an FD for /proc/self/maps without any extra capabilities, and then passing it to some sort of profiling agent through Unix-domain socket, would allow such profiling agent to not require some of the capabilities that are otherwise expected when opening /proc/<pid>/maps file for *another* process. This is a desirable property for some more restricted setups. This new ioctl-based implementation doesn't interfere with seq_file-based implementation of /proc/<pid>/maps textual interface, and so could be used together or independently without paying any price for that. Note also, that fetching VMA name (e.g., backing file path, or special hard-coded or user-provided names) is optional just like build ID. If user sets vma_name_size to zero, kernel code won't attempt to retrieve it, saving resources. Earlier versions of this patch set were adding per-VMA locking, which is why we have a code structure that is ready for abstracting mmap_lock vs vm_lock differences (query_vma_setup(), query_vma_teardown(), and query_vma_find_by_addr()), but given anon_vma_name() is not yet compatible with per-VMA locking, initial implementation sticks to using only mmap_lock for now. It will be easy to add back per-VMA locking once all the pieces are ready later on. Which is why we keep existing code structure with setup/teardown/query helper functions. [andrii@kernel.org: improve PROCMAP_QUERY's compat mode handling] Link: https://lkml.kernel.org/r/20240701174805.1897344-2-andrii@kernel.org Link: https://lkml.kernel.org/r/20240627170900.1672542-3-andrii@kernel.org Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12fs/procfs: extract logic for getting VMA name constituentsAndrii Nakryiko
Patch series "ioctl()-based API to query VMAs from /proc/<pid>/maps", v6. Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow applications to query VMA information more efficiently than reading *all* VMAs nonselectively through text-based interface of /proc/<pid>/maps file. Patch #2 goes into a lot of details and background on some common patterns of using /proc/<pid>/maps in the area of performance profiling and subsequent symbolization of captured stack traces. As mentioned in that patch, patterns of VMA querying can differ depending on specific use case, but can generally be grouped into two main categories: the need to query a small subset of VMAs covering a given batch of addresses, or reading/storing/caching all (typically, executable) VMAs upfront for later processing. The new PROCMAP_QUERY ioctl() API added in this patch set was motivated by the former pattern of usage. Earlier revisions had a patch adding a tool that faithfully reproduces an efficient VMA matching pass of a symbolizer, collecting a subset of covering VMAs for a given set of addresses as efficiently as possible. This tool served both as a testing ground, as well as a benchmarking tool. It implements everything both for currently existing text-based /proc/<pid>/maps interface, as well as for newly-added PROCMAP_QUERY ioctl(). This revision dropped the tool from the patch set and, once the API lands upstream, this tool might be added separately on Github as an example. Based on discussion on earlier revisions of this patch set, it turned out that this ioctl() API is competitive with highly-optimized text-based pre-processing pattern that perf tool is using. Based on perf discussion, this revision adds more flexibility in specifying a subset of VMAs that are of interest. Now it's possible to specify desired permissions of VMAs (e.g., request only executable ones) and/or restrict to only a subset of VMAs that have file backing. This further improves the efficiency when using this new API thanks to more selective (executable VMAs only) querying. In addition to a custom benchmarking tool, and experimental perf integration (available at [0]), Daniel Mueller has since also implemented an experimental integration into blazesym (see [1]), a library used for stack trace symbolization by our server fleet-wide profiler and another on-device profiler agent that runs on weaker ARM devices. The latter ARM-based device profiler is especially sensitive to performance, and so we benchmarked and compared text-based /proc/<pid>/maps solution to the equivalent one using PROCMAP_QUERY ioctl(). Results are very encouraging, giving us 5x improvement for end-to-end so-called "address normalization" pass, which is the part of the symbolization process that happens locally on ARM device, before being sent out for further heavier-weight processing on more powerful remote server. Note that this is not an artificial microbenchmark. It's a full end-to-end API call being measured with real-world data on real-world device. TEXT-BASED ========== Benchmarking main/normalize_process_no_build_ids_uncached_maps main/normalize_process_no_build_ids_uncached_maps time: [49.777 µs 49.982 µs 50.250 µs] IOCTL-BASED =========== Benchmarking main/normalize_process_no_build_ids_uncached_maps main/normalize_process_no_build_ids_uncached_maps time: [10.328 µs 10.391 µs 10.457 µs] change: [−79.453% −79.304% −79.166%] (p = 0.00 < 0.02) Performance has improved. You can see above that we see the drop from 50µs down to 10µs for exactly the same amount of work, with the same data and target process. With the aforementioned custom tool, we see about ~40x improvement (it might vary a bit, depending on a specific captured set of addresses). And even for perf-based benchmark it's on par or slightly ahead when using permission-based filtering (fetching only executable VMAs). Earlier revisions attempted to use per-VMA locking, if kernel was compiled with CONFIG_PER_VMA_LOCK=y, but it turned out that anon_vma_name() is not yet compatible with per-VMA locking and assumes mmap_lock to be taken, which makes the use of per-VMA locking for this API premature. It was agreed ([2]) to continue for now with just mmap_lock, but the code structure is such that it should be easy to add per-VMA locking support once all the pieces are ready. One thing that did not change was basing this new API as an ioctl() command on /proc/<pid>/maps file. An ioctl-based API on top of pidfd was considered, but has its own downsides. Implementing ioctl() directly on pidfd will cause access permission checks on every single ioctl(), which leads to performance concerns and potential spam of capable() audit messages. It also prevents a nice pattern, possible with /proc/<pid>/maps, in which application opens /proc/self/maps FD (requiring no additional capabilities) and passed this FD to profiling agent for querying. To achieve similar pattern, a new file would have to be created from pidf just for VMA querying, which is considered to be inferior to just querying /proc/<pid>/maps FD as proposed in current approach. These aspects were discussed in the hallway track at recent LSF/MM/BPF 2024 and sticking to procfs ioctl() was the final agreement we arrived at. [0] https://github.com/anakryiko/linux/commits/procfs-proc-maps-ioctl-v2/ [1] https://github.com/libbpf/blazesym/pull/675 [2] https://lore.kernel.org/bpf/7rm3izyq2vjp5evdjc7c6z4crdd3oerpiknumdnmmemwyiwx7t@hleldw7iozi3/ This patch (of 6): Extract generic logic to fetch relevant pieces of data to describe VMA name. This could be just some string (either special constant or user-provided), or a string with some formatted wrapping text (e.g., "[anon_shmem:<something>]"), or, commonly, file path. seq_file-based logic has different methods to handle all three cases, but they are currently mixed in with extracting underlying sources of data. This patch splits this into data fetching and data formatting, so that data fetching can be reused later on. There should be no functional changes. Link: https://lkml.kernel.org/r/20240627170900.1672542-1-andrii@kernel.org Link: https://lkml.kernel.org/r/20240627170900.1672542-2-andrii@kernel.org Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12selftests/udmabuf: add tests to verify data after page migrationVivek Kasireddy
Since the memfd pages associated with a udmabuf may be migrated as part of udmabuf create, we need to verify the data coherency after successful migration. The new tests added in this patch try to do just that using 4k sized pages and also 2 MB sized huge pages for the memfd. Successful completion of the tests would mean that there is no disconnect between the memfd pages and the ones associated with a udmabuf. And, these tests can also be augmented in the future to test newer udmabuf features (such as handling memfd hole punch). The idea for these tests comes from a patch by Mike Kravetz here: https://lists.freedesktop.org/archives/dri-devel/2023-June/410623.html v1->v2: (suggestions from Shuah) - Use ksft_* functions to print and capture results of tests - Use appropriate KSFT_* status codes for exit() - Add Mike Kravetz's suggested-by tag Link: https://lkml.kernel.org/r/20240624063952.1572359-10-vivek.kasireddy@intel.com Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Dave Airlie <airlied@redhat.com> Acked-by: Gerd Hoffmann <kraxel@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Dongwon Kim <dongwon.kim@intel.com> Cc: Junxiao Chang <junxiao.chang@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12udmabuf: pin the pages using memfd_pin_folios() APIVivek Kasireddy
Using memfd_pin_folios() will ensure that the pages are pinned correctly using FOLL_PIN. And, this also ensures that we don't accidentally break features such as memory hotunplug as it would not allow pinning pages in the movable zone. Using this new API also simplifies the code as we no longer have to deal with extracting individual pages from their mappings or handle shmem and hugetlb cases separately. Link: https://lkml.kernel.org/r/20240624063952.1572359-9-vivek.kasireddy@intel.com Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Acked-by: Dave Airlie <airlied@redhat.com> Acked-by: Gerd Hoffmann <kraxel@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Dongwon Kim <dongwon.kim@intel.com> Cc: Junxiao Chang <junxiao.chang@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12udmabuf: convert udmabuf driver to use foliosVivek Kasireddy
This is mainly a preparatory patch to use memfd_pin_folios() API for pinning folios. Using folios instead of pages makes sense as the udmabuf driver needs to handle both shmem and hugetlb cases. And, using the memfd_pin_folios() API makes this easier as we no longer need to separately handle shmem vs hugetlb cases in the udmabuf driver. Note that, the function vmap_udmabuf() still needs a list of pages; so, we collect all the head pages into a local array in this case. Other changes in this patch include the addition of helpers for checking the memfd seals and exporting dmabuf. Moving code from udmabuf_create() into these helpers improves readability given that udmabuf_create() is a bit long. Link: https://lkml.kernel.org/r/20240624063952.1572359-8-vivek.kasireddy@intel.com Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Acked-by: Dave Airlie <airlied@redhat.com> Acked-by: Gerd Hoffmann <kraxel@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Dongwon Kim <dongwon.kim@intel.com> Cc: Junxiao Chang <junxiao.chang@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12udmabuf: add back support for mapping hugetlb pagesVivek Kasireddy
A user or admin can configure a VMM (Qemu) Guest's memory to be backed by hugetlb pages for various reasons. However, a Guest OS would still allocate (and pin) buffers that are backed by regular 4k sized pages. In order to map these buffers and create dma-bufs for them on the Host, we first need to find the hugetlb pages where the buffer allocations are located and then determine the offsets of individual chunks (within those pages) and use this information to eventually populate a scatterlist. Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500 options were passed to the Host kernel and Qemu was launched with these relevant options: qemu-system-x86_64 -m 4096m.... -device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080 -display gtk,gl=on -object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M -machine memory-backend=mem1 Replacing -display gtk,gl=on with -display gtk,gl=off above would exercise the mmap handler. Link: https://lkml.kernel.org/r/20240624063952.1572359-7-vivek.kasireddy@intel.com Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> (v2) Acked-by: Dave Airlie <airlied@redhat.com> Acked-by: Gerd Hoffmann <kraxel@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Dongwon Kim <dongwon.kim@intel.com> Cc: Junxiao Chang <junxiao.chang@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12udmabuf: use vmf_insert_pfn and VM_PFNMAP for handling mmapVivek Kasireddy
Add VM_PFNMAP to vm_flags in the mmap handler to ensure that the mappings would be managed without using struct page. And, in the vm_fault handler, use vmf_insert_pfn to share the page's pfn to userspace instead of directly sharing the page (via struct page *). Link: https://lkml.kernel.org/r/20240624063952.1572359-6-vivek.kasireddy@intel.com Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Dave Airlie <airlied@redhat.com> Acked-by: Gerd Hoffmann <kraxel@redhat.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Dongwon Kim <dongwon.kim@intel.com> Cc: Junxiao Chang <junxiao.chang@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12udmabuf: add CONFIG_MMU dependencyArnd Bergmann
There is no !CONFIG_MMU version of vmf_insert_pfn(): arm-linux-gnueabi-ld: drivers/dma-buf/udmabuf.o: in function `udmabuf_vm_fault': udmabuf.c:(.text+0xaa): undefined reference to `vmf_insert_pfn' Link: https://lkml.kernel.org/r/20240624063952.1572359-5-vivek.kasireddy@intel.com Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dave Airlie <airlied@redhat.com> Cc: Dongwon Kim <dongwon.kim@intel.com> Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Junxiao Chang <junxiao.chang@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12mm/gup: introduce memfd_pin_folios() for pinning memfd foliosVivek Kasireddy
For drivers that would like to longterm-pin the folios associated with a memfd, the memfd_pin_folios() API provides an option to not only pin the folios via FOLL_PIN but also to check and migrate them if they reside in movable zone or CMA block. This API currently works with memfds but it should work with any files that belong to either shmemfs or hugetlbfs. Files belonging to other filesystems are rejected for now. The folios need to be located first before pinning them via FOLL_PIN. If they are found in the page cache, they can be immediately pinned. Otherwise, they need to be allocated using the filesystem specific APIs and then pinned. [akpm@linux-foundation.org: improve the CONFIG_MMU=n situation, per SeongJae] [vivek.kasireddy@intel.com: return -EINVAL if the end offset is greater than the size of memfd] Link: https://lkml.kernel.org/r/IA0PR11MB71850525CBC7D541CAB45DF1F8DB2@IA0PR11MB7185.namprd11.prod.outlook.com Link: https://lkml.kernel.org/r/20240624063952.1572359-4-vivek.kasireddy@intel.com Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> (v2) Reviewed-by: David Hildenbrand <david@redhat.com> (v3) Reviewed-by: Christoph Hellwig <hch@lst.de> (v6) Acked-by: Dave Airlie <airlied@redhat.com> Acked-by: Gerd Hoffmann <kraxel@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Dongwon Kim <dongwon.kim@intel.com> Cc: Junxiao Chang <junxiao.chang@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12mm/gup: introduce check_and_migrate_movable_folios()Vivek Kasireddy
This helper is the folio equivalent of check_and_migrate_movable_pages(). Therefore, all the rules that apply to check_and_migrate_movable_pages() also apply to this one as well. Currently, this helper is only used by memfd_pin_folios(). This patch also includes changes to rename and convert the internal functions collect_longterm_unpinnable_pages() and migrate_longterm_unpinnable_pages() to work on folios. As a result, check_and_migrate_movable_pages() is now a wrapper around check_and_migrate_movable_folios(). Link: https://lkml.kernel.org/r/20240624063952.1572359-3-vivek.kasireddy@intel.com Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Dave Airlie <airlied@redhat.com> Acked-by: Gerd Hoffmann <kraxel@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dongwon Kim <dongwon.kim@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Junxiao Chang <junxiao.chang@intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12mm/gup: introduce unpin_folio/unpin_folios helpersVivek Kasireddy
Patch series "mm/gup: Introduce memfd_pin_folios() for pinning memfd folios", v16. Currently, some drivers (e.g, Udmabuf) that want to longterm-pin the pages/folios associated with a memfd, do so by simply taking a reference on them. This is not desirable because the pages/folios may reside in Movable zone or CMA block. Therefore, having drivers use memfd_pin_folios() API ensures that the folios are appropriately pinned via FOLL_PIN for longterm DMA. This patchset also introduces a few helpers and converts the Udmabuf driver to use folios and memfd_pin_folios() API to longterm-pin the folios for DMA. Two new Udmabuf selftests are also included to test the driver and the new API. This patch (of 9): These helpers are the folio versions of unpin_user_page/unpin_user_pages. They are currently only useful for unpinning folios pinned by memfd_pin_folios() or other associated routines. However, they could find new uses in the future, when more and more folio-only helpers are added to GUP. We should probably sanity check the folio as part of unpin similar to how it is done in unpin_user_page/unpin_user_pages but we cannot cleanly do that at the moment without also checking the subpage. Therefore, sanity checking needs to be added to these routines once we have a way to determine if any given folio is anon-exclusive (via a per folio AnonExclusive flag). Link: https://lkml.kernel.org/r/20240624063952.1572359-1-vivek.kasireddy@intel.com Link: https://lkml.kernel.org/r/20240624063952.1572359-2-vivek.kasireddy@intel.com Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Dave Airlie <airlied@redhat.com> Acked-by: Gerd Hoffmann <kraxel@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dongwon Kim <dongwon.kim@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Junxiao Chang <junxiao.chang@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12mm/zswap: use only one pool in zswapChengming Zhou
Zswap uses 32 pools to workaround the locking scalability problem in zswap backends (mainly zsmalloc nowadays), which brings its own problems like memory waste and more memory fragmentation. Testing results show that we can have near performance with only one pool in zswap after changing zsmalloc to use per-size_class lock instead of pool spinlock. Testing kernel build (make bzImage -j32) on tmpfs with memory.max=1GB, and zswap shrinker enabled with 10GB swapfile on ext4. real user sys 6.10.0-rc3 138.18 1241.38 1452.73 6.10.0-rc3-onepool 149.45 1240.45 1844.69 6.10.0-rc3-onepool-perclass 138.23 1242.37 1469.71 And do the same testing using zbud, which shows a little worse performance as expected since we don't do any locking optimization for zbud. I think it's acceptable since zsmalloc became a lot more popular than other backends, and we may want to support only zsmalloc in the future. real user sys 6.10.0-rc3-zbud 138.23 1239.58 1430.09 6.10.0-rc3-onepool-zbud 139.64 1241.37 1516.59 [chengming.zhou@linux.dev: fix error handling in zswap_pool_create(), per Dan Carpenter] Link: https://lkml.kernel.org/r/20240621-zsmalloc-lock-mm-everything-v2-2-d30e9cd2b793@linux.dev [chengming.zhou@linux.dev: fix error handling again in zswap_pool_create(), per Yosry] Link: https://lkml.kernel.org/r/20240625-zsmalloc-lock-mm-everything-v3-2-ad941699cb61@linux.dev Link: https://lkml.kernel.org/r/20240617-zsmalloc-lock-mm-everything-v1-2-5e5081ea11b3@linux.dev Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12mm/zsmalloc: change back to per-size_class lockChengming Zhou
Patch series "mm/zsmalloc: change back to per-size_class lock, v2". Commit c0547d0b6a4b ("zsmalloc: consolidate zs_pool's migrate_lock and size_class's locks") changed per-size_class lock to pool spinlock to prepare reclaim support in zsmalloc. Then reclaim support in zsmalloc had been dropped in favor of LRU reclaim in zswap, but this locking change had been left there. Obviously, the scalability of pool spinlock is worse than per-size_class. And we have a workaround that using 32 pools in zswap to avoid this scalability problem, which brings its own problems like memory waste and more memory fragmentation. So this series changes back to use per-size_class lock and using testing data in much stressed situation to verify that we can use only one pool in zswap. Note we only test and care about the zsmalloc backend, which makes sense now since zsmalloc became a lot more popular than other backends. Testing kernel build (make bzImage -j32) on tmpfs with memory.max=1GB, and zswap shrinker enabled with 10GB swapfile on ext4. real user sys 6.10.0-rc3 138.18 1241.38 1452.73 6.10.0-rc3-onepool 149.45 1240.45 1844.69 6.10.0-rc3-onepool-perclass 138.23 1242.37 1469.71 We can see from "sys" column that per-size_class locking with only one pool in zswap can have near performance with the current 32 pools. This patch (of 2): This patch is almost the revert of the commit c0547d0b6a4b ("zsmalloc: consolidate zs_pool's migrate_lock and size_class's locks"), which changed to use a global pool->lock instead of per-size_class lock and pool->migrate_lock, was preparation for suppporting reclaim in zsmalloc. Then reclaim in zsmalloc had been dropped in favor of LRU reclaim in zswap. In theory, per-size_class is more fine-grained than the pool->lock, since a pool can have many size_classes. As for the additional pool->migrate_lock, only free() and map() need to grab it to access stable handle to get zspage, and only in read lock mode. Link: https://lkml.kernel.org/r/20240625-zsmalloc-lock-mm-everything-v3-0-ad941699cb61@linux.dev Link: https://lkml.kernel.org/r/20240621-zsmalloc-lock-mm-everything-v2-0-d30e9cd2b793@linux.dev Link: https://lkml.kernel.org/r/20240617-zsmalloc-lock-mm-everything-v1-0-5e5081ea11b3@linux.dev Link: https://lkml.kernel.org/r/20240617-zsmalloc-lock-mm-everything-v1-1-5e5081ea11b3@linux.dev Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12mm/hugetlb.c: undo errant changeAndrew Morton
During conflict resolution a line was unintentionally removed by a ksm.c patch. Link: https://lkml.kernel.org/r/85b0d694-d1ac-8e7a-2e50-1edc03eee21a@google.com Fixes: ac90c56bbd73 ("mm/ksm: refactor out try_to_merge_with_zero_page()") Reported-by: Hugh Dickins <hughd@google.com> Cc: Aristeu Rozanski <aris@redhat.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12PCI: Extend ACS configurabilityVidya Sagar
PCIe ACS settings control the level of isolation and the possible P2P paths between devices. With greater isolation the kernel will create smaller iommu_groups and with less isolation there is more HW that can achieve P2P transfers. From a virtualization perspective all devices in the same iommu_group must be assigned to the same VM as they lack security isolation. There is no way for the kernel to automatically know the correct ACS settings for any given system and workload. Existing command line options (e.g., disable_acs_redir) allow only for large scale change, disabling all isolation, but this is not sufficient for more complex cases. Add a kernel command-line option 'config_acs' to directly control all the ACS bits for specific devices, which allows the operator to setup the right level of isolation to achieve the desired P2P configuration. The definition is future proof; when new ACS bits are added to the spec the open syntax can be extended. ACS needs to be setup early in the kernel boot as the ACS settings affect how iommu_groups are formed. iommu_group formation is a one time event during initial device discovery, so changing ACS bits after kernel boot can result in an inaccurate view of the iommu_groups compared to the current isolation configuration. ACS applies to PCIe Downstream Ports and multi-function devices. The default ACS settings are strict and deny any direct traffic between two functions. This results in the smallest iommu_group the HW can support. Frequently these values result in slow or non-working P2PDMA. ACS offers a range of security choices controlling how traffic is allowed to go directly between two devices. Some popular choices: - Full prevention - Translated requests can be direct, with various options - Asymmetric direct traffic, A can reach B but not the reverse - All traffic can be direct Along with some other less common ones for special topologies. The intention is that this option would be used with expert knowledge of the HW capability and workload to achieve the desired configuration. Link: https://lore.kernel.org/r/20240625153150.159310-1-vidyas@nvidia.com Signed-off-by: Vidya Sagar <vidyas@nvidia.com> [bhelgaas: add example, tidy printk formats] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2024-07-12PCI: Add missing bridge lock to pci_bus_lock()Dan Williams
One of the true positives that the cfg_access_lock lockdep effort identified is this sequence: WARNING: CPU: 14 PID: 1 at drivers/pci/pci.c:4886 pci_bridge_secondary_bus_reset+0x5d/0x70 RIP: 0010:pci_bridge_secondary_bus_reset+0x5d/0x70 Call Trace: <TASK> ? __warn+0x8c/0x190 ? pci_bridge_secondary_bus_reset+0x5d/0x70 ? report_bug+0x1f8/0x200 ? handle_bug+0x3c/0x70 ? exc_invalid_op+0x18/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? pci_bridge_secondary_bus_reset+0x5d/0x70 pci_reset_bus+0x1d8/0x270 vmd_probe+0x778/0xa10 pci_device_probe+0x95/0x120 Where pci_reset_bus() users are triggering unlocked secondary bus resets. Ironically pci_bus_reset(), several calls down from pci_reset_bus(), uses pci_bus_lock() before issuing the reset which locks everything *but* the bridge itself. For the same motivation as adding: bridge = pci_upstream_bridge(dev); if (bridge) pci_dev_lock(bridge); to pci_reset_function() for the "bus" and "cxl_bus" reset cases, add pci_dev_lock() for @bus->self to pci_bus_lock(). Link: https://lore.kernel.org/r/171711747501.1628941.15217746952476635316.stgit@dwillia2-xfh.jf.intel.com Reported-by: Imre Deak <imre.deak@intel.com> Closes: http://lore.kernel.org/r/6657833b3b5ae_14984b29437@dwillia2-xfh.jf.intel.com.notmuch Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Keith Busch <kbusch@kernel.org> [bhelgaas: squash in recursive locking deadlock fix from Keith Busch: https://lore.kernel.org/r/20240711193650.701834-1-kbusch@meta.com] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Hans de Goede <hdegoede@redhat.com> Tested-by: Kalle Valo <kvalo@kernel.org> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
2024-07-12dt-bindings: ata: ahci-fsl-qoriq: add fsl,ls1046a-ahci and fsl,ls1012a-ahciFrank Li
Add missing documented compatible strings 'fsl,ls1046a-ahci' and 'fsl,ls1012a-ahci'. Allow 'fsl,ls1012a-ahci' to fallback to 'fsl,ls1043a-ahci'. Fix below CHECK_DTB warnings arch/arm64/boot/dts/freescale/fsl-ls1012a-qds.dtb: sata@3200000: compatible:0: 'fsl,ls1012a-ahci' is not one of ['fsl,ls1021a-ahci', 'fsl,ls1043a-ahci', 'fsl,ls1028a-ahci', 'fsl,ls1088a-ahci', 'fsl,ls2080a-ahci', 'fsl,lx2160a-ahci'] arch/arm64/boot/dts/freescale/fsl-ls1012a-qds.dtb: sata@3200000: compatible: ['fsl,ls1012a-ahci', 'fsl,ls1043a-ahci'] is too long Fixes: e58e12c5c34c ("dt-bindings: ata: ahci-fsl-qoriq: convert to yaml format") Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: Frank Li <Frank.Li@nxp.com> Link: https://lore.kernel.org/r/20240712185740.3819170-1-Frank.Li@nxp.com [cassel: rewrap commit log lines, capitalize SATA] Signed-off-by: Niklas Cassel <cassel@kernel.org>
2024-07-12docs: driver-model: platform: update the definition of platform_driverEric Biggers
Update the documented struct platform_driver to match the code. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20240711200421.11428-1-ebiggers@kernel.org
2024-07-12selftests/bpf: Test for null-pointer-deref bugfix in resolve_prog_type()Tengda Wu
This test verifies that resolve_prog_type() works as expected when `attach_prog_fd` is not passed in. `prog->aux->dst_prog` in resolve_prog_type() is assigned by `attach_prog_fd`, and would be NULL if `attach_prog_fd` is not provided. Loading EXT prog with bpf_dynptr_from_skb() kfunc call in this way will lead to null-pointer-deref. Verify that the null-pointer-deref bug in resolve_prog_type() is fixed. Signed-off-by: Tengda Wu <wutengda@huaweicloud.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20240711145819.254178-3-wutengda@huaweicloud.com
2024-07-12bpf: Fix null pointer dereference in resolve_prog_type() for BPF_PROG_TYPE_EXTTengda Wu
When loading a EXT program without specifying `attr->attach_prog_fd`, the `prog->aux->dst_prog` will be null. At this time, calling resolve_prog_type() anywhere will result in a null pointer dereference. Example stack trace: [ 8.107863] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000004 [ 8.108262] Mem abort info: [ 8.108384] ESR = 0x0000000096000004 [ 8.108547] EC = 0x25: DABT (current EL), IL = 32 bits [ 8.108722] SET = 0, FnV = 0 [ 8.108827] EA = 0, S1PTW = 0 [ 8.108939] FSC = 0x04: level 0 translation fault [ 8.109102] Data abort info: [ 8.109203] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 [ 8.109399] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 8.109614] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 8.109836] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000101354000 [ 8.110011] [0000000000000004] pgd=0000000000000000, p4d=0000000000000000 [ 8.112624] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP [ 8.112783] Modules linked in: [ 8.113120] CPU: 0 PID: 99 Comm: may_access_dire Not tainted 6.10.0-rc3-next-20240613-dirty #1 [ 8.113230] Hardware name: linux,dummy-virt (DT) [ 8.113390] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 8.113429] pc : may_access_direct_pkt_data+0x24/0xa0 [ 8.113746] lr : add_subprog_and_kfunc+0x634/0x8e8 [ 8.113798] sp : ffff80008283b9f0 [ 8.113813] x29: ffff80008283b9f0 x28: ffff800082795048 x27: 0000000000000001 [ 8.113881] x26: ffff0000c0bb2600 x25: 0000000000000000 x24: 0000000000000000 [ 8.113897] x23: ffff0000c1134000 x22: 000000000001864f x21: ffff0000c1138000 [ 8.113912] x20: 0000000000000001 x19: ffff0000c12b8000 x18: ffffffffffffffff [ 8.113929] x17: 0000000000000000 x16: 0000000000000000 x15: 0720072007200720 [ 8.113944] x14: 0720072007200720 x13: 0720072007200720 x12: 0720072007200720 [ 8.113958] x11: 0720072007200720 x10: 0000000000f9fca4 x9 : ffff80008021f4e4 [ 8.113991] x8 : 0101010101010101 x7 : 746f72705f6d656d x6 : 000000001e0e0f5f [ 8.114006] x5 : 000000000001864f x4 : ffff0000c12b8000 x3 : 000000000000001c [ 8.114020] x2 : 0000000000000002 x1 : 0000000000000000 x0 : 0000000000000000 [ 8.114126] Call trace: [ 8.114159] may_access_direct_pkt_data+0x24/0xa0 [ 8.114202] bpf_check+0x3bc/0x28c0 [ 8.114214] bpf_prog_load+0x658/0xa58 [ 8.114227] __sys_bpf+0xc50/0x2250 [ 8.114240] __arm64_sys_bpf+0x28/0x40 [ 8.114254] invoke_syscall.constprop.0+0x54/0xf0 [ 8.114273] do_el0_svc+0x4c/0xd8 [ 8.114289] el0_svc+0x3c/0x140 [ 8.114305] el0t_64_sync_handler+0x134/0x150 [ 8.114331] el0t_64_sync+0x168/0x170 [ 8.114477] Code: 7100707f 54000081 f9401c00 f9403800 (b9400403) [ 8.118672] ---[ end trace 0000000000000000 ]--- One way to fix it is by forcing `attach_prog_fd` non-empty when bpf_prog_load(). But this will lead to `libbpf_probe_bpf_prog_type` API broken which use verifier log to probe prog type and will log nothing if we reject invalid EXT prog before bpf_check(). Another way is by adding null check in resolve_prog_type(). The issue was introduced by commit 4a9c7bbe2ed4 ("bpf: Resolve to prog->aux->dst_prog->type only for BPF_PROG_TYPE_EXT") which wanted to correct type resolution for BPF_PROG_TYPE_TRACING programs. Before that, the type resolution of BPF_PROG_TYPE_EXT prog actually follows the logic below: prog->aux->dst_prog ? prog->aux->dst_prog->type : prog->type; It implies that when EXT program is not yet attached to `dst_prog`, the prog type should be EXT itself. This code worked fine in the past. So just keep using it. Fix this by returning `prog->type` for BPF_PROG_TYPE_EXT if `dst_prog` is not present in resolve_prog_type(). Fixes: 4a9c7bbe2ed4 ("bpf: Resolve to prog->aux->dst_prog->type only for BPF_PROG_TYPE_EXT") Signed-off-by: Tengda Wu <wutengda@huaweicloud.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20240711145819.254178-2-wutengda@huaweicloud.com
2024-07-12ubifs: add check for crypto_shash_tfm_digestChen Ni
Add check for the return value of crypto_shash_tfm_digest() and return the error if it fails in order to catch the error. Fixes: 817aa094842d ("ubifs: support offline signed images") Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-07-12ubifs: Fix inconsistent inode size when powercut happens during appendant ↵Zhihao Cheng
writing UBIFS always make sure that the data length won't beyond the inode size by writing inode before writing page(See ubifs_writepage.). After commit c35acef383f4a2f2cfc30("ubifs: Convert ubifs_writepage to use a folio"), the rule is broken in one case: Given a file with size 3, then write 4096 from the offset 0, following process will make inode size be smaller than file data length after powercut & recovery: P1 P2 ubifs_writepage len = folio_size(folio) // 4096 if (folio_pos(folio) + len <= i_size) // condition 1: 0 + 4096 <= 4096 //(i_size is updated as 4096 in ubifs_write_end) if (folio_pos(folio) >= synced_i_size) // condition 2: 0 >= 3, false write_inode // Skipped, because condition 2 is false do_writepage(folio, len) // write one page do_commit // data node won't be replayed in next mounting >> Powercut << So, inode size(4096) is not updated into disk, we will get following error messages in next mounting(chk_fs = 1): check_leaf [ubifs]: data node at LEB 14:2048 is not within inode size 3 dbg_walk_index [ubifs]: leaf checking function returned error -22, for leaf at LEB 14:2048 Fix it by modifying condition 2 as original comparison(Compare the page index of synced_i_size with current page index). Fixes: c35acef383f4 ("ubifs: Convert ubifs_writepage to use a folio") Link: https://bugzilla.kernel.org/show_bug.cgi?id=218934 Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-07-12ubi: block: fix null-pointer-dereference in ubiblock_create()Li Nan
Similar to commit adbf4c4954e3 ("ubi: block: fix memleak in ubiblock_create()"), 'dev->gd' is not assigned but dereferenced if blk_mq_alloc_tag_set() fails, and leading to a null-pointer-dereference. Fix it by using pr_err() and variable 'dev' to print error log. Additionally, the log in the error handle path of idr_alloc() has been improved by using pr_err(), too. Before initializing device name, using dev_err() will print error log with 'null' instead of the actual device name, like this: block (null): ... ~~~~~~ It is unclear. Using pr_err() can print more details of the device. The improved log is: ubiblock0_0: ... Fixes: 77567b25ab9f ("ubi: use blk_mq_alloc_disk and blk_cleanup_disk") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Daniel Golle <daniel@makrotopia.org> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-07-12ubifs: fix kernel-doc warningsJeff Johnson
make C=1 reports the following kernel-doc warnings: fs/ubifs/compress.c:103: warning: Function parameter or struct member 'c' not described in 'ubifs_compress' fs/ubifs/compress.c:155: warning: Function parameter or struct member 'c' not described in 'ubifs_decompress' fs/ubifs/find.c:353: warning: Excess function parameter 'data' description in 'scan_for_free_cb' fs/ubifs/find.c:353: warning: Function parameter or struct member 'arg' not described in 'scan_for_free_cb' fs/ubifs/find.c:594: warning: Excess function parameter 'data' description in 'scan_for_idx_cb' fs/ubifs/find.c:594: warning: Function parameter or struct member 'arg' not described in 'scan_for_idx_cb' fs/ubifs/find.c:786: warning: Excess function parameter 'data' description in 'scan_dirty_idx_cb' fs/ubifs/find.c:786: warning: Function parameter or struct member 'arg' not described in 'scan_dirty_idx_cb' fs/ubifs/find.c:86: warning: Excess function parameter 'data' description in 'scan_for_dirty_cb' fs/ubifs/find.c:86: warning: Function parameter or struct member 'arg' not described in 'scan_for_dirty_cb' fs/ubifs/journal.c:369: warning: expecting prototype for wake_up_reservation(). Prototype was for add_or_start_queue() instead fs/ubifs/lprops.c:1018: warning: Excess function parameter 'lst' description in 'scan_check_cb' fs/ubifs/lprops.c:1018: warning: Function parameter or struct member 'arg' not described in 'scan_check_cb' fs/ubifs/lpt.c:1938: warning: Function parameter or struct member 'ptr' not described in 'lpt_scan_node' fs/ubifs/replay.c:60: warning: Function parameter or struct member 'hash' not described in 'replay_entry' Fix them. Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-07-12ubifs: correct UBIFS_DFS_DIR_LEN macro definition and improve code clarityZhaoLong Wang
The UBIFS_DFS_DIR_LEN macro, which defines the maximum length of the UBIFS debugfs directory name, has an incorrect formula and misleading comments. The current formula is (3 + 1 + 2*2 + 1), which assumes that both UBI device number and volume ID are limited to 2 characters. However, UBI device number ranges from 0 to 31 (2 characters), and volume ID ranges from 0 to 127 (up to 3 characters). Although the current code works due to the cancellation of mathematical errors (9 + 1 = 10, which matches the correct UBIFS_DFS_DIR_LEN value), it can lead to confusion and potential issues in the future. This patch aims to improve the code clarity and maintainability by making the following changes: 1. Corrects the UBIFS_DFS_DIR_LEN macro definition to (3 + 1 + 2 + 3 + 1), accommodating the maximum lengths of both UBI device number and volume ID, plus the separators and null terminator. 2. Updates the snprintf calls to use UBIFS_DFS_DIR_LEN instead of UBIFS_DFS_DIR_LEN + 1, removing the unnecessary +1. 3. Modifies the error checks to compare against UBIFS_DFS_DIR_LEN using >= instead of >, aligning with the corrected macro definition. 4. Removes the redundant +1 in the dfs_dir_name array definitions in ubi.h and debug.h. While these changes do not affect the runtime behavior, they make the code more readable, maintainable, and less prone to future errors. v2->v3: - Removes the duplicated UBIFS_DFS_DIR_LEN and UBIFS_DFS_DIR_NAME macro definitions in ubifs.h, as they are already defined in debug.h. Signed-off-by: ZhaoLong Wang <wangzhaolong1@huawei.com> Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-07-12mtd: ubi: Restore missing cleanup on ubi_init() failure pathBen Hutchings
We need to clean-up debugfs and ubiblock if we fail after initialising them. Signed-off-by: Ben Hutchings <ben.hutchings@mind.be> Fixes: 927c145208b0 ("mtd: ubi: attach from device tree") Signed-off-by: Richard Weinberger <richard@nod.at>
2024-07-12ubifs: dbg_orphan_check: Fix missed key type checkingZhihao Cheng
When selinux/encryption is enabled, xattr entry node is added into TNC before host inode when creating new file. So it is possible to find xattr entry without host inode from TNC. Orphan debug checking is called by ubifs_orphan_end_commit(), at that time, the commit semaphore is already unlock, so the new creation won't be blocked. Fixes: d7f0b70d30ff ("UBIFS: Add security.* XATTR support for the UBIFS") Fixes: d475a507457b ("ubifs: Add skeleton for fscrypto") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-07-12ubifs: Fix unattached inode when powercut happens in creatingZhihao Cheng
For selinux or encryption scenarios, UBIFS could become inconsistent while creating new files in powercut case. Encryption/selinux related xattrs will be created before creating file dentry, which makes creation process is not atomic, details are shown as: Encryption case: ubifs_create ubifs_new_inode fscrypt_set_context ubifs_xattr_set create_xattr ubifs_jnl_update // Disk: xentry xinode inode(LAST_OF_NODE_GROUP) >> power cut << ubifs_jnl_update // Disk: dentry inode parent_inode(LAST_OF_NODE_GROUP) Selinux case: ubifs_create ubifs_new_inode ubifs_init_security security_inode_init_security ubifs_xattr_set create_xattr ubifs_jnl_update // Disk: xentry xinode inode(LAST_OF_NODE_GROUP) >> power cut << ubifs_jnl_update // Disk: dentry inode parent_inode(LAST_OF_NODE_GROUP) Above process will make chk_fs failed in next mounting: UBIFS error (ubi0:0 pid 7995): dbg_check_filesystem [ubifs]: inode 66 nlink is 1, but calculated nlink is 0 Fix it by allocating orphan inode for each non-xattr file creation, then removing orphan list in journal writing process, which ensures that both xattr and dentry be effective in atomic when powercut happens. Fixes: d7f0b70d30ff ("UBIFS: Add security.* XATTR support for the UBIFS") Fixes: d475a507457b ("ubifs: Add skeleton for fscrypto") Link: https://bugzilla.kernel.org/show_bug.cgi?id=218309 Suggested-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-07-12ubifs: Fix space leak when powercut happens in linking tmpfileZhihao Cheng
There is a potential space leak problem when powercut happens in linking tmpfile, in which case, inode node (with nlink=0) and its' data nodes can be found from tnc (on flash), but there are no dentries related to the inode, so the file is invisible but takes free space. Detailed process is shown as: ubifs_tmpfile ubifs_jnl_update // Add bud A into log area ubifs_add_orphan // Add inode into orphan list P1 P2 ubifs_link ubifs_delete_orphan // Delete inode from orphan list, then inode won't // be written into orphan area, there is no chance // to delete inode by replaying orphan. commit // bud A won't be replayed in next mounting >> powercut << ubifs_jnl_update // Link inode to dentry The root cause is that orphan entry deletion and journal writing(for link) are interrupted by commit, which makes the two operations are not atomic. Fix it by doing ubifs_delete_orphan under the protection of c->commit_sem within ubifs_jnl_update. This is also a preparation to support all creating new files by orphan inode. v1 is https://lore.kernel.org/linux-mtd/20200701093227.674945-1-chengzhihao1@huawei.com/ Fixes: 32fe905c17f0 ("ubifs: Fix O_TMPFILE corner case in ubifs_link()") Link: https://bugzilla.kernel.org/show_bug.cgi?id=208405 Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-07-12ubifs: Move ui->data initialization after initializing securityZhihao Cheng
Host inode and its' xattr will be written on disk after initializing security when creating symlink or dev, then the host inode and its dentry will be written again in ubifs_jnl_update. There is no need to write inode data in the security initialization pass, just move the ui->data initialization after initializing security. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-07-12ubifs: Fix adding orphan entry twice for the same inodeZhihao Cheng
The tmpfile could be added into orphan list twice, first time is creation, the second time is removing after it is linked. The orphan entry could be added twice for tmpfile if following sequence is satisfied: ubifs_tmpfile ubifs_jnl_update ubifs_add_orphan // first time to add orphan entry P1 P2 ubifs_link do_commit ubifs_orphan_start_commit orphan->cmt = 1 ubifs_delete_orphan orphan_delete if (orph->cmt) orph->del = 1; // orphan entry is not deleted from tree return ubifs_unlink ubifs_jnl_update ubifs_add_orphan orphan_add // found old orphan entry, second time to add orphan entry ubifs_err(c, "orphaned twice") return -EINVAL // unlink failed! ubifs_orphan_end_commit erase_deleted // delete old orphan entry rb_erase(&orphan->rb, &c->orph_tree) Fix it by removing orphan entry from orphan tree in advance, rather than remove it from orphan tree in committing process. Fixes: 32fe905c17f0 ("ubifs: Fix O_TMPFILE corner case in ubifs_link()") Link: https://bugzilla.kernel.org/show_bug.cgi?id=218672 Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-07-12ubifs: Remove insert_dead_orphan from replaying orphan processZhihao Cheng
UBIFS will do commit at the end of mounting process(rw mode), dead orphans(added by insert_dead_orphan in replaying orphan) are deleted by ubifs_orphan_end_commit(). The only reason why dead orphans are added into orphan list is that old orpans may be lost when powercut happens in ubifs_orphan_end_commit(): ubifs_orphan_end_commit // TNC(updated by orphans) is not written yet if (c->cmt_orphans != 0) commit_orphans consolidate // traverse orphan list write_orph_nodes // rewrite all orphans by ubifs_leb_change // If dead orphans are not in list, they will be lost when powercut // happens, then TNC won't be updated by old orphans in next mounting. Luckily, the condition 'c->cmt_orphans != 0' will never be true in mounting process, there can't be new orphans added into orphan list before mounting returned, but commit will be done at the end of mounting. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-07-12Revert "ubifs: ubifs_symlink: Fix memleak of inode->i_link in error path"Zhihao Cheng
This reverts commit 6379b44cdcd67f5f5d986b73953e99700591edfa. Commit 1e022216dcd2 ("ubifs: ubifs_symlink: Fix memleak of inode->i_link in error path") is applied again in commit 6379b44cdcd6 ("ubifs: ubifs_symlink: Fix memleak of inode->i_link in error path"), which changed ubifs_mknod (It won't become a real problem). Just revert it. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-07-12ubifs: Don't add xattr inode into orphan areaZhihao Cheng
Now, the entire inode with its' xattrs are removed while replaying orphan nodes. There is no need to add xattr inodes into orphan area, which is based on the fact that xattr entries won't be cleared from disk before deleting xattr inodes, in another words, current logic can make sure that xattr inode be deleted in any cases even UBIFS not record xattr inode into orphan area. Let's looking for possible paths that could clear xattr entries from disk but leave the xattr inode on TNC: 1. unlink/tmpfile -> ubifs_jnl_update: inode(nlink=0) is written into bud LEB and added into orphan list, then: a. powercut: ubifs_tnc_remove_ino(xattr entry/inode can be found from TNC and being deleted) is invoked in replaying journal. b. commit + powercut: inode is written into orphan area, and ubifs_tnc_remove_ino is invoked in replaying orphan nodes. c. evicting + powercut: xattr inode(nlink=0) is written on disk, xattr is removed from TNC, gc could clear xattr entries from disk. ubifs_tnc_remove_ino will apply on inode and xattr inode in replaying journal, so lost xattr entries will make no influence. d. evicting + commit + powercut: xattr inode/entry are removed from index tree(on disk) by ubifs_jnl_write_inode, xattr inode is cleared from orphan area by ubifs_jnl_write_inode + commit. e. commit + evicting + powercut: inode is written into orphan area, then equivalent to c. 2. remove xattr -> ubifs_jnl_delete_xattr: xattr entry(inum=0) and xattr inode(nlink=0) is written into bud LEB, xattr entry/inode are removed from TNC, then: a. powercut: gc could clear xattr entries from disk, which won't affect deleting xattr entry from TNC. ubifs_tnc_remove_ino will apply on xattr inode in replaying journal, ubifs_tnc_remove_nm will apply on xattr entry in replaying journal. b. commit + powercut: xattr entry/inode are removed from index tree (on disk). Tracking xattr inode in orphan list is imported by commit 988bec41318f3f ("ubifs: orphan: Handle xattrs like files"), it aims to fix the similar problem described in commit 7959cf3a7506d4a ("ubifs: journal: Handle xattrs like files"). Actually, the problem only exist in journal case but not the orphan case. So, we can remove the orphan tracking for xattr inodes. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-07-12ubifs: Fix unattached xattr inode if powercut happens after deletingZhihao Cheng
When powercut happens after deleting file, the xattr inode could be alone existing in TNC but its' xattr entry cannot be found in TNC. File inode and xattr inode are added into orphan list after deleting file, file inode's nlink is 0 but xattr inode's nlink is not 0 (PS: zero nlink xattr inode is written on disk in evicting process by ubifs_jnl_write_inode). So, following process could happen: 1. touch file 2. setxattr(file) 3. unlink file // inode(nlink=0), xattr inode(nlink=1) are added into orphan list 4. commit // write inode inum and xattr inum into orphan area 5. powercut 6. mount do_kill_orphans // inode(nlink=0) is deleted from TNC by ubifs_tnc_remove_range, // xattr entry is deleted too. // xattr inode(nlink=1) is not deleted from TNC Finally we could see following error while debugging UBIFS: UBIFS error (ubi0:0 pid 1093): dbg_check_filesystem [ubifs]: inode 66 nlink is 1, but calculated nlink is 0 UBIFS (ubi0:0): dump of the inode 66 sitting in LEB 12:2128 node_type 0 (inode node) group_type 1 (in node group) len 197 key (66, inode) size 37 nlink 1 flags 0x20 xattr_cnt 0 xattr_size 0 xattr_names 0 data len 37 Fix it by removing entire inode with it's xattrs while replaying orphan, just replace function ubifs_tnc_remove_range by ubifs_tnc_remove_ino. Fixes: ee1438ce5dc4 ("ubifs: Check link count of inodes when killing orphans.") Link: https://bugzilla.kernel.org/show_bug.cgi?id=218661 Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-07-12Merge tag 'for-6.10-rc7-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Fix a regression in extent map shrinker behaviour. In the past weeks we got reports from users that there are huge latency spikes or freezes. This was bisected to newly added shrinker of extent maps (it was added to fix a build up of the structures in memory). I'm assuming that the freezes would happen to many users after release so I'd like to get it merged now so it's in 6.10. Although the diff size is not small the changes are relatively straightforward, the reporters verified the fixes and we did testing on our side. The fixes: - adjust behaviour under memory pressure and check lock or scheduling conditions, bail out if needed - synchronize tracking of the scanning progress so inode ranges are not skipped or work duplicated - do a delayed iput when scanning a root so evicting an inode does not slow things down in case of lots of dirty data, also fix lockdep warning, a deadlock could happen when writing the dirty data would need to start a transaction" * tag 'for-6.10-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: avoid races when tracking progress for extent map shrinking btrfs: stop extent map shrinker if reschedule is needed btrfs: use delayed iput during extent map shrinking
2024-07-12dt-bindings: incomplete-devices: document devices without bindingsKrzysztof Kozlowski
There are devices in the wild with non-updatable firmware coming with ACPI tables with rejected compatibles, e.g. "ltr,ltrf216a". Linux kernel still supports this device via ACPI PRP0001, however the compatible was never accepted to bindings. There are also several early PowerPC or SPARC platforms using compatibles for their OpenFirmware, but without in-tree DTS. Often the legacy compatible is not correct in terms of current Devicetree specification, e.g. missing vendor prefix. Finally there are also Linux-specific tools and test code with compatibles. Add a schema covering above cases purely to satisfy the DT schema and scripts/checkpatch.pl checks for undocumented compatibles. For ltr,ltrf216a this also documents the consensus: compatible is allowed only via ACPI PRP0001, but not bindings. Link: https://lore.kernel.org/all/20240705095047.90558-1-marex@denx.de/ Link: https://lore.kernel.org/lkml/20220731173446.7400bfa8@jic23-huawei/T/#me55be502302d70424a85368c2645c89f860b7b40 Cc: Marek Vasut <marex@denx.de> Cc: Jonathan Cameron <jic23@kernel.org> Cc: Sebastian Reichel <sebastian.reichel@collabora.com> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20240712121146.90942-1-krzysztof.kozlowski@linaro.org Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
2024-07-12mtd: ubi: avoid expensive do_div() on 32-bit machinesArnd Bergmann
The use of do_div() in ubi_nvmem_reg_read() makes calling it on 32-bit machines rather expensive. Since the 'from' variable is known to be a 32-bit quantity, it is clearly never needed and can be optimized into a regular division operation. Fixes: b8a77b9a5f9c ("mtd: ubi: fix NVMEM over UBI volumes on 32-bit systems") Fixes: 3ce485803da1 ("mtd: ubi: provide NVMEM layer over UBI volumes") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-07-12mtd: ubi: make ubi_class constantRicardo B. Marliere
Since commit 43a7206b0963 ("driver core: class: make class_register() take a const *"), the driver core allows for struct class to be in read-only memory, so move the ubi_class structure to be declared at build time placing it into read-only memory, instead of having to be dynamically allocated at boot time. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-07-12ubi: eba: properly rollback inside self_check_ebaFedor Pchelkin
In case of a memory allocation failure in the volumes loop we can only process the already allocated scan_eba and fm_eba array elements on the error path - others are still uninitialized. Found by Linux Verification Center (linuxtesting.org). Fixes: 00abf3041590 ("UBI: Add self_check_eba()") Cc: stable@vger.kernel.org Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2024-07-12Merge tag 'ceph-for-6.10-rc8' of https://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fixes from Ilya Dryomov: "A fix for a possible use-after-free following "rbd unmap" or "umount" marked for stable and two kernel-doc fixups" * tag 'ceph-for-6.10-rc8' of https://github.com/ceph/ceph-client: libceph: fix crush_choose_firstn() kernel-doc warnings libceph: suppress crush_choose_indep() kernel-doc warnings libceph: fix race between delayed_work() and ceph_monc_stop()
2024-07-12Merge tag 'pmdomain-v6.10-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm Pull pmdomain fix from Ulf Hansson: - qcom: Skip retention level for rpmhpd's * tag 'pmdomain-v6.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm: pmdomain: qcom: rpmhpd: Skip retention level for Power Domains
2024-07-12Merge tag 'mmc-v6.10-rc4-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc Pull MMC host fixes from Ulf Hansson: - davinci_mmc: Prevent transmitted data size from exceeding sgm's length - sdhci: Fix max_seg_size for 64KiB PAGE_SIZE * tag 'mmc-v6.10-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: mmc: davinci_mmc: Prevent transmitted data size from exceeding sgm's length mmc: sdhci: Fix max_seg_size for 64KiB PAGE_SIZE
2024-07-12nfsd: nfsd_file_lease_notifier_call gets a file_lease as an argumentJeff Layton
"data" actually refers to a file_lease and not a file_lock. Both structs have their file_lock_core as the first field though, so this bug should be harmless without struct randomization in play. Reported-by: Florian Evers <florian-evers@gmx.de> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219008 Fixes: 05580bbfc6bc ("nfsd: adapt to breakup of struct file_lock") Signed-off-by: Jeff Layton <jlayton@kernel.org> Tested-by: Florian Evers <florian-evers@gmx.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-12perf trace: Fix iteration of syscall ids in syscalltbl->entriesHoward Chu
This is a bug found when implementing pretty-printing for the landlock_add_rule system call, I decided to send this patch separately because this is a serious bug that should be fixed fast. I wrote a test program to do landlock_add_rule syscall in a loop, yet perf trace -e landlock_add_rule freezes, giving no output. This bug is introduced by the false understanding of the variable "key" below: ``` for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) { struct syscall *sc = trace__syscall_info(trace, NULL, key); ... } ``` The code above seems right at the beginning, but when looking at syscalltbl.c, I found these lines: ``` for (i = 0; i <= syscalltbl_native_max_id; ++i) if (syscalltbl_native[i]) ++nr_entries; entries = tbl->syscalls.entries = malloc(sizeof(struct syscall) * nr_entries); ... for (i = 0, j = 0; i <= syscalltbl_native_max_id; ++i) { if (syscalltbl_native[i]) { entries[j].name = syscalltbl_native[i]; entries[j].id = i; ++j; } } ``` meaning the key is merely an index to traverse the syscall table, instead of the actual syscall id for this particular syscall. So if one uses key to do trace__syscall_info(trace, NULL, key), because key only goes up to trace->sctbl->syscalls.nr_entries, for example, on my X86_64 machine, this number is 373, it will end up neglecting all the rest of the syscall, in my case, everything after `rseq`, because the traversal will stop at 373, and `rseq` is the last syscall whose id is lower than 373 in tools/perf/arch/x86/include/generated/asm/syscalls_64.c: ``` ... [334] = "rseq", [424] = "pidfd_send_signal", ... ``` The reason why the key is scrambled but perf trace works well is that key is used in trace__syscall_info(trace, NULL, key) to do trace->syscalls.table[id], this makes sure that the struct syscall returned actually has an id the same value as key, making the later bpf_prog matching all correct. After fixing this bug, I can do perf trace on 38 more syscalls, and because more syscalls are visible, we get 8 more syscalls that can be augmented. before: perf $ perf trace -vv --max-events=1 |& grep Reusing Reusing "open" BPF sys_enter augmenter for "stat" Reusing "open" BPF sys_enter augmenter for "lstat" Reusing "open" BPF sys_enter augmenter for "access" Reusing "connect" BPF sys_enter augmenter for "accept" Reusing "sendto" BPF sys_enter augmenter for "recvfrom" Reusing "connect" BPF sys_enter augmenter for "bind" Reusing "connect" BPF sys_enter augmenter for "getsockname" Reusing "connect" BPF sys_enter augmenter for "getpeername" Reusing "open" BPF sys_enter augmenter for "execve" Reusing "open" BPF sys_enter augmenter for "truncate" Reusing "open" BPF sys_enter augmenter for "chdir" Reusing "open" BPF sys_enter augmenter for "mkdir" Reusing "open" BPF sys_enter augmenter for "rmdir" Reusing "open" BPF sys_enter augmenter for "creat" Reusing "open" BPF sys_enter augmenter for "link" Reusing "open" BPF sys_enter augmenter for "unlink" Reusing "open" BPF sys_enter augmenter for "symlink" Reusing "open" BPF sys_enter augmenter for "readlink" Reusing "open" BPF sys_enter augmenter for "chmod" Reusing "open" BPF sys_enter augmenter for "chown" Reusing "open" BPF sys_enter augmenter for "lchown" Reusing "open" BPF sys_enter augmenter for "mknod" Reusing "open" BPF sys_enter augmenter for "statfs" Reusing "open" BPF sys_enter augmenter for "pivot_root" Reusing "open" BPF sys_enter augmenter for "chroot" Reusing "open" BPF sys_enter augmenter for "acct" Reusing "open" BPF sys_enter augmenter for "swapon" Reusing "open" BPF sys_enter augmenter for "swapoff" Reusing "open" BPF sys_enter augmenter for "delete_module" Reusing "open" BPF sys_enter augmenter for "setxattr" Reusing "open" BPF sys_enter augmenter for "lsetxattr" Reusing "openat" BPF sys_enter augmenter for "fsetxattr" Reusing "open" BPF sys_enter augmenter for "getxattr" Reusing "open" BPF sys_enter augmenter for "lgetxattr" Reusing "openat" BPF sys_enter augmenter for "fgetxattr" Reusing "open" BPF sys_enter augmenter for "listxattr" Reusing "open" BPF sys_enter augmenter for "llistxattr" Reusing "open" BPF sys_enter augmenter for "removexattr" Reusing "open" BPF sys_enter augmenter for "lremovexattr" Reusing "fsetxattr" BPF sys_enter augmenter for "fremovexattr" Reusing "open" BPF sys_enter augmenter for "mq_open" Reusing "open" BPF sys_enter augmenter for "mq_unlink" Reusing "fsetxattr" BPF sys_enter augmenter for "add_key" Reusing "fremovexattr" BPF sys_enter augmenter for "request_key" Reusing "fremovexattr" BPF sys_enter augmenter for "inotify_add_watch" Reusing "fremovexattr" BPF sys_enter augmenter for "mkdirat" Reusing "fremovexattr" BPF sys_enter augmenter for "mknodat" Reusing "fremovexattr" BPF sys_enter augmenter for "fchownat" Reusing "fremovexattr" BPF sys_enter augmenter for "futimesat" Reusing "fremovexattr" BPF sys_enter augmenter for "newfstatat" Reusing "fremovexattr" BPF sys_enter augmenter for "unlinkat" Reusing "fremovexattr" BPF sys_enter augmenter for "linkat" Reusing "open" BPF sys_enter augmenter for "symlinkat" Reusing "fremovexattr" BPF sys_enter augmenter for "readlinkat" Reusing "fremovexattr" BPF sys_enter augmenter for "fchmodat" Reusing "fremovexattr" BPF sys_enter augmenter for "faccessat" Reusing "fremovexattr" BPF sys_enter augmenter for "utimensat" Reusing "connect" BPF sys_enter augmenter for "accept4" Reusing "fremovexattr" BPF sys_enter augmenter for "name_to_handle_at" Reusing "fremovexattr" BPF sys_enter augmenter for "renameat2" Reusing "open" BPF sys_enter augmenter for "memfd_create" Reusing "fremovexattr" BPF sys_enter augmenter for "execveat" Reusing "fremovexattr" BPF sys_enter augmenter for "statx" after perf $ perf trace -vv --max-events=1 |& grep Reusing Reusing "open" BPF sys_enter augmenter for "stat" Reusing "open" BPF sys_enter augmenter for "lstat" Reusing "open" BPF sys_enter augmenter for "access" Reusing "connect" BPF sys_enter augmenter for "accept" Reusing "sendto" BPF sys_enter augmenter for "recvfrom" Reusing "connect" BPF sys_enter augmenter for "bind" Reusing "connect" BPF sys_enter augmenter for "getsockname" Reusing "connect" BPF sys_enter augmenter for "getpeername" Reusing "open" BPF sys_enter augmenter for "execve" Reusing "open" BPF sys_enter augmenter for "truncate" Reusing "open" BPF sys_enter augmenter for "chdir" Reusing "open" BPF sys_enter augmenter for "mkdir" Reusing "open" BPF sys_enter augmenter for "rmdir" Reusing "open" BPF sys_enter augmenter for "creat" Reusing "open" BPF sys_enter augmenter for "link" Reusing "open" BPF sys_enter augmenter for "unlink" Reusing "open" BPF sys_enter augmenter for "symlink" Reusing "open" BPF sys_enter augmenter for "readlink" Reusing "open" BPF sys_enter augmenter for "chmod" Reusing "open" BPF sys_enter augmenter for "chown" Reusing "open" BPF sys_enter augmenter for "lchown" Reusing "open" BPF sys_enter augmenter for "mknod" Reusing "open" BPF sys_enter augmenter for "statfs" Reusing "open" BPF sys_enter augmenter for "pivot_root" Reusing "open" BPF sys_enter augmenter for "chroot" Reusing "open" BPF sys_enter augmenter for "acct" Reusing "open" BPF sys_enter augmenter for "swapon" Reusing "open" BPF sys_enter augmenter for "swapoff" Reusing "open" BPF sys_enter augmenter for "delete_module" Reusing "open" BPF sys_enter augmenter for "setxattr" Reusing "open" BPF sys_enter augmenter for "lsetxattr" Reusing "openat" BPF sys_enter augmenter for "fsetxattr" Reusing "open" BPF sys_enter augmenter for "getxattr" Reusing "open" BPF sys_enter augmenter for "lgetxattr" Reusing "openat" BPF sys_enter augmenter for "fgetxattr" Reusing "open" BPF sys_enter augmenter for "listxattr" Reusing "open" BPF sys_enter augmenter for "llistxattr" Reusing "open" BPF sys_enter augmenter for "removexattr" Reusing "open" BPF sys_enter augmenter for "lremovexattr" Reusing "fsetxattr" BPF sys_enter augmenter for "fremovexattr" Reusing "open" BPF sys_enter augmenter for "mq_open" Reusing "open" BPF sys_enter augmenter for "mq_unlink" Reusing "fsetxattr" BPF sys_enter augmenter for "add_key" Reusing "fremovexattr" BPF sys_enter augmenter for "request_key" Reusing "fremovexattr" BPF sys_enter augmenter for "inotify_add_watch" Reusing "fremovexattr" BPF sys_enter augmenter for "mkdirat" Reusing "fremovexattr" BPF sys_enter augmenter for "mknodat" Reusing "fremovexattr" BPF sys_enter augmenter for "fchownat" Reusing "fremovexattr" BPF sys_enter augmenter for "futimesat" Reusing "fremovexattr" BPF sys_enter augmenter for "newfstatat" Reusing "fremovexattr" BPF sys_enter augmenter for "unlinkat" Reusing "fremovexattr" BPF sys_enter augmenter for "linkat" Reusing "open" BPF sys_enter augmenter for "symlinkat" Reusing "fremovexattr" BPF sys_enter augmenter for "readlinkat" Reusing "fremovexattr" BPF sys_enter augmenter for "fchmodat" Reusing "fremovexattr" BPF sys_enter augmenter for "faccessat" Reusing "fremovexattr" BPF sys_enter augmenter for "utimensat" Reusing "connect" BPF sys_enter augmenter for "accept4" Reusing "fremovexattr" BPF sys_enter augmenter for "name_to_handle_at" Reusing "fremovexattr" BPF sys_enter augmenter for "renameat2" Reusing "open" BPF sys_enter augmenter for "memfd_create" Reusing "fremovexattr" BPF sys_enter augmenter for "execveat" Reusing "fremovexattr" BPF sys_enter augmenter for "statx" TL;DR: These are the new syscalls that can be augmented Reusing "openat" BPF sys_enter augmenter for "open_tree" Reusing "openat" BPF sys_enter augmenter for "openat2" Reusing "openat" BPF sys_enter augmenter for "mount_setattr" Reusing "openat" BPF sys_enter augmenter for "move_mount" Reusing "open" BPF sys_enter augmenter for "fsopen" Reusing "openat" BPF sys_enter augmenter for "fspick" Reusing "openat" BPF sys_enter augmenter for "faccessat2" Reusing "openat" BPF sys_enter augmenter for "fchmodat2" as for the perf trace output: before perf $ perf trace -e faccessat2 --max-events=1 [no output] after perf $ ./perf trace -e faccessat2 --max-events=1 0.000 ( 0.037 ms): waybar/958 faccessat2(dfd: 40, filename: "uevent") = 0 P.S. The reason why this bug was not found in the past five years is probably because it only happens to the newer syscalls whose id is greater, for instance, faccessat2 of id 439, which not a lot of people care about when using perf trace. [Arnaldo]: notes That and the fact that the BPF code was hidden before having to use -e, that got changed kinda recently when we switched to using BPF skels for augmenting syscalls in 'perf trace': ⬢[acme@toolbox perf-tools-next]$ git log --oneline tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c a9f4c6c999008c92 perf trace: Collect sys_nanosleep first argument 29d16de26df17e94 perf augmented_raw_syscalls.bpf: Move 'struct timespec64' to vmlinux.h 5069211e2f0b47e7 perf trace: Use the right bpf_probe_read(_str) variant for reading user data 33b725ce7b988756 perf trace: Avoid compile error wrt redefining bool 7d9642311b6d9d31 perf bpf augmented_raw_syscalls: Add an assert to make sure sizeof(augmented_arg->value) is a power of two. 262b54b6c9396823 perf bpf augmented_raw_syscalls: Add an assert to make sure sizeof(saddr) is a power of two. 1836480429d173c0 perf bpf_skel augmented_raw_syscalls: Cap the socklen parameter using &= sizeof(saddr) cd2cece61ac5f900 perf trace: Tidy comments related to BPF + syscall augmentation 5e6da6be3082f77b perf trace: Migrate BPF augmentation to use a skeleton ⬢[acme@toolbox perf-tools-next]$ ⬢[acme@toolbox perf-tools-next]$ git show --oneline --pretty=reference 5e6da6be3082f77b | head -1 5e6da6be3082f77b (perf trace: Migrate BPF augmentation to use a skeleton, 2023-08-10) ⬢[acme@toolbox perf-tools-next]$ I.e. from August, 2023. One had as well to ask for BUILD_BPF_SKEL=1, which now is default if all it needs is available on the system. I simplified the code to not expose the 'struct syscall' outside of tools/perf/util/syscalltbl.c, instead providing a function to go from the index to the syscall id: int syscalltbl__id_at_idx(struct syscalltbl *tbl, int idx); Signed-off-by: Howard Chu <howardchu95@gmail.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/lkml/ZmhlAxbVcAKoPTg8@x1 Link: https://lore.kernel.org/r/20240705132059.853205-2-howardchu95@gmail.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-07-12cgroup: Add Michal Koutný as a maintainerTejun Heo
Michal has been contributing and reviewing patches across cgroup for a while now. Add him as a maintainer. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Koutný <mkoutny@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>