summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2024-11-07alloc_tag: load module tags into separate contiguous memorySuren Baghdasaryan
When a module gets unloaded there is a possibility that some of the allocations it made are still used and therefore the allocation tags corresponding to these allocations are still referenced. As such, the memory for these tags can't be freed. This is currently handled as an abnormal situation and module's data section is not being unloaded. To handle this situation without keeping module's data in memory, allow codetags with longer lifespan than the module to be loaded into their own separate memory. The in-use memory areas and gaps after module unloading in this separate memory are tracked using maple trees. Allocation tags arrange their separate memory so that it is virtually contiguous and that will allow simple allocation tag indexing later on in this patchset. The size of this virtually contiguous memory is set to store up to 100000 allocation tags. [surenb@google.com: fix empty codetag module section handling] Link: https://lkml.kernel.org/r/20241101000017.3856204-1-surenb@google.com [akpm@linux-foundation.org: update comment, per Dan] Link: https://lkml.kernel.org/r/20241023170759.999909-4-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Sourav Panda <souravpanda@google.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Huth <thuth@redhat.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xiongwei Song <xiongwei.song@windriver.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07maple_tree: add mas_for_each_rev() helperSuren Baghdasaryan
Patch series "page allocation tag compression", v4. This patchset implements several improvements: 1. Gracefully handles module unloading while there are used allocations allocated from that module; 2. Provides an option to store page allocation tag references in the page flags, removing dependency on page extensions and eliminating the memory overhead from storing page allocation references (~0.2% of total system memory). This also improves page allocation performance when CONFIG_MEM_ALLOC_PROFILING is enabled by eliminating page extension lookup. Page allocation performance overhead is reduced from 41% to 5.5%. Patch #1 introduces mas_for_each_rev() helper function. Patch #2 introduces shutdown_mem_profiling() helper function to be used when disabling memory allocation profiling. Patch #3 copies module tags into virtually contiguous memory which serves two purposes: - Lets us deal with the situation when module is unloaded while there are still live allocations from that module. Since we are using a copy version of the tags we can safely unload the module. Space and gaps in this contiguous memory are managed using a maple tree. - Enables simple indexing of the tags in the later patches. Patch #4 changes the way we allocate virtually contiguous memory for module tags to reserve only vitrual area and populate physical pages only as needed at module load time. Patch #5 abstracts page allocation tag reference to simplify later changes. Patch #6 adds compression option to the sysctl.vm.mem_profiling boot parameter for storing page allocation tag references inside page flags if they fit. If the number of available page flag bits is insufficient to address all kernel allocations, memory allocation profiling gets disabled with an appropriate warning. This patch (of 6): Add mas_for_each_rev() function to iterate maple tree nodes in reverse order. Link: https://lkml.kernel.org/r/20241023170759.999909-1-surenb@google.com Link: https://lkml.kernel.org/r/20241023170759.999909-2-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Sourav Panda <souravpanda@google.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Huth <thuth@redhat.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xiongwei Song <xiongwei.song@windriver.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07execmem: add support for cache of large ROX pagesMike Rapoport (Microsoft)
Using large pages to map text areas reduces iTLB pressure and improves performance. Extend execmem_alloc() with an ability to use huge pages with ROX permissions as a cache for smaller allocations. To populate the cache, a writable large page is allocated from vmalloc with VM_ALLOW_HUGE_VMAP, filled with invalid instructions and then remapped as ROX. The direct map alias of that large page is exculded from the direct map. Portions of that large page are handed out to execmem_alloc() callers without any changes to the permissions. When the memory is freed with execmem_free() it is invalidated again so that it won't contain stale instructions. An architecture has to implement execmem_fill_trapping_insns() callback and select ARCH_HAS_EXECMEM_ROX configuration option to be able to use the ROX cache. The cache is enabled on per-range basis when an architecture sets EXECMEM_ROX_CACHE flag in definition of an execmem_range. Link: https://lkml.kernel.org/r/20241023162711.2579610-8-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Tested-by: kdevops <kdevops@lists.linux.dev> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Brian Cain <bcain@quicinc.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Song Liu <song@kernel.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07arch: introduce set_direct_map_valid_noflush()Mike Rapoport (Microsoft)
Add an API that will allow updates of the direct/linear map for a set of physically contiguous pages. It will be used in the following patches. Link: https://lkml.kernel.org/r/20241023162711.2579610-6-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Tested-by: kdevops <kdevops@lists.linux.dev> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Brian Cain <bcain@quicinc.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Song Liu <song@kernel.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07module: prepare to handle ROX allocations for textMike Rapoport (Microsoft)
In order to support ROX allocations for module text, it is necessary to handle modifications to the code, such as relocations and alternatives patching, without write access to that memory. One option is to use text patching, but this would make module loading extremely slow and will expose executable code that is not finally formed. A better way is to have memory allocated with ROX permissions contain invalid instructions and keep a writable, but not executable copy of the module text. The relocations and alternative patches would be done on the writable copy using the addresses of the ROX memory. Once the module is completely ready, the updated text will be copied to ROX memory using text patching in one go and the writable copy will be freed. Add support for that to module initialization code and provide necessary interfaces in execmem. Link: https://lkml.kernel.org/r/20241023162711.2579610-5-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewd-by: Luis Chamberlain <mcgrof@kernel.org> Tested-by: kdevops <kdevops@lists.linux.dev> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Brian Cain <bcain@quicinc.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Song Liu <song@kernel.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07asm-generic: introduce text-patching.hMike Rapoport (Microsoft)
Several architectures support text patching, but they name the header files that declare patching functions differently. Make all such headers consistently named text-patching.h and add an empty header in asm-generic for architectures that do not support text patching. Link: https://lkml.kernel.org/r/20241023162711.2579610-4-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k Acked-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Tested-by: kdevops <kdevops@lists.linux.dev> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Brian Cain <bcain@quicinc.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Song Liu <song@kernel.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07mm: vmalloc: group declarations depending on CONFIG_MMU togetherMike Rapoport (Microsoft)
Patch series "x86/module: use large ROX pages for text allocations", v7. These patches add support for using large ROX pages for allocations of executable memory on x86. They address Andy's comments [1] about having executable mappings for code that was not completely formed. The approach taken is to allocate ROX memory along with writable but not executable memory and use the writable copy to perform relocations and alternatives patching. After the module text gets into its final shape, the contents of the writable memory is copied into the actual ROX location using text poking. The allocations of the ROX memory use vmalloc(VMAP_ALLOW_HUGE_MAP) to allocate PMD aligned memory, fill that memory with invalid instructions and in the end remap it as ROX. Portions of these large pages are handed out to execmem_alloc() callers without any changes to the permissions. When the memory is freed with execmem_free() it is invalidated again so that it won't contain stale instructions. The module memory allocation, x86 code dealing with relocations and alternatives patching take into account the existence of the two copies, the writable memory and the ROX memory at the actual allocated virtual address. [1] https://lore.kernel.org/all/a17c65c6-863f-4026-9c6f-a04b659e9ab4@app.fastmail.com This patch (of 8): There are a couple of declarations that depend on CONFIG_MMU in include/linux/vmalloc.h spread all over the file. Group them all together to improve code readability. No functional changes. Link: https://lkml.kernel.org/r/20241023162711.2579610-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20241023162711.2579610-2-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Tested-by: kdevops <kdevops@lists.linux.dev> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Brian Cain <bcain@quicinc.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Song Liu <song@kernel.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07signal: restore the override_rlimit logicRoman Gushchin
Prior to commit d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") UCOUNT_RLIMIT_SIGPENDING rlimit was not enforced for a class of signals. However now it's enforced unconditionally, even if override_rlimit is set. This behavior change caused production issues. For example, if the limit is reached and a process receives a SIGSEGV signal, sigqueue_alloc fails to allocate the necessary resources for the signal delivery, preventing the signal from being delivered with siginfo. This prevents the process from correctly identifying the fault address and handling the error. From the user-space perspective, applications are unaware that the limit has been reached and that the siginfo is effectively 'corrupted'. This can lead to unpredictable behavior and crashes, as we observed with java applications. Fix this by passing override_rlimit into inc_rlimit_get_ucounts() and skip the comparison to max there if override_rlimit is set. This effectively restores the old behavior. Link: https://lkml.kernel.org/r/20241104195419.3962584-1-roman.gushchin@linux.dev Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Co-developed-by: Andrei Vagin <avagin@google.com> Signed-off-by: Andrei Vagin <avagin@google.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Alexey Gladkov <legion@kernel.org> Cc: Kees Cook <kees@kernel.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07mm/page_alloc: keep track of free highatomicYu Zhao
OOM kills due to vastly overestimated free highatomic reserves were observed: ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ... Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ... Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB The second line above shows that the OOM kill was due to the following condition: free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB) And the third line shows there were no free pages in any MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type 'H'. Therefore __zone_watermark_unusable_free() underestimated the usable free memory by over 1GB, which resulted in the unnecessary OOM kill above. The comments in __zone_watermark_unusable_free() warns about the potential risk, i.e., If the caller does not have rights to reserves below the min watermark then subtract the high-atomic reserves. This will over-estimate the size of the atomic reserve but it avoids a search. However, it is possible to keep track of free pages in reserved highatomic pageblocks with a new per-zone counter nr_free_highatomic protected by the zone lock, to avoid a search when calculating the usable free memory. And the cost would be minimal, i.e., simple arithmetics in the highatomic alloc/free/move paths. Note that since nr_free_highatomic can be relatively small, using a per-cpu counter might cause too much drift and defeat its purpose, in addition to the extra memory overhead. Dependson e0932b6c1f94 ("mm: page_alloc: consolidate free page accounting") - see [1] [akpm@linux-foundation.org: s/if/else if/, per Johannes, stealth whitespace tweak] Link: https://lkml.kernel.org/r/20241028182653.3420139-1-yuzhao@google.com Link: https://lkml.kernel.org/r/0d0ddb33-fcdc-43e2-801f-0c1df2031afb@suse.cz [1] Fixes: 0aaa29a56e4f ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Link Lin <linkl@google.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.12-rc7). Conflicts: drivers/net/ethernet/freescale/enetc/enetc_pf.c e15c5506dd39 ("net: enetc: allocate vf_state during PF probes") 3774409fd4c6 ("net: enetc: build enetc_pf_common.c as a separate module") https://lore.kernel.org/20241105114100.118bd35e@canb.auug.org.au Adjacent changes: drivers/net/ethernet/ti/am65-cpsw-nuss.c de794169cf17 ("net: ethernet: ti: am65-cpsw: Fix multi queue Rx on J7") 4a7b2ba94a59 ("net: ethernet: ti: am65-cpsw: Use tstats instead of open coded version") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-07nvme: add reservation command's definesGuixin Liu
This is a preparation patch for NVMeOF target reservation commands implantation. Add the defines of reservation command, such as reservation log and sub operations. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Tested-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-07Revert "block: pre-calculate max_zone_append_sectors"Jens Axboe
This causes issue on, at least, nvme-mpath where my boot fails with: WARNING: CPU: 354 PID: 2729 at block/blk-settings.c:75 blk_validate_limits+0x356/0x380 Modules linked in: tg3(+) nvme usbcore scsi_mod ptp i2c_piix4 libphy nvme_core crc32c_intel scsi_common usb_common pps_core i2c_smbus CPU: 354 UID: 0 PID: 2729 Comm: kworker/u2061:1 Not tainted 6.12.0-rc6+ #181 Hardware name: Dell Inc. PowerEdge R7625/06444F, BIOS 1.8.3 04/02/2024 Workqueue: async async_run_entry_fn RIP: 0010:blk_validate_limits+0x356/0x380 Code: f6 47 01 04 75 28 83 bf 94 00 00 00 00 75 39 83 bf 98 00 00 00 00 75 34 83 7f 68 00 75 32 31 c0 83 7f 5c 00 0f 84 9b fd ff ff <0f> 0b eb 13 0f 0b eb 0f 48 c7 c0 74 12 58 92 48 89 c7 e8 13 76 46 RSP: 0018:ffffa8a1dfb93b30 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff9232829c8388 RCX: 0000000000000088 RDX: 0000000000000080 RSI: 0000000000000200 RDI: ffffa8a1dfb93c38 RBP: 000000000000000c R08: 00000000ffffffff R09: 000000000000ffff R10: 0000000000000000 R11: 0000000000000000 R12: ffff9232829b9000 R13: ffff9232829b9010 R14: ffffa8a1dfb93c38 R15: ffffa8a1dfb93c38 FS: 0000000000000000(0000) GS:ffff923867c80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055c1b92480a8 CR3: 0000002484ff0002 CR4: 0000000000370ef0 Call Trace: <TASK> ? __warn+0xca/0x1a0 ? blk_validate_limits+0x356/0x380 ? report_bug+0x11a/0x1a0 ? handle_bug+0x5e/0x90 ? exc_invalid_op+0x16/0x40 ? asm_exc_invalid_op+0x16/0x20 ? blk_validate_limits+0x356/0x380 blk_alloc_queue+0x7a/0x250 __blk_alloc_disk+0x39/0x80 nvme_mpath_alloc_disk+0x13d/0x1b0 [nvme_core] nvme_scan_ns+0xcc7/0x1010 [nvme_core] async_run_entry_fn+0x27/0x120 process_scheduled_works+0x1a0/0x360 worker_thread+0x2bc/0x350 ? pr_cont_work+0x1b0/0x1b0 kthread+0x111/0x120 ? kthread_unuse_mm+0x90/0x90 ret_from_fork+0x30/0x40 ? kthread_unuse_mm+0x90/0x90 ret_from_fork_asm+0x11/0x20 </TASK> ---[ end trace 0000000000000000 ]--- presumably due to max_zone_append_sectors not being cleared to zero, resulting in blk_validate_zoned_limits() complaining and failing. This reverts commit 2a8f6153e1c2db06a537a5c9d61102eb591776f1. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07wwan: core: Add WWAN ADB and MIPC port typeJinjian Song
Add new WWAN ports that connect to the device's ADB protocol interface and MTK MIPC diagnostic interface. Signed-off-by: Jinjian Song <jinjian.song@fibocom.com> Reviewed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-11-07arm64: smccc: Remove broken support for SMCCCv1.3 SVE discard hintMark Rutland
SMCCCv1.3 added a hint bit which callers can set in an SMCCC function ID (AKA "FID") to indicate that it is acceptable for the SMCCC implementation to discard SVE and/or SME state over a specific SMCCC call. The kernel support for using this hint is broken and SMCCC calls may clobber the SVE and/or SME state of arbitrary tasks, though FPSIMD state is unaffected. The kernel support is intended to use the hint when there is no SVE or SME state to save, and to do this it checks whether TIF_FOREIGN_FPSTATE is set or TIF_SVE is clear in assembly code: | ldr <flags>, [<current_task>, #TSK_TI_FLAGS] | tbnz <flags>, #TIF_FOREIGN_FPSTATE, 1f // Any live FP state? | tbnz <flags>, #TIF_SVE, 2f // Does that state include SVE? | | 1: orr <fid>, <fid>, ARM_SMCCC_1_3_SVE_HINT | 2: | << SMCCC call using FID >> This is not safe as-is: (1) SMCCC calls can be made in a preemptible context and preemption can result in TIF_FOREIGN_FPSTATE being set or cleared at arbitrary points in time. Thus checking for TIF_FOREIGN_FPSTATE provides no guarantee. (2) TIF_FOREIGN_FPSTATE only indicates that the live FP/SVE/SME state in the CPU does not belong to the current task, and does not indicate that clobbering this state is acceptable. When the live CPU state is clobbered it is necessary to update fpsimd_last_state.st to ensure that a subsequent context switch will reload FP/SVE/SME state from memory rather than consuming the clobbered state. This and the SMCCC call itself must happen in a critical section with preemption disabled to avoid races. (3) Live SVE/SME state can exist with TIF_SVE clear (e.g. with only TIF_SME set), and checking TIF_SVE alone is insufficient. Remove the broken support for the SMCCCv1.3 SVE saving hint. This is effectively a revert of commits: * cfa7ff959a78 ("arm64: smccc: Support SMCCC v1.3 SVE register saving hint") * a7c3acca5380 ("arm64: smccc: Save lr before calling __arm_smccc_sve_check()") ... leaving behind the ARM_SMCCC_VERSION_1_3 and ARM_SMCCC_1_3_SVE_HINT definitions, since these are simply definitions from the SMCCC specification, and the latter is used in KVM via ARM_SMCCC_CALL_HINTS. If we want to bring this back in future, we'll probably want to handle this logic in C where we can use all the usual FPSIMD/SVE/SME helper functions, and that'll likely require some rework of the SMCCC code and/or its callers. Fixes: cfa7ff959a78 ("arm64: smccc: Support SMCCC v1.3 SVE register saving hint") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: stable@vger.kernel.org Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20241106160448.2712997-1-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2024-11-06memcg: workingset: remove folio_memcg_rcu usageShakeel Butt
The function workingset_activation() is called from folio_mark_accessed() with the guarantee that the given folio can not be freed under us in workingset_activation(). In addition, the association of the folio and its memcg can not be broken here because charge migration is no more. There is no need to use folio_memcg_rcu. Simply use folio_memcg_charged() because that is what this function cares about. [akpm@linux-foundation.org: provide folio_memcg_charged stub for CONFIG_MEMCG=n] Link: https://lkml.kernel.org/r/20241026163707.2479526-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Suggested-by: Yu Zhao <yuzhao@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Hugh Dickins <hughd@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06memcg-v1: remove memcg move locking codeShakeel Butt
The memcg v1's charge move feature has been deprecated. All the places using the memcg move lock, have stopped using it as they don't need the protection any more. Let's proceed to remove all the locking code related to charge moving. Link: https://lkml.kernel.org/r/20241025012304.2473312-7-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06memcg-v1: remove charge move codeShakeel Butt
The memcg-v1 charge move feature has been deprecated completely and let's remove the relevant code as well. Link: https://lkml.kernel.org/r/20241025012304.2473312-3-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: add missing mmu_notifier_clear_young for !MMU_NOTIFIERJames Houghton
Remove the now unnecessary ifdef in mm/damon/vaddr.c as well. Link: https://lkml.kernel.org/r/20241021160212.9935-1-jthoughton@google.com Signed-off-by: James Houghton <jthoughton@google.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm/codetag: fix arg in pgalloc_tag_copy alloc_tag_subSourav Panda
alloc_tag_sub() takes bytes as opposed to number of pages as argument. Currently pgalloc_tag_copy() passes the number of pages. This fix passes the correct unit, which is the number of bytes allocated. Link: https://lkml.kernel.org/r/20241022232440.334820-1-souravpanda@google.com Fixes: e0a955bf7f61 ("mm/codetag: add pgalloc_tag_copy()") Signed-off-by: Sourav Panda <souravpanda@google.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Wei Xu <weixugc@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06maple_tree: fix outdated flag name in commentJann Horn
MAPLE_USE_RCU was renamed to MT_FLAGS_USE_RCU at some point, fix up the comment. Link: https://lkml.kernel.org/r/20241007-maple-tree-doc-fix-v1-1-6bbf89c1153d@google.com Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: remove unused has_isolate_pageblockLuoxi Li
has_isolate_pageblock() has been unused since commit 55612e80e722 ("mm: page_alloc: close migratetype race between freeing and stealing") Remove it. Link: https://lkml.kernel.org/r/20241018092235.2764859-1-kaixa@kiloview.com Signed-off-by: Luoxi Li <kaixa@kiloview.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: huge_memory: move file_thp_enabled() into huge_memory.cKefeng Wang
file_thp_enabled() is only used in __thp_vma_allowable_orders(), so move it into huge_memory.c, also check READ_ONLY_THP_FOR_FS ahead to avoid unnecessary code if config disabled. Link: https://lkml.kernel.org/r/20241017141457.1169092-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm/mglru: reset page lru tier bits when activatingWei Xu
When a folio is activated, lru_gen_add_folio() moves the folio to the youngest generation. But unlike folio_update_gen()/folio_inc_gen(), lru_gen_add_folio() doesn't reset the folio lru tier bits (LRU_REFS_MASK | LRU_REFS_FLAGS). This inconsistency can affect how pages are aged via folio_mark_accessed() (e.g. fd accesses), though no user visible impact related to this has been detected yet. Note that lru_gen_add_folio() cannot clear PG_workingset if the activation is due to workingset refault, otherwise PSI accounting will be skipped. So fix lru_gen_add_folio() to clear the lru tier bits other than PG_workingset when activating a folio, and also clear all the lru tier bits when a folio is activated via folio_activate() in lru_gen_look_around(). Link: https://lkml.kernel.org/r/20241017181528.3358821-1-weixugc@google.com Fixes: 018ee47f1489 ("mm: multi-gen LRU: exploit locality in rmap") Signed-off-by: Wei Xu <weixugc@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Jan Alexander Steffens <heftig@archlinux.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: avoid zeroing user movable page twice with init_on_alloc=1Zi Yan
Commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options") forces allocated page to be zeroed in post_alloc_hook() when init_on_alloc=1. For order-0 folios, if arch does not define vma_alloc_zeroed_movable_folio(), the default implementation again zeros the page return from the buddy allocator. So the page is zeroed twice. Fix it by passing __GFP_ZERO instead to avoid double page zeroing. At the moment, s390,arm64,x86,alpha,m68k are not impacted since they define their own vma_alloc_zeroed_movable_folio(). For >0 order folios (mTHP and PMD THP), folio_zero_user() is called to zero the folio again. Fix it by calling folio_zero_user() only if init_on_alloc is set. All arch are impacted. Add alloc_zeroed() helper to encapsulate the init_on_alloc check. [ziy@nvidia.com: comment fixes, per David] Link: https://lkml.kernel.org/r/97DB52E1-C594-49B5-9736-89AC302FAB01@nvidia.com Link: https://lkml.kernel.org/r/20241011150304.709590-1-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: remove unused hugepage for vma_alloc_folio()Kefeng Wang
The hugepage parameter was deprecated since commit ddc1a5cbc05d ("mempolicy: alloc_pages_mpol() for NUMA policy without vma"), for PMD-sized THP, it still tries only preferred node if possible in vma_alloc_folio() by checking the order of the folio allocation. Link: https://lkml.kernel.org/r/20241010061556.1846751-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06kaslr: rename physmem_end and PHYSMEM_END to direct_map_physmem_endJohn Hubbard
For clarity. It's increasingly hard to reason about the code, when KASLR is moving around the boundaries. In this case where KASLR is randomizing the location of the kernel image within physical memory, the maximum number of address bits for physical memory has not changed. What has changed is the ending address of memory that is allowed to be directly mapped by the kernel. Let's name the variable, and the associated macro accordingly. Also, enhance the comment above the direct_map_physmem_end definition, to further clarify how this all works. Link: https://lkml.kernel.org/r/20241009025024.89813-1-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Will Deacon <will@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jordan Niethe <jniethe@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: drop hugetlb_get_unmapped_area{_*} functionsOscar Salvador
Hugetlb mappings are now handled through normal channels just like any other mapping, so we no longer need hugetlb_get_unmapped_area* specific functions. Link: https://lkml.kernel.org/r/20241007075037.267650-8-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: make hugetlb mappings go through mm_get_unmapped_area_vmflagsOscar Salvador
Hugetlb mappings will no longer be special cased but rather go through the generic mm_get_unmapped_area_vmflags function. For that to happen, let us remove the .get_unmapped_area from hugetlbfs_file_operations struct, and hint __get_unmapped_area that it should not send hugetlb mappings through thp_get_unmapped_area_vmflags but through mm_get_unmapped_area_vmflags. Create also a function called hugetlb_mmap_check_and_align() where a couple of safety checks are being done and the addr is aligned to the huge page size. Otherwise we will have to do this in every single function, which duplicates quite a lot of code. Link: https://lkml.kernel.org/r/20241007075037.267650-7-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm/mmap: teach generic_get_unmapped_area{_topdown} to handle hugetlb mappingsOscar Salvador
Patch series "Unify hugetlb into arch_get_unmapped_area functions", v4. This is an attempt to get rid of a fair amount of duplicated code wrt. hugetlb and *get_unmapped_area* functions. HugeTLB registers a .get_unmapped_area function which gets called from __get_unmapped_area(). hugetlb_get_unmapped_area() is defined by a bunch of architectures and it also has a generic definition for those that do not define it. Short-long story is that there is a ton of duplicated code between specific hugetlb *_get_unmapped_area_* functions and mm-core functions, so we can do better by teaching arch_get_unmapped_area* functions how to deal with hugetlb mappings. Note that not a lot of things need to be taught though. hugetlb_get_unmapped_area, that gets called for hugetlb mappings, runs some sanity checks prior to calling mm_get_unmapped_area_vmflags(), so we do not need to that down the road in the respective {generic,arch}_get_unmapped_area* functions. More information can be found in the respective patches. LTP mmapstress hugetlb selftests were ran succesfully on: This patch (of 9): We want to stop special casing hugetlb mappings and make them go through generic channels, so teach generic_get_unmapped_area{_topdown} to handle those. The main difference is that we set info.align_mask for huge mappings. Link: https://lkml.kernel.org/r/20241007075037.267650-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20241007075037.267650-2-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: move set_pxd_safe() helpers from generic to platformAnshuman Khandual
set_pxd_safe() helpers that serve a specific purpose for both x86 and riscv platforms, do not need to be in the common memory code. Otherwise they just unnecessarily make the common API more complicated. This moves the helpers from common code to platform instead. Link: https://lkml.kernel.org/r/20241003044842.246016-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: David Hildenbrand <david@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: remove PageKsm()Matthew Wilcox (Oracle)
All callers have been converted to use folio_test_ksm() or PageAnonNotKsm(), so we can remove this wrapper. Link: https://lkml.kernel.org/r/20241002152533.1350629-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alex Shi <alexs@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: add PageAnonNotKsm()Matthew Wilcox (Oracle)
Check that this anonymous page is really anonymous, not anonymous-or-KSM. This optimises the debug check, but its real purpose is to remove the last two users of PageKsm(). [willy@infradead.org: fix assertions] Link: https://lkml.kernel.org/r/ZwApWPER7caIA_N3@casper.infradead.org Link: https://lkml.kernel.org/r/20241002152533.1350629-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alex Shi <alexs@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07hrtimers: Delete hrtimer_init_on_stack()Nam Cao
hrtimer_init_on_stack() is now unused. Delete it. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/510ce0d2944c4a382ea51e51d03dcfb73ba0f4f7.1730386209.git.namcao@linutronix.de
2024-11-07hrtimers: Delete hrtimer_init_sleeper_on_stack()Nam Cao
hrtimer_init_sleeper_on_stack() is now unused. Delete it. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/52549846635c0b3a2abf82101f539efdabcd9778.1730386209.git.namcao@linutronix.de
2024-11-07wait: Switch to use hrtimer_setup_sleeper_on_stack()Nam Cao
hrtimer_setup_sleeper_on_stack() replaces hrtimer_init_sleeper_on_stack() to keep the naming convention consistent. Convert the usage site over to it. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/fc91182375df81120a88dbe0263267e24d1bf19e.1730386209.git.namcao@linutronix.de
2024-11-07hrtimers: Introduce hrtimer_update_function()Nam Cao
Some users of hrtimer need to change the callback function after the initial setup. They write to hrtimer::function directly. That's not safe under all circumstances as the write is lockless and a concurrent timer expiry might end up using the wrong function pointer. Introduce hrtimer_update_function(), which also performs runtime checks whether it is safe to modify the callback. This allows to make hrtimer::function private once all users are converted. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20a937b0ae09ad54b5b6d86eabead7c570f1b72e.1730386209.git.namcao@linutronix.de
2024-11-07hrtimers: Introduce hrtimer_setup_sleeper_on_stack()Nam Cao
The hrtimer_init*() API is replaced by hrtimer_setup*() variants to initialize the timer including the callback function at once. hrtimer_init_sleeper_on_stack() does not need user to setup the callback function separately, so a new variant would not be strictly necessary. Nonetheless, to keep the naming convention consistent, introduce hrtimer_setup_sleeper_on_stack(). hrtimer_init_on_stack() will be removed once all users are converted. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/7b5e18e6dd0ace9eaa211201528cb9dc23752454.1730386209.git.namcao@linutronix.de
2024-11-07hrtimers: Introduce hrtimer_setup_on_stack()Nam Cao
To initialize hrtimer on stack, hrtimer_init_on_stack() needs to be called and also hrtimer::function must be set. This is error-prone and awkward to use. Introduce hrtimer_setup_on_stack() which does both of these things, so that users of hrtimer can be simplified. The new setup function also has a sanity check for the provided function pointer. If NULL, a warning is emitted and a dummy callback installed. hrtimer_init_on_stack() will be removed as soon as all of its users have been converted to the new function. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/4b05e2ab3a82c517adf67fabc0f0cd8fe118b97c.1730386209.git.namcao@linutronix.de
2024-11-07hrtimers: Introduce hrtimer_setup() to replace hrtimer_init()Nam Cao
To initialize hrtimer, hrtimer_init() needs to be called and also hrtimer::function must be set. This is error-prone and awkward to use. Introduce hrtimer_setup() which does both of these things, so that users of hrtimer can be simplified. The new setup function also has a sanity check for the provided function pointer. If NULL, a warning is emitted and a dummy callback installed. hrtimer_init() will be removed as soon as all of its users have been converted to the new function. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/5057c1ddbfd4b92033cd93d37fe38e6b069d5ba6.1730386209.git.namcao@linutronix.de
2024-11-07hrtimers: Add missing hrtimer_init() trace pointsNam Cao
hrtimer_init*_on_stack() is not covered by tracing when CONFIG_DEBUG_OBJECTS_TIMERS=y. Rework the functions similar to hrtimer_init() and hrtimer_init_sleeper() so that the hrtimer_init() tracepoint is unconditionally available. The rework makes hrtimer_init_sleeper() unused. Delete it. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/74528e8abf2bb96e8bee85ffacbf14e15cf89f0d.1730386209.git.namcao@linutronix.de
2024-11-07softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT.Sebastian Andrzej Siewior
The timer and hrtimer soft interrupts are raised in hard interrupt context. With threaded interrupts force enabled or on PREEMPT_RT this leads to waking the ksoftirqd for the processing of the soft interrupt. ksoftirqd runs as SCHED_OTHER task which means it will compete with other tasks for CPU resources. This can introduce long delays for timer processing on heavy loaded systems and is not desired. Split the TIMER_SOFTIRQ and HRTIMER_SOFTIRQ processing into a dedicated timers thread and let it run at the lowest SCHED_FIFO priority. Wake-ups for RT tasks happen from hardirq context so only timer_list timers and hrtimers for "regular" tasks are processed here. The higher priority ensures that wakeups are performed before scheduling SCHED_OTHER tasks. Using a dedicated variable to store the pending softirq bits values ensure that the timer are not accidentally picked up by ksoftirqd and other threaded interrupts. It shouldn't be picked up by ksoftirqd since it runs at lower priority. However if ksoftirqd is already running while a timer fires, then ksoftird will be PI-boosted due to the BH-lock to ktimer's priority. The timer thread can pick up pending softirqs from ksoftirqd but only if the softirq load is high. It is not be desired that the picked up softirqs are processed at SCHED_FIFO priority under high softirq load but this can already happen by a PI-boost by a force-threaded interrupt. [ frederic@kernel.org: rcutorture.c fixes, storm fix by introduction of local_timers_pending() for tick_nohz_next_event() ] [ junxiao.chang@intel.com: Ensure ktimersd gets woken up even if a softirq is currently served. ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> [rcutorture] Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20241106150419.2593080-4-bigeasy@linutronix.de
2024-11-06net: add debug check in skb_reset_mac_header()Eric Dumazet
Make sure (skb->data - skb->head) can fit in skb->mac_header This needs CONFIG_DEBUG_NET=y. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Link: https://patch.msgid.link/20241105174403.850330-8-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-06net: add debug check in skb_reset_network_header()Eric Dumazet
Make sure (skb->data - skb->head) can fit in skb->network_header This needs CONFIG_DEBUG_NET=y. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Link: https://patch.msgid.link/20241105174403.850330-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-06net: add debug check in skb_reset_transport_header()Eric Dumazet
Make sure (skb->data - skb->head) can fit in skb->transport_header This needs CONFIG_DEBUG_NET=y. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Link: https://patch.msgid.link/20241105174403.850330-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-06net: add debug check in skb_reset_inner_mac_header()Eric Dumazet
Make sure (skb->data - skb->head) can fit in skb->inner_mac_header This needs CONFIG_DEBUG_NET=y. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Link: https://patch.msgid.link/20241105174403.850330-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-06net: add debug check in skb_reset_inner_network_header()Eric Dumazet
Make sure (skb->data - skb->head) can fit in skb->inner_network_header This needs CONFIG_DEBUG_NET=y. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Link: https://patch.msgid.link/20241105174403.850330-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-06net: add debug check in skb_reset_inner_transport_header()Eric Dumazet
Make sure (skb->data - skb->head) can fit in skb->inner_transport_header This needs CONFIG_DEBUG_NET=y. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Link: https://patch.msgid.link/20241105174403.850330-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-06net: skb_reset_mac_len() must check if mac_header was setEric Dumazet
Recent discussions show that skb_reset_mac_len() should be more careful. We expect the MAC header being set. If not, clear skb->mac_len and fire a warning for CONFIG_DEBUG_NET=y builds. If after investigations we find that not having a MAC header was okay, we can remove the warning. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/netdev/CANn89iJZGH+yEfJxfPWa3Hm7jxb-aeY2Up4HufmLMnVuQXt38A@mail.gmail.com/T/ Cc: En-Wei Wu <en-wei.wu@canonical.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Link: https://patch.msgid.link/20241105174403.850330-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-07alarmtimers: Remove return value from alarm functionsThomas Gleixner
Now that the SIG_IGN problem is solved in the core code, the alarmtimer callbacks do not require a return value anymore. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20241105064214.318837272@linutronix.de
2024-11-07posix-timers: Cleanup SIG_IGN workaround leftoversThomas Gleixner
Now that ignored posix timer signals are requeued and the timers are rearmed on signal delivery the workaround to keep such timers alive and self rearm them is not longer required. Remove the relevant hacks and the not longer required return values from the related functions. The alarm timer workarounds will be cleaned up in a separate step. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20241105064214.187239060@linutronix.de