From f21bb37afbba0878c8d417cd861e43d014119845 Mon Sep 17 00:00:00 2001 From: Qi Zheng Date: Tue, 25 Feb 2025 11:45:51 +0800 Subject: mm: pgtable: make generic tlb_remove_table() use struct ptdesc Patch series "remove tlb_remove_page_ptdesc()", v2. As suggested by Peter Zijlstra below [1], this series aims to remove tlb_remove_page_ptdesc(). : Fundamentally tlb_remove_page() is about removing *pages* as from a PTE, : there should not be a page-table anywhere near here *ever*. : : Yes, some architectures use tlb_remove_page() for page-tables too, but : that is more or less an implementation detail that can be fixed. After this series, all architectures use tlb_remove_table() or tlb_remove_ptdesc() to remove the page table pages. In the future, once all architectures using tlb_remove_table() have also converted to using struct ptdesc (eg. powerpc), it may be possible to use only tlb_remove_ptdesc(). [1] https://lore.kernel.org/linux-mm/20250103111457.GC22934@noisy.programming.kicks-ass.net/ This patch (of 6): Now only arm will call tlb_remove_ptdesc()/tlb_remove_table() when CONFIG_MMU_GATHER_TABLE_FREE is disabled. In this case, the type of the table parameter is actually struct ptdesc * instead of struct page *. Since struct ptdesc still overlaps with struct page and has not been separated from it, forcing the table parameter to struct page * will not cause any problems at this time. But this is definitely incorrect and needs to be fixed. So just like the generic __tlb_remove_table(), let generic tlb_remove_table() use struct ptdesc by default when CONFIG_MMU_GATHER_TABLE_FREE is disabled. Link: https://lkml.kernel.org/r/cover.1740454179.git.zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/5be8c3ab7bd68510bf0db4cf84010f4dfe372917.1740454179.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng Reviewed-by: Kevin Brodsky Cc: Alexandre Ghiti Cc: "Aneesh Kumar K.V" Cc: Arnd Bergmann Cc: Dave Hansen Cc: David Hildenbrand Cc: Hugh Dickens Cc: Jann Horn Cc: Matthew Wilcow (Oracle) Cc: "Mike Rapoport (IBM)" Cc: Muchun Song Cc: Nicholas Piggin Cc: Peter Zijlstra (Intel) Cc: Rik van Riel Cc: Vishal Moola (Oracle) Cc: Will Deacon Cc: Yu Zhao Signed-off-by: Andrew Morton --- include/asm-generic/tlb.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'include') diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index d1adfba8387e..e27d24dd8d5e 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -227,10 +227,10 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page); */ static inline void tlb_remove_table(struct mmu_gather *tlb, void *table) { - struct page *page = (struct page *)table; + struct ptdesc *ptdesc = (struct ptdesc *)table; - pagetable_dtor(page_ptdesc(page)); - tlb_remove_page(tlb, page); + pagetable_dtor(ptdesc); + tlb_remove_page(tlb, ptdesc_page(ptdesc)); } #endif /* CONFIG_MMU_GATHER_TABLE_FREE */ -- cgit From 1a03c275a3ad4e47d479a63037b72f2b305f0c13 Mon Sep 17 00:00:00 2001 From: Qi Zheng Date: Tue, 25 Feb 2025 11:45:52 +0800 Subject: mm: pgtable: change pt parameter of tlb_remove_ptdesc() to struct ptdesc* All callers of tlb_remove_ptdesc() pass it a pointer of struct ptdesc, so let's change the pt parameter from void * to struct ptdesc * to perform a type safety check. Link: https://lkml.kernel.org/r/60bb44299cf2d731df6592e446e7f694054d0dbe.1740454179.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng Originally-by: Peter Zijlstra (Intel) Reviewed-by: Kevin Brodsky Cc: Alexandre Ghiti Cc: "Aneesh Kumar K.V" Cc: Arnd Bergmann Cc: Dave Hansen Cc: David Hildenbrand Cc: Hugh Dickens Cc: Jann Horn Cc: Matthew Wilcow (Oracle) Cc: "Mike Rapoport (IBM)" Cc: Muchun Song Cc: Nicholas Piggin Cc: Rik van Riel Cc: Vishal Moola (Oracle) Cc: Will Deacon Cc: Yu Zhao Signed-off-by: Andrew Morton --- include/asm-generic/tlb.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include') diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index e27d24dd8d5e..bf845019af36 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -493,7 +493,7 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page) return tlb_remove_page_size(tlb, page, PAGE_SIZE); } -static inline void tlb_remove_ptdesc(struct mmu_gather *tlb, void *pt) +static inline void tlb_remove_ptdesc(struct mmu_gather *tlb, struct ptdesc *pt) { tlb_remove_table(tlb, pt); } -- cgit From 02d9e1a2048e47d39733f0ced71ce8e8fee3e56d Mon Sep 17 00:00:00 2001 From: Qi Zheng Date: Tue, 25 Feb 2025 11:45:56 +0800 Subject: mm: pgtable: remove tlb_remove_page_ptdesc() The tlb_remove_ptdesc()/tlb_remove_table() is specially designed for page table pages, and now all architectures have been converted to use it to remove page table pages. So let's remove tlb_remove_page_ptdesc(), it currently has no users and should not be used for page table pages. Link: https://lkml.kernel.org/r/3df04c8494339073b71be4acb2d92e108ecd1b60.1740454179.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng Suggested-by: Peter Zijlstra (Intel) Reviewed-by: Kevin Brodsky Cc: Alexandre Ghiti Cc: "Aneesh Kumar K.V" Cc: Arnd Bergmann Cc: Dave Hansen Cc: David Hildenbrand Cc: Hugh Dickens Cc: Jann Horn Cc: Matthew Wilcow (Oracle) Cc: "Mike Rapoport (IBM)" Cc: Muchun Song Cc: Nicholas Piggin Cc: Rik van Riel Cc: Vishal Moola (Oracle) Cc: Will Deacon Cc: Yu Zhao Signed-off-by: Andrew Morton --- include/asm-generic/tlb.h | 6 ------ 1 file changed, 6 deletions(-) (limited to 'include') diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index bf845019af36..88a42973fa47 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -498,12 +498,6 @@ static inline void tlb_remove_ptdesc(struct mmu_gather *tlb, struct ptdesc *pt) tlb_remove_table(tlb, pt); } -/* Like tlb_remove_ptdesc, but for page-like page directories. */ -static inline void tlb_remove_page_ptdesc(struct mmu_gather *tlb, struct ptdesc *pt) -{ - tlb_remove_page(tlb, ptdesc_page(pt)); -} - static inline void tlb_change_page_size(struct mmu_gather *tlb, unsigned int page_size) { -- cgit From 5796d3967c0956734bd1249f76989ca80da0225b Mon Sep 17 00:00:00 2001 From: Jeff Xu Date: Wed, 5 Mar 2025 02:17:05 +0000 Subject: mseal sysmap: kernel config and header change MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "mseal system mappings", v9. As discussed during mseal() upstream process [1], mseal() protects the VMAs of a given virtual memory range against modifications, such as the read/write (RW) and no-execute (NX) bits. For complete descriptions of memory sealing, please see mseal.rst [2]. The mseal() is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management system. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. The system mappings are readonly only, memory sealing can protect them from ever changing to writable or unmmap/remapped as different attributes. System mappings such as vdso, vvar, vvar_vclock, vectors (arm compat-mode), sigpage (arm compat-mode), are created by the kernel during program initialization, and could be sealed after creation. Unlike the aforementioned mappings, the uprobe mapping is not established during program startup. However, its lifetime is the same as the process's lifetime [3]. It could be sealed from creation. The vsyscall on x86-64 uses a special address (0xffffffffff600000), which is outside the mm managed range. This means mprotect, munmap, and mremap won't work on the vsyscall. Since sealing doesn't enhance the vsyscall's security, it is skipped in this patch. If we ever seal the vsyscall, it is probably only for decorative purpose, i.e. showing the 'sl' flag in the /proc/pid/smaps. For this patch, it is ignored. It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may alter the system mappings during restore operations. UML(User Mode Linux) and gVisor, rr are also known to change the vdso/vvar mappings. Consequently, this feature cannot be universally enabled across all systems. As such, CONFIG_MSEAL_SYSTEM_MAPPINGS is disabled by default. To support mseal of system mappings, architectures must define CONFIG_ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS and update their special mappings calls to pass mseal flag. Additionally, architectures must confirm they do not unmap/remap system mappings during the process lifetime. The existence of this flag for an architecture implies that it does not require the remapping of thest system mappings during process lifetime, so sealing these mappings is safe from a kernel perspective. This version covers x86-64 and arm64 archiecture as minimum viable feature. While no specific CPU hardware features are required for enable this feature on an archiecture, memory sealing requires a 64-bit kernel. Other architectures can choose whether or not to adopt this feature. Currently, I'm not aware of any instances in the kernel code that actively munmap/mremap a system mapping without a request from userspace. The PPC does call munmap when _install_special_mapping fails for vdso; however, it's uncertain if this will ever fail for PPC - this needs to be investigated by PPC in the future [4]. The UML kernel can add this support when KUnit tests require it [5]. In this version, we've improved the handling of system mapping sealing from previous versions, instead of modifying the _install_special_mapping function itself, which would affect all architectures, we now call _install_special_mapping with a sealing flag only within the specific architecture that requires it. This targeted approach offers two key advantages: 1) It limits the code change's impact to the necessary architectures, and 2) It aligns with the software architecture by keeping the core memory management within the mm layer, while delegating the decision of sealing system mappings to the individual architecture, which is particularly relevant since 32-bit architectures never require sealing. Prior to this patch series, we explored sealing special mappings from userspace using glibc's dynamic linker. This approach revealed several issues: - The PT_LOAD header may report an incorrect length for vdso, (smaller than its actual size). The dynamic linker, which relies on PT_LOAD information to determine mapping size, would then split and partially seal the vdso mapping. Since each architecture has its own vdso/vvar code, fixing this in the kernel would require going through each archiecture. Our initial goal was to enable sealing readonly mappings, e.g. .text, across all architectures, sealing vdso from kernel since creation appears to be simpler than sealing vdso at glibc. - The [vvar] mapping header only contains address information, not length information. Similar issues might exist for other special mappings. - Mappings like uprobe are not covered by the dynamic linker, and there is no effective solution for them. This feature's security enhancements will benefit ChromeOS, Android, and other high security systems. Testing: This feature was tested on ChromeOS and Android for both x86-64 and ARM64. - Enable sealing and verify vdso/vvar, sigpage, vector are sealed properly, i.e. "sl" shown in the smaps for those mappings, and mremap is blocked. - Passing various automation tests (e.g. pre-checkin) on ChromeOS and Android to ensure the sealing doesn't affect the functionality of Chromebook and Android phone. I also tested the feature on Ubuntu on x86-64: - With config disabled, vdso/vvar is not sealed, - with config enabled, vdso/vvar is sealed, and booting up Ubuntu is OK, normal operations such as browsing the web, open/edit doc are OK. Link: https://lore.kernel.org/all/20240415163527.626541-1-jeffxu@chromium.org/ [1] Link: Documentation/userspace-api/mseal.rst [2] Link: https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/ [3] Link: https://lore.kernel.org/all/CABi2SkV6JJwJeviDLsq9N4ONvQ=EFANsiWkgiEOjyT9TQSt+HA@mail.gmail.com/ [4] Link: https://lore.kernel.org/all/202502251035.239B85A93@keescook/ [5] This patch (of 7): Provide infrastructure to mseal system mappings. Establish two kernel configs (CONFIG_MSEAL_SYSTEM_MAPPINGS, ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS) and VM_SEALED_SYSMAP macro for future patches. Link: https://lkml.kernel.org/r/20250305021711.3867874-1-jeffxu@google.com Link: https://lkml.kernel.org/r/20250305021711.3867874-2-jeffxu@google.com Signed-off-by: Jeff Xu Reviewed-by: Kees Cook Reviewed-by: Liam R. Howlett Reviewed-by: Lorenzo Stoakes Cc: Adhemerval Zanella Cc: Alexander Mikhalitsyn Cc: Alexey Dobriyan Cc: Andrei Vagin Cc: Anna-Maria Behnsen Cc: Ard Biesheuvel Cc: Benjamin Berg Cc: Christoph Hellwig Cc: Dave Hansen Cc: David Rientjes Cc: David S. Miller Cc: Elliot Hughes Cc: Florian Faineli Cc: Greg Ungerer Cc: Guenter Roeck Cc: Heiko Carstens Cc: Helge Deller Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Ingo Molnar Cc: Jann Horn Cc: Jason A. Donenfeld Cc: Johannes Berg Cc: Jorge Lucangeli Obes Cc: Linus Waleij Cc: Mark Rutland Cc: Matthew Wilcow (Oracle) Cc: Michael Ellerman Cc: Michal Hocko Cc: Miguel Ojeda Cc: Mike Rapoport Cc: Oleg Nesterov Cc: Pedro Falcato Cc: Peter Xu Cc: Randy Dunlap Cc: Stephen Röttger Cc: Thomas Weißschuh Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/mm.h | 10 ++++++++++ 1 file changed, 10 insertions(+) (limited to 'include') diff --git a/include/linux/mm.h b/include/linux/mm.h index 32ba0e33422b..778f5de6a12e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4236,4 +4236,14 @@ int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *st int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status); int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status); + +/* + * mseal of userspace process's system mappings. + */ +#ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS +#define VM_SEALED_SYSMAP VM_SEALED +#else +#define VM_SEALED_SYSMAP VM_NONE +#endif + #endif /* _LINUX_MM_H */ -- cgit