summaryrefslogtreecommitdiff
path: root/Documentation/userspace-api/mseal.rst
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/userspace-api/mseal.rst')
-rw-r--r--Documentation/userspace-api/mseal.rst209
1 files changed, 209 insertions, 0 deletions
diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst
new file mode 100644
index 000000000000..ea9b11a0bd89
--- /dev/null
+++ b/Documentation/userspace-api/mseal.rst
@@ -0,0 +1,209 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Introduction of mseal
+=====================
+
+:Author: Jeff Xu <jeffxu@chromium.org>
+
+Modern CPUs support memory permissions such as RW and NX bits. The memory
+permission feature improves security stance on memory corruption bugs, i.e.
+the attacker can’t just write to arbitrary memory and point the code to it,
+the memory has to be marked with X bit, or else an exception will happen.
+
+Memory sealing additionally protects the mapping itself against
+modifications. This is useful to mitigate memory corruption issues where a
+corrupted pointer is passed to a memory management system. For example,
+such an attacker primitive can break control-flow integrity guarantees
+since read-only memory that is supposed to be trusted can become writable
+or .text pages can get remapped. Memory sealing can automatically be
+applied by the runtime loader to seal .text and .rodata pages and
+applications can additionally seal security critical data at runtime.
+
+A similar feature already exists in the XNU kernel with the
+VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2].
+
+SYSCALL
+=======
+mseal syscall signature
+-----------------------
+ ``int mseal(void *addr, size_t len, unsigned long flags)``
+
+ **addr**/**len**: virtual memory address range.
+ The address range set by **addr**/**len** must meet:
+ - The start address must be in an allocated VMA.
+ - The start address must be page aligned.
+ - The end address (**addr** + **len**) must be in an allocated VMA.
+ - no gap (unallocated memory) between start and end address.
+
+ The ``len`` will be paged aligned implicitly by the kernel.
+
+ **flags**: reserved for future use.
+
+ **Return values**:
+ - **0**: Success.
+ - **-EINVAL**:
+ * Invalid input ``flags``.
+ * The start address (``addr``) is not page aligned.
+ * Address range (``addr`` + ``len``) overflow.
+ - **-ENOMEM**:
+ * The start address (``addr``) is not allocated.
+ * The end address (``addr`` + ``len``) is not allocated.
+ * A gap (unallocated memory) between start and end address.
+ - **-EPERM**:
+ * sealing is supported only on 64-bit CPUs, 32-bit is not supported.
+
+ **Note about error return**:
+ - For above error cases, users can expect the given memory range is
+ unmodified, i.e. no partial update.
+ - There might be other internal errors/cases not listed here, e.g.
+ error during merging/splitting VMAs, or the process reaching the maximum
+ number of supported VMAs. In those cases, partial updates to the given
+ memory range could happen. However, those cases should be rare.
+
+ **Architecture support**:
+ mseal only works on 64-bit CPUs, not 32-bit CPUs.
+
+ **Idempotent**:
+ users can call mseal multiple times. mseal on an already sealed memory
+ is a no-action (not error).
+
+ **no munseal**
+ Once mapping is sealed, it can't be unsealed. The kernel should never
+ have munseal, this is consistent with other sealing feature, e.g.
+ F_SEAL_SEAL for file.
+
+Blocked mm syscall for sealed mapping
+-------------------------------------
+ It might be important to note: **once the mapping is sealed, it will
+ stay in the process's memory until the process terminates**.
+
+ Example::
+
+ *ptr = mmap(0, 4096, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
+ rc = mseal(ptr, 4096, 0);
+ /* munmap will fail */
+ rc = munmap(ptr, 4096);
+ assert(rc < 0);
+
+ Blocked mm syscall:
+ - munmap
+ - mmap
+ - mremap
+ - mprotect and pkey_mprotect
+ - some destructive madvise behaviors: MADV_DONTNEED, MADV_FREE,
+ MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK
+
+ The first set of syscalls to block is munmap, mremap, mmap. They can
+ either leave an empty space in the address space, therefore allowing
+ replacement with a new mapping with new set of attributes, or can
+ overwrite the existing mapping with another mapping.
+
+ mprotect and pkey_mprotect are blocked because they changes the
+ protection bits (RWX) of the mapping.
+
+ Certain destructive madvise behaviors, specifically MADV_DONTNEED,
+ MADV_FREE, MADV_DONTNEED_LOCKED, and MADV_WIPEONFORK, can introduce
+ risks when applied to anonymous memory by threads lacking write
+ permissions. Consequently, these operations are prohibited under such
+ conditions. The aforementioned behaviors have the potential to modify
+ region contents by discarding pages, effectively performing a memset(0)
+ operation on the anonymous memory.
+
+ Kernel will return -EPERM for blocked syscalls.
+
+ When blocked syscall return -EPERM due to sealing, the memory regions may
+ or may not be changed, depends on the syscall being blocked:
+
+ - munmap: munmap is atomic. If one of VMAs in the given range is
+ sealed, none of VMAs are updated.
+ - mprotect, pkey_mprotect, madvise: partial update might happen, e.g.
+ when mprotect over multiple VMAs, mprotect might update the beginning
+ VMAs before reaching the sealed VMA and return -EPERM.
+ - mmap and mremap: undefined behavior.
+
+Use cases
+=========
+- glibc:
+ The dynamic linker, during loading ELF executables, can apply sealing to
+ mapping segments.
+
+- Chrome browser: protect some security sensitive data structures.
+
+- System mappings:
+ The system mappings are created by the kernel and includes vdso, vvar,
+ vvar_vclock, vectors (arm compat-mode), sigpage (arm compat-mode), uprobes.
+
+ Those system mappings are readonly only or execute only, memory sealing can
+ protect them from ever changing to writable or unmmap/remapped as different
+ attributes. This is useful to mitigate memory corruption issues where a
+ corrupted pointer is passed to a memory management system.
+
+ If supported by an architecture (CONFIG_ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS),
+ the CONFIG_MSEAL_SYSTEM_MAPPINGS seals all system mappings of this
+ architecture.
+
+ The following architectures currently support this feature: x86-64, arm64,
+ loongarch and s390.
+
+ WARNING: This feature breaks programs which rely on relocating
+ or unmapping system mappings. Known broken software at the time
+ of writing includes CHECKPOINT_RESTORE, UML, gVisor, rr. Therefore
+ this config can't be enabled universally.
+
+When not to use mseal
+=====================
+Applications can apply sealing to any virtual memory region from userspace,
+but it is *crucial to thoroughly analyze the mapping's lifetime* prior to
+apply the sealing. This is because the sealed mapping *won’t be unmapped*
+until the process terminates or the exec system call is invoked.
+
+For example:
+ - aio/shm
+ aio/shm can call mmap and munmap on behalf of userspace, e.g.
+ ksys_shmdt() in shm.c. The lifetimes of those mapping are not tied to
+ the lifetime of the process. If those memories are sealed from userspace,
+ then munmap will fail, causing leaks in VMA address space during the
+ lifetime of the process.
+
+ - ptr allocated by malloc (heap)
+ Don't use mseal on the memory ptr return from malloc().
+ malloc() is implemented by allocator, e.g. by glibc. Heap manager might
+ allocate a ptr from brk or mapping created by mmap.
+ If an app calls mseal on a ptr returned from malloc(), this can affect
+ the heap manager's ability to manage the mappings; the outcome is
+ non-deterministic.
+
+ Example::
+
+ ptr = malloc(size);
+ /* don't call mseal on ptr return from malloc. */
+ mseal(ptr, size);
+ /* free will success, allocator can't shrink heap lower than ptr */
+ free(ptr);
+
+mseal doesn't block
+===================
+In a nutshell, mseal blocks certain mm syscall from modifying some of VMA's
+attributes, such as protection bits (RWX). Sealed mappings doesn't mean the
+memory is immutable.
+
+As Jann Horn pointed out in [3], there are still a few ways to write
+to RO memory, which is, in a way, by design. And those could be blocked
+by different security measures.
+
+Those cases are:
+
+ - Write to read-only memory through /proc/self/mem interface (FOLL_FORCE).
+ - Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
+ - userfaultfd.
+
+The idea that inspired this patch comes from Stephen Röttger’s work in V8
+CFI [4]. Chrome browser in ChromeOS will be the first user of this API.
+
+Reference
+=========
+- [1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
+- [2] https://man.openbsd.org/mimmutable.2
+- [3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com
+- [4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc