diff options
Diffstat (limited to 'Documentation/admin-guide/mm/userfaultfd.rst')
| -rw-r--r-- | Documentation/admin-guide/mm/userfaultfd.rst | 80 |
1 files changed, 78 insertions, 2 deletions
diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst index 83f31919ebb3..e5cc8848dcb3 100644 --- a/Documentation/admin-guide/mm/userfaultfd.rst +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -1,5 +1,3 @@ -.. _userfaultfd: - =========== Userfaultfd =========== @@ -115,6 +113,9 @@ events, except page fault notifications, may be generated: areas. ``UFFD_FEATURE_MINOR_SHMEM`` is the analogous feature indicating support for shmem virtual memory areas. +- ``UFFD_FEATURE_MOVE`` indicates that the kernel supports moving an + existing page contents from userspace. + The userland application should set the feature flags it intends to use when invoking the ``UFFDIO_API`` ioctl, to request that those features be enabled if supported. @@ -221,6 +222,81 @@ former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was used. +Userfaultfd write-protect mode currently behave differently on none ptes +(when e.g. page is missing) over different types of memories. + +For anonymous memory, ``ioctl(UFFDIO_WRITEPROTECT)`` will ignore none ptes +(e.g. when pages are missing and not populated). For file-backed memories +like shmem and hugetlbfs, none ptes will be write protected just like a +present pte. In other words, there will be a userfaultfd write fault +message generated when writing to a missing page on file typed memories, +as long as the page range was write-protected before. Such a message will +not be generated on anonymous memories by default. + +If the application wants to be able to write protect none ptes on anonymous +memory, one can pre-populate the memory with e.g. MADV_POPULATE_READ. On +newer kernels, one can also detect the feature UFFD_FEATURE_WP_UNPOPULATED +and set the feature bit in advance to make sure none ptes will also be +write protected even upon anonymous memory. + +When using ``UFFDIO_REGISTER_MODE_WP`` in combination with either +``UFFDIO_REGISTER_MODE_MISSING`` or ``UFFDIO_REGISTER_MODE_MINOR``, when +resolving missing / minor faults with ``UFFDIO_COPY`` or ``UFFDIO_CONTINUE`` +respectively, it may be desirable for the new page / mapping to be +write-protected (so future writes will also result in a WP fault). These ioctls +support a mode flag (``UFFDIO_COPY_MODE_WP`` or ``UFFDIO_CONTINUE_MODE_WP`` +respectively) to configure the mapping this way. + +If the userfaultfd context has ``UFFD_FEATURE_WP_ASYNC`` feature bit set, +any vma registered with write-protection will work in async mode rather +than the default sync mode. + +In async mode, there will be no message generated when a write operation +happens, meanwhile the write-protection will be resolved automatically by +the kernel. It can be seen as a more accurate version of soft-dirty +tracking and it can be different in a few ways: + + - The dirty result will not be affected by vma changes (e.g. vma + merging) because the dirty is only tracked by the pte. + + - It supports range operations by default, so one can enable tracking on + any range of memory as long as page aligned. + + - Dirty information will not get lost if the pte was zapped due to + various reasons (e.g. during split of a shmem transparent huge page). + + - Due to a reverted meaning of soft-dirty (page clean when uffd-wp bit + set; dirty when uffd-wp bit cleared), it has different semantics on + some of the memory operations. For example: ``MADV_DONTNEED`` on + anonymous (or ``MADV_REMOVE`` on a file mapping) will be treated as + dirtying of memory by dropping uffd-wp bit during the procedure. + +The user app can collect the "written/dirty" status by looking up the +uffd-wp bit for the pages being interested in /proc/pagemap. + +The page will not be under track of uffd-wp async mode until the page is +explicitly write-protected by ``ioctl(UFFDIO_WRITEPROTECT)`` with the mode +flag ``UFFDIO_WRITEPROTECT_MODE_WP`` set. Trying to resolve a page fault +that was tracked by async mode userfaultfd-wp is invalid. + +When userfaultfd-wp async mode is used alone, it can be applied to all +kinds of memory. + +Memory Poisioning Emulation +--------------------------- + +In response to a fault (either missing or minor), an action userspace can +take to "resolve" it is to issue a ``UFFDIO_POISON``. This will cause any +future faulters to either get a SIGBUS, or in KVM's case the guest will +receive an MCE as if there were hardware memory poisoning. + +This is used to emulate hardware memory poisoning. Imagine a VM running on a +machine which experiences a real hardware memory error. Later, we live migrate +the VM to another physical machine. Since we want the migration to be +transparent to the guest, we want that same address range to act as if it was +still poisoned, even though it's on a new physical host which ostensibly +doesn't have a memory error in the exact same spot. + QEMU/KVM ======== |
