diff options
Diffstat (limited to 'Documentation/admin-guide/mm/pagemap.rst')
| -rw-r--r-- | Documentation/admin-guide/mm/pagemap.rst | 161 |
1 files changed, 123 insertions, 38 deletions
diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index 6e2e416af783..c57e61b5d8aa 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -1,5 +1,3 @@ -.. _pagemap: - ============================= Examining Process Page Tables ============================= @@ -19,11 +17,12 @@ There are four components to pagemap: * Bits 0-4 swap type if swapped * Bits 5-54 swap offset if swapped * Bit 55 pte is soft-dirty (see - :ref:`Documentation/admin-guide/mm/soft-dirty.rst <soft_dirty>`) + Documentation/admin-guide/mm/soft-dirty.rst) * Bit 56 page exclusively mapped (since 4.2) * Bit 57 pte is uffd-wp write-protected (since 5.13) (see - :ref:`Documentation/admin-guide/mm/userfaultfd.rst <userfaultfd>`) - * Bits 58-60 zero + Documentation/admin-guide/mm/userfaultfd.rst) + * Bit 58 pte is a guard region (since 6.15) (see madvise (2) man page) + * Bits 59-60 zero * Bit 61 page is file-page or shared-anon (since 3.5) * Bit 62 page swapped * Bit 63 page present @@ -39,14 +38,30 @@ There are four components to pagemap: precisely which pages are mapped (or in swap) and comparing mapped pages between processes. + Traditionally, bit 56 indicates that a page is mapped exactly once and bit + 56 is clear when a page is mapped multiple times, even when mapped in the + same process multiple times. In some kernel configurations, the semantics + for pages part of a larger allocation (e.g., THP) can differ: bit 56 is set + if all pages part of the corresponding large allocation are *certainly* + mapped in the same process, even if the page is mapped multiple times in that + process. Bit 56 is clear when any page page of the larger allocation + is *maybe* mapped in a different process. In some cases, a large allocation + might be treated as "maybe mapped by multiple processes" even though this + is no longer the case. + Efficient users of this interface will use ``/proc/pid/maps`` to determine which areas of memory are actually mapped and llseek to skip over unmapped regions. * ``/proc/kpagecount``. This file contains a 64-bit count of the number of - times each page is mapped, indexed by PFN. - -The page-types tool in the tools/vm directory can be used to query the + times each page is mapped, indexed by PFN. Some kernel configurations do + not track the precise number of times a page part of a larger allocation + (e.g., THP) is mapped. In these configurations, the average number of + mappings per page in this larger allocation is returned instead. However, + if any page of the large allocation is mapped, the returned value will + be at least 1. + +The page-types tool in the tools/mm directory can be used to query the number of times a page is mapped. * ``/proc/kpageflags``. This file contains a 64-bit set of flags for each @@ -93,20 +108,20 @@ Short descriptions to the page flags The page is being locked for exclusive access, e.g. by undergoing read/write IO. 7 - SLAB - The page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator. - When compound page is used, SLUB/SLQB will only set this flag on the head - page; SLOB will not flag it at all. + The page is managed by the SLAB/SLUB kernel memory allocator. + When compound page is used, either will only set this flag on the head + page. 10 - BUDDY A free memory block managed by the buddy system allocator. The buddy system organizes free memory in blocks of various orders. An order N block has 2^N physically contiguous pages, with the BUDDY flag - set for and _only_ for the first page. + set for all pages. + Before 4.6 only the first page of the block had the flag set. 15 - COMPOUND_HEAD A compound page with order N consists of 2^N physically contiguous pages. A compound page with order 2 takes the form of "HTTT", where H donates its head page and T donates its tail page(s). The major consumers of compound - pages are hugeTLB pages - (:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`), + pages are hugeTLB pages (Documentation/admin-guide/mm/hugetlbpage.rst), the SLUB etc. memory allocators and various device drivers. However in this interface, only huge/giga pages are made visible to end users. @@ -121,14 +136,14 @@ Short descriptions to the page flags 21 - KSM Identical memory pages dynamically shared between one or more processes. 22 - THP - Contiguous pages which construct transparent hugepages. + Contiguous pages which construct THP of any size and mapped by any granularity. 23 - OFFLINE The page is logically offline. 24 - ZERO_PAGE Zero page for pfn_zero or huge_zero page. 25 - IDLE The page has not been accessed since it was marked idle (see - :ref:`Documentation/admin-guide/mm/idle_page_tracking.rst <idle_page_tracking>`). + Documentation/admin-guide/mm/idle_page_tracking.rst). Note that this flag may be stale in case the page was accessed via a PTE. To make sure the flag is up-to-date one has to read ``/sys/kernel/mm/page_idle/bitmap`` first. @@ -173,30 +188,9 @@ LRU related page flags 14 - SWAPBACKED The page is backed by swap/RAM. -The page-types tool in the tools/vm directory can be used to query the +The page-types tool in the tools/mm directory can be used to query the above flags. -Using pagemap to do something useful -==================================== - -The general procedure for using pagemap to find out about a process' memory -usage goes like this: - - 1. Read ``/proc/pid/maps`` to determine which parts of the memory space are - mapped to what. - 2. Select the maps you are interested in -- all of them, or a particular - library, or the stack or the heap, etc. - 3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine. - 4. Read a u64 for each page from pagemap. - 5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you - just read, seek to that entry in the file, and read the data you want. - -For example, to find the "unique set size" (USS), which is the amount of -memory that a process is using that is not shared with any other process, -you can go through every map in the process, find the PFNs, look those up -in kpagecount, and tally up the number of pages that are only referenced -once. - Exceptions for Shared Memory ============================ @@ -230,3 +224,94 @@ Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is always 12 at most architectures). Since Linux 3.11 their meaning changes after first clear of soft-dirty bits. Since Linux 4.2 they are used for flags unconditionally. + +Pagemap Scan IOCTL +================== + +The ``PAGEMAP_SCAN`` IOCTL on the pagemap file can be used to get or optionally +clear the info about page table entries. The following operations are supported +in this IOCTL: + +- Scan the address range and get the memory ranges matching the provided criteria. + This is performed when the output buffer is specified. +- Write-protect the pages. The ``PM_SCAN_WP_MATCHING`` is used to write-protect + the pages of interest. The ``PM_SCAN_CHECK_WPASYNC`` aborts the operation if + non-Async Write Protected pages are found. The ``PM_SCAN_WP_MATCHING`` can be + used with or without ``PM_SCAN_CHECK_WPASYNC``. +- Both of those operations can be combined into one atomic operation where we can + get and write protect the pages as well. + +Following flags about pages are currently supported: + +- ``PAGE_IS_WPALLOWED`` - Page has async-write-protection enabled +- ``PAGE_IS_WRITTEN`` - Page has been written to from the time it was write protected +- ``PAGE_IS_FILE`` - Page is file backed +- ``PAGE_IS_PRESENT`` - Page is present in the memory +- ``PAGE_IS_SWAPPED`` - Page is in swapped +- ``PAGE_IS_PFNZERO`` - Page has zero PFN +- ``PAGE_IS_HUGE`` - Page is PMD-mapped THP or Hugetlb backed +- ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty +- ``PAGE_IS_GUARD`` - Page is a part of a guard region + +The ``struct pm_scan_arg`` is used as the argument of the IOCTL. + + 1. The size of the ``struct pm_scan_arg`` must be specified in the ``size`` + field. This field will be helpful in recognizing the structure if extensions + are done later. + 2. The flags can be specified in the ``flags`` field. The ``PM_SCAN_WP_MATCHING`` + and ``PM_SCAN_CHECK_WPASYNC`` are the only added flags at this time. The get + operation is optionally performed depending upon if the output buffer is + provided or not. + 3. The range is specified through ``start`` and ``end``. + 4. The walk can abort before visiting the complete range such as the user buffer + can get full etc. The walk ending address is specified in``end_walk``. + 5. The output buffer of ``struct page_region`` array and size is specified in + ``vec`` and ``vec_len``. + 6. The optional maximum requested pages are specified in the ``max_pages``. + 7. The masks are specified in ``category_mask``, ``category_anyof_mask``, + ``category_inverted`` and ``return_mask``. + +Find pages which have been written and WP them as well:: + + struct pm_scan_arg arg = { + .size = sizeof(arg), + .flags = PM_SCAN_CHECK_WPASYNC | PM_SCAN_CHECK_WPASYNC, + .. + .category_mask = PAGE_IS_WRITTEN, + .return_mask = PAGE_IS_WRITTEN, + }; + +Find pages which have been written, are file backed, not swapped and either +present or huge:: + + struct pm_scan_arg arg = { + .size = sizeof(arg), + .flags = 0, + .. + .category_mask = PAGE_IS_WRITTEN | PAGE_IS_SWAPPED, + .category_inverted = PAGE_IS_SWAPPED, + .category_anyof_mask = PAGE_IS_PRESENT | PAGE_IS_HUGE, + .return_mask = PAGE_IS_WRITTEN | PAGE_IS_SWAPPED | + PAGE_IS_PRESENT | PAGE_IS_HUGE, + }; + +The ``PAGE_IS_WRITTEN`` flag can be considered as a better-performing alternative +of soft-dirty flag. It doesn't get affected by VMA merging of the kernel and hence +the user can find the true soft-dirty pages in case of normal pages. (There may +still be extra dirty pages reported for THP or Hugetlb pages.) + +"PAGE_IS_WRITTEN" category is used with uffd write protect-enabled ranges to +implement memory dirty tracking in userspace: + + 1. The userfaultfd file descriptor is created with ``userfaultfd`` syscall. + 2. The ``UFFD_FEATURE_WP_UNPOPULATED`` and ``UFFD_FEATURE_WP_ASYNC`` features + are set by ``UFFDIO_API`` IOCTL. + 3. The memory range is registered with ``UFFDIO_REGISTER_MODE_WP`` mode + through ``UFFDIO_REGISTER`` IOCTL. + 4. Then any part of the registered memory or the whole memory region must + be write protected using ``PAGEMAP_SCAN`` IOCTL with flag ``PM_SCAN_WP_MATCHING`` + or the ``UFFDIO_WRITEPROTECT`` IOCTL can be used. Both of these perform the + same operation. The former is better in terms of performance. + 5. Now the ``PAGEMAP_SCAN`` IOCTL can be used to either just find pages which + have been written to since they were last marked and/or optionally write protect + the pages as well. |
