summaryrefslogtreecommitdiff
path: root/Documentation/vm/mmu_notifier.txt
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/vm/mmu_notifier.txt')
-rw-r--r--Documentation/vm/mmu_notifier.txt93
1 files changed, 93 insertions, 0 deletions
diff --git a/Documentation/vm/mmu_notifier.txt b/Documentation/vm/mmu_notifier.txt
new file mode 100644
index 000000000000..23b462566bb7
--- /dev/null
+++ b/Documentation/vm/mmu_notifier.txt
@@ -0,0 +1,93 @@
+When do you need to notify inside page table lock ?
+
+When clearing a pte/pmd we are given a choice to notify the event through
+(notify version of *_clear_flush call mmu_notifier_invalidate_range) under
+the page table lock. But that notification is not necessary in all cases.
+
+For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use
+thing like ATS/PASID to get the IOMMU to walk the CPU page table to access a
+process virtual address space). There is only 2 cases when you need to notify
+those secondary TLB while holding page table lock when clearing a pte/pmd:
+
+ A) page backing address is free before mmu_notifier_invalidate_range_end()
+ B) a page table entry is updated to point to a new page (COW, write fault
+ on zero page, __replace_page(), ...)
+
+Case A is obvious you do not want to take the risk for the device to write to
+a page that might now be used by some completely different task.
+
+Case B is more subtle. For correctness it requires the following sequence to
+happen:
+ - take page table lock
+ - clear page table entry and notify ([pmd/pte]p_huge_clear_flush_notify())
+ - set page table entry to point to new page
+
+If clearing the page table entry is not followed by a notify before setting
+the new pte/pmd value then you can break memory model like C11 or C++11 for
+the device.
+
+Consider the following scenario (device use a feature similar to ATS/PASID):
+
+Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we assume
+they are write protected for COW (other case of B apply too).
+
+[Time N] --------------------------------------------------------------------
+CPU-thread-0 {try to write to addrA}
+CPU-thread-1 {try to write to addrB}
+CPU-thread-2 {}
+CPU-thread-3 {}
+DEV-thread-0 {read addrA and populate device TLB}
+DEV-thread-2 {read addrB and populate device TLB}
+[Time N+1] ------------------------------------------------------------------
+CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
+CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
+CPU-thread-2 {}
+CPU-thread-3 {}
+DEV-thread-0 {}
+DEV-thread-2 {}
+[Time N+2] ------------------------------------------------------------------
+CPU-thread-0 {COW_step1: {update page table to point to new page for addrA}}
+CPU-thread-1 {COW_step1: {update page table to point to new page for addrB}}
+CPU-thread-2 {}
+CPU-thread-3 {}
+DEV-thread-0 {}
+DEV-thread-2 {}
+[Time N+3] ------------------------------------------------------------------
+CPU-thread-0 {preempted}
+CPU-thread-1 {preempted}
+CPU-thread-2 {write to addrA which is a write to new page}
+CPU-thread-3 {}
+DEV-thread-0 {}
+DEV-thread-2 {}
+[Time N+3] ------------------------------------------------------------------
+CPU-thread-0 {preempted}
+CPU-thread-1 {preempted}
+CPU-thread-2 {}
+CPU-thread-3 {write to addrB which is a write to new page}
+DEV-thread-0 {}
+DEV-thread-2 {}
+[Time N+4] ------------------------------------------------------------------
+CPU-thread-0 {preempted}
+CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
+CPU-thread-2 {}
+CPU-thread-3 {}
+DEV-thread-0 {}
+DEV-thread-2 {}
+[Time N+5] ------------------------------------------------------------------
+CPU-thread-0 {preempted}
+CPU-thread-1 {}
+CPU-thread-2 {}
+CPU-thread-3 {}
+DEV-thread-0 {read addrA from old page}
+DEV-thread-2 {read addrB from new page}
+
+So here because at time N+2 the clear page table entry was not pair with a
+notification to invalidate the secondary TLB, the device see the new value for
+addrB before seing the new value for addrA. This break total memory ordering
+for the device.
+
+When changing a pte to write protect or to point to a new write protected page
+with same content (KSM) it is fine to delay the mmu_notifier_invalidate_range
+call to mmu_notifier_invalidate_range_end() outside the page table lock. This
+is true even if the thread doing the page table update is preempted right after
+releasing page table lock but before call mmu_notifier_invalidate_range_end().