From 0f9a921cf9bf3b524feddc484e2b4d070b7ca0d0 Mon Sep 17 00:00:00 2001 From: Rik van Riel Date: Tue, 6 Nov 2012 09:54:47 +0000 Subject: x86: mm: only do a local tlb flush in ptep_set_access_flags() The function ptep_set_access_flags() is only ever invoked to set access flags or add write permission on a PTE. The write bit is only ever set together with the dirty bit. Because we only ever upgrade a PTE, it is safe to skip flushing entries on remote TLBs. The worst that can happen is a spurious page fault on other CPUs, which would flush that TLB entry. Lazily letting another CPU incur a spurious page fault occasionally is (much!) cheaper than aggressively flushing everybody else's TLB. Signed-off-by: Rik van Riel Cc: Linus Torvalds Cc: Andrew Morton Cc: Peter Zijlstra Cc: Michel Lespinasse Cc: Ingo Molnar --- arch/x86/mm/pgtable.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 8573b83a63d0..be3bb4690887 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -301,6 +301,13 @@ void pgd_free(struct mm_struct *mm, pgd_t *pgd) free_page((unsigned long)pgd); } +/* + * Used to set accessed or dirty bits in the page table entries + * on other architectures. On x86, the accessed and dirty bits + * are tracked by hardware. However, do_wp_page calls this function + * to also make the pte writeable at the same time the dirty bit is + * set. In that case we do actually need to write the PTE. + */ int ptep_set_access_flags(struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t entry, int dirty) @@ -310,7 +317,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma, if (changed && dirty) { *ptep = entry; pte_update_defer(vma->vm_mm, address, ptep); - flush_tlb_page(vma, address); + __flush_tlb_one(address); } return changed; -- cgit From e4a1cc56e4d728eb87072c71c07581524e5160b1 Mon Sep 17 00:00:00 2001 From: Rik van Riel Date: Tue, 6 Nov 2012 09:55:18 +0000 Subject: x86: mm: drop TLB flush from ptep_set_access_flags Intel has an architectural guarantee that the TLB entry causing a page fault gets invalidated automatically. This means we should be able to drop the local TLB invalidation. Because of the way other areas of the page fault code work, chances are good that all x86 CPUs do this. However, if someone somewhere has an x86 CPU that does not invalidate the TLB entry causing a page fault, this one-liner should be easy to revert. Signed-off-by: Rik van Riel Cc: Linus Torvalds Cc: Andrew Morton Cc: Michel Lespinasse Cc: Peter Zijlstra Cc: Ingo Molnar --- arch/x86/mm/pgtable.c | 1 - 1 file changed, 1 deletion(-) (limited to 'arch') diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index be3bb4690887..7353de3d98a7 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -317,7 +317,6 @@ int ptep_set_access_flags(struct vm_area_struct *vma, if (changed && dirty) { *ptep = entry; pte_update_defer(vma->vm_mm, address, ptep); - __flush_tlb_one(address); } return changed; -- cgit From 2c3cf556b2f7ab5823195766eddec6f1a43134f5 Mon Sep 17 00:00:00 2001 From: Rik van Riel Date: Tue, 9 Oct 2012 15:31:12 +0200 Subject: x86/mm: Introduce pte_accessible() We need pte_present to return true for _PAGE_PROTNONE pages, to indicate that the pte is associated with a page. However, for TLB flushing purposes, we would like to know whether the pte points to an actually accessible page. This allows us to skip remote TLB flushes for pages that are not actually accessible. Fill in this method for x86 and provide a safe (but slower) method on other architectures. Signed-off-by: Rik van Riel Signed-off-by: Peter Zijlstra Fixed-by: Linus Torvalds Cc: Andrew Morton Cc: Peter Zijlstra Link: http://lkml.kernel.org/n/tip-66p11te4uj23gevgh4j987ip@git.kernel.org [ Added Linus's review fixes. ] Signed-off-by: Ingo Molnar --- arch/x86/include/asm/pgtable.h | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'arch') diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index a1f780d45f76..5fe03aaca92e 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -407,6 +407,12 @@ static inline int pte_present(pte_t a) return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE); } +#define pte_accessible pte_accessible +static inline int pte_accessible(pte_t a) +{ + return pte_flags(a) & _PAGE_PRESENT; +} + static inline int pte_hidden(pte_t pte) { return pte_flags(pte) & _PAGE_HIDDEN; -- cgit From dbe4d2035a5b273c910f8f7eb0b7189ee76f63ad Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Thu, 4 Oct 2012 01:50:46 +0200 Subject: mm: numa: define _PAGE_NUMA The objective of _PAGE_NUMA is to be able to trigger NUMA hinting page faults to identify the per NUMA node working set of the thread at runtime. Arming the NUMA hinting page fault mechanism works similarly to setting up a mprotect(PROT_NONE) virtual range: the present bit is cleared at the same time that _PAGE_NUMA is set, so when the fault triggers we can identify it as a NUMA hinting page fault. _PAGE_NUMA on x86 shares the same bit number of _PAGE_PROTNONE (but it could also use a different bitflag, it's up to the architecture to decide). It would be confusing to call the "NUMA hinting page faults" as "do_prot_none faults". They're different events and _PAGE_NUMA doesn't alter the semantics of mprotect(PROT_NONE) in any way. Sharing the same bitflag with _PAGE_PROTNONE in fact complicates things: it requires us to ensure the code paths executed by _PAGE_PROTNONE remains mutually exclusive to the code paths executed by _PAGE_NUMA at all times, to avoid _PAGE_NUMA and _PAGE_PROTNONE to step into each other toes. Because we want to be able to set this bitflag in any established pte or pmd (while clearing the present bit at the same time) without losing information, this bitflag must never be set when the pte and pmd are present, so the bitflag picked for _PAGE_NUMA usage, must not be used by the swap entry format. Signed-off-by: Andrea Arcangeli Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel --- arch/x86/include/asm/pgtable_types.h | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) (limited to 'arch') diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index ec8a1fc9505d..3c32db8c539d 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -64,6 +64,26 @@ #define _PAGE_FILE (_AT(pteval_t, 1) << _PAGE_BIT_FILE) #define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE) +/* + * _PAGE_NUMA indicates that this page will trigger a numa hinting + * minor page fault to gather numa placement statistics (see + * pte_numa()). The bit picked (8) is within the range between + * _PAGE_FILE (6) and _PAGE_PROTNONE (8) bits. Therefore, it doesn't + * require changes to the swp entry format because that bit is always + * zero when the pte is not present. + * + * The bit picked must be always zero when the pmd is present and not + * present, so that we don't lose information when we set it while + * atomically clearing the present bit. + * + * Because we shared the same bit (8) with _PAGE_PROTNONE this can be + * interpreted as _PAGE_NUMA only in places that _PAGE_PROTNONE + * couldn't reach, like handle_mm_fault() (see access_error in + * arch/x86/mm/fault.c, the vma protection must not be PROT_NONE for + * handle_mm_fault() to be invoked). + */ +#define _PAGE_NUMA _PAGE_PROTNONE + #define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \ _PAGE_ACCESSED | _PAGE_DIRTY) #define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \ -- cgit From be3a728427a605990a7a0b6dbf9e29b68e266146 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Thu, 4 Oct 2012 01:50:47 +0200 Subject: mm: numa: pte_numa() and pmd_numa() Implement pte_numa and pmd_numa. We must atomically set the numa bit and clear the present bit to define a pte_numa or pmd_numa. Once a pte or pmd has been set as pte_numa or pmd_numa, the next time a thread touches a virtual address in the corresponding virtual range, a NUMA hinting page fault will trigger. The NUMA hinting page fault will clear the NUMA bit and set the present bit again to resolve the page fault. The expectation is that a NUMA hinting page fault is used as part of a placement policy that decides if a page should remain on the current node or migrated to a different node. Acked-by: Rik van Riel Signed-off-by: Andrea Arcangeli Signed-off-by: Mel Gorman --- arch/x86/include/asm/pgtable.h | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) (limited to 'arch') diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 5fe03aaca92e..5199db2923d3 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -404,7 +404,8 @@ static inline int pte_same(pte_t a, pte_t b) static inline int pte_present(pte_t a) { - return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE); + return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE | + _PAGE_NUMA); } #define pte_accessible pte_accessible @@ -426,7 +427,8 @@ static inline int pmd_present(pmd_t pmd) * the _PAGE_PSE flag will remain set at all times while the * _PAGE_PRESENT bit is clear). */ - return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE); + return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE | + _PAGE_NUMA); } static inline int pmd_none(pmd_t pmd) @@ -485,6 +487,11 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address) static inline int pmd_bad(pmd_t pmd) { +#ifdef CONFIG_NUMA_BALANCING + /* pmd_numa check */ + if ((pmd_flags(pmd) & (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA) + return 0; +#endif return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE; } -- cgit From cbee9f88ec1b8dd6b58f25f54e4f52c82ed77690 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Thu, 25 Oct 2012 14:16:43 +0200 Subject: mm: numa: Add fault driven placement and migration NOTE: This patch is based on "sched, numa, mm: Add fault driven placement and migration policy" but as it throws away all the policy to just leave a basic foundation I had to drop the signed-offs-by. This patch creates a bare-bones method for setting PTEs pte_numa in the context of the scheduler that when faulted later will be faulted onto the node the CPU is running on. In itself this does nothing useful but any placement policy will fundamentally depend on receiving hints on placement from fault context and doing something intelligent about it. Signed-off-by: Mel Gorman Acked-by: Rik van Riel --- arch/sh/mm/Kconfig | 1 + arch/x86/Kconfig | 2 ++ 2 files changed, 3 insertions(+) (limited to 'arch') diff --git a/arch/sh/mm/Kconfig b/arch/sh/mm/Kconfig index cb8f9920f4dd..0f7c852f355c 100644 --- a/arch/sh/mm/Kconfig +++ b/arch/sh/mm/Kconfig @@ -111,6 +111,7 @@ config VSYSCALL config NUMA bool "Non Uniform Memory Access (NUMA) Support" depends on MMU && SYS_SUPPORTS_NUMA && EXPERIMENTAL + select ARCH_WANT_NUMA_VARIABLE_LOCALITY default n help Some SH systems have many various memories scattered around diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 46c3bff3ced2..1137028fc6d9 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -22,6 +22,8 @@ config X86 def_bool y select HAVE_AOUT if X86_32 select HAVE_UNSTABLE_SCHED_CLOCK + select ARCH_SUPPORTS_NUMA_BALANCING + select ARCH_WANTS_PROT_NUMA_PROT_NONE select HAVE_IDE select HAVE_OPROFILE select HAVE_PCSPKR_PLATFORM -- cgit