From 67961f9db8c477026ea20ce05761bde6f8bf85b0 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Wed, 8 Jun 2016 15:33:42 -0700
Subject: mm/hugetlb: fix huge page reserve accounting for private mappings

When creating a private mapping of a hugetlbfs file, it is possible to
unmap pages via ftruncate or fallocate hole punch.  If subsequent faults
repopulate these mappings, the reserve counts will go negative.  This is
because the code currently assumes all faults to private mappings will
consume reserves.  The problem can be recreated as follows:

 - mmap(MAP_PRIVATE) a file in hugetlbfs filesystem
 - write fault in pages in the mapping
 - fallocate(FALLOC_FL_PUNCH_HOLE) some pages in the mapping
 - write fault in pages in the hole

This will result in negative huge page reserve counts and negative
subpool usage counts for the hugetlbfs.  Note that this can also be
recreated with ftruncate, but fallocate is more straight forward.

This patch modifies the routines vma_needs_reserves and vma_has_reserves
to examine the reserve map associated with private mappings similar to
that for shared mappings.  However, the reserve map semantics for
private and shared mappings are very different.  This results in subtly
different code that is explained in the comments.

Link: http://lkml.kernel.org/r/1464720957-15698-1-git-send-email-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/hugetlb.c | 42 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 40 insertions(+), 2 deletions(-)

(limited to 'mm/hugetlb.c')

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d26162e81fea..388c2bb9b55c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -832,8 +832,27 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
 	 * Only the process that called mmap() has reserves for
 	 * private mappings.
 	 */
-	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER))
-		return true;
+	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
+		/*
+		 * Like the shared case above, a hole punch or truncate
+		 * could have been performed on the private mapping.
+		 * Examine the value of chg to determine if reserves
+		 * actually exist or were previously consumed.
+		 * Very Subtle - The value of chg comes from a previous
+		 * call to vma_needs_reserves().  The reserve map for
+		 * private mappings has different (opposite) semantics
+		 * than that of shared mappings.  vma_needs_reserves()
+		 * has already taken this difference in semantics into
+		 * account.  Therefore, the meaning of chg is the same
+		 * as in the shared case above.  Code could easily be
+		 * combined, but keeping it separate draws attention to
+		 * subtle differences.
+		 */
+		if (chg)
+			return false;
+		else
+			return true;
+	}
 
 	return false;
 }
@@ -1816,6 +1835,25 @@ static long __vma_reservation_common(struct hstate *h,
 
 	if (vma->vm_flags & VM_MAYSHARE)
 		return ret;
+	else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) && ret >= 0) {
+		/*
+		 * In most cases, reserves always exist for private mappings.
+		 * However, a file associated with mapping could have been
+		 * hole punched or truncated after reserves were consumed.
+		 * As subsequent fault on such a range will not use reserves.
+		 * Subtle - The reserve map for private mappings has the
+		 * opposite meaning than that of shared mappings.  If NO
+		 * entry is in the reserve map, it means a reservation exists.
+		 * If an entry exists in the reserve map, it means the
+		 * reservation has already been consumed.  As a result, the
+		 * return value of this routine is the opposite of the
+		 * value returned from reserve map manipulation routines above.
+		 */
+		if (ret)
+			return 0;
+		else
+			return 1;
+	}
 	else
 		return ret < 0 ? ret : 0;
 }
-- 
cgit 


From c17b1f42594eb71b8d3eb5a6dfc907a7eb88a51d Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Fri, 24 Jun 2016 14:49:51 -0700
Subject: hugetlb: fix nr_pmds accounting with shared page tables

We account HugeTLB's shared page table to all processes who share it.
The accounting happens during huge_pmd_share().

If somebody populates pud entry under us, we should decrease pagetable's
refcount and decrease nr_pmds of the process.

By mistake, I increase nr_pmds again in this case.  :-/ It will lead to
"BUG: non-zero nr_pmds on freeing mm: 2" on process' exit.

Let's fix this by increasing nr_pmds only when we're sure that the page
table will be used.

Link: http://lkml.kernel.org/r/20160617122506.GC6534@node.shutemov.name
Fixes: dc6c9a35b66b ("mm: account pmd page tables to the process")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: zhongjiang <zhongjiang@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/hugetlb.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

(limited to 'mm/hugetlb.c')

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 388c2bb9b55c..ee0302169959 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4228,7 +4228,6 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 		if (saddr) {
 			spte = huge_pte_offset(svma->vm_mm, saddr);
 			if (spte) {
-				mm_inc_nr_pmds(mm);
 				get_page(virt_to_page(spte));
 				break;
 			}
@@ -4243,9 +4242,9 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	if (pud_none(*pud)) {
 		pud_populate(mm, pud,
 				(pmd_t *)((unsigned long)spte & PAGE_MASK));
+		mm_inc_nr_pmds(mm);
 	} else {
 		put_page(virt_to_page(spte));
-		mm_inc_nr_pmds(mm);
 	}
 	spin_unlock(ptl);
 out:
-- 
cgit 


From c8cc708a340cc7c5445565079fd4d1c28898d7a2 Mon Sep 17 00:00:00 2001
From: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Date: Fri, 24 Jun 2016 14:50:04 -0700
Subject: mm/hugetlb: clear compound_mapcount when freeing gigantic pages

While working on s390 support for gigantic hugepages I ran into the
following "Bad page state" warning when freeing gigantic pages:

  BUG: Bad page state in process bash  pfn:580001
  page:000003d116000040 count:0 mapcount:0 mapping:ffffffff00000000 index:0x0
  flags: 0x7fffc0000000000()
  page dumped because: non-NULL mapping

This is because page->compound_mapcount, which is part of a union with
page->mapping, is initialized with -1 in prep_compound_gigantic_page(),
and not cleared again during destroy_compound_gigantic_page().  Fix this
by clearing the compound_mapcount in destroy_compound_gigantic_page()
before clearing compound_head.

Interestingly enough, the warning will not show up on x86_64, although
this should not be architecture specific.  Apparently there is an
endianness issue, combined with the fact that the union contains both a
64 bit ->mapping pointer and a 32 bit atomic_t ->compound_mapcount as
members.  The resulting bogus page->mapping on x86_64 therefore contains
00000000ffffffff instead of ffffffff00000000 on s390, which will falsely
trigger the PageAnon() check in free_pages_prepare() because
page->mapping & PAGE_MAPPING_ANON is true on little-endian architectures
like x86_64 in this case (the page is not compound anymore,
->compound_head was already cleared before).  As a result, page->mapping
will be cleared before doing the checks in free_pages_check().

Not sure if the bogus "PageAnon() returning true" on x86_64 for the
first tail page of a gigantic page (at this stage) has other theoretical
implications, but they would also be fixed with this patch.

Link: http://lkml.kernel.org/r/1466612719-5642-1-git-send-email-gerald.schaefer@de.ibm.com
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/hugetlb.c | 1 +
 1 file changed, 1 insertion(+)

(limited to 'mm/hugetlb.c')

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ee0302169959..c1f3c0be150a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1030,6 +1030,7 @@ static void destroy_compound_gigantic_page(struct page *page,
 	int nr_pages = 1 << order;
 	struct page *p = page + 1;
 
+	atomic_set(compound_mapcount_ptr(page), 0);
 	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
 		clear_compound_head(p);
 		set_page_refcounted(p);
-- 
cgit