summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-06-24mm/page_alloc: __alloc_pages_bulk(): do bounds check before accessing arrayRasmus Villemoes
In the event that somebody would call this with an already fully populated page_array, the last loop iteration would do an access beyond the end of page_array. It's of course extremely unlikely that would ever be done, but this triggers my internal static analyzer. Also, if it really is not supposed to be invoked this way (i.e., with no NULL entries in page_array), the nr_populated<nr_pages check could simply be removed instead. Link: https://lkml.kernel.org/r/20210507064504.1712559-1-linux@rasmusvillemoes.dk Fixes: 0f87d9d30f21 ("mm/page_alloc: add an array-based interface to the bulk page allocator") Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm/hwpoison: do not lock page again when me_huge_page() successfully recoversNaoya Horiguchi
Currently me_huge_page() temporary unlocks page to perform some actions then locks it again later. My testcase (which calls hard-offline on some tail page in a hugetlb, then accesses the address of the hugetlb range) showed that page allocation code detects this page lock on buddy page and printed out "BUG: Bad page state" message. check_new_page_bad() does not consider a page with __PG_HWPOISON as bad page, so this flag works as kind of filter, but this filtering doesn't work in this case because the "bad page" is not the actual hwpoisoned page. So stop locking page again. Actions to be taken depend on the page type of the error, so page unlocking should be done in ->action() callbacks. So let's make it assumed and change all existing callbacks that way. Link: https://lkml.kernel.org/r/20210609072029.74645-1-nao.horiguchi@gmail.com Fixes: commit 78bb920344b8 ("mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm,hwpoison: return -EHWPOISON to denote that the page has already been poisonedAili Yao
When memory_failure() is called with MF_ACTION_REQUIRED on the page that has already been hwpoisoned, memory_failure() could fail to send SIGBUS to the affected process, which results in infinite loop of MCEs. Currently memory_failure() returns 0 if it's called for already hwpoisoned page, then the caller, kill_me_maybe(), could return without sending SIGBUS to current process. An action required MCE is raised when the current process accesses to the broken memory, so no SIGBUS means that the current process continues to run and access to the error page again soon, so running into MCE loop. This issue can arise for example in the following scenarios: - Two or more threads access to the poisoned page concurrently. If local MCE is enabled, MCE handler independently handles the MCE events. So there's a race among MCE events, and the second or latter threads fall into the situation in question. - If there was a precedent memory error event and memory_failure() for the event failed to unmap the error page for some reason, the subsequent memory access to the error page triggers the MCE loop situation. To fix the issue, make memory_failure() return an error code when the error page has already been hwpoisoned. This allows memory error handler to control how it sends signals to userspace. And make sure that any process touching a hwpoisoned page should get a SIGBUS even in "already hwpoisoned" path of memory_failure() as is done in page fault path. Link: https://lkml.kernel.org/r/20210521030156.2612074-3-nao.horiguchi@gmail.com Signed-off-by: Aili Yao <yaoaili@kingsoft.com> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Jue Wang <juew@google.com> Cc: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm/memory-failure: use a mutex to avoid memory_failure() racesTony Luck
Patch series "mm,hwpoison: fix sending SIGBUS for Action Required MCE", v5. I wrote this patchset to materialize what I think is the current allowable solution mentioned by the previous discussion [1]. I simply borrowed Tony's mutex patch and Aili's return code patch, then I queued another one to find error virtual address in the best effort manner. I know that this is not a perfect solution, but should work for some typical case. [1]: https://lore.kernel.org/linux-mm/20210331192540.2141052f@alex-virtual-machine/ This patch (of 2): There can be races when multiple CPUs consume poison from the same page. The first into memory_failure() atomically sets the HWPoison page flag and begins hunting for tasks that map this page. Eventually it invalidates those mappings and may send a SIGBUS to the affected tasks. But while all that work is going on, other CPUs see a "success" return code from memory_failure() and so they believe the error has been handled and continue executing. Fix by wrapping most of the internal parts of memory_failure() in a mutex. [akpm@linux-foundation.org: make mf_mutex local to memory_failure()] Link: https://lkml.kernel.org/r/20210521030156.2612074-1-nao.horiguchi@gmail.com Link: https://lkml.kernel.org/r/20210521030156.2612074-2-nao.horiguchi@gmail.com Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Aili Yao <yaoaili@kingsoft.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: Jue Wang <juew@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm, futex: fix shared futex pgoff on shmem huge pageHugh Dickins
If more than one futex is placed on a shmem huge page, it can happen that waking the second wakes the first instead, and leaves the second waiting: the key's shared.pgoff is wrong. When 3.11 commit 13d60f4b6ab5 ("futex: Take hugepages into account when generating futex_key"), the only shared huge pages came from hugetlbfs, and the code added to deal with its exceptional page->index was put into hugetlb source. Then that was missed when 4.8 added shmem huge pages. page_to_pgoff() is what others use for this nowadays: except that, as currently written, it gives the right answer on hugetlbfs head, but nonsense on hugetlbfs tails. Fix that by calling hugetlbfs-specific hugetlb_basepage_index() on PageHuge tails as well as on head. Yes, it's unconventional to declare hugetlb_basepage_index() there in pagemap.h, rather than in hugetlb.h; but I do not expect anything but page_to_pgoff() ever to need it. [akpm@linux-foundation.org: give hugetlb_basepage_index() prototype the correct scope] Link: https://lkml.kernel.org/r/b17d946b-d09-326e-b42a-52884c36df32@google.com Fixes: 800d8c63b2e9 ("shmem: add huge pages support") Reported-by: Neel Natu <neelnatu@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Zhang Yi <wetpzy@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Darren Hart <dvhart@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24kthread: prevent deadlock when kthread_mod_delayed_work() races with ↵Petr Mladek
kthread_cancel_delayed_work_sync() The system might hang with the following backtrace: schedule+0x80/0x100 schedule_timeout+0x48/0x138 wait_for_common+0xa4/0x134 wait_for_completion+0x1c/0x2c kthread_flush_work+0x114/0x1cc kthread_cancel_work_sync.llvm.16514401384283632983+0xe8/0x144 kthread_cancel_delayed_work_sync+0x18/0x2c xxxx_pm_notify+0xb0/0xd8 blocking_notifier_call_chain_robust+0x80/0x194 pm_notifier_call_chain_robust+0x28/0x4c suspend_prepare+0x40/0x260 enter_state+0x80/0x3f4 pm_suspend+0x60/0xdc state_store+0x108/0x144 kobj_attr_store+0x38/0x88 sysfs_kf_write+0x64/0xc0 kernfs_fop_write_iter+0x108/0x1d0 vfs_write+0x2f4/0x368 ksys_write+0x7c/0xec It is caused by the following race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync(): CPU0 CPU1 Context: Thread A Context: Thread B kthread_mod_delayed_work() spin_lock() __kthread_cancel_work() spin_unlock() del_timer_sync() kthread_cancel_delayed_work_sync() spin_lock() __kthread_cancel_work() spin_unlock() del_timer_sync() spin_lock() work->canceling++ spin_unlock spin_lock() queue_delayed_work() // dwork is put into the worker->delayed_work_list spin_unlock() kthread_flush_work() // flush_work is put at the tail of the dwork wait_for_completion() Context: IRQ kthread_delayed_work_timer_fn() spin_lock() list_del_init(&work->node); spin_unlock() BANG: flush_work is not longer linked and will never get proceed. The problem is that kthread_mod_delayed_work() checks work->canceling flag before canceling the timer. A simple solution is to (re)check work->canceling after __kthread_cancel_work(). But then it is not clear what should be returned when __kthread_cancel_work() removed the work from the queue (list) and it can't queue it again with the new @delay. The return value might be used for reference counting. The caller has to know whether a new work has been queued or an existing one was replaced. The proper solution is that kthread_mod_delayed_work() will remove the work from the queue (list) _only_ when work->canceling is not set. The flag must be checked after the timer is stopped and the remaining operations can be done under worker->lock. Note that kthread_mod_delayed_work() could remove the timer and then bail out. It is fine. The other canceling caller needs to cancel the timer as well. The important thing is that the queue (list) manipulation is done atomically under worker->lock. Link: https://lkml.kernel.org/r/20210610133051.15337-3-pmladek@suse.com Fixes: 9a6b06c8d9a220860468a ("kthread: allow to modify delayed kthread work") Signed-off-by: Petr Mladek <pmladek@suse.com> Reported-by: Martin Liu <liumartin@google.com> Cc: <jenhaochen@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24kthread_worker: split code for canceling the delayed work timerPetr Mladek
Patch series "kthread_worker: Fix race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync()". This patchset fixes the race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync() including proper return value handling. This patch (of 2): Simple code refactoring as a preparation step for fixing a race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync(). It does not modify the existing behavior. Link: https://lkml.kernel.org/r/20210610133051.15337-2-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: <jenhaochen@google.com> Cc: Martin Liu <liumartin@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm/vmalloc: unbreak kasan vmalloc supportDaniel Axtens
In commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"), __vmalloc_node_range was changed such that __get_vm_area_node was no longer called with the requested/real size of the vmalloc allocation, but rather with a rounded-up size. This means that __get_vm_area_node called kasan_unpoision_vmalloc() with a rounded up size rather than the real size. This led to it allowing access to too much memory and so missing vmalloc OOBs and failing the kasan kunit tests. Pass the real size and the desired shift into __get_vm_area_node. This allows it to round up the size for the underlying allocators while still unpoisioning the correct quantity of shadow memory. Adjust the other call-sites to pass in PAGE_SHIFT for the shift value. Link: https://lkml.kernel.org/r/20210617081330.98629-1-dja@axtens.net Link: https://bugzilla.kernel.org/show_bug.cgi?id=213335 Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") Signed-off-by: Daniel Axtens <dja@axtens.net> Tested-by: David Gow <davidgow@google.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Tested-by: Andrey Konovalov <andreyknvl@gmail.com> Acked-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24KVM: s390: prepare for hugepage vmallocClaudio Imbrenda
The Create Secure Configuration Ultravisor Call does not support using large pages for the virtual memory area. This is a hardware limitation. This patch replaces the vzalloc call with an almost equivalent call to the newly introduced vmalloc_no_huge function, which guarantees that only small pages will be used for the backing. The new call will not clear the allocated memory, but that has never been an actual requirement. Link: https://lkml.kernel.org/r/20210614132357.10202-3-imbrenda@linux.ibm.com Fixes: 121e6f3258fe3 ("mm/vmalloc: hugepage vmalloc mappings") Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm/vmalloc: add vmalloc_no_hugeClaudio Imbrenda
Patch series "mm: add vmalloc_no_huge and use it", v4. Add vmalloc_no_huge() and export it, so modules can allocate memory with small pages. Use the newly added vmalloc_no_huge() in KVM on s390 to get around a hardware limitation. This patch (of 2): Commit 121e6f3258fe3 ("mm/vmalloc: hugepage vmalloc mappings") added support for hugepage vmalloc mappings, it also added the flag VM_NO_HUGE_VMAP for __vmalloc_node_range to request the allocation to be performed with 0-order non-huge pages. This flag is not accessible when calling vmalloc, the only option is to call directly __vmalloc_node_range, which is not exported. This means that a module can't vmalloc memory with small pages. Case in point: KVM on s390x needs to vmalloc a large area, and it needs to be mapped with non-huge pages, because of a hardware limitation. This patch adds the function vmalloc_no_huge, which works like vmalloc, but it is guaranteed to always back the mapping using small pages. This new function is exported, therefore it is usable by modules. [akpm@linux-foundation.org: whitespace fixes, per Christoph] Link: https://lkml.kernel.org/r/20210614132357.10202-1-imbrenda@linux.ibm.com Link: https://lkml.kernel.org/r/20210614132357.10202-2-imbrenda@linux.ibm.com Fixes: 121e6f3258fe3 ("mm/vmalloc: hugepage vmalloc mappings") Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24nilfs2: fix memory leak in nilfs_sysfs_delete_device_groupPavel Skripkin
My local syzbot instance hit memory leak in nilfs2. The problem was in missing kobject_put() in nilfs_sysfs_delete_device_group(). kobject_del() does not call kobject_cleanup() for passed kobject and it leads to leaking duped kobject name if kobject_put() was not called. Fail log: BUG: memory leak unreferenced object 0xffff8880596171e0 (size 8): comm "syz-executor379", pid 8381, jiffies 4294980258 (age 21.100s) hex dump (first 8 bytes): 6c 6f 6f 70 30 00 00 00 loop0... backtrace: kstrdup+0x36/0x70 mm/util.c:60 kstrdup_const+0x53/0x80 mm/util.c:83 kvasprintf_const+0x108/0x190 lib/kasprintf.c:48 kobject_set_name_vargs+0x56/0x150 lib/kobject.c:289 kobject_add_varg lib/kobject.c:384 [inline] kobject_init_and_add+0xc9/0x160 lib/kobject.c:473 nilfs_sysfs_create_device_group+0x150/0x800 fs/nilfs2/sysfs.c:999 init_nilfs+0xe26/0x12b0 fs/nilfs2/the_nilfs.c:637 Link: https://lkml.kernel.org/r/20210612140559.20022-1-paskripkin@gmail.com Fixes: da7141fb78db ("nilfs2: add /sys/fs/nilfs2/<device> group") Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Michael L. Semon <mlsemon35@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk()Hugh Dickins
Aha! Shouldn't that quick scan over pte_none()s make sure that it holds ptlock in the PVMW_SYNC case? That too might have been responsible for BUGs or WARNs in split_huge_page_to_list() or its unmap_page(), though I've never seen any. Link: https://lkml.kernel.org/r/1bdf384c-8137-a149-2a1e-475a4791c3c@google.com Link: https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/ Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Wang Yugui <wangyugui@e16-tech.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm/thp: fix page_vma_mapped_walk() if THP mapped by ptesHugh Dickins
Running certain tests with a DEBUG_VM kernel would crash within hours, on the total_mapcount BUG() in split_huge_page_to_list(), while trying to free up some memory by punching a hole in a shmem huge page: split's try_to_unmap() was unable to find all the mappings of the page (which, on a !DEBUG_VM kernel, would then keep the huge page pinned in memory). Crash dumps showed two tail pages of a shmem huge page remained mapped by pte: ptes in a non-huge-aligned vma of a gVisor process, at the end of a long unmapped range; and no page table had yet been allocated for the head of the huge page to be mapped into. Although designed to handle these odd misaligned huge-page-mapped-by-pte cases, page_vma_mapped_walk() falls short by returning false prematurely when !pmd_present or !pud_present or !p4d_present or !pgd_present: there are cases when a huge page may span the boundary, with ptes present in the next. Restructure page_vma_mapped_walk() as a loop to continue in these cases, while keeping its layout much as before. Add a step_forward() helper to advance pvmw->address across those boundaries: originally I tried to use mm's standard p?d_addr_end() macros, but hit the same crash 512 times less often: because of the way redundant levels are folded together, but folded differently in different configurations, it was just too difficult to use them correctly; and step_forward() is simpler anyway. Link: https://lkml.kernel.org/r/fedb8632-1798-de42-f39e-873551d5bc81@google.com Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm: page_vma_mapped_walk(): get vma_address_end() earlierHugh Dickins
page_vma_mapped_walk() cleanup: get THP's vma_address_end() at the start, rather than later at next_pte. It's a little unnecessary overhead on the first call, but makes for a simpler loop in the following commit. Link: https://lkml.kernel.org/r/4542b34d-862f-7cb4-bb22-e0df6ce830a2@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm: page_vma_mapped_walk(): use goto instead of while (1)Hugh Dickins
page_vma_mapped_walk() cleanup: add a label this_pte, matching next_pte, and use "goto this_pte", in place of the "while (1)" loop at the end. Link: https://lkml.kernel.org/r/a52b234a-851-3616-2525-f42736e8934@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm: page_vma_mapped_walk(): add a level of indentationHugh Dickins
page_vma_mapped_walk() cleanup: add a level of indentation to much of the body, making no functional change in this commit, but reducing the later diff when this is all converted to a loop. [hughd@google.com: : page_vma_mapped_walk(): add a level of indentation fix] Link: https://lkml.kernel.org/r/7f817555-3ce1-c785-e438-87d8efdcaf26@google.com Link: https://lkml.kernel.org/r/efde211-f3e2-fe54-977-ef481419e7f3@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm: page_vma_mapped_walk(): crossing page table boundaryHugh Dickins
page_vma_mapped_walk() cleanup: adjust the test for crossing page table boundary - I believe pvmw->address is always page-aligned, but nothing else here assumed that; and remember to reset pvmw->pte to NULL after unmapping the page table, though I never saw any bug from that. Link: https://lkml.kernel.org/r/799b3f9c-2a9e-dfef-5d89-26e9f76fd97@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm: page_vma_mapped_walk(): prettify PVMW_MIGRATION blockHugh Dickins
page_vma_mapped_walk() cleanup: rearrange the !pmd_present() block to follow the same "return not_found, return not_found, return true" pattern as the block above it (note: returning not_found there is never premature, since existence or prior existence of huge pmd guarantees good alignment). Link: https://lkml.kernel.org/r/378c8650-1488-2edf-9647-32a53cf2e21@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm: page_vma_mapped_walk(): use pmde for *pvmw->pmdHugh Dickins
page_vma_mapped_walk() cleanup: re-evaluate pmde after taking lock, then use it in subsequent tests, instead of repeatedly dereferencing pointer. Link: https://lkml.kernel.org/r/53fbc9d-891e-46b2-cb4b-468c3b19238e@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm: page_vma_mapped_walk(): settle PageHuge on entryHugh Dickins
page_vma_mapped_walk() cleanup: get the hugetlbfs PageHuge case out of the way at the start, so no need to worry about it later. Link: https://lkml.kernel.org/r/e31a483c-6d73-a6bb-26c5-43c3b880a2@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm: page_vma_mapped_walk(): use page for pvmw->pageHugh Dickins
Patch series "mm: page_vma_mapped_walk() cleanup and THP fixes". I've marked all of these for stable: many are merely cleanups, but I think they are much better before the main fix than after. This patch (of 11): page_vma_mapped_walk() cleanup: sometimes the local copy of pvwm->page was used, sometimes pvmw->page itself: use the local copy "page" throughout. Link: https://lkml.kernel.org/r/589b358c-febc-c88e-d4c2-7834b37fa7bf@google.com Link: https://lkml.kernel.org/r/88e67645-f467-c279-bf5e-af4b5c6b13eb@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24ata: rb532_cf: remove redundant codesgushengxian
The codes "dev_err(&pdev->dev, "no IRQ resource found\n");" is redundant because platform_get_irq() already prints an error. Signed-off-by: gushengxian <gushengxian@yulong.com> Link: https://lore.kernel.org/r/20210622115507.359017-1-13145886936@163.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-24blk: Fix lock inversion between ioc lock and bfqd lockJan Kara
Lockdep complains about lock inversion between ioc->lock and bfqd->lock: bfqd -> ioc: put_io_context+0x33/0x90 -> ioc->lock grabbed blk_mq_free_request+0x51/0x140 blk_put_request+0xe/0x10 blk_attempt_req_merge+0x1d/0x30 elv_attempt_insert_merge+0x56/0xa0 blk_mq_sched_try_insert_merge+0x4b/0x60 bfq_insert_requests+0x9e/0x18c0 -> bfqd->lock grabbed blk_mq_sched_insert_requests+0xd6/0x2b0 blk_mq_flush_plug_list+0x154/0x280 blk_finish_plug+0x40/0x60 ext4_writepages+0x696/0x1320 do_writepages+0x1c/0x80 __filemap_fdatawrite_range+0xd7/0x120 sync_file_range+0xac/0xf0 ioc->bfqd: bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed put_io_context_active+0x78/0xb0 -> ioc->lock grabbed exit_io_context+0x48/0x50 do_exit+0x7e9/0xdd0 do_group_exit+0x54/0xc0 To avoid this inversion we change blk_mq_sched_try_insert_merge() to not free the merged request but rather leave that upto the caller similarly to blk_mq_sched_try_merge(). And in bfq_insert_requests() we make sure to free all the merged requests after dropping bfqd->lock. Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler") Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210623093634.27879-3-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-24bfq: Remove merged request already in bfq_requests_merged()Jan Kara
Currently, bfq does very little in bfq_requests_merged() and handles all the request cleanup in bfq_finish_requeue_request() called from blk_mq_free_request(). That is currently safe only because blk_mq_free_request() is called shortly after bfq_requests_merged() while bfqd->lock is still held. However to fix a lock inversion between bfqd->lock and ioc->lock, we need to call blk_mq_free_request() after dropping bfqd->lock. That would mean that already merged request could be seen by other processes inside bfq queues and possibly dispatched to the device which is wrong. So move cleanup of the request from bfq_finish_requeue_request() to bfq_requests_merged(). Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210623093634.27879-2-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-24net: bcmgenet: Add mdio-bcm-unimac soft dependencyJian-Hong Pan
The Broadcom UniMAC MDIO bus from mdio-bcm-unimac module comes too late. So, GENET cannot find the ethernet PHY on UniMAC MDIO bus. This leads GENET fail to attach the PHY as following log: bcmgenet fd580000.ethernet: GENET 5.0 EPHY: 0x0000 ... could not attach to PHY bcmgenet fd580000.ethernet eth0: failed to connect to PHY uart-pl011 fe201000.serial: no DMA platform data libphy: bcmgenet MII bus: probed ... unimac-mdio unimac-mdio.-19: Broadcom UniMAC MDIO bus It is not just coming too late, there is also no way for the module loader to figure out the dependency between GENET and its MDIO bus driver unless we provide this MODULE_SOFTDEP hint. This patch adds the soft dependency to load mdio-bcm-unimac module before genet module to fix this issue. Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=213485 Fixes: 9a4e79697009 ("net: bcmgenet: utilize generic Broadcom UniMAC MDIO controller driver") Signed-off-by: Jian-Hong Pan <jhp@endlessos.org> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24ipv6: delete useless dst check in ip6_dst_lookup_tailzhang kai
parameter dst always points to null. Signed-off-by: zhang kai <zhangkaiheb@126.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24net: dsa: sja1105: fix NULL pointer dereference in sja1105_reload_cbs()Vladimir Oltean
priv->cbs is an array of priv->info->num_cbs_shapers elements of type struct sja1105_cbs_entry which only get allocated if CONFIG_NET_SCH_CBS is enabled. However, sja1105_reload_cbs() is called from sja1105_static_config_reload() which in turn is called for any of the items in sja1105_reset_reasons, therefore during the normal runtime of the driver and not just from a code path which can be triggered by the tc-cbs offload. The sja1105_reload_cbs() function does not contain a check whether the priv->cbs array is NULL or not, it just assumes it isn't and proceeds to iterate through the credit-based shaper elements. This leads to a NULL pointer dereference. The solution is to return success if the priv->cbs array has not been allocated, since sja1105_reload_cbs() has nothing to do. Fixes: 4d7525085a9b ("net: dsa: sja1105: offload the Credit-Based Shaper qdisc") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24mlxsw: core_env: Avoid unnecessary memcpy()sIdo Schimmel
Simply get a pointer to the data in the register payload instead of copying it to a temporary buffer. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24Merge tag 'ieee802154-for-davem-2021-06-24' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan Stefan Schmidt says: ==================== pull-request: ieee802154 for net 2021-06-24 An update from ieee802154 for your *net* tree. This time we only have fixes for ieee802154 hwsim driver. Sparked from some syzcaller reports We got a potential crash fix from Eric Dumazet and two memory leak fixes from Dongliang Mu. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24gve: Fix warnings reported for DQO patchsetBailey Forrest
https://patchwork.kernel.org/project/netdevbpf/list/?series=506637&state=* - Remove unused variable - Use correct integer type for string formatting. - Remove `inline` in C files Fixes: 9c1a59a2f4bc ("gve: DQO: Add ring allocation and initialization") Fixes: a57e5de476be ("gve: DQO: Add TX path") Signed-off-by: Bailey Forrest <bcf@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24e1000e: Check the PCIm stateSasha Neftin
Complete to commit def4ec6dce393e ("e1000e: PCIm function state support") Check the PCIm state only on CSME systems. There is no point to do this check on non CSME systems. This patch fixes a generation a false-positive warning: "Error in exiting dmoff" Fixes: def4ec6dce39 ("e1000e: PCIm function state support") Signed-off-by: Sasha Neftin <sasha.neftin@intel.com> Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24KVM: x86: rename apic_access_page_done to apic_access_memslot_enabledMaxim Levitsky
This better reflects the purpose of this variable on AMD, since on AMD the AVIC's memory slot can be enabled and disabled dynamically. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210623113002.111448-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24kvm: x86: disable the narrow guest module parameter on unloadAaron Lewis
When the kvm_intel module unloads the module parameter 'allow_smaller_maxphyaddr' is not cleared because the backing variable is defined in the kvm module. As a result, if the module parameter's state was set before kvm_intel unloads, it will also be set when it reloads. Explicitly clear the state in vmx_exit() to prevent this from happening. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Message-Id: <20210623203426.1891402-1-aaronlewis@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Jim Mattson <jmattson@google.com>
2021-06-24selftests: kvm: Allows userspace to handle emulation errors.Aaron Lewis
This test exercises the feature KVM_CAP_EXIT_ON_EMULATION_FAILURE. When enabled, errors in the in-kernel instruction emulator are forwarded to userspace with the instruction bytes stored in the exit struct for KVM_EXIT_INTERNAL_ERROR. So, when the guest attempts to emulate an 'flds' instruction, which isn't able to be emulated in KVM, instead of failing, KVM sends the instruction to userspace to handle. For this test to work properly the module parameter 'allow_smaller_maxphyaddr' has to be set. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20210510144834.658457-3-aaronlewis@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24kvm: x86: Allow userspace to handle emulation errorsAaron Lewis
Add a fallback mechanism to the in-kernel instruction emulator that allows userspace the opportunity to process an instruction the emulator was unable to. When the in-kernel instruction emulator fails to process an instruction it will either inject a #UD into the guest or exit to userspace with exit reason KVM_INTERNAL_ERROR. This is because it does not know how to proceed in an appropriate manner. This feature lets userspace get involved to see if it can figure out a better path forward. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Reviewed-by: David Edmondson <david.edmondson@oracle.com> Message-Id: <20210510144834.658457-2-aaronlewis@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: Let guest use GBPAGES if supported in hardware and TDP is onSean Christopherson
Let the guest use 1g hugepages if TDP is enabled and the host supports GBPAGES, KVM can't actively prevent the guest from using 1g pages in this case since they can't be disabled in the hardware page walker. While injecting a page fault if a bogus 1g page is encountered during a software page walk is perfectly reasonable since KVM is simply honoring userspace's vCPU model, doing so arguably doesn't provide any meaningful value, and at worst will be horribly confusing as the guest will see inconsistent behavior and seemingly spurious page faults. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-55-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: Get CR4.SMEP from MMU, not vCPU, in shadow page faultSean Christopherson
Use the current MMU instead of vCPU state to query CR4.SMEP when handling a page fault. In the nested NPT case, the current CR4.SMEP reflects L2, whereas the page fault is shadowing L1's NPT, which uses L1's hCR4. Practically speaking, this is a nop a NPT walks are always user faults, i.e. this code will never be reached, but fix it up for consistency. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-54-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: Get CR0.WP from MMU, not vCPU, in shadow page faultSean Christopherson
Use the current MMU instead of vCPU state to query CR0.WP when handling a page fault. In the nested NPT case, the current CR0.WP reflects L2, whereas the page fault is shadowing L1's NPT. Practically speaking, this is a nop a NPT walks are always user faults, but fix it up for consistency. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-53-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: Drop redundant rsvd bits reset for nested NPTSean Christopherson
Drop the extra reset of shadow_zero_bits in the nested NPT flow now that shadow_mmu_init_context computes the correct level for nested NPT. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-52-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: Optimize and clean up so called "last nonleaf level" logicSean Christopherson
Drop the pre-computed last_nonleaf_level, which is arguably wrong and at best confusing. Per the comment: Can have large pages at levels 2..last_nonleaf_level-1. the intent of the variable would appear to be to track what levels can _legally_ have large pages, but that intent doesn't align with reality. The computed value will be wrong for 5-level paging, or if 1gb pages are not supported. The flawed code is not a problem in practice, because except for 32-bit PSE paging, bit 7 is reserved if large pages aren't supported at the level. Take advantage of this invariant and simply omit the level magic math for 64-bit page tables (including PAE). For 32-bit paging (non-PAE), the adjustments are needed purely because bit 7 is ignored if PSE=0. Retain that logic as is, but make is_last_gpte() unique per PTTYPE so that the PSE check is avoided for PAE and EPT paging. In the spirit of avoiding branches, bump the "last nonleaf level" for 32-bit PSE paging by adding the PSE bit itself. Note, bit 7 is ignored or has other meaning in CR3/EPTP, but despite FNAME(walk_addr_generic) briefly grabbing CR3/EPTP in "pte", they are not PTEs and will blow up all the other gpte helpers. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-51-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86: Enhance comments for MMU roles and nested transition trickinessSean Christopherson
Expand the comments for the MMU roles. The interactions with gfn_track PGD reuse in particular are hairy. Regarding PGD reuse, add comments in the nested virtualization flows to call out why kvm_init_mmu() is unconditionally called even when nested TDP is used. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-50-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: WARN on any reserved SPTE value when making a valid SPTESean Christopherson
Replace make_spte()'s WARN on a collision with the magic MMIO value with a generic WARN on reserved bits being set (including EPT's reserved WX combination). Warning on any reserved bits covers MMIO, A/D tracking bits with PAE paging, and in theory any future goofs that are introduced. Opportunistically convert to ONCE behavior to avoid spamming the kernel log, odds are very good that if KVM screws up one SPTE, it will botch all SPTEs for the same MMU. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-49-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: Add helpers to do full reserved SPTE checks w/ generic MMUSean Christopherson
Extract the reserved SPTE check and print helpers in get_mmio_spte() to new helpers so that KVM can also WARN on reserved badness when making a SPTE. Tag the checking helper with __always_inline to improve the probability of the compiler generating optimal code for the checking loop, e.g. gcc appears to avoid using %rbp when the helper is tagged with a vanilla "inline". No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-48-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: Use MMU's role to determine PTTYPESean Christopherson
Use the MMU's role instead of vCPU state or role_regs to determine the PTTYPE, i.e. which helpers to wire up. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-47-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: Collapse 32-bit PAE and 64-bit statements for helpersSean Christopherson
Skip paging32E_init_context() and paging64_init_context_common() and go directly to paging64_init_context() (was the common version) now that the relevant flows don't need to distinguish between 64-bit PAE and 32-bit PAE for other reasons. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-46-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: Add a helper to calculate root from role_regsSean Christopherson
Add a helper to calculate the level for non-EPT page tables from the MMU's role_regs. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-45-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: Add helper to update paging metadataSean Christopherson
Consolidate MMU guest metadata updates into a common helper for TDP, shadow, and nested MMUs. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-44-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: Don't update nested guest's paging bitmasks if CR0.PG=0Sean Christopherson
Don't bother updating the bitmasks and last-leaf information if paging is disabled as the metadata will never be used. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-43-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: Consolidate reset_rsvds_bits_mask() callsSean Christopherson
Move calls to reset_rsvds_bits_mask() out of the various mode statements and under a more generic CR0.PG=1 check. This will allow for additional code consolidation in the future. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-42-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: Use MMU role_regs to get LA57, and drop vCPU LA57 helperSean Christopherson
Get LA57 from the role_regs, which are initialized from the vCPU even though TDP is enabled, instead of pulling the value directly from the vCPU when computing the guest's root_level for TDP MMUs. Note, the check is inside an is_long_mode() statement, so that requirement is not lost. Use role_regs even though the MMU's role is available and arguably "better". A future commit will consolidate the guest root level logic, and it needs access to EFER.LMA, which is not tracked in the role (it can't be toggled on VM-Exit, unlike LA57). Drop is_la57_mode() as there are no remaining users, and to discourage pulling MMU state from the vCPU (in the future). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-41-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>