summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)Author
2022-09-26mm/swap: add swp_offset_pfn() to fetch PFN from swap entryPeter Xu
We've got a bunch of special swap entries that stores PFN inside the swap offset fields. To fetch the PFN, normally the user just calls swp_offset() assuming that'll be the PFN. Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the max possible length of a PFN on the host, meanwhile doing proper check with MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the PFNs properly always using the BUILD_BUG_ON() in is_pfn_swap_entry(). One reason to do so is we never tried to sanitize whether swap offset can really fit for storing PFN. At the meantime, this patch also prepares us with the future possibility to store more information inside the swp offset field, so assuming "swp_offset(entry)" to be the PFN will not stand any more very soon. Replace many of the swp_offset() callers to use swp_offset_pfn() where proper. Note that many of the existing users are not candidates for the replacement, e.g.: (1) When the swap entry is not a pfn swap entry at all, or, (2) when we wanna keep the whole swp_offset but only change the swp type. For the latter, it can happen when fork() triggered on a write-migration swap entry pte, we may want to only change the migration type from write->read but keep the rest, so it's not "fetching PFN" but "changing swap type only". They're left aside so that when there're more information within the swp offset they'll be carried over naturally in those cases. Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what the new swp_offset_pfn() is about. Link: https://lkml.kernel.org/r/20220811161331.37055-4-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Minchan Kim <minchan@kernel.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26mm/x86: use SWP_TYPE_BITS in 3-level swap macrosPeter Xu
Patch series "mm: Remember a/d bits for migration entries", v4. Problem ======= When migrating a page, right now we always mark the migrated page as old & clean. However that could lead to at least two problems: (1) We lost the real hot/cold information while we could have persisted. That information shouldn't change even if the backing page is changed after the migration, (2) There can be always extra overhead on the immediate next access to any migrated page, because hardware MMU needs cycles to set the young bit again for reads, and dirty bits for write, as long as the hardware MMU supports these bits. Many of the recent upstream works showed that (2) is not something trivial and actually very measurable. In my test case, reading 1G chunk of memory - jumping in page size intervals - could take 99ms just because of the extra setting on the young bit on a generic x86_64 system, comparing to 4ms if young set. This issue is originally reported by Andrea Arcangeli. Solution ======== To solve this problem, this patchset tries to remember the young/dirty bits in the migration entries and carry them over when recovering the ptes. We have the chance to do so because in many systems the swap offset is not really fully used. Migration entries use swp offset to store PFN only, while the PFN is normally not as large as swp offset and normally smaller. It means we do have some free bits in swp offset that we can use to store things like A/D bits, and that's how this series tried to approach this problem. max_swapfile_size() is used here to detect per-arch offset length in swp entries. We'll automatically remember the A/D bits when we find that we have enough swp offset field to keep both the PFN and the extra bits. Since max_swapfile_size() can be slow, the last two patches cache the results for it and also swap_migration_ad_supported as a whole. Known Issues / TODOs ==================== We still haven't taught madvise() to recognize the new A/D bits in migration entries, namely MADV_COLD/MADV_FREE. E.g. when MADV_COLD upon a migration entry. It's not clear yet on whether we should clear the A bit, or we should just drop the entry directly. We didn't teach idle page tracking on the new migration entries, because it'll need larger rework on the tree on rmap pgtable walk. However it should make it already better because before this patchset page will be old page after migration, so the series will fix potential false negative of idle page tracking when pages were migrated before observing. The other thing is migration A/D bits will not start to working for private device swap entries. The code is there for completeness but since private device swap entries do not yet have fields to store A/D bits, even if we'll persistent A/D across present pte switching to migration entry, we'll lose it again when the migration entry converted to private device swap entry. Tests ===== After the patchset applied, the immediate read access test [1] of above 1G chunk after migration can shrink from 99ms to 4ms. The test is done by moving 1G pages from node 0->1->0 then read it in page size jumps. The test is with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz. Similar effect can also be measured when writting the memory the 1st time after migration. After applying the patchset, both initial immediate read/write after page migrated will perform similarly like before migration happened. Patch Layout ============ Patch 1-2: Cleanups from either previous versions or on swapops.h macros. Patch 3-4: Prepare for the introduction of migration A/D bits Patch 5: The core patch to remember young/dirty bit in swap offsets. Patch 6-7: Cache relevant fields to make migration_entry_supports_ad() fast. [1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c This patch (of 7): Replace all the magic "5" with the macro. Link: https://lkml.kernel.org/r/20220811161331.37055-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20220811161331.37055-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26Merge tag 'x86_urgent_for_v6.0-rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Dave Hansen: - A performance fix for recent large AMD systems that avoids an ancient cpu idle hardware workaround - A new Intel model number. Folks like these upstream as soon as possible so that each developer doing feature development doesn't need to carry their own #define - SGX fixes for a userspace crash and a rare kernel warning * tag 'x86_urgent_for_v6.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: ACPI: processor idle: Practically limit "Dummy wait" workaround to old Intel systems x86/sgx: Handle VA page allocation failure for EAUG on PF. x86/sgx: Do not fail on incomplete sanitization on premature stop of ksgxd x86/cpu: Add CPU model numbers for Meteor Lake
2022-09-26ARM: dts: integrator: Fix DMA rangesLinus Walleij
A recent change affecting the behaviour of phys_to_dma() to actually require the device tree ranges to work unmasked a bug in the Integrator DMA ranges. The PL110 uses the CMA allocator to obtain coherent allocations from a dedicated 1MB video memory, leading to the following call chain: drm_gem_cma_create() dma_alloc_attrs() dma_alloc_from_dev_coherent() __dma_alloc_from_coherent() dma_get_device_base() phys_to_dma() translate_phys_to_dma() phys_to_dma() by way of translate_phys_to_dma() will nowadays not provide 1:1 mappings unless the ranges are properly defined in the device tree and reflected into the dev->dma_range_map. There is a bug in the device trees because the DMA ranges are incorrectly specified, and the patch uncovers this bug. Solution: - Fix the LB (logic bus) ranges to be 1-to-1 like they should have always been. - Provide a 1:1 dma-ranges attribute to the PL110. - Mark the PL110 display controller as DMA coherent. This makes the DMA ranges work right and makes the PL110 framebuffer work again. Fixes: af6f23b88e95 ("ARM/dma-mapping: use the generic versions of dma_to_phys/phys_to_dma by default") Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220926073311.1610568-1-linus.walleij@linaro.org' Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2022-09-26Merge tag 'mm-hotfixes-stable-2022-09-26' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull last (?) hotfixes from Andrew Morton: "26 hotfixes. 8 are for issues which were introduced during this -rc cycle, 18 are for earlier issues, and are cc:stable" * tag 'mm-hotfixes-stable-2022-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (26 commits) x86/uaccess: avoid check_object_size() in copy_from_user_nmi() mm/page_isolation: fix isolate_single_pageblock() isolation behavior mm,hwpoison: check mm when killing accessing process mm/hugetlb: correct demote page offset logic mm: prevent page_frag_alloc() from corrupting the memory mm: bring back update_mmu_cache() to finish_fault() frontswap: don't call ->init if no ops are registered mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all() mm: fix madivse_pageout mishandling on non-LRU page powerpc/64s/radix: don't need to broadcast IPI for radix pmd collapse flush mm: gup: fix the fast GUP race against THP collapse mm: fix dereferencing possible ERR_PTR vmscan: check folio_test_private(), not folio_get_private() mm: fix VM_BUG_ON in __delete_from_swap_cache() tools: fix compilation after gfp_types.h split mm/damon/dbgfs: fix memory leak when using debugfs_lookup() mm/migrate_device.c: copy pte dirty bit to page mm/migrate_device.c: add missing flush_cache_page() mm/migrate_device.c: flush TLB while holding PTL x86/mm: disable instrumentations of mm/pgprot.c ...
2022-09-26Merge branch 'mm-hotfixes-stable' into mm-stableAndrew Morton
2022-09-26x86/uaccess: avoid check_object_size() in copy_from_user_nmi()Kees Cook
The check_object_size() helper under CONFIG_HARDENED_USERCOPY is designed to skip any checks where the length is known at compile time as a reasonable heuristic to avoid "likely known-good" cases. However, it can only do this when the copy_*_user() helpers are, themselves, inline too. Using find_vmap_area() requires taking a spinlock. The check_object_size() helper can call find_vmap_area() when the destination is in vmap memory. If show_regs() is called in interrupt context, it will attempt a call to copy_from_user_nmi(), which may call check_object_size() and then find_vmap_area(). If something in normal context happens to be in the middle of calling find_vmap_area() (with the spinlock held), the interrupt handler will hang forever. The copy_from_user_nmi() call is actually being called with a fixed-size length, so check_object_size() should never have been called in the first place. Given the narrow constraints, just replace the __copy_from_user_inatomic() call with an open-coded version that calls only into the sanitizers and not check_object_size(), followed by a call to raw_copy_from_user(). [akpm@linux-foundation.org: no instrument_copy_from_user() in my tree...] Link: https://lkml.kernel.org/r/20220919201648.2250764-1-keescook@chromium.org Link: https://lore.kernel.org/all/CAOUHufaPshtKrTWOz7T7QFYUNVGFm0JBjvM700Nhf9qEL9b3EQ@mail.gmail.com Fixes: 0aef499f3172 ("mm/usercopy: Detect vmalloc overruns") Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Yu Zhao <yuzhao@google.com> Reported-by: Florian Lehner <dev@der-flo.net> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Florian Lehner <dev@der-flo.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26powerpc/64s/radix: don't need to broadcast IPI for radix pmd collapse flushYang Shi
The IPI broadcast is used to serialize against fast-GUP, but fast-GUP will move to use RCU instead of disabling local interrupts in fast-GUP. Using an IPI is the old-styled way of serializing against fast-GUP although it still works as expected now. And fast-GUP now fixed the potential race with THP collapse by checking whether PMD is changed or not. So IPI broadcast in radix pmd collapse flush is not necessary anymore. But it is still needed for hash TLB. Link: https://lkml.kernel.org/r/20220907180144.555485-2-shy828301@gmail.com Suggested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26x86/paravirt: add extra clobbers with ZERO_CALL_USED_REGS enabledBill Wendling
The ZERO_CALL_USED_REGS feature may zero out caller-saved registers before returning. In spurious_kernel_fault(), the "pte_offset_kernel()" call results in this assembly code: .Ltmp151: #APP # ALT: oldnstr .Ltmp152: .Ltmp153: .Ltmp154: .section .discard.retpoline_safe,"",@progbits .quad .Ltmp154 .text callq *pv_ops+536(%rip) .Ltmp155: .section .parainstructions,"a",@progbits .p2align 3, 0x0 .quad .Ltmp153 .byte 67 .byte .Ltmp155-.Ltmp153 .short 1 .text .Ltmp156: # ALT: padding .zero (-(((.Ltmp157-.Ltmp158)-(.Ltmp156-.Ltmp152))>0))*((.Ltmp157-.Ltmp158)-(.Ltmp156-.Ltmp152)),144 .Ltmp159: .section .altinstructions,"a",@progbits .Ltmp160: .long .Ltmp152-.Ltmp160 .Ltmp161: .long .Ltmp158-.Ltmp161 .short 33040 .byte .Ltmp159-.Ltmp152 .byte .Ltmp157-.Ltmp158 .text .section .altinstr_replacement,"ax",@progbits # ALT: replacement 1 .Ltmp158: movq %rdi, %rax .Ltmp157: .text #NO_APP .Ltmp162: testb $-128, %dil The "testb" here is using %dil, but the %rdi register was cleared before returning from "callq *pv_ops+536(%rip)". Adding the proper constraints results in the use of a different register: movq %r11, %rdi # Similar to above. testb $-128, %r11b Link: https://github.com/KSPP/linux/issues/192 Signed-off-by: Bill Wendling <morbo@google.com> Reported-and-tested-by: Nathan Chancellor <nathan@kernel.org> Fixes: 035f7f87b729 ("randstruct: Enable Clang support") Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/lkml/fa6df43b-8a1a-8ad1-0236-94d2a0b588fa@suse.com/ Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220902213750.1124421-3-morbo@google.com
2022-09-26x86/paravirt: clean up typos and grammarosBill Wendling
Drive-by clean up of the comment. [ Impact: cleanup] Signed-off-by: Bill Wendling <morbo@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220902213750.1124421-2-morbo@google.com
2022-09-26x86/entry: Work around Clang __bdos() bugKees Cook
Clang produces a false positive when building with CONFIG_FORTIFY_SOURCE=y and CONFIG_UBSAN_BOUNDS=y when operating on an array with a dynamic offset. Work around this by using a direct assignment of an empty instance. Avoids this warning: ../include/linux/fortify-string.h:309:4: warning: call to __write_overflow_field declared with 'warn ing' attribute: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wat tribute-warning] __write_overflow_field(p_size_field, size); ^ which was isolated to the memset() call in xen_load_idt(). Note that this looks very much like another bug that was worked around: https://github.com/ClangBuiltLinux/linux/issues/1592 Cc: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: xen-devel@lists.xenproject.org Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: https://lore.kernel.org/lkml/41527d69-e8ab-3f86-ff37-6b298c01d5bc@oracle.com Signed-off-by: Kees Cook <keescook@chromium.org>
2022-09-26x86/kprobes: Remove unused arch_kprobe_override_function() declarationGaosheng Cui
All uses of arch_kprobe_override_function() have been removed by commit 540adea3809f ("error-injection: Separate error-injection from kprobe"), so remove the declaration, too. Link: https://lkml.kernel.org/r/20220914110437.1436353-3-cuigaosheng1@huawei.com Cc: <mingo@redhat.com> Cc: <tglx@linutronix.de> Cc: <bp@alien8.de> Cc: <dave.hansen@linux.intel.com> Cc: <x86@kernel.org> Cc: <hpa@zytor.com> Cc: <mhiramat@kernel.org> Cc: <peterz@infradead.org> Cc: <ast@kernel.org> Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-26x86/ftrace: Remove unused modifying_ftrace_code declarationGaosheng Cui
All uses of modifying_ftrace_code have been removed by commit 768ae4406a5c ("x86/ftrace: Use text_poke()"), so remove the declaration, too. Link: https://lkml.kernel.org/r/20220914110437.1436353-2-cuigaosheng1@huawei.com Cc: <mingo@redhat.com> Cc: <tglx@linutronix.de> Cc: <bp@alien8.de> Cc: <dave.hansen@linux.intel.com> Cc: <x86@kernel.org> Cc: <hpa@zytor.com> Cc: <mhiramat@kernel.org> Cc: <peterz@infradead.org> Cc: <ast@kernel.org> Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-09-26x86: Add support for CONFIG_CFI_CLANGSami Tolvanen
With CONFIG_CFI_CLANG, the compiler injects a type preamble immediately before each function and a check to validate the target function type before indirect calls: ; type preamble __cfi_function: mov <id>, %eax function: ... ; indirect call check mov -<id>,%r10d add -0x4(%r11),%r10d je .Ltmp1 ud2 .Ltmp1: call __x86_indirect_thunk_r11 Add error handling code for the ud2 traps emitted for the checks, and allow CONFIG_CFI_CLANG to be selected on x86_64. This produces the following oops on CFI failure (generated using lkdtm): [ 21.441706] CFI failure at lkdtm_indirect_call+0x16/0x20 [lkdtm] (target: lkdtm_increment_int+0x0/0x10 [lkdtm]; expected type: 0x7e0c52a) [ 21.444579] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 21.445296] CPU: 0 PID: 132 Comm: sh Not tainted 5.19.0-rc8-00020-g9f27360e674c #1 [ 21.445296] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 21.445296] RIP: 0010:lkdtm_indirect_call+0x16/0x20 [lkdtm] [ 21.445296] Code: 52 1c c0 48 c7 c1 c5 50 1c c0 e9 25 48 2a cc 0f 1f 44 00 00 49 89 fb 48 c7 c7 50 b4 1c c0 41 ba 5b ad f3 81 45 03 53 f8 [ 21.445296] RSP: 0018:ffffa9f9c02ffdc0 EFLAGS: 00000292 [ 21.445296] RAX: 0000000000000027 RBX: ffffffffc01cb300 RCX: 385cbbd2e070a700 [ 21.445296] RDX: 0000000000000000 RSI: c0000000ffffdfff RDI: ffffffffc01cb450 [ 21.445296] RBP: 0000000000000006 R08: 0000000000000000 R09: ffffffff8d081610 [ 21.445296] R10: 00000000bcc90825 R11: ffffffffc01c2fc0 R12: 0000000000000000 [ 21.445296] R13: ffffa31b827a6000 R14: 0000000000000000 R15: 0000000000000002 [ 21.445296] FS: 00007f08b42216a0(0000) GS:ffffa31b9f400000(0000) knlGS:0000000000000000 [ 21.445296] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 21.445296] CR2: 0000000000c76678 CR3: 0000000001940000 CR4: 00000000000006f0 [ 21.445296] Call Trace: [ 21.445296] <TASK> [ 21.445296] lkdtm_CFI_FORWARD_PROTO+0x30/0x50 [lkdtm] [ 21.445296] direct_entry+0x12d/0x140 [lkdtm] [ 21.445296] full_proxy_write+0x5d/0xb0 [ 21.445296] vfs_write+0x144/0x460 [ 21.445296] ? __x64_sys_wait4+0x5a/0xc0 [ 21.445296] ksys_write+0x69/0xd0 [ 21.445296] do_syscall_64+0x51/0xa0 [ 21.445296] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 21.445296] RIP: 0033:0x7f08b41a6fe1 [ 21.445296] Code: be 07 00 00 00 41 89 c0 e8 7e ff ff ff 44 89 c7 89 04 24 e8 91 c6 02 00 8b 04 24 48 83 c4 68 c3 48 63 ff b8 01 00 00 03 [ 21.445296] RSP: 002b:00007ffcdf65c2e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 21.445296] RAX: ffffffffffffffda RBX: 00007f08b4221690 RCX: 00007f08b41a6fe1 [ 21.445296] RDX: 0000000000000012 RSI: 0000000000c738f0 RDI: 0000000000000001 [ 21.445296] RBP: 0000000000000001 R08: fefefefefefefeff R09: fefefefeffc5ff4e [ 21.445296] R10: 00007f08b42222b0 R11: 0000000000000246 R12: 0000000000c738f0 [ 21.445296] R13: 0000000000000012 R14: 00007ffcdf65c401 R15: 0000000000c70450 [ 21.445296] </TASK> [ 21.445296] Modules linked in: lkdtm [ 21.445296] Dumping ftrace buffer: [ 21.445296] (ftrace buffer empty) [ 21.471442] ---[ end trace 0000000000000000 ]--- [ 21.471811] RIP: 0010:lkdtm_indirect_call+0x16/0x20 [lkdtm] [ 21.472467] Code: 52 1c c0 48 c7 c1 c5 50 1c c0 e9 25 48 2a cc 0f 1f 44 00 00 49 89 fb 48 c7 c7 50 b4 1c c0 41 ba 5b ad f3 81 45 03 53 f8 [ 21.474400] RSP: 0018:ffffa9f9c02ffdc0 EFLAGS: 00000292 [ 21.474735] RAX: 0000000000000027 RBX: ffffffffc01cb300 RCX: 385cbbd2e070a700 [ 21.475664] RDX: 0000000000000000 RSI: c0000000ffffdfff RDI: ffffffffc01cb450 [ 21.476471] RBP: 0000000000000006 R08: 0000000000000000 R09: ffffffff8d081610 [ 21.477127] R10: 00000000bcc90825 R11: ffffffffc01c2fc0 R12: 0000000000000000 [ 21.477959] R13: ffffa31b827a6000 R14: 0000000000000000 R15: 0000000000000002 [ 21.478657] FS: 00007f08b42216a0(0000) GS:ffffa31b9f400000(0000) knlGS:0000000000000000 [ 21.479577] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 21.480307] CR2: 0000000000c76678 CR3: 0000000001940000 CR4: 00000000000006f0 [ 21.481460] Kernel panic - not syncing: Fatal exception Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kees Cook <keescook@chromium.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220908215504.3686827-23-samitolvanen@google.com
2022-09-26x86/purgatory: Disable CFISami Tolvanen
Disable CONFIG_CFI_CLANG for the stand-alone purgatory.ro. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Tested-by: Kees Cook <keescook@chromium.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220908215504.3686827-22-samitolvanen@google.com
2022-09-26x86: Add types to indirectly called assembly functionsSami Tolvanen
With CONFIG_CFI_CLANG, assembly functions indirectly called from C code must be annotated with type identifiers to pass CFI checking. Define the __CFI_TYPE helper macro to match the compiler generated function preamble, and ensure SYM_TYPED_FUNC_START also emits ENDBR with IBT. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kees Cook <keescook@chromium.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220908215504.3686827-21-samitolvanen@google.com
2022-09-26x86/tools/relocs: Ignore __kcfi_typeid_ relocationsSami Tolvanen
The compiler generates __kcfi_typeid_ symbols for annotating assembly functions with type information. These are constants that can be referenced in assembly code and are resolved by the linker. Ignore them in relocs. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kees Cook <keescook@chromium.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220908215504.3686827-20-samitolvanen@google.com
2022-09-26treewide: Drop function_nocfiSami Tolvanen
With -fsanitize=kcfi, we no longer need function_nocfi() as the compiler won't change function references to point to a jump table. Remove all implementations and uses of the macro. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kees Cook <keescook@chromium.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220908215504.3686827-14-samitolvanen@google.com
2022-09-26arm64: Drop unneeded __nocfi attributesSami Tolvanen
With -fsanitize=kcfi, CONFIG_CFI_CLANG no longer has issues with address space confusion in functions that switch to linear mapping. Now that the indirectly called assembly functions have type annotations, drop the __nocfi attributes. Suggested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kees Cook <keescook@chromium.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220908215504.3686827-12-samitolvanen@google.com
2022-09-26arm64: Add CFI error handlingSami Tolvanen
With -fsanitize=kcfi, CFI always traps. Add arm64 support for handling CFI failures. The registers containing the target address and the expected type are encoded in the first ten bits of the ESR as follows: - 0-4: n, where the register Xn contains the target address - 5-9: m, where the register Wm contains the type hash This produces the following oops on CFI failure (generated using lkdtm): [ 21.885179] CFI failure at lkdtm_indirect_call+0x2c/0x44 [lkdtm] (target: lkdtm_increment_int+0x0/0x1c [lkdtm]; expected type: 0x7e0c52a) [ 21.886593] Internal error: Oops - CFI: 0 [#1] PREEMPT SMP [ 21.891060] Modules linked in: lkdtm [ 21.893363] CPU: 0 PID: 151 Comm: sh Not tainted 5.19.0-rc1-00021-g852f4e48dbab #1 [ 21.895560] Hardware name: linux,dummy-virt (DT) [ 21.896543] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 21.897583] pc : lkdtm_indirect_call+0x2c/0x44 [lkdtm] [ 21.898551] lr : lkdtm_CFI_FORWARD_PROTO+0x3c/0x6c [lkdtm] [ 21.899520] sp : ffff8000083a3c50 [ 21.900191] x29: ffff8000083a3c50 x28: ffff0000027e0ec0 x27: 0000000000000000 [ 21.902453] x26: 0000000000000000 x25: ffffc2aa3d07e7b0 x24: 0000000000000002 [ 21.903736] x23: ffffc2aa3d079088 x22: ffffc2aa3d07e7b0 x21: ffff000003379000 [ 21.905062] x20: ffff8000083a3dc0 x19: 0000000000000012 x18: 0000000000000000 [ 21.906371] x17: 000000007e0c52a5 x16: 000000003ad55aca x15: ffffc2aa60d92138 [ 21.907662] x14: ffffffffffffffff x13: 2e2e2e2065707974 x12: 0000000000000018 [ 21.909775] x11: ffffc2aa62322b88 x10: ffffc2aa62322aa0 x9 : c7e305fb5195d200 [ 21.911898] x8 : ffffc2aa3d077e20 x7 : 6d20676e696c6c61 x6 : 43203a6d74646b6c [ 21.913108] x5 : ffffc2aa6266c9df x4 : ffffc2aa6266c9e1 x3 : ffff8000083a3968 [ 21.914358] x2 : 80000000fffff122 x1 : 00000000fffff122 x0 : ffffc2aa3d07e8f8 [ 21.915827] Call trace: [ 21.916375] lkdtm_indirect_call+0x2c/0x44 [lkdtm] [ 21.918060] lkdtm_CFI_FORWARD_PROTO+0x3c/0x6c [lkdtm] [ 21.919030] lkdtm_do_action+0x34/0x4c [lkdtm] [ 21.919920] direct_entry+0x170/0x1ac [lkdtm] [ 21.920772] full_proxy_write+0x84/0x104 [ 21.921759] vfs_write+0x188/0x3d8 [ 21.922387] ksys_write+0x78/0xe8 [ 21.922986] __arm64_sys_write+0x1c/0x2c [ 21.923696] invoke_syscall+0x58/0x134 [ 21.924554] el0_svc_common+0xb4/0xf4 [ 21.925603] do_el0_svc+0x2c/0xb4 [ 21.926563] el0_svc+0x2c/0x7c [ 21.927147] el0t_64_sync_handler+0x84/0xf0 [ 21.927985] el0t_64_sync+0x18c/0x190 [ 21.929133] Code: 728a54b1 72afc191 6b11021f 54000040 (d4304500) [ 21.930690] ---[ end trace 0000000000000000 ]--- [ 21.930971] Kernel panic - not syncing: Oops - CFI: Fatal exception Suggested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kees Cook <keescook@chromium.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220908215504.3686827-11-samitolvanen@google.com
2022-09-26arm64: Add types to indirect called assembly functionsSami Tolvanen
With CONFIG_CFI_CLANG, assembly functions indirectly called from C code must be annotated with type identifiers to pass CFI checking. Use SYM_TYPED_FUNC_START for the indirectly called functions, and ensure we emit `bti c` also with SYM_TYPED_FUNC_START. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kees Cook <keescook@chromium.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220908215504.3686827-10-samitolvanen@google.com
2022-09-26cfi: Switch to -fsanitize=kcfiSami Tolvanen
Switch from Clang's original forward-edge control-flow integrity implementation to -fsanitize=kcfi, which is better suited for the kernel, as it doesn't require LTO, doesn't use a jump table that requires altering function references, and won't break cross-module function address equality. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kees Cook <keescook@chromium.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220908215504.3686827-6-samitolvanen@google.com
2022-09-26cfi: Remove CONFIG_CFI_CLANG_SHADOWSami Tolvanen
In preparation to switching to -fsanitize=kcfi, remove support for the CFI module shadow that will no longer be needed. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kees Cook <keescook@chromium.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220908215504.3686827-4-samitolvanen@google.com
2022-09-26treewide: Filter out CC_FLAGS_CFISami Tolvanen
In preparation for removing CC_FLAGS_CFI from CC_FLAGS_LTO, explicitly filter out CC_FLAGS_CFI in all the makefiles where we currently filter out CC_FLAGS_LTO. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kees Cook <keescook@chromium.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220908215504.3686827-2-samitolvanen@google.com
2022-09-26KVM: remove KVM_REQ_UNHALTPaolo Bonzini
KVM_REQ_UNHALT is now unnecessary because it is replaced by the return value of kvm_vcpu_block/kvm_vcpu_halt. Remove it. No functional change intended. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Acked-by: Marc Zyngier <maz@kernel.org> Message-Id: <20220921003201.1441511-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: mips, x86: do not rely on KVM_REQ_UNHALTPaolo Bonzini
KVM_REQ_UNHALT is a weird request that simply reports the value of kvm_arch_vcpu_runnable() on exit from kvm_vcpu_halt(). Only MIPS and x86 are looking at it, the others just clear it. Check the state of the vCPU directly so that the request is handled as a nop on all architectures. No functional change intended, except for corner cases where an event arrive immediately after a signal become pending or after another similar host-side event. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Message-Id: <20220921003201.1441511-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: never write to memory from kvm_vcpu_check_block()Paolo Bonzini
kvm_vcpu_check_block() is called while not in TASK_RUNNING, and therefore it cannot sleep. Writing to guest memory is therefore forbidden, but it can happen on AMD processors if kvm_check_nested_events() causes a vmexit. Fortunately, all events that are caught by kvm_check_nested_events() are also recognized by kvm_vcpu_has_events() through vendor callbacks such as kvm_x86_interrupt_allowed() or kvm_x86_ops.nested_ops->has_events(), so remove the call and postpone the actual processing to vcpu_block(). Opportunistically honor the return of kvm_check_nested_events(). KVM punted on the check in kvm_vcpu_running() because the only error path is if vmx_complete_nested_posted_interrupt() fails, in which case KVM exits to userspace with "internal error" i.e. the VM is likely dead anyways so it wasn't worth overloading the return of kvm_vcpu_running(). Add the check mostly so that KVM is consistent with itself; the return of the call via kvm_apic_accept_events()=>kvm_check_nested_events() that immediately follows _is_ checked. Reported-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [sean: check and handle return of kvm_check_nested_events()] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220921003201.1441511-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Don't snapshot pending INIT/SIPI prior to checking nested eventsSean Christopherson
Don't snapshot pending INIT/SIPI events prior to checking nested events, architecturally there's nothing wrong with KVM processing (dropping) a SIPI that is received immediately after synthesizing a VM-Exit. Taking and consuming the snapshot makes the flow way more subtle than it needs to be, e.g. nVMX consumes/clears events that trigger VM-Exit (INIT/SIPI), and so at first glance it appears that KVM is double-dipping on pending INITs and SIPIs. But that's not the case because INIT is blocked unconditionally in VMX root mode the CPU cannot be in wait-for_SIPI after VM-Exit, i.e. the paths that truly consume the snapshot are unreachable if apic->pending_events is modified by kvm_check_nested_events(). nSVM is a similar story as GIF is cleared by the CPU on VM-Exit; INIT is blocked regardless of whether or not it was pending prior to VM-Exit. Drop the snapshot logic so that a future fix doesn't create weirdness when kvm_vcpu_running()'s call to kvm_check_nested_events() is moved to vcpu_block(). In that case, kvm_check_nested_events() will be called immediately before kvm_apic_accept_events(), which raises the obvious question of why that change doesn't break the snapshot logic. Note, there is a subtle functional change. Previously, KVM would clear pending SIPIs if and only SIPI was pending prior to VM-Exit, whereas now KVM clears pending SIPI unconditionally if INIT+SIPI are blocked. The latter is architecturally allowed, as SIPI is ignored if the CPU is not in wait-for-SIPI mode (arguably, KVM should be even more aggressive in dropping SIPIs). It is software's responsibility to ensure the SIPI is delivered, i.e. software shouldn't be firing INIT-SIPI at a CPU until it knows with 100% certaining that the target CPU isn't in VMX root mode. Furthermore, the existing code is extra weird as SIPIs that arrive after VM-Exit _are_ dropped if there also happened to be a pending SIPI before VM-Exit. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220921003201.1441511-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: nVMX: Make event request on VMXOFF iff INIT/SIPI is pendingSean Christopherson
Explicitly check for a pending INIT/SIPI event when emulating VMXOFF instead of blindly making an event request. There's obviously no need to evaluate events if none are pending. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220921003201.1441511-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: nVMX: Make an event request if INIT or SIPI is pending on VM-EnterSean Christopherson
Evaluate interrupts, i.e. set KVM_REQ_EVENT, if INIT or SIPI is pending when emulating nested VM-Enter. INIT is blocked while the CPU is in VMX root mode, but not in VMX non-root, i.e. becomes unblocked on VM-Enter. This bug has been masked by KVM calling ->check_nested_events() in the core run loop, but that hack will be fixed in the near future. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220921003201.1441511-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: SVM: Make an event request if INIT or SIPI is pending when GIF is setSean Christopherson
Set KVM_REQ_EVENT if INIT or SIPI is pending when the guest enables GIF. INIT in particular is blocked when GIF=0 and needs to be processed when GIF is toggled to '1'. This bug has been masked by (a) KVM calling ->check_nested_events() in the core run loop and (b) hypervisors toggling GIF from 0=>1 only when entering guest mode (L1 entering L2). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220921003201.1441511-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: lapic does not have to process INIT if it is blockedPaolo Bonzini
Do not return true from kvm_vcpu_has_events() if the vCPU isn' going to immediately process a pending INIT/SIPI. INIT/SIPI shouldn't be treated as wake events if they are blocked. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [sean: rebase onto refactored INIT/SIPI helpers, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220921003201.1441511-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Rename kvm_apic_has_events() to make it INIT/SIPI specificSean Christopherson
Rename kvm_apic_has_events() to kvm_apic_has_pending_init_or_sipi() so that it's more obvious that "events" really just means "INIT or SIPI". Opportunistically clean up a weirdly worded comment that referenced kvm_apic_has_events() instead of kvm_apic_accept_events(). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220921003201.1441511-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Rename and expose helper to detect if INIT/SIPI are allowedSean Christopherson
Rename and invert kvm_vcpu_latch_init() to kvm_apic_init_sipi_allowed() so as to match the behavior of {interrupt,nmi,smi}_allowed(), and expose the helper so that it can be used by kvm_vcpu_has_events() to determine whether or not an INIT or SIPI is pending _and_ can be taken immediately. Opportunistically replaced usage of the "latch" terminology with "blocked" and/or "allowed", again to align with KVM's terminology used for all other event types. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220921003201.1441511-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: nVMX: Make an event request when pending an MTF nested VM-ExitSean Christopherson
Set KVM_REQ_EVENT when MTF becomes pending to ensure that KVM will run through inject_pending_event() and thus vmx_check_nested_events() prior to re-entering the guest. MTF currently works by virtue of KVM's hack that calls kvm_check_nested_events() from kvm_vcpu_running(), but that hack will be removed in the near future. Until that call is removed, the patch introduces no real functional change. Fixes: 5ef8acbdd687 ("KVM: nVMX: Emulate MTF when performing instruction emulation") Cc: stable@vger.kernel.org Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220921003201.1441511-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: make vendor code check for all nested eventsPaolo Bonzini
Interrupts, NMIs etc. sent while in guest mode are already handled properly by the *_interrupt_allowed callbacks, but other events can cause a vCPU to be runnable that are specific to guest mode. In the case of VMX there are two, the preemption timer and the monitor trap. The VMX preemption timer is already special cased via the hv_timer_pending callback, but the purpose of the callback can be easily extended to MTF or in fact any other event that can occur only in guest mode. Rename the callback and add an MTF check; kvm_arch_vcpu_runnable() now can return true if an MTF is pending, without relying on kvm_vcpu_running()'s call to kvm_check_nested_events(). Until that call is removed, however, the patch introduces no functional change. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220921003201.1441511-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Allow force_emulation_prefix to be written without a reloadSean Christopherson
Allow force_emulation_prefix to be written by privileged userspace without reloading KVM. The param does not have any persistent affects and is trivial to snapshot. Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830231614.3580124-28-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Rename inject_pending_events() to kvm_check_and_inject_events()Sean Christopherson
Rename inject_pending_events() to kvm_check_and_inject_events() in order to capture the fact that it handles more than just pending events, and to (mostly) align with kvm_check_nested_events(), which omits the "inject" for brevity. Add a comment above kvm_check_and_inject_events() to provide a high-level synopsis, and to document a virtualization hole (KVM erratum if you will) that exists due to KVM not strictly tracking instruction boundaries with respect to coincident instruction restarts and asynchronous events. No functional change inteded. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-25-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle behaviorSean Christopherson
Document the oddities of ICEBP interception (trap-like #DB is intercepted as a fault-like exception), and how using VMX's inner "skip" helper deliberately bypasses the pending MTF and single-step #DB logic. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-24-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptionsSean Christopherson
Treat pending TRIPLE_FAULTS as pending exceptions. A triple fault is an exception for all intents and purposes, it's just not tracked as such because there's no vector associated the exception. E.g. if userspace were to set vcpu->request_interrupt_window while running L2 and L2 hit a triple fault, a triple fault nested VM-Exit should be synthesized to L1 before exiting to userspace with KVM_EXIT_IRQ_WINDOW_OPEN. Link: https://lore.kernel.org/all/YoVHAIGcFgJit1qp@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-23-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Morph pending exceptions to pending VM-Exits at queue timeSean Christopherson
Morph pending exceptions to pending VM-Exits (due to interception) when the exception is queued instead of waiting until nested events are checked at VM-Entry. This fixes a longstanding bug where KVM fails to handle an exception that occurs during delivery of a previous exception, KVM (L0) and L1 both want to intercept the exception (e.g. #PF for shadow paging), and KVM determines that the exception is in the guest's domain, i.e. queues the new exception for L2. Deferring the interception check causes KVM to esclate various combinations of injected+pending exceptions to double fault (#DF) without consulting L1's interception desires, and ends up injecting a spurious #DF into L2. KVM has fudged around the issue for #PF by special casing emulated #PF injection for shadow paging, but the underlying issue is not unique to shadow paging in L0, e.g. if KVM is intercepting #PF because the guest has a smaller maxphyaddr and L1 (but not L0) is using shadow paging. Other exceptions are affected as well, e.g. if KVM is intercepting #GP for one of SVM's workaround or for the VMware backdoor emulation stuff. The other cases have gone unnoticed because the #DF is spurious if and only if L1 resolves the exception, e.g. KVM's goofs go unnoticed if L1 would have injected #DF anyways. The hack-a-fix has also led to ugly code, e.g. bailing from the emulator if #PF injection forced a nested VM-Exit and the emulator finds itself back in L1. Allowing for direct-to-VM-Exit queueing also neatly solves the async #PF in L2 mess; no need to set a magic flag and token, simply queue a #PF nested VM-Exit. Deal with event migration by flagging that a pending exception was queued by userspace and check for interception at the next KVM_RUN, e.g. so that KVM does the right thing regardless of the order in which userspace restores nested state vs. event state. When "getting" events from userspace, simply drop any pending excpetion that is destined to be intercepted if there is also an injected exception to be migrated. Ideally, KVM would migrate both events, but that would require new ABI, and practically speaking losing the event is unlikely to be noticed, let alone fatal. The injected exception is captured, RIP still points at the original faulting instruction, etc... So either the injection on the target will trigger the same intercepted exception, or the source of the intercepted exception was transient and/or non-deterministic, thus dropping it is ok-ish. Fixes: a04aead144fd ("KVM: nSVM: fix running nested guests when npt=0") Fixes: feaf0c7dc473 ("KVM: nVMX: Do not generate #DF if #PF happens during exception delivery into L2") Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-22-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: nVMX: Document priority of all known events on Intel CPUsSean Christopherson
Add a gigantic comment above vmx_check_nested_events() to document the priorities of all known events on Intel CPUs. Intel's SDM doesn't include VMX-specific events in its "Priority Among Concurrent Events", which makes it painfully difficult to suss out the correct priority between things like Monitor Trap Flag VM-Exits and pending #DBs. Kudos to Jim Mattson for doing the hard work of collecting and interpreting the priorities from various locations throughtout the SDM (because putting them all in one place in the SDM would be too easy). Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-21-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: nVMX: Add a helper to identify low-priority #DB trapsSean Christopherson
Add a helper to identify "low"-priority #DB traps, i.e. trap-like #DBs that aren't TSS T flag #DBs, and tweak the related code to operate on any queued exception. A future commit will separate exceptions that are intercepted by L1, i.e. cause nested VM-Exit, from those that do NOT trigger nested VM-Exit. I.e. there will be multiple exception structs and multiple invocations of the helpers. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-20-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential VM-ExitSean Christopherson
Determine whether or not new events can be injected after checking nested events. If a VM-Exit occurred during nested event handling, any previous event that needed re-injection is gone from's KVM perspective; the event is captured in the vmc*12 VM-Exit information, but doesn't exist in terms of what needs to be done for entry to L1. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-19-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Hoist nested event checks above event injection logicSean Christopherson
Perform nested event checks before re-injecting exceptions/events into L2. If a pending exception causes VM-Exit to L1, re-injecting events into vmcs02 is premature and wasted effort. Take care to ensure events that need to be re-injected are still re-injected if checking for nested events "fails", i.e. if KVM needs to force an immediate entry+exit to complete the to-be-re-injecteed event. Keep the "can_inject" logic the same for now; it too can be pushed below the nested checks, but is a slightly riskier change (see past bugs about events not being properly purged on nested VM-Exit). Add and/or modify comments to better document the various interactions. Of note is the comment regarding "blocking" previously injected NMIs and IRQs if an exception is pending. The old comment isn't wrong strictly speaking, but it failed to capture the reason why the logic even exists. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-18-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Use kvm_queue_exception_e() to queue #DFSean Christopherson
Queue #DF by recursing on kvm_multiple_exception() by way of kvm_queue_exception_e() instead of open coding the behavior. This will allow KVM to Just Work when a future commit moves exception interception checks (for L2 => L1) into kvm_multiple_exception(). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-17-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Formalize blocking of nested pending exceptionsSean Christopherson
Capture nested_run_pending as block_pending_exceptions so that the logic of why exceptions are blocked only needs to be documented once instead of at every place that employs the logic. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-16-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Make kvm_queued_exception a properly named, visible structSean Christopherson
Move the definition of "struct kvm_queued_exception" out of kvm_vcpu_arch in anticipation of adding a second instance in kvm_vcpu_arch to handle exceptions that occur when vectoring an injected exception and are morphed to VM-Exit instead of leading to #DF. Opportunistically take advantage of the churn to rename "nr" to "vector". No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-15-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exceptionSean Christopherson
Rename the kvm_x86_ops hook for exception injection to better reflect reality, and to align with pretty much every other related function name in KVM. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-14-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: VMX: Inject #PF on ENCLS as "emulated" #PFSean Christopherson
Treat #PFs that occur during emulation of ENCLS as, wait for it, emulated page faults. Practically speaking, this is a glorified nop as the exception is never of the nested flavor, and it's extremely unlikely the guest is relying on the side effect of an implicit INVLPG on the faulting address. Fixes: 70210c044b4e ("KVM: VMX: Add SGX ENCLS[ECREATE] handler to enforce CPUID restrictions") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-13-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>