summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2020-01-10Merge branch 'x86/mm' into efi/core, to pick up dependenciesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-01-10Merge branch 'linus' into efi/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-01-10platform/x86: intel_telemetry_pltdrv: use devm_platform_ioremap_resource()Andy Shevchenko
Use devm_platform_ioremap_resource() to simplify the code a bit. While here, drop initialized but unused ssram_base_addr and ssram_size members. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2020-01-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
The ungrafting from PRIO bug fixes in net, when merged into net-next, merge cleanly but create a build failure. The resolution used here is from Petr Machata. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09bpf: Introduce BPF_MAP_TYPE_STRUCT_OPSMartin KaFai Lau
The patch introduces BPF_MAP_TYPE_STRUCT_OPS. The map value is a kernel struct with its func ptr implemented in bpf prog. This new map is the interface to register/unregister/introspect a bpf implemented kernel struct. The kernel struct is actually embedded inside another new struct (or called the "value" struct in the code). For example, "struct tcp_congestion_ops" is embbeded in: struct bpf_struct_ops_tcp_congestion_ops { refcount_t refcnt; enum bpf_struct_ops_state state; struct tcp_congestion_ops data; /* <-- kernel subsystem struct here */ } The map value is "struct bpf_struct_ops_tcp_congestion_ops". The "bpftool map dump" will then be able to show the state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g. number of tcp_sock in the tcp_congestion_ops case). This "value" struct is created automatically by a macro. Having a separate "value" struct will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding "void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some initialization works before registering the struct_ops to the kernel subsystem). The libbpf will take care of finding and populating the "struct bpf_struct_ops_XYZ" from "struct XYZ". Register a struct_ops to a kernel subsystem: 1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s) 2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the running kernel. Instead of reusing the attr->btf_value_type_id, btf_vmlinux_value_type_id s added such that attr->btf_fd can still be used as the "user" btf which could store other useful sysadmin/debug info that may be introduced in the furture, e.g. creation-date/compiler-details/map-creator...etc. 3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described in the running kernel btf. Populate the value of this object. The function ptr should be populated with the prog fds. 4. Call BPF_MAP_UPDATE with the object created in (3) as the map value. The key is always "0". During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's args as an array of u64 is generated. BPF_MAP_UPDATE also allows the specific struct_ops to do some final checks in "st_ops->init_member()" (e.g. ensure all mandatory func ptrs are implemented). If everything looks good, it will register this kernel struct to the kernel subsystem. The map will not allow further update from this point. Unregister a struct_ops from the kernel subsystem: BPF_MAP_DELETE with key "0". Introspect a struct_ops: BPF_MAP_LOOKUP_ELEM with key "0". The map value returned will have the prog _id_ populated as the func ptr. The map value state (enum bpf_struct_ops_state) will transit from: INIT (map created) => INUSE (map updated, i.e. reg) => TOBEFREE (map value deleted, i.e. unreg) The kernel subsystem needs to call bpf_struct_ops_get() and bpf_struct_ops_put() to manage the "refcnt" in the "struct bpf_struct_ops_XYZ". This patch uses a separate refcnt for the purose of tracking the subsystem usage. Another approach is to reuse the map->refcnt and then "show" (i.e. during map_lookup) the subsystem's usage by doing map->refcnt - map->usercnt to filter out the map-fd/pinned-map usage. However, that will also tie down the future semantics of map->refcnt and map->usercnt. The very first subsystem's refcnt (during reg()) holds one count to map->refcnt. When the very last subsystem's refcnt is gone, it will also release the map->refcnt. All bpf_prog will be freed when the map->refcnt reaches 0 (i.e. during map_free()). Here is how the bpftool map command will look like: [root@arch-fb-vm1 bpf]# bpftool map show 6: struct_ops name dctcp flags 0x0 key 4B value 256B max_entries 1 memlock 4096B btf_id 6 [root@arch-fb-vm1 bpf]# bpftool map dump id 6 [{ "value": { "refcnt": { "refs": { "counter": 1 } }, "state": 1, "data": { "list": { "next": 0, "prev": 0 }, "key": 0, "flags": 2, "init": 24, "release": 0, "ssthresh": 25, "cong_avoid": 30, "set_state": 27, "cwnd_event": 28, "in_ack_event": 26, "undo_cwnd": 29, "pkts_acked": 0, "min_tso_segs": 0, "sndbuf_expand": 0, "cong_control": 0, "get_info": 0, "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0 ], "owner": 0 } } } ] Misc Notes: * bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup. It does an inplace update on "*value" instead returning a pointer to syscall.c. Otherwise, it needs a separate copy of "zero" value for the BPF_STRUCT_OPS_STATE_INIT to avoid races. * The bpf_struct_ops_map_delete_elem() is also called without preempt_disable() from map_delete_elem(). It is because the "->unreg()" may requires sleepable context, e.g. the "tcp_unregister_congestion_control()". * "const" is added to some of the existing "struct btf_func_model *" function arg to avoid a compiler warning caused by this patch. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200109003505.3855919-1-kafai@fb.com
2020-01-09x86/crash: Use resource_size()Julia Lawall
Use resource_size() rather than a verbose computation on the end and start fields. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) <smpl> @@ struct resource ptr; @@ - (ptr.end - ptr.start + 1) + resource_size(&ptr) </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/1577900990-8588-10-git-send-email-Julia.Lawall@inria.fr
2020-01-09x86/cpu: Add a missing prototype for arch_smt_update()Benjamin Thiel
.. in order to fix a -Wmissing-prototype warning. No functional change. Signed-off-by: Benjamin Thiel <b.thiel@posteo.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200109121723.8151-1-b.thiel@posteo.de
2020-01-09x86/entry/64: Add instruction suffix to SYSRETJan Beulich
ignore_sysret() contains an unsuffixed SYSRET instruction. gas correctly interprets this as SYSRETL, but leaving it up to gas to guess when there is no register operand that implies a size is bad practice, and upstream gas is likely to warn about this in the future. Use SYSRETL explicitly. This does not change the assembled output. Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/038a7c35-062b-a285-c6d2-653b56585844@suse.com
2020-01-09crypto: remove propagation of CRYPTO_TFM_RES_* flagsEric Biggers
The CRYPTO_TFM_RES_* flags were apparently meant as a way to make the ->setkey() functions provide more information about errors. But these flags weren't actually being used or tested, and in many cases they weren't being set correctly anyway. So they've now been removed. Also, if someone ever actually needs to start better distinguishing ->setkey() errors (which is somewhat unlikely, as this has been unneeded for a long time), we'd be much better off just defining different return values, like -EINVAL if the key is invalid for the algorithm vs. -EKEYREJECTED if the key was rejected by a policy like "no weak keys". That would be much simpler, less error-prone, and easier to test. So just remove CRYPTO_TFM_RES_MASK and all the unneeded logic that propagates these flags around. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-09crypto: remove CRYPTO_TFM_RES_BAD_KEY_LENEric Biggers
The CRYPTO_TFM_RES_BAD_KEY_LEN flag was apparently meant as a way to make the ->setkey() functions provide more information about errors. However, no one actually checks for this flag, which makes it pointless. Also, many algorithms fail to set this flag when given a bad length key. Reviewing just the generic implementations, this is the case for aes-fixed-time, cbcmac, echainiv, nhpoly1305, pcrypt, rfc3686, rfc4309, rfc7539, rfc7539esp, salsa20, seqiv, and xcbc. But there are probably many more in arch/*/crypto/ and drivers/crypto/. Some algorithms can even set this flag when the key is the correct length. For example, authenc and authencesn set it when the key payload is malformed in any way (not just a bad length), the atmel-sha and ccree drivers can set it if a memory allocation fails, and the chelsio driver sets it for bad auth tag lengths, not just bad key lengths. So even if someone actually wanted to start checking this flag (which seems unlikely, since it's been unused for a long time), there would be a lot of work needed to get it working correctly. But it would probably be much better to go back to the drawing board and just define different return values, like -EINVAL if the key is invalid for the algorithm vs. -EKEYREJECTED if the key was rejected by a policy like "no weak keys". That would be much simpler, less error-prone, and easier to test. So just remove this flag. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Horia Geantă <horia.geanta@nxp.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-08x86: Remove force_iret()Brian Gerst
force_iret() was originally intended to prevent the return to user mode with the SYSRET or SYSEXIT instructions, in cases where the register state could have been changed to be incompatible with those instructions. The entry code has been significantly reworked since then, and register state is validated before SYSRET or SYSEXIT are used. force_iret() no longer serves its original purpose and can be eliminated. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lkml.kernel.org/r/20191219115812.102620-1-brgerst@gmail.com
2020-01-08KVM: x86/mmu: WARN if root_hpa is invalid when handling a page faultSean Christopherson
WARN if root_hpa is invalid when handling a page fault. The check on root_hpa exists for historical reasons that no longer apply to the current KVM code base. Remove an equivalent debug-only warning in direct_page_fault(), whose existence more or less confirms that root_hpa should always be valid when handling a page fault. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: x86/mmu: WARN on an invalid root_hpaSean Christopherson
WARN on the existing invalid root_hpa checks in __direct_map() and FNAME(fetch). The "legitimate" path that invalidated root_hpa in the middle of a page fault is long since gone, i.e. it should no longer be impossible to invalidate in the middle of a page fault[*]. The root_hpa checks were added by two related commits 989c6b34f6a94 ("KVM: MMU: handle invalid root_hpa at __direct_map") 37f6a4e237303 ("KVM: x86: handle invalid root_hpa everywhere") to fix a bug where nested_vmx_vmexit() could be called *in the middle* of a page fault. At the time, vmx_interrupt_allowed(), which was and still is used by kvm_can_do_async_pf() via ->interrupt_allowed(), directly invoked nested_vmx_vmexit() to switch from L2 to L1 to emulate a VM-Exit on a pending interrupt. Emulating the nested VM-Exit resulted in root_hpa being invalidated by kvm_mmu_reset_context() without explicitly terminating the page fault. Now that root_hpa is checked for validity by kvm_mmu_page_fault(), WARN on an invalid root_hpa to detect any flows that reset the MMU while handling a page fault. The broken vmx_interrupt_allowed() behavior has long since been fixed and resetting the MMU during a page fault should not be considered legal behavior. [*] It's actually technically possible in FNAME(page_fault)() because it calls inject_page_fault() when the guest translation is invalid, but in that case the page fault handling is immediately terminated. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: x86/mmu: Move root_hpa validity checks to top of page fault handlerSean Christopherson
Add a check on root_hpa at the beginning of the page fault handler to consolidate several checks on root_hpa that are scattered throughout the page fault code. This is a preparatory step towards eventually removing such checks altogether, or at the very least WARNing if an invalid root is encountered. Remove only the checks that can be easily audited to confirm that root_hpa cannot be invalidated between their current location and the new check in kvm_mmu_page_fault(), and aren't currently protected by mmu_lock, i.e. keep the checks in __direct_map() and FNAME(fetch) for the time being. The root_hpa checks that are consolidate were all added by commit 37f6a4e237303 ("KVM: x86: handle invalid root_hpa everywhere") which was a follow up to a bug fix for __direct_map(), commit 989c6b34f6a94 ("KVM: MMU: handle invalid root_hpa at __direct_map") At the time, nested VMX had, in hindsight, crazy handling of nested interrupts and would trigger a nested VM-Exit in ->interrupt_allowed(), and thus unexpectedly reset the MMU in flows such as can_do_async_pf(). Now that the wonky nested VM-Exit behavior is gone, the root_hpa checks are bogus and confusing, e.g. it's not at all obvious what they actually protect against, and at first glance they appear to be broken since many of them run without holding mmu_lock. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: x86/mmu: Move calls to thp_adjust() down a levelSean Christopherson
Move the calls to thp_adjust() down a level from the page fault handlers to the map/fetch helpers and remove the page count shuffling done in thp_adjust(). Despite holding a reference to the underlying page while processing a page fault, the page fault flows don't actually rely on holding a reference to the page when thp_adjust() is called. At that point, the fault handlers hold mmu_lock, which prevents mmu_notifier from completing any invalidations, and have verified no invalidations from mmu_notifier have occurred since the page reference was acquired (which is done prior to taking mmu_lock). The kvm_release_pfn_clean()/kvm_get_pfn() dance in thp_adjust() is a quirk that is necessitated because thp_adjust() modifies the pfn that is consumed by its caller. Because the page fault handlers call kvm_release_pfn_clean() on said pfn, thp_adjust() needs to transfer the reference to the correct pfn purely for correctness when the pfn is released. Calling thp_adjust() from __direct_map() and FNAME(fetch) means the pfn adjustment doesn't change the pfn as seen by the page fault handlers, i.e. the pfn released by the page fault handlers is the same pfn that was returned by gfn_to_pfn(). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: x86/mmu: Move transparent_hugepage_adjust() above __direct_map()Sean Christopherson
Move thp_adjust() above __direct_map() in preparation of calling thp_adjust() from __direct_map() and FNAME(fetch). No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: x86/mmu: Consolidate tdp_page_fault() and nonpaging_page_fault()Sean Christopherson
Consolidate the direct MMU page fault handlers into a common helper, direct_page_fault(). Except for unique max level conditions, the tdp and nonpaging fault handlers are functionally identical. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: x86/mmu: Rename lpage_disallowed to account_disallowed_nx_lpageSean Christopherson
Rename __direct_map()'s param that controls whether or not a disallowed NX large page should be accounted to match what it actually does. The nonpaging_page_fault() case unconditionally passes %false for the param even though it locally sets lpage_disallowed. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: x86/mmu: Persist gfn_lpage_is_disallowed() to max_levelSean Christopherson
Persist the max page level calculated via gfn_lpage_is_disallowed() to the max level "returned" by mapping_level() so that its naturally taken into account by the max level check that conditions calling transparent_hugepage_adjust(). Drop the gfn_lpage_is_disallowed() check in thp_adjust() as it's now handled by mapping_level() and its callers. Add a comment to document the behavior of host_mapping_level() and its interaction with max level and transparent huge pages. Note, transferring the gfn_lpage_is_disallowed() from thp_adjust() to mapping_level() superficially affects how changes to a memslot's disallow_lpage count will be handled due to thp_adjust() being run while holding mmu_lock. In the more common case where a different vCPU increments the count via account_shadowed(), gfn_lpage_is_disallowed() is rechecked by set_spte() to ensure a writable large page isn't created. In the less common case where the count is decremented to zero due to all shadow pages in the memslot being zapped, THP behavior now matches hugetlbfs behavior in the sense that a small page will be created when a large page could be used if the count reaches zero in the miniscule window between mapping_level() and acquiring mmu_lock. Lastly, the new THP behavior also follows hugetlbfs behavior in the absurdly unlikely scenario of a memslot being moved such that the memslot's compatibility with respect to large pages changes, but without changing the validity of the gpf->pfn walk. I.e. if a memslot is moved between mapping_level() and snapshotting mmu_seq, it's theoretically possible to consume a stale disallow_lpage count. But, since KVM zaps all shadow pages when moving a memslot and forces all vCPUs to reload a new MMU, the inserted spte will always be thrown away prior to completing the memslot move, i.e. whether or not the spte accurately reflects disallow_lpage is irrelevant. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: x86/mmu: Incorporate guest's page level into max level for shadow MMUSean Christopherson
Restrict the max level for a shadow page based on the guest's level instead of capping the level after the fact for host-mapped huge pages, e.g. hugetlbfs pages. Explicitly capping the max level using the guest mapping level also eliminates FNAME(page_fault)'s subtle dependency on THP only supporting 2mb pages. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: x86/mmu: Refactor handling of forced 4k pages in page faultsSean Christopherson
Refactor the page fault handlers and mapping_level() to track the max allowed page level instead of only tracking if a 4k page is mandatory due to one restriction or another. This paves the way for cleanly consolidating tdp_page_fault() and nonpaging_page_fault(), and for eliminating a redundant check on mmu_gfn_lpage_is_disallowed(). No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: x86/mmu: Refactor the per-slot level calculation in mapping_level()Sean Christopherson
Invert the loop which adjusts the allowed page level based on what's compatible with the associated memslot to use a largest-to-smallest page size walk. This paves the way for passing around a "max level" variable instead of having redundant checks and/or multiple booleans. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: x86/mmu: Refactor handling of cache consistency with TDPSean Christopherson
Pre-calculate the max level for a TDP page with respect to MTRR cache consistency in preparation of replacing force_pt_level with max_level, and eventually combining the bulk of nonpaging_page_fault() and tdp_page_fault() into a common helper. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: x86/mmu: Move nonpaging_page_fault() below try_async_pf()Sean Christopherson
Move nonpaging_page_fault() below try_async_pf() to eliminate the forward declaration of try_async_pf() and to prepare for combining the bulk of nonpaging_page_fault() and tdp_page_fault() into a common helper. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: x86/mmu: Fold nonpaging_map() into nonpaging_page_fault()Sean Christopherson
Fold nonpaging_map() into its sole caller, nonpaging_page_fault(), in preparation for combining the bulk of nonpaging_page_fault() and tdp_page_fault() into a common helper. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: x86/mmu: Move definition of make_mmu_pages_available() upSean Christopherson
Move make_mmu_pages_available() above its first user to put it closer to related code and eliminate a forward declaration. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVMSean Christopherson
Convert a plethora of parameters and variables in the MMU and page fault flows from type gva_t to gpa_t to properly handle TDP on 32-bit KVM. Thanks to PSE and PAE paging, 32-bit kernels can access 64-bit physical addresses. When TDP is enabled, the fault address is a guest physical address and thus can be a 64-bit value, even when both KVM and its guest are using 32-bit virtual addressing, e.g. VMX's VMCS.GUEST_PHYSICAL is a 64-bit field, not a natural width field. Using a gva_t for the fault address means KVM will incorrectly drop the upper 32-bits of the GPA. Ditto for gva_to_gpa() when it is used to translate L2 GPAs to L1 GPAs. Opportunistically rename variables and parameters to better reflect the dual address modes, e.g. use "cr2_or_gpa" for fault addresses and plain "addr" instead of "vaddr" when the address may be either a GVA or an L2 GPA. Similarly, use "gpa" in the nonpaging_page_fault() flows to avoid a confusing "gpa_t gva" declaration; this also sets the stage for a future patch to combing nonpaging_page_fault() and tdp_page_fault() with minimal churn. Sprinkle in a few comments to document flows where an address is known to be a GVA and thus can be safely truncated to a 32-bit value. Add WARNs in kvm_handle_page_fault() and FNAME(gva_to_gpa_nested)() to help document such cases and detect bugs. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: x86: Add a WARN on TIF_NEED_FPU_LOAD in kvm_load_guest_fpu()Sean Christopherson
WARN once in kvm_load_guest_fpu() if TIF_NEED_FPU_LOAD is observed, as that would mean that KVM is corrupting userspace's FPU by saving unknown register state into arch.user_fpu. Add a comment to explain why KVM WARNs on TIF_NEED_FPU_LOAD instead of implementing logic similar to fpu__copy(). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: x86: Fix potential put_fpu() w/o load_fpu() on MPX platformSean Christopherson
Unlike most state managed by XSAVE, MPX is initialized to zero on INIT. Because INITs are usually recognized in the context of a VCPU_RUN call, kvm_vcpu_reset() puts the guest's FPU so that the FPU state is resident in memory, zeros the MPX state, and reloads FPU state to hardware. But, in the unlikely event that an INIT is recognized during kvm_arch_vcpu_ioctl_get_mpstate() via kvm_apic_accept_events(), kvm_vcpu_reset() will call kvm_put_guest_fpu() without a preceding kvm_load_guest_fpu() and corrupt the guest's FPU state (and possibly userspace's FPU state as well). Given that MPX is being removed from the kernel[*], fix the bug with the simple-but-ugly approach of loading the guest's FPU during KVM_GET_MP_STATE. [*] See commit f240652b6032b ("x86/mpx: Remove MPX APIs"). Fixes: f775b13eedee2 ("x86,kvm: move qemu/guest FPU switching out to vcpu_run") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08kvm: nVMX: Aesthetic cleanup of handle_vmread and handle_vmwriteJim Mattson
Apply reverse fir tree declaration order, shorten some variable names to avoid line wrap, reformat a block comment, delete an extra blank line, and use BIT(10) instead of (1u << 10). Signed-off-by: Jim Mattson <jmattson@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Reviewed-by: Jon Cargille <jcargill@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08kvm: nVMX: VMWRITE checks unsupported field before read-only fieldJim Mattson
According to the SDM, VMWRITE checks to see if the secondary source operand corresponds to an unsupported VMCS field before it checks to see if the secondary source operand corresponds to a VM-exit information field and the processor does not support writing to VM-exit information fields. Fixes: 49f705c5324aa ("KVM: nVMX: Implement VMREAD and VMWRITE") Signed-off-by: Jim Mattson <jmattson@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Reviewed-by: Jon Cargille <jcargill@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08kvm: nVMX: VMWRITE checks VMCS-link pointer before VMCS fieldJim Mattson
According to the SDM, a VMWRITE in VMX non-root operation with an invalid VMCS-link pointer results in VMfailInvalid before the validity of the VMCS field in the secondary source operand is checked. For consistency, modify both handle_vmwrite and handle_vmread, even though there was no problem with the latter. Fixes: 6d894f498f5d1 ("KVM: nVMX: vmread/vmwrite: Use shadow vmcs12 if running L2") Signed-off-by: Jim Mattson <jmattson@google.com> Cc: Liran Alon <liran.alon@oracle.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Reviewed-by: Jon Cargille <jcargill@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: VMX: Fix the spelling of CPU_BASED_USE_TSC_OFFSETTINGXiaoyao Li
The mis-spelling is found by checkpatch.pl, so fix them. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: VMX: Rename NMI_PENDING to NMI_WINDOWXiaoyao Li
Rename the NMI-window exiting related definitions to match the latest Intel SDM. No functional changes. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: VMX: Rename INTERRUPT_PENDING to INTERRUPT_WINDOWXiaoyao Li
Rename interrupt-windown exiting related definitions to match the latest Intel SDM. No functional changes. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: x86: Fix some comment typosMiaohe Lin
Fix some typos in comment. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: X86: Convert the last users of "shorthand = 0" to use macrosPeter Xu
Change the last users of "shorthand = 0" to use APIC_DEST_NOSHORT. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: X86: Fix callers of kvm_apic_match_dest() to use correct macrosPeter Xu
Callers of kvm_apic_match_dest() should always pass in APIC_DEST_* macros for either dest_mode and short_hand parameters. Fix up all the callers of kvm_apic_match_dest() that are not following the rule. Since at it, rename the parameter from short_hand to shorthand in kvm_apic_match_dest(), as suggested by Vitaly. Reported-by: Sean Christopherson <sean.j.christopherson@intel.com> Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: X86: Drop KVM_APIC_SHORT_MASK and KVM_APIC_DEST_MASKPeter Xu
We have both APIC_SHORT_MASK and KVM_APIC_SHORT_MASK defined for the shorthand mask. Similarly, we have both APIC_DEST_MASK and KVM_APIC_DEST_MASK defined for the destination mode mask. Drop the KVM_APIC_* macros and replace the only user of them to use the APIC_DEST_* macros instead. At the meantime, move APIC_SHORT_MASK and APIC_DEST_MASK from lapic.c to lapic.h. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: X86: Use APIC_DEST_* macros properly in kvm_lapic_irq.dest_modePeter Xu
We were using either APIC_DEST_PHYSICAL|APIC_DEST_LOGICAL or 0|1 to fill in kvm_lapic_irq.dest_mode. It's fine only because in most cases when we check against dest_mode it's against APIC_DEST_PHYSICAL (which equals to 0). However, that's not consistent. We'll have problem when we want to start checking against APIC_DEST_LOGICAL, which does not equals to 1. This patch firstly introduces kvm_lapic_irq_dest_mode() helper to take any boolean of destination mode and return the APIC_DEST_* macro. Then, it replaces the 0|1 settings of irq.dest_mode with the helper. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: X86: Move irrelevant declarations out of ioapic.hPeter Xu
kvm_apic_match_dest() is declared in both ioapic.h and lapic.h. Remove the declaration in ioapic.h. kvm_apic_compare_prio() is declared in ioapic.h but defined in lapic.c. Move the declaration to lapic.h. kvm_irq_delivery_to_apic() is declared in ioapic.h but defined in irq_comm.c. Move the declaration to irq.h. hyperv.c needs to use kvm_irq_delivery_to_apic(). Include irq.h in hyperv.c. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: X86: Fix kvm_bitmap_or_dest_vcpus() to use irq shorthandPeter Xu
The 3rd parameter of kvm_apic_match_dest() is the irq shorthand, rather than the irq delivery mode. Fixes: 7ee30bc132c6 ("KVM: x86: deliver KVM IOAPIC scan request to target vCPUs") Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: explicitly set rmap_head->val to 0 in pte_list_desc_remove_entry()Miaohe Lin
When we reach here, we have desc->sptes[j] = NULL with j = 0. So we can replace desc->sptes[0] with 0 to make it more clear. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: vmx: remove unreachable statement in vmx_get_msr_feature()Miaohe Lin
We have no way to reach the final statement, remove it. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: x86: use CPUID to locate host page table reserved bitsPaolo Bonzini
The comment in kvm_get_shadow_phys_bits refers to MKTME, but the same is actually true of SME and SEV. Just use CPUID[0x8000_0008].EAX[7:0] unconditionally if available, it is simplest and works even if memory is not encrypted. Cc: stable@vger.kernel.org Reported-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08x86/cpufeatures: Add support for fast short REP; MOVSBTony Luck
>From the Intel Optimization Reference Manual: 3.7.6.1 Fast Short REP MOVSB Beginning with processors based on Ice Lake Client microarchitecture, REP MOVSB performance of short operations is enhanced. The enhancement applies to string lengths between 1 and 128 bytes long. Support for fast-short REP MOVSB is enumerated by the CPUID feature flag: CPUID [EAX=7H, ECX=0H).EDX.FAST_SHORT_REP_MOVSB[bit 4] = 1. There is no change in the REP STOS performance. Add an X86_FEATURE_FSRM flag for this. memmove() avoids REP MOVSB for short (< 32 byte) copies. Check FSRM and use REP MOVSB for short copies on systems that support it. [ bp: Massage and add comment. ] Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20191216214254.26492-1-tony.luck@intel.com
2020-01-07x86/fpu: Deactivate FPU state after failure during state loadSebastian Andrzej Siewior
In __fpu__restore_sig(), fpu_fpregs_owner_ctx needs to be reset if the FPU state was not fully restored. Otherwise the following may happen (on the same CPU): Task A Task B fpu_fpregs_owner_ctx *active* A.fpu __fpu__restore_sig() ctx switch load B.fpu *active* B.fpu fpregs_lock() copy_user_to_fpregs_zeroing() copy_kernel_to_xregs() *modify* copy_user_to_xregs() *fails* fpregs_unlock() ctx switch skip loading B.fpu, *active* B.fpu In the success case, fpu_fpregs_owner_ctx is set to the current task. In the failure case, the FPU state might have been modified by loading the init state. In this case, fpu_fpregs_owner_ctx needs to be reset in order to ensure that the FPU state of the following task is loaded from saved state (and not skipped because it was the previous state). Reset fpu_fpregs_owner_ctx after a failure during restore occurred, to ensure that the FPU state for the next task is always loaded. The problem was debugged-by Yu-cheng Yu <yu-cheng.yu@intel.com>. [ bp: Massage commit message. ] Fixes: 5f409e20b7945 ("x86/fpu: Defer FPU state load until return to userspace") Reported-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20191220195906.plk6kpmsrikvbcfn@linutronix.de
2020-01-07um: Implement copy_thread_tlsAmanieu d'Antras
This is required for clone3 which passes the TLS value through a struct rather than a register. Signed-off-by: Amanieu d'Antras <amanieu@gmail.com> Cc: linux-um@lists.infradead.org Cc: <stable@vger.kernel.org> # 5.3.x Link: https://lore.kernel.org/r/20200104123928.1048822-1-amanieu@gmail.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-01-07x86/unwind/orc: Fix !CONFIG_MODULES build warningShile Zhang
To fix follwowing warning due to ORC sort moved to build time: arch/x86/kernel/unwind_orc.c:210:12: warning: ‘orc_sort_cmp’ defined but not used [-Wunused-function] arch/x86/kernel/unwind_orc.c:190:13: warning: ‘orc_sort_swap’ defined but not used [-Wunused-function] Signed-off-by: Shile Zhang <shile.zhang@linux.alibaba.com> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/c9c81536-2afc-c8aa-c5f8-c7618ecd4f54@linux.alibaba.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-01-07x86/context-tracking: Remove exception_enter/exit() from ↵Frederic Weisbecker
KVM_PV_REASON_PAGE_NOT_PRESENT async page fault This is a leftover. Page faults, just like most other exceptions, are protected inside user_exit() / user_enter() calls in x86 entry code when we fault from userspace. So this pair of calls is now superfluous. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Jim Mattson <jmattson@google.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Wanpeng Li <wanpengli@tencent.com> Link: https://lkml.kernel.org/r/20191227163612.10039-3-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>