summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)Author
2017-11-16Merge tag 'kvm-s390-next-4.15-1' of ↵Radim Krčmář
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux KVM: s390: fixes and improvements for 4.15 - Some initial preparation patches for exitless interrupts and crypto - New capability for AIS migration - Fixes - merge of the sthyi tree from the base s390 team, which moves the sthyi out of KVM into a shared function also for non-KVM
2017-11-09KVM: s390: provide a capability for AIS state migrationChristian Borntraeger
The AIS capability was introduced in 4.12, while the interface to migrate the state was added in 4.13. Unfortunately it is not possible for userspace to detect the migration capability without creating a flic kvm device. As in QEMU the cpu model detection runs on the "none" machine this will result in cpu model issues regarding the "ais" capability. To get the "ais" capability properly let's add a new KVM capability that tells userspace that AIS states can be migrated. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Halil Pasic <pasic@linux.vnet.ibm.com>
2017-11-09Merge tag 'kvm-ppc-next-4.15-2' of ↵Radim Krčmář
git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc Second PPC KVM update for 4.15 This merges in my kvm-ppc-fixes branch to resolve the conflicts between the fixes that have been applied there and the changes made in my patch series to allow HPT guests to run on a radix host on POWER9. It also resolves another conflict in the code for the KVM_CAP_PPC_HTM capability.
2017-11-09KVM: s390: clear_io_irq() requests are not expected for adapter interruptsMichael Mueller
There is a chance to delete not yet delivered I/O interrupts if an exploiter uses the subsystem identification word 0x0000 while processing a KVM_DEV_FLIC_CLEAR_IO_IRQ ioctl. -EINVAL will be returned now instead in that case. Classic interrupts will always have bit 0x10000 set in the schid while adapter interrupts have a zero schid. The clear_io_irq interface is only useful for classic interrupts (as adapter interrupts belong to many devices). Let's make this interface more strict and forbid a schid of 0. Signed-off-by: Michael Mueller <mimu@linux.vnet.ibm.com> Reviewed-by: Halil Pasic <pasic@linux.vnet.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2017-11-09KVM: s390: abstract conversion between isc and enum irq_typesMichael Mueller
The abstraction of the conversion between an isc value and an irq_type by means of functions isc_to_irq_type() and irq_type_to_isc() allows to clarify the respective operations where used. Signed-off-by: Michael Mueller <mimu@linux.vnet.ibm.com> Reviewed-by: Halil Pasic <pasic@linux.vnet.ibm.com> Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2017-11-09KVM: s390: vsie: use common code functions for pinningDavid Hildenbrand
We will not see -ENOMEM (gfn_to_hva() will return KVM_ERR_PTR_BAD_PAGE for all errors). So we can also get rid of special handling in the callers of pin_guest_page() and always assume that it is a g2 error. As also kvm_s390_inject_program_int() should never fail, we can simplify pin_scb(), too. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20170901151143.22714-1-david@redhat.com> Acked-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2017-11-09KVM: s390: SIE considerations for AP Queue virtualizationTony Krowiak
The Crypto Control Block (CRYCB) is referenced by the SIE state description and controls KVM guest access to the Adjunct Processor (AP) adapters, usage domains and control domains. This patch defines the AP control blocks to be used for controlling guest access to the AP adapters and domains. Signed-off-by: Tony Krowiak <akrowiak@linux.vnet.ibm.com> Message-Id: <1507916344-3896-2-git-send-email-akrowiak@linux.vnet.ibm.com> Acked-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2017-11-09KVM: s390: document memory ordering for kvm_s390_vcpu_wakeupChristian Borntraeger
swait_active does not enforce any ordering and it can therefore trigger some subtle races when the CPU moves the read for the check before a previous store and that store is then used on another CPU that is preparing the swait. On s390 there is a call to swait_active in kvm_s390_vcpu_wakeup. The good thing is, on s390 all potential races cannot happen because all callers of kvm_s390_vcpu_wakeup do not store (no race) or use an atomic operation, which handles memory ordering. Since this is not guaranteed by the Linux semantics (but by the implementation on s390) let's add smp_mb_after_atomic to make this obvious and document the ordering. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Halil Pasic <pasic@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2017-11-09KVM: PPC: Book3S HV: Cosmetic post-merge cleanupsPaul Mackerras
This rearranges the code in kvmppc_run_vcpu() and kvmppc_run_vcpu_hv() to be neater and clearer. Deeply indented code in kvmppc_run_vcpu() is moved out to a helper function, kvmhv_setup_mmu(). In kvmppc_vcpu_run_hv(), make use of the existing variable 'kvm' in place of 'vcpu->kvm'. No functional change. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-09Merge branch 'kvm-ppc-fixes' into kvm-ppc-nextPaul Mackerras
This merges in a couple of fixes from the kvm-ppc-fixes branch that modify the same areas of code as some commits from the kvm-ppc-next branch, in order to resolve the conflicts. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-08Merge tag 'kvm-arm-for-v4.15' of ↵Radim Krčmář
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into next KVM/ARM Changes for v4.15 Changes include: - Optimized arch timer handling for KVM/ARM - Improvements to the VGIC ITS code and introduction of an ITS reset ioctl - Unification of the 32-bit fault injection logic - More exact external abort matching logic
2017-11-08KVM: PPC: Book3S HV: Fix exclusion between HPT resizing and other HPT updatesPaul Mackerras
Commit 5e9859699aba ("KVM: PPC: Book3S HV: Outline of KVM-HV HPT resizing implementation", 2016-12-20) added code that tries to exclude any use or update of the hashed page table (HPT) while the HPT resizing code is iterating through all the entries in the HPT. It does this by taking the kvm->lock mutex, clearing the kvm->arch.hpte_setup_done flag and then sending an IPI to all CPUs in the host. The idea is that any VCPU task that tries to enter the guest will see that the hpte_setup_done flag is clear and therefore call kvmppc_hv_setup_htab_rma, which also takes the kvm->lock mutex and will therefore block until we release kvm->lock. However, any VCPU that is already in the guest, or is handling a hypervisor page fault or hypercall, can re-enter the guest without rechecking the hpte_setup_done flag. The IPI will cause a guest exit of any VCPUs that are currently in the guest, but does not prevent those VCPU tasks from immediately re-entering the guest. The result is that after resize_hpt_rehash_hpte() has made a HPTE absent, a hypervisor page fault can occur and make that HPTE present again. This includes updating the rmap array for the guest real page, meaning that we now have a pointer in the rmap array which connects with pointers in the old rev array but not the new rev array. In fact, if the HPT is being reduced in size, the pointer in the rmap array could point outside the bounds of the new rev array. If that happens, we can get a host crash later on such as this one: [91652.628516] Unable to handle kernel paging request for data at address 0xd0000000157fb10c [91652.628668] Faulting instruction address: 0xc0000000000e2640 [91652.628736] Oops: Kernel access of bad area, sig: 11 [#1] [91652.628789] LE SMP NR_CPUS=1024 NUMA PowerNV [91652.628847] Modules linked in: binfmt_misc vhost_net vhost tap xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables ses enclosure scsi_transport_sas i2c_opal ipmi_powernv ipmi_devintf i2c_core ipmi_msghandler powernv_op_panel nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc kvm_hv kvm_pr kvm scsi_dh_alua dm_service_time dm_multipath tg3 ptp pps_core [last unloaded: stap_552b612747aec2da355051e464fa72a1_14259] [91652.629566] CPU: 136 PID: 41315 Comm: CPU 21/KVM Tainted: G O 4.14.0-1.rc4.dev.gitb27fc5c.el7.centos.ppc64le #1 [91652.629684] task: c0000007a419e400 task.stack: c0000000028d8000 [91652.629750] NIP: c0000000000e2640 LR: d00000000c36e498 CTR: c0000000000e25f0 [91652.629829] REGS: c0000000028db5d0 TRAP: 0300 Tainted: G O (4.14.0-1.rc4.dev.gitb27fc5c.el7.centos.ppc64le) [91652.629932] MSR: 900000010280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 44022422 XER: 00000000 [91652.630034] CFAR: d00000000c373f84 DAR: d0000000157fb10c DSISR: 40000000 SOFTE: 1 [91652.630034] GPR00: d00000000c36e498 c0000000028db850 c000000001403900 c0000007b7960000 [91652.630034] GPR04: d0000000117fb100 d000000007ab00d8 000000000033bb10 0000000000000000 [91652.630034] GPR08: fffffffffffffe7f 801001810073bb10 d00000000e440000 d00000000c373f70 [91652.630034] GPR12: c0000000000e25f0 c00000000fdb9400 f000000003b24680 0000000000000000 [91652.630034] GPR16: 00000000000004fb 00007ff7081a0000 00000000000ec91a 000000000033bb10 [91652.630034] GPR20: 0000000000010000 00000000001b1190 0000000000000001 0000000000010000 [91652.630034] GPR24: c0000007b7ab8038 d0000000117fb100 0000000ec91a1190 c000001e6a000000 [91652.630034] GPR28: 00000000033bb100 000000000073bb10 c0000007b7960000 d0000000157fb100 [91652.630735] NIP [c0000000000e2640] kvmppc_add_revmap_chain+0x50/0x120 [91652.630806] LR [d00000000c36e498] kvmppc_book3s_hv_page_fault+0xbb8/0xc40 [kvm_hv] [91652.630884] Call Trace: [91652.630913] [c0000000028db850] [c0000000028db8b0] 0xc0000000028db8b0 (unreliable) [91652.630996] [c0000000028db8b0] [d00000000c36e498] kvmppc_book3s_hv_page_fault+0xbb8/0xc40 [kvm_hv] [91652.631091] [c0000000028db9e0] [d00000000c36a078] kvmppc_vcpu_run_hv+0xdf8/0x1300 [kvm_hv] [91652.631179] [c0000000028dbb30] [d00000000c2248c4] kvmppc_vcpu_run+0x34/0x50 [kvm] [91652.631266] [c0000000028dbb50] [d00000000c220d54] kvm_arch_vcpu_ioctl_run+0x114/0x2a0 [kvm] [91652.631351] [c0000000028dbbd0] [d00000000c2139d8] kvm_vcpu_ioctl+0x598/0x7a0 [kvm] [91652.631433] [c0000000028dbd40] [c0000000003832e0] do_vfs_ioctl+0xd0/0x8c0 [91652.631501] [c0000000028dbde0] [c000000000383ba4] SyS_ioctl+0xd4/0x130 [91652.631569] [c0000000028dbe30] [c00000000000b8e0] system_call+0x58/0x6c [91652.631635] Instruction dump: [91652.631676] fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ffa1 2fa70000 793d0020 e9432110 [91652.631814] 7bbf26e4 7c7e1b78 7feafa14 409e0094 <807f000c> 786326e4 7c6a1a14 93a40008 [91652.631959] ---[ end trace ac85ba6db72e5b2e ]--- To fix this, we tighten up the way that the hpte_setup_done flag is checked to ensure that it does provide the guarantee that the resizing code needs. In kvmppc_run_core(), we check the hpte_setup_done flag after disabling interrupts and refuse to enter the guest if it is clear (for a HPT guest). The code that checks hpte_setup_done and calls kvmppc_hv_setup_htab_rma() is moved from kvmppc_vcpu_run_hv() to a point inside the main loop in kvmppc_run_vcpu(), ensuring that we don't just spin endlessly calling kvmppc_run_core() while hpte_setup_done is clear, but instead have a chance to block on the kvm->lock mutex. Finally we also check hpte_setup_done inside the region in kvmppc_book3s_hv_page_fault() where the HPTE is locked and we are about to update the HPTE, and bail out if it is clear. If another CPU is inside kvm_vm_ioctl_resize_hpt_commit) and has cleared hpte_setup_done, then we know that either we are looking at a HPTE that resize_hpt_rehash_hpte() has not yet processed, which is OK, or else we will see hpte_setup_done clear and refuse to update it, because of the full barrier formed by the unlock of the HPTE in resize_hpt_rehash_hpte() combined with the locking of the HPTE in kvmppc_book3s_hv_page_fault(). Fixes: 5e9859699aba ("KVM: PPC: Book3S HV: Outline of KVM-HV HPT resizing implementation") Cc: stable@vger.kernel.org # v4.10+ Reported-by: Satheesh Rajendran <satheera@in.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-06KVM: arm/arm64: fix the incompatible matching for external abortDongjiu Geng
kvm_vcpu_dabt_isextabt() tries to match a full fault syndrome, but calls kvm_vcpu_trap_get_fault_type() that only returns the fault class, thus reducing the scope of the check. This doesn't cause any observable bug yet as we end-up matching a closely related syndrome for which we return the same value. Using kvm_vcpu_trap_get_fault() instead fixes it for good. Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2017-11-06KVM: arm/arm64: Unify 32bit fault injectionMarc Zyngier
Both arm and arm64 implementations are capable of injecting faults, and yet have completely divergent implementations, leading to different bugs and reduced maintainability. Let's elect the arm64 version as the canonical one and move it into aarch32.c, which is common to both architectures. Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2017-11-06KVM: arm/arm64: vgic-its: Implement KVM_DEV_ARM_ITS_CTRL_RESETEric Auger
On reset we clear the valid bits of GITS_CBASER and GITS_BASER<n>. We also clear command queue registers and free the cache (device, collection, and lpi lists). As we need to take the same locks as save/restore functions, we create a vgic_its_ctrl() wrapper that handles KVM_DEV_ARM_VGIC_GRP_CTRL group functions. Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2017-11-06KVM: arm/arm64: Use kvm_arm_timer_set/get_reg for guest register trapsChristoffer Dall
When trapping on a guest access to one of the timer registers, we were messing with the internals of the timer state from the sysregs handling code, and that logic was about to receive more added complexity when optimizing the timer handling code. Therefore, since we already have timer register access functions (to access registers from userspace), reuse those for the timer register traps from a VM and let the timer code maintain its own consistency. Signed-off-by: Christoffer Dall <cdall@linaro.org> Acked-by: Marc Zyngier <marc.zyngier@arm.com>
2017-11-06KVM: arm/arm64: Support EL1 phys timer register access in set/get regChristoffer Dall
Add suport for the physical timer registers in kvm_arm_timer_set_reg and kvm_arm_timer_get_reg so that these functions can be reused to interact with the rest of the system. Note that this paves part of the way for the physical timer state save/restore, but we still need to add those registers to KVM_GET_REG_LIST before we support migrating the physical timer state. Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <cdall@linaro.org>
2017-11-06KVM: arm/arm64: Move timer save/restore out of the hyp codeChristoffer Dall
As we are about to be lazy with saving and restoring the timer registers, we prepare by moving all possible timer configuration logic out of the hyp code. All virtual timer registers can be programmed from EL1 and since the arch timer is always a level triggered interrupt we can safely do this with interrupts disabled in the host kernel on the way to the guest without taking vtimer interrupts in the host kernel (yet). The downside is that the cntvoff register can only be programmed from hyp mode, so we jump into hyp mode and back to program it. This is also safe, because the host kernel doesn't use the virtual timer in the KVM code. It may add a little performance performance penalty, but only until following commits where we move this operation to vcpu load/put. Signed-off-by: Christoffer Dall <cdall@linaro.org> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
2017-11-06arm64: Use physical counter for in-kernel reads when booted in EL2Christoffer Dall
Using the physical counter allows KVM to retain the offset between the virtual and physical counter as long as it is actively running a VCPU. As soon as a VCPU is released, another thread is scheduled or we start running userspace applications, we reset the offset to 0, so that userspace accessing the virtual timer can still read the virtual counter and get the same view of time as the kernel. This opens up potential improvements for KVM performance, but we have to make a few adjustments to preserve system consistency. Currently get_cycles() is hardwired to arch_counter_get_cntvct() on arm64, but as we move to using the physical timer for the in-kernel time-keeping on systems that boot in EL2, we should use the same counter for get_cycles() as for other in-kernel timekeeping operations. Similarly, implementations of arch_timer_set_next_event_phys() is modified to use the counter specific to the timer being programmed. VHE kernels or kernels continuing to use the virtual timer are unaffected. Cc: Will Deacon <will.deacon@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <cdall@linaro.org>
2017-11-06arm64: Implement arch_counter_get_cntpct to read the physical counterChristoffer Dall
As we are about to use the physical counter on arm64 systems that have KVM support, implement arch_counter_get_cntpct() and the associated errata workaround functionality for stable timer reads. Cc: Will Deacon <will.deacon@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2017-11-02Merge branch 'kvm-ppc-next' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD Apart from various bugfixes and code cleanups, the major new feature is the ability to run guests using the hashed page table (HPT) MMU mode on a host that is using the radix MMU mode. Because of limitations in the current POWER9 chip (all SMT threads in each core must use the same MMU mode, HPT or radix), this requires the host to be configured to run similar to POWER8: the host runs in single-threaded mode (only thread 0 of each core online), and have KVM be able to wake up the other threads when a KVM guest is to be run, and use the other threads for running guest VCPUs. A new module parameter, called "indep_threads_mode", is normally Y on POWER9 but must be set to N before any HPT guests can be run on a radix host: # echo N >/sys/module/kvm_hv/parameters/indep_threads_mode # ppc64_cpu --smt=off Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-11-01KVM: PPC: Book3S HV: Run HPT guests on POWER9 radix hostsPaul Mackerras
This patch removes the restriction that a radix host can only run radix guests, allowing us to run HPT (hashed page table) guests as well. This is useful because it provides a way to run old guest kernels that know about POWER8 but not POWER9. Unfortunately, POWER9 currently has a restriction that all threads in a given code must either all be in HPT mode, or all in radix mode. This means that when entering a HPT guest, we have to obtain control of all 4 threads in the core and get them to switch their LPIDR and LPCR registers, even if they are not going to run a guest. On guest exit we also have to get all threads to switch LPIDR and LPCR back to host values. To make this feasible, we require that KVM not be in the "independent threads" mode, and that the CPU cores be in single-threaded mode from the host kernel's perspective (only thread 0 online; threads 1, 2 and 3 offline). That allows us to use the same code as on POWER8 for obtaining control of the secondary threads. To manage the LPCR/LPIDR changes required, we extend the kvm_split_info struct to contain the information needed by the secondary threads. All threads perform a barrier synchronization (where all threads wait for every other thread to reach the synchronization point) on guest entry, both before and after loading LPCR and LPIDR. On guest exit, they all once again perform a barrier synchronization both before and after loading host values into LPCR and LPIDR. Finally, it is also currently necessary to flush the entire TLB every time we enter a HPT guest on a radix host. We do this on thread 0 with a loop of tlbiel instructions. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-01KVM: PPC: Book3S HV: Allow for running POWER9 host in single-threaded modePaul Mackerras
This patch allows for a mode on POWER9 hosts where we control all the threads of a core, much as we do on POWER8. The mode is controlled by a module parameter on the kvm_hv module, called "indep_threads_mode". The normal mode on POWER9 is the "independent threads" mode, with indep_threads_mode=Y, where the host is in SMT4 mode (or in fact any desired SMT mode) and each thread independently enters and exits from KVM guests without reference to what other threads in the core are doing. If indep_threads_mode is set to N at the point when a VM is started, KVM will expect every core that the guest runs on to be in single threaded mode (that is, threads 1, 2 and 3 offline), and will set the flag that prevents secondary threads from coming online. We can still use all four threads; the code that implements dynamic micro-threading on POWER8 will become active in over-commit situations and will allow up to three other VCPUs to be run on the secondary threads of the core whenever a VCPU is run. The reason for wanting this mode is that this will allow us to run HPT guests on a radix host on a POWER9 machine that does not support "mixed mode", that is, having some threads in a core be in HPT mode while other threads are in radix mode. It will also make it possible to implement a "strict threads" mode in future, if desired. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-01KVM: PPC: Book3S HV: Add infrastructure for running HPT guests on radix hostPaul Mackerras
This sets up the machinery for switching a guest between HPT (hashed page table) and radix MMU modes, so that in future we can run a HPT guest on a radix host on POWER9 machines. * The KVM_PPC_CONFIGURE_V3_MMU ioctl can now specify either HPT or radix mode, on a radix host. * The KVM_CAP_PPC_MMU_HASH_V3 capability now returns 1 on POWER9 with HV KVM on a radix host. * The KVM_PPC_GET_SMMU_INFO returns information about the HPT MMU on a radix host. * The KVM_PPC_ALLOCATE_HTAB ioctl on a radix host will switch the guest to HPT mode and allocate a HPT. * For simplicity, we now allocate the rmap array for each memslot, even on a radix host, since it will be needed if the guest switches to HPT mode. * Since we cannot yet run a HPT guest on a radix host, the KVM_RUN ioctl will return an EINVAL error in that case. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-01KVM: PPC: Book3S HV: Unify dirty page map between HPT and radixPaul Mackerras
Currently, the HPT code in HV KVM maintains a dirty bit per guest page in the rmap array, whether or not dirty page tracking has been enabled for the memory slot. In contrast, the radix code maintains a dirty bit per guest page in memslot->dirty_bitmap, and only does so when dirty page tracking has been enabled. This changes the HPT code to maintain the dirty bits in the memslot dirty_bitmap like radix does. This results in slightly less code overall, and will mean that we do not lose the dirty bits when transitioning between HPT and radix mode in future. There is one minor change to behaviour as a result. With HPT, when dirty tracking was enabled for a memslot, we would previously clear all the dirty bits at that point (both in the HPT entries and in the rmap arrays), meaning that a KVM_GET_DIRTY_LOG ioctl immediately following would show no pages as dirty (assuming no vcpus have run in the meantime). With this change, the dirty bits on HPT entries are not cleared at the point where dirty tracking is enabled, so KVM_GET_DIRTY_LOG would show as dirty any guest pages that are resident in the HPT and dirty. This is consistent with what happens on radix. This also fixes a bug in the mark_pages_dirty() function for radix (in the sense that the function no longer exists). In the case where a large page of 64 normal pages or more is marked dirty, the addressing of the dirty bitmap was incorrect and could write past the end of the bitmap. Fortunately this case was never hit in practice because a 2MB large page is only 32 x 64kB pages, and we don't support backing the guest with 1GB huge pages at this point. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-01KVM: PPC: Book3S HV: Rename hpte_setup_done to mmu_readyPaul Mackerras
This renames the kvm->arch.hpte_setup_done field to mmu_ready because we will want to use it for radix guests too -- both for setting things up before vcpu execution, and for excluding vcpus from executing while MMU-related things get changed, such as in future switching the MMU from radix to HPT mode or vice-versa. This also moves the call to kvmppc_setup_partition_table() that was done in kvmppc_hv_setup_htab_rma() for HPT guests, and the setting of mmu_ready, into the caller in kvmppc_vcpu_run_hv(). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-01KVM: PPC: Book3S HV: Don't rely on host's page size informationPaul Mackerras
This removes the dependence of KVM on the mmu_psize_defs array (which stores information about hardware support for various page sizes) and the things derived from it, chiefly hpte_page_sizes[], hpte_page_size(), hpte_actual_page_size() and get_sllp_encoding(). We also no longer rely on the mmu_slb_size variable or the MMU_FTR_1T_SEGMENTS feature bit. The reason for doing this is so we can support a HPT guest on a radix host. In a radix host, the mmu_psize_defs array contains information about page sizes supported by the MMU in radix mode rather than the page sizes supported by the MMU in HPT mode. Similarly, mmu_slb_size and the MMU_FTR_1T_SEGMENTS bit are not set. Instead we hard-code knowledge of the behaviour of the HPT MMU in the POWER7, POWER8 and POWER9 processors (which are the only processors supported by HV KVM) - specifically the encoding of the LP fields in the HPT and SLB entries, and the fact that they have 32 SLB entries and support 1TB segments. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-01Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-nextPaul Mackerras
This merges in the ppc-kvm topic branch of the powerpc tree to get the commit that reverts the patch "KVM: PPC: Book3S HV: POWER9 does not require secondary thread management". This is needed for subsequent patches which will be applied on this branch. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-01KVM: PPC: Book3S: Fix gas warning due to using r0 as immediate 0Nicholas Piggin
This fixes the message: arch/powerpc/kvm/book3s_segment.S: Assembler messages: arch/powerpc/kvm/book3s_segment.S:330: Warning: invalid register expression Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-01KVM: PPC: Book3S PR: Only install valid SLBs during KVM_SET_SREGSGreg Kurz
Userland passes an array of 64 SLB descriptors to KVM_SET_SREGS, some of which are valid (ie, SLB_ESID_V is set) and the rest are likely all-zeroes (with QEMU at least). Each of them is then passed to kvmppc_mmu_book3s_64_slbmte(), which assumes to find the SLB index in the 3 lower bits of its rb argument. When passed zeroed arguments, it happily overwrites the 0th SLB entry with zeroes. This is exactly what happens while doing live migration with QEMU when the destination pushes the incoming SLB descriptors to KVM PR. When reloading the SLBs at the next synchronization, QEMU first clears its SLB array and only restore valid ones, but the 0th one is now gone and we cannot access the corresponding memory anymore: (qemu) x/x $pc c0000000000b742c: Cannot access memory To avoid this, let's filter out non-valid SLB entries. While here, we also force a full SLB flush before installing new entries. Since SLB is for 64-bit only, we now build this path conditionally to avoid a build break on 32-bit, which doesn't define SLB_ESID_V. Signed-off-by: Greg Kurz <groug@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-01KVM: PPC: Book3S HV: Don't call real-mode XICS hypercall handlers if not enabledPaul Mackerras
When running a guest on a POWER9 system with the in-kernel XICS emulation disabled (for example by running QEMU with the parameter "-machine pseries,kernel_irqchip=off"), the kernel does not pass the XICS-related hypercalls such as H_CPPR up to userspace for emulation there as it should. The reason for this is that the real-mode handlers for these hypercalls don't check whether a XICS device has been instantiated before calling the xics-on-xive code. That code doesn't check either, leading to potential NULL pointer dereferences because vcpu->arch.xive_vcpu is NULL. Those dereferences won't cause an exception in real mode but will lead to kernel memory corruption. This fixes it by adding kvmppc_xics_enabled() checks before calling the XICS functions. Cc: stable@vger.kernel.org # v4.11+ Fixes: 5af50993850a ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller") Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-20KVM: X86: #GP when guest attempts to write MCi_STATUS register w/o 0Wanpeng Li
Both Intel SDM and AMD APM mentioned that MCi_STATUS, when the register is implemented, this register can be cleared by explicitly writing 0s to this register. Writing 1s to this register will cause a general-protection exception. The mce is emulated in qemu, so just the guest attempts to write 1 to this register should cause a #GP, this patch does it. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-10-20KVM: VMX: Fix VPID capability detectionWanpeng Li
In my setup, EPT is not exposed to L1, the VPID capability is exposed and can be observed by vmxcap tool in L1: INVVPID supported yes Individual-address INVVPID yes Single-context INVVPID yes All-context INVVPID yes Single-context-retaining-globals INVVPID yes However, the module parameter of VPID observed in L1 is always N, the cpu_has_vmx_invvpid() check in L1 KVM fails since vmx_capability.vpid is 0 and it is not read from MSR due to EPT is not exposed. The VPID can be used to tag linear mappings when EPT is not enabled. However, current logic just detects VPID capability if EPT is enabled, this patch fixes it. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-10-20KVM: nVMX: Fix EPT switching advertisingWanpeng Li
I can use vmxcap tool to observe "EPTP Switching yes" even if EPT is not exposed to L1. EPT switching is advertised unconditionally since it is emulated, however, it can be treated as an extended feature for EPT and it should not be advertised if EPT itself is not exposed. This patch fixes it. Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-10-20KVM: PPC: Tie KVM_CAP_PPC_HTM to the user-visible TM featureMichael Ellerman
Currently we use CPU_FTR_TM to decide if the CPU/kernel can support TM (Transactional Memory), and if it's true we advertise that to Qemu (or similar) via KVM_CAP_PPC_HTM. PPC_FEATURE2_HTM is the user-visible feature bit, which indicates that the CPU and kernel can support TM. Currently CPU_FTR_TM and PPC_FEATURE2_HTM always have the same value, either true or false, so using the former for KVM_CAP_PPC_HTM is correct. However some Power9 CPUs can operate in a mode where TM is enabled but TM suspended state is disabled. In this mode CPU_FTR_TM is true, but PPC_FEATURE2_HTM is false. Instead a different PPC_FEATURE2 bit is set, to indicate that this different mode of TM is available. It is not safe to let guests use TM as-is, when the CPU is in this mode. So to prevent that from happening, use PPC_FEATURE2_HTM to determine the value of KVM_CAP_PPC_HTM. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-10-19Revert "KVM: PPC: Book3S HV: POWER9 does not require secondary thread ↵Paul Mackerras
management" This reverts commit 94a04bc25a2c6296bd0c5e82c10e8231c2b11f77. In order to run HPT guests on a radix POWER9 host, we will have to run the host in single-threaded mode, because POWER9 processors do not currently support running some threads of a core in HPT mode while others are in radix mode ("mixed mode"). That means that we will need the same mechanisms that are used on POWER8 to make the secondary threads available to KVM, which were disabled on POWER9 by commit 94a04bc25a2c. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-10-18KVM: SVM: detect opening of SMI window using STGI interceptLadi Prosek
Commit 05cade71cf3b ("KVM: nSVM: fix SMI injection in guest mode") made KVM mask SMI if GIF=0 but it didn't do anything to unmask it when GIF is enabled. The issue manifests for me as a significantly longer boot time of Windows guests when running with SMM-enabled OVMF. This commit fixes it by intercepting STGI instead of requesting immediate exit if the reason why SMM was masked is GIF. Fixes: 05cade71cf3b ("KVM: nSVM: fix SMI injection in guest mode") Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-10-16KVM: PPC: Book3S HV: Add more barriers in XIVE load/unload codeBenjamin Herrenschmidt
On POWER9 systems, we push the VCPU context onto the XIVE (eXternal Interrupt Virtualization Engine) hardware when entering a guest, and pull the context off the XIVE when exiting the guest. The push is done with cache-inhibited stores, and the pull with cache-inhibited loads. Testing has revealed that it is possible (though very rare) for the stores to get reordered with the loads so that we end up with the guest VCPU context still loaded on the XIVE after we have exited the guest. When that happens, it is possible for the same VCPU context to then get loaded on another CPU, which causes the machine to checkstop. To fix this, we add I/O barrier instructions (eieio) before and after the push and pull operations. As partial compensation for the potential slowdown caused by the extra barriers, we remove the eieio instructions between the two stores in the push operation, and between the two loads in the pull operation. (The architecture requires loads to cache-inhibited, guarded storage to be kept in order, and requires stores to cache-inhibited, guarded storage likewise to be kept in order, but allows such loads and stores to be reordered with respect to each other.) Reported-by: Carol L Soto <clsoto@us.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-16KVM: PPC: Book3S HV: Explicitly disable HPT operations on radix guestsPaul Mackerras
This adds code to make sure that we don't try to access the non-existent HPT for a radix guest using the htab file for the VM in debugfs, a file descriptor obtained using the KVM_PPC_GET_HTAB_FD ioctl, or via the KVM_PPC_RESIZE_HPT_{PREPARE,COMMIT} ioctls. At present nothing bad happens if userspace does access these interfaces on a radix guest, mostly because kvmppc_hpt_npte() gives 0 for a radix guest, which in turn is because 1 << -4 comes out as 0 on POWER processors. However, that relies on undefined behaviour, so it is better to be explicit about not accessing the HPT for a radix guest. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-14KVM: PPC: Book3S PR: Enable in-kernel TCE handlers for PR KVMAlexey Kardashevskiy
The handlers support PR KVM from the day one; however the PR KVM's enable/disable hcalls handler missed these ones. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-14KVM: PPC: Book3S HV: Delete an error message for a failed memory allocation ↵Markus Elfring
in kvmppc_allocate_hpt() Omit an extra message for a memory allocation failure in this function. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-14KVM: PPC: BookE: Use vma_pages functionThomas Meyer
Use vma_pages function on vma object instead of explicit computation. Found by coccinelle spatch "api/vma_pages.cocci" Signed-off-by: Thomas Meyer <thomas@m3y3r.de> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-14KVM: PPC: Book3S HV: Use ARRAY_SIZE macroThomas Meyer
Use ARRAY_SIZE macro, rather than explicitly coding some variant of it yourself. Found with: find -type f -name "*.c" -o -name "*.h" | xargs perl -p -i -e 's/\bsizeof\s*\(\s*(\w+)\s*\)\s*\ /\s*sizeof\s*\(\s*\1\s*\[\s*0\s*\]\s*\) /ARRAY_SIZE(\1)/g' and manual check/verification. Signed-off-by: Thomas Meyer <thomas@m3y3r.de> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-14KVM: PPC: Book3S HV: Handle unexpected interrupts betterPaul Mackerras
At present, if an interrupt (i.e. an exception or trap) occurs in the code where KVM is switching the MMU to or from guest context, we jump to kvmppc_bad_host_intr, where we simply spin with interrupts disabled. In this situation, it is hard to debug what happened because we get no indication as to which interrupt occurred or where. Typically we get a cascade of stall and soft lockup warnings from other CPUs. In order to get more information for debugging, this adds code to create a stack frame on the emergency stack and save register values to it. We start half-way down the emergency stack in order to give ourselves some chance of being able to do a stack trace on secondary threads that are already on the emergency stack. On POWER7 or POWER8, we then just spin, as before, because we don't know what state the MMU context is in or what other threads are doing, and we can't switch back to host context without coordinating with other threads. On POWER9 we can do better; there we load up the host MMU context and jump to C code, which prints an oops message to the console and panics. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-14KVM: PPC: Book3S: Protect kvmppc_gpa_to_ua() with SRCUAlexey Kardashevskiy
kvmppc_gpa_to_ua() accesses KVM memory slot array via srcu_dereference_check() and this produces warnings from RCU like below. This extends the existing srcu_read_lock/unlock to cover that kvmppc_gpa_to_ua() as well. We did not hit this before as this lock is not needed for the realmode handlers and hash guests would use the realmode path all the time; however the radix guests are always redirected to the virtual mode handlers and hence the warning. [ 68.253798] ./include/linux/kvm_host.h:575 suspicious rcu_dereference_check() usage! [ 68.253799] other info that might help us debug this: [ 68.253802] rcu_scheduler_active = 2, debug_locks = 1 [ 68.253804] 1 lock held by qemu-system-ppc/6413: [ 68.253806] #0: (&vcpu->mutex){+.+.}, at: [<c00800000e3c22f4>] vcpu_load+0x3c/0xc0 [kvm] [ 68.253826] stack backtrace: [ 68.253830] CPU: 92 PID: 6413 Comm: qemu-system-ppc Tainted: G W 4.14.0-rc3-00553-g432dcba58e9c-dirty #72 [ 68.253833] Call Trace: [ 68.253839] [c000000fd3d9f790] [c000000000b7fcc8] dump_stack+0xe8/0x160 (unreliable) [ 68.253845] [c000000fd3d9f7d0] [c0000000001924c0] lockdep_rcu_suspicious+0x110/0x180 [ 68.253851] [c000000fd3d9f850] [c0000000000e825c] kvmppc_gpa_to_ua+0x26c/0x2b0 [ 68.253858] [c000000fd3d9f8b0] [c00800000e3e1984] kvmppc_h_put_tce+0x12c/0x2a0 [kvm] Fixes: 121f80ba68f1 ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-14KVM: PPC: Book3S HV: POWER9 more doorbell fixesNicholas Piggin
- Add another case where msgsync is required. - Required barrier sequence for global doorbells is msgsync ; lwsync When msgsnd is used for IPIs to other cores, msgsync must be executed by the target to order stores performed on the source before its msgsnd (provided the source executes the appropriate sync). Fixes: 1704a81ccebc ("KVM: PPC: Book3S HV: Use msgsnd for IPIs to other cores on POWER9") Cc: stable@vger.kernel.org # v4.10+ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-14KVM: PPC: Fix oops when checking KVM_CAP_PPC_HTMGreg Kurz
The following program causes a kernel oops: #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <sys/ioctl.h> #include <linux/kvm.h> main() { int fd = open("/dev/kvm", O_RDWR); ioctl(fd, KVM_CHECK_EXTENSION, KVM_CAP_PPC_HTM); } This happens because when using the global KVM fd with KVM_CHECK_EXTENSION, kvm_vm_ioctl_check_extension() gets called with a NULL kvm argument, which gets dereferenced in is_kvmppc_hv_enabled(). Spotted while reading the code. Let's use the hv_enabled fallback variable, like everywhere else in this function. Fixes: 23528bb21ee2 ("KVM: PPC: Introduce KVM_CAP_PPC_HTM") Cc: stable@vger.kernel.org # v4.7+ Signed-off-by: Greg Kurz <groug@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-12KVM: x86: extend usage of RET_MMIO_PF_* constantsPaolo Bonzini
The x86 MMU if full of code that returns 0 and 1 for retry/emulate. Use the existing RET_MMIO_PF_RETRY/RET_MMIO_PF_EMULATE enum, renaming it to drop the MMIO part. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-12KVM: nSVM: fix SMI injection in guest modeLadi Prosek
Entering SMM while running in guest mode wasn't working very well because several pieces of the vcpu state were left set up for nested operation. Some of the issues observed: * L1 was getting unexpected VM exits (using L1 interception controls but running in SMM execution environment) * MMU was confused (walk_mmu was still set to nested_mmu) * INTERCEPT_SMI was not emulated for L1 (KVM never injected SVM_EXIT_SMI) Intel SDM actually prescribes the logical processor to "leave VMX operation" upon entering SMM in 34.14.1 Default Treatment of SMI Delivery. AMD doesn't seem to document this but they provide fields in the SMM state-save area to stash the current state of SVM. What we need to do is basically get out of guest mode for the duration of SMM. All this completely transparent to L1, i.e. L1 is not given control and no L1 observable state changes. To avoid code duplication this commit takes advantage of the existing nested vmexit and run functionality, perhaps at the cost of efficiency. To get out of guest mode, nested_svm_vmexit is called, unchanged. Re-entering is performed using enter_svm_guest_mode. This commit fixes running Windows Server 2016 with Hyper-V enabled in a VM with OVMF firmware (OVMF_CODE-need-smm.fd). Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-12KVM: nSVM: refactor nested_svm_vmrunLadi Prosek
Analogous to 858e25c06fb0 ("kvm: nVMX: Refactor nested_vmx_run()"), this commit splits nested_svm_vmrun into two parts. The newly introduced enter_svm_guest_mode modifies the vcpu state to transition from L1 to L2, while the code left in nested_svm_vmrun handles the VMRUN instruction. Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>