summaryrefslogtreecommitdiff
path: root/arch/powerpc/kvm/book3s_64_vio.c
AgeCommit message (Collapse)Author
2019-04-05KVM: PPC: Book3S: Protect memslots while validating user addressAlexey Kardashevskiy
Guest physical to user address translation uses KVM memslots and reading these requires holding the kvm->srcu lock. However recently introduced kvmppc_tce_validate() broke the rule (see the lockdep warning below). This moves srcu_read_lock(&vcpu->kvm->srcu) earlier to protect kvmppc_tce_validate() as well. ============================= WARNING: suspicious RCU usage 5.1.0-rc2-le_nv2_aikATfstn1-p1 #380 Not tainted ----------------------------- include/linux/kvm_host.h:605 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by qemu-system-ppc/8020: #0: 0000000094972fe9 (&vcpu->mutex){+.+.}, at: kvm_vcpu_ioctl+0xdc/0x850 [kvm] stack backtrace: CPU: 44 PID: 8020 Comm: qemu-system-ppc Not tainted 5.1.0-rc2-le_nv2_aikATfstn1-p1 #380 Call Trace: [c000003fece8f740] [c000000000bcc134] dump_stack+0xe8/0x164 (unreliable) [c000003fece8f790] [c000000000181be0] lockdep_rcu_suspicious+0x130/0x170 [c000003fece8f810] [c0000000000d5f50] kvmppc_tce_to_ua+0x280/0x290 [c000003fece8f870] [c00800001a7e2c78] kvmppc_tce_validate+0x80/0x1b0 [kvm] [c000003fece8f8e0] [c00800001a7e3fac] kvmppc_h_put_tce+0x94/0x3e4 [kvm] [c000003fece8f9a0] [c00800001a8baac4] kvmppc_pseries_do_hcall+0x30c/0xce0 [kvm_hv] [c000003fece8fa10] [c00800001a8bd89c] kvmppc_vcpu_run_hv+0x694/0xec0 [kvm_hv] [c000003fece8fae0] [c00800001a7d95dc] kvmppc_vcpu_run+0x34/0x48 [kvm] [c000003fece8fb00] [c00800001a7d56bc] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm] [c000003fece8fb90] [c00800001a7c3618] kvm_vcpu_ioctl+0x460/0x850 [kvm] [c000003fece8fd00] [c00000000041c4f4] do_vfs_ioctl+0xe4/0x930 [c000003fece8fdb0] [c00000000041ce04] ksys_ioctl+0xc4/0x110 [c000003fece8fe00] [c00000000041ce78] sys_ioctl+0x28/0x80 [c000003fece8fe20] [c00000000000b5a4] system_call+0x5c/0x70 Fixes: 42de7b9e2167 ("KVM: PPC: Validate TCEs against preregistered memory page sizes", 2018-09-10) Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-02-22KVM: PPC: Book3S: Improve KVM reference countingAlexey Kardashevskiy
The anon fd's ops releases the KVM reference in the release hook. However we reference the KVM object after we create the fd so there is small window when the release function can be called and dereferenced the KVM object which potentially may free it. It is not a problem at the moment as the file is created and KVM is referenced under the KVM lock and the release function obtains the same lock before dereferencing the KVM (although the lock is not held when calling kvm_put_kvm()) but it is potentially fragile against future changes. This references the KVM object before creating a file. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-02-19KVM: PPC: Release all hardware TCE tables attached to a groupAlexey Kardashevskiy
The SPAPR TCE KVM device references all hardware IOMMU tables assigned to some IOMMU group to ensure that in-kernel KVM acceleration of H_PUT_TCE can work. The tables are references when an IOMMU group gets registered with the VFIO KVM device by the KVM_DEV_VFIO_GROUP_ADD ioctl; KVM_DEV_VFIO_GROUP_DEL calls into the dereferencing code in kvm_spapr_tce_release_iommu_group() which walks through the list of LIOBNs, finds a matching IOMMU table and calls kref_put() when found. However that code stops after the very first successful derefencing leaving other tables referenced till the SPAPR TCE KVM device is destroyed which normally happens on guest reboot or termination so if we do hotplug and unplug in a loop, we are leaking IOMMU tables here. This removes a premature return to let kvm_spapr_tce_release_iommu_group() find and dereference all attached tables. Fixes: 121f80ba68f ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-12-21powerpc/vfio/iommu/kvm: Do not pin device memoryAlexey Kardashevskiy
This new memory does not have page structs as it is not plugged to the host so gup() will fail anyway. This adds 2 helpers: - mm_iommu_newdev() to preregister the "memory device" memory so the rest of API can still be used; - mm_iommu_is_devmem() to know if the physical address is one of thise new regions which we must avoid unpinning of. This adds @mm to tce_page_is_contained() and iommu_tce_xchg() to test if the memory is device memory to avoid pfn_to_page(). This adds a check for device memory in mm_iommu_ua_mark_dirty_rm() which does delayed pages dirtying. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-20KVM: PPC: Optimize clearing TCEs for sparse tablesAlexey Kardashevskiy
The powernv platform maintains 2 TCE tables for VFIO - a hardware TCE table and a table with userspace addresses. These tables are radix trees, we allocate indirect levels when they are written to. Since the memory allocation is problematic in real mode, we have 2 accessors to the entries: - for virtual mode: it allocates the memory and it is always expected to return non-NULL; - fr real mode: it does not allocate and can return NULL. Also, DMA windows can span to up to 55 bits of the address space and since we never have this much RAM, such windows are sparse. However currently the SPAPR TCE IOMMU driver walks through all TCEs to unpin DMA memory. Since we maintain a userspace addresses table for VFIO which is a mirror of the hardware table, we can use it to know which parts of the DMA window have not been mapped and skip these so does this patch. The bare metal systems do not have this problem as they use a bypass mode of a PHB which maps RAM directly. This helps a lot with sparse DMA windows, reducing the shutdown time from about 3 minutes per 1 billion TCEs to a few seconds for 32GB sparse guest. Just skipping the last level seems to be good enough. As non-allocating accessor is used now in virtual mode as well, rename it from IOMMU_TABLE_USERSPACE_ENTRY_RM (real mode) to _RO (read only). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-10-09KVM: PPC: Remove redundand permission bits removalAlexey Kardashevskiy
The kvmppc_gpa_to_ua() helper itself takes care of the permission bits in the TCE and yet every single caller removes them. This changes semantics of kvmppc_gpa_to_ua() so it takes TCEs (which are GPAs + TCE permission bits) to make the callers simpler. This should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Propagate errors to the guest when failed instead of ignoringAlexey Kardashevskiy
At the moment if the PUT_TCE{_INDIRECT} handlers fail to update the hardware tables, we print a warning once, clear the entry and continue. This is so as at the time the assumption was that if a VFIO device is hotplugged into the guest, and the userspace replays virtual DMA mappings (i.e. TCEs) to the hardware tables and if this fails, then there is nothing useful we can do about it. However the assumption is not valid as these handlers are not called for TCE replay (VFIO ioctl interface is used for that) and these handlers are for new TCEs. This returns an error to the guest if there is a request which cannot be processed. By now the only possible failure must be H_TOO_HARD. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09KVM: PPC: Validate TCEs against preregistered memory page sizesAlexey Kardashevskiy
The userspace can request an arbitrary supported page size for a DMA window and this works fine as long as the mapped memory is backed with the pages of the same or bigger size; if this is not the case, mm_iommu_ua_to_hpa{_rm}() fail and tables do not populated with dangerously incorrect TCEs. However since it is quite easy to misconfigure the KVM and we do not do reverts to all changes made to TCE tables if an error happens in a middle, we better do the acceptable page size validation before we even touch the tables. This enhances kvmppc_tce_validate() to check the hardware IOMMU page sizes against the preregistered memory page sizes. Since the new check uses real/virtual mode helpers, this renames kvmppc_tce_validate() to kvmppc_rm_tce_validate() to handle the real mode case and mirrors it for the virtual mode under the old name. The real mode handler is not used for the virtual mode as: 1. it uses _lockless() list traversing primitives instead of RCU; 2. realmode's mm_iommu_ua_to_hpa_rm() uses vmalloc_to_phys() which virtual mode does not have to use and since on POWER9+radix only virtual mode handlers actually work, we do not want to slow down that path even a bit. This removes EXPORT_SYMBOL_GPL(kvmppc_tce_validate) as the validators are static now. From now on the attempts on mapping IOMMU pages bigger than allowed will result in KVM exit. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> [mpe: Fix KVM_HV=n build] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-02KVM: PPC: Inform the userspace about TCE update failuresAlexey Kardashevskiy
We return H_TOO_HARD from TCE update handlers when we think that the next handler (realmode -> virtual mode -> user mode) has a chance to handle the request; H_HARDWARE/H_CLOSED otherwise. This changes the handlers to return H_TOO_HARD on every error giving the userspace an opportunity to handle any request or at least log them all. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-02KVM: PPC: Validate all tces before updating tablesAlexey Kardashevskiy
The KVM TCE handlers are written in a way so they fail when either something went horribly wrong or the userspace did some obvious mistake such as passing a misaligned address. We are going to enhance the TCE checker to fail on attempts to map bigger IOMMU page than the underlying pinned memory so let's valitate TCE beforehand. This should cause no behavioral change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-08-19Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull first set of KVM updates from Paolo Bonzini: "PPC: - minor code cleanups x86: - PCID emulation and CR3 caching for shadow page tables - nested VMX live migration - nested VMCS shadowing - optimized IPI hypercall - some optimizations ARM will come next week" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (85 commits) kvm: x86: Set highest physical address bits in non-present/reserved SPTEs KVM/x86: Use CC_SET()/CC_OUT in arch/x86/kvm/vmx.c KVM: X86: Implement PV IPIs in linux guest KVM: X86: Add kvm hypervisor init time platform setup callback KVM: X86: Implement "send IPI" hypercall KVM/x86: Move X86_CR4_OSXSAVE check into kvm_valid_sregs() KVM: x86: Skip pae_root shadow allocation if tdp enabled KVM/MMU: Combine flushing remote tlb in mmu_set_spte() KVM: vmx: skip VMWRITE of HOST_{FS,GS}_BASE when possible KVM: vmx: skip VMWRITE of HOST_{FS,GS}_SEL when possible KVM: vmx: always initialize HOST_{FS,GS}_BASE to zero during setup KVM: vmx: move struct host_state usage to struct loaded_vmcs KVM: vmx: compute need to reload FS/GS/LDT on demand KVM: nVMX: remove a misleading comment regarding vmcs02 fields KVM: vmx: rename __vmx_load_host_state() and vmx_save_host_state() KVM: vmx: add dedicated utility to access guest's kernel_gs_base KVM: vmx: track host_state.loaded using a loaded_vmcs pointer KVM: vmx: refactor segmentation code in vmx_save_host_state() kvm: nVMX: Fix fault priority for VMX operations kvm: nVMX: Fix fault vector for VMX operation at CPL > 0 ...
2018-08-13Merge branch 'fixes' into nextMichael Ellerman
Merge our fixes branch from the 4.18 cycle to resolve some minor conflicts.
2018-08-06Merge tag 'v4.18-rc6' into HEADPaolo Bonzini
Pull bug fixes into the KVM development tree to avoid nasty conflicts.
2018-07-30powerpc: remove unnecessary inclusion of asm/tlbflush.hChristophe Leroy
asm/tlbflush.h is only needed for: - using functions xxx_flush_tlb_xxx() - using MMU_NO_CONTEXT - including asm-generic/pgtable.h Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-18KVM: PPC: Check if IOMMU page is contained in the pinned physical pageAlexey Kardashevskiy
A VM which has: - a DMA capable device passed through to it (eg. network card); - running a malicious kernel that ignores H_PUT_TCE failure; - capability of using IOMMU pages bigger that physical pages can create an IOMMU mapping that exposes (for example) 16MB of the host physical memory to the device when only 64K was allocated to the VM. The remaining 16MB - 64K will be some other content of host memory, possibly including pages of the VM, but also pages of host kernel memory, host programs or other VMs. The attacking VM does not control the location of the page it can map, and is only allowed to map as many pages as it has pages of RAM. We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that an IOMMU page is contained in the physical page so the PCI hardware won't get access to unassigned host memory; however this check is missing in the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and did not hit this yet as the very first time when the mapping happens we do not have tbl::it_userspace allocated yet and fall back to the userspace which in turn calls VFIO IOMMU driver, this fails and the guest does not retry, This stores the smallest preregistered page size in the preregistered region descriptor and changes the mm_iommu_xxx API to check this against the IOMMU page size. This calculates maximum page size as a minimum of the natural region alignment and compound page size. For the page shift this uses the shift returned by find_linux_pte() which indicates how the page is mapped to the current userspace - if the page is huge and this is not a zero, then it is a leaf pte and the page is mapped within the range. Fixes: 121f80ba68f1 ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-18KVM: PPC: Book3S: Fix matching of hardware and emulated TCE tablesAlexey Kardashevskiy
When attaching a hardware table to LIOBN in KVM, we match table parameters such as page size, table offset and table size. However the tables are created via very different paths - VFIO and KVM - and the VFIO path goes through the platform code which has minimum TCE page size requirement (which is 4K but since we allocate memory by pages and cannot avoid alignment anyway, we align to 64k pages for powernv_defconfig). So when we match the tables, one might be bigger that the other which means the hardware table cannot get attached to LIOBN and DMA mapping fails. This removes the table size alignment from the guest visible table. This does not affect the memory allocation which is still aligned - kvmppc_tce_pages() takes care of this. This relaxes the check we do when attaching tables to allow the hardware table be bigger than the guest visible table. Ideally we want the KVM table to cover the same space as the hardware table does but since the hardware table may use multiple levels, and all levels must use the same table size (IODA2 design), the area it can actually cover might get very different from the window size which the guest requested, even though the guest won't map it all. Fixes: ca1fc489cf "KVM: PPC: Book3S: Allow backing bigger guest IOMMU pages with smaller physical pages" Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-07-16KVM: PPC: Make iommu_table::it_userspace big endianAlexey Kardashevskiy
We are going to reuse multilevel TCE code for the userspace copy of the TCE table and since it is big endian, let's make the copy big endian too. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-05-17KVM: PPC: Book3S: Change return type to vm_fault_tSouptick Joarder
Use new return type vm_fault_t for fault handler in struct vm_operations_struct. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. commit 1c8f422059ae ("mm: change return type to vm_fault_t") Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-17KVM: PPC: Book3S: Check KVM_CREATE_SPAPR_TCE_64 parametersAlexey Kardashevskiy
Although it does not seem possible to break the host by passing bad parameters when creating a TCE table in KVM, it is still better to get an early clear indication of that than debugging weird effect this might bring. This adds some sanity checks that the page size is 4KB..16GB as this is what the actual LoPAPR supports and that the window actually fits 64bit space. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-17KVM: PPC: Book3S: Allow backing bigger guest IOMMU pages with smaller ↵Alexey Kardashevskiy
physical pages At the moment we only support in the host the IOMMU page sizes which the guest is aware of, which is 4KB/64KB/16MB. However P9 does not support 16MB IOMMU pages, 2MB and 1GB pages are supported instead. We can still emulate bigger guest pages (for example 16MB) with smaller host pages (4KB/64KB/2MB). This allows the physical IOMMU pages to use a page size smaller or equal than the guest visible IOMMU page size. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-17KVM: PPC: Book3S: Use correct page shift in H_STUFF_TCEAlexey Kardashevskiy
The other TCE handlers use page shift from the guest visible TCE table (described by kvmppc_spapr_tce_iommu_table) so let's make H_STUFF_TCE handlers do the same thing. This should cause no behavioral change now but soon we will allow the iommu_table::it_page_shift being different from from the emulated table page size so this will play a role. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-14KVM: PPC: Book3S: Protect kvmppc_gpa_to_ua() with SRCUAlexey Kardashevskiy
kvmppc_gpa_to_ua() accesses KVM memory slot array via srcu_dereference_check() and this produces warnings from RCU like below. This extends the existing srcu_read_lock/unlock to cover that kvmppc_gpa_to_ua() as well. We did not hit this before as this lock is not needed for the realmode handlers and hash guests would use the realmode path all the time; however the radix guests are always redirected to the virtual mode handlers and hence the warning. [ 68.253798] ./include/linux/kvm_host.h:575 suspicious rcu_dereference_check() usage! [ 68.253799] other info that might help us debug this: [ 68.253802] rcu_scheduler_active = 2, debug_locks = 1 [ 68.253804] 1 lock held by qemu-system-ppc/6413: [ 68.253806] #0: (&vcpu->mutex){+.+.}, at: [<c00800000e3c22f4>] vcpu_load+0x3c/0xc0 [kvm] [ 68.253826] stack backtrace: [ 68.253830] CPU: 92 PID: 6413 Comm: qemu-system-ppc Tainted: G W 4.14.0-rc3-00553-g432dcba58e9c-dirty #72 [ 68.253833] Call Trace: [ 68.253839] [c000000fd3d9f790] [c000000000b7fcc8] dump_stack+0xe8/0x160 (unreliable) [ 68.253845] [c000000fd3d9f7d0] [c0000000001924c0] lockdep_rcu_suspicious+0x110/0x180 [ 68.253851] [c000000fd3d9f850] [c0000000000e825c] kvmppc_gpa_to_ua+0x26c/0x2b0 [ 68.253858] [c000000fd3d9f8b0] [c00800000e3e1984] kvmppc_h_put_tce+0x12c/0x2a0 [kvm] Fixes: 121f80ba68f1 ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-30KVM: PPC: Book3S HV: Protect updates to spapr_tce_tables listPaul Mackerras
Al Viro pointed out that while one thread of a process is executing in kvm_vm_ioctl_create_spapr_tce(), another thread could guess the file descriptor returned by anon_inode_getfd() and close() it before the first thread has added it to the kvm->arch.spapr_tce_tables list. That highlights a more general problem: there is no mutual exclusion between writers to the spapr_tce_tables list, leading to the possibility of the list becoming corrupted, which could cause a host kernel crash. To fix the mutual exclusion problem, we add a mutex_lock/unlock pair around the list_del_rce in kvm_spapr_tce_release(). Also, this moves the call to anon_inode_getfd() inside the region protected by the kvm->lock mutex, after we have done the check for a duplicate LIOBN. This means that if another thread does guess the file descriptor and closes it, its call to kvm_spapr_tce_release() will not do any harm because it will have to wait until the first thread has released kvm->lock. With this, there are no failure points in kvm_vm_ioctl_create_spapr_tce() after the call to anon_inode_getfd(). The other things that the second thread could do with the guessed file descriptor are to mmap it or to pass it as a parameter to a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE ioctl on a KVM device fd. An mmap call won't cause any harm because kvm_spapr_tce_mmap() and kvm_spapr_tce_fault() don't access the spapr_tce_tables list or the kvmppc_spapr_tce_table.list field, and the fields that they do use have been properly initialized by the time of the anon_inode_getfd() call. The KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE ioctl calls kvm_spapr_tce_attach_iommu_group(), which scans the spapr_tce_tables list looking for the kvmppc_spapr_tce_table struct corresponding to the fd given as the parameter. Either it will find the new entry or it won't; if it doesn't, it just returns an error, and if it does, it will function normally. So, in each case there is no harmful effect. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-25KVM: PPC: Book3S: Fix race and leak in kvm_vm_ioctl_create_spapr_tce()Paul Mackerras
Nixiaoming pointed out that there is a memory leak in kvm_vm_ioctl_create_spapr_tce() if the call to anon_inode_getfd() fails; the memory allocated for the kvmppc_spapr_tce_table struct is not freed, and nor are the pages allocated for the iommu tables. In addition, we have already incremented the process's count of locked memory pages, and this doesn't get restored on error. David Hildenbrand pointed out that there is a race in that the function checks early on that there is not already an entry in the stt->iommu_tables list with the same LIOBN, but an entry with the same LIOBN could get added between then and when the new entry is added to the list. This fixes all three problems. To simplify things, we now call anon_inode_getfd() before placing the new entry in the list. The check for an existing entry is done while holding the kvm->lock mutex, immediately before adding the new entry to the list. Finally, on failure we now call kvmppc_account_memlimit to decrement the process's count of locked memory pages. Reported-by: Nixiaoming <nixiaoming@huawei.com> Reported-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-04-20KVM: PPC: VFIO: Add in-kernel acceleration for VFIOAlexey Kardashevskiy
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO without passing them to user space which saves time on switching to user space and back. This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM. KVM tries to handle a TCE request in the real mode, if failed it passes the request to the virtual mode to complete the operation. If it a virtual mode handler fails, the request is passed to the user space; this is not expected to happen though. To avoid dealing with page use counters (which is tricky in real mode), this only accelerates SPAPR TCE IOMMU v2 clients which are required to pre-register the userspace memory. The very first TCE request will be handled in the VFIO SPAPR TCE driver anyway as the userspace view of the TCE table (iommu_table::it_userspace) is not allocated till the very first mapping happens and we cannot call vmalloc in real mode. If we fail to update a hardware IOMMU table unexpected reason, we just clear it and move on as there is nothing really we can do about it - for example, if we hot plug a VFIO device to a guest, existing TCE tables will be mirrored automatically to the hardware and there is no interface to report to the guest about possible failures. This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd and associates a physical IOMMU table with the SPAPR TCE table (which is a guest view of the hardware IOMMU table). The iommu_table object is cached and referenced so we do not have to look up for it in real mode. This does not implement the UNSET counterpart as there is no use for it - once the acceleration is enabled, the existing userspace won't disable it unless a VFIO container is destroyed; this adds necessary cleanup to the KVM_DEV_VFIO_GROUP_DEL handler. This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user space. This adds real mode version of WARN_ON_ONCE() as the generic version causes problems with rcu_sched. Since we testing what vmalloc_to_phys() returns in the code, this also adds a check for already existing vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect(). This finally makes use of vfio_external_user_iommu_id() which was introduced quite some time ago and was considered for removal. Tests show that this patch increases transmission speed from 220MB/s to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-04-20KVM: PPC: Pass kvm* to kvmppc_find_table()Alexey Kardashevskiy
The guest view TCE tables are per KVM anyway (not per VCPU) so pass kvm* there. This will be used in the following patches where we will be attaching VFIO containers to LIOBNs via ioctl() to KVM (rather than to VCPU). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-04-20KVM: PPC: Align the table size to system page sizeAlexey Kardashevskiy
At the moment the userspace can request a table smaller than a page size and this value will be stored as kvmppc_spapr_tce_table::size. However the actual allocated size will still be aligned to the system page size as alloc_page() is used there. This aligns the table size up to the system page size. It should not change the existing behaviour but when in-kernel TCE acceleration patchset reaches the upstream kernel, this will allow small TCE tables be accelerated as well: PCI IODA iommu_table allocator already aligns the size and, without this patch, an IOMMU group won't attach to LIOBN due to the mismatching table size. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar
<linux/sched/signal.h> We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/signal.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-24mm, fs: reduce fault, page_mkwrite, and pfn_mkwrite to take only vmfDave Jiang
->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to take a vma and vmf parameter when the vma already resides in vmf. Remove the vma parameter to simplify things. [arnd@arndb.de: fix ARM build] Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-09KVM: PPC: Book 3S: Fix error return in kvm_vm_ioctl_create_spapr_tce()Wei Yongjun
Fix to return error code -ENOMEM from the memory alloc error handling case instead of 0, as done elsewhere in this function. Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-07-14powerpc/kvm: Clarify __user annotationsDaniel Axtens
kvmppc_h_put_tce_indirect labels a u64 pointer as __user. It also labelled the u64 where get_user puts the result as __user. This isn't a pointer and so doesn't need to be labelled __user. Split the u64 value definition onto a new line to make it clear that it doesn't get the annotation. Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-03-22KVM: PPC: Create a virtual-mode only TCE table handlersAlexey Kardashevskiy
Upcoming in-kernel VFIO acceleration needs different handling in real and virtual modes which makes it hard to support both modes in the same handler. This creates a copy of kvmppc_rm_h_stuff_tce and kvmppc_rm_h_put_tce in addition to the existing kvmppc_rm_h_put_tce_indirect. This also fixes linker breakage when only PR KVM was selected (leaving HV KVM off): the kvmppc_h_put_tce/kvmppc_h_stuff_tce functions would not compile at all and the linked would fail. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-19Merge tag 'powerpc-4.6-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "This was delayed a day or two by some build-breakage on old toolchains which we've now fixed. There's two PCI commits both acked by Bjorn. There's one commit to mm/hugepage.c which is (co)authored by Kirill. Highlights: - Restructure Linux PTE on Book3S/64 to Radix format from Paul Mackerras - Book3s 64 MMU cleanup in preparation for Radix MMU from Aneesh Kumar K.V - Add POWER9 cputable entry from Michael Neuling - FPU/Altivec/VSX save/restore optimisations from Cyril Bur - Add support for new ftrace ABI on ppc64le from Torsten Duwe Various cleanups & minor fixes from: - Adam Buchbinder, Andrew Donnellan, Balbir Singh, Christophe Leroy, Cyril Bur, Luis Henriques, Madhavan Srinivasan, Pan Xinhui, Russell Currey, Sukadev Bhattiprolu, Suraj Jitindar Singh. General: - atomics: Allow architectures to define their own __atomic_op_* helpers from Boqun Feng - Implement atomic{, 64}_*_return_* variants and acquire/release/ relaxed variants for (cmp)xchg from Boqun Feng - Add powernv_defconfig from Jeremy Kerr - Fix BUG_ON() reporting in real mode from Balbir Singh - Add xmon command to dump OPAL msglog from Andrew Donnellan - Add xmon command to dump process/task similar to ps(1) from Douglas Miller - Clean up memory hotplug failure paths from David Gibson pci/eeh: - Redesign SR-IOV on PowerNV to give absolute isolation between VFs from Wei Yang. - EEH Support for SRIOV VFs from Wei Yang and Gavin Shan. - PCI/IOV: Rename and export virtfn_{add, remove} from Wei Yang - PCI: Add pcibios_bus_add_device() weak function from Wei Yang - MAINTAINERS: Update EEH details and maintainership from Russell Currey cxl: - Support added to the CXL driver for running on both bare-metal and hypervisor systems, from Christophe Lombard and Frederic Barrat. - Ignore probes for virtual afu pci devices from Vaibhav Jain perf: - Export Power8 generic and cache events to sysfs from Sukadev Bhattiprolu - hv-24x7: Fix usage with chip events, display change in counter values, display domain indices in sysfs, eliminate domain suffix in event names, from Sukadev Bhattiprolu Freescale: - Updates from Scott: "Highlights include 8xx optimizations, 32-bit checksum optimizations, 86xx consolidation, e5500/e6500 cpu hotplug, more fman and other dt bits, and minor fixes/cleanup" * tag 'powerpc-4.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (179 commits) powerpc: Fix unrecoverable SLB miss during restore_math() powerpc/8xx: Fix do_mtspr_cpu6() build on older compilers powerpc/rcpm: Fix build break when SMP=n powerpc/book3e-64: Use hardcoded mttmr opcode powerpc/fsl/dts: Add "jedec,spi-nor" flash compatible powerpc/T104xRDB: add tdm riser card node to device tree powerpc32: PAGE_EXEC required for inittext powerpc/mpc85xx: Add pcsphy nodes to FManV3 device tree powerpc/mpc85xx: Add MDIO bus muxing support to the board device tree(s) powerpc/86xx: Introduce and use common dtsi powerpc/86xx: Update device tree powerpc/86xx: Move dts files to fsl directory powerpc/86xx: Switch to kconfig fragments approach powerpc/86xx: Update defconfigs powerpc/86xx: Consolidate common platform code powerpc32: Remove one insn in mulhdu powerpc32: small optimisation in flush_icache_range() powerpc: Simplify test in __dma_sync() powerpc32: move xxxxx_dcache_range() functions inline powerpc32: Remove clear_pages() and define clear_page() inline ...
2016-03-03powerpc/mm: Move hash related mmu-*.h headers to book3s/Aneesh Kumar K.V
No code changes. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-03-02KVM: PPC: Add support for 64bit TCE windowsAlexey Kardashevskiy
The existing KVM_CREATE_SPAPR_TCE only supports 32bit windows which is not enough for directly mapped windows as the guest can get more than 4GB. This adds KVM_CREATE_SPAPR_TCE_64 ioctl and advertises it via KVM_CAP_SPAPR_TCE_64 capability. The table size is checked against the locked memory limit. Since 64bit windows are to support Dynamic DMA windows (DDW), let's add @bus_offset and @page_shift which are also required by DDW. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-03-02KVM: PPC: Add @page_shift to kvmppc_spapr_tce_tableAlexey Kardashevskiy
At the moment the kvmppc_spapr_tce_table struct can only describe 4GB windows and handle fixed size (4K) pages. Dynamic DMA windows support more so these limits need to be extended. This replaces window_size (in bytes, 4GB max) with page_shift (32bit) and size (64bit, in pages). This should cause no behavioural change as this is changing the internal structures only - the user interface still only allows one to create a 32-bit table with 4KiB pages at this stage. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-02-16KVM: PPC: Add support for multiple-TCE hcallsAlexey Kardashevskiy
This adds real and virtual mode handlers for the H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls for user space emulated devices such as IBMVIO devices or emulated PCI. These calls allow adding multiple entries (up to 512) into the TCE table in one call which saves time on transition between kernel and user space. The current implementation of kvmppc_h_stuff_tce() allows it to be executed in both real and virtual modes so there is one helper. The kvmppc_rm_h_put_tce_indirect() needs to translate the guest address to the host address and since the translation is different, there are 2 helpers - one for each mode. This implements the KVM_CAP_PPC_MULTITCE capability. When present, the kernel will try handling H_PUT_TCE_INDIRECT and H_STUFF_TCE if these are enabled by the userspace via KVM_CAP_PPC_ENABLE_HCALL. If they can not be handled by the kernel, they are passed on to the user space. The user space still has to have an implementation for these. Both HV and PR-syle KVM are supported. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-02-16KVM: PPC: Replace SPAPR_TCE_SHIFT with IOMMU_PAGE_SHIFT_4KAlexey Kardashevskiy
SPAPR_TCE_SHIFT is used in few places only and since IOMMU_PAGE_SHIFT_4K can be easily used instead, remove SPAPR_TCE_SHIFT. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-02-16KVM: PPC: Account TCE-containing pages in locked_vmAlexey Kardashevskiy
At the moment pages used for TCE tables (in addition to pages addressed by TCEs) are not counted in locked_vm counter so a malicious userspace tool can call ioctl(KVM_CREATE_SPAPR_TCE) as many times as RLIMIT_NOFILE and lock a lot of memory. This adds counting for pages used for TCE tables. This counts the number of pages required for a table plus pages for the kvmppc_spapr_tce_table struct (TCE table descriptor) itself. This changes release_spapr_tce_table() to store @npages on stack to avoid calling kvmppc_stt_npages() in the loop (tiny optimization, probably). This does not change the amount of used memory. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-02-16KVM: PPC: Use RCU for arch.spapr_tce_tablesAlexey Kardashevskiy
At the moment only spapr_tce_tables updates are protected against races but not lookups. This fixes missing protection by using RCU for the list. As lookups also happen in real mode, this uses list_for_each_entry_lockless() (which is expected not to access any vmalloc'd memory). This converts release_spapr_tce_table() to a RCU scheduled handler. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>
2013-08-26ppc: kvm: use anon_inode_getfd() with O_CLOEXEC flagYann Droneaud
KVM uses anon_inode_get() to allocate file descriptors as part of some of its ioctls. But those ioctls are lacking a flag argument allowing userspace to choose options for the newly opened file descriptor. In such case it's advised to use O_CLOEXEC by default so that userspace is allowed to choose, without race, if the file descriptor is going to be inherited across exec(). This patch set O_CLOEXEC flag on all file descriptors created with anon_inode_getfd() to not leak file descriptors across exec(). Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> Link: http://lkml.kernel.org/r/cover.1377372576.git.ydroneaud@opteya.com Reviewed-by: Alexander Graf <agraf@suse.de> Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-09constify a bunch of struct file_operations instancesAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-05-06kvm/book3s: Make kernel emulated H_PUT_TCE available for "PR" KVMBenjamin Herrenschmidt
There is nothing in the code for emulating TCE tables in the kernel that prevents it from working on "PR" KVM... other than ifdef's and location of the code. This and moves the bulk of the code there to a new file called book3s_64_vio.c. This speeds things up a bit on my G5. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [agraf: fix for hv kvm, 32bit, whitespace] Signed-off-by: Alexander Graf <agraf@suse.de>