summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-10-05KVM: PPC: Book3S HV: Provide mode where all vCPUs on a core must be the same VMPaul Mackerras
This adds a mode where the vcore scheduling logic in HV KVM limits itself to scheduling only virtual cores from the same VM on any given physical core. This is enabled via a new module parameter on the kvm-hv module called "one_vm_per_core". For this to work on POWER9, it is necessary to set indep_threads_mode=N. (On POWER8, hardware limitations mean that KVM is never in independent threads mode, regardless of the indep_threads_mode setting.) Thus the settings needed for this to work are: 1. The host is in SMT1 mode. 2. On POWER8, the host is not in 2-way or 4-way static split-core mode. 3. On POWER9, the indep_threads_mode parameter is N. 4. The one_vm_per_core parameter is Y. With these settings, KVM can run up to 4 vcpus on a core at the same time on POWER9, or up to 8 vcpus on POWER8 (depending on the guest threading mode), and will ensure that all of the vcpus belong to the same VM. This is intended for use in security-conscious settings where users are concerned about possible side-channel attacks between threads which could perhaps enable one VM to attack another VM on the same core, or the host. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-10-05KVM: PPC: Book3S PR: Exiting split hack mode needs to fixup both PC and LRCameron Kaiser
When an OS (currently only classic Mac OS) is running in KVM-PR and makes a linked jump from code with split hack addressing enabled into code that does not, LR is not correctly updated and reflects the previously munged PC. To fix this, this patch undoes the address munge when exiting split hack mode so that code relying on LR being a proper address will now execute. This does not affect OS X or other operating systems running on KVM-PR. Signed-off-by: Cameron Kaiser <spectre@floodgap.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-10-04Merge tag 'kvm-s390-next-4.20-1' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD KVM: s390: Features for 4.20 - Initial version of AP crypto virtualization via vfio-mdev - Set the host program identifier - Optimize page table locking
2018-10-04kvm: nVMX: fix entry with pending interrupt if APICv is enabledPaolo Bonzini
Commit b5861e5cf2fcf83031ea3e26b0a69d887adf7d21 introduced a check on the interrupt-window and NMI-window CPU execution controls in order to inject an external interrupt vmexit before the first guest instruction executes. However, when APIC virtualization is enabled the host does not need a vmexit in order to inject an interrupt at the next interrupt window; instead, it just places the interrupt vector in RVI and the processor will inject it as soon as possible. Therefore, on machines with APICv it is not enough to check the CPU execution controls: the same scenario can also happen if RVI>vPPR. Fixes: b5861e5cf2fcf83031ea3e26b0a69d887adf7d21 Reviewed-by: Nikita Leshchenko <nikita.leshchenko@oracle.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Liran Alon <liran.alon@oracle.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-04KVM: VMX: hide flexpriority from guest when disabled at the module levelPaolo Bonzini
As of commit 8d860bbeedef ("kvm: vmx: Basic APIC virtualization controls have three settings"), KVM will disable VIRTUALIZE_APIC_ACCESSES when a nested guest writes APIC_BASE MSR and kvm-intel.flexpriority=0, whereas previously KVM would allow a nested guest to enable VIRTUALIZE_APIC_ACCESSES so long as it's supported in hardware. That is, KVM now advertises VIRTUALIZE_APIC_ACCESSES to a guest but doesn't (always) allow setting it when kvm-intel.flexpriority=0, and may even initially allow the control and then clear it when the nested guest writes APIC_BASE MSR, which is decidedly odd even if it doesn't cause functional issues. Hide the control completely when the module parameter is cleared. reported-by: Sean Christopherson <sean.j.christopherson@intel.com> Fixes: 8d860bbeedef ("kvm: vmx: Basic APIC virtualization controls have three settings") Cc: Jim Mattson <jmattson@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-04KVM: VMX: check for existence of secondary exec controls before accessingSean Christopherson
Return early from vmx_set_virtual_apic_mode() if the processor doesn't support VIRTUALIZE_APIC_ACCESSES or VIRTUALIZE_X2APIC_MODE, both of which reside in SECONDARY_VM_EXEC_CONTROL. This eliminates warnings due to VMWRITEs to SECONDARY_VM_EXEC_CONTROL (VMCS field 401e) failing on processors without secondary exec controls. Remove the similar check for TPR shadowing as it is incorporated in the flexpriority_enabled check and the APIC-related code in vmx_update_msr_bitmap() is further gated by VIRTUALIZE_X2APIC_MODE. Reported-by: Gerhard Wiesinger <redhat@wiesinger.com> Fixes: 8d860bbeedef ("kvm: vmx: Basic APIC virtualization controls have three settings") Cc: Jim Mattson <jmattson@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-01KVM: x86: fix L1TF's MMIO GFN calculationSean Christopherson
One defense against L1TF in KVM is to always set the upper five bits of the *legal* physical address in the SPTEs for non-present and reserved SPTEs, e.g. MMIO SPTEs. In the MMIO case, the GFN of the MMIO SPTE may overlap with the upper five bits that are being usurped to defend against L1TF. To preserve the GFN, the bits of the GFN that overlap with the repurposed bits are shifted left into the reserved bits, i.e. the GFN in the SPTE will be split into high and low parts. When retrieving the GFN from the MMIO SPTE, e.g. to check for an MMIO access, get_mmio_spte_gfn() unshifts the affected bits and restores the original GFN for comparison. Unfortunately, get_mmio_spte_gfn() neglects to mask off the reserved bits in the SPTE that were used to store the upper chunk of the GFN. As a result, KVM fails to detect MMIO accesses whose GPA overlaps the repurprosed bits, which in turn causes guest panics and hangs. Fix the bug by generating a mask that covers the lower chunk of the GFN, i.e. the bits that aren't shifted by the L1TF mitigation. The alternative approach would be to explicitly zero the five reserved bits that are used to store the upper chunk of the GFN, but that requires additional run-time computation and makes an already-ugly bit of code even more inscrutable. I considered adding a WARN_ON_ONCE(low_phys_bits-1 <= PAGE_SHIFT) to warn if GENMASK_ULL() generated a nonsensical value, but that seemed silly since that would mean a system that supports VMX has less than 18 bits of physical address space... Reported-by: Sakari Ailus <sakari.ailus@iki.fi> Fixes: d9b47449c1a1 ("kvm: x86: Set highest physical address bits in non-present/reserved SPTEs") Cc: Junaid Shahid <junaids@google.com> Cc: Jim Mattson <jmattson@google.com> Cc: stable@vger.kernel.org Reviewed-by: Junaid Shahid <junaids@google.com> Tested-by: Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-01tools/kvm_stat: cut down decimal places in update interval dialogStefan Raspl
We currently display the default number of decimal places for floats in _show_set_update_interval(), which is quite pointless. Cutting down to a single decimal place. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-01KVM: nVMX: Fix emulation of VM_ENTRY_LOAD_BNDCFGSLiran Alon
L2 IA32_BNDCFGS should be updated with vmcs12->guest_bndcfgs only when VM_ENTRY_LOAD_BNDCFGS is specified in vmcs12->vm_entry_controls. Otherwise, L2 IA32_BNDCFGS should be set to vmcs01->guest_bndcfgs which is L1 IA32_BNDCFGS. Reviewed-by: Nikita Leshchenko <nikita.leshchenko@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-01KVM: x86: Do not use kvm_x86_ops->mpx_supported() directlyLiran Alon
Commit a87036add092 ("KVM: x86: disable MPX if host did not enable MPX XSAVE features") introduced kvm_mpx_supported() to return true iff MPX is enabled in the host. However, that commit seems to have missed replacing some calls to kvm_x86_ops->mpx_supported() to kvm_mpx_supported(). Complete original commit by replacing remaining calls to kvm_mpx_supported(). Fixes: a87036add092 ("KVM: x86: disable MPX if host did not enable MPX XSAVE features") Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-01KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabledLiran Alon
Before this commit, KVM exposes MPX VMX controls to L1 guest only based on if KVM and host processor supports MPX virtualization. However, these controls should be exposed to guest only in case guest vCPU supports MPX. Without this change, a L1 guest running with kernel which don't have commit 691bd4340bef ("kvm: vmx: allow host to access guest MSR_IA32_BNDCFGS") asserts in QEMU on the following: qemu-kvm: error: failed to set MSR 0xd90 to 0x0 qemu-kvm: .../qemu-2.10.0/target/i386/kvm.c:1801 kvm_put_msrs: Assertion 'ret == cpu->kvm_msr_buf->nmsrs failed' This is because L1 KVM kvm_init_msr_list() will see that vmx_mpx_supported() (As it only checks MPX VMX controls support) and therefore KVM_GET_MSR_INDEX_LIST IOCTL will include MSR_IA32_BNDCFGS. However, later when L1 will attempt to set this MSR via KVM_SET_MSRS IOCTL, it will fail because !guest_cpuid_has_mpx(vcpu). Therefore, fix the issue by exposing MPX VMX controls to L1 guest only when vCPU supports MPX. Fixes: 36be0b9deb23 ("KVM: x86: Add nested virtualization support for MPX") Reported-by: Eyal Moscovici <eyal.moscovici@oracle.com> Reviewed-by: Nikita Leshchenko <nikita.leshchenko@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-01Merge branch 'apv11' of ↵Christian Borntraeger
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kernelorgnext
2018-10-01s390/mm: optimize locking without huge pages in gmap_pmd_op_walk()David Hildenbrand
Right now we temporarily take the page table lock in gmap_pmd_op_walk() even though we know we won't need it (if we can never have 1mb pages mapped into the gmap). Let's make this a special case, so gmap_protect_range() and gmap_sync_dirty_log_pmd() will not take the lock when huge pages are not allowed. gmap_protect_range() is called quite frequently for managing shadow page tables in vSIE environments. Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Message-Id: <20180806155407.15252-1-david@redhat.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
2018-10-01KVM: s390: set host program identifierCollin Walling
A host program identifier (HPID) provides information regarding the underlying host environment. A level-2 (VM) guest will have an HPID denoting Linux/KVM, which is set during VCPU setup. A level-3 (VM on a VM) and beyond guest will have an HPID denoting KVM vSIE, which is set for all shadow control blocks, overriding the original value of the HPID. Signed-off-by: Collin Walling <walling@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Message-Id: <1535734279-10204-4-git-send-email-walling@linux.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-28s390: doc: detailed specifications for AP virtualizationTony Krowiak
This patch provides documentation describing the AP architecture and design concepts behind the virtualization of AP devices. It also includes an example of how to configure AP devices for exclusive use of KVM guests. Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Halil Pasic <pasic@linux.ibm.com> Message-Id: <20180925231641.4954-27-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-28KVM: s390: CPU model support for AP virtualizationTony Krowiak
Introduces two new CPU model facilities to support AP virtualization for KVM guests: 1. AP Query Configuration Information (QCI) facility is installed. This is indicated by setting facilities bit 12 for the guest. The kernel will not enable this facility for the guest if it is not set on the host. If this facility is not set for the KVM guest, then only APQNs with an APQI less than 16 will be used by a Linux guest regardless of the matrix configuration for the virtual machine. This is a limitation of the Linux AP bus. 2. AP Facilities Test facility (APFT) is installed. This is indicated by setting facilities bit 15 for the guest. The kernel will not enable this facility for the guest if it is not set on the host. If this facility is not set for the KVM guest, then no AP devices will be available to the guest regardless of the guest's matrix configuration for the virtual machine. This is a limitation of the Linux AP bus. Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Halil Pasic <pasic@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Tested-by: Michael Mueller <mimu@linux.ibm.com> Tested-by: Farhan Ali <alifm@linux.ibm.com> Acked-by: Cornelia Huck <cohuck@redhat.com> Message-Id: <20180925231641.4954-26-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-28KVM: s390: device attrs to enable/disable AP interpretationTony Krowiak
Introduces two new VM crypto device attributes (KVM_S390_VM_CRYPTO) to enable or disable AP instruction interpretation from userspace via the KVM_SET_DEVICE_ATTR ioctl: * The KVM_S390_VM_CRYPTO_ENABLE_APIE attribute enables hardware interpretation of AP instructions executed on the guest. * The KVM_S390_VM_CRYPTO_DISABLE_APIE attribute disables hardware interpretation of AP instructions executed on the guest. In this case the instructions will be intercepted and pass through to the guest. Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Message-Id: <20180925231641.4954-25-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-28KVM: s390: vsie: allow guest FORMAT-0 CRYCB on host FORMAT-2Pierre Morel
When the guest schedules a SIE with a FORMAT-0 CRYCB, we are able to schedule it in the host with a FORMAT-2 CRYCB if the host uses FORMAT-2 Signed-off-by: Pierre Morel <pmorel@linux.ibm.com> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Message-Id: <20180925231641.4954-24-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-28KVM: s390: vsie: allow guest FORMAT-1 CRYCB on host FORMAT-2Pierre Morel
When the guest schedules a SIE with a CRYCB FORMAT-1 CRYCB, we are able to schedule it in the host with a FORMAT-2 CRYCB if the host uses FORMAT-2. Signed-off-by: Pierre Morel <pmorel@linux.ibm.com> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Message-Id: <20180925231641.4954-23-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-28KVM: s390: vsie: allow guest FORMAT-0 CRYCB on host FORMAT-1Pierre Morel
When the guest schedules a SIE with a FORMAT-0 CRYCB, we are able to schedule it in the host with a FORMAT-1 CRYCB if the host uses FORMAT-1 or FORMAT-0. Signed-off-by: Pierre Morel <pmorel@linux.ibm.com> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Message-Id: <20180925231641.4954-22-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-28KVM: s390: vsie: allow CRYCB FORMAT-0Pierre Morel
When the host and the guest both use a FORMAT-0 CRYCB, we copy the guest's FORMAT-0 APCB to a shadow CRYCB for use by vSIE. Signed-off-by: Pierre Morel <pmorel@linux.ibm.com> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Message-Id: <20180925231641.4954-21-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-28KVM: s390: vsie: allow CRYCB FORMAT-1Pierre Morel
When the host and guest both use a FORMAT-1 CRYCB, we copy the guest's FORMAT-0 APCB to a shadow CRYCB for use by vSIE. Signed-off-by: Pierre Morel <pmorel@linux.ibm.com> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Message-Id: <20180925231641.4954-20-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-28KVM: s390: vsie: Allow CRYCB FORMAT-2Pierre Morel
When the guest and the host both use CRYCB FORMAT-2, we copy the guest's FORMAT-1 APCB to a FORMAT-1 shadow APCB. This patch also cleans up the shadow_crycb() function. Signed-off-by: Pierre Morel <pmorel@linux.ibm.com> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Message-Id: <20180925231641.4954-19-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-28KVM: s390: vsie: Make use of CRYCB FORMAT2 clearPierre Morel
The comment preceding the shadow_crycb function is misleading, we effectively accept FORMAT2 CRYCB in the guest. When using FORMAT2 in the host we do not need to or with FORMAT1. Signed-off-by: Pierre Morel <pmorel@linux.ibm.com> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Message-Id: <20180925231641.4954-18-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-28KVM: s390: vsie: Do the CRYCB validation firstPierre Morel
We need to handle the validity checks for the crycb, no matter what the settings for the keywrappings are. So lets move the keywrapping checks after we have done the validy checks. Signed-off-by: Pierre Morel <pmorel@linux.ibm.com> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Message-Id: <20180925231641.4954-17-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-28KVM: s390: Clear Crypto Control Block when using vSIEPierre Morel
When we clear the Crypto Control Block (CRYCB) used by a guest level 2, the vSIE shadow CRYCB for guest level 3 must be updated before the guest uses it. We achieve this by using the KVM_REQ_VSIE_RESTART synchronous request for each vCPU belonging to the guest to force the reload of the shadow CRYCB before rerunning the guest level 3. Signed-off-by: Pierre Morel <pmorel@linux.ibm.com> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Message-Id: <20180925231641.4954-16-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-28s390: vfio-ap: implement VFIO_DEVICE_RESET ioctlTony Krowiak
Implements the VFIO_DEVICE_RESET ioctl. This ioctl zeroizes all of the AP queues assigned to the guest. Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Halil Pasic <pasic@linux.ibm.com> Reviewed-by: Pierre Morel <pmorel@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Tested-by: Michael Mueller <mimu@linux.ibm.com> Tested-by: Farhan Ali <alifm@linux.ibm.com> Tested-by: Pierre Morel <pmorel@linux.ibm.com> Message-Id: <20180925231641.4954-15-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-28s390: vfio-ap: zeroize the AP queuesTony Krowiak
Let's call PAPQ(ZAPQ) to zeroize a queue for each queue configured for a mediated matrix device when it is released. Zeroizing a queue resets the queue, clears all pending messages for the queue entries and disables adapter interruptions associated with the queue. Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Halil Pasic <pasic@linux.ibm.com> Tested-by: Michael Mueller <mimu@linux.ibm.com> Tested-by: Farhan Ali <alifm@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Message-Id: <20180925231641.4954-14-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-28s390: vfio-ap: implement VFIO_DEVICE_GET_INFO ioctlTony Krowiak
Adds support for the VFIO_DEVICE_GET_INFO ioctl to the VFIO AP Matrix device driver. This is a minimal implementation, as vfio-ap does not use I/O regions. Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Pierre Morel <pmorel@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Acked-by: Halil Pasic <pasic@linux.ibm.com> Tested-by: Michael Mueller <mimu@linux.ibm.com> Tested-by: Farhan Ali <alifm@linux.ibm.com> Tested-by: Pierre Morel <pmorel@linux.ibm.com> Message-Id: <20180925231641.4954-13-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-28s390: vfio-ap: implement mediated device open callbackTony Krowiak
Implements the open callback on the mediated matrix device. The function registers a group notifier to receive notification of the VFIO_GROUP_NOTIFY_SET_KVM event. When notified, the vfio_ap device driver will get access to the guest's kvm structure. The open callback must ensure that only one mediated device shall be opened per guest. Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Acked-by: Halil Pasic <pasic@linux.ibm.com> Tested-by: Michael Mueller <mimu@linux.ibm.com> Tested-by: Farhan Ali <alifm@linux.ibm.com> Tested-by: Pierre Morel <pmorel@linux.ibm.com> Acked-by: Pierre Morel <pmorel@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Message-Id: <20180925231641.4954-12-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-26KVM: s390: interface to clear CRYCB masksTony Krowiak
Introduces a new KVM function to clear the APCB0 and APCB1 in the guest's CRYCB. This effectively clears all bits of the APM, AQM and ADM masks configured for the guest. The VCPUs are taken out of SIE to ensure the VCPUs do not get out of sync. Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Acked-by: Halil Pasic <pasic@linux.ibm.com> Tested-by: Michael Mueller <mimu@linux.ibm.com> Tested-by: Farhan Ali <alifm@linux.ibm.com> Tested-by: Pierre Morel <pmorel@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Message-Id: <20180925231641.4954-11-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-26s390: vfio-ap: sysfs interface to view matrix mdev matrixTony Krowiak
Provides a sysfs interface to view the AP matrix configured for the mediated matrix device. The relevant sysfs structures are: /sys/devices/vfio_ap/matrix/ ...... [mdev_supported_types] ......... [vfio_ap-passthrough] ............ [devices] ...............[$uuid] .................. matrix To view the matrix configured for the mediated matrix device, print the matrix file: cat matrix Below are examples of the output from the above command: Example 1: Adapters and domains assigned Assignments: Adapters 5 and 6 Domains 4 and 71 (0x47) Output 05.0004 05.0047 06.0004 06.0047 Examples 2: Only adapters assigned Assignments: Adapters 5 and 6 Output: 05. 06. Examples 3: Only domains assigned Assignments: Domains 4 and 71 (0x47) Output: .0004 .0047 Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Halil Pasic <pasic@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Tested-by: Michael Mueller <mimu@linux.ibm.com> Tested-by: Farhan Ali <alifm@linux.ibm.com> Tested-by: Pierre Morel <pmorel@linux.ibm.com> Message-Id: <20180925231641.4954-10-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-26s390: vfio-ap: sysfs interfaces to configure control domainsTony Krowiak
Provides the sysfs interfaces for: 1. Assigning AP control domains to the mediated matrix device 2. Unassigning AP control domains from a mediated matrix device 3. Displaying the control domains assigned to a mediated matrix device The IDs of the AP control domains assigned to the mediated matrix device are stored in an AP domain mask (ADM). The bits in the ADM, from most significant to least significant bit, correspond to AP domain numbers 0 to 255. On some systems, the maximum allowable domain number may be less than 255 - depending upon the host's AP configuration - and assignment may be rejected if the input domain ID exceeds the limit. When a control domain is assigned, the bit corresponding its domain ID will be set in the ADM. Likewise, when a domain is unassigned, the bit corresponding to its domain ID will be cleared in the ADM. The relevant sysfs structures are: /sys/devices/vfio_ap/matrix/ ...... [mdev_supported_types] ......... [vfio_ap-passthrough] ............ [devices] ...............[$uuid] .................. assign_control_domain .................. unassign_control_domain To assign a control domain to the $uuid mediated matrix device's ADM, write its domain number to the assign_control_domain file. To unassign a domain, write its domain number to the unassign_control_domain file. The domain number is specified using conventional semantics: If it begins with 0x the number will be parsed as a hexadecimal (case insensitive) number; if it begins with 0, it is parsed as an octal number; otherwise, it will be parsed as a decimal number. For example, to assign control domain 173 (0xad) to the mediated matrix device $uuid: echo 173 > assign_control_domain or echo 0255 > assign_control_domain or echo 0xad > assign_control_domain To unassign control domain 173 (0xad): echo 173 > unassign_control_domain or echo 0255 > unassign_control_domain or echo 0xad > unassign_control_domain The assignment will be rejected if the APQI exceeds the maximum value for an AP domain: * If the AP Extended Addressing (APXA) facility is installed, the max value is 255 * Else the max value is 15 Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Halil Pasic <pasic@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Tested-by: Michael Mueller <mimu@linux.ibm.com> Tested-by: Farhan Ali <alifm@linux.ibm.com> Tested-by: Pierre Morel <pmorel@linux.ibm.com> Message-Id: <20180925231641.4954-9-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-26s390: vfio-ap: sysfs interfaces to configure domainsTony Krowiak
Introduces two new sysfs attributes for the VFIO mediated matrix device for assigning AP domains to and unassigning AP domains from a mediated matrix device. The IDs of the AP domains assigned to the mediated matrix device will be stored in an AP queue mask (AQM). The bits in the AQM, from most significant to least significant bit, correspond to AP queue index (APQI) 0 to 255 (note that an APQI is synonymous with with a domain ID). On some systems, the maximum allowable domain number may be less than 255 - depending upon the host's AP configuration - and assignment may be rejected if the input domain ID exceeds the limit. When a domain is assigned, the bit corresponding to the APQI will be set in the AQM. Likewise, when a domain is unassigned, the bit corresponding to the APQI will be cleared from the AQM. In order to successfully assign a domain, the APQNs derived from the domain ID being assigned and the adapter numbers of all adapters previously assigned: 1. Must be bound to the vfio_ap device driver. 2. Must not be assigned to any other mediated matrix device. If there are no adapters assigned to the mdev, then there must be an AP queue bound to the vfio_ap device driver with an APQN containing the domain ID (i.e., APQI), otherwise all adapters subsequently assigned will fail because there will be no AP queues bound with an APQN containing the APQI. Assigning or un-assigning an AP domain will also be rejected if a guest using the mediated matrix device is running. The relevant sysfs structures are: /sys/devices/vfio_ap/matrix/ ...... [mdev_supported_types] ......... [vfio_ap-passthrough] ............ [devices] ...............[$uuid] .................. assign_domain .................. unassign_domain To assign a domain to the $uuid mediated matrix device, write the domain's ID to the assign_domain file. To unassign a domain, write the domain's ID to the unassign_domain file. The ID is specified using conventional semantics: If it begins with 0x, the number will be parsed as a hexadecimal (case insensitive) number; if it begins with 0, it will be parsed as an octal number; otherwise, it will be parsed as a decimal number. For example, to assign domain 173 (0xad) to the mediated matrix device $uuid: echo 173 > assign_domain or echo 0255 > assign_domain or echo 0xad > assign_domain To unassign domain 173 (0xad): echo 173 > unassign_domain or echo 0255 > unassign_domain or echo 0xad > unassign_domain Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Halil Pasic <pasic@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Tested-by: Michael Mueller <mimu@linux.ibm.com> Tested-by: Farhan Ali <alifm@linux.ibm.com> Tested-by: Pierre Morel <pmorel@linux.ibm.com> Message-Id: <20180925231641.4954-8-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-26s390: vfio-ap: sysfs interfaces to configure adaptersTony Krowiak
Introduces two new sysfs attributes for the VFIO mediated matrix device for assigning AP adapters to and unassigning AP adapters from a mediated matrix device. The IDs of the AP adapters assigned to the mediated matrix device will be stored in an AP mask (APM). The bits in the APM, from most significant to least significant bit, correspond to AP adapter IDs (APID) 0 to 255. On some systems, the maximum allowable adapter number may be less than 255 - depending upon the host's AP configuration - and assignment may be rejected if the input adapter ID exceeds the limit. When an adapter is assigned, the bit corresponding to the APID will be set in the APM. Likewise, when an adapter is unassigned, the bit corresponding to the APID will be cleared from the APM. In order to successfully assign an adapter, the APQNs derived from the adapter ID being assigned and the queue indexes of all domains previously assigned: 1. Must be bound to the vfio_ap device driver. 2. Must not be assigned to any other mediated matrix device If there are no domains assigned to the mdev, then there must be an AP queue bound to the vfio_ap device driver with an APQN containing the APID, otherwise all domains subsequently assigned will fail because there will be no AP queues bound with an APQN containing the adapter ID. Assigning or un-assigning an AP adapter will be rejected if a guest using the mediated matrix device is running. The relevant sysfs structures are: /sys/devices/vfio_ap/matrix/ ...... [mdev_supported_types] ......... [vfio_ap-passthrough] ............ [devices] ...............[$uuid] .................. assign_adapter .................. unassign_adapter To assign an adapter to the $uuid mediated matrix device's APM, write the APID to the assign_adapter file. To unassign an adapter, write the APID to the unassign_adapter file. The APID is specified using conventional semantics: If it begins with 0x the number will be parsed as a hexadecimal number; if it begins with a 0 the number will be parsed as an octal number; otherwise, it will be parsed as a decimal number. For example, to assign adapter 173 (0xad) to the mediated matrix device $uuid: echo 173 > assign_adapter or echo 0xad > assign_adapter or echo 0255 > assign_adapter To unassign adapter 173 (0xad): echo 173 > unassign_adapter or echo 0xad > unassign_adapter or echo 0255 > unassign_adapter Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Halil Pasic <pasic@linux.ibm.com> Tested-by: Michael Mueller <mimu@linux.ibm.com> Tested-by: Farhan Ali <alifm@linux.ibm.com> Tested-by: Pierre Morel <pmorel@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Message-Id: <20180925231641.4954-7-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-26s390: vfio-ap: register matrix device with VFIO mdev frameworkTony Krowiak
Registers the matrix device created by the VFIO AP device driver with the VFIO mediated device framework. Registering the matrix device will create the sysfs structures needed to create mediated matrix devices each of which will be used to configure the AP matrix for a guest and connect it to the VFIO AP device driver. Registering the matrix device with the VFIO mediated device framework will create the following sysfs structures: /sys/devices/vfio_ap/matrix/ ...... [mdev_supported_types] ......... [vfio_ap-passthrough] ............ create To create a mediated device for the AP matrix device, write a UUID to the create file: uuidgen > create A symbolic link to the mediated device's directory will be created in the devices subdirectory named after the generated $uuid: /sys/devices/vfio_ap/matrix/ ...... [mdev_supported_types] ......... [vfio_ap-passthrough] ............ [devices] ............... [$uuid] A symbolic link to the mediated device will also be created in the vfio_ap matrix's directory: /sys/devices/vfio_ap/matrix/[$uuid] Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Halil Pasic <pasic@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Tested-by: Michael Mueller <mimu@linux.ibm.com> Tested-by: Farhan Ali <alifm@linux.ibm.com> Message-Id: <20180925231641.4954-6-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-26s390: vfio-ap: base implementation of VFIO AP device driverTony Krowiak
Introduces a new AP device driver. This device driver is built on the VFIO mediated device framework. The framework provides sysfs interfaces that facilitate passthrough access by guests to devices installed on the linux host. The VFIO AP device driver will serve two purposes: 1. Provide the interfaces to reserve AP devices for exclusive use by KVM guests. This is accomplished by unbinding the devices to be reserved for guest usage from the zcrypt device driver and binding them to the VFIO AP device driver. 2. Implements the functions, callbacks and sysfs attribute interfaces required to create one or more VFIO mediated devices each of which will be used to configure the AP matrix for a guest and serve as a file descriptor for facilitating communication between QEMU and the VFIO AP device driver. When the VFIO AP device driver is initialized: * It registers with the AP bus for control of type 10 (CEX4 and newer) AP queue devices. This limitation was imposed due to: 1. A desire to keep the code as simple as possible; 2. Some older models are no longer supported by the kernel and others are getting close to end of service. 3. A lack of older systems on which to test older devices. The probe and remove callbacks will be provided to support the binding/unbinding of AP queue devices to/from the VFIO AP device driver. * Creates a matrix device, /sys/devices/vfio_ap/matrix, to serve as the parent of the mediated devices created, one for each guest, and to hold the APQNs of the AP devices bound to the VFIO AP device driver. Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Halil Pasic <pasic@linux.ibm.com> Tested-by: Michael Mueller <mimu@linux.ibm.com> Tested-by: Farhan Ali <alifm@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Message-Id: <20180925231641.4954-5-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-26KVM: s390: refactor crypto initializationTony Krowiak
This patch refactors the code that initializes and sets up the crypto configuration for a guest. The following changes are implemented via this patch: 1. Introduces a flag indicating AP instructions executed on the guest shall be interpreted by the firmware. This flag is used to set a bit in the guest's state description indicating AP instructions are to be interpreted. 2. Replace code implementing AP interfaces with code supplied by the AP bus to query the AP configuration. Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Halil Pasic <pasic@linux.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Janosch Frank <frankja@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Tested-by: Michael Mueller <mimu@linux.ibm.com> Tested-by: Farhan Ali <alifm@linux.ibm.com> Message-Id: <20180925231641.4954-4-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-26KVM: s390: introduce and use KVM_REQ_VSIE_RESTARTDavid Hildenbrand
When we change the crycb (or execution controls), we also have to make sure that the vSIE shadow datastructures properly consider the changed values before rerunning the vSIE. We can achieve that by simply using a VCPU request now. This has to be a synchronous request (== handled before entering the (v)SIE again). The request will make sure that the vSIE handler is left, and that the request will be processed (NOP), therefore forcing a reload of all vSIE data (including rebuilding the crycb) when re-entering the vSIE interception handler the next time. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Pierre Morel <pmorel@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Message-Id: <20180925231641.4954-3-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-26KVM: s390: vsie: simulate VCPU SIE entry/exitDavid Hildenbrand
VCPU requests and VCPU blocking right now don't take care of the vSIE (as it was not necessary until now). But we want to have synchronous VCPU requests that will also be handled before running the vSIE again. So let's simulate a SIE entry of the VCPU when calling the sie during vSIE handling and check for PROG_ flags. The existing infrastructure (e.g. exit_sie()) will then detect that the SIE (in form of the vSIE) is running and properly kick the vSIE CPU, resulting in it leaving the vSIE loop and therefore the vSIE interception handler, allowing it to handle VCPU requests. E.g. if we want to modify the crycb of the VCPU and make sure that any masks also get applied to the VSIE crycb shadow (which uses masks from the VCPU crycb), we will need a way to hinder the vSIE from running and make sure to process the updated crycb before reentering the vSIE again. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Pierre Morel <pmorel@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Message-Id: <20180925231641.4954-2-akrowiak@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-09-24KVM: x86: never trap MSR_KERNEL_GS_BASEPaolo Bonzini
KVM has an old optimization whereby accesses to the kernel GS base MSR are trapped when the guest is in 32-bit and not when it is in 64-bit mode. The idea is that swapgs is not available in 32-bit mode, thus the guest has no reason to access the MSR unless in 64-bit mode and 32-bit applications need not pay the price of switching the kernel GS base between the host and the guest values. However, this optimization adds complexity to the code for little benefit (these days most guests are going to be 64-bit anyway) and in fact broke after commit 678e315e78a7 ("KVM: vmx: add dedicated utility to access guest's kernel_gs_base", 2018-08-06); the guest kernel GS base can be corrupted across SMIs and UEFI Secure Boot is therefore broken (a secure boot Linux guest, for example, fails to reach the login prompt about half the time). This patch just removes the optimization; the kernel GS base MSR is now never trapped by KVM, similarly to the FS and GS base MSRs. Fixes: 678e315e78a780dbef384b92339c8414309dbc11 Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-21Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmGreg Kroah-Hartman
Paolo writes: "It's mostly small bugfixes and cleanups, mostly around x86 nested virtualization. One important change, not related to nested virtualization, is that the ability for the guest kernel to trap CPUID instructions (in Linux that's the ARCH_SET_CPUID arch_prctl) is now masked by default. This is because the feature is detected through an MSR; a very bad idea that Intel seems to like more and more. Some applications choke if the other fields of that MSR are not initialized as on real hardware, hence we have to disable the whole MSR by default, as was the case before Linux 4.12." * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (23 commits) KVM: nVMX: Fix bad cleanup on error of get/set nested state IOCTLs kvm: selftests: Add platform_info_test KVM: x86: Control guest reads of MSR_PLATFORM_INFO KVM: x86: Turbo bits in MSR_PLATFORM_INFO nVMX x86: Check VPID value on vmentry of L2 guests nVMX x86: check posted-interrupt descriptor addresss on vmentry of L2 KVM: nVMX: Wake blocked vCPU in guest-mode if pending interrupt in virtual APICv KVM: VMX: check nested state and CR4.VMXE against SMM kvm: x86: make kvm_{load|put}_guest_fpu() static x86/hyper-v: rename ipi_arg_{ex,non_ex} structures KVM: VMX: use preemption timer to force immediate VMExit KVM: VMX: modify preemption timer bit only when arming timer KVM: VMX: immediately mark preemption timer expired only for zero value KVM: SVM: Switch to bitmap_zalloc() KVM/MMU: Fix comment in walk_shadow_page_lockless_end() kvm: selftests: use -pthread instead of -lpthread KVM: x86: don't reset root in kvm_mmu_setup() kvm: mmu: Don't read PDPTEs when paging is not enabled x86/kvm/lapic: always disable MMIO interface in x2APIC mode KVM: s390: Make huge pages unavailable in ucontrol VMs ...
2018-09-21Merge tag 'upstream-4.19-rc4' of git://git.infradead.org/linux-ubifsGreg Kroah-Hartman
Richard writes: "This pull request contains fixes for UBIFS: - A wrong UBIFS assertion in mount code - Fix for a NULL pointer deref in mount code - Revert of a bad fix for xattrs" * tag 'upstream-4.19-rc4' of git://git.infradead.org/linux-ubifs: Revert "ubifs: xattr: Don't operate on deleted inodes" ubifs: drop false positive assertion ubifs: Check for name being NULL while mounting
2018-09-21Merge tag 'for-linus-20180920' of git://git.kernel.dk/linux-blockGreg Kroah-Hartman
Jens writes: "Storage fixes for 4.19-rc5 - Fix for leaking kernel pointer in floppy ioctl (Andy Whitcroft) - NVMe pull request from Christoph, and a single ANA log page fix (Hannes) - Regression fix for libata qd32 support, where we trigger an illegal active command transition. This fixes a CD-ROM detection issue that was reported, but could also trigger premature completion of the internal tag (me)" * tag 'for-linus-20180920' of git://git.kernel.dk/linux-block: floppy: Do not copy a kernel pointer to user memory in FDGETPRM ioctl libata: mask swap internal and hardware tag nvme: count all ANA groups for ANA Log page
2018-09-21Merge tag 'drm-fixes-2018-09-21' of git://anongit.freedesktop.org/drm/drmGreg Kroah-Hartman
David writes: "drm fixes for 4.19-rc5: - core: fix debugfs for atomic, fix the check for atomic for non-modesetting drivers - amdgpu: adds a new PCI id, some kfd fixes and a sdma fix - i915: a bunch of GVT fixes. - vc4: scaling fix - vmwgfx: modesetting fixes and a old buffer eviction fix - udl: framebuffer destruction fix - sun4i: disable on R40 fix until next kernel - pl111: NULL termination on table fix" * tag 'drm-fixes-2018-09-21' of git://anongit.freedesktop.org/drm/drm: (21 commits) drm/amdkfd: Fix ATS capablity was not reported correctly on some APUs drm/amdkfd: Change the control stack MTYPE from UC to NC on GFX9 drm/amdgpu: Fix SDMA HQD destroy error on gfx_v7 drm/vmwgfx: Fix buffer object eviction drm/vmwgfx: Don't impose STDU limits on framebuffer size drm/vmwgfx: limit mode size for all display unit to texture_max drm/vmwgfx: limit screen size to stdu_max during check_modeset drm/vmwgfx: don't check for old_crtc_state enable status drm/amdgpu: add new polaris pci id drm: sun4i: drop second PLL from A64 HDMI PHY drm: fix drm_drv_uses_atomic_modeset on non modesetting drivers. drm/i915/gvt: clear ggtt entries when destroy vgpu drm/i915/gvt: request srcu_read_lock before checking if one gfn is valid drm/i915/gvt: Add GEN9_CLKGATE_DIS_4 to default BXT mmio handler drm/i915/gvt: Init PHY related registers for BXT drm/atomic: Use drm_drv_uses_atomic_modeset() for debugfs creation drm/fb-helper: Remove set but not used variable 'connector_funcs' drm: udl: Destroy framebuffer only if it was initialized drm/sun4i: Remove R40 display pipeline compatibles drm/pl111: Make sure of_device_id tables are NULL terminated ...
2018-09-21Merge branch 'drm-fixes-4.19' of git://people.freedesktop.org/~agd5f/linux ↵Dave Airlie
into drm-fixes A few fixes for 4.19: - Add a new polaris pci id - KFD fixes for raven and gfx7 Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexdeucher@gmail.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180920155850.5455-1-alexander.deucher@amd.com
2018-09-21Merge branch 'vmwgfx-fixes-4.19' of ↵Dave Airlie
git://people.freedesktop.org/~thomash/linux into drm-fixes A couple of modesetting fixes and a fix for a long-standing buffer-eviction problem cc'd stable. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Thomas Hellstrom <thellstrom@vmware.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180920063935.35492-1-thellstrom@vmware.com
2018-09-20ocfs2: fix ocfs2 read block panicJunxiao Bi
While reading block, it is possible that io error return due to underlying storage issue, in this case, BH_NeedsValidate was left in the buffer head. Then when reading the very block next time, if it was already linked into journal, that will trigger the following panic. [203748.702517] kernel BUG at fs/ocfs2/buffer_head_io.c:342! [203748.702533] invalid opcode: 0000 [#1] SMP [203748.702561] Modules linked in: ocfs2 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sunrpc dm_switch dm_queue_length dm_multipath bonding be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i iw_cxgb4 cxgb4 cxgb3i libcxgbi iw_cxgb3 cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_devintf iTCO_wdt iTCO_vendor_support dcdbas ipmi_ssif i2c_core ipmi_si ipmi_msghandler acpi_pad pcspkr sb_edac edac_core lpc_ich mfd_core shpchp sg tg3 ptp pps_core ext4 jbd2 mbcache2 sr_mod cdrom sd_mod ahci libahci megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod [203748.703024] CPU: 7 PID: 38369 Comm: touch Not tainted 4.1.12-124.18.6.el6uek.x86_64 #2 [203748.703045] Hardware name: Dell Inc. PowerEdge R620/0PXXHP, BIOS 2.5.2 01/28/2015 [203748.703067] task: ffff880768139c00 ti: ffff88006ff48000 task.ti: ffff88006ff48000 [203748.703088] RIP: 0010:[<ffffffffa05e9f09>] [<ffffffffa05e9f09>] ocfs2_read_blocks+0x669/0x7f0 [ocfs2] [203748.703130] RSP: 0018:ffff88006ff4b818 EFLAGS: 00010206 [203748.703389] RAX: 0000000008620029 RBX: ffff88006ff4b910 RCX: 0000000000000000 [203748.703885] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00000000023079fe [203748.704382] RBP: ffff88006ff4b8d8 R08: 0000000000000000 R09: ffff8807578c25b0 [203748.704877] R10: 000000000f637376 R11: 000000003030322e R12: 0000000000000000 [203748.705373] R13: ffff88006ff4b910 R14: ffff880732fe38f0 R15: 0000000000000000 [203748.705871] FS: 00007f401992c700(0000) GS:ffff880bfebc0000(0000) knlGS:0000000000000000 [203748.706370] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [203748.706627] CR2: 00007f4019252440 CR3: 00000000a621e000 CR4: 0000000000060670 [203748.707124] Stack: [203748.707371] ffff88006ff4b828 ffffffffa0609f52 ffff88006ff4b838 0000000000000001 [203748.707885] 0000000000000000 0000000000000000 ffff880bf67c3800 ffffffffa05eca00 [203748.708399] 00000000023079ff ffffffff81c58b80 0000000000000000 0000000000000000 [203748.708915] Call Trace: [203748.709175] [<ffffffffa0609f52>] ? ocfs2_inode_cache_io_unlock+0x12/0x20 [ocfs2] [203748.709680] [<ffffffffa05eca00>] ? ocfs2_empty_dir_filldir+0x80/0x80 [ocfs2] [203748.710185] [<ffffffffa05ec0cb>] ocfs2_read_dir_block_direct+0x3b/0x200 [ocfs2] [203748.710691] [<ffffffffa05f0fbf>] ocfs2_prepare_dx_dir_for_insert.isra.57+0x19f/0xf60 [ocfs2] [203748.711204] [<ffffffffa065660f>] ? ocfs2_metadata_cache_io_unlock+0x1f/0x30 [ocfs2] [203748.711716] [<ffffffffa05f4f3a>] ocfs2_prepare_dir_for_insert+0x13a/0x890 [ocfs2] [203748.712227] [<ffffffffa05f442e>] ? ocfs2_check_dir_for_entry+0x8e/0x140 [ocfs2] [203748.712737] [<ffffffffa061b2f2>] ocfs2_mknod+0x4b2/0x1370 [ocfs2] [203748.713003] [<ffffffffa061c385>] ocfs2_create+0x65/0x170 [ocfs2] [203748.713263] [<ffffffff8121714b>] vfs_create+0xdb/0x150 [203748.713518] [<ffffffff8121b225>] do_last+0x815/0x1210 [203748.713772] [<ffffffff812192e9>] ? path_init+0xb9/0x450 [203748.714123] [<ffffffff8121bca0>] path_openat+0x80/0x600 [203748.714378] [<ffffffff811bcd45>] ? handle_pte_fault+0xd15/0x1620 [203748.714634] [<ffffffff8121d7ba>] do_filp_open+0x3a/0xb0 [203748.714888] [<ffffffff8122a767>] ? __alloc_fd+0xa7/0x130 [203748.715143] [<ffffffff81209ffc>] do_sys_open+0x12c/0x220 [203748.715403] [<ffffffff81026ddb>] ? syscall_trace_enter_phase1+0x11b/0x180 [203748.715668] [<ffffffff816f0c9f>] ? system_call_after_swapgs+0xe9/0x190 [203748.715928] [<ffffffff8120a10e>] SyS_open+0x1e/0x20 [203748.716184] [<ffffffff816f0d5e>] system_call_fastpath+0x18/0xd7 [203748.716440] Code: 00 00 48 8b 7b 08 48 83 c3 10 45 89 f8 44 89 e1 44 89 f2 4c 89 ee e8 07 06 11 e1 48 8b 03 48 85 c0 75 df 8b 5d c8 e9 4d fa ff ff <0f> 0b 48 8b 7d a0 e8 dc c6 06 00 48 b8 00 00 00 00 00 00 00 10 [203748.717505] RIP [<ffffffffa05e9f09>] ocfs2_read_blocks+0x669/0x7f0 [ocfs2] [203748.717775] RSP <ffff88006ff4b818> Joesph ever reported a similar panic. Link: https://oss.oracle.com/pipermail/ocfs2-devel/2013-May/008931.html Link: http://lkml.kernel.org/r/20180912063207.29484-1-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Changwei Ge <ge.changwei@h3c.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-20mm: slowly shrink slabs with a relatively small number of objectsRoman Gushchin
9092c71bb724 ("mm: use sc->priority for slab shrink targets") changed the way that the target slab pressure is calculated and made it priority-based: delta = freeable >> priority; delta *= 4; do_div(delta, shrinker->seeks); The problem is that on a default priority (which is 12) no pressure is applied at all, if the number of potentially reclaimable objects is less than 4096 (1<<12). This causes the last objects on slab caches of no longer used cgroups to (almost) never get reclaimed. It's obviously a waste of memory. It can be especially painful, if these stale objects are holding a reference to a dying cgroup. Slab LRU lists are reparented on memcg offlining, but corresponding objects are still holding a reference to the dying cgroup. If we don't scan these objects, the dying cgroup can't go away. Most likely, the parent cgroup hasn't any directly charged objects, only remaining objects from dying children cgroups. So it can easily hold a reference to hundreds of dying cgroups. If there are no big spikes in memory pressure, and new memory cgroups are created and destroyed periodically, this causes the number of dying cgroups grow steadily, causing a slow-ish and hard-to-detect memory "leak". It's not a real leak, as the memory can be eventually reclaimed, but it could not happen in a real life at all. I've seen hosts with a steadily climbing number of dying cgroups, which doesn't show any signs of a decline in months, despite the host is loaded with a production workload. It is an obvious waste of memory, and to prevent it, let's apply a minimal pressure even on small shrinker lists. E.g. if there are freeable objects, let's scan at least min(freeable, scan_batch) objects. This fix significantly improves a chance of a dying cgroup to be reclaimed, and together with some previous patches stops the steady growth of the dying cgroups number on some of our hosts. Link: http://lkml.kernel.org/r/20180905230759.12236-1-guro@fb.com Fixes: 9092c71bb724 ("mm: use sc->priority for slab shrink targets") Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Rik van Riel <riel@surriel.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-20kernel/sys.c: remove duplicated includeYueHaibing
Link: http://lkml.kernel.org/r/20180821133424.18716-1-yuehaibing@huawei.com Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>