summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-01-19KVM: VMX: Don't do full kick when handling posted interrupt wakeupSean Christopherson
When waking vCPUs in the posted interrupt wakeup handling, do exactly that and no more. There is no need to kick the vCPU as the wakeup handler just needs to get the vCPU task running, and if it's in the guest then it's definitely running. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-21-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: VMX: Fold fallback path into triggering posted IRQ helperSean Christopherson
Move the fallback "wake_up" path into the helper to trigger posted interrupt helper now that the nested and non-nested paths are identical. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-20-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: VMX: Pass desired vector instead of bool for triggering posted IRQSean Christopherson
Refactor the posted interrupt helper to take the desired notification vector instead of a bool so that the callers are self-documenting. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-19-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: VMX: Don't do full kick when triggering posted interrupt "fails"Sean Christopherson
Replace the full "kick" with just the "wake" in the fallback path when triggering a virtual interrupt via a posted interrupt fails because the guest is not IN_GUEST_MODE. If the guest transitions into guest mode between the check and the kick, then it's guaranteed to see the pending interrupt as KVM syncs the PIR to IRR (and onto GUEST_RVI) after setting IN_GUEST_MODE. Kicking the guest in this case is nothing more than an unnecessary VM-Exit (and host IRQ). Opportunistically update comments to explain the various ordering rules and barriers at play. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211208015236.1616697-17-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: SVM: Skip AVIC and IRTE updates when loading blocking vCPUSean Christopherson
Don't bother updating the Physical APIC table or IRTE when loading a vCPU that is blocking, i.e. won't be marked IsRun{ning}=1, as the pCPU is queried if and only if IsRunning is '1'. If the vCPU was migrated, the new pCPU will be picked up when avic_vcpu_load() is called by svm_vcpu_unblocking(). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211208015236.1616697-15-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: SVM: Use kvm_vcpu_is_blocking() in AVIC load to handle preemptionSean Christopherson
Use kvm_vcpu_is_blocking() to determine whether or not the vCPU should be marked running during avic_vcpu_load(). Drop avic_is_running, which really should have been named "vcpu_is_not_blocking", as it tracked if the vCPU was blocking, not if it was actually running, e.g. it was set during svm_create_vcpu() when the vCPU was obviously not running. This is technically a teeny tiny functional change, as the vCPU will be marked IsRunning=1 on being reloaded if the vCPU is preempted between svm_vcpu_blocking() and prepare_to_rcuwait(). But that's a benign change as the vCPU will be marked IsRunning=0 when KVM voluntarily schedules out the vCPU. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211208015236.1616697-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: SVM: Remove unnecessary APICv/AVIC update in vCPU unblocking pathSean Christopherson
Remove handling of KVM_REQ_APICV_UPDATE from svm_vcpu_unblocking(), it's no longer needed as it was made obsolete by commit df7e4827c549 ("KVM: SVM: call avic_vcpu_load/avic_vcpu_put when enabling/disabling AVIC"). Prior to that commit, the manual check was necessary to ensure the AVIC stuff was updated by avic_set_running() when a request to enable APICv became pending while the vCPU was blocking, as the request handling itself would not do the update. But, as evidenced by the commit, that logic was flawed and subject to various races. Now that svm_refresh_apicv_exec_ctrl() does avic_vcpu_load/put() in response to an APICv status change, drop the manual check in the unblocking path. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211208015236.1616697-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: SVM: Don't bother checking for "running" AVIC when kicking for IPIsSean Christopherson
Drop the avic_vcpu_is_running() check when waking vCPUs in response to a VM-Exit due to incomplete IPI delivery. The check isn't wrong per se, but it's not 100% accurate in the sense that it doesn't guarantee that the vCPU was one of the vCPUs that didn't receive the IPI. The check isn't required for correctness as blocking == !running in this context. From a performance perspective, waking a live task is not expensive as the only moderately costly operation is a locked operation to temporarily disable preemption. And if that is indeed a performance issue, kvm_vcpu_is_blocking() would be a better check than poking into the AVIC. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: SVM: Signal AVIC doorbell iff vCPU is in guest modeSean Christopherson
Signal the AVIC doorbell iff the vCPU is running in the guest. If the vCPU is not IN_GUEST_MODE, it's guaranteed to pick up any pending IRQs on the next VMRUN, which unconditionally processes the vIRR. Add comments to document the logic. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211208015236.1616697-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: x86: Remove defunct pre_block/post_block kvm_x86_ops hooksSean Christopherson
Drop kvm_x86_ops' pre/post_block() now that all implementations are nops. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: x86: Unexport LAPIC's switch_to_{hv,sw}_timer() helpersSean Christopherson
Unexport switch_to_{hv,sw}_timer() now that common x86 handles the transitions. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: VMX: Move preemption timer <=> hrtimer dance to common x86Sean Christopherson
Handle the switch to/from the hypervisor/software timer when a vCPU is blocking in common x86 instead of in VMX. Even though VMX is the only user of a hypervisor timer, the logic and all functions involved are generic x86 (unless future CPUs do something completely different and implement a hypervisor timer that runs regardless of mode). Handling the switch in common x86 will allow for the elimination of the pre/post_blocks hooks, and also lets KVM switch back to the hypervisor timer if and only if it was in use (without additional params). Add a comment explaining why the switch cannot be deferred to kvm_sched_out() or kvm_vcpu_block(). Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: Move x86 VMX's posted interrupt list_head to vcpu_vmxSean Christopherson
Move the seemingly generic block_vcpu_list from kvm_vcpu to vcpu_vmx, and rename the list and all associated variables to clarify that it tracks the set of vCPU that need to be poked on a posted interrupt to the wakeup vector. The list is not used to track _all_ vCPUs that are blocking, and the term "blocked" can be misleading as it may refer to a blocking condition in the host or the guest, where as the PI wakeup case is specifically for the vCPUs that are actively blocking from within the guest. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: Drop unused kvm_vcpu.pre_pcpu fieldSean Christopherson
Remove kvm_vcpu.pre_pcpu as it no longer has any users. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: VMX: Handle PI descriptor updates during vcpu_put/loadSean Christopherson
Move the posted interrupt pre/post_block logic into vcpu_put/load respectively, using the kvm_vcpu_is_blocking() to determining whether or not the wakeup handler needs to be set (and unset). This avoids updating the PI descriptor if halt-polling is successful, reduces the number of touchpoints for updating the descriptor, and eliminates the confusing behavior of intentionally leaving a "stale" PI.NDST when a blocking vCPU is scheduled back in after preemption. The downside is that KVM will do the PID update twice if the vCPU is preempted after prepare_to_rcuwait() but before schedule(), but that's a rare case (and non-existent on !PREEMPT kernels). The notable wart is the need to send a self-IPI on the wakeup vector if an outstanding notification is pending after configuring the wakeup vector. Ideally, KVM would just do a kvm_vcpu_wake_up() in this case, but the scheduler doesn't support waking a task from its preemption notifier callback, i.e. while the task is right in the middle of being scheduled out. Note, setting the wakeup vector before halt-polling is not necessary: once the pending IRQ will be recorded in the PIR, kvm_vcpu_has_events() will detect this (via kvm_cpu_get_interrupt(), kvm_apic_get_interrupt(), apic_has_interrupt_for_ppr() and finally vmx_sync_pir_to_irr()) and terminate the polling. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19Merge branch 'kvm-pi-raw-spinlock' into HEADPaolo Bonzini
Bring in fix for VT-d posted interrupts before further changing the code in 5.17. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: avoid warning on s390 in mark_page_dirtyChristian Borntraeger
Avoid warnings on s390 like [ 1801.980931] CPU: 12 PID: 117600 Comm: kworker/12:0 Tainted: G E 5.17.0-20220113.rc0.git0.32ce2abb03cf.300.fc35.s390x+next #1 [ 1801.980938] Workqueue: events irqfd_inject [kvm] [...] [ 1801.981057] Call Trace: [ 1801.981060] [<000003ff805f0f5c>] mark_page_dirty_in_slot+0xa4/0xb0 [kvm] [ 1801.981083] [<000003ff8060e9fe>] adapter_indicators_set+0xde/0x268 [kvm] [ 1801.981104] [<000003ff80613c24>] set_adapter_int+0x64/0xd8 [kvm] [ 1801.981124] [<000003ff805fb9aa>] kvm_set_irq+0xc2/0x130 [kvm] [ 1801.981144] [<000003ff805f8d86>] irqfd_inject+0x76/0xa0 [kvm] [ 1801.981164] [<0000000175e56906>] process_one_work+0x1fe/0x470 [ 1801.981173] [<0000000175e570a4>] worker_thread+0x64/0x498 [ 1801.981176] [<0000000175e5ef2c>] kthread+0x10c/0x110 [ 1801.981180] [<0000000175de73c8>] __ret_from_fork+0x40/0x58 [ 1801.981185] [<000000017698440a>] ret_from_fork+0xa/0x40 when writing to a guest from an irqfd worker as long as we do not have the dirty ring. Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com> Reluctantly-acked-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20220113122924.740496-1-borntraeger@linux.ibm.com> Fixes: 2efd61a608b0 ("KVM: Warn if mark_page_dirty() is called without an active vCPU") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: selftests: Add a test to force emulation with a pending exceptionSean Christopherson
Add a VMX specific test to verify that KVM doesn't explode if userspace attempts KVM_RUN when emulation is required with a pending exception. KVM VMX's emulation support for !unrestricted_guest punts exceptions to userspace instead of attempting to synthesize the exception with all the correct state (and stack switching, etc...). Punting is acceptable as there's never been a request to support injecting exceptions when emulating due to invalid state, but KVM has historically assumed that userspace will do the right thing and either clear the exception or kill the guest. Deliberately do the opposite and attempt to re-enter the guest with a pending exception and emulation required to verify KVM continues to punt the combination to userspace, e.g. doesn't explode, WARN, etc... Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211228232437.1875318-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: VMX: Reject KVM_RUN if emulation is required with pending exceptionSean Christopherson
Reject KVM_RUN if emulation is required (because VMX is running without unrestricted guest) and an exception is pending, as KVM doesn't support emulating exceptions except when emulating real mode via vm86. The vCPU is hosed either way, but letting KVM_RUN proceed triggers a WARN due to the impossible condition. Alternatively, the WARN could be removed, but then userspace and/or KVM bugs would result in the vCPU silently running in a bad state, which isn't very friendly to users. Originally, the bug was hit by syzkaller with a nested guest as that doesn't require kvm_intel.unrestricted_guest=0. That particular flavor is likely fixed by commit cd0e615c49e5 ("KVM: nVMX: Synthesize TRIPLE_FAULT for L2 if emulation is required"), but it's trivial to trigger the WARN with a non-nested guest, and userspace can likely force bad state via ioctls() for a nested guest as well. Checking for the impossible condition needs to be deferred until KVM_RUN because KVM can't force specific ordering between ioctls. E.g. clearing exception.pending in KVM_SET_SREGS doesn't prevent userspace from setting it in KVM_SET_VCPU_EVENTS, and disallowing KVM_SET_VCPU_EVENTS with emulation_required would prevent userspace from queuing an exception and then stuffing sregs. Note, if KVM were to try and detect/prevent the condition prior to KVM_RUN, handle_invalid_guest_state() and/or handle_emulation_failure() would need to be modified to clear the pending exception prior to exiting to userspace. ------------[ cut here ]------------ WARNING: CPU: 6 PID: 137812 at arch/x86/kvm/vmx/vmx.c:1623 vmx_queue_exception+0x14f/0x160 [kvm_intel] CPU: 6 PID: 137812 Comm: vmx_invalid_nes Not tainted 5.15.2-7cc36c3e14ae-pop #279 Hardware name: ASUS Q87M-E/Q87M-E, BIOS 1102 03/03/2014 RIP: 0010:vmx_queue_exception+0x14f/0x160 [kvm_intel] Code: <0f> 0b e9 fd fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 RSP: 0018:ffffa45c83577d38 EFLAGS: 00010202 RAX: 0000000000000003 RBX: 0000000080000006 RCX: 0000000000000006 RDX: 0000000000000000 RSI: 0000000000010002 RDI: ffff9916af734000 RBP: ffff9916af734000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000006 R13: 0000000000000000 R14: ffff9916af734038 R15: 0000000000000000 FS: 00007f1e1a47c740(0000) GS:ffff99188fb80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1e1a6a8008 CR3: 000000026f83b005 CR4: 00000000001726e0 Call Trace: kvm_arch_vcpu_ioctl_run+0x13a2/0x1f20 [kvm] kvm_vcpu_ioctl+0x279/0x690 [kvm] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae Reported-by: syzbot+82112403ace4cbd780d8@syzkaller.appspotmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211228232437.1875318-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19selftests: kvm/x86: Add test for KVM_SET_PMU_EVENT_FILTERJim Mattson
Verify that the PMU event filter works as expected. Note that the virtual PMU doesn't work as expected on AMD Zen CPUs (an intercepted rdmsr is counted as a retired branch instruction), but the PMU event filter does work. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220115052431.447232-7-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19selftests: kvm/x86: Introduce x86_model()Jim Mattson
Extract the x86 model number from CPUID.01H:EAX. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220115052431.447232-6-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19selftests: kvm/x86: Export x86_family() for use outside of processor.cJim Mattson
Move this static inline function to processor.h, so that it can be used in individual tests, as needed. Opportunistically replace the bare 'unsigned' with 'unsigned int.' Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220115052431.447232-5-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19selftests: kvm/x86: Introduce is_amd_cpu()Jim Mattson
Replace the one ad hoc "AuthenticAMD" CPUID vendor string comparison with a new function, is_amd_cpu(). Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220115052431.447232-4-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19selftests: kvm/x86: Parameterize the CPUID vendor string checkJim Mattson
Refactor is_intel_cpu() to make it easier to reuse the bulk of the code for other vendors in the future. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220115052431.447232-3-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: x86/pmu: Use binary search to check filtered eventsJim Mattson
The PMU event filter may contain up to 300 events. Replace the linear search in reprogram_gp_counter() with a binary search. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220115052431.447232-2-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19kvm: selftests: conditionally build vm_xsave_req_perm()Wei Wang
vm_xsave_req_perm() is currently defined and used by x86_64 only. Make it compiled into vm_create_with_vcpus() only when on x86_64 machines. Otherwise, it would cause linkage errors, e.g. on s390x. Fixes: 415a3c33e8 ("kvm: selftests: Add support for KVM_CAP_XSAVE2") Reported-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com> Signed-off-by: Wei Wang <wei.w.wang@intel.com> Tested-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com> Message-Id: <20220118014817.30910-1-wei.w.wang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: x86/cpuid: Clear XFD for component i if the base feature is missingLike Xu
According to Intel extended feature disable (XFD) spec, the sub-function i (i > 1) of CPUID function 0DH enumerates "details for state component i. ECX[2] enumerates support for XFD support for this state component." If KVM does not report F(XFD) feature (e.g. due to CONFIG_X86_64), then the corresponding XFD support for any state component i should also be removed. Translate this dependency into KVM terms. Fixes: 690a757d610e ("kvm: x86: Add CPUID support for Intel AMX") Signed-off-by: Like Xu <likexu@tencent.com> Message-Id: <20220117074531.76925-1-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: x86/mmu: Improve TLB flush comment in kvm_mmu_slot_remove_write_access()David Matlack
Rewrite the comment in kvm_mmu_slot_remove_write_access() that explains why it is safe to flush TLBs outside of the MMU lock after write-protecting SPTEs for dirty logging. The current comment is a long run-on sentence that was difficult to understand. In addition it was specific to the shadow MMU (mentioning mmu_spte_update()) when the TDP MMU has to handle this as well. The new comment explains: - Why the TLB flush is necessary at all. - Why it is desirable to do the TLB flush outside of the MMU lock. - Why it is safe to do the TLB flush outside of the MMU lock. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220113233020.3986005-5-dmatlack@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: x86/mmu: Document and enforce MMU-writable and Host-writable invariantsDavid Matlack
SPTEs are tagged with software-only bits to indicate if it is "MMU-writable" and "Host-writable". These bits are used to determine why KVM has marked an SPTE as read-only. Document these bits and their invariants, and enforce the invariants with new WARNs in spte_can_locklessly_be_made_writable() to ensure they are not accidentally violated in the future. Opportunistically move DEFAULT_SPTE_{MMU,HOST}_WRITABLE next to EPT_SPTE_{MMU,HOST}_WRITABLE since the new documentation applies to both. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220113233020.3986005-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: x86/mmu: Clear MMU-writable during changed_pte notifierDavid Matlack
When handling the changed_pte notifier and the new PTE is read-only, clear both the Host-writable and MMU-writable bits in the SPTE. This preserves the invariant that MMU-writable is set if-and-only-if Host-writable is set. No functional change intended. Nothing currently relies on the aforementioned invariant and technically the changed_pte notifier is dead code. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220113233020.3986005-3-dmatlack@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: x86/mmu: Fix write-protection of PTs mapped by the TDP MMUDavid Matlack
When the TDP MMU is write-protection GFNs for page table protection (as opposed to for dirty logging, or due to the HVA not being writable), it checks if the SPTE is already write-protected and if so skips modifying the SPTE and the TLB flush. This behavior is incorrect because it fails to check if the SPTE is write-protected for page table protection, i.e. fails to check that MMU-writable is '0'. If the SPTE was write-protected for dirty logging but not page table protection, the SPTE could locklessly be made writable, and vCPUs could still be running with writable mappings cached in their TLB. Fix this by only skipping setting the SPTE if the SPTE is already write-protected *and* MMU-writable is already clear. Technically, checking only MMU-writable would suffice; a SPTE cannot be writable without MMU-writable being set. But check both to be paranoid and because it arguably yields more readable code. Fixes: 46044f72c382 ("kvm: x86/mmu: Support write protection for nesting in tdp MMU") Cc: stable@vger.kernel.org Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220113233020.3986005-2-dmatlack@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19ALSA: core: Simplify snd_power_ref_and_wait() with the standard macroTakashi Iwai
Use wait_event_cmd() macro and simplify snd_power_ref_wait() implementation. This may also cover possible races in the current open code, too. Reviewed-by: Jaroslav Kysela <perex@perex.cz> Link: https://lore.kernel.org/r/20220119091050.30125-1-tiwai@suse.de Signed-off-by: Takashi Iwai <tiwai@suse.de>
2022-01-19Merge branch 'ipv4-avoid-pathological-hash-tables'Jakub Kicinski
Eric Dumazet says: ==================== ipv4: avoid pathological hash tables This series speeds up netns dismantles on hosts having many active netns, by making sure two hash tables used for IPV4 fib contains uniformly spread items. v2: changed second patch to add fib_info_laddrhash_bucket() for consistency (David Ahern suggestion). ==================== Link: https://lore.kernel.org/r/20220119100413.4077866-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-19ipv4: add net_hash_mix() dispersion to fib_info_laddrhash keysEric Dumazet
net/ipv4/fib_semantics.c uses a hash table (fib_info_laddrhash) in which fib_sync_down_addr() can locate fib_info based on IPv4 local address. This hash table is resized based on total number of hashed fib_info, but the hash function is only using the local address. For hosts having many active network namespaces, all fib_info for loopback devices (IPv4 address 127.0.0.1) are hashed into a single bucket, making netns dismantles very slow. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-19ipv4: avoid quadratic behavior in netns dismantleEric Dumazet
net/ipv4/fib_semantics.c uses an hash table of 256 slots, keyed by device ifindexes: fib_info_devhash[DEVINDEX_HASHSIZE] Problem is that with network namespaces, devices tend to use the same ifindex. lo device for instance has a fixed ifindex of one, for all network namespaces. This means that hosts with thousands of netns spend a lot of time looking at some hash buckets with thousands of elements, notably at netns dismantle. Simply add a per netns perturbation (net_hash_mix()) to spread elements more uniformely. Also change fib_devindex_hashfn() to use more entropy. Fixes: aa79e66eee5d ("net: Make ifindex generation per-net namespace") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-19Merge branch 'net-fsl-xgmac_mdio-add-workaround-for-erratum-a-009885'Jakub Kicinski
Tobias Waldekranz says: ==================== net/fsl: xgmac_mdio: Add workaround for erratum A-009885 The individual messages mostly speak for themselves. It is very possible that there are more chips out there that are impacted by this, but I only have access to the errata document for the T1024 family, so I've limited the DT changes to the exact FMan version used in that device. Hopefully someone from NXP can supply a follow-up if need be. The final commit is an unrelated fix that was brought to my attention by sparse. ==================== Link: https://lore.kernel.org/r/20220118215054.2629314-1-tobias@waldekranz.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-19net/fsl: xgmac_mdio: Fix incorrect iounmap when removing moduleTobias Waldekranz
As reported by sparse: In the remove path, the driver would attempt to unmap its own priv pointer - instead of the io memory that it mapped in probe. Fixes: 9f35a7342cff ("net/fsl: introduce Freescale 10G MDIO driver") Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-19powerpc/fsl/dts: Enable WA for erratum A-009885 on fman3l MDIO busesTobias Waldekranz
This block is used in (at least) T1024 and T1040, including their variants like T1023 etc. Fixes: d55ad2967d89 ("powerpc/mpc85xx: Create dts components for the FSL QorIQ DPAA FMan") Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-19dt-bindings: net: Document fsl,erratum-a009885Tobias Waldekranz
Update FMan binding documentation with the newly added workaround for erratum A-009885. Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-19net/fsl: xgmac_mdio: Add workaround for erratum A-009885Tobias Waldekranz
Once an MDIO read transaction is initiated, we must read back the data register within 16 MDC cycles after the transaction completes. Outside of this window, reads may return corrupt data. Therefore, disable local interrupts in the critical section, to maximize the probability that we can satisfy this requirement. Fixes: d55ad2967d89 ("powerpc/mpc85xx: Create dts components for the FSL QorIQ DPAA FMan") Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-19dt-bindings: rtc: st,stm32-rtc: Make each example a separate entryRob Herring
Each independent example should be a separate entry. This allows for 'interrupts' to have different cell sizes. Signed-off-by: Rob Herring <robh@kernel.org> Link: https://lore.kernel.org/r/20220106182518.1435497-7-robh@kernel.org
2022-01-19dt-bindings: mmc: arm,pl18x: Make each example a separate entryRob Herring
Each independent example should be a separate entry. This and dropping 'interrupt-parent' allows for 'interrupts' to have different cell sizes. Signed-off-by: Rob Herring <robh@kernel.org> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Link: https://lore.kernel.org/r/20220106182518.1435497-6-robh@kernel.org
2022-01-19dt-bindings: display: Add SPI peripheral schema to SPI based displaysRob Herring
With 'unevaluatedProperties' support enabled, several SPI based display binding examples have warnings: Documentation/devicetree/bindings/display/panel/samsung,ld9040.example.dt.yaml: lcd@0: Unevaluated properties are not allowed ('#address-cells', '#size-cells', 'spi-max-frequency', 'spi-cpol', 'spi-cpha' were unexpected) Documentation/devicetree/bindings/display/panel/kingdisplay,kd035g6-54nt.example.dt.yaml: panel@0: Unevaluated properties are not allowed ('spi-max-frequency', 'spi-3wire' were unexpected) Documentation/devicetree/bindings/display/panel/ilitek,ili9322.example.dt.yaml: display@0: Unevaluated properties are not allowed ('reg' was unexpected) Documentation/devicetree/bindings/display/panel/samsung,s6e63m0.example.dt.yaml: display@0: Unevaluated properties are not allowed ('spi-max-frequency' was unexpected) Documentation/devicetree/bindings/display/panel/abt,y030xx067a.example.dt.yaml: panel@0: Unevaluated properties are not allowed ('spi-max-frequency' was unexpected) Documentation/devicetree/bindings/display/panel/sony,acx565akm.example.dt.yaml: panel@2: Unevaluated properties are not allowed ('spi-max-frequency', 'reg' were unexpected) Documentation/devicetree/bindings/display/panel/tpo,td.example.dt.yaml: panel@0: Unevaluated properties are not allowed ('spi-max-frequency', 'spi-cpol', 'spi-cpha' were unexpected) Documentation/devicetree/bindings/display/panel/lgphilips,lb035q02.example.dt.yaml: panel@0: Unevaluated properties are not allowed ('reg', 'spi-max-frequency', 'spi-cpol', 'spi-cpha' were unexpected) Documentation/devicetree/bindings/display/panel/innolux,ej030na.example.dt.yaml: panel@0: Unevaluated properties are not allowed ('spi-max-frequency' was unexpected) Documentation/devicetree/bindings/display/panel/sitronix,st7789v.example.dt.yaml: panel@0: Unevaluated properties are not allowed ('spi-max-frequency', 'spi-cpol', 'spi-cpha' were unexpected) Fix all of these by adding a reference to spi-peripheral-props.yaml. With this, the description that the binding must follow spi-controller.yaml is both a bit out of date and redundant, so remove it. Signed-off-by: Rob Herring <robh@kernel.org> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Acked-by: Paul Cercueil <paul@crapouillou.net> Acked-by: Sam Ravnborg <sam@ravnborg.org> Link: https://lore.kernel.org/r/20211221125209.1195932-1-robh@kernel.org
2022-01-19HID: uhid: Use READ_ONCE()/WRITE_ONCE() for ->runningJann Horn
The flag uhid->running can be set to false by uhid_device_add_worker() without holding the uhid->devlock. Mark all reads/writes of the flag that might race with READ_ONCE()/WRITE_ONCE() for clarity and correctness. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2022-01-19HID: uhid: Fix worker destroying device without any protectionJann Horn
uhid has to run hid_add_device() from workqueue context while allowing parallel use of the userspace API (which is protected with ->devlock). But hid_add_device() can fail. Currently, that is handled by immediately destroying the associated HID device, without using ->devlock - but if there are concurrent requests from userspace, that's wrong and leads to NULL dereferences and/or memory corruption (via use-after-free). Fix it by leaving the HID device as-is in the worker. We can clean it up later, either in the UHID_DESTROY command handler or in the ->release() handler. Cc: stable@vger.kernel.org Fixes: 67f8ecc550b5 ("HID: uhid: fix timeout when probe races with IO") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2022-01-19net: mscc: ocelot: fix using match before it is setTom Rix
Clang static analysis reports this issue ocelot_flower.c:563:8: warning: 1st function call argument is an uninitialized value !is_zero_ether_addr(match.mask->dst)) { ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The variable match is used before it is set. So move the block. Fixes: 75944fda1dfe ("net: mscc: ocelot: offload ingress skbedit and vlan actions to VCAP IS1") Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-19net: phy: micrel: use kszphy_suspend()/kszphy_resume for irq aware devicesClaudiu Beznea
On a setup with KSZ9131 and MACB drivers it happens on suspend path, from time to time, that the PHY interrupt arrives after PHY and MACB were suspended (PHY via genphy_suspend(), MACB via macb_suspend()). In this case the phy_read() at the beginning of kszphy_handle_interrupt() will fail (as MACB driver is suspended at this time) leading to phy_error() being called and a stack trace being displayed on console. To solve this .suspend/.resume functions for all KSZ devices implementing .handle_interrupt were replaced with kszphy_suspend()/kszphy_resume() which disable/enable interrupt before/after calling genphy_suspend()/genphy_resume(). The fix has been adapted for all KSZ devices which implements .handle_interrupt but it has been tested only on KSZ9131. Fixes: 59ca4e58b917 ("net: phy: micrel: implement generic .handle_interrupt() callback") Signed-off-by: Claudiu Beznea <claudiu.beznea@microchip.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-19net: cpsw: avoid alignment faults by taking NET_IP_ALIGN into accountArd Biesheuvel
Both versions of the CPSW driver declare a CPSW_HEADROOM_NA macro that takes NET_IP_ALIGN into account, but fail to use it appropriately when storing incoming packets in memory. This results in the IPv4 source and destination addresses to appear misaligned in memory, which causes aligment faults that need to be fixed up in software. So let's switch from CPSW_HEADROOM to CPSW_HEADROOM_NA where needed. This gets rid of any alignment faults on the RX path on a Beaglebone White. Fixes: 9ed4050c0d75 ("net: ethernet: ti: cpsw: add XDP support") Cc: Grygorii Strashko <grygorii.strashko@ti.com> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-19nfc: llcp: fix NULL error pointer dereference on sendmsg() after failed bind()Krzysztof Kozlowski
Syzbot detected a NULL pointer dereference of nfc_llcp_sock->dev pointer (which is a 'struct nfc_dev *') with calls to llcp_sock_sendmsg() after a failed llcp_sock_bind(). The message being sent is a SOCK_DGRAM. KASAN report: BUG: KASAN: null-ptr-deref in nfc_alloc_send_skb+0x2d/0xc0 Read of size 4 at addr 00000000000005c8 by task llcp_sock_nfc_a/899 CPU: 5 PID: 899 Comm: llcp_sock_nfc_a Not tainted 5.16.0-rc6-next-20211224-00001-gc6437fbf18b0 #125 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x45/0x59 ? nfc_alloc_send_skb+0x2d/0xc0 __kasan_report.cold+0x117/0x11c ? mark_lock+0x480/0x4f0 ? nfc_alloc_send_skb+0x2d/0xc0 kasan_report+0x38/0x50 nfc_alloc_send_skb+0x2d/0xc0 nfc_llcp_send_ui_frame+0x18c/0x2a0 ? nfc_llcp_send_i_frame+0x230/0x230 ? __local_bh_enable_ip+0x86/0xe0 ? llcp_sock_connect+0x470/0x470 ? llcp_sock_connect+0x470/0x470 sock_sendmsg+0x8e/0xa0 ____sys_sendmsg+0x253/0x3f0 ... The issue was visible only with multiple simultaneous calls to bind() and sendmsg(), which resulted in most of the bind() calls to fail. The bind() was failing on checking if there is available WKS/SDP/SAP (respective bit in 'struct nfc_llcp_local' fields). When there was no available WKS/SDP/SAP, the bind returned error but the sendmsg() to such socket was able to trigger mentioned NULL pointer dereference of nfc_llcp_sock->dev. The code looks simply racy and currently it protects several paths against race with checks for (!nfc_llcp_sock->local) which is NULL-ified in error paths of bind(). The llcp_sock_sendmsg() did not have such check but called function nfc_llcp_send_ui_frame() had, although not protected with lock_sock(). Therefore the race could look like (same socket is used all the time): CPU0 CPU1 ==== ==== llcp_sock_bind() - lock_sock() - success - release_sock() - return 0 llcp_sock_sendmsg() - lock_sock() - release_sock() llcp_sock_bind(), same socket - lock_sock() - error - nfc_llcp_send_ui_frame() - if (!llcp_sock->local) - llcp_sock->local = NULL - nfc_put_device(dev) - dereference llcp_sock->dev - release_sock() - return -ERRNO The nfc_llcp_send_ui_frame() checked llcp_sock->local outside of the lock, which is racy and ineffective check. Instead, its caller llcp_sock_sendmsg(), should perform the check inside lock_sock(). Reported-and-tested-by: syzbot+7f23bcddf626e0593a39@syzkaller.appspotmail.com Fixes: b874dec21d1c ("NFC: Implement LLCP connection less Tx path") Cc: <stable@vger.kernel.org> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-19Merge branch 'axienet-fixes'David S. Miller
Robert Hancock says: ==================== Xilinx axienet fixes Various fixes for the Xilinx AXI Ethernet driver. Changed since v2: -added Reviewed-by tags, added some explanation to commit messages, no code changes Changed since v1: -corrected a Fixes tag to point to mainline commit -split up reset changes into 3 patches -added ratelimit on netdev_warn in TX busy case ==================== Signed-off-by: David S. Miller <davem@davemloft.net>