Age | Commit message (Collapse) | Author |
|
On some x86 processors, CPUID leaf 0xA provides information
on Architectural Performance Monitoring features. It
advertises a PMU version which Qemu uses to determine the
availability of additional MSRs to manage the PMCs.
Upon receiving a KVM_GET_SUPPORTED_CPUID ioctl request for
the same, the kernel constructs return values based on the
x86_pmu_capability irrespective of the vendor.
This leaf and the additional MSRs are not supported on AMD
and Hygon processors. If AMD PerfMonV2 is detected, the PMU
version is set to 2 and guest startup breaks because of an
attempt to access a non-existent MSR. Return zeros to avoid
this.
Fixes: a6c06ed1a60a ("KVM: Expose the architectural performance monitoring CPUID leaf")
Reported-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Message-Id: <3fef83d9c2b2f7516e8ff50d60851f29a4bcb716.1651058600.git.sandipan.das@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Zen renumbered some of the performance counters that correspond to the
well known events in perf_hw_id. This code in KVM was never updated for
that, so guest that attempt to use counters on Zen that correspond to the
pre-Zen perf_hw_id values will silently receive the wrong values.
This has been observed in the wild with rr[0] when running in Zen 3
guests. rr uses the retired conditional branch counter 00d1 which is
incorrectly recognized by KVM as PERF_COUNT_HW_STALLED_CYCLES_BACKEND.
[0] https://rr-project.org/
Signed-off-by: Kyle Huey <me@kylehuey.com>
Message-Id: <20220503050136.86298-1-khuey@kylehuey.com>
Cc: stable@vger.kernel.org
[Check guest family, not host. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
We are dropping A/D bits (and W bits) in the TDP MMU. Even if mmu_lock
is held for write, as volatile SPTEs can be written by other tasks/vCPUs
outside of mmu_lock.
Attempting to prove that bug exposed another notable goof, which has been
lurking for a decade, give or take: KVM treats _all_ MMU-writable SPTEs
as volatile, even though KVM never clears WRITABLE outside of MMU lock.
As a result, the legacy MMU (and the TDP MMU if not fixed) uses XCHG to
update writable SPTEs.
The fix does not seem to have an easily-measurable affect on performance;
page faults are so slow that wasting even a few hundred cycles is dwarfed
by the base cost.
|
|
We are dropping A/D bits (and W bits) in the TDP MMU. Even if mmu_lock
is held for write, as volatile SPTEs can be written by other tasks/vCPUs
outside of mmu_lock.
Attempting to prove that bug exposed another notable goof, which has been
lurking for a decade, give or take: KVM treats _all_ MMU-writable SPTEs
as volatile, even though KVM never clears WRITABLE outside of MMU lock.
As a result, the legacy MMU (and the TDP MMU if not fixed) uses XCHG to
update writable SPTEs.
The fix does not seem to have an easily-measurable affect on performance;
page faults are so slow that wasting even a few hundred cycles is dwarfed
by the base cost.
|
|
syzbot is reporting use-after-free read in tcp_retransmit_timer() [1],
for TCP socket used by RDS is accessing sock_net() without acquiring a
refcount on net namespace. Since TCP's retransmission can happen after
a process which created net namespace terminated, we need to explicitly
acquire a refcount.
Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1]
Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com>
Link: https://lore.kernel.org/r/a5fb1fc4-2284-3359-f6a0-e4e390239d7b@I-love.SAKURA.ne.jp
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even
if mmu_lock is held for write, as volatile SPTEs can be written by other
tasks/vCPUs outside of mmu_lock. If a vCPU uses the to-be-modified SPTE
to write a page, the CPU can cache the translation as WRITABLE in the TLB
despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the
Accessed/Dirty bits and not properly tag the backing page.
Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify
non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and
KVM doesn't consume the Accessed bit of non-leaf SPTEs.
Dropping the Dirty and/or Writable bits is most problematic for dirty
logging, as doing so can result in a missed TLB flush and eventually a
missed dirty page. In the unlikely event that the only dirty page(s) is
a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty
(based on the Dirty or Writable bit depending on the method) and so not
update the SPTE and ultimately not flush. If the SPTE is cached in the
TLB as writable before it is clobbered, the guest can continue writing
the associated page without ever taking a write-protect fault.
For most (all?) file back memory, dropping the Dirty bit is a non-issue.
The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty
bit is effectively ignored because the primary MMU will mark that page
dirty when the write-protection is lifted, e.g. when KVM faults the page
back in for write.
The Accessed bit is a complete non-issue. Aside from being unused for
non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the
Accessed bit may be dropped anyways.
Lastly, the Writable bit is also problematic as an extension of the Dirty
bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is
!DIRTY && WRITABLE. If KVM fixes an MMU-writable, but !WRITABLE, SPTE
out of mmu_lock, then it can allow the CPU to set the Dirty bit despite
the SPTE being !WRITABLE when it is checked by KVM. But that all depends
on the Dirty bit being problematic in the first place.
Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Venkatesh Srinivas <venkateshs@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Move the is_shadow_present_pte() check out of spte_has_volatile_bits()
and into its callers. Well, caller, since only one of its two callers
doesn't already do the shadow-present check.
Opportunistically move the helper to spte.c/h so that it can be used by
the TDP MMU, which is also the primary motivation for the shadow-present
change. Unlike the legacy MMU, the TDP MMU uses a single path for clear
leaf and non-leaf SPTEs, and to avoid unnecessary atomic updates, the TDP
MMU will need to check is_last_spte() prior to calling
spte_has_volatile_bits(), and calling is_last_spte() without first
calling is_shadow_present_spte() is at best odd, and at worst a violation
of KVM's loosely defines SPTE rules.
Note, mmu_spte_clear_track_bits() could likely skip the write entirely
for SPTEs that are not shadow-present. Leave that cleanup for a future
patch to avoid introducing a functional change, and because the
shadow-present check can likely be moved further up the stack, e.g.
drop_large_spte() appears to be the only path that doesn't already
explicitly check for a shadow-present SPTE.
No functional change intended.
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Don't treat SPTEs that are truly writable, i.e. writable in hardware, as
being volatile (unless they're volatile for other reasons, e.g. A/D bits).
KVM _sets_ the WRITABLE bit out of mmu_lock, but never _clears_ the bit
out of mmu_lock, so if the WRITABLE bit is set, it cannot magically get
cleared just because the SPTE is MMU-writable.
Rename the wrapper of MMU-writable to be more literal, the previous name
of spte_can_locklessly_be_made_writable() is wrong and misleading.
Fixes: c7ba5b48cc8d ("KVM: MMU: fast path of handling guest page fault")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The program uses CLOCK_TAI as default clock since it was added to the
Linux repo. In commit:
| 040806343bb4 ("selftests/net: so_txtime multi-host support")
a help text stating the wrong default clock was added.
This patch fixes the help text.
Fixes: 040806343bb4 ("selftests/net: so_txtime multi-host support")
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Acked-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Carlos Llamas <cmllamas@google.com>
Link: https://lore.kernel.org/r/20220502094638.1921702-3-mkl@pengutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This patch fixes the parsing of the cmd line supplied start time on 32
bit systems. A "long" on 32 bit systems is only 32 bit wide and cannot
hold a timestamp in nano second resolution.
Fixes: 040806343bb4 ("selftests/net: so_txtime multi-host support")
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Acked-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Carlos Llamas <cmllamas@google.com>
Link: https://lore.kernel.org/r/20220502094638.1921702-2-mkl@pengutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
commit 11663111cd49 ("KVM: arm64: Hide PMU registers from userspace when
not available") hid the AArch64 PMU registers from userspace and guest
when the PMU VCPU feature was not set. Do the same when the PMU
registers are accessed by an AArch32 guest. While we're at it, rename
the previously unused AA32_ZEROHIGH to AA32_DIRECT to match the behavior
of get_access_mask().
Now that KVM emulates ID_DFR0 and hides the PMU from the guest when the
feature is not set, it is safe to inject to inject an undefined exception
when the PMU is not present, as that corresponds to the architected
behaviour.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
[Oliver - Add AA32_DIRECT to match the zero value of the enum]
Signed-off-by: Oliver Upton <oupton@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220503060205.2823727-7-oupton@google.com
|
|
To date KVM has not trapped ID register accesses from AArch32, meaning
that guests get an unconstrained view of what hardware supports. This
can be a serious problem because we try to base the guest's feature
registers on values that are safe system-wide. Furthermore, KVM does not
implement the latest ISA in the PMU and Debug architecture, so we
constrain these fields to supported values.
Since KVM now correctly handles CP15 and CP10 register traps, we no
longer need to clear HCR_EL2.TID3 for 32 bit guests and will instead
emulate reads with their safe values.
Signed-off-by: Oliver Upton <oupton@google.com>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220503060205.2823727-6-oupton@google.com
|
|
In order to enable HCR_EL2.TID3 for AArch32 guests KVM needs to handle
traps where ESR_EL2.EC=0x8, which corresponds to an attempted VMRS
access from an ID group register. Specifically, the MVFR{0-2} registers
are accessed this way from AArch32. Conveniently, these registers are
architecturally mapped to MVFR{0-2}_EL1 in AArch64. Furthermore, KVM
already handles reads to these aliases in AArch64.
Plumb VMRS read traps through to the general AArch64 system register
handler.
Signed-off-by: Oliver Upton <oupton@google.com>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220503060205.2823727-5-oupton@google.com
|
|
KVM currently does not trap ID register accesses from an AArch32 EL1.
This is painful for a couple of reasons. Certain unimplemented features
are visible to AArch32 EL1, as we limit PMU to version 3 and the debug
architecture to v8.0. Additionally, we attempt to paper over
heterogeneous systems by using register values that are safe
system-wide. All this hard work is completely sidestepped because KVM
does not set TID3 for AArch32 guests.
Fix up handling of CP15 feature registers by simply rerouting to their
AArch64 aliases. Punt setting HCR_EL2.TID3 to a later change, as we need
to fix up the oddball CP10 feature registers still.
Signed-off-by: Oliver Upton <oupton@google.com>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220503060205.2823727-4-oupton@google.com
|
|
emulate_sys_reg() returns 1 unconditionally, even though a a system
register access can fail. Furthermore, kvm_handle_sys_reg() writes to Rt
for every register read, regardless of if it actually succeeded.
Though this pattern is safe (as params.regval is initialized with the
current value of Rt) it is a bit ugly. Indicate failure if the register
access could not be emulated and only write to Rt on success.
Signed-off-by: Oliver Upton <oupton@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220503060205.2823727-3-oupton@google.com
|
|
KVM indicates success/failure in several ways, but generally an integer
is used when conditionally bouncing to userspace is involved. That is
not the case from emulate_cp(); just use a bool instead.
No functional change intended.
Signed-off-by: Oliver Upton <oupton@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220503060205.2823727-2-oupton@google.com
|
|
operational
In emulated environments, the bridge ports enslaved to br1 get a carrier
before changing br1's PVID. This means that by the time the PVID is
changed, br1 is already operational and configured with an IPv6
link-local address.
When the test is run with netdevs registered by mlxsw, changing the PVID
is vetoed, as changing the VID associated with an existing L3 interface
is forbidden. This restriction is similar to the 8021q driver's
restriction of changing the VID of an existing interface.
Fix this by taking br1 down and bringing it back up when it is fully
configured.
With this fix, the test reliably passes on top of both the SW and HW
data paths (emulated or not).
Fixes: 239e754af854 ("selftests: forwarding: Test mirror-to-gretap w/ UL 802.1q")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/20220502084507.364774-1-idosch@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Radhey Shyam Pandey says:
====================
emaclite: improve error handling and minor cleanup
This patchset does error handling for of_address_to_resource() and also
removes "Don't advertise 1000BASE-T" and auto negotiation.
Changes for v3:
- Resolve git apply conflicts for 2/2 patch.
Changes for v2:
- Added Andrew's reviewed by tag in 1/2 patch.
- Move ret to down to align with reverse xmas tree style in 2/2 patch.
- Also add fixes tag in 2/2 patch.
- Specify tree name in subject prefix.
====================
Link: https://lore.kernel.org/r/1651476470-23904-1-git-send-email-radhey.shyam.pandey@xilinx.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
check the return value of of_address_to_resource() and also add
missing of_node_put() for np and npp nodes.
Fixes: e0a3bc65448c ("net: emaclite: Support multiple phys connected to one MDIO bus")
Addresses-Coverity: Event check_return value.
Signed-off-by: Shravya Kumbham <shravya.kumbham@xilinx.com>
Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
In xemaclite_open() function we are setting the max speed of
emaclite to 100Mb using phy_set_max_speed() function so,
there is no need to write the advertising registers to stop
giga-bit speed and the phy_start() function starts the
auto-negotiation so, there is no need to handle it separately
using advertising registers. Remove the phy_read and phy_write
of advertising registers in xemaclite_open() function.
Signed-off-by: Shravya Kumbham <shravya.kumbham@xilinx.com>
Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Aborting commands that have already been sent to the firmware can
cause BUG in qlt_free_cmd(): BUG_ON(cmd->sg_mapped)
For instance:
- Command passes rdx_to_xfer state, maps sgl, sends to the firmware
- Reset occurs, qla2xxx performs ISP error recovery, aborts the command
- Target stack calls qlt_abort_cmd() and then qlt_free_cmd()
- BUG_ON(cmd->sg_mapped) in qlt_free_cmd() occurs because sgl was not
unmapped
Thus, unmap sgl in qlt_abort_cmd() for commands with the aborted flag set.
Link: https://lore.kernel.org/r/AS8PR10MB4952D545F84B6B1DFD39EC1E9DEE9@AS8PR10MB4952.EURPRD10.PROD.OUTLOOK.COM
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Gleb Chesnokov <Chesnokov.G@raidix.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
The handling of the ALUA transitioning state is currently broken. When a
target goes into this state, it is expected that the target is allowed to
stay in this state for the implicit transition timeout without a path
failure. The handler has this logic, but it gets skipped currently.
When the target transitions, there is in-flight I/O from the initiator. The
first of these responses from the target will be a unit attention letting
the initiator know that the ALUA state has changed. The remaining
in-flight I/Os, before the initiator finds out that the portal state has
changed, will return not ready, ALUA state is transitioning. The portal
state will change to SCSI_ACCESS_STATE_TRANSITIONING. This will lead to all
new I/O immediately failing the path unexpectedly. The path failure happens
in less than a second instead of the expected successes until the
transition timer is exceeded.
Allow I/Os to continue while the path is in the ALUA transitioning
state. The handler already takes care of a target that stays in the
transitioning state for too long by changing the state to ALUA state
standby once the transition timeout is exceeded at which point the path
will fail.
Link: https://lore.kernel.org/r/CAHZQxy+4sTPz9+pY3=7VJH+CLUJsDct81KtnR2be8ycN5mhqTg@mail.gmail.com
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Krishna Kant <krishna.kant@purestorage.com>
Acked-by: Seamus Connor <sconnor@purestorage.com>
Signed-off-by: Brian Bunker <brian@purestorage.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
Issuing a memop on a protected vm does not make sense,
neither is the memory readable/writable, nor does it make sense to check
storage keys. This is why the ioctl will return -EINVAL when it detects
the vm to be protected. However, in order to ensure that the vm cannot
become protected during the memop, the kvm->lock would need to be taken
for the duration of the ioctl. This is also required because
kvm_s390_pv_is_protected asserts that the lock must be held.
Instead, don't try to prevent this. If user space enables secure
execution concurrently with a memop it must accecpt the possibility of
the memop failing.
Still check if the vm is currently protected, but without locking and
consider it a heuristic.
Fixes: ef11c9463ae0 ("KVM: s390: Add vm IOCTL for key checked guest absolute memory access")
Signed-off-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Link: https://lore.kernel.org/r/20220322153204.2637400-1-scgl@linux.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes mostly around how some file attributes could be set.
- fix handling of compression property:
- don't allow setting it on anything else than regular file or
directory
- do not allow setting it on nodatacow files via properties
- improved error handling when setting xattr
- make sure symlinks are always properly logged"
* tag 'for-5.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: skip compression property for anything other than files and dirs
btrfs: do not BUG_ON() on failure to update inode when setting xattr
btrfs: always log symlinks in full mode
btrfs: do not allow compression on nodatacow files
btrfs: export a helper for compression hard check
|
|
This reverts commit daaca3522a8e67c46e39ef09c1d542e866f85f3b.
Commit daaca3522a8e ("block: release rq qos structures for queue without
disk") is only needed for v5.15~v5.17, and isn't needed for v5.18, so
revert it.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220426024936.3321341-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The helper function, vcpu_to_pi_desc(), is defined to get the posted
interrupt descriptor from vcpu. There is one place that doesn't use
it, and instead references vmx_vcpu->pi_desc directly. Remove the
inconsistency.
Signed-off-by: Yuan Yao <yuan.yao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-Id: <ee7be7832bc424546fd4f05015a844a0205b5ba2.1646422845.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
For TDX guests, the maximum number of vcpus needs to be specified when the
TDX guest VM is initialized (creating the TDX data corresponding to TDX
guest) before creating vcpu. It needs to record the maximum number of
vcpus on VM creation (KVM_CREATE_VM) and return error if the number of
vcpus exceeds it
Because there is already max_vcpu member in arm64 struct kvm_arch, move it
to common struct kvm and initialize it to KVM_MAX_VCPUS before
kvm_arch_init_vm() instead of adding it to x86 struct kvm_arch.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-Id: <e53234cdee6a92357d06c80c03d77c19cdefb804.1646422845.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
This can cause various unexpected issues, since VM is partially
destroyed at that point.
For example when AVIC is enabled, this causes avic_vcpu_load to
access physical id page entry which is already freed by .vm_destroy.
Fixes: 8221c1370056 ("svm: Manage vcpu load/unload when enable AVIC")
Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220322172449.235575-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
For some net events in irdma_net_event notifier, the netdev can be NULL
which will cause a crash in rdma_vlan_dev_real_dev. Fix this by moving
all processing to the NETEVENT_NEIGH_UPDATE case where the netdev is
guaranteed to not be NULL.
Fixes: 6702bc147448 ("RDMA/irdma: Fix netdev notifications for vlan's")
Link: https://lore.kernel.org/r/20220425181703.1634-4-shiraz.saleem@intel.com
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
QP destroy is synchronous and waits for its refcnt to be decremented in
irdma_cm_node_free_cb (for iWARP) which fires after the RCU grace period
elapses.
Applications running a large number of connections are exposed to high
wait times on destroy QP for events like SIGABORT.
The long pole for this wait time is the firing of the call_rcu callback
during a CM node destroy which can be slow. It holds the QP reference
count and blocks the destroy QP from completing.
call_rcu only needs to make sure that list walkers have a reference to the
cm_node object before freeing it and thus need to wait for grace period
elapse. The rest of the connection teardown in irdma_cm_node_free_cb is
moved out of the grace period wait in irdma_destroy_connection. Also,
replace call_rcu with a simple kfree_rcu as it just needs to do a kfree on
the cm_node
Fixes: 146b9756f14c ("RDMA/irdma: Add connection manager")
Link: https://lore.kernel.org/r/20220425181703.1634-3-shiraz.saleem@intel.com
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
When connection establishment fails in iWARP mode, an app can drain the
QPs and hang because flush isn't issued when the QP is modified from RTR
state to error. Issue a flush in this case using function
irdma_cm_disconn().
Update irdma_cm_disconn() to do flush when cm_id is NULL, which is the
case when the QP is in RTR state and there is an error in the connection
establishment.
Fixes: b48c24c2d710 ("RDMA/irdma: Implement device supported verb APIs")
Link: https://lore.kernel.org/r/20220425181703.1634-2-shiraz.saleem@intel.com
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
We defer file assignment to ensure that fixed files work with links
between a direct accept/open and the links that follow it. But this has
the side effect that normal file assignment is then not complete by the
time that request submission has been done.
For deferred execution, if the file is a regular file, assign it when
we do the async prep anyway.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
pwmchip_add() unconditionally assigns the base ID dynamically. Commit
f9a8ee8c8bcd1 ("pwm: Always allocate PWM chip base ID dynamically")
dropped all base assignment from drivers under drivers/pwm/. It missed
this driver. Fix that.
Fixes: f9a8ee8c8bcd1 ("pwm: Always allocate PWM chip base ID dynamically")
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Reviewed-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
|
|
Gpiolib interprets the elements of "gpio-reserved-ranges" as "start,size"
because it clears "size" bits starting from the "start" bit in the according
bitmap. So it has to use "greater" instead of "greater or equal" when performs
bounds check to make sure that GPIOs are in the available range.
Previous implementation skipped ranges that include the last GPIO in
the range.
I wrote the mail to the maintainers
(https://lore.kernel.org/linux-gpio/20220412115554.159435-1-andrei.lalaev@emlid.com/T/#u)
of the questioned DTSes (because I couldn't understand how the maintainers
interpreted this property), but I haven't received a response.
Since the questioned DTSes use "gpio-reserved-ranges = <0 4>"
(i.e., the beginning of the range), this patch doesn't affect these DTSes at all.
TBH this patch doesn't break any existing DTSes because none of them
reserve gpios at the end of range.
Fixes: 726cb3ba4969 ("gpiolib: Support 'gpio-reserved-ranges' property")
Signed-off-by: Andrei Lalaev <andrei.lalaev@emlid.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Cc: stable@vger.kernel.org
Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
|
|
|
|
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Take care of faults occuring between the PARange and IPA range by
injecting an exception
- Fix S2 faults taken from a host EL0 in protected mode
- Work around Oops caused by a PMU access from a 32bit guest when PMU
has been created. This is a temporary bodge until we fix it for
good.
x86:
- Fix potential races when walking host page table
- Fix shadow page table leak when KVM runs nested
- Work around bug in userspace when KVM synthesizes leaf 0x80000021
on older (pre-EPYC) or Intel processors
Generic (but affects only RISC-V):
- Fix bad user ABI for KVM_EXIT_SYSTEM_EVENT"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: work around QEMU issue with synthetic CPUID leaves
Revert "x86/mm: Introduce lookup_address_in_mm()"
KVM: x86/mmu: fix potential races when walking host page table
KVM: fix bad user ABI for KVM_EXIT_SYSTEM_EVENT
KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
KVM: arm64: Inject exception on out-of-IPA-range translation fault
KVM/arm64: Don't emulate a PMU for 32-bit guests if feature not set
KVM: arm64: Handle host stage-2 faults from 32-bit EL0
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- A fix to disable PCI/MSI[-X] masking for XEN_HVM guests as that is
solely controlled by the hypervisor
- A build fix to make the function prototype (__warn()) as visible as
the definition itself
- A bunch of objtool annotation fixes which have accumulated over time
- An ORC unwinder fix to handle bad input gracefully
- Well, we thought the microcode gets loaded in time in order to
restore the microcode-emulated MSRs but we thought wrong. So there's
a fix for that to have the ordering done properly
- Add new Intel model numbers
- A spelling fix
* tag 'x86_urgent_for_v5.18_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/pci/xen: Disable PCI/MSI[-X] masking for XEN_HVM guests
bug: Have __warn() prototype defined unconditionally
x86/Kconfig: fix the spelling of 'becoming' in X86_KERNEL_IBT config
objtool: Use offstr() to print address of missing ENDBR
objtool: Print data address for "!ENDBR" data warnings
x86/xen: Add ANNOTATE_NOENDBR to startup_xen()
x86/uaccess: Add ENDBR to __put_user_nocheck*()
x86/retpoline: Add ANNOTATE_NOENDBR for retpolines
x86/static_call: Add ANNOTATE_NOENDBR to static call trampoline
objtool: Enable unreachable warnings for CLANG LTO
x86,objtool: Explicitly mark idtentry_body()s tail REACHABLE
x86,objtool: Mark cpu_startup_entry() __noreturn
x86,xen,objtool: Add UNWIND hint
lib/strn*,objtool: Enforce user_access_begin() rules
MAINTAINERS: Add x86 unwinding entry
x86/unwind/orc: Recheck address range after stack info was updated
x86/cpu: Load microcode during restore_processor_state()
x86/cpu: Add new Alderlake and Raptorlake CPU model numbers
|
|
Convert B53 to use phylink_pcs for the serdes rather than hooking it
into the MAC-layer callbacks.
Fixes: 81c1681cbb9f ("net: dsa: b53: mark as non-legacy")
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fixes from Borislav Petkov:
"A bunch of objtool fixes to improve unwinding, sibling call detection,
fallthrough detection and relocation handling of weak symbols when the
toolchain strips section symbols"
* tag 'objtool_urgent_for_v5.18_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Fix code relocs vs weak symbols
objtool: Fix type of reloc::addend
objtool: Fix function fallthrough detection for vmlinux
objtool: Fix sibling call detection in alternatives
objtool: Don't set 'jump_dest' for sibling calls
x86/uaccess: Don't jump between functions
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Borislav Petkov:
- Fix locking when accessing device MSI descriptors
* tag 'irq_urgent_for_v5.18_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
bus: fsl-mc-msi: Fix MSI descriptor mutex lock for msi_first_desc()
|
|
with the usage of this function in nic_mbx_intr_handler().
Cache the Linux interrupt numbers in struct nicpf and use that cache in the
interrupt handler to select the mailbox.
Fixes: 495c66aca3da ("genirq/msi: Convert to new functions")
Reported-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Sunil Goutham <sgoutham@marvell.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: netdev@vger.kernel.org
Cc: stable@vger.kernel.org
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2041772
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Duoming Zhou says:
====================
Replace improper checks and fix bugs in nfc subsystem
The first patch is used to replace improper checks in netlink related
functions of nfc core, the second patch is used to fix bugs in
nfcmrvl driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
nfcmrvl_nci_unregister_dev to avoid bugs
There are destructive operations such as nfcmrvl_fw_dnld_abort and
gpio_free in nfcmrvl_nci_unregister_dev. The resources such as firmware,
gpio and so on could be destructed while the upper layer functions such as
nfcmrvl_fw_dnld_start and nfcmrvl_nci_recv_frame is executing, which leads
to double-free, use-after-free and null-ptr-deref bugs.
There are three situations that could lead to double-free bugs.
The first situation is shown below:
(Thread 1) | (Thread 2)
nfcmrvl_fw_dnld_start |
... | nfcmrvl_nci_unregister_dev
release_firmware() | nfcmrvl_fw_dnld_abort
kfree(fw) //(1) | fw_dnld_over
| release_firmware
... | kfree(fw) //(2)
| ...
The second situation is shown below:
(Thread 1) | (Thread 2)
nfcmrvl_fw_dnld_start |
... |
mod_timer |
(wait a time) |
fw_dnld_timeout | nfcmrvl_nci_unregister_dev
fw_dnld_over | nfcmrvl_fw_dnld_abort
release_firmware | fw_dnld_over
kfree(fw) //(1) | release_firmware
... | kfree(fw) //(2)
The third situation is shown below:
(Thread 1) | (Thread 2)
nfcmrvl_nci_recv_frame |
if(..->fw_download_in_progress)|
nfcmrvl_fw_dnld_recv_frame |
queue_work |
|
fw_dnld_rx_work | nfcmrvl_nci_unregister_dev
fw_dnld_over | nfcmrvl_fw_dnld_abort
release_firmware | fw_dnld_over
kfree(fw) //(1) | release_firmware
| kfree(fw) //(2)
The firmware struct is deallocated in position (1) and deallocated
in position (2) again.
The crash trace triggered by POC is like below:
BUG: KASAN: double-free or invalid-free in fw_dnld_over
Call Trace:
kfree
fw_dnld_over
nfcmrvl_nci_unregister_dev
nci_uart_tty_close
tty_ldisc_kill
tty_ldisc_hangup
__tty_hangup.part.0
tty_release
...
What's more, there are also use-after-free and null-ptr-deref bugs
in nfcmrvl_fw_dnld_start. If we deallocate firmware struct, gpio or
set null to the members of priv->fw_dnld in nfcmrvl_nci_unregister_dev,
then, we dereference firmware, gpio or the members of priv->fw_dnld in
nfcmrvl_fw_dnld_start, the UAF or NPD bugs will happen.
This patch reorders destructive operations after nci_unregister_device
in order to synchronize between cleanup routine and firmware download
routine.
The nci_unregister_device is well synchronized. If the device is
detaching, the firmware download routine will goto error. If firmware
download routine is executing, nci_unregister_device will wait until
firmware download routine is finished.
Fixes: 3194c6870158 ("NFC: nfcmrvl: add firmware download support")
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The device_is_registered() in nfc core is used to check whether
nfc device is registered in netlink related functions such as
nfc_fw_download(), nfc_dev_up() and so on. Although device_is_registered()
is protected by device_lock, there is still a race condition between
device_del() and device_is_registered(). The root cause is that
kobject_del() in device_del() is not protected by device_lock.
(cleanup task) | (netlink task)
|
nfc_unregister_device | nfc_fw_download
device_del | device_lock
... | if (!device_is_registered)//(1)
kobject_del//(2) | ...
... | device_unlock
The device_is_registered() returns the value of state_in_sysfs and
the state_in_sysfs is set to zero in kobject_del(). If we pass check in
position (1), then set zero in position (2). As a result, the check
in position (1) is useless.
This patch uses bool variable instead of device_is_registered() to judge
whether the nfc device is registered, which is well synchronized.
Fixes: 3e256b8f8dfa ("NFC: add nfc subsystem core")
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Based on DesignWare Ethernet QoS datasheet, we are seeing the limitation
of Split Header (SPH) feature is not supported for Ipv4 fragmented packet.
This SPH limitation will cause ping failure when the packets size exceed
the MTU size. For example, the issue happens once the basic ping packet
size is larger than the configured MTU size and the data is lost inside
the fragmented packet, replaced by zeros/corrupted values, and leads to
ping fail.
So, disable the Split Header for Intel platforms.
v2: Add fixes tag in commit message.
Fixes: 67afd6d1cfdf("net: stmmac: Add Split Header support and enable it in XGMAC cores")
Cc: <stable@vger.kernel.org> # 5.10.x
Suggested-by: Ong, Boon Leong <boon.leong.ong@intel.com>
Signed-off-by: Mohammad Athari Bin Ismail <mohammad.athari.ismail@intel.com>
Signed-off-by: Wong Vee Khee <vee.khee.wong@linux.intel.com>
Signed-off-by: Tan Tee Min <tee.min.tan@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The PASID is being freed too early. It needs to stay around until after
device drivers that might be using it have had a chance to clear it out
of the hardware.
The relevant refcounts are:
mmget() /mmput() refcount the mm's address space
mmgrab()/mmdrop() refcount the mm itself
The PASID is currently tied to the life of the mm's address space and freed
in __mmput(). This makes logical sense because the PASID can't be used
once the address space is gone.
But, this misses an important point: even after the address space is gone,
the PASID will still be programmed into a device. Device drivers might,
for instance, still need to flush operations that are outstanding and need
to use that PASID. They do this at file->release() time.
Device drivers call the IOMMU driver to hold a reference on the mm itself
and drop it at file->release() time. But, the IOMMU driver holds a
reference on the mm itself, not the address space. The address space (and
the PASID) is long gone by the time the driver tries to clean up. This is
effectively a use-after-free bug on the PASID.
To fix this, move the PASID free operation from __mmput() to __mmdrop().
This ensures that the IOMMU driver's existing mmgrab() keeps the PASID
allocated until it drops its mm reference.
Fixes: 701fac40384f ("iommu/sva: Assign a PASID to mm on PASID allocation and free it on mm exit")
Reported-by: Zhangfei Gao <zhangfei.gao@foxmail.com>
Suggested-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Suggested-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Zhangfei Gao <zhangfei.gao@foxmail.com>
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Link: https://lore.kernel.org/r/20220428180041.806809-1-fenghua.yu@intel.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are some small driver core and kernfs fixes for some reported
problems. They include:
- kernfs regression that is causing oopses in 5.17 and newer releases
- topology sysfs fixes for a few small reported problems.
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-5.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
kernfs: fix NULL dereferencing in kernfs_remove
topology: Fix up build warning in topology_is_visible()
arch_topology: Do not set llc_sibling if llc_id is invalid
topology: make core_mask include at least cluster_siblings
topology/sysfs: Hide PPIN on systems that do not support it.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are a small number of char/misc/other driver fixes for 5.18-rc5
Nothing major in here, this is mostly IIO driver fixes along with some
other small things:
- at25 driver fix for systems without a dma-able stack
- phy driver fixes for reported issues
- binder driver fixes for reported issues
All of these have been in linux-next without any reported problems"
* tag 'char-misc-5.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (31 commits)
eeprom: at25: Use DMA safe buffers
binder: Gracefully handle BINDER_TYPE_FDA objects with num_fds=0
binder: Address corner cases in deferred copy and fixup
phy: amlogic: fix error path in phy_g12a_usb3_pcie_probe()
iio: imu: inv_icm42600: Fix I2C init possible nack
iio: dac: ltc2688: fix voltage scale read
interconnect: qcom: sdx55: Drop IP0 interconnects
interconnect: qcom: sc7180: Drop IP0 interconnects
phy: ti: Add missing pm_runtime_disable() in serdes_am654_probe
phy: mapphone-mdm6600: Fix PM error handling in phy_mdm6600_probe
phy: ti: omap-usb2: Fix error handling in omap_usb2_enable_clocks
bus: mhi: host: pci_generic: Flush recovery worker during freeze
bus: mhi: host: pci_generic: Add missing poweroff() PM callback
phy: ti: tusb1210: Fix an error handling path in tusb1210_probe()
phy: samsung: exynos5250-sata: fix missing device put in probe error paths
phy: samsung: Fix missing of_node_put() in exynos_sata_phy_probe
phy: ti: Fix missing of_node_put in ti_pipe3_get_sysctrl()
phy: ti: tusb1210: Make tusb1210_chg_det_states static
iio:dac:ad3552r: Fix an IS_ERR() vs NULL check
iio: sx9324: Fix default precharge internal resistance register
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial fixes from Greg KH:
"Here are some small serial driver fixes, and a larger number of GSM
line discipline fixes for 5.18-rc5.
These include:
- lots of tiny n_gsm fixes for issues to resolve a number of reported
problems. Seems that people are starting to actually use this code
again.
- 8250 driver fixes for some devices
- imx serial driver fix
- amba-pl011 driver fix
All of these have been in linux-next for a while with no reported
issues"
* tag 'tty-5.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (27 commits)
tty: n_gsm: fix sometimes uninitialized warning in gsm_dlci_modem_output()
serial: 8250: Correct the clock for EndRun PTP/1588 PCIe device
serial: 8250: Also set sticky MCR bits in console restoration
tty: n_gsm: fix software flow control handling
tty: n_gsm: fix invalid use of MSC in advanced option
tty: n_gsm: fix broken virtual tty handling
Revert "serial: sc16is7xx: Clear RS485 bits in the shutdown"
tty: n_gsm: fix missing update of modem controls after DLCI open
serial: 8250: Fix runtime PM for start_tx() for empty buffer
serial: imx: fix overrun interrupts in DMA mode
serial: amba-pl011: do not time out prematurely when draining tx fifo
tty: n_gsm: fix incorrect UA handling
tty: n_gsm: fix reset fifo race condition
tty: n_gsm: fix missing tty wakeup in convergence layer type 2
tty: n_gsm: fix wrong signal octets encoding in MSC
tty: n_gsm: fix wrong command frame length field encoding
tty: n_gsm: fix wrong command retry handling
tty: n_gsm: fix missing explicit ldisc flush
tty: n_gsm: fix wrong DLCI release order
tty: n_gsm: fix insufficient txframe size
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are a number of small USB driver fixes for 5.18-rc5 for some
reported issues and new quirks. They include:
- dwc3 driver fixes
- xhci driver fixes
- typec driver fixes
- new usb-serial driver ids
- added new USB devices to existing quirk tables
- other tiny fixes
All of these have been in linux-next for a while with no reported
issues"
* tag 'usb-5.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (31 commits)
usb: phy: generic: Get the vbus supply
usb: dwc3: gadget: Return proper request status
usb: dwc3: pci: add support for the Intel Meteor Lake-P
usb: dwc3: core: Only handle soft-reset in DCTL
usb: gadget: configfs: clear deactivation flag in configfs_composite_unbind()
usb: misc: eud: Fix an error handling path in eud_probe()
usb: core: Don't hold the device lock while sleeping in do_proc_control()
usb: dwc3: Try usb-role-switch first in dwc3_drd_init
usb: dwc3: core: Fix tx/rx threshold settings
usb: mtu3: fix USB 3.0 dual-role-switch from device to host
xhci: Enable runtime PM on second Alderlake controller
usb: dwc3: fix backwards compat with rockchip devices
dt-bindings: usb: samsung,exynos-usb2: add missing required reg
usb: misc: fix improper handling of refcount in uss720_probe()
USB: Fix ehci infinite suspend-resume loop issue in zhaoxin
usb: typec: tcpm: Fix undefined behavior due to shift overflowing the constant
usb: typec: rt1719: Fix build error without CONFIG_POWER_SUPPLY
usb: typec: ucsi: Fix role swapping
usb: typec: ucsi: Fix reuse of completion structure
usb: xhci: tegra:Fix PM usage reference leak of tegra_xusb_unpowergate_partitions
...
|