Age | Commit message (Collapse) | Author |
|
Reuse newly added DMA API to cache IOVA and only link/unlink pages
in fast path for UMEM ODP flow.
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
As a preparation to remove dma_list, store access mask in PFN pointer
and not in dma_addr_t.
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
Currently the only efficient way to map a complex memory description through
the DMA API is by using the scatter list APIs. The SG APIs are unique in that
they efficiently combine the two fundamental operations of sizing and allocating
a large IOVA window from the IOMMU and processing all the per-address
swiotlb/flushing/p2p/map details.
This uniqueness has been a long standing pain point as the scatter list API
is mandatory, but expensive to use. It prevents any kind of optimization or
feature improvement (such as avoiding struct page for P2P) due to the
impossibility of improving the scatter list.
Several approaches have been explored to expand the DMA API with additional
scatterlist-like structures (BIO, rlist), instead split up the DMA API
to allow callers to bring their own data structure.
The API is split up into parts:
- Allocate IOVA space:
To do any pre-allocation required. This is done based on the caller
supplying some details about how much IOMMU address space it would need
in worst case.
- Map and unmap relevant structures to pre-allocated IOVA space:
Perform the actual mapping into the pre-allocated IOVA. This is very
similar to dma_map_page().
Thanks
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add include path to find hns_roce_trace.h to fix the following
build error:
In file included from drivers/infiniband/hw/hns/hns_roce_trace.h:213,
from drivers/infiniband/hw/hns/hns_roce_hw_v2.c:53:
./include/trace/define_trace.h:110:42: fatal error: ./hns_roce_trace.h: No such file or directory
110 | #include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
| ^
compilation terminated.
Fixes: 02007e3ddc07 ("RDMA/hns: Add trace for flush CQE")
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Closes: https://lore.kernel.org/linux-next/b7dd4dda-37d8-47e4-8d78-b6585be21cfd@paulmck-laptop/T/#t
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20250507033903.2879433-1-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
siw_mem_add() was added in 2019 by commit 2251334dcac9 ("rdma/siw:
application buffer management") but has remained unused.
Remove it.
Link: https://patch.msgid.link/r/20250505210226.88994-1-linux@treblig.org
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
sc_drop() and sdma_all_idle() were both added in 2015's commit
7724105686e7 ("IB/hfi1: add driver files") but have remained unused.
Remove them.
Link: https://patch.msgid.link/r/20250505205419.88131-1-linux@treblig.org
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Introduce new DMA APIs to perform DMA linkage of buffers
in layers higher than DMA.
In proposed API, the callers will perform the following steps.
In map path:
if (dma_can_use_iova(...))
dma_iova_alloc()
for (page in range)
dma_iova_link_next(...)
dma_iova_sync(...)
else
/* Fallback to legacy map pages */
for (all pages)
dma_map_page(...)
In unmap path:
if (dma_can_use_iova(...))
dma_iova_destroy()
else
for (all pages)
dma_unmap_page(...)
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
|
|
Split the iommu logic from iommu_dma_map_page into a separate helper.
This not only keeps the code neatly separated, but will also allow for
reuse in another caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
|
|
The existing .map_pages() callback provides both allocating of IOVA
and linking DMA pages. That combination works great for most of the
callers who use it in control paths, but is less effective in fast
paths where there may be multiple calls to map_page().
These advanced callers already manage their data in some sort of
database and can perform IOVA allocation in advance, leaving range
linkage operation to be in fast path.
Provide an interface to allocate/deallocate IOVA and next patch
link/unlink DMA ranges to that specific IOVA.
In the new API a DMA mapping transaction is identified by a
struct dma_iova_state, which holds some recomputed information
for the transaction which does not change for each page being
mapped, so add a check if IOVA can be used for the specific
transaction.
The API is exported from dma-iommu as it is the only implementation
supported, the namespace is clearly different from iommu_* functions
which are not allowed to be used. This code layout allows us to save
function call per API call used in datapath as well as a lot of boilerplate
code.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
|
|
Add kernel-doc section for iommu_unmap_fast to document existing
limitation of underlying functions which can't split individual ranges.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
|
|
For the upcoming IOVA-based DMA API we want to batch the
ops->iotlb_sync_map() call after mapping multiple IOVAs from
dma-iommu without having a scatterlist. Improve the API.
Add a wrapper for the map_sync as iommu_sync_map() so that callers
don't need to poke into the methods directly.
Formalize __iommu_map() into iommu_map_nosync() which requires the
caller to call iommu_sync_map() after all maps are completed.
Refactor the existing sanity checks from all the different layers
into iommu_map_nosync().
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Will Deacon <will@kernel.org>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
|
|
To support the upcoming non-scatterlist mapping helpers, we need to go
back to have them called outside of the DMA API. Thus move them out of
dma-map-ops.h, which is only for DMA API implementations to pci-p2pdma.h,
which is for driver use.
Note that the core helper is still not exported as the mapping is
expected to be done only by very highlevel subsystem code at least for
now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
|
|
The current scheme with a single helper to determine the P2P status
and map a scatterlist segment force users to always use the map_sg
helper to DMA map, which we're trying to get away from because they
are very cache inefficient.
Refactor the code so that there is a single helper that checks the P2P
state for a page, including the result that it is not a P2P page to
simplify the callers, and a second one to perform the address translation
for a bus mapped P2P transfer that does not depend on the scatterlist
structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
|
|
Upon RQ destruction if the firmware command fails which is the
last resource to be destroyed some SW resources were already cleaned
regardless of the failure.
Now properly rollback the object to its original state upon such failure.
In order to avoid a use-after free in case someone tries to destroy the
object again, which results in the following kernel trace:
refcount_t: underflow; use-after-free.
WARNING: CPU: 0 PID: 37589 at lib/refcount.c:28 refcount_warn_saturate+0xf4/0x148
Modules linked in: rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) rfkill mlx5_core(OE) mlxdevm(OE) ib_uverbs(OE) ib_core(OE) psample mlxfw(OE) mlx_compat(OE) macsec tls pci_hyperv_intf sunrpc vfat fat virtio_net net_failover failover fuse loop nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vmw_vmci vsock xfs crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce virtio_console virtio_gpu virtio_blk virtio_dma_buf virtio_mmio dm_mirror dm_region_hash dm_log dm_mod xpmem(OE)
CPU: 0 UID: 0 PID: 37589 Comm: python3 Kdump: loaded Tainted: G OE ------- --- 6.12.0-54.el10.aarch64 #1
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : refcount_warn_saturate+0xf4/0x148
lr : refcount_warn_saturate+0xf4/0x148
sp : ffff80008b81b7e0
x29: ffff80008b81b7e0 x28: ffff000133d51600 x27: 0000000000000001
x26: 0000000000000000 x25: 00000000ffffffea x24: ffff00010ae80f00
x23: ffff00010ae80f80 x22: ffff0000c66e5d08 x21: 0000000000000000
x20: ffff0000c66e0000 x19: ffff00010ae80340 x18: 0000000000000006
x17: 0000000000000000 x16: 0000000000000020 x15: ffff80008b81b37f
x14: 0000000000000000 x13: 2e656572662d7265 x12: ffff80008283ef78
x11: ffff80008257efd0 x10: ffff80008283efd0 x9 : ffff80008021ed90
x8 : 0000000000000001 x7 : 00000000000bffe8 x6 : c0000000ffff7fff
x5 : ffff0001fb8e3408 x4 : 0000000000000000 x3 : ffff800179993000
x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000133d51600
Call trace:
refcount_warn_saturate+0xf4/0x148
mlx5_core_put_rsc+0x88/0xa0 [mlx5_ib]
mlx5_core_destroy_rq_tracked+0x64/0x98 [mlx5_ib]
mlx5_ib_destroy_wq+0x34/0x80 [mlx5_ib]
ib_destroy_wq_user+0x30/0xc0 [ib_core]
uverbs_free_wq+0x28/0x58 [ib_uverbs]
destroy_hw_idr_uobject+0x34/0x78 [ib_uverbs]
uverbs_destroy_uobject+0x48/0x240 [ib_uverbs]
__uverbs_cleanup_ufile+0xd4/0x1a8 [ib_uverbs]
uverbs_destroy_ufile_hw+0x48/0x120 [ib_uverbs]
ib_uverbs_close+0x2c/0x100 [ib_uverbs]
__fput+0xd8/0x2f0
__fput_sync+0x50/0x70
__arm64_sys_close+0x40/0x90
invoke_syscall.constprop.0+0x74/0xd0
do_el0_svc+0x48/0xe8
el0_svc+0x44/0x1d0
el0t_64_sync_handler+0x120/0x130
el0t_64_sync+0x1a4/0x1a8
Fixes: e2013b212f9f ("net/mlx5_core: Add RQ and SQ event handling")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Link: https://patch.msgid.link/3181433ccdd695c63560eeeb3f0c990961732101.1745839855.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
The send completion handler can run after cm_id has advanced to another
message. The cm_id lock is not needed in this case, but a recent change
re-used cm_free_priv_msg(), which asserts that the lock is held and
WARNs if the cm_id's currently outstanding msg is different than the one
being freed.
Fixes: 1e5159219076 ("IB/cm: Do not hold reference on cm_id unless needed")
Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Reviewed-by: Sean Hefty <shefty@nvidia.com>
Link: https://patch.msgid.link/0c364c29142f72b7875fdeba51f3c9bd6ca863ee.1745839788.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
In preparation for making the kmalloc family of allocators type aware,
we need to make sure that the returned type from the allocation matches
the type of the variable being assigned. (Before, the allocator would
always return "void *", which can be implicitly cast to any pointer type.)
The assigned type is "struct tid_rb_node **", but the return type will be
"struct rb_node **". These are the same allocation size (pointer size),
but the types do not match. Adjust the allocation type to match the
assignment.
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20250426061247.work.261-kees@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
In preparation for making the kmalloc family of allocators type aware,
we need to make sure that the returned type from the allocation matches
the type of the variable being assigned. (Before, the allocator would
always return "void *", which can be implicitly cast to any pointer type.)
The assigned type is "unsigned long **", but the returned type will be
"long **". These are the same allocation size (pointer size), but the
types do not match. Adjust the allocation type to match the assignment.
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20250426061208.work.000-kees@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add trace for CMDQ dumping.
Output example:
$ cat /sys/kernel/debug/tracing/trace
tracer: nop
entries-in-buffer/entries-written: 2/2 #P:128
_-----=> irqs-off/BH-disabled
/ _----=> need-resched
| / _---=> hardirq/softirq
|| / _--=> preempt-depth
||| / _-=> migrate-disable
|||| / delay
TASK-PID CPU# ||||| TIMESTAMP FUNCTION
| | | ||||| | |
kworker/u512:1-14003 [089] b..1. 50737.238304: hns_cmdq_req: 0000:bd:00.0 cmdq opcode:0x8500, flag:0x1, retval:0x0, data:{0x2,0x0,0x0,0xffff0000,0x32323232,0x0}
kworker/u512:1-14003 [089] b..1. 50737.238316: hns_cmdq_resp: 0000:bd:00.0 cmdq opcode:0x8500, flag:0x2, retval:0x0, data:{0x2,0x0,0x0,0xffff0000,0x32323232,0x0}
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20250421132750.1363348-7-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
hns_roce_hw_v2.h has a direct dependency on hnae3.h due to the
inline function hns_roce_write64(), but it doesn't include this
header currently. This leads to that files including
hns_roce_hw_v2.h must also include hnae3.h to avoid compilation
errors, even if they themselves don't really rely on hnae3.h.
This doesn't make sense, hns_roce_hw_v2.h should include hnae3.h
directly.
Fixes: d3743fa94ccd ("RDMA/hns: Fix the chip hanging caused by sending doorbell during reset")
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20250421132750.1363348-6-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add trace for MR/MTR attribute dumping.
Output example:
$ cat /sys/kernel/debug/tracing/trace
tracer: nop
entries-in-buffer/entries-written: 2/2 #P:128
_-----=> irqs-off/BH-disabled
/ _----=> need-resched
| / _---=> hardirq/softirq
|| / _--=> preempt-depth
||| / _-=> migrate-disable
|||| / delay
TASK-PID CPU# ||||| TIMESTAMP FUNCTION
| | | ||||| | |
ib_send_bw-14751 [111] ..... 8763.823038: hns_buf_attr: rg cnt:1,
pg_sft:0xc, mtt_only:no, rg 0 (sz:131072, hop:2), rg 1 (sz:0, hop:0),
rg 2 (sz:0, hop:0)
ib_send_bw-14751 [111] ..... 8763.823118: hns_mr:
iova:0xffffb2968000, size:131072, key:512, pd:1, pbl_hop:1, npages:4,
type:0, status:0
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20250421132750.1363348-5-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add trace for AEQE dumping.
Output example:
$ cat /sys/kernel/debug/tracing/trace
tracer: nop
entries-in-buffer/entries-written: 2/2 #P:128
_-----=> irqs-off/BH-disabled
/ _----=> need-resched
| / _---=> hardirq/softirq
|| / _--=> preempt-depth
||| / _-=> migrate-disable
|||| / delay
TASK-PID CPU# ||||| TIMESTAMP FUNCTION
| | | ||||| | |
<idle>-0 [120] d.h1. 7995.835587: hns_ae_info: event 19 aeqe:
{0x80006013,0x0,0x0,0x10d2c,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0}
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20250421132750.1363348-4-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add trace for WQE dumping, including SQ, RQ and SRQ.
Output example:
$ cat /sys/kernel/debug/tracing/trace
tracer: nop
entries-in-buffer/entries-written: 2/2 #P:128
_-----=> irqs-off/BH-disabled
/ _----=> need-resched
| / _---=> hardirq/softirq
|| / _--=> preempt-depth
||| / _-=> migrate-disable
|||| / delay
TASK-PID CPU# ||||| TIMESTAMP FUNCTION
| | | ||||| | |
roce_test_main-22730 [074] d..1. 16133.898282: hns_sq_wqe: SQ 0xc wqe
(0x0/0xffff0820a6076060): {0x180,0x639c,0x0,0x1000000,0x0,0x0,0x0,0x0,
0x639c,0x300,0xf7e38000,0x0,0x0,0x0,0x0,0x0}
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20250421132750.1363348-3-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add trace to print the producer index of QP when triggering flush CQE.
Output example:
$ cat /sys/kernel/debug/tracing/trace
tracer: nop
entries-in-buffer/entries-written: 2/2 #P:128
_-----=> irqs-off/BH-disabled
/ _----=> need-resched
| / _---=> hardirq/softirq
|| / _--=> preempt-depth
||| / _-=> migrate-disable
|||| / delay
TASK-PID CPU# ||||| TIMESTAMP FUNCTION
| | | ||||| | |
ib_send_bw-11474 [075] d..1. 2393.434738: hns_sq_flush_cqe: SQ 0x2 flush head 0xb5c7.
ib_send_bw-11474 [075] d..1. 2393.434739: hns_rq_flush_cqe: RQ 0x2 flush head 0.
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20250421132750.1363348-2-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Major linux distibutions have phased out support for 32-bit machines. Since
rxe is primarily used for development and testing, the benefit of
maintaining 32-bit support is minimal. This change simplifies ATOMIC WRITE
implementations and improves maintainability of the driver.
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Link: https://patch.msgid.link/20250421025101.3588139-1-matsuda-daisuke@fujitsu.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
rxe_run_task() has been unused since 2024's
commit 23bc06af547f ("RDMA/rxe: Don't call direct between tasks")
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://patch.msgid.link/20250419132725.199785-1-linux@treblig.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120
assign_lock_key kernel/locking/lockdep.c:986 [inline]
register_lock_class+0x4a3/0x4c0 kernel/locking/lockdep.c:1300
__lock_acquire+0x99/0x1ba0 kernel/locking/lockdep.c:5110
lock_acquire kernel/locking/lockdep.c:5866 [inline]
lock_acquire+0x179/0x350 kernel/locking/lockdep.c:5823
__timer_delete_sync+0x152/0x1b0 kernel/time/timer.c:1644
rxe_qp_do_cleanup+0x5c3/0x7e0 drivers/infiniband/sw/rxe/rxe_qp.c:815
execute_in_process_context+0x3a/0x160 kernel/workqueue.c:4596
__rxe_cleanup+0x267/0x3c0 drivers/infiniband/sw/rxe/rxe_pool.c:232
rxe_create_qp+0x3f7/0x5f0 drivers/infiniband/sw/rxe/rxe_verbs.c:604
create_qp+0x62d/0xa80 drivers/infiniband/core/verbs.c:1250
ib_create_qp_kernel+0x9f/0x310 drivers/infiniband/core/verbs.c:1361
ib_create_qp include/rdma/ib_verbs.h:3803 [inline]
rdma_create_qp+0x10c/0x340 drivers/infiniband/core/cma.c:1144
rds_ib_setup_qp+0xc86/0x19a0 net/rds/ib_cm.c:600
rds_ib_cm_initiate_connect+0x1e8/0x3d0 net/rds/ib_cm.c:944
rds_rdma_cm_event_handler_cmn+0x61f/0x8c0 net/rds/rdma_transport.c:109
cma_cm_event_handler+0x94/0x300 drivers/infiniband/core/cma.c:2184
cma_work_handler+0x15b/0x230 drivers/infiniband/core/cma.c:3042
process_one_work+0x9cc/0x1b70 kernel/workqueue.c:3238
process_scheduled_works kernel/workqueue.c:3319 [inline]
worker_thread+0x6c8/0xf10 kernel/workqueue.c:3400
kthread+0x3c2/0x780 kernel/kthread.c:464
ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:153
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
</TASK>
The root cause is as below:
In the function rxe_create_qp, the function rxe_qp_from_init is called
to create qp, if this function rxe_qp_from_init fails, rxe_cleanup will
be called to handle all the allocated resources, including the timers:
retrans_timer and rnr_nak_timer.
The function rxe_qp_from_init calls the function rxe_qp_init_req to
initialize the timers: retrans_timer and rnr_nak_timer.
But these timers are initialized in the end of rxe_qp_init_req.
If some errors occur before the initialization of these timers, this
problem will occur.
The solution is to check whether these timers are initialized or not.
If these timers are not initialized, ignore these timers.
Fixes: 8700e3e7c485 ("Soft RoCE driver")
Reported-by: syzbot+4edb496c3cad6e953a31@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=4edb496c3cad6e953a31
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Link: https://patch.msgid.link/20250419080741.1515231-1-yanjun.zhu@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
The last use of rdma_res_to_id() was removed in 2020 by
commi t211cd9459fda ("RDMA: Add dedicated CM_ID resource tracker function")
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://patch.msgid.link/20250418165848.241305-1-linux@treblig.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Check PF capability flag whether the 4M, 1G, and 2G pages are
supported. Add these pages sizes to mana_ib, if supported.
Define possible page sizes in enum gdma_page_type and
remove unused enum atb_page_size.
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://patch.msgid.link/1744621234-26114-4-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add IB_ZERO_BASED to the valid flags and use
the corresponding MR creation request for the zero
based memory.
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://patch.msgid.link/1744621234-26114-3-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add IB_ACCESS_REMOTE_ATOMIC to the valid flags for MRs and use
the corresponding flag bit during MR creation in the HW.
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://patch.msgid.link/1744621234-26114-2-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
On x86_64 with gcc version 13.3.0, I compile
drivers/infiniband/hw/hns/hns_roce_hw_v2.c with:
make defconfig
./scripts/kconfig/merge_config.sh .config <(
echo CONFIG_COMPILE_TEST=y
echo CONFIG_HNS3=m
echo CONFIG_INFINIBAND=m
echo CONFIG_INFINIBAND_HNS_HIP08=m
)
make KCFLAGS="-fno-inline-small-functions -fno-inline-functions-called-once" \
drivers/infiniband/hw/hns/hns_roce_hw_v2.o
Then I get a compile error:
CALL scripts/checksyscalls.sh
DESCEND objtool
INSTALL libsubcmd_headers
CC [M] drivers/infiniband/hw/hns/hns_roce_hw_v2.o
In file included from drivers/infiniband/hw/hns/hns_roce_hw_v2.c:47:
drivers/infiniband/hw/hns/hns_roce_hw_v2.c: In function 'update_srq_db':
drivers/infiniband/hw/hns/hns_roce_common.h:74:17: error: 'db' is used uninitialized [-Werror=uninitialized]
74 | *((__le32 *)_ptr + (field_h) / 32) &= \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/infiniband/hw/hns/hns_roce_common.h:90:17: note: in expansion of macro '_hr_reg_clear'
90 | _hr_reg_clear(ptr, field_type, field_h, field_l); \
| ^~~~~~~~~~~~~
drivers/infiniband/hw/hns/hns_roce_common.h:95:39: note: in expansion of macro '_hr_reg_write'
95 | #define hr_reg_write(ptr, field, val) _hr_reg_write(ptr, field, val)
| ^~~~~~~~~~~~~
drivers/infiniband/hw/hns/hns_roce_hw_v2.c:948:9: note: in expansion of macro 'hr_reg_write'
948 | hr_reg_write(&db, DB_TAG, srq->srqn);
| ^~~~~~~~~~~~
drivers/infiniband/hw/hns/hns_roce_hw_v2.c:946:31: note: 'db' declared here
946 | struct hns_roce_v2_db db;
| ^~
cc1: all warnings being treated as errors
Signed-off-by: Chen Linxuan <chenlinxuan@uniontech.com>
Co-developed-by: Winston Wen <wentao@uniontech.com>
Signed-off-by: Winston Wen <wentao@uniontech.com>
Link: https://patch.msgid.link/FF922C77946229B6+20250411105459.90782-5-chenlinxuan@uniontech.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Some functions return int values while they are defined as enum resp_states
variables. This patch resolves the mismatches in rxe.
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Link: https://patch.msgid.link/20250409102701.1275265-1-matsuda-daisuke@fujitsu.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
In the past %pK was preferable to %p as it would not leak raw pointer
values into the kernel log.
Since commit ad67b74d2469 ("printk: hash addresses printed with %p")
the regular %p has been improved to avoid this issue.
Furthermore, restricted pointers ("%pK") were never meant to be used
through printk(). They can still unintentionally leak raw pointers or
acquire sleeping looks in atomic contexts.
Switch to the regular pointer formatting which is safer and
easier to reason about.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Link: https://patch.msgid.link/20250407-restricted-pointers-infiniband-v1-1-22b20504b84d@linutronix.de
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add rxe_odp_do_atomic_write() so that ODP specific steps are applied to
ATOMIC WRITE requests.
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Link: https://patch.msgid.link/20250324075649.3313968-3-matsuda-daisuke@fujitsu.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
For persistent memories, add rxe_odp_flush_pmem_iova() so that ODP specific
steps are executed. Otherwise, no additional consideration is required.
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Link: https://patch.msgid.link/20250324075649.3313968-2-matsuda-daisuke@fujitsu.com
Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
In workloads where there are many processes establishing connections using
RDMA CM in parallel (large scale MPI), there can be heavy contention for
mad_agent_lock in cm_alloc_msg.
This contention can occur while inside of a spin_lock_irq region, leading
to interrupts being disabled for extended durations on many
cores. Furthermore, it leads to the serialization of rdma_create_ah calls,
which has negative performance impacts for NICs which are capable of
processing multiple address handle creations in parallel.
The end result is the machine becoming unresponsive, hung task warnings,
netdev TX timeouts, etc.
Since the lock appears to be only for protection from cm_remove_one, it
can be changed to a rwlock to resolve these issues.
Reproducer:
Server:
for i in $(seq 1 512); do
ucmatose -c 32 -p $((i + 5000)) &
done
Client:
for i in $(seq 1 512); do
ucmatose -c 32 -p $((i + 5000)) -s 10.2.0.52 &
done
Fixes: 76039ac9095f ("IB/cm: Protect cm_dev, cm_ports and mad_agent with kref and lock")
Link: https://patch.msgid.link/r/20250220175612.2763122-1-jmoroni@google.com
Signed-off-by: Jacob Moroni <jmoroni@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Remove unused parameters.
Link: https://patch.msgid.link/r/20250327114724.3454268-2-huangjunxian6@hisilicon.com
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.
Remove unused flex-array member `class_data` from
`struct opa_mad_notice_attr`.
Fix the following warning:
drivers/infiniband/hw/hfi1/mad.c:23:36: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
Link: https://patch.msgid.link/r/Z-wiYkll8Vo3ME3P@kspp
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
As opposed to open-code, using the ERR_CAST macro clearly indicates that
this is a pointer to an error value and a type conversion was performed.
Link: https://patch.msgid.link/r/20250401211015750qxOfU9XZ8QgKizM1Lcyq2@zte.com.cn
Signed-off-by: Li Haoran <li.haoran7@zte.com.cn>
Signed-off-by: Shao Mingyin <shao.mingyin@zte.com.cn>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
As opposed to open-code, using the ERR_CAST macro clearly indicates that
this is a pointer to an error value and a type conversion was performed.
Link: https://patch.msgid.link/r/202504012109233981_YPVbd4wQzmAzP3tA5IG@zte.com.cn
Signed-off-by: Li Haoran <li.haoran7@zte.com.cn>
Signed-off-by: Shao Mingyin <shao.mingyin@zte.com.cn>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
As opposed to open-code, using the ERR_CAST macro clearly indicates that
this is a pointer to an error value and a type conversion was performed.
Link: https://patch.msgid.link/r/20250401210840146_IyrV3zlejzz3eAnDmMSB@zte.com.cn
Signed-off-by: Li Haoran <li.haoran7@zte.com.cn>
Signed-off-by: Shao Mingyin <shao.mingyin@zte.com.cn>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
In drivers/infiniband/hw/mlx4/mcg.c and drivers/infiniband/hw/mlx5/mr.c,
`msecs_to_jiffies` is used to convert milliseconds to jiffies.
For constant milliseconds, using `msecs_to_jiffies` introduces additional
computational overhead. For example, it is unnecessary to check
if m > jiffies_to_msecs(MAX_JIFFY_OFFSET) or (int)m < 0 for constants,
while using `secs_to_jiffies` can avoid these extra calculations.
Link: https://patch.msgid.link/r/20250326221955611qu6Ix3Pt5WgKvhL6sTySX@zte.com.cn
Signed-off-by: Peng Jiang <jiang.peng9@zte.com.cn>
Signed-off-by: Ye Xingchen <ye.xingchen@zte.com.cn>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Commit b35108a51cf7 ("jiffies: Define secs_to_jiffies()") introduced
secs_to_jiffies(). As the value here is a multiple of 1000, use
secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication.
This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with
the following Coccinelle rules:
@depends on patch@
expression E;
@@
-msecs_to_jiffies(E * 1000)
+secs_to_jiffies(E)
-msecs_to_jiffies(E * MSEC_PER_SEC)
+secs_to_jiffies(E)
Link: https://patch.msgid.link/r/20250219-rdma-secs-to-jiffies-v1-2-b506746561a9@linux.microsoft.com
Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer cleanups from Thomas Gleixner:
"A set of final cleanups for the timer subsystem:
- Convert all del_timer[_sync]() instances over to the new
timer_delete[_sync]() API and remove the legacy wrappers.
Conversion was done with coccinelle plus some manual fixups as
coccinelle chokes on scoped_guard().
- The final cleanup of the hrtimer_init() to hrtimer_setup()
conversion.
This has been delayed to the end of the merge window, so that all
patches which have been merged through other trees are in mainline
and all new users are catched.
Doing this right before rc1 ensures that new code which is merged post
rc1 is not introducing new instances of the original functionality"
* tag 'timers-cleanups-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tracing/timers: Rename the hrtimer_init event to hrtimer_setup
hrtimers: Rename debug_init_on_stack() to debug_setup_on_stack()
hrtimers: Rename debug_init() to debug_setup()
hrtimers: Rename __hrtimer_init_sleeper() to __hrtimer_setup_sleeper()
hrtimers: Remove unnecessary NULL check in hrtimer_start_range_ns()
hrtimers: Make callback function pointer private
hrtimers: Merge __hrtimer_init() into __hrtimer_setup()
hrtimers: Switch to use __htimer_setup()
hrtimers: Delete hrtimer_init()
treewide: Convert new and leftover hrtimer_init() users
treewide: Switch/rename to timer_delete[_sync]()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull more irq updates from Thomas Gleixner:
"A set of updates for the interrupt subsystem:
- A treewide cleanup for the irq_domain code, which makes the naming
consistent and gets rid of the original oddity of naming domains
'host'.
This is a trivial mechanical change and is done late to ensure that
all instances have been catched and new code merged post rc1 wont
reintroduce new instances.
- A trivial consistency fix in the migration code
The recent introduction of irq_force_complete_move() in the core
code, causes a problem for the nostalgia crowd who maintains ia64
out of tree.
The code assumes that hierarchical interrupt domains are enabled
and dereferences irq_data::parent_data unconditionally. That works
in mainline because both architectures which enable that code have
hierarchical domains enabled. Though it breaks the ia64 build,
which enables the functionality, but does not have hierarchical
domains.
While it's not really a problem for mainline today, this
unconditional dereference is inconsistent and trivially fixable by
using the existing helper function irqd_get_parent_data(), which
has the appropriate #ifdeffery in place"
* tag 'irq-urgent-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/migration: Use irqd_get_parent_data() in irq_force_complete_move()
irqdomain: Stop using 'host' for domain
irqdomain: Rename irq_get_default_host() to irq_get_default_domain()
irqdomain: Rename irq_set_default_host() to irq_set_default_domain()
|
|
Pull drm fixes from Dave Airlie:
"Weekly fixes, mostly from the end of last week, this week was very
quiet, maybe you scared everyone away. It's mostly amdgpu, and xe,
with some i915, adp and bridge bits, since I think this is overly
quiet I'd expect rc2 to be a bit more lively.
bridge:
- tda998x: Select CONFIG_DRM_KMS_HELPER
amdgpu:
- Guard against potential division by 0 in fan code
- Zero RPM support for SMU 14.0.2
- Properly handle SI and CIK support being disabled
- PSR fixes
- DML2 fixes
- DP Link training fix
- Vblank fixes
- RAS fixes
- Partitioning fix
- SDMA fix
- SMU 13.0.x fixes
- Rom fetching fix
- MES fixes
- Queue reset fix
xe:
- Fix NULL pointer dereference on error path
- Add missing HW workaround for BMG
- Fix survivability mode not triggering
- Fix build warning when DRM_FBDEV_EMULATION is not set
i915:
- Bounds check for scalers in DSC prefill latency computation
- Fix build by adding a missing include
adp:
- Fix error handling in plane setup"
# -----BEGIN PGP SIGNATURE-----
* tag 'drm-next-2025-04-05' of https://gitlab.freedesktop.org/drm/kernel: (34 commits)
drm/i2c: tda998x: select CONFIG_DRM_KMS_HELPER
drm/amdgpu/gfx12: fix num_mec
drm/amdgpu/gfx11: fix num_mec
drm/amd/pm: Add gpu_metrics_v1_8
drm/amdgpu: Prefer shadow rom when available
drm/amd/pm: Update smu metrics table for smu_v13_0_6
drm/amd/pm: Remove host limit metrics support
Remove unnecessary firmware version check for gc v9_4_2
drm/amdgpu: stop unmapping MQD for kernel queues v3
Revert "drm/amdgpu/sdma_v4_4_2: update VM flush implementation for SDMA"
drm/amdgpu: Parse all deferred errors with UMC aca handle
drm/amdgpu: Update ta ras block
drm/amdgpu: Add NPS2 to DPX compatible mode
drm/amdgpu: Use correct gfx deferred error count
drm/amd/display: Actually do immediate vblank disable
drm/amd/display: prevent hang on link training fail
Revert "drm/amd/display: dml2 soc dscclk use DPM table clk setting"
drm/amd/display: Increase vblank offdelay for PSR panels
drm/amd: Handle being compiled without SI or CIK support better
drm/amd/pm: Add zero RPM enabled OD setting support for SMU14.0.2
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input updates from Dmitry Torokhov:
- a brand new driver for touchpads and touchbars in newer Apple devices
- support for Berlin-A series in goodix-berlin touchscreen driver
- improvements to matrix_keypad driver to better handle GPIOs toggling
- assorted small cleanups in other input drivers
* tag 'input-for-v6.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: goodix_berlin - add support for Berlin-A series
dt-bindings: input: goodix,gt9916: Document gt9897 compatible
dt-bindings: input: matrix_keypad - add wakeup-source property
dt-bindings: input: matrix_keypad - add missing property
Input: pm8941-pwrkey - fix dev_dbg() output in pm8941_pwrkey_irq()
Input: synaptics - hide unused smbus_pnp_ids[] array
Input: apple_z2 - fix potential confusion in Kconfig
Input: matrix_keypad - use fsleep for delays after activating columns
Input: matrix_keypad - add settle time after enabling all columns
dt-bindings: input: matrix_keypad: add settle time after enabling all columns
dt-bindings: input: matrix_keypad: convert to YAML
dt-bindings: input: Correct indentation and style in DTS example
MAINTAINERS: Add entries for Apple Z2 touchscreen driver
Input: apple_z2 - add a driver for Apple Z2 touchscreens
dt-bindings: input: touchscreen: Add Z2 controller
Input: Switch to use hrtimer_setup()
Input: drop vb2_ops_wait_prepare/finish
|
|
hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.
Replace hrtimer_init() and the open coded initialization of
hrtimer::function with the new setup mechanism.
Coccinelle scripted cleanup.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.
Conversion was done with coccinelle plus manual fixups where necessary.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Prepare input updates for 6.15 merge window.
|