Age | Commit message (Collapse) | Author |
|
When invalidating an address range in mlx5, there is an optimization to
do UMR operations in chunks.
Previously, the invalidation counter was incorrectly updated for the
same indexes within a chunk. Now, the invalidation counter is updated
only when a chunk is complete and mlx5r_umr_update_xlt() is called.
This ensures that the counter accurately represents the number of pages
invalidated using UMR.
Fixes: a3de94e3d61e ("IB/mlx5: Introduce ODP diagnostic counters")
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/560deb2433318e5947282b070c915f3c81fef77f.1741875692.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
When cur_qp isn't NULL, in order to avoid fetching the QP from
the radix tree again we check if the next cqe QP is identical to
the one we already have.
The bug however is that we are checking if the QP is identical by
checking the QP number inside the CQE against the QP number inside the
mlx5_ib_qp, but that's wrong since the QP number from the CQE is from
FW so it should be matched against mlx5_core_qp which is our FW QP
number.
Otherwise we could use the wrong QP when handling a CQE which could
cause the kernel trace below.
This issue is mainly noticeable over QPs 0 & 1, since for now they are
the only QPs in our driver whereas the QP number inside mlx5_ib_qp
doesn't match the QP number inside mlx5_core_qp.
BUG: kernel NULL pointer dereference, address: 0000000000000012
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP
CPU: 0 UID: 0 PID: 7927 Comm: kworker/u62:1 Not tainted 6.14.0-rc3+ #189
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
RIP: 0010:mlx5_ib_poll_cq+0x4c7/0xd90 [mlx5_ib]
Code: 03 00 00 8d 58 ff 21 cb 66 39 d3 74 39 48 c7 c7 3c 89 6e a0 0f b7 db e8 b7 d2 b3 e0 49 8b 86 60 03 00 00 48 c7 c7 4a 89 6e a0 <0f> b7 5c 98 02 e8 9f d2 b3 e0 41 0f b7 86 78 03 00 00 83 e8 01 21
RSP: 0018:ffff88810511bd60 EFLAGS: 00010046
RAX: 0000000000000010 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff88885fa1b3c0 RDI: ffffffffa06e894a
RBP: 00000000000000b0 R08: 0000000000000000 R09: ffff88810511bc10
R10: 0000000000000001 R11: 0000000000000001 R12: ffff88810d593000
R13: ffff88810e579108 R14: ffff888105146000 R15: 00000000000000b0
FS: 0000000000000000(0000) GS:ffff88885fa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000012 CR3: 00000001077e6001 CR4: 0000000000370eb0
Call Trace:
<TASK>
? __die+0x20/0x60
? page_fault_oops+0x150/0x3e0
? exc_page_fault+0x74/0x130
? asm_exc_page_fault+0x22/0x30
? mlx5_ib_poll_cq+0x4c7/0xd90 [mlx5_ib]
__ib_process_cq+0x5a/0x150 [ib_core]
ib_cq_poll_work+0x31/0x90 [ib_core]
process_one_work+0x169/0x320
worker_thread+0x288/0x3a0
? work_busy+0xb0/0xb0
kthread+0xd7/0x1f0
? kthreads_online_cpu+0x130/0x130
? kthreads_online_cpu+0x130/0x130
ret_from_fork+0x2d/0x50
? kthreads_online_cpu+0x130/0x130
ret_from_fork_asm+0x11/0x20
</TASK>
Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/4ada09d41f1e36db62c44a9b25c209ea5f054316.1741875692.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Change all variables storing mlx5_umem_mkc_find_best_pgsz() result to
unsigned long to support values larger than 31 and avoid overflow.
For example: If we try to register 4GB of memory that is contiguous in
physical memory, the driver will optimize the page_size and try to use
an mkey with 4GB entity size. The 'unsigned int' page_size variable will
overflow to '0' and we'll hit the WARN_ON() in alloc_cacheable_mr().
WARNING: CPU: 2 PID: 1203 at drivers/infiniband/hw/mlx5/mr.c:1124 alloc_cacheable_mr+0x22/0x580 [mlx5_ib]
Modules linked in: mlx5_ib mlx5_core bonding ip6_gre ip6_tunnel tunnel6 ip_gre gre rdma_rxe rdma_ucm ib_uverbs ib_ipoib ib_umad rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm fuse ib_core [last unloaded: mlx5_core]
CPU: 2 UID: 70878 PID: 1203 Comm: rdma_resource_l Tainted: G W 6.14.0-rc4-dirty #43
Tainted: [W]=WARN
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:alloc_cacheable_mr+0x22/0x580 [mlx5_ib]
Code: 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 41 52 53 48 83 ec 30 f6 46 28 04 4c 8b 77 08 75 21 <0f> 0b 49 c7 c2 ea ff ff ff 48 8d 65 d0 4c 89 d0 5b 41 5a 41 5c 41
RSP: 0018:ffffc900006ffac8 EFLAGS: 00010246
RAX: 0000000004c0d0d0 RBX: ffff888217a22000 RCX: 0000000000100001
RDX: 00007fb7ac480000 RSI: ffff8882037b1240 RDI: ffff8882046f0600
RBP: ffffc900006ffb28 R08: 0000000000000001 R09: 0000000000000000
R10: 00000000000007e0 R11: ffffea0008011d40 R12: ffff8882037b1240
R13: ffff8882046f0600 R14: ffff888217a22000 R15: ffffc900006ffe00
FS: 00007fb7ed013340(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fb7ed1d8000 CR3: 00000001fd8f6006 CR4: 0000000000772eb0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
? __warn+0x81/0x130
? alloc_cacheable_mr+0x22/0x580 [mlx5_ib]
? report_bug+0xfc/0x1e0
? handle_bug+0x55/0x90
? exc_invalid_op+0x17/0x70
? asm_exc_invalid_op+0x1a/0x20
? alloc_cacheable_mr+0x22/0x580 [mlx5_ib]
create_real_mr+0x54/0x150 [mlx5_ib]
ib_uverbs_reg_mr+0x17f/0x2a0 [ib_uverbs]
ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xca/0x140 [ib_uverbs]
ib_uverbs_run_method+0x6d0/0x780 [ib_uverbs]
? __pfx_ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x10/0x10 [ib_uverbs]
ib_uverbs_cmd_verbs+0x19b/0x360 [ib_uverbs]
? walk_system_ram_range+0x79/0xd0
? ___pte_offset_map+0x1b/0x110
? __pte_offset_map_lock+0x80/0x100
ib_uverbs_ioctl+0xac/0x110 [ib_uverbs]
__x64_sys_ioctl+0x94/0xb0
do_syscall_64+0x50/0x110
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fb7ecf0737b
Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7d 2a 0f 00 f7 d8 64 89 01 48
RSP: 002b:00007ffdbe03ecc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffdbe03edb8 RCX: 00007fb7ecf0737b
RDX: 00007ffdbe03eda0 RSI: 00000000c0181b01 RDI: 0000000000000003
RBP: 00007ffdbe03ed80 R08: 00007fb7ecc84010 R09: 00007ffdbe03eed4
R10: 0000000000000009 R11: 0000000000000246 R12: 00007ffdbe03eed4
R13: 000000000000000c R14: 000000000000000c R15: 00007fb7ecc84150
</TASK>
Fixes: cef7dde8836a ("net/mlx5: Expand mkey page size to support 6 bits")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://patch.msgid.link/2479a4a3f6fd9bd032e1b6d396274a89c4c5e22f.1741875692.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Drop the unused access_flags parameter.
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://patch.msgid.link/4d769c51eb012c62b3a92fd916b7886c25b56fbf.1741875692.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Fix double decrement of 'in_use' counter on push_mkey_locked() failure
while deregistering an MR.
If we fail to return an mkey to the cache in cache_ent_find_and_store()
it'll update the 'in_use' counter. Its caller, revoke_mr(), also updates
it, thus having double decrement.
Wrong value of 'in_use' counter will be exposed through debugfs and can
also cause wrong resizing of the cache when users try to set cache
entry size using the 'size' debugfs.
To address this issue, the 'in_use' counter is now decremented within
mlx5_revoke_mr() also after a successful call to
cache_ent_find_and_store() and not within cache_ent_find_and_store().
Other success or failure flows remains unchanged where it was also
decremented.
Fixes: 8c1185fef68c ("RDMA/mlx5: Change check for cacheable mkeys")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://patch.msgid.link/97e979dff636f232ff4c83ce709c17c727da1fdb.1741875692.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Destroy all previously created cache entries and work queue when rolling
back the MR cache initialization upon an error.
Fixes: 73d09b2fe833 ("RDMA/mlx5: Introduce mlx5r_cache_rb_key")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://patch.msgid.link/c41d525fb3c72e28dd38511bf3aaccb5d584063e.1741875692.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add support to allow optional-counters binding to a QP, whereas when
a bind operation is requested depending on the counter optional-counter
binding state the driver will determine if to also add optional-counters
to this QP binding.
The optional-counter binding is done by simply adding a steering
rule for the specific optional-counter condition with the additional
match over that QP number.
Note that optional-counters per QP rules are handled on an earlier prio
than per device counters, and per device counter correctness is maintained
by core whereas it is responsible to sum active counters when checking device
counter and to add them to history count when they are deallocated.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/2cad1b891a6641ae61fe8d92f867e1059121813a.1741875070.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Change mlx5 Makefile, fs.c and fs.h to support fs compilation regardless
of INFINIBAND_USER_ACCESS config.
In addition allow optional counters support regardless of the config.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Link: https://patch.msgid.link/b8dd220456a91538b22c3aff150ab021d7b9e1bf.1741875070.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
This will be useful for the next patches in the series since port number
is needed for optional counters binding and unbinding.
Note that this change is needed since when the operation is done qp->port
isn't necessarily initialized yet and can't be used.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/b6f6797844acbd517358e8d2a270ea9b3e6ecba1.1741875070.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Whenever a new counter is created, save inside it the user requested
configuration for optional-counters binding, for manual configuration it
is requested directly by the user and for the automatic configuration it
depends on if the automatic binding was enabled with or without
optional-counters binding.
This argument will later be used by the driver to determine if to bind the
optional-counters as well or not when trying to bind this counter to a QP.
It indicates that when binding counters to a QP we also want the
currently enabled link optional-counters to be bound as well.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/82f1c357606a16932979ef9a5910122675c74a3a.1741875070.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Change rdma_counter allocation to use rdma_zalloc_drv_obj() instead of,
explicitly allocating at core, in order to be contained inside driver
specific structures.
Adjust all drivers that use it to have their containing structure, and
add driver specific initialization operation.
This change is needed to allow upcoming patches to implement
optional-counters binding whereas inside each driver specific counter
struct his bound optional-counters will be maintained.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/a5a484f421fc2e5595158e61a354fba43272b02d.1741875070.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add the following optional counters:
rdma_tx_packets,rdma_rx_bytes,rdma_rx_packets,rdma_tx_bytes.
Which counts all RDMA packets/bytes sent and received per link.
Note that since each direction packet and byte counter are shared,
the counter is only reset when both counters of that direction
are removed. But from user-perspective each can be enabled/disabled separately.
The counters can be enabled using:
sudo rdma stat set link rocep8s0f0/1 optional-counters rdma_tx_packets
And can be seen using:
rdma stat -j show link rocep8s0f0/1
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/9f2753ad636f21704416df64b47395c8991d1123.1741875070.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Syzbot reported a slab-use-after-free with the following call trace:
==================================================================
BUG: KASAN: slab-use-after-free in nla_put+0xd3/0x150 lib/nlattr.c:1099
Read of size 5 at addr ffff888140ea1c60 by task syz.0.988/10025
CPU: 0 UID: 0 PID: 10025 Comm: syz.0.988
Not tainted 6.14.0-rc4-syzkaller-00859-gf77f12010f67 #0
Hardware name: Google Compute Engine, BIOS Google 02/12/2025
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:408 [inline]
print_report+0x16e/0x5b0 mm/kasan/report.c:521
kasan_report+0x143/0x180 mm/kasan/report.c:634
kasan_check_range+0x282/0x290 mm/kasan/generic.c:189
__asan_memcpy+0x29/0x70 mm/kasan/shadow.c:105
nla_put+0xd3/0x150 lib/nlattr.c:1099
nla_put_string include/net/netlink.h:1621 [inline]
fill_nldev_handle+0x16e/0x200 drivers/infiniband/core/nldev.c:265
rdma_nl_notify_event+0x561/0xef0 drivers/infiniband/core/nldev.c:2857
ib_device_notify_register+0x22/0x230 drivers/infiniband/core/device.c:1344
ib_register_device+0x1292/0x1460 drivers/infiniband/core/device.c:1460
rxe_register_device+0x233/0x350 drivers/infiniband/sw/rxe/rxe_verbs.c:1540
rxe_net_add+0x74/0xf0 drivers/infiniband/sw/rxe/rxe_net.c:550
rxe_newlink+0xde/0x1a0 drivers/infiniband/sw/rxe/rxe.c:212
nldev_newlink+0x5ea/0x680 drivers/infiniband/core/nldev.c:1795
rdma_nl_rcv_skb drivers/infiniband/core/netlink.c:239 [inline]
rdma_nl_rcv+0x6dd/0x9e0 drivers/infiniband/core/netlink.c:259
netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline]
netlink_unicast+0x7f6/0x990 net/netlink/af_netlink.c:1339
netlink_sendmsg+0x8de/0xcb0 net/netlink/af_netlink.c:1883
sock_sendmsg_nosec net/socket.c:709 [inline]
__sock_sendmsg+0x221/0x270 net/socket.c:724
____sys_sendmsg+0x53a/0x860 net/socket.c:2564
___sys_sendmsg net/socket.c:2618 [inline]
__sys_sendmsg+0x269/0x350 net/socket.c:2650
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f42d1b8d169
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 ...
RSP: 002b:00007f42d2960038 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f42d1da6320 RCX: 00007f42d1b8d169
RDX: 0000000000000000 RSI: 00004000000002c0 RDI: 000000000000000c
RBP: 00007f42d1c0e2a0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 00007f42d1da6320 R15: 00007ffe399344a8
</TASK>
Allocated by task 10025:
kasan_save_stack mm/kasan/common.c:47 [inline]
kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
__kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:394
kasan_kmalloc include/linux/kasan.h:260 [inline]
__do_kmalloc_node mm/slub.c:4294 [inline]
__kmalloc_node_track_caller_noprof+0x28b/0x4c0 mm/slub.c:4313
__kmemdup_nul mm/util.c:61 [inline]
kstrdup+0x42/0x100 mm/util.c:81
kobject_set_name_vargs+0x61/0x120 lib/kobject.c:274
dev_set_name+0xd5/0x120 drivers/base/core.c:3468
assign_name drivers/infiniband/core/device.c:1202 [inline]
ib_register_device+0x178/0x1460 drivers/infiniband/core/device.c:1384
rxe_register_device+0x233/0x350 drivers/infiniband/sw/rxe/rxe_verbs.c:1540
rxe_net_add+0x74/0xf0 drivers/infiniband/sw/rxe/rxe_net.c:550
rxe_newlink+0xde/0x1a0 drivers/infiniband/sw/rxe/rxe.c:212
nldev_newlink+0x5ea/0x680 drivers/infiniband/core/nldev.c:1795
rdma_nl_rcv_skb drivers/infiniband/core/netlink.c:239 [inline]
rdma_nl_rcv+0x6dd/0x9e0 drivers/infiniband/core/netlink.c:259
netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline]
netlink_unicast+0x7f6/0x990 net/netlink/af_netlink.c:1339
netlink_sendmsg+0x8de/0xcb0 net/netlink/af_netlink.c:1883
sock_sendmsg_nosec net/socket.c:709 [inline]
__sock_sendmsg+0x221/0x270 net/socket.c:724
____sys_sendmsg+0x53a/0x860 net/socket.c:2564
___sys_sendmsg net/socket.c:2618 [inline]
__sys_sendmsg+0x269/0x350 net/socket.c:2650
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Freed by task 10035:
kasan_save_stack mm/kasan/common.c:47 [inline]
kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:576
poison_slab_object mm/kasan/common.c:247 [inline]
__kasan_slab_free+0x59/0x70 mm/kasan/common.c:264
kasan_slab_free include/linux/kasan.h:233 [inline]
slab_free_hook mm/slub.c:2353 [inline]
slab_free mm/slub.c:4609 [inline]
kfree+0x196/0x430 mm/slub.c:4757
kobject_rename+0x38f/0x410 lib/kobject.c:524
device_rename+0x16a/0x200 drivers/base/core.c:4525
ib_device_rename+0x270/0x710 drivers/infiniband/core/device.c:402
nldev_set_doit+0x30e/0x4c0 drivers/infiniband/core/nldev.c:1146
rdma_nl_rcv_skb drivers/infiniband/core/netlink.c:239 [inline]
rdma_nl_rcv+0x6dd/0x9e0 drivers/infiniband/core/netlink.c:259
netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline]
netlink_unicast+0x7f6/0x990 net/netlink/af_netlink.c:1339
netlink_sendmsg+0x8de/0xcb0 net/netlink/af_netlink.c:1883
sock_sendmsg_nosec net/socket.c:709 [inline]
__sock_sendmsg+0x221/0x270 net/socket.c:724
____sys_sendmsg+0x53a/0x860 net/socket.c:2564
___sys_sendmsg net/socket.c:2618 [inline]
__sys_sendmsg+0x269/0x350 net/socket.c:2650
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
This is because if rename device happens, the old name is freed in
ib_device_rename() with lock, but ib_device_notify_register() may visit
the dev name locklessly by event RDMA_REGISTER_EVENT or
RDMA_NETDEV_ATTACH_EVENT.
Fix this by hold devices_rwsem in ib_device_notify_register().
Reported-by: syzbot+f60349ba1f9f08df349f@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=25bc6f0ed2b88b9eb9b8
Fixes: 9cbed5aab5ae ("RDMA/nldev: Add support for RDMA monitoring")
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Link: https://patch.msgid.link/20250313092421.944658-1-wangliang74@huawei.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add support for process_mad hook to retrieve the perf management counters.
Supports IB_PMA_PORT_COUNTERS and IB_PMA_PORT_COUNTERS_EXT counters.
Query the data from HW contexts and FW commands.
Signed-off-by: Preethi G <preethi.gurusiddalingeswaraswamy@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1741855464-27921-1-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
rxe_mr_do_atomic_op() returns enum resp_states numbers, so the ODP
counterpart must not return raw errno codes.
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Link: https://patch.msgid.link/20250313064540.2619115-1-matsuda-daisuke@fujitsu.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Currently, the IB uverbs API calls uobj_get_uobj_read(), which in turn
uses the rdma_lookup_get_uobject() helper to retrieve user objects.
In case of failure, uobj_get_uobj_read() returns NULL, overriding the
error code from rdma_lookup_get_uobject(). The IB uverbs API then
translates this NULL to -EINVAL, masking the actual error and
complicating debugging. For example, applications calling ibv_modify_qp
that fails with EBUSY when retrieving the QP uobject will see the
overridden error code EINVAL instead, masking the actual error.
Furthermore, based on rdma-core commit:
"2a22f1ced5f3 ("Merge pull request #1568 from jakemoroni/master")"
Kernel's IB uverbs return values are either ignored and passed on as is
to application or overridden with other errnos in a few cases.
Thus, to improve error reporting and debuggability, propagate the
original error from rdma_lookup_get_uobject() instead of replacing it
with EINVAL.
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Link: https://patch.msgid.link/64f9d3711b183984e939962c2f83383904f97dfb.1740577869.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
When running under Hyper-V, the master device to the RDMA device is always
bonded to this RDMA device. This is not user-configurable.
The master device can be unbind/bind from the kernel. During those events,
the RDMA device should set to the current netdev to reflect the change of
master device from those events.
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/1741821332-9392-2-git-send-email-longli@linuxonhyperv.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Change mana_get_primary_netdev_rcu() to mana_get_primary_netdev(), and
return the ndev with refcount held. The caller is responsible for dropping
the refcount.
Also drop the check for IFF_SLAVE as it is not necessary if the upper
device is present.
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/1741821332-9392-1-git-send-email-longli@linuxonhyperv.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Use a meaningful constant explicitly instead of hard-coding a literal.
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Link: https://patch.msgid.link/20250312065937.1787241-1-matsuda-daisuke@fujitsu.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Use %u for unsigned type and %d for enum.
Signed-off-by: Guofeng Yue <yueguofeng@h-partners.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20250311084857.3803665-2-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add an explanation on the newly added UCAP API.
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Link: https://patch.msgid.link/d0e095f9a7601437acc2d2fdf8705136d1edf1c5.1741261611.git.leon@kernel.org
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
This patch adds RDMA_TRANSPORT_RX and RDMA_TRANSPORT_TX as a new flow
table type for matcher creation.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://patch.msgid.link/2287d8c50483e880450c7e8e08d9de34cdec1b14.1741261611.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Verify that the enabled UCAPs are supported by the device before
creating the ucontext.
If supported, create the ucontext with the associated capabilities.
Store the privileged ucontext UID on creation and remove it when
destroying the privileged ucontext. This allows the command interface
to recognize privileged commands through its UID.
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Link: https://patch.msgid.link/8b180583a207cb30deb7a2967934079749cdcc44.1741261611.git.leon@kernel.org
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add support for file descriptor array attribute for GET_CONTEXT
commands.
Check that the file descriptor (fd) array represents fds for valid UCAPs.
Store the enabled UCAPs from the fd array as a bitmask in ib_ucontext.
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Link: https://patch.msgid.link/ebfb30bc947e2259b193c96a319c80e82599045b.1741261611.git.leon@kernel.org
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Create UCAP character devices when probing an IB device with supported
firmware capabilities.
If the RDMA_CTRL general object type is supported, check for specific
UCTX capabilities:
Create /dev/infiniband/mlx5_perm_ctrl_local for RDMA_UCAP_MLX5_CTRL_LOCAL
Create /dev/infiniband/mlx5_perm_ctrl_other_vhca for RDMA_UCAP_MLX5_CTRL_OTHER_VHCA
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Link: https://patch.msgid.link/30ed40e7a12a694cf4ee257459ed61b145b7837d.1741261611.git.leon@kernel.org
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Implement a new User CAPabilities (UCAP) API to provide fine-grained
control over specific firmware features.
This approach offers more granular capabilities than the existing Linux
capabilities, which may be too generic for certain FW features.
This mechanism represents each capability as a character device with
root read-write access. Root processes can grant users special
privileges by allowing access to these character devices (e.g., using
chown).
UCAP character devices are located in /dev/infiniband and the class path
is /sys/class/infiniband_ucaps.
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Link: https://patch.msgid.link/5a1379187cd21178e8554afc81a3c941f21af22f.1741261611.git.leon@kernel.org
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
My static checker says this multiplication can overflow. I'm not an
expert in this code but the call tree would be:
ib_uverbs_handler_UVERBS_METHOD_QP_CREATE() <- reads cap from the user
-> ib_create_qp_user()
-> create_qp()
-> mana_ib_create_qp()
-> mana_ib_create_ud_qp()
-> create_shadow_queue()
It can't hurt to use safer interfaces.
Fixes: c8017f5b4856 ("RDMA/mana_ib: UD/GSI work requests")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://patch.msgid.link/58439ac0-1ee5-4f96-a595-7ab83b59139b@stanley.mountain
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
---------------------------------------------------------------------
Hi,
This is preparation series targeted for mlx5-next, which will be used
later in RDMA.
This series adds RDMA transport steering logic which would allow the
vport group manager to catch control packets from VFs and forward them
to control SW to help with congestion control.
In addition, RDMA will provide new set of APIs to better control exposed
FW capabilities and this series is needed to make sure mlx5 command
interface will ensure that privileged commands can always proceed,
Thanks
Link: https://lore.kernel.org/all/cover.1740574103.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
* mlx5-next:
net/mlx5: fs, add RDMA TRANSPORT steering domain support
net/mlx5: Query ADV_RDMA capabilities
net/mlx5: Limit non-privileged commands
net/mlx5: Allow the throttle mechanism to be more dynamic
net/mlx5: Add RDMA_CTRL HW capabilities
|
|
Add RX and TX RDMA_TRANSPORT flow table namespace, and the ability
to create flow tables in those namespaces.
The RDMA_TRANSPORT RX and TX are per vport.
Packets will traverse through RDMA_TRANSPORT_RX after RDMA_RX and through
RDMA_TRANSPORT_TX before RDMA_TX, ensuring proper control and management.
RDMA_TRANSPORT domains are managed by the vport group manager.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/a6b550d9859a197eafa804b9a8d76916ca481da9.1740574103.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Query ADV_RDMA capabilities which provide information for
advanced RDMA related features.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/e3e6ede03ea31cd201078dcdd4e407608e4a5a87.1740574103.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Limit non-privileged UID commands to half of the available command slots
when privileged UIDs are present.
Privileged throttle commands will not be limited.
Use an xarray to store privileged UIDs. Add insert and remove functions
for privileged UIDs management.
Non-user commands (with uid 0) are not limited.
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/d2f3dd9a0dbad3c9f2b4bb0723837995e4e06de2.1740574103.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Previously, throttle commands were identified and limited based on
opcode. These commands were limited to half the command slots using a
semaphore, and callback commands checked the opcode to determine
semaphore release.
To allow exceptions, we introduce a variable to indicate when the
throttle lock is held. This allows scenarios where throttle commands
are not limited. Callback functions use this variable to determine
if the throttle semaphore needs to be released.
This patch contains no functional changes. It's a preparation for the
next patch.
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Link: https://patch.msgid.link/055d975edeb816ac4c0fd1e665c6157d11947d26.1740574103.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add RDMA_CTRL UCTX capabilities and add the RDMA_CTRL general object
type in hca_cap_2.
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Link: https://patch.msgid.link/ef7eb24be9a6f247ab52e8b4480350072e5182f5.1740574103.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
After the erdma_cep_put(new_cep) being called, new_cep will be freed,
and the following dereference will cause a UAF problem. Fix this issue.
Fixes: 920d93eac8b9 ("RDMA/erdma: Add connection management (CM) support")
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
pvrdma_modify_device() was added in 2016 as part of
commit 29c8d9eba550 ("IB: Add vmw_pvrdma driver")
but accidentally it was never wired into the device_ops struct.
After some discussion the best course seems to be just to remove it,
see discussion at:
https://lore.kernel.org/all/Z8TWF6coBUF3l_jk@gallifrey/
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://patch.msgid.link/20250304215637.68559-1-linux@treblig.org
Acked-by: Vishnu Dasa <vishnu.dasa@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
capable() calls refer to enabled LSMs whether to permit or deny the
request. This is relevant in connection with SELinux, where a
capability check results in a policy decision and by default a denial
message on insufficient permission is issued.
It can lead to three undesired cases:
1. A denial message is generated, even in case the operation was an
unprivileged one and thus the syscall succeeded, creating noise.
2. To avoid the noise from 1. the policy writer adds a rule to ignore
those denial messages, hiding future syscalls, where the task
performs an actual privileged operation, leading to hidden limited
functionality of that task.
3. To avoid the noise from 1. the policy writer adds a rule to permit
the task the requested capability, while it does not need it,
violating the principle of least privilege.
Signed-off-by: Christian Göttsche <cgzones@googlemail.com>
Link: https://patch.msgid.link/20250302160657.127253-10-cgoettsche@seltendoof.de
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Bound infiniband iwcm and ucma sysctl writings between SYSCTL_ZERO
and SYSCTL_INT_MAX.
The proc_handler has thus been updated to proc_dointvec_minmax.
Signed-off-by: Nicolas Bouchinet <nicolas.bouchinet@ssi.gouv.fr>
Link: https://patch.msgid.link/20250224095826.16458-6-nicolas.bouchinet@clip-os.org
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Joel Granados <joel.granados@kernel.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Commit 467f432a521a ("RDMA/core: Split port and device counter sysfs
attributes") accidentally almost exposed hw counters to non-init net
namespaces. It didn't expose them fully, as an attempt to read any of
those counters leads to a crash like this one:
[42021.807566] BUG: kernel NULL pointer dereference, address: 0000000000000028
[42021.814463] #PF: supervisor read access in kernel mode
[42021.819549] #PF: error_code(0x0000) - not-present page
[42021.824636] PGD 0 P4D 0
[42021.827145] Oops: 0000 [#1] SMP PTI
[42021.830598] CPU: 82 PID: 2843922 Comm: switchto-defaul Kdump: loaded Tainted: G S W I XXX
[42021.841697] Hardware name: XXX
[42021.849619] RIP: 0010:hw_stat_device_show+0x1e/0x40 [ib_core]
[42021.855362] Code: 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 49 89 d0 4c 8b 5e 20 48 8b 8f b8 04 00 00 48 81 c7 f0 fa ff ff <48> 8b 41 28 48 29 ce 48 83 c6 d0 48 c1 ee 04 69 d6 ab aa aa aa 48
[42021.873931] RSP: 0018:ffff97fe90f03da0 EFLAGS: 00010287
[42021.879108] RAX: ffff9406988a8c60 RBX: ffff940e1072d438 RCX: 0000000000000000
[42021.886169] RDX: ffff94085f1aa000 RSI: ffff93c6cbbdbcb0 RDI: ffff940c7517aef0
[42021.893230] RBP: ffff97fe90f03e70 R08: ffff94085f1aa000 R09: 0000000000000000
[42021.900294] R10: ffff94085f1aa000 R11: ffffffffc0775680 R12: ffffffff87ca2530
[42021.907355] R13: ffff940651602840 R14: ffff93c6cbbdbcb0 R15: ffff94085f1aa000
[42021.914418] FS: 00007fda1a3b9700(0000) GS:ffff94453fb80000(0000) knlGS:0000000000000000
[42021.922423] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[42021.928130] CR2: 0000000000000028 CR3: 00000042dcfb8003 CR4: 00000000003726f0
[42021.935194] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[42021.942257] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[42021.949324] Call Trace:
[42021.951756] <TASK>
[42021.953842] [<ffffffff86c58674>] ? show_regs+0x64/0x70
[42021.959030] [<ffffffff86c58468>] ? __die+0x78/0xc0
[42021.963874] [<ffffffff86c9ef75>] ? page_fault_oops+0x2b5/0x3b0
[42021.969749] [<ffffffff87674b92>] ? exc_page_fault+0x1a2/0x3c0
[42021.975549] [<ffffffff87801326>] ? asm_exc_page_fault+0x26/0x30
[42021.981517] [<ffffffffc0775680>] ? __pfx_show_hw_stats+0x10/0x10 [ib_core]
[42021.988482] [<ffffffffc077564e>] ? hw_stat_device_show+0x1e/0x40 [ib_core]
[42021.995438] [<ffffffff86ac7f8e>] dev_attr_show+0x1e/0x50
[42022.000803] [<ffffffff86a3eeb1>] sysfs_kf_seq_show+0x81/0xe0
[42022.006508] [<ffffffff86a11134>] seq_read_iter+0xf4/0x410
[42022.011954] [<ffffffff869f4b2e>] vfs_read+0x16e/0x2f0
[42022.017058] [<ffffffff869f50ee>] ksys_read+0x6e/0xe0
[42022.022073] [<ffffffff8766f1ca>] do_syscall_64+0x6a/0xa0
[42022.027441] [<ffffffff8780013b>] entry_SYSCALL_64_after_hwframe+0x78/0xe2
The problem can be reproduced using the following steps:
ip netns add foo
ip netns exec foo bash
cat /sys/class/infiniband/mlx4_0/hw_counters/*
The panic occurs because of casting the device pointer into an
ib_device pointer using container_of() in hw_stat_device_show() is
wrong and leads to a memory corruption.
However the real problem is that hw counters should never been exposed
outside of the non-init net namespace.
Fix this by saving the index of the corresponding attribute group
(it might be 1 or 2 depending on the presence of driver-specific
attributes) and zeroing the pointer to hw_counters group for compat
devices during the initialization.
With this fix applied hw_counters are not available in a non-init
net namespace:
find /sys/class/infiniband/mlx4_0/ -name hw_counters
/sys/class/infiniband/mlx4_0/ports/1/hw_counters
/sys/class/infiniband/mlx4_0/ports/2/hw_counters
/sys/class/infiniband/mlx4_0/hw_counters
ip netns add foo
ip netns exec foo bash
find /sys/class/infiniband/mlx4_0/ -name hw_counters
Fixes: 467f432a521a ("RDMA/core: Split port and device counter sysfs attributes")
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Maher Sanalla <msanalla@nvidia.com>
Cc: linux-rdma@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Link: https://patch.msgid.link/20250227165420.3430301-1-roman.gushchin@linux.dev
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Now that the crc32c() library function directly takes advantage of
architecture-specific optimizations, it is unnecessary to go through the
crypto API. Just use crc32c(). This is much simpler, and it improves
performance due to eliminating the crypto API overhead.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://patch.msgid.link/20250227051207.19470-1-ebiggers@kernel.org
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
This is merge of shared branch between RDMA and net-next trees.
* mlx5-next: (550 commits)
net/mlx5: Change POOL_NEXT_SIZE define value and make it global
net/mlx5: Add new health syndrome error and crr bit offset
Linux 6.14-rc3
...
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
The last use of one_qsfp_write() was removed in 2016's
commit 145dd2b39958 ("IB/hfi1: Always turn on CDRs for low power QSFP
modules")
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://patch.msgid.link/20250223215543.153312-1-linux@treblig.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
In the function mana_ib_gd_create_dma_region if there are no dma blocks
to process the variable `err` remains uninitialized.
Fixes: 0266a177631d ("RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter")
Signed-off-by: Kees Bakker <kees@ijzerbout.nl>
Link: https://patch.msgid.link/20250221195833.7516C16290A@bout3.ijzerbout.nl
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Change POOL_NEXT_SIZE define value from 0 to BIT(30), since this define
is used to request the available maximum sized flow table, and zero doesn't
make sense for it, whereas some places in the driver use zero explicitly
expecting the smallest table size possible but instead due to this
define they end up allocating the biggest table size unawarely.
In addition move the definition to "include/linux/mlx5/fs.h" to expose the
define to IB driver as well, while appropriately renaming it.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250219085808.349923-3-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add new error value for trust lockdown in health syndrome enum.
Also, include the offset for crr bit in the health buffer layout.
These changes prepare for downstream patches that update health
event handling.
Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250219085808.349923-2-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Enable 'fetch and add' and 'compare and swap' operations to be used with
ODP. This is comprised of the following steps:
1. Check the driver page table(umem_odp->dma_list) to see if the target
page is both readable and writable.
2. If not, then trigger page fault to map the page.
3. Convert its user space address to a kernel logical address using PFNs
in the driver page table(umem_odp->pfn_list).
4. Execute the operation.
Link: https://patch.msgid.link/r/20241220100936.2193541-6-matsuda-daisuke@fujitsu.com
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
rxe_mr_copy() is used widely to copy data to/from a user MR. requester uses
it to load payloads of requesting packets; responder uses it to process
Send, Write, and Read operaetions; completer uses it to copy data from
response packets of Read and Atomic operations to a user MR.
Allow these operations to be used with ODP by adding a subordinate function
rxe_odp_mr_copy(). It is comprised of the following steps:
1. Check the driver page table(umem_odp->dma_list) to see if pages being
accessed are present with appropriate permission.
2. If necessary, trigger page fault to map the pages.
3. Convert their user space addresses to kernel logical addresses using
PFNs in the driver page table(umem_odp->pfn_list).
4. Execute data copy to/from the pages.
Link: https://patch.msgid.link/r/20241220100936.2193541-5-matsuda-daisuke@fujitsu.com
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Allow userspace to register an ODP-enabled MR, in which case the flag
IB_ACCESS_ON_DEMAND is passed to rxe_reg_user_mr(). However, there is no
RDMA operation enabled right now. They will be supported later in the
subsequent two patches.
rxe_odp_do_pagefault() is called to initialize an ODP-enabled MR. It syncs
process address space from the CPU page table to the driver page table
(dma_list/pfn_list in umem_odp) when called with RXE_PAGEFAULT_SNAPSHOT
flag. Additionally, It can be used to trigger page fault when pages being
accessed are not present or do not have proper read/write permissions, and
possibly to prefetch pages in the future.
Link: https://patch.msgid.link/r/20241220100936.2193541-4-matsuda-daisuke@fujitsu.com
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
On page invalidation, an MMU notifier callback is invoked to unmap DMA
addresses and update the driver page table(umem_odp->dma_list). The
callback is registered when an ODP-enabled MR is created.
Link: https://patch.msgid.link/r/20241220100936.2193541-3-matsuda-daisuke@fujitsu.com
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
rxe_mr_init() and resp_states are going to be used in rxe_odp.c, which is
to be created in the subsequent patch.
Link: https://patch.msgid.link/r/20241220100936.2193541-2-matsuda-daisuke@fujitsu.com
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
A single scatter-gather entry is limited by a 32 bits "length" field
that is practically 4GB - PAGE_SIZE. This means that even when the
memory is physically contiguous, we might need more than one entry to
represent it. Additionally when using dmabuf, the sg_table might be
originated outside the subsystem and optimized for other needs.
For instance an SGT of 16GB GPU continuous memory might look like this:
(a real life example)
dma_address 34401400000, length fffff000
dma_address 345013ff000, length fffff000
dma_address 346013fe000, length fffff000
dma_address 347013fd000, length fffff000
dma_address 348013fc000, length 4000
Since ib_umem_find_best_pgsz works within SG entries, in the above case
we will result with the worst possible 4KB page size.
Fix this by taking into consideration only the alignment of addresses of
real discontinuity points rather than treating SG entries as such, and
adjust the page iterator to correctly handle cross SG entry pages.
There is currently an assumption that drivers do not ask for pages
bigger than maximal DMA size supported by their devices.
Reviewed-by: Firas Jahjah <firasj@amazon.com>
Reviewed-by: Yonatan Nachum <ynachum@amazon.com>
Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Link: https://patch.msgid.link/20250217141623.12428-1-mrgolin@amazon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|