Age | Commit message (Collapse) | Author |
|
Add the following optional counters:
rdma_tx_packets,rdma_rx_bytes,rdma_rx_packets,rdma_tx_bytes.
Which counts all RDMA packets/bytes sent and received per link.
Note that since each direction packet and byte counter are shared,
the counter is only reset when both counters of that direction
are removed. But from user-perspective each can be enabled/disabled separately.
The counters can be enabled using:
sudo rdma stat set link rocep8s0f0/1 optional-counters rdma_tx_packets
And can be seen using:
rdma stat -j show link rocep8s0f0/1
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/9f2753ad636f21704416df64b47395c8991d1123.1741875070.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Syzbot reported a slab-use-after-free with the following call trace:
==================================================================
BUG: KASAN: slab-use-after-free in nla_put+0xd3/0x150 lib/nlattr.c:1099
Read of size 5 at addr ffff888140ea1c60 by task syz.0.988/10025
CPU: 0 UID: 0 PID: 10025 Comm: syz.0.988
Not tainted 6.14.0-rc4-syzkaller-00859-gf77f12010f67 #0
Hardware name: Google Compute Engine, BIOS Google 02/12/2025
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:408 [inline]
print_report+0x16e/0x5b0 mm/kasan/report.c:521
kasan_report+0x143/0x180 mm/kasan/report.c:634
kasan_check_range+0x282/0x290 mm/kasan/generic.c:189
__asan_memcpy+0x29/0x70 mm/kasan/shadow.c:105
nla_put+0xd3/0x150 lib/nlattr.c:1099
nla_put_string include/net/netlink.h:1621 [inline]
fill_nldev_handle+0x16e/0x200 drivers/infiniband/core/nldev.c:265
rdma_nl_notify_event+0x561/0xef0 drivers/infiniband/core/nldev.c:2857
ib_device_notify_register+0x22/0x230 drivers/infiniband/core/device.c:1344
ib_register_device+0x1292/0x1460 drivers/infiniband/core/device.c:1460
rxe_register_device+0x233/0x350 drivers/infiniband/sw/rxe/rxe_verbs.c:1540
rxe_net_add+0x74/0xf0 drivers/infiniband/sw/rxe/rxe_net.c:550
rxe_newlink+0xde/0x1a0 drivers/infiniband/sw/rxe/rxe.c:212
nldev_newlink+0x5ea/0x680 drivers/infiniband/core/nldev.c:1795
rdma_nl_rcv_skb drivers/infiniband/core/netlink.c:239 [inline]
rdma_nl_rcv+0x6dd/0x9e0 drivers/infiniband/core/netlink.c:259
netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline]
netlink_unicast+0x7f6/0x990 net/netlink/af_netlink.c:1339
netlink_sendmsg+0x8de/0xcb0 net/netlink/af_netlink.c:1883
sock_sendmsg_nosec net/socket.c:709 [inline]
__sock_sendmsg+0x221/0x270 net/socket.c:724
____sys_sendmsg+0x53a/0x860 net/socket.c:2564
___sys_sendmsg net/socket.c:2618 [inline]
__sys_sendmsg+0x269/0x350 net/socket.c:2650
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f42d1b8d169
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 ...
RSP: 002b:00007f42d2960038 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f42d1da6320 RCX: 00007f42d1b8d169
RDX: 0000000000000000 RSI: 00004000000002c0 RDI: 000000000000000c
RBP: 00007f42d1c0e2a0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 00007f42d1da6320 R15: 00007ffe399344a8
</TASK>
Allocated by task 10025:
kasan_save_stack mm/kasan/common.c:47 [inline]
kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
__kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:394
kasan_kmalloc include/linux/kasan.h:260 [inline]
__do_kmalloc_node mm/slub.c:4294 [inline]
__kmalloc_node_track_caller_noprof+0x28b/0x4c0 mm/slub.c:4313
__kmemdup_nul mm/util.c:61 [inline]
kstrdup+0x42/0x100 mm/util.c:81
kobject_set_name_vargs+0x61/0x120 lib/kobject.c:274
dev_set_name+0xd5/0x120 drivers/base/core.c:3468
assign_name drivers/infiniband/core/device.c:1202 [inline]
ib_register_device+0x178/0x1460 drivers/infiniband/core/device.c:1384
rxe_register_device+0x233/0x350 drivers/infiniband/sw/rxe/rxe_verbs.c:1540
rxe_net_add+0x74/0xf0 drivers/infiniband/sw/rxe/rxe_net.c:550
rxe_newlink+0xde/0x1a0 drivers/infiniband/sw/rxe/rxe.c:212
nldev_newlink+0x5ea/0x680 drivers/infiniband/core/nldev.c:1795
rdma_nl_rcv_skb drivers/infiniband/core/netlink.c:239 [inline]
rdma_nl_rcv+0x6dd/0x9e0 drivers/infiniband/core/netlink.c:259
netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline]
netlink_unicast+0x7f6/0x990 net/netlink/af_netlink.c:1339
netlink_sendmsg+0x8de/0xcb0 net/netlink/af_netlink.c:1883
sock_sendmsg_nosec net/socket.c:709 [inline]
__sock_sendmsg+0x221/0x270 net/socket.c:724
____sys_sendmsg+0x53a/0x860 net/socket.c:2564
___sys_sendmsg net/socket.c:2618 [inline]
__sys_sendmsg+0x269/0x350 net/socket.c:2650
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Freed by task 10035:
kasan_save_stack mm/kasan/common.c:47 [inline]
kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:576
poison_slab_object mm/kasan/common.c:247 [inline]
__kasan_slab_free+0x59/0x70 mm/kasan/common.c:264
kasan_slab_free include/linux/kasan.h:233 [inline]
slab_free_hook mm/slub.c:2353 [inline]
slab_free mm/slub.c:4609 [inline]
kfree+0x196/0x430 mm/slub.c:4757
kobject_rename+0x38f/0x410 lib/kobject.c:524
device_rename+0x16a/0x200 drivers/base/core.c:4525
ib_device_rename+0x270/0x710 drivers/infiniband/core/device.c:402
nldev_set_doit+0x30e/0x4c0 drivers/infiniband/core/nldev.c:1146
rdma_nl_rcv_skb drivers/infiniband/core/netlink.c:239 [inline]
rdma_nl_rcv+0x6dd/0x9e0 drivers/infiniband/core/netlink.c:259
netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline]
netlink_unicast+0x7f6/0x990 net/netlink/af_netlink.c:1339
netlink_sendmsg+0x8de/0xcb0 net/netlink/af_netlink.c:1883
sock_sendmsg_nosec net/socket.c:709 [inline]
__sock_sendmsg+0x221/0x270 net/socket.c:724
____sys_sendmsg+0x53a/0x860 net/socket.c:2564
___sys_sendmsg net/socket.c:2618 [inline]
__sys_sendmsg+0x269/0x350 net/socket.c:2650
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
This is because if rename device happens, the old name is freed in
ib_device_rename() with lock, but ib_device_notify_register() may visit
the dev name locklessly by event RDMA_REGISTER_EVENT or
RDMA_NETDEV_ATTACH_EVENT.
Fix this by hold devices_rwsem in ib_device_notify_register().
Reported-by: syzbot+f60349ba1f9f08df349f@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=25bc6f0ed2b88b9eb9b8
Fixes: 9cbed5aab5ae ("RDMA/nldev: Add support for RDMA monitoring")
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Link: https://patch.msgid.link/20250313092421.944658-1-wangliang74@huawei.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add support for process_mad hook to retrieve the perf management counters.
Supports IB_PMA_PORT_COUNTERS and IB_PMA_PORT_COUNTERS_EXT counters.
Query the data from HW contexts and FW commands.
Signed-off-by: Preethi G <preethi.gurusiddalingeswaraswamy@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1741855464-27921-1-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
rxe_mr_do_atomic_op() returns enum resp_states numbers, so the ODP
counterpart must not return raw errno codes.
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Link: https://patch.msgid.link/20250313064540.2619115-1-matsuda-daisuke@fujitsu.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Currently, the IB uverbs API calls uobj_get_uobj_read(), which in turn
uses the rdma_lookup_get_uobject() helper to retrieve user objects.
In case of failure, uobj_get_uobj_read() returns NULL, overriding the
error code from rdma_lookup_get_uobject(). The IB uverbs API then
translates this NULL to -EINVAL, masking the actual error and
complicating debugging. For example, applications calling ibv_modify_qp
that fails with EBUSY when retrieving the QP uobject will see the
overridden error code EINVAL instead, masking the actual error.
Furthermore, based on rdma-core commit:
"2a22f1ced5f3 ("Merge pull request #1568 from jakemoroni/master")"
Kernel's IB uverbs return values are either ignored and passed on as is
to application or overridden with other errnos in a few cases.
Thus, to improve error reporting and debuggability, propagate the
original error from rdma_lookup_get_uobject() instead of replacing it
with EINVAL.
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Link: https://patch.msgid.link/64f9d3711b183984e939962c2f83383904f97dfb.1740577869.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
When running under Hyper-V, the master device to the RDMA device is always
bonded to this RDMA device. This is not user-configurable.
The master device can be unbind/bind from the kernel. During those events,
the RDMA device should set to the current netdev to reflect the change of
master device from those events.
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/1741821332-9392-2-git-send-email-longli@linuxonhyperv.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Change mana_get_primary_netdev_rcu() to mana_get_primary_netdev(), and
return the ndev with refcount held. The caller is responsible for dropping
the refcount.
Also drop the check for IFF_SLAVE as it is not necessary if the upper
device is present.
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/1741821332-9392-1-git-send-email-longli@linuxonhyperv.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Use a meaningful constant explicitly instead of hard-coding a literal.
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Link: https://patch.msgid.link/20250312065937.1787241-1-matsuda-daisuke@fujitsu.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Use %u for unsigned type and %d for enum.
Signed-off-by: Guofeng Yue <yueguofeng@h-partners.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20250311084857.3803665-2-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add an explanation on the newly added UCAP API.
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Link: https://patch.msgid.link/d0e095f9a7601437acc2d2fdf8705136d1edf1c5.1741261611.git.leon@kernel.org
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
This patch adds RDMA_TRANSPORT_RX and RDMA_TRANSPORT_TX as a new flow
table type for matcher creation.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://patch.msgid.link/2287d8c50483e880450c7e8e08d9de34cdec1b14.1741261611.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Verify that the enabled UCAPs are supported by the device before
creating the ucontext.
If supported, create the ucontext with the associated capabilities.
Store the privileged ucontext UID on creation and remove it when
destroying the privileged ucontext. This allows the command interface
to recognize privileged commands through its UID.
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Link: https://patch.msgid.link/8b180583a207cb30deb7a2967934079749cdcc44.1741261611.git.leon@kernel.org
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add support for file descriptor array attribute for GET_CONTEXT
commands.
Check that the file descriptor (fd) array represents fds for valid UCAPs.
Store the enabled UCAPs from the fd array as a bitmask in ib_ucontext.
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Link: https://patch.msgid.link/ebfb30bc947e2259b193c96a319c80e82599045b.1741261611.git.leon@kernel.org
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Create UCAP character devices when probing an IB device with supported
firmware capabilities.
If the RDMA_CTRL general object type is supported, check for specific
UCTX capabilities:
Create /dev/infiniband/mlx5_perm_ctrl_local for RDMA_UCAP_MLX5_CTRL_LOCAL
Create /dev/infiniband/mlx5_perm_ctrl_other_vhca for RDMA_UCAP_MLX5_CTRL_OTHER_VHCA
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Link: https://patch.msgid.link/30ed40e7a12a694cf4ee257459ed61b145b7837d.1741261611.git.leon@kernel.org
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Implement a new User CAPabilities (UCAP) API to provide fine-grained
control over specific firmware features.
This approach offers more granular capabilities than the existing Linux
capabilities, which may be too generic for certain FW features.
This mechanism represents each capability as a character device with
root read-write access. Root processes can grant users special
privileges by allowing access to these character devices (e.g., using
chown).
UCAP character devices are located in /dev/infiniband and the class path
is /sys/class/infiniband_ucaps.
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Link: https://patch.msgid.link/5a1379187cd21178e8554afc81a3c941f21af22f.1741261611.git.leon@kernel.org
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
My static checker says this multiplication can overflow. I'm not an
expert in this code but the call tree would be:
ib_uverbs_handler_UVERBS_METHOD_QP_CREATE() <- reads cap from the user
-> ib_create_qp_user()
-> create_qp()
-> mana_ib_create_qp()
-> mana_ib_create_ud_qp()
-> create_shadow_queue()
It can't hurt to use safer interfaces.
Fixes: c8017f5b4856 ("RDMA/mana_ib: UD/GSI work requests")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://patch.msgid.link/58439ac0-1ee5-4f96-a595-7ab83b59139b@stanley.mountain
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
---------------------------------------------------------------------
Hi,
This is preparation series targeted for mlx5-next, which will be used
later in RDMA.
This series adds RDMA transport steering logic which would allow the
vport group manager to catch control packets from VFs and forward them
to control SW to help with congestion control.
In addition, RDMA will provide new set of APIs to better control exposed
FW capabilities and this series is needed to make sure mlx5 command
interface will ensure that privileged commands can always proceed,
Thanks
Link: https://lore.kernel.org/all/cover.1740574103.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
* mlx5-next:
net/mlx5: fs, add RDMA TRANSPORT steering domain support
net/mlx5: Query ADV_RDMA capabilities
net/mlx5: Limit non-privileged commands
net/mlx5: Allow the throttle mechanism to be more dynamic
net/mlx5: Add RDMA_CTRL HW capabilities
|
|
Add RX and TX RDMA_TRANSPORT flow table namespace, and the ability
to create flow tables in those namespaces.
The RDMA_TRANSPORT RX and TX are per vport.
Packets will traverse through RDMA_TRANSPORT_RX after RDMA_RX and through
RDMA_TRANSPORT_TX before RDMA_TX, ensuring proper control and management.
RDMA_TRANSPORT domains are managed by the vport group manager.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/a6b550d9859a197eafa804b9a8d76916ca481da9.1740574103.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Query ADV_RDMA capabilities which provide information for
advanced RDMA related features.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/e3e6ede03ea31cd201078dcdd4e407608e4a5a87.1740574103.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Limit non-privileged UID commands to half of the available command slots
when privileged UIDs are present.
Privileged throttle commands will not be limited.
Use an xarray to store privileged UIDs. Add insert and remove functions
for privileged UIDs management.
Non-user commands (with uid 0) are not limited.
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/d2f3dd9a0dbad3c9f2b4bb0723837995e4e06de2.1740574103.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Previously, throttle commands were identified and limited based on
opcode. These commands were limited to half the command slots using a
semaphore, and callback commands checked the opcode to determine
semaphore release.
To allow exceptions, we introduce a variable to indicate when the
throttle lock is held. This allows scenarios where throttle commands
are not limited. Callback functions use this variable to determine
if the throttle semaphore needs to be released.
This patch contains no functional changes. It's a preparation for the
next patch.
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Link: https://patch.msgid.link/055d975edeb816ac4c0fd1e665c6157d11947d26.1740574103.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add RDMA_CTRL UCTX capabilities and add the RDMA_CTRL general object
type in hca_cap_2.
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Link: https://patch.msgid.link/ef7eb24be9a6f247ab52e8b4480350072e5182f5.1740574103.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
After the erdma_cep_put(new_cep) being called, new_cep will be freed,
and the following dereference will cause a UAF problem. Fix this issue.
Fixes: 920d93eac8b9 ("RDMA/erdma: Add connection management (CM) support")
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
pvrdma_modify_device() was added in 2016 as part of
commit 29c8d9eba550 ("IB: Add vmw_pvrdma driver")
but accidentally it was never wired into the device_ops struct.
After some discussion the best course seems to be just to remove it,
see discussion at:
https://lore.kernel.org/all/Z8TWF6coBUF3l_jk@gallifrey/
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://patch.msgid.link/20250304215637.68559-1-linux@treblig.org
Acked-by: Vishnu Dasa <vishnu.dasa@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
capable() calls refer to enabled LSMs whether to permit or deny the
request. This is relevant in connection with SELinux, where a
capability check results in a policy decision and by default a denial
message on insufficient permission is issued.
It can lead to three undesired cases:
1. A denial message is generated, even in case the operation was an
unprivileged one and thus the syscall succeeded, creating noise.
2. To avoid the noise from 1. the policy writer adds a rule to ignore
those denial messages, hiding future syscalls, where the task
performs an actual privileged operation, leading to hidden limited
functionality of that task.
3. To avoid the noise from 1. the policy writer adds a rule to permit
the task the requested capability, while it does not need it,
violating the principle of least privilege.
Signed-off-by: Christian Göttsche <cgzones@googlemail.com>
Link: https://patch.msgid.link/20250302160657.127253-10-cgoettsche@seltendoof.de
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Bound infiniband iwcm and ucma sysctl writings between SYSCTL_ZERO
and SYSCTL_INT_MAX.
The proc_handler has thus been updated to proc_dointvec_minmax.
Signed-off-by: Nicolas Bouchinet <nicolas.bouchinet@ssi.gouv.fr>
Link: https://patch.msgid.link/20250224095826.16458-6-nicolas.bouchinet@clip-os.org
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Joel Granados <joel.granados@kernel.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Commit 467f432a521a ("RDMA/core: Split port and device counter sysfs
attributes") accidentally almost exposed hw counters to non-init net
namespaces. It didn't expose them fully, as an attempt to read any of
those counters leads to a crash like this one:
[42021.807566] BUG: kernel NULL pointer dereference, address: 0000000000000028
[42021.814463] #PF: supervisor read access in kernel mode
[42021.819549] #PF: error_code(0x0000) - not-present page
[42021.824636] PGD 0 P4D 0
[42021.827145] Oops: 0000 [#1] SMP PTI
[42021.830598] CPU: 82 PID: 2843922 Comm: switchto-defaul Kdump: loaded Tainted: G S W I XXX
[42021.841697] Hardware name: XXX
[42021.849619] RIP: 0010:hw_stat_device_show+0x1e/0x40 [ib_core]
[42021.855362] Code: 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 49 89 d0 4c 8b 5e 20 48 8b 8f b8 04 00 00 48 81 c7 f0 fa ff ff <48> 8b 41 28 48 29 ce 48 83 c6 d0 48 c1 ee 04 69 d6 ab aa aa aa 48
[42021.873931] RSP: 0018:ffff97fe90f03da0 EFLAGS: 00010287
[42021.879108] RAX: ffff9406988a8c60 RBX: ffff940e1072d438 RCX: 0000000000000000
[42021.886169] RDX: ffff94085f1aa000 RSI: ffff93c6cbbdbcb0 RDI: ffff940c7517aef0
[42021.893230] RBP: ffff97fe90f03e70 R08: ffff94085f1aa000 R09: 0000000000000000
[42021.900294] R10: ffff94085f1aa000 R11: ffffffffc0775680 R12: ffffffff87ca2530
[42021.907355] R13: ffff940651602840 R14: ffff93c6cbbdbcb0 R15: ffff94085f1aa000
[42021.914418] FS: 00007fda1a3b9700(0000) GS:ffff94453fb80000(0000) knlGS:0000000000000000
[42021.922423] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[42021.928130] CR2: 0000000000000028 CR3: 00000042dcfb8003 CR4: 00000000003726f0
[42021.935194] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[42021.942257] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[42021.949324] Call Trace:
[42021.951756] <TASK>
[42021.953842] [<ffffffff86c58674>] ? show_regs+0x64/0x70
[42021.959030] [<ffffffff86c58468>] ? __die+0x78/0xc0
[42021.963874] [<ffffffff86c9ef75>] ? page_fault_oops+0x2b5/0x3b0
[42021.969749] [<ffffffff87674b92>] ? exc_page_fault+0x1a2/0x3c0
[42021.975549] [<ffffffff87801326>] ? asm_exc_page_fault+0x26/0x30
[42021.981517] [<ffffffffc0775680>] ? __pfx_show_hw_stats+0x10/0x10 [ib_core]
[42021.988482] [<ffffffffc077564e>] ? hw_stat_device_show+0x1e/0x40 [ib_core]
[42021.995438] [<ffffffff86ac7f8e>] dev_attr_show+0x1e/0x50
[42022.000803] [<ffffffff86a3eeb1>] sysfs_kf_seq_show+0x81/0xe0
[42022.006508] [<ffffffff86a11134>] seq_read_iter+0xf4/0x410
[42022.011954] [<ffffffff869f4b2e>] vfs_read+0x16e/0x2f0
[42022.017058] [<ffffffff869f50ee>] ksys_read+0x6e/0xe0
[42022.022073] [<ffffffff8766f1ca>] do_syscall_64+0x6a/0xa0
[42022.027441] [<ffffffff8780013b>] entry_SYSCALL_64_after_hwframe+0x78/0xe2
The problem can be reproduced using the following steps:
ip netns add foo
ip netns exec foo bash
cat /sys/class/infiniband/mlx4_0/hw_counters/*
The panic occurs because of casting the device pointer into an
ib_device pointer using container_of() in hw_stat_device_show() is
wrong and leads to a memory corruption.
However the real problem is that hw counters should never been exposed
outside of the non-init net namespace.
Fix this by saving the index of the corresponding attribute group
(it might be 1 or 2 depending on the presence of driver-specific
attributes) and zeroing the pointer to hw_counters group for compat
devices during the initialization.
With this fix applied hw_counters are not available in a non-init
net namespace:
find /sys/class/infiniband/mlx4_0/ -name hw_counters
/sys/class/infiniband/mlx4_0/ports/1/hw_counters
/sys/class/infiniband/mlx4_0/ports/2/hw_counters
/sys/class/infiniband/mlx4_0/hw_counters
ip netns add foo
ip netns exec foo bash
find /sys/class/infiniband/mlx4_0/ -name hw_counters
Fixes: 467f432a521a ("RDMA/core: Split port and device counter sysfs attributes")
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Maher Sanalla <msanalla@nvidia.com>
Cc: linux-rdma@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Link: https://patch.msgid.link/20250227165420.3430301-1-roman.gushchin@linux.dev
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Now that the crc32c() library function directly takes advantage of
architecture-specific optimizations, it is unnecessary to go through the
crypto API. Just use crc32c(). This is much simpler, and it improves
performance due to eliminating the crypto API overhead.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://patch.msgid.link/20250227051207.19470-1-ebiggers@kernel.org
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
This is merge of shared branch between RDMA and net-next trees.
* mlx5-next: (550 commits)
net/mlx5: Change POOL_NEXT_SIZE define value and make it global
net/mlx5: Add new health syndrome error and crr bit offset
Linux 6.14-rc3
...
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
The last use of one_qsfp_write() was removed in 2016's
commit 145dd2b39958 ("IB/hfi1: Always turn on CDRs for low power QSFP
modules")
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://patch.msgid.link/20250223215543.153312-1-linux@treblig.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
In the function mana_ib_gd_create_dma_region if there are no dma blocks
to process the variable `err` remains uninitialized.
Fixes: 0266a177631d ("RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter")
Signed-off-by: Kees Bakker <kees@ijzerbout.nl>
Link: https://patch.msgid.link/20250221195833.7516C16290A@bout3.ijzerbout.nl
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Change POOL_NEXT_SIZE define value from 0 to BIT(30), since this define
is used to request the available maximum sized flow table, and zero doesn't
make sense for it, whereas some places in the driver use zero explicitly
expecting the smallest table size possible but instead due to this
define they end up allocating the biggest table size unawarely.
In addition move the definition to "include/linux/mlx5/fs.h" to expose the
define to IB driver as well, while appropriately renaming it.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250219085808.349923-3-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add new error value for trust lockdown in health syndrome enum.
Also, include the offset for crr bit in the health buffer layout.
These changes prepare for downstream patches that update health
event handling.
Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250219085808.349923-2-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Enable 'fetch and add' and 'compare and swap' operations to be used with
ODP. This is comprised of the following steps:
1. Check the driver page table(umem_odp->dma_list) to see if the target
page is both readable and writable.
2. If not, then trigger page fault to map the page.
3. Convert its user space address to a kernel logical address using PFNs
in the driver page table(umem_odp->pfn_list).
4. Execute the operation.
Link: https://patch.msgid.link/r/20241220100936.2193541-6-matsuda-daisuke@fujitsu.com
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
rxe_mr_copy() is used widely to copy data to/from a user MR. requester uses
it to load payloads of requesting packets; responder uses it to process
Send, Write, and Read operaetions; completer uses it to copy data from
response packets of Read and Atomic operations to a user MR.
Allow these operations to be used with ODP by adding a subordinate function
rxe_odp_mr_copy(). It is comprised of the following steps:
1. Check the driver page table(umem_odp->dma_list) to see if pages being
accessed are present with appropriate permission.
2. If necessary, trigger page fault to map the pages.
3. Convert their user space addresses to kernel logical addresses using
PFNs in the driver page table(umem_odp->pfn_list).
4. Execute data copy to/from the pages.
Link: https://patch.msgid.link/r/20241220100936.2193541-5-matsuda-daisuke@fujitsu.com
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Allow userspace to register an ODP-enabled MR, in which case the flag
IB_ACCESS_ON_DEMAND is passed to rxe_reg_user_mr(). However, there is no
RDMA operation enabled right now. They will be supported later in the
subsequent two patches.
rxe_odp_do_pagefault() is called to initialize an ODP-enabled MR. It syncs
process address space from the CPU page table to the driver page table
(dma_list/pfn_list in umem_odp) when called with RXE_PAGEFAULT_SNAPSHOT
flag. Additionally, It can be used to trigger page fault when pages being
accessed are not present or do not have proper read/write permissions, and
possibly to prefetch pages in the future.
Link: https://patch.msgid.link/r/20241220100936.2193541-4-matsuda-daisuke@fujitsu.com
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
On page invalidation, an MMU notifier callback is invoked to unmap DMA
addresses and update the driver page table(umem_odp->dma_list). The
callback is registered when an ODP-enabled MR is created.
Link: https://patch.msgid.link/r/20241220100936.2193541-3-matsuda-daisuke@fujitsu.com
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
rxe_mr_init() and resp_states are going to be used in rxe_odp.c, which is
to be created in the subsequent patch.
Link: https://patch.msgid.link/r/20241220100936.2193541-2-matsuda-daisuke@fujitsu.com
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
A single scatter-gather entry is limited by a 32 bits "length" field
that is practically 4GB - PAGE_SIZE. This means that even when the
memory is physically contiguous, we might need more than one entry to
represent it. Additionally when using dmabuf, the sg_table might be
originated outside the subsystem and optimized for other needs.
For instance an SGT of 16GB GPU continuous memory might look like this:
(a real life example)
dma_address 34401400000, length fffff000
dma_address 345013ff000, length fffff000
dma_address 346013fe000, length fffff000
dma_address 347013fd000, length fffff000
dma_address 348013fc000, length 4000
Since ib_umem_find_best_pgsz works within SG entries, in the above case
we will result with the worst possible 4KB page size.
Fix this by taking into consideration only the alignment of addresses of
real discontinuity points rather than treating SG entries as such, and
adjust the page iterator to correctly handle cross SG entry pages.
There is currently an assumption that drivers do not ask for pages
bigger than maximal DMA size supported by their devices.
Reviewed-by: Firas Jahjah <firasj@amazon.com>
Reviewed-by: Yonatan Nachum <ynachum@amazon.com>
Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Link: https://patch.msgid.link/20250217141623.12428-1-mrgolin@amazon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Fixes multiple occurrences of the misspelled word "occured" in the comments
of `iscsi_iser.c`, replacing them with the correct spelling "occurred".
This improves readability without affecting functionality.
Signed-off-by: Imanol <imvalient@protonmail.com>
Link: https://patch.msgid.link/20250217183048.9394-1-imvalient@protonmail.com
Acked-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add support of dmabuf MRs to mana_ib.
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://patch.msgid.link/1739454861-4456-1-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Fix annoying logs when building tools in parallel
- Fix the Debian linux-headers package build again
- Fix the target triple detection for userspace programs on Clang
* tag 'kbuild-fixes-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
modpost: Fix a few typos in a comment
kbuild: userprogs: fix bitsize and target detection on clang
kbuild: fix linux-headers package build when $(CC) cannot link userspace
tools: fix annoying "mkdir -p ..." logs when building tools in parallel
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core api addition from Greg KH:
"Here is a driver core new api for 6.14-rc3 that is being added to
allow platform devices from stop being abused.
It adds a new 'faux_device' structure and bus and api to allow almost
a straight or simpler conversion from platform devices that were not
really a platform device. It also comes with a binding for rust, with
an example driver in rust showing how it's used.
I'm adding this now so that the patches that convert the different
drivers and subsystems can all start flowing into linux-next now
through their different development trees, in time for 6.15-rc1.
We have a number that are already reviewed and tested, but adding
those conversions now doesn't seem right. For now, no one is using
this, and it passes all build tests from 0-day and linux-next, so all
should be good"
* tag 'driver-core-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
rust/kernel: Add faux device bindings
driver core: add a faux bus for use when a simple device/bus is needed
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull serial driver fixes from Greg KH:
"Here are some small serial driver fixes for some reported problems.
Nothing major, just:
- sc16is7xx irq check fix
- 8250 fifo underflow fix
- serial_port and 8250 iotype fixes
Most of these have been in linux-next already, and all have passed
0-day testing"
* tag 'tty-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
serial: 8250: Fix fifo underflow on flush
serial: 8250_pnp: Remove unneeded ->iotype assignment
serial: 8250_platform: Remove unneeded ->iotype assignment
serial: 8250_of: Remove unneeded ->iotype assignment
serial: port: Make ->iotype validation global in __uart_read_properties()
serial: port: Always update ->iotype in __uart_read_properties()
serial: port: Assign ->iotype correctly when ->iobase is set
serial: sc16is7xx: Fix IRQ number check behavior
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some small USB driver fixes, and new device ids, for
6.14-rc3. Lots of tiny stuff for reported problems, including:
- new device ids and quirks
- usb hub crash fix found by syzbot
- dwc2 driver fix
- dwc3 driver fixes
- uvc gadget driver fix
- cdc-acm driver fixes for a variety of different issues
- other tiny bugfixes
Almost all of these have been in linux-next this week, and all have
passed 0-day testing"
* tag 'usb-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (25 commits)
usb: typec: tcpm: PSSourceOffTimer timeout in PR_Swap enters ERROR_RECOVERY
usb: roles: set switch registered flag early on
usb: gadget: uvc: Fix unstarted kthread worker
USB: quirks: add USB_QUIRK_NO_LPM quirk for Teclast dist
usb: gadget: core: flush gadget workqueue after device removal
USB: gadget: f_midi: f_midi_complete to call queue_work
usb: core: fix pipe creation for get_bMaxPacketSize0
usb: dwc3: Fix timeout issue during controller enter/exit from halt state
USB: Add USB_QUIRK_NO_LPM quirk for sony xperia xz1 smartphone
USB: cdc-acm: Fill in Renesas R-Car D3 USB Download mode quirk
usb: cdc-acm: Fix handling of oversized fragments
usb: cdc-acm: Check control transfer buffer size before access
usb: xhci: Restore xhci_pci support for Renesas HCs
USB: pci-quirks: Fix HCCPARAMS register error for LS7A EHCI
USB: serial: option: drop MeiG Smart defines
USB: serial: option: fix Telit Cinterion FN990A name
USB: serial: option: add Telit Cinterion FN990B compositions
USB: serial: option: add MeiG Smart SLM828
usb: gadget: f_midi: fix MIDI Streaming descriptor lengths
usb: dwc2: gadget: remove of_node reference upon udc_stop
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq Kconfig cleanup from Borislav Petkov:
- Remove an unused config item GENERIC_PENDING_IRQ_CHIPFLAGS
* tag 'irq_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Remove unused CONFIG_GENERIC_PENDING_IRQ_CHIPFLAGS
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 perf fixes from Borislav Petkov:
- Explicitly clear DEBUGCTL.LBR to prevent LBRs continuing being
enabled after handoff to the OS
- Check CPUID(0x23) leaf and subleafs presence properly
- Remove the PEBS-via-PT feature from being supported on hybrid systems
- Fix perf record/top default commands on systems without a raw PMU
registered
* tag 'perf_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Ensure LBRs are disabled when a CPU is starting
perf/x86/intel: Fix ARCH_PERFMON_NUM_COUNTER_LEAF
perf/x86/intel: Clean up PEBS-via-PT on hybrid
perf/x86/rapl: Fix the error checking order
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Borislav Petkov:
- Clarify what happens when a task is woken up from the wake queue and
make clear its removal from that queue is atomic
* tag 'sched_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Clarify wake_up_q()'s write to task->wake_q.next
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fixes from Borislav Petkov:
- Move a warning about a lld.ld breakage into the verbose setting as
said breakage has been fixed in the meantime
- Teach objtool to ignore dangling jump table entries added by Clang
* tag 'objtool_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Move dodgy linker warn to verbose
objtool: Ignore dangling jump table entries
|