summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-03-13net: mana: Change the function signature of mana_get_primary_netdev_rcuLong Li
Change mana_get_primary_netdev_rcu() to mana_get_primary_netdev(), and return the ndev with refcount held. The caller is responsible for dropping the refcount. Also drop the check for IFF_SLAVE as it is not necessary if the upper device is present. Signed-off-by: Long Li <longli@microsoft.com> Link: https://patch.msgid.link/1741821332-9392-1-git-send-email-longli@linuxonhyperv.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-12RDMA/rxe: Improve readability of ODP pagefault interfaceDaisuke Matsuda
Use a meaningful constant explicitly instead of hard-coding a literal. Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> Link: https://patch.msgid.link/20250312065937.1787241-1-matsuda-daisuke@fujitsu.com Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-12RDMA/hns: Inappropriate format characters cleanupGuofeng Yue
Use %u for unsigned type and %d for enum. Signed-off-by: Guofeng Yue <yueguofeng@h-partners.com> Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20250311084857.3803665-2-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-09docs: infiniband: document the UCAP APIChiara Meiohas
Add an explanation on the newly added UCAP API. Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Link: https://patch.msgid.link/d0e095f9a7601437acc2d2fdf8705136d1edf1c5.1741261611.git.leon@kernel.org Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-09RDMA/mlx5: Expose RDMA TRANSPORT flow table types to userspacePatrisious Haddad
This patch adds RDMA_TRANSPORT_RX and RDMA_TRANSPORT_TX as a new flow table type for matcher creation. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Link: https://patch.msgid.link/2287d8c50483e880450c7e8e08d9de34cdec1b14.1741261611.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-09RDMA/mlx5: Check enabled UCAPs when creating ucontextChiara Meiohas
Verify that the enabled UCAPs are supported by the device before creating the ucontext. If supported, create the ucontext with the associated capabilities. Store the privileged ucontext UID on creation and remove it when destroying the privileged ucontext. This allows the command interface to recognize privileged commands through its UID. Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Link: https://patch.msgid.link/8b180583a207cb30deb7a2967934079749cdcc44.1741261611.git.leon@kernel.org Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-09RDMA/uverbs: Add support for UCAPs in context creationChiara Meiohas
Add support for file descriptor array attribute for GET_CONTEXT commands. Check that the file descriptor (fd) array represents fds for valid UCAPs. Store the enabled UCAPs from the fd array as a bitmask in ib_ucontext. Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Link: https://patch.msgid.link/ebfb30bc947e2259b193c96a319c80e82599045b.1741261611.git.leon@kernel.org Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-09RDMA/mlx5: Create UCAP char devices for supported device capabilitiesChiara Meiohas
Create UCAP character devices when probing an IB device with supported firmware capabilities. If the RDMA_CTRL general object type is supported, check for specific UCTX capabilities: Create /dev/infiniband/mlx5_perm_ctrl_local for RDMA_UCAP_MLX5_CTRL_LOCAL Create /dev/infiniband/mlx5_perm_ctrl_other_vhca for RDMA_UCAP_MLX5_CTRL_OTHER_VHCA Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Link: https://patch.msgid.link/30ed40e7a12a694cf4ee257459ed61b145b7837d.1741261611.git.leon@kernel.org Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-09RDMA/uverbs: Introduce UCAP (User CAPabilities) APIChiara Meiohas
Implement a new User CAPabilities (UCAP) API to provide fine-grained control over specific firmware features. This approach offers more granular capabilities than the existing Linux capabilities, which may be too generic for certain FW features. This mechanism represents each capability as a character device with root read-write access. Root processes can grant users special privileges by allowing access to these character devices (e.g., using chown). UCAP character devices are located in /dev/infiniband and the class path is /sys/class/infiniband_ucaps. Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Link: https://patch.msgid.link/5a1379187cd21178e8554afc81a3c941f21af22f.1741261611.git.leon@kernel.org Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-08RDMA/mana_ib: Use safer allocation function()Dan Carpenter
My static checker says this multiplication can overflow. I'm not an expert in this code but the call tree would be: ib_uverbs_handler_UVERBS_METHOD_QP_CREATE() <- reads cap from the user -> ib_create_qp_user() -> create_qp() -> mana_ib_create_qp() -> mana_ib_create_ud_qp() -> create_shadow_queue() It can't hurt to use safer interfaces. Fixes: c8017f5b4856 ("RDMA/mana_ib: UD/GSI work requests") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://patch.msgid.link/58439ac0-1ee5-4f96-a595-7ab83b59139b@stanley.mountain Reviewed-by: Long Li <longli@microsoft.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-08Add support and infrastructure for RDMA TRANSPORTLeon Romanovsky
--------------------------------------------------------------------- Hi, This is preparation series targeted for mlx5-next, which will be used later in RDMA. This series adds RDMA transport steering logic which would allow the vport group manager to catch control packets from VFs and forward them to control SW to help with congestion control. In addition, RDMA will provide new set of APIs to better control exposed FW capabilities and this series is needed to make sure mlx5 command interface will ensure that privileged commands can always proceed, Thanks Link: https://lore.kernel.org/all/cover.1740574103.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> * mlx5-next: net/mlx5: fs, add RDMA TRANSPORT steering domain support net/mlx5: Query ADV_RDMA capabilities net/mlx5: Limit non-privileged commands net/mlx5: Allow the throttle mechanism to be more dynamic net/mlx5: Add RDMA_CTRL HW capabilities
2025-03-08net/mlx5: fs, add RDMA TRANSPORT steering domain supportPatrisious Haddad
Add RX and TX RDMA_TRANSPORT flow table namespace, and the ability to create flow tables in those namespaces. The RDMA_TRANSPORT RX and TX are per vport. Packets will traverse through RDMA_TRANSPORT_RX after RDMA_RX and through RDMA_TRANSPORT_TX before RDMA_TX, ensuring proper control and management. RDMA_TRANSPORT domains are managed by the vport group manager. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/a6b550d9859a197eafa804b9a8d76916ca481da9.1740574103.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-08net/mlx5: Query ADV_RDMA capabilitiesPatrisious Haddad
Query ADV_RDMA capabilities which provide information for advanced RDMA related features. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/e3e6ede03ea31cd201078dcdd4e407608e4a5a87.1740574103.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-08net/mlx5: Limit non-privileged commandsChiara Meiohas
Limit non-privileged UID commands to half of the available command slots when privileged UIDs are present. Privileged throttle commands will not be limited. Use an xarray to store privileged UIDs. Add insert and remove functions for privileged UIDs management. Non-user commands (with uid 0) are not limited. Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/d2f3dd9a0dbad3c9f2b4bb0723837995e4e06de2.1740574103.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-08net/mlx5: Allow the throttle mechanism to be more dynamicChiara Meiohas
Previously, throttle commands were identified and limited based on opcode. These commands were limited to half the command slots using a semaphore, and callback commands checked the opcode to determine semaphore release. To allow exceptions, we introduce a variable to indicate when the throttle lock is held. This allows scenarios where throttle commands are not limited. Callback functions use this variable to determine if the throttle semaphore needs to be released. This patch contains no functional changes. It's a preparation for the next patch. Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Link: https://patch.msgid.link/055d975edeb816ac4c0fd1e665c6157d11947d26.1740574103.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-08net/mlx5: Add RDMA_CTRL HW capabilitiesChiara Meiohas
Add RDMA_CTRL UCTX capabilities and add the RDMA_CTRL general object type in hca_cap_2. Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Link: https://patch.msgid.link/ef7eb24be9a6f247ab52e8b4480350072e5182f5.1740574103.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-06RDMA/erdma: Prevent use-after-free in erdma_accept_newconn()Cheng Xu
After the erdma_cep_put(new_cep) being called, new_cep will be freed, and the following dereference will cause a UAF problem. Fix this issue. Fixes: 920d93eac8b9 ("RDMA/erdma: Add connection management (CM) support") Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-06RDMA/vmw_pvrdma: Remove unused pvrdma_modify_deviceDr. David Alan Gilbert
pvrdma_modify_device() was added in 2016 as part of commit 29c8d9eba550 ("IB: Add vmw_pvrdma driver") but accidentally it was never wired into the device_ops struct. After some discussion the best course seems to be just to remove it, see discussion at: https://lore.kernel.org/all/Z8TWF6coBUF3l_jk@gallifrey/ Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Link: https://patch.msgid.link/20250304215637.68559-1-linux@treblig.org Acked-by: Vishnu Dasa <vishnu.dasa@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-03RDMA/mlx5: Reorder capability check lastChristian Göttsche
capable() calls refer to enabled LSMs whether to permit or deny the request. This is relevant in connection with SELinux, where a capability check results in a policy decision and by default a denial message on insufficient permission is issued. It can lead to three undesired cases: 1. A denial message is generated, even in case the operation was an unprivileged one and thus the syscall succeeded, creating noise. 2. To avoid the noise from 1. the policy writer adds a rule to ignore those denial messages, hiding future syscalls, where the task performs an actual privileged operation, leading to hidden limited functionality of that task. 3. To avoid the noise from 1. the policy writer adds a rule to permit the task the requested capability, while it does not need it, violating the principle of least privilege. Signed-off-by: Christian Göttsche <cgzones@googlemail.com> Link: https://patch.msgid.link/20250302160657.127253-10-cgoettsche@seltendoof.de Reviewed-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-03RDMA/core: Fixes infiniband sysctl boundsNicolas Bouchinet
Bound infiniband iwcm and ucma sysctl writings between SYSCTL_ZERO and SYSCTL_INT_MAX. The proc_handler has thus been updated to proc_dointvec_minmax. Signed-off-by: Nicolas Bouchinet <nicolas.bouchinet@ssi.gouv.fr> Link: https://patch.msgid.link/20250224095826.16458-6-nicolas.bouchinet@clip-os.org Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Reviewed-by: Joel Granados <joel.granados@kernel.org> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-03RDMA/core: Don't expose hw_counters outside of init net namespaceRoman Gushchin
Commit 467f432a521a ("RDMA/core: Split port and device counter sysfs attributes") accidentally almost exposed hw counters to non-init net namespaces. It didn't expose them fully, as an attempt to read any of those counters leads to a crash like this one: [42021.807566] BUG: kernel NULL pointer dereference, address: 0000000000000028 [42021.814463] #PF: supervisor read access in kernel mode [42021.819549] #PF: error_code(0x0000) - not-present page [42021.824636] PGD 0 P4D 0 [42021.827145] Oops: 0000 [#1] SMP PTI [42021.830598] CPU: 82 PID: 2843922 Comm: switchto-defaul Kdump: loaded Tainted: G S W I XXX [42021.841697] Hardware name: XXX [42021.849619] RIP: 0010:hw_stat_device_show+0x1e/0x40 [ib_core] [42021.855362] Code: 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 49 89 d0 4c 8b 5e 20 48 8b 8f b8 04 00 00 48 81 c7 f0 fa ff ff <48> 8b 41 28 48 29 ce 48 83 c6 d0 48 c1 ee 04 69 d6 ab aa aa aa 48 [42021.873931] RSP: 0018:ffff97fe90f03da0 EFLAGS: 00010287 [42021.879108] RAX: ffff9406988a8c60 RBX: ffff940e1072d438 RCX: 0000000000000000 [42021.886169] RDX: ffff94085f1aa000 RSI: ffff93c6cbbdbcb0 RDI: ffff940c7517aef0 [42021.893230] RBP: ffff97fe90f03e70 R08: ffff94085f1aa000 R09: 0000000000000000 [42021.900294] R10: ffff94085f1aa000 R11: ffffffffc0775680 R12: ffffffff87ca2530 [42021.907355] R13: ffff940651602840 R14: ffff93c6cbbdbcb0 R15: ffff94085f1aa000 [42021.914418] FS: 00007fda1a3b9700(0000) GS:ffff94453fb80000(0000) knlGS:0000000000000000 [42021.922423] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [42021.928130] CR2: 0000000000000028 CR3: 00000042dcfb8003 CR4: 00000000003726f0 [42021.935194] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [42021.942257] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [42021.949324] Call Trace: [42021.951756] <TASK> [42021.953842] [<ffffffff86c58674>] ? show_regs+0x64/0x70 [42021.959030] [<ffffffff86c58468>] ? __die+0x78/0xc0 [42021.963874] [<ffffffff86c9ef75>] ? page_fault_oops+0x2b5/0x3b0 [42021.969749] [<ffffffff87674b92>] ? exc_page_fault+0x1a2/0x3c0 [42021.975549] [<ffffffff87801326>] ? asm_exc_page_fault+0x26/0x30 [42021.981517] [<ffffffffc0775680>] ? __pfx_show_hw_stats+0x10/0x10 [ib_core] [42021.988482] [<ffffffffc077564e>] ? hw_stat_device_show+0x1e/0x40 [ib_core] [42021.995438] [<ffffffff86ac7f8e>] dev_attr_show+0x1e/0x50 [42022.000803] [<ffffffff86a3eeb1>] sysfs_kf_seq_show+0x81/0xe0 [42022.006508] [<ffffffff86a11134>] seq_read_iter+0xf4/0x410 [42022.011954] [<ffffffff869f4b2e>] vfs_read+0x16e/0x2f0 [42022.017058] [<ffffffff869f50ee>] ksys_read+0x6e/0xe0 [42022.022073] [<ffffffff8766f1ca>] do_syscall_64+0x6a/0xa0 [42022.027441] [<ffffffff8780013b>] entry_SYSCALL_64_after_hwframe+0x78/0xe2 The problem can be reproduced using the following steps: ip netns add foo ip netns exec foo bash cat /sys/class/infiniband/mlx4_0/hw_counters/* The panic occurs because of casting the device pointer into an ib_device pointer using container_of() in hw_stat_device_show() is wrong and leads to a memory corruption. However the real problem is that hw counters should never been exposed outside of the non-init net namespace. Fix this by saving the index of the corresponding attribute group (it might be 1 or 2 depending on the presence of driver-specific attributes) and zeroing the pointer to hw_counters group for compat devices during the initialization. With this fix applied hw_counters are not available in a non-init net namespace: find /sys/class/infiniband/mlx4_0/ -name hw_counters /sys/class/infiniband/mlx4_0/ports/1/hw_counters /sys/class/infiniband/mlx4_0/ports/2/hw_counters /sys/class/infiniband/mlx4_0/hw_counters ip netns add foo ip netns exec foo bash find /sys/class/infiniband/mlx4_0/ -name hw_counters Fixes: 467f432a521a ("RDMA/core: Split port and device counter sysfs attributes") Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: Maher Sanalla <msanalla@nvidia.com> Cc: linux-rdma@vger.kernel.org Cc: linux-kernel@vger.kernel.org Link: https://patch.msgid.link/20250227165420.3430301-1-roman.gushchin@linux.dev Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-03RDMA/siw: Switch to using the crc32c libraryEric Biggers
Now that the crc32c() library function directly takes advantage of architecture-specific optimizations, it is unnecessary to go through the crypto API. Just use crc32c(). This is much simpler, and it improves performance due to eliminating the crypto API overhead. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://patch.msgid.link/20250227051207.19470-1-ebiggers@kernel.org Acked-by: Bernard Metzler <bmt@zurich.ibm.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-25Merge branch 'mlx5-next' into wip/leon-for-nextLeon Romanovsky
This is merge of shared branch between RDMA and net-next trees. * mlx5-next: (550 commits) net/mlx5: Change POOL_NEXT_SIZE define value and make it global net/mlx5: Add new health syndrome error and crr bit offset Linux 6.14-rc3 ... Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-24RDMA/hfi1: Remove unused one_qsfp_writeDr. David Alan Gilbert
The last use of one_qsfp_write() was removed in 2016's commit 145dd2b39958 ("IB/hfi1: Always turn on CDRs for low power QSFP modules") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Link: https://patch.msgid.link/20250223215543.153312-1-linux@treblig.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-23RDMA/mana_ib: Ensure variable err is initializedKees Bakker
In the function mana_ib_gd_create_dma_region if there are no dma blocks to process the variable `err` remains uninitialized. Fixes: 0266a177631d ("RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter") Signed-off-by: Kees Bakker <kees@ijzerbout.nl> Link: https://patch.msgid.link/20250221195833.7516C16290A@bout3.ijzerbout.nl Reviewed-by: Long Li <longli@microsoft.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-23net/mlx5: Change POOL_NEXT_SIZE define value and make it globalPatrisious Haddad
Change POOL_NEXT_SIZE define value from 0 to BIT(30), since this define is used to request the available maximum sized flow table, and zero doesn't make sense for it, whereas some places in the driver use zero explicitly expecting the smallest table size possible but instead due to this define they end up allocating the biggest table size unawarely. In addition move the definition to "include/linux/mlx5/fs.h" to expose the define to IB driver as well, while appropriately renaming it. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250219085808.349923-3-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-23net/mlx5: Add new health syndrome error and crr bit offsetShahar Shitrit
Add new error value for trust lockdown in health syndrome enum. Also, include the offset for crr bit in the health buffer layout. These changes prepare for downstream patches that update health event handling. Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250219085808.349923-2-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-21RDMA/rxe: Add support for the traditional Atomic operations with ODPDaisuke Matsuda
Enable 'fetch and add' and 'compare and swap' operations to be used with ODP. This is comprised of the following steps: 1. Check the driver page table(umem_odp->dma_list) to see if the target page is both readable and writable. 2. If not, then trigger page fault to map the page. 3. Convert its user space address to a kernel logical address using PFNs in the driver page table(umem_odp->pfn_list). 4. Execute the operation. Link: https://patch.msgid.link/r/20241220100936.2193541-6-matsuda-daisuke@fujitsu.com Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-02-21RDMA/rxe: Add support for Send/Recv/Write/Read with ODPDaisuke Matsuda
rxe_mr_copy() is used widely to copy data to/from a user MR. requester uses it to load payloads of requesting packets; responder uses it to process Send, Write, and Read operaetions; completer uses it to copy data from response packets of Read and Atomic operations to a user MR. Allow these operations to be used with ODP by adding a subordinate function rxe_odp_mr_copy(). It is comprised of the following steps: 1. Check the driver page table(umem_odp->dma_list) to see if pages being accessed are present with appropriate permission. 2. If necessary, trigger page fault to map the pages. 3. Convert their user space addresses to kernel logical addresses using PFNs in the driver page table(umem_odp->pfn_list). 4. Execute data copy to/from the pages. Link: https://patch.msgid.link/r/20241220100936.2193541-5-matsuda-daisuke@fujitsu.com Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-02-21RDMA/rxe: Allow registering MRs for On-Demand PagingDaisuke Matsuda
Allow userspace to register an ODP-enabled MR, in which case the flag IB_ACCESS_ON_DEMAND is passed to rxe_reg_user_mr(). However, there is no RDMA operation enabled right now. They will be supported later in the subsequent two patches. rxe_odp_do_pagefault() is called to initialize an ODP-enabled MR. It syncs process address space from the CPU page table to the driver page table (dma_list/pfn_list in umem_odp) when called with RXE_PAGEFAULT_SNAPSHOT flag. Additionally, It can be used to trigger page fault when pages being accessed are not present or do not have proper read/write permissions, and possibly to prefetch pages in the future. Link: https://patch.msgid.link/r/20241220100936.2193541-4-matsuda-daisuke@fujitsu.com Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-02-21RDMA/rxe: Add page invalidation supportDaisuke Matsuda
On page invalidation, an MMU notifier callback is invoked to unmap DMA addresses and update the driver page table(umem_odp->dma_list). The callback is registered when an ODP-enabled MR is created. Link: https://patch.msgid.link/r/20241220100936.2193541-3-matsuda-daisuke@fujitsu.com Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-02-21RDMA/rxe: Move some code to rxe_loc.h in preparation for ODPDaisuke Matsuda
rxe_mr_init() and resp_states are going to be used in rxe_odp.c, which is to be created in the subsequent patch. Link: https://patch.msgid.link/r/20241220100936.2193541-2-matsuda-daisuke@fujitsu.com Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-02-19RDMA/core: Fix best page size finding when it can cross SG entriesMichael Margolin
A single scatter-gather entry is limited by a 32 bits "length" field that is practically 4GB - PAGE_SIZE. This means that even when the memory is physically contiguous, we might need more than one entry to represent it. Additionally when using dmabuf, the sg_table might be originated outside the subsystem and optimized for other needs. For instance an SGT of 16GB GPU continuous memory might look like this: (a real life example) dma_address 34401400000, length fffff000 dma_address 345013ff000, length fffff000 dma_address 346013fe000, length fffff000 dma_address 347013fd000, length fffff000 dma_address 348013fc000, length 4000 Since ib_umem_find_best_pgsz works within SG entries, in the above case we will result with the worst possible 4KB page size. Fix this by taking into consideration only the alignment of addresses of real discontinuity points rather than treating SG entries as such, and adjust the page iterator to correctly handle cross SG entry pages. There is currently an assumption that drivers do not ask for pages bigger than maximal DMA size supported by their devices. Reviewed-by: Firas Jahjah <firasj@amazon.com> Reviewed-by: Yonatan Nachum <ynachum@amazon.com> Signed-off-by: Michael Margolin <mrgolin@amazon.com> Link: https://patch.msgid.link/20250217141623.12428-1-mrgolin@amazon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-18IB/iser: fix typos in iscsi_iser.c commentsImanol
Fixes multiple occurrences of the misspelled word "occured" in the comments of `iscsi_iser.c`, replacing them with the correct spelling "occurred". This improves readability without affecting functionality. Signed-off-by: Imanol <imvalient@protonmail.com> Link: https://patch.msgid.link/20250217183048.9394-1-imvalient@protonmail.com Acked-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-18RDMA/mana_ib: Implement DMABUF MR supportKonstantin Taranov
Add support of dmabuf MRs to mana_ib. Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com> Link: https://patch.msgid.link/1739454861-4456-1-git-send-email-kotaranov@linux.microsoft.com Reviewed-by: Long Li <longli@microsoft.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-16Linux 6.14-rc3Linus Torvalds
2025-02-16Merge tag 'kbuild-fixes-v6.14-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - Fix annoying logs when building tools in parallel - Fix the Debian linux-headers package build again - Fix the target triple detection for userspace programs on Clang * tag 'kbuild-fixes-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: modpost: Fix a few typos in a comment kbuild: userprogs: fix bitsize and target detection on clang kbuild: fix linux-headers package build when $(CC) cannot link userspace tools: fix annoying "mkdir -p ..." logs when building tools in parallel
2025-02-16Merge tag 'driver-core-6.14-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core api addition from Greg KH: "Here is a driver core new api for 6.14-rc3 that is being added to allow platform devices from stop being abused. It adds a new 'faux_device' structure and bus and api to allow almost a straight or simpler conversion from platform devices that were not really a platform device. It also comes with a binding for rust, with an example driver in rust showing how it's used. I'm adding this now so that the patches that convert the different drivers and subsystems can all start flowing into linux-next now through their different development trees, in time for 6.15-rc1. We have a number that are already reviewed and tested, but adding those conversions now doesn't seem right. For now, no one is using this, and it passes all build tests from 0-day and linux-next, so all should be good" * tag 'driver-core-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: rust/kernel: Add faux device bindings driver core: add a faux bus for use when a simple device/bus is needed
2025-02-16Merge tag 'tty-6.14-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull serial driver fixes from Greg KH: "Here are some small serial driver fixes for some reported problems. Nothing major, just: - sc16is7xx irq check fix - 8250 fifo underflow fix - serial_port and 8250 iotype fixes Most of these have been in linux-next already, and all have passed 0-day testing" * tag 'tty-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: serial: 8250: Fix fifo underflow on flush serial: 8250_pnp: Remove unneeded ->iotype assignment serial: 8250_platform: Remove unneeded ->iotype assignment serial: 8250_of: Remove unneeded ->iotype assignment serial: port: Make ->iotype validation global in __uart_read_properties() serial: port: Always update ->iotype in __uart_read_properties() serial: port: Assign ->iotype correctly when ->iobase is set serial: sc16is7xx: Fix IRQ number check behavior
2025-02-16Merge tag 'usb-6.14-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB fixes from Greg KH: "Here are some small USB driver fixes, and new device ids, for 6.14-rc3. Lots of tiny stuff for reported problems, including: - new device ids and quirks - usb hub crash fix found by syzbot - dwc2 driver fix - dwc3 driver fixes - uvc gadget driver fix - cdc-acm driver fixes for a variety of different issues - other tiny bugfixes Almost all of these have been in linux-next this week, and all have passed 0-day testing" * tag 'usb-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (25 commits) usb: typec: tcpm: PSSourceOffTimer timeout in PR_Swap enters ERROR_RECOVERY usb: roles: set switch registered flag early on usb: gadget: uvc: Fix unstarted kthread worker USB: quirks: add USB_QUIRK_NO_LPM quirk for Teclast dist usb: gadget: core: flush gadget workqueue after device removal USB: gadget: f_midi: f_midi_complete to call queue_work usb: core: fix pipe creation for get_bMaxPacketSize0 usb: dwc3: Fix timeout issue during controller enter/exit from halt state USB: Add USB_QUIRK_NO_LPM quirk for sony xperia xz1 smartphone USB: cdc-acm: Fill in Renesas R-Car D3 USB Download mode quirk usb: cdc-acm: Fix handling of oversized fragments usb: cdc-acm: Check control transfer buffer size before access usb: xhci: Restore xhci_pci support for Renesas HCs USB: pci-quirks: Fix HCCPARAMS register error for LS7A EHCI USB: serial: option: drop MeiG Smart defines USB: serial: option: fix Telit Cinterion FN990A name USB: serial: option: add Telit Cinterion FN990B compositions USB: serial: option: add MeiG Smart SLM828 usb: gadget: f_midi: fix MIDI Streaming descriptor lengths usb: dwc2: gadget: remove of_node reference upon udc_stop ...
2025-02-16Merge tag 'irq_urgent_for_v6.14_rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq Kconfig cleanup from Borislav Petkov: - Remove an unused config item GENERIC_PENDING_IRQ_CHIPFLAGS * tag 'irq_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Remove unused CONFIG_GENERIC_PENDING_IRQ_CHIPFLAGS
2025-02-16Merge tag 'perf_urgent_for_v6.14_rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 perf fixes from Borislav Petkov: - Explicitly clear DEBUGCTL.LBR to prevent LBRs continuing being enabled after handoff to the OS - Check CPUID(0x23) leaf and subleafs presence properly - Remove the PEBS-via-PT feature from being supported on hybrid systems - Fix perf record/top default commands on systems without a raw PMU registered * tag 'perf_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Ensure LBRs are disabled when a CPU is starting perf/x86/intel: Fix ARCH_PERFMON_NUM_COUNTER_LEAF perf/x86/intel: Clean up PEBS-via-PT on hybrid perf/x86/rapl: Fix the error checking order
2025-02-16Merge tag 'sched_urgent_for_v6.14_rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: - Clarify what happens when a task is woken up from the wake queue and make clear its removal from that queue is atomic * tag 'sched_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Clarify wake_up_q()'s write to task->wake_q.next
2025-02-16Merge tag 'objtool_urgent_for_v6.14_rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool fixes from Borislav Petkov: - Move a warning about a lld.ld breakage into the verbose setting as said breakage has been fixed in the meantime - Teach objtool to ignore dangling jump table entries added by Clang * tag 'objtool_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: objtool: Move dodgy linker warn to verbose objtool: Ignore dangling jump table entries
2025-02-16Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "ARM: - Large set of fixes for vector handling, especially in the interactions between host and guest state. This fixes a number of bugs affecting actual deployments, and greatly simplifies the FP/SIMD/SVE handling. Thanks to Mark Rutland for dealing with this thankless task. - Fix an ugly race between vcpu and vgic creation/init, resulting in unexpected behaviours - Fix use of kernel VAs at EL2 when emulating timers with nVHE - Small set of pKVM improvements and cleanups x86: - Fix broken SNP support with KVM module built-in, ensuring the PSP module is initialized before KVM even when the module infrastructure cannot be used to order initcalls - Reject Hyper-V SEND_IPI hypercalls if the local APIC isn't being emulated by KVM to fix a NULL pointer dereference - Enter guest mode (L2) from KVM's perspective before initializing the vCPU's nested NPT MMU so that the MMU is properly tagged for L2, not L1 - Load the guest's DR6 outside of the innermost .vcpu_run() loop, as the guest's value may be stale if a VM-Exit is handled in the fastpath" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits) x86/sev: Fix broken SNP support with KVM module built-in KVM: SVM: Ensure PSP module is initialized if KVM module is built-in crypto: ccp: Add external API interface for PSP module initialization KVM: arm64: vgic: Hoist SGI/PPI alloc from vgic_init() to kvm_create_vgic() KVM: arm64: timer: Drop warning on failed interrupt signalling KVM: arm64: Fix alignment of kvm_hyp_memcache allocations KVM: arm64: Convert timer offset VA when accessed in HYP code KVM: arm64: Simplify warning in kvm_arch_vcpu_load_fp() KVM: arm64: Eagerly switch ZCR_EL{1,2} KVM: arm64: Mark some header functions as inline KVM: arm64: Refactor exit handlers KVM: arm64: Refactor CPTR trap deactivation KVM: arm64: Remove VHE host restore of CPACR_EL1.SMEN KVM: arm64: Remove VHE host restore of CPACR_EL1.ZEN KVM: arm64: Remove host FPSIMD saving for non-protected KVM KVM: arm64: Unconditionally save+flush host FPSIMD/SVE/SME state KVM: x86: Load DR6 with guest value only before entering .vcpu_run() loop KVM: nSVM: Enter guest mode before initializing nested NPT MMU KVM: selftests: Add CPUID tests for Hyper-V features that need in-kernel APIC KVM: selftests: Manage CPUID array in Hyper-V CPUID test's core helper ...
2025-02-16Merge tag 'mips-fixes_6.14_1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux Pull MIPS fixes from Thomas Bogendoerfer: "Fix for o32 ptrace/get_syscall_info" * tag 'mips-fixes_6.14_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: MIPS: fix mips_get_syscall_arg() for o32 MIPS: Export syscall stack arguments properly for remote use
2025-02-15Merge tag 'devicetree-fixes-for-6.14-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux Pull devicetree fixes from Rob Herring: - Add bindings for QCom QCS8300 clocks, QCom SAR2130P qfprom, and powertip,{st7272|hx8238a} displays - Fix compatible for TI am62a7 dss - Add a kunit test for __of_address_resource_bounds() * tag 'devicetree-fixes-for-6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: dt-bindings: display: Add powertip,{st7272|hx8238a} as DT Schema description dt-bindings: nvmem: qcom,qfprom: Add SAR2130P compatible dt-bindings: display: ti: Fix compatible for am62a7 dss of: address: Add kunit test for __of_address_resource_bounds() dt-bindings: clock: qcom: Add QCS8300 video clock controller dt-bindings: clock: qcom: Add CAMCC clocks for QCS8300 dt-bindings: clock: qcom: Add GPU clocks for QCS8300
2025-02-15Merge tag 'uml-for-linus-6.14-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux Pull UML fixes from Richard Weinberger: - Align signal stack correctly - Convert to raw spinlocks where needed (irq and virtio) - FPU related fixes * tag 'uml-for-linus-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux: um: convert irq_lock to raw spinlock um: virtio_uml: use raw spinlock um: virt-pci: don't use kmalloc() um: fix execve stub execution on old host OSs um: properly align signal stack on x86_64 um: avoid copying FP state from init_task um: add back support for FXSAVE registers
2025-02-15Merge tag 'trace-ring-buffer-v6.14-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull trace ring buffer fixes from Steven Rostedt: - Enable resize on mmap() error When a process mmaps a ring buffer, its size is locked and resizing is disabled. But if the user passes in a wrong parameter, the mmap() can fail after the resize was disabled and the mmap() exits with error without reenabling the ring buffer resize. This prevents the ring buffer from ever being resized after that. Reenable resizing of the ring buffer on mmap() error. - Have resizing return proper error and not always -ENOMEM If the ring buffer is mmapped by one task and another task tries to resize the buffer it will error with -ENOMEM. This is confusing to the user as there may be plenty of memory available. Have it return the error that actually happens (in this case -EBUSY) where the user can understand why the resize failed. - Test the sub-buffer array to validate persistent memory buffer On boot up, the initialization of the persistent memory buffer will do a validation check to see if the content of the data is valid, and if so, it will use the memory as is, otherwise it re-initializes it. There's meta data in this persistent memory that keeps track of which sub-buffer is the reader page and an array that states the order of the sub-buffers. The values in this array are indexes into the sub-buffers. The validator checks to make sure that all the entries in the array are within the sub-buffer list index, but it does not check for duplications. While working on this code, the array got corrupted and had duplicates, where not all the sub-buffers were accounted for. This passed the validator as all entries were valid, but the link list was incorrect and could have caused a crash. The corruption only produced incorrect data, but it could have been more severe. To fix this, create a bitmask that covers all the sub-buffer indexes and set it to all zeros. While iterating the array checking the values of the array content, have it set a bit corresponding to the index in the array. If the bit was already set, then it is a duplicate and mark the buffer as invalid and reset it. - Prevent mmap()ing persistent ring buffer The persistent ring buffer uses vmap() to map the persistent memory. Currently, the mmap() logic only uses virt_to_page() to get the page from the ring buffer memory and use that to map to user space. This works because a normal ring buffer uses alloc_page() to allocate its memory. But because the persistent ring buffer use vmap() it causes a kernel crash. Fixing this to work with vmap() is not hard, but since mmap() on persistent memory buffers never worked, just have the mmap() return -ENODEV (what was returned before mmap() for persistent memory ring buffers, as they never supported mmap. Normal buffers will still allow mmap(). Implementing mmap() for persistent memory ring buffers can wait till the next merge window. - Fix polling on persistent ring buffers There's a "buffer_percent" option (default set to 50), that is used to have reads of the ring buffer binary data block until the buffer fills to that percentage. The field "pages_touched" is incremented every time a new sub-buffer has content added to it. This field is used in the calculations to determine the amount of content is in the buffer and if it exceeds the "buffer_percent" then it will wake the task polling on the buffer. As persistent ring buffers can be created by the content from a previous boot, the "pages_touched" field was not updated. This means that if a task were to poll on the persistent buffer, it would block even if the buffer was completely full. It would block even if the "buffer_percent" was zero, because with "pages_touched" as zero, it would be calculated as the buffer having no content. Update pages_touched when initializing the persistent ring buffer from a previous boot. * tag 'trace-ring-buffer-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ring-buffer: Update pages_touched to reflect persistent buffer content tracing: Do not allow mmap() of persistent ring buffer ring-buffer: Validate the persistent meta data subbuf array tracing: Have the error of __tracing_resize_ring_buffer() passed to user ring-buffer: Unlock resize on mmap error
2025-02-15ring-buffer: Update pages_touched to reflect persistent buffer contentSteven Rostedt
The pages_touched field represents the number of subbuffers in the ring buffer that have content that can be read. This is used in accounting of "dirty_pages" and "buffer_percent" to allow the user to wait for the buffer to be filled to a certain amount before it reads the buffer in blocking mode. The persistent buffer never updated this value so it was set to zero, and this accounting would take it as it had no content. This would cause user space to wait for content even though there's enough content in the ring buffer that satisfies the buffer_percent. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/20250214123512.0631436e@gandalf.local.home Fixes: 5f3b6e839f3ce ("ring-buffer: Validate boot range memory events") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>