summaryrefslogtreecommitdiff
path: root/drivers/infiniband/hw/mlx5/mr.c
AgeCommit message (Collapse)Author
2021-02-08RDMA/mlx5: Cleanup the synchronize_srcu() from the ODP flowYishai Hadas
Cleanup the synchronize_srcu() from the ODP flow as it was found to be a very heavy time consumer as part of dereg_mr. For example de-registration of 10000 ODP MRs each with size of 2M hugepage took 19.6 sec comparing de-registration of same number of non ODP MRs that took 172 ms. The new locking scheme uses the wait_event() mechanism which follows the use count of the MR instead of using synchronize_srcu(). By that change, the time required for the above test took 95 ms which is even better than the non ODP flow. Once fully dropped the srcu usage, had to come with a lock to protect the XA access. As part of using the above mechanism we could also clean the num_deferred_work stuff and follow the use count instead. Link: https://lore.kernel.org/r/20210202071309.2057998-1-leon@kernel.org Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-01-20RDMA/mlx5: Support dma-buf based userspace memory regionJianxin Xiong
Implement the new driver method 'reg_user_mr_dmabuf'. Utilize the core functions to import dma-buf based memory region and update the mappings. Add code to handle dma-buf related page fault. Link: https://lore.kernel.org/r/1608067636-98073-5-git-send-email-jianxin.xiong@intel.com Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com> Reviewed-by: Sean Hefty <sean.hefty@intel.com> Acked-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Acked-by: Christian Koenig <christian.koenig@amd.com> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-12-14RDMA/mlx5: Fix MR cache memory leakMaor Gottlieb
If the MR cache entry invalidation failed, then we detach this entry from the cache, therefore we must to free the memory as well. Allcation backtrace for the leaker: [<00000000d8e423b0>] alloc_cache_mr+0x23/0xc0 [mlx5_ib] [<000000001f21304c>] create_cache_mr+0x3f/0xf0 [mlx5_ib] [<000000009d6b45dc>] mlx5_ib_alloc_implicit_mr+0x41/0×210 [mlx5_ib] [<00000000879d0d68>] mlx5_ib_reg_user_mr+0x9e/0×6e0 [mlx5_ib] [<00000000be74bf89>] create_qp+0x2fc/0xf00 [ib_uverbs] [<000000001a532d22>] ib_uverbs_handler_UVERBS_METHOD_COUNTERS_READ+0x1d9/0×230 [ib_uverbs] [<0000000070f46001>] rdma_alloc_commit_uobject+0xb5/0×120 [ib_uverbs] [<000000006d8a0b38>] uverbs_alloc+0x2b/0xf0 [ib_uverbs] [<00000000075217c9>] ksysioctl+0x234/0×7d0 [<00000000eb5c120b>] __x64_sys_ioctl+0x16/0×20 [<00000000db135b48>] do_syscall_64+0x59/0×2e0 Fixes: 1769c4c57548 ("RDMA/mlx5: Always remove MRs from the cache before destroying them") Link: https://lore.kernel.org/r/20201213132940.345554-2-leon@kernel.org Signed-off-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-12-07RDMA/mlx5: Assign dev to DM MRMaor Gottlieb
Currently, DM MR registration flow doesn't set the mlx5_ib_dev pointer and can cause a NULL pointer dereference if userspace dumps the MR via rdma tool. Assign the IB device together with the other fields and remove the redundant reference of mlx5_ib_dev from mlx5_ib_mr. Cc: stable@vger.kernel.org Fixes: 6c29f57ea475 ("IB/mlx5: Device memory mr registration support") Link: https://lore.kernel.org/r/20201203190807.127189-1-leon@kernel.org Signed-off-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-12-07RDMA/mlx5: Fix error unwinds for rereg_mrJason Gunthorpe
This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-12-07RDMA/mlx5: Reorganize mlx5_ib_reg_user_mr()Jason Gunthorpe
This function handles an ODP and regular MR flow all mushed together, even though the two flows are quite different. Split them into two dedicated functions. Link: https://lore.kernel.org/r/20201130075839.278575-5-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-12-07RDMA/uverbs: Allow drivers to create a new HW object during rereg_mrJason Gunthorpe
mlx5 has an ugly flow where it tries to allocate a new MR and replace the existing MR in the same memory during rereg. This is very complicated and buggy. Instead of trying to replace in-place inside the driver, provide support from uverbs to change the entire HW object assigned to a handle during rereg_mr. Since destroying a MR is allowed to fail (ie if a MW is pointing at it) and can't be detected in advance, the algorithm creates a completely new uobject to hold the new MR and swaps the IDR entries of the two objects. The old MR in the temporary IDR entry is destroyed, and if it fails rereg_mr succeeds and destruction is deferred to FD release. This complexity is why this cannot live in a driver safely. Link: https://lore.kernel.org/r/20201130075839.278575-4-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-26RDMA/mlx5: Use PCI device for dma mappingsParav Pandit
DMA operation of the IB device is done using ib_device->dma_device. Instead of accessing parent of the IB device, use the PCI dma device which is setup to ib_device->dma_device during IB device registration. Link: https://lore.kernel.org/r/20201125064628.8431-1-leon@kernel.org Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-02RDMA/mlx5: Use ib_umem_find_best_pgsz() for mkc'sJason Gunthorpe
Now that all the PAS arrays or UMR XLT's for mkcs are filled using rdma_for_each_block() we can use the common ib_umem_find_best_pgsz() algorithm. Link: https://lore.kernel.org/r/20201026132314.1336717-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-02RDMA/mlx5: Split mlx5_ib_update_xlt() into ODP and non-ODP casesJason Gunthorpe
Mixing these together is just a mess, make a dedicated version, mlx5_ib_update_mr_pas(), which directly loads the whole MTT for a non-ODP MR. The split out version can trivially use a simple loop with rdma_for_each_block() which allows using the core code to compute the MR pages and avoids seeking in the SGL list after each chunk as the __mlx5_ib_populate_pas() call required. Significantly speeds loading large MTTs. Link: https://lore.kernel.org/r/20201026132314.1336717-5-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-02RDMA/mlx5: Split the WR setup out of mlx5_ib_update_xlt()Jason Gunthorpe
The memory allocation is quite complicated, and makes this function hard to understand. Refactor things so that a function call sets up the WR, SG, DMA mapping and buffer, further splitting that into buffer and DMA/wr. This also slightly changes the buffer allocation logic to try an order 0 page allocation (with OOM warnings on) before going to the emergency page. Link: https://lore.kernel.org/r/20201026132314.1336717-4-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-02RDMA/mlx5: Move xlt_emergency_page_mutex into mr.cJason Gunthorpe
This is the only user, so remove the wrappers. Link: https://lore.kernel.org/r/20201026132314.1336717-3-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-02RDMA/mlx5: Change mlx5_ib_populate_pas() to use rdma_for_each_block()Jason Gunthorpe
This routine converts the umem SGL into a list of fixed pages for DMA, which is exactly what rdma_umem_for_each_dma_block() is for, use the common code directly. Link: https://lore.kernel.org/r/20201026132314.1336717-2-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-02RDMA/mlx5: Remove npages from mlx5_ib_cont_pages()Jason Gunthorpe
Most callers don't need this, and the few that do can get it as ib_umem_num_pages(umem). Link: https://lore.kernel.org/r/20201026131936.1335664-8-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-02RDMA/mlx5: Remove ncont from mlx5_ib_cont_pages()Jason Gunthorpe
This is the same as ib_umem_num_dma_blocks(umem, 1UL << page_shift), have the callers compute it directly. Link: https://lore.kernel.org/r/20201026131936.1335664-7-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-02RDMA/mlx5: Remove order from mlx5_ib_cont_pages()Jason Gunthorpe
Only alloc_mr_from_cache() needs order and can trivially compute it, so lift it to the one call site and remove the NULL arguments. Link: https://lore.kernel.org/r/20201026131936.1335664-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-02RDMA/mlx5: Move mlx5_ib_cont_pages() to the creation of the mlx5_ib_mrJason Gunthorpe
For the user MR path, instead of calling this after getting the umem, call it as part of creating the struct mlx5_ib_mr and distill its output to a single page_shift stored inside the mr. This avoids passing around the tuple of its output. Based on the umem and page_shift, the output arguments can be computed using: count == ib_umem_num_pages(mr->umem) shift == mr->page_shift ncont == ib_umem_num_dma_blocks(mr->umem, 1 << mr->page_shift) order == order_base_2(ncont) And since mr->page_shift == umem_odp->page_shift then ncont == ib_umem_num_dma_blocks() == ib_umem_odp_num_pages() for ODP umems. Link: https://lore.kernel.org/r/20201026131936.1335664-5-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-02RDMA/mlx5: Remove mlx5_ib_mr->npagesJason Gunthorpe
This is the same value as ib_umem_num_pages(mr->umem), use that instead. Link: https://lore.kernel.org/r/20201026131936.1335664-4-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-02RDMA/mlx5: Fix corruption of reg_pages in mlx5_ib_rereg_user_mr()Jason Gunthorpe
reg_pages should always contain mr->npage since when the mr is finally de-reg'd it is always subtracted out. If there were any error exits then mlx5_ib_rereg_user_mr() would leave the reg_pages adjusted and this will cause it to be double subtracted eventually. The manipulation of reg_pages is inherently connected to the umem, so lift it out of set_mr_fields() and only adjust it around creating/destroying a umem. reg_pages is only used for diagnostics in sysfs. Fixes: 7d0cc6edcc70 ("IB/mlx5: Add MR cache for large UMR regions") Link: https://lore.kernel.org/r/20201026131936.1335664-3-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-02RDMA/mlx5: Remove mlx5_ib_mr->orderJason Gunthorpe
The is only ever set to non-zero if the MR is from the cache, and if it is cached then the order is in cached_ent->order. Make it clearer that use_umr_mtt_update() only returns true for cached MRs and remove the redundant data. Link: https://lore.kernel.org/r/20201026131936.1335664-2-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-01RDMA/mlx5: Sync device with CPU pages upon ODP MR registrationYishai Hadas
Sync device with CPU pages upon ODP MR registration. mlx5 already has to zero the HW's version of the PAS list, may as well deliver a PAS list that matches the current CPU page tables configuration. Link: https://lore.kernel.org/r/20200930163828.1336747-5-leon@kernel.org Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-01RDMA/mlx5: Extend advice MR to support non faulting modeYishai Hadas
Extend advice MR to support non faulting mode, this can improve performance by increasing the populated page tables in the device. Link: https://lore.kernel.org/r/20200930163828.1336747-4-leon@kernel.org Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-18RDMA/mlx5: Clarify what the UMR is for when creating MRsJason Gunthorpe
Once a mkey is created it can be modified using UMR. This is desirable for performance reasons. However, different hardware has restrictions on what modifications are possible using UMR. Make sense of these checks: - mlx5_ib_can_reconfig_with_umr() returns true if the access flags can be altered. Most cases create MRs using 0 access flags (now made clear by consistent use of set_mkc_access_pd_addr_fields()), but the old logic here was tormented. Make it clear that this is checking if the current access_flags can be modified using UMR to different access_flags. It is always OK to use UMR to change flags that all HW supports. - mlx5_ib_can_load_pas_with_umr() returns true if UMR can be used to enable and update the PAS/XLT. Enabling requires updating the entity size, so UMR ends up completely disabled on this old hardware. Make it clear why it is disabled. FRWR, ODP and cache always requires mlx5_ib_can_load_pas_with_umr(). - mlx5_ib_pas_fits_in_mr() is used to tell if an existing MR can be resized to hold a new PAS list. This only works for cached MR's because we don't store the PAS list size in other cases. To be very clear, arrange things so any pre-created MR's in the cache check the newly requested access_flags before allowing the MR to leave the cache. If UMR cannot set the required access_flags the cache fails to create the MR. This in turn means relaxed ordering and atomic are now correctly blocked early for implicit ODP on older HW. Link: https://lore.kernel.org/r/20200914112653.345244-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-18RDMA/mlx5: Make mkeys always owned by the kernel's PD when not enabledJason Gunthorpe
Any mkey that is not enabled and assigned to userspace should have the PD set to a kernel owned PD. When cache entries are created for the first time the PDN is set to 0, which is probably a kernel PD, but be explicit. When a MR is registered using the hybrid reg_create with UMR xlt & enable the disabled mkey is pointing at the user PD, keep it pointing at the kernel until a UMR enables it and sets the user PD. Fixes: 9ec4483a3f0f ("IB/mlx5: Move MRs to a kernel PD when freeing them to the MR cache") Link: https://lore.kernel.org/r/20200914112653.345244-4-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-18RDMA/mlx5: Use set_mkc_access_pd_addr_fields() in reg_create()Jason Gunthorpe
reg_create() open codes this helper, use the shared code. Link: https://lore.kernel.org/r/20200914112653.345244-3-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-18RDMA/mlx5: Remove dead check for EAGAIN after alloc_mr_from_cache()Jason Gunthorpe
alloc_mr_from_cache() no longer returns EAGAIN, this is just dead code now. Fixes: aad719dcf379 ("RDMA/mlx5: Allow MRs to be created in the cache synchronously") Link: https://lore.kernel.org/r/20200914112653.345244-2-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-17RDMA: Clean MW allocation and free flowsLeon Romanovsky
Move allocation and destruction of memory windows under ib_core responsibility and clean drivers to ensure that no updates to MW ib_core structures are done in driver layer. Link: https://lore.kernel.org/r/20200902081623.746359-2-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-08-18RDMA/mlx5: Replace open-coded offsetofend() macroLeon Romanovsky
Clean mlx5_ib from open-coded implementations of offsetofend(). Link: https://lore.kernel.org/r/20200730081235.1581127-3-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-07-06RDMA: Remove the udata parameter from alloc_mr callbackGal Pressman
Allocating an MR flow can only be initiated by kernel users, and not from userspace so a udata parameter is redundant. Link: https://lore.kernel.org/r/20200706120343.10816-4-galpress@amazon.com Signed-off-by: Gal Pressman <galpress@amazon.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-05-21RDMA/mlx5: Fix NULL pointer dereference in destroy_prefetch_workMaor Gottlieb
q_deferred_work isn't initialized when creating an explicit ODP memory region. This can lead to a NULL pointer dereference when user performs asynchronous prefetch MR. Fix it by initializing q_deferred_work for explicit ODP. BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 4 PID: 6074 Comm: kworker/u16:6 Not tainted 5.7.0-rc1-for-upstream-perf-2020-04-17_07-03-39-64 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 Workqueue: events_unbound mlx5_ib_prefetch_mr_work [mlx5_ib] RIP: 0010:__wake_up_common+0x49/0x120 Code: 04 89 54 24 0c 89 4c 24 08 74 0a 41 f6 01 04 0f 85 8e 00 00 00 48 8b 47 08 48 83 e8 18 4c 8d 67 08 48 8d 50 18 49 39 d4 74 66 <48> 8b 70 18 31 db 4c 8d 7e e8 eb 17 49 8b 47 18 48 8d 50 e8 49 8d RSP: 0000:ffffc9000097bd88 EFLAGS: 00010082 RAX: ffffffffffffffe8 RBX: ffff888454cd9f90 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffff888454cd9f90 RBP: ffffc9000097bdd0 R08: 0000000000000000 R09: ffffc9000097bdd0 R10: 0000000000000000 R11: 0000000000000001 R12: ffff888454cd9f98 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000003 FS: 0000000000000000(0000) GS:ffff88846fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000044c19e002 CR4: 0000000000760ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: __wake_up_common_lock+0x7a/0xc0 destroy_prefetch_work+0x5a/0x60 [mlx5_ib] mlx5_ib_prefetch_mr_work+0x64/0x80 [mlx5_ib] process_one_work+0x15b/0x360 worker_thread+0x49/0x3d0 kthread+0xf5/0x130 ? rescuer_thread+0x310/0x310 ? kthread_bind+0x10/0x10 ret_from_fork+0x1f/0x30 Fixes: de5ed007a03d ("IB/mlx5: Fix implicit ODP race") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200521072504.567406-1-leon@kernel.org Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-13RDMA/mlx5: Allow MRs to be created in the cache synchronouslyJason Gunthorpe
If the cache is completely out of MRs, and we are running in cache mode, then directly, and synchronously, create an MR that is compatible with the cache bucket using a sleeping mailbox command. This ensures that the thread that is waiting for the MR absolutely will get one. When a MR allocated in this way becomes freed then it is compatible with the cache bucket and will be recycled back into it. Deletes the very buggy ent->compl scheme to create a synchronous MR allocation. Link: https://lore.kernel.org/r/20200310082238.239865-13-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-13RDMA/mlx5: Revise how the hysteresis scheme works for cache fillingJason Gunthorpe
Currently if the work queue is running then it is in 'hysteresis' mode and will fill until the cache reaches the high water mark. This implicit state is very tricky and doesn't interact with pending very well. Instead of self re-scheduling the work queue after the add_keys() has started to create the new MR, have the queue scheduled from reg_mr_callback() only after the requested MR has been added. This avoids the bad design of an in-rush of queue'd work doing back to back add_keys() until EAGAIN then sleeping. The add_keys() will be paced one at a time as they complete, slowly filling up the cache. Also, fix pending to be only manipulated under lock. Link: https://lore.kernel.org/r/20200310082238.239865-12-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-13RDMA/mlx5: Fix locking in MR cache work queueJason Gunthorpe
All of the members of mlx5_cache_ent must be accessed while holding the spinlock, add the missing spinlock in the __cache_work_func(). Using cache->stopped and flush_workqueue() is an inherently racy way to shutdown self-scheduling work on a queue. Replace it with ent->disabled under lock, and always check disabled before queuing any new work. Use cancel_work_sync() to shutdown the queue. Use READ_ONCE/WRITE_ONCE for dev->last_add to manage concurrency as coherency is less important here. Split fill_delay from the bitfield. C bitfield updates are not atomic and this is just a mess. Use READ_ONCE/WRITE_ONCE, but this could also use test_bit()/set_bit(). Link: https://lore.kernel.org/r/20200310082238.239865-11-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-13RDMA/mlx5: Lock access to ent->available_mrs/limit when doing queue_workJason Gunthorpe
Accesses to these members needs to be locked. There is no reason not to hold a spinlock while calling queue_work(), so move the tests into a helper and always call it under lock. The helper should be called when available_mrs is adjusted. Link: https://lore.kernel.org/r/20200310082238.239865-10-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-13RDMA/mlx5: Fix MR cache size and limit debugfsJason Gunthorpe
The size_write function is supposed to adjust the total_mr's to match the user's request, but lacks locking and safety checking. total_mrs can only be adjusted by at most available_mrs. mrs already assigned to users cannot be revoked. Ensure that the user provides a target value within the range of available_mrs and within the high/low water mark. limit_write has confusing and wrong sanity checking, and doesn't have the ability to deallocate on limit reduction. Since both functions use the same algorithm to adjust the available_mrs, consolidate it into one function and write it correctly. Fix the locking and by holding the spinlock for all accesses to ent->X. Always fail if the user provides a malformed string. Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters") Link: https://lore.kernel.org/r/20200310082238.239865-9-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-13RDMA/mlx5: Always remove MRs from the cache before destroying themJason Gunthorpe
The cache bucket tracks the total number of MRs that exists, both inside and outside of the cache. Removing a MR from the cache (by setting cache_ent to NULL) without updating total_mrs will cause the tracking to leak and be inflated. Further fix the rereg_mr path to always destroy the MR. reg_create will always overwrite all the MR data in mlx5_ib_mr, so the MR must be completely destroyed, in all cases, before this function can be called. Detach the MR from the cache and unconditionally destroy it to avoid leaking HW mkeys. Fixes: afd1417404fb ("IB/mlx5: Use direct mkey destroy command upon UMR unreg failure") Fixes: 56e11d628c5d ("IB/mlx5: Added support for re-registration of MRs") Link: https://lore.kernel.org/r/20200310082238.239865-8-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-13RDMA/mlx5: Simplify how the MR cache bucket is locatedJason Gunthorpe
There are many bad APIs here that are accepting a cache bucket index instead of a bucket pointer. Many of the callers already have a bucket pointer, so this results in a lot of confusing uses of order2idx(). Pass the struct mlx5_cache_ent into add_keys(), remove_keys(), and alloc_cached_mr(). Once the MR is in the cache, store the cache bucket pointer directly in the MR, replacing the 'bool allocated_from cache'. In the end there is only one place that needs to form index from order, alloc_mr_from_cache(). Increase the safety of this function by disallowing it from accessing cache entries in the ODP special area. Link: https://lore.kernel.org/r/20200310082238.239865-7-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-13RDMA/mlx5: Rename the tracking variables for the MR cacheJason Gunthorpe
The old names do not clearly indicate the intent. Link: https://lore.kernel.org/r/20200310082238.239865-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-13RDMA/mlx5: Replace spinlock protected write with atomic varSaeed Mahameed
mkey variant calculation was spinlock protected to make it atomic, replace that with one atomic variable. Link: https://lore.kernel.org/r/20200310082238.239865-4-leon@kernel.org Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-13{IB,net}/mlx5: Move asynchronous mkey creation to mlx5_ibMichael Guralnik
As mlx5_ib is the only user of the mlx5_core_create_mkey_cb, move the logic inside mlx5_ib and cleanup the code in mlx5_core. Signed-off-by: Michael Guralnik <michaelgur@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2020-03-13{IB,net}/mlx5: Assign mkey variant in mlx5_ib onlySaeed Mahameed
mkey variant is not required for mlx5_core use, move the mkey variant counter to mlx5_ib. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2020-03-13{IB,net}/mlx5: Setup mkey variant before mr create command invocationSaeed Mahameed
On reg_mr_callback() mlx5_ib is recalculating the mkey variant which is wrong and will lead to using a different key variant than the one submitted to firmware on create mkey command invocation. To fix this, we store the mkey variant before invoking the firmware command and use it later on completion (reg_mr_callback). Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2020-01-21Merge tag 'rds-odp-for-5.5' into rdma.git for-nextJason Gunthorpe
From https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma Leon Romanovsky says: ==================== Use ODP MRs for kernel ULPs The following series extends MR creation routines to allow creation of user MRs through kernel ULPs as a proxy. The immediate use case is to allow RDS to work over FS-DAX, which requires ODP (on-demand-paging) MRs to be created and such MRs were not possible to create prior this series. The first part of this patchset extends RDMA to have special verb ib_reg_user_mr(). The common use case that uses this function is a userspace application that allocates memory for HCA access but the responsibility to register the memory at the HCA is on an kernel ULP. This ULP acts as an agent for the userspace application. The second part provides advise MR functionality for ULPs. This is integral part of ODP flows and used to trigger pagefaults in advance to prepare memory before running working set. The third part is actual user of those in-kernel APIs. ==================== * tag 'rds-odp-for-5.5': net/rds: Use prefetch for On-Demand-Paging MR net/rds: Handle ODP mr registration/unregistration net/rds: Detect need of On-Demand-Paging memory registration RDMA/mlx5: Fix handling of IOVA != user_va in ODP paths IB/mlx5: Mask out unsupported ODP capabilities for kernel QPs RDMA/mlx5: Don't fake udata for kernel path IB/mlx5: Add ODP WQE handlers for kernel QPs IB/core: Add interface to advise_mr for kernel users IB/core: Introduce ib_reg_user_mr IB: Allow calls to ib_umem_get from kernel ULPs Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-01-16RDMA/mlx5: Set relaxed ordering when requestedMichael Guralnik
Enable relaxed ordering in the mkey context when requested. As relaxed ordering is not currently supported in UMR, disable UMR usage for relaxed ordering MRs. Link: https://lore.kernel.org/r/1578506740-22188-11-git-send-email-yishaih@mellanox.com Signed-off-by: Michael Guralnik <michaelgur@mellanox.com> Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-01-16RDMA/mlx5: Fix handling of IOVA != user_va in ODP pathsJason Gunthorpe
Till recently it was not possible for userspace to specify a different IOVA, but with the new ibv_reg_mr_iova() library call this can be done. To compute the user_va we must compute: user_va = (iova - iova_start) + user_va_start while being cautious of overflow and other math problems. The iova is not reliably stored in the mmkey when the MR is created. Only the cached creation path (the common one) set it, so it must also be set when creating uncached MRs. Fix the weird use of iova when computing the starting page index in the MR. In the normal case, when iova == umem.address: iova & (~(BIT(page_shift) - 1)) == ALIGN_DOWN(umem.address, odp->page_size) == ib_umem_start(odp) And when iova is different using it in math with a user_va is wrong. Finally, do not allow an implicit ODP to be created with a non-zero IOVA as we have no support for that. Fixes: 7bdf65d411c1 ("IB/mlx5: Handle page faults") Signed-off-by: Moni Shoua <monis@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2020-01-16IB: Allow calls to ib_umem_get from kernel ULPsMoni Shoua
So far the assumption was that ib_umem_get() and ib_umem_odp_get() are called from flows that start in UVERBS and therefore has a user context. This assumption restricts flows that are initiated by ULPs and need the service that ib_umem_get() provides. This patch changes ib_umem_get() and ib_umem_odp_get() to get IB device directly by relying on the fact that both UVERBS and ULPs sets that field correctly. Reviewed-by: Guy Levi <guyle@mellanox.com> Signed-off-by: Moni Shoua <monis@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2020-01-03RDMA/mlx5: use true,false for bool variablezhengbin
Fixes coccicheck warning: drivers/infiniband/hw/mlx5/mr.c:150:2-26: WARNING: Assignment of 0/1 to bool variable drivers/infiniband/hw/mlx5/mr.c:1455:2-26: WARNING: Assignment of 0/1 to bool variable drivers/infiniband/hw/mlx5/qp.c:1874:6-20: WARNING: Assignment of 0/1 to bool variable Link: https://lore.kernel.org/r/1577176812-2238-6-git-send-email-zhengbin13@huawei.com Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: zhengbin <zhengbin13@huawei.com> Acked-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-01-03IB/mlx5: Unify ODP MR code paths to allow extra flexibilityArtemy Kovalyov
Building MR translation table in the ODP case requires additional flexibility, namely random access to DMA addresses. Make both direct and indirect ODP MR use same code path, separated from the non-ODP MR code path. With the restructuring the correct page_shift is now used around __mlx5_ib_populate_pas(). Fixes: d2183c6f1958 ("RDMA/umem: Move page_shift from ib_umem to ib_odp_umem") Link: https://lore.kernel.org/r/20191222124649.52300-2-leon@kernel.org Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-11-30Merge tag 'for-linus-hmm' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull hmm updates from Jason Gunthorpe: "This is another round of bug fixing and cleanup. This time the focus is on the driver pattern to use mmu notifiers to monitor a VA range. This code is lifted out of many drivers and hmm_mirror directly into the mmu_notifier core and written using the best ideas from all the driver implementations. This removes many bugs from the drivers and has a very pleasing diffstat. More drivers can still be converted, but that is for another cycle. - A shared branch with RDMA reworking the RDMA ODP implementation - New mmu_interval_notifier API. This is focused on the use case of monitoring a VA and simplifies the process for drivers - A common seq-count locking scheme built into the mmu_interval_notifier API usable by drivers that call get_user_pages() or hmm_range_fault() with the VA range - Conversion of mlx5 ODP, hfi1, radeon, nouveau, AMD GPU, and Xen GntDev drivers to the new API. This deletes a lot of wonky driver code. - Two improvements for hmm_range_fault(), from testing done by Ralph" * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: mm/hmm: remove hmm_range_dma_map and hmm_range_dma_unmap mm/hmm: make full use of walk_page_range() xen/gntdev: use mmu_interval_notifier_insert mm/hmm: remove hmm_mirror and related drm/amdgpu: Use mmu_interval_notifier instead of hmm_mirror drm/amdgpu: Use mmu_interval_insert instead of hmm_mirror drm/amdgpu: Call find_vma under mmap_sem nouveau: use mmu_interval_notifier instead of hmm_mirror nouveau: use mmu_notifier directly for invalidate_range_start drm/radeon: use mmu_interval_notifier_insert RDMA/hfi1: Use mmu_interval_notifier_insert for user_exp_rcv RDMA/odp: Use mmu_interval_notifier_insert() mm/hmm: define the pre-processor related parts of hmm.h even if disabled mm/hmm: allow hmm_range to be used with a mmu_interval_notifier or hmm_mirror mm/mmu_notifier: add an interval tree notifier mm/mmu_notifier: define the header pre-processor parts even if disabled mm/hmm: allow snapshot of the special zero page
2019-11-23RDMA/odp: Use mmu_interval_notifier_insert()Jason Gunthorpe
Replace the internal interval tree based mmu notifier with the new common mmu_interval_notifier_insert() API. This removes a lot of code and fixes a deadlock that can be triggered in ODP: zap_page_range() mmu_notifier_invalidate_range_start() [..] ib_umem_notifier_invalidate_range_start() down_read(&per_mm->umem_rwsem) unmap_single_vma() [..] __split_huge_page_pmd() mmu_notifier_invalidate_range_start() [..] ib_umem_notifier_invalidate_range_start() down_read(&per_mm->umem_rwsem) // DEADLOCK mmu_notifier_invalidate_range_end() up_read(&per_mm->umem_rwsem) mmu_notifier_invalidate_range_end() up_read(&per_mm->umem_rwsem) The umem_rwsem is held across the range_start/end as the ODP algorithm for invalidate_range_end cannot tolerate changes to the interval tree. However, due to the nested invalidation regions the second down_read() can deadlock if there are competing writers. The new core code provides an alternative scheme to solve this problem. Fixes: ca748c39ea3f ("RDMA/umem: Get rid of per_mm->notifier_count") Link: https://lore.kernel.org/r/20191112202231.3856-6-jgg@ziepe.ca Tested-by: Artemy Kovalyov <artemyko@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>