summaryrefslogtreecommitdiff
path: root/drivers/infiniband
AgeCommit message (Collapse)Author
2019-06-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Minor SPDX change conflict. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-22timekeeping: Use proper clock specifier names in functionsJason A. Donenfeld
This makes boot uniformly boottime and tai uniformly clocktai, to address the remaining oversights. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Link: https://lkml.kernel.org/r/20190621203249.3909-2-Jason@zx2c4.com
2019-06-21RDMA/efa: Print address on AH creation failureFiras Jahjah
For debugging purposes, print destination address if failed to create AH. Signed-off-by: Firas Jahjah <firasj@amazon.com> Reviewed-by: Yossi Leybovich <sleybo@amazon.com> Signed-off-by: Gal Pressman <galpress@amazon.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-21RDMA/efa: Be consistent with success flow return valueGal Pressman
The EFA driver is written with success oriented flows in mind, meaning that functions should mostly end with a return 0 statement. Error flows return their error value on their own instead of assuming that the function will return the error at the end. This commit fixes a bunch of functions that were not aligned with this behavior. Reviewed-by: Firas JahJah <firasj@amazon.com> Reviewed-by: Yossi Leybovich <sleybo@amazon.com> Signed-off-by: Gal Pressman <galpress@amazon.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-21RDMA/efa: Use API to get contiguous memory blocks aligned to device ↵Gal Pressman
supported page size Use the ib_umem_find_best_pgsz() and rdma_for_each_block() API when registering an MR instead of coding it in the driver. ib_umem_find_best_pgsz() is used to find the best suitable page size which replaces the existing efa_cont_pages() implementation. rdma_for_each_block() is used to iterate the umem in aligned contiguous memory blocks. Reviewed-by: Firas JahJah <firasj@amazon.com> Reviewed-by: Yossi Leybovich <sleybo@amazon.com> Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Gal Pressman <galpress@amazon.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-20IB/{rdmavt, qib, hfi1}: Convert to new completion APIMike Marciniszyn
Convert all completions to use the new completion routine that fixes a race between post send and completion where fields from a SWQE can be read after SWQE has been freed. This patch also addresses issues reported in https://marc.info/?l=linux-kernel&m=155656897409107&w=2. The reserved operation path has no need for any barrier. The barrier for the other path is addressed by the smp_load_acquire() barrier. Cc: Andrea Parri <andrea.parri@amarulasolutions.com> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-20RDMA/odp: Do not leak dma maps when working with huge pagesJason Gunthorpe
The ib_dma_unmap_page() must match the length of the ib_dma_map_page(), which is based on odp_shift. Otherwise iommu resources under this API will not be properly freed. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-20RDMA/uverbs: Use offsetofend instead of opencodingJason Gunthorpe
Discovered this was available already. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-20RDMa/hns: Don't stuck in endless timeout loopLeon Romanovsky
The "end" variable is declared as unsigned and can't be negative, it leads to the situation where timeout limit is not honored, so let's convert logic to ensure that loop is bounded. drivers/infiniband/hw/hns/hns_roce_hw_v1.c: In function _hns_roce_v1_clear_hem_: drivers/infiniband/hw/hns/hns_roce_hw_v1.c:2471:12: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] 2471 | if (end < 0) { | ^ Fixes: 669cefb654cb ("RDMA/hns: Remove jiffies operation in disable interrupt context") Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-20scsi: RDMA/srp: Fix a sleep-in-invalid-context bugBart Van Assche
The previous patch guarantees that srp_queuecommand() does not get invoked while reconnecting occurs. Hence remove the code from srp_queuecommand() that prevents command queueing while reconnecting. This patch avoids that the following can appear in the kernel log: BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747 in_atomic(): 1, irqs_disabled(): 0, pid: 5600, name: scsi_eh_9 1 lock held by scsi_eh_9/5600: #0: (rcu_read_lock){....}, at: [<00000000cbb798c7>] __blk_mq_run_hw_queue+0xf1/0x1e0 Preemption disabled at: [<00000000139badf2>] __blk_mq_delay_run_hw_queue+0x78/0xf0 CPU: 9 PID: 5600 Comm: scsi_eh_9 Tainted: G W 4.15.0-rc4-dbg+ #1 Hardware name: Dell Inc. PowerEdge R720/0VWT90, BIOS 2.5.4 01/22/2016 Call Trace: dump_stack+0x67/0x99 ___might_sleep+0x16a/0x250 [ib_srp] __mutex_lock+0x46/0x9d0 srp_queuecommand+0x356/0x420 [ib_srp] scsi_dispatch_cmd+0xf6/0x3f0 scsi_queue_rq+0x4a8/0x5f0 blk_mq_dispatch_rq_list+0x73/0x440 blk_mq_sched_dispatch_requests+0x109/0x1a0 __blk_mq_run_hw_queue+0x131/0x1e0 __blk_mq_delay_run_hw_queue+0x9a/0xf0 blk_mq_run_hw_queue+0xc0/0x1e0 blk_mq_start_hw_queues+0x2c/0x40 scsi_run_queue+0x18e/0x2d0 scsi_run_host_queues+0x22/0x40 scsi_error_handler+0x18d/0x5f0 kthread+0x11c/0x140 ret_from_fork+0x24/0x30 Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Laurence Oberman <loberman@redhat.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Leon Romanovsky <leonro@mellanox.com> Cc: Doug Ledford <dledford@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2019-06-20RDMA: Check umem pointer validity prior to releaseLeon Romanovsky
Update ib_umem_release() to behave similarly to kfree() and allow submitting NULL pointer as safe input to this function. Fixes: a52c8e2469c3 ("RDMA: Clean destroy CQ in drivers do not return errors") Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-20RDMA/hns: reset function when removing moduleLang Cheng
During removing the driver, we needs to notify the roce engine to stop working immediately,and symmetrically recycle the hardware resources requested during initialization. The hardware provides a command called function clear that can package these operations,so that the driver can only focus on releasing resources that applied from the operating system. This patch implements the call of this command. Signed-off-by: Lang Cheng <chenglang@huawei.com> Signed-off-by: Lijun Ou <oulijun@huawei.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-20RDMA: Convert destroy_wq to be voidLeon Romanovsky
All callers of destroy WQ are always success and there is no need to check their return value, so convert destroy_wq to be void. Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-20RDMA/hns: Fix bug when wqe num is larger than 16KLijun Ou
hip08 can support up to 32768 wqes in one qp. currently if the wqe num is larger than 16384, the driver will lead a calltrace as follows. [21361.393725] Call trace: [21361.398605] hns_roce_v2_modify_qp+0xbcc/0x1360 [hns_roce_hw_v2] [21361.410627] hns_roce_modify_qp+0x1d8/0x2f8 [hns_roce] [21361.420906] _ib_modify_qp+0x70/0x118 [21361.428222] ib_modify_qp+0x14/0x1c [21361.435193] rt_ktest_modify_qp+0xb8/0x650 [rdma_test] [21361.445472] exec_modify_qp_cmd+0x110/0x4d8 [rdma_test] [21361.455924] rt_ktest_dispatch_cmd_3+0xa94/0x2edc [rdma_test] [21361.467422] rt_ktest_dispatch_cmd_2+0x9c/0x108 [rdma_test] [21361.478570] rt_ktest_dispatch_cmd+0x138/0x904 [rdma_test] [21361.489545] rt_ktest_dev_write+0x328/0x4b0 [rdma_test] [21361.499998] __vfs_write+0x38/0x15c [21361.506966] vfs_write+0xa8/0x1a0 [21361.513586] ksys_write+0x50/0xb0 [21361.520206] sys_write+0xc/0x14 [21361.526479] el0_svc_naked+0x30/0x34 [21361.533622] Code: 1ac10841 d37d7c22 0b000021 d37df021 (f86268c0) [21361.545815] ---[ end trace e2a1feb2c3d7f13c ]--- When the wqe num is larger than 16384, hns_roce_table_find will return an invalid mtt, this will lead an kernel paging requet error if the driver try to access it. It's the mtt design defect which can't support up to the max wqe num of hip08. This patch fixs it by replacing mtt with mtr for wqe. Fixes: 926a01dc000d ("RDMA/hns: Add QP operations support for hip08 SoC") Signed-off-by: Xi Wang <wangxi11@huawei.com> Signed-off-by: Lijun Ou <oulijun@huawei.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-20RDMA/hns: Add a group interfaces for optimizing buffers getting flowLijun Ou
Currently, the code for getting umem and kmem buffers exist many files, this patch adds a group interfaces to simplify the buffers getting flow. Signed-off-by: Xi Wang <wangxi11@huawei.com> Signed-off-by: Lijun Ou <oulijun@huawei.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-20RDMA/hns: Add mtr support for mixed multihop addressingLijun Ou
Currently, the MTT(memory translate table) design required a buffer space must has the same hopnum, but the hip08 hw can support mixed hopnum config in a buffer space. This patch adds the MTR(memory translate region) design for supporting mixed multihop. Signed-off-by: Xi Wang <wangxi11@huawei.com> Signed-off-by: Lijun Ou <oulijun@huawei.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-19RDMA/netlink: Resort policy arrayDoug Ledford
Sort the netlink policy array by netlink attribute name. This will make it easier in the future to find the entry you are looking for when you need to make changes, or to make sure you don't add the same entry twice. Fix the whitespace while we are there. Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-18RDMA/mlx5: Enable decap and packet reformat on FDBMaor Gottlieb
If FDB flow tables support decap operation, enable it on creation, This allows to perform decapsulation of tunnelled packets by steering rules. If FDB flow tables support reformat operation, enable it on creation as well. Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Reviewed-by: Petr Vorel <pvorel@suse.cz> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-18RDMA/mlx5: Consider eswitch encap modeMaor Gottlieb
When flow steering is created, then the encap support should consider the eswitch encap mode. If the eswitch flow table (FDB) supports encap then it shouldn't be supported on NIC RX flow tables. Fixes: 4adda1122c490 ('RDMA/mlx5: Enable decap and packet reformat on flow tables') Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Reviewed-by: Petr Vorel <pvorel@suse.cz> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-18Merge remote-tracking branch 'mlx5-next/mlx5-next' into HEADDoug Ledford
Take mlx5-next so we can take a dependent two patch series next. Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-18RDMA/odp: Fix missed unlock in non-blocking invalidate_startJason Gunthorpe
If invalidate_start returns with EAGAIN then the umem_rwsem needs to be unlocked as no invalidate_end will be called. Cc: <stable@vger.kernel.org> Fixes: ca748c39ea3f ("RDMA/umem: Get rid of per_mm->notifier_count") Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-18IB/hfi1: Spelling s/statisfied/satisfied/Geert Uytterhoeven
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-18RDMA: Report available cdevs through RDMA_NLDEV_CMD_GET_CHARDEVJason Gunthorpe
Update the struct ib_client for all modules exporting cdevs related to the ibdevice to also implement RDMA_NLDEV_CMD_GET_CHARDEV. All cdevs are now autoloadable and discoverable by userspace over netlink instead of relying on sysfs. uverbs also exposes the DRIVER_ID for drivers that are able to support driver id binding in rdma-core. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-18RDMA: Add NLDEV_GET_CHARDEV to allow char dev discovery and autoloadJason Gunthorpe
Allow userspace to issue a netlink query against the ib_device for something like "uverbs" and get back the char dev name, inode major/minor, and interface ABI information for "uverbs0". Since we are now in netlink this can also trigger a module autoload to make the uverbs device come into existence. Largely this will let us replace searching and reading inside sysfs to setup devices, and provides an alternative (using driver_id) to device name based provider binding for things like rxe. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-18RDMA/efa: Handle mmap insertions overflowGal Pressman
When inserting a new mmap entry to the xarray we should check for 'mmap_page' overflow as it is limited to 32 bits. Fixes: 40909f664d27 ("RDMA/efa: Add EFA verbs implementation") Signed-off-by: Gal Pressman <galpress@amazon.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-18ipoib: correcly show a VF hardware addressDenis Kirjanov
in the case of IPoIB with SRIOV enabled hardware ip link show command incorrecly prints 0 instead of a VF hardware address. Before: 11: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast state UP mode DEFAULT group default qlen 256 link/infiniband 80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff vf 0 MAC 00:00:00:00:00:00, spoof checking off, link-state disable, trust off, query_rss off ... After: 11: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast state UP mode DEFAULT group default qlen 256 link/infiniband 80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff vf 0 link/infiniband 80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off v1->v2: just copy an address without modifing ifla_vf_mac v2->v3: update the changelog Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org> Acked-by: Doug Ledford <dledford@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Honestly all the conflicts were simple overlapping changes, nothing really interesting to report. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17RDMA/efa: Fix success return value in case of errorGal Pressman
Existing code would mistakenly return success in case of error instead of a proper return value. Fixes: e9c6c5373088 ("RDMA/efa: Add common command handlers") Reviewed-by: Firas JahJah <firasj@amazon.com> Reviewed-by: Yossi Leybovich <sleybo@amazon.com> Signed-off-by: Gal Pressman <galpress@amazon.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-17IB/hfi1: Handle port down properly in pioMike Marciniszyn
The call to sc_buffer_alloc currently returns NULL (no buffer) or a buffer descriptor. There is a third case when the port is down. Currently that returns NULL and this prevents the caller from properly handling the sc_buffer_alloc() failure. A verbs code link test after the call is racy so the indication needs to come from the state check inside the allocation routine to be valid. Fix by encoding the ECOMM failure like SDMA. IS_ERR_OR_NULL() tests are added at all call sites. For verbs send, this needs to treat any error by returning a completion without any MMIO copy. Fixes: 7724105686e7 ("IB/hfi1: add driver files") Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-17IB/hfi1: Handle wakeup of orphaned QPs for pioMike Marciniszyn
Once a send context is taken down due to a link failure, any QPs waiting for pio credits will stay on the waitlist indefinitely. Fix by wakeing up all QPs linked to piowait list. Fixes: 7724105686e7 ("IB/hfi1: add driver files") Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-17IB/hfi1: Wakeup QPs orphaned on wait list after flushMike Marciniszyn
Once an SDMA engine is taken down due to a link failure, any waiting QPs that do not have outstanding descriptors in the ring will stay on the dmawait list as long as the port is down. Since there is no timer running, they will stay there for a long time. The fix is to wake up all iowaits linked to dmawait. The send engine will build and post packets that get flushed back. Fixes: 7724105686e7 ("IB/hfi1: add driver files") Reviewed-by: Kaike Wan <kaike.wan@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-17IB/hfi1: Use aborts to trigger RC throttlingMike Marciniszyn
SDMA and pio flushes will cause a lot of packets to be transmitted after a link has gone down, using a lot of CPU to retransmit packets. Fix for RC QPs by recognizing the flush status and: - Forcing a timer start - Putting the QP into a "send one" mode Fixes: 7724105686e7 ("IB/hfi1: add driver files") Reviewed-by: Kaike Wan <kaike.wan@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-17IB/hfi1: Create inline to get extended headersMike Marciniszyn
This paves the way for another patch that reacts to a flush sdma completion for RC. Fixes: 81cd3891f021 ("IB/hfi1: Add support for 16B Management Packets") Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-17IB/hfi1: Silence txreq allocation warningsMike Marciniszyn
The following warning can happen when a memory shortage occurs during txreq allocation: [10220.939246] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC) [10220.939246] Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0018.C4.072020161249 07/20/2016 [10220.939247] cache: mnt_cache, object size: 384, buffer size: 384, default order: 2, min order: 0 [10220.939260] Workqueue: hfi0_0 _hfi1_do_send [hfi1] [10220.939261] node 0: slabs: 1026568, objs: 43115856, free: 0 [10220.939262] Call Trace: [10220.939262] node 1: slabs: 820872, objs: 34476624, free: 0 [10220.939263] dump_stack+0x5a/0x73 [10220.939265] warn_alloc+0x103/0x190 [10220.939267] ? wake_all_kswapds+0x54/0x8b [10220.939268] __alloc_pages_slowpath+0x86c/0xa2e [10220.939270] ? __alloc_pages_nodemask+0x2fe/0x320 [10220.939271] __alloc_pages_nodemask+0x2fe/0x320 [10220.939273] new_slab+0x475/0x550 [10220.939275] ___slab_alloc+0x36c/0x520 [10220.939287] ? hfi1_make_rc_req+0x90/0x18b0 [hfi1] [10220.939299] ? __get_txreq+0x54/0x160 [hfi1] [10220.939310] ? hfi1_make_rc_req+0x90/0x18b0 [hfi1] [10220.939312] __slab_alloc+0x40/0x61 [10220.939323] ? hfi1_make_rc_req+0x90/0x18b0 [hfi1] [10220.939325] kmem_cache_alloc+0x181/0x1b0 [10220.939336] hfi1_make_rc_req+0x90/0x18b0 [hfi1] [10220.939348] ? hfi1_verbs_send_dma+0x386/0xa10 [hfi1] [10220.939359] ? find_prev_entry+0xb0/0xb0 [hfi1] [10220.939371] hfi1_do_send+0x1d9/0x3f0 [hfi1] [10220.939372] process_one_work+0x171/0x380 [10220.939374] worker_thread+0x49/0x3f0 [10220.939375] kthread+0xf8/0x130 [10220.939377] ? max_active_store+0x80/0x80 [10220.939378] ? kthread_bind+0x10/0x10 [10220.939379] ret_from_fork+0x35/0x40 [10220.939381] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC) The shortage is handled properly so the message isn't needed. Silence by adding the no warn option to the slab allocation. Fixes: 45842abbb292 ("staging/rdma/hfi1: move txreq header code") Cc: <stable@vger.kernel.org> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-17IB/hfi1: Avoid hardlockup with flushlist_lockMike Marciniszyn
Heavy contention of the sde flushlist_lock can cause hard lockups at extreme scale when the flushing logic is under stress. Mitigate by replacing the item at a time copy to the local list with an O(1) list_splice_init() and using the high priority work queue to do the flushes. Fixes: 7724105686e7 ("IB/hfi1: add driver files") Cc: <stable@vger.kernel.org> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-17RDMA: Move rdma_node_type to uapi/Jason Gunthorpe
This enum is exposed over the sysfs file 'node_type' and over netlink via RDMA_NLDEV_ATTR_DEV_NODE_TYPE, so declare it in the uapi headers. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-17Merge tag 'v5.2-rc5' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17locking/lockdep: Rename lockdep_assert_held_exclusive() -> ↵Nikolay Borisov
lockdep_assert_held_write() All callers of lockdep_assert_held_exclusive() use it to verify the correct locking state of either a semaphore (ldisc_sem in tty, mmap_sem for perf events, i_rwsem of inode for dax) or rwlock by apparmor. Thus it makes sense to rename _exclusive to _write since that's the semantics callers care. Additionally there is already lockdep_assert_held_read(), which this new naming is more consistent with. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190531100651.3969-1-nborisov@suse.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-14LSM: switch to blocking policy update notifiersJanne Karhunen
Atomic policy updaters are not very useful as they cannot usually perform the policy updates on their own. Since it seems that there is no strict need for the atomicity, switch to the blocking variant. While doing so, rename the functions accordingly. Signed-off-by: Janne Karhunen <janne.karhunen@gmail.com> Acked-by: Paul Moore <paul@paul-moore.com> Acked-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2019-06-13net/mlx5: Add EQ enable/disable APIYuval Avnery
Previously, EQ joined the chain notifier on creation. This forced the caller to be ready to handle events before creating the EQ through eq_create_generic interface. To help the caller control when the created EQ will be attached to the IRQ, add enable/disable API. Signed-off-by: Yuval Avnery <yuvalav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-13net/mlx5: Use a single IRQ for all async EQsAriel Levkovich
The patch modifies the IRQ allocation so that all async EQs are assigned to the same IRQ resulting in more available IRQs for completion EQs. The changes are using the support for IRQ sharing and EQ polling budget that was introduced in previous patches so when the shared interrupt is triggered, the kernel will serially call the handler of each of the sharing EQs with a certain budget of EQEs to poll in order to prevent starvation. Signed-off-by: Ariel Levkovich <lariel@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-13net/mlx5: Separate IRQ request/free from EQ life cycleYuval Avnery
Instead of requesting IRQ with eq creation, IRQs will be requested before EQ table creation. Instead of freeing the IRQs after EQ destroy, free IRQs after eq table destroy. Signed-off-by: Yuval Avnery <yuvalav@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-13net/mlx5: Change interrupt handler to call chain notifierYuval Avnery
Multiple EQs may share the same IRQ in subsequent patches. Instead of calling the IRQ handler directly, the EQ will register to an atomic chain notfier. The Linux built-in shared IRQ is not used because it forces the caller to disable the IRQ and clear affinity before free_irq() can be called. This patch is the first step in the separation of IRQ and EQ logic. Signed-off-by: Yuval Avnery <yuvalav@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-13rdma: Remove nesJason Gunthorpe
This driver was first merged over 10 years ago and has not seen major activity by the authors in the last 7 years. However, in that time it has been patched 150 times to adapt it to changing kernel APIs. Further, the hardware has several issues, like not supporting 64 bit DMA, that make it rather uninteresting for use with modern systems and RDMA. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-11RDMA/ipoib: Remove check for ETH_SS_TESTKamal Heib
The default action for unlisted tests is "not-supported", so given that ipoib doesn't support ETH_SS_TEST, there is no need to check for it in the case statements, just let it get caught by the default: case. Fixes: e3614bc9dc44 ("IB/ipoib: Add readout of statistics using ethtool") Signed-off-by: Kamal Heib <kamalheib1@gmail.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-11RDMA: Convert CQ allocations to be under core responsibilityLeon Romanovsky
Ensure that CQ is allocated and freed by IB/core and not by drivers. Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Acked-by: Gal Pressman <galpress@amazon.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Tested-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-11RDMA: Clean destroy CQ in drivers do not return errorsLeon Romanovsky
Like all other destroy commands, .destroy_cq() call is not supposed to fail. In all flows, the attempt to return earlier caused to memory leaks. This patch converts .destroy_cq() to do not return any errors. Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Acked-by: Gal Pressman <galpress@amazon.com> Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-11RDMA/nes: Avoid memory allocation during CQ destroyLeon Romanovsky
The memory allocation call can fail and cause to early return from nes_desotroy_cq() function. This situation will cause to memory leak of struct nes_cq. Rewrite function to avoid memory allocation. Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-11IB/hfi1: Correct tid qp rcd to match verbs contextMike Marciniszyn
The qp priv rcd pointer doesn't match the context being used for verbs causing issues when 9B and kdeth packets are processed by different receive contexts and hence different CPUs. When running on different CPUs the following panic can occur: WARNING: CPU: 3 PID: 2584 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0 list_del corruption. prev->next should be ffff9a7ac31f7a30, but was ffff9a7c3bc89230 CPU: 3 PID: 2584 Comm: z_wr_iss Kdump: loaded Tainted: P OE ------------ 3.10.0-862.2.3.el7_lustre.x86_64 #1 Call Trace: <IRQ> [<ffffffffb7b0d78e>] dump_stack+0x19/0x1b [<ffffffffb74916d8>] __warn+0xd8/0x100 [<ffffffffb749175f>] warn_slowpath_fmt+0x5f/0x80 [<ffffffffb7768671>] __list_del_entry+0xa1/0xd0 [<ffffffffc0c7a945>] process_rcv_qp_work+0xb5/0x160 [hfi1] [<ffffffffc0c7bc2b>] handle_receive_interrupt_nodma_rtail+0x20b/0x2b0 [hfi1] [<ffffffffc0c70683>] receive_context_interrupt+0x23/0x40 [hfi1] [<ffffffffb7540a94>] __handle_irq_event_percpu+0x44/0x1c0 [<ffffffffb7540c42>] handle_irq_event_percpu+0x32/0x80 [<ffffffffb7540ccc>] handle_irq_event+0x3c/0x60 [<ffffffffb7543a1f>] handle_edge_irq+0x7f/0x150 [<ffffffffb742d504>] handle_irq+0xe4/0x1a0 [<ffffffffb7b23f7d>] do_IRQ+0x4d/0xf0 [<ffffffffb7b16362>] common_interrupt+0x162/0x162 <EOI> [<ffffffffb775a326>] ? memcpy+0x6/0x110 [<ffffffffc109210d>] ? abd_copy_from_buf_off_cb+0x1d/0x30 [zfs] [<ffffffffc10920f0>] ? abd_copy_to_buf_off_cb+0x30/0x30 [zfs] [<ffffffffc1093257>] abd_iterate_func+0x97/0x120 [zfs] [<ffffffffc10934d9>] abd_copy_from_buf_off+0x39/0x60 [zfs] [<ffffffffc109b828>] arc_write_ready+0x178/0x300 [zfs] [<ffffffffb7b11032>] ? mutex_lock+0x12/0x2f [<ffffffffb7b11032>] ? mutex_lock+0x12/0x2f [<ffffffffc1164d05>] zio_ready+0x65/0x3d0 [zfs] [<ffffffffc04d725e>] ? tsd_get_by_thread+0x2e/0x50 [spl] [<ffffffffc04d1318>] ? taskq_member+0x18/0x30 [spl] [<ffffffffc115ef22>] zio_execute+0xa2/0x100 [zfs] [<ffffffffc04d1d2c>] taskq_thread+0x2ac/0x4f0 [spl] [<ffffffffb74cee80>] ? wake_up_state+0x20/0x20 [<ffffffffc115ee80>] ? zio_taskq_member.isra.7.constprop.10+0x80/0x80 [zfs] [<ffffffffc04d1a80>] ? taskq_thread_spawn+0x60/0x60 [spl] [<ffffffffb74bae31>] kthread+0xd1/0xe0 [<ffffffffb74bad60>] ? insert_kthread_work+0x40/0x40 [<ffffffffb7b1f5f7>] ret_from_fork_nospec_begin+0x21/0x21 [<ffffffffb74bad60>] ? insert_kthread_work+0x40/0x40 Fix by reading the map entry in the same manner as the hardware so that the kdeth and verbs contexts match. Cc: <stable@vger.kernel.org> Fixes: 5190f052a365 ("IB/hfi1: Allow the driver to initialize QP priv struct") Reviewed-by: Kaike Wan <kaike.wan@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-11IB/hfi1: Close PSM sdma_progress sleep windowMike Marciniszyn
The call to sdma_progress() is called outside the wait lock. In this case, there is a race condition where sdma_progress() can return false and the sdma_engine can idle. If that happens, there will be no more sdma interrupts to cause the wakeup and the user_sdma xmit will hang. Fix by moving the lock to enclose the sdma_progress() call. Also, delete busycount. The need for this was removed by: commit bcad29137a97 ("IB/hfi1: Serve the most starved iowait entry first") Cc: <stable@vger.kernel.org> Fixes: 7724105686e7 ("IB/hfi1: add driver files") Reviewed-by: Gary Leshner <Gary.S.Leshner@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>