summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2016-04-28RDMA/i40iw: Fix for checking if the QP is destroyedTatyana Nikolova
Fix for checking if the QP associated with a completion has been destroyed while processing CQ elements. If that is the case, move the CQ head to the next element and continue completion processing. Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Signed-off-by: Faisal Latif <faisal.latif@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28RDMA/i40iw: Fix for using one sge for RDMA READShiraz Saleem
A check is added to validate the requested sge number. iWARP doesn't support multiple sg elements for RDMA READ work requests. Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Signed-off-by: Faisal Latif <faisal.latif@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28RDMA/i40iw: Fix for the size of kernel mode SQShiraz Saleem
Fix to calculate the SQ size based on the max frag_count, requested by the application instead of overwriting it with the max supported frag_count Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Signed-off-by: Faisal Latif <faisal.latif@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28RDMA/i40iw: Fix for a NOP WQE sizeMohammad Khan
Fix for filling in the WQE size for NOP Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Signed-off-by: Faisal Latif <faisal.latif@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28RDMA/i40iw: Correct STag mask to min of 14 bitsChien Tin Tung
STag index mask is calculated incorrectly, missing the 14 bits minimum requirement. Add max macro to use either # of MRs or 14 bits in the mask size calculation. Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com> Signed-off-by: Faisal Latif <faisal.latif@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28RDMA/i40iw: Fixes for WQE alignmentShiraz Saleem
Invalidation after every WQE write is changed to invalidate only if required. NOPs are padded so that WQE writes are aligned to 64B boundary. Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Signed-off-by: Faisal Latif <faisal.latif@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28RDMA/i40iw: Adding queue drain functionsIsmail, Mustafa
Adding sq and rq drain functions, which block until all previously posted wr-s in the specified queue have completed. A completion object is signaled to unblock the thread, when the last cqe for the corresponding queue is processed. Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com> Signed-off-by: Faisal Latif <faisal.latif@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28RDMA/i40iw: Fix SD calculation for initial HMC creationIsmail, Mustafa
Correct SD calculation by using base address returned from commit FPM. This alleviates any assumptions on resource ordering and alignment requirement. Also consolidate SD estimation code into i40iw_est_sd(). Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com> Signed-off-by: Faisal Latif <faisal.latif@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28RDMA/i40iw: Fix endian issues and warningsIsmail, Mustafa
Fix endian warnings and errors due to u32 stored to u16. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com> Signed-off-by: Faisal Latif <faisal.latif@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28RDMA/i40iw: Add base memory management extensionsIsmail, Mustafa
Implement fast register mr, Local invalidate, send with invalidate and RDMA read with invalidate. Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com> Signed-off-by: Faisal Latif <faisal.latif@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28RDMA/i40iw: Initialize max enabled vfs variableIsmail, Mustafa
Initialize max enabled vfs to max rdma vfs instead of 0. Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com> Signed-off-by: Faisal Latif <faisal.latif@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28RDMA/i40iw: Correct return code check in add_pble_poolIsmail, Mustafa
Move return code check to immediately after i40iw_hmc_sd_one call where it is set instead of outside the then statement. Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com> Signed-off-by: Faisal Latif <faisal.latif@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28RDMA/i40iw: Add virtual channel message queueIsmail, Mustafa
Queue users of virtual channel on a waitqueue until the channel is clear instead of failing the call when the channel is occupied. Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com> Signed-off-by: Faisal Latif <faisal.latif@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28RDMA/i40iw: Remove unused code and fix warningIsmail, Mustafa
Remove unused code and fix warning. Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com> Signed-off-by: Faisal Latif <faisal.latif@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28RDMA/i40iw: Populate vendor_id and vendor_part_id fieldsIsmail, Mustafa
Populate PCI info fields from PCI device structure. Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com> Signed-off-by: Faisal Latif <faisal.latif@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28RDMA/i40iw: Set vendor_err only if there is an actual errorIsmail, Mustafa
Add a check for cq_poll_info.error before setting vendor_err instead of always setting it. Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com> Signed-off-by: Faisal Latif <faisal.latif@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28RDMA/i40iw: Add qp table lock around AE processingIsmail, Mustafa
QP may be freed during Async Event processing. Add a lock around QP table to prevent it. Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com> Signed-off-by: Faisal Latif <faisal.latif@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28RDMA/i40iw: Do not set self-referencing pointer to NULL after freeIsmail, Mustafa
iwqp->allocated_buffer is a self-referencing pointer to iwqp. Do not set iwqp->allocated_buffer to NULL after freeing it. Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com> Signed-off-by: Faisal Latif <faisal.latif@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28RDMA/i40iw: Correct max message size in query portIsmail, Mustafa
Fix to correct max reported message size in query port. Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com> Signed-off-by: Faisal Latif <faisal.latif@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28RDMA/i40iw: Fix refused connectionsIsmail, Mustafa
Make sure cm_node is setup before sending SYN packet and ORD/IRD negotiation. Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28RDMA/i40iw: Correct QP size calculationIsmail, Mustafa
Include inline data size as part of SQ size calculation. RQ size calculation uses only number of SGEs and does not support 96 byte WQE size. Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28RDMA/i40iw: Fix overflow of region lengthIsmail, Mustafa
Change region_length to u64 as a region can be > 4GB. Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Serialize hrtimer function callsJubin John
hrtimer functions do not guarantee serialization, so we extend the cca_timer_lock to cover the hrtimer_forward_now() in the hrtimer callback handler and the hrtimer_start() in process_becn(). This prevents races between these 2 functions to update the hrtimer state leading to problems such as: kernel BUG at kernel/hrtimer.c:1282! encountered during validation of the CCA feature. Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Jubin John <jubin.john@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Fix MAD port poll for active cablesDean Luick
A MAD directive to start polling must go through the normal link tuning and start steps in order to correctly handle active cables. Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com> Signed-off-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Correctly report neighbor link down reasonDean Luick
The code to save the link down reason for reporting to the SMA was in a location before the actual reason was read. Move the SMA link down reason assignment to a better location. Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com> Signed-off-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Use the neighbor link down reason only when validDean Luick
The 8051 uses a link down reason to inform the driver why the link went down. The neighbor planned link down reason code is only valid when a link down idle message is received by the 8051. Enhance the explanation on why the link went down. Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com> Signed-off-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Ignore link downgrade with 0 lanesDean Luick
Versions of the 8051 firmware < 0.38 may report a link failure as a link downgrade with a width of 0 followed by a link down notification. Ignore the zero width downgrade notification - the driver should follow the link down path. Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com> Signed-off-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Add RSM rule for user FECN handlingDean Luick
Add a receive side mapping rule to extract expected user packets with the FECN bit set and place them in an eager buffer. This will allow user libraries to recognize that a FECN was sent when using header suppression and respond appropriately. Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Create a routine to set a receive side mapping ruleDean Luick
Move the rule setting code into its own routine for improved searchability and reuse. Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Move QOS decision logic into its own functionDean Luick
The decision to use QOS affects other resource allocation. Move the QOS decision logic into its own function so it can be called by other interested parties. Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Extract RSM map table init from QOSDean Luick
Refactor the allocation, tracking, and writing of the RSM map table into its own set of routines. This will allow the map table to be passed to multiple users to fill in as needed. Start with the original user, QOS. Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Reduce kernel context pio buffer allocationJianxin Xiong
The pio buffers were pooled evenly among all kernel contexts and user contexts. However, the demand from kernel contexts is much lower than user contexts. This patch reduces the allocation for kernel contexts and thus makes more credits available for PSM, helping performance. This is especially useful on high core-count systems where large numbers of contexts are used. A new context type SC_VL15 is added to distinguish the context used for VL15 from other kernel contexts. The reason is that VL15 needs to support 2KB sized packet while other kernel contexts need only support packets up to the size determined by "piothreshold", which has a default value of 256. The new allocation method allows triple buffering of largest pio packets configured for these contexts. This is sufficient to maintain verbs performance. The largest pio packet size is 2048B for VL15 and "piothreshold" for other kernel contexts. A cap is applied to "piothreshold" to avoid excessive buffer allocation. The special case that SDMA is disable is handled differently. In that case, the original pooling allocation is used to better support the much higher pio traffic. Notice that if adaptive pio is disabled (piothreshold==0), the pio buffer size doesn't matter for non-VL15 kernel send contexts when SDMA is enabled because pio is not used at all on these contexts and thus the new allocation is still valid. If SDMA is disabled then pooling allocation is used as mentioned in previous paragraph. Adjustment is also made to the calculation of the credit return threshold for the kernel contexts. Instead of purely based on the MTU size, a percentage based threshold is also considered and the smaller one of the two is chosen. This is necessary to ensure that with the reduced buffer allocation credits are returned in time to avoid unnecessary stall in the send path. Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Reviewed-by: Dean Luick <dean.luick@intel.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Mark Debbage <mark.debbage@intel.com> Reviewed-by: Jubin John <jubin.john@intel.com> Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Change default number of user contextsJubin John
Change the default number of user contexts to the number of real (non-HT) cpu cores in order to reduce the division of hfi1 hardware contexts in the case of high core counts with hyper-threading enabled. Reviewed-by: Dean Luick <dean.luick@intel.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Jubin John <jubin.john@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Use global defines for upper bits in opcodeMike Marciniszyn
The awkward coding for setting the allowed_ops field was tripping an smatch warning. This patch uses the more appropriate defines from include/rdma to avoid the issue. As part of the patch remove a mask that was duplicated in rdmavt include files and use that mask as appropriate. Fixes: 8bea6b1cfe6f ("IB/rdmavt: Add create queue pair functionality") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Remove unreachable codeMike Marciniszyn
Remove unreachable code from RC ack handling to fix an smatch error. Fixes: 633d27399514 ("staging/rdma/hfi1: use mod_timer when appropriate") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Fix double QSFP resource acquire on cache refreshDean Luick
The function refresh_qsfp_cache() acquires the i2c chain resource, but one caller already holds the resource. Change the acquire so all calls to refresh_qsfp_cache() are covered by the acquire and remove the acquire within refresh_qsfp_cache(). Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com> Signed-off-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Guard against concurrent I2C access across all chainsDean Luick
The discrete ASIC board design makes the two I2C chains not independent of each other. That is, only one chain can safely be accessed at a time. For discrete ASIC devices, adjust the resource locking so that access to one I2C chain will lock both of the chains. Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com> Signed-off-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Remove module presence check outside pre-LNI checksEaswar Hariharan
The pre-LNI SerDes and channel tuning algorithm already checks for module presence assertion for the relevant port types. The extraneous check removed in this patch blocks link up for port types for which the module presence assertion is not relevant. Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Easwar Hariharan <easwar.hariharan@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Always turn on CDRs for low power QSFP modulesEaswar Hariharan
Clock and data recovery mechanisms (CDRs) in active QSFP modules can be turned on or off to improve the bit error rate observed on the channel. Signal integrity and bit error rate requirements require us to always turn on any CDRs present in low power cables (power dissipation 2.5W or lower). However, we adhere to the platform designer's settings (provided in the platform configuration) for higher power cables (dissipation 3.5W or higher) if the platform designer has determined that the platform requires the CDRs to be turned on (or off) and is capable of supplying and cooling the higher power modules. This patch also introduces the get_qsfp_power_class function to centralize the bit twiddling required to determine the QSFP power class across the code. Reusing this function improves the readability of code that depends on knowing the power class of the cable, such as the active and optical channel tuning algorithm. Reviewed-by: Dean Luick <dean.luick@intel.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Easwar Hariharan <easwar.hariharan@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Check P_KEY for all sent packets from user modeSebastian Sanchez
Add the P_KEY check for user-context mechanism for both PIO and SDMA. For PIO, the SendCtxtCheckEnable.DisallowKDETHPackets is set by default. When the P_KEY is set, SendCtxtCheckEnable.DisallowKDETHPackets is cleared. For SDMA, a software check was included. This change requires user processes to set the P_KEY before sending any packets, otherwise, the sent packet will fail. The original submission didn't have this check but it's required. Reviewed-by: Dean Luick <dean.luick@intel.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Mikto Haralanov <mitko.haralanov@intel.com> Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Adjust default MTU to be 10KBSebastian Sanchez
Increasing the default MTU size to 10KB improves performance for PSM. Change the default MTU to 10KB but constrain Verbs MTU to 8KB. Also update default MTU module parameter description to be HFI1_DEFAULT_MAX_MTU. Reviewed-by: Dean Luick <dean.luick@intel.com> Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Reviewed-by: Jubin John <jubin.john@intel.com> Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Simplify init_qpmap_table()Dean Luick
Make init_qpmap_table() easier to understand by simplifying the loop indexing and writing each register when it is "full", removing the need for a follow-on register write. Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Correctly obtain the full service classDean Luick
The function hdr2sc was using an unshifted mask to obtain the 5th bit of the service class. Correct the issue by using the shifted mask. Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Fix QOS rule mappingsDean Luick
The QOS RSM rule mappings are off by one, referencing a kernel receive context that does not exist. Correctly start the QOS RSM map entries at FIRST_KERNEL_CONTEXT rather than MIN_KERNEL_KCTXTS. Remove the cruft that hid this. Change the QP map table so all traffic not caught by QOS RSM goes to the control context rather than the first QOS context. Correct comments to match the actual code operation and intent. Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Remove invalid QOS checkDean Luick
Remove an invalid compare of the number of QOS RSM map table entries against the number of physical receive contexts. The RSM map table has its own size and has no relation to the number of physical receive contexts. Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Fix QOS num_vl bit widthDean Luick
The bit width for num_vls, n, needs to be calculated based on the pow2 rounded up of the number of vls. Otherwise num_vls of 3, 5, 6, and 7 will have misplaced QOS RSM map entries. Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Fix i2c resource reservation checksDean Luick
The i2c and qsfp read/write routines should check for the resource reservation of the incoming argument target rather than the implicit target of the hardware HFI. Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com> Signed-off-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Jubin John <jubin.john@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Fix sysfs file offset usageDean Luick
Two sysfs files do not pay attention to the file offset when reading data. Fix that. Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Jubin John <jubin.john@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/rdmavt,hfi1,qib: Fix memory leakJubin John
rdi->ports has memory allocated in rvt_alloc_device(), but does not get freed because the hfi1 and qib drivers drivers call ib_dealloc_device() directly instead of going through rdmavt. Add a rvt_dealloc_device() that frees rdi->ports and then calls ib_dealloc_device(). Switch hfi1 and qib drivers to calling rvt_dealloc_device() instead of ib_dealloc_device() directly. Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Brian Welty <brian.welty@intel.com> Signed-off-by: Jubin John <jubin.john@intel.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Fix buffer cache races which may cause corruptionMitko Haralanov
There are two possible causes for node/memory corruption both of which are related to the cache eviction algorithm. One way to cause corruption is due to the asynchronous nature of the MMU invalidation and the locking used when invalidating node. The MMU invalidation routine would temporarily release the RB tree lock to avoid a deadlock. However, this would allow the eviction function to take the lock resulting in the removal of cache nodes. If the node being removed by the eviction code is the same as the node being invalidated, the result is use after free. The same is true in the other direction due to the temporary release of the eviction list lock in the eviction loop. Another corner case exists when dealing with the SDMA buffer cache that could cause memory corruption of kernel memory. The most common way, in which this corruption exhibits itself is a linked list node corruption. In that case, the kernel will complain that a node with poisoned pointers is being removed. The fact that the pointers are already poisoned means that the node has already been removed from the list. To root cause of this corruption was a mishandling of the eviction list maintained by the driver. In order for this to happen four conditions need to be satisfied: 1. A node describing a user buffer already exists in the interval RB tree, 2. The beginning of the current user buffer matches that node but is bigger. This will cause the node to be extended. 3. The amount of cached buffers is close or at the limit of the buffer cache size. 4. The node has dropped close to the end of the eviction list. This will cause the node to be considered for eviction. If all of the above conditions have been satisfied, it is possible for the eviction algorithm to evict the current node, which will free the node without the driver knowing. To solve both issues described above: - the locking around the MMU invalidation loop and cache eviction loop has been improved so locks are not released in the loop body, - a new RB function is introduced which will "atomically" find and remove the matching node from the RB tree, preventing the MMU invalidation loop from touching it, and - the node being extended by the pin_vector_pages() function is removed from the eviction list prior to calling the eviction function. Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>