summaryrefslogtreecommitdiff
path: root/drivers/nvme
AgeCommit message (Collapse)Author
2019-04-18Merge branch 'for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU and LKMM commits from Paul E. McKenney: - An LKMM commit adding support for synchronize_srcu_expedited() - A couple of straggling RCU flavor consolidation updates - Documentation updates. - Miscellaneous fixes - SRCU updates - RCU CPU stall-warning updates - Torture-test updates Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-12scsi: scsi_transport_fc: nvme: display FC-NVMe port rolesHannes Reinecke
Currently the FC-NVMe driver is leverating the SCSI FC transport class to access the remote ports. Which means that all FC-NVMe remote ports will be visible to the fc transport layer, but due to missing definitions the port roles will always be 'unknown'. This patch adds the missing definitions to the fc transport class to that the port roles are correctly displayed. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Ewan D. Milne <emilne@redhat.com> Reviewed-by: Giridhar Malavali <gmalavali@marvell.com> Reviewed-by: Himanshu Madhani <hmadhani@marvell.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2019-04-11nvmet: fix discover log page when offsets are usedKeith Busch
The nvme target hadn't been taking the Get Log Page offset parameter into consideration, and so has been returning corrupted log pages when offsets are used. Since many tools, including nvme-cli, split the log request to 4k, we've been breaking discovery log responses when more than 3 subsystems exist. Fix the returned data by internally generating the entire discovery log page and copying only the requested bytes into the user buffer. The command log page offset type has been modified to a native __le64 to make it easier to extract the value from a command. Signed-off-by: Keith Busch <keith.busch@intel.com> Tested-by: Minwoo Im <minwoo.im@samsung.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-11nvme-fc: correct csn initialization and increments on errorJames Smart
This patch fixes a long-standing bug that initialized the FC-NVME cmnd iu CSN value to 1. Early FC-NVME specs had the connection starting with CSN=1. By the time the spec reached approval, the language had changed to state a connection should start with CSN=0. This patch corrects the initialization value for FC-NVME connections. Additionally, in reviewing the transport, the CSN value is assigned to the new IU early in the start routine. It's possible that a later dma map request may fail, causing the command to never be sent to the controller. Change the location of the assignment so that it is immediately prior to calling the lldd. Add a comment block to explain the impacts if the lldd were to additionally fail sending the command. Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-10nvme: cancel request synchronouslyMing Lei
nvme_cancel_request() is used in error handler, and it is always reliable to cancel request synchronously, and avoids possible race in which request may be completed after real hw queue is destroyed. One issue is reported by our customer on NVMe RDMA, in which freed ib queue pair may be used in nvme_rdma_complete_rq(). Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Bart Van Assche <bvanassche@acm.org> Cc: James Smart <james.smart@broadcom.com> Cc: linux-nvme@lists.infradead.org Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-05nvme: log the error status on Identify Namespace failureKenneth Heitke
Identify Namespace failures are logged as a warning but there is not an indication of the cause for the failure. Update the log message to include the error status. Signed-off-by: Kenneth Heitke <kenneth.heitke@intel.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-05nvmet: add safety check for subsystem lock during nvmet_ns_changedMax Gurtovoy
we need to make sure that subsystem lock is taken during ctrl's list traversing. nvmet_ns_changed function is not static and can be used from various callers simultaneously. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-05nvmet: never fail double namespace enablementMax Gurtovoy
In case we create N namespaces while N < NVMET_MAX_NAMESPACES, we can perform "echo 1 > <nsid>/enable" as much as we want. In case N == NVMET_MAX_NAMESPACES we fail. Make sure we have the same flow for any N. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-05nvme-pci: tidy up nvme_map_dataChristoph Hellwig
Remove two pointless local variables, remove ret assignment that is never used, move the use_sgl initialization closer to where it is used. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-04-05nvme-pci: optimize mapping single segment requests using SGLsChristoph Hellwig
If the controller supports SGLs we can take another short cut for single segment request, given that we can always map those without another indirection structure, and thus don't need to create a scatterlist structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-04-05nvme-pci: optimize mapping of small single segment requestsChristoph Hellwig
If a request is single segment and fits into one or two PRP entries we do not have to create a scatterlist for it, but can just map the bio_vec directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-04-05nvme-pci: remove the inline scatterlist optimizationChristoph Hellwig
We'll have a better way to optimize for small I/O that doesn't require it soon, so remove the existing inline_sg case to make that optimization easier to implement. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-04-05nvme-pci: split metadata handling from nvme_map_data / nvme_unmap_dataChristoph Hellwig
This prepares for some bigger changes to the data mapping helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-04-05nvme-pci: do not build a scatterlist to map metadataChristoph Hellwig
We always have exactly one segment, so we can simply call dma_map_bvec. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-04-05nvme-pci: only call nvme_unmap_data for requests transferring dataChristoph Hellwig
This mirrors how nvme_map_pci is called and will allow simplifying some checks in nvme_unmap_pci later on. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-04-05nvme-pci: merge nvme_free_iod into nvme_unmap_dataChristoph Hellwig
This means we now have a function that undoes everything nvme_map_data does and we can simplify the error handling a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-04-05nvme-pci: move the call to nvme_cleanup_cmd out of nvme_unmap_dataChristoph Hellwig
Cleaning up the command setup isn't related to unmapping data, and disentangling them will simplify error handling a bit down the road. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-04-05nvme-pci: remove nvme_init_iodChristoph Hellwig
nvme_init_iod should really be split into two parts: initialize a few general iod fields, which can easily be done at the beginning of nvme_queue_rq, and allocating the scatterlist if needed, which logically belongs into nvme_map_data with the code making use of it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-04-05nvme-pci: remove unused nvme_iod memberKeith Busch
Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-05nvme-pci: remove q_dmadev from nvme_queueKeith Busch
We don't need to save the dma device as it's not used in the hot path and hasn't in a long time. Shrink the struct nvme_queue removing this unnecessary member. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-05nvme-pci: use a flag for polled queuesKeith Busch
A negative value for the cq_vector used to mean the queue is either disabled or a polled queue. However, we have a queue enabled flag, so the cq_vector had been serving double duty. Don't overload the meaning of cq_vector. Use a flag specific to the polled queues instead. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-05nvmet-tcp: implement C2HData SUCCESS optimizationSagi Grimberg
TP 8000 says that the use of the SUCCESS flag depends on weather the controller support disabling sq_head pointer updates. Given that we support it by default, makes sense that we go the extra mile to actually use the SUCCESS flag. When we create the C2HData PDU header, we check if sqhd_disabled is set on our queue, if so, we set the SUCCESS flag in the PDU header and skip sending a completion response capsule. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Oliver Smith-Denny <osmithde@cisco.com> Tested-by: Oliver Smith-Denny <osmithde@cisco.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-05nvmet-fc: use zero-sized array and struct_size() in kzalloc()Gustavo A. R. Silva
Update the code to use a zero-sized array instead of a pointer in structure nvmet_fc_tgt_queue and use struct_size() in kzalloc(). Notice that one of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; struct boo entry[]; }; instance = kzalloc(sizeof(struct foo) + sizeof(struct boo) * count, GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL); This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-05nvmet: avoid double errno conversionsChristoph Hellwig
Use errno_to_nvme_status to convert from a negative errno to a nvme status field instead of going through a blk_status_t. Also remove the pointless status variable in nvmet_bdev_execute_write_zeroes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-04-05nvme: avoid double dereference to convert le to cpuMax Gurtovoy
Use le16_to_cpu instead of le16_to_cpup and le64_to_cpu instead of le64_to_cpup. This will also align the code to nvme-core driver convention. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-03-28nvmet: fix error flow during ns enableMax Gurtovoy
In case we fail to enable p2pmem on the current namespace, disable the backing store device before exiting. Cc: Stephen Bates <sbates@raithlin.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-03-28nvmet: fix building bvec from sg listMing Lei
There are two mistakes for building bvec from sg list for file backed ns: - use request data length to compute number of io vector, this way doesn't consider sg->offset, and the result may be smaller than required io vectors - bvec->bv_len isn't capped by sg->length This patch fixes this issue by building bvec from sg directly, given the whole IO stack is ready for multi-page bvec. Reported-by: Yi Zhang <yi.zhang@redhat.com> Fixes: 3a85a5de29ea ("nvme-loop: add a NVMe loopback host driver") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-03-28nvme-multipath: relax ANA state checkMartin George
When undergoing state transitions I/O might be requeued, hence we should always call nvme_mpath_set_live() to schedule requeue_work whenever the nvme device is live, independent on whether the old state was live or not. Signed-off-by: Martin George <marting@netapp.com> Signed-off-by: Gargi Srinivas <sring@netapp.com> Signed-off-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-03-28nvme-tcp: fix an endianess miss-annotationChristoph Hellwig
nvme_tcp_end_request just takes the status value and the converts it to little endian as well as shifting for the phase bit. Fixes: 43ce38a6d823 ("nvme-tcp: support C2HData with SUCCESS flag") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-03-26srcu: Remove cleanup_srcu_struct_quiesced()Paul E. McKenney
The cleanup_srcu_struct_quiesced() function was added because NVME used WQ_MEM_RECLAIM workqueues and SRCU did not, which meant that NVME workqueues waiting on SRCU workqueues could result in deadlocks during low-memory conditions. However, SRCU now also has WQ_MEM_RECLAIM workqueues, so there is no longer a potential for deadlock. Furthermore, it turns out to be extremely hard to use cleanup_srcu_struct_quiesced() correctly due to the fact that SRCU callback invocation accesses the srcu_struct structure's per-CPU data area just after callbacks are invoked. Therefore, the usual practice of using srcu_barrier() to wait for callbacks to be invoked before invoking cleanup_srcu_struct_quiesced() fails because SRCU's callback-invocation workqueue handler might be delayed, which can result in cleanup_srcu_struct_quiesced() being invoked (and thus freeing the per-CPU data) before the SRCU's callback-invocation workqueue handler is finished using that per-CPU data. Nor is this a theoretical problem: KASAN emitted use-after-free warnings because of this problem on actual runs. In short, NVME can now safely invoke cleanup_srcu_struct(), which avoids the use-after-free scenario. And cleanup_srcu_struct_quiesced() is quite difficult to use safely. This commit therefore removes cleanup_srcu_struct_quiesced(), switching its sole user back to cleanup_srcu_struct(). This effectively reverts the following pair of commits: f7194ac32ca2 ("srcu: Add cleanup_srcu_struct_quiesced()") 4317228ad9b8 ("nvme: Avoid flush dependency in delete controller flow") Reported-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Bart Van Assche <bvanassche@acm.org>
2019-03-16Merge tag 'for-5.1/block-post-20190315' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull more block layer changes from Jens Axboe: "This is a collection of both stragglers, and fixes that came in after I finalized the initial pull. This contains: - An MD pull request from Song, with a few minor fixes - Set of NVMe patches via Christoph - Pull request from Konrad, with a few fixes for xen/blkback - pblk fix IO calculation fix (Javier) - Segment calculation fix for pass-through (Ming) - Fallthrough annotation for blkcg (Mathieu)" * tag 'for-5.1/block-post-20190315' of git://git.kernel.dk/linux-block: (25 commits) blkcg: annotate implicit fall through nvme-tcp: support C2HData with SUCCESS flag nvmet: ignore EOPNOTSUPP for discard nvme: add proper write zeroes setup for the multipath device nvme: add proper discard setup for the multipath device nvme: remove nvme_ns_config_oncs nvme: disable Write Zeroes for qemu controllers nvmet-fc: bring Disconnect into compliance with FC-NVME spec nvmet-fc: fix issues with targetport assoc_list list walking nvme-fc: reject reconnect if io queue count is reduced to zero nvme-fc: fix numa_node when dev is null nvme-fc: use nr_phys_segments to determine existence of sgl nvme-loop: init nvmet_ctrl fatal_err_work when allocate nvme: update comment to make the code easier to read nvme: put ns_head ref if namespace fails allocation nvme-trace: fix cdw10 buffer overrun nvme: don't warn on block content change effects nvme: add get-feature to admin cmds tracer md: Fix failed allocation of md_register_thread It's wrong to add len to sector_nr in raid10 reshape twice ...
2019-03-13nvme-tcp: support C2HData with SUCCESS flagSagi Grimberg
A C2HData PDU with the SUCCESS flag set indicates that the I/O was completed by the controller successfully and means that a subsequent completion response capsule PDU will be ommitted. If we see this flag, fisrt we check that LAST_PDU flag is set as well, and then we complete the request when the data transfer (and data digest verification if its on) is done. While we're at it, reuse a bit of code with nvme_fail_request. Reported-by: Steve Blightman <steve.blightman@oracle.com> Suggested-by: Oliver Smith-Denny <osmithde@cisco.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Oliver Smith-Denny <osmithde@cisco.com> Tested-by: Oliver Smith-Denny <osmithde@cisco.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-13nvmet: ignore EOPNOTSUPP for discardChristoph Hellwig
NVMe DSM is a pure hint, so if the underlying device / file system does not support discard-like operations we should not fail the operation but rather return success. Fixes: 3b031d15995f ("nvmet: add error log support for bdev backend") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Tested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-13nvme: add proper write zeroes setup for the multipath deviceChristoph Hellwig
Add a gendisk argument to nvme_config_write_zeroes so that the call to nvme_update_disk_info for the multipath device node updates the proper request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Tested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-13nvme: add proper discard setup for the multipath deviceChristoph Hellwig
Add a gendisk argument to nvme_config_discard so that the call to nvme_update_disk_info for the multipath device node updates the proper request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Tested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-13nvme: remove nvme_ns_config_oncsChristoph Hellwig
Just opencode the two function calls in the caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Tested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-13nvme: disable Write Zeroes for qemu controllersChristoph Hellwig
Qemu started out with a broken implementation of Write Zeroes written by yours truly. Disable Write Zeroes on qemu for now, eventually we need to go back and make all the qemu quirks version specific, but that is left for another time. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Tested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-13nvmet-fc: bring Disconnect into compliance with FC-NVME specJames Smart
The FC-NVME spec, when finally approved, modified the disconnect LS such that the only scope available is the association. Rework the Disconnect LS processing to be in accordance with the change. Signed-off-by: Nigel Kirkland <nigel.kirkland@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-13nvmet-fc: fix issues with targetport assoc_list list walkingJames Smart
There are two changes: 1) The logic in the __nvmet_fc_free_assoc() routine is bad. It uses "safe" routines assuming pointers will come back valid. However, the intervening next structure being linked can be removed from the list and the resulting safe pointers are bad, resulting in NULL ptrs being hit. Correct by scheduling a work element to perform the association delete, which can be done while under lock. 2) Prior patch that added the work element scheduling left a possible reference on the object if the work element couldn't be scheduled. Correct by doing the put on a failing schedule_work() call. Signed-off-by: Nigel Kirkland <nigel.kirkland@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-13nvme-fc: reject reconnect if io queue count is reduced to zeroJames Smart
If: - A successful connect has occurred with an io queue count greater than zero and namespaces detected and running. - An error or something occurs which causes a termination of the prior association and then starts a reconnect, - The reconnect then creates a new controller, but for whatever reason, nvme_set_queue_count() results in io queue count set to zero. This will skip io queue and tag set changes. - But... the controller will transition to live, calling nvme_start_ctrl, which calls nvme_start_queues(), which then releases I/Os into the transport which then sends them to the driver. As there are no queues, things eventually hit the driver looking for a handle, which was cleared when the original controller was reset, and it can't proceed. Worst case, things progress, but everything fails. In the failing scenario, the nvme_set_features(NVME_FEAT_NUM_QUEUES) command actually failed with a NVME_SC_INTERNAL error. For some reason, although nvme_set_queue_count() saw the error and set io queue count to zero, it doesn't return a failure status to the transport, which allows the transport to continue using the controller. Fix the problem by simply rejecting the new association if at least 1 I/O queue can't be created. The association reject will fail the reconnect attempt and fall into the reconnect retry policy. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-13nvme-fc: fix numa_node when dev is nullJames Smart
A recent change added a numa_node field to the nvme controller and has the transport assign the node using dev_to_node(). However, fcloop registers with a NULL device struct, so the dev_to_node() call oops. Revise the assignment to assign no node when device struct is null. Fixes: 103e515efa89b ("nvme: add a numa_node field to struct nvme_ctrl") Reported-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> [hch: small coding style fixup] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-13nvme-fc: use nr_phys_segments to determine existence of sglJames Smart
For some nvme command, when issued by the nvme core layer, there is an internal buffer which can cause blk_rq_payload_bytes() to return a non-zero value yet there is no actual/real command payload and sg list. An example is the WRITE ZEROES command. To address this, when making choices on whether to dma map an sgl, use blk_rq_nr_phys_segments() instead of blk_rq_payload_bytes(). When there is a sgl, blk_rq_payload_bytes() will return the amount of data to be transferred by the sgl. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-13nvme-loop: init nvmet_ctrl fatal_err_work when allocateYufen Yu
After commit 4d43d395fe (workqueue: Try to catch flush_work() without INIT_WORK()), it can cause warning when delete nvme-loop device, trace like: [ 76.601272] Call Trace: [ 76.601646] ? del_timer+0x72/0xa0 [ 76.602156] __cancel_work_timer+0x1ae/0x270 [ 76.602791] cancel_work_sync+0x14/0x20 [ 76.603407] nvmet_ctrl_free+0x1b7/0x2f0 [nvmet] [ 76.604091] ? free_percpu+0x168/0x300 [ 76.604652] nvmet_sq_destroy+0x106/0x240 [nvmet] [ 76.605346] nvme_loop_destroy_admin_queue+0x30/0x60 [nvme_loop] [ 76.606220] nvme_loop_shutdown_ctrl+0xc3/0xf0 [nvme_loop] [ 76.607026] nvme_loop_delete_ctrl_host+0x19/0x30 [nvme_loop] [ 76.607871] nvme_do_delete_ctrl+0x75/0xb0 [ 76.608477] nvme_sysfs_delete+0x7d/0xc0 [ 76.609057] dev_attr_store+0x24/0x40 [ 76.609603] sysfs_kf_write+0x4c/0x60 [ 76.610144] kernfs_fop_write+0x19a/0x260 [ 76.610742] __vfs_write+0x1c/0x60 [ 76.611246] vfs_write+0xfa/0x280 [ 76.611739] ksys_write+0x6e/0x120 [ 76.612238] __x64_sys_write+0x1e/0x30 [ 76.612787] do_syscall_64+0xbf/0x3a0 [ 76.613329] entry_SYSCALL_64_after_hwframe+0x44/0xa9 We fix it by moving fatal_err_work init to nvmet_alloc_ctrl(), which may more reasonable. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-13nvme: update comment to make the code easier to readYufen Yu
After commit a686ed75c0fb ("nvme: introduce a helper function for controller deletion), nvme_delete_ctrl_sync no longer use flush_work. Update comment, accordingly. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-13nvme: put ns_head ref if namespace fails allocationSagi Grimberg
In case nvme_alloc_ns fails after we initialize ns_head but before we add the ns to the controller namespaces list we need to explicitly put the ns_head reference because when we teardown the controller we won't find it, causing us to leak a dangling subsystem eventually. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-13nvme-trace: fix cdw10 buffer overrunKeith Busch
The field is defined to be a 24 byte array, we don't need to multiply the sizeof() that field by the number of dwords it covers. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-13nvme: don't warn on block content change effectsKeith Busch
A write or flush IO passthrough command is expected to change the logical block content, so don't warn on these as no additional handling is necessary. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-13nvme: add get-feature to admin cmds tracerMax Gurtovoy
This will print get-feature cmd in more informative way. For example, run "nvme get-feature /dev/nvme0 -n 1 -f 0x9 -c 10" will trace: nvme-3907 [008] .... 1763.635054: nvme_setup_cmd: nvme0: qid=0, cmdid=6, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_admin_get_features fid=0x9 sel=0x0 cdw11=0xa) <idle>-0 [001] d.h. 1763.635112: nvme_sq: nvme0: qid=0, head=27, tail=27 <idle>-0 [008] ..s. 1763.635121: nvme_complete_rq: nvme0: qid=0, cmdid=6, res=10, retries=0, flags=0x2, status=0 Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-08Merge tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block layer updates from Jens Axboe: "Not a huge amount of changes in this round, the biggest one is that we finally have Mings multi-page bvec support merged. Apart from that, this pull request contains: - Small series that avoids quiescing the queue for sysfs changes that match what we currently have (Aleksei) - Series of bcache fixes (via Coly) - Series of lightnvm fixes (via Mathias) - NVMe pull request from Christoph. Nothing major, just SPDX/license cleanups, RR mp policy (Hannes), and little fixes (Bart, Chaitanya). - BFQ series (Paolo) - Save blk-mq cpu -> hw queue mapping, removing a pointer indirection for the fast path (Jianchao) - fops->iopoll() added for async IO polling, this is a feature that the upcoming io_uring interface will use (Christoph, me) - Partition scan loop fixes (Dongli) - mtip32xx conversion from managed resource API (Christoph) - cdrom registration race fix (Guenter) - MD pull from Song, two minor fixes. - Various documentation fixes (Marcos) - Multi-page bvec feature. This brings a lot of nice improvements with it, like more efficient splitting, larger IOs can be supported without growing the bvec table size, and so on. (Ming) - Various little fixes to core and drivers" * tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block: (117 commits) block: fix updating bio's front segment size block: Replace function name in string with __func__ nbd: propagate genlmsg_reply return code floppy: remove set but not used variable 'q' null_blk: fix checking for REQ_FUA block: fix NULL pointer dereference in register_disk fs: fix guard_bio_eod to check for real EOD errors blk-mq: use HCTX_TYPE_DEFAULT but not 0 to index blk_mq_tag_set->map block: optimize bvec iteration in bvec_iter_advance block: introduce mp_bvec_for_each_page() for iterating over page block: optimize blk_bio_segment_split for single-page bvec block: optimize __blk_segment_map_sg() for single-page bvec block: introduce bvec_nth_page() iomap: wire up the iopoll method block: add bio_set_polled() helper block: wire up block device iopoll method fs: add an iopoll method to struct file_operations loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part() loop: do not print warn message if partition scan is successful block: bounce: make sure that bvec table is updated ...
2019-03-05Merge branch 'irq-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "The interrupt departement delivers this time: - New infrastructure to manage NMIs on platforms which have a sane NMI delivery, i.e. identifiable NMI vectors instead of a single lump. - Simplification of the interrupt affinity management so drivers don't have to implement ugly loops around the PCI/MSI enablement. - Speedup for interrupt statistics in /proc/stat - Provide a function to retrieve the default irq domain - A new interrupt controller for the Loongson LS1X platform - Affinity support for the SiFive PLIC - Better support for the iMX irqsteer driver - NUMA aware memory allocations for GICv3 - The usual small fixes, improvements and cleanups all over the place" * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) irqchip/imx-irqsteer: Add multi output interrupts support irqchip/imx-irqsteer: Change to use reg_num instead of irq_group dt-bindings: irq: imx-irqsteer: Add multi output interrupts support dt-binding: irq: imx-irqsteer: Use irq number instead of group number irqchip/brcmstb-l2: Use _irqsave locking variants in non-interrupt code irqchip/gicv3-its: Use NUMA aware memory allocation for ITS tables irqdomain: Allow the default irq domain to be retrieved irqchip/sifive-plic: Implement irq_set_affinity() for SMP host irqchip/sifive-plic: Differentiate between PLIC handler and context irqchip/sifive-plic: Add warning in plic_init() if handler already present irqchip/sifive-plic: Pre-compute context hart base and enable base PCI/MSI: Remove obsolete sanity checks for multiple interrupt sets genirq/affinity: Remove the leftovers of the original set support nvme-pci: Simplify interrupt allocation genirq/affinity: Add new callback for (re)calculating interrupt sets genirq/affinity: Store interrupt sets size in struct irq_affinity genirq/affinity: Code consolidation irqchip/irq-sifive-plic: Check and continue in case of an invalid cpuid. irqchip/i8259: Fix shutdown order by moving syscore_ops registration dt-bindings: interrupt-controller: loongson ls1x intc ...