summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2019-07-09nvme-fc: fix module unloads while lports still pendingJames Smart
Current code allows the module to be unloaded even if there are pending data structures, such as localports and controllers on the localports, that have yet to hit their reference counting to remove them. Fix by having exit entrypoint explicitly delete every controller, which in turn will remove references on the remoteports and localports causing them to be deleted as well. The exit entrypoint, after initiating the deletes, will wait for the last localport to be deleted before continuing. Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-09nvme-tcp: don't use sendpage for SLAB pagesMikhail Skorzhinskii
According to commit a10674bf2406 ("tcp: detecting the misuse of .sendpage for Slab objects") and previous discussion, tcp_sendpage should not be used for pages that is managed by SLAB, as SLAB is not taking page reference counters into consideration. Signed-off-by: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-09nvme-tcp: set the STABLE_WRITES flag when data digests are enabledMikhail Skorzhinskii
There was a few false alarms sighted on target side about wrong data digest while performing high throughput load to XFS filesystem shared through NVMoF TCP. This flag tells the rest of the kernel to ensure that the data buffer does not change while the write is in flight. It incurs a performance penalty, so only enable it when it is actually needed, i.e. when we are calculating data digests. Although even with this change in place, ext2 users can steel experience false positives, as ext2 is not respecting this flag. This may be apply to vfat as well. Signed-off-by: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com> Signed-off-by: Mike Playle <mplayle@solarflare.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-09nvmet: print a hint while rejecting NSID 0 or 0xffffffffMikhail Skorzhinskii
Adding this hint for the sake of convenience. It was spotted that a few times people spent some time before understanding what is exactly wrong in configuration process. This should save a few time in such situations, especially for people who is not very confident with NVMe requirements. Signed-off-by: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-09nvme-multipath: do not select namespaces which are about to be removedHannes Reinecke
nvme_ns_remove() will first set the NVME_NS_REMOVING flag before removing it from the list at the very last step. So to avoid selecting a namespace in nvme_find_path() which is about to be removed check the NVME_NS_REMOVING flag, too, when selecting a new path. Signed-off-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-09nvme-multipath: also check for a disabled path if there is a single siblingHannes Reinecke
When we have a singular list in nvme_round_robin_path() we still need to check its validity. Signed-off-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-09nvme-multipath: factor out a nvme_path_is_disabled helperHannes Reinecke
Factor our a common helper to check if a path has been disabled by something other than the per-namespace ANA state. Signed-off-by: Hannes Reinecke <hare@suse.com> [hch: split from a bigger patch] Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-09nvme: set physical block size and optimal I/O sizeBart Van Assche
>From the NVMe 1.4 spec: NSFEAT bit 4 if set to 1: indicates that the fields NPWG, NPWA, NPDG, NPDA, and NOWS are defined for this namespace and should be used by the host for I/O optimization; [ ... ] Namespace Preferred Write Granularity (NPWG): This field indicates the smallest recommended write granularity in logical blocks for this namespace. This is a 0's based value. The size indicated should be less than or equal to Maximum Data Transfer Size (MDTS) that is specified in units of minimum memory page size. The value of this field may change if the namespace is reformatted. The size should be a multiple of Namespace Preferred Write Alignment (NPWA). Refer to section 8.25 for how this field is utilized to improve performance and endurance. [ ... ] Each Write, Write Uncorrectable, or Write Zeroes commands should address a multiple of Namespace Preferred Write Granularity (NPWG) (refer to Figure 245) and Stream Write Size (SWS) (refer to Figure 515) logical blocks (as expressed in the NLB field), and the SLBA field of the command should be aligned to Namespace Preferred Write Alignment (NPWA) (refer to Figure 245) for best performance. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-09nvme: add I/O characteristics fieldsBart Van Assche
Several new fields have been introduced in version 1.4 of the NVMe spec at offsets that were defined as reserved in version 1.3d of the NVMe spec. Update the definition of the nvme_id_ns data structure such that it is in sync with version 1.4 of the NVMe spec. This change preserves backwards compatibility. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-09nvmet: export I/O characteristics attributes in IdentifyBart Van Assche
Make the NVMe NAWUN, NAWUPF, NACWU, NPWG, NPWA, NPDG and NOWS attributes available to initator systems for the block backend. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-09nvme-trace: add delete completion and submission queue to admin cmds tracerTom Wu
The trace log for 'delete I/O submission queue' and 'delete I/O completion queue' command will look like as below: kworker/u49:1-3438 [003] .... 6693.070865: nvme_setup_cmd: nvme0: qid=0, cmdid=11, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_delete_sq sqid=1) kworker/u49:1-3438 [003] .... 6693.071171: nvme_setup_cmd: nvme0: qid=0, cmdid=8, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_delete_cq cqid=24) Signed-off-by: Tom Wu <tomwu@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-09nvme-trace: fix spelling mistake "spcecific" -> "specific"Colin Ian King
There are two spelling mistakes in trace_seq_printf messages, fix these. Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-09nvme-pci: limit max_hw_sectors based on the DMA max mapping sizeChristoph Hellwig
When running a NVMe device that is attached to a addressing challenged PCIe root port that requires bounce buffering, our request sizes can easily overflow the swiotlb bounce buffer size. Limit the maximum I/O size to the limit exposed by the DMA mapping subsystem. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Atish Patra <Atish.Patra@wdc.com> Tested-by: Atish Patra <Atish.Patra@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-07-09nvme-pci: check for NULL return from pci_alloc_p2pmem()Alan Mikhak
Modify nvme_alloc_sq_cmds() to call pci_free_p2pmem() to free the memory it allocated using pci_alloc_p2pmem() in case pci_p2pmem_virt_to_bus() returns null. Makes sure not to call pci_free_p2pmem() if pci_alloc_p2pmem() returned NULL, which can happen if CONFIG_PCI_P2PDMA is not configured. The current implementation is not expected to leak since pci_p2pmem_virt_to_bus() is expected to fail only if pci_alloc_p2pmem() returns null. However, checking the return value of pci_alloc_p2pmem() is more explicit. Signed-off-by: Alan Mikhak <alan.mikhak@sifive.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-09nvme-pci: don't create a read hctx mapping without read queuesAlan Mikhak
Only request an IRQ mapping for read queues if at least one read queue is being allocted, as nvme_pci_map_queues() will later on ignore the unnecessary mapping request should nvme_dev_add() request such an IRQ mapping even though no read queues are being allocated. However, nvme_dev_add() can avoid making the request by checking the number of read queues without assuming. This would bring it more in line with nvme_setup_irqs() and nvme_calc_irq_sets(). Signed-off-by: Alan Mikhak <alan.mikhak@sifive.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-09nvme-pci: don't fall back to a 32-bit DMA maskChristoph Hellwig
Since Linux 5.0 drivers can safely set the largest DMA mask supported by the device, and don't need fallbacks to work around the dma mapping implementations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-07-09nvme-pci: make nvme_dev_pm_ops staticYueHaibing
Fix sparse warning: drivers/nvme/host/pci.c:2926:25: warning: symbol 'nvme_dev_pm_ops' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-09nvme-fcloop: resolve warnings on RCU usage and sleep warningsJames Smart
With additional debugging enabled, seeing warnings for suspicious RCU usage or Sleeping function called from invalid context. These both map to allocation of a work structure which is currently GFP_KERNEL, meaning it can sleep. For the RCU warning, the sequence was sleeping while holding the RCU lock. Convert the allocation to GFP_ATOMIC. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-09nvme-fcloop: fix inconsistent lock state warningsJames Smart
With extra debug on, inconsistent lock state warnings are being called out as the tfcp_req->reqlock is being taken out without irq, while some calling sequences have the sequence in a softirq state. Change the lock taking/release to raise/drop irq. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-05blk-iolatency: fix STS_AGAIN handlingDennis Zhou
The iolatency controller is based on rq_qos. It increments on rq_qos_throttle() and decrements on either rq_qos_cleanup() or rq_qos_done_bio(). a3fb01ba5af0 fixes the double accounting issue where blk_mq_make_request() may call both rq_qos_cleanup() and rq_qos_done_bio() on REQ_NO_WAIT. So checking STS_AGAIN prevents the double decrement. The above works upstream as the only way we can get STS_AGAIN is from blk_mq_get_request() failing. The STS_AGAIN handling isn't a real problem as bio_endio() skipping only happens on reserved tag allocation failures which can only be caused by driver bugs and already triggers WARN. However, the fix creates a not so great dependency on how STS_AGAIN can be propagated. Internally, we (Facebook) carry a patch that kills read ahead if a cgroup is io congested or a fatal signal is pending. This combined with chained bios progagate their bi_status to the parent is not already set can can cause the parent bio to not clean up properly even though it was successful. This consequently leaks the inflight counter and can hang all IOs under that blkg. To nip the adverse interaction early, this removes the rq_qos_cleanup() callback in iolatency in favor of cleaning up always on the rq_qos_done_bio() path. Fixes: a3fb01ba5af0 ("blk-iolatency: only account submitted bios") Debugged-by: Tejun Heo <tj@kernel.org> Debugged-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-03block: nr_phys_segments needs to be zero for REQ_OP_WRITE_ZEROESChristoph Hellwig
Fix a regression introduced when removing bi_phys_segments for Write Zeroes requests, which need to have a segment count of zero, as they don't have a payload. Fixes: 14ccb66b3f58 ("block: remove the bi_phys_segments field in struct bio") Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-02blk-mq: simplify blk_mq_make_request()Bart Van Assche
Move the blk_mq_bio_to_request() call in front of the if-statement. Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-02blk-mq: remove blk_mq_put_ctx()Bart Van Assche
No code that occurs between blk_mq_get_ctx() and blk_mq_put_ctx() depends on preemption being disabled for its correctness. Since removing the CPU preemption calls does not measurably affect performance, simplify the blk-mq code by removing the blk_mq_put_ctx() function and also by not disabling preemption in blk_mq_get_ctx(). Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-01sbitmap: Replace cmpxchg with xchgPavel Begunkov
cmpxchg() with an immediate value could be replaced with less expensive xchg(). The same true if new value don't _depend_ on the old one. In the second block, atomic_cmpxchg() return value isn't checked, so after atomic_cmpxchg() -> atomic_xchg() conversion it could be replaced with atomic_set(). Comparison with atomic_read() in the second chunk was left as an optimisation (if that was the initial intention). Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-01block: fix .bi_size overflowMing Lei
'bio->bi_iter.bi_size' is 'unsigned int', which at most hold 4G - 1 bytes. Before 07173c3ec276 ("block: enable multipage bvecs"), one bio can include very limited pages, and usually at most 256, so the fs bio size won't be bigger than 1M bytes most of times. Since we support multi-page bvec, in theory one fs bio really can be added > 1M pages, especially in case of hugepage, or big writeback with too many dirty pages. Then there is chance in which .bi_size is overflowed. Fixes this issue by using bio_full() to check if the added segment may overflow .bi_size. Cc: Liu Yiding <liuyd.fnst@cn.fujitsu.com> Cc: kernel test robot <rong.a.chen@intel.com> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: linux-xfs@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Cc: stable@vger.kernel.org Fixes: 07173c3ec276 ("block: enable multipage bvecs") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-01Merge tag 'v5.2-rc6' into for-5.3/blockJens Axboe
Merge 5.2-rc6 into for-5.3/block, so we get the same page merge leak fix. Otherwise we end up having conflicts with future patches between for-5.3/block and master that touch this area. In particular, it makes the bio_full() fix hard to backport to stable. * tag 'v5.2-rc6': (482 commits) Linux 5.2-rc6 Revert "iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock" Bluetooth: Fix regression with minimum encryption key size alignment tcp: refine memory limit test in tcp_fragment() x86/vdso: Prevent segfaults due to hoisted vclock reads SUNRPC: Fix a credential refcount leak Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE" net :sunrpc :clnt :Fix xps refcount imbalance on the error path NFS4: Only set creation opendata if O_CREAT ARM: 8867/1: vdso: pass --be8 to linker if necessary KVM: nVMX: reorganize initial steps of vmx_set_nested_state KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries habanalabs: use u64_to_user_ptr() for reading user pointers nfsd: replace Jeff by Chuck as nfsd co-maintainer inet: clear num_timeout reqsk_alloc() PCI/P2PDMA: Ignore root complex whitelist when an IOMMU is present net: mvpp2: debugfs: Add pmap to fs dump ipv6: Default fib6_type to RTN_UNICAST when not set net: hns3: Fix inconsistent indenting net/af_iucv: always register net_device notifier ...
2019-06-29block: sed-opal: check size of shadow mbrJonas Rabenstein
Check whether the shadow mbr does fit in the provided space on the target. Also a proper firmware should handle this case and return an error we may prevent problems or even damage with crappy firmwares. Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-29block: sed-opal: ioctl for writing to shadow mbrJonas Rabenstein
Allow modification of the shadow mbr. If the shadow mbr is not marked as done, this data will be presented read only as the device content. Only after marking the shadow mbr as done and unlocking a locking range the actual content is accessible. Co-authored-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-29block: sed-opal: add ioctl for done-mark of shadow mbrJonas Rabenstein
Enable users to mark the shadow mbr as done without completely deactivating the shadow mbr feature. This may be useful on reboots, when the power to the disk is not disconnected in between and the shadow mbr stores the required boot files. Of course, this saves also the (few) commands required to enable the feature if it is already enabled and one only wants to mark the shadow mbr as done. Co-authored-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-29block: never take page references for ITER_BVECChristoph Hellwig
If we pass pages through an iov_iter we always already have a reference in the caller. Thus remove the ITER_BVEC_FLAG_NO_REF and don't take reference to pages by default for bvec backed iov_iters. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-29direct-io: use bio_release_pages in dio_bio_completeChristoph Hellwig
Use bio_release_pages instead of duplicating it. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-29block_dev: use bio_release_pages in bio_unmap_userChristoph Hellwig
Use bio_release_pages instead of duplicating it. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-29block_dev: use bio_release_pages in blkdev_bio_end_ioChristoph Hellwig
Use bio_release_pages instead of duplicating it. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-29iomap: use bio_release_pages in iomap_dio_bio_end_ioChristoph Hellwig
Use bio_release_pages instead of duplicating it. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-29block: use bio_release_pages in bio_map_user_iovChristoph Hellwig
Use bio_release_pages instead of open coding it. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-29block: use bio_release_pages in bio_unmap_userChristoph Hellwig
Use bio_release_pages instead of open coding it. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-29block: optionally mark pages dirty in bio_release_pagesChristoph Hellwig
A lot of callers of bio_release_pages also want to mark the released pages as dirty. Add a mark_dirty parameter to avoid a second relatively expensive bio_for_each_segment_all loop. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-29block: move the BIO_NO_PAGE_REF check into bio_release_pagesChristoph Hellwig
Move the BIO_NO_PAGE_REF check into bio_release_pages instead of duplicating it in both callers. Also make the function available outside of bio.c so that we can reuse it in other direct I/O implementations. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-29block: skd_main.c: Remove call to memset after dma_alloc_coherentFuqian Huang
In commit af7ddd8a627c ("Merge tag 'dma-mapping-4.21' of git://git.infradead.org/users/hch/dma-mapping"), dma_alloc_coherent has already zeroed the memory. So memset is not needed. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-29block: mtip32xx: Remove call to memset after dma_alloc_coherentFuqian Huang
In commit af7ddd8a627c ("Merge tag 'dma-mapping-4.21' of git://git.infradead.org/users/hch/dma-mapping"), dma_alloc_coherent has already zeroed the memory. So memset is not needed. Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-29block: sed-opal: "Never True" conditionsRevanth Rajashekar
'who' an unsigned variable in stucture opal_session_info can never be lesser than zero. Hence, the condition "who < OPAL_ADMIN1" can never be true. Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-29block: sed-opal: PSID reverttper capabilityRevanth Rajashekar
PSID is a 32 character password printed on the drive label, to prove its physical access. This PSID reverttper function is very useful to regain the control over the drive when it is locked and the user can no longer access it because of some failures. However, *all the data on the drive is completely erased*. This method is advisable only when the user is exhausted of all other recovery methods. PSID capabilities are described in: https://trustedcomputinggroup.org/wp-content/uploads/TCG_Storage-Opal_Feature_Set_PSID_v1.00_r1.00.pdf Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-29block, documentation: Document discard_zeroes_data, fua, ↵Bart Van Assche
max_discard_segments and write_zeroes_max_bytes Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-29block, documentation: Explain the word 'segments'Bart Van Assche
Several block layer users who are not kernel developers do not know that the word 'segment' refers to an element in a DMA scatter/gather list. Make the block layer documentation easier to understand by stating explicitly what the word 'segment' stands for. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-29block, documentation: Sort queue sysfs attribute names alphabeticallyBart Van Assche
Commit f9824952ee1c ("block: update sysfs documentation") # v5.0 broke the alphabetical order of the sysfs attribute names. List queue sysfs attribute names alphabetically. Cc: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-29block, documentation: Fix wbt_lat_usec documentationBart Van Assche
Fix the spelling of the wbt_lat_usec sysfs attribute. Fixes: 87760e5eef35 ("block: hook up writeback throttling") # v4.10. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-29null_blk: fix type mismatch null_handle_cmd()Chaitanya Kulkarni
In null_handle_cmd() when device is configured as zoned, variable op is decalred as an int, where it is used to hold values of type REQ_OP_XXX which is of type enum req_opf. Change the type from int to enum req_opf. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28block, bfq: NULL out the bic when it's no longer validDouglas Anderson
In reboot tests on several devices we were seeing a "use after free" when slub_debug or KASAN was enabled. The kernel complained about: Unable to handle kernel paging request at virtual address 6b6b6c2b ...which is a classic sign of use after free under slub_debug. The stack crawl in kgdb looked like: 0 test_bit (addr=<optimized out>, nr=<optimized out>) 1 bfq_bfqq_busy (bfqq=<optimized out>) 2 bfq_select_queue (bfqd=<optimized out>) 3 __bfq_dispatch_request (hctx=<optimized out>) 4 bfq_dispatch_request (hctx=<optimized out>) 5 0xc056ef00 in blk_mq_do_dispatch_sched (hctx=0xed249440) 6 0xc056f728 in blk_mq_sched_dispatch_requests (hctx=0xed249440) 7 0xc0568d24 in __blk_mq_run_hw_queue (hctx=0xed249440) 8 0xc0568d94 in blk_mq_run_work_fn (work=<optimized out>) 9 0xc024c5c4 in process_one_work (worker=0xec6d4640, work=0xed249480) 10 0xc024cff4 in worker_thread (__worker=0xec6d4640) Digging in kgdb, it could be found that, though bfqq looked fine, bfqq->bic had been freed. Through further digging, I postulated that perhaps it is illegal to access a "bic" (AKA an "icq") after bfq_exit_icq() had been called because the "bic" can be freed at some point in time after this call is made. I confirmed that there certainly were cases where the exact crashing code path would access the "bic" after bfq_exit_icq() had been called. Sspecifically I set the "bfqq->bic" to (void *)0x7 and saw that the bic was 0x7 at the time of the crash. To understand a bit more about why this crash was fairly uncommon (I saw it only once in a few hundred reboots), you can see that much of the time bfq_exit_icq_fbqq() fully frees the bfqq and thus it can't access the ->bic anymore. The only case it doesn't is if bfq_put_queue() sees a reference still held. However, even in the case when bfqq isn't freed, the crash is still rare. Why? I tracked what happened to the "bic" after the exit routine. It doesn't get freed right away. Rather, put_io_context_active() eventually called put_io_context() which queued up freeing on a workqueue. The freeing then actually happened later than that through call_rcu(). Despite all these delays, some extra debugging showed that all the hoops could be jumped through in time and the memory could be freed causing the original crash. Phew! To make a long story short, assuming it truly is illegal to access an icq after the "exit_icq" callback is finished, this patch is needed. Cc: stable@vger.kernel.org Reviewed-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28bcache: add reclaimed_journal_buckets to struct cache_setColy Li
Now we have counters for how many times jouranl is reclaimed, how many times cached dirty btree nodes are flushed, but we don't know how many jouranl buckets are really reclaimed. This patch adds reclaimed_journal_buckets into struct cache_set, this is an increasing only counter, to tell how many journal buckets are reclaimed since cache set runs. From all these three counters (reclaim, reclaimed_journal_buckets, flush_write), we can have idea how well current journal space reclaim code works. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-28bcache: performance improvement for btree_flush_write()Coly Li
This patch improves performance for btree_flush_write() in following ways, - Use another spinlock journal.flush_write_lock to replace the very hot journal.lock. We don't have to use journal.lock here, selecting candidate btree nodes takes a lot of time, hold journal.lock here will block other jouranling threads and drop the overall I/O performance. - Only select flushing btree node from c->btree_cache list. When the machine has a large system memory, mca cache may have a huge number of cached btree nodes. Iterating all the cached nodes will take a lot of CPU time, and most of the nodes on c->btree_cache_freeable and c->btree_cache_freed lists are cleared and have need to flush. So only travel mca list c->btree_cache to select flushing btree node should be enough for most of the cases. - Don't iterate whole c->btree_cache list, only reversely select first BTREE_FLUSH_NR btree nodes to flush. Iterate all btree nodes from c->btree_cache and select the oldest journal pin btree nodes consumes huge number of CPU cycles if the list is huge (push and pop a node into/out of a heap is expensive). The last several dirty btree nodes on the tail of c->btree_cache list are earlest allocated and cached btree nodes, they are relative to the oldest journal pin btree nodes. Therefore only flushing BTREE_FLUSH_NR btree nodes from tail of c->btree_cache probably includes the oldest journal pin btree nodes. In my testing, the above change decreases 50%+ CPU consumption when journal space is full. Some times IOPS drops to 0 for 5-8 seconds, comparing blocking I/O for 120+ seconds in previous code, this is much better. Maybe there is room to improve in future, but at this momment the fix looks fine and performs well in my testing. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>