summaryrefslogtreecommitdiff
path: root/drivers
AgeCommit message (Collapse)Author
2017-06-27drbd: Drop unnecessary staticJulia Lawall
Drop static on a local variable, when the variable is initialized before any use, on every possible execution path through the function. The static has no benefit, and dropping it reduces the code size. The semantic patch that fixes this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @bad exists@ position p; identifier x; type T; @@ static T x@p; ... x = <+...x...+> @@ identifier x; expression e; type T; position p != bad.p; @@ -static T x@p; ... when != x when strict ?x = e; // </smpl> The change in code size is indicates by the following output from the size command. before: text data bss dec hex filename 67299 2291 1056 70646 113f6 drivers/block/drbd/drbd_nl.o after: text data bss dec hex filename 67283 2291 1056 70630 113e6 drivers/block/drbd/drbd_nl.o Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27mmc/block: remove a call to blk_queue_bounce_limitChristoph Hellwig
BLK_BOUNCE_ANY is the defauly now, so the call is superflous. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27dm: don't set bounce limitChristoph Hellwig
Now all queues allocators come without abounce limit by default, dm doesn't have to override this anymore. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27block: don't set bounce limit in blk_init_queueChristoph Hellwig
Instead move it to the callers. Those that either don't use bio_data() or page_address() or are specific to architectures that do not support highmem are skipped. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27block: don't set bounce limit in blk_init_allocated_queueChristoph Hellwig
And just move it into scsi_transport_sas which needs it due to low-level drivers directly derferencing bio_data, and into blk_init_queue_node, which will need a further push into the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27blk-mq: don't bounce by defaultChristoph Hellwig
For historical reasons we default to bouncing highmem pages for all block queues. But the blk-mq drivers are easy to audit to ensure that we don't need this - scsi and mtip32xx set explicit limits and everyone else doesn't have any particular ones. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27block: don't bother with bounce limits for make_request driversChristoph Hellwig
We only call blk_queue_bounce for request-based drivers, so stop messing with it for make_request based drivers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27blk-map: call blk_queue_bounce from blk_rq_append_bioChristoph Hellwig
This makes moves the knowledge about bouncing out of the callers into the block core (just like we do for the normal I/O path), and allows to unexport blk_queue_bounce. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27pktcdvd: remove the call to blk_queue_bounceChristoph Hellwig
pktcdvd is a make_request based stacking driver and thus doesn't have any addressing limits on it's own. It also doesn't use bio_data() or page_address(), so it doesn't need a lowmem bounce either. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27nvme: add support for streams and directivesJens Axboe
This adds support for Directives in NVMe, particular for the Streams directive. Support for Directives is a new feature in NVMe 1.3. It allows a user to pass in information about where to store the data, so that it the device can do so most effiently. If an application is managing and writing data with different life times, mixing differently retentioned data onto the same locations on flash can cause write amplification to grow. This, in turn, will reduce performance and life time of the device. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27lightnvm: if LUNs are already allocated fix returnRakesh Pandit
While creating new device with NVM_DEV_CREATE if LUNs are already allocated ioctl would return -ENOMEM which is wrong. This patch propagates -EBUSY from nvm_reserve_luns which is correct response. Fixes: ade69e243 ("lightnvm: merge gennvm with core") Reviewed-by: Frans Klaver <fransklaver@gmail.com> Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-26lightnvm: pblk: fail gracefully on irrec. errorJavier González
Due to user writes being decoupled from media writes because of the need of an intermediate write buffer, irrecoverable media write errors lead to pblk stalling; user writes fill up the buffer and end up in an infinite retry loop. In order to let user writes fail gracefully, it is necessary for pblk to keep track of its own internal state and prevent further writes from being placed into the write buffer. This patch implements a state machine to keep track of internal errors and, in case of failure, fail further user writes in an standard way. Depending on the type of error, pblk will do its best to persist buffered writes (which are already acknowledged) and close down on a graceful manner. This way, data might be recovered by re-instantiating pblk. Such state machine paves out the way for a state-based FTL log. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-26lightnvm: pblk: set mempool and workqueue params.Javier González
Make constants to define sizes for internal mempools and workqueues. In this process, adjust the values to be more meaningful given the internal constrains of the FTL. In order to do this for workqueues, separate the current auxiliary workqueue into two dedicated workqueues to manage lines being closed and bad blocks. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-26lightnvm: pblk: redesign GC algorithmJavier González
At the moment, in order to get enough read parallelism, we have recycled several lines at the same time. This approach has proven not to work well when reaching capacity, since we end up mixing valid data from all lines, thus not maintaining a sustainable free/recycled line ratio. The new design, relies on a two level workqueue mechanism. In the first level, we read the metadata for a number of lines based on the GC list they reside on (this is governed by the number of valid sectors in each line). In the second level, we recycle a single line at a time. Here, we issue reads in parallel, while a single GC write thread places data in the write buffer. This design allows to (i) only move data from one line at a time, thus maintaining a sane free/recycled ration and (ii) maintain the GC writer busy with recycled data. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-26lightnvm: pblk: add lock assertions on helpersJavier González
Add lockdep assertions on helper functions. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-26lightnvm: pblk: cleanup unnecessary codeJavier González
Cleanup unnecessary headers and code lines. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-26lightnvm: pblk: set metadata list for all I/OsJavier González
Set a dma area for all I/Os in order to read/write from/to the metadata stored on the per-sector out-of-bound area. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-26lightnvm: pblk: choose optimal victim GC lineJavier González
At the moment, we separate the closed lines on three different list based on their number of valid sectors. GC recycles lines from each list based on capacity. Lines from each list are taken in a FIFO fashion. Since the number of lines is limited (it corresponds to the number of blocks in a LUN, which is somewhere between 1000-2000), we can afford scanning the lists to choose the optimal line to be recycled. This helps specially in lines with a high number of valid sectors. If the number of blocks per LUN increases, we will consider a more efficient policy. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-26lightnvm: pblk: decouple bad block from line allocJavier González
Decouple bad block discovery from line allocation logic. This allows to return meaningful error codes in case of bad block discovery failure. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-26lightnvm: pblk: simplify meta. memory allocationJavier González
smeta size will always be suitable for a kmalloc allocation. Simplify the code and leave the vmalloc fallback only for emeta, where the pblk configuration has an impact. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-26lightnvm: pblk: issue multiplane reads if possibleJavier González
If a read request is sequential and its size aligns with a multi-plane page size, use the multi-plane hint to process the I/O in parallel in the controller. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-26lightnvm: pblk: delete redundant buffer pointerJavier González
After refactoring the metadata path, the backpointer controlling synced I/Os in a line becomes unnecessary; metadata is scheduled on the write thread, thus we know when the end of the line is reached and act on it directly. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-26lightnvm: pblk: delete redundant debug line statJavier González
Remove a legacy variable that helped verifying the consistency of the run-time metadata for the free line list. With the new metadata layout, this check is no longer necessary. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-26lightnvm: pblk: sched. metadata on write threadJavier González
At the moment, line metadata is persisted on a separate work queue, that is kicked each time that a line is closed. The assumption when designing this was that freeing the write thread from creating a new write request was better than the potential impact of writes colliding on the media (user I/O and metadata I/O). Experimentation has proven that this assumption is wrong; collision can cause up to 25% of bandwidth and introduce long tail latencies on the write thread, which potentially cause user write threads to spend more time spinning to get a free entry on the write buffer. This patch moves the metadata logic to the write thread. When a line is closed, remaining metadata is written in memory and is placed on a metadata queue. The write thread then takes the metadata corresponding to the previous line, creates the write request and schedules it to minimize collisions on the media. Using this approach, we see that we can saturate the media's bandwidth, which helps reducing both write latencies and the spinning time for user writer threads. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-26lightnvm: pblk: rename read request poolJavier González
Read requests allocate some extra memory to store its per I/O context. Instead of requiring yet another memory pool for other type of requests, generalize this context allocation (and change naming accordingly). Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-26lightnvm: pblk: generalize erase pathJavier González
Erase I/Os are scheduled with the following goals in mind: (i) minimize LUNs collisions with write I/Os, and (ii) even out the price of erasing on every write, instead of putting all the burden on when garbage collection runs. This works well on the current design, but is specific to the default mapping algorithm. This patch generalizes the erase path so that other mapping algorithms can select an arbitrary line to be erased instead. It also gets rid of the erase semaphore since it creates jittering for user writes. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-26lightnvm: pblk: expose max sec per write on sysfsJavier González
Allow to configure the number of maximum sectors per write command through sysfs. This makes it easier to tune write command sizes for different controller configurations. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-26lightnvm: pblk: add debug stat for read cache hitsJavier González
Add a new debug counter to measure cache hits on the read path Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-26lightnvm: pblk: spare double cpu_to_le64 calc.Javier González
Spare a double calculation on the fast write path. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-26lightnvm: propagate right error code to targetJavier González
If nvme_alloc_request fails, propagate the right error, instead of assuming ENOMEM. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-26lightnvm: re-convert ppa format on I/O failureJavier González
In case of a failure when submitting a request, convert the ppa_list addresses to the target format so that it can interpret ppas for recovery Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-23mtip32xx: fix up the checking for internal command failureJens Axboe
This fixes up two commits that have touched this driver. The command status field is now a blk_status_t, so we can't check for < 0 and we definitely can't assume it's holding -Exxxx error values. All we care about here is whether ->status is zero or not. Check for that, and remove the various attempts at smart error reporting. Just log to dmesg what command failed, and the blk_status_t value. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: 2a842acab109 ("block: introduce new block status code type") Fixes: 3f5e6a35774c ("mtip32xx: convert internal command issue to block IO path") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-22Merge commit '8e8320c9315c' into for-4.13/blockJens Axboe
Pull in the fix for shared tags, as it conflicts with the pending changes in for-4.13/block. We already pulled in v4.12-rc5 to solve other conflicts or get fixes that went into 4.12, so not a lot of changes in this merge. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20block: Change argument type of scsi_req_init()Bart Van Assche
Since scsi_req_init() works on a struct scsi_request, change the argument type into struct scsi_request *. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20block: Make most scsi_req_init() calls implicitBart Van Assche
Instead of explicitly calling scsi_req_init() after blk_get_request(), call that function from inside blk_get_request(). Add an .initialize_rq_fn() callback function to the block drivers that need it. Merge the IDE .init_rq_fn() function into .initialize_rq_fn() because it is too small to keep it as a separate function. Keep the scsi_req_init() call in ide_prep_sense() because it follows a blk_rq_init() call. References: commit 82ed4db499b8 ("block: split scsi_request out of struct request") Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20null_blk: add support for shared tagsJens Axboe
Some storage drivers need to share tag sets between devices. It's useful to be able to model that with null_blk, to find hangs or performance issues. Add a 'shared_tags' bool module parameter that. If that is set to true and nr_devices is bigger than 1, all devices allocated will share the same tag set. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20Merge branch 'stable/for-jens-4.12' of ↵Jens Axboe
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-linus Pull xen-blkback fixes from Konrad: "Security and memory leak fixes in xen block driver."
2017-06-18nvme: host: unquiesce queue in nvme_kill_queues()Ming Lei
When nvme_kill_queues() is run, queues may be in quiesced state, so we forcibly unquiesce queues to avoid blocking dispatch, and I/O hang can be avoided in remove path. Peviously we use blk_mq_start_stopped_hw_queues() as counterpart of blk_mq_quiesce_queue(), now we have introduced blk_mq_unquiesce_queue(), so use it explicitly. Cc: linux-nvme@lists.infradead.org Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18blk-mq: use the introduced blk_mq_unquiesce_queue()Ming Lei
blk_mq_unquiesce_queue() is used for unquiescing the queue explicitly, so replace blk_mq_start_stopped_hw_queues() with it. For the scsi part, this patch takes Bart's suggestion to switch to block quiesce/unquiesce API completely. Cc: linux-nvme@lists.infradead.org Cc: linux-scsi@vger.kernel.org Cc: dm-devel@redhat.com Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18block: remove bio_clone() and all references.NeilBrown
bio_clone() is no longer used. Only bio_clone_bioset() or bio_clone_fast(). This is for the best, as bio_clone() used fs_bio_set, and filesystems are unlikely to want to use bio_clone(). So remove bio_clone() and all references. This includes a fix to some incorrect documentation. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18bcache: use kmalloc to allocate bio in bch_data_verify()NeilBrown
This function allocates a bio, then a collection of pages. It copes with failure. It currently uses a mempool() to allocate the bio, but alloc_page() to allocate the pages. These fail in different ways, so the usage is inconsistent. Change the bio_clone() to bio_clone_kmalloc() so that no pool is used either for the bio or the pages. Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by : Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18xen-blkfront: remove bio splitting.NeilBrown
bios that are re-submitted will pass through blk_queue_split() when blk_queue_bio() is called, and this will split the bio if necessary. There is no longer any need to do this splitting in xen-blkfront. Acked-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18lightnvm/pblk-read: use bio_clone_fast()NeilBrown
pblk_submit_read() uses bio_clone_bioset() but doesn't change the io_vec, so bio_clone_fast() is a better choice. It also uses fs_bio_set which is intended for filesystems. Using it in a device driver can deadlock. So allocate a new bioset, and and use bio_clone_fast(). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Javier González <javier@cnexlabs.com> Tested-by: Javier González <javier@cnexlabs.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18pktcdvd: use bio_clone_fast() instead of bio_clone()NeilBrown
pktcdvd doesn't change the bi_io_vec of the clone bio, so it is more efficient to use bio_clone_fast(), and not clone the bi_io_vec. This requires providing a bio_set, and it is safest to provide a dedicated bio_set rather than sharing fs_bio_set, which filesytems use. This new bio_set, pkt_bio_set, can also be use for the bio_split() call as the two allocations (bio_clone_fast, and bio_split) are independent, neither can block a bio allocated by the other. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18drbd: use bio_clone_fast() instead of bio_clone()NeilBrown
drbd does not modify the bi_io_vec of the cloned bio, so there is no need to clone that part. So bio_clone_fast() is the better choice. For bio_clone_fast() we need to specify a bio_set. We could use fs_bio_set, which bio_clone() uses, or drbd_md_io_bio_set, which drbd uses for metadata, but it is generally best to avoid sharing bio_sets unless you can be certain that there are no interdependencies. So create a new bio_set, drbd_io_bio_set, and use bio_clone_fast(). Also remove a "XXX cannot fail ???" comment because it definitely cannot fail - bio_clone_fast() doesn't fail if the GFP flags allow for sleeping. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18rbd: use bio_clone_fast() instead of bio_clone()NeilBrown
bio_clone() makes a copy of the bi_io_vec, but rbd never changes that, so there is no need for a copy. bio_clone_fast() can be used instead, which avoids making the copy. This requires that we provide a bio_set. bio_clone() uses fs_bio_set, but it isn't, in general, safe to use the same bio_set at different levels of the stack, as that can lead to deadlocks. As filesystems use fs_bio_set, block devices shouldn't. As rbd never stacks, it is safe to have a single global bio_set for all rbd devices to use. So allocate that when the module is initialised, and use it with bio_clone_fast(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18blk: make the bioset rescue_workqueue optional.NeilBrown
This patch converts bioset_create() to not create a workqueue by default, so alloctions will never trigger punt_bios_to_rescuer(). It also introduces a new flag BIOSET_NEED_RESCUER which tells bioset_create() to preserve the old behavior. All callers of bioset_create() that are inside block device drivers, are given the BIOSET_NEED_RESCUER flag. biosets used by filesystems or other top-level users do not need rescuing as the bio can never be queued behind other bios. This includes fs_bio_set, blkdev_dio_pool, btrfs_bioset, xfs_ioend_bioset, and one allocated by target_core_iblock.c. biosets used by md/raid do not need rescuing as their usage was recently audited and revised to never risk deadlock. It is hoped that most, if not all, of the remaining biosets can end up being the non-rescued version. Reviewed-by: Christoph Hellwig <hch@lst.de> Credit-to: Ming Lei <ming.lei@redhat.com> (minor fixes) Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18blk: replace bioset_create_nobvec() with a flags arg to bioset_create()NeilBrown
"flags" arguments are often seen as good API design as they allow easy extensibility. bioset_create_nobvec() is implemented internally as a variation in flags passed to __bioset_create(). To support future extension, make the internal structure part of the API. i.e. add a 'flags' argument to bioset_create() and discard bioset_create_nobvec(). Note that the bio_split allocations in drivers/md/raid* do not need the bvec mempool - they should have used bioset_create_nobvec(). Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18blk: remove bio_set arg from blk_queue_split()NeilBrown
blk_queue_split() is always called with the last arg being q->bio_split, where 'q' is the first arg. Also blk_queue_split() sometimes uses the passed-in 'bs' and sometimes uses q->bio_split. This is inconsistent and unnecessary. Remove the last arg and always use q->bio_split inside blk_queue_split() Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Credit-to: Javier González <jg@lightnvm.io> (Noticed that lightnvm was missed) Reviewed-by: Javier González <javier@cnexlabs.com> Tested-by: Javier González <javier@cnexlabs.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18loop: Add PF_LESS_THROTTLE to block/loop device thread.NeilBrown
When a filesystem is mounted from a loop device, writes are throttled by balance_dirty_pages() twice: once when writing to the filesystem and once when the loop_handle_cmd() writes to the backing file. This double-throttling can trigger positive feedback loops that create significant delays. The throttling at the lower level is seen by the upper level as a slow device, so it throttles extra hard. The PF_LESS_THROTTLE flag was created to handle exactly this circumstance, though with an NFS filesystem mounted from a local NFS server. It reduces the throttling on the lower layer so that it can proceed largely unthrottled. To demonstrate this, create a filesystem on a loop device and write (e.g. with dd) several large files which combine to consume significantly more than the limit set by /proc/sys/vm/dirty_ratio or dirty_bytes. Measure the total time taken. When I do this directly on a device (no loop device) the total time for several runs (mkfs, mount, write 200 files, umount) is fairly stable: 28-35 seconds. When I do this over a loop device the times are much worse and less stable. 52-460 seconds. Half below 100seconds, half above. When I apply this patch, the times become stable again, though not as fast as the no-loop-back case: 53-72 seconds. There may be room for further improvement as the total overhead still seems too high, but this is a big improvement. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>