summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2018-01-10dm mpath: Use blk_path_errorKeith Busch
Uses common code for determining if an error should be retried on alternate path. Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09bcache: closures: move control bits one bit rightMichael Lyle
Otherwise, architectures that do negated adds of atomics (e.g. s390) to do atomic_sub fail in closure_set_stopped. Signed-off-by: Michael Lyle <mlyle@lyle.org> Cc: Kent Overstreet <kent.overstreet@gmail.com> Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: fix writeback target calc on large devicesMichael Lyle
Bcache needs to scale the dirty data in the cache over the multiple backing disks in order to calculate writeback rates for each. The previous code did this by multiplying the target number of dirty sectors by the backing device size, and expected it to fit into a uint64_t; this blows up on relatively small backing devices. The new approach figures out the bdev's share in 16384ths of the overall cached data. This is chosen to cope well when bdevs drastically vary in size and to ensure that bcache can cross the petabyte boundary for each backing device. This has been improved based on Tang Junhui's feedback to ensure that every device gets a share of dirty data, no matter how small it is compared to the total backing pool. The existing mechanism is very limited; this is purely a bug fix to remove limits on volume size. However, there still needs to be change to make this "fair" over many volumes where some are idle. Reported-by: Jack Douglas <jack@douglastechnology.co.uk> Signed-off-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: fix misleading error message in bch_count_io_errors()Coly Li
Bcache only does recoverable I/O for read operations by calling cached_dev_read_error(). For write opertions there is no I/O recovery for failed requests. But in bch_count_io_errors() no matter read or write I/Os, before errors counter reaches io error limit, pr_err() always prints "IO error on %, recoverying". For write requests this information is misleading, because there is no I/O recovery at all. This patch adds a parameter 'is_read' to bch_count_io_errors(), and only prints "recovering" by pr_err() when the bio direction is READ. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: reduce cache_set devices iteration by devices_max_usedColy Li
Member devices of struct cache_set is used to reference all attached bcache devices to this cache set. If it is treated as array of pointers, size of devices[] is indicated by member nr_uuids of struct cache_set. nr_uuids is calculated in drivers/md/super.c:bch_cache_set_alloc(), bucket_bytes(c) / sizeof(struct uuid_entry) Bucket size is determined by user space tool "make-bcache", by default it is 1024 sectors (defined in bcache-tools/make-bcache.c:main()). So default nr_uuids value is 4096 from the above calculation. Every time when bcache code iterates bcache devices of a cache set, all the 4096 pointers are checked even only 1 bcache device is attached to the cache set, that's a wast of time and unncessary. This patch adds a member devices_max_used to struct cache_set. Its value is 1 + the maximum used index of devices[] in a cache set. When iterating all valid bcache devices of a cache set, use c->devices_max_used in for-loop may reduce a lot of useless checking. Personally, my motivation of this patch is not for performance, I use it in bcache debugging, which helps me to narrow down the scape to check valid bcached devices of a cache set. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: fix unmatched generic_end_io_acct() & generic_start_io_acct()Zhai Zhaoxuan
The function cached_dev_make_request() and flash_dev_make_request() call generic_start_io_acct() with (struct bcache_device)->disk when they start a closure. Then the function bio_complete() calls generic_end_io_acct() with (struct search)->orig_bio->bi_disk when the closure has done. Since the `bi_disk` is not the bcache device, the generic_end_io_acct() is called with a wrong device queue. It causes the "inflight" (in struct hd_struct) counter keep increasing without decreasing. This patch fix the problem by calling generic_end_io_acct() with (struct bcache_device)->disk. Signed-off-by: Zhai Zhaoxuan <kxuanobj@gmail.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Coly Li <colyli@suse.de> Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: mark closure_sync() __schedKent Overstreet
[edit by mlyle: include sched/debug.h to get __sched] Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: Fix, improve efficiency of closure_sync()Kent Overstreet
Eliminates cases where sync can race and fail to complete / get stuck. Removes many status flags and simplifies entering-and-exiting closure sleeping behaviors. [mlyle: fixed conflicts due to changed return behavior in mainline. extended commit comment, and squashed down two commits that were mostly contradictory to get to this state. Changed __set_current_state to set_current_state per Jens review comment] Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: allow quick writeback when backing idleMichael Lyle
If the control system would wait for at least half a second, and there's been no reqs hitting the backing disk for awhile: use an alternate mode where we have at most one contiguous set of writebacks in flight at a time. (But don't otherwise delay). If front-end IO appears, it will still be quick, as it will only have to contend with one real operation in flight. But otherwise, we'll be sending data to the backing disk as quickly as it can accept it (with one op at a time). Signed-off-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: writeback: properly order backing device IOMichael Lyle
Writeback keys are presently iterated and dispatched for writeback in order of the logical block address on the backing device. Multiple may be, in parallel, read from the cache device and then written back (especially when there are contiguous I/O). However-- there was no guarantee with the existing code that the writes would be issued in LBA order, as the reads from the cache device are often re-ordered. In turn, when writing back quickly, the backing disk often has to seek backwards-- this slows writeback and increases utilization. This patch introduces an ordering mechanism that guarantees that the original order of issue is maintained for the write portion of the I/O. Performance for writeback is significantly improved when there are multiple contiguous keys or high writeback rates. Signed-off-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn> Tested-by: Tang Junhui <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: fix wrong return value in bch_debug_init()Tang Junhui
in bch_debug_init(), ret is always 0, and the return value is useless, change it to return 0 if be success after calling debugfs_create_dir(), else return a non-zero value. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: segregate flash only volume write streamsTang Junhui
In such scenario that there are some flash only volumes , and some cached devices, when many tasks request these devices in writeback mode, the write IOs may fall to the same bucket as bellow: | cached data | flash data | cached data | cached data| flash data| then after writeback of these cached devices, the bucket would be like bellow bucket: | free | flash data | free | free | flash data | So, there are many free space in this bucket, but since data of flash only volumes still exists, so this bucket cannot be reclaimable, which would cause waste of bucket space. In this patch, we segregate flash only volume write streams from cached devices, so data from flash only volumes and cached devices can store in different buckets. Compare to v1 patch, this patch do not add a additionally open bucket list, and it is try best to segregate flash only volume write streams from cached devices, sectors of flash only volumes may still be mixed with dirty sectors of cached device, but the number is very small. [mlyle: fixed commit log formatting, permissions, line endings] Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: Use PTR_ERR_OR_ZERO()Vasyl Gomonovych
Fix ptr_ret.cocci warnings: drivers/md/bcache/btree.c:1800:1-3: WARNING: PTR_ERR_OR_ZERO can be used Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR Generated by: scripts/coccinelle/api/ptr_ret.cocci Signed-off-by: Vasyl Gomonovych <gomonovych@gmail.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: stop writeback thread after detachingTang Junhui
Currently, when a cached device detaching from cache, writeback thread is not stopped, and writeback_rate_update work is not canceled. For example, after the following command: echo 1 >/sys/block/sdb/bcache/detach you can still see the writeback thread. Then you attach the device to the cache again, bcache will create another writeback thread, for example, after below command: echo ba0fb5cd-658a-4533-9806-6ce166d883b9 > /sys/block/sdb/bcache/attach then you will see 2 writeback threads. This patch stops writeback thread and cancels writeback_rate_update work when cached device detaching from cache. Compare with patch v1, this v2 patch moves code down into the register lock for safety in case of any future changes as Coly and Mike suggested. [edit by mlyle: commit log spelling/formatting] Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: ret IOERR when read meets metadata errorRui Hua
The read request might meet error when searching the btree, but the error was not handled in cache_lookup(), and this kind of metadata failure will not go into cached_dev_read_error(), finally, the upper layer will receive bi_status=0. In this patch we judge the metadata error by the return value of bch_btree_map_keys(), there are two potential paths give rise to the error: 1. Because the btree is not totally cached in memery, we maybe get error when read btree node from cache device (see bch_btree_node_get()), the likely errno is -EIO, -ENOMEM 2. When read miss happens, bch_btree_insert_check_key() will be called to insert a "replace_key" to btree(see cached_dev_cache_miss(), just for doing preparatory work before insert the missed data to cache device), a failure can also happen in this situation, the likely errno is -ENOMEM bch_btree_map_keys() will return MAP_DONE in normal scenario, but we will get either -EIO or -ENOMEM in above two cases. if this happened, we should NOT recover data from backing device (when cache device is dirty) because we don't know whether bkeys the read request covered are all clean. And after that happened, s->iop.status is still its initially value(0) before we submit s->bio.bio, we set it to BLK_STS_IOERR, so it can go into cached_dev_read_error(), and finally it can be passed to upper layer, or recovered by reread from backing device. [edit by mlyle: patch formatting, word-wrap, comment spelling, commit log format] Signed-off-by: Hua Rui <huarui.dev@gmail.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06dm mpath: factor out SCSI vs NVMe path selectionMike Snitzer
Trying to do both SCSI and NVMe bio-based handling with branching in the same common code has proven too tedious on a code maintenance level. In addition it slightly hurts IO performance. Fix this by factoring out __map_bio() and __map_bio_nvme(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-06dm mpath: optimize NVMe bio-based supportMike Snitzer
All code that deals with pg_init is not used with bio-based NVMe mode. This includes skipping initialization of pg_init related variables. Also, pg_init related members on 'struct multipath' have been grouped together. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-06dm-crypt: don't clear bvec->bv_page in crypt_free_buffer_pages()Ming Lei
The bio is always freed after running crypt_free_buffer_pages(), so it isn't necessary to clear bv->bv_page. Cc: Mike Snitzer <snitzer@redhat.com> Cc:dm-devel@redhat.com Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06block: move bio_alloc_pages() to bcacheMing Lei
bcache is the only user of bio_alloc_pages(), so move this function into bcache, and avoid it being misused in the future. Also rename it to bch_bio_allo_pages() since it is bcache only. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06bcache: comment on direct access to bvec tableMing Lei
All direct access to bvec table are safe even after multipage bvec is supported. Cc: linux-bcache@vger.kernel.org Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZEMing Lei
For BIO based DM, some targets aren't ready for dealing with bigger incoming bio than 1Mbyte, such as crypt target. Cc: Mike Snitzer <snitzer@redhat.com> Cc:dm-devel@redhat.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06block: convert to bio_first_bvec_all & bio_first_page_allMing Lei
This patch converts to bio_first_bvec_all() & bio_first_page_all() for retrieving the 1st bvec/page, and prepares for supporting multipage bvec. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-04dm mpath: implement NVMe bio-based supportMike Snitzer
This DM multipath NVMe bio-based support requires CONFIG_NVME_MULTIPATH to not be set. In the future hopefully NVMe multipath and DM multipath can co-exist more seemlessly. But as is, if CONFIG_NVME_MULTIPATH=Y then all the individal NVMe paths will remain hidden to upper layers and as such DM multipath will not be able to manage them. Though NVMe's native multipathing doesn't multipath namespaces across subsystems; so technically a user _could_ use CONFIG_NVME_MULTIPATH=Y and also use DM multipath to multipath across subsystems. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-03dm mpath: move dm_bio_restore out of endio methodMike Snitzer
Moving the dm_bio_restore() to process_queued_bios() avoids doing that work in multipath_end_io_bio(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-20md/r5cache: print more info of log recoverySong Liu
Log recovery is critical for raid5 journal/cache. Printing information about each recovery by default will help the system admin monitor the status of the array. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-12-20dm mpath: optimize retrieval of bio_details from per-bio-dataMike Snitzer
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-20dm mpath: remove unnecessary memset() calls for per-io-dataMike Snitzer
All underlying members are initialized directly so the memset() calls are not needed. Also, initialize mpio->nr_bytes from the start since it never changes. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-20dm mpath: remove unused param from multipath_init_per_bio_data()Mike Snitzer
'struct dm_bio_details *' isn't ever needed. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-20dm: optimize bio-based NVMe IO submissionMike Snitzer
Upper level bio-based drivers that stack immediately ontop of NVMe can leverage direct_make_request(). In addition DM's NVMe bio-based will initially only ever have one NVMe device that it submits IO to at a time. There is no splitting needed. Enhance DM core so that DM_TYPE_NVME_BIO_BASED's IO submission takes advantage of both of these characteristics. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-20dm: introduce DM_TYPE_NVME_BIO_BASEDMike Snitzer
If dm_table_determine_type() establishes DM_TYPE_NVME_BIO_BASED then all devices in the DM table do not support partial completions. Also, the table has a single immutable target that doesn't require DM core to split bios. This will enable adding NVMe optimizations to bio-based DM. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-17dm: simplify start of block stats accounting for bio-basedMike Snitzer
No apparent need to generic_start_io_acct() until before the IO is ready for submission. start_io_acct() is the proper place to do this accounting -- it is also where DM accounts for pending IO and, if enabled, starts dm-stats accounting. Replace start_io_acct()'s part_round_stats() with generic_start_io_acct(). This eliminates needing to take part_stat_lock() multiple times when starting an IO on bio-based devices. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-16dm: remove redundant mapped_device member from clone_info structureMike Snitzer
'struct dm_io' already has the same pointer. So update all accesses from ci->md to ci->io->md. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-16dm: remove now unused bio-based io_pool and _io_cacheMike Snitzer
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-16dm: improve performance by moving dm_io structure to per-bio-dataMike Snitzer
Eliminates need for a separate mempool to allocate 'struct dm_io' objects from. As such, it saves an extra mempool allocation for each original bio that DM core is issued. This complicates the per-bio-data accessor functions by needing to conditonally add extra padding to get to a target's per-bio-data. But in the end this provides a decent performance improvement for all bio-based DM devices. On an NVMe-loop based testbed to a ramdisk (~3100 MB/s): bio-based DM linear performance improved by 2% (went from 2665 to 2777 MB/s). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-16dm: rename 'bio' member of dm_io structure to 'orig_bio'Mike Snitzer
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-16dm: remove stale comment blocksMike Snitzer
These CRUD comments have worn out their welcome. The code is what it is, over time it'll hopefully get better. But these comments serve no purpose whatsoever. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-15Merge tag 'for-4.15/dm-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - fix a particularly nasty DM core bug in a 4.15 refcount_t conversion. - fix various targets to dm_register_target after module __init resources created; otherwise racing lvm2 commands could result in a NULL pointer during initialization of associated DM kernel module. - fix regression in bio-based DM multipath queue_if_no_path handling. - fix DM bufio's shrinker to reclaim more than one buffer per scan. * tag 'for-4.15/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm bufio: fix shrinker scans when (nr_to_scan < retain_target) dm mpath: fix bio-based multipath queue_if_no_path handling dm: fix various targets to dm_register_target after module __init resources created dm table: fix regression from improper dm_dev_internal.count refcount_t conversion
2017-12-13dm: set QUEUE_FLAG_DAX accordingly in dm_table_set_restrictions()Mike Snitzer
Rather than having DAX support be unique by setting it based on table type in dm_setup_md_queue(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13dm: fix __send_changing_extent_only() to send first bio and chain remainderMike Snitzer
__send_changing_extent_only() must follow the same pattern that was established with commit "dm: ensure bio submission follows a depth-first tree walk". That is: submit first bio up to split boundary and then split the remainder to further submissions. Suggested-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13dm: ensure bio-based DM's bioset and io_pool support targets' maximum IOsMike Snitzer
alloc_multiple_bios() assumes it can allocate the requested number of bios but until now there was no gaurantee that the mempools would be accomodating. Suggested-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13dm: remove BIOSET_NEED_RESCUER based dm_offload infrastructureMike Snitzer
Now that all of DM has been revised and/or verified to no longer require the use of BIOSET_NEED_RESCUER the dm_offload code may be removed. Suggested-by: NeilBrown <neilb@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13dm: safely allocate multiple bioset biosMike Snitzer
DM targets can request multiple bios be sent to them by DM core (see: num_{flush,discard,write_same,write_zeroes}_bios). But until now these bios were allocated in an unsafe manner than could potentially exhaust the DM device's bioset -- in the face of multiple threads each trying to do multiple allocations from the same DM device's bioset. Fix __send_duplicate_bios() by using the new alloc_multiple_bios(). The allocation strategy used by alloc_multiple_bios() models that used by dm-crypt.c:crypt_alloc_buffer(). Neil Brown initially proposed this fix but the implementation has been revised enough that it inappropriate to attribute the entirety of it to him. Suggested-by: NeilBrown <neilb@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13dm: remove unused 'num_write_bios' target interfaceNeilBrown
No DM target provides num_write_bios and none has since dm-cache's brief use in 2013. Having the possibility of num_write_bios > 1 complicates bio allocation. So remove the interface and assume there is only one bio needed. If a target ever needs more, it must provide a suitable bioset and allocate itself based on its particular needs. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13dm: ensure bio submission follows a depth-first tree walkNeilBrown
A dm device can, in general, represent a tree of targets, each of which handles a sub-range of the range of blocks handled by the parent. The bio sequencing managed by generic_make_request() requires that bios are generated and handled in a depth-first manner. Each call to a make_request_fn() may submit bios to a single member device, and may submit bios for a reduced region of the same device as the make_request_fn. In particular, any bios submitted to member devices must be expected to be processed in order, so a later one must never wait for an earlier one. This ordering is usually achieved by using bio_split() to reduce a bio to a size that can be completely handled by one target, and resubmitting the remainder to the originating device. bio_queue_split() shows the canonical approach. dm doesn't follow this approach, largely because it has needed to split bios since long before bio_split() was available. It currently can submit bios to separate targets within the one dm_make_request() call. Dependencies between these targets, as can happen with dm-snap, can cause deadlocks if either bios gets stuck behind the other in the queues managed by generic_make_request(). This requires the 'rescue' functionality provided by dm_offload_{start,end}. Some of this requirement can be removed by changing the order of bio submission to follow the canonical approach. That is, if dm finds that it needs to split a bio, the remainder should be sent to generic_make_request() rather than being handled immediately. This delays the handling until the first part is completely processed, so the deadlock problems do not occur. __split_and_process_bio() can be called both from dm_make_request() and from dm_wq_work(). When called from dm_wq_work() the current approach is perfectly satisfactory as each bio will be processed immediately. When called from dm_make_request(), current->bio_list will be non-NULL, and in this case it is best to create a separate "clone" bio for the remainder. When we use bio_clone_bioset() to split off the front part of a bio and chain the two together and submit the remainder to generic_make_request(), it is important that the newly allocated bio is used as the head to be processed immediately, and the original bio gets "bio_advance()"d and sent to generic_make_request() as the remainder. Otherwise, if the newly allocated bio is used as the remainder, and if it then needs to be split again, then the next bio_clone_bioset() call will be made while holding a reference a bio (result of the first clone) from the same bioset. This can potentially exhaust the bioset mempool and result in a memory allocation deadlock. Note that there is no race caused by reassigning cio.io->bio after already calling __map_bio(). This bio will only be dereferenced again after dec_pending() has found io->io_count to be zero, and this cannot happen before the dec_pending() call at the end of __split_and_process_bio(). To provide the clone bio when splitting, we use q->bio_split. This was previously being freed by bio-based dm to avoid having excess rescuer threads. As bio_split bio sets no longer create rescuer threads, there is little cost and much gain from restoring the q->bio_split bio set. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13dm io: remove BIOSET_NEED_RESCUER flag from bios biosetNeilBrown
The BIOSET_NEED_RESCUER flag is only needed when a make_request_fn might do two allocations from the one bioset, and the second one could block until the first bio completes. dm_io() is called from make_request_fn() context. The closest it comes to multiple allocations is in chunk_io() in dm-snap-persistent. But there the code uses a separate thread to avoid problems. So BIOSET_NEED_RESCUER is not needed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13dm crypt: remove BIOSET_NEED_RESCUER flagNeilBrown
The BIOSET_NEED_RESCUER flag is only needed when a make_request_fn might do two allocations from the one bioset, and the second one could block until the first bio completes. dm-crypt does allocate from this bioset inside the dm make_request_fn, but does so using GFP_NOWAIT so that the allocation will not block. So BIOSET_NEED_RESCUER is not needed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13dm: fix comment above dm_accept_partial_bioNeilBrown
Clarify that dm_accept_partial_bio isn't allowed for REQ_OP_ZONE_RESET bios. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13dm raid: use rs_is_raid*()Heinz Mauelshagen
Cleanup, no functional change. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13dm raid: simplify rs_get_progress()Heinz Mauelshagen
No need to calculate the reshaping progress because mddev->curr_resync_completed holds it. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13dm raid: ensure 'a' chars during reshapeHeinz Mauelshagen
During reshape, 'A' chars were reported in status rather than 'a'. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>