summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2018-01-08bcache: Fix, improve efficiency of closure_sync()Kent Overstreet
Eliminates cases where sync can race and fail to complete / get stuck. Removes many status flags and simplifies entering-and-exiting closure sleeping behaviors. [mlyle: fixed conflicts due to changed return behavior in mainline. extended commit comment, and squashed down two commits that were mostly contradictory to get to this state. Changed __set_current_state to set_current_state per Jens review comment] Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: allow quick writeback when backing idleMichael Lyle
If the control system would wait for at least half a second, and there's been no reqs hitting the backing disk for awhile: use an alternate mode where we have at most one contiguous set of writebacks in flight at a time. (But don't otherwise delay). If front-end IO appears, it will still be quick, as it will only have to contend with one real operation in flight. But otherwise, we'll be sending data to the backing disk as quickly as it can accept it (with one op at a time). Signed-off-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: writeback: properly order backing device IOMichael Lyle
Writeback keys are presently iterated and dispatched for writeback in order of the logical block address on the backing device. Multiple may be, in parallel, read from the cache device and then written back (especially when there are contiguous I/O). However-- there was no guarantee with the existing code that the writes would be issued in LBA order, as the reads from the cache device are often re-ordered. In turn, when writing back quickly, the backing disk often has to seek backwards-- this slows writeback and increases utilization. This patch introduces an ordering mechanism that guarantees that the original order of issue is maintained for the write portion of the I/O. Performance for writeback is significantly improved when there are multiple contiguous keys or high writeback rates. Signed-off-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn> Tested-by: Tang Junhui <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: fix wrong return value in bch_debug_init()Tang Junhui
in bch_debug_init(), ret is always 0, and the return value is useless, change it to return 0 if be success after calling debugfs_create_dir(), else return a non-zero value. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: segregate flash only volume write streamsTang Junhui
In such scenario that there are some flash only volumes , and some cached devices, when many tasks request these devices in writeback mode, the write IOs may fall to the same bucket as bellow: | cached data | flash data | cached data | cached data| flash data| then after writeback of these cached devices, the bucket would be like bellow bucket: | free | flash data | free | free | flash data | So, there are many free space in this bucket, but since data of flash only volumes still exists, so this bucket cannot be reclaimable, which would cause waste of bucket space. In this patch, we segregate flash only volume write streams from cached devices, so data from flash only volumes and cached devices can store in different buckets. Compare to v1 patch, this patch do not add a additionally open bucket list, and it is try best to segregate flash only volume write streams from cached devices, sectors of flash only volumes may still be mixed with dirty sectors of cached device, but the number is very small. [mlyle: fixed commit log formatting, permissions, line endings] Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: Use PTR_ERR_OR_ZERO()Vasyl Gomonovych
Fix ptr_ret.cocci warnings: drivers/md/bcache/btree.c:1800:1-3: WARNING: PTR_ERR_OR_ZERO can be used Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR Generated by: scripts/coccinelle/api/ptr_ret.cocci Signed-off-by: Vasyl Gomonovych <gomonovych@gmail.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: stop writeback thread after detachingTang Junhui
Currently, when a cached device detaching from cache, writeback thread is not stopped, and writeback_rate_update work is not canceled. For example, after the following command: echo 1 >/sys/block/sdb/bcache/detach you can still see the writeback thread. Then you attach the device to the cache again, bcache will create another writeback thread, for example, after below command: echo ba0fb5cd-658a-4533-9806-6ce166d883b9 > /sys/block/sdb/bcache/attach then you will see 2 writeback threads. This patch stops writeback thread and cancels writeback_rate_update work when cached device detaching from cache. Compare with patch v1, this v2 patch moves code down into the register lock for safety in case of any future changes as Coly and Mike suggested. [edit by mlyle: commit log spelling/formatting] Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: ret IOERR when read meets metadata errorRui Hua
The read request might meet error when searching the btree, but the error was not handled in cache_lookup(), and this kind of metadata failure will not go into cached_dev_read_error(), finally, the upper layer will receive bi_status=0. In this patch we judge the metadata error by the return value of bch_btree_map_keys(), there are two potential paths give rise to the error: 1. Because the btree is not totally cached in memery, we maybe get error when read btree node from cache device (see bch_btree_node_get()), the likely errno is -EIO, -ENOMEM 2. When read miss happens, bch_btree_insert_check_key() will be called to insert a "replace_key" to btree(see cached_dev_cache_miss(), just for doing preparatory work before insert the missed data to cache device), a failure can also happen in this situation, the likely errno is -ENOMEM bch_btree_map_keys() will return MAP_DONE in normal scenario, but we will get either -EIO or -ENOMEM in above two cases. if this happened, we should NOT recover data from backing device (when cache device is dirty) because we don't know whether bkeys the read request covered are all clean. And after that happened, s->iop.status is still its initially value(0) before we submit s->bio.bio, we set it to BLK_STS_IOERR, so it can go into cached_dev_read_error(), and finally it can be passed to upper layer, or recovered by reread from backing device. [edit by mlyle: patch formatting, word-wrap, comment spelling, commit log format] Signed-off-by: Hua Rui <huarui.dev@gmail.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06dm mpath: factor out SCSI vs NVMe path selectionMike Snitzer
Trying to do both SCSI and NVMe bio-based handling with branching in the same common code has proven too tedious on a code maintenance level. In addition it slightly hurts IO performance. Fix this by factoring out __map_bio() and __map_bio_nvme(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-06dm mpath: optimize NVMe bio-based supportMike Snitzer
All code that deals with pg_init is not used with bio-based NVMe mode. This includes skipping initialization of pg_init related variables. Also, pg_init related members on 'struct multipath' have been grouped together. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-06dm-crypt: don't clear bvec->bv_page in crypt_free_buffer_pages()Ming Lei
The bio is always freed after running crypt_free_buffer_pages(), so it isn't necessary to clear bv->bv_page. Cc: Mike Snitzer <snitzer@redhat.com> Cc:dm-devel@redhat.com Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06block: move bio_alloc_pages() to bcacheMing Lei
bcache is the only user of bio_alloc_pages(), so move this function into bcache, and avoid it being misused in the future. Also rename it to bch_bio_allo_pages() since it is bcache only. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06bcache: comment on direct access to bvec tableMing Lei
All direct access to bvec table are safe even after multipage bvec is supported. Cc: linux-bcache@vger.kernel.org Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZEMing Lei
For BIO based DM, some targets aren't ready for dealing with bigger incoming bio than 1Mbyte, such as crypt target. Cc: Mike Snitzer <snitzer@redhat.com> Cc:dm-devel@redhat.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06block: convert to bio_first_bvec_all & bio_first_page_allMing Lei
This patch converts to bio_first_bvec_all() & bio_first_page_all() for retrieving the 1st bvec/page, and prepares for supporting multipage bvec. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-04dm mpath: implement NVMe bio-based supportMike Snitzer
This DM multipath NVMe bio-based support requires CONFIG_NVME_MULTIPATH to not be set. In the future hopefully NVMe multipath and DM multipath can co-exist more seemlessly. But as is, if CONFIG_NVME_MULTIPATH=Y then all the individal NVMe paths will remain hidden to upper layers and as such DM multipath will not be able to manage them. Though NVMe's native multipathing doesn't multipath namespaces across subsystems; so technically a user _could_ use CONFIG_NVME_MULTIPATH=Y and also use DM multipath to multipath across subsystems. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-01-03dm mpath: move dm_bio_restore out of endio methodMike Snitzer
Moving the dm_bio_restore() to process_queued_bios() avoids doing that work in multipath_end_io_bio(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-20md/r5cache: print more info of log recoverySong Liu
Log recovery is critical for raid5 journal/cache. Printing information about each recovery by default will help the system admin monitor the status of the array. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-12-20dm mpath: optimize retrieval of bio_details from per-bio-dataMike Snitzer
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-20dm mpath: remove unnecessary memset() calls for per-io-dataMike Snitzer
All underlying members are initialized directly so the memset() calls are not needed. Also, initialize mpio->nr_bytes from the start since it never changes. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-20dm mpath: remove unused param from multipath_init_per_bio_data()Mike Snitzer
'struct dm_bio_details *' isn't ever needed. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-20dm: optimize bio-based NVMe IO submissionMike Snitzer
Upper level bio-based drivers that stack immediately ontop of NVMe can leverage direct_make_request(). In addition DM's NVMe bio-based will initially only ever have one NVMe device that it submits IO to at a time. There is no splitting needed. Enhance DM core so that DM_TYPE_NVME_BIO_BASED's IO submission takes advantage of both of these characteristics. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-20dm: introduce DM_TYPE_NVME_BIO_BASEDMike Snitzer
If dm_table_determine_type() establishes DM_TYPE_NVME_BIO_BASED then all devices in the DM table do not support partial completions. Also, the table has a single immutable target that doesn't require DM core to split bios. This will enable adding NVMe optimizations to bio-based DM. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-17dm: simplify start of block stats accounting for bio-basedMike Snitzer
No apparent need to generic_start_io_acct() until before the IO is ready for submission. start_io_acct() is the proper place to do this accounting -- it is also where DM accounts for pending IO and, if enabled, starts dm-stats accounting. Replace start_io_acct()'s part_round_stats() with generic_start_io_acct(). This eliminates needing to take part_stat_lock() multiple times when starting an IO on bio-based devices. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-16dm: remove redundant mapped_device member from clone_info structureMike Snitzer
'struct dm_io' already has the same pointer. So update all accesses from ci->md to ci->io->md. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-16dm: remove now unused bio-based io_pool and _io_cacheMike Snitzer
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-16dm: improve performance by moving dm_io structure to per-bio-dataMike Snitzer
Eliminates need for a separate mempool to allocate 'struct dm_io' objects from. As such, it saves an extra mempool allocation for each original bio that DM core is issued. This complicates the per-bio-data accessor functions by needing to conditonally add extra padding to get to a target's per-bio-data. But in the end this provides a decent performance improvement for all bio-based DM devices. On an NVMe-loop based testbed to a ramdisk (~3100 MB/s): bio-based DM linear performance improved by 2% (went from 2665 to 2777 MB/s). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-16dm: rename 'bio' member of dm_io structure to 'orig_bio'Mike Snitzer
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-16dm: remove stale comment blocksMike Snitzer
These CRUD comments have worn out their welcome. The code is what it is, over time it'll hopefully get better. But these comments serve no purpose whatsoever. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-15Merge tag 'for-4.15/dm-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - fix a particularly nasty DM core bug in a 4.15 refcount_t conversion. - fix various targets to dm_register_target after module __init resources created; otherwise racing lvm2 commands could result in a NULL pointer during initialization of associated DM kernel module. - fix regression in bio-based DM multipath queue_if_no_path handling. - fix DM bufio's shrinker to reclaim more than one buffer per scan. * tag 'for-4.15/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm bufio: fix shrinker scans when (nr_to_scan < retain_target) dm mpath: fix bio-based multipath queue_if_no_path handling dm: fix various targets to dm_register_target after module __init resources created dm table: fix regression from improper dm_dev_internal.count refcount_t conversion
2017-12-13dm: set QUEUE_FLAG_DAX accordingly in dm_table_set_restrictions()Mike Snitzer
Rather than having DAX support be unique by setting it based on table type in dm_setup_md_queue(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13dm: fix __send_changing_extent_only() to send first bio and chain remainderMike Snitzer
__send_changing_extent_only() must follow the same pattern that was established with commit "dm: ensure bio submission follows a depth-first tree walk". That is: submit first bio up to split boundary and then split the remainder to further submissions. Suggested-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13dm: ensure bio-based DM's bioset and io_pool support targets' maximum IOsMike Snitzer
alloc_multiple_bios() assumes it can allocate the requested number of bios but until now there was no gaurantee that the mempools would be accomodating. Suggested-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13dm: remove BIOSET_NEED_RESCUER based dm_offload infrastructureMike Snitzer
Now that all of DM has been revised and/or verified to no longer require the use of BIOSET_NEED_RESCUER the dm_offload code may be removed. Suggested-by: NeilBrown <neilb@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13dm: safely allocate multiple bioset biosMike Snitzer
DM targets can request multiple bios be sent to them by DM core (see: num_{flush,discard,write_same,write_zeroes}_bios). But until now these bios were allocated in an unsafe manner than could potentially exhaust the DM device's bioset -- in the face of multiple threads each trying to do multiple allocations from the same DM device's bioset. Fix __send_duplicate_bios() by using the new alloc_multiple_bios(). The allocation strategy used by alloc_multiple_bios() models that used by dm-crypt.c:crypt_alloc_buffer(). Neil Brown initially proposed this fix but the implementation has been revised enough that it inappropriate to attribute the entirety of it to him. Suggested-by: NeilBrown <neilb@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13dm: remove unused 'num_write_bios' target interfaceNeilBrown
No DM target provides num_write_bios and none has since dm-cache's brief use in 2013. Having the possibility of num_write_bios > 1 complicates bio allocation. So remove the interface and assume there is only one bio needed. If a target ever needs more, it must provide a suitable bioset and allocate itself based on its particular needs. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13dm: ensure bio submission follows a depth-first tree walkNeilBrown
A dm device can, in general, represent a tree of targets, each of which handles a sub-range of the range of blocks handled by the parent. The bio sequencing managed by generic_make_request() requires that bios are generated and handled in a depth-first manner. Each call to a make_request_fn() may submit bios to a single member device, and may submit bios for a reduced region of the same device as the make_request_fn. In particular, any bios submitted to member devices must be expected to be processed in order, so a later one must never wait for an earlier one. This ordering is usually achieved by using bio_split() to reduce a bio to a size that can be completely handled by one target, and resubmitting the remainder to the originating device. bio_queue_split() shows the canonical approach. dm doesn't follow this approach, largely because it has needed to split bios since long before bio_split() was available. It currently can submit bios to separate targets within the one dm_make_request() call. Dependencies between these targets, as can happen with dm-snap, can cause deadlocks if either bios gets stuck behind the other in the queues managed by generic_make_request(). This requires the 'rescue' functionality provided by dm_offload_{start,end}. Some of this requirement can be removed by changing the order of bio submission to follow the canonical approach. That is, if dm finds that it needs to split a bio, the remainder should be sent to generic_make_request() rather than being handled immediately. This delays the handling until the first part is completely processed, so the deadlock problems do not occur. __split_and_process_bio() can be called both from dm_make_request() and from dm_wq_work(). When called from dm_wq_work() the current approach is perfectly satisfactory as each bio will be processed immediately. When called from dm_make_request(), current->bio_list will be non-NULL, and in this case it is best to create a separate "clone" bio for the remainder. When we use bio_clone_bioset() to split off the front part of a bio and chain the two together and submit the remainder to generic_make_request(), it is important that the newly allocated bio is used as the head to be processed immediately, and the original bio gets "bio_advance()"d and sent to generic_make_request() as the remainder. Otherwise, if the newly allocated bio is used as the remainder, and if it then needs to be split again, then the next bio_clone_bioset() call will be made while holding a reference a bio (result of the first clone) from the same bioset. This can potentially exhaust the bioset mempool and result in a memory allocation deadlock. Note that there is no race caused by reassigning cio.io->bio after already calling __map_bio(). This bio will only be dereferenced again after dec_pending() has found io->io_count to be zero, and this cannot happen before the dec_pending() call at the end of __split_and_process_bio(). To provide the clone bio when splitting, we use q->bio_split. This was previously being freed by bio-based dm to avoid having excess rescuer threads. As bio_split bio sets no longer create rescuer threads, there is little cost and much gain from restoring the q->bio_split bio set. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13dm io: remove BIOSET_NEED_RESCUER flag from bios biosetNeilBrown
The BIOSET_NEED_RESCUER flag is only needed when a make_request_fn might do two allocations from the one bioset, and the second one could block until the first bio completes. dm_io() is called from make_request_fn() context. The closest it comes to multiple allocations is in chunk_io() in dm-snap-persistent. But there the code uses a separate thread to avoid problems. So BIOSET_NEED_RESCUER is not needed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13dm crypt: remove BIOSET_NEED_RESCUER flagNeilBrown
The BIOSET_NEED_RESCUER flag is only needed when a make_request_fn might do two allocations from the one bioset, and the second one could block until the first bio completes. dm-crypt does allocate from this bioset inside the dm make_request_fn, but does so using GFP_NOWAIT so that the allocation will not block. So BIOSET_NEED_RESCUER is not needed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13dm: fix comment above dm_accept_partial_bioNeilBrown
Clarify that dm_accept_partial_bio isn't allowed for REQ_OP_ZONE_RESET bios. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13dm raid: use rs_is_raid*()Heinz Mauelshagen
Cleanup, no functional change. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13dm raid: simplify rs_get_progress()Heinz Mauelshagen
No need to calculate the reshaping progress because mddev->curr_resync_completed holds it. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13dm raid: ensure 'a' chars during reshapeHeinz Mauelshagen
During reshape, 'A' chars were reported in status rather than 'a'. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13dm raid: stop keeping raid set frozen altogetherHeinz Mauelshagen
In order to avoid redoing synchronization/recovery/reshape partially, the raid set got frozen until after all passed in table line flags had been cleared. The related table reload sequence had to be precisely followed, or reshaping may lead to data corruption caused by the active mapping carrying on with a reshape when the inactive mapping already had retrieved a stale reshape position. Harden by retrieving the actual resync/recovery/reshape position during resume whilst the active table is suspended thus avoiding to keep the raid set frozen altogether. This prevents superfluous redoing of an already resynchronized or recovered segment and, most importantly, potential for redoing of an already reshaped segment causing data corruption. Fixes: d39f0010e ("dm raid: fix raid_resume() to keep raid set frozen as needed") Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-13dm raid: validate current raid sets redundancyHeinz Mauelshagen
Verifying the current raid sets redundancy based on retrieved superblock content has to use the superblock's raid level (e.g. raid0), not the constructor requested one (e.g. raid10). Using the requested raid level of raid10 lead to a "divide error" on raid0 which defines data copies divided by to be zero. Also check for bogus data copies. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-11md/raid1,raid10: silence warning about wait-within-waitNeilBrown
If you prepare_to_wait() after a previous prepare_to_wait(), but before calling schedule(), you get warning: do not call blocking ops when !TASK_RUNNING; state=2 This is appropriate as it is often a bug. The event that the first prepare_to_wait() expects might wake up the schedule following the second prepare_to_wait(), which could be confusing. However if both prepare_to_wait()s are part of simple wait_event() loops, and if the inner one is rarely called, then there is no problem. The inner loop is too simple to get confused by a stray wakeup, and the outer loop won't spin unduly because the inner doesnt affect it often. This pattern occurs in both raid1.c and raid10.c in the use of flush_pending_writes(). The warning can be silenced by setting current->state to TASK_RUNNING. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-12-11md: introduce new personality funciton start()Song Liu
In do_md_run(), md threads should not wake up until the array is fully initialized in md_run(). However, in raid5_run(), raid5-cache may wake up mddev->thread to flush stripes that need to be written back. This design doesn't break badly right now. But it could lead to bad bug in the future. This patch tries to resolve this problem by splitting start up work into two personality functions, run() and start(). Tasks that do not require the md threads should go into run(), while task that require the md threads go into start(). r5l_load_log() is moved to raid5_start(), so it is not called until the md threads are started in do_md_run(). Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-12-08dm raid: bump target version to reflect numerous fixesMike Snitzer
Also update Documentation accordingly. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-08dm raid: small cleanup and remove unsed "struct raid_set" memberHeinz Mauelshagen
Move raid_resume()'s setting of 'rw' and 'in_sync' to just prior to mddev_resume(). Also, remove unused 'bitmap_loaded' member from "struct raid_set". No functional changes. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-12-08dm raid: fix rs_get_progress() synchronization state/ratioHeinz Mauelshagen
Fix various sync state issues causing racy/bogus sync ratio, sync_action ad health chars in dm_status() info output. Sync ratio could be N/N (i.e. 100%) shortly after raid set creation, i.e. creating a new RaidLV or upconverting a linear LV to raid1 thus: "0 2097152 raid raid1 2 Aa 2097162/2097152 recover 0 0 -" instead of: "0 2097152 raid raid1 2 Aa 0/2097152 idle 0 0 -" Sync action could be non-idle, when the MD thread was done with io. Health chars could be 'A' when they should be 'a' for a short time before a resynchonization started. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>