summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2018-08-11bcache: add missing SPDX headerColy Li
The SPDX header is missing fro closure.c, super.c and util.c, this patch adds SPDX header for GPL-2.0 into these files. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: move open brace at end of function definitions to next lineColy Li
This is not a preferred style to place open brace '{' at the end of function definition, checkpatch.pl reports error for such coding style. This patch moves them into the start of the next new line. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: add static const prefix to char * array declarationsColy Li
This patch declares char * array with const prefix in sysfs.c, which is suggested by checkpatch.pl. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: fix code comments styleColy Li
This patch fixes 3 style issues warned by checkpatch.pl, - Comment lines are not aligned - Comments use "/*" on subsequent lines - Comment lines use a trailing "*/" Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: do not check NULL pointer before calling kmem_cache_destroyColy Li
kmem_cache_destroy() is safe for NULL pointer as input, the NULL pointer checking is unncessary. This patch just removes the NULL pointer checking to make code simpler. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: prefer 'help' in KconfigColy Li
Current bcache Kconfig uses '---help---' as header of help information, for now 'help' is prefered. This patch fixes this style by replacing '---help---' by 'help' in bcache Kconfig file. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: fix typo 'succesfully' to 'successfully'Coly Li
This patch fixes typo 'succesfully' to correct 'successfully', which is suggested by checkpatch.pl. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: replace '%pF' by '%pS' in seq_printf()Coly Li
'%pF' and '%pf' are deprecated vsprintf pointer extensions, this patch replace them by '%pS', which is suggested by checkpatch.pl. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: fix indent by replacing blank by tabsColy Li
bch_btree_insert_check_key() has unaligned indent, or indent by blank characters. This patch makes the indent aligned and replace blank by tabs. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: replace printk() by pr_*() routinesColy Li
There are still many places in bcache use printk to display kernel message, which are suggested to be preplaced by pr_*() routines like pr_err(), pr_info(), or pr_notice(). This patch replaces all printk() with a proper pr_*() routine for bcache code. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: replace Symbolic permissions by octal permission numbersColy Li
Symbolic permission names are used in bcache, for now octal permission numbers are encouraged to use for readability. This patch replaces all symbolic permissions by octal permission numbers. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: style fixes for lines over 80 charactersColy Li
This patch fixes the lines over 80 characters into more lines, to minimize warnings by checkpatch.pl. There are still some lines exceed 80 characters, but it is better to be a single line and I don't change them. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: add identifier names to arguments of function definitionsColy Li
There are many function definitions do not have identifier argument names, scripts/checkpatch.pl complains warnings like this, WARNING: function definition argument 'struct bcache_device *' should also have an identifier name #16735: FILE: writeback.h:120: +void bch_sectors_dirty_init(struct bcache_device *); This patch adds identifier argument names to all bcache function definitions to fix such warnings. Signed-off-by: Coly Li <colyli@suse.de> Reviewed: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: style fix to add a blank line after declarationsColy Li
Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: style fix to replace 'unsigned' by 'unsigned int'Coly Li
This patch fixes warning reported by checkpatch.pl by replacing 'unsigned' with 'unsigned int'. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-10bcache: fix error setting writeback_rate through sysfs interfaceColy Li
Commit ea8c5356d390 ("bcache: set max writeback rate when I/O request is idle") changes struct bch_ratelimit member rate from uint32_t to atomic_long_t and uses atomic_long_set() in drivers/md/bcache/sysfs.c to set new writeback rate, after the input is converted from memory buf to long int by sysfs_strtoul_clamp(). The above change has a problem because there is an implicit return inside sysfs_strtoul_clamp() so the following atomic_long_set() won't be called. This error is detected by 0day system with following snipped smatch warnings: drivers/md/bcache/sysfs.c:271 __cached_dev_store() error: uninitialized symbol 'v'. 270 sysfs_strtoul_clamp(writeback_rate, v, 1, INT_MAX); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @271 atomic_long_set(&dc->writeback_rate.rate, v); This patch fixes the above error by using strtoul_safe_clamp() to convert the input buffer into a long int type result. Fixes: ea8c5356d390 ("bcache: set max writeback rate when I/O request is idle") Cc: Kai Krakow <kai@kaishome.de> Cc: Stefan Priebe <s.priebe@profihost.ag> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09dm cache metadata: set dirty on all cache blocks after a crashIlya Dryomov
Quoting Documentation/device-mapper/cache.txt: The 'dirty' state for a cache block changes far too frequently for us to keep updating it on the fly. So we treat it as a hint. In normal operation it will be written when the dm device is suspended. If the system crashes all cache blocks will be assumed dirty when restarted. This got broken in commit f177940a8091 ("dm cache metadata: switch to using the new cursor api for loading metadata") in 4.9, which removed the code that consulted cmd->clean_when_opened (CLEAN_SHUTDOWN on-disk flag) when loading cache blocks. This results in data corruption on an unclean shutdown with dirty cache blocks on the fast device. After the crash those blocks are considered clean and may get evicted from the cache at any time. This can be demonstrated by doing a lot of reads to trigger individual evictions, but uncache is more predictable: ### Disable auto-activation in lvm.conf to be able to do uncache in ### time (i.e. see uncache doing flushing) when the fix is applied. # xfs_io -d -c 'pwrite -b 4M -S 0xaa 0 1G' /dev/vdb # vgcreate vg_cache /dev/vdb /dev/vdc # lvcreate -L 1G -n lv_slowdev vg_cache /dev/vdb # lvcreate -L 512M -n lv_cachedev vg_cache /dev/vdc # lvcreate -L 256M -n lv_metadev vg_cache /dev/vdc # lvconvert --type cache-pool --cachemode writeback vg_cache/lv_cachedev --poolmetadata vg_cache/lv_metadev # lvconvert --type cache vg_cache/lv_slowdev --cachepool vg_cache/lv_cachedev # xfs_io -d -c 'pwrite -b 4M -S 0xbb 0 512M' /dev/mapper/vg_cache-lv_slowdev # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2 0fe00000: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ 0fe00010: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ # dmsetup status vg_cache-lv_slowdev 0 2097152 cache 8 27/65536 128 8192/8192 1 100 0 0 0 8192 7065 2 metadata2 writeback 2 migration_threshold 2048 smq 0 rw - ^^^^ 7065 * 64k = 441M yet to be written to the slow device # echo b >/proc/sysrq-trigger # vgchange -ay vg_cache # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2 0fe00000: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ 0fe00010: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ # lvconvert --uncache vg_cache/lv_slowdev Flushing 0 blocks for cache vg_cache/lv_slowdev. Logical volume "lv_cachedev" successfully removed Logical volume vg_cache/lv_slowdev is not cached. # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2 0fe00000: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa ................ 0fe00010: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa ................ This is the case with both v1 and v2 cache pool metatata formats. After applying this patch: # vgchange -ay vg_cache # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2 0fe00000: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ 0fe00010: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ # lvconvert --uncache vg_cache/lv_slowdev Flushing 3724 blocks for cache vg_cache/lv_slowdev. ... Flushing 71 blocks for cache vg_cache/lv_slowdev. Logical volume "lv_cachedev" successfully removed Logical volume vg_cache/lv_slowdev is not cached. # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2 0fe00000: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ 0fe00010: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ Cc: stable@vger.kernel.org Fixes: f177940a8091 ("dm cache metadata: switch to using the new cursor api for loading metadata") Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-08-09bcache: trivial - remove tailing backslash in macro BTREE_FLAGShenghui Wang
Remove the tailing backslash in macro BTREE_FLAG in btree.h Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09bcache: make the pr_err statement used for ENOENT only in sysfs_attatch sectionShenghui Wang
The pr_err statement in the code for sysfs_attatch section would run for various error codes, which maybe confusing. E.g, Run the command twice: echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \ /sys/block/bcache0/bcache/attach [the backing dev got attached on the first run] echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \ /sys/block/bcache0/bcache/attach In dmesg, after the command run twice, we can get: bcache: bch_cached_dev_attach() Can't attach sda6: already attached bcache: __cached_dev_store() Can't attach 796b5c05-b03c-4bc7-9cbd-\ a8df5e8be891 : cache set not found The first statement in the message was right, but the second was confusing. bch_cached_dev_attach has various pr_ statements for various error codes, except ENOENT. After the change, rerun above command twice: echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \ /sys/block/bcache0/bcache/attach echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \ /sys/block/bcache0/bcache/attach In dmesg we only got: bcache: bch_cached_dev_attach() Can't attach sda6: already attached No confusing "cache set not found" message anymore. And for some not exist SET-UUID: echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be898 > \ /sys/block/bcache0/bcache/attach In dmesg we can get: bcache: __cached_dev_store() Can't attach 796b5c05-b03c-4bc7-9cbd-\ a8df5e8be898 : cache set not found Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09bcache: set max writeback rate when I/O request is idleColy Li
Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") allows the writeback rate to be faster if there is no I/O request on a bcache device. It works well if there is only one bcache device attached to the cache set. If there are many bcache devices attached to a cache set, it may introduce performance regression because multiple faster writeback threads of the idle bcache devices will compete the btree level locks with the bcache device who have I/O requests coming. This patch fixes the above issue by only permitting fast writebac when all bcache devices attached on the cache set are idle. And if one of the bcache devices has new I/O request coming, minimized all writeback throughput immediately and let PI controller __update_writeback_rate() to decide the upcoming writeback rate for each bcache device. Also when all bcache devices are idle, limited wrieback rate to a small number is wast of thoughput, especially when backing devices are slower non-rotation devices (e.g. SATA SSD). This patch sets a max writeback rate for each backing device if the whole cache set is idle. A faster writeback rate in idle time means new I/Os may have more available space for dirty data, and people may observe a better write performance then. Please note bcache may change its cache mode in run time, and this patch still works if the cache mode is switched from writeback mode and there is still dirty data on cache. Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") Cc: stable@vger.kernel.org #4.16+ Signed-off-by: Coly Li <colyli@suse.de> Tested-by: Kai Krakow <kai@kaishome.de> Tested-by: Stefan Priebe <s.priebe@profihost.ag> Cc: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09bcache: add code comments for bset.cColy Li
This patch tries to add code comments in bset.c, to make some tricky code and designment to be more comprehensible. Most information of this patch comes from the discussion between Kent and I, he offers very informative details. If there is any mistake of the idea behind the code, no doubt that's from me misrepresentation. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09bcache: fix mistaken comments in request.cColy Li
This patch updates code comment in bch_keylist_realloc() by fixing incorrected function names, to make the code to be more comprehennsible. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09bcache: fix mistaken code comments in bcache.hColy Li
This patch updates the code comment in struct cache with correct array names, to make the code to be more comprehensible. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09bcache: add a comment in super.cColy Li
This patch adds a line of code comment in super.c:register_bdev(), to make code to be more comprehensible. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09bcache: avoid unncessary cache prefetch bch_btree_node_get()Coly Li
In bch_btree_node_get() the read-in btree node will be partially prefetched into L1 cache for following bset iteration (if there is). But if the btree node read is failed, the perfetch operations will waste L1 cache space. This patch checkes whether read operation and only does cache prefetch when read I/O succeeded. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09bcache: display rate debug parameters to 0 when writeback is not runningColy Li
When writeback is not running, writeback rate should be 0, other value is misleading. And the following dyanmic writeback rate debug parameters should be 0 too, rate, proportional, integral, change otherwise they are misleading when writeback is not running. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09bcache: do not check return value of debugfs_create_dir()Coly Li
Greg KH suggests that normal code should not care about debugfs. Therefore no matter successful or failed of debugfs_create_dir() execution, it is unncessary to check its return value. There are two functions called debugfs_create_dir() and check the return value, which are bch_debug_init() and closure_debug_init(). This patch changes these two functions from int to void type, and ignore return values of debugfs_create_dir(). This patch does not fix exact bug, just makes things work as they should. Signed-off-by: Coly Li <colyli@suse.de> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: stable@vger.kernel.org Cc: Kai Krakow <kai@kaishome.de> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-08dm snapshot: remove stale FIXME in snapshot_map()Mike Snitzer
Commit ae1093be ("dm snapshot: use mutex instead of rw_semaphore") eliminated the need to worry about read vs write locking. So remove a FIXME in snapshot_map() that is concerned about selectively taking a write lock. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-08-08dm snapshot: improve performance by switching out_of_order_list to rbtreeDavid Jeffery
copy_complete()'s processing of out_of_order_list can result in quadratic complexity in the worst case. As such it was the source of consuming too much cpu and the source of significant loss in performance. Fix this by converting out_of_order_list to an rbtree. This improved a dm-snapshot test copy workload from 32 seconds to 4 seconds. Signed-off-by: David Jeffery <djeffery@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Tested-by: Brett Hull <bhull@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-08-08dm kcopyd: avoid softlockup in run_complete_jobJohn Pittman
It was reported that softlockups occur when using dm-snapshot ontop of slow (rbd) storage. E.g.: [ 4047.990647] watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [kworker/10:23:26177] ... [ 4048.034151] Workqueue: kcopyd do_work [dm_mod] [ 4048.034156] RIP: 0010:copy_callback+0x41/0x160 [dm_snapshot] ... [ 4048.034190] Call Trace: [ 4048.034196] ? __chunk_is_tracked+0x70/0x70 [dm_snapshot] [ 4048.034200] run_complete_job+0x5f/0xb0 [dm_mod] [ 4048.034205] process_jobs+0x91/0x220 [dm_mod] [ 4048.034210] ? kcopyd_put_pages+0x40/0x40 [dm_mod] [ 4048.034214] do_work+0x46/0xa0 [dm_mod] [ 4048.034219] process_one_work+0x171/0x370 [ 4048.034221] worker_thread+0x1fc/0x3f0 [ 4048.034224] kthread+0xf8/0x130 [ 4048.034226] ? max_active_store+0x80/0x80 [ 4048.034227] ? kthread_bind+0x10/0x10 [ 4048.034231] ret_from_fork+0x35/0x40 [ 4048.034233] Kernel panic - not syncing: softlockup: hung tasks Fix this by calling cond_resched() after run_complete_job()'s callout to the dm_kcopyd_notify_fn (which is dm-snap.c:copy_callback in the above trace). Signed-off-by: John Pittman <jpittman@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-08-07dm cache metadata: save in-core policy_hint_size to on-disk superblockMike Snitzer
policy_hint_size starts as 0 during __write_initial_superblock(). It isn't until the policy is loaded that policy_hint_size is set in-core (cmd->policy_hint_size). But it never got recorded in the on-disk superblock because __commit_transaction() didn't deal with transfering the in-core cmd->policy_hint_size to the on-disk superblock. The in-core cmd->policy_hint_size gets initialized by metadata_open()'s __begin_transaction_flags() which re-reads all superblock fields. Because the superblock's policy_hint_size was never properly stored, when the cache was created, hints_array_available() would always return false when re-activating a previously created cache. This means __load_mappings() always considered the hints invalid and never made use of the hints (these hints served to optimize). Another detremental side-effect of this oversight is the cache_check utility would fail with: "invalid hint width: 0" Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-08-07dm thin: stop no_space_timeout worker when switching to write-modeHou Tao
Now both check_for_space() and do_no_space_timeout() will read & write pool->pf.error_if_no_space. If these functions run concurrently, as shown in the following case, the default setting of "queue_if_no_space" can get lost. precondition: * error_if_no_space = false (aka "queue_if_no_space") * pool is in Out-of-Data-Space (OODS) mode * no_space_timeout worker has been queued CPU 0: CPU 1: // delete a thin device process_delete_mesg() // check_for_space() invoked by commit() set_pool_mode(pool, PM_WRITE) pool->pf.error_if_no_space = \ pt->requested_pf.error_if_no_space // timeout, pool is still in OODS mode do_no_space_timeout // "queue_if_no_space" config is lost pool->pf.error_if_no_space = true pool->pf.mode = new_mode Fix it by stopping no_space_timeout worker when switching to write mode. Fixes: bcc696fac11f ("dm thin: stay in out-of-data-space mode once no_space_timeout expires") Cc: stable@vger.kernel.org Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-08-05Merge tag 'v4.18-rc6' into for-4.19/block2Jens Axboe
Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a merge conflict down the line. Signed-of-by: Jens Axboe <axboe@kernel.dk>
2018-08-02md/raid5: fix data corruption of replacements after originals droppedBingJing Chang
During raid5 replacement, the stripes can be marked with R5_NeedReplace flag. Data can be read from being-replaced devices and written to replacing spares without reading all other devices. (It's 'replace' mode. s.replacing = 1) If a being-replaced device is dropped, the replacement progress will be interrupted and resumed with pure recovery mode. However, existing stripes before being interrupted cannot read from the dropped device anymore. It prints lots of WARN_ON messages. And it results in data corruption because existing stripes write problematic data into its replacement device and update the progress. \# Erase disks (1MB + 2GB) dd if=/dev/zero of=/dev/sda bs=1MB count=2049 dd if=/dev/zero of=/dev/sdb bs=1MB count=2049 dd if=/dev/zero of=/dev/sdc bs=1MB count=2049 dd if=/dev/zero of=/dev/sdd bs=1MB count=2049 mdadm -C /dev/md0 -amd -R -l5 -n3 -x0 /dev/sd[abc] -z 2097152 \# Ensure array stores non-zero data dd if=/root/data_4GB.iso of=/dev/md0 bs=1MB \# Start replacement mdadm /dev/md0 -a /dev/sdd mdadm /dev/md0 --replace /dev/sda Then, Hot-plug out /dev/sda during recovery, and wait for recovery done. echo check > /sys/block/md0/md/sync_action cat /sys/block/md0/md/mismatch_cnt # it will be greater than 0. Soon after you hot-plug out /dev/sda, you will see many WARN_ON messages. The replacement recovery will be interrupted shortly. After the recovery finishes, it will result in data corruption. Actually, it's just an unhandled case of replacement. In commit <f94c0b6658c7> (md/raid5: fix interaction of 'replace' and 'recovery'.), if a NeedReplace device is not UPTODATE then that is an error, the commit just simply print WARN_ON but also mark these corrupted stripes with R5_WantReplace. (it means it's ready for writes.) To fix this case, we can leverage 'sync and replace' mode mentioned in commit <9a3e1101b827> (md/raid5: detect and handle replacements during recovery.). We can add logics to detect and use 'sync and replace' mode for these stripes. Reported-by: Alex Chen <alexchen@synology.com> Reviewed-by: Alex Wu <alexwu@synology.com> Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com> Signed-off-by: BingJing Chang <bingjingc@synology.com> Signed-off-by: Shaohua Li <shli@fb.com>
2018-08-01md: Avoid namespace collision with bitmap APIAndy Shevchenko
bitmap API (include/linux/bitmap.h) has 'bitmap' prefix for its methods. On the other hand MD bitmap API is special case. Adding 'md' prefix to it to avoid name space collision. No functional changes intended. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Shaohua Li <shli@kernel.org> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2018-08-01dm: Avoid namespace collision with bitmap APIAndy Shevchenko
bitmap API (include/linux/bitmap.h) has 'bitmap' prefix for its methods. On the other hand DM bitmap API is special case. Adding 'dm' prefix to it to avoid potential name space collision. No functional changes intended. Suggested-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2018-07-31dm kcopyd: return void from dm_kcopyd_copy()Mike Snitzer
dm_kcopyd_copy() only ever returns 0 so there is no need for callers to account for possible failure. Same goes for dm_kcopyd_zero(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-30md/dm-writecache: Don't request pointer dummy_addr when not requiredHuaisheng Ye
Function persistent_memory_claim doesn't need to get local pointer dummy_addr from direct_access. Using NULL instead of having to pass in a useless local pointer that caller then just throw away. Suggested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Huaisheng Ye <yehs1@lenovo.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2018-07-30dm thin: include metadata_low_watermark threshold in pool statusAndy Grover
The metadata low watermark threshold is set by the kernel. But the kernel depends on userspace to extend the thinpool metadata device when the threshold is crossed. Since the metadata low watermark threshold is not visible to userspace, upon receiving an event, userspace cannot tell that the kernel wants the metadata device extended, instead of some other eventing condition. Making it visible (but not settable) enables userspace to affirmatively know the kernel is asking for a metadata device extension, by comparing metadata_low_watermark against nr_free_blocks_metadata, also reported in status. Current solutions like dmeventd have their own thresholds for extending the data and metadata devices, and both devices are checked against their thresholds on each event. This lessens the value of the kernel-set threshold, since userspace will either extend the metadata device sooner, when receiving another event; or will receive the metadata lowater event and do nothing, if dmeventd's threshold is less than the kernel's. (This second case is dangerous. The metadata lowater event will not be re-sent, so no further event will be generated before the metadata device is out if space, unless some other event causes userspace to recheck its thresholds.) Signed-off-by: Andy Grover <agrover@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27dm writecache: report start_sector in status lineMikulas Patocka
Fixes: d284f8248c7 ("dm writecache: support optional offset for start of device") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27dm crypt: convert essiv from ahash to shashKees Cook
In preparing to remove all stack VLA usage from the kernel[1], remove the discouraged use of AHASH_REQUEST_ON_STACK in favor of the smaller SHASH_DESC_ON_STACK by converting from ahash-wrapped-shash to direct shash. The stack allocation will be made a fixed size in a later patch to the crypto subsystem. [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27dm crypt: use wake_up_process() instead of a wait queueMikulas Patocka
This is a small simplification of dm-crypt - use wake_up_process() instead of a wait queue in a case where only one process may be waiting. dm-writecache uses a similar pattern. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27dm integrity: recalculate checksums on creationMikulas Patocka
When using external metadata device and internal hash, recalculate the checksums when the device is created - so that dm-integrity doesn't have to overwrite the device. The superblock stores the last position when the recalculation ended, so that it is properly restarted. Integrity tags that haven't been recalculated yet are ignored. Also bump the target version. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27dm integrity: flush journal on suspend when using separate metadata deviceMikulas Patocka
Flush the journal on suspend when using separate data and metadata devices, so that the metadata device can be discarded and the table can be reloaded with a linear target pointing to the data device. NOTE: the journal is deliberately not flushed when using the same device for metadata and data, so that the journal replay code is tested. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27dm integrity: use version 2 for separate metadataMikulas Patocka
Use version "2" in the superblock when data and metadata devices are separate, so that the device is not accidentally read by older kernel version. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27dm integrity: allow separate metadata deviceMikulas Patocka
Add the ability to store DM integrity metadata on a separate device. This feature is activated with the option "meta_device:/dev/device". Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27dm integrity: add ic->start in get_data_sector()Mikulas Patocka
A small refactoring. Add the variable ic->start to the result returned by get_data_sector() and not in the callers. This is a prerequisite for the commit that adds the ability to use an external metadata device. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27dm integrity: report provided data sectors in the statusMikulas Patocka
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27dm integrity: implement fair range locksMikulas Patocka
dm-integrity locks a range of sectors to prevent concurrent I/O or journal writeback. These locks were not fair - so that many small overlapping I/Os could starve a large I/O indefinitely. Fix this by making the range locks fair. The ranges that are waiting are added to the list "wait_list". If a new I/O overlaps some of the waiting I/Os, it is not dispatched, but it is also added to that wait list. Entries on the wait list are processed in first-in-first-out order, so that an I/O can't starve indefinitely. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-27dm integrity: decouple common code in dm_integrity_map_continue()Mikulas Patocka
Decouple how dm_integrity_map_continue() responds to being out of free sectors and when add_new_range() fails. This has no functional change, but helps prepare for the next commit. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>