summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2023-09-27md: simplify md_seq_opsYu Kuai
Before this patch, the implementation is hacky and hard to understand: 1) md_seq_start set pos to 1; 2) md_seq_show found pos is 1, then print Personalities; 3) md_seq_next found pos is 1, then it update pos to the first mddev; 4) md_seq_show found pos is not 1 or 2, show mddev; 5) md_seq_next found pos is not 1 or 2, update pos to next mddev; 6) loop 4-5 until the last mddev, then md_seq_next update pos to 2; 7) md_seq_show found pos is 2, then print unused devices; 8) md_seq_next found pos is 2, stop; This patch remove the magic value and use seq_list_start/next/stop() directly, and move printing "Personalities" to md_seq_start(), "unsed devices" to md_seq_stop(): 1) md_seq_start print Personalities, and then set pos to first mddev; 2) md_seq_show show mddev; 3) md_seq_next update pos to next mddev; 4) loop 2-3 until the last mddev; 5) md_seq_stop print unsed devices; Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230927061241.1552837-3-yukuai1@huaweicloud.com
2023-09-27md: factor out a helper from mddev_put()Yu Kuai
There are no functional changes, prepare to simplify md_seq_ops in next patch. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230927061241.1552837-2-yukuai1@huaweicloud.com
2023-09-25md: replace deprecated strncpy with memcpyJustin Stitt
`strncpy` is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. There are three such strncpy uses that this patch addresses: The respective destination buffers are: 1) mddev->clevel 2) clevel 3) mddev->metadata_type We expect mddev->clevel to be NUL-terminated due to its use with format strings: | ret = sprintf(page, "%s\n", mddev->clevel); Furthermore, we can see that mddev->clevel is not expected to be NUL-padded as `md_clean()` merely set's its first byte to NULL -- not the entire buffer: | static void md_clean(struct mddev *mddev) | { | mddev->array_sectors = 0; | mddev->external_size = 0; | ... | mddev->level = LEVEL_NONE; | mddev->clevel[0] = 0; | ... A suitable replacement for this instance is `memcpy` as we know the number of bytes to copy and perform manual NUL-termination at a specified offset. This really decays to just a byte copy from one buffer to another. `strscpy` is also a considerable replacement but using `slen` as the length argument would result in truncation of the last byte unless something like `slen + 1` was provided which isn't the most idiomatic strscpy usage. For the next case, the destination buffer `clevel` is expected to be NUL-terminated based on its usage within kstrtol() which expects NUL-terminated strings. Note that, in context, this code removes a trailing newline which is seemingly not required as kstrtol() can handle trailing newlines implicitly. However, there exists further usage of clevel (or buf) that would also like to have the newline removed. All in all, with similar reasoning to the first case, let's just use memcpy as this is just a byte copy and NUL-termination is handled manually. The third and final case concerning `mddev->metadata_type` is more or less the same as the other two. We expect that it be NUL-terminated based on its usage with seq_printf: | seq_printf(seq, " super external:%s", | mddev->metadata_type); ... and we can surmise that NUL-padding isn't required either due to how it is handled in md_clean(): | static void md_clean(struct mddev *mddev) | { | ... | mddev->metadata_type[0] = 0; | ... So really, all these instances have precisely calculated lengths and purposeful NUL-termination so we can just use memcpy to remove ambiguity surrounding strncpy. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://github.com/KSPP/linux/issues/90 Cc: linux-hardening@vger.kernel.org Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230925-strncpy-drivers-md-md-c-v1-1-2b0093b89c2b@google.com
2023-09-22md/md-linear: Annotate struct linear_conf with __counted_byKees Cook
Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). As found with Coccinelle[1], add __counted_by for struct linear_conf. Additionally, since the element count member must be set before accessing the annotated flexible array member, move its initialization earlier. [1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci Cc: Song Liu <song@kernel.org> Cc: linux-raid@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230915200328.never.064-kees@kernel.org
2023-09-22md: don't check 'mddev->pers' and 'pers->quiesce' from suspend_lo_store()Yu Kuai
Now that mddev_suspend() doean't rely on 'mddev->pers' to be set, it's safe to remove such checking. This will also allow the array to be suspended even before the array is ran. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825030956.1527023-8-yukuai1@huaweicloud.com
2023-09-22md: don't check 'mddev->pers' from suspend_hi_store()Yu Kuai
Now that mddev_suspend() doean't rely on 'mddev->pers' to be set, it's safe to remove such checking. This will also allow the array to be suspended even before the array is ran. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825030956.1527023-7-yukuai1@huaweicloud.com
2023-09-22md-bitmap: suspend array earlier in location_store()Yu Kuai
Now that mddev_suspend() doean't rely on 'mddev->pers' to be set, it's safe to call mddev_suspend() earlier. This will also be helper to refactor mddev_suspend() later. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825030956.1527023-6-yukuai1@huaweicloud.com
2023-09-22md-bitmap: remove the checking of 'pers->quiesce' from location_store()Yu Kuai
After commit 4d27e927344a ("md: don't quiesce in mddev_suspend()"), there is no need to check 'pers->quiesce' before calling mddev_suspend(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825030956.1527023-5-yukuai1@huaweicloud.com
2023-09-22md: don't rely on 'mddev->pers' to be set in mddev_suspend()Yu Kuai
'active_io' used to be initialized while the array is running, and 'mddev->pers' is set while the array is running as well. Hence caller must hold 'reconfig_mutex' and guarantee 'mddev->pers' is set before calling mddev_suspend(). Now that 'active_io' is initialized when mddev is allocated, such restriction doesn't exist anymore. In the meantime, follow up patches will refactor mddev_suspend(), hence add checking for 'mddev->pers' to prevent null-ptr-deref. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825030956.1527023-4-yukuai1@huaweicloud.com
2023-09-22md: initialize 'writes_pending' while allocating mddevYu Kuai
Currently 'writes_pending' is initialized in pers->run for raid1/5/10, and it's freed while deleing mddev, instead of pers->free. pers->run can be called multiple times before mddev is deleted, and a helper mddev_init_writes_pending() is used to prevent 'writes_pending' to be initialized multiple times, this usage is safe but a litter weird. On the other hand, 'writes_pending' is only initialized for raid1/5/10, however, it's used in common layer, for example: array_state_store set_in_sync if (!mddev->in_sync) -> in_sync is used for all levels // access writes_pending There might be some implicit dependency that I don't recognized to make sure 'writes_pending' can only be accessed for raid1/5/10, but there are no comments about that. By the way, it make sense to initialize 'writes_pending' in common layer because there are already three levels use it. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825030956.1527023-3-yukuai1@huaweicloud.com
2023-09-22md: initialize 'active_io' while allocating mddevYu Kuai
'active_io' is used for mddev_suspend() and it's initialized in md_run(), this restrict that 'reconfig_mutex' must be held and "mddev->pers" must be set before calling mddev_suspend(). Initialize 'active_io' early so that mddev_suspend() is safe to call once mddev is allocated, this will be helpful to refactor mddev_suspend() in following patches. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825030956.1527023-2-yukuai1@huaweicloud.com
2023-09-22md: delay remove_and_add_spares() for read only array to md_start_sync()Yu Kuai
Before this patch, for read-only array: md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, then it will call remove_and_add_spares() directly to try to remove and add rdevs from array. After this patch: 1) md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, and the worker 'sync_work' is not pending, and there are rdevs can be added or removed, then it will queue new work md_start_sync(); 2) md_start_sync() will call remove_and_add_spares() and exist; This change make sure that array reconfiguration is independent from daemon, and it'll be much easier to synchronize it with io, consier that io may rely on daemon thread to be done. Also fix a problem that 'pers->spars_active' is called after remove_and_add_spares(), which order is wrong, because spares must active first, and then remove_and_add_spares() can add spares to the array, like what read-write case does: 1) daemon set 'MD_RECOVERY_RUNNING', register new sync thread to do recovery; 2) recovery is done, md_do_sync() set 'MD_RECOVERY_DONE' before return; 3) daemon call 'pers->spars_active', and clear 'MD_RECOVERY_RUNNING'; 4) in the next round of daemon, call remove_and_add_spares() to add spares to the array. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825031622.1530464-8-yukuai1@huaweicloud.com
2023-09-22md: factor out a helper rdev_addable() from remove_and_add_spares()Yu Kuai
There are no functional changes, just to make the code simpler and prepare to delay remove_and_add_spares() to md_start_sync(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825031622.1530464-7-yukuai1@huaweicloud.com
2023-09-22md: factor out a helper rdev_is_spare() from remove_and_add_spares()Yu Kuai
There are no functional changes, just to make the code simpler and prepare to delay remove_and_add_spares() to md_start_sync(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825031622.1530464-6-yukuai1@huaweicloud.com
2023-09-22md: factor out a helper rdev_removeable() from remove_and_add_spares()Yu Kuai
There are no functional changes, just to make the code simpler and prepare to delay remove_and_add_spares() to md_start_sync(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825031622.1530464-5-yukuai1@huaweicloud.com
2023-09-22md: delay choosing sync action to md_start_sync()Yu Kuai
Before this patch, for read-write array: 1) md_check_recover() found that something need to be done, and it'll try to grab 'reconfig_mutex'. The case that md_check_recover() need to do something: - array is not suspend; - super_block need to be updated; - 'MD_RECOVERY_NEEDED' or 'MD_RECOVERY_DONE' is set; - unusual case related to safemode; 2) if 'MD_RECOVERY_RUNNING' is not set, and 'MD_RECOVERY_NEEDED' is set, md_check_recover() will try to choose a sync action, and then queue a work md_start_sync(). 3) md_start_sync() register sync_thread; After this patch, 1) is the same; 2) if 'MD_RECOVERY_RUNNING' is not set, and 'MD_RECOVERY_NEEDED' is set, queue a work md_start_sync() directly; 3) md_start_sync() will try to choose a sync action, and then register sync_thread(); Because 'MD_RECOVERY_RUNNING' is cleared when sync_thread is done, 2) and 3) and md_do_sync() is always ran in serial and they can never concurrent, this change should not introduce any behavior change for now. Also fix a problem that md_start_sync() can clear 'MD_RECOVERY_RUNNING' without protection in error path, which might affect the logical in md_check_recovery(). The advantage to change this is that array reconfiguration is independent from daemon now, and it'll be much easier to synchronize it with io, consider that io may rely on daemon thread to be done. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825031622.1530464-4-yukuai1@huaweicloud.com
2023-09-22md: factor out a helper to choose sync action from md_check_recovery()Yu Kuai
There are no functional changes, on the one hand make the code cleaner, on the other hand prevent following checkpatch error in the next patch to delay choosing sync action to md_start_sync(). ERROR: do not use assignment in if condition + } else if ((spares = remove_and_add_spares(mddev, NULL))) { Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825031622.1530464-3-yukuai1@huaweicloud.com
2023-09-22md: use separate work_struct for md_start_sync()Yu Kuai
It's a little weird to borrow 'del_work' for md_start_sync(), declare a new work_struct 'sync_work' for md_start_sync(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825031622.1530464-2-yukuai1@huaweicloud.com
2023-09-20dm zoned: free dmz->ddev array in dmz_put_zoned_devicesFedor Pchelkin
Commit 4dba12881f88 ("dm zoned: support arbitrary number of devices") made the pointers to additional zoned devices to be stored in a dynamically allocated dmz->ddev array. However, this array is not freed. Rename dmz_put_zoned_device to dmz_put_zoned_devices and fix it to free the dmz->ddev array when cleaning up zoned device information. Remove NULL assignment for all dmz->ddev elements and just free the dmz->ddev array instead. Found by Linux Verification Center (linuxtesting.org). Fixes: 4dba12881f88 ("dm zoned: support arbitrary number of devices") Cc: stable@vger.kernel.org Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-09-15Merge tag 'for-6.6/dm-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - Fix DM core retrieve_deps() UAF race due to missing locking of a DM table's list of devices that is managed using dm_{get,put}_device. - Revert DM core's half-baked RCU optimization if IO submitter has set REQ_NOWAIT. Can be revisited, and properly justified, after comprehensively auditing all of DM to also pass GFP_NOWAIT for any allocations if REQ_NOWAIT used. * tag 'for-6.6/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: don't attempt to queue IO under RCU protection dm: fix a race condition in retrieve_deps
2023-09-15dm: don't attempt to queue IO under RCU protectionJens Axboe
dm looks up the table for IO based on the request type, with an assumption that if the request is marked REQ_NOWAIT, it's fine to attempt to submit that IO while under RCU read lock protection. This is not OK, as REQ_NOWAIT just means that we should not be sleeping waiting on other IO, it does not mean that we can't potentially schedule. A simple test case demonstrates this quite nicely: int main(int argc, char *argv[]) { struct iovec iov; int fd; fd = open("/dev/dm-0", O_RDONLY | O_DIRECT); posix_memalign(&iov.iov_base, 4096, 4096); iov.iov_len = 4096; preadv2(fd, &iov, 1, 0, RWF_NOWAIT); return 0; } which will instantly spew: BUG: sleeping function called from invalid context at include/linux/sched/mm.h:306 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 5580, name: dm-nowait preempt_count: 0, expected: 0 RCU nest depth: 1, expected: 0 INFO: lockdep is turned off. CPU: 7 PID: 5580 Comm: dm-nowait Not tainted 6.6.0-rc1-g39956d2dcd81 #132 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x11d/0x1b0 __might_resched+0x3c3/0x5e0 ? preempt_count_sub+0x150/0x150 mempool_alloc+0x1e2/0x390 ? mempool_resize+0x7d0/0x7d0 ? lock_sync+0x190/0x190 ? lock_release+0x4b7/0x670 ? internal_get_user_pages_fast+0x868/0x2d40 bio_alloc_bioset+0x417/0x8c0 ? bvec_alloc+0x200/0x200 ? internal_get_user_pages_fast+0xb8c/0x2d40 bio_alloc_clone+0x53/0x100 dm_submit_bio+0x27f/0x1a20 ? lock_release+0x4b7/0x670 ? blk_try_enter_queue+0x1a0/0x4d0 ? dm_dax_direct_access+0x260/0x260 ? rcu_is_watching+0x12/0xb0 ? blk_try_enter_queue+0x1cc/0x4d0 __submit_bio+0x239/0x310 ? __bio_queue_enter+0x700/0x700 ? kvm_clock_get_cycles+0x40/0x60 ? ktime_get+0x285/0x470 submit_bio_noacct_nocheck+0x4d9/0xb80 ? should_fail_request+0x80/0x80 ? preempt_count_sub+0x150/0x150 ? lock_release+0x4b7/0x670 ? __bio_add_page+0x143/0x2d0 ? iov_iter_revert+0x27/0x360 submit_bio_noacct+0x53e/0x1b30 submit_bio_wait+0x10a/0x230 ? submit_bio_wait_endio+0x40/0x40 __blkdev_direct_IO_simple+0x4f8/0x780 ? blkdev_bio_end_io+0x4c0/0x4c0 ? stack_trace_save+0x90/0xc0 ? __bio_clone+0x3c0/0x3c0 ? lock_release+0x4b7/0x670 ? lock_sync+0x190/0x190 ? atime_needs_update+0x3bf/0x7e0 ? timestamp_truncate+0x21b/0x2d0 ? inode_owner_or_capable+0x240/0x240 blkdev_direct_IO.part.0+0x84a/0x1810 ? rcu_is_watching+0x12/0xb0 ? lock_release+0x4b7/0x670 ? blkdev_read_iter+0x40d/0x530 ? reacquire_held_locks+0x4e0/0x4e0 ? __blkdev_direct_IO_simple+0x780/0x780 ? rcu_is_watching+0x12/0xb0 ? __mark_inode_dirty+0x297/0xd50 ? preempt_count_add+0x72/0x140 blkdev_read_iter+0x2a4/0x530 do_iter_readv_writev+0x2f2/0x3c0 ? generic_copy_file_range+0x1d0/0x1d0 ? fsnotify_perm.part.0+0x25d/0x630 ? security_file_permission+0xd8/0x100 do_iter_read+0x31b/0x880 ? import_iovec+0x10b/0x140 vfs_readv+0x12d/0x1a0 ? vfs_iter_read+0xb0/0xb0 ? rcu_is_watching+0x12/0xb0 ? rcu_is_watching+0x12/0xb0 ? lock_release+0x4b7/0x670 do_preadv+0x1b3/0x260 ? do_readv+0x370/0x370 __x64_sys_preadv2+0xef/0x150 do_syscall_64+0x39/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f5af41ad806 Code: 41 54 41 89 fc 55 44 89 c5 53 48 89 cb 48 83 ec 18 80 3d e4 dd 0d 00 00 74 7a 45 89 c1 49 89 ca 45 31 c0 b8 47 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 be 00 00 00 48 85 c0 79 4a 48 8b 0d da 55 RSP: 002b:00007ffd3145c7f0 EFLAGS: 00000246 ORIG_RAX: 0000000000000147 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5af41ad806 RDX: 0000000000000001 RSI: 00007ffd3145c850 RDI: 0000000000000003 RBP: 0000000000000008 R08: 0000000000000000 R09: 0000000000000008 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003 R13: 00007ffd3145c850 R14: 000055f5f0431dd8 R15: 0000000000000001 </TASK> where in fact it is dm itself that attempts to allocate a bio clone with GFP_NOIO under the rcu read lock, regardless of the request type. Fix this by getting rid of the special casing for REQ_NOWAIT, and just use the normal SRCU protected table lookup. Get rid of the bio based table locking helpers at the same time, as they are now unused. Cc: stable@vger.kernel.org Fixes: 563a225c9fd2 ("dm: introduce dm_{get,put}_live_table_bio called from dm_submit_bio") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-09-14md: Put the right device in md_seq_nextMariusz Tkaczyk
If there are multiple arrays in system and one mddevice is marked with MD_DELETED and md_seq_next() is called in the middle of removal then it _get()s proper device but it may _put() deleted one. As a result, active counter may never be zeroed for mddevice and it cannot be removed. Put the device which has been _get with previous md_seq_next() call. Cc: stable@vger.kernel.org Fixes: 12a6caf27324 ("md: only delete entries from all_mddevs when the disk is freed") Reported-by: AceLan Kao <acelan@gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217798 Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230914152416.10819-1-mariusz.tkaczyk@linux.intel.com
2023-09-14dm: fix a race condition in retrieve_depsMikulas Patocka
There's a race condition in the multipath target when retrieve_deps races with multipath_message calling dm_get_device and dm_put_device. retrieve_deps walks the list of open devices without holding any lock but multipath may add or remove devices to the list while it is running. The end result may be memory corruption or use-after-free memory access. See this description of a UAF with multipath_message(): https://listman.redhat.com/archives/dm-devel/2022-October/052373.html Fix this bug by introducing a new rw semaphore "devices_lock". We grab devices_lock for read in retrieve_deps and we grab it for write in dm_get_device and dm_put_device. Reported-by: Luo Meng <luomeng12@huawei.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Tested-by: Li Lingfeng <lilingfeng3@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-09-11md/raid1: fix error: ISO C90 forbids mixed declarationsNigel Croxon
There is a compile error when this commit is added: md: raid1: fix potential OOB in raid1_remove_disk() drivers/md/raid1.c: In function 'raid1_remove_disk': drivers/md/raid1.c:1844:9: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement] 1844 |         struct raid1_info *p = conf->mirrors + number;     |         ^~~~~~ That's because the new code was inserted before the struct. The change is move the struct command above this commit. Fixes: 8b0472b50bcf ("md: raid1: fix potential OOB in raid1_remove_disk()") Signed-off-by: Nigel Croxon <ncroxon@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/46d929d0-2aab-4cf2-b2bf-338963e8ba5a@redhat.com
2023-09-08md: fix warning for holder mismatch from export_rdev()Yu Kuai
Commit a1d767191096 ("md: use mddev->external to select holder in export_rdev()") fix the problem that 'claim_rdev' is used for blkdev_get_by_dev() while 'rdev' is used for blkdev_put(). However, if mddev->external is changed from 0 to 1, then 'rdev' is used for blkdev_get_by_dev() while 'claim_rdev' is used for blkdev_put(). And this problem can be reporduced reliably by following: New file: mdadm/tests/23rdev-lifetime devname=${dev0##*/} devt=`cat /sys/block/$devname/dev` pid="" runtime=2 clean_up_test() { pill -9 $pid echo clear > /sys/block/md0/md/array_state } trap 'clean_up_test' EXIT add_by_sysfs() { while true; do echo $devt > /sys/block/md0/md/new_dev done } remove_by_sysfs(){ while true; do echo remove > /sys/block/md0/md/dev-${devname}/state done } echo md0 > /sys/module/md_mod/parameters/new_array || die "create md0 failed" add_by_sysfs & pid="$pid $!" remove_by_sysfs & pid="$pid $!" sleep $runtime exit 0 Test cmd: ./test --save-logs --logdir=/tmp/ --keep-going --dev=loop --tests=23rdev-lifetime Test result: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 960 at block/bdev.c:618 blkdev_put+0x27c/0x330 Modules linked in: multipath md_mod loop CPU: 0 PID: 960 Comm: test Not tainted 6.5.0-rc2-00121-g01e55c376936-dirty #50 RIP: 0010:blkdev_put+0x27c/0x330 Call Trace: <TASK> export_rdev.isra.23+0x50/0xa0 [md_mod] mddev_unlock+0x19d/0x300 [md_mod] rdev_attr_store+0xec/0x190 [md_mod] sysfs_kf_write+0x52/0x70 kernfs_fop_write_iter+0x19a/0x2a0 vfs_write+0x3b5/0x770 ksys_write+0x74/0x150 __x64_sys_write+0x22/0x30 do_syscall_64+0x40/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd Fix the problem by recording if 'rdev' is used as holder. Fixes: a1d767191096 ("md: use mddev->external to select holder in export_rdev()") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825025532.1523008-3-yukuai1@huaweicloud.com
2023-09-08md: don't dereference mddev after export_rdev()Yu Kuai
Except for initial reference, mddev->kobject is referenced by rdev->kobject, and if the last rdev is freed, there is no guarantee that mddev is still valid. Hence mddev should not be used anymore after export_rdev(). This problem can be triggered by following test for mdadm at very low rate: New file: mdadm/tests/23rdev-lifetime devname=${dev0##*/} devt=`cat /sys/block/$devname/dev` pid="" runtime=2 clean_up_test() { pill -9 $pid echo clear > /sys/block/md0/md/array_state } trap 'clean_up_test' EXIT add_by_sysfs() { while true; do echo $devt > /sys/block/md0/md/new_dev done } remove_by_sysfs(){ while true; do echo remove > /sys/block/md0/md/dev-${devname}/state done } echo md0 > /sys/module/md_mod/parameters/new_array || die "create md0 failed" add_by_sysfs & pid="$pid $!" remove_by_sysfs & pid="$pid $!" sleep $runtime exit 0 Test cmd: ./test --save-logs --logdir=/tmp/ --keep-going --dev=loop --tests=23rdev-lifetime Test result: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bcb: 0000 [#4] PREEMPT SMP CPU: 0 PID: 1292 Comm: test Tainted: G D W 6.5.0-rc2-00121-g01e55c376936 #562 RIP: 0010:md_wakeup_thread+0x9e/0x320 [md_mod] Call Trace: <TASK> mddev_unlock+0x1b6/0x310 [md_mod] rdev_attr_store+0xec/0x190 [md_mod] sysfs_kf_write+0x52/0x70 kernfs_fop_write_iter+0x19a/0x2a0 vfs_write+0x3b5/0x770 ksys_write+0x74/0x150 __x64_sys_write+0x22/0x30 do_syscall_64+0x40/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd Fix this problem by don't dereference mddev after export_rdev(). Fixes: 3ce94ce5d05a ("md: fix duplicate filename for rdev") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825025532.1523008-2-yukuai1@huaweicloud.com
2023-08-29Merge tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: "Pretty quiet round for this release. This contains: - Add support for zoned storage to ublk (Andreas, Ming) - Series improving performance for drivers that mark themselves as needing a blocking context for issue (Bart) - Cleanup the flush logic (Chengming) - sed opal keyring support (Greg) - Fixes and improvements to the integrity support (Jinyoung) - Add some exports for bcachefs that we can hopefully delete again in the future (Kent) - deadline throttling fix (Zhiguo) - Series allowing building the kernel without buffer_head support (Christoph) - Sanitize the bio page adding flow (Christoph) - Write back cache fixes (Christoph) - MD updates via Song: - Fix perf regression for raid0 large sequential writes (Jan) - Fix split bio iostat for raid0 (David) - Various raid1 fixes (Heinz, Xueshi) - raid6test build fixes (WANG) - Deprecate bitmap file support (Christoph) - Fix deadlock with md sync thread (Yu) - Refactor md io accounting (Yu) - Various non-urgent fixes (Li, Yu, Jack) - Various fixes and cleanups (Arnd, Azeem, Chengming, Damien, Li, Ming, Nitesh, Ruan, Tejun, Thomas, Xu)" * tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linux: (113 commits) block: use strscpy() to instead of strncpy() block: sed-opal: keyring support for SED keys block: sed-opal: Implement IOC_OPAL_REVERT_LSP block: sed-opal: Implement IOC_OPAL_DISCOVERY blk-mq: prealloc tags when increase tagset nr_hw_queues blk-mq: delete redundant tagset map update when fallback blk-mq: fix tags leak when shrink nr_hw_queues ublk: zoned: support REQ_OP_ZONE_RESET_ALL md: raid0: account for split bio in iostat accounting md/raid0: Fix performance regression for large sequential writes md/raid0: Factor out helper for mapping and submitting a bio md raid1: allow writebehind to work on any leg device set WriteMostly md/raid1: hold the barrier until handle_read_error() finishes md/raid1: free the r1bio before waiting for blocked rdev md/raid1: call free_r1bio() before allow_barrier() in raid_end_bio_io() blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before init drivers/rnbd: restore sysfs interface to rnbd-client md/raid5-cache: fix null-ptr-deref for r5l_flush_stripe_to_raid() raid6: test: only check for Altivec if building on powerpc hosts raid6: test: make sure all intermediate and artifact files are .gitignored ...
2023-08-17md: raid0: account for split bio in iostat accountingDavid Jeffery
When a bio is split by md raid0, the newly created bio will not be tracked by md for I/O accounting. Only the portion of I/O still assigned to the original bio which was reduced by the split will be accounted for. This results in md iostat data sometimes showing I/O values far below the actual amount of data being sent through md. md_account_bio() needs to be called for all bio generated by the bio split. A simple example of the issue was generated using a raid0 device on partitions to the same device. Since all raid0 I/O then goes to one device, it makes it easy to see a gap between the md device and its sd storage. Reading an lvm device on top of the md device, the iostat output (some 0 columns and extra devices removed to make the data more compact) was: Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read md2 0.00 0.00 0.00 0.00 0 sde 0.00 0.00 0.00 0.00 0 md2 1364.00 411496.00 0.00 0.00 411496 sde 1734.00 646144.00 0.00 0.00 646144 md2 1699.00 510680.00 0.00 0.00 510680 sde 2155.00 802784.00 0.00 0.00 802784 md2 803.00 241480.00 0.00 0.00 241480 sde 1016.00 377888.00 0.00 0.00 377888 md2 0.00 0.00 0.00 0.00 0 sde 0.00 0.00 0.00 0.00 0 I/O was generated doing large direct I/O reads (12M) with dd to a linear lvm volume on top of the 4 leg raid0 device. The md2 reads were showing as roughly 2/3 of the reads to the sde device containing all of md2's raid partitions. The sum of reads to sde was 1826816 kB, which was the expected amount as it was the amount read by dd. With the patch, the total reads from md will match the reads from sde and be consistent with the amount of I/O generated. Fixes: 10764815ff47 ("md: add io accounting for raid0 and raid5") Signed-off-by: David Jeffery <djeffery@redhat.com> Tested-by: Laurence Oberman <loberman@redhat.com> Reviewed-by: Laurence Oberman <loberman@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230816181433.13289-1-djeffery@redhat.com
2023-08-17md/raid0: Fix performance regression for large sequential writesJan Kara
Commit f00d7c85be9e ("md/raid0: fix up bio splitting.") among other things changed how bio that needs to be split is submitted. Before this commit, we have split the bio, mapped and submitted each part. After this commit, we map only the first part of the split bio and submit the second part unmapped. Due to bio sorting in __submit_bio_noacct() this results in the following request ordering: 9,0 18 1181 0.525037895 15995 Q WS 1479315464 + 63392 Split off chunk-sized (1024 sectors) request: 9,0 18 1182 0.629019647 15995 X WS 1479315464 / 1479316488 Request is unaligned to the chunk so it's split in raid0_make_request(). This is the first part mapped and punted to bio_list: 8,0 18 7053 0.629020455 15995 A WS 739921928 + 1016 <- (9,0) 1479315464 Now raid0_make_request() returns, second part is postponed on bio_list. __submit_bio_noacct() resorts the bio_list, mapped request is submitted to the underlying device: 8,0 18 7054 0.629022782 15995 G WS 739921928 + 1016 Now we take another request from the bio_list which is the remainder of the original huge request. Split off another chunk-sized bit from it and the situation repeats: 9,0 18 1183 0.629024499 15995 X WS 1479316488 / 1479317512 8,16 18 6998 0.629025110 15995 A WS 739921928 + 1016 <- (9,0) 1479316488 8,16 18 6999 0.629026728 15995 G WS 739921928 + 1016 ... 9,0 18 1184 0.629032940 15995 X WS 1479317512 / 1479318536 [libnetacq-write] 8,0 18 7059 0.629033294 15995 A WS 739922952 + 1016 <- (9,0) 1479317512 8,0 18 7060 0.629033902 15995 G WS 739922952 + 1016 ... This repeats until we consume the whole original huge request. Now we finally get to processing the second parts of the split off requests (in reverse order): 8,16 18 7181 0.629161384 15995 A WS 739952640 + 8 <- (9,0) 1479377920 8,0 18 7239 0.629162140 15995 A WS 739952640 + 8 <- (9,0) 1479376896 8,16 18 7186 0.629163881 15995 A WS 739951616 + 8 <- (9,0) 1479375872 8,0 18 7242 0.629164421 15995 A WS 739951616 + 8 <- (9,0) 1479374848 ... I guess it is obvious that this IO pattern is extremely inefficient way to perform sequential IO. It also makes bio_list to grow to rather long lengths. Change raid0_make_request() to map both parts of the split bio. Since we know we are provided with at most chunk-sized bios, we will always need to split the incoming bio at most once. Fixes: f00d7c85be9e ("md/raid0: fix up bio splitting.") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230814092720.3931-2-jack@suse.cz Signed-off-by: Song Liu <song@kernel.org>
2023-08-17md/raid0: Factor out helper for mapping and submitting a bioJan Kara
Factor out helper function for mapping and submitting a bio out of raid0_make_request(). We will use it later for submitting both parts of a split bio. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230814092720.3931-1-jack@suse.cz Signed-off-by: Song Liu <song@kernel.org>
2023-08-17md raid1: allow writebehind to work on any leg device set WriteMostlyHeinz Mauelshagen
As the WriteMostly flag can be set on any component device of a RAID1 array, remove the constraint that it only works if set on the first one. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Tested-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/r/2a9592bf3340f34bf588eec984b23ee219f3985e.1692013451.git.heinzm@redhat.com Signed-off-by: Song Liu <song@kernel.org>
2023-08-17md/raid1: hold the barrier until handle_read_error() finishesXueshi Hu
handle_read_error() will call allow_barrier() to match the former barrier raising. However, it should put the allow_barrier() at the end to avoid a concurrent raid reshape. Fixes: 689389a06ce7 ("md/raid1: simplify handle_read_error().") Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xueshi Hu <xueshi.hu@smartx.com> Link: https://lore.kernel.org/r/20230814135356.1113639-4-xueshi.hu@smartx.com Signed-off-by: Song Liu <song@kernel.org>
2023-08-17md/raid1: free the r1bio before waiting for blocked rdevXueshi Hu
Raid1 reshape will change mempool and r1conf::raid_disks which are needed to free r1bio. allow_barrier() make a concurrent raid1_reshape() possible. So, free the in-flight r1bio before waiting blocked rdev. Fixes: 6bfe0b499082 ("md: support blocking writes to an array on device failure") Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xueshi Hu <xueshi.hu@smartx.com> Link: https://lore.kernel.org/r/20230814135356.1113639-3-xueshi.hu@smartx.com Signed-off-by: Song Liu <song@kernel.org>
2023-08-17md/raid1: call free_r1bio() before allow_barrier() in raid_end_bio_io()Xueshi Hu
After allow_barrier, a concurrent raid1_reshape() will replace old mempool and r1conf::raid_disks. Move allow_barrier() to the end of raid_end_bio_io(), so that r1bio can be freed safely. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xueshi Hu <xueshi.hu@smartx.com> Link: https://lore.kernel.org/r/20230814135356.1113639-2-xueshi.hu@smartx.com Signed-off-by: Song Liu <song@kernel.org>
2023-08-15md/raid5-cache: fix null-ptr-deref for r5l_flush_stripe_to_raid()Yu Kuai
r5l_flush_stripe_to_raid() will check if the list 'flushing_ios' is empty, and then submit 'flush_bio', however, r5l_log_flush_endio() is clearing the list first and then clear the bio, which will cause null-ptr-deref: T1: submit flush io raid5d handle_active_stripes r5l_flush_stripe_to_raid // list is empty // add 'io_end_ios' to the list bio_init submit_bio // io1 T2: io1 is done r5l_log_flush_endio list_splice_tail_init // clear the list T3: submit new flush io ... r5l_flush_stripe_to_raid // list is empty // add 'io_end_ios' to the list bio_init bio_uninit // clear bio->bi_blkg submit_bio // null-ptr-deref Fix this problem by clearing bio before clearing the list in r5l_log_flush_endio(). Fixes: 0dd00cba99c3 ("raid5-cache: fully initialize flush_bio when needed") Reported-and-tested-by: Corey Hickey <bugfood-ml@fatooh.org> Closes: https://lore.kernel.org/all/cddd7213-3dfd-4ab7-a3ac-edd54d74a626@fatooh.org/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
2023-08-15md: Hold mddev->reconfig_mutex when trying to get mddev->sync_threadLi Lingfeng
Commit ba9d9f1a707f ("Revert "md: unlock mddev before reap sync_thread in action_store"") removed the scenario of calling md_unregister_thread() without holding mddev->reconfig_mutex, so add a lock holding check before acquiring mddev->sync_thread by passing mdev to md_unregister_thread(). Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230803071711.2546560-1-lilingfeng@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-08-15md/raid10: fix a 'conf->barrier' leakage in raid10_takeover()Yu Kuai
After commit b39f35ebe86d ("md: don't quiesce in mddev_suspend()"), 'conf->barrier' will be leaked in the case that raid10 takeover raid0: level_store pers->takeover -> raid10_takeover raid10_takeover_raid0 WRITE_ONCE(conf->barrier, 1) mddev_suspend // still raid0 mddev->pers = pers // switch to raid10 mddev_resume // resume without suspend After the above commit, mddev_resume() will not decrease 'conf->barrier' that is set in raid10_takeover_raid0(). Fix this problem by not setting 'conf->barrier' in raid10_takeover_raid0(). By the way, this problem is found while I'm trying to make mddev_suspend/resume() to be independent from raid personalities. raid10 is the only personality to use reference count in the quiesce() callback and this problem is only related to raid10. Fixes: b39f35ebe86d ("md: don't quiesce in mddev_suspend()") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Link: https://lore.kernel.org/r/20230731022800.1424902-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-08-15md: raid1: fix potential OOB in raid1_remove_disk()Zhang Shurong
If rddev->raid_disk is greater than mddev->raid_disks, there will be an out-of-bounds in raid1_remove_disk(). We have already found similar reports as follows: 1) commit d17f744e883b ("md-raid10: fix KASAN warning") 2) commit 1ebc2cec0b7d ("dm raid: fix KASAN warning in raid5_remove_disk") Fix this bug by checking whether the "number" variable is valid. Signed-off-by: Zhang Shurong <zhang_shurong@foxmail.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/tencent_0D24426FAC6A21B69AC0C03CE4143A508F09@qq.com Signed-off-by: Song Liu <song@kernel.org>
2023-08-15md/raid5-cache: fix a deadlock in r5l_exit_log()Yu Kuai
Commit b13015af94cf ("md/raid5-cache: Clear conf->log after finishing work") introduce a new problem: // caller hold reconfig_mutex r5l_exit_log flush_work(&log->disable_writeback_work) r5c_disable_writeback_async wait_event /* * conf->log is not NULL, and mddev_trylock() * will fail, wait_event() can never pass. */ conf->log = NULL Fix this problem by setting 'config->log' to NULL before wake_up() as it used to be, so that wait_event() from r5c_disable_writeback_async() can exist. In the meantime, move forward md_unregister_thread() so that null-ptr-deref this commit fixed can still be fixed. Fixes: b13015af94cf ("md/raid5-cache: Clear conf->log after finishing work") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230708091727.1417894-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-08-09bio-integrity: update the payload size in bio_integrity_add_page()Jinyoung Choi
Previously, the bip's bi_size has been set before an integrity pages were added. If a problem occurs in the process of adding pages for bip, the bi_size mismatch problem must be dealt with. When the page is successfully added to bvec, the bi_size is updated. The parts affected by the change were also contained in this commit. Cc: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jinyoung Choi <j-young.choi@samsung.com> Tested-by: "Martin K. Petersen" <martin.petersen@oracle.com> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230803024956epcms2p38186a17392706650c582d38ef3dbcd32@epcms2p3 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-02fs: add CONFIG_BUFFER_HEADChristoph Hellwig
Add a new config option that controls building the buffer_head code, and select it from all file systems and stacking drivers that need it. For the block device nodes and alternative iomap based buffered I/O path is provided when buffer_head support is not enabled, and iomap needs a a small tweak to define the IOMAP_F_BUFFER_HEAD flag to 0 to not call into the buffer_head code when it doesn't exist. Otherwise this is just Kconfig and ifdef changes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20230801172201.1923299-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-27md/md-bitmap: hold 'reconfig_mutex' in backlog_store()Yu Kuai
Several reasons why 'reconfig_mutex' should be held: 1) rdev_for_each() is not safe to be called without the lock, because rdev can be removed concurrently. 2) mddev_destroy_serial_pool() and mddev_create_serial_pool() should not be called concurrently. 3) mddev_suspend() from mddev_destroy/create_serial_pool() should be protected by the lock. Fixes: 10c92fca636e ("md-bitmap: create and destroy wb_info_pool with the change of backlog") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230706083727.608914-3-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27md/md-bitmap: remove unnecessary local variable in backlog_store()Yu Kuai
Local variable is definied first in the beginning of backlog_store(), there is no need to define it again. Fixes: 8c13ab115b57 ("md/bitmap: don't set max_write_behind if there is no write mostly device") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230706083727.608914-2-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27md/raid10: use dereference_rdev_and_rrdev() to get devicesLi Nan
Commit 2ae6aaf76912 ("md/raid10: fix io loss while replacement replace rdev") reads replacement first to prevent io loss. However, there are same issue in wait_blocked_dev() and raid10_handle_discard(), too. Fix it by using dereference_rdev_and_rrdev() to get devices. Fixes: d30588b2731f ("md/raid10: improve raid10 discard request") Fixes: f2e7e269a752 ("md/raid10: pull the code that wait for blocked dev into one function") Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/r/20230701080529.2684932-4-linan666@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27md/raid10: factor out dereference_rdev_and_rrdev()Li Nan
Factor out a helper to get 'rdev' and 'replacement' from config->mirrors. Just to make code cleaner and prepare to fix the bug of io loss while 'replacement' replace 'rdev'. There is no functional change. Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/r/20230701080529.2684932-3-linan666@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27md/raid10: check replacement and rdev to prevent submit the same io twiceLi Nan
After commit 4ca40c2ce099 ("md/raid10: Allow replacement device to be replace old drive."), 'rdev' and 'replacement' could appear to be identical. There are already checks for that in wait_blocked_dev() and raid10_write_request(). Add check for raid10_handle_discard() now. Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/r/20230701080529.2684932-2-linan666@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27md/raid1: Avoid lock contention from wake_up()Jack Wang
wake_up is called unconditionally in a few paths such as make_request(), which cause lock contention under high concurrency workload like below raid1_end_write_request wake_up __wake_up_common_lock spin_lock_irqsave Improve performance by only call wake_up() if waitqueue is not empty Fio test script: [global] name=random reads and writes ioengine=libaio direct=1 readwrite=randrw rwmixread=70 iodepth=64 buffered=0 filename=/dev/md0 size=1G runtime=30 time_based randrepeat=0 norandommap refill_buffers ramp_time=10 bs=4k numjobs=400 group_reporting=1 [job1] Test result with 2 ramdisk in raid1 on a Intel Broadwell 56 cores server. Before this patch With this patch READ BW=4621MB/s BW=7337MB/s WRITE BW=1980MB/s BW=3144MB/s The patch is inspired by Yu Kuai's change for raid10: https://lore.kernel.org/r/20230621105728.1268542-1-yukuai1@huaweicloud.com Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230705113227.148494-1-jinpu.wang@ionos.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27md: restore 'noio_flag' for the last mddev_resume()Yu Kuai
memalloc_noio_save() is called for the first mddev_suspend(), and repeated mddev_suspend() only increase 'suspended'. However, memalloc_noio_restore() is also called for the first mddev_resume(), which means that memory reclaim will be enabled before the last mddev_resume() is called, while the array is still suspended. Fix this problem by restore 'noio_flag' for the last mddev_resume(). Fixes: 78f57ef9d50a ("md: use memalloc scope APIs in mddev_suspend()/mddev_resume()") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230628012931.88911-3-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27md: don't quiesce in mddev_suspend()Yu Kuai
Some levels doesn't implement "pers->quiesce", for example raid0_quiesce() is empty, and now that all levels will drop 'active_io' until io is done, wait for 'active_io' to be 0 is enough to make sure all normal io is done, and percpu_ref_kill() for 'active_io' will make sure no new normal io can be dispatched. There is no need to call "pers->quiesce" anymore from mddev_suspend(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230628012931.88911-2-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27md: remove redundant check in fix_read_error()Li Nan
In fix_read_error(), 'success' will be checked immediately after assigning it, if it is set to 1 then the loop will break. Checking it again in condition of loop is redundant. Clean it up. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230623173236.2513554-3-linan666@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>