summaryrefslogtreecommitdiff
path: root/drivers/md/raid10.c
AgeCommit message (Collapse)Author
2025-02-14md/raid*: Fix the set_queue_limits implementationsBart Van Assche
queue_limits_cancel_update() must only be called if queue_limits_start_update() is called first. Remove the queue_limits_cancel_update() calls from the raid*_set_limits() functions because there is no corresponding queue_limits_start_update() call. Cc: Christoph Hellwig <hch@lst.de> Fixes: c6e56cf6b2e7 ("block: move integrity information into queue_limits") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/linux-raid/20250212171108.3483150-1-bvanassche@acm.org/ Signed-off-by: Yu Kuai <yukuai@kernel.org>
2025-01-17block: Add common atomic writes enable flagJohn Garry
Currently only stacked devices need to explicitly enable atomic writes by setting BLK_FEAT_ATOMIC_WRITES_STACKED flag. This does not work well for device mapper stacking devices, as there many sets of limits are stacked and what is the 'bottom' and 'top' device can swapped. This means that BLK_FEAT_ATOMIC_WRITES_STACKED needs to be set for many queue limits, which is messy. Generalize enabling atomic writes enabling by ensuring that all devices must explicitly set a flag - that includes NVMe, SCSI sd, and md raid. Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20250116170301.474130-2-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-13md/md-bitmap: move bitmap_{start, end}write to md upper layerYu Kuai
There are two BUG reports that raid5 will hang at bitmap_startwrite([1],[2]), root cause is that bitmap start write and end write is unbalanced, it's not quite clear where, and while reviewing raid5 code, it's found that bitmap operations can be optimized. For example, for a 4 disks raid5, with chunksize=8k, if user issue a IO (0 + 48k) to the array: ┌────────────────────────────────────────────────────────────┐ │chunk 0 │ │ ┌────────────┬─────────────┬─────────────┬────────────┼ │ sh0 │A0: 0 + 4k │A1: 8k + 4k │A2: 16k + 4k │A3: P │ │ ┼────────────┼─────────────┼─────────────┼────────────┼ │ sh1 │B0: 4k + 4k │B1: 12k + 4k │B2: 20k + 4k │B3: P │ ┼──────┴────────────┴─────────────┴─────────────┴────────────┼ │chunk 1 │ │ ┌────────────┬─────────────┬─────────────┬────────────┤ │ sh2 │C0: 24k + 4k│C1: 32k + 4k │C2: P │C3: 40k + 4k│ │ ┼────────────┼─────────────┼─────────────┼────────────┼ │ sh3 │D0: 28k + 4k│D1: 36k + 4k │D2: P │D3: 44k + 4k│ └──────┴────────────┴─────────────┴─────────────┴────────────┘ Before this patch, 4 stripe head will be used, and each sh will attach bio for 3 disks, and each attached bio will trigger bitmap_startwrite() once, which means total 12 times. - 3 times (0 + 4k), for (A0, A1 and A2) - 3 times (4 + 4k), for (B0, B1 and B2) - 3 times (8 + 4k), for (C0, C1 and C3) - 3 times (12 + 4k), for (D0, D1 and D3) After this patch, md upper layer will calculate that IO range (0 + 48k) is corresponding to the bitmap (0 + 16k), and call bitmap_startwrite() just once. Noted that this patch will align bitmap ranges to the chunks, for example, if user issue a IO (0 + 4k) to array: - Before this patch, 1 time (0 + 4k), for A0; - After this patch, 1 time (0 + 8k) for chunk 0; Usually, one bitmap bit will represent more than one disk chunk, and this doesn't have any difference. And even if user really created a array that one chunk contain multiple bits, the overhead is that more data will be recovered after power failure. Also remove STRIPE_BITMAP_PENDING since it's not used anymore. [1] https://lore.kernel.org/all/CAJpMwyjmHQLvm6zg1cmQErttNNQPDAAXPKM3xgTjMhbfts986Q@mail.gmail.com/ [2] https://lore.kernel.org/all/ADF7D720-5764-4AF3-B68E-1845988737AA@flyingcircus.io/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250109015145.158868-6-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2025-01-13md/md-bitmap: remove the last parameter for bimtap_ops->endwrite()Yu Kuai
For the case that IO failed for one rdev, the bit will be mark as NEEDED in following cases: 1) If badblocks is set and rdev is not faulty; 2) If rdev is faulty; Case 1) is useless because synchronize data to badblocks make no sense. Case 2) can be replaced with mddev->degraded. Also remove R1BIO_Degraded, R10BIO_Degraded and STRIPE_DEGRADED since case 2) no longer use them. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250109015145.158868-3-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2025-01-13md/md-bitmap: factor behind write counters out from bitmap_{start/end}write()Yu Kuai
behind_write is only used in raid1, prepare to refactor bitmap_{start/end}write(), there are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/r/20250109015145.158868-2-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-11-30Merge tag 'block-6.13-20242901' of git://git.kernel.dk/linuxLinus Torvalds
Pull more block updates from Jens Axboe: - NVMe pull request via Keith: - Use correct srcu list traversal (Breno) - Scatter-gather support for metadata (Keith) - Fabrics shutdown race condition fix (Nilay) - Persistent reservations updates (Guixin) - Add the required bits for MD atomic write support for raid0/1/10 - Correct return value for unknown opcode in ublk - Fix deadlock with zone revalidation - Fix for the io priority request vs bio cleanups - Use the correct unsigned int type for various limit helpers - Fix for a race in loop - Cleanup blk_rq_prep_clone() to prevent uninit-value warning and make it easier for actual humans to read - Fix potential UAF when iterating tags - A few fixes for bfq-iosched UAF issues - Fix for brd discard not decrementing the allocated page count - Various little fixes and cleanups * tag 'block-6.13-20242901' of git://git.kernel.dk/linux: (36 commits) brd: decrease the number of allocated pages which discarded block, bfq: fix bfqq uaf in bfq_limit_depth() block: Don't allow an atomic write be truncated in blkdev_write_iter() mq-deadline: don't call req_get_ioprio from the I/O completion handler block: Prevent potential deadlock in blk_revalidate_disk_zones() block: Remove extra part pointer NULLify in blk_rq_init() nvme: tuning pr code by using defined structs and macros nvme: introduce change ptpl and iekey definition block: return bool from get_disk_ro and bdev_read_only block: remove a duplicate definition for bdev_read_only block: return bool from blk_rq_aligned block: return unsigned int from blk_lim_dma_alignment_and_pad block: return unsigned int from queue_dma_alignment block: return unsigned int from bdev_io_opt block: req->bio is always set in the merge code block: don't bother checking the data direction for merges block: blk-mq: fix uninit-value in blk_rq_prep_clone and refactor Revert "block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()" md/raid10: Atomic write support md/raid1: Atomic write support ...
2024-11-19md/raid10: Atomic write supportJohn Garry
Set BLK_FEAT_ATOMIC_WRITES_STACKED to enable atomic writes. For an attempt to atomic write to a region which has bad blocks, error the write as we just cannot do this. It is unlikely to find devices which support atomic writes and bad blocks. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20241118105018.1870052-6-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-18Merge tag 'for-6.13/block-20241118' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - NVMe updates via Keith: - Use uring_cmd helper (Pavel) - Host Memory Buffer allocation enhancements (Christoph) - Target persistent reservation support (Guixin) - Persistent reservation tracing (Guixen) - NVMe 2.1 specification support (Keith) - Rotational Meta Support (Matias, Wang, Keith) - Volatile cache detection enhancment (Guixen) - MD updates via Song: - Maintainers update - raid5 sync IO fix - Enhance handling of faulty and blocked devices - raid5-ppl atomic improvement - md-bitmap fix - Support for manually defining embedded partition tables - Zone append fixes and cleanups - Stop sending the queued requests in the plug list to the driver ->queue_rqs() handle in reverse order. - Zoned write plug cleanups - Cleanups disk stats tracking and add support for disk stats for passthrough IO - Add preparatory support for file system atomic writes - Add lockdep support for queue freezing. Already found a bunch of issues, and some fixes for that are in here. More will be coming. - Fix race between queue stopping/quiescing and IO queueing - ublk recovery improvements - Fix ublk mmap for 64k pages - Various fixes and cleanups * tag 'for-6.13/block-20241118' of git://git.kernel.dk/linux: (118 commits) MAINTAINERS: Update git tree for mdraid subsystem block: make struct rq_list available for !CONFIG_BLOCK block/genhd: use seq_put_decimal_ull for diskstats decimal values block: don't reorder requests in blk_mq_add_to_batch block: don't reorder requests in blk_add_rq_to_plug block: add a rq_list type block: remove rq_list_move virtio_blk: reverse request order in virtio_queue_rqs nvme-pci: reverse request order in nvme_queue_rqs btrfs: validate queue limits block: export blk_validate_limits nvmet: add tracing of reservation commands nvme: parse reservation commands's action and rtype to string nvmet: report ns's vwc not present md/raid5: Increase r5conf.cache_name size block: remove the ioprio field from struct request block: remove the write_hint field from struct request nvme: check ns's volatile write cache not present nvme: add rotational support nvme: use command set independent id ns if available ...
2024-11-11md/raid10: Handle bio_split() errorsJohn Garry
Add proper bio_split() error handling. For any error, call raid_end_bio_io() and return. Except for discard, where we end the bio directly. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241111112150.3756529-7-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-05md/raid10: don't wait for Faulty rdev in wait_blocked_rdev()Yu Kuai
Faulty rdev should never be accessed anymore, hence there is no point to wait for bad block to be acknowledged in this case while handling write request. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Link: https://lore.kernel.org/r/20241031033114.3845582-7-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-10-17md/raid10: fix null ptr dereference in raid10_size()Yu Kuai
In raid10_run() if raid10_set_queue_limits() succeed, the return value is set to zero, and if following procedures failed raid10_run() will return zero while mddev->private is still NULL, causing null ptr dereference in raid10_size(). Fix the problem by only overwrite the return value if raid10_set_queue_limits() failed. Fixes: 3d8466ba68d4 ("md/raid10: use the atomic queue limit update APIs") Cc: stable@vger.kernel.org Reported-and-tested-by: ValdikSS <iam@valdikss.org.ru> Closes: https://lore.kernel.org/all/0dd96820-fe52-4841-bc58-dbf14d6bfcc8@valdikss.org.ru/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241009014914.1682037-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-08-27md/md-bitmap: merge md_bitmap_resize() into bitmap_operationsYu Kuai
So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-36-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-08-27md/md-bitmap: pass in mddev directly for md_bitmap_resize()Yu Kuai
And move the condition "if (mddev->bitmap)" into md_bitmap_resize() as well, on the one hand make code cleaner, on the other hand try not to access bitmap directly. Since we are here, also change the parameter 'init' from int to bool. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-35-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-08-27md/md-bitmap: merge md_bitmap_unplug_async() into md_bitmap_unplug()Yu Kuai
Add a parameter 'bool sync' to distinguish them, and md_bitmap_unplug_async() won't be exported anymore, hence bitmap_operations only need one op to cover them. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-32-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-08-27md/md-bitmap: merge md_bitmap_cond_end_sync() into bitmap_operationsYu Kuai
So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Also change the parameter from bitmap to mddev, to avoid access bitmap outside md-bitmap.c as much as possible. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-30-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-08-27md/md-bitmap: merge md_bitmap_close_sync() into bitmap_operationsYu Kuai
So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Also change the parameter from bitmap to mddev, to avoid access bitmap outside md-bitmap.c as much as possible. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-29-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-08-27md/md-bitmap: merge md_bitmap_end_sync() into bitmap_operationsYu Kuai
So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Also change the parameter from bitmap to mddev, to avoid access bitmap outside md-bitmap.c as much as possible. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-28-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-08-27md/md-bitmap: remove the parameter 'aborted' for md_bitmap_end_sync()Yu Kuai
For internal callers, aborted are always set to false, while for external callers, aborted are always set to true. Hence there is no need to always pass in true for exported api. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-27-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-08-27md/md-bitmap: merge md_bitmap_start_sync() into bitmap_operationsYu Kuai
So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Also change the parameter from bitmap to mddev, to avoid access bitmap outside md-bitmap.c as much as possible. Also fix lots of code style. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-26-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-08-27md/md-bitmap: merge md_bitmap_endwrite() into bitmap_operationsYu Kuai
So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Also change the parameter from bitmap to mddev, to avoid access bitmap outside md-bitmap.c as much as possible. And change the type of 'success' and 'behind' from int to bool. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-25-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-08-27md/md-bitmap: merge md_bitmap_startwrite() into bitmap_operationsYu Kuai
So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Also change the parameter from bitmap to mddev, to avoid access bitmap outside md-bitmap.c as much as possible. And change the type of 'behind' from int to bool. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-24-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2024-08-15md: convert comma to semicolonChen Ni
Replace a comma between expression statements by a semicolon. Fixes: 5e5702898e93 ("md/raid10: Handle read errors during recovery better.") Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Link: https://lore.kernel.org/r/20240716025852.400259-1-nichen@iscas.ac.cn Signed-off-by: Song Liu <song@kernel.org>
2024-06-26md: set md-specific flags for all queue limitsChristoph Hellwig
The md driver wants to enforce a number of flags for all devices, even when not inheriting them from the underlying devices. To make sure these flags survive the queue_limits_set calls that md uses to update the queue limits without deriving them form the previous limits add a new md_init_stacking_limits helper that calls blk_set_stacking_limits and sets these flags. Fixes: 1122c0c1cc71 ("block: move cache control settings out of queue->flags") Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240626142637.300624-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14Merge branch 'for-6.11/block-limits' into for-6.11/blockJens Axboe
Pull in block limits branch, which exists as a shared branch for both the block and SCSI tree. * for-6.11/block-limits: (26 commits) block: move integrity information into queue_limits block: invert the BLK_INTEGRITY_{GENERATE,VERIFY} flags block: bypass the STABLE_WRITES flag for protection information block: don't require stable pages for non-PI metadata block: use kstrtoul in flag_store block: factor out flag_{store,show} helper for integrity block: remove the blk_flush_integrity call in blk_integrity_unregister block: remove the blk_integrity_profile structure dm-integrity: use the nop integrity profile md/raid1: don't free conf on raid0_run failure md/raid0: don't free conf on raid0_run failure block: initialize integrity buffer to zero before writing it to media block: add special APIs for run-time disabling of discard and friends block: remove unused queue limits API sr: convert to the atomic queue limits API sd: convert to the atomic queue limits API sd: cleanup zoned queue limits initialization sd: factor out a sd_discard_mode helper sd: simplify the disable case in sd_config_discard sd: add a sd_disable_write_same helper ...
2024-06-14block: move integrity information into queue_limitsChristoph Hellwig
Move the integrity information into the queue limits so that it can be set atomically with other queue limits, and that the sysfs changes to the read_verify and write_generate flags are properly synchronized. This also allows to provide a more useful helper to stack the integrity fields, although it still is separate from the main stacking function as not all stackable devices want to inherit the integrity settings. Even with that it greatly simplifies the code in md and dm. Note that the integrity field is moved as-is into the queue limits. While there are good arguments for removing the separate blk_integrity structure, this would cause a lot of churn and might better be done at a later time if desired. However the integrity field in the queue_limits structure is now unconditional so that various ifdefs can be avoided or replaced with IS_ENABLED(). Given that tiny size of it that seems like a worthwhile trade off. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240613084839.1044015-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-12md: pass in max_sectors for pers->sync_request()Yu Kuai
For different sync_action, sync_thread will use different max_sectors, see details in md_sync_max_sectors(), currently both md_do_sync() and pers->sync_request() in eatch iteration have to get the same max_sectors. Hence pass in max_sectors for pers->sync_request() to prevent redundant code. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-12-yukuai1@huaweicloud.com
2024-06-10md: change the return value type of md_write_start to voidLi Nan
Commit cc27b0c78c79 ("md: fix deadlock between mddev_suspend() and md_write_start()") aborted md_write_start() with false when mddev is suspended, which fixed a deadlock if calling mddev_suspend() with holding reconfig_mutex(). Since mddev_suspend() now includes lockdep_assert_not_held(), it no longer holds the reconfig_mutex. This makes previous abort unnecessary. Now, remove unnecessary abort and change function return value to void. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240525185257.3896201-2-linan666@huaweicloud.com
2024-03-11Merge tag 'for-6.9/block-20240310' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - MD pull requests via Song: - Cleanup redundant checks (Yu Kuai) - Remove deprecated headers (Marc Zyngier, Song Liu) - Concurrency fixes (Li Lingfeng) - Memory leak fix (Li Nan) - Refactor raid1 read_balance (Yu Kuai, Paul Luse) - Clean up and fix for md_ioctl (Li Nan) - Other small fixes (Gui-Dong Han, Heming Zhao) - MD atomic limits (Christoph) - NVMe pull request via Keith: - RDMA target enhancements (Max) - Fabrics fixes (Max, Guixin, Hannes) - Atomic queue_limits usage (Christoph) - Const use for class_register (Ricardo) - Identification error handling fixes (Shin'ichiro, Keith) - Improvement and cleanup for cached request handling (Christoph) - Moving towards atomic queue limits. Core changes and driver bits so far (Christoph) - Fix UAF issues in aoeblk (Chun-Yi) - Zoned fix and cleanups (Damien) - s390 dasd cleanups and fixes (Jan, Miroslav) - Block issue timestamp caching (me) - noio scope guarding for zoned IO (Johannes) - block/nvme PI improvements (Kanchan) - Ability to terminate long running discard loop (Keith) - bdev revalidation fix (Li) - Get rid of old nr_queues hack for kdump kernels (Ming) - Support for async deletion of ublk (Ming) - Improve IRQ bio recycling (Pavel) - Factor in CPU capacity for remote vs local completion (Qais) - Add shared_tags configfs entry for null_blk (Shin'ichiro - Fix for a regression in page refcounts introduced by the folio unification (Tony) - Misc fixes and cleanups (Arnd, Colin, John, Kunwu, Li, Navid, Ricardo, Roman, Tang, Uwe) * tag 'for-6.9/block-20240310' of git://git.kernel.dk/linux: (221 commits) block: partitions: only define function mac_fix_string for CONFIG_PPC_PMAC block/swim: Convert to platform remove callback returning void cdrom: gdrom: Convert to platform remove callback returning void block: remove disk_stack_limits md: remove mddev->queue md: don't initialize queue limits md/raid10: use the atomic queue limit update APIs md/raid5: use the atomic queue limit update APIs md/raid1: use the atomic queue limit update APIs md/raid0: use the atomic queue limit update APIs md: add queue limit helpers md: add a mddev_is_dm helper md: add a mddev_add_trace_msg helper md: add a mddev_trace_remap helper bcache: move calculation of stripe_size and io_opt into bcache_device_init virtio_blk: Do not use disk_set_max_open/active_zones() aoe: fix the potential use-after-free problem in aoecmd_cfg_pkts block: move capacity validation to blkpg_do_ioctl() block: prevent division by zero in blk_rq_stat_sum() drbd: atomically update queue limits in drbd_reconsider_queue_parameters ...
2024-03-06md: remove mddev->queueChristoph Hellwig
Just use the request_queue from the gendisk pointer in the relatively few places that sill need it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed--by: Song Liu <song@kernel.org> Tested-by: Song Liu <song@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240303140150.5435-11-hch@lst.de
2024-03-06md/raid10: use the atomic queue limit update APIsChristoph Hellwig
Build the queue limits outside the queue and apply them using queue_limits_set. To make the code more obvious also split the queue limits handling into separate helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed--by: Song Liu <song@kernel.org> Tested-by: Song Liu <song@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240303140150.5435-9-hch@lst.de
2024-03-06md: add a mddev_is_dm helperChristoph Hellwig
Add a helper to check for a DM-mapped MD device instead of using the obfuscated ->gendisk or ->queue NULL checks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed--by: Song Liu <song@kernel.org> Tested-by: Song Liu <song@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240303140150.5435-4-hch@lst.de
2024-03-06md: add a mddev_add_trace_msg helperChristoph Hellwig
Add a small wrapper around blk_add_trace_msg that hides some argument dereferences and the check for a DM-mapped MD device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed--by: Song Liu <song@kernel.org> Tested-by: Song Liu <song@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240303140150.5435-3-hch@lst.de
2024-03-06md: add a mddev_trace_remap helperChristoph Hellwig
Add a helper to trace bio remapping that hides some argument dereferences and the check for a DM-mapped MD device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed--by: Song Liu <song@kernel.org> Tested-by: Song Liu <song@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240303140150.5435-2-hch@lst.de
2024-02-29md/raid1-10: factor out a new helper raid1_should_read_first()Yu Kuai
If resync is in progress, read_balance() should find the first usable disk, otherwise, data could be inconsistent after resync is done. raid1 and raid10 implement the same checking, hence factor out the checking to make code cleaner. Noted that raid1 is using 'mddev->recovery_cp', which is updated after all resync IO is done, while raid10 is using 'conf->next_resync', which is inaccurate because raid10 update it before submitting resync IO. Fortunately, raid10 read IO can't concurrent with resync IO, hence there is no problem. And this patch also switch raid10 to use 'mddev->recovery_cp'. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-7-yukuai1@huaweicloud.com
2024-02-29md: add a new helper rdev_has_badblock()Yu Kuai
The current api is_badblock() must pass in 'first_bad' and 'bad_sectors', however, many caller just want to know if there are badblocks or not, and these caller must define two local variable that will never be used. Add a new helper rdev_has_badblock() that will only return if there are badblocks or not, remove unnecessary local variables and replace is_badblock() with the new helper in many places. There are no functional changes, and the new helper will also be used later to refactor read_balance(). Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-2-yukuai1@huaweicloud.com
2024-02-15md: Don't register sync_thread for reshape directlyYu Kuai
Currently, if reshape is interrupted, then reassemble the array will register sync_thread directly from pers->run(), in this case 'MD_RECOVERY_RUNNING' is set directly, however, there is no guarantee that md_do_sync() will be executed, hence stop_sync_thread() will hang because 'MD_RECOVERY_RUNNING' can't be cleared. Last patch make sure that md_do_sync() will set MD_RECOVERY_DONE, however, following hang can still be triggered by dm-raid test shell/lvconvert-raid-reshape.sh occasionally: [root@fedora ~]# cat /proc/1982/stack [<0>] stop_sync_thread+0x1ab/0x270 [md_mod] [<0>] md_frozen_sync_thread+0x5c/0xa0 [md_mod] [<0>] raid_presuspend+0x1e/0x70 [dm_raid] [<0>] dm_table_presuspend_targets+0x40/0xb0 [dm_mod] [<0>] __dm_destroy+0x2a5/0x310 [dm_mod] [<0>] dm_destroy+0x16/0x30 [dm_mod] [<0>] dev_remove+0x165/0x290 [dm_mod] [<0>] ctl_ioctl+0x4bb/0x7b0 [dm_mod] [<0>] dm_ctl_ioctl+0x11/0x20 [dm_mod] [<0>] vfs_ioctl+0x21/0x60 [<0>] __x64_sys_ioctl+0xb9/0xe0 [<0>] do_syscall_64+0xc6/0x230 [<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74 Meanwhile mddev->recovery is: MD_RECOVERY_RUNNING | MD_RECOVERY_INTR | MD_RECOVERY_RESHAPE | MD_RECOVERY_FROZEN Fix this problem by remove the code to register sync_thread directly from raid10 and raid5. And let md_check_recovery() to register sync_thread. Fixes: f67055780caa ("[PATCH] md: Checkpoint and allow restart of raid5 reshape") Fixes: f52f5c71f3d4 ("md: fix stopping sync thread") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240201092559.910982-5-yukuai1@huaweicloud.com
2023-12-15md: factor out a helper exceed_read_errors() to check read_errorsLi Nan
Move check_decay_read_errors() to raid1-10.c and factor out a helper exceed_read_errors() to check if read_errors exceeds the limit, so that raid1 can also use it. There are no functional changes. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231215023852.3478228-2-linan666@huaweicloud.com
2023-11-27md/raid10: remove rcu protection to access rdev from confYu Kuai
Because it's safe to accees rdev from conf: - If any spinlock is held, because synchronize_rcu() from md_kick_rdev_from_array() will prevent 'rdev' to be freed until spinlock is released; - If 'reconfig_lock' is held, because rdev can't be added or removed from array; - If there is normal IO inflight, because mddev_suspend() will prevent rdev to be added or removed from array; - If there is sync IO inflight, because 'MD_RECOVERY_RUNNING' is checked in remove_and_add_spares(). And these will cover all the scenarios in raid10. This patch also cleanup the code to handle the case that replacement replace rdev while IO is still inflight. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231125081604.3939938-3-yukuai1@huaweicloud.com
2023-11-27md: remove flag RemoveSynchronizedYu Kuai
rcu is not used correctly here, because synchronize_rcu() is called before replacing old value, for example: remove_and_add_spares // other path synchronize_rcu // called before replacing old value set_bit(RemoveSynchronized) rcu_read_lock() rdev = conf->mirros[].rdev pers->hot_remove_disk conf->mirros[].rdev = NULL; if (!test_bit(RemoveSynchronized)) synchronize_rcu /* * won't be called, and won't wait * for concurrent readers to be done. */ // access rdev after remove_and_add_spares() rcu_read_unlock() Fortunately, there is a separate rcu protection to prevent such rdev to be freed: md_kick_rdev_from_array //other path rcu_read_lock() rdev = conf->mirros[].rdev list_del_rcu(&rdev->same_set) rcu_read_unlock() /* * rdev can be removed from conf, but * rdev won't be freed. */ synchronize_rcu() free rdev Hence remove this useless flag and prepare to remove rcu protection to access rdev from 'conf'. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231125081604.3939938-2-yukuai1@huaweicloud.com
2023-09-22md: initialize 'writes_pending' while allocating mddevYu Kuai
Currently 'writes_pending' is initialized in pers->run for raid1/5/10, and it's freed while deleing mddev, instead of pers->free. pers->run can be called multiple times before mddev is deleted, and a helper mddev_init_writes_pending() is used to prevent 'writes_pending' to be initialized multiple times, this usage is safe but a litter weird. On the other hand, 'writes_pending' is only initialized for raid1/5/10, however, it's used in common layer, for example: array_state_store set_in_sync if (!mddev->in_sync) -> in_sync is used for all levels // access writes_pending There might be some implicit dependency that I don't recognized to make sure 'writes_pending' can only be accessed for raid1/5/10, but there are no comments about that. By the way, it make sense to initialize 'writes_pending' in common layer because there are already three levels use it. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230825030956.1527023-3-yukuai1@huaweicloud.com
2023-08-15md: Hold mddev->reconfig_mutex when trying to get mddev->sync_threadLi Lingfeng
Commit ba9d9f1a707f ("Revert "md: unlock mddev before reap sync_thread in action_store"") removed the scenario of calling md_unregister_thread() without holding mddev->reconfig_mutex, so add a lock holding check before acquiring mddev->sync_thread by passing mdev to md_unregister_thread(). Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230803071711.2546560-1-lilingfeng@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-08-15md/raid10: fix a 'conf->barrier' leakage in raid10_takeover()Yu Kuai
After commit b39f35ebe86d ("md: don't quiesce in mddev_suspend()"), 'conf->barrier' will be leaked in the case that raid10 takeover raid0: level_store pers->takeover -> raid10_takeover raid10_takeover_raid0 WRITE_ONCE(conf->barrier, 1) mddev_suspend // still raid0 mddev->pers = pers // switch to raid10 mddev_resume // resume without suspend After the above commit, mddev_resume() will not decrease 'conf->barrier' that is set in raid10_takeover_raid0(). Fix this problem by not setting 'conf->barrier' in raid10_takeover_raid0(). By the way, this problem is found while I'm trying to make mddev_suspend/resume() to be independent from raid personalities. raid10 is the only personality to use reference count in the quiesce() callback and this problem is only related to raid10. Fixes: b39f35ebe86d ("md: don't quiesce in mddev_suspend()") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Link: https://lore.kernel.org/r/20230731022800.1424902-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27md/raid10: use dereference_rdev_and_rrdev() to get devicesLi Nan
Commit 2ae6aaf76912 ("md/raid10: fix io loss while replacement replace rdev") reads replacement first to prevent io loss. However, there are same issue in wait_blocked_dev() and raid10_handle_discard(), too. Fix it by using dereference_rdev_and_rrdev() to get devices. Fixes: d30588b2731f ("md/raid10: improve raid10 discard request") Fixes: f2e7e269a752 ("md/raid10: pull the code that wait for blocked dev into one function") Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/r/20230701080529.2684932-4-linan666@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27md/raid10: factor out dereference_rdev_and_rrdev()Li Nan
Factor out a helper to get 'rdev' and 'replacement' from config->mirrors. Just to make code cleaner and prepare to fix the bug of io loss while 'replacement' replace 'rdev'. There is no functional change. Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/r/20230701080529.2684932-3-linan666@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27md/raid10: check replacement and rdev to prevent submit the same io twiceLi Nan
After commit 4ca40c2ce099 ("md/raid10: Allow replacement device to be replace old drive."), 'rdev' and 'replacement' could appear to be identical. There are already checks for that in wait_blocked_dev() and raid10_write_request(). Add check for raid10_handle_discard() now. Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/r/20230701080529.2684932-2-linan666@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27md: remove redundant check in fix_read_error()Li Nan
In fix_read_error(), 'success' will be checked immediately after assigning it, if it is set to 1 then the loop will break. Checking it again in condition of loop is redundant. Clean it up. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230623173236.2513554-3-linan666@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27md/raid10: optimize fix_read_errorLi Nan
We dereference r10_bio->read_slot too many times in fix_read_error(). Optimize it by using a variable to store read_slot. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230623173236.2513554-2-linan666@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2023-07-27md/raid10: switch to use md_account_bio() for io accountingYu Kuai
Make sure that 'active_io' will represent inflight io instead of io that is dispatching, and io accounting from all levels will be consistent. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230621165110.1498313-6-yukuai1@huaweicloud.com
2023-06-23raid10: avoid spin_lock from fastpath from raid10_unplug()Yu Kuai
Commit 0c0be98bbe67 ("md/raid10: prevent unnecessary calls to wake_up() in fast path") missed one place, for example, with: fio -direct=1 -rw=write/randwrite -iodepth=1 ... Plug and unplug are called for each io, then wake_up() from raid10_unplug() will cause lock contention as well. Avoid this contention by using wake_up_barrier() instead of wake_up(), where spin_lock is not held if waitqueue is empty. Fio test script: [global] name=random reads and writes ioengine=libaio direct=1 readwrite=randrw rwmixread=70 iodepth=64 buffered=0 filename=/dev/md0 size=1G runtime=30 time_based randrepeat=0 norandommap refill_buffers ramp_time=10 bs=4k numjobs=400 group_reporting=1 [job1] Test result with ramdisk raid10(By Ali): Before this patch With this patch READ IOPS=2033k IOPS=3642k WRITE IOPS=871k IOPS=1561K By the way, in this scenario, blk_plug_cb() will be allocated and freed for each io, this seems need to be optimized as well. Reported-and-tested-by: Ali Gholami Rudi <aligrudi@gmail.com> Closes: https://lore.kernel.org/all/20231606122233@laper.mirepesht/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230621105728.1268542-1-yukuai1@huaweicloud.com
2023-06-23md/raid10: fix the condition to call bio_end_io_acct()Li Nan
/sys/block/[device]/queue/iostats is used to control whether to count io stat. Write 0 to it will clear queue_flags QUEUE_FLAG_IO_STAT which means iostats is disabled. If we disable iostats and later endable it, the io issued during this period will be counted incorrectly, inflight will be decreased to -1. //T1 set iostats echo 0 > /sys/block/md0/queue/iostats clear QUEUE_FLAG_IO_STAT //T2 issue io if (QUEUE_FLAG_IO_STAT) -> false bio_start_io_acct inflight++ echo 1 > /sys/block/md0/queue/iostats set QUEUE_FLAG_IO_STAT //T3 io end if (QUEUE_FLAG_IO_STAT) -> true bio_end_io_acct inflight-- -> -1 Also, if iostats is enabled while issuing io but disabled while io end, inflight will never be decreased. Fix it by checking start_time when io end. If start_time is not 0, call bio_end_io_acct(). Fixes: 528bc2cf2fcc ("md/raid10: enable io accounting") Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230609094320.2397604-1-linan666@huaweicloud.com