summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-03-05md: export helper md_is_rdwr()Yu Kuai
There are no functional changes for now, prepare to fix a deadlock for dm-raid456. Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-4-yukuai1@huaweicloud.com
2024-03-05md: export helpers to stop sync_threadYu Kuai
Add new helpers: void md_idle_sync_thread(struct mddev *mddev); void md_frozen_sync_thread(struct mddev *mddev); void md_unfrozen_sync_thread(struct mddev *mddev); The helpers will be used in dm-raid in later patches to fix regressions and prevent calling md_reap_sync_thread() directly. Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-3-yukuai1@huaweicloud.com
2024-03-05md: don't clear MD_RECOVERY_FROZEN for new dm-raid until resumeYu Kuai
After commit 9dbd1aa3a81c ("dm raid: add reshaping support to the target") raid_ctr() will set MD_RECOVERY_FROZEN before md_run() and expect to keep array frozen until resume. However, md_run() will clear the flag by setting mddev->recovery to 0. Before commit 1baae052cccd ("md: Don't ignore suspended array in md_check_recovery()"), dm-raid actually relied on suspending to prevent starting new sync_thread. Fix this problem by keeping 'MD_RECOVERY_FROZEN' for dm-raid in md_run(). Fixes: 1baae052cccd ("md: Don't ignore suspended array in md_check_recovery()") Fixes: 9dbd1aa3a81c ("dm raid: add reshaping support to the target") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240305072306.2562024-2-yukuai1@huaweicloud.com
2024-03-01nbd: use the atomic queue limits API in nbd_set_sizeChristoph Hellwig
Use queue_limits_start_update / queue_limits_commit_update to update all the limits in one go and with proper sanity checking. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240229143846.1047223-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01nbd: freeze the queue for queue limits updatesChristoph Hellwig
nbd currently updates the logical and physical block sizes as well as the discard_sectors on a live queue. Freeze the queue first to make sure there are not commands in flight that can see torn or inconsistent limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240229143846.1047223-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01nbd: don't clear discard_sectors in nbd_config_putChristoph Hellwig
nbd_config_put currently clears discard_sectors when unusing a device. This is pretty odd behavior and different from the sector size configuration which is simply left in places and then reconfigured when nbd_set_size is as part of configuring the device. Change nbd_set_size to clear discard_sectors if discard is not supported so that all the queue limits changes are handled in one place. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240229143846.1047223-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01pktcdvd: don't set max_hw_sectors on the underlying deviceChristoph Hellwig
pktcdvd sets max_hw_sectors on the queue of the underlying device that it doesn't own (and doesn't reset it ever) since the driver was merged. This can create all kinds of problems as the underlying driver doesn't even know about it changing the limit. As the state purpose is to not create I/Os larger than a single frame, and pktcdvd never builds bios larger than that, just set REQ_NOMERGE on the bios it submits so that largers I/Os never get built. Note: I don't have packet writing hardware, so this is compile tested only. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240229144408.1047967-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01dm: use queue_limits_setChristoph Hellwig
Use queue_limits_set which validates the limits and takes care of updating the readahead settings instead of directly assigning them to the queue. For that make sure all limits are actually updated before the assignment. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20240228225653.947152-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01block: add a queue_limits_stack_bdev helperChristoph Hellwig
Add a small wrapper around blk_stack_limits that allows passing a bdev for the bottom device and prints an error in case of misaligned device. The name fits into the new queue limits API and the intent is to eventually replace disk_stack_limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240228225653.947152-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01block: add a queue_limits_set helperChristoph Hellwig
Add a small wrapper around queue_limits_commit_update for stacking drivers that don't want to update existing limits, but set an entirely new set. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240228225653.947152-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01Merge tag 'md-6.9-20240301' of ↵Jens Axboe
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.9/block Pull MD updates from Song: "The major changes are: 1. Refactor raid1 read_balance, by Yu Kuai and Paul Luse. 2. Clean up and fix for md_ioctl, by Li Nan. 3. Other small fixes, by Gui-Dong Han and Heming Zhao." * tag 'md-6.9-20240301' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: (22 commits) md/raid1: factor out helpers to choose the best rdev from read_balance() md/raid1: factor out the code to manage sequential IO md/raid1: factor out choose_bb_rdev() from read_balance() md/raid1: factor out choose_slow_rdev() from read_balance() md/raid1: factor out read_first_rdev() from read_balance() md/raid1-10: factor out a new helper raid1_should_read_first() md/raid1-10: add a helper raid1_check_read_range() md/raid1: fix choose next idle in read_balance() md/raid1: record nonrot rdevs while adding/removing rdevs to conf md/raid1: factor out helpers to add rdev to conf md: add a new helper rdev_has_badblock() md/raid5: fix atomicity violation in raid5_cache_count md/md-bitmap: fix incorrect usage for sb_index md: check mddev->pers before calling md_set_readonly() md: clean up openers check in do_md_stop() and md_set_readonly() md: sync blockdev before stopping raid or setting readonly md: factor out a helper to sync mddev md: Don't clear MD_CLOSING when the raid is about to stop md: return directly before setting did_set_md_closing md: clean up invalid BUG_ON in md_ioctl ...
2024-02-29Merge branch 'raid1-read_balance' into md-6.9Song Liu
From: Yu Kuai <yukuai3@huawei.com> Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> The original idea is that Paul want to optimize raid1 read performance([1]), however, we think that the original code for read_balance() is quite complex, and we don't want to add more complexity. Hence we decide to refactor read_balance() first, to make code cleaner and easier for follow up. Before this patchset, read_balance() has many local variables and many branches, it want to consider all the scenarios in one iteration. The idea of this patch is to divide them into 4 different steps: 1) If resync is in progress, find the first usable disk, patch 5; Otherwise: 2) Loop through all disks and skipping slow disks and disks with bad blocks, choose the best disk, patch 10. If no disk is found: 3) Look for disks with bad blocks and choose the one with most number of sectors, patch 8. If no disk is found: 4) Choose first found slow disk with no bad blocks, or slow disk with most number of sectors, patch 7. Note that step 3) and step 4) are super code path, and performance should not be considered. And after this patchset, we'll continue to optimize read_balance for step 2), specifically how to choose the best rdev to read. [1] https://lore.kernel.org/all/20240102125115.129261-1-paul.e.luse@linux.intel.com/ Yu Kuai (11): md: add a new helper rdev_has_badblock() md/raid1: factor out helpers to add rdev to conf md/raid1: record nonrot rdevs while adding/removing rdevs to conf md/raid1: fix choose next idle in read_balance() md/raid1-10: add a helper raid1_check_read_range() md/raid1-10: factor out a new helper raid1_should_read_first() md/raid1: factor out read_first_rdev() from read_balance() md/raid1: factor out choose_slow_rdev() from read_balance() md/raid1: factor out choose_bb_rdev() from read_balance() md/raid1: factor out the code to manage sequential IO md/raid1: factor out helpers to choose the best rdev from read_balance()
2024-02-29md/raid1: factor out helpers to choose the best rdev from read_balance()Yu Kuai
The way that best rdev is chosen: 1) If the read is sequential from one rdev: - if rdev is rotational, use this rdev; - if rdev is non-rotational, use this rdev until total read length exceed disk opt io size; 2) If the read is not sequential: - if there is idle disk, use it, otherwise: - if the array has non-rotational disk, choose the rdev with minimal inflight IO; - if all the underlaying disks are rotational disk, choose the rdev with closest IO; There are no functional changes, just to make code cleaner and prepare for following refactor. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-12-yukuai1@huaweicloud.com
2024-02-29md/raid1: factor out the code to manage sequential IOYu Kuai
There is no functional change for now, make read_balance() cleaner and prepare to fix problems and refactor the handler of sequential IO. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-11-yukuai1@huaweicloud.com
2024-02-29md/raid1: factor out choose_bb_rdev() from read_balance()Yu Kuai
read_balance() is hard to understand because there are too many status and branches, and it's overlong. This patch factor out the case to read the rdev with bad blocks from read_balance(), there are no functional changes. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-10-yukuai1@huaweicloud.com
2024-02-29md/raid1: factor out choose_slow_rdev() from read_balance()Yu Kuai
read_balance() is hard to understand because there are too many status and branches, and it's overlong. This patch factor out the case to read the slow rdev from read_balance(), there are no functional changes. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-9-yukuai1@huaweicloud.com
2024-02-29md/raid1: factor out read_first_rdev() from read_balance()Yu Kuai
read_balance() is hard to understand because there are too many status and branches, and it's overlong. This patch factor out the case to read the first rdev from read_balance(), there are no functional changes. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-8-yukuai1@huaweicloud.com
2024-02-29md/raid1-10: factor out a new helper raid1_should_read_first()Yu Kuai
If resync is in progress, read_balance() should find the first usable disk, otherwise, data could be inconsistent after resync is done. raid1 and raid10 implement the same checking, hence factor out the checking to make code cleaner. Noted that raid1 is using 'mddev->recovery_cp', which is updated after all resync IO is done, while raid10 is using 'conf->next_resync', which is inaccurate because raid10 update it before submitting resync IO. Fortunately, raid10 read IO can't concurrent with resync IO, hence there is no problem. And this patch also switch raid10 to use 'mddev->recovery_cp'. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-7-yukuai1@huaweicloud.com
2024-02-29md/raid1-10: add a helper raid1_check_read_range()Yu Kuai
The checking and handler of bad blocks appear many timers during read_balance() in raid1 and raid10. This helper will be used in later patches to simplify read_balance() a lot. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-6-yukuai1@huaweicloud.com
2024-02-29md/raid1: fix choose next idle in read_balance()Yu Kuai
Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add the case choose next idle in read_balance(): read_balance: for_each_rdev if(next_seq_sect == this_sector || dist == 0) -> sequential reads best_disk = disk; if (...) choose_next_idle = 1 continue; for_each_rdev -> iterate next rdev if (pending == 0) best_disk = disk; -> choose the next idle disk break; if (choose_next_idle) -> keep using this rdev if there are no other idle disk contine However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.") remove the code: - /* If device is idle, use it */ - if (pending == 0) { - best_disk = disk; - break; - } Hence choose next idle will never work now, fix this problem by following: 1) don't set best_disk in this case, read_balance() will choose the best disk after iterating all the disks; 2) add 'pending' so that other idle disk will be chosen; 3) add a new local variable 'sequential_disk' to record the disk, and if there is no other idle disk, 'sequential_disk' will be chosen; Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.") Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-5-yukuai1@huaweicloud.com
2024-02-29md/raid1: record nonrot rdevs while adding/removing rdevs to confYu Kuai
For raid1, each read will iterate all the rdevs from conf and check if any rdev is non-rotational, then choose rdev with minimal IO inflight if so, or rdev with closest distance otherwise. Disk nonrot info can be changed through sysfs entry: /sys/block/[disk_name]/queue/rotational However, consider that this should only be used for testing, and user really shouldn't do this in real life. Record the number of non-rotational disks in conf, to avoid checking each rdev in IO fast path and simplify read_balance() a little bit. Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-4-yukuai1@huaweicloud.com
2024-02-29md/raid1: factor out helpers to add rdev to confYu Kuai
There are no functional changes, just make code cleaner and prepare to record disk non-rotational information while adding and removing rdev to conf Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-3-yukuai1@huaweicloud.com
2024-02-29md: add a new helper rdev_has_badblock()Yu Kuai
The current api is_badblock() must pass in 'first_bad' and 'bad_sectors', however, many caller just want to know if there are badblocks or not, and these caller must define two local variable that will never be used. Add a new helper rdev_has_badblock() that will only return if there are badblocks or not, remove unnecessary local variables and replace is_badblock() with the new helper in many places. There are no functional changes, and the new helper will also be used later to refactor read_balance(). Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240229095714.926789-2-yukuai1@huaweicloud.com
2024-02-28ublk: add UBLK_CMD_DEL_DEV_ASYNCMing Lei
The current command UBLK_CMD_DEL_DEV won't return until the device is released, this way looks more reliable, but makes userspace more difficult to implement, especially about orders: unmap command buffer(which holds one ublkc reference), ublkc close, io_uring_file_unregister, ublkb close. Add UBLK_CMD_DEL_DEV_ASYNC so that device deletion won't wait release, then userspace needn't worry about the above order. Actually both loop and nbd is deleted in this async way. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240223075539.89945-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-28ublk: improve getting & putting ublk deviceMing Lei
Firstly convert get_device() and put_device() into ublk_get_device() and ublk_put_device(). Secondly annotate ublk_get_device() & ublk_put_device() as noinline for trace, especially it is often to trigger device deletion hang when incorrect order is used on ublkc mmap, ublkc close, io_uring_sqe_unregister_file, ublkb close. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240223075539.89945-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-28blk-mq: don't change nr_hw_queues and nr_maps for kdump kernelMing Lei
For most of ARCHs, 'nr_cpus=1' is passed for kdump kernel, so nr_hw_queues for each mapping is supposed to be 1 already. More importantly, this way may cause trouble for driver, because blk-mq and driver see different queue mapping since driver should setup hardware queue setting before calling into allocating blk-mq tagset. So not overriding nr_hw_queues and nr_maps for kdump kernel. Cc: Wen Xiong <wenxiong@us.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240228040857.306483-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27md/raid5: fix atomicity violation in raid5_cache_countGui-Dong Han
In raid5_cache_count(): if (conf->max_nr_stripes < conf->min_nr_stripes) return 0; return conf->max_nr_stripes - conf->min_nr_stripes; The current check is ineffective, as the values could change immediately after being checked. In raid5_set_cache_size(): ... conf->min_nr_stripes = size; ... while (size > conf->max_nr_stripes) conf->min_nr_stripes = conf->max_nr_stripes; ... Due to intermediate value updates in raid5_set_cache_size(), concurrent execution of raid5_cache_count() and raid5_set_cache_size() may lead to inconsistent reads of conf->max_nr_stripes and conf->min_nr_stripes. The current checks are ineffective as values could change immediately after being checked, raising the risk of conf->min_nr_stripes exceeding conf->max_nr_stripes and potentially causing an integer overflow. This possible bug is found by an experimental static analysis tool developed by our team. This tool analyzes the locking APIs to extract function pairs that can be concurrently executed, and then analyzes the instructions in the paired functions to identify possible concurrency bugs including data races and atomicity violations. The above possible bug is reported when our tool analyzes the source code of Linux 6.2. To resolve this issue, it is suggested to introduce local variables 'min_stripes' and 'max_stripes' in raid5_cache_count() to ensure the values remain stable throughout the check. Adding locks in raid5_cache_count() fails to resolve atomicity violations, as raid5_set_cache_size() may hold intermediate values of conf->min_nr_stripes while unlocked. With this patch applied, our tool no longer reports the bug, with the kernel configuration allyesconfig for x86_64. Due to the lack of associated hardware, we cannot test the patch in runtime testing, and just verify it according to the code logic. Fixes: edbe83ab4c27 ("md/raid5: allow the stripe_cache to grow and shrink.") Cc: stable@vger.kernel.org Signed-off-by: Gui-Dong Han <2045gemini@gmail.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240112071017.16313-1-2045gemini@gmail.com Signed-off-by: Song Liu <song@kernel.org>
2024-02-27ubd: open the backing files in ubd_addChristoph Hellwig
Opening the backing device only when the block device is opened is a bit weird as no one configures block devices to not use them. Opend them at add time, close them at remove time and remove the now superflous opened counter as remove can simply check for disk_openers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Richard Weinberger <richard@nod.at> Link: https://lore.kernel.org/r/20240222072417.3773131-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27ubd: remove the queue pointer in struct ubdChristoph Hellwig
No need for it now, everything goes through the gendisk. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Richard Weinberger <richard@nod.at> Link: https://lore.kernel.org/r/20240222072417.3773131-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27ubd: move set_disk_ro to ubd_addChristoph Hellwig
No need to delay this until open time. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Richard Weinberger <richard@nod.at> Link: https://lore.kernel.org/r/20240222072417.3773131-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27ubd: move setting the variable queue limits to ubd_addChristoph Hellwig
No reason to delay this until open time. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Richard Weinberger <richard@nod.at> Link: https://lore.kernel.org/r/20240222072417.3773131-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27ubd: move setting the nonrot flag to ubd_addChristoph Hellwig
No reason to delay this until open time. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Richard Weinberger <richard@nod.at> Link: https://lore.kernel.org/r/20240222072417.3773131-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27ubd: remove ubd_disk_registerChristoph Hellwig
Fold it into the only caller to remove lots of references to the global ubd_devs array. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Richard Weinberger <richard@nod.at> Link: https://lore.kernel.org/r/20240222072417.3773131-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27ubd: remove the ubd_gendisk arrayChristoph Hellwig
And add a disk pointer to the ubd structure instead to keep all the per-device information together. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Richard Weinberger <richard@nod.at> Link: https://lore.kernel.org/r/20240222072417.3773131-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27xen-blkfront: atomically update queue limitsChristoph Hellwig
Pass the initial queue limits to blk_mq_alloc_disk and use the blkif_set_queue_limits API to update the limits on reconnect. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Link: https://lore.kernel.org/r/20240221125845.3610668-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27xen-blkfront: don't redundantly set max_sements in blkif_recoverChristoph Hellwig
blkif_set_queue_limits already sets the max_sements limits, so don't do it a second time. Also remove a comment about a long fixe bug in blk_mq_update_nr_hw_queues. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Link: https://lore.kernel.org/r/20240221125845.3610668-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27xen-blkfront: rely on the default discard granularityChristoph Hellwig
The block layer now sets the discard granularity to the physical block size default. Take advantage of that in xen-blkfront and only set the discard granularity if explicitly specified. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Link: https://lore.kernel.org/r/20240221125845.3610668-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27xen-blkfront: set max_discard/secure erase limits to UINT_MAXChristoph Hellwig
Currently xen-blkfront set the max discard limit to the capacity of the device, which is suboptimal when the capacity changes. Just set it to UINT_MAX, which has the same effect and is simpler. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Link: https://lore.kernel.org/r/20240221125845.3610668-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-26md/md-bitmap: fix incorrect usage for sb_indexHeming Zhao
Commit d7038f951828 ("md-bitmap: don't use ->index for pages backing the bitmap file") removed page->index from bitmap code, but left wrong code logic for clustered-md. current code never set slot offset for cluster nodes, will sometimes cause crash in clustered env. Call trace (partly): md_bitmap_file_set_bit+0x110/0x1d8 [md_mod] md_bitmap_startwrite+0x13c/0x240 [md_mod] raid1_make_request+0x6b0/0x1c08 [raid1] md_handle_request+0x1dc/0x368 [md_mod] md_submit_bio+0x80/0xf8 [md_mod] __submit_bio+0x178/0x300 submit_bio_noacct_nocheck+0x11c/0x338 submit_bio_noacct+0x134/0x614 submit_bio+0x28/0xdc submit_bh_wbc+0x130/0x1cc submit_bh+0x1c/0x28 Fixes: d7038f951828 ("md-bitmap: don't use ->index for pages backing the bitmap file") Cc: stable@vger.kernel.org # v6.6+ Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240223121128.28985-1-heming.zhao@suse.com
2024-02-26md: check mddev->pers before calling md_set_readonly()Li Nan
If 'mddev->pers' is NULL, there is nothing to do in md_set_readonly(). Except for md_ioctl(), the other two callers of md_set_readonly() have already checked 'mddev->pers'. To simplify the code, move the check of 'mddev->pers' to the caller. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-10-linan666@huaweicloud.com
2024-02-26md: clean up openers check in do_md_stop() and md_set_readonly()Li Nan
Before stopping or setting readonly, mddev_set_closing_and_sync_blockdev() is always called to check the openers. So no longer need to check it again in do_md_stop() and md_set_readonly(). Clean it up. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-9-linan666@huaweicloud.com
2024-02-26md: sync blockdev before stopping raid or setting readonlyLi Nan
Commit a05b7ea03d72 ("md: avoid crash when stopping md array races with closing other open fds.") added sync_block before stopping raid and setting readonly. Later in commit 260fa034ef7a ("md: avoid deadlock when dirty buffers during md_stop.") it is moved to ioctl. array_state_store() was ignored. Add sync blockdev to array_state_store() now. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-8-linan666@huaweicloud.com
2024-02-26md: factor out a helper to sync mddevLi Nan
There are no functional changes, prepare to sync mddev in array_state_store(). Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-7-linan666@huaweicloud.com
2024-02-26md: Don't clear MD_CLOSING when the raid is about to stopLi Nan
The raid should not be opened anymore when it is about to be stopped. However, other processes can open it again if the flag MD_CLOSING is cleared before exiting. From now on, this flag will not be cleared when the raid will be stopped. Fixes: 065e519e71b2 ("md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stop") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-6-linan666@huaweicloud.com
2024-02-26md: return directly before setting did_set_md_closingLi Nan
There is nothing to do at 'out' before setting 'did_set_md_closing' in md_ioctl(). Return directly, and it will help us to remove 'did_set_md_closing' later. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-5-linan666@huaweicloud.com
2024-02-26md: clean up invalid BUG_ON in md_ioctlLi Nan
'disk->private_data' is set to mddev in md_alloc() and never set to NULL, and users need to open mddev before submitting ioctl. So mddev must not have been freed during ioctl, and there is no need to check mddev here. Clean up it. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-4-linan666@huaweicloud.com
2024-02-26md: changed the switch of RAID_VERSION to ifLi Nan
There is only one case of this 'switch'. Change it to 'if'. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-3-linan666@huaweicloud.com
2024-02-26md: merge the check of capabilities into md_ioctl_valid()Li Nan
There is no functional change. Just to make code cleaner. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240226031444.3606764-2-linan666@huaweicloud.com
2024-02-24bdev: remove SLAB_MEM_SPREAD flag usageChengming Zhou
The SLAB_MEM_SPREAD flag is already a no-op as of 6.8-rc1, remove its usage so we can delete it from slab. No functional change. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Link: https://lore.kernel.org/r/20240224134646.829105-1-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-24block/blk-mq: Don't complete locally if capacities are differentQais Yousef
The logic in blk_mq_complete_need_ipi() assumes SMP systems where all CPUs have equal compute capacities and only LLC cache can make a different on perceived performance. But this assumption falls apart on HMP systems where LLC is shared, but the CPUs have different capacities. Staying local then can have a big performance impact if the IO request was done from a CPU with higher capacity but the interrupt is serviced on a lower capacity CPU. Use the new cpus_equal_capacity() function to check if we need to send an IPI. Without the patch I see the BLOCK softirq always running on little cores (where the hardirq is serviced). With it I can see it running on all cores. This was noticed after the topology change [1] where now on a big.LITTLE we truly get that the LLC is shared between all cores where as in the past it was being misrepresented for historical reasons. The logic exposed a missing dependency on capacities for such systems where there can be a big performance difference between the CPUs. This of course introduced a noticeable change in behavior depending on how the topology is presented. Leading to regressions in some workloads as the performance of the BLOCK softirq on littles can be noticeably worse on some platforms. Worth noting that we could have checked for capacities being greater than or equal instead for equality. This will lead to favouring higher performance always. But opted for equality instead to match the performance of the requester without making an assumption that can lead to power trade-offs which these systems tend to be sensitive about. If the requester would like to run faster, it's better to rely on the scheduler to give the IO requester via some facility to run on a faster core; and then if the interrupt triggered on a CPU with different capacity we'll make sure to match the performance the requester is supposed to run at. [1] https://lpc.events/event/16/contributions/1342/attachments/962/1883/LPC-2022-Android-MC-Phantom-Domains.pdf Signed-off-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240223155749.2958009-3-qyousef@layalina.io Signed-off-by: Jens Axboe <axboe@kernel.dk>