linux-arm.git - Russell King's ARM Linux kernel tree

Age	Commit message (Collapse)	Author
2025-03-06	md: improve return types of badblocks handling functions	Zheng Qixing
	rdev_set_badblocks() only indicates success/failure, so convert its return type from int to boolean for better semantic clarity. rdev_clear_badblocks() return value is never used by any caller, convert it to void. This removes unnecessary value returns. Also update narrow_write_error() in both raid1 and raid10 to use boolean return type to match rdev_set_badblocks(). Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250227075507.151331-12-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06	badblocks: return boolean from badblocks_set() and badblocks_clear()	Zheng Qixing
	Change the return type of badblocks_set() and badblocks_clear() from int to bool, indicating success or failure. Specifically: - _badblocks_set() and _badblocks_clear() functions now return true for success and false for failure. - All calls to these functions are updated to handle the new boolean return type. - This change improves code clarity and ensures a more consistent handling of success and failure states. Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Acked-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20250227075507.151331-11-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06	badblocks: fix missing bad blocks on retry in _badblocks_check()	Zheng Qixing
	The bad blocks check would miss bad blocks when retrying under contention, as checking parameters are not reset. These stale values from the previous attempt could lead to incorrect scanning in the subsequent retry. Move seqlock to outer function and reinitialize checking state for each retry. This ensures a clean state for each check attempt, preventing any missed bad blocks. Fixes: 3ea3354cb9f0 ("badblocks: improve badblocks_check() for multiple ranges handling") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250227075507.151331-10-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06	badblocks: fix merge issue when new badblocks align with pre+1	Li Nan
	There is a merge issue when adding badblocks as follow: echo 0 10 > bad_blocks echo 30 10 > bad_blocks echo 20 10 > bad_blocks cat bad_blocks 0 10 20 10 //should be merged with (30 10) 30 10 In this case, if new badblocks does not intersect with prev, it is added by insert_at(). If there is an intersection with prev+1, the merge will be processed in the next re_insert loop. However, when the end of the new badblocks is exactly equal to the offset of prev+1, no further re_insert loop occurs, and the two badblocks are not merge. Fix it by inc prev, badblocks can be merged during the subsequent code. Fixes: aa511ff8218b ("badblocks: switch to the improved badblock handling code") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250227075507.151331-9-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06	badblocks: try can_merge_front before overlap_front	Li Nan
	Regardless of whether overlap_front() returns true or false, can_merge_front() will be executed first. Therefore, move can_merge_front() in front of can_merge_front() to simplify code. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250227075507.151331-8-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06	badblocks: fix the using of MAX_BADBLOCKS	Li Nan
	The number of badblocks cannot exceed MAX_BADBLOCKS, but it should be allowed to equal MAX_BADBLOCKS. Fixes: aa511ff8218b ("badblocks: switch to the improved badblock handling code") Fixes: c3c6a86e9efc ("badblocks: add helper routines for badblock ranges handling") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250227075507.151331-7-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06	badblocks: return error if any badblock set fails	Li Nan
	_badblocks_set() returns success if at least one badblock is set successfully, even if others fail. This can lead to data inconsistencies in raid, where a failed badblock set should trigger the disk to be kicked out to prevent future reads from failed write areas. _badblocks_set() should return error if any badblock set fails. Instead of relying on 'rv', directly returning 'sectors' for clearer logic. If all badblocks are successfully set, 'sectors' will be 0, otherwise it indicates the number of badblocks that have not been set yet, thus signaling failure. By the way, it can also fix an issue: when a newly set unack badblock is included in an existing ack badblock, the setting will return an error. ··· echo "0 100" /sys/block/md0/md/dev-loop1/bad_blocks echo "0 100" /sys/block/md0/md/dev-loop1/unacknowledged_bad_blocks -bash: echo: write error: No space left on device ``` After fix, it will return success. Fixes: aa511ff8218b ("badblocks: switch to the improved badblock handling code") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250227075507.151331-6-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06	badblocks: return error directly when setting badblocks exceeds 512	Li Nan
	In the current handling of badblocks settings, a lot of processing has been done for scenarios where the number of badblocks exceeds 512. This makes the code look quite complex and also introduces some issues, For example, if there is 512 badblocks already: for((i=0; i<510; i++)); do ((sector=i*2)); echo "$sector 1" > bad_blocks; done echo 2100 10 > bad_blocks echo 2200 10 > bad_blocks Set new one, exceed 512: echo 2000 500 > bad_blocks Expected: 2000 500 Actual: 2100 400 In fact, a disk shouldn't have too many badblocks, and for disks with 512 badblocks, attempting to set more bad blocks doesn't make much sense. At that point, the more appropriate action would be to replace the disk. Therefore, to resolve these issues and simplify the code somewhat, return error directly when setting badblocks exceeds 512. Fixes: aa511ff8218b ("badblocks: switch to the improved badblock handling code") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250227075507.151331-5-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06	badblocks: attempt to merge adjacent badblocks during ack_all_badblocks	Li Nan
	If ack and unack badblocks are adjacent, they will not be merged and will remain as two separate badblocks. Even after the bad blocks are written to disk and both become ack, they will still remain as two independent bad blocks. This is not ideal as it wastes the limited space for badblocks. Therefore, during ack_all_badblocks(), attempt to merge badblocks if they are adjacent. Fixes: aa511ff8218b ("badblocks: switch to the improved badblock handling code") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250227075507.151331-4-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06	badblocks: factor out a helper try_adjacent_combine	Li Nan
	Factor out try_adjacent_combine(), and it will be used in the later patch. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250227075507.151331-3-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06	badblocks: Fix error shitf ops	Li Nan
	'bb->shift' is used directly in badblocks. It is wrong, fix it. Fixes: 3ea3354cb9f0 ("badblocks: improve badblocks_check() for multiple ranges handling") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250227075507.151331-2-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06	block: Correctly initialize BLK_INTEGRITY_NOGENERATE and BLK_INTEGRITY_NOVERIFY	Anuj Gupta
	Currently, BLK_INTEGRITY_NOGENERATE and BLK_INTEGRITY_NOVERIFY are not explicitly set during integrity initialization. This can lead to incorrect reporting of read_verify and write_generate sysfs values, particularly when a device does not support integrity. Ensure that these flags are correctly initialized by default. Reported-by: M Nikhil <nikh1092@linux.ibm.com> Link: https://lore.kernel.org/linux-block/f6130475-3ccd-45d2-abde-3ccceada0f0a@linux.ibm.com/ Fixes: 9f4aa46f2a74 ("block: invert the BLK_INTEGRITY_{GENERATE,VERIFY} flags") Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250305063033.1813-3-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06	block: ensure correct integrity capability propagation in stacked devices	Anuj Gupta
	queue_limits_stack_integrity() incorrectly sets BLK_INTEGRITY_DEVICE_CAPABLE for a DM device even when none of its underlying devices support integrity. This happens because the flag is inherited unconditionally. Ensure that integrity capabilities are correctly propagated only when the underlying devices actually support integrity. Reported-by: M Nikhil <nikh1092@linux.ibm.com> Link: https://lore.kernel.org/linux-block/f6130475-3ccd-45d2-abde-3ccceada0f0a@linux.ibm.com/ Fixes: c6e56cf6b2e7 ("block: move integrity information into queue_limits") Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250305063033.1813-2-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06	md/raid10: wait barrier before returning discard request with REQ_NOWAIT	Xiao Ni
	raid10_handle_discard should wait barrier before returning a discard bio which has REQ_NOWAIT. And there is no need to print warning calltrace if a discard bio has REQ_NOWAIT flag. Quality engineer usually checks dmesg and reports error if dmesg has warning/error calltrace. Fixes: c9aa889b035f ("md: raid10 add nowait support") Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/linux-raid/20250306094938.48952-1-xni@redhat.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-03-05	blk-throttle: carry over directly	Ming Lei
	Now ->carryover_bytes[] and ->carryover_ios[] only covers limit/config update. Actually the carryover bytes/ios can be carried to ->bytes_disp[] and ->io_disp[] directly, since the carryover is one-shot thing and only valid in current slice. Then we can remove the two fields and simplify code much. Type of ->bytes_disp[] and ->io_disp[] has to change as signed because the two fields may become negative when updating limits or config, but both are big enough for holding bytes/ios dispatched in single slice Cc: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250305043123.3938491-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-05	blk-throttle: don't take carryover for prioritized processing of metadata	Ming Lei
	Commit 29390bb5661d ("blk-throttle: support prioritized processing of metadata") takes bytes/ios carryover for prioritized processing of metadata. Turns out we can support it by charging it directly without trimming slice, and the result is same with carryover. Cc: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250305043123.3938491-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-05	blk-throttle: remove last_bytes_disp and last_ios_disp	Ming Lei
	The two fields are not used any more, so remove them. Cc: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250305043123.3938491-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-05	blk-throttle: fix lower bps rate by throtl_trim_slice()	Yu Kuai
	The bio submission time may be a few jiffies more than the expected waiting time, due to 'extra_bytes' can't be divided in tg_within_bps_limit(), and also due to timer wakeup delay. In this case, adjust slice_start to jiffies will discard the extra wait time, causing lower rate than expected. Current in-tree code already covers deviation by rounddown(), but turns out it is not enough, because jiffies - slice_start can be a multiple of throtl_slice. For example, assume bps_limit is 1000bytes, 1 jiffes is 10ms, and slice is 20ms(2 jiffies), expected rate is 1000 / 1000 * 20 = 20 bytes per slice. If user issues two 21 bytes IO, then wait time will be 30ms for the first IO: bytes_allowed = 20, extra_bytes = 1; jiffy_wait = 1 + 2 = 3 jiffies and consider extra 1 jiffies by timer, throtl_trim_slice() will be called at: jiffies = 40ms slice_start = 0ms, slice_end= 40ms bytes_disp = 21 In this case, before the patch, real rate in the first two slices is 10.5 bytes per slice, and slice will be updated to: jiffies = 40ms slice_start = 40ms, slice_end = 60ms, bytes_disp = 0; Hence the second IO will have to wait another 30ms; With the patch, the real rate in the first slice is 20 bytes per slice, which is the same as expected, and slice will be updated: jiffies=40ms, slice_start = 20ms, slice_end = 60ms, bytes_disp = 1; And now, there is still 19 bytes allowed in the second slice, and the second IO will only have to wait 10ms; This problem will cause blktests throtl/001 failure in case of CONFIG_HZ_100=y, fix it by preserving one extra finished slice in throtl_trim_slice(). Fixes: e43473b7f223 ("blkio: Core implementation of throttle policy") Reported-by: Ming Lei <ming.lei@redhat.com> Closes: https://lore.kernel.org/linux-block/20250222092823.210318-3-yukuai1@huaweicloud.com/ Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250227120645.812815-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-05	md/md-bitmap: fix wrong bitmap_limit for clustermd when write sb	Su Yue
	In clustermd, separate write-intent-bitmaps are used for each cluster node: 0 4k 8k 12k ------------------------------------------------------------------- \| idle \| md super \| bm super [0] + bits \| \| bm bits[0, contd] \| bm super[1] + bits \| bm bits[1, contd] \| \| bm super[2] + bits \| bm bits [2, contd] \| bm super[3] + bits \| \| bm bits [3, contd] \| \| \| So in node 1, pg_index in __write_sb_page() could equal to bitmap->storage.file_pages. Then bitmap_limit will be calculated to 0. md_super_write() will be called with 0 size. That means the first 4k sb area of node 1 will never be updated through filemap_write_page(). This bug causes hang of mdadm/clustermd_tests/01r1_Grow_resize. Here use (pg_index % bitmap->storage.file_pages) to make calculation of bitmap_limit correct. Fixes: ab99a87542f1 ("md/md-bitmap: fix writing non bitmap pages") Signed-off-by: Su Yue <glass.su@suse.com> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Link: https://lore.kernel.org/linux-raid/20250303033918.32136-1-glass.su@suse.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-03-05	md/raid1,raid10: don't ignore IO flags	Yu Kuai
	If blk-wbt is enabled by default, it's found that raid write performance is quite bad because all IO are throttled by wbt of underlying disks, due to flag REQ_IDLE is ignored. And turns out this behaviour exist since blk-wbt is introduced. Other than REQ_IDLE, other flags should not be ignored as well, for example REQ_META can be set for filesystems, clearing it can cause priority reverse problems; And REQ_NOWAIT should not be cleared as well, because io will wait instead of failing directly in underlying disks. Fix those problems by keep IO flags from master bio. Fises: f51d46d0e7cb ("md: add support for REQ_NOWAIT") Fixes: e34cbd307477 ("blk-wbt: add general throttling mechanism") Fixes: 5404bc7a87b9 ("[PATCH] Allow file systems to differentiate between data and meta reads") Link: https://lore.kernel.org/linux-raid/20250227121657.832356-1-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-03-05	md/raid5: merge reshape_progress checking inside get_reshape_loc()	Yu Kuai
	During code review, it's found that other than raid5_bitmap_sector(), reshape_progress is always checked before get_reshape_loc(), while raid5_bitmap_sector() should check as well to prevent holding the lock 'conf->device_lock'. Hence merge that checking inside get_reshape_loc(). Link: https://lore.kernel.org/linux-raid/20250227120452.808503-1-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-03-05	md: fix mddev uaf while iterating all_mddevs list	Yu Kuai
	While iterating all_mddevs list from md_notify_reboot() and md_exit(), list_for_each_entry_safe is used, and this can race with deletint the next mddev, causing UAF: t1: spin_lock //list_for_each_entry_safe(mddev, n, ...) mddev_get(mddev1) // assume mddev2 is the next entry spin_unlock t2: //remove mddev2 ... mddev_free spin_lock list_del spin_unlock kfree(mddev2) mddev_put(mddev1) spin_lock //continue dereference mddev2->all_mddevs The old helper for_each_mddev() actually grab the reference of mddev2 while holding the lock, to prevent from being freed. This problem can be fixed the same way, however, the code will be complex. Hence switch to use list_for_each_entry, in this case mddev_put() can free the mddev1 and it's not safe as well. Refer to md_seq_show(), also factor out a helper mddev_put_locked() to fix this problem. Cc: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/linux-raid/20250220124348.845222-1-yukuai1@huaweicloud.com Fixes: f26514342255 ("md: stop using for_each_mddev in md_notify_reboot") Fixes: 16648bac862f ("md: stop using for_each_mddev in md_exit") Reported-and-tested-by: Guillaume Morin <guillaume@morinfr.org> Closes: https://lore.kernel.org/all/Z7Y0SURoA8xwg7vn@bender.morinfr.org/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2025-03-05	md: switch md-cluster to use md_submodle_head	Yu Kuai
	To make code cleaner, and prepare to add kconfig for bitmap. Also remove the unsed global variables pers_lock, md_cluster_ops and md_cluster_mod, and exported symbols register_md_cluster_operations(), unregister_md_cluster_operations() and md_cluster_ops. Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-8-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Su Yue <glass.su@suse.com>
2025-03-05	md: don't export md_cluster_ops	Yu Kuai
	Add a new field 'cluster_ops' and initialize it md_setup_cluster(), so that the gloable variable 'md_cluter_ops' doesn't need to be exported. Also prepare to switch md-cluster to use md_submod_head. Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-7-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Su Yue <glass.su@suse.com>
2025-03-05	md/md-cluster: cleanup md_cluster_ops reference	Yu Kuai
	md_cluster_ops->slot_number() is implemented inside md-cluster.c, just call it directly. Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-6-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Su Yue <glass.su@suse.com>
2025-03-05	md: switch personalities to use md_submodule_head	Yu Kuai
	Remove the global list 'pers_list', and switch to use md_submodule_head, which is managed by xarry. Prepare to unify registration and unregistration for all sub modules. Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-5-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-03-05	md: introduce struct md_submodule_head and APIs	Yu Kuai
	Prepare to unify registration and unregistration of md personalities and md-cluster, also prepare for add kconfig for md-bitmap. Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-4-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-03-05	md: only include md-cluster.h if necessary	Yu Kuai
	md-cluster is only supportted by raid1 and raid10, there is no need to include md-cluster.h for other personalities. Also move APIs that is only used in md-cluster.c from md.h to md-cluster.h. Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-3-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Su Yue <glass.su@suse.com>
2025-03-05	md: merge common code into find_pers()	Yu Kuai
	- pers_lock() are held and released from caller - try_module_get() is called from caller - error message from caller Merge above code into find_pers(), and rename it to get_pers(), also add a wrapper to module_put() as put_pers(). Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-2-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Su Yue <glass.su@suse.com>
2025-03-04	ublk: enforce ublks_max only for unprivileged devices	Uday Shankar
	Commit 403ebc877832 ("ublk_drv: add module parameter of ublks_max for limiting max allowed ublk dev"), claimed ublks_max was added to prevent a DoS situation with an untrusted user creating too many ublk devices. If that's the case, ublks_max should only restrict the number of unprivileged ublk devices in the system. Enforce the limit only for unprivileged ublk devices, and rename variables accordingly. Leave the external-facing parameter name unchanged, since changing it may break systems which use it (but still update its documentation to reflect its new meaning). As a result of this change, in a system where there are only normal (non-unprivileged) devices, the maximum number of such devices is increased to 1 << MINORBITS, or 1048576. That ought to be enough for anyone, right? Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250228-ublks_max-v1-1-04b7379190c0@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-04	loop: Remove struct loop_func_table	Zhu Yanjun
	The struct is introduced in the commit 754d96798fab ("loop: remove loop.h"), but it is not used now. So remove it. No functional changes. Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250227163343.55952-1-yanjun.zhu@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-03	ublk: add DMA alignment limit	Ming Lei
	The in-tree ublk driver doesn't need DMA alignment limit because there is one data copy between request pages and the userspace buffer. However, ublk is going to support zero copy, then DMA alignment limit is required, because same IO buffer is forwarded to backend which may have specific buffer DMA alignment limit, so the limit has to be exposed from the frontend driver to client application. Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250227103707.2640014-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-03	block: split struct bio_integrity_payload	Christoph Hellwig
	Many of the fields in struct bio_integrity_payload are only needed for the default integrity buffer in the block layer, and the variable sized array at the end of the structure makes it very hard to embed into caller allocated structures. Reduce struct bio_integrity_payload to the minimal structure needed in common code and create two separate containing structures for the automatically generated payload and the caller allocated payload. The latter is a simple wrapper for struct bio_integrity_payload and the bvecs, while the former contains the additional fields moved out of struct bio_integrity_payload. Always use a dedicated mempool for automatic integrity metadata instead of depending on bio_set that is submitter controlled and thus often doesn't have the mempool initialized and stop using mempools for the submitter buffers as they aren't in the NOIO I/O submission path where we need to guarantee forward progress. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250225154449.422989-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-03	block: move the block layer auto-integrity code into a new file	Christoph Hellwig
	The code that automatically creates a integrity payload and generates and verifies the checksums for bios that don't have submitter-provided integrity payload currently sits right in the middle of the block integrity metadata infrastructure. Split it into a separate file to make the different layers clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250225154449.422989-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-03	block: mark bounce buffering as incompatible with integrity	Christoph Hellwig
	None of the few drivers still using the legacy block layer bounce buffering support integrity metadata. Explicitly mark the features as incompatible and stop creating the slab and mempool for integrity buffers for the bounce bio_set. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250225154449.422989-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-03	null_blk: do partial IO for bad blocks	Shin'ichiro Kawasaki
	The current null_blk implementation checks if any bad blocks exist in the target blocks of each IO. If so, the IO fails and data is not transferred for all of the IO target blocks. However, when real storage devices have bad blocks, the devices may transfer data partially up to the first bad blocks (e.g., SAS drives). Especially, when the IO is a write operation, such partial IO leaves partially written data on the device. To simulate such partial IO using null_blk, introduce the new parameter 'badblocks_partial_io'. When this parameter is set, null_handle_badblocks() returns the number of the sectors for the partial IO as its third pointer argument. Pass the returned number of sectors to the following calls to null_handle_memory_backend() in null_process_cmd() and null_zone_write(). Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20250226100613.1622564-6-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-03	null_blk: pass transfer size to null_handle_rq()	Shin'ichiro Kawasaki
	As preparation to support partial data transfer, add a new argument to null_handle_rq() to pass the number of sectors to transfer. While at it, rename the function from null_handle_rq to null_handle_data_transfer. This commit does not change the behavior. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20250226100613.1622564-5-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-03	null_blk: replace null_process_cmd() call in null_zone_write()	Shin'ichiro Kawasaki
	As a preparation to support partial data transfer due to badblocks, replace the null_process_cmd() call in null_zone_write() with equivalent calls to null_handle_badblocks() and null_handle_memory_backed(). This commit does not change behavior. It will enable null_handle_badblocks() to return the size of partial data transfer in the following commit, allowing null_zone_write() to move write pointers appropriately. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20250226100613.1622564-4-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-03	null_blk: introduce badblocks_once parameter	Shin'ichiro Kawasaki
	When IO errors happen on real storage devices, the IOs repeated to the same target range can success by virtue of recovery features by devices, such as reserved block assignment. To simulate such IO errors and recoveries, introduce the new parameter badblocks_once parameter. When this parameter is set to 1, the specified badblocks are cleared after the first IO error, so that the next IO to the blocks succeed. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20250226100613.1622564-3-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-03	null_blk: generate null_blk configfs features string	Shin'ichiro Kawasaki
	The null_blk configfs file 'features' provides a string that lists available null_blk features for userspace programs to reference. The string is defined as a long constant in the code, which tends to be forgotten for updates. It also causes checkpatch.pl to report "WARNING: quoted string split across lines". To avoid these drawbacks, generate the feature string on the fly. Refer to the ca_name field of each element in the nullb_device_attrs table and concatenate them in the given buffer. Also, sorted nullb_device_attrs table elements in alphabetical order. Of note is that the feature "index" was missing before this commit. This commit adds it to the generated string. Suggested-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20250226100613.1622564-2-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-03	ublk: complete command synchronously on error	Caleb Sander Mateos
	In case of an error, ublk's ->uring_cmd() functions currently return -EIOCBQUEUED and immediately call io_uring_cmd_done(). -EIOCBQUEUED and io_uring_cmd_done() are intended for asynchronous completions. For synchronous completions, the ->uring_cmd() function can just return the negative return code directly. This skips io_uring_cmd_del_cancelable(), and deferring the completion to task work. So return the error code directly from __ublk_ch_uring_cmd() and ublk_ctrl_uring_cmd(). Update ublk_ch_uring_cmd_cb(), which currently ignores the return value from __ublk_ch_uring_cmd(), to call io_uring_cmd_done() for synchronous completions. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250225212456.2902549-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-25	blk-wbt: Cleanup a comment in wb_timer_fn	Tang Yizhou
	The original comment contains a grammatical error. Rewrite it into a more easily understandable sentence. Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com> Link: https://lore.kernel.org/r/20250213100611.209997-3-yizhou.tang@shopee.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-25	blk-wbt: Fix some comments	Tang Yizhou
	wbt_wait() no longer uses a spinlock as a parameter. Update the function comments accordingly. RWB_UNKNOWN_BUMP is used when we gradually adjust scale_steps toward the center state, which is a value of 0. Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250213100611.209997-2-yizhou.tang@shopee.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-24	loop: take the file system minimum dio alignment into account	Christoph Hellwig
	The loop driver currently uses the logical block size of the underlying bdev as the lower bound of the loop device block size. While this works for many cases, it fails for file systems made up of multiple devices with different logical block sizes (e.g. XFS with a RT device that has a larger logical block size), or when the file systems doesn't support direct I/O writes at the sector size granularity (e.g. because it does out of place writes with a file system block size larger than the sector size). Fix this by querying the minimum direct I/O alignment from statx when available. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250131120120.1315125-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-24	loop: check in LO_FLAGS_DIRECT_IO in loop_default_blocksize	Christoph Hellwig
	We can't go below the minimum direct I/O size no matter if direct I/O is enabled by passing in an O_DIRECT file descriptor or due to the explicit flag. Now that LO_FLAGS_DIRECT_IO is set earlier after assigning a backing file, loop_default_blocksize can check it instead of the O_DIRECT flag to handle both conditions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250131120120.1315125-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-24	loop: set LO_FLAGS_DIRECT_IO in loop_assign_backing_file	Christoph Hellwig
	Assigning LO_FLAGS_DIRECT_IO from the O_DIRECT flag is related to assigning a new backing file. Move the assignment in preparation of using the flag more and earlier. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250131120120.1315125-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-24	loop: factor out a loop_assign_backing_file helper	Christoph Hellwig
	Split the code for setting up a backing file into a helper in preparation of adding more code to this path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250131120120.1315125-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-21	block: Remove commented out code	Thorsten Blum
	Remove commented out code. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250219205328.28462-2-thorsten.blum@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-18	Revert "driver: block: release the lo_work_lock before queue_work"	Zhaoyang Huang
	This reverts commit ad934fc1784802fd1408224474b25ee5289fadfc. loop_queue_work should be strictly serialized to loop_process_work since the lo_worker could be freed without noticing new work has been queued again. Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Link: https://lore.kernel.org/r/20250218065835.19503-1-zhaoyang.huang@unisoc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-15	md/raid1: fix memory leak in raid1_run() if no active rdev	Zheng Qixing
	When `raid1_set_limits()` fails or when the array has no active `rdev`, the allocated memory for `conf` is not properly freed. Add raid1_free() call to properly free the conf in error path. Fixes: 799af947ed13 ("md/raid1: don't free conf on raid0_run failure") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Link: https://lore.kernel.org/linux-raid/20250215020137.3703757-1-zhengqixing@huaweicloud.com Singed-off-by: Yu Kuai <yukuai3@huawei.com>