Age | Commit message (Collapse) | Author |
|
submit_bio() always uses bio->bi_bdev->bd_has_submit_bio to decide if
disk's ->submit_bio() is called, and bio->bi_bdev could point to one
partition device.
So we have to sync part bdev's ->bd_has_submit_bio with disk's.
Reported-by: Changhui Zhong <czhong@redhat.com>
Link: https://lore.kernel.org/linux-block/ZEdItaPqif8fp85H@ovpn-8-24.pek2.redhat.com/T/#t
Fixes: 9f4107b07b17 ("block: store bdev->bd_disk->fops->submit_bio state in bdev")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230425034154.110099-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The code for setting a block device capacity (bd_nr_sectors field of
struct block_device) is duplicated in set_capacity() and
bdev_set_nr_sectors(). Clean this up by making bdev_set_nr_sectors()
a block layer internal function defined in block/bdev.c instead of
having this function statically defined in block/partitions/core.c.
With this change, set_capacity() implementation can be simplified to
only calling bdev_set_nr_sectors().
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230424131318.79935-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We tested and found an alarm caused by nbd_ioctl arg without verification.
The UBSAN warning calltrace like below:
UBSAN: Undefined behaviour in fs/buffer.c:1709:35
signed integer overflow:
-9223372036854775808 - 1 cannot be represented in type 'long long int'
CPU: 3 PID: 2523 Comm: syz-executor.0 Not tainted 4.19.90 #1
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x3f0 arch/arm64/kernel/time.c:78
show_stack+0x28/0x38 arch/arm64/kernel/traps.c:158
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x170/0x1dc lib/dump_stack.c:118
ubsan_epilogue+0x18/0xb4 lib/ubsan.c:161
handle_overflow+0x188/0x1dc lib/ubsan.c:192
__ubsan_handle_sub_overflow+0x34/0x44 lib/ubsan.c:206
__block_write_full_page+0x94c/0xa20 fs/buffer.c:1709
block_write_full_page+0x1f0/0x280 fs/buffer.c:2934
blkdev_writepage+0x34/0x40 fs/block_dev.c:607
__writepage+0x68/0xe8 mm/page-writeback.c:2305
write_cache_pages+0x44c/0xc70 mm/page-writeback.c:2240
generic_writepages+0xdc/0x148 mm/page-writeback.c:2329
blkdev_writepages+0x2c/0x38 fs/block_dev.c:2114
do_writepages+0xd4/0x250 mm/page-writeback.c:2344
The reason for triggering this warning is __block_write_full_page()
-> i_size_read(inode) - 1 overflow.
inode->i_size is assigned in __nbd_ioctl() -> nbd_set_size() -> bytesize.
We think it is necessary to limit the size of arg to prevent errors.
Moreover, __nbd_ioctl() -> nbd_add_socket(), arg will be cast to int.
Assuming the value of arg is 0x80000000000000001) (on a 64-bit machine),
it will become 1 after the coercion, which will return unexpected results.
Fix it by adding checks to prevent passing in too large numbers.
Signed-off-by: Zhong Jinghua <zhongjinghua@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20230206145805.2645671-1-zhongjinghua@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 2d786e66c966 ("block: ublk: switch to ioctl command encoding")
starts to reset local variable of 'ret' as zero, then if any failure
happens when handling the three IO commands, 0 can be returned to ublk
server.
Fix it by returning -EINVAL in case of command handling failure.
Cc: Christoph Hellwig <hch@lst.de>
Fixes: 2d786e66c966 ("block: ublk: switch to ioctl command encoding")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230420091104.1092972-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Locking range start and locking range length
attributes may be require to satisfy restrictions
exposed by OPAL2 geometry feature reporting.
Geometry reporting feature is described in TCG OPAL SSC,
section 3.1.1.4 (ALIGN, LogicalBlockSize, AlignmentGranularity
and LowestAlignedLBA).
4.3.5.2.1.1 RangeStart Behavior:
[ StartAlignment = (RangeStart modulo AlignmentGranularity) - LowestAlignedLBA ]
When processing a Set method or CreateRow method on the Locking
table for a non-Global Range row, if:
a) the AlignmentRequired (ALIGN above) column in the LockingInfo
table is TRUE;
b) RangeStart is non-zero; and
c) StartAlignment is non-zero, then the method SHALL fail and
return an error status code INVALID_PARAMETER.
4.3.5.2.1.2 RangeLength Behavior:
If RangeStart is zero, then
[ LengthAlignment = (RangeLength modulo AlignmentGranularity) - LowestAlignedLBA ]
If RangeStart is non-zero, then
[ LengthAlignment = (RangeLength modulo AlignmentGranularity) ]
When processing a Set method or CreateRow method on the Locking
table for a non-Global Range row, if:
a) the AlignmentRequired (ALIGN above) column in the LockingInfo
table is TRUE;
b) RangeLength is non-zero; and
c) LengthAlignment is non-zero, then the method SHALL fail and
return an error status code INVALID_PARAMETER
In userspace we stuck to logical block size reported by general
block device (via sysfs or ioctl), but we can not read
'AlignmentGranularity' or 'LowestAlignedLBA' anywhere else and
we need to get those values from sed-opal interface otherwise
we will not be able to report or avoid locking range setup
INVALID_PARAMETER errors above.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Tested-by: Milan Broz <gmazyland@gmail.com>
Link: https://lore.kernel.org/r/20230411090931.9193-2-okozina@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Make sure to check device queue mode in the null_validate_conf() and
return error for NULL_Q_RQ as we don't allow legacy I/O path, without
this patch we get OOPs when queue mode is set to 1 from configfs,
following are repro steps :-
modprobe null_blk nr_devices=0
mkdir config/nullb/nullb0
echo 1 > config/nullb/nullb0/memory_backed
echo 4096 > config/nullb/nullb0/blocksize
echo 20480 > config/nullb/nullb0/size
echo 1 > config/nullb/nullb0/queue_mode
echo 1 > config/nullb/nullb0/power
Entering kdb (current=0xffff88810acdd080, pid 2372) on processor 42 Oops: (null)
due to oops @ 0xffffffffc041c329
CPU: 42 PID: 2372 Comm: sh Tainted: G O N 6.3.0-rc5lblk+ #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
RIP: 0010:null_add_dev.part.0+0xd9/0x720 [null_blk]
Code: 01 00 00 85 d2 0f 85 a1 03 00 00 48 83 bb 08 01 00 00 00 0f 85 f7 03 00 00 80 bb 62 01 00 00 00 48 8b 75 20 0f 85 6d 02 00 00 <48> 89 6e 60 48 8b 75 20 bf 06 00 00 00 e8 f5 37 2c c1 48 8b 75 20
RSP: 0018:ffffc900052cbde0 EFLAGS: 00010246
RAX: 0000000000000001 RBX: ffff88811084d800 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888100042e00
RBP: ffff8881053d8200 R08: ffffc900052cbd68 R09: ffff888105db2000
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000002
R13: ffff888104765200 R14: ffff88810eec1748 R15: ffff88810eec1740
FS: 00007fd445fd1740(0000) GS:ffff8897dfc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000060 CR3: 0000000166a00000 CR4: 0000000000350ee0
DR0: ffffffff8437a488 DR1: ffffffff8437a489 DR2: ffffffff8437a48a
DR3: ffffffff8437a48b DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
nullb_device_power_store+0xd1/0x120 [null_blk]
configfs_write_iter+0xb4/0x120
vfs_write+0x2ba/0x3c0
ksys_write+0x5f/0xe0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7fd4460c57a7
Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007ffd3792a4a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fd4460c57a7
RDX: 0000000000000002 RSI: 000055b43c02e4c0 RDI: 0000000000000001
RBP: 000055b43c02e4c0 R08: 000000000000000a R09: 00007fd44615b4e0
R10: 00007fd44615b3e0 R11: 0000000000000246 R12: 0000000000000002
R13: 00007fd446198520 R14: 0000000000000002 R15: 00007fd446198700
</TASK>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Link: https://lore.kernel.org/r/20230416220339.43845-1-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
All ublk commands(control, IO) should have taken ioctl command encoding
from the beginning, because ioctl command encoding defines each code
uniquely, so driver can figure out wrong command sent from userspace
easily; 2) it might help security subsystem for audit uring cmd[1].
Unfortunately we didn't do that way, and it could be one lesson for
ublk driver.
So switch to ioctl command encoding now, we still support commands encoded
in old way, but they become legacy definition. Any new command should take
ioctl encoding.
See ublksrv code for switching to ioctl command encoding in [2].
[1] https://lore.kernel.org/io-uring/CAHC9VhSVzujW9LOj5Km80AjU0EfAuukoLrxO6BEfnXeK_s6bAg@mail.gmail.com/
[2] https://github.com/ming1/ubdsrv/commits/ioctl_cmd_encoding
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ken Kurematsu <k.kurematsu@nskint.co.jp>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230418131810.855959-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit b12e5c6c755a accidentally changes blk_kick_flush to do a head
insert into the requeue list, fix this up.
Fixes: b12e5c6c755a ("blk-mq: pass a flags argument to blk_mq_add_to_requeue_list")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230416073553.966161-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When the weighted sum is zero the calculation of limit causes
a division by zero error. Fix this by continuing to the next level.
This was discovered by running as root:
stress-ng --ioprio 0
Fixes divison by error oops:
[ 521.450556] divide error: 0000 [#1] SMP NOPTI
[ 521.450766] CPU: 2 PID: 2684464 Comm: stress-ng-iopri Not tainted 6.2.1-1280.native #1
[ 521.451117] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
[ 521.451627] RIP: 0010:bfqq_request_over_limit+0x207/0x400
[ 521.451875] Code: 01 48 8d 0c c8 74 0b 48 8b 82 98 00 00 00 48 8d 0c c8 8b 85 34 ff ff ff 48 89 ca 41 0f af 41 50 48 d1 ea 48 98 48 01 d0 31 d2 <48> f7 f1 41 39 41 48 89 85 34 ff ff ff 0f 8c 7b 01 00 00 49 8b 44
[ 521.452699] RSP: 0018:ffffb1af84eb3948 EFLAGS: 00010046
[ 521.452938] RAX: 000000000000003c RBX: 0000000000000000 RCX: 0000000000000000
[ 521.453262] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffb1af84eb3978
[ 521.453584] RBP: ffffb1af84eb3a30 R08: 0000000000000001 R09: ffff8f88ab8a4ba0
[ 521.453905] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8f88ab8a4b18
[ 521.454224] R13: ffff8f8699093000 R14: 0000000000000001 R15: ffffb1af84eb3970
[ 521.454549] FS: 00005640b6b0b580(0000) GS:ffff8f88b3880000(0000) knlGS:0000000000000000
[ 521.454912] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 521.455170] CR2: 00007ffcbcae4e38 CR3: 00000002e46de001 CR4: 0000000000770ee0
[ 521.455491] PKRU: 55555554
[ 521.455619] Call Trace:
[ 521.455736] <TASK>
[ 521.455837] ? bfq_request_merge+0x3a/0xc0
[ 521.456027] ? elv_merge+0x115/0x140
[ 521.456191] bfq_limit_depth+0xc8/0x240
[ 521.456366] __blk_mq_alloc_requests+0x21a/0x2c0
[ 521.456577] blk_mq_submit_bio+0x23c/0x6c0
[ 521.456766] __submit_bio+0xb8/0x140
[ 521.457236] submit_bio_noacct_nocheck+0x212/0x300
[ 521.457748] submit_bio_noacct+0x1a6/0x580
[ 521.458220] submit_bio+0x43/0x80
[ 521.458660] ext4_io_submit+0x23/0x80
[ 521.459116] ext4_do_writepages+0x40a/0xd00
[ 521.459596] ext4_writepages+0x65/0x100
[ 521.460050] do_writepages+0xb7/0x1c0
[ 521.460492] __filemap_fdatawrite_range+0xa6/0x100
[ 521.460979] file_write_and_wait_range+0xbf/0x140
[ 521.461452] ext4_sync_file+0x105/0x340
[ 521.461882] __x64_sys_fsync+0x67/0x100
[ 521.462305] ? syscall_exit_to_user_mode+0x2c/0x1c0
[ 521.462768] do_syscall_64+0x3b/0xc0
[ 521.463165] entry_SYSCALL_64_after_hwframe+0x5a/0xc4
[ 521.463621] RIP: 0033:0x5640b6c56590
[ 521.464006] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 80 3d 71 70 0e 00 00 74 17 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 7c
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20230413133009.1605335-1-colin.i.king@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This fixes a build error when CONFIG_FAULT_INJECTION_CONFIGFS=y and
CONFIG_CONFIGFS_FS=m.
Since the fault-injection library cannot built as a module, avoid building
configfs as a module.
Fixes: 4668c7a2940d ("fault-inject: allow configuration via configfs")
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202304150025.K0hczLR4-lkp@intel.com/
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We have a long chain of memory dereferencing just to whether or not
this disk has a special submit_bio helper. As that's not necessarily
the common case, add a bd_has_submit_bio state in the bdev to avoid
traversing this memory dependency chain if we don't need to.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This moves struct device out-of-line as it's just used at open/close
time, so we can keep some of the commonly used fields closer together.
On a standard setup, it also reduces the size from 864 bytes to 848
bytes. Yes, struct device is a pig...
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.4/block
Pull MD updates from Song:
"- md/bitmap: Optimal last page size, by Jon Derrick
- Various raid10 fixes, by Yu Kuai and Li Nan
- md: add error_handlers for raid0 and linear, by Mariusz Tkaczyk"
* 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
md/raid5: remove unused working_disks variable
md/raid10: don't call bio_start_io_acct twice for bio which experienced read error
md/raid10: fix memleak of md thread
md/raid10: fix memleak for 'conf->bio_split'
md/raid10: fix leak of 'r10bio->remaining' for recovery
md/raid10: don't BUG_ON() in raise_barrier()
md: fix soft lockup in status_resync
md: add error_handlers for raid0 and linear
md: Use optimal I/O size for last bitmap page
md: Fix types in sb writer
md: Move sb writer loop to its own function
md/raid10: Fix typo in comment (replacment -> replacement)
md: make kobj_type structures constant
md/raid10: fix null-ptr-deref in raid10_sync_request
md/raid10: fix task hung in raid10d
|
|
for-6.4/block
Pull NVMe updates from Christoph:
"nvme updates for Linux 6.4
- drop redundant pci_enable_pcie_error_reporting (Bjorn Helgaas)
- validate nvmet module parameters (Chaitanya Kulkarni)
- fence TCP socket on receive error (Chris Leech)
- fix async event trace event (Keith Busch)
- minor cleanups (Chaitanya Kulkarni, zhenwei pi)
- fix and cleanup nvmet Identify handling (Damien Le Moal,
Christoph Hellwig)
- fix double blk_mq_complete_request race in the timeout handler
(Lei Yin)
- fix irq locking in nvme-fcloop (Ming Lei)
- remove queue mapping helper for rdma devices (Sagi Grimberg)"
* tag 'nvme-6.4-2023-04-14' of git://git.infradead.org/nvme:
nvme-fcloop: fix "inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage"
blk-mq-rdma: remove queue mapping helper for rdma devices
nvme-rdma: minor cleanup in nvme_rdma_create_cq()
nvme: fix double blk_mq_complete_request for timeout request with low probability
nvme: fix async event trace event
nvme-apple: return directly instead of else
nvme-apple: return directly instead of else
nvmet-tcp: validate idle poll modparam value
nvmet-tcp: validate so_priority modparam value
nvme-tcp: fence TCP socket on receive error
nvmet: remove nvmet_req_cns_error_complete
nvmet: rename nvmet_execute_identify_cns_cs_ns
nvmet: fix Identify Identification Descriptor List handling
nvmet: cleanup nvmet_execute_identify()
nvmet: fix I/O Command Set specific Identify Controller
nvmet: fix Identify Active Namespace ID list handling
nvmet: fix Identify Controller handling
nvmet: fix Identify Namespace handling
nvmet: fix error handling in nvmet_execute_identify_cns_cs_ns()
nvme-pci: drop redundant pci_enable_pcie_error_reporting()
|
|
clang with W=1 reports
drivers/md/raid5.c:7719:6: error: variable 'working_disks'
set but not used [-Werror,-Wunused-but-set-variable]
int working_disks = 0;
^
This variable is not used so remove it.
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230327132324.1769595-1-trix@redhat.com
|
|
error
handle_read_error() will resumit r10_bio by raid10_read_request(), which
will call bio_start_io_acct() again, while bio_end_io_acct() will only
be called once.
Fix the problem by don't account io again from handle_read_error().
Fixes: 528bc2cf2fcc ("md/raid10: enable io accounting")
Suggested-by: Song Liu <song@kernel.org>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230314012258.2395894-1-yukuai1@huaweicloud.com
|
|
In raid10_run(), if setup_conf() succeed and raid10_run() failed before
setting 'mddev->thread', then in the error path 'conf->thread' is not
freed.
Fix the problem by setting 'mddev->thread' right after setup_conf().
Fixes: 43a521238aca ("md-cluster: choose correct label when clustered layout is not supported")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230310073855.1337560-7-yukuai1@huaweicloud.com
|
|
In the error path of raid10_run(), 'conf' need be freed, however,
'conf->bio_split' is missed and memory will be leaked.
Since there are 3 places to free 'conf', factor out a helper to fix the
problem.
Fixes: fc9977dd069e ("md/raid10: simplify the splitting of requests.")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230310073855.1337560-6-yukuai1@huaweicloud.com
|
|
raid10_sync_request() will add 'r10bio->remaining' for both rdev and
replacement rdev. However, if the read io fails, recovery_request_write()
returns without issuing the write io, in this case, end_sync_request()
is only called once and 'remaining' is leaked, cause an io hang.
Fix the problem by decreasing 'remaining' according to if 'bio' and
'repl_bio' is valid.
Fixes: 24afd80d99f8 ("md/raid10: handle recovery of replacement devices.")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230310073855.1337560-5-yukuai1@huaweicloud.com
|
|
If raise_barrier() is called the first time in raid10_sync_request(), which
means the first non-normal io is handled, raise_barrier() should wait for
all dispatched normal io to be done. This ensures that normal io won't
starve.
However, BUG_ON() if this is broken is too aggressive. This patch replace
BUG_ON() with WARN and fall back to not force.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230310073855.1337560-4-yukuai1@huaweicloud.com
|
|
status_resync() will calculate 'curr_resync - recovery_active' to show
user a progress bar like following:
[============>........] resync = 61.4%
'curr_resync' and 'recovery_active' is updated in md_do_sync(), and
status_resync() can read them concurrently, hence it's possible that
'curr_resync - recovery_active' can overflow to a huge number. In this
case status_resync() will be stuck in the loop to print a large amount
of '=', which will end up soft lockup.
Fix the problem by setting 'resync' to MD_RESYNC_ACTIVE in this case,
this way resync in progress will be reported to user.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230310073855.1337560-3-yukuai1@huaweicloud.com
|
|
After the commit 9631abdbf406c("md: Set MD_BROKEN for RAID1 and RAID10")
MD_BROKEN must be set if array is failed because state_store() checks it.
If it is set then -EBUSY is returned to userspace.
For raid0 and linear MD_BROKEN is not set by error_handler(). As a result
mdadm is unable to trigger clean-up actions. It is a regression.
This patch adds appropriate error_handler for raid0 and linear. The
error handler sets MD_BROKEN for this device.
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230306130317.3418-1-mariusz.tkaczyk@linux.intel.com
|
|
If the bitmap space has enough room, size the I/O for the last bitmap
page write to the optimal I/O size for the storage device. The expanded
write is checked that it won't overrun the data or metadata.
The drive this was tested against has higher latencies when there are
sub-4k writes due to device-side read-mod-writes of its atomic 4k write
unit. This change helps increase performance by sizing the last bitmap
page I/O for the device's preferred write unit, if it is given.
Example Intel/Solidigm P5520
Raid10, Chunk-size 64M, bitmap-size 57228 bits
$ mdadm --create /dev/md0 --level=10 --raid-devices=4 /dev/nvme{0,1,2,3}n1
--assume-clean --bitmap=internal --bitmap-chunk=64M
$ fio --name=test --direct=1 --filename=/dev/md0 --rw=randwrite --bs=4k --runtime=60
Without patch:
write: IOPS=1676, BW=6708KiB/s (6869kB/s)(393MiB/60001msec); 0 zone resets
With patch:
write: IOPS=15.7k, BW=61.4MiB/s (64.4MB/s)(3683MiB/60001msec); 0 zone resets
Biosnoop:
Without patch:
Time Process PID Device LBA Size Lat
1.410377 md0_raid10 6900 nvme0n1 W 16 4096 0.02
1.410387 md0_raid10 6900 nvme2n1 W 16 4096 0.02
1.410374 md0_raid10 6900 nvme3n1 W 16 4096 0.01
1.410381 md0_raid10 6900 nvme1n1 W 16 4096 0.02
1.410411 md0_raid10 6900 nvme1n1 W 115346512 4096 0.01
1.410418 md0_raid10 6900 nvme0n1 W 115346512 4096 0.02
1.410915 md0_raid10 6900 nvme2n1 W 24 3584 0.43 <--
1.410935 md0_raid10 6900 nvme3n1 W 24 3584 0.45 <--
1.411124 md0_raid10 6900 nvme1n1 W 24 3584 0.64 <--
1.411147 md0_raid10 6900 nvme0n1 W 24 3584 0.66 <--
1.411176 md0_raid10 6900 nvme3n1 W 2019022184 4096 0.01
1.411189 md0_raid10 6900 nvme2n1 W 2019022184 4096 0.02
With patch:
Time Process PID Device LBA Size Lat
5.747193 md0_raid10 727 nvme0n1 W 16 4096 0.01
5.747192 md0_raid10 727 nvme1n1 W 16 4096 0.02
5.747195 md0_raid10 727 nvme3n1 W 16 4096 0.01
5.747202 md0_raid10 727 nvme2n1 W 16 4096 0.02
5.747229 md0_raid10 727 nvme3n1 W 1196223704 4096 0.02
5.747224 md0_raid10 727 nvme0n1 W 1196223704 4096 0.01
5.747279 md0_raid10 727 nvme0n1 W 24 4096 0.01 <--
5.747279 md0_raid10 727 nvme1n1 W 24 4096 0.02 <--
5.747284 md0_raid10 727 nvme3n1 W 24 4096 0.02 <--
5.747291 md0_raid10 727 nvme2n1 W 24 4096 0.02 <--
5.747314 md0_raid10 727 nvme2n1 W 2234636712 4096 0.01
5.747317 md0_raid10 727 nvme1n1 W 2234636712 4096 0.02
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230224183323.638-4-jonathan.derrick@linux.dev
|
|
Page->index is a pgoff_t and multiplying could cause overflows on a
32-bit architecture. In the sb writer, this is used to calculate and
verify the sector being used, and is multiplied by a sector value. Using
sector_t will cast it to a u64 type and is the more appropriate type for
the unit. Additionally, the integer size unit is converted to a sector
unit in later calculations, and is now corrected to be an unsigned type.
Finally, clean up the calculations using variable aliases to improve
readabiliy.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230224183323.638-3-jonathan.derrick@linux.dev
|
|
Preparatory patch for optimal I/O size calculation. Move the sb writer
loop routine into its own function for clarity.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230224183323.638-2-jonathan.derrick@linux.dev
|
|
Replace replacment with replacement.
Signed-off-by: Jiangshan Yi <yijiangshan@kylinos.cn>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230214064013.2373851-1-yijiangshan@kylinos.cn
|
|
Since commit ee6d3dd4ed48 ("driver core: make kobj_type constant.")
the driver core allows the usage of const struct kobj_type.
Take advantage of this to constify the structure definitions to prevent
modification at runtime.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230214-kobj_type-md-v1-1-d6853f707f11@weissschuh.net
|
|
init_resync() inits mempool and sets conf->have_replacemnt at the beginning
of sync, close_sync() frees the mempool when sync is completed.
After [1] recovery might be skipped and init_resync() is called but
close_sync() is not. null-ptr-deref occurs with r10bio->dev[i].repl_bio.
The following is one way to reproduce the issue.
1) create a array, wait for resync to complete, mddev->recovery_cp is set
to MaxSector.
2) recovery is woken and it is skipped. conf->have_replacement is set to
0 in init_resync(). close_sync() not called.
3) some io errors and rdev A is set to WantReplacement.
4) a new device is added and set to A's replacement.
5) recovery is woken, A have replacement, but conf->have_replacemnt is
0. r10bio->dev[i].repl_bio will not be alloced and null-ptr-deref
occurs.
Fix it by not calling init_resync() if recovery skipped.
[1] commit 7e83ccbecd60 ("md/raid10: Allow skipping recovery when clean arrays are assembled")
Fixes: 7e83ccbecd60 ("md/raid10: Allow skipping recovery when clean arrays are assembled")
Cc: stable@vger.kernel.org
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230222041000.3341651-3-linan666@huaweicloud.com
|
|
commit fe630de009d0 ("md/raid10: avoid deadlock on recovery.") allowed
normal io and sync io to exist at the same time. Task hung will occur as
below:
T1 T2 T3 T4
raid10d
handle_read_error
allow_barrier
conf->nr_pending--
-> 0
//submit sync io
raid10_sync_request
raise_barrier
->will not be blocked
...
//submit to drivers
raid10_read_request
wait_barrier
conf->nr_pending++
-> 1
//retry read fail
raid10_end_read_request
reschedule_retry
add to retry_list
conf->nr_queued++
-> 1
//sync io fail
end_sync_read
__end_sync_read
reschedule_retry
add to retry_list
conf->nr_queued++
-> 2
...
handle_read_error
get form retry_list
conf->nr_queued--
freeze_array
wait nr_pending == nr_queued+1
->1 ->2
//task hung
retry read and sync io will be added to retry_list(nr_queued->2) if they
fails. raid10d() called handle_read_error() and hung in freeze_array().
nr_queued will not decrease because raid10d is blocked, nr_pending will
not increase because conf->barrier is not released.
Fix it by moving allow_barrier() after raid10_read_request().
raise_barrier() will wait for nr_waiting to become 0. Therefore, sync io
and regular io will not be issued at the same time.
Also remove the check of nr_queued in stop_waiting_barrier. It can be 0
but don't need to be blocking. Remove the check for MD_RECOVERY_RUNNING as
the check is redundent.
Fixes: fe630de009d0 ("md/raid10: avoid deadlock on recovery.")
Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230222041000.3341651-2-linan666@huaweicloud.com
|
|
The null_blk driver has multiple driver-specific fault injection
mechanisms. Each fault injection configuration can only be specified by a
module parameter and cannot be reconfigured without reloading the driver.
Also, each configuration is common to all devices and is initialized every
time a new device is added.
This change adds the following subdirectories for each null_blk device.
/sys/kernel/config/nullb/<disk>/timeout_inject
/sys/kernel/config/nullb/<disk>/requeue_inject
/sys/kernel/config/nullb/<disk>/init_hctx_fault_inject
Each fault injection attribute can be dynamically set per device by a
corresponding file in these directories.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Link: https://lore.kernel.org/r/20230327143733.14599-3-akinobu.mita@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This provides a helper function to allow configuration of fault-injection
for configfs-based drivers.
The config items created by this function have the same interface as the
one created under debugfs by fault_create_debugfs_attr().
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Link: https://lore.kernel.org/r/20230327143733.14599-2-akinobu.mita@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
__blk_mq_run_hw_queue just contains a WARN_ON_ONCE for calls from
interrupt context and a blk_mq_run_dispatch_ops-protected call to
blk_mq_sched_dispatch_requests. Open code the call to
blk_mq_sched_dispatch_requests in both callers, and move the WARN_ON_ONCE
to blk_mq_run_hw_queue where it can be extended to all !async calls,
while the other call is from workqueue context and thus obviously does
not need the assert.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413060651.694656-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Only blk_mq_run_hw_queue can call __blk_mq_delay_run_hw_queue with
async=false, so move the handling there.
With this __blk_mq_delay_run_hw_queue can be merged into
blk_mq_delay_run_hw_queue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413060651.694656-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
For the in-context dispatch, blk_mq_hctx_stopped is alredy checked in
blk_mq_sched_dispatch_requests under blk_mq_run_dispatch_ops() protection.
For the async dispatch case having a check before scheduling the work
still makes sense to avoid needless workqueue scheduling, so just keep it
for that case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413060651.694656-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
blk_mq_hctx_stopped is already checked in blk_mq_sched_dispatch_requests
under blk_mq_run_dispatch_ops() protection, so remove the duplicate check.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413060651.694656-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
__blk_mq_sched_dispatch_requests currently has duplicated logic
for the cases where requests are on the hctx dispatch list or not.
Merge the two with a new need_dispatch variable and remove a few
pointless local variables.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413060651.694656-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Replace the boolean at_head argument with the same flags that are already
passed to blk_mq_insert_request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-21-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Instead of passing a bool at_head, pass down the full flags from the
blk_mq_insert_request interface.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-20-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Replace the boolean at_head argument with the same flags that are already
passed to blk_mq_insert_request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-19-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Replace the at_head bool with a flags argument that so far only contains
a single BLK_MQ_INSERT_AT_HEAD value. This makes it much easier to grep
for head insertions into the blk-mq dispatch queues.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
blk_mq_add_to_requeue_list takes a bool parameter to control how to kick
the requeue list at the end of the function. Move the call to
blk_mq_kick_requeue_list to the callers that want it instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
blk_mq_request_bypass_insert takes a bool parameter to control how to run
the queue at the end of the function. Move the blk_mq_run_hw_queue call
to the callers that want it instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
blk_mq_insert_request takes two bool parameters to control how to run
the queue at the end of the function. Move the blk_mq_run_hw_queue call
to the callers that want it instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Due to the wildly different behavior based on the bypass_insert argument,
not a whole lot of code in __blk_mq_try_issue_directly is actually shared
between blk_mq_try_issue_directly and blk_mq_request_issue_directly.
Remove __blk_mq_try_issue_directly and fold the code into the two callers
instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Factor out a helper from __blk_mq_try_issue_directly in preparation
of folding that function into its two callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Split the RQF_DONTPREP and RQF_SOFTBARRIER in separate branches to make
the code more readable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
While both passthrough and flush requests call directly into
blk_mq_request_bypass_insert, the parameters aren't the same.
Split the handling into two separate conditionals and turn the whole
function into an if/elif/elif/else flow instead of the gotos.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Just call blk_mq_add_to_requeue_list directly from the two callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Remove this very small helper and fold it into the only caller.
Note that this moves the trace_block_rq_insert out of ctx->lock, matching
the other calls to this tracepoint.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There is no good point in keeping the __blk_mq_insert_request around
for two function calls and a singler caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|