summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-07-12md/raid5: unset WQ_CPU_INTENSIVE for raid5 unbound workqueueRyo Takakura
When specified with WQ_CPU_INTENSIVE, the workqueue doesn't participate in concurrency management. This behaviour is already accounted for WQ_UNBOUND workqueues given that they are assigned to their own worker threads. Unset WQ_CPU_INTENSIVE as the use of flag has no effect when used with WQ_UNBOUND. Signed-off-by: Ryo Takakura <ryotkkr98@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/linux-raid/20250601013702.64640-1-ryotkkr98@gmail.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-07-12md: remove/add redundancy group only in level changeXiao Ni
del_gendisk is called in synchronous way now. So it doesn't need to handle redundancy group in stop path separately. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/linux-raid/20250611073108.25463-4-xni@redhat.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-07-12md: Don't clear MD_CLOSING until mddev is freedXiao Ni
UNTIL_STOP is used to avoid mddev is freed on the last close before adding disks to mddev. And it should be cleared when stopping an array which is mentioned in commit efeb53c0e572 ("md: Allow md devices to be created by name."). So reset ->hold_active to 0 in md_clean. And MD_CLOSING should be kept until mddev is freed to avoid reopen. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/linux-raid/20250611073108.25463-3-xni@redhat.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-07-12md: call del_gendisk in control pathXiao Ni
Now del_gendisk and put_disk are called asynchronously in workqueue work. The asynchronous way has a problem that the device node can still exist after mdadm --stop command returns in a short window. So udev rule can open this device node and create the struct mddev in kernel again. So put del_gendisk in control path and still leave put_disk in md_kobj_release to avoid uaf of gendisk. Function del_gendisk can't be called with reconfig_mutex. If it's called with reconfig mutex, a deadlock can happen. del_gendisk waits all sysfs files access to finish and sysfs file access waits reconfig mutex. So put del_gendisk after releasing reconfig mutex. But there is still a window that sysfs can be accessed between mddev_unlock and del_gendisk. So some actions (add disk, change level, .e.g) can happen which lead unexpected results. MD_DELETED is used to resolve this problem. MD_DELETED is set before releasing reconfig mutex and it should be checked for these sysfs access which need reconfig mutex. For sysfs access which don't need reconfig mutex, del_gendisk will wait them to finish. But it doesn't need to do this in function mddev_lock_nointr. There are ten places that call it. * Five of them are in dm raid which we don't need to care. MD_DELETED is only used for md raid. * stop_sync_thread, md_do_sync and md_start_sync are related sync request, and it needs to wait sync thread to finish before stopping an array. * md_ioctl: md_open is called before md_ioctl, so ->openers is added. It will fail to stop the array. So it doesn't need to check MD_DELETED here * md_set_readonly: It needs to call mddev_set_closing_and_sync_blockdev when setting readonly or read_auto. So it will fail to stop the array too because MD_CLOSING is already set. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/linux-raid/20250611073108.25463-2-xni@redhat.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-07-11loop: Avoid updating block size under exclusive ownerJan Kara
Syzbot came up with a reproducer where a loop device block size is changed underneath a mounted filesystem. This causes a mismatch between the block device block size and the block size stored in the superblock causing confusion in various places such as fs/buffer.c. The particular issue triggered by syzbot was a warning in __getblk_slow() due to requested buffer size not matching block device block size. Fix the problem by getting exclusive hold of the loop device to change its block size. This fails if somebody (such as filesystem) has already an exclusive ownership of the block device and thus prevents modifying the loop device under some exclusive owner which doesn't expect it. Reported-by: syzbot+01ef7a8da81a975e1ccd@syzkaller.appspotmail.com Signed-off-by: Jan Kara <jack@suse.cz> Tested-by: syzbot+01ef7a8da81a975e1ccd@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/20250711163202.19623-2-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-11nvme-pci: don't allocate dma_vec for IOVA mappingsChristoph Hellwig
Not only do IOVA mappings no need the separate dma_vec tracking, it also won't free it and thus leak the allocations. Fixes: b8b7570a7ec8 ("nvme-pci: fix dma unmapping when using PRPs and not using the IOVA mapping") Reported-by: Klara Modin <klarasmodin@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Klara Modin <klarasmodin@gmail.com> Link: https://lore.kernel.org/r/20250711112250.633269-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-09nbd: fix lockdep deadlock warningMing Lei
nbd grabs device lock nbd->config_lock for updating nr_hw_queues, this ways cause the following lock dependency: -> #2 (&disk->open_mutex){+.+.}-{4:4}: lock_acquire kernel/locking/lockdep.c:5871 [inline] lock_acquire+0x1ac/0x448 kernel/locking/lockdep.c:5828 __mutex_lock_common kernel/locking/mutex.c:602 [inline] __mutex_lock+0x166/0x1292 kernel/locking/mutex.c:747 mutex_lock_nested+0x14/0x1c kernel/locking/mutex.c:799 __del_gendisk+0x132/0xac6 block/genhd.c:706 del_gendisk+0xf6/0x19a block/genhd.c:819 nbd_dev_remove+0x3c/0xf2 drivers/block/nbd.c:268 nbd_dev_remove_work+0x1c/0x26 drivers/block/nbd.c:284 process_one_work+0x96a/0x1f32 kernel/workqueue.c:3238 process_scheduled_works kernel/workqueue.c:3321 [inline] worker_thread+0x5ce/0xde8 kernel/workqueue.c:3402 kthread+0x39c/0x7d4 kernel/kthread.c:464 ret_from_fork_kernel+0x2a/0xbb2 arch/riscv/kernel/process.c:214 ret_from_fork_kernel_asm+0x16/0x18 arch/riscv/kernel/entry.S:327 -> #1 (&set->update_nr_hwq_lock){++++}-{4:4}: lock_acquire kernel/locking/lockdep.c:5871 [inline] lock_acquire+0x1ac/0x448 kernel/locking/lockdep.c:5828 down_write+0x9c/0x19a kernel/locking/rwsem.c:1577 blk_mq_update_nr_hw_queues+0x3e/0xb86 block/blk-mq.c:5041 nbd_start_device+0x140/0xb2c drivers/block/nbd.c:1476 nbd_genl_connect+0xae0/0x1b24 drivers/block/nbd.c:2201 genl_family_rcv_msg_doit+0x206/0x2e6 net/netlink/genetlink.c:1115 genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline] genl_rcv_msg+0x514/0x78e net/netlink/genetlink.c:1210 netlink_rcv_skb+0x206/0x3be net/netlink/af_netlink.c:2534 genl_rcv+0x36/0x4c net/netlink/genetlink.c:1219 netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline] netlink_unicast+0x4f0/0x82c net/netlink/af_netlink.c:1339 netlink_sendmsg+0x85e/0xdd6 net/netlink/af_netlink.c:1883 sock_sendmsg_nosec net/socket.c:712 [inline] __sock_sendmsg+0xcc/0x160 net/socket.c:727 ____sys_sendmsg+0x63e/0x79c net/socket.c:2566 ___sys_sendmsg+0x144/0x1e6 net/socket.c:2620 __sys_sendmsg+0x188/0x246 net/socket.c:2652 __do_sys_sendmsg net/socket.c:2657 [inline] __se_sys_sendmsg net/socket.c:2655 [inline] __riscv_sys_sendmsg+0x70/0xa2 net/socket.c:2655 syscall_handler+0x94/0x118 arch/riscv/include/asm/syscall.h:112 do_trap_ecall_u+0x396/0x530 arch/riscv/kernel/traps.c:341 handle_exception+0x146/0x152 arch/riscv/kernel/entry.S:197 -> #0 (&nbd->config_lock){+.+.}-{4:4}: check_noncircular+0x132/0x146 kernel/locking/lockdep.c:2178 check_prev_add kernel/locking/lockdep.c:3168 [inline] check_prevs_add kernel/locking/lockdep.c:3287 [inline] validate_chain kernel/locking/lockdep.c:3911 [inline] __lock_acquire+0x12b2/0x24ea kernel/locking/lockdep.c:5240 lock_acquire kernel/locking/lockdep.c:5871 [inline] lock_acquire+0x1ac/0x448 kernel/locking/lockdep.c:5828 __mutex_lock_common kernel/locking/mutex.c:602 [inline] __mutex_lock+0x166/0x1292 kernel/locking/mutex.c:747 mutex_lock_nested+0x14/0x1c kernel/locking/mutex.c:799 refcount_dec_and_mutex_lock+0x60/0xd8 lib/refcount.c:118 nbd_config_put+0x3a/0x610 drivers/block/nbd.c:1423 nbd_release+0x94/0x15c drivers/block/nbd.c:1735 blkdev_put_whole+0xac/0xee block/bdev.c:721 bdev_release+0x3fe/0x600 block/bdev.c:1144 blkdev_release+0x1a/0x26 block/fops.c:684 __fput+0x382/0xa8c fs/file_table.c:465 ____fput+0x1c/0x26 fs/file_table.c:493 task_work_run+0x16a/0x25e kernel/task_work.c:227 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] exit_to_user_mode_loop+0x118/0x134 kernel/entry/common.c:114 exit_to_user_mode_prepare include/linux/entry-common.h:330 [inline] syscall_exit_to_user_mode_work include/linux/entry-common.h:414 [inline] syscall_exit_to_user_mode include/linux/entry-common.h:449 [inline] do_trap_ecall_u+0x3f0/0x530 arch/riscv/kernel/traps.c:355 handle_exception+0x146/0x152 arch/riscv/kernel/entry.S:197 Also it isn't necessary to require nbd->config_lock, because blk_mq_update_nr_hw_queues() does grab tagset lock for sync everything. Fixes the issue by releasing ->config_lock & retry in case of concurrent updating nr_hw_queues. Fixes: 98e68f67020c ("block: prevent adding/deleting disk during updating nr_hw_queues") Reported-by: syzbot+2bcecf3c38cb3e8fdc8d@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6855034f.a00a0220.137b3.0031.GAE@google.com Reviewed-by: Yu Kuai <yukuai3@huawei.com> Cc: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250709111744.2353050-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-08drbd: add missing kref_get in handle_write_conflictsSarah Newman
With `two-primaries` enabled, DRBD tries to detect "concurrent" writes and handle write conflicts, so that even if you write to the same sector simultaneously on both nodes, they end up with the identical data once the writes are completed. In handling "superseeded" writes, we forgot a kref_get, resulting in a premature drbd_destroy_device and use after free, and further to kernel crashes with symptoms. Relevance: No one should use DRBD as a random data generator, and apparently all users of "two-primaries" handle concurrent writes correctly on layer up. That is cluster file systems use some distributed lock manager, and live migration in virtualization environments stops writes on one node before starting writes on the other node. Which means that other than for "test cases", this code path is never taken in real life. FYI, in DRBD 9, things are handled differently nowadays. We still detect "write conflicts", but no longer try to be smart about them. We decided to disconnect hard instead: upper layers must not submit concurrent writes. If they do, that's their fault. Signed-off-by: Sarah Newman <srn@prgmr.com> Signed-off-by: Lars Ellenberg <lars@linbit.com> Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Link: https://lore.kernel.org/r/20250627095728.800688-1-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-08block: mtip32xx: Fix usage of dma_map_sg()Thomas Fourier
The dma_map_sg() can fail and, in case of failure, returns 0. If it fails, mtip_hw_submit_io() returns an error. The dma_unmap_sg() requires the nents parameter to be the same as the one passed to dma_map_sg(). This patch saves the nents in command->scatter_ents. Fixes: 88523a61558a ("block: Add driver for Micron RealSSD pcie flash cards") Signed-off-by: Thomas Fourier <fourier.thomas@gmail.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20250627121123.203731-2-fourier.thomas@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-08Documentation: remove reference to pktcdvd in cdrom documentationJens Axboe
pktcdvd got killed in a previous commit, remove the reference to it as well in the cdrom documentation. Fixes: 1cea5180f2f8 ("block: remove pktcdvd driver") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-08nvme-pci: fix dma unmapping when using PRPs and not using the IOVA mappingChristoph Hellwig
The current version of the blk_rq_dma_map support in nvme-pci tries to reconstruct the DMA mappings from the on the wire descriptors if they are needed for unmapping. While this is not the case for the direct mapping fast path and the IOVA path, it is needed for the non-IOVA slow path, e.g. when using the interconnect is not dma coherent, when using swiotlb bounce buffering, or a IOMMU mapping that can't coalesce. While the reconstruction is easy and works fine for the SGL path, where the on the wire representation maps 1:1 to DMA mappings, the code to reconstruct the DMA mapping ranges from PRPs can't always work, as a given PRP layout can come from different DMA mappings, and the current code doesn't even always get that right. Give up on this approach and track the actual DMA mapping when actually needed again. Fixes: 7ce3c1dd78fc ("nvme-pci: convert the data mapping to blk_rq_dma_map") Reported-by: Ben Copeland <ben.copeland@linaro.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Tested-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20250707125223.3022531-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-07block: remove pktcdvd driverJens Axboe
This driver has long outlived it's utility, and it's broken and unloved. The main use case for this was direct mount with UDF of cd-rw drives that required 32kb packets. It would collect writes into that size and write them out in multiples of that. That's not a common use case anymore, the world has moved on from those kinds of media. To make matters worse, it's actively breaking setups where it's not even required or useful. Link: https://lore.kernel.org/linux-block/fxg6dksau4jsk3u5xldlyo2m7qgiux6vtdrz5rywseotsouqdv@urcrwz6qtd3r/ Link: https://lore.kernel.org/linux-block/dcc4836e-6da9-4208-ad27-bbd44b3a2063@kernel.dk/ Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-04ublk: introduce and use ublk_set_canceling helperUday Shankar
For performance reasons (minimizing the number of cache lines accessed in the hot path), we store the "canceling" state redundantly - there is one flag in the device, which can be considered the source of truth, and per-queue copies of that flag. This redundancy can cause confusion, and opens the door to bugs where the state is set inconsistently. Try to guard against these bugs by introducing a ublk_set_canceling helper which is the sole mutator of both the per-device and per-queue canceling state. This helper always sets the state consistently. Use the helper in all places where we need to modify the canceling state. No functional changes are expected. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250703-ublk_too_many_quiesce-v2-2-3527b5339eeb@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-04ublk: speed up ublk server exit handlingUday Shankar
Recently, we've observed a few cases where a ublk server is able to complete restart more quickly than the driver can process the exit of the previous ublk server. The new ublk server comes up, attempts recovery of the preexisting ublk devices, and observes them still in state UBLK_S_DEV_LIVE. While this is possible due to the asynchronous nature of io_uring cleanup and should therefore be handled properly in the ublk server, it is still preferable to make ublk server exit handling faster if possible, as we should strive for it to not be a limiting factor in how fast a ublk server can restart and provide service again. Analysis of the issue showed that the vast majority of the time spent in handling the ublk server exit was in calls to blk_mq_quiesce_queue, which is essentially just a (relatively expensive) call to synchronize_rcu. The ublk server exit path currently issues an unnecessarily large number of calls to blk_mq_quiesce_queue, for two reasons: 1. It tries to call blk_mq_quiesce_queue once per ublk_queue. However, blk_mq_quiesce_queue targets the request_queue of the underlying ublk device, of which there is only one. So the number of calls is larger than necessary by a factor of nr_hw_queues. 2. In practice, it calls blk_mq_quiesce_queue _more_ than once per ublk_queue. This is because of a data race where we read ubq->canceling without any locking when deciding if we should call ublk_start_cancel. It is thus possible for two calls to ublk_uring_cmd_cancel_fn against the same ublk_queue to both call ublk_start_cancel against the same ublk_queue. Fix this by making the "canceling" flag a per-device state. This actually matches the existing code better, as there are several places where the flag is set or cleared for all queues simultaneously, and there is the general expectation that cancellation corresponds with ublk server exit. This per-device canceling flag is then checked under a (new) lock (addressing the data race (2) above), and the queue is only quiesced if it is cleared (addressing (1) above). The result is just one call to blk_mq_quiesce_queue per ublk device. To minimize the number of cache lines that are accessed in the hot path, the per-queue canceling flag is kept. The values of the per-device canceling flag and all per-queue canceling flags should always match. In our setup, where one ublk server handles I/O for 128 ublk devices, each having 24 hardware queues of depth 4096, here are the results before and after this patch, where teardown time is measured from the first call to io_ring_ctx_wait_and_kill to the return from the last ublk_ch_release: before after number of calls to blk_mq_quiesce_queue: 6469 256 teardown time: 11.14s 2.44s There are still some potential optimizations here, but this takes care of a big chunk of the ublk server exit handling delay. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250703-ublk_too_many_quiesce-v2-1-3527b5339eeb@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-03zram: pass buffer offset to zcomp_available_show()Sergey Senozhatsky
In most cases zcomp_available_show() is the only emitting function that is called from sysfs read() handler, so it assumes that there is a whole PAGE_SIZE buffer to work with. There is an exception, however: recomp_algorithm_show(). In recomp_algorithm_show() we prepend the buffer with priority number before we pass it to zcomp_available_show(), so it cannot assume PAGE_SIZE anymore and must take recomp_algorithm_show() modifications into consideration. Therefore we need to pass buffer offset to zcomp_available_show(). Also convert it to use sysfs_emit_at(), to stay aligned with the rest of zram's sysfs read() handlers. On practice we are never even close to using the whole PAGE_SIZE buffer, so that's not a critical bug, but still. Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Link: https://lore.kernel.org/r/20250627071840.1394242-1-senozhatsky@chromium.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-03block: zram: replace scnprintf() with sysfs_emit() in *_show() functionsRahul Kumar
Replace scnprintf() with sysfs_emit() or sysfs_emit_at() in sysfs *_show() functions in zram_drv.c to follow the kernel's guidelines from Documentation/filesystems/sysfs.rst. This improves consistency, safety, and makes the code easier to maintain and update in the future. Signed-off-by: Rahul Kumar <rk0006818@gmail.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Link: https://lore.kernel.org/r/20250627035256.1120740-1-rk0006818@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-02bcache: switch from pages to folios in read_super()Matthew Wilcox (Oracle)
Retrieve a folio from the page cache instead of a page. Removes a hidden call to compound_head(). Then be sure to call folio_put() instead of put_page() to release it. That doesn't save any calls to compound_head(), just moves them around. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Coly Li <colyli@kernel.org> Acked-back: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250702024848.343370-1-colyli@kernel.org [axboe: commit message massaging] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01virtio: blk/scsi: use block layer helpers to calculate num of queuesDaniel Wagner
The calculation of the upper limit for queues does not depend solely on the number of possible CPUs; for example, the isolcpus kernel command-line option must also be considered. To account for this, the block layer provides a helper function to retrieve the maximum number of queues. Use it to set an appropriate upper queue number limit. Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20250617-isolcpus-queue-counters-v1-5-13923686b54b@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01scsi: use block layer helpers to calculate num of queuesDaniel Wagner
The calculation of the upper limit for queues does not depend solely on the number of online CPUs; for example, the isolcpus kernel command-line option must also be considered. To account for this, the block layer provides a helper function to retrieve the maximum number of queues. Use it to set an appropriate upper queue number limit. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20250617-isolcpus-queue-counters-v1-4-13923686b54b@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01nvme-pci: use block layer helpers to calculate num of queuesDaniel Wagner
The calculation of the upper limit for queues does not depend solely on the number of possible CPUs; for example, the isolcpus kernel command-line option must also be considered. To account for this, the block layer provides a helper function to retrieve the maximum number of queues. Use it to set an appropriate upper queue number limit. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20250617-isolcpus-queue-counters-v1-3-13923686b54b@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01blk-mq: add number of queue calc helperDaniel Wagner
Add two variants of helper functions that calculate the correct number of queues to use. Two variants are needed because some drivers base their maximum number of queues on the possible CPU mask, while others use the online CPU mask. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20250617-isolcpus-queue-counters-v1-2-13923686b54b@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01lib/group_cpus: Let group_cpu_evenly() return the number of initialized masksDaniel Wagner
group_cpu_evenly() might have allocated less groups then requested: group_cpu_evenly() __group_cpus_evenly() alloc_nodes_groups() # allocated total groups may be less than numgrps when # active total CPU number is less then numgrps In this case, the caller will do an out of bound access because the caller assumes the masks returned has numgrps. Return the number of groups created so the caller can limit the access range accordingly. Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250617-isolcpus-queue-counters-v1-1-13923686b54b@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30ublk: cache-align struct ublk_ioCaleb Sander Mateos
struct ublk_io is already 56 bytes on 64-bit architectures, so round it up to a full cache line (typically 64 bytes). This ensures a single ublk_io doesn't span multiple cache lines and prevents false sharing if consecutive ublk_io's are accessed by different daemon tasks. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-15-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30ublk: remove ubq checks from ublk_{get,put}_req_ref()Caleb Sander Mateos
ublk_get_req_ref() and ublk_put_req_ref() currently call ublk_need_req_ref(ubq) to check whether the ublk device features require reference counting of its requests. However, all callers already know that reference counting is required: - __ublk_check_and_get_req() is only called from ublk_check_and_get_req() if user copy is enabled, and from ublk_register_io_buf() if zero copy is enabled - ublk_io_release() is only called for requests registered by ublk_register_io_buf(), which requires zero copy - ublk_ch_read_iter() and ublk_ch_write_iter() only call ublk_put_req_ref() if ublk_check_and_get_req() succeeded, which requires user copy to be enabled So drop the ublk_need_req_ref() check and the ubq argument in ublk_get_req_ref() and ublk_put_req_ref(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-14-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30ublk: optimize UBLK_IO_UNREGISTER_IO_BUF on daemon taskCaleb Sander Mateos
ublk_io_release() performs an expensive atomic refcount decrement. This atomic operation is unnecessary in the common case where the request's buffer is registered and unregistered on the daemon task before handling UBLK_IO_COMMIT_AND_FETCH_REQ for the I/O. So if ublk_io_release() is called on the daemon task and task_registered_buffers is positive, just decrement task_registered_buffers (nonatomically). ublk_sub_req_ref() will apply this decrement when it atomically subtracts from io->ref. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-13-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30ublk: optimize UBLK_IO_REGISTER_IO_BUF on daemon taskCaleb Sander Mateos
ublk_register_io_buf() performs an expensive atomic refcount increment, as well as a lot of pointer chasing to look up the struct request. Create a separate ublk_daemon_register_io_buf() for the daemon task to call. Initialize ublk_io's reference count to a large number, introduce a field task_registered_buffers to count the buffers registered on the daemon task, and atomically subtract the large number minus task_registered_buffers in ublk_commit_and_fetch(). Also obtain the struct request directly from ublk_io's req field instead of looking it up on the tagset. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-12-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30ublk: return early if blk_should_fake_timeout()Caleb Sander Mateos
Make the unlikely case blk_should_fake_timeout() return early to reduce the indentation of the successful path. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-11-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30ublk: allow UBLK_IO_(UN)REGISTER_IO_BUF on any taskCaleb Sander Mateos
Currently, UBLK_IO_REGISTER_IO_BUF and UBLK_IO_UNREGISTER_IO_BUF are only permitted on the ublk_io's daemon task. But this restriction is unnecessary. ublk_register_io_buf() calls __ublk_check_and_get_req() to look up the request from the tagset and atomically take a reference on the request without accessing the ublk_io. ublk_unregister_io_buf() doesn't use the q_id or tag at all. So allow these opcodes even on tasks other than io->task. Handle UBLK_IO_UNREGISTER_IO_BUF before obtaining the ubq and io since the buffer index being unregistered is not necessarily related to the specified q_id and tag. Add a feature flag UBLK_F_BUF_REG_OFF_DAEMON that userspace can use to determine whether the kernel supports off-daemon buffer registration. Suggested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-10-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30ublk: don't take ublk_queue in ublk_unregister_io_buf()Caleb Sander Mateos
UBLK_IO_UNREGISTER_IO_BUF currently requires a valid q_id and tag to be passed in the ublksrv_io_cmd. However, only the addr (registered buffer index) is actually used to unregister the buffer. There is no check that the q_id and tag are for the ublk request whose buffer is registered at the given index. To prepare to allow userspace to omit the q_id and tag, check the UBLK_F_SUPPORT_ZERO_COPY flag on the ublk_device instead of the ublk_queue. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-9-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30ublk: consolidate UBLK_IO_FLAG_{ACTIVE,OWNED_BY_SRV} checksCaleb Sander Mateos
UBLK_IO_FLAG_ACTIVE and UBLK_IO_FLAG_OWNED_BY_SRV are mutually exclusive. So just check that UBLK_IO_FLAG_OWNED_BY_SRV is set in __ublk_ch_uring_cmd(); that implies UBLK_IO_FLAG_ACTIVE is unset. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-7-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30ublk: remove task variable from __ublk_ch_uring_cmd()Caleb Sander Mateos
The variable is computed from a simple expression and used once, so just replace it with the expression. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-6-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30ublk: handle UBLK_IO_FETCH_REQ earlierCaleb Sander Mateos
Check for UBLK_IO_FETCH_REQ early in __ublk_ch_uring_cmd() and skip the rest of the checks in this case. This allows removing the checks for NULL io->task and UBLK_IO_FLAG_OWNED_BY_SRV unset in io->flags, which are only allowed for FETCH. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-5-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30ublk: check cmd_op firstCaleb Sander Mateos
In preparation for skipping some of the other checks for certain IO opcodes, move the cmd_op check earlier. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-4-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30ublk: remove struct ublk_rq_dataCaleb Sander Mateos
__ublk_check_and_get_req() attempts to atomically look up the struct request for a ublk I/O and take a reference on it. However, the request can be freed between the lookup on the tagset in blk_mq_tag_to_rq() and the increment of its reference count in ublk_get_req_ref(), for example if an elevator switch happens concurrently. Fix the potential use after free by moving the reference count from ublk_rq_data to ublk_io. Move the fields buf_index and buf_ctx_handle too to reduce the number of cache lines touched when dispatching and completing a ublk I/O, allowing ublk_rq_data to be removed entirely. Suggested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: 62fe99cef94a ("ublk: add read()/write() support for ublk char device") Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-3-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30ublk: use vmalloc for ublk_device's __queuesCaleb Sander Mateos
struct ublk_device's __queues points to an allocation with up to UBLK_MAX_NR_QUEUES (4096) queues, each of which have: - struct ublk_queue (48 bytes) - Tail array of up to UBLK_MAX_QUEUE_DEPTH (4096) struct ublk_io's, 32 bytes each This means the full allocation can exceed 512 MB, which may well be impossible to service with contiguous physical pages. Switch to kvcalloc() and kvfree(), since there is no need for physically contiguous memory. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver") Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-2-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30nvme-pci: rework the build time assert for NVME_MAX_NR_DESCRIPTORSChristoph Hellwig
The current use of an always_inline helper is a bit convoluted. Instead use macros that represent the arithmetics used for building up the PRP chain. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20250625113531.522027-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30nvme-pci: replace NVME_MAX_KB_SZ with NVME_MAX_BYTEChristoph Hellwig
Having a define in kiB units is a bit weird. Also update the comment now that there is not scatterlist limit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20250625113531.522027-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30nvme-pci: convert the data mapping to blk_rq_dma_mapChristoph Hellwig
Use the blk_rq_dma_map API to DMA map requests instead of scatterlists. This removes the need to allocate a scatterlist covering every segment, and thus the overall transfer length limit based on the scatterlist allocation. Instead the DMA mapping is done by iterating the bio_vec chain in the request directly. The unmap is handled differently depending on how we mapped: - when using an IOMMU only a single IOVA is used, and it is stored in iova_state - for direct mappings that don't use swiotlb and are cache coherent, unmap is not needed at all - for direct mappings that are not cache coherent or use swiotlb, the physical addresses are rebuild from the PRPs or SGL segments The latter unfortunately adds a fair amount of code to the driver, but it is code not used in the fast path. The conversion only covers the data mapping path, and still uses a scatterlist for the multi-segment metadata case. I plan to convert that as soon as we have good test coverage for the multi-segment metadata path. Thanks to Chaitanya Kulkarni for an initial attempt at a new DMA API conversion for nvme-pci, Kanchan Joshi for bringing back the single segment optimization, Leon Romanovsky for shepherding this through a gazillion rebases and Nitesh Shetty for various improvements. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20250625113531.522027-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30nvme-pci: remove superfluous argumentsChristoph Hellwig
The call chain in the prep_rq and completion paths passes around a lot of nvme_dev, nvme_queue and nvme_command arguments that can be trivially derived from the passed in struct request. Remove them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20250625113531.522027-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30nvme-pci: merge the simple PRP and SGL setup into a common helperChristoph Hellwig
nvme_setup_prp_simple and nvme_setup_sgl_simple share a lot of logic. Merge them into a single helper that makes use of the previously added use_sgl tristate. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250625113531.522027-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30nvme-pci: refactor nvme_pci_use_sglsChristoph Hellwig
Move the average segment size into a separate helper, and return a tristate to distinguish the case where can use SGL vs where we have to use SGLs. This will allow the simplify the code and make more efficient decisions in follow on changes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250625113531.522027-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30block: add scatterlist-less DMA mapping helpersChristoph Hellwig
Add a new blk_rq_dma_map / blk_rq_dma_unmap pair that does away with the wasteful scatterlist structure. Instead it uses the mapping iterator to either add segments to the IOVA for IOMMU operations, or just maps them one by one for the direct mapping. For the IOMMU case instead of a scatterlist with an entry for each segment, only a single [dma_addr,len] pair needs to be stored for processing a request, and for the direct mapping the per-segment allocation shrinks from [page,offset,len,dma_addr,dma_len] to just [dma_addr,len]. One big difference to the scatterlist API, which could be considered downside, is that the IOVA collapsing only works when the driver sets a virt_boundary that matches the IOMMU granule. For NVMe this is done already so it works perfectly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20250625113531.522027-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30block: don't merge different kinds of P2P transfers in a single bioChristoph Hellwig
To get out of the DMA mapping helpers having to check every segment for it's P2P status, ensure that bios either contain P2P transfers or non-P2P transfers, and that a P2P bio only contains ranges from a single device. This means we do the page zone access in the bio add path where it should be still page hot, and will only have do the fairly expensive P2P topology lookup once per bio down in the DMA mapping path, and only for already marked bios. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20250625113531.522027-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30dm: Check for forbidden splitting of zone write operationsDamien Le Moal
DM targets must not split zone append and write operations using dm_accept_partial_bio() as doing so is forbidden for zone append BIOs, breaks zone append emulation using regular write BIOs and potentially creates deadlock situations with queue freeze operations. Modify dm_accept_partial_bio() to add missing BUG_ON() checks for all these cases, that is, check that the BIO is a write or write zeroes operation. This change packs all the zone related checks together under a static_branch_unlikely(&zoned_enabled) and done only if the target is a zoned device. Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Link: https://lore.kernel.org/r/20250625093327.548866-6-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30dm: dm-crypt: Do not partially accept write BIOs with zoned targetsDamien Le Moal
Read and write operations issued to a dm-crypt target may be split according to the dm-crypt internal limits defined by the max_read_size and max_write_size module parameters (default is 128 KB). The intent is to improve processing time of large BIOs by splitting them into smaller operations that can be parallelized on different CPUs. For zoned dm-crypt targets, this BIO splitting is still done but without the parallel execution to ensure that the issuing order of write operations to the underlying devices remains sequential. However, the splitting itself causes other problems: 1) Since dm-crypt relies on the block layer zone write plugging to handle zone append emulation using regular write operations, the reminder of a split write BIO will always be plugged into the target zone write plugged. Once the on-going write BIO finishes, this reminder BIO is unplugged and issued from the zone write plug work. If this reminder BIO itself needs to be split, the reminder will be re-issued and plugged again, but that causes a call to a blk_queue_enter(), which may block if a queue freeze operation was initiated. This results in a deadlock as DM submission still holds BIOs that the queue freeze side is waiting for. 2) dm-crypt relies on the emulation done by the block layer using regular write operations for processing zone append operations. This still requires to properly return the written sector as the BIO sector of the original BIO. However, this can be done correctly only and only if there is a single clone BIO used for processing the original zone append operation issued by the user. If the size of a zone append operation is larger than dm-crypt max_write_size, then the orginal BIO will be split and processed as a chain of regular write operations. Such chaining result in an incorrect written sector being returned to the zone append issuer using the original BIO sector. This in turn results in file system data corruptions using xfs or btrfs. Fix this by modifying get_max_request_size() to always return the size of the BIO to avoid it being split with dm_accpet_partial_bio() in crypt_map(). get_max_request_size() is renamed to get_max_request_sectors() to clarify the unit of the value returned and its interface is changed to take a struct dm_target pointer and a pointer to the struct bio being processed. In addition to this change, to ensure that crypt_alloc_buffer() works correctly, set the dm-crypt device max_hw_sectors limit to be at most BIO_MAX_VECS << PAGE_SECTORS_SHIFT (1 MB with a 4KB page architecture). This forces DM core to split write BIOs before passing them to crypt_map(), and thus guaranteeing that dm-crypt can always accept an entire write BIO without needing to split it. This change does not have any effect on the read path of dm-crypt. Read operations can still be split and the BIO fragments processed in parallel. There is also no impact on the performance of the write path given that all zone write BIOs were already processed inline instead of in parallel. This change also does not affect in any way regular dm-crypt block devices. Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Link: https://lore.kernel.org/r/20250625093327.548866-5-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30dm: Always split write BIOs to zoned device limitsDamien Le Moal
Any zoned DM target that requires zone append emulation will use the block layer zone write plugging. In such case, DM target drivers must not split BIOs using dm_accept_partial_bio() as doing so can potentially lead to deadlocks with queue freeze operations. Regular write operations used to emulate zone append operations also cannot be split by the target driver as that would result in an invalid writen sector value return using the BIO sector. In order for zoned DM target drivers to avoid such incorrect BIO splitting, we must ensure that large BIOs are split before being passed to the map() function of the target, thus guaranteeing that the limits for the mapped device are not exceeded. dm-crypt and dm-flakey are the only target drivers supporting zoned devices and using dm_accept_partial_bio(). In the case of dm-crypt, this function is used to split BIOs to the internal max_write_size limit (which will be suppressed in a different patch). However, since crypt_alloc_buffer() uses a bioset allowing only up to BIO_MAX_VECS (256) vectors in a BIO. The dm-crypt device max_segments limit, which is not set and so default to BLK_MAX_SEGMENTS (128), must thus be respected and write BIOs split accordingly. In the case of dm-flakey, since zone append emulation is not required, the block layer zone write plugging is not used and no splitting of BIOs required. Modify the function dm_zone_bio_needs_split() to use the block layer helper function bio_needs_zone_write_plugging() to force a call to bio_split_to_limits() in dm_split_and_process_bio(). This allows DM target drivers to avoid using dm_accept_partial_bio() for write operations on zoned DM devices. Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250625093327.548866-4-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30block: Introduce bio_needs_zone_write_plugging()Damien Le Moal
In preparation for fixing device mapper zone write handling, introduce the inline helper function bio_needs_zone_write_plugging() to test if a BIO requires handling through zone write plugging using the function blk_zone_plug_bio(). This function returns true for any write (op_is_write(bio) == true) operation directed at a zoned block device using zone write plugging, that is, a block device with a disk that has a zone write plug hash table. This helper allows simplifying the check on entry to blk_zone_plug_bio() and used in to protect calls to it for blk-mq devices and DM devices. Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250625093327.548866-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30block: Make REQ_OP_ZONE_FINISH a write operationDamien Le Moal
REQ_OP_ZONE_FINISH is defined as "12", which makes op_is_write(REQ_OP_ZONE_FINISH) return false, despite the fact that a zone finish operation is an operation that modifies a zone (transition it to full) and so should be considered as a write operation (albeit one that does not transfer any data to the device). Fix this by redefining REQ_OP_ZONE_FINISH to be an odd number (13), and redefine REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL using sequential odd numbers from that new value. Fixes: 6c1b1da58f8c ("block: add zone open, close and finish operations") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250625093327.548866-2-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30block: Increase BLK_DEF_MAX_SECTORS_CAPDamien Le Moal
Back in 2015, commit d2be537c3ba3 ("block: bump BLK_DEF_MAX_SECTORS to 2560") increased the default maximum size of a block device I/O to 2560 sectors (1280 KiB) to "accommodate a 10-data-disk stripe write with chunk size 128k". This choice is rather arbitrary and since then, improvements to the block layer have software RAID drivers correctly advertize their stripe width through chunk_sectors and abuses of BLK_DEF_MAX_SECTORS_CAP by drivers (to set the HW limit rather than the default user controlled maximum I/O size) have been fixed. Since many block devices can benefit from a larger value of BLK_DEF_MAX_SECTORS_CAP, and in particular HDDs, increase this value to be 4MiB, or 8192 sectors. And given that BLK_DEF_MAX_SECTORS_CAP is only used in the block layer and should not be used by drivers directly, move this macro definition to the block layer internal header file block/blk.h. Suggested-by: Martin K . Petersen <martin.petersen@oracle.com> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20250618060045.37593-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-29Linux 6.16-rc4Linus Torvalds