summaryrefslogtreecommitdiff
path: root/drivers
AgeCommit message (Collapse)Author
2025-04-24ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmdMing Lei
ublk_cancel_cmd() calls io_uring_cmd_done() to complete uring_cmd, but we may have scheduled task work via io_uring_cmd_complete_in_task() for dispatching request, then kernel crash can be triggered. Fix it by not trying to canceling the command if ublk block request is started. Fixes: 216c8f5ef0f2 ("ublk: replace monitor with cancelable uring_cmd") Reported-by: Jared Holzman <jholzman@nvidia.com> Tested-by: Jared Holzman <jholzman@nvidia.com> Closes: https://lore.kernel.org/linux-block/d2179120-171b-47ba-b664-23242981ef19@nvidia.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250425013742.1079549-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATAMing Lei
We call io_uring_cmd_complete_in_task() to schedule task_work for handling UBLK_U_IO_NEED_GET_DATA. This way is really not necessary because the current context is exactly the ublk queue context, so call ublk_dispatch_req() directly for handling UBLK_U_IO_NEED_GET_DATA. Fixes: 216c8f5ef0f2 ("ublk: replace monitor with cancelable uring_cmd") Tested-by: Jared Holzman <jholzman@nvidia.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250425013742.1079549-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-22nvmet: fix out-of-bounds access in nvmet_enable_portRichard Weinberger
When trying to enable a port that has no transport configured yet, nvmet_enable_port() uses NVMF_TRTYPE_MAX (255) to query the transports array, causing an out-of-bounds access: [ 106.058694] BUG: KASAN: global-out-of-bounds in nvmet_enable_port+0x42/0x1da [ 106.058719] Read of size 8 at addr ffffffff89dafa58 by task ln/632 [...] [ 106.076026] nvmet: transport type 255 not supported Since commit 200adac75888, NVMF_TRTYPE_MAX is the default state as configured by nvmet_ports_make(). Avoid this by checking for NVMF_TRTYPE_MAX before proceeding. Fixes: 200adac75888 ("nvme: Add PCI transport type") Signed-off-by: Richard Weinberger <richard@nod.at> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
2025-04-17Merge tag 'nvme-6.15-2025-04-17' of git://git.infradead.org/nvme into block-6.15Jens Axboe
Pull NVMe fixes from Christoph: "nvme fixes for Linux 6.15 - fix scan failure for non-ANA multipath controllers (Hannes Reinecke) - fix multipath sysfs links creation for some cases (Hannes Reinecke) - PCIe endpoint fixes (Damien Le Moal) - use NULL instead of 0 in the auth code (Damien Le Moal)" * tag 'nvme-6.15-2025-04-17' of git://git.infradead.org/nvme: nvmet: pci-epf: cleanup link state management nvmet: pci-epf: clear CC and CSTS when disabling the controller nvmet: pci-epf: always fully initialize completion entries nvmet: auth: use NULL to clear a pointer in nvmet_auth_sq_free() nvme-multipath: sysfs links may not be created for devices nvme: fixup scan failure for non-ANA multipath controllers
2025-04-16Merge tag 'md-6.15-20250416' of ↵Jens Axboe
https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into block-6.15 Pull MD fixes from Yu: "- fix raid10 missing discard IO accounting (Yu Kuai) - fix bitmap stats for bitmap file (Zheng Qixing) - fix oops while reading all member disks failed during check/repair (Meir Elisha)" * tag 'md-6.15-20250416' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux: md/raid1: Add check for missing source disk in process_checks() md/md-bitmap: fix stats collection for external bitmaps md/raid10: fix missing discard IO accounting
2025-04-16ublk: simplify aborting ublk requestMing Lei
Now ublk_abort_queue() is moved to ublk char device release handler, meantime our request queue is "quiesced" because either ->canceling was set from uring_cmd cancel function or all IOs are inflight and can't be completed by ublk server, things becomes easy much: - all uring_cmd are done, so we needn't to mark io as UBLK_IO_FLAG_ABORTED for handling completion from uring_cmd - ublk char device is closed, no one can hold IO request reference any more, so we can simply complete this request or requeue it for ublk_nosrv_should_reissue_outstanding. Reviewed-by: Uday Shankar <ushankar@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250416035444.99569-8-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-16ublk: remove __ublk_quiesce_dev()Ming Lei
Remove __ublk_quiesce_dev() and open code for updating device state as QUIESCED. We needn't to drain inflight requests in __ublk_quiesce_dev() any more, because all inflight requests are aborted in ublk char device release handler. Also we needn't to set ->canceling in __ublk_quiesce_dev() any more because it is done unconditionally now in ublk_ch_release(). Reviewed-by: Uday Shankar <ushankar@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250416035444.99569-7-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-16ublk: improve detection and handling of ublk server exitUday Shankar
There are currently two ways in which ublk server exit is detected by ublk_drv: 1. uring_cmd cancellation. If there are any outstanding uring_cmds which have not been completed to the ublk server when it exits, io_uring calls the uring_cmd callback with a special cancellation flag as the issuing task is exiting. 2. I/O timeout. This is needed in addition to the above to handle the "saturated queue" case, when all I/Os for a given queue are in the ublk server, and therefore there are no outstanding uring_cmds to cancel when the ublk server exits. There are a couple of issues with this approach: - It is complex and inelegant to have two methods to detect the same condition - The second method detects ublk server exit only after a long delay (~30s, the default timeout assigned by the block layer). This delays the nosrv behavior from kicking in and potential subsequent recovery of the device. The second issue is brought to light with the new test_generic_06 which will be added in following patch. It fails before this fix: selftests: ublk: test_generic_06.sh dev id is 0 dd: error writing '/dev/ublkb0': Input/output error 1+0 records in 0+0 records out 0 bytes copied, 30.0611 s, 0.0 kB/s DEAD dd took 31 seconds to exit (>= 5s tolerance)! generic_06 : [FAIL] Fix this by instead detecting and handling ublk server exit in the character file release callback. This has several advantages: - This one place can handle both saturated and unsaturated queues. Thus, it replaces both preexisting methods of detecting ublk server exit. - It runs quickly on ublk server exit - there is no 30s delay. - It starts the process of removing task references in ublk_drv. This is needed if we want to relax restrictions in the driver like letting only one thread serve each queue There is also the disadvantage that the character file release callback can also be triggered by intentional close of the file, which is a significant behavior change. Preexisting ublk servers (libublksrv) are dependent on the ability to open/close the file multiple times. To address this, only transition to a nosrv state if the file is released while the ublk device is live. This allows for programs to open/close the file multiple times during setup. It is still a behavior change if a ublk server decides to close/reopen the file while the device is LIVE (i.e. while it is responsible for serving I/O), but that would be highly unusual. This behavior is in line with what is done by FUSE, which is very similar to ublk in that a userspace daemon is providing services traditionally provided by the kernel. With this change in, the new test (and all other selftests, and all ublksrv tests) pass: selftests: ublk: test_generic_06.sh dev id is 0 dd: error writing '/dev/ublkb0': Input/output error 1+0 records in 0+0 records out 0 bytes copied, 0.0376731 s, 0.0 kB/s DEAD generic_04 : [PASS] Signed-off-by: Uday Shankar <ushankar@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250416035444.99569-6-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-16ublk: move device reset into ublk_ch_release()Ming Lei
ublk_ch_release() is called after ublk char device is closed, when all uring_cmd are done, so it is perfect fine to move ublk device reset to ublk_ch_release() from ublk_ctrl_start_recovery(). This way can avoid to grab the exiting daemon task_struct too long. However, reset of the following ublk IO flags has to be moved until ublk io_uring queues are ready: - ubq->canceling For requeuing IO in case of ublk_nosrv_dev_should_queue_io() before device is recovered - ubq->fail_io For failing IO in case of UBLK_F_USER_RECOVERY_FAIL_IO before device is recovered - ublk_io->flags For preventing using io->cmd With this way, recovery is simplified a lot. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250416035444.99569-5-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-16ublk: rely on ->canceling for dealing with ublk_nosrv_dev_should_queue_ioMing Lei
Now ublk deals with ublk_nosrv_dev_should_queue_io() by keeping request queue as quiesced. This way is fragile because queue quiesce crosses syscalls or process contexts. Switch to rely on ubq->canceling for dealing with ublk_nosrv_dev_should_queue_io(), because it has been used for this purpose during io_uring context exiting, and it can be reused before recovering too. In ublk_queue_rq(), the request will be added to requeue list without kicking off requeue in case of ubq->canceling, and finally requests added in requeue list will be dispatched from either ublk_stop_dev() or ublk_ctrl_end_recovery(). Meantime we have to move reset of ubq->canceling from ublk_ctrl_start_recovery() to ublk_ctrl_end_recovery(), when IO handling can be recovered completely. Then blk_mq_quiesce_queue() and blk_mq_unquiesce_queue() are always used in same context. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Uday Shankar <ushankar@purestorage.com> Link: https://lore.kernel.org/r/20250416035444.99569-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-16ublk: add ublk_force_abort_dev()Ming Lei
Add ublk_force_abort_dev() for handling ublk_nosrv_dev_should_queue_io() in ublk_stop_dev(). Then queue quiesce and unquiesce can be paired in single function. Meantime not change device state to QUIESCED any more, since the disk is going to be removed soon. Reviewed-by: Uday Shankar <ushankar@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250416035444.99569-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-16ublk: properly serialize all FETCH_REQsUday Shankar
Most uring_cmds issued against ublk character devices are serialized because each command affects only one queue, and there is an early check which only allows a single task (the queue's ubq_daemon) to issue uring_cmds against that queue. However, this mechanism does not work for FETCH_REQs, since they are expected before ubq_daemon is set. Since FETCH_REQs are only used at initialization and not in the fast path, serialize them using the per-ublk-device mutex. This fixes a number of data races that were previously possible if a badly behaved ublk server decided to issue multiple FETCH_REQs against the same qid/tag concurrently. Reported-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Uday Shankar <ushankar@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250416035444.99569-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-16md/raid1: Add check for missing source disk in process_checks()Meir Elisha
During recovery/check operations, the process_checks function loops through available disks to find a 'primary' source with successfully read data. If no suitable source disk is found after checking all possibilities, the 'primary' index will reach conf->raid_disks * 2. Add an explicit check for this condition after the loop. If no source disk was found, print an error message and return early to prevent further processing without a valid primary source. Link: https://lore.kernel.org/linux-raid/20250408143808.1026534-1-meir.elisha@volumez.com Signed-off-by: Meir Elisha <meir.elisha@volumez.com> Suggested-and-reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-04-16nvmet: pci-epf: cleanup link state managementDamien Le Moal
Since the link_up boolean field of struct nvmet_pci_epf_ctrl is always set to true when nvmet_pci_epf_start_ctrl() is called, assign true to this field in nvmet_pci_epf_start_ctrl(). Conversely, since this field is set to false when nvmet_pci_epf_stop_ctrl() is called, set this field to false directly inside that function. While at it, also add information messages to notify the user of the PCI link state changes to help troubleshoot any link stability issues without needing to enable debug messages. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Niklas Cassel <cassel@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-16nvmet: pci-epf: clear CC and CSTS when disabling the controllerDamien Le Moal
When a host shuts down the controller when shutting down but does so without first disabling the controller, the enable bit remains set in the controller configuration register. When the host restarts and attempts to enable the controller again, the nvmet_pci_epf_poll_cc_work() function is unable to detect the change from 0 to 1 of the enable bit, and thus the controller is not enabled again, which result in a device scan timeout on the host. This problem also occurs if the host shuts down uncleanly or if the PCIe link goes down: as the CC.EN value is not reset, the controller is not enabled again when the host restarts. Fix this by introducing the function nvmet_pci_epf_clear_ctrl_config() to clear the CC and CSTS registers of the controller when the PCIe link is lost (nvmet_pci_epf_stop_ctrl() function), or when starting the controller fails (nvmet_pci_epf_enable_ctrl() fails). Also use this function in nvmet_pci_epf_init_bar() to simplify the initialization of the CC and CSTS registers. Furthermore, modify the function nvmet_pci_epf_disable_ctrl() to clear the CC.EN bit and write this updated value to the BAR register when the controller is shutdown by the host, to ensure that upon restart, we can detect the host setting CC.EN. Fixes: 0faa0fe6f90e ("nvmet: New NVMe PCI endpoint function target driver") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Niklas Cassel <cassel@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-16nvmet: pci-epf: always fully initialize completion entriesDamien Le Moal
For a command that is normally processed through the command request execute() function, the completion entry for the command is initialized by __nvmet_req_complete() and nvmet_pci_epf_cq_work() only needs to set the status field and the phase of the completion entry before posting the entry to the completion queue. However, for commands that are failed due to an internal error (e.g. the command data buffer allocation fails), the command request execute() function is not called and __nvmet_req_complete() is never executed for the command, leaving the command completion entry uninitialized. For such command failed before calling req->execute(), the host ends up seeing completion entries with an invalid submission queue ID and command ID. Avoid such issue by always fully initilizing a command completion entry in nvmet_pci_epf_cq_work(), setting the entry submission queue head, ID and command ID. Fixes: 0faa0fe6f90e ("nvmet: New NVMe PCI endpoint function target driver") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Niklas Cassel <cassel@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-16nvmet: auth: use NULL to clear a pointer in nvmet_auth_sq_free()Damien Le Moal
When compiling with C=1, the following sparse warning is generated: auth.c:243:23: warning: Using plain integer as NULL pointer Avoid this warning by using NULL to instead of 0 to set the sq tls_key pointer. Fixes: fa2e0f8bbc68 ("nvmet-tcp: support secure channel concatenation") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-16nvme-multipath: sysfs links may not be created for devicesHannes Reinecke
When rapidly rescanning for new namespaces nvme_mpath_add_sysfs_link() may be called for a block device not added to sysfs. But NVME_NS_SYSFS_ATTR_LINK had already been set, so when checking this device a second time we will fail to create the link. Fix this by exchanging the order of the block device check and the NVME_NS_SYSFS_ATTR_LINK bit check. Fixes: 4dbd2b2ebe4c ("nvme-multipath: Add visibility for round-robin io-policy") Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>** Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-16nvme: fixup scan failure for non-ANA multipath controllersHannes Reinecke
Commit 62baf70c3274 caused the ANA log page to be re-read, even on controllers that do not support ANA. While this should generally harmless, some controllers hang on the unsupported log page and never finish probing. Fixes: 62baf70c3274 ("nvme: re-read ANA log page after ns scan completes") Signed-off-by: Hannes Reinecke <hare@kernel.org> Tested-by: Srikanth Aithal <sraithal@amd.com> [hch: more detailed commit message] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2025-04-15ublk: don't suggest CONFIG_BLK_DEV_UBLK=YCaleb Sander Mateos
The CONFIG_BLK_DEV_UBLK help text suggests setting the config option to Y so task_work_add() can be used to dispatch I/O, improving performance. However, this mechanism was removed in commit 29dc5d06613f2 ("ublk: kill queuing request by task_work_add"). So remove this paragraph from the config help text. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Uday Shankar <ushankar@purestorage.com> Link: https://lore.kernel.org/r/20250416004111.3242817-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-15loop: stop using vfs_iter_{read,write} for buffered I/OChristoph Hellwig
vfs_iter_{read,write} always perform direct I/O when the file has the O_DIRECT flag set, which breaks disabling direct I/O using the LOOP_SET_STATUS / LOOP_SET_STATUS64 ioctls. This was recenly reported as a regression, but as far as I can tell was only uncovered by better checking for block sizes and has been around since the direct I/O support was added. Fix this by using the existing aio code that calls the raw read/write iter methods instead. Note that despite the comments there is no need for block drivers to ever call flush_dcache_page themselves, and the call is a left-over from prehistoric times. Fixes: ab1cb278bc70 ("block: loop: introduce ioctl command of LOOP_SET_DIRECT_IO") Reported-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Tested-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20250409130940.3685677-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-15loop: LOOP_SET_FD: send uevents for partitionsThomas Weißschuh
Remove the suppression of the uevents before scanning for partitions. The partitions inherit their suppression settings from their parent device, which lead to the uevents being dropped. This is similar to the same changes for LOOP_CONFIGURE done in commit bb430b694226 ("loop: LOOP_CONFIGURE: send uevents for partitions"). Fixes: 498ef5c777d9 ("loop: suppress uevents while reconfiguring the device") Cc: stable@vger.kernel.org Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20250415-loop-uevent-changed-v3-1-60ff69ac6088@linutronix.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-15loop: properly send KOBJ_CHANGED uevent for disk deviceThomas Weißschuh
The original commit message and the wording "uncork" in the code comment indicate that it is expected that the suppressed event instances are automatically sent after unsuppressing. This is not the case, instead they are discarded. In effect this means that no "changed" events are emitted on the device itself by default. While each discovered partition does trigger a changed event on the device, devices without partitions don't have any event emitted. This makes udev miss the device creation and prompted workarounds in userspace. See the linked util-linux/losetup bug. Explicitly emit the events and drop the confusingly worded comments. Link: https://github.com/util-linux/util-linux/issues/2434 Fixes: 498ef5c777d9 ("loop: suppress uevents while reconfiguring the device") Cc: stable@vger.kernel.org Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Link: https://lore.kernel.org/r/20250415-loop-uevent-changed-v2-1-0c4e6a923b2a@linutronix.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-15loop: aio inherit the ioprio of original requestYunlong Xing
Set cmd->iocb.ki_ioprio to the ioprio of loop device's request. The purpose is to inherit the original request ioprio in the aio flow. Signed-off-by: Yunlong Xing <yunlong.xing@unisoc.com> Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250414030159.501180-1-yunlong.xing@unisoc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-11null_blk: Use strscpy() instead of strscpy_pad() in null_add_dev()Thorsten Blum
blk_mq_alloc_disk() already zero-initializes the destination buffer, making strscpy() sufficient for safely copying the disk's name. The additional NUL-padding performed by strscpy_pad() is unnecessary. If the destination buffer has a fixed length, strscpy() automatically determines its size using sizeof() when the argument is omitted. This makes the explicit size argument unnecessary. The source string is also NUL-terminated and meets the __must_be_cstr() requirement of strscpy(). No functional changes intended. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250410154727.883207-1-thorsten.blum@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-10Merge tag 'nvme-6.15-2025-04-10' of git://git.infradead.org/nvme into block-6.15Jens Axboe
Pull NVMe updates from Christoph: "nvme updates for Linux 6.15 - nvmet fc/fcloop refcounting fixes (Daniel Wagner) - fix missed namespace/ANA scans (Hannes Reinecke) - fix a use after free in the new TCP netns support (Kuniyuki Iwashima) - fix a NULL instead of false review in multipath (Uday Shankar)" * tag 'nvme-6.15-2025-04-10' of git://git.infradead.org/nvme: nvmet-fc: put ref when assoc->del_work is already scheduled nvmet-fc: take tgtport reference only once nvmet-fc: update tgtport ref per assoc nvmet-fc: inline nvmet_fc_free_hostport nvmet-fc: inline nvmet_fc_delete_assoc nvmet-fcloop: add ref counting to lport nvmet-fcloop: replace kref with refcount nvmet-fcloop: swap list_add_tail arguments nvme-tcp: fix use-after-free of netns by kernel TCP socket. nvme: multipath: fix return value of nvme_available_path nvme: re-read ANA log page after ns scan completes nvme: requeue namespace scan on missed AENs
2025-04-09ublk: pass ublksrv_ctrl_cmd * instead of io_uring_cmd *Caleb Sander Mateos
The ublk_ctrl_*() handlers all take struct io_uring_cmd *cmd but only use it to get struct ublksrv_ctrl_cmd *header from the io_uring SQE. Since the caller ublk_ctrl_uring_cmd() has already computed header, pass it instead of cmd. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250409012928.3527198-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-09ublk: don't fail request for recovery & reissue in case of ubq->cancelingMing Lei
ubq->canceling is set with request queue quiesced when io_uring context is exiting. USER_RECOVERY or !RECOVERY_FAIL_IO requires request to be re-queued and re-dispatch after device is recovered. However commit d796cea7b9f3 ("ublk: implement ->queue_rqs()") still may fail any request in case of ubq->canceling, this way breaks USER_RECOVERY or !RECOVERY_FAIL_IO. Fix it by calling __ublk_abort_rq() in case of ubq->canceling. Reviewed-by: Uday Shankar <ushankar@purestorage.com> Reported-by: Uday Shankar <ushankar@purestorage.com> Closes: https://lore.kernel.org/linux-block/Z%2FQkkTRHfRxtN%2FmB@dev-ushankar.dev.purestorage.com/ Fixes: d796cea7b9f3 ("ublk: implement ->queue_rqs()") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250409011444.2142010-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-09ublk: fix handling recovery & reissue in ublk_abort_queue()Ming Lei
Commit 8284066946e6 ("ublk: grab request reference when the request is handled by userspace") doesn't grab request reference in case of recovery reissue. Then the request can be requeued & re-dispatch & failed when canceling uring command. If it is one zc request, the request can be freed before io_uring returns the zc buffer back, then cause kernel panic: [ 126.773061] BUG: kernel NULL pointer dereference, address: 00000000000000c8 [ 126.773657] #PF: supervisor read access in kernel mode [ 126.774052] #PF: error_code(0x0000) - not-present page [ 126.774455] PGD 0 P4D 0 [ 126.774698] Oops: Oops: 0000 [#1] SMP NOPTI [ 126.775034] CPU: 13 UID: 0 PID: 1612 Comm: kworker/u64:55 Not tainted 6.14.0_blk+ #182 PREEMPT(full) [ 126.775676] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014 [ 126.776275] Workqueue: iou_exit io_ring_exit_work [ 126.776651] RIP: 0010:ublk_io_release+0x14/0x130 [ublk_drv] Fixes it by always grabbing request reference for aborting the request. Reported-by: Caleb Sander Mateos <csander@purestorage.com> Closes: https://lore.kernel.org/linux-block/CADUfDZodKfOGUeWrnAxcZiLT+puaZX8jDHoj_sfHZCOZwhzz6A@mail.gmail.com/ Fixes: 8284066946e6 ("ublk: grab request reference when the request is handled by userspace") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250409011444.2142010-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-09nvmet-fc: put ref when assoc->del_work is already scheduledDaniel Wagner
Do not leak the tgtport reference when the work is already scheduled. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-09nvmet-fc: take tgtport reference only onceDaniel Wagner
The reference counting code can be simplified. Instead taking a tgtport refrerence at the beginning of nvmet_fc_alloc_hostport and put it back if not a new hostport object is allocated, only take it when a new hostport object is allocated. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-09nvmet-fc: update tgtport ref per assocDaniel Wagner
We need to take for each unique association a reference. nvmet_fc_alloc_hostport for each newly created association. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-09nvmet-fc: inline nvmet_fc_free_hostportDaniel Wagner
No need for this tiny helper with only one user, let's inline it. And since the hostport ref counter needs to stay in sync, it's not optional anymore to give back the reference. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-09nvmet-fc: inline nvmet_fc_delete_assocDaniel Wagner
No need for this tiny helper with only one user, just inline it. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-09nvmet-fcloop: add ref counting to lportDaniel Wagner
The fcloop_lport objects live time is controlled by the user interface add_local_port and del_local_port. nport, rport and tport objects are pointing to the lport objects but here is no clear tracking. Let's introduce an explicit ref counter for the lport objects and prepare the stage for restructuring how lports are used. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-09nvmet-fcloop: replace kref with refcountDaniel Wagner
The kref wrapper is not really adding any value ontop of refcount. Thus replace the kref API with the refcount API. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-09nvmet-fcloop: swap list_add_tail argumentsDaniel Wagner
The newly element to be added to the list is the first argument of list_add_tail. This fix is missing dcfad4ab4d67 ("nvmet-fcloop: swap the list_add_tail arguments"). Fixes: 437c0b824dbd ("nvme-fcloop: add target to host LS request support") Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-09nvme-tcp: fix use-after-free of netns by kernel TCP socket.Kuniyuki Iwashima
Commit 1be52169c348 ("nvme-tcp: fix selinux denied when calling sock_sendmsg") converted sock_create() in nvme_tcp_alloc_queue() to sock_create_kern(). sock_create_kern() creates a kernel socket, which does not hold a reference to netns. If the code does not manage the netns lifetime properly, use-after-free could happen. Also, TCP kernel socket with sk_net_refcnt 0 has a socket leak problem: it remains FIN_WAIT_1 if it misses FIN after close() because tcp_close() stops all timers. To fix such problems, let's hold netns ref by sk_net_refcnt_upgrade(). We had the same issue in CIFS, SMC, etc, and applied the same solution, see commit ef7134c7fc48 ("smb: client: Fix use-after-free of network namespace.") and commit 9744d2bf1976 ("smc: Fix use-after-free in tcp_write_timer_handler()."). Fixes: 1be52169c348 ("nvme-tcp: fix selinux denied when calling sock_sendmsg") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-08nvme: multipath: fix return value of nvme_available_pathUday Shankar
The function returns bool so we should return false, not NULL. No functional changes are expected. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-08nvme: re-read ANA log page after ns scan completesHannes Reinecke
When scanning for new namespaces we might have missed an ANA AEN. The NVMe base spec (NVMe Base Specification v2.1, Figure 151 'Asynchonous Event Information - Notice': Asymmetric Namespace Access Change) states: A controller shall not send this even if an Attached Namespace Attribute Changed asynchronous event [...] is sent for the same event. so we need to re-read the ANA log page after we rescanned the namespace list to update the ANA states of the new namespaces. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-07nvme: requeue namespace scan on missed AENsHannes Reinecke
Scanning for namespaces can take some time, so if the target is reconfigured while the scan is running we may miss a Attached Namespace Attribute Changed AEN. Check if the NVME_AER_NOTICE_NS_CHANGED bit is set once the scan has finished, and requeue scanning to pick up any missed change. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-06md/md-bitmap: fix stats collection for external bitmapsZheng Qixing
The bitmap_get_stats() function incorrectly returns -ENOENT for external bitmaps. Remove the external bitmap check as the statistics should be available regardless of bitmap storage location. Return -EINVAL only for invalid bitmap with no storage (neither in superblock nor in external file). Note: "bitmap_info.external" here refers to a bitmap stored in a separate file (bitmap_file), not to external metadata. Fixes: 8d28d0ddb986 ("md/md-bitmap: Synchronize bitmap_get_stats() with bitmap lifetime") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Link: https://lore.kernel.org/linux-raid/20250403015322.2873369-1-zhengqixing@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-04-06md/raid10: fix missing discard IO accountingYu Kuai
md_account_bio() is not called from raid10_handle_discard(), now that we handle bitmap inside md_account_bio(), also fix missing bitmap_startwrite for discard. Test whole disk discard for 20G raid10: Before: Device d/s dMB/s drqm/s %drqm d_await dareq-sz md0 48.00 16.00 0.00 0.00 5.42 341.33 After: Device d/s dMB/s drqm/s %drqm d_await dareq-sz md0 68.00 20462.00 0.00 0.00 2.65 308133.65 Link: https://lore.kernel.org/linux-raid/20250325015746.3195035-1-yukuai1@huaweicloud.com Fixes: 528bc2cf2fcc ("md/raid10: enable io accounting") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org>
2025-04-03Merge tag 'block-6.15-20250403' of git://git.kernel.dk/linuxLinus Torvalds
Pull more block updates from Jens Axboe: - NVMe pull request via Keith: - PCI endpoint target cleanup (Damien) - Early import for uring_cmd fixed buffer (Caleb) - Multipath documentation and notification improvements (John) - Invalid pci sq doorbell write fix (Maurizio) - Queue init locking fix - Remove dead nsegs parameter from blk_mq_get_new_requests() * tag 'block-6.15-20250403' of git://git.kernel.dk/linux: block: don't grab elevator lock during queue initialization nvme-pci: skip nvme_write_sq_db on empty rqlist nvme-multipath: change the NVME_MULTIPATH config option nvme: update the multipath warning in nvme_init_ns_head nvme/ioctl: move fixed buffer lookup to nvme_uring_cmd_io() nvme/ioctl: move blk_mq_free_request() out of nvme_map_user_request() nvme/ioctl: don't warn on vectorized uring_cmd with fixed buffer nvmet: pci-epf: Keep completion queues mapped block: remove unused nseg parameter
2025-04-03Merge tag 'io_uring-6.15-20250403' of git://git.kernel.dk/linuxLinus Torvalds
Pull more io_uring updates from Jens Axboe: "Set of fixes/updates for io_uring that should go into this release. The ublk bits could've gone via either tree - usually I put them in block, but they got a bit mixed this series with the zero-copy supported that ended up dipping into both trees. This contains: - Fix for sendmsg zc, include in pinned pages accounting like we do for the other zc types - Series for ublk fixing request aborting, doing various little cleanups, fixing some zc issues, and adding queue_rqs support - Another ublk series doing some code cleanups - Series cleaning up the io_uring send path, mostly in preparation for registered buffers - Series doing little MSG_RING cleanups - Fix for the newly added zc rx, fixing len being 0 for the last invocation of the callback - Add vectored registered buffer support for ublk. With that, then ublk also supports this feature in the kernel revision where it could generically introduced for rw/net - A bunch of selftest additions for ublk. This is the majority of the diffstat - Silence a KCSAN data race warning for io-wq - Various little cleanups and fixes" * tag 'io_uring-6.15-20250403' of git://git.kernel.dk/linux: (44 commits) io_uring: always do atomic put from iowq selftests: ublk: enable zero copy for stripe target io_uring: support vectored kernel fixed buffer block: add for_each_mp_bvec() io_uring: add validate_fixed_range() for validate fixed buffer selftests: ublk: kublk: fix an error log line selftests: ublk: kublk: use ioctl-encoded opcodes io_uring/zcrx: return early from io_zcrx_recv_skb if readlen is 0 io_uring/net: avoid import_ubuf for regvec send io_uring/rsrc: check size when importing reg buffer io_uring: cleanup {g,s]etsockopt sqe reading io_uring: hide caches sqes from drivers io_uring: make zcrx depend on CONFIG_IO_URING io_uring: add req flag invariant build assertion Documentation: ublk: remove dead footnote selftests: ublk: specify io_cmd_buf pointer type ublk: specify io_cmd_buf pointer type io_uring: don't pass ctx to tw add remote helper io_uring/msg: initialise msg request opcode io_uring/msg: rename io_double_lock_ctx() ...
2025-04-03Merge tag 'rtc-6.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux Pull RTC updates from Alexandre Belloni: "We see a net reduction of the number of lines of code thanks to the removal of a now unused driver and a testing tool that is not used anymore. Apart from this, the max31335 driver gets support for a new part number and pm8xxx gets UEFI support. Core: - setdate is removed as it has better replacements - skip alarms with a second resolution when we know the RTC doesn't support those. Subsystem: - remove unnecessary private struct members - use devm_pm_set_wake_irq were relevant Drivers: - ds1307: stop disabling alarms on probe for DS1337, DS1339, DS1341 and DS3231 - max31335: add max31331 support - pcf50633 is removed as support for the related SoC has been removed - pcf85063: properly handle POR failures" * tag 'rtc-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (50 commits) rtc: remove 'setdate' test program selftest: rtc: skip some tests if the alarm only supports minutes rtc: mt6397: drop unused defines rtc: pcf85063: replace dev_err+return with return dev_err_probe rtc: pcf85063: do a SW reset if POR failed rtc: max31335: Add driver support for max31331 dt-bindings: rtc: max31335: Add max31331 support rtc: cros-ec: Avoid a couple of -Wflex-array-member-not-at-end warnings dt-bindings: rtc: pcf2127: Reference spi-peripheral-props.yaml rtc: rzn1: implement one-second accuracy for alarms rtc: pcf50633: Remove rtc: pm8xxx: implement qcom,no-alarm flag for non-HLOS owned alarm rtc: pm8xxx: mitigate flash wear rtc: pm8xxx: add support for uefi offset dt-bindings: rtc: qcom-pm8xxx: document qcom,no-alarm flag rtc: rv3032: drop WADA rtc: rv3032: fix EERD location rtc: pm8xxx: switch to devm_device_init_wakeup rtc: pm8xxx: fix possible race condition rtc: mpfs: switch to devm_device_init_wakeup ...
2025-04-03Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linuxLinus Torvalds
Pull ARM and clkdev updates from Russell King: - Simplify ARM_MMU_KEEP usage - Add Rust support for ARM architecture version 7 - Align IPIs reported in /proc/interrupts - require linker to support KEEP within OVERLAY - add KEEP() for ARM vectors - add __printf() attribute for clkdev functions * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux: ARM: 9445/1: clkdev: Mark some functions with __printf() attribute ARM: 9444/1: add KEEP() keyword to ARM_VECTORS ARM: 9443/1: Require linker to support KEEP within OVERLAY for DCE ARM: 9442/1: smp: Fix IPI alignment in /proc/interrupts ARM: 9441/1: rust: Enable Rust support for ARMv7 ARM: 9439/1: arm32: simplify ARM_MMU_KEEP usage
2025-04-02Merge tag 'firewire-updates-6.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394 Pull firewire update from Takashi Sakamoto: "A single commit to use the common helper function for on-stack trailing array to enqueue any isochronous packet by the requests from userspace applications" * tag 'firewire-updates-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394: firewire: core: avoid -Wflex-array-member-not-at-end warning
2025-04-02Merge tag 'for-6.15/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mikulas Patocka: - dm-crypt: switch to using the crc32 library - dm-verity, dm-integrity, dm-crypt: documentation improvement - dm-vdo fixes - dm-stripe: enable inline crypto passthrough - dm-integrity: set ti->error on memory allocation failure - dm-bufio: remove unused return value - dm-verity: do forward error correction on metadata I/O errors - dm: fix unconditional IO throttle caused by REQ_PREFLUSH - dm cache: prevent BUG_ON by blocking retries on failed device resumes - dm cache: support shrinking the origin device - dm: restrict dm device size to 2^63-512 bytes - dm-delay: support zoned devices - dm-verity: support block number limits for different ioprio classes - dm-integrity: fix non-constant-time tag verification (security bug) - dm-verity, dm-ebs: fix prefetch-vs-suspend race * tag 'for-6.15/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (27 commits) dm-ebs: fix prefetch-vs-suspend race dm-verity: fix prefetch-vs-suspend race dm-integrity: fix non-constant-time tag verification dm-verity: support block number limits for different ioprio classes dm-delay: support zoned devices dm: restrict dm device size to 2^63-512 bytes dm cache: support shrinking the origin device dm cache: prevent BUG_ON by blocking retries on failed device resumes dm vdo indexer: reorder uds_request to reduce padding dm: fix unconditional IO throttle caused by REQ_PREFLUSH dm vdo: rework processing of loaded refcount byte arrays dm vdo: remove remaining ring references dm-verity: do forward error correction on metadata I/O errors dm-bufio: remove unused return value dm-integrity: set ti->error on memory allocation failure dm: Enable inline crypto passthrough for striped target dm vdo slab-depot: read refcount blocks in large chunks at load time dm vdo vio-pool: allow variable-sized metadata vios dm vdo vio-pool: support pools with multiple data blocks per vio dm vdo vio-pool: add a pool pointer to pooled_vio ...
2025-04-02Merge tag 'libnvdimm-for-6.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Ira Weiny: "Most of the code changes are to remove dead code. The bug fixes are minor, Syzkaller and one for broken devices which are unlikely to be in the field. So no need to backport them. - two patches to remove dead code: nd_attach_ndns() and nd_region_conflict() have not been used since 2017 and 2019 respectively - Fix divide-by-0 if device returns a broken LSA value - Fix Syzkaller reported bug" * tag 'libnvdimm-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: libnvdimm/labels: Fix divide error in nd_label_data_init() libnvdimm: Remove unused nd_attach_ndns libnvdimm: Remove unused nd_region_conflict acpi: nfit: fix narrowing conversion in acpi_nfit_ctl