summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-05-02ublk: don't log uring_cmd cmd_op in ublk_dispatch_req()Caleb Sander Mateos
cmd_op is either UBLK_U_IO_FETCH_REQ, UBLK_U_IO_COMMIT_AND_FETCH_REQ, or UBLK_U_IO_NEED_GET_DATA. Which one isn't particularly interesting and is already recorded by the log line in __ublk_ch_uring_cmd(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250430225234.2676781-6-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02ublk: take const ubq pointer in ublk_get_iod()Caleb Sander Mateos
ublk_get_iod() doesn't modify the struct ublk_queue it is passed. Clarify that by making the argument a const pointer. Move the function definition earlier in the file so it doesn't need a forward declaration. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250430225234.2676781-5-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02ublk: remove misleading "ubq" in "ubq_complete_io_cmd()"Caleb Sander Mateos
ubq_complete_io_cmd() doesn't interact with a ublk queue, so "ubq" in the name is confusing. Most likely "ubq" was meant to be "ublk". Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250430225234.2676781-4-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02ublk: fix "immepdately" typo in commentCaleb Sander Mateos
Looks like "immediately" was intended. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250430225234.2676781-3-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02ublk: factor out ublk_commit_and_fetchUday Shankar
Move the logic for the UBLK_IO_COMMIT_AND_FETCH_REQ opcode into its own function. This also allows us to mark ublk_queue pointers as const for that operation, which can help prevent data races since we may allow concurrent operation on one ublk_queue in the future. Also open code ublk_commit_completion in ublk_commit_and_fetch to reduce the number of parameters/avoid a redundant lookup. [Restore __ublk_ch_uring_cmd() req variable used in commit d6aa0c178bf8 ("ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATA")] Suggested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250430225234.2676781-2-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02block: avoid hctx spinlock for plug with multiple queuesCaleb Sander Mateos
blk_mq_flush_plug_list() has a fast path if all requests in the plug are destined for the same request_queue. It calls ->queue_rqs() with the whole batch of requests, falling back on ->queue_rq() for any requests not handled by ->queue_rqs(). However, if the requests are destined for multiple queues, blk_mq_flush_plug_list() has a slow path that calls blk_mq_dispatch_list() repeatedly to filter the requests by ctx/hctx. Each queue's requests are inserted into the hctx's dispatch list under a spinlock, then __blk_mq_sched_dispatch_requests() takes them out of the dispatch list (taking the spinlock again), and finally blk_mq_dispatch_rq_list() calls ->queue_rq() on each request. Acquiring the hctx spinlock twice and calling ->queue_rq() instead of ->queue_rqs() makes the slow path significantly more expensive. Thus, batching more requests into a single plug (e.g. io_uring_enter syscall) can counterintuitively hurt performance by causing the plug to span multiple queues. We have observed 2-3% of CPU time spent acquiring the hctx spinlock alone on workloads issuing requests to multiple NVMe devices in the same io_uring SQE batches. Add a medium path in blk_mq_flush_plug_list() for plugs that don't have elevators or come from a schedule, but do span multiple queues. Filter the requests by queue and call ->queue_rqs()/->queue_rq() on the list of requests destined to each request_queue. With this change, we no longer see any CPU time spent in _raw_spin_lock from blk_mq_flush_plug_list and throughput increases accordingly. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250426011728.4189119-4-csander@purestorage.com [axboe: fix whitespace damage] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02block: factor out blk_mq_dispatch_queue_requests() helperCaleb Sander Mateos
Factor out the logic from blk_mq_flush_plug_list() that calls ->queue_rqs() with a fallback to ->queue_rq() into a helper function blk_mq_dispatch_queue_requests(). This is in preparation for using this code with other lists of requests. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250426011728.4189119-3-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02block: take rq_list instead of plug in dispatch functionsCaleb Sander Mateos
blk_mq_plug_issue_direct(), __blk_mq_flush_plug_list(), and blk_mq_dispatch_plug_list() take a struct blk_plug * but only use its mq_list. Pass the struct rq_list * instead in preparation for calling them with other lists of requests. Drop "plug" from the function names as they are no longer plug-specific. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250426011728.4189119-2-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-01Documentation: Document the new zoned loop block device driverDamien Le Moal
Introduce the zoned_loop.rst documentation file under admin-guide/blockdev to document the zoned loop block device driver. An overview of the driver is provided and its usage to create and delete zoned devices described. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250407075222.170336-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-01block: new zoned loop block device driverDamien Le Moal
The zoned loop block device driver allows a user to create emulated zoned block devices using one regular file per zone as backing storage. Compared to null_blk or scsi_debug, it has the advantage of allowing emulating large zoned devices without requiring the same amount of memory as the capacity of the emulated device. Furthermore, zoned devices emulated with this driver can be re-started after a host reboot without any loss of the state of the device zones, which is something that null_blk and scsi_debug do not support. This initial implementation is simple and does not support zone resource limits. That is, a zoned loop block device limits for the maximum number of open zones and maximum number of active zones is always 0. This driver can be either compiled in-kernel or as a module, named "zloop". Compilation of this driver depends on the block layer support for zoned block device (CONFIG_BLK_DEV_ZONED must be set). Using the zloop driver to create and delete zoned block devices is done by writing commands to the zoned loop control character device file (/dev/zloop-control). Creating a device is done with: $ echo "add [options]" > /dev/zloop-control The options available for the "add" operation cat be listed by reading the zloop-control device file: $ cat /dev/zloop-control add id=%d,capacity_mb=%u,zone_size_mb=%u,zone_capacity_mb=%u,conv_zones=%u,base_dir=%s,nr_queues=%u,queue_depth=%u remove id=%d The options available allow controlling the zoned device total capacity, zone size, zone capactity of sequential zones, total number of conventional zones, base directory for the zones backing file, number of I/O queues and the maximum queue depth of I/O queues. Deleting a device is done using the "remove" command: $ echo "remove id=0" > /dev/zloop-control This implementation passes various tests using zonefs and fio (t/zbd tests) and provides a state machine for zone conditions that is compliant with the T10 ZBC and NVMe ZNS specifications. Co-developed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250407075222.170336-2-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-01Merge tag 'nvme-6.15-2025-05-01' of git://git.infradead.org/nvme into block-6.15Jens Axboe
Pull NVMe fixes from Christoph: "nvme fixes for Linux 6.15 - fix queue unquiesce check on PCI slot_reset (Keith Busch) - fix premature queue removal and I/O failover in nvme-tcp (Michael Liang) - don't restore null sk_state_change (Alistair Francis) - select CONFIG_TLS where needed (Alistair Francis) - always free derived key data (Hannes Reinecke) - more quirks (Wentao Guan)" * tag 'nvme-6.15-2025-05-01' of git://git.infradead.org/nvme: nvmet-auth: always free derived key data nvmet-tcp: don't restore null sk_state_change nvmet-tcp: select CONFIG_TLS from CONFIG_NVME_TARGET_TCP_TLS nvme-tcp: select CONFIG_TLS from CONFIG_NVME_TCP_TLS nvme-tcp: fix premature queue removal and I/O failover nvme-pci: add quirks for WDC Blue SN550 15b7:5009 nvme-pci: add quirks for device 126f:1001 nvme-pci: fix queue unquiesce check on slot_reset
2025-04-30nvmet-auth: always free derived key dataHannes Reinecke
After calling nvme_auth_derive_tls_psk() we need to free the resulting psk data, as either TLS is disable (and we don't need the data anyway) or the psk data is copied into the resulting key (and can be free, too). Fixes: fa2e0f8bbc68 ("nvmet-tcp: support secure channel concatenation") Reported-by: Yi Zhang <yi.zhang@redhat.com> Suggested-by: Maurizio Lombardi <mlombard@bsdbackstore.eu> Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-30nvmet-tcp: don't restore null sk_state_changeAlistair Francis
queue->state_change is set as part of nvmet_tcp_set_queue_sock(), but if the TCP connection isn't established when nvmet_tcp_set_queue_sock() is called then queue->state_change isn't set and sock->sk->sk_state_change isn't replaced. As such we don't need to restore sock->sk->sk_state_change if queue->state_change is NULL. This avoids NULL pointer dereferences such as this: [ 286.462026][ C0] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 286.462814][ C0] #PF: supervisor instruction fetch in kernel mode [ 286.463796][ C0] #PF: error_code(0x0010) - not-present page [ 286.464392][ C0] PGD 8000000140620067 P4D 8000000140620067 PUD 114201067 PMD 0 [ 286.465086][ C0] Oops: Oops: 0010 [#1] SMP KASAN PTI [ 286.465559][ C0] CPU: 0 UID: 0 PID: 1628 Comm: nvme Not tainted 6.15.0-rc2+ #11 PREEMPT(voluntary) [ 286.466393][ C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014 [ 286.467147][ C0] RIP: 0010:0x0 [ 286.467420][ C0] Code: Unable to access opcode bytes at 0xffffffffffffffd6. [ 286.467977][ C0] RSP: 0018:ffff8883ae008580 EFLAGS: 00010246 [ 286.468425][ C0] RAX: 0000000000000000 RBX: ffff88813fd34100 RCX: ffffffffa386cc43 [ 286.469019][ C0] RDX: 1ffff11027fa68b6 RSI: 0000000000000008 RDI: ffff88813fd34100 [ 286.469545][ C0] RBP: ffff88813fd34160 R08: 0000000000000000 R09: ffffed1027fa682c [ 286.470072][ C0] R10: ffff88813fd34167 R11: 0000000000000000 R12: ffff88813fd344c3 [ 286.470585][ C0] R13: ffff88813fd34112 R14: ffff88813fd34aec R15: ffff888132cdd268 [ 286.471070][ C0] FS: 00007fe3c04c7d80(0000) GS:ffff88840743f000(0000) knlGS:0000000000000000 [ 286.471644][ C0] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 286.472543][ C0] CR2: ffffffffffffffd6 CR3: 000000012daca000 CR4: 00000000000006f0 [ 286.473500][ C0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 286.474467][ C0] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400 [ 286.475453][ C0] Call Trace: [ 286.476102][ C0] <IRQ> [ 286.476719][ C0] tcp_fin+0x2bb/0x440 [ 286.477429][ C0] tcp_data_queue+0x190f/0x4e60 [ 286.478174][ C0] ? __build_skb_around+0x234/0x330 [ 286.478940][ C0] ? rcu_is_watching+0x11/0xb0 [ 286.479659][ C0] ? __pfx_tcp_data_queue+0x10/0x10 [ 286.480431][ C0] ? tcp_try_undo_loss+0x640/0x6c0 [ 286.481196][ C0] ? seqcount_lockdep_reader_access.constprop.0+0x82/0x90 [ 286.482046][ C0] ? kvm_clock_get_cycles+0x14/0x30 [ 286.482769][ C0] ? ktime_get+0x66/0x150 [ 286.483433][ C0] ? rcu_is_watching+0x11/0xb0 [ 286.484146][ C0] tcp_rcv_established+0x6e4/0x2050 [ 286.484857][ C0] ? rcu_is_watching+0x11/0xb0 [ 286.485523][ C0] ? ipv4_dst_check+0x160/0x2b0 [ 286.486203][ C0] ? __pfx_tcp_rcv_established+0x10/0x10 [ 286.486917][ C0] ? lock_release+0x217/0x2c0 [ 286.487595][ C0] tcp_v4_do_rcv+0x4d6/0x9b0 [ 286.488279][ C0] tcp_v4_rcv+0x2af8/0x3e30 [ 286.488904][ C0] ? raw_local_deliver+0x51b/0xad0 [ 286.489551][ C0] ? rcu_is_watching+0x11/0xb0 [ 286.490198][ C0] ? __pfx_tcp_v4_rcv+0x10/0x10 [ 286.490813][ C0] ? __pfx_raw_local_deliver+0x10/0x10 [ 286.491487][ C0] ? __pfx_nf_confirm+0x10/0x10 [nf_conntrack] [ 286.492275][ C0] ? rcu_is_watching+0x11/0xb0 [ 286.492900][ C0] ip_protocol_deliver_rcu+0x8f/0x370 [ 286.493579][ C0] ip_local_deliver_finish+0x297/0x420 [ 286.494268][ C0] ip_local_deliver+0x168/0x430 [ 286.494867][ C0] ? __pfx_ip_local_deliver+0x10/0x10 [ 286.495498][ C0] ? __pfx_ip_local_deliver_finish+0x10/0x10 [ 286.496204][ C0] ? ip_rcv_finish_core+0x19a/0x1f20 [ 286.496806][ C0] ? lock_release+0x217/0x2c0 [ 286.497414][ C0] ip_rcv+0x455/0x6e0 [ 286.497945][ C0] ? __pfx_ip_rcv+0x10/0x10 [ 286.498550][ C0] ? rcu_is_watching+0x11/0xb0 [ 286.499137][ C0] ? __pfx_ip_rcv_finish+0x10/0x10 [ 286.499763][ C0] ? lock_release+0x217/0x2c0 [ 286.500327][ C0] ? dl_scaled_delta_exec+0xd1/0x2c0 [ 286.500922][ C0] ? __pfx_ip_rcv+0x10/0x10 [ 286.501480][ C0] __netif_receive_skb_one_core+0x166/0x1b0 [ 286.502173][ C0] ? __pfx___netif_receive_skb_one_core+0x10/0x10 [ 286.502903][ C0] ? lock_acquire+0x2b2/0x310 [ 286.503487][ C0] ? process_backlog+0x372/0x1350 [ 286.504087][ C0] ? lock_release+0x217/0x2c0 [ 286.504642][ C0] process_backlog+0x3b9/0x1350 [ 286.505214][ C0] ? process_backlog+0x372/0x1350 [ 286.505779][ C0] __napi_poll.constprop.0+0xa6/0x490 [ 286.506363][ C0] net_rx_action+0x92e/0xe10 [ 286.506889][ C0] ? __pfx_net_rx_action+0x10/0x10 [ 286.507437][ C0] ? timerqueue_add+0x1f0/0x320 [ 286.507977][ C0] ? sched_clock_cpu+0x68/0x540 [ 286.508492][ C0] ? lock_acquire+0x2b2/0x310 [ 286.509043][ C0] ? kvm_sched_clock_read+0xd/0x20 [ 286.509607][ C0] ? handle_softirqs+0x1aa/0x7d0 [ 286.510187][ C0] handle_softirqs+0x1f2/0x7d0 [ 286.510754][ C0] ? __pfx_handle_softirqs+0x10/0x10 [ 286.511348][ C0] ? irqtime_account_irq+0x181/0x290 [ 286.511937][ C0] ? __dev_queue_xmit+0x85d/0x3450 [ 286.512510][ C0] do_softirq.part.0+0x89/0xc0 [ 286.513100][ C0] </IRQ> [ 286.513548][ C0] <TASK> [ 286.513953][ C0] __local_bh_enable_ip+0x112/0x140 [ 286.514522][ C0] ? __dev_queue_xmit+0x85d/0x3450 [ 286.515072][ C0] __dev_queue_xmit+0x872/0x3450 [ 286.515619][ C0] ? nft_do_chain+0xe16/0x15b0 [nf_tables] [ 286.516252][ C0] ? __pfx___dev_queue_xmit+0x10/0x10 [ 286.516817][ C0] ? selinux_ip_postroute+0x43c/0xc50 [ 286.517433][ C0] ? __pfx_selinux_ip_postroute+0x10/0x10 [ 286.518061][ C0] ? rcu_is_watching+0x11/0xb0 [ 286.518606][ C0] ? ip_output+0x164/0x4a0 [ 286.519149][ C0] ? rcu_is_watching+0x11/0xb0 [ 286.519671][ C0] ? ip_finish_output2+0x17d5/0x1fb0 [ 286.520258][ C0] ip_finish_output2+0xb4b/0x1fb0 [ 286.520787][ C0] ? __pfx_ip_finish_output2+0x10/0x10 [ 286.521355][ C0] ? __ip_finish_output+0x15d/0x750 [ 286.521890][ C0] ip_output+0x164/0x4a0 [ 286.522372][ C0] ? __pfx_ip_output+0x10/0x10 [ 286.522872][ C0] ? rcu_is_watching+0x11/0xb0 [ 286.523402][ C0] ? _raw_spin_unlock_irqrestore+0x4c/0x60 [ 286.524031][ C0] ? __pfx_ip_finish_output+0x10/0x10 [ 286.524605][ C0] ? __ip_queue_xmit+0x999/0x2260 [ 286.525200][ C0] ? rcu_is_watching+0x11/0xb0 [ 286.525744][ C0] ? ipv4_dst_check+0x16a/0x2b0 [ 286.526279][ C0] ? lock_release+0x217/0x2c0 [ 286.526793][ C0] __ip_queue_xmit+0x1883/0x2260 [ 286.527324][ C0] ? __skb_clone+0x54c/0x730 [ 286.527827][ C0] __tcp_transmit_skb+0x209b/0x37a0 [ 286.528374][ C0] ? __pfx___tcp_transmit_skb+0x10/0x10 [ 286.528952][ C0] ? rcu_is_watching+0x11/0xb0 [ 286.529472][ C0] ? seqcount_lockdep_reader_access.constprop.0+0x82/0x90 [ 286.530152][ C0] ? trace_hardirqs_on+0x12/0x120 [ 286.530691][ C0] tcp_write_xmit+0xb81/0x88b0 [ 286.531224][ C0] ? mod_memcg_state+0x4d/0x60 [ 286.531736][ C0] ? rcu_is_watching+0x11/0xb0 [ 286.532253][ C0] __tcp_push_pending_frames+0x90/0x320 [ 286.532826][ C0] tcp_send_fin+0x141/0xb50 [ 286.533352][ C0] ? __pfx_tcp_send_fin+0x10/0x10 [ 286.533908][ C0] ? __local_bh_enable_ip+0xab/0x140 [ 286.534495][ C0] inet_shutdown+0x243/0x320 [ 286.535077][ C0] nvme_tcp_alloc_queue+0xb3b/0x2590 [nvme_tcp] [ 286.535709][ C0] ? do_raw_spin_lock+0x129/0x260 [ 286.536314][ C0] ? __pfx_nvme_tcp_alloc_queue+0x10/0x10 [nvme_tcp] [ 286.536996][ C0] ? do_raw_spin_unlock+0x54/0x1e0 [ 286.537550][ C0] ? _raw_spin_unlock+0x29/0x50 [ 286.538127][ C0] ? do_raw_spin_lock+0x129/0x260 [ 286.538664][ C0] ? __pfx_do_raw_spin_lock+0x10/0x10 [ 286.539249][ C0] ? nvme_tcp_alloc_admin_queue+0xd5/0x340 [nvme_tcp] [ 286.539892][ C0] ? __wake_up+0x40/0x60 [ 286.540392][ C0] nvme_tcp_alloc_admin_queue+0xd5/0x340 [nvme_tcp] [ 286.541047][ C0] ? rcu_is_watching+0x11/0xb0 [ 286.541589][ C0] nvme_tcp_setup_ctrl+0x8b/0x7a0 [nvme_tcp] [ 286.542254][ C0] ? _raw_spin_unlock_irqrestore+0x4c/0x60 [ 286.542887][ C0] ? __pfx_nvme_tcp_setup_ctrl+0x10/0x10 [nvme_tcp] [ 286.543568][ C0] ? trace_hardirqs_on+0x12/0x120 [ 286.544166][ C0] ? _raw_spin_unlock_irqrestore+0x35/0x60 [ 286.544792][ C0] ? nvme_change_ctrl_state+0x196/0x2e0 [nvme_core] [ 286.545477][ C0] nvme_tcp_create_ctrl+0x839/0xb90 [nvme_tcp] [ 286.546126][ C0] nvmf_dev_write+0x3db/0x7e0 [nvme_fabrics] [ 286.546775][ C0] ? rw_verify_area+0x69/0x520 [ 286.547334][ C0] vfs_write+0x218/0xe90 [ 286.547854][ C0] ? do_syscall_64+0x9f/0x190 [ 286.548408][ C0] ? trace_hardirqs_on_prepare+0xdb/0x120 [ 286.549037][ C0] ? syscall_exit_to_user_mode+0x93/0x280 [ 286.549659][ C0] ? __pfx_vfs_write+0x10/0x10 [ 286.550259][ C0] ? do_syscall_64+0x9f/0x190 [ 286.550840][ C0] ? syscall_exit_to_user_mode+0x8e/0x280 [ 286.551516][ C0] ? trace_hardirqs_on_prepare+0xdb/0x120 [ 286.552180][ C0] ? syscall_exit_to_user_mode+0x93/0x280 [ 286.552834][ C0] ? ksys_read+0xf5/0x1c0 [ 286.553386][ C0] ? __pfx_ksys_read+0x10/0x10 [ 286.553964][ C0] ksys_write+0xf5/0x1c0 [ 286.554499][ C0] ? __pfx_ksys_write+0x10/0x10 [ 286.555072][ C0] ? trace_hardirqs_on_prepare+0xdb/0x120 [ 286.555698][ C0] ? syscall_exit_to_user_mode+0x93/0x280 [ 286.556319][ C0] ? do_syscall_64+0x54/0x190 [ 286.556866][ C0] do_syscall_64+0x93/0x190 [ 286.557420][ C0] ? rcu_read_unlock+0x17/0x60 [ 286.557986][ C0] ? rcu_is_watching+0x11/0xb0 [ 286.558526][ C0] ? lock_release+0x217/0x2c0 [ 286.559087][ C0] ? rcu_is_watching+0x11/0xb0 [ 286.559659][ C0] ? count_memcg_events.constprop.0+0x4a/0x60 [ 286.560476][ C0] ? exc_page_fault+0x7a/0x110 [ 286.561064][ C0] ? rcu_is_watching+0x11/0xb0 [ 286.561647][ C0] ? lock_release+0x217/0x2c0 [ 286.562257][ C0] ? do_user_addr_fault+0x171/0xa00 [ 286.562839][ C0] ? do_user_addr_fault+0x4a2/0xa00 [ 286.563453][ C0] ? irqentry_exit_to_user_mode+0x84/0x270 [ 286.564112][ C0] ? rcu_is_watching+0x11/0xb0 [ 286.564677][ C0] ? irqentry_exit_to_user_mode+0x84/0x270 [ 286.565317][ C0] ? trace_hardirqs_on_prepare+0xdb/0x120 [ 286.565922][ C0] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 286.566542][ C0] RIP: 0033:0x7fe3c05e6504 [ 286.567102][ C0] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 8b 10 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89 [ 286.568931][ C0] RSP: 002b:00007fff76444f58 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 [ 286.569807][ C0] RAX: ffffffffffffffda RBX: 000000003b40d930 RCX: 00007fe3c05e6504 [ 286.570621][ C0] RDX: 00000000000000cf RSI: 000000003b40d930 RDI: 0000000000000003 [ 286.571443][ C0] RBP: 0000000000000003 R08: 00000000000000cf R09: 000000003b40d930 [ 286.572246][ C0] R10: 0000000000000000 R11: 0000000000000202 R12: 000000003b40cd60 [ 286.573069][ C0] R13: 00000000000000cf R14: 00007fe3c07417f8 R15: 00007fe3c073502e [ 286.573886][ C0] </TASK> Closes: https://lore.kernel.org/linux-nvme/5hdonndzoqa265oq3bj6iarwtfk5dewxxjtbjvn5uqnwclpwt6@a2n6w3taxxex/ Signed-off-by: Alistair Francis <alistair.francis@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-30nvmet-tcp: select CONFIG_TLS from CONFIG_NVME_TARGET_TCP_TLSAlistair Francis
Ensure that TLS support is enabled in the kernel when CONFIG_NVME_TARGET_TCP_TLS is enabled. Without this the code compiles, but does not actually work unless something else enables CONFIG_TLS. Fixes: 675b453e0241 ("nvmet-tcp: enable TLS handshake upcall") Signed-off-by: Alistair Francis <alistair.francis@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-30nvme-tcp: select CONFIG_TLS from CONFIG_NVME_TCP_TLSAlistair Francis
Ensure that TLS support is enabled in the kernel when CONFIG_NVME_TCP_TLS is enabled. Without this the code compiles, but does not actually work unless something else enables CONFIG_TLS. Fixes: be8e82caa68 ("nvme-tcp: enable TLS handshake upcall") Signed-off-by: Alistair Francis <alistair.francis@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-30nvme-tcp: fix premature queue removal and I/O failoverMichael Liang
This patch addresses a data corruption issue observed in nvme-tcp during testing. In an NVMe native multipath setup, when an I/O timeout occurs, all inflight I/Os are canceled almost immediately after the kernel socket is shut down. These canceled I/Os are reported as host path errors, triggering a failover that succeeds on a different path. However, at this point, the original I/O may still be outstanding in the host's network transmission path (e.g., the NIC’s TX queue). From the user-space app's perspective, the buffer associated with the I/O is considered completed since they're acked on the different path and may be reused for new I/O requests. Because nvme-tcp enables zero-copy by default in the transmission path, this can lead to corrupted data being sent to the original target, ultimately causing data corruption. We can reproduce this data corruption by injecting delay on one path and triggering i/o timeout. To prevent this issue, this change ensures that all inflight transmissions are fully completed from host's perspective before returning from queue stop. To handle concurrent I/O timeout from multiple namespaces under the same controller, always wait in queue stop regardless of queue's state. This aligns with the behavior of queue stopping in other NVMe fabric transports. Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver") Signed-off-by: Michael Liang <mliang@purestorage.com> Reviewed-by: Mohamed Khalfella <mkhalfella@purestorage.com> Reviewed-by: Randy Jennings <randyj@purestorage.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-29nvme-pci: add quirks for WDC Blue SN550 15b7:5009Wentao Guan
Add two quirks for the WDC Blue SN550 (PCI ID 15b7:5009) based on user reports and hardware analysis: - NVME_QUIRK_NO_DEEPEST_PS: liaozw talked to me the problem and solved with nvme_core.default_ps_max_latency_us=0, so add the quirk. I also found some reports in the following link. - NVME_QUIRK_BROKEN_MSI: after get the lspci from Jack Rio. I think that the disk also have NVME_QUIRK_BROKEN_MSI. described in commit d5887dc6b6c0 ("nvme-pci: Add quirk for broken MSIs") as sean said in link which match the MSI 1/32 and MSI-X 17. Log: lspci -nn | grep -i memory 03:00.0 Non-Volatile memory controller [0108]: Sandisk Corp SanDisk Ultra 3D / WD PC SN530, IX SN530, Blue SN550 NVMe SSD (DRAM-less) [15b7:5009] (rev 01) lspci -v -d 15b7:5009 03:00.0 Non-Volatile memory controller: Sandisk Corp SanDisk Ultra 3D / WD PC SN530, IX SN530, Blue SN550 NVMe SSD (DRAM-less) (rev 01) (prog-if 02 [NVM Express]) Subsystem: Sandisk Corp WD Blue SN550 NVMe SSD Flags: bus master, fast devsel, latency 0, IRQ 35, IOMMU group 10 Memory at fe800000 (64-bit, non-prefetchable) [size=16K] Memory at fe804000 (64-bit, non-prefetchable) [size=256] Capabilities: [80] Power Management version 3 Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+ Capabilities: [b0] MSI-X: Enable+ Count=17 Masked- Capabilities: [c0] Express Endpoint, MSI 00 Capabilities: [100] Advanced Error Reporting Capabilities: [150] Device Serial Number 00-00-00-00-00-00-00-00 Capabilities: [1b8] Latency Tolerance Reporting Capabilities: [300] Secondary PCI Express Capabilities: [900] L1 PM Substates Kernel driver in use: nvme dmesg | grep nvme [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.12.20-amd64-desktop-rolling root=UUID= ro splash quiet nvme_core.default_ps_max_latency_us=0 DEEPIN_GFXMODE= [ 0.059301] Kernel command line: BOOT_IMAGE=/vmlinuz-6.12.20-amd64-desktop-rolling root=UUID= ro splash quiet nvme_core.default_ps_max_latency_us=0 DEEPIN_GFXMODE= [ 0.542430] nvme nvme0: pci function 0000:03:00.0 [ 0.560426] nvme nvme0: allocated 32 MiB host memory buffer. [ 0.562491] nvme nvme0: 16/0/0 default/read/poll queues [ 0.567764] nvme0n1: p1 p2 p3 p4 p5 p6 p7 p8 p9 [ 6.388726] EXT4-fs (nvme0n1p7): mounted filesystem ro with ordered data mode. Quota mode: none. [ 6.893421] EXT4-fs (nvme0n1p7): re-mounted r/w. Quota mode: none. [ 7.125419] Adding 16777212k swap on /dev/nvme0n1p8. Priority:-2 extents:1 across:16777212k SS [ 7.157588] EXT4-fs (nvme0n1p6): mounted filesystem r/w with ordered data mode. Quota mode: none. [ 7.165021] EXT4-fs (nvme0n1p9): mounted filesystem r/w with ordered data mode. Quota mode: none. [ 8.036932] nvme nvme0: using unchecked data buffer [ 8.096023] block nvme0n1: No UUID available providing old NGUID Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5887dc6b6c054d0da3cd053afc15b7be1f45ff6 Link: https://lore.kernel.org/all/20240422162822.3539156-1-sean.anderson@linux.dev/ Reported-by: liaozw <hedgehog-002@163.com> Closes: https://bbs.deepin.org.cn/post/286300 Reported-by: rugk <rugk+github@posteo.de> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=208123 Signed-off-by: Wentao Guan <guanwentao@uniontech.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-29nvme-pci: add quirks for device 126f:1001Wentao Guan
This commit adds NVME_QUIRK_NO_DEEPEST_PS and NVME_QUIRK_BOGUS_NID for device [126f:1001]. It is similar to commit e89086c43f05 ("drivers/nvme: Add quirks for device 126f:2262") Diff is according the dmesg, use NVME_QUIRK_IGNORE_DEV_SUBNQN. dmesg | grep -i nvme0: nvme nvme0: pci function 0000:01:00.0 nvme nvme0: missing or invalid SUBNQN field. nvme nvme0: 12/0/0 default/read/poll queues Link:https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e89086c43f0500bc7c4ce225495b73b8ce234c1f Signed-off-by: Wentao Guan <guanwentao@uniontech.com> Signed-off-by: WangYuli <wangyuli@uniontech.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-29nvme-pci: fix queue unquiesce check on slot_resetKeith Busch
A zero return means the reset was successfully scheduled. We don't want to unquiesce the queues while the reset_work is pending, as that will just flush out requeued requests to a failed completion. Fixes: 71a5bb153be104 ("nvme: ensure disabling pairs with unquiesce") Reported-by: Dhankaran Singh Ajravat <dhankaran@meta.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-29ublk: remove the check of ublk_need_req_ref() from __ublk_check_and_get_reqMing Lei
__ublk_check_and_get_req() is only called from ublk_check_and_get_req() and ublk_register_io_buf(), the same check has been covered in the two calling sites. So remove the check from __ublk_check_and_get_req(). Suggested-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250429022941.1718671-5-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-29ublk: enhance check for register/unregister io buffer commandMing Lei
The simple check of UBLK_IO_FLAG_OWNED_BY_SRV can avoid incorrect register/unregister io buffer easily, so check it before calling starting to register/un-register io buffer. Also only allow io buffer register/unregister uring_cmd in case of UBLK_F_SUPPORT_ZERO_COPY. Also mark argument 'ublk_queue *' of ublk_register_io_buf as const. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: 1f6540e2aabb ("ublk: zc register/unregister bvec") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250429022941.1718671-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-29ublk: decouple zero copy from user copyMing Lei
UBLK_F_USER_COPY and UBLK_F_SUPPORT_ZERO_COPY are two different features, and shouldn't be coupled together. Commit 1f6540e2aabb ("ublk: zc register/unregister bvec") enables user copy automatically in case of UBLK_F_SUPPORT_ZERO_COPY, this way isn't correct. So decouple zero copy from user copy, and use independent helper to check each one. Fixes: 1f6540e2aabb ("ublk: zc register/unregister bvec") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250429022941.1718671-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-29selftests: ublk: fix UBLK_F_NEED_GET_DATAMing Lei
Commit 57e13a2e8cd2 ("selftests: ublk: support user recovery") starts to support UBLK_F_NEED_GET_DATA for covering recovery feature, however the ublk utility implementation isn't done correctly. Fix it by supporting UBLK_F_NEED_GET_DATA correctly. Also add test generic_07 for covering UBLK_F_NEED_GET_DATA. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: 57e13a2e8cd2 ("selftests: ublk: support user recovery") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250429022941.1718671-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-28brd: use memcpy_{to,from]_page in brd_rw_bvecChristoph Hellwig
Use the proper helpers to copy to/from potential highmem pages, which do a local instead of atomic kmap underneath, and perform flush_dcache_page where needed. This also simplifies the code so much that the separate read write helpers are not required any more. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250428141014.2360063-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-28brd: split I/O at page boundariesChristoph Hellwig
A lot of complexity in brd stems from the fact that it tries to handle I/O spanning two backing pages. Instead limit the size of a single bvec iteration so that it never crosses a page boundary and remove all the now unneeded code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250428141014.2360063-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-28brd: use bvec_kmap_local in brd_do_bvecChristoph Hellwig
Use the proper helper to kmap a bvec in brd_do_bvec instead of directly accessing the bvec fields and use the deprecated kmap_atomic API. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250428141014.2360063-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-28brd: remove the sector variable in brd_submit_bioChristoph Hellwig
The bvec iter iterates over the sector already, no need to duplicate the work. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250428141014.2360063-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-28brd: pass a bvec pointer to brd_do_bvecChristoph Hellwig
Pass the bvec to brd_do_bvec instead of marshalling the information into individual arguments. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250428141014.2360063-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24Merge branch 'block-6.15' into for-6.16/blockJens Axboe
Merge 6.15 block fixes - both to get the fixes causing issues with XFS testing, but also to make it easier for 6.16 ublk patches to avoid conflicts. * block-6.15: ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmd ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATA block: don't autoload drivers on blk-cgroup configuration block: don't autoload drivers on stat block: remove the backing_inode variable in bdev_statx block: move blkdev_{get,put} _no_open prototypes out of blkdev.h block: never reduce ra_pages in blk_apply_bdi_limits selftests: ublk: common: fix _get_disk_dev_t for pre-9.0 coreutils selftests: ublk: remove useless 'delay_us' from 'struct dev_ctx' selftests: ublk: fix recover test block: hoist block size validation code to a separate function block: fix race between set_blocksize and read paths nvmet: fix out-of-bounds access in nvmet_enable_port
2025-04-24ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmdMing Lei
ublk_cancel_cmd() calls io_uring_cmd_done() to complete uring_cmd, but we may have scheduled task work via io_uring_cmd_complete_in_task() for dispatching request, then kernel crash can be triggered. Fix it by not trying to canceling the command if ublk block request is started. Fixes: 216c8f5ef0f2 ("ublk: replace monitor with cancelable uring_cmd") Reported-by: Jared Holzman <jholzman@nvidia.com> Tested-by: Jared Holzman <jholzman@nvidia.com> Closes: https://lore.kernel.org/linux-block/d2179120-171b-47ba-b664-23242981ef19@nvidia.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250425013742.1079549-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATAMing Lei
We call io_uring_cmd_complete_in_task() to schedule task_work for handling UBLK_U_IO_NEED_GET_DATA. This way is really not necessary because the current context is exactly the ublk queue context, so call ublk_dispatch_req() directly for handling UBLK_U_IO_NEED_GET_DATA. Fixes: 216c8f5ef0f2 ("ublk: replace monitor with cancelable uring_cmd") Tested-by: Jared Holzman <jholzman@nvidia.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250425013742.1079549-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24block: don't autoload drivers on blk-cgroup configurationChristoph Hellwig
Loading a driver just to configure blk-cgroup doesn't make sense, as that assumes and already existing device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250423053810.1683309-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24block: don't autoload drivers on statChristoph Hellwig
blkdev_get_no_open can trigger the legacy autoload of block drivers. A simple stat of a block device has not historically done that, so disable this behavior again. Fixes: 9abcfbd235f5 ("block: Add atomic write support for statx") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250423053810.1683309-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24block: remove the backing_inode variable in bdev_statxChristoph Hellwig
backing_inode is only used once, so remove it and update the comment describing the bdev lookup to be a bit more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250423053810.1683309-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24block: move blkdev_{get,put} _no_open prototypes out of blkdev.hChristoph Hellwig
These are only to be used by block internal code. Remove the comment as we grew more users due to reworking block device node opening. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250423053810.1683309-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24block: never reduce ra_pages in blk_apply_bdi_limitsChristoph Hellwig
When the user increased the read-ahead size through sysfs this value currently get lost if the device is reprobe, including on a resume from suspend. As there is no hardware limitation for the read-ahead size there is no real need to reset it or track a separate hardware limitation like for max_sectors. This restores the pre-atomic queue limit behavior in the sd driver as sd did not use blk_queue_io_opt and thus never updated the read ahead size to the value based of the optimal I/O, but changes behavior for all other drivers. As the new behavior seems useful and sd is the driver for which the readahead size tweaks are most useful that seems like a worthwhile trade off. Fixes: 804e498e0496 ("sd: convert to the atomic queue limits API") Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20250424082521.1967286-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24selftests: ublk: common: fix _get_disk_dev_t for pre-9.0 coreutilsUday Shankar
Some distributions, such as centos stream 9, still have a version of coreutils which does not yet support the %Hr and %Lr formats for stat(1) [1, 2]. Running ublk selftests on these distributions results in the following error in tests that use the _get_disk_dev_t helper: line 23: ?r: syntax error: operand expected (error token is "?r") To better accommodate older distributions, rewrite _get_disk_dev_t to use the much older %t and %T formats for stat instead. [1] https://github.com/coreutils/coreutils/blob/v9.0/NEWS#L114 [2] https://pkgs.org/download/coreutils Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250423-ublk_selftests-v1-2-7d060e260e76@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24Merge tag 'nvme-6.15-2025-04-24' of git://git.infradead.org/nvme into block-6.15Jens Axboe
Pull NVMe fix from Christoph: "nvme fixes for Linux 6.15 - fix an out-of-bounds access in nvmet_enable_port (Richard Weinberger)" * tag 'nvme-6.15-2025-04-24' of git://git.infradead.org/nvme: nvmet: fix out-of-bounds access in nvmet_enable_port
2025-04-23selftests: ublk: remove useless 'delay_us' from 'struct dev_ctx'Ming Lei
'delay_us' shouldn't be added to 'struct dev_ctx' since now it is handled by per-target command line & 'struct fault_inject_ctx'. So remove it. Fixes: 81586652bb1f ("selftests: ublk: add generic_06 for covering fault inject") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Uday Shankar <ushankar@purestorage.com> Link: https://lore.kernel.org/r/20250421235947.715272-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-23selftests: ublk: fix recover testMing Lei
When adding recovery test: - 'break' is missed for handling '-g' argument - test name of test_generic_05.sh is wrong So fix the two. Fixes: 57e13a2e8cd2 ("selftests: ublk: support user recovery") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Uday Shankar <ushankar@purestorage.com> Link: https://lore.kernel.org/r/20250421235947.715272-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-23block: hoist block size validation code to a separate functionDarrick J. Wong
Hoist the block size validation code to bdev_validate_blocksize so that we can call it from filesystems that don't care about the bdev pagecache manipulations of set_blocksize. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/174543795720.4139148.840349813093799165.stgit@frogsfrogsfrogs Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-23block: fix race between set_blocksize and read pathsDarrick J. Wong
With the new large sector size support, it's now the case that set_blocksize can change i_blksize and the folio order in a manner that conflicts with a concurrent reader and causes a kernel crash. Specifically, let's say that udev-worker calls libblkid to detect the labels on a block device. The read call can create an order-0 folio to read the first 4096 bytes from the disk. But then udev is preempted. Next, someone tries to mount an 8k-sectorsize filesystem from the same block device. The filesystem calls set_blksize, which sets i_blksize to 8192 and the minimum folio order to 1. Now udev resumes, still holding the order-0 folio it allocated. It then tries to schedule a read bio and do_mpage_readahead tries to create bufferheads for the folio. Unfortunately, blocks_per_folio == 0 because the page size is 4096 but the blocksize is 8192 so no bufferheads are attached and the bh walk never sets bdev. We then submit the bio with a NULL block device and crash. Therefore, truncate the page cache after flushing but before updating i_blksize. However, that's not enough -- we also need to lock out file IO and page faults during the update. Take both the i_rwsem and the invalidate_lock in exclusive mode for invalidations, and in shared mode for read/write operations. I don't know if this is the correct fix, but xfs/259 found it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/174543795699.4139148.2086129139322431423.stgit@frogsfrogsfrogs Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-22ublk: remove unnecessary ubq checksCaleb Sander Mateos
ublk_init_queues() ensures that all nr_hw_queues queues are initialized, with each ublk_queue's q_id set to its index. And ublk_init_queues() is called before ublk_add_chdev(), which creates the cdev. Is is therefore impossible for the !ubq || ub_cmd->q_id != ubq->q_id condition to hit in __ublk_ch_uring_cmd(). Remove it to avoids some branches in the I/O path. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Uday Shankar <ushankar@purestorage.com> Link: https://lore.kernel.org/r/20250416170154.3621609-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-22nvmet: fix out-of-bounds access in nvmet_enable_portRichard Weinberger
When trying to enable a port that has no transport configured yet, nvmet_enable_port() uses NVMF_TRTYPE_MAX (255) to query the transports array, causing an out-of-bounds access: [ 106.058694] BUG: KASAN: global-out-of-bounds in nvmet_enable_port+0x42/0x1da [ 106.058719] Read of size 8 at addr ffffffff89dafa58 by task ln/632 [...] [ 106.076026] nvmet: transport type 255 not supported Since commit 200adac75888, NVMF_TRTYPE_MAX is the default state as configured by nvmet_ports_make(). Avoid this by checking for NVMF_TRTYPE_MAX before proceeding. Fixes: 200adac75888 ("nvme: Add PCI transport type") Signed-off-by: Richard Weinberger <richard@nod.at> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
2025-04-21ublk: Add UBLK_U_CMD_UPDATE_SIZEOmri Mann
Currently ublk only allows the size of the ublkb block device to be set via UBLK_CMD_SET_PARAMS before UBLK_CMD_START_DEV is triggered. This does not provide support for extendable user-space block devices without having to stop and restart the underlying ublkb block device causing IO interruption. This patch adds a new ublk command UBLK_U_CMD_UPDATE_SIZE to allow the ublk block device to be resized on-the-fly. Feature flag UBLK_F_UPDATE_SIZE is also added to indicate support. Signed-off-by: Omri Mann <omri@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/2a370ab1-d85b-409d-b762-f9f3f6bdf705@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21block: blk-rq-qos: guard rq-qos helpers by static keyJens Axboe
Even if blk-rq-qos isn't used or configured, dipping into the queue to fetch ->rq_qos is a noticeable slowdown and visible in profiles. Add an unlikely static key around blk-rq-qos, to avoid fetching this cacheline if blk-iolatency or blk-wbt isn't configured or used. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21block: ensure that struct blk_mq_alloc_data is fully initializedJens Axboe
On x86, rep stos will be emitted to clear the the blk_mq_alloc_data struct, as not all members are being explicitly initialied. Depending on the type of CPU, this is a noticeable slowdown compared to just ensuring that the struct is fully initialized when setup. For the 4 spots that setup a struct blk_mq_alloc_data on the stack, ensure all members are being initialized. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21block: Simplify blk_mq_dispatch_rq_list() and its callersBart Van Assche
The 'nr_budgets' argument of blk_mq_dispatch_rq_list() is either the number of elements in the 'list' argument or zero. Instead of passing the number of list elements to blk_mq_dispatch_rq_list(), pass a boolean argument that indicates whether or not blk_mq_dispatch_rq_list() should request the block driver for a budget for each request in 'list'. Remove the code for counting list elements from blk_mq_dispatch_rq_list() callers where possible. Remove the code that decrements nr_budgets from blk_mq_dispatch_rq_list() because it is superfluous. Each request that is processed by blk_mq_dispatch_rq_list() is in one of these two states if 'get_budget' is false: * Either the request is on 'list' and the budget for the request has to be released from the error path. * Or the request is not on 'list' and q->mq_ops->queue_rq() has already released the budget (ret != BLK_STS_OK) or q->mq_ops->queue_rq() will release the budget asynchronously (ret == BLK_STS_OK). Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: John Garry <john.g.garry@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20250415205134.3650042-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-20gcc-15: disable '-Wunterminated-string-initialization' entirely for nowLinus Torvalds
I had left the warning around but as a non-fatal error to get my gcc-15 builds going, but fixed up some of the most annoying warning cases so that it wouldn't be *too* verbose. Because I like the _concept_ of the warning, even if I detested the implementation to shut it up. It turns out the implementation to shut it up is even more broken than I thought, and my "shut up most of the warnings" patch just caused fatal errors on gcc-14 instead. I had tested with clang, but when I upgrade my development environment, I try to do it on all machines because I hate having different systems to maintain, and hadn't realized that gcc-14 now had issues. The ACPI case is literally why I wanted to have a *type* that doesn't trigger the warning (see commit d5d45a7f2619: "gcc-15: make 'unterminated string initialization' just a warning"), instead of marking individual places as "__nonstring". But gcc-14 doesn't like that __nonstring location that shut gcc-15 up, because it's on an array of char arrays, not on one single array: drivers/acpi/tables.c:399:1: error: 'nonstring' attribute ignored on objects of type 'const char[][4]' [-Werror=attributes] 399 | static const char table_sigs[][ACPI_NAMESEG_SIZE] __initconst __nonstring = { | ^~~~~~ and my attempts to nest it properly with a type had failed, because of how gcc doesn't like marking the types as having attributes, only symbols. There may be some trick to it, but I was already annoyed by the bad attribute design, now I'm just entirely fed up with it. I wish gcc had a proper way to say "this type is a *byte* array, not a string". The obvious thing would be to distinguish between "char []" and an explicitly signed "unsigned char []" (as opposed to an implicitly unsigned char, which is typically an architecture-specific default, but for the kernel is universal thanks to '-funsigned-char'). But any "we can typedef a 8-bit type to not become a string just because it's an array" model would be fine. But "__attribute__((nonstring))" is sadly not that sane model. Reported-by: Chris Clayton <chris2553@googlemail.com> Fixes: 4b4bd8c50f48 ("gcc-15: acpi: sprinkle random '__nonstring' crumbles around") Fixes: d5d45a7f2619 ("gcc-15: make 'unterminated string initialization' just a warning") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-20Linux 6.15-rc3v6.15-rc3Linus Torvalds