summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-03-20Merge tag 'nvme-6.15-2025-03-20' of git://git.infradead.org/nvme into ↵Jens Axboe
for-6.15/block Pull NVMe updates from Keith: "nvme updates for Linux 6.15 - Secure concatenation for TCP transport (Hannes) - Multipath sysfs visibility (Nilay) - Various cleanups (Qasim, Baruch, Wang, Chen, Mike, Damien, Li) - Correct use of 64-bit BARs for pci-epf target (Niklas) - Socket fix for selinux when used in containers (Peijie)" * tag 'nvme-6.15-2025-03-20' of git://git.infradead.org/nvme: (22 commits) nvmet: replace max(a, min(b, c)) by clamp(val, lo, hi) nvme-tcp: fix selinux denied when calling sock_sendmsg nvmet: pci-epf: Always configure BAR0 as 64-bit nvmet: Remove duplicate uuid_copy nvme: zns: Simplify nvme_zone_parse_entry() nvmet: pci-epf: Remove redundant 'flush_workqueue()' calls nvmet-fc: Remove unused functions nvme-pci: remove stale comment nvme-fc: Utilise min3() to simplify queue count calculation nvme-multipath: Add visibility for queue-depth io-policy nvme-multipath: Add visibility for numa io-policy nvme-multipath: Add visibility for round-robin io-policy nvmet: add tls_concat and tls_key debugfs entries nvmet-tcp: support secure channel concatenation nvmet: Add 'sq' argument to alloc_ctrl_args nvme-fabrics: reset admin connection for secure concatenation nvme-tcp: request secure channel concatenation nvme-keyring: add nvme_tls_psk_refresh() nvme: add nvme_auth_derive_tls_psk() nvme: add nvme_auth_generate_digest() ...
2025-03-20nvmet: replace max(a, min(b, c)) by clamp(val, lo, hi)Li Haoran
This patch replaces max(a, min(b, c)) by clamp(val, lo, hi) in the nvme driver. The clamp() macro explicitly expresses the intent of constraining a value within bounds, improving code readability. Signed-off-by: Li Haoran <li.haoran7@zte.com.cn> Signed-off-by: Shao Mingyin <shao.mingyin@zte.com.cn> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvme-tcp: fix selinux denied when calling sock_sendmsgPeijie Shao
In a SELinux enabled kernel, socket_create() initializes the security label of the socket using the security label of the calling process, this typically works well. However, in a containerized environment like Kubernetes, problem arises when a privileged container(domain spc_t) connects to an NVMe target and mounts the NVMe as persistent storage for unprivileged containers(domain container_t). This is because the container_t domain cannot access resources labeled with spc_t, resulting in socket_sendmsg returning -EACCES. The solution is to use socket_create_kern() instead of socket_create(), which labels the socket context to kernel_t. Access control will then be handled by the VFS layer rather than the socket itself. Signed-off-by: Peijie Shao <shaopeijie@cestc.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvmet: pci-epf: Always configure BAR0 as 64-bitNiklas Cassel
NVMe PCIe Transport Specification 1.1, section 2.1.10, claims that the BAR0 type is Implementation Specific. However, in NVMe 1.1, the type is required to be 64-bit. Thus, to make our PCI EPF work on as many host systems as possible, always configure the BAR0 type to be 64-bit. In the rare case that the underlying PCI EPC does not support configuring BAR0 as 64-bit, the call to pci_epc_set_bar() will fail, and we will return a failure back to the user. This should not be a problem, as most PCI EPCs support configuring a BAR as 64-bit (and those EPCs with .only_64bit set to true in epc_features only support configuring the BAR as 64-bit). Tested-by: Damien Le Moal <dlemoal@kernel.org> Fixes: 0faa0fe6f90e ("nvmet: New NVMe PCI endpoint function target driver") Signed-off-by: Niklas Cassel <cassel@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvmet: Remove duplicate uuid_copyMike Christie
We do uuid_copy twice in nvmet_alloc_ctrl so this patch deletes one of the calls. Signed-off-by: Mike Christie <michael.christie@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvme: zns: Simplify nvme_zone_parse_entry()Damien Le Moal
Instead of passing a pointer to a struct nvme_ctrl and a pointer to a struct nvme_ns_head as the first two arguments of nvme_zone_parse_entry(), pass only a pointer to a struct nvme_ns as both the controller structure and ns head structure can be infered from the namespace structure. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvmet: pci-epf: Remove redundant 'flush_workqueue()' callsChen Ni
'destroy_workqueue()' already drains the queue before destroying it, so there is no need to flush it explicitly. Remove the redundant 'flush_workqueue()' calls. This was generated with coccinelle: @@ expression E; @@ - flush_workqueue(E); destroy_workqueue(E); Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvmet-fc: Remove unused functionsWangYuli
The functions nvmet_fc_iodnum() and nvmet_fc_fodnum() are currently unutilized. Following commit c53432030d86 ("nvme-fabrics: Add target support for FC transport"), which introduced these two functions, they have not been used at all in practice. Remove them to resolve the compiler warnings. Fix follow errors with clang-19 when W=1e: drivers/nvme/target/fc.c:177:1: error: unused function 'nvmet_fc_iodnum' [-Werror,-Wunused-function] 177 | nvmet_fc_iodnum(struct nvmet_fc_ls_iod *iodptr) | ^~~~~~~~~~~~~~~ drivers/nvme/target/fc.c:183:1: error: unused function 'nvmet_fc_fodnum' [-Werror,-Wunused-function] 183 | nvmet_fc_fodnum(struct nvmet_fc_fcp_iod *fodptr) | ^~~~~~~~~~~~~~~ 2 errors generated. make[8]: *** [scripts/Makefile.build:207: drivers/nvme/target/fc.o] Error 1 make[7]: *** [scripts/Makefile.build:465: drivers/nvme/target] Error 2 make[6]: *** [scripts/Makefile.build:465: drivers/nvme] Error 2 make[6]: *** Waiting for unfinished jobs.... Fixes: c53432030d86 ("nvme-fabrics: Add target support for FC transport") Signed-off-by: WangYuli <wangyuli@uniontech.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvme-pci: remove stale commentBaruch Siach
The ns variable has been removed in commit 62451a2b2e7e ("nvme: separate command prep and issue"). Drop reference to ns in comment. Fixes: 62451a2b2e7e ("nvme: separate command prep and issue") Signed-off-by: Baruch Siach <baruch@tkos.co.il> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvme-fc: Utilise min3() to simplify queue count calculationQasim Ijaz
Refactor nvme_fc_create_io_queues() and nvme_fc_recreate_io_queues() to use the min3() macro to find the minimum between 3 values instead of multiple min()'s. This shortens the code and makes it easier to read. Signed-off-by: Qasim Ijaz <qasdev00@gmail.com> Reviewed-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvme-multipath: Add visibility for queue-depth io-policyNilay Shroff
This patch helps add nvme native multipath visibility for queue-depth io-policy. It adds a new attribute file named "queue_depth" under namespace device path node which would print the number of active/ in-flight I/O requests currently queued for the given path. For instance, if we have a shared namespace accessible from two different controllers/paths then accessing head block node of the shared namespace would show the following output: $ ls -l /sys/block/nvme1n1/multipath/ nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1 nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1 In the above example, nvme1n1 is head gendisk node created for a shared namespace and the namespace is accessible from nvme1c1n1 and nvme1c3n1 paths. For queue-depth io-policy we can then refer the "queue_depth" attribute file created under each namespace path: $ cat /sys/block/nvme1n1/multipath/nvme1c1n1/queue_depth 518 $cat /sys/block/nvme1n1/multipath/nvme1c3n1/queue_depth 504 >From the above output, we can infer that I/O workload targeted at nvme1n1 uses two paths nvme1c1n1 and nvme1c3n1 and the current queue depth of each path is 518 and 504 respectively. Reading "queue_depth" file when configured io-policy is anything but queue-depth would show no output. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvme-multipath: Add visibility for numa io-policyNilay Shroff
This patch helps add nvme native multipath visibility for numa io-policy. It adds a new attribute file named "numa_nodes" under namespace gendisk device path node which prints the list of numa nodes preferred by the given namespace path. The numa nodes value is comma delimited list of nodes or A-B range of nodes. For instance, if we have a shared namespace accessible from two different controllers/paths then accessing head node of the shared namespace would show the following output: $ ls -l /sys/block/nvme1n1/multipath/ nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1 nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1 In the above example, nvme1n1 is head gendisk node created for a shared namespace and this namespace is accessible from nvme1c1n1 and nvme1c3n1 paths. For numa io-policy we can then refer the "numa_nodes" attribute file created under each namespace path: $ cat /sys/block/nvme1n1/multipath/nvme1c1n1/numa_nodes 0-1 $ cat /sys/block/nvme1n1/multipath/nvme1c3n1/numa_nodes 2-3 >From the above output, we infer that I/O workload targeted at nvme1n1 and running on numa nodes 0 and 1 would prefer using path nvme1c1n1. Similarly, I/O workload running on numa nodes 2 and 3 would prefer using path nvme1c3n1. Reading "numa_nodes" file when configured io-policy is anything but numa would show no output. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvme-multipath: Add visibility for round-robin io-policyNilay Shroff
This patch helps add nvme native multipath visibility for round-robin io-policy. It creates a "multipath" sysfs directory under head gendisk device node directory and then from "multipath" directory it adds a link to each namespace path device the head node refers. For instance, if we have a shared namespace accessible from two different controllers/paths then we create a soft link to each path device from head disk node as shown below: $ ls -l /sys/block/nvme1n1/multipath/ nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1 nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1 In the above example, nvme1n1 is head gendisk node created for a shared namespace and the namespace is accessible from nvme1c1n1 and nvme1c3n1 paths. For round-robin I/O policy, we could easily infer from the above output that I/O workload targeted to nvme1n1 would toggle across paths nvme1c1n1 and nvme1c3n1. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvmet: add tls_concat and tls_key debugfs entriesHannes Reinecke
Add debugfs entries to display the 'concat' and 'tls_key' controller attributes. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvmet-tcp: support secure channel concatenationHannes Reinecke
Evaluate the SC_C flag during DH-CHAP-HMAC negotiation to check if secure concatenation as specified in the NVMe Base Specification v2.1, section 8.3.4.3: "Secure Channel Concatenationand" is requested. If requested the generated PSK is inserted into the keyring once negotiation has finished allowing for an encrypted connection once the admin queue is restarted. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvmet: Add 'sq' argument to alloc_ctrl_argsHannes Reinecke
For secure concatenation the result of the TLS handshake will be stored in the 'sq' struct, so add it to the alloc_ctrl_args struct. Cc: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvme-fabrics: reset admin connection for secure concatenationHannes Reinecke
When secure concatenation is requested the connection needs to be reset to enable TLS encryption on the new cnnection. That implies that the original connection used for the DH-CHAP negotiation really shouldn't be used, and we should reset as soon as the DH-CHAP negotiation has succeeded on the admin queue. Based on an idea from Sagi. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvme-tcp: request secure channel concatenationHannes Reinecke
Add a fabrics option 'concat' to request secure channel concatenation as specified the NVME Base Specification v2.1, section 8.3.4.3: Secure Channel Concatenation. When secure channel concatenation is enabled a 'generated PSK' is inserted into the keyring such that it's available after reset. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvme-keyring: add nvme_tls_psk_refresh()Hannes Reinecke
Add a function to refresh a generated PSK in the specified keyring. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvme: add nvme_auth_derive_tls_psk()Hannes Reinecke
Add a function to derive the TLS PSK as specified TP8018. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvme: add nvme_auth_generate_digest()Hannes Reinecke
Add a function to calculate the PSK digest as specified in TP8018. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvme: add nvme_auth_generate_psk()Hannes Reinecke
Add a function to generate a NVMe PSK from the shared credentials negotiated by DH-HMAC-CHAP. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20crypto,fs: Separate out hkdf_extract() and hkdf_expand()Hannes Reinecke
Separate out the HKDF functions into a separate module to to make them available to other callers. And add a testsuite to the module with test vectors from RFC 5869 (and additional vectors for SHA384 and SHA512) to ensure the integrity of the algorithm. Signed-off-by: Hannes Reinecke <hare@kernel.org> Acked-by: Eric Biggers <ebiggers@kernel.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20docs: sysfs-block: Clarify integrity sysfs attributesMilan Broz
The /sys/block/<disk>/integrity fields are historically set if T10 protection Information is enabled. It is not set if some upper layer uses integrity metadata. Document it. Signed-off-by: Milan Broz <gmazyland@gmail.com> Co-developed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250318154447.370786-1-gmazyland@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-19block/blk-iocost: ensure 'ret' is set on errorJens Axboe
In case blkg_conf_open_bdev_frozen() fails, ioc_qos_write() jumps to the error path without assigning a value to 'ret'. Ensure that it inherits the error from the passed back error value. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202503200454.QWpwKeJu-lkp@intel.com/ Fixes: 9730763f4756 ("block: correct locking order for protecting blk-wbt parameters") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-19block: correct locking order for protecting blk-wbt parametersNilay Shroff
The commit '245618f8e45f ("block: protect wbt_lat_usec using q-> elevator_lock")' introduced q->elevator_lock to protect updates to blk-wbt parameters when writing to the sysfs attribute wbt_ lat_usec and the cgroup attribute io.cost.qos. However, both these attributes also acquire q->rq_qos_mutex, leading to the following lockdep warning: ====================================================== WARNING: possible circular locking dependency detected 6.14.0-rc5+ #138 Not tainted ------------------------------------------------------ bash/5902 is trying to acquire lock: c000000085d495a0 (&q->rq_qos_mutex){+.+.}-{4:4}, at: wbt_init+0x164/0x238 but task is already holding lock: c000000085d498c8 (&q->elevator_lock){+.+.}-{4:4}, at: queue_wb_lat_store+0xb0/0x20c which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&q->elevator_lock){+.+.}-{4:4}: __mutex_lock+0xf0/0xa58 ioc_qos_write+0x16c/0x85c cgroup_file_write+0xc4/0x32c kernfs_fop_write_iter+0x1b8/0x29c vfs_write+0x410/0x584 ksys_write+0x84/0x140 system_call_exception+0x134/0x360 system_call_vectored_common+0x15c/0x2ec -> #0 (&q->rq_qos_mutex){+.+.}-{4:4}: __lock_acquire+0x1b6c/0x2ae0 lock_acquire+0x140/0x430 __mutex_lock+0xf0/0xa58 wbt_init+0x164/0x238 queue_wb_lat_store+0x1dc/0x20c queue_attr_store+0x12c/0x164 sysfs_kf_write+0x6c/0xb0 kernfs_fop_write_iter+0x1b8/0x29c vfs_write+0x410/0x584 ksys_write+0x84/0x140 system_call_exception+0x134/0x360 system_call_vectored_common+0x15c/0x2ec other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&q->elevator_lock); lock(&q->rq_qos_mutex); lock(&q->elevator_lock); lock(&q->rq_qos_mutex); *** DEADLOCK *** 6 locks held by bash/5902: #0: c000000051122400 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x84/0x140 #1: c00000007383f088 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x174/0x29c #2: c000000008550428 (kn->active#182){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x180/0x29c #3: c000000085d493a8 (&q->q_usage_counter(io)#5){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40 #4: c000000085d493e0 (&q->q_usage_counter(queue)#5){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40 #5: c000000085d498c8 (&q->elevator_lock){+.+.}-{4:4}, at: queue_wb_lat_store+0xb0/0x20c stack backtrace: CPU: 17 UID: 0 PID: 5902 Comm: bash Kdump: loaded Not tainted 6.14.0-rc5+ #138 Hardware name: IBM,9043-MRX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_028) hv:phyp pSeries Call Trace: [c0000000721ef590] [c00000000118f8a8] dump_stack_lvl+0x108/0x18c (unreliable) [c0000000721ef5c0] [c00000000022563c] print_circular_bug+0x448/0x604 [c0000000721ef670] [c000000000225a44] check_noncircular+0x24c/0x26c [c0000000721ef740] [c00000000022bf28] __lock_acquire+0x1b6c/0x2ae0 [c0000000721ef870] [c000000000229240] lock_acquire+0x140/0x430 [c0000000721ef970] [c0000000011cfbec] __mutex_lock+0xf0/0xa58 [c0000000721efaa0] [c00000000096c46c] wbt_init+0x164/0x238 [c0000000721efaf0] [c0000000008f8cd8] queue_wb_lat_store+0x1dc/0x20c [c0000000721efb50] [c0000000008f8fa0] queue_attr_store+0x12c/0x164 [c0000000721efc60] [c0000000007c11cc] sysfs_kf_write+0x6c/0xb0 [c0000000721efca0] [c0000000007bfa4c] kernfs_fop_write_iter+0x1b8/0x29c [c0000000721efcf0] [c0000000006a281c] vfs_write+0x410/0x584 [c0000000721efdc0] [c0000000006a2cc8] ksys_write+0x84/0x140 [c0000000721efe10] [c000000000031b64] system_call_exception+0x134/0x360 [c0000000721efe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec >From the above log it's apparent that method which writes to sysfs attr wbt_lat_usec acquires q->elevator_lock first, and then acquires q->rq_ qos_mutex. However the another method which writes to io.cost.qos, acquires q->rq_qos_mutex first, and then acquires q->rq_qos_mutex. So this could potentially cause the deadlock. A closer look at ioc_qos_write shows that correcting the lock order is non-trivial because q->rq_qos_mutex is acquired in blkg_conf_open_bdev and released in blkg_conf_exit. The function blkg_conf_open_bdev is responsible for parsing user input and finding the corresponding block device (bdev) from the user provided major:minor number. Since we do not know the bdev until blkg_conf_open_bdev completes, we cannot simply move q->elevator_lock acquisition before blkg_conf_open_ bdev. So to address this, we intoduce new helpers blkg_conf_open_bdev_ frozen and blkg_conf_exit_frozen which are just wrappers around blkg_ conf_open_bdev and blkg_conf_exit respectively. The helper blkg_conf_ open_bdev_frozen is similar to blkg_conf_open_bdev, but additionally freezes the queue, acquires q->elevator_lock and ensures the correct locking order is followed between q->elevator_lock and q->rq_qos_mutex. Similarly another helper blkg_conf_exit_frozen in addition to unfreezing the queue ensures that we release the locks in correct order. By using these helpers, now we maintain the same locking order in all code paths where we update blk-wbt parameters. Fixes: 245618f8e45f ("block: protect wbt_lat_usec using q->elevator_lock") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202503171650.cc082b66-lkp@intel.com Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250319105518.468941-3-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-19block: release q->elevator_lock in ioc_qos_writeNilay Shroff
The ioc_qos_write method acquires q->elevator_lock to protect updates to blk-wbt parameters. Once these updates are complete, the lock should be released before returning from ioc_qos_write. However, in one code path, the release of q->elevator_lock was mistakenly omitted, potentially leading to a lock leak. This commit fixes the issue by ensuring that q->elevator_lock is properly released in all return paths of ioc_qos_write. Fixes: 245618f8e45f ("block: protect wbt_lat_usec using q->elevator_lock") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202503171650.cc082b66-lkp@intel.com Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250319105518.468941-2-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-19ublk: remove io_cmds list in ublk_queueUday Shankar
The current I/O dispatch mechanism - queueing I/O by adding it to the io_cmds list (and poking task_work as needed), then dispatching it in ublk server task context by reversing io_cmds and completing the io_uring command associated to each one - was introduced by commit 7d4a93176e014 ("ublk_drv: don't forward io commands in reserve order") to ensure that the ublk server received I/O in the same order that the block layer submitted it to ublk_drv. This mechanism was only needed for the "raw" task_work submission mechanism, since the io_uring task work wrapper maintains FIFO ordering (using quite a similar mechanism in fact). The "raw" task_work submission mechanism is no longer supported in ublk_drv as of commit 29dc5d06613f2 ("ublk: kill queuing request by task_work_add"), so the explicit llist/reversal is no longer needed - it just duplicates logic already present in the underlying io_uring APIs. Remove it. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250318-ublk_io_cmds-v1-1-c1bb74798fef@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-18blk-cgroup: improve policy registration error handlingChen Linxuan
This patch improve the returned error code of blkcg_policy_register(). 1. Move the validation check for cpd/pd_alloc_fn and cpd/pd_free_fn function pairs to the start of blkcg_policy_register(). This ensures we immediately return -EINVAL if the function pairs are not correctly provided, rather than returning -ENOSPC after locking and unlocking mutexes unnecessarily. Those locks should not contention any problems, as error of policy registration is a super cold path. 2. Return -ENOMEM when cpd_alloc_fn() failed. Co-authored-by: Wen Tao <wentao@uniontech.com> Signed-off-by: Wen Tao <wentao@uniontech.com> Signed-off-by: Chen Linxuan <chenlinxuan@uniontech.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/3E333A73B6B6DFC0+20250317022924.150907-1-chenlinxuan@uniontech.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-18loop: move vfs_fsync() out of loop_update_dio()Ming Lei
If vfs_flush() is called with queue frozen, the queue freeze lock may be connected with FS internal lock, and lockdep warning can be triggered because the queue freeze lock is connected with too many global or sub-system locks. Fix the warning by moving vfs_fsync() out of loop_update_dio(): - vfs_fsync() is only needed when switching to dio - only loop_change_fd() and loop_configure() may switch from buffered IO to direct IO, so call vfs_fsync() directly here. This way is safe because either loop is in unbound, or new file isn't attached - for the other two cases of set_status and set_block_size, direct IO can only become off, so no need to call vfs_fsync() Cc: Christoph Hellwig <hch@infradead.org> Reported-by: Kun Hu <huk23@m.fudan.edu.cn> Reported-by: Jiaji Qin <jjtan24@m.fudan.edu.cn> Closes: https://lore.kernel.org/linux-block/359BC288-B0B1-4815-9F01-3A349B12E816@m.fudan.edu.cn/T/#u Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250318072955.3893805-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-18block: Make request_queue lockdep splats show up earlierThomas Hellström
In recent kernels, there are lockdep splats around the struct request_queue::io_lockdep_map, similar to [1], but they typically don't show up until reclaim with writeback happens. Having multiple kernel versions released with a known risc of kernel deadlock during reclaim writeback should IMHO be addressed and backported to -stable with the highest priority. In order to have these lockdep splats show up earlier, preferrably during system initialization, prime the struct request_queue::io_lockdep_map as GFP_KERNEL reclaim- tainted. This will instead lead to lockdep splats looking similar to [2], but without the need for reclaim + writeback happening. [1]: [ 189.762244] ====================================================== [ 189.762432] WARNING: possible circular locking dependency detected [ 189.762441] 6.14.0-rc6-xe+ #6 Tainted: G U [ 189.762450] ------------------------------------------------------ [ 189.762459] kswapd0/119 is trying to acquire lock: [ 189.762467] ffff888110ceb710 (&q->q_usage_counter(io)#26){++++}-{0:0}, at: __submit_bio+0x76/0x230 [ 189.762485] but task is already holding lock: [ 189.762494] ffffffff834c97c0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xbe/0xb00 [ 189.762507] which lock already depends on the new lock. [ 189.762519] the existing dependency chain (in reverse order) is: [ 189.762529] -> #2 (fs_reclaim){+.+.}-{0:0}: [ 189.762540] fs_reclaim_acquire+0xc5/0x100 [ 189.762548] kmem_cache_alloc_lru_noprof+0x4a/0x480 [ 189.762558] alloc_inode+0xaa/0xe0 [ 189.762566] iget_locked+0x157/0x330 [ 189.762573] kernfs_get_inode+0x1b/0x110 [ 189.762582] kernfs_get_tree+0x1b0/0x2e0 [ 189.762590] sysfs_get_tree+0x1f/0x60 [ 189.762597] vfs_get_tree+0x2a/0xf0 [ 189.762605] path_mount+0x4cd/0xc00 [ 189.762613] __x64_sys_mount+0x119/0x150 [ 189.762621] x64_sys_call+0x14f2/0x2310 [ 189.762630] do_syscall_64+0x91/0x180 [ 189.762637] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 189.762647] -> #1 (&root->kernfs_rwsem){++++}-{3:3}: [ 189.762659] down_write+0x3e/0xf0 [ 189.762667] kernfs_remove+0x32/0x60 [ 189.762676] sysfs_remove_dir+0x4f/0x60 [ 189.762685] __kobject_del+0x33/0xa0 [ 189.762709] kobject_del+0x13/0x30 [ 189.762716] elv_unregister_queue+0x52/0x80 [ 189.762725] elevator_switch+0x68/0x360 [ 189.762733] elv_iosched_store+0x14b/0x1b0 [ 189.762756] queue_attr_store+0x181/0x1e0 [ 189.762765] sysfs_kf_write+0x49/0x80 [ 189.762773] kernfs_fop_write_iter+0x17d/0x250 [ 189.762781] vfs_write+0x281/0x540 [ 189.762790] ksys_write+0x72/0xf0 [ 189.762798] __x64_sys_write+0x19/0x30 [ 189.762807] x64_sys_call+0x2a3/0x2310 [ 189.762815] do_syscall_64+0x91/0x180 [ 189.762823] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 189.762833] -> #0 (&q->q_usage_counter(io)#26){++++}-{0:0}: [ 189.762845] __lock_acquire+0x1525/0x2760 [ 189.762854] lock_acquire+0xca/0x310 [ 189.762861] blk_mq_submit_bio+0x8a2/0xba0 [ 189.762870] __submit_bio+0x76/0x230 [ 189.762878] submit_bio_noacct_nocheck+0x323/0x430 [ 189.762888] submit_bio_noacct+0x2cc/0x620 [ 189.762896] submit_bio+0x38/0x110 [ 189.762904] __swap_writepage+0xf5/0x380 [ 189.762912] swap_writepage+0x3c7/0x600 [ 189.762920] shmem_writepage+0x3da/0x4f0 [ 189.762929] pageout+0x13f/0x310 [ 189.762937] shrink_folio_list+0x61c/0xf60 [ 189.763261] evict_folios+0x378/0xcd0 [ 189.763584] try_to_shrink_lruvec+0x1b0/0x360 [ 189.763946] shrink_one+0x10e/0x200 [ 189.764266] shrink_node+0xc02/0x1490 [ 189.764586] balance_pgdat+0x563/0xb00 [ 189.764934] kswapd+0x1e8/0x430 [ 189.765249] kthread+0x10b/0x260 [ 189.765559] ret_from_fork+0x44/0x70 [ 189.765889] ret_from_fork_asm+0x1a/0x30 [ 189.766198] other info that might help us debug this: [ 189.767089] Chain exists of: &q->q_usage_counter(io)#26 --> &root->kernfs_rwsem --> fs_reclaim [ 189.767971] Possible unsafe locking scenario: [ 189.768555] CPU0 CPU1 [ 189.768849] ---- ---- [ 189.769136] lock(fs_reclaim); [ 189.769421] lock(&root->kernfs_rwsem); [ 189.769714] lock(fs_reclaim); [ 189.770016] rlock(&q->q_usage_counter(io)#26); [ 189.770305] *** DEADLOCK *** [ 189.771167] 1 lock held by kswapd0/119: [ 189.771453] #0: ffffffff834c97c0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xbe/0xb00 [ 189.771770] stack backtrace: [ 189.772351] CPU: 4 UID: 0 PID: 119 Comm: kswapd0 Tainted: G U 6.14.0-rc6-xe+ #6 [ 189.772353] Tainted: [U]=USER [ 189.772354] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 2001 02/01/2023 [ 189.772354] Call Trace: [ 189.772355] <TASK> [ 189.772356] dump_stack_lvl+0x6e/0xa0 [ 189.772359] dump_stack+0x10/0x18 [ 189.772360] print_circular_bug.cold+0x17a/0x1b7 [ 189.772363] check_noncircular+0x13a/0x150 [ 189.772365] ? __pfx_stack_trace_consume_entry+0x10/0x10 [ 189.772368] __lock_acquire+0x1525/0x2760 [ 189.772368] ? ret_from_fork_asm+0x1a/0x30 [ 189.772371] lock_acquire+0xca/0x310 [ 189.772372] ? __submit_bio+0x76/0x230 [ 189.772375] ? lock_release+0xd5/0x2c0 [ 189.772376] blk_mq_submit_bio+0x8a2/0xba0 [ 189.772378] ? __submit_bio+0x76/0x230 [ 189.772380] __submit_bio+0x76/0x230 [ 189.772382] ? trace_hardirqs_on+0x1e/0xe0 [ 189.772384] submit_bio_noacct_nocheck+0x323/0x430 [ 189.772386] ? submit_bio_noacct_nocheck+0x323/0x430 [ 189.772387] ? __might_sleep+0x58/0xa0 [ 189.772390] submit_bio_noacct+0x2cc/0x620 [ 189.772391] ? count_memcg_events+0x68/0x90 [ 189.772393] submit_bio+0x38/0x110 [ 189.772395] __swap_writepage+0xf5/0x380 [ 189.772396] swap_writepage+0x3c7/0x600 [ 189.772397] shmem_writepage+0x3da/0x4f0 [ 189.772401] pageout+0x13f/0x310 [ 189.772406] shrink_folio_list+0x61c/0xf60 [ 189.772409] ? isolate_folios+0xe80/0x16b0 [ 189.772410] ? mark_held_locks+0x46/0x90 [ 189.772412] evict_folios+0x378/0xcd0 [ 189.772414] ? evict_folios+0x34a/0xcd0 [ 189.772415] ? lock_is_held_type+0xa3/0x130 [ 189.772417] try_to_shrink_lruvec+0x1b0/0x360 [ 189.772420] shrink_one+0x10e/0x200 [ 189.772421] shrink_node+0xc02/0x1490 [ 189.772423] ? shrink_node+0xa08/0x1490 [ 189.772424] ? shrink_node+0xbd8/0x1490 [ 189.772425] ? mem_cgroup_iter+0x366/0x480 [ 189.772427] balance_pgdat+0x563/0xb00 [ 189.772428] ? balance_pgdat+0x563/0xb00 [ 189.772430] ? trace_hardirqs_on+0x1e/0xe0 [ 189.772431] ? finish_task_switch.isra.0+0xcb/0x330 [ 189.772433] ? __switch_to_asm+0x33/0x70 [ 189.772437] kswapd+0x1e8/0x430 [ 189.772438] ? __pfx_autoremove_wake_function+0x10/0x10 [ 189.772440] ? __pfx_kswapd+0x10/0x10 [ 189.772441] kthread+0x10b/0x260 [ 189.772443] ? __pfx_kthread+0x10/0x10 [ 189.772444] ret_from_fork+0x44/0x70 [ 189.772446] ? __pfx_kthread+0x10/0x10 [ 189.772447] ret_from_fork_asm+0x1a/0x30 [ 189.772450] </TASK> [2]: [ 8.760253] ====================================================== [ 8.760254] WARNING: possible circular locking dependency detected [ 8.760255] 6.14.0-rc6-xe+ #7 Tainted: G U [ 8.760256] ------------------------------------------------------ [ 8.760257] (udev-worker)/674 is trying to acquire lock: [ 8.760259] ffff888100e39148 (&root->kernfs_rwsem){++++}-{3:3}, at: kernfs_remove+0x32/0x60 [ 8.760265] but task is already holding lock: [ 8.760266] ffff888110dc7680 (&q->q_usage_counter(io)#27){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x12/0x30 [ 8.760272] which lock already depends on the new lock. [ 8.760272] the existing dependency chain (in reverse order) is: [ 8.760273] -> #2 (&q->q_usage_counter(io)#27){++++}-{0:0}: [ 8.760276] blk_alloc_queue+0x30a/0x350 [ 8.760279] blk_mq_alloc_queue+0x6b/0xe0 [ 8.760281] scsi_alloc_sdev+0x276/0x3c0 [ 8.760284] scsi_probe_and_add_lun+0x22a/0x440 [ 8.760286] __scsi_scan_target+0x109/0x230 [ 8.760288] scsi_scan_channel+0x65/0xc0 [ 8.760290] scsi_scan_host_selected+0xff/0x140 [ 8.760292] do_scsi_scan_host+0xa7/0xc0 [ 8.760293] do_scan_async+0x1c/0x160 [ 8.760295] async_run_entry_fn+0x32/0x150 [ 8.760299] process_one_work+0x224/0x5f0 [ 8.760302] worker_thread+0x1d4/0x3e0 [ 8.760304] kthread+0x10b/0x260 [ 8.760306] ret_from_fork+0x44/0x70 [ 8.760309] ret_from_fork_asm+0x1a/0x30 [ 8.760312] -> #1 (fs_reclaim){+.+.}-{0:0}: [ 8.760315] fs_reclaim_acquire+0xc5/0x100 [ 8.760317] kmem_cache_alloc_lru_noprof+0x4a/0x480 [ 8.760319] alloc_inode+0xaa/0xe0 [ 8.760322] iget_locked+0x157/0x330 [ 8.760323] kernfs_get_inode+0x1b/0x110 [ 8.760325] kernfs_get_tree+0x1b0/0x2e0 [ 8.760327] sysfs_get_tree+0x1f/0x60 [ 8.760329] vfs_get_tree+0x2a/0xf0 [ 8.760332] path_mount+0x4cd/0xc00 [ 8.760334] __x64_sys_mount+0x119/0x150 [ 8.760336] x64_sys_call+0x14f2/0x2310 [ 8.760338] do_syscall_64+0x91/0x180 [ 8.760340] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 8.760342] -> #0 (&root->kernfs_rwsem){++++}-{3:3}: [ 8.760345] __lock_acquire+0x1525/0x2760 [ 8.760347] lock_acquire+0xca/0x310 [ 8.760348] down_write+0x3e/0xf0 [ 8.760350] kernfs_remove+0x32/0x60 [ 8.760351] sysfs_remove_dir+0x4f/0x60 [ 8.760353] __kobject_del+0x33/0xa0 [ 8.760355] kobject_del+0x13/0x30 [ 8.760356] elv_unregister_queue+0x52/0x80 [ 8.760358] elevator_switch+0x68/0x360 [ 8.760360] elv_iosched_store+0x14b/0x1b0 [ 8.760362] queue_attr_store+0x181/0x1e0 [ 8.760364] sysfs_kf_write+0x49/0x80 [ 8.760366] kernfs_fop_write_iter+0x17d/0x250 [ 8.760367] vfs_write+0x281/0x540 [ 8.760370] ksys_write+0x72/0xf0 [ 8.760372] __x64_sys_write+0x19/0x30 [ 8.760374] x64_sys_call+0x2a3/0x2310 [ 8.760376] do_syscall_64+0x91/0x180 [ 8.760377] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 8.760380] other info that might help us debug this: [ 8.760380] Chain exists of: &root->kernfs_rwsem --> fs_reclaim --> &q->q_usage_counter(io)#27 [ 8.760384] Possible unsafe locking scenario: [ 8.760384] CPU0 CPU1 [ 8.760385] ---- ---- [ 8.760385] lock(&q->q_usage_counter(io)#27); [ 8.760387] lock(fs_reclaim); [ 8.760388] lock(&q->q_usage_counter(io)#27); [ 8.760390] lock(&root->kernfs_rwsem); [ 8.760391] *** DEADLOCK *** [ 8.760391] 6 locks held by (udev-worker)/674: [ 8.760392] #0: ffff8881209ac420 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0x72/0xf0 [ 8.760398] #1: ffff88810c80f488 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x136/0x250 [ 8.760402] #2: ffff888125d1d330 (kn->active#101){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x13f/0x250 [ 8.760406] #3: ffff888110dc7bb0 (&q->sysfs_lock){+.+.}-{3:3}, at: queue_attr_store+0x148/0x1e0 [ 8.760411] #4: ffff888110dc7680 (&q->q_usage_counter(io)#27){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x12/0x30 [ 8.760416] #5: ffff888110dc76b8 (&q->q_usage_counter(queue)#27){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x12/0x30 [ 8.760421] stack backtrace: [ 8.760422] CPU: 7 UID: 0 PID: 674 Comm: (udev-worker) Tainted: G U 6.14.0-rc6-xe+ #7 [ 8.760424] Tainted: [U]=USER [ 8.760425] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 2001 02/01/2023 [ 8.760426] Call Trace: [ 8.760427] <TASK> [ 8.760428] dump_stack_lvl+0x6e/0xa0 [ 8.760431] dump_stack+0x10/0x18 [ 8.760433] print_circular_bug.cold+0x17a/0x1b7 [ 8.760437] check_noncircular+0x13a/0x150 [ 8.760441] ? save_trace+0x54/0x360 [ 8.760445] __lock_acquire+0x1525/0x2760 [ 8.760446] ? irqentry_exit+0x3a/0xb0 [ 8.760448] ? sysvec_apic_timer_interrupt+0x57/0xc0 [ 8.760452] lock_acquire+0xca/0x310 [ 8.760453] ? kernfs_remove+0x32/0x60 [ 8.760457] down_write+0x3e/0xf0 [ 8.760459] ? kernfs_remove+0x32/0x60 [ 8.760460] kernfs_remove+0x32/0x60 [ 8.760462] sysfs_remove_dir+0x4f/0x60 [ 8.760464] __kobject_del+0x33/0xa0 [ 8.760466] kobject_del+0x13/0x30 [ 8.760467] elv_unregister_queue+0x52/0x80 [ 8.760470] elevator_switch+0x68/0x360 [ 8.760472] elv_iosched_store+0x14b/0x1b0 [ 8.760475] queue_attr_store+0x181/0x1e0 [ 8.760479] ? lock_acquire+0xca/0x310 [ 8.760480] ? kernfs_fop_write_iter+0x13f/0x250 [ 8.760482] ? lock_is_held_type+0xa3/0x130 [ 8.760485] sysfs_kf_write+0x49/0x80 [ 8.760487] kernfs_fop_write_iter+0x17d/0x250 [ 8.760489] vfs_write+0x281/0x540 [ 8.760494] ksys_write+0x72/0xf0 [ 8.760497] __x64_sys_write+0x19/0x30 [ 8.760499] x64_sys_call+0x2a3/0x2310 [ 8.760502] do_syscall_64+0x91/0x180 [ 8.760504] ? trace_hardirqs_off+0x5d/0xe0 [ 8.760506] ? handle_softirqs+0x479/0x4d0 [ 8.760508] ? hrtimer_interrupt+0x13f/0x280 [ 8.760511] ? irqentry_exit_to_user_mode+0x8b/0x260 [ 8.760513] ? clear_bhb_loop+0x15/0x70 [ 8.760515] ? clear_bhb_loop+0x15/0x70 [ 8.760516] ? clear_bhb_loop+0x15/0x70 [ 8.760518] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 8.760520] RIP: 0033:0x7aa3bf2f5504 [ 8.760522] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 8b 10 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89 [ 8.760523] RSP: 002b:00007ffc1e3697d8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 [ 8.760526] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007aa3bf2f5504 [ 8.760527] RDX: 0000000000000003 RSI: 00007ffc1e369ae0 RDI: 000000000000001c [ 8.760528] RBP: 00007ffc1e369800 R08: 00007aa3bf3f51c8 R09: 00007ffc1e3698b0 [ 8.760528] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000003 [ 8.760529] R13: 00007ffc1e369ae0 R14: 0000613ccf21f2f0 R15: 00007aa3bf3f4e80 [ 8.760533] </TASK> v2: - Update a code comment to increase readability (Ming Lei). Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250318095548.5187-1-thomas.hellstrom@linux.intel.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-18block: fix a comment in the queue_attrs[] arrayChristoph Hellwig
queue_ra_entry uses limits_lock just like the attributes above it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250312150127.703534-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-13block: protect debugfs attribute method hctx_busy_showNilay Shroff
The hctx_busy_show method in debugfs is currently unprotected. This method iterates over all started requests in a tagset and prints them. However, the tags can be updated concurrently via the sysfs attributes 'nr_requests' or 'scheduler' (elevator switch), leading to potential race conditions. Since sysfs attributes 'nr_requests' and 'scheduler' are already protected using q->elevator_lock, extend this protection to the debugfs 'busy' attribute as well to ensure consistency. Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250313115235.3707600-4-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-13block: remove unnecessary goto labels in debugfs attribute read methodsNilay Shroff
In some debugfs attribute read methods, failure to acquire the mutex lock results in jumping to a label before returning an error code. However this is unnecessary, as we can return the failure code directly, improving code readability and reducing complexity. This commit removes the goto labels and ensures that the method returns immediately upon failing to acquire the mutex lock. Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250313115235.3707600-3-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-13block: protect debugfs attrs using elevator_lock instead of sysfs_lockNilay Shroff
Currently, the block debugfs attributes (tags, tags_bitmap, sched_tags, and sched_tags_bitmap) are protected using q->sysfs_lock. However, these attributes are updated in multiple scenarios: - During driver probe method - During an elevator switch/update - During an nr_hw_queues update - When writing to the sysfs attribute nr_requests All these update paths (except driver probe method, which doesn't require any protection) are already protected using q->elevator_lock. To ensure consistency and proper synchronization, replace q->sysfs_lock with q->elevator_lock for protecting these debugfs attributes. Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250313115235.3707600-2-nilay@linux.ibm.com [axboe: some commit message rewording/fixes] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-13block: remove unused parameter 'q' parameter in __blk_rq_map_sg()Anuj Gupta
request_queue param is no longer used by blk_rq_map_sg and __blk_rq_map_sg. Remove it. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250313035322.243239-1-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-13Merge tag 'md-6.15-20250312' of ↵Jens Axboe
https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into for-6.15/block Merge MD changes from Yu: "- fix recovery can preempt resync (Li Nan) - fix md-bitmap IO limit (Su Yue) - fix raid10 discard with REQ_NOWAIT (Xiao Ni) - fix raid1 memory leak (Zheng Qixing) - fix mddev uaf (Yu Kuai) - fix raid1,raid10 IO flags (Yu Kuai) - some refactor and cleanup (Yu Kuai)" * tag 'md-6.15-20250312' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux: md/raid10: wait barrier before returning discard request with REQ_NOWAIT md/md-bitmap: fix wrong bitmap_limit for clustermd when write sb md/raid1,raid10: don't ignore IO flags md/raid5: merge reshape_progress checking inside get_reshape_loc() md: fix mddev uaf while iterating all_mddevs list md: switch md-cluster to use md_submodle_head md: don't export md_cluster_ops md/md-cluster: cleanup md_cluster_ops reference md: switch personalities to use md_submodule_head md: introduce struct md_submodule_head and APIs md: only include md-cluster.h if necessary md: merge common code into find_pers() md/raid1: fix memory leak in raid1_run() if no active rdev md: ensure resync is prioritized over recovery
2025-03-12block: fix adding folio to bioMing Lei
>4GB folio is possible on some ARCHs, such as aarch64, 16GB hugepage is supported, then 'offset' of folio can't be held in 'unsigned int', cause warning in bio_add_folio_nofail() and IO failure. Fix it by adjusting 'page' & trimming 'offset' so that `->bi_offset` won't be overflow, and folio can be added to bio successfully. Fixes: ed9832bc08db ("block: introduce folio awareness and add a bigger size from folio") Cc: Kundan Kumar <kundan.kumar@samsung.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Gavin Shan <gshan@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20250312145136.2891229-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-12block: remove unused parameterGuixin Liu
The blk_mq_map_queue()'s request_queue param is not used anymore, remove it, same with blk_get_flush_queue(). Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Link: https://lore.kernel.org/r/20250312084722.129680-1-kanie@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10badblocks: Fix a nonsense WARN_ON() which checks whether a u64 variable < 0Coly Li
In _badblocks_check(), there are lines of code like this, 1246 sectors -= len; [snipped] 1251 WARN_ON(sectors < 0); The WARN_ON() at line 1257 doesn't make sense because sectors is unsigned long long type and never to be <0. Fix it by checking directly checking whether sectors is less than len. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Coly Li <colyli@kernel.org> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250309160556.42854-1-colyli@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10block: make sure ->nr_integrity_segments is cloned in blk_rq_prep_cloneMing Lei
Make sure ->nr_integrity_segments is cloned in blk_rq_prep_clone(), otherwise requests cloned by device-mapper multipath will not have the proper nr_integrity_segments values set, then BUG() is hit from sg_alloc_table_chained(). Fixes: b0fd271d5fba ("block: add request clone interface (v2)") Cc: stable@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250310115453.2271109-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10block: protect hctx attributes/params using q->elevator_lockNilay Shroff
Currently, hctx attributes (nr_tags, nr_reserved_tags, and cpu_list) are protected using `q->sysfs_lock`. However, these attributes can be updated in multiple scenarios: - During the driver's probe method. - When updating nr_hw_queues. - When writing to the sysfs attribute nr_requests, which can modify nr_tags. The nr_requests attribute is already protected using q->elevator_lock, but none of the update paths actually use q->sysfs_lock to protect hctx attributes. So to ensure proper synchronization, replace q->sysfs_lock with q->elevator_lock when reading hctx attributes through sysfs. Additionally, blk_mq_update_nr_hw_queues allocates and updates hctx. The allocation of hctx is protected using q->elevator_lock, however, updating hctx params happens without any protection, so safeguard hctx param update path by also using q->elevator_lock. Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250306093956.2818808-1-nilay@linux.ibm.com [axboe: wrap comment at 80 chars] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10block: protect read_ahead_kb using q->limits_lockNilay Shroff
The bdi->ra_pages could be updated under q->limits_lock because it's usually calculated from the queue limits by queue_limits_commit_update. So protect reading/writing the sysfs attribute read_ahead_kb using q->limits_lock instead of q->sysfs_lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250304102551.2533767-8-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10block: protect wbt_lat_usec using q->elevator_lockNilay Shroff
The wbt latency and state could be updated while initializing the elevator or exiting the elevator. It could be also updated while configuring IO latency QoS parameters using cgroup. The elevator code path is now protected with q->elevator_lock. So we should protect the access to sysfs attribute wbt_lat_usec using q->elevator _lock instead of q->sysfs_lock. White we're at it, also protect ioc_qos_write(), which configures wbt parameters via cgroup, using q->elevator_lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250304102551.2533767-7-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10block: protect nr_requests update using q->elevator_lockNilay Shroff
The sysfs attribute nr_requests could be simultaneously updated from elevator switch/update or nr_hw_queue update code path. The update to nr_requests for each of those code paths runs holding q->elevator_lock. So we should protect access to sysfs attribute nr_requests using q-> elevator_lock instead of q->sysfs_lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250304102551.2533767-6-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10block: introduce a dedicated lock for protecting queue elevator updatesNilay Shroff
A queue's elevator can be updated either when modifying nr_hw_queues or through the sysfs scheduler attribute. Currently, elevator switching/ updating is protected using q->sysfs_lock, but this has led to lockdep splats[1] due to inconsistent lock ordering between q->sysfs_lock and the freeze-lock in multiple block layer call sites. As the scope of q->sysfs_lock is not well-defined, its (mis)use has resulted in numerous lockdep warnings. To address this, introduce a new q->elevator_lock, dedicated specifically for protecting elevator switches/updates. And we'd now use this new q->elevator_lock instead of q->sysfs_lock for protecting elevator switches/updates. While at it, make elv_iosched_load_module() a static function, as it is only called from elv_iosched_store(). Also, remove redundant parameters from elv_iosched_load_module() function signature. [1] https://lore.kernel.org/all/67637e70.050a0220.3157ee.000c.GAE@google.com/ Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250304102551.2533767-5-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10block: remove q->sysfs_lock for attributes which don't need itNilay Shroff
There're few sysfs attributes in block layer which don't really need acquiring q->sysfs_lock while accessing it. The reason being, reading/ writing a value from/to such attributes are either atomic or could be easily protected using READ_ONCE()/WRITE_ONCE(). Moreover, sysfs attributes are inherently protected with sysfs/kernfs internal locking. So this change help segregate all existing sysfs attributes for which we could avoid acquiring q->sysfs_lock. For all read-only attributes we removed the q->sysfs_lock from show method of such attributes. In case attribute is read/write then we removed the q->sysfs_lock from both show and store methods of these attributes. We audited all block sysfs attributes and found following list of attributes which shouldn't require q->sysfs_lock protection: 1. io_poll: Write to this attribute is ignored. So, we don't need q->sysfs_lock. 2. io_poll_delay: Write to this attribute is NOP, so we don't need q->sysfs_lock. 3. io_timeout: Write to this attribute updates q->rq_timeout and read of this attribute returns the value stored in q->rq_timeout Moreover, the q->rq_timeout is set only once when we init the queue (under blk_mq_ init_allocated_queue()) even before disk is added. So that means that we don't need to protect it with q->sysfs_lock. As this attribute is not directly correlated with anything else simply using READ_ONCE/WRITE_ONCE should be enough. 4. nomerges: Write to this attribute file updates two q->flags : QUEUE_FLAG_ NOMERGES and QUEUE_FLAG_NOXMERGES. These flags are accessed during bio-merge which anyways doesn't run with q->sysfs_lock held. Moreover, the q->flags are updated/accessed with bitops which are atomic. So, protecting it with q->sysfs_lock is not necessary. 5. rq_affinity: Write to this attribute file makes atomic updates to q->flags: QUEUE_FLAG_SAME_COMP and QUEUE_FLAG_SAME_FORCE. These flags are also accessed from blk_mq_complete_need_ipi() using test_bit macro. As read/write to q->flags uses bitops which are atomic, protecting it with q->stsys_lock is not necessary. 6. nr_zones: Write to this attribute happens in the driver probe method (except nvme) before disk is added and outside of q->sysfs_lock or any other lock. Moreover nr_zones is defined as "unsigned int" and so reading this attribute, even when it's simultaneously being updated on other cpu, should not return torn value on any architecture supported by linux. So we can avoid using q->sysfs_lock or any other lock/ protection while reading this attribute. 7. discard_zeroes_data: Reading of this attribute always returns 0, so we don't require holding q->sysfs_lock. 8. write_same_max_bytes Reading of this attribute always returns 0, so we don't require holding q->sysfs_lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250304102551.2533767-4-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10block: move q->sysfs_lock and queue-freeze under show/store methodNilay Shroff
In preparation to further simplify and group sysfs attributes which don't require locking or require some form of locking other than q-> limits_lock, move acquire/release of q->sysfs_lock and queue freeze/ unfreeze under each attributes' respective show/store method. While we are at it, also remove ->load_module() as it's used to load the module before queue is freezed. Now as we moved queue-freeze under ->store(), we could load module directly from the attributes' store method before we actually start freezing the queue. Currently, the ->load_module() is only used by "scheduler" attribute, so we now load the relevant elevator module before we start freezing the queue in elv_iosched_store(). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250304102551.2533767-3-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10block: acquire q->limits_lock while reading sysfs attributesNilay Shroff
There're few sysfs attributes(RW) whose store method is protected with q->limits_lock, however the corresponding show method of these attributes run holding q->sysfs_lock and that doesn't make sense as ideally the show method of these attributes should also run holding q->limits_lock instead of q->sysfs_lock. Hence update the show method of these sysfs attributes so that reading of these attributes acquire q->limits_lock instead of q->sysfs_lock. Similarly, there're few sysfs attributes(RO) whose show method is currently protected with q->sysfs_lock however updates to these attributes could occur using atomic limit update APIs such as queue_ limits_start_update() and queue_limits_commit_update() which run holding q->limits_lock. So that means that reading these attributes holding q->sysfs_lock doesn't make sense. Hence update the show method of these sysfs attributes(RO) such that they run with holding q-> limits_lock instead of q->sysfs_lock. We have defined a new macro QUEUE_LIM_RO_ENTRY() which uses new ->show_ limit() method and it runs holding q->limits_lock. All existing sysfs attributes(RO) which needs protection using q->limits_lock while reading have been now updated to use this new macro for initialization. Also, the existing QUEUE_LIM_RW_ENTRY() is updated to use new ->show_ limit() method for reading attributes instead of existing ->show() method. As ->show_limit() runs holding q->limits_lock, the existing sysfs attributes(RW) requiring protection are now inherently protected using q->limits_lock instead of q->sysfs_lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250304102551.2533767-2-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06badblocks: use sector_t instead of int to avoid truncation of badblocks lengthZheng Qixing
There is a truncation of badblocks length issue when set badblocks as follow: echo "2055 4294967299" > bad_blocks cat bad_blocks 2055 3 Change 'sectors' argument type from 'int' to 'sector_t'. This change avoids truncation of badblocks length for large sectors by replacing 'int' with 'sector_t' (u64), enabling proper handling of larger disk sizes and ensuring compatibility with 64-bit sector addressing. Fixes: 9e0e252a048b ("badblocks: Add core badblock management code") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250227075507.151331-13-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>