summaryrefslogtreecommitdiff
path: root/drivers
AgeCommit message (Collapse)Author
2024-07-12md/raid1: set max_sectors during early return from choose_slow_rdev()Mateusz Jończyk
Linux 6.9+ is unable to start a degraded RAID1 array with one drive, when that drive has a write-mostly flag set. During such an attempt, the following assertion in bio_split() is hit: BUG_ON(sectors <= 0); Call Trace: ? bio_split+0x96/0xb0 ? exc_invalid_op+0x53/0x70 ? bio_split+0x96/0xb0 ? asm_exc_invalid_op+0x1b/0x20 ? bio_split+0x96/0xb0 ? raid1_read_request+0x890/0xd20 ? __call_rcu_common.constprop.0+0x97/0x260 raid1_make_request+0x81/0xce0 ? __get_random_u32_below+0x17/0x70 ? new_slab+0x2b3/0x580 md_handle_request+0x77/0x210 md_submit_bio+0x62/0xa0 __submit_bio+0x17b/0x230 submit_bio_noacct_nocheck+0x18e/0x3c0 submit_bio_noacct+0x244/0x670 After investigation, it turned out that choose_slow_rdev() does not set the value of max_sectors in some cases and because of it, raid1_read_request calls bio_split with sectors == 0. Fix it by filling in this variable. This bug was introduced in commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()") but apparently hidden until commit 0091c5a269ec ("md/raid1: factor out helpers to choose the best rdev from read_balance()") shortly thereafter. Cc: stable@vger.kernel.org # 6.9.x+ Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Fixes: dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()") Cc: Song Liu <song@kernel.org> Cc: Yu Kuai <yukuai3@huawei.com> Cc: Paul Luse <paul.e.luse@linux.intel.com> Cc: Xiao Ni <xni@redhat.com> Cc: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Link: https://lore.kernel.org/linux-raid/20240706143038.7253-1-mat.jonczyk@o2.pl/ -- Tested on both Linux 6.10 and 6.9.8. Inside a VM, mdadm testsuite for RAID1 on 6.10 did not find any problems: ./test --dev=loop --no-error --raidtype=raid1 (on 6.9.8 there was one failure, caused by external bitmap support not compiled in). Notes: - I was reliably getting deadlocks when adding / removing devices on such an array - while the array was loaded with fsstress with 20 concurrent processes. When the array was idle or loaded with fsstress with 8 processes, no such deadlocks happened in my tests. This occurred also on unpatched Linux 6.8.0 though, but not on 6.1.97-rc1, so this is likely an independent regression (to be investigated). - I was also getting deadlocks when adding / removing the bitmap on the array in similar conditions - this happened on Linux 6.1.97-rc1 also though. fsstress with 8 concurrent processes did cause it only once during many tests. - in my testing, there was once a problem with hot adding an internal bitmap to the array: mdadm: Cannot add bitmap while array is resyncing or reshaping etc. mdadm: failed to set internal bitmap. even though no such reshaping was happening according to /proc/mdstat. This seems unrelated, though. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240711202316.10775-1-mat.jonczyk@o2.pl
2024-07-12md-cluster: fix no recovery job when adding/re-adding a diskHeming Zhao
The commit db5e653d7c9f ("md: delay choosing sync action to md_start_sync()") delays the start of the sync action. In a clustered environment, this will cause another node to first activate the spare disk and skip recovery. As a result, no nodes will perform recovery when a disk is added or re-added. Before db5e653d7c9f: ``` node1 node2 ---------------------------------------------------------------- md_check_recovery + md_update_sb | sendmsg: METADATA_UPDATED + md_choose_sync_action process_metadata_update | remove_and_add_spares //node1 has not finished adding + call mddev->sync_work //the spare disk:do nothing md_start_sync starts md_do_sync md_do_sync + grabbed resync_lockres:DLM_LOCK_EX + do syncing job md_check_recovery sendmsg: METADATA_UPDATED process_metadata_update //activate spare disk ... ... md_do_sync waiting to grab resync_lockres:EX ``` After db5e653d7c9f: (note: if 'cmd:idle' sets MD_RECOVERY_INTR after md_check_recovery starts md_start_sync, setting the INTR action will exacerbate the delay in node1 calling the md_do_sync function.) ``` node1 node2 ---------------------------------------------------------------- md_check_recovery + md_update_sb | sendmsg: METADATA_UPDATED + calls mddev->sync_work process_metadata_update //node1 has not finished adding //the spare disk:do nothing md_start_sync + md_choose_sync_action | remove_and_add_spares + calls md_do_sync md_check_recovery md_update_sb sendmsg: METADATA_UPDATED process_metadata_update //activate spare disk ... ... ... ... md_do_sync + grabbed resync_lockres:EX + raid1_sync_request skip sync under conf->fullsync:0 md_do_sync 1. waiting to grab resync_lockres:EX 2. when node1 could grab EX lock, node1 will skip resync under recovery_offset:MaxSector ``` How to trigger: ```(commands @node1) # to easily watch the recovery status echo 2000 > /proc/sys/dev/raid/speed_limit_max ssh root@node2 "echo 2000 > /proc/sys/dev/raid/speed_limit_max" mdadm -CR /dev/md0 -l1 -b clustered -n 2 /dev/sda /dev/sdb --assume-clean ssh root@node2 mdadm -A /dev/md0 /dev/sda /dev/sdb mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda mdadm --manage /dev/md0 --add /dev/sdc === "cat /proc/mdstat" on both node, there are no recovery action. === ``` How to fix: because md layer code logic is hard to restore for speeding up sync job on local node, we add new cluster msg to pending the another node to active disk. Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Su Yue <glass.su@suse.com> Acked-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240709104120.22243-2-heming.zhao@suse.com
2024-07-12md-cluster: fix hanging issue while a new disk addingHeming Zhao
The commit 1bbe254e4336 ("md-cluster: check for timeout while a new disk adding") is correct in terms of code syntax but not suite real clustered code logic. When a timeout occurs while adding a new disk, if recv_daemon() bypasses the unlock for ack_lockres:CR, another node will be waiting to grab EX lock. This will cause the cluster to hang indefinitely. How to fix: 1. In dlm_lock_sync(), change the wait behaviour from forever to a timeout, This could avoid the hanging issue when another node fails to handle cluster msg. Another result of this change is that if another node receives an unknown msg (e.g. a new msg_type), the old code will hang, whereas the new code will timeout and fail. This could help cluster_md handle new msg_type from different nodes with different kernel/module versions (e.g. The user only updates one leg's kernel and monitors the stability of the new kernel). 2. The old code for __sendmsg() always returns 0 (success) under the design (must successfully unlock ->message_lockres). This commit makes this function return an error number when an error occurs. Fixes: 1bbe254e4336 ("md-cluster: check for timeout while a new disk adding") Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Su Yue <glass.su@suse.com> Acked-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240709104120.22243-1-heming.zhao@suse.com
2024-07-10floppy: add missing MODULE_DESCRIPTION() macroJeff Johnson
make allmodconfig && make W=1 C=1 reports: WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/block/floppy.o Add the missing invocation of the MODULE_DESCRIPTION() macro. Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Reviewed-by: Denis Efremov <efremov@linux.com> Link: https://lore.kernel.org/r/20240602-md-block-floppy-v1-1-bc628ea5eb84@quicinc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-10loop: add missing MODULE_DESCRIPTION() macroJeff Johnson
make allmodconfig && make W=1 C=1 reports: WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/block/loop.o Add the missing invocation of the MODULE_DESCRIPTION() macro. Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Link: https://lore.kernel.org/r/20240602-md-block-loop-v1-1-b9b7e2603e72@quicinc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-10ublk_drv: add missing MODULE_DESCRIPTION() macroJeff Johnson
make allmodconfig && make W=1 C=1 reports: WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/block/ublk_drv.o Add the missing invocation of the MODULE_DESCRIPTION() macro. Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240602-md-block-ublk_drv-v1-1-995474cafff0@quicinc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-10xen/blkback: add missing MODULE_DESCRIPTION() macroJeff Johnson
make allmodconfig && make W=1 C=1 reports: WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/block/xen-blkback/xen-blkback.o Add the missing invocation of the MODULE_DESCRIPTION() macro. Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Link: https://lore.kernel.org/r/20240602-md-block-xen-blkback-v1-1-6ff5b58bdee1@quicinc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-09block/rnbd: Constify struct kobj_typeChristophe JAILLET
'struct kobj_type' is not modified in this driver. It is only used with kobject_init_and_add() which takes a "const struct kobj_type *" parameter. Constifying this structure moves some data to a read-only section, so increase overall security. On a x86_64, with allmodconfig, as an example: Before: ====== text data bss dec hex filename 4082 792 8 4882 1312 drivers/block/rnbd/rnbd-srv-sysfs.o After: ===== text data bss dec hex filename 4210 672 8 4890 131a drivers/block/rnbd/rnbd-srv-sysfs.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/e3d454173ffad30726c9351810d3aa7b75122711.1720462252.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-09loop: Don't bother validating blocksizeJohn Garry
The block queue limits validation does this for us now. The loop_configure() -> WARN_ON_ONCE() call is dropped, as an invalid block size would trigger this now. We don't want userspace to be able to directly trigger WARNs. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Link: https://lore.kernel.org/r/20240708091651.177447-6-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-09virtio_blk: Don't bother validating blocksizeJohn Garry
The block queue limits validation does this for us now. Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240708091651.177447-5-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-09null_blk: Don't bother validating blocksizeJohn Garry
The block queue limits validation does this for us now. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Link: https://lore.kernel.org/r/20240708091651.177447-4-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-09virtio_blk: Fix default logical block size fallbackJohn Garry
If we fail to read a logical block size in virtblk_read_limits() -> virtio_cread_feature(), then we default to what is in lim->logical_block_size, but that would be 0. We can deal with lim->logical_block_size = 0 later in the blk_mq_alloc_disk(), but the code in virtblk_read_limits() needs a proper default, so give a default of SECTOR_SIZE. Fixes: 27e32cd23fed ("block: pass a queue_limits argument to blk_mq_alloc_disk") Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20240708091651.177447-2-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-08Merge tag 'nvme-6.11-2024-07-08' of git://git.infradead.org/nvme into ↵Jens Axboe
for-6.11/block Pull NVMe updates from Keith: "nvme updates for Linux 6.11 - Device initialization memory leak fixes (Keith) - More constants defined (Weiwen) - Target debugfs support (Hannes) - PCIe subsystem reset enhancements (Keith) - Queue-depth multipath policy (Redhat and PureStorage) - Implement get_unique_id (Christoph) - Authentication error fixes (Gaosheng)" * tag 'nvme-6.11-2024-07-08' of git://git.infradead.org/nvme: (21 commits) nvmet-auth: fix nvmet_auth hash error handling nvme: implement ->get_unique_id nvme-multipath: implement "queue-depth" iopolicy nvme-multipath: prepare for "queue-depth" iopolicy nvme-pci: do not directly handle subsys reset fallout lpfc_nvmet: implement 'host_traddr' nvme-fcloop: implement 'host_traddr' nvmet-fc: implement host_traddr() nvmet-rdma: implement host_traddr() nvmet-tcp: implement host_traddr() nvmet: add 'host_traddr' callback for debugfs nvmet: add debugfs support mailmap: add entry for Weiwen Hu nvme: rename CDR/MORE/DNR to NVME_STATUS_* nvme: fix status magic numbers nvme: rename nvme_sc_to_pr_err to nvme_status_to_pr_err nvme: split device add from initialization nvme: fc: split controller bringup handling nvme: rdma: split controller bringup handling nvme: tcp: split controller bringup handling ...
2024-07-08nvmet-auth: fix nvmet_auth hash error handlingGaosheng Cui
If we fail to call nvme_auth_augmented_challenge, or fail to kmalloc for shash, we should free the memory allocation for challenge, so add err path out_free_challenge to fix the memory leak. Fixes: 7a277c37d352 ("nvmet-auth: Diffie-Hellman key exchange support") Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-08nvme: implement ->get_unique_idChristoph Hellwig
Implement the get_unique_id method to allow pNFS SCSI layout access to NVMe namespaces. This is the server side implementation of RFC 9561 "Using the Parallel NFS (pNFS) SCSI Layout to Access Non-Volatile Memory Express (NVMe) Storage Devices". Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-05block: Remove REQ_OP_ZONE_RESET_ALL emulationDamien Le Moal
Now that device mapper can handle resetting all zones of a mapped zoned device using REQ_OP_ZONE_RESET_ALL, all zoned block device drivers support this operation. With this, the request queue feature BLK_FEAT_ZONE_RESETALL is not necessary and the emulation code in blk-zone.c can be removed. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240704052816.623865-5-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05dm: handle REQ_OP_ZONE_RESET_ALLDamien Le Moal
This commit implements processing of the REQ_OP_ZONE_RESET_ALL operation for zoned mapped devices. Given that this operation always has a BIO sector of 0 and a 0 size, processing through the regular BIO __split_and_process_bio() function does not work because this function would always select the first target. Instead, handling of this operation is implemented using the function __send_zone_reset_all(). Similarly to the __send_empty_flush() function, the new __send_zone_reset_all() function manually goes through all targets of a mapped device table doing the following: 1) If the target can natively support REQ_OP_ZONE_RESET_ALL, __send_duplicate_bios() is used to forward the reset all operation to the target. This case is handled with the __send_zone_reset_all_native() function. 2) For other targets, the function __send_zone_reset_all_emulated() is executed to emulate the execution of REQ_OP_ZONE_RESET_ALL using regular REQ_OP_ZONE_RESET operations. Targets that can natively support REQ_OP_ZONE_RESET_ALL are identified using the new target field zone_reset_all_supported. This boolean is set to true in for targets that have reliable zone limits, that is, targets that map all sequential write required zones of their zoned device(s). Setting this field is handled in dm_set_zones_restrictions() and device_get_zone_resource_limits(). For targets with unreliable zone limits, REQ_OP_ZONE_RESET_ALL must be emulated (case 2 above). This is implemented with __send_zone_reset_all_emulated() and is similar to the block layer function blkdev_zone_reset_all_emulated(): first a report zones is done for the zones of the target to identify zones that need reset, that is, any sequential write required zone that is not already empty. This is done using a bitmap and the function dm_zone_get_reset_bitmap() which sets to 1 the bit corresponding to a zone that needs reset. Next, this zone bitmap is inspected and a clone BIO modified to use the REQ_OP_ZONE_RESET operation issued for any zone with its bit set in the zone bitmap. This implementation is more efficient than what the block layer does with blkdev_zone_reset_all_emulated(), which is always used for DM zoned devices currently: as we can natively use REQ_OP_ZONE_RESET_ALL on targets mapping all sequential write required zones, resetting all zones of a zoned mapped device can be much faster compared to always emulating this operation using regular per-zone reset. In the worst case, this implementation is as-efficient as the block layer emulation. This reduction in the time it takes to reset all zones of a zoned mapped device depends directly on the mapped device targets mapping (reliable zone limits or not). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240704052816.623865-4-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05dm: Refactor is_abnormal_io()Damien Le Moal
Use a single switch-case to simplify is_abnormal_io() and make this function more readable and easier to modify. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240704052816.623865-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05null_blk: Introduce the zone_full parameterDamien Le Moal
Allow creating a zoned null_blk device with the initial state of its sequential write required zones to be FULL. This is convenient to avoid having to first write these zones to perform read performance evaluation or test zone management operations such as zone reset (and zone reset all). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240704052816.623865-2-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05loop: remove the unused inode variable in loop_configureChristoph Hellwig
Remove the inode variable now that the last user is gone. Fixes: a17ece76bcfe ("loop: regularize upgrading the block size for direct I/O") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240705053114.2042976-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-04null_blk: don't initialize static 'g_virt_boundary' to falseZhu Yanjun
No functional changes intended. Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> Link: https://lore.kernel.org/r/20240704010638.324349-1-yanjun.zhu@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-04md/raid5: recheck if reshape has finished with device_lock heldBenjamin Marzinski
When handling an IO request, MD checks if a reshape is currently happening, and if so, where the IO sector is in relation to the reshape progress. MD uses conf->reshape_progress for both of these tasks. When the reshape finishes, conf->reshape_progress is set to MaxSector. If this occurs after MD checks if the reshape is currently happening but before it calls ahead_of_reshape(), then ahead_of_reshape() will end up comparing the IO sector against MaxSector. During a backwards reshape, this will make MD think the IO sector is in the area not yet reshaped, causing it to use the previous configuration, and map the IO to the sector where that data was before the reshape. This bug can be triggered by running the lvm2 lvconvert-raid-reshape-linear_to_raid6-single-type.sh test in a loop, although it's very hard to reproduce. Fix this by factoring the code that checks where the IO sector is in relation to the reshape out to a helper called get_reshape_loc(), which reads reshape_progress and reshape_safe while holding the device_lock, and then rechecks if the reshape has finished before calling ahead_of_reshape with the saved values. Also use the helper during the REQ_NOWAIT check to see if the location is inside of the reshape region. Fixes: fef9c61fdfabf ("md/raid5: change reshape-progress measurement to cope with reshaping backwards.") Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240702151802.1632010-1-bmarzins@redhat.com
2024-07-04md: Don't wait for MD_RECOVERY_NEEDED for HOT_REMOVE_DISK ioctlYu Kuai
Commit 90f5f7ad4f38 ("md: Wait for md_check_recovery before attempting device removal.") explained in the commit message that failed device must be reomoved from the personality first by md_check_recovery(), before it can be removed from the array. That's the reason the commit add the code to wait for MD_RECOVERY_NEEDED. However, this is not the case now, because remove_and_add_spares() is called directly from hot_remove_disk() from ioctl path, hence failed device(marked faulty) can be removed from the personality by ioctl. On the other hand, the commit introduced a performance problem that if MD_RECOVERY_NEEDED is set and the array is not running, ioctl will wait for 5s before it can return failure to user. Since the waiting is not needed now, fix the problem by removing the waiting. Fixes: 90f5f7ad4f38 ("md: Wait for md_check_recovery before attempting device removal.") Reported-by: Mateusz Kusiak <mateusz.kusiak@linux.intel.com> Closes: https://lore.kernel.org/all/814ff6ee-47a2-4ba0-963e-cf256ee4ecfa@linux.intel.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240627112321.3044744-1-yukuai1@huaweicloud.com
2024-07-04md-cluster: Constify struct md_cluster_operationsChristophe JAILLET
'struct md_cluster_operations' is not modified in this driver. Constifying this structure moves some data to a read-only section, so increase overall security. On a x86_64, with allmodconfig, as an example: Before: ====== text data bss dec hex filename 51941 1442 80 53463 d0d7 drivers/md/md-cluster.o After: ===== text data bss dec hex filename 52133 1246 80 53459 d0d3 drivers/md/md-cluster.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/3727f3ce9693cae4e62ae6778ea13971df805479.1719173852.git.christophe.jaillet@wanadoo.fr
2024-07-04md: Remove unneeded semicolonYang Li
./drivers/md/md.c:630:21-22: Unneeded semicolon Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9344 Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240618010759.85416-1-yang.lee@linux.alibaba.com
2024-07-04md/raid5: fix spares errors about rcu usageYu Kuai
As commit ad8606702f26 ("md/raid5: remove rcu protection to access rdev from conf") explains, rcu protection can be removed, however, there are three places left, there won't be any real problems. drivers/md/raid5.c:8071:24: error: incompatible types in comparison expression (different address spaces): drivers/md/raid5.c:8071:24: struct md_rdev [noderef] __rcu * drivers/md/raid5.c:8071:24: struct md_rdev * drivers/md/raid5.c:7569:25: error: incompatible types in comparison expression (different address spaces): drivers/md/raid5.c:7569:25: struct md_rdev [noderef] __rcu * drivers/md/raid5.c:7569:25: struct md_rdev * drivers/md/raid5.c:7573:25: error: incompatible types in comparison expression (different address spaces): drivers/md/raid5.c:7573:25: struct md_rdev [noderef] __rcu * drivers/md/raid5.c:7573:25: struct md_rdev * Fixes: ad8606702f26 ("md/raid5: remove rcu protection to access rdev from conf") Cc: stable@vger.kernel.org Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240615085143.1648223-1-yukuai1@huaweicloud.com
2024-07-02xen-blkfront: fix sector_size propagation to the block layerChristoph Hellwig
Ensure that info->sector_size and info->physical_sector_size are set before the call to blkif_set_queue_limits by doing away with the local variables and arguments that propagate them. Thanks to Marek Marczykowski-Górecki and Jürgen Groß for root causing the issue. Fixes: ba3f67c11638 ("xen-blkfront: atomically update queue limits") Reported-by: Rusty Bird <rustybird@net-c.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Link: https://lore.kernel.org/r/20240625055238.7934-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-02null_blk: Fix description of the fua parameterDamien Le Moal
The description of the fua module parameter is defined using MODULE_PARM_DESC() with the first argument passed being "zoned". That is the wrong name, obviously. Fix that by using the correct "fua" parameter name so that "modinfo null_blk" displays correct information. Fixes: f4f84586c8b9 ("null_blk: Introduce fua attribute") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240702073234.206458-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-02nvme-multipath: implement "queue-depth" iopolicyThomas Song
The round-robin path selector is inefficient in cases where there is a difference in latency between paths. In the presence of one or more high latency paths the round-robin selector continues to use the high latency path equally. This results in a bias towards the highest latency path and can cause a significant decrease in overall performance as IOs pile on the highest latency path. This problem is acute with NVMe-oF controllers. The queue-depth path selector sends I/O down the path with the lowest number of requests in its request queue. Paths with lower latency will clear requests more quickly and have less requests queued compared to higher latency paths. The goal of this path selector is to make more use of lower latency paths which will bring down overall IO latency and increase throughput and performance. Signed-off-by: Thomas Song <tsong@purestorage.com> [emilne: commandeered patch developed by Thomas Song @ Pure Storage] Co-developed-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Ewan D. Milne <emilne@redhat.com> Co-developed-by: John Meneghini <jmeneghi@redhat.com> Signed-off-by: John Meneghini <jmeneghi@redhat.com> Link: https://lore.kernel.org/linux-nvme/20240509202929.831680-1-jmeneghi@redhat.com/ Tested-by: Marco Patalano <mpatalan@redhat.com> Tested-by: Jyoti Rani <jrani@purestorage.com> Tested-by: John Meneghini <jmeneghi@redhat.com> Reviewed-by: Randy Jennings <randyj@purestorage.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-01nvme: don't set io_opt if NOWS is zeroChristoph Hellwig
NOWS is one of the annoying "0's based values" in NVMe, where 0 means one and we thus can't detect if it isn't set. Thus a NOWS value of 0 means that the Namespace Optimal Write Size is a single LBA, which is clearly bogus. Ignore the value in that case and don't propagate an io_opt value to the block layer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240701051800.1245240-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28rnbd-cnt: don't set QUEUE_FLAG_SAME_FORCEChristoph Hellwig
QUEUE_FLAG_SAME_FORCE has been set by rnbd-cnt since the initial merge. There is no good reason for a driver to force exact core delivery, which is tunable for very specific workloads and not a driver setting. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jack Wang <jinpu.wang@ionos.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240627124926.512662-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28rnbd: don't set QUEUE_FLAG_SAME_COMPChristoph Hellwig
QUEUE_FLAG_SAME_COMP is already set by default. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jack Wang <jinpu.wang@ionos.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240627124926.512662-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28mpt3sas_scsih: don't set QUEUE_FLAG_NOMERGESChristoph Hellwig
Setting QUEUE_FLAG_NOMERGES was added in commit d1b01d14b7ba ("scsi: mpt3sas: Set NVMe device queue depth as 128") without any explanation. Drivers should second guess the block layer merge decisions, so remove the flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240627124926.512662-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28megaraid_sas: don't set QUEUE_FLAG_NOMERGESChristoph Hellwig
Setting QUEUE_FLAG_NOMERGES was added in commit 15dd03811d99dcf ("scsi: megaraid_sas: NVME Interface detection and prop settings") without any explanation. Drivers should second guess the block layer merge decisions, so remove the flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240627124926.512662-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28loop: don't set QUEUE_FLAG_NOMERGESChristoph Hellwig
QUEUE_FLAG_NOMERGES isn't really a driver interface, but a user tunable. There also isn't any good reason to set it in the loop driver. The original commit adding it (5b5e20f421c0b6d "block: loop: set QUEUE_FLAG_NOMERGES for request queue of loop") claims that "It doesn't make sense to enable merge because the I/O submitted to backing file is handled page by page." which of course isn't true for multi-page bvec now, and it never has been for direct I/O, for which commit 40326d8a33d ("block/loop: allow request merge for directio mode") alredy disabled the nomerges flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240627124926.512662-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28bcache: work around a __bitwise to bool conversion sparse warningChristoph Hellwig
Sparse is a bit dumb about bitwise operation on __bitwise types used in boolean contexts. Add a !! to explicitly propagate to boolean without a warning. Fixes: fcf865e357f8 ("block: convert features and flags to __bitwise types") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Kent Overstreet <kent.overstreet@linux.dev> Link: https://lore.kernel.org/r/20240628131657.667797-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-27loop: Fix a race between loop detach and loop openGulam Mohamed
1. Userspace sends the command "losetup -d" which uses the open() call to open the device 2. Kernel receives the ioctl command "LOOP_CLR_FD" which calls the function loop_clr_fd() 3. If LOOP_CLR_FD is the first command received at the time, then the AUTOCLEAR flag is not set and deletion of the loop device proceeds ahead and scans the partitions (drop/add partitions) if (disk_openers(lo->lo_disk) > 1) { lo->lo_flags |= LO_FLAGS_AUTOCLEAR; loop_global_unlock(lo, true); return 0; } 4. Before scanning partitions, it will check to see if any partition of the loop device is currently opened 5. If any partition is opened, then it will return EBUSY: if (disk->open_partitions) return -EBUSY; 6. So, after receiving the "LOOP_CLR_FD" command and just before the above check for open_partitions, if any other command (like blkid) opens any partition of the loop device, then the partition scan will not proceed and EBUSY is returned as shown in above code 7. But in "__loop_clr_fd()", this EBUSY error is not propagated 8. We have noticed that this is causing the partitions of the loop to remain stale even after the loop device is detached resulting in the IO errors on the partitions Fix: Defer the detach of loop device to release function, which is called when the last close happens, by setting the lo_flags to LO_FLAGS_AUTOCLEAR at the time of detach i.e in loop_clr_fd() function. Test case involves the following two scripts: script1.sh: while [ 1 ]; do losetup -P -f /home/opt/looptest/test10.img blkid /dev/loop0p1 done script2.sh: while [ 1 ]; do losetup -d /dev/loop0 done Without fix, the following IO errors have been observed: kernel: __loop_clr_fd: partition scan of loop0 failed (rc=-16) kernel: I/O error, dev loop0, sector 20971392 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 kernel: I/O error, dev loop0, sector 108868 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 kernel: Buffer I/O error on dev loop0p1, logical block 27201, async page read Signed-off-by: Gulam Mohamed <gulam.mohamed@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240618164042.343777-1-gulam.mohamed@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-26nvme-multipath: prepare for "queue-depth" iopolicyJohn Meneghini
This patch prepares for the introduction of a new iopolicy by breaking up the nvme_find_path() code path into sub-routines. Signed-off-by: John Meneghini <jmeneghi@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-26block: move dma_pad_mask into queue_limitsChristoph Hellwig
dma_pad_mask is a queue_limits by all ways of looking at it, so move it there and set it through the atomic queue limits APIs. Add a little helper that takes the alignment and pad into account to simplify the code that is touched a bit. Note that there never was any need for the > check in blk_queue_update_dma_pad, this probably was just copy and paste from dma_update_dma_alignment. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240626142637.300624-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-26md: set md-specific flags for all queue limitsChristoph Hellwig
The md driver wants to enforce a number of flags for all devices, even when not inheriting them from the underlying devices. To make sure these flags survive the queue_limits_set calls that md uses to update the queue limits without deriving them form the previous limits add a new md_init_stacking_limits helper that calls blk_set_stacking_limits and sets these flags. Fixes: 1122c0c1cc71 ("block: move cache control settings out of queue->flags") Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240626142637.300624-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-26block: change rq_integrity_vec to respect the iteratorMikulas Patocka
If we allocate a bio that is larger than NVMe maximum request size, attach integrity metadata to it and send it to the NVMe subsystem, the integrity metadata will be corrupted. Splitting the bio works correctly. The function bio_split will clone the bio, trim the iterator of the first bio and advance the iterator of the second bio. However, the function rq_integrity_vec has a bug - it returns the first vector of the bio's metadata and completely disregards the metadata iterator that was advanced when the bio was split. Thus, the second bio uses the same metadata as the first bio and this leads to metadata corruption. This commit changes rq_integrity_vec, so that it calls mp_bvec_iter_bvec instead of returning the first vector. mp_bvec_iter_bvec reads the iterator and uses it to build a bvec for the current position in the iterator. The "queue_max_integrity_segments(rq->q) > 1" check was removed, because the updated rq_integrity_vec function works correctly with multiple segments. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/49d1afaa-f934-6ed2-a678-e0d428c63a65@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-26nvme-pci: do not directly handle subsys reset falloutKeith Busch
Scheduling reset_work after a nvme subsystem reset is expected to fail on pcie, but this also prevents potential handling the platform's pcie services may provide that might successfully recovering the link without re-enumeration. Such examples include AER, DPC, and power's EEH. Provide a pci specific operation that safely initiates a subsystem reset, and instead of scheduling reset work, read back the status register to trigger a pcie read error. Since this only affects pci, the other fabrics drivers subscribe to a generic nvmf subsystem reset that is exactly the same as before. The loop fabric doesn't use it because nvmet doesn't support setting that property anyway. And since we're using the magic NSSR value in two places now, provide a symbolic define for it. Reported-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24lpfc_nvmet: implement 'host_traddr'Hannes Reinecke
Implement the 'host_traddr' callback to display the host transport address for nvmet debugfs. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24nvme-fcloop: implement 'host_traddr'Hannes Reinecke
Implement the 'host_traddr' callback to display the host transport address for nvmet debugfs. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24nvmet-fc: implement host_traddr()Hannes Reinecke
Implement callback to display the host transport address by adding a callback 'host_traddr' for nvmet_fc_target_template. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24nvmet-rdma: implement host_traddr()Hannes Reinecke
Implement callback to display the host transport address. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24nvmet-tcp: implement host_traddr()Hannes Reinecke
Implement callback to display the host transport address. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24nvmet: add 'host_traddr' callback for debugfsHannes Reinecke
We want to display the transport address of the connected host in debugfs, but this is a property of the transport. So add a callback 'host_traddr' to allow the transport drivers to fill in the data. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24nvmet: add debugfs supportHannes Reinecke
Add a debugfs hierarchy to display the configured subsystems and the controllers attached to the subsystems. Suggested-by: Redouane BOUFENGHOUR <redouane.boufenghour@shadow.tech> Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24nvme: rename CDR/MORE/DNR to NVME_STATUS_*Weiwen Hu
CDR/MORE/DNR fields are not belonging to SC in the NVMe spec, rename them to NVME_STATUS_* to avoid confusion. Signed-off-by: Weiwen Hu <huweiwen@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>