linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2022-11-24	drbd: remove call to memset before free device/resource/connection	Wang ShaoBo
	This revert c2258ffc56f2 ("drbd: poison free'd device, resource and connection structs"), add memset is odd here for debugging, there are some methods to accurately show what happened, such as kdump. Signed-off-by: Wang ShaoBo <bobo.shaobowang@huawei.com> Link: https://lore.kernel.org/r/20221124015817.2729789-2-bobo.shaobowang@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-24	block: mq-deadline: Do not break sequential write streams to zoned HDDs	Damien Le Moal
	mq-deadline ensures an in order dispatching of write requests to zoned block devices using a per zone lock (a bit). This implies that for any purely sequential write workload, the drive is exercised most of the time at a maximum queue depth of one. However, when such sequential write workload crosses a zone boundary (when sequentially writing multiple contiguous zones), zone write locking may prevent the last write to one zone to be issued (as the previous write is still being executed) but allow the first write to the following zone to be issued (as that zone is not yet being writen and not locked). This result in an out of order delivery of the sequential write commands to the device every time a zone boundary is crossed. While such behavior does not break the sequential write constraint of zoned block devices (and does not generate any write error), some zoned hard-disks react badly to seeing these out of order writes, resulting in lower write throughput. This problem can be addressed by always dispatching the first request of a stream of sequential write requests, regardless of the zones targeted by these sequential writes. To do so, the function deadline_skip_seq_writes() is introduced and used in deadline_next_request() to select the next write command to issue if the target device is an HDD (blk_queue_nonrot() being false). deadline_fifo_request() is modified using the new deadline_earlier_request() and deadline_is_seq_write() helpers to ignore requests in the fifo list that have a preceding request in lba order that is sequential. With this fix, a sequential write workload executed with the following fio command: fio --name=seq-write --filename=/dev/sda --zonemode=zbd --direct=1 \ --size=68719476736 --ioengine=libaio --iodepth=32 --rw=write \ --bs=65536 results in an increase from 225 MB/s to 250 MB/s of the write throughput of an SMR HDD (11% increase). Cc: <stable@vger.kernel.org> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20221124021208.242541-3-damien.lemoal@opensource.wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-24	block: mq-deadline: Fix dd_finish_request() for zoned devices	Damien Le Moal
	dd_finish_request() tests if the per prio fifo_list is not empty to determine if request dispatching must be restarted for handling blocked write requests to zoned devices with a call to blk_mq_sched_mark_restart_hctx(). While simple, this implementation has 2 problems: 1) Only the priority level of the completed request is considered. However, writes to a zone may be blocked due to other writes to the same zone using a different priority level. While this is unlikely to happen in practice, as writing a zone with different IO priorirites does not make sense, nothing in the code prevents this from happening. 2) The use of list_empty() is dangerous as dd_finish_request() does not take dd->lock and may run concurrently with the insert and dispatch code. Fix these 2 problems by testing the write fifo list of all priority levels using the new helper dd_has_write_work(), and by testing each fifo list using list_empty_careful(). Fixes: c807ab520fc3 ("block/mq-deadline: Add I/O priority support") Cc: <stable@vger.kernel.org> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20221124021208.242541-2-damien.lemoal@opensource.wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-23	blk-crypto: Add a missing include directive	Bart Van Assche
	Allow the compiler to verify consistency of function declarations and function definitions. This patch fixes the following sparse errors: block/blk-crypto-profile.c:241:14: error: no previous prototype for ‘blk_crypto_get_keyslot’ [-Werror=missing-prototypes] 241 \| blk_status_t blk_crypto_get_keyslot(struct blk_crypto_profile profile, \| ^~~~~~~~~~~~~~~~~~~~~~ block/blk-crypto-profile.c:318:6: error: no previous prototype for ‘blk_crypto_put_keyslot’ [-Werror=missing-prototypes] 318 \| void blk_crypto_put_keyslot(struct blk_crypto_keyslot slot) \| ^~~~~~~~~~~~~~~~~~~~~~ block/blk-crypto-profile.c:344:6: error: no previous prototype for ‘__blk_crypto_cfg_supported’ [-Werror=missing-prototypes] 344 \| bool __blk_crypto_cfg_supported(struct blk_crypto_profile profile, \| ^~~~~~~~~~~~~~~~~~~~~~~~~~ block/blk-crypto-profile.c:373:5: error: no previous prototype for ‘__blk_crypto_evict_key’ [-Werror=missing-prototypes] 373 \| int __blk_crypto_evict_key(struct blk_crypto_profile profile, \| ^~~~~~~~~~~~~~~~~~~~~~ Cc: Eric Biggers <ebiggers@google.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20221123172923.434339-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-23	elevator: remove an outdated comment in elevator_change	Jinlong Chen
	mq is no longer a special case. Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/cbf47824fc726440371e74c867bf635ae1b671a3.1669126766.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-23	elevator: update the document of elevator_match	Jinlong Chen
	elevator_match does not care about elevator_features any more. Remove related descriptions from its document. Fixes: ffb86425ee2c ("block: don't check for required features in elevator_match") Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/a58424555202c07a9ccf7f60c3ad7e247da09e25.1669126766.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-23	elevator: printk a warning if switching to a new io scheduler fails	Jinlong Chen
	printk a warning to indicate that the io scheduler has been set to none if switching to a new io scheduler fails. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/d51ed0fb457db7a4f9cbb0dbce36d534e22be457.1669126766.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-23	elevator: update the document of elevator_switch	Jinlong Chen
	We no longer support falling back to the old io scheduler if switching to the new one fails. Update the document to indicate that. Fixes: a1ce35fa4985 ("block: remove dead elevator code") Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/94250961689ba7d2e67a7d9e7995a11166fedb31.1669126766.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-22	drbd: use consistent license	Christoph Böhmwalder
	DRBD currently has a mix of GPL-2.0 and GPL-2.0-or-later SPDX license identifiers. We have decided to stick with GPL 2.0 only, so consistently use that identifier. Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Link: https://lore.kernel.org/r/20221122134301.69258-5-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-22	lru_cache: remove unused lc_private, lc_set, lc_index_of	Joel Colledge
	Signed-off-by: Joel Colledge <joel.colledge@linbit.com> Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Link: https://lore.kernel.org/r/20221122134301.69258-4-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-22	lru_cache: remove compiled out code	Christoph Böhmwalder
	Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Link: https://lore.kernel.org/r/20221122134301.69258-3-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-22	lru_cache: use atomic operations when accessing lc->flags, always	Lars Ellenberg
	Or, depending on the way locking is implemented at the call sites, some updates could be lost (has not been observed). Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Link: https://lore.kernel.org/r/20221122134301.69258-2-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-22	block: fix missing nr_hw_queues update in blk_mq_realloc_tag_set_tags	Shin'ichiro Kawasaki
	The commit ee9d55210c2f ("blk-mq: simplify blk_mq_realloc_tag_set_tags") cleaned up the function blk_mq_realloc_tag_set_tags. After this change, the function does not update nr_hw_queues of struct blk_mq_tag_set when new nr_hw_queues value is smaller than original. This results in failure of queue number change of block devices. To avoid the failure, add the missing nr_hw_queues update. Fixes: ee9d55210c2f ("blk-mq: simplify blk_mq_realloc_tag_set_tags") Reported-by: Chaitanya Kulkarni <chaitanyak@nvidia.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/linux-block/20221118140640.featvt3fxktfquwh@shindev/ Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221122084917.2034220-1-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21	blk-crypto: move internal only declarations to blk-crypto-internal.h	Christoph Hellwig
	blk_crypto_get_keyslot, blk_crypto_put_keyslot, __blk_crypto_evict_key and __blk_crypto_cfg_supported are only used internally by the blk-crypto code, so move the out of blk-crypto-profile.h, which is included by drivers that supply blk-crypto functionality. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20221114042944.1009870-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21	blk-crypto: add a blk_crypto_config_supported_natively helper	Christoph Hellwig
	Add a blk_crypto_config_supported_natively helper that wraps __blk_crypto_cfg_supported to retrieve the crypto_profile from the request queue. With this fscrypt can stop including blk-crypto-profile.h and rely on the public consumer interface in blk-crypto.h. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20221114042944.1009870-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21	blk-crypto: don't use struct request_queue for public interfaces	Christoph Hellwig
	Switch all public blk-crypto interfaces to use struct block_device arguments to specify the device they operate on instead of th request_queue, which is a block layer implementation detail. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20221114042944.1009870-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21	nvmet: expose firmware revision to configfs	Aleksandr Miloserdov
	Allow user to set currently active firmware revision Reviewed-by: Konstantin Shelekhin <k.shelekhin@yadro.com> Reviewed-by: Dmitriy Bogdanov <d.bogdanov@yadro.com> Signed-off-by: Aleksandr Miloserdov <a.miloserdov@yadro.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-21	nvmet: expose IEEE OUI to configfs	Aleksandr Miloserdov
	Allow user to set OUI for the controller vendor. Reviewed-by: Konstantin Shelekhin <k.shelekhin@yadro.com> Reviewed-by: Dmitriy Bogdanov <d.bogdanov@yadro.com> Signed-off-by: Aleksandr Miloserdov <a.miloserdov@yadro.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-18	nvme: rename the queue quiescing helpers	Christoph Hellwig
	Naming the nvme helpers that wrap the block quiesce functionality _start/_stop is rather confusing. Switch to using the quiesce naming used by the block layer instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-11-16	blk-cgroup: Flush stats at blkgs destruction path	Waiman Long
	As noted by Michal, the blkg_iostat_set's in the lockless list hold reference to blkg's to protect against their removal. Those blkg's hold reference to blkcg. When a cgroup is being destroyed, cgroup_rstat_flush() is only called at css_release_work_fn() which is called when the blkcg reference count reaches 0. This circular dependency will prevent blkcg from being freed until some other events cause cgroup_rstat_flush() to be called to flush out the pending blkcg stats. To prevent this delayed blkcg removal, add a new cgroup_rstat_css_flush() function to flush stats for a given css and cpu and call it at the blkgs destruction path, blkcg_destroy_blkgs(), whenever there are still some pending stats to be flushed. This will ensure that blkcg reference count can reach 0 ASAP. Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20221105005902.407297-4-longman@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-16	blk-cgroup: Optimize blkcg_rstat_flush()	Waiman Long
	For a system with many CPUs and block devices, the time to do blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It can be especially problematic as interrupt is disabled during the flush. It was reported that it might take seconds to complete in some extreme cases leading to hard lockup messages. As it is likely that not all the percpu blkg_iostat_set's has been updated since the last flush, those stale blkg_iostat_set's don't need to be flushed in this case. This patch optimizes blkcg_rstat_flush() by keeping a lockless list of recently updated blkg_iostat_set's in a newly added percpu blkcg->lhead pointer. The blkg_iostat_set is added to a lockless list on the update side in blk_cgroup_bio_start(). It is removed from the lockless list when flushed in blkcg_rstat_flush(). Due to racing, it is possible that blk_iostat_set's in the lockless list may have no new IO stats to be flushed, but that is OK. To protect against destruction of blkg, a percpu reference is gotten when putting into the lockless list and put back when removed. When booting up an instrumented test kernel with this patch on a 2-socket 96-thread system with cgroup v2, out of the 2051 calls to cgroup_rstat_flush() after bootup, 1788 of the calls were exited immediately because of empty lockless list. After an all-cpu kernel build, the ratio became 6295424/6340513. That was more than 99%. Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20221105005902.407297-3-longman@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-16	blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path	Waiman Long
	For blkcg_css_alloc(), the only error that will be returned is -ENOMEM. Simplify error handling code by returning this error directly instead of setting an intermediate "ret" variable. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20221105005902.407297-2-longman@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-16	block: don't allow a disk link holder to itself	Yu Kuai
	After creating a dm device, then user can reload such dm with itself, and dead loop will be triggered because dm keep looking up to itself. Test procedures: 1) dmsetup create test --table "xxx sda", assume dm-0 is created 2) dmsetup suspend test 3) dmsetup reload test --table "xxx dm-0" 4) dmsetup resume test Test result: BUG: TASK stack guard page was hit at 00000000736a261f (stack is 000000008d12c88d..00000000c8dd82d5) stack guard page: 0000 [#1] PREEMPT SMP CPU: 29 PID: 946 Comm: systemd-udevd Not tainted 6.1.0-rc3-next-20221101-00006-g17640ca3b0ee #1295 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 RIP: 0010:dm_prepare_ioctl+0xf/0x1e0 Code: da 48 83 05 4a 7c 99 0b 01 41 89 c4 eb cd e8 b8 1f 40 00 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 48 83 05 a1 5a 99 0b 01 <41> 56 49 89 d6 41 55 4c 8d af 90 02 00 00 9 RSP: 0018:ffffc90002090000 EFLAGS: 00010206 RAX: ffff8881049d6800 RBX: ffff88817e589000 RCX: 0000000000000000 RDX: ffffc90002090010 RSI: ffffc9000209001c RDI: ffff88817e589000 RBP: 00000000484a101d R08: 0000000000000000 R09: 0000000000000007 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000005331 R13: 0000000000005331 R14: 0000000000000000 R15: 0000000000000000 FS: 00007fddf9609200(0000) GS:ffff889fbfd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffc9000208fff8 CR3: 0000000179043000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> dm_blk_ioctl+0x50/0x1c0 ? dm_prepare_ioctl+0xe0/0x1e0 dm_blk_ioctl+0x88/0x1c0 dm_blk_ioctl+0x88/0x1c0 ......(a lot of same lines) dm_blk_ioctl+0x88/0x1c0 dm_blk_ioctl+0x88/0x1c0 blkdev_ioctl+0x184/0x3e0 __x64_sys_ioctl+0xa3/0x110 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fddf7306577 Code: b3 66 90 48 8b 05 11 89 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e1 88 8 RSP: 002b:00007ffd0b2ec318 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00005634ef478320 RCX: 00007fddf7306577 RDX: 0000000000000000 RSI: 0000000000005331 RDI: 0000000000000007 RBP: 0000000000000007 R08: 00005634ef4843e0 R09: 0000000000000080 R10: 00007fddf75cfb38 R11: 0000000000000246 R12: 00000000030d4000 R13: 0000000000000000 R14: 0000000000000000 R15: 00005634ef48b800 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:dm_prepare_ioctl+0xf/0x1e0 Code: da 48 83 05 4a 7c 99 0b 01 41 89 c4 eb cd e8 b8 1f 40 00 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 48 83 05 a1 5a 99 0b 01 <41> 56 49 89 d6 41 55 4c 8d af 90 02 00 00 9 RSP: 0018:ffffc90002090000 EFLAGS: 00010206 RAX: ffff8881049d6800 RBX: ffff88817e589000 RCX: 0000000000000000 RDX: ffffc90002090010 RSI: ffffc9000209001c RDI: ffff88817e589000 RBP: 00000000484a101d R08: 0000000000000000 R09: 0000000000000007 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000005331 R13: 0000000000005331 R14: 0000000000000000 R15: 0000000000000000 FS: 00007fddf9609200(0000) GS:ffff889fbfd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffc9000208fff8 CR3: 0000000179043000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: disabled ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- Fix the problem by forbidding a disk to create link to itself. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221115141054.1051801-11-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-16	block: store the holder kobject in bd_holder_disk	Yu Kuai
	We hold a reference to the holder kobject for each bd_holder_disk, so to make the code a bit more robust, use a reference to it instead of the block_device. As long as no one clears ->bd_holder_dir in before freeing the disk, this isn't strictly required, but it does make the code more clear and more robust. Orignally-From: Christoph Hellwig <hch@lst.de> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221115141054.1051801-10-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-16	block: fix use after free for bd_holder_dir	Yu Kuai
	Currently, the caller of bd_link_disk_holer() get 'bdev' by blkdev_get_by_dev(), which will look up 'bdev' by inode number 'dev'. Howerver, it's possible that del_gendisk() can be called currently, and 'bd_holder_dir' can be freed before bd_link_disk_holer() access it, thus use after free is triggered. t1: t2: bdev = blkdev_get_by_dev del_gendisk kobject_put(bd_holder_dir) kobject_free() bd_link_disk_holder Fix the problem by checking disk is still live and grabbing a reference to 'bd_holder_dir' first in bd_link_disk_holder(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221115141054.1051801-9-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-16	block: remove delayed holder registration	Christoph Hellwig
	Now that dm has been fixed to track of holder registrations before add_disk, the somewhat buggy block layer code can be safely removed. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20221115141054.1051801-8-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-16	dm: track per-add_disk holder relations in DM	Christoph Hellwig
	dm is a bit special in that it opens the underlying devices. Commit 89f871af1b26 ("dm: delay registering the gendisk") tried to accommodate that by allowing to add the holder to the list before add_gendisk and then just add them to sysfs once add_disk is called. But that leads to really odd lifetime problems and error handling problems as we can't know the state of the kobjects and don't unwind properly. To fix this switch to just registering all existing table_devices with the holder code right after add_disk, and remove them before calling del_gendisk. Fixes: 89f871af1b26 ("dm: delay registering the gendisk") Reported-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20221115141054.1051801-7-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-16	dm: make sure create and remove dm device won't race with open and close table	Yu Kuai
	open_table_device() and close_table_device() is protected by table_devices_lock, hence use it to protect add_disk() and del_gendisk(). Prepare to track per-add_disk holder relations in dm. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221115141054.1051801-6-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-16	dm: cleanup close_table_device	Christoph Hellwig
	Take the list unlink and free into close_table_device so that no half torn down table_devices exist. Also remove the check for a NULL bdev as that can't happen - open_table_device never adds a table_device to the list that does not have a valid block_device. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20221115141054.1051801-5-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-16	dm: cleanup open_table_device	Christoph Hellwig
	Move all the logic for allocation the table_device and linking it into the list into the open_table_device. This keeps the code tidy and ensures that the table_devices only exist in fully initialized state. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20221115141054.1051801-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-16	dm: remove free_table_devices	Christoph Hellwig
	free_table_devices just warns and frees all table_device structures when the target removal did not remove them. This should never happen, but if it did, just freeing the structure without deleting them from the list or cleaning up the resources would not help at all. So just WARN on a non-empty list instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20221115141054.1051801-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-16	block: clear ->slave_dir when dropping the main slave_dir reference	Christoph Hellwig
	Zero out the pointer to ->slave_dir so that the holder code doesn't incorrectly treat the object as alive when add_disk failed or after del_gendisk was called. Fixes: 89f871af1b26 ("dm: delay registering the gendisk") Reported-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20221115141054.1051801-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-16	sbitmap: Try each queue to wake up at least one waiter	Gabriel Krisman Bertazi
	Jan reported the new algorithm as merged might be problematic if the queue being awaken becomes empty between the waitqueue_active inside sbq_wake_ptr check and the wake up. If that happens, wake_up_nr will not wake up any waiter and we loose too many wake ups. In order to guarantee progress, we need to wake up at least one waiter here, if there are any. This now requires trying to wake up from every queue. Instead of walking through all the queues with sbq_wake_ptr, this call moves the wake up inside that function. In a previous version of the patch, I found that updating wake_index several times when walking through queues had a measurable overhead. This ensures we only update it once, at the end. Fixes: 4f8126bb2308 ("sbitmap: Use single per-bitmap counting to wake up queued tags") Reported-by: Jan Kara <jack@suse.cz> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221115224553.23594-4-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-16	wait: Return number of exclusive waiters awaken	Gabriel Krisman Bertazi
	Sbitmap code will need to know how many waiters were actually woken for its batched wakeups implementation. Return the number of woken exclusive waiters from __wake_up() to facilitate that. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221115224553.23594-3-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-16	sbitmap: Advance the queue index before waking up a queue	Gabriel Krisman Bertazi
	When a queue is awaken, the wake_index written by sbq_wake_ptr currently keeps pointing to the same queue. On the next wake up, it will thus retry the same queue, which is unfair to other queues, and can lead to starvation. This patch, moves the index update to happen before the queue is returned, such that it will now try a different queue first on the next wake up, improving fairness. Fixes: 4f8126bb2308 ("sbitmap: Use single per-bitmap counting to wake up queued tags") Reported-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20221115224553.23594-2-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-16	block: remove blkdev_writepages	Christoph Hellwig
	While the block device code should switch to implementing ->writepages instead of ->writepage eventually, the current implementation is entirely pointless as it does the same looping over ->writepage as the generic code if no ->writepages is present. Remove blkdev_writepages so that we can eventually unexport generic_writepages. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221116132035.2192924-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-16	io_uring/rw: enable bio caches for IRQ rw	Pavel Begunkov
	Now we can use IOCB_ALLOC_CACHE not only for iopoll'ed reads/write but also for normal IRQ driven I/O. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/fb8bd092ed5a4a3b037e84e4777074d07aa5639a.1667384020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-16	bio: shrink max number of pcpu cached bios	Pavel Begunkov
	The downside of the bio pcpu cache is that bios of a cpu will be never freed unless there is new I/O issued from that cpu. We currently keep max 512 bios, which feels too much, half it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bc198e8efb27d8c740d80c8ce477432729075096.1667384020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-16	bio: add pcpu caching for non-polling bio_put	Pavel Begunkov
	This patch extends REQ_ALLOC_CACHE to IRQ completions, whenever currently it's only limited to iopoll. Instead of guarding the list with irq toggling on alloc, which is expensive, it keeps an additional irq-safe list from which bios are spliced in batches to ammortise overhead. On the put side it toggles irqs, but in many cases they're already disabled and so cheap. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c2306de96b900ab9264f4428ec37768ddcf0da36.1667384020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-16	bio: split pcpu cache part of bio_put into a helper	Pavel Begunkov
	Extract a helper out of bio_put for recycling into percpu caches. It's a preparation patch without functional changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e97ab2026a89098ee1bfdd09bcb9451fced95f87.1667384020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-16	bio: don't rob starving biosets of bios	Pavel Begunkov
	Biosets keep a mempool, so as long as requests complete we can always can allocate and have forward progress. Percpu bio caches break that assumptions as we may complete into the cache of one CPU and after try and fail to allocate with another CPU. We also can't grab from another CPU's cache without tricky sync. If we're allocating with a bio while the mempool is undersaturated, remove REQ_ALLOC_CACHE flag, so on put it will go straight to mempool. It might try to free into mempool more requests than required, but assuming than there is no memory starvation in the system it'll stabilise and never hit that path. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/aa150caf9c263fa92269e86d7826cc8fa65f38de.1667384020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-16	mempool: introduce mempool_is_saturated	Pavel Begunkov
	Introduce a helper mempool_is_saturated(), which tells if the mempool is under-filled or not. We need it to figure out whether it should be freed right into the mempool or could be cached with top level caches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/636aed30be8c35d78f45e244998bc6209283cccc.1667384020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-16	nvmet: fix a memory leak in nvmet_auth_set_key	Sagi Grimberg
	When changing dhchap secrets we need to release the old secrets as well. kmemleak complaint: -- unreferenced object 0xffff8c7f44ed8180 (size 64): comm "check", pid 7304, jiffies 4295686133 (age 72034.246s) hex dump (first 32 bytes): 44 48 48 43 2d 31 3a 30 30 3a 4c 64 4c 4f 64 71 DHHC-1:00:LdLOdq 79 56 69 67 77 48 55 32 6d 5a 59 4c 7a 35 59 38 yVigwHU2mZYLz5Y8 backtrace: [<00000000b6fc5071>] kstrdup+0x2e/0x60 [<00000000f0f4633f>] 0xffffffffc0e07ee6 [<0000000053006c05>] 0xffffffffc0dff783 [<00000000419ae922>] configfs_write_iter+0xb1/0x120 [<000000008183c424>] vfs_write+0x2be/0x3c0 [<000000009005a2a5>] ksys_write+0x5f/0xe0 [<00000000cd495c89>] do_syscall_64+0x38/0x90 [<00000000f2a84ac5>] entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: db1312dd9548 ("nvmet: implement basic In-Band Authentication") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16	nvme: return err on nvme_init_non_mdts_limits fail	Joel Granados
	In nvme_init_non_mdts_limits function we were returning 0 when kzalloc failed; it now returns -ENOMEM. Fixes: 5befc7c26e5a ("nvme: implement non-mdts command limits") Signed-off-by: Joel Granados <j.granados@samsung.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16	nvme: avoid fallback to sequential scan due to transient issues	Uday Shankar
	Currently, if nvme_scan_ns_list fails, nvme_scan_work will fall back to a sequential scan. nvme_scan_ns_list can fail for a variety of reasons, e.g. a transient transport issue, and the resulting sequential scan can be extremely expensive on controllers reporting an NN value close to the maximum allowed (> 4 billion). Avoid sequential scans wherever possible by only falling back to them in two cases: - When the NVMe version supported (VS) value reported by the device is older than NVME_VS(1, 1, 0), before which support of Identify NS List not required. - When the Identify NS List command fails with the DNR bit set in the status. This is to accommodate (non-compliant) devices which report a VS value which implies support for Identify NS List, but nevertheless do not support the command. Such devices will most likely fail the command with the DNR bit set. The third case is when the device claims support for Identify NS List but the command fails with DNR not set. In such cases, fallback to sequential scan is potentially expensive and likely unnecessary, as a retry of the list scan should succeed. So this change skips the fallback in this third case. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16	nvme-rdma: stop auth work after tearing down queues in error recovery	Sagi Grimberg
	when starting error recovery there might be a authentication work running, and it involves I/O commands. Given the controller is tearing down there is no chance for the I/O to complete other than timing out which may unnecessarily take a full io timeout. So first tear down the queues, fail/cancel all inflight I/O (including potentially authentication) and only then stop authentication. This ensures that failover is not stalled due to blocked authentication I/O. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16	nvme-tcp: stop auth work after tearing down queues in error recovery	Sagi Grimberg
	when starting error recovery there might be a authentication work running, and it involves I/O commands. Given the controller is tearing down there is no chance for the I/O to complete other than timing out which may unnecessarily take a full io timeout. So first tear down the queues, fail/cancel all inflight I/O (including potentially authentication) and only then stop authentication. This ensures that failover is not stalled due to blocked authentication I/O. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16	nvme-auth: have dhchap_auth_work wait for queues auth to complete	Sagi Grimberg
	It triggered the queue authentication work elements in parallel, but the ctrl authentication work itself completes when all of them completes. Hence wait for queues auth completions. This also makes nvme_auth_stop simply a sync cancel of ctrl dhchap_auth_work. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16	nvme-auth: remove redundant auth_work flush	Sagi Grimberg
	only ctrl deletion calls nvme_auth_free, which was stopped prior in the teardown stage, so there is no possibility that it should ever run when nvme_auth_free is called. As a result, we can remove a local chap pointer variable. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-11-16	nvme-auth: convert dhchap_auth_list to an array	Sagi Grimberg
	We know exactly how many dhchap contexts we will need, there is no need to hold a list that we need to protect with a mutex. Convert to a dynamically allocated array. And dhchap_context access state is maintained by the chap itself. Make dhchap_auth_mutex protect only the ctrl host_key and ctrl_key in a fine-grained lock such that there is no long lasting acquisition of the lock and no need to take/release this lock when flushing authentication works. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>