summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2021-10-17Merge tag 'block-5.15-2021-10-17' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "Bigger than usual for this point in time, the majority is fixing some issues around BDI lifetimes with the move from the request_queue to the disk in this release. In detail: - Series on draining fs IO for del_gendisk() (Christoph) - NVMe pull request via Christoph: - fix the abort command id (Keith Busch) - nvme: fix per-namespace chardev deletion (Adam Manzanares) - brd locking scope fix (Tetsuo) - BFQ fix (Paolo)" * tag 'block-5.15-2021-10-17' of git://git.kernel.dk/linux-block: block, bfq: reset last_bfqq_created on group change block: warn when putting the final reference on a registered disk brd: reduce the brd_devices_mutex scope kyber: avoid q->disk dereferences in trace points block: keep q_usage_counter in atomic mode after del_gendisk block: drain file system I/O on del_gendisk block: split bio_queue_enter from blk_queue_enter block: factor out a blk_try_enter_queue helper block: call submit_bio_checks under q_usage_counter nvme: fix per-namespace chardev deletion block/rnbd-clt-sysfs: fix a couple uninitialized variable bugs nvme-pci: Fix abort command id
2021-10-17block, bfq: reset last_bfqq_created on group changePaolo Valente
Since commit 430a67f9d616 ("block, bfq: merge bursts of newly-created queues"), BFQ maintains a per-group pointer to the last bfq_queue created. If such a queue, say bfqq, happens to move to a different group, then bfqq is no more a valid last bfq_queue created for its previous group. That pointer must then be cleared. Not resetting such a pointer may also cause UAF, if bfqq happens to also be freed after being moved to a different group. This commit performs this missing reset. As such it fixes commit 430a67f9d616 ("block, bfq: merge bursts of newly-created queues"). Such a missing reset is most likely the cause of the crash reported in [1]. With some analysis, we found that this crash was due to the above UAF. And such UAF did go away with this commit applied [1]. Anyway, before this commit, that crash happened to be triggered in conjunction with commit 2d52c58b9c9b ("block, bfq: honor already-setup queue merges"). The latter was then reverted by commit ebc69e897e17 ("Revert "block, bfq: honor already-setup queue merges""). Yet commit 2d52c58b9c9b ("block, bfq: honor already-setup queue merges") contains no error related with the above UAF, and can then be restored. [1] https://bugzilla.kernel.org/show_bug.cgi?id=214503 Fixes: 430a67f9d616 ("block, bfq: merge bursts of newly-created queues") Tested-by: Grzegorz Kowal <custos.mentis@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20211015144336.45894-2-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-17block: warn when putting the final reference on a registered diskChristoph Hellwig
Warn when the last reference on a live disk is put without calling del_gendisk first. There are some BDI related bug reports that look like a case of this, so make sure we have the proper instrumentation to catch it. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211014130231.1468538-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-15kyber: avoid q->disk dereferences in trace pointsChristoph Hellwig
q->disk becomes invalid after the gendisk is removed. Work around this by caching the dev_t for the tracepoints. The real fix would be to properly tear down the I/O schedulers with the gendisk, but that is a much more invasive change. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211012093301.GA27795@lst.de Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-15block: keep q_usage_counter in atomic mode after del_gendiskChristoph Hellwig
Don't switch back to percpu mode to avoid the double RCU grace period when tearing down SCSI devices. After removing the disk only passthrough commands can be send anyway. Suggested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20210929071241.934472-6-hch@lst.de Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-15block: drain file system I/O on del_gendiskChristoph Hellwig
Instead of delaying draining of file system I/O related items like the blk-qos queues, the integrity read workqueue and timeouts only when the request_queue is removed, do that when del_gendisk is called. This is important for SCSI where the upper level drivers that control the gendisk are separate entities, and the disk can be freed much earlier than the request_queue, or can even be unbound without tearing down the queue. Fixes: edb0872f44ec ("block: move the bdi from the request_queue to the gendisk") Reported-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20210929071241.934472-5-hch@lst.de Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-15block: split bio_queue_enter from blk_queue_enterChristoph Hellwig
To prepare for fixing a gendisk shutdown race, open code the blk_queue_enter logic in bio_queue_enter. This also removes the pointless flags translation. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20210929071241.934472-4-hch@lst.de Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-15block: factor out a blk_try_enter_queue helperChristoph Hellwig
Factor out the code to try to get q_usage_counter without blocking into a separate helper. Both to improve code readability and to prepare for splitting bio_queue_enter from blk_queue_enter. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20210929071241.934472-3-hch@lst.de Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-15block: call submit_bio_checks under q_usage_counterChristoph Hellwig
Ensure all bios check the current values of the queue under freeze protection, i.e. to make sure the zero capacity set by del_gendisk is actually seen before dispatching to the driver. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210929071241.934472-2-hch@lst.de Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-09Merge tag 'block-5.15-2021-10-09' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "Two small fixes for this release: - Add missing QUEUE_FLAG_HCTX_ACTIVE in the debugfs handling (Johannes) - Fix double free / UAF issue in __alloc_disk_node (Tetsuo)" * tag 'block-5.15-2021-10-09' of git://git.kernel.dk/linux-block: block: decode QUEUE_FLAG_HCTX_ACTIVE in debugfs output block: genhd: fix double kfree() in __alloc_disk_node()
2021-10-04block: decode QUEUE_FLAG_HCTX_ACTIVE in debugfs outputJohannes Thumshirn
While debugging an issue we've found that $DEBUGFS/block/$disk/state doesn't decode QUEUE_FLAG_HCTX_ACTIVE but only displays its numerical value. Add QUEUE_FLAG(HCTX_ACTIVE) to the blk_queue_flag_name array so it'll get decoded properly. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/4351076388918075bd80ef07756f9d2ce63be12c.1633332053.git.johannes.thumshirn@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-02Merge tag 'block-5.15-2021-10-01' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "A few block fixes for this release: - Revert a BFQ commit that causes breakage for people. Unfortunately it was auto-selected for stable as well, so now 5.14.7 suffers from it too. Hopefully stable will pick up this revert quickly too, so we can remove the issue on that end as well. - Add a quirk for Apple NVMe controllers, which due to their non-compliance broke due to the introduction of command sequences (Keith) - Use shifts in nbd, fixing a __divdi3 issue (Nick)" * tag 'block-5.15-2021-10-01' of git://git.kernel.dk/linux-block: nbd: use shifts rather than multiplies Revert "block, bfq: honor already-setup queue merges" nvme: add command id quirk for apple controllers
2021-10-02block: genhd: fix double kfree() in __alloc_disk_node()Tetsuo Handa
syzbot is reporting use-after-free read at bdev_free_inode() [1], for kfree() from __alloc_disk_node() is called before bdev_free_inode() (which is called after RCU grace period) reads bdev->bd_disk and calls kfree(bdev->bd_disk). Fix use-after-free read followed by double kfree() problem by making sure that bdev->bd_disk is NULL when calling iput(). Link: https://syzkaller.appspot.com/bug?extid=8281086e8a6fbfbd952a [1] Reported-by: syzbot <syzbot+8281086e8a6fbfbd952a@syzkaller.appspotmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/e6dd13c5-8db0-4392-6e78-a42ee5d2a1c4@i-love.sakura.ne.jp Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-28Revert "block, bfq: honor already-setup queue merges"Jens Axboe
This reverts commit 2d52c58b9c9bdae0ca3df6a1eab5745ab3f7d80b. We have had several folks complain that this causes hangs for them, which is especially problematic as the commit has also hit stable already. As no resolution seems to be forthcoming right now, revert the patch. Link: https://bugzilla.kernel.org/show_bug.cgi?id=214503 Fixes: 2d52c58b9c9b ("block, bfq: honor already-setup queue merges") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-25Merge tag 'scsi-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "Thirty-three fixes, I'm afraid. Essentially the build up from the last couple of weeks while I've been dealling with Linux Plumbers conference infrastructure issues. It's mostly the usual assortment of spelling fixes and minor corrections. The only core relevant changes are to the sd driver to reduce the spin up message spew and fix a small memory leak on the freeing path" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (33 commits) scsi: ses: Retry failed Send/Receive Diagnostic commands scsi: target: Fix spelling mistake "CONFLIFT" -> "CONFLICT" scsi: lpfc: Fix gcc -Wstringop-overread warning, again scsi: lpfc: Use correct scnprintf() limit scsi: lpfc: Fix sprintf() overflow in lpfc_display_fpin_wwpn() scsi: core: Remove 'current_tag' scsi: acornscsi: Remove tagged queuing vestiges scsi: fas216: Kill scmd->tag scsi: qla2xxx: Restore initiator in dual mode scsi: ufs: core: Unbreak the reset handler scsi: sd_zbc: Support disks with more than 2**32 logical blocks scsi: ufs: core: Revert "scsi: ufs: Synchronize SCSI and UFS error handling" scsi: bsg: Fix device unregistration scsi: sd: Make sd_spinup_disk() less noisy scsi: ufs: ufs-pci: Fix Intel LKF link stability scsi: mpt3sas: Clean up some inconsistent indenting scsi: megaraid: Clean up some inconsistent indenting scsi: sr: Fix spelling mistake "does'nt" -> "doesn't" scsi: Remove SCSI CDROM MAINTAINERS entry scsi: megaraid: Fix Coccinelle warning ...
2021-09-24block: hold ->invalidate_lock in blkdev_fallocateMing Lei
When running ->fallocate(), blkdev_fallocate() should hold mapping->invalidate_lock to prevent page cache from being accessed, otherwise stale data may be read in page cache. Without this patch, blktests block/009 fails sometimes. With this patch, block/009 can pass always. Also as Jan pointed out, no pages can be created in the discarded area while you are holding the invalidate_lock, so remove the 2nd truncate_bdev_range(). Cc: Jan Kara <jack@suse.cz> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210923023751.1441091-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-24block: don't call rq_qos_ops->done_bio if the bio isn't trackedMing Lei
rq_qos framework is only applied on request based driver, so: 1) rq_qos_done_bio() needn't to be called for bio based driver 2) rq_qos_done_bio() needn't to be called for bio which isn't tracked, such as bios ended from error handling code. Especially in bio_endio(): 1) request queue is referred via bio->bi_bdev->bd_disk->queue, which may be gone since request queue refcount may not be held in above two cases 2) q->rq_qos may be freed in blk_cleanup_queue() when calling into __rq_qos_done_bio() Fix the potential kernel panic by not calling rq_qos_ops->done_bio if the bio isn't tracked. This way is safe because both ioc_rqos_done_bio() and blkcg_iolatency_done_bio() are nop if the bio isn't tracked. Reported-by: Yu Kuai <yukuai3@huawei.com> Cc: tj@kernel.org Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210924110704.1541818-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-15blk-cgroup: fix UAF by grabbing blkcg lock before destroying blkg pdLi Jinlin
KASAN reports a use-after-free report when doing fuzz test: [693354.104835] ================================================================== [693354.105094] BUG: KASAN: use-after-free in bfq_io_set_weight_legacy+0xd3/0x160 [693354.105336] Read of size 4 at addr ffff888be0a35664 by task sh/1453338 [693354.105607] CPU: 41 PID: 1453338 Comm: sh Kdump: loaded Not tainted 4.18.0-147 [693354.105610] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.81 07/02/2018 [693354.105612] Call Trace: [693354.105621] dump_stack+0xf1/0x19b [693354.105626] ? show_regs_print_info+0x5/0x5 [693354.105634] ? printk+0x9c/0xc3 [693354.105638] ? cpumask_weight+0x1f/0x1f [693354.105648] print_address_description+0x70/0x360 [693354.105654] kasan_report+0x1b2/0x330 [693354.105659] ? bfq_io_set_weight_legacy+0xd3/0x160 [693354.105665] ? bfq_io_set_weight_legacy+0xd3/0x160 [693354.105670] bfq_io_set_weight_legacy+0xd3/0x160 [693354.105675] ? bfq_cpd_init+0x20/0x20 [693354.105683] cgroup_file_write+0x3aa/0x510 [693354.105693] ? ___slab_alloc+0x507/0x540 [693354.105698] ? cgroup_file_poll+0x60/0x60 [693354.105702] ? 0xffffffff89600000 [693354.105708] ? usercopy_abort+0x90/0x90 [693354.105716] ? mutex_lock+0xef/0x180 [693354.105726] kernfs_fop_write+0x1ab/0x280 [693354.105732] ? cgroup_file_poll+0x60/0x60 [693354.105738] vfs_write+0xe7/0x230 [693354.105744] ksys_write+0xb0/0x140 [693354.105749] ? __ia32_sys_read+0x50/0x50 [693354.105760] do_syscall_64+0x112/0x370 [693354.105766] ? syscall_return_slowpath+0x260/0x260 [693354.105772] ? do_page_fault+0x9b/0x270 [693354.105779] ? prepare_exit_to_usermode+0xf9/0x1a0 [693354.105784] ? enter_from_user_mode+0x30/0x30 [693354.105793] entry_SYSCALL_64_after_hwframe+0x65/0xca [693354.105875] Allocated by task 1453337: [693354.106001] kasan_kmalloc+0xa0/0xd0 [693354.106006] kmem_cache_alloc_node_trace+0x108/0x220 [693354.106010] bfq_pd_alloc+0x96/0x120 [693354.106015] blkcg_activate_policy+0x1b7/0x2b0 [693354.106020] bfq_create_group_hierarchy+0x1e/0x80 [693354.106026] bfq_init_queue+0x678/0x8c0 [693354.106031] blk_mq_init_sched+0x1f8/0x460 [693354.106037] elevator_switch_mq+0xe1/0x240 [693354.106041] elevator_switch+0x25/0x40 [693354.106045] elv_iosched_store+0x1a1/0x230 [693354.106049] queue_attr_store+0x78/0xb0 [693354.106053] kernfs_fop_write+0x1ab/0x280 [693354.106056] vfs_write+0xe7/0x230 [693354.106060] ksys_write+0xb0/0x140 [693354.106064] do_syscall_64+0x112/0x370 [693354.106069] entry_SYSCALL_64_after_hwframe+0x65/0xca [693354.106114] Freed by task 1453336: [693354.106225] __kasan_slab_free+0x130/0x180 [693354.106229] kfree+0x90/0x1b0 [693354.106233] blkcg_deactivate_policy+0x12c/0x220 [693354.106238] bfq_exit_queue+0xf5/0x110 [693354.106241] blk_mq_exit_sched+0x104/0x130 [693354.106245] __elevator_exit+0x45/0x60 [693354.106249] elevator_switch_mq+0xd6/0x240 [693354.106253] elevator_switch+0x25/0x40 [693354.106257] elv_iosched_store+0x1a1/0x230 [693354.106261] queue_attr_store+0x78/0xb0 [693354.106264] kernfs_fop_write+0x1ab/0x280 [693354.106268] vfs_write+0xe7/0x230 [693354.106271] ksys_write+0xb0/0x140 [693354.106275] do_syscall_64+0x112/0x370 [693354.106280] entry_SYSCALL_64_after_hwframe+0x65/0xca [693354.106329] The buggy address belongs to the object at ffff888be0a35580 which belongs to the cache kmalloc-1k of size 1024 [693354.106736] The buggy address is located 228 bytes inside of 1024-byte region [ffff888be0a35580, ffff888be0a35980) [693354.107114] The buggy address belongs to the page: [693354.107273] page:ffffea002f828c00 count:1 mapcount:0 mapping:ffff888107c17080 index:0x0 compound_mapcount: 0 [693354.107606] flags: 0x17ffffc0008100(slab|head) [693354.107760] raw: 0017ffffc0008100 ffffea002fcbc808 ffffea0030bd3a08 ffff888107c17080 [693354.108020] raw: 0000000000000000 00000000001c001c 00000001ffffffff 0000000000000000 [693354.108278] page dumped because: kasan: bad access detected [693354.108511] Memory state around the buggy address: [693354.108671] ffff888be0a35500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [693354.116396] ffff888be0a35580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [693354.124473] >ffff888be0a35600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [693354.132421] ^ [693354.140284] ffff888be0a35680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [693354.147912] ffff888be0a35700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [693354.155281] ================================================================== blkgs are protected by both queue and blkcg locks and holding either should stabilize them. However, the path of destroying blkg policy data is only protected by queue lock in blkcg_activate_policy()/blkcg_deactivate_policy(). Other tasks can get the blkg policy data before the blkg policy data is destroyed, and use it after destroyed, which will result in a use-after-free. CPU0 CPU1 blkcg_deactivate_policy spin_lock_irq(&q->queue_lock) bfq_io_set_weight_legacy spin_lock_irq(&blkcg->lock) blkg_to_bfqg(blkg) pd_to_bfqg(blkg->pd[pol->plid]) ^^^^^^blkg->pd[pol->plid] != NULL bfqg != NULL pol->pd_free_fn(blkg->pd[pol->plid]) pd_to_bfqg(blkg->pd[pol->plid]) bfqg_put(bfqg) kfree(bfqg) blkg->pd[pol->plid] = NULL spin_unlock_irq(q->queue_lock); bfq_group_set_weight(bfqg, val, 0) bfqg->entity.new_weight ^^^^^^trigger uaf here spin_unlock_irq(&blkcg->lock); Fix by grabbing the matching blkcg lock before trying to destroy blkg policy data. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Li Jinlin <lijinlin3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210914042605.3260596-1-lijinlin3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-15blkcg: fix memory leak in blk_iolatency_initYanfei Xu
BUG: memory leak unreferenced object 0xffff888129acdb80 (size 96): comm "syz-executor.1", pid 12661, jiffies 4294962682 (age 15.220s) hex dump (first 32 bytes): 20 47 c9 85 ff ff ff ff 20 d4 8e 29 81 88 ff ff G...... ..).... 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff82264ec8>] kmalloc include/linux/slab.h:591 [inline] [<ffffffff82264ec8>] kzalloc include/linux/slab.h:721 [inline] [<ffffffff82264ec8>] blk_iolatency_init+0x28/0x190 block/blk-iolatency.c:724 [<ffffffff8225b8c4>] blkcg_init_queue+0xb4/0x1c0 block/blk-cgroup.c:1185 [<ffffffff822253da>] blk_alloc_queue+0x22a/0x2e0 block/blk-core.c:566 [<ffffffff8223b175>] blk_mq_init_queue_data block/blk-mq.c:3100 [inline] [<ffffffff8223b175>] __blk_mq_alloc_disk+0x25/0xd0 block/blk-mq.c:3124 [<ffffffff826a9303>] loop_add+0x1c3/0x360 drivers/block/loop.c:2344 [<ffffffff826a966e>] loop_control_get_free drivers/block/loop.c:2501 [inline] [<ffffffff826a966e>] loop_control_ioctl+0x17e/0x2e0 drivers/block/loop.c:2516 [<ffffffff81597eec>] vfs_ioctl fs/ioctl.c:51 [inline] [<ffffffff81597eec>] __do_sys_ioctl fs/ioctl.c:874 [inline] [<ffffffff81597eec>] __se_sys_ioctl fs/ioctl.c:860 [inline] [<ffffffff81597eec>] __x64_sys_ioctl+0xfc/0x140 fs/ioctl.c:860 [<ffffffff843fa745>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<ffffffff843fa745>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 [<ffffffff84600068>] entry_SYSCALL_64_after_hwframe+0x44/0xae Once blk_throtl_init() queue init failed, blkcg_iolatency_exit() will not be invoked for cleanup. That leads a memory leak. Swap the blk_throtl_init() and blk_iolatency_init() calls can solve this. Reported-by: syzbot+01321b15cc98e6bf96d6@syzkaller.appspotmail.com Fixes: 19688d7f9592 (block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() calls) Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210915072426.4022924-1-yanfei.xu@windriver.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-14block: flush the integrity workqueue in blk_integrity_unregisterLihong Kou
When the integrity profile is unregistered there can still be integrity reads queued up which could see a NULL verify_fn as shown by the race window below: CPU0 CPU1 process_one_work nvme_validate_ns bio_integrity_verify_fn nvme_update_ns_info nvme_update_disk_info blk_integrity_unregister ---set queue->integrity as 0 bio_integrity_process --access bi->profile->verify_fn(bi is a pointer of queue->integity) Before calling blk_integrity_unregister in nvme_update_disk_info, we must make sure that there is no work item in the kintegrityd_wq. Just call blk_flush_integrity to flush the work queue so the bug can be resolved. Signed-off-by: Lihong Kou <koulihong@huawei.com> [hch: split up and shortened the changelog] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Link: https://lore.kernel.org/r/20210914070657.87677-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-14block: check if a profile is actually registered in blk_integrity_unregisterChristoph Hellwig
While clearing the profile itself is harmless, we really should not clear the stable writes flag if it wasn't set due to a registered integrity profile. Reported-by: Lihong Kou <koulihong@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Link: https://lore.kernel.org/r/20210914070657.87677-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-14scsi: bsg: Fix device unregistrationZenghui Yu
device_initialize() is used to take a refcount on the device. However, put_device() is not called during device teardown. This leads to a leak of private data of the driver core, dev_name(), etc. This is reported by kmemleak at boot time if we compile kernel with DEBUG_TEST_DRIVER_REMOVE. Fix memory leaks during unregistration and implement a release function. Link: https://lore.kernel.org/r/20210911105306.1511-1-yuzenghui@huawei.com Fixes: ead09dd3aed5 ("scsi: bsg: Simplify device registration") Reviewed-by: Johan Hovold <johan@kernel.org> Signed-off-by: Zenghui Yu <yuzenghui@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-09-12blk-mq: avoid to iterate over stale requestMing Lei
blk-mq can't run allocating driver tag and updating ->rqs[tag] atomically, meantime blk-mq doesn't clear ->rqs[tag] after the driver tag is released. So there is chance to iterating over one stale request just after the tag is allocated and before updating ->rqs[tag]. scsi_host_busy_iter() calls scsi_host_check_in_flight() to count scsi in-flight requests after scsi host is blocked, so no new scsi command can be marked as SCMD_STATE_INFLIGHT. However, driver tag allocation still can be run by blk-mq core. One request is marked as SCMD_STATE_INFLIGHT, but this request may have been kept in another slot of ->rqs[], meantime the slot can be allocated out but ->rqs[] isn't updated yet. Then this in-flight request is counted twice as SCMD_STATE_INFLIGHT. This way causes trouble in handling scsi error. Fixes the issue by not iterating over stale request. Cc: linux-scsi@vger.kernel.org Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Reported-by: luojiaxing <luojiaxing@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210906065003.439019-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-11Merge tag 'block-5.15-2021-09-11' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe pull request from Christoph: - fix nvmet command set reporting for passthrough controllers (Adam Manzanares) - update a MAINTAINERS email address (Chaitanya Kulkarni) - set QUEUE_FLAG_NOWAIT for nvme-multipth (me) - handle errors from add_disk() (Luis Chamberlain) - update the keep alive interval when kato is modified (Tatsuya Sasaki) - fix a buffer overrun in nvmet_subsys_attr_serial (Hannes Reinecke) - do not reset transport on data digest errors in nvme-tcp (Daniel Wagner) - only call synchronize_srcu when clearing current path (Daniel Wagner) - revalidate paths during rescan (Hannes Reinecke) - Split out the fs/block_dev into block/fops.c and block/bdev.c, which has been long overdue. Do this now before -rc1, to avoid annoying conflicts due to this (Christoph) - blk-throtl use-after-free fix (Li) - Improve plug depth for multi-device plugs, greatly increasing md resync performance (Song) - blkdev_show() locking fix (Tetsuo) - n64cart error check fix (Yang) * tag 'block-5.15-2021-09-11' of git://git.kernel.dk/linux-block: n64cart: fix return value check in n64cart_probe() blk-mq: allow 4x BLK_MAX_REQUEST_COUNT at blk_plug for multiple_queues block: move fs/block_dev.c to block/bdev.c block: split out operations on block special files blk-throttle: fix UAF by deleteing timer in blk_throtl_exit() block: genhd: don't call blkdev_show() with major_names_lock held nvme: update MAINTAINERS email address nvme: add error handling support for add_disk() nvme: only call synchronize_srcu when clearing current path nvme: update keep alive interval when kato is modified nvme-tcp: Do not reset transport on data digest errors nvmet: fixup buffer overrun in nvmet_subsys_attr_serial() nvmet: return bool from nvmet_passthru_ctrl and nvmet_is_passthru_req nvmet: looks at the passthrough controller when initializing CAP nvme: move nvme_multi_css into nvme.h nvme-multipath: revalidate paths during rescan nvme-multipath: set QUEUE_FLAG_NOWAIT
2021-09-07blk-mq: allow 4x BLK_MAX_REQUEST_COUNT at blk_plug for multiple_queuesSong Liu
Limiting number of request to BLK_MAX_REQUEST_COUNT at blk_plug hurts performance for large md arrays. [1] shows resync speed of md array drops for md array with more than 16 HDDs. Fix this by allowing more request at plug queue. The multiple_queue flag is used to only apply higher limit to multiple queue cases. [1] https://lore.kernel.org/linux-raid/CAFDAVznS71BXW8Jxv6k9dXc2iR3ysX3iZRBww_rzA8WifBFxGg@mail.gmail.com/ Tested-by: Marcin Wanat <marcin.wanat@gmail.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07block: move fs/block_dev.c to block/bdev.cChristoph Hellwig
Move it together with the rest of the block layer. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210907141303.1371844-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07block: split out operations on block special filesChristoph Hellwig
Add a new block/fops.c for all the file and address_space operations that provide the block special file support. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210907141303.1371844-2-hch@lst.de [axboe: correct trailing whitespace while at it] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07blk-throttle: fix UAF by deleteing timer in blk_throtl_exit()Li Jinlin
The pending timer has been set up in blk_throtl_init(). However, the timer is not deleted in blk_throtl_exit(). This means that the timer handler may still be running after freeing the timer, which would result in a use-after-free. Fix by calling del_timer_sync() to delete the timer in blk_throtl_exit(). Signed-off-by: Li Jinlin <lijinlin3@huawei.com> Link: https://lore.kernel.org/r/20210907121242.2885564-1-lijinlin3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07block: genhd: don't call blkdev_show() with major_names_lock heldTetsuo Handa
If CONFIG_BLK_DEV_LOOP && CONFIG_MTD (at least; there might be other combinations), lockdep complains circular locking dependency at __loop_clr_fd(), for major_names_lock serves as a locking dependency aggregating hub across multiple block modules. ====================================================== WARNING: possible circular locking dependency detected 5.14.0+ #757 Tainted: G E ------------------------------------------------------ systemd-udevd/7568 is trying to acquire lock: ffff88800f334d48 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x70/0x560 but task is already holding lock: ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #6 (&lo->lo_mutex){+.+.}-{3:3}: lock_acquire+0xbe/0x1f0 __mutex_lock_common+0xb6/0xe10 mutex_lock_killable_nested+0x17/0x20 lo_open+0x23/0x50 [loop] blkdev_get_by_dev+0x199/0x540 blkdev_open+0x58/0x90 do_dentry_open+0x144/0x3a0 path_openat+0xa57/0xda0 do_filp_open+0x9f/0x140 do_sys_openat2+0x71/0x150 __x64_sys_openat+0x78/0xa0 do_syscall_64+0x3d/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #5 (&disk->open_mutex){+.+.}-{3:3}: lock_acquire+0xbe/0x1f0 __mutex_lock_common+0xb6/0xe10 mutex_lock_nested+0x17/0x20 bd_register_pending_holders+0x20/0x100 device_add_disk+0x1ae/0x390 loop_add+0x29c/0x2d0 [loop] blk_request_module+0x5a/0xb0 blkdev_get_no_open+0x27/0xa0 blkdev_get_by_dev+0x5f/0x540 blkdev_open+0x58/0x90 do_dentry_open+0x144/0x3a0 path_openat+0xa57/0xda0 do_filp_open+0x9f/0x140 do_sys_openat2+0x71/0x150 __x64_sys_openat+0x78/0xa0 do_syscall_64+0x3d/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #4 (major_names_lock){+.+.}-{3:3}: lock_acquire+0xbe/0x1f0 __mutex_lock_common+0xb6/0xe10 mutex_lock_nested+0x17/0x20 blkdev_show+0x19/0x80 devinfo_show+0x52/0x60 seq_read_iter+0x2d5/0x3e0 proc_reg_read_iter+0x41/0x80 vfs_read+0x2ac/0x330 ksys_read+0x6b/0xd0 do_syscall_64+0x3d/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #3 (&p->lock){+.+.}-{3:3}: lock_acquire+0xbe/0x1f0 __mutex_lock_common+0xb6/0xe10 mutex_lock_nested+0x17/0x20 seq_read_iter+0x37/0x3e0 generic_file_splice_read+0xf3/0x170 splice_direct_to_actor+0x14e/0x350 do_splice_direct+0x84/0xd0 do_sendfile+0x263/0x430 __se_sys_sendfile64+0x96/0xc0 do_syscall_64+0x3d/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #2 (sb_writers#3){.+.+}-{0:0}: lock_acquire+0xbe/0x1f0 lo_write_bvec+0x96/0x280 [loop] loop_process_work+0xa68/0xc10 [loop] process_one_work+0x293/0x480 worker_thread+0x23d/0x4b0 kthread+0x163/0x180 ret_from_fork+0x1f/0x30 -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}: lock_acquire+0xbe/0x1f0 process_one_work+0x280/0x480 worker_thread+0x23d/0x4b0 kthread+0x163/0x180 ret_from_fork+0x1f/0x30 -> #0 ((wq_completion)loop0){+.+.}-{0:0}: validate_chain+0x1f0d/0x33e0 __lock_acquire+0x92d/0x1030 lock_acquire+0xbe/0x1f0 flush_workqueue+0x8c/0x560 drain_workqueue+0x80/0x140 destroy_workqueue+0x47/0x4f0 __loop_clr_fd+0xb4/0x400 [loop] blkdev_put+0x14a/0x1d0 blkdev_close+0x1c/0x20 __fput+0xfd/0x220 task_work_run+0x69/0xc0 exit_to_user_mode_prepare+0x1ce/0x1f0 syscall_exit_to_user_mode+0x26/0x60 do_syscall_64+0x4c/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae other info that might help us debug this: Chain exists of: (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&lo->lo_mutex); lock(&disk->open_mutex); lock(&lo->lo_mutex); lock((wq_completion)loop0); *** DEADLOCK *** 2 locks held by systemd-udevd/7568: #0: ffff888012554128 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x4c/0x1d0 #1: ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop] stack backtrace: CPU: 0 PID: 7568 Comm: systemd-udevd Tainted: G E 5.14.0+ #757 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020 Call Trace: dump_stack_lvl+0x79/0xbf print_circular_bug+0x5d6/0x5e0 ? stack_trace_save+0x42/0x60 ? save_trace+0x3d/0x2d0 check_noncircular+0x10b/0x120 validate_chain+0x1f0d/0x33e0 ? __lock_acquire+0x953/0x1030 ? __lock_acquire+0x953/0x1030 __lock_acquire+0x92d/0x1030 ? flush_workqueue+0x70/0x560 lock_acquire+0xbe/0x1f0 ? flush_workqueue+0x70/0x560 flush_workqueue+0x8c/0x560 ? flush_workqueue+0x70/0x560 ? sched_clock_cpu+0xe/0x1a0 ? drain_workqueue+0x41/0x140 drain_workqueue+0x80/0x140 destroy_workqueue+0x47/0x4f0 ? blk_mq_freeze_queue_wait+0xac/0xd0 __loop_clr_fd+0xb4/0x400 [loop] ? __mutex_unlock_slowpath+0x35/0x230 blkdev_put+0x14a/0x1d0 blkdev_close+0x1c/0x20 __fput+0xfd/0x220 task_work_run+0x69/0xc0 exit_to_user_mode_prepare+0x1ce/0x1f0 syscall_exit_to_user_mode+0x26/0x60 do_syscall_64+0x4c/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f0fd4c661f7 Code: 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 13 fc ff ff RSP: 002b:00007ffd1c9e9fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 00007f0fd46be6c8 RCX: 00007f0fd4c661f7 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000006 RBP: 0000000000000006 R08: 000055fff1eaf400 R09: 0000000000000000 R10: 00007f0fd46be6c8 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000002f08 R15: 00007ffd1c9ea050 Commit 1c500ad706383f1a ("loop: reduce the loop_ctl_mutex scope") is for breaking "loop_ctl_mutex => &lo->lo_mutex" dependency chain. But enabling a different block module results in forming circular locking dependency due to shared major_names_lock mutex. The simplest fix is to call probe function without holding major_names_lock [1], but Christoph Hellwig does not like such idea. Therefore, instead of holding major_names_lock in blkdev_show(), introduce a different lock for blkdev_show() in order to break "sb_writers#$N => &p->lock => major_names_lock" dependency chain. Link: https://lkml.kernel.org/r/b2af8a5b-3c1b-204e-7f56-bea0b15848d6@i-love.sakura.ne.jp [1] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Link: https://lore.kernel.org/r/18a02da2-0bf3-550e-b071-2b4ab13c49f0@i-love.sakura.ne.jp Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-06Merge tag 'block-5.15-2021-09-05' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "Was going to send this one in later this week, but given that -Werror is now enabled (or at least available), the mq-deadline fix really should go in for the folks hitting that. - Ensure dd_queued() is only there if needed (Geert) - Fix a kerneldoc warning for bio_alloc_kiocb() - BFQ fix for queue merging - loop locking fix (Tetsuo)" * tag 'block-5.15-2021-09-05' of git://git.kernel.dk/linux-block: loop: reduce the loop_ctl_mutex scope bio: fix kerneldoc documentation for bio_alloc_kiocb() block, bfq: honor already-setup queue merges block/mq-deadline: Move dd_queued() to fix defined but not used warning
2021-09-03Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc updates from Andrew Morton: "173 patches. Subsystems affected by this series: ia64, ocfs2, block, and mm (debug, pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap, bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock, oom-kill, migration, ksm, percpu, vmstat, and madvise)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits) mm/madvise: add MADV_WILLNEED to process_madvise() mm/vmstat: remove unneeded return value mm/vmstat: simplify the array size calculation mm/vmstat: correct some wrong comments mm/percpu,c: remove obsolete comments of pcpu_chunk_populated() selftests: vm: add COW time test for KSM pages selftests: vm: add KSM merging time test mm: KSM: fix data type selftests: vm: add KSM merging across nodes test selftests: vm: add KSM zero page merging test selftests: vm: add KSM unmerge test selftests: vm: add KSM merge test mm/migrate: correct kernel-doc notation mm: wire up syscall process_mrelease mm: introduce process_mrelease system call memblock: make memblock_find_in_range method private mm/mempolicy.c: use in_task() in mempolicy_slab_node() mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies mm/mempolicy: advertise new MPOL_PREFERRED_MANY mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY ...
2021-09-03mm: remove flush_kernel_dcache_pageChristoph Hellwig
flush_kernel_dcache_page is a rather confusing interface that implements a subset of flush_dcache_page by not being able to properly handle page cache mapped pages. The only callers left are in the exec code as all other previous callers were incorrect as they could have dealt with page cache pages. Replace the calls to flush_kernel_dcache_page with calls to flush_dcache_page, which for all architectures does either exactly the same thing, can contains one or more of the following: 1) an optimization to defer the cache flush for page cache pages not mapped into userspace 2) additional flushing for mapped page cache pages if cache aliases are possible Link: https://lkml.kernel.org/r/20210712060928.4161649-7-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Alex Shi <alexs@kernel.org> Cc: Geoff Levand <geoff@infradead.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Nick Hu <nickhu@andestech.com> Cc: Paul Cercueil <paul@crapouillou.net> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Ulf Hansson <ulf.hansson@linaro.org> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Yoshinori Sato <ysato@users.osdn.me> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03bio: fix kerneldoc documentation for bio_alloc_kiocb()Jens Axboe
Apparently the last fixup got butter fingered a bit, the correct variable name is 'nr_vecs', not 'nr_iovecs'. Link: https://lore.kernel.org/lkml/20210903164939.02f6e8c5@canb.auug.org.au/ Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-02Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds
Pull SCSI updates from James Bottomley: "This series consists of the usual driver updates (ufs, qla2xxx, target, smartpqi, lpfc, mpt3sas). The core change causing the most churn was replacing the command request field request with a macro, allowing us to offset map to it and remove the redundant field; the same was also done for the tag field. The most impactful change is the final removal of scsi_ioctl, which has been deprecated for over a decade" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (293 commits) scsi: ufs: Fix ufshcd_request_sense_async() for Samsung KLUFG8RHDA-B2D1 scsi: ufs: ufs-exynos: Fix static checker warning scsi: mpt3sas: Use the proper SCSI midlayer interfaces for PI scsi: lpfc: Use the proper SCSI midlayer interfaces for PI scsi: lpfc: Copyright updates for 14.0.0.1 patches scsi: lpfc: Update lpfc version to 14.0.0.1 scsi: lpfc: Add bsg support for retrieving adapter cmf data scsi: lpfc: Add cmf_info sysfs entry scsi: lpfc: Add debugfs support for cm framework buffers scsi: lpfc: Add support for maintaining the cm statistics buffer scsi: lpfc: Add rx monitoring statistics scsi: lpfc: Add support for the CM framework scsi: lpfc: Add cmfsync WQE support scsi: lpfc: Add support for cm enablement buffer scsi: lpfc: Add cm statistics buffer support scsi: lpfc: Add EDC ELS support scsi: lpfc: Expand FPIN and RDF receive logging scsi: lpfc: Add MIB feature enablement support scsi: lpfc: Add SET_HOST_DATA mbox cmd to pass date/time info to firmware scsi: fc: Add EDC ELS definition ...
2021-09-02block, bfq: honor already-setup queue mergesPaolo Valente
The function bfq_setup_merge prepares the merging between two bfq_queues, say bfqq and new_bfqq. To this goal, it assigns bfqq->new_bfqq = new_bfqq. Then, each time some I/O for bfqq arrives, the process that generated that I/O is disassociated from bfqq and associated with new_bfqq (merging is actually a redirection). In this respect, bfq_setup_merge increases new_bfqq->ref in advance, adding the number of processes that are expected to be associated with new_bfqq. Unfortunately, the stable-merging mechanism interferes with this setup. After bfqq->new_bfqq has been set by bfq_setup_merge, and before all the expected processes have been associated with bfqq->new_bfqq, bfqq may happen to be stably merged with a different queue than the current bfqq->new_bfqq. In this case, bfqq->new_bfqq gets changed. So, some of the processes that have been already accounted for in the ref counter of the previous new_bfqq will not be associated with that queue. This creates an unbalance, because those references will never be decremented. This commit fixes this issue by reestablishing the previous, natural behaviour: once bfqq->new_bfqq has been set, it will not be changed until all expected redirections have occurred. Signed-off-by: Davide Zini <davidezini2@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20210802141352.74353-2-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-02block/mq-deadline: Move dd_queued() to fix defined but not used warningGeert Uytterhoeven
If CONFIG_BLK_DEBUG_FS=n: block/mq-deadline.c:274:12: warning: ‘dd_queued’ defined but not used [-Wunused-function] 274 | static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio) | ^~~~~~~~~ Fix this by moving dd_queued() just before the sole function that calls it. Fixes: 7b05bf771084ff78 ("Revert "block/mq-deadline: Prioritize high-priority requests"") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Fixes: 38ba64d12d4c ("block/mq-deadline: Track I/O statistics") Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210830091128.1854266-1-geert@linux-m68k.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-31Merge tag 'for-5.15-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "The highlights of this round are integrations with fs-verity and idmapped mounts, the rest is usual mix of minor improvements, speedups and cleanups. There are some patches outside of btrfs, namely updating some VFS interfaces, all straightforward and acked. Features: - fs-verity support, using standard ioctls, backward compatible with read-only limitation on inodes with previously enabled fs-verity - idmapped mount support - make mount with rescue=ibadroots more tolerant to partially damaged trees - allow raid0 on a single device and raid10 on two devices, degenerate cases but might be useful as an intermediate step during conversion to other profiles - zoned mode block group auto reclaim can be disabled via sysfs knob Performance improvements: - continue readahead of node siblings even if target node is in memory, could speed up full send (on sample test +11%) - batching of delayed items can speed up creating many files - fsync/tree-log speedups - avoid unnecessary work (gains +2% throughput, -2% run time on sample load) - reduced lock contention on renames (on dbench +4% throughput, up to -30% latency) Fixes: - various zoned mode fixes - preemptive flushing threshold tuning, avoid excessive work on almost full filesystems Core: - continued subpage support, preparation for implementing remaining features like compression and defragmentation; with some limitations, write is now enabled on 64K page systems with 4K sectors, still considered experimental - no readahead on compressed reads - inline extents disabled - disabled raid56 profile conversion and mount - improved flushing logic, fixing early ENOSPC on some workloads - inode flags have been internally split to read-only and read-write incompat bit parts, used by fs-verity - new tree items for fs-verity - descriptor item - Merkle tree item - inode operations extended to be namespace-aware - cleanups and refactoring Generic code changes: - fs: new export filemap_fdatawrite_wbc - fs: removed sync_inode - block: bio_trim argument type fixups - vfs: add namespace-aware lookup" * tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (114 commits) btrfs: reset replace target device to allocation state on close btrfs: zoned: fix ordered extent boundary calculation btrfs: do not do preemptive flushing if the majority is global rsv btrfs: reduce the preemptive flushing threshold to 90% btrfs: tree-log: check btrfs_lookup_data_extent return value btrfs: avoid unnecessarily logging directories that had no changes btrfs: allow idmapped mount btrfs: handle ACLs on idmapped mounts btrfs: allow idmapped INO_LOOKUP_USER ioctl btrfs: allow idmapped SUBVOL_SETFLAGS ioctl btrfs: allow idmapped SET_RECEIVED_SUBVOL ioctls btrfs: relax restrictions for SNAP_DESTROY_V2 with subvolids btrfs: allow idmapped SNAP_DESTROY ioctls btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls btrfs: check whether fsgid/fsuid are mapped during subvolume creation btrfs: allow idmapped permission inode op btrfs: allow idmapped setattr inode op btrfs: allow idmapped tmpfile inode op btrfs: allow idmapped symlink inode op btrfs: allow idmapped mkdir inode op ...
2021-08-30Merge tag 'io_uring-bio-cache.5-2021-08-30' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull support for struct bio recycling from Jens Axboe: "This adds bio recycling support for polled IO, allowing quick reuse of a bio for high IOPS scenarios via a percpu bio_set list. It's good for almost a 10% improvement in performance, bumping our per-core IO limit from ~3.2M IOPS to ~3.5M IOPS" * tag 'io_uring-bio-cache.5-2021-08-30' of git://git.kernel.dk/linux-block: bio: improve kerneldoc documentation for bio_alloc_kiocb() block: provide bio_clear_hipri() helper block: use the percpu bio cache in __blkdev_direct_IO io_uring: enable use of bio alloc cache block: clear BIO_PERCPU_CACHE flag if polling isn't supported bio: add allocation cache abstraction fs: add kiocb alloc cache flag bio: optimize initialization of a bio
2021-08-30Merge tag 'for-5.15/block-2021-08-30' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: "Nothing major in here - lots of good cleanups and tech debt handling, which is also evident in the diffstats. In particular: - Add disk sequence numbers (Matteo) - Discard merge fix (Ming) - Relax disk zoned reporting restrictions (Niklas) - Bio error handling zoned leak fix (Pavel) - Start of proper add_disk() error handling (Luis, Christoph) - blk crypto fix (Eric) - Non-standard GPT location support (Dmitry) - IO priority improvements and cleanups (Damien)o - blk-throtl improvements (Chunguang) - diskstats_show() stack reduction (Abd-Alrhman) - Loop scheduler selection (Bart) - Switch block layer to use kmap_local_page() (Christoph) - Remove obsolete disk_name helper (Christoph) - block_device refcounting improvements (Christoph) - Ensure gendisk always has a request queue reference (Christoph) - Misc fixes/cleanups (Shaokun, Oliver, Guoqing)" * tag 'for-5.15/block-2021-08-30' of git://git.kernel.dk/linux-block: (129 commits) sg: pass the device name to blk_trace_setup block, bfq: cleanup the repeated declaration blk-crypto: fix check for too-large dun_bytes blk-zoned: allow BLKREPORTZONE without CAP_SYS_ADMIN blk-zoned: allow zone management send operations without CAP_SYS_ADMIN block: mark blkdev_fsync static block: refine the disk_live check in del_gendisk mmc: sdhci-tegra: Enable MMC_CAP2_ALT_GPT_TEGRA mmc: block: Support alternative_gpt_sector() operation partitions/efi: Support non-standard GPT location block: Add alternative_gpt_sector() operation bio: fix page leak bio_add_hw_page failure block: remove CONFIG_DEBUG_BLOCK_EXT_DEVT block: remove a pointless call to MINOR() in device_add_disk null_blk: add error handling support for add_disk() virtio_blk: add error handling support for add_disk() block: add error handling for device_add_disk / add_disk block: return errors from disk_alloc_events block: return errors from blk_integrity_add block: call blk_register_queue earlier in device_add_disk ...
2021-08-30Merge tag 'irq-core-2021-08-30' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "Updates to the interrupt core and driver subsystems: Core changes: - The usual set of small fixes and improvements all over the place, but nothing stands out MSI changes: - Further consolidation of the PCI/MSI interrupt chip code - Make MSI sysfs code independent of PCI/MSI and expose the MSI interrupts of platform devices in the same way as PCI exposes them. Driver changes: - Support for ARM GICv3 EPPI partitions - Treewide conversion to generic_handle_domain_irq() for all chained interrupt controllers - Conversion to bitmap_zalloc() throughout the irq chip drivers - The usual set of small fixes and improvements" * tag 'irq-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits) platform-msi: Add ABI to show msi_irqs of platform devices genirq/msi: Move MSI sysfs handling from PCI to MSI core genirq/cpuhotplug: Demote debug printk to KERN_DEBUG irqchip/qcom-pdc: Trim unused levels of the interrupt hierarchy irqdomain: Export irq_domain_disconnect_hierarchy() irqchip/gic-v3: Fix priority comparison when non-secure priorities are used irqchip/apple-aic: Fix irq_disable from within irq handlers pinctrl/rockchip: drop the gpio related codes gpio/rockchip: drop irq_gc_lock/irq_gc_unlock for irq set type gpio/rockchip: support next version gpio controller gpio/rockchip: use struct rockchip_gpio_regs for gpio controller gpio/rockchip: add driver for rockchip gpio dt-bindings: gpio: change items restriction of clock for rockchip,gpio-bank pinctrl/rockchip: add pinctrl device to gpio bank struct pinctrl/rockchip: separate struct rockchip_pin_bank to a head file pinctrl/rockchip: always enable clock for gpio controller genirq: Fix kernel doc indentation EDAC/altera: Convert to generic_handle_domain_irq() powerpc: Bulk conversion to generic_handle_domain_irq() nios2: Bulk conversion to generic_handle_domain_irq() ...
2021-08-27Merge tag 'block-5.14-2021-08-27' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - Revert the mq-deadline priority handling, it's causing serious performance regressions. While experimental patches exists to fix this up, it's too late to do so now. Revert it and re-do it properly for 5.15 instead. - Fix a NULL vs IS_ERR() regression in this release (Dan) - Fix a mq-deadline accounting regression in this release (Bart) - Mark cryptoloop as deprecated. It's broken and dm-crypt fully supports it, and it's actively intefering with loop. Plan on removal for 5.16 (Christoph) * tag 'block-5.14-2021-08-27' of git://git.kernel.dk/linux-block: cryptoloop: add a deprecation warning pd: fix a NULL vs IS_ERR() check Revert "block/mq-deadline: Prioritize high-priority requests" mq-deadline: Fix request accounting
2021-08-26Revert "block/mq-deadline: Prioritize high-priority requests"Jens Axboe
This reverts commit fb926032b3209300f9dc454a36b8299582ae545c. Zhen reports that this commit slows down mq-deadline on a 128 thread box, going from 258K IOPS to 170-180K. My testing shows that Optane gen2 IOPS goes from 2.3M IOPS to 1.2M IOPS on a 64 thread box. Looking in detail at the code, the main culprit here is needing to sum percpu counters in the dispatch hot path, leading to very high CPU utilization there. To make matters worse, the code currently needs to sum 2 percpu counters, and it does so in the most naive way of iterating possible CPUs _twice_. Since we're close to release, revert this commit and we can re-do it with regular per-priority counters instead for the 5.15 kernel. Link: https://lore.kernel.org/linux-block/20210826144039.2143-1-thunder.leizhen@huawei.com/ Reported-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-25block, bfq: cleanup the repeated declarationShaokun Zhang
Function 'bfq_entity_to_bfqq' is declared twice, so remove the repeated declaration and blank line. Cc: Paolo Valente <paolo.valente@linaro.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Link: https://lore.kernel.org/r/1629872391-46399-1-git-send-email-zhangshaokun@hisilicon.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-25blk-crypto: fix check for too-large dun_bytesEric Biggers
dun_bytes needs to be less than or equal to the IV size of the encryption mode, not just less than or equal to BLK_CRYPTO_MAX_IV_SIZE. Currently this doesn't matter since blk_crypto_init_key() is never actually passed invalid values, but we might as well fix this. Fixes: a892c8d52c02 ("block: Inline encryption support for blk-mq") Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20210825055918.51975-1-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-24mq-deadline: Fix request accountingBart Van Assche
The block layer may call the I/O scheduler .finish_request() callback without having called the .insert_requests() callback. Make sure that the mq-deadline I/O statistics are correct if the block layer inserts an I/O request that bypasses the I/O scheduler. This patch prevents that lower priority I/O is delayed longer than necessary for mixed I/O priority workloads. Cc: Niklas Cassel <Niklas.Cassel@wdc.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Reported-by: Niklas Cassel <Niklas.Cassel@wdc.com> Fixes: 08a9ad8bf607 ("block/mq-deadline: Add cgroup support") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210824170520.1659173-1-bvanassche@acm.org Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com> Tested-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-24blk-zoned: allow BLKREPORTZONE without CAP_SYS_ADMINNiklas Cassel
A user space process should not need the CAP_SYS_ADMIN capability set in order to perform a BLKREPORTZONE ioctl. Getting the zone report is required in order to get the write pointer. Neither read() nor write() requires CAP_SYS_ADMIN, so it is reasonable that a user space process that can read/write from/to the device, also can get the write pointer. (Since e.g. writes have to be at the write pointer.) Fixes: 3ed05a987e0f ("blk-zoned: implement ioctls") Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Aravind Ramesh <aravind.ramesh@wdc.com> Reviewed-by: Adam Manzanares <a.manzanares@samsung.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: stable@vger.kernel.org # v4.10+ Link: https://lore.kernel.org/r/20210811110505.29649-3-Niklas.Cassel@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-24blk-zoned: allow zone management send operations without CAP_SYS_ADMINNiklas Cassel
Zone management send operations (BLKRESETZONE, BLKOPENZONE, BLKCLOSEZONE and BLKFINISHZONE) should be allowed under the same permissions as write(). (write() does not require CAP_SYS_ADMIN). Additionally, other ioctls like BLKSECDISCARD and BLKZEROOUT only check if the fd was successfully opened with FMODE_WRITE. (They do not require CAP_SYS_ADMIN). Currently, zone management send operations require both CAP_SYS_ADMIN and that the fd was successfully opened with FMODE_WRITE. Remove the CAP_SYS_ADMIN requirement, so that zone management send operations match the access control requirement of write(), BLKSECDISCARD and BLKZEROOUT. Fixes: 3ed05a987e0f ("blk-zoned: implement ioctls") Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Aravind Ramesh <aravind.ramesh@wdc.com> Reviewed-by: Adam Manzanares <a.manzanares@samsung.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: stable@vger.kernel.org # v4.10+ Link: https://lore.kernel.org/r/20210811110505.29649-2-Niklas.Cassel@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-24block: refine the disk_live check in del_gendiskChristoph Hellwig
hidden gendisks will never be marked live. Fixes: 40b3a52ffc5b ("block: add a sanity check for a live disk in del_gendisk") Reported-by: Bruno Goncalves <bgoncalv@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210824144310.1487816-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-24partitions/efi: Support non-standard GPT locationDmitry Osipenko
Support looking up GPT at a non-standard location specified by a block device driver. Acked-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dmitry Osipenko <digetx@gmail.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20210820004536.15791-3-digetx@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-24bio: fix page leak bio_add_hw_page failurePavel Begunkov
__bio_iov_append_get_pages() doesn't put not appended pages on bio_add_hw_page() failure, so potentially leaking them, fix it. Also, do the same for __bio_iov_iter_get_pages(), even though it looks like it can't be triggered by userspace in this case. Fixes: 0512a75b98f8 ("block: Introduce REQ_OP_ZONE_APPEND") Cc: stable@vger.kernel.org # 5.8+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1edfa6a2ffd66d55e6345a477df5387d2c1415d0.1626653825.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>