summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2021-06-04dm writecache: interrupt writeback if suspendedMikulas Patocka
If the DM device is suspended, interrupt the writeback sequence so that there is no excessive suspend delay. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04dm writecache: don't split bios when overwriting contiguous cache contentMikulas Patocka
If dm-writecache overwrites existing cached data, it splits the incoming bio into many block-sized bios. The I/O scheduler does merge these bios into one large request but this needless splitting and merging causes performance degradation. Fix this by avoiding bio splitting if the cache target area that is being overwritten is contiguous. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04dm kcopyd: avoid spin_lock_irqsave from process contextMikulas Patocka
The functions "pop", "push_head", "do_work" can only be called from process context. Therefore, replace spin_lock_irq{save,restore} with spin_{lock,unlock}_irq. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04dm kcopyd: avoid useless atomic operationsMikulas Patocka
The functions set_bit and clear_bit are atomic. We don't need atomicity when making flags for dm-kcopyd. So, change them to direct manipulation of the flags. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04dm space map disk: cache a small number of index entriesJoe Thornber
The disk space map stores it's index entries in a btree, these are accessed very frequently, so having a few cached makes a big difference to performance. With this change provisioning a new block takes roughly 20% less cpu. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04dm space maps: improve performance with inc/dec on ranges of blocksJoe Thornber
When we break sharing on btree nodes we typically need to increment the reference counts to every value held in the node. This can cause a lot of repeated calls to the space maps. Fix this by changing the interface to the space map inc/dec methods to take ranges of adjacent blocks to be operated on. For installations that are using a lot of snapshots this will reduce cpu overhead of fundamental operations such as provisioning a new block, or deleting a snapshot, by as much as 10 times. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04dm space maps: don't reset space map allocation cursor when committingJoe Thornber
Current commit code resets the place where the search for free blocks will begin back to the start of the metadata device. There are a couple of repercussions to this: - The first allocation after the commit is likely to take longer than normal as it searches for a free block in an area that is likely to have very few free blocks (if any). - Any free blocks it finds will have been recently freed. Reusing them means we have fewer old copies of the metadata to aid recovery from hardware error. Fix these issues by leaving the cursor alone, only resetting when the search hits the end of the metadata device. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-04dm btree: improve btree residencyJoe Thornber
This commit improves the residency of btrees built in the metadata for dm-thin and dm-cache. When inserting a new entry into a full btree node the current code splits the node into two. This can result in very many half full nodes, particularly if the insertions are occurring in an ascending order (as happens in dm-thin with large writes). With this commit, when we insert into a full node we first try and move some entries to a neighbouring node that has space, failing that it tries to split two neighbouring nodes into three. Results are given below. 'Residency' is how full nodes are on average as a percentage. Average instruction counts for the operations are given to show the extra processing has little overhead. +--------------------------+--------------------------+ | Before | After | +------------+-----------+-----------+--------------+-----------+--------------+ | Test | Phase | Residency | Instructions | Residency | Instructions | +------------+-----------+-----------+--------------+-----------+--------------+ | Ascending | insert | 50 | 1876 | 96 | 1930 | | | overwrite | 50 | 1789 | 96 | 1746 | | | lookup | 50 | 778 | 96 | 778 | | Descending | insert | 50 | 3024 | 96 | 3181 | | | overwrite | 50 | 1789 | 96 | 1746 | | | lookup | 50 | 778 | 96 | 778 | | Random | insert | 68 | 3800 | 84 | 3736 | | | overwrite | 68 | 4254 | 84 | 3911 | | | lookup | 68 | 779 | 84 | 779 | | Runs | insert | 63 | 2546 | 82 | 2815 | | | overwrite | 63 | 2013 | 82 | 1986 | | | lookup | 63 | 778 | 82 | 779 | +------------+-----------+-----------+--------------+-----------+--------------+ Ascending - keys are inserted in ascending order. Descending - keys are inserted in descending order. Random - keys are inserted in random order. Runs - keys are split into ascending runs of ~20 length. Then the runs are shuffled. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Colin Ian King <colin.king@canonical.com> # contains_key() fix Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-06-01block: move bd_mutex to struct gendiskChristoph Hellwig
Replace the per-block device bd_mutex with a per-gendisk open_mutex, thus simplifying locking wherever we deal with partitions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Link: https://lore.kernel.org/r/20210525061301.2242282-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-01md: convert to blk_alloc_disk/blk_cleanup_diskChristoph Hellwig
Convert the md driver to use the blk_alloc_disk and blk_cleanup_disk helpers to simplify gendisk and request_queue allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20210521055116.1053587-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-01dm: convert to blk_alloc_disk/blk_cleanup_diskChristoph Hellwig
Convert the dm driver to use the blk_alloc_disk and blk_cleanup_disk helpers to simplify gendisk and request_queue allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20210521055116.1053587-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-01bcache: convert to blk_alloc_disk/blk_cleanup_diskChristoph Hellwig
Convert the bcache driver to use the blk_alloc_disk and blk_cleanup_disk helpers to simplify gendisk and request_queue allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Coly Li <colyli@suse.de> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20210521055116.1053587-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-28Merge tag 'block-5.13-2021-05-28' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe pull request (Christoph): - fix a memory leak in nvme_cdev_add (Guoqing Jiang) - fix inline data size comparison in nvmet_tcp_queue_response (Hou Pu) - fix false keep-alive timeout when a controller is torn down (Sagi Grimberg) - fix a nvme-tcp Kconfig dependency (Sagi Grimberg) - short-circuit reconnect retries for FC (Hannes Reinecke) - decode host pathing error for connect (Hannes Reinecke) - MD pull request (Song): - Fix incorrect chunk boundary assert (Christoph) - Fix s390/dasd verification panic (Stefan) * tag 'block-5.13-2021-05-28' of git://git.kernel.dk/linux-block: nvmet: fix false keep-alive timeout when a controller is torn down nvmet-tcp: fix inline data size comparison in nvmet_tcp_queue_response nvme-tcp: remove incorrect Kconfig dep in BLK_DEV_NVME md/raid5: remove an incorrect assert in in_chunk_boundary s390/dasd: add missing discipline function nvme-fabrics: decode host pathing error for connect nvme-fc: short-circuit reconnect retries nvme: fix potential memory leaks in nvme_cdev_add
2021-05-25md/raid5: remove an incorrect assert in in_chunk_boundaryChristoph Hellwig
Now that the original bdev is stored in the bio this assert is incorrect and will trigger for any partitioned raid5 device. Reported-by: Florian Dazinger <spam02@dazinger.net> Tested-by: Florian Dazinger <spam02@dazinger.net> Cc: stable@vger.kernel.org # 5.12 Fixes: 309dca309fc3 ("block: store a block_device pointer in struct bio"), Reviewed-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
2021-05-25dm snapshot: properly fix a crash when an origin has no snapshotsMikulas Patocka
If an origin target has no snapshots, o->split_boundary is set to 0. This causes BUG_ON(sectors <= 0) in block/bio.c:bio_split(). Fix this by initializing chunk_size, and in turn split_boundary, to rounddown_pow_of_two(UINT_MAX) -- the largest power of two that fits into "unsigned" type. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-05-25dm snapshot: revert "fix a crash when an origin has no snapshots"Mikulas Patocka
Commit 7ee06ddc4038f936b0d4459d37a7d4d844fb03db ("dm snapshot: fix a crash when an origin has no snapshots") introduced a regression in snapshot merging - causing the lvm2 test lvcreate-cache-snapshot.sh got stuck in an infinite loop. Even though commit 7ee06ddc4038f936b0d4459d37a7d4d844fb03db was marked for stable@ the stable team was notified to _not_ backport it. Fixes: 7ee06ddc4038 ("dm snapshot: fix a crash when an origin has no snapshots") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-05-25dm verity: fix require_signatures module_param permissionsJohn Keeping
The third parameter of module_param() is permissions for the sysfs node but it looks like it is being used as the initial value of the parameter here. In fact, false here equates to omitting the file from sysfs and does not affect the value of require_signatures. Making the parameter writable is not simple because going from false->true is fine but it should not be possible to remove the requirement to verify a signature. But it can be useful to inspect the value of this parameter from userspace, so change the permissions to make a read-only file in sysfs. Signed-off-by: John Keeping <john@metanate.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-05-13dm integrity: fix sparse warningsMikulas Patocka
Use the types __le* instead of __u* to fix sparse warnings. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-05-13dm integrity: revert to not using discard filler when recalulatingMikulas Patocka
Revert the commit 7a5b96b4784454ba258e83dc7469ddbacd3aaac3 ("dm integrity: use discard support when recalculating"). There's a bug that when we write some data beyond the current recalculate boundary, the checksum will be rewritten with the discard filler later. And the data will no longer have integrity protection. There's no easy fix for this case. Also, another problematic case is if dm-integrity is used to detect bitrot (random device errors, bit flips, etc); dm-integrity should detect that even for unused sectors. With commit 7a5b96b4784 it can happen that such change is undetected (because discard filler is not a valid checksum). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-05-13dm snapshot: fix crash with transient storage and zero chunk sizeMikulas Patocka
The following commands will crash the kernel: modprobe brd rd_size=1048576 dmsetup create o --table "0 `blockdev --getsize /dev/ram0` snapshot-origin /dev/ram0" dmsetup create s --table "0 `blockdev --getsize /dev/ram0` snapshot /dev/ram0 /dev/ram1 N 0" The reason is that when we test for zero chunk size, we jump to the label bad_read_metadata without setting the "r" variable. The function snapshot_ctr destroys all the structures and then exits with "r == 0". The kernel then crashes because it falsely believes that snapshot_ctr succeeded. In order to fix the bug, we set the variable "r" to -EINVAL. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-05-10dm snapshot: fix a crash when an origin has no snapshotsMikulas Patocka
If an origin target has no snapshots, o->split_boundary is set to 0. This causes BUG_ON(sectors <= 0) in block/bio.c:bio_split(). Fix this by initializing chunk_size, and in turn split_boundary, to rounddown_pow_of_two(UINT_MAX) -- the largest power of two that fits into "unsigned" type. Reported-by: Michael Tokarev <mjt@tls.msk.ru> Tested-by: Michael Tokarev <mjt@tls.msk.ru> Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-05-07Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge yet more updates from Andrew Morton: "This is everything else from -mm for this merge window. 90 patches. Subsystems affected by this patch series: mm (cleanups and slub), alpha, procfs, sysctl, misc, core-kernel, bitmap, lib, compat, checkpatch, epoll, isofs, nilfs2, hpfs, exit, fork, kexec, gcov, panic, delayacct, gdb, resource, selftests, async, initramfs, ipc, drivers/char, and spelling" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (90 commits) mm: fix typos in comments mm: fix typos in comments treewide: remove editor modelines and cruft ipc/sem.c: spelling fix fs: fat: fix spelling typo of values kernel/sys.c: fix typo kernel/up.c: fix typo kernel/user_namespace.c: fix typos kernel/umh.c: fix some spelling mistakes include/linux/pgtable.h: few spelling fixes mm/slab.c: fix spelling mistake "disired" -> "desired" scripts/spelling.txt: add "overflw" scripts/spelling.txt: Add "diabled" typo scripts/spelling.txt: add "overlfow" arm: print alloc free paths for address in registers mm/vmalloc: remove vwrite() mm: remove xlate_dev_kmem_ptr() drivers/char: remove /dev/kmem for good mm: fix some typos and code style problems ipc/sem.c: mundane typo fixes ...
2021-05-06include: remove pagemap.h from blkdev.hMatthew Wilcox (Oracle)
My UEK-derived config has 1030 files depending on pagemap.h before this change. Afterwards, just 326 files need to be rebuilt when I touch pagemap.h. I think blkdev.h is probably included too widely, but untangling that dependency is harder and this solves my problem. x86 allmodconfig builds, but there may be implicit include problems on other architectures. Link: https://lkml.kernel.org/r/20210309195747.283796-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Dan Williams <dan.j.williams@intel.com> [nvdimm] Acked-by: Jens Axboe <axboe@kernel.dk> [block] Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: Martin K. Petersen <martin.petersen@oracle.com> [scsi] Reviewed-by: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-01Merge tag 'for-5.13/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Improve scalability of DM's device hash by switching to rbtree - Extend DM ioctl's DM_LIST_DEVICES_CMD handling to include UUID and allow filtering based on name or UUID prefix. - Various small fixes for typos, warnings, unused function, or needlessly exported interfaces. - Remove needless request_queue NULL pointer checks in DM thin and cache targets. - Remove unnecessary loop in DM core's __split_and_process_bio(). - Remove DM core's dm_vcalloc() and just use kvcalloc or kvmalloc_array instead (depending whether zeroing is useful). - Fix request-based DM's double free of blk_mq_tag_set in device remove after table load fails. - Improve DM persistent data performance on non-x86 by fixing packed structs to have a stated alignment. Also remove needless extra work from redundant calls to sm_disk_get_nr_free() and a paranoid BUG_ON() that caused duplicate checksum calculation. - Fix missing goto in DM integrity's bitmap_flush_interval error handling. - Add "reset_recalculate" feature flag to DM integrity. - Improve DM integrity by leveraging discard support to avoid needless re-writing of metadata and also use discard support to improve hash recalculation. - Fix race with DM raid target's reshape and MD raid4/5/6 resync that resulted in inconsistant reshape state during table reloads. - Update DM raid target to temove unnecessary discard limits for raid0 and raid10 now that MD has optimized discard handling for both raid levels. * tag 'for-5.13/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (26 commits) dm raid: remove unnecessary discard limits for raid0 and raid10 dm rq: fix double free of blk_mq_tag_set in dev remove after table load fails dm integrity: use discard support when recalculating dm integrity: increase RECALC_SECTORS to improve recalculate speed dm integrity: don't re-write metadata if discarding same blocks dm raid: fix inconclusive reshape layout on fast raid4/5/6 table reload sequences dm raid: fix fall-through warning in rs_check_takeover() for Clang dm clone metadata: remove unused function dm integrity: fix missing goto in bitmap_flush_interval error handling dm: replace dm_vcalloc() dm space map common: fix division bug in sm_ll_find_free_block() dm persistent data: packed struct should have an aligned() attribute too dm btree spine: remove paranoid node_check call in node_prep_for_write() dm space map disk: remove redundant calls to sm_disk_get_nr_free() dm integrity: add the "reset_recalculate" feature flag dm persistent data: remove unused return from exit_shadow_spine() dm cache: remove needless request_queue NULL pointer checks dm thin: remove needless request_queue NULL pointer check dm: unexport dm_{get,put}_table_device dm ebs: fix a few typos ...
2021-04-30dm raid: remove unnecessary discard limits for raid0 and raid10Mike Snitzer
Commit 29efc390b946 ("md/md0: optimize raid0 discard handling") and commit d30588b2731f ("md/raid10: improve raid10 discard request") remove MD raid0's and raid10's inability to properly handle large discards. So eliminate associated constraints from dm-raid's support. Depends-on: 29efc390b946 ("md/md0: optimize raid0 discard handling") Depends-on: d30588b2731f ("md/raid10: improve raid10 discard request") Reported-by: Matthew Ruffell <matthew.ruffell@canonical.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-30dm rq: fix double free of blk_mq_tag_set in dev remove after table load failsBenjamin Block
When loading a device-mapper table for a request-based mapped device, and the allocation/initialization of the blk_mq_tag_set for the device fails, a following device remove will cause a double free. E.g. (dmesg): device-mapper: core: Cannot initialize queue for request-based dm-mq mapped device device-mapper: ioctl: unable to set up device queue for new table. Unable to handle kernel pointer dereference in virtual kernel address space Failing address: 0305e098835de000 TEID: 0305e098835de803 Fault in home space mode while using kernel ASCE. AS:000000025efe0007 R3:0000000000000024 Oops: 0038 ilc:3 [#1] SMP Modules linked in: ... lots of modules ... Supported: Yes, External CPU: 0 PID: 7348 Comm: multipathd Kdump: loaded Tainted: G W X 5.3.18-53-default #1 SLE15-SP3 Hardware name: IBM 8561 T01 7I2 (LPAR) Krnl PSW : 0704e00180000000 000000025e368eca (kfree+0x42/0x330) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3 Krnl GPRS: 000000000000004a 000000025efe5230 c1773200d779968d 0000000000000000 000000025e520270 000000025e8d1b40 0000000000000003 00000007aae10000 000000025e5202a2 0000000000000001 c1773200d779968d 0305e098835de640 00000007a8170000 000003ff80138650 000000025e5202a2 000003e00396faa8 Krnl Code: 000000025e368eb8: c4180041e100 lgrl %r1,25eba50b8 000000025e368ebe: ecba06b93a55 risbg %r11,%r10,6,185,58 #000000025e368ec4: e3b010000008 ag %r11,0(%r1) >000000025e368eca: e310b0080004 lg %r1,8(%r11) 000000025e368ed0: a7110001 tmll %r1,1 000000025e368ed4: a7740129 brc 7,25e369126 000000025e368ed8: e320b0080004 lg %r2,8(%r11) 000000025e368ede: b904001b lgr %r1,%r11 Call Trace: [<000000025e368eca>] kfree+0x42/0x330 [<000000025e5202a2>] blk_mq_free_tag_set+0x72/0xb8 [<000003ff801316a8>] dm_mq_cleanup_mapped_device+0x38/0x50 [dm_mod] [<000003ff80120082>] free_dev+0x52/0xd0 [dm_mod] [<000003ff801233f0>] __dm_destroy+0x150/0x1d0 [dm_mod] [<000003ff8012bb9a>] dev_remove+0x162/0x1c0 [dm_mod] [<000003ff8012a988>] ctl_ioctl+0x198/0x478 [dm_mod] [<000003ff8012ac8a>] dm_ctl_ioctl+0x22/0x38 [dm_mod] [<000000025e3b11ee>] ksys_ioctl+0xbe/0xe0 [<000000025e3b127a>] __s390x_sys_ioctl+0x2a/0x40 [<000000025e8c15ac>] system_call+0xd8/0x2c8 Last Breaking-Event-Address: [<000000025e52029c>] blk_mq_free_tag_set+0x6c/0xb8 Kernel panic - not syncing: Fatal exception: panic_on_oops When allocation/initialization of the blk_mq_tag_set fails in dm_mq_init_request_queue(), it is uninitialized/freed, but the pointer is not reset to NULL; so when dev_remove() later gets into dm_mq_cleanup_mapped_device() it sees the pointer and tries to uninitialize and free it again. Fix this by setting the pointer to NULL in dm_mq_init_request_queue() error-handling. Also set it to NULL in dm_mq_cleanup_mapped_device(). Cc: <stable@vger.kernel.org> # 4.6+ Fixes: 1c357a1e86a4 ("dm: allocate blk_mq_tag_set rather than embed in mapped_device") Signed-off-by: Benjamin Block <bblock@linux.ibm.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-30dm integrity: use discard support when recalculatingMikulas Patocka
If we have discard support we don't have to recalculate hash - we can just fill the metadata with the discard pattern. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-30dm integrity: increase RECALC_SECTORS to improve recalculate speedMikulas Patocka
Increase RECALC_SECTORS because it improves recalculate speed slightly (from 390kiB/s to 410kiB/s). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-30dm integrity: don't re-write metadata if discarding same blocksMikulas Patocka
If we discard already discarded blocks we do not need to write discard pattern to the metadata, because it is already there. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-28Merge tag 'for-5.13/drivers-2021-04-27' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block driver updates from Jens Axboe: - MD changes via Song: - raid5 POWER fix - raid1 failure fix - UAF fix for md cluster - mddev_find_or_alloc() clean up - Fix NULL pointer deref with external bitmap - Performance improvement for raid10 discard requests - Fix missing information of /proc/mdstat - rsxx const qualifier removal (Arnd) - Expose allocated brd pages (Calvin) - rnbd via Gioh Kim: - Change maintainer - Change domain address of maintainers' email - Add polling IO mode and document update - Fix memory leak and some bug detected by static code analysis tools - Code refactoring - Series of floppy cleanups/fixes (Denis) - s390 dasd fixes (Julian) - kerneldoc fixes (Lee) - null_blk double free (Lv) - null_blk virtual boundary addition (Max) - Remove xsysace driver (Michal) - umem driver removal (Davidlohr) - ataflop fixes (Dan) - Revalidate disk removal (Christoph) - Bounce buffer cleanups (Christoph) - Mark lightnvm as deprecated (Christoph) - mtip32xx init cleanups (Shixin) - Various fixes (Tian, Gustavo, Coly, Yang, Zhang, Zhiqiang) * tag 'for-5.13/drivers-2021-04-27' of git://git.kernel.dk/linux-block: (143 commits) async_xor: increase src_offs when dropping destination page drivers/block/null_blk/main: Fix a double free in null_init. md/raid1: properly indicate failure when ending a failed write request md-cluster: fix use-after-free issue when removing rdev nvme: introduce generic per-namespace chardev nvme: cleanup nvme_configure_apst nvme: do not try to reconfigure APST when the controller is not live nvme: add 'kato' sysfs attribute nvme: sanitize KATO setting nvmet: avoid queuing keep-alive timer if it is disabled brd: expose number of allocated pages in debugfs ataflop: fix off by one in ataflop_probe() ataflop: potential out of bounds in do_format() drbd: Fix fall-through warnings for Clang block/rnbd: Use strscpy instead of strlcpy block/rnbd-clt-sysfs: Remove copy buffer overlap in rnbd_clt_get_path_name block/rnbd-clt: Remove max_segment_size block/rnbd-clt: Generate kobject_uevent when the rnbd device state changes block/rnbd-srv: Remove unused arguments of rnbd_srv_rdma_ev Documentation/ABI/rnbd-clt: Add description for nr_poll_queues ...
2021-04-27Merge tag 'cfi-v5.13-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull CFI on arm64 support from Kees Cook: "This builds on last cycle's LTO work, and allows the arm64 kernels to be built with Clang's Control Flow Integrity feature. This feature has happily lived in Android kernels for almost 3 years[1], so I'm excited to have it ready for upstream. The wide diffstat is mainly due to the treewide fixing of mismatched list_sort prototypes. Other things in core kernel are to address various CFI corner cases. The largest code portion is the CFI runtime implementation itself (which will be shared by all architectures implementing support for CFI). The arm64 pieces are Acked by arm64 maintainers rather than coming through the arm64 tree since carrying this tree over there was going to be awkward. CFI support for x86 is still under development, but is pretty close. There are a handful of corner cases on x86 that need some improvements to Clang and objtool, but otherwise works well. Summary: - Clean up list_sort prototypes (Sami Tolvanen) - Introduce CONFIG_CFI_CLANG for arm64 (Sami Tolvanen)" * tag 'cfi-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: arm64: allow CONFIG_CFI_CLANG to be selected KVM: arm64: Disable CFI for nVHE arm64: ftrace: use function_nocfi for ftrace_call arm64: add __nocfi to __apply_alternatives arm64: add __nocfi to functions that jump to a physical address arm64: use function_nocfi with __pa_symbol arm64: implement function_nocfi psci: use function_nocfi for cpu_resume lkdtm: use function_nocfi treewide: Change list_sort to use const pointers bpf: disable CFI in dispatcher functions kallsyms: strip ThinLTO hashes from static functions kthread: use WARN_ON_FUNCTION_MISMATCH workqueue: use WARN_ON_FUNCTION_MISMATCH module: ensure __cfi_check alignment mm: add generic function_nocfi macro cfi: add __cficanonical add support for Clang CFI
2021-04-23md/raid1: properly indicate failure when ending a failed write requestPaul Clements
This patch addresses a data corruption bug in raid1 arrays using bitmaps. Without this fix, the bitmap bits for the failed I/O end up being cleared. Since we are in the failure leg of raid1_end_write_request, the request either needs to be retried (R1BIO_WriteError) or failed (R1BIO_Degraded). Fixes: eeba6809d8d5 ("md/raid1: end bio when the device faulty") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Paul Clements <paul.clements@us.sios.com> Signed-off-by: Song Liu <song@kernel.org>
2021-04-23md-cluster: fix use-after-free issue when removing rdevHeming Zhao
md_kick_rdev_from_array will remove rdev, so we should use rdev_for_each_safe to search list. How to trigger: env: Two nodes on kvm-qemu x86_64 VMs (2C2G with 2 iscsi luns). ``` node2=192.168.0.3 for i in {1..20}; do echo ==== $i `date` ====; mdadm -Ss && ssh ${node2} "mdadm -Ss" wipefs -a /dev/sda /dev/sdb mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l 1 /dev/sda \ /dev/sdb --assume-clean ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb" mdadm --wait /dev/md0 ssh ${node2} "mdadm --wait /dev/md0" mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda sleep 1 done ``` Crash stack: ``` stack segment: 0000 [#1] SMP ... ... RIP: 0010:md_check_recovery+0x1e8/0x570 [md_mod] ... ... RSP: 0018:ffffb149807a7d68 EFLAGS: 00010207 RAX: 0000000000000000 RBX: ffff9d494c180800 RCX: ffff9d490fc01e50 RDX: fffff047c0ed8308 RSI: 0000000000000246 RDI: 0000000000000246 RBP: 6b6b6b6b6b6b6b6b R08: ffff9d490fc01e40 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000 R13: ffff9d494c180818 R14: ffff9d493399ef38 R15: ffff9d4933a1d800 FS: 0000000000000000(0000) GS:ffff9d494f700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe68cab9010 CR3: 000000004c6be001 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: raid1d+0x5c/0xd40 [raid1] ? finish_task_switch+0x75/0x2a0 ? lock_timer_base+0x67/0x80 ? try_to_del_timer_sync+0x4d/0x80 ? del_timer_sync+0x41/0x50 ? schedule_timeout+0x254/0x2d0 ? md_start_sync+0xe0/0xe0 [md_mod] ? md_thread+0x127/0x160 [md_mod] md_thread+0x127/0x160 [md_mod] ? wait_woken+0x80/0x80 kthread+0x10d/0x130 ? kthread_park+0xa0/0xa0 ret_from_fork+0x1f/0x40 ``` Fixes: dbb64f8635f5d ("md-cluster: Fix adding of new disk with new reload code") Fixes: 659b254fa7392 ("md-cluster: remove a disk asynchronously from cluster environment") Cc: stable@vger.kernel.org Reviewed-by: Gang He <ghe@suse.com> Signed-off-by: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Song Liu <song@kernel.org>
2021-04-21dm raid: fix inconclusive reshape layout on fast raid4/5/6 table reload ↵Heinz Mauelshagen
sequences If fast table reloads occur during an ongoing reshape of raid4/5/6 devices the target may race reading a superblock vs the the MD resync thread; causing an inconclusive reshape state to be read in its constructor. lvm2 test lvconvert-raid-reshape-stripes-load-reload.sh can cause BUG_ON() to trigger in md_run(), e.g.: "kernel BUG at drivers/md/raid5.c:7567!". Scenario triggering the bug: 1. the MD sync thread calls end_reshape() from raid5_sync_request() when done reshaping. However end_reshape() _only_ updates the reshape position to MaxSector keeping the changed layout configuration though (i.e. any delta disks, chunk sector or RAID algorithm changes). That inconclusive configuration is stored in the superblock. 2. dm-raid constructs a mapping, loading named inconsistent superblock as of step 1 before step 3 is able to finish resetting the reshape state completely, and calls md_run() which leads to mentioned bug in raid5.c. 3. the MD RAID personality's finish_reshape() is called; which resets the reshape information on chunk sectors, delta disks, etc. This explains why the bug is rarely seen on multi-core machines, as MD's finish_reshape() superblock update races with the dm-raid constructor's superblock load in step 2. Fix identifies inconclusive superblock content in the dm-raid constructor and resets it before calling md_run(), factoring out identifying checks into rs_is_layout_change() to share in existing rs_reshape_requested() and new rs_reset_inclonclusive_reshape(). Also enhance a comment and remove an empty line. Cc: stable@vger.kernel.org Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-20dm raid: fix fall-through warning in rs_check_takeover() for ClangGustavo A. R. Silva
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning by explicitly adding a break statement instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-19dm clone metadata: remove unused functionJiapeng Chong
Fix the following clang warning: drivers/md/dm-clone-metadata.c:279:19: warning: unused function 'superblock_write_lock' [-Wunused-function]. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-19dm integrity: fix missing goto in bitmap_flush_interval error handlingTian Tao
Fixes: 468dfca38b1a ("dm integrity: add a bitmap mode") Cc: stable@vger.kernel.org Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-19dm: replace dm_vcalloc()Matthew Wilcox (Oracle)
Use kvcalloc or kvmalloc_array instead (depending whether zeroing is useful). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-19dm space map common: fix division bug in sm_ll_find_free_block()Joe Thornber
This division bug meant the search for free metadata space could skip the final allocation bitmap's worth of entries. Fix affects DM thinp, cache and era targets. Cc: stable@vger.kernel.org Signed-off-by: Joe Thornber <ejt@redhat.com> Tested-by: Ming-Hung Tsai <mtsai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-19dm persistent data: packed struct should have an aligned() attribute tooJoe Thornber
Otherwise most non-x86 architectures (e.g. riscv, arm) will resort to byte-by-byte access. Cc: stable@vger.kernel.org Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-19dm btree spine: remove paranoid node_check call in node_prep_for_write()Joe Thornber
Remove this extra BUG_ON() that calls node_check() -- which avoids extra crc checking. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-19dm space map disk: remove redundant calls to sm_disk_get_nr_free()Joe Thornber
Both sm_disk_new_block and sm_disk_commit are needlessly calling sm_disk_get_nr_free(). Looks like old queries used for some debugging. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-15md/bitmap: wait for external bitmap writes to complete during tear downSudhakar Panneerselvam
NULL pointer dereference was observed in super_written() when it tries to access the mddev structure. [The below stack trace is from an older kernel, but the problem described in this patch applies to the mainline kernel.] [ 1194.474861] task: ffff8fdd20858000 task.stack: ffffb99d40790000 [ 1194.488000] RIP: 0010:super_written+0x29/0xe1 [ 1194.499688] RSP: 0018:ffff8ffb7fcc3c78 EFLAGS: 00010046 [ 1194.512477] RAX: 0000000000000000 RBX: ffff8ffb7bf4a000 RCX: ffff8ffb78991048 [ 1194.527325] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ffb56b8a200 [ 1194.542576] RBP: ffff8ffb7fcc3c90 R08: 000000000000000b R09: 0000000000000000 [ 1194.558001] R10: ffff8ffb56b8a298 R11: 0000000000000000 R12: ffff8ffb56b8a200 [ 1194.573070] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 1194.588117] FS: 0000000000000000(0000) GS:ffff8ffb7fcc0000(0000) knlGS:0000000000000000 [ 1194.604264] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1194.617375] CR2: 00000000000002b8 CR3: 00000021e040a002 CR4: 00000000007606e0 [ 1194.632327] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1194.647865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1194.663316] PKRU: 55555554 [ 1194.674090] Call Trace: [ 1194.683735] <IRQ> [ 1194.692948] bio_endio+0xae/0x135 [ 1194.703580] blk_update_request+0xad/0x2fa [ 1194.714990] blk_update_bidi_request+0x20/0x72 [ 1194.726578] __blk_end_bidi_request+0x2c/0x4d [ 1194.738373] __blk_end_request_all+0x31/0x49 [ 1194.749344] blk_flush_complete_seq+0x377/0x383 [ 1194.761550] flush_end_io+0x1dd/0x2a7 [ 1194.772910] blk_finish_request+0x9f/0x13c [ 1194.784544] scsi_end_request+0x180/0x25c [ 1194.796149] scsi_io_completion+0xc8/0x610 [ 1194.807503] scsi_finish_command+0xdc/0x125 [ 1194.818897] scsi_softirq_done+0x81/0xde [ 1194.830062] blk_done_softirq+0xa4/0xcc [ 1194.841008] __do_softirq+0xd9/0x29f [ 1194.851257] irq_exit+0xe6/0xeb [ 1194.861290] do_IRQ+0x59/0xe3 [ 1194.871060] common_interrupt+0x1c6/0x382 [ 1194.881988] </IRQ> [ 1194.890646] RIP: 0010:cpuidle_enter_state+0xdd/0x2a5 [ 1194.902532] RSP: 0018:ffffb99d40793e68 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff43 [ 1194.917317] RAX: ffff8ffb7fce27c0 RBX: ffff8ffb7fced800 RCX: 000000000000001f [ 1194.932056] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000000 [ 1194.946428] RBP: ffffb99d40793ea0 R08: 0000000000000004 R09: 0000000000002ed2 [ 1194.960508] R10: 0000000000002664 R11: 0000000000000018 R12: 0000000000000003 [ 1194.974454] R13: 000000000000000b R14: ffffffff925715a0 R15: 0000011610120d5a [ 1194.988607] ? cpuidle_enter_state+0xcc/0x2a5 [ 1194.999077] cpuidle_enter+0x17/0x19 [ 1195.008395] call_cpuidle+0x23/0x3a [ 1195.017718] do_idle+0x172/0x1d5 [ 1195.026358] cpu_startup_entry+0x73/0x75 [ 1195.035769] start_secondary+0x1b9/0x20b [ 1195.044894] secondary_startup_64+0xa5/0xa5 [ 1195.084921] RIP: super_written+0x29/0xe1 RSP: ffff8ffb7fcc3c78 [ 1195.096354] CR2: 00000000000002b8 bio in the above stack is a bitmap write whose completion is invoked after the tear down sequence sets the mddev structure to NULL in rdev. During tear down, there is an attempt to flush the bitmap writes, but for external bitmaps, there is no explicit wait for all the bitmap writes to complete. For instance, md_bitmap_flush() is called to flush the bitmap writes, but the last call to md_bitmap_daemon_work() in md_bitmap_flush() could generate new bitmap writes for which there is no explicit wait to complete those writes. The call to md_bitmap_update_sb() will return simply for external bitmaps and the follow-up call to md_update_sb() is conditional and may not get called for external bitmaps. This results in a kernel panic when the completion routine, super_written() is called which tries to reference mddev in the rdev that has been set to NULL(in unbind_rdev_from_array() by tear down sequence). The solution is to call md_super_wait() for external bitmaps after the last call to md_bitmap_daemon_work() in md_bitmap_flush() to ensure there are no pending bitmap writes before proceeding with the tear down. Cc: stable@vger.kernel.org Signed-off-by: Sudhakar Panneerselvam <sudhakar.panneerselvam@oracle.com> Reviewed-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <song@kernel.org>
2021-04-15md: do not return existing mddevs from mddev_find_or_allocChristoph Hellwig
Instead of returning an existing mddev, just for it to be discarded later directly return -EEXIST. Rename the function to mddev_alloc now that it doesn't find an existing mddev. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
2021-04-15md: refactor mddev_find_or_allocChristoph Hellwig
Allocate the new mddev first speculatively, which greatly simplifies the code flow. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
2021-04-15md: factor out a mddev_alloc_unit helper from mddev_findChristoph Hellwig
Split out a self contained helper to find a free minor for the md "unit" number. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
2021-04-14dm verity fec: fix misaligned RS roots IOJaegeuk Kim
commit df7b59ba9245 ("dm verity: fix FEC for RS roots unaligned to block size") introduced the possibility for misaligned roots IO relative to the underlying device's logical block size. E.g. Android's default RS roots=2 results in dm_bufio->block_size=1024, which causes the following EIO if the logical block size of the device is 4096, given v->data_dev_block_bits=12: E sd 0 : 0:0:0: [sda] tag#30 request not aligned to the logical block size E blk_update_request: I/O error, dev sda, sector 10368424 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 E device-mapper: verity-fec: 254:8: FEC 9244672: parity read failed (block 18056): -5 Fix this by onlu using f->roots for dm_bufio blocksize IFF it is aligned to v->data_dev_block_bits. Fixes: df7b59ba9245 ("dm verity: fix FEC for RS roots unaligned to block size") Cc: stable@vger.kernel.org Signed-off-by: Jaegeuk Kim <jaegeuk@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-11bcache: fix a regression of code compiling failure in debug.cColy Li
The patch "bcache: remove PTR_CACHE" introduces a compiling failure in debug.c with following error message, In file included from drivers/md/bcache/bcache.h:182:0, from drivers/md/bcache/debug.c:9: drivers/md/bcache/debug.c: In function 'bch_btree_verify': drivers/md/bcache/debug.c:53:19: error: 'c' undeclared (first use in this function) bio_set_dev(bio, c->cache->bdev); ^ This patch fixes the regression by replacing c->cache->bdev by b->c-> cache->bdev. Signed-off-by: Coly Li <colyli@suse.de> Cc: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210411134316.80274-8-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-11bcache: Use 64-bit arithmetic instead of 32-bitGustavo A. R. Silva
Cast multiple variables to (int64_t) in order to give the compiler complete information about the proper arithmetic to use. Notice that these variables are being used in contexts that expect expressions of type int64_t (64 bit, signed). And currently, such expressions are being evaluated using 32-bit arithmetic. Fixes: d0cf9503e908 ("octeontx2-pf: ethtool fec mode support") Addresses-Coverity-ID: 1501724 ("Unintentional integer overflow") Addresses-Coverity-ID: 1501725 ("Unintentional integer overflow") Addresses-Coverity-ID: 1501726 ("Unintentional integer overflow") Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20210411134316.80274-7-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-11md: bcache: Trivial typo fixes in the file journal.cBhaskar Chowdhury
s/condidate/candidate/ s/folowing/following/ Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20210411134316.80274-6-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>