summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2021-06-14md/raid5: avoid device_lock in read_one_chunk()Gal Ofri
There is a lock contention on device_lock in read_one_chunk(). device_lock is taken to sync conf->active_aligned_reads and conf->quiesce. read_one_chunk() takes the lock, then waits for quiesce=0 (resumed) before incrementing active_aligned_reads. raid5_quiesce() takes the lock, sets quiesce=2 (in-progress), then waits for active_aligned_reads to be zero before setting quiesce=1 (suspended). Introduce a fast (lockless) path in read_one_chunk(): activate aligned read without taking device_lock. In case quiesce starts while activating the aligned-read in fast path, deactivate it and revert to old behavior (take device_lock and wait for quiesce to finish). Add smp store/load in raid5_quiesce()/read_one_chunk() respectively to gaurantee that read_one_chunk() does not miss an ongoing quiesce. My setups: 1. 8 local nvme drives (each up to 250k iops). 2. 8 ram disks (brd). Each setup with raid6 (6+2), 1024 io threads on a 96 cpu-cores (48 per socket) system. Record both iops and cpu spent on this contention with rand-read-4k. Record bw with sequential-read-128k. Note: in most cases cpu is still busy but due to "new" bottlenecks. nvme: | iops | cpu | bw ----------------------------------------------- without patch | 1.6M | ~50% | 5.5GB/s with patch | 2M (throttled) | 0% | 16GB/s (throttled) ram (brd): | iops | cpu | bw ----------------------------------------------- without patch | 2M | ~80% | 24GB/s with patch | 4M | 0% | 55GB/s CC: Song Liu <song@kernel.org> CC: Neil Brown <neilb@suse.de> Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Gal Ofri <gal.ofri@storing.io> Signed-off-by: Song Liu <song@kernel.org>
2021-06-14md: add comments in md_integrity_registerGuoqing Jiang
Given it is not obvious for the error handling, let's try to add some comments here to make it clear. Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
2021-06-14md: check level before create and exit io_acct_setGuoqing Jiang
The bio_set (io_acct_set) is used by personalities to clone bio and trace the timestamp of bio. Some personalities such as raid1/10 don't need the bio_set, so add check to not create it unconditionally. Also update the comment for md_account_bio to make it more clear. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
2021-06-14md: Constify attribute_group structsRikard Falkeborn
The attribute_group structs are never modified, they're only passed to sysfs_create_group() and sysfs_remove_group(). Make them const to allow the compiler to put them in read-only memory. Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com> Signed-off-by: Song Liu <song@kernel.org>
2021-06-14md: mark some personalities as deprecatedGuoqing Jiang
Mark the three personalities (linear, fault and multipath) as deprecated because: 1. people can use dm multipath or nvme multipath. 2. linear is already deprecated in MODULE_ALIAS. 3. no one actively using fault. Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
2021-06-14md/raid10: enable io accountingGuoqing Jiang
For raid10, we record the start time between split bio and clone bio, and finish the accounting in the final endio. Also introduce start_time in r10bio accordingly. Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
2021-06-14md/raid1: enable io accountingGuoqing Jiang
For raid1, we record the start time between split bio and clone bio, and finish the accounting in the final endio. Also introduce start_time in r1bio accordingly. Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
2021-06-14md/raid1: rename print_msg with r1bio_existedGuoqing Jiang
The caller of raid1_read_request could pass NULL or a valid pointer for "struct r1bio *r1_bio", so it actually means whether r1_bio is existed or not. Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
2021-06-14md/raid5: avoid redundant bio clone in raid5_read_one_chunkGuoqing Jiang
After enable io accounting, chunk read bio could be cloned twice which is not good. To avoid such inefficiency, let's clone align_bio from io_acct_set too, then we need only call md_account_bio in make_request unconditionally. Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
2021-06-14md/raid5: move checking badblock before clone bio in raid5_read_one_chunkGuoqing Jiang
We don't need to clone bio if the relevant region has badblock. Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
2021-06-14md: add io accounting for raid0 and raid5Guoqing Jiang
We introduce a new bioset (io_acct_set) for raid0 and raid5 since they don't own clone infrastructure to accounting io. And the bioset is added to mddev instead of to raid0 and raid5 layer, because with this way, we can put common functions to md.h and reuse them in raid0 and raid5. Also struct md_io_acct is added accordingly which includes io start_time, the origin bio and cloned bio. Then we can call bio_{start,end}_io_acct to get related io status. Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
2021-06-14md: revert io stats accountingGuoqing Jiang
The commit 41d2d848e5c0 ("md: improve io stats accounting") could cause double fault problem per the report [1], and also it is not correct to change ->bi_end_io if md don't own it, so let's revert it. And io stats accounting will be replemented in later commits. [1]. https://lore.kernel.org/linux-raid/3bf04253-3fad-434a-63a7-20214e38cf26@gmail.com/T/#t Fixes: 41d2d848e5c0 ("md: improve io stats accounting") Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
2021-05-13dm integrity: fix sparse warningsMikulas Patocka
Use the types __le* instead of __u* to fix sparse warnings. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-05-13dm integrity: revert to not using discard filler when recalulatingMikulas Patocka
Revert the commit 7a5b96b4784454ba258e83dc7469ddbacd3aaac3 ("dm integrity: use discard support when recalculating"). There's a bug that when we write some data beyond the current recalculate boundary, the checksum will be rewritten with the discard filler later. And the data will no longer have integrity protection. There's no easy fix for this case. Also, another problematic case is if dm-integrity is used to detect bitrot (random device errors, bit flips, etc); dm-integrity should detect that even for unused sectors. With commit 7a5b96b4784 it can happen that such change is undetected (because discard filler is not a valid checksum). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-05-13dm snapshot: fix crash with transient storage and zero chunk sizeMikulas Patocka
The following commands will crash the kernel: modprobe brd rd_size=1048576 dmsetup create o --table "0 `blockdev --getsize /dev/ram0` snapshot-origin /dev/ram0" dmsetup create s --table "0 `blockdev --getsize /dev/ram0` snapshot /dev/ram0 /dev/ram1 N 0" The reason is that when we test for zero chunk size, we jump to the label bad_read_metadata without setting the "r" variable. The function snapshot_ctr destroys all the structures and then exits with "r == 0". The kernel then crashes because it falsely believes that snapshot_ctr succeeded. In order to fix the bug, we set the variable "r" to -EINVAL. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-05-10dm snapshot: fix a crash when an origin has no snapshotsMikulas Patocka
If an origin target has no snapshots, o->split_boundary is set to 0. This causes BUG_ON(sectors <= 0) in block/bio.c:bio_split(). Fix this by initializing chunk_size, and in turn split_boundary, to rounddown_pow_of_two(UINT_MAX) -- the largest power of two that fits into "unsigned" type. Reported-by: Michael Tokarev <mjt@tls.msk.ru> Tested-by: Michael Tokarev <mjt@tls.msk.ru> Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-05-07Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge yet more updates from Andrew Morton: "This is everything else from -mm for this merge window. 90 patches. Subsystems affected by this patch series: mm (cleanups and slub), alpha, procfs, sysctl, misc, core-kernel, bitmap, lib, compat, checkpatch, epoll, isofs, nilfs2, hpfs, exit, fork, kexec, gcov, panic, delayacct, gdb, resource, selftests, async, initramfs, ipc, drivers/char, and spelling" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (90 commits) mm: fix typos in comments mm: fix typos in comments treewide: remove editor modelines and cruft ipc/sem.c: spelling fix fs: fat: fix spelling typo of values kernel/sys.c: fix typo kernel/up.c: fix typo kernel/user_namespace.c: fix typos kernel/umh.c: fix some spelling mistakes include/linux/pgtable.h: few spelling fixes mm/slab.c: fix spelling mistake "disired" -> "desired" scripts/spelling.txt: add "overflw" scripts/spelling.txt: Add "diabled" typo scripts/spelling.txt: add "overlfow" arm: print alloc free paths for address in registers mm/vmalloc: remove vwrite() mm: remove xlate_dev_kmem_ptr() drivers/char: remove /dev/kmem for good mm: fix some typos and code style problems ipc/sem.c: mundane typo fixes ...
2021-05-06include: remove pagemap.h from blkdev.hMatthew Wilcox (Oracle)
My UEK-derived config has 1030 files depending on pagemap.h before this change. Afterwards, just 326 files need to be rebuilt when I touch pagemap.h. I think blkdev.h is probably included too widely, but untangling that dependency is harder and this solves my problem. x86 allmodconfig builds, but there may be implicit include problems on other architectures. Link: https://lkml.kernel.org/r/20210309195747.283796-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Dan Williams <dan.j.williams@intel.com> [nvdimm] Acked-by: Jens Axboe <axboe@kernel.dk> [block] Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: Martin K. Petersen <martin.petersen@oracle.com> [scsi] Reviewed-by: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-01Merge tag 'for-5.13/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Improve scalability of DM's device hash by switching to rbtree - Extend DM ioctl's DM_LIST_DEVICES_CMD handling to include UUID and allow filtering based on name or UUID prefix. - Various small fixes for typos, warnings, unused function, or needlessly exported interfaces. - Remove needless request_queue NULL pointer checks in DM thin and cache targets. - Remove unnecessary loop in DM core's __split_and_process_bio(). - Remove DM core's dm_vcalloc() and just use kvcalloc or kvmalloc_array instead (depending whether zeroing is useful). - Fix request-based DM's double free of blk_mq_tag_set in device remove after table load fails. - Improve DM persistent data performance on non-x86 by fixing packed structs to have a stated alignment. Also remove needless extra work from redundant calls to sm_disk_get_nr_free() and a paranoid BUG_ON() that caused duplicate checksum calculation. - Fix missing goto in DM integrity's bitmap_flush_interval error handling. - Add "reset_recalculate" feature flag to DM integrity. - Improve DM integrity by leveraging discard support to avoid needless re-writing of metadata and also use discard support to improve hash recalculation. - Fix race with DM raid target's reshape and MD raid4/5/6 resync that resulted in inconsistant reshape state during table reloads. - Update DM raid target to temove unnecessary discard limits for raid0 and raid10 now that MD has optimized discard handling for both raid levels. * tag 'for-5.13/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (26 commits) dm raid: remove unnecessary discard limits for raid0 and raid10 dm rq: fix double free of blk_mq_tag_set in dev remove after table load fails dm integrity: use discard support when recalculating dm integrity: increase RECALC_SECTORS to improve recalculate speed dm integrity: don't re-write metadata if discarding same blocks dm raid: fix inconclusive reshape layout on fast raid4/5/6 table reload sequences dm raid: fix fall-through warning in rs_check_takeover() for Clang dm clone metadata: remove unused function dm integrity: fix missing goto in bitmap_flush_interval error handling dm: replace dm_vcalloc() dm space map common: fix division bug in sm_ll_find_free_block() dm persistent data: packed struct should have an aligned() attribute too dm btree spine: remove paranoid node_check call in node_prep_for_write() dm space map disk: remove redundant calls to sm_disk_get_nr_free() dm integrity: add the "reset_recalculate" feature flag dm persistent data: remove unused return from exit_shadow_spine() dm cache: remove needless request_queue NULL pointer checks dm thin: remove needless request_queue NULL pointer check dm: unexport dm_{get,put}_table_device dm ebs: fix a few typos ...
2021-04-30dm raid: remove unnecessary discard limits for raid0 and raid10Mike Snitzer
Commit 29efc390b946 ("md/md0: optimize raid0 discard handling") and commit d30588b2731f ("md/raid10: improve raid10 discard request") remove MD raid0's and raid10's inability to properly handle large discards. So eliminate associated constraints from dm-raid's support. Depends-on: 29efc390b946 ("md/md0: optimize raid0 discard handling") Depends-on: d30588b2731f ("md/raid10: improve raid10 discard request") Reported-by: Matthew Ruffell <matthew.ruffell@canonical.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-30dm rq: fix double free of blk_mq_tag_set in dev remove after table load failsBenjamin Block
When loading a device-mapper table for a request-based mapped device, and the allocation/initialization of the blk_mq_tag_set for the device fails, a following device remove will cause a double free. E.g. (dmesg): device-mapper: core: Cannot initialize queue for request-based dm-mq mapped device device-mapper: ioctl: unable to set up device queue for new table. Unable to handle kernel pointer dereference in virtual kernel address space Failing address: 0305e098835de000 TEID: 0305e098835de803 Fault in home space mode while using kernel ASCE. AS:000000025efe0007 R3:0000000000000024 Oops: 0038 ilc:3 [#1] SMP Modules linked in: ... lots of modules ... Supported: Yes, External CPU: 0 PID: 7348 Comm: multipathd Kdump: loaded Tainted: G W X 5.3.18-53-default #1 SLE15-SP3 Hardware name: IBM 8561 T01 7I2 (LPAR) Krnl PSW : 0704e00180000000 000000025e368eca (kfree+0x42/0x330) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3 Krnl GPRS: 000000000000004a 000000025efe5230 c1773200d779968d 0000000000000000 000000025e520270 000000025e8d1b40 0000000000000003 00000007aae10000 000000025e5202a2 0000000000000001 c1773200d779968d 0305e098835de640 00000007a8170000 000003ff80138650 000000025e5202a2 000003e00396faa8 Krnl Code: 000000025e368eb8: c4180041e100 lgrl %r1,25eba50b8 000000025e368ebe: ecba06b93a55 risbg %r11,%r10,6,185,58 #000000025e368ec4: e3b010000008 ag %r11,0(%r1) >000000025e368eca: e310b0080004 lg %r1,8(%r11) 000000025e368ed0: a7110001 tmll %r1,1 000000025e368ed4: a7740129 brc 7,25e369126 000000025e368ed8: e320b0080004 lg %r2,8(%r11) 000000025e368ede: b904001b lgr %r1,%r11 Call Trace: [<000000025e368eca>] kfree+0x42/0x330 [<000000025e5202a2>] blk_mq_free_tag_set+0x72/0xb8 [<000003ff801316a8>] dm_mq_cleanup_mapped_device+0x38/0x50 [dm_mod] [<000003ff80120082>] free_dev+0x52/0xd0 [dm_mod] [<000003ff801233f0>] __dm_destroy+0x150/0x1d0 [dm_mod] [<000003ff8012bb9a>] dev_remove+0x162/0x1c0 [dm_mod] [<000003ff8012a988>] ctl_ioctl+0x198/0x478 [dm_mod] [<000003ff8012ac8a>] dm_ctl_ioctl+0x22/0x38 [dm_mod] [<000000025e3b11ee>] ksys_ioctl+0xbe/0xe0 [<000000025e3b127a>] __s390x_sys_ioctl+0x2a/0x40 [<000000025e8c15ac>] system_call+0xd8/0x2c8 Last Breaking-Event-Address: [<000000025e52029c>] blk_mq_free_tag_set+0x6c/0xb8 Kernel panic - not syncing: Fatal exception: panic_on_oops When allocation/initialization of the blk_mq_tag_set fails in dm_mq_init_request_queue(), it is uninitialized/freed, but the pointer is not reset to NULL; so when dev_remove() later gets into dm_mq_cleanup_mapped_device() it sees the pointer and tries to uninitialize and free it again. Fix this by setting the pointer to NULL in dm_mq_init_request_queue() error-handling. Also set it to NULL in dm_mq_cleanup_mapped_device(). Cc: <stable@vger.kernel.org> # 4.6+ Fixes: 1c357a1e86a4 ("dm: allocate blk_mq_tag_set rather than embed in mapped_device") Signed-off-by: Benjamin Block <bblock@linux.ibm.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-30dm integrity: use discard support when recalculatingMikulas Patocka
If we have discard support we don't have to recalculate hash - we can just fill the metadata with the discard pattern. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-30dm integrity: increase RECALC_SECTORS to improve recalculate speedMikulas Patocka
Increase RECALC_SECTORS because it improves recalculate speed slightly (from 390kiB/s to 410kiB/s). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-30dm integrity: don't re-write metadata if discarding same blocksMikulas Patocka
If we discard already discarded blocks we do not need to write discard pattern to the metadata, because it is already there. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-28Merge tag 'for-5.13/drivers-2021-04-27' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block driver updates from Jens Axboe: - MD changes via Song: - raid5 POWER fix - raid1 failure fix - UAF fix for md cluster - mddev_find_or_alloc() clean up - Fix NULL pointer deref with external bitmap - Performance improvement for raid10 discard requests - Fix missing information of /proc/mdstat - rsxx const qualifier removal (Arnd) - Expose allocated brd pages (Calvin) - rnbd via Gioh Kim: - Change maintainer - Change domain address of maintainers' email - Add polling IO mode and document update - Fix memory leak and some bug detected by static code analysis tools - Code refactoring - Series of floppy cleanups/fixes (Denis) - s390 dasd fixes (Julian) - kerneldoc fixes (Lee) - null_blk double free (Lv) - null_blk virtual boundary addition (Max) - Remove xsysace driver (Michal) - umem driver removal (Davidlohr) - ataflop fixes (Dan) - Revalidate disk removal (Christoph) - Bounce buffer cleanups (Christoph) - Mark lightnvm as deprecated (Christoph) - mtip32xx init cleanups (Shixin) - Various fixes (Tian, Gustavo, Coly, Yang, Zhang, Zhiqiang) * tag 'for-5.13/drivers-2021-04-27' of git://git.kernel.dk/linux-block: (143 commits) async_xor: increase src_offs when dropping destination page drivers/block/null_blk/main: Fix a double free in null_init. md/raid1: properly indicate failure when ending a failed write request md-cluster: fix use-after-free issue when removing rdev nvme: introduce generic per-namespace chardev nvme: cleanup nvme_configure_apst nvme: do not try to reconfigure APST when the controller is not live nvme: add 'kato' sysfs attribute nvme: sanitize KATO setting nvmet: avoid queuing keep-alive timer if it is disabled brd: expose number of allocated pages in debugfs ataflop: fix off by one in ataflop_probe() ataflop: potential out of bounds in do_format() drbd: Fix fall-through warnings for Clang block/rnbd: Use strscpy instead of strlcpy block/rnbd-clt-sysfs: Remove copy buffer overlap in rnbd_clt_get_path_name block/rnbd-clt: Remove max_segment_size block/rnbd-clt: Generate kobject_uevent when the rnbd device state changes block/rnbd-srv: Remove unused arguments of rnbd_srv_rdma_ev Documentation/ABI/rnbd-clt: Add description for nr_poll_queues ...
2021-04-27Merge tag 'cfi-v5.13-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull CFI on arm64 support from Kees Cook: "This builds on last cycle's LTO work, and allows the arm64 kernels to be built with Clang's Control Flow Integrity feature. This feature has happily lived in Android kernels for almost 3 years[1], so I'm excited to have it ready for upstream. The wide diffstat is mainly due to the treewide fixing of mismatched list_sort prototypes. Other things in core kernel are to address various CFI corner cases. The largest code portion is the CFI runtime implementation itself (which will be shared by all architectures implementing support for CFI). The arm64 pieces are Acked by arm64 maintainers rather than coming through the arm64 tree since carrying this tree over there was going to be awkward. CFI support for x86 is still under development, but is pretty close. There are a handful of corner cases on x86 that need some improvements to Clang and objtool, but otherwise works well. Summary: - Clean up list_sort prototypes (Sami Tolvanen) - Introduce CONFIG_CFI_CLANG for arm64 (Sami Tolvanen)" * tag 'cfi-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: arm64: allow CONFIG_CFI_CLANG to be selected KVM: arm64: Disable CFI for nVHE arm64: ftrace: use function_nocfi for ftrace_call arm64: add __nocfi to __apply_alternatives arm64: add __nocfi to functions that jump to a physical address arm64: use function_nocfi with __pa_symbol arm64: implement function_nocfi psci: use function_nocfi for cpu_resume lkdtm: use function_nocfi treewide: Change list_sort to use const pointers bpf: disable CFI in dispatcher functions kallsyms: strip ThinLTO hashes from static functions kthread: use WARN_ON_FUNCTION_MISMATCH workqueue: use WARN_ON_FUNCTION_MISMATCH module: ensure __cfi_check alignment mm: add generic function_nocfi macro cfi: add __cficanonical add support for Clang CFI
2021-04-23md/raid1: properly indicate failure when ending a failed write requestPaul Clements
This patch addresses a data corruption bug in raid1 arrays using bitmaps. Without this fix, the bitmap bits for the failed I/O end up being cleared. Since we are in the failure leg of raid1_end_write_request, the request either needs to be retried (R1BIO_WriteError) or failed (R1BIO_Degraded). Fixes: eeba6809d8d5 ("md/raid1: end bio when the device faulty") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Paul Clements <paul.clements@us.sios.com> Signed-off-by: Song Liu <song@kernel.org>
2021-04-23md-cluster: fix use-after-free issue when removing rdevHeming Zhao
md_kick_rdev_from_array will remove rdev, so we should use rdev_for_each_safe to search list. How to trigger: env: Two nodes on kvm-qemu x86_64 VMs (2C2G with 2 iscsi luns). ``` node2=192.168.0.3 for i in {1..20}; do echo ==== $i `date` ====; mdadm -Ss && ssh ${node2} "mdadm -Ss" wipefs -a /dev/sda /dev/sdb mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l 1 /dev/sda \ /dev/sdb --assume-clean ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb" mdadm --wait /dev/md0 ssh ${node2} "mdadm --wait /dev/md0" mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda sleep 1 done ``` Crash stack: ``` stack segment: 0000 [#1] SMP ... ... RIP: 0010:md_check_recovery+0x1e8/0x570 [md_mod] ... ... RSP: 0018:ffffb149807a7d68 EFLAGS: 00010207 RAX: 0000000000000000 RBX: ffff9d494c180800 RCX: ffff9d490fc01e50 RDX: fffff047c0ed8308 RSI: 0000000000000246 RDI: 0000000000000246 RBP: 6b6b6b6b6b6b6b6b R08: ffff9d490fc01e40 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000 R13: ffff9d494c180818 R14: ffff9d493399ef38 R15: ffff9d4933a1d800 FS: 0000000000000000(0000) GS:ffff9d494f700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe68cab9010 CR3: 000000004c6be001 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: raid1d+0x5c/0xd40 [raid1] ? finish_task_switch+0x75/0x2a0 ? lock_timer_base+0x67/0x80 ? try_to_del_timer_sync+0x4d/0x80 ? del_timer_sync+0x41/0x50 ? schedule_timeout+0x254/0x2d0 ? md_start_sync+0xe0/0xe0 [md_mod] ? md_thread+0x127/0x160 [md_mod] md_thread+0x127/0x160 [md_mod] ? wait_woken+0x80/0x80 kthread+0x10d/0x130 ? kthread_park+0xa0/0xa0 ret_from_fork+0x1f/0x40 ``` Fixes: dbb64f8635f5d ("md-cluster: Fix adding of new disk with new reload code") Fixes: 659b254fa7392 ("md-cluster: remove a disk asynchronously from cluster environment") Cc: stable@vger.kernel.org Reviewed-by: Gang He <ghe@suse.com> Signed-off-by: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Song Liu <song@kernel.org>
2021-04-21dm raid: fix inconclusive reshape layout on fast raid4/5/6 table reload ↵Heinz Mauelshagen
sequences If fast table reloads occur during an ongoing reshape of raid4/5/6 devices the target may race reading a superblock vs the the MD resync thread; causing an inconclusive reshape state to be read in its constructor. lvm2 test lvconvert-raid-reshape-stripes-load-reload.sh can cause BUG_ON() to trigger in md_run(), e.g.: "kernel BUG at drivers/md/raid5.c:7567!". Scenario triggering the bug: 1. the MD sync thread calls end_reshape() from raid5_sync_request() when done reshaping. However end_reshape() _only_ updates the reshape position to MaxSector keeping the changed layout configuration though (i.e. any delta disks, chunk sector or RAID algorithm changes). That inconclusive configuration is stored in the superblock. 2. dm-raid constructs a mapping, loading named inconsistent superblock as of step 1 before step 3 is able to finish resetting the reshape state completely, and calls md_run() which leads to mentioned bug in raid5.c. 3. the MD RAID personality's finish_reshape() is called; which resets the reshape information on chunk sectors, delta disks, etc. This explains why the bug is rarely seen on multi-core machines, as MD's finish_reshape() superblock update races with the dm-raid constructor's superblock load in step 2. Fix identifies inconclusive superblock content in the dm-raid constructor and resets it before calling md_run(), factoring out identifying checks into rs_is_layout_change() to share in existing rs_reshape_requested() and new rs_reset_inclonclusive_reshape(). Also enhance a comment and remove an empty line. Cc: stable@vger.kernel.org Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-20dm raid: fix fall-through warning in rs_check_takeover() for ClangGustavo A. R. Silva
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning by explicitly adding a break statement instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-19dm clone metadata: remove unused functionJiapeng Chong
Fix the following clang warning: drivers/md/dm-clone-metadata.c:279:19: warning: unused function 'superblock_write_lock' [-Wunused-function]. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-19dm integrity: fix missing goto in bitmap_flush_interval error handlingTian Tao
Fixes: 468dfca38b1a ("dm integrity: add a bitmap mode") Cc: stable@vger.kernel.org Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-19dm: replace dm_vcalloc()Matthew Wilcox (Oracle)
Use kvcalloc or kvmalloc_array instead (depending whether zeroing is useful). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-19dm space map common: fix division bug in sm_ll_find_free_block()Joe Thornber
This division bug meant the search for free metadata space could skip the final allocation bitmap's worth of entries. Fix affects DM thinp, cache and era targets. Cc: stable@vger.kernel.org Signed-off-by: Joe Thornber <ejt@redhat.com> Tested-by: Ming-Hung Tsai <mtsai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-19dm persistent data: packed struct should have an aligned() attribute tooJoe Thornber
Otherwise most non-x86 architectures (e.g. riscv, arm) will resort to byte-by-byte access. Cc: stable@vger.kernel.org Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-19dm btree spine: remove paranoid node_check call in node_prep_for_write()Joe Thornber
Remove this extra BUG_ON() that calls node_check() -- which avoids extra crc checking. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-19dm space map disk: remove redundant calls to sm_disk_get_nr_free()Joe Thornber
Both sm_disk_new_block and sm_disk_commit are needlessly calling sm_disk_get_nr_free(). Looks like old queries used for some debugging. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-15md/bitmap: wait for external bitmap writes to complete during tear downSudhakar Panneerselvam
NULL pointer dereference was observed in super_written() when it tries to access the mddev structure. [The below stack trace is from an older kernel, but the problem described in this patch applies to the mainline kernel.] [ 1194.474861] task: ffff8fdd20858000 task.stack: ffffb99d40790000 [ 1194.488000] RIP: 0010:super_written+0x29/0xe1 [ 1194.499688] RSP: 0018:ffff8ffb7fcc3c78 EFLAGS: 00010046 [ 1194.512477] RAX: 0000000000000000 RBX: ffff8ffb7bf4a000 RCX: ffff8ffb78991048 [ 1194.527325] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ffb56b8a200 [ 1194.542576] RBP: ffff8ffb7fcc3c90 R08: 000000000000000b R09: 0000000000000000 [ 1194.558001] R10: ffff8ffb56b8a298 R11: 0000000000000000 R12: ffff8ffb56b8a200 [ 1194.573070] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 1194.588117] FS: 0000000000000000(0000) GS:ffff8ffb7fcc0000(0000) knlGS:0000000000000000 [ 1194.604264] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1194.617375] CR2: 00000000000002b8 CR3: 00000021e040a002 CR4: 00000000007606e0 [ 1194.632327] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1194.647865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1194.663316] PKRU: 55555554 [ 1194.674090] Call Trace: [ 1194.683735] <IRQ> [ 1194.692948] bio_endio+0xae/0x135 [ 1194.703580] blk_update_request+0xad/0x2fa [ 1194.714990] blk_update_bidi_request+0x20/0x72 [ 1194.726578] __blk_end_bidi_request+0x2c/0x4d [ 1194.738373] __blk_end_request_all+0x31/0x49 [ 1194.749344] blk_flush_complete_seq+0x377/0x383 [ 1194.761550] flush_end_io+0x1dd/0x2a7 [ 1194.772910] blk_finish_request+0x9f/0x13c [ 1194.784544] scsi_end_request+0x180/0x25c [ 1194.796149] scsi_io_completion+0xc8/0x610 [ 1194.807503] scsi_finish_command+0xdc/0x125 [ 1194.818897] scsi_softirq_done+0x81/0xde [ 1194.830062] blk_done_softirq+0xa4/0xcc [ 1194.841008] __do_softirq+0xd9/0x29f [ 1194.851257] irq_exit+0xe6/0xeb [ 1194.861290] do_IRQ+0x59/0xe3 [ 1194.871060] common_interrupt+0x1c6/0x382 [ 1194.881988] </IRQ> [ 1194.890646] RIP: 0010:cpuidle_enter_state+0xdd/0x2a5 [ 1194.902532] RSP: 0018:ffffb99d40793e68 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff43 [ 1194.917317] RAX: ffff8ffb7fce27c0 RBX: ffff8ffb7fced800 RCX: 000000000000001f [ 1194.932056] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000000 [ 1194.946428] RBP: ffffb99d40793ea0 R08: 0000000000000004 R09: 0000000000002ed2 [ 1194.960508] R10: 0000000000002664 R11: 0000000000000018 R12: 0000000000000003 [ 1194.974454] R13: 000000000000000b R14: ffffffff925715a0 R15: 0000011610120d5a [ 1194.988607] ? cpuidle_enter_state+0xcc/0x2a5 [ 1194.999077] cpuidle_enter+0x17/0x19 [ 1195.008395] call_cpuidle+0x23/0x3a [ 1195.017718] do_idle+0x172/0x1d5 [ 1195.026358] cpu_startup_entry+0x73/0x75 [ 1195.035769] start_secondary+0x1b9/0x20b [ 1195.044894] secondary_startup_64+0xa5/0xa5 [ 1195.084921] RIP: super_written+0x29/0xe1 RSP: ffff8ffb7fcc3c78 [ 1195.096354] CR2: 00000000000002b8 bio in the above stack is a bitmap write whose completion is invoked after the tear down sequence sets the mddev structure to NULL in rdev. During tear down, there is an attempt to flush the bitmap writes, but for external bitmaps, there is no explicit wait for all the bitmap writes to complete. For instance, md_bitmap_flush() is called to flush the bitmap writes, but the last call to md_bitmap_daemon_work() in md_bitmap_flush() could generate new bitmap writes for which there is no explicit wait to complete those writes. The call to md_bitmap_update_sb() will return simply for external bitmaps and the follow-up call to md_update_sb() is conditional and may not get called for external bitmaps. This results in a kernel panic when the completion routine, super_written() is called which tries to reference mddev in the rdev that has been set to NULL(in unbind_rdev_from_array() by tear down sequence). The solution is to call md_super_wait() for external bitmaps after the last call to md_bitmap_daemon_work() in md_bitmap_flush() to ensure there are no pending bitmap writes before proceeding with the tear down. Cc: stable@vger.kernel.org Signed-off-by: Sudhakar Panneerselvam <sudhakar.panneerselvam@oracle.com> Reviewed-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <song@kernel.org>
2021-04-15md: do not return existing mddevs from mddev_find_or_allocChristoph Hellwig
Instead of returning an existing mddev, just for it to be discarded later directly return -EEXIST. Rename the function to mddev_alloc now that it doesn't find an existing mddev. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
2021-04-15md: refactor mddev_find_or_allocChristoph Hellwig
Allocate the new mddev first speculatively, which greatly simplifies the code flow. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
2021-04-15md: factor out a mddev_alloc_unit helper from mddev_findChristoph Hellwig
Split out a self contained helper to find a free minor for the md "unit" number. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
2021-04-14dm verity fec: fix misaligned RS roots IOJaegeuk Kim
commit df7b59ba9245 ("dm verity: fix FEC for RS roots unaligned to block size") introduced the possibility for misaligned roots IO relative to the underlying device's logical block size. E.g. Android's default RS roots=2 results in dm_bufio->block_size=1024, which causes the following EIO if the logical block size of the device is 4096, given v->data_dev_block_bits=12: E sd 0 : 0:0:0: [sda] tag#30 request not aligned to the logical block size E blk_update_request: I/O error, dev sda, sector 10368424 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 E device-mapper: verity-fec: 254:8: FEC 9244672: parity read failed (block 18056): -5 Fix this by onlu using f->roots for dm_bufio blocksize IFF it is aligned to v->data_dev_block_bits. Fixes: df7b59ba9245 ("dm verity: fix FEC for RS roots unaligned to block size") Cc: stable@vger.kernel.org Signed-off-by: Jaegeuk Kim <jaegeuk@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-11bcache: fix a regression of code compiling failure in debug.cColy Li
The patch "bcache: remove PTR_CACHE" introduces a compiling failure in debug.c with following error message, In file included from drivers/md/bcache/bcache.h:182:0, from drivers/md/bcache/debug.c:9: drivers/md/bcache/debug.c: In function 'bch_btree_verify': drivers/md/bcache/debug.c:53:19: error: 'c' undeclared (first use in this function) bio_set_dev(bio, c->cache->bdev); ^ This patch fixes the regression by replacing c->cache->bdev by b->c-> cache->bdev. Signed-off-by: Coly Li <colyli@suse.de> Cc: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210411134316.80274-8-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-11bcache: Use 64-bit arithmetic instead of 32-bitGustavo A. R. Silva
Cast multiple variables to (int64_t) in order to give the compiler complete information about the proper arithmetic to use. Notice that these variables are being used in contexts that expect expressions of type int64_t (64 bit, signed). And currently, such expressions are being evaluated using 32-bit arithmetic. Fixes: d0cf9503e908 ("octeontx2-pf: ethtool fec mode support") Addresses-Coverity-ID: 1501724 ("Unintentional integer overflow") Addresses-Coverity-ID: 1501725 ("Unintentional integer overflow") Addresses-Coverity-ID: 1501726 ("Unintentional integer overflow") Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20210411134316.80274-7-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-11md: bcache: Trivial typo fixes in the file journal.cBhaskar Chowdhury
s/condidate/candidate/ s/folowing/following/ Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20210411134316.80274-6-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-11md: bcache: avoid -Wempty-body warningsArnd Bergmann
building with 'make W=1' shows a harmless warning for each user of the EBUG_ON() macro: drivers/md/bcache/bset.c: In function 'bch_btree_sort_partial': drivers/md/bcache/util.h:30:55: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 30 | #define EBUG_ON(cond) do { if (cond); } while (0) | ^ drivers/md/bcache/bset.c:1312:9: note: in expansion of macro 'EBUG_ON' 1312 | EBUG_ON(oldsize >= 0 && bch_count_data(b) != oldsize); | ^~~~~~~ Reword the macro slightly to avoid the warning. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20210411134316.80274-5-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-11bcache: use NULL instead of using plain integer as pointerYang Li
This fixes the following sparse warnings: drivers/md/bcache/features.c:22:16: warning: Using plain integer as NULL pointer Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20210411134316.80274-4-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-11bcache: remove PTR_CACHEChristoph Hellwig
Remove the PTR_CACHE inline and replace it with a direct dereference of c->cache. (Coly Li: fix the typo from PTR_BUCKET to PTR_CACHE in commit log) Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20210411134316.80274-3-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-11bcache: reduce redundant code in bch_cached_dev_run()Zhiqiang Liu
In bch_cached_dev_run(), free(env[1])|free(env[2])|free(buf) show up three times. This patch introduce out tag in which free(env[1])|free(env[2])|free(buf) are only called one time. If we need to call free() when errors occur, we can set error code to ret, and then goto out tag directly. Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20210411134316.80274-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-08treewide: Change list_sort to use const pointersSami Tolvanen
list_sort() internally casts the comparison function passed to it to a different type with constant struct list_head pointers, and uses this pointer to call the functions, which trips indirect call Control-Flow Integrity (CFI) checking. Instead of removing the consts, this change defines the list_cmp_func_t type and changes the comparison function types of all list_sort() callers to use const pointers, thus avoiding type mismatches. Suggested-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210408182843.1754385-10-samitolvanen@google.com