summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2023-02-14dm: correct block comments format.Heinz Mauelshagen
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14dm: address indent/space issuesHeinz Mauelshagen
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14dm: address space issues relative to switch/while/for/...Heinz Mauelshagen
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14dm: avoid initializing static variablesHeinz Mauelshagen
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14dm: enclose complex macros into parentheses where possibleHeinz Mauelshagen
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14dm: avoid assignment in if conditionsHeinz Mauelshagen
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14dm: change "unsigned" to "unsigned int"Heinz Mauelshagen
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14dm: use fsleep() instead of msleep() for deterministic sleep durationHeinz Mauelshagen
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14dm: prefer kmap_local_page() instead of deprecated kmap_atomic()Heinz Mauelshagen
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14dm: add missing SPDX-License-IndentifiersHeinz Mauelshagen
'GPL-2.0-only' is used instead of 'GPL-2.0' because SPDX has deprecated its use. Suggested-by: John Wiele <jwiele@redhat.com> Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14dm: send just one event on resize, not twoMikulas Patocka
Device mapper sends an uevent when the device is suspended, using the function set_capacity_and_notify. However, this causes a race condition with udev. Udev skips scanning dm devices that are suspended. If we send an uevent while we are suspended, udev will be racing with device mapper resume code. If the device mapper resume code wins the race, udev will process the uevent after the device is resumed and it will properly scan the device. However, if udev wins the race, it will receive the uevent, find out that the dm device is suspended and skip scanning the device. This causes bugs such as systemd unmounting the device - see https://bugzilla.redhat.com/show_bug.cgi?id=2158628 This commit fixes this race. We replace the function set_capacity_and_notify with set_capacity, so that the uevent is not sent at this point. In do_resume, we detect if the capacity has changed and we pass a boolean variable need_resize_uevent to dm_kobject_uevent. dm_kobject_uevent adds "RESIZE=1" to the uevent if need_resize_uevent is set. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Tested-by: Peter Rajnoha <prajnoha@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-13dm: Remove completion function scaffoldingHerbert Xu
This patch removes the temporary scaffolding now that the comletion function signature has been converted. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2023-02-13dm: Add scaffolding to change completion function signatureHerbert Xu
This patch adds temporary scaffolding so that the Crypto API completion function can take a void * instead of crypto_async_request. Once affected users have been converted this can be removed. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2023-02-12dm table: check that a dm device doesn't reference itselfBenjamin Marzinski
If a DM device's table references itself, it will crash the kernel with an infinite recursion. Check for a self-reference in dm_get_device(). This is a quick check, but it won't catch more complicated circular references. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-12dm raid: fix some spelling mistakes in commentsYu Zhe
Signed-off-by: Yu Zhe <yuzhe@nfschina.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-08md: account io_acct_set usage with active_ioXiao Ni
io_acct_set was enabled for raid0/raid5 io accounting. bios that contain md_io_acct are allocated in the i/o path. There isn't a good method to monitor if these bios are all finished and freed. In the takeover process, io_acct_set (which is used for bios with md_io_acct) need to be freed. However, if some bios finish after io_acct_set is freed, it may trigger the following panic: [ 6973.767999] RIP: 0010:mempool_free+0x52/0x80 [ 6973.786098] Call Trace: [ 6973.786549] md_end_io_acct+0x31/0x40 [ 6973.787227] blk_update_request+0x224/0x380 [ 6973.787994] blk_mq_end_request+0x1a/0x130 [ 6973.788739] blk_complete_reqs+0x35/0x50 [ 6973.789456] __do_softirq+0xd7/0x2c8 [ 6973.790114] ? sort_range+0x20/0x20 [ 6973.790763] run_ksoftirqd+0x2a/0x40 [ 6973.791400] smpboot_thread_fn+0xb5/0x150 [ 6973.792114] kthread+0x10b/0x130 [ 6973.792724] ? set_kthread_struct+0x50/0x50 [ 6973.793491] ret_from_fork+0x1f/0x40 Fix this by increasing and decreasing active_io for each bio with md_io_acct so that mddev_suspend() will wait until all bios from io_acct_set finish before freeing io_acct_set. Reported-by: Fine Fan <ffan@redhat.com> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org>
2023-02-02drivers/md: Remove "select SRCU"Paul E. McKenney
Now that the SRCU Kconfig option is unconditionally selected, there is no longer any point in selecting it. Therefore, remove the "select SRCU" Kconfig statements. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: <dm-devel@redhat.com> Cc: <linux-raid@vger.kernel.org> Acked-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: John Ogness <john.ogness@linutronix.de>
2023-02-02dm verity: stop using WQ_UNBOUND for verify_wqNathan Huckleberry
Setting WQ_UNBOUND increases scheduler latency on ARM64. This is likely due to the asymmetric architecture of ARM64 processors. I've been unable to reproduce the results that claim WQ_UNBOUND gives a performance boost on x86-64. This flag is causing performance issues for multiple subsystems within Android. Notably, the same slowdown exists for decompression with EROFS. | open-prebuilt-camera | WQ_UNBOUND | ~WQ_UNBOUND | |-----------------------|------------|---------------| | verity wait time (us) | 11746 | 119 (-98%) | | erofs wait time (us) | 357805 | 174205 (-51%) | | sha256 ramdisk random read | WQ_UNBOUND | ~WQ_UNBOUND | |----------------------------|-----------=---|-------------| | arm64 (accelerated) | bw=42.4MiB/s | bw=212MiB/s | | arm64 (generic) | bw=16.5MiB/s | bw=48MiB/s | | x86_64 (generic) | bw=233MiB/s | bw=230MiB/s | Using a alloc_workqueue() @max_active arg of num_online_cpus() only made sense with WQ_UNBOUND. Switch the @max_active arg to 0 (aka default, which is 256 per-cpu). Also, eliminate 'wq_flags' since it really doesn't serve a purpose. Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Nathan Huckleberry <nhuck@google.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-02dm integrity: Remove bi_sector that's only used by commented debug codeJiapeng Chong
drivers/md/dm-integrity.c:1738:13: warning: variable 'bi_sector' set but not used. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3895 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-02dm crypt: Slightly simplify crypt_set_keyring_key()Christophe JAILLET
Use strchr() instead of strpbrk() when there is only 1 element in the set of characters to look for. This potentially saves a few cycles, but gcc does already account for optimizing this pattern thanks to it's fold_builtin_strpbrk(). Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-02dm ioctl: drop always-false conditionSergey Shtylyov
The expression 'indata[3] > ULONG_MAX' always evaluates to false since indata[] is declared as an array of *unsigned long* elements and #define ULONG_MAX represents the max value of that exact type... Note that gcc seems to be able to detect the dead code here and eliminate this check anyway... Found by Linux Verification Center (linuxtesting.org) with the SVACE static analysis tool. Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-02dm flakey: fix logic when corrupting a bioMikulas Patocka
If "corrupt_bio_byte" is set to corrupt reads and corrupt_bio_flags is used, dm-flakey would erroneously return all writes as errors. Likewise, if "corrupt_bio_byte" is set to corrupt writes, dm-flakey would return errors for all reads. Fix the logic so that if fc->corrupt_bio_byte is non-zero, dm-flakey will not abort reads on writes with an error. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-01md: use MD_RESYNC_* whenever possibleHou Tao
Just replace magic numbers by MD_RESYNC_* enumerations. Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>
2023-02-01dm flakey: fix a bug with 32-bit highmem systemsMikulas Patocka
The function page_address does not work with 32-bit systems with high memory. Use bvec_kmap_local/kunmap_local instead. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-01dm flakey: don't corrupt the zero pageMikulas Patocka
When we need to zero some range on a block device, the function __blkdev_issue_zero_pages submits a write bio with the bio vector pointing to the zero page. If we use dm-flakey with corrupt bio writes option, it will corrupt the content of the zero page which results in crashes of various userspace programs. Glibc assumes that memory returned by mmap is zeroed and it uses it for calloc implementation; if the newly mapped memory is not zeroed, calloc will return non-zeroed memory. Fix this bug by testing if the page is equal to ZERO_PAGE(0) and avoiding the corruption in this case. Cc: stable@vger.kernel.org Fixes: a00f5276e266 ("dm flakey: Properly corrupt multi-page bios.") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-01md: Free writes_pending in md_stopXiao Ni
dm raid calls md_stop to stop the raid device. It needs to free the writes_pending here. Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org>
2023-02-01md: Change active_io to percpuXiao Ni
Now the type of active_io is atomic. It's used to count how many ios are in the submitting process and it's added and decreased very time. But it only needs to check if it's zero when suspending the raid. So we can switch atomic to percpu to improve the performance. After switching active_io to percpu type, we use the state of active_io to judge if the raid device is suspended. And we don't need to wake up ->sb_wait in md_handle_request anymore. It's done in the callback function which is registered when initing active_io. The argument mddev->suspended is only used to count how many users are trying to set raid to suspend state. Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org>
2023-02-01md: Factor out is_md_suspended helperXiao Ni
This helper function will be used in next patch. It's easy for understanding. Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org>
2023-02-01md: don't update recovery_cp when curr_resync is ACTIVEHou Tao
Don't update recovery_cp when curr_resync is MD_RESYNC_ACTIVE, otherwise md may skip the resync of the first 3 sectors if the resync procedure is interrupted before the first calling of ->sync_request() as shown below: md_do_sync thread control thread // setup resync mddev->recovery_cp = 0 j = 0 mddev->curr_resync = MD_RESYNC_ACTIVE // e.g., set array as idle set_bit(MD_RECOVERY_INTR, &&mddev_recovery) // resync loop // check INTR before calling sync_request !test_bit(MD_RECOVERY_INTR, &mddev->recovery // resync interrupted // update recovery_cp from 0 to 3 // the resync of three 3 sectors will be skipped mddev->recovery_cp = 3 Fixes: eac58d08d493 ("md: Use enum for overloaded magic numbers used by mddev->curr_resync") Cc: stable@vger.kernel.org # 6.0+ Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>
2023-01-30dm cache: Add some documentation to dm-cache-background-tracker.hJoe Thornber
Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-01-30dm cache: free background tracker's queued work in btracker_destroyJoe Thornber
Otherwise the kernel can BUG with: [ 2245.426978] ============================================================================= [ 2245.435155] BUG bt_work (Tainted: G B W ): Objects remaining in bt_work on __kmem_cache_shutdown() [ 2245.445233] ----------------------------------------------------------------------------- [ 2245.445233] [ 2245.454879] Slab 0x00000000b0ce2b30 objects=64 used=2 fp=0x000000000a3c6a4e flags=0x17ffffc0000200(slab|node=0|zone=2|lastcpupid=0x1fffff) [ 2245.467300] CPU: 7 PID: 10805 Comm: lvm Kdump: loaded Tainted: G B W 6.0.0-rc2 #19 [ 2245.476078] Hardware name: Dell Inc. PowerEdge R7525/0590KW, BIOS 2.5.6 10/06/2021 [ 2245.483646] Call Trace: [ 2245.486100] <TASK> [ 2245.488206] dump_stack_lvl+0x34/0x48 [ 2245.491878] slab_err+0x95/0xcd [ 2245.495028] __kmem_cache_shutdown.cold+0x31/0x136 [ 2245.499821] kmem_cache_destroy+0x49/0x130 [ 2245.503928] btracker_destroy+0x12/0x20 [dm_cache] [ 2245.508728] smq_destroy+0x15/0x60 [dm_cache_smq] [ 2245.513435] dm_cache_policy_destroy+0x12/0x20 [dm_cache] [ 2245.518834] destroy+0xc0/0x110 [dm_cache] [ 2245.522933] dm_table_destroy+0x5c/0x120 [dm_mod] [ 2245.527649] __dm_destroy+0x10e/0x1c0 [dm_mod] [ 2245.532102] dev_remove+0x117/0x190 [dm_mod] [ 2245.536384] ctl_ioctl+0x1a2/0x290 [dm_mod] [ 2245.540579] dm_ctl_ioctl+0xa/0x20 [dm_mod] [ 2245.544773] __x64_sys_ioctl+0x8a/0xc0 [ 2245.548524] do_syscall_64+0x5c/0x90 [ 2245.552104] ? syscall_exit_to_user_mode+0x12/0x30 [ 2245.556897] ? do_syscall_64+0x69/0x90 [ 2245.560648] ? do_syscall_64+0x69/0x90 [ 2245.564394] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 2245.569447] RIP: 0033:0x7fe52583ec6b ... [ 2245.646771] ------------[ cut here ]------------ [ 2245.651395] kmem_cache_destroy bt_work: Slab cache still has objects when called from btracker_destroy+0x12/0x20 [dm_cache] [ 2245.651408] WARNING: CPU: 7 PID: 10805 at mm/slab_common.c:478 kmem_cache_destroy+0x128/0x130 Found using: lvm2-testsuite --only "cache-single-split.sh" Ben bisected and found that commit 0495e337b703 ("mm/slab_common: Deleting kobject in kmem_cache_destroy() without holding slab_mutex/cpu_hotplug_lock") first exposed dm-cache's incomplete cleanup of its background tracker work objects. Reported-by: Benjamin Marzinski <bmarzins@redhat.com> Tested-by: Benjamin Marzinski <bmarzins@redhat.com> Cc: stable@vger.kernel.org # 6.0+ Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-01-30dm: improve shrinker debug namesMike Snitzer
Commit e33c267ab70d ("mm: shrinkers: provide shrinkers with names") chose some fairly bad names for DM's shrinkers. Fixes: e33c267ab70d ("mm: shrinkers: provide shrinkers with names") Signed-off-by : Mike Snitzer <snitzer@kernel.org>
2023-01-27Merge tag 'hardening-v6.2-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening fixes from Kees Cook: - Split slow memcpy tests into MEMCPY_SLOW_KUNIT_TEST - Reorganize gcc-plugin includes for GCC 13 - Silence bcache memcpy run-time false positive warnings * tag 'hardening-v6.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: bcache: Silence memcpy() run-time false positive warnings gcc-plugins: Reorganize gimple includes for GCC 13 kunit: memcpy: Split slow memcpy tests into MEMCPY_SLOW_KUNIT_TEST
2023-01-25bcache: Silence memcpy() run-time false positive warningsKees Cook
struct bkey has internal padding in a union, but it isn't always named the same (e.g. key ## _pad, key_p, etc). This makes it extremely hard for the compiler to reason about the available size of copies done against such keys. Use unsafe_memcpy() for now, to silence the many run-time false positive warnings: memcpy: detected field-spanning write (size 264) of single field "&i->j" at drivers/md/bcache/journal.c:152 (size 240) memcpy: detected field-spanning write (size 24) of single field "&b->key" at drivers/md/bcache/btree.c:939 (size 16) memcpy: detected field-spanning write (size 24) of single field "&temp.key" at drivers/md/bcache/extents.c:428 (size 16) Reported-by: Alexandre Pereira <alexpereira@disroot.org> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216785 Acked-by: Coly Li <colyli@suse.de> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: linux-bcache@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230106060229.never.047-kees@kernel.org
2023-01-12md: fix incorrect declaration about claim_rdev in md_import_deviceAdrian Huang
Commit fb541ca4c365 ("md: remove lock_bdev / unlock_bdev") removes wrappers for blkdev_get/blkdev_put. However, the uninitialized local static variable of pointer type 'claim_rdev' in md_import_device() is NULL, which leads to the following warning call trace: WARNING: CPU: 22 PID: 1037 at block/bdev.c:577 bd_prepare_to_claim+0x131/0x150 CPU: 22 PID: 1037 Comm: mdadm Not tainted 6.2.0-rc3+ #69 .. RIP: 0010:bd_prepare_to_claim+0x131/0x150 .. Call Trace: <TASK> ? _raw_spin_unlock+0x15/0x30 ? iput+0x6a/0x220 blkdev_get_by_dev.part.0+0x4b/0x300 md_import_device+0x126/0x1d0 new_dev_store+0x184/0x240 md_attr_store+0x80/0xf0 kernfs_fop_write_iter+0x128/0x1c0 vfs_write+0x2be/0x3c0 ksys_write+0x5f/0xe0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc It turns out the md device cannot be used: md: could not open device unknown-block(259,0). md: md127 stopped. Fix the issue by declaring the local static variable of struct type and passing the pointer of the variable to blkdev_get_by_dev(). Fixes: fb541ca4c365 ("md: remove lock_bdev / unlock_bdev") Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Adrian Huang <ahuang12@lenovo.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
2023-01-05bcache: Replace zero-length arrays with DECLARE_FLEX_ARRAY() helperGustavo A. R. Silva
Zero-length arrays are deprecated and we are moving towards adopting C99 flexible-array members, instead. So, replace zero-length arrays declarations in anonymous union with the new DECLARE_FLEX_ARRAY() helper macro. This helper allows for flexible-array members in unions. Link: https://github.com/KSPP/linux/issues/193 Link: https://github.com/KSPP/linux/issues/213 Link: https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2023-01-04block: handle bio_split_to_limits() NULL returnJens Axboe
This can't happen right now, but in preparation for allowing bio_split_to_limits() returning NULL if it ended the bio, check for it in all the callers. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-13Merge tag 'for-6.2/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Fix use-after-free races due to missing resource cleanup during DM target destruction in DM targets: thin-pool, cache, integrity and clone. - Fix ABBA deadlocks in DM thin-pool and cache targets due to their use of a bufio client (that has a shrinker whose locking can cause the incorrect locking order). - Fix DM cache target to set its needs_check flag after first aborting the metadata (whereby using reset persistent-data objects to update the superblock with, otherwise the superblock update could be dropped due to aborting metadata). This was found with code-inspection when comparing with the equivalent in DM thinp code. - Fix DM thin-pool's presume to continue resuming the device even if the pool in is fail mode -- otherwise bios may never be failed up the IO stack (which will prevent resetting the thin-pool target via table reload) - Fix DM thin-pool's metadata to use proper btree root (from previous transaction) if metadata commit failed. - Add 'waitfor' module param to DM module (dm_mod) to allow dm-init to wait for the specified device before continuing with its DM target initialization. * tag 'for-6.2/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm thin: Use last transaction's pmd->root when commit failed dm init: add dm-mod.waitfor to wait for asynchronously probed block devices dm ioctl: fix a couple ioctl codes dm ioctl: a small code cleanup in list_version_get_info dm thin: resume even if in FAIL mode dm cache: set needs_check flag after aborting metadata dm cache: Fix ABBA deadlock between shrink_slab and dm_cache_metadata_abort dm thin: Fix ABBA deadlock between shrink_slab and dm_pool_abort_metadata dm integrity: Fix UAF in dm_integrity_dtr() dm cache: Fix UAF in destroy() dm clone: Fix UAF in clone_dtr() dm thin: Fix UAF in run_timer_softirq()
2022-12-13Merge tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - NVMe pull requests via Christoph: - Support some passthrough commands without CAP_SYS_ADMIN (Kanchan Joshi) - Refactor PCIe probing and reset (Christoph Hellwig) - Various fabrics authentication fixes and improvements (Sagi Grimberg) - Avoid fallback to sequential scan due to transient issues (Uday Shankar) - Implement support for the DEAC bit in Write Zeroes (Christoph Hellwig) - Allow overriding the IEEE OUI and firmware revision in configfs for nvmet (Aleksandr Miloserdov) - Force reconnect when number of queue changes in nvmet (Daniel Wagner) - Minor fixes and improvements (Uros Bizjak, Joel Granados, Sagi Grimberg, Christoph Hellwig, Christophe JAILLET) - Fix and cleanup nvme-fc req allocation (Chaitanya Kulkarni) - Use the common tagset helpers in nvme-pci driver (Christoph Hellwig) - Cleanup the nvme-pci removal path (Christoph Hellwig) - Use kstrtobool() instead of strtobool (Christophe JAILLET) - Allow unprivileged passthrough of Identify Controller (Joel Granados) - Support io stats on the mpath device (Sagi Grimberg) - Minor nvmet cleanup (Sagi Grimberg) - MD pull requests via Song: - Code cleanups (Christoph) - Various fixes - Floppy pull request from Denis: - Fix a memory leak in the init error path (Yuan) - Series fixing some batch wakeup issues with sbitmap (Gabriel) - Removal of the pktcdvd driver that was deprecated more than 5 years ago, and subsequent removal of the devnode callback in struct block_device_operations as no users are now left (Greg) - Fix for partition read on an exclusively opened bdev (Jan) - Series of elevator API cleanups (Jinlong, Christoph) - Series of fixes and cleanups for blk-iocost (Kemeng) - Series of fixes and cleanups for blk-throttle (Kemeng) - Series adding concurrent support for sync queues in BFQ (Yu) - Series bringing drbd a bit closer to the out-of-tree maintained version (Christian, Joel, Lars, Philipp) - Misc drbd fixes (Wang) - blk-wbt fixes and tweaks for enable/disable (Yu) - Fixes for mq-deadline for zoned devices (Damien) - Add support for read-only and offline zones for null_blk (Shin'ichiro) - Series fixing the delayed holder tracking, as used by DM (Yu, Christoph) - Series enabling bio alloc caching for IRQ based IO (Pavel) - Series enabling userspace peer-to-peer DMA (Logan) - BFQ waker fixes (Khazhismel) - Series fixing elevator refcount issues (Christoph, Jinlong) - Series cleaning up references around queue destruction (Christoph) - Series doing quiesce by tagset, enabling cleanups in drivers (Christoph, Chao) - Series untangling the queue kobject and queue references (Christoph) - Misc fixes and cleanups (Bart, David, Dawei, Jinlong, Kemeng, Ye, Yang, Waiman, Shin'ichiro, Randy, Pankaj, Christoph) * tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux: (247 commits) blktrace: Fix output non-blktrace event when blk_classic option enabled block: sed-opal: Don't include <linux/kernel.h> sed-opal: allow using IOC_OPAL_SAVE for locking too blk-cgroup: Fix typo in comment block: remove bio_set_op_attrs nvmet: don't open-code NVME_NS_ATTR_RO enumeration nvme-pci: use the tagset alloc/free helpers nvme: add the Apple shared tag workaround to nvme_alloc_io_tag_set nvme: only set reserved_tags in nvme_alloc_io_tag_set for fabrics controllers nvme: consolidate setting the tagset flags nvme: pass nr_maps explicitly to nvme_alloc_io_tag_set block: bio_copy_data_iter nvme-pci: split out a nvme_pci_ctrl_is_dead helper nvme-pci: return early on ctrl state mismatch in nvme_reset_work nvme-pci: rename nvme_disable_io_queues nvme-pci: cleanup nvme_suspend_queue nvme-pci: remove nvme_pci_disable nvme-pci: remove nvme_disable_admin_queue nvme: merge nvme_shutdown_ctrl into nvme_disable_ctrl nvme: use nvme_wait_ready in nvme_shutdown_ctrl ...
2022-12-12Merge tag 'random-6.2-rc1-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull random number generator updates from Jason Donenfeld: - Replace prandom_u32_max() and various open-coded variants of it, there is now a new family of functions that uses fast rejection sampling to choose properly uniformly random numbers within an interval: get_random_u32_below(ceil) - [0, ceil) get_random_u32_above(floor) - (floor, U32_MAX] get_random_u32_inclusive(floor, ceil) - [floor, ceil] Coccinelle was used to convert all current users of prandom_u32_max(), as well as many open-coded patterns, resulting in improvements throughout the tree. I'll have a "late" 6.1-rc1 pull for you that removes the now unused prandom_u32_max() function, just in case any other trees add a new use case of it that needs to converted. According to linux-next, there may be two trivial cases of prandom_u32_max() reintroductions that are fixable with a 's/.../.../'. So I'll have for you a final conversion patch doing that alongside the removal patch during the second week. This is a treewide change that touches many files throughout. - More consistent use of get_random_canary(). - Updates to comments, documentation, tests, headers, and simplification in configuration. - The arch_get_random*_early() abstraction was only used by arm64 and wasn't entirely useful, so this has been replaced by code that works in all relevant contexts. - The kernel will use and manage random seeds in non-volatile EFI variables, refreshing a variable with a fresh seed when the RNG is initialized. The RNG GUID namespace is then hidden from efivarfs to prevent accidental leakage. These changes are split into random.c infrastructure code used in the EFI subsystem, in this pull request, and related support inside of EFISTUB, in Ard's EFI tree. These are co-dependent for full functionality, but the order of merging doesn't matter. - Part of the infrastructure added for the EFI support is also used for an improvement to the way vsprintf initializes its siphash key, replacing an sleep loop wart. - The hardware RNG framework now always calls its correct random.c input function, add_hwgenerator_randomness(), rather than sometimes going through helpers better suited for other cases. - The add_latent_entropy() function has long been called from the fork handler, but is a no-op when the latent entropy gcc plugin isn't used, which is fine for the purposes of latent entropy. But it was missing out on the cycle counter that was also being mixed in beside the latent entropy variable. So now, if the latent entropy gcc plugin isn't enabled, add_latent_entropy() will expand to a call to add_device_randomness(NULL, 0), which adds a cycle counter, without the absent latent entropy variable. - The RNG is now reseeded from a delayed worker, rather than on demand when used. Always running from a worker allows it to make use of the CPU RNG on platforms like S390x, whose instructions are too slow to do so from interrupts. It also has the effect of adding in new inputs more frequently with more regularity, amounting to a long term transcript of random values. Plus, it helps a bit with the upcoming vDSO implementation (which isn't yet ready for 6.2). - The jitter entropy algorithm now tries to execute on many different CPUs, round-robining, in hopes of hitting even more memory latencies and other unpredictable effects. It also will mix in a cycle counter when the entropy timer fires, in addition to being mixed in from the main loop, to account more explicitly for fluctuations in that timer firing. And the state it touches is now kept within the same cache line, so that it's assured that the different execution contexts will cause latencies. * tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (23 commits) random: include <linux/once.h> in the right header random: align entropy_timer_state to cache line random: mix in cycle counter when jitter timer fires random: spread out jitter callback to different CPUs random: remove extraneous period and add a missing one in comments efi: random: refresh non-volatile random seed when RNG is initialized vsprintf: initialize siphash key using notifier random: add back async readiness notifier random: reseed in delayed work rather than on-demand random: always mix cycle counter in add_latent_entropy() hw_random: use add_hwgenerator_randomness() for early entropy random: modernize documentation comment on get_random_bytes() random: adjust comment to account for removed function random: remove early archrandom abstraction random: use random.trust_{bootloader,cpu} command line option only stackprotector: actually use get_random_canary() stackprotector: move get_random_canary() into stackprotector.h treewide: use get_random_u32_inclusive() when possible treewide: use get_random_u32_{above,below}() instead of manual loop treewide: use get_random_u32_below() instead of deprecated function ...
2022-12-08dm thin: Use last transaction's pmd->root when commit failedZhihao Cheng
Recently we found a softlock up problem in dm thin pool btree lookup code due to corrupted metadata: Kernel panic - not syncing: softlockup: hung tasks CPU: 7 PID: 2669225 Comm: kworker/u16:3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) Workqueue: dm-thin do_worker [dm_thin_pool] Call Trace: <IRQ> dump_stack+0x9c/0xd3 panic+0x35d/0x6b9 watchdog_timer_fn.cold+0x16/0x25 __run_hrtimer+0xa2/0x2d0 </IRQ> RIP: 0010:__relink_lru+0x102/0x220 [dm_bufio] __bufio_new+0x11f/0x4f0 [dm_bufio] new_read+0xa3/0x1e0 [dm_bufio] dm_bm_read_lock+0x33/0xd0 [dm_persistent_data] ro_step+0x63/0x100 [dm_persistent_data] btree_lookup_raw.constprop.0+0x44/0x220 [dm_persistent_data] dm_btree_lookup+0x16f/0x210 [dm_persistent_data] dm_thin_find_block+0x12c/0x210 [dm_thin_pool] __process_bio_read_only+0xc5/0x400 [dm_thin_pool] process_thin_deferred_bios+0x1a4/0x4a0 [dm_thin_pool] process_one_work+0x3c5/0x730 Following process may generate a broken btree mixed with fresh and stale btree nodes, which could get dm thin trapped in an infinite loop while looking up data block: Transaction 1: pmd->root = A, A->B->C // One path in btree pmd->root = X, X->Y->Z // Copy-up Transaction 2: X,Z is updated on disk, Y write failed. // Commit failed, dm thin becomes read-only. process_bio_read_only dm_thin_find_block __find_block dm_btree_lookup(pmd->root) The pmd->root points to a broken btree, Y may contain stale node pointing to any block, for example X, which gets dm thin trapped into a dead loop while looking up Z. Fix this by setting pmd->root in __open_metadata(), so that dm thin will use the last transaction's pmd->root if commit failed. Fetch a reproducer in [Link]. Linke: https://bugzilla.kernel.org/show_bug.cgi?id=216790 Cc: stable@vger.kernel.org Fixes: 991d9fa02da0 ("dm: add thin provisioning target") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-12-07block: remove bio_set_op_attrsChristoph Hellwig
This macro is obsolete, so replace the last few uses with open coded bi_opf assignments. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Coly Li <colyli@suse.de <mailto:colyli@suse.de>> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20221206144057.720846-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-02dm init: add dm-mod.waitfor to wait for asynchronously probed block devicesPeter Korsgaard
Just calling wait_for_device_probe() is not enough to ensure that asynchronously probed block devices are available (E.G. mmc, usb), so add a "dm-mod.waitfor=<device1>[,..,<deviceN>]" parameter to get dm-init to explicitly wait for specific block devices before initializing the tables with logic similar to the rootwait logic that was introduced with commit cc1ed7542c8c ("init: wait for asynchronously scanned block devices"). E.G. with dm-verity on mmc using: dm-mod.waitfor="PARTLABEL=hash-a,PARTLABEL=root-a" [ 0.671671] device-mapper: init: waiting for all devices to be available before creating mapped devices [ 0.671679] device-mapper: init: waiting for device PARTLABEL=hash-a ... [ 0.710695] mmc0: new HS200 MMC card at address 0001 [ 0.711158] mmcblk0: mmc0:0001 004GA0 3.69 GiB [ 0.715954] mmcblk0boot0: mmc0:0001 004GA0 partition 1 2.00 MiB [ 0.722085] mmcblk0boot1: mmc0:0001 004GA0 partition 2 2.00 MiB [ 0.728093] mmcblk0rpmb: mmc0:0001 004GA0 partition 3 512 KiB, chardev (249:0) [ 0.738274] mmcblk0: p1 p2 p3 p4 p5 p6 p7 [ 0.751282] device-mapper: init: waiting for device PARTLABEL=root-a ... [ 0.751306] device-mapper: init: all devices available [ 0.751683] device-mapper: verity: sha256 using implementation "sha256-generic" [ 0.759344] device-mapper: ioctl: dm-0 (vroot) is ready [ 0.766540] VFS: Mounted root (squashfs filesystem) readonly on device 254:0. Signed-off-by: Peter Korsgaard <peter@korsgaard.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-12-02md: fold unbind_rdev_from_array into md_kick_rdev_from_arrayChristoph Hellwig
unbind_rdev_from_array is only called from md_kick_rdev_from_array, so merge it into its only caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
2022-12-02md: mark md_kick_rdev_from_array staticChristoph Hellwig
md_kick_rdev_from_array is only used in md.c, so unexport it and mark the symbol static. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
2022-12-02md: remove lock_bdev / unlock_bdevChristoph Hellwig
These wrappers for blkdev_get / blkdev_put just horribly confuse the code with their odd naming. Remove them and improve the error unwinding in md_import_device with the now folded code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
2022-12-01dm ioctl: fix a couple ioctl codesMikulas Patocka
Change the ioctl codes from DM_DEV_ARM_POLL to DM_DEV_ARM_POLL_CMD and from DM_GET_TARGET_VERSION to DM_GET_TARGET_VERSION_CMD. Note that the "cmd" field of "struct _ioctls" is never used, thus this commit doesn't fix any bug, it just makes the code consistent. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-12-01dm ioctl: a small code cleanup in list_version_get_infoMikulas Patocka
No need to modify info->vers if we overwrite it immediately. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-12-01dm thin: resume even if in FAIL modeLuo Meng
If a thinpool set fail_io while suspending, resume will fail with: device-mapper: resume ioctl on vg-thinpool failed: Invalid argument The thin-pool also can't be removed if an in-flight bio is in the deferred list. This can be easily reproduced using: echo "offline" > /sys/block/sda/device/state dd if=/dev/zero of=/dev/mapper/thin bs=4K count=1 dmsetup suspend /dev/mapper/pool mkfs.ext4 /dev/mapper/thin dmsetup resume /dev/mapper/pool The root cause is maybe_resize_data_dev() will check fail_io and return error before called dm_resume. Fix this by adding FAIL mode check at the end of pool_preresume(). Cc: stable@vger.kernel.org Fixes: da105ed5fd7e ("dm thin metadata: introduce dm_pool_abort_metadata") Signed-off-by: Luo Meng <luomeng12@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-12-01dm cache: set needs_check flag after aborting metadataMike Snitzer
Otherwise the commit that will be aborted will be associated with the metadata objects that will be torn down. Must write needs_check flag to metadata with a reset block manager. Found through code-inspection (and compared against dm-thin.c). Cc: stable@vger.kernel.org Fixes: 028ae9f76f29 ("dm cache: add fail io mode and needs_check flag") Signed-off-by: Mike Snitzer <snitzer@kernel.org>