summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2022-05-22md: don't unregister sync_thread with reconfig_mutex heldGuoqing Jiang
Unregister sync_thread doesn't need to hold reconfig_mutex since it doesn't reconfigure array. And it could cause deadlock problem for raid5 as follows: 1. process A tried to reap sync thread with reconfig_mutex held after echo idle to sync_action. 2. raid5 sync thread was blocked if there were too many active stripes. 3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer) which causes the number of active stripes can't be decreased. 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able to hold reconfig_mutex. More details in the link: https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t And add one parameter to md_reap_sync_thread since it could be called by dm-raid which doesn't hold reconfig_mutex. Reported-and-tested-by: Donald Buczek <buczek@molgen.mpg.de> Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <song@kernel.org>
2022-05-16dax: add .recovery_write dax_operationJane Chu
Introduce dax_recovery_write() operation. The function is used to recover a dax range that contains poison. Typical use case is when a user process receives a SIGBUS with si_code BUS_MCEERR_AR indicating poison(s) in a dax range, in response, the user process issues a pwrite() to the page-aligned dax range, thus clears the poison and puts valid data in the range. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jane Chu <jane.chu@oracle.com> Link: https://lore.kernel.org/r/20220422224508.440670-6-jane.chu@oracle.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-05-16dax: introduce DAX_RECOVERY_WRITE dax access modeJane Chu
Up till now, dax_direct_access() is used implicitly for normal access, but for the purpose of recovery write, dax range with poison is requested. To make the interface clear, introduce enum dax_access_mode { DAX_ACCESS, DAX_RECOVERY_WRITE, } where DAX_ACCESS is used for normal dax access, and DAX_RECOVERY_WRITE is used for dax recovery write. Suggested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Link: https://lore.kernel.org/r/165247982851.52965.11024212198889762949.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-05-11dm: pass NULL bdev to bio_alloc_cloneMike Snitzer
Most DM targets will remap the clone bio passed to their ->map function using bio_set_bdev(). So this change to pass NULL bdev to bio_alloc_clone avoids clone-time work that sets up resources for a bdev association that will not be used in practice (e.g. clone issued to underlying device will not use DM device's blk-cgroups resources). But clone->bi_bdev is still initialized following bio_alloc_clone to preserve DM target expectations that clone->bi_bdev will be set. Follow-up work is needed to audit DM targets to remove accesses to a clone->bi_bdev that the target didn't initialize with bio_set_dev(). Depends-on: 7ecc56c62b27 ("block: allow passing a NULL bdev to bio_alloc_clone/bio_init_clone") Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-09dm cache metadata: remove unnecessary variable in __dump_mappingGuo Zhengkui
Fix the following coccicheck warning: drivers/md/dm-cache-metadata.c:1512:5-6: Unneeded variable: "r". Return "0" on line 1520. Signed-off-by: Guo Zhengkui <guozhengkui@vivo.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-09dm mpath: provide high-resolution timer to HST for bio-basedGabriel Krisman Bertazi
The precision loss of reading IO start_time with jiffies_to_nsecs instead of using a high resolution timer degrades HST path prediction for BIO-based mpath on high load workloads. Below, I show the utilization percentage of a 10 disk multipath with asymmetrical disk access cost, while being exercised by a randwrite FIO benchmark with high submission queue depth (depth=64). It is possible to see that the HST path selection degrades heavily for high-iops in BIO-mpath, underutilizing the slower paths way beyond expected. This seems to be caused by the start_time truncation, which makes some IO to seem much slower than it actually is. In this scenario ST outperforms HST for bio-mpath, but not for mq-mpath, which already uses ktime_get_ns(). The third column shows utilization with this patch applied. It is easy to see that now HST prediction is much closer to the ideal distribution (calculated considering the real cost of each path). | | ST | HST (orig) | HST(ktime) | Best | | sdd | 0.17 | 0.20 | 0.17 | 0.18 | | sde | 0.17 | 0.20 | 0.17 | 0.18 | | sdf | 0.17 | 0.20 | 0.17 | 0.18 | | sdg | 0.06 | 0.00 | 0.06 | 0.04 | | sdh | 0.03 | 0.00 | 0.03 | 0.02 | | sdi | 0.03 | 0.00 | 0.03 | 0.02 | | sdj | 0.02 | 0.00 | 0.01 | 0.01 | | sdk | 0.02 | 0.00 | 0.01 | 0.01 | | sdl | 0.17 | 0.20 | 0.17 | 0.18 | | sdm | 0.17 | 0.20 | 0.17 | 0.18 | This issue was originally discussed [1] when we first merged HST, and this patch was left as a low hanging fruit to be solved later. Regarding the implementation, as suggested by Mike in that mail thread, in order to avoid the overhead of ktime_get_ns for other selectors, this patch adds a flag for the selector code to request the high-resolution timer. I tested this using the same benchmark used in the original HST submission. Full test and benchmark scripts are available here: https://people.collabora.com/~krisman/HST-BIO-MPATH/ [1] https://lore.kernel.org/lkml/85tv0am9de.fsf@collabora.com/T/ Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> [snitzer: cleaned up various implementation details] Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-09dm crypt: make printing of the key constant-timeMikulas Patocka
The device mapper dm-crypt target is using scnprintf("%02x", cc->key[i]) to report the current key to userspace. However, this is not a constant-time operation and it may leak information about the key via timing, via cache access patterns or via the branch predictor. Change dm-crypt's key printing to use "%c" instead of "%02x". Also introduce hex2asc() that carefully avoids any branching or memory accesses when converting a number in the range 0 ... 15 to an ascii character. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Tested-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-09dm integrity: fix error code in dm_integrity_ctr()Dan Carpenter
The "r" variable shadows an earlier "r" that has function scope. It means that we accidentally return success instead of an error code. Smatch has a warning for this: drivers/md/dm-integrity.c:4503 dm_integrity_ctr() warn: missing error code 'r' Fixes: 7eada909bfd7 ("dm: add integrity target") Cc: stable@vger.kernel.org Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-09dm stats: add cond_resched when looping over entriesMikulas Patocka
dm-stats can be used with a very large number of entries (it is only limited by 1/4 of total system memory), so add rescheduling points to the loops that iterate over the entries. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: improve abnormal bio processingMike Snitzer
Read/write/flush are the most common operations, optimize switch in is_abnormal_io() for those cases. Follows same pattern established in block perf-wip commit ("block: optimise blk_may_split for normal rw") Also, push is_abnormal_io() check and blk_queue_split() down from dm_submit_bio() to dm_split_and_process_bio() and set new 'is_abnormal_io' flag in clone_info. Optimize __split_and_process_bio and __process_abnormal_io by leveraging ci.is_abnormal_io flag. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: simplify bio-based IO accounting furtherMike Snitzer
Now that io splitting is recorded prior to, or during, ->map IO accounting can happen immediately rather than defer until after bio splitting in dm_split_and_process_bio(). Remove the DM_IO_START_ACCT flag and also remove dm_io's map_task member because there is no longer any need to wait for splitting to occur before accounting. Also move dm_io struct's 'flags' member to consolidate struct holes. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: put all polled dm_io instances into a single listMing Lei
Now that bio_split() isn't used by DM's bio splitting, it is a bit overkill to link dm_io into an hlist given there is only single dm_io in the list. Convert to using a single list for holding all dm_io instances associated with this bio. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: improve dm_io reference countingMing Lei
Currently each dm_io's reference counter is grabbed before calling __map_bio(), this way isn't efficient since we can move this grabbing to initialization time inside alloc_io(). Meantime it becomes typical async io reference counter model: one is for submission side, the other is for completion side, and the io won't be completed until both sides are done. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: don't grab target io reference in dm_zone_map_bioMing Lei
dm_zone_map_bio() is only called from __map_bio in which the io's reference is grabbed already, and the reference won't be released until the bio is submitted, so not necessary to do it dm_zone_map_bio any more. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: improve bio splitting and associated IO accountingMing Lei
The current DM code (ab)uses late assignment of dm_io->orig_bio (after __map_bio() returns and any bio splitting is complete) to indicate the FS bio has been processed and can be accounted. This results in awkward waiting until ->orig_bio is set in dm_submit_bio_remap(). Also the bio splitting was implemented using bio_split()+bio_chain() -- a well-worn pattern but it requires bio cloning purely for the benefit of more natural IO accounting. The bio_split() result was stored in ->orig_bio to represent the mapped part of the original FS bio. DM has switched to the bdev based IO accounting interface. DM's IO accounting can be implemented in terms of the original FS bio (now stored early in ->orig_bio) via access to its sectors/bio_op. And if/when splitting is needed, set a new DM_IO_WAS_SPLIT flag and use new dm_io fields of .sector_offset & .sectors to allow IO accounting for split bios _without_ needing to clone a new bio to store in ->orig_bio. Signed-off-by: Ming Lei <ming.lei@redhat.com> Co-developed-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: switch to bdev based IO accounting interfacesMing Lei
DM splits flush with data into empty flush followed by bio with data payload, switch dm_io_acct() to use bdev_{start,end}_io_acct() to do this accoiunting more naturally (rather than temporarily changing the bio's bi_size). This will allow DM to more easily account bios that are split (in following commit). Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: pass dm_io instance to dm_io_acct directlyMing Lei
All the other 4 parameters are retrieved from the 'dm_io' instance, so it's not necessary to pass all four to dm_io_acct(). Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: don't pass bio to __dm_start_io_acct and dm_end_io_acctMing Lei
dm->orig_bio is always passed to __dm_start_io_acct and dm_end_io_acct, so it isn't necessary to take one bio parameter for the two helpers. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: use bio_sectors in dm_aceept_partial_bioMike Snitzer
Rename 'bi_size' to 'bio_sectors' given bi_size is being stored in sectors. Also, use bio_sectors() rather than open-coding it. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: simplify basic targetsMike Snitzer
Remove needless factoring and remap bi_sector regardless of bio_sectors() being non-zero. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: conditionally enable branching for less used featuresMike Snitzer
Use jump_labels to further reduce cost of unlikely branches for zoned block devices, dm-stats and swap_bios throttling. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: introduce dm_{get,put}_live_table_bio called from dm_submit_bioMike Snitzer
If a bio is marked REQ_NOWAIT optimize dm_submit_bio()'s dm_table RCU usage to dm_{get,put}_live_table_fast. DM core offers protection against blocking (via suspend) if REQ_NOWAIT. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: move hot dm_io members to same cacheline as dm_target_ioMike Snitzer
Just saves some cacheline bouncing for members accessed during cloned bio submission and completion. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: add local variables to clone_endio and __map_bioMike Snitzer
Avoid redundant dereferences in both functions. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: mark various branches unlikelyMike Snitzer
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: simplify dm_start_io_acctMike Snitzer
Pull common DM_IO_ACCOUNTED check out to beginning of dm_start_io_acct. Also, use dm_tio_is_normal (and move it to dm-core.h). Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: simplify dm_io access in dm_split_and_process_bioMike Snitzer
Use local variable instead of redudant access using ci.io Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: factor out dm_io_set_error and __dm_io_dec_pendingMike Snitzer
Also eliminate need to use errno_to_blk_status(). Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05dm: conditionally enable BIOSET_PERCPU_CACHE for dm_io biosetMike Snitzer
A bioset's per-cpu alloc cache may have broader utility in the future but for now constrain it to being tightly coupled to QUEUE_FLAG_POLL. Also change dm_io_complete() to use bio_clear_polled() so that it properly clears all associated bio state on requeue. This commit improves DM's hipri bio polling (REQ_POLLED) perf by 7 - 20% depending on the system. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-03raid5: don't set the discard_alignment queue limitChristoph Hellwig
The discard_alignment queue limit is named a bit misleading means the offset into the block device at which the discard granularity starts. Setting it to the discard granularity as done by raid5 is mostly harmless but also useless. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220418045314.360785-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-03dm-zoned: don't set the discard_alignment queue limitChristoph Hellwig
The discard_alignment queue limit is named a bit misleading means the offset into the block device at which the discard granularity starts. Setting it to the discard granularity as done by dm-zoned is mostly harmless but also useless. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220418045314.360785-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-25md: Replace role magic numbers with defined constantsDavid Sloan
There are several instances where magic numbers are used in md.c instead of the defined constants in md_p.h. This patch set improves code readability by replacing all occurrences of 0xffff, 0xfffe, and 0xfffd when relating to md roles with their equivalent defined constant. Signed-off-by: David Sloan <david.sloan@eideticom.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>
2022-04-25md/raid0: Ignore RAID0 layout if the second zone has only one devicePascal Hambourg
The RAID0 layout is irrelevant if all members have the same size so the array has only one zone. It is *also* irrelevant if the array has two zones and the second zone has only one device, for example if the array has two members of different sizes. So in that case it makes sense to allow assembly even when the layout is undefined, like what is done when the array has only one zone. Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Pascal Hambourg <pascal@plouf.fr.eu.org> Signed-off-by: Song Liu <song@kernel.org>
2022-04-25md/raid5: Annotate functions that hold device_lock with __must_holdLogan Gunthorpe
A handful of functions note the device_lock must be held with a comment but this is not comprehensive. Many other functions hold the lock when taken so add an __must_hold() to each call to annotate when the lock is held. This makes it a bit easier to analyse device_lock. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
2022-04-25md/raid5-ppl: Annotate with rcu_dereference_protected()Logan Gunthorpe
To suppress the last remaining sparse warnings about accessing rdev, add rcu_dereference_protected calls to a couple places in raid5-ppl. All of these places are called under raid5_run and therefore are occurring before the array has started and is thus safe. There's no sensible check to do for the second argument of rcu_dereference_protected() so a comment is added instead. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
2022-04-25md/raid5: Annotate rdev/replacement access when mddev_lock is heldLogan Gunthorpe
The mddev_lock should be held during raid5_remove_disk() which is when the rdev/replacement pointers are modified. So any access to these pointers marked __rcu should be safe whenever the mddev_lock is held. There are numerous such access that currently produce sparse warnings. Add a helper function, rdev_mdlock_deref() that wraps rcu_dereference_protected() in all these instances. This annotation fixes a number of sparse warnings. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
2022-04-25md/raid5: Annotate rdev/replacement accesses when nr_pending is elevatedLogan Gunthorpe
There are a number of accesses to __rcu variables that should be safe because nr_pending in the disk is known to be elevated. Create a wrapper around rcu_dereference_protected() to annotate these accesses and verify that nr_pending is non-zero. This fixes a number of sparse warnings. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
2022-04-25md/raid5: Add __rcu annotation to struct disk_infoLogan Gunthorpe
rdev and replacement are protected in some circumstances with rcu_dereference and synchronize_rcu (in raid5_remove_disk()). However, they were not annotated with __rcu so a sparse warning is emitted for every rcu_dereference() call. Add the __rcu annotation and fix up the initialization with RCU_INIT_POINTER, all pointer modifications with rcu_assign_pointer(), a few cases where the pointer value is tested with rcu_access_pointer() and one case where READ_ONCE() is used instead of rcu_dereference(), a case in print_raid5_conf() that should have rcu_dereference() and rcu_read_[un]lock() calls. Additional sparse issues will be fixed up in further commits. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
2022-04-25md/raid5: Un-nest struct raid5_percpu definitionLogan Gunthorpe
Sparse reports many warnings of the form: drivers/md/raid5.c:1476:16: warning: dereference of noderef expression This is because all struct raid5_percpu definitions get marked as __percpu when really only the pointer in r5conf should have that annotation. Fix this by moving the defnition of raid5_precpu out of the definition of struct r5conf. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
2022-04-25md/raid5: Cleanup setup_conf() error returnsLogan Gunthorpe
Be more careful about the error returns. Most errors in this function are actually ENOMEM, but it forcibly returns EIO if conf has been allocated. Instead return ret and ensure it is set appropriately before each goto abort. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
2022-04-25md: replace deprecated strlcpy & remove duplicated lineHeming Zhao
This commit includes two topics: 1> replace deprecated strlcpy change strlcpy to strscpy for strlcpy is marked as deprecated in Documentation/process/deprecated.rst 2> remove duplicated strlcpy line in md_bitmap_read_sb@md-bitmap.c there are two duplicated strlcpy(), the history: - commit cf921cc19cf7 ("Add node recovery callbacks") introduced the first usage of strlcpy(). - commit b97e92574c0b ("Use separate bitmaps for each nodes in the cluster") introduced the second strlcpy(). this time, the two strlcpy() are same, we can remove anyone safely. - commit d3b178adb3a3 ("md: Skip cluster setup for dm-raid") added dm-raid special handling. And the "nodes" value is the key of this patch. but from this patch, strlcpy() which was introduced by b97e92574c0bf become necessary. - commit 3c462c880b52 ("md: Increment version for clustered bitmaps") used clustered major version to only handle in clustered env. this patch could look a polishment for clustered code logic. So cf921cc19cf7 became useless after d3b178adb3a3a, we could remove it safely. Signed-off-by: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Song Liu <song@kernel.org>
2022-04-25md/bitmap: don't set sb values if can't pass sanity checkHeming Zhao
If bitmap area contains invalid data, kernel will crash then mdadm triggers "Segmentation fault". This is cluster-md speical bug. In non-clustered env, mdadm will handle broken metadata case. In clustered array, only kernel space handles bitmap slot info. But even this bug only happened in clustered env, current sanity check is wrong, the code should be changed. How to trigger: (faulty injection) dd if=/dev/zero bs=1M count=1 oflag=direct of=/dev/sda dd if=/dev/zero bs=1M count=1 oflag=direct of=/dev/sdb mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda /dev/sdb mdadm -Ss echo aaa > magic.txt == below modifying slot 2 bitmap data == dd if=magic.txt of=/dev/sda seek=16384 bs=1 count=3 <== destroy magic dd if=/dev/zero of=/dev/sda seek=16436 bs=1 count=4 <== ZERO chunksize mdadm -A /dev/md0 /dev/sda /dev/sdb == kernel crashes. mdadm outputs "Segmentation fault" == Reason of kernel crash: In md_bitmap_read_sb (called by md_bitmap_create), bad bitmap magic didn't block chunksize assignment, and zero value made DIV_ROUND_UP_SECTOR_T() trigger "divide error". Crash log: kernel: md: md0 stopped. kernel: md/raid1:md0: not clean -- starting background reconstruction kernel: md/raid1:md0: active with 2 out of 2 mirrors kernel: dlm: ... ... kernel: md-cluster: Joined cluster 44810aba-38bb-e6b8-daca-bc97a0b254aa slot 1 kernel: md0: invalid bitmap file superblock: bad magic kernel: md_bitmap_copy_from_slot can't get bitmap from slot 2 kernel: md-cluster: Could not gather bitmaps from slot 2 kernel: divide error: 0000 [#1] SMP NOPTI kernel: CPU: 0 PID: 1603 Comm: mdadm Not tainted 5.14.6-1-default kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) kernel: RIP: 0010:md_bitmap_create+0x1d1/0x850 [md_mod] kernel: RSP: 0018:ffffc22ac0843ba0 EFLAGS: 00010246 kernel: ... ... kernel: Call Trace: kernel: ? dlm_lock_sync+0xd0/0xd0 [md_cluster 77fe..7a0] kernel: md_bitmap_copy_from_slot+0x2c/0x290 [md_mod 24ea..d3a] kernel: load_bitmaps+0xec/0x210 [md_cluster 77fe..7a0] kernel: md_bitmap_load+0x81/0x1e0 [md_mod 24ea..d3a] kernel: do_md_run+0x30/0x100 [md_mod 24ea..d3a] kernel: md_ioctl+0x1290/0x15a0 [md_mod 24ea....d3a] kernel: ? mddev_unlock+0xaa/0x130 [md_mod 24ea..d3a] kernel: ? blkdev_ioctl+0xb1/0x2b0 kernel: block_ioctl+0x3b/0x40 kernel: __x64_sys_ioctl+0x7f/0xb0 kernel: do_syscall_64+0x59/0x80 kernel: ? exit_to_user_mode_prepare+0x1ab/0x230 kernel: ? syscall_exit_to_user_mode+0x18/0x40 kernel: ? do_syscall_64+0x69/0x80 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae kernel: RIP: 0033:0x7f4a15fa722b kernel: ... ... kernel: ---[ end trace 8afa7612f559c868 ]--- kernel: RIP: 0010:md_bitmap_create+0x1d1/0x850 [md_mod] Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Song Liu <song@kernel.org>
2022-04-25md: fix an incorrect NULL check in md_reload_sbXiaomeng Tong
The bug is here: if (!rdev || rdev->desc_nr != nr) { The list iterator value 'rdev' will *always* be set and non-NULL by rdev_for_each_rcu(), so it is incorrect to assume that the iterator value will be NULL if the list is empty or no element found (In fact, it will be a bogus pointer to an invalid struct object containing the HEAD). Otherwise it will bypass the check and lead to invalid memory access passing the check. To fix the bug, use a new variable 'iter' as the list iterator, while using the original variable 'pdev' as a dedicated pointer to point to the found element. Cc: stable@vger.kernel.org Fixes: 70bcecdb1534 ("md-cluster: Improve md_reload_sb to be less error prone") Signed-off-by: Xiaomeng Tong <xiam0nd.tong@gmail.com> Signed-off-by: Song Liu <song@kernel.org>
2022-04-25md: fix an incorrect NULL check in does_sb_need_changingXiaomeng Tong
The bug is here: if (!rdev) The list iterator value 'rdev' will *always* be set and non-NULL by rdev_for_each(), so it is incorrect to assume that the iterator value will be NULL if the list is empty or no element found. Otherwise it will bypass the NULL check and lead to invalid memory access passing the check. To fix the bug, use a new variable 'iter' as the list iterator, while using the original variable 'rdev' as a dedicated pointer to point to the found element. Cc: stable@vger.kernel.org Fixes: 2aa82191ac36 ("md-cluster: Perform a lazy update") Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Xiaomeng Tong <xiam0nd.tong@gmail.com> Acked-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Song Liu <song@kernel.org>
2022-04-25raid5: introduce MD_BROKENMariusz Tkaczyk
Raid456 module had allowed to achieve failed state. It was fixed by fb73b357fb9 ("raid5: block failing device if raid will be failed"). This fix introduces a bug, now if raid5 fails during IO, it may result with a hung task without completion. Faulty flag on the device is necessary to process all requests and is checked many times, mainly in analyze_stripe(). Allow to set faulty on drive again and set MD_BROKEN if raid is failed. As a result, this level is allowed to achieve failed state again, but communication with userspace (via -EBUSY status) will be preserved. This restores possibility to fail array via #mdadm --set-faulty command and will be fixed by additional verification on mdadm side. Reproduction steps: mdadm -CR imsm -e imsm -n 3 /dev/nvme[0-2]n1 mdadm -CR r5 -e imsm -l5 -n3 /dev/nvme[0-2]n1 --assume-clean mkfs.xfs /dev/md126 -f mount /dev/md126 /mnt/root/ fio --filename=/mnt/root/file --size=5GB --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=240 --numjobs=4 --time_based --group_reporting --name=throughput-test-job --eta-newline=1 & echo 1 > /sys/block/nvme2n1/device/device/remove echo 1 > /sys/block/nvme1n1/device/device/remove [ 1475.787779] Call Trace: [ 1475.793111] __schedule+0x2a6/0x700 [ 1475.799460] schedule+0x38/0xa0 [ 1475.805454] raid5_get_active_stripe+0x469/0x5f0 [raid456] [ 1475.813856] ? finish_wait+0x80/0x80 [ 1475.820332] raid5_make_request+0x180/0xb40 [raid456] [ 1475.828281] ? finish_wait+0x80/0x80 [ 1475.834727] ? finish_wait+0x80/0x80 [ 1475.841127] ? finish_wait+0x80/0x80 [ 1475.847480] md_handle_request+0x119/0x190 [ 1475.854390] md_make_request+0x8a/0x190 [ 1475.861041] generic_make_request+0xcf/0x310 [ 1475.868145] submit_bio+0x3c/0x160 [ 1475.874355] iomap_dio_submit_bio.isra.20+0x51/0x60 [ 1475.882070] iomap_dio_bio_actor+0x175/0x390 [ 1475.889149] iomap_apply+0xff/0x310 [ 1475.895447] ? iomap_dio_bio_actor+0x390/0x390 [ 1475.902736] ? iomap_dio_bio_actor+0x390/0x390 [ 1475.909974] iomap_dio_rw+0x2f2/0x490 [ 1475.916415] ? iomap_dio_bio_actor+0x390/0x390 [ 1475.923680] ? atime_needs_update+0x77/0xe0 [ 1475.930674] ? xfs_file_dio_aio_read+0x6b/0xe0 [xfs] [ 1475.938455] xfs_file_dio_aio_read+0x6b/0xe0 [xfs] [ 1475.946084] xfs_file_read_iter+0xba/0xd0 [xfs] [ 1475.953403] aio_read+0xd5/0x180 [ 1475.959395] ? _cond_resched+0x15/0x30 [ 1475.965907] io_submit_one+0x20b/0x3c0 [ 1475.972398] __x64_sys_io_submit+0xa2/0x180 [ 1475.979335] ? do_io_getevents+0x7c/0xc0 [ 1475.986009] do_syscall_64+0x5b/0x1a0 [ 1475.992419] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 1476.000255] RIP: 0033:0x7f11fc27978d [ 1476.006631] Code: Bad RIP value. [ 1476.073251] INFO: task fio:3877 blocked for more than 120 seconds. Cc: stable@vger.kernel.org Fixes: fb73b357fb9 ("raid5: block failing device if raid will be failed") Reviewd-by: Xiao Ni <xni@redhat.com> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Signed-off-by: Song Liu <song@kernel.org>
2022-04-25md: Set MD_BROKEN for RAID1 and RAID10Mariusz Tkaczyk
There is no direct mechanism to determine raid failure outside personality. It is done by checking rdev->flags after executing md_error(). If "faulty" flag is not set then -EBUSY is returned to userspace. -EBUSY means that array will be failed after drive removal. Mdadm has special routine to handle the array failure and it is executed if -EBUSY is returned by md. There are at least two known reasons to not consider this mechanism as correct: 1. drive can be removed even if array will be failed[1]. 2. -EBUSY seems to be wrong status. Array is not busy, but removal process cannot proceed safe. -EBUSY expectation cannot be removed without breaking compatibility with userspace. In this patch first issue is resolved by adding support for MD_BROKEN flag for RAID1 and RAID10. Support for RAID456 is added in next commit. The idea is to set the MD_BROKEN if we are sure that raid is in failed state now. This is done in each error_handler(). In md_error() MD_BROKEN flag is checked. If is set, then -EBUSY is returned to userspace. As in previous commit, it causes that #mdadm --set-faulty is able to fail array. Previously proposed workaround is valid if optional functionality[1] is disabled. [1] commit 9a567843f7ce("md: allow last device to be forcibly removed from RAID1/RAID10.") Reviewd-by: Xiao Ni <xni@redhat.com> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Signed-off-by: Song Liu <song@kernel.org>
2022-04-23Merge tag 'block-5.18-2022-04-22' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "Just two small regression fixes for bcache" * tag 'block-5.18-2022-04-22' of git://git.kernel.dk/linux-block: bcache: fix wrong bdev parameter when calling bio_alloc_clone() in do_bio_hook() bcache: put bch_bio_map() back to correct location in journal_write_unlocked()
2022-04-19bcache: fix wrong bdev parameter when calling bio_alloc_clone() in do_bio_hook()Coly Li
Commit abfc426d1b2f ("block: pass a block_device to bio_clone_fast") calls the modified bio_alloc_clone() in bcache code as: bio_init_clone(bio->bi_bdev, bio, orig_bio, GFP_NOIO); But the first parameter is wrong, where bio->bi_bdev should be orig_bio->bi_bdev. The wrong bi_bdev panics the kernel when submitting cache bio. This patch fixes the wrong bdev parameter usage and avoid the panic. Fixes: abfc426d1b2f ("block: pass a block_device to bio_clone_fast") Signed-off-by: Coly Li <colyli@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220419160425.4148-3-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-19bcache: put bch_bio_map() back to correct location in journal_write_unlocked()Coly Li
Commit a7c50c940477 ("block: pass a block_device and opf to bio_reset") moves bch_bio_map() inside journal_write_unlocked() next to the location where the modified bio_reset() was called. This change is wrong because calling bch_bio_map() immediately after bio_reset(), a BUG_ON(!bio->bi_iter.bi_size) inside bch_bio_map() will be triggered and panic the kernel. This patch puts bch_bio_map() back to its original correct location in journal_write_unlocked() and avoid the BUG_ON(). Fixes: a7c50c940477 ("block: pass a block_device and opf to bio_reset") Signed-off-by: Coly Li <colyli@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220419160425.4148-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17block: decouple REQ_OP_SECURE_ERASE from REQ_OP_DISCARDChristoph Hellwig
Secure erase is a very different operation from discard in that it is a data integrity operation vs hint. Fully split the limits and helper infrastructure to make the separation more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> [nifs2] Acked-by: Jaegeuk Kim <jaegeuk@kernel.org> [f2fs] Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Chao Yu <chao@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-27-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>