summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2017-05-02Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: tty: fix comment for __tty_alloc_driver() init/main: properly align the multi-line comment init/main: Fix double "the" in comment Fix dead URLs to ftp.kernel.org drivers: Clean up duplicated email address treewide: Fix typo in xml/driver-api/basics.xml tools/testing/selftests/powerpc: remove redundant CFLAGS in Makefile: "-Wall -O2 -Wall" -> "-O2 -Wall" selftests/timers: Spelling s/privledges/privileges/ HID: picoLCD: Spelling s/REPORT_WRTIE_MEMORY/REPORT_WRITE_MEMORY/ net: phy: dp83848: Fix Typo UBI: Fix typos Documentation: ftrace.txt: Correct nice value of 120 priority net: fec: Fix typo in error msg and comment treewide: Fix typos in printk
2017-05-01Merge branch 'for-4.12/block' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block layer updates from Jens Axboe: - Add BFQ IO scheduler under the new blk-mq scheduling framework. BFQ was initially a fork of CFQ, but subsequently changed to implement fairness based on B-WF2Q+, a modified variant of WF2Q. BFQ is meant to be used on desktop type single drives, providing good fairness. From Paolo. - Add Kyber IO scheduler. This is a full multiqueue aware scheduler, using a scalable token based algorithm that throttles IO based on live completion IO stats, similary to blk-wbt. From Omar. - A series from Jan, moving users to separately allocated backing devices. This continues the work of separating backing device life times, solving various problems with hot removal. - A series of updates for lightnvm, mostly from Javier. Includes a 'pblk' target that exposes an open channel SSD as a physical block device. - A series of fixes and improvements for nbd from Josef. - A series from Omar, removing queue sharing between devices on mostly legacy drivers. This helps us clean up other bits, if we know that a queue only has a single device backing. This has been overdue for more than a decade. - Fixes for the blk-stats, and improvements to unify the stats and user windows. This both improves blk-wbt, and enables other users to register a need to receive IO stats for a device. From Omar. - blk-throttle improvements from Shaohua. This provides a scalable framework for implementing scalable priotization - particularly for blk-mq, but applicable to any type of block device. The interface is marked experimental for now. - Bucketized IO stats for IO polling from Stephen Bates. This improves efficiency of polled workloads in the presence of mixed block size IO. - A few fixes for opal, from Scott. - A few pulls for NVMe, including a lot of fixes for NVMe-over-fabrics. From a variety of folks, mostly Sagi and James Smart. - A series from Bart, improving our exposed info and capabilities from the blk-mq debugfs support. - A series from Christoph, cleaning up how handle WRITE_ZEROES. - A series from Christoph, cleaning up the block layer handling of how we track errors in a request. On top of being a nice cleanup, it also shrinks the size of struct request a bit. - Removal of mg_disk and hd (sorry Linus) by Christoph. The former was never used by platforms, and the latter has outlived it's usefulness. - Various little bug fixes and cleanups from a wide variety of folks. * 'for-4.12/block' of git://git.kernel.dk/linux-block: (329 commits) block: hide badblocks attribute by default blk-mq: unify hctx delay_work and run_work block: add kblock_mod_delayed_work_on() blk-mq: unify hctx delayed_run_work and run_work nbd: fix use after free on module unload MAINTAINERS: bfq: Add Paolo as maintainer for the BFQ I/O scheduler blk-mq-sched: alloate reserved tags out of normal pool mtip32xx: use runtime tag to initialize command header scsi: Implement blk_mq_ops.show_rq() blk-mq: Add blk_mq_ops.show_rq() blk-mq: Show operation, cmd_flags and rq_flags names blk-mq: Make blk_flags_show() callers append a newline character blk-mq: Move the "state" debugfs attribute one level down blk-mq: Unregister debugfs attributes earlier blk-mq: Only unregister hctxs for which registration succeeded blk-mq-debugfs: Rename functions for registering and unregistering the mq directory blk-mq: Let blk_mq_debugfs_register() look up the queue name blk-mq: Register <dev>/queue/mq after having registered <dev>/queue ide-pm: always pass 0 error to ide_complete_rq in ide_do_devset ide-pm: always pass 0 error to __blk_end_request_all ..
2017-04-20blk-mq: remove the error argument to blk_mq_complete_requestChristoph Hellwig
Now that all drivers that call blk_mq_complete_requests have a ->complete callback we can remove the direct call to blk_mq_end_request, as well as the error argument to blk_mq_complete_request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20dm mpath: don't check for req->errorsChristoph Hellwig
We'll get all proper errors reported through ->end_io and ->errors will go away soon. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20dm rq: don't pass irrelevant error code to blk_mq_complete_requestChristoph Hellwig
dm never uses rq->errors, so there is no need to pass an error argument to blk_mq_complete_request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "Here's a pull request for 4.11-rc, fixing a set of issues mostly centered around the new scheduling framework. These have been brewing for a while, but split up into what we absolutely need in 4.11, and what we can defer until 4.12. These are well tested, on both single queue and multiqueue setups, and with and without shared tags. They fix several hangs that have happened in testing. This is obviously larger than I would have preferred at this point in time, but I don't think we can shave much off this and still get the desired results. In detail, this pull request contains: - a set of five fixes for NVMe, mostly from Christoph and one from Roland. - a series from Bart, fixing issues with dm-mq and SCSI shared tags and scheduling. Note that one of those patches commit messages may read like an optimization, but it is in fact an important fix for queue restarts in particular. - a series from Omar, most importantly fixing a hang with multiple hardware queues when we fail to get a driver tag. Another important fix in there is for resizing hardware queues, which nbd does when handling multiple sockets for one connection. - fixing an imbalance in putting the ctx for hctx request allocations from Minchan" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq: Restart a single queue if tag sets are shared dm rq: Avoid that request processing stalls sporadically scsi: Avoid that SCSI queues get stuck blk-mq: Introduce blk_mq_delay_run_hw_queue() blk-mq: remap queues when adding/removing hardware queues blk-mq-sched: fix crash in switch error path blk-mq-sched: set up scheduler tags when bringing up new queues blk-mq-sched: refactor scheduler initialization blk-mq: use the right hctx when getting a driver tag fails nvmet: fix byte swap in nvmet_parse_io_cmd nvmet: fix byte swap in nvmet_execute_write_zeroes nvmet: add missing byte swap in nvmet_get_smart_log nvme: add missing byte swap in nvme_setup_discard nvme: Correct NVMF enum values to match NVMe-oF rev 1.0 block: do not put mq context in blk_mq_alloc_request_hctx
2017-04-08block: remove the discard_zeroes_data flagChristoph Hellwig
Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can kill this hack. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08dm kcopyd: switch to use REQ_OP_WRITE_ZEROESChristoph Hellwig
It seems like the code currently passes whatever it was using for writes to WRITE SAME. Just switch it to WRITE ZEROES, although that doesn't need any payload. Untested, and confused by the code, maybe someone who understands it better than me can help.. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08dm: support REQ_OP_WRITE_ZEROESChristoph Hellwig
Copy & paste from the REQ_OP_WRITE_SAME code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08dm io: discards don't take a payloadChristoph Hellwig
Fix up do_region to not allocate a bio_vec for discards. We've got rid of the discard payload allocated by the caller years ago. Obviously this wasn't actually harmful given how long it's been there, but it's still good to avoid the pointless allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08md: support REQ_OP_WRITE_ZEROESChristoph Hellwig
Copy & paste from the REQ_OP_WRITE_SAME code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07Merge branch 'for-linus' into for-4.12/blockJens Axboe
We've added a considerable amount of fixes for stalls and issues with the blk-mq scheduling in the 4.11 series since forking off the for-4.12/block branch. We need to do improvements on top of that for 4.12, so pull in the previous fixes to make our lives easier going forward. Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07dm rq: Avoid that request processing stalls sporadicallyBart Van Assche
While running the srp-test software I noticed that request processing stalls sporadically at the beginning of a test, namely when mkfs is run against a dm-mpath device. Every time when that happened the following command was sufficient to resume request processing: echo run >/sys/kernel/debug/block/dm-0/state This patch avoids that such request processing stalls occur. The test I ran is as follows: while srp-test/run_tests -d -r 30 -t 02-mq; do :; done Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07Merge tag 'dm-4.11-fixes-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - two stable fixes for the verity target's FEC support - a stable fix for raid target's raid1 support (when no bitmap is used) - a 4.11 cache metadata v2 format fix to properly test blocks are clean * tag 'dm-4.11-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm verity fec: fix bufio leaks dm raid: fix NULL pointer dereference for raid1 without bitmap dm cache metadata: fix metadata2 format's blocks_are_clean_separate_dirty dm verity fec: limit error correction recursion
2017-04-07block: trace completion of all bios.NeilBrown
Currently only dm and md/raid5 bios trigger trace_block_bio_complete(). Now that we have bio_chain() and bio_inc_remaining(), it is not possible, in general, for a driver to know when the bio is really complete. Only bio_endio() knows that. So move the trace_block_bio_complete() call to bio_endio(). Now trace_block_bio_complete() pairs with trace_block_bio_queue(). Any bio for which a 'queue' event is traced, will subsequently generate a 'complete' event. There are a few cases where completion tracing is not wanted. 1/ If blk_update_request() has already generated a completion trace event at the 'request' level, there is no point generating one at the bio level too. In this case the bi_sector and bi_size will have changed, so the bio level event would be wrong 2/ If the bio hasn't actually been queued yet, but is being aborted early, then a trace event could be confusing. Some filesystems call bio_endio() but do not want tracing. 3/ The bio_integrity code interposes itself by replacing bi_end_io, then restoring it and calling bio_endio() again. This would produce two identical trace events if left like that. To handle these, we introduce a flag BIO_TRACE_COMPLETION and only produce the trace event when this is set. We address point 1 above by clearing the flag in blk_update_request(). We address point 2 above by only setting the flag when generic_make_request() is called. We address point 3 above by clearing the flag after generating a completion event. When bio_split() is used on a bio, particularly in blk_queue_split(), there is an extra complication. A new bio is split off the front, and may be handle directly without going through generic_make_request(). The old bio, which has been advanced, is passed to generic_make_request(), so it will trigger a trace event a second time. Probably the best result when a split happens is to see a single 'queue' event for the whole bio, then multiple 'complete' events - one for each component. To achieve this was can: - copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split() - avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set. This way, the split-off bio won't create a queue event, the original won't either even if it re-submitted to generic_make_request(), but both will produce completion events, each for their own range. So if generic_make_request() is called (which generates a QUEUED event), then bi_endio() will create a single COMPLETE event for each range that the bio is split into, unless the driver has explicitly requested it not to. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-31dm verity fec: fix bufio leaksSami Tolvanen
Buffers read through dm_bufio_read() were not released in all code paths. Fixes: a739ff3f543a ("dm verity: add support for forward error correction") Cc: stable@vger.kernel.org # v4.5+ Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-31dm raid: fix NULL pointer dereference for raid1 without bitmapDmitry Bilunov
Commit 4257e08 ("dm raid: support to change bitmap region size") introduced a bitmap resize call during preresume phase. User can create a DM device with "raid" target configured as raid1 with no metadata devices to hold superblock/bitmap info. It can be achieved using the following sequence: truncate -s 32M /dev/shm/raid-test LOOP=$(losetup --show -f /dev/shm/raid-test) dmsetup create raid-test-linear0 --table "0 1024 linear $LOOP 0" dmsetup create raid-test-linear1 --table "0 1024 linear $LOOP 1024" dmsetup create raid-test --table "0 1024 raid raid1 1 2048 2 - /dev/mapper/raid-test-linear0 - /dev/mapper/raid-test-linear1" This results in the following crash: [ 4029.110216] device-mapper: raid: Ignoring chunk size parameter for RAID 1 [ 4029.110217] device-mapper: raid: Choosing default region size of 4MiB [ 4029.111349] md/raid1:mdX: active with 2 out of 2 mirrors [ 4029.114770] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030 [ 4029.114802] IP: bitmap_resize+0x25/0x7c0 [md_mod] [ 4029.114816] PGD 0 … [ 4029.115059] Hardware name: Aquarius Pro P30 S85 BUY-866/B85M-E, BIOS 2304 05/25/2015 [ 4029.115079] task: ffff88015cc29a80 task.stack: ffffc90001a5c000 [ 4029.115097] RIP: 0010:bitmap_resize+0x25/0x7c0 [md_mod] [ 4029.115112] RSP: 0018:ffffc90001a5fb68 EFLAGS: 00010246 [ 4029.115127] RAX: 0000000000000005 RBX: 0000000000000000 RCX: 0000000000000000 [ 4029.115146] RDX: 0000000000000000 RSI: 0000000000000400 RDI: 0000000000000000 [ 4029.115166] RBP: ffffc90001a5fc28 R08: 0000000800000000 R09: 00000008ffffffff [ 4029.115185] R10: ffffea0005661600 R11: ffff88015cc29a80 R12: ffff88021231f058 [ 4029.115204] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 4029.115223] FS: 00007fe73a6b4740(0000) GS:ffff88021ea80000(0000) knlGS:0000000000000000 [ 4029.115245] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4029.115261] CR2: 0000000000000030 CR3: 0000000159a74000 CR4: 00000000001426e0 [ 4029.115281] Call Trace: [ 4029.115291] ? raid_iterate_devices+0x63/0x80 [dm_raid] [ 4029.115309] ? dm_table_all_devices_attribute.isra.23+0x41/0x70 [dm_mod] [ 4029.115329] ? dm_table_set_restrictions+0x225/0x2d0 [dm_mod] [ 4029.115346] raid_preresume+0x81/0x2e0 [dm_raid] [ 4029.115361] dm_table_resume_targets+0x47/0xe0 [dm_mod] [ 4029.115378] dm_resume+0xa8/0xd0 [dm_mod] [ 4029.115391] dev_suspend+0x123/0x250 [dm_mod] [ 4029.115405] ? table_load+0x350/0x350 [dm_mod] [ 4029.115419] ctl_ioctl+0x1c2/0x490 [dm_mod] [ 4029.115433] dm_ctl_ioctl+0xe/0x20 [dm_mod] [ 4029.115447] do_vfs_ioctl+0x8d/0x5a0 [ 4029.115459] ? ____fput+0x9/0x10 [ 4029.115470] ? task_work_run+0x79/0xa0 [ 4029.115481] SyS_ioctl+0x3c/0x70 [ 4029.115493] entry_SYSCALL_64_fastpath+0x13/0x94 The raid_preresume() function incorrectly assumes that the raid_set has a bitmap enabled if RT_FLAG_RS_BITMAP_LOADED is set. But RT_FLAG_RS_BITMAP_LOADED is getting set in __load_dirty_region_bitmap() even if there is no bitmap present (and bitmap_load() happily returns 0 even if a bitmap isn't present). So the only way forward in the near-term is to check if the bitmap is present by seeing if mddev->bitmap is not NULL after bitmap_load() has been called. By doing so the above NULL pointer is avoided. Fixes: 4257e08 ("dm raid: support to change bitmap region size") Cc: stable@vger.kernel.org # v4.8+ Signed-off-by: Dmitry Bilunov <kmeaw@yandex-team.ru> Signed-off-by: Andrey Smetanin <asmetanin@yandex-team.ru> Acked-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-31blk-mq: constify struct blk_mq_opsEric Biggers
Constify all instances of blk_mq_ops, as they are never modified. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28Fix dead URLs to ftp.kernel.orgSeongJae Park
URLs to ftp.kernel.org are still exist though the service is closed [0]. This commit fixes the URLs to use www.kernel.org instead. [0] https://www.kernel.org/shutting-down-ftp-services.html Signed-off-by: SeongJae Park <sj38.park@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-20dm cache metadata: fix metadata2 format's blocks_are_clean_separate_dirtyJoe Thornber
The dm_bitset_cursor_begin() call was using the incorrect nr_entries. Also, the last dm_bitset_cursor_next() must be avoided if we're at the end of the cursor. Fixes: 7f1b21591a6 ("dm cache metadata: use cursor api in blocks_are_clean_separate_dirty()") Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-16Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/shli/md Pull MD fixes from Shaohua Li: - fix a parity calculation bug of raid5 cache by Song - fix a potential deadlock issue by me - fix two endian issues by Jason - fix a disk limitation issue by Neil - other small fixes and cleanup * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: md/raid1: fix a trivial typo in comments md/r5cache: fix set_syndrome_sources() for data in cache md: fix incorrect use of lexx_to_cpu in does_sb_need_changing md: fix super_offset endianness in super_1_rdev_size_change md/raid1/10: fix potential deadlock md: don't impose the MD_SB_DISKS limit on arrays without metadata. md: move funcs from pers->resize to update_size md-cluster: remove useless memset from gather_all_resync_info md-cluster: free md_cluster_info if node leave cluster md: delete dead code md/raid10: submit bio directly to replacement disk
2017-03-16dm verity fec: limit error correction recursionSami Tolvanen
If the hash tree itself is sufficiently corrupt in addition to data blocks, it's possible for error correction to end up in a deep recursive loop, which eventually causes a kernel panic. This change limits the recursion to a reasonable level during a single I/O operation. Fixes: a739ff3f543a ("dm verity: add support for forward error correction") Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # v4.5+
2017-03-14md/raid1: fix a trivial typo in commentsZhilong Liu
raid1.c: fix a trivial typo in comments of freeze_array(). Cc: Jack Wang <jack.wang.usish@gmail.com> Cc: Guoqing Jiang <gqjiang@suse.com> Cc: John Stoffel <john@stoffel.org> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Zhilong Liu <zlliu@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-14md/r5cache: fix set_syndrome_sources() for data in cacheSong Liu
Before this patch, device InJournal will be included in prexor (SYNDROME_SRC_WANT_DRAIN) but not in reconstruct (SYNDROME_SRC_WRITTEN). So it will break parity calculation. With srctype == SYNDROME_SRC_WRITTEN, we need include both dev with non-null ->written and dev with R5_InJournal. This fixes logic in 1e6d690(md/r5cache: caching phase of r5cache) Cc: stable@vger.kernel.org (v4.10+) Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-11blk: Ensure users for current->bio_list can see the full list.NeilBrown
Commit 79bd99596b73 ("blk: improve order of bio handling in generic_make_request()") changed current->bio_list so that it did not contain *all* of the queued bios, but only those submitted by the currently running make_request_fn. There are two places which walk the list and requeue selected bios, and others that check if the list is empty. These are no longer correct. So redefine current->bio_list to point to an array of two lists, which contain all queued bios, and adjust various code to test or walk both lists. Signed-off-by: NeilBrown <neilb@suse.com> Fixes: 79bd99596b73 ("blk: improve order of bio handling in generic_make_request()") Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-10md: fix incorrect use of lexx_to_cpu in does_sb_need_changingJason Yan
The sb->layout is of type __le32, so we shoud use le32_to_cpu. Signed-off-by: Jason Yan <yanaijie@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-10md: fix super_offset endianness in super_1_rdev_size_changeJason Yan
The sb->super_offset should be big-endian, but the rdev->sb_start is in host byte order, so fix this by adding cpu_to_le64. Signed-off-by: Jason Yan <yanaijie@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-09drivers/md/bcache/util.h: remove duplicate inclusion of blkdev.hMasanari Iida
Link: http://lkml.kernel.org/r/20170226060230.11555-1-standby24x7@gmail.com Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Coly Li <colyli@suse.de> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-09md/raid1/10: fix potential deadlockShaohua Li
Neil Brown pointed out a potential deadlock in raid 10 code with bio_split/chain. The raid1 code could have the same issue, but recent barrier rework makes it less likely to happen. The deadlock happens in below sequence: 1. generic_make_request(bio), this will set current->bio_list 2. raid10_make_request will split bio to bio1 and bio2 3. __make_request(bio1), wait_barrer, add underlayer disk bio to current->bio_list 4. __make_request(bio2), wait_barrer If raise_barrier happens between 3 & 4, since wait_barrier runs at 3, raise_barrier waits for IO completion from 3. And since raise_barrier sets barrier, 4 waits for raise_barrier. But IO from 3 can't be dispatched because raid10_make_request() doesn't finished yet. The solution is to adjust the IO ordering. Quotes from Neil: " It is much safer to: if (need to split) { split = bio_split(bio, ...) bio_chain(...) make_request_fn(split); generic_make_request(bio); } else make_request_fn(mddev, bio); This way we first process the initial section of the bio (in 'split') which will queue some requests to the underlying devices. These requests will be queued in generic_make_request. Then we queue the remainder of the bio, which will be added to the end of the generic_make_request queue. Then we return. generic_make_request() will pop the lower-level device requests off the queue and handle them first. Then it will process the remainder of the original bio once the first section has been fully processed. " Note, this only happens in read path. In write path, the bio is flushed to underlaying disks either by blk flush (from schedule) or offladed to raid1/10d. It's queued in current->bio_list. Cc: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org (v3.14+, only the raid10 part) Suggested-by: NeilBrown <neilb@suse.com> Reviewed-by: Jack Wang <jinpu.wang@profitbricks.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-09md: don't impose the MD_SB_DISKS limit on arrays without metadata.NeilBrown
These arrays, created with "mdadm --build" don't benefit from a limit. The default will be used, which is '0' and is interpreted as "don't impose a limit". Reported-by: ian_bruce@mail.ru Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-09md: move funcs from pers->resize to update_sizeGuoqing Jiang
raid1_resize and raid5_resize should also check the mddev->queue if run underneath dm-raid. And both set_capacity and revalidate_disk are used in pers->resize such as raid1, raid10 and raid5. So move them from personality file to common code. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-09md-cluster: remove useless memset from gather_all_resync_infoGuoqing Jiang
This memset is not needed. The lvb is already zeroed because it was recently allocated by lockres_init, which uses kzalloc(), and read_resync_info() doesn't need it to be zero anyway. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-09md-cluster: free md_cluster_info if node leave clusterGuoqing Jiang
To avoid memory leak, we need to free the cinfo which is allocated when node join cluster. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-09md: delete dead codeShaohua Li
Nobody is using mddev_check_plugged(), so delete the dead code Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-09md/raid10: submit bio directly to replacement diskShaohua Li
Commit 57c67df(md/raid10: submit IO from originating thread instead of md thread) submits bio directly for normal disks but not for replacement disks. There is no point we shouldn't do this for replacement disks. Cc: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-03Merge branch 'WIP.sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull sched.h split-up from Ingo Molnar: "The point of these changes is to significantly reduce the <linux/sched.h> header footprint, to speed up the kernel build and to have a cleaner header structure. After these changes the new <linux/sched.h>'s typical preprocessed size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K lines), which is around 40% faster to build on typical configs. Not much changed from the last version (-v2) posted three weeks ago: I eliminated quirks, backmerged fixes plus I rebased it to an upstream SHA1 from yesterday that includes most changes queued up in -next plus all sched.h changes that were pending from Andrew. I've re-tested the series both on x86 and on cross-arch defconfigs, and did a bisectability test at a number of random points. I tried to test as many build configurations as possible, but some build breakage is probably still left - but it should be mostly limited to architectures that have no cross-compiler binaries available on kernel.org, and non-default configurations" * 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits) sched/headers: Clean up <linux/sched.h> sched/headers: Remove #ifdefs from <linux/sched.h> sched/headers: Remove the <linux/topology.h> include from <linux/sched.h> sched/headers, hrtimer: Remove the <linux/wait.h> include from <linux/hrtimer.h> sched/headers, x86/apic: Remove the <linux/pm.h> header inclusion from <asm/apic.h> sched/headers, timers: Remove the <linux/sysctl.h> include from <linux/timer.h> sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h> sched/headers: Remove <linux/sched.h> from <linux/sched/init.h> sched/core: Remove unused prefetch_stack() sched/headers: Remove <linux/rculist.h> from <linux/sched.h> sched/headers: Remove the 'init_pid_ns' prototype from <linux/sched.h> sched/headers: Remove <linux/signal.h> from <linux/sched.h> sched/headers: Remove <linux/rwsem.h> from <linux/sched.h> sched/headers: Remove the runqueue_is_locked() prototype sched/headers: Remove <linux/sched.h> from <linux/sched/hotplug.h> sched/headers: Remove <linux/sched.h> from <linux/sched/debug.h> sched/headers: Remove <linux/sched.h> from <linux/sched/nohz.h> sched/headers: Remove <linux/sched.h> from <linux/sched/stat.h> sched/headers: Remove the <linux/gfp.h> include from <linux/sched.h> sched/headers: Remove <linux/rtmutex.h> from <linux/sched.h> ...
2017-03-02Merge tag 'dm-4.11-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - a dm-raid stable@ fix for possible corruption when triggering a raid reshape via lvm2; and an additional small patch ontop to bump version of the dm-raid target outside of the stable@ fix - a dm-raid fix for a 'dm-4.11-changes' regression introduced by a commit that was meant to only cleanup confusing branching. * tag 'dm-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm raid: bump the target version dm raid: fix data corruption on reshape request dm raid: fix raid "check" regression due to improper cleanup in raid_message()
2017-03-02sched/headers: Prepare to move the get_task_struct()/put_task_struct() and ↵Ingo Molnar
related APIs from <linux/sched.h> to <linux/sched/task.h> But first update usage sites with the new header dependency. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare to use <linux/rcuupdate.h> instead of ↵Ingo Molnar
<linux/rculist.h> in <linux/sched.h> We don't actually need the full rculist.h header in sched.h anymore, we will be able to include the smaller rcupdate.h header instead. But first update code that relied on the implicit header inclusion. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar
<linux/sched/task_stack.h> We are going to split <linux/sched/task_stack.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/task_stack.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare to move the memalloc_noio_*() APIs to <linux/sched/mm.h>Ingo Molnar
Update the .c files that depend on these APIs. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare to move signal wakeup & sigpending methods from ↵Ingo Molnar
<linux/sched.h> into <linux/sched/signal.h> Fix up affected files that include this signal functionality via sched.h. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar
<linux/sched/signal.h> We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/signal.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar
<linux/sched/clock.h> We are going to split <linux/sched/clock.h> out of <linux/sched.h>, which will have to be picked up from other headers and .c files. Create a trivial placeholder <linux/sched/clock.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02KEYS: Differentiate uses of rcu_dereference_key() and user_key_payload()David Howells
rcu_dereference_key() and user_key_payload() are currently being used in two different, incompatible ways: (1) As a wrapper to rcu_dereference() - when only the RCU read lock used to protect the key. (2) As a wrapper to rcu_dereference_protected() - when the key semaphor is used to protect the key and the may be being modified. Fix this by splitting both of the key wrappers to produce: (1) RCU accessors for keys when caller has the key semaphore locked: dereference_key_locked() user_key_payload_locked() (2) RCU accessors for keys when caller holds the RCU read lock: dereference_key_rcu() user_key_payload_rcu() This should fix following warning in the NFS idmapper =============================== [ INFO: suspicious RCU usage. ] 4.10.0 #1 Tainted: G W ------------------------------- ./include/keys/user-type.h:53 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 0 1 lock held by mount.nfs/5987: #0: (rcu_read_lock){......}, at: [<d000000002527abc>] nfs_idmap_get_key+0x15c/0x420 [nfsv4] stack backtrace: CPU: 1 PID: 5987 Comm: mount.nfs Tainted: G W 4.10.0 #1 Call Trace: dump_stack+0xe8/0x154 (unreliable) lockdep_rcu_suspicious+0x140/0x190 nfs_idmap_get_key+0x380/0x420 [nfsv4] nfs_map_name_to_uid+0x2a0/0x3b0 [nfsv4] decode_getfattr_attrs+0xfac/0x16b0 [nfsv4] decode_getfattr_generic.constprop.106+0xbc/0x150 [nfsv4] nfs4_xdr_dec_lookup_root+0xac/0xb0 [nfsv4] rpcauth_unwrap_resp+0xe8/0x140 [sunrpc] call_decode+0x29c/0x910 [sunrpc] __rpc_execute+0x140/0x8f0 [sunrpc] rpc_run_task+0x170/0x200 [sunrpc] nfs4_call_sync_sequence+0x68/0xa0 [nfsv4] _nfs4_lookup_root.isra.44+0xd0/0xf0 [nfsv4] nfs4_lookup_root+0xe0/0x350 [nfsv4] nfs4_lookup_root_sec+0x70/0xa0 [nfsv4] nfs4_find_root_sec+0xc4/0x100 [nfsv4] nfs4_proc_get_rootfh+0x5c/0xf0 [nfsv4] nfs4_get_rootfh+0x6c/0x190 [nfsv4] nfs4_server_common_setup+0xc4/0x260 [nfsv4] nfs4_create_server+0x278/0x3c0 [nfsv4] nfs4_remote_mount+0x50/0xb0 [nfsv4] mount_fs+0x74/0x210 vfs_kern_mount+0x78/0x220 nfs_do_root_mount+0xb0/0x140 [nfsv4] nfs4_try_mount+0x60/0x100 [nfsv4] nfs_fs_mount+0x5ec/0xda0 [nfs] mount_fs+0x74/0x210 vfs_kern_mount+0x78/0x220 do_mount+0x254/0xf70 SyS_mount+0x94/0x100 system_call+0x38/0xe0 Reported-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
2017-02-28dm raid: bump the target versionMike Snitzer
This version bump reflects that the reshape corruption fix (commit 92a39f6cc "dm raid: fix data corruption on reshape request") is present. Done as a separate fix because the above referenced commit is marked for stable and target version bumps in a stable@ fix are a recipe for the fix to never get backported to stable@ kernels (because of target version number conflicts). Also, move RESUME_STAY_FROZEN_FLAGS up with the reset the the _FLAGS definitions now that we don't need to worry about stable@ conflicts as a result of missing context. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-02-28dm raid: fix data corruption on reshape requestHeinz Mauelshagen
The lvm2 sequence to manage dm-raid constructor flags that trigger a rebuild or a reshape is defined as: 1) load table with flags (e.g. rebuild/delta_disks/data_offset) 2) clear out the flags in lvm2 metadata 3) store the lvm2 metadata, reload the table to reset the flags previously established during the initial load (1) -- in order to prevent repeatedly requesting a rebuild or a reshape on activation Currently, loading an inactive table with rebuild/reshape flags specified will cause dm-raid to rebuild/reshape on resume and thus start updating the raid metadata (about the progress). When the second table reload, to reset the flags, occurs the constructor accesses the volatile progress state kept in the raid superblocks. Because the active mapping is still processing the rebuild/reshape, that position will be stale by the time the device is resumed. In the reshape case, this causes data corruption by processing already reshaped stripes again. In the rebuild case, it does _not_ cause data corruption but instead involves superfluous rebuilds. Fix by keeping the raid set frozen during the first resume and then allow the rebuild/reshape during the second resume. Fixes: 9dbd1aa3a ("dm raid: add reshaping support to the target") Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.8+
2017-02-28dm raid: fix raid "check" regression due to improper cleanup in raid_message()Mike Snitzer
While cleaning up awkward branching in raid_message() a raid set "check" regression was introduced because "check" needs both MD_RECOVERY_SYNC and MD_RECOVERY_REQUESTED flags set. Fix this regression by explicitly setting both flags for the "check" case (like is also done for the "repair" case, but redundant set_bit()s are perfectly fine because it adds clarity to what is needed in response to both messages -- in addition this isn't fast path code). Fixes: 105db59912 ("dm raid: cleanup awkward branching in raid_message() option processing") Reported-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-02-24Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds
Pull md updates from Shaohua Li: "Mainly fixes bugs and improves performance: - Improve scalability for raid1 from Coly - Improve raid5-cache read performance, disk efficiency and IO pattern from Song and me - Fix a race condition of disk hotplug for linear from Coly - A few cleanup patches from Ming and Byungchul - Fix a memory leak from Neil - Fix WRITE SAME IO failure from me - Add doc for raid5-cache from me" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (23 commits) md/raid1: fix write behind issues introduced by bio_clone_bioset_partial md/raid1: handle flush request correctly md/linear: shutup lockdep warnning md/raid1: fix a use-after-free bug RAID1: avoid unnecessary spin locks in I/O barrier code RAID1: a new I/O barrier implementation to remove resync window md/raid5: Don't reinvent the wheel but use existing llist API md: fast clone bio in bio_clone_mddev() md: remove unnecessary check on mddev md/raid1: use bio_clone_bioset_partial() in case of write behind md: fail if mddev->bio_set can't be created block: introduce bio_clone_bioset_partial() md: disable WRITE SAME if it fails in underlayer disks md/raid5-cache: exclude reclaiming stripes in reclaim check md/raid5-cache: stripe reclaim only counts valid stripes MD: add doc for raid5-cache Documentation: move MD related doc into a separate dir md: ensure md devices are freed before module is unloaded. md/r5cache: improve journal device efficiency md/r5cache: enable chunk_aligned_read with write back cache ...
2017-02-24Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates and fixes from Jens Axboe: - NVMe updates and fixes that missed the first pull request. This includes bug fixes, and support for autonomous power management. - Fix from Christoph for missing clear of the request payload, causing a problem with (at least) the storvsc driver. - Further fixes for the queue/bdi life time issues from Jan. - The Kconfig mq scheduler update from me. - Fixing a use-after-free in dm-rq, spotted by Bart, introduced in this merge window. - Three fixes for nbd from Josef. - Bug fix from Omar, fixing a bug in sas transport code that oopses when bsg ioctls were used. From Omar. - Improvements to the queue restart and tag wait from from Omar. - Set of fixes for the sed/opal code from Scott. - Three trivial patches to cciss from Tobin * 'for-linus' of git://git.kernel.dk/linux-block: (41 commits) dm-rq: don't dereference request payload after ending request blk-mq-sched: separate mark hctx and queue restart operations blk-mq: use sbq wait queues instead of restart for driver tags block/sed-opal: Propagate original error message to userland. nvme/pci: re-check security protocol support after reset block/sed-opal: Introduce free_opal_dev to free the structure and clean up state nvme: detect NVMe controller in recent MacBooks nvme-rdma: add support for host_traddr nvmet-rdma: Fix error handling nvmet-rdma: use nvme cm status helper nvme-rdma: move nvme cm status helper to .h file nvme-fc: don't bother to validate ioccsz and iorcsz nvme/pci: No special case for queue busy on IO nvme/core: Fix race kicking freed request_queue nvme/pci: Disable on removal when disconnected nvme: Enable autonomous power state transitions nvme: Add a quirk mechanism that uses identify_ctrl nvme: make nvmf_register_transport require a create_ctrl callback nvme: Use CNS as 8-bit field and avoid endianness conversion nvme: add semicolon in nvme_command setting ...