summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2015-07-17Merge tag 'dm-4.2-fixes-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - revert a request-based DM core change that caused IO latency to increase and adversely impact both throughput and system load - fix for a use after free bug in DM core's device cleanup - a couple DM btree removal fixes (used by dm-thinp) - a DM thinp fix for order-5 allocation failure - a DM thinp fix to not degrade to read-only metadata mode when in out-of-data-space mode for longer than the 'no_space_timeout' - fix a long-standing oversight in both dm-thinp and dm-cache by now exporting 'needs_check' in status if it was set in metadata - fix an embarrassing dm-cache busy-loop that caused worker threads to eat cpu even if no IO was actively being issued to the cache device * tag 'dm-4.2-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm cache: avoid calls to prealloc_free_structs() if possible dm cache: avoid preallocation if no work in writeback_some_dirty_blocks() dm cache: do not wake_worker() in free_migration() dm cache: display 'needs_check' in status if it is set dm thin: display 'needs_check' in status if it is set dm thin: stay in out-of-data-space mode once no_space_timeout expires dm: fix use after free crash due to incorrect cleanup sequence Revert "dm: only run the queue on completion if congested or no requests pending" dm btree: silence lockdep lock inversion in dm_btree_del() dm thin: allocate the cell_sort_array dynamically dm btree remove: fix bug in redistribute3
2015-07-16dm cache: avoid calls to prealloc_free_structs() if possibleMike Snitzer
If no work was performed then prealloc_data_structs() wasn't ever called so there isn't any need to call prealloc_free_structs(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16dm cache: avoid preallocation if no work in writeback_some_dirty_blocks()Mike Snitzer
Refactor writeback_some_dirty_blocks() to avoid prealloc_data_structs() if the policy doesn't have any dirty blocks ready for writeback. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16dm cache: do not wake_worker() in free_migration()Mike Snitzer
All methods that queue work call wake_worker() as you'd expect. E.g. cell_defer, defer_bio, quiesce_migration (which is called by writeback, promote, demote_then_promote, invalidate, discard, etc). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16dm cache: display 'needs_check' in status if it is setMike Snitzer
There is currently no way to see that the needs_check flag has been set in the metadata. Display 'needs_check' in the cache status if it is set in the cache metadata. Also, update cache documentation. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16dm thin: display 'needs_check' in status if it is setMike Snitzer
There is currently no way to see that the needs_check flag has been set in the metadata. Display 'needs_check' in the thin-pool status if it is set in the thinp metadata. Also, update thinp documentation. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16dm thin: stay in out-of-data-space mode once no_space_timeout expiresMike Snitzer
This fixes an issue where running out of data space would cause the thin-pool's metadata to become read-only. There was no reason to make metadata read-only -- calling set_pool_mode() with PM_READ_ONLY was a misguided way to error all queued and future write IOs. We can accomplish the same by degrading from PM_OUT_OF_DATA_SPACE to PM_OUT_OF_DATA_SPACE with error_if_no_space enabled. Otherwise, the use of PM_READ_ONLY could cause a race where commit() was started before the PM_READ_ONLY transition but dm_pool_commit_metadata() would go on to fail because the block manager had transitioned to read-only. The return of -EPERM from dm_pool_commit_metadata(), due to attempting to commit while in read-only mode, caused the thin-pool to set 'needs_check' because a metadata_operation_failed(). This needless cascade of failures makes life for users more difficult than needed. Reported-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-13dm: fix use after free crash due to incorrect cleanup sequenceMikulas Patocka
Linux 4.2-rc1 Commit 0f20972f7bf6 ("dm: factor out a common cleanup_mapped_device()") moved a common cleanup code to a separate function. Unfortunately, that commit incorrectly changed the order of cleanup, so that it destroys the mapped_device's srcu structure 'io_barrier' before destroying its workqueue. The function that is executed on the workqueue (dm_wq_work) uses the srcu structure, thus it may use it after being freed. That results in a crash in the LVM test suite's mirror-vgreduce-removemissing.sh test. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: 0f20972f7bf6 ("dm: factor out a common cleanup_mapped_device()") Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-11bcache: don't embed 'return' statements in closure macrosJens Axboe
This is horribly confusing, it breaks the flow of the code without it being apparent in the caller. Signed-off-by: Jens Axboe <axboe@fb.com> Acked-by: Christoph Hellwig <hch@lst.de>
2015-07-08Revert "dm: only run the queue on completion if congested or no requests ↵Mike Snitzer
pending" This reverts commit 9a0e609e3fd8a95c96629b9fbde6b8c5b9a1456a. (Resolved a conflict during revert due to commit bfebd1cdb4 that came after) This revert is motivated by a couple failure reports on request-based DM multipath testbeds: 1) Netapp reported that their multipath fault injection test under heavy IO load can stall longer than 300 seconds. 2) IBM reported elevated lock contention in their testbed (likely due to increased back pressure due to IO not being dispatched as quickly): https://www.redhat.com/archives/dm-devel/2015-July/msg00057.html Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.1+
2015-07-06dm btree: silence lockdep lock inversion in dm_btree_del()Joe Thornber
Allocate memory using GFP_NOIO when deleting a btree. dm_btree_del() can be called via an ioctl and we don't want to recurse into the FS or block layer. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2015-07-05dm thin: allocate the cell_sort_array dynamicallyJoe Thornber
Given the pool's cell_sort_array holds 8192 pointers it triggers an order 5 allocation via kmalloc. This order 5 allocation is prone to failure as system memory gets more fragmented over time. Fix this by allocating the cell_sort_array using vmalloc. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2015-07-05dm btree remove: fix bug in redistribute3Dennis Yang
redistribute3() shares entries out across 3 nodes. Some entries were being moved the wrong way, breaking the ordering. This manifested as a BUG() in dm-btree-remove.c:shift() when entries were removed from the btree. For additional context see: https://www.redhat.com/archives/dm-devel/2015-May/msg00113.html Signed-off-by: Dennis Yang <shinrairis@gmail.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2015-07-04Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs updates from Al Viro: "Assorted VFS fixes and related cleanups (IMO the most interesting in that part are f_path-related things and Eric's descriptor-related stuff). UFS regression fixes (it got broken last cycle). 9P fixes. fs-cache series, DAX patches, Jan's file_remove_suid() work" [ I'd say this is much more than "fixes and related cleanups". The file_table locking rule change by Eric Dumazet is a rather big and fundamental update even if the patch isn't huge. - Linus ] * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits) 9p: cope with bogus responses from server in p9_client_{read,write} p9_client_write(): avoid double p9_free_req() 9p: forgetting to cancel request on interrupted zero-copy RPC dax: bdev_direct_access() may sleep block: Add support for DAX reads/writes to block devices dax: Use copy_from_iter_nocache dax: Add block size note to documentation fs/file.c: __fget() and dup2() atomicity rules fs/file.c: don't acquire files->file_lock in fd_install() fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation vfs: avoid creation of inode number 0 in get_next_ino namei: make set_root_rcu() return void make simple_positive() public ufs: use dir_pages instead of ufs_dir_pages() pagemap.h: move dir_pages() over there remove the pointless include of lglock.h fs: cleanup slight list_entry abuse xfs: Correctly lock inode when removing suid and file capabilities fs: Call security_ops->inode_killpriv on truncate fs: Provide function telling whether file_remove_privs() will do anything ...
2015-06-30MAINTAINERS: BCACHE: Kent Overstreet has changed email addressJoe Perches
Kent's email address in MAINTAINERS seems to be invalid. This was his last sign-off address, so use that if appropriate. Fix the S: status entry while there. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30bcache: use kvfree() in various placesPekka Enberg
Use kvfree() instead of open-coding it. Signed-off-by: Pekka Enberg <penberg@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-29Merge tag 'md/4.2' of git://neil.brown.name/mdLinus Torvalds
Pull md updates from Neil Brown: "A mixed bag - a few bug fixes - some performance improvement that decrease lock contention - some clean-up Nothing major" * tag 'md/4.2' of git://neil.brown.name/md: md: clear Blocked flag on failed devices when array is read-only. md: unlock mddev_lock on an error path. md: clear mddev->private when it has been freed. md: fix a build warning md/raid5: ignore released_stripes check md/raid5: per hash value and exclusive wait_for_stripe md/raid5: split wait_for_stripe and introduce wait_for_quiescent wait: introduce wait_event_exclusive_cmd md: convert to kstrto*() md/raid10: make sync_request_write() call bio_copy_data()
2015-06-26Merge tag 'dm-4.2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: "Apologies for not pressing this request-based DM partial completion issue further, it was an oversight on my part. We'll have to get it fixed up properly and revisit for a future release. - Revert block and DM core changes the removed request-based DM's ability to handle partial request completions -- otherwise with the current SCSI LLDs these changes could lead to silent data corruption. - Fix two DM version bumps that were missing from the initial 4.2 DM pull request (enabled userspace lvm2 to know certain changes have been made)" * tag 'dm-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm cache policy smq: fix "default" version to be 1.4.0 dm: bump the ioctl version to 4.32.0 Revert "block, dm: don't copy bios for request clones" Revert "dm: do not allocate any mempools for blk-mq request-based DM"
2015-06-26Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge second patchbomb from Andrew Morton: - most of the rest of MM - lots of misc things - procfs updates - printk feature work - updates to get_maintainer, MAINTAINERS, checkpatch - lib/ updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (96 commits) exit,stats: /* obey this comment */ coredump: add __printf attribute to cn_*printf functions coredump: use from_kuid/kgid when formatting corename fs/reiserfs: remove unneeded cast NILFS2: support NFSv2 export fs/befs/btree.c: remove unneeded initializations fs/minix: remove unneeded cast init/do_mounts.c: add create_dev() failure log kasan: remove duplicate definition of the macro KASAN_FREE_PAGE fs/efs: femove unneeded cast checkpatch: emit "NOTE: <types>" message only once after multiple files checkpatch: emit an error when there's a diff in a changelog checkpatch: validate MODULE_LICENSE content checkpatch: add multi-line handling for PREFER_ETHER_ADDR_COPY checkpatch: suggest using eth_zero_addr() and eth_broadcast_addr() checkpatch: fix processing of MEMSET issues checkpatch: suggest using ether_addr_equal*() checkpatch: avoid NOT_UNIFIED_DIFF errors on cover-letter.patch files checkpatch: remove local from codespell path checkpatch: add --showfile to allow input via pipe to show filenames ...
2015-06-26dm cache policy smq: fix "default" version to be 1.4.0Mike Snitzer
Commit bccab6a0 ("dm cache: switch the "default" cache replacement policy from mq to smq") should've incremented the "default" policy's version number to 1.4.0 rather than reverting to version 1.0.0. Reported-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-26Revert "block, dm: don't copy bios for request clones"Mike Snitzer
This reverts commit 5f1b670d0bef508a5554d92525f5f6d00d640b38. Justification for revert as reported in this dm-devel post: https://www.redhat.com/archives/dm-devel/2015-June/msg00160.html this change should not be pushed to mainline yet. Firstly, Christoph has a newer version of the patch that fixes silent data corruption problem: https://www.redhat.com/archives/dm-devel/2015-May/msg00229.html And the new version still depends on LLDDs to always complete requests to the end when error happens, while block API doesn't enforce such a requirement. If the assumption is ever broken, the inconsistency between request and bio (e.g. rq->__sector and rq->bio) will cause silent data corruption: https://www.redhat.com/archives/dm-devel/2015-June/msg00022.html Reported-by: Junichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-26Revert "dm: do not allocate any mempools for blk-mq request-based DM"Mike Snitzer
This reverts commit cbc4e3c1350beb47beab8f34ad9be3d34a20c705. Reported-by: Junichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-25drivers/md/md.c: use strreplace()Rasmus Villemoes
There's no point in starting over when we meet a '/'. This also eliminates a stack variable and a little .text. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25Merge tag 'dm-4.2-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - DM core cleanups: * blk-mq request-based DM no longer uses any mempools now that partial completions are no longer handled as part of cloned requests - DM raid cleanups and support for MD raid0 - DM cache core advances and a new stochastic-multi-queue (smq) cache replacement policy * smq is the new default dm-cache policy - DM thinp cleanups and much more efficient large discard support - DM statistics support for request-based DM and nanosecond resolution timestamps - Fixes to DM stripe, DM log-writes, DM raid1 and DM crypt * tag 'dm-4.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (39 commits) dm stats: add support for request-based DM devices dm stats: collect and report histogram of IO latencies dm stats: support precise timestamps dm stats: fix divide by zero if 'number_of_areas' arg is zero dm cache: switch the "default" cache replacement policy from mq to smq dm space map metadata: fix occasional leak of a metadata block on resize dm thin metadata: fix a race when entering fail mode dm thin: fail messages with EOPNOTSUPP when pool cannot handle messages dm thin: range discard support dm thin metadata: add dm_thin_remove_range() dm thin metadata: add dm_thin_find_mapped_range() dm btree: add dm_btree_remove_leaves() dm stats: Use kvfree() in dm_kvfree() dm cache: age and write back cache entries even without active IO dm cache: prefix all DMERR and DMINFO messages with cache device name dm cache: add fail io mode and needs_check flag dm cache: wake the worker thread every time we free a migration object dm cache: add stochastic-multi-queue (smq) policy dm cache: boost promotion of blocks that will be overwritten dm cache: defer whole cells ...
2015-06-25Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull cgroup writeback support from Jens Axboe: "This is the big pull request for adding cgroup writeback support. This code has been in development for a long time, and it has been simmering in for-next for a good chunk of this cycle too. This is one of those problems that has been talked about for at least half a decade, finally there's a solution and code to go with it. Also see last weeks writeup on LWN: http://lwn.net/Articles/648292/" * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits) writeback, blkio: add documentation for cgroup writeback support vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB writeback: do foreign inode detection iff cgroup writeback is enabled v9fs: fix error handling in v9fs_session_init() bdi: fix wrong error return value in cgwb_create() buffer: remove unusued 'ret' variable writeback: disassociate inodes from dying bdi_writebacks writeback: implement foreign cgroup inode bdi_writeback switching writeback: add lockdep annotation to inode_to_wb() writeback: use unlocked_inode_to_wb transaction in inode_congested() writeback: implement unlocked_inode_to_wb transaction and use it for stat updates writeback: implement [locked_]inode_to_wb_and_lock_list() writeback: implement foreign cgroup inode detection writeback: make writeback_control track the inode being written back writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb() mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use writeback: implement memcg writeback domain based throttling writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes writeback: implement memcg wb_domain writeback: update wb_over_bg_thresh() to use wb_domain aware operations ...
2015-06-25Merge branch 'for-4.2/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block IO update from Jens Axboe: "Nothing really major in here, mostly a collection of smaller optimizations and cleanups, mixed with various fixes. In more detail, this contains: - Addition of policy specific data to blkcg for block cgroups. From Arianna Avanzini. - Various cleanups around command types from Christoph. - Cleanup of the suspend block I/O path from Christoph. - Plugging updates from Shaohua and Jeff Moyer, for blk-mq. - Eliminating atomic inc/dec of both remaining IO count and reference count in a bio. From me. - Fixes for SG gap and chunk size support for data-less (discards) IO, so we can merge these better. From me. - Small restructuring of blk-mq shared tag support, freeing drivers from iterating hardware queues. From Keith Busch. - A few cfq-iosched tweaks, from Tahsin Erdogan and me. Makes the IOPS mode the default for non-rotational storage" * 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits) cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL cfq-iosched: fix sysfs oops when attempting to read unconfigured weights cfq-iosched: move group scheduling functions under ifdef cfq-iosched: fix the setting of IOPS mode on SSDs blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file block, cgroup: implement policy-specific per-blkcg data block: Make CFQ default to IOPS mode on SSDs block: add blk_set_queue_dying() to blkdev.h blk-mq: Shared tag enhancements block: don't honor chunk sizes for data-less IO block: only honor SG gap prevention for merges that contain data block: fix returnvar.cocci warnings block, dm: don't copy bios for request clones block: remove management of bi_remaining when restoring original bi_end_io block: replace trylock with mutex_lock in blkdev_reread_part() block: export blkdev_reread_part() and __blkdev_reread_part() suspend: simplify block I/O handling block: collapse bio bit space block: remove unused BIO_RW_BLOCK and BIO_EOF flags block: remove BIO_EOPNOTSUPP ...
2015-06-25md: clear Blocked flag on failed devices when array is read-only.Neil Brown
The Blocked flag indicates that a device has failed but that this fact hasn't been recorded in the metadata yet. Writes to such devices cannot be allowed until the metadata has been updated. On a read-only array, the Blocked flag will never be cleared. This prevents the device being removed from the array. If the metadata is being handled by the kernel (i.e. !mddev->external), then we can be sure that if the array is switch to writable, then a metadata update will happen and will record the failure. So we don't need the flag set. If metadata is externally managed, it is upto the external manager to clear the 'blocked' flag. Reported-by: XiaoNi <xni@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-06-25md: unlock mddev_lock on an error path.NeilBrown
This error path retuns while still holding the lock - bad. Fixes: 6791875e2e53 ("md: make reconfig_mutex optional for writes to md sysfs files.") Cc: stable@vger.kernel.org (v4.0+) Signed-off-by: NeilBrown <neilb@suse.com>
2015-06-25md: clear mddev->private when it has been freed.NeilBrown
If ->private is set when ->run is called, it is assumed to be a 'config' prepared as part of 'reshape'. So it is important when we free that config, that we also clear ->private. This is not often a problem as the mddev will normally be discarded shortly after the config us freed. However if an 'assemble' races with a final close, the assemble can use the old mddev which has a stale ->private. This leads to any of various sorts of crashes. So clear ->private after calling ->free(). Reported-by: Nate Clark <nate@neworld.us> Cc: stable@vger.kernel.org (v4.0+) Fixes: afa0f557cb15 ("md: rename ->stop to ->free") Signed-off-by: NeilBrown <neilb@suse.com>
2015-06-23vfs: add seq_file_path() helperMiklos Szeredi
Turn seq_path(..., &file->f_path, ...); into seq_file_path(..., file, ...); Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-06-23vfs: add file_path() helperMiklos Szeredi
Turn d_path(&file->f_path, ...); into file_path(file, ...); Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-06-17dm stats: add support for request-based DM devicesMikulas Patocka
This makes it possible to use dm stats with DM multipath. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-17dm stats: collect and report histogram of IO latenciesMikulas Patocka
Add an option to dm statistics to collect and report a histogram of IO latencies. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-17dm stats: support precise timestampsMikulas Patocka
Make it possible to use precise timestamps with nanosecond granularity in dm statistics. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-17dm stats: fix divide by zero if 'number_of_areas' arg is zeroMikulas Patocka
If the number_of_areas argument was zero the kernel would crash on div-by-zero. Add better input validation. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # v3.12+
2015-06-17dm cache: switch the "default" cache replacement policy from mq to smqMike Snitzer
The Stochastic multiqueue (SMQ) policy (vs MQ) offers the promise of less memory utilization, improved performance and increased adaptability in the face of changing workloads. SMQ also does not have any cumbersome tuning knobs. Users may switch from "mq" to "smq" simply by appropriately reloading a DM table that is using the cache target. Doing so will cause all of the mq policy's hints to be dropped. Also, performance of the cache may degrade slightly until smq recalculates the origin device's hotspots that should be cached. In the future the "mq" policy will just silently make use of "smq" and the mq code will be removed. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
2015-06-17dm space map metadata: fix occasional leak of a metadata block on resizeJoe Thornber
The metadata space map has a simplified 'bootstrap' mode that is operational when extending the space maps. Whilst in this mode it's possible for some refcount decrement operations to become queued (eg, as a result of shadowing one of the bitmap indexes). These decrements were not being applied when switching out of bootstrap mode. The effect of this bug was the leaking of a 4k metadata block. This is detected by the latest version of thin_check as a non fatal error. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2015-06-17md: fix a build warningFiro Yang
Warning like this: drivers/md/md.c: In function "update_array_info": drivers/md/md.c:6394:26: warning: logical not is only applied to the left hand side of comparison [-Wlogical-not-parentheses] !mddev->persistent != info->not_persistent|| Fix it as Neil Brown said: mddev->persistent != !info->not_persistent || Signed-off-by: Firo Yang <firogm@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-06-17md/raid5: ignore released_stripes checkShaohua Li
conf->released_stripes list isn't always related to where there are free stripes pending. Active stripes can be in the list too. And even free stripes were active very recently. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-06-17md/raid5: per hash value and exclusive wait_for_stripeYuanhan Liu
I noticed heavy spin lock contention at get_active_stripe() with fsmark multiple thread write workloads. Here is how this hot contention comes from. We have limited stripes, and it's a multiple thread write workload. Hence, those stripes will be taken soon, which puts later processes to sleep for waiting free stripes. When enough stripes(>= 1/4 total stripes) are released, all process are woken, trying to get the lock. But there is one only being able to get this lock for each hash lock, making other processes spinning out there for acquiring the lock. Thus, it's effectiveless to wakeup all processes and let them battle for a lock that permits one to access only each time. Instead, we could make it be a exclusive wake up: wake up one process only. That avoids the heavy spin lock contention naturally. To do the exclusive wake up, we've to split wait_for_stripe into multiple wait queues, to make it per hash value, just like the hash lock. Here are some test results I have got with this patch applied(all test run 3 times): `fsmark.files_per_sec' ===================== next-20150317 this patch ------------------------- ------------------------- metric_value ±stddev metric_value ±stddev change testbox/benchmark/testcase-params ------------------------- ------------------------- -------- ------------------------------ 25.600 ±0.0 92.700 ±2.5 262.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose 25.600 ±0.0 77.800 ±0.6 203.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose 32.000 ±0.0 93.800 ±1.7 193.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose 32.000 ±0.0 81.233 ±1.7 153.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose 48.800 ±14.5 99.667 ±2.0 104.2% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose 6.400 ±0.0 12.800 ±0.0 100.0% ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose 63.133 ±8.2 82.800 ±0.7 31.2% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose 245.067 ±0.7 306.567 ±7.9 25.1% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-f2fs-4M-30G-fsyncBeforeClose 17.533 ±0.3 21.000 ±0.8 19.8% ivb44/fsmark/1x-1t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose 188.167 ±1.9 215.033 ±3.1 14.3% ivb44/fsmark/1x-1t-4BRD_12G-RAID5-btrfs-4M-30G-NoSync 254.500 ±1.8 290.733 ±2.4 14.2% ivb44/fsmark/1x-1t-9BRD_6G-RAID5-btrfs-4M-30G-NoSync `time.system_time' ===================== next-20150317 this patch ------------------------- ------------------------- metric_value ±stddev metric_value ±stddev change testbox/benchmark/testcase-params ------------------------- ------------------------- -------- ------------------------------ 7235.603 ±1.2 185.163 ±1.9 -97.4% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose 7666.883 ±2.9 202.750 ±1.0 -97.4% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose 14567.893 ±0.7 421.230 ±0.4 -97.1% ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose 3697.667 ±14.0 148.190 ±1.7 -96.0% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose 5572.867 ±3.8 310.717 ±1.4 -94.4% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose 5565.050 ±0.5 313.277 ±1.5 -94.4% ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose 2420.707 ±17.1 171.043 ±2.7 -92.9% ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose 3743.300 ±4.6 379.827 ±3.5 -89.9% ivb44/fsmark/1x-64t-3HDD-RAID5-ext4-4M-40G-fsyncBeforeClose 3308.687 ±6.3 363.050 ±2.0 -89.0% ivb44/fsmark/1x-64t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose Where, 1x: where 'x' means iterations or loop, corresponding to the 'L' option of fsmark 1t, 64t: where 't' means thread 4M: means the single file size, corresponding to the '-s' option of fsmark 40G, 30G, 120G: means the total test size 4BRD_12G: BRD is the ramdisk, where '4' means 4 ramdisk, and where '12G' means the size of one ramdisk. So, it would be 48G in total. And we made a raid on those ramdisk As you can see, though there are no much performance gain for hard disk workload, the system time is dropped heavily, up to 97%. And as expected, the performance increased a lot, up to 260%, for fast device(ram disk). v2: use bits instead of array to note down wait queue need to wake up. Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-06-17md/raid5: split wait_for_stripe and introduce wait_for_quiescentYuanhan Liu
I noticed heavy spin lock contention at get_active_stripe(), introduced at being wake up stage, where a bunch of processes try to re-hold the spin lock again. After giving some thoughts on this issue, I found the lock could be relieved(and even avoided) if we turn the wait_for_stripe to per waitqueue for each lock hash and make the wake up exclusive: wake up one process each time, which avoids the lock contention naturally. Before go hacking with wait_for_stripe, I found it actually has 2 usages: for the array to enter or leave the quiescent state, and also to wait for an available stripe in each of the hash lists. So this patch splits the first usage off into a separate wait_queue, wait_for_quiescent, and the next patch will turn the second usage into one waitqueue for each hash value, and make it exclusive, to relieve the lock contention. v2: wake_up(wait_for_quiescent) when (active_stripes == 0) Commit log refactor suggestion from Neil. Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-06-17md: convert to kstrto*()Alexey Dobriyan
Convert away from deprecated simple_strto*() functions. Add "fit into sector_t" checks. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-06-17md/raid10: make sync_request_write() call bio_copy_data()Kent Overstreet
Refactor sync_request_write() of md/raid10 to use bio_copy_data() instead of open coding bio_vec iterations. Cc: Christoph Hellwig <hch@infradead.org> Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: add more description in commit message] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <mlin@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de>
2015-06-12md: make sure MD_RECOVERY_DONE is clear before starting recovery/resyncNeilBrown
MD_RECOVERY_DONE is normally cleared by md_check_recovery after a resync etc finished. However it is possible for raid5_start_reshape to race and start a reshape before MD_RECOVERY_DONE is cleared. This can lean to multiple reshapes running at the same time, which isn't good. To make sure it is cleared before starting a reshape, and also clear it when reaping a thread, just to be safe. Signed-off-by: NeilBrown <neilb@suse.de>
2015-06-12md: Close race when setting 'action' to 'idle'.NeilBrown
Checking ->sync_thread without holding the mddev_lock() isn't really safe, even after flushing the workqueue which ensures md_start_sync() has been run. While this code is waiting for the lock, md_check_recovery could reap the thread itself, and then start another thread (e.g. recovery might finish, then reshape starts). When this thread gets the lock md_start_sync() hasn't run so it doesn't get reaped, but MD_RECOVERY_RUNNING gets cleared. This allows two threads to start which leads to confusion. So don't both if MD_RECOVERY_RUNNING isn't set, but if it is do the flush and the test and the reap all under the mddev_lock to avoid any race with md_check_recovery. Signed-off-by: NeilBrown <neilb@suse.de> Fixes: 6791875e2e53 ("md: make reconfig_mutex optional for writes to md sysfs files.") Cc: stable@vger.kernel.org (v4.0+)
2015-06-12md: don't return 0 from array_state_storeNeilBrown
Returning zero from a 'store' function is bad. The return value should be either len length of the string or an error. So use 'len' if 'err' is zero. Fixes: 6791875e2e53 ("md: make reconfig_mutex optional for writes to md sysfs files.") Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@vger.kernel (v4.0+)
2015-06-11dm thin metadata: fix a race when entering fail modeJoe Thornber
In dm_thin_find_block() the ->fail_io flag was checked outside the metadata device's root_lock, causing dm_thin_find_block() to race with the setting of this flag. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11dm thin: fail messages with EOPNOTSUPP when pool cannot handle messagesMike Snitzer
Use EOPNOTSUPP, rather than EINVAL, error code when user attempts to send the pool a message. Otherwise usespace is led to believe the message failed due to invalid argument. Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11dm thin: range discard supportJoe Thornber
Previously REQ_DISCARD bios have been split into block sized chunks before submission to the thin target. There are a couple of issues with this: - If the block size is small, a large discard request can get broken up into a great many bios which is both slow and causes a lot of memory pressure. - The thin pool block size and the discard granularity for the underlying data device need to be compatible if we want to passdown the discard. This patch relaxes the block size granularity for thin devices. It makes use of the recent range locking added to the bio_prison to quiesce a whole range of thin blocks before unmapping them. Once a thin range has been unmapped the discard can then be passed down to the data device for those sub ranges where the data blocks are no longer used (ie. they weren't shared in the first place). This patch also doesn't make any apologies about open-coding portions of block core as a means to supporting async discard completions in the near-term -- if/when late bio splitting lands it'll all get cleaned up. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11dm thin metadata: add dm_thin_remove_range()Joe Thornber
Removes a range of blocks from the btree. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>