Age | Commit message (Collapse) | Author |
|
Commit 3d5f6733 ("dm thin metadata: speed up discard of partially mapped
volumes"), or some other dm-thinp change during the Linux 4.5
development window, really should've bumped these target versions.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
This field is always set in tandem with ->pers, and when it is tested
->pers is also tested. So ->ready is not needed.
It was needed once, but code rearrangement and locking changes have
removed that needed.
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
md_new_event had removed sysfs_notify since 'commit 72a23c211e45
("Make sure all changes to md/sync_action are notified.")', so we
can use md_new_event and delete md_new_event_inintr.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
And propagate the error up the stack so we can add the stripe
to no_stripes_list and retry our log operation later. This avoids
blocking raid5d due to reclaim, an it allows to get rid of the
deadlock-prone GFP_NOFAIL allocation.
shli: add missing mempool_destroy()
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
We only have a limited number in flight, so use a page based mempool.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
This allows us to make guaranteed forward progress.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
Add support for journal disk hot add/remove. Mostly trival checks in md
part. The raid5 part is a little tricky. For hot-remove, we can't wait
pending write as it's called from raid5d. The wait will cause deadlock.
We simplily fail the hot-remove. A hot-remove retry can success
eventually since if journal disk is faulty all pending write will be
failed and finish. For hot-add, since an array supporting journal but
without journal disk will be marked read-only, we are safe to hot add
journal without stopping IO (should be read IO, while journal only
handles write IO).
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
get_seconds() API is not y2038 safe on 32 bit systems and the API
is deprecated. Replace it with calls to ktime_get_real_seconds()
API instead. Change mddev structure types to time64_t accordingly.
32 bit signed timestamps will overflow in the year 2038.
Change the user interface mdu_array_info_s structure timestamps:
ctime and utime values used in ioctls GET_ARRAY_INFO and
SET_ARRAY_INFO to unsigned int. This will extend the field to last
until the year 2106.
The long term plan is to get rid of ctime and utime values in
this structure as this information can be read from the on-disk
meta data directly.
Clamp the tim64_t timestamps to positive values with a max of U32_MAX
when returning from GET_ARRAY_INFO ioctl to accommodate above changes
in the data type of timestamps to unsigned int.
v0.90 on disk meta data uses u32 for maintaining time stamps.
So this will also last until year 2106.
Assumption is that the usage of v0.90 will be deprecated by
year 2106.
Timestamp fields in the on disk meta data for v1.0 version already
use 64 bit data types. Remove the truncation of the bits while
writing to or reading from these from the disk.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
When CONFIG_LBDAF is not set, sector_t is only 32-bits wide, which
means we cannot have devices with more than 2TB, and the code that
is trying to handle compatibility support for large devices in
md version 0.90 is meaningless but also causes a compile-time warning:
drivers/md/md.c: In function 'super_90_load':
drivers/md/md.c:1029:19: warning: large integer implicitly truncated to unsigned type [-Woverflow]
drivers/md/md.c: In function 'super_90_rdev_size_change':
drivers/md/md.c:1323:17: warning: large integer implicitly truncated to unsigned type [-Woverflow]
This adds a check for CONFIG_LBDAF to avoid even getting into this
code path, and also adds an explicit cast to let the compiler know
it doesn't have to warn about the truncation.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
Once the I/O completed we don't need the meta page anymore. As the iounits
can live on for a long time this reduces memory pressure a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
It's only used for one kind of move, so make that explicit. Also clean
up the code a bit by using list_for_each_safe.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
MD_CHANGE_CLEAN had been replaced with MD_CHANGE_PENDING after
commit 070dc6 ("md: resolve confusion of MD_CHANGE_CLEAN"),
so make the change accordingly.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
1. fix unbalanced parentheses.
2. add more description about that MD_CLUSTER_SEND_LOCKED_ALREADY
will be cleared after set it in add_new_disk.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
Communication can happen through multiple threads. It is possible that
one thread steps over another threads sequence. So, we use mutexes to
protect both the send and receive sequences.
Send communication is locked through state bit, MD_CLUSTER_SEND_LOCK.
Communication is locked with bit manipulation in order to allow
"lock and hold" for the add operation. In case of an add operation,
if the lock is held, MD_CLUSTER_SEND_LOCKED_ALREADY is set.
When md_update_sb() calls metadata_update_start(), it checks
(in a single statement to avoid races), if the communication
is already locked. If yes, it merely returns zero, else it
locks the token lockresource.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
Reloading of superblock must be performed under reconfig_mutex. However,
this cannot be done with md_reload_sb because it would deadlock with
the message DLM lock. So, we defer it in md_check_recovery() which is
executed by mddev->thread.
This introduces a new flag, MD_RELOAD_SB, which if set, will reload the
superblock. And good_device_nr is also added to 'struct mddev' which is
used to get the num of the good device within cluster raid.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
For clustered raid, we need to do extra actions when change
bitmap to none.
1. check if all the bitmap lock could be get or not, if yes then
we can continue the change since cluster raid is only active
in current node. Otherwise return fail and unlock the related
bitmap locks
2. set nodes to 0 and then leave cluster environment.
3. release other nodes's bitmap lock.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
If a spare device was marked faulty, it would not be reflected
in receiving nodes because it would mark it as activated and continue.
Continue the operation, so it may be set as faulty.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
The remove disk message does not need metadata_update_start(), but
can be an independent message.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
For cluster raid, if one disk couldn't be reach in one node, then
other nodes would receive the REMOVE message for the disk.
In receiving node, we can't call md_kick_rdev_from_array to remove
the disk from array synchronously since the disk might still be busy
in this node. So let's set a ClusterRemove flag on the disk, then
let the thread to do the removal job eventually.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
If a RESYNCING message with (0,0) has been sent before, do not send it
again. This avoids a resync ping pong between the nodes. We read
the bitmap lockresource's LVB to figure out the previous value
of the RESYNCING message.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
The stripe_add_to_batch_list() function is called only if
stripe_can_batch() returned true, so there is no need for double check.
Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Cc: Neil Brown <neilb@suse.com>
Cc: linux-raid@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
... so virt_to_phys(p) & (PAGE_SIZE - 1) is a very odd way to
spell offset_in_page(p).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Previously, it would only scan the entire disk if it was starting from
the very start of the disk - i.e. if the previous scan got to the end.
This was broken by refill_full_stripes(), which updates last_scanned so
that refill_dirty was never triggering the searched_from_start path.
But if we change refill_dirty() to always scan the entire disk if
necessary, regardless of what last_scanned was, the code gets cleaner
and we fix that bug too.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Added a safeguard in the shutdown case. At least while not being
attached it is also possible to trigger a kernel bug by writing into
writeback_running. This change adds the same check before trying to
wake up the thread for that case.
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Allows to use register, not register_quiet in udev to avoid "device_busy" error.
The initial patch proposed at https://lkml.org/lkml/2013/8/26/549 by Gabriel de Perthuis
<g2p.code@gmail.com> does not unlock the mutex and hangs the kernel.
See http://thread.gmane.org/gmane.linux.kernel.bcache.devel/2594 for the discussion.
Cc: Denis Bychkov <manover@gmail.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Gabriel de Perthuis <g2p.code@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
In bcache_init() function it forgot to unregister reboot notifier if
bcache fails to unregister a block device. This commit fixes this.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Tested-by: Joshua Schmid <jschmid@suse.com>
Tested-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Joshua Schmid <jschmid@suse.com>
Tested-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
This bug can be reproduced by the following script:
#!/bin/bash
bcache_sysfs="/sys/fs/bcache"
function clear_cache()
{
if [ ! -e $bcache_sysfs ]; then
echo "no bcache sysfs"
exit
fi
cset_uuid=$(ls -l $bcache_sysfs|head -n 2|tail -n 1|awk '{print $9}')
sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/detach"
sleep 5
sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/attach"
}
for ((i=0;i<10;i++)); do
clear_cache
done
The warning messages look like below:
[ 275.948611] ------------[ cut here ]------------
[ 275.963840] WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xb8/0xd0() (Tainted: P W
--------------- )
[ 275.979253] Hardware name: Tecal RH2285
[ 275.994106] sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:09.0/0000:08:00.0/host4/target4:2:1/4:2:1:0/block/sdb/sdb1/bcache/cache'
[ 276.024105] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler
bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801
i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas
pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan]
[ 276.072643] Pid: 2765, comm: sh Tainted: P W --------------- 2.6.32 #1
[ 276.089315] Call Trace:
[ 276.105801] [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0
[ 276.122650] [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50
[ 276.139361] [<ffffffff81205c08>] ? sysfs_add_one+0xb8/0xd0
[ 276.156012] [<ffffffff8120609b>] ? sysfs_do_create_link+0x12b/0x170
[ 276.172682] [<ffffffff81206113>] ? sysfs_create_link+0x13/0x20
[ 276.189282] [<ffffffffa03bda21>] ? bcache_device_link+0xc1/0x110 [bcache]
[ 276.205993] [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache]
[ 276.222794] [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache]
[ 276.239680] [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110
[ 276.256594] [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170
[ 276.273364] [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0
[ 276.290133] [<ffffffff811890b1>] ? sys_write+0x51/0x90
[ 276.306368] [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b
[ 276.322301] ---[ end trace 9f5d4fcdd0c3edfb ]---
[ 276.338241] ------------[ cut here ]------------
[ 276.354109] WARNING: at /home/wenqing.lz/bcache/bcache/super.c:720
bcache_device_link+0xdf/0x110 [bcache]() (Tainted: P W --------------- )
[ 276.386017] Hardware name: Tecal RH2285
[ 276.401430] Couldn't create device <-> cache set symlinks
[ 276.401759] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler
bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801
i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas
pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan]
[ 276.465477] Pid: 2765, comm: sh Tainted: P W --------------- 2.6.32 #1
[ 276.482169] Call Trace:
[ 276.498610] [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0
[ 276.515405] [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50
[ 276.532059] [<ffffffffa03bda3f>] ? bcache_device_link+0xdf/0x110 [bcache]
[ 276.548808] [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache]
[ 276.565569] [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache]
[ 276.582418] [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110
[ 276.599341] [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170
[ 276.616142] [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0
[ 276.632607] [<ffffffff811890b1>] ? sys_write+0x51/0x90
[ 276.648671] [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b
[ 276.664756] ---[ end trace 9f5d4fcdd0c3edfc ]---
We forget to clear BCACHE_DEV_UNLINK_DONE flag in bcache_device_attach()
function when we attach a backing device first time. After detaching this
backing device, this flag will be true and sysfs_remove_link() isn't called in
bcache_device_unlink(). Then when we attach this backing device again,
sysfs_create_link() will return EEXIST error in bcache_device_link().
So the fix is trival and we clear this flag in bcache_device_link().
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Tested-by: Joshua Schmid <jschmid@suse.com>
Tested-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Tested-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Subject : [PATCH v2] bcache: fix a livelock in btree lock
Date : Wed, 25 Feb 2015 20:32:09 +0800 (02/25/2015 04:32:09 AM)
This commit tries to fix a livelock in bcache. This livelock might
happen when we causes a huge number of cache misses simultaneously.
When we get a cache miss, bcache will execute the following path.
->cached_dev_make_request()
->cached_dev_read()
->cached_lookup()
->bch->btree_map_keys()
->btree_root() <------------------------
->bch_btree_map_keys_recurse() |
->cache_lookup_fn() |
->cached_dev_cache_miss() |
->bch_btree_insert_check_key() -|
[If btree->seq is not equal to seq + 1, we should return
EINTR and traverse btree again.]
In bch_btree_insert_check_key() function we first need to check upgrade
flag (op->lock == -1), and when this flag is true we need to release
read btree->lock and try to take write btree->lock. During taking and
releasing this write lock, btree->seq will be monotone increased in
order to prevent other threads modify this in cache miss (see btree.h:74).
But if there are some cache misses caused by some requested, we could
meet a livelock because btree->seq is always changed by others. Thus no
one can make progress.
This commit will try to take write btree->lock if it encounters a race
when we traverse btree. Although it sacrifice the scalability but we
can ensure that only one can modify the btree.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Tested-by: Joshua Schmid <jschmid@suse.com>
Tested-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Joshua Schmid <jschmid@suse.com>
Cc: Zhu Yanhai <zhu.yanhai@gmail.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
md currently doesn't allow a 'sync_action' such as 'reshape' to be set
while MD_RECOVERY_NEEDED is set.
This s a problem, particularly since commit 738a273806ee as that can
cause ->check_shape to call mddev_resume() which sets
MD_RECOVERY_NEEDED. So by the time we come to start 'reshape' it is
very likely that MD_RECOVERY_NEEDED is still set.
Testing for this flag is not really needed and is in any case very
racy as it can be set at any moment - asynchronously. Any race
between setting a sync_action and setting MD_RECOVERY_NEEDED must
already be handled properly in some locked code, probably
md_check_recovery(), so remove the test here.
The test on MD_RECOVERY_RUNNING is also racy in the 'reshape' case
so we should test it again after getting mddev_lock().
As this fixes a race and a regression which can cause 'reshape' to
fail, it is suitable for -stable kernels since 4.1
Reported-by: Xiao Ni <xni@redhat.com>
Fixes: 738a273806ee ("md/raid5: fix allocation of 'scribble' array.")
Cc: stable@vger.kernel.org (v4.1+)
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
Commit 2910ff17d154baa5eb50e362a91104e831eb2bb6
introduced a regression which would remove a recently added spare via
slot_store. Revert part of the patch which touches slot_store() and add
the disk directly using pers->hot_add_disk()
Fixes: 2910ff17d154 ("md: remove_and_add_spares() to activate specific
rdev")
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
The patch c7bfced9a6716ff66c9d61f934bb60af08d4688c committed to 4.4-rc
causes crash in LVM test shell/lvchange-raid.sh. The kernel crashes with
this BUG, the reason is that we attempt to suspend a device that is
already suspended. See also
https://bugzilla.redhat.com/show_bug.cgi?id=1283491
This patch fixes the bug by changing functions mddev_suspend and
mddev_resume to always nest.
The number of nested calls to mddev_nested_suspend is kept in the
variable mddev->suspended.
[neilb: made mddev_suspend() always nest instead of introduce mddev_nested_suspend]
kernel BUG at drivers/md/md.c:317!
CPU: 3 PID: 32754 Comm: lvm Not tainted 4.4.0-rc2 #1
task: 0000000047076040 ti: 0000000047014000 task.ti: 0000000047014000
YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
PSW: 00001000000001000000000000001111 Not tainted
r00-03 000000000804000f 00000000102c5280 0000000010c7522c 000000007e3d1810
r04-07 0000000010c6f000 000000004ef37f20 000000007e3d1dd0 000000007e3d1810
r08-11 000000007c9f1600 0000000000000000 0000000000000001 ffffffffffffffff
r12-15 0000000010c1d000 0000000000000041 00000000f98d63c8 00000000f98e49e4
r16-19 00000000f98e49e4 00000000c138fd06 00000000f98d63c8 0000000000000001
r20-23 0000000000000002 000000004ef37f00 00000000000000b0 00000000000001d1
r24-27 00000000424783a0 000000007e3d1dd0 000000007e3d1810 00000000102b2000
r28-31 0000000000000001 0000000047014840 0000000047014930 0000000000000001
sr00-03 0000000007040800 0000000000000000 0000000000000000 0000000007040800
sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000
IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000102c538c 00000000102c5390
IIR: 03ffe01f ISR: 0000000000000000 IOR: 00000000102b2748
CPU: 3 CR30: 0000000047014000 CR31: 0000000000000000
ORIG_R28: 00000000000000b0
IAOQ[0]: mddev_suspend+0x10c/0x160 [md_mod]
IAOQ[1]: mddev_suspend+0x110/0x160 [md_mod]
RP(r2): raid1_add_disk+0xd4/0x2c0 [raid1]
Backtrace:
[<0000000010c7522c>] raid1_add_disk+0xd4/0x2c0 [raid1]
[<0000000010c20078>] raid_resume+0x390/0x418 [dm_raid]
[<00000000105833e8>] dm_table_resume_targets+0xc0/0x188 [dm_mod]
[<000000001057f784>] dm_resume+0x144/0x1e0 [dm_mod]
[<0000000010587dd4>] dev_suspend+0x1e4/0x568 [dm_mod]
[<0000000010589278>] ctl_ioctl+0x1e8/0x428 [dm_mod]
[<0000000010589518>] dm_compat_ctl_ioctl+0x18/0x68 [dm_mod]
[<0000000040377b88>] compat_SyS_ioctl+0xd0/0x1558
Fixes: c7bfced9a671 ("md: suspend i/o during runtime blk_integrity_unregister")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
Neil pointed out setting journal disk role to raid_disks will confuse
reshape if we support reshape eventually. Switching the role to 0 (we
should be fine as long as the value >=0) and skip sysfs file creation to
avoid error.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
The commit c31df25f20e3 ("md/raid10: make sync_request_write() call
bio_copy_data()") replaced manual data copying with bio_copy_data() but
it doesn't work as intended. The source bio (fbio) is already processed,
so its bvec_iter has bi_size == 0 and bi_idx == bi_vcnt. Because of
this, bio_copy_data() either does not copy anything, or worse, copies
data from the ->bi_next bio if it is set. This causes wrong data to be
written to drives during resync and sometimes lockups/crashes in
bio_copy_data():
[ 517.338478] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [md126_raid10:3319]
[ 517.347324] Modules linked in: raid10 xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul cryptd shpchp pcspkr ipmi_si ipmi_msghandler tpm_crb acpi_power_meter acpi_cpufreq ext4 mbcache jbd2 sr_mod cdrom sd_mod e1000e ax88179_178a usbnet mii ahci ata_generic crc32c_intel libahci ptp pata_acpi libata pps_core wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
[ 517.440555] CPU: 0 PID: 3319 Comm: md126_raid10 Not tainted 4.3.0-rc6+ #1
[ 517.448384] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYDCRB1.86B.0055.D14.1509221924 09/22/2015
[ 517.459768] task: ffff880153773980 ti: ffff880150df8000 task.ti: ffff880150df8000
[ 517.468529] RIP: 0010:[<ffffffff812e1888>] [<ffffffff812e1888>] bio_copy_data+0xc8/0x3c0
[ 517.478164] RSP: 0018:ffff880150dfbc98 EFLAGS: 00000246
[ 517.484341] RAX: ffff880169356688 RBX: 0000000000001000 RCX: 0000000000000000
[ 517.492558] RDX: 0000000000000000 RSI: ffffea0001ac2980 RDI: ffffea0000d835c0
[ 517.500773] RBP: ffff880150dfbd08 R08: 0000000000000001 R09: ffff880153773980
[ 517.508987] R10: ffff880169356600 R11: 0000000000001000 R12: 0000000000010000
[ 517.517199] R13: 000000000000e000 R14: 0000000000000000 R15: 0000000000001000
[ 517.525412] FS: 0000000000000000(0000) GS:ffff880174a00000(0000) knlGS:0000000000000000
[ 517.534844] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 517.541507] CR2: 00007f8a044d5fed CR3: 0000000169504000 CR4: 00000000001406f0
[ 517.549722] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 517.557929] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 517.566144] Stack:
[ 517.568626] ffff880174a16bc0 ffff880153773980 ffff880169356600 0000000000000000
[ 517.577659] 0000000000000001 0000000000000001 ffff880153773980 ffff88016a61a800
[ 517.586715] ffff880150dfbcf8 0000000000000001 ffff88016dd209e0 0000000000001000
[ 517.595773] Call Trace:
[ 517.598747] [<ffffffffa043ef95>] raid10d+0xfc5/0x1690 [raid10]
[ 517.605610] [<ffffffff816697ae>] ? __schedule+0x29e/0x8e2
[ 517.611987] [<ffffffff814ff206>] md_thread+0x106/0x140
[ 517.618072] [<ffffffff810c1d80>] ? wait_woken+0x80/0x80
[ 517.624252] [<ffffffff814ff100>] ? super_1_load+0x520/0x520
[ 517.630817] [<ffffffff8109ef89>] kthread+0xc9/0xe0
[ 517.636506] [<ffffffff8109eec0>] ? flush_kthread_worker+0x70/0x70
[ 517.643653] [<ffffffff8166d99f>] ret_from_fork+0x3f/0x70
[ 517.649929] [<ffffffff8109eec0>] ? flush_kthread_worker+0x70/0x70
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Reviewed-by: Shaohua Li <shli@kernel.org>
Cc: stable@vger.kernel.org (v4.2+)
Fixes: c31df25f20e3 ("md/raid10: make sync_request_write() call bio_copy_data()")
Signed-off-by: NeilBrown <neilb@suse.com>
|
|
When a thin pool is being destroyed delayed work items are
cancelled using cancel_delayed_work(), which doesn't guarantee that on
return the delayed item isn't running. This can cause the work item to
requeue itself on an already destroyed workqueue. Fix this by using
cancel_delayed_work_sync() which guarantees that on return the work item
is not running anymore.
Fixes: 905e51b39a555 ("dm thin: commit outstanding data every second")
Fixes: 85ad643b7e7e5 ("dm thin: add timeout to stop out-of-data-space mode holding IO forever")
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
|
Remove the unused struct block_op pointer that was inadvertantly
introduced, via cut-and-paste of previous brb_op() code, as part of
commit 50dd842ad.
(Cc'ing stable@ because commit 50dd842ad did)
Fixes: 50dd842ad ("dm space map metadata: fix ref counting bug when bootstrapping a new space map")
Reported-by: David Binderman <dcb314@hotmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
|
If ignore_zero_blocks is enabled dm-verity will return zeroes for blocks
matching a zero hash without validating the content.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Add support for correcting corrupted blocks using Reed-Solomon.
This code uses RS(255, N) interleaved across data and hash
blocks. Each error-correcting block covers N bytes evenly
distributed across the combined total data, so that each byte is a
maximum distance away from the others. This makes it possible to
recover from several consecutive corrupted blocks with relatively
small space overhead.
In addition, using verity hashes to locate erasures nearly doubles
the effectiveness of error correction. Being able to detect
corrupted blocks also improves performance, because only corrupted
blocks need to corrected.
For a 2 GiB partition, RS(255, 253) (two parity bytes for each
253-byte block) can correct up to 16 MiB of consecutive corrupted
blocks if erasures can be located, and 8 MiB if they cannot, with
16 MiB space overhead.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
verity_for_bv_block() will be re-used by optional dm-verity object.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Prepare for an optional verity object to make use of existing dm-verity
structures and functions.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Prepare for extending dm-verity with an optional object. Follows the
naming convention used by other DM targets (e.g. dm-cache and dm-era).
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Move optional argument parsing into a separate function to make it
easier to add more of them without making verity_ctr even longer.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Handle dm-verity salting in one place to simplify the code.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Eliminates code duplication within insert().
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Signed-off-by: Anup Limbu <anuplimbu14@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
The option DM_DEBUG_BLOCK_STACK_TRACING is moved from persistent-data
directory to device mapper directory because it will now be used by
persistent-data and bufio. When the option is enabled, each bufio buffer
stores the stacktrace of the last dm_bufio_get(), dm_bufio_read() or
dm_bufio_new() call that increased the hold count to 1. The buffer's
stacktrace is printed if the buffer was not released before the bufio
client is destroyed.
When DM_DEBUG_BLOCK_STACK_TRACING is enabled, any bufio buffer leaks are
considered warnings - i.e. the kernel continues afterwards. If not
enabled, buffer leaks are considered BUGs and the kernel with crash.
Reasoning on this disposition is: if we only ever warned on buffer leaks
users would generally ignore them and the problematic code would never
get fixed.
Successfully used to find source of bufio leaks fixed with commit
fce079f63c3 ("dm btree: fix bufio buffer leaks in dm_btree_del() error
path").
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
A small code cleanup in new_read() - return NULL instead of b (although
b is NULL at this point). This function is not returning pointer to the
buffer, it is returning a pointer to the bufffer's data, thus it makes
no sense to return the variable b.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
There is no need to record stack trace and immediately print it. Just
use dump_stack() to print the current stack.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|