Age | Commit message (Collapse) | Author |
|
If we're about to do a fast fsync for an inode and btrfs_inode_in_log()
returns false, it's possible that we had an ordered extent in progress
(btrfs_finish_ordered_io() not run yet) when we noticed that the inode's
last_trans field was not greater than the id of the last committed
transaction, but shortly after, before we checked if there were any
ongoing ordered extents, the ordered extent had just completed and
removed itself from the inode's ordered tree, in which case we end up not
logging the inode, losing some data if a power failure or crash happens
after the fsync handler returns and before the transaction is committed.
Fix this by checking first if there are any ongoing ordered extents
before comparing the inode's last_trans with the id of the last committed
transaction - when it completes, an ordered extent always updates the
inode's last_trans before it removes itself from the inode's ordered
tree (at btrfs_finish_ordered_io()).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
In the listxattrs handler, we were not listing all the xattrs that are
packed in the same btree item, which happens when multiple xattrs have
a name that when crc32c hashed produce the same checksum value.
Fix this by processing them all.
The following test case for xfstests reproduces the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
cd /
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/attr
# real QA test starts here
_supported_fs generic
_supported_os Linux
_require_scratch
_require_attrs
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test file with a few xattrs. The first 3 xattrs have a name
# that when given as input to a crc32c function result in the same checksum.
# This made btrfs list only one of the xattrs through listxattrs system call
# (because it packs xattrs with the same name checksum into the same btree
# item).
touch $SCRATCH_MNT/testfile
$SETFATTR_PROG -n user.foobar -v 123 $SCRATCH_MNT/testfile
$SETFATTR_PROG -n user.WvG1c1Td -v qwerty $SCRATCH_MNT/testfile
$SETFATTR_PROG -n user.J3__T_Km3dVsW_ -v hello $SCRATCH_MNT/testfile
$SETFATTR_PROG -n user.something -v pizza $SCRATCH_MNT/testfile
$SETFATTR_PROG -n user.ping -v pong $SCRATCH_MNT/testfile
# Now call getfattr with --dump, which calls the listxattrs system call.
# It should list all the xattrs we have set before.
$GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/testfile | _filter_scratch
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
While running a test with a mix of buffered IO and direct IO against
the same files I hit a deadlock reported by the following trace:
[11642.140352] INFO: task kworker/u32:3:15282 blocked for more than 120 seconds.
[11642.142452] Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.143982] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.146332] kworker/u32:3 D ffff880230ef7988 [11642.147737] systemd-journald[571]: Sent WATCHDOG=1 notification.
[11642.149771] 0 15282 2 0x00000000
[11642.151205] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs]
[11642.154074] ffff880230ef7988 0000000000000246 0000000000014ec0 ffff88023ec94ec0
[11642.156722] ffff880233fe8f80 ffff880230ef8000 ffff88023ec94ec0 7fffffffffffffff
[11642.159205] 0000000000000002 ffffffff8147b7f9 ffff880230ef79a0 ffffffff8147b541
[11642.161403] Call Trace:
[11642.162129] [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.163396] [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.164871] [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
[11642.167020] [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.167931] [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
[11642.182320] [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
[11642.183762] [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
[11642.185308] [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
[11642.186782] [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
[11642.188217] [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
[11642.189626] [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
[11642.190803] [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
[11642.192158] [<ffffffff8111829f>] __lock_page+0x66/0x68
[11642.193379] [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
[11642.194831] [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
[11642.197068] [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
[11642.199188] [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
[11642.200723] [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[11642.202465] [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
[11642.203836] [<ffffffff811236bc>] do_writepages+0x23/0x2c
[11642.205624] [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61
[11642.207057] [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15
[11642.208529] [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
[11642.210375] [<ffffffffa0462613>] ? btrfs_scrubparity_helper+0x140/0x33a [btrfs]
[11642.212132] [<ffffffffa044f974>] btrfs_run_ordered_extent_work+0x25/0x34 [btrfs]
[11642.213837] [<ffffffffa046262f>] btrfs_scrubparity_helper+0x15c/0x33a [btrfs]
[11642.215457] [<ffffffffa046293b>] btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
[11642.217095] [<ffffffff8106483e>] process_one_work+0x256/0x48b
[11642.218324] [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7
[11642.219466] [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
[11642.220801] [<ffffffff8106a500>] kthread+0xd4/0xdc
[11642.222032] [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
[11642.223190] [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70
[11642.224394] [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
[11642.226295] 2 locks held by kworker/u32:3/15282:
[11642.227273] #0: ("%s-%s""btrfs", name){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
[11642.229412] #1: ((&work->normal_work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
[11642.231414] INFO: task kworker/u32:8:15289 blocked for more than 120 seconds.
[11642.232872] Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.234109] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.235776] kworker/u32:8 D ffff88020de5f848 0 15289 2 0x00000000
[11642.237412] Workqueue: writeback wb_workfn (flush-btrfs-481)
[11642.238670] ffff88020de5f848 0000000000000246 0000000000014ec0 ffff88023ed54ec0
[11642.240475] ffff88021b1ece40 ffff88020de60000 ffff88023ed54ec0 7fffffffffffffff
[11642.242154] 0000000000000002 ffffffff8147b7f9 ffff88020de5f860 ffffffff8147b541
[11642.243715] Call Trace:
[11642.244390] [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.245432] [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.246392] [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
[11642.247479] [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.248551] [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
[11642.249968] [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
[11642.251043] [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
[11642.252202] [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
[11642.253210] [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
[11642.254307] [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
[11642.256118] [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
[11642.257131] [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
[11642.258200] [<ffffffff8111829f>] __lock_page+0x66/0x68
[11642.259168] [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
[11642.260516] [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
[11642.261841] [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
[11642.263531] [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
[11642.264747] [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[11642.266148] [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
[11642.267264] [<ffffffff811236bc>] do_writepages+0x23/0x2c
[11642.268280] [<ffffffff81192a2b>] __writeback_single_inode+0xda/0x5ba
[11642.269407] [<ffffffff811939f0>] writeback_sb_inodes+0x27b/0x43d
[11642.270476] [<ffffffff81193c28>] __writeback_inodes_wb+0x76/0xae
[11642.271547] [<ffffffff81193ea6>] wb_writeback+0x19e/0x41c
[11642.272588] [<ffffffff81194821>] wb_workfn+0x201/0x341
[11642.273523] [<ffffffff81194821>] ? wb_workfn+0x201/0x341
[11642.274479] [<ffffffff8106483e>] process_one_work+0x256/0x48b
[11642.275497] [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7
[11642.276518] [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
[11642.277520] [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
[11642.278517] [<ffffffff8106a500>] kthread+0xd4/0xdc
[11642.279371] [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
[11642.280468] [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70
[11642.281607] [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
[11642.282604] 3 locks held by kworker/u32:8/15289:
[11642.283423] #0: ("writeback"){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
[11642.285629] #1: ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
[11642.287538] #2: (&type->s_umount_key#37){+++++.}, at: [<ffffffff81171217>] trylock_super+0x1b/0x4b
[11642.289423] INFO: task fdm-stress:26848 blocked for more than 120 seconds.
[11642.290547] Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.291453] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.292864] fdm-stress D ffff88022c107c20 0 26848 26591 0x00000000
[11642.294118] ffff88022c107c20 000000038108affa 0000000000014ec0 ffff88023ed54ec0
[11642.295602] ffff88013ab1ca40 ffff88022c108000 ffff8800b2fc19d0 00000000000e0fff
[11642.297098] ffff8800b2fc19b0 ffff88022c107c88 ffff88022c107c38 ffffffff8147b541
[11642.298433] Call Trace:
[11642.298896] [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.299738] [<ffffffffa045225d>] lock_extent_bits+0xfe/0x1a3 [btrfs]
[11642.300833] [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44
[11642.301943] [<ffffffffa0447516>] lock_and_cleanup_extent_if_need+0x68/0x18e [btrfs]
[11642.303270] [<ffffffffa04485ba>] __btrfs_buffered_write+0x238/0x4c1 [btrfs]
[11642.304552] [<ffffffffa044b50a>] ? btrfs_file_write_iter+0x17c/0x408 [btrfs]
[11642.305782] [<ffffffffa044b682>] btrfs_file_write_iter+0x2f4/0x408 [btrfs]
[11642.306878] [<ffffffff8116e298>] __vfs_write+0x7c/0xa5
[11642.307729] [<ffffffff8116e7d1>] vfs_write+0x9d/0xe8
[11642.308602] [<ffffffff8116efbb>] SyS_write+0x50/0x7e
[11642.309410] [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
[11642.310403] 3 locks held by fdm-stress/26848:
[11642.311108] #0: (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40
[11642.312578] #1: (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0
[11642.314170] #2: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa044b401>] btrfs_file_write_iter+0x73/0x408 [btrfs]
[11642.316796] INFO: task fdm-stress:26849 blocked for more than 120 seconds.
[11642.317842] Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.318691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.319959] fdm-stress D ffff8801964ffa68 0 26849 26591 0x00000000
[11642.321312] ffff8801964ffa68 00ff8801e9975f80 0000000000014ec0 ffff88023ed94ec0
[11642.322555] ffff8800b00b4840 ffff880196500000 ffff8801e9975f20 0000000000000002
[11642.323715] ffff8801e9975f18 ffff8800b00b4840 ffff8801964ffa80 ffffffff8147b541
[11642.325096] Call Trace:
[11642.325532] [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.326303] [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
[11642.327180] [<ffffffff8108ae40>] ? mark_held_locks+0x5e/0x74
[11642.328114] [<ffffffff8147f30e>] ? _raw_spin_unlock_irq+0x2c/0x4a
[11642.329051] [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
[11642.330053] [<ffffffff8147bceb>] __wait_for_common+0x109/0x147
[11642.330952] [<ffffffff8147bceb>] ? __wait_for_common+0x109/0x147
[11642.331869] [<ffffffff8147e7bb>] ? usleep_range+0x4a/0x4a
[11642.332925] [<ffffffff81074075>] ? wake_up_q+0x47/0x47
[11642.333736] [<ffffffff8147bd4d>] wait_for_completion+0x24/0x26
[11642.334672] [<ffffffffa044f5ce>] btrfs_wait_ordered_extents+0x1c8/0x217 [btrfs]
[11642.335858] [<ffffffffa0465b5a>] btrfs_mksubvol+0x224/0x45d [btrfs]
[11642.336854] [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44
[11642.337820] [<ffffffffa0465edb>] btrfs_ioctl_snap_create_transid+0x148/0x17a [btrfs]
[11642.339026] [<ffffffffa046603b>] btrfs_ioctl_snap_create_v2+0xc7/0x110 [btrfs]
[11642.340214] [<ffffffffa0468582>] btrfs_ioctl+0x590/0x27bd [btrfs]
[11642.341123] [<ffffffff8147dc00>] ? mutex_unlock+0xe/0x10
[11642.341934] [<ffffffffa00fa6e9>] ? ext4_file_write_iter+0x2a3/0x36f [ext4]
[11642.342936] [<ffffffff8108895d>] ? __lock_is_held+0x3c/0x57
[11642.343772] [<ffffffff81186a1d>] ? rcu_read_unlock+0x3e/0x5d
[11642.344673] [<ffffffff8117dc95>] do_vfs_ioctl+0x458/0x4dc
[11642.346024] [<ffffffff81186bbe>] ? __fget_light+0x62/0x71
[11642.346873] [<ffffffff8117dd70>] SyS_ioctl+0x57/0x79
[11642.347720] [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
[11642.350222] 4 locks held by fdm-stress/26849:
[11642.350898] #0: (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0
[11642.352375] #1: (&type->i_mutex_dir_key#4/1){+.+.+.}, at: [<ffffffffa0465981>] btrfs_mksubvol+0x4b/0x45d [btrfs]
[11642.354072] #2: (&fs_info->subvol_sem){++++..}, at: [<ffffffffa0465a2a>] btrfs_mksubvol+0xf4/0x45d [btrfs]
[11642.355647] #3: (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
[11642.357516] INFO: task fdm-stress:26850 blocked for more than 120 seconds.
[11642.358508] Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.359376] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.368625] fdm-stress D ffff88021f167688 0 26850 26591 0x00000000
[11642.369716] ffff88021f167688 0000000000000001 0000000000014ec0 ffff88023edd4ec0
[11642.370950] ffff880128a98680 ffff88021f168000 ffff88023edd4ec0 7fffffffffffffff
[11642.372210] 0000000000000002 ffffffff8147b7f9 ffff88021f1676a0 ffffffff8147b541
[11642.373430] Call Trace:
[11642.373853] [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.374623] [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.375948] [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
[11642.376862] [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.377637] [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
[11642.378610] [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
[11642.379457] [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
[11642.380366] [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
[11642.381353] [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
[11642.382255] [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
[11642.383162] [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
[11642.383945] [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
[11642.384875] [<ffffffff8111829f>] __lock_page+0x66/0x68
[11642.385749] [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
[11642.386721] [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
[11642.387596] [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
[11642.389030] [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
[11642.389973] [<ffffffff810a25ad>] ? rcu_read_lock_sched_held+0x61/0x69
[11642.390939] [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[11642.392271] [<ffffffffa0451c32>] ? __clear_extent_bit+0x26e/0x2c0 [btrfs]
[11642.393305] [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
[11642.394239] [<ffffffff811236bc>] do_writepages+0x23/0x2c
[11642.395045] [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61
[11642.395991] [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15
[11642.397144] [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
[11642.398392] [<ffffffffa0452094>] ? clear_extent_bit+0x17/0x19 [btrfs]
[11642.399363] [<ffffffffa0445945>] btrfs_get_blocks_direct+0x12b/0x61c [btrfs]
[11642.400445] [<ffffffff8119f7a1>] ? dio_bio_add_page+0x3d/0x54
[11642.401309] [<ffffffff8119fa93>] ? submit_page_section+0x7b/0x111
[11642.402213] [<ffffffff811a0258>] do_blockdev_direct_IO+0x685/0xc24
[11642.403139] [<ffffffffa044581a>] ? btrfs_page_exists_in_range+0x1a1/0x1a1 [btrfs]
[11642.404360] [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
[11642.406187] [<ffffffff811a0828>] __blockdev_direct_IO+0x31/0x33
[11642.407070] [<ffffffff811a0828>] ? __blockdev_direct_IO+0x31/0x33
[11642.407990] [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
[11642.409192] [<ffffffffa043b4ca>] btrfs_direct_IO+0x1c7/0x27e [btrfs]
[11642.410146] [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
[11642.411291] [<ffffffff81119a2c>] generic_file_read_iter+0x89/0x4e1
[11642.412263] [<ffffffff8108ac05>] ? mark_lock+0x24/0x201
[11642.413057] [<ffffffff8116e1f8>] __vfs_read+0x79/0x9d
[11642.413897] [<ffffffff8116e6f1>] vfs_read+0x8f/0xd2
[11642.414708] [<ffffffff8116ef3d>] SyS_read+0x50/0x7e
[11642.415573] [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
[11642.416572] 1 lock held by fdm-stress/26850:
[11642.417345] #0: (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40
[11642.418703] INFO: task fdm-stress:26851 blocked for more than 120 seconds.
[11642.419698] Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.420612] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.421807] fdm-stress D ffff880196483d28 0 26851 26591 0x00000000
[11642.422878] ffff880196483d28 00ff8801c8f60740 0000000000014ec0 ffff88023ed94ec0
[11642.424149] ffff8801c8f60740 ffff880196484000 0000000000000246 ffff8801c8f60740
[11642.425374] ffff8801bb711840 ffff8801bb711878 ffff880196483d40 ffffffff8147b541
[11642.426591] Call Trace:
[11642.427013] [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.427856] [<ffffffff8147b6d5>] schedule_preempt_disabled+0x18/0x24
[11642.428852] [<ffffffff8147c23a>] mutex_lock_nested+0x1d7/0x3b4
[11642.429743] [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
[11642.430911] [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
[11642.432102] [<ffffffffa044f674>] ? btrfs_wait_ordered_roots+0x57/0x191 [btrfs]
[11642.433259] [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
[11642.434431] [<ffffffffa044f6ea>] btrfs_wait_ordered_roots+0xcd/0x191 [btrfs]
[11642.436079] [<ffffffffa0410cab>] btrfs_sync_fs+0xe0/0x1ad [btrfs]
[11642.437009] [<ffffffff81197900>] ? SyS_tee+0x23c/0x23c
[11642.437860] [<ffffffff81197920>] sync_fs_one_sb+0x20/0x22
[11642.438723] [<ffffffff81171435>] iterate_supers+0x75/0xc2
[11642.439597] [<ffffffff81197d00>] sys_sync+0x52/0x80
[11642.440454] [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
[11642.441533] 3 locks held by fdm-stress/26851:
[11642.442370] #0: (&type->s_umount_key#37){+++++.}, at: [<ffffffff8117141f>] iterate_supers+0x5f/0xc2
[11642.444043] #1: (&fs_info->ordered_operations_mutex){+.+...}, at: [<ffffffffa044f661>] btrfs_wait_ordered_roots+0x44/0x191 [btrfs]
[11642.446010] #2: (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
This happened because under specific timings the path for direct IO reads
can deadlock with concurrent buffered writes. The diagram below shows how
this happens for an example file that has the following layout:
[ extent A ] [ extent B ] [ ....
0K 4K 8K
CPU 1 CPU 2 CPU 3
DIO read against range
[0K, 8K[ starts
btrfs_direct_IO()
--> calls btrfs_get_blocks_direct()
which finds the extent map for the
extent A and leaves the range
[0K, 4K[ locked in the inode's
io tree
buffered write against
range [4K, 8K[ starts
__btrfs_buffered_write()
--> dirties page at 4K
a user space
task calls sync
for e.g or
writepages() is
invoked by mm
writepages()
run_delalloc_range()
cow_file_range()
--> ordered extent X
for the buffered
write is created
and
writeback starts
--> calls btrfs_get_blocks_direct()
again, without submitting first
a bio for reading extent A, and
finds the extent map for extent B
--> calls lock_extent_direct()
--> locks range [4K, 8K[
--> finds ordered extent X
covering range [4K, 8K[
--> unlocks range [4K, 8K[
buffered write against
range [0K, 8K[ starts
__btrfs_buffered_write()
prepare_pages()
--> locks pages with
offsets 0 and 4K
lock_and_cleanup_extent_if_need()
--> blocks attempting to
lock range [0K, 8K[ in
the inode's io tree,
because the range [0, 4K[
is already locked by the
direct IO task at CPU 1
--> calls
btrfs_start_ordered_extent(oe X)
btrfs_start_ordered_extent(oe X)
--> At this point writeback for ordered
extent X has not finished yet
filemap_fdatawrite_range()
btrfs_writepages()
extent_writepages()
extent_write_cache_pages()
--> finds page with offset 0
with the writeback tag
(and not dirty)
--> tries to lock it
--> deadlock, task at CPU 2
has the page locked and
is blocked on the io range
[0, 4K[ that was locked
earlier by this task
So fix this by falling back to a buffered read in the direct IO read path
when an ordered extent for a buffered write is found.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
When using the same file as the source and destination for a dedup
(extent_same ioctl) operation we were allowing it to dedup to a
destination offset beyond the file's size, which doesn't make sense and
it's not allowed for the case where the source and destination files are
not the same file. This made de deduplication operation successful only
when the source range corresponded to a hole, a prealloc extent or an
extent with all bytes having a value of 0x00. This was also leaving a
file hole (between i_size and destination offset) without the
corresponding file extent items, which can be reproduced with the
following steps for example:
$ mkfs.btrfs -f /dev/sdi
$ mount /dev/sdi /mnt/sdi
$ xfs_io -f -c "pwrite -S 0xab 304457 404990" /mnt/sdi/foobar
wrote 404990/404990 bytes at offset 304457
395 KiB, 99 ops; 0.0000 sec (31.150 MiB/sec and 7984.5149 ops/sec)
$ /git/hub/duperemove/btrfs-extent-same 24576 /mnt/sdi/foobar 28672 /mnt/sdi/foobar 929792
Deduping 2 total files
(28672, 24576): /mnt/sdi/foobar
(929792, 24576): /mnt/sdi/foobar
1 files asked to be deduped
i: 0, status: 0, bytes_deduped: 24576
24576 total bytes deduped in this operation
$ umount /mnt/sdi
$ btrfsck /dev/sdi
Checking filesystem on /dev/sdi
UUID: 98c528aa-0833-427d-9403-b98032ffbf9d
checking extents
checking free space cache
checking fs roots
root 5 inode 257 errors 100, file extent discount
Found file extent holes:
start: 712704, len: 217088
found 540673 bytes used err is 1
total csum bytes: 400
total tree bytes: 131072
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 123675
file data blocks allocated: 671744
referenced 671744
btrfs-progs v4.2.3
So fix this by not allowing the destination to go beyond the file's size,
just as we do for the same where the source and destination files are not
the same.
A test for xfstests follows.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
We have two cases where we end up deleting a file at log replay time
when we should not. For this to happen the file must have been renamed
and a directory inode must have been fsynced/logged.
Two examples that exercise these two cases are listed below.
Case 1)
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir -p /mnt/a/b
$ mkdir /mnt/c
$ touch /mnt/a/b/foo
$ sync
$ mv /mnt/a/b/foo /mnt/c/
# Create file bar just to make sure the fsync on directory a/ does
# something and it's not a no-op.
$ touch /mnt/a/bar
$ xfs_io -c "fsync" /mnt/a
< power fail / crash >
The next time the filesystem is mounted, the log replay procedure
deletes file foo.
Case 2)
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/a
$ mkdir /mnt/b
$ mkdir /mnt/c
$ touch /mnt/a/foo
$ ln /mnt/a/foo /mnt/b/foo_link
$ touch /mnt/b/bar
$ sync
$ unlink /mnt/b/foo_link
$ mv /mnt/b/bar /mnt/c/
$ xfs_io -c "fsync" /mnt/a/foo
< power fail / crash >
The next time the filesystem is mounted, the log replay procedure
deletes file bar.
The reason why the files are deleted is because when we log inodes
other then the fsync target inode, we ignore their last_unlink_trans
value and leave the log without enough information to later replay the
rename operations. So we need to look at the last_unlink_trans values
and fallback to a transaction commit if they are greater than the
id of the last committed transaction.
So fix this by looking at the last_unlink_trans values and fallback to
transaction commits when needed. Also, when logging other inodes (for
case 1 we logged descendants of the fsync target inode while for case 2
we logged ascendants) we need to care about concurrent tasks updating
the last_unlink_trans of inodes we are logging (which was already an
existing problem in check_parent_dirs_for_sync()). Since we can not
acquire their inode mutex (vfs' struct inode ->i_mutex), as that causes
deadlocks with other concurrent operations that acquire the i_mutex of
2 inodes (other fsyncs or renames for example), we need to serialize on
the log_mutex of the inode we are logging. A task setting a new value for
an inode's last_unlink_trans must acquire the inode's log_mutex and it
must do this update before doing the actual unlink operation (which is
already the case except when deleting a snapshot). Conversely the task
logging the inode must first log the inode and then check the inode's
last_unlink_trans value while holding its log_mutex, as if its value is
not greater then the id of the last committed transaction it means it
logged a safe state of the inode's items, while if its value is not
smaller then the id of the last committed transaction it means the inode
state it has logged might not be safe (the concurrent task might have
just updated last_unlink_trans but hasn't done yet the unlink operation)
and therefore a transaction commit must be done.
Test cases for xfstests follow in separate patches.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
If we delete a snapshot, fsync its parent directory and crash/power fail
before the next transaction commit, on the next mount when we attempt to
replay the log tree of the root containing the parent directory we will
fail and prevent the filesystem from mounting, which is solvable by wiping
out the log trees with the btrfs-zero-log tool but very inconvenient as
we will lose any data and metadata fsynced before the parent directory
was fsynced.
For example:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ mkdir /mnt/testdir
$ btrfs subvolume snapshot /mnt /mnt/testdir/snap
$ btrfs subvolume delete /mnt/testdir/snap
$ xfs_io -c "fsync" /mnt/testdir
< crash / power failure and reboot >
$ mount /dev/sdc /mnt
mount: mount(2) failed: No such file or directory
And in dmesg/syslog we get the following message and trace:
[192066.361162] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257
[192066.363010] ------------[ cut here ]------------
[192066.365268] WARNING: CPU: 4 PID: 5130 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x17a/0x354 [btrfs]()
[192066.367250] BTRFS: Transaction aborted (error -2)
[192066.368401] Modules linked in: btrfs dm_flakey dm_mod ppdev sha256_generic xor raid6_pq hmac drbg ansi_cprng aesni_intel acpi_cpufreq tpm_tis aes_x86_64 tpm ablk_helper evdev cryptd sg parport_pc i2c_piix4 psmouse lrw parport i2c_core pcspkr gf128mul processor serio_raw glue_helper button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs]
[192066.377154] CPU: 4 PID: 5130 Comm: mount Tainted: G W 4.4.0-rc6-btrfs-next-20+ #1
[192066.378875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[192066.380889] 0000000000000000 ffff880143923670 ffffffff81257570 ffff8801439236b8
[192066.382561] ffff8801439236a8 ffffffff8104ec07 ffffffffa039dc2c 00000000fffffffe
[192066.384191] ffff8801ed31d000 ffff8801b9fc9c88 ffff8801086875e0 ffff880143923710
[192066.385827] Call Trace:
[192066.386373] [<ffffffff81257570>] dump_stack+0x4e/0x79
[192066.387387] [<ffffffff8104ec07>] warn_slowpath_common+0x99/0xb2
[192066.388429] [<ffffffffa039dc2c>] ? __btrfs_unlink_inode+0x17a/0x354 [btrfs]
[192066.389236] [<ffffffff8104ec68>] warn_slowpath_fmt+0x48/0x50
[192066.389884] [<ffffffffa039dc2c>] __btrfs_unlink_inode+0x17a/0x354 [btrfs]
[192066.390621] [<ffffffff81184b55>] ? iput+0xb0/0x266
[192066.391200] [<ffffffffa039ea25>] btrfs_unlink_inode+0x1c/0x3d [btrfs]
[192066.391930] [<ffffffffa03ca623>] check_item_in_log+0x1fe/0x29b [btrfs]
[192066.392715] [<ffffffffa03ca827>] replay_dir_deletes+0x167/0x1cf [btrfs]
[192066.393510] [<ffffffffa03cccc7>] replay_one_buffer+0x417/0x570 [btrfs]
[192066.394241] [<ffffffffa03ca164>] walk_up_log_tree+0x10e/0x1dc [btrfs]
[192066.394958] [<ffffffffa03cac72>] walk_log_tree+0xa5/0x190 [btrfs]
[192066.395628] [<ffffffffa03ce8b8>] btrfs_recover_log_trees+0x239/0x32c [btrfs]
[192066.396790] [<ffffffffa03cc8b0>] ? replay_one_extent+0x50a/0x50a [btrfs]
[192066.397891] [<ffffffffa0394041>] open_ctree+0x1d8b/0x2167 [btrfs]
[192066.398897] [<ffffffffa03706e1>] btrfs_mount+0x5ef/0x729 [btrfs]
[192066.399823] [<ffffffff8108ad98>] ? trace_hardirqs_on+0xd/0xf
[192066.400739] [<ffffffff8108959b>] ? lockdep_init_map+0xb9/0x1b3
[192066.401700] [<ffffffff811714b9>] mount_fs+0x67/0x131
[192066.402482] [<ffffffff81188560>] vfs_kern_mount+0x6c/0xde
[192066.403930] [<ffffffffa03702bd>] btrfs_mount+0x1cb/0x729 [btrfs]
[192066.404831] [<ffffffff8108ad98>] ? trace_hardirqs_on+0xd/0xf
[192066.405726] [<ffffffff8108959b>] ? lockdep_init_map+0xb9/0x1b3
[192066.406621] [<ffffffff811714b9>] mount_fs+0x67/0x131
[192066.407401] [<ffffffff81188560>] vfs_kern_mount+0x6c/0xde
[192066.408247] [<ffffffff8118ae36>] do_mount+0x893/0x9d2
[192066.409047] [<ffffffff8113009b>] ? strndup_user+0x3f/0x8c
[192066.409842] [<ffffffff8118b187>] SyS_mount+0x75/0xa1
[192066.410621] [<ffffffff8147e517>] entry_SYSCALL_64_fastpath+0x12/0x6b
[192066.411572] ---[ end trace 2de42126c1e0a0f0 ]---
[192066.412344] BTRFS: error (device dm-0) in __btrfs_unlink_inode:3986: errno=-2 No such entry
[192066.413748] BTRFS: error (device dm-0) in btrfs_replay_log:2464: errno=-2 No such entry (Failed to recover log tree)
[192066.415458] BTRFS error (device dm-0): cleaner transaction attach returned -30
[192066.444613] BTRFS: open_ctree failed
This happens because when we are replaying the log and processing the
directory entry pointing to the snapshot in the subvolume tree, we treat
its btrfs_dir_item item as having a location with a key type matching
BTRFS_INODE_ITEM_KEY, which is wrong because the type matches
BTRFS_ROOT_ITEM_KEY and therefore must be processed differently, as the
object id refers to a root number and not to an inode in the root
containing the parent directory.
So fix this by triggering a transaction commit if an fsync against the
parent directory is requested after deleting a snapshot. This is the
simplest approach for a rare use case. Some alternative that avoids the
transaction commit would require more code to explicitly delete the
snapshot at log replay time (factoring out common code from ioctl.c:
btrfs_ioctl_snap_destroy()), special care at fsync time to remove the
log tree of the snapshot's root from the log root of the root of tree
roots, amongst other steps.
A test case for xfstests that triggers the issue follows.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
_cleanup_flakey
cd /
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/dmflakey
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_dm_target flakey
_require_metadata_journaling $SCRATCH_DEV
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
_init_flakey
_mount_flakey
# Create a snapshot at the root of our filesystem (mount point path), delete it,
# fsync the mount point path, crash and mount to replay the log. This should
# succeed and after the filesystem is mounted the snapshot should not be visible
# anymore.
_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap1
_run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap1
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT
_flakey_drop_and_remount
[ -e $SCRATCH_MNT/snap1 ] && \
echo "Snapshot snap1 still exists after log replay"
# Similar scenario as above, but this time the snapshot is created inside a
# directory and not directly under the root (mount point path).
mkdir $SCRATCH_MNT/testdir
_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/testdir/snap2
_run_btrfs_util_prog subvolume delete $SCRATCH_MNT/testdir/snap2
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir
_flakey_drop_and_remount
[ -e $SCRATCH_MNT/testdir/snap2 ] && \
echo "Snapshot snap2 still exists after log replay"
_unmount_flakey
echo "Silence is golden"
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.6
Btrfs patchsets for 4.6
|
|
Commit 04b38d601239b4 ("vfs: pull btrfs clone API to vfs layer")
added a duplicated line (in cifsfs.c) which causes a sparse compile
warning.
Signed-off-by: Steve French <steve.french@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
If the block size of a filesystem is not at least PAGE_SIZEd, then
at this point in time DAX cannot be used due to the fact we can't
guarantee extents are page sized or aligned without further work.
Hence disallow setting the DAX flag on an inode if the block size is
too small. Also, be defensive and check the block size when reading
an inode in off disk.
In future, we want to allow DAX to work on any filesystem, so this
is temporary while we sort of the correct conbination of extent size
hints and allocation alignment configurations needed to guarantee
page sized and aligned extent allocation for DAX enabled files.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
When we set or clear the XFS_DIFLAG2_DAX flag, we should also
set/clear the S_DAX flag in the VFS inode. To do this, we need to
ensure that we first flush and remove any cached entries in the
radix tree to ensure the correct data access method is used when we
next try to read or write data. We ahve to be especially careful
here to lock out page faults so they don't race with the flush and
invalidation before we change the access mode.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Only regular files can use DAX for data operations, so we should
restrict setting it on the VFS inode to regular files. Setting it on
metadata inodes may cause the VFS to do the wrong thing for such
inodes, so avoid potential problems by restricting the scope of the
flag to what we know is supported.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Only file data can use DAX, so we should onyl be able to set this
flag on regular files. However, the flag also serves as an "inherit"
flag at file create time when set on directories, so limit the
FS_IOC_FSSETXATTR ioctl to only set this flag on regular files and
directories.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
We need to finish doing the CRC checks before we can allow writes to
happen, and we currently process the inodes in order. This means a call
to jffs2_get_ino_cache() for each possible inode# up to c->highest_ino.
There may be a lot of lookups which fail, if the inode# space is used
sparsely. And the inode# space is *often* used sparsely, if a file
system contains a lot of stuff that was put there in the original
image, followed by lots of creation and deletion of new files.
Instead of processing them numerically with a lookup each time, just
walk the hash buckets instead.
[fix locking typo reported by Dan Carpenter]
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
|
|
Games with ordering and barriers are way too brittle. Just
bump ->d_seq before and after updating ->d_inode and ->d_flags
type bits, so that verifying ->d_seq would guarantee they are
coherent.
Cc: stable@vger.kernel.org # v3.13+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
This issue is caused by commit 02323db17e3a7 ("cifs: fix
cifs_uniqueid_to_ino_t not to ever return 0"), when BITS_PER_LONG
is 64 on s390x, the corresponding cifs_uniqueid_to_ino_t()
function will cast 64-bit fileid to 32-bit by using (ino_t)fileid,
because ino_t (typdefed __kernel_ino_t) is int type.
It's defined in arch/s390/include/uapi/asm/posix_types.h
#ifndef __s390x__
typedef unsigned long __kernel_ino_t;
...
#else /* __s390x__ */
typedef unsigned int __kernel_ino_t;
So the #ifdef condition is wrong for s390x, we can just still use
one cifs_uniqueid_to_ino_t() function with comparing sizeof(ino_t)
and sizeof(u64) to choose the correct execution accordingly.
Signed-off-by: Yadan Fan <ydfan@suse.com>
CC: stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
|
|
For interim responses we only need to parse a header and update
a number credits. Now it is done for all SMB2+ command except
SMB2_READ which is wrong. Fix this by adding such processing.
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Tested-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
|
|
When opening a file, SMB2_open() attempts to parse the lease state from the
SMB2 CREATE Response. However, the parsing code was not careful to ensure
that the create contexts are not empty or invalid, which can lead to out-
of-bounds memory access. This can be seen easily by trying
to read a file from a OSX 10.11 SMB3 server. Here is sample crash output:
BUG: unable to handle kernel paging request at ffff8800a1a77cc6
IP: [<ffffffff8828a734>] SMB2_open+0x804/0x960
PGD 8f77067 PUD 0
Oops: 0000 [#1] SMP
Modules linked in:
CPU: 3 PID: 2876 Comm: cp Not tainted 4.5.0-rc3.x86_64.1+ #14
Hardware name: NETGEAR ReadyNAS 314 /ReadyNAS 314 , BIOS 4.6.5 10/11/2012
task: ffff880073cdc080 ti: ffff88005b31c000 task.ti: ffff88005b31c000
RIP: 0010:[<ffffffff8828a734>] [<ffffffff8828a734>] SMB2_open+0x804/0x960
RSP: 0018:ffff88005b31fa08 EFLAGS: 00010282
RAX: 0000000000000015 RBX: 0000000000000000 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffff88007eb8c8b0
RBP: ffff88005b31fad8 R08: 666666203d206363 R09: 6131613030383866
R10: 3030383866666666 R11: 00000000000002b0 R12: ffff8800660fd800
R13: ffff8800a1a77cc2 R14: 00000000424d53fe R15: ffff88005f5a28c0
FS: 00007f7c8a2897c0(0000) GS:ffff88007eb80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff8800a1a77cc6 CR3: 000000005b281000 CR4: 00000000000006e0
Stack:
ffff88005b31fa70 ffffffff88278789 00000000000001d3 ffff88005f5a2a80
ffffffff00000003 ffff88005d029d00 ffff88006fde05a0 0000000000000000
ffff88005b31fc78 ffff88006fde0780 ffff88005b31fb2f 0000000100000fe0
Call Trace:
[<ffffffff88278789>] ? cifsConvertToUTF16+0x159/0x2d0
[<ffffffff8828cf68>] smb2_open_file+0x98/0x210
[<ffffffff8811e80c>] ? __kmalloc+0x1c/0xe0
[<ffffffff882685f4>] cifs_open+0x2a4/0x720
[<ffffffff88122cef>] do_dentry_open+0x1ff/0x310
[<ffffffff88268350>] ? cifsFileInfo_get+0x30/0x30
[<ffffffff88123d92>] vfs_open+0x52/0x60
[<ffffffff88131dd0>] path_openat+0x170/0xf70
[<ffffffff88097d48>] ? remove_wait_queue+0x48/0x50
[<ffffffff88133a29>] do_filp_open+0x79/0xd0
[<ffffffff8813f2ca>] ? __alloc_fd+0x3a/0x170
[<ffffffff881240c4>] do_sys_open+0x114/0x1e0
[<ffffffff881241a9>] SyS_open+0x19/0x20
[<ffffffff8896e257>] entry_SYSCALL_64_fastpath+0x12/0x6a
Code: 4d 8d 6c 07 04 31 c0 4c 89 ee e8 47 6f e5 ff 31 c9 41 89 ce 44 89 f1 48 c7 c7 28 b1 bd 88 31 c0 49 01 cd 4c 89 ee e8 2b 6f e5 ff <45> 0f b7 75 04 48 c7 c7 31 b1 bd 88 31 c0 4d 01 ee 4c 89 f6 e8
RIP [<ffffffff8828a734>] SMB2_open+0x804/0x960
RSP <ffff88005b31fa08>
CR2: ffff8800a1a77cc6
---[ end trace d9f69ba64feee469 ]---
Signed-off-by: Justin Maggard <jmaggard@netgear.com>
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
|
|
When AIO DIO fails e.g. due to IO error, we must not convert unwritten
extents as that will expose uninitialized data. Handle this case
by clearing unwritten flag from io_end in case of error and thus
preventing extent conversion.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
do_last(): ELOOP failure exit should be done after leaving RCU mode
should_follow_link(): validate ->d_seq after having decided to follow
namei: ->d_inode of a pinned dentry is stable only for positives
do_last(): don't let a bogus return value from ->open() et.al. to confuse us
fs: return -EOPNOTSUPP if clone is not supported
hpfs: don't truncate the file when delete fails
|
|
... or we risk seeing a bogus value of d_is_symlink() there.
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
... otherwise d_is_symlink() above might have nothing to do with
the inode value we've got.
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
both do_last() and walk_component() risk picking a NULL inode out
of dentry about to become positive, *then* checking its flags and
seeing that it's not negative anymore and using (already stale by
then) value they'd fetched earlier. Usually ends up oopsing soon
after that...
Cc: stable@vger.kernel.org # v3.13+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
... into returning a positive to path_openat(), which would interpret that
as "symlink had been encountered" and proceed to corrupt memory, etc.
It can only happen due to a bug in some ->open() instance or in some LSM
hook, etc., so we report any such event *and* make sure it doesn't trick
us into further unpleasantness.
Cc: stable@vger.kernel.org # v3.6+, at least
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
-EBADF is a rather confusing error if an operations is not supported,
and nfsd gets rather upset about it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
The delete opration can allocate additional space on the HPFS filesystem
due to btree split. The HPFS driver checks in advance if there is
available space, so that it won't corrupt the btree if we run out of space
during splitting.
If there is not enough available space, the HPFS driver attempted to
truncate the file, but this results in a deadlock since the commit
7dd29d8d865efdb00c0542a5d2c87af8c52ea6c7 ("HPFS: Introduce a global mutex
and lock it on every callback from VFS").
This patch removes the code that tries to truncate the file and -ENOSPC is
returned instead. If the user hits -ENOSPC on delete, he should try to
delete other files (that are stored in a leaf btree node), so that the
delete operation will make some space for deleting the file stored in
non-leaf btree node.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: stable@vger.kernel.org # 2.6.39+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Merge fixes from Andrew Morton:
"10 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
dax: move writeback calls into the filesystems
dax: give DAX clearing code correct bdev
ext4: online defrag not supported with DAX
ext2, ext4: only set S_DAX for regular inodes
block: disable block device DAX by default
ocfs2: unlock inode if deleting inode from orphan fails
mm: ASLR: use get_random_long()
drivers: char: random: add get_random_long()
mm: numa: quickly fail allocations for NUMA balancing on full nodes
mm: thp: fix SMP race condition between THP page fault and MADV_DONTNEED
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext2/4 DAX fix from Ted Ts'o:
"This fixes a file system corruption bug with DAX"
* tag 'tags/ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext2, ext4: fix issue with missing journal entry in ext4_dax_mkwrite()
|
|
As it is currently written ext4_dax_mkwrite() assumes that the call into
__dax_mkwrite() will not have to do a block allocation so it doesn't create
a journal entry. For a read that creates a zero page to cover a hole
followed by a write that actually allocates storage this is incorrect. The
ext4_dax_mkwrite() -> __dax_mkwrite() -> __dax_fault() path calls
get_blocks() to allocate storage.
Fix this by having the ->page_mkwrite fault handler call ext4_dax_fault()
as this function already has all the logic needed to allocate a journal
entry and call __dax_fault().
Also update the ext2 fault handlers in this same way to remove duplicate
code and keep the logic between ext2 and ext4 the same.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Previously calls to dax_writeback_mapping_range() for all DAX filesystems
(ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
dax_writeback_mapping_range() needs a struct block_device, and it used
to get that from inode->i_sb->s_bdev. This is correct for normal inodes
mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
block devices and for XFS real-time files.
Instead, call dax_writeback_mapping_range() directly from the filesystem
->writepages function so that it can supply us with a valid block
device. This also fixes DAX code to properly flush caches in response
to sync(2).
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
dax_clear_blocks() needs a valid struct block_device and previously it
was using inode->i_sb->s_bdev in all cases. This is correct for normal
inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for
DAX raw block devices and for XFS real-time devices.
Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change
its arguments to take a bdev and a sector instead of an inode and a
block. This better reflects what the function does, and it allows the
filesystem and raw block device code to pass in an appropriate struct
block_device.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Online defrag operations for ext4 are hard coded to use the page cache.
See ext4_ioctl() -> ext4_move_extents() -> move_extent_per_page()
When combined with DAX I/O, which circumvents the page cache, this can
result in data corruption. This was observed with xfstests ext4/307 and
ext4/308.
Fix this by only allowing online defrag for non-DAX files.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When S_DAX is set on an inode we assume that if there are pages attached
to the mapping (mapping->nrpages != 0), those pages are clean zero pages
that were used to service reads from holes. Any dirty data associated
with the inode should be in the form of DAX exceptional entries
(mapping->nrexceptional) that is written back via
dax_writeback_mapping_range().
With the current code, though, this isn't always true. For example,
ext2 and ext4 directory inodes can have S_DAX set, but have their dirty
data stored as dirty page cache entries. For these types of inodes,
having S_DAX set doesn't really make sense since their I/O doesn't
actually happen through the DAX code path.
Instead, only allow S_DAX to be set for regular inodes for ext2 and
ext4. This allows us to have strict DAX vs non-DAX paths in the
writeback code.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The recent *sync enabling discovered that we are inserting into the
block_device pagecache counter to the expectations of the dirty data
tracking for dax mappings. This can lead to data corruption.
We want to support DAX for block devices eventually, but it requires
wider changes to properly manage the pagecache.
dump_stack+0x85/0xc2
dax_writeback_mapping_range+0x60/0xe0
blkdev_writepages+0x3f/0x50
do_writepages+0x21/0x30
__filemap_fdatawrite_range+0xc6/0x100
filemap_write_and_wait+0x4a/0xa0
set_blocksize+0x70/0xd0
sb_set_blocksize+0x1d/0x50
ext4_fill_super+0x75b/0x3360
mount_bdev+0x180/0x1b0
ext4_mount+0x15/0x20
mount_fs+0x38/0x170
Mark the support broken so its disabled by default, but otherwise still
available for testing.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When doing append direct io cleanup, if deleting inode fails, it goes
out without unlocking inode, which will cause the inode deadlock.
This issue was introduced by commit cf1776a9e834 ("ocfs2: fix a tiny
race when truncate dio orohaned entry").
Signed-off-by: Guozhonghua <guozhonghua@h3c.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org> [4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Replace calls to get_random_int() followed by a cast to (unsigned long)
with calls to get_random_long(). Also address shifting bug which, in
case of x86 removed entropy mask for mmap_rnd_bits values > 31 bits.
Signed-off-by: Daniel Cashman <dcashman@android.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: David S. Miller <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nick Kralevich <nnk@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Mark Salyzyn <salyzyn@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When testing with fsstress, kworker and user threads were both blocked:
INFO: task kworker/u16:1:16580 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u16:1 D ffff8803f2595390 0 16580 2 0x00000000
Workqueue: writeback bdi_writeback_workfn (flush-251:0)
ffff8802730e5760 0000000000000046 ffff880274729fc0 0000000000012440
ffff8802730e5fd8 ffff8802730e4010 0000000000012440 0000000000012440
ffff8802730e5fd8 0000000000012440 ffff880274729fc0 ffff88026eb50000
Call Trace:
[<ffffffff816fe9d9>] schedule+0x29/0x70
[<ffffffff816ff895>] rwsem_down_read_failed+0xa5/0xf9
[<ffffffff81378584>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffffa0694feb>] f2fs_write_data_page+0x31b/0x420 [f2fs]
[<ffffffffa0690f1a>] __f2fs_writepage+0x1a/0x50 [f2fs]
[<ffffffffa06922a0>] f2fs_write_data_pages+0xe0/0x290 [f2fs]
[<ffffffff811473b3>] do_writepages+0x23/0x40
[<ffffffff811cc3ee>] __writeback_single_inode+0x4e/0x250
[<ffffffff811cd4f1>] writeback_sb_inodes+0x2c1/0x470
[<ffffffff811cd73e>] __writeback_inodes_wb+0x9e/0xd0
[<ffffffff811cda0b>] wb_writeback+0x1fb/0x2d0
[<ffffffff811cdb7c>] wb_do_writeback+0x9c/0x220
[<ffffffff811ce232>] bdi_writeback_workfn+0x72/0x1c0
[<ffffffff8106b74e>] process_one_work+0x1de/0x5b0
[<ffffffff8106e78f>] worker_thread+0x11f/0x3e0
[<ffffffff810750ce>] kthread+0xde/0xf0
[<ffffffff817093f8>] ret_from_fork+0x58/0x90
fsstress thread stack:
[<ffffffff81139f0e>] sleep_on_page+0xe/0x20
[<ffffffff81139ef7>] __lock_page+0x67/0x70
[<ffffffff8113b100>] find_lock_page+0x50/0x80
[<ffffffff8113b24f>] find_or_create_page+0x3f/0xb0
[<ffffffffa06983a9>] sync_node_pages+0x259/0x810 [f2fs]
[<ffffffffa068d874>] write_checkpoint+0x1a4/0xce0 [f2fs]
[<ffffffffa0686b0c>] f2fs_sync_fs+0x7c/0xd0 [f2fs]
[<ffffffffa067c813>] f2fs_sync_file+0x143/0x5f0 [f2fs]
[<ffffffff811d301b>] vfs_fsync_range+0x2b/0x40
[<ffffffff811d304c>] vfs_fsync+0x1c/0x20
[<ffffffff811d3291>] do_fsync+0x41/0x70
[<ffffffff811d32d3>] SyS_fdatasync+0x13/0x20
[<ffffffff817094a2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
The reason of this issue is:
CPU0: CPU1:
- f2fs_write_data_pages
- f2fs_sync_fs
- write_checkpoint
- block_operations
- f2fs_lock_all
- down_write(sbi->cp_rwsem)
- lock_page(page)
- f2fs_write_data_page
- sync_node_pages
- flush_inline_data
- pagecache_get_page(page, GFP_LOCK)
- f2fs_lock_op
- down_read(sbi->cp_rwsem)
This patch alters to use trylock_page in flush_inline_data to fix this ABBA
deadlock issue.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Add a new helper f2fs_flush_merged_bios to clean up redundant codes.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Add a new help f2fs_update_data_blkaddr to clean up redundant codes.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
For now, flow of GCing an encrypted data page:
1) try to grab meta page in meta inode's mapping with index of old block
address of that data page
2) load data of ciphertext into meta page
3) allocate new block address
4) write the meta page into new block address
5) update block address pointer in direct node page.
Other reader/writer will use f2fs_wait_on_encrypted_page_writeback to
check and wait on GCed encrypted data cached in meta page writebacked
in order to avoid inconsistence among data page cache, meta page cache
and data on-disk when updating.
However, we will use new block address updated in step 5) as an index to
lookup meta page in inner bio buffer. That would be wrong, and we will
never find the GCing meta page, since we use the old block address as
index of that page in step 1).
This patch fixes the issue by adjust the order of step 1) and step 3),
and in step 1) grab page with index generated in step 3).
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Al Viro has cleaned up the way ops are processed and waited for,
now orangefs.txt has an overview of how it works. Several recent
related commits have added to the comments in the code as well.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
orangefs contains a helper function to calculate the difference
between two timeval structures. We are trying to remove all
instances of timespec from the kernel, and this one is not
used at all, so let's remove it now.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
The new orangefs code uses a helper function to read a time field to
its private structures from struct iattr. This will conflict with the
move to 64-bit timestamps in the kernel and is generally not necessary.
This replaces the conversion with a simple cast to time64_t that shows
what is going on. As the orangefs-internal representation already uses
64-bit timestamps, there should be no ambiguity to negative values,
and the cast ensures that we treat them as times before 1970 on both
32-bit and 64-bit architectures, rather than times after 2038. This
patch keeps that behavior.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
|
|
# Conflicts:
# fs/btrfs/file.c
|
|
|
|
|
|
|
|
|
|
|