Age | Commit message (Collapse) | Author |
|
Commit 88740da18[1] introduced a function to compute the maximum
height of the inode btree back in 1994. Back then, apparently, the
freespace and inode btrees shared the same geometry; however, it has
long since been the case that the inode and freespace btrees have
different record and key sizes. Therefore, we must use m_inobt_mnr if
we want a correct calculation/log reservation/etc.
(Yes, this bug has been around for 21 years and ten months.)
(Yes, I was in middle school when this bug was committed.)
[1] http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=88740da18ddd9d7ba3ebaa9502fefc6ef2fd19cd
Historical-research-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
If a crash occurs immediately after a filesystem grow operation, the
updated superblock geometry is found only in the log. After we
recover the log, the superblock is reread and re-initialised and so
has the new geometry in memory. If the new geometry has more AGs
than prior to the grow operation, then the new AGs will not have
in-memory xfs_perag structurea associated with them.
This will result in an oops when the first metadata buffer from a
new AG is looked up in the buffer cache, as the block lies within
the new geometry but then fails to find a perag structure on lookup.
This is easily fixed by simply re-initialising the perag structure
after re-reading the superblock at the conclusion of the first pahse
of log recovery.
This, however, does not fix the case of log recovery requiring
access to metadata in the newly grown space. Fortunately for us,
because the in-core superblock has not been updated, this will
result in detection of access beyond the end of the filesystem
and so recovery will fail at that point. If this proves to be
a problem, then we can address it separately to the current
reported issue.
Reported-by: Alex Lyakas <alex@zadarastorage.com>
Tested-by: Alex Lyakas <alex@zadarastorage.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
|
|
XFS uses CRC verification over a sub-range of the head of the log to
detect and handle torn writes. This torn log write detection currently
runs unconditionally at mount time, regardless of whether the log is
dirty or clean. This is problematic in cases where a filesystem might
end up being moved across different, incompatible (i.e., opposite
byte-endianness) architectures.
The problem lies in the fact that log data is not necessarily written in
an architecture independent format. For example, certain bits of data
are written in native endian format. Further, the size of certain log
data structures differs (i.e., struct xlog_rec_header) depending on the
word size of the cpu. This leads to false positive crc verification
errors and ultimately failed mounts when a cleanly unmounted filesystem
is mounted on a system with an incompatible architecture from data that
was written near the head of the log.
Update the log head/tail discovery code to run torn write detection only
when the log is not clean. This means something other than an unmount
record resides at the head of the log and log recovery is imminent. It
is a requirement to run log recovery on the same type of host that had
written the content of the dirty log and therefore CRC failures are
legitimate corruptions in that scenario.
Reported-by: Jan Beulich <JBeulich@suse.com>
Tested-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Once the record at the head of the log is identified and verified, the
in-core log state is updated based on the record. This includes
information such as the current head block and cycle, the start block of
the last record written to the log, the tail lsn, etc.
Once torn write detection is conditional, this logic will need to be
reused. Factor the code to update the in-core log data structures into a
new helper function. This patch does not change behavior.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Once the mount sequence has identified the head and tail blocks of the
physical log, the record at the head of the log is located and examined
for an unmount record to determine if the log is clean. This currently
occurs after torn write verification of the head region of the log.
This must ultimately be separated from torn write verification and may
need to be called again if the log head is walked back due to a torn
write (to determine whether the new head record is an unmount record).
Separate this logic into a new helper function. This patch does not
change behavior.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
The code that locates the log record at the head of the log is buried in
the log head verification function. This is fine when torn write
verification occurs unconditionally, but this behavior is problematic
for filesystems that might be moved across systems with different
architectures.
In preparation for separating examination of the log head for unmount
records from torn write detection, lift the record location logic out of
the log verification function and into the caller. This patch does not
change behavior.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph fix from Sage Weil:
"This is a final commit we missed to align the protocol compatibility
with the feature bits.
It decodes a few extra fields in two different messages and reports
EIO when they are used (not yet supported)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: initial CEPH_FEATURE_FS_FILE_LAYOUT_V2 support
|
|
Replace the current NULL-terminated array of default groups with a linked
list. This gets rid of lots of nasty code to size and/or dynamically
allocate the array.
While we're at it also provide a conveniant helper to remove the default
groups.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Felipe Balbi <balbi@kernel.org> [drivers/usb/gadget]
Acked-by: Joel Becker <jlbec@evilplan.org>
Acked-by: Nicholas Bellinger <nab@linux-iscsi.org>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
|
|
... and kill need_lookup thing
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
bugger off on negatives a bit earlier, simplify the tests
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
for the sake of namei.c fixes
|
|
Pull block fixes from Jens Axboe:
"Round 2 of this. I cut back to the bare necessities, the patch is
still larger than it usually would be at this time, due to the number
of NVMe fixes in there. This pull request contains:
- The 4 core fixes from Ming, that fix both problems with exceeding
the virtual boundary limit in case of merging, and the gap checking
for cloned bio's.
- NVMe fixes from Keith and Christoph:
- Regression on larger user commands, causing problems with
reading log pages (for instance). This touches both NVMe,
and the block core since that is now generally utilized also
for these types of commands.
- Hot removal fixes.
- User exploitable issue with passthrough IO commands, if !length
is given, causing us to fault on writing to the zero
page.
- Fix for a hang under error conditions
- And finally, the current series regression for umount with cgroup
writeback, where the final flush would happen async and hence open
up window after umount where the device wasn't consistent. fsck
right after umount would show this. From Tejun"
* 'for-linus2' of git://git.kernel.dk/linux-block:
block: support large requests in blk_rq_map_user_iov
block: fix blk_rq_get_max_sectors for driver private requests
nvme: fix max_segments integer truncation
nvme: set queue limits for the admin queue
writeback: flush inode cgroup wb switches instead of pinning super_block
NVMe: Fix 0-length integrity payload
NVMe: Don't allow unsupported flags
NVMe: Move error handling to failed reset handler
NVMe: Simplify device reset failure
NVMe: Fix namespace removal deadlock
NVMe: Use IDA for namespace disk naming
NVMe: Don't unmap controller registers on reset
block: merge: get the 1st and last bvec via helpers
block: get the 1st and last bvec via helpers
block: check virt boundary in bio_will_gap()
block: bio: introduce helpers to get the 1st and last bvec
|
|
Pull jffs2 fixes from David Woodhouse:
"This contains two important JFFS2 fixes marked for stable:
- a lock ordering problem between the page lock and the internal
f->sem mutex, which was causing occasional deadlocks in garbage
collection
- a scan failure causing moved directories to sometimes end up
appearing to have hard links.
There are also a couple of trivial MAINTAINERS file updates"
* tag 'for-linus-20160304' of git://git.infradead.org/linux-mtd:
MAINTAINERS: add maintainer entry for FREESCALE GPMI NAND driver
Fix directory hardlinks from deleted directories
jffs2: Fix page lock / f->sem deadlock
Revert "jffs2: Fix lock acquisition order bug in jffs2_write_begin"
MAINTAINERS: update Han's email
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
"Filipe nailed down a problem where tree log replay would do some work
that orphan code wasn't expecting to be done yet, leading to BUG_ON"
* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix loading of orphan roots leading to BUG_ON
|
|
Add support for the format change of MClientReply/MclientCaps.
Also add code that denies access to inodes with pool_ns layouts.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stephen Bates <stephen.bates@pmcs.com>
Tested-by: Stephen Bates <stephen.bates@pmcs.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
This adds a flag that tells the file system that this is a high priority
request for which it's worth to poll the hardware. The flag is purely
advisory and can be ignored if not supported.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stephen Bates <stephen.bates@pmcs.com>
Tested-by: Stephen Bates <stephen.bates@pmcs.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
New syscalls that take an flag argument. No flags are added yet in this
patch.
Signed-off-by: Milosz Tanski <milosz@adfin.com>
[hch: rebased on top of my kiocb changes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stephen Bates <stephen.bates@pmcs.com>
Tested-by: Stephen Bates <stephen.bates@pmcs.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
This way we can set kiocb flags also from the sync read/write path for
the read_iter/write_iter operations. For now there is no way to pass
flags to plain read/write operations as there is no real need for that,
and all flags passed are explicitly rejected for these files.
Signed-off-by: Milosz Tanski <milosz@adfin.com>
[hch: rebased on top of my kiocb changes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stephen Bates <stephen.bates@pmcs.com>
Tested-by: Stephen Bates <stephen.bates@pmcs.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
When looking for orphan roots during mount we can end up hitting a
BUG_ON() (at root-item.c:btrfs_find_orphan_roots()) if a log tree is
replayed and qgroups are enabled. This is because after a log tree is
replayed, a transaction commit is made, which triggers qgroup extent
accounting which in turn does backref walking which ends up reading and
inserting all roots in the radix tree fs_info->fs_root_radix, including
orphan roots (deleted snapshots). So after the log tree is replayed, when
finding orphan roots we hit the BUG_ON with the following trace:
[118209.182438] ------------[ cut here ]------------
[118209.183279] kernel BUG at fs/btrfs/root-tree.c:314!
[118209.184074] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[118209.185123] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic ppdev xor raid6_pq evdev sg parport_pc parport acpi_cpufreq tpm_tis tpm psmouse
processor i2c_piix4 serio_raw pcspkr i2c_core button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata
virtio_pci virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs]
[118209.186318] CPU: 14 PID: 28428 Comm: mount Tainted: G W 4.5.0-rc5-btrfs-next-24+ #1
[118209.186318] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[118209.186318] task: ffff8801ec131040 ti: ffff8800af34c000 task.ti: ffff8800af34c000
[118209.186318] RIP: 0010:[<ffffffffa04237d7>] [<ffffffffa04237d7>] btrfs_find_orphan_roots+0x1fc/0x244 [btrfs]
[118209.186318] RSP: 0018:ffff8800af34faa8 EFLAGS: 00010246
[118209.186318] RAX: 00000000ffffffef RBX: 00000000ffffffef RCX: 0000000000000001
[118209.186318] RDX: 0000000080000000 RSI: 0000000000000001 RDI: 00000000ffffffff
[118209.186318] RBP: ffff8800af34fb08 R08: 0000000000000001 R09: 0000000000000000
[118209.186318] R10: ffff8800af34f9f0 R11: 6db6db6db6db6db7 R12: ffff880171b97000
[118209.186318] R13: ffff8801ca9d65e0 R14: ffff8800afa2e000 R15: 0000160000000000
[118209.186318] FS: 00007f5bcb914840(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
[118209.186318] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[118209.186318] CR2: 00007f5bcaceb5d9 CR3: 00000000b49b5000 CR4: 00000000000006e0
[118209.186318] Stack:
[118209.186318] fffffbffffffffff 010230ffffffffff 0101000000000000 ff84000000000000
[118209.186318] fbffffffffffffff 30ffffffffffffff 0000000000000101 ffff880082348000
[118209.186318] 0000000000000000 ffff8800afa2e000 ffff8800afa2e000 0000000000000000
[118209.186318] Call Trace:
[118209.186318] [<ffffffffa042e2db>] open_ctree+0x1e37/0x21b9 [btrfs]
[118209.186318] [<ffffffffa040a753>] btrfs_mount+0x97e/0xaed [btrfs]
[118209.186318] [<ffffffff8108e1c0>] ? trace_hardirqs_on+0xd/0xf
[118209.186318] [<ffffffff8117b87e>] mount_fs+0x67/0x131
[118209.186318] [<ffffffff81192d2b>] vfs_kern_mount+0x6c/0xde
[118209.186318] [<ffffffffa0409f81>] btrfs_mount+0x1ac/0xaed [btrfs]
[118209.186318] [<ffffffff8108e1c0>] ? trace_hardirqs_on+0xd/0xf
[118209.186318] [<ffffffff8108c26b>] ? lockdep_init_map+0xb9/0x1b3
[118209.186318] [<ffffffff8117b87e>] mount_fs+0x67/0x131
[118209.186318] [<ffffffff81192d2b>] vfs_kern_mount+0x6c/0xde
[118209.186318] [<ffffffff81195637>] do_mount+0x8a6/0x9e8
[118209.186318] [<ffffffff8119598d>] SyS_mount+0x77/0x9f
[118209.186318] [<ffffffff81493017>] entry_SYSCALL_64_fastpath+0x12/0x6b
[118209.186318] Code: 64 00 00 85 c0 89 c3 75 24 f0 41 80 4c 24 20 20 49 8b bc 24 f0 01 00 00 4c 89 e6 e8 e8 65 00 00 85 c0 89 c3 74 11 83 f8 ef 75 02 <0f> 0b
4c 89 e7 e8 da 72 00 00 eb 1c 41 83 bc 24 00 01 00 00 00
[118209.186318] RIP [<ffffffffa04237d7>] btrfs_find_orphan_roots+0x1fc/0x244 [btrfs]
[118209.186318] RSP <ffff8800af34faa8>
[118209.230735] ---[ end trace 83938f987d85d477 ]---
So fix this by not treating the error -EEXIST, returned when attempting
to insert a root already inserted by the backref walking code, as an error.
The following test case for xfstests reproduces the bug:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
_cleanup_flakey
cd /
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/dmflakey
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_dm_target flakey
_require_metadata_journaling $SCRATCH_DEV
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
_init_flakey
_mount_flakey
_run_btrfs_util_prog quota enable $SCRATCH_MNT
# Create 2 directories with one file in one of them.
# We use these just to trigger a transaction commit later, moving the file from
# directory a to directory b and doing an fsync against directory a.
mkdir $SCRATCH_MNT/a
mkdir $SCRATCH_MNT/b
touch $SCRATCH_MNT/a/f
sync
# Create our test file with 2 4K extents.
$XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foobar | _filter_xfs_io
# Create a snapshot and delete it. This doesn't really delete the snapshot
# immediately, just makes it inaccessible and invisible to user space, the
# snapshot is deleted later by a dedicated kernel thread (cleaner kthread)
# which is woke up at the next transaction commit.
# A root orphan item is inserted into the tree of tree roots, so that if a
# power failure happens before the dedicated kernel thread does the snapshot
# deletion, the next time the filesystem is mounted it resumes the snapshot
# deletion.
_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap
_run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap
# Now overwrite half of the extents we wrote before. Because we made a snapshpot
# before, which isn't really deleted yet (since no transaction commit happened
# after we did the snapshot delete request), the non overwritten extents get
# referenced twice, once by the default subvolume and once by the snapshot.
$XFS_IO_PROG -c "pwrite -S 0xbb 4K 8K" $SCRATCH_MNT/foobar | _filter_xfs_io
# Now move file f from directory a to directory b and fsync directory a.
# The fsync on the directory a triggers a transaction commit (because a file
# was moved from it to another directory) and the file fsync leaves a log tree
# with file extent items to replay.
mv $SCRATCH_MNT/a/f $SCRATCH_MNT/a/b
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
echo "File digest before power failure:"
md5sum $SCRATCH_MNT/foobar | _filter_scratch
# Now simulate a power failure and mount the filesystem to replay the log tree.
# After the log tree was replayed, we used to hit a BUG_ON() when processing
# the root orphan item for the deleted snapshot. This is because when processing
# an orphan root the code expected to be the first code inserting the root into
# the fs_info->fs_root_radix radix tree, while in reallity it was the second
# caller attempting to do it - the first caller was the transaction commit that
# took place after replaying the log tree, when updating the qgroup counters.
_flakey_drop_and_remount
echo "File digest before after failure:"
# Must match what he got before the power failure.
md5sum $SCRATCH_MNT/foobar | _filter_scratch
_unmount_flakey
status=0
exit
Fixes: 2d9e97761087 ("Btrfs: use btrfs_get_fs_root in resolve_indirect_ref")
Cc: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
block_dev's .writepages/.writepage already handles
wbc_init_bio/wbc_account_io. We only set the SB_I_CGROUPWB bit to
suppport writeback cgroup support.
Signed-off-by: Shaohua Li <shli@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
If cgroup writeback is in use, inodes can be scheduled for
asynchronous wb switching. Before 5ff8eaac1636 ("writeback: keep
superblock pinned during cgroup writeback association switches"), this
could race with umount leading to super_block being destroyed while
inodes are pinned for wb switching. 5ff8eaac1636 fixed it by bumping
s_active while wb switches are in flight; however, this allowed
in-flight wb switches to make umounts asynchronous when the userland
expected synchronosity - e.g. fsck immediately following umount may
fail because the device is still busy.
This patch removes the problematic super_block pinning and instead
makes generic_shutdown_super() flush in-flight wb switches. wb
switches are now executed on a dedicated isw_wq so that they can be
flushed and isw_nr_in_flight keeps track of the number of in-flight wb
switches so that flushing can be avoided in most cases.
v2: Move cgroup_writeback_umount() further below and add MS_ACTIVE
check in inode_switch_wbs() as Jan an Al suggested.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Tahsin Erdogan <tahsin@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
Fixes: 5ff8eaac1636 ("writeback: keep superblock pinned during cgroup writeback association switches")
Cc: stable@vger.kernel.org #v4.5
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
Overlayfs must update uid/gid after chown, otherwise functions
like inode_owner_or_capable() will check user against stale uid.
Catched by xfstests generic/087, it chowns file and calls utimes.
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org>
|
|
After rename file dentry still holds reference to lower dentry from
previous location. This doesn't matter for data access because data comes
from upper dentry. But this stale lower dentry taints dentry at new
location and turns it into non-pure upper. Such file leaves visible
whiteout entry after remove in directory which shouldn't have whiteouts at
all.
Overlayfs already tracks pureness of file location in oe->opaque. This
patch just uses that for detecting actual path type.
Comment from Vivek Goyal's patch:
Here are the details of the problem. Do following.
$ mkdir upper lower work merged upper/dir/
$ touch lower/test
$ sudo mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=
work merged
$ mv merged/test merged/dir/
$ rm merged/dir/test
$ ls -l merged/dir/
/usr/bin/ls: cannot access merged/dir/test: No such file or directory
total 0
c????????? ? ? ? ? ? test
Basic problem seems to be that once a file has been unlinked, a whiteout
has been left behind which was not needed and hence it becomes visible.
Whiteout is visible because parent dir is of not type MERGE, hence
od->is_real is set during ovl_dir_open(). And that means ovl_iterate()
passes on iterate handling directly to underlying fs. Underlying fs does
not know/filter whiteouts so it becomes visible to user.
Why did we leave a whiteout to begin with when we should not have.
ovl_do_remove() checks for OVL_TYPE_PURE_UPPER() and does not leave
whiteout if file is pure upper. In this case file is not found to be pure
upper hence whiteout is left.
So why file was not PURE_UPPER in this case? I think because dentry is
still carrying some leftover state which was valid before rename. For
example, od->numlower was set to 1 as it was a lower file. After rename,
this state is not valid anymore as there is no such file in lower.
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Reported-by: Viktor Stanchev <me@viktorstanchev.com>
Suggested-by: Vivek Goyal <vgoyal@redhat.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=109611
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org>
|
|
ovl_remove_upper() should do d_drop() only after it successfully
removes the dir, otherwise a subsequent getcwd() system call will
fail, breaking userspace programs.
This is to fix: https://bugzilla.kernel.org/show_bug.cgi?id=110491
Signed-off-by: Rui Wang <rui.y.wang@intel.com>
Reviewed-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org>
|
|
This adds missing .d_select_inode into alternative dentry_operations.
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Fixes: 7c03b5d45b8e ("ovl: allow distributed fs as lower layer")
Fixes: 4bacc9c9234c ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
Reviewed-by: Nikolay Borisov <kernel@kyup.com>
Tested-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org> # 4.2+
|
|
When dqget() in __dquot_initialize() fails e.g. due to IO error,
__dquot_initialize() will pass an array of uninitialized pointers to
dqput_all() and thus can lead to deference of random data. Fix the
problem by properly initializing the array.
CC: stable@vger.kernel.org
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
f2fs_lock_all() calls down_write_nest_lock() to acquire a rw_sem and check
a mutex, but down_write_nest_lock() is designed for two rw_sem accoring to the
comment in include/linux/rwsem.h. And, other than f2fs, it is just called in
mm/mmap.c with two rwsem.
So, it looks it is used wrongly by f2fs. And, it causes the below compile
warning on -rt kernel too.
In file included from fs/f2fs/xattr.c:25:0:
fs/f2fs/f2fs.h: In function 'f2fs_lock_all':
fs/f2fs/f2fs.h:962:34: warning: passing argument 2 of 'down_write_nest_lock' from incompatible pointer type [-Wincompatible-pointer-types]
f2fs_down_write(&sbi->cp_rwsem, &sbi->cp_mutex);
^
fs/f2fs/f2fs.h:27:55: note: in definition of macro 'f2fs_down_write'
#define f2fs_down_write(x, y) down_write_nest_lock(x, y)
^
In file included from include/linux/rwsem.h:22:0,
from fs/f2fs/xattr.c:21:
include/linux/rwsem_rt.h:138:20: note: expected 'struct rw_semaphore *' but argument is of type 'struct mutex *'
static inline void down_write_nest_lock(struct rw_semaphore *sem,
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
If f2fs was corrupted with missing dot dentries in root dirctory,
it needs to recover them after fsck.f2fs set F2FS_INLINE_DOTS flag
in directory inode when fsck.f2fs detects missing dot dentries.
Signed-off-by: Xue Liu <liuxueliu.liu@huawei.com>
Signed-off-by: Yong Sheng <shengyong1@huawei.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Pull cifs fixes from Steve French:
"Various small CIFS/SMB3 fixes for stable:
Fixes address oops that can occur when accessing Macs with SMB3, and
another problem found to Samba when read responses queued (e.g. with
gluster under Samba)"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: Fix duplicate line introduced by clone_file_range patch
Fix cifs_uniqueid_to_ino_t() function for s390x
CIFS: Fix SMB2+ interim response processing for read requests
cifs: fix out-of-bounds access in lease parsing
|
|
The exit path will do some final updates to the VM of an exiting process
to inform others of the fact that the process is going away.
That happens, for example, for robust futex state cleanup, but also if
the parent has asked for a TID update when the process exits (we clear
the child tid field in user space).
However, at the time we do those final VM accesses, we've already
stopped accepting signals, so the usual "stop waiting for userfaults on
signal" code in fs/userfaultfd.c no longer works, and the process can
become an unkillable zombie waiting for something that will never
happen.
To solve this, just make handle_userfault() abort any user fault
handling if we're already in the exit path past the signal handling
state being dead (marked by PF_EXITING).
This VM special case is pretty ugly, and it is possible that we should
look at finalizing signals later (or move the VM final accesses
earlier). But in the meantime this is a fairly minimally intrusive fix.
Reported-and-tested-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We want the fixes in here, and others are sending us pull requests based
on this kernel tree.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull d_inode/d_flags race fix from Al Viro.
I love this fix. Not only does it fix the race in the dentry type
handling, it entirely gets rid of the nasty and subtle memory ordering
rules for d_type and d_inode, and replaces them with the basic dentry
locking rules (sequence numbers under RCU, d_lock elsewhere).
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
use ->d_seq to get coherency between ->d_inode and ->d_flags
|
|
Just use the t_blk_res field directly instead of obsfucating the reference
by a macro.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
inode32/inode64 allocator behavior with respect to mount, remount
and growfs is a little tricky.
The inode32 mount option should only enable the inode32 allocator
heuristics if the filesystem is large enough for 64-bit inodes to
exist. Today, it has this behavior on the initial mount, but a
remount with inode32 unconditionally changes the allocation
heuristics, even for a small fs.
Also, an inode32 mounted small filesystem should transition to the
inode32 allocator if the filesystem is subsequently grown to a
sufficient size. Today that does not happen.
This patch consolidates xfs_set_inode32 and xfs_set_inode64 into a
single new function, and moves the "is the maximum inode number big
enough to matter" test into that function, so it doesn't rely on the
caller to get it right - which remount did not do, previously.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
busyp->bno is printed with a %llu format specifier when the
intention is to print a hexadecimal value. Trivial fix to
use %llx instead. Found with smatch static analysis:
fs/xfs/xfs_discard.c:229 xfs_discard_extents() warn: '0x'
prefix is confusing together with '%llu' specifier
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Perform basic sanitization of remount options by
passing the option string and a dummy mount structure
through xfs_parseargs and returning the result.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
This should be a no-op change, just switch to token parsing
like every other respectable filesystem does.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
This plugs 2 trivial leaks in xfs_attr_shortform_list and
xfs_attr3_leaf_list_int.
Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
The maximum size of a backchannel message on RPC-over-RDMA depends
on the connection's inline threshold. Today that threshold is
typically 1024 bytes, making the maximum message size 996 bytes.
The Linux server's CREATE_SESSION operation checks that the size
of callback Calls can be as large as 1044 bytes, to accommodate
RPCSEC_GSS. Thus CREATE_SESSION fails if a client advertises the
true message size maximum of 996 bytes.
But the server's backchannel currently does not support RPCSEC_GSS.
The actual maximum size it needs is much smaller. It is safe to
reduce the limit to enable NFSv4.1 on RDMA backchannel operation.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
The server does indeed now support NFSv4.1 on RDMA transports. It
does not support shifting an RDMA-capable TCP transport (such as
iWARP) to RDMA mode.
Reported-by: Shirley Ma <shirley.ma@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Remember free allocated client when meeting unsupported state protect how.
Fixes: 50c7b948adbd ("nfsd: minor consolidation of mach_cred handling code")
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
A number of spots in the xdr decoding follow a pattern like
n = be32_to_cpup(p++);
READ_BUF(n + 4);
where n is a u32. The only bounds checking is done in READ_BUF itself,
but since it's checking (n + 4), it won't catch cases where n is very
large, (u32)(-4) or higher. I'm not sure exactly what the consequences
are, but we've seen crashes soon after.
Instead, just break these up into two READ_BUF()s.
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
When logging that an inode exists, for example as part of a directory
fsync operation, we were collecting any ordered extents for the inode but
we ended up doing nothing with them except tagging them as processed, by
setting the flag BTRFS_ORDERED_LOGGED on them, which prevented a
subsequent fsync of that inode (using the LOG_INODE_ALL mode) from
collecting and processing them. This created a time window where a second
fsync against the inode, using the fast path, ended up not logging the
checksums for the new extents but it logged the extents since they were
part of the list of modified extents. This happened because the ordered
extents were not collected and checksums were not yet added to the csum
tree - the ordered extents have not gone through btrfs_finish_ordered_io()
yet (which is where we add them to the csum tree by calling
inode.c:add_pending_csums()).
So fix this by not collecting an inode's ordered extents if we are logging
it with the LOG_INODE_EXISTS mode.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
If we're about to do a fast fsync for an inode and btrfs_inode_in_log()
returns false, it's possible that we had an ordered extent in progress
(btrfs_finish_ordered_io() not run yet) when we noticed that the inode's
last_trans field was not greater than the id of the last committed
transaction, but shortly after, before we checked if there were any
ongoing ordered extents, the ordered extent had just completed and
removed itself from the inode's ordered tree, in which case we end up not
logging the inode, losing some data if a power failure or crash happens
after the fsync handler returns and before the transaction is committed.
Fix this by checking first if there are any ongoing ordered extents
before comparing the inode's last_trans with the id of the last committed
transaction - when it completes, an ordered extent always updates the
inode's last_trans before it removes itself from the inode's ordered
tree (at btrfs_finish_ordered_io()).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
In the listxattrs handler, we were not listing all the xattrs that are
packed in the same btree item, which happens when multiple xattrs have
a name that when crc32c hashed produce the same checksum value.
Fix this by processing them all.
The following test case for xfstests reproduces the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
cd /
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/attr
# real QA test starts here
_supported_fs generic
_supported_os Linux
_require_scratch
_require_attrs
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test file with a few xattrs. The first 3 xattrs have a name
# that when given as input to a crc32c function result in the same checksum.
# This made btrfs list only one of the xattrs through listxattrs system call
# (because it packs xattrs with the same name checksum into the same btree
# item).
touch $SCRATCH_MNT/testfile
$SETFATTR_PROG -n user.foobar -v 123 $SCRATCH_MNT/testfile
$SETFATTR_PROG -n user.WvG1c1Td -v qwerty $SCRATCH_MNT/testfile
$SETFATTR_PROG -n user.J3__T_Km3dVsW_ -v hello $SCRATCH_MNT/testfile
$SETFATTR_PROG -n user.something -v pizza $SCRATCH_MNT/testfile
$SETFATTR_PROG -n user.ping -v pong $SCRATCH_MNT/testfile
# Now call getfattr with --dump, which calls the listxattrs system call.
# It should list all the xattrs we have set before.
$GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/testfile | _filter_scratch
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
While running a test with a mix of buffered IO and direct IO against
the same files I hit a deadlock reported by the following trace:
[11642.140352] INFO: task kworker/u32:3:15282 blocked for more than 120 seconds.
[11642.142452] Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.143982] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.146332] kworker/u32:3 D ffff880230ef7988 [11642.147737] systemd-journald[571]: Sent WATCHDOG=1 notification.
[11642.149771] 0 15282 2 0x00000000
[11642.151205] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs]
[11642.154074] ffff880230ef7988 0000000000000246 0000000000014ec0 ffff88023ec94ec0
[11642.156722] ffff880233fe8f80 ffff880230ef8000 ffff88023ec94ec0 7fffffffffffffff
[11642.159205] 0000000000000002 ffffffff8147b7f9 ffff880230ef79a0 ffffffff8147b541
[11642.161403] Call Trace:
[11642.162129] [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.163396] [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.164871] [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
[11642.167020] [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.167931] [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
[11642.182320] [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
[11642.183762] [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
[11642.185308] [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
[11642.186782] [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
[11642.188217] [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
[11642.189626] [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
[11642.190803] [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
[11642.192158] [<ffffffff8111829f>] __lock_page+0x66/0x68
[11642.193379] [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
[11642.194831] [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
[11642.197068] [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
[11642.199188] [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
[11642.200723] [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[11642.202465] [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
[11642.203836] [<ffffffff811236bc>] do_writepages+0x23/0x2c
[11642.205624] [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61
[11642.207057] [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15
[11642.208529] [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
[11642.210375] [<ffffffffa0462613>] ? btrfs_scrubparity_helper+0x140/0x33a [btrfs]
[11642.212132] [<ffffffffa044f974>] btrfs_run_ordered_extent_work+0x25/0x34 [btrfs]
[11642.213837] [<ffffffffa046262f>] btrfs_scrubparity_helper+0x15c/0x33a [btrfs]
[11642.215457] [<ffffffffa046293b>] btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
[11642.217095] [<ffffffff8106483e>] process_one_work+0x256/0x48b
[11642.218324] [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7
[11642.219466] [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
[11642.220801] [<ffffffff8106a500>] kthread+0xd4/0xdc
[11642.222032] [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
[11642.223190] [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70
[11642.224394] [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
[11642.226295] 2 locks held by kworker/u32:3/15282:
[11642.227273] #0: ("%s-%s""btrfs", name){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
[11642.229412] #1: ((&work->normal_work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
[11642.231414] INFO: task kworker/u32:8:15289 blocked for more than 120 seconds.
[11642.232872] Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.234109] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.235776] kworker/u32:8 D ffff88020de5f848 0 15289 2 0x00000000
[11642.237412] Workqueue: writeback wb_workfn (flush-btrfs-481)
[11642.238670] ffff88020de5f848 0000000000000246 0000000000014ec0 ffff88023ed54ec0
[11642.240475] ffff88021b1ece40 ffff88020de60000 ffff88023ed54ec0 7fffffffffffffff
[11642.242154] 0000000000000002 ffffffff8147b7f9 ffff88020de5f860 ffffffff8147b541
[11642.243715] Call Trace:
[11642.244390] [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.245432] [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.246392] [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
[11642.247479] [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.248551] [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
[11642.249968] [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
[11642.251043] [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
[11642.252202] [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
[11642.253210] [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
[11642.254307] [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
[11642.256118] [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
[11642.257131] [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
[11642.258200] [<ffffffff8111829f>] __lock_page+0x66/0x68
[11642.259168] [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
[11642.260516] [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
[11642.261841] [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
[11642.263531] [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
[11642.264747] [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[11642.266148] [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
[11642.267264] [<ffffffff811236bc>] do_writepages+0x23/0x2c
[11642.268280] [<ffffffff81192a2b>] __writeback_single_inode+0xda/0x5ba
[11642.269407] [<ffffffff811939f0>] writeback_sb_inodes+0x27b/0x43d
[11642.270476] [<ffffffff81193c28>] __writeback_inodes_wb+0x76/0xae
[11642.271547] [<ffffffff81193ea6>] wb_writeback+0x19e/0x41c
[11642.272588] [<ffffffff81194821>] wb_workfn+0x201/0x341
[11642.273523] [<ffffffff81194821>] ? wb_workfn+0x201/0x341
[11642.274479] [<ffffffff8106483e>] process_one_work+0x256/0x48b
[11642.275497] [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7
[11642.276518] [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
[11642.277520] [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
[11642.278517] [<ffffffff8106a500>] kthread+0xd4/0xdc
[11642.279371] [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
[11642.280468] [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70
[11642.281607] [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
[11642.282604] 3 locks held by kworker/u32:8/15289:
[11642.283423] #0: ("writeback"){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
[11642.285629] #1: ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
[11642.287538] #2: (&type->s_umount_key#37){+++++.}, at: [<ffffffff81171217>] trylock_super+0x1b/0x4b
[11642.289423] INFO: task fdm-stress:26848 blocked for more than 120 seconds.
[11642.290547] Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.291453] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.292864] fdm-stress D ffff88022c107c20 0 26848 26591 0x00000000
[11642.294118] ffff88022c107c20 000000038108affa 0000000000014ec0 ffff88023ed54ec0
[11642.295602] ffff88013ab1ca40 ffff88022c108000 ffff8800b2fc19d0 00000000000e0fff
[11642.297098] ffff8800b2fc19b0 ffff88022c107c88 ffff88022c107c38 ffffffff8147b541
[11642.298433] Call Trace:
[11642.298896] [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.299738] [<ffffffffa045225d>] lock_extent_bits+0xfe/0x1a3 [btrfs]
[11642.300833] [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44
[11642.301943] [<ffffffffa0447516>] lock_and_cleanup_extent_if_need+0x68/0x18e [btrfs]
[11642.303270] [<ffffffffa04485ba>] __btrfs_buffered_write+0x238/0x4c1 [btrfs]
[11642.304552] [<ffffffffa044b50a>] ? btrfs_file_write_iter+0x17c/0x408 [btrfs]
[11642.305782] [<ffffffffa044b682>] btrfs_file_write_iter+0x2f4/0x408 [btrfs]
[11642.306878] [<ffffffff8116e298>] __vfs_write+0x7c/0xa5
[11642.307729] [<ffffffff8116e7d1>] vfs_write+0x9d/0xe8
[11642.308602] [<ffffffff8116efbb>] SyS_write+0x50/0x7e
[11642.309410] [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
[11642.310403] 3 locks held by fdm-stress/26848:
[11642.311108] #0: (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40
[11642.312578] #1: (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0
[11642.314170] #2: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa044b401>] btrfs_file_write_iter+0x73/0x408 [btrfs]
[11642.316796] INFO: task fdm-stress:26849 blocked for more than 120 seconds.
[11642.317842] Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.318691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.319959] fdm-stress D ffff8801964ffa68 0 26849 26591 0x00000000
[11642.321312] ffff8801964ffa68 00ff8801e9975f80 0000000000014ec0 ffff88023ed94ec0
[11642.322555] ffff8800b00b4840 ffff880196500000 ffff8801e9975f20 0000000000000002
[11642.323715] ffff8801e9975f18 ffff8800b00b4840 ffff8801964ffa80 ffffffff8147b541
[11642.325096] Call Trace:
[11642.325532] [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.326303] [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
[11642.327180] [<ffffffff8108ae40>] ? mark_held_locks+0x5e/0x74
[11642.328114] [<ffffffff8147f30e>] ? _raw_spin_unlock_irq+0x2c/0x4a
[11642.329051] [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
[11642.330053] [<ffffffff8147bceb>] __wait_for_common+0x109/0x147
[11642.330952] [<ffffffff8147bceb>] ? __wait_for_common+0x109/0x147
[11642.331869] [<ffffffff8147e7bb>] ? usleep_range+0x4a/0x4a
[11642.332925] [<ffffffff81074075>] ? wake_up_q+0x47/0x47
[11642.333736] [<ffffffff8147bd4d>] wait_for_completion+0x24/0x26
[11642.334672] [<ffffffffa044f5ce>] btrfs_wait_ordered_extents+0x1c8/0x217 [btrfs]
[11642.335858] [<ffffffffa0465b5a>] btrfs_mksubvol+0x224/0x45d [btrfs]
[11642.336854] [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44
[11642.337820] [<ffffffffa0465edb>] btrfs_ioctl_snap_create_transid+0x148/0x17a [btrfs]
[11642.339026] [<ffffffffa046603b>] btrfs_ioctl_snap_create_v2+0xc7/0x110 [btrfs]
[11642.340214] [<ffffffffa0468582>] btrfs_ioctl+0x590/0x27bd [btrfs]
[11642.341123] [<ffffffff8147dc00>] ? mutex_unlock+0xe/0x10
[11642.341934] [<ffffffffa00fa6e9>] ? ext4_file_write_iter+0x2a3/0x36f [ext4]
[11642.342936] [<ffffffff8108895d>] ? __lock_is_held+0x3c/0x57
[11642.343772] [<ffffffff81186a1d>] ? rcu_read_unlock+0x3e/0x5d
[11642.344673] [<ffffffff8117dc95>] do_vfs_ioctl+0x458/0x4dc
[11642.346024] [<ffffffff81186bbe>] ? __fget_light+0x62/0x71
[11642.346873] [<ffffffff8117dd70>] SyS_ioctl+0x57/0x79
[11642.347720] [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
[11642.350222] 4 locks held by fdm-stress/26849:
[11642.350898] #0: (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0
[11642.352375] #1: (&type->i_mutex_dir_key#4/1){+.+.+.}, at: [<ffffffffa0465981>] btrfs_mksubvol+0x4b/0x45d [btrfs]
[11642.354072] #2: (&fs_info->subvol_sem){++++..}, at: [<ffffffffa0465a2a>] btrfs_mksubvol+0xf4/0x45d [btrfs]
[11642.355647] #3: (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
[11642.357516] INFO: task fdm-stress:26850 blocked for more than 120 seconds.
[11642.358508] Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.359376] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.368625] fdm-stress D ffff88021f167688 0 26850 26591 0x00000000
[11642.369716] ffff88021f167688 0000000000000001 0000000000014ec0 ffff88023edd4ec0
[11642.370950] ffff880128a98680 ffff88021f168000 ffff88023edd4ec0 7fffffffffffffff
[11642.372210] 0000000000000002 ffffffff8147b7f9 ffff88021f1676a0 ffffffff8147b541
[11642.373430] Call Trace:
[11642.373853] [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.374623] [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.375948] [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
[11642.376862] [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
[11642.377637] [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
[11642.378610] [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
[11642.379457] [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
[11642.380366] [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
[11642.381353] [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
[11642.382255] [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
[11642.383162] [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
[11642.383945] [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
[11642.384875] [<ffffffff8111829f>] __lock_page+0x66/0x68
[11642.385749] [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
[11642.386721] [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
[11642.387596] [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
[11642.389030] [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
[11642.389973] [<ffffffff810a25ad>] ? rcu_read_lock_sched_held+0x61/0x69
[11642.390939] [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
[11642.392271] [<ffffffffa0451c32>] ? __clear_extent_bit+0x26e/0x2c0 [btrfs]
[11642.393305] [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
[11642.394239] [<ffffffff811236bc>] do_writepages+0x23/0x2c
[11642.395045] [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61
[11642.395991] [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15
[11642.397144] [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
[11642.398392] [<ffffffffa0452094>] ? clear_extent_bit+0x17/0x19 [btrfs]
[11642.399363] [<ffffffffa0445945>] btrfs_get_blocks_direct+0x12b/0x61c [btrfs]
[11642.400445] [<ffffffff8119f7a1>] ? dio_bio_add_page+0x3d/0x54
[11642.401309] [<ffffffff8119fa93>] ? submit_page_section+0x7b/0x111
[11642.402213] [<ffffffff811a0258>] do_blockdev_direct_IO+0x685/0xc24
[11642.403139] [<ffffffffa044581a>] ? btrfs_page_exists_in_range+0x1a1/0x1a1 [btrfs]
[11642.404360] [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
[11642.406187] [<ffffffff811a0828>] __blockdev_direct_IO+0x31/0x33
[11642.407070] [<ffffffff811a0828>] ? __blockdev_direct_IO+0x31/0x33
[11642.407990] [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
[11642.409192] [<ffffffffa043b4ca>] btrfs_direct_IO+0x1c7/0x27e [btrfs]
[11642.410146] [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
[11642.411291] [<ffffffff81119a2c>] generic_file_read_iter+0x89/0x4e1
[11642.412263] [<ffffffff8108ac05>] ? mark_lock+0x24/0x201
[11642.413057] [<ffffffff8116e1f8>] __vfs_read+0x79/0x9d
[11642.413897] [<ffffffff8116e6f1>] vfs_read+0x8f/0xd2
[11642.414708] [<ffffffff8116ef3d>] SyS_read+0x50/0x7e
[11642.415573] [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
[11642.416572] 1 lock held by fdm-stress/26850:
[11642.417345] #0: (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40
[11642.418703] INFO: task fdm-stress:26851 blocked for more than 120 seconds.
[11642.419698] Not tainted 4.4.0-rc6-btrfs-next-21+ #1
[11642.420612] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11642.421807] fdm-stress D ffff880196483d28 0 26851 26591 0x00000000
[11642.422878] ffff880196483d28 00ff8801c8f60740 0000000000014ec0 ffff88023ed94ec0
[11642.424149] ffff8801c8f60740 ffff880196484000 0000000000000246 ffff8801c8f60740
[11642.425374] ffff8801bb711840 ffff8801bb711878 ffff880196483d40 ffffffff8147b541
[11642.426591] Call Trace:
[11642.427013] [<ffffffff8147b541>] schedule+0x82/0x9a
[11642.427856] [<ffffffff8147b6d5>] schedule_preempt_disabled+0x18/0x24
[11642.428852] [<ffffffff8147c23a>] mutex_lock_nested+0x1d7/0x3b4
[11642.429743] [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
[11642.430911] [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
[11642.432102] [<ffffffffa044f674>] ? btrfs_wait_ordered_roots+0x57/0x191 [btrfs]
[11642.433259] [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
[11642.434431] [<ffffffffa044f6ea>] btrfs_wait_ordered_roots+0xcd/0x191 [btrfs]
[11642.436079] [<ffffffffa0410cab>] btrfs_sync_fs+0xe0/0x1ad [btrfs]
[11642.437009] [<ffffffff81197900>] ? SyS_tee+0x23c/0x23c
[11642.437860] [<ffffffff81197920>] sync_fs_one_sb+0x20/0x22
[11642.438723] [<ffffffff81171435>] iterate_supers+0x75/0xc2
[11642.439597] [<ffffffff81197d00>] sys_sync+0x52/0x80
[11642.440454] [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
[11642.441533] 3 locks held by fdm-stress/26851:
[11642.442370] #0: (&type->s_umount_key#37){+++++.}, at: [<ffffffff8117141f>] iterate_supers+0x5f/0xc2
[11642.444043] #1: (&fs_info->ordered_operations_mutex){+.+...}, at: [<ffffffffa044f661>] btrfs_wait_ordered_roots+0x44/0x191 [btrfs]
[11642.446010] #2: (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
This happened because under specific timings the path for direct IO reads
can deadlock with concurrent buffered writes. The diagram below shows how
this happens for an example file that has the following layout:
[ extent A ] [ extent B ] [ ....
0K 4K 8K
CPU 1 CPU 2 CPU 3
DIO read against range
[0K, 8K[ starts
btrfs_direct_IO()
--> calls btrfs_get_blocks_direct()
which finds the extent map for the
extent A and leaves the range
[0K, 4K[ locked in the inode's
io tree
buffered write against
range [4K, 8K[ starts
__btrfs_buffered_write()
--> dirties page at 4K
a user space
task calls sync
for e.g or
writepages() is
invoked by mm
writepages()
run_delalloc_range()
cow_file_range()
--> ordered extent X
for the buffered
write is created
and
writeback starts
--> calls btrfs_get_blocks_direct()
again, without submitting first
a bio for reading extent A, and
finds the extent map for extent B
--> calls lock_extent_direct()
--> locks range [4K, 8K[
--> finds ordered extent X
covering range [4K, 8K[
--> unlocks range [4K, 8K[
buffered write against
range [0K, 8K[ starts
__btrfs_buffered_write()
prepare_pages()
--> locks pages with
offsets 0 and 4K
lock_and_cleanup_extent_if_need()
--> blocks attempting to
lock range [0K, 8K[ in
the inode's io tree,
because the range [0, 4K[
is already locked by the
direct IO task at CPU 1
--> calls
btrfs_start_ordered_extent(oe X)
btrfs_start_ordered_extent(oe X)
--> At this point writeback for ordered
extent X has not finished yet
filemap_fdatawrite_range()
btrfs_writepages()
extent_writepages()
extent_write_cache_pages()
--> finds page with offset 0
with the writeback tag
(and not dirty)
--> tries to lock it
--> deadlock, task at CPU 2
has the page locked and
is blocked on the io range
[0, 4K[ that was locked
earlier by this task
So fix this by falling back to a buffered read in the direct IO read path
when an ordered extent for a buffered write is found.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
When using the same file as the source and destination for a dedup
(extent_same ioctl) operation we were allowing it to dedup to a
destination offset beyond the file's size, which doesn't make sense and
it's not allowed for the case where the source and destination files are
not the same file. This made de deduplication operation successful only
when the source range corresponded to a hole, a prealloc extent or an
extent with all bytes having a value of 0x00. This was also leaving a
file hole (between i_size and destination offset) without the
corresponding file extent items, which can be reproduced with the
following steps for example:
$ mkfs.btrfs -f /dev/sdi
$ mount /dev/sdi /mnt/sdi
$ xfs_io -f -c "pwrite -S 0xab 304457 404990" /mnt/sdi/foobar
wrote 404990/404990 bytes at offset 304457
395 KiB, 99 ops; 0.0000 sec (31.150 MiB/sec and 7984.5149 ops/sec)
$ /git/hub/duperemove/btrfs-extent-same 24576 /mnt/sdi/foobar 28672 /mnt/sdi/foobar 929792
Deduping 2 total files
(28672, 24576): /mnt/sdi/foobar
(929792, 24576): /mnt/sdi/foobar
1 files asked to be deduped
i: 0, status: 0, bytes_deduped: 24576
24576 total bytes deduped in this operation
$ umount /mnt/sdi
$ btrfsck /dev/sdi
Checking filesystem on /dev/sdi
UUID: 98c528aa-0833-427d-9403-b98032ffbf9d
checking extents
checking free space cache
checking fs roots
root 5 inode 257 errors 100, file extent discount
Found file extent holes:
start: 712704, len: 217088
found 540673 bytes used err is 1
total csum bytes: 400
total tree bytes: 131072
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 123675
file data blocks allocated: 671744
referenced 671744
btrfs-progs v4.2.3
So fix this by not allowing the destination to go beyond the file's size,
just as we do for the same where the source and destination files are not
the same.
A test for xfstests follows.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|