Age | Commit message (Collapse) | Author |
|
The __jbd2_journal_remove_checkpoint() doesn't require an elevated
b_count; indeed, until the jh structure gets released by the call to
jbd2_journal_put_journal_head(), the bh's b_count is elevated by
virtue of the existence of the jh structure.
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Having done a full regression test, we can now drop the
DELALLOC_RESERVED state flag.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
|
|
The EXT4_STATE_DELALLOC_RESERVED flag was originally implemented
because it was too hard to make sure the mballoc and get_block flags
could be reliably passed down through all of the codepaths that end up
calling ext4_mb_new_blocks().
Since then, we have mb_flags passed down through most of the code
paths, so getting rid of EXT4_STATE_DELALLOC_RESERVED isn't as tricky
as it used to.
This commit plumbs in the last of what is required, and then adds a
WARN_ON check to make sure we haven't missed anything. If this passes
a full regression test run, we can then drop
EXT4_STATE_DELALLOC_RESERVED.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
|
|
Instead of initializing the allocation_request structure in
ext4_alloc_branch(), set it up in ext4_ind_map_blocks(), and then pass
it to ext4_alloc_branch() and ext4_splice_branch().
This allows ext4_ind_map_blocks to pass flags in the allocation
request structure without having to add Yet Another argument to
ext4_alloc_branch().
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
|
|
It seems that exit_aio() also needs to wait for all iocbs to complete (like
io_destroy), but we missed the wait step in current implemention, so fix
it in the same way as we did in io_destroy.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Currently udf_iget() (triggered by NFS) can race with udf_new_inode()
leading to two inode structures with the same inode number:
nfsd: iget_locked() creates inode
nfsd: try to read from disk, block on that.
udf_new_inode(): allocate inode with that inumber
udf_new_inode(): insert it into icache, set it up and dirty
udf_write_inode(): write inode into buffer cache
nfsd: get CPU again, look into buffer cache, see nice and sane on-disk
inode, set the in-core inode from it
Fix the problem by putting inode into icache in locked state (I_NEW set)
and unlocking it only after it's fully set up.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
boilerplate code in udf_{create,mknod,symlink} taken to new helper
symlink case converted to unique id calculated by udf_new_inode() - no
point finding a new one.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Currently UDF doesn't initialize i_generation in any way and thus NFS
can easily get reallocated inodes from stale file handles. Luckily UDF
already has a unique object identifier associated with each inode -
i_unique. Use that for initialization of i_generation.
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
NFS can easily ask for inodes that are already deleted. Currently UDF
happily returns such inodes which is a bug. Return -ESTALE if
udf_read_inode() is asked to read deleted inode.
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Currently __udf_read_inode() wasn't returning anything and we found out
whether we succeeded reading inode by checking whether inode is bad or
not. udf_iget() returned NULL on failure and inode pointer otherwise.
Make these two functions properly propagate errors up the call stack and
use the return value in callers.
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
We did not implement any bound on number of indirect ICBs we follow when
loading inode. Thus corrupted medium could cause kernel to go into an
infinite loop, possibly causing a stack overflow.
Fix the possible stack overflow by removing recursion from
__udf_read_inode() and limit number of indirect ICBs we follow to avoid
infinite loops.
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
There's no good reason to separate these since udf_fill_inode() is
called only from __udf_read_inode() and both do part of the same thing.
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
If we are writing back inode of unlinked directory, its link count ends
up being (u16)-1. Although the inode is deleted, udf_iget() can load the
inode when NFS uses stale file handle and get confused.
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
This patch adds three inline functions to clean up dirty casting codes.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs bug fixes from Jaegeuk Kim:
"This series includes patches to:
- fix recovery routines
- fix bugs related to inline_data/xattr
- fix when casting the dentry names
- handle EIO or ENOMEM correctly
- fix memory leak
- fix lock coverage"
* tag 'for-f2fs-3.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (28 commits)
f2fs: reposition unlock_new_inode to prevent accessing invalid inode
f2fs: fix wrong casting for dentry name
f2fs: simplify by using a literal
f2fs: truncate stale block for inline_data
f2fs: use macro for code readability
f2fs: introduce need_do_checkpoint for readability
f2fs: fix incorrect calculation with total/free inode num
f2fs: remove rename and use rename2
f2fs: skip if inline_data was converted already
f2fs: remove rewrite_node_page
f2fs: avoid double lock in truncate_blocks
f2fs: prevent checkpoint during roll-forward
f2fs: add WARN_ON in f2fs_bug_on
f2fs: handle EIO not to break fs consistency
f2fs: check s_dirty under cp_mutex
f2fs: unlock_page when node page is redirtied out
f2fs: introduce f2fs_cp_error for readability
f2fs: give a chance to mount again when encountering errors
f2fs: trigger release_dirty_inode in f2fs_put_super
f2fs: don't skip checkpoint if there is no dirty node pages
...
|
|
Thanks to Dan Carpenter for extending smatch to find bugs like this.
(This was found using a development version of smatch.)
Fixes: 36de928641ee48b2078d3fe9514242aaa2f92013
Reported-by: Dan Carpenter <dan.carpenter@oracle.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
|
|
While doing a ranged fsync, that is, one whose range doesn't cover the
whole possible file range (0 to LLONG_MAX), we can crash under certain
circumstances with a trace like the following:
[41074.641913] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
(...)
[41074.642692] CPU: 0 PID: 24580 Comm: fsx Not tainted 3.16.0-fdm-btrfs-next-45+ #1
(...)
[41074.643886] RIP: 0010:[<ffffffffa01ecc99>] [<ffffffffa01ecc99>] btrfs_ordered_update_i_size+0x279/0x2b0 [btrfs]
(...)
[41074.644919] Stack:
(...)
[41074.644919] Call Trace:
[41074.644919] [<ffffffffa01db531>] btrfs_truncate_inode_items+0x3f1/0xa10 [btrfs]
[41074.644919] [<ffffffffa01eb54f>] ? btrfs_get_logged_extents+0x4f/0x80 [btrfs]
[41074.644919] [<ffffffffa02137a9>] btrfs_log_inode+0x2f9/0x970 [btrfs]
[41074.644919] [<ffffffff81090875>] ? sched_clock_local+0x25/0xa0
[41074.644919] [<ffffffff8164a55e>] ? mutex_unlock+0xe/0x10
[41074.644919] [<ffffffff810af51d>] ? trace_hardirqs_on+0xd/0x10
[41074.644919] [<ffffffffa0214b4f>] btrfs_log_inode_parent+0x1ef/0x560 [btrfs]
[41074.644919] [<ffffffff811d0c55>] ? dget_parent+0x5/0x180
[41074.644919] [<ffffffffa0215d11>] btrfs_log_dentry_safe+0x51/0x80 [btrfs]
[41074.644919] [<ffffffffa01e2d1a>] btrfs_sync_file+0x1ba/0x3e0 [btrfs]
[41074.644919] [<ffffffff811eda6b>] vfs_fsync_range+0x1b/0x30
(...)
The necessary conditions that lead to such crash are:
* an incremental fsync (when the inode doesn't have the
BTRFS_INODE_NEEDS_FULL_SYNC flag set) happened for our file and it logged
a file extent item ending at offset X;
* the file got the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode, due
to a file truncate operation that reduces the file to a size smaller
than X;
* a ranged fsync call happens (via an msync for example), with a range that
doesn't cover the whole file and the end of this range, lets call it Y, is
smaller than X;
* btrfs_log_inode, sees the flag BTRFS_INODE_NEEDS_FULL_SYNC set and
calls btrfs_truncate_inode_items() to remove all items from the log
tree that are associated with our file;
* btrfs_truncate_inode_items() removes all of the inode's items, and the lowest
file extent item it removed is the one ending at offset X, where X > 0 and
X > Y - before returning, it calls btrfs_ordered_update_i_size() with an offset
parameter set to X;
* btrfs_ordered_update_i_size() sees that X is greater then the current ordered
size (btrfs_inode's disk_i_size) and then it assumes there can't be any ongoing
ordered operation with a range covering the offset X, calling a BUG_ON() if
such ordered operation exists. This assumption is made because the disk_i_size
is only increased after the corresponding file extent item is added to the
btree (btrfs_finish_ordered_io);
* But because our fsync covers only a limited range, such an ordered extent might
exist, and our fsync callback (btrfs_sync_file) doesn't wait for such ordered
extent to finish when calling btrfs_wait_ordered_range();
And then by the time btrfs_ordered_update_i_size() is called, via:
btrfs_sync_file() ->
btrfs_log_dentry_safe() ->
btrfs_log_inode_parent() ->
btrfs_log_inode() ->
btrfs_truncate_inode_items() ->
btrfs_ordered_update_i_size()
We hit the BUG_ON(), which could never happen if the fsync range covered the whole
possible file range (0 to LLONG_MAX), as we would wait for all ordered extents to
finish before calling btrfs_truncate_inode_items().
So just don't call btrfs_ordered_update_i_size() if we're removing the inode's items
from a log tree, which isn't supposed to change the in memory inode's disk_i_size.
Issue found while running xfstests/generic/127 (happens very rarely for me), more
specifically via the fsx calls that use memory mapped IO (and issue msync calls).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
While writing to a file, in inode.c:cow_file_range() (and same applies to
submit_compressed_extents()), after reserving an extent for the file data,
we create a new extent map for the written range and insert it into the
extent map cache. After that, we create an ordered operation, but if it
fails (due to a transient/temporary-ENOMEM), we return without dropping
that extent map, which points to a reserved extent that is freed when we
return. A subsequent incremental fsync (when the btrfs inode doesn't have
the flag BTRFS_INODE_NEEDS_FULL_SYNC) considers this extent map valid and
logs a file extent item based on that extent map, which points to a disk
extent that doesn't contain valid data - it was freed by us earlier, at this
point it might contain any random/garbage data.
Therefore, if we reach an error condition when cowing a file range after
we added the new extent map to the cache, drop it from the cache before
returning.
Some sequence of steps that lead to this:
$ mkfs.btrfs -f /dev/sdd
$ mount -o commit=9999 /dev/sdd /mnt
$ cd /mnt
$ xfs_io -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" foo
$ xfs_io -c "pwrite -S 0x02 -b 4096 4096 4096"
$ sync
$ od -t x1 foo
0000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
*
0010000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
*
0020000
$ xfs_io -c "pwrite -S 0xa1 -b 4096 0 4096" foo
# Now this write + fsync fail with -ENOMEM, which was returned by
# btrfs_add_ordered_extent() in inode.c:cow_file_range().
$ xfs_io -c "pwrite -S 0xff -b 4096 4096 4096" foo
$ xfs_io -c "fsync" foo
fsync: Cannot allocate memory
# Now do a new write + fsync, which will succeed. Our previous
# -ENOMEM was a transient/temporary error.
$ xfs_io -c "pwrite -S 0xee -b 4096 16384 4096" foo
$ xfs_io -c "fsync" foo
# Our file content (in page cache) is now:
$ od -t x1 foo
0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1
*
0010000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
*
0050000
# Now reboot the machine, and mount the fs, so that fsync log replay
# takes place.
# The file content is now weird, in particular the first 8Kb, which
# do not match our data before nor after the sync command above.
$ od -t x1 foo
0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
*
0010000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
*
0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
*
0050000
# In fact these first 4Kb are a duplicate of the last 4kb block.
# The last write got an extent map/file extent item that points to
# the same disk extent that we got in the write+fsync that failed
# with the -ENOMEM error. btrfs-debug-tree and btrfsck allow us to
# verify that:
$ btrfs-debug-tree /dev/sdd
(...)
item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
extent data disk byte 12582912 nr 8192
extent data offset 0 nr 8192 ram 8192
item 7 key (257 EXTENT_DATA 8192) itemoff 15766 itemsize 53
extent data disk byte 0 nr 0
extent data offset 0 nr 8192 ram 8192
item 8 key (257 EXTENT_DATA 16384) itemoff 15713 itemsize 53
extent data disk byte 12582912 nr 4096
extent data offset 0 nr 4096 ram 4096
$ umount /dev/sdd
$ btrfsck /dev/sdd
Checking filesystem on /dev/sdd
UUID: db5e60e1-050d-41e6-8c7f-3d742dea5d8f
checking extents
extent item 12582912 has multiple extent items
ref mismatch on [12582912 4096] extent item 1, found 2
Backref bytes do not match extent backref, bytenr=12582912, ref bytes=4096, backref bytes=8192
backpointer mismatch on [12582912 4096]
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
root 5 inode 257 errors 1000, some csum missing
found 131074 bytes used err is 1
total csum bytes: 4
total tree bytes: 131072
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 123404
file data blocks allocated: 274432
referenced 274432
Btrfs v3.14.1-96-gcc7fd5a-dirty
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
This fixes an Oopsable race when starting up the callback server.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
This fixes an Oopsable race when starting lockd.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
We ran into a case on ppc64 running mariadb where io_getevents would
return zeroed out I/O events. After adding instrumentation, it became
clear that there was some missing synchronization between reading the
tail pointer and the events themselves. This small patch fixes the
problem in testing.
Thanks to Zach for helping to look into this, and suggesting the fix.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org
|
|
As the race condition on the inode cache, following scenario can appear:
[Thread a] [Thread b]
->f2fs_mkdir
->f2fs_add_link
->__f2fs_add_link
->init_inode_metadata failed here
->gc_thread_func
->f2fs_gc
->do_garbage_collect
->gc_data_segment
->f2fs_iget
->iget_locked
->wait_on_inode
->unlock_new_inode
->move_data_page
->make_bad_inode
->iput
When we fail in create/symlink/mkdir/mknod/tmpfile, the new allocated inode
should be set as bad to avoid being accessed by other thread. But in above
scenario, it allows f2fs to access the invalid inode before this inode was set
as bad.
This patch fix the potential problem, and this issue was found by code review.
change log from v1:
o Add condition judgment in gc_data_segment() suggested by Changman Lee.
o use iget_failed to simplify code.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This commit adds some statictics in extent status tree shrinker. The
purpose to add these is that we want to collect more details when we
encounter a stall caused by extent status tree shrinker. Here we count
the following statictics:
stats:
the number of all objects on all extent status trees
the number of reclaimable objects on lru list
cache hits/misses
the last sorted interval
the number of inodes on lru list
average:
scan time for shrinking some objects
the number of shrunk objects
maximum:
the inode that has max nr. of objects on lru list
the maximum scan time for shrinking some objects
The output looks like below:
$ cat /proc/fs/ext4/sda1/es_shrinker_info
stats:
28228 objects
6341 reclaimable objects
5281/631 cache hits/misses
586 ms last sorted interval
250 inodes on lru list
average:
153 us scan time
128 shrunk objects
maximum:
255 inode (255 objects, 198 reclaimable)
125723 us max scan time
If the lru list has never been sorted, the following line will not be
printed:
586ms last sorted interval
If there is an empty lru list, the following lines also will not be
printed:
250 inodes on lru list
...
maximum:
255 inode (255 objects, 198 reclaimable)
0 us max scan time
Meanwhile in this commit a new trace point is defined to print some
details in __ext4_es_shrink().
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This commit improves the trace point of extents status tree. We rename
trace_ext4_es_shrink_enter in ext4_es_count() because it is also used
in ext4_es_scan() and we can not identify them from the result.
Further this commit fixes a variable name in trace point in order to
keep consistency with others.
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
get_blocks is renamed to get_block.
Signed-off-by: Seunghun Lee <waydi1@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
xfs_collapse_file_space() currently writes back the entire file
undergoing collapse range to settle things down for the extent shift
algorithm. While this prevents changes to the extent list during the
collapse operation, the writeback itself is not enough to prevent
unnecessary collapse failures.
The current shift algorithm uses the extent index to iterate the in-core
extent list. If a post-eof delalloc extent persists after the writeback
(e.g., a prior zero range op where the end of the range aligns with eof
can separate the post-eof blocks such that they are not written back and
converted), xfs_bmap_shift_extents() becomes confused over the encoded
br_startblock value and fails the collapse.
As with the full writeback, this is a temporary fix until the algorithm
is improved to cope with a volatile extent list and avoid attempts to
shift post-eof extents.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
If we have delalloc extents on a file before we run a collapse range
opertaion, we sync the range that we are going to collapse to
convert delalloc extents in that region to real extents to simplify
the shift operation.
However, the shift operation then assumes that the extent list is
not going to change as it iterates over the extent list moving
things about. Unfortunately, this isn't true because we can't hold
the ILOCK over all the operations. We can prevent new IO from
modifying the extent list by holding the IOLOCK, but that doesn't
prevent writeback from running....
And when writeback runs, it can convert delalloc extents is the
range of the file prior to the region being collapsed, and this
changes the indexes of all the extents in the file. That causes the
collapse range operation to Go Bad.
The right fix is to rewrite the extent shift operation not to be
dependent on the extent list not changing across the entire
operation, but this is a fairly significant piece of work to do.
Hence, as a short-term workaround for the problem, sync the entire
file before starting a collapse operation to remove all delalloc
ranges from the file and so avoid the problem of concurrent
writeback changing the extent list.
Diagnosed-and-Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
The file collapse mechanism uses xfs_bmap_shift_extents() to collapse
all subsequent extents down into the specified, previously punched out,
region. This function performs some validation, such as whether a
sufficient hole exists in the target region of the collapse, then shifts
the remaining exents downward.
The exit path of the function currently logs the inode unconditionally.
While we must log the inode (and abort) if an error occurs and the
transaction is dirty, the initial validation paths can generate errors
before the transaction has been dirtied. This creates an unnecessary
filesystem shutdown scenario, as the caller will cancel a transaction
that has been marked dirty.
Modify xfs_bmap_shift_extents() to OR the logflags bits as modifications
are made to the inode bmap. Only log the inode in the exit path if
logflags has been set. This ensures we only have to cancel a dirty
transaction if modifications have been made and prevents an unnecessary
filesystem shutdown otherwise.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Now we are not doing silly things with dirtying buffers beyond EOF
and using invalidation correctly, we can finally reduce the ranges of
writeback and invalidation used by direct IO to match that of the IO
being issued.
Bring the writeback and invalidation ranges back to match the
generic direct IO code - this will greatly reduce the perturbation
of cached data when direct IO and buffered IO are mixed, but still
provide the same buffered vs direct IO coherency behaviour we
currently have.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Similar to direct IO reads, direct IO writes are using
truncate_pagecache_range to invalidate the page cache. This is
incorrect due to the sub-block zeroing in the page cache that
truncate_pagecache_range() triggers.
This patch fixes things by using invalidate_inode_pages2_range
instead. It preserves the page cache invalidation, but won't zero
any pages.
cc: stable@vger.kernel.org
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
xfs is using truncate_pagecache_range to invalidate the page cache
during DIO reads. This is different from the other filesystems who
only invalidate pages during DIO writes.
truncate_pagecache_range is meant to be used when we are freeing the
underlying data structs from disk, so it will zero any partial
ranges in the page. This means a DIO read can zero out part of the
page cache page, and it is possible the page will stay in cache.
buffered reads will find an up to date page with zeros instead of
the data actually on disk.
This patch fixes things by using invalidate_inode_pages2_range
instead. It preserves the page cache invalidation, but won't zero
any pages.
[dchinner: catch error and warn if it fails. Comment.]
cc: stable@vger.kernel.org
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
generic/263 is failing fsx at this point with a page spanning
EOF that cannot be invalidated. The operations are:
1190 mapwrite 0x52c00 thru 0x5e569 (0xb96a bytes)
1191 mapread 0x5c000 thru 0x5d636 (0x1637 bytes)
1192 write 0x5b600 thru 0x771ff (0x1bc00 bytes)
where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
write attempts to invalidate the cached page over this range, it
fails with -EBUSY and so any attempt to do page invalidation fails.
The real question is this: Why can't that page be invalidated after
it has been written to disk and cleaned?
Well, there's data on the first two buffers in the page (1k block
size, 4k page), but the third buffer on the page (i.e. beyond EOF)
is failing drop_buffers because it's bh->b_state == 0x3, which is
BH_Uptodate | BH_Dirty. IOWs, there's dirty buffers beyond EOF. Say
what?
OK, set_buffer_dirty() is called on all buffers from
__set_page_buffers_dirty(), regardless of whether the buffer is
beyond EOF or not, which means that when we get to ->writepage,
we have buffers marked dirty beyond EOF that we need to clean.
So, we need to implement our own .set_page_dirty method that
doesn't dirty buffers beyond EOF.
This is messy because the buffer code is not meant to be shared
and it has interesting locking issues on the buffer dirty bits.
So just copy and paste it and then modify it to suit what we need.
Note: the solutions the other filesystems and generic block code use
of marking the buffers clean in ->writepage does not work for XFS.
It still leaves dirty buffers beyond EOF and invalidations still
fail. Hence rather than play whack-a-mole, this patch simply
prevents those buffers from being dirtied in the first place.
cc: <stable@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Enable by default the block_validity feature, which checks for
collisions between newly allocated blocks and critical system
metadata.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
__wait_cp_io() is only called by jbd2_log_do_checkpoint(). Fold it in
to make it a bit easier to understand.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
__process_buffer() is only called by jbd2_log_do_checkpoint(), and it
had a very complex locking protocol where it would be called with the
j_list_lock, and sometimes exit with the lock held (if the return code
was 0), or release the lock.
This was confusing both to humans and to smatch (which erronously
complained that the lock was taken twice).
Folding __process_buffer() to the caller allows us to simplify the
control flow, making the resulting function easier to read and reason
about, and dropping the compiled size of fs/jbd2/checkpoint.c by 150
bytes (over 4% of the text size).
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
|
|
Make the function name less redundant.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Reuse the path object in ext4_move_extents() so we don't unnecessarily
free and reallocate it.
Also clean up the get_ext_path() wrapper so that it has the same
semantics of freeing the path object on error as ext4_ext_find_extent().
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Now that the semantics of ext4_ext_find_extent() are much cleaner,
it's safe and more efficient to reuse the path object across the
multiple calls to ext4_ext_find_extent() in ext4_ext_shift_extents().
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This adds additional safety in case for some reason we end reusing a
path structure which isn't big enough for current depth of the inode.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Teach ext4_ext_drop_refs() to accept a NULL argument, much like
kfree(). This allows us to drop a lot of checks to make sure path is
non-NULL before calling ext4_ext_drop_refs().
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
In nearly all of the calls to ext4_ext_find_extent() where the caller
is trying to recycle the path object, ext4_ext_drop_refs() gets called
to release the buffer heads before the path object gets overwritten.
To simplify things for the callers, and to avoid the possibility of a
memory leak, make ext4_ext_find_extent() responsible for dropping the
buffers.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Drop EXT4_EX_NOFREE_ON_ERR from ext4_ext_create_new_leaf(),
ext4_split_extent(), ext4_convert_unwritten_extents_endio().
This requires fixing all of their callers to potentially
ext4_ext_find_extent() to free the struct ext4_ext_path object in case
of an error, and there are interlocking dependencies all the way up to
ext4_ext_map_blocks(), ext4_swap_extents(), and
ext4_ext_remove_space().
Once this is done, we can drop the EXT4_EX_NOFREE_ON_ERR flag since it
is no longer necessary.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Transfer responsibility of freeing struct ext4_ext_path on error to
ext4_ext_find_extent().
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|