Age | Commit message (Collapse) | Author |
|
lockdep report following warning in test:
[25176.843958] =================================
[25176.844519] [ INFO: inconsistent lock state ]
[25176.845047] 4.1.0-rc3 #22 Tainted: G W
[25176.845591] ---------------------------------
[25176.846153] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[25176.846713] fsstress/26661 [HC0[0]:SC1[1]:HE1:SE0] takes:
[25176.847246] (&wr_ctx->wr_lock){+.?...}, at: [<ffffffffa04cdc6d>] scrub_free_ctx+0x2d/0xf0 [btrfs]
[25176.847838] {SOFTIRQ-ON-W} state was registered at:
[25176.848396] [<ffffffff810bf460>] __lock_acquire+0x6a0/0xe10
[25176.848955] [<ffffffff810bfd1e>] lock_acquire+0xce/0x2c0
[25176.849491] [<ffffffff816489af>] mutex_lock_nested+0x7f/0x410
[25176.850029] [<ffffffffa04d04ff>] scrub_stripe+0x4df/0x1080 [btrfs]
[25176.850575] [<ffffffffa04d11b1>] scrub_chunk.isra.19+0x111/0x130 [btrfs]
[25176.851110] [<ffffffffa04d144c>] scrub_enumerate_chunks+0x27c/0x510 [btrfs]
[25176.851660] [<ffffffffa04d3b87>] btrfs_scrub_dev+0x1c7/0x6c0 [btrfs]
[25176.852189] [<ffffffffa04e918e>] btrfs_dev_replace_start+0x36e/0x450 [btrfs]
[25176.852771] [<ffffffffa04a98e0>] btrfs_ioctl+0x1e10/0x2d20 [btrfs]
[25176.853315] [<ffffffff8121c5b8>] do_vfs_ioctl+0x318/0x570
[25176.853868] [<ffffffff8121c851>] SyS_ioctl+0x41/0x80
[25176.854406] [<ffffffff8164da17>] system_call_fastpath+0x12/0x6f
[25176.854935] irq event stamp: 51506
[25176.855511] hardirqs last enabled at (51506): [<ffffffff810d4ce5>] vprintk_emit+0x225/0x5e0
[25176.856059] hardirqs last disabled at (51505): [<ffffffff810d4b77>] vprintk_emit+0xb7/0x5e0
[25176.856642] softirqs last enabled at (50886): [<ffffffff81067a23>] __do_softirq+0x363/0x640
[25176.857184] softirqs last disabled at (50949): [<ffffffff8106804d>] irq_exit+0x10d/0x120
[25176.857746]
other info that might help us debug this:
[25176.858845] Possible unsafe locking scenario:
[25176.859981] CPU0
[25176.860537] ----
[25176.861059] lock(&wr_ctx->wr_lock);
[25176.861705] <Interrupt>
[25176.862272] lock(&wr_ctx->wr_lock);
[25176.862881]
*** DEADLOCK ***
Reason:
Above warning is caused by:
Interrupt
-> bio_endio()
-> ...
-> scrub_put_ctx()
-> scrub_free_ctx() *1
-> ...
-> mutex_lock(&wr_ctx->wr_lock);
scrub_put_ctx() is allowed to be called in end_bio interrupt, but
in code design, it will never call scrub_free_ctx(sctx) in interrupe
context(above *1), because btrfs_scrub_dev() get one additional
reference of sctx->refs, which makes scrub_free_ctx() only called
withine btrfs_scrub_dev().
Now the code runs out of our wish, because free sequence in
scrub_pending_bio_dec() have a gap.
Current code:
-----------------------------------+-----------------------------------
scrub_pending_bio_dec() | btrfs_scrub_dev
-----------------------------------+-----------------------------------
atomic_dec(&sctx->bios_in_flight); |
wake_up(&sctx->list_wait); |
| scrub_put_ctx()
| -> atomic_dec_and_test(&sctx->refs)
scrub_put_ctx(sctx); |
-> atomic_dec_and_test(&sctx->refs)|
-> scrub_free_ctx() |
-----------------------------------+-----------------------------------
We expected:
-----------------------------------+-----------------------------------
scrub_pending_bio_dec() | btrfs_scrub_dev
-----------------------------------+-----------------------------------
atomic_dec(&sctx->bios_in_flight); |
wake_up(&sctx->list_wait); |
scrub_put_ctx(sctx); |
-> atomic_dec_and_test(&sctx->refs)|
| scrub_put_ctx()
| -> atomic_dec_and_test(&sctx->refs)
| -> scrub_free_ctx()
-----------------------------------+-----------------------------------
Fix:
Move scrub_pending_bio_dec() to a workqueue, to avoid this function run
in interrupt context.
Tested by check tracelog in debug.
Changelog v1->v2:
Use workqueue instead of adjust function call sequence in v1,
because v1 will introduce a bug pointed out by:
Filipe David Manana <fdmanana@gmail.com>
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
The extent-same code rejects requests with an unaligned length. This
poses a problem when we want to dedupe the tail extent of files as we
skip cloning the portion between i_size and the extent boundary.
If we don't clone the entire extent, it won't be deleted. So the
combination of these behaviors winds up giving us worst-case dedupe on
many files.
We can fix this by allowing a length that extents to i_size and
internally aligining those to the end of the block. This is what
btrfs_ioctl_clone() so we can just copy that check over.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
max_to_defrag represents the number of pages to defrag rather than the last
page of the file range to be defragged.
Consider a file having 10 4k blocks (i.e. blocks in the range [0 - 9]). If the
defrag ioctl was invoked for the block range [3 - 6], then max_to_defrag
should actually have the value 4. Instead in the current code we end up
setting it to 6.
Now, this does not (yet) cause an issue since the first part of the while loop
condition in btrfs_defrag_file() (i.e. "i <= last_index") causes the control
to flow out of the while loop before any buggy behavior is actually caused. So
the patch just makes sure that max_to_defrag ends up having the right value
rather than fixing a bug. I did run the xfstests suite to make sure that the
code does not regress.
Changelog: v1->v2:
Provide a much descriptive commit message.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
Read-ahead is done for the pages in the range [ra_index, ra_index + cluster -
1]. So the next read-ahead should be starting from the page at index 'ra_index
+ cluster' (unless we deemed that the extent at 'ra_index + cluster' as
non-defraggable) rather than from the page at index 'ra_index +
max_cluster'. This patch fixes this. I did run the xfstests suite to make sure
that the code does not regress.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
When allocating a new chunk or removing one we need to update num_devs
device items and insert or remove a chunk item in the chunk tree, so
in the worst case the space needed in the chunk space_info is:
btrfs_calc_trunc_metadata_size(chunk_root, num_devs) +
btrfs_calc_trans_metadata_size(chunk_root, 1)
That is, in the worst case we need to cow num_devs paths and cow 1 other
path that can result in splitting every node and leaf, and each path
consisting of BTRFS_MAX_LEVEL - 1 nodes and 1 leaf. We were requiring
some additional chunk_root->nodesize * BTRFS_MAX_LEVEL * num_devs bytes,
which were unnecessary since updating the existing device items does
not result in splitting the nodes and leaf since after updating them
they remain with the same size.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
We don't need to attach ordered extents that have completed to the current
transaction. Doing so only makes us hold memory for longer than necessary
and delaying the iput of the inode until the transaction is committed (for
each created ordered extent we do an igrab and then schedule an asynchronous
iput when the ordered extent's reference count drops to 0), preventing the
inode from being evictable until the transaction commits.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
Commit 3a8b36f37806 ("Btrfs: fix data loss in the fast fsync path") added
a performance regression for that causes an unnecessary sync of the log
trees (fs/subvol and root log trees) when 2 consecutive fsyncs are done
against a file, without no writes or any metadata updates to the inode in
between them and if a transaction is committed before the second fsync is
called.
Huang Ying reported this to lkml (https://lkml.org/lkml/2015/3/18/99)
after a test sysbench test that measured a -62% decrease of file io
requests per second for that tests' workload.
The test is:
echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
mkfs -t btrfs /dev/sda2
mount -t btrfs /dev/sda2 /fs/sda2
cd /fs/sda2
for ((i = 0; i < 1024; i++)); do fallocate -l 67108864 testfile.$i; done
sysbench --test=fileio --max-requests=0 --num-threads=4 --max-time=600 \
--file-test-mode=rndwr --file-total-size=68719476736 --file-io-mode=sync \
--file-num=1024 run
A test on kvm guest, running a debug kernel gave me the following results:
Without 3a8b36f378060d: 16.01 reqs/sec
With 3a8b36f378060d: 3.39 reqs/sec
With 3a8b36f378060d and this patch: 16.04 reqs/sec
Reported-by: Huang Ying <ying.huang@intel.com>
Tested-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.2
|
|
When aborting volatile_writes, let's drop its flag and give up any further
volatile_writes.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
We cannot provide an efficient implementation due to the headers
on the data blocks, so there doesn't seem much point in having it.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
This patch implements fallocate's FALLOC_FL_INSERT_RANGE for Ext4.
1) Make sure that both offset and len are block size aligned.
2) Update the i_size of inode by len bytes.
3) Compute the file's logical block number against offset. If the computed
block number is not the starting block of the extent, split the extent
such that the block number is the starting block of the extent.
4) Shift all the extents which are lying between [offset, last allocated extent]
towards right by len bytes. This step will make a hole of len bytes
at offset.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
|
|
|
|
In recovery procedure for superblock, we try to write data of valid
superblock into invalid one for recovery, work should be finished here,
but then still we will write the valid one with its original data.
This operation is not needed. Let's skip doing this unnecessary work.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
For the first mount of f2fs image with realtime discard option, we will
disable discard option if device is not supported, but for remount
operation, our discard option can still be set, this should be avoided.
This patch moves configuring of discard option to parse_options() to fix
this issue.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
We want the fixes in this branch as well for testing and merge
resolution.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
jbd2_journal_get_write_access() and jbd2_journal_get_create_access() are
frequently called for buffers that are already part of the running
transaction - most frequently it is the case for bitmaps, inode table
blocks, and superblock. Since in such cases we have nothing to do, it is
unfortunate we still grab reference to journal head, lock the bh, lock
bh_state only to find out there's nothing to do.
Improving this is a bit subtle though since until we find out journal
head is attached to the running transaction, it can disappear from under
us because checkpointing / commit decided it's no longer needed. We deal
with this by protecting journal_head slab with RCU. We still have to be
careful about journal head being freed & reallocated within slab and
about exposing journal head in consistent state (in particular
b_modified and b_frozen_data must be in correct state before we allow
user to touch the buffer).
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Check for the simple case of unjournaled buffer first, handle it and
bail out. This allows us to remove one if and unindent the difficult case
by one tab. The result is easier to read.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
We were acquiring bh_state_lock when allocation of buffer failed in
do_get_write_access() only to be able to jump to a label that releases
the lock and does all other checks that don't make sense for this error
path. Just jump into the right label instead.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
needs_copy is set only in one place in do_get_write_access(), just move
the frozen buffer copying into that place and factor it out to a
separate function to make do_get_write_access() slightly more readable.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
[ Added another sparse fix for EXT4_IOC_GET_ENCRYPTION_POLICY while
we're at it. --tytso ]
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
One-line fix to cast quota value to s64 before comparison.
By default the quantity is treated as u64.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
During a source code review of fs/ext4/extents.c I noted identical
consecutive lines. An assertion is repeated for inode1 and never done
for inode2. This is not in keeping with the rest of the code in the
ext4_swap_extents function and appears to be a bug.
Assert that the inode2 mutex is not locked.
Signed-off-by: David Moore <dmoorefo@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
|
|
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Currently ext4_mb_good_group() only returns 0 or 1 depending on whether
the allocation group is suitable for use or not. However we might get
various errors and fail while initializing new group including -EIO
which would never get propagated up the call chain. This might lead to
an endless loop at writeback when we're trying to find a good group to
allocate from and we fail to initialize new group (read error for
example).
Fix this by returning proper error code from ext4_mb_good_group() and
using it in ext4_mb_regular_allocator(). In ext4_mb_regular_allocator()
we will always return only the first occurred error from
ext4_mb_good_group() and we only propagate it back to the caller if we
do not get any other errors and we fail to allocate any blocks.
Note that with other modes than errors=continue, we will fail
immediately in ext4_mb_good_group() in case of error, however with
errors=continue we should try to continue using the file system, that's
why we're not going to fail immediately when we see an error from
ext4_mb_good_group(), but rather when we fail to find a suitable block
group to allocate from due to an problem in group initialization.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Currently on the machines with page size > block size when initializing
block group buddy cache we initialize it for all the block group bitmaps
in the page. However in the case of read error, checksum error, or if
a single bitmap is in any way corrupted we would fail to initialize all
of the bitmaps. This is problematic because we will not have access to
the other allocation groups even though those might be perfectly fine
and usable.
Fix this by reading all the bitmaps instead of error out on the first
problem and simply skip the bitmaps which were either not read properly,
or are not valid.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
If we want to rely on the buffer_verified() flag of the block bitmap
buffer, we have to set it consistently. However currently if we're
initializing uninitialized block bitmap in
ext4_read_block_bitmap_nowait() we're not going to set buffer verified
at all.
We can do this by simply setting the flag on the buffer, but I think
it's actually better to run ext4_validate_block_bitmap() to make sure
that what we did in the ext4_init_block_bitmap() is right.
So run ext4_validate_block_bitmap() even after the block bitmap
initialization. Also bail out early from ext4_validate_block_bitmap() if
we see corrupt bitmap, since we already know it's corrupt and we do not
need to verify that.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
On failure, v9fs_session_init() returns with the v9fs_session_info
struct partially initialized and expects the caller to invoke
v9fs_session_close() to clean it up; however, it doesn't track whether
the bdi is initialized or not and curiously invokes bdi_destroy() in
both vfs_session_init() failure path too.
A. If v9fs_session_init() fails before the bdi is initialized, the
follow-up v9fs_session_close() will invoke bdi_destroy() on an
uninitialized bdi.
B. If v9fs_session_init() fails after the bdi is initialized,
bdi_destroy() will be called twice on the same bdi - once in the
failure path of v9fs_session_init() and then by
v9fs_session_close().
A is broken no matter what. B used to be okay because bdi_destroy()
allowed being invoked multiple times on the same bdi, which BTW was
broken in its own way - if bdi_destroy() was invoked on an initialiezd
but !registered bdi, it'd fail to free percpu counters. Since
f0054bb1e1f3 ("writeback: move backing_dev_info->wb_lock and
->worklist into bdi_writeback"), this no longer work - bdi_destroy()
on an initialized but not registered bdi works correctly but multiple
invocations of bdi_destroy() is no longer allowed.
The obvious culprit here is v9fs_session_init()'s odd and broken error
behavior. It should simply clean up after itself on failures. This
patch makes the following updates to v9fs_session_init().
* @rc -> @retval error return propagation removed. It didn't serve
any purpose. Just use @rc.
* Move addition to v9fs_sessionlist to the end of the function so that
incomplete sessions are not put on the list or iterated and error
path doesn't have to worry about it.
* Update error handling so that it cleans up after itself.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
This basically reverts 47def82672b3 (jbd2: Remove __GFP_NOFAIL from jbd2
layer). The deprecation of __GFP_NOFAIL was a bad choice because it led
to open coding the endless loop around the allocator rather than
removing the dependency on the non failing allocation. So the
deprecation was a clear failure and the reality tells us that
__GFP_NOFAIL is not even close to go away.
It is still true that __GFP_NOFAIL allocations are generally discouraged
and new uses should be evaluated and an alternative (pre-allocations or
reservations) should be considered but it doesn't make any sense to lie
the allocator about the requirements. Allocator can take steps to help
making a progress if it knows the requirements.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Acked-by: David Rientjes <rientjes@google.com>
|
|
When testing pnfs layout, nfsd got error NFS4ERR_SEQ_MISORDERED.
It is caused by nfs return NFS4ERR_DELAY before validate_seqid(),
don't update the sequnce id, but nfsd updates the sequnce id !!!
According to RFC5661 20.9.3,
" If CB_SEQUENCE returns an error, then the state of the slot
(sequence ID, cached reply) MUST NOT change. "
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
nfsd enters a infinite loop and prints message every 10 seconds:
May 31 18:33:52 test-server kernel: Error sending entire callback!
May 31 18:34:01 test-server kernel: Error sending entire callback!
This is caused by a cb_layoutreturn getting error -10008
(NFS4ERR_DELAY), the client crashing, and then nfsd entering the
infinite loop:
bc_sendto --> call_timeout --> nfsd4_cb_done --> nfsd4_cb_layout_done
with error -10008 --> rpc_delay(task, HZ/100) --> bc_sendto ...
Reproduced using xfstests 074 with nfs client's kdump on,
CONFIG_DEFAULT_HUNG_TASK_TIMEOUT set, and client's blkmapd down:
1. nfs client's write operation will get the layout of file,
and then send getdeviceinfo,
2. but layout segment is not recorded by client because blkmapd is down,
3. client writes data by sending WRITE to server,
4. nfs server recalls the layout of the file before WRITE,
5. network error causes the client reset the session and return NFS4ERR_DELAY,
6. so client's WRITE operation is waiting the reply.
If the task hangs 120s, the client will crash.
7. so that, the next bc_sendto will fail with TIMEOUT,
and cb_status is NFS4ERR_DELAY.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Conflicts:
fs/xfs/xfs_attr_inactive.c
|
|
Instead of the confusing flags argument pass a boolean flag to indicate if
we want to release or regrant a log reservation.
Also ensure that xfs_log_done always drop the reference on the log ticket,
to both simplify the code and make the logic in xfs_trans_roll easier
to understand.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
The flags argument to xfs_trans_commit is not useful for most callers, as
a commit of a transaction without a permanent log reservation must pass
0 here, and all callers for a transaction with a permanent log reservation
except for xfs_trans_roll must pass XFS_TRANS_RELEASE_LOG_RES. So remove
the flags argument from the public xfs_trans_commit interfaces, and
introduce low-level __xfs_trans_commit variant just for xfs_trans_roll
that regrants a log reservation instead of releasing it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
xfs_trans_cancel takes two flags arguments: XFS_TRANS_RELEASE_LOG_RES and
XFS_TRANS_ABORT. Both of them are a direct product of the transaction
state, and can be deducted:
- any dirty transaction needs XFS_TRANS_ABORT to be properly canceled,
and XFS_TRANS_ABORT is a noop for a transaction that is not dirty.
- any transaction with a permanent log reservation needs
XFS_TRANS_RELEASE_LOG_RES to be properly canceled, and passing
XFS_TRANS_RELEASE_LOG_RES for a transaction without a permanent
log reservation is invalid.
So just remove the flags argument and do the right thing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
The flags value always was 0 or XFS_TRANS_ABORT. Switch to a bool
parameter to allow further cleanups.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
We have three remaining callers of xfs_trans_dup:
- xfs_itruncate_extents which open codes xfs_trans_roll
- xfs_bmap_finish doesn't have an xfs_inode argument and thus leaves
attaching them to it's callers, but otherwise is identical to
xfs_trans_roll
- xfs_dir_ialloc looks at the log reservations in the old xfs_trans
structure instead of the log reservation parameters, but otherwise
is identical to xfs_trans_roll.
By allowing a NULL xfs_inode argument to xfs_trans_roll we can switch
these three remaining users over to xfs_trans_roll and mark xfs_trans_dup
static.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
|
|
The inode allocator enables random sparse inode chunk allocations in
DEBUG mode to facilitate testing. Sparse inode allocations are not
always possible, however, depending on the fs geometry. For example,
there is no possibility for a sparse inode allocation on filesystems
where the block size is large enough to fit one or more inode chunks
within a single block.
Fix up the DEBUG mode sparse inode allocation logic to trigger random
sparse allocations only when the geometry of the fs allows it.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
The kbuild test robot reports the following compilation failure with a
32-bit kernel configuration:
fs/built-in.o: In function `xfs_ifree_cluster':
>> xfs_inode.c:(.text+0x17ac84): undefined reference to `__umoddi3'
This is due to the use of the modulus operator on a 64-bit variable in
the ASSERT() added as part of the following commit:
xfs: skip unallocated regions of inode chunks in xfs_ifree_cluster()
This ASSERT() simply checks that the offset of the inode in a sparse
cluster is appropriately aligned. Since the maximum inode record offset
is 63 (for a 64 inode record) and the calculated offset here should be
something less than that, just use a 32-bit variable to store the offset
and call the do_mod() helper.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
|
|
Add initial DAX support to XFS. To do this we need a new mount
option to turn DAX on filesystem, and we need to propagate this into
the inode flags whenever an inode is instantiated so that the
per-inode checks throughout the code Do The Right Thing.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
DAX does not do buffered IO (can't buffer direct access!) and hence
all read/write IO is vectored through the direct IO path. Hence we
need to add the DAX IO path callouts to the direct IO
infrastructure.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
When we truncate a DAX file, we need to call through the DAX page
truncation path rather than through block_truncate_page() so that
mappings and block zeroing are all handled correctly. Otherwise,
truncate does not need to change.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Add initial support for DAX block zeroing operations to XFS. DAX
cannot use buffered IO through the page cache for zeroing, nor do we
need to issue IO for uncached block zeroing. In both cases, we can
simply call out to the dax block zeroing function.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Add the initial support for DAX file operations to XFS. This
includes the necessary block allocation and mmap page fault hooks
for DAX to function.
Note that there are changes to the splice interfaces to ensure that
for DAX splice avoids direct page cache manipulations and instead
takes the DAX IO paths for read/write operations.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Some filesystems cannot call dax_fault() directly because they have
different locking and/or allocation constraints in the page fault IO
path. To handle this, we need to follow the same model as the
generic block_page_mkwrite code, where the internals are exposed via
__block_page_mkwrite() so that filesystems can wrap the correct
locking and operations around the outside.
This is loosely based on a patch originally from Matthew Willcox.
Unlike the original patch, it does not change ext4 code, error
returns or unwritten extent conversion handling. It also adds a
__dax_mkwrite() wrapper for .page_mkwrite implementations to do the
right thing, too.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
dax_fault() currently relies on the get_block callback to attach an
io completion callback to the mapping buffer head so that it can
run unwritten extent conversion after zeroing allocated blocks.
Instead of this hack, pass the conversion callback directly into
dax_fault() similar to the get_block callback. When the filesystem
allocates unwritten extents, it will set the buffer_unwritten()
flag, and hence the dax_fault code can call the completion function
in the contexts where it is necessary without overloading the
mapping buffer head.
Note: The changes to ext4 to use this interface are suspect at best.
In fact, the way ext4 did this end_io assignment in the first place
looks suspect because it only set a completion callback when there
wasn't already some other write() call taking place on the same
inode. The ext4 end_io code looks rather intricate and fragile with
all it's reference counting and passing to different contexts for
modification via inode private pointers that aren't protected by
locks...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Lock ordering for the new mmap lock needs to be:
mmap_sem
sb_start_pagefault
i_mmap_lock
page lock
<fault processsing>
Right now xfs_vm_page_mkwrite gets this the wrong way around,
While technically it cannot deadlock due to the current freeze
ordering, it's still a landmine that might explode if we change
anything in future. Hence we need to nest the locks correctly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Previously we allocated bounce pages using a combination of
alloc_page() and mempool_alloc() with the __GFP_WAIT bit set.
Instead, use mempool_alloc() with GFP_NOWAIT. The mempool_alloc()
function will try using alloc_pages() initially, and then only use the
mempool reserve of pages if alloc_pages() is unable to fulfill the
request.
This minimizes the the impact on the mm layer when we need to do a
large amount of writeback of encrypted files, as Jaeguk Kim had
reported that under a heavy fio workload on a system with restricted
amounts memory (which unfortunately, includes many mobile handsets),
he had observed the the OOM killer getting triggered several times.
Using GFP_NOWAIT
If the mempool_alloc() function fails, we will retry the page
writeback at a later time; the function of the mempool is to ensure
that we can writeback at least 32 pages at a time, so we can more
efficiently dispatch I/O under high memory pressure situations. In
the future we should make this be a tunable so we can determine the
best tradeoff between permanently sequestering memory and the ability
to quickly launder pages so we can free up memory quickly when
necessary.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Zygo Blaxell and other users have reported occasional hangs while an
inode is being evicted, leading to traces like the following:
[ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds.
[ 5281.973836] Not tainted 4.0.0-rc5-btrfs-next-9+ #2
[ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5281.976364] rm D ffff8800724cfc38 0 20488 7747 0x00000000
[ 5281.977506] ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 0000000000000001
[ 5281.978461] ffff8800724cffd8 ffff8801540a5f50 0000000000000008 ffff8801540a5f78
[ 5281.979541] ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e 0000000000000123
[ 5281.981396] Call Trace:
[ 5281.982066] [<ffffffff8143107e>] schedule+0x74/0x83
[ 5281.983341] [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs]
[ 5281.985127] [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
[ 5281.986715] [<ffffffffa03b4b71>] wait_extent_bit.constprop.32+0x7c/0xde [btrfs]
[ 5281.988680] [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 [btrfs]
[ 5281.990200] [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be [btrfs]
[ 5281.991781] [<ffffffff8116964d>] evict+0xa0/0x148
[ 5281.992735] [<ffffffff8116a43d>] iput+0x18f/0x1e5
[ 5281.993796] [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa
[ 5281.994806] [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
[ 5281.996120] [<ffffffff8107d314>] ? trace_hardirqs_on_caller+0x18f/0x1ab
[ 5281.997562] [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 5281.998815] [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b
[ 5281.999920] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
[ 5282.001299] 1 lock held by rm/20488:
[ 5282.002066] #0: (sb_writers#12){.+.+.+}, at: [<ffffffff8116dd81>] mnt_want_write+0x24/0x4b
This happens when we have readahead, which calls readpages(), happening
right before the inode eviction handler is invoked. So the reason is
essentially:
1) readpages() is called while a reference on the inode is held, so
eviction can not be triggered before readpages() returns. It also
locks one or more ranges in the inode's io_tree (which is done at
extent_io.c:__do_contiguous_readpages());
2) readpages() submits several read bios, all with an end io callback
that runs extent_io.c:end_bio_extent_readpage() and that is executed
by other task when a bio finishes, corresponding to a work queue
(fs_info->end_io_workers) worker kthread. This callback unlocks
the ranges in the inode's io_tree that were previously locked in
step 1;
3) readpages() returns, the reference on the inode is dropped;
4) One or more of the read bios previously submitted are still not
complete (their end io callback was not yet invoked or has not
yet finished execution);
5) Inode eviction is triggered (through an unlink call for example).
The inode reference count was not incremented before submitting
the read bios, therefore this is possible;
6) The eviction handler starts executing and enters the loop that
iterates over all extent states in the inode's io_tree;
7) The loop picks one extent state record and uses its ->start and
->end fields, after releasing the inode's io_tree spinlock, to
call lock_extent_bits() and clear_extent_bit(). The call to lock
the range [state->start, state->end] blocks because the whole
range or a part of it was locked by the previous call to
readpages() and the corresponding end io callback, which unlocks
the range was not yet executed;
8) The end io callback for the read bio is executed and unlocks the
range [state->start, state->end] (or a superset of that range).
And at clear_extent_bit() the extent_state record state is used
as a second argument to split_state(), which sets state->start to
a larger value;
9) The task executing the eviction handler is woken up by the task
executing the bio's end io callback (through clear_state_bit) and
the eviction handler locks the range
[old value for state->start, state->end]. Shortly after, when
calling clear_extent_bit(), it unlocks the range
[new value for state->start, state->end], so it ends up unlocking
only part of the range that it locked, leaving an extent state
record in the io_tree that represents the unlocked subrange;
10) The eviction handler loop, in its next iteration, gets the
extent_state record for the subrange that it did not unlock in the
previous step and then tries to lock it, resulting in an hang.
So fix this by not using the ->start and ->end fields of an existing
extent_state record. This is a simple solution, and an alternative
could be to bump the inode's reference count before submitting each
read bio and having it dropped in the bio's end io callback. But that
would be a more invasive/complex change and would not protect against
other possible places that are not holding a reference on the inode
as well. Something to consider in the future.
Many thanks to Zygo Blaxell for reporting, in the mailing list, the
issue, a set of scripts to trigger it and testing this fix.
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|