Age | Commit message (Collapse) | Author |
|
It's a buggy, unnecessary wrapper that is duplicating
truncate_pagecache_range().
When replacing the call in xfs_change_file_space(), also ensure that
the length being allocated/freed is always positive before making
any changes. These checks are done in the lower extent manipulation
functions, too, but we need to do them before any page cache
operations.
Reported-by: Andrew Dahl <adahl@sgi.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-By: Andrew Dahl <adahl@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
This pulls in the 3.7-rc5 fixes into tty-next to make it easier to test.
|
|
When unmounting, gfs2 does a full dlm_unlock operation on every
cached lock. This can create a very large amount of work and can
take a long time to complete. However, the vast majority of these
dlm unlock operations are unnecessary because after all the unlocks
are done, gfs2 leaves the dlm lockspace, which automatically clears
the locks of the leaving node, without unlocking each one individually.
So, gfs2 can skip explicit dlm unlocks, and use dlm_release_lockspace to
remove the locks implicitly. The one exception is when the lock's lvb is
being used. In this case, dlm_unlock is called because it may update the
lvb of the resource.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
For verification purposes, AGFLs need to be initialised to a known
set of values. For upcoming CRC changes, they are also headers that
need to be initialised. Currently, growfs does neither for the AGFLs
- it ignores them completely. Add initialisation of the AGFL to be
full of invalid block numbers (NULLAGBLOCK) to put the
infrastructure in place needed for CRC support.
Includes a comment clarification from Jeff Liu.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
When writing the new AG headers to disk, we can't attach write
verifiers because they have a dependency on the struct xfs-perag
being attached to the buffer to be fully initialised and growfs
can't fully initialise them until later in the process.
The simplest way to avoid this problem is to use uncached buffers
for writing the new headers. These buffers don't have the xfs-perag
attached to them, so it's simple to detect in the write verifier and
be able to skip the checks that need the xfs-perag.
This enables us to attach the appropriate buffer ops to the buffer
and hence calculate CRCs on the way to disk. IT also means that the
buffer is torn down immediately, and so the first access to the AG
headers will re-read the header from disk and perform full
verification of the buffer. This way we also can catch corruptions
due to problems that went undetected in growfs.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
Factor xfs_btree_init_block() to be independent of the btree cursor,
and use the function to initialise btree blocks in the growfs code.
This makes adding support for different format btree blocks simple.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
Added when debugging recent attribute tree problems to more finely
trace code execution through the maze of twisty passages that makes
up the attr code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
Error handling in xfs_buf_ioapply_map() does not handle IO reference
counts correctly. We increment the b_io_remaining count before
building the bio, but then fail to decrement it in the failure case.
This leads to the buffer never running IO completion and releasing
the reference that the IO holds, so at unmount we can leak the
buffer. This leak is captured by this assert failure during unmount:
XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 273
This is not a new bug - the b_io_remaining accounting has had this
problem for a long, long time - it's just very hard to get a
zero length bio being built by this code...
Further, the buffer IO error can be overwritten on a multi-segment
buffer by subsequent bio completions for partial sections of the
buffer. Hence we should only set the buffer error status if the
buffer is not already carrying an error status. This ensures that a
partial IO error on a multi-segment buffer will not be lost. This
part of the problem is a regression, however.
cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
When we shut down the filesystem, it might first be detected in
writeback when we are allocating a inode size transaction. This
happens after we have moved all the pages into the writeback state
and unlocked them. Unfortunately, if we fail to set up the
transaction we then abort writeback and try to invalidate the
current page. This then triggers are BUG() in block_invalidatepage()
because we are trying to invalidate an unlocked page.
Fixing this is a bit of a chicken and egg problem - we can't
allocate the transaction until we've clustered all the pages into
the IO and we know the size of it (i.e. whether the last block of
the IO is beyond the current EOF or not). However, we don't want to
hold pages locked for long periods of time, especially while we lock
other pages to cluster them into the write.
To fix this, we need to make a clear delineation in writeback where
errors can only be handled by IO completion processing. That is,
once we have marked a page for writeback and unlocked it, we have to
report errors via IO completion because we've already started the
IO. We may not have submitted any IO, but we've changed the page
state to indicate that it is under IO so we must now use the IO
completion path to report errors.
To do this, add an error field to xfs_submit_ioend() to pass it the
error that occurred during the building on the ioend chain. When
this is non-zero, mark each ioend with the error and call
xfs_finish_ioend() directly rather than building bios. This will
immediately push the ioends through completion processing with the
error that has occurred.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
In certain circumstances, a double split of an attribute tree is
needed to insert or replace an attribute. In rare situations, this
can go wrong, leaving the attribute tree corrupted. In this case,
the attr being replaced is the last attr in a leaf node, and the
replacement is larger so doesn't fit in the same leaf node.
When we have the initial condition of a node format attribute
btree with two leaves at index 1 and 2. Call them L1 and L2. The
leaf L1 is completely full, there is not a single byte of free space
in it. L2 is mostly empty. The attribute being replaced - call it X
- is the last attribute in L1.
The way an attribute replace is executed is that the replacement
attribute - call it Y - is first inserted into the tree, but has an
INCOMPLETE flag set on it so that list traversals ignore it. Once
this transaction is committed, a second transaction it run to
atomically mark Y as COMPLETE and X as INCOMPLETE, so that a
traversal will now find Y and skip X. Once that transaction is
committed, attribute X is then removed.
So, the initial condition is:
+--------+ +--------+
| L1 | | L2 |
| fwd: 2 |---->| fwd: 0 |
| bwd: 0 |<----| bwd: 1 |
| fsp: 0 | | fsp: N |
|--------| |--------|
| attr A | | attr 1 |
|--------| |--------|
| attr B | | attr 2 |
|--------| |--------|
.......... ..........
|--------| |--------|
| attr X | | attr n |
+--------+ +--------+
So now we go to replace X, and see that L1:fsp = 0 - it is full so
we can't insert Y in the same leaf. So we record the the location of
attribute X so we can track it for later use, then we split L1 into
L1 and L3 and reblance across the two leafs. We end with:
+--------+ +--------+ +--------+
| L1 | | L3 | | L2 |
| fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
| bwd: 0 |<----| bwd: 1 |<----| bwd: 3 |
| fsp: M | | fsp: J | | fsp: N |
|--------| |--------| |--------|
| attr A | | attr X | | attr 1 |
|--------| +--------+ |--------|
| attr B | | attr 2 |
|--------| |--------|
.......... ..........
|--------| |--------|
| attr W | | attr n |
+--------+ +--------+
And we track that the original attribute is now at L3:0.
We then try to insert Y into L1 again, and find that there isn't
enough room because the new attribute is larger than the old one.
Hence we have to split again to make room for Y. We end up with
this:
+--------+ +--------+ +--------+ +--------+
| L1 | | L4 | | L3 | | L2 |
| fwd: 4 |---->| fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
| bwd: 0 |<----| bwd: 1 |<----| bwd: 4 |<----| bwd: 3 |
| fsp: M | | fsp: J | | fsp: J | | fsp: N |
|--------| |--------| |--------| |--------|
| attr A | | attr Y | | attr X | | attr 1 |
|--------| + INCOMP + +--------+ |--------|
| attr B | +--------+ | attr 2 |
|--------| |--------|
.......... ..........
|--------| |--------|
| attr W | | attr n |
+--------+ +--------+
And now we have the new (incomplete) attribute @ L4:0, and the
original attribute at L3:0. At this point, the first transaction is
committed, and we move to the flipping of the flags.
This is where we are supposed to end up with this:
+--------+ +--------+ +--------+ +--------+
| L1 | | L4 | | L3 | | L2 |
| fwd: 4 |---->| fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
| bwd: 0 |<----| bwd: 1 |<----| bwd: 4 |<----| bwd: 3 |
| fsp: M | | fsp: J | | fsp: J | | fsp: N |
|--------| |--------| |--------| |--------|
| attr A | | attr Y | | attr X | | attr 1 |
|--------| +--------+ + INCOMP + |--------|
| attr B | +--------+ | attr 2 |
|--------| |--------|
.......... ..........
|--------| |--------|
| attr W | | attr n |
+--------+ +--------+
But that doesn't happen properly - the attribute tracking indexes
are not pointing to the right locations. What we end up with is both
the old attribute to be removed pointing at L4:0 and the new
attribute at L4:1. On a debug kernel, this assert fails like so:
XFS: Assertion failed: args->index2 < be16_to_cpu(leaf2->hdr.count), file: fs/xfs/xfs_attr_leaf.c, line: 2725
because the new attribute location does not exist. On a production
kernel, this goes unnoticed and the code proceeds ahead merrily and
removes L4 because it thinks that is the block that is no longer
needed. This leaves the hash index node pointing to entries
L1, L4 and L2, but only blocks L1, L3 and L2 to exist. Further, the
leaf level sibling list is L1 <-> L4 <-> L2, but L4 is now free
space, and so everything is busted. This corruption is caused by the
removal of the old attribute triggering a join - it joins everything
correctly but then frees the wrong block.
xfs_repair will report something like:
bad sibling back pointer for block 4 in attribute fork for inode 131
problem with attribute contents in inode 131
would clear attr fork
bad nblocks 8 for inode 131, would reset to 3
bad anextents 4 for inode 131, would reset to 0
The problem lies in the assignment of the old/new blocks for
tracking purposes when the double leaf split occurs. The first split
tries to place the new attribute inside the current leaf (i.e.
"inleaf == true") and moves the old attribute (X) to the new block.
This sets up the old block/index to L1:X, and newly allocated
block to L3:0. It then moves attr X to the new block and tries to
insert attr Y at the old index. That fails, so it splits again.
With the second split, the rebalance ends up placing the new attr in
the second new block - L4:0 - and this is where the code goes wrong.
What is does is it sets both the new and old block index to the
second new block. Hence it inserts attr Y at the right place (L4:0)
but overwrites the current location of the attr to replace that is
held in the new block index (currently L3:0). It over writes it with
L4:1 - the index we later assert fail on.
Hopefully this table will show this in a foramt that is a bit easier
to understand:
Split old attr index new attr index
vanilla patched vanilla patched
before 1st L1:26 L1:26 N/A N/A
after 1st L3:0 L3:0 L1:26 L1:26
after 2nd L4:0 L3:0 L4:1 L4:0
^^^^ ^^^^
wrong wrong
The fix is surprisingly simple, for all this analysis - just stop
the rebalance on the out-of leaf case from overwriting the new attr
index - it's already correct for the double split case.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
For filesystems with only a single resource group, we need to be careful
that the allocation loop will not land up with a NULL resource group. This
fixes a bug in a previous patch where the gfs2_rgrpd_get_next() function
was being used instead of gfs2_rgrpd_get_first()
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
Since we now have a dirty_inode that takes care of manipulating the
inode buffer and writing from the inode to the buffer, we can
eliminate some unnecessary buffer manipulations in gfs2_unlink_inode
that are now redundant.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
This patch changes the gfs2_dir_add function so that it uses
the dirty_inode function (via mark_inode_dirty) rather than manually
updating the dinode.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
This patch fixes an issue relating to not having enough revokes
available when truncating journaled data files. In order to ensure
that we do no run out, the truncation is broken into separate pieces
if it is large enough.
Tested using fsx on a journaled data file.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
During a directory entry lookup of a hashed directory, if the
hash-based lookup functions fail and we fall back to a linear scan,
don't try to verify the dirent checksum on the internal nodes of the
hash tree because they don't store a checksum in a hidden dirent like
the leaf nodes do.
Reported-by: George Spelvin <linux@horizon.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
It can be legitimately triggered via procfs access. Now, at least
2 of 3 of get_files_struct() callers in procfs are useless, but
when and if we get rid of those we can always add WARN_ON() here.
BUG_ON() at that spot is simply wrong.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
If there is no space for a checksum in a directory leaf node,
previously we would use EXT4_ERROR_INODE() which would mark the file
system as inconsistent. While it would be nice to use e2fsck -D, it
certainly isn't required, so just print a warning using
ext4_warning().
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
|
|
Pull cifs fixes from Jeff Layton.
* 'for-linus' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Do not lookup hashed negative dentry in cifs_atomic_open
cifs: fix potential buffer overrun in cifs.idmap handling code
|
|
jffs2_write_begin() first acquires the page lock, then f->sem. This
causes an AB-BA deadlock with jffs2_garbage_collect_live(), which first
acquires f->sem, then the page lock:
jffs2_garbage_collect_live
mutex_lock(&f->sem) (A)
jffs2_garbage_collect_dnode
jffs2_gc_fetch_page
read_cache_page_async
do_read_cache_page
lock_page(page) (B)
jffs2_write_begin
grab_cache_page_write_begin
find_lock_page
lock_page(page) (B)
mutex_lock(&f->sem) (A)
We fix this by restructuring jffs2_write_begin() to take f->sem before
the page lock. However, we make sure that f->sem is not held when
calling jffs2_reserve_space(), as this is not permitted by the locking
rules.
The deadlock above was observed multiple times on an SoC with a dual
ARMv7 (Cortex-A9), running the long-term 3.4.11 kernel; it occurred
when using scp to copy files from a host system to the ARM target
system. The fix was heavily tested on the same target system.
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Betker <thomas.betker@rohde-schwarz.com>
Acked-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
|
|
Merge misc fixes from Andrew Morton:
"Five fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (5 patches)
h8300: add missing L1_CACHE_SHIFT
mm: bugfix: set current->reclaim_state to NULL while returning from kswapd()
fanotify: fix missing break
revert "epoll: support for disabling items, and a self-test app"
checkpatch: improve network block comment style checking
|
|
Pull xfs bugfixes from Ben Myers:
- fix for large transactions spanning multiple iclog buffers
- zero the allocation_args structure on the stack before using it to
determine whether to use a worker for allocation
- move allocation stack switch to xfs_bmapi_allocate in order to
prevent deadlock on AGF buffers
- growfs no longer reads in garbage for new secondary superblocks
- silence a build warning
- ensure that invalid buffers never get written to disk while on free
list
- don't vmap inode cluster buffers during free
- fix buffer shutdown reference count mismatch
- fix reading of wrapped log data
* tag 'for-linus-v3.7-rc5' of git://oss.sgi.com/xfs/xfs:
xfs: fix reading of wrapped log data
xfs: fix buffer shudown reference count mismatch
xfs: don't vmap inode cluster buffers during free
xfs: invalidate allocbt blocks moved to the free list
xfs: silence uninitialised f.file warning.
xfs: growfs: don't read garbage for new secondary superblocks
xfs: move allocation stack switch up to xfs_bmapi_allocate
xfs: introduce XFS_BMAPI_STACK_SWITCH
xfs: zero allocation_args on the kernel stack
xfs: only update the last_sync_lsn when a transaction completes
|
|
Anders Blomdell noted in 2010 that Fanotify lost events and provided a
test case. Eric Paris confirmed it was a bug and posted a fix to the
list
https://groups.google.com/forum/?fromgroups=#!topic/linux.kernel/RrJfTfyW2BE
but never applied it. Repeated attempts over time to actually get him
to apply it have never had a reply from anyone who has raised it
So apply it anyway
Signed-off-by: Alan Cox <alan@linux.intel.com>
Reported-by: Anders Blomdell <anders.blomdell@control.lth.se>
Cc: Eric Paris <eparis@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Revert commit 03a7beb55b9f ("epoll: support for disabling items, and a
self-test app") pending resolution of the issues identified by Michael
Kerrisk, copied below.
We'll revisit this for 3.8.
: I've taken a look at this patch as it currently stands in 3.7-rc1, and
: done a bit of testing. (By the way, the test program
: tools/testing/selftests/epoll/test_epoll.c does not compile...)
:
: There are one or two places where the behavior seems a little strange,
: so I have a question or two at the end of this mail. But other than
: that, I want to check my understanding so that the interface can be
: correctly documented.
:
: Just to go though my understanding, the problem is the following
: scenario in a multithreaded application:
:
: 1. Multiple threads are performing epoll_wait() operations,
: and maintaining a user-space cache that contains information
: corresponding to each file descriptor being monitored by
: epoll_wait().
:
: 2. At some point, a thread wants to delete (EPOLL_CTL_DEL)
: a file descriptor from the epoll interest list, and
: delete the corresponding record from the user-space cache.
:
: 3. The problem with (2) is that some other thread may have
: previously done an epoll_wait() that retrieved information
: about the fd in question, and may be in the middle of using
: information in the cache that relates to that fd. Thus,
: there is a potential race.
:
: 4. The race can't solved purely in user space, because doing
: so would require applying a mutex across the epoll_wait()
: call, which would of course blow thread concurrency.
:
: Right?
:
: Your solution is the EPOLL_CTL_DISABLE operation. I want to
: confirm my understanding about how to use this flag, since
: the description that has accompanied the patches so far
: has been a bit sparse
:
: 0. In the scenario you're concerned about, deleting a file
: descriptor means (safely) doing the following:
: (a) Deleting the file descriptor from the epoll interest list
: using EPOLL_CTL_DEL
: (b) Deleting the corresponding record in the user-space cache
:
: 1. It's only meaningful to use this EPOLL_CTL_DISABLE in
: conjunction with EPOLLONESHOT.
:
: 2. Using EPOLL_CTL_DISABLE without using EPOLLONESHOT in
: conjunction is a logical error.
:
: 3. The correct way to code multithreaded applications using
: EPOLL_CTL_DISABLE and EPOLLONESHOT is as follows:
:
: a. All EPOLL_CTL_ADD and EPOLL_CTL_MOD operations should
: should EPOLLONESHOT.
:
: b. When a thread wants to delete a file descriptor, it
: should do the following:
:
: [1] Call epoll_ctl(EPOLL_CTL_DISABLE)
: [2] If the return status from epoll_ctl(EPOLL_CTL_DISABLE)
: was zero, then the file descriptor can be safely
: deleted by the thread that made this call.
: [3] If the epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY,
: then the descriptor is in use. In this case, the calling
: thread should set a flag in the user-space cache to
: indicate that the thread that is using the descriptor
: should perform the deletion operation.
:
: Is all of the above correct?
:
: The implementation depends on checking on whether
: (events & ~EP_PRIVATE_BITS) == 0
: This replies on the fact that EPOLL_CTL_AD and EPOLL_CTL_MOD always
: set EPOLLHUP and EPOLLERR in the 'events' mask, and EPOLLONESHOT
: causes those flags (as well as all others in ~EP_PRIVATE_BITS) to be
: cleared.
:
: A corollary to the previous paragraph is that using EPOLL_CTL_DISABLE
: is only useful in conjunction with EPOLLONESHOT. However, as things
: stand, one can use EPOLL_CTL_DISABLE on a file descriptor that does
: not have EPOLLONESHOT set in 'events' This results in the following
: (slightly surprising) behavior:
:
: (a) The first call to epoll_ctl(EPOLL_CTL_DISABLE) returns 0
: (the indicator that the file descriptor can be safely deleted).
: (b) The next call to epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY.
:
: This doesn't seem particularly useful, and in fact is probably an
: indication that the user made a logic error: they should only be using
: epoll_ctl(EPOLL_CTL_DISABLE) on a file descriptor for which
: EPOLLONESHOT was set in 'events'. If that is correct, then would it
: not make sense to return an error to user space for this case?
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paton J. Lewis" <palewis@adobe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patch makes ext4 really support SEEK_DATA/SEEK_HOLE flags. Block-mapped
and extent-mapped files are fully implemented together because ext4_map_blocks
hides this differences.
After applying this patch, it will cause a failure in xfstest #285 when the file
is block-mapped due to block-mapped file isn't support fallocate(2).
I had tried to use ext4_ext_walk_space() to retrieve the offset for a
extent-mapped file. But finally I decide to keep using ext4_map_blocks() to
support SEEK_DATA/SEEK_HOLE because ext4_map_blocks() can hide the difference
between block-mapped file and extent-mapped file. Moreover, in next step,
extent status tree will track all extent status, and we can get all mappings
from this tree. So I think that using ext4_map_blocks() is a better choice.
CC: Hugh Dickins <hughd@google.com>
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
This patch adds some tracepoints in extent status tree.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
This patch lets ext4 maintain extent status tree.
Currently it only tracks delay extent status in extent status tree. When a
delay allocation is issued, the related delay extent will be inserted into
extent status tree. When a delay extent is written out or invalidated, it will
be removed from this tree.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Let ext4 initialize extent status tree of an inode.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
This patch adds operations on a extent status tree.
CC: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Create a new mount workqueue and delayed_work to enable background
scanning and freeing of eofblocks inodes. The scanner kicks in once
speculative preallocation occurs and stops requeueing itself when
no eofblocks inodes exist.
The scan interval is based on the new
'speculative_prealloc_lifetime' tunable (default to 5m). The
background scanner performs unfiltered, best effort scans (which
skips inodes under lock contention or with a dirty cache mapping).
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
Support minimum file size filtering in the eofblocks scan. The
caller must set the XFS_EOF_FLAGS_MINFILESIZE flags bit and minimum
file size value in bytes.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
Enhance the eofblocks scan code to filter based on multiply specified
inode id values. When multiple inode id values are specified, only
inodes that match all id values are selected.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
Support inode ID filtering in the eofblocks scan. The caller must
set the associated XFS_EOF_FLAGS_*ID bit and ID field.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
The XFS_IOC_FREE_EOFBLOCKS ioctl allows users to invoke an EOFBLOCKS
scan. The xfs_eofblocks structure is defined to support the command
parameters (scan mode).
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
xfs_inodes_free_eofblocks() implements scanning functionality for
EOFBLOCKS inodes. It uses the AG iterator to walk the tagged inodes
and free post-EOF blocks via the xfs_inode_free_eofblocks() execute
function. The scan can be invoked in best-effort mode or wait
(force) mode.
A best-effort scan (default) handles all inodes that do not have a
dirty cache and we successfully acquire the io lock via trylock. In
wait mode, we continue to cycle through an AG until all inodes are
handled.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
Turn xfs_free_eofblocks() into a non-static function, return EAGAIN to
indicate trylock failure and make sure this error is not propagated in
xfs_release().
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
This check is used in multiple places to determine whether we
should check for (and potentially free) post EOF blocks on an
inode. Add a helper to consolidate the check.
Note that when we remove an inode from the cache (xfs_inactive()),
we are required to trim post-EOF blocks even if the inode is marked
preallocated or append-only to maintain correct space accounting.
The 'force' parameter to xfs_can_free_eofblocks() specifies whether
we should ignore the prealloc/append-only status of the inode.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
Genericize xfs_inode_ag_walk() to support an optional radix tree tag
and args argument for the execute function. Create a new wrapper
called xfs_inode_ag_iterator_tag() that performs a tag based walk
of perag's and inodes.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
Add the XFS_ICI_EOFBLOCKS_TAG inode tag to identify inodes with
speculatively preallocated blocks beyond EOF. An inode is tagged
when speculative preallocation occurs and untagged either via
truncate down or when post-EOF blocks are freed via release or
reclaim.
The tag management is intentionally not aggressive to prefer
simplicity over the complexity of handling all the corner cases
under which post-EOF blocks could be freed (i.e., forward
truncation, fallocate, write error conditions, etc.). This means
that a tagged inode may or may not have post-EOF blocks after a
period of time. The tag is eventually cleared when the inode is
released or reclaimed.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
This patch adds two structures that supports extent status tree, extent_status
and ext4_es_tree. Currently extent_status is used to track a delay extent for
an inode, which record the start block and the length of the delay extent.
ext4_es_tree is used to store all extent_status for an inode in memory.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
There are some places in ext4_fill_super() where we would not return
proper error code if something fails. The confusion is caused probably
due to the fact that we have two "kind-of" return variables 'ret'and
'err'.
'ret' is used to return error code from ext4_fill_super() where err is
used to store return values from other functions within ext4_fill_super().
However some places were missing the obligatory 'ret = err'. We could
put the assignment where it is missing, but we can have better "future
proof" solution. Or we could convert the code to use just one, but it
would require more rewrites.
This commit fixes the problem by returning value from 'err' variable if
it is set and 'ret' otherwise in error handling branch of the
ext4_fill_super(). The reasoning is that 'ret' value is often set to
default "-EINVAL" or explicit value, where 'err' is used to store
return value from other functions and should be otherwise zero.
https://bugzilla.kernel.org/show_bug.cgi?id=48431
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
In ext4_xattr_set_acl(), if ext4_journal_start() returns an error,
posix_acl_release() will not be called for 'acl' which may result in a
memory leak.
This patch fixes that.
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Eugene Shatokhin <eugene.shatokhin@rosalab.ru>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
|
|
729f52c6be51013 introduced function ext4_get_block_write_nolock() that
is very similar to _ext4_get_block(). Eliminate code duplication
by passing different flags to _ext4_get_block()
Tested: xfs tests
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Tested: xfs tests
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
When ext4_ext_handle_uninitialized_extents(), we will directly return
from ext4_ext_map_blocks(). The trace point of
trace_ext4_ext_map_blocks_exit isn't called, and the user doesn't see
any result. This patch tries to fix this problem.
Meanwhile in ext4_ext_handle_uninitialized_extents it returns errors
or the number of allocated blocks. So 'ret' variable can be removed
due to previously modifications.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
|
|
When we use trace_ext4_ext/ind_map_blocks_exit, print the value of
map->m_flags in order that we can understand the extent's current
status.
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
In trace_ext4_ext_handle_uninitialized_extents we don't care about the
value of map->m_flags because this value is probably 0, and we prefer
to get the value of flags because we can know how to handle this
extent in this function.
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
We should warn user then the discard request fails. However we need to
exclude -EOPNOTSUPP case since parts of the device might not support it
while other parts can. So print the kernel warning when the error !=
-EOPNOTSUPP is returned from ext4_issue_discard().
We should also handle error cases in batched discard, again excluding
EOPNOTSUPP.
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Notify user when mounting the file system with -o discard option, but
the device does not support discard. Obviously we do not want to fail
the mount or disable the options, because the underlying device might
change in future even without file system remount.
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|