Age | Commit message (Collapse) | Author |
|
Currently userspace has no way of determining that a filesystem is
CRC enabled. Add a flag to the XFS_IOC_FSGEOMETRY ioctl output to
indicate that the filesystem has v5 superblock support enabled.
This will allow xfs_info to correctly report the state of the
filesystem.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 74137fff067961c9aca1e14d073805c3de8549bd)
|
|
When CRCs are enabled, the number of blocks needed to hold a remote
symlink on a 1k block size filesystem may be 2 instead of 1. The
transaction reservation for the allocated blocks was not taking this
into account and only allocating one block. Hence when trying to
read or invalidate such symlinks, we are mapping a hole where there
should be a block and things go bad at that point.
Fix the reservation to use the correct block count, clean up the
block count calculation similar to the remote attribute calculation,
and add a debug guard to detect when we don't write the entire
symlink to disk.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 321a95839e65db3759a07a3655184b0283af90fe)
|
|
A long time ago in a galaxy far away....
.. the was a commit made to fix some ilinux specific "fragmented
buffer" log recovery problem:
http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=b29c0bece51da72fb3ff3b61391a391ea54e1603
That problem occurred when a contiguous dirty region of a buffer was
split across across two pages of an unmapped buffer. It's been a
long time since that has been done in XFS, and the changes to log
the entire inode buffers for CRC enabled filesystems has
re-introduced that corner case.
And, of course, it turns out that the above commit didn't actually
fix anything - it just ensured that log recovery is guaranteed to
fail when this situation occurs. And now for the gory details.
xfstest xfs/085 is failing with this assert:
XFS (vdb): bad number of regions (0) in inode log format
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 1583
Largely undocumented factoid #1: Log recovery depends on all log
buffer format items starting with this format:
struct foo_log_format {
__uint16_t type;
__uint16_t size;
....
As recoery uses the size field and assumptions about 32 bit
alignment in decoding format items. So don't pay much attention to
the fact log recovery thinks that it decoding an inode log format
item - it just uses them to determine what the size of the item is.
But why would it see a log format item with a zero size? Well,
luckily enough xfs_logprint uses the same code and gives the same
error, so with a bit of gdb magic, it turns out that it isn't a log
format that is being decoded. What logprint tells us is this:
Oper (130): tid: a0375e1a len: 28 clientid: TRANS flags: none
BUF: #regs: 2 start blkno: 144 (0x90) len: 16 bmap size: 2 flags: 0x4000
Oper (131): tid: a0375e1a len: 4096 clientid: TRANS flags: none
BUF DATA
----------------------------------------------------------------------------
Oper (132): tid: a0375e1a len: 4096 clientid: TRANS flags: none
xfs_logprint: unknown log operation type (4e49)
**********************************************************************
* ERROR: data block=2 *
**********************************************************************
That we've got a buffer format item (oper 130) that has two regions;
the format item itself and one dirty region. The subsequent region
after the buffer format item and it's data is them what we are
tripping over, and the first bytes of it at an inode magic number.
Not a log opheader like there is supposed to be.
That means there's a problem with the buffer format item. It's dirty
data region is 4096 bytes, and it contains - you guessed it -
initialised inodes. But inode buffers are 8k, not 4k, and we log
them in their entirety. So something is wrong here. The buffer
format item contains:
(gdb) p /x *(struct xfs_buf_log_format *)in_f
$22 = {blf_type = 0x123c, blf_size = 0x2, blf_flags = 0x4000,
blf_len = 0x10, blf_blkno = 0x90, blf_map_size = 0x2,
blf_data_map = {0xffffffff, 0xffffffff, .... }}
Two regions, and a signle dirty contiguous region of 64 bits. 64 *
128 = 8k, so this should be followed by a single 8k region of data.
And the blf_flags tell us that the type of buffer is a
XFS_BLFT_DINO_BUF. It contains inodes. And because it doesn't have
the XFS_BLF_INODE_BUF flag set, that means it's an inode allocation
buffer. So, it should be followed by 8k of inode data.
But we know that the next region has a header of:
(gdb) p /x *ohead
$25 = {oh_tid = 0x1a5e37a0, oh_len = 0x100000, oh_clientid = 0x69,
oh_flags = 0x0, oh_res2 = 0x0}
and so be32_to_cpu(oh_len) = 0x1000 = 4096 bytes. It's simply not
long enough to hold all the logged data. There must be another
region. There is - there's a following opheader for another 4k of
data that contains the other half of the inode cluster data - the
one we assert fail on because it's not a log format header.
So why is the second part of the data not being accounted to the
correct buffer log format structure? It took a little more work with
gdb to work out that the buffer log format structure was both
expecting it to be there but hadn't accounted for it. It was at that
point I went to the kernel code, as clearly this wasn't a bug in
xfs_logprint and the kernel was writing bad stuff to the log.
First port of call was the buffer item formatting code, and the
discontiguous memory/contiguous dirty region handling code
immediately stood out. I've wondered for a long time why the code
had this comment in it:
vecp->i_addr = xfs_buf_offset(bp, buffer_offset);
vecp->i_len = nbits * XFS_BLF_CHUNK;
vecp->i_type = XLOG_REG_TYPE_BCHUNK;
/*
* You would think we need to bump the nvecs here too, but we do not
* this number is used by recovery, and it gets confused by the boundary
* split here
* nvecs++;
*/
vecp++;
And it didn't account for the extra vector pointer. The case being
handled here is that a contiguous dirty region lies across a
boundary that cannot be memcpy()d across, and so has to be split
into two separate operations for xlog_write() to perform.
What this code assumes is that what is written to the log is two
consecutive blocks of data that are accounted in the buf log format
item as the same contiguous dirty region and so will get decoded as
such by the log recovery code.
The thing is, xlog_write() knows nothing about this, and so just
does it's normal thing of adding an opheader for each vector. That
means the 8k region gets written to the log as two separate regions
of 4k each, but because nvecs has not been incremented, the buf log
format item accounts for only one of them.
Hence when we come to log recovery, we process the first 4k region
and then expect to come across a new item that starts with a log
format structure of some kind that tells us whenteh next data is
going to be. Instead, we hit raw buffer data and things go bad real
quick.
So, the commit from 2002 that commented out nvecs++ is just plain
wrong. It breaks log recovery completely, and it would seem the only
reason this hasn't been since then is that we don't log large
contigous regions of multi-page unmapped buffers very often. Never
would be a closer estimate, at least until the CRC code came along....
So, lets fix that by restoring the nvecs accounting for the extra
region when we hit this case.....
.... and there's the problemin log recovery it is apparently working
around:
XFS: Assertion failed: i == item->ri_total, file: fs/xfs/xfs_log_recover.c, line: 2135
Yup, xlog_recover_do_reg_buffer() doesn't handle contigous dirty
regions being broken up into multiple regions by the log formatting
code. That's an easy fix, though - if the number of contiguous dirty
bits exceeds the length of the region being copied out of the log,
only account for the number of dirty bits that region covers, and
then loop again and copy more from the next region. It's a 2 line
fix.
Now xfstests xfs/085 passes, we have one less piece of mystery
code, and one more important piece of knowledge about how to
structure new log format items..
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 709da6a61aaf12181a8eea8443919ae5fc1b731d)
|
|
XFS has failed to kill suid/sgid bits correctly when truncating
files of non-zero size since commit c4ed4243 ("xfs: split
xfs_setattr") introduced in the 3.1 kernel. Fix it.
Fix it.
cc: stable kernel <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 56c19e89b38618390addfc743d822f99519055c6)
|
|
Lockdep reports:
=============================================
[ INFO: possible recursive locking detected ]
3.9.0+ #3 Not tainted
---------------------------------------------
setquota/28368 is trying to acquire lock:
(sb_internal){++++.?}, at: [<c11e8846>] xfs_trans_alloc+0x26/0x50
but task is already holding lock:
(sb_internal){++++.?}, at: [<c11e8846>] xfs_trans_alloc+0x26/0x50
from xfs_qm_scall_setqlim()->xfs_dqread() when a dquot needs to be
allocated.
xfs_qm_scall_setqlim() is starting a transaction and then not
passing it into xfs_qm_dqet() and so it starts it's own transaction
when allocating the dquot. Splat!
Fix this by not allocating the dquot in xfs_qm_scall_setqlim()
inside the setqlim transaction. This requires getting the dquot
first (and allocating it if necessary) then dropping and relocking
the dquot before joining it to the setqlim transaction.
Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit f648167f3ac79018c210112508c732ea9bf67c7b)
|
|
Darrick J. Wong <darrick.wong@oracle.com> reports:
> I have a kvm-based testing setup that netboots VMs over NFS, the
> client end of which seems to have broken somehow in 3.10-rc1. The
> server's exports file looks like this:
>
> /storage/mtr/x64 192.168.122.0/24(ro,sync,no_root_squash,no_subtree_check)
>
> On the client end (inside the VM), the initrd runs the following
> command to try to mount the rootfs over NFS:
>
> # mount -o nolock -o ro -o retrans=10 192.168.122.1:/storage/mtr/x64/ /root
>
> (Note: This is the busybox mount command.)
>
> The mount fails with -EINVAL.
Commit 4580a92d44 "NFS: Use server-recommended security flavor by
default (NFSv3)" introduced a behavior regression for NFS mounts
done via a legacy binary mount(2) call.
Ensure that a default security flavor is specified for legacy binary
mount requests, since they do not invoke nfs_select_flavor() in the
kernel.
Busybox uses klibc's nfsmount command, which performs NFS mounts
using the legacy binary mount data format. /sbin/mount.nfs is not
affected by this regression.
Reported-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
When the directory freespace index grows to a second block (2017
4k data blocks in the directory), the initialisation of the second
new block header goes wrong. The write verifier fires a corruption
error indicating that the block number in the header is zero. This
was being tripped by xfs/110.
The problem is that the initialisation of the new block is done just
fine in xfs_dir3_free_get_buf(), but the caller then users a dirv2
structure to zero on-disk header fields that xfs_dir3_free_get_buf()
has already zeroed. These lined up with the block number in the dir
v3 header format.
While looking at this, I noticed that the struct xfs_dir3_free_hdr()
had 4 bytes of padding in it that wasn't defined as padding or being
zeroed by the initialisation. Add a pad field declaration and fully
zero the on disk and in-core headers in xfs_dir3_free_get_buf() so
that this is never an issue in the future. Note that this doesn't
change the on-disk layout, just makes the 32 bits of padding in the
layout explicit.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
XFS has failed to kill suid/sgid bits correctly when truncating
files of non-zero size since commit c4ed4243 ("xfs: split
xfs_setattr") introduced in the 3.1 kernel. Fix it.
Fix it.
cc: stable kernel <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
Currently userspace has no way of determining that a filesystem is
CRC enabled. Add a flag to the XFS_IOC_FSGEOMETRY ioctl output to
indicate that the filesystem has v5 superblock support enabled.
This will allow xfs_info to correctly report the state of the
filesystem.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
Currently, swapping extents from one inode to another is a simple
act of switching data and attribute forks from one inode to another.
This, unfortunately in no longer so simple with CRC enabled
filesystems as there is owner information embedded into the BMBT
blocks that are swapped between inodes. Hence swapping the forks
between inodes results in the inodes having mapping blocks that
point to the wrong owner and hence are considered corrupt.
To fix this we need an extent tree block or record based swap
algorithm so that the BMBT block owner information can be updated
atomically in the swap transaction. This is a significant piece of
new work, so for the moment simply don't allow swap extent
operations to succeed on CRC enabled filesystems.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
A long time ago in a galaxy far away....
.. the was a commit made to fix some ilinux specific "fragmented
buffer" log recovery problem:
http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=b29c0bece51da72fb3ff3b61391a391ea54e1603
That problem occurred when a contiguous dirty region of a buffer was
split across across two pages of an unmapped buffer. It's been a
long time since that has been done in XFS, and the changes to log
the entire inode buffers for CRC enabled filesystems has
re-introduced that corner case.
And, of course, it turns out that the above commit didn't actually
fix anything - it just ensured that log recovery is guaranteed to
fail when this situation occurs. And now for the gory details.
xfstest xfs/085 is failing with this assert:
XFS (vdb): bad number of regions (0) in inode log format
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 1583
Largely undocumented factoid #1: Log recovery depends on all log
buffer format items starting with this format:
struct foo_log_format {
__uint16_t type;
__uint16_t size;
....
As recoery uses the size field and assumptions about 32 bit
alignment in decoding format items. So don't pay much attention to
the fact log recovery thinks that it decoding an inode log format
item - it just uses them to determine what the size of the item is.
But why would it see a log format item with a zero size? Well,
luckily enough xfs_logprint uses the same code and gives the same
error, so with a bit of gdb magic, it turns out that it isn't a log
format that is being decoded. What logprint tells us is this:
Oper (130): tid: a0375e1a len: 28 clientid: TRANS flags: none
BUF: #regs: 2 start blkno: 144 (0x90) len: 16 bmap size: 2 flags: 0x4000
Oper (131): tid: a0375e1a len: 4096 clientid: TRANS flags: none
BUF DATA
----------------------------------------------------------------------------
Oper (132): tid: a0375e1a len: 4096 clientid: TRANS flags: none
xfs_logprint: unknown log operation type (4e49)
**********************************************************************
* ERROR: data block=2 *
**********************************************************************
That we've got a buffer format item (oper 130) that has two regions;
the format item itself and one dirty region. The subsequent region
after the buffer format item and it's data is them what we are
tripping over, and the first bytes of it at an inode magic number.
Not a log opheader like there is supposed to be.
That means there's a problem with the buffer format item. It's dirty
data region is 4096 bytes, and it contains - you guessed it -
initialised inodes. But inode buffers are 8k, not 4k, and we log
them in their entirety. So something is wrong here. The buffer
format item contains:
(gdb) p /x *(struct xfs_buf_log_format *)in_f
$22 = {blf_type = 0x123c, blf_size = 0x2, blf_flags = 0x4000,
blf_len = 0x10, blf_blkno = 0x90, blf_map_size = 0x2,
blf_data_map = {0xffffffff, 0xffffffff, .... }}
Two regions, and a signle dirty contiguous region of 64 bits. 64 *
128 = 8k, so this should be followed by a single 8k region of data.
And the blf_flags tell us that the type of buffer is a
XFS_BLFT_DINO_BUF. It contains inodes. And because it doesn't have
the XFS_BLF_INODE_BUF flag set, that means it's an inode allocation
buffer. So, it should be followed by 8k of inode data.
But we know that the next region has a header of:
(gdb) p /x *ohead
$25 = {oh_tid = 0x1a5e37a0, oh_len = 0x100000, oh_clientid = 0x69,
oh_flags = 0x0, oh_res2 = 0x0}
and so be32_to_cpu(oh_len) = 0x1000 = 4096 bytes. It's simply not
long enough to hold all the logged data. There must be another
region. There is - there's a following opheader for another 4k of
data that contains the other half of the inode cluster data - the
one we assert fail on because it's not a log format header.
So why is the second part of the data not being accounted to the
correct buffer log format structure? It took a little more work with
gdb to work out that the buffer log format structure was both
expecting it to be there but hadn't accounted for it. It was at that
point I went to the kernel code, as clearly this wasn't a bug in
xfs_logprint and the kernel was writing bad stuff to the log.
First port of call was the buffer item formatting code, and the
discontiguous memory/contiguous dirty region handling code
immediately stood out. I've wondered for a long time why the code
had this comment in it:
vecp->i_addr = xfs_buf_offset(bp, buffer_offset);
vecp->i_len = nbits * XFS_BLF_CHUNK;
vecp->i_type = XLOG_REG_TYPE_BCHUNK;
/*
* You would think we need to bump the nvecs here too, but we do not
* this number is used by recovery, and it gets confused by the boundary
* split here
* nvecs++;
*/
vecp++;
And it didn't account for the extra vector pointer. The case being
handled here is that a contiguous dirty region lies across a
boundary that cannot be memcpy()d across, and so has to be split
into two separate operations for xlog_write() to perform.
What this code assumes is that what is written to the log is two
consecutive blocks of data that are accounted in the buf log format
item as the same contiguous dirty region and so will get decoded as
such by the log recovery code.
The thing is, xlog_write() knows nothing about this, and so just
does it's normal thing of adding an opheader for each vector. That
means the 8k region gets written to the log as two separate regions
of 4k each, but because nvecs has not been incremented, the buf log
format item accounts for only one of them.
Hence when we come to log recovery, we process the first 4k region
and then expect to come across a new item that starts with a log
format structure of some kind that tells us whenteh next data is
going to be. Instead, we hit raw buffer data and things go bad real
quick.
So, the commit from 2002 that commented out nvecs++ is just plain
wrong. It breaks log recovery completely, and it would seem the only
reason this hasn't been since then is that we don't log large
contigous regions of multi-page unmapped buffers very often. Never
would be a closer estimate, at least until the CRC code came along....
So, lets fix that by restoring the nvecs accounting for the extra
region when we hit this case.....
.... and there's the problemin log recovery it is apparently working
around:
XFS: Assertion failed: i == item->ri_total, file: fs/xfs/xfs_log_recover.c, line: 2135
Yup, xlog_recover_do_reg_buffer() doesn't handle contigous dirty
regions being broken up into multiple regions by the log formatting
code. That's an easy fix, though - if the number of contiguous dirty
bits exceeds the length of the region being copied out of the log,
only account for the number of dirty bits that region covers, and
then loop again and copy more from the next region. It's a 2 line
fix.
Now xfstests xfs/085 passes, we have one less piece of mystery
code, and one more important piece of knowledge about how to
structure new log format items..
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
When CRCs are enabled, the number of blocks needed to hold a remote
symlink on a 1k block size filesystem may be 2 instead of 1. The
transaction reservation for the allocated blocks was not taking this
into account and only allocating one block. Hence when trying to
read or invalidate such symlinks, we are mapping a hole where there
should be a block and things go bad at that point.
Fix the reservation to use the correct block count, clean up the
block count calculation similar to the remote attribute calculation,
and add a debug guard to detect when we don't write the entire
symlink to disk.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
We write the superblock every 30s or so which results in the
verifier being called. Right now that results in this output
every 30s:
XFS (vda): Version 5 superblock detected. This kernel has EXPERIMENTAL support enabled!
Use of these features in this kernel is at your own risk!
And spamming the logs.
We don't need to check for whether we support v5 superblocks or
whether there are feature bits we don't support set as these are
only relevant when we first mount the filesytem. i.e. on superblock
read. Hence for the write verification we can just skip all the
checks (and hence verbose output) altogether.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
We need to pass the full open mode flags to nfs_may_open() when doing
a delegated open.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
|
|
Add support for clocks CLOCK_REALTIME_ALARM and CLOCK_BOOTTIME_ALARM,
thereby enabling wakeup alarm timers via file descriptors.
Signed-off-by: Todd Poynor <toddpoynor@google.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
|
|
Pull CIFS fixes from Steve French:
"Fixes for a couple of DFS problems, a problem with extended security
negotiation and two other small cifs fixes"
* 'for-3.10' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix composing of mount options for DFS referrals
cifs: stop printing the unc= option in /proc/mounts
cifs: fix error handling when calling cifs_parse_devname
cifs: allow sec=none mounts to work against servers that don't support extended security
cifs: fix potential buffer overrun when composing a new options string
cifs: only set ops for inodes in I_NEW state
|
|
Suppress the messages releating to processing the ext4 orphan list
("truncating inode" and "deleting unreferenced inode") unless the
debug option is on, since otherwise they end up taking up space in the
log that could be used for more useful information.
Tested by opening several files, unlinking them, then
crashing the system, rebooting the system and examining
/var/log/messages.
Addresses the problem described in http://crbug.com/220976
Signed-off-by: Paul Taysom <taysom@chromium.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Al Viro complained of a ton of bogosity with regards to the jbd2 block
tag header checksum. This one checksum is 16 bits, so cut off the
upper 16 bits and treat it as a 16-bit value and don't mess around
with be32* conversions. Fortunately metadata checksumming is still
"experimental" and not in a shipping e2fsprogs, so there should be few
users affected by this.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
This commit tries to use kmem_cache_zalloc instead of kmem_cache_alloc/
memset when a new journal head is alloctated.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
|
|
Correct spelling typo in various part of drivers
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
|
The size parameter to btrfs_extend_item() is the number of bytes
to add to the item, not the size of the item after the operation
(like it is for btrfs_truncate_item(), there the size parameter
is not the number of bytes to take away, but the total size of
the item after truncation).
Fix it in the comment.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
|
Expand information about posix-timers in /proc/<pid>/timers by adding
info about clock, with which the timer was created. I.e. in the forth
line of timer info after "notify:" line go "ClockID: <clock_id>".
Signed-off-by: Pavel Tikhomirov <snorcht@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Matthew Helsley <matt.helsley@gmail.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Link: http://lkml.kernel.org/r/1368742323-46949-2-git-send-email-snorcht@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
If a file is linked with other files, it should be checkpointed at every fsync
calls.
For this, we use set_cp_file() with FADVISE_CP_BIT, but previously we didn't
cover the flag by the global lock.
This patch fixes that the inode page stores this correctly.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
- iget/iput flow in the dentry recovery process
1. *dir* = f2fs_iget
2. set FI_DELAY_IPUT to *dir*
3. add *dir* to the dirty_dir_list
- __f2fs_add_link
- recover_dentry)
4. iput *dir* by remove_dirty_dir_inode
- sync_dirty_dir_inodes
- write_chekcpoint
If *dir*'s i_count is not 1 (i.e., root dir), remove_dirty_dir_inode is called
later and then iput is triggered again due to the FI_DELAY_IPUT flag.
So, let's unset the flag properly once iput is triggered.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
The error scenario is:
1. create /a
(1.a link /a /b)
2. sync
3. unlinke /a
4. create /a
5. fsync /a
6. Sudden power-off
When the f2fs recovers the fsynced dentry, /a, we discover an exsiting dentry at
f2fs_find_entry() in recover_dentry().
In such the case, we should unlink the existing dentry and its inode
and then recover newly created dentry.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
If there remains some unwritten blocks from the recovery, we should not call
iput on that directory inode.
Otherwise, we can loose some dentry blocks after the recovery.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
when there is an error from kthread_run, then return proper error
rather than returning -ENOMEM.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
There are various functions with common code which could be separated
out to make common routines. So, made new routines and in order to
retain the same call path and no major changes, written some macros
to access those routines.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
There is no need to initialize few pointers in f2fs_parent_dir
as the values are not checked and instead directly initialized
values are used.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
Some, counters are needed only for the statistical information
while debugging.
So, those can be controlled using CONFIG_F2FS_STAT_FS,
pushing the usage for few variables under this flag.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
The on-disk block address is defined as __le32, but in-memory block address,
block_t, does as u64.
Let's synchronize them to 32 bits.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
There is an error path where "dir" is an ERR_PTR.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
Use the following helper function committed by Al.
commit 7de9c6ee3ecffd99e1628e81a5ea5468f7581a1f
Author: Al Viro <viro@zeniv.linux.org.uk>
Date: Sat Oct 23 11:11:40 2010 -0400
new helper: ihold()
...
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
If -ENOSPC is met during f2fs_link, we should not make the inode as bad.
The inode is still alive.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
This patch adds error handling codes of check_index_in_prev_nodes and its
caller, do_recover_data.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
This patch fixes the following deadlock bug during the recovery.
INFO: task mount:1322 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mount D ffffffff81125870 0 1322 1266 0x00000000
ffff8801207e39d8 0000000000000046 ffff88012ab1dee0 0000000000000046
ffff8801207e3a08 ffff880115903f40 ffff8801207e3fd8 ffff8801207e3fd8
ffff8801207e3fd8 ffff880115903f40 ffff8801207e39d8 ffff88012fc94520
Call Trace:
[<ffffffff81125870>] ? __lock_page+0x70/0x70
[<ffffffff816a92d9>] schedule+0x29/0x70
[<ffffffff816a93af>] io_schedule+0x8f/0xd0
[<ffffffff8112587e>] sleep_on_page+0xe/0x20
[<ffffffff816a649a>] __wait_on_bit_lock+0x5a/0xc0
[<ffffffff81125867>] __lock_page+0x67/0x70
[<ffffffff8106c7b0>] ? autoremove_wake_function+0x40/0x40
[<ffffffff81126857>] find_lock_page+0x67/0x80
[<ffffffff8112698f>] find_or_create_page+0x3f/0xb0
[<ffffffffa03901a8>] ? sync_inode_page+0xa8/0xd0 [f2fs]
[<ffffffffa038fdf7>] get_node_page+0x67/0x180 [f2fs]
[<ffffffffa039818b>] recover_fsync_data+0xacb/0xff0 [f2fs]
[<ffffffff816aaa1e>] ? _raw_spin_unlock+0x3e/0x40
[<ffffffffa0389634>] f2fs_fill_super+0x7d4/0x850 [f2fs]
[<ffffffff81184cf9>] mount_bdev+0x1c9/0x210
[<ffffffffa0388e60>] ? validate_superblock+0x180/0x180 [f2fs]
[<ffffffffa0387635>] f2fs_mount+0x15/0x20 [f2fs]
[<ffffffff81185a13>] mount_fs+0x43/0x1b0
[<ffffffff81145ba0>] ? __alloc_percpu+0x10/0x20
[<ffffffff811a0796>] vfs_kern_mount+0x76/0x120
[<ffffffff811a2cb7>] do_mount+0x237/0xa10
[<ffffffff81140b9b>] ? strndup_user+0x5b/0x80
[<ffffffff811a3520>] SyS_mount+0x90/0xe0
[<ffffffff816b3502>] system_call_fastpath+0x16/0x1b
The bug is triggered when check_index_in_prev_nodes tries to get the direct
node page by calling get_node_page.
At this point, if the direct node page is already locked by get_dnode_of_data,
its caller, we got a deadlock condition.
This patch adds additional condition check for the reuse of locked direct node
pages prior to the get_node_page call.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
While an orphan inode has zero link_count, f2fs_gc is able to select the inode
for foreground gc.
- f2fs_gc
- do_garbage_collect
- gc_data_segment
: f2fs_iget is failed
: get_valid_blocks() != 0, so that retry
--> here we got the infinite loop.
This patch resolved this issue.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
Introduce a simple macro function for readability.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
This patch tries to avoid the following deadlock condition of which the reclaim
path can trigger f2fs_balance_fs again.
=================================
[ INFO: inconsistent lock state ]
---------------------------------
inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
kswapd0/41 [HC0[0]:SC0[0]:HE1:SE1] takes:
(&sbi->gc_mutex){+.+.?.}, at: f2fs_balance_fs+0xe6/0x100 [f2fs]
{RECLAIM_FS-ON-W} state was registered at:
[<ffffffff810aa5a9>] mark_held_locks+0xb9/0x140
[<ffffffff810aae85>] lockdep_trace_alloc+0x85/0xf0
[<ffffffff8113ab2c>] __alloc_pages_nodemask+0x7c/0x9b0
[<ffffffff81175aa8>] alloc_pages_current+0xb8/0x180
[<ffffffff811319cf>] __page_cache_alloc+0xaf/0xd0
[<ffffffff8113225c>] find_or_create_page+0x4c/0xb0
[<ffffffffa021359e>] find_data_page+0x14e/0x210 [f2fs]
[<ffffffffa021161b>] f2fs_gc+0x9eb/0xd90 [f2fs]
[<ffffffffa0218fae>] f2fs_balance_fs+0xee/0x100 [f2fs]
[<ffffffffa020848c>] f2fs_setattr+0x6c/0x200 [f2fs]
[<ffffffff811ae51b>] notify_change+0x1db/0x3a0
[<ffffffff8118fbd0>] do_truncate+0x60/0xa0
[<ffffffff8118fd95>] vfs_truncate+0x185/0x1b0
[<ffffffff8118fe1c>] do_sys_truncate+0x5c/0xa0
[<ffffffff8118ffee>] SyS_truncate+0xe/0x10
[<ffffffff816e2b42>] system_call_fastpath+0x16/0x1b
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
If we met an error during the dentry recovery, we should not conduct checkpoint.
Otherwise, some errorneous dentry blocks overwrites the existing blocks that
contain the remaining recovery information.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
If we got an error after lock_page, we should unlock it before exit.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
The allocated page used by the recovery is not on HIGHMEM, so that we don't
need to use kmap/kunmap.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
Few things can be changed in the default mkwrite function
1) Make file_update_time at the start before acquiring any lock
2) the condition page_offset(page) >= i_size_read(inode) should be
changed to page_offset(page) > i_size_read
3) Move wait_on_page_writeback.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
We can do this, since now we use a global mutex, f2fs_stat_mutex to protect its
list operations.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
[Jaegeuk Kim: add description]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
Code cleanup without behavior changed.
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
Majianpeng reported a lockdep splat for f2fs. It turns out mutex_lock_all()
acquires an array of locks (in global/local lock style).
Any such operation is always serialized using cp_mutex, therefore there is no
fs_lock[] lock-order issue; tell lockdep about this using the
mutex_lock_nest_lock() primitive.
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
This patch adds some trivial debugging messages in the recovery process.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
I found a bug when testing power-off-recovery as follows.
[Bug Scenario]
1. create a file
2. fsync the file
3. reboot w/o any sync
4. try to recover the file
- found its fsync mark
- found its dentry mark
: try to recover its dentry
- get its file name
- get its parent inode number
: here we got zero value
The reason why we get the wrong parent inode number is that we didn't
synchronize the inode page with its newly created inode information perfectly.
Especially, previous f2fs stores fi->i_pino and writes it to the cached
node page in a wrong order, which incurs the zero-valued i_pino during the
recovery.
So, this patch modifies the creation flow to fix the synchronization order of
inode page with its inode.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
This patch is for passing a locked node page to get_dnode_of_data.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
If get_dnode_of_data gets a locked node page, let's skip redundant
get_node_page calls.
This is for the futher enhancement.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|