Age | Commit message (Collapse) | Author |
|
The extents marked in pin_down_extent() will be unpinned later in
unpin_extent_range(), which decrements total_bytes_pinned.
pin_down_extent() must increment the counter to avoid underflowing it.
Also adjust btrfs_free_tree_block() to avoid accounting for the same
extent twice.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The value of flags is one of DATA/METADATA/SYSTEM, they must exist at
when add_pinned_bytes is called.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ added changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There are a few places where we pass in a negative num_bytes, so make it
signed for clarity. Also move it up in the file since later patches will
need it there.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The XATTR_ITEM is a type of a directory item so we use the common
validator helper. Unlike other dir items, it can have data. The way the
name len validation is currently implemented does not reflect that. We'd
have to adjust by the data_len when comparing the read and item limits.
However, this will not work for multi-item xattr dir items.
Example from tree dump of generic/337:
item 7 key (257 XATTR_ITEM 751495445) itemoff 15667 itemsize 147
location key (0 UNKNOWN.0 0) type XATTR
transid 8 data_len 3 name_len 11
name: user.foobar
data 123
location key (0 UNKNOWN.0 0) type XATTR
transid 8 data_len 6 name_len 13
name: user.WvG1c1Td
data qwerty
location key (0 UNKNOWN.0 0) type XATTR
transid 8 data_len 5 name_len 19
name: user.J3__T_Km3dVsW_
data hello
At the point of btrfs_is_name_len_valid call we don't have access to the
data_len value of the 2nd and 3rd sub-item. So simple btrfs_dir_data_len(leaf,
di) would always return 3, although we'd need to get 6 and 5 respectively to
get the claculations right. (read_end + name_len + data_len vs item_end)
We'd have to also pass data_len externally, which is not point of the
name validation. The last check is supposed to test if there's at least
one dir item space after the one we're processing. I don't think this is
particularly useful, validation of the next item would catch that too.
So the check is removed and we don't weaken the validation. Now tests
btrfs/048, btrfs/053, generic/273 and generic/337 pass.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Wen reports significant memory leaks with DIF and O_DIRECT:
"With nvme devive + T10 enabled, On a system it has 256GB and started
logging /proc/meminfo & /proc/slabinfo for every minute and in an hour
it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute
leaking.
/proc/meminfo | grep SUnreclaim...
SUnreclaim: 6752128 kB
SUnreclaim: 6874880 kB
SUnreclaim: 7238080 kB
....
SUnreclaim: 22307264 kB
SUnreclaim: 22485888 kB
SUnreclaim: 22720256 kB
When testcases with T10 enabled call into __blkdev_direct_IO_simple,
code doesn't free memory allocated by bio_integrity_alloc. The patch
fixes the issue. HTX has been run with +60 hours without failure."
Since __blkdev_direct_IO_simple() allocates the bio on the stack, it
doesn't go through the regular bio free. This means that any ancillary
data allocated with the bio through the stack is not freed. Hence, we
can leak the integrity data associated with the bio, if the device is
using DIF/DIX.
Fix this by providing a bio_uninit() and export it, so that we can use
it to free this data. Note that this is a minimal fix for this issue.
Any current user of bio's that are allocated outside of
bio_alloc_bioset() suffers from this issue, most notably some drivers.
We will fix those in a more comprehensive patch for 4.13. This also
means that the commit marked as being fixed by this isn't the real
culprit, it's just the most obvious one out there.
Fixes: 542ff7bf18c6 ("block: new direct I/O implementation")
Reported-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull NFS client bugfixes from Trond Myklebust:
"Bugfixes include:
- stable fix for exclusive create if the server supports the umask
attribute
- trunking detection should handle ERESTARTSYS/EINTR
- stable fix for a race in the LAYOUTGET function
- stable fix to revert "nfs_rename() handle -ERESTARTSYS dentry left
behind"
- nfs4_callback_free_slot() cannot call nfs4_slot_tbl_drain_complete()"
* tag 'nfs-for-4.12-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4.1: nfs4_callback_free_slot() cannot call nfs4_slot_tbl_drain_complete()
Revert "NFS: nfs_rename() handle -ERESTARTSYS dentry left behind"
NFSv4.1: Fix a race in nfs4_proc_layoutget
NFS: Trunking detection should handle ERESTARTSYS/EINTR
NFSv4.2: Don't send mode again in post-EXCLUSIVE4_1 SETATTR with umask
|
|
Some architectures (at least PPC) doesn't like get/put_user with
64-bit types on a 32-bit system. Use the variably sized copy
to/from user variants instead.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: c75b1d9421f8 ("fs: add fcntl() interface for setting/getting write life time hints")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When copying up a file that has multiple hard links we need to break any
association with the origin file. This makes copy-up be essentially an
atomic replace.
The new file has nothing to do with the old one (except having the same
data and metadata initially), so don't set the overlay.origin attribute.
We can relax this in the future when we are able to index upper object by
origin.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 3a1e819b4e80 ("ovl: store file handle of lower inode on copy up")
|
|
Nothing prevents mischief on upper layer while we are busy copying up the
data.
Move the lookup right before the looked up dentry is actually used.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 01ad3eb8a073 ("ovl: concurrent copy up of regular files")
Cc: <stable@vger.kernel.org> # v4.11
|
|
The current code works only for the case where we have exactly one slot,
which is no longer true.
nfs4_free_slot() will automatically declare the callback channel to be
drained when all slots have been returned.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
This reverts commit 920b4530fb80430ff30ef83efe21ba1fa5623731 which could
call d_move() without holding the directory's i_mutex, and reverts commit
d4ea7e3c5c0e341c15b073016dbf3ab6c65f12f3 "NFS: Fix old dentry rehash after
move", which was a follow-up fix.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: 920b4530fb80 ("NFS: nfs_rename() handle -ERESTARTSYS dentry left behind")
Cc: stable@vger.kernel.org # v4.10+
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
If the task calling layoutget is signalled, then it is possible for the
calls to nfs4_sequence_free_slot() and nfs4_layoutget_prepare() to race,
in which case we leak a slot.
The fix is to move the call to nfs4_sequence_free_slot() into the
nfs4_layoutget_release() so that it gets called at task teardown time.
Fixes: 2e80dbe7ac51 ("NFSv4.1: Close callback races for OPEN, LAYOUTGET...")
Cc: stable@vger.kernel.org # v4.8+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Currently, it will return EIO in those cases.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Use memdup_user() helper instead of open-coding to simplify the code.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
Now that all callers of the pmem api have been converted to dax helpers that
call back to the pmem driver, we can remove include/linux/pmem.h and
asm/pmem.h.
Cc: <x86@kernel.org>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Oliver O'Halloran <oohall@gmail.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Define a set of write life time hints:
RWH_WRITE_LIFE_NOT_SET No hint information set
RWH_WRITE_LIFE_NONE No hints about write life time
RWH_WRITE_LIFE_SHORT Data written has a short life time
RWH_WRITE_LIFE_MEDIUM Data written has a medium life time
RWH_WRITE_LIFE_LONG Data written has a long life time
RWH_WRITE_LIFE_EXTREME Data written has an extremely long life time
The intent is for these values to be relative to each other, no
absolute meaning should be attached to these flag names.
Add an fcntl interface for querying these flags, and also for
setting them as well:
F_GET_RW_HINT Returns the read/write hint set on the
underlying inode.
F_SET_RW_HINT Set one of the above write hints on the
underlying inode.
F_GET_FILE_RW_HINT Returns the read/write hint set on the
file descriptor.
F_SET_FILE_RW_HINT Set one of the above write hints on the
file descriptor.
The user passes in a 64-bit pointer to get/set these values, and
the interface returns 0/-1 on success/error.
Sample program testing/implementing basic setting/getting of write
hints is below.
Add support for storing the write life time hint in the inode flags
and in struct file as well, and pass them to the kiocb flags. If
both a file and its corresponding inode has a write hint, then we
use the one in the file, if available. The file hint can be used
for sync/direct IO, for buffered writeback only the inode hint
is available.
This is in preparation for utilizing these hints in the block layer,
to guide on-media data placement.
/*
* writehint.c: get or set an inode write hint
*/
#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdbool.h>
#include <inttypes.h>
#ifndef F_GET_RW_HINT
#define F_LINUX_SPECIFIC_BASE 1024
#define F_GET_RW_HINT (F_LINUX_SPECIFIC_BASE + 11)
#define F_SET_RW_HINT (F_LINUX_SPECIFIC_BASE + 12)
#endif
static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE",
"RWH_WRITE_LIFE_SHORT", "RWH_WRITE_LIFE_MEDIUM",
"RWH_WRITE_LIFE_LONG", "RWH_WRITE_LIFE_EXTREME" };
int main(int argc, char *argv[])
{
uint64_t hint;
int fd, ret;
if (argc < 2) {
fprintf(stderr, "%s: file <hint>\n", argv[0]);
return 1;
}
fd = open(argv[1], O_RDONLY);
if (fd < 0) {
perror("open");
return 2;
}
if (argc > 2) {
hint = atoi(argv[2]);
ret = fcntl(fd, F_SET_RW_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_SET_RW_HINT");
return 4;
}
}
ret = fcntl(fd, F_GET_RW_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_GET_RW_HINT");
return 3;
}
printf("%s: hint %s\n", argv[1], str[hint]);
close(fd);
return 0;
}
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
... and lose HAVE_ARCH_...; if copy_{to,from}_user() on an
architecture sucks badly enough to make it a problem, we have
a worse problem.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
Merge in arm64 ACPI RAS support (APEI/GHES) from Tyler Baicar.
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Merge misc fixes from Andrew Morton:
"8 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
fs/exec.c: account for argv/envp pointers
ocfs2: fix deadlock caused by recursive locking in xattr
slub: make sysfs file removal asynchronous
lib/cmdline.c: fix get_options() overflow while parsing ranges
fs/dax.c: fix inefficiency in dax_writeback_mapping_range()
autofs: sanity check status reported with AUTOFS_DEV_IOCTL_FAIL
mm/vmalloc.c: huge-vmap: fail gracefully on unexpected huge vmap mappings
mm, thp: remove cond_resched from __collapse_huge_page_copy
|
|
When limiting the argv/envp strings during exec to 1/4 of the stack limit,
the storage of the pointers to the strings was not included. This means
that an exec with huge numbers of tiny strings could eat 1/4 of the stack
limit in strings and then additional space would be later used by the
pointers to the strings.
For example, on 32-bit with a 8MB stack rlimit, an exec with 1677721
single-byte strings would consume less than 2MB of stack, the max (8MB /
4) amount allowed, but the pointers to the strings would consume the
remaining additional stack space (1677721 * 4 == 6710884).
The result (1677721 + 6710884 == 8388605) would exhaust stack space
entirely. Controlling this stack exhaustion could result in
pathological behavior in setuid binaries (CVE-2017-1000365).
[akpm@linux-foundation.org: additional commenting from Kees]
Fixes: b6a2fea39318 ("mm: variable length argument support")
Link: http://lkml.kernel.org/r/20170622001720.GA32173@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Qualys Security Advisory <qsa@qualys.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Another deadlock path caused by recursive locking is reported. This
kind of issue was introduced since commit 743b5f1434f5 ("ocfs2: take
inode lock in ocfs2_iop_set/get_acl()"). Two deadlock paths have been
fixed by commit b891fa5024a9 ("ocfs2: fix deadlock issue when taking
inode lock at vfs entry points"). Yes, we intend to fix this kind of
case in incremental way, because it's hard to find out all possible
paths at once.
This one can be reproduced like this. On node1, cp a large file from
home directory to ocfs2 mountpoint. While on node2, run
setfacl/getfacl. Both nodes will hang up there. The backtraces:
On node1:
__ocfs2_cluster_lock.isra.39+0x357/0x740 [ocfs2]
ocfs2_inode_lock_full_nested+0x17d/0x840 [ocfs2]
ocfs2_write_begin+0x43/0x1a0 [ocfs2]
generic_perform_write+0xa9/0x180
__generic_file_write_iter+0x1aa/0x1d0
ocfs2_file_write_iter+0x4f4/0xb40 [ocfs2]
__vfs_write+0xc3/0x130
vfs_write+0xb1/0x1a0
SyS_write+0x46/0xa0
On node2:
__ocfs2_cluster_lock.isra.39+0x357/0x740 [ocfs2]
ocfs2_inode_lock_full_nested+0x17d/0x840 [ocfs2]
ocfs2_xattr_set+0x12e/0xe80 [ocfs2]
ocfs2_set_acl+0x22d/0x260 [ocfs2]
ocfs2_iop_set_acl+0x65/0xb0 [ocfs2]
set_posix_acl+0x75/0xb0
posix_acl_xattr_set+0x49/0xa0
__vfs_setxattr+0x69/0x80
__vfs_setxattr_noperm+0x72/0x1a0
vfs_setxattr+0xa7/0xb0
setxattr+0x12d/0x190
path_setxattr+0x9f/0xb0
SyS_setxattr+0x14/0x20
Fix this one by using ocfs2_inode_{lock|unlock}_tracker, which is
exported by commit 439a36b8ef38 ("ocfs2/dlmglue: prepare tracking logic
to avoid recursive cluster lock").
Link: http://lkml.kernel.org/r/20170622014746.5815-1-zren@suse.com
Fixes: 743b5f1434f5 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
Signed-off-by: Eric Ren <zren@suse.com>
Reported-by: Thomas Voegtle <tv@lio96.de>
Tested-by: Thomas Voegtle <tv@lio96.de>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
dax_writeback_mapping_range() fails to update iteration index when
searching radix tree for entries needing cache flushing. Thus each
pagevec worth of entries is searched starting from the start which is
inefficient and prone to livelocks. Update index properly.
Link: http://lkml.kernel.org/r/20170619124531.21491-1-jack@suse.cz
Fixes: 9973c98ecfda3 ("dax: add support for fsync/sync")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
If a positive status is passed with the AUTOFS_DEV_IOCTL_FAIL ioctl,
autofs4_d_automount() will return
ERR_PTR(status)
with that status to follow_automount(), which will then dereference an
invalid pointer.
So treat a positive status the same as zero, and map to ENOENT.
See comment in systemd src/core/automount.c::automount_send_ready().
Link: http://lkml.kernel.org/r/871sqwczx5.fsf@notabene.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: Ian Kent <raven@themaw.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull xfs fixes from Darrick Wong:
"I have one more bugfix for you for 4.12-rc7 to fix a disk corruption
problem:
- don't allow swapon on files on the realtime device, because the
swap code will swap pages out to blocks on the data device, thereby
corrupting the filesystem"
* tag 'xfs-4.12-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: don't allow bmap on rt files
|
|
Pull cifs fixes from Steve French:
"Various small fixes for stable"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: Fix some return values in case of error in 'crypt_message'
cifs: remove redundant return in cifs_creation_time_get
CIFS: Improve readdir verbosity
CIFS: check if pages is null rather than bv for a failed allocation
CIFS: Set ->should_dirty in cifs_user_readv()
|
|
bmap returns a dumb LBA address but not the block device that goes with
that LBA. Swapfiles don't care about this and will blindly assume that
the data volume is the correct blockdev, which is totally bogus for
files on the rt subvolume. This results in the swap code doing IOs to
arbitrary locations on the data device(!) if the passed in mapping is a
realtime file, so just turn off bmap for rt files.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Two entries being added at the same time to the IFLA
policy table, whilst parallel bug fixes to decnet
routing dst handling overlapping with the dst gc removal
in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more ufs fixes from Al Viro:
"More UFS fixes, unfortunately including build regression fix for the
64-bit s_dsize commit. Fixed in this pile:
- trivial bug in signedness of 32bit timestamps on ufs1
- ESTALE instead of ufs_error() when doing open-by-fhandle on
something deleted
- build regression on 32bit in ufs_new_fragments() - calculating that
many percents of u64 pulls libgcc stuff on some of those. Mea
culpa.
- fix hysteresis loop broken by typo in 2.4.14.7 (right next to the
location of previous bug).
- fix the insane limits of said hysteresis loop on filesystems with
very low percentage of reserved blocks. If it's 5% or less, just
use the OPTSPACE policy.
- calculate those limits once and mount time.
This tree does pass xfstests clean (both ufs1 and ufs2) and it _does_
survive cross-builds.
Again, my apologies for missing that, especially since I have noticed
a related percentage-of-64bit issue in earlier patches (when dealing
with amount of reserved blocks). Self-LART applied..."
* 'ufs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ufs: fix the logics for tail relocation
ufs_iget(): fail with -ESTALE on deleted inode
fix signedness of timestamps on ufs1
|
|
Call verify_dir_item before memcmp_extent_buffer reading name from
dir_item.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_del_root_ref calls btrfs_search_slot and reads name from root_ref.
Call btrfs_is_name_len_valid before memcmp.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In btrfs_get_name, there's btrfs_search_slot and reads name from
inode_ref/root_ref.
Call btrfs_is_name_len_valid in btrfs_get_name.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since iterate_dir_item checks name_len in its own way,
so use btrfs_is_name_len_valid not 'verify_dir_item' to make more strict
name_len check.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ switched ENAMETOOLONG to EIO ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In btrfs_log_inode, btrfs_search_forward gets the buffer and then
btrfs_check_ref_name_override will read name from ref/extref for the
first time.
Call btrfs_is_name_len_valid before reading name.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
replay_xattr_deletes calls btrfs_search_slot to get buffer and reads
name.
Call verify_dir_item to check name_len in replay_xattr_deletes to avoid
reading out of boundary.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
replay_one_buffer first reads buffers and dispatches items accroding to
the item type.
In this patch, add_inode_ref handles inode_ref and inode_extref.
Then add_inode_ref calls ref_get_fields and extref_get_fields to read
ref/extref name for the first time.
So checking name_len before reading those two is fine.
add_inode_ref also calls inode_in_dir to match ref/extref in parent_dir.
The call graph includes btrfs_match_dir_item_name to read dir_item name
in the parent dir.
Checking first dir_item is not enough. Change it to verify every
dir_item while doing matches.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Originally, verify_dir_item verifies name_len of dir_item with fixed
values but not item boundary.
If corrupted name_len was not bigger than the fixed value, for example
255, the function will think the dir_item is fine. And then reading
beyond boundary will cause crash.
Example:
1. Corrupt one dir_item name_len to be 255.
2. Run 'ls -lar /mnt/test/ > /dev/null'
dmesg:
[ 48.451449] BTRFS info (device vdb1): disk space caching is enabled
[ 48.451453] BTRFS info (device vdb1): has skinny extents
[ 48.489420] general protection fault: 0000 [#1] SMP
[ 48.489571] Modules linked in: ext4 jbd2 mbcache btrfs xor raid6_pq
[ 48.489716] CPU: 1 PID: 2710 Comm: ls Not tainted 4.10.0-rc1 #5
[ 48.489853] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
[ 48.490008] task: ffff880035df1bc0 task.stack: ffffc90004800000
[ 48.490008] RIP: 0010:read_extent_buffer+0xd2/0x190 [btrfs]
[ 48.490008] RSP: 0018:ffffc90004803d98 EFLAGS: 00010202
[ 48.490008] RAX: 000000000000001b RBX: 000000000000001b RCX: 0000000000000000
[ 48.490008] RDX: ffff880079dbf36c RSI: 0005080000000000 RDI: ffff880079dbf368
[ 48.490008] RBP: ffffc90004803dc8 R08: ffff880078e8cc48 R09: ffff880000000000
[ 48.490008] R10: 0000160000000000 R11: 0000000000001000 R12: ffff880079dbf288
[ 48.490008] R13: ffff880078e8ca88 R14: 0000000000000003 R15: ffffc90004803e20
[ 48.490008] FS: 00007fef50c60800(0000) GS:ffff88007d400000(0000) knlGS:0000000000000000
[ 48.490008] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 48.490008] CR2: 000055f335ac2ff8 CR3: 000000007356d000 CR4: 00000000001406e0
[ 48.490008] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 48.490008] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 48.490008] Call Trace:
[ 48.490008] btrfs_real_readdir+0x3b7/0x4a0 [btrfs]
[ 48.490008] iterate_dir+0x181/0x1b0
[ 48.490008] SyS_getdents+0xa7/0x150
[ 48.490008] ? fillonedir+0x150/0x150
[ 48.490008] entry_SYSCALL_64_fastpath+0x18/0xad
[ 48.490008] RIP: 0033:0x7fef5032546b
[ 48.490008] RSP: 002b:00007ffeafcdb830 EFLAGS: 00000206 ORIG_RAX: 000000000000004e
[ 48.490008] RAX: ffffffffffffffda RBX: 00007fef5061db38 RCX: 00007fef5032546b
[ 48.490008] RDX: 0000000000008000 RSI: 000055f335abaff0 RDI: 0000000000000003
[ 48.490008] RBP: 00007fef5061dae0 R08: 00007fef5061db48 R09: 0000000000000000
[ 48.490008] R10: 000055f335abafc0 R11: 0000000000000206 R12: 00007fef5061db38
[ 48.490008] R13: 0000000000008040 R14: 00007fef5061db38 R15: 000000000000270e
[ 48.490008] RIP: read_extent_buffer+0xd2/0x190 [btrfs] RSP: ffffc90004803d98
[ 48.499455] ---[ end trace 321920d8e8339505 ]---
Fix it by adding a parameter @slot and check name_len with item boundary
by calling btrfs_is_name_len_valid.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
rev
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Introduce function btrfs_is_name_len_valid.
The function compares parameter @name_len with item boundary then
returns true if name_len is valid.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ s/btrfs_leaf_data/BTRFS_LEAF_DATA_OFFSET/ ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We should really just wait in wait_dev_flush and let the caller decide
what to do with the error value.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Similar to what submit_bio_wait does, we should account for IO while
waiting for a bio completion. This has marginal visible effects, flush
bio is short-lived.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
For devices that support flushing, we allocate a bio, submit, wait for
it and then free it. The bio allocation does not fail so ENOMEM is not a
problem but we still may unnecessarily stress the allocation subsystem.
Instead, we can allocate the bio at the same time we allocate the device
and reuse it each time we need to flush the barriers. The bio is reset
before each use. Reference counting is simplified to just device
allocation (get) and freeing (put).
The bio used to be submitted through the integrity checker which will
find out that bio has no data attached and call submit_bio.
Status of the bio in flight needs to be tracked separately in case the
device caches get switched off between write and wait.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
An incremental send can contain unlink operations with an invalid target
path when we rename some directory inode A, then rename some file inode B
to the old name of inode A and directory inode A is an ancestor of inode B
in the parent snapshot (but not anymore in the send snapshot).
Consider the following example scenario where this issue happens.
Parent snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
|--- dir2/ (ino 258)
| |--- file1 (ino 259)
| |--- file3 (ino 261)
|
|--- dir3/ (ino 262)
|--- file22 (ino 260)
|--- dir4/ (ino 263)
Send snapshot:
. (ino 256)
|
|--- dir1/ (ino 257)
|--- dir2/ (ino 258)
|--- dir3 (ino 260)
|--- file3/ (ino 262)
|--- dir4/ (ino 263)
|--- file11 (ino 269)
|--- file33 (ino 261)
When attempting to apply the corresponding incremental send stream, an
unlink operation contains an invalid path which makes the receiver fail.
The following is verbose output of the btrfs receive command:
receiving snapshot snap2 uuid=7d5450da-a573-e043-a451-ec85f4879f0f (...)
utimes
utimes dir1
utimes dir1/dir2
link dir1/dir3/dir4/file11 -> dir1/dir2/file1
unlink dir1/dir2/file1
utimes dir1/dir2
truncate dir1/dir3/dir4/file11 size=0
utimes dir1/dir3/dir4/file11
rename dir1/dir3 -> o262-7-0
link dir1/dir3 -> o262-7-0/file22
unlink dir1/dir3/file22
ERROR: unlink dir1/dir3/file22 failed. Not a directory
The following steps happen during the computation of the incremental send
stream the lead to this issue:
1) Before we start processing the new and deleted references for inode
260, we compute the full path of the deleted reference
("dir1/dir3/file22") and cache it in the list of deleted references
for our inode.
2) We then start processing the new references for inode 260, for which
there is only one new, located at "dir1/dir3". When processing this
new reference, we check that inode 262, which was not yet processed,
collides with the new reference and because of that we orphanize
inode 262 so its new full path becomes "o262-7-0".
3) After the orphanization of inode 262, we create the new reference for
inode 260 by issuing a link command with a target path of "dir1/dir3"
and a source path of "o262-7-0/file22".
4) We then start processing the deleted references for inode 260, for
which there is only one with the base name of "file22", and issue
an unlink operation containing the target path computed at step 1,
which is wrong because that path no longer exists and should be
replaced with "o262-7-0/file22".
So fix this issue by recomputing the full path of deleted references if
when we processed the new references for an inode we ended up orphanizing
any other inode that is an ancestor of our inode in the parent snapshot.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[ adjusted after prev patch removed fs_path::dir_path and dir_path_len ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently an incremental snapshot can generate link operations which
contain an invalid target path. Such case happens when in the send
snapshot a file was renamed, a new hard link added for it and some
other inode (with a lower number) got renamed to the former name of
that file. Example:
Parent snapshot
. (ino 256)
|
|--- f1 (ino 257)
|--- f2 (ino 258)
|--- f3 (ino 259)
Send snapshot
. (ino 256)
|
|--- f2 (ino 257)
|--- f3 (ino 258)
|--- f4 (ino 259)
|--- f5 (ino 258)
The following steps happen when computing the incremental send stream:
1) When processing inode 257, inode 258 is orphanized (renamed to
"o258-7-0"), because its current reference has the same name as the
new reference for inode 257;
2) When processing inode 258, we iterate over all its new references,
which have the names "f3" and "f5". The first iteration sees name
"f5" and renames the inode from its orphan name ("o258-7-0") to
"f5", while the second iteration sees the name "f3" and, incorrectly,
issues a link operation with a target name matching the orphan name,
which no longer exists. The first iteration had reset the current
valid path of the inode to "f5", but in the second iteration we lost
it because we found another inode, with a higher number of 259, which
has a reference named "f3" as well, so we orphanized inode 259 and
recomputed the current valid path of inode 258 to its old orphan
name because inode 259 could be an ancestor of inode 258 and therefore
the current valid path could contain the pre-orphanization name of
inode 259. However in this case inode 259 is not an ancestor of inode
258 so the current valid path should not be recomputed.
This makes the receiver fail with the following error:
ERROR: link f3 -> o258-7-0 failed: No such file or directory
So fix this by not recomputing the current valid path for an inode
whenever we find a colliding reference from some not yet processed inode
(inode number higher then the one currently being processed), unless
that other inode is an ancestor of the one we are currently processing.
A test case for fstests will follow soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
While punching a hole in a range that is not aligned with the sector size
(currently the same as the page size) we can end up leaving an extent map
in memory with a length that is smaller then the sector size or with a
start offset that is not aligned to the sector size. Both cases are not
expected and can lead to problems. This issue is easily detected
after the patch from commit a7e3b975a0f9 ("Btrfs: fix reported number of
inode blocks"), introduced in kernel 4.12-rc1, in a scenario like the
following for example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -c "pwrite -S 0xaa -b 100K 0 100K" /mnt/foo
$ xfs_io -c "fpunch 60K 90K" /mnt/foo
$ xfs_io -c "pwrite -S 0xbb -b 100K 50K 100K" /mnt/foo
$ xfs_io -c "pwrite -S 0xcc -b 50K 100K 50K" /mnt/foo
$ umount /mnt
After the unmount operation we can see several warnings emmitted due to
underflows related to space reservation counters:
[ 2837.443299] ------------[ cut here ]------------
[ 2837.447395] WARNING: CPU: 8 PID: 2474 at fs/btrfs/inode.c:9444 btrfs_destroy_inode+0xe8/0x27e [btrfs]
[ 2837.452108] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button se
rio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_gene
ric raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
[ 2837.458389] CPU: 8 PID: 2474 Comm: umount Tainted: G W 4.10.0-rc8-btrfs-next-43+ #1
[ 2837.459754] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[ 2837.462379] Call Trace:
[ 2837.462379] dump_stack+0x68/0x92
[ 2837.462379] __warn+0xc2/0xdd
[ 2837.462379] warn_slowpath_null+0x1d/0x1f
[ 2837.462379] btrfs_destroy_inode+0xe8/0x27e [btrfs]
[ 2837.462379] destroy_inode+0x3d/0x55
[ 2837.462379] evict+0x177/0x17e
[ 2837.462379] dispose_list+0x50/0x71
[ 2837.462379] evict_inodes+0x132/0x141
[ 2837.462379] generic_shutdown_super+0x3f/0xeb
[ 2837.462379] kill_anon_super+0x12/0x1c
[ 2837.462379] btrfs_kill_super+0x16/0x21 [btrfs]
[ 2837.462379] deactivate_locked_super+0x30/0x68
[ 2837.462379] deactivate_super+0x36/0x39
[ 2837.462379] cleanup_mnt+0x58/0x76
[ 2837.462379] __cleanup_mnt+0x12/0x14
[ 2837.462379] task_work_run+0x77/0x9b
[ 2837.462379] prepare_exit_to_usermode+0x9d/0xc5
[ 2837.462379] syscall_return_slowpath+0x196/0x1b9
[ 2837.462379] entry_SYSCALL_64_fastpath+0xab/0xad
[ 2837.462379] RIP: 0033:0x7f3ef3e6b9a7
[ 2837.462379] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 2837.462379] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
[ 2837.462379] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
[ 2837.462379] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
[ 2837.462379] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
[ 2837.462379] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
[ 2837.519355] ---[ end trace e79345fe24b30b8d ]---
[ 2837.596256] ------------[ cut here ]------------
[ 2837.597625] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:5699 btrfs_free_block_groups+0x246/0x3eb [btrfs]
[ 2837.603547] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
[ 2837.659372] CPU: 8 PID: 2474 Comm: umount Tainted: G W 4.10.0-rc8-btrfs-next-43+ #1
[ 2837.663359] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[ 2837.663359] Call Trace:
[ 2837.663359] dump_stack+0x68/0x92
[ 2837.663359] __warn+0xc2/0xdd
[ 2837.663359] warn_slowpath_null+0x1d/0x1f
[ 2837.663359] btrfs_free_block_groups+0x246/0x3eb [btrfs]
[ 2837.663359] close_ctree+0x1dd/0x2e1 [btrfs]
[ 2837.663359] ? evict_inodes+0x132/0x141
[ 2837.663359] btrfs_put_super+0x15/0x17 [btrfs]
[ 2837.663359] generic_shutdown_super+0x6a/0xeb
[ 2837.663359] kill_anon_super+0x12/0x1c
[ 2837.663359] btrfs_kill_super+0x16/0x21 [btrfs]
[ 2837.663359] deactivate_locked_super+0x30/0x68
[ 2837.663359] deactivate_super+0x36/0x39
[ 2837.663359] cleanup_mnt+0x58/0x76
[ 2837.663359] __cleanup_mnt+0x12/0x14
[ 2837.663359] task_work_run+0x77/0x9b
[ 2837.663359] prepare_exit_to_usermode+0x9d/0xc5
[ 2837.663359] syscall_return_slowpath+0x196/0x1b9
[ 2837.663359] entry_SYSCALL_64_fastpath+0xab/0xad
[ 2837.663359] RIP: 0033:0x7f3ef3e6b9a7
[ 2837.663359] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 2837.663359] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
[ 2837.663359] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
[ 2837.663359] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
[ 2837.663359] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
[ 2837.663359] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
[ 2837.739445] ---[ end trace e79345fe24b30b8e ]---
[ 2837.745595] ------------[ cut here ]------------
[ 2837.746412] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:5700 btrfs_free_block_groups+0x261/0x3eb [btrfs]
[ 2837.747955] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
[ 2837.755395] CPU: 8 PID: 2474 Comm: umount Tainted: G W 4.10.0-rc8-btrfs-next-43+ #1
[ 2837.756769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[ 2837.758526] Call Trace:
[ 2837.758925] dump_stack+0x68/0x92
[ 2837.759383] __warn+0xc2/0xdd
[ 2837.759383] warn_slowpath_null+0x1d/0x1f
[ 2837.759383] btrfs_free_block_groups+0x261/0x3eb [btrfs]
[ 2837.759383] close_ctree+0x1dd/0x2e1 [btrfs]
[ 2837.759383] ? evict_inodes+0x132/0x141
[ 2837.759383] btrfs_put_super+0x15/0x17 [btrfs]
[ 2837.759383] generic_shutdown_super+0x6a/0xeb
[ 2837.759383] kill_anon_super+0x12/0x1c
[ 2837.759383] btrfs_kill_super+0x16/0x21 [btrfs]
[ 2837.759383] deactivate_locked_super+0x30/0x68
[ 2837.759383] deactivate_super+0x36/0x39
[ 2837.759383] cleanup_mnt+0x58/0x76
[ 2837.759383] __cleanup_mnt+0x12/0x14
[ 2837.759383] task_work_run+0x77/0x9b
[ 2837.759383] prepare_exit_to_usermode+0x9d/0xc5
[ 2837.759383] syscall_return_slowpath+0x196/0x1b9
[ 2837.759383] entry_SYSCALL_64_fastpath+0xab/0xad
[ 2837.759383] RIP: 0033:0x7f3ef3e6b9a7
[ 2837.759383] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 2837.759383] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
[ 2837.759383] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
[ 2837.759383] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
[ 2837.759383] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
[ 2837.759383] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
[ 2837.777063] ---[ end trace e79345fe24b30b8f ]---
[ 2837.778235] ------------[ cut here ]------------
[ 2837.778856] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:9825 btrfs_free_block_groups+0x348/0x3eb [btrfs]
[ 2837.791385] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
[ 2837.797711] CPU: 8 PID: 2474 Comm: umount Tainted: G W 4.10.0-rc8-btrfs-next-43+ #1
[ 2837.798594] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[ 2837.800118] Call Trace:
[ 2837.800515] dump_stack+0x68/0x92
[ 2837.801015] __warn+0xc2/0xdd
[ 2837.801471] warn_slowpath_null+0x1d/0x1f
[ 2837.801698] btrfs_free_block_groups+0x348/0x3eb [btrfs]
[ 2837.801698] close_ctree+0x1dd/0x2e1 [btrfs]
[ 2837.801698] ? evict_inodes+0x132/0x141
[ 2837.801698] btrfs_put_super+0x15/0x17 [btrfs]
[ 2837.801698] generic_shutdown_super+0x6a/0xeb
[ 2837.801698] kill_anon_super+0x12/0x1c
[ 2837.801698] btrfs_kill_super+0x16/0x21 [btrfs]
[ 2837.801698] deactivate_locked_super+0x30/0x68
[ 2837.801698] deactivate_super+0x36/0x39
[ 2837.801698] cleanup_mnt+0x58/0x76
[ 2837.801698] __cleanup_mnt+0x12/0x14
[ 2837.801698] task_work_run+0x77/0x9b
[ 2837.801698] prepare_exit_to_usermode+0x9d/0xc5
[ 2837.801698] syscall_return_slowpath+0x196/0x1b9
[ 2837.801698] entry_SYSCALL_64_fastpath+0xab/0xad
[ 2837.801698] RIP: 0033:0x7f3ef3e6b9a7
[ 2837.801698] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 2837.801698] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
[ 2837.801698] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
[ 2837.801698] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
[ 2837.801698] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
[ 2837.801698] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
[ 2837.818441] ---[ end trace e79345fe24b30b90 ]---
[ 2837.818991] BTRFS info (device sdc): space_info 1 has 7974912 free, is not full
[ 2837.819830] BTRFS info (device sdc): space_info total=8388608, used=417792, pinned=0, reserved=0, may_use=18446744073709547520, readonly=0
What happens in the above example is the following:
1) When punching the hole, at btrfs_punch_hole(), the variable tail_len
is set to 2048 (as tail_start is 148Kb + 1 and offset + len is 150Kb).
This results in the creation of an extent map with a length of 2Kb
starting at file offset 148Kb, through find_first_non_hole() ->
btrfs_get_extent().
2) The second write (first write after the hole punch operation), sets
the range [50Kb, 152Kb[ to delalloc.
3) The third write, at btrfs_find_new_delalloc_bytes(), sees the extent
map covering the range [148Kb, 150Kb[ and ends up calling
set_extent_bit() for the same range, which results in splitting an
existing extent state record, covering the range [148Kb, 152Kb[ into
two 2Kb extent state records, covering the ranges [148Kb, 150Kb[ and
[150Kb, 152Kb[.
4) Finally at lock_and_cleanup_extent_if_need(), immediately after calling
btrfs_find_new_delalloc_bytes() we clear the delalloc bit from the
range [100Kb, 152Kb[ which results in the btrfs_clear_bit_hook()
callback being invoked against the two 2Kb extent state records that
cover the ranges [148Kb, 150Kb[ and [150Kb, 152Kb[. When called against
the first 2Kb extent state, it calls btrfs_delalloc_release_metadata()
with a length argument of 2048 bytes. That function rounds up the length
to a sector size aligned length, so it ends up considering a length of
4096 bytes, and then calls calc_csum_metadata_size() which results in
decrementing the inode's csum_bytes counter by 4096 bytes, so after
it stays a value of 0 bytes. Then the same happens when
btrfs_clear_bit_hook() is called against the second extent state that
has a length of 2Kb, covering the range [150Kb, 152Kb[, the length is
rounded up to 4096 and calc_csum_metadata_size() ends up being called
to decrement 4096 bytes from the inode's csum_bytes counter, which
at that time has a value of 0, leading to an underflow, which is
exactly what triggers the first warning, at btrfs_destroy_inode().
All the other warnings relate to several space accounting counters
that underflow as well due to similar reasons.
A similar case but where the hole punching operation creates an extent map
with a start offset not aligned to the sector size is the following:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "fpunch 695K 820K" $SCRATCH_MNT/bar
$ xfs_io -c "pwrite -S 0xaa 1008K 307K" $SCRATCH_MNT/bar
$ xfs_io -c "pwrite -S 0xbb -b 630K 1073K 630K" $SCRATCH_MNT/bar
$ xfs_io -c "pwrite -S 0xcc -b 459K 1068K 459K" $SCRATCH_MNT/bar
$ umount /mnt
During the unmount operation we get similar traces for the same reasons as
in the first example.
So fix the hole punching operation to make sure it never creates extent
maps with a length that is not aligned to the sector size nor with a start
offset that is not aligned to the sector size, as this breaks all
assumptions and it's a land mine.
Fixes: d77815461f04 ("btrfs: Avoid trucating page or punching hole in a already existed hole.")
Cc: <stable@vger.kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|