Age | Commit message (Collapse) | Author |
|
The "notify_timeout" rbd device option is never used, so get rid of
it.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
|
|
Add the ability to map an rbd image read-only, by specifying either
"read_only" or "ro" as an option on the rbd "command line." Also
allow the inverse to be explicitly specified using "read_write" or
"rw".
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
|
|
The rbd options don't really apply to the ceph client. So don't
store a pointer to it in the ceph_client structure, and put them
(a struct, not a pointer) into the rbd_dev structure proper.
Pass the rbd device structure to rbd_client_create() so it can
assign rbd_dev->rbdc if successful, and have it return an error code
instead of the rbd client pointer.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
|
|
This just rearranges things a bit more in rbd_header_from_disk()
so that the snapshot sizes are initialized right after the buffer
to hold them is allocated and doing a little further consolidation
that follows from that. Also adds a few simple comments.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
|
|
The only thing the on-disk snap_names_len field is needed is to
size the buffer allocated to hold a copy of the snapshot names
for an rbd image.
So don't bother saving it in the in-core rbd_image_header structure.
Just use a local variable to hold the required buffer size while
it's needed.
Move the code that actually copies the snapshot names up closer
to where the required length is saved.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
|
|
In rbd_header_from_disk() the object prefix buffer is sized based on
the maximum size it's block_name equivalent on disk could be.
Instead, only allocate enough to hold null-terminated string from
the on-disk header--or the maximum size of no NUL is found.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
|
|
There is only caller of __rbd_client_find(), and it somewhat
clumsily gets the appropriate lock and gets a reference to the
existing ceph_client structure if it's found.
Instead, have that function handle its own locking, and acquire the
reference if found while it holds the lock. Drop the underscores
from the name because there's no need to signify anything special
about this function.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
|
|
Using list_move_tail() instead of list_del() + list_add_tail().
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Sage Weil <sage@inktank.com>
|
|
This fixes a bug that went in with this commit:
commit f6e0c99092cca7be00fca4080cfc7081739ca544
Author: Alex Elder <elder@inktank.com>
Date: Thu Aug 2 11:29:46 2012 -0500
rbd: simplify __rbd_init_snaps_header()
The problem is that a new rbd snapshot needs to go either after an
existing snapshot entry, or at the *end* of an rbd device's snapshot
list. As originally coded, it is placed at the beginning. This was
based on the assumption the list would be empty (so it wouldn't
matter), but in fact if multiple new snapshots are added to an empty
list in one shot the list will be non-empty after the first one is
added.
This addresses http://tracker.newdream.net/issues/3063
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
|
|
In the on-disk image header structure there is a field "block_name"
which represents what we now call the "object prefix" for an rbd
image. Rename this field "object_prefix" to be consistent with
modern usage.
This appears to be the only remaining vestige of the use of "block"
in symbols that represent objects in the rbd code.
This addresses http://tracker.newdream.net/issues/1761
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
|
|
Make ceph_monc_do_poolop() static to remove the following sparse warning:
* net/ceph/mon_client.c:616:5: warning: symbol 'ceph_monc_do_poolop' was not
declared. Should it be static?
Also drops the 'ceph_monc_' prefix, now being a private function.
Signed-off-by: Iulius Curt <icurt@ixiacom.com>
Signed-off-by: Sage Weil <sage@inktank.com>
|
|
This is unused; use monc->client->have_fsid.
Signed-off-by: Sage Weil <sage@inktank.com>
|
|
A recent change to /sbin/mountall causes any trailing '/' character
in the "device" (or fs_spec) field in /etc/fstab to be stripped. As
a result, an entry for a ceph mount that intends to mount the root
of the name space ends up with now path portion, and the ceph mount
option processing code rejects this.
That is, an entry in /etc/fstab like:
cephserver:port:/ /mnt ceph defaults 0 0
provides to the ceph code just "cephserver:port:" as the "device,"
and that gets rejected.
Although this is a bug in /sbin/mountall, we can have the ceph mount
code support an empty/nonexistent path, interpreting it to mean the
root of the name space.
RFC 5952 offers recommendations for how to express IPv6 addresses,
and recommends the usage found in RFC 3986 (which specifies the
format for URI's) for representing both IPv4 and IPv6 addresses that
include port numbers. (See in particular the definition of
"authority" found in the Appendix of RFC 3986.)
According to those standards, no host specification will ever
contain a '/' character. As a result, it is sufficient to scan a
provided "device" from an /etc/fstab entry for the first '/'
character, and if it's found, treat that as the beginning of the
path. If no '/' character is present, we can treat the entire
string as the monitor host specification(s), and assume the path
to be the root of the name space. We'll still require a ':' to
separate the host portion from the (possibly empty) path portion.
This means that we can more formally define how ceph will interpret
the "device" it's provided when processing a mount request:
"device" will look like:
<server_spec>[,<server_spec>...]:[<path>]
where
<server_spec> is <ip>[:<port>]
<path> is optional, but if present must begin with '/'
This addresses http://tracker.newdream.net/issues/2919
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
|
|
Right now rbd_read_header() both reads the header object for an rbd
image and decodes its contents. It does this repeatedly if needed,
in order to ensure a complete and intact header is obtained.
Separate this process into two steps--reading of the raw header
data (in new function, rbd_dev_v1_header_read()) and separately
decoding its contents (in rbd_header_from_disk()). As a result,
the latter function no longer requires its allocated_snaps argument.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
|
|
Add checks on the validity of the snap_count and snap_names_len
field values in rbd_dev_ondisk_valid(). This eliminates the
need to do them in rbd_header_from_disk().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
|
|
The only caller of rbd_header_from_disk() is rbd_read_header().
It passes as allocated_snaps the number of snapshots it will
have received from the server for the snapshot context that
rbd_header_from_disk() is to interpret. The first time through
it provides 0--mainly to extract the number of snapshots from
the snapshot context header--so that it can allocate an
appropriately-sized buffer to receive the entire snapshot
context from the server in a second request.
rbd_header_from_disk() will not fill in the array of snapshot ids
unless the number in the snapshot matches the number the caller
had allocated.
This patch adjusts that logic a little further to be more efficient.
rbd_read_header() doesn't even examine the snapshot context unless
the snapshot count (stored in header->total_snaps) matches the
number of snapshots allocated. So rbd_header_from_disk() doesn't
need to allocate or fill in the snapshot context field at all in
that case.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
|
|
This just moves code around for the most part. It was pulled out as
a separate patch to avoid cluttering up some upcoming patches which
are more substantive. The point is basically to group everything
related to initializing the snapshot context together.
The only functional change is that rbd_header_from_disk() now
ensures the (in-core) header it is passed is zero-filled. This
allows a simpler error handling path in rbd_header_from_disk().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
|
|
Fix a few spots in rbd_header_from_disk() to use sizeof (object)
rather than sizeof (type). Use a local variable to record sizes
to shorten some lines and improve readability.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
|
|
Fix a number of spots where a pointer value that is known to
have become invalid but was not reset to null.
Also, toss in a change so we use sizeof (object) rather than
sizeof (type).
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
|
|
The snap_names_len field of an rbd_image_header structure is defined
with type size_t. That field is used as both the source and target
of 64-bit byte-order swapping operations though, so it's best to
define it with type u64 instead.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
|
|
The purpose of __rbd_init_snaps_header() is to compare a new
snapshot context with an rbd device's list of existing snapshots.
It updates the list by adding any new snapshots or removing any
that are not present in the new snapshot context.
The code as written is a little confusing, because it traverses both
the existing snapshot list and the set of snapshots in the snapshot
context in reverse. This was done based on an assumption about
snapshots that is not true--namely that a duplicate snapshot name
could cause an error in intepreting things if they were not
processed in ascending order.
These precautions are not necessary, because:
- all snapshots are uniquely identified by their snapshot id
- a new snapshot cannot be created if the rbd device has another
snapshot with the same name
(It is furthermore not currently possible to rename a snapshot.)
This patch re-implements __rbd_init_snaps_header() so it passes
through both the existing snapshot list and the entries in the
snapshot context in forward order. It still does the same thing
as before, but I find the logic considerably easier to understand.
By going forward through the names in the snapshot context, there
is no longer a need for the rbd_prev_snap_name() helper function.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
|
|
Pull TTY changes from Greg Kroah-Hartman:
"As we skipped the merge window for 3.6-rc1 for the tty tree,
everything is now settled down and working properly, so we are ready
for 3.7-rc1. Here's the patchset, it's big, but the large changes are
removing a firmware file and adding a staging tty driver (it depended
on the tty core changes, so it's going through this tree instead of
the staging tree.)
All of these patches have been in the linux-next tree for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
Fix up more-or-less trivial conflicts in
- drivers/char/pcmcia/synclink_cs.c:
tty NULL dereference fix vs tty_port_cts_enabled() helper function
- drivers/staging/{Kconfig,Makefile}:
add-add conflict (dgrp driver added close to other staging drivers)
- drivers/staging/ipack/devices/ipoctal.c:
"split ipoctal_channel from iopctal" vs "TTY: use tty_port_register_device"
* tag 'tty-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (235 commits)
tty/serial: Add kgdb_nmi driver
tty/serial/amba-pl011: Quiesce interrupts in poll_get_char
tty/serial/amba-pl011: Implement poll_init callback
tty/serial/core: Introduce poll_init callback
kdb: Turn KGDB_KDB=n stubs into static inlines
kdb: Implement disable_nmi command
kernel/debug: Mask KGDB NMI upon entry
serial: pl011: handle corruption at high clock speeds
serial: sccnxp: Make 'default' choice in switch last
serial: sccnxp: Remove mask termios caps for SW flow control
serial: sccnxp: Report actual baudrate back to core
serial: samsung: Add poll_get_char & poll_put_char
Powerpc 8xx CPM_UART setting MAXIDL register proportionaly to baud rate
Powerpc 8xx CPM_UART maxidl should not depend on fifo size
Powerpc 8xx CPM_UART too many interrupts
Powerpc 8xx CPM_UART desynchronisation
serial: set correct baud_base for EXSYS EX-41092 Dual 16950
serial: omap: fix the reciever line error case
8250: blacklist Winbond CIR port
8250_pnp: do pnp probe before legacy probe
...
|
|
This reverts commit 0885ef5b5601e9b007c383e77c172769b1f214fd
After applying the above patch, the performance slowed down because the dirty
page flush can only be done by one task, so revert it.
The following is the test result of sysbench:
Before After
24MB/s 39MB/s
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
|
|
Everybody is just making stuff up, and it's just used to see if we really do
need to alloc a chunk, and since we do this when we already know we really
do it's just a waste of space. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
So we have lots of places where we try to preallocate chunks in order to
make sure we have enough space as we make our allocations. This has
historically meant that we're constantly tweaking when we should allocate a
new chunk, and historically we have gotten this horribly wrong so we way
over allocate either metadata or data. To try and keep this from happening
we are going to make it so that the block group item insertion is done out
of band at the end of a transaction. This will allow us to create chunks
even if we are trying to make an allocation for the extent tree. With this
patch my enospc tests run faster (didn't expect this) and more efficiently
use the disk space (this is what I wanted). Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
For immutable bio vecs, I've been auditing and removing bi_idx
references. These were harmless, but removing them will make auditing
easier.
scrub_bio_end_io_worker() was open coding a bio_reset() - but this
doesn't appear to have been needed for anything as right after it does a
bio_put(), and perusing the code it doesn't appear anything else was
holding a reference to the bio.
The other use end_bio_extent_readpage() was just for a pr_debug() -
changed it to something that might be a bit more useful.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Stefan Behrens <sbehrens@giantdisaster.de>
|
|
When we wrote some data by compress mode into a btrfs filesystem which was full
of the fragments, the kernel will report:
BTRFS warning (device xxx): Aborting unused transaction.
The reason is:
We can not find a long enough free space to store the compressed data because
of the fragmentary free space, and the compressed data can not be splited,
so the kernel outputed the above message.
In fact, btrfs can deal with this problem very well: it fall back to
uncompressed IO, split the uncompressed data into small ones, and then
store them into to the fragmentary free space. So we shouldn't output the
above warning message.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
|
|
Wade Cline reported a problem where he was getting garbage and warnings when
writing to a preallocated range via O_DIRECT. This is because we weren't
creating our normal pinned extent_map for the range we were writing to,
which was causing all sorts of issues. This patch fixes the problem and
makes his testcase much happier. Thanks,
Reported-by: Wade Cline <clinew@linux.vnet.ibm.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
Sage reported the following lockdep backtrace
=====================================
[ BUG: bad unlock balance detected! ]
3.6.0-rc2-ceph-00171-gc7ed62d #1 Not tainted
-------------------------------------
btrfs-cleaner/7607 is trying to release lock (sb_internal) at:
[<ffffffffa00422ae>] btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
but there are no more locks to release!
other info that might help us debug this:
1 lock held by btrfs-cleaner/7607:
#0: (&fs_info->cleaner_mutex){+.+...}, at: [<ffffffffa003b405>] cleaner_kthread+0x95/0x120 [btrfs]
stack backtrace:
Pid: 7607, comm: btrfs-cleaner Not tainted 3.6.0-rc2-ceph-00171-gc7ed62d #1
Call Trace:
[<ffffffffa00422ae>] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
[<ffffffff810afa9e>] print_unlock_inbalance_bug+0xfe/0x110
[<ffffffff810b289e>] lock_release_non_nested+0x1ee/0x310
[<ffffffff81172f9b>] ? kmem_cache_free+0x7b/0x160
[<ffffffffa004106c>] ? put_transaction+0x8c/0x130 [btrfs]
[<ffffffffa00422ae>] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
[<ffffffff810b2a95>] lock_release+0xd5/0x220
[<ffffffff81173071>] ? kmem_cache_free+0x151/0x160
[<ffffffff8117d9ed>] __sb_end_write+0x7d/0x90
[<ffffffffa00422ae>] btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
[<ffffffff81079850>] ? __init_waitqueue_head+0x60/0x60
[<ffffffff81634c6b>] ? _raw_spin_unlock+0x2b/0x40
[<ffffffffa0042758>] __btrfs_end_transaction+0x368/0x3c0 [btrfs]
[<ffffffffa0042808>] btrfs_end_transaction_throttle+0x18/0x20 [btrfs]
[<ffffffffa00318f0>] btrfs_drop_snapshot+0x410/0x600 [btrfs]
[<ffffffff8132babd>] ? do_raw_spin_unlock+0x5d/0xb0
[<ffffffffa00430ef>] btrfs_clean_old_snapshots+0xaf/0x150 [btrfs]
[<ffffffffa003b405>] ? cleaner_kthread+0x95/0x120 [btrfs]
[<ffffffffa003b419>] cleaner_kthread+0xa9/0x120 [btrfs]
[<ffffffffa003b370>] ? btrfs_destroy_delayed_refs.isra.102+0x220/0x220 [btrfs]
[<ffffffff810791ee>] kthread+0xae/0xc0
[<ffffffff810b379d>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff8163e744>] kernel_thread_helper+0x4/0x10
[<ffffffff81635430>] ? retint_restore_args+0x13/0x13
[<ffffffff81079140>] ? flush_kthread_work+0x1a0/0x1a0
[<ffffffff8163e740>] ? gs_change+0x13/0x13
This is because the throttle stuff can commit the transaction, which expects to
be the one stopping the intwrite stuff, but we've already done it in the
__btrfs_end_transaction. Moving the sb_end_intewrite after this logic makes the
lockdep go away. Thanks,
Tested-by: Sage Weil <sage@inktank.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
This is the change of the kernel side.
Translation of logical to inode used to have an upper limit 4k on
inode container's size, but the limit is not large enough for a data
with a great many of refs, so when resolving logical address,
we can end up with
"ioctl ret=0, bytes_left=0, bytes_missing=19944, cnt=510, missed=2493"
This changes to regard 64k as the upper limit and use vmalloc instead of
kmalloc to get memory more easily.
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
|
|
We already have a helper, iterate_inodes_from_logical(), for logical resolve,
so just use it.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
|
|
In logical resolve, we parse extent_from_logical()'s 'ret' as a kind of flag.
It is possible to lose our errors because
(-EXXXX & BTRFS_EXTENT_FLAG_TREE_BLOCK) is true.
I'm not sure if it is on purpose, it just looks too hacky if it is.
I'd rather use a real flag and a 'ret' to catch errors.
Acked-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Liu Bo <liub.liubo@gmail.com>
|
|
We've added a new field 'sequence' to delayed ref node, so update related
tracepoints.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
|
|
As ref cache has been removed from btrfs, there is no user on
its lock and its check.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
|
|
When we delete a inode, we will remove all the delayed items including delayed
inode update, and then truncate all the relative metadata. If there is lots of
metadata, we will end the current transaction, and start a new transaction to
truncate the left metadata. In this way, we will leave a inode item that its
link counter is > 0, and also may leave some directory index items in fs/file tree
after the current transaction ends. In other words, the metadata in this fs/file tree
is inconsistent. If we create a snapshot for this tree now, we will find a inode with
corrupted metadata in the new snapshot, and we won't continue to drop the left metadata,
because its link counter is not 0.
We fix this problem by updating the inode item before the current transaction ends.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
|
|
Usecase:
watch 'grep btrfs < /proc/slabinfo'
easy to watch all caches in one go.
Signed-off-by: David Sterba <dsterba@suse.cz>
|
|
I noticed I was seeing large lags when running my torrent test in a vm on my
laptop. While trying to make it lag less I noticed that our overcommit math
was taking into account the number of bytes we wanted to reclaim, not the
number of bytes we actually wanted to allocate, which means we wouldn't
overcommit as often. This patch fixes the overcommit math and makes
shrink_delalloc() use that logic so that it will stop looping faster. We
still have pretty high spikes of latency, but the test now takes 3 minutes
less time (about 5% faster). Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
Mitch reported a problem where you could get an ENOSPC error when untarring
a kernel git tree onto a 16gb file system with compress-force=zlib. This is
because compression is a huge pain, it will return from ->writepages()
without having actually created any ordered extents. To get around this we
check to see if the async submit counter is up, and if it is wait until it
drops to 0 before doing our normal ordered wait dance. With this patch I
can now untar a kernel git tree onto a 16gb file system without getting
ENOSPC errors. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
We're going to use this flag EXTENT_DEFRAG to indicate which range
belongs to defragment so that we can implement snapshow-aware defrag:
We set the EXTENT_DEFRAG flag when dirtying the extents that need
defragmented, so later on writeback thread can differentiate between
normal writeback and writeback started by defragmentation.
Original-Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
|
|
ulist_alloc() has the possibility of returning NULL.
So, it is necessary to check the return value.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
|
|
btrfs_iget() never return NULL.
So, NULL check is unnecessary.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
|
|
When we ran fsstress(a program in xfstests), the filesystem hung up when it
is full. It was because the space reserved in btrfs_fallocate() was wrong,
btrfs_fallocate() just used the size of the pre-allocation to reserve the
space, didn't took the block size aligning into account, so the size of
the reserved space was less than the allocated space, it caused the over
reserve problem and made the filesystem hung up when invoking cow_file_range().
Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
|
|
Though we dump the stack information when aborting a unused transaction
handle, we don't know the correct place where we decide to abort the
transaction handle if one function has several place where the transaction
abort function is invoked and jumps to the same place after this call.
And beside that we also don't know the reason why we jump to abort
the current handle. So I modify the transaction abort function and make
it output the function name, line and error information.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
|
|
We forget to protect ->log_batch when syncing a file, this patch fix
this problem by atomic operation. And ->log_batch is used to check
if there are parallel sync operations or not, so it is unnecessary to
reset it to 0 after the sync operation of the current log tree complete.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
|
|
We should insert/update 6 items(root ref, root backref, dir item, dir index,
root item and parent inode) when creating a snapshot, not 5 items, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
|
|
The snapshot should be the image of the fs tree before it was created,
so the metadata of the snapshot should not exist in the its tree. But now, we
found the directory item and directory name index is in both the snapshot tree
and the fs tree. It introduces some problems and makes the users feel strange:
# mkfs.btrfs /dev/sda1
# mount /dev/sda1 /mnt
# mkdir /mnt/1
# cd /mnt/1
# btrfs subvolume snapshot /mnt snap0
# ls -a /mnt/1/snap0/1
. .. [no other file/dir]
# ll /mnt/1/snap0/
total 0
drwxr-xr-x 1 root root 10 Ju1 24 12:11 1
^^^
There is no file/dir in it, but it's size is 10
# cd /mnt/1/snap0/1/snap0
[Enter a unexisted directory successfully...]
There is nothing in the directory 1 in snap0, but btrfs told the length of
this directory is 10. Beside that, we can enter an unexisted directory, it is
very strange to the users.
# btrfs subvolume snapshot /mnt/1/snap0 /mnt/snap1
# ll /mnt/1/snap0/1/
total 0
[None]
# ll /mnt/snap1/1/
total 0
drwxr-xr-x 1 root root 0 Ju1 24 12:14 snap0
And the source of snap1 did have any directory in Directory 1, but snap1 have
a snap0, it is different between the source and the snapshot.
So I think we should insert directory item and directory name index and update
the parent inode as the last step of snapshot creation, and do not leave the
useless metadata in the file tree.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
|
|
Sometimes we need choose the method of the reservation according to the type
of the block reservation, such as the reservation for the delayed inode update.
Now we identify the type just by comparing the address of the reservation
variants, it is very ugly if it is a temporary one because we need compare it
with all the common reservation variants. So we add a new "type" field to keep
the type the reservation variants.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
|
|
The ordered extent allocation is in the fast path of the IO, so use a slab
to improve the speed of the allocation.
"Size of the struct is 280, so this will fall into the size-512 bucket,
giving 8 objects per page, while own slab will pack 14 objects into a page.
Another benefit I see is to check for leaked objects when the module is
removed (and the cache destroy takes place)."
-- David Sterba
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
|
|
If a snapshot is created while we are writing some data into the file,
the i_size of the corresponding file in the snapshot will be wrong, it will
be beyond the end of the last file extent. And btrfsck will report:
root 256 inode 257 errors 100
Steps to reproduce:
# mkfs.btrfs <partition>
# mount <partition> <mnt>
# cd <mnt>
# dd if=/dev/zero of=tmpfile bs=4M count=1024 &
# for ((i=0; i<4; i++))
> do
> btrfs sub snap . $i
> done
This because the algorithm of disk_i_size update is wrong. Though there are
some ordered extents behind the current one which we use to update disk_i_size,
it doesn't mean those extents will be dealt with in the same transaction. So
We shouldn't use the offset of those extents to update disk_i_size. Or we will
get the wrong i_size in the snapshot.
We fix this problem by recording the max real i_size. If we find there is a
ordered extent which is in front of the current one and doesn't complete, we
will record the end of the current one into that ordered extent. Surely, if
the current extent holds the end of other extent(it must be greater than
the current one because it is behind the current one), we will record the
number that the current extent holds. In this way, we can exclude the ordered
extents that may not be dealth with in the same transaction, and be easy to
know the real disk_i_size.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
|
|
If we create several snapshots at the same time, the following BUG_ON() will be
triggered.
kernel BUG at fs/btrfs/extent-tree.c:6047!
Steps to reproduce:
# mkfs.btrfs <partition>
# mount <partition> <mnt>
# cd <mnt>
# for ((i=0;i<2400;i++)); do touch long_name_to_make_tree_more_deep$i; done
# for ((i=0; i<4; i++))
> do
> mkdir $i
> for ((j=0; j<200; j++))
> do
> btrfs sub snap . $i/$j
> done &
> done
The reason is:
Before transaction commit, some operations changed the fs tree and new tree
blocks were allocated because of COW. We used the implicit non-shared back
reference for those newly allocated tree blocks because they were not shared by
two or more trees.
And then we created the first snapshot for the fs tree, according to the back
reference rules, we also used implicit back refs for the child tree blocks of
the root node of the fs tree, now those child nodes/leaves were shared by two
trees.
Then We didn't deal with the delayed references, and continued to change the fs
tree(created the second snapshot and inserted the dir item of the new snapshot
into the fs tree). According to the rules of the back reference, we added full
back refs for those tree blocks whose parents have be shared by two trees.
Now some newly allocated tree blocks had two types of the references.
As we know, the delayed reference system handles these delayed references from
back to front, and the full delayed reference is inserted after the implicit
ones. So when we dealt with the back references of those newly allocated tree
blocks, the full references was dealt with at first. And if the first reference
is a shared back reference and the tree block that the reference points to is
newly allocated, It would be considered as a tree block which is shared by two
or more trees when it is allocated and should be a full back reference not a
implicit one, the flag of its reference also should be set to FULL_BACKREF.
But in fact, it was a non-shared tree block with a implicit reference at
beginning, so it was not compulsory to set the flags to FULL_BACKREF. So BUG_ON
was triggered.
We have several methods to fix this bug:
1. deal with delayed references after the snapshot is created and before we
change the source tree of the snapshot. This is the easiest and safest way.
2. modify the sort method of the delayed reference tree, make the full delayed
references be inserted before the implicit ones. It is also very easy, but
I don't know if it will introduce some problems or not.
3. modify select_delayed_ref() and make it select the implicit delayed reference
at first. This way is not so good because it may wastes CPU time if we have
lots of delayed references.
4. set the flags to FULL_BACKREF, this method is a little complex comparing with
the 1st way.
I chose the 1st way to fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
|