Age | Commit message (Collapse) | Author |
|
We are using a variable named 'log_ref_ver' of type int to indicate if we
are processing an extref item or not, using a value of 1 if so, otherwise
0. This is an odd name and type, so rename it to 'is_extref_item' and
change its type to bool.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
During log replay, at add_inode_ref(), if we have an extref item that
contains multiple extrefs and one of them points to a directory that does
not exist in the subvolume tree, we are supposed to ignore it and process
the remaining extrefs encoded in the extref item, since each extref can
point to a different parent inode. However when that happens we just
return from the function and ignore the remaining extrefs.
The problem has been around since extrefs were introduced, in commit
f186373fef00 ("btrfs: extended inode refs"), but it's hard to hit in
practice because getting extref items encoding multiple extref requires
getting a hash collision when computing the offset of the extref's
key. The offset if computed like this:
key.offset = btrfs_extref_hash(dir_ino, name->name, name->len);
and btrfs_extref_hash() is just a wrapper around crc32c().
Fix this by moving to next iteration of the loop when we don't find
the parent directory that an extref points to.
Fixes: f186373fef00 ("btrfs: extended inode refs")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
During log replay, at add_inode_ref(), we return -ENOENT if our current
inode isn't found on the subvolume tree or if a parent directory isn't
found. The error comes from btrfs_iget_logging() <- btrfs_iget() <-
btrfs_read_locked_inode().
The single caller of add_inode_ref(), replay_one_buffer(), ignores an
-ENOENT error because it expects that error to mean only that a parent
directory wasn't found and that is ok.
Before commit 5f61b961599a ("btrfs: fix inode lookup error handling during
log replay") we were converting any error when getting a parent directory
to -ENOENT and any error when getting the current inode to -EIO, so our
caller would fail log replay in case we can't find the current inode.
After that commit however in case the current inode is not found we return
-ENOENT to the caller and therefore it ignores the critical fact that the
current inode was not found in the subvolume tree.
Fix this by converting -ENOENT to 0 when we don't find a parent directory,
returning -ENOENT when we don't find the current inode and making the
caller, replay_one_buffer(), not ignore -ENOENT anymore.
Fixes: 5f61b961599a ("btrfs: fix inode lookup error handling during log replay")
CC: stable@vger.kernel.org # 6.16
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[TEST FAILURE WITH EXPERIMENTAL FEATURES]
When running test case generic/508, the test case will fail with the new
btrfs shutdown support:
generic/508 - output mismatch (see /home/adam/xfstests/results//generic/508.out.bad)
--- tests/generic/508.out 2022-05-11 11:25:30.806666664 +0930
+++ /home/adam/xfstests/results//generic/508.out.bad 2025-07-02 14:53:22.401824212 +0930
@@ -1,2 +1,6 @@
QA output created by 508
Silence is golden
+Before:
+After : stat.btime = Thu Jan 1 09:30:00 1970
+Before:
+After : stat.btime = Wed Jul 2 14:53:22 2025
...
(Run 'diff -u /home/adam/xfstests/tests/generic/508.out /home/adam/xfstests/results//generic/508.out.bad' to see the entire diff)
Ran: generic/508
Failures: generic/508
Failed 1 of 1 tests
Please note that the test case requires shutdown support, thus the test
case will be skipped using the current upstream kernel, as it doesn't
have shutdown ioctl support.
[CAUSE]
The direct cause the 0 time stamp in the log tree:
leaf 30507008 items 2 free space 16057 generation 9 owner TREE_LOG
leaf 30507008 flags 0x1(WRITTEN) backref revision 1
checksum stored e522548d
checksum calced e522548d
fs uuid 57d45451-481e-43e4-aa93-289ad707a3a0
chunk uuid d52bd3fd-5163-4337-98a7-7986993ad398
item 0 key (257 INODE_ITEM 0) itemoff 16123 itemsize 160
generation 9 transid 9 size 0 nbytes 0
block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
sequence 1 flags 0x0(none)
atime 1751432947.492000000 (2025-07-02 14:39:07)
ctime 1751432947.492000000 (2025-07-02 14:39:07)
mtime 1751432947.492000000 (2025-07-02 14:39:07)
otime 0.0 (1970-01-01 09:30:00) <<<
But the old fs tree has all the correct time stamp:
btrfs-progs v6.12
fs tree key (FS_TREE ROOT_ITEM 0)
leaf 30425088 items 2 free space 16061 generation 5 owner FS_TREE
leaf 30425088 flags 0x1(WRITTEN) backref revision 1
checksum stored 48f6c57e
checksum calced 48f6c57e
fs uuid 57d45451-481e-43e4-aa93-289ad707a3a0
chunk uuid d52bd3fd-5163-4337-98a7-7986993ad398
item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
generation 3 transid 0 size 0 nbytes 16384
block group 0 mode 40755 links 1 uid 0 gid 0 rdev 0
sequence 0 flags 0x0(none)
atime 1751432947.0 (2025-07-02 14:39:07)
ctime 1751432947.0 (2025-07-02 14:39:07)
mtime 1751432947.0 (2025-07-02 14:39:07)
otime 1751432947.0 (2025-07-02 14:39:07) <<<
The root cause is that fill_inode_item() in tree-log.c is only
populating a/c/m time, not the otime (or btime in statx output).
Part of the reason is that, the vfs inode only has a/c/m time, no native
btime support yet.
[FIX]
Thankfully btrfs has its otime stored in btrfs_inode::i_otime_sec and
btrfs_inode::i_otime_nsec.
So what we really need is just fill the otime time stamp in
fill_inode_item() of tree-log.c
There is another fill_inode_item() in inode.c, which is doing the proper
otime population.
Fixes: 94edf4ae43a5 ("Btrfs: don't bother committing delayed inode updates when fsyncing")
CC: stable@vger.kernel.org
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The token versions of set/get accessors will be removed, use the normal
helpers.
There's additional overhead of the token helpers that update the cached
address in case it moves to another page/folio. The normal versions
don't need to do that.
Note this is similar to fill_inode_item() in inode.c but with slight
differences. The two functions could be deduplicated eventually.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is an exported function and therefore it should have a 'btrfs_'
prefix, to make it clear it's btrfs specific, avoid future name collisions
with code outside btrfs, and make its naming consistent with most other
btrfs exported functions.
So add a 'btrfs_' prefix to it and make it return bool instead of int,
since all we need is to return true or false.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The __add_inode_ref() function is quite big and with too much nesting, so
move the code that processes inode extrefs into a helper function, to make
the function easier to read and reduce the level of indentation too.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The __add_inode_ref() function is quite big and with too much nesting, so
move the code that processes inode refs into a helper function, to make
the function easier to read and reduce the level of indentation too.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function btrfs_lookup_inode_extref(` no longer requires transaction
handle, insert length, or COW flag, as the only caller now performs a
read-only lookup using trans == NULL, ins_len == 0 and cow == 0.
This function was introduced in the early days where extref feature was
introduced by commit f186373fef00 ("btrfs: extended inode refs").
Then some cleanup was done in commit 33b98f227111 ("btrfs: cleanup:
removed unused 'btrfs_get_inode_ref_index'"), which removed the only
caller passing trans and other COW specific options.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The dirty_log_pages tree is used for tree logging and marks extents
based on log_transid. The bits could be renamed to resemble the
LOG1/LOG2 naming used for the BTRFS_FS_LOG1_ERR bits.
The DIRTY bit is renamed to LOG1 and NEW to LOG2.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of using a bare atomic, use the refcount_t type, which despite
being a structure that contains only an atomic, has an API that checks
for underflows and other hazards. This doesn't change the size of the
extent_buffer structure.
This removes the need to do things like this:
WARN_ON(atomic_read(&eb->refs) == 0);
if (atomic_dec_and_test(&eb->refs)) {
(...)
}
And do just:
if (refcount_dec_and_test(&eb->refs)) {
(...)
}
Since refcount_dec_and_test() already triggers a warning when we decrement
a ref count that has a value of 0 (or below zero).
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The argument has boolean semantics, so change its type from int to bool,
making it more clear.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There are two callers of btrfs_del_inode_ref() that declare a local index
variable and then pass a pointer for it to btrfs_del_inode_ref(), but then
don't use that index at all. Since btrfs_del_inode_ref() accepts a NULL
index pointer, pass NULL instead and stop declaring those useless index
variables.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of allocating the scratch eb after joining the log transaction,
allocate it before so that we're not delaying log commits for longer
than necessary, as allocating the scratch eb means allocating an
extent_buffer structure, which comes from a dedicated kmem_cache, plus
pages/folios to attach to the eb. Both of these allocations may take time
when we're under memory pressure.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of allocating the path after joining the log transaction, allocate
it before so that we're not delaying log commits for the rare cases where
the allocation takes a significant time (under memory pressure and all
slabs are full, there's the need to allocate a new page, etc).
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of allocating the path after joining the log transaction, allocate
it before so that we're not delaying log commits for the rare cases where
the allocation takes a significant time (under memory pressure and all
slabs are full, there's the need to allocate a new page, etc).
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We are supposed to be able to join a log transaction at that point, since
we have determined that the inode was logged in the current transaction
with the call to inode_logged(). So ASSERT() we joined a log transaction
and also warn if we didn't in case assertions are disabled (the kernel
config doesn't have CONFIG_BTRFS_ASSERT=y), so that the issue gets noticed
and reported if it ever happens.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There's no need to use btrfs_delete_one_dir_name() at del_logged_dentry()
because we are processing a dir index key which can contain only a single
name, unlike dir item keys which can encode multiple names in case of name
hash collisions. We have explicitly looked up for a dir index key by
calling btrfs_lookup_dir_index_item() and we don't log dir item keys
anymore (since commit 339d03542484 ("btrfs: only copy dir index keys when
logging a directory")). So simplify and use btrfs_del_item() directly
instead of btrfs_delete_one_dir_name().
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We are supposed to be able to join a log transaction at that point, since
we have determined that the inode was logged in the current transaction
with the call to inode_logged(). So ASSERT() we joined a log transaction
and also warn if we didn't in case assertions are disabled (the kernel
config doesn't have CONFIG_BTRFS_ASSERT=y), so that the issue gets noticed
and reported if it ever happens.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We have this fuzzy logic at btrfs_recover_log_trees() where we don't
abort the transaction and exit immediately after each function call that
returned an error, and instead have if-then-else logic or check if the
previous function call returned success before calling the next function.
Make the flow more straightforward by immediately aborting the transaction
and exiting after each function call failure. This also allows to avoid
two consecutive if statements that test the same conditions:
if (!ret && wc.stage == LOG_WALK_REPLAY_ALL) {
(...)
}
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There's no need to call btrfs_release_path() before calling
btrfs_init_root_free_objectid() as we have released the path already at
the top of the loop and the previous call to fixup_inode_link_counts()
also releases the path. So remove it to simplify the code.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If we failed walking a log tree during replay, we have a missing
transaction abort to prevent committing a transaction where we didn't
fully replay all the changes from a log tree and therefore can leave the
respective subvolume tree in some inconsistent state. So add the missing
transaction abort.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We have a single line doing a transaction abort in case either we got an
error from btrfs_get_fs_root() different from -ENOENT or we got an error
from btrfs_pin_extent_for_log_replay(), making it hard to figure out which
function call failed when looking at a transaction abort massages and
stack trace in dmesg. Change this to have an explicit transaction abort
for each one of the two cases.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of recording that a new subvolume was created in a directory after
we add the entry do the directory, record it before adding the entry. This
is to avoid races where after creating the entry and before recording the
new subvolume in the directory (the call to btrfs_record_new_subvolume()),
another task logs the directory, so we end up with a log tree where we
logged a directory that has an entry pointing to a root that was not yet
committed, resulting in an invalid entry if the log is persisted and
replayed later due to a power failure or crash.
Also state this requirement in the function comment for
btrfs_record_new_subvolume(), similar to what we do for the
btrfs_record_unlink_dir() and btrfs_record_snapshot_destroy().
Fixes: 45c4102f0d82 ("btrfs: avoid transaction commit on any fsync after subvolume creation")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When replaying log trees we use read_one_inode() to get an inode, which is
just a wrapper around btrfs_iget_logging(), which in turn is a wrapper for
btrfs_iget(). But read_one_inode() always returns NULL for any error
that btrfs_iget_logging() / btrfs_iget() may return and this is a problem
because:
1) In many callers of read_one_inode() we convert the NULL into -EIO,
which is not accurate since btrfs_iget() may return -ENOMEM and -ENOENT
for example, besides -EIO and other errors. So during log replay we
may end up reporting a false -EIO, which is confusing since we may
not have had any IO error at all;
2) When replaying directory deletes, at replay_dir_deletes(), we assume
the NULL returned from read_one_inode() means that the inode doesn't
exist and then proceed as if no error had happened. This is wrong
because unless btrfs_iget() returned ERR_PTR(-ENOENT), we had an
actual error and the target inode may exist in the target subvolume
root - this may later result in the log replay code failing at a
later stage (if we are "lucky") or succeed but leaving some
inconsistency in the filesystem.
So fix this by not ignoring errors from btrfs_iget_logging() and as
a consequence remove the read_one_inode() wrapper and just use
btrfs_iget_logging() directly. Also since btrfs_iget_logging() is
supposed to be called only against subvolume roots, just like
read_one_inode() which had a comment about it, add an assertion to
btrfs_iget_logging() to check that the target root corresponds to a
subvolume root.
Fixes: 5d4f98a28c7d ("Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
At __inode_add_ref() when processing extrefs, if we jump into the next
label we have an undefined value of victim_name.len, since we haven't
initialized it before we did the goto. This results in an invalid memory
access in the next iteration of the loop since victim_name.len was not
initialized to the length of the name of the current extref.
Fix this by initializing victim_name.len with the current extref's name
length.
Fixes: e43eec81c516 ("btrfs: use struct qstr instead of name and namelen pairs")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
replay
During log replay, at __add_inode_ref(), when we are searching for inode
ref keys we totally ignore if btrfs_search_slot() returns an error. This
may make a log replay succeed when there was an actual error and leave
some metadata inconsistency in a subvolume tree. Fix this by checking if
an error was returned from btrfs_search_slot() and if so, return it to
the caller.
Fixes: e02119d5a7b4 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If there's an unexpected (invalid) extent type, we just silently ignore
it. This means a corruption or some bug somewhere, so instead return
-EUCLEAN to the caller, making log replay fail, and print an error message
with relevant information.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In a few places where we call read_one_inode(), if we get a NULL pointer
we end up jumping into an error path, or fallthrough in case of
__add_inode_ref(), where we then do something like this:
iput(&inode->vfs_inode);
which results in an invalid inode pointer that triggers an invalid memory
access, resulting in a crash.
Fix this by making sure we don't do such dereferences.
Fixes: b4c50cbb01a1 ("btrfs: return a btrfs_inode from read_one_inode()")
CC: stable@vger.kernel.org # 6.15+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Using the helper makes it a bit more clear that we're accessing the
first list entry.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Rename all the exported functions from extent_map.h that don't have a
'btrfs_' prefix in their names, so that they are consistent with all the
other functions, to make it clear they are btrfs specific functions and
to avoid potential name collisions in the future with functions defined
elsewhere in the kernel.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
These functions are exported and don't have a 'btrfs_' prefix in their
names, which goes against coding style conventions. Rename them to have
such prefix, making it clear they are from btrfs and avoiding potential
collisions in the future with functions defined elsewhere outside btrfs.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
These functions are exported and don't have a 'btrfs_' prefix in their
names, which goes against coding style conventions. Rename them to have
such prefix, making it clear they are from btrfs and avoiding potential
collisions in the future with functions defined elsewhere outside btrfs.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
These functions are exported and don't have a 'btrfs_' prefix in their
names, which goes against coding style conventions. Rename them to have
such prefix, making it clear they are from btrfs and avoiding potential
collisions in the future with functions defined elsewhere outside btrfs.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel.
So add a 'btrfs_' prefix to their name to make it clear they are from
btrfs.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel. So add a prefix to
their name.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If we fsync a file (or directory) that has no more hard links, because
while a process had a file descriptor open on it, the file's last hard
link was removed and then the process did an fsync against the file
descriptor, after a power failure or crash the file still exists after
replaying the log.
This behaviour is incorrect since once an inode has no more hard links
it's not accessible anymore and we insert an orphan item into its
subvolume's tree so that the deletion of all its items is not missed in
case of a power failure or crash.
So after log replay the file shouldn't exist anymore, which is also the
behaviour on ext4, xfs, f2fs and other filesystems.
Fix this by not ignoring inodes with zero hard links at
btrfs_log_inode_parent() and by committing an inode's delayed inode when
we are not doing a fast fsync (either BTRFS_INODE_COPY_EVERYTHING or
BTRFS_INODE_NEEDS_FULL_SYNC is set in the inode's runtime flags). This
last step is necessary because when removing the last hard link we don't
delete the corresponding ref (or extref) item, instead we record the
change in the inode's delayed inode with the BTRFS_DELAYED_NODE_DEL_IREF
flag, so that when the delayed inode is committed we delete the ref/extref
item from the inode's subvolume tree - otherwise the logging code will log
the last hard link and therefore upon log replay the inode is not deleted.
The base code for a fstests test case that reproduces this bug is the
following:
. ./common/dmflakey
_require_scratch
_require_dm_target flakey
_require_mknod
_scratch_mkfs >>$seqres.full 2>&1 || _fail "mkfs failed"
_require_metadata_journaling $SCRATCH_DEV
_init_flakey
_mount_flakey
touch $SCRATCH_MNT/foo
# Commit the current transaction and persist the file.
_scratch_sync
# A fifo to communicate with a background xfs_io process that will
# fsync the file after we deleted its hard link while it's open by
# xfs_io.
mkfifo $SCRATCH_MNT/fifo
tail -f $SCRATCH_MNT/fifo | \
$XFS_IO_PROG $SCRATCH_MNT/foo >>$seqres.full &
XFS_IO_PID=$!
# Give some time for the xfs_io process to open a file descriptor for
# the file.
sleep 1
# Now while the file is open by the xfs_io process, delete its only
# hard link.
rm -f $SCRATCH_MNT/foo
# Now that it has no more hard links, make the xfs_io process fsync it.
echo "fsync" > $SCRATCH_MNT/fifo
# Terminate the xfs_io process so that we can unmount.
echo "quit" > $SCRATCH_MNT/fifo
wait $XFS_IO_PID
unset XFS_IO_PID
# Simulate a power failure and then mount again the filesystem to
# replay the journal/log.
_flakey_drop_and_remount
# We don't expect the file to exist anymore, since it was fsynced when
# it had no more hard links.
[ -f $SCRATCH_MNT/foo ] && echo "file foo still exists"
_unmount_flakey
# success, all done
echo "Silence is golden"
status=0
exit
A test case for fstests will be submitted soon.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
It's a pointless label as we don't have to do anything under it other
than return from the function. So remove it and directly return from the
function where we used to goto.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There's no point in checking if the inode is a directory as
ctx->log_new_dentries is only set in case we are logging a directory down
the call chain of btrfs_log_inode(). So remove that check making the logic
more simple and while at it add a comment about why use a local variable
to track if we later need to log new dentries.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If we don't need to log new directory dentries, there's no point in having
an else branch just to set 'ret' to zero, as it's already zero because
every time it gets a non-zero value we jump into one of the exit labels.
So remove it, which reduces source code size and the module text size.
Before this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1813855 163737 16920 1994512 1e6f10 fs/btrfs/btrfs.ko
After this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1813807 163737 16920 1994464 1e6ee0 fs/btrfs/btrfs.ko
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of using memcmp(), which requires copying both file extent items
from each extent buffer into a local buffer, use memcmp_extent_buffer() so
that we only need to copy one of the file extent items and directly use
the extent buffer of the other file extent item for the comparison.
This reduces code size, saves one memory copy and reduces stack usage.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function is exclusively used for log replay since commit
3eb423442483 ("btrfs: remove outdated logic from overwrite_item() and add
assertion"), so update the comment so that it doesn't say it can be used
for logging. Also some minor rewording for clarity and while at it
reformat the affected text so that it fits closer to the 80 characters
limit for comments.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of referring to path->nodes[0] and path->slots[0] multiple times,
which is verbose and confusing since we have an 'eb' and 'slot' variables
as well, introduce local variables 'dst_eb' to point to path->nodes[0] and
'dst_slot' to have path->slots[0], reducing verbosity and making it more
obvious about which extent buffer and slot we are referring to.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There's no need to allocate memory and copy from both the destination and
source extent buffers to compare if the items are equal, we can instead
use memcmp_extent_buffer() which allows to do only one memory allocation
and copy instead of two.
So use memcmp_extent_buffer() instead of memcmp(), allowing us to avoid
one memory allocation, which can fail or be slow while under memory heavy
pressure, avoid the memory copying and reducing code.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
It's an internal function and most of the time the callers are doing a lot
of BTRFS_I() calls on the returned VFS inode to get the btrfs inode, so
change the return type to struct btrfs_inode instead.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
fixup_inode_link_count() mostly wants to use a btrfs_inode, plus it's an
internal function so it should take btrfs_inode instead of a VFS inode.
Change the argument type to btrfs_inode, avoiding several BTRFS_I() calls
too.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
All callers of read_one_inode() are mostly interested in the btrfs_inode
structure rather than the VFS inode, so make read_one_inode() return
the btrfs_inode instead, avoiding lots of BTRFS_I() calls.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
All callers of btrfs_iget_logging() are interested in the btrfs_inode
structure rather than the VFS inode, so make btrfs_iget_logging() return
the btrfs_inode instead, avoiding lots of BTRFS_I() calls.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The btrfs_key is defined as objectid/type/offset and the keys are also
printed like that. For better readability, update all key
initializations to match this order.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.
These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|