Age | Commit message (Collapse) | Author |
|
These functions have very little to do with de_thread move them out
of de_thread an into flush_old_exec proper so it can be more clearly
seen what flush_old_exec is doing.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
|
|
This makes the code clearer and makes it easier to implement a mutex
that is not taken over any locations that may block indefinitely waiting
for userspace.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
|
|
Make it clear that current only needs to be computed once in
flush_old_exec. This may have some efficiency improvements and it
makes the code easier to change.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
|
|
UDP was originally disabled in 6da1a034362f for NFSv4. Later in
b24ee6c64ca7 UDP is by default disabled by NFS_DISABLE_UDP_SUPPORT=y for
all NFS versions. Therefore remove v4 from error message.
Fixes: b24ee6c64ca7 ("NFS: allow deprecation of NFS UDP protocol")
Signed-off-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
UDP is disabled by default in commit b24ee6c64ca7 ("NFS: allow
deprecation of NFS UDP protocol"), but the default mount options
is still udp, change it to tcp to avoid the "Unsupported transport
protocol udp" error if no protocol is specified when mount nfs.
Fixes: b24ee6c64ca7 ("NFS: allow deprecation of NFS UDP protocol")
Signed-off-by: Liwei Song <liwei.song@windriver.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Add SPDX License Identifier to all .gitignore files.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
When some events have directory id and some object id,
fanotify_event_has_fid() becomes mostly useless and confusing because we
usually need to know which type of file handle the event has. So just
drop the function and use fanotify_event_object_fh() instead.
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
For some events, we are going to report both child and parent fid's,
so pass fsid and file handle as arguments to copy_fid_to_user(),
which is going to be called with parent and child file handles.
Link: https://lore.kernel.org/r/20200319151022.31456-13-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Dirent events are going to be supported in two flavors:
1. Directory fid info + mask that includes the specific event types
(e.g. FAN_CREATE) and an optional FAN_ONDIR flag.
2. Directory fid info + name + mask that includes only FAN_DIR_MODIFY.
To request the second event flavor, user needs to set the event type
FAN_DIR_MODIFY in the mark mask.
The first flavor is supported since kernel v5.1 for groups initialized
with flag FAN_REPORT_FID. It is intended to be used for watching
directories in "batch mode" - the watcher is notified when directory is
changed and re-scans the directory content in response. This event
flavor is stored more compactly in the event queue, so it is optimal
for workloads with frequent directory changes.
The second event flavor is intended to be used for watching large
directories, where the cost of re-scan of the directory on every change
is considered too high. The watcher getting the event with the directory
fid and entry name is expected to call fstatat(2) to query the content of
the entry after the change.
Legacy inotify events are reported with name and event mask (e.g. "foo",
FAN_CREATE | FAN_ONDIR). That can lead users to the conclusion that
there is *currently* an entry "foo" that is a sub-directory, when in fact
"foo" may be negative or non-dir by the time user gets the event.
To make it clear that the current state of the named entry is unknown,
when reporting an event with name info, fanotify obfuscates the specific
event types (e.g. create,delete,rename) and uses a common event type -
FAN_DIR_MODIFY to describe the change. This should make it harder for
users to make wrong assumptions and write buggy filesystem monitors.
At this point, name info reporting is not yet implemented, so trying to
set FAN_DIR_MODIFY in mark mask will return -EINVAL.
Link: https://lore.kernel.org/r/20200319151022.31456-12-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Breakup the union and make them both inherit from abstract fanotify_event.
fanotify_path_event, fanotify_fid_event and fanotify_perm_event inherit
from fanotify_event.
type field in abstract fanotify_event determines the concrete event type.
fanotify_path_event, fanotify_fid_event and fanotify_perm_event are
allocated from separate memcache pools.
Rename fanotify_perm_event casting macro to FANOTIFY_PERM(), so that
FANOTIFY_PE() and FANOTIFY_FE() can be used as casting macros to
fanotify_path_event and fanotify_fid_event.
[JK: Cleanup FANOTIFY_PE() and FANOTIFY_FE() to be proper inline
functions and remove requirement that fanotify_event is the first in
event structures]
Link: https://lore.kernel.org/r/20200319151022.31456-11-amir73il@gmail.com
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Currently, struct fanotify_fid groups fsid and file handle and is
unioned together with struct path to save space. Also there is fh_type
and fh_len directly in struct fanotify_event to avoid padding overhead.
In the follwing patches, we will be adding more event types and this
packing makes code difficult to follow. So unpack everything and create
struct fanotify_fh which groups members logically related to file handle
to make code easier to follow. In the following patch we will pack
things again differently to make events smaller.
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
create_fd() is never used with invalid path. Also the only thing it
needs to know from fanotify_event is the path. Simplify the function to
take path directly and assume it is correct.
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
The missing 'return' work may make it hard for other developers to
understand it.
Signed-off-by: Chucheng Luo <luochucheng@vivo.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The write pointer of zones in the read-only consition is defined as
invalid by the SCSI ZBC and ATA ZAC specifications. It is thus not
possible to determine the correct size of a read-only zone file on
mount. Fix this by handling read-only zones in the same manner as
offline zones by disabling all accesses to the zone (read and write)
and initializing the inode size of the read-only zone to 0).
For zones found to be in the read-only condition at runtime, only
disable write access to the zone and keep the size of the zone file to
its last updated value to allow the user to recover previously written
data.
Also fix zonefs documentation file to reflect this change.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
|
|
There is no good reason for __bdevname to exist. Just open code
printing the string in the callers. For three of them the format
string can be trivially merged into existing printk statements,
and in init/do_mounts.c we can at least do the scnprintf once at
the start of the function, and unconditional of CONFIG_BLOCK to
make the output for tiny configfs a little more helpful.
Acked-by: Theodore Ts'o <tytso@mit.edu> # for ext4
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Reading from a debugfs file at a nonzero position, without first reading
at position 0, leaks uninitialized memory to userspace.
It's a bit tricky to do this, since lseek() and pread() aren't allowed
on these files, and write() doesn't update the position on them. But
writing to them with splice() *does* update the position:
#define _GNU_SOURCE 1
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
int main()
{
int pipes[2], fd, n, i;
char buf[32];
pipe(pipes);
write(pipes[1], "0", 1);
fd = open("/sys/kernel/debug/fault_around_bytes", O_RDWR);
splice(pipes[0], NULL, fd, NULL, 1, 0);
n = read(fd, buf, sizeof(buf));
for (i = 0; i < n; i++)
printf("%02x", buf[i]);
printf("\n");
}
Output:
5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a5a30
Fix the infoleak by making simple_attr_read() always fill
simple_attr::get_buf if it hasn't been filled yet.
Reported-by: syzbot+fcab69d1ada3e8d6f06b@syzkaller.appspotmail.com
Reported-by: Alexander Potapenko <glider@google.com>
Fixes: acaefc25d21f ("[PATCH] libfs: add simple attribute files")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20200308023849.988264-1-ebiggers@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Change the logic of FAN_ONDIR in two ways that are similar to the logic
of FAN_EVENT_ON_CHILD, that was fixed in commit 54a307ba8d3c ("fanotify:
fix logic of events on child"):
1. The flag is meaningless in ignore mask
2. The flag refers only to events in the mask of the mark where it is set
This is what the fanotify_mark.2 man page says about FAN_ONDIR:
"Without this flag, only events for files are created." It doesn't
say anything about setting this flag in ignore mask to stop getting
events on directories nor can I think of any setup where this capability
would be useful.
Currently, when marks masks are merged, the FAN_ONDIR flag set in one
mark affects the events that are set in another mark's mask and this
behavior causes unexpected results. For example, a user adds a mark on a
directory with mask FAN_ATTRIB | FAN_ONDIR and a mount mark with mask
FAN_OPEN (without FAN_ONDIR). An opendir() of that directory (which is
inside that mount) generates a FAN_OPEN event even though neither of the
marks requested to get open events on directories.
Link: https://lore.kernel.org/r/20200319151022.31456-10-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
With inotify, when a watch is set on a directory and on its child, an
event on the child is reported twice, once with wd of the parent watch
and once with wd of the child watch without the filename.
With fanotify, when a watch is set on a directory and on its child, an
event on the child is reported twice, but it has the exact same
information - either an open file descriptor of the child or an encoded
fid of the child.
The reason that the two identical events are not merged is because the
object id used for merging events in the queue is the child inode in one
event and parent inode in the other.
For events with path or dentry data, use the victim inode instead of the
watched inode as the object id for event merging, so that the event
reported on parent will be merged with the event reported on the child.
Link: https://lore.kernel.org/r/20200319151022.31456-9-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
The event inode field is used only for comparison in queue merges and
cannot be dereferenced after handle_event(), because it does not hold a
refcount on the inode.
Replace it with an abstract id to do the same thing.
Link: https://lore.kernel.org/r/20200319151022.31456-8-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU changes from Paul E. McKenney:
- Make kfree_rcu() use kfree_bulk() for added performance
- RCU updates
- Callback-overload handling updates
- Tasks-RCU KCSAN and sparse updates
- Locking torture test and RCU torture test updates
- Documentation updates
- Miscellaneous fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
We always punt async buffered writes to an io-wq helper, as the core
kernel does not have IOCB_NOWAIT support for that. Most buffered async
writes complete very quickly, as it's just a copy operation. This means
that doing multiple locking roundtrips on the shared wqe lock for each
buffered write is wasteful. Additionally, buffered writes are hashed
work items, which means that any buffered write to a given file is
serialized.
Keep identicaly hashed work items contiguously in @wqe->work_list, and
track a tail for each hash bucket. On dequeue of a hashed item, splice
all of the same hash in one go using the tracked tail. Until the batch
is done, the caller doesn't have to synchronize with the wqe or worker
locks again.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Instead of passing both dentry and path and having to figure out which
one to use, pass data/data_type to simplify the code.
Link: https://lore.kernel.org/r/20200319151022.31456-6-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Create helpers to access path and inode from different data types.
Link: https://lore.kernel.org/r/20200319151022.31456-5-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
snprintf() is a hard-to-use function, and it's especially difficult to
use it properly for concatenating substrings in a buffer with a limited
size. Since snprintf() returns the would-be-output size, not the actual
size, the subsequent use of snprintf() may point to the incorrect
position easily. Also, returning the value from snprintf() directly to
sysfs show function would pass a bogus value that is higher than the
actually truncated string.
That said, although the current code doesn't actually overflow the
buffer with PAGE_SIZE, it's a usage that shouldn't be done. Or it's
worse; this gives a wrong confidence as if it were doing safe
operations.
This patch replaces such snprintf() calls with a safer version,
scnprintf(). It returns the actual output size, hence it's more
intuitive and the code does what's expected.
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Zygo reported a deadlock where a task was stuck in the inode logical
resolve code. The deadlock looks like this
Task 1
btrfs_ioctl_logical_to_ino
->iterate_inodes_from_logical
->iterate_extent_inodes
->path->search_commit_root isn't set, so a transaction is started
->resolve_indirect_ref for a root that's being deleted
->search for our key, attempt to lock a node, DEADLOCK
Task 2
btrfs_drop_snapshot
->walk down to a leaf, lock it, walk up, lock node
->end transaction
->start transaction
-> wait_cur_trans
Task 3
btrfs_commit_transaction
->wait_event(cur_trans->write_wait, num_writers == 1) DEADLOCK
We are holding a transaction open in btrfs_ioctl_logical_to_ino while we
try to resolve our references. btrfs_drop_snapshot() holds onto its
locks while it stops and starts transaction handles, because it assumes
nobody is going to touch the root now. Commit just does what commit
does, waiting for the writers to finish, blocking any new trans handles
from starting.
Fix this by making the backref code not try to resolve backrefs of roots
that are currently being deleted. This will keep us from walking into a
snapshot that's currently being deleted.
This problem was harder to hit before because we rarely broke out of the
snapshot delete halfway through, but with my delayed ref throttling code
it happened much more often. However we've always been able to do this,
so it's not a new problem.
Fixes: 8da6d5815c59 ("Btrfs: added btrfs_find_all_roots()")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We always search the commit root of the extent tree for looking up back
references, however we track the reloc roots based on their current
bytenr.
This is wrong, if we commit the transaction between relocating tree
blocks we could end up in this code in build_backref_tree
if (key.objectid == key.offset) {
/*
* Only root blocks of reloc trees use backref
* pointing to itself.
*/
root = find_reloc_root(rc, cur->bytenr);
ASSERT(root);
cur->root = root;
break;
}
find_reloc_root() is looking based on the bytenr we had in the commit
root, but if we've COWed this reloc root we will not find that bytenr,
and we will trip over the ASSERT(root).
Fix this by using the commit_root->start bytenr for indexing the commit
root. Then we change the __update_reloc_root() caller to be used when
we switch the commit root for the reloc root during commit.
This fixes the panic I was seeing when we started throttling relocation
for delayed refs.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There are two bugs here, but fixing them independently would just result
in pain if you happened to bisect between the two patches.
First is how we handle the -EAGAIN from relocate_tree_block(). We don't
set error, unless we happen to be the first node, which makes no sense,
I have no idea what the code was trying to accomplish here.
We in fact _do_ want err set here so that we know we need to restart in
relocate_block_group(). Also we need finish_pending_nodes() to not
actually call link_to_upper(), because we didn't actually relocate the
block.
And then if we do get -EAGAIN we do not want to set our backref cache
last_trans to the one before ours. This would force us to update our
backref cache if we didn't cross transaction ids, which would mean we'd
have some nodes updated to their new_bytenr, but still able to find
their old bytenr because we're searching the same commit root as the
last time we went through relocate_tree_blocks.
Fixing these two things keeps us from panicing when we start breaking
out of relocate_tree_blocks() either for delayed ref flushing or enospc.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since we're not only checking for metadata reservations but also if we
need to throttle our delayed ref generation, reorder
reserve_metadata_space() above the select_one_root() call in
relocate_tree_block().
The reason we want this is because select_reloc_root() will mess with
the backref cache, and if we're going to bail we want to be able to
cleanly remove this node from the backref cache and come back along to
regenerate it. Move it up so this is the first thing we do to make
restarting cleaner.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Here we are just searching down to the bytenr we're building the backref
tree for, and all of it's paths to the roots. These bytenrs are not
guaranteed to be anywhere near each other, so readahead just generates
extra latency.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Readahead will generate a lot of extra reads for adjacent nodes, but
when running delayed refs we have no idea if the next ref is going to be
adjacent or not, so this potentially just generates a lot of extra IO.
To make matters worse each ref is truly just looking for one item, it
doesn't generally search forward, so we simply don't need it here.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
With BTRFS_SUBVOL_CREATE_ASYNC support remove it's no longer required to
pass the async_transid parameter so remove it and any code using it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_ioctl_snap_create_transid no longer takes a transid argument, so
remove it and rename the function to __btrfs_ioctl_snap_create to
reflect it's an internal, worker function.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This functionality was deprecated in kernel 5.4. Since no one has
complained of the impending removal it's time we did so.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Now that we have proper root ref counting everywhere we can kill the
subvol_srcu.
* removal of fs_info::subvol_srcu reduces size of fs_info by 1176 bytes
* the refcount_t used for the references checks for accidental 0->1
in cases where the root lifetime would not be properly protected
* there's a leak detector for roots to catch unfreed roots at umount
time
* SRCU served us well over the years but is was not a proper
synchronization mechanism for some cases
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The radix root is primarily protected by the fs_roots_radix_lock, so use
that to lookup and get a ref on all of our fs roots in
btrfs_cleanup_fs_roots. The tree reference is taken in the protected
section as before.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Now that all the users of roots take references for them we can drop the
extra root ref we've been taking. Before we had roots at 2 refs for the
life of the file system, one for the radix tree, and one simply for
existing. Now that we have proper ref accounting in all places that use
roots we can drop this extra ref simply for existing as we no longer
need it.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
At the point we add a root to the dead roots list we have no open inodes
for that root, so we need to hold a ref on that root to keep it from
disappearing.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If we make sure all the inodes have refs on their root we don't have to
worry about the root disappearing while we have open inodes.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There are a few different ways to free roots, either you allocated them
yourself and you just do
free_extent_buffer(root->node);
free_extent_buffer(root->commit_node);
btrfs_put_root(root);
Which is the pattern for log roots. Or for snapshots/subvolumes that
are being dropped you simply call btrfs_free_fs_root() which does all
the cleanup for you.
Unify this all into btrfs_put_root(), so that we don't free up things
associated with the root until the last reference is dropped. This
makes the root freeing code much more significant.
The only caveat is at close_ctree() time we have to free the extent
buffers for all of our main roots (extent_root, chunk_root, etc) because
we have to drop the btree_inode and we'll run into issues if we hold
onto those nodes until ->kill_sb() time. This will be addressed in the
future when we kill the btree_inode.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We are going to make root life be controlled soley by refcounting, and
inodes will be one of the things that hold a ref on the root. This
means we need to handle dropping the ino_cache_inode outside of the root
freeing logic, so move it into btrfs_drop_and_free_fs_root() so it is
cleaned up properly on unmount.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
I'm going to make the entire destruction of btrfs_root's controlled by
their refcount, so it will be helpful to notice if we're leaking their
eb's on umount.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This was pretty subtle, we default to reloc roots having 0 root refs, so
if we crash in the middle of the relocation they can just be deleted.
If we successfully complete the relocation operations we'll set our root
refs to 1 in prepare_to_merge() and then go on to merge_reloc_roots().
At prepare_to_merge() time if any of the reloc roots have a 0 reference
still, we will remove that reloc root from our reloc root rb tree, and
then clean it up later.
However this only happens if we successfully start a transaction. If
we've aborted previously we will skip this step completely, and only
have reloc roots with a reference count of 0, but were never properly
removed from the reloc control's rb tree.
This isn't a problem per-se, our references are held by the list the
reloc roots are on, and by the original root the reloc root belongs to.
If we end up in this situation all the reloc roots will be added to the
dirty_reloc_list, and then properly dropped at that point. The reloc
control will be free'd and the rb tree is no longer used.
There were two options when fixing this, one was to remove the BUG_ON(),
the other was to make prepare_to_merge() handle the case where we
couldn't start a trans handle.
IMO this is the cleaner solution. I started with handling the error in
prepare_to_merge(), but it turned out super ugly. And in the end this
BUG_ON() simply doesn't matter, the cleanup was happening properly, we
were just panicing because this BUG_ON() only matters in the success
case. So I've opted to just remove it and add a comment where it was.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We previously were relying on root->reloc_root to be cleaned up by the
drop snapshot, or the error handling. However if btrfs_drop_snapshot()
failed it wouldn't drop the ref for the root. Also we sort of depend on
the right thing to happen with moving reloc roots between lists and the
fs root they belong to, which makes it hard to figure out who owns the
reference.
Fix this by explicitly holding a reference on the reloc root for
roo->reloc_root. This means that we hold two references on reloc roots,
one for whichever reloc_roots list it's attached to, and the
root->reloc_root we're on.
This makes it easier to reason out who owns a reference on the root, and
when it needs to be dropped.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The DEAD_RELOC_TREE flag is in place in order to avoid a use after free
in init_reloc_root, tracking the presence of reloc_root. However adding
the explicit tree references in previous patches makes the use after
free impossible because at this point we no longer have a reloc_control
set on the fs_info and thus cannot enter the function.
So move this to be coupled with clearing the root->reloc_root so we're
consistent with all other operations of the reloc root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If we have an error while processing the reloc roots we could leak roots
that were added to rc->reloc_roots before we hit the error. We could
have also not removed the reloc tree mapping from our rb_tree, so clean
up any remaining nodes in the reloc root rb_tree.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use rbtree_postorder_for_each_entry_safe ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We previously were checking if the root had a dead root before accessing
root->reloc_root in order to avoid a use-after-free type bug. However
this scenario happens after we've unset the reloc control, so we would
have been saved if we'd simply checked for fs_info->reloc_control. At
this point during relocation we no longer need to be creating new reloc
roots, so simply move this check above the reloc_root checks to avoid
any future races and confusion.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If we do merge_reloc_roots() we could insert a few roots onto the dirty
subvol roots list, where we hold a ref on them. If we fail to start the
transaction we need to run clean_dirty_subvols() in order to cleanup the
refs.
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If we fail to load an fs root, or fail to start a transaction we can
bail without unsetting the reloc control, which leads to problems later
when we free the reloc control but still have it attached to the file
system.
In the normal path we'll end up calling unset_reloc_control() twice, but
all it does is set fs_info->reloc_control = NULL, and we can only have
one balance at a time so it's not racey.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If we have an error while building the backref tree in relocation we'll
process all the pending edges and then free the node. However if we
integrated some edges into the cache we'll lose our link to those edges
by simply freeing this node, which means we'll leak memory and
references to any roots that we've found.
Instead we need to use remove_backref_node(), which walks through all of
the edges that are still linked to this node and free's them up and
drops any root references we may be holding.
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
tree leaves
In relocation, we need to locate all parent tree leaves referring to one
data extent, thus we have a complex mechanism to iterate throught extent
tree and subvolume trees to locate the related leaves.
However this is already done in backref.c, we have
btrfs_find_all_leafs(), which can return a ulist containing all leaves
referring to that data extent.
Use btrfs_find_all_leafs() to replace find_data_references().
There is a special handling for v1 space cache data extents, where we
need to delete the v1 space cache data extents, to avoid those data
extents to hang the data relocation.
In this patch, the special handling is done by re-iterating the root
tree leaf. Although it's a little less efficient than the old handling,
considering we can reuse a lot of code, it should be acceptable.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|