Age | Commit message (Collapse) | Author |
|
Sort mount options by the primary name, followed by the 'no-'
counterpart if it exists. Group the deprecated and debugging options.
Enum and token defintions are synced.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Btrfs has two mount options for SSD optimizations: ssd and ssd_spread.
Presently there is an option to disable all SSD optimizations, but there
isn't an option to disable just ssd_spread.
This patch adds a mount option nossd_spread that disables ssd_spread
only.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since userspace transaction have been removed we no longer have use
for this field so delete it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Now that the userspace transaction ioctls have been removed,
TRANS_USERSPACE is no longer used hence we can remove it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Now that the userspace transaction IOCTL have been removed, this member
is no longer used so just remove it
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Commit 3558d4f88ec8 ("btrfs: Deprecate userspace transaction ioctls")
marked the beginning of the end of userspace transaction. This commit
finishes the job! There are no known users and ceph does not use the
ioctl anymore.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Acked-by: Sage Weil <sage@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
are created with quota enabled
When multiple pending snapshots referring to the same source subvolume
are executed, enabled quota will cause root item corruption, where root
items are using old bytenr (no backref in extent tree).
This can be triggered by fstests btrfs/152.
The cause is when source subvolume is still dirty, extra commit
(simplied transaction commit) of qgroup_account_snapshot() can skip
dirty roots not recorded in current transaction, making root item of
source subvolume not updated.
Fix it by forcing recording source subvolume in current transaction
before qgroup sub-transaction commit.
Reported-by: Justin Maggard <jmaggard@netgear.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When performing an unlock on an extent buffer we'd like to order the
decrement of extent_buffer::blocking_writers with waking up any
waiters. In such situations it's sufficient to use smp_mb__after_atomic
rather than the heavy smp_mb. On architectures where atomic operations
are fully ordered (such as x86 or s390) unconditionally executing
a heavyweight smp_mb instruction causes a severe hit to performance
while bringin no improvements in terms of correctness.
The better thing is to use the appropriate smp_mb__after_atomic routine
which will do the correct thing (invoke a full smp_mb or in the case
of ordered atomics insert a compiler barrier). Put another way,
an RMW atomic op + smp_load__after_atomic equals, in terms of
semantics, to a full smp_mb. This ensures that none of the problems
described in the accompanying comment of waitqueue_active occur.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Some functions can filter metadata by the generation. Add a define that
will annotate such arguments.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The called function name is self explanatory.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The current implementation of btrfs_page_exists_in_range() gives the
wrong answer if the workingset code has stored a shadow entry in the
page cache. The filemap_range_has_page() function does not have this
problem, and it's shared code, so use it instead.
eigned-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Refactor the call to EXT4_ERROR_INODE() into ext4_xattr_check_block().
This simplifies the code, and fixes a problem where not all callers of
ext4_xattr_check_block() were not resulting in ext4_error() getting
called when the xattr block is corrupted.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
|
|
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings.
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings like the
following:
WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
[..]
CPU: 27 PID: 1783 Comm: dma-collision Tainted: G O 4.15.0-rc2+ #984
[..]
Call Trace:
set_page_dirty_lock+0x40/0x60
bio_set_pages_dirty+0x37/0x50
iomap_dio_actor+0x2b7/0x3b0
? iomap_dio_zero+0x110/0x110
iomap_apply+0xa4/0x110
iomap_dio_rw+0x29e/0x3b0
? iomap_dio_zero+0x110/0x110
? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
xfs_file_read_iter+0xa0/0xc0 [xfs]
__vfs_read+0xf9/0x170
vfs_read+0xa6/0x150
SyS_pread64+0x93/0xb0
entry_SYSCALL_64_fastpath+0x1f/0x96
...where the default set_page_dirty() handler assumes that dirty state
is being tracked in 'struct page' flags.
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Block device inodes never have S_DAX set, so kill the check for DAX and
diversion to dax_writeback_mapping_range().
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Define some generic VFS aops
helpers for dax. These noop implementations are there in the dax case to
prevent the VFS from falling back to operations with page-cache
assumptions, dax_writeback_mapping_range() may not be referenced in the
FS_DAX=n case.
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Matthew Wilcox <mawilcox@microsoft.com>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
In preparation for examining the busy state of dax pages in the truncate
path, switch from sectors to pfns in the radix.
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
If a page is already locked, attempting to dirty it leads to a deadlock
in lock_page(). This is what currently happens to ITER_BVEC pages when
a dio-enabled loop device is backed by ceph:
$ losetup --direct-io /dev/loop0 /mnt/cephfs/img
$ xfs_io -c 'pread 0 4k' /dev/loop0
Follow other file systems and only dirty ITER_IOVEC pages.
Cc: stable@kernel.org
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Previously, mount -l would show data=<mode> even if the ext4 default
journaling mode was being used. Change this to be consistent with the
rest of the options.
Ext4 already did the right thing when the journaling mode being used
matched the one specified in the superblock's default mount options. The
reason it failed to do the right thing for the ext4 defaults is that,
when set, they were never included in sbi->s_def_mount_opt (unlike the
superblock's defaults, which were).
Signed-off-by: Tyson Nottingham <tgnottingham@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Don't show init_itable=n in /proc/fs/ext4/<dev>/options when filesystem
is mounted with noinit_itable.
Signed-off-by: Tyson Nottingham <tgnottingham@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Previously, /proc/fs/ext4/<dev>/options would only show binary options
if they were set (1 in the options bit mask). E.g. it would show "grpid"
if it was set, but it would not show "nogrpid" if grpid was not set.
This seems sensible, but when an option is absent from the file, it can
be hard for the unfamiliar to know what is being used. E.g. if there
isn't a (no)grpid entry, nogrpid is in effect. But if there isn't a
(no)auto_da_alloc entry, auto_da_alloc is in effect. If there isn't a
(minixdf|bsddf) entry, it turns out bsddf is in effect. It all depends
on how the option is implemented.
It's clearer to be explicit, so print the corresponding option
regardless of whether it means a 1 or a 0 in the bit mask.
Note that options which do not have an explicit disable option aren't
indicated as being disabled even with this change (e.g. dax).
Signed-off-by: Tyson Nottingham <tgnottingham@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Replace kset with generic kobject provided by kobject_create_and_add(),
since the latter is sufficient.
Signed-off-by: Tyson Nottingham <tgnottingham@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Signed-off-by: Tyson Nottingham <tgnottingham@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Make cleanup of ext4_feat kobject consistent with similar objects.
Signed-off-by: Tyson Nottingham <tgnottingham@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
If some metadata block, such as an allocation bitmap, overlaps the
superblock, it's very likely that if the file system is mounted
read/write, the results will not be pretty. So disallow r/w mounts
for file systems corrupted in this particular way.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
|
|
The extended attribute code now uses the crc32c checksum for hashing
purposes, so we should just always always initialize it. We also want
to prevent NULL pointer dereferences if one of the metadata checksum
features is enabled after the file sytsem is originally mounted.
This issue has been assigned CVE-2018-1094.
https://bugzilla.kernel.org/show_bug.cgi?id=199183
https://bugzilla.redhat.com/show_bug.cgi?id=1560788
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
|
|
If the root directory has an i_links_count of zero, then when the file
system is mounted, then when ext4_fill_super() notices the problem and
tries to call iput() the root directory in the error return path,
ext4_evict_inode() will try to free the inode on disk, before all of
the file system structures are set up, and this will result in an OOPS
caused by a NULL pointer dereference.
This issue has been assigned CVE-2018-1092.
https://bugzilla.kernel.org/show_bug.cgi?id=199179
https://bugzilla.redhat.com/show_bug.cgi?id=1560777
Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Currently d_move(from, to) does the following:
* name/parent of from <- old name/parent of to, from hashed there
* to is unhashed
* name of to is preserved
* if from used to be detached, to gets detached
* if from used to be attached, parent of to <- old parent of from.
That's both user-visibly bogus and complicates reasoning a lot.
Much saner semantics would be
* name/parent of from <- name/parent of to, from hashed there.
* to is unhashed
* name/parent of to is unchanged.
The price, of course, is that old parent of from might lose a reference.
However,
* all potentially cross-directory callers of d_move() have both
parents pinned directly; typically, dentries themselves are grabbed
only after we have grabbed and locked both parents. IOW, the decrement
of old parent's refcount in case of d_move() won't reach zero.
* __d_move() from d_splice_alias() is done to detached alias.
No refcount decrements in that case
* __d_move() from __d_unalias() *can* get the refcount to zero.
So let's grab a reference to alias' old parent before calling __d_unalias()
and dput() it after we'd dropped rename_lock.
That does make d_splice_alias() potentially blocking. However, it has
no callers in non-sleepable contexts (and the case where we'd grown
that dget/dput pair is _very_ rare, so performance is not an issue).
Another thing that needs adjustment is unlocking in the end of __d_move();
folded it in. And cleaned the remnants of bogus ordering from the
"lock them in the beginning" counterpart - it's never been right and
now (well, for 7 years now) we have that thing always serialized on
rename_lock anyway.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Those parts of fs/dcache.c are pretty much self-contained.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
shrink_dentry_list() holds dentry->d_lock and needs to acquire
dentry->d_inode->i_lock. This cannot be done with a spin_lock()
operation because it's the reverse of the regular lock order.
To avoid ABBA deadlocks it is done with a trylock loop.
Trylock loops are problematic in two scenarios:
1) PREEMPT_RT converts spinlocks to 'sleeping' spinlocks, which are
preemptible. As a consequence the i_lock holder can be preempted
by a higher priority task. If that task executes the trylock loop
it will do so forever and live lock.
2) In virtual machines trylock loops are problematic as well. The
VCPU on which the i_lock holder runs can be scheduled out and a
task on a different VCPU can loop for a whole time slice. In the
worst case this can lead to starvation. Commits 47be61845c77
("fs/dcache.c: avoid soft-lockup in dput()") and 046b961b45f9
("shrink_dentry_list(): take parent's d_lock earlier") are
addressing exactly those symptoms.
Avoid the trylock loop by using dentry_kill(). When pruning ancestors,
the same code applies that is used to kill a dentry in dput(). This
also has the benefit that the locking order is now the same. First
the inode is locked, then the parent.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
In case when trylock in there fails, deal with it directly in
dentry_kill(). Note that in cases when we drop and retake
->d_lock, we need to recheck whether to retain the dentry.
Another thing is that dropping/retaking ->d_lock might have
ended up with negative dentry turning into positive; that,
of course, can happen only once...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Turn the "trylock failed" part into uninlined __lock_parent().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
all remaining callers hold either a reference or ->i_lock
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
In case of trylock failure don't re-add to the list - drop the locks
and carefully get them in the right order. For shrink_dentry_list(),
somebody having grabbed a reference to dentry means that we can
kick it off-list, so if we find dentry being modified under us we
don't need to play silly buggers with retries anyway - off the list
it is.
The locking logics taken out into a helper of its own; lock_parent()
is no longer used for dentries that can be killed under us.
[fix from Eric Biggers folded]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
ext4 isn't validating the sizes of xattrs where the value of the xattr
is stored in an external inode. This is problematic because
->e_value_size is a u32, but ext4_xattr_get() returns an int. A very
large size is misinterpreted as an error code, which ext4_get_acl()
translates into a bogus ERR_PTR() for which IS_ERR() returns false,
causing a crash.
Fix this by validating that all xattrs are <= INT_MAX bytes.
This issue has been assigned CVE-2018-1095.
https://bugzilla.kernel.org/show_bug.cgi?id=199185
https://bugzilla.redhat.com/show_bug.cgi?id=1560793
Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Fixes: e50e5129f384 ("ext4: xattr-in-inode support")
|
|
This patch spits out the time taken by the various steps in the
journal recover process. Previously, the journal recovery time
didn't account for finding the journal head in the log which takes
up a significant portion of time.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
Today if we run xfs_fsr and crash[1], log replay can fail because
the recovery code tries to instantiate the donor inode from
disk to replay the swapext, but it's been deleted and we get
verifier failures when we try to read the inode off disk with
i_mode == 0.
This fixes both sides: We don't log the swapext change if the
inode has been deleted, and we don't try to recover it either.
[1] or if systemd doesn't cleanly unmount root, as it is wont
to do ...
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Instead of zeroing out fallocated blocks in gfs2_iomap_alloc, zero them
out in fallocate_chunk, much higher up the call stack. This gets rid of
gfs2's abuse of the IOMAP_ZERO flag as well as the gfs2 specific zeronew
buffer flag. I can't think of a reason why zeroing out the blocks in
gfs2_iomap_alloc would have any benefits: there is no additional locking
at that level that would add protection to the newly allocated blocks.
While at it, change fallocate over from gs2_block_map to gfs2_iomap_begin.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
|
|
Trivial fix to spelling mistake in debug message text.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
|
|
mount_crypt_stat is assigned to
&ecryptfs_superblock_to_private(ecryptfs_dentry->d_sb)->mount_crypt_stat,
and mount_crypt_stat is not the first object in struct ecryptfs_sb_info.
mount_crypt_stat is therefore never NULL. At the same time, no crash
in ecryptfs_lookup() has been reported, and the lookup functions in
other file systems don't check if d_sb is NULL either.
Given that, remove the NULL check.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
|
|
Reserve an F2FS feature flag and inode flag for fs-verity. This is an
in-development feature that is planned be discussed at LSF/MM 2018 [1].
It will provide file-based integrity and authenticity for read-only
files. Most code will be in a filesystem-independent module, with
smaller changes needed to individual filesystems that opt-in to
supporting the feature. An early prototype supporting F2FS is available
[2]. Reserving the F2FS on-disk bits for fs-verity will prevent users
of the prototype from conflicting with other new F2FS features.
Note that we're reserving the inode flag in f2fs_inode.i_advise, which
isn't really appropriate since it's not a hint or advice. But
->i_advise is already being used to hold the 'encrypt' flag; and F2FS's
->i_flags uses the generic FS_* values, so it seems ->i_flags can't be
used for an F2FS-specific flag without additional work to remove the
assumption that ->i_flags uses the generic flags namespace.
[1] https://marc.info/?l=linux-fsdevel&m=151690752225644
[2] https://git.kernel.org/pub/scm/linux/kernel/git/mhalcrow/linux.git/log/?h=fs-verity-dev
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
We want the hyperv fix in here for merging and testing.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
And use it in a few more places rather than opencoding the values.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|