Age | Commit message (Collapse) | Author |
|
Create refcount update intent/done log items to record redo
information in the log. Because we need to roll transactions between
updating the bmbt mapping and updating the reverse mapping, we also
have to track the status of the metadata updates that will be recorded
in the post-roll transactions, just in case we crash before committing
the final transaction. This mechanism enables log recovery to finish
what was already started.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Implement the generic btree operations required to manipulate refcount
btree blocks. The implementation is similar to the bmapbt, though it
will only allocate and free blocks from the AG.
Since the refcount root and level fields are separate from the
existing roots and levels array, they need a separate logging flag.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch: fix logging of AGF refcount btree fields]
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Every time we allocate or free a data extent, we might need to split
the refcount btree. Reserve some blocks in the transaction to handle
this possibility. Even though the deferred refcount code can roll a
transaction to avoid overloading the transaction, we can still exceed
the reservation.
Certain pathological workloads (1k blocks, no cowextsize hint, random
directio writes), cause a perfect storm wherein a refcount adjustment
of a large range of blocks causes full tree splits in two separate
extents in two separate refcount tree blocks; allocating new refcount
tree blocks causes rmap btree splits; and all the allocation activity
causes the freespace btrees to split, blowing the reservation.
(Reproduced by generic/167 over NFS atop XFS)
Signed-off-by: Christoph Hellwig <hch@lst.de>
[darrick.wong@oracle.com: add commit message]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Modify the growfs code to initialize new refcount btree blocks.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Start constructing the refcount btree implementation by establishing
the on-disk format and everything needed to read, write, and
manipulate the refcount btree blocks.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Since XFS reserves a small amount of space in each AG as the minimum
free space needed for an operation, save some more space in case we
touch the refcount btree.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Add new per-AG refcount btree definitions to the per-AG structures.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Define all the tracepoints we need to inspect the refcount btree
runtime operation.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
If the size of an inline directory is so small that it doesn't
even cover the required header size, return an error to userspace
instead of ASSERTing and returning 0 like everything's ok.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reported-by: Jan Kara <jack@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Add a new fallocate mode flag that explicitly unshares blocks on
filesystems that support such features. The new flag can only
be used with an allocate-mode fallocate call.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Using list_move() instead of list_del() + list_add().
Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Signed-off-by: Yan, Zheng <zyan@redhat.com>
|
|
Accessing / causes failuire if the client has caps that restrict path
Signed-off-by: Yan, Zheng <zyan@redhat.com>
|
|
Signed-off-by: Yan, Zheng <zyan@redhat.com>
|
|
If O_DIRECT writes are racing with buffered writes, then
the call to invalidate_inode_pages2_range() can call ceph_releasepage()
on dirty pages.
Most filesystems hold inode_lock() across O_DIRECT writes so they do not
suffer this race, but cephfs deliberately drops the lock, and opens a window
for the race.
This race can be triggered with the generic/036 test from the xfstests
test suite. It doesn't happen every time, but it does happen often.
As the possibilty is expected, remove the warning, and instead include
the PageDirty() status in the debug message.
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
|
|
This call can fail if there are dirty pages. The preceding call to
filemap_write_and_wait_range() will normally remove dirty pages, but
as inode_lock() is not held over calls to ceph_direct_read_write(), it
could race with non-direct writes and pages could be dirtied
immediately after filemap_write_and_wait_range() returns
If there are dirty pages, they will be removed by the subsequent call
to truncate_inode_pages_range(), so having them here is not a problem.
If the 'ret' value is left holding an error, then in the async IO case
(aio_req is not NULL) the loop that would normally call
ceph_osdc_start_request() will see the error in 'ret' and abort all
requests. This doesn't seem like correct behaviour.
So use separate 'ret2' instead of overloading 'ret'.
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
|
|
If start_page() fails to add a page to page cache or fails to send
OSD request. It should cal put_page() (instead of free_page()) for
relevant pages.
Besides, start_page() need to cancel fscache readpage if it fails
to send OSD request.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reported-by: Zhi Zhang <zhang.david2011@gmail.com>
|
|
Don't let userspace filesystem give bogus values for the size of xattr and
xattr list.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Three sets of overlapping changes. Nothing serious.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
|
|
|
|
|
|
|
|
After the call to __blkdev_direct_IO the final reference to the file
might have been dropped by aio_complete already, and the call to
file_accessed might cause a use after free.
Instead update the access time before the I/O, similar to how we
update the time stamps before writes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-and-tested-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
For 32-bit architectures we need to cast first_block to u64 before
shifting it left.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Fix various inconsistencies in the documentation associated with various
functions.
In the case of fs/ubifs/lprops.c, the second parameter of
ubifs_get_lp_stats was renamed from st to lst in commit 84abf972ccff
("UBIFS: add re-mount debugging checks")
In the case of fs/ubifs/lpt_commit.c, the excess variables have never
existed in the associated functions since the code was introduced into the
kernel.
The others appear to be straightforward typos.
Issues detected using Coccinelle (http://coccinelle.lip6.fr/)
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
When an extended attribute is changed, xattr_len of host inode is
recalculated. ui->data_len is updated before computation and result
is wrong. This patch adds a temporary variable to fix computation.
To reproduce the issue:
~# > a.txt
~# attr -s an-attr -V a-value a.txt
~# attr -s an-attr -V a-bit-bigger-value a.txt
Now host inode xattr_len is wrong. Forcing dbg_check_filesystem()
generates the following error:
[ 130.620140] UBIFS (ubi0:2): background thread "ubifs_bgt0_2" started, PID 565
[ 131.470790] UBIFS error (ubi0:2 pid 564): check_inodes: inode 646 has xattr size 240, but calculated size is 256
[ 131.481697] UBIFS (ubi0:2): dump of the inode 646 sitting in LEB 29:114688
[ 131.488953] magic 0x6101831
[ 131.492876] crc 0x9fce9091
[ 131.496836] node_type 0 (inode node)
[ 131.501193] group_type 1 (in node group)
[ 131.505788] sqnum 9278
[ 131.509191] len 160
[ 131.512549] key (646, inode)
[ 131.516688] creat_sqnum 9270
[ 131.520133] size 0
[ 131.523264] nlink 1
[ 131.526398] atime 1053025857.0
[ 131.530574] mtime 1053025857.0
[ 131.534714] ctime 1053025906.0
[ 131.538849] uid 0
[ 131.542009] gid 0
[ 131.545140] mode 33188
[ 131.548636] flags 0x1
[ 131.551977] xattr_cnt 1
[ 131.555108] xattr_size 240
[ 131.558420] xattr_names 12
[ 131.561670] compr_type 0x1
[ 131.564983] data len 0
[ 131.568125] UBIFS error (ubi0:2 pid 564): dbg_check_filesystem: file-system check failed with error -22
[ 131.578074] CPU: 0 PID: 564 Comm: mount Not tainted 4.4.12-g3639bea54a #24
[ 131.585352] Hardware name: Generic AM33XX (Flattened Device Tree)
[ 131.591918] [<c00151c0>] (unwind_backtrace) from [<c0012acc>] (show_stack+0x10/0x14)
[ 131.600177] [<c0012acc>] (show_stack) from [<c01c950c>] (dbg_check_filesystem+0x464/0x4d0)
[ 131.608934] [<c01c950c>] (dbg_check_filesystem) from [<c019f36c>] (ubifs_mount+0x14f8/0x2130)
[ 131.617991] [<c019f36c>] (ubifs_mount) from [<c00d7088>] (mount_fs+0x14/0x98)
[ 131.625572] [<c00d7088>] (mount_fs) from [<c00ed674>] (vfs_kern_mount+0x4c/0xd4)
[ 131.633435] [<c00ed674>] (vfs_kern_mount) from [<c00efb5c>] (do_mount+0x988/0xb50)
[ 131.641471] [<c00efb5c>] (do_mount) from [<c00f004c>] (SyS_mount+0x74/0xa0)
[ 131.648837] [<c00f004c>] (SyS_mount) from [<c000fe20>] (ret_fast_syscall+0x0/0x3c)
[ 131.665315] UBIFS (ubi0:2): background thread "ubifs_bgt0_2" stops
Signed-off-by: Pascal Eberhard <pascal.eberhard@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
...to make the code more consistent since we use
move already in other places.
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Adds RENAME_EXCHANGE to UBIFS, the operation itself
is completely disjunct from a regular rename() that's
why we dispatch very early in ubifs_reaname().
RENAME_EXCHANGE used by the renameat2() system call
allows the caller to exchange two paths atomically.
Both paths have to exist and have to be on the same
filesystem.
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Adds RENAME_WHITEOUT support to UBIFS, we implement
it in the same way as ext4 and xfs do.
For an overview of other ways to implement it please
refere to commit 7dcf5c3e4527 ("xfs: add RENAME_WHITEOUT support").
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
This patchs adds O_TMPFILE support to UBIFS.
A temp file is a reference to an unlinked inode, a user
holding the reference can use it. As soon it is being closed
all data vanishes.
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
The two invocations share little code.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
In preparation for posix acl support, rework fuse to use xattr handlers and
the generic setxattr/getxattr/listxattr callbacks. Split the xattr code
out into it's own file, and promote symbols to module-global scope as
needed.
Functionally these changes have no impact, as fuse still uses a single
handler for all xattrs which uses the old callbacks.
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Only two flags: "default_permissions" and "allow_other". All other flags
are handled via bitfields. So convert these two as well. They don't
change during the lifetime of the filesystem, so this is quite safe.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Make sure userspace filesystem is returning a well formed list of xattr
names (zero or more nonzero length, null terminated strings).
[Michael Theall: only verify in the nonzero size case]
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
|
|
And check for valid nsec value before passing into timespec64_to_jiffies().
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Store in memory pointed to by ->d_fsdata. Use ->d_init() to allocate the
storage. Need to use RCU freeing because the data is used in RCU lookup
mode.
We could cast ->d_fsdata directly on 64bit archs, but I don't think this is
worth the extra complexity.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Add a new INIT flag, FUSE_POSIX_ACL, for negotiating ACL support with
userspace. When it is set in the INIT response, ACL support will be
enabled. ACL support also implies "default_permissions".
When ACL support is enabled, the kernel will cache and have responsibility
for enforcing ACLs. ACL xattrs will be passed to userspace, which is
responsible for updating the ACLs in the filesystem, keeping the file mode
in sync, and inheritance of default ACLs when new filesystem nodes are
created.
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Only userspace filesystem can do the killing of suid/sgid without races.
So introduce an INIT flag and negotiate support for this.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Fuse allowed VFS to set mode in setattr in order to clear suid/sgid on
chown and truncate, and (since writeback_cache) write. The problem with
this is that it'll potentially restore a stale mode.
The poper fix would be to let the filesystems do the suid/sgid clearing on
the relevant operations. Possibly some are already doing it but there's no
way we can detect this.
So fix this by refreshing and recalculating the mode. Do this only if
ATTR_KILL_S[UG]ID is set to not destroy performance for writes. This is
still racy but the size of the window is reduced.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
|
|
Without "default_permissions" the userspace filesystem's lookup operation
needs to perform the check for search permission on the directory.
If directory does not allow search for everyone (this is quite rare) then
userspace filesystem has to set entry timeout to zero to make sure
permissions are always performed.
Changing the mode bits of the directory should also invalidate the
(previously cached) dentry to make sure the next lookup will have a chance
of updating the timeout, if needed.
Reported-by: Jean-Pierre André <jean-pierre.andre@wanadoo.fr>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
|
|
This patch add update_ckpt_flags() to clean up the flow.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
While we call ->writepages, there are two cases:
a. we didn't writeout any dirty pages, since they are writebacked by other
thread concurrently.
b. we writeout dirty pages, and have already submitted bio to block layer.
In these cases, we don't need to do additional bio flushing unnecessarily,
it may split bio in cache into smaller one.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
In sync_node_pages, we won't check and commit last merged pages in private
bio cache of f2fs, as these pages were taged as writeback, someone who is
waiting for writebacking of the page will be blocked until the cache was
committed by someone else.
We need to commit node type bio cache to avoid potential deadlock or long
delay of waiting writeback.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
There exists almost same codes when get the value of pre_version
and cur_version in function validate_checkpoint, this patch adds
get_checkpoint_version to clean up redundant codes.
Signed-off-by: Tiezhu Yang <kernelpatch@126.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|