Age | Commit message (Collapse) | Author |
|
Split out a xlog_add_buffer_cancelled helper which does the low-level
manipulation of the buffer cancelation table, and in that helper call
xlog_find_buffer_cancelled instead of open coding it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Don't bother to allocate memory and convert the log item when we
only need the block number and the length. Just extract them directly
and call xlog_buf_readahead separately in each branch.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Add a little helper to readahead a buffer if it hasn't been cancelled.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
This list contains pretty much everything that is not a buffer. The
comment calls it item_list, which is a much better name than inode
list, so switch the actual variable name to that as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Replace the somewhat convoluted use of xlog_peek_buffer_cancelled and
xlog_check_buffer_cancelled with two obvious helpers:
xlog_is_buffer_cancelled, which returns true if there is a buffer in
the cancellation table, and
xlog_put_buffer_cancelled, which also decrements the reference count
of the buffer cancellation table.
Both share a little helper to look up the entry.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
There are a couple places where we directly call printk_once() and one
of them doesn't follow the standard xfs subsystem printk format as a
result.
#define printk_once variants to go with our existing printk_ratelimited
#defines so we can do one-shot printks in a consistent manner.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
I ran into a linker warning in XFS that originates from a mismatch
between libelf, binutils and objtool when certain files in the kernel
are built with "gcc -g":
x86_64-linux-ld: fs/xfs/xfs_trace.o: unable to initialize decompress status for section .debug_info
After some discussion, nobody could identify why xfs sets this flag
here. CONFIG_XFS_DEBUG used to enable lots of unrelated settings, but
now its main purpose is to enable extra consistency checks and assertions
that are unrelated to the debug info.
Remove the Makefile logic to set the flag here. If anyone relies
on the debug info, this can simply be enabled again with the global
CONFIG_DEBUG_INFO option.
Dave Chinner writes:
I'm pretty sure it was needed for the original kgdb integration back
in the early 2000s. That was when SGI used to patch their XFS dev
tree with kgdb and debug symbols were needed by the custom kgdb
modules that were ported across from the Irix kernel debugger.
ISTR that the early kcrash kernel dump analysis tools (again,
originated from the Irix "icrash" kernel dump tools) had custom XFS
debug scripts that needed also the debug info to work correctly...
Which is a long way of saying "we don't need it anymore" instead of
"nobody knows why it was set"... :)
Suggested-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/lkml/20200409074130.GD21033@infradead.org/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Since the "no-allocation" reservations has been removed, the resblks
value should be larger than zero, so remove the unnecessary check.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Simplify the setting of the flags value, and only consider
quota enforcement stuff here.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
The check XFS_IS_QUOTA_RUNNING() has been done when enter the
xfs_qm_vop_create_dqattach() function, it will return directly
if the result is false, so the followed XFS_IS_QUOTA_RUNNING()
assertion is unnecessary. If we truly care about this, the check
also can be added to the condition of next if statements.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
The initial value of variable udqp is NULL, and we only set the
flag XFS_QMOPT_PQUOTA in xfs_qm_vop_dqalloc() function, so only
the pdqp value is initialized and the udqp value is still NULL.
Since the udqp value is NULL in the rest part of xfs_ioctl_setattr()
function, it is meaningless and do nothing. So remove it from
xfs_ioctl_setattr().
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
We share an inode between gquota and pquota with the older
superblock that doesn't have separate pquotino, and for the
need_alloc == false case we don't need to call xfs_dir_ialloc()
function, so add the check if reserved free disk blocks is
needed.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
The two if statements have same condition, and the mask value
does not change in xfs_setattr_nonsize(), so combine them.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
The trace event xfs_dquot_dqalloc does not depend on the
value uq, so remove the condition, and trace quota allocations
for all quota types.
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
When we're sorting recovered log items ahead of recovering them and
encounter a log item of unknown type, actually print the type code when
we're rejecting the whole transaction to aid in debugging.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
In order for users to determine if a file is currently operating in DAX
state (effective DAX). Define a statx attribute value and set that
attribute if the effective DAX flag is set.
To go along with this we propose the following addition to the statx man
page:
STATX_ATTR_DAX
The file is in the DAX (cpu direct access) state. DAX state
attempts to minimize software cache effects for both I/O and
memory mappings of this file. It requires a file system which
has been configured to support DAX.
DAX generally assumes all accesses are via cpu load / store
instructions which can minimize overhead for small accesses, but
may adversely affect cpu utilization for large transfers.
File I/O is done directly to/from user-space buffers and memory
mapped I/O may be performed with direct memory mappings that
bypass kernel page cache.
While the DAX property tends to result in data being transferred
synchronously, it does not give the same guarantees of O_SYNC
where data and the necessary metadata are transferred together.
A DAX file may support being mapped with the MAP_SYNC flag,
which enables a program to use CPU cache flush instructions to
persist CPU store operations without an explicit fsync(2). See
mmap(2) for more information.
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
cachefiles_read_or_alloc_pages()
The patch which changed cachefiles from calling ->bmap() to using the
bmap() wrapper overwrote the running return value with the result of
calling bmap(). This causes an assertion failure elsewhere in the code.
Fix this by using ret2 rather than ret to hold the return value.
The oops looks like:
kernel BUG at fs/nfs/fscache.c:468!
...
RIP: 0010:__nfs_readpages_from_fscache+0x18b/0x190 [nfs]
...
Call Trace:
nfs_readpages+0xbf/0x1c0 [nfs]
? __alloc_pages_nodemask+0x16c/0x320
read_pages+0x67/0x1a0
__do_page_cache_readahead+0x1cf/0x1f0
ondemand_readahead+0x172/0x2b0
page_cache_async_readahead+0xaa/0xe0
generic_file_buffered_read+0x852/0xd50
? mem_cgroup_commit_charge+0x6e/0x140
? nfs4_have_delegation+0x19/0x30 [nfsv4]
generic_file_read_iter+0x100/0x140
? nfs_revalidate_mapping+0x176/0x2b0 [nfs]
nfs_file_read+0x6d/0xc0 [nfs]
new_sync_read+0x11a/0x1c0
__vfs_read+0x29/0x40
vfs_read+0x8e/0x140
ksys_read+0x61/0xd0
__x64_sys_read+0x1a/0x20
do_syscall_64+0x60/0x1e0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f5d148267e0
Fixes: 10d83e11a582 ("cachefiles: drop direct usage of ->bmap method.")
Reported-by: David Wysochanski <dwysocha@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: David Wysochanski <dwysocha@redhat.com>
cc: Carlos Maiolino <cmaiolino@redhat.com>
|
|
The prepare_to_wait() and finish_wait() calls in io_uring_cancel_files()
are mismatched. Currently I don't see any issues related this bug, just
find it by learning codes.
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull more btrfs fixes from David Sterba:
"A few more stability fixes, minor build warning fixes and git url
fixup:
- fix partial loss of prealloc extent past i_size after fsync
- fix potential deadlock due to wrong transaction handle passing via
journal_info
- fix gcc 4.8 struct intialization warning
- update git URL in MAINTAINERS entry"
* tag 'for-5.7-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
MAINTAINERS: btrfs: fix git repo URL
btrfs: fix gcc-4.8 build warning for struct initializer
btrfs: transaction: Avoid deadlock due to bad initialization timing of fs_info::journal_info
btrfs: fix partial loss of prealloc extent past i_size after fsync
|
|
Pull iomap fix from Darrick Wong:
"Hoist the check for an unrepresentable FIBMAP return value into
ioctl_fibmap.
The internal kernel function can handle 64-bit values (and is needed
to fix a regression on ext4 + jbd2). It is only the userspace ioctl
that is so old that it cannot deal"
* tag 'iomap-5.7-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
fibmap: Warn and return an error in case of block > INT_MAX
|
|
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
Stable fixes:
- fix handling of backchannel binding in BIND_CONN_TO_SESSION
Bugfixes:
- Fix a credential use-after-free issue in pnfs_roc()
- Fix potential posix_acl refcnt leak in nfs3_set_acl
- defer slow parts of rpc_free_client() to a workqueue
- Fix an Oopsable race in __nfs_list_for_each_server()
- Fix trace point use-after-free race
- Regression: the RDMA client no longer responds to server disconnect
requests
- Fix return values of xdr_stream_encode_item_{present, absent}
- _pnfs_return_layout() must always wait for layoutreturn completion
Cleanups:
- Remove unreachable error conditions"
* tag 'nfs-for-5.7-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: Fix a race in __nfs_list_for_each_server()
NFSv4.1: fix handling of backchannel binding in BIND_CONN_TO_SESSION
SUNRPC: defer slow parts of rpc_free_client() to a workqueue.
NFSv4: Remove unreachable error condition due to rpc_run_task()
SUNRPC: Remove unreachable error condition
xprtrdma: Fix use of xdr_stream_encode_item_{present, absent}
xprtrdma: Fix trace point use-after-free race
xprtrdma: Restore wake-up-all to rpcrdma_cm_event_handler()
nfs: Fix potential posix_acl refcnt leak in nfs3_set_acl
NFS/pnfs: Fix a credential use-after-free issue in pnfs_roc()
NFS/pnfs: Ensure that _pnfs_return_layout() waits for layoutreturn completion
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
... and the same for its compat counterpart
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Pull io_uring fixes from Jens Axboe:
- Fix for statx not grabbing the file table, making AT_EMPTY_PATH fail
- Cover a few cases where async poll can handle retry, eliminating the
need for an async thread
- fallback request busy/free fix (Bijan)
- syzbot reported SQPOLL thread exit fix for non-preempt (Xiaoguang)
- Fix extra put of req for sync_file_range (Pavel)
- Always punt splice async. We'll improve this for 5.8, but wanted to
eliminate the inode mutex lock from the non-blocking path for 5.7
(Pavel)
* tag 'io_uring-5.7-2020-05-01' of git://git.kernel.dk/linux-block:
io_uring: punt splice async because of inode mutex
io_uring: check non-sync defer_list carefully
io_uring: fix extra put in sync_file_range()
io_uring: use cond_resched() in io_ring_ctx_wait_and_kill()
io_uring: use proper references for fallback_req locking
io_uring: only force async punt if poll based retry can't handle it
io_uring: enable poll retry for any file with ->read_iter / ->write_iter
io_uring: statx must grab the file table for valid fd
|
|
Nonblocking do_splice() still may wait for some time on an inode mutex.
Let's play safe and always punt it async.
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_req_defer() do double-checked locking. Use proper helpers for that,
i.e. list_empty_careful().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
[ 40.179474] refcount_t: underflow; use-after-free.
[ 40.179499] WARNING: CPU: 6 PID: 1848 at lib/refcount.c:28 refcount_warn_saturate+0xae/0xf0
...
[ 40.179612] RIP: 0010:refcount_warn_saturate+0xae/0xf0
[ 40.179617] Code: 28 44 0a 01 01 e8 d7 01 c2 ff 0f 0b 5d c3 80 3d 15 44 0a 01 00 75 91 48 c7 c7 b8 f5 75 be c6 05 05 44 0a 01 01 e8 b7 01 c2 ff <0f> 0b 5d c3 80 3d f3 43 0a 01 00 0f 85 6d ff ff ff 48 c7 c7 10 f6
[ 40.179619] RSP: 0018:ffffb252423ebe18 EFLAGS: 00010286
[ 40.179623] RAX: 0000000000000000 RBX: ffff98d65e929400 RCX: 0000000000000000
[ 40.179625] RDX: 0000000000000001 RSI: 0000000000000086 RDI: 00000000ffffffff
[ 40.179627] RBP: ffffb252423ebe18 R08: 0000000000000001 R09: 000000000000055d
[ 40.179629] R10: 0000000000000c8c R11: 0000000000000001 R12: 0000000000000000
[ 40.179631] R13: ffff98d68c434400 R14: ffff98d6a9cbaa20 R15: ffff98d6a609ccb8
[ 40.179634] FS: 0000000000000000(0000) GS:ffff98d6af580000(0000) knlGS:0000000000000000
[ 40.179636] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 40.179638] CR2: 00000000033e3194 CR3: 000000006480a003 CR4: 00000000003606e0
[ 40.179641] Call Trace:
[ 40.179652] io_put_req+0x36/0x40
[ 40.179657] io_free_work+0x15/0x20
[ 40.179661] io_worker_handle_work+0x2f5/0x480
[ 40.179667] io_wqe_worker+0x2a9/0x360
[ 40.179674] ? _raw_spin_unlock_irqrestore+0x24/0x40
[ 40.179681] kthread+0x12c/0x170
[ 40.179685] ? io_worker_handle_work+0x480/0x480
[ 40.179690] ? kthread_park+0x90/0x90
[ 40.179695] ret_from_fork+0x35/0x40
[ 40.179702] ---[ end trace 85027405f00110aa ]---
Opcode handler must never put submission ref, but that's what
io_sync_file_range_finish() do. use io_steal_work() there.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
While working on to make io_uring sqpoll mode support syscalls that need
struct files_struct, I got cpu soft lockup in io_ring_ctx_wait_and_kill(),
while (ctx->sqo_thread && !wq_has_sleeper(&ctx->sqo_wait))
cpu_relax();
above loop never has an chance to exit, it's because preempt isn't enabled
in the kernel, and the context calling io_ring_ctx_wait_and_kill() and
io_sq_thread() run in the same cpu, if io_sq_thread calls a cond_resched()
yield cpu and another context enters above loop, then io_sq_thread() will
always in runqueue and never exit.
Use cond_resched() can fix this issue.
Reported-by: syzbot+66243bb7126c410cefe6@syzkaller.appspotmail.com
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Use ctx->fallback_req address for test_and_set_bit_lock() and
clear_bit_unlock().
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We do blocking retry from our poll handler, if the file supports polled
notifications. Only mark the request as needing an async worker if we
can't poll for it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We can have files like eventfd where it's perfectly fine to do poll
based retry on them, right now io_file_supports_async() doesn't take
that into account.
Pass in data direction and check the f_op instead of just always needing
an async worker.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When opening user access to only perform reads, only open read access.
When opening user access to only perform writes, only open write
access.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/2e73bc57125c2c6ab12a587586a4eed3a47105fc.1585898438.git.christophe.leroy@c-s.fr
|
|
The struct nfs_server gets put on the cl_superblocks list before
the server->super field has been initialised, in which case the
call to nfs_sb_active() will Oops. Add a check to ensure that
we skip such a list entry.
Fixes: 3c9e502b59fb ("NFS: Add a helper nfs_client_for_each_server()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
We better warn the fibmap user and not return a truncated and therefore
an incorrect block map address if the bmap() returned block address
is greater than INT_MAX (since user supplied integer pointer).
It's better to pr_warn() all user of ioctl_fibmap() and return a proper
error code rather than silently letting a FS corruption happen if the
user tries to fiddle around with the returned block map address.
We fix this by returning an error code of -ERANGE and returning 0 as the
block mapping address in case if it is > INT_MAX.
Now iomap_bmap() could be called from either of these two paths.
Either when a user is calling an ioctl_fibmap() interface to get
the block mapping address or by some filesystem via use of bmap()
internal kernel API.
bmap() kernel API is well equipped with handling of u64 addresses.
WARN condition in iomap_bmap_actor() was mainly added to warn all
the fibmap users. But now that we have directly added this warning
for all fibmap users and also made sure to return 0 as block map address
in case if addr > INT_MAX.
So we can now remove this logic from iomap_bmap_actor().
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Some older compilers like gcc-4.8 warn about mismatched curly braces in
a initializer:
fs/btrfs/backref.c: In function 'is_shared_data_backref':
fs/btrfs/backref.c:394:9: error: missing braces around
initializer [-Werror=missing-braces]
struct prelim_ref target = {0};
^
fs/btrfs/backref.c:394:9: error: (near initialization for
'target.rbnode') [-Werror=missing-braces]
Use the GNU empty initializer extension to avoid this.
Fixes: ed58f2e66e84 ("btrfs: backref, don't add refs from shared block when resolving normal backref")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
As of now during open(), we don't pass bunch of flags to underlying
filesystem. O_TRUNC is one of these. Normally this is not a problem as VFS
calls ->setattr() with zero size and underlying filesystem sets file size
to 0.
But when overlayfs is running on top of virtiofs, it has an optimization
where it does not send setattr request to server if dectects that
truncation is part of open(O_TRUNC). It assumes that server already zeroed
file size as part of open(O_TRUNC).
fuse_do_setattr() {
if (attr->ia_valid & ATTR_OPEN) {
/*
* No need to send request to userspace, since actual
* truncation has already been done by OPEN. But still
* need to truncate page cache.
*/
}
}
IOW, fuse expects O_TRUNC to be passed to it as part of open flags.
But currently overlayfs does not pass O_TRUNC to underlying filesystem
hence fuse/virtiofs breaks. Setup overlayfs on top of virtiofs and
following does not zero the file size of a file is either upper only or has
already been copied up.
fd = open(foo.txt, O_TRUNC | O_WRONLY);
There are two ways to fix this. Either pass O_TRUNC to underlying
filesystem or clear ATTR_OPEN from attr->ia_valid so that fuse ends up
sending a SETATTR request to server. Miklos is concerned that O_TRUNC might
have side affects so it is better to clear ATTR_OPEN for now. Hence this
patch clears ATTR_OPEN from attr->ia_valid.
I found this problem while running unionmount-testsuite. With this patch,
unionmount-testsuite passes with overlayfs on top of virtiofs.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Fixes: bccece1ead36 ("ovl: allow remote upper")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
ovl_setattr() can be passed an attr which has ATTR_FILE set and
attr->ia_file is a file pointer to overlay file. This is done in
open(O_TRUNC) path.
We should either replace with attr->ia_file with underlying file object or
clear ATTR_FILE so that underlying filesystem does not end up using
overlayfs file object pointer.
There are no good use cases yet so for now clear ATTR_FILE. fuse seems to
be one user which can use this. But it can work even without this. So it
is not mandatory to pass ATTR_FILE to fuse.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Fixes: bccece1ead36 ("ovl: allow remote upper")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
With the introduction of exchange_tids thread_group_leader and
has_group_leader_pid have become equivalent. Further at this point in the
code a thread group has exactly two threads, the previous thread_group_leader
that is waiting to be reaped and tsk. So we know it is impossible for tsk to
be the thread_group_leader.
This is also the last user of has_group_leader_pid so removing this check
will allow has_group_leader_pid to be removed.
So remove the "BUG_ON(has_group_leader_pid)" that will never fire.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
Pull vfs fixes from Al Viro:
"Two old bugs..."
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
propagate_one(): mnt_set_mountpoint() needs mount_lock
dlmfs_file_write(): fix the bogosity in handling non-zero *ppos
|
|
Commit 6fcf0c72e4b9, a fix to get_tree_bdev() put a missing blkdev_put() in
the wrong place, before a warnf() that displays the bdev under
consideration rather after it.
This results in a silent lockup in printk("%pg") called via warnf() from
get_tree_bdev() under some circumstances when there's a race with the
blockdev being frozen. This can be caused by xfstests/tests/generic/085 in
combination with Lukas Czerner's ext4 mount API conversion patchset. It
looks like it ought to occur with other users of get_tree_bdev() such as
XFS, but apparently doesn't.
Fix this by switching the order of the lines.
Fixes: 6fcf0c72e4b9 ("vfs: add missing blkdev_put() in get_tree_bdev()")
Reported-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Ian Kent <raven@themaw.net>
cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When the thread group leader changes during exec and the old leaders
thread is reaped proc_flush_pid will flush the dentries for the entire
process because the leader still has it's original pid.
Fix this by exchanging the pids in an rcu safe manner,
and wrapping the code to do that up in a helper exchange_tids.
When I removed switch_exec_pids and introduced this behavior
in d73d65293e3e ("[PATCH] pidhash: kill switch_exec_pids") there
really was nothing that cared as flushing happened with
the cached dentry and de_thread flushed both of them on exec.
This lack of fully exchanging pids became a problem a few months later
when I introduced 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry
flush on exit optimization"). Which overlooked the de_thread case
was no longer swapping pids, and I was looking up proc dentries
by task->pid.
The current behavior isn't properly a bug as everything in proc will
continue to work correctly just a little bit less efficiently. Fix
this just so there are no little surprise corner cases waiting to bite
people.
-- Oleg points out this could be an issue in next_tgid in proc where
has_group_leader_pid is called, and reording some of the assignments
should fix that.
-- Oleg points out this will break the 10 year old hack in __exit_signal.c
> /*
> * This can only happen if the caller is de_thread().
> * FIXME: this is the temporary hack, we should teach
> * posix-cpu-timers to handle this case correctly.
> */
> if (unlikely(has_group_leader_pid(tsk)))
> posix_cpu_timers_exit_group(tsk);
The code in next_tgid has been changed to use PIDTYPE_TGID,
and the posix cpu timers code has been fixed so it does not
need the 10 year old hack, so this should be safe to merge
now.
Link: https://lore.kernel.org/lkml/87h7x3ajll.fsf_-_@x220.int.ebiederm.org/
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Fixes: 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry flush on exit optimization").
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
|
|
Currently, if the client sends BIND_CONN_TO_SESSION with
NFS4_CDFC4_FORE_OR_BOTH but only gets NFS4_CDFS4_FORE back it ignores
that it wasn't able to enable a backchannel.
To make sure, the client sends BIND_CONN_TO_SESSION as the first
operation on the connections (ie., no other session compounds haven't
been sent before), and if the client's request to bind the backchannel
is not satisfied, then reset the connection and retry.
Cc: stable@vger.kernel.org
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull in Christoph Hellwig's series that changes the sysctl's ->proc_handler
methods to take kernel pointers instead. It gets rid of the set_fs address
space overrides used by BPF. As per discussion, pull in the feature branch
into bpf-next as it relates to BPF sysctl progs.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200427071508.GV23230@ZenIV.linux.org.uk/T/
|
|
Commit 64e90a8acb859 ("Introduce STATIC_USERMODEHELPER to mediate
call_usermodehelper()") added the optiont to disable all
call_usermodehelper() calls by setting STATIC_USERMODEHELPER_PATH to
an empty string. When this is done, and crashdump is triggered, it
will crash on null pointer dereference, since we make assumptions
over what call_usermodehelper_exec() did.
This has been reported by Sergey when one triggers a a coredump
with the following configuration:
```
CONFIG_STATIC_USERMODEHELPER=y
CONFIG_STATIC_USERMODEHELPER_PATH=""
kernel.core_pattern = |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e
```
The way disabling the umh was designed was that call_usermodehelper_exec()
would just return early, without an error. But coredump assumes
certain variables are set up for us when this happens, and calls
ile_start_write(cprm.file) with a NULL file.
[ 2.819676] BUG: kernel NULL pointer dereference, address: 0000000000000020
[ 2.819859] #PF: supervisor read access in kernel mode
[ 2.820035] #PF: error_code(0x0000) - not-present page
[ 2.820188] PGD 0 P4D 0
[ 2.820305] Oops: 0000 [#1] SMP PTI
[ 2.820436] CPU: 2 PID: 89 Comm: a Not tainted 5.7.0-rc1+ #7
[ 2.820680] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190711_202441-buildvm-armv7-10.arm.fedoraproject.org-2.fc31 04/01/2014
[ 2.821150] RIP: 0010:do_coredump+0xd80/0x1060
[ 2.821385] Code: e8 95 11 ed ff 48 c7 c6 cc a7 b4 81 48 8d bd 28 ff
ff ff 89 c2 e8 70 f1 ff ff 41 89 c2 85 c0 0f 84 72 f7 ff ff e9 b4 fe ff
ff <48> 8b 57 20 0f b7 02 66 25 00 f0 66 3d 00 8
0 0f 84 9c 01 00 00 44
[ 2.822014] RSP: 0000:ffffc9000029bcb8 EFLAGS: 00010246
[ 2.822339] RAX: 0000000000000000 RBX: ffff88803f860000 RCX: 000000000000000a
[ 2.822746] RDX: 0000000000000009 RSI: 0000000000000282 RDI: 0000000000000000
[ 2.823141] RBP: ffffc9000029bde8 R08: 0000000000000000 R09: ffffc9000029bc00
[ 2.823508] R10: 0000000000000001 R11: ffff88803dec90be R12: ffffffff81c39da0
[ 2.823902] R13: ffff88803de84400 R14: 0000000000000000 R15: 0000000000000000
[ 2.824285] FS: 00007fee08183540(0000) GS:ffff88803e480000(0000) knlGS:0000000000000000
[ 2.824767] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2.825111] CR2: 0000000000000020 CR3: 000000003f856005 CR4: 0000000000060ea0
[ 2.825479] Call Trace:
[ 2.825790] get_signal+0x11e/0x720
[ 2.826087] do_signal+0x1d/0x670
[ 2.826361] ? force_sig_info_to_task+0xc1/0xf0
[ 2.826691] ? force_sig_fault+0x3c/0x40
[ 2.826996] ? do_trap+0xc9/0x100
[ 2.827179] exit_to_usermode_loop+0x49/0x90
[ 2.827359] prepare_exit_to_usermode+0x77/0xb0
[ 2.827559] ? invalid_op+0xa/0x30
[ 2.827747] ret_from_intr+0x20/0x20
[ 2.827921] RIP: 0033:0x55e2c76d2129
[ 2.828107] Code: 2d ff ff ff e8 68 ff ff ff 5d c6 05 18 2f 00 00 01
c3 0f 1f 80 00 00 00 00 c3 0f 1f 80 00 00 00 00 e9 7b ff ff ff 55 48 89
e5 <0f> 0b b8 00 00 00 00 5d c3 66 2e 0f 1f 84 0
0 00 00 00 00 0f 1f 40
[ 2.828603] RSP: 002b:00007fffeba5e080 EFLAGS: 00010246
[ 2.828801] RAX: 000055e2c76d2125 RBX: 0000000000000000 RCX: 00007fee0817c718
[ 2.829034] RDX: 00007fffeba5e188 RSI: 00007fffeba5e178 RDI: 0000000000000001
[ 2.829257] RBP: 00007fffeba5e080 R08: 0000000000000000 R09: 00007fee08193c00
[ 2.829482] R10: 0000000000000009 R11: 0000000000000000 R12: 000055e2c76d2040
[ 2.829727] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 2.829964] CR2: 0000000000000020
[ 2.830149] ---[ end trace ceed83d8c68a1bf1 ]---
```
Cc: <stable@vger.kernel.org> # v4.11+
Fixes: 64e90a8acb85 ("Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()")
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=199795
Reported-by: Tony Vroon <chainsaw@gentoo.org>
Reported-by: Sergey Kvachonok <ravenexp@gmail.com>
Tested-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20200416162859.26518-1-mcgrof@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- regression fixes:
- transaction leak when deleting unused block group
- log cleanup after transaction abort
- fix block group leak when removing fails
- transaction leak if relocation recovery fails
- fix SPDX header
* tag 'for-5.7-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix transaction leak in btrfs_recover_relocation
btrfs: fix block group leak when removing fails
btrfs: drop logs when we've aborted a transaction
btrfs: fix memory leak of transaction when deleting unused block group
btrfs: discard: Use the correct style for SPDX License Identifier
|
|
Clay reports that OP_STATX fails for a test case with a valid fd
and empty path:
-- Test 0: statx:fd 3: SUCCEED, file mode 100755
-- Test 1: statx:path ./uring_statx: SUCCEED, file mode 100755
-- Test 2: io_uring_statx:fd 3: FAIL, errno 9: Bad file descriptor
-- Test 3: io_uring_statx:path ./uring_statx: SUCCEED, file mode 100755
This is due to statx not grabbing the process file table, hence we can't
lookup the fd in async context. If the fd is valid, ensure that we grab
the file table so we can grab the file from async context.
Cc: stable@vger.kernel.org # v5.6
Reported-by: Clay Harris <bugs@claycon.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
fs_info::journal_info
[BUG]
One run of btrfs/063 triggered the following lockdep warning:
============================================
WARNING: possible recursive locking detected
5.6.0-rc7-custom+ #48 Not tainted
--------------------------------------------
kworker/u24:0/7 is trying to acquire lock:
ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs]
but task is already holding lock:
ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs]
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(sb_internal#2);
lock(sb_internal#2);
*** DEADLOCK ***
May be due to missing lock nesting notation
4 locks held by kworker/u24:0/7:
#0: ffff88817b495948 ((wq_completion)btrfs-endio-write){+.+.}, at: process_one_work+0x557/0xb80
#1: ffff888189ea7db8 ((work_completion)(&work->normal_work)){+.+.}, at: process_one_work+0x557/0xb80
#2: ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs]
#3: ffff888174ca4da8 (&fs_info->reloc_mutex){+.+.}, at: btrfs_record_root_in_trans+0x83/0xd0 [btrfs]
stack backtrace:
CPU: 0 PID: 7 Comm: kworker/u24:0 Not tainted 5.6.0-rc7-custom+ #48
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
Call Trace:
dump_stack+0xc2/0x11a
__lock_acquire.cold+0xce/0x214
lock_acquire+0xe6/0x210
__sb_start_write+0x14e/0x290
start_transaction+0x66c/0x890 [btrfs]
btrfs_join_transaction+0x1d/0x20 [btrfs]
find_free_extent+0x1504/0x1a50 [btrfs]
btrfs_reserve_extent+0xd5/0x1f0 [btrfs]
btrfs_alloc_tree_block+0x1ac/0x570 [btrfs]
btrfs_copy_root+0x213/0x580 [btrfs]
create_reloc_root+0x3bd/0x470 [btrfs]
btrfs_init_reloc_root+0x2d2/0x310 [btrfs]
record_root_in_trans+0x191/0x1d0 [btrfs]
btrfs_record_root_in_trans+0x90/0xd0 [btrfs]
start_transaction+0x16e/0x890 [btrfs]
btrfs_join_transaction+0x1d/0x20 [btrfs]
btrfs_finish_ordered_io+0x55d/0xcd0 [btrfs]
finish_ordered_fn+0x15/0x20 [btrfs]
btrfs_work_helper+0x116/0x9a0 [btrfs]
process_one_work+0x632/0xb80
worker_thread+0x80/0x690
kthread+0x1a3/0x1f0
ret_from_fork+0x27/0x50
It's pretty hard to reproduce, only one hit so far.
[CAUSE]
This is because we're calling btrfs_join_transaction() without re-using
the current running one:
btrfs_finish_ordered_io()
|- btrfs_join_transaction() <<< Call #1
|- btrfs_record_root_in_trans()
|- btrfs_reserve_extent()
|- btrfs_join_transaction() <<< Call #2
Normally such btrfs_join_transaction() call should re-use the existing
one, without trying to re-start a transaction.
But the problem is, in btrfs_join_transaction() call #1, we call
btrfs_record_root_in_trans() before initializing current::journal_info.
And in btrfs_join_transaction() call #2, we're relying on
current::journal_info to avoid such deadlock.
[FIX]
Call btrfs_record_root_in_trans() after we have initialized
current::journal_info.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When we have an inode with a prealloc extent that starts at an offset
lower than the i_size and there is another prealloc extent that starts at
an offset beyond i_size, we can end up losing part of the first prealloc
extent (the part that starts at i_size) and have an implicit hole if we
fsync the file and then have a power failure.
Consider the following example with comments explaining how and why it
happens.
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
# Create our test file with 2 consecutive prealloc extents, each with a
# size of 128Kb, and covering the range from 0 to 256Kb, with a file
# size of 0.
$ xfs_io -f -c "falloc -k 0 128K" /mnt/foo
$ xfs_io -c "falloc -k 128K 128K" /mnt/foo
# Fsync the file to record both extents in the log tree.
$ xfs_io -c "fsync" /mnt/foo
# Now do a redudant extent allocation for the range from 0 to 64Kb.
# This will merely increase the file size from 0 to 64Kb. Instead we
# could also do a truncate to set the file size to 64Kb.
$ xfs_io -c "falloc 0 64K" /mnt/foo
# Fsync the file, so we update the inode item in the log tree with the
# new file size (64Kb). This also ends up setting the number of bytes
# for the first prealloc extent to 64Kb. This is done by the truncation
# at btrfs_log_prealloc_extents().
# This means that if a power failure happens after this, a write into
# the file range 64Kb to 128Kb will not use the prealloc extent and
# will result in allocation of a new extent.
$ xfs_io -c "fsync" /mnt/foo
# Now set the file size to 256K with a truncate and then fsync the file.
# Since no changes happened to the extents, the fsync only updates the
# i_size in the inode item at the log tree. This results in an implicit
# hole for the file range from 64Kb to 128Kb, something which fsck will
# complain when not using the NO_HOLES feature if we replay the log
# after a power failure.
$ xfs_io -c "truncate 256K" -c "fsync" /mnt/foo
So instead of always truncating the log to the inode's current i_size at
btrfs_log_prealloc_extents(), check first if there's a prealloc extent
that starts at an offset lower than the i_size and with a length that
crosses the i_size - if there is one, just make sure we truncate to a
size that corresponds to the end offset of that prealloc extent, so
that we don't lose the part of that extent that starts at i_size if a
power failure happens.
A test case for fstests follows soon.
Fixes: 31d11b83b96f ("Btrfs: fix duplicate extents after fsync of file with prealloc extents")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
... to protect the modification of mp->m_count done by it. Most of
the places that modify that thing also have namespace_lock held,
but not all of them can do so, so we really need mount_lock here.
Kudos to Piotr Krysiuk <piotras@gmail.com>, who'd spotted a related
bug in pivot_root(2) (fixed unnoticed in 5.3); search for other
similar turds has caught out this one.
Cc: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|