Age | Commit message (Collapse) | Author |
|
Add an extra delegation state to allow the stateid to remain in the idr
tree until the last reference has been released. This will be necessary
to ensure uniqueness once the client_mutex is removed.
[jlayton: reset the sc_type under the state_lock in unhash_delegation]
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
No need to pass the delegation pointer in here as it's only used to get
the nfs4_file pointer.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
state_lock is a heavily contended global lock. We don't want to grab
that while simultaneously holding the inode->i_lock.
Add a new per-nfs4_file lock that we can use to protect the
per-nfs4_file delegation list. Hold that while walking the list in the
break_deleg callback and queue the workqueue job for each one.
The workqueue job can then take the state_lock and do the list
manipulations without the i_lock being held prior to starting the
rpc call.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
It's just an obfuscated INIT_WORK call. Just make the work_func_t a
non-static symbol and use a normal INIT_WORK call.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
It is currently not possible for various wait_on_bit functions
to implement a timeout.
While the "action" function that is called to do the waiting
could certainly use schedule_timeout(), there is no way to carry
forward the remaining timeout after a false wake-up.
As false-wakeups a clearly possible at least due to possible
hash collisions in bit_waitqueue(), this is a real problem.
The 'action' function is currently passed a pointer to the word
containing the bit being waited on. No current action functions
use this pointer. So changing it to something else will be a
little noisy but will have no immediate effect.
This patch changes the 'action' function to take a pointer to
the "struct wait_bit_key", which contains a pointer to the word
containing the bit so nothing is really lost.
It also adds a 'private' field to "struct wait_bit_key", which
is initialized to zero.
An action function can now implement a timeout with something
like
static int timed_out_waiter(struct wait_bit_key *key)
{
unsigned long waited;
if (key->private == 0) {
key->private = jiffies;
if (key->private == 0)
key->private -= 1;
}
waited = jiffies - key->private;
if (waited > 10 * HZ)
return -EAGAIN;
schedule_timeout(waited - 10 * HZ);
return 0;
}
If any other need for context in a waiter were found it would be
easy to use ->private for some other purpose, or even extend
"struct wait_bit_key".
My particular need is to support timeouts in nfs_release_page()
to avoid deadlocks with loopback mounted NFS.
While wait_on_bit_timeout() would be a cleaner interface, it
will not meet my need. I need the timeout to be sensitive to
the state of the connection with the server, which could change.
So I need to use an 'action' interface.
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140707051604.28027.41257.stgit@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
The current "wait_on_bit" interface requires an 'action'
function to be provided which does the actual waiting.
There are over 20 such functions, many of them identical.
Most cases can be satisfied by one of just two functions, one
which uses io_schedule() and one which just uses schedule().
So:
Rename wait_on_bit and wait_on_bit_lock to
wait_on_bit_action and wait_on_bit_lock_action
to make it explicit that they need an action function.
Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
which are *not* given an action function but implicitly use
a standard one.
The decision to error-out if a signal is pending is now made
based on the 'mode' argument rather than being encoded in the action
function.
All instances of the old wait_on_bit and wait_on_bit_lock which
can use the new version have been changed accordingly and their
action functions have been discarded.
wait_on_bit{_lock} does not return any specific error code in the
event of a signal so the caller must check for non-zero and
interpolate their own error code as appropriate.
The wait_on_bit() call in __fscache_wait_on_invalidate() was
ambiguous as it specified TASK_UNINTERRUPTIBLE but used
fscache_wait_bit_interruptible as an action function.
David Howells confirms this should be uniformly
"uninterruptible"
The main remaining user of wait_on_bit{,_lock}_action is NFS
which needs to use a freezer-aware schedule() call.
A comment in fs/gfs2/glock.c notes that having multiple 'action'
functions is useful as they display differently in the 'wchan'
field of 'ps'. (and /proc/$PID/wchan).
As the new bit_wait{,_io} functions are tagged "__sched", they
will not show up at all, but something higher in the stack. So
the distinction will still be visible, only with different
function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
gfs2/glock.c case).
Since first version of this patch (against 3.15) two new action
functions appeared, on in NFS and one in CIFS. CIFS also now
uses an action function that makes the same freezer aware
schedule call as NFS.
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: David Howells <dhowells@redhat.com> (fscache, keys)
Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2)
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota fix from Jan Kara:
"Fix locking of dquot shrinker"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
quota: missing lock in dqcache_shrink_scan()
|
|
In __set_test_and_free we will check whether all segment are free in one section
When free one segment, in order to set section to free status.
But the searching region of segmap is from start segno to last segno of f2fs,
it's not necessary. So let's just only check all segment bitmap of target
section.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Valid data within i_size in page cache will be copied to ICB cache when we
writeback the page by invoking udf_adinicb_writepage, so the copy in
udf_adinicb_write_end is redundant.
After we remove the copy, it's better to use simple_write_end directly in
udf_adinicb_aops instead of udf_adinicb_write_end.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
This patch cleans up udf_translate_to_linux() a bit by using globally defined
macros instead of custom code.
We can use sprintf(buf, "%04X", ...) there as well, but this one faster.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
type and id were removed and qid added to quota_send_warning in commit
431f19744d15
("userns: Convert quota netlink aka quota_send_warning")
Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Fix checkpatch warning
WARNING: Use #include <linux/uaccess.h> instead of <asm/uaccess.h>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Drop cast on the result of kmem_cache_alloc.
The semantic patch that makes this change is as follows:
// <smpl>
@@
type T;
@@
- (T *)
(\(kmalloc\|kzalloc\|kcalloc\|kmem_cache_alloc\|kmem_cache_zalloc\|
kmem_cache_alloc_node\|kmalloc_node\|kzalloc_node\)(...))
// </smpl>
Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Remove dqptr_sem to make quota code scalable: Remove the dqptr_sem,
accessing inode->i_dquot now protected by dquot_srcu, and changing
inode->i_dquot is now serialized by dq_data_lock.
Signed-off-by: Lai Siyao <lai.siyao@intel.com>
Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Simplify the remove_inode_dquot_ref() to make it more obvious
that now we keep one reference for each dquot from inodes.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Avoid unnecessary dqget()/dqput() calls in __dquot_initialize(),
that will introduce global lock contention otherwise.
Signed-off-by: Lai Siyao <lai.siyao@intel.com>
Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
dqptr_sem will go away. Protect Q_GETFMT quotactl by
dqonoff_mutex instead. This is also enough to make sure
quota info will not go away while we are looking at it.
Signed-off-by: Lai Siyao <lai.siyao@intel.com>
Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Commit 1ab6c4997e04 (fs: convert fs shrinkers to new scan/count API)
accidentally removed locking from quota shrinker. Fix it -
dqcache_shrink_scan() should use dq_list_lock to protect the
scan on free_dquots list.
CC: stable@vger.kernel.org
Fixes: 1ab6c4997e04a00c50c6d786c2f046adc0d1f5de
Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
"This contains miscellaneous fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: replace count*size kzalloc by kcalloc
fuse: release temporary page if fuse_writepage_locked() failed
fuse: restructure ->rename2()
fuse: avoid scheduling while atomic
fuse: handle large user and group ID
fuse: inode: drop cast
fuse: ignore entry-timeout on LOOKUP_REVAL
fuse: timeout comparison fix
|
|
Now ext4_has_inline_data() is used in wide spread codepaths. So we need
to make it as a inline function to avoid burning some CPU cycles.
Change in text size:
text data bss dec hex filename
before: 326110 19258 5528 350896 55ab0 fs/ext4/ext4.o
after: 326227 19258 5528 351013 55b25 fs/ext4/ext4.o
I use the following script to measure the CPU usage.
#!/bin/bash
shm_base='/dev/shm'
img=${shm_base}/ext4-img
mnt=/mnt/loop
e2fsprgs_base=$HOME/e2fsprogs
mkfs=${e2fsprgs_base}/misc/mke2fs
fsck=${e2fsprgs_base}/e2fsck/e2fsck
sudo umount $mnt
dd if=/dev/zero of=$img bs=4k count=3145728
${mkfs} -t ext4 -O inline_data -F $img
sudo mount -t ext4 -o loop $img $mnt
# start testing...
testdir="${mnt}/testdir"
mkdir $testdir
cd $testdir
echo "start testing..."
for ((cnt=0;cnt<100;cnt++)); do
for ((i=0;i<5;i++)); do
for ((j=0;j<5;j++)); do
for ((k=0;k<5;k++)); do
for ((l=0;l<5;l++)); do
mkdir -p $i/$j/$k/$l
echo "$i-$j-$k-$l" > $i/$j/$k/$l/testfile
done
done
done
done
ls -R $testdir > /dev/null
rm -rf $testdir/*
done
The result of `perf top -G -U` is as below.
vanilla:
13.92% [ext4] [k] ext4_do_update_inode
9.36% [ext4] [k] __ext4_get_inode_loc
4.07% [ext4] [k] ftrace_define_fields_ext4_writepages
3.83% [ext4] [k] __ext4_handle_dirty_metadata
3.42% [ext4] [k] ext4_get_inode_flags
2.71% [ext4] [k] ext4_mark_iloc_dirty
2.46% [ext4] [k] ftrace_define_fields_ext4_direct_IO_enter
2.26% [ext4] [k] ext4_get_inode_loc
2.22% [ext4] [k] ext4_has_inline_data
[...]
After applied the patch, we don't see ext4_has_inline_data() because it
has been inlined and perf couldn't sample it. Although it doesn't mean
that the CPU cycles can be saved but at least the overhead of function
calls can be eliminated. So IMHO we'd better inline this function.
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
There is no kind of file which does not supply a page reading function.
Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Currently punch hole code on files with direct/indirect mapping has some
problems which may lead to a data loss. For example (from Jan Kara):
fallocate -n -p 10240000 4096
will punch the range 10240000 - 12632064 instead of the range 1024000 -
10244096.
Also the code is a bit weird and it's not using infrastructure provided
by indirect.c, but rather creating it's own way.
This patch fixes the issues as well as making the operation to run 4
times faster from my testing (punching out 60GB file). It uses similar
approach used in ext4_ind_truncate() which takes advantage of
ext4_free_branches() function.
Also rename the ext4_free_hole_blocks() to something more sensible, like
the equivalent we have for extent mapped files. Call it
ext4_ind_remove_space().
This has been tested mostly with fsx and some xfstests which are testing
punch hole but does not require unwritten extents which are not
supported with direct/indirect mapping. Not problems showed up even with
1024k block size.
CC: stable@vger.kernel.org
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Commit 27dd43854227b ("ext4: introduce reserved space") reserves 2% of
the file system space to make sure metadata allocations will always
succeed. Given that, tracking the reservation of metadata blocks is
no longer necessary.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The EXT4FS_DEBUG is a *very* developer specific #ifdef designed for
ext4 developers only. (You have to modify fs/ext4/ext4.h to enable
it.)
Rearrange how we initialize data structures to avoid calling
ext4_count_free_clusters() until the multiblock allocator has been
initialized.
This also allows us to only call ext4_count_free_clusters() once, and
simplifies the code somewhat.
(Thanks to Chen Gang <gang.chen.5i5j@gmail.com> for pointing out a
!CONFIG_SMP compile breakage in the original patch.)
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
|
|
Create log attributes to export the current runtime state of the log to
sysfs. Note that the filesystem should be frozen for consistency across
attributes.
The following per-mount attributes are created: log_head_lsn,
log_tail_lsn, reserve_grant_head and write_grant_head. These represent
the physical log head, tail and reserve and write grant heads
respectively. Attribute values are exported in the following format:
"cycle:[block,byte]"
... where cycle represents the log cycle and [block,bytes] represents
either the basic block or byte offset of the log, depending on the
attribute. Log sequence number (LSN) values are encoded in basic blocks
and grant heads are encoded in bytes. All values are in decimal format.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Embed a kobject into the xfs log data structure (xlog). This creates a
'log' subdirectory for every XFS mount instance in sysfs. The lifecycle
of the log kobject is tied to the lifecycle of the log.
Also define a set of generic attribute handlers associated with the log
kobject in preparation for the addition of attributes.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Embed a base kobject into xfs_mount. This creates a kobject associated
with each XFS mount and a subdirectory in sysfs with the name of the
filesystem. The subdirectory lifecycle matches that of the mount. Also
add the new xfs_sysfs.[c,h] source files with some XFS sysfs
infrastructure to facilitate attribute creation.
Note that there are currently no attributes exported as part of the
xfs_mount kobject. It exists solely to serve as a per-mount container
for child objects.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Create a sysfs kset to contain all sub-objects associated with the XFS
module. The kset is created and removed on module initialization and
removal respectively. The kset uses fs_obj as a parent. This leads to
the creation of a /sys/fs/xfs directory when the kset exists.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
xfs_mountfs() has a couple failure conditions that do not jump to the
correct labels. Specifically:
- xfs_initialize_perag_data() failure does not deallocate the log even
though it occurs after log initialization
- xfs_mount_reset_sbqflags() failure returns the error directly rather
than jump to the error sequence
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
|
|
When quota is on, it is expected that unused quota inodes have a
value of NULLFSINO. The changes to support a separate project quota
in 3.12 broken this rule for non-project quota inode enabled
filesystem, as the code now refuses to write the group quota inode
if neither group or project quotas are enabled. This regression was
introduced by commit d892d58 ("xfs: Start using pquotaino from the
superblock").
In this case, we should be writing NULLFSINO rather than nothing to
ensure that we leave the group quota inode in a valid state while
quotas are enabled.
Failure to do so doesn't cause a current kernel to break - the
separate project quota inodes introduced translation code to always
treat a zero inode as NULLFSINO. This was introduced by commit
0102629 ("xfs: Initialize all quota inodes to be NULLFSINO") with is
also in 3.12 but older kernels do not do this and hence taking a
filesystem back to an older kernel can result in quotas failing
initialisation at mount time. When that happens, we see this in
dmesg:
[ 1649.215390] XFS (sdb): Mounting Filesystem
[ 1649.316894] XFS (sdb): Failed to initialize disk quotas.
[ 1649.316902] XFS (sdb): Ending clean mount
By ensuring that we write NULLFSINO to quota inodes that aren't
active, we avoid this problem. We have to be really careful when
determining if the quota inodes are active or not, because we don't
want to write a NULLFSINO if the quota inodes are active and we
simply aren't updating them.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
The allocation stack switch at xfs_bmapi_allocate() has served it's
purpose, but is no longer a sufficient solution to the stack usage
problem we have in the XFS allocation path.
Whilst the kernel stack size is now 16k, that is not a valid reason
for undoing all our "keep stack usage down" modifications. What it
does allow us to do is have the freedom to refine and perfect the
modifications knowing that if we get it wrong it won't blow up in
our faces - we have a safety net now.
This is important because we still have the issue of older kernels
having smaller stacks and that they are still supported and are
demonstrating a wide range of different stack overflows. Red Hat
has several open bugs for allocation based stack overflows from
directory modifications and direct IO block allocation and these
problems still need to be solved. If we can solve them upstream,
then distro's won't need to bake their own unique solutions.
To that end, I've observed that every allocation based stack
overflow report has had a specific characteristic - it has happened
during or directly after a bmap btree block split. That event
requires a new block to be allocated to the tree, and so we
effectively stack one allocation stack on top of another, and that's
when we get into trouble.
A further observation is that bmap btree block splits are much rarer
than writeback allocation - over a range of different workloads I've
observed the ratio of bmap btree inserts to splits ranges from 100:1
(xfstests run) to 10000:1 (local VM image server with sparse files
that range in the hundreds of thousands to millions of extents).
Either way, bmap btree split events are much, much rarer than
allocation events.
Finally, we have to move the kswapd state to the allocation workqueue
work when allocation is done on behalf of kswapd. This is proving to
cause significant perturbation in performance under memory pressure
and appears to be generating allocation deadlock warnings under some
workloads, so avoiding the use of a workqueue for the majority of
kswapd writeback allocation will minimise the impact of such
behaviour.
Hence it makes sense to move the stack switch to xfs_btree_split()
and only do it for bmap btree splits. Stack switches during
allocation will be much rarer, so there won't be significant
performacne overhead caused by switching stacks. The worse case
stack from all allocation paths will be split, not just writeback.
And the majority of memory allocations will be done in the correct
context (e.g. kswapd) without causing additional latency, and so we
simplify the memory reclaim interactions between processes,
workqueues and kswapd.
The worst stack I've been able to generate with this patch in place
is 5600 bytes deep. It's very revealing because we exit XFS at:
37) 1768 64 kmem_cache_alloc+0x13b/0x170
about 1800 bytes of stack consumed, and the remaining 3800 bytes
(and 36 functions) is memory reclaim, swap and the IO stack. And
this occurs in the inode allocation from an open(O_CREAT) syscall,
not writeback.
The amount of stack being used is much less than I've previously be
able to generate - fs_mark testing has been able to generate stack
usage of around 7k without too much trouble; with this patch it's
only just getting to 5.5k. This is primarily because the metadata
allocation paths (e.g. directory blocks) are no longer causing
double splits on the same stack, and hence now stack tracing is
showing swapping being the worst stack consumer rather than XFS.
Performance of fs_mark inode create workloads is unchanged.
Performance of fs_mark async fsync workloads is consistently good
with context switches reduced by around 150,000/s (30%).
Performance of dbench, streaming IO and postmark is unchanged.
Allocation deadlock warnings have not been seen on the workloads
that generated them since adding this patch.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
This reverts commit 1f6d64829db78a7e1d63e15c9f48f0a5d2b5a679.
This commit resulted in regressions in performance in low
memory situations where kswapd was doing writeback of delayed
allocation blocks. It resulted in significant parallelism of the
kswapd work and with the special kswapd flags meant that hundreds of
active allocation could dip into kswapd specific memory reserves and
avoid being throttled. This cause a large amount of performance
variation, as well as random OOM-killer invocations that didn't
previously exist.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
|
|
As of commit f8567a3845ac05bb28f3c1b478ef752762bd39ef it is now possible to
have put_reqs_available() called from irq context. While put_reqs_available()
is per cpu, it did not protect itself from interrupts on the same CPU. This
lead to aio_complete() corrupting the available io requests count when run
under a heavy O_DIRECT workloads as reported by Robert Elliott. Fix this by
disabling irq updates around the per cpu batch updates of reqs_available.
Many thanks to Robert and folks for testing and tracking this down.
Reported-by: Robert Elliot <Elliott@hp.com>
Tested-by: Robert Elliot <Elliott@hp.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@infradead.org>
Cc: stable@vger.kenel.org
|
|
kcalloc manages count*sizeof overflow.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
|
|
tmp_page to be freed if fuse_write_file_get() returns NULL.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
|
|
this makes __choose_mds() choose mds according caps
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bugfixes from Ted Ts'o:
"More bug fixes for ext4 -- most importantly, a fix for a bug
introduced in 3.15 that can end up triggering a file system corruption
error after a journal replay.
It shouldn't lead to any actual data corruption, but it is scary and
can force file systems to be remounted read-only, etc"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix potential null pointer dereference in ext4_free_inode
ext4: fix a potential deadlock in __ext4_es_shrink()
ext4: revert commit which was causing fs corruption after journal replays
ext4: disable synchronous transaction batching if max_batch_time==0
ext4: clarify ext4_error message in ext4_mb_generate_buddy_error()
ext4: clarify error count warning messages
ext4: fix unjournalled bg descriptor while initializing inode bitmap
|
|
* bugfixes:
NFS: Don't reset pg_moreio in __nfs_pageio_add_request
NFS: Remove 2 unused variables
nfs: handle multiple reqs in nfs_wb_page_cancel
nfs: handle multiple reqs in nfs_page_async_flush
nfs: change find_request to find_head_request
nfs: nfs_page should take a ref on the head req
nfs: mark nfs_page reqs with flag for extra ref
nfs: only show Posix ACLs in listxattr if actually present
Conflicts:
fs/nfs/write.c
|
|
Once we've started sending unstable NFS writes, we do not want to
clear pg_moreio, or we may end up sending the very last request as
a stable write if the commit lists are still empty.
Do, however, reset pg_moreio in the case where we end up having to
recoalesce the write if an attempt to use pNFS failed.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Use macro definition
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
This patch does away with the cast on void * as it is unnecessary.
The following Coccinelle semantic patch was used for making the change:
@r@
expression x;
void* e;
type T;
identifier f;
@@
(
*((T *)e)
|
((T *)x)[...]
|
((T *)x)->f
|
- (T *)
e
)
Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Fix checkpatch warnings:
"WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable"
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
The current CB_COMPOUND handling code tries to compare the principal
name of the request with the cl_hostname in the client. This is not
guaranteed to ever work, particularly if the client happened to mount
a CNAME of the server or a non-fqdn.
Fix this by instead comparing the cr_principal string with the acceptor
name that we get from gssd. In the event that gssd didn't send one
down (i.e. it was too old), then we fall back to trying to use the
cl_hostname as we do today.
Signed-off-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Nothing checks its return value.
Signed-off-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
We got a report of the following warning in Fedora:
BUG: sleeping function called from invalid context at mm/slub.c:969
in_atomic(): 1, irqs_disabled(): 0, pid: 533, name: bash
3 locks held by bash/533:
#0: (&sp->so_delegreturn_mutex){+.+...}, at: [<ffffffffa033da62>] nfs4_proc_lock+0x262/0x910 [nfsv4]
#1: (&nfsi->rwsem){.+.+.+}, at: [<ffffffffa033da6a>] nfs4_proc_lock+0x26a/0x910 [nfsv4]
#2: (&sb->s_type->i_lock_key#23){+.+...}, at: [<ffffffff812998dc>] flock_lock_file_wait+0x8c/0x3a0
CPU: 0 PID: 533 Comm: bash Not tainted 3.15.0-0.rc1.git1.1.fc21.x86_64 #1
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
0000000000000000 00000000d664ff3c ffff880078b69a70 ffffffff817e82e0
0000000000000000 ffff880078b69a98 ffffffff810cf1a4 0000000000000050
0000000000000050 ffff88007cc01a00 ffff880078b69ad8 ffffffff8121449e
Call Trace:
[<ffffffff817e82e0>] dump_stack+0x4d/0x66
[<ffffffff810cf1a4>] __might_sleep+0x184/0x240
[<ffffffff8121449e>] kmem_cache_alloc_trace+0x4e/0x330
[<ffffffffa0331124>] ? nfs4_release_lockowner+0x74/0x110 [nfsv4]
[<ffffffffa0331124>] nfs4_release_lockowner+0x74/0x110 [nfsv4]
[<ffffffffa0352340>] nfs4_put_lock_state+0x90/0xb0 [nfsv4]
[<ffffffffa0352375>] nfs4_fl_release_lock+0x15/0x20 [nfsv4]
[<ffffffff81297515>] locks_free_lock+0x45/0x90
[<ffffffff8129996c>] flock_lock_file_wait+0x11c/0x3a0
[<ffffffffa033da6a>] ? nfs4_proc_lock+0x26a/0x910 [nfsv4]
[<ffffffffa033301e>] do_vfs_lock+0x1e/0x30 [nfsv4]
[<ffffffffa033da79>] nfs4_proc_lock+0x279/0x910 [nfsv4]
[<ffffffff810dbb26>] ? local_clock+0x16/0x30
[<ffffffff810f5a3f>] ? lock_release_holdtime.part.28+0xf/0x200
[<ffffffffa02f820c>] do_unlk+0x8c/0xc0 [nfs]
[<ffffffffa02f85c5>] nfs_flock+0xa5/0xf0 [nfs]
[<ffffffff8129a6f6>] locks_remove_file+0xb6/0x1e0
[<ffffffff812159d8>] ? kfree+0xd8/0x2d0
[<ffffffff8123bc63>] __fput+0xd3/0x210
[<ffffffff8123bdee>] ____fput+0xe/0x10
[<ffffffff810bfb6d>] task_work_run+0xcd/0xf0
[<ffffffff81019cd1>] do_notify_resume+0x61/0x90
[<ffffffff817fbea2>] int_signal+0x12/0x17
The problem is that NFSv4 is trying to do an allocation from
fl_release_private (in order to send a RELEASE_LOCKOWNER call). That
function can be called while holding the inode->i_lock, and it's
currently set up to do __GFP_WAIT allocations. v4.1 code has a
similar problem.
This patch adds a work_struct to the nfs4_lock_state and has the code
queue the free_lock_state operation to nfsiod.
Reported-by: Josh Stone <jistone@redhat.com>
Signed-off-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
Do the following set of ops with a file on a NFSv4 mount:
exec 3>>/file/on/nfsv4
flock -x 3
exec 3>&-
You'll see the LOCK request go across the wire, but no LOCKU when the
file is closed.
What happens is that the fd is passed across a fork, and the final close
is done in a different process than the opener. That makes
__nfs4_find_lock_state miss finding the correct lock state because it
uses the fl_pid as a search key. A new one is created, and the locking
code treats it as a delegation stateid (because NFS_LOCK_INITIALIZED
isn't set).
The root cause of this breakage seems to be commit 77041ed9b49a9e
(NFSv4: Ensure the lockowners are labelled using the fl_owner and/or
fl_pid).
That changed it so that flock lockowners are allocated based on the
fl_pid. I think this is incorrect. flock locks should be "owned" by the
struct file, and that is already accounted for in the fl_owner field of
the lock request when it comes through nfs_flock.
This patch basically reverts the above commit and with it, a LOCKU is
sent in the above reproducer.
Signed-off-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
If file is not opened by anyone, we do layout return on close
in delegation return.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|