Age | Commit message (Collapse) | Author |
|
Introduce a new transaction space reservation XFS_SB_LOG_RES() for
those transactions that need to modify the superblock on disk.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
Convert the calculation for end of quotaoff log space reservation
from runtime to mount time.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
Convert the calculation of quota off transaction log space reservation
from runtime to mount time.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
The disk quota allocation log space reservation is calcuated at runtime,
this patch does it at mount time.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
For adjusting quota limits transactions, we calculate out the log space
reservation at runtime, this patch does it at mount time.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
For the transaction that write the incore superblock changes of quota flags
to disk, it would reserve the same log space to clear/reset quota flags
transaction, hence we can use XFS_TRANS_SBCHANGE_LOG_RES() for it as well.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
The transaction log space for clearing/reseting the quota flags
is calculated out at runtime, this patch can figure it out at
mount time.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
Refining the existing reservations with xfs_calc_buf_res() in xfs_trans.c
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
This patch allocates a block reservation structure before growing
or shrinking a file. Without this structure, the grow or shink code
can reference the bad pointer.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
The intent here is to split the processing of the glock lru
list into two parts, so that the selection of glocks and the
disposal are separate functions. The plan is then, that further
updates can then be made to these functions in the future
to improve the selection of glocks and also the efficiency of
glock disposal.
The new feature which this patch brings is sorting the
glocks to be disposed of into glock number (and thus also
disk block number) order. Not all glocks will need i/o in
order to dispose of them, but some will, and at least we'll
generate mostly disk block order i/o now.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
Add a new helper xfs_calc_buf_res() to calcuate out the transaction space
reservations per item. xfs_buf_log_overhead() is used to figure out the
extra space for struct xfs_buf_log_format that gets written into the log
for every buffer as well as a log opheader, i.e. struct xlog_op_header.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
We batch up operations to the extent allocation tree, which allows
us to deal with the recursive nature of using the extent allocation
tree to allocate extents to the extent allocation tree.
It also provides a mechanism to sort and collect extent
operations, which makes it much more efficient to record extents
that are close together.
The delayed extent operations must all be finished before the
running transaction commits, so we have code to make sure and run a few
of the batched operations when closing our transaction handles.
This creates a great deal of contention for the locks in the
delayed extent operation tree, and also contention for the lock on the
extent allocation tree itself. All the extra contention just slows
down the operations and doesn't get things done any faster.
This commit changes things to use a wait queue instead. As procs
want to run the delayed operations, one of them races in and gets
permission to hit the tree, and the others step back and wait for
progress to be made.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
The extent buffers have a refs_lock which we use to make coordinate freeing
the extent buffer with operations on the radix tree. On tree roots and
other extent buffers that very cache hot, this can be highly contended.
These are also the extent buffers that are basically pinned in memory.
This commit adds code to cmpxchg our way through the ref modifications,
and as long as the result of the reference change is still pinned in
ram, we skip the expensive spinlock.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
With the new raid56 code, we want to make sure we're
properly aligning our allocation clusters with -o ssd
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
Buffered writes and DIRECT_IO writes will often break up
big contiguous changes to the file into sub-stripe writes.
This adds a plugging callback to gather those smaller writes full stripe
writes.
Example on flash:
fio job to do 64K writes in batches of 3 (which makes a full stripe):
With plugging: 450MB/s
Without plugging: 220MB/s
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
The stripe cache allows us to avoid extra read/modify/write cycles
by caching the pages we read off the disk. Pages are cached when:
* They are read in during a read/modify/write cycle
* They are written during a read/modify/write cycle
* They are involved in a parity rebuild
Pages are not cached if we're doing a full stripe write. We're
assuming that a full stripe write won't be followed by another
partial stripe write any time soon.
This provides a substantial boost in performance for workloads that
synchronously modify adjacent offsets in the file, and for the parity
rebuild use case in general.
The size of the stripe cache isn't tunable (yet) and is set at 1024
entries.
Example on flash: dd if=/dev/zero of=/mnt/xxx bs=4K oflag=direct
Without the stripe cache -- 2.1MB/s
With the stripe cache 21MB/s
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
This builds on David Woodhouse's original Btrfs raid5/6 implementation.
The code has changed quite a bit, blame Chris Mason for any bugs.
Read/modify/write is done after the higher levels of the filesystem have
prepared a given bio. This means the higher layers are not responsible
for building full stripes, and they don't need to query for the topology
of the extents that may get allocated during delayed allocation runs.
It also means different files can easily share the same stripe.
But, it does expose us to incorrect parity if we crash or lose power
while doing a read/modify/write cycle. This will be addressed in a
later commit.
Scrub is unable to repair crc errors on raid5/6 chunks.
Discard does not work on raid5/6 (yet)
The stripe size is fixed at 64KiB per disk. This will be tunable
in a later commit.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
We'll want to merge writes so they can fill a full RAID[56] stripe, but
not necessarily reads.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
If we remove a missing device, bdev is null, and if we
send that off to btrfs_kobject_uevent we'll panic.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
This reverts commit 324d003b0cd82151adbaecefef57b73f7959a469.
The deadlock turned out to be caused by a workqueue limitation that has
now been worked around in the RPC code (see comment in rpc_free_task).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
Pull NFS client bugfixes from Trond Myklebust:
- Error reporting in nfs_xdev_mount incorrectly maps all errors to
ENOMEM
- Fix an NFSv4 refcounting issue
- Fix a mount failure when the server reboots during NFSv4 trunking
discovery
- NFSv4.1 mounts may need to run the lease recovery thread.
- Don't silently fail setattr() requests on mountpoints
- Fix a SUNRPC socket/transport livelock and priority queue issue
- We must handle NFS4ERR_DELAY when resetting the NFSv4.1 session.
* tag 'nfs-for-3.8-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4.1: Handle NFS4ERR_DELAY when resetting the NFSv4.1 session
SUNRPC: When changing the queue priority, ensure that we change the owner
NFS: Don't silently fail setattr() requests on mountpoints
NFSv4.1: Ensure that nfs41_walk_client_list() does start lease recovery
NFSv4: Fix NFSv4 trunking discovery
NFSv4: Fix NFSv4 reference counting for trunked sessions
NFS: Fix error reporting in nfs_xdev_mount
|
|
Use the same adaptive readdirplus mechanism as NFS:
http://permalink.gmane.org/gmane.linux.nfs/49299
If the user space implementation wants to disable readdirplus
temporarily, it could just return ENOTSUPP. Then kernel will
recall it with readdir.
Signed-off-by: Feng Shuo <steve.shuo.feng@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
|
|
Commit c69e8d9c0 added rcu lock to fuse/dir.c It was assuming
that 'task' is some other process but in fact this parameter always
equals to 'current'. Inline this parameter to make it more readable
and remove RCU lock as it is not needed when access current process
credentials.
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
|
|
NFS4ERR_DELAY is a legal reply when we call DESTROY_SESSION. It
usually means that the server is busy handling an unfinished RPC
request. Just sleep for a second and then retry.
We also need to be able to handle the NFS4ERR_BACK_CHAN_BUSY return
value. If the NFS server has outstanding callbacks, we just want to
similarly sleep & retry.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
|
|
Ensure that any setattr and getattr requests for junctions and/or
mountpoints are sent to the server. Ever since commit
0ec26fd0698 (vfs: automount should ignore LOOKUP_FOLLOW), we have
silently dropped any setattr requests to a server-side mountpoint.
For referrals, we have silently dropped both getattr and setattr
requests.
This patch restores the original behaviour for setattr on mountpoints,
and tries to do the same for referrals, provided that we have a
filehandle...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
|
|
|
|
Don't send an extra wakeup to kjournald in the case where we
already have the proper target in j_commit_request, i.e. that
transaction has already been requested for commit.
commit deeeaf13 "jbd2: fix fsync() tid wraparound bug" changed
the logic leading to a wakeup, but it caused some extra wakeups
which were found to lead to a measurable performance regression.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[tytso@mit.edu: reworked check to make it clearer]
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Running AIO is pinning inode in memory using file reference. Once AIO
is completed using aio_complete(), file reference is put and inode can
be freed from memory. So we have to be sure that calling aio_complete()
is the last thing we do with the inode.
CC: stable@vger.kernel.org
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Pull xfs bugfixes from Ben Myers:
"Here are fixes for returning EFSCORRUPTED on probe of a non-xfs
filesystem, the stack switch in xfs_bmapi_allocate, a crash in
_xfs_buf_find, speculative preallocation as the filesystem nears
ENOSPC, an unmount hang, a race with AIO, and a regression with
xfs_fsr:
- fix return value when filesystem probe finds no XFS magic, a
regression introduced in 9802182.
- fix stack switch in __xfs_bmapi_allocate by moving the check for
stack switch up into xfs_bmapi_write.
- fix oops in _xfs_buf_find by validating that the requested block is
within the filesystem bounds.
- limit speculative preallocation near ENOSPC.
- fix an unmount hang in xfs_wait_buftarg by freeing the
xfs_buf_log_item in xfs_buf_item_unlock.
- fix a possible use after free with AIO.
- fix xfs_swap_extents after removal of xfs_flushinval_pages, a
regression introduced in commit fb59581404a."
* tag 'for-linus-v3.8-rc6' of git://oss.sgi.com/xfs/xfs:
xfs: Fix xfs_swap_extents() after removal of xfs_flushinval_pages()
xfs: Fix possible use-after-free with AIO
xfs: fix shutdown hang on invalid inode during create
xfs: limit speculative prealloc near ENOSPC thresholds
xfs: fix _xfs_buf_find oops on blocks beyond the filesystem end
xfs: pull up stack_switch check into xfs_bmapi_write
xfs: Do not return EFSCORRUPTED when filesystem probe finds no XFS magic
|
|
In func svc_export_parse, the uuid which used kmemdup to alloc will be
changed in func export_update.So the later kfree don't free this memory.
And it can't be free in func svc_export_parse because other place still
used.So put this operation in func svc_export_put.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Instead of using a list of buffers to write ahead of the journal
flush, this now uses a list of inodes and calls ->writepages
via filemap_fdatawrite() in order to achieve the same thing. For
most use cases this results in a shorter ordered write list,
as well as much larger i/os being issued.
The ordered write list is sorted by inode number before writing
in order to retain the disk block ordering between inodes as
per the previous code.
The previous ordered write code used to conflict in its assumptions
about how to write out the disk blocks with mpage_writepages()
so that with this updated version we can also use mpage_writepages()
for GFS2's ordered write, writepages implementation. So we will
also send larger i/os from writeback too.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
The freeze code has not been looked at a lot recently. Upstream has
moved on, and this is an attempt to catch us back up again. There
is a vfs level interface for the freeze code which can be called
from our (obsolete, but kept for backward compatibility purposes)
sysfs freeze interface. This means freezing this way vs. doing it
from the ioctl should now work in identical fashion.
As a result of this, the freeze function is only called once
and we can drop our own special purpose code for counting the
number of freezes.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
The locking in gfs2_attach_bufdata() was type specific (data/meta)
which made the function rather confusing. This patch moves the core
of gfs2_attach_bufdata() into trans.c renaming it gfs2_alloc_bufdata()
and moving the locking into gfs2_trans_add_data()/gfs2_trans_add_meta()
As a result all of the locking related to adding data and metadata to
the journal is now in these two functions. This should help to clarify
what is going on, and give us some opportunities to simplify in
some cases.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
This patch copies the body of gfs2_trans_add_bh into the two newly
added gfs2_trans_add_data and gfs2_trans_add_meta functions. We can
then move the .lo_add functions from lops.c into trans.c and call
them directly.
As a result of this, we no longer need to use the .lo_add functions
at all, so that is removed from the log operations structure.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
There is little common content in gfs2_trans_add_bh() between the data
and meta classes by the time that the functions which it calls are
taken into account. The intent here is to split this into two
separate functions. Stage one is to introduce gfs2_trans_add_data()
and gfs2_trans_add_meta() and update the callers accordingly.
Later patches will then pull in the content of gfs2_trans_add_bh()
and its dependent functions in order to clean up the code in this
area.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
This moves the lo_add function for revokes into trans.c, removing
a function call and making the code easier to read.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
This breaks out the LRU scanning function from the shrinker in
preparation for adding other callers to the LRU scanner.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
Conflicts:
drivers/devfreq/exynos4_bus.c
Sync with Linus' tree to be able to apply patches that are
against newer code (mvneta).
|
|
brelse() and ext4_journal_force_commit() are both inlined and able
to handle NULL.
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
After commit 978fef9 (create __ext4_insert_dentry for dir entry
insertion), 'reclen' is not used anymore.
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Commit b0336e8d (ext4: calculate and verify checksums of directory
leaf blocks) and commit dbe89444 (ext4: Calculate and verify checksums
for htree nodes) forget to release buffer when checksum failed, at
some places.
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
In two places we call WARN_ON() before we print out the debug message,
however we agreed that the WARN_ON() is unnecessary at those places so
remove them.
Also use ext4_warning() instead of ext4_msg() and printk().
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Remove unused variable flags from dump_completed_IO(). The code is
only exercised when EXT4FS_DEBUG is defined.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
|
|
So far ext4_writepage() skipped writing pages that had any delayed or
unwritten buffers attached. When blocksize < pagesize this breaks
data=ordered mode guarantees as we can have a page with one freshly
allocated buffer whose allocation is part of the committing
transaction and another buffer in the page which is delayed or
unwritten. So fix this problem by calling ext4_bio_writepage()
anyway. It will submit mapped buffers and leave others alone.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
So far ext4_bio_writepage() unconditionally cleared dirty bit on all
buffers underlying the page. That implicitely assumes we can write all
buffers. So far that is true because callers call into
ext4_bio_writepage() make sure all buffers in the page are mapped but:
a) it's a data corruption bug waiting to happen
b) in data=ordered mode when blocksize < pagesize we do need to write
pages that may have only some of dirty buffers mapped.
So change ext4_bio_writepage() to skip buffers that cannot be written without
clearing their dirty bit.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
This is always called with a valid "sg" pointer. My static checker
complains because the call to sg_init_table() dereferences "sg" before
we reach the checks.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
|
|
Commit fb59581404ab7ec5075299065c22cb211a9262a9 removed
xfs_flushinval_pages() and changed its callers to use
filemap_write_and_wait() and truncate_pagecache_range() directly.
But in xfs_swap_extents() this change accidental switched the argument
for 'tip' to 'ip'. This patch switches it back to 'tip'
Signed-off-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
Commit fb59581404ab7ec5075299065c22cb211a9262a9 removed
xfs_flushinval_pages() and changed its callers to use
filemap_write_and_wait() and truncate_pagecache_range() directly.
But in xfs_swap_extents() this change accidental switched the argument
for 'tip' to 'ip'. This patch switches it back to 'tip'
Signed-off-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|
|
Running AIO is pinning inode in memory using file reference. Once AIO
is completed using aio_complete(), file reference is put and inode can
be freed from memory. So we have to be sure that calling aio_complete()
is the last thing we do with the inode.
CC: xfs@oss.sgi.com
CC: Ben Myers <bpm@sgi.com>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
|