Age | Commit message (Collapse) | Author |
|
The 'follows' should match the seq for the snap context for the given snap
cap, which is the context under which we have been dirtying and writing
data and metadata. The snapshot that _contains_ those updates thus
_follows_ that context's seq #.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: prevent possible memory corruption in cifs_demultiplex_thread
cifs: eliminate some more premature cifsd exits
cifs: prevent cifsd from exiting prematurely
[CIFS] ntlmv2/ntlmssp remove-unused-function CalcNTLMv2_partial_mac_key
cifs: eliminate redundant xdev check in cifs_rename
Revert "[CIFS] Fix ntlmv2 auth with ntlmssp"
Revert "missing changes during ntlmv2/ntlmssp auth and sign"
Revert "Eliminate sparse warning - bad constant expression"
Revert "[CIFS] Eliminate unused variable warning"
|
|
When adding the readdir results to the cache, ceph_set_dentry_offset was
clobbered our just-set offset. This can cause the readdir result offsets
to get out of sync with the server. Add an argument to the helper so
that it does not.
This bug was introduced by 1cd3935bedccf592d44343890251452a6dd74fc4.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
We should not use dotlversion for the dotu inode operations
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
|
|
We should use the cached dentry operation only if caching mode is enabled
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
|
|
NULL fid should be handled in cases where we endup calling v9fs_dir_release()
before even we instantiate the fid in filp.
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
|
|
This was introduced by 7cadb63d58a932041afa3f957d5cbb6ce69dcee5
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
|
|
Four memory leak fixes in the 9P code.
Signed-off-by: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
|
|
The NFSv4 client's callback server calls svc_gss_principal(), which
is defined in the auth_rpcgss.ko
The NFSv4 server has the same dependency, and in addition calls
svcauth_gss_flavor(), gss_mech_get_by_pseudoflavor(),
gss_pseudoflavor_to_service() and gss_mech_put() from the same module.
The module auth_rpcgss itself has no dependencies aside from sunrpc,
so we only need to select RPCSEC_GSS.
Reported-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
Hi,
An NFS client executes a statfs("file", &buff) call.
"file" exists / existed, the client has read / written it,
but it has already closed it.
user_path(pathname, &path) looks up "file" successfully in the
directory-cache and restarts the aging timer of the directory-entry.
Even if "file" has already been removed from the server, because the
lookupcache=positive option I use, keeps the entries valid for a while.
nfs_statfs() returns ESTALE if "file" has already been removed from the
server.
If the user application repeats the statfs("file", &buff) call, we
are stuck: "file" remains young forever in the directory-cache.
Signed-off-by: Zoltan Menyhart <Zoltan.Menyhart@bull.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
|
|
Reported-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
|
|
The do_vfs_lock function on fs/nfs/file.c is only called if NLM is
not being used, via the -onolock mount option. Therefore it cannot
really be "out of sync with lock manager" when the local locking
function called returns an error, as there will be no corresponding
call to the NLM. For details, simply check the if/else on do_setlk
and do_unlk on fs/nfs/file.c.
Signed-Off-By: Fabio Olive Leite <fleite@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
Cast the value before shifting so that we don't run out of bits with a
32-bit unsigned long. This fixes wrapping of high file offsets into the
low 4GB of a file on disk, and the subsequent data corruption for large
files.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
Fix the reconnect encoding to encode the cap record when the MDS does not
have the FLOCK capability (i.e., pre v0.22).
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
A wrong parameter was passed to the kunmap.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
When we release a root dentry, particularly after a splice, the parent
(actually our) inode was evaluating to NULL and was getting dereferenced
by ceph_snap(). This is reproduced by something as simple as
mount -t ceph monhost:/a/b mnt
mount -t ceph monhost:/a mnt2
ls mnt2
A splice_dentry() would kill the old 'b' inode's root dentry, and we'd
crash while releasing it.
Fix by checking for both the ROOT and NULL cases explicitly. We only need
to invalidate the parent dir when we have a correct parent to invalidate.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: log IO completion workqueue is a high priority queue
xfs: prevent reading uninitialized stack memory
|
|
This patch tries to handle the case in which list 'dlm->tracking_list' is
empty, to avoid accessing an invalid pointer. It fixes the following oops:
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1287
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
In ocfs2_dx_dir_rebalance(), we need to rejournal_acess the blocks after
calling ocfs2_insert_extent() since growing an extent tree may trigger
ocfs2_extend_trans(), which makes previous journal_access meaningless.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
This patch change mutex_lock to a new subclass and
add a new inode lock subclass for the target inode
which caused this lockdep warning.
=============================================
[ INFO: possible recursive locking detected ]
2.6.35+ #5
---------------------------------------------
reflink/11086 is trying to acquire lock:
(Meta){+++++.}, at: [<ffffffffa06f9d65>] ocfs2_reflink_ioctl+0x898/0x1229 [ocfs2]
but task is already holding lock:
(Meta){+++++.}, at: [<ffffffffa06f9aa0>] ocfs2_reflink_ioctl+0x5d3/0x1229 [ocfs2]
other info that might help us debug this:
6 locks held by reflink/11086:
#0: (&sb->s_type->i_mutex_key#15/1){+.+.+.}, at: [<ffffffff820e09ec>] lookup_create+0x26/0x97
#1: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa06f99a0>] ocfs2_reflink_ioctl+0x4d3/0x1229 [ocfs2]
#2: (Meta){+++++.}, at: [<ffffffffa06f9aa0>] ocfs2_reflink_ioctl+0x5d3/0x1229 [ocfs2]
#3: (&oi->ip_xattr_sem){+.+.+.}, at: [<ffffffffa06f9b58>] ocfs2_reflink_ioctl+0x68b/0x1229 [ocfs2]
#4: (&oi->ip_alloc_sem){+.+.+.}, at: [<ffffffffa06f9b67>] ocfs2_reflink_ioctl+0x69a/0x1229 [ocfs2]
#5: (&sb->s_type->i_mutex_key#15/2){+.+...}, at: [<ffffffffa06f9d4f>] ocfs2_reflink_ioctl+0x882/0x1229 [ocfs2]
stack backtrace:
Pid: 11086, comm: reflink Not tainted 2.6.35+ #5
Call Trace:
[<ffffffff82063dd9>] validate_chain+0x56e/0xd68
[<ffffffff82062275>] ? mark_held_locks+0x49/0x69
[<ffffffff82064d6d>] __lock_acquire+0x79a/0x7f1
[<ffffffff82065a81>] lock_acquire+0xc6/0xed
[<ffffffffa06f9d65>] ? ocfs2_reflink_ioctl+0x898/0x1229 [ocfs2]
[<ffffffffa06c9ade>] __ocfs2_cluster_lock+0x975/0xa0d [ocfs2]
[<ffffffffa06f9d65>] ? ocfs2_reflink_ioctl+0x898/0x1229 [ocfs2]
[<ffffffffa06e107b>] ? ocfs2_wait_for_recovery+0x15/0x8a [ocfs2]
[<ffffffffa06cb6ea>] ocfs2_inode_lock_full_nested+0x1ac/0xdc5 [ocfs2]
[<ffffffffa06f9d65>] ? ocfs2_reflink_ioctl+0x898/0x1229 [ocfs2]
[<ffffffff820623a0>] ? trace_hardirqs_on_caller+0x10b/0x12f
[<ffffffff82060193>] ? debug_mutex_free_waiter+0x4f/0x53
[<ffffffffa06f9d65>] ocfs2_reflink_ioctl+0x898/0x1229 [ocfs2]
[<ffffffffa06ce24a>] ? ocfs2_file_lock_res_init+0x66/0x78 [ocfs2]
[<ffffffff820bb2d2>] ? might_fault+0x40/0x8d
[<ffffffffa06df9f6>] ocfs2_ioctl+0x61a/0x656 [ocfs2]
[<ffffffff820ee5d3>] ? mntput_no_expire+0x1d/0xb0
[<ffffffff820e07b3>] ? path_put+0x2c/0x31
[<ffffffff820e53ac>] vfs_ioctl+0x2a/0x9d
[<ffffffff820e5903>] do_vfs_ioctl+0x45d/0x4ae
[<ffffffff8233a7f6>] ? _raw_spin_unlock+0x26/0x2a
[<ffffffff8200299c>] ? sysret_check+0x27/0x62
[<ffffffff820e59ab>] sys_ioctl+0x57/0x7a
[<ffffffff8200296b>] system_call_fastpath+0x16/0x1b
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
As the name shows, we shouldn't have any lock in
ocfs2_xattr_get_nolock. so lift ip_xattr_sem to the caller.
This should be safe for us since the only 2 callers are:
1. ocfs2_xattr_get which will lock the resources.
2. ocfs2_mknod which don't need this locking.
And this also resolves the following lockdep warning.
=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.35+ #5
-------------------------------------------------------
reflink/30027 is trying to acquire lock:
(&oi->ip_alloc_sem){+.+.+.}, at: [<ffffffffa0673b67>] ocfs2_reflink_ioctl+0x69a/0x1226 [ocfs2]
but task is already holding lock:
(&oi->ip_xattr_sem){++++..}, at: [<ffffffffa0673b58>] ocfs2_reflink_ioctl+0x68b/0x1226 [ocfs2]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (&oi->ip_xattr_sem){++++..}:
[<ffffffff82064d6d>] __lock_acquire+0x79a/0x7f1
[<ffffffff82065a81>] lock_acquire+0xc6/0xed
[<ffffffff82339650>] down_read+0x34/0x47
[<ffffffffa0691cb8>] ocfs2_xattr_get_nolock+0xa0/0x4e6 [ocfs2]
[<ffffffffa069d64f>] ocfs2_get_acl_nolock+0x5c/0x132 [ocfs2]
[<ffffffffa069d9c7>] ocfs2_init_acl+0x60/0x243 [ocfs2]
[<ffffffffa066499d>] ocfs2_mknod+0xae8/0xfea [ocfs2]
[<ffffffffa0665041>] ocfs2_create+0x9d/0x105 [ocfs2]
[<ffffffff820e1c83>] vfs_create+0x9b/0xf4
[<ffffffff820e20bb>] do_last+0x2fd/0x5be
[<ffffffff820e31c0>] do_filp_open+0x1fb/0x572
[<ffffffff820d6cf6>] do_sys_open+0x5a/0xe7
[<ffffffff820d6dac>] sys_open+0x1b/0x1d
[<ffffffff8200296b>] system_call_fastpath+0x16/0x1b
-> #2 (jbd2_handle){+.+...}:
[<ffffffff82064d6d>] __lock_acquire+0x79a/0x7f1
[<ffffffff82065a81>] lock_acquire+0xc6/0xed
[<ffffffffa0604ff8>] start_this_handle+0x4a3/0x4bc [jbd2]
[<ffffffffa06051d6>] jbd2__journal_start+0xba/0xee [jbd2]
[<ffffffffa0605218>] jbd2_journal_start+0xe/0x10 [jbd2]
[<ffffffffa065ca34>] ocfs2_start_trans+0xb7/0x19b [ocfs2]
[<ffffffffa06645f3>] ocfs2_mknod+0x73e/0xfea [ocfs2]
[<ffffffffa0665041>] ocfs2_create+0x9d/0x105 [ocfs2]
[<ffffffff820e1c83>] vfs_create+0x9b/0xf4
[<ffffffff820e20bb>] do_last+0x2fd/0x5be
[<ffffffff820e31c0>] do_filp_open+0x1fb/0x572
[<ffffffff820d6cf6>] do_sys_open+0x5a/0xe7
[<ffffffff820d6dac>] sys_open+0x1b/0x1d
[<ffffffff8200296b>] system_call_fastpath+0x16/0x1b
-> #1 (&journal->j_trans_barrier){.+.+..}:
[<ffffffff82064d6d>] __lock_acquire+0x79a/0x7f1
[<ffffffff82064fa9>] lock_release_non_nested+0x1e5/0x24b
[<ffffffff82065999>] lock_release+0x158/0x17a
[<ffffffff823389f6>] __mutex_unlock_slowpath+0xbf/0x11b
[<ffffffff82338a5b>] mutex_unlock+0x9/0xb
[<ffffffffa0679673>] ocfs2_free_ac_resource+0x31/0x67 [ocfs2]
[<ffffffffa067c6bc>] ocfs2_free_alloc_context+0x11/0x1d [ocfs2]
[<ffffffffa0633de0>] ocfs2_write_begin_nolock+0x141e/0x159b [ocfs2]
[<ffffffffa0635523>] ocfs2_write_begin+0x11e/0x1e7 [ocfs2]
[<ffffffff820a1297>] generic_file_buffered_write+0x10c/0x210
[<ffffffffa0653624>] ocfs2_file_aio_write+0x4cc/0x6d3 [ocfs2]
[<ffffffff820d822d>] do_sync_write+0xc2/0x106
[<ffffffff820d897b>] vfs_write+0xae/0x131
[<ffffffff820d8e55>] sys_write+0x47/0x6f
[<ffffffff8200296b>] system_call_fastpath+0x16/0x1b
-> #0 (&oi->ip_alloc_sem){+.+.+.}:
[<ffffffff82063f92>] validate_chain+0x727/0xd68
[<ffffffff82064d6d>] __lock_acquire+0x79a/0x7f1
[<ffffffff82065a81>] lock_acquire+0xc6/0xed
[<ffffffff82339694>] down_write+0x31/0x52
[<ffffffffa0673b67>] ocfs2_reflink_ioctl+0x69a/0x1226 [ocfs2]
[<ffffffffa06599f6>] ocfs2_ioctl+0x61a/0x656 [ocfs2]
[<ffffffff820e53ac>] vfs_ioctl+0x2a/0x9d
[<ffffffff820e5903>] do_vfs_ioctl+0x45d/0x4ae
[<ffffffff820e59ab>] sys_ioctl+0x57/0x7a
[<ffffffff8200296b>] system_call_fastpath+0x16/0x1b
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
Track negative dentries by recording the generation number of the parent
directory in d_fsdata. The generation number for the parent directory is
recorded in the inode_info, which increments every time the lock on the
directory is dropped.
If the generation number of the parent directory and the negative dentry
matches, there is no need to perform the revalidate, else a revalidate
is forced. This improves performance in situations where nodes look for
the same non-existent file multiple times.
Thanks Mark for explaining the DLM sequence.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
Durring orphan scan, if we are slot 0, and we are replaying
orphan_dir:0001, the general process is that for every file
in this dir:
1. we will iget orphan_dir:0001, since there is no inode for it.
we will have to create an inode and read it from the disk.
2. do the normal work, such as delete_inode and remove it from
the dir if it is allowed.
3. call iput orphan_dir:0001 when we are done. In this case,
since we have no dcache for this inode, i_count will
reach 0, and VFS will have to call clear_inode and in
ocfs2_clear_inode we will checkpoint the inode which will let
ocfs2_cmt and journald begin to work.
4. We loop back to 1 for the next file.
So you see, actually for every deleted file, we have to read the
orphan dir from the disk and checkpoint the journal. It is very
time consuming and cause a lot of journal checkpoint I/O.
A better solution is that we can have another reference for these
inodes in ocfs2_super. So if there is no other race among
nodes(which will let dlmglue to checkpoint the inode), for step 3,
clear_inode won't be called and for step 1, we may only need to
read the inode for the 1st time. This is a big win for us.
So this patch will try to cache system inodes of other slots so
that we will have one more reference for these inodes and avoid
the extra inode read and journal checkpoint.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
generic_check_addressable() erroneously shifts pages down by a block
factor when it should be shifting up. To prevent overflow, we shift
blocks down to pages.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
The OCFS2 developers have already done all of the hard work to allow
volumes larger than 16 TiB. But there is still a "sanity check" in
fs/ocfs2/super.c that prevents the mounting of such volumes, even when
the cluster size and journal options would allow it.
This patch replaces that sanity check with a more sophisticated one to
mount a huge volume provided that (a) it is addressable by the raw
word/address size of the system (borrowing a test from ext4); (b) the
volume is using JBD2; and (c) the JBD2_FEATURE_INCOMPAT_64BIT flag is
set on the journal.
I factored out the sanity check into its own function. I also moved it
from ocfs2_initialize_super() down to ocfs2_check_volume(); any earlier,
and the journal will not have been initialized yet.
This patch is one of a pair, and it depends on the other ("JBD2: Allow
feature checks before journal recovery").
I have tested this patch on small volumes, huge volumes, and huge
volumes without 64-bit block support in the journal. All of them appear
to work or to fail gracefully, as appropriate.
Signed-off-by: Patrick LoPresti <lopresti@gmail.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
Before we start accessing a huge (> 16 TiB) OCFS2 volume, we need to
confirm that its journal supports 64-bit offsets. In particular, we
need to check the journal's feature bits before recovering the journal.
This is not possible with JBD2 at present, because the journal
superblock (where the feature bits reside) is not loaded from disk until
the journal is recovered.
This patch loads the journal superblock in
jbd2_journal_check_used_features() if it has not already been loaded,
allowing us to check the feature bits before journal recovery.
Signed-off-by: Patrick LoPresti <lopresti@gmail.com>
Cc: linux-ext4@vger.kernel.org
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
As part of adding support for OCFS2 to mount huge volumes, we need to
check that the sector_t and page cache of the system are capable of
addressing the entire volume.
An identical check already appears in ext3 and ext4. This patch moves
the addressability check into its own function in fs/libfs.c and
modifies ext3 and ext4 to invoke it.
[Edited to -EINVAL instead of BUG_ON() for bad blocksize_bits -- Joel]
Signed-off-by: Patrick LoPresti <lopresti@gmail.com>
Cc: linux-ext4@vger.kernel.org
Acked-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
merge-2
|
|
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
ocfs2_sync_inode() is used only from ocfs2_sync_file(). But all data has
already been written before calling ocfs2_sync_file() and ocfs2 doesn't use
inode's private_list for tracking metadata buffers thus sync_mapping_buffers()
is superfluous as well.
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
Thanks for the comments. I have incorportated them all.
CONFIG_OCFS2_FS_STATS is enabled and CONFIG_DEBUG_LOCK_ALLOC is disabled.
Statistics now look like -
ocfs2_write_ctxt: 2144 - 2136 = 8
ocfs2_inode_info: 1960 - 1848 = 112
ocfs2_journal: 168 - 160 = 8
ocfs2_lock_res: 336 - 304 = 32
ocfs2_refcount_tree: 512 - 472 = 40
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
In ocfs2, actually we don't allow any direct write pass i_size,
see the function ocfs2_prepare_inode_for_write. So we don't
need the bogus simple_setsize.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
Now orphan scan worker has no trace log, so it is
very hard to tell whether it is finished or blocked.
So add 2 mlog trace log so that we can tell whether
the current orphan scan worker is blocked or not.
It does help when I analyzed a orphan scan bug.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
The reason why we need this ioctl is to offer the none-privileged
end-user a possibility to get filesys info gathering.
We use OCFS2_IOC_INFO to manipulate the new ioctl, userspace passes a
structure to kernel containing an array of request pointers and request
count, such as,
* From userspace:
struct ocfs2_info_blocksize oib = {
.ib_req = {
.ir_magic = OCFS2_INFO_MAGIC,
.ir_code = OCFS2_INFO_BLOCKSIZE,
...
}
...
}
struct ocfs2_info_clustersize oic = {
...
}
uint64_t reqs[2] = {(unsigned long)&oib,
(unsigned long)&oic};
struct ocfs2_info info = {
.oi_requests = reqs,
.oi_count = 2,
}
ret = ioctl(fd, OCFS2_IOC_INFO, &info);
* In kernel:
Get the request pointers from *info*, then handle each request one bye one.
Idea here is to make the spearated request small enough to guarantee
a better backward&forward compatibility since a small piece of request
would be less likely to be broken if filesys on raw disk get changed.
Currently, the following 7 requests are supported per the requirement from
userspace tool o2info, and I believe it will grow over time:-)
OCFS2_INFO_CLUSTERSIZE
OCFS2_INFO_BLOCKSIZE
OCFS2_INFO_MAXSLOTS
OCFS2_INFO_LABEL
OCFS2_INFO_UUID
OCFS2_INFO_FS_FEATURES
OCFS2_INFO_JOURNAL_SIZE
This ioctl is only specific to OCFS2.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
|
The workqueue implementation in 2.6.36-rcX has changed, resulting
in the workqueues no longer having dedicated threads for work
processing. This has caused severe livelocks under heavy parallel
create workloads because the log IO completions have been getting
held up behind metadata IO completions. Hence log commits would
stall, memory allocation would stall because pages could not be
cleaned, and lock contention on the AIL during inode IO completion
processing was being seen to slow everything down even further.
By making the log Io completion workqueue a high priority workqueue,
they are queued ahead of all data/metadata IO completions and
processed before the data/metadata completions. Hence the log never
gets stalled, and operations needed to clean memory can continue as
quickly as possible. This avoids the livelock conditions and allos
the system to keep running under heavy load as per normal.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
|
|
An execve with a very large total of argument/environment strings
can take a really long time in the execve system call. It runs
uninterruptibly to count and copy all the strings. This change
makes it abort the exec quickly if sent a SIGKILL.
Note that this is the conservative change, to interrupt only for
SIGKILL, by using fatal_signal_pending(). It would be perfectly
correct semantics to let any signal interrupt the string-copying in
execve, i.e. use signal_pending() instead of fatal_signal_pending().
We'll save that change for later, since it could have user-visible
consequences, such as having a timer set too quickly make it so that
an execve can never complete, though it always happened to work before.
Signed-off-by: Roland McGrath <roland@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This adds a preemption point during the copying of the argument and
environment strings for execve, in copy_strings(). There is already
a preemption point in the count() loop, so this doesn't add any new
points in the abstract sense.
When the total argument+environment strings are very large, the time
spent copying them can be much more than a normal user time slice.
So this change improves the interactivity of the rest of the system
when one process is doing an execve with very large arguments.
Signed-off-by: Roland McGrath <roland@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The CONFIG_STACK_GROWSDOWN variant of setup_arg_pages() does not
check the size of the argument/environment area on the stack.
When it is unworkably large, shift_arg_pages() hits its BUG_ON.
This is exploitable with a very large RLIMIT_STACK limit, to
create a crash pretty easily.
Check that the initial stack is not too large to make it possible
to map in any executable. We're not checking that the actual
executable (or intepreter, for binfmt_elf) will fit. So those
mappings might clobber part of the initial stack mapping. But
that is just userland lossage that userland made happen, not a
kernel problem.
Signed-off-by: Roland McGrath <roland@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
block: Range check cpu in blk_cpu_to_group
scatterlist: prevent invalid free when alloc fails
writeback: Fix lost wake-up shutting down writeback thread
writeback: do not lose wakeup events when forking bdi threads
cciss: fix reporting of max queue depth since init
block: switch s390 tape_block and mg_disk to elevator_change()
block: add function call to switch the IO scheduler from a driver
fs/bio-integrity.c: return -ENOMEM on kmalloc failure
bio-integrity.c: remove dependency on __GFP_NOFAIL
BLOCK: fix bio.bi_rw handling
block: put dev->kobj in blk_register_queue fail path
cciss: handle allocation failure
cfq-iosched: Documentation help for new tunables
cfq-iosched: blktrace print per slice sector stats
cfq-iosched: Implement tunable group_idle
cfq-iosched: Do group share accounting in IOPS when slice_idle=0
cfq-iosched: Do not idle if slice_idle=0
cciss: disable doorbell reset on reset_devices
blkio: Fix return code for mkdir calls
|
|
The XFS_IOC_FSGETXATTR ioctl allows unprivileged users to read 12
bytes of uninitialized stack memory, because the fsxattr struct
declared on the stack in xfs_ioc_fsgetxattr() does not alter (or zero)
the 12-byte fsx_pad member before copying it back to the user. This
patch takes care of it.
Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
|
|
This flag was only set for barrier buffers, which we don't submit
anymore.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
fat already uses synchronous discards, no need to add I/O barriers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
ext4 already uses synchronous discards, no need to add I/O barriers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
Switch to the WRITE_FLUSH_FUA flag for journal commits and remove the
EOPNOTSUPP detection for barriers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
Currently JBD2 relies blkdev_issue_flush() draining the queue when ASYNC_COMMIT
feature is set. This property is going away so make JBD2 wait for buffers it
needs on its own before submitting the cache flush.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
Switch to the WRITE_FLUSH_FUA flag for journal commits and remove the
EOPNOTSUPP detection for barriers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
Switch to the WRITE_FLUSH_FUA flag for log writes, remove the EOPNOTSUPP
detection for barriers and stop setting the barrier flag for discards.
tj: nilfs is now fixed to wait for discard completion. Updated this
patch accordingly and dropped warning about it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
Switch to the WRITE_FLUSH_FUA flag for log writes and remove the EOPNOTSUPP
detection for barriers. Note that reiserfs had a fairly different code
path for barriers before as it wa the only filesystem actually making use
of them. The new code always uses the old non-barrier codepath and just
sets the WRITE_FLUSH_FUA explicitly for the journal commits.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
Switch to the WRITE_FLUSH_FUA flag for log writes, remove the EOPNOTSUPP
detection for barriers and stop setting the barrier flag for discards.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|