summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2010-10-21cifs: convert cifs_tcp_ses_lock from a rwlock to a spinlockSuresh Jayaraman
cifs_tcp_ses_lock is a rwlock with protects the cifs_tcp_ses_list, server->smb_ses_list and the ses->tcon_list. It also protects a few ref counters in server, ses and tcon. In most cases the critical section doesn't seem to be large, in a few cases where it is slightly large, there seem to be really no benefit from concurrent access. I briefly considered RCU mechanism but it appears to me that there is no real need. Replace it with a spinlock and get rid of the last rwlock in the cifs code. Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-21pipe: fix failure to return error code on ->confirm()Nicolas Kaiser
The arguments were transposed, we want to assign the error code to 'ret', which is being returned. Signed-off-by: Nicolas Kaiser <nikai@nikai.net> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-21UBIFS: do not allocate unneeded scan bufferArtem Bityutskiy
In 'ubifs_replay_journal()' we allocate 'sbuf' for scanning the log. However, we already have 'c->sbuf' for these purposes, so do not allocate yet another one. This reduces UBIFS memory consumption while recovering. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2010-10-21UBIFS: do not forget to cancel timersArtem Bityutskiy
This is a bug-fix: when we unmount, and we are currently in R/O mode because of an error - we do not sync write-buffers, which means we also do not cancel write-buffer timers we may possibly have armed. This patch fixes the issue. The issue can easily be reproduced by enabling UBIFS failure debug mode (echo 4 > /sys/module/ubifs/parameters/debug_tsts) and unmounting as soon as a failure happen. At some point the system oopses because we have an armed hrtimer but UBIFS is unmounted already. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2010-10-21UBIFS: remove a bit of unneeded codeArtem Bityutskiy
This is a clean-up patch which: 1. Removes explicite 'hrtimer_cancel()' after 'ubifs_wbuf_sync()' in 'ubifs_remount_ro()', because the timers will be canceled by 'ubifs_wbuf_sync()', no need to cancel them for the second time. 2. Remove "if (c->jheads)" check from 'ubifs_put_super()', because at journal heads must always be allocated there, since we checked earlier that we were mounted R/W, and the olny situation when journal heads are not allocated is when mounter or re-mounted R/O. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2010-10-20ceph: do not carry i_lock for readdir from dcacheSage Weil
We were taking dcache_lock inside of i_lock, which introduces a dependency not found elsewhere in the kernel, complicationg the vfs locking scalability work. Since we don't actually need it here anyway, remove it. We only need i_lock to test for the I_COMPLETE flag, so be careful to do so without dcache_lock held. Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20fs/ceph/xattr.c: Use kmemdupJulia Lawall
Convert a sequence of kmalloc and memcpy to use kmemdup. The semantic patch that performs this transformation is: (http://coccinelle.lip6.fr/) // <smpl> @@ expression a,flag,len; expression arg,e1,e2; statement S; @@ a = - \(kmalloc\|kzalloc\)(len,flag) + kmemdup(arg,len,flag) <... when != a if (a == NULL || ...) S ...> - memcpy(a,arg,len+1); // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20ceph: add CEPH_MDS_OP_SETDIRLAYOUT and associated ioctl.Greg Farnum
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20ceph: fix debugfs warningsRandy Dunlap
Include "super.h" outside of CONFIG_DEBUG_FS to eliminate a compiler warning: fs/ceph/debugfs.c:266: warning: 'struct ceph_fs_client' declared inside parameter list fs/ceph/debugfs.c:266: warning: its scope is only this definition or declaration, which is probably not what you want fs/ceph/debugfs.c:271: warning: 'struct ceph_fs_client' declared inside parameter list Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2010-10-20ceph: switch from BKL to lock_flocks()Sage Weil
Switch from using the BKL explicitly to the new lock_flocks() interface. Eventually this will turn into a spinlock. Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20ceph: preallocate flock state without locks heldGreg Farnum
When the lock_kernel() turns into lock_flocks() and a spinlock, we won't be able to do allocations with the lock held. Preallocate space without the lock, and retry if the lock state changes out from underneath us. Signed-off-by: Greg Farnum <gregf@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20ceph: use mapping->nrpages to determine if mapping is emptySage Weil
This is simpler and faster. Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20ceph: only invalidate on check_caps if we actually have pagesSage Weil
The i_rdcache_gen value only implies we MAY have cached pages; actually check the mapping to see if it's worth bothering with an invalidate. Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20ceph: do not hide .snap in root directorySage Weil
Snaps in the root directory are now supported by the MDS, and harmless on older versions. Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20ceph: factor out libceph from Ceph file systemYehuda Sadeh
This factors out protocol and low-level storage parts of ceph into a separate libceph module living in net/ceph and include/linux/ceph. This is mostly a matter of moving files around. However, a few key pieces of the interface change as well: - ceph_client becomes ceph_fs_client and ceph_client, where the latter captures the mon and osd clients, and the fs_client gets the mds client and file system specific pieces. - Mount option parsing and debugfs setup is correspondingly broken into two pieces. - The mon client gets a generic handler callback for otherwise unknown messages (mds map, in this case). - The basic supported/required feature bits can be expanded (and are by ceph_fs_client). No functional change, aside from some subtle error handling cases that got cleaned up in the refactoring process. Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20ceph-rbd: osdc support for osd call and rollback operationsYehuda Sadeh
This will be used for rbd snapshots administration. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2010-10-20ceph: messenger and osdc changes for rbdYehuda Sadeh
Allow the messenger to send/receive data in a bio. This is added so that we wouldn't need to copy the data into pages or some other buffer when doing IO for an rbd block device. We can now have trailing variable sized data for osd ops. Also osd ops encoding is more modular. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20ceph: refactor osdc requests creation functionsYehuda Sadeh
The osd requests creation are being decoupled from the vino parameter, allowing clients using the osd to use other arbitrary object names that are not necessarily vino based. Also, calc_raw_layout now takes a snap id. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20ceph: lookup pool in osdmap by nameYehuda Sadeh
Implement a pool lookup by name. This will be used by rbd. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-19NFSv4: Don't call nfs4_reclaim_complete() on receiving NFS4ERR_STALE_CLIENTIDTrond Myklebust
If the server sends us an NFS4ERR_STALE_CLIENTID while the state management thread is busy reclaiming state, we do want to treat all state that wasn't reclaimed before the STALE_CLIENTID as if a network partition occurred (see the edge conditions described in RFC3530 and RFC5661). What we do not want to do is to send an nfs4_reclaim_complete(), since we haven't yet even started reclaiming state after the server rebooted. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org
2010-10-19NFSv4: Don't call nfs4_state_mark_reclaim_reboot() from error handlersTrond Myklebust
In the case of a server reboot, the state recovery thread starts by calling nfs4_state_end_reclaim_reboot() in order to avoid edge conditions when the server reboots while the client is in the middle of recovery. However, if the client has already marked the nfs4_state as requiring reboot recovery, then the above behaviour will cause the recovery thread to treat the open as if it was part of such an edge condition: the open will be recovered as if it was part of a lease expiration (and all the locks will be lost). Fix is to remove the call to nfs4_state_mark_reclaim_reboot from nfs4_async_handle_error(), and nfs4_handle_exception(). Instead we leave it to the recovery thread to do this for us. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org
2010-10-19NFSv4: Fix open recoveryTrond Myklebust
NFSv4 open recovery is currently broken: since we do not clear the state->flags states before attempting recovery, we end up with the 'can_open_cached()' function triggering. This again leads to no OPEN call being put on the wire. Reported-by: Sachin Prabhu <sprabhu@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org
2010-10-19NFS: Don't SIGBUS if nfs_vm_page_mkwrite races with a cache invalidationTrond Myklebust
In the case where we lock the page, and then find out that the page has been thrown out of the page cache, we should just return VM_FAULT_NOPAGE. This is what block_page_mkwrite() does in these situations. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org
2010-10-19Merge branch 'devel-stable' into develRussell King
2010-10-19cifs: cancel_delayed_work() + flush_scheduled_work() -> ↵Tejun Heo
cancel_delayed_work_sync() flush_scheduled_work() is going away. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-19Clean up two declarations of blob_lenShirish Pargaonkar
- Eliminate double declaration of variable blob_len - Modify function build_ntlmssp_auth_blob to return error code as well as length of the blob. Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Reviewed-by: Jeff Layton <jlayton@samba.org> Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-19fix rawctl compat ioctls breakage on amd64 and itanicAl Viro
RAW_SETBIND and RAW_GETBIND 32bit versions are fscked in interesting ways. 1) fs/compat_ioctl.c has COMPATIBLE_IOCTL(RAW_SETBIND) followed by HANDLE_IOCTL(RAW_SETBIND, raw_ioctl). The latter is ignored. 2) on amd64 (and itanic) the damn thing is broken - we have int + u64 + u64 and layouts on i386 and amd64 are _not_ the same. raw_ioctl() would work there, but it's never called due to (1). As it is, i386 /sbin/raw definitely doesn't work on amd64 boxen. 3) switching to raw_ioctl() as is would *not* work on e.g. sparc64 and ppc64, which would be rather sad, seeing that normal userland there is 32bit. The thing is, slapping __packed on the struct in question does not DTRT - it eliminates *all* padding. The real solution is to use compat_u64. 4) of course, all that stuff has no business being outside of raw.c in the first place - there should be ->compat_ioctl() for /dev/rawctl instead of messing with compat_ioctl.c. [akpm@linux-foundation.org: coding-style fixes] [arnd@arndb.de: port to 2.6.36] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-19Merge branch 'v2.6.36-rc8' into for-2.6.37/barrierJens Axboe
Conflicts: block/blk-core.c drivers/block/loop.c mm/swapfile.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-19block: fix accounting bug on cross partition mergesYasuaki Ishimatsu
/proc/diskstats would display a strange output as follows. $ cat /proc/diskstats |grep sda 8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089 8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691 ~~~~~~~~~~ 8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390 8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92 8 4 sda4 4 0 8 0 0 0 0 0 0 0 0 8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137 Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE. The detailed root cause is as follows. Assuming that there are two partition, sda1 and sda2. 1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight is 0 and sda2's one is 1. | hd_struct->in_flight --------------------------- sda1 | 0 sda2 | 1 --------------------------- 2. A bio belongs to sda1 is issued and is merged into the request mentioned on step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed from sda2 region to sda1 region. However the two partition's hd_struct->in_flight are not changed. | hd_struct->in_flight --------------------------- sda1 | 0 sda2 | 1 --------------------------- 3. The request is finished and blk_account_io_done() is called. In this case, sda2's hd_struct->in_flight, not a sda1's one, is decremented. | hd_struct->in_flight --------------------------- sda1 | -1 sda2 | 1 --------------------------- The patch fixes the problem by caching the partition lookup inside the request structure, hence making sure that the increment and decrement will always happen on the same partition struct. This also speeds up IO with accounting enabled, since it cuts down on the number of lookups we have to do. When reloading partition tables, quiesce IO to ensure that no request references to the partition struct exists. When it is safe to free the partition table, the IO for that device is restarted again. Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-18xfs: semaphore cleanupThomas Gleixner
Get rid of init_MUTEX[_LOCKED]() and use sema_init() instead. (Ported to current XFS code by <aelder@sgi.com>.) Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18xfs: Extend project quotas to support 32bit project idsArkadiusz Mi?kiewicz
This patch adds support for 32bit project quota identifiers. On disk format is backward compatible with 16bit projid numbers. projid on disk is now kept in two 16bit values - di_projid_lo (which holds the same position as old 16bit projid value) and new di_projid_hi (takes existing padding) and converts from/to 32bit value on the fly. xfs_admin (for existing fs), mkfs.xfs (for new fs) needs to be used to enable PROJID32BIT support. Signed-off-by: Arkadiusz Miśkiewicz <arekm@maven.pl> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18xfs: remove xfs_buf wrappersChristoph Hellwig
Stop having two different names for many buffer functions and use the more descriptive xfs_buf_* names directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18xfs: remove xfs_cred.hChristoph Hellwig
We're not actually passing around credentials inside XFS for a while now, so remove all xfs_cred.h with it's cred_t typedef and all instances of it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18xfs: remove xfs_globals.hChristoph Hellwig
This header only provides one extern that isn't actually declared anywhere, and shadowed by a macro. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18xfs: remove xfs_version.hChristoph Hellwig
It used to have a place when it contained an automatically generated CVS version, but these days it's entirely superflous. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18xfs: remove xfs_refcache.hChristoph Hellwig
This header has been completely unused for a couple of years. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18xfs: fix the xfs_trans_committedChristoph Hellwig
Use the correct prototype for xfs_trans_committed instead of casting it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18xfs: remove unused t_callback field in struct xfs_transChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18xfs: fix bogus m_maxagi check in xfs_igetChristoph Hellwig
These days inode64 should only control which AGs we allocate new inodes from, while we still try to support reading all existing inodes. To make this actually work the check ontop of xfs_iget needs to be relaxed to allow inodes in all allocation groups instead of just those that we allow allocating inodes from. Note that we can't simply remove the check - it prevents us from accessing invalid data when fed invalid inode numbers from NFS or bulkstat. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18xfs: do not use xfs_mod_incore_sb_batch for per-cpu countersChristoph Hellwig
Update the per-cpu counters manually in xfs_trans_unreserve_and_mod_sb and remove support for per-cpu counters from xfs_mod_incore_sb_batch to simplify it. And added benefit is that we don't have to take m_sb_lock for transactions that only modify per-cpu counters. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18xfs: do not use xfs_mod_incore_sb for per-cpu countersChristoph Hellwig
Export xfs_icsb_modify_counters and always use it for modifying the per-cpu counters. Remove support for per-cpu counters from xfs_mod_incore_sb to simplify it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18xfs: remove XFS_MOUNT_NO_PERCPU_SBChristoph Hellwig
Fail the mount if we can't allocate memory for the per-CPU counters. This is consistent with how we handle everything else in the mount path and makes the superblock counter modification a lot simpler. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18xfs: pack xfs_buf structure more tightlyDave Chinner
pahole reports the struct xfs_buf has quite a few holes in it, so packing the structure better will reduce the size of it by 16 bytes. Also, move all the fields used in cache lookups into the first cacheline. Before on x86_64: /* size: 320, cachelines: 5 */ /* sum members: 298, holes: 6, sum holes: 22 */ After on x86_64: /* size: 304, cachelines: 5 */ /* padding: 6 */ /* last cacheline: 48 bytes */ Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>
2010-10-18xfs: convert buffer cache hash to rbtreeDave Chinner
The buffer cache hash is showing typical hash scalability problems. In large scale testing the number of cached items growing far larger than the hash can efficiently handle. Hence we need to move to a self-scaling cache indexing mechanism. I have selected rbtrees for indexing becuse they can have O(log n) search scalability, and insert and remove cost is not excessive, even on large trees. Hence we should be able to cache large numbers of buffers without incurring the excessive cache miss search penalties that the hash is imposing on us. To ensure we still have parallel access to the cache, we need multiple trees. Rather than hashing the buffers by disk address to select a tree, it seems more sensible to separate trees by typical access patterns. Most operations use buffers from within a single AG at a time, so rather than searching lots of different lists, separate the buffer indexes out into per-AG rbtrees. This means that searches during metadata operation have a much higher chance of hitting cache resident nodes, and that updates of the tree are less likely to disturb trees being accessed on other CPUs doing independent operations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>
2010-10-18xfs: serialise inode reclaim within an AGDave Chinner
Memory reclaim via shrinkers has a terrible habit of having N+M concurrent shrinker executions (N = num CPUs, M = num kswapds) all trying to shrink the same cache. When the cache they are all working on is protected by a single spinlock, massive contention an slowdowns occur. Wrap the per-ag inode caches with a reclaim mutex to serialise reclaim access to the AG. This will block concurrent reclaim in each AG but still allow reclaim to scan multiple AGs concurrently. Allow shrinkers to move on to the next AG if it can't get the lock, and if we can't get any AG, then start blocking on locks. To prevent reclaimers from continually scanning the same inodes in each AG, add a cursor that tracks where the last reclaim got up to and start from that point on the next reclaim. This should avoid only ever scanning a small number of inodes at the satart of each AG and not making progress. If we have a non-shrinker based reclaim pass, ignore the cursor and reset it to zero once we are done. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>
2010-10-18xfs: batch inode reclaim lookupDave Chinner
Batch and optimise the per-ag inode lookup for reclaim to minimise scanning overhead. This involves gang lookups on the radix trees to get multiple inodes during each tree walk, and tighter validation of what inodes can be reclaimed without blocking befor we take any locks. This is based on ideas suggested in a proof-of-concept patch posted by Nick Piggin. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>
2010-10-18xfs: implement batched inode lookups for AG walkingDave Chinner
With the reclaim code separated from the generic walking code, it is simple to implement batched lookups for the generic walk code. Separate out the inode validation from the execute operations and modify the tree lookups to get a batch of inodes at a time. Reclaim operations will be optimised separately. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>
2010-10-18xfs: split out inode walk inode grabbingDave Chinner
When doing read side inode cache walks, the code to validate and grab an inode is common to all callers. Split it out of the execute callbacks in preparation for batching lookups. Similarly, split out the inode reference dropping from the execute callbacks into the main lookup look to be symmetric with the grab. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>
2010-10-18xfs: split inode AG walking into separate code for reclaimDave Chinner
The reclaim walk requires different locking and has a slightly different walk algorithm, so separate it out so that it can be optimised separately. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>
2010-10-18xfs: remove buftarg hash for external devicesDave Chinner
For RT and external log devices, we never use hashed buffers on them now. Remove the buftarg hash tables that are set up for them. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>