summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2014-09-17btrfs: remove obsolete comment in btrfs_clean_one_deleted_snapshotDavid Sterba
The comment applied when there was a BUG_ON. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17nfsd4: clarify how grace period endsJ. Bruce Fields
The grace period is ended in two steps--first userland is notified that the grace period is now long enough that any clients who have not yet reclaimed can be safely forgotten, then we flip the switch that forbids reclaims and allows new opens. I had to think a bit to convince myself that the ordering was right here. Document it. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-17nfsd4: stop grace_time update at end of grace periodJ. Bruce Fields
The attempt to automatically set a new grace period time at the end of the grace period isn't really helpful. We'll probably shut down and reboot before we actually make use of the new grace period time anyway. So may as well leave it up to the init system to get this right. This just confuses people when they see /proc/fs/nfsd/nfsv4gracetime change from what they set it to. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-17nfsd: skip subsequent UMH "create" operations after the first one for v4.0 ↵Jeff Layton
clients In the case of v4.0 clients, we may call into the "create" client tracking operation multiple times (once for each openowner). Upcalling for each one of those is wasteful and slow however. We can skip doing further "create" operations after the first one if we know that one has already been done. v4.1+ clients generally only call into this function once (on RECLAIM_COMPLETE), and we can't skip upcalling on the create even if the STABLE bit is set. Doing so would make it impossible for nfsdcltrack to lift the grace period early since the timestamp has a different meaning in the case where the client is expected to issue a RECLAIM_COMPLETE. Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17nfsd: set and test NFSD4_CLIENT_STABLE bit to reduce nfsdcltrack upcallsJeff Layton
The nfsdcltrack upcall doesn't utilize the NFSD4_CLIENT_STABLE flag, which basically results in an upcall every time we call into the client tracking ops. Change it to set this bit on a successful "check" or "create" request, and clear it on a "remove" request. Also, check to see if that bit is set before upcalling on a "check" or "remove" request, and skip upcalling appropriately, depending on its state. Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17nfsd: serialize nfsdcltrack upcalls for a particular clientJeff Layton
In a later patch, we want to add a flag that will allow us to reduce the need for upcalls. In order to handle that correctly, we'll need to ensure that racing upcalls for the same client can't occur. In practice it should be rare for this to occur with a well-behaved client, but it is possible. Convert one of the bits in the cl_flags field to be an upcall bitlock, and use it to ensure that upcalls for the same client are serialized. Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17nfsd: pass extra info in env vars to upcalls to allow for early grace period endJeff Layton
In order to support lifting the grace period early, we must tell nfsdcltrack what sort of client the "create" upcall is for. We can't reliably tell if a v4.0 client has completed reclaiming, so we can only lift the grace period once all the v4.1+ clients have issued a RECLAIM_COMPLETE and if there are no v4.0 clients. Also, in order to lift the grace period, we have to tell userland when the grace period started so that it can tell whether a RECLAIM_COMPLETE has been issued for each client since then. Since this is all optional info, we pass it along in environment variables to the "init" and "create" upcalls. By doing this, we don't need to revise the upcall format. The UMH upcall can simply make use of this info if it happens to be present. If it's not then it can just avoid lifting the grace period early. Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17nfsd: add a v4_end_grace file to /proc/fs/nfsdJeff Layton
Allow a privileged userland process to end the v4 grace period early. Writing "Y", "y", or "1" to the file will cause the v4 grace period to be lifted. The basic idea with this will be to allow the userland client tracking program to lift the grace period once it knows that no more clients will be reclaiming state. Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17lockd: add a /proc/fs/lockd/nlm_end_grace fileJeff Layton
Add a new procfile that will allow a (privileged) userland process to end the NLM grace period early. The basic idea here will be to have sm-notify write to this file, if it sent out no NOTIFY requests when it runs. In that situation, we can generally expect that there will be no reclaim requests so the grace period can be lifted early. Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17nfsd: reject reclaim request when client has already sent RECLAIM_COMPLETEJeff Layton
As stated in RFC 5661, section 18.51.3: Once a RECLAIM_COMPLETE is done, there can be no further reclaim operations for locks whose scope is defined as having completed recovery. Once the client sends RECLAIM_COMPLETE, the server will not allow the client to do subsequent reclaims of locking state for that scope and, if these are attempted, will return NFS4ERR_NO_GRACE. Ensure that we enforce that requirement. Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17nfsd: remove redundant boot_time parm from grace_done client tracking opJeff Layton
Since it's stored in nfsd_net, we don't need to pass it in separately. Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17lockd: move lockd's grace period handling into its own moduleJeff Layton
Currently, all of the grace period handling is part of lockd. Eventually though we'd like to be able to build v4-only servers, at which point we'll need to put all of this elsewhere. Move the code itself into fs/nfs_common and have it build a grace.ko module. Then, rejigger the Kconfig options so that both nfsd and lockd enable it automatically. Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17ocfs2: Don't use MAXQUOTAS valueJan Kara
MAXQUOTAS value defines maximum number of quota types VFS supports. This isn't necessarily the number of types ocfs2 supports and with addition of project quotas these two numbers stop matching. So make ocfs2 use its private definition. CC: Mark Fasheh <mfasheh@suse.com> CC: Joel Becker <jlbec@evilplan.org> CC: ocfs2-devel@oss.oracle.com Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-17reiserfs: Don't use MAXQUOTAS valueJan Kara
MAXQUOTAS value defines maximum number of quota types VFS supports. This isn't necessarily the number of types reiserfs supports and with addition of project quotas these two numbers stop matching. So make reiserfs use its private definition. CC: reiserfs-devel@vger.kernel.org CC: Jeff Mahoney <jeffm@suse.de> Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-17ext3: Don't use MAXQUOTAS valueJan Kara
MAXQUOTAS value defines maximum number of quota types VFS supports. This isn't necessarily the number of types ext3 supports and with addition of project quotas these two numbers stop matching. So make ext3 use its private definition. CC: linux-ext4@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-17udf: Fix race between write(2) and close(2)Jan Kara
Currently write(2) updating i_size and close(2) of the file can race in such a way that udf_truncate_tail_extent() called from udf_file_release() sees old i_size but already new extents added by the running write call. This results in complaints like: UDF-fs: warning (device vdb2): udf_truncate_tail_extent: Too long extent after EOF in inode 877: i_size: 0 lbcount: 1073739776 extent 0+1073739776 UDF-fs: error (device vdb2): udf_truncate_tail_extent: Extent after EOF in inode 877 Fix the problem by grabbing i_mutex in udf_file_release() to be sure i_size is consistent with current state of extent list. Also avoid truncating tail extent unnecessarily when the file is still open for writing. Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-16Btrfs: set inode's logged_trans/last_log_commit after ranged fsyncFilipe Manana
When a ranged fsync finishes if there are still extent maps in the modified list, still set the inode's logged_trans and last_log_commit. This is important in case an inode is fsync'ed and unlinked in the same transaction, to ensure its inode ref gets deleted from the log and the respective dentries in its parent are deleted too from the log (if the parent directory was fsync'ed in the same transaction). Instead make btrfs_inode_in_log() return false if the list of modified extent maps isn't empty. This is an incremental on top of the v4 version of the patch: "Btrfs: fix fsync data loss after a ranged fsync" which was added to its v5, but didn't make it on time. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-09-16ext4: explicitly inform user about orphan list cleanupDmitry Monakhov
Production fs likely compiled/mounted w/o jbd debugging, so orphan list clearing will be silent. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-16jbd2: jbd2_log_wait_for_space improve error detetcionDmitry Monakhov
If EIO happens after we have dropped j_state_lock, we won't notice that the journal has been aborted. So it is reasonable to move this check after we have grabbed the j_checkpoint_mutex and re-grabbed the j_state_lock. This patch helps to prevent false positive complain after EIO. #DMESG: __jbd2_log_wait_for_space: needed 8448 blocks and only had 8386 space available __jbd2_log_wait_for_space: no way to get more journal space in ram1-8 ------------[ cut here ]------------ WARNING: CPU: 15 PID: 6739 at fs/jbd2/checkpoint.c:168 __jbd2_log_wait_for_space+0x188/0x200() Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror dm_region_hash dm_log dm_mod CPU: 15 PID: 6739 Comm: fsstress Tainted: G W 3.17.0-rc2-00429-g684de57 #139 Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011 00000000000000a8 ffff88077aaab878 ffffffff815c1a8c 00000000000000a8 0000000000000000 ffff88077aaab8b8 ffffffff8106ce8c ffff88077aaab898 ffff8807c57e6000 ffff8807c57e6028 0000000000002100 ffff8807c57e62f0 Call Trace: [<ffffffff815c1a8c>] dump_stack+0x51/0x6d [<ffffffff8106ce8c>] warn_slowpath_common+0x8c/0xc0 [<ffffffff8106ceda>] warn_slowpath_null+0x1a/0x20 [<ffffffff812419f8>] __jbd2_log_wait_for_space+0x188/0x200 [<ffffffff8123be9a>] start_this_handle+0x4da/0x7b0 [<ffffffff810990e5>] ? local_clock+0x25/0x30 [<ffffffff810aba87>] ? lockdep_init_map+0xe7/0x180 [<ffffffff8123c5bc>] jbd2__journal_start+0xdc/0x1d0 [<ffffffff811f2414>] ? __ext4_new_inode+0x7f4/0x1330 [<ffffffff81222a38>] __ext4_journal_start_sb+0xf8/0x110 [<ffffffff811f2414>] __ext4_new_inode+0x7f4/0x1330 [<ffffffff810ac359>] ? lock_release_holdtime+0x29/0x190 [<ffffffff812025bb>] ext4_create+0x8b/0x150 [<ffffffff8117fe3b>] vfs_create+0x7b/0xb0 [<ffffffff8118097b>] do_last+0x7db/0xcf0 [<ffffffff8117e31d>] ? inode_permission+0x4d/0x50 [<ffffffff811845d2>] path_openat+0x242/0x590 [<ffffffff81191a76>] ? __alloc_fd+0x36/0x140 [<ffffffff81184a6a>] do_filp_open+0x4a/0xb0 [<ffffffff81191b61>] ? __alloc_fd+0x121/0x140 [<ffffffff81172f20>] do_sys_open+0x170/0x220 [<ffffffff8117300e>] SyS_open+0x1e/0x20 [<ffffffff811715d6>] SyS_creat+0x16/0x20 [<ffffffff815c7e12>] system_call_fastpath+0x16/0x1b ---[ end trace cd71c831f82059db ]--- Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-16jbd2: free bh when descriptor block checksum failsDarrick J. Wong
Free the buffer head if the journal descriptor block fails checksum verification. This is the jbd2 port of the e2fsprogs patch "e2fsck: free bh on csum verify error in do_one_pass". Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Cc: stable@vger.kernel.org
2014-09-16ext4: check EA value offset when loadingDarrick J. Wong
When loading extended attributes, check each entry's value offset to make sure it doesn't collide with the entries. Without this check it is easy to crash the kernel by mounting a malicious FS containing a file with an EA wherein e_value_offs = 0 and e_value_size > 0 and then deleting the EA, which corrupts the name list. (See the f_ea_value_crash test's FS image in e2fsprogs for an example.) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-09-16KEYS: Remove key_type::match in favour of overriding default by match_preparseDavid Howells
A previous patch added a ->match_preparse() method to the key type. This is allowed to override the function called by the iteration algorithm. Therefore, we can just set a default that simply checks for an exact match of the key description with the original criterion data and allow match_preparse to override it as needed. The key_type::match op is then redundant and can be removed, as can the user_match() function. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Vivek Goyal <vgoyal@redhat.com>
2014-09-16Merge tag 'gfs2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes Pull gfs2 fixes from Steven Whitehouse: "Here are a number of small fixes for GFS2. There is a fix for FIEMAP on large sparse files, a negative dentry hashing fix, a fix for flock, and a bug fix relating to d_splice_alias usage. There are also (patches 1 and 5) a couple of updates which are less critical, but small and low risk" * tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes: GFS2: fix d_splice_alias() misuses GFS2: Don't use MAXQUOTAS value GFS2: Hash the negative dentry during inode lookup GFS2: Request demote when a "try" flock fails GFS2: Change maxlen variables to size_t GFS2: fs/gfs2/super.c: replace seq_printf by seq_puts
2014-09-16vfs: workaround gcc <4.6 build error in link_path_walk()James Hogan
Commit d6bb3e9075bb ("vfs: simplify and shrink stack frame of link_path_walk()") introduced build problems with GCC versions older than 4.6 due to the initialisation of a member of an anonymous union in struct qstr without enclosing braces. This hits GCC bug 10676 [1] (which was fixed in GCC 4.6 by [2]), and causes the following build error: fs/namei.c: In function 'link_path_walk': fs/namei.c:1778: error: unknown field 'hash_len' specified in initializer This is worked around by adding explicit braces. [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=10676 [2] https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=159206 Fixes: d6bb3e9075bb (vfs: simplify and shrink stack frame of link_path_walk()) Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: linux-fsdevel@vger.kernel.org Cc: linux-metag@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-16Fix mfsymlinks file size checkSteve French
If the mfsymlinks file size has changed (e.g. the file no longer represents an emulated symlink) we were not returning an error properly. Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Stefan Metzmacher <metze@samba.org>
2014-09-16f2fs: fix double lock for inode page during roll-foward recoveryJaegeuk Kim
If the inode is same and its data index are needed to truncate, we can fall into double lock for its inode page via get_dnode_of_data. Error case is like this. 1. write data 1, 2, 3, 4, 5 in inode #4. 2. write data 100, 102, 103, 104, 105 in dnode #6 of inode #4. 3. sync 4. update data 100->106 in dnode #6. 5. fsync inode #4. 6. power-cut -> Then, 1. go back to #3's checkpoint 2. in do_recover_data, get_dnode_of_data() gets inode #4. 3. detect 100->106 in dnode #6. 4. check_index_in_prev_nodes tries to truncate 100 in dnode #6. 5. to trigger truncate_hole, get_dnode_of_data should grab inode #4. 6. detect *kernel hang* This patch should resolve that bug. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-16f2fs: fix a race condition in next_free_nidHuang Ying
The nm_i->fcnt checking is executed before spin_lock, so if another thread delete the last free_nid from the list, the wrong nid may be gotten. So fix the race condition by moving the nm_i->fnct checking into spin_lock. Signed-off-by: Huang, Ying <ying.huang@intel.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-16f2fs: use nm_i->next_scan_nid as default for next_free_nidHuang Ying
Now, if there is no free nid in nm_i->free_nid_list, 0 may be saved into next_free_nid of checkpoint, this may cause useless scanning for next mount. nm_i->next_scan_nid should be a better default value than 0. Signed-off-by: Huang, Ying <ying.huang@intel.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-16f2fs: give an option to enable in-place-updates during fsync to usersJaegeuk Kim
If user wrote F2FS_IPU_FSYNC:4 in /sys/fs/f2fs/ipu_policy, f2fs_sync_file only starts to try in-place-updates. And, if the number of dirty pages is over /sys/fs/f2fs/min_fsync_blocks, it keeps out-of-order manner. Otherwise, it triggers in-place-updates. This may be used by storage showing very high random write performance. For example, it can be used when, Seq. writes (Data) + wait + Seq. writes (Node) is pretty much slower than, Rand. writes (Data) Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-16f2fs: expand counting dirty pages in the inode page cacheJaegeuk Kim
Previously f2fs only counts dirty dentry pages, but there is no reason not to expand the scope. This patch changes the names on the management of dirty pages and to count dirty pages in each inode info as well. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-16Update version number displayed by modinfo for cifs.koSteve French
Update cifs.ko version to 2.05 Signed-off-by: Steve French <smfrench@gmail.com>w
2014-09-16cifs: remove dead codeArnd Bergmann
cifs provides two dummy functions 'sess_auth_lanman' and 'sess_auth_kerberos' for the case in which the respective features are not defined. However, the caller is also under an #ifdef, so we just get warnings about unused code: fs/cifs/sess.c:1109:1: warning: 'sess_auth_kerberos' defined but not used [-Wunused-function] sess_auth_kerberos(struct sess_data *sess_data) Removing the dead functions gets rid of the warnings without any downsides that I can see. (Yalin Wang reported the identical problem and fix so added him) Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com> Signed-off-by: Steve French <smfrench@gmail.com>
2014-09-16Revert "cifs: No need to send SIGKILL to demux_thread during umount"Steve French
This reverts commit 52a36244443eabb594bdb63622ff2dd7a083f0e2. Causes rmmod to fail for at least 7 seconds after unmount which makes automated testing a little harder when reloading cifs.ko between test runs. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> CC: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Steve French <smfrench@gmail.com>
2014-09-15pnfs/blocklayout: include vmalloc.h for __vmallocStephen Rothwell
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-15vfs: simplify and shrink stack frame of link_path_walk()Linus Torvalds
Commit 9226b5b440f2 ("vfs: avoid non-forwarding large load after small store in path lookup") made link_path_walk() always access the "hash_len" field as a single 64-bit entity, in order to avoid mixed size accesses to the members. However, what I didn't notice was that that effectively means that the whole "struct qstr this" is now basically redundant. We already explicitly track the "const char *name", and if we just use "u64 hash_len" instead of "long len", there is nothing else left of the "struct qstr". We do end up wanting the "struct qstr" if we have a filesystem with a "d_hash()" function, but that's a rare case, and we might as well then just squirrell away the name and hash_len at that point. End result: fewer live variables in the loop, a smaller stack frame, and better code generation. And we don't need to pass in pointers variables to helper functions any more, because the return value contains all the relevant information. So this removes more lines than it adds, and the source code is clearer too. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-15[SMB3] Fix oops when creating symlinks on smb3Steve French
We were not checking for symlink support properly for SMB2/SMB3 mounts so could oops when mounted with mfsymlinks when try to create symlink when mfsymlinks on smb2/smb3 mounts Signed-off-by: Steve French <smfrench@gmail.com> Cc: <stable@vger.kernel.org> # 3.14+ CC: Sachin Prabhu <sprabhu@redhat.com>
2014-09-14Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "double iput() on failure exit in lustre, racy removal of spliced dentries from ->s_anon in __d_materialise_dentry() plus a bunch of assorted RCU pathwalk fixes" The RCU pathwalk fixes end up fixing a couple of cases where we incorrectly dropped out of RCU walking, due to incorrect initialization and testing of the sequence locks in some corner cases. Since dropping out of RCU walk mode forces the slow locked accesses, those corner cases slowed down quite dramatically. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: be careful with nd->inode in path_init() and follow_dotdot_rcu() don't bugger nd->seq on set_root_rcu() from follow_dotdot_rcu() fix bogus read_seqretry() checks introduced in b37199e move the call of __d_drop(anon) into __d_materialise_unique(dentry, anon) [fix] lustre: d_make_root() does iput() on dentry allocation failure
2014-09-14vfs: avoid non-forwarding large load after small store in path lookupLinus Torvalds
The performance regression that Josef Bacik reported in the pathname lookup (see commit 99d263d4c5b2 "vfs: fix bad hashing of dentries") made me look at performance stability of the dcache code, just to verify that the problem was actually fixed. That turned up a few other problems in this area. There are a few cases where we exit RCU lookup mode and go to the slow serializing case when we shouldn't, Al has fixed those and they'll come in with the next VFS pull. But my performance verification also shows that link_path_walk() turns out to have a very unfortunate 32-bit store of the length and hash of the name we look up, followed by a 64-bit read of the combined hash_len field. That screws up the processor store to load forwarding, causing an unnecessary hickup in this critical routine. It's caused by the ugly calling convention for the "hash_name()" function, and easily fixed by just making hash_name() fill in the whole 'struct qstr' rather than passing it a pointer to just the hash value. With that, the profile for this function looks much smoother. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-14[CIFS] Fix setting time before epoch (negative time values)Steve French
xfstest generic/258 sets the time on a file to a negative value (before 1970) which fails since do_div can not handle negative numbers. In addition 'normal' division of 64 bit values does not build on 32 bit arch so have to workaround this by special casing negative values in cifs_NTtimeToUnix Samba server also has a bug with this (see samba bugzilla 7771) but it works to Windows server. Signed-off-by: Steve French <smfrench@gmail.com>
2014-09-14be careful with nd->inode in path_init() and follow_dotdot_rcu()Al Viro
in the former we simply check if dentry is still valid after picking its ->d_inode; in the latter we fetch ->d_inode in the same places where we fetch dentry and its ->d_seq, under the same checks. Cc: stable@vger.kernel.org # 2.6.38+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-14don't bugger nd->seq on set_root_rcu() from follow_dotdot_rcu()Al Viro
return the value instead, and have path_init() do the assignment. Broken by "vfs: Fix absolute RCU path walk failures due to uninitialized seq number", which was Cc-stable with 2.6.38+ as destination. This one should go where it went. To avoid dummy value returned in case when root is already set (it would do no harm, actually, since the only caller that doesn't ignore the return value is guaranteed to have nd->root *not* set, but it's more obvious that way), lift the check into callers. And do the same to set_root(), to keep them in sync. Cc: stable@vger.kernel.org # 2.6.38+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-13fix bogus read_seqretry() checks introduced in b37199eAl Viro
read_seqretry() returns true on mismatch, not on match... Cc: stable@vger.kernel.org # 3.15+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-13move the call of __d_drop(anon) into __d_materialise_unique(dentry, anon)Al Viro
and lock the right list there Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-13vfs: fix bad hashing of dentriesLinus Torvalds
Josef Bacik found a performance regression between 3.2 and 3.10 and narrowed it down to commit bfcfaa77bdf0 ("vfs: use 'unsigned long' accesses for dcache name comparison and hashing"). He reports: "The test case is essentially for (i = 0; i < 1000000; i++) mkdir("a$i"); On xfs on a fio card this goes at about 20k dir/sec with 3.2, and 12k dir/sec with 3.10. This is because we spend waaaaay more time in __d_lookup on 3.10 than in 3.2. The new hashing function for strings is suboptimal for < sizeof(unsigned long) string names (and hell even > sizeof(unsigned long) string names that I've tested). I broke out the old hashing function and the new one into a userspace helper to get real numbers and this is what I'm getting: Old hash table had 1000000 entries, 0 dupes, 0 max dupes New hash table had 12628 entries, 987372 dupes, 900 max dupes We had 11400 buckets with a p50 of 30 dupes, p90 of 240 dupes, p99 of 567 dupes for the new hash My test does the hash, and then does the d_hash into a integer pointer array the same size as the dentry hash table on my system, and then just increments the value at the address we got to see how many entries we overlap with. As you can see the old hash function ended up with all 1 million entries in their own bucket, whereas the new one they are only distributed among ~12.5k buckets, which is why we're using so much more CPU in __d_lookup". The reason for this hash regression is two-fold: - On 64-bit architectures the down-mixing of the original 64-bit word-at-a-time hash into the final 32-bit hash value is very simplistic and suboptimal, and just adds the two 32-bit parts together. In particular, because there is no bit shuffling and the mixing boundary is also a byte boundary, similar character patterns in the low and high word easily end up just canceling each other out. - the old byte-at-a-time hash mixed each byte into the final hash as it hashed the path component name, resulting in the low bits of the hash generally being a good source of hash data. That is not true for the word-at-a-time case, and the hash data is distributed among all the bits. The fix is the same in both cases: do a better job of mixing the bits up and using as much of the hash data as possible. We already have the "hash_32|64()" functions to do that. Reported-by: Josef Bacik <jbacik@fb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-12GFS2: fix d_splice_alias() misusesAl Viro
Callers of d_splice_alias(dentry, inode) don't need iput(), neither on success nor on failure. Either the reference to inode is stored in a previously negative dentry, or it's dropped. In either case inode reference the caller used to hold is consumed. __gfs2_lookup() does iput() in case when d_splice_alias() has failed. Double iput() if we ever hit that. And gfs2_create_inode() ends up not only with double iput(), but with link count dropped to zero - on an inode it has just found in directory. Cc: stable@vger.kernel.org # v3.14+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-09-12Merge tag 'nfs-for-3.17-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client fixes from Trond Myklebust: "Highlights: - fix a kernel warning when removing /proc/net/nfsfs - revert commit 49a4bda22e18 due to Oopses - fix a typo in the pNFS file layout commit code" * tag 'nfs-for-3.17-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: pnfs: fix filelayout_retry_commit when idx > 0 nfs: revert "nfs4: queue free_lock_state job submission to nfsiod" nfs: fix kernel warning when removing proc entry
2014-09-12Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Filipe is doing a careful pass through fsync problems, and these are the fixes so far. I'll have one more for rc6 that we're still testing. My big commit is fixing up some inode hash races that Al Viro found (thanks Al)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: use insert_inode_locked4 for inode creation Btrfs: fix fsync data loss after a ranged fsync Btrfs: kfree()ing ERR_PTRs Btrfs: fix crash while doing a ranged fsync Btrfs: fix corruption after write/fsync failure + fsync + log recovery Btrfs: fix autodefrag with compression
2014-09-12nfs41: change PNFS_LAYOUTRET_ON_SETATTR to only return on truncation to ↵Peng Tao
smaller size Both blocks layout and objects layout want to use it to avoid CB_LAYOUTRECALL but that should only happen if client is doing truncation to a smaller size. For other cases, we let server decide if it wants to recall client's layouts. Change PNFS_LAYOUTRET_ON_SETATTR to follow the logic and not to send layoutreturn unnecessarily. Cc: Christoph Hellwig <hch@lst.de> Cc: Boaz Harrosh <boaz@plexistor.com> Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12NFS: Move NFS v3 acl functions to nfs3_fs.hAnna Schumaker
This code is internal to the v3 module, so other parts of the client shouldn't have any knowledge of it. nfs3_getxattr(), nfs3_setxattr(), and nfs3_removexattr() no longer exist anywhere so I remove the declarations while I'm here. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12NFS: Remove v3 not compiled check from validate_mount_data()Anna Schumaker
This check is already performed by the module loading code - if the module can't be found then -EPROTONOSUPPORT will be returned. Let's handle v3 this way, too. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>