summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2014-03-10Btrfs: skip readonly root for snapshot-aware defragmentWang Shilong
Btrfs send is assuming readonly root won't change, let's skip readonly root. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: switch to btrfs_previous_extent_item()Wang Shilong
Since we have introduced btrfs_previous_extent_item() to search previous extent item, just switch into it. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Filipe Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: skip submitting barrier for missing deviceHidetoshi Seto
I got an error on v3.13: BTRFS error (device sdf1) in write_all_supers:3378: errno=-5 IO failure (errors while submitting device barriers.) how to reproduce: > mkfs.btrfs -f -d raid1 /dev/sdf1 /dev/sdf2 > wipefs -a /dev/sdf2 > mount -o degraded /dev/sdf1 /mnt > btrfs balance start -f -sconvert=single -mconvert=single -dconvert=single /mnt The reason of the error is that barrier_all_devices() failed to submit barrier to the missing device. However it is clear that we cannot do anything on missing device, and also it is not necessary to care chunks on the missing device. This patch stops sending/waiting barrier if device is missing. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: unlock extent and pages on error in cow_file_rangeJosef Bacik
When I converted the BUG_ON() for the free_space_cache_inode in cow_file_range I made it so we just return an error instead of unlocking all of our various stuff. This is a mistake and causes us to hang when we run into this. This patch fixes this problem. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: balance delayed inode updatesJosef Bacik
While trying to reproduce a delayed ref problem I noticed the box kept falling over using all 80gb of my ram with btrfs_inode's and btrfs_delayed_node's. Turns out this is because we only throttle delayed inode updates in btrfs_dirty_inode, which doesn't actually get called that often, especially when all you are doing is creating a bunch of files. So balance delayed inode updates everytime we create a new inode. With this patch we no longer use up all of our ram with delayed inode updates. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: add simple debugfs interfaceDavid Sterba
Help during debugging to export various interesting infromation and tunables without the need of extra mount options or ioctls. Usage: * declare your variable in sysfs.h, and include where you need it * define the variable in sysfs.c and make it visible via debugfs_create_TYPE Depends on CONFIG_DEBUG_FS. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: send: lower memory requirements in common caseDavid Sterba
The fs_path structure uses an inline buffer and falls back to a chain of allocations, but vmalloc is not necessary because PATH_MAX fits into PAGE_SIZE. The size of fs_path has been reduced to 256 bytes from PAGE_SIZE, usually 4k. Experimental measurements show that most paths on a single filesystem do not exceed 200 bytes, and these get stored into the inline buffer directly, which is now 230 bytes. Longer paths are kmalloced when needed. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: make some tree searches in send.c more efficientFilipe David Borba Manana
We have this pattern where we do search for a contiguous group of items in a tree and everytime we find an item, we process it, then we release our path, increment the offset of the search key, do another full tree search and repeat these steps until a tree search can't find more items we're interested in. Instead of doing these full tree searches after processing each item, just process the next item/slot in our leaf and don't release the path. Since all these trees are read only and we always use the commit root for a search and skip node/leaf locks, we're not affecting concurrency on the trees. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: use right extent item position in send when finding extent clonesFilipe David Borba Manana
This was a leftover from the commit: 74dd17fbe3d65829e75d84f00a9525b2ace93998 (Btrfs: fix btrfs send for inline items and compression) Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: send: remove BUG_ON from name_cache_deleteDavid Sterba
If cleaning the name cache fails, we could try to proceed at the cost of some memory leak. This is not expected to happen often. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: send: remove BUG from process_all_refsDavid Sterba
There are only 2 static callers, the BUG would normally be never reached, but let's be nice. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: send: squeeze bitfilelds in fs_pathDavid Sterba
We know that buf_len is at most PATH_MAX, 4k, and can merge it with the reversed member. This saves 3 bytes in favor of inline_buf. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: send: remove virtual_mem member from fs_pathDavid Sterba
We don't need to keep track of that, it's available via is_vmalloc_addr. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: send: remove prepared member from fs_pathDavid Sterba
The member is used only to return value back from fs_path_prepare_for_add, we can do it locally and save 8 bytes for the inline_buf path. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: send: replace check with an assert in gen_unique_nameDavid Sterba
The buffer passed to snprintf can hold the fully expanded format string, 64 = 3x largest ULL + 3x char + trailing null. I don't think that removing the check entirely is a good idea, hence the ASSERT. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: more send support for parent/child dir relationship inversionFilipe David Borba Manana
The commit titled "Btrfs: fix infinite path build loops in incremental send" didn't cover a particular case where the parent-child relationship inversion of directories doesn't imply a rename of the new parent directory. This was due to a simple logic mistake, a logical and instead of a logical or. Steps to reproduce: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ mkdir -p /mnt/btrfs/a/b/bar1/bar2/bar3/bar4 $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1 $ mv /mnt/btrfs/a/b/bar1/bar2/bar3/bar4 /mnt/btrfs/a/b/k44 $ mv /mnt/btrfs/a/b/bar1/bar2/bar3 /mnt/btrfs/a/b/k44 $ mv /mnt/btrfs/a/b/bar1/bar2 /mnt/btrfs/a/b/k44/bar3 $ mv /mnt/btrfs/a/b/bar1 /mnt/btrfs/a/b/k44/bar3/bar2/k11 $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2 $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send A patch to update the test btrfs/030 from xfstests, so that it covers this case, will be submitted soon. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: fix send dealing with file renames and directory movesFilipe David Borba Manana
This fixes a case that the commit titled: Btrfs: fix infinite path build loops in incremental send didn't cover. If the parent-child relationship between 2 directories is inverted, both get renamed, and the former parent has a file that got renamed too (but remains a child of that directory), the incremental send operation would use the file's old path after sending an unlink operation for that old path, causing receive to fail on future operations like changing owner, permissions or utimes of the corresponding inode. This is not a regression from the commit mentioned before, as without that commit we would fall into the issues that commit fixed, so it's just one case that wasn't covered before. Simple steps to reproduce this issue are: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ mkdir -p /mnt/btrfs/a/b/c/d $ touch /mnt/btrfs/a/b/c/d/file $ mkdir -p /mnt/btrfs/a/b/x $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1 $ mv /mnt/btrfs/a/b/x /mnt/btrfs/a/b/c/x2 $ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c/x2/d2 $ mv /mnt/btrfs/a/b/c/x2/d2/file /mnt/btrfs/a/b/c/x2/d2/file2 $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2 $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send A patch to update the test btrfs/030 from xfstests, so that it covers this case, will be submitted soon. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: only add roots if necessary in find_parent_nodes()Wang Shilong
find_all_leafs() dosen't need add all roots actually, add roots only if we need, this can avoid unnecessary ulist dance. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctlHugo Mills
The structure for BTRFS_SET_RECEIVED_IOCTL packs differently on 32-bit and 64-bit systems. This means that it is impossible to use btrfs receive on a system with a 64-bit kernel and 32-bit userspace, because the structure size (and hence the ioctl number) is different. This patch adds a compatibility structure and ioctl to deal with the above case. Signed-off-by: Hugo Mills <hugo@carfax.org.uk> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: add missing error check in incremental sendFilipe David Borba Manana
Function wait_for_parent_move() returns negative value if an error happened, 0 if we don't need to wait for the parent's move, and 1 if the wait is needed. Before this change an error return value was being treated like the return value 1, which was not correct. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: fix use-after-free in the finishing procedure of the device replaceMiao Xie
During device replace test, we hit a null pointer deference (It was very easy to reproduce it by running xfstests' btrfs/011 on the devices with the virtio scsi driver). There were two bugs that caused this problem: - We might allocate new chunks on the replaced device after we updated the mapping tree. And we forgot to replace the source device in those mapping of the new chunks. - We might get the mapping information which including the source device before the mapping information update. And then submit the bio which was based on that mapping information after we freed the source device. For the first bug, we can fix it by doing mapping tree update and source device remove in the same context of the chunk mutex. The chunk mutex is used to protect the allocable device list, the above method can avoid the new chunk allocation, and after we remove the source device, all the new chunks will be allocated on the new device. So it can fix the first bug. For the second bug, we need make sure all flighting bios are finished and no new bios are produced during we are removing the source device. To fix this problem, we introduced a global @bio_counter, we not only inc/dec @bio_counter outsize of map_blocks, but also inc it before submitting bio and dec @bio_counter when ending bios. Since Raid56 is a little different and device replace dosen't support raid56 yet, it is not addressed in the patch and I add comments to make sure we will fix it in the future. Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: fix unprotected alloc list insertion during the finishing procedure ↵Miao Xie
of replace the alloc list of the filesystem is protected by ->chunk_mutex, we need get that mutex when we insert the new device into the list. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: Return EXDEV for cross file system snapshotKusanagi Kouichi
EXDEV seems an appropriate error if an operation fails bacause it crosses file system boundaries. Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: don't mix the ordered extents of all files together during logging ↵Miao Xie
the inodes There was a problem in the old code: If we failed to log the csum, we would free all the ordered extents in the log list including those ordered extents that were logged successfully, it would make the log committer not to wait for the completion of the ordered extents. This patch doesn't insert the ordered extents that is about to be logged into a global list, instead, we insert them into a local list. If we log the ordered extents successfully, we splice them with the global list, or we will throw them away, then do full sync. It can also reduce the lock contention and the traverse time of list. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Merge branch 'fortglx/3.15/time' of ↵Thomas Gleixner
git://git.linaro.org/people/john.stultz/linux into timers/core - support CLOCK_BOOTTIME clock in timerfd - Add missing header file Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-03-10get rid of fget_light()Al Viro
instead of returning the flags by reference, we can just have the low-level primitive return those in lower bits of unsigned long, with struct file * derived from the rest. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-10vfs: atomic f_pos accesses as per POSIXLinus Torvalds
Our write() system call has always been atomic in the sense that you get the expected thread-safe contiguous write, but we haven't actually guaranteed that concurrent writes are serialized wrt f_pos accesses, so threads (or processes) that share a file descriptor and use "write()" concurrently would quite likely overwrite each others data. This violates POSIX.1-2008/SUSv4 Section XSI 2.9.7 that says: "2.9.7 Thread Interactions with Regular File Operations All of the following functions shall be atomic with respect to each other in the effects specified in POSIX.1-2008 when they operate on regular files or symbolic links: [...]" and one of the effects is the file position update. This unprotected file position behavior is not new behavior, and nobody has ever cared. Until now. Yongzhi Pan reported unexpected behavior to Michael Kerrisk that was due to this. This resolves the issue with a f_pos-specific lock that is taken by read/write/lseek on file descriptors that may be shared across threads or processes. Reported-by: Yongzhi Pan <panyongzhi@gmail.com> Reported-by: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-10ocfs2 syncs the wrong range...Al Viro
Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-09Merge tag 'nfs-for-3.14-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Trond Myklebust: "Highlights include: - Fix another nfs4_sequence corruptor in RELEASE_LOCKOWNER - Fix an Oopsable delegation callback race - Fix another bad stateid infinite loop - Fail the data server I/O is the stateid represents a lost lock - Fix an Oopsable sunrpc trace event" * tag 'nfs-for-3.14-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: SUNRPC: Fix oops when trace sunrpc_task events in nfs client NFSv4: Fail the truncate() if the lock/open stateid is invalid NFSv4.1 Fail data server I/O if stateid represents a lost lock NFSv4: Fix the return value of nfs4_select_rw_stateid NFSv4: nfs4_stateid_is_current should return 'true' for an invalid stateid NFS: Fix a delegation callback race NFSv4: Fix another nfs4_sequence corruptor
2014-03-08kernfs: cache atomic_write_len in kernfs_open_fileTejun Heo
While implementing atomic_write_len, 4d3773c4bb41 ("kernfs: implement kernfs_ops->atomic_write_len") moved data copy from userland inside kernfs_get_active() and kernfs_open_file->mutex so that kernfs_ops->atomic_write_len can be accessed before copying buffer from userland; unfortunately, this could lead to locking order inversion involving mmap_sem if copy_from_user() takes a page fault. ====================================================== [ INFO: possible circular locking dependency detected ] 3.14.0-rc4-next-20140228-sasha-00011-g4077c67-dirty #26 Tainted: G W ------------------------------------------------------- trinity-c236/10658 is trying to acquire lock: (&of->mutex#2){+.+.+.}, at: [<fs/kernfs/file.c:487>] kernfs_fop_mmap+0x54/0x120 but task is already holding lock: (&mm->mmap_sem){++++++}, at: [<mm/util.c:397>] vm_mmap_pgoff+0x6e/0xe0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&mm->mmap_sem){++++++}: [<kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131>] validate_chain+0x6c5/0x7b0 [<kernel/locking/lockdep.c:3182>] __lock_acquire+0x4cd/0x5a0 [<arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602>] lock_acquire+0x182/0x1d0 [<mm/memory.c:4188>] might_fault+0x7e/0xb0 [<arch/x86/include/asm/uaccess.h:713 fs/kernfs/file.c:291>] kernfs_fop_write+0xd8/0x190 [<fs/read_write.c:473>] vfs_write+0xe3/0x1d0 [<fs/read_write.c:523 fs/read_write.c:515>] SyS_write+0x5d/0xa0 [<arch/x86/kernel/entry_64.S:749>] tracesys+0xdd/0xe2 -> #0 (&of->mutex#2){+.+.+.}: [<kernel/locking/lockdep.c:1840>] check_prev_add+0x13f/0x560 [<kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131>] validate_chain+0x6c5/0x7b0 [<kernel/locking/lockdep.c:3182>] __lock_acquire+0x4cd/0x5a0 [<arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602>] lock_acquire+0x182/0x1d0 [<kernel/locking/mutex.c:470 kernel/locking/mutex.c:571>] mutex_lock_nested+0x6a/0x510 [<fs/kernfs/file.c:487>] kernfs_fop_mmap+0x54/0x120 [<mm/mmap.c:1573>] mmap_region+0x310/0x5c0 [<mm/mmap.c:1365>] do_mmap_pgoff+0x385/0x430 [<mm/util.c:399>] vm_mmap_pgoff+0x8f/0xe0 [<mm/mmap.c:1416 mm/mmap.c:1374>] SyS_mmap_pgoff+0x1b0/0x210 [<arch/x86/kernel/sys_x86_64.c:72>] SyS_mmap+0x1d/0x20 [<arch/x86/kernel/entry_64.S:749>] tracesys+0xdd/0xe2 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&mm->mmap_sem); lock(&of->mutex#2); lock(&mm->mmap_sem); lock(&of->mutex#2); *** DEADLOCK *** 1 lock held by trinity-c236/10658: #0: (&mm->mmap_sem){++++++}, at: [<mm/util.c:397>] vm_mmap_pgoff+0x6e/0xe0 stack backtrace: CPU: 2 PID: 10658 Comm: trinity-c236 Tainted: G W 3.14.0-rc4-next-20140228-sasha-00011-g4077c67-dirty #26 0000000000000000 ffff88011911fa48 ffffffff8438e945 0000000000000000 0000000000000000 ffff88011911fa98 ffffffff811a0109 ffff88011911fab8 ffff88011911fab8 ffff88011911fa98 ffff880119128cc0 ffff880119128cf8 Call Trace: [<lib/dump_stack.c:52>] dump_stack+0x52/0x7f [<kernel/locking/lockdep.c:1213>] print_circular_bug+0x129/0x160 [<kernel/locking/lockdep.c:1840>] check_prev_add+0x13f/0x560 [<include/linux/spinlock.h:343 mm/slub.c:1933>] ? deactivate_slab+0x511/0x550 [<kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131>] validate_chain+0x6c5/0x7b0 [<kernel/locking/lockdep.c:3182>] __lock_acquire+0x4cd/0x5a0 [<mm/mmap.c:1552>] ? mmap_region+0x24a/0x5c0 [<arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602>] lock_acquire+0x182/0x1d0 [<fs/kernfs/file.c:487>] ? kernfs_fop_mmap+0x54/0x120 [<kernel/locking/mutex.c:470 kernel/locking/mutex.c:571>] mutex_lock_nested+0x6a/0x510 [<fs/kernfs/file.c:487>] ? kernfs_fop_mmap+0x54/0x120 [<kernel/sched/core.c:2477>] ? get_parent_ip+0x11/0x50 [<fs/kernfs/file.c:487>] ? kernfs_fop_mmap+0x54/0x120 [<fs/kernfs/file.c:487>] kernfs_fop_mmap+0x54/0x120 [<mm/mmap.c:1573>] mmap_region+0x310/0x5c0 [<mm/mmap.c:1365>] do_mmap_pgoff+0x385/0x430 [<mm/util.c:397>] ? vm_mmap_pgoff+0x6e/0xe0 [<mm/util.c:399>] vm_mmap_pgoff+0x8f/0xe0 [<kernel/rcu/update.c:97>] ? __rcu_read_unlock+0x44/0xb0 [<fs/file.c:641>] ? dup_fd+0x3c0/0x3c0 [<mm/mmap.c:1416 mm/mmap.c:1374>] SyS_mmap_pgoff+0x1b0/0x210 [<arch/x86/kernel/sys_x86_64.c:72>] SyS_mmap+0x1d/0x20 [<arch/x86/kernel/entry_64.S:749>] tracesys+0xdd/0xe2 Fix it by caching atomic_write_len in kernfs_open_file during open so that it can be determined without accessing kernfs_ops in kernfs_fop_write(). This restores the structure of kernfs_fop_write() before 4d3773c4bb41 with updated @len determination logic. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Sasha Levin <sasha.levin@oracle.com> References: http://lkml.kernel.org/g/53113485.2090407@oracle.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-08kernfs: fix off by one error.Richard Cochran
The hash values 0 and 1 are reserved for magic directory entries, but the code only prevents names hashing to 0. This patch fixes the test to also prevent hash value 1. Signed-off-by: Richard Cochran <richardcochran@gmail.com> Cc: <stable@vger.kernel.org> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-09jbd2: minimize region locked by j_list_lock in jbd2_journal_forget()Theodore Ts'o
It's not needed until we start trying to modifying fields in the journal_head which are protected by j_list_lock. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-09jbd2: minimize region locked by j_list_lock in journal_get_create_access()Theodore Ts'o
It's not needed until we start trying to modifying fields in the journal_head which are protected by j_list_lock. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-09jbd2: check jh->b_transaction without taking j_list_lockTheodore Ts'o
jh->b_transaction is adequately protected for reading by the jbd_lock_bh_state(bh), so we don't need to take j_list_lock in __journal_try_to_free_buffer(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-08jbd2: add transaction to checkpoint list earlierTheodore Ts'o
We don't otherwise need j_list_lock during the rest of commit phase #7, so add the transaction to the checkpoint list at the very end of commit phase #6. This allows us to drop j_list_lock earlier, which is a good thing since it is a super hot lock. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-08jbd2: calculate statistics without holding j_state_lock and j_list_lockTheodore Ts'o
The two hottest locks, and thus the biggest scalability bottlenecks, in the jbd2 layer, are the j_list_lock and j_state_lock. This has inspired some people to do some truly unnatural things[1]. [1] https://www.usenix.org/system/files/conference/fast14/fast14-paper_kang.pdf We don't need to be holding both j_state_lock and j_list_lock while calculating the journal statistics, so move those calculations to the very end of jbd2_journal_commit_transaction. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-08jbd2: don't hold j_state_lock while calling wake_up()Theodore Ts'o
The j_state_lock is one of the hottest locks in the jbd2 layer and thus one of its scalability bottlenecks. We don't need to be holding the j_state_lock while we are calling wake_up(&journal->j_wait_commit), so release the lock a little bit earlier. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-08jbd2: don't unplug after writing revoke recordsTheodore Ts'o
During commit process, keep the block device plugged after we are done writing the revoke records, until we are finished writing the rest of the commit records in the journal. This will allow most of the journal blocks to be written in a single I/O operation, instead of separating the the revoke blocks from the rest of the journal blocks. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-07Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "Small collection of fixes for 3.14-rc. It contains: - Three minor update to blk-mq from Christoph. - Reduce number of unaligned (< 4kb) in-flight writes on mtip32xx to two. From Micron. - Make the blk-mq CPU notify spinlock raw, since it can't be a sleeper spinlock on RT. From Mike Galbraith. - Drop now bogus BUG_ON() for bio iteration with blk integrity. From Nic Bellinger. - Properly propagate the SYNC flag on requests. From Shaohua" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq: add REQ_SYNC early rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock bio-integrity: Drop bio_integrity_verify BUG_ON in post bip->bip_iter world blk-mq: support partial I/O completions blk-mq: merge blk_mq_insert_request and blk_mq_run_request blk-mq: remove blk_mq_alloc_rq mtip32xx: Reduce the number of unaligned writes to 2
2014-03-07afs: don't use PREPARE_WORKTejun Heo
PREPARE_[DELAYED_]WORK() are being phased out. They have few users and a nasty surprise in terms of reentrancy guarantee as workqueue considers work items to be different if they don't have the same work function. afs_call->async_work is multiplexed with multiple work functions. Introduce afs_async_workfn() which invokes afs_call->async_workfn and always use it as the work function and update the users to set the ->async_workfn field instead of overriding the work function using PREPARE_WORK(). It would probably be best to route this with other related updates through the workqueue tree. Compile tested. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: David Howells <dhowells@redhat.com> Cc: linux-afs@lists.infradead.org
2014-03-07GFS2: Convert gfs2_lm_withdraw to use fs_errJoe Perches
vprintk use is not prefixed by a KERN_<LEVEL>, so emit these messages at KERN_ERR level. Using %pV can save some code and allow fs_err to be used, so do it. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-03-07GFS2: Use fs_<level> more oftenJoe Perches
Convert a couple of uses of pr_<level> to fs_<level> Add and use fs_emerg. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-03-07GFS2: Use pr_<level> more consistentlyJoe Perches
Add pr_fmt, remove embedded "GFS2: " prefixes. This now consistently emits lower case "gfs2: " for each message. Other miscellanea around these changes: o Add missing newlines o Coalesce formats o Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-03-07GFS2: Move recovery variables to journal structure in memoryBob Peterson
If multiple nodes fail and their recovery work runs simultaneously, they would use the same unprotected variables in the superblock. For example, they would stomp on each other's revoked blocks lists, which resulted in file system metadata corruption. This patch moves the necessary variables so that each journal has its own separate area for tracking its journal replay. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-03-07xfs: inode log reservations are still too smallDave Chinner
Back in commit 23956703 ("xfs: inode log reservations are too small"), the reservation size was increased to take into account the difference in size between the in-memory BMBT block headers and the on-disk BMDR headers. This solved a transaction overrun when logging the inode size. Recently, however, we've seen a number of these same overruns on kernels with the above fix in it. All of them have been by 4 bytes, so we must still not be accounting for something correctly. Through inspection it turns out the above commit didn't take into account everything it should have. That is, it only accounts for a single log op_hdr structure, when it can actually require up to four op_hdrs - one for each region (log iovec) that is formatted. These regions are the inode log format header, the inode core, and the two forks that can be held in the literal area of the inode. This means we are not accounting for 36 bytes of log space that the transaction can use, and hence when we get inodes in certain formats with particular fragmentation patterns we can overrun the transaction. Fix this by adding the correct accounting for log op_headers in the transaction. Tested-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-03-07xfs: xfs_check_page_type buffer checks need helpDave Chinner
xfs_aops_discard_page() was introduced in the following commit: xfs: truncate delalloc extents when IO fails in writeback ... to clean up left over delalloc ranges after I/O failure in ->writepage(). generic/224 tests for this scenario and occasionally reproduces panics on sub-4k blocksize filesystems. The cause of this is failure to clean up the delalloc range on a page where the first buffer does not match one of the expected states of xfs_check_page_type(). If a buffer is not unwritten, delayed or dirty&mapped, xfs_check_page_type() stops and immediately returns 0. The stress test of generic/224 creates a scenario where the first several buffers of a page with delayed buffers are mapped & uptodate and some subsequent buffer is delayed. If the ->writepage() happens to fail for this page, xfs_aops_discard_page() incorrectly skips the entire page. This then causes later failures either when direct IO maps the range and finds the stale delayed buffer, or we evict the inode and find that the inode still has a delayed block reservation accounted to it. We can easily fix this xfs_aops_discard_page() failure by making xfs_check_page_type() check all buffers, but this breaks xfs_convert_page() more than it is already broken. Indeed, xfs_convert_page() wants xfs_check_page_type() to tell it if the first buffers on the pages are of a type that can be aggregated into the contiguous IO that is already being built. xfs_convert_page() should not be writing random buffers out of a page, but the current behaviour will cause it to do so if there are buffers that don't match the current specification on the page. Hence for xfs_convert_page() we need to: a) return "not ok" if the first buffer on the page does not match the specification provided to we don't write anything; and b) abort it's buffer-add-to-io loop the moment we come across a buffer that does not match the specification. Hence we need to fix both xfs_check_page_type() and xfs_convert_page() to work correctly with pages that have mixed buffer types, whilst allowing xfs_aops_discard_page() to scan all buffers on the page for a type match. Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-03-07xfs: avoid AGI/AGF deadlock scenario for inode chunk allocationBrian Foster
The inode chunk allocation path can lead to deadlock conditions if a transaction is dirtied with an AGF (to fix up the freelist) for an AG that cannot satisfy the actual allocation request. This code path is written to try and avoid this scenario, but it can be reproduced by running xfstests generic/270 in a loop on a 512b fs. An example situation is: - process A attempts an inode allocation on AG 3, modifies the freelist, fails the allocation and ultimately moves on to AG 0 with the AG 3 AGF held - process B is doing a free space operation (i.e., truncate) and acquires the AG 0 AGF, waits on the AG 3 AGF - process A acquires the AG 0 AGI, waits on the AG 0 AGF (deadlock) The problem here is that process A acquired the AG 3 AGF while moving on to AG 0 (and releasing the AG 3 AGI with the AG 3 AGF held). xfs_dialloc() makes one pass through each of the AGs when attempting to allocate an inode chunk. The expectation is a clean transaction if a particular AG cannot satisfy the allocation request. xfs_ialloc_ag_alloc() is written to support this through use of the minalignslop allocation args field. When using the agi->agi_newino optimization, we attempt an exact bno allocation request based on the location of the previously allocated chunk. minalignslop is set to inform the allocator that we will require alignment on this chunk, and thus to not allow the request for this AG if the extra space is not available. Suppose that the AG in question has just enough space for this request, but not at the requested bno. xfs_alloc_fix_freelist() will proceed as normal as it determines the request should succeed, and thus it is allowed to modify the agf. xfs_alloc_ag_vextent() ultimately fails because the requested bno is not available. In response, the caller moves on to a NEAR_BNO allocation request for the same AG. The alignment is set, but the minalignslop field is never reset. This increases the overall requirement of the request from the first attempt. If this delta is the difference between allocation success and failure for the AG, xfs_alloc_fix_freelist() rejects this request outright the second time around and causes the allocation request to unnecessarily fail for this AG. To address this situation, reset the minalignslop field immediately after use and prevent it from leaking into subsequent requests. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-03-07xfs: use NOIO contexts for vm_map_ramDave Chinner
When we map pages in the buffer cache, we can do so in GFP_NOFS contexts. However, the vmap interfaces do not provide any method of communicating this information to memory reclaim, and hence we get lockdep complaining about it regularly and occassionally see hangs that may be vmap related reclaim deadlocks. We can also see these same problems from anywhere where we use vmalloc for a large buffer (e.g. attribute code) inside a transaction context. A typical lockdep report shows up as a reclaim state warning like so: [14046.101458] ================================= [14046.102850] [ INFO: inconsistent lock state ] [14046.102850] 3.14.0-rc4+ #2 Not tainted [14046.102850] --------------------------------- [14046.102850] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. [14046.102850] kswapd0/14 [HC0[0]:SC0[0]:HE1:SE1] takes: [14046.102850] (&xfs_dir_ilock_class){++++?+}, at: [<791a04bb>] xfs_ilock+0xff/0x16a [14046.102850] {RECLAIM_FS-ON-W} state was registered at: [14046.102850] [<7904cdb1>] mark_held_locks+0x81/0xe7 [14046.102850] [<7904d390>] lockdep_trace_alloc+0x5c/0xb4 [14046.102850] [<790c2c28>] kmem_cache_alloc_trace+0x2b/0x11e [14046.102850] [<790ba7f4>] vm_map_ram+0x119/0x3e6 [14046.102850] [<7914e124>] _xfs_buf_map_pages+0x5b/0xcf [14046.102850] [<7914ed74>] xfs_buf_get_map+0x67/0x13f [14046.102850] [<7917506f>] xfs_attr_rmtval_set+0x396/0x4d5 [14046.102850] [<7916e8bb>] xfs_attr_leaf_addname+0x18f/0x37d [14046.102850] [<7916ed9e>] xfs_attr_set_int+0x2f5/0x3e8 [14046.102850] [<7916eefc>] xfs_attr_set+0x6b/0x74 [14046.102850] [<79168355>] xfs_xattr_set+0x61/0x81 [14046.102850] [<790e5b10>] generic_setxattr+0x59/0x68 [14046.102850] [<790e4c06>] __vfs_setxattr_noperm+0x58/0xce [14046.102850] [<790e4d0a>] vfs_setxattr+0x8e/0x92 [14046.102850] [<790e4ddd>] setxattr+0xcf/0x159 [14046.102850] [<790e5423>] SyS_lsetxattr+0x88/0xbb [14046.102850] [<79268438>] sysenter_do_call+0x12/0x36 Now, we can't completely remove these traces - mainly because vm_map_ram() will do GFP_KERNEL allocation and that generates the above warning before we get into the reclaim code, but we can turn them all into false positive warnings. To do that, use the method that DM and other IO context code uses to avoid this problem: there is a process flag to tell memory reclaim not to do IO that we can set appropriately. That prevents GFP_KERNEL context reclaim being done from deep inside the vmalloc code in places we can't directly pass a GFP_NOFS context to. That interface has a pair of wrapper functions: memalloc_noio_save() and memalloc_noio_restore(). Adding them around vm_map_ram and the vzalloc call in kmem_alloc_large() will prevent deadlocks and most lockdep reports for this issue. Also, convert the vzalloc() call in kmem_alloc_large() to use __vmalloc() so that we can pass the correct gfp context to the data page allocation routine inside __vmalloc() so that it is clear that GFP_NOFS context is important to this vmalloc call. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-03-07xfs: don't leak EFSBADCRC to userspaceDave Chinner
While the verifier routines may return EFSBADCRC when a buffer has a bad CRC, we need to translate that to EFSCORRUPTED so that the higher layers treat the error appropriately and we return a consistent error to userspace. This fixes a xfs/005 regression. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-03-06GFS2: global conversion to pr_foo()Fabian Frederick
-All printk(KERN_foo converted to pr_foo(). -Messages updated to fit in 80 columns. -fs_macros converted as well. -fs_printk removed. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>