summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2015-12-20fs: configfs: Factor out configfs_do_depend_item()Krzysztof Opasiak
configfs_depend_item() is quite complicated and should be split up into smaller functions. This also allow to share this code with other functions. Signed-off-by: Krzysztof Opasiak <k.opasiak@samsung.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2015-12-20fs: configfs: Drop unused parameter from configfs_undepend_item()Krzysztof Opasiak
subsys parameter is never used by configfs_undepend_item() so there is no point in passing it to this function. Signed-off-by: Krzysztof Opasiak <k.opasiak@samsung.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2015-12-19x86/cpufeature: Remove unused and seldomly used cpu_has_xx macrosBorislav Petkov
Those are stupid and code should use static_cpu_has_safe() or boot_cpu_has() instead. Kill the least used and unused ones. The remaining ones need more careful inspection before a conversion can happen. On the TODO. Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1449481182-27541-4-git-send-email-bp@alien8.de Cc: David Sterba <dsterba@suse.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Matt Mackall <mpm@selenic.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-12-18Merge branch 'for-linus-4.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "A couple of small fixes" * 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: check prepare_uptodate_page() error code earlier Btrfs: check for empty bitmap list in setup_cluster_bitmaps btrfs: fix misleading warning when space cache failed to load Btrfs: fix transaction handle leak in balance Btrfs: fix unprotected list move from unused_bgs to deleted_bgs list
2015-12-18proc: fix -ESRCH error when writing to /proc/$pid/coredump_filterColin Ian King
Writing to /proc/$pid/coredump_filter always returns -ESRCH because commit 774636e19ed51 ("proc: convert to kstrto*()/kstrto*_from_user()") removed the setting of ret after the get_proc_task call and incorrectly left it as -ESRCH. Instead, return 0 when successful. Example breakage: echo 0 > /proc/self/coredump_filter bash: echo: write error: No such process Fixes: 774636e19ed51 ("proc: convert to kstrto*()/kstrto*_from_user()") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> [4.3+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-18Merge branch 'freespace-tree' into for-linus-4.5Chris Mason
Signed-off-by: Chris Mason <clm@fb.com>
2015-12-18GFS2: Don't do glock put on when inode creation failsBob Peterson
Currently the error path of function gfs2_inode_lookup calls function gfs2_glock_put corresponding to an earlier call to gfs2_glock_get for the inode glock. That's wrong because the error path also calls iget_failed() which eventually calls iput, which eventually calls gfs2_evict_inode, which does another gfs2_glock_put. This double-put can cause the glock reference count to get off. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-12-18GFS2: Always use iopen glock for gl_deletesBob Peterson
Before this patch, when function try_rgrp_unlink queued a glock for delete_work to reclaim the space, it used the inode glock to do so. That's different from the iopen callback which uses the iopen glock for the same purpose. We should be consistent and always use the iopen glock. This may also save us reference counting problems with the inode glock, since clear_glock does an extra glock_put() for the inode glock. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-12-18GFS2: Release iopen glock in gfs2_create_inode error casesBob Peterson
Some error cases in gfs2_create_inode were not unlocking the iopen glock, getting the reference count off. This adds the proper unlock. The error logic in function gfs2_create_inode was also convoluted, so this patch simplifies it. It also takes care of a bug in which gfs2_qa_delete() was not called in an error case. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-12-18GFS2: Truncate address space mapping when deleting an inodeBob Peterson
In function gfs2_delete_inode() we write and flush the mapping for a glock, among other things. We truncate the mapping for the inode, but we never truncate the mapping for the glock. This patch makes it also truncate the metamapping. This avoid cases where the glock is reused by another process who is trying to recreate an inode in its place using the same block. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2015-12-18GFS2: Wait for iopen glock dequeuesBob Peterson
This patch changes every glock_dq for iopen glocks into a dq_wait. This makes sure that iopen glocks do not outlive the inode itself. In turn, that ensures that anyone trying to unlink the glock will be able to find the inode when it receives a remote iopen callback. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2015-12-18fs: make locks.c explicitly non-modularPaul Gortmaker
The Kconfig currently controlling compilation of this code is: config FILE_LOCKING bool "Enable POSIX file locking API" if EXPERT ...meaning that it currently is not being built as a module by anyone. Lets remove the couple traces of modularity so that when reading the driver there is no doubt it is builtin-only. Since module_init translates to device_initcall in the non-modular case, the init ordering gets bumped to one level earlier when we use the more appropriate fs_initcall here. However we've made similar changes before without any fallout and none is expected here either. Cc: Jeff Layton <jlayton@poochiereds.net> Acked-by: Jeff Layton <jlayton@poochiereds.net> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
2015-12-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/geneve.c Here we had an overlapping change, where in 'net' the extraneous stats bump was being removed whilst in 'net-next' the final argument to udp_tunnel6_xmit_skb() was being changed. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18Btrfs: fix locking bugs when defragging leavesFilipe Manana
When running fstests btrfs/070, with a higher number of fsstress operations, I ran frequently into two different locking bugs when defragging directories. The first bug produced the following traces: [133860.229792] ------------[ cut here ]------------ [133860.251062] WARNING: CPU: 2 PID: 26057 at fs/btrfs/locking.c:46 btrfs_set_lock_blocking_rw+0x57/0xbd [btrfs]() [133860.253576] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 psmouse parport [133860.282566] CPU: 2 PID: 26057 Comm: btrfs Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1 [133860.284393] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [133860.286827] 0000000000000000 ffff880207697b78 ffffffff812566f4 0000000000000000 [133860.288341] ffff880207697bb0 ffffffff8104d0a6 ffffffffa052d4c1 ffff880178f60e00 [133860.294219] ffff880178f60e00 0000000000000000 00000000000000f6 ffff880207697bc0 [133860.295831] Call Trace: [133860.306518] [<ffffffff812566f4>] dump_stack+0x4e/0x79 [133860.307473] [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8 [133860.308619] [<ffffffffa052d4c1>] ? btrfs_set_lock_blocking_rw+0x57/0xbd [btrfs] [133860.310068] [<ffffffff8104d172>] warn_slowpath_null+0x1a/0x1c [133860.312552] [<ffffffffa052d4c1>] btrfs_set_lock_blocking_rw+0x57/0xbd [btrfs] [133860.314630] [<ffffffffa04d5787>] btrfs_set_lock_blocking+0xe/0x10 [btrfs] [133860.323596] [<ffffffffa04d99cb>] btrfs_realloc_node+0xb3/0x341 [btrfs] [133860.325233] [<ffffffffa050e396>] btrfs_defrag_leaves+0x239/0x2fa [btrfs] [133860.332427] [<ffffffffa04fc2ce>] btrfs_defrag_root+0x63/0xca [btrfs] [133860.337259] [<ffffffffa052a34e>] btrfs_ioctl_defrag+0x78/0x14e [btrfs] [133860.340147] [<ffffffffa052b00b>] btrfs_ioctl+0x746/0x24c6 [btrfs] [133860.344833] [<ffffffff81087481>] ? arch_local_irq_save+0x9/0xc [133860.346343] [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7 [133860.353248] [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7 [133860.354242] [<ffffffff8113adba>] ? __might_fault+0xa5/0xa7 [133860.355232] [<ffffffff81171139>] ? cp_new_stat+0x15d/0x174 [133860.356237] [<ffffffff8117c610>] do_vfs_ioctl+0x427/0x4e6 [133860.358587] [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e [133860.360195] [<ffffffff8118574d>] ? __fget_light+0x4d/0x71 [133860.361380] [<ffffffff8117c726>] SyS_ioctl+0x57/0x79 [133860.363578] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f [133860.366217] ---[ end trace 2cadb2f653437e49 ]--- [133860.367399] ------------[ cut here ]------------ [133860.368162] kernel BUG at fs/btrfs/locking.c:307! [133860.369430] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [133860.370205] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 psmouse parport [133860.370205] CPU: 2 PID: 26057 Comm: btrfs Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1 [133860.370205] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [133860.370205] task: ffff8800aec6db40 ti: ffff880207694000 task.ti: ffff880207694000 [133860.370205] RIP: 0010:[<ffffffffa052d466>] [<ffffffffa052d466>] btrfs_assert_tree_locked+0x10/0x14 [btrfs] [133860.370205] RSP: 0018:ffff880207697bc0 EFLAGS: 00010246 [133860.370205] RAX: 0000000000000000 RBX: ffff880178f60e00 RCX: 0000000000000000 [133860.370205] RDX: ffff88023ec4fb50 RSI: 00000000ffffffff RDI: ffff880178f60e00 [133860.370205] RBP: ffff880207697bc0 R08: 0000000000000001 R09: 0000000000000000 [133860.370205] R10: 0000160000000000 R11: ffffffff81651000 R12: ffff880178f60e00 [133860.370205] R13: 0000000000000000 R14: 00000000000000f6 R15: ffff8801ff409000 [133860.370205] FS: 00007f763efd48c0(0000) GS:ffff88023ec40000(0000) knlGS:0000000000000000 [133860.370205] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [133860.370205] CR2: 0000000002158048 CR3: 000000003fd6c000 CR4: 00000000000006e0 [133860.370205] Stack: [133860.370205] ffff880207697bd8 ffffffffa052d4d0 0000000000000000 ffff880207697be8 [133860.370205] ffffffffa04d5787 ffff880207697c80 ffffffffa04d99cb ffff8801ff409590 [133860.370205] ffff880207697ca8 000000f507697c80 ffff880183c11bb8 0000000000000000 [133860.370205] Call Trace: [133860.370205] [<ffffffffa052d4d0>] btrfs_set_lock_blocking_rw+0x66/0xbd [btrfs] [133860.370205] [<ffffffffa04d5787>] btrfs_set_lock_blocking+0xe/0x10 [btrfs] [133860.370205] [<ffffffffa04d99cb>] btrfs_realloc_node+0xb3/0x341 [btrfs] [133860.370205] [<ffffffffa050e396>] btrfs_defrag_leaves+0x239/0x2fa [btrfs] [133860.370205] [<ffffffffa04fc2ce>] btrfs_defrag_root+0x63/0xca [btrfs] [133860.370205] [<ffffffffa052a34e>] btrfs_ioctl_defrag+0x78/0x14e [btrfs] [133860.370205] [<ffffffffa052b00b>] btrfs_ioctl+0x746/0x24c6 [btrfs] [133860.370205] [<ffffffff81087481>] ? arch_local_irq_save+0x9/0xc [133860.370205] [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7 [133860.370205] [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7 [133860.370205] [<ffffffff8113adba>] ? __might_fault+0xa5/0xa7 [133860.370205] [<ffffffff81171139>] ? cp_new_stat+0x15d/0x174 [133860.370205] [<ffffffff8117c610>] do_vfs_ioctl+0x427/0x4e6 [133860.370205] [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e [133860.370205] [<ffffffff8118574d>] ? __fget_light+0x4d/0x71 [133860.370205] [<ffffffff8117c726>] SyS_ioctl+0x57/0x79 [133860.370205] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f This bug happened because we assumed that by setting keep_locks to 1 in our search path, our path after a call to btrfs_search_slot() would have all nodes locked, which is not always true because unlock_up() (called by btrfs_search_slot()) will unlock a node in a path if the slot of the node below it doesn't point to the last item or beyond the last item. For example, when the tree has a heigth of 2 and path->slots[0] has a value smaller than btrfs_header_nritems(path->nodes[0]) - 1, the node at level 2 will be unlocked (also because lowest_unlock is set to 1 due to the fact that the value passed as ins_len to btrfs_search_slot is 0). This resulted in btrfs_find_next_key(), called before btrfs_realloc_node(), to release out path and call again btrfs_search_slot(), but this time with the cow parameter set to 0, meaning the resulting path got only read locks. Therefore when we called btrfs_realloc_node(), with path->nodes[1] having a read lock, it resulted in the warning and BUG_ON when calling btrfs_set_lock_blocking() against the node, as that function expects the node to have a write lock. The second bug happened often when the first bug didn't happen, and made us hang and hitting the following warning at fs/btrfs/locking.c: 251 void btrfs_tree_lock(struct extent_buffer *eb) 252 { 253 WARN_ON(eb->lock_owner == current->pid); This happened because the tree search we made at btrfs_defrag_leaves() before calling btrfs_find_next_key() locked a leaf and all the other nodes in the path, so btrfs_find_next_key() had no need to release the path and make a new search (with path->lowest_level set to 1). This made btrfs_realloc_node() attempt to write lock the same leaf again, resulting in a hang/deadlock. So fix these issues by calling btrfs_find_next_key() after calling btrfs_realloc_node() and setting the search path's lowest_level to 1 to avoid the hang/deadlock when attempting to write lock the leaves at btrfs_realloc_node(). Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-17Btrfs: add free space tree mount optionOmar Sandoval
Now we can finally hook up everything so we can actually use free space tree. The free space tree is enabled by passing the space_cache=v2 mount option. On the first mount with the this option set, the free space tree will be created and the FREE_SPACE_TREE read-only compat bit will be set. Any time the filesystem is mounted from then on, we must use the free space tree. The clear_cache option will also clear the free space tree. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17Btrfs: wire up the free space tree to the extent treeOmar Sandoval
The free space tree is updated in tandem with the extent tree. There are only a handful of places where we need to hook in: 1. Block group creation 2. Block group deletion 3. Delayed refs (extent creation and deletion) 4. Block group caching Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17Btrfs: add free space tree sanity testsOmar Sandoval
This tests the operations on the free space tree trying to excercise all of the main cases for both formats. Between this and xfstests, the free space tree should have pretty good coverage. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17Btrfs: implement the free space B-treeOmar Sandoval
The free space cache has turned out to be a scalability bottleneck on large, busy filesystems. When the cache for a lot of block groups needs to be written out, we can get extremely long commit times; if this happens in the critical section, things are especially bad because we block new transactions from happening. The main problem with the free space cache is that it has to be written out in its entirety and is managed in an ad hoc fashion. Using a B-tree to store free space fixes this: updates can be done as needed and we get all of the benefits of using a B-tree: checksumming, RAID handling, well-understood behavior. With the free space tree, we get commit times that are about the same as the no cache case with load times slower than the free space cache case but still much faster than the no cache case. Free space is represented with extents until it becomes more space-efficient to use bitmaps, giving us similar space overhead to the free space cache. The operations on the free space tree are: adding and removing free space, handling the creation and deletion of block groups, and loading the free space for a block group. We can also create the free space tree by walking the extent tree and clear the free space tree. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17Btrfs: introduce the free space B-tree on-disk formatOmar Sandoval
The on-disk format for the free space tree is straightforward. Each block group is represented in the free space tree by a free space info item that stores accounting information: whether the free space for this block group is stored as bitmaps or extents and how many extents of free space exist for this block group (regardless of which format is being used in the tree). Extents are (start, FREE_SPACE_EXTENT, length) keys with no corresponding item, and bitmaps instead have the FREE_SPACE_BITMAP type and have a bitmap item attached, which is just an array of bytes. Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17Btrfs: refactor caching_thread()Omar Sandoval
We're also going to load the free space tree from caching_thread(), so we should refactor some of the common code. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17Btrfs: add helpers for read-only compat bitsOmar Sandoval
We're finally going to add one of these for the free space tree, so let's add the same nice helpers that we have for the incompat bits. While we're add it, also add helpers to clear the bits. Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17Btrfs: add extent buffer bitmap sanity testsOmar Sandoval
Sanity test the extent buffer bitmap operations (test, set, and clear) against the equivalent standard kernel operations. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17Btrfs: add extent buffer bitmap operationsOmar Sandoval
These are going to be used for the free space tree bitmap items. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17f2fs: add a tracepoint for sync_dirty_inodesChao Yu
This patch adds a tracepoint for sync_dirty_inodes. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-12-17f2fs: optimize the flow of f2fs_map_blocksFan Li
check map->m_len right after it changes to avoid excess call to update dnode_of_data. Signed-off-by: Fan li <fanofcode.li@samsung.com> Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-12-17f2fs: support data flush in backgroundChao Yu
Previously, when finishing a checkpoint, we have persisted all fs meta info including meta inode, node inode, dentry page of directory inode, so, after a sudden power cut, f2fs can recover from last checkpoint with full directory structure. But during checkpoint, we didn't flush dirty pages of regular and symlink inode, so such dirty datas still in memory will be lost in that moment of power off. In order to reduce the chance of lost data, this patch enables f2fs_balance_fs_bg with the ability of data flushing. It will try to flush user data before starting a checkpoint. So user's data written after last checkpoint which may not be fsynced could be saved. When we mount with data_flush option, after every period of cp_interval (could be configured in sysfs: /sys/fs/f2fs/device/cp_interval) seconds user data could be flushed into device once f2fs_balance_fs_bg was called in kworker thread or gc thread. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-12-17f2fs: stat dirty regular/symlink inodesChao Yu
Add to stat dirty regular and symlink inode for showing in debugfs. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-12-17Btrfs: fix leaking of ordered extents after direct IO write errorFilipe Manana
When doing a direct IO write, __blockdev_direct_IO() can call the btrfs_get_blocks_direct() callback one or more times before it calls the btrfs_submit_direct() callback. However it can fail after calling the first callback and before calling the second callback, which is a problem because the first one creates ordered extents and the second one is the one that submits bios that cover the ordered extents created by the first one. That means the ordered extents will never complete nor have any of the flags BTRFS_ORDERED_IO_DONE / BTRFS_ORDERED_IOERR set, resulting in subsequent operations (such as other direct IO writes, buffered writes or hole punching) that lock the same IO range and lookup for ordered extents in the range to hang forever waiting for those ordered extents because they can not complete ever, since no bio was submitted. Fix this by tracking a range of created ordered extents that don't have yet corresponding bios submitted and completing the ordered extents in the range if __blockdev_direct_IO() fails with an error. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-17Btrfs: fix deadlock between direct IO write and defrag/readpagesFilipe Manana
If readpages() (triggered by defrag or buffered reads) is called while a direct IO write is in progress, we have a small time window where we can deadlock, resulting in traces like the following being generated: [84723.212993] INFO: task fio:2849 blocked for more than 120 seconds. [84723.214310] Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1 [84723.215640] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [84723.217313] fio D ffff88023ec75218 0 2849 2835 0x00000000 [84723.218778] ffff880122dfb6e8 0000000000000092 0000000000000000 ffff88023ec75200 [84723.220458] ffff88000e05d2c0 ffff880122dfc000 ffff88023ec75200 7fffffffffffffff [84723.230597] 0000000000000002 ffffffff8147891a ffff880122dfb700 ffffffff8147856a [84723.232085] Call Trace: [84723.232625] [<ffffffff8147891a>] ? bit_wait+0x3c/0x3c [84723.233529] [<ffffffff8147856a>] schedule+0x7d/0x95 [84723.234398] [<ffffffff8147baa3>] schedule_timeout+0x43/0x10b [84723.235384] [<ffffffff810f82eb>] ? time_hardirqs_on+0x15/0x28 [84723.236426] [<ffffffff8108a23d>] ? trace_hardirqs_on+0xd/0xf [84723.237502] [<ffffffff810af8a3>] ? read_seqcount_begin.constprop.20+0x57/0x6d [84723.238807] [<ffffffff8108a09b>] ? trace_hardirqs_on_caller+0x16/0x1ab [84723.242012] [<ffffffff8108a23d>] ? trace_hardirqs_on+0xd/0xf [84723.243064] [<ffffffff810af2ad>] ? timekeeping_get_ns+0xe/0x33 [84723.244116] [<ffffffff810afa2e>] ? ktime_get+0x41/0x52 [84723.245029] [<ffffffff81477cff>] io_schedule_timeout+0xb7/0x12b [84723.245942] [<ffffffff81477cff>] ? io_schedule_timeout+0xb7/0x12b [84723.246596] [<ffffffff81478953>] bit_wait_io+0x39/0x45 [84723.247503] [<ffffffff81478b93>] __wait_on_bit_lock+0x49/0x8d [84723.248540] [<ffffffff8111684f>] __lock_page+0x66/0x68 [84723.249558] [<ffffffff81081c9b>] ? autoremove_wake_function+0x3a/0x3a [84723.250844] [<ffffffff81124a04>] lock_page+0x2c/0x2f [84723.251871] [<ffffffff81124afc>] invalidate_inode_pages2_range+0xf5/0x2aa [84723.253274] [<ffffffff81117c34>] ? filemap_fdatawait_range+0x12d/0x146 [84723.254757] [<ffffffff81118191>] ? filemap_fdatawrite_range+0x13/0x15 [84723.256378] [<ffffffffa05139a2>] btrfs_get_blocks_direct+0x1b0/0x664 [btrfs] [84723.258556] [<ffffffff8119e3f9>] ? submit_page_section+0x7b/0x111 [84723.260064] [<ffffffff8119eb90>] do_blockdev_direct_IO+0x658/0xbdb [84723.261479] [<ffffffffa05137f2>] ? btrfs_page_exists_in_range+0x1a9/0x1a9 [btrfs] [84723.262961] [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs] [84723.264449] [<ffffffff8119f144>] __blockdev_direct_IO+0x31/0x33 [84723.265614] [<ffffffff8119f144>] ? __blockdev_direct_IO+0x31/0x33 [84723.266769] [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs] [84723.268264] [<ffffffffa050935d>] btrfs_direct_IO+0x1b9/0x259 [btrfs] [84723.270954] [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs] [84723.272465] [<ffffffff8111878c>] generic_file_direct_write+0xb3/0x128 [84723.273734] [<ffffffffa051955c>] btrfs_file_write_iter+0x228/0x404 [btrfs] [84723.275101] [<ffffffff8116ca6f>] __vfs_write+0x7c/0xa5 [84723.276200] [<ffffffff8116cfab>] vfs_write+0xa0/0xe4 [84723.277298] [<ffffffff8116d79d>] SyS_write+0x50/0x7e [84723.278327] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f [84723.279595] INFO: lockdep is turned off. [84723.379035] INFO: task btrfs:2923 blocked for more than 120 seconds. [84723.380323] Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1 [84723.381608] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [84723.383003] btrfs D ffff88023ed75218 0 2923 2859 0x00000000 [84723.384277] ffff88001311f860 0000000000000082 ffff88001311f840 ffff88023ed75200 [84723.385748] ffff88012c6751c0 ffff880013120000 ffff88012042fe68 ffff88012042fe30 [84723.387152] ffff880221571c88 0000000000000001 ffff88001311f878 ffffffff8147856a [84723.388620] Call Trace: [84723.389105] [<ffffffff8147856a>] schedule+0x7d/0x95 [84723.391882] [<ffffffffa051da32>] btrfs_start_ordered_extent+0x161/0x1fa [btrfs] [84723.393718] [<ffffffff81081c61>] ? signal_pending_state+0x31/0x31 [84723.395659] [<ffffffffa0522c5b>] __do_contiguous_readpages.constprop.21+0x81/0xdc [btrfs] [84723.397383] [<ffffffffa050ac96>] ? btrfs_submit_direct+0x3f0/0x3f0 [btrfs] [84723.398852] [<ffffffffa0522da3>] __extent_readpages.constprop.20+0xed/0x100 [btrfs] [84723.400561] [<ffffffff81123f6c>] ? __lru_cache_add+0x5d/0x72 [84723.401787] [<ffffffffa0523896>] extent_readpages+0x111/0x1a7 [btrfs] [84723.403121] [<ffffffffa050ac96>] ? btrfs_submit_direct+0x3f0/0x3f0 [btrfs] [84723.404583] [<ffffffffa05088fa>] btrfs_readpages+0x1f/0x21 [btrfs] [84723.406007] [<ffffffff811226df>] __do_page_cache_readahead+0x168/0x1f4 [84723.407502] [<ffffffff81122988>] ondemand_readahead+0x21d/0x22e [84723.408937] [<ffffffff81122988>] ? ondemand_readahead+0x21d/0x22e [84723.410487] [<ffffffff81122af1>] page_cache_sync_readahead+0x3d/0x3f [84723.411710] [<ffffffffa0535388>] btrfs_defrag_file+0x419/0xaaf [btrfs] [84723.413007] [<ffffffffa0531db0>] ? kzalloc+0xf/0x11 [btrfs] [84723.414085] [<ffffffffa0535b43>] btrfs_ioctl_defrag+0x125/0x14e [btrfs] [84723.415307] [<ffffffffa0536753>] btrfs_ioctl+0x746/0x24c6 [btrfs] [84723.416532] [<ffffffff81087481>] ? arch_local_irq_save+0x9/0xc [84723.417731] [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7 [84723.418699] [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7 [84723.421532] [<ffffffff8113adba>] ? __might_fault+0xa5/0xa7 [84723.422629] [<ffffffff81171139>] ? cp_new_stat+0x15d/0x174 [84723.423712] [<ffffffff8117c610>] do_vfs_ioctl+0x427/0x4e6 [84723.424801] [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e [84723.425968] [<ffffffff8118574d>] ? __fget_light+0x4d/0x71 [84723.427063] [<ffffffff8117c726>] SyS_ioctl+0x57/0x79 [84723.428138] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f Consider the following logical and physical file layout: logical: ... [ prealloc extent A ] [ prealloc extent B ] [ extent C ] ... 4K 8K 16K physical: ... 12853248 12857344 1103101952 ... (= 12853248 + 4K) Extents A and B are physically adjacent. The following diagram shows a sequence of events that lead to the deadlock when we attempt to do a direct IO write against the file range [4K, 16K[ and a defrag is triggered simultaneously. CPU 1 CPU 2 btrfs_direct_IO() btrfs_get_blocks_direct() creates ordered extent A, covering the 4k prealloc extent A (range [4K, 8K[) btrfs_defrag_file() page_cache_sync_readahead([0K, 1M[) btrfs_readpages() extent_readpages() locks all pages in the file range [0K, 128K[ through calls to add_to_page_cache_lru() __do_contiguous_readpages() finds ordered extent A waits for it to complete btrfs_get_blocks_direct() called again lock_extent_direct(range [8K, 16K[) finds a page in range [8K, 16K[ through btrfs_page_exists_in_range() invalidate_inode_pages2_range([8K, 16K[) --> tries to lock pages that are already locked by the task at CPU 2 --> our task, running __blockdev_direct_IO(), hangs waiting to lock the pages and the submit bio callback, btrfs_submit_direct(), ends up never being called, resulting in the ordered extent A never completing (because a corresponding bio is never submitted) and CPU 2 will wait for it forever while holding the pages locked ---> deadlock! Fix this by removing the page invalidation approach when attempting to lock the range for IO from the callback btrfs_get_blocks_direct() and falling back buffered IO. This was a rare case anyway and well behaved applications do not mix concurrent direct IO writes with buffered reads anyway, being a concurrent defrag the only normal case that could lead to the deadlock. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-17Btrfs: fix error path when failing to submit bio for direct IO writeFilipe Manana
Commit 61de718fceb6 ("Btrfs: fix memory corruption on failure to submit bio for direct IO") fixed problems with the error handling code after we fail to submit a bio for direct IO. However there were 2 problems that it did not address when the failure is due to memory allocation failures for direct IO writes: 1) We considered that there could be only one ordered extent for the whole IO range, which is not always true, as we can have multiple; 2) It did not set the bit BTRFS_ORDERED_IO_DONE in the ordered extent, which can make other tasks running btrfs_wait_logged_extents() hang forever, since they wait for that bit to be set. The general assumption is that regardless of an error, the BTRFS_ORDERED_IO_DONE is always set and it precedes setting the bit BTRFS_ORDERED_COMPLETE. Fix these issues by moving part of the btrfs_endio_direct_write() handler into a new helper function and having that new helper function called when we fail to allocate memory to submit the bio (and its private object) for a direct IO write. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-12-17Btrfs: fix memory leaks after transaction is abortedFilipe Manana
When a transaction is aborted, or its commit fails before writing the new superblock and calling btrfs_finish_extent_commit(), we leak reference counts on the block groups attached to the transaction's delete_bgs list, because btrfs_finish_extent_commit() is never called for those two cases. Fix this by dropping their references at btrfs_put_transaction(), which is called when transactions are aborted (by making the transaction kthread commit the transaction) or if their commits fail. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-17Btrfs: fix race when finishing dev replace leading to transaction abortFilipe Manana
During the final phase of a device replace operation, I ran into a transaction abort that resulted in the following trace: [23919.655368] WARNING: CPU: 10 PID: 30175 at fs/btrfs/extent-tree.c:9843 btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]() [23919.664742] BTRFS: Transaction aborted (error -2) [23919.665749] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 parport psmouse acpi_cpufreq processor i2c_core evdev microcode pcspkr button serio_raw ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic ata_piix virtio_pci floppy virtio_ring libata e1000 virtio scsi_mod [last unloaded: btrfs] [23919.679442] CPU: 10 PID: 30175 Comm: fsstress Not tainted 4.3.0-rc5-btrfs-next-17+ #1 [23919.682392] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [23919.689151] 0000000000000000 ffff8804020cbb50 ffffffff812566f4 ffff8804020cbb98 [23919.692604] ffff8804020cbb88 ffffffff8104d0a6 ffffffffa03eea69 ffff88041b678a48 [23919.694230] ffff88042ac38000 ffff88041b678930 00000000fffffffe ffff8804020cbbf0 [23919.696716] Call Trace: [23919.698669] [<ffffffff812566f4>] dump_stack+0x4e/0x79 [23919.700597] [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8 [23919.701958] [<ffffffffa03eea69>] ? btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs] [23919.703612] [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50 [23919.705047] [<ffffffffa03eea69>] btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs] [23919.706967] [<ffffffffa0402097>] __btrfs_end_transaction+0x84/0x2dd [btrfs] [23919.708611] [<ffffffffa0402300>] btrfs_end_transaction+0x10/0x12 [btrfs] [23919.710099] [<ffffffffa03ef0b8>] btrfs_alloc_data_chunk_ondemand+0x121/0x28b [btrfs] [23919.711970] [<ffffffffa0413025>] btrfs_fallocate+0x7d3/0xc6d [btrfs] [23919.713602] [<ffffffff8108b78f>] ? lock_acquire+0x10d/0x194 [23919.714756] [<ffffffff81086dbc>] ? percpu_down_read+0x51/0x78 [23919.716155] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0 [23919.718918] [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0 [23919.724170] [<ffffffff8116b579>] vfs_fallocate+0x170/0x1ff [23919.725482] [<ffffffff8117c1d7>] ioctl_preallocate+0x89/0x9b [23919.726790] [<ffffffff8117c5ef>] do_vfs_ioctl+0x406/0x4e6 [23919.728428] [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e [23919.729642] [<ffffffff8118574d>] ? __fget_light+0x4d/0x71 [23919.730782] [<ffffffff8117c726>] SyS_ioctl+0x57/0x79 [23919.731847] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f [23919.733330] ---[ end trace 166ef301a335832a ]--- This is due to a race between device replace and chunk allocation, which the following diagram illustrates: CPU 1 CPU 2 btrfs_dev_replace_finishing() at this point dev_replace->tgtdev->devid == BTRFS_DEV_REPLACE_DEVID (0ULL) ... btrfs_start_transaction() btrfs_commit_transaction() btrfs_fallocate() btrfs_alloc_data_chunk_ondemand() btrfs_join_transaction() --> starts a new transaction do_chunk_alloc() lock fs_info->chunk_mutex btrfs_alloc_chunk() --> creates extent map for the new chunk with em->bdev->map->stripes[i]->dev->devid == X (X > 0) --> extent map is added to fs_info->mapping_tree --> initial phase of bg A allocation completes unlock fs_info->chunk_mutex lock fs_info->chunk_mutex btrfs_dev_replace_update_device_in_mapping_tree() --> iterates fs_info->mapping_tree and replaces the device in every extent map's map->stripes[] with dev_replace->tgtdev, which still has an id of 0ULL (BTRFS_DEV_REPLACE_DEVID) btrfs_end_transaction() btrfs_create_pending_block_groups() --> starts final phase of bg A creation (update device, extent, and chunk trees, etc) btrfs_finish_chunk_alloc() btrfs_update_device() --> attempts to update a device item with ID == 0ULL (BTRFS_DEV_REPLACE_DEVID) which is the current ID of bg A's em->bdev->map->stripes[i]->dev->devid --> doesn't find such item returns -ENOENT --> the device id should have been X and not 0ULL got -ENOENT from btrfs_finish_chunk_alloc() and aborts current transaction finishes setting up the target device, namely it sets tgtdev->devid to the value of srcdev->devid, which is X (and X > 0) frees the srcdev unlock fs_info->chunk_mutex So fix this by taking the device list mutex when processing the chunk's extent map stripes to update the device items. This avoids getting the wrong device id and use-after-free problems if the task finishing a chunk allocation grabs the replaced device, which is freed while the dev replace task is holding the device list mutex. This happened while running fstest btrfs/071. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-12-16f2fs: introduce new option for controlling data flushChao Yu
Add a new option 'data_flush' to enable data flush functionality. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-12-16f2fs: record dirty status of regular/symlink inodeChao Yu
Maintain regular/symlink inode which has dirty pages in global dirty list and record their total dirty pages count like the way of handling directory inode. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-12-16f2fs: introduce __f2fs_commit_superChao Yu
Introduce __f2fs_commit_super to include duplicated codes in f2fs_commit_super for cleanup. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-12-16f2fs: relocate tracepoint of write_checkpointJaegeuk Kim
It needs to relocate its location to see exact trace logs. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-12-16f2fs: don't grab super block buffer header all the timeChao Yu
We have already got one copy of valid super block in memory, do not grab buffer header of super block all the time. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-12-16f2fs: backup raw_super in sbiYunlei He
f2fs use fields of f2fs_super_block struct directly in a grabbed buffer. Once the buffer happen to be destroyed (e.g. through dd), it may bring in unpredictable effect on f2fs. This patch fixes to allocate additional buffer to store datas of super block rather than using grabbed block buffer directly. Signed-off-by: Yunlei He <heyunlei@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-12-16f2fs: fix to reset variable correctllyFan Li
f2fs_map_blocks will set m_flags and m_len to 0, so we don't need to reset m_flags ourselves, but have to reset m_len to correct value before use it again. Signed-off-by: Fan li <fanofcode.li@samsung.com> Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-12-16nfsd: don't hold ls_mutex across a layout recallJeff Layton
We do need to serialize layout stateid morphing operations, but we currently hold the ls_mutex across a layout recall which is pretty ugly. It's also unnecessary -- once we've bumped the seqid and copied it, we don't need to serialize the rest of the CB_LAYOUTRECALL vs. anything else. Just drop the mutex once the copy is done. This was causing a "workqueue leaked lock or atomic" warning and an occasional deadlock. There's more work to be done here but this fixes the immediate regression. Fixes: cc8a55320b5f "nfsd: serialize layout stateid morphing operations" Cc: stable@vger.kernel.org Reported-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-12-15f2fs: introduce __remove_dirty_inodeChao Yu
Introduce __remove_dirty_inode to clean up codes in remove_dirty_dir_inode. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-12-15f2fs: introduce dirty list node in inode infoChao Yu
Add a new dirt list node member in inode info for linking the inode to global dirty list in superblock, instead of old implementation which allocate slab cache memory as an entry to inode. It avoids memory pressure due to slab cache allocation, and also makes codes more clean. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-12-15f2fs: rename {add,remove,release}_dirty_inode to {add,remove,release}_ino_entryChao Yu
remove_dirty_dir_inode will be renamed to remove_dirty_inode as a generic function in following patch for removing directory/regular/symlink inode in global dirty list. Here rename ino management related functions for readability, also in order to avoid name conflict. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-12-15Merge branch 'for-chris-4.4' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.4
2015-12-15Btrfs: check prepare_uptodate_page() error code earlierChris Mason
prepare_pages() may end up calling prepare_uptodate_page() twice if our write only spans a single page. But if the first call returns an error, our page will be unlocked and its not safe to call it again. This bug goes all the way back to 2011, and it's not something commonly hit. While we're here, add a more explicit check for the page being truncated away. The bare lock_page() alone is protected only by good thoughts and i_mutex, which we're sure to regret eventually. Reported-by: Dave Jones <dsj@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-12-15Btrfs: check for empty bitmap list in setup_cluster_bitmapsChris Mason
Dave Jones found a warning from kasan in setup_cluster_bitmaps() ================================================================== BUG: KASAN: stack-out-of-bounds in setup_cluster_bitmap+0xc4/0x5a0 at addr ffff88039bef6828 Read of size 8 by task nfsd/1009 page:ffffea000e6fbd80 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0x8000000000000000() page dumped because: kasan: bad access detected CPU: 1 PID: 1009 Comm: nfsd Tainted: G W 4.4.0-rc3-backup-debug+ #1 ffff880065647b50 000000006bb712c2 ffff88039bef6640 ffffffffa680a43e 0000004559c00000 ffff88039bef66c8 ffffffffa62638d1 ffffffffa61121c0 ffff8803a5769de8 0000000000000296 ffff8803a5769df0 0000000000046280 Call Trace: [<ffffffffa680a43e>] dump_stack+0x4b/0x6d [<ffffffffa62638d1>] kasan_report_error+0x501/0x520 [<ffffffffa61121c0>] ? debug_show_all_locks+0x1e0/0x1e0 [<ffffffffa6263948>] kasan_report+0x58/0x60 [<ffffffffa6814b00>] ? rb_last+0x10/0x40 [<ffffffffa66f8af4>] ? setup_cluster_bitmap+0xc4/0x5a0 [<ffffffffa6262ead>] __asan_load8+0x5d/0x70 [<ffffffffa66f8af4>] setup_cluster_bitmap+0xc4/0x5a0 [<ffffffffa66f675a>] ? setup_cluster_no_bitmap+0x6a/0x400 [<ffffffffa66fcd16>] btrfs_find_space_cluster+0x4b6/0x640 [<ffffffffa66fc860>] ? btrfs_alloc_from_cluster+0x4e0/0x4e0 [<ffffffffa66fc36e>] ? btrfs_return_cluster_to_free_space+0x9e/0xb0 [<ffffffffa702dc37>] ? _raw_spin_unlock+0x27/0x40 [<ffffffffa666a1a1>] find_free_extent+0xba1/0x1520 Andrey noticed this was because we were doing list_first_entry on a list that might be empty. Rework the tests a bit so we don't do that. Signed-off-by: Chris Mason <clm@fb.com> Reprorted-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Reported-by: Dave Jones <dsj@fb.com>
2015-12-14f2fs: do more integrity verification for superblockChao Yu
Do more sanity check for superblock during ->mount. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-12-14udf: limit the maximum number of TD redirectionsVegard Nossum
Filesystem fuzzing revealed that we could get stuck in the udf_process_sequence() loop. The maximum limit was chosen arbitrarily but fixes the problem I saw. Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz>
2015-12-14gfs2: clear journal live bit in gfs2_log_flushBenjamin Marzinski
When gfs2 was unmounting filesystems or changing them to read-only it was clearing the SDF_JOURNAL_LIVE bit before the final log flush. This caused a race. If an inode glock got demoted in the gap between clearing the bit and the shutdown flush, it would be unable to reserve log space to clear out the active items list in inode_go_sync, causing an error in inode_go_inval because the glock was still dirty. To solve this, the SDF_JOURNAL_LIVE bit is now cleared inside the shutdown log flush. This means that, because of the locking on the log blocks, either inode_go_sync will be able to reserve space to clean the glock before the shutdown flush, or the shutdown flush will clean the glock itself, before inode_go_sync fails to reserve the space. Either way, the glock will be clean before inode_go_inval. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-12-14gfs2: change gfs2 readdir cookieBenjamin Marzinski
gfs2 currently returns 31 bits of filename hash as a cookie that readdir uses for an offset into the directory. When there are a large number of directory entries, the likelihood of a collision goes up way too quickly. GFS2 will now return cookies that are guaranteed unique for a while, and then fail back to using 30 bits of filename hash. Specifically, the directory leaf blocks are divided up into chunks based on the minimum size of a gfs2 directory entry (48 bytes). Each entry's cookie is based off the chunk where it starts, in the linked list of leaf blocks that it hashes to (there are 131072 hash buckets). Directory entries will have unique names until they take reach chunk 8192. Assuming the largest filenames possible, and the least efficient spacing possible, this new method will still be able to return unique names when the previous method has statistically more than a 99% chance of a collision. The non-unique names it fails back to are guaranteed to not collide with the unique names. unique cookies will be in this format: - 1 bit "0" to make sure the the returned cookie is positive - 17 bits for the hash table index - 1 bit for the mode "0" - 13 bits for the offset non-unique cookies will be in this format: - 1 bit "0" to make sure the the returned cookie is positive - 17 bits for the hash table index - 1 bit for the mode "1" - 13 more bits of the name hash Another benefit of location based cookies, is that once a directory's exhash table is fully extended (so that multiple hash table indexs do not use the same leaf blocks), gfs2 can skip sorting the directory entries until it reaches the non-unique ones, and then it only needs to sort these. This provides a significant speed up for directory reads of very large directories. The only issue is that for these cookies to continue to point to the correct entry as files are added and removed from the directory, gfs2 must keep the entries at the same offset in the leaf block when they are split (see my previous patch). This means that until all the nodes in a cluster are running with code that will split the directory leaf blocks this way, none of the nodes can use the new cookie code. To deal with this, gfs2 now has the mount option loccookie, which, if set, will make it return these new location based cookies. This option must not be set until all nodes in the cluster are at least running this version of the kernel code, and you have guaranteed that there are no outstanding cookies required by other software, such as NFS. gfs2 uses some of the extra space at the end of the gfs2_dirent structure to store the calculated readdir cookies. This keeps us from needing to allocate a seperate array to hold these values. gfs2 recomputes the cookie stored in de_cookie for every readdir call. The time it takes to do so is small, and if gfs2 expected this value to be saved on disk, the new code wouldn't work correctly on filesystems created with an earlier version of gfs2. One issue with adding de_cookie to the union in the gfs2_dirent structure is that it caused the union to align itself to a 4 byte boundary, instead of its previous 2 byte boundary. This changed the offset of de_rahead. To solve that, I pulled de_rahead out of the union, since it does not need to be there. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>