summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2015-11-02NFS: Enable client side NFSv4.1 backchannel to use other transportsChuck Lever
Forechannel transports get their own "bc_up" method to create an endpoint for the backchannel service. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> [Anna Schumaker: Add forward declaration of struct net to xprt.h] Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02pNFS/flexfiles: Add support for FF_FLAGS_NO_IO_THRU_MDSTrond Myklebust
For loosely coupled pNFS/flexfiles systems, there is often no advantage at all in going through the MDS for I/O, since the MDS is subject to the same limitations as all other clients when talking to DSes. If a DS is unresponsive, I/O through the MDS will fail. For such systems, the only scalable solution is to have the pNFS clients retry doing pNFS, and so the protocol now provides a flag that allows the pNFS server to signal this. If LAYOUTGET returns FF_FLAGS_NO_IO_THRU_MDS, then we should assume that the MDS wants the client to retry using these devices, even if they were previously marked as being unavailable. To do so, we add a helper, ff_layout_mark_devices_valid() that will be called from layoutget. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-11-02pNFS/flexfiles: When mirrored, retry failed reads by switching mirrorsTrond Myklebust
If the pNFS/flexfiles file is mirrored, and a read to one mirror fails, then we should bump the mirror index, so that we retry to a different mirror. Once we've iterated through all mirrors and all failed, we can return the layout and issue a new LAYOUTGET. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-11-02xfs: clear PF_NOFREEZE for xfsaild kthreadJiri Kosina
Since xfsaild has been converted to kthread in 0030807c, it calls try_to_freeze() during every AIL push iteration. It however doesn't set itself as freezable, and therefore this try_to_freeze() will never do anything. Before (hopefully eventually) kthread freezing gets converted to fileystem freezing, we'd rather mark xfsaild freezable (as it can generate I/O during suspend). Signed-off-by: Jiri Kosina <jkosina@suse.cz> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-01mm: get rid of 'vmalloc_info' from /proc/meminfoLinus Torvalds
It turns out that at least some versions of glibc end up reading /proc/meminfo at every single startup, because glibc wants to know the amount of memory the machine has. And while that's arguably insane, it's just how things are. And it turns out that it's not all that expensive most of the time, but the vmalloc information statistics (amount of virtual memory used in the vmalloc space, and the biggest remaining chunk) can be rather expensive to compute. The 'get_vmalloc_info()' function actually showed up on my profiles as 4% of the CPU usage of "make test" in the git source repository, because the git tests are lots of very short-lived shell-scripts etc. It turns out that apparently this same silly vmalloc info gathering shows up on the facebook servers too, according to Dave Jones. So it's not just "make test" for git. We had two patches to just cache the information (one by me, one by Ingo) to mitigate this issue, but the whole vmalloc information of of rather dubious value to begin with, and people who *actually* want to know what the situation is wrt the vmalloc area should just look at the much more complete /proc/vmallocinfo instead. In fact, according to my testing - and perhaps more importantly, according to that big search engine in the sky: Google - there is nothing out there that actually cares about those two expensive fields: VmallocUsed and VmallocChunk. So let's try to just remove them entirely. Actually, this just removes the computation and reports the numbers as zero for now, just to try to be minimally intrusive. If this breaks anything, we'll obviously have to re-introduce the code to compute this all and add the caching patches on top. But if given the option, I'd really prefer to just remove this bad idea entirely rather than add even more code to work around our historical mistake that likely nobody really cares about. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-01Merge branch 'fs-file-descriptor-optimization'Linus Torvalds
Merge file descriptor allocation speedup. Eric Dumazet has a test-case for a fairly common network deamon load pattern: openign and closing a lot of sockets that each have very little work done on them. It turns out that in that case, the cost of just finding the correct file descriptor number can be a dominating factor. We've long had a trivial optimization for allocating file descriptors sequentially, but that optimization ends up being not very effective when other file descriptors are being closed concurrently, and the fd patterns are not some simple FIFO pattern. In such cases we ended up spending a lot of time just scanning the bitmap of open file descriptors in order to find the next file descriptor number to open. This trivial patch-series mitigates that by simply introducing a second-level bitmap of which words in the first bitmap are already fully allocated. That cuts down the cost of scanning by an order of magnitude in some pathological (but realistic) cases. The second patch is an even more trivial patch to avoid unnecessarily dirtying the cacheline for the close-on-exec bit array that normally ends up being all empty. * fs-file-descriptor-optimization: vfs: conditionally clear close-on-exec flag vfs: Fix pathological performance case for __alloc_fd()
2015-10-31vfs: conditionally clear close-on-exec flagLinus Torvalds
We clear the close-on-exec flag when opening and closing files, and the bit was almost always already clear before. Avoid dirtying the cacheline if the clearning isn't necessary. That avoids unnecessary cacheline dirtying and bouncing in multi-socket environments. Eric Dumazet has a file descriptor benchmark that goes 4% faster from this on his two-socket machine. It's probably partly superlinear improvement due to getting slightly less spinlock contention on the file_lock spinlock due to less work in the critical section. Tested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-31vfs: Fix pathological performance case for __alloc_fd()Linus Torvalds
Al Viro points out that: > > * [Linux-specific aside] our __alloc_fd() can degrade quite badly > > with some use patterns. The cacheline pingpong in the bitmap is probably > > inevitable, unless we accept considerably heavier memory footprint, > > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard > > to trigger - close(3);open(...); will have the next open() after that > > scanning the entire in-use bitmap. And Eric Dumazet has a somewhat realistic multithreaded microbenchmark that opens and closes a lot of sockets with minimal work per socket. This patch largely fixes it. We keep a 2nd-level bitmap of the open file bitmaps, showing which words are already full. So then we can traverse that second-level bitmap to efficiently skip already allocated file descriptors. On his benchmark, this improves performance by up to an order of magnitude, by avoiding the excessive open file bitmap scanning. Tested-and-acked-by: Eric Dumazet <edumazet@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-31Merge branch 'overlayfs-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs bug fixes from Miklos Szeredi: "This contains fixes for bugs that appeared in earlier kernels (all are marked for -stable)" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: free lower_mnt array in ovl_put_super ovl: free stack of paths in ovl_fill_super ovl: fix open in stacked overlay ovl: fix dentry reference leak ovl: use O_LARGEFILE in ovl_copy_up()
2015-10-29fs/ext4: remove unnecessary new_valid_dev checkYaowei Bai
As new_valid_dev always returns 1, so !new_valid_dev check is not needed, remove it. Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-10-29gfs2: Remove gl_spin defineAndreas Gruenbacher
Commit e66cf161 replaced the gl_spin spinlock in struct gfs2_glock with a gl_lockref lockref and defined gl_spin as gl_lockref.lock (the spinlock in gl_lockref). Remove that define to make the references to gl_lockref.lock more obvious. Signed-off-by: Andreas Gruenbacher <andreas.gruenbacher@gmail.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-10-28fs/writeback, rcu: Don't use list_entry_rcu() for pointer offsetting in ↵Tejun Heo
bdi_split_work_to_wbs() bdi_split_work_to_wbs() uses list_for_each_entry_rcu_continue() to walk @bdi->wb_list. To set up the initial iteration condition, it uses list_entry_rcu() to calculate the entry pointer corresponding to the list head; however, this isn't an actual RCU dereference and using list_entry_rcu() for it ended up breaking a proposed list_entry_rcu() change because it was feeding an non-lvalue pointer into the macro. Don't use the RCU variant for simple pointer offsetting. Use list_entry() instead. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Darren Hart <dvhart@linux.intel.com> Cc: David Howells <dhowells@redhat.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Patrick Marlier <patrick.marlier@gmail.com> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: pranith kumar <bobby.prani@gmail.com> Link: http://lkml.kernel.org/r/20151027051939.GA19355@mtj.duckdns.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-27namei: permit linking with CAP_FOWNER in usernsDirk Steinmetz
Attempting to hardlink to an unsafe file (e.g. a setuid binary) from within an unprivileged user namespace fails, even if CAP_FOWNER is held within the namespace. This may cause various failures, such as a gentoo installation within a lxc container failing to build and install specific packages. This change permits hardlinking of files owned by mapped uids, if CAP_FOWNER is held for that namespace. Furthermore, it improves consistency by using the existing inode_owner_or_capable(), which is aware of namespaced capabilities as of 23adbe12ef7d3 ("fs,userns: Change inode_capable to capable_wrt_inode_uidgid"). Signed-off-by: Dirk Steinmetz <public@rsjtdrjgfuzkfg.com> This is hitting us in Ubuntu during some dpkg upgrades in containers. When upgrading a file dpkg creates a hard link to the old file to back it up before overwriting it. When packages upgrade suid files owned by a non-root user the link isn't permitted, and the package upgrade fails. This patch fixes our problem. Tested-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2015-10-26btrfs: qgroup: Fix a rebase bug which will cause qgroup double freeQu Wenruo
When rebasing my patchset, I forgot to pick up a cleanup patch to remove old hotfix in 4.2 release. Witouth the cleanup, it will screw up new qgroup reserve framework and always cause minus reserved number. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26btrfs: qgroup: Fix a race in delayed_ref which leads to abort transQu Wenruo
Between btrfs_allocerved_file_extent() and btrfs_add_delayed_qgroup_reserve(), there is a window that delayed_refs are run and delayed ref head maybe freed before btrfs_add_delayed_qgroup_reserve(). This will cause btrfs_dad_delayed_qgroup_reserve() to return -ENOENT, and cause transaction to be aborted. This patch will record qgroup reserve space info into delayed_ref_head at btrfs_add_delayed_ref(), to eliminate the race window. Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26btrfs: clear PF_NOFREEZE in cleaner_kthread()Jiri Kosina
cleaner_kthread() kthread calls try_to_freeze() at the beginning of every cleanup attempt. This operation can't ever succeed though, as the kthread hasn't marked itself as freezable. Before (hopefully eventually) kthread freezing gets converted to fileystem freezing, we'd rather mark cleaner_kthread() freezable (as my understanding is that it can generate filesystem I/O during suspend). Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26btrfs: qgroup: Don't copy extent buffer to do qgroup rescanQu Wenruo
Ancient qgroup code call memcpy() on a extent buffer and use it for leaf iteration. As extent buffer contains lock, pointers to pages, it's never sane to do such copy. The following bug may be caused by this insane operation: [92098.841309] general protection fault: 0000 [#1] SMP [92098.841338] Modules linked in: ... [92098.841814] CPU: 1 PID: 24655 Comm: kworker/u4:12 Not tainted 4.3.0-rc1 #1 [92098.841868] Workqueue: btrfs-qgroup-rescan btrfs_qgroup_rescan_helper [btrfs] [92098.842261] Call Trace: [92098.842277] [<ffffffffc035a5d8>] ? read_extent_buffer+0xb8/0x110 [btrfs] [92098.842304] [<ffffffffc0396d00>] ? btrfs_find_all_roots+0x60/0x70 [btrfs] [92098.842329] [<ffffffffc039af3d>] btrfs_qgroup_rescan_worker+0x28d/0x5a0 [btrfs] Where btrfs_qgroup_rescan_worker+0x28d is btrfs_disk_key_to_cpu(), called in reading key from the copied extent_buffer. This patch will use btrfs_clone_extent_buffer() to a better copy of extent buffer to deal such case. Reported-by: Stephane Lesimple <stephane_btrfs@lesimple.fr> Suggested-by: Filipe Manana <fdmanana@kernel.org> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26btrfs: add balance filters limits, stripes and usage to supported maskDavid Sterba
Enable the extended 'limit' syntax (a range), the new 'stripes' and extended 'usage' syntax (a range) filters in the filters mask. The patch comes separate and not within the series that introduced the new filters because the patch adding the mask was merged in a late rc. The integration branch was based on an older rc and could not merge the patch due to the missing changes. Prerequisities: * btrfs: check unsupported filters in balance arguments * btrfs: extend balance filter limit to take minimum and maximum * btrfs: add balance filter for stripes * btrfs: extend balance filter usage to take minimum and maximum Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26btrfs: extend balance filter usage to take minimum and maximumDavid Sterba
Similar to the 'limit' filter, we can enhance the 'usage' filter to accept a range. The change is backward compatible, the range is applied only in connection with the BTRFS_BALANCE_ARGS_USAGE_RANGE flag. We don't have a usecase yet, the current syntax has been sufficient. The enhancement should provide parity with other range-like filters. Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26btrfs: add balance filter for stripesGabríel Arthúr Pétursson
Balance block groups which have the given number of stripes, defined by a range min..max. This is useful to selectively rebalance only chunks that do not span enough devices, applies to RAID0/10/5/6. Signed-off-by: Gabríel Arthúr Pétursson <gabriel@system.is> [ renamed bargs members, added to the UAPI, wrote the changelog ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26btrfs: extend balance filter limit to take minimum and maximumDavid Sterba
The 'limit' filter is underdesigned, it should have been a range for [min,max], with some relaxed semantics when one of the bounds is missing. Besides that, using a full u64 for a single value is a waste of bytes. Let's fix both by extending the use of the u64 bytes for the [min,max] range. This can be done in a backward compatible way, the range will be interpreted only if the appropriate flag is set (BTRFS_BALANCE_ARGS_LIMIT_RANGE). Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26btrfs: fix use after free iterating extrefsChris Mason
The code for btrfs inode-resolve has never worked properly for files with enough hard links to trigger extrefs. It was trying to get the leaf out of a path after freeing the path: btrfs_release_path(path); leaf = path->nodes[0]; item_size = btrfs_item_size_nr(leaf, slot); The fix here is to use the extent buffer we cloned just a little higher up to avoid deadlocks caused by using the leaf in the path. Signed-off-by: Chris Mason <clm@fb.com> cc: stable@vger.kernel.org # v3.7+ cc: Mark Fasheh <mfasheh@suse.de> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26btrfs: check unsupported filters in balance argumentsDavid Sterba
We don't verify that all the balance filter arguments supplemented by the flags are actually known to the kernel. Thus we let it silently pass and do nothing. At the moment this means only the 'limit' filter, but we're going to add a few more soon so it's better to have that fixed. Also in older stable kernels so that it works with newer userspace tools. Cc: stable@vger.kernel.org # 3.16+ Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26fs/binfmt_elf_fdpic.c: fix brk area overlap with stack on NOMMURich Felker
On NOMMU archs, the FDPIC ELF loader sets up the usable brk range to overlap with all but the last PAGE_SIZE bytes of the stack. This leads to catastrophic memory reuse/corruption if brk is used. Fix by setting the brk area to zero size to disable its use. Signed-off-by: Rich Felker <dalias@libc.org> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Greg Ungerer <gerg@uclinux.org>
2015-10-25Btrfs: fix regression running delayed references when using qgroupsFilipe Manana
In the kernel 4.2 merge window we had a big changes to the implementation of delayed references and qgroups which made the no_quota field of delayed references not used anymore. More specifically the no_quota field is not used anymore as of: commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.") Leaving the no_quota field actually prevents delayed references from getting merged, which in turn cause the following BUG_ON(), at fs/btrfs/extent-tree.c, to be hit when qgroups are enabled: static int run_delayed_tree_ref(...) { (...) BUG_ON(node->ref_mod != 1); (...) } This happens on a scenario like the following: 1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added. 2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added. It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota. 3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added. It's not merged with the reference at the tail of the list of refs for bytenr X because the reference at the tail, Ref2 is incompatible due to Ref2->no_quota != Ref3->no_quota. 4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added. It's not merged with the reference at the tail of the list of refs for bytenr X because the reference at the tail, Ref3 is incompatible due to Ref3->no_quota != Ref4->no_quota. 5) We run delayed references, trigger merging of delayed references, through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs(). 6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and all other conditions are satisfied too. So Ref1 gets a ref_mod value of 2. 7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and all other conditions are satisfied too. So Ref2 gets a ref_mod value of 2. 8) Ref1 and Ref2 aren't merged, because they have different values for their no_quota field. 9) Delayed reference Ref1 is picked for running (select_delayed_ref() always prefers references with an action == BTRFS_ADD_DELAYED_REF). So run_delayed_tree_ref() is called for Ref1 which triggers the BUG_ON because Ref1->red_mod != 1 (equals 2). So fix this by removing the no_quota field, as it's not used anymore as of commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism."). The use of no_quota was also buggy in at least two places: 1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting no_quota to 0 instead of 1 when the following condition was true: is_fstree(ref_root) || !fs_info->quota_enabled 2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to reset a node's no_quota when the condition "!is_fstree(root_objectid) || !root->fs_info->quota_enabled" was true but we did it only in an unused local stack variable, that is, we never reset the no_quota value in the node itself. This fixes the remainder of problems several people have been having when running delayed references, mostly while a balance is running in parallel, on a 4.2+ kernel. Very special thanks to Stéphane Lesimple for helping debugging this issue and testing this fix on his multi terabyte filesystem (which took more than one day to balance alone, plus fsck, etc). Also, this fixes deadlock issue when using the clone ioctl with qgroups enabled, as reported by Elias Probst in the mailing list. The deadlock happens because after calling btrfs_insert_empty_item we have our path holding a write lock on a leaf of the fs/subvol tree and then before releasing the path we called check_ref() which did backref walking, when qgroups are enabled, and tried to read lock the same leaf. The trace for this case is the following: INFO: task systemd-nspawn:6095 blocked for more than 120 seconds. (...) Call Trace: [<ffffffff86999201>] schedule+0x74/0x83 [<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea [<ffffffff86137ed7>] ? wait_woken+0x74/0x74 [<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810 [<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce [<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127 [<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667 [<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe [<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6 [<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0 [<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65 [<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88 [<ffffffff863e852e>] check_ref+0x64/0xc4 [<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d [<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb [<ffffffff86048a68>] ? native_sched_clock+0x28/0x77 [<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb (...) The problem goes away by eleminating check_ref(), which no longer is needed as its purpose was to get a value for the no_quota field of a delayed reference (this patch removes the no_quota field as mentioned earlier). Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr> Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr> Reported-by: Elias Probst <mail@eliasprobst.eu> Reported-by: Peter Becker <floyd.net@gmail.com> Reported-by: Malte Schröder <malte@tnxip.de> Reported-by: Derek Dongray <derek@valedon.co.uk> Reported-by: Erkki Seppala <flux-btrfs@inside.org> Cc: stable@vger.kernel.org # 4.2+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
2015-10-25Btrfs: fix regression when running delayed referencesFilipe Manana
In the kernel 4.2 merge window we had a refactoring/rework of the delayed references implementation in order to fix certain problems with qgroups. However that rework introduced one more regression that leads to the following trace when running delayed references for metadata: [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832! [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc psmouse i2 [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1 [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs] [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti: ffff88010c4c8000 [35908.065201] RIP: 0010:[<ffffffffa04928b5>] [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs] [35908.065201] RSP: 0018:ffff88010c4cbb08 EFLAGS: 00010293 [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX: 0000000000000000 [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI: 0000000000000000 [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09: ffff88010c4cb9f8 [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12: 0000000000000000 [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15: 0000000000000000 [35908.065201] FS: 0000000000000000(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000 [35908.065201] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4: 00000000000006e0 [35908.065201] Stack: [35908.065201] ffff88010c4cbb18 0000000000000f37 ffff88020a74c578 ffff88015a408000 [35908.065201] ffff880154a44000 0000000000000000 0000000000000005 ffff88010c4cbbd8 [35908.065201] ffffffffa0492b9a 0000000000000005 0000000000000000 0000000000000000 [35908.065201] Call Trace: [35908.065201] [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs] [35908.065201] [<ffffffffa0497117>] ? __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs] [35908.065201] [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33 [btrfs] [35908.065201] [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f [btrfs] [35908.065201] [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f [btrfs] [35908.065201] [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd [btrfs] [35908.065201] [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b [btrfs] [35908.065201] [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a [btrfs] [35908.065201] [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14 [btrfs] [35908.065201] [<ffffffff81063b23>] process_one_work+0x24a/0x4ac [35908.065201] [<ffffffff81064285>] worker_thread+0x206/0x2c2 [35908.065201] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb [35908.065201] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb [35908.065201] [<ffffffff8106904d>] kthread+0xef/0xf7 [35908.065201] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24 [35908.065201] [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70 [35908.065201] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24 [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> 0b 4c 8b 45 30 8b 4d 28 45 31 [35908.065201] RIP [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs] [35908.065201] RSP <ffff88010c4cbb08> [35908.310885] ---[ end trace fe4299baf0666457 ]--- This happens because the new delayed references code no longer merges delayed references that have different sequence values. The following steps are an example sequence leading to this issue: 1) Transaction N starts, fs_info->tree_mod_seq has value 0; 2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for bytenr A is created, with a value of 1 and a seq value of 0; 3) fs_info->tree_mod_seq is incremented to 1; 4) Extent buffer A is deleted through btrfs_del_items(), which calls btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The later returns the metadata extent associated to extent buffer A to the free space cache (the range is not pinned), because the extent buffer was created in the current transaction (N) and writeback never happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set in the extent buffer). This creates the delayed reference Ref2 for bytenr A, with a value of -1 and a seq value of 1; 5) Delayed reference Ref2 is not merged with Ref1 when we create it, because they have different sequence numbers (decided at add_delayed_ref_tail_merge()); 6) fs_info->tree_mod_seq is incremented to 2; 7) Some task attempts to allocate a new extent buffer (done at extent-tree.c:find_free_extent()), but due to heavy fragmentation and running low on metadata space the clustered allocation fails and we fall back to unclustered allocation, which finds the extent at offset A, so a new extent buffer at offset A is allocated. This creates delayed reference Ref3 for bytenr A, with a value of 1 and a seq value of 2; 8) Ref3 is not merged neither with Ref2 nor Ref1, again because they all have different seq values; 9) We start running the delayed references (__btrfs_run_delayed_refs()); 10) The delayed Ref1 is the first one being applied, which ends up creating an inline extent backref in the extent tree; 10) Next the delayed reference Ref3 is selected for execution, and not Ref2, because select_delayed_ref() always gives a preference for positive references (that have an action of BTRFS_ADD_DELAYED_REF); 11) When running Ref3 we encounter alreay the inline extent backref in the extent tree at insert_inline_extent_backref(), which makes us hit the following BUG_ON: BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID); This is always true because owner corresponds to the level of the extent buffer/btree node in the btree. For the scenario described above we hit the BUG_ON because we never merge references that have different seq values. We used to do the merging before the 4.2 kernel, more specifically, before the commmits: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.") c43d160fcd5e ("btrfs: delayed-ref: Cleanup the unneeded functions.") This issue became more exposed after the following change that was added to 4.2 as well: cffc3374e567 ("Btrfs: fix order by which delayed references are run") Which in turn fixed another regression by the two commits previously mentioned. So fix this by bringing back the delayed reference merge code, with the proper adaptations so that it operates against the new data structure (linked list vs old red black tree implementation). This issue was hit running fstest btrfs/063 in a loop. Several people have reported this issue in the mailing list when running on kernels 4.2+. Very special thanks to Stéphane Lesimple for helping debugging this issue and testing this fix on his multi terabyte filesystem (which took more than one day to balance alone, plus fsck, etc). Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.") Reported-by: Peter Becker <floyd.net@gmail.com> Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr> Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr> Reported-by: Malte Schröder <malte@tnxip.de> Reported-by: Derek Dongray <derek@valedon.co.uk> Reported-by: Erkki Seppala <flux-btrfs@inside.org> Cc: stable@vger.kernel.org # 4.2+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-10-24Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block layer fixes from Jens Axboe: "A final set of fixes for 4.3. It is (again) bigger than I would have liked, but it's all been through the testing mill and has been carefully reviewed by multiple parties. Each fix is either a regression fix for this cycle, or is marked stable. You can scold me at KS. The pull request contains: - Three simple fixes for NVMe, fixing regressions since 4.3. From Arnd, Christoph, and Keith. - A single xen-blkfront fix from Cathy, fixing a NULL dereference if an error is returned through the staste change callback. - Fixup for some bad/sloppy code in nbd that got introduced earlier in this cycle. From Markus Pargmann. - A blk-mq tagset use-after-free fix from Junichi. - A backing device lifetime fix from Tejun, fixing a crash. - And finally, a set of regression/stable fixes for cgroup writeback from Tejun" * 'for-linus' of git://git.kernel.dk/linux-block: writeback: remove broken rbtree_postorder_for_each_entry_safe() usage in cgwb_bdi_destroy() NVMe: Fix memory leak on retried commands block: don't release bdi while request_queue has live references nvme: use an integer value to Linux errno values blk-mq: fix use-after-free in blk_mq_free_tag_set() nvme: fix 32-bit build warning writeback: fix incorrect calculation of available memory for memcg domains writeback: memcg dirty_throttle_control should be initialized with wb->memcg_completions writeback: bdi_writeback iteration must not skip dying ones writeback: fix bdi_writeback iteration in wakeup_dirtytime_writeback() writeback: laptop_mode_timer_fn() needs rcu_read_lock() around bdi_writeback iteration nbd: Add locking for tasks xen-blkfront: check for null drvdata in blkback_changed (XenbusStateClosing)
2015-10-24Merge branch 'for-linus-4.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "I have two more small fixes this week: Qu's fix avoids unneeded COW during fallocate, and Christian found a memory leak in the error handling of an earlier fix" * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: fix possible leak in btrfs_ioctl_balance() btrfs: Avoid truncate tailing page if fallocate range doesn't exceed inode size
2015-10-23i2c-dev: Fix typo in ioctl name referenceJean Delvare
The ioctl is named I2C_RDWR for "I2C read/write". But references to it were misspelled "rdrw". Fix them. Signed-off-by: Jean Delvare <jdelvare@suse.de> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2015-10-23nfsd: ensure that seqid morphing operations are atomic wrt to copiesJeff Layton
Bruce points out that the increment of the seqid in stateids is not serialized in any way, so it's possible for racing calls to bump it twice and end up sending the same stateid. While we don't have any reports of this problem it _is_ theoretically possible, and could lead to spurious state recovery by the client. In the current code, update_stateid is always followed by a memcpy of that stateid, so we can combine the two operations. For better atomicity, we add a spinlock to the nfs4_stid and hold that when bumping the seqid and copying the stateid. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-10-23nfsd: serialize layout stateid morphing operationsJeff Layton
In order to allow the client to make a sane determination of what happened with racing LAYOUTGET/LAYOUTRETURN/CB_LAYOUTRECALL calls, we must ensure that the seqids return accurately represent the order of operations. The simplest way to do that is to ensure that operations on a single stateid are serialized. This patch adds a mutex to the layout stateid, and locks it when checking the layout stateid's seqid. The mutex is held over the entire operation and released after the seqid is bumped. Note that in the case of CB_LAYOUTRECALL we must move the increment of the seqid and setting into a new cb "prepare" operation. The lease infrastructure will call the lm_break callback with a spinlock held, so and we can't take the mutex in that codepath. Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-10-23nfsd: improve client_has_state to check for unused openownersJ. Bruce Fields
At least in the v4.0 case openowners can hang around for a while after last close, but they shouldn't really block (for example), a new mount with a different principal. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-10-23nfsd: fix clid_inuse on mount with security changeJ. Bruce Fields
In bakeathon testing Solaris client was getting CLID_INUSE error when doing a krb5 mount soon after an auth_sys mount, or vice versa. That's not really necessary since in this case the old client doesn't have any state any more: http://tools.ietf.org/html/rfc7530#page-103 "when the server gets a SETCLIENTID for a client ID that currently has no state, or it has state but the lease has expired, rather than returning NFS4ERR_CLID_INUSE, the server MUST allow the SETCLIENTID and confirm the new client ID if followed by the appropriate SETCLIENTID_CONFIRM." This doesn't fix the problem completely since our client_has_state() check counts openowners left around to handle close replays, which we should probably just remove in this case. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-10-23nfsd: move include of state.h from trace.c to trace.hJeff Layton
Any file which includes trace.h will need to include state.h, even if they aren't using any state tracepoints. Ensure that we include any headers that might be needed in trace.h instead of relying on the *.c files to have the right ones. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-10-23lockd: get rid of reference-counted NSM RPC clientsAndrey Ryabinin
Currently we have reference-counted per-net NSM RPC client which created on the first monitor request and destroyed after the last unmonitor request. It's needed because RPC client need to know 'utsname()->nodename', but utsname() might be NULL when nsm_unmonitor() called. So instead of holding the rpc client we could just save nodename in struct nlm_host and pass it to the rpc_create(). Thus ther is no need in keeping rpc client until last unmonitor request. We could create separate RPC clients for each monitor/unmonitor requests. Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-10-23ocfs2/dlm: unlock lockres spinlock before dlm_lockres_putJoseph Qi
dlm_lockres_put will call dlm_lockres_release if it is the last reference, and then it may call dlm_print_one_lock_resource and take lockres spinlock. So unlock lockres spinlock before dlm_lockres_put to avoid deadlock. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-22locks: cleanup posix_lock_inode_wait and flock_lock_inode_waitBenjamin Coddington
All callers use locks_lock_inode_wait() instead. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
2015-10-22Move locks API users to locks_lock_inode_wait()Benjamin Coddington
Instead of having users check for FL_POSIX or FL_FLOCK to call the correct locks API function, use the check within locks_lock_inode_wait(). This allows for some later cleanup. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
2015-10-22locks: introduce locks_lock_inode_wait()Benjamin Coddington
Users of the locks API commonly call either posix_lock_file_wait() or flock_lock_file_wait() depending upon the lock type. Add a new function locks_lock_inode_wait() which will check and call the correct function for the type of lock passed in. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
2015-10-22pstore: Fix return type of pstore_is_mounted()Geliang Tang
This patch changes return type of pstore_is_mounted from int to bool. Signed-off-by: Geliang Tang <geliangtang@163.com> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tony Luck <tony.luck@intel.com>
2015-10-22f2fs: fix to skip shrinking extent nodesChao Yu
In f2fs_shrink_extent_tree we should stop shrink flow if we have already shrunk enough nodes in extent cache. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-22f2fs: fix error path of ->symlinkChao Yu
Now, in ->symlink of f2fs, we kept the fixed invoking order between f2fs_add_link and page_symlink since we should init node info firstly in f2fs_add_link, then such node info can be used in page_symlink. But we didn't fix to release meta info which was done before page_symlink in our error path, so this will leave us corrupt symlink entry in its parent's dentry page. Fix this issue by adding f2fs_unlink in the error path for removing such linking. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-22f2fs: fix to clear GCed flag for atomic written pageChao Yu
Atomic write page can be GCed, after committing this kind of page, we should clear the GCed flag for it. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-22pstore: add pstore unregisterGeliang Tang
pstore doesn't support unregistering yet. It was marked as TODO. This patch adds some code to fix it: 1) Add functions to unregister kmsg/console/ftrace/pmsg. 2) Add a function to free compression buffer. 3) Unmap the memory and free it. 4) Add a function to unregister pstore filesystem. Signed-off-by: Geliang Tang <geliangtang@163.com> Acked-by: Kees Cook <keescook@chromium.org> [Removed __exit annotation from ramoops_remove(). Reported by Arnd Bergmann] Signed-off-by: Tony Luck <tony.luck@intel.com>
2015-10-21f2fs: don't need to submit bio on error caseJaegeuk Kim
If commit_atomic_write is failed, we don't need to submit any bio. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-21f2fs: fix leakage of inmemory atomic pagesJaegeuk Kim
If we got failure during commit_atomic_write, abort_volatile_write will be called, but will not drop the inmemory pages due to no FI_ATOMIC_FILE. Actually, there is no reason to check the flag in abort_volatile_write. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-21Merge branch 'allocator-fixes' into for-linus-4.4Chris Mason
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21Btrfs: don't do extra bitmap search in one bit caseJosef Bacik
When we make ctl->unit allocations from a bitmap there is no point in searching for the next 0 in the bitmap. If we've found a bit we're done and can just exit the loop. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21Btrfs: keep track of largest extent in bitmapsJosef Bacik
We can waste a lot of time searching through bitmaps when we are heavily fragmented trying to find large contiguous areas that don't exist in the bitmap. So keep track of the max extent size when we do a full search of a bitmap so that next time around we can just skip the expensive searching if our max size is less than what we are looking for. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21Btrfs: don't keep trying to build clusters if we are fragmentedJosef Bacik
If we are extremely fragmented then we won't be able to create a free_cluster. So if this happens set last_ptr->fragmented so that all future allcations will give up trying to create a cluster. When we unpin extents we will unset ->fragmented if we free up a sufficient amount of space in a block group. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>