linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2015-11-04	GFS2: Protect freeing directory hash table with i_lock spin_lock	Bob Peterson
	This patch changes function gfs2_dir_hash_inval so it uses the i_lock spin_lock to protect the in-core hash table, i_hash_cache. This will prevent double-frees due to a race between gfs2_evict_inode and inode invalidation. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-11-03	Merge branch 'core-rcu-for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU changes from Ingo Molnar: "The main changes in this cycle were: - Improvements to expedited grace periods (Paul E McKenney) - Performance improvements to and locktorture tests for percpu-rwsem (Oleg Nesterov, Paul E McKenney) - Torture-test changes (Paul E McKenney, Davidlohr Bueso) - Documentation updates (Paul E McKenney) - Miscellaneous fixes (Paul E McKenney, Boqun Feng, Oleg Nesterov, Patrick Marlier)" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits) fs/writeback, rcu: Don't use list_entry_rcu() for pointer offsetting in bdi_split_work_to_wbs() rcu: Better hotplug handling for synchronize_sched_expedited() rcu: Enable stall warnings for synchronize_rcu_expedited() rcu: Add tasks to expedited stall-warning messages rcu: Add online/offline info to expedited stall warning message rcu: Consolidate expedited CPU selection rcu: Prepare for consolidating expedited CPU selection cpu: Remove try_get_online_cpus() rcu: Stop excluding CPU hotplug in synchronize_sched_expedited() rcu: Stop silencing lockdep false positive for expedited grace periods rcu: Switch synchronize_sched_expedited() to IPI locktorture: Fix module unwind when bad torture_type specified torture: Forgive non-plural arguments rcutorture: Fix unused-function warning for torturing_tasks() rcutorture: Fix module unwind when bad torture_type specified rcu_sync: Cleanup the CONFIG_PROVE_RCU checks locking/percpu-rwsem: Clean up the lockdep annotations in percpu_down_read() locking/percpu-rwsem: Fix the comments outdated by rcu_sync locking/percpu-rwsem: Make use of the rcu_sync infrastructure locking/percpu-rwsem: Make percpu_free_rwsem() after kzalloc() safe ...
2015-11-03	Merge branch 'core-debug-for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull wchan kernel address hiding from Ingo Molnar: "This fixes a wchan related information leak in /proc/PID/stat. There's a bit of an ABI twist to it: instead of setting the wchan field to 0 (which is our usual technique) we set it conditionally to a 0/1 flag to keep ABI compatibility with older procps versions that only fetches /proc/PID/wchan (symbolic names) if the absolute wchan address is nonzero" * 'core-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: fs/proc, core/debug: Don't expose absolute kernel addresses via wchan
2015-11-03	nfs: Fix GETATTR bitmap verification	Andreas Gruenbacher
	When decoding GETATTR replies, the client checks the attribute bitmap for which attributes the server has sent. It misses bits at the word boundaries, though; fix that. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-11-03	nfs: Remove unused xdr page offsets in getacl/setacl arguments	Andreas Gruenbacher
	The arguments passed around for getacl and setacl xdr encoding, struct nfs_setaclargs and struct nfs_getaclargs, both contain an array of pages, an offset into the first page, and the length of the page data. The offset is unused as it is always zero; remove it. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-11-03	fs/nfs: remove unnecessary new_valid_dev check	Yaowei Bai
	As new_valid_dev always returns 1, so !new_valid_dev check is not needed, remove it. Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-11-03	dlm: make posix locks interruptible	Eric Ren
	Replace wait_event_killable with wait_event_interruptible so that a program waiting for a posix lock can be interrupted by a signal. With the killable version, a program was not interruptible by a signal if it had a signal handler set for it, overriding the default action of terminating the process. Signed-off-by: Eric Ren <zren@suse.com> Signed-off-by: David Teigland <teigland@redhat.com>
2015-11-03	Btrfs: fix hole punching when using the no-holes feature	Filipe Manana
	When we are using the no-holes feature, if we punch a hole into a file range that already contains a hole which overlaps the range we are passing to fallocate(), we end up removing the extent map that represents the existing hole without adding a new one. This happens because with the no-holes feature we do not have explicit extent items to represent holes and therefore the call to __btrfs_drop_extents(), made from btrfs_punch_hole(), returns an end offset to the variable drop_end that is smaller than the end of the range passed to fallocate(), while it drops all existing extent maps in that range. Normally having a missing extent map is not a problem, for example for a readpages() operation we just end up building the extent map by looking at the fs/subvol tree for a matching extent item (or a lack of one for implicit holes). However for an fsync that uses the fast path, which needs to look at the list of modified extent maps, this means the fsync will not record information about the complete hole we had before the fallocate() call into the log tree, resulting in a file with content/layout that does not match what we had neither before nor after the hole punch operation. The following test case for fstests reproduces the issue. It fails without this change because we get a file with a different digest after the fsync log replay and also with a different extent/hole layout. seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { _cleanup_flakey rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/punch . ./common/dmflakey # real QA test starts here _need_to_be_root _supported_fs generic _supported_os Linux _require_scratch _require_xfs_io_command "fpunch" _require_xfs_io_command "fiemap" _require_dm_target flakey _require_metadata_journaling $SCRATCH_DEV # This test was motivated by an issue found in btrfs when the btrfs # no-holes feature is enabled (introduced in kernel 3.14). So enable # the feature if the fs being tested is btrfs. if [ $FSTYP == "btrfs" ]; then _require_btrfs_fs_feature "no_holes" _require_btrfs_mkfs_feature "no-holes" MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes" fi rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _init_flakey _mount_flakey # Create out test file with some data and then fsync it. # We do the fsync only to make sure the last fsync we do in this test # triggers the fast code path of btrfs' fsync implementation, a # condition necessary to trigger the bug btrfs had. $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 128K" \ -c "fsync" \ $SCRATCH_MNT/foobar \| _filter_xfs_io # Now punch a hole against the range [96K, 128K[. $XFS_IO_PROG -c "fpunch 96K 32K" $SCRATCH_MNT/foobar # Punch another hole against a range that overlaps the previous range # and ends beyond eof. $XFS_IO_PROG -c "fpunch 64K 128K" $SCRATCH_MNT/foobar # Punch another hole against a range that overlaps the first range # ([96K, 128K[) and ends at eof. $XFS_IO_PROG -c "fpunch 32K 96K" $SCRATCH_MNT/foobar # Fsync our file. We want to verify that, after a power failure and # mounting the filesystem again, the file content reflects all the hole # punch operations. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar echo "File digest before power failure:" md5sum $SCRATCH_MNT/foobar \| _filter_scratch echo "Fiemap before power failure:" $XFS_IO_PROG -c "fiemap -v" $SCRATCH_MNT/foobar \| _filter_fiemap # Silently drop all writes and umount to simulate a crash/power failure. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey # Allow writes again, mount to trigger log replay and validate file # contents. _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey echo "File digest after log replay:" # Must match the same digest we got before the power failure. md5sum $SCRATCH_MNT/foobar \| _filter_scratch echo "Fiemap after log replay:" # Must match the same extent listing we got before the power failure. $XFS_IO_PROG -c "fiemap -v" $SCRATCH_MNT/foobar \| _filter_fiemap _unmount_flakey status=0 exit Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-03	Btrfs: find_free_extent: Do not erroneously skip LOOP_CACHING_WAIT state	chandan
	When executing generic/001 in a loop on a ppc64 machine (with both sectorsize and nodesize set to 64k), the following call trace is observed, WARNING: at /root/repos/linux/fs/btrfs/locking.c:253 Modules linked in: CPU: 2 PID: 8353 Comm: umount Not tainted 4.3.0-rc5-13676-ga5e681d #54 task: c0000000f2b1f560 ti: c0000000f6008000 task.ti: c0000000f6008000 NIP: c000000000520c88 LR: c0000000004a3b34 CTR: 0000000000000000 REGS: c0000000f600a820 TRAP: 0700 Not tainted (4.3.0-rc5-13676-ga5e681d) MSR: 8000000102029032 <SF,VEC,EE,ME,IR,DR,RI> CR: 24444884 XER: 00000000 CFAR: c0000000004a3b30 SOFTE: 1 GPR00: c0000000004a3b34 c0000000f600aaa0 c00000000108ac00 c0000000f5a808c0 GPR04: 0000000000000000 c0000000f600ae60 0000000000000000 0000000000000005 GPR08: 00000000000020a1 0000000000000001 c0000000f2b1f560 0000000000000030 GPR12: 0000000084842882 c00000000fdc0900 c0000000f600ae60 c0000000f070b800 GPR16: 0000000000000000 c0000000f3c8a000 0000000000000000 0000000000000049 GPR20: 0000000000000001 0000000000000001 c0000000f5aa01f8 0000000000000000 GPR24: 0f83e0f83e0f83e1 c0000000f5a808c0 c0000000f3c8d000 c000000000000000 GPR28: c0000000f600ae74 0000000000000001 c0000000f3c8d000 c0000000f5a808c0 NIP [c000000000520c88] .btrfs_tree_lock+0x48/0x2a0 LR [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80 Call Trace: [c0000000f600aaa0] [c0000000f600ab80] 0xc0000000f600ab80 (unreliable) [c0000000f600ab80] [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80 [c0000000f600ac00] [c0000000004a99dc] .btrfs_search_slot+0xa8c/0xc00 [c0000000f600ad40] [c0000000004ab878] .btrfs_insert_empty_items+0x98/0x120 [c0000000f600adf0] [c00000000050da44] .btrfs_finish_chunk_alloc+0x1d4/0x620 [c0000000f600af20] [c0000000004be854] .btrfs_create_pending_block_groups+0x1d4/0x2c0 [c0000000f600b020] [c0000000004bf188] .do_chunk_alloc+0x3c8/0x420 [c0000000f600b100] [c0000000004c27cc] .find_free_extent+0xbfc/0x1030 [c0000000f600b260] [c0000000004c2ce8] .btrfs_reserve_extent+0xe8/0x250 [c0000000f600b330] [c0000000004c2f90] .btrfs_alloc_tree_block+0x140/0x590 [c0000000f600b440] [c0000000004a47b4] .__btrfs_cow_block+0x124/0x780 [c0000000f600b530] [c0000000004a4fc0] .btrfs_cow_block+0xf0/0x250 [c0000000f600b5e0] [c0000000004a917c] .btrfs_search_slot+0x22c/0xc00 [c0000000f600b720] [c00000000050aa40] .btrfs_remove_chunk+0x1b0/0x9f0 [c0000000f600b850] [c0000000004c4e04] .btrfs_delete_unused_bgs+0x434/0x570 [c0000000f600b950] [c0000000004d3cb8] .close_ctree+0x2e8/0x3b0 [c0000000f600ba20] [c00000000049d178] .btrfs_put_super+0x18/0x30 [c0000000f600ba90] [c000000000243cd4] .generic_shutdown_super+0xa4/0x1a0 [c0000000f600bb10] [c0000000002441d8] .kill_anon_super+0x18/0x30 [c0000000f600bb90] [c00000000049c898] .btrfs_kill_super+0x18/0xc0 [c0000000f600bc10] [c0000000002444f8] .deactivate_locked_super+0x98/0xe0 [c0000000f600bc90] [c000000000269f94] .cleanup_mnt+0x54/0xa0 [c0000000f600bd10] [c0000000000bd744] .task_work_run+0xc4/0x100 [c0000000f600bdb0] [c000000000016334] .do_notify_resume+0x74/0x80 [c0000000f600be30] [c0000000000098b8] .ret_from_except_lite+0x64/0x68 Instruction dump: fba1ffe8 fbc1fff0 fbe1fff8 7c791b78 f8010010 f821ff21 e94d0290 81030040 812a04e8 7d094a78 7d290034 5529d97e <0b090000> 3b400000 3be30050 3bc3004c The above call trace is seen even on x86_64; albeit very rarely and that too with nodesize set to 64k and with nospace_cache mount option being used. The reason for the above call trace is, btrfs_remove_chunk check_system_chunk Allocate chunk if required For each physical stripe on underlying device, btrfs_free_dev_extent ... Take lock on Device tree's root node btrfs_cow_block("dev tree's root node"); btrfs_reserve_extent find_free_extent index = BTRFS_RAID_DUP; have_caching_bg = false; When in LOOP_CACHING_NOWAIT state, Assume we find a block group which is being cached; Hence have_caching_bg is set to true When repeating the search for the next RAID index, we set have_caching_bg to false. Hence right after completing the LOOP_CACHING_NOWAIT state, we incorrectly skip LOOP_CACHING_WAIT state and move to LOOP_ALLOC_CHUNK state where we allocate a chunk and try to add entries corresponding to the chunk's physical stripe into the device tree. When doing so the task deadlocks itself waiting for the blocking lock on the root node of the device tree. This commit fixes the issue by introducing a new local variable to help indicate as to whether a block group of any RAID type is being cached. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-03	btrfs: Fix a data space underflow warning	Qu Wenruo
	Even with quota disabled, generic/127 will trigger a kernel warning by underflow data space info. The bug is caused by buffered write, which in case of short copy, the start parameter for btrfs_delalloc_release_space() is wrong, and round_up/down() in btrfs_delalloc_release() extents the range to page aligned, decreasing one more page than expected. This patch will fix it by passing correct start. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-02	Merge tag 'nfs-rdma-4.4-2' of git://git.linux-nfs.org/projects/anna/nfs-rdma	Trond Myklebust
	NFS: NFSoRDMA Client Side Changes In addition to a variety of bugfixes, these patches are mostly geared at enabling both swap and backchannel support to the NFS over RDMA client. Signed-off-by: Anna Schumake <Anna.Schumaker@Netapp.com>
2015-11-02	pstore: fix code comment to match code	Geliang Tang
	Fix code comment about kmsg_dump register so it matches the code. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2015-11-02	NFS: Enable client side NFSv4.1 backchannel to use other transports	Chuck Lever
	Forechannel transports get their own "bc_up" method to create an endpoint for the backchannel service. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> [Anna Schumaker: Add forward declaration of struct net to xprt.h] Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02	pNFS/flexfiles: Add support for FF_FLAGS_NO_IO_THRU_MDS	Trond Myklebust
	For loosely coupled pNFS/flexfiles systems, there is often no advantage at all in going through the MDS for I/O, since the MDS is subject to the same limitations as all other clients when talking to DSes. If a DS is unresponsive, I/O through the MDS will fail. For such systems, the only scalable solution is to have the pNFS clients retry doing pNFS, and so the protocol now provides a flag that allows the pNFS server to signal this. If LAYOUTGET returns FF_FLAGS_NO_IO_THRU_MDS, then we should assume that the MDS wants the client to retry using these devices, even if they were previously marked as being unavailable. To do so, we add a helper, ff_layout_mark_devices_valid() that will be called from layoutget. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-11-02	pNFS/flexfiles: When mirrored, retry failed reads by switching mirrors	Trond Myklebust
	If the pNFS/flexfiles file is mirrored, and a read to one mirror fails, then we should bump the mirror index, so that we retry to a different mirror. Once we've iterated through all mirrors and all failed, we can return the layout and issue a new LAYOUTGET. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-11-01	mm: get rid of 'vmalloc_info' from /proc/meminfo	Linus Torvalds
	It turns out that at least some versions of glibc end up reading /proc/meminfo at every single startup, because glibc wants to know the amount of memory the machine has. And while that's arguably insane, it's just how things are. And it turns out that it's not all that expensive most of the time, but the vmalloc information statistics (amount of virtual memory used in the vmalloc space, and the biggest remaining chunk) can be rather expensive to compute. The 'get_vmalloc_info()' function actually showed up on my profiles as 4% of the CPU usage of "make test" in the git source repository, because the git tests are lots of very short-lived shell-scripts etc. It turns out that apparently this same silly vmalloc info gathering shows up on the facebook servers too, according to Dave Jones. So it's not just "make test" for git. We had two patches to just cache the information (one by me, one by Ingo) to mitigate this issue, but the whole vmalloc information of of rather dubious value to begin with, and people who actually want to know what the situation is wrt the vmalloc area should just look at the much more complete /proc/vmallocinfo instead. In fact, according to my testing - and perhaps more importantly, according to that big search engine in the sky: Google - there is nothing out there that actually cares about those two expensive fields: VmallocUsed and VmallocChunk. So let's try to just remove them entirely. Actually, this just removes the computation and reports the numbers as zero for now, just to try to be minimally intrusive. If this breaks anything, we'll obviously have to re-introduce the code to compute this all and add the caching patches on top. But if given the option, I'd really prefer to just remove this bad idea entirely rather than add even more code to work around our historical mistake that likely nobody really cares about. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-01	Merge branch 'fs-file-descriptor-optimization'	Linus Torvalds
	Merge file descriptor allocation speedup. Eric Dumazet has a test-case for a fairly common network deamon load pattern: openign and closing a lot of sockets that each have very little work done on them. It turns out that in that case, the cost of just finding the correct file descriptor number can be a dominating factor. We've long had a trivial optimization for allocating file descriptors sequentially, but that optimization ends up being not very effective when other file descriptors are being closed concurrently, and the fd patterns are not some simple FIFO pattern. In such cases we ended up spending a lot of time just scanning the bitmap of open file descriptors in order to find the next file descriptor number to open. This trivial patch-series mitigates that by simply introducing a second-level bitmap of which words in the first bitmap are already fully allocated. That cuts down the cost of scanning by an order of magnitude in some pathological (but realistic) cases. The second patch is an even more trivial patch to avoid unnecessarily dirtying the cacheline for the close-on-exec bit array that normally ends up being all empty. * fs-file-descriptor-optimization: vfs: conditionally clear close-on-exec flag vfs: Fix pathological performance case for __alloc_fd()
2015-10-31	vfs: conditionally clear close-on-exec flag	Linus Torvalds
	We clear the close-on-exec flag when opening and closing files, and the bit was almost always already clear before. Avoid dirtying the cacheline if the clearning isn't necessary. That avoids unnecessary cacheline dirtying and bouncing in multi-socket environments. Eric Dumazet has a file descriptor benchmark that goes 4% faster from this on his two-socket machine. It's probably partly superlinear improvement due to getting slightly less spinlock contention on the file_lock spinlock due to less work in the critical section. Tested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-31	vfs: Fix pathological performance case for __alloc_fd()	Linus Torvalds
	Al Viro points out that: > > * [Linux-specific aside] our __alloc_fd() can degrade quite badly > > with some use patterns. The cacheline pingpong in the bitmap is probably > > inevitable, unless we accept considerably heavier memory footprint, > > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard > > to trigger - close(3);open(...); will have the next open() after that > > scanning the entire in-use bitmap. And Eric Dumazet has a somewhat realistic multithreaded microbenchmark that opens and closes a lot of sockets with minimal work per socket. This patch largely fixes it. We keep a 2nd-level bitmap of the open file bitmaps, showing which words are already full. So then we can traverse that second-level bitmap to efficiently skip already allocated file descriptors. On his benchmark, this improves performance by up to an order of magnitude, by avoiding the excessive open file bitmap scanning. Tested-and-acked-by: Eric Dumazet <edumazet@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-31	Merge branch 'overlayfs-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs bug fixes from Miklos Szeredi: "This contains fixes for bugs that appeared in earlier kernels (all are marked for -stable)" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: free lower_mnt array in ovl_put_super ovl: free stack of paths in ovl_fill_super ovl: fix open in stacked overlay ovl: fix dentry reference leak ovl: use O_LARGEFILE in ovl_copy_up()
2015-10-29	fs/ext4: remove unnecessary new_valid_dev check	Yaowei Bai
	As new_valid_dev always returns 1, so !new_valid_dev check is not needed, remove it. Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-10-29	gfs2: Remove gl_spin define	Andreas Gruenbacher
	Commit e66cf161 replaced the gl_spin spinlock in struct gfs2_glock with a gl_lockref lockref and defined gl_spin as gl_lockref.lock (the spinlock in gl_lockref). Remove that define to make the references to gl_lockref.lock more obvious. Signed-off-by: Andreas Gruenbacher <andreas.gruenbacher@gmail.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-10-28	fs/writeback, rcu: Don't use list_entry_rcu() for pointer offsetting in ↵	Tejun Heo
	bdi_split_work_to_wbs() bdi_split_work_to_wbs() uses list_for_each_entry_rcu_continue() to walk @bdi->wb_list. To set up the initial iteration condition, it uses list_entry_rcu() to calculate the entry pointer corresponding to the list head; however, this isn't an actual RCU dereference and using list_entry_rcu() for it ended up breaking a proposed list_entry_rcu() change because it was feeding an non-lvalue pointer into the macro. Don't use the RCU variant for simple pointer offsetting. Use list_entry() instead. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Darren Hart <dvhart@linux.intel.com> Cc: David Howells <dhowells@redhat.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Patrick Marlier <patrick.marlier@gmail.com> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: pranith kumar <bobby.prani@gmail.com> Link: http://lkml.kernel.org/r/20151027051939.GA19355@mtj.duckdns.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-27	namei: permit linking with CAP_FOWNER in userns	Dirk Steinmetz
	Attempting to hardlink to an unsafe file (e.g. a setuid binary) from within an unprivileged user namespace fails, even if CAP_FOWNER is held within the namespace. This may cause various failures, such as a gentoo installation within a lxc container failing to build and install specific packages. This change permits hardlinking of files owned by mapped uids, if CAP_FOWNER is held for that namespace. Furthermore, it improves consistency by using the existing inode_owner_or_capable(), which is aware of namespaced capabilities as of 23adbe12ef7d3 ("fs,userns: Change inode_capable to capable_wrt_inode_uidgid"). Signed-off-by: Dirk Steinmetz <public@rsjtdrjgfuzkfg.com> This is hitting us in Ubuntu during some dpkg upgrades in containers. When upgrading a file dpkg creates a hard link to the old file to back it up before overwriting it. When packages upgrade suid files owned by a non-root user the link isn't permitted, and the package upgrade fails. This patch fixes our problem. Tested-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2015-10-26	btrfs: qgroup: Fix a rebase bug which will cause qgroup double free	Qu Wenruo
	When rebasing my patchset, I forgot to pick up a cleanup patch to remove old hotfix in 4.2 release. Witouth the cleanup, it will screw up new qgroup reserve framework and always cause minus reserved number. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26	btrfs: qgroup: Fix a race in delayed_ref which leads to abort trans	Qu Wenruo
	Between btrfs_allocerved_file_extent() and btrfs_add_delayed_qgroup_reserve(), there is a window that delayed_refs are run and delayed ref head maybe freed before btrfs_add_delayed_qgroup_reserve(). This will cause btrfs_dad_delayed_qgroup_reserve() to return -ENOENT, and cause transaction to be aborted. This patch will record qgroup reserve space info into delayed_ref_head at btrfs_add_delayed_ref(), to eliminate the race window. Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26	btrfs: clear PF_NOFREEZE in cleaner_kthread()	Jiri Kosina
	cleaner_kthread() kthread calls try_to_freeze() at the beginning of every cleanup attempt. This operation can't ever succeed though, as the kthread hasn't marked itself as freezable. Before (hopefully eventually) kthread freezing gets converted to fileystem freezing, we'd rather mark cleaner_kthread() freezable (as my understanding is that it can generate filesystem I/O during suspend). Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26	btrfs: qgroup: Don't copy extent buffer to do qgroup rescan	Qu Wenruo
	Ancient qgroup code call memcpy() on a extent buffer and use it for leaf iteration. As extent buffer contains lock, pointers to pages, it's never sane to do such copy. The following bug may be caused by this insane operation: [92098.841309] general protection fault: 0000 [#1] SMP [92098.841338] Modules linked in: ... [92098.841814] CPU: 1 PID: 24655 Comm: kworker/u4:12 Not tainted 4.3.0-rc1 #1 [92098.841868] Workqueue: btrfs-qgroup-rescan btrfs_qgroup_rescan_helper [btrfs] [92098.842261] Call Trace: [92098.842277] [<ffffffffc035a5d8>] ? read_extent_buffer+0xb8/0x110 [btrfs] [92098.842304] [<ffffffffc0396d00>] ? btrfs_find_all_roots+0x60/0x70 [btrfs] [92098.842329] [<ffffffffc039af3d>] btrfs_qgroup_rescan_worker+0x28d/0x5a0 [btrfs] Where btrfs_qgroup_rescan_worker+0x28d is btrfs_disk_key_to_cpu(), called in reading key from the copied extent_buffer. This patch will use btrfs_clone_extent_buffer() to a better copy of extent buffer to deal such case. Reported-by: Stephane Lesimple <stephane_btrfs@lesimple.fr> Suggested-by: Filipe Manana <fdmanana@kernel.org> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26	btrfs: add balance filters limits, stripes and usage to supported mask	David Sterba
	Enable the extended 'limit' syntax (a range), the new 'stripes' and extended 'usage' syntax (a range) filters in the filters mask. The patch comes separate and not within the series that introduced the new filters because the patch adding the mask was merged in a late rc. The integration branch was based on an older rc and could not merge the patch due to the missing changes. Prerequisities: * btrfs: check unsupported filters in balance arguments * btrfs: extend balance filter limit to take minimum and maximum * btrfs: add balance filter for stripes * btrfs: extend balance filter usage to take minimum and maximum Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26	btrfs: extend balance filter usage to take minimum and maximum	David Sterba
	Similar to the 'limit' filter, we can enhance the 'usage' filter to accept a range. The change is backward compatible, the range is applied only in connection with the BTRFS_BALANCE_ARGS_USAGE_RANGE flag. We don't have a usecase yet, the current syntax has been sufficient. The enhancement should provide parity with other range-like filters. Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26	btrfs: add balance filter for stripes	Gabríel Arthúr Pétursson
	Balance block groups which have the given number of stripes, defined by a range min..max. This is useful to selectively rebalance only chunks that do not span enough devices, applies to RAID0/10/5/6. Signed-off-by: Gabríel Arthúr Pétursson <gabriel@system.is> [ renamed bargs members, added to the UAPI, wrote the changelog ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26	btrfs: extend balance filter limit to take minimum and maximum	David Sterba
	The 'limit' filter is underdesigned, it should have been a range for [min,max], with some relaxed semantics when one of the bounds is missing. Besides that, using a full u64 for a single value is a waste of bytes. Let's fix both by extending the use of the u64 bytes for the [min,max] range. This can be done in a backward compatible way, the range will be interpreted only if the appropriate flag is set (BTRFS_BALANCE_ARGS_LIMIT_RANGE). Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26	btrfs: fix use after free iterating extrefs	Chris Mason
	The code for btrfs inode-resolve has never worked properly for files with enough hard links to trigger extrefs. It was trying to get the leaf out of a path after freeing the path: btrfs_release_path(path); leaf = path->nodes[0]; item_size = btrfs_item_size_nr(leaf, slot); The fix here is to use the extent buffer we cloned just a little higher up to avoid deadlocks caused by using the leaf in the path. Signed-off-by: Chris Mason <clm@fb.com> cc: stable@vger.kernel.org # v3.7+ cc: Mark Fasheh <mfasheh@suse.de> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26	btrfs: check unsupported filters in balance arguments	David Sterba
	We don't verify that all the balance filter arguments supplemented by the flags are actually known to the kernel. Thus we let it silently pass and do nothing. At the moment this means only the 'limit' filter, but we're going to add a few more soon so it's better to have that fixed. Also in older stable kernels so that it works with newer userspace tools. Cc: stable@vger.kernel.org # 3.16+ Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26	fs/binfmt_elf_fdpic.c: fix brk area overlap with stack on NOMMU	Rich Felker
	On NOMMU archs, the FDPIC ELF loader sets up the usable brk range to overlap with all but the last PAGE_SIZE bytes of the stack. This leads to catastrophic memory reuse/corruption if brk is used. Fix by setting the brk area to zero size to disable its use. Signed-off-by: Rich Felker <dalias@libc.org> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Greg Ungerer <gerg@uclinux.org>
2015-10-25	Btrfs: fix regression running delayed references when using qgroups	Filipe Manana
	In the kernel 4.2 merge window we had a big changes to the implementation of delayed references and qgroups which made the no_quota field of delayed references not used anymore. More specifically the no_quota field is not used anymore as of: commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.") Leaving the no_quota field actually prevents delayed references from getting merged, which in turn cause the following BUG_ON(), at fs/btrfs/extent-tree.c, to be hit when qgroups are enabled: static int run_delayed_tree_ref(...) { (...) BUG_ON(node->ref_mod != 1); (...) } This happens on a scenario like the following: 1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added. 2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added. It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota. 3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added. It's not merged with the reference at the tail of the list of refs for bytenr X because the reference at the tail, Ref2 is incompatible due to Ref2->no_quota != Ref3->no_quota. 4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added. It's not merged with the reference at the tail of the list of refs for bytenr X because the reference at the tail, Ref3 is incompatible due to Ref3->no_quota != Ref4->no_quota. 5) We run delayed references, trigger merging of delayed references, through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs(). 6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and all other conditions are satisfied too. So Ref1 gets a ref_mod value of 2. 7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and all other conditions are satisfied too. So Ref2 gets a ref_mod value of 2. 8) Ref1 and Ref2 aren't merged, because they have different values for their no_quota field. 9) Delayed reference Ref1 is picked for running (select_delayed_ref() always prefers references with an action == BTRFS_ADD_DELAYED_REF). So run_delayed_tree_ref() is called for Ref1 which triggers the BUG_ON because Ref1->red_mod != 1 (equals 2). So fix this by removing the no_quota field, as it's not used anymore as of commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism."). The use of no_quota was also buggy in at least two places: 1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting no_quota to 0 instead of 1 when the following condition was true: is_fstree(ref_root) \|\| !fs_info->quota_enabled 2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to reset a node's no_quota when the condition "!is_fstree(root_objectid) \|\| !root->fs_info->quota_enabled" was true but we did it only in an unused local stack variable, that is, we never reset the no_quota value in the node itself. This fixes the remainder of problems several people have been having when running delayed references, mostly while a balance is running in parallel, on a 4.2+ kernel. Very special thanks to Stéphane Lesimple for helping debugging this issue and testing this fix on his multi terabyte filesystem (which took more than one day to balance alone, plus fsck, etc). Also, this fixes deadlock issue when using the clone ioctl with qgroups enabled, as reported by Elias Probst in the mailing list. The deadlock happens because after calling btrfs_insert_empty_item we have our path holding a write lock on a leaf of the fs/subvol tree and then before releasing the path we called check_ref() which did backref walking, when qgroups are enabled, and tried to read lock the same leaf. The trace for this case is the following: INFO: task systemd-nspawn:6095 blocked for more than 120 seconds. (...) Call Trace: [<ffffffff86999201>] schedule+0x74/0x83 [<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea [<ffffffff86137ed7>] ? wait_woken+0x74/0x74 [<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810 [<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce [<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127 [<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667 [<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe [<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6 [<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0 [<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65 [<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88 [<ffffffff863e852e>] check_ref+0x64/0xc4 [<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d [<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb [<ffffffff86048a68>] ? native_sched_clock+0x28/0x77 [<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb (...) The problem goes away by eleminating check_ref(), which no longer is needed as its purpose was to get a value for the no_quota field of a delayed reference (this patch removes the no_quota field as mentioned earlier). Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr> Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr> Reported-by: Elias Probst <mail@eliasprobst.eu> Reported-by: Peter Becker <floyd.net@gmail.com> Reported-by: Malte Schröder <malte@tnxip.de> Reported-by: Derek Dongray <derek@valedon.co.uk> Reported-by: Erkki Seppala <flux-btrfs@inside.org> Cc: stable@vger.kernel.org # 4.2+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
2015-10-25	Btrfs: fix regression when running delayed references	Filipe Manana
	In the kernel 4.2 merge window we had a refactoring/rework of the delayed references implementation in order to fix certain problems with qgroups. However that rework introduced one more regression that leads to the following trace when running delayed references for metadata: [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832! [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc psmouse i2 [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1 [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs] [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti: ffff88010c4c8000 [35908.065201] RIP: 0010:[<ffffffffa04928b5>] [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs] [35908.065201] RSP: 0018:ffff88010c4cbb08 EFLAGS: 00010293 [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX: 0000000000000000 [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI: 0000000000000000 [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09: ffff88010c4cb9f8 [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12: 0000000000000000 [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15: 0000000000000000 [35908.065201] FS: 0000000000000000(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000 [35908.065201] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4: 00000000000006e0 [35908.065201] Stack: [35908.065201] ffff88010c4cbb18 0000000000000f37 ffff88020a74c578 ffff88015a408000 [35908.065201] ffff880154a44000 0000000000000000 0000000000000005 ffff88010c4cbbd8 [35908.065201] ffffffffa0492b9a 0000000000000005 0000000000000000 0000000000000000 [35908.065201] Call Trace: [35908.065201] [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs] [35908.065201] [<ffffffffa0497117>] ? __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs] [35908.065201] [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33 [btrfs] [35908.065201] [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f [btrfs] [35908.065201] [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f [btrfs] [35908.065201] [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd [btrfs] [35908.065201] [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b [btrfs] [35908.065201] [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a [btrfs] [35908.065201] [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14 [btrfs] [35908.065201] [<ffffffff81063b23>] process_one_work+0x24a/0x4ac [35908.065201] [<ffffffff81064285>] worker_thread+0x206/0x2c2 [35908.065201] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb [35908.065201] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb [35908.065201] [<ffffffff8106904d>] kthread+0xef/0xf7 [35908.065201] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24 [35908.065201] [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70 [35908.065201] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24 [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> 0b 4c 8b 45 30 8b 4d 28 45 31 [35908.065201] RIP [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs] [35908.065201] RSP <ffff88010c4cbb08> [35908.310885] ---[ end trace fe4299baf0666457 ]--- This happens because the new delayed references code no longer merges delayed references that have different sequence values. The following steps are an example sequence leading to this issue: 1) Transaction N starts, fs_info->tree_mod_seq has value 0; 2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for bytenr A is created, with a value of 1 and a seq value of 0; 3) fs_info->tree_mod_seq is incremented to 1; 4) Extent buffer A is deleted through btrfs_del_items(), which calls btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The later returns the metadata extent associated to extent buffer A to the free space cache (the range is not pinned), because the extent buffer was created in the current transaction (N) and writeback never happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set in the extent buffer). This creates the delayed reference Ref2 for bytenr A, with a value of -1 and a seq value of 1; 5) Delayed reference Ref2 is not merged with Ref1 when we create it, because they have different sequence numbers (decided at add_delayed_ref_tail_merge()); 6) fs_info->tree_mod_seq is incremented to 2; 7) Some task attempts to allocate a new extent buffer (done at extent-tree.c:find_free_extent()), but due to heavy fragmentation and running low on metadata space the clustered allocation fails and we fall back to unclustered allocation, which finds the extent at offset A, so a new extent buffer at offset A is allocated. This creates delayed reference Ref3 for bytenr A, with a value of 1 and a seq value of 2; 8) Ref3 is not merged neither with Ref2 nor Ref1, again because they all have different seq values; 9) We start running the delayed references (__btrfs_run_delayed_refs()); 10) The delayed Ref1 is the first one being applied, which ends up creating an inline extent backref in the extent tree; 10) Next the delayed reference Ref3 is selected for execution, and not Ref2, because select_delayed_ref() always gives a preference for positive references (that have an action of BTRFS_ADD_DELAYED_REF); 11) When running Ref3 we encounter alreay the inline extent backref in the extent tree at insert_inline_extent_backref(), which makes us hit the following BUG_ON: BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID); This is always true because owner corresponds to the level of the extent buffer/btree node in the btree. For the scenario described above we hit the BUG_ON because we never merge references that have different seq values. We used to do the merging before the 4.2 kernel, more specifically, before the commmits: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.") c43d160fcd5e ("btrfs: delayed-ref: Cleanup the unneeded functions.") This issue became more exposed after the following change that was added to 4.2 as well: cffc3374e567 ("Btrfs: fix order by which delayed references are run") Which in turn fixed another regression by the two commits previously mentioned. So fix this by bringing back the delayed reference merge code, with the proper adaptations so that it operates against the new data structure (linked list vs old red black tree implementation). This issue was hit running fstest btrfs/063 in a loop. Several people have reported this issue in the mailing list when running on kernels 4.2+. Very special thanks to Stéphane Lesimple for helping debugging this issue and testing this fix on his multi terabyte filesystem (which took more than one day to balance alone, plus fsck, etc). Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.") Reported-by: Peter Becker <floyd.net@gmail.com> Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr> Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr> Reported-by: Malte Schröder <malte@tnxip.de> Reported-by: Derek Dongray <derek@valedon.co.uk> Reported-by: Erkki Seppala <flux-btrfs@inside.org> Cc: stable@vger.kernel.org # 4.2+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-10-24	Merge branch 'for-linus' of git://git.kernel.dk/linux-block	Linus Torvalds
	Pull block layer fixes from Jens Axboe: "A final set of fixes for 4.3. It is (again) bigger than I would have liked, but it's all been through the testing mill and has been carefully reviewed by multiple parties. Each fix is either a regression fix for this cycle, or is marked stable. You can scold me at KS. The pull request contains: - Three simple fixes for NVMe, fixing regressions since 4.3. From Arnd, Christoph, and Keith. - A single xen-blkfront fix from Cathy, fixing a NULL dereference if an error is returned through the staste change callback. - Fixup for some bad/sloppy code in nbd that got introduced earlier in this cycle. From Markus Pargmann. - A blk-mq tagset use-after-free fix from Junichi. - A backing device lifetime fix from Tejun, fixing a crash. - And finally, a set of regression/stable fixes for cgroup writeback from Tejun" * 'for-linus' of git://git.kernel.dk/linux-block: writeback: remove broken rbtree_postorder_for_each_entry_safe() usage in cgwb_bdi_destroy() NVMe: Fix memory leak on retried commands block: don't release bdi while request_queue has live references nvme: use an integer value to Linux errno values blk-mq: fix use-after-free in blk_mq_free_tag_set() nvme: fix 32-bit build warning writeback: fix incorrect calculation of available memory for memcg domains writeback: memcg dirty_throttle_control should be initialized with wb->memcg_completions writeback: bdi_writeback iteration must not skip dying ones writeback: fix bdi_writeback iteration in wakeup_dirtytime_writeback() writeback: laptop_mode_timer_fn() needs rcu_read_lock() around bdi_writeback iteration nbd: Add locking for tasks xen-blkfront: check for null drvdata in blkback_changed (XenbusStateClosing)
2015-10-24	Merge branch 'for-linus-4.3' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "I have two more small fixes this week: Qu's fix avoids unneeded COW during fallocate, and Christian found a memory leak in the error handling of an earlier fix" * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: fix possible leak in btrfs_ioctl_balance() btrfs: Avoid truncate tailing page if fallocate range doesn't exceed inode size
2015-10-23	i2c-dev: Fix typo in ioctl name reference	Jean Delvare
	The ioctl is named I2C_RDWR for "I2C read/write". But references to it were misspelled "rdrw". Fix them. Signed-off-by: Jean Delvare <jdelvare@suse.de> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2015-10-23	ocfs2/dlm: unlock lockres spinlock before dlm_lockres_put	Joseph Qi
	dlm_lockres_put will call dlm_lockres_release if it is the last reference, and then it may call dlm_print_one_lock_resource and take lockres spinlock. So unlock lockres spinlock before dlm_lockres_put to avoid deadlock. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-22	locks: cleanup posix_lock_inode_wait and flock_lock_inode_wait	Benjamin Coddington
	All callers use locks_lock_inode_wait() instead. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
2015-10-22	Move locks API users to locks_lock_inode_wait()	Benjamin Coddington
	Instead of having users check for FL_POSIX or FL_FLOCK to call the correct locks API function, use the check within locks_lock_inode_wait(). This allows for some later cleanup. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
2015-10-22	locks: introduce locks_lock_inode_wait()	Benjamin Coddington
	Users of the locks API commonly call either posix_lock_file_wait() or flock_lock_file_wait() depending upon the lock type. Add a new function locks_lock_inode_wait() which will check and call the correct function for the type of lock passed in. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
2015-10-22	pstore: Fix return type of pstore_is_mounted()	Geliang Tang
	This patch changes return type of pstore_is_mounted from int to bool. Signed-off-by: Geliang Tang <geliangtang@163.com> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tony Luck <tony.luck@intel.com>
2015-10-22	f2fs: fix to skip shrinking extent nodes	Chao Yu
	In f2fs_shrink_extent_tree we should stop shrink flow if we have already shrunk enough nodes in extent cache. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-22	f2fs: fix error path of ->symlink	Chao Yu
	Now, in ->symlink of f2fs, we kept the fixed invoking order between f2fs_add_link and page_symlink since we should init node info firstly in f2fs_add_link, then such node info can be used in page_symlink. But we didn't fix to release meta info which was done before page_symlink in our error path, so this will leave us corrupt symlink entry in its parent's dentry page. Fix this issue by adding f2fs_unlink in the error path for removing such linking. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-22	f2fs: fix to clear GCed flag for atomic written page	Chao Yu
	Atomic write page can be GCed, after committing this kind of page, we should clear the GCed flag for it. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-22	pstore: add pstore unregister	Geliang Tang
	pstore doesn't support unregistering yet. It was marked as TODO. This patch adds some code to fix it: 1) Add functions to unregister kmsg/console/ftrace/pmsg. 2) Add a function to free compression buffer. 3) Unmap the memory and free it. 4) Add a function to unregister pstore filesystem. Signed-off-by: Geliang Tang <geliangtang@163.com> Acked-by: Kees Cook <keescook@chromium.org> [Removed __exit annotation from ramoops_remove(). Reported by Arnd Bergmann] Signed-off-by: Tony Luck <tony.luck@intel.com>
2015-10-21	f2fs: don't need to submit bio on error case	Jaegeuk Kim
	If commit_atomic_write is failed, we don't need to submit any bio. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>