summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2016-08-25btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster()Wang Xiaoguang
In prealloc_file_extent_cluster(), btrfs_check_data_free_space() uses wrong file offset for reloc_inode, it uses cluster->start and cluster->end, which indeed are extent's bytenr. The correct value should be cluster->[start|end] minus block group's start bytenr. start bytenr cluster->start | | extent | extent | ...| extent | |----------------------------------------------------------------| | block group reloc_inode | Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25btrfs: qgroup: Fix qgroup incorrectness caused by log replayQu Wenruo
When doing log replay at mount time(after power loss), qgroup will leak numbers of replayed data extents. The cause is almost the same of balance. So fix it by manually informing qgroup for owner changed extents. The bug can be detected by btrfs/119 test case. Cc: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25btrfs: relocation: Fix leaking qgroups numbers on data extentsQu Wenruo
This patch fixes a REGRESSION introduced in 4.2, caused by the big quota rework. When balancing data extents, qgroup will leak all its numbers for relocated data extents. The relocation is done in the following steps for data extents: 1) Create data reloc tree and inode 2) Copy all data extents to data reloc tree And commit transaction 3) Create tree reloc tree(special snapshot) for any related subvolumes 4) Replace file extent in tree reloc tree with new extents in data reloc tree And commit transaction 5) Merge tree reloc tree with original fs, by swapping tree blocks For 1)~4), since tree reloc tree and data reloc tree doesn't count to qgroup, everything is OK. But for 5), the swapping of tree blocks will only info qgroup to track metadata extents. If metadata extents contain file extents, qgroup number for file extents will get lost, leading to corrupted qgroup accounting. The fix is, before commit transaction of step 5), manually info qgroup to track all file extents in data reloc tree. Since at commit transaction time, the tree swapping is done, and qgroup will account these data extents correctly. Cc: Mark Fasheh <mfasheh@suse.de> Reported-by: Mark Fasheh <mfasheh@suse.de> Reported-by: Filipe Manana <fdmanana@gmail.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent()Qu Wenruo
Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions: 1. btrfs_qgroup_insert_dirty_extent_nolock() Almost the same with original code. For delayed_ref usage, which has delayed refs locked. Change the return value type to int, since caller never needs the pointer, but only needs to know if they need to free the allocated memory. 2. btrfs_qgroup_insert_dirty_extent() The more encapsulated version. Will do the delayed_refs lock, memory allocation, quota enabled check and other things. The original design is to keep exported functions to minimal, but since more btrfs hacks exposed, like replacing path in balance, we need to record dirty extents manually, so we have to add such functions. Also, add comment for both functions, to info developers how to keep qgroup correct when doing hacks. Cc: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25btrfs: waiting on qgroup rescan should not always be interruptibleJeff Mahoney
We wait on qgroup rescan completion in three places: file system shutdown, the quota disable ioctl, and the rescan wait ioctl. If the user sends a signal while we're waiting, we continue happily along. This is expected behavior for the rescan wait ioctl. It's racy in the shutdown path but mostly works due to other unrelated synchronization points. In the quota disable path, it Oopses the kernel pretty much immediately. Cc: <stable@vger.kernel.org> # v4.4+ Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25btrfs: properly track when rescan worker is runningJeff Mahoney
The qgroup_flags field is overloaded such that it reflects the on-disk status of qgroups and the runtime state. The BTRFS_QGROUP_STATUS_FLAG_RESCAN flag is used to indicate that a rescan operation is in progress, but if the file system is unmounted while a rescan is running, the rescan operation is paused. If the file system is then mounted read-only, the flag will still be present but the rescan operation will not have been resumed. When we go to umount, btrfs_qgroup_wait_for_completion will see the flag and interpret it to mean that the rescan worker is still running and will wait for a completion that will never come. This patch uses a separate flag to indicate when the worker is running. The locking and state surrounding the qgroup rescan worker needs a lot of attention beyond this patch but this is enough to avoid a hung umount. Cc: <stable@vger.kernel.org> # v4.4+ Signed-off-by; Jeff Mahoney <jeffm@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25btrfs: flush_space: treat return value of do_chunk_alloc properlyAlex Lyakas
do_chunk_alloc returns 1 when it succeeds to allocate a new chunk. But flush_space will not convert this to 0, and will also return 1. As a result, reserve_metadata_bytes will think that flush_space failed, and may potentially return this value "1" to the caller (depends how reserve_metadata_bytes was called). The caller will also treat this as an error. For example, btrfs_block_rsv_refill does: int ret = -ENOSPC; ... ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush); if (!ret) { block_rsv_add_bytes(block_rsv, num_bytes, 0); return 0; } return ret; So it will return -ENOSPC. Signed-off-by: Alex Lyakas <alex@zadarastorage.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25Btrfs: add ASSERT for block group's memory leakLiu Bo
This adds several ASSERT()' s to report memory leak of block group cache. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25btrfs: backref: Fix soft lockup in __merge_refs functionQu Wenruo
When over 1000 file extents refers to one extent, find_parent_nodes() will be obviously slow, due to the O(n^2)~O(n^3) loops inside __merge_refs(). The following ftrace shows the cubic growth of execution time: 256 refs 5) + 91.768 us | __add_keyed_refs.isra.12 [btrfs](); 5) 1.447 us | __add_missing_keys.isra.13 [btrfs](); 5) ! 114.544 us | __merge_refs [btrfs](); 5) ! 136.399 us | __merge_refs [btrfs](); 512 refs 6) ! 279.859 us | __add_keyed_refs.isra.12 [btrfs](); 6) 3.164 us | __add_missing_keys.isra.13 [btrfs](); 6) ! 442.498 us | __merge_refs [btrfs](); 6) # 2091.073 us | __merge_refs [btrfs](); and 1024 refs 7) ! 368.683 us | __add_keyed_refs.isra.12 [btrfs](); 7) 4.810 us | __add_missing_keys.isra.13 [btrfs](); 7) # 2043.428 us | __merge_refs [btrfs](); 7) * 18964.23 us | __merge_refs [btrfs](); And sort them into the following char: (Unit: us) ------------------------------------------------------------------------ Trace function | 256 ref | 512 refs | 1024 refs | ------------------------------------------------------------------------ __add_keyed_refs | 91 | 249 | 368 | __add_missing_keys | 1 | 3 | 4 | __merge_refs 1st call | 114 | 442 | 2043 | __merge_refs 2nd call | 136 | 2091 | 18964 | ------------------------------------------------------------------------ We can see the that __add_keyed_refs() grows almost in linear behavior. And __add_missing_keys() in this case doesn't change much or takes much time. While for the 1st __merge_refs() it's square growth for the 2nd __merge_refs() call it's cubic growth. It's no doubt that merge_refs() will take a long long time to execute if the number of refs continues its grows. So add a cond_resced() into the loop of __merge_refs(). Although this will solve the problem of soft lockup, we need to use the new rb_tree based structure introduced by Lu Fengqi to really solve the long execution time. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25Btrfs: fix memory leak of reloc_rootLiu Bo
When some critical errors occur and FS would be flipped into RO, if we have an on-going balance, we can end up with a memory leak of root->reloc_root since btrfs_drop_snapshots() bails out without freeing reloc_root at the very early start. However, we're not able to free reloc_root in btrfs_drop_snapshots() because its caller, merge_reloc_roots(), still needs to access it to cleanup reloc_root's rbtree. This makes us free reloc_root when we're going to free fs/file roots. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-08-24Merge tag 'upstream-4.8-rc4' of git://git.infradead.org/linux-ubifsLinus Torvalds
Pull UBIFS fixes from Richard Weinberger: "This pull requests contains fixes for two issues in UBI and UBIFS: - wrong UBIFS assertion. - a UBIFS xattr regression" * tag 'upstream-4.8-rc4' of git://git.infradead.org/linux-ubifs: ubifs: Fix xattr generic handler usage ubifs: Fix assertion in layout_in_gaps()
2016-08-24fuse: direct-io: don't dirty ITER_BVEC pagesMiklos Szeredi
When reading from a loop device backed by a fuse file it deadlocks on lock_page(). This is because the page is already locked by the read() operation done on the loop device. In this case we don't want to either lock the page or dirty it. So do what fs/direct-io.c does: only dirty the page for ITER_IOVEC vectors. Reported-by: Sheng Yang <sheng@yasker.org> Fixes: aa4d86163e4e ("block: loop: switch to VFS ITER_BVEC") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org> # v4.1+ Reviewed-by: Sheng Yang <sheng@yasker.org> Reviewed-by: Ashish Samant <ashish.samant@oracle.com> Tested-by: Sheng Yang <sheng@yasker.org> Tested-by: Ashish Samant <ashish.samant@oracle.com>
2016-08-23Merge tag 'for-f2fs-v4.8-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs fixes from Jaegeuk Kim: - fsmark regression - i_size race condition - wrong conditions in f2fs_move_file_range * tag 'for-f2fs-v4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: f2fs: avoid potential deadlock in f2fs_move_file_range f2fs: allow copying file range only in between regular files Revert "f2fs: move i_size_write in f2fs_write_end" Revert "f2fs: use percpu_rw_semaphore"
2016-08-23ubifs: Fix xattr generic handler usageRichard Weinberger
UBIFS uses full names to work with xattrs, therefore we have to use xattr_full_name() to obtain the xattr prefix as string. Cc: <stable@vger.kernel.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Fixes: 2b88fc21ca ("ubifs: Switch to generic xattr handlers") Signed-off-by: Richard Weinberger <richard@nod.at> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> Tested-by: Dongsheng Yang <dongsheng081251@gmail.com>
2016-08-23ubifs: Fix assertion in layout_in_gaps()Vincent Stehlé
An assertion in layout_in_gaps() verifies that the gap_lebs pointer is below the maximum bound. When computing this maximum bound the idx_lebs count is multiplied by sizeof(int), while C pointers arithmetic does take into account the size of the pointed elements implicitly already. Remove the multiplication to fix the assertion. Fixes: 1e51764a3c2ac05a ("UBIFS: add new flash file system") Cc: <stable@vger.kernel.org> Signed-off-by: Vincent Stehlé <vincent.stehle@intel.com> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2016-08-23pnfs/blocklayout: update last_write_offset atomically with extentsBenjamin Coddington
Block/SCSI layout write completion may add committable extents to the extent tree before updating the layout's last-written byte under the inode lock. If a sync happens before this value is updated, then prepare_layoutcommit may find and encode these extents which would produce a LAYOUTCOMMIT request whose encoded extents are larger than the request's loca_length. Fix this by using a last-written byte value that is updated atomically with the extent tree so that commitable extents always match. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-08-23pNFS: The client must not do I/O to the DS if it's lease has expiredTrond Myklebust
Ensure that the client conforms to the normative behaviour described in RFC5661 Section 12.7.2: "If a client believes its lease has expired, it MUST NOT send I/O to the storage device until it has validated its lease." So ensure that we wait for the lease to be validated before using the layout. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org # v3.20+
2016-08-22bdev: fix NULL pointer dereferenceVegard Nossum
I got this: kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) CPU: 0 PID: 5505 Comm: syz-executor Not tainted 4.8.0-rc2+ #161 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014 task: ffff880113415940 task.stack: ffff880118350000 RIP: 0010:[<ffffffff8172cb32>] [<ffffffff8172cb32>] bd_mount+0x52/0xa0 RSP: 0018:ffff880118357ca0 EFLAGS: 00010207 RAX: dffffc0000000000 RBX: ffffffffffffffff RCX: ffffc90000bb6000 RDX: 0000000000000018 RSI: ffffffff846d6b20 RDI: 00000000000000c7 RBP: ffff880118357cb0 R08: ffff880115967c68 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801188211e8 R13: ffffffff847baa20 R14: ffff8801139cb000 R15: 0000000000000080 FS: 00007fa3ff6c0700(0000) GS:ffff88011aa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fc1d8cc7e78 CR3: 0000000109f20000 CR4: 00000000000006f0 DR0: 000000000000001e DR1: 000000000000001e DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 Stack: ffff880112cfd6c0 ffff8801188211e8 ffff880118357cf0 ffffffff8167f207 ffffffff816d7a1e ffff880112a413c0 ffffffff847baa20 ffff8801188211e8 0000000000000080 ffff880112cfd6c0 ffff880118357d38 ffffffff816dce0a Call Trace: [<ffffffff8167f207>] mount_fs+0x97/0x2e0 [<ffffffff816d7a1e>] ? alloc_vfsmnt+0x55e/0x760 [<ffffffff816dce0a>] vfs_kern_mount+0x7a/0x300 [<ffffffff83c3247c>] ? _raw_read_unlock+0x2c/0x50 [<ffffffff816dfc87>] do_mount+0x3d7/0x2730 [<ffffffff81235fd4>] ? trace_do_page_fault+0x1f4/0x3a0 [<ffffffff816df8b0>] ? copy_mount_string+0x40/0x40 [<ffffffff8161ea81>] ? memset+0x31/0x40 [<ffffffff816df73e>] ? copy_mount_options+0x1ee/0x320 [<ffffffff816e2a02>] SyS_mount+0xb2/0x120 [<ffffffff816e2950>] ? copy_mnt_ns+0x970/0x970 [<ffffffff81005524>] do_syscall_64+0x1c4/0x4e0 [<ffffffff83c3282a>] entry_SYSCALL64_slow_path+0x25/0x25 Code: 83 e8 63 1b fc ff 48 85 c0 48 89 c3 74 4c e8 56 35 d1 ff 48 8d bb c8 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 36 4c 8b a3 c8 00 00 00 48 b8 00 00 00 00 00 fc RIP [<ffffffff8172cb32>] bd_mount+0x52/0xa0 RSP <ffff880118357ca0> ---[ end trace 13690ad962168b98 ]--- mount_pseudo() returns ERR_PTR(), not NULL, on error. Fixes: 3684aa7099e0 ("block-dev: enable writeback cgroup support") Cc: Shaohua Li <shli@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@fb.com> Cc: stable@vger.kernel.org Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-19pNFS: Handle NFS4ERR_OLD_STATEID correctly in LAYOUTSTAT callsTrond Myklebust
We normally want to update the stateid and then retry, Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-08-19Merge tag 'xfs-iomap-for-linus-4.8-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs and iomap fixes from Dave Chinner: "Changes in this update: Regression fixes for XFS changes introduce in 4.8-rc1: - buffer IO accounting assert failure - ENOSPC block accounting reservation issue - DAX IO path page cache invalidation fix - rmapbt on-disk block count in agf - correct classification of rmap block type when updating AGFL. - iomap support for attribute fork mapping Regression fixes for iomap infrastructure in 4.8-rc1: - fiemap: honor FIEMAP_FLAG_SYNC - fiemap: implement FIEMAP_FLAG_XATTR support to fix XFS regression - make mark_page_accessed and pagefault_disable usage consistent with other IO paths" * tag 'xfs-iomap-for-linus-4.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: xfs: remove OWN_AG rmap when allocating a block from the AGFL xfs: (re-)implement FIEMAP_FLAG_XATTR xfs: simplify xfs_file_iomap_begin iomap: mark ->iomap_end as optional iomap: prepare iomap_fiemap for attribute mappings iomap: fiemap should honor the FIEMAP_FLAG_SYNC flag iomap: remove superflous pagefault_disable from iomap_write_actor iomap: remove superflous mark_page_accessed from iomap_write_actor xfs: store rmapbt block count in the AGF xfs: don't invalidate whole file on DAX read/write xfs: fix bogus space reservation in xfs_iomap_write_allocate xfs: don't assert fail on non-async buffers on ioacct decrement
2016-08-19f2fs: avoid potential deadlock in f2fs_move_file_rangeChao Yu
Thread A Thread B - inode_lock fileA - inode_lock fileB - inode_lock fileA - inode_lock fileB We may encounter above potential deadlock during moving file range in concurrent scenario. This patch fixes the issue by using inode_trylock instead. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-08-19f2fs: allow copying file range only in between regular filesChao Yu
Only if two input files are regular files, we allow copying data in range of them, otherwise, deny it. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-08-19Revert "f2fs: move i_size_write in f2fs_write_end"Chao Yu
This reverts commit a2ee0a300344a6da76186129b078113354fe13d2. When testing with generic/032 of xfstest suit, failure message will be reported as below: generic/032 8s ... [failed, exit status 1] - output mismatch (see results/generic/032.out.bad) --- tests/generic/032.out 2015-01-11 16:52:27.643681072 +0800 +++ results/generic/032.out.bad 2016-08-06 13:44:43.861330500 +0800 @@ -1,5 +1,5 @@ QA output created by 032 -100 iterations -0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd -* -0100000 +1: [768..775]: unwritten +Unwritten extents found! ... (Run 'diff -u tests/generic/032.out results/generic/032.out.bad' to see the entire diff) Ran: generic/032 Failures: generic/032 Failed 1 of 1 tests In write_end(), we should update i_size of inode before unlock page, otherwise, we will lose newly updated data in following race condition. Thread A Thread B - write_end - unlock page - writepages - lock_page - writepage if page is out-of-range of file size, we will skip writting the page. - update i_size Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-08-19Revert "f2fs: use percpu_rw_semaphore"Jaegeuk Kim
LKP reported -36.3% regression of fsmark.files_per_sec due to this patch. I've confirmed that fxmark [1] has also slight regression for DWAL. [1] https://github.com/sslab-gatech/fxmark This reverts commit ec795418c41850056feb956534edf059dc1155d4.
2016-08-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Buffers powersave frame test is reversed in cfg80211, fix from Felix Fietkau. 2) Remove bogus WARN_ON in openvswitch, from Jarno Rajahalme. 3) Fix some tg3 ethtool logic bugs, and one that would cause no interrupts to be generated when rx-coalescing is set to 0. From Satish Baddipadige and Siva Reddy Kallam. 4) QLCNIC mailbox corruption and napi budget handling fix from Manish Chopra. 5) Fix fib_trie logic when walking the trie during /proc/net/route output than can access a stale node pointer. From David Forster. 6) Several sctp_diag fixes from Phil Sutter. 7) PAUSE frame handling fixes in mlxsw driver from Ido Schimmel. 8) Checksum fixup fixes in bpf from Daniel Borkmann. 9) Memork leaks in nfnetlink, from Liping Zhang. 10) Use after free in rxrpc, from David Howells. 11) Use after free in new skb_array code of macvtap driver, from Jason Wang. 12) Calipso resource leak, from Colin Ian King. 13) mediatek bug fixes (missing stats sync init, etc.) from Sean Wang. 14) Fix bpf non-linear packet write helpers, from Daniel Borkmann. 15) Fix lockdep splats in macsec, from Sabrina Dubroca. 16) hv_netvsc bug fixes from Vitaly Kuznetsov, mostly to do with VF handling. 17) Various tc-action bug fixes, from CONG Wang. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits) net_sched: allow flushing tc police actions net_sched: unify the init logic for act_police net_sched: convert tcf_exts from list to pointer array net_sched: move tc offload macros to pkt_cls.h net_sched: fix a typo in tc_for_each_action() net_sched: remove an unnecessary list_del() net_sched: remove the leftover cleanup_a() mlxsw: spectrum: Allow packets to be trapped from any PG mlxsw: spectrum: Unmap 802.1Q FID before destroying it mlxsw: spectrum: Add missing rollbacks in error path mlxsw: reg: Fix missing op field fill-up mlxsw: spectrum: Trap loop-backed packets mlxsw: spectrum: Add missing packet traps mlxsw: spectrum: Mark port as active before registering it mlxsw: spectrum: Create PVID vPort before registering netdevice mlxsw: spectrum: Remove redundant errors from the code mlxsw: spectrum: Don't return upon error in removal path i40e: check for and deal with non-contiguous TCs ixgbe: Re-enable ability to toggle VLAN filtering ixgbe: Force VLNCTRL.VFE to be set in all VMDq paths ...
2016-08-17Merge branch 'iomap-fixes-4.8-rc3' into for-nextDave Chinner
2016-08-17xfs: remove OWN_AG rmap when allocating a block from the AGFLDarrick J. Wong
When we're really tight on space, xfs_alloc_ag_vextent_small() can allocate a block from the AGFL and give it to the caller. Since the caller is never the AGFL-fixing method, we must remove the OWN_AG reverse mapping because it will clash with whatever rmap the caller wants to set up. This bug was discovered by running generic/299 repeatedly. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17xfs: (re-)implement FIEMAP_FLAG_XATTRChristoph Hellwig
Use a special read-only iomap_ops implementation to support fiemap on the attr fork. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17xfs: simplify xfs_file_iomap_beginChristoph Hellwig
We'll never get nimap == 0 for a successful return from xfs_bmapi_read, so don't try to handle it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17iomap: mark ->iomap_end as optionalChristoph Hellwig
No need to implement it for read-only mappings. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17iomap: prepare iomap_fiemap for attribute mappingsDave Chinner
By bassing through an -ENOENT, similar to the old XFS implementation of FIEMAP_FLAG_XATTR. Signed-off-by: Dave Chinner <dchinner@redhat.com> [hch: split from a larger patch] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17iomap: fiemap should honor the FIEMAP_FLAG_SYNC flagDave Chinner
The flag is checked as supported, but then we do an unconditional sync of the file, regardless of whether the flag is set or not. Make the sync conditional on having the FIEMAP_FLAG_SYNC flag set. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17iomap: remove superflous pagefault_disable from iomap_write_actorChristoph Hellwig
iov_iter_copy_from_user_atomic disables page faults internally, no need to do it around the call. This also brings the iomap code in line with the original filemap version. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17iomap: remove superflous mark_page_accessed from iomap_write_actorChristoph Hellwig
This catches up with commit 2457ae ("mm: non-atomically mark page accessed during page cache allocation where possible"), which moved the initial access marking into the pagecache allocator. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17xfs: store rmapbt block count in the AGFDarrick J. Wong
Track the number of blocks used for the rmapbt in the AGF. When we get to the AG reservation code we need this counter to quickly make our reservation during mount. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17xfs: don't invalidate whole file on DAX read/writeDave Chinner
When we do DAX IO, we try to invalidate the entire page cache held on the file. This is incorrect as it will trash the entire mapping tree that now tracks dirty state in exceptional entries in the radix tree slots. What we are trying to do is remove cached pages (e.g from reads into holes) that sit in the radix tree over the range we are about to write to. Hence we should just limit the invalidation to the range we are about to overwrite. Reported-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17xfs: fix bogus space reservation in xfs_iomap_write_allocateChristoph Hellwig
The space reservations was without an explaination in commit "Add error reporting calls in error paths that return EFSCORRUPTED" back in 2003. There is no reason to reserve disk blocks in the transaction when allocating blocks for delalloc space as we already reserved the space when creating the delalloc extent. With this fix we stop running out of the reserved pool in generic/229, which has happened for long time with small blocksize file systems, and has increased in severity with the new buffered write path. [ dchinner: we still need to pass the block reservation into xfs_bmapi_write() to ensure we don't deadlock during AG selection. See commit dbd5c8c ("xfs: pass total block res. as total xfs_bmapi_write() parameter") for more details on why this is necessary. ] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17xfs: don't assert fail on non-async buffers on ioacct decrementBrian Foster
The buffer I/O accounting mechanism tracks async buffers under I/O. As an optimization, the buffer I/O count is incremented only once on the first async I/O for a given hold cycle of a buffer and decremented once the buffer is released to the LRU (or freed). xfs_buf_ioacct_dec() has an ASSERT() check for an XBF_ASYNC buffer, but we have one or two corner cases where a buffer can be submitted for I/O multiple times via different methods in a single hold cycle. If an async I/O occurs first, the I/O count is incremented. If a sync I/O occurs before the hold count drops, XBF_ASYNC is cleared by the time the I/O count is decremented. Remove the async assert check from xfs_buf_ioacct_dec() as this is a perfectly valid scenario. For the purposes of I/O accounting, we really only care about the buffer async state at I/O submission time. Discovered-and-analyzed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-16pNFS/flexfiles: Set reasonable default retrans values for the data channelTrond Myklebust
Prior to this patch, the retrans value was set at 5, meaning that we could see a maximum retransmission timeout value of more than 6 minutes. That's a tad high for NFSv3 where the protocol does allow the server to drop requests at any time. Since this is a data channel, let's just set retrans to 0, and the default timeout to 60s. The user can continue to adjust these defaults using the dataserver_retrans and dataserver_timeo module parameters. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-08-16NFS: Allow the mount option retrans=0Trond Myklebust
We should allow retrans=0 as just meaning that every timeout is a major timeout, and that there is no increment in the timeout value. For instance, this means that we would allow TCP users to specify a flat timeout value of 60s, by specifying "timeo=600,retrans=0" in their mount option string. Siged-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-08-14pNFS/flexfiles: Fix layoutstat periodic reportingTrond Myklebust
Putting the periodicity timer in the mirror instances is causing non-scalable reporting behaviour and missed reporting intervals. When you recall layouts and/or implement client side mirroring, it leads to consecutive reports with only a few ms between RPC calls. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Fixes: d0379a5d066a9 ("pNFS/flexfiles: Support server-supplied...")
2016-08-13Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - an NVMe fix from Gabriel, fixing a suspend/resume issue on some setups - addition of a few missing entries in the block queue sysfs documentation, from Joe - a fix for a sparse shadow warning for the bvec iterator, from Johannes - a writeback deadlock involving raid issuing barriers, and not flushing the plug when we wakeup the flusher threads. From Konstantin - a set of patches for the NVMe target/loop/rdma code, from Roland and Sagi * 'for-linus' of git://git.kernel.dk/linux-block: bvec: avoid variable shadowing warning doc: update block/queue-sysfs.txt entries nvme: Suspend all queues before deletion mm, writeback: flush plugged IO in wakeup_flusher_threads() nvme-rdma: Remove unused includes nvme-rdma: start async event handler after reconnecting to a controller nvmet: Fix controller serial number inconsistency nvmet-rdma: Don't use the inline buffer in order to avoid allocation for small reads nvmet-rdma: Correctly handle RDMA device hot removal nvme-rdma: Make sure to shutdown the controller if we can nvme-loop: Remove duplicate call to nvme_remove_namespaces nvme-rdma: Free the I/O tags when we delete the controller nvme-rdma: Remove duplicate call to nvme_remove_namespaces nvme-rdma: Fix device removal handling nvme-rdma: Queue ns scanning after a sucessful reconnection nvme-rdma: Don't leak uninitialized memory in connect request private data
2016-08-12Merge tag 'nfsd-4.8-1' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd fixes from Bruce Fields: "Fixes for the dentry refcounting leak I introduced in 4.8-rc1, and for races in the LOCK code which appear to go back to the big nfsd state lock removal from 3.17" * tag 'nfsd-4.8-1' of git://linux-nfs.org/~bfields/linux: nfsd: don't return an unhashed lock stateid after taking mutex nfsd: Fix race between FREE_STATEID and LOCK nfsd: fix dentry refcounting on create
2016-08-12nfsd: don't return an unhashed lock stateid after taking mutexJeff Layton
nfsd4_lock will take the st_mutex before working with the stateid it gets, but between the time when we drop the cl_lock and take the mutex, the stateid could become unhashed (a'la FREE_STATEID). If that happens the lock stateid returned to the client will be forgotten. Fix this by first moving the st_mutex acquisition into lookup_or_create_lock_state. Then, have it check to see if the lock stateid is still hashed after taking the mutex. If it's not, then put the stateid and try the find/create again. Signed-off-by: Jeff Layton <jlayton@redhat.com> Tested-by: Alexey Kodanev <alexey.kodanev@oracle.com> Cc: stable@vger.kernel.org # feb9dad5 nfsd: Always lock state exclusively. Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-08-12Merge tag 'nfs-for-4.8-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Trond Myklebust: "Highlights include: - Stable patch from Olga to fix RPCSEC_GSS upcalls when the same user needs multiple different security services (e.g. krb5i and krb5p). - Stable patch to fix a regression introduced by the use of SO_REUSEPORT, and that prevented the use of multiple different NFS versions to the same server. - TCP socket reconnection timer fixes. - Patch from Neil to disable the use of IPv6 temporary addresses" * tag 'nfs-for-4.8-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4: Cap the transport reconnection timer at 1/2 lease period NFSv4: Cleanup the setting of the nfs4 lease period SUNRPC: Limit the reconnect backoff timer to the max RPC message timeout SUNRPC: Fix reconnection timeouts NFSv4.2: LAYOUTSTATS may return NFS4ERR_ADMIN/DELEG_REVOKED SUNRPC: disable the use of IPv6 temporary addresses. SUNRPC: allow for upcalls for same uid but different gss service SUNRPC: Fix up socket autodisconnect SUNRPC: Handle EADDRNOTAVAIL on connection failures
2016-08-11Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge fixes from Andrew Morton: "7 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm/memory_hotplug.c: initialize per_cpu_nodestats for hotadded pgdats mm, oom: fix uninitialized ret in task_will_free_mem() kasan: remove the unnecessary WARN_ONCE from quarantine.c mm: memcontrol: fix memcg id ref counter on swap charge move mm: memcontrol: fix swap counter leak on swapout from offline cgroup proc, meminfo: use correct helpers for calculating LRU sizes in meminfo mm/hugetlb: fix incorrect hugepages count during mem hotplug
2016-08-11proc, meminfo: use correct helpers for calculating LRU sizes in meminfoMel Gorman
meminfo_proc_show() and si_mem_available() are using the wrong helpers for calculating the size of the LRUs. The user-visible impact is that there appears to be an abnormally high number of unevictable pages. Link: http://lkml.kernel.org/r/20160805105805.GR2799@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-11Merge tag 'ceph-for-4.8-rc2' of https://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fixes from Ilya Dryomov: "A patch for a NULL dereference bug introduced in 4.8-rc1 and a handful of static checker fixes" * tag 'ceph-for-4.8-rc2' of https://github.com/ceph/ceph-client: ceph: initialize pathbase in the !dentry case in encode_caps_cb() rbd: nuke the 32-bit pool id check rbd: destroy header_oloc in rbd_dev_release() ceph: fix null pointer dereference in ceph_flush_snaps() libceph: using kfree_rcu() to simplify the code libceph: make cancel_generic_request() static libceph: fix return value check in alloc_msg_with_page_vector()
2016-08-11nfsd: Fix race between FREE_STATEID and LOCKChuck Lever
When running LTP's nfslock01 test, the Linux client can send a LOCK and a FREE_STATEID request at the same time. The outcome is: Frame 324 R OPEN stateid [2,O] Frame 115004 C LOCK lockowner_is_new stateid [2,O] offset 672000 len 64 Frame 115008 R LOCK stateid [1,L] Frame 115012 C WRITE stateid [0,L] offset 672000 len 64 Frame 115016 R WRITE NFS4_OK Frame 115019 C LOCKU stateid [1,L] offset 672000 len 64 Frame 115022 R LOCKU NFS4_OK Frame 115025 C FREE_STATEID stateid [2,L] Frame 115026 C LOCK lockowner_is_new stateid [2,O] offset 672128 len 64 Frame 115029 R FREE_STATEID NFS4_OK Frame 115030 R LOCK stateid [3,L] Frame 115034 C WRITE stateid [0,L] offset 672128 len 64 Frame 115038 R WRITE NFS4ERR_BAD_STATEID In other words, the server returns stateid L in a successful LOCK reply, but it has already released it. Subsequent uses of stateid L fail. To address this, protect the generation check in nfsd4_free_stateid with the st_mutex. This should guarantee that only one of two outcomes occurs: either LOCK returns a fresh valid stateid, or FREE_STATEID returns NFS4ERR_LOCKS_HELD. Reported-by: Alexey Kodanev <alexey.kodanev@oracle.com> Fix-suggested-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Alexey Kodanev <alexey.kodanev@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-08-11ext4: avoid deadlock when expanding inode sizeJan Kara
When we need to move xattrs into external xattr block, we call ext4_xattr_block_set() from ext4_expand_extra_isize_ea(). That may end up calling ext4_mark_inode_dirty() again which will recurse back into the inode expansion code leading to deadlocks. Protect from recursion using EXT4_STATE_NO_EXPAND inode flag and move its management into ext4_expand_extra_isize_ea() since its manipulation is safe there (due to xattr_sem) from possible races with ext4_xattr_set_handle() which plays with it as well. CC: stable@vger.kernel.org # 4.4.x Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>