summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2017-02-22userfaultfd: non-cooperative: add madvise() event for MADV_DONTNEED requestPavel Emelyanov
If the page is punched out of the address space the uffd reader should know this and zeromap the respective area in case of the #PF event. Link: http://lkml.kernel.org/r/20161216144821.5183-14-aarcange@redhat.com Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: non-cooperative: optimize mremap_userfaultfd_complete()Andrea Arcangeli
Optimize the mremap_userfaultfd_complete() interface to pass only the vm_userfaultfd_ctx pointer through the stack as a microoptimization. Link: http://lkml.kernel.org/r/20161216144821.5183-13-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: non-cooperative: add mremap() eventPavel Emelyanov
The event denotes that an area [start:end] moves to different location. Length change isn't reported as "new" addresses, if they appear on the uffd reader side they will not contain any data and the latter can just zeromap them. Waiting for the event ACK is also done outside of mmap sem, as for fork event. Link: http://lkml.kernel.org/r/20161216144821.5183-12-aarcange@redhat.com Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: non-cooperative: dup_userfaultfd: use mm_count instead of mm_usersMike Rapoport
Since commit d2005e3f41d4 ("userfaultfd: don't pin the user memory in userfaultfd_file_create()") userfaultfd uses mm_count rather than mm_users to pin mm_struct. Make dup_userfaultfd consistent with this behaviour Link: http://lkml.kernel.org/r/20161216144821.5183-11-aarcange@redhat.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: non-cooperative: Add fork() eventPavel Emelyanov
When the mm with uffd-ed vmas fork()-s the respective vmas notify their uffds with the event which contains a descriptor with new uffd. This new descriptor can then be used to get events from the child and populate its mm with data. Note, that there can be different uffd-s controlling different vmas within one mm, so first we should collect all those uffds (and ctx-s) in a list and then notify them all one by one but only once per fork(). The context is created at fork() time but the descriptor, file struct and anon inode object is created at event read time. So some trickery is added to the userfaultfd_ctx_read() to handle the ctx queues' locking vs file creation. Another thing worth noticing is that the task that fork()-s waits for the uffd event to get processed WITHOUT the mmap sem. [aarcange@redhat.com: build warning fix] Link: http://lkml.kernel.org/r/20161216144821.5183-10-aarcange@redhat.com Link: http://lkml.kernel.org/r/20161216144821.5183-9-aarcange@redhat.com Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: non-cooperative: report all available features to userlandAndrea Arcangeli
This will allow userland to probe all features available in the kernel. It will however only enable the requested features in the open userfaultfd context. Link: http://lkml.kernel.org/r/20161216144821.5183-8-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: non-cooperative: add ability to report non-PF events from uffd ↵Pavel Emelyanov
descriptor The custom events are queued in ctx->event_wqh not to disturb the fast-path-ed PF queue-wait-wakeup functions. The events to be generated (other than PF-s) are requested in UFFD_API ioctl with the uffd_api.features bits. Those, known by the kernel, are then turned on and reported back to the user-space. Link: http://lkml.kernel.org/r/20161216144821.5183-7-aarcange@redhat.com Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: non-cooperative: Split the find_userfault() routinePavel Emelyanov
I will need one to lookup for userfaultfd_wait_queue-s in different wait queue Link: http://lkml.kernel.org/r/20161216144821.5183-6-aarcange@redhat.com Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: use vma_is_anonymousAndrea Arcangeli
Cleanup the vma->vm_ops usage. Side note: it would be more robust if vma_is_anonymous() would also check that vm_flags hasn't VM_PFNMAP set. Link: http://lkml.kernel.org/r/20161216144821.5183-5-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: convert BUG() to WARN_ON_ONCE()Andrea Arcangeli
Avoid BUG_ON()s and only WARN instead. This is just a cleanup, it can't make any runtime difference. This BUG_ON has never triggered and cannot trigger. Link: http://lkml.kernel.org/r/20161216144821.5183-4-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: correct comment about UFFD_FEATURE_PAGEFAULT_FLAG_WPAndrea Arcangeli
Minor comment correction. Link: http://lkml.kernel.org/r/20161216144821.5183-3-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-229p: fix a potential acl leakCong Wang
posix_acl_update_mode() could possibly clear 'acl', if so we leak the memory pointed by 'acl'. Save this pointer before calling posix_acl_update_mode() and release the memory if 'acl' really gets cleared. Link: http://lkml.kernel.org/r/1486678332-2430-1-git-send-email-xiyou.wangcong@gmail.com Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reported-by: Mark Salyzyn <salyzyn@android.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Greg Kurz <groug@kaod.org> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22ocfs2: fix deadlock issue when taking inode lock at vfs entry pointsEric Ren
Commit 743b5f1434f5 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()") results in a deadlock, as the author "Tariq Saeed" realized shortly after the patch was merged. The discussion happened here https://oss.oracle.com/pipermail/ocfs2-devel/2015-September/011085.html The reason why taking cluster inode lock at vfs entry points opens up a self deadlock window, is explained in the previous patch of this series. So far, we have seen two different code paths that have this issue. 1. do_sys_open may_open inode_permission ocfs2_permission ocfs2_inode_lock() <=== take PR generic_permission get_acl ocfs2_iop_get_acl ocfs2_inode_lock() <=== take PR 2. fchmod|fchmodat chmod_common notify_change ocfs2_setattr <=== take EX posix_acl_chmod get_acl ocfs2_iop_get_acl <=== take PR ocfs2_iop_set_acl <=== take EX Fixes them by adding the tracking logic (in the previous patch) for these funcs above, ocfs2_permission(), ocfs2_iop_[set|get]_acl(), ocfs2_setattr(). Link: http://lkml.kernel.org/r/20170117100948.11657-3-zren@suse.com Signed-off-by: Eric Ren <zren@suse.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22ocfs2/dlmglue: prepare tracking logic to avoid recursive cluster lockEric Ren
We are in the situation that we have to avoid recursive cluster locking, but there is no way to check if a cluster lock has been taken by a precess already. Mostly, we can avoid recursive locking by writing code carefully. However, we found that it's very hard to handle the routines that are invoked directly by vfs code. For instance: const struct inode_operations ocfs2_file_iops = { .permission = ocfs2_permission, .get_acl = ocfs2_iop_get_acl, .set_acl = ocfs2_iop_set_acl, }; Both ocfs2_permission() and ocfs2_iop_get_acl() call ocfs2_inode_lock(PR): do_sys_open may_open inode_permission ocfs2_permission ocfs2_inode_lock() <=== first time generic_permission get_acl ocfs2_iop_get_acl ocfs2_inode_lock() <=== recursive one A deadlock will occur if a remote EX request comes in between two of ocfs2_inode_lock(). Briefly describe how the deadlock is formed: On one hand, OCFS2_LOCK_BLOCKED flag of this lockres is set in BAST(ocfs2_generic_handle_bast) when downconvert is started on behalf of the remote EX lock request. Another hand, the recursive cluster lock (the second one) will be blocked in in __ocfs2_cluster_lock() because of OCFS2_LOCK_BLOCKED. But, the downconvert never complete, why? because there is no chance for the first cluster lock on this node to be unlocked - we block ourselves in the code path. The idea to fix this issue is mostly taken from gfs2 code. 1. introduce a new field: struct ocfs2_lock_res.l_holders, to keep track of the processes' pid who has taken the cluster lock of this lock resource; 2. introduce a new flag for ocfs2_inode_lock_full: OCFS2_META_LOCK_GETBH; it means just getting back disk inode bh for us if we've got cluster lock. 3. export a helper: ocfs2_is_locked_by_me() is used to check if we have got the cluster lock in the upper code path. The tracking logic should be used by some of the ocfs2 vfs's callbacks, to solve the recursive locking issue cuased by the fact that vfs routines can call into each other. The performance penalty of processing the holder list should only be seen at a few cases where the tracking logic is used, such as get/set acl. You may ask what if the first time we got a PR lock, and the second time we want a EX lock? fortunately, this case never happens in the real world, as far as I can see, including permission check, (get|set)_(acl|attr), and the gfs2 code also do so. [sfr@canb.auug.org.au remove some inlines] Link: http://lkml.kernel.org/r/20170117100948.11657-2-zren@suse.com Signed-off-by: Eric Ren <zren@suse.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm, dax: change pmd_fault() to take only vmf parameterDave Jiang
pmd_fault() and related functions really only need the vmf parameter since the additional parameters are all included in the vmf struct. Remove the additional parameter and simplify pmd_fault() and friends. Link: http://lkml.kernel.org/r/1484085142-2297-8-git-send-email-ross.zwisler@linux.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm, dax: make pmd_fault() and friends be the same as fault()Dave Jiang
Instead of passing in multiple parameters in the pmd_fault() handler, a vmf can be passed in just like a fault() handler. This will simplify code and remove the need for the actual pmd fault handlers to allocate a vmf. Related functions are also modified to do the same. [dave.jiang@intel.com: fix issue with xfs_tests stall when DAX option is off] Link: http://lkml.kernel.org/r/148469861071.195597.3619476895250028518.stgit@djiang5-desk3.ch.intel.com Link: http://lkml.kernel.org/r/1484085142-2297-7-git-send-email-ross.zwisler@linux.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22dax: add tracepoints to dax_pmd_insert_mapping()Ross Zwisler
Add tracepoints to dax_pmd_insert_mapping(), following the same logging conventions as the tracepoints in dax_iomap_pmd_fault(). Here is an example PMD fault showing the new tracepoints: big-1504 [001] .... 326.960743: xfs_filemap_pmd_fault: dev 259:0 ino 0x1003 big-1504 [001] .... 326.960753: dax_pmd_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400 big-1504 [001] .... 326.960981: dax_pmd_insert_mapping: dev 259:0 ino 0x1003 shared write address 0x10505000 length 0x200000 pfn 0x100600 DEV|MAP radix_entry 0xc000e big-1504 [001] .... 326.960986: dax_pmd_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400 NOPAGE Link: http://lkml.kernel.org/r/1484085142-2297-6-git-send-email-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22dax: add tracepoints to dax_pmd_load_hole()Ross Zwisler
Add tracepoints to dax_pmd_load_hole(), following the same logging conventions as the tracepoints in dax_iomap_pmd_fault(). Here is an example PMD fault showing the new tracepoints: read_big-1478 [004] .... 238.242188: xfs_filemap_pmd_fault: dev 259:0 ino 0x1003 read_big-1478 [004] .... 238.242191: dax_pmd_fault: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10400000 vm_start 0x10200000 vm_end 0x10600000 pgoff 0x200 max_pgoff 0x1400 read_big-1478 [004] .... 238.242390: dax_pmd_load_hole: dev 259:0 ino 0x1003 shared address 0x10400000 zero_page ffffea0002c20000 radix_entry 0x1e read_big-1478 [004] .... 238.242392: dax_pmd_fault_done: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10400000 vm_start 0x10200000 vm_end 0x10600000 pgoff 0x200 max_pgoff 0x1400 NOPAGE Link: http://lkml.kernel.org/r/1484085142-2297-5-git-send-email-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22dax: add tracepoint infrastructure, PMD tracingRoss Zwisler
Tracepoints are the standard way to capture debugging and tracing information in many parts of the kernel, including the XFS and ext4 filesystems. Create a tracepoint header for FS DAX and add the first DAX tracepoints to the PMD fault handler. This allows the tracing for DAX to be done in the same way as the filesystem tracing so that developers can look at them together and get a coherent idea of what the system is doing. I added both an entry and exit tracepoint because future patches will add tracepoints to child functions of dax_iomap_pmd_fault() like dax_pmd_load_hole() and dax_pmd_insert_mapping(). We want those messages to be wrapped by the parent function tracepoints so the code flow is more easily understood. Having entry and exit tracepoints for faults also allows us to easily see what filesystems functions were called during the fault. These filesystem functions get executed via iomap_begin() and iomap_end() calls, for example, and will have their own tracepoints. For PMD faults we primarily want to understand the type of mapping, the fault flags, the faulting address and whether it fell back to 4k faults. If it fell back to 4k faults the tracepoints should let us understand why. I named the new tracepoint header file "fs_dax.h" to allow for device DAX to have its own separate tracing header in the same directory at some point. Here is an example output for these events from a successful PMD fault: big-1441 [005] .... 32.582758: xfs_filemap_pmd_fault: dev 259:0 ino 0x1003 big-1441 [005] .... 32.582776: dax_pmd_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400 big-1441 [005] .... 32.583292: dax_pmd_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400 NOPAGE Link: http://lkml.kernel.org/r/1484085142-2297-3-git-send-email-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22Btrfs: use the correct type when creating cow dio extentLiu Bo
'BTRFS_ORDERED_REGULAR' was introduced for the cow case in patch 'Btrfs: specify a new ordered extent type for create_io_em', but it missed the directIO cow case. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2017-02-22Btrfs: fix deadlock between dedup on same file and starting writebackFilipe Manana
If we are deduping two ranges of the same file we need to make sure that we lock all pages in ascending order, that is, lock first the pages from the range with lower offset and then the pages from the other range, as otherwise we can deadlock with a concurrent task that is starting delalloc (writeback). Example trace: [74073.052218] INFO: task kworker/u32:10:17997 blocked for more than 120 seconds. [74073.053889] Tainted: G W 4.9.0-rc7-btrfs-next-36+ #1 [74073.055071] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [74073.056696] kworker/u32:10 D 0 17997 2 0x00000000 [74073.058606] Workqueue: writeback wb_workfn (flush-btrfs-53176) [74073.061370] ffff880031e79858 ffff8802159d2580 ffff880237004580 ffff880031e79240 [74073.064784] ffff88023f4978c0 ffffc9000817b638 ffffffff814c15e1 0000000000000000 [74073.068386] ffff88023f4978d8 ffff88023f4978c0 000000000017b620 ffff880031e79240 [74073.071712] Call Trace: [74073.072884] [<ffffffff814c15e1>] ? __schedule+0x48f/0x6f4 [74073.075395] [<ffffffff814c1c8b>] ? bit_wait+0x2f/0x2f [74073.077511] [<ffffffff814c18d2>] schedule+0x8c/0xa0 [74073.079440] [<ffffffff814c4b36>] schedule_timeout+0x43/0xff [74073.081637] [<ffffffff8110953e>] ? time_hardirqs_on+0x9/0x14 [74073.083809] [<ffffffff81095c67>] ? trace_hardirqs_on_caller+0x16/0x197 [74073.086314] [<ffffffff810bde98>] ? timekeeping_get_ns+0x1e/0x32 [74073.100654] [<ffffffff810be048>] ? ktime_get+0x41/0x52 [74073.102619] [<ffffffff814c10f0>] io_schedule_timeout+0xa0/0x102 [74073.104771] [<ffffffff814c10f0>] ? io_schedule_timeout+0xa0/0x102 [74073.106969] [<ffffffff814c1ca6>] bit_wait_io+0x1b/0x39 [74073.108954] [<ffffffff814c1fb8>] __wait_on_bit_lock+0x4f/0x99 [74073.110981] [<ffffffff8112b692>] __lock_page+0x6b/0x6d [74073.112833] [<ffffffff8108ceb4>] ? autoremove_wake_function+0x3a/0x3a [74073.115010] [<ffffffffa031178b>] lock_page+0x2f/0x32 [btrfs] [74073.116999] [<ffffffffa0311d9f>] lock_delalloc_pages+0xc7/0x1a0 [btrfs] [74073.119243] [<ffffffffa0313d15>] find_lock_delalloc_range+0xc3/0x1a4 [btrfs] [74073.121636] [<ffffffffa0313e81>] writepage_delalloc.isra.31+0x8b/0x134 [btrfs] [74073.124229] [<ffffffffa0315d69>] __extent_writepage+0x1c1/0x2bf [btrfs] [74073.126372] [<ffffffffa03160f2>] extent_write_cache_pages.isra.30.constprop.49+0x28b/0x36c [btrfs] [74073.129371] [<ffffffffa03165b9>] extent_writepages+0x4b/0x5c [btrfs] [74073.131440] [<ffffffffa02fcb59>] ? insert_reserved_file_extent.constprop.42+0x261/0x261 [btrfs] [74073.134303] [<ffffffff811b4ce4>] ? writeback_sb_inodes+0xe0/0x4a1 [74073.136298] [<ffffffffa02fab7f>] btrfs_writepages+0x28/0x2a [btrfs] [74073.138248] [<ffffffff81138200>] do_writepages+0x23/0x2c [74073.139910] [<ffffffff811b3cab>] __writeback_single_inode+0x105/0x6d2 [74073.142003] [<ffffffff811b4e96>] writeback_sb_inodes+0x292/0x4a1 [74073.136298] [<ffffffffa02fab7f>] btrfs_writepages+0x28/0x2a [btrfs] [74073.138248] [<ffffffff81138200>] do_writepages+0x23/0x2c [74073.139910] [<ffffffff811b3cab>] __writeback_single_inode+0x105/0x6d2 [74073.142003] [<ffffffff811b4e96>] writeback_sb_inodes+0x292/0x4a1 [74073.143911] [<ffffffff811b511b>] __writeback_inodes_wb+0x76/0xae [74073.145787] [<ffffffff811b53ca>] wb_writeback+0x1cc/0x4d7 [74073.147452] [<ffffffff811b60cd>] wb_workfn+0x194/0x37d [74073.149084] [<ffffffff811b60cd>] ? wb_workfn+0x194/0x37d [74073.150726] [<ffffffff8106ce77>] ? process_one_work+0x154/0x4e4 [74073.152694] [<ffffffff8106cf96>] process_one_work+0x273/0x4e4 [74073.154452] [<ffffffff8106d6db>] worker_thread+0x1eb/0x2ca [74073.156138] [<ffffffff8106d4f0>] ? rescuer_thread+0x2b6/0x2b6 [74073.157837] [<ffffffff81072a81>] kthread+0xd5/0xdd [74073.159339] [<ffffffff810729ac>] ? __kthread_unpark+0x5a/0x5a [74073.161088] [<ffffffff814c6257>] ret_from_fork+0x27/0x40 [74073.162680] INFO: lockdep is turned off. [74073.163855] INFO: task do-dedup:30264 blocked for more than 120 seconds. [74073.181180] Tainted: G W 4.9.0-rc7-btrfs-next-36+ #1 [74073.181180] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [74073.185296] fdm-stress D 0 30264 29974 0x00000000 [74073.186810] ffff880089595118 ffff880211b8eac0 ffff880237030380 ffff880089594b00 [74073.188998] ffff88023f2978c0 ffffc900063abb68 ffffffff814c15e1 0000000000000000 [74073.191070] ffff88023f2978d8 ffff88023f2978c0 00000000003abb50 ffff880089594b00 [74073.193286] Call Trace: [74073.193990] [<ffffffff814c15e1>] ? __schedule+0x48f/0x6f4 [74073.195418] [<ffffffff814c1c8b>] ? bit_wait+0x2f/0x2f [74073.196796] [<ffffffff814c18d2>] schedule+0x8c/0xa0 [74073.198163] [<ffffffff814c4b36>] schedule_timeout+0x43/0xff [74073.199621] [<ffffffff81095df5>] ? trace_hardirqs_on+0xd/0xf [74073.201100] [<ffffffff810bde98>] ? timekeeping_get_ns+0x1e/0x32 [74073.202686] [<ffffffff810be048>] ? ktime_get+0x41/0x52 [74073.204051] [<ffffffff814c10f0>] io_schedule_timeout+0xa0/0x102 [74073.205585] [<ffffffff814c10f0>] ? io_schedule_timeout+0xa0/0x102 [74073.207123] [<ffffffff814c1ca6>] bit_wait_io+0x1b/0x39 [74073.208238] [<ffffffff814c1fb8>] __wait_on_bit_lock+0x4f/0x99 [74073.208871] [<ffffffff8112b692>] __lock_page+0x6b/0x6d [74073.209430] [<ffffffff8108ceb4>] ? autoremove_wake_function+0x3a/0x3a [74073.210101] [<ffffffff8112b800>] lock_page+0x2f/0x32 [74073.210636] [<ffffffff8112c502>] pagecache_get_page+0x5e/0x153 [74073.211270] [<ffffffffa03257eb>] gather_extent_pages+0x4e/0x109 [btrfs] [74073.212166] [<ffffffffa032a04c>] btrfs_dedupe_file_range+0x1e1/0x4dd [btrfs] [74073.213257] [<ffffffff8118d9b5>] vfs_dedupe_file_range+0x1c1/0x221 [74073.214086] [<ffffffff8119e0c4>] do_vfs_ioctl+0x442/0x600 [74073.214767] [<ffffffff811a7874>] ? rcu_read_unlock+0x5b/0x5d [74073.215619] [<ffffffff811a7953>] ? __fget+0x6b/0x77 [74073.216338] [<ffffffff8119e2d9>] SyS_ioctl+0x57/0x79 [74073.217149] [<ffffffff814c5fea>] entry_SYSCALL_64_fastpath+0x18/0xad [74073.218102] [<ffffffff81109552>] ? time_hardirqs_off+0x9/0x14 [74073.218968] [<ffffffff810938ce>] ? trace_hardirqs_off_caller+0x1f/0xaa [74073.219938] INFO: lockdep is turned off. What happened was the following: CPU 1 CPU 2 btrfs_dedupe_file_range() --> using same inode as source and target --> src range is [768K, 1Mb[ --> dst range is [0, 256K[ btrfs_cmp_data_prepare() --> calls gather_extent_pages() for range [768K, 1Mb[ and locks all pages in that range do_writepages() btrfs_writepages() extent_writepages() extent_write_cache_pages() __extent_writepage() writepage_delalloc() find_lock_delalloc_range() --> finds range [0, 1Mb[ lock_delalloc_pages() --> locks all pages in the range [0, 768K[ --> tries to lock page at offset 768K --> deadlock --> calls gather_extent_pages() to lock pages in the range [0, 256K[ --> deadlock, task at CPU 1 already locked that range and it's trying to lock the range we locked previously So fix this by making sure that during a dedup we always lock first the pages from the range with lower offset. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2017-02-22f2fs: avoid needless checkpoint in f2fs_trim_fsJaegeuk Kim
The f2fs_trim_fs() doesn't need to do checkpoint if there are newly allocated data blocks only which didn't change the critical checkpoint data such as nat and sit entries. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-02-22Revert "NFSv4.1: Handle NFS4ERR_BADSESSION/NFS4ERR_DEADSESSION replies to ↵Trond Myklebust
OP_SEQUENCE" This reverts commit 2cf10cdd486c362f983abdce00dc1127e8ab8c59. The patch has been seen to cause excessive looping. Reported-by: Olga Kornievskaia <aglo@umich.edu> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org # 4.10+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-22Merge tag 'driver-core-4.11-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the "small" driver core patches for 4.11-rc1. Not much here, some firmware documentation and self-test updates, a debugfs code formatting issue, and a new feature for call_usermodehelper to make it more robust on systems that want to lock it down in a more secure way. All of these have been linux-next for a while now with no reported issues" * tag 'driver-core-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: kernfs: handle null pointers while printing node name and path Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper() Make static usermode helper binaries constant kmod: make usermodehelper path a const string firmware: revamp firmware documentation selftests: firmware: send expected errors to /dev/null selftests: firmware: only modprobe if driver is missing platform: Print the resource range if device failed to claim kref: prefer atomic_inc_not_zero to atomic_add_unless debugfs: improve formatting of debugfs_real_fops()
2017-02-22fuse: release: private_data cannot be NULLMiklos Szeredi
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-02-22fuse: cleanup fuse_file refcountingMiklos Szeredi
struct fuse_file is stored in file->private_data. Make this always be a counting reference for consistency. This also allows fuse_sync_release() to call fuse_file_put() instead of partially duplicating its functionality. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-02-22fuse: add missing FR_FORCEMiklos Szeredi
fuse_file_put() was missing the "force" flag for the RELEASE request when sending synchronously (fuseblk). If this flag is not set, then a sync request may be interrupted before it is dequeued by the userspace filesystem. In this case the OPEN won't be balanced with a RELEASE. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: 5a18ec176c93 ("fuse: fix hang of single threaded fuseblk filesystem") Cc: <stable@vger.kernel.org> # v2.6.38+
2017-02-22NFSv4: Fix reboot recovery in copy offloadTrond Myklebust
Copy offload code needs to be hooked into the code for handling NFS4ERR_BAD_STATEID by ensuring that we set the "stateid" field in struct nfs4_exception. Reported-by: Olga Kornievskaia <aglo@umich.edu> Fixes: 2e72448b07dc3 ("NFS: Add COPY nfs operation") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org # v4.7+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: "Highlights: 1) Support TX_RING in AF_PACKET TPACKET_V3 mode, from Sowmini Varadhan. 2) Simplify classifier state on sk_buff in order to shrink it a bit. From Willem de Bruijn. 3) Introduce SIPHASH and it's usage for secure sequence numbers and syncookies. From Jason A. Donenfeld. 4) Reduce CPU usage for ICMP replies we are going to limit or suppress, from Jesper Dangaard Brouer. 5) Introduce Shared Memory Communications socket layer, from Ursula Braun. 6) Add RACK loss detection and allow it to actually trigger fast recovery instead of just assisting after other algorithms have triggered it. From Yuchung Cheng. 7) Add xmit_more and BQL support to mvneta driver, from Simon Guinot. 8) skb_cow_data avoidance in esp4 and esp6, from Steffen Klassert. 9) Export MPLS packet stats via netlink, from Robert Shearman. 10) Significantly improve inet port bind conflict handling, especially when an application is restarted and changes it's setting of reuseport. From Josef Bacik. 11) Implement TX batching in vhost_net, from Jason Wang. 12) Extend the dummy device so that VF (virtual function) features, such as configuration, can be more easily tested. From Phil Sutter. 13) Avoid two atomic ops per page on x86 in bnx2x driver, from Eric Dumazet. 14) Add new bpf MAP, implementing a longest prefix match trie. From Daniel Mack. 15) Packet sample offloading support in mlxsw driver, from Yotam Gigi. 16) Add new aquantia driver, from David VomLehn. 17) Add bpf tracepoints, from Daniel Borkmann. 18) Add support for port mirroring to b53 and bcm_sf2 drivers, from Florian Fainelli. 19) Remove custom busy polling in many drivers, it is done in the core networking since 4.5 times. From Eric Dumazet. 20) Support XDP adjust_head in virtio_net, from John Fastabend. 21) Fix several major holes in neighbour entry confirmation, from Julian Anastasov. 22) Add XDP support to bnxt_en driver, from Michael Chan. 23) VXLAN offloads for enic driver, from Govindarajulu Varadarajan. 24) Add IPVTAP driver (IP-VLAN based tap driver) from Sainath Grandhi. 25) Support GRO in IPSEC protocols, from Steffen Klassert" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1764 commits) Revert "ath10k: Search SMBIOS for OEM board file extension" net: socket: fix recvmmsg not returning error from sock_error bnxt_en: use eth_hw_addr_random() bpf: fix unlocking of jited image when module ronx not set arch: add ARCH_HAS_SET_MEMORY config net: napi_watchdog() can use napi_schedule_irqoff() tcp: Revert "tcp: tcp_probe: use spin_lock_bh()" net/hsr: use eth_hw_addr_random() net: mvpp2: enable building on 64-bit platforms net: mvpp2: switch to build_skb() in the RX path net: mvpp2: simplify MVPP2_PRS_RI_* definitions net: mvpp2: fix indentation of MVPP2_EXT_GLOBAL_CTRL_DEFAULT net: mvpp2: remove unused register definitions net: mvpp2: simplify mvpp2_bm_bufs_add() net: mvpp2: drop useless fields in mvpp2_bm_pool and related code net: mvpp2: remove unused 'tx_skb' field of 'struct mvpp2_tx_queue' net: mvpp2: release reference to txq_cpu[] entry after unmapping net: mvpp2: handle too large value in mvpp2_rx_time_coal_set() net: mvpp2: handle too large value handling in mvpp2_rx_pkts_coal_set() net: mvpp2: remove useless arguments in mvpp2_rx_{pkts, time}_coal_set ...
2017-02-22pNFS/flexfiles: If the layout is invalid, it must be updated before retryingTrond Myklebust
If we see that our pNFS READ/WRITE/COMMIT operation failed, but we also see that our layout segment is no longer valid, then we need to get a new layout segment before retrying. Fixes: 90816d1ddacf ("NFSv4.1/flexfiles: Don't mark the entire deviceid...") Cc: stable@vger.kernel.org # v4.2+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-21Merge tag 'pstore-v4.11-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore updates from Kees Cook: "Minor changes to pstore tree: - update MAINTAINERS with current git repo, add more files. - move prz allocation checks into the walker - initialize flags correctly (by accident spinlock was technically ok)" * tag 'pstore-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: MAINTAINERS: Adjust pstore git repo URI, add files pstore: Check for prz allocation in walker pstore: Correctly initialize spinlock and flags
2017-02-21NFSv4: Clean up owner/group attribute decodeTrond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-21NFSv4: Remove bogus "struct nfs_client" argument from decode_ace()Trond Myklebust
We shouldn't need to force callers to carry an unused argument. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-21NFSv4: Fix the underestimation of delegation XDR space reservationTrond Myklebust
Account for the "space_limit" field in struct open_write_delegation4. Fixes: 2cebf82883f4 ("NFSv4: Fix the underestimate of NFSv4 open request size") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-21NFSv4: Replace callback string decode function with a genericTrond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-21NFSv4: Replace the open coded decode_opaque_inline() with the new genericTrond Myklebust
Also ensure that we always check that the size of the decoded object matches the expectation that it must be smaller than NFS4_OPAQUE_LIMIT. This should be true for all the current users of decode_opaque_inline(), including decode_ace(), decode_pathname(), decode_attr_fs_locations() and decode_exchange_id(). Note that this allows us to get rid of a number of existing checks in decode_exchange_id(), Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-21NFSv4: Replace ad-hoc xdr encode/decode helpers with xdr_stream_* genericsTrond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-21Merge branch 'next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security layer updates from James Morris: "Highlights: - major AppArmor update: policy namespaces & lots of fixes - add /sys/kernel/security/lsm node for easy detection of loaded LSMs - SELinux cgroupfs labeling support - SELinux context mounts on tmpfs, ramfs, devpts within user namespaces - improved TPM 2.0 support" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (117 commits) tpm: declare tpm2_get_pcr_allocation() as static tpm: Fix expected number of response bytes of TPM1.2 PCR Extend tpm xen: drop unneeded chip variable tpm: fix misspelled "facilitate" in module parameter description tpm_tis: fix the error handling of init_tis() KEYS: Use memzero_explicit() for secret data KEYS: Fix an error code in request_master_key() sign-file: fix build error in sign-file.c with libressl selinux: allow changing labels for cgroupfs selinux: fix off-by-one in setprocattr tpm: silence an array overflow warning tpm: fix the type of owned field in cap_t tpm: add securityfs support for TPM 2.0 firmware event log tpm: enhance read_log_of() to support Physical TPM event log tpm: enhance TPM 2.0 PCR extend to support multiple banks tpm: implement TPM 2.0 capability to get active PCR banks tpm: fix RC value check in tpm2_seal_trusted tpm_tis: fix iTPM probe via probe_itpm() function tpm: Begin the process to deprecate user_read_timer tpm: remove tpm_read_index and tpm_write_index from tpm.h ...
2017-02-21kernfs: fix locking around kernfs_ops->release() callbackTejun Heo
The release callback may be called from two places - file release operation and kernfs open file draining. kernfs_open_file->mutex is used to synchronize the two callsites. This unfortunately leads to possible circular locking because of->mutex is used to protect the usual kernfs operations which may use locking constructs which are held while removing and thus draining kernfs files. @of->mutex is for synchronizing concurrent kernfs access operations and all we need here is synchronization between the releaes and drain paths. As the drain path has to grab kernfs_open_file_mutex anyway, let's use the mutex to synchronize the release operation instead. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Tony Lindgren <tony@atomide.com> Fixes: 0e67db2f9fe9 ("kernfs: add kernfs_ops->open/release() callbacks") Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-02-21block: Revalidate i_bdev reference in bd_aquire()Jan Kara
When a device gets removed, block device inode unhashed so that it is not used anymore (bdget() will not find it anymore). Later when a new device gets created with the same device number, we create new block device inode. However there may be file system device inodes whose i_bdev still points to the original block device inode and thus we get two active block device inodes for the same device. They will share the same gendisk so the only visible differences will be that page caches will not be coherent and BDIs will be different (the old block device inode still points to unregistered BDI). Fix the problem by checking in bd_acquire() whether i_bdev still points to active block device inode and re-lookup the block device if not. That way any open of a block device happening after the old device has been removed will get correct block device inode. Tested-by: Lekshmi Pillai <lekshmicpillai@in.ibm.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22proc/sysctl: Don't grab i_lock under sysctl_lock.Eric W. Biederman
Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes: > This patch has locking problem. I've got lockdep splat under LTP. > > [ 6633.115456] ====================================================== > [ 6633.115502] [ INFO: possible circular locking dependency detected ] > [ 6633.115553] 4.9.10-debug+ #9 Tainted: G L > [ 6633.115584] ------------------------------------------------------- > [ 6633.115627] ksm02/284980 is trying to acquire lock: > [ 6633.115659] (&sb->s_type->i_lock_key#4){+.+...}, at: [<ffffffff816bc1ce>] igrab+0x1e/0x80 > [ 6633.115834] but task is already holding lock: > [ 6633.115882] (sysctl_lock){+.+...}, at: [<ffffffff817e379b>] unregister_sysctl_table+0x6b/0x110 > [ 6633.116026] which lock already depends on the new lock. > [ 6633.116026] > [ 6633.116080] > [ 6633.116080] the existing dependency chain (in reverse order) is: > [ 6633.116117] > -> #2 (sysctl_lock){+.+...}: > -> #1 (&(&dentry->d_lockref.lock)->rlock){+.+...}: > -> #0 (&sb->s_type->i_lock_key#4){+.+...}: > > d_lock nests inside i_lock > sysctl_lock nests inside d_lock in d_compare > > This patch adds i_lock nesting inside sysctl_lock. Al Viro <viro@ZenIV.linux.org.uk> replied: > Once ->unregistering is set, you can drop sysctl_lock just fine. So I'd > try something like this - use rcu_read_lock() in proc_sys_prune_dcache(), > drop sysctl_lock() before it and regain after. Make sure that no inodes > are added to the list ones ->unregistering has been set and use RCU list > primitives for modifying the inode list, with sysctl_lock still used to > serialize its modifications. > > Freeing struct inode is RCU-delayed (see proc_destroy_inode()), so doing > igrab() is safe there. Since we don't drop inode reference until after we'd > passed beyond it in the list, list_for_each_entry_rcu() should be fine. I agree with Al Viro's analsysis of the situtation. Fixes: d6cffbbe9a7e ("proc/sysctl: prune stale dentries during unregistering") Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Tested-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Suggested-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-02-21Merge tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block layer updates from Jens Axboe: - blk-mq scheduling framework from me and Omar, with a port of the deadline scheduler for this framework. A port of BFQ from Paolo is in the works, and should be ready for 4.12. - Various fixups and improvements to the above scheduling framework from Omar, Paolo, Bart, me, others. - Cleanup of the exported sysfs blk-mq data into debugfs, from Omar. This allows us to export more information that helps debug hangs or performance issues, without cluttering or abusing the sysfs API. - Fixes for the sbitmap code, the scalable bitmap code that was migrated from blk-mq, from Omar. - Removal of the BLOCK_PC support in struct request, and refactoring of carrying SCSI payloads in the block layer. This cleans up the code nicely, and enables us to kill the SCSI specific parts of struct request, shrinking it down nicely. From Christoph mainly, with help from Hannes. - Support for ranged discard requests and discard merging, also from Christoph. - Support for OPAL in the block layer, and for NVMe as well. Mainly from Scott Bauer, with fixes/updates from various others folks. - Error code fixup for gdrom from Christophe. - cciss pci irq allocation cleanup from Christoph. - Making the cdrom device operations read only, from Kees Cook. - Fixes for duplicate bdi registrations and bdi/queue life time problems from Jan and Dan. - Set of fixes and updates for lightnvm, from Matias and Javier. - A few fixes for nbd from Josef, using idr to name devices and a workqueue deadlock fix on receive. Also marks Josef as the current maintainer of nbd. - Fix from Josef, overwriting queue settings when the number of hardware queues is updated for a blk-mq device. - NVMe fix from Keith, ensuring that we don't repeatedly mark and IO aborted, if we didn't end up aborting it. - SG gap merging fix from Ming Lei for block. - Loop fix also from Ming, fixing a race and crash between setting loop status and IO. - Two block race fixes from Tahsin, fixing request list iteration and fixing a race between device registration and udev device add notifiations. - Double free fix from cgroup writeback, from Tejun. - Another double free fix in blkcg, from Hou Tao. - Partition overflow fix for EFI from Alden Tondettar. * tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-block: (156 commits) nvme: Check for Security send/recv support before issuing commands. block/sed-opal: allocate struct opal_dev dynamically block/sed-opal: tone down not supported warnings block: don't defer flushes on blk-mq + scheduling blk-mq-sched: ask scheduler for work, if we failed dispatching leftovers blk-mq: don't special case flush inserts for blk-mq-sched blk-mq-sched: don't add flushes to the head of requeue queue blk-mq: have blk_mq_dispatch_rq_list() return if we queued IO or not block: do not allow updates through sysfs until registration completes lightnvm: set default lun range when no luns are specified lightnvm: fix off-by-one error on target initialization Maintainers: Modify SED list from nvme to block Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN uapi: sed-opal fix IOW for activate lsp to use correct struct cdrom: Make device operations read-only elevator: fix loading wrong elevator type for blk-mq devices cciss: switch to pci_irq_alloc_vectors block/loop: fix race between I/O and set_status blk-mq-sched: don't hold queue_lock when calling exit_icq block: set make_request_fn manually in blk_mq_update_nr_hw_queues ...
2017-02-21Merge tag 'gfs2-4.11.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull GFS2 updates from Robert Peterson: "We've got eight GFS2 patches for this merge window: - Andy Price submitted a patch to make gfs2_write_full_page a static function. - Dan Carpenter submitted a patch to fix a ERR_PTR thinko. Three patches fix bugs related to deleting very large files, which cause GFS2 to run out of journal space: - The first one prevents GFS2 delete operation from requesting too much journal space. - The second one fixes a problem whereby GFS2 can hang because it wasn't taking journal space demand into its calculations. - The third one wakes up IO waiters when a flush is done to restart processes stuck waiting for journal space to become available. The final three patches are a performance improvement related to spin_lock contention between multiple writers: - The "tr_touched" variable was switched to a flag to be more atomic and eliminate the possibility of some races. - Function meta_lo_add was moved inline with its only caller to make the code more readable and efficient. - Contention on the gfs2_log_lock spinlock was greatly reduced by avoiding the lock altogether in cases where we don't really need it: buffers that already appear in the appropriate metadata list for the journal. Many thanks to Steve Whitehouse for the ideas and principles behind these patches" * tag 'gfs2-4.11.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Make gfs2_write_full_page static GFS2: Reduce contention on gfs2_log_lock GFS2: Inline function meta_lo_add GFS2: Switch tr_touched to flag in transaction GFS2: Wake up io waiters whenever a flush is done GFS2: Made logd daemon take into account log demand GFS2: Limit number of transaction blocks requested for truncates GFS2: Fix reference to ERR_PTR in gfs2_glock_iter_next
2017-02-21Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull UDF fixes and cleanups from Jan Kara: "Several small UDF fixes and cleanups and a small cleanup of fanotify code" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fanotify: simplify the code of fanotify_merge udf: simplify udf_ioctl() udf: fix ioctl errors udf: allow implicit blocksize specification during mount udf: check partition reference in udf_read_inode() udf: atomically read inode size udf: merge module informations in super.c udf: remove next_epos from udf_update_extent_cache() udf: Factor out trimming of crtime udf: remove empty condition udf: remove unneeded line break udf: merge bh free udf: use pointer for kernel_long_ad argument udf: use __packed instead of __attribute__ ((packed)) udf: Make stat on symlink report symlink length as st_size fs/udf: make #ifdef UDF_PREALLOCATE unconditional fs: udf: Replace CURRENT_TIME with current_time()
2017-02-21nfsd: special case truncates some moreChristoph Hellwig
Both the NFS protocols and the Linux VFS use a setattr operation with a bitmap of attributes to set to set various file attributes including the file size and the uid/gid. The Linux syscalls never mix size updates with unrelated updates like the uid/gid, and some file systems like XFS and GFS2 rely on the fact that truncates don't update random other attributes, and many other file systems handle the case but do not update the other attributes in the same transaction. NFSD on the other hand passes the attributes it gets on the wire more or less directly through to the VFS, leading to updates the file systems don't expect. XFS at least has an assert on the allowed attributes, which caught an unusual NFS client setting the size and group at the same time. To handle this issue properly this splits the notify_change call in nfsd_setattr into two separate ones. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org Tested-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-02-20Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull CIFS/SMB3 updates from Steve French: "Includes support for a critical SMB3 security feature: per-share encryption from Pavel, and a cleanup from Jean Delvare. Will have another cifs/smb3 merge next week" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: CIFS: Allow to switch on encryption with seal mount option CIFS: Add capability to decrypt big read responses CIFS: Decrypt and process small encrypted packets CIFS: Add copy into pages callback for a read operation CIFS: Add mid handle callback CIFS: Add transform header handling callbacks CIFS: Encrypt SMB3 requests before sending CIFS: Enable encryption during session setup phase CIFS: Add capability to transform requests before sending CIFS: Separate RFC1001 length processing for SMB2 read CIFS: Separate SMB2 sync header processing CIFS: Send RFC1001 length in a separate iov CIFS: Make send_cancel take rqst as argument CIFS: Make SendReceive2() takes resp iov CIFS: Separate SMB2 header structure CIFS: Fix splice read for non-cached files cifs: Add soft dependencies cifs: Only select the required crypto modules cifs: Simplify SMB2 and SMB311 dependencies
2017-02-20Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "For this cycle we add support for the shutdown ioctl, which is primarily used for testing, but which can be useful on production systems when a scratch volume is being destroyed and the data on it doesn't need to be saved. This found (and we fixed) a number of bugs with ext4's recovery to corrupted file system --- the bugs increased the amount of data that could be potentially lost, and in the case of the inline data feature, could cause the kernel to BUG. Also included are a number of other bug fixes, including in ext4's fscrypt, DAX, inline data support" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (26 commits) ext4: rename EXT4_IOC_GOINGDOWN to EXT4_IOC_SHUTDOWN ext4: fix fencepost in s_first_meta_bg validation ext4: don't BUG when truncating encrypted inodes on the orphan list ext4: do not use stripe_width if it is not set ext4: fix stripe-unaligned allocations dax: assert that i_rwsem is held exclusive for writes ext4: fix DAX write locking ext4: add EXT4_IOC_GOINGDOWN ioctl ext4: add shutdown bit and check for it ext4: rename s_resize_flags to s_ext4_flags ext4: return EROFS if device is r/o and journal replay is needed ext4: preserve the needs_recovery flag when the journal is aborted jbd2: don't leak modified metadata buffers on an aborted journal ext4: fix inline data error paths ext4: move halfmd4 into hash.c directly ext4: fix use-after-iput when fscrypt contexts are inconsistent jbd2: fix use after free in kjournald2() ext4: fix data corruption in data=journal mode ext4: trim allocation requests to group size ext4: replace BUG_ON with WARN_ON in mb_find_extent() ...
2017-02-20Merge tag 'fscrypt-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt Pull fscrypt updates from Ted Ts'o: "Various cleanups for the file system encryption feature" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt: fscrypt: constify struct fscrypt_operations fscrypt: properly declare on-stack completion fscrypt: split supp and notsupp declarations into their own headers fscrypt: remove redundant assignment of res fscrypt: make fscrypt_operations.key_prefix a string fscrypt: remove unused 'mode' member of fscrypt_ctx ext4: don't allow encrypted operations without keys fscrypt: make test_dummy_encryption require a keyring key fscrypt: factor out bio specific functions fscrypt: pass up error codes from ->get_context() fscrypt: remove user-triggerable warning messages fscrypt: use EEXIST when file already uses different policy fscrypt: use ENOTDIR when setting encryption policy on nondirectory fscrypt: use ENOKEY when file cannot be created w/o key
2017-02-20nfsd: minor nfsd_setattr cleanupChristoph Hellwig
Simplify exit paths, size_change use. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-02-20nfsd: merge stable fix into main nfsd branchJ. Bruce Fields