summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2015-09-04userfaultfd: require UFFDIO_API before other ioctlsAndrea Arcangeli
UFFDIO_API was already forced before read/poll could work. This makes the code more strict to force it also for all other ioctls. All users would already have been required to call UFFDIO_API before invoking other ioctls but this makes it more explicit. This will ensure we can change all ioctls (all but UFFDIO_API/struct uffdio_api) with a bump of uffdio_api.api. There's no actual plan or need to change the API or the ioctl, the current API already should cover fine even the non cooperative usage, but this is just for the longer term future just in case. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGEAndrea Arcangeli
These two ioctl allows to either atomically copy or to map zeropages into the virtual address space. This is used by the thread that opened the userfaultfd to resolve the userfaults. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: avoid mmap_sem read recursion in mcopy_atomicAndrea Arcangeli
If the rwsem starves writers it wasn't strictly a bug but lockdep doesn't like it and this avoids depending on lowlevel implementation details of the lock. [akpm@linux-foundation.org: delete weird BUILD_BUG_ON()] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE ↵Andrea Arcangeli
preparation This implements mcopy_atomic and mfill_zeropage that are the lowlevel VM methods that are invoked respectively by the UFFDIO_COPY and UFFDIO_ZEROPAGE userfaultfd commands. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPIAndrea Arcangeli
This implements the uABI of UFFDIO_COPY and UFFDIO_ZEROPAGE. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: activate syscallAndrea Arcangeli
This activates the userfaultfd syscall. [sfr@canb.auug.org.au: activate syscall fix] [akpm@linux-foundation.org: don't enable userfaultfd on powerpc] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: buildsystem activationAndrea Arcangeli
This allows to select the userfaultfd during configuration to build it. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and readAndrea Arcangeli
Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and userfaultfd_read if they are run on different threads simultaneously. Until now qemu solved the race in userland: the race was explicitly and intentionally left for userland to solve. However we can also solve it in kernel. Requiring all users to solve this race if they use two threads (one for the background transfer and one for the userfault reads) isn't very attractive from an API prospective, furthermore this allows to remove a whole bunch of mutex and bitmap code from qemu, making it faster. The cost of __get_user_pages_fast should be insignificant considering it scales perfectly and the pagetables are already hot in the CPU cache, compared to the overhead in userland to maintain those structures. Applying this patch is backwards compatible with respect to the userfaultfd userland API, however reverting this change wouldn't be backwards compatible anymore. Without this patch qemu in the background transfer thread, has to read the old state, and do UFFDIO_WAKE if old_state is missing but it become REQUESTED by the time it tries to set it to RECEIVED (signaling the other side received an userfault). vcpu background_thr userfault_thr ----- ----- ----- vcpu0 handle_mm_fault() postcopy_place_page read old_state -> MISSING UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending) vcpu0 fault at 0x7fb76a139000 enters handle_userfault poll() is kicked poll() -> POLLIN read() -> 0x7fb76a139000 postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED /* check that no userfault raced with UFFDIO_COPY */ if (old_state == MISSING && tmp_state == REQUESTED) UFFDIO_WAKE from background thread And a second case where a UFFDIO_WAKE would be needed is in the userfault thread: vcpu background_thr userfault_thr ----- ----- ----- vcpu0 handle_mm_fault() postcopy_place_page read old_state -> MISSING UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending) tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED vcpu0 fault at 0x7fb76a139000 enters handle_userfault poll() is kicked poll() -> POLLIN read() -> 0x7fb76a139000 if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED) UFFDIO_WAKE from userfault thread This patch removes the need of both UFFDIO_WAKE and of the associated per-page tristate as well. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: allocate the userfaultfd_ctx cacheline alignedAndrea Arcangeli
Use proper slab to guarantee alignment. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: optimize read() and poll() to be O(1)Andrea Arcangeli
This makes read O(1) and poll that was already O(1) becomes lockless. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: wake pending userfaultsAndrea Arcangeli
This is an optimization but it's a userland visible one and it affects the API. The downside of this optimization is that if you call poll() and you get POLLIN, read(ufd) may still return -EAGAIN. The blocked userfault may be waken by a different thread, before read(ufd) comes around. This in short means that poll() isn't really usable if the userfaultfd is opened in blocking mode. userfaults won't wait in "pending" state to be read anymore and any UFFDIO_WAKE or similar operations that has the objective of waking userfaults after their resolution, will wake all blocked userfaults for the resolved range, including those that haven't been read() by userland yet. The behavior of poll() becomes not standard, but this obviates the need of "spurious" UFFDIO_WAKE and it lets the userland threads to restart immediately without requiring an UFFDIO_WAKE. This is even more significant in case of repeated faults on the same address from multiple threads. This optimization is justified by the measurement that the number of spurious UFFDIO_WAKE accounts for 5% and 10% of the total userfaults for heavy workloads, so it's worth optimizing those away. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: change the read API to return a uffd_msgAndrea Arcangeli
I had requests to return the full address (not the page aligned one) to userland. It's not entirely clear how the page offset could be relevant because userfaults aren't like SIGBUS that can sigjump to a different place and it actually skip resolving the fault depending on a page offset. There's currently no real way to skip the fault especially because after a UFFDIO_COPY|ZEROPAGE, the fault is optimized to be retried within the kernel without having to return to userland first (not even self modifying code replacing the .text that touched the faulting address would prevent the fault to be repeated). Userland cannot skip repeating the fault even more so if the fault was triggered by a KVM secondary page fault or any get_user_pages or any copy-user inside some syscall which will return to kernel code. The second time FAULT_FLAG_RETRY_NOWAIT won't be set leading to a SIGBUS being raised because the userfault can't wait if it cannot release the mmap_map first (and FAULT_FLAG_RETRY_NOWAIT is required for that). Still returning userland a proper structure during the read() on the uffd, can allow to use the current UFFD_API for the future non-cooperative extensions too and it looks cleaner as well. Once we get additional fields there's no point to return the fault address page aligned anymore to reuse the bits below PAGE_SHIFT. The only downside is that the read() syscall will read 32bytes instead of 8bytes but that's not going to be measurable overhead. The total number of new events that can be extended or of new future bits for already shipped events, is limited to 64 by the features field of the uffdio_api structure. If more will be needed a bump of UFFD_API will be required. [akpm@linux-foundation.org: use __packed] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: Rename uffd_api.bits into .featuresPavel Emelyanov
This is (seems to be) the minimal thing that is required to unblock standard uffd usage from the non-cooperative one. Now more bits can be added to the features field indicating e.g. UFFD_FEATURE_FORK and others needed for the latter use-case. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: add new syscall to provide memory externalizationAndrea Arcangeli
Once an userfaultfd has been created and certain region of the process virtual address space have been registered into it, the thread responsible for doing the memory externalization can manage the page faults in userland by talking to the kernel using the userfaultfd protocol. poll() can be used to know when there are new pending userfaults to be read (POLLIN). Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: prevent khugepaged to merge if userfaultfd is armedAndrea Arcangeli
If userfaultfd is armed on a certain vma we can't "fill" the holes with zeroes or we'll break the userland on demand paging. The holes if the userfault is armed, are really missing information (not zeroes) that the userland has to load from network or elsewhere. The same issue happens for wrprotected ptes that we can't just convert into a single writable pmd_trans_huge. We could however in theory still merge across zeropages if only VM_UFFD_MISSING is set (so if VM_UFFD_WP is not set)... that could be slightly improved but it'd be much more complex code for a tiny corner case. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctxAndrea Arcangeli
vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge must be aware about so that we can merge vmas back like they were originally before arming the userfaultfd on some memory range. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: call handle_userfault() for userfaultfd_missing() faultsAndrea Arcangeli
This is where the page faults must be modified to call handle_userfault() if userfaultfd_missing() is true (so if the vma->vm_flags had VM_UFFD_MISSING set). handle_userfault() then takes care of blocking the page fault and delivering it to userland. The fault flags must also be passed as parameter so the "read|write" kind of fault can be passed to userland. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WPAndrea Arcangeli
These two flags gets set in vma->vm_flags to tell the VM common code if the userfaultfd is armed and in which mode (only tracking missing faults, only tracking wrprotect faults or both). If neither flags is set it means the userfaultfd is not armed on the vma. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: add vm_userfaultfd_ctx to the vm_area_structAndrea Arcangeli
This adds the vm_userfaultfd_ctx to the vm_area_struct. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: linux/userfaultfd_k.hAndrea Arcangeli
Kernel header defining the methods needed by the VM common code to interact with the userfaultfd. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: uAPIAndrea Arcangeli
Defines the uAPI of the userfaultfd, notably the ioctl numbers and protocol. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_keyAndrea Arcangeli
userfaultfd needs to wake all waitqueues (pass 0 as nr parameter), instead of the current hardcoded 1 (that would wake just the first waitqueue in the head list). Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: linux/Documentation/vm/userfaultfd.txtAndrea Arcangeli
This is the latest userfaultfd patchset. The postcopy live migration feature on the qemu side is mostly ready to be merged and it entirely depends on the userfaultfd syscall to be merged as well. So it'd be great if this patchset could be reviewed for merging in -mm. Userfaults allow to implement on demand paging from userland and more generally they allow userland to more efficiently take control of the behavior of page faults than what was available before (PROT_NONE + SIGSEGV trap). The use cases are: 1) KVM postcopy live migration (one form of cloud memory externalization). KVM postcopy live migration is the primary driver of this work: http://blog.zhaw.ch/icclab/setting-up-post-copy-live-migration-in-openstack/ http://lists.gnu.org/archive/html/qemu-devel/2015-02/msg04873.html 2) postcopy live migration of binaries inside linux containers: http://thread.gmane.org/gmane.linux.kernel.mm/132662 3) KVM postcopy live snapshotting (allowing to limit/throttle the memory usage, unlike fork would, plus the avoidance of fork overhead in the first place). While the wrprotect tracking is not implemented yet, the syscall API is already contemplating the wrprotect fault tracking and it's generic enough to allow its later implementation in a backwards compatible fashion. 4) KVM userfaults on shared memory. The UFFDIO_COPY lowlevel method should be extended to work also on tmpfs and then the uffdio_register.ioctls will notify userland that UFFDIO_COPY is available even when the registered virtual memory range is tmpfs backed. 5) alternate mechanism to notify web browsers or apps on embedded devices that volatile pages have been reclaimed. This basically avoids the need to run a syscall before the app can access with the CPU the virtual regions marked volatile. This depends on point 4) to be fulfilled first, as volatile pages happily apply to tmpfs. Even though there wasn't a real use case requesting it yet, it also allows to implement distributed shared memory in a way that readonly shared mappings can exist simultaneously in different hosts and they can be become exclusive at the first wrprotect fault. This patch (of 22): Add documentation. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04mm/slab.h: fix argument order in cache_from_obj's error messageDaniel Borkmann
While debugging a networking issue, I hit a condition that triggered an object to be freed into the wrong kmem cache, and thus triggered the warning in cache_from_obj(). The arguments in the error message are in wrong order: the location of the object's kmem cache is in cachep, not s. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04mm/slub: don't wait for high-order page allocationJoonsoo Kim
Description is almost copied from commit fb05e7a89f50 ("net: don't wait for order-3 page allocation"). I saw excessive direct memory reclaim/compaction triggered by slub. This causes performance issues and add latency. Slub uses high-order allocation to reduce internal fragmentation and management overhead. But, direct memory reclaim/compaction has high overhead and the benefit of high-order allocation can't compensate the overhead of both work. This patch makes auxiliary high-order allocation atomic. If there is no memory pressure and memory isn't fragmented, the alloction will still success, so we don't sacrifice high-order allocation's benefit here. If the atomic allocation fails, direct memory reclaim/compaction will not be triggered, allocation fallback to low-order immediately, hence the direct memory reclaim/compaction overhead is avoided. In the allocation failure case, kswapd is waken up and trying to make high-order freepages, so allocation could success next time. Following is the test to measure effect of this patch. System: QEMU, CPU 8, 512 MB Mem: 25% memory is allocated at random position to make fragmentation. Memory-hogger occupies 150 MB memory. Workload: hackbench -g 20 -l 1000 Average result by 10 runs (Base va Patched) elapsed_time(s): 4.3468 vs 2.9838 compact_stall: 461.7 vs 73.6 pgmigrate_success: 28315.9 vs 7256.1 Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Shaohua Li <shli@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04mm/slub: fix slab double-free in case of duplicate sysfs filenameKonstantin Khlebnikov
sysfs_slab_add() shouldn't call kobject_put at error path: this puts last reference of kmem-cache kobject and frees it. Kmem cache will be freed second time at error path in kmem_cache_create(). For example this happens when slub debug was enabled in runtime and somebody creates new kmem cache: # echo 1 | tee /sys/kernel/slab/*/sanity_checks # modprobe configfs "configfs_dir_cache" cannot be merged because existing slab have debug and cannot create new slab because unique name ":t-0000096" already taken. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04mm/slub: move slab initialization into irq enabled regionThomas Gleixner
Initializing a new slab can introduce rather large latencies because most of the initialization runs always with interrupts disabled. There is no point in doing so. The newly allocated slab is not visible yet, so there is no reason to protect it against concurrent alloc/free. Move the expensive parts of the initialization into allocate_slab(), so for all allocations with GFP_WAIT set, interrupts are enabled. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04slub: add support for kmem_cache_debug in bulk callsJesper Dangaard Brouer
Per request of Joonsoo Kim adding kmem debug support. I've tested that when debugging is disabled, then there is almost no performance impact as this code basically gets removed by the compiler. Need some guidance in enabling and testing this. bulk- PREVIOUS - THIS-PATCH 1 - 43 cycles(tsc) 10.811 ns - 44 cycles(tsc) 11.236 ns improved -2.3% 2 - 27 cycles(tsc) 6.867 ns - 28 cycles(tsc) 7.019 ns improved -3.7% 3 - 21 cycles(tsc) 5.496 ns - 22 cycles(tsc) 5.526 ns improved -4.8% 4 - 24 cycles(tsc) 6.038 ns - 19 cycles(tsc) 4.786 ns improved 20.8% 8 - 17 cycles(tsc) 4.280 ns - 18 cycles(tsc) 4.572 ns improved -5.9% 16 - 17 cycles(tsc) 4.483 ns - 18 cycles(tsc) 4.658 ns improved -5.9% 30 - 18 cycles(tsc) 4.531 ns - 18 cycles(tsc) 4.568 ns improved 0.0% 32 - 58 cycles(tsc) 14.586 ns - 65 cycles(tsc) 16.454 ns improved -12.1% 34 - 53 cycles(tsc) 13.391 ns - 63 cycles(tsc) 15.932 ns improved -18.9% 48 - 65 cycles(tsc) 16.268 ns - 50 cycles(tsc) 12.506 ns improved 23.1% 64 - 53 cycles(tsc) 13.440 ns - 63 cycles(tsc) 15.929 ns improved -18.9% 128 - 79 cycles(tsc) 19.899 ns - 86 cycles(tsc) 21.583 ns improved -8.9% 158 - 90 cycles(tsc) 22.732 ns - 90 cycles(tsc) 22.552 ns improved 0.0% 250 - 95 cycles(tsc) 23.916 ns - 98 cycles(tsc) 24.589 ns improved -3.2% Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04slub: initial bulk free implementationJesper Dangaard Brouer
This implements SLUB specific kmem_cache_free_bulk(). SLUB allocator now both have bulk alloc and free implemented. Choose to reenable local IRQs while calling slowpath __slab_free(). In worst case, where all objects hit slowpath call, the performance should still be faster than fallback function __kmem_cache_free_bulk(), because local_irq_{disable+enable} is very fast (7-cycles), while the fallback invokes this_cpu_cmpxchg() which is slightly slower (9-cycles). Nitpicking, this should be faster for N>=4, due to the entry cost of local_irq_{disable+enable}. Do notice that the save+restore variant is very expensive, this is key to why this optimization works. CPU: i7-4790K CPU @ 4.00GHz * local_irq_{disable,enable}: 7 cycles(tsc) - 1.821 ns * local_irq_{save,restore} : 37 cycles(tsc) - 9.443 ns Measurements on CPU CPU i7-4790K @ 4.00GHz Baseline normal fastpath (alloc+free cost): 43 cycles(tsc) 10.834 ns Bulk- fallback - this-patch 1 - 58 cycles(tsc) 14.542 ns - 43 cycles(tsc) 10.811 ns improved 25.9% 2 - 50 cycles(tsc) 12.659 ns - 27 cycles(tsc) 6.867 ns improved 46.0% 3 - 48 cycles(tsc) 12.168 ns - 21 cycles(tsc) 5.496 ns improved 56.2% 4 - 47 cycles(tsc) 11.987 ns - 24 cycles(tsc) 6.038 ns improved 48.9% 8 - 46 cycles(tsc) 11.518 ns - 17 cycles(tsc) 4.280 ns improved 63.0% 16 - 45 cycles(tsc) 11.366 ns - 17 cycles(tsc) 4.483 ns improved 62.2% 30 - 45 cycles(tsc) 11.433 ns - 18 cycles(tsc) 4.531 ns improved 60.0% 32 - 75 cycles(tsc) 18.983 ns - 58 cycles(tsc) 14.586 ns improved 22.7% 34 - 71 cycles(tsc) 17.940 ns - 53 cycles(tsc) 13.391 ns improved 25.4% 48 - 80 cycles(tsc) 20.077 ns - 65 cycles(tsc) 16.268 ns improved 18.8% 64 - 71 cycles(tsc) 17.799 ns - 53 cycles(tsc) 13.440 ns improved 25.4% 128 - 91 cycles(tsc) 22.980 ns - 79 cycles(tsc) 19.899 ns improved 13.2% 158 - 100 cycles(tsc) 25.241 ns - 90 cycles(tsc) 22.732 ns improved 10.0% 250 - 102 cycles(tsc) 25.583 ns - 95 cycles(tsc) 23.916 ns improved 6.9% Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04slub: improve bulk alloc strategyJesper Dangaard Brouer
Call slowpath __slab_alloc() from within the bulk loop, as the side-effect of this call likely repopulates c->freelist. Choose to reenable local IRQs while calling slowpath. Saving some optimizations for later. E.g. it is possible to extract parts of __slab_alloc() and avoid the unnecessary and expensive (37 cycles) local_irq_{save,restore}. For now, be happy calling __slab_alloc() this lower icache impact of this func and I don't have to worry about correctness. Measurements on CPU CPU i7-4790K @ 4.00GHz Baseline normal fastpath (alloc+free cost): 42 cycles(tsc) 10.601 ns Bulk- fallback - this-patch 1 - 58 cycles(tsc) 14.516 ns - 49 cycles(tsc) 12.459 ns improved 15.5% 2 - 51 cycles(tsc) 12.930 ns - 38 cycles(tsc) 9.605 ns improved 25.5% 3 - 49 cycles(tsc) 12.274 ns - 34 cycles(tsc) 8.525 ns improved 30.6% 4 - 48 cycles(tsc) 12.058 ns - 32 cycles(tsc) 8.036 ns improved 33.3% 8 - 46 cycles(tsc) 11.609 ns - 31 cycles(tsc) 7.756 ns improved 32.6% 16 - 45 cycles(tsc) 11.451 ns - 32 cycles(tsc) 8.148 ns improved 28.9% 30 - 79 cycles(tsc) 19.865 ns - 68 cycles(tsc) 17.164 ns improved 13.9% 32 - 76 cycles(tsc) 19.212 ns - 66 cycles(tsc) 16.584 ns improved 13.2% 34 - 74 cycles(tsc) 18.600 ns - 63 cycles(tsc) 15.954 ns improved 14.9% 48 - 88 cycles(tsc) 22.092 ns - 77 cycles(tsc) 19.373 ns improved 12.5% 64 - 80 cycles(tsc) 20.043 ns - 68 cycles(tsc) 17.188 ns improved 15.0% 128 - 99 cycles(tsc) 24.818 ns - 89 cycles(tsc) 22.404 ns improved 10.1% 158 - 99 cycles(tsc) 24.977 ns - 92 cycles(tsc) 23.089 ns improved 7.1% 250 - 106 cycles(tsc) 26.552 ns - 99 cycles(tsc) 24.785 ns improved 6.6% Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04slub bulk alloc: extract objects from the per cpu slabJesper Dangaard Brouer
First piece: acceleration of retrieval of per cpu objects If we are allocating lots of objects then it is advantageous to disable interrupts and avoid the this_cpu_cmpxchg() operation to get these objects faster. Note that we cannot do the fast operation if debugging is enabled, because we would have to add extra code to do all the debugging checks. And it would not be fast anyway. Note also that the requirement of having interrupts disabled avoids having to do processor flag operations. Allocate as many objects as possible in the fast way and then fall back to the generic implementation for the rest of the objects. Measurements on CPU CPU i7-4790K @ 4.00GHz Baseline normal fastpath (alloc+free cost): 42 cycles(tsc) 10.554 ns Bulk- fallback - this-patch 1 - 57 cycles(tsc) 14.432 ns - 48 cycles(tsc) 12.155 ns improved 15.8% 2 - 50 cycles(tsc) 12.746 ns - 37 cycles(tsc) 9.390 ns improved 26.0% 3 - 48 cycles(tsc) 12.180 ns - 33 cycles(tsc) 8.417 ns improved 31.2% 4 - 48 cycles(tsc) 12.015 ns - 32 cycles(tsc) 8.045 ns improved 33.3% 8 - 46 cycles(tsc) 11.526 ns - 30 cycles(tsc) 7.699 ns improved 34.8% 16 - 45 cycles(tsc) 11.418 ns - 32 cycles(tsc) 8.205 ns improved 28.9% 30 - 80 cycles(tsc) 20.246 ns - 73 cycles(tsc) 18.328 ns improved 8.8% 32 - 79 cycles(tsc) 19.946 ns - 72 cycles(tsc) 18.208 ns improved 8.9% 34 - 78 cycles(tsc) 19.659 ns - 71 cycles(tsc) 17.987 ns improved 9.0% 48 - 86 cycles(tsc) 21.516 ns - 82 cycles(tsc) 20.566 ns improved 4.7% 64 - 93 cycles(tsc) 23.423 ns - 89 cycles(tsc) 22.480 ns improved 4.3% 128 - 100 cycles(tsc) 25.170 ns - 99 cycles(tsc) 24.871 ns improved 1.0% 158 - 102 cycles(tsc) 25.549 ns - 101 cycles(tsc) 25.375 ns improved 1.0% 250 - 101 cycles(tsc) 25.344 ns - 100 cycles(tsc) 25.182 ns improved 1.0% Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04slab: infrastructure for bulk object allocation and freeingChristoph Lameter
Add the basic infrastructure for alloc/free operations on pointer arrays. It includes a generic function in the common slab code that is used in this infrastructure patch to create the unoptimized functionality for slab bulk operations. Allocators can then provide optimized allocation functions for situations in which large numbers of objects are needed. These optimization may avoid taking locks repeatedly and bypass metadata creation if all objects in slab pages can be used to provide the objects required. Allocators can extend the skeletons provided and add their own code to the bulk alloc and free functions. They can keep the generic allocation and freeing and just fall back to those if optimizations would not work (like for example when debugging is on). Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04slub: fix spelling succedd to succeedJesper Dangaard Brouer
With this patchset the SLUB allocator now has both bulk alloc and free implemented. This patchset mostly optimizes the "fastpath" where objects are available on the per CPU fastpath page. This mostly amortize the less-heavy none-locked cmpxchg_double used on fastpath. The "fallback" bulking (e.g __kmem_cache_free_bulk) provides a good basis for comparison. Measurements[1] of the fallback functions __kmem_cache_{free,alloc}_bulk have been copied from slab_common.c and forced "noinline" to force a function call like slab_common.c. Measurements on CPU CPU i7-4790K @ 4.00GHz Baseline normal fastpath (alloc+free cost): 42 cycles(tsc) 10.601 ns Measurements last-patch with disabled debugging: Bulk- fallback - this-patch 1 - 57 cycles(tsc) 14.448 ns - 44 cycles(tsc) 11.236 ns improved 22.8% 2 - 51 cycles(tsc) 12.768 ns - 28 cycles(tsc) 7.019 ns improved 45.1% 3 - 48 cycles(tsc) 12.232 ns - 22 cycles(tsc) 5.526 ns improved 54.2% 4 - 48 cycles(tsc) 12.025 ns - 19 cycles(tsc) 4.786 ns improved 60.4% 8 - 46 cycles(tsc) 11.558 ns - 18 cycles(tsc) 4.572 ns improved 60.9% 16 - 45 cycles(tsc) 11.458 ns - 18 cycles(tsc) 4.658 ns improved 60.0% 30 - 45 cycles(tsc) 11.499 ns - 18 cycles(tsc) 4.568 ns improved 60.0% 32 - 79 cycles(tsc) 19.917 ns - 65 cycles(tsc) 16.454 ns improved 17.7% 34 - 78 cycles(tsc) 19.655 ns - 63 cycles(tsc) 15.932 ns improved 19.2% 48 - 68 cycles(tsc) 17.049 ns - 50 cycles(tsc) 12.506 ns improved 26.5% 64 - 80 cycles(tsc) 20.009 ns - 63 cycles(tsc) 15.929 ns improved 21.3% 128 - 94 cycles(tsc) 23.749 ns - 86 cycles(tsc) 21.583 ns improved 8.5% 158 - 97 cycles(tsc) 24.299 ns - 90 cycles(tsc) 22.552 ns improved 7.2% 250 - 102 cycles(tsc) 25.681 ns - 98 cycles(tsc) 24.589 ns improved 3.9% Benchmarking shows impressive improvements in the "fastpath" with a small number of objects in the working set. Once the working set increases, resulting in activating the "slowpath" (that contains the heavier locked cmpxchg_double) the improvement decreases. I'm currently working on also optimizing the "slowpath" (as network stack use-case hits this), but this patchset should provide a good foundation for further improvements. Rest of my patch queue in this area needs some more work, but preliminary results are good. I'm attending Netfilter Workshop[2] next week, and I'll hopefully return working on further improvements in this area. This patch (of 6): s/succedd/succeed/ Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04watchdog: rename watchdog_suspend() and watchdog_resume()Ulrich Obergfell
Rename watchdog_suspend() to lockup_detector_suspend() and watchdog_resume() to lockup_detector_resume() to avoid confusion with the watchdog subsystem and to be consistent with the existing name lockup_detector_init(). Also provide comment blocks to explain the watchdog_running and watchdog_suspended variables and their relationship. Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com> Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Don Zickus <dzickus@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Stephane Eranian <eranian@google.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04watchdog: use suspend/resume interface in fixup_ht_bug()Ulrich Obergfell
Remove watchdog_nmi_disable_all() and watchdog_nmi_enable_all() since these functions are no longer needed. If a subsystem has a need to deactivate the watchdog temporarily, it should utilize the watchdog_suspend() and watchdog_resume() functions. [akpm@linux-foundation.org: fix build with CONFIG_LOCKUP_DETECTOR=m] Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com> Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Don Zickus <dzickus@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Stephane Eranian <eranian@google.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04watchdog: use park/unpark functions in update_watchdog_all_cpus()Ulrich Obergfell
Remove update_watchdog() and restart_watchdog_hrtimer() since these functions are no longer needed. Changes of parameters such as the sample period are honored at the time when the watchdog threads are being unparked. Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com> Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Don Zickus <dzickus@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Stephane Eranian <eranian@google.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04watchdog: introduce watchdog_suspend() and watchdog_resume()Ulrich Obergfell
This interface can be utilized to deactivate the hard and soft lockup detector temporarily. Callers are expected to minimize the duration of deactivation. Multiple deactivations are allowed to occur in parallel but should be rare in practice. [akpm@linux-foundation.org: remove unneeded static initialization] Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com> Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Don Zickus <dzickus@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Stephane Eranian <eranian@google.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04watchdog: introduce watchdog_park_threads() and watchdog_unpark_threads()Ulrich Obergfell
Originally watchdog_nmi_enable(cpu) and watchdog_nmi_disable(cpu) were only called in watchdog thread context. However, the following commits utilize these functions outside of watchdog thread context too. commit 9809b18fcf6b8d8ec4d3643677345907e6b50eca Author: Michal Hocko <mhocko@suse.cz> Date: Tue Sep 24 15:27:30 2013 -0700 watchdog: update watchdog_thresh properly commit b3738d29323344da3017a91010530cf3a58590fc Author: Stephane Eranian <eranian@google.com> Date: Mon Nov 17 20:07:03 2014 +0100 watchdog: Add watchdog enable/disable all functions Hence, it is now possible that these functions execute concurrently with the same 'cpu' argument. This concurrency is problematic because per-cpu 'watchdog_ev' can be accessed/modified without adequate synchronization. The patch series aims to address the above problem. However, instead of introducing locks to protect per-cpu 'watchdog_ev' a different approach is taken: Invoke these functions by parking and unparking the watchdog threads (to ensure they are always called in watchdog thread context). static struct smp_hotplug_thread watchdog_threads = { ... .park = watchdog_disable, // calls watchdog_nmi_disable() .unpark = watchdog_enable, // calls watchdog_nmi_enable() }; Both previously mentioned commits call these functions in a similar way and thus in principle contain some duplicate code. The patch series also avoids this duplication by providing a commonly usable mechanism. - Patch 1/4 introduces the watchdog_{park|unpark}_threads functions that park/unpark all watchdog threads specified in 'watchdog_cpumask'. They are intended to be called inside of kernel/watchdog.c only. - Patch 2/4 introduces the watchdog_{suspend|resume} functions which can be utilized by external callers to deactivate the hard and soft lockup detector temporarily. - Patch 3/4 utilizes watchdog_{park|unpark}_threads to replace some code that was introduced by commit 9809b18fcf6b8d8ec4d3643677345907e6b50eca. - Patch 4/4 utilizes watchdog_{suspend|resume} to replace some code that was introduced by commit b3738d29323344da3017a91010530cf3a58590fc. A few corner cases should be mentioned here for completeness. - kthread_park() of watchdog/N could hang if cpu N is already locked up. However, if watchdog is enabled the lockup will be detected anyway. - kthread_unpark() of watchdog/N could hang if cpu N got locked up after kthread_park(). The occurrence of this scenario should be _very_ rare in practice, in particular because it is not expected that temporary deactivation will happen frequently, and if it happens at all it is expected that the duration of deactivation will be short. This patch (of 4): introduce watchdog_park_threads() and watchdog_unpark_threads() These functions are intended to be used only from inside kernel/watchdog.c to park/unpark all watchdog threads that are specified in watchdog_cpumask. Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com> Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Don Zickus <dzickus@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Stephane Eranian <eranian@google.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04kernel/watchdog: move NMI function header declarations from watchdog.h to nmi.hGuenter Roeck
The kernel's NMI watchdog has nothing to do with the watchdog subsystem. Its header declarations should be in linux/nmi.h, not linux/watchdog.h. The code provided two sets of dummy functions if HARDLOCKUP_DETECTOR is not configured, one in the include file and one in kernel/watchdog.c. Remove the dummy functions from kernel/watchdog.c and use those from the include file. Signed-off-by: Guenter Roeck <linux@roeck-us.net> Cc: Stephane Eranian <eranian@google.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Don Zickus <dzickus@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04watchdog: simplify housekeeping affinity with the appropriate maskFrederic Weisbecker
housekeeping_mask gathers all the CPUs that aren't part of the nohz_full set. This is exactly what we want the watchdog to be affine to without the need to use complicated cpumask operations. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Chris Metcalf <cmetcalf@ezchip.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04smpboot: allow passing the cpumask on per-cpu thread registrationFrederic Weisbecker
It makes the registration cheaper and simpler for the smpboot per-cpu kthread users that don't need to always update the cpumask after threads creation. [sfr@canb.auug.org.au: fix for allow passing the cpumask on per-cpu thread registration] Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Chris Metcalf <cmetcalf@ezchip.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04smpboot: make cleanup to mirror setupFrederic Weisbecker
The per-cpu kthread cleanup() callback is the mirror of the setup() callback. When the per-cpu kthread is started, it first calls setup() to initialize the resources which are then released by cleanup() when the kthread exits. Now since the introduction of a per-cpu kthread cpumask, the kthreads excluded by the cpumask on boot may happen to be parked immediately after their creation without taking the setup() stage, waiting to be asked to unpark to do so. Then when smpboot_unregister_percpu_thread() is later called, the kthread is stopped without having ever called setup(). But this triggers a bug as the kthread unconditionally calls cleanup() on exit but this doesn't mirror any setup(). Thus the kernel crashes because we try to free resources that haven't been initialized, as in the watchdog case: WATCHDOG disable 0 WATCHDOG disable 1 WATCHDOG disable 2 BUG: unable to handle kernel NULL pointer dereference at (null) IP: hrtimer_active+0x26/0x60 [...] Call Trace: hrtimer_try_to_cancel+0x1c/0x280 hrtimer_cancel+0x1d/0x30 watchdog_disable+0x56/0x70 watchdog_cleanup+0xe/0x10 smpboot_thread_fn+0x23c/0x2c0 kthread+0xf8/0x110 ret_from_fork+0x3f/0x70 This bug is currently masked with explicit kthread unparking before kthread_stop() on smpboot_destroy_threads(). This forces a call to setup() and then unpark(). We could fix this by unconditionally calling setup() on kthread entry. But setup() isn't always cheap. In the case of watchdog it launches hrtimer, perf events, etc... So we may as well like to skip it if there are chances the kthread will never be used, as in a reduced cpumask value. So let's simply do a state machine check before calling cleanup() that makes sure setup() has been called before mirroring it. And remove the nasty hack workaround. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Chris Metcalf <cmetcalf@ezchip.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04smpboot: fix memory leak on error handlingFrederic Weisbecker
The cpumask is allocated before threads get created. If the latter step fails, we need to free the cpumask. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Chris Metcalf <cmetcalf@ezchip.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04fs: create and use seq_show_option for escapingKees Cook
Many file systems that implement the show_options hook fail to correctly escape their output which could lead to unescaped characters (e.g. new lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This could lead to confusion, spoofed entries (resulting in things like systemd issuing false d-bus "mount" notifications), and who knows what else. This looks like it would only be the root user stepping on themselves, but it's possible weird things could happen in containers or in other situations with delegated mount privileges. Here's an example using overlay with setuid fusermount trusting the contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use of "sudo" is something more sneaky: $ BASE="ovl" $ MNT="$BASE/mnt" $ LOW="$BASE/lower" $ UP="$BASE/upper" $ WORK="$BASE/work/ 0 0 none /proc fuse.pwn user_id=1000" $ mkdir -p "$LOW" "$UP" "$WORK" $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt $ cat /proc/mounts none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0 none /proc fuse.pwn user_id=1000 0 0 $ fusermount -u /proc $ cat /proc/mounts cat: /proc/mounts: No such file or directory This fixes the problem by adding new seq_show_option and seq_show_option_n helpers, and updating the vulnerable show_option handlers to use them as needed. Some, like SELinux, need to be open coded due to unusual existing escape mechanisms. [akpm@linux-foundation.org: add lost chunk, per Kees] [keescook@chromium.org: seq_show_option should be using const parameters] Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Jan Kara <jack@suse.com> Acked-by: Paul Moore <paul@paul-moore.com> Cc: J. R. Okajima <hooanon05g@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2: clean up redundant NULL checks before kfreeJoseph Qi
NULL check before kfree is redundant and so clean them up. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2: neaten do_error, ocfs2_error and ocfs2_abortJoe Perches
These uses sometimes do and sometimes don't have '\n' terminations. Make the uses consistently use '\n' terminations and remove the newline from the functions. Miscellanea: o Coalesce formats o Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2: do not set fs read-only if rec[0] is empty while committing truncateXue jiufei
While appending an extent to a file, it will call these functions: ocfs2_insert_extent -> call ocfs2_grow_tree() if there's no free rec -> ocfs2_add_branch add a new branch to extent tree, now rec[0] in the leaf of rightmost path is empty -> ocfs2_do_insert_extent -> ocfs2_rotate_tree_right -> ocfs2_extend_rotate_transaction -> jbd2_journal_restart if jbd2_journal_extend fail -> ocfs2_insert_path -> ocfs2_extend_trans -> jbd2_journal_restart if jbd2_journal_extend fail -> ocfs2_insert_at_leaf -> ocfs2_et_update_clusters Function jbd2_journal_restart() may be called and it may happened that buffers dirtied in ocfs2_add_branch() are committed while buffers dirtied in ocfs2_insert_at_leaf() and ocfs2_et_update_clusters() are not. So an empty rec[0] is left in rightmost path which will cause read-only filesystem when call ocfs2_commit_truncate() with the error message: "Inode %lu has an empty extent record". This is not a serious problem, so remove the rightmost path when call ocfs2_commit_truncate(). Signed-off-by: joyce.xue <xuejiufei@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2: call ocfs2_journal_access_di() before ocfs2_journal_dirty() in ↵yangwenfang
ocfs2_write_end_nolock() 1: After we call ocfs2_journal_access_di() in ocfs2_write_begin(), jbd2_journal_restart() may also be called, in this function transaction A's t_updates-- and obtains a new transaction B. If jbd2_journal_commit_transaction() is happened to commit transaction A, when t_updates==0, it will continue to complete commit and unfile buffer. So when jbd2_journal_dirty_metadata(), the handle is pointed a new transaction B, and the buffer head's journal head is already freed, jh->b_transaction == NULL, jh->b_next_transaction == NULL, it returns EINVAL, So it triggers the BUG_ON(status). thread 1 jbd2 ocfs2_write_begin jbd2_journal_commit_transaction ocfs2_write_begin_nolock ocfs2_start_trans jbd2__journal_start(t_updates+1, transaction A) ocfs2_journal_access_di ocfs2_write_cluster_by_desc ocfs2_mark_extent_written ocfs2_change_extent_flag ocfs2_split_extent ocfs2_extend_rotate_transaction jbd2_journal_restart (t_updates-1,transaction B) t_updates==0 __jbd2_journal_refile_buffer (jh->b_transaction = NULL) ocfs2_write_end ocfs2_write_end_nolock ocfs2_journal_dirty jbd2_journal_dirty_metadata(bug) ocfs2_commit_trans 2. In ext4, I found that: jbd2_journal_get_write_access() called by ext4_write_end. ext4_write_begin ext4_journal_start __ext4_journal_start_sb ext4_journal_check_start jbd2__journal_start ext4_write_end ext4_mark_inode_dirty ext4_reserve_inode_write ext4_journal_get_write_access jbd2_journal_get_write_access ext4_mark_iloc_dirty ext4_do_update_inode ext4_handle_dirty_metadata jbd2_journal_dirty_metadata 3. So I think we should put ocfs2_journal_access_di before ocfs2_journal_dirty in the ocfs2_write_end. and it works well after my modification. Signed-off-by: vicky <vicky.yangwenfang@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Zhangguanghui <zhang.guanghui@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2: use 64bit variables to track heartbeat timeTina Ruchandani
o2hb_elapsed_msecs computes the time taken for a disk heartbeat. 'struct timeval' variables are used to store start and end times. On 32-bit systems, the 'tv_sec' component of 'struct timeval' will overflow in year 2038 and beyond. This patch solves the overflow with the following: 1. Replace o2hb_elapsed_msecs using 'ktime_t' values to measure start and end time, and built-in function 'ktime_ms_delta' to compute the elapsed time. ktime_get_real() is used since the code prints out the wallclock time. 2. Changes format string to print time as a single 64-bit nanoseconds value ("%lld") instead of seconds and microseconds. This simplifies the code since converting ktime_t to that format would need expensive computation. However, the debug log string is less readable than the previous format. Signed-off-by: Tina Ruchandani <ruchandani.tina@gmail.com> Suggested by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04ocfs2: fix race between crashed dio and rmJoseph Qi
There is a race case between crashed dio and rm, which will lead to OCFS2_VALID_FL not set read-only. N1 N2 ------------------------------------------------------------------------ dd with direct flag rm file crashed with an dio entry left in orphan dir clear OCFS2_VALID_FL in ocfs2_remove_inode recover N1 and read the corrupted inode, and set filesystem read-only So we skip the inode deletion this time and wait for dio entry recovered first. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>