summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2017-11-15mm: remove cold parameter for release_pagesMel Gorman
All callers of release_pages claim the pages being released are cache hot. As no one cares about the hotness of pages being released to the allocator, just ditch the parameter. No performance impact is expected as the overhead is marginal. The parameter is removed simply because it is a bit stupid to have a useless parameter copied everywhere. Link: http://lkml.kernel.org/r/20171018075952.10627-7-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm, pagevec: remove cold parameter for pagevecsMel Gorman
Every pagevec_init user claims the pages being released are hot even in cases where it is unlikely the pages are hot. As no one cares about the hotness of pages being released to the allocator, just ditch the parameter. No performance impact is expected as the overhead is marginal. The parameter is removed simply because it is a bit stupid to have a useless parameter copied everywhere. Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: only drain per-cpu pagevecs once per pagevec usageMel Gorman
When a pagevec is initialised on the stack, it is generally used multiple times over a range of pages, looking up entries and then releasing them. On each pagevec_release, the per-cpu deferred LRU pagevecs are drained on the grounds the page being released may be on those queues and the pages may be cache hot. In many cases only the first drain is necessary as it's unlikely that the range of pages being walked is racing against LRU addition. Even if there is such a race, the impact is marginal where as constantly redraining the lru pagevecs costs. This patch ensures that pagevec is only drained once in a given lifecycle without increasing the cache footprint of the pagevec structure. Only sparsetruncate tiny is shown here as large files have many exceptional entries and calls pagecache_release less frequently. sparsetruncate (tiny) 4.14.0-rc4 4.14.0-rc4 batchshadow-v1r1 onedrain-v1r1 Min Time 141.00 ( 0.00%) 141.00 ( 0.00%) 1st-qrtle Time 142.00 ( 0.00%) 142.00 ( 0.00%) 2nd-qrtle Time 142.00 ( 0.00%) 142.00 ( 0.00%) 3rd-qrtle Time 143.00 ( 0.00%) 143.00 ( 0.00%) Max-90% Time 144.00 ( 0.00%) 144.00 ( 0.00%) Max-95% Time 146.00 ( 0.00%) 145.00 ( 0.68%) Max-99% Time 198.00 ( 0.00%) 194.00 ( 2.02%) Max Time 254.00 ( 0.00%) 208.00 ( 18.11%) Amean Time 145.12 ( 0.00%) 144.30 ( 0.56%) Stddev Time 12.74 ( 0.00%) 9.62 ( 24.49%) Coeff Time 8.78 ( 0.00%) 6.67 ( 24.06%) Best99%Amean Time 144.29 ( 0.00%) 143.82 ( 0.32%) Best95%Amean Time 142.68 ( 0.00%) 142.31 ( 0.26%) Best90%Amean Time 142.52 ( 0.00%) 142.19 ( 0.24%) Best75%Amean Time 142.26 ( 0.00%) 141.98 ( 0.20%) Best50%Amean Time 141.90 ( 0.00%) 141.71 ( 0.13%) Best25%Amean Time 141.80 ( 0.00%) 141.43 ( 0.26%) The impact on bonnie is marginal and within the noise because a significant percentage of the file being truncated has been reclaimed and consists of shadow entries which reduce the hotness of the pagevec_release path. Link: http://lkml.kernel.org/r/20171018075952.10627-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm, truncate: do not check mapping for every page being truncatedMel Gorman
During truncation, the mapping has already been checked for shmem and dax so it's known that workingset_update_node is required. This patch avoids the checks on mapping for each page being truncated. In all other cases, a lookup helper is used to determine if workingset_update_node() needs to be called. The one danger is that the API is slightly harder to use as calling workingset_update_node directly without checking for dax or shmem mappings could lead to surprises. However, the API rarely needs to be used and hopefully the comment is enough to give people the hint. sparsetruncate (tiny) 4.14.0-rc4 4.14.0-rc4 oneirq-v1r1 pickhelper-v1r1 Min Time 141.00 ( 0.00%) 140.00 ( 0.71%) 1st-qrtle Time 142.00 ( 0.00%) 141.00 ( 0.70%) 2nd-qrtle Time 142.00 ( 0.00%) 142.00 ( 0.00%) 3rd-qrtle Time 143.00 ( 0.00%) 143.00 ( 0.00%) Max-90% Time 144.00 ( 0.00%) 144.00 ( 0.00%) Max-95% Time 147.00 ( 0.00%) 145.00 ( 1.36%) Max-99% Time 195.00 ( 0.00%) 191.00 ( 2.05%) Max Time 230.00 ( 0.00%) 205.00 ( 10.87%) Amean Time 144.37 ( 0.00%) 143.82 ( 0.38%) Stddev Time 10.44 ( 0.00%) 9.00 ( 13.74%) Coeff Time 7.23 ( 0.00%) 6.26 ( 13.41%) Best99%Amean Time 143.72 ( 0.00%) 143.34 ( 0.26%) Best95%Amean Time 142.37 ( 0.00%) 142.00 ( 0.26%) Best90%Amean Time 142.19 ( 0.00%) 141.85 ( 0.24%) Best75%Amean Time 141.92 ( 0.00%) 141.58 ( 0.24%) Best50%Amean Time 141.69 ( 0.00%) 141.31 ( 0.27%) Best25%Amean Time 141.38 ( 0.00%) 140.97 ( 0.29%) As you'd expect, the gain is marginal but it can be detected. The differences in bonnie are all within the noise which is not surprising given the impact on the microbenchmark. radix_tree_update_node_t is a callback for some radix operations that optionally passes in a private field. The only user of the callback is workingset_update_node and as it no longer requires a mapping, the private field is removed. Link: http://lkml.kernel.org/r/20171018075952.10627-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: batch radix tree operations when truncating pagesJan Kara
Currently we remove pages from the radix tree one by one. To speed up page cache truncation, lock several pages at once and free them in one go. This allows us to batch radix tree operations in a more efficient way and also save round-trips on mapping->tree_lock. As a result we gain about 20% speed improvement in page cache truncation. Data from a simple benchmark timing 10000 truncates of 1024 pages (on ext4 on ramdisk but the filesystem is barely visible in the profiles). The range shows 1% and 95% percentiles of the measured times: 4.14-rc2 4.14-rc2 + batched truncation 248-256 209-219 249-258 209-217 248-255 211-239 248-255 209-217 247-256 210-218 [jack@suse.cz: convert delete_from_page_cache_batch() to pagevec] Link: http://lkml.kernel.org/r/20171018111648.13714-1-jack@suse.cz [akpm@linux-foundation.org: move struct pagevec forward declaration to top-of-file] Link: http://lkml.kernel.org/r/20171010151937.26984-8-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Andi Kleen <ak@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: speed up cancel_dirty_page() for clean pagesJan Kara
Patch series "Speed up page cache truncation", v1. When rebasing our enterprise distro to a newer kernel (from 4.4 to 4.12) we have noticed a regression in bonnie++ benchmark when deleting files. Eventually we have tracked this down to a fact that page cache truncation got slower by about 10%. There were both gains and losses in the above interval of kernels but we have been able to identify that commit 83929372f629 ("filemap: prepare find and delete operations for huge pages") caused about 10% regression on its own. After some investigation it didn't seem easily possible to fix the regression while maintaining the THP in page cache functionality so we've decided to optimize the page cache truncation path instead to make up for the change. This series is a result of that effort. Patch 1 is an easy speedup of cancel_dirty_page(). Patches 2-6 refactor page cache truncation code so that it is easier to batch radix tree operations. Patch 7 implements batching of deletes from the radix tree which more than makes up for the original regression. This patch (of 7): cancel_dirty_page() does quite some work even for clean pages (fetching of mapping, locking of memcg, atomic bit op on page flags) so it accounts for ~2.5% of cost of truncation of a clean page. That is not much but still dumb for something we don't need at all. Check whether a page is actually dirty and avoid any work if not. Link: http://lkml.kernel.org/r/20171010151937.26984-2-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: zero reserved and unavailable struct pagesPavel Tatashin
Some memory is reserved but unavailable: not present in memblock.memory (because not backed by physical pages), but present in memblock.reserved. Such memory has backing struct pages, but they are not initialized by going through __init_single_page(). In some cases these struct pages are accessed even if they do not contain any data. One example is page_to_pfn() might access page->flags if this is where section information is stored (CONFIG_SPARSEMEM, SECTION_IN_PAGE_FLAGS). One example of such memory: trim_low_memory_range() unconditionally reserves from pfn 0, but e820__memblock_setup() might provide the exiting memory from pfn 1 (i.e. KVM). Since struct pages are zeroed in __init_single_page(), and not during allocation time, we must zero such struct pages explicitly. The patch involves adding a new memblock iterator: for_each_resv_unavail_range(i, p_start, p_end) Which iterates through reserved && !memory lists, and we zero struct pages explicitly by calling mm_zero_struct_page(). === Here is more detailed example of problem that this patch is addressing: Run tested on qemu with the following arguments: -enable-kvm -cpu kvm64 -m 512 -smp 2 This patch reports that there are 98 unavailable pages. They are: pfn 0 and pfns in range [159, 255]. Note, trim_low_memory_range() reserves only pfns in range [0, 15], it does not reserve [159, 255] ones. e820__memblock_setup() reports linux that the following physical ranges are available: [1 , 158] [256, 130783] Notice, that exactly unavailable pfns are missing! Now, lets check what we have in zone 0: [1, 131039] pfn 0, is not part of the zone, but pfns [1, 158], are. However, the bigger problem we have if we do not initialize these struct pages is with memory hotplug. Because, that path operates at 2M boundaries (section_nr). And checks if 2M range of pages is hot removable. It starts with first pfn from zone, rounds it down to 2M boundary (sturct pages are allocated at 2M boundaries when vmemmap is created), and checks if that section is hot removable. In this case start with pfn 1 and convert it down to pfn 0. Later pfn is converted to struct page, and some fields are checked. Now, if we do not zero struct pages, we get unpredictable results. In fact when CONFIG_VM_DEBUG is enabled, and we explicitly set all vmemmap memory to ones, the following panic is observed with kernel test without this patch applied: BUG: unable to handle kernel NULL pointer dereference at (null) IP: is_pageblock_removable_nolock+0x35/0x90 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT ... task: ffff88001f4e2900 task.stack: ffffc90000314000 RIP: 0010:is_pageblock_removable_nolock+0x35/0x90 Call Trace: ? is_mem_section_removable+0x5a/0xd0 show_mem_removable+0x6b/0xa0 dev_attr_show+0x1b/0x50 sysfs_kf_seq_show+0xa1/0x100 kernfs_seq_show+0x22/0x30 seq_read+0x1ac/0x3a0 kernfs_fop_read+0x36/0x190 ? security_file_permission+0x90/0xb0 __vfs_read+0x16/0x30 vfs_read+0x81/0x130 SyS_read+0x44/0xa0 entry_SYSCALL_64_fastpath+0x1f/0xbd Link: http://lkml.kernel.org/r/20171013173214.27300-7-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Steven Sistare <steven.sistare@oracle.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Bob Picco <bob.picco@oracle.com> Tested-by: Bob Picco <bob.picco@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: define memblock_virt_alloc_try_nid_rawPavel Tatashin
* A new variant of memblock_virt_alloc_* allocations: memblock_virt_alloc_try_nid_raw() - Does not zero the allocated memory - Does not panic if request cannot be satisfied * optimize early system hash allocations Clients can call alloc_large_system_hash() with flag: HASH_ZERO to specify that memory that was allocated for system hash needs to be zeroed, otherwise the memory does not need to be zeroed, and client will initialize it. If memory does not need to be zero'd, call the new memblock_virt_alloc_raw() interface, and thus improve the boot performance. * debug for raw alloctor When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no places excpect zeroed memory. Link: http://lkml.kernel.org/r/20171013173214.27300-6-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Steven Sistare <steven.sistare@oracle.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Bob Picco <bob.picco@oracle.com> Tested-by: Bob Picco <bob.picco@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15kmemcheck: rip it outLevin, Alexander (Sasha Levin)
Fix up makefiles, remove references, and git rm kmemcheck. Link: http://lkml.kernel.org/r/20171007030159.22241-4-alexander.levin@verizon.com Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Vegard Nossum <vegardno@ifi.uio.no> Cc: Pekka Enberg <penberg@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Alexander Potapenko <glider@google.com> Cc: Tim Hansen <devtimhansen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15kmemcheck: remove whats left of NOTRACK flagsLevin, Alexander (Sasha Levin)
Now that kmemcheck is gone, we don't need the NOTRACK flags. Link: http://lkml.kernel.org/r/20171007030159.22241-5-alexander.levin@verizon.com Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Cc: Alexander Potapenko <glider@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tim Hansen <devtimhansen@gmail.com> Cc: Vegard Nossum <vegardno@ifi.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15kmemcheck: stop using GFP_NOTRACK and SLAB_NOTRACKLevin, Alexander (Sasha Levin)
Convert all allocations that used a NOTRACK flag to stop using it. Link: http://lkml.kernel.org/r/20171007030159.22241-3-alexander.levin@verizon.com Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Cc: Alexander Potapenko <glider@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tim Hansen <devtimhansen@gmail.com> Cc: Vegard Nossum <vegardno@ifi.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15kmemcheck: remove annotationsLevin, Alexander (Sasha Levin)
Patch series "kmemcheck: kill kmemcheck", v2. As discussed at LSF/MM, kill kmemcheck. KASan is a replacement that is able to work without the limitation of kmemcheck (single CPU, slow). KASan is already upstream. We are also not aware of any users of kmemcheck (or users who don't consider KASan as a suitable replacement). The only objection was that since KASAN wasn't supported by all GCC versions provided by distros at that time we should hold off for 2 years, and try again. Now that 2 years have passed, and all distros provide gcc that supports KASAN, kill kmemcheck again for the very same reasons. This patch (of 4): Remove kmemcheck annotations, and calls to kmemcheck from the kernel. [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs] Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Cc: Alexander Potapenko <glider@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tim Hansen <devtimhansen@gmail.com> Cc: Vegard Nossum <vegardno@ifi.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: consolidate page table accountingKirill A. Shutemov
Currently, we account page tables separately for each page table level, but that's redundant -- we only make use of total memory allocated to page tables for oom_badness calculation. We also provide the information to userspace, but it has dubious value there too. This patch switches page table accounting to single counter. mm->pgtables_bytes is now used to account all page table levels. We use bytes, because page table size for different levels of page table tree may be different. The change has user-visible effect: we don't have VmPMD and VmPUD reported in /proc/[pid]/status. Not sure if anybody uses them. (As alternative, we can always report 0 kB for them.) OOM-killer report is also slightly changed: we now report pgtables_bytes instead of nr_ptes, nr_pmd, nr_puds. Apart from reducing number of counters per-mm, the benefit is that we now calculate oom_badness() more correctly for machines which have different size of page tables depending on level or where page tables are less than a page in size. The only downside can be debuggability because we do not know which page table level could leak. But I do not remember many bugs that would be caught by separate counters so I wouldn't lose sleep over this. [akpm@linux-foundation.org: fix mm/huge_memory.c] Link: http://lkml.kernel.org/r/20171006100651.44742-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> [kirill.shutemov@linux.intel.com: fix build] Link: http://lkml.kernel.org/r/20171016150113.ikfxy3e7zzfvsr4w@black.fi.intel.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: introduce wrappers to access mm->nr_ptesKirill A. Shutemov
Let's add wrappers for ->nr_ptes with the same interface as for nr_pmd and nr_pud. The patch also makes nr_ptes accounting dependent onto CONFIG_MMU. Page table accounting doesn't make sense if you don't have page tables. It's preparation for consolidation of page-table counters in mm_struct. Link: http://lkml.kernel.org/r/20171006100651.44742-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: account pud page tablesKirill A. Shutemov
On a machine with 5-level paging support a process can allocate significant amount of memory and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PUD page tables. We don't account PUD page tables, only PMD and PTE. We already addressed the same issue for PMD page tables, see commit dc6c9a35b66b ("mm: account pmd page tables to the process"). Introduction of 5-level paging brings the same issue for PUD page tables. The patch expands accounting to PUD level. [kirill.shutemov@linux.intel.com: s/pmd_t/pud_t/] Link: http://lkml.kernel.org/r/20171004074305.x35eh5u7ybbt5kar@black.fi.intel.com [heiko.carstens@de.ibm.com: s390/mm: fix pud table accounting] Link: http://lkml.kernel.org/r/20171103090551.18231-1-heiko.carstens@de.ibm.com Link: http://lkml.kernel.org/r/20171002080427.3320-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: remove nr_pages argument from pagevec_lookup_{,range}_tag()Jan Kara
All users of pagevec_lookup() and pagevec_lookup_range() now pass PAGEVEC_SIZE as a desired number of pages. Just drop the argument. Link: http://lkml.kernel.org/r/20171009151359.31984-15-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: add variant of pagevec_lookup_range_tag() taking number of pagesJan Kara
Currently pagevec_lookup_range_tag() takes number of pages to look up but most users don't need this. Create a new function pagevec_lookup_range_nr_tag() that takes maximum number of pages to lookup for Ceph which wants this functionality so that we can drop nr_pages argument from pagevec_lookup_range_tag(). Link: http://lkml.kernel.org/r/20171009151359.31984-13-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: implement find_get_pages_range_tag()Jan Kara
Patch series "Ranged pagevec tagged lookup", v3. In this series I provide a ranged variant of pagevec_lookup_tag() and use it in places where it makes sense. This series removes some common code and it also has a potential for speeding up some operations similarly as for pagevec_lookup_range() (but for now I can think of only artificial cases where this happens). This patch (of 16): Implement a variant of find_get_pages_tag() that stops iterating at given index. Lots of users of this function (through pagevec_lookup()) actually want a range lookup and all of them are currently open-coding this. Also create corresponding pagevec_lookup_range_tag() function. Link: http://lkml.kernel.org/r/20171009151359.31984-2-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Chao Yu <yuchao0@huawei.com> Cc: David Howells <dhowells@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Steve French <sfrench@samba.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm, arch: remove empty_bad_page*Michal Hocko
empty_bad_page() and empty_bad_pte_table() seem to be relics from old days which is not used by any code for a long time. I have tried to find when exactly but this is not really all that straightforward due to many code movements - traces disappear around 2.4 times. Anyway no code really references neither empty_bad_page nor empty_bad_pte_table. We only allocate the storage which is not used by anybody so remove them. Link: http://lkml.kernel.org/r/20171004150045.30755-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Ralf Baechle <ralf@linus-mips.org> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: David Howells <dhowells@redhat.com> Cc: Rich Felker <dalias@libc.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: remove unused pgdat->inactive_ratioAndrey Ryabinin
Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list") 'pgdat->inactive_ratio' is not used, except for printing "node_inactive_ratio: 0" in /proc/zoneinfo output. Remove it. Link: http://lkml.kernel.org/r/20171003152611.27483-1-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm/mmu_notifier: avoid call to invalidate_range() in range_end()Jérôme Glisse
This is an optimization patch that only affect mmu_notifier users which rely on the invalidate_range() callback. This patch avoids calling that callback twice in a row from inside __mmu_notifier_invalidate_range_end Existing pattern (before this patch): mmu_notifier_invalidate_range_start() pte/pmd/pud_clear_flush_notify() mmu_notifier_invalidate_range() mmu_notifier_invalidate_range_end() mmu_notifier_invalidate_range() New pattern (after this patch): mmu_notifier_invalidate_range_start() pte/pmd/pud_clear_flush_notify() mmu_notifier_invalidate_range() mmu_notifier_invalidate_range_only_end() We call the invalidate_range callback after clearing the page table under the page table lock and we skip the call to invalidate_range inside the __mmu_notifier_invalidate_range_end() function. Idea from Andrea Arcangeli Link: http://lkml.kernel.org/r/20171017031003.7481-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Alistair Popple <alistair@popple.id.au> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm/mmu_notifier: avoid double notification when it is uselessJérôme Glisse
This patch only affects users of mmu_notifier->invalidate_range callback which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ... and it is an optimization for those users. Everyone else is unaffected by it. When clearing a pte/pmd we are given a choice to notify the event under the page table lock (notify version of *_clear_flush helpers do call the mmu_notifier_invalidate_range). But that notification is not necessary in all cases. This patch removes almost all cases where it is useless to have a call to mmu_notifier_invalidate_range before mmu_notifier_invalidate_range_end. It also adds documentation in all those cases explaining why. Below is a more in depth analysis of why this is fine to do this: For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use thing like ATS/PASID to get the IOMMU to walk the CPU page table to access a process virtual address space). There is only 2 cases when you need to notify those secondary TLB while holding page table lock when clearing a pte/pmd: A) page backing address is free before mmu_notifier_invalidate_range_end B) a page table entry is updated to point to a new page (COW, write fault on zero page, __replace_page(), ...) Case A is obvious you do not want to take the risk for the device to write to a page that might now be used by something completely different. Case B is more subtle. For correctness it requires the following sequence to happen: - take page table lock - clear page table entry and notify (pmd/pte_huge_clear_flush_notify()) - set page table entry to point to new page If clearing the page table entry is not followed by a notify before setting the new pte/pmd value then you can break memory model like C11 or C++11 for the device. Consider the following scenario (device use a feature similar to ATS/ PASID): Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we assume they are write protected for COW (other case of B apply too). [Time N] ----------------------------------------------------------------- CPU-thread-0 {try to write to addrA} CPU-thread-1 {try to write to addrB} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {read addrA and populate device TLB} DEV-thread-2 {read addrB and populate device TLB} [Time N+1] --------------------------------------------------------------- CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}} CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {} DEV-thread-2 {} [Time N+2] --------------------------------------------------------------- CPU-thread-0 {COW_step1: {update page table point to new page for addrA}} CPU-thread-1 {COW_step1: {update page table point to new page for addrB}} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {} DEV-thread-2 {} [Time N+3] --------------------------------------------------------------- CPU-thread-0 {preempted} CPU-thread-1 {preempted} CPU-thread-2 {write to addrA which is a write to new page} CPU-thread-3 {} DEV-thread-0 {} DEV-thread-2 {} [Time N+3] --------------------------------------------------------------- CPU-thread-0 {preempted} CPU-thread-1 {preempted} CPU-thread-2 {} CPU-thread-3 {write to addrB which is a write to new page} DEV-thread-0 {} DEV-thread-2 {} [Time N+4] --------------------------------------------------------------- CPU-thread-0 {preempted} CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {} DEV-thread-2 {} [Time N+5] --------------------------------------------------------------- CPU-thread-0 {preempted} CPU-thread-1 {} CPU-thread-2 {} CPU-thread-3 {} DEV-thread-0 {read addrA from old page} DEV-thread-2 {read addrB from new page} So here because at time N+2 the clear page table entry was not pair with a notification to invalidate the secondary TLB, the device see the new value for addrB before seing the new value for addrA. This break total memory ordering for the device. When changing a pte to write protect or to point to a new write protected page with same content (KSM) it is ok to delay invalidate_range callback to mmu_notifier_invalidate_range_end() outside the page table lock. This is true even if the thread doing page table update is preempted right after releasing page table lock before calling mmu_notifier_invalidate_range_end Thanks to Andrea for thinking of a problematic scenario for COW. [jglisse@redhat.com: v2] Link: http://lkml.kernel.org/r/20171017031003.7481-2-jglisse@redhat.com Link: http://lkml.kernel.org/r/20170901173011.10745-1-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Alistair Popple <alistair@popple.id.au> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm/hmm: constify hmm_devmem_page_get_drvdata() parameterRalph Campbell
Constify pointer parameter to avoid issue when use from code that only has const struct page pointer to use in the first place. Link: http://lkml.kernel.org/r/1506972774-10191-1-git-send-email-jglisse@redhat.com Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm/memblock.c: make the index explicit argument of for_each_memblock_typeGioh Kim
for_each_memblock_type macro function relies on idx variable defined in the caller context. Silent macro arguments are almost always wrong thing to do. They make code harder to read and easier to get wrong. Let's use an explicit iterator parameter for for_each_memblock_type and make the code more obious. This patch is a mere cleanup and it shouldn't introduce any functional change. Link: http://lkml.kernel.org/r/20170913133029.28911-1-gi-oh.kim@profitbricks.com Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: distinguish CMA and MOVABLE isolation in has_unmovable_pages()Michal Hocko
Joonsoo has noticed that "mm: drop migrate type checks from has_unmovable_pages" would break CMA allocator because it relies on has_unmovable_pages returning false even for CMA pageblocks which in fact don't have to be movable: alloc_contig_range start_isolate_page_range set_migratetype_isolate has_unmovable_pages This is a result of the code sharing between CMA and memory hotplug while each one has a different idea of what has_unmovable_pages should return. This is unfortunate but fixing it properly would require a lot of code duplication. Fix the issue by introducing the requested migrate type argument and special case MIGRATE_CMA case where CMA page blocks are handled properly. This will work for memory hotplug because it requires MIGRATE_MOVABLE. Link: http://lkml.kernel.org/r/20171019122118.y6cndierwl2vnguj@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ran Wang <ran.wang_1@nxp.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Igor Mammedov <imammedo@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: swap: SWP_SYNCHRONOUS_IO: skip swapcache only if swapped page has no ↵Minchan Kim
other reference When SWP_SYNCHRONOUS_IO swapped-in pages are shared by several processes, it can cause unnecessary memory wastage by skipping swap cache. Because, with swapin fault by read, they could share a page if the page were in swap cache. Thus, it avoids allocating same content new pages. This patch makes the swapcache skipping work only if the swap pte is non-sharable. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/1507620825-5537-1-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm, swap: skip swapcache for swapin of synchronous deviceMinchan Kim
With fast swap storage, the platforms want to use swap more aggressively and swap-in is crucial to application latency. The rw_page() based synchronous devices like zram, pmem and btt are such fast storage. When I profile swapin performance with zram lz4 decompress test, S/W overhead is more than 70%. Maybe, it would be bigger in nvdimm. This patch aims to reduce swap-in latency by skipping swapcache if the swap device is synchronous device like rw_page based device. It enhances 45% my swapin test(5G sequential swapin, no readahead, from 2.41sec to 1.64sec). Link: http://lkml.kernel.org/r/1505886205-9671-5-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm, swap: introduce SWP_SYNCHRONOUS_IOMinchan Kim
If rw-page based fast storage is used for swap devices, we need to detect it to enhance swap IO operations. This patch is preparation for optimizing of swap-in operation with next patch. Link: http://lkml.kernel.org/r/1505886205-9671-4-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15bdi: introduce BDI_CAP_SYNCHRONOUS_IOMinchan Kim
As discussed at https://lkml.kernel.org/r/<20170728165604.10455-1-ross.zwisler@linux.intel.com> someday we will remove rw_page(). If so, we need something to detect such super-fast storage on which synchronous IO operations like the current rw_page are always a win. Introduces BDI_CAP_SYNCHRONOUS_IO to indicate such devices. With it, we could use various optimization techniques. Link: http://lkml.kernel.org/r/1505886205-9671-3-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: update comments for struct page.mappingChangbin Du
struct page.mapping can be NULL or points to one object of type address_space, anon_vma or KSM private structure. Link: http://lkml.kernel.org/r/1506485067-15954-1-git-send-email-changbin.du@intel.com Signed-off-by: Changbin Du <changbin.du@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15include/linux/slab.h: add kmalloc_array_node() and kcalloc_node()Johannes Thumshirn
Patch series "Add kmalloc_array_node() and kcalloc_node()". Our current memeory allocation routines suffer form an API imbalance, for one we have kmalloc_array() and kcalloc() which check for overflows in size multiplication and we have kmalloc_node() and kzalloc_node() which allow for memory allocation on a certain NUMA node but don't check for eventual overflows. This patch (of 6): We have kmalloc_array() and kcalloc() wrappers on top of kmalloc() which ensure us overflow free multiplication for the size of a memory allocation but these implementations are not NUMA-aware. Likewise we have kmalloc_node() which is a NUMA-aware version of kmalloc() but the implementation is not aware of any possible overflows in eventual size calculations. Introduce a combination of the two above cases to have a NUMA-node aware version of kmalloc_array() and kcalloc(). Link: http://lkml.kernel.org/r/20170927082038.3782-2-jthumshirn@suse.de Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: David Rientjes <rientjes@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Doug Ledford <dledford@redhat.com> Cc: Hal Rosenstock <hal.rosenstock@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mike Marciniszyn <infinipath@intel.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com> Cc: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15slab, slub, slob: convert slab_flags_t to 32-bitAlexey Dobriyan
struct kmem_cache::flags is "unsigned long" which is unnecessary on 64-bit as no flags are defined in the higher bits. Switch the field to 32-bit and save some space on x86_64 until such flags appear: add/remove: 0/0 grow/shrink: 0/107 up/down: 0/-657 (-657) function old new delta sysfs_slab_add 720 719 -1 ... check_object 699 676 -23 [akpm@linux-foundation.org: fix printk warning] Link: http://lkml.kernel.org/r/20171021100635.GA8287@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15slab, slub, slob: add slab_flags_tAlexey Dobriyan
Add sparse-checked slab_flags_t for struct kmem_cache::flags (SLAB_POISON, etc). SLAB is bloated temporarily by switching to "unsigned long", but only temporarily. Link: http://lkml.kernel.org/r/20171021100225.GA22428@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-16ipv6: sr: update the struct ipv6_sr_hdrAhmed Abdelsalam
The IPv6 Segment Routing Header (SRH) format has been updated (revision 6 of the SRH ietf draft). The update includes the following SRH fields: (1) The "First Segment" field changed to be "Last Entry" which contains the index, in the Segment List, of the last element of the Segment List. (2) The 16 bit "reserved" field now is used as a "tag" which tags a packet as part of a class or group of packets, e.g.,packets sharing the same set of properties. This patch updates the struct ipv6_sr_hdr, so it complies with the updated SRH draft. The 16 bit "reserved" field is changed to be "tag", In addition a comment is added to the "first_segment" field, showing that it represents the "Last Entry" field of the SRH. Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-16genetlink: fix genlmsg_nlhdr()Michal Kubecek
According to the description, first argument of genlmsg_nlhdr() points to what genlmsg_put() returns, i.e. beginning of user header. Therefore we should only subtract size of genetlink header and netlink message header, not user header. This also means we don't need to pass the pointer to genetlink family and the same is true for genl_dump_check_consistent() which is the only caller of genlmsg_nlhdr(). (Note that at the moment, these functions are only used for families which do not have user header so that they are not affected.) Fixes: 670dc2833d14 ("netlink: advertise incomplete dumps") Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-15Merge branch 'for-4.15/dax' into libnvdimm-for-nextDan Williams
2017-11-15Merge tag 'ipmi-for-4.15' of git://github.com/cminyard/linux-ipmiLinus Torvalds
Pull IPMI updates from Corey Minyard: "This is a fairly large rework of the IPMI code, along with a bunch of smaller fixes. The major changes have been in the next tree for a couple of months, so they should be good to do in. - Some users had IPMI systems where the GUID of the IPMI controller could change. So rescanning of the GUID was added. The naming of some sysfs things was dependent on the GUID, however, so this resulted in the sysfs interface code in IPMI changing to remove that dependency and name the IPMI BMCs like other sysfs devices. - The ipmi_si_intf.c code was fairly bloated with all the different discovery methods (PCI, ACPI, SMBIOS, OF, platform, module parameters, hot add). The structure of how the interfaces were added was redone to make them more modular, then the individual methods were pulled out into their own files" * tag 'ipmi-for-4.15' of git://github.com/cminyard/linux-ipmi: (48 commits) ipmi_si: Delete an error message for a failed memory allocation in try_smi_init() ipmi_si: fix memory leak on new_smi ipmi: remove redundant initialization of bmc ipmi: pr_err() strings should end with newlines ipmi: Clean up some print operations ipmi: Make the DMI probe into a generic platform probe ipmi: Make the IPMI proc interface configurable ipmi_ssif: Add device attrs for the things in proc ipmi_si: Add device attrs for the things in proc ipmi_si: remove ipmi_smi_alloc() function ipmi_si: Move port and mem I/O handling to their own files ipmi_si: Get rid of unused spacing and port fields ipmi_si: Move PARISC handling to another file ipmi_si: Move PCI setup to another file ipmi_si: Move platform device handling to another file ipmi_si: Move hardcode handling to a separate file. ipmi_si: Move the hotmod handling to another file. ipmi_si: Change ipmi_si_add_smi() to take just I/O info ipmi_si: Move io setup into io structure ipmi_si: Move irq setup handling into the io struct ...
2017-11-15Merge tag 'pci-v4.15-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI updates from Bjorn Helgaas: - detach driver before tearing down procfs/sysfs (Alex Williamson) - disable PCIe services during shutdown (Sinan Kaya) - fix ASPM oops on systems with no Root Ports (Ard Biesheuvel) - fix ASPM LTR_L1.2_THRESHOLD programming (Bjorn Helgaas) - fix ASPM Common_Mode_Restore_Time computation (Bjorn Helgaas) - fix portdrv MSI/MSI-X vector allocation (Dongdong Liu, Bjorn Helgaas) - report non-fatal AER errors only to the affected endpoint (Gabriele Paoloni) - distribute bus numbers, MMIO, and I/O space among hotplug bridges to allow more devices to be hot-added (Mika Westerberg) - fix pciehp races during initialization and surprise link down (Mika Westerberg) - handle surprise-removed devices in PME handling (Qiang) - support resizable BARs for large graphics devices (Christian König) - expose SR-IOV offset, stride, and VF device ID via sysfs (Filippo Sironi) - create SR-IOV virtfn/physfn sysfs links before attaching driver (Stuart Hayes) - fix SR-IOV "ARI Capable Hierarchy" restore issue (Tony Nguyen) - enforce Kconfig IOV/REALLOC dependency (Sascha El-Sharkawy) - avoid slot reset if bridge itself is broken (Jan Glauber) - clean up pci_reset_function() path (Jan H. Schönherr) - make pci_map_rom() fail if the option ROM is invalid (Changbin Du) - convert timers to timer_setup() (Kees Cook) - move PCI_QUIRKS to PCI bus Kconfig menu (Randy Dunlap) - constify pci_dev_type and intel_mid_pci_ops (Bhumika Goyal) - remove unnecessary pci_dev, pci_bus, resource, pcibios_set_master() declarations (Bjorn Helgaas) - fix endpoint framework overflows and BUG()s (Dan Carpenter) - fix endpoint framework issues (Kishon Vijay Abraham I) - avoid broken Cavium CN8xxx bus reset behavior (David Daney) - extend Cavium ACS capability quirks (Vadim Lomovtsev) - support Synopsys DesignWare RC in ECAM mode (Ard Biesheuvel) - turn off dra7xx clocks cleanly on shutdown (Keerthy) - fix Faraday probe error path (Wei Yongjun) - support HiSilicon STB SoC PCIe host controller (Jianguo Sun) - fix Hyper-V interrupt affinity issue (Dexuan Cui) - remove useless ACPI warning for Hyper-V pass-through devices (Vitaly Kuznetsov) - support multiple MSI on iProc (Sandor Bodo-Merle) - support Layerscape LS1012a and LS1046a PCIe host controllers (Hou Zhiqiang) - fix Layerscape default error response (Minghuan Lian) - support MSI on Tango host controller (Marc Gonzalez) - support Tegra186 PCIe host controller (Manikanta Maddireddy) - use generic accessors on Tegra when possible (Thierry Reding) - support V3 Semiconductor PCI host controller (Linus Walleij) * tag 'pci-v4.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (85 commits) PCI/ASPM: Add L1 Substates definitions PCI/ASPM: Reformat ASPM register definitions PCI/ASPM: Use correct capability pointer to program LTR_L1.2_THRESHOLD PCI/ASPM: Account for downstream device's Port Common_Mode_Restore_Time PCI: xgene: Rename xgene_pcie_probe_bridge() to xgene_pcie_probe() PCI: xilinx: Rename xilinx_pcie_link_is_up() to xilinx_pcie_link_up() PCI: altera: Rename altera_pcie_link_is_up() to altera_pcie_link_up() PCI: Fix kernel-doc build warning PCI: Fail pci_map_rom() if the option ROM is invalid PCI: Move pci_map_rom() error path PCI: Move PCI_QUIRKS to the PCI bus menu alpha/PCI: Make pdev_save_srm_config() static PCI: Remove unused declarations PCI: Remove redundant pci_dev, pci_bus, resource declarations PCI: Remove redundant pcibios_set_master() declarations PCI/PME: Handle invalid data when reading Root Status PCI: hv: Use effective affinity mask PCI: pciehp: Do not clear Presence Detect Changed during initialization PCI: pciehp: Fix race condition handling surprise link down PCI: Distribute available resources to hotplug-capable bridges ...
2017-11-15Merge tag 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull rdma updates from Doug Ledford: "This is a fairly plain pull request. Lots of driver updates across the stack, a huge number of static analysis cleanups including a close to 50 patch series from Bart Van Assche, and a number of new features inside the stack such as general CQ moderation support. Nothing really stands out, but there might be a few conflicts as you take things in. In particular, the cleanups touched some of the same lines as the new timer_setup changes. Everything in this pull request has been through 0day and at least two days of linux-next (since Stephen doesn't necessarily flag new errors/warnings until day2). A few more items (about 30 patches) from Intel and Mellanox showed up on the list on Tuesday. I've excluded those from this pull request, and I'm sure some of them qualify as fixes suitable to send any time, but I still have to review them fully. If they contain mostly fixes and little or no new development, then I will probably send them through by the end of the week just to get them out of the way. There was a break in my acceptance of patches which coincides with the computer problems I had, and then when I got things mostly back under control I had a backlog of patches to process, which I did mostly last Friday and Monday. So there is a larger number of patches processed in that timeframe than I was striving for. Summary: - Add iWARP support to qedr driver - Lots of misc fixes across subsystem - Multiple update series to hns roce driver - Multiple update series to hfi1 driver - Updates to vnic driver - Add kref to wait struct in cxgb4 driver - Updates to i40iw driver - Mellanox shared pull request - timer_setup changes - massive cleanup series from Bart Van Assche - Two series of SRP/SRPT changes from Bart Van Assche - Core updates from Mellanox - i40iw updates - IPoIB updates - mlx5 updates - mlx4 updates - hns updates - bnxt_re fixes - PCI write padding support - Sparse/Smatch/warning cleanups/fixes - CQ moderation support - SRQ support in vmw_pvrdma" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (296 commits) RDMA/core: Rename kernel modify_cq to better describe its usage IB/mlx5: Add CQ moderation capability to query_device IB/mlx4: Add CQ moderation capability to query_device IB/uverbs: Add CQ moderation capability to query_device IB/mlx5: Exposing modify CQ callback to uverbs layer IB/mlx4: Exposing modify CQ callback to uverbs layer IB/uverbs: Allow CQ moderation with modify CQ iw_cxgb4: atomically flush the qp iw_cxgb4: only call the cq comp_handler when the cq is armed iw_cxgb4: Fix possible circular dependency locking warning RDMA/bnxt_re: report vlan_id and sl in qp1 recv completion IB/core: Only maintain real QPs in the security lists IB/ocrdma_hw: remove unnecessary code in ocrdma_mbx_dealloc_lkey RDMA/core: Make function rdma_copy_addr return void RDMA/vmw_pvrdma: Add shared receive queue support RDMA/core: avoid uninitialized variable warning in create_udata RDMA/bnxt_re: synchronize poll_cq and req_notify_cq verbs RDMA/bnxt_re: Flush CQ notification Work Queue before destroying QP RDMA/bnxt_re: Set QP state in case of response completion errors RDMA/bnxt_re: Add memory barriers when processing CQ/EQ entries ...
2017-11-15Merge branch 'for-4.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "Cgroup2 cpu controller support is finally merged. - Basic cpu statistics support to allow monitoring by default without the CPU controller enabled. - cgroup2 cpu controller support. - /sys/kernel/cgroup files to help dealing with new / optional features" * 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: export list of cgroups v2 features using sysfs cgroup: export list of delegatable control files using sysfs cgroup: mark @cgrp __maybe_unused in cpu_stat_show() MAINTAINERS: relocate cpuset.c cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat sched: Implement interface for cgroup unified hierarchy sched: Misc preps for cgroup unified hierarchy interface sched/cputime: Add dummy cputime_adjust() implementation for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE cgroup: statically initialize init_css_set->dfl_cgrp cgroup: Implement cgroup2 basic CPU usage accounting cpuacct: Introduce cgroup_account_cputime[_field]() sched/cputime: Expose cputime_adjust()
2017-11-15Merge branch 'for-4.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata Pull libata updates from Tejun Heo: "Nothing too interesting or alarming. Other than a new power saving mode addition to ahci and crash fix on a tracepoint, all changes are trivial or device-specific" * 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: (22 commits) ahci: imx: Handle increased read failures for IMX53 temperature sensor in low frequency mode. ata: sata_dwc_460ex: Propagate platform device ID to DMA driver ata: fixes kernel crash while tracing ata_eh_link_autopsy event ata: pata_pdc2027x: Fix space before '[' error. libata: fix spelling mistake: 'ambigious' -> 'ambiguous' ata: ceva: Add SMMU support for SATA IP ata: ceva: Correct the suspend and resume logic for SATA ata: ceva: Correct the AXI bus configuration for SATA ports ata: ceva: Add CCI support for SATA if CCI is enabled ata: ceva: Make RxWaterMark value as module parameter ata: ceva: Disable Device Sleep capability ata: ceva: Add gen 3 mode support in driver ata: ceva: Move sata port phy oob settings to device-tree devicetree: bindings: Add sata port phy config parameters in ahci-ceva ata: mark expected switch fall-throughs ata: sata_mv: remove a redundant assignment to pointer ehi ahci: Add support for Cavium's fifth generation SATA controller ata: sata_rcar: Use of_device_get_match_data() helper libata: make ata_port_type const libata: make static arrays const, reduces object code size ...
2017-11-15Merge tag 'modules-for-v4.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux Pull module updates from Jessica Yu: "Summary of modules changes for the 4.15 merge window: - treewide module_param_call() cleanup, fix up set/get function prototype mismatches, from Kees Cook - minor code cleanups" * tag 'modules-for-v4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux: module: Do not paper over type mismatches in module_param_call() treewide: Fix function prototypes for module_param_call() module: Prepare to convert all module_param_call() prototypes kernel/module: Delete an error message for a failed memory allocation in add_module_usage()
2017-11-15Merge tag 'audit-pr-20171113' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit updates from Paul Moore: "Another relatively small pull request for audit, nine patches total. The only real new bit of functionality is the patch from Richard which adds the ability to filter records based on the filesystem type. The remainder are bug fixes and cleanups; the bug fix highlights include: - ensuring that we properly audit init/PID-1 (me) - allowing the audit daemon to shutdown the kernel/auditd connection cleanly by setting the audit PID to zero (Steve)" * tag 'audit-pr-20171113' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: filter PATH records keyed on filesystem magic Audit: remove unused audit_log_secctx function audit: Allow auditd to set pid to 0 to end auditing audit: Add new syscalls to the perm=w filter audit: use audit_set_enabled() in audit_enable() audit: convert audit_ever_enabled to a boolean audit: don't use simple_strtol() anymore audit: initialize the audit subsystem as early as possible audit: ensure that 'audit=1' actually enables audit for PID 1
2017-11-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: "Highlights: 1) Maintain the TCP retransmit queue using an rbtree, with 1GB windows at 100Gb this really has become necessary. From Eric Dumazet. 2) Multi-program support for cgroup+bpf, from Alexei Starovoitov. 3) Perform broadcast flooding in hardware in mv88e6xxx, from Andrew Lunn. 4) Add meter action support to openvswitch, from Andy Zhou. 5) Add a data meta pointer for BPF accessible packets, from Daniel Borkmann. 6) Namespace-ify almost all TCP sysctl knobs, from Eric Dumazet. 7) Turn on Broadcom Tags in b53 driver, from Florian Fainelli. 8) More work to move the RTNL mutex down, from Florian Westphal. 9) Add 'bpftool' utility, to help with bpf program introspection. From Jakub Kicinski. 10) Add new 'cpumap' type for XDP_REDIRECT action, from Jesper Dangaard Brouer. 11) Support 'blocks' of transformations in the packet scheduler which can span multiple network devices, from Jiri Pirko. 12) TC flower offload support in cxgb4, from Kumar Sanghvi. 13) Priority based stream scheduler for SCTP, from Marcelo Ricardo Leitner. 14) Thunderbolt networking driver, from Amir Levy and Mika Westerberg. 15) Add RED qdisc offloadability, and use it in mlxsw driver. From Nogah Frankel. 16) eBPF based device controller for cgroup v2, from Roman Gushchin. 17) Add some fundamental tracepoints for TCP, from Song Liu. 18) Remove garbage collection from ipv6 route layer, this is a significant accomplishment. From Wei Wang. 19) Add multicast route offload support to mlxsw, from Yotam Gigi" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2177 commits) tcp: highest_sack fix geneve: fix fill_info when link down bpf: fix lockdep splat net: cdc_ncm: GetNtbFormat endian fix openvswitch: meter: fix NULL pointer dereference in ovs_meter_cmd_reply_start netem: remove unnecessary 64 bit modulus netem: use 64 bit divide by rate tcp: Namespace-ify sysctl_tcp_default_congestion_control net: Protect iterations over net::fib_notifier_ops in fib_seq_sum() ipv6: set all.accept_dad to 0 by default uapi: fix linux/tls.h userspace compilation error usbnet: ipheth: prevent TX queue timeouts when device not ready vhost_net: conditionally enable tx polling uapi: fix linux/rxrpc.h userspace compilation errors net: stmmac: fix LPI transitioning for dwmac4 atm: horizon: Fix irq release error net-sysfs: trigger netlink notification on ifalias change via sysfs openvswitch: Using kfree_rcu() to simplify the code openvswitch: Make local function ovs_nsh_key_attr_size() static openvswitch: Fix return value check in ovs_meter_cmd_features() ...
2017-11-15Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "The big highlight is support for the Scalable Vector Extension (SVE) which required extensive ABI work to ensure we don't break existing applications by blowing away their signal stack with the rather large new vector context (<= 2 kbit per vector register). There's further work to be done optimising things like exception return, but the ABI is solid now. Much of the line count comes from some new PMU drivers we have, but they're pretty self-contained and I suspect we'll have more of them in future. Plenty of acronym soup here: - initial support for the Scalable Vector Extension (SVE) - improved handling for SError interrupts (required to handle RAS events) - enable GCC support for 128-bit integer types - remove kernel text addresses from backtraces and register dumps - use of WFE to implement long delay()s - ACPI IORT updates from Lorenzo Pieralisi - perf PMU driver for the Statistical Profiling Extension (SPE) - perf PMU driver for Hisilicon's system PMUs - misc cleanups and non-critical fixes" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (97 commits) arm64: Make ARMV8_DEPRECATED depend on SYSCTL arm64: Implement __lshrti3 library function arm64: support __int128 on gcc 5+ arm64/sve: Add documentation arm64/sve: Detect SVE and activate runtime support arm64/sve: KVM: Hide SVE from CPU features exposed to guests arm64/sve: KVM: Treat guest SVE use as undefined instruction execution arm64/sve: KVM: Prevent guests from using SVE arm64/sve: Add sysctl to set the default vector length for new processes arm64/sve: Add prctl controls for userspace vector length management arm64/sve: ptrace and ELF coredump support arm64/sve: Preserve SVE registers around EFI runtime service calls arm64/sve: Preserve SVE registers around kernel-mode NEON use arm64/sve: Probe SVE capabilities and usable vector lengths arm64: cpufeature: Move sys_caps_initialised declarations arm64/sve: Backend logic for setting the vector length arm64/sve: Signal handling support arm64/sve: Support vector length resetting for new processes arm64/sve: Core task context handling arm64/sve: Low-level CPU setup ...
2017-11-15Merge tag 'riscv-for-linus-4.15-arch-v9-premerge' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/palmer/linux Pull RISC-V architecture support from Palmer Dabbelt: "This contains the core RISC-V Linux port, which has been through nine rounds of review on various mailing lists. The port is not complete: there's some cleanup patches moving through the review process, a whole bunch of drivers that need some work, and a lot of feature additions that will be needed. The patches contained in this tag have been through nine rounds of review on the various mailing lists. I have some outstanding cleanup patches, but since there's been so much review on these patches I thought it would be best to submit them as-is and then submit explicit cleanup patches so everyone can review them. This first patch set is big enough that it's a bit of a pain to constantly rewrite, and it's caused a few headaches with various contributors. The port is definately a work in progress. While what's there builds and boots with 4.14, it's a bit hard to actually see anything happen because there are no device drivers yet. I maintain a staging branch that contains all the device drivers and cleanup that actually works, but those patches won't all be ready for a while. I'd like to get what we currently have into your tree so everyone can start working from a single base -- of particular importance is allowing the glibc upstreaming process to proceed so we can sort out any possibly lingering user-visible ABI problems we might have. Copied below is the ChangeLog that contains the history of this patch set: (v9) As per suggestions on our v8 patch set, I've split the core architecture code out from our drivers and would like to submit this patch set to be included into linux-next, with the goal being to be merged in during the next merge window. This patch set is based on 4.14-rc2, but if it's better to have it based on something else then I can change it around. This patch set contains just the core arch code for RISC-V, so while it builds an nominally boots, you can't print or take an interrupt so it's not that useful. If you're looking to actually boot a system it would probably be better to use the full patch set listed below. We've collected a handful of tags from reviewers, and the remainder of the patch set only got minimal feedback last time. Here's what changed: - We now use the device tree to initialize the timer driver so it's less tighly coupled with the arch port. - I cleaned up the defconfigs -- there's actually now just one, and it's empty. For now I think we're OK with what the kernel sets as defaults, but I anticipate we'll begin to expand this as people start to use the port more. - The VDSO symbols version is sane. - We WFI while spinning in the boot loop. - A handful of comments have been added. While there are still a handful of FIXMEs in this patch set, we've started to get enough interest from various users and contributors that maintaining an out of tree patch set is starting to become a big burden. Hopefully the patches are good enough to merge now, which will at least get everyone working in a more reasonable manner as we clean up the remaining issues. (v8) I know it may not be the ideal time to submit a patch set right now, as it's the middle of the merge window, but things have calmed down quite a bit in the last month so I thought it would be good to get everyone on the same page. There's been a handful of changes since the last patch set, but most of them are fairly minor: - We changed PAGE_OFFSET to allowing mapping more physical memory on 64-bit systems. This is user configurable, as it triggers a different code model that generates slightly less efficient code. - The device tree binding documentation is back, I'd managed to lose it at some point. - We now pass the atomic64 test suite - The SBI timer driver has been refactored. (v7) It's been a while since my last patch set, but the changes han been fairly minimal: - The PCI cleanup patches have been dropped, we'll do them as a separate patch set later. - We've the Kconfig entries from CONFIG_ISA_* to CONFIG_RISCV_ISA_*, to make grep easier. - There have been a handful of memory model related tweaks in I/O land, particularly relating the PCI and the upcoming platform specification. There are significant comments in the relevant files. This is still a WIP, but I think we're close to getting as good as we're going to get until we end up with some more specifications. (v6) As it's been only a day since the v5 patch set, the changes are pretty minimal: - The patch set is now based on linux-next/master, which I believe is a better base now that we're getting closer to upstream. - EARLY_PRINTK is no longer an option. Since the SBI console is reasonable, there's no penalty to enabling it (and thus no benefit to disabling it). - The mmap syscalls were refactored a bit. (v5) Things have really started to calm down, so this is fairly similar to the v4 patch set. The most interesting changes include: - We've moved back to a single patch set. - SMP support has been fixed, I was accidentally running on a non-SMP configuration. There were various mistakes all over the tree as a result of this. - The cmpxchg syscalls have been removed, as they were deemed a bad idea. As a result, RISC-V Linux systems mandate the A extension. The corresponding Kconfig entry to enable builds on non-A systems has been removed. - A few more atomic fixes: mostly fence changes, but those resulted in a handful of additional macros that were no longer necessary. - riscv_early_sie has been removed. (v4) There have only been a few changes since the v3 patch set: - The cmpxchg64 syscall is no longer enabled on 32-bit systems. It's not possible to provide this on SMP systems, and it's not necessary as glibc knows not to call it. - We provide a ELF_HWCAP so users can determine the ISA of the machine the kernel is running on. - The multi-line comments are in a better form. - There were a handful of headers that could be replaced with the asm-generic versions, and a few unnecessary definitions. - We no longer use printk, but instead use pr_*. - A few Kconfig and defconfig entries have been cleaned up. (v3) A highlight of the changes since the v2 patch set includes: - We've split out all our drivers into separate patch sets, which I've already sent out to the relevant maintainers. I haven't included those patches in this patch set, but some of them are necessary to build our port. - The patch set is now split up differently: rather than being split per directory it is split per topic. Hopefully this will make it easier to review the port on the mailing list. The split is a bit rough, so you probably still want to look at the patch set as a whole. - atomic.h has been completely rewritten and is hopefully now correct. I've attempted to sanitize the various other memory model related code as well, and I think it should all be sane now aside from a handful of FIXMEs commented in the code. - We've changed the cmpexchg syscall to always exist and to not be multiplexed. There is also a VDSO entry for compare and exchange, which allows kernels with the A extension to execute user code without the A extension reasonably fast. - Our user-visible register state now contains enough space for the Q extension for 128-bit floating point, as well as a few words to allow extensibility to future ISA extensions like the eventual V extension for vectors. - A handful of driver cleanups, but these have been split into separate patch sets now so I won't duplicate them here. (v2) A highlight of the changes since the v1 patch set includes: - We've split out our drivers into the right places, which means now there's a lot more patches. I'll be submitting these patches to various subsystem maintainers and including them in any future RISC-V patch sets until they've been merged. - The SBI console driver has been completely rewritten to use the HVC helpers and is now significantly smaller. - We've begun to use weaker barriers as opposed to just the big "fence". There's still some work to do here, specifically: - We need fences in the relaxed MMIO functions. - The non-relaxed MMIO functions are missing R/W bits on their fences. - Many AMOs need the aq and rl bits set. - We now have thread_info in task_struct. As a result, sscratch now contains TP instead of SP. This was necessary because thread_info is no longer on the stack. - A few shared routines have been added that we use instead of creating another arch copy" Reviewed-by: Arnd Bergmann <arnd@arndb.de> * tag 'riscv-for-linus-4.15-arch-v9-premerge' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/linux: RISC-V: Build Infrastructure RISC-V: User-facing API RISC-V: Paging and MMU RISC-V: Device, timer, IRQs, and the SBI RISC-V: Task implementation RISC-V: ELF and module implementation RISC-V: Generic library routines and assembly RISC-V: Atomic and Locking Code RISC-V: Init and Halt Code dt-bindings: RISC-V CPU Bindings lib: Add shared copies of some GCC library routines MAINTAINERS: Add RISC-V
2017-11-15x86 / CPU: Always show current CPU frequency in /proc/cpuinfoRafael J. Wysocki
After commit 890da9cf0983 (Revert "x86: do not use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz"") the "cpu MHz" number in /proc/cpuinfo on x86 can be either the nominal CPU frequency (which is constant) or the frequency most recently requested by a scaling governor in cpufreq, depending on the cpufreq configuration. That is somewhat inconsistent and is different from what it was before 4.13, so in order to restore the previous behavior, make it report the current CPU frequency like the scaling_cur_freq sysfs file in cpufreq. To that end, modify the /proc/cpuinfo implementation on x86 to use aperfmperf_snapshot_khz() to snapshot the APERF and MPERF feedback registers, if available, and use their values to compute the CPU frequency to be reported as "cpu MHz". However, do that carefully enough to avoid accumulating delays that lead to unacceptable access times for /proc/cpuinfo on systems with many CPUs. Run aperfmperf_snapshot_khz() once on all CPUs asynchronously at the /proc/cpuinfo open time, add a single delay upfront (if necessary) at that point and simply compute the current frequency while running show_cpuinfo() for each individual CPU. Also, to avoid slowing down /proc/cpuinfo accesses too much, reduce the default delay between consecutive APERF and MPERF reads to 10 ms, which should be sufficient to get large enough numbers for the frequency computation in all cases. Fixes: 890da9cf0983 (Revert "x86: do not use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz"") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@kernel.org>
2017-11-15Merge branch 'for-linus' of ↵Linus Torvalds
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching Pull livepatching updates from Jiri Kosina: - shadow variables support, allowing livepatches to associate new "shadow" fields to existing data structures, from Joe Lawrence - pre/post patch callbacks API, allowing livepatch writers to register callbacks to be called before and after patch application, from Joe Lawrence * 'for-linus' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching: livepatch: __klp_disable_patch() should never be called for disabled patches livepatch: Correctly call klp_post_unpatch_callback() in error paths livepatch: add transition notices livepatch: move transition "complete" notice into klp_complete_transition() livepatch: add (un)patch callbacks livepatch: Small shadow variable documentation fixes livepatch: __klp_shadow_get_or_alloc() is local to shadow.c livepatch: introduce shadow variable API
2017-11-15Merge branch 'for-linus' of ↵Linus Torvalds
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "The usual rocket-science from trivial tree for 4.15" * 'for-linus' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: MAINTAINERS: relinquish kconfig MAINTAINERS: Update my email address treewide: Fix typos in Kconfig kfifo: Fix comments init/Kconfig: Fix module signing document location misc: ibmasm: Return error on error path HID: logitech-hidpp: fix mistake in printk, "feeback" -> "feedback" MAINTAINERS: Correct path to uDraw PS3 driver tracing: Fix doc mistakes in trace sample tracing: Kconfig text fixes for CONFIG_HWLAT_TRACER MIPS: Alchemy: Remove reverted CONFIG_NETLINK_MMAP from db1xxx_defconfig mm/huge_memory.c: fixup grammar in comment lib/xz: Add fall-through comments to a switch statement
2017-11-15Merge branch 'for-linus' of ↵Linus Torvalds
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jikos/hid Pull HID updates from Jiri Kosina: - high resolution mode for Dell canvas support, from Benjamin Tissoires - pen handling fixes for the Wacom driver, from Jason Gerecke - i2c-hid: Apollo-Lake based laptops improvements, from Hans de Goede - Input/Core: eraser tool support, from Ping Cheng - new ALPS touchpad (T4, found currently on HP EliteBook 1000, Zbook Stduio and HP Elite book x360) supportm from Masaki Ota - other smaller assorted fixes * 'for-linus' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (33 commits) HID: cp2112: fix broken gpio_direction_input callback HID: cp2112: fix interface specification URL HID: Wacom: switch Dell canvas into highres mode HID: wacom: generic: Send BTN_STYLUS3 when both barrel switches are set HID: sony: Fix SHANWAN pad rumbling on USB HID: i2c-hid: Add no-irq-after-reset quirk for 0911:5288 device HID: add backlight level quirk for Asus ROG laptops HID: cp2112: add HIDRAW dependency HID: Add ID 044f:b605 ThrustMaster, Inc. force feedback Racing Wheel HID: hid-logitech: remove redundant assignment to pointer value HID: wacom: generic: Recognize WACOM_HID_WD_PEN as a type of pen collection HID: rmi: Check that a device is a RMI device before calling RMI functions HID: add multi-input quirk for GamepadBlock HID: alps: add new U1 device ID HID: alps: add support for Alps T4 Touchpad device HID: alps: remove variables local to u1_init() from the device struct HID: alps: properly handle max_fingers and minimum on X and Y axis HID: alps: Separate U1 device code HID: alps: delete unnecessary struct u1_dev devInfo HID: usbhid: Convert timers to use timer_setup() ...