summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2017-07-26percpu: update alloc path to only scan if contig hints are brokenDennis Zhou (Facebook)
Metadata is kept per block to keep track of where the contig hints are. Scanning can be avoided when the contig hints are not broken. In that case, left and right contigs have to be managed manually. This patch changes the allocation path hint updating to only scan when contig hints are broken. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: keep track of the best offset for contig hintsDennis Zhou (Facebook)
This patch makes the contig hint starting offset optimization from the previous patch as honest as it can be. For both chunk and block starting offsets, make sure it keeps the starting offset with the best alignment. The block skip optimization is added in a later patch when the pcpu_find_block_fit iterator is swapped in. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: skip chunks if the alloc does not fit in the contig hintDennis Zhou (Facebook)
This patch adds chunk->contig_bits_start to keep track of the contig hint's offset and the check to skip the chunk if it does not fit. If the chunk's contig hint starting offset cannot satisfy an allocation, the allocator assumes there is enough memory pressure in this chunk to either use a different chunk or create a new one. This accepts a less tight packing for a smoother latency curve. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: add first_bit to keep track of the first free in the bitmapDennis Zhou (Facebook)
This patch adds first_bit to keep track of the first free bit in the bitmap. This hint helps prevent scanning of fully allocated blocks. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: introduce bitmap metadata blocksDennis Zhou (Facebook)
This patch introduces the bitmap metadata blocks and adds the skeleton of the code that will be used to maintain these blocks. Each chunk's bitmap is made up of full metadata blocks. These blocks maintain basic metadata to help prevent scanning unnecssarily to update hints. Full scanning methods are used for the skeleton and will be replaced in the coming patches. A number of helper functions are added as well to do conversion of pages to blocks and manage offsets. Comments will be updated as the final version of each function is added. There exists a relationship between PAGE_SIZE, PCPU_BITMAP_BLOCK_SIZE, the region size, and unit_size. Every chunk's region (including offsets) is page aligned at the beginning to preserve alignment. The end is aligned to LCM(PAGE_SIZE, PCPU_BITMAP_BLOCK_SIZE) to ensure that the end can fit with the populated page map which is by page and every metadata block is fully accounted for. The unit_size is already page aligned, but must also be aligned with PCPU_BITMAP_BLOCK_SIZE to ensure full metadata blocks. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: replace area map allocator with bitmapDennis Zhou (Facebook)
The percpu memory allocator is experiencing scalability issues when allocating and freeing large numbers of counters as in BPF. Additionally, there is a corner case where iteration is triggered over all chunks if the contig_hint is the right size, but wrong alignment. This patch replaces the area map allocator with a basic bitmap allocator implementation. Each subsequent patch will introduce new features and replace full scanning functions with faster non-scanning options when possible. Implementation: This patchset removes the area map allocator in favor of a bitmap allocator backed by metadata blocks. The primary goal is to provide consistency in performance and memory footprint with a focus on small allocations (< 64 bytes). The bitmap removes the heavy memmove from the freeing critical path and provides a consistent memory footprint. The metadata blocks provide a bound on the amount of scanning required by maintaining a set of hints. In an effort to make freeing fast, the metadata is updated on the free path if the new free area makes a page free, a block free, or spans across blocks. This causes the chunk's contig hint to potentially be smaller than what it could allocate by up to the smaller of a page or a block. If the chunk's contig hint is contained within a block, a check occurs and the hint is kept accurate. Metadata is always kept accurate on allocation, so there will not be a situation where a chunk has a later contig hint than available. Evaluation: I have primarily done testing against a simple workload of allocation of 1 million objects (2^20) of varying size. Deallocation was done by in order, alternating, and in reverse. These numbers were collected after rebasing ontop of a80099a152. I present the worst-case numbers here: Area Map Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 310 | 4770 16B | 557 | 1325 64B | 436 | 273 256B | 776 | 131 1024B | 3280 | 122 Bitmap Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 490 | 70 16B | 515 | 75 64B | 610 | 80 256B | 950 | 100 1024B | 3520 | 200 This data demonstrates the inability for the area map allocator to handle less than ideal situations. In the best case of reverse deallocation, the area map allocator was able to perform within range of the bitmap allocator. In the worst case situation, freeing took nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator dramatically improves the consistency of the free path. The small allocations performed nearly identical regardless of the freeing pattern. While it does add to the allocation latency, the allocation scenario here is optimal for the area map allocator. The area map allocator runs into trouble when it is allocating in chunks where the latter half is full. It is difficult to replicate this, so I present a variant where the pages are second half filled. Freeing was done sequentially. Below are the numbers for this scenario: Area Map Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 4118 | 4892 16B | 1651 | 1163 64B | 598 | 285 256B | 771 | 158 1024B | 3034 | 160 Bitmap Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 481 | 67 16B | 506 | 69 64B | 636 | 75 256B | 892 | 90 1024B | 3262 | 147 The data shows a parabolic curve of performance for the area map allocator. This is due to the memmove operation being the dominant cost with the lower object sizes as more objects are packed in a chunk and at higher object sizes, the traversal of the chunk slots is the dominating cost. The bitmap allocator suffers this problem as well. The above data shows the inability to scale for the allocation path with the area map allocator and that the bitmap allocator demonstrates consistent performance in general. The second problem of additional scanning can result in the area map allocator completing in 52 minutes when trying to allocate 1 million 4-byte objects with 8-byte alignment. The same workload takes approximately 16 seconds to complete for the bitmap allocator. V2: Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap using bytes instead of bits. Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26errseq: rename __errseq_set to errseq_setJeff Layton
Nothing calls this wrapper anymore, so just remove it and rename the old function to get rid of the double underscore prefix. Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-07-26percpu: generalize bitmap (un)populated iteratorsDennis Zhou (Facebook)
The area map allocator only used a bitmap for the backing page state. The new bitmap allocator will use bitmaps to manage the allocation region in addition to this. This patch generalizes the bitmap iterators so they can be reused with the bitmap allocator. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: increase minimum percpu allocation size and align first regionsDennis Zhou (Facebook)
This patch increases the minimum allocation size of percpu memory to 4-bytes. This change will help minimize the metadata overhead associated with the bitmap allocator. The assumption is that most allocations will be of objects or structs greater than 2 bytes with integers or longs being used rather than shorts. The first chunk regions are now aligned with the minimum allocation size. The reserved region is expected to be set as a multiple of the minimum allocation size. The static region is aligned up and the delta is removed from the dynamic size. This works because the dynamic size is increased to be page aligned. If the static size is not minimum allocation size aligned, then there must be a gap that is added to the dynamic size. The dynamic size will never be smaller than the set value. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: introduce nr_empty_pop_pages to help empty page accountingDennis Zhou (Facebook)
pcpu_nr_empty_pop_pages is used to ensure there are a handful of free pages around to serve atomic allocations. A new field, nr_empty_pop_pages, is added to the pcpu_chunk struct to keep track of the number of empty pages. This field is needed as the number of empty populated pages is globally tracked and deltas are used to update in the bitmap allocator. Pages that contain a hidden area are not considered to be empty. This new field is exposed in percpu_stats. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: change the number of pages marked in the first_chunk pop bitmapDennis Zhou (Facebook)
The populated bitmap represents the state of the pages the chunk serves. Prior, the bitmap was marked completely used as the first chunk was allocated and immutable. This is misleading because the first chunk may not be completely filled. Additionally, with moving the base_addr up in the previous patch, the population check no longer corresponds to what was being checked. This patch modifies the population map to be only the number of pages the region serves and to make what it was checking correspond correctly again. The change is to remove any misunderstanding between the size of the populated bitmap and the actual size of it. The work function page iterators now use nr_pages for the check rather than pcpu_unit_pages because nr_populated is now chunk specific. Without this, the work function would try to populate the remainder of these chunks despite it not serving any more than nr_pages when nr_pages is set less than pcpu_unit_pages. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: combine percpu address checksDennis Zhou (Facebook)
The percpu address checks for the reserved and dynamic region chunks are now specific to each region. The address checking logic can be combined taking advantage of the global references to the dynamic and static region chunks. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: modify base_addr to be region specificDennis Zhou (Facebook)
Originally, the first chunk was served by one or two chunks, each given a region they are responsible for. Despite this, the arithmetic was based off of the true base_addr of the chunk making it be overly inclusive. This patch moves the base_addr of chunks that are responsible for the first chunk. The base_addr must remain page aligned to keep the address alignment correct, so it is the beginning of the region served page aligned down. start_offset holds where the region served begins from this new base_addr. The corresponding percpu address checks are modified to be more specific as a result. The first chunk considers only the dynamic region and both first chunk and reserved chunk checks ignore the static region. The static region addresses should never be passed into the allocator. There is no impact here besides distinguishing the first chunk and making the checks specific. The percpu pointer to physical address is left intact as addresses are not given out in the non-allocated portion of percpu memory. nr_pages is added to pcpu_chunk to keep track of the size of the entire region served containing both start_offset and end_offset. This variable will be used to manage the bitmap allocator. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: setup_first_chunk rename schunk/dchunk to chunkDennis Zhou (Facebook)
There is no need to have the static chunk and dynamic chunk be named separately as the allocations are sequential. This preemptively solves the misnomer problem with the base_addrs being moved up in the following patch. It also removes a ternary operation deciding the first chunk. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: end chunk area maps page aligned for the populated bitmapDennis Zhou (Facebook)
The area map allocator manages the first chunk area by hiding all but the region it is responsible for serving in the area map. To align this with the populated page bitmap, end_offset is introduced to keep track of the delta to end page aligned. The area map is appended with the page aligned end when necessary to be in line with how the bitmap allocator requires the ending to be aligned with the LCM of PAGE_SIZE and the size of each bitmap block. percpu_stats is updated to ignore this region when present. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: unify allocation of schunk and dchunkDennis Zhou (Facebook)
Create a common allocator for first chunk initialization, pcpu_alloc_first_chunk. Comments for this function will be added in a later patch once the bitmap allocator is added. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: setup_first_chunk remove dyn_size and consolidate logicDennis Zhou (Facebook)
There is logic for setting variables in the static chunk init code that could be consolidated with the dynamic chunk init code. This combines this logic to setup for combining the allocation paths. reserved_size is used as the conditional as a dynamic region will always exist. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: remove has_reserved from pcpu_chunkDennis Zhou (Facebook)
Prior this variable was used to manage statistics when the first chunk had a reserved region. The previous patch introduced start_offset to keep track of the offset by value rather than boolean. Therefore, has_reserved can be removed. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: introduce start_offset to pcpu_chunkDennis Zhou (Facebook)
The reserved chunk arithmetic uses a global variable pcpu_reserved_chunk_limit that is set in the first chunk init code to hide a portion of the area map. The bitmap allocator to come will eventually move the base_addr up and require both the reserved chunk and static chunk to maintain this offset. pcpu_reserved_chunk_limit is removed and start_offset is added. The first chunk that is circulated and is pcpu_first_chunk serves the dynamic region, the region following the reserved region. The reserved chunk address check will temporarily use the first chunk to identify its address range. A following patch will increase the base_addr and remove this. If there is no reserved chunk, this will check the static region and return false because those values should never be passed into the allocator. Lastly, when linking in the first chunk, make sure to count the right free region for the number of empty populated pages. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26percpu: setup_first_chunk enforce dynamic region must existDennis Zhou (Facebook)
The first chunk is handled as a special case as it is composed of the static, reserved, and dynamic regions. The code handles each case individually. The next several patches will merge these code paths and lay the foundation for the bitmap allocator. This patch modifies logic to enforce that a dynamic region exists and changes the area map to account for that. This brings the logic closer to the dynamic chunk's init logic. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26kasan: Allow kasan_check_read/write() to accept pointers to volatilesDmitry Vyukov
Currently kasan_check_read/write() accept 'const void*', make them accept 'const volatile void*'. This is required for instrumentation of atomic operations and there is just no reason to not allow that. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kasan-dev@googlegroups.com Cc: linux-mm@kvack.org Cc: will.deacon@arm.com Link: http://lkml.kernel.org/r/33e5ec275c1ee89299245b2ebbccd63709c6021f.1498140838.git.dvyukov@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-21cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCSTejun Heo
css_task_iter currently always walks all tasks. With the scheduled cgroup v2 thread support, the iterator would need to handle multiple types of iteration. As a preparation, add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS. If the flag is not specified, it walks all tasks as before. When asserted, the iterator only walks the group leaders. For now, the only user of the flag is cgroup v2 "cgroup.procs" file which no longer needs to skip non-leader tasks in cgroup_procs_next(). Note that cgroup v1 "cgroup.procs" can't use the group leader walk as v1 "cgroup.procs" doesn't mean "list all thread group leaders in the cgroup" but "list all thread group id's with any threads in the cgroup". While at it, update cgroup_procs_show() to use task_pid_vnr() instead of task_tgid_vnr(). As the iteration guarantees that the function only sees group leaders, this doesn't change the output and will allow sharing the function for thread iteration. Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-18x86/mm: Add support to access boot related data in the clearTom Lendacky
Boot data (such as EFI related data) is not encrypted when the system is booted because UEFI/BIOS does not run with SME active. In order to access this data properly it needs to be mapped decrypted. Update early_memremap() to provide an arch specific routine to modify the pagetable protection attributes before they are applied to the new mapping. This is used to remove the encryption mask for boot related data. Update memremap() to provide an arch specific routine to determine if RAM remapping is allowed. RAM remapping will cause an encrypted mapping to be generated. By preventing RAM remapping, ioremap_cache() will be used instead, which will provide a decrypted mapping of the boot related data. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Dave Young <dyoung@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Toshimitsu Kani <toshi.kani@hpe.com> Cc: kasan-dev@googlegroups.com Cc: kvm@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-efi@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/81fb6b4117a5df6b9f2eda342f81bbef4b23d2e5.1500319216.git.thomas.lendacky@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-18x86/mm: Extend early_memremap() support with additional attrsTom Lendacky
Add early_memremap() support to be able to specify encrypted and decrypted mappings with and without write-protection. The use of write-protection is necessary when encrypting data "in place". The write-protect attribute is considered cacheable for loads, but not stores. This implies that the hardware will never give the core a dirty line with this memtype. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Dave Young <dyoung@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Toshimitsu Kani <toshi.kani@hpe.com> Cc: kasan-dev@googlegroups.com Cc: kvm@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-efi@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/479b5832c30fae3efa7932e48f81794e86397229.1500319216.git.thomas.lendacky@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-17percpu: update the header comment and pcpu_build_alloc_info commentsDennis Zhou (Facebook)
The header comment for percpu memory is a little hard to parse and is not super clear about how the first chunk is managed. This adds a little more clarity to the situation. There is also quite a bit of tricky logic in the pcpu_build_alloc_info. This adds a restructure of a comment to add a little more information. Unfortunately, you will still have to piece together a handful of other comments too, but should help direct you to the meaningful comments. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-17percpu: expose pcpu_nr_empty_pop_pages in pcpu_statsDennis Zhou (Facebook)
Percpu memory holds a minimum threshold of pages that are populated in order to serve atomic percpu memory requests. This change makes it easier to verify that there are a minimum number of populated pages lying around. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-17percpu: change the format for percpu_stats outputDennis Zhou (Facebook)
This makes the debugfs output for percpu_stats a little easier to read by changing the spacing of the output to be consistent. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-17percpu: pcpu-stats change void buffer to int bufferDennis Zhou (Facebook)
Changes the use of a void buffer to an int buffer for clarity. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-15Merge branch 'work.mount' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull ->s_options removal from Al Viro: "Preparations for fsmount/fsopen stuff (coming next cycle). Everything gets moved to explicit ->show_options(), killing ->s_options off + some cosmetic bits around fs/namespace.c and friends. Basically, the stuff needed to work with fsmount series with minimum of conflicts with other work. It's not strictly required for this merge window, but it would reduce the PITA during the coming cycle, so it would be nice to have those bits and pieces out of the way" * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: isofs: Fix isofs_show_options() VFS: Kill off s_options and helpers orangefs: Implement show_options 9p: Implement show_options isofs: Implement show_options afs: Implement show_options affs: Implement show_options befs: Implement show_options spufs: Implement show_options bpf: Implement show_options ramfs: Implement show_options pstore: Implement show_options omfs: Implement show_options hugetlbfs: Implement show_options VFS: Don't use save/replace_mount_options if not using generic_show_options VFS: Provide empty name qstr VFS: Make get_filesystem() return the affected filesystem VFS: Clean up whitespace in fs/namespace.c and fs/super.c Provide a function to create a NUL-terminated string from unterminated data
2017-07-14mm: fix overflow check in expand_upwards()Helge Deller
Jörn Engel noticed that the expand_upwards() function might not return -ENOMEM in case the requested address is (unsigned long)-PAGE_SIZE and if the architecture didn't defined TASK_SIZE as multiple of PAGE_SIZE. Affected architectures are arm, frv, m68k, blackfin, h8300 and xtensa which all define TASK_SIZE as 0xffffffff, but since none of those have an upwards-growing stack we currently have no actual issue. Nevertheless let's fix this just in case any of the architectures with an upward-growing stack (currently parisc, metag and partly ia64) define TASK_SIZE similar. Link: http://lkml.kernel.org/r/20170702192452.GA11868@p100.box Fixes: bd726c90b6b8 ("Allow stack to grow up to address space limit") Signed-off-by: Helge Deller <deller@gmx.de> Reported-by: Jörn Engel <joern@purestorage.com> Cc: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12writeback: rework wb_[dec|inc]_stat family of functionsNikolay Borisov
Currently the writeback statistics code uses a percpu counters to hold various statistics. Furthermore we have 2 families of functions - those which disable local irq and those which doesn't and whose names begin with double underscore. However, they both end up calling __add_wb_stats which in turn calls percpu_counter_add_batch which is already irq-safe. Exploiting this fact allows to eliminated the __wb_* functions since they don't add any further protection than we already have. Furthermore, refactor the wb_* function to call __add_wb_stat directly without the irq-disabling dance. This will likely result in better runtime of code which deals with modifying the stat counters. While at it also document why percpu_counter_add_batch is in fact preempt and irq-safe since at least 3 people got confused. Link: http://lkml.kernel.org/r/1498029937-27293-1-git-send-email-nborisov@suse.com Signed-off-by: Nikolay Borisov <nborisov@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Josef Bacik <jbacik@fb.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jeff Layton <jlayton@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12mm, migration: do not trigger OOM killer when migrating memoryMichal Hocko
Page migration (for memory hotplug, soft_offline_page or mbind) needs to allocate a new memory. This can trigger an oom killer if the target memory is depleated. Although quite unlikely, still possible, especially for the memory hotplug (offlining of memoery). Up to now we didn't really have reasonable means to back off. __GFP_NORETRY can fail just too easily and __GFP_THISNODE sticks to a single node and that is not suitable for all callers. But now that we have __GFP_RETRY_MAYFAIL we should use it. It is preferable to fail the migration than disrupt the system by killing some processes. Link: http://lkml.kernel.org/r/20170623085345.11304-7-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alex Belits <alex.belits@cavium.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: David Daney <david.daney@cavium.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: NeilBrown <neilb@suse.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12mm: kvmalloc support __GFP_RETRY_MAYFAIL for all sizesMichal Hocko
Now that __GFP_RETRY_MAYFAIL has a reasonable semantic regardless of the request size we can drop the hackish implementation for !costly orders. __GFP_RETRY_MAYFAIL retries as long as the reclaim makes a forward progress and backs of when we are out of memory for the requested size. Therefore we do not need to enforce__GFP_NORETRY for !costly orders just to silent the oom killer anymore. Link: http://lkml.kernel.org/r/20170623085345.11304-5-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alex Belits <alex.belits@cavium.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: David Daney <david.daney@cavium.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: NeilBrown <neilb@suse.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful ↵Michal Hocko
semantic __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to the page allocator. This has been true but only for allocations requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for smaller sizes. This is a bit unfortunate because there is no way to express the same semantic for those requests and they are considered too important to fail so they might end up looping in the page allocator for ever, similarly to GFP_NOFAIL requests. Now that the whole tree has been cleaned up and accidental or misled usage of __GFP_REPEAT flag has been removed for !costly requests we can give the original flag a better name and more importantly a more useful semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user that the allocator would try really hard but there is no promise of a success. This will work independent of the order and overrides the default allocator behavior. Page allocator users have several levels of guarantee vs. cost options (take GFP_KERNEL as an example) - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_ attempt to free memory at all. The most light weight mode which even doesn't kick the background reclaim. Should be used carefully because it might deplete the memory and the next user might hit the more aggressive reclaim - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic allocation without any attempt to free memory from the current context but can wake kswapd to reclaim memory if the zone is below the low watermark. Can be used from either atomic contexts or when the request is a performance optimization and there is another fallback for a slow path. - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) - non sleeping allocation with an expensive fallback so it can access some portion of memory reserves. Usually used from interrupt/bh context with an expensive slow path fallback. - GFP_KERNEL - both background and direct reclaim are allowed and the _default_ page allocator behavior is used. That means that !costly allocation requests are basically nofail but there is no guarantee of that behavior so failures have to be checked properly by callers (e.g. OOM killer victim is allowed to fail currently). - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and all allocation requests fail early rather than cause disruptive reclaim (one round of reclaim in this implementation). The OOM killer is not invoked. - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator behavior and all allocation requests try really hard. The request will fail if the reclaim cannot make any progress. The OOM killer won't be triggered. - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior and all allocation requests will loop endlessly until they succeed. This might be really dangerous especially for larger orders. Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL because they already had their semantic. No new users are added. __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if there is no progress and we have already passed the OOM point. This means that all the reclaim opportunities have been exhausted except the most disruptive one (the OOM killer) and a user defined fallback behavior is more sensible than keep retrying in the page allocator. [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c] [mhocko@suse.com: semantic fix] Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz [mhocko@kernel.org: address other thing spotted by Vlastimil] Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alex Belits <alex.belits@cavium.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: David Daney <david.daney@cavium.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: NeilBrown <neilb@suse.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12mm/memory.c: mark create_huge_pmd() inline to prevent build failureGeert Uytterhoeven
With gcc 4.1.2: mm/memory.o: In function `create_huge_pmd': memory.c:(.text+0x93e): undefined reference to `do_huge_pmd_anonymous_page' Interestingly, create_huge_pmd() is emitted in the assembler output, but never called. Converting transparent_hugepage_enabled() from a macro to a static inline function reduced the ability of the compiler to remove unused code. Fix this by marking create_huge_pmd() inline. Fixes: 16981d763501c0e0 ("mm: improve readability of transparent_hugepage_enabled()") Link: http://lkml.kernel.org/r/1499842660-10665-1-git-send-email-geert@linux-m68k.org Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10kasan: make get_wild_bug_type() staticColin Ian King
The helper function get_wild_bug_type() does not need to be in global scope, so make it static. Cleans up sparse warning: "symbol 'get_wild_bug_type' was not declared. Should it be static?" Link: http://lkml.kernel.org/r/20170622090049.10658-1-colin.king@canonical.com Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm/kasan/kasan.c: rename XXX_is_zero to XXX_is_nonzeroJoonsoo Kim
They return positive value, that is, true, if non-zero value is found. Rename them to reduce confusion. Link: http://lkml.kernel.org/r/20170516012350.GA16015@js1304-desktop Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm/kasan: add support for memory hotplugAndrey Ryabinin
KASAN doesn't happen work with memory hotplug because hotplugged memory doesn't have any shadow memory. So any access to hotplugged memory would cause a crash on shadow check. Use memory hotplug notifier to allocate and map shadow memory when the hotplugged memory is going online and free shadow after the memory offlined. Link: http://lkml.kernel.org/r/20170601162338.23540-4-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alexander Potapenko <glider@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm/kasan: get rid of speculative shadow checksAndrey Ryabinin
For some unaligned memory accesses we have to check additional byte of the shadow memory. Currently we load that byte speculatively to have only single load + branch on the optimistic fast path. However, this approach has some downsides: - It's unaligned access, so this prevents porting KASAN on architectures which doesn't support unaligned accesses. - We have to map additional shadow page to prevent crash if speculative load happens near the end of the mapped memory. This would significantly complicate upcoming memory hotplug support. I wasn't able to notice any performance degradation with this patch. So these speculative loads is just a pain with no gain, let's remove them. Link: http://lkml.kernel.org/r/20170601162338.23540-1-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm/kasan/kasan_init.c: use kasan_zero_pud for p4d tableJoonsoo Kim
There is missing optimization in zero_p4d_populate() that can save some memory when mapping zero shadow. Implement it like as others. Link: http://lkml.kernel.org/r/1494829255-23946-1-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm/zsmalloc: simplify zs_max_alloc_size handlingJerome Marchand
Commit 40f9fb8cffc6 ("mm/zsmalloc: support allocating obj with size of ZS_MAX_ALLOC_SIZE") fixes a size calculation error that prevented zsmalloc to allocate an object of the maximal size (ZS_MAX_ALLOC_SIZE). I think however the fix is unneededly complicated. This patch replaces the dynamic calculation of zs_size_classes at init time by a compile time calculation that uses the DIV_ROUND_UP() macro already used in get_size_class_index(). [akpm@linux-foundation.org: use min_t] Link: http://lkml.kernel.org/r/20170630114859.1979-1-jmarchan@redhat.com Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Mahendran Ganesh <opensource.ganesh@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm/memory-hotplug: switch locking to a percpu rwsemThomas Gleixner
Andrey reported a potential deadlock with the memory hotplug lock and the cpu hotplug lock. The reason is that memory hotplug takes the memory hotplug lock and then calls stop_machine() which calls get_online_cpus(). That's the reverse lock order to get_online_cpus(); get_online_mems(); in mm/slub_common.c The problem has been there forever. The reason why this was never reported is that the cpu hotplug locking had this homebrewn recursive reader writer semaphore construct which due to the recursion evaded the full lock dep coverage. The memory hotplug code copied that construct verbatim and therefor has similar issues. Three steps to fix this: 1) Convert the memory hotplug locking to a per cpu rwsem so the potential issues get reported proper by lockdep. 2) Lock the online cpus in mem_hotplug_begin() before taking the memory hotplug rwsem and use stop_machine_cpuslocked() in the page_alloc code to avoid recursive locking. 3) The cpu hotpluck locking in #2 causes a recursive locking of the cpu hotplug lock via __offline_pages() -> lru_add_drain_all(). Solve this by invoking lru_add_drain_all_cpuslocked() instead. Link: http://lkml.kernel.org/r/20170704093421.506836322@linutronix.de Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm: swap: provide lru_add_drain_all_cpuslocked()Thomas Gleixner
The rework of the cpu hotplug locking unearthed potential deadlocks with the memory hotplug locking code. The solution for these is to rework the memory hotplug locking code as well and take the cpu hotplug lock before the memory hotplug lock in mem_hotplug_begin(), but this will cause a recursive locking of the cpu hotplug lock when the memory hotplug code calls lru_add_drain_all(). Split out the inner workings of lru_add_drain_all() into lru_add_drain_all_cpuslocked() so this function can be invoked from the memory hotplug code with the cpu hotplug lock held. Link: http://lkml.kernel.org/r/20170704093421.419329357@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm: use dedicated helper to access rlimit valueKrzysztof Opasiak
Use rlimit() helper instead of manually writing whole chain from current task to rlim_cur. Link: http://lkml.kernel.org/r/20170705172811.8027-1-k.opasiak@samsung.com Signed-off-by: Krzysztof Opasiak <k.opasiak@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm/list_lru.c: fix list_lru_count_node() to be race freeSahitya Tummala
list_lru_count_node() iterates over all memcgs to get the total number of entries on the node but it can race with memcg_drain_all_list_lrus(), which migrates the entries from a dead cgroup to another. This can return incorrect number of entries from list_lru_count_node(). Fix this by keeping track of entries per node and simply return it in list_lru_count_node(). Link: http://lkml.kernel.org/r/1498707555-30525-1-git-send-email-stummala@codeaurora.org Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Alexander Polakov <apolyakov@beget.ru> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm/mmap.c: expand_downwards: don't require the gap if !vm_prevOleg Nesterov
expand_stack(vma) fails if address < stack_guard_gap even if there is no vma->vm_prev. I don't think this makes sense, and we didn't do this before the recent commit 1be7107fbe18 ("mm: larger stack guard gap, between vmas"). We do not need a gap in this case, any address is fine as long as security_mmap_addr() doesn't object. This also simplifies the code, we know that address >= prev->vm_end and thus underflow is not possible. Link: http://lkml.kernel.org/r/20170628175258.GA24881@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Larry Woodman <lwoodman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm/mmap.c: do not blow on PROT_NONE MAP_FIXED holes in the stackMichal Hocko
Commit 1be7107fbe18 ("mm: larger stack guard gap, between vmas") has introduced a regression in some rust and Java environments which are trying to implement their own stack guard page. They are punching a new MAP_FIXED mapping inside the existing stack Vma. This will confuse expand_{downwards,upwards} into thinking that the stack expansion would in fact get us too close to an existing non-stack vma which is a correct behavior wrt safety. It is a real regression on the other hand. Let's work around the problem by considering PROT_NONE mapping as a part of the stack. This is a gros hack but overflowing to such a mapping would trap anyway an we only can hope that usespace knows what it is doing and handle it propely. Fixes: 1be7107fbe18 ("mm: larger stack guard gap, between vmas") Link: http://lkml.kernel.org/r/20170705182849.GA18027@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Debugged-by: Vlastimil Babka <vbabka@suse.cz> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Willy Tarreau <w@1wt.eu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm/balloon_compaction.c: enqueue zero page to balloon devicezhenwei.pi
presently pages in the balloon device have random value, and these pages will be scanned by ksmd on the host. They usually cannot be merged. Enqueue zero pages will resolve this problem. Link: http://lkml.kernel.org/r/1498698637-26389-1-git-send-email-zhenwei.pi@youruncloud.com Signed-off-by: zhenwei.pi <zhenwei.pi@youruncloud.com> Cc: Gioh Kim <gi-oh.kim@profitbricks.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10cma: fix calculation of aligned offsetDoug Berger
The align_offset parameter is used by bitmap_find_next_zero_area_off() to represent the offset of map's base from the previous alignment boundary; the function ensures that the returned index, plus the align_offset, honors the specified align_mask. The logic introduced by commit b5be83e308f7 ("mm: cma: align to physical address, not CMA region position") has the cma driver calculate the offset to the *next* alignment boundary. In most cases, the base alignment is greater than that specified when making allocations, resulting in a zero offset whether we align up or down. In the example given with the commit, the base alignment (8MB) was half the requested alignment (16MB) so the math also happened to work since the offset is 8MB in both directions. However, when requesting allocations with an alignment greater than twice that of the base, the returned index would not be correctly aligned. Also, the align_order arguments of cma_bitmap_aligned_mask() and cma_bitmap_aligned_offset() should not be negative so the argument type was made unsigned. Fixes: b5be83e308f7 ("mm: cma: align to physical address, not CMA region position") Link: http://lkml.kernel.org/r/20170628170742.2895-1-opendmb@gmail.com Signed-off-by: Angus Clark <angus@angusclark.org> Signed-off-by: Doug Berger <opendmb@gmail.com> Acked-by: Gregory Fong <gregory.0xf0@gmail.com> Cc: Doug Berger <opendmb@gmail.com> Cc: Angus Clark <angus@angusclark.org> Cc: Laura Abbott <labbott@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Shiraz Hashim <shashim@codeaurora.org> Cc: Jaewon Kim <jaewon31.kim@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10mm/memory_hotplug.c: remove unused local zone_type from __remove_zone()John Hubbard
__remove_zone() sets up up zone_type, but never uses it for anything. This does not cause a warning, due to the (necessary) use of -Wno-unused-but-set-variable. However, it's noise, so just delete it. Link: http://lkml.kernel.org/r/20170624043421.24465-2-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>