summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2020-04-02mm/filemap.c: don't bother dropping mmap_sem for zero size readaheadJan Kara
When handling a page fault, we drop mmap_sem to start async readahead so that we don't block on IO submission with mmap_sem held. However there's no point to drop mmap_sem in case readahead is disabled. Handle that case to avoid pointless dropping of mmap_sem and retrying the fault. This was actually reported to block mlockall(MCL_CURRENT) indefinitely. Fixes: 6b4c9f446981 ("filemap: drop the mmap_sem for all blocking operations") Reported-by: Minchan Kim <minchan@kernel.org> Reported-by: Robert Stupp <snazy@gmx.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/20200212101356.30759-1-jack@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02mm/Makefile: disable KCSAN for kmemleakQian Cai
Kmemleak could scan task stacks while plain writes happens to those stack variables which could results in data races. For example, in sys_rt_sigaction and do_sigaction(), it could have plain writes in a 32-byte size. Since the kmemleak does not care about the actual values of a non-pointer and all do_sigaction() call sites only copy to stack variables, just disable KCSAN for kmemleak to avoid annotating anything outside Kmemleak just because Kmemleak scans everything. Suggested-by: Marco Elver <elver@google.com> Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Marco Elver <elver@google.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Link: http://lkml.kernel.org/r/1583263716-25150-1-git-send-email-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02mm/kmemleak.c: use address-of operator on section symbolsNathan Chancellor
Clang warns: mm/kmemleak.c:1955:28: warning: array comparison always evaluates to a constant [-Wtautological-compare] if (__start_ro_after_init < _sdata || __end_ro_after_init > _edata) ^ mm/kmemleak.c:1955:60: warning: array comparison always evaluates to a constant [-Wtautological-compare] if (__start_ro_after_init < _sdata || __end_ro_after_init > _edata) These are not true arrays, they are linker defined symbols, which are just addresses. Using the address of operator silences the warning and does not change the resulting assembly with either clang/ld.lld or gcc/ld (tested with diff + objdump -Dr). Suggested-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://github.com/ClangBuiltLinux/linux/issues/895 Link: http://lkml.kernel.org/r/20200220051551.44000-1-natechancellor@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02revert "topology: add support for node_to_mem_node() to determine the ↵Vlastimil Babka
fallback node" This reverts commit ad2c8144418c6a81cefe65379fd47bbe8344cef2. The function node_to_mem_node() was introduced by that commit for use in SLUB on systems with memoryless nodes, but it turned out to be unreliable on some architectures/configurations and a simpler solution exists than fixing it up. Thus commit 0715e6c516f1 ("mm, slub: prevent kmalloc_node crashes and memory leaks") removed the only user of node_to_mem_node() and we can revert the commit that introduced the function. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Bharata B Rao <bharata@linux.ibm.com> Cc: Christopher Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: PUVICHAKRAVARTHY RAMACHANDRAN <puvichakravarthy@in.ibm.com> Cc: Sachin Sant <sachinp@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20200320115533.9604-2-vbabka@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02slub: relocate freelist pointer to middle of objectKees Cook
In a recent discussion[1] with Vitaly Nikolenko and Silvio Cesare, it became clear that moving the freelist pointer away from the edge of allocations would likely improve the overall defensive posture of the inline freelist pointer. My benchmarks show no meaningful change to performance (they seem to show it being faster), so this looks like a reasonable change to make. Instead of having the freelist pointer at the very beginning of an allocation (offset 0) or at the very end of an allocation (effectively offset -sizeof(void *) from the next allocation), move it away from the edges of the allocation and into the middle. This provides some protection against small-sized neighboring overflows (or underflows), for which the freelist pointer is commonly the target. (Large or well controlled overwrites are much more likely to attack live object contents, instead of attempting freelist corruption.) The vaunted kernel build benchmark, across 5 runs. Before: Mean: 250.05 Std Dev: 1.85 and after, which appears mysteriously faster: Mean: 247.13 Std Dev: 0.76 Attempts at running "sysbench --test=memory" show the change to be well in the noise (sysbench seems to be pretty unstable here -- it's not really measuring allocation). Hackbench is more allocation-heavy, and while the std dev is above the difference, it looks like may manifest as an improvement as well: 20 runs of "hackbench -g 20 -l 1000", before: Mean: 36.322 Std Dev: 0.577 and after: Mean: 36.056 Std Dev: 0.598 [1] https://twitter.com/vnik5287/status/1235113523098685440 Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Christoph Lameter <cl@linux.com> Cc: Vitaly Nikolenko <vnik@duasynt.com> Cc: Silvio Cesare <silvio.cesare@gmail.com> Cc: Christoph Lameter <cl@linux.com>Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/202003051624.AAAC9AECC@keescook Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02slub: improve bit diffusion for freelist ptr obfuscationKees Cook
Under CONFIG_SLAB_FREELIST_HARDENED=y, the obfuscation was relatively weak in that the ptr and ptr address were usually so close that the first XOR would result in an almost entirely 0-byte value[1], leaving most of the "secret" number ultimately being stored after the third XOR. A single blind memory content exposure of the freelist was generally sufficient to learn the secret. Add a swab() call to mix bits a little more. This is a cheap way (1 cycle) to make attacks need more than a single exposure to learn the secret (or to know _where_ the exposure is in memory). kmalloc-32 freelist walk, before: ptr ptr_addr stored value secret ffff90c22e019020@ffff90c22e019000 is 86528eb656b3b5bd (86528eb656b3b59d) ffff90c22e019040@ffff90c22e019020 is 86528eb656b3b5fd (86528eb656b3b59d) ffff90c22e019060@ffff90c22e019040 is 86528eb656b3b5bd (86528eb656b3b59d) ffff90c22e019080@ffff90c22e019060 is 86528eb656b3b57d (86528eb656b3b59d) ffff90c22e0190a0@ffff90c22e019080 is 86528eb656b3b5bd (86528eb656b3b59d) ... after: ptr ptr_addr stored value secret ffff9eed6e019020@ffff9eed6e019000 is 793d1135d52cda42 (86528eb656b3b59d) ffff9eed6e019040@ffff9eed6e019020 is 593d1135d52cda22 (86528eb656b3b59d) ffff9eed6e019060@ffff9eed6e019040 is 393d1135d52cda02 (86528eb656b3b59d) ffff9eed6e019080@ffff9eed6e019060 is 193d1135d52cdae2 (86528eb656b3b59d) ffff9eed6e0190a0@ffff9eed6e019080 is f93d1135d52cdac2 (86528eb656b3b59d) [1] https://blog.infosectcbr.com.au/2020/03/weaknesses-in-linux-kernel-heap.html Fixes: 2482ddec670f ("mm: add SLUB free list pointer obfuscation") Reported-by: Silvio Cesare <silvio.cesare@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/202003051623.AF4F8CB@keescook Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02mm/slub.c: replace kmem_cache->cpu_partial with wrapped APIschenqiwu
There are slub_cpu_partial() and slub_set_cpu_partial() APIs to wrap kmem_cache->cpu_partial. This patch will use the two APIs to replace kmem_cache->cpu_partial in slub code. Signed-off-by: chenqiwu <chenqiwu@xiaomi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/1582079562-17980-1-git-send-email-qiwuchen55@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02mm/slub.c: replace cpu_slab->partial with wrapped APIschenqiwu
There are slub_percpu_partial() and slub_set_percpu_partial() APIs to wrap kmem_cache->cpu_partial. This patch will use the two to replace cpu_slab->partial in slub code. Signed-off-by: chenqiwu <chenqiwu@xiaomi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/1581951895-3038-1-git-send-email-qiwuchen55@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-01Merge tag 'for-linus-hmm' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull hmm updates from Jason Gunthorpe: "This series focuses on corner case bug fixes and general clarity improvements to hmm_range_fault(). It arose from a review of hmm_range_fault() by Christoph, Ralph and myself. hmm_range_fault() is being used by these 'SVM' style drivers to non-destructively read the page tables. It is very similar to get_user_pages() except that the output is an array of PFNs and per-pfn flags, and it has various modes of reading. This is necessary before RDMA ODP can be converted, as we don't want to have weird corner case regressions, which is still a looking forward item. Ralph has a nice tester for this routine, but it is waiting for feedback from the selftests maintainers. Summary: - 9 bug fixes - Allow pgmap to track the 'owner' of a DEVICE_PRIVATE - in this case the owner tells the driver if it can understand the DEVICE_PRIVATE page or not. Use this to resolve a bug in nouveau where it could touch DEVICE_PRIVATE pages from other drivers. - Remove a bunch of dead, redundant or unused code and flags - Clarity improvements to hmm_range_fault()" * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (25 commits) mm/hmm: return error for non-vma snapshots mm/hmm: do not set pfns when returning an error code mm/hmm: do not unconditionally set pfns when returning EBUSY mm/hmm: use device_private_entry_to_pfn() mm/hmm: remove HMM_FAULT_SNAPSHOT mm/hmm: remove unused code and tidy comments mm/hmm: return the fault type from hmm_pte_need_fault() mm/hmm: remove pgmap checking for devmap pages mm/hmm: check the device private page owner in hmm_range_fault() mm: simplify device private page handling in hmm_range_fault mm: handle multiple owners of device private pages in migrate_vma memremap: add an owner field to struct dev_pagemap mm: merge hmm_vma_do_fault into into hmm_vma_walk_hole_ mm/hmm: don't handle the non-fault case in hmm_vma_walk_hole_() mm/hmm: simplify hmm_vma_walk_hugetlb_entry() mm/hmm: remove the unused HMM_FAULT_ALLOW_RETRY flag mm/hmm: don't provide a stub for hmm_range_fault() mm/hmm: do not check pmd_protnone twice in hmm_vma_handle_pmd() mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handling mm/hmm: return -EFAULT when setting HMM_PFN_ERROR on requested valid pages ...
2020-04-01blkcg: rename blkcg->cgwb_refcnt to ->online_pin and always use itTejun Heo
blkcg->cgwb_refcnt is used to delay blkcg offlining so that blkgs don't get offlined while there are active cgwbs on them. However, it ends up making offlining unordered sometimes causing parents to be offlined before children. To fix it, we want child blkcgs to pin the parents' online states turning the refcnt into a more generic online pinning mechanism. In prepartion, * blkcg->cgwb_refcnt -> blkcg->online_pin * blkcg_cgwb_get/put() -> blkcg_pin/unpin_online() * Take them out of CONFIG_CGROUP_WRITEBACK Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-01percpu: update copyright emails to dennis@kernel.orgDennis Zhou
Currently there are 3 emails tied to me in the kernel tree, I'd rather dennis@kernel.org be the only one. Signed-off-by: Dennis Zhou <dennis@kernel.org>
2020-03-31Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Catalin Marinas: "The bulk is in-kernel pointer authentication, activity monitors and lots of asm symbol annotations. I also queued the sys_mremap() patch commenting the asymmetry in the address untagging. Summary: - In-kernel Pointer Authentication support (previously only offered to user space). - ARM Activity Monitors (AMU) extension support allowing better CPU utilisation numbers for the scheduler (frequency invariance). - Memory hot-remove support for arm64. - Lots of asm annotations (SYM_*) in preparation for the in-kernel Branch Target Identification (BTI) support. - arm64 perf updates: ARMv8.5-PMU 64-bit counters, refactoring the PMU init callbacks, support for new DT compatibles. - IPv6 header checksum optimisation. - Fixes: SDEI (software delegated exception interface) double-lock on hibernate with shared events. - Minor clean-ups and refactoring: cpu_ops accessor, cpu_do_switch_mm() converted to C, cpufeature finalisation helper. - sys_mremap() comment explaining the asymmetric address untagging behaviour" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (81 commits) mm/mremap: Add comment explaining the untagging behaviour of mremap() arm64: head: Convert install_el2_stub to SYM_INNER_LABEL arm64: Introduce get_cpu_ops() helper function arm64: Rename cpu_read_ops() to init_cpu_ops() arm64: Declare ACPI parking protocol CPU operation if needed arm64: move kimage_vaddr to .rodata arm64: use mov_q instead of literal ldr arm64: Kconfig: verify binutils support for ARM64_PTR_AUTH lkdtm: arm64: test kernel pointer authentication arm64: compile the kernel with ptrauth return address signing kconfig: Add support for 'as-option' arm64: suspend: restore the kernel ptrauth keys arm64: __show_regs: strip PAC from lr in printk arm64: unwind: strip PAC from kernel addresses arm64: mask PAC bits of __builtin_return_address arm64: initialize ptrauth keys for kernel booting task arm64: initialize and switch ptrauth kernel keys arm64: enable ptrauth earlier arm64: cpufeature: handle conflicts based on capability arm64: cpufeature: Move cpu capability helpers inside C file ...
2020-03-30mm/hmm: return error for non-vma snapshotsJason Gunthorpe
The pagewalker does not call most ops with NULL vma, those are all routed to hmm_vma_walk_hole() via ops->pte_hole instead. Thus hmm_vma_fault() is only called with a NULL vma from hmm_vma_walk_hole(), so hoist the NULL vma check to there. Now it is clear that snapshotting with no vma is a HMM_PFN_ERROR as without a vma we have no path to call hmm_vma_fault(). Link: https://lore.kernel.org/r/20200327200021.29372-10-jgg@ziepe.ca Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-30mm/hmm: do not set pfns when returning an error codeJason Gunthorpe
Most places that return an error code, like -EFAULT, do not set HMM_PFN_ERROR, only two places do this. Resolve this inconsistency by never setting the pfns on an error exit. This doesn't seem like a worthwhile thing to do anyhow. If for some reason it becomes important, it makes more sense to directly return the address of the failing page rather than have the caller scan for the HMM_PFN_ERROR. No caller inspects the pnfs output array if hmm_range_fault() fails. Link: https://lore.kernel.org/r/20200327200021.29372-9-jgg@ziepe.ca Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-30mm/hmm: do not unconditionally set pfns when returning EBUSYJason Gunthorpe
In hmm_vma_handle_pte() and hmm_vma_walk_hugetlb_entry() if fault happens then -EBUSY will be returned and the pfns input flags will have been destroyed. For hmm_vma_handle_pte() set HMM_PFN_NONE only on the success returns that don't otherwise store to pfns. For hmm_vma_walk_hugetlb_entry() all exit paths already set pfns, so remove the redundant store. Fixes: 2aee09d8c116 ("mm/hmm: change hmm_vma_fault() to allow write fault on page basis") Link: https://lore.kernel.org/r/20200327200021.29372-8-jgg@ziepe.ca Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-30mm/hmm: use device_private_entry_to_pfn()Jason Gunthorpe
swp_offset() should not be called directly, the wrappers are supposed to abstract away the encoding of the device_private specific information in the swap entry. Link: https://lore.kernel.org/r/20200327200021.29372-7-jgg@ziepe.ca Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-29mm/sparse: fix kernel crash with pfn_section_valid checkAneesh Kumar K.V
Fix the crash like this: BUG: Kernel NULL pointer dereference on read at 0x00000000 Faulting instruction address: 0xc000000000c3447c Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1 ... NIP [c000000000c3447c] vmemmap_populated+0x98/0xc0 LR [c000000000088354] vmemmap_free+0x144/0x320 Call Trace: section_deactivate+0x220/0x240 __remove_pages+0x118/0x170 arch_remove_memory+0x3c/0x150 memunmap_pages+0x1cc/0x2f0 devm_action_release+0x30/0x50 release_nodes+0x2f8/0x3e0 device_release_driver_internal+0x168/0x270 unbind_store+0x130/0x170 drv_attr_store+0x44/0x60 sysfs_kf_write+0x68/0x80 kernfs_fop_write+0x100/0x290 __vfs_write+0x3c/0x70 vfs_write+0xcc/0x240 ksys_write+0x7c/0x140 system_call+0x5c/0x68 The crash is due to NULL dereference at test_bit(idx, ms->usage->subsection_map); due to ms->usage = NULL in pfn_section_valid() With commit d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case") section_mem_map is set to NULL after depopulate_section_mem(). This was done so that pfn_page() can work correctly with kernel config that disables SPARSEMEM_VMEMMAP. With that config pfn_to_page does __section_mem_map_addr(__sec) + __pfn; where static inline struct page *__section_mem_map_addr(struct mem_section *section) { unsigned long map = section->section_mem_map; map &= SECTION_MAP_MASK; return (struct page *)map; } Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is used to check the pfn validity (pfn_valid()). Since section_deactivate release mem_section->usage if a section is fully deactivated, pfn_valid() check after a subsection_deactivate cause a kernel crash. static inline int pfn_valid(unsigned long pfn) { ... return early_section(ms) || pfn_section_valid(ms, pfn); } where static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn) { int idx = subsection_map_index(pfn); return test_bit(idx, ms->usage->subsection_map); } Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is freed. For architectures like ppc64 where large pages are used for vmmemap mapping (16MB), a specific vmemmap mapping can cover multiple sections. Hence before a vmemmap mapping page can be freed, the kernel needs to make sure there are no valid sections within that mapping. Clearing the section valid bit before depopulate_section_memap enables this. [aneesh.kumar@linux.ibm.com: add comment] Link: http://lkml.kernel.org/r/20200326133235.343616-1-aneesh.kumar@linux.ibm.comLink: http://lkml.kernel.org/r/20200325031914.107660-1-aneesh.kumar@linux.ibm.com Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case") Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-29mm: fork: fix kernel_stack memcg stats for various stack implementationsRoman Gushchin
Depending on CONFIG_VMAP_STACK and the THREAD_SIZE / PAGE_SIZE ratio the space for task stacks can be allocated using __vmalloc_node_range(), alloc_pages_node() and kmem_cache_alloc_node(). In the first and the second cases page->mem_cgroup pointer is set, but in the third it's not: memcg membership of a slab page should be determined using the memcg_from_slab_page() function, which looks at page->slab_cache->memcg_params.memcg . In this case, using mod_memcg_page_state() (as in account_kernel_stack()) is incorrect: page->mem_cgroup pointer is NULL even for pages charged to a non-root memory cgroup. It can lead to kernel_stack per-memcg counters permanently showing 0 on some architectures (depending on the configuration). In order to fix it, let's introduce a mod_memcg_obj_state() helper, which takes a pointer to a kernel object as a first argument, uses mem_cgroup_from_obj() to get a RCU-protected memcg pointer and calls mod_memcg_state(). It allows to handle all possible configurations (CONFIG_VMAP_STACK and various THREAD_SIZE/PAGE_SIZE values) without spilling any memcg/kmem specifics into fork.c . Note: This is a special version of the patch created for stable backports. It contains code from the following two patches: - mm: memcg/slab: introduce mem_cgroup_from_obj() - mm: fork: fix kernel_stack memcg stats for various stack implementations [guro@fb.com: introduce mem_cgroup_from_obj()] Link: http://lkml.kernel.org/r/20200324004221.GA36662@carbon.dhcp.thefacebook.com Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages") Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Bharata B Rao <bharata@linux.ibm.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200303233550.251375-1-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-29hugetlb_cgroup: fix illegal access to memoryMina Almasry
This appears to be a mistake in commit faced7e0806cf ("mm: hugetlb controller for cgroups v2"). Essentially that commit does a hugetlb_cgroup_from_counter assuming that page_counter_try_charge has initialized counter. But if that has failed then it seems will not initialize counter, so hugetlb_cgroup_from_counter(counter) ends up pointing to random memory, causing kasan to complain. The solution is to simply use 'h_cg', instead of hugetlb_cgroup_from_counter(counter), since that is a reference to the hugetlb_cgroup anyway. After this change kasan ceases to complain. Fixes: faced7e0806cf ("mm: hugetlb controller for cgroups v2") Reported-by: syzbot+cac0c4e204952cf449b1@syzkaller.appspotmail.com Signed-off-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Giuseppe Scrivano <gscrivan@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Rientjes <rientjes@google.com> Link: http://lkml.kernel.org/r/20200313223920.124230-1-almasrymina@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-29mm/swapfile.c: move inode_lock out of claim_swapfileNaohiro Aota
claim_swapfile() currently keeps the inode locked when it is successful, or the file is already swapfile (with -EBUSY). And, on the other error cases, it does not lock the inode. This inconsistency of the lock state and return value is quite confusing and actually causing a bad unlock balance as below in the "bad_swap" section of __do_sys_swapon(). This commit fixes this issue by moving the inode_lock() and IS_SWAPFILE check out of claim_swapfile(). The inode is unlocked in "bad_swap_unlock_inode" section, so that the inode is ensured to be unlocked at "bad_swap". Thus, error handling codes after the locking now jumps to "bad_swap_unlock_inode" instead of "bad_swap". ===================================== WARNING: bad unlock balance detected! 5.5.0-rc7+ #176 Not tainted ------------------------------------- swapon/4294 is trying to release lock (&sb->s_type->i_mutex_key) at: __do_sys_swapon+0x94b/0x3550 but there are no more locks to release! other info that might help us debug this: no locks held by swapon/4294. stack backtrace: CPU: 5 PID: 4294 Comm: swapon Not tainted 5.5.0-rc7-BTRFS-ZNS+ #176 Hardware name: ASUS All Series/H87-PRO, BIOS 2102 07/29/2014 Call Trace: dump_stack+0xa1/0xea print_unlock_imbalance_bug.cold+0x114/0x123 lock_release+0x562/0xed0 up_write+0x2d/0x490 __do_sys_swapon+0x94b/0x3550 __x64_sys_swapon+0x54/0x80 do_syscall_64+0xa4/0x4b0 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f15da0a0dc7 Fixes: 1638045c3677 ("mm: set S_SWAPFILE on blockdev swap devices") Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Qais Youef <qais.yousef@arm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200206090132.154869-1-naohiro.aota@wdc.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-27mm/hmm: remove HMM_FAULT_SNAPSHOTJason Gunthorpe
Now that flags are handled on a fine-grained per-page basis this global flag is redundant and has a confusing overlap with the pfn_flags_mask and default_flags. Normalize the HMM_FAULT_SNAPSHOT behavior into one place. Callers needing the SNAPSHOT behavior should set a pfn_flags_mask and default_flags that always results in a cleared HMM_PFN_VALID. Then no pages will be faulted, and HMM_FAULT_SNAPSHOT is not a special flow that overrides the masking mechanism. As this is the last flag, also remove the flags argument. If future flags are needed they can be part of the struct hmm_range function arguments. Link: https://lore.kernel.org/r/20200327200021.29372-5-jgg@ziepe.ca Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-27mm/hmm: remove unused code and tidy commentsJason Gunthorpe
Delete several functions that are never called, fix some desync between comments and structure content, toss the now out of date top of file header, and move one function only used by hmm.c into hmm.c Link: https://lore.kernel.org/r/20200327200021.29372-4-jgg@ziepe.ca Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-27mm/hmm: return the fault type from hmm_pte_need_fault()Jason Gunthorpe
Using two bools instead of flags return is not necessary and leads to bugs. Returning a value is easier for the compiler to check and easier to pass around the code flow. Convert the two bools into flags and push the change to all callers. Link: https://lore.kernel.org/r/20200327200021.29372-3-jgg@ziepe.ca Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-27mm/hmm: remove pgmap checking for devmap pagesJason Gunthorpe
The checking boils down to some racy check if the pagemap is still available or not. Instead of checking this, rely entirely on the notifiers, if a pagemap is destroyed then all pages that belong to it must be removed from the tables and the notifiers triggered. Link: https://lore.kernel.org/r/20200327200021.29372-2-jgg@ziepe.ca Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm/hmm: check the device private page owner in hmm_range_fault()Christoph Hellwig
hmm_range_fault() will succeed for any kind of device private memory, even if it doesn't belong to the calling entity. While nouveau has some crude checks for that, they are broken because they assume nouveau is the only user of device private memory. Fix this by passing in an expected pgmap owner in the hmm_range_fault structure. If a device_private page is found and doesn't match the owner then it is treated as an non-present and non-faultable page. This prevents a bug in amdgpu, where it doesn't know how to handle device_private pages, but hmm_range_fault would return them anyhow. Fixes: 4ef589dc9b10 ("mm/hmm/devmem: device memory hotplug using ZONE_DEVICE") Link: https://lore.kernel.org/r/20200316193216.920734-5-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm: simplify device private page handling in hmm_range_faultChristoph Hellwig
Remove the HMM_PFN_DEVICE_PRIVATE flag, no driver has ever set this flag on input, and the only place that uses it on output can be trivially changed to use is_device_private_page(). This removes the ability to request that device_private pages are faulted back into system memory. Link: https://lore.kernel.org/r/20200316193216.920734-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm: handle multiple owners of device private pages in migrate_vmaChristoph Hellwig
Add a new src_owner field to struct migrate_vma. If the field is set, only device private pages with page->pgmap->owner equal to that field are migrated. If the field is not set only "normal" pages are migrated. Fixes: df6ad69838fc ("mm/device-public-memory: device memory cache coherent with CPU") Link: https://lore.kernel.org/r/20200316193216.920734-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Tested-by: Bharata B Rao <bharata@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26memremap: add an owner field to struct dev_pagemapChristoph Hellwig
Add a new opaque owner field to struct dev_pagemap, which will allow the hmm and migrate_vma code to identify who owns ZONE_DEVICE memory, and refuse to work on mappings not owned by the calling entity. Link: https://lore.kernel.org/r/20200316193216.920734-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Tested-by: Bharata B Rao <bharata@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm: merge hmm_vma_do_fault into into hmm_vma_walk_hole_Christoph Hellwig
There is no good reason for this split, as it just obsfucates the flow. Link: https://lore.kernel.org/r/20200316135310.899364-6-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm/hmm: don't handle the non-fault case in hmm_vma_walk_hole_()Christoph Hellwig
Setting a pfns entry to NONE before returning -EBUSY is a bug that will cause corruption of the input flags on the next loop. There is just a single caller using hmm_vma_walk_hole_() for the non-fault case. Use hmm_pfns_fill() to fill the whole pfn array with zeroes in the only caller for the non-fault case and remove the non-fault path from hmm_vma_walk_hole_(). This avoids setting NONE before returning -EBUSY. Also rename the function to hmm_vma_fault() to better describe what it does. Fixes: 2aee09d8c116 ("mm/hmm: change hmm_vma_fault() to allow write fault on page basis") Link: https://lore.kernel.org/r/20200316135310.899364-5-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm/hmm: simplify hmm_vma_walk_hugetlb_entry()Christoph Hellwig
Remove the rather confusing goto label and just handle the fault case directly in the branch checking for it. Link: https://lore.kernel.org/r/20200316135310.899364-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm/hmm: remove the unused HMM_FAULT_ALLOW_RETRY flagChristoph Hellwig
The HMM_FAULT_ALLOW_RETRY isn't used anywhere in the tree. Remove it and the weird -EAGAIN handling where handle_mm_fault() drops the mmap_sem. Link: https://lore.kernel.org/r/20200316135310.899364-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm/hmm: do not check pmd_protnone twice in hmm_vma_handle_pmd()Jason Gunthorpe
pmd_to_hmm_pfn_flags() already checks it and makes the cpu flags 0. If no fault is requested then the pfns should be returned with the not valid flags. It should not unconditionally fault if faulting is not requested. Fixes: 2aee09d8c116 ("mm/hmm: change hmm_vma_fault() to allow write fault on page basis") Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handlingJason Gunthorpe
Currently if a special PTE is encountered hmm_range_fault() immediately returns EFAULT and sets the HMM_PFN_SPECIAL error output (which nothing uses). EFAULT should only be returned after testing with hmm_pte_need_fault(). Also pte_devmap() and pte_special() are exclusive, and there is no need to check IS_ENABLED, pte_special() is stubbed out to return false on unsupported architectures. Fixes: 992de9a8b751 ("mm/hmm: allow to mirror vma of a file on a DAX backed filesystem") Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm/hmm: return -EFAULT when setting HMM_PFN_ERROR on requested valid pagesJason Gunthorpe
hmm_range_fault() should never return 0 if the caller requested a valid page, but the pfns output for that page would be HMM_PFN_ERROR. hmm_pte_need_fault() must always be called before setting HMM_PFN_ERROR to detect if the page is in faulting mode or not. Fix two cases in hmm_vma_walk_pmd() and reorganize some of the duplicated code. Fixes: d08faca018c4 ("mm/hmm: properly handle migration pmd") Fixes: da4c3c735ea4 ("mm/hmm/mirror: helper to snapshot CPU page table") Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm/hmm: reorganize how !pte_present is handled in hmm_vma_handle_pte()Jason Gunthorpe
The intention with this code is to determine if the caller required the pages to be valid, and if so, then take some action to make them valid. The action varies depending on the page type. In all cases, if the caller doesn't ask for the page, then hmm_range_fault() should not return an error. Revise the implementation to be clearer, and fix some bugs: - hmm_pte_need_fault() must always be called before testing fault or write_fault otherwise the defaults of false apply and the if()'s don't work. This was missed on the is_migration_entry() branch - -EFAULT should not be returned unless hmm_pte_need_fault() indicates fault is required - ie snapshotting should not fail. - For !pte_present() the cpu_flags are always 0, except in the special case of is_device_private_entry(), calling pte_to_hmm_pfn_flags() is confusing. Reorganize the flow so that it always follows the pattern of calling hmm_pte_need_fault() and then checking fault || write_fault. Fixes: 2aee09d8c116 ("mm/hmm: change hmm_vma_fault() to allow write fault on page basis") Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm/hmm: add missing call to hmm_range_need_fault() before returning EFAULTJason Gunthorpe
All return paths that do EFAULT must call hmm_range_need_fault() to determine if the user requires this page to be valid. If the page cannot be made valid if the user later requires it, due to vma flags in this case, then the return should be HMM_PFN_ERROR. Fixes: a3e0d41c2b1f ("mm/hmm: improve driver API to work and wait over a range") Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm/hmm: add missing pfns set to hmm_vma_walk_pmd()Jason Gunthorpe
All success exit paths from the walker functions must set the pfns array. A migration entry with no required fault is a HMM_PFN_NONE return, just like the pte case. Fixes: d08faca018c4 ("mm/hmm: properly handle migration pmd") Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm/hmm: do not call hmm_vma_walk_hole() while holding a spinlockJason Gunthorpe
This eventually calls into handle_mm_fault() which is a sleeping function. Release the lock first. hmm_vma_walk_hole() does not touch the contents of the PUD, so it does not need the lock. Fixes: 3afc423632a1 ("mm: pagewalk: add p4d_entry() and pgd_entry()") Cc: Steven Price <steven.price@arm.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Steven Price <steven.price@arm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm/hmm: add missing unmaps of the ptep during hmm_vma_handle_pte()Jason Gunthorpe
Many of the direct returns of error skipped doing the pte_unmap(). All non zero exit paths must unmap the pte. The pte_unmap() is split unnaturally like this because some of the error exit paths trigger a sleep and must release the lock before sleeping. Fixes: 992de9a8b751 ("mm/hmm: allow to mirror vma of a file on a DAX backed filesystem") Fixes: 53f5c3f489ec ("mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()") Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm/mremap: Add comment explaining the untagging behaviour of mremap()Will Deacon
Commit dcde237319e6 ("mm: Avoid creating virtual address aliases in brk()/mmap()/mremap()") changed mremap() so that only the 'old' address is untagged, leaving the 'new' address in the form it was passed from userspace. This prevents the unexpected creation of aliasing virtual mappings in userspace, but looks a bit odd when you read the code. Add a comment justifying the untagging behaviour in mremap(). Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-03-25mm: docs: Fix a comment in process_vm_rw_coreBernd Edlinger
This removes a duplicate "a" in the comment in process_vm_rw_core. Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-03-24mm: Add vmf_insert_pfn_xxx_prot() for huge page-table entriesThomas Hellstrom (VMware)
For graphics drivers needing to modify the page-protection, add huge page-table entries counterparts to vmf_insert_pfn_prot(). Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Thomas Hellstrom (VMware) <thomas_os@shipmail.org> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Andrew Morton <akpm@linux-foundation.org>
2020-03-24mm: Split huge pages on write-notify or COWThomas Hellstrom (VMware)
The functions wp_huge_pmd() and wp_huge_pud() currently relies on the huge_fault() callback to split huge page table entries if needed. However for module users that requires export of the split_huge_xxx() functionality which may be undesired. Instead split pre-existing huge page-table entries on VM_FAULT_FALLBACK return. We currently only do COW and write-notify on the PTE level, so if the huge_fault() handler returns VM_FAULT_FALLBACK on wp faults, split the huge pages and page-table entries. Also do this for huge PUDs if there is no huge_fault() handler and the vma is not anonymous, similar to how it's done for PMDs. Note that fs/dax.c still does the splitting in the huge_fault() handler, but as huge_fault() A follow-up patch can remove the dax.c split_huge_pmd() if needed. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Thomas Hellstrom (VMware) <thomas_os@shipmail.org> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Andrew Morton <akpm@linux-foundation.org>
2020-03-24mm: Introduce vma_is_special_hugeThomas Hellstrom (VMware)
For VM_PFNMAP and VM_MIXEDMAP vmas that want to support transhuge pages and -page table entries, introduce vma_is_special_huge() that takes the same codepaths as vma_is_dax(). The use of "special" follows the definition in memory.c, vm_normal_page(): "Special" mappings do not wish to be associated with a "struct page" (either it doesn't exist, or it exists but they don't want to touch it) For PAGE_SIZE pages, "special" is determined per page table entry to be able to deal with COW pages. But since we don't have huge COW pages, we can classify a vma as either "special huge" or "normal huge". Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Thomas Hellstrom (VMware) <thomas_os@shipmail.org> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Andrew Morton <akpm@linux-foundation.org>
2020-03-21x86/mm: split vmalloc_sync_all()Joerg Roedel
Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in the vunmap() code-path. While this change was necessary to maintain correctness on x86-32-pae kernels, it also adds additional cycles for architectures that don't need it. Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported severe performance regressions in micro-benchmarks because it now also calls the x86-64 implementation of vmalloc_sync_all() on vunmap(). But the vmalloc_sync_all() implementation on x86-64 is only needed for newly created mappings. To avoid the unnecessary work on x86-64 and to gain the performance back, split up vmalloc_sync_all() into two functions: * vmalloc_sync_mappings(), and * vmalloc_sync_unmappings() Most call-sites to vmalloc_sync_all() only care about new mappings being synchronized. The only exception is the new call-site added in the above mentioned commit. Shile Zhang directed us to a report of an 80% regression in reaim throughput. Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()") Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Borislav Petkov <bp@suse.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [GHES] Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/ Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-21mm, slub: prevent kmalloc_node crashes and memory leaksVlastimil Babka
Sachin reports [1] a crash in SLUB __slab_alloc(): BUG: Kernel NULL pointer dereference on read at 0x000073b0 Faulting instruction address: 0xc0000000003d55f4 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: CPU: 19 PID: 1 Comm: systemd Not tainted 5.6.0-rc2-next-20200218-autotest #1 NIP: c0000000003d55f4 LR: c0000000003d5b94 CTR: 0000000000000000 REGS: c0000008b37836d0 TRAP: 0300 Not tainted (5.6.0-rc2-next-20200218-autotest) MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24004844 XER: 00000000 CFAR: c00000000000dec4 DAR: 00000000000073b0 DSISR: 40000000 IRQMASK: 1 GPR00: c0000000003d5b94 c0000008b3783960 c00000000155d400 c0000008b301f500 GPR04: 0000000000000dc0 0000000000000002 c0000000003443d8 c0000008bb398620 GPR08: 00000008ba2f0000 0000000000000001 0000000000000000 0000000000000000 GPR12: 0000000024004844 c00000001ec52a00 0000000000000000 0000000000000000 GPR16: c0000008a1b20048 c000000001595898 c000000001750c18 0000000000000002 GPR20: c000000001750c28 c000000001624470 0000000fffffffe0 5deadbeef0000122 GPR24: 0000000000000001 0000000000000dc0 0000000000000002 c0000000003443d8 GPR28: c0000008b301f500 c0000008bb398620 0000000000000000 c00c000002287180 NIP ___slab_alloc+0x1f4/0x760 LR __slab_alloc+0x34/0x60 Call Trace: ___slab_alloc+0x334/0x760 (unreliable) __slab_alloc+0x34/0x60 __kmalloc_node+0x110/0x490 kvmalloc_node+0x58/0x110 mem_cgroup_css_online+0x108/0x270 online_css+0x48/0xd0 cgroup_apply_control_enable+0x2ec/0x4d0 cgroup_mkdir+0x228/0x5f0 kernfs_iop_mkdir+0x90/0xf0 vfs_mkdir+0x110/0x230 do_mkdirat+0xb0/0x1a0 system_call+0x5c/0x68 This is a PowerPC platform with following NUMA topology: available: 2 nodes (0-1) node 0 cpus: node 0 size: 0 MB node 0 free: 0 MB node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 1 size: 35247 MB node 1 free: 30907 MB node distances: node 0 1 0: 10 40 1: 40 10 possible numa nodes: 0-31 This only happens with a mmotm patch "mm/memcontrol.c: allocate shrinker_map on appropriate NUMA node" [2] which effectively calls kmalloc_node for each possible node. SLUB however only allocates kmem_cache_node on online N_NORMAL_MEMORY nodes, and relies on node_to_mem_node to return such valid node for other nodes since commit a561ce00b09e ("slub: fall back to node_to_mem_node() node if allocating on memoryless node"). This is however not true in this configuration where the _node_numa_mem_ array is not initialized for nodes 0 and 2-31, thus it contains zeroes and get_partial() ends up accessing non-allocated kmem_cache_node. A related issue was reported by Bharata (originally by Ramachandran) [3] where a similar PowerPC configuration, but with mainline kernel without patch [2] ends up allocating large amounts of pages by kmalloc-1k kmalloc-512. This seems to have the same underlying issue with node_to_mem_node() not behaving as expected, and might probably also lead to an infinite loop with CONFIG_SLUB_CPU_PARTIAL [4]. This patch should fix both issues by not relying on node_to_mem_node() anymore and instead simply falling back to NUMA_NO_NODE, when kmalloc_node(node) is attempted for a node that's not online, or has no usable memory. The "usable memory" condition is also changed from node_present_pages() to N_NORMAL_MEMORY node state, as that is exactly the condition that SLUB uses to allocate kmem_cache_node structures. The check in get_partial() is removed completely, as the checks in ___slab_alloc() are now sufficient to prevent get_partial() being reached with an invalid node. [1] https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/ [2] https://lore.kernel.org/linux-mm/fff0e636-4c36-ed10-281c-8cdb0687c839@virtuozzo.com/ [3] https://lore.kernel.org/linux-mm/20200317092624.GB22538@in.ibm.com/ [4] https://lore.kernel.org/linux-mm/088b5996-faae-8a56-ef9c-5b567125ae54@suse.cz/ Fixes: a561ce00b09e ("slub: fall back to node_to_mem_node() node if allocating on memoryless node") Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Reported-by: PUVICHAKRAVARTHY RAMACHANDRAN <puvichakravarthy@in.ibm.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Tested-by: Bharata B Rao <bharata@linux.ibm.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christopher Lameter <cl@linux.com> Cc: linuxppc-dev@lists.ozlabs.org Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200320115533.9604-1-vbabka@suse.cz Debugged-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-21mm/mmu_notifier: silence PROVE_RCU_LIST warningsQian Cai
It is safe to traverse mm->notifier_subscriptions->list either under SRCU read lock or mm->notifier_subscriptions->lock using hlist_for_each_entry_rcu(). Silence the PROVE_RCU_LIST false positives, for example, WARNING: suspicious RCU usage ----------------------------- mm/mmu_notifier.c:484 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 3 locks held by libvirtd/802: #0: ffff9321e3f58148 (&mm->mmap_sem#2){++++}, at: do_mprotect_pkey+0xe1/0x3e0 #1: ffffffff91ae6160 (mmu_notifier_invalidate_range_start){+.+.}, at: change_p4d_range+0x5fa/0x800 #2: ffffffff91ae6e08 (srcu){....}, at: __mmu_notifier_invalidate_range_start+0x178/0x460 stack backtrace: CPU: 7 PID: 802 Comm: libvirtd Tainted: G I 5.6.0-rc6-next-20200317+ #2 Hardware name: HP ProLiant BL460c Gen8, BIOS I31 11/02/2014 Call Trace: dump_stack+0xa4/0xfe lockdep_rcu_suspicious+0xeb/0xf5 __mmu_notifier_invalidate_range_start+0x3ff/0x460 change_p4d_range+0x746/0x800 change_protection+0x1df/0x300 mprotect_fixup+0x245/0x3e0 do_mprotect_pkey+0x23b/0x3e0 __x64_sys_mprotect+0x51/0x70 do_syscall_64+0x91/0xae8 entry_SYSCALL_64_after_hwframe+0x49/0xb3 Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Link: http://lkml.kernel.org/r/20200317175640.2047-1-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-21mm: do not allow MADV_PAGEOUT for CoW pagesMichal Hocko
Jann has brought up a very interesting point [1]. While shared pages are excluded from MADV_PAGEOUT normally, CoW pages can be easily reclaimed that way. This can lead to all sorts of hard to debug problems. E.g. performance problems outlined by Daniel [2]. There are runtime environments where there is a substantial memory shared among security domains via CoW memory and a easy to reclaim way of that memory, which MADV_{COLD,PAGEOUT} offers, can lead to either performance degradation in for the parent process which might be more privileged or even open side channel attacks. The feasibility of the latter is not really clear to me TBH but there is no real reason for exposure at this stage. It seems there is no real use case to depend on reclaiming CoW memory via madvise at this stage so it is much easier to simply disallow it and this is what this patch does. Put it simply MADV_{PAGEOUT,COLD} can operate only on the exclusively owned memory which is a straightforward semantic. [1] http://lkml.kernel.org/r/CAG48ez0G3JkMq61gUmyQAaCq=_TwHbi1XKzWRooxZkv08PQKuw@mail.gmail.com [2] http://lkml.kernel.org/r/CAKOZueua_v8jHCpmEtTB6f3i9e2YnmX4mqdYVWhV4E=Z-n+zRQ@mail.gmail.com Fixes: 9c276cc65a58 ("mm: introduce MADV_COLD") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200312082248.GS23944@dhcp22.suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-21mm, memcg: throttle allocators based on ancestral memory.highChris Down
Prior to this commit, we only directly check the affected cgroup's memory.high against its usage. However, it's possible that we are being reclaimed as a result of hitting an ancestor memory.high and should be penalised based on that, instead. This patch changes memory.high overage throttling to use the largest overage in its ancestors when considering how many penalty jiffies to charge. This makes sure that we penalise poorly behaving cgroups in the same way regardless of at what level of the hierarchy memory.high was breached. Fixes: 0e4b01df8659 ("mm, memcg: throttle allocators when failing reclaim over memory.high") Reported-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Chris Down <chris@chrisdown.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nathan Chancellor <natechancellor@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: <stable@vger.kernel.org> [5.4.x+] Link: http://lkml.kernel.org/r/8cd132f84bd7e16cdb8fde3378cdbf05ba00d387.1584036142.git.chris@chrisdown.name Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>