summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2022-11-30mm/hugetlb: convert move_hugetlb_state() to foliosSidhartha Kumar
Clean up unmap_and_move_huge_page() by converting move_hugetlb_state() to take in folios. [akpm@linux-foundation.org: fix CONFIG_HUGETLB_PAGE=n build] Link: https://lkml.kernel.org/r/20221101223059.460937-10-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Bui Quang Minh <minhquangbui99@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm/hugetlb_cgroup: convert hugetlb_cgroup_uncharge_page() to foliosSidhartha Kumar
Continue to use a folio inside free_huge_page() by converting hugetlb_cgroup_uncharge_page*() to folios. Link: https://lkml.kernel.org/r/20221101223059.460937-8-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Bui Quang Minh <minhquangbui99@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm/hugetlb_cgroup: convert hugetlb_cgroup_migrate to foliosSidhartha Kumar
Cleans up intermediate page to folio conversion code in hugetlb_cgroup_migrate() by changing its arguments from pages to folios. Link: https://lkml.kernel.org/r/20221101223059.460937-5-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Bui Quang Minh <minhquangbui99@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm/hugetlb_cgroup: convert set_hugetlb_cgroup*() to foliosSidhartha Kumar
Allows __prep_new_huge_page() to operate on a folio by converting set_hugetlb_cgroup*() to take in a folio. Link: https://lkml.kernel.org/r/20221101223059.460937-4-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Bui Quang Minh <minhquangbui99@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm/hugetlb_cgroup: convert hugetlb_cgroup_from_page() to foliosSidhartha Kumar
Introduce folios in __remove_hugetlb_page() by converting hugetlb_cgroup_from_page() to use folios. Also gets rid of unsed hugetlb_cgroup_from_page_resv() function. Link: https://lkml.kernel.org/r/20221101223059.460937-3-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Bui Quang Minh <minhquangbui99@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mina Almasry <almasrymina@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm/hugetlb_cgroup: convert __set_hugetlb_cgroup() to foliosSidhartha Kumar
Patch series "convert hugetlb_cgroup helper functions to folios", v2. This patch series continues the conversion of hugetlb code from being managed in pages to folios by converting many of the hugetlb_cgroup helper functions to use folios. This allows the core hugetlb functions to pass in a folio to these helper functions. This patch (of 9); Change __set_hugetlb_cgroup() to use folios so it is explicit that the function operates on a head page. Link: https://lkml.kernel.org/r/20221101223059.460937-1-sidhartha.kumar@oracle.com Link: https://lkml.kernel.org/r/20221101223059.460937-2-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Bui Quang Minh <minhquangbui99@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: vmscan: split khugepaged stats from direct reclaim statsJohannes Weiner
Direct reclaim stats are useful for identifying a potential source for application latency, as well as spotting issues with kswapd. However, khugepaged currently distorts the picture: as a kernel thread it doesn't impose allocation latencies on userspace, and it explicitly opts out of kswapd reclaim. Its activity showing up in the direct reclaim stats is misleading. Counting it as kswapd reclaim could also cause confusion when trying to understand actual kswapd behavior. Break out khugepaged from the direct reclaim counters into new pgsteal_khugepaged, pgdemote_khugepaged, pgscan_khugepaged counters. Test with a huge executable (CONFIG_READ_ONLY_THP_FOR_FS): pgsteal_kswapd 1342185 pgsteal_direct 0 pgsteal_khugepaged 3623 pgscan_kswapd 1345025 pgscan_direct 0 pgscan_khugepaged 3623 Link: https://lkml.kernel.org/r/20221026180133.377671-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Eric Bergen <ebergen@meta.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm, hwpoison: when copy-on-write hits poison, take page offlineTony Luck
Cannot call memory_failure() directly from the fault handler because mmap_lock (and others) are held. It is important, but not urgent, to mark the source page as h/w poisoned and unmap it from other tasks. Use memory_failure_queue() to request a call to memory_failure() for the page with the error. Also provide a stub version for CONFIG_MEMORY_FAILURE=n Link: https://lkml.kernel.org/r/20221021200120.175753-3-tony.luck@intel.com Signed-off-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm, hwpoison: try to recover from copy-on write faultsTony Luck
Patch series "Copy-on-write poison recovery", v3. Part 1 deals with the process that triggered the copy on write fault with a store to a shared read-only page. That process is send a SIGBUS with the usual machine check decoration to specify the virtual address of the lost page, together with the scope. Part 2 sets up to asynchronously take the page with the uncorrected error offline to prevent additional machine check faults. H/t to Miaohe Lin <linmiaohe@huawei.com> and Shuai Xue <xueshuai@linux.alibaba.com> for pointing me to the existing function to queue a call to memory_failure(). On x86 there is some duplicate reporting (because the error is also signalled by the memory controller as well as by the core that triggered the machine check). Console logs look like this: This patch (of 2): If the kernel is copying a page as the result of a copy-on-write fault and runs into an uncorrectable error, Linux will crash because it does not have recovery code for this case where poison is consumed by the kernel. It is easy to set up a test case. Just inject an error into a private page, fork(2), and have the child process write to the page. I wrapped that neatly into a test at: git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git just enable ACPI error injection and run: # ./einj_mem-uc -f copy-on-write Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel() on architectures where that is available (currently x86 and powerpc). When an error is detected during the page copy, return VM_FAULT_HWPOISON to caller of wp_page_copy(). This propagates up the call stack. Both x86 and powerpc have code in their fault handler to deal with this code by sending a SIGBUS to the application. Note that this patch avoids a system crash and signals the process that triggered the copy-on-write action. It does not take any action for the memory error that is still in the shared page. To handle that a call to memory_failure() is needed. But this cannot be done from wp_page_copy() because it holds mmap_lock(). Perhaps the architecture fault handlers can deal with this loose end in a subsequent patch? On Intel/x86 this loose end will often be handled automatically because the memory controller provides an additional notification of the h/w poison in memory, the handler for this will call memory_failure(). This isn't a 100% solution. If there are multiple errors, not all may be logged in this way. [tony.luck@intel.com: add call to kmsan_unpoison_memory(), per Miaohe Lin] Link: https://lkml.kernel.org/r/20221031201029.102123-2-tony.luck@intel.com Link: https://lkml.kernel.org/r/20221021200120.175753-1-tony.luck@intel.com Link: https://lkml.kernel.org/r/20221021200120.175753-2-tony.luck@intel.com Signed-off-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Alexander Potapenko <glider@google.com> Tested-by: Shuai Xue <xueshuai@linux.alibaba.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30percpu_counter: add percpu_counter_sum_all interfaceShakeel Butt
The percpu_counter is used for scenarios where performance is more important than the accuracy. For percpu_counter users, who want more accurate information in their slowpath, percpu_counter_sum is provided which traverses all the online CPUs to accumulate the data. The reason it only needs to traverse online CPUs is because percpu_counter does implement CPU offline callback which syncs the local data of the offlined CPU. However there is a small race window between the online CPUs traversal of percpu_counter_sum and the CPU offline callback. The offline callback has to traverse all the percpu_counters on the system to flush the CPU local data which can be a lot. During that time, the CPU which is going offline has already been published as offline to all the readers. So, as the offline callback is running, percpu_counter_sum can be called for one counter which has some state on the CPU going offline. Since percpu_counter_sum only traverses online CPUs, it will skip that specific CPU and the offline callback might not have flushed the state for that specific percpu_counter on that offlined CPU. Normally this is not an issue because percpu_counter users can deal with some inaccuracy for small time window. However a new user i.e. mm_struct on the cleanup path wants to check the exact state of the percpu_counter through check_mm(). For such users, this patch introduces percpu_counter_sum_all() which traverses all possible CPUs and it is used in fork.c:check_mm() to avoid the potential race. This issue is exposed by the later patch "mm: convert mm's rss stats into percpu_counter". Link: https://lkml.kernel.org/r/20221109012011.881058-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: convert mm's rss stats into percpu_counterShakeel Butt
Currently mm_struct maintains rss_stats which are updated on page fault and the unmapping codepaths. For page fault codepath the updates are cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64. The reason for caching is performance for multithreaded applications otherwise the rss_stats updates may become hotspot for such applications. However this optimization comes with the cost of error margin in the rss stats. The rss_stats for applications with large number of threads can be very skewed. At worst the error margin is (nr_threads * 64) and we have a lot of applications with 100s of threads, so the error margin can be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32. Recently we started seeing the unbounded errors for rss_stats for specific applications which use TCP rx0cp. It seems like vm_insert_pages() codepath does not sync rss_stats at all. This patch converts the rss_stats into percpu_counter to convert the error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2). However this conversion enable us to get the accurate stats for situations where accuracy is more important than the cpu cost. This patch does not make such tradeoffs - we can just use percpu_counter_add_local() for the updates and percpu_counter_sum() (or percpu_counter_sync() + percpu_counter_read) for the readers. At the moment the readers are either procfs interface, oom_killer and memory reclaim which I think are not performance critical and should be ok with slow read. However I think we can make that change in a separate patch. Link: https://lkml.kernel.org/r/20221024052841.3291983-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30bpf: Tighten ptr_to_btf_id checks.Alexei Starovoitov
The networking programs typically don't require CAP_PERFMON, but through kfuncs like bpf_cast_to_kern_ctx() they can access memory through PTR_TO_BTF_ID. In such case enforce CAP_PERFMON. Also make sure that only GPL programs can access kernel data structures. All kfuncs require GPL already. Also remove allow_ptr_to_map_access. It's the same as allow_ptr_leaks and different name for the same check only causes confusion. Fixes: fd264ca02094 ("bpf: Add a kfunc to type cast from bpf uapi ctx to kernel ctx") Fixes: 50c6b8a9aea2 ("selftests/bpf: Add a test for btf_type_tag "percpu"") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20221125220617.26846-1-alexei.starovoitov@gmail.com
2022-12-01Merge branch 'slub-tiny-v1r6' into slab/for-nextVlastimil Babka
Merge my series [1] to deprecate the SLOB allocator. - Renames CONFIG_SLOB to CONFIG_SLOB_DEPRECATED with deprecation notice. - The recommended replacement is CONFIG_SLUB, optionally with the new CONFIG_SLUB_TINY tweaks for systems with 16MB or less RAM. - Use cases that stopped working with CONFIG_SLUB_TINY instead of SLOB should be reported to linux-mm@kvack.org and slab maintainers, otherwise SLOB will be removed in few cycles. [1] https://lore.kernel.org/all/20221121171202.22080-1-vbabka@suse.cz/
2022-12-01mm, slub: remove percpu slabs with CONFIG_SLUB_TINYVlastimil Babka
SLUB gets most of its scalability by percpu slabs. However for CONFIG_SLUB_TINY the goal is minimal memory overhead, not scalability. Thus, #ifdef out the whole kmem_cache_cpu percpu structure and associated code. Additionally to the slab page savings, this reduces percpu allocator usage, and code size. This change builds on recent commit c7323a5ad078 ("mm/slub: restrict sysfs validation to debug caches and make it safe"), as caches with enabled debugging also avoid percpu slabs and all allocations and freeing ends up working with the partial list. With a bit more refactoring by the preceding patches, use the same code paths with CONFIG_SLUB_TINY. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com>
2022-11-30Merge branch 'mm-hotfixes-stable' into mm-stableAndrew Morton
2022-11-30revert "kbuild: fix -Wimplicit-function-declaration in ↵Andrew Morton
license_is_gpl_compatible" It causes build failures with unusual CC/HOSTCC combinations. Quoting https://lkml.kernel.org/r/A222B1E6-69B8-4085-AD1B-27BDB72CA971@goldelico.com: HOSTCC scripts/mod/modpost.o - due to target missing In file included from include/linux/string.h:5, from scripts/mod/../../include/linux/license.h:5, from scripts/mod/modpost.c:24: include/linux/compiler.h:246:10: fatal error: asm/rwonce.h: No such file or directory 246 | #include <asm/rwonce.h> | ^~~~~~~~~~~~~~ compilation terminated. ... The problem is that HOSTCC is not necessarily the same compiler or even architecture as CC and pulling in <linux/compiler.h> or <asm/rwonce.h> files indirectly isn't a good idea then. My toolchain is providing HOSTCC = gcc (MacPorts) and CC = arm-linux-gnueabihf (built from gcc source) and all running on Darwin. If I change the include to <string.h> I can then "HOSTCC scripts/mod/modpost.c" but then it fails for "CC kernel/module/main.c" not finding <string.h>: CC kernel/module/main.o - due to target missing In file included from kernel/module/main.c:43:0: ./include/linux/license.h:5:20: fatal error: string.h: No such file or directory #include <string.h> ^ compilation terminated. Reported-by: "H. Nikolaus Schaller" <hns@goldelico.com> Cc: Sam James <sam@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm/khugepaged: fix GUP-fast interaction by sending IPIJann Horn
Since commit 70cbc3cc78a99 ("mm: gup: fix the fast GUP race against THP collapse"), the lockless_pages_from_mm() fastpath rechecks the pmd_t to ensure that the page table was not removed by khugepaged in between. However, lockless_pages_from_mm() still requires that the page table is not concurrently freed. Fix it by sending IPIs (if the architecture uses semi-RCU-style page table freeing) before freeing/reusing page tables. Link: https://lkml.kernel.org/r/20221129154730.2274278-2-jannh@google.com Link: https://lkml.kernel.org/r/20221128180252.1684965-2-jannh@google.com Link: https://lkml.kernel.org/r/20221125213714.4115729-2-jannh@google.com Fixes: ba76149f47d8 ("thp: khugepaged") Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: introduce arch_has_hw_nonleaf_pmd_young()Juergen Gross
When running as a Xen PV guests commit eed9a328aa1a ("mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG") can cause a protection violation in pmdp_test_and_clear_young(): BUG: unable to handle page fault for address: ffff8880083374d0 #PF: supervisor write access in kernel mode #PF: error_code(0x0003) - permissions violation PGD 3026067 P4D 3026067 PUD 3027067 PMD 7fee5067 PTE 8010000008337065 Oops: 0003 [#1] PREEMPT SMP NOPTI CPU: 7 PID: 158 Comm: kswapd0 Not tainted 6.1.0-rc5-20221118-doflr+ #1 RIP: e030:pmdp_test_and_clear_young+0x25/0x40 This happens because the Xen hypervisor can't emulate direct writes to page table entries other than PTEs. This can easily be fixed by introducing arch_has_hw_nonleaf_pmd_young() similar to arch_has_hw_pte_young() and test that instead of CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG. Link: https://lkml.kernel.org/r/20221123064510.16225-1-jgross@suse.com Fixes: eed9a328aa1a ("mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG") Signed-off-by: Juergen Gross <jgross@suse.com> Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Acked-by: Yu Zhao <yuzhao@google.com> Tested-by: Sander Eikelenboom <linux@eikelenboom.it> Acked-by: David Hildenbrand <david@redhat.com> [core changes] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: add dummy pmd_young() for architectures not having itJuergen Gross
In order to avoid #ifdeffery add a dummy pmd_young() implementation as a fallback. This is required for the later patch "mm: introduce arch_has_hw_nonleaf_pmd_young()". Link: https://lkml.kernel.org/r/fd3ac3cd-7349-6bbd-890a-71a9454ca0b3@suse.com Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Yu Zhao <yuzhao@google.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Sander Eikelenboom <linux@eikelenboom.it> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processingMike Kravetz
madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page tables associated with the address range. For hugetlb vmas, zap_page_range will call __unmap_hugepage_range_final. However, __unmap_hugepage_range_final assumes the passed vma is about to be removed and deletes the vma_lock to prevent pmd sharing as the vma is on the way out. In the case of madvise(MADV_DONTNEED) the vma remains, but the missing vma_lock prevents pmd sharing and could potentially lead to issues with truncation/fault races. This issue was originally reported here [1] as a BUG triggered in page_try_dup_anon_rmap. Prior to the introduction of the hugetlb vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to prevent pmd sharing. Subsequent faults on this vma were confused as VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was not set in new pages added to the page table. This resulted in pages that appeared anonymous in a VM_SHARED vma and triggered the BUG. Address issue by adding a new zap flag ZAP_FLAG_UNMAP to indicate an unmap call from unmap_vmas(). This is used to indicate the 'final' unmapping of a hugetlb vma. When called via MADV_DONTNEED, this flag is not set and the vm_lock is not deleted. [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/ Link: https://lkml.kernel.org/r/20221114235507.294320-3-mike.kravetz@oracle.com Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Wei Chen <harperchen1110@gmail.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30madvise: use zap_page_range_single for madvise dontneedMike Kravetz
This series addresses the issue first reported in [1], and fully described in patch 2. Patches 1 and 2 address the user visible issue and are tagged for stable backports. While exploring solutions to this issue, related problems with mmu notification calls were discovered. This is addressed in the patch "hugetlb: remove duplicate mmu notifications:". Since there are no user visible effects, this third is not tagged for stable backports. Previous discussions suggested further cleanup by removing the routine zap_page_range. This is possible because zap_page_range_single is now exported, and all callers of zap_page_range pass ranges entirely within a single vma. This work will be done in a later patch so as not to distract from this bug fix. [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/ This patch (of 2): Expose the routine zap_page_range_single to zap a range within a single vma. The madvise routine madvise_dontneed_single_vma can use this routine as it explicitly operates on a single vma. Also, update the mmu notification range in zap_page_range_single to take hugetlb pmd sharing into account. This is required as MADV_DONTNEED supports hugetlb vmas. Link: https://lkml.kernel.org/r/20221114235507.294320-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20221114235507.294320-2-mike.kravetz@oracle.com Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Wei Chen <harperchen1110@gmail.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: replace VM_WARN_ON to pr_warn if the node is offline with __GFP_THISNODEYang Shi
Syzbot reported the below splat: WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline] WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 Modules linked in: CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline] RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001 RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715 hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156 madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611 madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066 madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240 do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419 do_madvise mm/madvise.c:1432 [inline] __do_sys_madvise mm/madvise.c:1432 [inline] __se_sys_madvise mm/madvise.c:1430 [inline] __x64_sys_madvise+0x113/0x150 mm/madvise.c:1430 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f6b48a4eef9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9 RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000 RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4 R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000 </TASK> It is because khugepaged allocates pages with __GFP_THISNODE, but the preferred node is bogus. The previous patch fixed the khugepaged code to avoid allocating page from non-existing node. But it is still racy against memory hotremove. There is no synchronization with the memory hotplug so it is possible that memory gets offline during a longer taking scanning. So this warning still seems not quite helpful because: * There is no guarantee the node is online for __GFP_THISNODE context for all the callsites. * Kernel just fails the allocation regardless the warning, and it looks all callsites handle the allocation failure gracefully. Although while the warning has helped to identify a buggy code, it is not safe in general and this warning could panic the system with panic-on-warn configuration which tends to be used surprisingly often. So replace VM_WARN_ON to pr_warn(). And the warning will be triggered if __GFP_NOWARN is set since the allocator would print out warning for such case if __GFP_NOWARN is not set. [shy828301@gmail.com: rename nid to this_node and gfp to warn_gfp] Link: https://lkml.kernel.org/r/20221123193014.153983-1-shy828301@gmail.com [akpm@linux-foundation.org: fix whitespace] [akpm@linux-foundation.org: print gfp_mask instead of warn_gfp, per Michel] Link: https://lkml.kernel.org/r/20221108184357.55614-3-shy828301@gmail.com Fixes: 7d8faaf15545 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse") Signed-off-by: Yang Shi <shy828301@gmail.com> Reported-by: <syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30Merge branches 'doc.2022.10.20a', 'fixes.2022.10.21a', 'lazy.2022.11.30a', ↵Paul E. McKenney
'srcunmisafe.2022.11.09a', 'torture.2022.10.18c' and 'torturescript.2022.10.20a' into HEAD doc.2022.10.20a: Documentation updates. fixes.2022.10.21a: Miscellaneous fixes. lazy.2022.11.30a: Lazy call_rcu() and NOCB updates. srcunmisafe.2022.11.09a: NMI-safe SRCU readers. torture.2022.10.18c: Torture-test updates. torturescript.2022.10.20a: Torture-test scripting updates.
2022-11-30block: untangle request_queue refcounting from sysfsChristoph Hellwig
The kobject embedded into the request_queue is used for the queue directory in sysfs, but that is a child of the gendisks directory and is intimately tied to it. Move this kobject to the gendisk and use a refcount_t in the request_queue for the actual request_queue refcounting that is completely unrelated to the device model. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221114042637.1009333-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-30netfilter: ipset: Add support for new bitmask parameterVishwanath Pai
Add a new parameter to complement the existing 'netmask' option. The main difference between netmask and bitmask is that bitmask takes any arbitrary ip address as input, it does not have to be a valid netmask. The name of the new parameter is 'bitmask'. This lets us mask out arbitrary bits in the ip address, for example: ipset create set1 hash:ip bitmask 255.128.255.0 ipset create set2 hash:ip,port family inet6 bitmask ffff::ff80 Signed-off-by: Vishwanath Pai <vpai@akamai.com> Signed-off-by: Joshua Hunt <johunt@akamai.com> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-11-30netfilter: conntrack: merge ipv4+ipv6 confirm functionsFlorian Westphal
No need to have distinct functions. After merge, ipv6 can avoid protooff computation if the connection neither needs sequence adjustment nor helper invocation -- this is the normal case. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-11-30io_uring: reshuffle issue_flagsPavel Begunkov
Reshuffle issue flags to keep normal flags separate from the uring_cmd ctx-setup like flags. Shift the second type to the second byte so it's easier to add new ones in the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d6e4696c883943082d248716f4cd568f37b17a74.1669821213.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-30netfilter: conntrack: add sctp DATA_SENT stateSriram Yagnaraman
SCTP conntrack currently assumes that the SCTP endpoints will probe secondary paths using HEARTBEAT before sending traffic. But, according to RFC 9260, SCTP endpoints can send any traffic on any of the confirmed paths after SCTP association is up. SCTP endpoints that sends INIT will confirm all peer addresses that upper layer configures, and the SCTP endpoint that receives COOKIE_ECHO will only confirm the address it sent the INIT_ACK to. So, we can have a situation where the INIT sender can start to use secondary paths without the need to send HEARTBEAT. This patch allows DATA/SACK packets to create new connection tracking entry. A new state has been added to indicate that a DATA/SACK chunk has been seen in the original direction - SCTP_CONNTRACK_DATA_SENT. State transitions mostly follows the HEARTBEAT_SENT, except on receiving HEARTBEAT/HEARTBEAT_ACK/DATA/SACK in the reply direction. State transitions in original direction: - DATA_SENT behaves similar to HEARTBEAT_SENT for all chunks, except that it remains in DATA_SENT on receving HEARTBEAT, HEARTBEAT_ACK/DATA/SACK chunks State transitions in reply direction: - DATA_SENT behaves similar to HEARTBEAT_SENT for all chunks, except that it moves to HEARTBEAT_ACKED on receiving HEARTBEAT/HEARTBEAT_ACK/DATA/SACK chunks Note: This patch still doesn't solve the problem when the SCTP endpoint decides to use primary paths for association establishment but uses a secondary path for association shutdown. We still have to depend on timeout for connections to expire in such a case. Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-11-30Merge tag 'arm-soc/for-6.2/drivers' of https://github.com/Broadcom/stblinux ↵Arnd Bergmann
into soc/drivers This pull request contains Broadcom SoCs driver changes for 6.2, please pull the following: - Yuan uses dev_err_probe() in the Raspberry Pi firmware provider to simplify the error handling code - Rafal adds support for initialiazing the BCM47xx NVMEM/NVRAM firmware provider out of memory-mapped flash devices. * tag 'arm-soc/for-6.2/drivers' of https://github.com/Broadcom/stblinux: firmware/nvram: bcm47xx: support init from IO memory firmware: raspberrypi: Use dev_err_probe() to simplify code Link: https://lore.kernel.org/r/20221129191755.542584-3-f.fainelli@gmail.com Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2022-11-30Revert "blk-cgroup: Flush stats at blkgs destruction path"Jens Axboe
This reverts commit dae590a6c96c799434e0ff8156ef29b88c257e60. We've had a few reports on this causing a crash at boot time, because of a reference issue. While this problem seemginly did exist before the patch and needs solving separately, this patch makes it a lot easier to trigger. Link: https://lore.kernel.org/linux-block/CA+QYu4oxiRKC6hJ7F27whXy-PRBx=Tvb+-7TQTONN8qTtV3aDA@mail.gmail.com/ Link: https://lore.kernel.org/linux-block/69af7ccb-6901-c84c-0e95-5682ccfb750c@acm.org/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-30Merge tag 'qcom-arm64-for-6.2' of ↵Arnd Bergmann
https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into soc/dt Qualcomm ARM64 DTS updates for 6.2 This introduces support for SM4250, SM6115, SM6375 and SDM670 platforms and Sony Xperia 10 IV, Google Pixel 3a, OnePlus 3, OnePlus 3T, Google Pazquel and OnePlus Nord N100. A wide variety of updates to align with DeviceTree bindings across many/most platforms is introduced, and incorrectly styled comments are adjusted across the tree. Apps RSC is added to the cluster-idle power-domain across SM8150, SM8250, SM8350 and SM8450, to ensure sleep and wake votes are flushed as the last core is being powered down. Remoteproc firmware patches are aligned with agreed upon structure used in linux-firmware across Inforce 6560, Lenovo Miix 630, various Sony Xperia devices and Samsung Galaxy Book2 (although these are not available in linux-firmware today). On IPQ8074 CPU clocks are added, thermal zones are introduced and vqmmc supply is specified for the HK01 board. Alcatel OneTouch Idol 3 gains LED nodes and Samsung Galaxy A3U gained vibrator support. The application subsystem's IOMMU and the display subsystem is enabled for MSM8953. A new CPU frequency table is introduced for MSM8996Pro, to properly describe it separate of MSM8996. The GPU opp-table is extended as well. On SC7180 USB is marked as a wakeup source, USB gains required-opps to ensure that the core voltage rail is voted for as needed. The description of the fingerprint sensor in Trogdor is corrected. On SC7280 Wake-on-WLAN is introduced, and PHY parameters for the SNPS USB PHY is defined across SC7280. The memory map across Google Herobrine is adjusted, to regain unused memory on the WiFi SKUs. A LTE SKU of the Evoker board is introduced and the bard gains touchscreen. NVME support is disabled on Villager boards, as it's not used. PCIe support is introduced on SC8280XP, with NVMe, SDX55 (5G) and WiFi enabled on the Lenovo Thinkpad X13s and Compute Reference Device. ADCs and thermal zones are intrduced for the same. Lenovo Thinkpad X13s gains LID switch support. Fairphone FP3 gains touchscreen support. Support for Xiaomi Poco F1 variant with EBBG panel. The round-robin ADC is enabled across DB845c, OnePlus devices and Pocophone F1 devices. The displayport controller on SDM845 is introduced. SM6350 gains SDHCI support and on Sony Xperia 10 III sd-card, touchscreen and GPI DMA is enabled. Fairphone FP4 got SD-card support. UFS PHY register ranges are corrected across SM8150, SM8250, SM8350 and SM8450. Sony Xperia 1 II got NFC support and Sony Xperia 5 III got PMIC regulators defined and USB definition corrected, to enable USB3. The SDHCI controller is described for SM8450 and microSD support is enabled for the HDK and QRD devices. SM8450 also gains camera CCI interface and display clock controller. * tag 'qcom-arm64-for-6.2' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux: (261 commits) arm64: dts: qcom: sdm845-polaris: Don't duplicate DMA assignment arm64: dts: qcom: sm8350-sagami: Wire up USB regulators and fix USB3 arm64: dts: qcom: sm8350-sagami: Add most RPMh regulators arm64: dts: qcom: sc7280: Make herobrine-audio-rt5682 mic dtsi's match more arm64: dts: qcom: trim addresses to 8 digits arm64: dts: msm8998: unify PCIe clock order withMSM8996 arm64: dts: msm8998: add MSM8998 specific compatible arm64: dts: qcom: sc8280xp-x13s: enable WiFi controller arm64: dts: qcom: sc8280xp-x13s: enable modem arm64: dts: qcom: sc8280xp-x13s: enable NVMe SSD arm64: dts: qcom: sc8280xp-crd: enable WiFi controller arm64: dts: qcom: sc8280xp-crd: enable SDX55 modem arm64: dts: qcom: sc8280xp-crd: enable NVMe SSD arm64: dts: qcom: sc8280xp-crd: rename backlight and misc regulators arm64: dts: qcom: sa8295p-adp: enable PCIe arm64: dts: qcom: sc8280xp/sa8540p: add PCIe2-4 nodes arm64: dts: qcom: add sdm670 and pixel 3a device trees arm64: dts: qcom: sc7280: Add Google Herobrine WIFI SKU dts fragment arm64: dts: qcom: sc7280: Mark all Qualcomm reference boards as LTE arm64: dts: qcom: sm7225-fairphone-fp4: Enable SD card ... Link: https://lore.kernel.org/r/20221124100650.1982448-1-andersson@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2022-11-30filelock: add a new locks_inode_context accessor functionJeff Layton
There are a number of places in the kernel that are accessing the inode->i_flctx field without smp_load_acquire. This is required to ensure that the caller doesn't see a partially-initialized structure. Add a new accessor function for it to make this clear and convert all of the relevant accesses in locks.c to use it. Also, convert locks_free_lock_context to use the helper as well instead of just doing a "bare" assignment. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jeff Layton <jlayton@kernel.org>
2022-11-30filelock: new helper: vfs_inode_has_locksJeff Layton
Ceph has a need to know whether a particular inode has any locks set on it. It's currently tracking that by a num_locks field in its filp->private_data, but that's problematic as it tries to decrement this field when releasing locks and that can race with the file being torn down. Add a new vfs_inode_has_locks helper that just returns whether any locks are currently held on the inode. Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jeff Layton <jlayton@kernel.org>
2022-11-30Merge tag 'drm-msm-display-for-6.2' of ↵Dave Airlie
https://gitlab.freedesktop.org/lumag/msm into drm-next drm/msm updates for 6.2 Core: - MSM_INFO_GET_FLAGS support - Cleaned up MSM IOMMU wrapper code DPU: - Added support for XR30 and P010 image formats - Reworked MDSS/DPU schema, added SM8250 MDSS bindings - Added Qualcomm SM6115 support DP: - Dropped unsane sanity checks DSI: - Fix calculation of DSC pps payload DSI PHY: - DSI PHY support for QCM2290 HDMI: - Reworked dev init path Signed-off-by: Dave Airlie <airlied@redhat.com> From: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Link: https://patchwork.freedesktop.org/patch/msgid/20221126102141.721353-1-dmitry.baryshkov@linaro.org
2022-11-29net/mlx5: Use generic definition for UMR KLM alignmentTariq Toukan
MLX5_UMR_KLM_ALIGNMENT is in units of number of entries, while MLX5_UMR_MTT_ALIGNMENT (generalized and renamed to MLX5_UMR_FLEX_ALIGNMENT) is in byte units. This is misleading and confusing. Replace this KLM definition with one based on the generic definition. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-11-29net/mlx5: Generalize name of UMR alignment definitionTariq Toukan
Per the device spec, MLX5_UMR_MTT_ALIGNMENT is good not only for UMR MTT entries, but for all other entries as well, like KLMs and KSMs. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-11-29net/mlx5: Remove unused UMR MTT definitionsTariq Toukan
Defines MLX5_UMR_MTT_MASK and MLX5_UMR_MTT_MIN_CHUNK_SIZE are not in use. Remove them. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-11-29net/mlx5e: Add padding when needed in UMR WQEsTariq Toukan
Per the device spec, MTTs/KLMs list in a UMR WQE must be aligned to 64B. Per our SW design, the MTT/KLMs list would need alignment only if it's too small, for example on PPC when PAGE_SIZE is 64KB, and only 4 pages are needed to cover a MPWQE of size 256KB. Padding, if needed, is taken into account when calculating the UMR WQE fields (ds_cnt and xlt_octowords), however no entries are provided, instead garbage is passed. No real harm though, as these parts act as gaps between the RX MPWQEs and not used by any of them. Hence, in practice, device does not try to write any incoming packet to them. Still, prefer providing clean padding marking the end of the list, and do not map garbage into the RQ memory region. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-11-29net/mlx5: Remove unused ctx variablesPetr Pavlu
Remove mlx5_priv.ctx_list and ctx_lock which are no longer used after commit 601c10c89cbb ("net/mlx5: Delete custom device management logic"). Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-11-29Merge branch 'master' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== ipsec-next 2022-11-26 1) Remove redundant variable in esp6. From Colin Ian King. 2) Update x->lastused for every packet. It was used only for outgoing mobile IPv6 packets, but showed to be usefull to check if the a SA is still in use in general. From Antony Antony. 3) Remove unused variable in xfrm_byidx_resize. From Leon Romanovsky. 4) Finalize extack support for xfrm. From Sabrina Dubroca. * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: xfrm: add extack to xfrm_set_spdinfo xfrm: add extack to xfrm_alloc_userspi xfrm: add extack to xfrm_do_migrate xfrm: add extack to xfrm_new_ae and xfrm_replay_verify_len xfrm: add extack to xfrm_del_sa xfrm: add extack to xfrm_add_sa_expire xfrm: a few coding style clean ups xfrm: Remove not-used total variable xfrm: update x->lastused for every packet esp6: remove redundant variable err ==================== Link: https://lore.kernel.org/r/20221126110303.1859238-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-30Merge tag 'drm/tegra/for-6.2-rc1' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/tegra into drm-next drm/tegra: Changes for v6.2-rc1 This contains a bunch of cleanups across the board as well as support for the NVDEC hardware found on the Tegra234 SoC. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Thierry Reding <thierry.reding@gmail.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221125155219.3352952-1-thierry.reding@gmail.com
2022-11-30Merge tag 'drm-accel-2022-11-22' of ↵Dave Airlie
https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/accel into drm-next This tag contains the patches that add the new compute acceleration subsystem, which is part of the DRM subsystem. The patches: - Add a new directory at drivers/accel. - Add a new major (261) for compute accelerators. - Add a new DRM minor type for compute accelerators. - Integrate the accel core code with DRM core code. - Add documentation for the accel subsystem. Signed-off-by: Dave Airlie <airlied@redhat.com> some acks from the list (some are in the patch series): Acked-by: Daniel Stone <daniels@collabora.com> Acked-by: Sonal Santan <sonal.santan@amd.com> Acked-by: Maxime Ripard <maxime@cerno.tech> Acked-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Tested-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Acked-by: Thomas Zimmermann <tzimmermann@suse.de> From: Oded Gabbay <ogabbay@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20221122112222.GA352082@ogabbay-vm-u20.habana-labs.com
2022-11-29rcu: Make SRCU mandatoryPaul E. McKenney
Kernels configured with CONFIG_PRINTK=n and CONFIG_SRCU=n get build failures. This causes trouble for deep embedded systems. But given that there are more than 25 instances of "select SRCU" in the kernel, it is hard to believe that there are many kernels running in production without SRCU. This commit therefore makes SRCU mandatory. The SRCU Kconfig option remains for backwards compatibility, and will be removed when it is no longer used. [ paulmck: Update per kernel test robot feedback. ] Reported-by: John Ogness <john.ogness@linutronix.de> Reported-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: <linux-arch@vger.kernel.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Reviewed-by: John Ogness <john.ogness@linutronix.de>
2022-11-29rcu: Make call_rcu() lazy to save powerJoel Fernandes (Google)
Implement timer-based RCU callback batching (also known as lazy callbacks). With this we save about 5-10% of power consumed due to RCU requests that happen when system is lightly loaded or idle. By default, all async callbacks (queued via call_rcu) are marked lazy. An alternate API call_rcu_hurry() is provided for the few users, for example synchronize_rcu(), that need the old behavior. The batch is flushed whenever a certain amount of time has passed, or the batch on a particular CPU grows too big. Also memory pressure will flush it in a future patch. To handle several corner cases automagically (such as rcu_barrier() and hotplug), we re-use bypass lists which were originally introduced to address lock contention, to handle lazy CBs as well. The bypass list length has the lazy CB length included in it. A separate lazy CB length counter is also introduced to keep track of the number of lazy CBs. [ paulmck: Fix formatting of inline call_rcu_lazy() definition. ] [ paulmck: Apply Zqiang feedback. ] [ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ] Suggested-by: Paul McKenney <paulmck@kernel.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-11-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
tools/lib/bpf/ringbuf.c 927cbb478adf ("libbpf: Handle size overflow for ringbuf mmap") b486d19a0ab0 ("libbpf: checkpatch: Fixed code alignments in ringbuf.c") https://lore.kernel.org/all/20221121122707.44d1446a@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-29Merge tag 'v6.2-rockchip-dts64-1' of ↵Arnd Bergmann
git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip into asahi-wip New boards: - Model A and blade baseboards for the SOQuartz (rk3568) SoM, - Anberic RG351M, RG353V, RG353VS; Odroid Go Super, Advance gaming devices - Odroid M1 - Theobroma px30 SoM with baseboard - Rockchip's own rk3566 demo board Some core support for per SoC specifics: - crypto support for rk3399 and rk3328 - second I2S controller for rk3568 - Cache properties for follow the binding for rk3308 and rk3328 Bigger device support updates for: - SOQuartz: PCIe2, video output, gpu, HDMI sound - Rock 3A: eth regulator, eth clock input, Wifi+Bt, I2S, PCIe3 As well as some minor extensions for Rock960 (hdmi supplies), rk3566-roc-pc (PCIe2), Rock 4C+ (thermal support), Pinephone Pro (Wifi+Bt) * tag 'v6.2-rockchip-dts64-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip: (51 commits) arm64: dts: rockchip: update cache properties for rk3308 and rk3328 arm64: dts: rockchip: Add SOQuartz Model A baseboard dt-bindings: arm: rockchip: Add SOQuartz Model A arm64: dts: rockchip: Add SOQuartz blade board dt-bindings: arm: rockchip: Add SOQuartz Blade arm64: dts: rockchip: Add Anbernic RG351M arm64: dts: rockchip: Add Odroid Go Super arm64: dts: rockchip: Add Odroid Go Advance Black Edition dt-bindings: arm: rockchip: Add more RK3326 devices arm64: dts: rockchip: Move most of Odroid Go Advance DTS into a DTSI arm64: dts: rockchip: Add support of regulator for ethernet node on Rock 3A SBC arm64: dts: rockchip: Add support of external clock to ethernet node on Rock 3A SBC arm64: dts: rockchip: Add HDMI supplies on Rock960 arm64: dts: rockchip: Add dts for rockchip rk3566 box demo board dt-bindings: rockchip: Add Rockchip rk3566 box demo board arm64: dts: rockchip: Enable PCIe 2 on SOQuartz CM4IO arm64: dts: rockchip: Enable HDMI sound on SOQuartz arm64: dts: rockchip: Enable video output and HDMI on SOQuartz arm64: dts: rockchip: Enable GPU on SOQuartz CM4 arm64: dts: rockchip: enable pcie2 on rk3566-roc-pc ... Link: https://lore.kernel.org/r/4716610.aeNJFYEL58@phil Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2022-11-29Merge branch 'for-6.2/io_uring' into for-6.2/io_uring-nextJens Axboe
* for-6.2/io_uring: (41 commits) io_uring: keep unlock_post inlined in hot path io_uring: don't use complete_post in kbuf io_uring: spelling fix io_uring: remove io_req_complete_post_tw io_uring: allow multishot polled reqs to defer completion io_uring: remove overflow param from io_post_aux_cqe io_uring: add lockdep assertion in io_fill_cqe_aux io_uring: make io_fill_cqe_aux static io_uring: add io_aux_cqe which allows deferred completion io_uring: allow defer completion for aux posted cqes io_uring: defer all io_req_complete_failed io_uring: always lock in io_apoll_task_func io_uring: remove iopoll spinlock io_uring: iopoll protect complete_post io_uring: inline __io_req_complete_put() io_uring: remove io_req_tw_post_queue io_uring: use io_req_task_complete() in timeout io_uring: hold locks for io_req_complete_failed io_uring: add completion locking for iopoll io_uring: kill io_cqring_ev_posted() and __io_cq_unlock_post() ...
2022-11-29Merge tag 'net-6.1-rc8-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf, can and wifi. Current release - new code bugs: - eth: mlx5e: - use kvfree() in mlx5e_accel_fs_tcp_create() - MACsec, fix RX data path 16 RX security channel limit - MACsec, fix memory leak when MACsec device is deleted - MACsec, fix update Rx secure channel active field - MACsec, fix add Rx security association (SA) rule memory leak Previous releases - regressions: - wifi: cfg80211: don't allow multi-BSSID in S1G - stmmac: set MAC's flow control register to reflect current settings - eth: mlx5: - E-switch, fix duplicate lag creation - fix use-after-free when reverting termination table Previous releases - always broken: - ipv4: fix route deletion when nexthop info is not specified - bpf: fix a local storage BPF map bug where the value's spin lock field can get initialized incorrectly - tipc: re-fetch skb cb after tipc_msg_validate - wifi: wilc1000: fix Information Element parsing - packet: do not set TP_STATUS_CSUM_VALID on CHECKSUM_COMPLETE - sctp: fix memory leak in sctp_stream_outq_migrate() - can: can327: fix potential skb leak when netdev is down - can: add number of missing netdev freeing on error paths - aquantia: do not purge addresses when setting the number of rings - wwan: iosm: - fix incorrect skb length leading to truncated packet - fix crash in peek throughput test due to skb UAF" * tag 'net-6.1-rc8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (79 commits) net: ethernet: renesas: ravb: Fix promiscuous mode after system resumed MAINTAINERS: Update maintainer list for chelsio drivers ionic: update MAINTAINERS entry sctp: fix memory leak in sctp_stream_outq_migrate() packet: do not set TP_STATUS_CSUM_VALID on CHECKSUM_COMPLETE net/mlx5: Lag, Fix for loop when checking lag Revert "net/mlx5e: MACsec, remove replay window size limitation in offload path" net: marvell: prestera: Fix a NULL vs IS_ERR() check in some functions net: tun: Fix use-after-free in tun_detach() net: mdiobus: fix unbalanced node reference count net: hsr: Fix potential use-after-free tipc: re-fetch skb cb after tipc_msg_validate mptcp: fix sleep in atomic at close time mptcp: don't orphan ssk in mptcp_close() dsa: lan9303: Correct stat name ipv4: Fix route deletion when nexthop info is not specified net: wwan: iosm: fix incorrect skb length net: wwan: iosm: fix crash in peek throughput test net: wwan: iosm: fix dma_alloc_coherent incompatible pointer type net: wwan: iosm: fix kernel test robot reported error ...
2022-11-29sctp: fix memory leak in sctp_stream_outq_migrate()Zhengchao Shao
When sctp_stream_outq_migrate() is called to release stream out resources, the memory pointed to by prio_head in stream out is not released. The memory leak information is as follows: unreferenced object 0xffff88801fe79f80 (size 64): comm "sctp_repo", pid 7957, jiffies 4294951704 (age 36.480s) hex dump (first 32 bytes): 80 9f e7 1f 80 88 ff ff 80 9f e7 1f 80 88 ff ff ................ 90 9f e7 1f 80 88 ff ff 90 9f e7 1f 80 88 ff ff ................ backtrace: [<ffffffff81b215c6>] kmalloc_trace+0x26/0x60 [<ffffffff88ae517c>] sctp_sched_prio_set+0x4cc/0x770 [<ffffffff88ad64f2>] sctp_stream_init_ext+0xd2/0x1b0 [<ffffffff88aa2604>] sctp_sendmsg_to_asoc+0x1614/0x1a30 [<ffffffff88ab7ff1>] sctp_sendmsg+0xda1/0x1ef0 [<ffffffff87f765ed>] inet_sendmsg+0x9d/0xe0 [<ffffffff8754b5b3>] sock_sendmsg+0xd3/0x120 [<ffffffff8755446a>] __sys_sendto+0x23a/0x340 [<ffffffff87554651>] __x64_sys_sendto+0xe1/0x1b0 [<ffffffff89978b49>] do_syscall_64+0x39/0xb0 [<ffffffff89a0008b>] entry_SYSCALL_64_after_hwframe+0x63/0xcd Link: https://syzkaller.appspot.com/bug?exrid=29c402e56c4760763cc0 Fixes: 637784ade221 ("sctp: introduce priority based stream scheduler") Reported-by: syzbot+29c402e56c4760763cc0@syzkaller.appspotmail.com Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Link: https://lore.kernel.org/r/20221126031720.378562-1-shaozhengchao@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-29Revert "net/mlx5e: MACsec, remove replay window size limitation in offload path"Saeed Mahameed
This reverts commit c0071be0e16c461680d87b763ba1ee5e46548fde. The cited commit removed the validity checks which initialized the window_sz and never removed the use of the now uninitialized variable, so now we are left with wrong value in the window size and the following clang warning: [-Wuninitialized] drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c:232:45: warning: variable 'window_sz' is uninitialized when used here MLX5_SET(macsec_aso, aso_ctx, window_size, window_sz); Revet at this time to address the clang issue due to lack of time to test the proper solution. Fixes: c0071be0e16c ("net/mlx5e: MACsec, remove replay window size limitation in offload path") Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reported-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20221129093006.378840-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>