summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2023-10-18shmem,percpu_counter: add _limited_add(fbc, limit, amount)Hugh Dickins
Percpu counter's compare and add are separate functions: without locking around them (which would defeat their purpose), it has been possible to overflow the intended limit. Imagine all the other CPUs fallocating tmpfs huge pages to the limit, in between this CPU's compare and its add. I have not seen reports of that happening; but tmpfs's recent addition of dquot_alloc_block_nodirty() in between the compare and the add makes it even more likely, and I'd be uncomfortable to leave it unfixed. Introduce percpu_counter_limited_add(fbc, limit, amount) to prevent it. I believe this implementation is correct, and slightly more efficient than the combination of compare and add (taking the lock once rather than twice when nearing full - the last 128MiB of a tmpfs volume on a machine with 128 CPUs and 4KiB pages); but it does beg for a better design - when nearing full, there is no new batching, but the costly percpu counter sum across CPUs still has to be done, while locked. Follow __percpu_counter_sum()'s example, including cpu_dying_mask as well as cpu_online_mask: but shouldn't __percpu_counter_compare() and __percpu_counter_limited_add() then be adding a num_dying_cpus() to num_online_cpus(), when they calculate the maximum which could be held across CPUs? But the times when it matters would be vanishingly rare. Link: https://lkml.kernel.org/r/bb817848-2d19-bcc8-39ca-ea179af0f0b4@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Carlos Maiolino <cem@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18shmem: shrink shmem_inode_info: dir_offsets in a unionHugh Dickins
Patch series "shmem,tmpfs: general maintenance". Mostly just cosmetic mods in mm/shmem.c, but the last two enforcing the "size=" limit better. 8/8 goes into percpu counter territory, and could stand alone. This patch (of 8): Shave 32 bytes off (the 64-bit) shmem_inode_info. There was a 4-byte pahole after stop_eviction, better filled by fsflags. And the 24-byte dir_offsets can only be used by directories, whereas shrinklist and swaplist only by shmem_mapping() inodes (regular files or long symlinks): so put those into a union. No change in mm/shmem.c is required for this. Link: https://lkml.kernel.org/r/c7441dc6-f3bb-dd60-c670-9f5cbd9f266@google.com Link: https://lkml.kernel.org/r/86ebb4b-c571-b9e8-27f5-cb82ec50357e@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Carlos Maiolino <cem@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Dave Chinner <dchinner@redhat.com> Cc: Tim Chen <tim.c.chen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEsMuhammad Usama Anjum
The PAGEMAP_SCAN IOCTL on the pagemap file can be used to get or optionally clear the info about page table entries. The following operations are supported in this IOCTL: - Scan the address range and get the memory ranges matching the provided criteria. This is performed when the output buffer is specified. - Write-protect the pages. The PM_SCAN_WP_MATCHING is used to write-protect the pages of interest. The PM_SCAN_CHECK_WPASYNC aborts the operation if non-Async Write Protected pages are found. The ``PM_SCAN_WP_MATCHING`` can be used with or without PM_SCAN_CHECK_WPASYNC. - Both of those operations can be combined into one atomic operation where we can get and write protect the pages as well. Following flags about pages are currently supported: - PAGE_IS_WPALLOWED - Page has async-write-protection enabled - PAGE_IS_WRITTEN - Page has been written to from the time it was write protected - PAGE_IS_FILE - Page is file backed - PAGE_IS_PRESENT - Page is present in the memory - PAGE_IS_SWAPPED - Page is in swapped - PAGE_IS_PFNZERO - Page has zero PFN - PAGE_IS_HUGE - Page is THP or Hugetlb backed This IOCTL can be extended to get information about more PTE bits. The entire address range passed by user [start, end) is scanned until either the user provided buffer is full or max_pages have been found. [akpm@linux-foundation.org: update it for "mm: hugetlb: add huge page size param to set_huge_pte_at()"] [akpm@linux-foundation.org: fix CONFIG_HUGETLB_PAGE=n warning] [arnd@arndb.de: hide unused pagemap_scan_backout_range() function] Link: https://lkml.kernel.org/r/20230927060257.2975412-1-arnd@kernel.org [sfr@canb.auug.org.au: fix "fs/proc/task_mmu: hide unused pagemap_scan_backout_range() function"] Link: https://lkml.kernel.org/r/20230928092223.0625c6bf@canb.auug.org.au Link: https://lkml.kernel.org/r/20230821141518.870589-3-usama.anjum@collabora.com Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Andrei Vagin <avagin@gmail.com> Reviewed-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gustavo A. R. Silva <gustavoars@kernel.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Miroslaw <emmir@google.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Paul Gofman <pgofman@codeweavers.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Yun Zhou <yun.zhou@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18userfaultfd: UFFD_FEATURE_WP_ASYNCPeter Xu
Patch series "Implement IOCTL to get and optionally clear info about PTEs", v33. *Motivation* The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows GetWriteWatch() and ResetWriteWatch() syscalls [1]. The GetWriteWatch() retrieves the addresses of the pages that are written to in a region of virtual memory. This syscall is used in Windows applications and games etc. This syscall is being emulated in pretty slow manner in userspace. Our purpose is to enhance the kernel such that we translate it efficiently in a better way. Currently some out of tree hack patches are being used to efficiently emulate it in some kernels. We intend to replace those with these patches. So the whole gaming on Linux can effectively get benefit from this. It means there would be tons of users of this code. CRIU use case [2] was mentioned by Andrei and Danylo: > Use cases for migrating sparse VMAs are binaries sanitized with ASAN, > MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of > shadow memory [4]. Being able to migrate such binaries allows to highly > reduce the amount of work needed to identify and fix post-migration > crashes, which happen constantly. Andrei defines the following uses of this code: * it is more granular and allows us to track changed pages more effectively. The current interface can clear dirty bits for the entire process only. In addition, reading info about pages is a separate operation. It means we must freeze the process to read information about all its pages, reset dirty bits, only then we can start dumping pages. The information about pages becomes more and more outdated, while we are processing pages. The new interface solves both these downsides. First, it allows us to read pte bits and clear the soft-dirty bit atomically. It means that CRIU will not need to freeze processes to pre-dump their memory. Second, it clears soft-dirty bits for a specified region of memory. It means CRIU will have actual info about pages to the moment of dumping them. * The new interface has to be much faster because basic page filtering is happening in the kernel. With the old interface, we have to read pagemap for each page. *Implementation Evolution (Short Summary)* From the definition of GetWriteWatch(), we feel like kernel's soft-dirty feature can be used under the hood with some additions like: * reset soft-dirty flag for only a specific region of memory instead of clearing the flag for the entire process * get and clear soft-dirty flag for a specific region atomically So we decided to use ioctl on pagemap file to read or/and reset soft-dirty flag. But using soft-dirty flag, sometimes we get extra pages which weren't even written. They had become soft-dirty because of VMA merging and VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were able to by-pass this short coming by ignoring VM_SOFTDIRTY until David reported that mprotect etc messes up the soft-dirty flag while ignoring VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We discussed if we can revert these patches. But we could not reach to any conclusion. So at this point, I made couple of tries to solve this whole VM_SOFTDIRTY issue by correcting the soft-dirty implementation: * [7] Correct the bug fixed wrongly back in 2014. It had potential to cause regression. We left it behind. * [8] Keep a list of soft-dirty part of a VMA across splits and merges. I got the reply don't increase the size of the VMA by 8 bytes. At this point, we left soft-dirty considering it is too much delicate and userfaultfd [9] seemed like the only way forward. From there onward, we have been basing soft-dirty emulation on userfaultfd wp feature where kernel resolves the faults itself when WP_ASYNC feature is used. It was straight forward to add WP_ASYNC feature in userfautlfd. Now we get only those pages dirty or written-to which are really written in reality. (PS There is another WP_UNPOPULATED userfautfd feature is required which is needed to avoid pre-faulting memory before write-protecting [9].) All the different masks were added on the request of CRIU devs to create interface more generic and better. [1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-getwritewatch [2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com [3] https://github.com/google/sanitizers [4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit [5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com [6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/ [7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.com [8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.com [9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com [10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com This patch (of 6): Add a new userfaultfd-wp feature UFFD_FEATURE_WP_ASYNC, that allows userfaultfd wr-protect faults to be resolved by the kernel directly. It can be used like a high accuracy version of soft-dirty, without vma modifications during tracking, and also with ranged support by default rather than for a whole mm when reset the protections due to existence of ioctl(UFFDIO_WRITEPROTECT). Several goals of such a dirty tracking interface: 1. All types of memory should be supported and tracable. This is nature for soft-dirty but should mention when the context is userfaultfd, because it used to only support anon/shmem/hugetlb. The problem is for a dirty tracking purpose these three types may not be enough, and it's legal to track anything e.g. any page cache writes from mmap. 2. Protections can be applied to partial of a memory range, without vma split/merge fuss. The hope is that the tracking itself should not affect any vma layout change. It also helps when reset happens because the reset will not need mmap write lock which can block the tracee. 3. Accuracy needs to be maintained. This means we need pte markers to work on any type of VMA. One could question that, the whole concept of async dirty tracking is not really close to fundamentally what userfaultfd used to be: it's not "a fault to be serviced by userspace" anymore. However, using userfaultfd-wp here as a framework is convenient for us in at least: 1. VM_UFFD_WP vma flag, which has a very good name to suite something like this, so we don't need VM_YET_ANOTHER_SOFT_DIRTY. Just use a new feature bit to identify from a sync version of uffd-wp registration. 2. PTE markers logic can be leveraged across the whole kernel to maintain the uffd-wp bit as long as an arch supports, this also applies to this case where uffd-wp bit will be a hint to dirty information and it will not go lost easily (e.g. when some page cache ptes got zapped). 3. Reuse ioctl(UFFDIO_WRITEPROTECT) interface for either starting or resetting a range of memory, while there's no counterpart in the old soft-dirty world, hence if this is wanted in a new design we'll need a new interface otherwise. We can somehow understand that commonality because uffd-wp was fundamentally a similar idea of write-protecting pages just like soft-dirty. This implementation allows WP_ASYNC to imply WP_UNPOPULATED, because so far WP_ASYNC seems to not usable if without WP_UNPOPULATE. This also gives us chance to modify impl of WP_ASYNC just in case it could be not depending on WP_UNPOPULATED anymore in the future kernels. It's also fine to imply that because both features will rely on PTE_MARKER_UFFD_WP config option, so they'll show up together (or both missing) in an UFFDIO_API probe. vma_can_userfault() now allows any VMA if the userfaultfd registration is only about async uffd-wp. So we can track dirty for all kinds of memory including generic file systems (like XFS, EXT4 or BTRFS). One trick worth mention in do_wp_page() is that we need to manually update vmf->orig_pte here because it can be used later with a pte_same() check - this path always has FAULT_FLAG_ORIG_PTE_VALID set in the flags. The major defect of this approach of dirty tracking is we need to populate the pgtables when tracking starts. Soft-dirty doesn't do it like that. It's unwanted in the case where the range of memory to track is huge and unpopulated (e.g., tracking updates on a 10G file with mmap() on top, without having any page cache installed yet). One way to improve this is to allow pte markers exist for larger than PTE level for PMD+. That will not change the interface if to implemented, so we can leave that for later. Link: https://lkml.kernel.org/r/20230821141518.870589-1-usama.anjum@collabora.com Link: https://lkml.kernel.org/r/20230821141518.870589-2-usama.anjum@collabora.com Signed-off-by: Peter Xu <peterx@redhat.com> Co-developed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrei Vagin <avagin@gmail.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gustavo A. R. Silva <gustavoars@kernel.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Miroslaw <emmir@google.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Paul Gofman <pgofman@codeweavers.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Yun Zhou <yun.zhou@windriver.com> Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18mm/memcg: annotate struct mem_cgroup_threshold_ary with __counted_byKees Cook
Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). As found with Coccinelle[1], add __counted_by for struct mem_cgroup_threshold_ary. [1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci Link: https://lkml.kernel.org/r/20230922175327.work.985-kees@kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18Merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes.Andrew Morton
2023-10-18kasan: disable kasan_non_canonical_hook() for HW tagsArnd Bergmann
On arm64, building with CONFIG_KASAN_HW_TAGS now causes a compile-time error: mm/kasan/report.c: In function 'kasan_non_canonical_hook': mm/kasan/report.c:637:20: error: 'KASAN_SHADOW_OFFSET' undeclared (first use in this function) 637 | if (addr < KASAN_SHADOW_OFFSET) | ^~~~~~~~~~~~~~~~~~~ mm/kasan/report.c:637:20: note: each undeclared identifier is reported only once for each function it appears in mm/kasan/report.c:640:77: error: expected expression before ';' token 640 | orig_addr = (addr - KASAN_SHADOW_OFFSET) << KASAN_SHADOW_SCALE_SHIFT; This was caused by removing the dependency on CONFIG_KASAN_INLINE that used to prevent this from happening. Use the more specific dependency on KASAN_SW_TAGS || KASAN_GENERIC to only ignore the function for hwasan mode. Link: https://lkml.kernel.org/r/20231016200925.984439-1-arnd@kernel.org Fixes: 12ec6a919b0f ("kasan: print the original fault addr when access invalid shadow") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Haibo Li <haibo.li@mediatek.com> Cc: Kees Cook <keescook@chromium.org> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18kasan: print the original fault addr when access invalid shadowHaibo Li
when the checked address is illegal,the corresponding shadow address from kasan_mem_to_shadow may have no mapping in mmu table. Access such shadow address causes kernel oops. Here is a sample about oops on arm64(VA 39bit) with KASAN_SW_TAGS and KASAN_OUTLINE on: [ffffffb80aaaaaaa] pgd=000000005d3ce003, p4d=000000005d3ce003, pud=000000005d3ce003, pmd=0000000000000000 Internal error: Oops: 0000000096000006 [#1] PREEMPT SMP Modules linked in: CPU: 3 PID: 100 Comm: sh Not tainted 6.6.0-rc1-dirty #43 Hardware name: linux,dummy-virt (DT) pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : __hwasan_load8_noabort+0x5c/0x90 lr : do_ib_ob+0xf4/0x110 ffffffb80aaaaaaa is the shadow address for efffff80aaaaaaaa. The problem is reading invalid shadow in kasan_check_range. The generic kasan also has similar oops. It only reports the shadow address which causes oops but not the original address. Commit 2f004eea0fc8("x86/kasan: Print original address on #GP") introduce to kasan_non_canonical_hook but limit it to KASAN_INLINE. This patch extends it to KASAN_OUTLINE mode. Link: https://lkml.kernel.org/r/20231009073748.159228-1-haibo.li@mediatek.com Fixes: 2f004eea0fc8("x86/kasan: Print original address on #GP") Signed-off-by: Haibo Li <haibo.li@mediatek.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Haibo Li <haibo.li@mediatek.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18hugetlbfs: close race between MADV_DONTNEED and page faultRik van Riel
Malloc libraries, like jemalloc and tcalloc, take decisions on when to call madvise independently from the code in the main application. This sometimes results in the application page faulting on an address, right after the malloc library has shot down the backing memory with MADV_DONTNEED. Usually this is harmless, because we always have some 4kB pages sitting around to satisfy a page fault. However, with hugetlbfs systems often allocate only the exact number of huge pages that the application wants. Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of any lock taken on the page fault path, which can open up the following race condition: CPU 1 CPU 2 MADV_DONTNEED unmap page shoot down TLB entry page fault fail to allocate a huge page killed with SIGBUS free page Fix that race by pulling the locking from __unmap_hugepage_final_range into helper functions called from zap_page_range_single. This ensures page faults stay locked out of the MADV_DONTNEED VMA until the huge pages have actually been freed. Link: https://lkml.kernel.org/r/20231006040020.3677377-4-riel@surriel.com Fixes: 04ada095dcfc ("hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing") Signed-off-by: Rik van Riel <riel@surriel.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18hugetlbfs: extend hugetlb_vma_lock to private VMAsRik van Riel
Extend the locking scheme used to protect shared hugetlb mappings from truncate vs page fault races, in order to protect private hugetlb mappings (with resv_map) against MADV_DONTNEED. Add a read-write semaphore to the resv_map data structure, and use that from the hugetlb_vma_(un)lock_* functions, in preparation for closing the race between MADV_DONTNEED and page faults. Link: https://lkml.kernel.org/r/20231006040020.3677377-3-riel@surriel.com Fixes: 04ada095dcfc ("hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing") Signed-off-by: Rik van Riel <riel@surriel.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18Merge tag 'qcom-drivers-for-6.7' of ↵Arnd Bergmann
https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into soc/drivers Qualcomm driver updates for v6.7 This introduces partial support for the Qualcomm Secure Execution Environment SCM interface, and uses this to implement EFI variable access on the Windows On Snapdragon devices (for now). The 32/64-bit calling convention detector of the SCM interface is updated to not choose 64-bit convention when Linux is 32-bit. The "extern" specifier is dropped from the interface include file. The LLCC driver gains support for carrying configuration for multiple different system/DDR configurations for a given platform, and selecting between them. Support for Q[DR]U1000 is added to the driver. All exported symbols are transitioned to EXPORT_SYMBOL_GPL(). The platform_drivers in the Qualcomm SoC are transitioned to the void-returning remove_new implementation. The rmtfs memory driver gains support for leaving guard pages around the used area, to avoid issues if the allocation happens to be placed adjacent to another protected memory region. The socinfo driver gains knowledge about IPQ8174, QCM6490, SM7150P and various PMICs used together with SM8550. * tag 'qcom-drivers-for-6.7' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux: (44 commits) soc: qcom: socinfo: Convert to platform remove callback returning void soc: qcom: smsm: Convert to platform remove callback returning void soc: qcom: smp2p: Convert to platform remove callback returning void soc: qcom: smem: Convert to platform remove callback returning void soc: qcom: rmtfs_mem: Convert to platform remove callback returning void soc: qcom: qcom_stats: Convert to platform remove callback returning void soc: qcom: qcom_gsbi: Convert to platform remove callback returning void soc: qcom: qcom_aoss: Convert to platform remove callback returning void soc: qcom: pmic_glink: Convert to platform remove callback returning void soc: qcom: ocmem: Convert to platform remove callback returning void soc: qcom: llcc-qcom: Convert to platform remove callback returning void soc: qcom: icc-bwmon: Convert to platform remove callback returning void firmware: qcom_scm: use 64-bit calling convention only when client is 64-bit soc: qcom: llcc: Handle a second device without data corruption soc: qcom: Switch to EXPORT_SYMBOL_GPL() soc: qcom: smem: Annotate struct qcom_smem with __counted_by soc: qcom: rmtfs: Support discarding guard pages dt-bindings: reserved-memory: rmtfs: Allow guard pages dt-bindings: firmware: qcom,scm: document IPQ5018 compatible firmware: qcom_scm: disable SDI if required ... Link: https://lore.kernel.org/r/20231015204014.855672-1-andersson@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-10-19kprobes: freelist.h removedwuqiang.matt
This patch will remove freelist.h from kernel source tree, since the only use cases (kretprobe and rethook) are converted to objpool. Link: https://lore.kernel.org/all/20231017135654.82270-5-wuqiang.matt@bytedance.com/ Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-10-18kprobes: kretprobe scalability improvementwuqiang.matt
kretprobe is using freelist to manage return-instances, but freelist, as LIFO queue based on singly linked list, scales badly and reduces the overall throughput of kretprobed routines, especially for high contention scenarios. Here's a typical throughput test of sys_prctl (counts in 10 seconds, measured with perf stat -a -I 10000 -e syscalls:sys_enter_prctl): OS: Debian 10 X86_64, Linux 6.5rc7 with freelist HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s 1T 2T 4T 8T 16T 24T 24150045 29317964 15446741 12494489 18287272 17708768 32T 48T 64T 72T 96T 128T 16200682 13737658 11645677 11269858 10470118 9931051 This patch introduces objpool to replace freelist. objpool is a high performance queue, which can bring near-linear scalability to kretprobed routines. Tests of kretprobe throughput show the biggest ratio as 159x of original freelist. Here's the result: 1T 2T 4T 8T 16T native: 41186213 82336866 164250978 328662645 658810299 freelist: 24150045 29317964 15446741 12494489 18287272 objpool: 23926730 48010314 96125218 191782984 385091769 32T 48T 64T 96T 128T native: 1330338351 1969957941 2512291791 2615754135 2671040914 freelist: 16200682 13737658 11645677 10470118 9931051 objpool: 764481096 1147149781 1456220214 1502109662 1579015050 Testings on 96-core ARM64 output similarly, but with the biggest ratio up to 448x: OS: Debian 10 AARCH64, Linux 6.5rc7 HW: Kunpeng-920 96 cores/2 sockets/4 NUMA nodes, DDR4 2933 MT/s 1T 2T 4T 8T 16T native: . 30066096 63569843 126194076 257447289 505800181 freelist: 16152090 11064397 11124068 7215768 5663013 objpool: 13997541 28032100 55726624 110099926 221498787 24T 32T 48T 64T 96T native: 763305277 1015925192 1521075123 2033009392 3021013752 freelist: 5015810 4602893 3766792 3382478 2945292 objpool: 328192025 439439564 668534502 887401381 1319972072 Link: https://lore.kernel.org/all/20231017135654.82270-4-wuqiang.matt@bytedance.com/ Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-10-18lib: objpool added: ring-array based lockless MPMCwuqiang.matt
objpool is a scalable implementation of high performance queue for object allocation and reclamation, such as kretprobe instances. With leveraging percpu ring-array to mitigate hot spots of memory contention, it delivers near-linear scalability for high parallel scenarios. The objpool is best suited for the following cases: 1) Memory allocation or reclamation are prohibited or too expensive 2) Consumers are of different priorities, such as irqs and threads Limitations: 1) Maximum objects (capacity) is fixed after objpool creation 2) All pre-allocated objects are managed in percpu ring array, which consumes more memory than linked lists Link: https://lore.kernel.org/all/20231017135654.82270-2-wuqiang.matt@bytedance.com/ Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-10-18fs: rename inode i_atime and i_mtime fieldsJeff Layton
Rename these two fields to discourage direct access (and to help ensure that we mop up any leftover direct accesses). Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-18linux: convert to new timestamp accessorsJeff Layton
Convert to using the new inode timestamp accessor functions. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20231004185347.80880-77-jlayton@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-18fs: new accessor methods for atime and mtimeJeff Layton
Recently, we converted the ctime accesses in the kernel to use new accessor functions. Linus recently pointed out though that if we add accessors for the atime and mtime, then that would allow us to seamlessly change how these timestamps are stored in the inode. Add new accessor functions for the atime and mtime that mirror the accessors for the ctime. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20231004185239.80830-1-jlayton@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-18neighbor: tracing: Move pin6 inside CONFIG_IPV6=y sectionGeert Uytterhoeven
When CONFIG_IPV6=n, and building with W=1: In file included from include/trace/define_trace.h:102, from include/trace/events/neigh.h:255, from net/core/net-traces.c:51: include/trace/events/neigh.h: In function ‘trace_event_raw_event_neigh_create’: include/trace/events/neigh.h:42:34: error: variable ‘pin6’ set but not used [-Werror=unused-but-set-variable] 42 | struct in6_addr *pin6; | ^~~~ include/trace/trace_events.h:402:11: note: in definition of macro ‘DECLARE_EVENT_CLASS’ 402 | { assign; } \ | ^~~~~~ include/trace/trace_events.h:44:30: note: in expansion of macro ‘PARAMS’ 44 | PARAMS(assign), \ | ^~~~~~ include/trace/events/neigh.h:23:1: note: in expansion of macro ‘TRACE_EVENT’ 23 | TRACE_EVENT(neigh_create, | ^~~~~~~~~~~ include/trace/events/neigh.h:41:9: note: in expansion of macro ‘TP_fast_assign’ 41 | TP_fast_assign( | ^~~~~~~~~~~~~~ In file included from include/trace/define_trace.h:103, from include/trace/events/neigh.h:255, from net/core/net-traces.c:51: include/trace/events/neigh.h: In function ‘perf_trace_neigh_create’: include/trace/events/neigh.h:42:34: error: variable ‘pin6’ set but not used [-Werror=unused-but-set-variable] 42 | struct in6_addr *pin6; | ^~~~ include/trace/perf.h:51:11: note: in definition of macro ‘DECLARE_EVENT_CLASS’ 51 | { assign; } \ | ^~~~~~ include/trace/trace_events.h:44:30: note: in expansion of macro ‘PARAMS’ 44 | PARAMS(assign), \ | ^~~~~~ include/trace/events/neigh.h:23:1: note: in expansion of macro ‘TRACE_EVENT’ 23 | TRACE_EVENT(neigh_create, | ^~~~~~~~~~~ include/trace/events/neigh.h:41:9: note: in expansion of macro ‘TP_fast_assign’ 41 | TP_fast_assign( | ^~~~~~~~~~~~~~ Indeed, the variable pin6 is declared and initialized unconditionally, while it is only used and needlessly re-initialized when support for IPv6 is enabled. Fix this by dropping the unused variable initialization, and moving the variable declaration inside the existing section protected by a check for CONFIG_IPV6. Fixes: fc651001d2c5ca4f ("neighbor: Add tracepoint to __neigh_create") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Simon Horman <horms@kernel.org> # build-tested Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18Merge tag 'nf-next-23-10-18' of ↵David S. Miller
https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== netfilter next pull request 2023-10-18 This series contains initial netfilter skb drop_reason support, from myself. First few patches fix up a few spots to make sure we won't trip when followup patches embed error numbers in the upper bits (we already do this in some places). Then, nftables and bridge netfilter get converted to call kfree_skb_reason directly to let tooling pinpoint exact location of packet drops, rather than the existing NF_DROP catchall in nf_hook_slow(). I would like to eventually convert all netfilter modules, but as some callers cannot deal with NF_STOLEN (notably act_ct), more preparation work is needed for this. Last patch gets rid of an ugly 'de-const' cast in nftables. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-17workqueue: Provide one lock class key per work_on_cpu() callsiteFrederic Weisbecker
All callers of work_on_cpu() share the same lock class key for all the functions queued. As a result the workqueue related locking scenario for a function A may be spuriously accounted as an inversion against the locking scenario of function B such as in the following model: long A(void *arg) { mutex_lock(&mutex); mutex_unlock(&mutex); } long B(void *arg) { } void launchA(void) { work_on_cpu(0, A, NULL); } void launchB(void) { mutex_lock(&mutex); work_on_cpu(1, B, NULL); mutex_unlock(&mutex); } launchA and launchB running concurrently have no chance to deadlock. However the above can be reported by lockdep as a possible locking inversion because the works containing A() and B() are treated as belonging to the same locking class. The following shows an existing example of such a spurious lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 6.6.0-rc1-00065-g934ebd6e5359 #35409 Not tainted ------------------------------------------------------ kworker/0:1/9 is trying to acquire lock: ffffffff9bc72f30 (cpu_hotplug_lock){++++}-{0:0}, at: _cpu_down+0x57/0x2b0 but task is already holding lock: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}: __flush_work+0x83/0x4e0 work_on_cpu+0x97/0xc0 rcu_nocb_cpu_offload+0x62/0xb0 rcu_nocb_toggle+0xd0/0x1d0 kthread+0xe6/0x120 ret_from_fork+0x2f/0x40 ret_from_fork_asm+0x1b/0x30 -> #1 (rcu_state.barrier_mutex){+.+.}-{3:3}: __mutex_lock+0x81/0xc80 rcu_nocb_cpu_deoffload+0x38/0xb0 rcu_nocb_toggle+0x144/0x1d0 kthread+0xe6/0x120 ret_from_fork+0x2f/0x40 ret_from_fork_asm+0x1b/0x30 -> #0 (cpu_hotplug_lock){++++}-{0:0}: __lock_acquire+0x1538/0x2500 lock_acquire+0xbf/0x2a0 percpu_down_write+0x31/0x200 _cpu_down+0x57/0x2b0 __cpu_down_maps_locked+0x10/0x20 work_for_cpu_fn+0x15/0x20 process_scheduled_works+0x2a7/0x500 worker_thread+0x173/0x330 kthread+0xe6/0x120 ret_from_fork+0x2f/0x40 ret_from_fork_asm+0x1b/0x30 other info that might help us debug this: Chain exists of: cpu_hotplug_lock --> rcu_state.barrier_mutex --> (work_completion)(&wfc.work) Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock((work_completion)(&wfc.work)); lock(rcu_state.barrier_mutex); lock((work_completion)(&wfc.work)); lock(cpu_hotplug_lock); *** DEADLOCK *** 2 locks held by kworker/0:1/9: #0: ffff900481068b38 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x212/0x500 #1: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500 stack backtrace: CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1-00065-g934ebd6e5359 #35409 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 Workqueue: events work_for_cpu_fn Call Trace: rcu-torture: rcu_torture_read_exit: Start of episode <TASK> dump_stack_lvl+0x4a/0x80 check_noncircular+0x132/0x150 __lock_acquire+0x1538/0x2500 lock_acquire+0xbf/0x2a0 ? _cpu_down+0x57/0x2b0 percpu_down_write+0x31/0x200 ? _cpu_down+0x57/0x2b0 _cpu_down+0x57/0x2b0 __cpu_down_maps_locked+0x10/0x20 work_for_cpu_fn+0x15/0x20 process_scheduled_works+0x2a7/0x500 worker_thread+0x173/0x330 ? __pfx_worker_thread+0x10/0x10 kthread+0xe6/0x120 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2f/0x40 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK Fix this with providing one lock class key per work_on_cpu() caller. Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-10-18ethtool: Add forced speed to supported link modes mapsPaul Greenwalt
The need to map Ethtool forced speeds to Ethtool supported link modes is common among drivers. To support this, add a common structure for forced speed maps and a function to init them. This is solution was originally introduced in commit 1d4e4ecccb11 ("qede: populate supported link modes maps on module init") for qede driver. ethtool_forced_speed_maps_init() should be called during driver init with an array of struct ethtool_forced_speed_map to populate the mapping. Definitions for maps themselves are left in the driver code, as the sets of supported link modes may vary between the devices. Suggested-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Pawel Chmielewski <pawel.chmielewski@intel.com> Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18netfilter: nf_tables: de-constify set commit ops function argumentFlorian Westphal
The set backend using this already has to work around this via ugly cast, don't spread this pattern. Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18netfilter: make nftables drops visible in net dropmonitorFlorian Westphal
net_dropmonitor blames core.c:nf_hook_slow. Add NF_DROP_REASON() helper and use it in nft_do_chain(). The helper releases the skb, so exact drop location becomes available. Calling code will observe the NF_STOLEN verdict instead. Adjust nf_hook_slow so we can embed an erro value wih NF_STOLEN verdicts, just like we do for NF_DROP. After this, drop in nftables can be pinpointed to a drop due to a rule or the chain policy. Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18net: treat possible_net_t net pointer as an RCU one and add read_pnet_rcu()Jiri Pirko
Make the net pointer stored in possible_net_t structure annotated as an RCU pointer. Change the access helpers to treat it as such. Introduce read_pnet_rcu() helper to allow caller to dereference the net pointer under RCU read lock. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18parport: Use kasprintf() instead of fixed buffer formattingAndy Shevchenko
Improve readability and maintainability by replacing a hardcoded string allocation and formatting by the use of the kasprintf() helper. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20231016133135.1203643-2-andriy.shevchenko@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-18mei: docs: fix spelling errorsTomas Winkler
Fix spelling errors in the mei code base. Signed-off-by: Tomas Winkler <tomas.winkler@intel.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20231011074301.223879-4-tomas.winkler@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-18mei: update mei-pxp's component interface with timeoutsAlan Previn
In debugging platform or firmware related MEI-PXP connection issues, having a timeout when clients (such as i915) calling into mei-pxp's send/receive functions have proven useful as opposed to blocking forever until the kernel triggers a watchdog panic (when platform issues are experienced). Update the mei-pxp component interface send and receive functions to take in timeouts. Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com> Signed-off-by: Alexander Usyskin <alexander.usyskin@intel.com> Signed-off-by: Tomas Winkler <tomas.winkler@intel.com> Link: https://lore.kernel.org/r/20231011110157.247552-5-tomas.winkler@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-18mei: bus: add send and recv api with timeoutAlexander Usyskin
Add variation of the send and recv functions on bus that define timeout. Caller can use such functions in flow that can stuck to bail out and not to put down the whole system. Signed-off-by: Alexander Usyskin <alexander.usyskin@intel.com> Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com> Signed-off-by: Tomas Winkler <tomas.winkler@intel.com> Link: https://lore.kernel.org/r/20231011110157.247552-2-tomas.winkler@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-18Merge tag 'amd-drm-next-6.7-2023-10-13' of ↵Dave Airlie
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.7-2023-10-13: amdgpu: - DC replay fixes - Misc code cleanups and spelling fixes - Documentation updates - RAS EEPROM Updates - FRU EEPROM Updates - IP discovery updates - SR-IOV fixes - RAS updates - DC PQ fixes - SMU 13.0.6 updates - GC 11.5 Support - NBIO 7.11 Support - GMC 11 Updates - Reset fixes - SMU 11.5 Updates - SMU 13.0 OD support - Use flexible arrays for bo list handling - W=1 Fixes - SubVP fixes - DPIA fixes - DCN 3.5 Support - Devcoredump fixes - VPE 6.1 support - VCN 4.0 Updates - S/G display fixes - DML fixes - DML2 Support - MST fixes - VRR fixes - Enable seamless boot in more cases - Enable content type property for HDMI - OLED fixes - Rework and clean up GPUVM TLB flushing - DC ODM fixes - DP 2.x fixes - AGP aperture fixes - SDMA firmware loading cleanups - Cyan Skillfish GPU clock counter fix - GC 11 GART fix - Cache GPU fault info for userspace queries - DC cursor check fixes - eDP fixes - DC FP handling fixes - Variable sized array fixes - SMU 13.0.x fixes - IB start and size alignment fixes for VCN - SMU 14 Support - Suspend and resume sequence rework - vkms fix amdkfd: - GC 11 fixes - GC 10 fixes - Doorbell fixes - CWSR fixes - SVM fixes - Clean up GC info enumeration - Rework memory limit handling - Coherent memory handling fixes - Use partial migrations in GPU faults - TLB flush fixes - DMA unmap fixes - GC 9.4.3 fixes - SQ interrupt fix - GTT mapping fix - GC 11.5 Support radeon: - Misc code cleanups - W=1 Fixes - Fix possible buffer overflow - Fix possible NULL pointer dereference UAPI: - Add EXT_COHERENT memory allocation flags. These allow for system scope atomics. Proposed userspace: https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/pull/88 - Add support for new VPE engine. This is a memory to memory copy engine with advanced scaling, CSC, and color management features Proposed mesa MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25713 - Add INFO IOCTL interface to query GPU faults Proposed Mesa MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23238 Proposed libdrm MR: https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/298 Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231013175758.1735031-1-alexander.deucher@amd.com
2023-10-17Merge tag 'mlx5-updates-2023-10-10' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2023-10-10 1) Adham Faris, Increase max supported channels number to 256 2) Leon Romanovsky, Allow IPsec soft/hard limits in bytes 3) Shay Drory, Replace global mlx5_intf_lock with HCA devcom component lock 4) Wei Zhang, Optimize SF creation flow During SF creation, HCA state gets changed from INVALID to IN_USE step by step. Accordingly, FW sends vhca event to driver to inform about this state change asynchronously. Each vhca event is critical because all related SW/FW operations are triggered by it. Currently there is only a single mlx5 general event handler which not only handles vhca event but many other events. This incurs huge bottleneck because all events are forced to be handled in serial manner. Moreover, all SFs share same table_lock which inevitably impacts each other when they are created in parallel. This series will solve this issue by: 1. A dedicated vhca event handler is introduced to eliminate the mutual impact with other mlx5 events. 2. Max FW threads work queues are employed in the vhca event handler to fully utilize FW capability. 3. Redesign SF active work logic to completely remove table_lock. With above optimization, SF creation time is reduced by 25%, i.e. from 80s to 60s when creating 100 SFs. Patches summary: Patch 1 - implement dedicated vhca event handler with max FW cmd threads of work queues. Patch 2 - remove table_lock by redesigning SF active work logic. * tag 'mlx5-updates-2023-10-10' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5e: Allow IPsec soft/hard limits in bytes net/mlx5e: Increase max supported channels number to 256 net/mlx5e: Preparations for supporting larger number of channels net/mlx5e: Refactor mlx5e_rss_init() and mlx5e_rss_free() API's net/mlx5e: Refactor mlx5e_rss_set_rxfh() and mlx5e_rss_get_rxfh() net/mlx5e: Refactor rx_res_init() and rx_res_free() APIs net/mlx5e: Use PTR_ERR_OR_ZERO() to simplify code net/mlx5: Use PTR_ERR_OR_ZERO() to simplify code net/mlx5: fix config name in Kconfig parameter documentation net/mlx5: Remove unused declaration net/mlx5: Replace global mlx5_intf_lock with HCA devcom component lock net/mlx5: Refactor LAG peer device lookout bus logic to mlx5 devcom net/mlx5: Avoid false positive lockdep warning by adding lock_class_key net/mlx5: Redesign SF active work to remove table_lock net/mlx5: Parallelize vhca event handling ==================== Link: https://lore.kernel.org/r/20231014171908.290428-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17Merge tag 'ipsec-2023-10-17' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2023-10-17 1) Fix a slab-use-after-free in xfrm_policy_inexact_list_reinsert. From Dong Chenchen. 2) Fix data-races in the xfrm interfaces dev->stats fields. From Eric Dumazet. 3) Fix a data-race in xfrm_gen_index. From Eric Dumazet. 4) Fix an inet6_dev refcount underflow. From Zhang Changzhong. 5) Check the return value of pskb_trim in esp_remove_trailer for esp4 and esp6. From Ma Ke. 6) Fix a data-race in xfrm_lookup_with_ifid. From Eric Dumazet. * tag 'ipsec-2023-10-17' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: xfrm: fix a data-race in xfrm_lookup_with_ifid() net: ipv4: fix return value check in esp_remove_trailer net: ipv6: fix return value check in esp_remove_trailer xfrm6: fix inet6_dev refcount underflow problem xfrm: fix a data-race in xfrm_gen_index() xfrm: interface: use DEV_STATS_INC() net: xfrm: skip policies marked as dead while reinserting policies ==================== Link: https://lore.kernel.org/r/20231017083723.1364940-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17net: phylink: remove a bunch of unused validation methodsRussell King (Oracle)
Remove exports for phylink_caps_to_linkmodes(), phylink_get_capabilities(), phylink_validate_mask_caps() and phylink_generic_validate(). Also, as phylink_generic_validate() is no longer called, we can remove its implementation as well. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/E1qsPkK-009wip-W9@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17net: phylink: remove .validate() methodRussell King (Oracle)
The MAC .validate() method is no longer used, so remove it from the phylink_mac_ops structure, and remove the callsite in phylink_validate_mac_and_pcs(). Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/E1qsPkF-009wij-QM@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17net: phylink: provide mac_get_caps() methodRussell King (Oracle)
Provide a new method, mac_get_caps() to get the MAC capabilities for the specified interface mode. This is for MACs which have special requirements, such as not supporting half-duplex in certain interface modes, and will replace the validate() method. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/E1qsPk5-009wiX-G5@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17net: bridge: Add netlink knobs for number / max learned FDB entriesJohannes Nixdorf
The previous patch added accounting and a limit for the number of dynamically learned FDB entries per bridge. However it did not provide means to actually configure those bounds or read back the count. This patch does that. Two new netlink attributes are added for the accounting and limit of dynamically learned FDB entries: - IFLA_BR_FDB_N_LEARNED (RO) for the number of entries accounted for a single bridge. - IFLA_BR_FDB_MAX_LEARNED (RW) for the configured limit of entries for the bridge. The new attributes are used like this: # ip link add name br up type bridge fdb_max_learned 256 # ip link add name v1 up master br type veth peer v2 # ip link set up dev v2 # mausezahn -a rand -c 1024 v2 0.01 seconds (90877 packets per second # bridge fdb | grep -v permanent | wc -l 256 # ip -d link show dev br 13: br: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 [...] [...] fdb_n_learned 256 fdb_max_learned 256 Signed-off-by: Johannes Nixdorf <jnixdorf-oss@avm.de> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/20231016-fdb_limit-v5-3-32cddff87758@avm.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17gpu/drm: Eliminate DRM_SCHED_PRIORITY_UNSETLuben Tuikov
Eliminate DRM_SCHED_PRIORITY_UNSET, value of -2, whose only user was amdgpu. Furthermore, eliminate an index bug, in that when amdgpu boots, it calls drm_sched_entity_init() with DRM_SCHED_PRIORITY_UNSET, which uses it to index sched->sched_rq[]. Cc: Alex Deucher <Alexander.Deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Alex Deucher <Alexander.Deucher@amd.com> Link: https://lore.kernel.org/r/20231017035656.8211-2-luben.tuikov@amd.com
2023-10-17tcp: fix excessive TLP and RACK timeouts from HZ roundingNeal Cardwell
We discovered from packet traces of slow loss recovery on kernels with the default HZ=250 setting (and min_rtt < 1ms) that after reordering, when receiving a SACKed sequence range, the RACK reordering timer was firing after about 16ms rather than the desired value of roughly min_rtt/4 + 2ms. The problem is largely due to the RACK reorder timer calculation adding in TCP_TIMEOUT_MIN, which is 2 jiffies. On kernels with HZ=250, this is 2*4ms = 8ms. The TLP timer calculation has the exact same issue. This commit fixes the TLP transmit timer and RACK reordering timer floor calculation to more closely match the intended 2ms floor even on kernels with HZ=250. It does this by adding in a new TCP_TIMEOUT_MIN_US floor of 2000 us and then converting to jiffies, instead of the current approach of converting to jiffies and then adding th TCP_TIMEOUT_MIN value of 2 jiffies. Our testing has verified that on kernels with HZ=1000, as expected, this does not produce significant changes in behavior, but on kernels with the default HZ=250 the latency improvement can be large. For example, our tests show that for HZ=250 kernels at low RTTs this fix roughly halves the latency for the RACK reorder timer: instead of mostly firing at 16ms it mostly fires at 8ms. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Fixes: bb4d991a28cc ("tcp: adjust tail loss probe timeout") Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231015174700.2206872-1-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17Merge tag 'fbdev-for-6.6-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev Pull fbdev fixes and cleanups from Helge Deller: "Various minor fixes, cleanups and annotations for atyfb, sa1100fb, omapfb, uvesafb and mmp" * tag 'fbdev-for-6.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev: fbdev: core: syscopyarea: fix sloppy typing fbdev: core: cfbcopyarea: fix sloppy typing fbdev: uvesafb: Call cn_del_callback() at the end of uvesafb_exit() fbdev: uvesafb: Remove uvesafb_exec() prototype from include/video/uvesafb.h fbdev: sa1100fb: mark sa1100fb_init() static fbdev: omapfb: fix some error codes fbdev: atyfb: only use ioremap_uc() on i386 and ia64 fbdev: mmp: Annotate struct mmp_path with __counted_by fbdev: mmp: Annotate struct mmphw_ctrl with __counted_by
2023-10-17Merge tag 'wireless-next-2023-10-16' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle Valo says: ==================== wireless-next patches for v6.7 The second pull request for v6.7, with only driver changes this time. We have now support for mt7925 PCIe and USB variants, few new features and of course some fixes. Major changes: mt76 - mt7925 support ath12k - read board data variant name from SMBIOS wfx - Remain-On-Channel (ROC) support * tag 'wireless-next-2023-10-16' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (109 commits) wifi: rtw89: mac: do bf_monitor only if WiFi 6 chips wifi: rtw89: mac: set bf_assoc capabilities according to chip gen wifi: rtw89: mac: set bfee_ctrl() according to chip gen wifi: rtw89: mac: add registers of MU-EDCA parameters for WiFi 7 chips wifi: rtw89: mac: generalize register of MU-EDCA switch according to chip gen wifi: rtw89: mac: update RTS threshold according to chip gen wifi: rtlwifi: simplify TX command fill callbacks wifi: hostap: remove unused ioctl function wifi: atmel: remove unused ioctl function wifi: rtw89: coex: add annotation __counted_by() to struct rtw89_btc_btf_set_mon_reg wifi: rtw89: coex: add annotation __counted_by() for struct rtw89_btc_btf_set_slot_table wifi: rtw89: add EHT radiotap in monitor mode wifi: rtw89: show EHT rate in debugfs wifi: rtw89: parse TX EHT rate selected by firmware from RA C2H report wifi: rtw89: Add EHT rate mask as parameters of RA H2C command wifi: rtw89: parse EHT information from RX descriptor and PPDU status packet wifi: radiotap: add bandwidth definition of EHT U-SIG wifi: rtlwifi: use convenient list_count_nodes() wifi: p54: Annotate struct p54_cal_database with __counted_by wifi: brcmfmac: fweh: Add __counted_by for struct brcmf_fweh_queue_item and use struct_size() ... ==================== Link: https://lore.kernel.org/r/20231016143822.880D8C433C8@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17Merge tag 'nvme-6.7-2023-10-17' of git://git.infradead.org/nvme into ↵Jens Axboe
for-6.7/block Pull NVMe updates from Keith: "nvme updates for Linux 6.7 - nvme-auth updates (Mark) - nvme-tcp tls (Hannes) - nvme-fc annotaions (Kees)" * tag 'nvme-6.7-2023-10-17' of git://git.infradead.org/nvme: (24 commits) nvme-auth: allow mixing of secret and hash lengths nvme-auth: use transformed key size to create resp nvme-auth: alloc nvme_dhchap_key as single buffer nvmet-tcp: use 'spin_lock_bh' for state_lock() nvme: rework NVME_AUTH Kconfig selection nvmet-tcp: peek icreq before starting TLS nvmet-tcp: control messages for recvmsg() nvmet-tcp: enable TLS handshake upcall nvmet: Set 'TREQ' to 'required' when TLS is enabled nvmet-tcp: allocate socket file nvmet-tcp: make nvmet_tcp_alloc_queue() a void function nvmet: make TCP sectype settable via configfs nvme-fabrics: parse options 'keyring' and 'tls_key' nvme-tcp: improve icreq/icresp logging nvme-tcp: control message handling for recvmsg() nvme-tcp: enable TLS handshake upcall nvme-tcp: allocate socket file security/keys: export key_lookup() nvme-keyring: implement nvme_tls_psk_default() nvme-tcp: add definitions for TLS cipher suites ...
2023-10-17nvme-auth: use transformed key size to create respMark O'Donovan
This does not change current behaviour as the driver currently verifies that the secret size is the same size as the length of the transformation hash. Co-developed-by: Akash Appaiah <Akash.Appaiah@dell.com> Signed-off-by: Akash Appaiah <Akash.Appaiah@dell.com> Signed-off-by: Mark O'Donovan <shiftee@posteo.net> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-17nvme-auth: alloc nvme_dhchap_key as single bufferMark O'Donovan
Co-developed-by: Akash Appaiah <Akash.Appaiah@dell.com> Signed-off-by: Akash Appaiah <Akash.Appaiah@dell.com> Signed-off-by: Mark O'Donovan <shiftee@posteo.net> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-17PCI/MSI: Provide stubs for IMS functionsReinette Chatre
The IMS related functions (pci_create_ims_domain(), pci_ims_alloc_irq(), and pci_ims_free_irq()) are not declared when CONFIG_PCI_MSI is disabled. Provide definitions of these functions for use when callers are compiled with CONFIG_PCI_MSI disabled. Fixes: 0194425af0c8 ("PCI/MSI: Provide IMS (Interrupt Message Store) support") Fixes: c9e5bea27383 ("PCI/MSI: Provide pci_ims_alloc/free_irq()") Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/14ff656899a3757453f8584c1109d7a9b98fa258.1697564731.git.reinette.chatre@intel.com
2023-10-17Merge branch 'linus' into smp/coreThomas Gleixner
Pull in upstream to get the fixes so depending changes can be applied.
2023-10-17block:sed-opal: SED Opal keystoreGreg Joyce
Add read and write functions that allow SED Opal keys to stored in a permanent keystore. Signed-off-by: Greg Joyce <gjoyce@linux.vnet.ibm.com> Reviewed-by: Jonathan Derrick <jonathan.derrick@linux.dev> Link: https://lore.kernel.org/r/20231004201957.1451669-2-gjoyce@linux.vnet.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-17Merge branch 'for-6.7/io_uring' into for-6.7/blockJens Axboe
Merge in io_uring fixes, as the ublk simplifying cancelations and aborts depend on the two patches from Ming adding cancelation support for uring_cmd. * for-6.7/io_uring: io_uring/kbuf: Use slab for struct io_buffer objects io_uring/kbuf: Allow the full buffer id space for provided buffers io_uring/kbuf: Fix check of BID wrapping in provided buffers io_uring/rsrc: cleanup io_pin_pages() io_uring: cancelable uring_cmd io_uring: retain top 8bits of uring_cmd flags for kernel internal use io_uring: add IORING_OP_WAITID support exit: add internal include file with helpers exit: add kernel_waitid_prepare() helper exit: move core of do_wait() into helper exit: abstract out should_wake helper for child_wait_callback() io_uring/rw: add support for IORING_OP_READ_MULTISHOT io_uring/rw: mark readv/writev as vectored in the opcode definition io_uring/rw: split io_read() into a helper
2023-10-17net, bpf: Add a warning if NAPI cb missed xdp_do_flush().Sebastian Andrzej Siewior
A few drivers were missing a xdp_do_flush() invocation after XDP_REDIRECT. Add three helper functions each for one of the per-CPU lists. Return true if the per-CPU list is non-empty and flush the list. Add xdp_do_check_flushed() which invokes each helper functions and creates a warning if one of the functions had a non-empty list. Hide everything behind CONFIG_DEBUG_NET. Suggested-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20231016125738.Yt79p1uF@linutronix.de
2023-10-17locking/seqlock: Fix grammar in commentCuda-Chen
The "neither writes before and after ..." for the description of do_write_seqcount_end() should be "neither writes before nor after". Signed-off-by: Cuda-Chen <clh960524@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231017053703.11312-1-clh960524@gmail.com
2023-10-17i3c: Add support for bus enumeration & notificationJeremy Kerr
This allows other drivers to be notified when new i3c busses are attached, referring to a whole i3c bus as opposed to individual devices. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: Matt Johnston <matt@codeconstruct.com.au> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-17pmdomain: mediatek: Add support for MT8365Fabien Parent
Add the needed board data to support MT8365 SoC. Signed-off-by: Fabien Parent <fparent@baylibre.com> Signed-off-by: Markus Schneider-Pargmann <msp@baylibre.com> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Reviewed-by: Alexandre Mergnat <amergnat@baylibre.com> Tested-by: Alexandre Mergnat <amergnat@baylibre.com> Link: https://lore.kernel.org/r/20230918093751.1188668-9-msp@baylibre.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>