summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-07-09samples/damon: fix damon sample mtier for start failureHonggyu Kim
The damon_sample_mtier_start() can fail so we must reset the "enable" parameter to "false" again for proper rollback. In such cases, setting Y to "enable" then N triggers the similar crash with mtier because damon sample start failed but the "enable" stays as Y. Link: https://lkml.kernel.org/r/20250702000205.1921-4-honggyu.kim@sk.com Fixes: 82a08bde3cf7 ("samples/damon: implement a DAMON module for memory tiering") Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09samples/damon: fix damon sample wsse for start failureHonggyu Kim
The damon_sample_wsse_start() can fail so we must reset the "enable" parameter to "false" again for proper rollback. In such cases, setting Y to "enable" then N triggers the similar crash with wsse because damon sample start failed but the "enable" stays as Y. Link: https://lkml.kernel.org/r/20250702000205.1921-3-honggyu.kim@sk.com Fixes: b757c6cfc696 ("samples/damon/wsse: start and stop DAMON as the user requests") Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09samples/damon: fix damon sample prcl for start failureHonggyu Kim
Patch series "mm/damon: fix divide by zero and its samples", v3. This series includes fixes against damon and its samples to make it safer when damon sample starting fails. It includes the following changes. - fix unexpected divide by zero crash for zero size regions - fix bugs for damon samples in case of start failures This patch (of 4): The damon_sample_prcl_start() can fail so we must reset the "enable" parameter to "false" again for proper rollback. In such cases, setting Y to "enable" then N triggers the following crash because damon sample start failed but the "enable" stays as Y. [ 2441.419649] damon_sample_prcl: start [ 2454.146817] damon_sample_prcl: stop [ 2454.146862] ------------[ cut here ]------------ [ 2454.146865] kernel BUG at mm/slub.c:546! [ 2454.148183] Oops: invalid opcode: 0000 [#1] SMP NOPTI ... [ 2454.167555] Call Trace: [ 2454.167822] <TASK> [ 2454.168061] damon_destroy_ctx+0x78/0x140 [ 2454.168454] damon_sample_prcl_enable_store+0x8d/0xd0 [ 2454.168932] param_attr_store+0xa1/0x120 [ 2454.169315] module_attr_store+0x20/0x50 [ 2454.169695] sysfs_kf_write+0x72/0x90 [ 2454.170065] kernfs_fop_write_iter+0x150/0x1e0 [ 2454.170491] vfs_write+0x315/0x440 [ 2454.170833] ksys_write+0x69/0xf0 [ 2454.171162] __x64_sys_write+0x19/0x30 [ 2454.171525] x64_sys_call+0x18b2/0x2700 [ 2454.171900] do_syscall_64+0x7f/0x680 [ 2454.172258] ? exit_to_user_mode_loop+0xf6/0x180 [ 2454.172694] ? clear_bhb_loop+0x30/0x80 [ 2454.173067] ? clear_bhb_loop+0x30/0x80 [ 2454.173439] entry_SYSCALL_64_after_hwframe+0x76/0x7e Link: https://lkml.kernel.org/r/20250702000205.1921-1-honggyu.kim@sk.com Link: https://lkml.kernel.org/r/20250702000205.1921-2-honggyu.kim@sk.com Fixes: 2aca254620a8 ("samples/damon: introduce a skeleton of a smaple DAMON module for proactive reclamation") Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09kasan: remove kasan_find_vm_area() to prevent possible deadlockYeoreum Yun
find_vm_area() couldn't be called in atomic_context. If find_vm_area() is called to reports vm area information, kasan can trigger deadlock like: CPU0 CPU1 vmalloc(); alloc_vmap_area(); spin_lock(&vn->busy.lock) spin_lock_bh(&some_lock); <interrupt occurs> <in softirq> spin_lock(&some_lock); <access invalid address> kasan_report(); print_report(); print_address_description(); kasan_find_vm_area(); find_vm_area(); spin_lock(&vn->busy.lock) // deadlock! To prevent possible deadlock while kasan reports, remove kasan_find_vm_area(). Link: https://lkml.kernel.org/r/20250703181018.580833-1-yeoreum.yun@arm.com Fixes: c056a364e954 ("kasan: print virtual mapping info in reports") Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com> Reported-by: Yunseong Kim <ysk@kzalloc.com> Reviewed-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09scripts: gdb: vfs: support external dentry namesIllia Ostapyshyn
d_shortname of struct dentry only reserves D_NAME_INLINE_LEN characters and contains garbage for longer names. Use d_name instead, which always references the valid name. Link: https://lore.kernel.org/all/20250525213709.878287-2-illia@yshyn.com/ Link: https://lkml.kernel.org/r/20250629003811.2420418-1-illia@yshyn.com Fixes: 79300ac805b6 ("scripts/gdb: fix dentry_name() lookup") Signed-off-by: Illia Ostapyshyn <illia@yshyn.com> Tested-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jan Kiszka <jan.kiszka@siemens.com> Cc: Kieran Bingham <kbingham@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/migrate: fix do_pages_stat in compat modeChristoph Berg
For arrays with more than 16 entries, the old code would incorrectly advance the pages pointer by 16 words instead of 16 compat_uptr_t. Fix by doing the pointer arithmetic inside get_compat_pages_array where pages32 is already a correctly-typed pointer. Discovered while working on PostgreSQL 18's new NUMA introspection code. Link: https://lkml.kernel.org/r/aGREU0XTB48w9CwN@msg.df7cb.de Fixes: 5b1b561ba73c ("mm: simplify compat_sys_move_pages") Signed-off-by: Christoph Berg <myon@debian.org> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: David Hildenbrand <david@redhat.com> Reported-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reported-by: Tomas Vondra <tomas@vondra.me> Closes: https://www.postgresql.org/message-id/flat/6342f601-77de-4ee0-8c2a-3deb50ceac5b%40vondra.me#86402e3d80c031788f5f55b42c459471 Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/damon/core: handle damon_call_control as normal under kdmond deactivationSeongJae Park
DAMON sysfs interface internally uses damon_call() to update DAMON parameters as users requested, online. However, DAMON core cancels any damon_call() requests when it is deactivated by DAMOS watermarks. As a result, users cannot change DAMON parameters online while DAMON is deactivated. Note that users can turn DAMON off and on with different watermarks to work around. Since deactivated DAMON is nearly same to stopped DAMON, the work around should have no big problem. Anyway, a bug is a bug. There is no real good reason to cancel the damon_call() request under DAMOS deactivation. Fix it by simply handling the request as normal, rather than cancelling under the situation. Link: https://lkml.kernel.org/r/20250629204914.54114-1-sj@kernel.org Fixes: 42b7491af14c ("mm/damon/core: introduce damon_call()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [6.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/rmap: fix potential out-of-bounds page table access during batched unmapLance Yang
As pointed out by David[1], the batched unmap logic in try_to_unmap_one() may read past the end of a PTE table when a large folio's PTE mappings are not fully contained within a single page table. While this scenario might be rare, an issue triggerable from userspace must be fixed regardless of its likelihood. This patch fixes the out-of-bounds access by refactoring the logic into a new helper, folio_unmap_pte_batch(). The new helper correctly calculates the safe batch size by capping the scan at both the VMA and PMD boundaries. To simplify the code, it also supports partial batching (i.e., any number of pages from 1 up to the calculated safe maximum), as there is no strong reason to special-case for fully mapped folios. Link: https://lkml.kernel.org/r/20250701143100.6970-1-lance.yang@linux.dev Link: https://lkml.kernel.org/r/20250630011305.23754-1-lance.yang@linux.dev Link: https://lkml.kernel.org/r/20250627062319.84936-1-lance.yang@linux.dev Link: https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redhat.com [1] Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation") Signed-off-by: Lance Yang <lance.yang@linux.dev> Suggested-by: David Hildenbrand <david@redhat.com> Reported-by: David Hildenbrand <david@redhat.com> Closes: https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redhat.com Suggested-by: Barry Song <baohua@kernel.org> Acked-by: Barry Song <baohua@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <huang.ying.caritas@gmail.com> Cc: Kairui Song <kasong@tencent.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mingzhe Yang <mingzhe.yang@ly.com> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/hugetlb: don't crash when allocating a folio if there are no resvVivek Kasireddy
There are cases when we try to pin a folio but discover that it has not been faulted-in. So, we try to allocate it in memfd_alloc_folio() but there is a chance that we might encounter a fatal crash/failure (VM_BUG_ON(!h->resv_huge_pages) in alloc_hugetlb_folio_reserve()) if there are no active reservations at that instant. This issue was reported by syzbot: kernel BUG at mm/hugetlb.c:2403! Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403 Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89 f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087 RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000 RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005 R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000 R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8 FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88 memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750 udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline] udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443 udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline] udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Therefore, prevent the above crash by removing the VM_BUG_ON() as there is no need to crash the system in this situation and instead we could just fail the allocation request. Furthermore, as described above, the specific situation where this happens is when we try to pin memfd folios before they are faulted-in. Although, this is a valid thing to do, it is not the regular or the common use-case. Let us consider the following scenarios: 1) hugetlbfs_file_mmap() memfd_alloc_folio() hugetlb_fault() 2) memfd_alloc_folio() hugetlbfs_file_mmap() hugetlb_fault() 3) hugetlbfs_file_mmap() hugetlb_fault() alloc_hugetlb_folio() 3) is the most common use-case where first a memfd is allocated followed by mmap(), user writes/updates and then the relevant folios are pinned (memfd_pin_folios()). The BUG this patch is fixing occurs in 2) because we try to pin the folios before hugetlbfs_file_mmap() is called. So, in this situation we try to allocate the folios before pinning them but since we did not make any reservations, resv_huge_pages would be 0, leading to this issue. Link: https://lkml.kernel.org/r/20250626191116.1377761-1-vivek.kasireddy@intel.com Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak") Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Closes: https://syzkaller.appspot.com/bug?extid=a504cb5bae4fe117ba94 Closes: https://lore.kernel.org/all/677928b5.050a0220.3b53b0.004d.GAE@google.com/T/ Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Steve Sistare <steven.sistare@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09scripts/gdb: de-reference per-CPU MCE interruptsFlorian Fainelli
The per-CPU MCE interrupts are looked up by reference and need to be de-referenced before printing, otherwise we print the addresses of the variables instead of their contents: MCE: 18379471554386948492 Machine check exceptions MCP: 18379471554386948488 Machine check polls The corrected output looks like this instead now: MCE: 0 Machine check exceptions MCP: 1 Machine check polls Link: https://lkml.kernel.org/r/20250625021109.1057046-1-florian.fainelli@broadcom.com Link: https://lkml.kernel.org/r/20250624030020.882472-1-florian.fainelli@broadcom.com Fixes: b0969d7687a7 ("scripts/gdb: print interrupts") Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Cc: Jan Kiszka <jan.kiszka@siemens.com> Cc: Kieran Bingham <kbingham@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09scripts/gdb: fix interrupts.py after maple tree conversionFlorian Fainelli
In commit 721255b9826b ("genirq: Use a maple tree for interrupt descriptor management"), the irq_desc_tree was replaced with a sparse_irqs tree using a maple tree structure. Since the script looked for the irq_desc_tree symbol which is no longer available, no interrupts would be printed and the script output would not be useful anymore. In addition to looking up the correct symbol (sparse_irqs), a new module (mapletree.py) is added whose mtree_load() implementation is largely copied after the C version and uses the same variable and intermediate function names wherever possible to ensure that both the C and Python version be updated in the future. This restores the scripts' output to match that of /proc/interrupts. Link: https://lkml.kernel.org/r/20250625021020.1056930-1-florian.fainelli@broadcom.com Fixes: 721255b9826b ("genirq: Use a maple tree for interrupt descriptor management") Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Cc: Jan Kiszka <jan.kiszka@siemens.com> Cc: Kieran Bingham <kbingham@kernel.org> Cc: Shanker Donthineni <sdonthineni@nvidia.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09maple_tree: fix mt_destroy_walk() on root leaf nodeWei Yang
On destroy, we should set each node dead. But current code miss this when the maple tree has only the root node. The reason is mt_destroy_walk() leverage mte_destroy_descend() to set node dead, but this is skipped since the only root node is a leaf. Fixes this by setting the node dead if it is a leaf. Link: https://lore.kernel.org/all/20250407231354.11771-1-richard.weiyang@gmail.com/ Link: https://lkml.kernel.org/r/20250624191841.64682-1-Liam.Howlett@oracle.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/vmalloc: leave lazy MMU mode on PTE mapping errorAlexander Gordeev
vmap_pages_pte_range() enters the lazy MMU mode, but fails to leave it in case an error is encountered. Link: https://lkml.kernel.org/r/20250623075721.2817094-1-agordeev@linux.ibm.com Fixes: 2ba3e6947aed ("mm/vmalloc: track which page-table levels were modified") Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202506132017.T1l1l6ME-lkp@intel.com/ Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09scripts/gdb: fix interrupts display after MCP on x86Florian Fainelli
The text line would not be appended to as it should have, it should have been a '+=' but ended up being a '==', fix that. Link: https://lkml.kernel.org/r/20250623164153.746359-1-florian.fainelli@broadcom.com Fixes: b0969d7687a7 ("scripts/gdb: print interrupts") Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Cc: Jan Kiszka <jan.kiszka@siemens.com> Cc: Kieran Bingham <kbingham@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09lib/alloc_tag: do not acquire non-existent lock in alloc_tag_top_users()Harry Yoo
alloc_tag_top_users() attempts to lock alloc_tag_cttype->mod_lock even when the alloc_tag_cttype is not allocated because: 1) alloc tagging is disabled because mem profiling is disabled (!alloc_tag_cttype) 2) alloc tagging is enabled, but not yet initialized (!alloc_tag_cttype) 3) alloc tagging is enabled, but failed initialization (!alloc_tag_cttype or IS_ERR(alloc_tag_cttype)) In all cases, alloc_tag_cttype is not allocated, and therefore alloc_tag_top_users() should not attempt to acquire the semaphore. This leads to a crash on memory allocation failure by attempting to acquire a non-existent semaphore: Oops: general protection fault, probably for non-canonical address 0xdffffc000000001b: 0000 [#3] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x00000000000000d8-0x00000000000000df] CPU: 2 UID: 0 PID: 1 Comm: systemd Tainted: G D 6.16.0-rc2 #1 VOLUNTARY Tainted: [D]=DIE Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 RIP: 0010:down_read_trylock+0xaa/0x3b0 Code: d0 7c 08 84 d2 0f 85 a0 02 00 00 8b 0d df 31 dd 04 85 c9 75 29 48 b8 00 00 00 00 00 fc ff df 48 8d 6b 68 48 89 ea 48 c1 ea 03 <80> 3c 02 00 0f 85 88 02 00 00 48 3b 5b 68 0f 85 53 01 00 00 65 ff RSP: 0000:ffff8881002ce9b8 EFLAGS: 00010016 RAX: dffffc0000000000 RBX: 0000000000000070 RCX: 0000000000000000 RDX: 000000000000001b RSI: 000000000000000a RDI: 0000000000000070 RBP: 00000000000000d8 R08: 0000000000000001 R09: ffffed107dde49d1 R10: ffff8883eef24e8b R11: ffff8881002cec20 R12: 1ffff11020059d37 R13: 00000000003fff7b R14: ffff8881002cec20 R15: dffffc0000000000 FS: 00007f963f21d940(0000) GS:ffff888458ca6000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f963f5edf71 CR3: 000000010672c000 CR4: 0000000000350ef0 Call Trace: <TASK> codetag_trylock_module_list+0xd/0x20 alloc_tag_top_users+0x369/0x4b0 __show_mem+0x1cd/0x6e0 warn_alloc+0x2b1/0x390 __alloc_frozen_pages_noprof+0x12b9/0x21a0 alloc_pages_mpol+0x135/0x3e0 alloc_slab_page+0x82/0xe0 new_slab+0x212/0x240 ___slab_alloc+0x82a/0xe00 </TASK> As David Wang points out, this issue became easier to trigger after commit 780138b12381 ("alloc_tag: check mem_profiling_support in alloc_tag_init"). Before the commit, the issue occurred only when it failed to allocate and initialize alloc_tag_cttype or if a memory allocation fails before alloc_tag_init() is called. After the commit, it can be easily triggered when memory profiling is compiled but disabled at boot. To properly determine whether alloc_tag_init() has been called and its data structures initialized, verify that alloc_tag_cttype is a valid pointer before acquiring the semaphore. If the variable is NULL or an error value, it has not been properly initialized. In such a case, just skip and do not attempt to acquire the semaphore. [harry.yoo@oracle.com: v3] Link: https://lkml.kernel.org/r/20250624072513.84219-1-harry.yoo@oracle.com Link: https://lkml.kernel.org/r/20250620195305.1115151-1-harry.yoo@oracle.com Fixes: 780138b12381 ("alloc_tag: check mem_profiling_support in alloc_tag_init") Fixes: 1438d349d16b ("lib: add memory allocations report in show_mem()") Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202506181351.bba867dd-lkp@intel.com Acked-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Raghavendra K T <raghavendra.kt@amd.com> Cc: Casey Chen <cachen@purestorage.com> Cc: David Wang <00107082@163.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Yuanyuan Zhong <yzhong@purestorage.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09kallsyms: fix build without execinfoAchill Gilgenast
Some libc's like musl libc don't provide execinfo.h since it's not part of POSIX. In order to fix compilation on musl, only include execinfo.h if available (HAVE_BACKTRACE_SUPPORT) This was discovered with c104c16073b7 ("Kunit to check the longest symbol length") which starts to include linux/kallsyms.h with Alpine Linux' configs. Link: https://lkml.kernel.org/r/20250622014608.448718-1-fossdd@pwned.life Fixes: c104c16073b7 ("Kunit to check the longest symbol length") Signed-off-by: Achill Gilgenast <fossdd@pwned.life> Cc: Luis Henriques <luis@igalia.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09Merge branch 'net-mlx5-misc-changes-2025-07-09'Jakub Kicinski
Tariq Toukan says: ==================== net/mlx5: misc changes 2025-07-09 This series contains misc enhancements to the mlx5 driver. ==================== Link: https://patch.msgid.link/1752009387-13300-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net/mlx5e: RX, Remove unnecessary RQT redirectsTariq Toukan
RQTs (Receive Queue Table) should redirect traffic to the channels' RQs when they're active. Otherwise, redirect to the designated "drop RQ". RQTs are created in "inactive" state, pointing to the "drop RQ". In activate and de-activate flows, do not "deactivate" the rest of RQTs (beyond the num of channels), as they are already inactive. This cuts down unnecessary execution of FW commands (MODIFY_RQT), and improves the latency of open/close channels or configuration change. Perf: NIC: Connect-X7. Configuration: 1 combined channel, max num channels 248. Measure time for "interface up + interface down". Before: 0.313 sec After: 0.057 sec (5.5x faster) 247 MODIFY_RQT commands saved in interface up. 247 MODIFY_RQT commands saved in interface down. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1752009387-13300-6-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net/mlx5: Warn when write combining is not supportedMaor Gottlieb
Warn if write combining is not supported, as it can impact latency. Add the warning message to be printed only when the driver actually run the test and detect unsupported state, rather than when inheriting parent's result for SFs. Signed-off-by: Maor Gottlieb <maorg@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1752009387-13300-5-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net/mlx5e: Replace recursive VLAN push handling with an iterative loopGal Pressman
mlx5e_tc_act_vlan_add_push_action() uses tail-recursion to walk through a stack of VLAN devices. There is no need for a complicated recursion with unnecessary stack consumption and less obvious code flow, rewrite the function so that it uses a do while loop instead. Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1752009387-13300-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net/mlx5e: CT: extract a memcmp from a spinlock sectionCosmin Ratiu
This reduces the time the lock is held and reduces contention. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1752009387-13300-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net/mlx5e: Remove unused VLAN insertion logic in TX pathCarolina Jubran
The VLAN insertion capability (`wqe_vlan_insert`) was never enabled on all mlx5 devices. When VLAN TX offload is advertised but this capability is not supported, the driver uses inline headers to insert the VLAN tag. To support this, the driver used to set the `MLX5E_SQ_STATE_VLAN_NEED_L2_INLINE` bit to enforce L2 inline mode when `wqe_vlan_insert` was not supported. Since the capability is disabled on all devices, this logic was always active, and the SQ flag has become redundant. L2 inline is enforced unconditionally for VLAN-tagged packets. The `skb_vlan_tag_present()` check in the else-if block of `mlx5e_sq_xmit_wqe()` is never true by this point in the TX flow, as the VLAN tag has already been inserted by the driver using inline headers. As a result, this code is never executed. Remove the redundant SQ state, dead VLAN insertion code block, and related logic. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1752009387-13300-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09Merge branch 'rxrpc-miscellaneous-fixes'Jakub Kicinski
David Howells says: ==================== rxrpc: Miscellaneous fixes Here are some miscellaneous fixes for rxrpc: (1) Fix assertion failure due to preallocation collision. (2) Fix oops due to prealloc backlog struct not yet having been allocated if no service calls have yet been preallocated. ==================== Link: https://patch.msgid.link/20250708211506.2699012-1-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09rxrpc: Fix oops due to non-existence of prealloc backlog structDavid Howells
If an AF_RXRPC service socket is opened and bound, but calls are preallocated, then rxrpc_alloc_incoming_call() will oops because the rxrpc_backlog struct doesn't get allocated until the first preallocation is made. Fix this by returning NULL from rxrpc_alloc_incoming_call() if there is no backlog struct. This will cause the incoming call to be aborted. Reported-by: Junvyyang, Tencent Zhuque Lab <zhuque@tencent.com> Suggested-by: Junvyyang, Tencent Zhuque Lab <zhuque@tencent.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: LePremierHomme <kwqcheii@proton.me> cc: Marc Dionne <marc.dionne@auristor.com> cc: Willy Tarreau <w@1wt.eu> cc: Simon Horman <horms@kernel.org> cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/20250708211506.2699012-3-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09rxrpc: Fix bug due to prealloc collisionDavid Howells
When userspace is using AF_RXRPC to provide a server, it has to preallocate incoming calls and assign to them call IDs that will be used to thread related recvmsg() and sendmsg() together. The preallocated call IDs will automatically be attached to calls as they come in until the pool is empty. To the kernel, the call IDs are just arbitrary numbers, but userspace can use the call ID to hold a pointer to prepared structs. In any case, the user isn't permitted to create two calls with the same call ID (call IDs become available again when the call ends) and EBADSLT should result from sendmsg() if an attempt is made to preallocate a call with an in-use call ID. However, the cleanup in the error handling will trigger both assertions in rxrpc_cleanup_call() because the call isn't marked complete and isn't marked as having been released. Fix this by setting the call state in rxrpc_service_prealloc_one() and then marking it as being released before calling the cleanup function. Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests") Reported-by: Junvyyang, Tencent Zhuque Lab <zhuque@tencent.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: LePremierHomme <kwqcheii@proton.me> cc: Marc Dionne <marc.dionne@auristor.com> cc: Simon Horman <horms@kernel.org> cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/20250708211506.2699012-2-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09vsock/test: fix test for null ptr deref when transport changesStefano Garzarella
In test_stream_transport_change_client(), the client sends CONTROL_CONTINUE on each iteration, even when connect() is unsuccessful. This causes a flood of control messages in the server that hangs around for more than 10 seconds after the test finishes, triggering several timeouts and causing subsequent tests to fail. This was discovered in testing a newly proposed test that failed in this way on the client side: ... 33 - SOCK_STREAM transport change null-ptr-deref...ok 34 - SOCK_STREAM ioctl(SIOCINQ) functionality...recv timed out The CONTROL_CONTINUE message is used only to tell to the server to call accept() to consume successful connections, so that subsequent connect() will not fail for finding the queue full. Send CONTROL_CONTINUE message only when the connect() has succeeded, or found the queue full. Note that the second connect() can also succeed if the first one was interrupted after sending the request. Fixes: 3a764d93385c ("vsock/test: Add test for null ptr deref when transport changes") Cc: leonardi@redhat.com Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Link: https://patch.msgid.link/20250708111701.129585-1-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09Merge branch 'net-phy-bcm54811-phy-initialization'Jakub Kicinski
says: ==================== net: phy: bcm54811: PHY initialization Proper bcm54811 PHY driver initialization for MII-Lite. The bcm54811 PHY in MLP package must be setup for MII-Lite interface mode by software. Normally, the PHY to MAC interface is selected in hardware by setting the bootstrap pins of the PHY. However, MII and MII-Lite share the same hardware setup and must be distinguished by software, setting appropriate bit in a configuration register. The MII-Lite interface mode is non-standard one, defined by Broadcom for some of their PHYs. The MII-Lite lightness consist in omitting RXER, TXER, CRS and COL signals of the standard MII interface. Absence of COL them makes half-duplex links modes impossible but does not interfere with Broadcom's BroadR-Reach link modes, because they are full-duplex only. To do it in a clean way, MII-Lite must be introduced first, including its limitation to link modes (no half-duplex), because it is a prerequisite for the patch #3 of this series. The patch #4 does not depend on MII-Lite directly but both #3 and #4 are necessary for bcm54811 to work properly without additional configuration steps to be done - for example in the bootloader, before the kernel starts. PATCH 1 - Add MII-Lite PHY interface mode as defined by Broadcom for their two-wire PHYs. It can be used with most Ethernet controllers under certain limitations (no half-duplex link modes etc.). PATCH 2 - Add MII-Lite PHY interface type PATCH 3 - Activation of MII-Lite interface mode on Broadcom bcm5481x PHYs PATCH 4 - Initialize the BCM54811 PHY properly so that it conforms to the datasheet regarding a reserved bit in the LRE Control register, which must be written to zero after every device reset. Ignore the LDS capability bit in LRE Status register on bcm54811. ==================== Link: https://patch.msgid.link/20250708090140.61355-1-kamilh@axis.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net: phy: bcm54811: PHY initializationKamil Horák - 2N
Reset the bit 12 in PHY's LRE Control register upon initialization. According to the datasheet, this bit must be written to zero after every device reset. Signed-off-by: Kamil Horák - 2N <kamilh@axis.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://patch.msgid.link/20250708090140.61355-5-kamilh@axis.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net: phy: bcm5481x: MII-Lite activationKamil Horák - 2N
Broadcom PHYs featuring the BroadR-Reach two-wire link mode are usually capable to operate in simplified MII mode, without TXER, RXER, CRS and COL signals as defined for the MII. The absence of COL signal makes half-duplex link modes impossible, however, the BroadR-Reach modes are all full-duplex only. Depending on the IC encapsulation, there exist MII-Lite-only PHYs such as bcm54811 in MLP. The PHY itself is hardware-strapped to select among multiple RGMII and MII-Lite modes, but the MII-Lite mode must be also activated by software. Add MII-Lite activation for bcm5481x PHYs. Signed-off-by: Kamil Horák - 2N <kamilh@axis.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/20250708090140.61355-4-kamilh@axis.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09dt-bindings: ethernet-phy: add MII-Lite phy interface typeKamil Horák - 2N
Some Broadcom PHYs are capable to operate in simplified MII mode, without TXER, RXER, CRS and COL signals as defined for the MII. The MII-Lite mode can be used on most Ethernet controllers with full MII interface by just leaving the input signals (RXER, CRS, COL) inactive. The absence of COL signal makes half-duplex link modes impossible but does not interfere with BroadR-Reach link modes on Broadcom PHYs, because they are all full-duplex only. Add new interface type "mii-lite" to phy-connection-type enum. Signed-off-by: Kamil Horák - 2N <kamilh@axis.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Acked-by: Rob Herring (Arm) <robh@kernel.org> Link: https://patch.msgid.link/20250708090140.61355-3-kamilh@axis.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net: phy: MII-Lite PHY interface modeKamil Horák - 2N
Some Broadcom PHYs are capable to operate in simplified MII mode, without TXER, RXER, CRS and COL signals as defined for the MII. The MII-Lite mode can be used on most Ethernet controllers with full MII interface by just leaving the input signals (RXER, CRS, COL) inactive. The absence of COL signal makes half-duplex link modes impossible but does not interfere with BroadR-Reach link modes on Broadcom PHYs, because they are all full-duplex only. Add MII-Lite interface mode, especially for Broadcom two-wire PHYs. Signed-off-by: Kamil Horák - 2N <kamilh@axis.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/20250708090140.61355-2-kamilh@axis.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09MAINTAINERS: remove myself as netronome maintainerLouis Peens
I am moving on from Corigine to different things, for the moment slightly removed from kernel development. Right now there is nobody I can in good conscience recommend to take over the maintainer role, but there are still people available for review, so put the driver state to 'Odd Fixes'. Additionally add Simon Horman as reviewer - thanks Simon. Signed-off-by: Louis Peens <louis.peens@corigine.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net: usb: enable the work after stop usbnet by ip down/upZqiang
Oleksij reported that: The smsc95xx driver fails after one down/up cycle, like this: $ nmcli device set enu1u1 managed no $ p a a 10.10.10.1/24 dev enu1u1 $ ping -c 4 10.10.10.3 $ ip l s dev enu1u1 down $ ip l s dev enu1u1 up $ ping -c 4 10.10.10.3 The second ping does not reach the host. Networking also fails on other interfaces. Enable the work by replacing the disable_work_sync() with cancel_work_sync(). [Jun Miao: completely write the commit changelog] Fixes: 2c04d279e857 ("net: usb: Convert tasklet API to new bottom half workqueue mechanism") Reported-by: Oleksij Rempel <o.rempel@pengutronix.de> Tested-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Jun Miao <jun.miao@intel.com> Link: https://patch.msgid.link/20250708081653.307815-1-jun.miao@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09Merge branch 'vsock-introduce-siocinq-ioctl-support'Jakub Kicinski
Xuewei Niu says: ==================== vsock: Introduce SIOCINQ ioctl support Introduce SIOCINQ ioctl support for vsock, indicating the length of unread bytes. Similar with SIOCOUTQ ioctl, the information is transport-dependent. The first patch adds SIOCINQ ioctl support in AF_VSOCK. Thanks to @dexuan, the second patch is to fix the issue where hyper-v `hvs_stream_has_data()` doesn't return the readable bytes. The third patch wraps the ioctl into `ioctl_int()`, which implements a retry mechanism to prevent immediate failure. The last one adds two test cases to check the functionality. The changes have been tested, and the results are as expected. ==================== Link: https://patch.msgid.link/20250708-siocinq-v6-0-3775f9a9e359@antgroup.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09test/vsock: Add ioctl SIOCINQ testsXuewei Niu
Add SIOCINQ ioctl tests for both SOCK_STREAM and SOCK_SEQPACKET. The client waits for the server to send data, and checks if the SIOCINQ ioctl value matches the data size. After consuming the data, the client checks if the SIOCINQ value is 0. Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Tested-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Link: https://patch.msgid.link/20250708-siocinq-v6-4-3775f9a9e359@antgroup.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09test/vsock: Add retry mechanism to ioctl wrapperXuewei Niu
Wrap the ioctl in `ioctl_int()`, which takes a pointer to the actual int value and an expected int value. The function will not return until either the ioctl returns the expected value or a timeout occurs, thus avoiding immediate failure. Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Link: https://patch.msgid.link/20250708-siocinq-v6-3-3775f9a9e359@antgroup.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09vsock: Add support for SIOCINQ ioctlXuewei Niu
Add support for SIOCINQ ioctl, indicating the length of bytes unread in the socket. The value is obtained from `vsock_stream_has_data()`. Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Link: https://patch.msgid.link/20250708-siocinq-v6-2-3775f9a9e359@antgroup.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09hv_sock: Return the readable bytes in hvs_stream_has_data()Dexuan Cui
When hv_sock was originally added, __vsock_stream_recvmsg() and vsock_stream_has_data() actually only needed to know whether there is any readable data or not, so hvs_stream_has_data() was written to return 1 or 0 for simplicity. However, now hvs_stream_has_data() should return the readable bytes because vsock_data_ready() -> vsock_stream_has_data() needs to know the actual bytes rather than a boolean value of 1 or 0. The SIOCINQ ioctl support also needs hvs_stream_has_data() to return the readable bytes. Let hvs_stream_has_data() return the readable bytes of the payload in the next host-to-guest VMBus hv_sock packet. Note: there may be multiple incoming hv_sock packets pending in the VMBus channel's ringbuffer, but so far there is not a VMBus API that allows us to know all the readable bytes in total without reading and caching the payload of the multiple packets, so let's just return the readable bytes of the next single packet. In the future, we'll either add a VMBus API that allows us to know the total readable bytes without touching the data in the ringbuffer, or the hv_sock driver needs to understand the VMBus packet format and parse the packets directly. Signed-off-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Wei Liu <wei.liu@kernel.org> Link: https://patch.msgid.link/20250708-siocinq-v6-1-3775f9a9e359@antgroup.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09Documentation: xsk: correct the obsolete references and examplesJason Xing
The modified lines are mainly related to the following commits[1][2] which remove those tests and examples. Since samples/bpf has been deprecated, we can refer to more examples that are easily searched in the various xdp-projects, like the following link: https://github.com/xdp-project/bpf-examples/tree/main/AF_XDP-example [1] commit f36600634282 ("libbpf: move xsk.{c,h} into selftests/bpf") [2] commit cfb5a2dbf141 ("bpf, samples: Remove AF_XDP samples") Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250708062907.11557-1-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09skbuff: Add MSG_MORE flag to optimize tcp large packet transmissionFeng Yang
When using sockmap for forwarding, the average latency for different packet sizes after sending 10,000 packets is as follows: size old(us) new(us) 512 56 55 1472 58 58 1600 106 81 3000 145 105 5000 182 125 Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Feng Yang <yangfeng@kylinos.cn> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250708054053.39551-1-yangfeng59949@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09Merge branch 'converge-on-using-secs_to_jiffies-part-two'Jakub Kicinski
Easwar Hariharan says: ==================== Converge on using secs_to_jiffies() part two This is the second series (part 1*) that converts users of msecs_to_jiffies() that either use the multiply pattern of either of: - msecs_to_jiffies(N*1000) or - msecs_to_jiffies(N*MSEC_PER_SEC) where N is a constant or an expression, to avoid the multiplication. The conversion is made with Coccinelle with the secs_to_jiffies() script in scripts/coccinelle/misc. Attention is paid to what the best change can be rather than restricting to what the tool provides. v1: https://lore.kernel.org/20250219-netdev-secs-to-jiffies-part-2-v1-0-c484cc63611b@linux.microsoft.com ==================== Link: https://patch.msgid.link/20250707-netdev-secs-to-jiffies-part-2-v2-0-b7817036342f@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net: ipconfig: convert timeouts to secs_to_jiffies()Easwar Hariharan
Commit b35108a51cf7 ("jiffies: Define secs_to_jiffies()") introduced secs_to_jiffies(). As the value here is a multiple of 1000, use secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication. This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with the following Coccinelle rules: @depends on patch@ expression E; @@ -msecs_to_jiffies(E * 1000) +secs_to_jiffies(E) -msecs_to_jiffies(E * MSEC_PER_SEC) +secs_to_jiffies(E) While here, manually convert a couple timeouts denominated in seconds Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Link: https://patch.msgid.link/20250707-netdev-secs-to-jiffies-part-2-v2-2-b7817036342f@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net/smc: convert timeouts to secs_to_jiffies()Easwar Hariharan
Commit b35108a51cf7 ("jiffies: Define secs_to_jiffies()") introduced secs_to_jiffies(). As the value here is a multiple of 1000, use secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication. This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with the following Coccinelle rules: @depends on patch@ expression E; @@ -msecs_to_jiffies(E * 1000) +secs_to_jiffies(E) -msecs_to_jiffies(E * MSEC_PER_SEC) +secs_to_jiffies(E) Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Link: https://patch.msgid.link/20250707-netdev-secs-to-jiffies-part-2-v2-1-b7817036342f@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09Merge branch 'tcp-better-memory-control-for-not-yet-accepted-sockets'Jakub Kicinski
Eric Dumazet says: ==================== tcp: better memory control for not-yet-accepted sockets Address a possible OOM condition caused by a recent change. Add a new packetdrill test checking the expected behavior. ==================== Link: https://patch.msgid.link/20250707213900.1543248-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09selftests/net: packetdrill: add tcp_ooo-before-and-after-accept.pktEric Dumazet
Test how new passive flows react to ooo incoming packets. Their sk_rcvbuf can increase only after accept(). Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250707213900.1543248-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09tcp: refine sk_rcvbuf increase for ooo packetsEric Dumazet
When a passive flow has not been accepted yet, it is not wise to increase sk_rcvbuf when receiving ooo packets. A very busy server might tune down tcp_rmem[1] to better control how much memory can be used by sockets waiting in its listeners accept queues. Fixes: 63ad7dfedfae ("tcp: adjust rcvbuf in presence of reorders") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250707213900.1543248-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net/sched: Abort __tc_modify_qdisc if parent class does not existVictor Nogueira
Lion's patch [1] revealed an ancient bug in the qdisc API. Whenever a user creates/modifies a qdisc specifying as a parent another qdisc, the qdisc API will, during grafting, detect that the user is not trying to attach to a class and reject. However grafting is performed after qdisc_create (and thus the qdiscs' init callback) is executed. In qdiscs that eventually call qdisc_tree_reduce_backlog during init or change (such as fq, hhf, choke, etc), an issue arises. For example, executing the following commands: sudo tc qdisc add dev lo root handle a: htb default 2 sudo tc qdisc add dev lo parent a: handle beef fq Qdiscs such as fq, hhf, choke, etc unconditionally invoke qdisc_tree_reduce_backlog() in their control path init() or change() which then causes a failure to find the child class; however, that does not stop the unconditional invocation of the assumed child qdisc's qlen_notify with a null class. All these qdiscs make the assumption that class is non-null. The solution is ensure that qdisc_leaf() which looks up the parent class, and is invoked prior to qdisc_create(), should return failure on not finding the class. In this patch, we leverage qdisc_leaf to return ERR_PTRs whenever the parentid doesn't correspond to a class, so that we can detect it earlier on and abort before qdisc_create is called. [1] https://lore.kernel.org/netdev/d912cbd7-193b-4269-9857-525bee8bbb6a@gmail.com/ Fixes: 5e50da01d0ce ("[NET_SCHED]: Fix endless loops (part 2): "simple" qdiscs") Reported-by: syzbot+d8b58d7b0ad89a678a16@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/68663c93.a70a0220.5d25f.0857.GAE@google.com/ Reported-by: syzbot+5eccb463fa89309d8bdc@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/68663c94.a70a0220.5d25f.0858.GAE@google.com/ Reported-by: syzbot+1261670bbdefc5485a06@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/686764a5.a00a0220.c7b3.0013.GAE@google.com/ Reported-by: syzbot+15b96fc3aac35468fe77@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/686764a5.a00a0220.c7b3.0014.GAE@google.com/ Reported-by: syzbot+4dadc5aecf80324d5a51@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/68679e81.a70a0220.29cf51.0016.GAE@google.com/ Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20250707210801.372995-1-victor@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09gve: make IRQ handlers and page allocation NUMA awareBailey Forrest
All memory in GVE is currently allocated without regard for the NUMA node of the device. Because access to NUMA-local memory access is significantly cheaper than access to a remote node, this change attempts to ensure that page frags used in the RX path, including page pool frags, are allocated on the NUMA node local to the gVNIC device. Note that this attempt is best-effort. If necessary, the driver will still allocate non-local memory, as __GFP_THISNODE is not passed. Descriptor ring allocations are not updated, as dma_alloc_coherent handles that. This change also modifies the IRQ affinity setting to only select CPUs from the node local to the device, preserving the behavior that TX and RX queues of the same index share CPU affinity. Signed-off-by: Bailey Forrest <bcf@google.com> Signed-off-by: Joshua Washington <joshwash@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Harshitha Ramamurthy <hramamurthy@google.com> Signed-off-by: Jeroen de Borst <jeroendb@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250707210107.2742029-1-jeroendb@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net: ethernet: ti: am65-cpsw-nuss: Fix skb size by accounting for ↵Chintan Vankar
skb_shared_info While transitioning from netdev_alloc_ip_align() to build_skb(), memory for the "skb_shared_info" member of an "skb" was not allocated. Fix this by allocating "PAGE_SIZE" as the skb length, accounting for the packet length, headroom and tailroom, thereby including the required memory space for skb_shared_info. Fixes: 8acacc40f733 ("net: ethernet: ti: am65-cpsw: Add minimal XDP support") Reviewed-by: Siddharth Vadapalli <s-vadapalli@ti.com> Signed-off-by: Chintan Vankar <c-vankar@ti.com> Link: https://patch.msgid.link/20250707085201.1898818-1-c-vankar@ti.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net: thunderx: avoid direct MTU assignment after WRITE_ONCE()Alok Tiwari
The current logic in nicvf_change_mtu() writes the new MTU to netdev->mtu using WRITE_ONCE() before verifying if the hardware update succeeds. However on hardware update failure, it attempts to revert to the original MTU using a direct assignment (netdev->mtu = orig_mtu) which violates the intended of WRITE_ONCE protection introduced in commit 1eb2cded45b3 ("net: annotate writes on dev->mtu from ndo_change_mtu()") Additionally, WRITE_ONCE(netdev->mtu, new_mtu) is unnecessarily performed even when the device is not running. Fix this by: Only writing netdev->mtu after successfully updating the hardware. Skipping hardware update when the device is down, and setting MTU directly. Remove unused variable orig_mtu. This ensures that all writes to netdev->mtu are consistent with WRITE_ONCE expectations and avoids unintended state corruption on failure paths. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20250706194327.1369390-1-alok.a.tiwari@oracle.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>