summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-09-09selftests/net: integrate packetdrill with ksftWillem de Bruijn
Lay the groundwork to import into kselftests the over 150 packetdrill TCP/IP conformance tests on github.com/google/packetdrill. Florian recently added support for packetdrill tests in nf_conntrack, in commit a8a388c2aae49 ("selftests: netfilter: add packetdrill based conntrack tests"). This patch takes a slightly different approach. It relies on ksft_runner.sh to run every *.pkt file in the directory. Any future imports of packetdrill tests should require no additional coding. Just add the *.pkt files. Initially import only two features/directories from github. One with a single script, and one with two. This was the only reason to pick tcp/inq and tcp/md5. The path replaces the directory hierarchy in github with a flat space of files: $(subst /,_,$(wildcard tcp/**/*.pkt)). This is the most straightforward option to integrate with kselftests. The Linked thread reviewed two ways to maintain the hierarchy: TEST_PROGS_RECURSE and PRESERVE_TEST_DIRS. But both introduce significant changes to kselftest infra and with that risk to existing tests. Implementation notes: - restore alphabetical order when adding the new directory to tools/testing/selftests/Makefile - imported *.pkt files and support verbatim from the github project, except for - update `source ./defaults.sh` path (to adjust for flat dir) - add SPDX headers - remove one author statement - Acknowledgment: drop an e (checkpatch) Tested: make -C tools/testing/selftests \ TARGETS=net/packetdrill \ run_tests make -C tools/testing/selftests \ TARGETS=net/packetdrill \ install INSTALL_PATH=$KSFT_INSTALL_PATH # in virtme-ng ./run_kselftest.sh -c net/packetdrill ./run_kselftest.sh -t net/packetdrill:tcp_inq_client.pkt Link: https://lore.kernel.org/netdev/20240827193417.2792223-1-willemdebruijn.kernel@gmail.com/ Signed-off-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20240905231653.2427327-3-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09selftests: support interpreted scripts with ksft_runner.shWillem de Bruijn
Support testcases that are themselves not executable, but need an interpreter to run them. If a test file is not executable, but an executable file ksft_runner.sh exists in the TARGET dir, kselftest will run ./ksft_runner.sh ./$BASENAME_TEST Packetdrill may add hundreds of packetdrill scripts for testing. These scripts must be passed to the packetdrill process. Have kselftest run each test directly, as it already solves common runner requirements like parallel execution and isolation (netns). A previous RFC added a wrapper in between, which would have to reimplement such functionality. Link: https://lore.kernel.org/netdev/66d4d97a4cac_3df182941a@willemb.c.googlers.com.notmuch/T/ Signed-off-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20240905231653.2427327-2-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09fou: fix initialization of grcMuhammad Usama Anjum
The grc must be initialize first. There can be a condition where if fou is NULL, goto out will be executed and grc would be used uninitialized. Fixes: 7e4196935069 ("fou: Fix null-ptr-deref in GRO.") Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20240906102839.202798-1-usama.anjum@collabora.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09Merge branch 'various-cleanups'Jakub Kicinski
Rosen Penev says: ==================== various cleanups Allow CI to build. Also a bugfix for dual GMAC devices. ==================== Link: https://patch.msgid.link/20240905194938.8453-1-rosenp@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09net: ag71xx: disable napi interrupts during probeSven Eckelmann
ag71xx_probe is registering ag71xx_interrupt as handler for gmac0/gmac1 interrupts. The handler is trying to use napi_schedule to handle the processing of packets. But the netif_napi_add for this device is called a lot later in ag71xx_probe. It can therefore happen that a still running gmac0/gmac1 is triggering the interrupt handler with a bit from AG71XX_INT_POLL set in AG71XX_REG_INT_STATUS. The handler will then call napi_schedule and the napi code will crash the system because the ag->napi is not yet initialized. The gmcc0/gmac1 must be brought in a state in which it doesn't signal a AG71XX_INT_POLL related status bits as interrupt before registering the interrupt handler. ag71xx_hw_start will take care of re-initializing the AG71XX_REG_INT_ENABLE. This will become relevant when dual GMAC devices get added here. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Rosen Penev <rosenp@gmail.com> Link: https://patch.msgid.link/20240905194938.8453-8-rosenp@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09net: ag71xx: remove always true branchRosen Penev
The opposite of this condition is checked above and if true, function returns. Which means this can never be false. Signed-off-by: Rosen Penev <rosenp@gmail.com> Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20240905194938.8453-7-rosenp@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09net: ag71xx: get reset control using devm apiRosen Penev
Currently, the of variant is missing reset_control_put in error paths. The devm variant does not require it. Allows removing mdio_reset from the struct as it is not used outside the function. Signed-off-by: Rosen Penev <rosenp@gmail.com> Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20240905194938.8453-6-rosenp@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09net: ag71xx: use ethtool_putsRosen Penev
Allows simplifying get_strings and avoids manual pointer manipulation. Signed-off-by: Rosen Penev <rosenp@gmail.com> Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20240905194938.8453-5-rosenp@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09net: ag71xx: update FIFO bits and descriptionsRosen Penev
Taken from QCA SDK. No functional difference as same bits get applied. Signed-off-by: Rosen Penev <rosenp@gmail.com> Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20240905194938.8453-4-rosenp@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09net: ag71xx: add MODULE_DESCRIPTIONRosen Penev
Now that COMPILE_TEST is enabled, it gets flagged when building with allmodconfig W=1 builds. Text taken from the beginning of the file. Signed-off-by: Rosen Penev <rosenp@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240905194938.8453-3-rosenp@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09net: ag71xx: add COMPILE_TEST to test compilationRosen Penev
While this driver is meant for MIPS only, it can be compiled on x86 just fine. Remove pointless parentheses while at it. Enables CI building of this driver. Signed-off-by: Rosen Penev <rosenp@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240905194938.8453-2-rosenp@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09Merge branch 'af_unix-correct-manage_oob-when-oob-follows-a-consumed-oob'Jakub Kicinski
Kuniyuki Iwashima says: ==================== af_unix: Correct manage_oob() when OOB follows a consumed OOB. Recently syzkaller reported UAF of OOB skb. The bug was introduced by commit 93c99f21db36 ("af_unix: Don't stop recv(MSG_DONTWAIT) if consumed OOB skb is at the head.") but uncovered by another recent commit 8594d9b85c07 ("af_unix: Don't call skb_get() for OOB skb."). [0]: https://lore.kernel.org/netdev/00000000000083b05a06214c9ddc@google.com/ ==================== Link: https://patch.msgid.link/20240905193240.17565-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09af_unix: Don't return OOB skb in manage_oob().Kuniyuki Iwashima
syzbot reported use-after-free in unix_stream_recv_urg(). [0] The scenario is 1. send(MSG_OOB) 2. recv(MSG_OOB) -> The consumed OOB remains in recv queue 3. send(MSG_OOB) 4. recv() -> manage_oob() returns the next skb of the consumed OOB -> This is also OOB, but unix_sk(sk)->oob_skb is not cleared 5. recv(MSG_OOB) -> unix_sk(sk)->oob_skb is used but already freed The recent commit 8594d9b85c07 ("af_unix: Don't call skb_get() for OOB skb.") uncovered the issue. If the OOB skb is consumed and the next skb is peeked in manage_oob(), we still need to check if the skb is OOB. Let's do so by falling back to the following checks in manage_oob() and add the test case in selftest. Note that we need to add a similar check for SIOCATMARK. [0]: BUG: KASAN: slab-use-after-free in unix_stream_read_actor+0xa6/0xb0 net/unix/af_unix.c:2959 Read of size 4 at addr ffff8880326abcc4 by task syz-executor178/5235 CPU: 0 UID: 0 PID: 5235 Comm: syz-executor178 Not tainted 6.11.0-rc5-syzkaller-00742-gfbdaffe41adc #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/06/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:93 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:119 print_address_description mm/kasan/report.c:377 [inline] print_report+0x169/0x550 mm/kasan/report.c:488 kasan_report+0x143/0x180 mm/kasan/report.c:601 unix_stream_read_actor+0xa6/0xb0 net/unix/af_unix.c:2959 unix_stream_recv_urg+0x1df/0x320 net/unix/af_unix.c:2640 unix_stream_read_generic+0x2456/0x2520 net/unix/af_unix.c:2778 unix_stream_recvmsg+0x22b/0x2c0 net/unix/af_unix.c:2996 sock_recvmsg_nosec net/socket.c:1046 [inline] sock_recvmsg+0x22f/0x280 net/socket.c:1068 ____sys_recvmsg+0x1db/0x470 net/socket.c:2816 ___sys_recvmsg net/socket.c:2858 [inline] __sys_recvmsg+0x2f0/0x3e0 net/socket.c:2888 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f5360d6b4e9 Code: 48 83 c4 28 c3 e8 37 17 00 00 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fff29b3a458 EFLAGS: 00000246 ORIG_RAX: 000000000000002f RAX: ffffffffffffffda RBX: 00007fff29b3a638 RCX: 00007f5360d6b4e9 RDX: 0000000000002001 RSI: 0000000020000640 RDI: 0000000000000003 RBP: 00007f5360dde610 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 R13: 00007fff29b3a628 R14: 0000000000000001 R15: 0000000000000001 </TASK> Allocated by task 5235: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 unpoison_slab_object mm/kasan/common.c:312 [inline] __kasan_slab_alloc+0x66/0x80 mm/kasan/common.c:338 kasan_slab_alloc include/linux/kasan.h:201 [inline] slab_post_alloc_hook mm/slub.c:3988 [inline] slab_alloc_node mm/slub.c:4037 [inline] kmem_cache_alloc_node_noprof+0x16b/0x320 mm/slub.c:4080 __alloc_skb+0x1c3/0x440 net/core/skbuff.c:667 alloc_skb include/linux/skbuff.h:1320 [inline] alloc_skb_with_frags+0xc3/0x770 net/core/skbuff.c:6528 sock_alloc_send_pskb+0x91a/0xa60 net/core/sock.c:2815 sock_alloc_send_skb include/net/sock.h:1778 [inline] queue_oob+0x108/0x680 net/unix/af_unix.c:2198 unix_stream_sendmsg+0xd24/0xf80 net/unix/af_unix.c:2351 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x221/0x270 net/socket.c:745 ____sys_sendmsg+0x525/0x7d0 net/socket.c:2597 ___sys_sendmsg net/socket.c:2651 [inline] __sys_sendmsg+0x2b0/0x3a0 net/socket.c:2680 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 5235: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:579 poison_slab_object+0xe0/0x150 mm/kasan/common.c:240 __kasan_slab_free+0x37/0x60 mm/kasan/common.c:256 kasan_slab_free include/linux/kasan.h:184 [inline] slab_free_hook mm/slub.c:2252 [inline] slab_free mm/slub.c:4473 [inline] kmem_cache_free+0x145/0x350 mm/slub.c:4548 unix_stream_read_generic+0x1ef6/0x2520 net/unix/af_unix.c:2917 unix_stream_recvmsg+0x22b/0x2c0 net/unix/af_unix.c:2996 sock_recvmsg_nosec net/socket.c:1046 [inline] sock_recvmsg+0x22f/0x280 net/socket.c:1068 __sys_recvfrom+0x256/0x3e0 net/socket.c:2255 __do_sys_recvfrom net/socket.c:2273 [inline] __se_sys_recvfrom net/socket.c:2269 [inline] __x64_sys_recvfrom+0xde/0x100 net/socket.c:2269 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f The buggy address belongs to the object at ffff8880326abc80 which belongs to the cache skbuff_head_cache of size 240 The buggy address is located 68 bytes inside of freed 240-byte region [ffff8880326abc80, ffff8880326abd70) The buggy address belongs to the physical page: page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x326ab ksm flags: 0xfff00000000000(node=0|zone=1|lastcpupid=0x7ff) page_type: 0xfdffffff(slab) raw: 00fff00000000000 ffff88801eaee780 ffffea0000b7dc80 dead000000000003 raw: 0000000000000000 00000000800c000c 00000001fdffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 0, migratetype Unmovable, gfp_mask 0x52cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP), pid 4686, tgid 4686 (udevadm), ts 32357469485, free_ts 28829011109 set_page_owner include/linux/page_owner.h:32 [inline] post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1493 prep_new_page mm/page_alloc.c:1501 [inline] get_page_from_freelist+0x2e4c/0x2f10 mm/page_alloc.c:3439 __alloc_pages_noprof+0x256/0x6c0 mm/page_alloc.c:4695 __alloc_pages_node_noprof include/linux/gfp.h:269 [inline] alloc_pages_node_noprof include/linux/gfp.h:296 [inline] alloc_slab_page+0x5f/0x120 mm/slub.c:2321 allocate_slab+0x5a/0x2f0 mm/slub.c:2484 new_slab mm/slub.c:2537 [inline] ___slab_alloc+0xcd1/0x14b0 mm/slub.c:3723 __slab_alloc+0x58/0xa0 mm/slub.c:3813 __slab_alloc_node mm/slub.c:3866 [inline] slab_alloc_node mm/slub.c:4025 [inline] kmem_cache_alloc_node_noprof+0x1fe/0x320 mm/slub.c:4080 __alloc_skb+0x1c3/0x440 net/core/skbuff.c:667 alloc_skb include/linux/skbuff.h:1320 [inline] alloc_uevent_skb+0x74/0x230 lib/kobject_uevent.c:289 uevent_net_broadcast_untagged lib/kobject_uevent.c:326 [inline] kobject_uevent_net_broadcast+0x2fd/0x580 lib/kobject_uevent.c:410 kobject_uevent_env+0x57d/0x8e0 lib/kobject_uevent.c:608 kobject_synth_uevent+0x4ef/0xae0 lib/kobject_uevent.c:207 uevent_store+0x4b/0x70 drivers/base/bus.c:633 kernfs_fop_write_iter+0x3a1/0x500 fs/kernfs/file.c:334 new_sync_write fs/read_write.c:497 [inline] vfs_write+0xa72/0xc90 fs/read_write.c:590 page last free pid 1 tgid 1 stack trace: reset_page_owner include/linux/page_owner.h:25 [inline] free_pages_prepare mm/page_alloc.c:1094 [inline] free_unref_page+0xd22/0xea0 mm/page_alloc.c:2612 kasan_depopulate_vmalloc_pte+0x74/0x90 mm/kasan/shadow.c:408 apply_to_pte_range mm/memory.c:2797 [inline] apply_to_pmd_range mm/memory.c:2841 [inline] apply_to_pud_range mm/memory.c:2877 [inline] apply_to_p4d_range mm/memory.c:2913 [inline] __apply_to_page_range+0x8a8/0xe50 mm/memory.c:2947 kasan_release_vmalloc+0x9a/0xb0 mm/kasan/shadow.c:525 purge_vmap_node+0x3e3/0x770 mm/vmalloc.c:2208 __purge_vmap_area_lazy+0x708/0xae0 mm/vmalloc.c:2290 _vm_unmap_aliases+0x79d/0x840 mm/vmalloc.c:2885 change_page_attr_set_clr+0x2fe/0xdb0 arch/x86/mm/pat/set_memory.c:1881 change_page_attr_set arch/x86/mm/pat/set_memory.c:1922 [inline] set_memory_nx+0xf2/0x130 arch/x86/mm/pat/set_memory.c:2110 free_init_pages arch/x86/mm/init.c:924 [inline] free_kernel_image_pages arch/x86/mm/init.c:943 [inline] free_initmem+0x79/0x110 arch/x86/mm/init.c:970 kernel_init+0x31/0x2b0 init/main.c:1476 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Memory state around the buggy address: ffff8880326abb80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880326abc00: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc >ffff8880326abc80: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff8880326abd00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc ffff8880326abd80: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb Fixes: 93c99f21db36 ("af_unix: Don't stop recv(MSG_DONTWAIT) if consumed OOB skb is at the head.") Reported-by: syzbot+8811381d455e3e9ec788@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=8811381d455e3e9ec788 Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20240905193240.17565-5-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09af_unix: Move spin_lock() in manage_oob().Kuniyuki Iwashima
When OOB skb has been already consumed, manage_oob() returns the next skb if exists. In such a case, we need to fall back to the else branch below. Then, we want to keep holding spin_lock(&sk->sk_receive_queue.lock). Let's move it out of if-else branch and add lightweight check before spin_lock() for major use cases without OOB skb. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20240905193240.17565-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09af_unix: Rename unlinked_skb in manage_oob().Kuniyuki Iwashima
When OOB skb has been already consumed, manage_oob() returns the next skb if exists. In such a case, we need to fall back to the else branch below. Then, we need to keep two skbs and free them later with consume_skb() and kfree_skb(). Let's rename unlinked_skb accordingly. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20240905193240.17565-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09af_unix: Remove single nest in manage_oob().Kuniyuki Iwashima
This is a prep for the later fix. No functional change intended. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20240905193240.17565-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09Merge tag 'linux-can-next-for-6.12-20240909' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== pull-request: can-next 2024-09-09 The first patch is by Rob Herring and simplifies the DT parsing in the cc770 driver. The next 2 patches both target the rockchip_canfd driver added in the last pull request to net-next. The first one is by Nathan Chancellor and fixes the return type of rkcanfd_start_xmit(), the second one is by me and fixes a 64 bit division on 32 bit platforms in rkcanfd_timestamp_init(). * tag 'linux-can-next-for-6.12-20240909' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next: can: rockchip_canfd: rkcanfd_timestamp_init(): fix 64 bit division on 32 bit platforms can: rockchip_canfd: fix return type of rkcanfd_start_xmit() net: can: cc770: Simplify parsing DT properties ==================== Link: https://patch.msgid.link/20240909063657.2287493-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09net: remove dev_pick_tx_cpu_id()Jakub Kicinski
dev_pick_tx_cpu_id() has been introduced with two users by commit a4ea8a3dacc3 ("net: Add generic ndo_select_queue functions"). The use in AF_PACKET has been removed in 2019 by commit b71b5837f871 ("packet: rework packet_pick_tx_queue() to use common code selection") The other user was a Netlogic XLP driver, removed in 2021 by commit 47ac6f567c28 ("staging: Remove Netlogic XLP network driver"). It's relatively unlikely that any modern driver will need an .ndo_select_queue implementation which picks purely based on CPU ID and skips XPS, delete dev_pick_tx_cpu_id() Found by code inspection. Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20240906161059.715546-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09Merge branch 'selftests-mptcp-add-time-per-subtests-in-tap-output'Jakub Kicinski
Matthieu Baerts says: ==================== selftests: mptcp: add time per subtests in TAP output Patches here add 'time=<N>ms' in the diagnostic data of the TAP output, e.g. ok 1 - pm_netlink: defaults addr list # time=9ms This addition is useful to quickly identify which subtests are taking a longer time than the others, or more than expected. Note that there are no specific formats to follow to show this time according to the TAP 13, TAP 14 and KTAP specifications, but we follow the format being parsed by NIPA [1]. Patch 1 modifies mptcp_lib.sh to add this support to all MPTCP selftests. Patch 2 removes the now duplicated info in mptcp_connect.sh Patch 3 slightly improves the precision of the first subtests in all MPTCP subtests. Patches 4 and 5 remove duplicated spaces in TAP output, for the TAP parsers that cannot handle them properly. v1: https://lore.kernel.org/20240902-net-next-mptcp-ksft-subtest-time-v1-0-f1ed499a11b1@kernel.org Link: https://github.com/linux-netdev/nipa/pull/36 ==================== Link: https://patch.msgid.link/20240906-net-next-mptcp-ksft-subtest-time-v2-0-31d5ee4f3bdf@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09selftests: mptcp: connect: remove duplicated spaces in TAP outputMatthieu Baerts (NGI0)
It is nice to have a visual alignment in the test output to present the different results, but it makes less sense in the TAP output that is there for computers. It sounds then better to remove the duplicated whitespaces in the TAP output, also because it can cause some issues with TAP parsers expecting only one space around the directive delimiter (#). While at it, change the variable name (result_msg) to something more explicit. Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20240906-net-next-mptcp-ksft-subtest-time-v2-5-31d5ee4f3bdf@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09selftests: mptcp: diag: remove trailing whitespaceMatthieu Baerts (NGI0)
It doesn't need to be there, and it can cause some issues with TAP parsers expecting only one space around the directive delimiter (#). Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20240906-net-next-mptcp-ksft-subtest-time-v2-4-31d5ee4f3bdf@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09selftests: mptcp: reset the last TS before the first testMatthieu Baerts (NGI0)
Just to slightly improve the precision of the duration of the first test. In mptcp_join.sh, the last append_prev_results is now done as soon as the last test is over: this will add the last result in the list, and get a more precise time for this last test. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20240906-net-next-mptcp-ksft-subtest-time-v2-3-31d5ee4f3bdf@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09selftests: mptcp: connect: remote time in TAP outputMatthieu Baerts (NGI0)
It is now added by the MPTCP lib automatically, see the parent commit. The time in the TAP output might be slightly different from the one displayed before, but that's OK. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20240906-net-next-mptcp-ksft-subtest-time-v2-2-31d5ee4f3bdf@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09selftests: mptcp: lib: add time per subtests in TAP outputMatthieu Baerts (NGI0)
It adds 'time=<N>ms' in the diagnostic data of the TAP output, e.g. ok 1 - pm_netlink: defaults addr list # time=9ms This addition is useful to quickly identify which subtests are taking a longer time than the others, or more than expected. Note that there are no specific formats to follow to show this time according to the TAP 13 [1], TAP 14 [2] and KTAP [3] specifications. Let's then define this one here. Link: https://testanything.org/tap-version-13-specification.html [1] Link: https://testanything.org/tap-version-14-specification.html [2] Link: https://docs.kernel.org/dev-tools/ktap.html [3] Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20240906-net-next-mptcp-ksft-subtest-time-v2-1-31d5ee4f3bdf@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09treewide: correct the typo 'retun'WangYuli
There are some spelling mistakes of 'retun' in comments which should be instead of 'return'. Link: https://lkml.kernel.org/r/63D0F870EE8E87A0+20240906054008.390188-1-wangyuli@uniontech.com Signed-off-by: WangYuli <wangyuli@uniontech.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09ocfs2: cleanup return value and mlog in ocfs2_global_read_info()Joseph Qi
Return 0 instead of sizeof(ocfs2_global_disk_dqinfo) that .quota_read returns in normal case. Also cleanup mlog to make code more readable. Link: https://lkml.kernel.org/r/20240904071004.2067695-2-joseph.qi@linux.alibaba.com Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09nilfs2: remove duplicate 'unlikely()' usageKunwu Chan
Nested unlikely() calls, IS_ERR already uses unlikely() internally Link: https://lkml.kernel.org/r/20240904101618.17716-1-konishi.ryusuke@gmail.com Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09nilfs2: fix potential oob read in nilfs_btree_check_delete()Ryusuke Konishi
The function nilfs_btree_check_delete(), which checks whether degeneration to direct mapping occurs before deleting a b-tree entry, causes memory access outside the block buffer when retrieving the maximum key if the root node has no entries. This does not usually happen because b-tree mappings with 0 child nodes are never created by mkfs.nilfs2 or nilfs2 itself. However, it can happen if the b-tree root node read from a device is configured that way, so fix this potential issue by adding a check for that case. Link: https://lkml.kernel.org/r/20240904081401.16682-4-konishi.ryusuke@gmail.com Fixes: 17c76b0104e4 ("nilfs2: B-tree based block mapping") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Lizhi Xu <lizhi.xu@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09nilfs2: determine empty node blocks as corruptedRyusuke Konishi
Due to the nature of b-trees, nilfs2 itself and admin tools such as mkfs.nilfs2 will never create an intermediate b-tree node block with 0 child nodes, nor will they delete (key, pointer)-entries that would result in such a state. However, it is possible that a b-tree node block is corrupted on the backing device and is read with 0 child nodes. Because operation is not guaranteed if the number of child nodes is 0 for intermediate node blocks other than the root node, modify nilfs_btree_node_broken(), which performs sanity checks when reading a b-tree node block, so that such cases will be judged as metadata corruption. Link: https://lkml.kernel.org/r/20240904081401.16682-3-konishi.ryusuke@gmail.com Fixes: 17c76b0104e4 ("nilfs2: B-tree based block mapping") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Lizhi Xu <lizhi.xu@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09nilfs2: fix potential null-ptr-deref in nilfs_btree_insert()Ryusuke Konishi
Patch series "nilfs2: fix potential issues with empty b-tree nodes". This series addresses three potential issues with empty b-tree nodes that can occur with corrupted filesystem images, including one recently discovered by syzbot. This patch (of 3): If a b-tree is broken on the device, and the b-tree height is greater than 2 (the level of the root node is greater than 1) even if the number of child nodes of the b-tree root is 0, a NULL pointer dereference occurs in nilfs_btree_prepare_insert(), which is called from nilfs_btree_insert(). This is because, when the number of child nodes of the b-tree root is 0, nilfs_btree_do_lookup() does not set the block buffer head in any of path[x].bp_bh, leaving it as the initial value of NULL, but if the level of the b-tree root node is greater than 1, nilfs_btree_get_nonroot_node(), which accesses the buffer memory of path[x].bp_bh, is called. Fix this issue by adding a check to nilfs_btree_root_broken(), which performs sanity checks when reading the root node from the device, to detect this inconsistency. Thanks to Lizhi Xu for trying to solve the bug and clarifying the cause early on. Link: https://lkml.kernel.org/r/20240904081401.16682-1-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/20240902084101.138971-1-lizhi.xu@windriver.com Link: https://lkml.kernel.org/r/20240904081401.16682-2-konishi.ryusuke@gmail.com Fixes: 17c76b0104e4 ("nilfs2: B-tree based block mapping") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+9bff4c7b992038a7409f@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=9bff4c7b992038a7409f Cc: Lizhi Xu <lizhi.xu@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09user_namespace: use kmemdup_array() instead of kmemdup() for multiple allocationJinjie Ruan
Let the kmemdup_array() take care about multiplication and possible overflows. Link: https://lkml.kernel.org/r/20240828072340.1249310-1-ruanjinjie@huawei.com Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Reviewed-by: Kees Cook <kees@kernel.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Li zeming <zeming@nfschina.com> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09tools/mm: rm thp_swap_allocator_test when make cleanzhangjiao
rm thp_swap_allocator_test when make clean Link: https://lkml.kernel.org/r/20240829042008.6937-1-zhangjiao2@cmss.chinamobile.com Signed-off-by: zhangjiao <zhangjiao2@cmss.chinamobile.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09squashfs: fix percpu address space issues in decompressor_multi_percpu.cUros Bizjak
When strict percpu address space checks are enabled, then current direct casts between the percpu address space and the generic address space fail the compilation on x86_64 with: decompressor_multi_percpu.c: In function `squashfs_decompressor_create': decompressor_multi_percpu.c:49:16: error: cast to generic address space pointer from disjoint `__seg_gs' address space pointer decompressor_multi_percpu.c: In function `squashfs_decompressor_destroy': decompressor_multi_percpu.c:64:25: error: cast to `__seg_gs' address space pointer from disjoint generic address space pointer decompressor_multi_percpu.c: In function `squashfs_decompress': decompressor_multi_percpu.c:82:25: error: cast to `__seg_gs' address space pointer from disjoint generic address space pointer Add intermediate casts to unsigned long, as advised in [1] and [2]. Side note: sparse still requires __force when casting from the percpu address space, although the documentation [2] allows casts to unsigned long without __force attribute. Found by GCC's named address space checks. There were no changes in the resulting object file. [1] https://gcc.gnu.org/onlinedocs/gcc/Named-Address-Spaces.html#x86-Named-Address-Spaces [2] https://sparse.docs.kernel.org/en/latest/annotations.html#address-space-name Link: https://lkml.kernel.org/r/20240830091104.13049-1-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Phillip Lougher <phillip@squashfs.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09lib: glob.c: added null check for character classAlok Swaminathan
Add null check for character class. Previously, an inverted character class could result in a nul byte being matched and lead to the function reading past the end of the inputted string. Link: https://lkml.kernel.org/r/20240826155709.12383-1-swaminathanalok@gmail.com Signed-off-by: Alok Swaminathan <swaminathanalok@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09selftests: return failure when timestamps can't be reportedJason Xing
When I was trying to modify the tx timestamping feature, I found that running "./txtimestamp -4 -C -L 127.0.0.1" didn't reflect the error: I succeeded to generate timestamp stored in the skb but later failed to report it to the userspace (which means failed to put css into cmsg). It can happen when someone writes buggy codes in __sock_recv_timestamp(), for example. After adding the check so that running ./txtimestamp will reflect the result correctly like this if there is a bug in the reporting phase: protocol: TCP payload: 10 server port: 9000 family: INET test SND USR: 1725458477 s 667997 us (seq=0, len=0) Failed to report timestamps USR: 1725458477 s 718128 us (seq=0, len=0) Failed to report timestamps USR: 1725458477 s 768273 us (seq=0, len=0) Failed to report timestamps USR: 1725458477 s 818416 us (seq=0, len=0) Failed to report timestamps ... In the future, it will help us detect whether the new coming patch has bugs or not. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20240905160035.62407-1-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09mm/codetag: add pgalloc_tag_copy()Yu Zhao
Add pgalloc_tag_copy() to transfer the codetag from the old folio to the new one during migration. This makes original allocation sites persist cross migration rather than lump into the get_new_folio callbacks passed into migrate_pages(), e.g., compaction_alloc(): # echo 1 >/proc/sys/vm/compact_memory # grep compaction_alloc /proc/allocinfo Before this patch: 132968448 32463 mm/compaction.c:1880 func:compaction_alloc After this patch: 0 0 mm/compaction.c:1880 func:compaction_alloc Link: https://lkml.kernel.org/r/20240906042108.1150526-3-yuzhao@google.com Fixes: dcfe378c81f7 ("lib: introduce support for page allocation tagging") Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm/codetag: fix pgalloc_tag_split()Yu Zhao
The current assumption is that a large folio can only be split into order-0 folios. That is not the case for hugeTLB demotion, nor for THP split: see commit c010d47f107f ("mm: thp: split huge page to any lower order pages"). When a large folio is split into ones of a lower non-zero order, only the new head pages should be tagged. Tagging tail pages can cause imbalanced "calls" counters, since only head pages are untagged by pgalloc_tag_sub() and the "calls" counts on tail pages are leaked, e.g., # echo 2048kB >/sys/kernel/mm/hugepages/hugepages-1048576kB/demote_size # echo 700 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages # time echo 700 >/sys/kernel/mm/hugepages/hugepages-1048576kB/demote # echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # grep alloc_gigantic_folio /proc/allocinfo Before this patch: 0 549427200 mm/hugetlb.c:1549 func:alloc_gigantic_folio real 0m2.057s user 0m0.000s sys 0m2.051s After this patch: 0 0 mm/hugetlb.c:1549 func:alloc_gigantic_folio real 0m1.711s user 0m0.000s sys 0m1.704s Not tagging tail pages also improves the splitting time, e.g., by about 15% when demoting 1GB hugeTLB folios to 2MB ones, as shown above. Link: https://lkml.kernel.org/r/20240906042108.1150526-2-yuzhao@google.com Fixes: be25d1d4e822 ("mm: create new codetag references during page splitting") Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm/codetag: fix a typoYu Zhao
Link: https://lkml.kernel.org/r/20240906042108.1150526-1-yuzhao@google.com Fixes: 22d407b164ff ("lib: add allocation tagging support for memory allocation profiling") Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm/vmalloc.c: use "high-order" in description non 0-order pagesUladzislau Rezki (Sony)
In many places, in the comments, we use both "higher-order" and "high-order" to describe the non 0-order pages. That is confusing, because a "higher-order" statement does not reflect what it is compared with. Link: https://lkml.kernel.org/r/20240906095049.3486-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Suggested-by: Baoquan He <bhe@redhat.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm/vmalloc.c: use helper function va_size()ZhangPeng
Use helper function va_size() to improve code readability. No functional modification involved. Link: https://lkml.kernel.org/r/20240906102539.3537207-1-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm: replace xa_get_order with xas_get_order where appropriateShakeel Butt
The tracing of invalidation and truncation operations on large files showed that xa_get_order() is among the top functions where kernel spends a lot of CPUs. xa_get_order() needs to traverse the tree to reach the right node for a given index and then extract the order of the entry. However it seems like at many places it is being called within an already happening tree traversal where there is no need to do another traversal. Just use xas_get_order() at those places. Link: https://lkml.kernel.org/r/20240906230512.124643-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09maple_tree: mark three functions as __maybe_unusedLiam R. Howlett
People keep trying to remove three functions that are going to be used in a feature that is being developed. Dropping the functions entirely may end up with people trying to use the bit for other uses, as people have tried in the past. Adding __maybe_unused stops compilers complaining about the unused functions so they can be silently optimised out of the compiled code and people won't try to claim the bit for another use. Link: https://lore.kernel.org/all/20230726080916.17454-2-zhangpeng.00@bytedance.com/ Link: https://lore.kernel.org/all/202408310728.S7EE59BN-lkp@intel.com/ Link: https://lkml.kernel.org/r/20240907021506.4018676-1-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Kuan-Wei Chiu <visitorckw@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm: clean up mem_cgroup_iter()Kinsey Ho
A clean up to make variable names more clear and to improve code readability. No functional change. Link: https://lkml.kernel.org/r/20240905003058.1859929-6-kinseyho@google.com Signed-off-by: Kinsey Ho <kinseyho@google.com> Reviewed-by: T.J. Mercier <tjmercier@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm: restart if multiple traversals racedKinsey Ho
Currently, if multiple reclaimers raced on the same position, the reclaimers which detect the race will still reclaim from the same memcg. Instead, the reclaimers which detect the race should move on to the next memcg in the hierarchy. So, in the case where multiple traversals race, jump back to the start of the mem_cgroup_iter() function to find the next memcg in the hierarchy to reclaim from. Link: https://lkml.kernel.org/r/20240905003058.1859929-5-kinseyho@google.com Reported-by: syzbot+e099d407346c45275ce9@syzkaller.appspotmail.com Closes: https://lore.kernel.org/000000000000817cf10620e20d33@google.com/ Signed-off-by: Kinsey Ho <kinseyho@google.com> Reviewed-by: T.J. Mercier <tjmercier@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm: increment gen # before restarting traversalKinsey Ho
The generation number in struct mem_cgroup_reclaim_iter should be incremented on every round-trip. Currently, it is possible for a concurrent reclaimer to jump in at the end of the hierarchy, causing a traversal restart (resetting the iteration position) without incrementing the generation number. By resetting the position without incrementing the generation, it's possible for another ongoing mem_cgroup_iter() thread to walk the tree twice. Move the traversal restart such that the generation number is incremented before the restart. Link: https://lkml.kernel.org/r/20240905003058.1859929-4-kinseyho@google.com Signed-off-by: Kinsey Ho <kinseyho@google.com> Reviewed-by: T.J. Mercier <tjmercier@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm: don't hold css->refcnt during traversalKinsey Ho
To obtain the pointer to the next memcg position, mem_cgroup_iter() currently holds css->refcnt during memcg traversal only to put css->refcnt at the end of the routine. This isn't necessary as an rcu_read_lock is already held throughout the function. The use of the RCU read lock with css_next_descendant_pre() guarantees that sibling linkage is safe without holding a ref on the passed-in @css. Remove css->refcnt usage during traversal by leveraging RCU. Link: https://lkml.kernel.org/r/20240905003058.1859929-3-kinseyho@google.com Signed-off-by: Kinsey Ho <kinseyho@google.com> Reviewed-by: T.J. Mercier <tjmercier@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCUKinsey Ho
Patch series "Improve mem_cgroup_iter()", v4. Incremental cgroup iteration is being used again [1]. This patchset improves the reliability of mem_cgroup_iter(). It also improves simplicity and code readability. [1] https://lore.kernel.org/20240514202641.2821494-1-hannes@cmpxchg.org/ This patch (of 5): Explicitly document that css sibling/descendant linkage is protected by cgroup_mutex or RCU. Also, document in css_next_descendant_pre() and similar functions that it isn't necessary to hold a ref on @pos. The following changes in this patchset rely on this clarification for simplification in memcg iteration code. Link: https://lkml.kernel.org/r/20240905003058.1859929-1-kinseyho@google.com Link: https://lkml.kernel.org/r/20240905003058.1859929-2-kinseyho@google.com Suggested-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Kinsey Ho <kinseyho@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Hugh Dickins <hughd@google.com> Cc: T.J. Mercier <tjmercier@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm/page_alloc: fix build with CONFIG_UNACCEPTED_MEMORY=nAndrew Morton
When has_unaccepted_memory() is unused, it prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y: mm/page_alloc.c:7036:20: error: unused function 'has_unaccepted_memory' [-Werror,-Wunused-function] 7036 | static inline bool has_unaccepted_memory(void) | ^~~~~~~~~~~~~~~~~~~~~ Fix it by removeing the CONFIG_UNACCEPTED_MEMORY=n stub. Link: https://lkml.kernel.org/r/20240905142220.49d93337a0abce5690e515d9@linux-foundation.org Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Closes: https://lkml.kernel.org/r/20240905171553.275054-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm: migrate: remove unused includesKefeng Wang
random.h is not needed since commit 6c542ab75714 ("mm/demotion: build demotion targets based on explicit memory tiers"), all functions moved into memory-tiers. nsproxy.h is not needed since commit 228ebcbe634a ("Uninline find_task_by_xxx set of functions"), no nsproxy, we only call find_task_by_vpid() now. hugetlb_cgroup.h is not needed since commit ab5ac90aecf5 ("mm, hugetlb: do not rely on overcommit limit during migration"), move_hugetlb_state() is called and it belongs to hugetlb.h, which is already included. balloon_compaction.h, we have more general movable_operations for non-lru movable page migration, so it could be dropped. memremap.h, userfaultfd_k.h and oom.h are introduced for zone device page migration, but all functions are moved into migrate_device.c, so no needed anymore too. Link: https://lkml.kernel.org/r/20240905152432.626877-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm: thp: simplify split_huge_pages_pid()Nanyong Sun
The helper find_get_task_by_vpid() can totally replace the task_struct find logic in split_huge_pages_pid(), so use it to simplify the code. Also delete the needless comments for the helper function name already explains what it's doing here. Link: https://lkml.kernel.org/r/20240905153028.1205128-1-sunnanyong@huawei.com Signed-off-by: Nanyong Sun <sunnanyong@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>