summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-09-09af_unix: Don't return OOB skb in manage_oob().Kuniyuki Iwashima
syzbot reported use-after-free in unix_stream_recv_urg(). [0] The scenario is 1. send(MSG_OOB) 2. recv(MSG_OOB) -> The consumed OOB remains in recv queue 3. send(MSG_OOB) 4. recv() -> manage_oob() returns the next skb of the consumed OOB -> This is also OOB, but unix_sk(sk)->oob_skb is not cleared 5. recv(MSG_OOB) -> unix_sk(sk)->oob_skb is used but already freed The recent commit 8594d9b85c07 ("af_unix: Don't call skb_get() for OOB skb.") uncovered the issue. If the OOB skb is consumed and the next skb is peeked in manage_oob(), we still need to check if the skb is OOB. Let's do so by falling back to the following checks in manage_oob() and add the test case in selftest. Note that we need to add a similar check for SIOCATMARK. [0]: BUG: KASAN: slab-use-after-free in unix_stream_read_actor+0xa6/0xb0 net/unix/af_unix.c:2959 Read of size 4 at addr ffff8880326abcc4 by task syz-executor178/5235 CPU: 0 UID: 0 PID: 5235 Comm: syz-executor178 Not tainted 6.11.0-rc5-syzkaller-00742-gfbdaffe41adc #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/06/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:93 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:119 print_address_description mm/kasan/report.c:377 [inline] print_report+0x169/0x550 mm/kasan/report.c:488 kasan_report+0x143/0x180 mm/kasan/report.c:601 unix_stream_read_actor+0xa6/0xb0 net/unix/af_unix.c:2959 unix_stream_recv_urg+0x1df/0x320 net/unix/af_unix.c:2640 unix_stream_read_generic+0x2456/0x2520 net/unix/af_unix.c:2778 unix_stream_recvmsg+0x22b/0x2c0 net/unix/af_unix.c:2996 sock_recvmsg_nosec net/socket.c:1046 [inline] sock_recvmsg+0x22f/0x280 net/socket.c:1068 ____sys_recvmsg+0x1db/0x470 net/socket.c:2816 ___sys_recvmsg net/socket.c:2858 [inline] __sys_recvmsg+0x2f0/0x3e0 net/socket.c:2888 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f5360d6b4e9 Code: 48 83 c4 28 c3 e8 37 17 00 00 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fff29b3a458 EFLAGS: 00000246 ORIG_RAX: 000000000000002f RAX: ffffffffffffffda RBX: 00007fff29b3a638 RCX: 00007f5360d6b4e9 RDX: 0000000000002001 RSI: 0000000020000640 RDI: 0000000000000003 RBP: 00007f5360dde610 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 R13: 00007fff29b3a628 R14: 0000000000000001 R15: 0000000000000001 </TASK> Allocated by task 5235: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 unpoison_slab_object mm/kasan/common.c:312 [inline] __kasan_slab_alloc+0x66/0x80 mm/kasan/common.c:338 kasan_slab_alloc include/linux/kasan.h:201 [inline] slab_post_alloc_hook mm/slub.c:3988 [inline] slab_alloc_node mm/slub.c:4037 [inline] kmem_cache_alloc_node_noprof+0x16b/0x320 mm/slub.c:4080 __alloc_skb+0x1c3/0x440 net/core/skbuff.c:667 alloc_skb include/linux/skbuff.h:1320 [inline] alloc_skb_with_frags+0xc3/0x770 net/core/skbuff.c:6528 sock_alloc_send_pskb+0x91a/0xa60 net/core/sock.c:2815 sock_alloc_send_skb include/net/sock.h:1778 [inline] queue_oob+0x108/0x680 net/unix/af_unix.c:2198 unix_stream_sendmsg+0xd24/0xf80 net/unix/af_unix.c:2351 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x221/0x270 net/socket.c:745 ____sys_sendmsg+0x525/0x7d0 net/socket.c:2597 ___sys_sendmsg net/socket.c:2651 [inline] __sys_sendmsg+0x2b0/0x3a0 net/socket.c:2680 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 5235: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:579 poison_slab_object+0xe0/0x150 mm/kasan/common.c:240 __kasan_slab_free+0x37/0x60 mm/kasan/common.c:256 kasan_slab_free include/linux/kasan.h:184 [inline] slab_free_hook mm/slub.c:2252 [inline] slab_free mm/slub.c:4473 [inline] kmem_cache_free+0x145/0x350 mm/slub.c:4548 unix_stream_read_generic+0x1ef6/0x2520 net/unix/af_unix.c:2917 unix_stream_recvmsg+0x22b/0x2c0 net/unix/af_unix.c:2996 sock_recvmsg_nosec net/socket.c:1046 [inline] sock_recvmsg+0x22f/0x280 net/socket.c:1068 __sys_recvfrom+0x256/0x3e0 net/socket.c:2255 __do_sys_recvfrom net/socket.c:2273 [inline] __se_sys_recvfrom net/socket.c:2269 [inline] __x64_sys_recvfrom+0xde/0x100 net/socket.c:2269 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f The buggy address belongs to the object at ffff8880326abc80 which belongs to the cache skbuff_head_cache of size 240 The buggy address is located 68 bytes inside of freed 240-byte region [ffff8880326abc80, ffff8880326abd70) The buggy address belongs to the physical page: page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x326ab ksm flags: 0xfff00000000000(node=0|zone=1|lastcpupid=0x7ff) page_type: 0xfdffffff(slab) raw: 00fff00000000000 ffff88801eaee780 ffffea0000b7dc80 dead000000000003 raw: 0000000000000000 00000000800c000c 00000001fdffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 0, migratetype Unmovable, gfp_mask 0x52cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP), pid 4686, tgid 4686 (udevadm), ts 32357469485, free_ts 28829011109 set_page_owner include/linux/page_owner.h:32 [inline] post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1493 prep_new_page mm/page_alloc.c:1501 [inline] get_page_from_freelist+0x2e4c/0x2f10 mm/page_alloc.c:3439 __alloc_pages_noprof+0x256/0x6c0 mm/page_alloc.c:4695 __alloc_pages_node_noprof include/linux/gfp.h:269 [inline] alloc_pages_node_noprof include/linux/gfp.h:296 [inline] alloc_slab_page+0x5f/0x120 mm/slub.c:2321 allocate_slab+0x5a/0x2f0 mm/slub.c:2484 new_slab mm/slub.c:2537 [inline] ___slab_alloc+0xcd1/0x14b0 mm/slub.c:3723 __slab_alloc+0x58/0xa0 mm/slub.c:3813 __slab_alloc_node mm/slub.c:3866 [inline] slab_alloc_node mm/slub.c:4025 [inline] kmem_cache_alloc_node_noprof+0x1fe/0x320 mm/slub.c:4080 __alloc_skb+0x1c3/0x440 net/core/skbuff.c:667 alloc_skb include/linux/skbuff.h:1320 [inline] alloc_uevent_skb+0x74/0x230 lib/kobject_uevent.c:289 uevent_net_broadcast_untagged lib/kobject_uevent.c:326 [inline] kobject_uevent_net_broadcast+0x2fd/0x580 lib/kobject_uevent.c:410 kobject_uevent_env+0x57d/0x8e0 lib/kobject_uevent.c:608 kobject_synth_uevent+0x4ef/0xae0 lib/kobject_uevent.c:207 uevent_store+0x4b/0x70 drivers/base/bus.c:633 kernfs_fop_write_iter+0x3a1/0x500 fs/kernfs/file.c:334 new_sync_write fs/read_write.c:497 [inline] vfs_write+0xa72/0xc90 fs/read_write.c:590 page last free pid 1 tgid 1 stack trace: reset_page_owner include/linux/page_owner.h:25 [inline] free_pages_prepare mm/page_alloc.c:1094 [inline] free_unref_page+0xd22/0xea0 mm/page_alloc.c:2612 kasan_depopulate_vmalloc_pte+0x74/0x90 mm/kasan/shadow.c:408 apply_to_pte_range mm/memory.c:2797 [inline] apply_to_pmd_range mm/memory.c:2841 [inline] apply_to_pud_range mm/memory.c:2877 [inline] apply_to_p4d_range mm/memory.c:2913 [inline] __apply_to_page_range+0x8a8/0xe50 mm/memory.c:2947 kasan_release_vmalloc+0x9a/0xb0 mm/kasan/shadow.c:525 purge_vmap_node+0x3e3/0x770 mm/vmalloc.c:2208 __purge_vmap_area_lazy+0x708/0xae0 mm/vmalloc.c:2290 _vm_unmap_aliases+0x79d/0x840 mm/vmalloc.c:2885 change_page_attr_set_clr+0x2fe/0xdb0 arch/x86/mm/pat/set_memory.c:1881 change_page_attr_set arch/x86/mm/pat/set_memory.c:1922 [inline] set_memory_nx+0xf2/0x130 arch/x86/mm/pat/set_memory.c:2110 free_init_pages arch/x86/mm/init.c:924 [inline] free_kernel_image_pages arch/x86/mm/init.c:943 [inline] free_initmem+0x79/0x110 arch/x86/mm/init.c:970 kernel_init+0x31/0x2b0 init/main.c:1476 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Memory state around the buggy address: ffff8880326abb80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880326abc00: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc >ffff8880326abc80: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff8880326abd00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc ffff8880326abd80: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb Fixes: 93c99f21db36 ("af_unix: Don't stop recv(MSG_DONTWAIT) if consumed OOB skb is at the head.") Reported-by: syzbot+8811381d455e3e9ec788@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=8811381d455e3e9ec788 Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20240905193240.17565-5-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09af_unix: Move spin_lock() in manage_oob().Kuniyuki Iwashima
When OOB skb has been already consumed, manage_oob() returns the next skb if exists. In such a case, we need to fall back to the else branch below. Then, we want to keep holding spin_lock(&sk->sk_receive_queue.lock). Let's move it out of if-else branch and add lightweight check before spin_lock() for major use cases without OOB skb. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20240905193240.17565-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09af_unix: Rename unlinked_skb in manage_oob().Kuniyuki Iwashima
When OOB skb has been already consumed, manage_oob() returns the next skb if exists. In such a case, we need to fall back to the else branch below. Then, we need to keep two skbs and free them later with consume_skb() and kfree_skb(). Let's rename unlinked_skb accordingly. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20240905193240.17565-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09af_unix: Remove single nest in manage_oob().Kuniyuki Iwashima
This is a prep for the later fix. No functional change intended. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20240905193240.17565-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09Merge tag 'linux-can-next-for-6.12-20240909' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== pull-request: can-next 2024-09-09 The first patch is by Rob Herring and simplifies the DT parsing in the cc770 driver. The next 2 patches both target the rockchip_canfd driver added in the last pull request to net-next. The first one is by Nathan Chancellor and fixes the return type of rkcanfd_start_xmit(), the second one is by me and fixes a 64 bit division on 32 bit platforms in rkcanfd_timestamp_init(). * tag 'linux-can-next-for-6.12-20240909' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next: can: rockchip_canfd: rkcanfd_timestamp_init(): fix 64 bit division on 32 bit platforms can: rockchip_canfd: fix return type of rkcanfd_start_xmit() net: can: cc770: Simplify parsing DT properties ==================== Link: https://patch.msgid.link/20240909063657.2287493-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09net: remove dev_pick_tx_cpu_id()Jakub Kicinski
dev_pick_tx_cpu_id() has been introduced with two users by commit a4ea8a3dacc3 ("net: Add generic ndo_select_queue functions"). The use in AF_PACKET has been removed in 2019 by commit b71b5837f871 ("packet: rework packet_pick_tx_queue() to use common code selection") The other user was a Netlogic XLP driver, removed in 2021 by commit 47ac6f567c28 ("staging: Remove Netlogic XLP network driver"). It's relatively unlikely that any modern driver will need an .ndo_select_queue implementation which picks purely based on CPU ID and skips XPS, delete dev_pick_tx_cpu_id() Found by code inspection. Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20240906161059.715546-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09Merge branch 'selftests-mptcp-add-time-per-subtests-in-tap-output'Jakub Kicinski
Matthieu Baerts says: ==================== selftests: mptcp: add time per subtests in TAP output Patches here add 'time=<N>ms' in the diagnostic data of the TAP output, e.g. ok 1 - pm_netlink: defaults addr list # time=9ms This addition is useful to quickly identify which subtests are taking a longer time than the others, or more than expected. Note that there are no specific formats to follow to show this time according to the TAP 13, TAP 14 and KTAP specifications, but we follow the format being parsed by NIPA [1]. Patch 1 modifies mptcp_lib.sh to add this support to all MPTCP selftests. Patch 2 removes the now duplicated info in mptcp_connect.sh Patch 3 slightly improves the precision of the first subtests in all MPTCP subtests. Patches 4 and 5 remove duplicated spaces in TAP output, for the TAP parsers that cannot handle them properly. v1: https://lore.kernel.org/20240902-net-next-mptcp-ksft-subtest-time-v1-0-f1ed499a11b1@kernel.org Link: https://github.com/linux-netdev/nipa/pull/36 ==================== Link: https://patch.msgid.link/20240906-net-next-mptcp-ksft-subtest-time-v2-0-31d5ee4f3bdf@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09selftests: mptcp: connect: remove duplicated spaces in TAP outputMatthieu Baerts (NGI0)
It is nice to have a visual alignment in the test output to present the different results, but it makes less sense in the TAP output that is there for computers. It sounds then better to remove the duplicated whitespaces in the TAP output, also because it can cause some issues with TAP parsers expecting only one space around the directive delimiter (#). While at it, change the variable name (result_msg) to something more explicit. Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20240906-net-next-mptcp-ksft-subtest-time-v2-5-31d5ee4f3bdf@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09selftests: mptcp: diag: remove trailing whitespaceMatthieu Baerts (NGI0)
It doesn't need to be there, and it can cause some issues with TAP parsers expecting only one space around the directive delimiter (#). Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20240906-net-next-mptcp-ksft-subtest-time-v2-4-31d5ee4f3bdf@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09selftests: mptcp: reset the last TS before the first testMatthieu Baerts (NGI0)
Just to slightly improve the precision of the duration of the first test. In mptcp_join.sh, the last append_prev_results is now done as soon as the last test is over: this will add the last result in the list, and get a more precise time for this last test. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20240906-net-next-mptcp-ksft-subtest-time-v2-3-31d5ee4f3bdf@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09selftests: mptcp: connect: remote time in TAP outputMatthieu Baerts (NGI0)
It is now added by the MPTCP lib automatically, see the parent commit. The time in the TAP output might be slightly different from the one displayed before, but that's OK. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20240906-net-next-mptcp-ksft-subtest-time-v2-2-31d5ee4f3bdf@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09selftests: mptcp: lib: add time per subtests in TAP outputMatthieu Baerts (NGI0)
It adds 'time=<N>ms' in the diagnostic data of the TAP output, e.g. ok 1 - pm_netlink: defaults addr list # time=9ms This addition is useful to quickly identify which subtests are taking a longer time than the others, or more than expected. Note that there are no specific formats to follow to show this time according to the TAP 13 [1], TAP 14 [2] and KTAP [3] specifications. Let's then define this one here. Link: https://testanything.org/tap-version-13-specification.html [1] Link: https://testanything.org/tap-version-14-specification.html [2] Link: https://docs.kernel.org/dev-tools/ktap.html [3] Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20240906-net-next-mptcp-ksft-subtest-time-v2-1-31d5ee4f3bdf@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09selftests: return failure when timestamps can't be reportedJason Xing
When I was trying to modify the tx timestamping feature, I found that running "./txtimestamp -4 -C -L 127.0.0.1" didn't reflect the error: I succeeded to generate timestamp stored in the skb but later failed to report it to the userspace (which means failed to put css into cmsg). It can happen when someone writes buggy codes in __sock_recv_timestamp(), for example. After adding the check so that running ./txtimestamp will reflect the result correctly like this if there is a bug in the reporting phase: protocol: TCP payload: 10 server port: 9000 family: INET test SND USR: 1725458477 s 667997 us (seq=0, len=0) Failed to report timestamps USR: 1725458477 s 718128 us (seq=0, len=0) Failed to report timestamps USR: 1725458477 s 768273 us (seq=0, len=0) Failed to report timestamps USR: 1725458477 s 818416 us (seq=0, len=0) Failed to report timestamps ... In the future, it will help us detect whether the new coming patch has bugs or not. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20240905160035.62407-1-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-09clk: qcom: clk-alpha-pll: Simplify the zonda_pll_adjust_l_val()Satya Priya Kakitapalli
In zonda_pll_adjust_l_val() replace the divide operator with comparison operator to fix below build error and smatch warning. drivers/clk/qcom/clk-alpha-pll.o: In function `clk_zonda_pll_set_rate': clk-alpha-pll.c:(.text+0x45dc): undefined reference to `__aeabi_uldivmod' smatch warnings: drivers/clk/qcom/clk-alpha-pll.c:2129 zonda_pll_adjust_l_val() warn: replace divide condition '(remainder * 2) / prate' with '(remainder * 2) >= prate' Fixes: f4973130d255 ("clk: qcom: clk-alpha-pll: Update set_rate for Zonda PLL") Reported-by: Jon Hunter <jonathanh@nvidia.com> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202408110724.8pqbpDiD-lkp@intel.com/ Signed-off-by: Satya Priya Kakitapalli <quic_skakitap@quicinc.com> Link: https://lore.kernel.org/r/20240906113905.641336-1-quic_skakitap@quicinc.com Reviewed-by: Vladimir Zapolskiy <vladimir.zapolskiy@linaro.org> Tested-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2024-09-09ASoC: mt8365: Fix -Werror buildsMark Brown
Merge series from Mark Brown <broonie@kernel.org>: Nathan reported that the newly added mt8365 drivers were causing a number of warnings which break -Werror builds, these were only visible on arm64 since the drivers did not have COMPILE_TEST enabled. Fix this and some other minor stuff I noticed while doing so.
2024-09-09ASoC: loongson: Simplify code formattingMark Brown
Merge series from Binbin Zhou <zhoubinbin@loongson.cn>: This patchset attempts to improve code readability by simplifying code formatting. No functional changes.
2024-09-09idpf: enable WB_ON_ITRJoshua Hay
Tell hardware to write back completed descriptors even when interrupts are disabled. Otherwise, descriptors might not be written back until the hardware can flush a full cacheline of descriptors. This can cause unnecessary delays when traffic is light (or even trigger Tx queue timeout). The example scenario to reproduce the Tx timeout if the fix is not applied: - configure at least 2 Tx queues to be assigned to the same q_vector, - generate a huge Tx traffic on the first Tx queue - try to send a few packets using the second Tx queue. In such a case Tx timeout will appear on the second Tx queue because no completion descriptors are written back for that queue while interrupts are disabled due to NAPI polling. Fixes: c2d548cad150 ("idpf: add TX splitq napi poll support") Fixes: a5ab9ee0df0b ("idpf: add singleq start_xmit and napi poll") Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Co-developed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-09-09idpf: fix netdev Tx queue stop/wakeMichal Kubiak
netif_txq_maybe_stop() returns -1, 0, or 1, while idpf_tx_maybe_stop_common() says it returns 0 or -EBUSY. As a result, there sometimes are Tx queue timeout warnings despite that the queue is empty or there is at least enough space to restart it. Make idpf_tx_maybe_stop_common() inline and returning true or false, handling the return of netif_txq_maybe_stop() properly. Use a correct goto in idpf_tx_maybe_stop_splitq() to avoid stopping the queue or incrementing the stops counter twice. Fixes: 6818c4d5b3c2 ("idpf: add splitq start_xmit") Fixes: a5ab9ee0df0b ("idpf: add singleq start_xmit and napi poll") Cc: stable@vger.kernel.org # 6.7+ Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-09-09idpf: refactor Tx completion routinesJoshua Hay
Add a mechanism to guard against stashing partial packets into the hash table to make the driver more robust, with more efficient decision making when cleaning. Don't stash partial packets. This can happen when an RE (Report Event) completion is received in flow scheduling mode, or when an out of order RS (Report Status) completion is received. The first buffer with the skb is stashed, but some or all of its frags are not because the stack is out of reserve buffers. This leaves the ring in a weird state since the frags are still on the ring. Use the field libeth_sqe::nr_frags to track the number of fragments/tx_bufs representing the packet. The clean routines check to make sure there are enough reserve buffers on the stack before stashing any part of the packet. If there are not, next_to_clean is left pointing to the first buffer of the packet that failed to be stashed. This leaves the whole packet on the ring, and the next time around, cleaning will start from this packet. An RS completion is still expected for this packet in either case. So instead of being cleaned from the hash table, it will be cleaned from the ring directly. This should all still be fine since the DESC_UNUSED and BUFS_UNUSED will reflect the state of the ring. If we ever fall below the thresholds, the TxQ will still be stopped, giving the completion queue time to catch up. This may lead to stopping the queue more frequently, but it guarantees the Tx ring will always be in a good state. Also, always use the idpf_tx_splitq_clean function to clean descriptors, i.e. use it from clean_buf_ring as well. This way we avoid duplicating the logic and make sure we're using the same reserve buffers guard rail. This does require a switch from the s16 next_to_clean overflow descriptor ring wrap calculation to u16 and the normal ring size check. Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-09-09netdevice: add netdev_tx_reset_subqueue() shorthandAlexander Lobakin
Add a shorthand similar to other net*_subqueue() helpers for resetting the queue by its index w/o obtaining &netdev_tx_queue beforehand manually. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-09-09idpf: convert to libeth Tx buffer completionAlexander Lobakin
&idpf_tx_buffer is almost identical to the previous generations, as well as the way it's handled. Moreover, relying on dma_unmap_addr() and !!buf->skb instead of explicit defining of buffer's type was never good. Use the newly added libeth helpers to do it properly and reduce the copy-paste around the Tx code. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-09-09libeth: add Tx buffer completion helpersAlexander Lobakin
Software-side Tx buffers for storing DMA, frame size, skb pointers etc. are pretty much generic and every driver defines them the same way. The same can be said for software Tx completions -- same napi_consume_skb()s and all that... Add a couple simple wrappers for doing that to stop repeating the old tale at least within the Intel code. Drivers are free to use 'priv' member at the end of the structure. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-09-09regulator: tps6287x: Constify struct regulator_descChristophe JAILLET
'struct regulator_desc' is not modified in this driver. Constifying this structure moves some data to a read-only section, so increases overall security, especially when the structure holds some function pointers. On a x86_64, with allmodconfig: Before: ====== text data bss dec hex filename 4974 736 16 5726 165e drivers/regulator/tps6287x-regulator.o After: ===== text data bss dec hex filename 5294 416 16 5726 165e drivers/regulator/tps6287x-regulator.o -- Compile tested only Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://patch.msgid.link/7727e493490d37775a653905dfe0cc1d8478f8e0.1725908163.git.christophe.jaillet@wanadoo.fr Signed-off-by: Mark Brown <broonie@kernel.org>
2024-09-09regulator: wm8400: Constify struct regulator_descChristophe JAILLET
'struct regulator_desc' is not modified in this driver. Constifying this structure moves some data to a read-only section, so increases overall security, especially when the structure holds some function pointers. On a x86_64, with allmodconfig: Before: ====== text data bss dec hex filename 4419 2512 0 6931 1b13 drivers/regulator/wm8400-regulator.o After: ===== text data bss dec hex filename 6307 624 0 6931 1b13 drivers/regulator/wm8400-regulator.o -- Compile tested only Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://patch.msgid.link/fde33ecfd9bbdbdc1da1620c9f3b1b7a72f9d805.1725906876.git.christophe.jaillet@wanadoo.fr Signed-off-by: Mark Brown <broonie@kernel.org>
2024-09-09tracing: Drop unused helper function to fix the buildAndy Shevchenko
A helper function defined but not used. This, in particular, prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y: kernel/trace/trace.c:2229:19: error: unused function 'run_tracer_selftest' [-Werror,-Wunused-function] 2229 | static inline int run_tracer_selftest(struct tracer *type) | ^~~~~~~~~~~~~~~~~~~ Fix this by dropping unused functions. See also commit 6863f5643dd7 ("kbuild: allow Clang to find unused static inline functions for W=1 build"). Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Bill Wendling <morbo@google.com> Cc: Justin Stitt <justinstitt@google.com> Link: https://lore.kernel.org/20240909105314.928302-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-09-09tracing/osnoise: Fix build when timerlat is not enabledSteven Rostedt
To fix some critical section races, the interface_lock was added to a few locations. One of those locations was above where the interface_lock was declared, so the declaration was moved up before that usage. Unfortunately, where it was placed was inside a CONFIG_TIMERLAT_TRACER ifdef block. As the interface_lock is used outside that config, this broke the build when CONFIG_OSNOISE_TRACER was enabled but CONFIG_TIMERLAT_TRACER was not. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: "Helena Anna" <helena.anna.dubel@intel.com> Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com> Cc: Tomas Glozar <tglozar@redhat.com> Link: https://lore.kernel.org/20240909103231.23a289e2@gandalf.local.home Fixes: e6a53481da29 ("tracing/timerlat: Only clear timer if a kthread exists") Reported-by: "Bityutskiy, Artem" <artem.bityutskiy@intel.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-09-09net/mlx5: Fix bridge mode operations when there are no VFsBenjamin Poirier
Currently, trying to set the bridge mode attribute when numvfs=0 leads to a crash: bridge link set dev eth2 hwmode vepa [ 168.967392] BUG: kernel NULL pointer dereference, address: 0000000000000030 [...] [ 168.969989] RIP: 0010:mlx5_add_flow_rules+0x1f/0x300 [mlx5_core] [...] [ 168.976037] Call Trace: [ 168.976188] <TASK> [ 168.978620] _mlx5_eswitch_set_vepa_locked+0x113/0x230 [mlx5_core] [ 168.979074] mlx5_eswitch_set_vepa+0x7f/0xa0 [mlx5_core] [ 168.979471] rtnl_bridge_setlink+0xe9/0x1f0 [ 168.979714] rtnetlink_rcv_msg+0x159/0x400 [ 168.980451] netlink_rcv_skb+0x54/0x100 [ 168.980675] netlink_unicast+0x241/0x360 [ 168.980918] netlink_sendmsg+0x1f6/0x430 [ 168.981162] ____sys_sendmsg+0x3bb/0x3f0 [ 168.982155] ___sys_sendmsg+0x88/0xd0 [ 168.985036] __sys_sendmsg+0x59/0xa0 [ 168.985477] do_syscall_64+0x79/0x150 [ 168.987273] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 168.987773] RIP: 0033:0x7f8f7950f917 (esw->fdb_table.legacy.vepa_fdb is null) The bridge mode is only relevant when there are multiple functions per port. Therefore, prevent setting and getting this setting when there are no VFs. Note that after this change, there are no settings to change on the PF interface using `bridge link` when there are no VFs, so the interface no longer appears in the `bridge link` output. Fixes: 4b89251de024 ("net/mlx5: Support ndo bridge_setlink and getlink") Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: Verify support for scheduling element and TSAR typeCarolina Jubran
Before creating a scheduling element in a NIC or E-Switch scheduler, ensure that the requested element type is supported. If the element is of type Transmit Scheduling Arbiter (TSAR), also verify that the specific TSAR type is supported. Fixes: 214baf22870c ("net/mlx5e: Support HTB offload") Fixes: 85c5f7c9200e ("net/mlx5: E-switch, Create QoS on demand") Fixes: 0fe132eac38c ("net/mlx5: E-switch, Allow to add vports to rate groups") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: Add missing masks and QoS bit masks for scheduling elementsCarolina Jubran
Add the missing masks for supported element types and Transmit Scheduling Arbiter (TSAR) types in scheduling elements. Also, add the corresponding bit masks for these types in the QoS capabilities of a NIC scheduler. Fixes: 214baf22870c ("net/mlx5e: Support HTB offload") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: Explicitly set scheduling element and TSAR typeCarolina Jubran
Ensure the scheduling element type and TSAR type are explicitly initialized in the QoS rate group creation. This prevents potential issues due to default values. Fixes: 1ae258f8b343 ("net/mlx5: E-switch, Introduce rate limiting groups API") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5e: Add missing link mode to ptys2ext_ethtool_mapShahar Shitrit
Add MLX5E_400GAUI_8_400GBASE_CR8 to the extended modes in ptys2ext_ethtool_table, since it was missing. Fixes: 6a897372417e ("net/mlx5: ethtool, Add ethtool support for 50Gbps per lane link modes") Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5e: Add missing link modes to ptys2ethtool_mapShahar Shitrit
Add MLX5E_1000BASE_T and MLX5E_100BASE_TX to the legacy modes in ptys2legacy_ethtool_table, since they were missing. Fixes: 665bc53969d7 ("net/mlx5e: Use new ethtool get/set link ksettings API") Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: Update the list of the PCI supported devicesMaher Sanalla
Add the upcoming ConnectX-9 device ID to the table of supported PCI device IDs. Fixes: f908a35b2218 ("net/mlx5: Update the list of the PCI supported devices") Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: HWS, added API and enabled HWS supportYevgeny Kliteynik
Enabling HWS support in the mlx5 driver: - added HWS API header - added HWS files in the mlx5 driver makefile - added kconfig flag that enables HWS compilation Reviewed-by: Erez Shitrit <erezsh@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: HWS, added send engine and context handlingYevgeny Kliteynik
Added implementation of send engine and handling of HWS context. Reviewed-by: Itamar Gozlan <igozlan@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: HWS, added debug dump and internal headersYevgeny Kliteynik
Added debug dump of the existing HWS state, and all the required internal definitions. To dump the HWS state, cat the following debugfs node: cat /sys/kernel/debug/mlx5/<PCI>/steering/fdb/ctx_<ctx_id> Reviewed-by: Hamdan Agbariya <hamdani@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: HWS, added backward-compatible API handlingYevgeny Kliteynik
Added implementation of backward-compatible (BWC) steering API. Native HWS API is very different from SWS API: - SWS is synchronous (rule creation/deletion API call returns when the rule is created/deleted), while HWS is asynchronous (it requires polling for completion in order to know when the rule creation/deletion happened) - SWS manages its own memory (it allocates/frees all the needed memory for steering rules, while HWS requires the rules memory to be allocated/freed outside the API In order to make HWS fit the existing fs-core steering API paradigm, this patch adds implementation of backward-compatible (BWC) steering API that has the bahaviour similar to SWS: among others, it encompasses all the rules' memory management and completion polling, presenting the usual synchronous API for the upper layer. A user that wishes to utilize the full speed potential of HWS can call the HWS async API and have rule insertion/deletion batching, lower memory management overhead, and lower CPU utilization. Such approach will be taken by the future Connection Tracking. Note that BWC steering doesn't support yet rules that require more than one match STE - complex rules. This support will be added later on. Reviewed-by: Erez Shitrit <erezsh@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: HWS, added memory management handlingYevgeny Kliteynik
Added object pools and buddy allocator functionality. Reviewed-by: Itamar Gozlan <igozlan@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: HWS, added vport handlingYevgeny Kliteynik
Vport is a virtual eswitch port that is associated with its virtual function (VF), physical function (PF) or sub-function (SF). This patch adds handling of vports in HWS. Reviewed-by: Hamdan Agbariya <hamdani@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: HWS, added modify header pattern and args handlingYevgeny Kliteynik
Packet headers/metadta manipulations are split into two parts: - Header Modify Pattern: an object that describes which fields will be modified and in which way - Header Modify Argument: an object that provides the values to be used for header modification Reviewed-by: Hamdan Agbariya <hamdani@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: HWS, added FW commands handlingYevgeny Kliteynik
This patch adds implementation of FW object handling, such as creation/destruction, modification, and querying. Reviewed-by: Hamdan Agbariya <hamdani@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: HWS, added matchers functionalityYevgeny Kliteynik
Matcher object encompasses all the building blocks that are needed in order to perform flow steering of a given flow: - flow table that serves as entering point of this matcher - Rule Table Context (RTC) objects to hold ll the Steering Table Entries (STEs), both for matching the flow and for performing actions - rules that describe the set of matching parameters for a flow and actions to perform in case of a hit. This patch adds implementation of matchers handling in HWS. Reviewed-by: Itamar Gozlan <igozlan@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: HWS, added definers handlingYevgeny Kliteynik
The Match Definer combines packet fields and a mask, creating a key which can be used for packet matching during steering flow processing. This patch adds handling of definer objects in HWS. Reviewed-by: Hamdan Agbariya <hamdani@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: HWS, added rules handlingYevgeny Kliteynik
Steering rule is a concept that includes match parameters for a flow, and actions to perform on the flows that match these parameters. This patch adds rules handling part of HW Steering. Reviewed-by: Itamar Gozlan <igozlan@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: HWS, added tables handlingYevgeny Kliteynik
Flow tables are SW objects that are comprised of list of matchers, that in turn define the properties of a flow to match on and set of actions to perform on the flows in case of match hit or miss. Reviewed-by: Itamar Gozlan <igozlan@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: HWS, added actions handlingYevgeny Kliteynik
When a packet matches a flow, the actions specified for the flow are applied. The supported actions include (but not limited to) the following: - drop: packet processing is stopped - go to vport: packet is forwarded to a specified vport - go to flow table: packet is forwarded to a specified table and processing continues there - push/pop vlan: add/remove vlan header respectively to/from the packet - insert/remove header: add/remove a user-defined header to/from the packet - counter: count the packet bytes in the specified counter - tag: tag the matching flow with a provided tag value - reformat: change the packet format by adding or removing some of its headers - modify header: modify the value of the packet headers with set/add/copy ops - range: match packet on range of values Reviewed-by: Erez Shitrit <erezsh@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: Added missing definitions in preparation for HW SteeringYevgeny Kliteynik
As part of preparation for HWS, added missing definitions in qp.h and fs_core.h: - FS_FT_FDB_RX/TX table types that are used by HWS in addition to an existing FS_FT_FDB - MLX5_WQE_CTRL_INITIATOR_SMALL_FENCE that is used by HWS to require fence in WQE Reviewed-by: Hamdan Agbariya <hamdani@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: Added missing mlx5_ifc definition for HW SteeringYevgeny Kliteynik
Add mlx5_ifc definitions that are required for HWS support. Note that due to change in the mlx5_ifc_flow_table_context_bits structure that now includes both SWS and HWS bits in a union, this patch also includes small change in one of SWS files that was required for compilation. Reviewed-by: Hamdan Agbariya <hamdani@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09igb: Always call igb_xdp_ring_update_tail() under Tx lockSriram Yagnaraman
Always call igb_xdp_ring_update_tail() under __netif_tx_lock, add a comment and lockdep assert to indicate that. This is needed to share the same TX ring between XDP, XSK and slow paths. Furthermore, the current XDP implementation is racy on tail updates. Fixes: 9cbc948b5a20 ("igb: add XDP support") Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech> [Kurt: Add lockdep assert and fixes tag] Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-09-09ice: fix VSI lists confusion when adding VLANsMichal Schmidt
The description of function ice_find_vsi_list_entry says: Search VSI list map with VSI count 1 However, since the blamed commit (see Fixes below), the function no longer checks vsi_count. This causes a problem in ice_add_vlan_internal, where the decision to share VSI lists between filter rules relies on the vsi_count of the found existing VSI list being 1. The reproducing steps: 1. Have a PF and two VFs. There will be a filter rule for VLAN 0, referring to a VSI list containing VSIs: 0 (PF), 2 (VF#0), 3 (VF#1). 2. Add VLAN 1234 to VF#0. ice will make the wrong decision to share the VSI list with the new rule. The wrong behavior may not be immediately apparent, but it can be observed with debug prints. 3. Add VLAN 1234 to VF#1. ice will unshare the VSI list for the VLAN 1234 rule. Due to the earlier bad decision, the newly created VSI list will contain VSIs 0 (PF) and 3 (VF#1), instead of expected 2 (VF#0) and 3 (VF#1). 4. Try pinging a network peer over the VLAN interface on VF#0. This fails. Reproducer script at: https://gitlab.com/mschmidt2/repro/-/blob/master/RHEL-46814/test-vlan-vsi-list-confusion.sh Commented debug trace: https://gitlab.com/mschmidt2/repro/-/blob/master/RHEL-46814/ice-vlan-vsi-lists-debug.txt Patch adding the debug prints: https://gitlab.com/mschmidt2/linux/-/commit/f8a8814623944a45091a77c6094c40bfe726bfdb (Unsafe, by the way. Lacks rule_lock when dumping in ice_remove_vlan.) Michal Swiatkowski added to the explanation that the bug is caused by reusing a VSI list created for VLAN 0. All created VFs' VSIs are added to VLAN 0 filter. When a non-zero VLAN is created on a VF which is already in VLAN 0 (normal case), the VSI list from VLAN 0 is reused. It leads to a problem because all VFs (VSIs to be specific) that are subscribed to VLAN 0 will now receive a new VLAN tag traffic. This is one bug, another is the bug described above. Removing filters from one VF will remove VLAN filter from the previous VF. It happens a VF is reset. Example: - creation of 3 VFs - we have VSI list (used for VLAN 0) [0 (pf), 2 (vf1), 3 (vf2), 4 (vf3)] - we are adding VLAN 100 on VF1, we are reusing the previous list because 2 is there - VLAN traffic works fine, but VLAN 100 tagged traffic can be received on all VSIs from the list (for example broadcast or unicast) - trust is turning on VF2, VF2 is resetting, all filters from VF2 are removed; the VLAN 100 filter is also removed because 3 is on the list - VLAN traffic to VF1 isn't working anymore, there is a need to recreate VLAN interface to readd VLAN filter One thing I'm not certain about is the implications for the LAG feature, which is another caller of ice_find_vsi_list_entry. I don't have a LAG-capable card at hand to test. Fixes: 23ccae5ce15f ("ice: changes to the interface with the HW and FW for SRIOV_VF+LAG") Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Reviewed-by: Dave Ertman <David.m.ertman@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>