summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2022-02-02net: dsa: tag_qca: move define to include linux/dsaAnsuel Smith
Move tag_qca define to include dir linux/dsa as the qca8k require access to the tagger define to support in-band mdio read/write using ethernet packet. Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-02net: dsa: tag_qca: convert to FIELD macroAnsuel Smith
Convert driver to FIELD macro to drop redundant define. Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-02net: dsa: replay master state events in dsa_tree_{setup,teardown}_masterVladimir Oltean
In order for switch driver to be able to make simple and reliable use of the master tracking operations, they must also be notified of the initial state of the DSA master, not just of the changes. This is because they might enable certain features only during the time when they know that the DSA master is up and running. Therefore, this change explicitly checks the state of the DSA master under the same rtnl_mutex as we were holding during the dsa_master_setup() and dsa_master_teardown() call. The idea being that if the DSA master became operational in between the moment in which it became a DSA master (dsa_master_setup set dev->dsa_ptr) and the moment when we checked for the master being up, there is a chance that we would emit a ->master_state_change() call with no actual state change. We need to avoid that by serializing the concurrent netdevice event with us. If the netdevice event started before, we force it to finish before we begin, because we take rtnl_lock before making netdev_uses_dsa() return true. So we also handle that early event and do nothing on it. Similarly, if the dev_open() attempt is concurrent with us, it will attempt to take the rtnl_mutex, but we're holding it. We'll see that the master flag IFF_UP isn't set, then when we release the rtnl_mutex we'll process the NETDEV_UP notifier. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-02net: dsa: provide switch operations for tracking the master stateVladimir Oltean
Certain drivers may need to send management traffic to the switch for things like register access, FDB dump, etc, to accelerate what their slow bus (SPI, I2C, MDIO) can already do. Ethernet is faster (especially in bulk transactions) but is also more unreliable, since the user may decide to bring the DSA master down (or not bring it up), therefore severing the link between the host and the attached switch. Drivers needing Ethernet-based register access already should have fallback logic to the slow bus if the Ethernet method fails, but that fallback may be based on a timeout, and the I/O to the switch may slow down to a halt if the master is down, because every Ethernet packet will have to time out. The driver also doesn't have the option to turn off Ethernet-based I/O momentarily, because it wouldn't know when to turn it back on. Which is where this change comes in. By tracking NETDEV_CHANGE, NETDEV_UP and NETDEV_GOING_DOWN events on the DSA master, we should know the exact interval of time during which this interface is reliably available for traffic. Provide this information to switches so they can use it as they wish. An helper is added dsa_port_master_is_operational() to check if a master port is operational. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-01tcp: fix mem under-charging with zerocopy sendmsg()Eric Dumazet
We got reports of following warning in inet_sock_destruct() WARN_ON(sk_forward_alloc_get(sk)); Whenever we add a non zero-copy fragment to a pure zerocopy skb, we have to anticipate that whole skb->truesize will be uncharged when skb is finally freed. skb->data_len is the payload length. But the memory truesize estimated by __zerocopy_sg_from_iter() is page aligned. Fixes: 9b65b17db723 ("net: avoid double accounting for pure zerocopy skbs") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Talal Ahmad <talalahmad@google.com> Cc: Arjun Roy <arjunroy@google.com> Cc: Willem de Bruijn <willemb@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Link: https://lore.kernel.org/r/20220201065254.680532-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-01af_packet: fix data-race in packet_setsockopt / packet_setsockoptEric Dumazet
When packet_setsockopt( PACKET_FANOUT_DATA ) reads po->fanout, no lock is held, meaning that another thread can change po->fanout. Given that po->fanout can only be set once during the socket lifetime (it is only cleared from fanout_release()), we can use READ_ONCE()/WRITE_ONCE() to document the race. BUG: KCSAN: data-race in packet_setsockopt / packet_setsockopt write to 0xffff88813ae8e300 of 8 bytes by task 14653 on cpu 0: fanout_add net/packet/af_packet.c:1791 [inline] packet_setsockopt+0x22fe/0x24a0 net/packet/af_packet.c:3931 __sys_setsockopt+0x209/0x2a0 net/socket.c:2180 __do_sys_setsockopt net/socket.c:2191 [inline] __se_sys_setsockopt net/socket.c:2188 [inline] __x64_sys_setsockopt+0x62/0x70 net/socket.c:2188 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff88813ae8e300 of 8 bytes by task 14654 on cpu 1: packet_setsockopt+0x691/0x24a0 net/packet/af_packet.c:3935 __sys_setsockopt+0x209/0x2a0 net/socket.c:2180 __do_sys_setsockopt net/socket.c:2191 [inline] __se_sys_setsockopt net/socket.c:2188 [inline] __x64_sys_setsockopt+0x62/0x70 net/socket.c:2188 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x0000000000000000 -> 0xffff888106f8c000 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 14654 Comm: syz-executor.3 Not tainted 5.16.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: 47dceb8ecdc1 ("packet: add classic BPF fanout mode") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20220201022358.330621-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-01rtnetlink: make sure to refresh master_dev/m_ops in __rtnl_newlink()Eric Dumazet
While looking at one unrelated syzbot bug, I found the replay logic in __rtnl_newlink() to potentially trigger use-after-free. It is better to clear master_dev and m_ops inside the loop, in case we have to replay it. Fixes: ba7d49b1f0f8 ("rtnetlink: provide api for getting and setting slave info") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20220201012106.216495-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-01net: sched: fix use-after-free in tc_new_tfilter()Eric Dumazet
Whenever tc_new_tfilter() jumps back to replay: label, we need to make sure @q and @chain local variables are cleared again, or risk use-after-free as in [1] For consistency, apply the same fix in tc_ctl_chain() BUG: KASAN: use-after-free in mini_qdisc_pair_swap+0x1b9/0x1f0 net/sched/sch_generic.c:1581 Write of size 8 at addr ffff8880985c4b08 by task syz-executor.4/1945 CPU: 0 PID: 1945 Comm: syz-executor.4 Not tainted 5.17.0-rc1-syzkaller-00495-gff58831fa02d #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_address_description.constprop.0.cold+0x8d/0x336 mm/kasan/report.c:255 __kasan_report mm/kasan/report.c:442 [inline] kasan_report.cold+0x83/0xdf mm/kasan/report.c:459 mini_qdisc_pair_swap+0x1b9/0x1f0 net/sched/sch_generic.c:1581 tcf_chain_head_change_item net/sched/cls_api.c:372 [inline] tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:386 tcf_chain_tp_insert net/sched/cls_api.c:1657 [inline] tcf_chain_tp_insert_unique net/sched/cls_api.c:1707 [inline] tc_new_tfilter+0x1e67/0x2350 net/sched/cls_api.c:2086 rtnetlink_rcv_msg+0x80d/0xb80 net/core/rtnetlink.c:5583 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494 netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline] netlink_unicast+0x539/0x7e0 net/netlink/af_netlink.c:1343 netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:725 ____sys_sendmsg+0x331/0x810 net/socket.c:2413 ___sys_sendmsg+0xf3/0x170 net/socket.c:2467 __sys_sendmmsg+0x195/0x470 net/socket.c:2553 __do_sys_sendmmsg net/socket.c:2582 [inline] __se_sys_sendmmsg net/socket.c:2579 [inline] __x64_sys_sendmmsg+0x99/0x100 net/socket.c:2579 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f2647172059 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f2645aa5168 EFLAGS: 00000246 ORIG_RAX: 0000000000000133 RAX: ffffffffffffffda RBX: 00007f2647285100 RCX: 00007f2647172059 RDX: 040000000000009f RSI: 00000000200002c0 RDI: 0000000000000006 RBP: 00007f26471cc08d R08: 0000000000000000 R09: 0000000000000000 R10: 9e00000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007fffb3f7f02f R14: 00007f2645aa5300 R15: 0000000000022000 </TASK> Allocated by task 1944: kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38 kasan_set_track mm/kasan/common.c:45 [inline] set_alloc_info mm/kasan/common.c:436 [inline] ____kasan_kmalloc mm/kasan/common.c:515 [inline] ____kasan_kmalloc mm/kasan/common.c:474 [inline] __kasan_kmalloc+0xa9/0xd0 mm/kasan/common.c:524 kmalloc_node include/linux/slab.h:604 [inline] kzalloc_node include/linux/slab.h:726 [inline] qdisc_alloc+0xac/0xa10 net/sched/sch_generic.c:941 qdisc_create.constprop.0+0xce/0x10f0 net/sched/sch_api.c:1211 tc_modify_qdisc+0x4c5/0x1980 net/sched/sch_api.c:1660 rtnetlink_rcv_msg+0x413/0xb80 net/core/rtnetlink.c:5592 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494 netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline] netlink_unicast+0x539/0x7e0 net/netlink/af_netlink.c:1343 netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:725 ____sys_sendmsg+0x331/0x810 net/socket.c:2413 ___sys_sendmsg+0xf3/0x170 net/socket.c:2467 __sys_sendmmsg+0x195/0x470 net/socket.c:2553 __do_sys_sendmmsg net/socket.c:2582 [inline] __se_sys_sendmmsg net/socket.c:2579 [inline] __x64_sys_sendmmsg+0x99/0x100 net/socket.c:2579 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae Freed by task 3609: kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38 kasan_set_track+0x21/0x30 mm/kasan/common.c:45 kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370 ____kasan_slab_free mm/kasan/common.c:366 [inline] ____kasan_slab_free+0x130/0x160 mm/kasan/common.c:328 kasan_slab_free include/linux/kasan.h:236 [inline] slab_free_hook mm/slub.c:1728 [inline] slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1754 slab_free mm/slub.c:3509 [inline] kfree+0xcb/0x280 mm/slub.c:4562 rcu_do_batch kernel/rcu/tree.c:2527 [inline] rcu_core+0x7b8/0x1540 kernel/rcu/tree.c:2778 __do_softirq+0x29b/0x9c2 kernel/softirq.c:558 Last potentially related work creation: kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38 __kasan_record_aux_stack+0xbe/0xd0 mm/kasan/generic.c:348 __call_rcu kernel/rcu/tree.c:3026 [inline] call_rcu+0xb1/0x740 kernel/rcu/tree.c:3106 qdisc_put_unlocked+0x6f/0x90 net/sched/sch_generic.c:1109 tcf_block_release+0x86/0x90 net/sched/cls_api.c:1238 tc_new_tfilter+0xc0d/0x2350 net/sched/cls_api.c:2148 rtnetlink_rcv_msg+0x80d/0xb80 net/core/rtnetlink.c:5583 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494 netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline] netlink_unicast+0x539/0x7e0 net/netlink/af_netlink.c:1343 netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:725 ____sys_sendmsg+0x331/0x810 net/socket.c:2413 ___sys_sendmsg+0xf3/0x170 net/socket.c:2467 __sys_sendmmsg+0x195/0x470 net/socket.c:2553 __do_sys_sendmmsg net/socket.c:2582 [inline] __se_sys_sendmmsg net/socket.c:2579 [inline] __x64_sys_sendmmsg+0x99/0x100 net/socket.c:2579 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae The buggy address belongs to the object at ffff8880985c4800 which belongs to the cache kmalloc-1k of size 1024 The buggy address is located 776 bytes inside of 1024-byte region [ffff8880985c4800, ffff8880985c4c00) The buggy address belongs to the page: page:ffffea0002617000 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x985c0 head:ffffea0002617000 order:3 compound_mapcount:0 compound_pincount:0 flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff) raw: 00fff00000010200 0000000000000000 dead000000000122 ffff888010c41dc0 raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 3, migratetype Unmovable, gfp_mask 0x1d20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL), pid 1941, ts 1038999441284, free_ts 1033444432829 prep_new_page mm/page_alloc.c:2434 [inline] get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4165 __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5389 alloc_pages+0x1aa/0x310 mm/mempolicy.c:2271 alloc_slab_page mm/slub.c:1799 [inline] allocate_slab mm/slub.c:1944 [inline] new_slab+0x28a/0x3b0 mm/slub.c:2004 ___slab_alloc+0x87c/0xe90 mm/slub.c:3018 __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3105 slab_alloc_node mm/slub.c:3196 [inline] slab_alloc mm/slub.c:3238 [inline] __kmalloc+0x2fb/0x340 mm/slub.c:4420 kmalloc include/linux/slab.h:586 [inline] kzalloc include/linux/slab.h:715 [inline] __register_sysctl_table+0x112/0x1090 fs/proc/proc_sysctl.c:1335 neigh_sysctl_register+0x2c8/0x5e0 net/core/neighbour.c:3787 devinet_sysctl_register+0xb1/0x230 net/ipv4/devinet.c:2618 inetdev_init+0x286/0x580 net/ipv4/devinet.c:278 inetdev_event+0xa8a/0x15d0 net/ipv4/devinet.c:1532 notifier_call_chain+0xb5/0x200 kernel/notifier.c:84 call_netdevice_notifiers_info+0xb5/0x130 net/core/dev.c:1919 call_netdevice_notifiers_extack net/core/dev.c:1931 [inline] call_netdevice_notifiers net/core/dev.c:1945 [inline] register_netdevice+0x1073/0x1500 net/core/dev.c:9698 veth_newlink+0x59c/0xa90 drivers/net/veth.c:1722 page last free stack trace: reset_page_owner include/linux/page_owner.h:24 [inline] free_pages_prepare mm/page_alloc.c:1352 [inline] free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1404 free_unref_page_prepare mm/page_alloc.c:3325 [inline] free_unref_page+0x19/0x690 mm/page_alloc.c:3404 release_pages+0x748/0x1220 mm/swap.c:956 tlb_batch_pages_flush mm/mmu_gather.c:50 [inline] tlb_flush_mmu_free mm/mmu_gather.c:243 [inline] tlb_flush_mmu+0xe9/0x6b0 mm/mmu_gather.c:250 zap_pte_range mm/memory.c:1441 [inline] zap_pmd_range mm/memory.c:1490 [inline] zap_pud_range mm/memory.c:1519 [inline] zap_p4d_range mm/memory.c:1540 [inline] unmap_page_range+0x1d1d/0x2a30 mm/memory.c:1561 unmap_single_vma+0x198/0x310 mm/memory.c:1606 unmap_vmas+0x16b/0x2f0 mm/memory.c:1638 exit_mmap+0x201/0x670 mm/mmap.c:3178 __mmput+0x122/0x4b0 kernel/fork.c:1114 mmput+0x56/0x60 kernel/fork.c:1135 exit_mm kernel/exit.c:507 [inline] do_exit+0xa3c/0x2a30 kernel/exit.c:793 do_group_exit+0xd2/0x2f0 kernel/exit.c:935 __do_sys_exit_group kernel/exit.c:946 [inline] __se_sys_exit_group kernel/exit.c:944 [inline] __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:944 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae Memory state around the buggy address: ffff8880985c4a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880985c4a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff8880985c4b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff8880985c4b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880985c4c00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc Fixes: 470502de5bdb ("net: sched: unlock rules update API") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vlad Buslov <vladbu@mellanox.com> Cc: Jiri Pirko <jiri@mellanox.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20220131172018.3704490-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-01net: allow SO_MARK with CAP_NET_RAW via cmsgJakub Kicinski
There's not reason SO_MARK would be allowed via setsockopt() and not via cmsg, let's keep the two consistent. See commit 079925cce1d0 ("net: allow SO_MARK with CAP_NET_RAW") for justification why NET_RAW -> SO_MARK is safe. Reviewed-by: Maciej Żenczykowski <maze@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220131233357.52964-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-31bpf: Make dst_port field in struct bpf_sock 16-bit wideJakub Sitnicki
Menglong Dong reports that the documentation for the dst_port field in struct bpf_sock is inaccurate and confusing. From the BPF program PoV, the field is a zero-padded 16-bit integer in network byte order. The value appears to the BPF user as if laid out in memory as so: offsetof(struct bpf_sock, dst_port) + 0 <port MSB> + 8 <port LSB> +16 0x00 +24 0x00 32-, 16-, and 8-bit wide loads from the field are all allowed, but only if the offset into the field is 0. 32-bit wide loads from dst_port are especially confusing. The loaded value, after converting to host byte order with bpf_ntohl(dst_port), contains the port number in the upper 16-bits. Remove the confusion by splitting the field into two 16-bit fields. For backward compatibility, allow 32-bit wide loads from offsetof(struct bpf_sock, dst_port). While at it, allow loads 8-bit loads at offset [0] and [1] from dst_port. Reported-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/r/20220130115518.213259-2-jakub@cloudflare.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-31net/smc: Cork when sendpage with MSG_SENDPAGE_NOTLAST flagTony Lu
This introduces a new corked flag, MSG_SENDPAGE_NOTLAST, which is involved in syscall sendfile() [1], it indicates this is not the last page. So we can cork the data until the page is not specify this flag. It has the same effect as MSG_MORE, but existed in sendfile() only. This patch adds a option MSG_SENDPAGE_NOTLAST for corking data, try to cork more data before sending when using sendfile(), which acts like TCP's behaviour. Also, this reimplements the default sendpage to inform that it is supported to some extent. [1] https://man7.org/linux/man-pages/man2/sendfile.2.html Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-31net/smc: Remove corked dealyed workTony Lu
Based on the manual of TCP_CORK [1] and MSG_MORE [2], these two options have the same effect. Applications can set these options and informs the kernel to pend the data, and send them out only when the socket or syscall does not specify this flag. In other words, there's no need to send data out by a delayed work, which will queue a lot of work. This removes corked delayed work with SMC_TX_CORK_DELAY (250ms), and the applications control how/when to send them out. It improves the performance for sendfile and throughput, and remove unnecessary race of lock_sock(). This also unlocks the limitation of sndbuf, and try to fill it up before sending. [1] https://linux.die.net/man/7/tcp [2] https://man7.org/linux/man-pages/man2/send.2.html Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-31net/smc: Send directly when TCP_CORK is clearedTony Lu
According to the man page of TCP_CORK [1], if set, don't send out partial frames. All queued partial frames are sent when option is cleared again. When applications call setsockopt to disable TCP_CORK, this call is protected by lock_sock(), and tries to mod_delayed_work() to 0, in order to send pending data right now. However, the delayed work smc_tx_work is also protected by lock_sock(). There introduces lock contention for sending data. To fix it, send pending data directly which acts like TCP, without lock_sock() protected in the context of setsockopt (already lock_sock()ed), and cancel unnecessary dealyed work, which is protected by lock. [1] https://linux.die.net/man/7/tcp Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-31tcp: Change SYN ACK retransmit behaviour to account for rehashAkhmat Karakotov
Disabling rehash behavior did not affect SYN ACK retransmits because hash was forcefully changed bypassing the sk_rethink_hash function. This patch adds a condition which checks for rehash mode before resetting hash. Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-31bpf: Add SO_TXREHASH setsockoptAkhmat Karakotov
Add bpf socket option to override rehash behaviour from userspace or from bpf. Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-31txhash: Add socket option to control TX hash rethink behaviorAkhmat Karakotov
Add the SO_TXREHASH socket option to control hash rethink behavior per socket. When default mode is set, sockets disable rehash at initialization and use sysctl option when entering listen state. setsockopt() overrides default behavior. Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-31txhash: Make rethinking txhash behavior configurable via sysctlAkhmat Karakotov
Add a per ns sysctl that controls the txhash rethink behavior: net.core.txrehash. When enabled, the same behavior is retained, when disabled, rethink is not performed. Sysctl is enabled by default. Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-31ipv4: Make ip_idents_reserve staticDavid Ahern
ip_idents_reserve is only used in net/ipv4/route.c. Make it static and remove the export. Signed-off-by: David Ahern <dsahern@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-31net/smc: Forward wakeup to smc socket waitqueue after fallbackWen Gu
When we replace TCP with SMC and a fallback occurs, there may be some socket waitqueue entries remaining in smc socket->wq, such as eppoll_entries inserted by userspace applications. After the fallback, data flows over TCP/IP and only clcsocket->wq will be woken up. Applications can't be notified by the entries which were inserted in smc socket->wq before fallback. So we need a mechanism to wake up smc socket->wq at the same time if some entries remaining in it. The current workaround is to transfer the entries from smc socket->wq to clcsock->wq during the fallback. But this may cause a crash like this: general protection fault, probably for non-canonical address 0xdead000000000100: 0000 [#1] PREEMPT SMP PTI CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Tainted: G E 5.16.0+ #107 RIP: 0010:__wake_up_common+0x65/0x170 Call Trace: <IRQ> __wake_up_common_lock+0x7a/0xc0 sock_def_readable+0x3c/0x70 tcp_data_queue+0x4a7/0xc40 tcp_rcv_established+0x32f/0x660 ? sk_filter_trim_cap+0xcb/0x2e0 tcp_v4_do_rcv+0x10b/0x260 tcp_v4_rcv+0xd2a/0xde0 ip_protocol_deliver_rcu+0x3b/0x1d0 ip_local_deliver_finish+0x54/0x60 ip_local_deliver+0x6a/0x110 ? tcp_v4_early_demux+0xa2/0x140 ? tcp_v4_early_demux+0x10d/0x140 ip_sublist_rcv_finish+0x49/0x60 ip_sublist_rcv+0x19d/0x230 ip_list_rcv+0x13e/0x170 __netif_receive_skb_list_core+0x1c2/0x240 netif_receive_skb_list_internal+0x1e6/0x320 napi_complete_done+0x11d/0x190 mlx5e_napi_poll+0x163/0x6b0 [mlx5_core] __napi_poll+0x3c/0x1b0 net_rx_action+0x27c/0x300 __do_softirq+0x114/0x2d2 irq_exit_rcu+0xb4/0xe0 common_interrupt+0xba/0xe0 </IRQ> <TASK> The crash is caused by privately transferring waitqueue entries from smc socket->wq to clcsock->wq. The owners of these entries, such as epoll, have no idea that the entries have been transferred to a different socket wait queue and still use original waitqueue spinlock (smc socket->wq.wait.lock) to make the entries operation exclusive, but it doesn't work. The operations to the entries, such as removing from the waitqueue (now is clcsock->wq after fallback), may cause a crash when clcsock waitqueue is being iterated over at the moment. This patch tries to fix this by no longer transferring wait queue entries privately, but introducing own implementations of clcsock's callback functions in fallback situation. The callback functions will forward the wakeup to smc socket->wq if clcsock->wq is actually woken up and smc socket->wq has remaining entries. Fixes: 2153bd1e3d3d ("net/smc: Transfer remaining wait queue entries during fallback") Suggested-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-29ipv4: drop fragmentation code from ip_options_build()Jakub Kicinski
Since v2.5.44 and addition of ip_options_fragment() ip_options_build() does not render headers for fragments directly. @is_frag is always 0. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-28Merge tag 'for-net-next-2022-01-28' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Luiz Augusto von Dentz says: ==================== bluetooth-next pull request for net-next: - Add support for RTL8822C hci_ver 0x08 - Add support for RTL8852AE part 0bda:2852 - Fix WBS setting for Intel legacy ROM products - Enable SCO over I2S ib mt7921s - Increment management interface revision * tag 'for-net-next-2022-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (30 commits) Bluetooth: Increment management interface revision Bluetooth: hci_sync: Fix queuing commands when HCI_UNREGISTER is set Bluetooth: hci_h5: Add power reset via gpio in h5_btrtl_open Bluetooth: btrtl: Add support for RTL8822C hci_ver 0x08 Bluetooth: hci_event: Fix HCI_EV_VENDOR max_len Bluetooth: hci_core: Rate limit the logging of invalid SCO handle Bluetooth: hci_event: Ignore multiple conn complete events Bluetooth: msft: fix null pointer deref on msft_monitor_device_evt Bluetooth: btmtksdio: mask out interrupt status Bluetooth: btmtksdio: run sleep mode by default Bluetooth: btmtksdio: lower log level in btmtksdio_runtime_[resume|suspend]() Bluetooth: mt7921s: fix btmtksdio_[drv|fw]_pmctrl() Bluetooth: mt7921s: fix bus hang with wrong privilege Bluetooth: btmtksdio: refactor btmtksdio_runtime_[suspend|resume]() Bluetooth: mt7921s: fix firmware coredump retrieve Bluetooth: hci_serdev: call init_rwsem() before p->open() Bluetooth: Remove kernel-doc style comment block Bluetooth: btusb: Whitespace fixes for btusb_setup_csr() Bluetooth: btusb: Add one more Bluetooth part for the Realtek RTL8852AE Bluetooth: btintel: Fix WBS setting for Intel legacy ROM products ... ==================== Link: https://lore.kernel.org/r/20220128205915.3995760-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-28Merge tag 'fsnotify_for_v5.17-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify fixes from Jan Kara: "Fixes for userspace breakage caused by fsnotify changes ~3 years ago and one fanotify cleanup" * tag 'fsnotify_for_v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fsnotify: fix fsnotify hooks in pseudo filesystems fsnotify: invalidate dcache before IN_DELETE event fanotify: remove variable set but not used
2022-01-28Merge tag 'ieee802154-for-net-2022-01-28' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan Stefan Schmidt says: ==================== pull-request: ieee802154 for net 2022-01-28 An update from ieee802154 for your *net* tree. A bunch of fixes in drivers, all from Miquel Raynal. Clarifying the default channel in hwsim, leak fixes in at86rf230 and ca8210 as well as a symbol duration fix for mcr20a. Topping up the driver fixes with better error codes in nl802154 and a cleanup in MAINTAINERS for an orphaned driver. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-28ax25: add refcount in ax25_dev to avoid UAF bugsDuoming Zhou
If we dereference ax25_dev after we call kfree(ax25_dev) in ax25_dev_device_down(), it will lead to concurrency UAF bugs. There are eight syscall functions suffer from UAF bugs, include ax25_bind(), ax25_release(), ax25_connect(), ax25_ioctl(), ax25_getname(), ax25_sendmsg(), ax25_getsockopt() and ax25_info_show(). One of the concurrency UAF can be shown as below: (USE) | (FREE) | ax25_device_event | ax25_dev_device_down ax25_bind | ... ... | kfree(ax25_dev) ax25_fillin_cb() | ... ax25_fillin_cb_from_dev() | ... | The root cause of UAF bugs is that kfree(ax25_dev) in ax25_dev_device_down() is not protected by any locks. When ax25_dev, which there are still pointers point to, is released, the concurrency UAF bug will happen. This patch introduces refcount into ax25_dev in order to guarantee that there are no pointers point to it when ax25_dev is released. Signed-off-by: Duoming Zhou <duoming@zju.edu.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-28ax25: improve the incomplete fix to avoid UAF and NPD bugsDuoming Zhou
The previous commit 1ade48d0c27d ("ax25: NPD bug when detaching AX25 device") introduce lock_sock() into ax25_kill_by_device to prevent NPD bug. But the concurrency NPD or UAF bug will occur, when lock_sock() or release_sock() dereferences the ax25_cb->sock. The NULL pointer dereference bug can be shown as below: ax25_kill_by_device() | ax25_release() | ax25_destroy_socket() | ax25_cb_del() ... | ... | ax25->sk=NULL; lock_sock(s->sk); //(1) | s->ax25_dev = NULL; | ... release_sock(s->sk); //(2) | ... | The root cause is that the sock is set to null before dereference site (1) or (2). Therefore, this patch extracts the ax25_cb->sock in advance, and uses ax25_list_lock to protect it, which can synchronize with ax25_cb_del() and ensure the value of sock is not null before dereference sites. The concurrency UAF bug can be shown as below: ax25_kill_by_device() | ax25_release() | ax25_destroy_socket() ... | ... | sock_put(sk); //FREE lock_sock(s->sk); //(1) | s->ax25_dev = NULL; | ... release_sock(s->sk); //(2) | ... | The root cause is that the sock is released before dereference site (1) or (2). Therefore, this patch uses sock_hold() to increase the refcount of sock and uses ax25_list_lock to protect it, which can synchronize with ax25_cb_del() in ax25_destroy_socket() and ensure the sock wil not be released before dereference sites. Signed-off-by: Duoming Zhou <duoming@zju.edu.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-28SUNRPC: add netns refcount tracker to struct rpc_xprtEric Dumazet
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-28SUNRPC: add netns refcount tracker to struct gss_authEric Dumazet
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-28SUNRPC: add netns refcount tracker to struct svc_xprtEric Dumazet
struct svc_xprt holds a long lived reference to a netns, it is worth tracking it. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-28ethtool: add header/data split indicationJakub Kicinski
For applications running on a mix of platforms it's useful to have a clear indication whether host's NIC supports the geometry requirements of TCP zero-copy. TCP zero-copy Rx requires data to be neatly placed into memory pages. Most NICs can't do that. This patch is adding GET support only, since the NICs I work with either always have the feature enabled or enable it whenever MTU is set to jumbo. In other words I don't need SET. But adding set should be trivial. (The only note on SET is that we will likely want the setting to be "sticky" and use 0 / `unknown` to reset it back to driver default.) Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-28net: ieee802154: Use the IEEE802154_MAX_PAGE define when relevantMiquel Raynal
This define already exist but is hardcoded in nl-phy.c. Use the definition when relevant. While at it, also convert the type from uint32_t to u32. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/r/20220125122540.855604-3-miquel.raynal@bootlin.com Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2022-01-27ipv6: partially inline ipv6_fixup_optionsPavel Begunkov
Inline a part of ipv6_fixup_options() to avoid extra overhead on function call if opt is NULL. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27ipv6: optimise dst refcounting on cork initPavel Begunkov
udpv6_sendmsg() doesn't need dst after calling ip6_make_skb(), so instead of taking an additional reference inside ip6_setup_cork() and releasing the initial one afterwards, we can hand over a reference into ip6_make_skb() saving two atomics. The only other user of ip6_setup_cork() is ip6_append_data() and it requires an extra dst_hold(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27udp6: don't make extra copies of iflowPavel Begunkov
udpv6_sendmsg() first initialises an on-stack 88B struct flowi6 and then copies it into cork, which is expensive. Avoid the copy in corkless case by initialising on-stack cork->fl directly. The main part is a couple of lines under !corkreq check. The rest converts fl6 variable to be a pointer. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27udp6: pass flow in ip6_make_skb together with corkPavel Begunkov
Another preparation patch. inet_cork_full already contains a field for iflow, so we can avoid passing a separate struct iflow6 into __ip6_append_data() and ip6_make_skb(), and use the flow stored in inet_cork_full. Make sure callers set cork->fl, i.e. we init it in ip6_append_data() and before calling ip6_make_skb(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27ipv6: pass full cork into __ip6_append_data()Pavel Begunkov
Convert a struct inet_cork argument in __ip6_append_data() to struct inet_cork_full. As one struct contains another inet_cork is still can be accessed via ->base field. It's a preparation patch making further changes a bit cleaner. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27ipv6: don't zero inet_cork_full::fl after usePavel Begunkov
It doesn't appear there is any reason for ip6_cork_release() to zero cork->fl, it'll be fully filled on next initialisation. This 88 bytes memset accounts to 0.3-0.5% of total CPU cycles. It's also needed in following patches and allows to remove an extar flow copy in udp_v6_push_pending_frames(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27ipv6: clean up cork setup/releasePavel Begunkov
Clean up ip6_setup_cork() and ip6_cork_release() adding a local variable for v6_cork->opt. It's a preparation patch for further changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27ipv6: remove daddr temp buffer in __ip6_make_skbPavel Begunkov
ipv6_push_nfrag_opts() doesn't change passed daddr, and so __ip6_make_skb() doesn't actually need to keep an on-stack copy of fl6->daddr. Set initially final_dst to fl6->daddr, ipv6_push_nfrag_opts() will override it if needed, and get rid of extra copies. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27udp6: shuffle up->pending AF_INET bitsPavel Begunkov
Corked AF_INET for ipv6 socket doesn't appear to be the hottest case, so move it out of the common path under up->pending check to remove overhead. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27ipv6: optimise dst refcounting on skb initPavel Begunkov
__ip6_make_skb() gets a cork->dst ref, hands it over to skb and shortly after puts cork->dst. Save two atomics by stealing it without extra referencing, ip6_cork_release() handles NULL cork->dst. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfJakub Kicinski
Pablo Neira Ayuso says: ==================== Netfilter fixes for net 1) Remove leftovers from flowtable modules, from Geert Uytterhoeven. 2) Missing refcount increment of conntrack template in nft_ct, from Florian Westphal. 3) Reduce nft_zone selftest time, also from Florian. 4) Add selftest to cover stateless NAT on fragments, from Florian Westphal. 5) Do not set net_device when for reject packets from the bridge path, from Phil Sutter. 6) Cancel register tracking info on nft_byteorder operations. 7) Extend nft_concat_range selftest to cover set reload with no elements, from Florian Westphal. 8) Remove useless update of pointer in chain blob builder, reported by kbuild test robot. * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf: netfilter: nf_tables: remove assignment with no effect in chain blob builder selftests: nft_concat_range: add test for reload with no element add/del netfilter: nft_byteorder: track register operations netfilter: nft_reject_bridge: Fix for missing reply from prerouting selftests: netfilter: check stateless nat udp checksum fixup selftests: netfilter: reduce zone stress test running time netfilter: nft_ct: fix use after free when attaching zone template netfilter: Remove flowtable relics ==================== Link: https://lore.kernel.org/r/20220127235235.656931-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27Bluetooth: Increment management interface revisionMarcel Holtmann
Increment the mgmt revision due to recent changes. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2022-01-27bpf: reject program if a __user tagged memory accessed in kernel wayYonghong Song
BPF verifier supports direct memory access for BPF_PROG_TYPE_TRACING type of bpf programs, e.g., a->b. If "a" is a pointer pointing to kernel memory, bpf verifier will allow user to write code in C like a->b and the verifier will translate it to a kernel load properly. If "a" is a pointer to user memory, it is expected that bpf developer should be bpf_probe_read_user() helper to get the value a->b. Without utilizing BTF __user tagging information, current verifier will assume that a->b is a kernel memory access and this may generate incorrect result. Now BTF contains __user information, it can check whether the pointer points to a user memory or not. If it is, the verifier can reject the program and force users to use bpf_probe_read_user() helper explicitly. In the future, we can easily extend btf_add_space for other address space tagging, for example, rcu/percpu etc. Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220127154606.654961-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-27Merge tag 'net-5.17-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter and can. Current release - new code bugs: - tcp: add a missing sk_defer_free_flush() in tcp_splice_read() - tcp: add a stub for sk_defer_free_flush(), fix CONFIG_INET=n - nf_tables: set last expression in register tracking area - nft_connlimit: fix memleak if nf_ct_netns_get() fails - mptcp: fix removing ids bitmap setting - bonding: use rcu_dereference_rtnl when getting active slave - fix three cases of sleep in atomic context in drivers: lan966x, gve - handful of build fixes for esoteric drivers after netdev->dev_addr was made const Previous releases - regressions: - revert "ipv6: Honor all IPv6 PIO Valid Lifetime values", it broke Linux compatibility with USGv6 tests - procfs: show net device bound packet types - ipv4: fix ip option filtering for locally generated fragments - phy: broadcom: hook up soft_reset for BCM54616S Previous releases - always broken: - ipv4: raw: lock the socket in raw_bind() - ipv4: decrease the use of shared IPID generator to decrease the chance of attackers guessing the values - procfs: fix cross-netns information leakage in /proc/net/ptype - ethtool: fix link extended state for big endian - bridge: vlan: fix single net device option dumping - ping: fix the sk_bound_dev_if match in ping_lookup" * tag 'net-5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (86 commits) net: bridge: vlan: fix memory leak in __allowed_ingress net: socket: rename SKB_DROP_REASON_SOCKET_FILTER ipv4: remove sparse error in ip_neigh_gw4() ipv4: avoid using shared IP generator for connected sockets ipv4: tcp: send zero IPID in SYNACK messages ipv4: raw: lock the socket in raw_bind() MAINTAINERS: add missing IPv4/IPv6 header paths MAINTAINERS: add more files to eth PHY net: stmmac: dwmac-sun8i: use return val of readl_poll_timeout() net: bridge: vlan: fix single net device option dumping net: stmmac: skip only stmmac_ptp_register when resume from suspend net: stmmac: configure PTP clock source prior to PTP initialization Revert "ipv6: Honor all IPv6 PIO Valid Lifetime values" connector/cn_proc: Use task_is_in_init_pid_ns() pid: Introduce helper task_is_in_init_pid_ns() gve: Fix GFP flags when allocing pages net: lan966x: Fix sleep in atomic context when updating MAC table net: lan966x: Fix sleep in atomic context when injecting frames ethernet: seeq/ether3: don't write directly to netdev->dev_addr ethernet: 8390/etherh: don't write directly to netdev->dev_addr ...
2022-01-27net: bridge: vlan: fix memory leak in __allowed_ingressTim Yi
When using per-vlan state, if vlan snooping and stats are disabled, untagged or priority-tagged ingress frame will go to check pvid state. If the port state is forwarding and the pvid state is not learning/forwarding, untagged or priority-tagged frame will be dropped but skb memory is not freed. Should free skb when __allowed_ingress returns false. Fixes: a580c76d534c ("net: bridge: vlan: add per-vlan state") Signed-off-by: Tim Yi <tim.yi@pica8.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://lore.kernel.org/r/20220127074953.12632-1-tim.yi@pica8.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27netfilter: nf_tables: remove assignment with no effect in chain blob builderPablo Neira Ayuso
cppcheck possible warnings: >> net/netfilter/nf_tables_api.c:2014:2: warning: Assignment of function parameter has no effect outside the function. Did you forget dereferencing it? [uselessAssignmentPtrArg] ptr += offsetof(struct nft_rule_dp, data); ^ Reported-by: kernel test robot <yujie.liu@intel.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-27net: socket: rename SKB_DROP_REASON_SOCKET_FILTERMenglong Dong
Rename SKB_DROP_REASON_SOCKET_FILTER, which is used as the reason of skb drop out of socket filter before it's part of a released kernel. It will be used for more protocols than just TCP in future series. Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/all/20220127091308.91401-2-imagedong@tencent.com/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27Bluetooth: hci_sync: Fix queuing commands when HCI_UNREGISTER is setLuiz Augusto von Dentz
hci_cmd_sync_queue shall return an error if HCI_UNREGISTER flag has been set as that means hci_unregister_dev has been called so it will likely cause a uaf after the timeout as the hdev will be freed. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2022-01-27ipv4: tcp: send zero IPID in SYNACK messagesEric Dumazet
In commit 431280eebed9 ("ipv4: tcp: send zero IPID for RST and ACK sent in SYN-RECV and TIME-WAIT state") we took care of some ctl packets sent by TCP. It turns out we need to use a similar strategy for SYNACK packets. By default, they carry IP_DF and IPID==0, but there are ways to ask them to use the hashed IP ident generator and thus be used to build off-path attacks. (Ref: Off-Path TCP Exploits of the Mixed IPID Assignment) One of this way is to force (before listener is started) echo 1 >/proc/sys/net/ipv4/ip_no_pmtu_disc Another way is using forged ICMP ICMP_FRAG_NEEDED with a very small MTU (like 68) to force a false return from ip_dont_fragment() In this patch, ip_build_and_send_pkt() uses the following heuristics. 1) Most SYNACK packets are smaller than IPV4_MIN_MTU and therefore can use IP_DF regardless of the listener or route pmtu setting. 2) In case the SYNACK packet is bigger than IPV4_MIN_MTU, we use prandom_u32() generator instead of the IPv4 hashed ident one. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Ray Che <xijiache@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Cc: Geoff Alexander <alexandg@cs.unm.edu> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Jakub Kicinski <kuba@kernel.org>