summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2022-06-17bpf: Fix bpf_skc_lookup comment wrt. return typeTobias Klauser
The function no longer returns 'unsigned long' as of commit edbf8c01de5a ("bpf: add skc_lookup_tcp helper"). Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220617152121.29617-1-tklauser@distanz.ch
2022-06-17tipc: cleanup unused functionHoang Le
tipc_dest_list_len() is not being called anywhere. Clean it up. Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-17ipv4: ping: fix bind address validity checkRiccardo Paolo Bestetti
Commit 8ff978b8b222 ("ipv4/raw: support binding to nonlocal addresses") introduced a helper function to fold duplicated validity checks of bind addresses into inet_addr_valid_or_nonlocal(). However, this caused an unintended regression in ping_check_bind_addr(), which previously would reject binding to multicast and broadcast addresses, but now these are both incorrectly allowed as reported in [1]. This patch restores the original check. A simple reordering is done to improve readability and make it evident that multicast and broadcast addresses should not be allowed. Also, add an early exit for INADDR_ANY which replaces lost behavior added by commit 0ce779a9f501 ("net: Avoid unnecessary inet_addr_type() call when addr is INADDR_ANY"). Furthermore, this patch introduces regression selftests to catch these specific cases. [1] https://lore.kernel.org/netdev/CANP3RGdkAcDyAZoT1h8Gtuu0saq+eOrrTiWbxnOs+5zn+cpyKg@mail.gmail.com/ Fixes: 8ff978b8b222 ("ipv4/raw: support binding to nonlocal addresses") Cc: Miaohe Lin <linmiaohe@huawei.com> Reported-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Carlos Llamas <cmllamas@google.com> Signed-off-by: Riccardo Paolo Bestetti <pbl@bestov.io> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-17tipc: fix use-after-free Read in tipc_named_reinitHoang Le
syzbot found the following issue on: ================================================================== BUG: KASAN: use-after-free in tipc_named_reinit+0x94f/0x9b0 net/tipc/name_distr.c:413 Read of size 8 at addr ffff88805299a000 by task kworker/1:9/23764 CPU: 1 PID: 23764 Comm: kworker/1:9 Not tainted 5.18.0-rc4-syzkaller-00878-g17d49e6e8012 #0 Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: events tipc_net_finalize_work Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_address_description.constprop.0.cold+0xeb/0x495 mm/kasan/report.c:313 print_report mm/kasan/report.c:429 [inline] kasan_report.cold+0xf4/0x1c6 mm/kasan/report.c:491 tipc_named_reinit+0x94f/0x9b0 net/tipc/name_distr.c:413 tipc_net_finalize+0x234/0x3d0 net/tipc/net.c:138 process_one_work+0x996/0x1610 kernel/workqueue.c:2289 worker_thread+0x665/0x1080 kernel/workqueue.c:2436 kthread+0x2e9/0x3a0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:298 </TASK> [...] ================================================================== In the commit d966ddcc3821 ("tipc: fix a deadlock when flushing scheduled work"), the cancel_work_sync() function just to make sure ONLY the work tipc_net_finalize_work() is executing/pending on any CPU completed before tipc namespace is destroyed through tipc_exit_net(). But this function is not guaranteed the work is the last queued. So, the destroyed instance may be accessed in the work which will try to enqueue later. In order to completely fix, we re-order the calling of cancel_work_sync() to make sure the work tipc_net_finalize_work() was last queued and it must be completed by calling cancel_work_sync(). Reported-by: syzbot+47af19f3307fc9c5c82e@syzkaller.appspotmail.com Fixes: d966ddcc3821 ("tipc: fix a deadlock when flushing scheduled work") Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-17net: fix data-race in dev_isalive()Eric Dumazet
dev_isalive() is called under RTNL or dev_base_lock protection. This means that changes to dev->reg_state should be done with both locks held. syzbot reported: BUG: KCSAN: data-race in register_netdevice / type_show write to 0xffff888144ecf518 of 1 bytes by task 20886 on cpu 0: register_netdevice+0xb9f/0xdf0 net/core/dev.c:10050 lapbeth_new_device drivers/net/wan/lapbether.c:414 [inline] lapbeth_device_event+0x4a0/0x6c0 drivers/net/wan/lapbether.c:456 notifier_call_chain kernel/notifier.c:87 [inline] raw_notifier_call_chain+0x53/0xb0 kernel/notifier.c:455 __dev_notify_flags+0x1d6/0x3a0 dev_change_flags+0xa2/0xc0 net/core/dev.c:8607 do_setlink+0x778/0x2230 net/core/rtnetlink.c:2780 __rtnl_newlink net/core/rtnetlink.c:3546 [inline] rtnl_newlink+0x114c/0x16a0 net/core/rtnetlink.c:3593 rtnetlink_rcv_msg+0x811/0x8c0 net/core/rtnetlink.c:6089 netlink_rcv_skb+0x13e/0x240 net/netlink/af_netlink.c:2501 rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:6107 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] netlink_unicast+0x58a/0x660 net/netlink/af_netlink.c:1345 netlink_sendmsg+0x661/0x750 net/netlink/af_netlink.c:1921 sock_sendmsg_nosec net/socket.c:714 [inline] sock_sendmsg net/socket.c:734 [inline] __sys_sendto+0x21e/0x2c0 net/socket.c:2119 __do_sys_sendto net/socket.c:2131 [inline] __se_sys_sendto net/socket.c:2127 [inline] __x64_sys_sendto+0x74/0x90 net/socket.c:2127 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 read to 0xffff888144ecf518 of 1 bytes by task 20423 on cpu 1: dev_isalive net/core/net-sysfs.c:38 [inline] netdev_show net/core/net-sysfs.c:50 [inline] type_show+0x24/0x90 net/core/net-sysfs.c:112 dev_attr_show+0x35/0x90 drivers/base/core.c:2095 sysfs_kf_seq_show+0x175/0x240 fs/sysfs/file.c:59 kernfs_seq_show+0x75/0x80 fs/kernfs/file.c:162 seq_read_iter+0x2c3/0x8e0 fs/seq_file.c:230 kernfs_fop_read_iter+0xd1/0x2f0 fs/kernfs/file.c:235 call_read_iter include/linux/fs.h:2052 [inline] new_sync_read fs/read_write.c:401 [inline] vfs_read+0x5a5/0x6a0 fs/read_write.c:482 ksys_read+0xe8/0x1a0 fs/read_write.c:620 __do_sys_read fs/read_write.c:630 [inline] __se_sys_read fs/read_write.c:628 [inline] __x64_sys_read+0x3e/0x50 fs/read_write.c:628 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 value changed: 0x00 -> 0x01 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 20423 Comm: udevd Tainted: G W 5.19.0-rc2-syzkaller-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-17tcp: fix build...David S. Miller
Remove accidental dup of tcp_wmem_schedule. Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-17tcp: fix possible freeze in tx path under memory pressureEric Dumazet
Blamed commit only dealt with applications issuing small writes. Issue here is that we allow to force memory schedule for the sk_buff allocation, but we have no guarantee that sendmsg() is able to copy some payload in it. In this patch, I make sure the socket can use up to tcp_wmem[0] bytes. For example, if we consider tcp_wmem[0] = 4096 (default on x86), and initial skb->truesize being 1280, tcp_sendmsg() is able to copy up to 2816 bytes under memory pressure. Before this patch a sendmsg() sending more than 2816 bytes would either block forever (if persistent memory pressure), or return -EAGAIN. For bigger MTU networks, it is advised to increase tcp_wmem[0] to avoid sending too small packets. v2: deal with zero copy paths. Fixes: 8e4d980ac215 ("tcp: fix behavior for epoll edge trigger") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Wei Wang <weiwan@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-17tcp: fix possible freeze in tx path under memory pressureEric Dumazet
Blamed commit only dealt with applications issuing small writes. Issue here is that we allow to force memory schedule for the sk_buff allocation, but we have no guarantee that sendmsg() is able to copy some payload in it. In this patch, I make sure the socket can use up to tcp_wmem[0] bytes. For example, if we consider tcp_wmem[0] = 4096 (default on x86), and initial skb->truesize being 1280, tcp_sendmsg() is able to copy up to 2816 bytes under memory pressure. Before this patch a sendmsg() sending more than 2816 bytes would either block forever (if persistent memory pressure), or return -EAGAIN. For bigger MTU networks, it is advised to increase tcp_wmem[0] to avoid sending too small packets. v2: deal with zero copy paths. Fixes: 8e4d980ac215 ("tcp: fix behavior for epoll edge trigger") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Wei Wang <weiwan@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-17tcp: fix over estimation in sk_forced_mem_schedule()Eric Dumazet
sk_forced_mem_schedule() has a bug similar to ones fixed in commit 7c80b038d23e ("net: fix sk_wmem_schedule() and sk_rmem_schedule() errors") While this bug has little chance to trigger in old kernels, we need to fix it before the following patch. Fixes: d83769a580f1 ("tcp: fix possible deadlock in tcp_send_fin()") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-16bpf: Allow the new syncookie helpers to work with SKBsMaxim Mikityanskiy
This commit allows the new BPF helpers to work in SKB context (in TC BPF programs): bpf_tcp_raw_{gen,check}_syncookie_ipv{4,6}. Using these helpers in TC BPF programs is not recommended, because it's unlikely that the BPF program will provide any substantional speedup compared to regular SYN cookies or synproxy, after the SKB is already created. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20220615134847.3753567-6-maximmi@nvidia.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-16bpf: Add helpers to issue and check SYN cookies in XDPMaxim Mikityanskiy
The new helpers bpf_tcp_raw_{gen,check}_syncookie_ipv{4,6} allow an XDP program to generate SYN cookies in response to TCP SYN packets and to check those cookies upon receiving the first ACK packet (the final packet of the TCP handshake). Unlike bpf_tcp_{gen,check}_syncookie these new helpers don't need a listening socket on the local machine, which allows to use them together with synproxy to accelerate SYN cookie generation. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20220615134847.3753567-4-maximmi@nvidia.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-16net: set proper memcg for net_init hooks allocationsVasily Averin
__register_pernet_operations() executes init hook of registered pernet_operation structure in all existing net namespaces. Typically, these hooks are called by a process associated with the specified net namespace, and all __GFP_ACCOUNT marked allocation are accounted for corresponding container/memcg. However __register_pernet_operations() calls the hooks in the same context, and as a result all marked allocations are accounted to one memcg for all processed net namespaces. This patch adjusts active memcg for each net namespace and helps to account memory allocated inside ops_init() into the proper memcg. Link: https://lkml.kernel.org/r/f9394752-e272-9bf9-645f-a18c56d1c4ec@openvz.org Signed-off-by: Vasily Averin <vvs@openvz.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Florian Westphal <fw@strlen.de> Cc: David S. Miller <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Linux Kernel Functional Testing <lkft@linaro.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naresh Kamboju <naresh.kamboju@linaro.org> Cc: Qian Cai <quic_qiancai@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-16Merge tag 'net-5.19-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Mostly driver fixes. Current release - regressions: - Revert "net: Add a second bind table hashed by port and address", needs more work - amd-xgbe: use platform_irq_count(), static setup of IRQ resources had been removed from DT core - dts: at91: ksz9477_evb: add phy-mode to fix port/phy validation Current release - new code bugs: - hns3: modify the ring param print info Previous releases - always broken: - axienet: make the 64b addressable DMA depends on 64b architectures - iavf: fix issue with MAC address of VF shown as zero - ice: fix PTP TX timestamp offset calculation - usb: ax88179_178a needs FLAG_SEND_ZLP Misc: - document some net.sctp.* sysctls" * tag 'net-5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (31 commits) net: axienet: add missing error return code in axienet_probe() Revert "net: Add a second bind table hashed by port and address" net: ax25: Fix deadlock caused by skb_recv_datagram in ax25_recvmsg net: usb: ax88179_178a needs FLAG_SEND_ZLP MAINTAINERS: add include/dt-bindings/net to NETWORKING DRIVERS ARM: dts: at91: ksz9477_evb: fix port/phy validation net: bgmac: Fix an erroneous kfree() in bgmac_remove() ice: Fix memory corruption in VF driver ice: Fix queue config fail handling ice: Sync VLAN filtering features for DVM ice: Fix PTP TX timestamp offset calculation mlxsw: spectrum_cnt: Reorder counter pools docs: networking: phy: Fix a typo amd-xgbe: Use platform_irq_count() octeontx2-vf: Add support for adaptive interrupt coalescing xilinx: Fix build on x86. net: axienet: Use iowrite64 to write all 64b descriptor pointers net: axienet: make the 64b addresable DMA depends on 64b archectures net: hns3: fix tm port shapping of fibre port is incorrect after driver initialization net: hns3: fix PF rss size initialization bug ...
2022-06-16Revert "net: Add a second bind table hashed by port and address"Joanne Koong
This reverts: commit d5a42de8bdbe ("net: Add a second bind table hashed by port and address") commit 538aaf9b2383 ("selftests: Add test for timing a bind request to a port with a populated bhash entry") Link: https://lore.kernel.org/netdev/20220520001834.2247810-1-kuba@kernel.org/ There are a few things that need to be fixed here: * Updating bhash2 in cases where the socket's rcv saddr changes * Adding bhash2 hashbucket locks Links to syzbot reports: https://lore.kernel.org/netdev/00000000000022208805e0df247a@google.com/ https://lore.kernel.org/netdev/0000000000003f33bc05dfaf44fe@google.com/ Fixes: d5a42de8bdbe ("net: Add a second bind table hashed by port and address") Reported-by: syzbot+015d756bbd1f8b5c8f09@syzkaller.appspotmail.com Reported-by: syzbot+98fd2d1422063b0f8c44@syzkaller.appspotmail.com Reported-by: syzbot+0a847a982613c6438fba@syzkaller.appspotmail.com Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://lore.kernel.org/r/20220615193213.2419568-1-joannelkoong@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-15bpf: Fix request_sock leak in sk lookup helpersJon Maxwell
A customer reported a request_socket leak in a Calico cloud environment. We found that a BPF program was doing a socket lookup with takes a refcnt on the socket and that it was finding the request_socket but returning the parent LISTEN socket via sk_to_full_sk() without decrementing the child request socket 1st, resulting in request_sock slab object leak. This patch retains the existing behaviour of returning full socks to the caller but it also decrements the child request_socket if one is present before doing so to prevent the leak. Thanks to Curtis Taylor for all the help in diagnosing and testing this. And thanks to Antoine Tenart for the reproducer and patch input. v2 of this patch contains, refactor as per Daniel Borkmann's suggestions to validate RCU flags on the listen socket so that it balances with bpf_sk_release() and update comments as per Martin KaFai Lau's suggestion. One small change to Daniels suggestion, put "sk = sk2" under "if (sk2 != sk)" to avoid an extra instruction. Fixes: f7355a6c0497 ("bpf: Check sk_fullsock() before returning from bpf_sk_lookup()") Fixes: edbf8c01de5a ("bpf: add skc_lookup_tcp helper") Co-developed-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Curtis Taylor <cutaylor-pub@yahoo.com> Cc: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/56d6f898-bde0-bb25-3427-12a330b29fb8@iogearbox.net Link: https://lore.kernel.org/bpf/20220615011540.813025-1-jmaxwell37@gmail.com
2022-06-15net: ax25: Fix deadlock caused by skb_recv_datagram in ax25_recvmsgDuoming Zhou
The skb_recv_datagram() in ax25_recvmsg() will hold lock_sock and block until it receives a packet from the remote. If the client doesn`t connect to server and calls read() directly, it will not receive any packets forever. As a result, the deadlock will happen. The fail log caused by deadlock is shown below: [ 369.606973] INFO: task ax25_deadlock:157 blocked for more than 245 seconds. [ 369.608919] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 369.613058] Call Trace: [ 369.613315] <TASK> [ 369.614072] __schedule+0x2f9/0xb20 [ 369.615029] schedule+0x49/0xb0 [ 369.615734] __lock_sock+0x92/0x100 [ 369.616763] ? destroy_sched_domains_rcu+0x20/0x20 [ 369.617941] lock_sock_nested+0x6e/0x70 [ 369.618809] ax25_bind+0xaa/0x210 [ 369.619736] __sys_bind+0xca/0xf0 [ 369.620039] ? do_futex+0xae/0x1b0 [ 369.620387] ? __x64_sys_futex+0x7c/0x1c0 [ 369.620601] ? fpregs_assert_state_consistent+0x19/0x40 [ 369.620613] __x64_sys_bind+0x11/0x20 [ 369.621791] do_syscall_64+0x3b/0x90 [ 369.622423] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 369.623319] RIP: 0033:0x7f43c8aa8af7 [ 369.624301] RSP: 002b:00007f43c8197ef8 EFLAGS: 00000246 ORIG_RAX: 0000000000000031 [ 369.625756] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f43c8aa8af7 [ 369.626724] RDX: 0000000000000010 RSI: 000055768e2021d0 RDI: 0000000000000005 [ 369.628569] RBP: 00007f43c8197f00 R08: 0000000000000011 R09: 00007f43c8198700 [ 369.630208] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff845e6afe [ 369.632240] R13: 00007fff845e6aff R14: 00007f43c8197fc0 R15: 00007f43c8198700 This patch replaces skb_recv_datagram() with an open-coded variant of it releasing the socket lock before the __skb_wait_for_more_packets() call and re-acquiring it after such call in order that other functions that need socket lock could be executed. what's more, the socket lock will be released only when recvmsg() will block and that should produce nicer overall behavior. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Suggested-by: Thomas Osterried <thomas@osterried.de> Signed-off-by: Duoming Zhou <duoming@zju.edu.cn> Reported-by: Thomas Habets <thomas@@habets.se> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-15net: don't check skb_count twiceSieng Piaw Liew
NAPI cache skb_count is being checked twice without condition. Change to checking the second time only if the first check is run. Signed-off-by: Sieng Piaw Liew <liew.s.piaw@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-15net: bridge: allow add/remove permanent mdb entries on disabled portsCasper Andersson
Adding mdb entries on disabled ports allows you to do setup before accepting any traffic, avoiding any time where the port is not in the multicast group. Signed-off-by: Casper Andersson <casper.casan@gmail.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-14xsk: Fix generic transmit when completion queue reservation failsCiara Loftus
Two points of potential failure in the generic transmit function are: 1. completion queue (cq) reservation failure. 2. skb allocation failure Originally the cq reservation was performed first, followed by the skb allocation. Commit 675716400da6 ("xdp: fix possible cq entry leak") reversed the order because at the time there was no mechanism available to undo the cq reservation which could have led to possible cq entry leaks in the event of skb allocation failure. However if the skb allocation is performed first and the cq reservation then fails, the xsk skb destructor is called which blindly adds the skb address to the already full cq leading to undefined behavior. This commit restores the original order (cq reservation followed by skb allocation) and uses the xskq_prod_cancel helper to undo the cq reserve in event of skb allocation failure. Fixes: 675716400da6 ("xdp: fix possible cq entry leak") Signed-off-by: Ciara Loftus <ciara.loftus@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220614070746.8871-1-ciara.loftus@intel.com
2022-06-13ethtool: Fix and simplify ethtool_convert_link_mode_to_legacy_u32()Marco Bonelli
Fix the implementation of ethtool_convert_link_mode_to_legacy_u32(), which is supposed to return false if src has bits higher than 31 set. The current implementation uses the complement of bitmap_fill(ext, 32) to test high bits of src, which is wrong as bitmap_fill() fills _with long granularity_, and sizeof(long) can be > 4. No users of this function currently check the return value, so the bug was dormant. Also remove the check for __ETHTOOL_LINK_MODE_MASK_NBITS > 32, as the enum ethtool_link_mode_bit_indices contains far beyond 32 values. Using find_next_bit() to test the src bitmask works regardless of this anyway. Signed-off-by: Marco Bonelli <marco@mebeim.net> Link: https://lore.kernel.org/r/20220609134900.11201-1-marco@mebeim.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-13net: make __sys_accept4_file() staticYajun Deng
__sys_accept4_file() isn't used outside of the file, make it static. As the same time, move file_flags and nofile parameters into __sys_accept4_file(). Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-13tcp: sk_forced_mem_schedule() optimizationEric Dumazet
sk_memory_allocated_add() has three callers, and returns to them @memory_allocated. sk_forced_mem_schedule() is one of them, and ignores the returned value. Change sk_memory_allocated_add() to return void. Change sock_reserve_memory() and __sk_mem_raise_allocated() to call sk_memory_allocated(). This removes one cache line miss [1] for RPC workloads, as first skbs in TCP write queue and receive queue go through sk_forced_mem_schedule(). [1] Cache line holding tcp_memory_allocated. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-11net: Kconfig: move the CAN device menu to the "Device Drivers" sectionVincent Mailhol
The devices are meant to be under the "Device Drivers" category of the menuconfig. The CAN subsystem is currently one of the rare exception with all of its devices under the "Networking support" category. The CAN_DEV menuentry gets moved to fix this discrepancy. The CAN menu is added just before MCTP in order to be consistent with the current order under the "Networking support" menu. A dependency on CAN is added to CAN_DEV so that the CAN devices only show up if the CAN subsystem is enabled. Link: https://lore.kernel.org/all/20220610143009.323579-6-mailhol.vincent@wanadoo.fr Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Acked-by: Max Staudt <max@enpas.org> Tested-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-06-10Merge tag 'nfsd-5.19-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: "Notable changes: - There is now a backup maintainer for NFSD Notable fixes: - Prevent array overruns in svc_rdma_build_writes() - Prevent buffer overruns when encoding NFSv3 READDIR results - Fix a potential UAF in nfsd_file_put()" * tag 'nfsd-5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: SUNRPC: Remove pointer type casts from xdr_get_next_encode_buffer() SUNRPC: Clean up xdr_get_next_encode_buffer() SUNRPC: Clean up xdr_commit_encode() SUNRPC: Optimize xdr_reserve_space() SUNRPC: Fix the calculation of xdr->end in xdr_get_next_encode_buffer() SUNRPC: Trap RDMA segment overflows NFSD: Fix potential use-after-free in nfsd_file_put() MAINTAINERS: reciprocal co-maintainership for file locking and nfsd
2022-06-10net: unexport __sk_mem_{raise|reduce}_allocatedEric Dumazet
These two helpers are only used from core networking. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-10net: keep sk->sk_forward_alloc as small as possibleEric Dumazet
Currently, tcp_memory_allocated can hit tcp_mem[] limits quite fast. Each TCP socket can forward allocate up to 2 MB of memory, even after flow became less active. 10,000 sockets can have reserved 20 GB of memory, and we have no shrinker in place to reclaim that. Instead of trying to reclaim the extra allocations in some places, just keep sk->sk_forward_alloc values as small as possible. This should not impact performance too much now we have per-cpu reserves: Changes to tcp_memory_allocated should not be too frequent. For sockets not using SO_RESERVE_MEM: - idle sockets (no packets in tx/rx queues) have zero forward alloc. - non idle sockets have a forward alloc smaller than one page. Note: - Removal of SK_RECLAIM_CHUNK and SK_RECLAIM_THRESHOLD is left to MPTCP maintainers as a follow up. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-10net: add per_cpu_fw_alloc field to struct protoEric Dumazet
Each protocol having a ->memory_allocated pointer gets a corresponding per-cpu reserve, that following patches will use. Instead of having reserved bytes per socket, we want to have per-cpu reserves. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-10net: remove SK_MEM_QUANTUM and SK_MEM_QUANTUM_SHIFTEric Dumazet
Due to memcg interface, SK_MEM_QUANTUM is effectively PAGE_SIZE. This might change in the future, but it seems better to avoid the confusion. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-10Merge tag 'wireless-next-2022-06-10' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Johannes Berg says: ==================== wireless-next patches for v5.20 Here's a first set of patches for v5.20. This is just a queue flush, before we get things back from net-next that are causing conflicts, and then can start merging a lot of MLO (multi-link operation, part of 802.11be) code. Lots of cleanups all over. The only notable change is perhaps wilc1000 being the first driver to disable WEP (while enabling WPA3). * tag 'wireless-next-2022-06-10' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (29 commits) wifi: mac80211_hwsim: Directly use ida_alloc()/free() wifi: mac80211: refactor some key code wifi: mac80211: remove cipher scheme support wifi: nl80211: fix typo in comment wifi: virt_wifi: fix typo in comment rtw89: add new state to CFO state machine for UL-OFDMA rtw89: 8852c: add trigger frame counter ieee80211: add trigger frame definition wifi: wfx: Remove redundant NULL check before release_firmware() call wifi: rtw89: support MULTI_BSSID and correct BSSID mask of H2C wifi: ray_cs: Drop useless status variable in parse_addr() wifi: ray_cs: Utilize strnlen() in parse_addr() wifi: rtw88: use %*ph to print small buffer wifi: wilc1000: add IGTK support wifi: wilc1000: add WPA3 SAE support wifi: wilc1000: remove WEP security support wifi: wilc1000: use correct sequence of RESET for chip Power-UP/Down wifi: rtlwifi: fix error codes in rtl_debugfs_set_write_h2c() wifi: rtw88: Fix Sparse warning for rtw8821c_hw_spec wifi: rtw88: Fix Sparse warning for rtw8723d_hw_spec ... ==================== Link: https://lore.kernel.org/r/20220610142838.330862-1-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-10wifi: mac80211: refactor some key codeJohannes Berg
There's some pretty close code here, with the exception of RCU dereference vs. protected dereference. Refactor this to just return a pointer and then do the deref in the caller later. Change-Id: Ide5315e2792da6ac66eaf852293306a3ac71ced9 Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-06-10net: ipconfig: Relax fw_devlink if we need to mount a network rootfsSaravana Kannan
If there are network devices that could probe without some of their suppliers probing and those network devices are needed to mount a network rootfs, then fw_devlink=on might break that usecase by blocking the network devices from probing by the time IP auto config starts. So, if no network devices are available when IP auto config is enabled and we have a network rootfs, make sure fw_devlink doesn't block the probing of any device that has a driver and then retry finding a network device. Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Saravana Kannan <saravanak@google.com> Link: https://lore.kernel.org/r/20220601070707.3946847-6-saravanak@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-10wifi: mac80211: remove cipher scheme supportJohannes Berg
The only driver using this was iwlwifi, where we just removed the support because it was never really used. Remove the code from mac80211 as well. Change-Id: I1667417a5932315ee9d81f5c233c56a354923f09 Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-06-10treewide: Replace GPLv2 boilerplate/reference with SPDX - gpl-2.0_149.RULEThomas Gleixner
Based on the normalized pattern: netapp provides this source code under the gpl v2 license the gpl v2 license is available at https://opensource org/licenses/gpl-license php this software is provided by the copyright holders and contributors as is and any express or implied warranties including but not limited to the implied warranties of merchantability and fitness for a particular purpose are disclaimed in no event shall the copyright owner or contributors be liable for any direct indirect incidental special exemplary or consequential damages (including but not limited to procurement of substitute goods or services loss of use data or profits or business interruption) however caused and on any theory of liability whether in contract strict liability or tort (including negligence or otherwise) arising in any way out of the use of this software even if advised of the possibility of such damage extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference. Reviewed-by: Allison Randal <allison@lohutok.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-10treewide: Replace GPLv2 boilerplate/reference with SPDX - gpl-2.0_30.RULE ↵Thomas Gleixner
(part 2) Based on the normalized pattern: this program is free software you can redistribute it and/or modify it under the terms of the gnu general public license as published by the free software foundation version 2 this program is distributed as is without any warranty of any kind whether express or implied without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference. Reviewed-by: Allison Randal <allison@lohutok.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-10xfrm: no need to set DST_NOPOLICY in IPv4Eyal Birger
This is a cleanup patch following commit e6175a2ed1f1 ("xfrm: fix "disable_policy" flag use when arriving from different devices") which made DST_NOPOLICY no longer be used for inbound policy checks. On outbound the flag was set, but never used. As such, avoid setting it altogether and remove the nopolicy argument from rt_dst_alloc(). Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-06-09Merge tag 'ieee802154-for-net-next-2022-06-09' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan-next Stefan Schmidt says: ==================== pull-request: ieee802154-next 2022-06-09 This is a separate pull request for 6lowpan changes. We agreed with the bluetooth maintainers to switch the trees these changing are going into from bluetooth to ieee802154. Jukka Rissanen stepped down as a co-maintainer of 6lowpan (Thanks for the work!). Alexander is staying as maintainer. Alexander reworked the nhc_id lookup in 6lowpan to be way simpler. Moved the data structure from rb to an array, which is all we need in this case. * tag 'ieee802154-for-net-next-2022-06-09' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan-next: MAINTAINERS: Remove Jukka Rissanen as 6lowpan maintainer net: 6lowpan: constify lowpan_nhc structures net: 6lowpan: use array for find nhc id net: 6lowpan: remove const from scalars ==================== Link: https://lore.kernel.org/r/20220609202956.1512156-1-stefan@datenfreihafen.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-09net: seg6: fix seg6_lookup_any_nexthop() to handle VRFs using flowi_l3mdevAndrea Mayer
Commit 40867d74c374 ("net: Add l3mdev index to flow struct and avoid oif reset for port devices") adds a new entry (flowi_l3mdev) in the common flow struct used for indicating the l3mdev index for later rule and table matching. The l3mdev_update_flow() has been adapted to properly set the flowi_l3mdev based on the flowi_oif/flowi_iif. In fact, when a valid flowi_iif is supplied to the l3mdev_update_flow(), this function can update the flowi_l3mdev entry only if it has not yet been set (i.e., the flowi_l3mdev entry is equal to 0). The SRv6 End.DT6 behavior in VRF mode leverages a VRF device in order to force the routing lookup into the associated routing table. This routing operation is performed by seg6_lookup_any_nextop() preparing a flowi6 data structure used by ip6_route_input_lookup() which, in turn, (indirectly) invokes l3mdev_update_flow(). However, seg6_lookup_any_nexthop() does not initialize the new flowi_l3mdev entry which is filled with random garbage data. This prevents l3mdev_update_flow() from properly updating the flowi_l3mdev with the VRF index, and thus SRv6 End.DT6 (VRF mode)/DT46 behaviors are broken. This patch correctly initializes the flowi6 instance allocated and used by seg6_lookup_any_nexhtop(). Specifically, the entire flowi6 instance is wiped out: in case new entries are added to flowi/flowi6 (as happened with the flowi_l3mdev entry), we should no longer have incorrectly initialized values. As a result of this operation, the value of flowi_l3mdev is also set to 0. The proposed fix can be tested easily. Starting from the commit referenced in the Fixes, selftests [1],[2] indicate that the SRv6 End.DT6 (VRF mode)/DT46 behaviors no longer work correctly. By applying this patch, those behaviors are back to work properly again. [1] - tools/testing/selftests/net/srv6_end_dt46_l3vpn_test.sh [2] - tools/testing/selftests/net/srv6_end_dt6_l3vpn_test.sh Fixes: 40867d74c374 ("net: Add l3mdev index to flow struct and avoid oif reset for port devices") Reported-by: Anton Makarov <am@3a-alliance.com> Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220608091917.20345-1-andrea.mayer@uniroma2.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-09net: add napi_get_frags_check() helperEric Dumazet
This is a follow up of commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for tiny skbs") When/if we increase MAX_SKB_FRAGS, we better make sure the old bug will not come back. Adding a check in napi_get_frags() would be costly, even if using DEBUG_NET_WARN_ON_ONCE(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-09net: add debug checks in napi_consume_skb and __napi_alloc_skb()Eric Dumazet
Commit 6454eca81eae ("net: Use lockdep_assert_in_softirq() in napi_consume_skb()") added a check in napi_consume_skb() which is a bit weak. napi_consume_skb() and __napi_alloc_skb() should only be used from BH context, not from hard irq or nmi context, otherwise we could have races. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-09net: use DEBUG_NET_WARN_ON_ONCE() in skb_release_head_state()Eric Dumazet
Remove this check from fast path unless CONFIG_DEBUG_NET=y Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-09af_unix: use DEBUG_NET_WARN_ON_ONCE()Eric Dumazet
Replace four WARN_ON() that have not triggered recently with DEBUG_NET_WARN_ON_ONCE(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-09net: use WARN_ON_ONCE() in sk_stream_kill_queues()Eric Dumazet
sk_stream_kill_queues() has three checks which have been useful to detect kernel bugs in the past. However they are potentially a problem because they could flood the syslog. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-09net: use WARN_ON_ONCE() in inet_sock_destruct()Eric Dumazet
inet_sock_destruct() has four warnings which have been useful to point to kernel bugs in the past. However they are potentially a problem because they could flood the syslog. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-09net: use DEBUG_NET_WARN_ON_ONCE() in dev_loopback_xmit()Eric Dumazet
One check in dev_loopback_xmit() has not caught issues in the past. Keep it for CONFIG_DEBUG_NET=y builds only. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-09net: use DEBUG_NET_WARN_ON_ONCE() in __release_sock()Eric Dumazet
Check against skb dst in socket backlog has never triggered in past years. Keep the check omly for CONFIG_DEBUG_NET=y builds. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-09drop_monitor: adopt u64_stats_tEric Dumazet
As explained in commit 316580b69d0a ("u64_stats: provide u64_stats_t type") we should use u64_stats_t and related accessors to avoid load/store tearing. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-09devlink: adopt u64_stats_tEric Dumazet
As explained in commit 316580b69d0a ("u64_stats: provide u64_stats_t type") we should use u64_stats_t and related accessors to avoid load/store tearing. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-09net: adopt u64_stats_t in struct pcpu_sw_netstatsEric Dumazet
As explained in commit 316580b69d0a ("u64_stats: provide u64_stats_t type") we should use u64_stats_t and related accessors to avoid load/store tearing. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>