summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2024-05-08net: bridge: fix corrupted ethernet header on multicast-to-unicastFelix Fietkau
The change from skb_copy to pskb_copy unfortunately changed the data copying to omit the ethernet header, since it was pulled before reaching this point. Fix this by calling __skb_push/pull around pskb_copy. Fixes: 59c878cbcdd8 ("net: bridge: fix multicast-to-unicast with fraglist GSO") Signed-off-by: Felix Fietkau <nbd@nbd.name> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-08net: dsa: add support switches global DSCP priority mappingOleksij Rempel
Some switches like Microchip KSZ variants do not support per port DSCP priority configuration. Instead there is a global DSCP mapping table. To handle it, we will accept set/del request to any of user ports to make global configuration and update dcb app entries for all other ports. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-08net: add IEEE 802.1q specific helpersOleksij Rempel
IEEE 802.1q specification provides recommendation and examples which can be used as good default values for different drivers. This patch implements mapping examples documented in IEEE 802.1Q-2022 in Annex I "I.3 Traffic type to traffic class mapping" and IETF DSCP naming and mapping DSCP to Traffic Type inspired by RFC8325. This helpers will be used in followup patches for dsa/microchip DCB implementation. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-08net: dsa: add support for DCB get/set apptrust configurationOleksij Rempel
Add DCB support to get/set trust configuration for different packet priority information sources. Some switch allow to chose different source of packet priority classification. For example on KSZ switches it is possible to configure VLAN PCP and/or DSCP sources. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-08xsk: use generic DMA sync shortcut instead of a custom oneAlexander Lobakin
XSk infra's been using its own DMA sync shortcut to try avoiding redundant function calls. Now that there is a generic one, remove the custom implementation and rely on the generic helpers. xsk_buff_dma_sync_for_cpu() doesn't need the second argument anymore, remove it. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-05-08page_pool: check for DMA sync shortcut earlierAlexander Lobakin
We can save a couple more function calls in the Page Pool code if we check for dma_need_sync() earlier, just when we test pp->p.dma_sync. Move both these checks into an inline wrapper and call the PP wrapper over the generic DMA sync function only when both are true. You can't cache the result of dma_need_sync() in &page_pool, as it may change anytime if an SWIOTLB buffer is allocated or mapped. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-05-07mptcp: only allow set existing scheduler for net.mptcp.schedulerGregory Detal
The current behavior is to accept any strings as inputs, this results in an inconsistent result where an unexisting scheduler can be set: # sysctl -w net.mptcp.scheduler=notdefault net.mptcp.scheduler = notdefault This patch changes this behavior by checking for existing scheduler before accepting the input. Fixes: e3b2870b6d22 ("mptcp: add a new sysctl scheduler") Cc: stable@vger.kernel.org Signed-off-by: Gregory Detal <gregory.detal@gmail.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://lore.kernel.org/r/20240506-upstream-net-20240506-mptcp-sched-exist-v1-1-2ed1529e521e@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-07nfc: nci: Fix kcov check in nci_rx_work()Tetsuo Handa
Commit 7e8cdc97148c ("nfc: Add KCOV annotations") added kcov_remote_start_common()/kcov_remote_stop() pair into nci_rx_work(), with an assumption that kcov_remote_stop() is called upon continue of the for loop. But commit d24b03535e5e ("nfc: nci: Fix uninit-value in nci_dev_up and nci_ntf_packet") forgot to call kcov_remote_stop() before break of the for loop. Reported-by: syzbot <syzbot+0438378d6f157baae1a2@syzkaller.appspotmail.com> Closes: https://syzkaller.appspot.com/bug?extid=0438378d6f157baae1a2 Fixes: d24b03535e5e ("nfc: nci: Fix uninit-value in nci_dev_up and nci_ntf_packet") Suggested-by: Andrey Konovalov <andreyknvl@gmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/6d10f829-5a0c-405a-b39a-d7266f3a1a0b@I-love.SAKURA.ne.jp Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-07mptcp: fix possible NULL dereferencesEric Dumazet
subflow_add_reset_reason(skb, ...) can fail. We can not assume mptcp_get_ext(skb) always return a non NULL pointer. syzbot reported: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN PTI KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f] CPU: 0 PID: 5098 Comm: syz-executor132 Not tainted 6.9.0-rc6-syzkaller-01478-gcdc74c9d06e7 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 RIP: 0010:subflow_v6_route_req+0x2c7/0x490 net/mptcp/subflow.c:388 Code: 8d 7b 07 48 89 f8 48 c1 e8 03 42 0f b6 04 20 84 c0 0f 85 c0 01 00 00 0f b6 43 07 48 8d 1c c3 48 83 c3 18 48 89 d8 48 c1 e8 03 <42> 0f b6 04 20 84 c0 0f 85 84 01 00 00 0f b6 5b 01 83 e3 0f 48 89 RSP: 0018:ffffc9000362eb68 EFLAGS: 00010206 RAX: 0000000000000003 RBX: 0000000000000018 RCX: ffff888022039e00 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88807d961140 R08: ffffffff8b6cb76b R09: 1ffff1100fb2c230 R10: dffffc0000000000 R11: ffffed100fb2c231 R12: dffffc0000000000 R13: ffff888022bfe273 R14: ffff88802cf9cc80 R15: ffff88802ad5a700 FS: 0000555587ad2380(0000) GS:ffff8880b9400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f420c3f9720 CR3: 0000000022bfc000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> tcp_conn_request+0xf07/0x32c0 net/ipv4/tcp_input.c:7180 tcp_rcv_state_process+0x183c/0x4500 net/ipv4/tcp_input.c:6663 tcp_v6_do_rcv+0x8b2/0x1310 net/ipv6/tcp_ipv6.c:1673 tcp_v6_rcv+0x22b4/0x30b0 net/ipv6/tcp_ipv6.c:1910 ip6_protocol_deliver_rcu+0xc76/0x1570 net/ipv6/ip6_input.c:438 ip6_input_finish+0x186/0x2d0 net/ipv6/ip6_input.c:483 NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314 NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314 __netif_receive_skb_one_core net/core/dev.c:5625 [inline] __netif_receive_skb+0x1ea/0x650 net/core/dev.c:5739 netif_receive_skb_internal net/core/dev.c:5825 [inline] netif_receive_skb+0x1e8/0x890 net/core/dev.c:5885 tun_rx_batched+0x1b7/0x8f0 drivers/net/tun.c:1549 tun_get_user+0x2f35/0x4560 drivers/net/tun.c:2002 tun_chr_write_iter+0x113/0x1f0 drivers/net/tun.c:2048 call_write_iter include/linux/fs.h:2110 [inline] new_sync_write fs/read_write.c:497 [inline] vfs_write+0xa84/0xcb0 fs/read_write.c:590 ksys_write+0x1a0/0x2c0 fs/read_write.c:643 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: 3e140491dd80 ("mptcp: support rstreason for passive reset") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://lore.kernel.org/r/20240506123032.3351895-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-07net: annotate writes on dev->mtu from ndo_change_mtu()Eric Dumazet
Simon reported that ndo_change_mtu() methods were never updated to use WRITE_ONCE(dev->mtu, new_mtu) as hinted in commit 501a90c94510 ("inet: protect against too small mtu values.") We read dev->mtu without holding RTNL in many places, with READ_ONCE() annotations. It is time to take care of ndo_change_mtu() methods to use corresponding WRITE_ONCE() Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Simon Horman <horms@kernel.org> Closes: https://lore.kernel.org/netdev/20240505144608.GB67882@kernel.org/ Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20240506102812.3025432-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-07net: dccp: Fix ccid2_rtt_estimator() kernel-docJeff Johnson
make C=1 reports: warning: Function parameter or struct member 'mrtt' not described in 'ccid2_rtt_estimator' So document the 'mrtt' parameter. Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240505-ccid2_rtt_estimator-kdoc-v1-1-09231fcb9145@quicinc.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-07page_pool: don't use driver-set flags field directlyAlexander Lobakin
page_pool::p is driver-defined params, copied directly from the structure passed to page_pool_create(). The structure isn't meant to be modified by the Page Pool core code and this even might look confusing[0][1]. In order to be able to alter some flags, let's define our own, internal fields the same way as the already existing one (::has_init_callback). They are defined as bits in the driver-set params, leave them so here as well, to not waste byte-per-bit or so. Almost 30 bits are still free for future extensions. We could've defined only new flags here or only the ones we may need to alter, but checking some flags in one place while others in another doesn't sound convenient or intuitive. ::flags passed by the driver can now go to the "slow" PP params. Suggested-by: Jakub Kicinski <kuba@kernel.org> Link[0]: https://lore.kernel.org/netdev/20230703133207.4f0c54ce@kernel.org Suggested-by: Alexander Duyck <alexanderduyck@fb.com> Link[1]: https://lore.kernel.org/netdev/CAKgT0UfZCGnWgOH96E4GV3ZP6LLbROHM7SHE8NKwq+exX+Gk_Q@mail.gmail.com Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-05-07page_pool: make sure frag API fields don't span between cachelinesAlexander Lobakin
After commit 5027ec19f104 ("net: page_pool: split the page_pool_params into fast and slow") that made &page_pool contain only "hot" params at the start, cacheline boundary chops frag API fields group in the middle again. To not bother with this each time fast params get expanded or shrunk, let's just align them to `4 * sizeof(long)`, the closest upper pow-2 to their actual size (2 longs + 1 int). This ensures 16-byte alignment for the 32-bit architectures and 32-byte alignment for the 64-bit ones, excluding unnecessary false-sharing. ::page_state_hold_cnt is used quite intensively on hotpath no matter if frag API is used, so move it to the newly created hole in the first cacheline. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-05-07rtnetlink: allow rtnl_fill_link_netnsid() to run under RCU protectionEric Dumazet
We want to be able to run rtnl_fill_ifinfo() under RCU protection instead of RTNL in the future. All rtnl_link_ops->get_link_net() methods already using dev_net() are ready. I added READ_ONCE() annotations on others. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07rtnetlink: do not depend on RTNL in rtnl_xdp_prog_skb()Eric Dumazet
dev->xdp_prog is protected by RCU, we can lift RTNL requirement from rtnl_xdp_prog_skb(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07rtnetlink: do not depend on RTNL in rtnl_fill_proto_down()Eric Dumazet
Change dev_change_proto_down() and dev_change_proto_down_reason() to write once on dev->proto_down and dev->proto_down_reason. Then rtnl_fill_proto_down() can use READ_ONCE() annotations and run locklessly. rtnl_proto_down_size() should assume worst case, because readng dev->proto_down_reason multiple times would be racy without RTNL in the future. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07rtnetlink: do not depend on RTNL for many attributesEric Dumazet
Following device fields can be read locklessly in rtnl_fill_ifinfo() : type, ifindex, operstate, link_mode, mtu, min_mtu, max_mtu, group, promiscuity, allmulti, num_tx_queues, gso_max_segs, gso_max_size, gro_max_size, gso_ipv4_max_size, gro_ipv4_max_size, tso_max_size, tso_max_segs, num_rx_queues. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07net: write once on dev->allmulti and dev->promiscuityEric Dumazet
In the following patch we want to read dev->allmulti and dev->promiscuity locklessly from rtnl_fill_ifinfo() In this patch I change __dev_set_promiscuity() and __dev_set_allmulti() to write these fields (and dev->flags) only if they succeed, with WRITE_ONCE() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07rtnetlink: do not depend on RTNL for IFLA_TXQLEN outputEric Dumazet
rtnl_fill_ifinfo() can read dev->tx_queue_len locklessly, granted we add corresponding READ_ONCE()/WRITE_ONCE() annotations. Add missing READ_ONCE(dev->tx_queue_len) in teql_enqueue() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07rtnetlink: do not depend on RTNL for IFLA_IFNAME outputEric Dumazet
We can use netdev_copy_name() to no longer rely on RTNL to fetch dev->name. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07rtnetlink: do not depend on RTNL for IFLA_QDISC outputEric Dumazet
dev->qdisc can be read using RCU protection. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06Merge tag 'ipsec-next-2024-05-03' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2024-05-03 1) Remove Obsolete UDP_ENCAP_ESPINUDP_NON_IKE Support. This was defined by an early version of an IETF draft that did not make it to a standard. 2) Introduce direction attribute for xfrm states. xfrm states have a direction, a stsate can be used either for input or output packet processing. Add a direction to xfrm states to make it clear for what a xfrm state is used. * tag 'ipsec-next-2024-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: xfrm: Restrict SA direction attribute to specific netlink message types xfrm: Add dir validation to "in" data path lookup xfrm: Add dir validation to "out" data path lookup xfrm: Add Direction to the SA in or out udpencap: Remove Obsolete UDP_ENCAP_ESPINUDP_NON_IKE Support ==================== Link: https://lore.kernel.org/r/20240503082732.2835810-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-06mptcp: fix typos in commentsShi-Sheng Yang
This patch fixes the spelling mistakes in comments. The changes were generated using codespell and reviewed manually. eariler -> earlier greceful -> graceful Signed-off-by: Shi-Sheng Yang <fourcolor4c@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://lore.kernel.org/r/20240502154740.249839-1-fourcolor4c@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-06phonet: fix rtm_phonet_notify() skb allocationEric Dumazet
fill_route() stores three components in the skb: - struct rtmsg - RTA_DST (u8) - RTA_OIF (u32) Therefore, rtm_phonet_notify() should use NLMSG_ALIGN(sizeof(struct rtmsg)) + nla_total_size(1) + nla_total_size(4) Fixes: f062f41d0657 ("Phonet: routing table Netlink interface") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Rémi Denis-Courmont <courmisch@gmail.com> Link: https://lore.kernel.org/r/20240502161700.1804476-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-06Merge wireless into wireless-nextJohannes Berg
Given how late we are in the cycle, merge the two fixes from wireless into wireless-next as they don't see that urgent. This way, the wireless tree won't need rebasing later. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-06netfilter: nft_set_pipapo: merge deactivate helper into callerFlorian Westphal
Its the only remaining call site so there is no need for this to be separated anymore. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-05-06netfilter: nft_set_pipapo: prepare walk function for on-demand cloneFlorian Westphal
The existing code uses iter->type to figure out what data is needed, the live copy (READ) or clone (UPDATE). Without pending updates, priv->clone and priv->match will point to different memory locations, but they have identical content. Future patch will make priv->clone == NULL if there are no pending changes, in this case we must copy the live data for the UPDATE case. Currently this would require GFP_ATOMIC allocation. Split the walk function in two parts: one that does the walk and one that decides which data is needed. In the UPDATE case, callers hold the transaction mutex so we do not need the rcu read lock. This allows to use GFP_KERNEL allocation while cloning. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-05-06netfilter: nft_set_pipapo: prepare destroy function for on-demand cloneFlorian Westphal
Once priv->clone can be NULL in case no insertions/removals occurred in the last transaction we need to drop set elements from priv->match if priv->clone is NULL. While at it, condense this function by reusing the pipapo_free_match helper instead of open-coded version. The rcu_barrier() is removed, its not needed: old call_rcu instances for pipapo_reclaim_match do not access struct nft_set. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-05-06netfilter: nft_set_pipapo: make pipapo_clone helper return NULLFlorian Westphal
Currently it returns an error pointer, but the only possible failure is ENOMEM. After a followup patch, we'd need to discard the errno code, i.e. x = pipapo_clone() if (IS_ERR(x)) return NULL or make more changes to fix up callers to expect IS_ERR() code from set->ops->deactivate(). So simplify this and make it return ptr-or-null. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-05-06netfilter: nft_set_pipapo: move prove_locking helper aroundFlorian Westphal
Preparation patch, the helper will soon get called from insert function too. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-05-06netfilter: conntrack: remove flowtable early-drop testFlorian Westphal
Not sure why this special case exists. Early drop logic (which kicks in when conntrack table is full) should be independent of flowtable offload and only consider assured bit (i.e., two-way traffic was seen). flowtable entries hold a reference to the conntrack entry (struct nf_conn) that has been offloaded. The conntrack use count is not decremented until after the entry is free'd. This change therefore will not result in exceeding the conntrack table limit. It does allow early-drop of tcp flows even when they've been offloaded, but only if they have been offloaded before syn-ack was received or after at least one peer has sent a fin. Currently 'fin' packet reception already stops offloading, so this should not impact offloading either. Cc: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-05-06netfilter: use NF_DROP instead of -NF_DROPJason Xing
At the beginning in 2009 one patch [1] introduced collecting drop counter in nf_conntrack_in() by returning -NF_DROP. Later, another patch [2] changed the return value of tcp_packet() which now is renamed to nf_conntrack_tcp_packet() from -NF_DROP to NF_DROP. As we can see, that -NF_DROP should be corrected. Similarly, there are other two points where the -NF_DROP is used. Well, as NF_DROP is equal to 0, inverting NF_DROP makes no sense as patch [2] said many years ago. [1] commit 7d1e04598e5e ("netfilter: nf_conntrack: account packets drop by tcp_packet()") [2] commit ec8d540969da ("netfilter: conntrack: fix dropping packet after l4proto->packet()") Signed-off-by: Jason Xing <kernelxing@tencent.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-05-06SUNRPC: Remove comment for sp_lockGuoqing Jiang
It is obsolete since sp_lock was discarded in commit 580a25756a9f ("SUNRPC: discard sp_lock"). Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06SUNRPC: add a new svc_find_listener helperJeff Layton
svc_find_listener will return the transport instance pointer for the endpoint accepting connections/peer traffic from the specified transport class and matching sockaddr. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06SUNRPC: introduce svc_xprt_create_from_sa utility routineLorenzo Bianconi
Add svc_xprt_create_from_sa utility routine and refactor svc_xprt_create() codebase in order to introduce the capability to create a svc port from socket address. Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06sunrpc: removed redundant procp checkAleksandr Aprelkov
since vs_proc pointer is dereferenced before getting it's address there's no need to check for NULL. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: 8e5b67731d08 ("SUNRPC: Add a callback to initialise server requests") Signed-off-by: Aleksandr Aprelkov <aaprelkov@usergate.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06net: fix out-of-bounds access in ops_initThadeu Lima de Souza Cascardo
net_alloc_generic is called by net_alloc, which is called without any locking. It reads max_gen_ptrs, which is changed under pernet_ops_rwsem. It is read twice, first to allocate an array, then to set s.len, which is later used to limit the bounds of the array access. It is possible that the array is allocated and another thread is registering a new pernet ops, increments max_gen_ptrs, which is then used to set s.len with a larger than allocated length for the variable array. Fix it by reading max_gen_ptrs only once in net_alloc_generic. If max_gen_ptrs is later incremented, it will be caught in net_assign_generic. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Fixes: 073862ba5d24 ("netns: fix net_alloc_generic()") Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240502132006.3430840-1-cascardo@igalia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06net: add heuristic for enabling TCP fraglist GROFelix Fietkau
When forwarding TCP after GRO, software segmentation is very expensive, especially when the checksum needs to be recalculated. One case where that's currently unavoidable is when routing packets over PPPoE. Performance improves significantly when using fraglist GRO implemented in the same way as for UDP. When NETIF_F_GRO_FRAGLIST is enabled, perform a lookup for an established socket in the same netns as the receiving device. While this may not cover all relevant use cases in multi-netns configurations, it should be good enough for most configurations that need this. Here's a measurement of running 2 TCP streams through a MediaTek MT7622 device (2-core Cortex-A53), which runs NAT with flow offload enabled from one ethernet port to PPPoE on another ethernet port + cake qdisc set to 1Gbps. rx-gro-list off: 630 Mbit/s, CPU 35% idle rx-gro-list on: 770 Mbit/s, CPU 40% idle Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Felix Fietkau <nbd@nbd.name> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06net: create tcp_gro_header_pull helper functionFelix Fietkau
Pull the code out of tcp_gro_receive in order to access the tcp header from tcp4/6_gro_receive. Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Felix Fietkau <nbd@nbd.name> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06net: create tcp_gro_lookup helper functionFelix Fietkau
This pulls the flow port matching out of tcp_gro_receive, so that it can be reused for the next change, which adds the TCP fraglist GRO heuristic. Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Felix Fietkau <nbd@nbd.name> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06net: add code for TCP fraglist GROFelix Fietkau
This implements fraglist GRO similar to how it's handled in UDP, however no functional changes are added yet. The next change adds a heuristic for using fraglist GRO instead of regular GRO. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Felix Fietkau <nbd@nbd.name> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06net: add support for segmenting TCP fraglist GSO packetsFelix Fietkau
Preparation for adding TCP fraglist GRO support. It expects packets to be combined in a similar way as UDP fraglist GSO packets. For IPv4 packets, NAT is handled in the same way as UDP fraglist GSO. Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Felix Fietkau <nbd@nbd.name> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06net: move skb_gro_receive_list from udp to coreFelix Fietkau
This helper function will be used for TCP fraglist GRO support Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Felix Fietkau <nbd@nbd.name> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06netfilter: conntrack: dccp: try not to drop skb in conntrackJason Xing
It would be better not to drop skb in conntrack unless we have good alternatives. So we can treat the result of testing skb's header pointer as nf_conntrack_tcp_packet() does. Signed-off-by: Jason Xing <kernelxing@tencent.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-05-06netfilter: conntrack: fix ct-state for ICMPv6 Multicast Router DiscoveryLinus Lüssing
So far Multicast Router Advertisements and Multicast Router Solicitations from the Multicast Router Discovery protocol (RFC4286) would be marked as INVALID for IPv6, even if they are in fact intact and adhering to RFC4286. This broke MRA reception and by that multicast reception on IPv6 multicast routers in a Proxmox managed setup, where Proxmox would install a rule like "-m conntrack --ctstate INVALID -j DROP" at the top of the FORWARD chain with br-nf-call-ip6tables enabled by default. Similar to as it's done for MLDv1, MLDv2 and IPv6 Neighbor Discovery already, fix this issue by excluding MRD from connection tracking handling as MRD always uses predefined multicast destinations for its messages, too. This changes the ct-state for ICMPv6 MRD messages from INVALID to UNTRACKED. This issue was found and fixed with the help of the mrdisc tool (https://github.com/troglobit/mrdisc). Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-05-06netfilter: nf_tables: remove NETDEV_CHANGENAME from netdev chain event handlerPablo Neira Ayuso
Originally, device name used to be stored in the basechain, but it is not the case anymore. Remove check for NETDEV_CHANGENAME. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-05-06netfilter: nf_tables: skip transaction if update object is not implementedPablo Neira Ayuso
Turn update into noop as a follow up for: 9fedd894b4e1 ("netfilter: nf_tables: fix unexpected EOPNOTSUPP error") instead of adding a transaction object which is simply discarded at a later stage of the commit protocol. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-05-03Revert "net: mirror skb frag ref/unref helpers"Mina Almasry
This reverts commit a580ea994fd37f4105028f5a85c38ff6508a2b25. This revert is to resolve Dragos's report of page_pool leak here: https://lore.kernel.org/lkml/20240424165646.1625690-2-dtatulea@nvidia.com/ The reverted patch interacts very badly with commit 2cc3aeb5eccc ("skbuff: Fix a potential race while recycling page_pool packets"). The reverted commit hopes that the pp_recycle + is_pp_page variables do not change between the skb_frag_ref and skb_frag_unref operation. If such a change occurs, the skb_frag_ref/unref will not operate on the same reference type. In the case of Dragos's report, the grabbed ref was a pp ref, but the unref was a page ref, because the pp_recycle setting on the skb was changed. Attempting to fix this issue on the fly is risky. Lets revert and I hope to reland this with better understanding and testing to ensure we don't regress some edge case while streamlining skb reffing. Fixes: a580ea994fd3 ("net: mirror skb frag ref/unref helpers") Reported-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Mina Almasry <almasrymina@google.com> Link: https://lore.kernel.org/r/20240502175423.2456544-1-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03rtnetlink: Correct nested IFLA_VF_VLAN_LIST attribute validationRoded Zats
Each attribute inside a nested IFLA_VF_VLAN_LIST is assumed to be a struct ifla_vf_vlan_info so the size of such attribute needs to be at least of sizeof(struct ifla_vf_vlan_info) which is 14 bytes. The current size validation in do_setvfinfo is against NLA_HDRLEN (4 bytes) which is less than sizeof(struct ifla_vf_vlan_info) so this validation is not enough and a too small attribute might be cast to a struct ifla_vf_vlan_info, this might result in an out of bands read access when accessing the saved (casted) entry in ivvl. Fixes: 79aab093a0b5 ("net: Update API for VF vlan protocol 802.1ad support") Signed-off-by: Roded Zats <rzats@paloaltonetworks.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://lore.kernel.org/r/20240502155751.75705-1-rzats@paloaltonetworks.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-03Merge tag 'ipsec-2024-05-02' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2024-05-02 1) Fix an error pointer dereference in xfrm_in_fwd_icmp. From Antony Antony. 2) Preserve vlan tags for ESP transport mode software GRO. From Paul Davey. 3) Fix a spelling mistake in an uapi xfrm.h comment. From Anotny Antony. * tag 'ipsec-2024-05-02' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: xfrm: Correct spelling mistake in xfrm.h comment xfrm: Preserve vlan tags for transport mode software GRO xfrm: fix possible derferencing in error path ==================== Link: https://lore.kernel.org/r/20240502084838.2269355-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>