summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2024-08-22ipv4: Unmask upper DSCP bits when using hintsIdo Schimmel
Unmask the upper DSCP bits when performing source validation and routing a packet using the same route from a previously processed packet (hint). In the future, this will allow us to perform the FIB lookup that is performed as part of source validation according to the full DSCP value. No functional changes intended since the upper DSCP bits are masked when comparing against the TOS selectors in FIB rules and routes. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240821125251.1571445-13-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-22ipv4: udp: Unmask upper DSCP bits during early demuxIdo Schimmel
Unmask the upper DSCP bits when performing source validation for multicast packets during early demux. In the future, this will allow us to perform the FIB lookup which is performed as part of source validation according to the full DSCP value. No functional changes intended since the upper DSCP bits are masked when comparing against the TOS selectors in FIB rules and routes. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240821125251.1571445-12-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-22ipv4: icmp: Pass full DS field to ip_route_input()Ido Schimmel
Align the ICMP code to other callers of ip_route_input() and pass the full DS field. In the future this will allow us to perform a route lookup according to the full DSCP value. No functional changes intended since the upper DSCP bits are masked when comparing against the TOS selectors in FIB rules and routes. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240821125251.1571445-11-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-22ipv4: Unmask upper DSCP bits in RTM_GETROUTE input route lookupIdo Schimmel
Unmask the upper DSCP bits when looking up an input route via the RTM_GETROUTE netlink message so that in the future the lookup could be performed according to the full DSCP value. No functional changes intended since the upper DSCP bits are masked when comparing against the TOS selectors in FIB rules and routes. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240821125251.1571445-10-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-22ipv4: Unmask upper DSCP bits in input route lookupIdo Schimmel
Unmask the upper DSCP bits in input route lookup so that in the future the lookup could be performed according to the full DSCP value. No functional changes intended since the upper DSCP bits are masked when comparing against the TOS selectors in FIB rules and routes. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240821125251.1571445-9-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-22ipv4: Unmask upper DSCP bits in fib_compute_spec_dst()Ido Schimmel
As explained in commit 35ebf65e851c ("ipv4: Create and use fib_compute_spec_dst() helper."), the function is used - for example - to determine the source address for an ICMP reply. If we are responding to a multicast or broadcast packet, the source address is set to the source address that we would use if we were to send a packet to the unicast source of the original packet. This address is determined by performing a FIB lookup and using the preferred source address of the resulting route. Unmask the upper DSCP bits of the DS field of the packet that triggered the reply so that in the future the FIB lookup could be performed according to the full DSCP value. No functional changes intended since the upper DSCP bits are masked when comparing against the TOS selectors in FIB rules and routes. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240821125251.1571445-8-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-22ipv4: ipmr: Unmask upper DSCP bits in ipmr_rt_fib_lookup()Ido Schimmel
Unmask the upper DSCP bits when calling ipmr_fib_lookup() so that in the future it could perform the FIB lookup according to the full DSCP value. Note that ipmr_fib_lookup() performs a FIB rule lookup (returning the relevant routing table) and that IPv4 multicast FIB rules do not support matching on TOS / DSCP. However, it is still worth unmasking the upper DSCP bits in case support for DSCP matching is ever added. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240821125251.1571445-7-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-22netfilter: nft_fib: Unmask upper DSCP bitsIdo Schimmel
In a similar fashion to the iptables rpfilter match, unmask the upper DSCP bits of the DS field of the currently tested packet so that in the future the FIB lookup could be performed according to the full DSCP value. No functional changes intended since the upper DSCP bits are masked when comparing against the TOS selectors in FIB rules and routes. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240821125251.1571445-6-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-22netfilter: rpfilter: Unmask upper DSCP bitsIdo Schimmel
The rpfilter match performs a reverse path filter test on a packet by performing a FIB lookup with the source and destination addresses swapped. Unmask the upper DSCP bits of the DS field of the tested packet so that in the future the FIB lookup could be performed according to the full DSCP value. No functional changes intended since the upper DSCP bits are masked when comparing against the TOS selectors in FIB rules and routes. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240821125251.1571445-5-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-22ipv4: Unmask upper DSCP bits when constructing the Record Route optionIdo Schimmel
The Record Route IP option records the addresses of the routers that routed the packet. In the case of forwarded packets, the kernel performs a route lookup via fib_lookup() and fills in the preferred source address of the matched route. Unmask the upper DSCP bits when performing the lookup so that in the future the lookup could be performed according to the full DSCP value. No functional changes intended since the upper DSCP bits are masked when comparing against the TOS selectors in FIB rules and routes. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240821125251.1571445-4-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-22ipv4: Unmask upper DSCP bits in NETLINK_FIB_LOOKUP familyIdo Schimmel
The NETLINK_FIB_LOOKUP netlink family can be used to perform a FIB lookup according to user provided parameters and communicate the result back to user space. Unmask the upper DSCP bits of the user-provided DS field before invoking the IPv4 FIB lookup API so that in the future the lookup could be performed according to the full DSCP value. No functional changes intended since the upper DSCP bits are masked when comparing against the TOS selectors in FIB rules and routes. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240821125251.1571445-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-22bpf: Unmask upper DSCP bits in bpf_fib_lookup() helperIdo Schimmel
The helper performs a FIB lookup according to the parameters in the 'params' argument, one of which is 'tos'. According to the test in test_tc_neigh_fib.c, it seems that BPF programs are expected to initialize the 'tos' field to the full 8 bit DS field from the IPv4 header. Unmask the upper DSCP bits before invoking the IPv4 FIB lookup APIs so that in the future the lookup could be performed according to the full DSCP value. No functional changes intended since the upper DSCP bits are masked when comparing against the TOS selectors in FIB rules and routes. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240821125251.1571445-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-22net: ipv6: ioam6: new feature tunsrcJustin Iurman
This patch provides a new feature (i.e., "tunsrc") for the tunnel (i.e., "encap") mode of ioam6. Just like seg6 already does, except it is attached to a route. The "tunsrc" is optional: when not provided (by default), the automatic resolution is applied. Using "tunsrc" when possible has a benefit: performance. See the comparison: - before (= "encap" mode): https://ibb.co/bNCzvf7 - after (= "encap" mode with "tunsrc"): https://ibb.co/PT8L6yq Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-22net: ipv6: ioam6: code alignmentJustin Iurman
This patch prepares the next one by correcting the alignment of some lines. Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-21ipv6: remove redundant checkXi Huang
err varibale will be set everytime,like -ENOBUFS and in if (err < 0), when code gets into this path. This check will just slowdown the execution and that's all. Signed-off-by: Xi Huang <xuiagnh@gmail.com> Reviewed-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20240820115442.49366-1-xuiagnh@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-20l2tp: use skb_queue_purge in l2tp_ip_destroy_sockJames Chapman
Recent commit ed8ebee6def7 ("l2tp: have l2tp_ip_destroy_sock use ip_flush_pending_frames") was incorrect in that l2tp_ip does not use socket cork and ip_flush_pending_frames is for sockets that do. Use __skb_queue_purge instead and remove the unnecessary lock. Also unexport ip_flush_pending_frames since it was originally exported in commit 4ff8863419cd ("ipv4: export ip_flush_pending_frames") for l2tp and is not used by other modules. Suggested-by: xiyou.wangcong@gmail.com Signed-off-by: James Chapman <jchapman@katalix.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20240819143333.3204957-1-jchapman@katalix.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-20af_unix: Don't call skb_get() for OOB skb.Kuniyuki Iwashima
Since introduced, OOB skb holds an additional reference count with no special reason and caused many issues. Also, kfree_skb() and consume_skb() are used to decrement the count, which is confusing. Let's drop the unnecessary skb_get() in queue_oob() and corresponding kfree_skb(), consume_skb(), and skb_unref(). Now unix_sk(sk)->oob_skb is just a pointer to skb in the receive queue, so special handing is no longer needed in GC. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20240816233921.57800-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-20ipv4: Centralize TOS matchingIdo Schimmel
The TOS field in the IPv4 flow information structure ('flowi4_tos') is matched by the kernel against the TOS selector in IPv4 rules and routes. The field is initialized differently by different call sites. Some treat it as DSCP (RFC 2474) and initialize all six DSCP bits, some treat it as RFC 1349 TOS and initialize it using RT_TOS() and some treat it as RFC 791 TOS and initialize it using IPTOS_RT_MASK. What is common to all these call sites is that they all initialize the lower three DSCP bits, which fits the TOS definition in the initial IPv4 specification (RFC 791). Therefore, the kernel only allows configuring IPv4 FIB rules that match on the lower three DSCP bits which are always guaranteed to be initialized by all call sites: # ip -4 rule add tos 0x1c table 100 # ip -4 rule add tos 0x3c table 100 Error: Invalid tos. While this works, it is unlikely to be very useful. RFC 791 that initially defined the TOS and IP precedence fields was updated by RFC 2474 over twenty five years ago where these fields were replaced by a single six bits DSCP field. Extending FIB rules to match on DSCP can be done by adding a new DSCP selector while maintaining the existing semantics of the TOS selector for applications that rely on that. A prerequisite for allowing FIB rules to match on DSCP is to adjust all the call sites to initialize the high order DSCP bits and remove their masking along the path to the core where the field is matched on. However, making this change alone will result in a behavior change. For example, a forwarded IPv4 packet with a DS field of 0xfc will no longer match a FIB rule that was configured with 'tos 0x1c'. This behavior change can be avoided by masking the upper three DSCP bits in 'flowi4_tos' before comparing it against the TOS selectors in FIB rules and routes. Implement the above by adding a new function that checks whether a given DSCP value matches the one specified in the IPv4 flow information structure and invoke it from the three places that currently match on 'flowi4_tos'. Use RT_TOS() for the masking of 'flowi4_tos' instead of IPTOS_RT_MASK since the latter is not uAPI and we should be able to remove it at some point. Include <linux/ip.h> in <linux/in_route.h> since the former defines IPTOS_TOS_MASK which is used in the definition of RT_TOS() in <linux/in_route.h>. No regressions in FIB tests: # ./fib_tests.sh [...] Tests passed: 218 Tests failed: 0 And FIB rule tests: # ./fib_rule_tests.sh [...] Tests passed: 116 Tests failed: 0 Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-20netfilter: nft_fib: Mask upper DSCP bits before FIB lookupIdo Schimmel
As part of its functionality, the nftables FIB expression module performs a FIB lookup, but unlike other users of the FIB lookup API, it does so without masking the upper DSCP bits. In particular, this differs from the equivalent iptables match ("rpfilter") that does mask the upper DSCP bits before the FIB lookup. Align the module to other users of the FIB lookup API and mask the upper DSCP bits using IPTOS_RT_MASK before the lookup. No regressions in nft_fib.sh: # ./nft_fib.sh PASS: fib expression did not cause unwanted packet drops PASS: fib expression did drop packets for 1.1.1.1 PASS: fib expression did drop packets for 1c3::c01d PASS: fib expression forward check with policy based routing Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-20ipv4: Mask upper DSCP bits and ECN bits in NETLINK_FIB_LOOKUP familyIdo Schimmel
The NETLINK_FIB_LOOKUP netlink family can be used to perform a FIB lookup according to user provided parameters and communicate the result back to user space. However, unlike other users of the FIB lookup API, the upper DSCP bits and the ECN bits of the DS field are not masked, which can result in the wrong result being returned. Solve this by masking the upper DSCP bits and the ECN bits using IPTOS_RT_MASK. The structure that communicates the request and the response is not exported to user space, so it is unlikely that this netlink family is actually in use [1]. [1] https://lore.kernel.org/netdev/ZpqpB8vJU%2FQ6LSqa@debian/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-20net/smc: introduce statistics for ringbufs usage of net namespaceWen Gu
The buffer size histograms in smc_stats, namely rx/tx_rmbsize, record the sizes of ringbufs for all connections that have ever appeared in the net namespace. They are incremental and we cannot know the actual ringbufs usage from these. So here introduces statistics for current ringbufs usage of existing smc connections in the net namespace into smc_stats, it will be incremented when new connection uses a ringbuf and decremented when the ringbuf is unused. Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-20net/smc: introduce statistics for allocated ringbufs of link groupWen Gu
Currently we have the statistics on sndbuf/RMB sizes of all connections that have ever been on the link group, namely smc_stats_memsize. However these statistics are incremental and since the ringbufs of link group are allowed to be reused, we cannot know the actual allocated buffers through these. So here introduces the statistic on actual allocated ringbufs of the link group, it will be incremented when a new ringbuf is added into buf_list and decremented when it is deleted from buf_list. Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-19net: remove redundant check in skb_shift()Zhang Changzhong
The check for '!to' is redundant here, since skb_can_coalesce() already contains this check. Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1723730983-22912-1-git-send-email-zhangchangzhong@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-19mptcp: Remove unused declaration mptcp_sockopt_sync()Yue Haibing
Commit a1ab24e5fc4a ("mptcp: consolidate sockopt synchronization") removed the implementation but leave declaration. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240816100404.879598-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-19tcp_metrics: use netlink policy for IPv6 addr len validationJakub Kicinski
Use the netlink policy to validate IPv6 address length. Destination address currently has policy for max len set, and source has no policy validation. In both cases the code does the real check. With correct policy check the code can be removed. Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240816212245.467745-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-16mpls: Reduce skb re-allocations due to skb_cow()Christoph Paasch
mpls_xmit() needs to prepend the MPLS-labels to the packet. That implies one needs to make sure there is enough space for it in the headers. Calling skb_cow() implies however that one wants to change even the playload part of the packet (which is not true for MPLS). Thus, call skb_cow_head() instead, which is what other tunnelling protocols do. Running a server with this comm it entirely removed the calls to pskb_expand_head() from the callstack in mpls_xmit() thus having significant CPU-reduction, especially at peak times. Cc: Roopa Prabhu <roopa@nvidia.com> Reported-by: Craig Taylor <cmtaylor@apple.com> Signed-off-by: Christoph Paasch <cpaasch@apple.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240815161201.22021-1-cpaasch@apple.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-16net: dsa: microchip: fix tag_ksz egress mask for KSZ8795 familyPieter Van Trappen
Fix the tag_ksz egress mask for DSA_TAG_PROTO_KSZ8795, the port is encoded in the two and not three LSB. This fix is for completeness, for example the bug doesn't manifest itself on the KSZ8794 because bit 2 seems to be always zero. Signed-off-by: Pieter Van Trappen <pieter.van.trappen@cern.ch> Acked-by: Arun Ramadoss <arun.ramadoss@microchip.com> Link: https://patch.msgid.link/20240813142750.772781-7-vtpieter@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-15netdev: Add missing __percpu qualifier to a castUros Bizjak
Add missing __percpu qualifier to a (void *) cast to fix dev.c:10863:45: warning: cast removes address space '__percpu' of expression sparse warning. Also remove now unneeded __force sparse directives. Found by GCC's named address space checks. There were no changes in the resulting object file. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Link: https://patch.msgid.link/20240814070748.943671-1-ubizjak@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-15openvswitch: switch to per-action label counting in conntrackXin Long
Similar to commit 70f06c115bcc ("sched: act_ct: switch to per-action label counting"), we should also switch to per-action label counting in openvswitch conntrack, as Florian suggested. The difference is that nf_connlabels_get() is called unconditionally when creating an ct action in ovs_ct_copy_action(). As with these flows: table=0,ip,actions=ct(commit,table=1) table=1,ip,actions=ct(commit,exec(set_field:0xac->ct_label),table=2) it needs to make sure the label ext is created in the 1st flow before the ct is committed in ovs_ct_commit(). Otherwise, the warning in nf_ct_ext_add() when creating the label ext in the 2nd flow will be triggered: WARN_ON(nf_ct_is_confirmed(ct)); Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/6b9347d5c1a0b364e88d900b29a616c3f8e5b1ca.1723483073.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-15ip: Move INFINITY_LIFE_TIME to addrconf.h.Kuniyuki Iwashima
INFINITY_LIFE_TIME is the common value used in IPv4 and IPv6 but defined in both .c files. Also, 0xffffffff used in addrconf_timeout_fixup() is INFINITY_LIFE_TIME. Let's move INFINITY_LIFE_TIME's definition to addrconf.h Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20240809235406.50187-6-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-15ipv4: Initialise ifa->hash in inet_alloc_ifa().Kuniyuki Iwashima
Whenever ifa is allocated, we call INIT_HLIST_NODE(&ifa->hash). Let's move it to inet_alloc_ifa(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20240809235406.50187-5-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-15ipv4: Remove redundant !ifa->ifa_dev check.Kuniyuki Iwashima
Now, ifa_dev is only set in inet_alloc_ifa() and never NULL after ifa gets visible. Let's remove the unneeded NULL check for ifa->ifa_dev. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20240809235406.50187-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-15ipv4: Set ifa->ifa_dev in inet_alloc_ifa().Kuniyuki Iwashima
When a new IPv4 address is assigned via ioctl(SIOCSIFADDR), inet_set_ifa() sets ifa->ifa_dev if it's different from in_dev passed as an argument. In this case, ifa is always a newly allocated object, and ifa->ifa_dev is NULL. inet_set_ifa() can be called for an existing reused ifa, then, this check is always false. Let's set ifa_dev in inet_alloc_ifa() and remove the check in inet_set_ifa(). Now, inet_alloc_ifa() is symmetric with inet_rcu_free_ifa(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20240809235406.50187-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-15ipv4: Check !in_dev earlier for ioctl(SIOCSIFADDR).Kuniyuki Iwashima
dev->ip_ptr could be NULL if we set an invalid MTU. Even then, if we issue ioctl(SIOCSIFADDR) for a new IPv4 address, devinet_ioctl() allocates struct in_ifaddr and fails later in inet_set_ifa() because in_dev is NULL. Let's move the check earlier. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20240809235406.50187-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: Documentation/devicetree/bindings/net/fsl,qoriq-mc-dpmac.yaml c25504a0ba36 ("dt-bindings: net: fsl,qoriq-mc-dpmac: add missed property phys") be034ee6c33d ("dt-bindings: net: fsl,qoriq-mc-dpmac: using unevaluatedProperties") https://lore.kernel.org/20240815110934.56ae623a@canb.auug.org.au drivers/net/dsa/vitesse-vsc73xx-core.c 5b9eebc2c7a5 ("net: dsa: vsc73xx: pass value in phy_write operation") fa63c6434b6f ("net: dsa: vsc73xx: check busy flag in MDIO operations") 2524d6c28bdc ("net: dsa: vsc73xx: use defined values in phy operations") https://lore.kernel.org/20240813104039.429b9fe6@canb.auug.org.au Resolve by using FIELD_PREP(), Stephen's resolution is simpler. Adjacent changes: net/vmw_vsock/af_vsock.c 69139d2919dd ("vsock: fix recursive ->recvmsg calls") 744500d81f81 ("vsock: add support for SIOCOUTQ ioctl") Link: https://patch.msgid.link/20240815141149.33862-1-pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-15Merge tag 'net-6.11-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from wireless and netfilter Current release - regressions: - udp: fall back to software USO if IPv6 extension headers are present - wifi: iwlwifi: correctly lookup DMA address in SG table Current release - new code bugs: - eth: mlx5e: fix queue stats access to non-existing channels splat Previous releases - regressions: - eth: mlx5e: take state lock during tx timeout reporter - eth: mlxbf_gige: disable RX filters until RX path initialized - eth: igc: fix reset adapter logics when tx mode change Previous releases - always broken: - tcp: update window clamping condition - netfilter: - nf_queue: drop packets with cloned unconfirmed conntracks - nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requests - vsock: fix recursive ->recvmsg calls - dsa: vsc73xx: fix MDIO bus access and PHY opera - eth: gtp: pull network headers in gtp_dev_xmit() - eth: igc: fix packet still tx after gate close by reducing i226 MAC retry buffer - eth: mana: fix RX buf alloc_size alignment and atomic op panic - eth: hns3: fix a deadlock problem when config TC during resetting" * tag 'net-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (58 commits) net: hns3: use correct release function during uninitialization net: hns3: void array out of bound when loop tnl_num net: hns3: fix a deadlock problem when config TC during resetting net: hns3: use the user's cfg after reset net: hns3: fix wrong use of semaphore up selftests: net: lib: kill PIDs before del netns pse-core: Conditionally set current limit during PI regulator registration net: thunder_bgx: Fix netdev structure allocation net: ethtool: Allow write mechanism of LPL and both LPL and EPL vsock: fix recursive ->recvmsg calls selftest: af_unix: Fix kselftest compilation warnings netfilter: nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requests netfilter: nf_tables: Introduce nf_tables_getobj_single netfilter: nf_tables: Audit log dump reset after the fact selftests: netfilter: add test for br_netfilter+conntrack+queue combination netfilter: nf_queue: drop packets with cloned unconfirmed conntracks netfilter: flowtable: initialise extack before use netfilter: nfnetlink: Initialise extack before use in ACKs netfilter: allow ipv6 fragments to arrive on different devices tcp: Update window clamping condition ...
2024-08-15Merge tag 'nf-24-08-15' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Ignores ifindex for types other than mcast/linklocal in ipv6 frag reasm, from Tom Hughes. 2) Initialize extack for begin/end netlink message marker in batch, from Donald Hunter. 3) Initialize extack for flowtable offload support, also from Donald. 4) Dropped packets with cloned unconfirmed conntracks in nfqueue, later it should be possible to explore lookup after reinject but Florian prefers this approach at this stage. From Florian Westphal. 5) Add selftest for cloned unconfirmed conntracks in nfqueue for previous update. 6) Audit after filling netlink header successfully in object dump, from Phil Sutter. 7-8) Fix concurrent dump and reset which could result in underflow counter / quota objects. netfilter pull request 24-08-15 * tag 'nf-24-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requests netfilter: nf_tables: Introduce nf_tables_getobj_single netfilter: nf_tables: Audit log dump reset after the fact selftests: netfilter: add test for br_netfilter+conntrack+queue combination netfilter: nf_queue: drop packets with cloned unconfirmed conntracks netfilter: flowtable: initialise extack before use netfilter: nfnetlink: Initialise extack before use in ACKs netfilter: allow ipv6 fragments to arrive on different devices ==================== Link: https://patch.msgid.link/20240814222042.150590-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-15net: ethtool: Allow write mechanism of LPL and both LPL and EPLDanielle Ratson
CMIS 5.2 standard section 9.4.2 defines four types of firmware update supported mechanism: None, only LPL, only EPL, both LPL and EPL. Currently, only LPL (Local Payload) type of write firmware block is supported. However, if the module supports both LPL and EPL the flashing process wrongly fails for no supporting LPL. Fix that, by allowing the write mechanism to be LPL or both LPL and EPL. Fixes: c4f78134d45c ("ethtool: cmis_fw_update: add a layer for supporting firmware update using CDB") Reported-by: Vladyslav Mykhaliuk <vmykhaliuk@nvidia.com> Signed-off-by: Danielle Ratson <danieller@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/20240812140824.3718826-1-danieller@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-15vsock: fix recursive ->recvmsg callsCong Wang
After a vsock socket has been added to a BPF sockmap, its prot->recvmsg has been replaced with vsock_bpf_recvmsg(). Thus the following recursiion could happen: vsock_bpf_recvmsg() -> __vsock_recvmsg() -> vsock_connectible_recvmsg() -> prot->recvmsg() -> vsock_bpf_recvmsg() again We need to fix it by calling the original ->recvmsg() without any BPF sockmap logic in __vsock_recvmsg(). Fixes: 634f1a7110b4 ("vsock: support sockmap") Reported-by: syzbot+bdb4bd87b5e22058e2a4@syzkaller.appspotmail.com Tested-by: syzbot+bdb4bd87b5e22058e2a4@syzkaller.appspotmail.com Cc: Bobby Eshleman <bobby.eshleman@bytedance.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://patch.msgid.link/20240812022153.86512-1-xiyou.wangcong@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-14netfilter: nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requestsPhil Sutter
Objects' dump callbacks are not concurrency-safe per-se with reset bit set. If two CPUs perform a reset at the same time, at least counter and quota objects suffer from value underrun. Prevent this by introducing dedicated locking callbacks for nfnetlink and the asynchronous dump handling to serialize access. Fixes: 43da04a593d8 ("netfilter: nf_tables: atomic dump and reset for stateful objects") Signed-off-by: Phil Sutter <phil@nwl.cc> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-08-14netfilter: nf_tables: Introduce nf_tables_getobj_singlePhil Sutter
Outsource the reply skb preparation for non-dump getrule requests into a distinct function. Prep work for object reset locking. Signed-off-by: Phil Sutter <phil@nwl.cc> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-08-14netfilter: nf_tables: Audit log dump reset after the factPhil Sutter
In theory, dumpreset may fail and invalidate the preceeding log message. Fix this and use the occasion to prepare for object reset locking, which benefits from a few unrelated changes: * Add an early call to nfnetlink_unicast if not resetting which effectively skips the audit logging but also unindents it. * Extract the table's name from the netlink attribute (which is verified via earlier table lookup) to not rely upon validity of the looked up table pointer. * Do not use local variable family, it will vanish. Fixes: 8e6cf365e1d5 ("audit: log nftables configuration change events") Signed-off-by: Phil Sutter <phil@nwl.cc> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-08-14netfilter: nf_queue: drop packets with cloned unconfirmed conntracksFlorian Westphal
Conntrack assumes an unconfirmed entry (not yet committed to global hash table) has a refcount of 1 and is not visible to other cores. With multicast forwarding this assumption breaks down because such skbs get cloned after being picked up, i.e. ct->use refcount is > 1. Likewise, bridge netfilter will clone broad/mutlicast frames and all frames in case they need to be flood-forwarded during learning phase. For ip multicast forwarding or plain bridge flood-forward this will "work" because packets don't leave softirq and are implicitly serialized. With nfqueue this no longer holds true, the packets get queued and can be reinjected in arbitrary ways. Disable this feature, I see no other solution. After this patch, nfqueue cannot queue packets except the last multicast/broadcast packet. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-08-14netfilter: flowtable: initialise extack before useDonald Hunter
Fix missing initialisation of extack in flow offload. Fixes: c29f74e0df7a ("netfilter: nf_flow_table: hardware offload support") Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-08-14netfilter: nfnetlink: Initialise extack before use in ACKsDonald Hunter
Add missing extack initialisation when ACKing BATCH_BEGIN and BATCH_END. Fixes: bf2ac490d28c ("netfilter: nfnetlink: Handle ACK flags for batch messages") Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-08-14netfilter: allow ipv6 fragments to arrive on different devicesTom Hughes
Commit 264640fc2c5f4 ("ipv6: distinguish frag queues by device for multicast and link-local packets") modified the ipv6 fragment reassembly logic to distinguish frag queues by device for multicast and link-local packets but in fact only the main reassembly code limits the use of the device to those address types and the netfilter reassembly code uses the device for all packets. This means that if fragments of a packet arrive on different interfaces then netfilter will fail to reassemble them and the fragments will be expired without going any further through the filters. Fixes: 648700f76b03 ("inet: frags: use rhashtables for reassembly units") Signed-off-by: Tom Hughes <tom@compton.nu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-08-14tcp: Update window clamping conditionSubash Abhinov Kasiviswanathan
This patch is based on the discussions between Neal Cardwell and Eric Dumazet in the link https://lore.kernel.org/netdev/20240726204105.1466841-1-quic_subashab@quicinc.com/ It was correctly pointed out that tp->window_clamp would not be updated in cases where net.ipv4.tcp_moderate_rcvbuf=0 or if (copied <= tp->rcvq_space.space). While it is expected for most setups to leave the sysctl enabled, the latter condition may not end up hitting depending on the TCP receive queue size and the pattern of arriving data. The updated check should be hit only on initial MSS update from TCP_MIN_MSS to measured MSS value and subsequently if there was an update to a larger value. Fixes: 05f76b2d634e ("tcp: Adjust clamping window for applications specifying SO_RCVBUF") Signed-off-by: Sean Tranchetti <quic_stranche@quicinc.com> Signed-off-by: Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-13mptcp: correct MPTCP_SUBFLOW_ATTR_SSN_OFFSET reserved sizeEugene Syromiatnikov
ssn_offset field is u32 and is placed into the netlink response with nla_put_u32(), but only 2 bytes are reserved for the attribute payload in subflow_get_info_size() (even though it makes no difference in the end, as it is aligned up to 4 bytes). Supply the correct argument to the relevant nla_total_size() call to make it less confusing. Fixes: 5147dfb50832 ("mptcp: allow dumping subflow context to userspace") Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20240812065024.GA19719@asgard.redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-13net: netpoll: extract core of netpoll_cleanupBreno Leitao
Extract the core part of netpoll_cleanup(), so, it could be called from a caller that has the rtnl lock already. Netconsole uses this in a weird way right now: __netpoll_cleanup(&nt->np); spin_lock_irqsave(&target_list_lock, flags); netdev_put(nt->np.dev, &nt->np.dev_tracker); nt->np.dev = NULL; nt->enabled = false; This will be replaced by do_netpoll_cleanup() as the locking situation is overhauled. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Rik van Riel <riel@surriel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-12net/smc: Use static_assert() to check struct sizesGustavo A. R. Silva
Commit 9748dbc9f265 ("net/smc: Avoid -Wflex-array-member-not-at-end warnings") introduced tagged `struct smc_clc_v2_extension_fixed` and `struct smc_clc_smcd_v2_extension_fixed`. We want to ensure that when new members need to be added to the flexible structures, they are always included within these tagged structs. So, we use `static_assert()` to ensure that the memory layout for both the flexible structure and the tagged struct is the same after any changes. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Jan Karcher <jaka@linux.ibm.com> Link: https://patch.msgid.link/ZrVBuiqFHAORpFxE@cute Signed-off-by: Jakub Kicinski <kuba@kernel.org>