summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2022-03-09tcp: adjust TSO packet sizes based on min_rttEric Dumazet
Back when tcp_tso_autosize() and TCP pacing were introduced, our focus was really to reduce burst sizes for long distance flows. The simple heuristic of using sk_pacing_rate/1024 has worked well, but can lead to too small packets for hosts in the same rack/cluster, when thousands of flows compete for the bottleneck. Neal Cardwell had the idea of making the TSO burst size a function of both sk_pacing_rate and tcp_min_rtt() Indeed, for local flows, sending bigger bursts is better to reduce cpu costs, as occasional losses can be repaired quite fast. This patch is based on Neal Cardwell implementation done more than two years ago. bbr is adjusting max_pacing_rate based on measured bandwidth, while cubic would over estimate max_pacing_rate. /proc/sys/net/ipv4/tcp_tso_rtt_log can be used to tune or disable this new feature, in logarithmic steps. Tested: 100Gbit NIC, two hosts in the same rack, 4K MTU. 600 flows rate-limited to 20000000 bytes per second. Before patch: (TSO sizes would be limited to 20000000/1024/4096 -> 4 segments per TSO) ~# echo 0 >/proc/sys/net/ipv4/tcp_tso_rtt_log ~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered" 96005 Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000': 65,945.29 msec task-clock # 2.845 CPUs utilized 1,314,632 context-switches # 19935.279 M/sec 5,292 cpu-migrations # 80.249 M/sec 940,641 page-faults # 14264.023 M/sec 201,117,030,926 cycles # 3049769.216 GHz (83.45%) 17,699,435,405 stalled-cycles-frontend # 8.80% frontend cycles idle (83.48%) 136,584,015,071 stalled-cycles-backend # 67.91% backend cycles idle (83.44%) 53,809,530,436 instructions # 0.27 insn per cycle # 2.54 stalled cycles per insn (83.36%) 9,062,315,523 branches # 137422329.563 M/sec (83.22%) 153,008,621 branch-misses # 1.69% of all branches (83.32%) 23.182970846 seconds time elapsed TcpInSegs 15648792 0.0 TcpOutSegs 58659110 0.0 # Average of 3.7 4K segments per TSO packet TcpExtTCPDelivered 58654791 0.0 TcpExtTCPDeliveredCE 19 0.0 After patch: ~# echo 9 >/proc/sys/net/ipv4/tcp_tso_rtt_log ~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered" 96046 Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000': 48,982.58 msec task-clock # 2.104 CPUs utilized 186,014 context-switches # 3797.599 M/sec 3,109 cpu-migrations # 63.472 M/sec 941,180 page-faults # 19214.814 M/sec 153,459,763,868 cycles # 3132982.807 GHz (83.56%) 12,069,861,356 stalled-cycles-frontend # 7.87% frontend cycles idle (83.32%) 120,485,917,953 stalled-cycles-backend # 78.51% backend cycles idle (83.24%) 36,803,672,106 instructions # 0.24 insn per cycle # 3.27 stalled cycles per insn (83.18%) 5,947,266,275 branches # 121417383.427 M/sec (83.64%) 87,984,616 branch-misses # 1.48% of all branches (83.43%) 23.281200256 seconds time elapsed TcpInSegs 1434706 0.0 TcpOutSegs 58883378 0.0 # Average of 41 4K segments per TSO packet TcpExtTCPDelivered 58878971 0.0 TcpExtTCPDeliveredCE 9664 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://lore.kernel.org/r/20220309015757.2532973-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-09tcp: autocork: take MSG_EOR hint into considerationEric Dumazet
tcp_should_autocork() is evaluating if it makes senses to not immediately send current skb, hoping that user space will add more payload on it by the time TCP stack reacts to upcoming TX completions. If current skb got MSG_EOR mark, then we know that no further data will be added, it is therefore futile to wait. SOF_TIMESTAMPING_TX_ACK will become a bit more accurate, if prior packets are still in qdisc/device queues. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Willem de Bruijn <willemb@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Link: https://lore.kernel.org/r/20220309054706.2857266-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-09net/smc: fix -Wmissing-prototypes warning when CONFIG_SYSCTL not setDust Li
when CONFIG_SYSCTL not set, smc_sysctl_net_init/exit need to be static inline to avoid missing-prototypes if compile with W=1. Since __net_exit has noinline annotation when CONFIG_NET_NS not set, it should not be used with static inline. So remove the __net_init/exit when CONFIG_SYSCTL not set. Fixes: 7de8eb0d9039 ("net/smc: fix compile warning for smc_sysctl") Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Link: https://lore.kernel.org/r/20220309033051.41893-1-dust.li@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-09bpf: Add "live packet" mode for XDP in BPF_PROG_RUNToke Høiland-Jørgensen
This adds support for running XDP programs through BPF_PROG_RUN in a mode that enables live packet processing of the resulting frames. Previous uses of BPF_PROG_RUN for XDP returned the XDP program return code and the modified packet data to userspace, which is useful for unit testing of XDP programs. The existing BPF_PROG_RUN for XDP allows userspace to set the ingress ifindex and RXQ number as part of the context object being passed to the kernel. This patch reuses that code, but adds a new mode with different semantics, which can be selected with the new BPF_F_TEST_XDP_LIVE_FRAMES flag. When running BPF_PROG_RUN in this mode, the XDP program return codes will be honoured: returning XDP_PASS will result in the frame being injected into the networking stack as if it came from the selected networking interface, while returning XDP_TX and XDP_REDIRECT will result in the frame being transmitted out that interface. XDP_TX is translated into an XDP_REDIRECT operation to the same interface, since the real XDP_TX action is only possible from within the network drivers themselves, not from the process context where BPF_PROG_RUN is executed. Internally, this new mode of operation creates a page pool instance while setting up the test run, and feeds pages from that into the XDP program. The setup cost of this is amortised over the number of repetitions specified by userspace. To support the performance testing use case, we further optimise the setup step so that all pages in the pool are pre-initialised with the packet data, and pre-computed context and xdp_frame objects stored at the start of each page. This makes it possible to entirely avoid touching the page content on each XDP program invocation, and enables sending up to 9 Mpps/core on my test box. Because the data pages are recycled by the page pool, and the test runner doesn't re-initialise them for each run, subsequent invocations of the XDP program will see the packet data in the state it was after the last time it ran on that particular page. This means that an XDP program that modifies the packet before redirecting it has to be careful about which assumptions it makes about the packet content, but that is only an issue for the most naively written programs. Enabling the new flag is only allowed when not setting ctx_out and data_out in the test specification, since using it means frames will be redirected somewhere else, so they can't be returned. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220309105346.100053-2-toke@redhat.com
2022-03-09Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2022-03-09 1) Fix IPv6 PMTU discovery for xfrm interfaces. From Lina Wang. 2) Revert failing for policies and states that are configured with XFRMA_IF_ID 0. It broke a user configuration. From Kai Lueke. 3) Fix a possible buffer overflow in the ESP output path. 4) Fix ESP GSO for tunnel and BEET mode on inter address family tunnels. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-09ax25: Fix NULL pointer dereference in ax25_kill_by_deviceDuoming Zhou
When two ax25 devices attempted to establish connection, the requester use ax25_create(), ax25_bind() and ax25_connect() to initiate connection. The receiver use ax25_rcv() to accept connection and use ax25_create_cb() in ax25_rcv() to create ax25_cb, but the ax25_cb->sk is NULL. When the receiver is detaching, a NULL pointer dereference bug caused by sock_hold(sk) in ax25_kill_by_device() will happen. The corresponding fail log is shown below: =============================================================== BUG: KASAN: null-ptr-deref in ax25_device_event+0xfd/0x290 Call Trace: ... ax25_device_event+0xfd/0x290 raw_notifier_call_chain+0x5e/0x70 dev_close_many+0x174/0x220 unregister_netdevice_many+0x1f7/0xa60 unregister_netdevice_queue+0x12f/0x170 unregister_netdev+0x13/0x20 mkiss_close+0xcd/0x140 tty_ldisc_release+0xc0/0x220 tty_release_struct+0x17/0xa0 tty_release+0x62d/0x670 ... This patch add condition check in ax25_kill_by_device(). If s->sk is NULL, it will goto if branch to kill device. Fixes: 4e0f718daf97 ("ax25: improve the incomplete fix to avoid UAF and NPD bugs") Reported-by: Thomas Osterried <thomas@osterried.de> Signed-off-by: Duoming Zhou <duoming@zju.edu.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-09skb: make drop reason booleanableJakub Kicinski
We have a number of cases where function returns drop/no drop decision as a boolean. Now that we want to report the reason code as well we have to pass extra output arguments. We can make the reason code evaluate correctly as bool. I believe we're good to reorder the reasons as they are reported to user space as strings. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-09net: dsa: felix: avoid early deletion of host FDB entriesVladimir Oltean
The Felix driver declares FDB isolation but puts all standalone ports in VID 0. This is mostly problem-free as discussed with Alvin here: https://patchwork.kernel.org/project/netdevbpf/cover/20220302191417.1288145-1-vladimir.oltean@nxp.com/#24763870 however there is one catch. DSA still thinks that FDB entries are installed on the CPU port as many times as there are user ports, and this is problematic when multiple user ports share the same MAC address. Consider the default case where all user ports inherit their MAC address from the DSA master, and then the user runs: ip link set swp0 address 00:01:02:03:04:05 The above will make dsa_slave_set_mac_address() call dsa_port_standalone_host_fdb_add() for 00:01:02:03:04:05 in port 0's standalone database, and dsa_port_standalone_host_fdb_del() for the old address of swp0, again in swp0's standalone database. Both the ->port_fdb_add() and ->port_fdb_del() will be propagated down to the felix driver, which will end up deleting the old MAC address from the CPU port. But this is still in use by other user ports, so we end up breaking unicast termination for them. There isn't a problem in the fact that DSA keeps track of host standalone addresses in the individual database of each user port: some drivers like sja1105 need this. There also isn't a problem in the fact that some drivers choose the same VID/FID for all standalone ports. It is just that the deletion of these host addresses must be delayed until they are known to not be in use any longer, and only the driver has this knowledge. Since DSA keeps these addresses in &cpu_dp->fdbs and &cpu_db->mdbs, it is just a matter of walking over those lists and see whether the same MAC address is present on the CPU port in the port db of another user port. I have considered reusing the generic dsa_port_walk_fdbs() and dsa_port_walk_mdbs() schemes for this, but locking makes it difficult. In the ->port_fdb_add() method and co, &dp->addr_lists_lock is held, but dsa_port_walk_fdbs() also acquires that lock. Also, even assuming that we introduce an unlocked variant of the address iterator, we'd still need some relatively complex data structures, and a void *ctx in the dsa_fdb_walk_cb_t which we don't currently pass, such that drivers are able to figure out, after iterating, whether the same MAC address is or isn't present in the port db of another port. All the above, plus the fact that I expect other drivers to follow the same model as felix where all standalone ports use the same FID, made me conclude that a generic method provided by DSA is necessary: dsa_fdb_present_in_other_db() and the mdb equivalent. Felix calls this from the ->port_fdb_del() handler for the CPU port, when the database was classified to either a port db, or a LAG db. For symmetry, we also call this from ->port_fdb_add(), because if the address was installed once, then installing it a second time serves no purpose: it's already in hardware in VID 0 and it affects all standalone ports. This change moves dsa_db_equal() from switch.c to dsa.c, since it now has one more caller. Fixes: 54c319846086 ("net: mscc: ocelot: enforce FDB isolation when VLAN-unaware") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-09net: dsa: be mostly no-op in dsa_slave_set_mac_address when downVladimir Oltean
Since the slave unicast address is synced to hardware and to the DSA master during dsa_slave_open(), this means that a call to dsa_slave_set_mac_address() while the slave interface is down will result to a call to dsa_port_standalone_host_fdb_del() and to dev_uc_del() for the MAC address while there was no previous dsa_port_standalone_host_fdb_add() or dev_uc_add(). This is a partial revert of the blamed commit below, which was too aggressive. Fixes: 35aae5ab9121 ("net: dsa: remove workarounds for changing master promisc/allmulti only while up") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-09net: dsa: move port lists initialization to dsa_port_touchVladimir Oltean
&cpu_db->fdbs and &cpu_db->mdbs may be uninitialized lists during some call paths of felix_set_tag_protocol(). There was an attempt to avoid calling dsa_port_walk_fdbs() during setup by using a "bool change" in the felix driver, but this doesn't work when the tagging protocol is defined in the device tree, and a change is triggered by DSA at pseudo-runtime: dsa_tree_setup_switches -> dsa_switch_setup -> dsa_switch_setup_tag_protocol -> ds->ops->change_tag_protocol dsa_tree_setup_ports -> dsa_port_setup -> &dp->fdbs and &db->mdbs only get initialized here So it seems like the only way to fix this is to move the initialization of these lists earlier. dsa_port_touch() is called from dsa_switch_touch_ports() which is called from dsa_switch_parse_of(), and this runs completely before dsa_tree_setup(). Similarly, dsa_switch_release_ports() runs after dsa_tree_teardown(). Fixes: f9cef64fa23f ("net: dsa: felix: migrate host FDB and MDB entries when changing tag proto") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-09net: dsa: warn if port lists aren't empty in dsa_port_teardownVladimir Oltean
There has been recent work towards matching each switchdev object addition with a corresponding deletion. Therefore, having elements in the fdbs, mdbs, vlans lists at the time of a shared (DSA, CPU) port's teardown is indicative of a bug somewhere else, and not something that is to be expected. We shouldn't try to silently paper over that. Instead, print a warning and a stack trace. This change is a prerequisite for moving the initialization/teardown of these lists. Make it clear that clearing the lists isn't needed. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-08tipc: fix incorrect order of state message data sanity checkTung Nguyen
When receiving a state message, function tipc_link_validate_msg() is called to validate its header portion. Then, its data portion is validated before it can be accessed correctly. However, current data sanity check is done after the message header is accessed to update some link variables. This commit fixes this issue by moving the data sanity check to the beginning of state message handling and right after the header sanity check. Fixes: 9aa422ad3266 ("tipc: improve size validations for received domain records") Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au> Link: https://lore.kernel.org/r/20220308021200.9245-1-tung.q.nguyen@dektech.com.au Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-08SO_ZEROCOPY should return -EOPNOTSUPP rather than -ENOTSUPPSamuel Thibault
ENOTSUPP is documented as "should never be seen by user programs", and thus not exposed in <errno.h>, and thus applications cannot safely check against it (they get "Unknown error 524" as strerror). We should rather return the well-known -EOPNOTSUPP. This is similar to 2230a7ef5198 ("drop_monitor: Use correct error code") and 4a5cdc604b9c ("net/tls: Fix return values to avoid ENOTSUPP"), which did not seem to cause problems. Signed-off-by: Samuel Thibault <samuel.thibault@labri.fr> Acked-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20220307223126.djzvg44v2o2jkjsx@begin Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-08mptcp: add fullmesh flag check for adding addressGeliang Tang
The fullmesh flag mustn't be used with the signal flag when adding an address. This patch added the necessary flags check for this case. Fixes: 73c762c1f07d ("mptcp: set fullmesh flag in pm_netlink") Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-08mptcp: strict local address ID selectionPaolo Abeni
The address ID selection for MPJ subflows created in response to incoming ADD_ADDR option is currently unreliable: it happens at MPJ socket creation time, when the local address could be unknown. Additionally, if the no local endpoint is available for the local address, a new dummy endpoint is created, confusing the user-land. This change refactor the code to move the address ID selection inside the rebuild_header() helper, when the local address eventually selected by the route lookup is finally known. If the address used is not mapped by any endpoint - and thus can't be advertised/removed pick the id 0 instead of allocate a new endpoint. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-08mptcp: introduce implicit endpointsPaolo Abeni
In some edge scenarios, an MPTCP subflows can use a local address mapped by a "implicit" endpoint created by the in-kernel path manager. Such endpoints presence can be confusing, as it's creation is hard to track and will prevent the later endpoint creation from the user-space using the same address. Define a new endpoint flag to mark implicit endpoints and allow the user-space to replace implicit them with user-provided data at endpoint creation time. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-08mptcp: more careful RM_ADDR generationPaolo Abeni
The in-kernel MPTCP path manager, when processing the MPTCP_PM_CMD_FLUSH_ADDR command, generates RM_ADDR events for each known local address. While that is allowed by the RFC, it makes unpredictable the exact number of RM_ADDR generated when both ends flush the PM addresses. This change restricts the RM_ADDR generation to previously explicitly announced addresses, and adjust the expected results in a bunch of related self-tests. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-08mptcp: use MPTCP_SUBFLOW_NODATAGeliang Tang
Set subflow->data_avail with the enum value MPTCP_SUBFLOW_NODATA, instead of using 0 directly. Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-08mptcp: add tracepoint in mptcp_sendmsg_fragGeliang Tang
The tracepoint in get_mapping_status() only dumped the incoming mpext fields. This patch added a new tracepoint in mptcp_sendmsg_frag() to dump the outgoing mpext too. Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-08Revert "netfilter: conntrack: tag conntracks picked up in local out hook"Florian Westphal
This was a prerequisite for the ill-fated "netfilter: nat: force port remap to prevent shadowing well-known ports". As this has been reverted, this change can be backed out too. Signed-off-by: Florian Westphal <fw@strlen.de>
2022-03-08Revert "netfilter: nat: force port remap to prevent shadowing well-known ports"Florian Westphal
This reverts commit 878aed8db324bec64f3c3f956e64d5ae7375a5de. This change breaks existing setups where conntrack is used with asymmetric paths. In these cases, the NAT transformation occurs on the syn-ack instead of the syn: 1. SYN x:12345 -> y -> 443 // sent by initiator, receiverd by responder 2. SYNACK y:443 -> x:12345 // First packet seen by conntrack, as sent by responder 3. tuple_force_port_remap() gets called, sees: 'tcp from 443 to port 12345 NAT' -> pick a new source port, inititor receives 4. SYNACK y:$RANDOM -> x:12345 // connection is never established While its possible to avoid the breakage with NOTRACK rules, a kernel update should not break working setups. An alternative to the revert is to augment conntrack to tag mid-stream connections plus more code in the nat core to skip NAT for such connections, however, this leads to more interaction/integration between conntrack and NAT. Therefore, revert, users will need to add explicit nat rules to avoid port shadowing. Link: https://lore.kernel.org/netfilter-devel/20220302105908.GA5852@breakpoint.cc/#R Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2051413 Signed-off-by: Florian Westphal <fw@strlen.de>
2022-03-08net: dsa: tag_dsa: Fix tx from VLAN uppers on non-filtering bridgesTobias Waldekranz
In this situation (VLAN filtering disabled on br0): br0.10 / br0 / \ swp0 swp1 When a frame is transmitted from the VLAN upper, the bridge will send it down to one of the switch ports with forward offloading enabled. This will cause tag_dsa to generate a FORWARD tag. Before this change, that tag would have it's VID set to 10, even though VID 10 is not loaded in the VTU. Before the blamed commit, the frame would trigger a VTU miss and be forwarded according to the PVT configuration. Now that all fabric ports are in 802.1Q secure mode, the frame is dropped instead. Therefore, restrict the condition under which we rewrite an 802.1Q tag to a DSA tag. On standalone port's, reuse is always safe since we will always generate FROM_CPU tags in that case. For bridged ports though, we must ensure that VLAN filtering is enabled, which in turn guarantees that the VID in question is loaded into the VTU. Fixes: d352b20f4174 ("net: dsa: mv88e6xxx: Improve multichip isolation of standalone ports") Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Tested-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Link: https://lore.kernel.org/r/20220307110548.812455-1-tobias@waldekranz.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-03-07net: rtnetlink: fix error handling in rtnl_fill_statsinfo()Tom Rix
The clang static analyzer reports this issue rtnetlink.c:5481:2: warning: Undefined or garbage value returned to caller return err; ^~~~~~~~~~ There is a function level err variable, in the list_for_each_entry_rcu block there is a shadow err. Remove the shadow. In the same block, the call to nla_nest_start_noflag() can fail without setting an err. Set the err to -EMSGSIZE. Fixes: 216e690631f5 ("net: rtnetlink: rtnl_fill_statsinfo(): Permit non-EMSGSIZE error returns") Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-07net: dsa: return success if there was nothing to doTom Rix
Clang static analysis reports this representative issue dsa.c:486:2: warning: Undefined or garbage value returned to caller return err; ^~~~~~~~~~ err is only set in the loop. If the loop is empty, garbage will be returned. So initialize err to 0 to handle this noop case. Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-07net: Fix esp GSO on inter address family tunnels.Steffen Klassert
The esp tunnel GSO handlers use skb_mac_gso_segment to push the inner packet to the segmentation handlers. However, skb_mac_gso_segment takes the Ethernet Protocol ID from 'skb->protocol' which is wrong for inter address family tunnels. We fix this by introducing a new skb_eth_gso_segment function. This function can be used if it is necessary to pass the Ethernet Protocol ID directly to the segmentation handler. First users of this function will be the esp4 and esp6 tunnel segmentation handlers. Fixes: c35fe4106b92 ("xfrm: Add mode handlers for IPsec on layer 2") Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-03-07esp: Fix BEET mode inter address family tunneling on GSOSteffen Klassert
The xfrm{4,6}_beet_gso_segment() functions did not correctly set the SKB_GSO_IPXIP4 and SKB_GSO_IPXIP6 gso types for the address family tunneling case. Fix this by setting these gso types. Fixes: 384a46ea7bdc7 ("esp4: add gso_segment for esp4 beet mode") Fixes: 7f9e40eb18a99 ("esp6: add gso_segment for esp6 beet mode") Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-03-07esp: Fix possible buffer overflow in ESP transformationSteffen Klassert
The maximum message size that can be send is bigger than the maximum site that skb_page_frag_refill can allocate. So it is possible to write beyond the allocated buffer. Fix this by doing a fallback to COW in that case. v2: Avoid get get_order() costs as suggested by Linus Torvalds. Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible") Fixes: 03e2a30f6a27 ("esp6: Avoid skb_cow_data whenever possible") Reported-by: valis <sec@valis.email> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-03-07net/smc: fix compile warning for smc_sysctlDust Li
kernel test robot reports multiple warning for smc_sysctl: In file included from net/smc/smc_sysctl.c:17: >> net/smc/smc_sysctl.h:23:5: warning: no previous prototype \ for function 'smc_sysctl_init' [-Wmissing-prototypes] int smc_sysctl_init(void) ^ and >> WARNING: modpost: vmlinux.o(.text+0x12ced2d): Section mismatch \ in reference from the function smc_sysctl_exit() to the variable .init.data:smc_sysctl_ops The function smc_sysctl_exit() references the variable __initdata smc_sysctl_ops. This is often because smc_sysctl_exit lacks a __initdata annotation or the annotation of smc_sysctl_ops is wrong. and net/smc/smc_sysctl.c: In function 'smc_sysctl_init_net': net/smc/smc_sysctl.c:47:17: error: 'struct netns_smc' has no member named 'smc_hdr' 47 | net->smc.smc_hdr = register_net_sysctl(net, "net/smc", table); Since we don't need global sysctl initialization. To make things clean and simple, remove the global pernet_operations and smc_sysctl_{init|exit}. Call smc_sysctl_net_{init|exit} directly from smc_net_{init|exit}. Also initialized sysctl_autocorking_size if CONFIG_SYSCTL it not set, this make sure SMC autocorking is enabled by default if CONFIG_SYSCTL is not set. Fixes: 462791bbfa35 ("net/smc: add sysctl interface for SMC") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-07netfilter: bridge: clean up some inconsistent indentingJiapeng Chong
Eliminate the follow smatch warning: net/bridge/netfilter/nf_conntrack_bridge.c:385 nf_ct_bridge_confirm() warn: inconsistent indenting. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2022-03-07phonet: Use netif_rx().Sebastian Andrzej Siewior
Since commit baebdf48c3600 ("net: dev: Makes sure netif_rx() can be invoked in any context.") the function netif_rx() can be used in preemptible/thread context as well as in interrupt context. Use netif_rx(). Cc: Remi Denis-Courmont <courmisch@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-07bluetooth: Use netif_rx().Sebastian Andrzej Siewior
Since commit baebdf48c3600 ("net: dev: Makes sure netif_rx() can be invoked in any context.") the function netif_rx() can be used in preemptible/thread context as well as in interrupt context. Use netif_rx(). Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com> Cc: linux-bluetooth@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-07batman-adv: Use netif_rx().Sebastian Andrzej Siewior
Since commit baebdf48c3600 ("net: dev: Makes sure netif_rx() can be invoked in any context.") the function netif_rx() can be used in preemptible/thread context as well as in interrupt context. Use netif_rx(). Cc: Antonio Quartulli <a@unstable.cc> Cc: Marek Lindner <mareklindner@neomailbox.ch> Cc: Simon Wunderlich <sw@simonwunderlich.de> Cc: Sven Eckelmann <sven@narfation.org> Cc: b.a.t.m.a.n@lists.open-mesh.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-07tipc: Use netif_rx().Sebastian Andrzej Siewior
Since commit baebdf48c3600 ("net: dev: Makes sure netif_rx() can be invoked in any context.") the function netif_rx() can be used in preemptible/thread context as well as in interrupt context. Use netif_rx(). Cc: Jon Maloy <jmaloy@redhat.com> Cc: Ying Xue <ying.xue@windriver.com> Cc: tipc-discussion@lists.sourceforge.net Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-07ptp: Add generic PTP is_sync() functionKurt Kanzenbach
PHY drivers such as micrel or dp83640 need to analyze whether a given skb is a PTP sync message for one step functionality. In order to avoid code duplication introduce a generic function and move it to ptp classify. Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-07xen/9p: use alloc/free_pages_exact()Juergen Gross
Instead of __get_free_pages() and free_pages() use alloc_pages_exact() and free_pages_exact(). This is in preparation of a change of gnttab_end_foreign_access() which will prohibit use of high-order pages. By using the local variable "order" instead of ring->intf->ring_order in the error path of xen_9pfs_front_alloc_dataring() another bug is fixed, as the error path can be entered before ring->intf->ring_order is being set. By using alloc_pages_exact() the size in bytes is specified for the allocation, which fixes another bug for the case of order < (PAGE_SHIFT - XEN_PAGE_SHIFT). This is part of CVE-2022-23041 / XSA-396. Reported-by: Simon Gaiser <simon@invisiblethingslab.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> --- V4: - new patch
2022-03-06wireless: Use netif_rx().Sebastian Andrzej Siewior
Since commit baebdf48c3600 ("net: dev: Makes sure netif_rx() can be invoked in any context.") the function netif_rx() can be used in preemptible/thread context as well as in interrupt context. Use netif_rx(). Cc: Johannes Berg <johannes@sipsolutions.net> Cc: linux-wireless@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-06can: Use netif_rx().Sebastian Andrzej Siewior
Since commit baebdf48c3600 ("net: dev: Makes sure netif_rx() can be invoked in any context.") the function netif_rx() can be used in preemptible/thread context as well as in interrupt context. Use netif_rx(). Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: Oliver Hartkopp <socketcan@hartkopp.net> Cc: Wolfgang Grandegger <wg@grandegger.com> Cc: linux-can@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Oliver Hartkopp <socketcan@hartkopp.net> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-06Revert "net/smc: don't req_notify until all CQEs drained"Dust Li
This reverts commit a505cce6f7cfaf2aa2385aab7286063c96444526. Leon says: We already discussed that. SMC should be changed to use RDMA CQ pool API drivers/infiniband/core/cq.c. ib_poll_handler() has much better implementation (tracing, IRQ rescheduling, proper error handling) than this SMC variant. Since we will switch to ib_poll_handler() in the future, revert this patch. Link: https://lore.kernel.org/netdev/20220301105332.GA9417@linux.alibaba.com/ Suggested-by: Leon Romanovsky <leon@kernel.org> Suggested-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-06net: dsa: unlock the rtnl_mutex when dsa_master_setup() failsVladimir Oltean
After the blamed commit, dsa_tree_setup_master() may exit without calling rtnl_unlock(), fix that. Fixes: c146f9bc195a ("net: dsa: hold rtnl_mutex when calling dsa_master_{setup,teardown}") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-06Revert "xfrm: state and policy should fail if XFRMA_IF_ID 0"Kai Lueke
This reverts commit 68ac0f3810e76a853b5f7b90601a05c3048b8b54 because ID 0 was meant to be used for configuring the policy/state without matching for a specific interface (e.g., Cilium is affected, see https://github.com/cilium/cilium/pull/18789 and https://github.com/cilium/cilium/pull/19019). Signed-off-by: Kai Lueke <kailueke@linux.microsoft.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-03-05selftests/bpf: Add tests for kfunc register offset checksKumar Kartikeya Dwivedi
Include a few verifier selftests that test against the problems being fixed by previous commits, i.e. release kfunc always require PTR_TO_BTF_ID fixed and var_off to be 0, and negative offset is not permitted and returns a helpful error message. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220304224645.3677453-9-memxor@gmail.com
2022-03-05bpf: Replace __diag_ignore with unified __diag_ignore_allKumar Kartikeya Dwivedi
Currently, -Wmissing-prototypes warning is ignored for GCC, but not clang. This leads to clang build warning in W=1 mode. Since the flag used by both compilers is same, we can use the unified __diag_ignore_all macro that works for all supported versions and compilers which have __diag macro support (currently GCC >= 8.0, and Clang >= 11.0). Also add nf_conntrack_bpf.h include to prevent missing prototype warning for register_nf_conntrack_bpf. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220304224645.3677453-8-memxor@gmail.com
2022-03-05net: dsa: tag_rtl8_4: add rtl8_4t trailing variantLuiz Angelo Daros de Luca
Realtek switches supports the same tag both before ethertype or between payload and the CRC. Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com> Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-04mptcp: add the mibs for MP_RSTGeliang Tang
This patch added two more mibs for MP_RST, MPTCP_MIB_MPRSTTX for the MP_RST sending and MPTCP_MIB_MPRSTRX for the MP_RST receiving. Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-04mptcp: add the mibs for MP_FASTCLOSEGeliang Tang
This patch added two more mibs for MP_FASTCLOSE, MPTCP_MIB_MPFASTCLOSETX for the MP_FASTCLOSE sending and MPTCP_MIB_MPFASTCLOSERX for receiving. Also added a debug log for MP_FASTCLOSE receiving, printed out the recv_key of MP_FASTCLOSE in mptcp_parse_option to show that MP_RST is received. Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-04Merge tag 'for-net-next-2022-03-04' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Luiz Augusto von Dentz says: ==================== bluetooth-next pull request for net-next: - Add new PID/VID (0x13d3/0x3567) for MT7921 - Add new PID/VID (0x2550/0x8761) for Realtek 8761BU - Add support for LG LGSBWAC02 (MT7663BUN) - Add support for BCM43430A0 and BCM43430A1 - Add support for Intel Madison Peak (MsP2) * tag 'for-net-next-2022-03-04' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (21 commits) Bluetooth: btusb: Add another Realtek 8761BU Bluetooth: hci_bcm: add BCM43430A0 & BCM43430A1 Bluetooth: use memset avoid memory leaks Bluetooth: btmtksdio: Fix kernel oops when sdio suspend. Bluetooth: btusb: Add a new PID/VID 13d3/3567 for MT7921 Bluetooth: move adv_instance_cnt read within the device lock Bluetooth: hci_event: Add missing locking on hdev in hci_le_ext_adv_term_evt Bluetooth: btusb: Make use of of BIT macro to declare flags Bluetooth: Fix not checking for valid hdev on bt_dev_{info,warn,err,dbg} Bluetooth: mediatek: fix the conflict between mtk and msft vendor event Bluetooth: mt7921s: support bluetooth reset mechanism Bluetooth: make array bt_uuid_any static const Bluetooth: 6lowpan: No need to clear memory twice Bluetooth: btusb: Improve stability for QCA devices Bluetooth: btusb: add support for LG LGSBWAC02 (MT7663BUN) Bluetooth: btusb: Add support for Intel Madison Peak (MsP2) device Bluetooth: Improve skb handling in mgmt_device_connected() Bluetooth: Fix skb allocation in mgmt_remote_name() & mgmt_device_connected() Bluetooth: mgmt: Remove unneeded variable Bluetooth: hci_sync: fix undefined return of hci_disconnect_all_sync() ... ==================== Link: https://lore.kernel.org/r/20220304193919.649815-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-04Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Daniel Borkmann says: ==================== pull-request: bpf-next 2022-03-04 We've added 32 non-merge commits during the last 14 day(s) which contain a total of 59 files changed, 1038 insertions(+), 473 deletions(-). The main changes are: 1) Optimize BPF stackmap's build_id retrieval by caching last valid build_id, as consecutive stack frames are likely to be in the same VMA and therefore have the same build id, from Hao Luo. 2) Several improvements to arm64 BPF JIT, that is, support for JITing the atomic[64]_fetch_add, atomic[64]_[fetch_]{and,or,xor} and lastly atomic[64]_{xchg|cmpxchg}. Also fix the BTF line info dump for JITed programs, from Hou Tao. 3) Optimize generic BPF map batch deletion by only enforcing synchronize_rcu() barrier once upon return to user space, from Eric Dumazet. 4) For kernel build parse DWARF and generate BTF through pahole with enabled multithreading, from Kui-Feng Lee. 5) BPF verifier usability improvements by making log info more concise and replacing inv with scalar type name, from Mykola Lysenko. 6) Two follow-up fixes for BPF prog JIT pack allocator, from Song Liu. 7) Add a new Kconfig to allow for loading kernel modules with non-matching BTF type info; their BTF info is then removed on load, from Connor O'Brien. 8) Remove reallocarray() usage from bpftool and switch to libbpf_reallocarray() in order to fix compilation errors for older glibc, from Mauricio Vásquez. 9) Fix libbpf to error on conflicting name in BTF when type declaration appears before the definition, from Xu Kuohai. 10) Fix issue in BPF preload for in-kernel light skeleton where loaded BPF program fds prevent init process from setting up fd 0-2, from Yucong Sun. 11) Fix libbpf reuse of pinned perf RB map when max_entries is auto-determined by libbpf, from Stijn Tintel. 12) Several cleanups for libbpf and a fix to enforce perf RB map #pages to be non-zero, from Yuntao Wang. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (32 commits) bpf: Small BPF verifier log improvements libbpf: Add a check to ensure that page_cnt is non-zero bpf, x86: Set header->size properly before freeing it x86: Disable HAVE_ARCH_HUGE_VMALLOC on 32-bit x86 bpf, test_run: Fix overflow in XDP frags bpf_test_finish selftests/bpf: Update btf_dump case for conflicting names libbpf: Skip forward declaration when counting duplicated type names bpf: Add some description about BPF_JIT_ALWAYS_ON in Kconfig bpf, docs: Add a missing colon in verifier.rst bpf: Cache the last valid build_id libbpf: Fix BPF_MAP_TYPE_PERF_EVENT_ARRAY auto-pinning bpf, selftests: Use raw_tp program for atomic test bpf, arm64: Support more atomic operations bpftool: Remove redundant slashes bpf: Add config to allow loading modules with BTF mismatches bpf, arm64: Feed byte-offset into bpf line info bpf, arm64: Call build_prologue() first in first JIT pass bpf: Fix issue with bpf preload module taking over stdout/stdin of kernel. bpftool: Bpf skeletons assert type sizes bpf: Cleanup comments ... ==================== Link: https://lore.kernel.org/r/20220304164313.31675-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-04Bluetooth: use memset avoid memory leaksMinghao Chi (CGEL ZTE)
Use memset to initialize structs to prevent memory leaks in l2cap_ecred_connect Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Minghao Chi (CGEL ZTE) <chi.minghao@zte.com.cn> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2022-03-04Bluetooth: move adv_instance_cnt read within the device lockNiels Dossche
The field adv_instance_cnt is always accessed within a device lock, except in the function add_advertising. A concurrent remove of an advertisement with adding another one could result in the if check "if a new instance was actually added" to not trigger, resulting in not triggering the "advertising added event". Signed-off-by: Niels Dossche <niels.dossche@ugent.be> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2022-03-04Bluetooth: hci_event: Add missing locking on hdev in hci_le_ext_adv_term_evtNiels Dossche
Both hci_find_adv_instance and hci_remove_adv_instance have a comment above their function definition saying that these two functions require the caller to hold the hdev->lock lock. However, hci_le_ext_adv_term_evt does not acquire that lock and neither does its caller hci_le_meta_evt (hci_le_meta_evt calls hci_le_ext_adv_term_evt via an indirect function call because of the lookup in hci_le_ev_table). The other event handlers all acquire and release the hdev->lock and they follow the rule that hci_find_adv_instance and hci_remove_adv_instance must be called while holding the hdev->lock lock. The solution is to make sure hci_le_ext_adv_term_evt also acquires and releases the hdev->lock lock. The check on ev->status which logs a warning and does an early return is not covered by the lock because other functions also access ev->status without holding the lock. Signed-off-by: Niels Dossche <niels.dossche@ugent.be> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>