summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2024-05-09net/sched: adjust device watchdog timer to detect stopped queue at right timePraveen Kumar Kannoju
Applications are sensitive to long network latency, particularly heartbeat monitoring ones. Longer the tx timeout recovery higher the risk with such applications on a production machines. This patch remedies, yet honoring device set tx timeout. Modify watchdog next timeout to be shorter than the device specified. Compute the next timeout be equal to device watchdog timeout less the how long ago queue stop had been done. At next watchdog timeout tx timeout handler is called into if still in stopped state. Either called or not called, restore the watchdog timeout back to device specified. Signed-off-by: Praveen Kumar Kannoju <praveen.kannoju@oracle.com> Link: https://lore.kernel.org/r/20240508133617.4424-1-praveen.kannoju@oracle.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-10kbuild: use $(src) instead of $(srctree)/$(src) for source directoryMasahiro Yamada
Kbuild conventionally uses $(obj)/ for generated files, and $(src)/ for checked-in source files. It is merely a convention without any functional difference. In fact, $(obj) and $(src) are exactly the same, as defined in scripts/Makefile.build: src := $(obj) When the kernel is built in a separate output directory, $(src) does not accurately reflect the source directory location. While Kbuild resolves this discrepancy by specifying VPATH=$(srctree) to search for source files, it does not cover all cases. For example, when adding a header search path for local headers, -I$(srctree)/$(src) is typically passed to the compiler. This introduces inconsistency between upstream and downstream Makefiles because $(src) is used instead of $(srctree)/$(src) for the latter. To address this inconsistency, this commit changes the semantics of $(src) so that it always points to the directory in the source tree. Going forward, the variables used in Makefiles will have the following meanings: $(obj) - directory in the object tree $(src) - directory in the source tree (changed by this commit) $(objtree) - the top of the kernel object tree $(srctree) - the top of the kernel source tree Consequently, $(srctree)/$(src) in upstream Makefiles need to be replaced with $(src). Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
2024-05-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 35d92abfbad8 ("net: hns3: fix kernel crash when devlink reload during initialization") 2a1a1a7b5fd7 ("net: hns3: add command queue trace for hns3") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-09l2tp: Support several sockets with same IP/port quadrupleSamuel Thibault
Some l2tp providers will use 1701 as origin port and open several tunnels for the same origin and target. On the Linux side, this may mean opening several sockets, but then trafic will go to only one of them, losing the trafic for the tunnel of the other socket (or leaving it up to userland, consuming a lot of cpu%). This can also happen when the l2tp provider uses a cluster, and load-balancing happens to migrate from one origin IP to another one, for which a socket was already established. Managing reassigning tunnels from one socket to another would be very hairy for userland. Lastly, as documented in l2tpconfig(1), as client it may be necessary to use 1701 as origin port for odd firewalls reasons, which could prevent from establishing several tunnels to a l2tp server, for the same reason: trafic would get only on one of the two sockets. With the V2 protocol it is however easy to route trafic to the proper tunnel, by looking up the tunnel number in the network namespace. This fixes the three cases altogether. Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Link: https://lore.kernel.org/r/20240506215336.1470009-1-samuel.thibault@ens-lyon.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-09SUNRPC: Fix gss_free_in_token_pages()Chuck Lever
Dan Carpenter says: > Commit 5866efa8cbfb ("SUNRPC: Fix svcauth_gss_proxy_init()") from Oct > 24, 2019 (linux-next), leads to the following Smatch static checker > warning: > > net/sunrpc/auth_gss/svcauth_gss.c:1039 gss_free_in_token_pages() > warn: iterator 'i' not incremented > > net/sunrpc/auth_gss/svcauth_gss.c > 1034 static void gss_free_in_token_pages(struct gssp_in_token *in_token) > 1035 { > 1036 u32 inlen; > 1037 int i; > 1038 > --> 1039 i = 0; > 1040 inlen = in_token->page_len; > 1041 while (inlen) { > 1042 if (in_token->pages[i]) > 1043 put_page(in_token->pages[i]); > ^ > This puts page zero over and over. > > 1044 inlen -= inlen > PAGE_SIZE ? PAGE_SIZE : inlen; > 1045 } > 1046 > 1047 kfree(in_token->pages); > 1048 in_token->pages = NULL; > 1049 } Based on the way that the ->pages[] array is constructed in gss_read_proxy_verf(), we know that once the loop encounters a NULL page pointer, the remaining array elements must also be NULL. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Suggested-by: Trond Myklebust <trondmy@hammerspace.com> Fixes: 5866efa8cbfb ("SUNRPC: Fix svcauth_gss_proxy_init()") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-09net/smc: fix neighbour and rtable leak in smc_ib_find_route()Wen Gu
In smc_ib_find_route(), the neighbour found by neigh_lookup() and rtable resolved by ip_route_output_flow() are not released or put before return. It may cause the refcount leak, so fix it. Link: https://lore.kernel.org/r/20240506015439.108739-1-guwen@linux.alibaba.com Fixes: e5c4744cfb59 ("net/smc: add SMC-Rv2 connection establishment") Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Link: https://lore.kernel.org/r/20240507125331.2808-1-guwen@linux.alibaba.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-08Merge tag 'wireless-next-2024-05-08' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle Valo says: ==================== wireless-next patches for v6.10 The third, and most likely the last, "new features" pull request for v6.10 with changes both in stack and in drivers. In ath12k and rtw89 we disabled Wireless Extensions just like with iwlwifi earlier. Wi-Fi 7 devices will not support Wireless Extensions (WEXT) anymore so if someone is still using the legacy WEXT interface it's time to switch to nl80211 now! We merged wireless into wireless-next as we decided not to send a wireless pull request to v6.9 this late in the cycle. Also an immutable branch with MHI subsystem was merged to get ath11k and ath12k hibernation working. Major changes: mac80211/cfg80211 * handle color change per link mt76 * mt7921 LED control * mt7925 EHT radiotap support * mt7920e PCI support ath12k * debugfs support * dfs_simulate_radar debugfs file * disable Wireless Extensions * suspend and hibernation support * ACPI support * refactoring in preparation of multi-link support ath11k * support hibernation (required changes in qrtr and MHI subsystems) * ieee80211-freq-limit Device Tree property support ath10k * firmware-name Device Tree property support rtw89 * complete features of new WiFi 7 chip 8922AE including BT-coexistence and WoWLAN * use BIOS ACPI settings to set TX power and channels * disable Wireless Extensios on Wi-Fi 7 devices iwlwifi * block_esr debugfs file * support again firmware API 90 (was reverted earlier) * provide channel survey information for Automatic Channel Selection (ACS) * tag 'wireless-next-2024-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (214 commits) wifi: mwl8k: initialize cmd->addr[] properly wifi: iwlwifi: Ensure prph_mac dump includes all addresses wifi: iwlwifi: mvm: don't request statistics in restart wifi: iwlwifi: mvm: exit EMLSR if secondary link is not used wifi: iwlwifi: mvm: add beacon template version 14 wifi: iwlwifi: mvm: align UATS naming with firmware wifi: iwlwifi: Force SCU_ACTIVE for specific platforms wifi: iwlwifi: mvm: record and return channel survey information wifi: iwlwifi: mvm: add the firmware API for channel survey wifi: iwlwifi: mvm: Fix race in scan completion wifi: iwlwifi: mvm: Add a print for invalid link pair due to bandwidth wifi: iwlwifi: mvm: add a debugfs for reading EMLSR blocking reasons wifi: iwlwifi: mvm: Add active EMLSR blocking reasons prints wifi: iwlwifi: bump FW API to 90 for BZ/SC devices wifi: iwlwifi: mvm: fix primary link setting wifi: iwlwifi: mvm: use already determined cmd_id wifi: iwlwifi: mvm: don't reset link selection during restart wifi: iwlwifi: Print EMLSR states name wifi: iwlwifi: mvm: Block EMLSR when a p2p/softAP vif is active wifi: iwlwifi: mvm: fix typo in debug print ... ==================== Link: https://lore.kernel.org/r/20240508120726.85A10C113CC@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-08ipv6: prevent NULL dereference in ip6_output()Eric Dumazet
According to syzbot, there is a chance that ip6_dst_idev() returns NULL in ip6_output(). Most places in IPv6 stack deal with a NULL idev just fine, but not here. syzbot reported: general protection fault, probably for non-canonical address 0xdffffc00000000bc: 0000 [#1] PREEMPT SMP KASAN PTI KASAN: null-ptr-deref in range [0x00000000000005e0-0x00000000000005e7] CPU: 0 PID: 9775 Comm: syz-executor.4 Not tainted 6.9.0-rc5-syzkaller-00157-g6a30653b604a #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 RIP: 0010:ip6_output+0x231/0x3f0 net/ipv6/ip6_output.c:237 Code: 3c 1e 00 49 89 df 74 08 4c 89 ef e8 19 58 db f7 48 8b 44 24 20 49 89 45 00 49 89 c5 48 8d 9d e0 05 00 00 48 89 d8 48 c1 e8 03 <42> 0f b6 04 38 84 c0 4c 8b 74 24 28 0f 85 61 01 00 00 8b 1b 31 ff RSP: 0018:ffffc9000927f0d8 EFLAGS: 00010202 RAX: 00000000000000bc RBX: 00000000000005e0 RCX: 0000000000040000 RDX: ffffc900131f9000 RSI: 0000000000004f47 RDI: 0000000000004f48 RBP: 0000000000000000 R08: ffffffff8a1f0b9a R09: 1ffffffff1f51fad R10: dffffc0000000000 R11: fffffbfff1f51fae R12: ffff8880293ec8c0 R13: ffff88805d7fc000 R14: 1ffff1100527d91a R15: dffffc0000000000 FS: 00007f135c6856c0(0000) GS:ffff8880b9400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000080 CR3: 0000000064096000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> NF_HOOK include/linux/netfilter.h:314 [inline] ip6_xmit+0xefe/0x17f0 net/ipv6/ip6_output.c:358 sctp_v6_xmit+0x9f2/0x13f0 net/sctp/ipv6.c:248 sctp_packet_transmit+0x26ad/0x2ca0 net/sctp/output.c:653 sctp_packet_singleton+0x22c/0x320 net/sctp/outqueue.c:783 sctp_outq_flush_ctrl net/sctp/outqueue.c:914 [inline] sctp_outq_flush+0x6d5/0x3e20 net/sctp/outqueue.c:1212 sctp_side_effects net/sctp/sm_sideeffect.c:1198 [inline] sctp_do_sm+0x59cc/0x60c0 net/sctp/sm_sideeffect.c:1169 sctp_primitive_ASSOCIATE+0x95/0xc0 net/sctp/primitive.c:73 __sctp_connect+0x9cd/0xe30 net/sctp/socket.c:1234 sctp_connect net/sctp/socket.c:4819 [inline] sctp_inet_connect+0x149/0x1f0 net/sctp/socket.c:4834 __sys_connect_file net/socket.c:2048 [inline] __sys_connect+0x2df/0x310 net/socket.c:2065 __do_sys_connect net/socket.c:2075 [inline] __se_sys_connect net/socket.c:2072 [inline] __x64_sys_connect+0x7a/0x90 net/socket.c:2072 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: 778d80be5269 ("ipv6: Add disable_ipv6 sysctl to disable IPv6 operaion on specific interface.") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://lore.kernel.org/r/20240507161842.773961-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-08hsr: Simplify code for announcing HSR nodes timer setupLukasz Majewski
Up till now the code to start HSR announce timer, which triggers sending supervisory frames, was assuming that hsr_netdev_notify() would be called at least twice for hsrX interface. This was required to have different values for old and current values of network device's operstate. This is problematic for a case where hsrX interface is already in the operational state when hsr_netdev_notify() is called, so timer is not configured to trigger and as a result the hsrX is not sending supervisory frames to HSR ring. This error has been discovered when hsr_ping.sh script was run. To be more specific - for the hsr1 and hsr2 the hsr_netdev_notify() was called at least twice with different IF_OPER_{LOWERDOWN|DOWN|UP} states assigned in hsr_check_carrier_and_operstate(hsr). As a result there was no issue with sending supervisory frames. However, with hsr3, the notify function was called only once with operstate set to IF_OPER_UP and timer responsible for triggering supervisory frames was not fired. The solution is to use netif_oper_up() and netif_running() helper functions to assess if network hsrX device is up. Only then, when the timer is not already pending, it is started. Otherwise it is deactivated. Fixes: f421436a591d ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)") Signed-off-by: Lukasz Majewski <lukma@denx.de> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240507111214.3519800-1-lukma@denx.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-08phonet: no longer hold RTNL in route_dumpit()Eric Dumazet
route_dumpit() already relies on RCU, RTNL is not needed. Also change return value at the end of a dump. This allows NLMSG_DONE to be appended to the current skb at the end of a dump, saving a couple of recvmsg() system calls. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Remi Denis-Courmont <courmisch@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240507121748.416287-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-08ipv6: fib6_rules: avoid possible NULL dereference in fib6_rule_action()Eric Dumazet
syzbot is able to trigger the following crash [1], caused by unsafe ip6_dst_idev() use. Indeed ip6_dst_idev() can return NULL, and must always be checked. [1] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN PTI KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 0 PID: 31648 Comm: syz-executor.0 Not tainted 6.9.0-rc4-next-20240417-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 RIP: 0010:__fib6_rule_action net/ipv6/fib6_rules.c:237 [inline] RIP: 0010:fib6_rule_action+0x241/0x7b0 net/ipv6/fib6_rules.c:267 Code: 02 00 00 49 8d 9f d8 00 00 00 48 89 d8 48 c1 e8 03 42 80 3c 20 00 74 08 48 89 df e8 f9 32 bf f7 48 8b 1b 48 89 d8 48 c1 e8 03 <42> 80 3c 20 00 74 08 48 89 df e8 e0 32 bf f7 4c 8b 03 48 89 ef 4c RSP: 0018:ffffc9000fc1f2f0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 1a772f98c8186700 RDX: 0000000000000003 RSI: ffffffff8bcac4e0 RDI: ffffffff8c1f9760 RBP: ffff8880673fb980 R08: ffffffff8fac15ef R09: 1ffffffff1f582bd R10: dffffc0000000000 R11: fffffbfff1f582be R12: dffffc0000000000 R13: 0000000000000080 R14: ffff888076509000 R15: ffff88807a029a00 FS: 00007f55e82ca6c0(0000) GS:ffff8880b9400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b31d23000 CR3: 0000000022b66000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> fib_rules_lookup+0x62c/0xdb0 net/core/fib_rules.c:317 fib6_rule_lookup+0x1fd/0x790 net/ipv6/fib6_rules.c:108 ip6_route_output_flags_noref net/ipv6/route.c:2637 [inline] ip6_route_output_flags+0x38e/0x610 net/ipv6/route.c:2649 ip6_route_output include/net/ip6_route.h:93 [inline] ip6_dst_lookup_tail+0x189/0x11a0 net/ipv6/ip6_output.c:1120 ip6_dst_lookup_flow+0xb9/0x180 net/ipv6/ip6_output.c:1250 sctp_v6_get_dst+0x792/0x1e20 net/sctp/ipv6.c:326 sctp_transport_route+0x12c/0x2e0 net/sctp/transport.c:455 sctp_assoc_add_peer+0x614/0x15c0 net/sctp/associola.c:662 sctp_connect_new_asoc+0x31d/0x6c0 net/sctp/socket.c:1099 __sctp_connect+0x66d/0xe30 net/sctp/socket.c:1197 sctp_connect net/sctp/socket.c:4819 [inline] sctp_inet_connect+0x149/0x1f0 net/sctp/socket.c:4834 __sys_connect_file net/socket.c:2048 [inline] __sys_connect+0x2df/0x310 net/socket.c:2065 __do_sys_connect net/socket.c:2075 [inline] __se_sys_connect net/socket.c:2072 [inline] __x64_sys_connect+0x7a/0x90 net/socket.c:2072 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: 5e5f3f0f8013 ("[IPV6] ADDRCONF: Convert ipv6_get_saddr() to ipv6_dev_get_saddr().") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240507163145.835254-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-08net: dst_cache: minor optimization in dst_cache_set_ip6()Eric Dumazet
There is no need to use this_cpu_ptr(dst_cache->cache) twice. Compiler is unable to optimize the second call, because of per-cpu constraints. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240507132717.627518-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-08net: dst_cache: annotate data-races around dst_cache->reset_tsEric Dumazet
dst_cache->reset_ts is read or written locklessly, add READ_ONCE() and WRITE_ONCE() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240507132000.614591-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-08rxrpc: Only transmit one ACK per jumbo packet receivedDavid Howells
Only generate one ACK packet for all the subpackets in a jumbo packet. If we would like to generate more than one ACK, we prioritise them base on their reason code, in the order, highest first: OutOfSeq > NoSpace > ExceedsWin > Duplicate > Requested > Delay > Idle For the first four, we reference the lowest offending subpacket; for the last three, the highest. This reduces the number of ACKs we end up transmitting to one per UDP packet transmitted to reduce network loading and packet parsing. Fixes: 5d7edbc9231e ("rxrpc: Get rid of the Rx ring") Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Reviewed-by: Jeffrey Altman <jaltman@auristor.com <mailto:jaltman@auristor.com>> Link: https://lore.kernel.org/r/20240503150749.1001323-3-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-08rxrpc: Fix congestion control algorithmDavid Howells
Make the following fixes to the congestion control algorithm: (1) Don't vary the cwnd starting value by the size of RXRPC_TX_SMSS since that's currently held constant - set to the size of a jumbo subpacket payload so that we can create jumbo packets on the fly. The current code invariably picks 3 as the starting value. Further, the starting cwnd needs to be an even number because we ack every other packet, so set it to 4. (2) Don't cut ssthresh when we see an ACK come from the peer with a receive window (rwind) less than ssthresh. ssthresh keeps track of characteristics of the connection whereas rwind may be reduced by the peer for any reason - and may be reduced to 0. Fixes: 1fc4fa2ac93d ("rxrpc: Fix congestion management") Fixes: 0851115090a3 ("rxrpc: Reduce ssthresh to peer's receive window") Signed-off-by: David Howells <dhowells@redhat.com> Suggested-by: Simon Wilkinson <sxw@auristor.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Reviewed-by: Jeffrey Altman <jaltman@auristor.com <mailto:jaltman@auristor.com>> Link: https://lore.kernel.org/r/20240503150749.1001323-2-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-08ax25: Remove superfuous "return" from ax25_ds_set_timerJoel Granados
Remove the explicit call to "return" in the void ax25_ds_set_timer function that was introduced in 78a7b5dbc060 ("ax.25: x.25: Remove the now superfluous sentinel elements from ctl_table array"). Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-08ipvs: allow some sysctls in non-init user namespacesAlexander Mikhalitsyn
Let's make all IPVS sysctls writtable even when network namespace is owned by non-initial user namespace. Let's make a few sysctls to be read-only for non-privileged users: - sync_qlen_max - sync_sock_size - run_estimation - est_cpulist - est_nice I'm trying to be conservative with this to prevent introducing any security issues in there. Maybe, we can allow more sysctls to be writable, but let's do this on-demand and when we see real use-case. This patch is motivated by user request in the LXC project [1]. Having this can help with running some Kubernetes [2] or Docker Swarm [3] workloads inside the system containers. Link: https://github.com/lxc/lxc/issues/4278 [1] Link: https://github.com/kubernetes/kubernetes/blob/b722d017a34b300a2284b890448e5a605f21d01e/pkg/proxy/ipvs/proxier.go#L103 [2] Link: https://github.com/moby/libnetwork/blob/3797618f9a38372e8107d8c06f6ae199e1133ae8/osl/namespace_linux.go#L682 [3] Cc: Julian Anastasov <ja@ssi.bg> Cc: Simon Horman <horms@verge.net.au> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Jozsef Kadlecsik <kadlec@netfilter.org> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-08ipvs: add READ_ONCE barrier for ipvs->sysctl_amemthreshAlexander Mikhalitsyn
Cc: Julian Anastasov <ja@ssi.bg> Cc: Simon Horman <horms@verge.net.au> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Jozsef Kadlecsik <kadlec@netfilter.org> Cc: Florian Westphal <fw@strlen.de> Suggested-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-08ipv6: Fix potential uninit-value access in __ip6_make_skb()Shigeru Yoshida
As it was done in commit fc1092f51567 ("ipv4: Fix uninit-value access in __ip_make_skb()") for IPv4, check FLOWI_FLAG_KNOWN_NH on fl6->flowi6_flags instead of testing HDRINCL on the socket to avoid a race condition which causes uninit-value access. Fixes: ea30388baebc ("ipv6: Fix an uninit variable access bug in __ip6_make_skb()") Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-08net: bridge: switchdev: Improve error message for port_obj_add/del functionsOleksij Rempel
Enhance the error reporting mechanism in the switchdev framework to provide more informative and user-friendly error messages. Following feedback from users struggling to understand the implications of error messages like "failed (err=-28) to add object (id=2)", this update aims to clarify what operation failed and how this might impact the system or network. With this change, error messages now include a description of the failed operation, the specific object involved, and a brief explanation of the potential impact on the system. This approach helps administrators and developers better understand the context and severity of errors, facilitating quicker and more effective troubleshooting. Example of the improved logging: [ 70.516446] ksz-switch spi0.0 uplink: Failed to add Port Multicast Database entry (object id=2) with error: -ENOSPC (-28). [ 70.516446] Failure in updating the port's Multicast Database could lead to multicast forwarding issues. [ 70.516446] Current HW/SW setup lacks sufficient resources. This comprehensive update includes handling for a range of switchdev object IDs, ensuring that most operations within the switchdev framework benefit from clearer error reporting. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-08appletalk: Improve handling of broadcast packetsVincent Duvert
When a broadcast AppleTalk packet is received, prefer queuing it on the socket whose address matches the address of the interface that received the packet (and is listening on the correct port). Userspace applications that handle such packets will usually send a response on the same socket that received the packet; this fix allows the response to be sent on the correct interface. If a socket matching the interface's address is not found, an arbitrary socket listening on the correct port will be used, if any. This matches the implementation's previous behavior. Fixes atalkd's responses to network information requests when multiple network interfaces are configured to use AppleTalk. Link: https://lore.kernel.org/netdev/20200722113752.1218-2-vincent.ldev@duvert.net/ Link: https://gist.github.com/VinDuv/4db433b6dce39d51a5b7847ee749b2a4 Signed-off-by: Vincent Duvert <vincent.ldev@duvert.net> Signed-off-by: Doug Brown <doug@schmorgal.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-08net/ipv4: add tracepoint for icmp_sendPeilin He
Introduce a tracepoint for icmp_send, which can help users to get more detail information conveniently when icmp abnormal events happen. 1. Giving an usecase example: ============================= When an application experiences packet loss due to an unreachable UDP destination port, the kernel will send an exception message through the icmp_send function. By adding a trace point for icmp_send, developers or system administrators can obtain detailed information about the UDP packet loss, including the type, code, source address, destination address, source port, and destination port. This facilitates the trouble-shooting of UDP packet loss issues especially for those network-service applications. 2. Operation Instructions: ========================== Switch to the tracing directory. cd /sys/kernel/tracing Filter for destination port unreachable. echo "type==3 && code==3" > events/icmp/icmp_send/filter Enable trace event. echo 1 > events/icmp/icmp_send/enable 3. Result View: ================ udp_client_erro-11370 [002] ...s.12 124.728002: icmp_send: icmp_send: type=3, code=3. From 127.0.0.1:41895 to 127.0.0.1:6666 ulen=23 skbaddr=00000000589b167a Signed-off-by: Peilin He <he.peilin@zte.com.cn> Signed-off-by: xu xin <xu.xin16@zte.com.cn> Reviewed-by: Yunkai Zhang <zhang.yunkai@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Cc: Liu Chun <liu.chun2@zte.com.cn> Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-08net: bridge: fix corrupted ethernet header on multicast-to-unicastFelix Fietkau
The change from skb_copy to pskb_copy unfortunately changed the data copying to omit the ethernet header, since it was pulled before reaching this point. Fix this by calling __skb_push/pull around pskb_copy. Fixes: 59c878cbcdd8 ("net: bridge: fix multicast-to-unicast with fraglist GSO") Signed-off-by: Felix Fietkau <nbd@nbd.name> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-08net: dsa: add support switches global DSCP priority mappingOleksij Rempel
Some switches like Microchip KSZ variants do not support per port DSCP priority configuration. Instead there is a global DSCP mapping table. To handle it, we will accept set/del request to any of user ports to make global configuration and update dcb app entries for all other ports. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-08net: add IEEE 802.1q specific helpersOleksij Rempel
IEEE 802.1q specification provides recommendation and examples which can be used as good default values for different drivers. This patch implements mapping examples documented in IEEE 802.1Q-2022 in Annex I "I.3 Traffic type to traffic class mapping" and IETF DSCP naming and mapping DSCP to Traffic Type inspired by RFC8325. This helpers will be used in followup patches for dsa/microchip DCB implementation. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-08net: dsa: add support for DCB get/set apptrust configurationOleksij Rempel
Add DCB support to get/set trust configuration for different packet priority information sources. Some switch allow to chose different source of packet priority classification. For example on KSZ switches it is possible to configure VLAN PCP and/or DSCP sources. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-08xsk: use generic DMA sync shortcut instead of a custom oneAlexander Lobakin
XSk infra's been using its own DMA sync shortcut to try avoiding redundant function calls. Now that there is a generic one, remove the custom implementation and rely on the generic helpers. xsk_buff_dma_sync_for_cpu() doesn't need the second argument anymore, remove it. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-05-08page_pool: check for DMA sync shortcut earlierAlexander Lobakin
We can save a couple more function calls in the Page Pool code if we check for dma_need_sync() earlier, just when we test pp->p.dma_sync. Move both these checks into an inline wrapper and call the PP wrapper over the generic DMA sync function only when both are true. You can't cache the result of dma_need_sync() in &page_pool, as it may change anytime if an SWIOTLB buffer is allocated or mapped. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-05-07mptcp: only allow set existing scheduler for net.mptcp.schedulerGregory Detal
The current behavior is to accept any strings as inputs, this results in an inconsistent result where an unexisting scheduler can be set: # sysctl -w net.mptcp.scheduler=notdefault net.mptcp.scheduler = notdefault This patch changes this behavior by checking for existing scheduler before accepting the input. Fixes: e3b2870b6d22 ("mptcp: add a new sysctl scheduler") Cc: stable@vger.kernel.org Signed-off-by: Gregory Detal <gregory.detal@gmail.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://lore.kernel.org/r/20240506-upstream-net-20240506-mptcp-sched-exist-v1-1-2ed1529e521e@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-07nfc: nci: Fix kcov check in nci_rx_work()Tetsuo Handa
Commit 7e8cdc97148c ("nfc: Add KCOV annotations") added kcov_remote_start_common()/kcov_remote_stop() pair into nci_rx_work(), with an assumption that kcov_remote_stop() is called upon continue of the for loop. But commit d24b03535e5e ("nfc: nci: Fix uninit-value in nci_dev_up and nci_ntf_packet") forgot to call kcov_remote_stop() before break of the for loop. Reported-by: syzbot <syzbot+0438378d6f157baae1a2@syzkaller.appspotmail.com> Closes: https://syzkaller.appspot.com/bug?extid=0438378d6f157baae1a2 Fixes: d24b03535e5e ("nfc: nci: Fix uninit-value in nci_dev_up and nci_ntf_packet") Suggested-by: Andrey Konovalov <andreyknvl@gmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/6d10f829-5a0c-405a-b39a-d7266f3a1a0b@I-love.SAKURA.ne.jp Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-07mptcp: fix possible NULL dereferencesEric Dumazet
subflow_add_reset_reason(skb, ...) can fail. We can not assume mptcp_get_ext(skb) always return a non NULL pointer. syzbot reported: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN PTI KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f] CPU: 0 PID: 5098 Comm: syz-executor132 Not tainted 6.9.0-rc6-syzkaller-01478-gcdc74c9d06e7 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 RIP: 0010:subflow_v6_route_req+0x2c7/0x490 net/mptcp/subflow.c:388 Code: 8d 7b 07 48 89 f8 48 c1 e8 03 42 0f b6 04 20 84 c0 0f 85 c0 01 00 00 0f b6 43 07 48 8d 1c c3 48 83 c3 18 48 89 d8 48 c1 e8 03 <42> 0f b6 04 20 84 c0 0f 85 84 01 00 00 0f b6 5b 01 83 e3 0f 48 89 RSP: 0018:ffffc9000362eb68 EFLAGS: 00010206 RAX: 0000000000000003 RBX: 0000000000000018 RCX: ffff888022039e00 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88807d961140 R08: ffffffff8b6cb76b R09: 1ffff1100fb2c230 R10: dffffc0000000000 R11: ffffed100fb2c231 R12: dffffc0000000000 R13: ffff888022bfe273 R14: ffff88802cf9cc80 R15: ffff88802ad5a700 FS: 0000555587ad2380(0000) GS:ffff8880b9400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f420c3f9720 CR3: 0000000022bfc000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> tcp_conn_request+0xf07/0x32c0 net/ipv4/tcp_input.c:7180 tcp_rcv_state_process+0x183c/0x4500 net/ipv4/tcp_input.c:6663 tcp_v6_do_rcv+0x8b2/0x1310 net/ipv6/tcp_ipv6.c:1673 tcp_v6_rcv+0x22b4/0x30b0 net/ipv6/tcp_ipv6.c:1910 ip6_protocol_deliver_rcu+0xc76/0x1570 net/ipv6/ip6_input.c:438 ip6_input_finish+0x186/0x2d0 net/ipv6/ip6_input.c:483 NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314 NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314 __netif_receive_skb_one_core net/core/dev.c:5625 [inline] __netif_receive_skb+0x1ea/0x650 net/core/dev.c:5739 netif_receive_skb_internal net/core/dev.c:5825 [inline] netif_receive_skb+0x1e8/0x890 net/core/dev.c:5885 tun_rx_batched+0x1b7/0x8f0 drivers/net/tun.c:1549 tun_get_user+0x2f35/0x4560 drivers/net/tun.c:2002 tun_chr_write_iter+0x113/0x1f0 drivers/net/tun.c:2048 call_write_iter include/linux/fs.h:2110 [inline] new_sync_write fs/read_write.c:497 [inline] vfs_write+0xa84/0xcb0 fs/read_write.c:590 ksys_write+0x1a0/0x2c0 fs/read_write.c:643 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: 3e140491dd80 ("mptcp: support rstreason for passive reset") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://lore.kernel.org/r/20240506123032.3351895-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-07net: annotate writes on dev->mtu from ndo_change_mtu()Eric Dumazet
Simon reported that ndo_change_mtu() methods were never updated to use WRITE_ONCE(dev->mtu, new_mtu) as hinted in commit 501a90c94510 ("inet: protect against too small mtu values.") We read dev->mtu without holding RTNL in many places, with READ_ONCE() annotations. It is time to take care of ndo_change_mtu() methods to use corresponding WRITE_ONCE() Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Simon Horman <horms@kernel.org> Closes: https://lore.kernel.org/netdev/20240505144608.GB67882@kernel.org/ Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20240506102812.3025432-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-07net: dccp: Fix ccid2_rtt_estimator() kernel-docJeff Johnson
make C=1 reports: warning: Function parameter or struct member 'mrtt' not described in 'ccid2_rtt_estimator' So document the 'mrtt' parameter. Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240505-ccid2_rtt_estimator-kdoc-v1-1-09231fcb9145@quicinc.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-07page_pool: don't use driver-set flags field directlyAlexander Lobakin
page_pool::p is driver-defined params, copied directly from the structure passed to page_pool_create(). The structure isn't meant to be modified by the Page Pool core code and this even might look confusing[0][1]. In order to be able to alter some flags, let's define our own, internal fields the same way as the already existing one (::has_init_callback). They are defined as bits in the driver-set params, leave them so here as well, to not waste byte-per-bit or so. Almost 30 bits are still free for future extensions. We could've defined only new flags here or only the ones we may need to alter, but checking some flags in one place while others in another doesn't sound convenient or intuitive. ::flags passed by the driver can now go to the "slow" PP params. Suggested-by: Jakub Kicinski <kuba@kernel.org> Link[0]: https://lore.kernel.org/netdev/20230703133207.4f0c54ce@kernel.org Suggested-by: Alexander Duyck <alexanderduyck@fb.com> Link[1]: https://lore.kernel.org/netdev/CAKgT0UfZCGnWgOH96E4GV3ZP6LLbROHM7SHE8NKwq+exX+Gk_Q@mail.gmail.com Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-05-07page_pool: make sure frag API fields don't span between cachelinesAlexander Lobakin
After commit 5027ec19f104 ("net: page_pool: split the page_pool_params into fast and slow") that made &page_pool contain only "hot" params at the start, cacheline boundary chops frag API fields group in the middle again. To not bother with this each time fast params get expanded or shrunk, let's just align them to `4 * sizeof(long)`, the closest upper pow-2 to their actual size (2 longs + 1 int). This ensures 16-byte alignment for the 32-bit architectures and 32-byte alignment for the 64-bit ones, excluding unnecessary false-sharing. ::page_state_hold_cnt is used quite intensively on hotpath no matter if frag API is used, so move it to the newly created hole in the first cacheline. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-05-07rtnetlink: allow rtnl_fill_link_netnsid() to run under RCU protectionEric Dumazet
We want to be able to run rtnl_fill_ifinfo() under RCU protection instead of RTNL in the future. All rtnl_link_ops->get_link_net() methods already using dev_net() are ready. I added READ_ONCE() annotations on others. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07rtnetlink: do not depend on RTNL in rtnl_xdp_prog_skb()Eric Dumazet
dev->xdp_prog is protected by RCU, we can lift RTNL requirement from rtnl_xdp_prog_skb(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07rtnetlink: do not depend on RTNL in rtnl_fill_proto_down()Eric Dumazet
Change dev_change_proto_down() and dev_change_proto_down_reason() to write once on dev->proto_down and dev->proto_down_reason. Then rtnl_fill_proto_down() can use READ_ONCE() annotations and run locklessly. rtnl_proto_down_size() should assume worst case, because readng dev->proto_down_reason multiple times would be racy without RTNL in the future. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07rtnetlink: do not depend on RTNL for many attributesEric Dumazet
Following device fields can be read locklessly in rtnl_fill_ifinfo() : type, ifindex, operstate, link_mode, mtu, min_mtu, max_mtu, group, promiscuity, allmulti, num_tx_queues, gso_max_segs, gso_max_size, gro_max_size, gso_ipv4_max_size, gro_ipv4_max_size, tso_max_size, tso_max_segs, num_rx_queues. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07net: write once on dev->allmulti and dev->promiscuityEric Dumazet
In the following patch we want to read dev->allmulti and dev->promiscuity locklessly from rtnl_fill_ifinfo() In this patch I change __dev_set_promiscuity() and __dev_set_allmulti() to write these fields (and dev->flags) only if they succeed, with WRITE_ONCE() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07rtnetlink: do not depend on RTNL for IFLA_TXQLEN outputEric Dumazet
rtnl_fill_ifinfo() can read dev->tx_queue_len locklessly, granted we add corresponding READ_ONCE()/WRITE_ONCE() annotations. Add missing READ_ONCE(dev->tx_queue_len) in teql_enqueue() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07rtnetlink: do not depend on RTNL for IFLA_IFNAME outputEric Dumazet
We can use netdev_copy_name() to no longer rely on RTNL to fetch dev->name. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-07rtnetlink: do not depend on RTNL for IFLA_QDISC outputEric Dumazet
dev->qdisc can be read using RCU protection. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-06Merge tag 'ipsec-next-2024-05-03' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2024-05-03 1) Remove Obsolete UDP_ENCAP_ESPINUDP_NON_IKE Support. This was defined by an early version of an IETF draft that did not make it to a standard. 2) Introduce direction attribute for xfrm states. xfrm states have a direction, a stsate can be used either for input or output packet processing. Add a direction to xfrm states to make it clear for what a xfrm state is used. * tag 'ipsec-next-2024-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: xfrm: Restrict SA direction attribute to specific netlink message types xfrm: Add dir validation to "in" data path lookup xfrm: Add dir validation to "out" data path lookup xfrm: Add Direction to the SA in or out udpencap: Remove Obsolete UDP_ENCAP_ESPINUDP_NON_IKE Support ==================== Link: https://lore.kernel.org/r/20240503082732.2835810-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-06mptcp: fix typos in commentsShi-Sheng Yang
This patch fixes the spelling mistakes in comments. The changes were generated using codespell and reviewed manually. eariler -> earlier greceful -> graceful Signed-off-by: Shi-Sheng Yang <fourcolor4c@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://lore.kernel.org/r/20240502154740.249839-1-fourcolor4c@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-06phonet: fix rtm_phonet_notify() skb allocationEric Dumazet
fill_route() stores three components in the skb: - struct rtmsg - RTA_DST (u8) - RTA_OIF (u32) Therefore, rtm_phonet_notify() should use NLMSG_ALIGN(sizeof(struct rtmsg)) + nla_total_size(1) + nla_total_size(4) Fixes: f062f41d0657 ("Phonet: routing table Netlink interface") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Rémi Denis-Courmont <courmisch@gmail.com> Link: https://lore.kernel.org/r/20240502161700.1804476-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-06Merge wireless into wireless-nextJohannes Berg
Given how late we are in the cycle, merge the two fixes from wireless into wireless-next as they don't see that urgent. This way, the wireless tree won't need rebasing later. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-06netfilter: nft_set_pipapo: merge deactivate helper into callerFlorian Westphal
Its the only remaining call site so there is no need for this to be separated anymore. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-05-06netfilter: nft_set_pipapo: prepare walk function for on-demand cloneFlorian Westphal
The existing code uses iter->type to figure out what data is needed, the live copy (READ) or clone (UPDATE). Without pending updates, priv->clone and priv->match will point to different memory locations, but they have identical content. Future patch will make priv->clone == NULL if there are no pending changes, in this case we must copy the live data for the UPDATE case. Currently this would require GFP_ATOMIC allocation. Split the walk function in two parts: one that does the walk and one that decides which data is needed. In the UPDATE case, callers hold the transaction mutex so we do not need the rcu read lock. This allows to use GFP_KERNEL allocation while cloning. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-05-06netfilter: nft_set_pipapo: prepare destroy function for on-demand cloneFlorian Westphal
Once priv->clone can be NULL in case no insertions/removals occurred in the last transaction we need to drop set elements from priv->match if priv->clone is NULL. While at it, condense this function by reusing the pipapo_free_match helper instead of open-coded version. The rcu_barrier() is removed, its not needed: old call_rcu instances for pipapo_reclaim_match do not access struct nft_set. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>