summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2025-03-10mptcp: pm: move Netlink PM helpers to pm_netlink.cMatthieu Baerts (NGI0)
Before this patch, the PM code was dispersed in different places: - pm.c had common code for all PMs, but also Netlink specific code that will not be needed with the future BPF path-managers. - pm_netlink.c had common Netlink code. To clarify the code, a reorganisation is suggested here, only by moving code around, and small helper renaming to avoid confusions: - pm_netlink.c now only contains common PM Netlink code: - PM events: this code was already there - shared helpers around Netlink code that were already there as well - shared Netlink commands code from pm.c - pm.c now no longer contain Netlink specific code. - protocol.h has been updated accordingly: - mptcp_nl_fill_addr() no longer need to be exported. The code around the PM is now less confusing, which should help for the maintenance in the long term. This will certainly impact future backports, but because other cleanups have already done recently, and more are coming to ease the addition of a new path-manager controlled with BPF (struct_ops), doing that now seems to be a good time. Also, many issues around the PM have been fixed a few months ago while increasing the code coverage in the selftests, so such big reorganisation can be done with more confidence now. No behavioural changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-15-abef20ada03b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-10mptcp: pm: split in-kernel PM specific codeMatthieu Baerts (NGI0)
Before this patch, the PM code was dispersed in different places: - pm.c had common code for all PMs - pm_netlink.c was supposed to be about the in-kernel PM, but also had exported common Netlink helpers, NL events for PM userspace daemons, etc. quite confusing. To clarify the code, a reorganisation is suggested here, only by moving code around to avoid confusions: - pm_netlink.c now only contains common PM Netlink code: - PM events: this code was already there - shared helpers around Netlink code that were already there as well - more shared Netlink commands code from pm.c will come after - pm_kernel.c now contains only code that is specific to the in-kernel PM. Now all functions are either called from: - pm.c: events coming from the core, when this PM is being used - pm_netlink.c: for shared Netlink commands - mptcp_pm_gen.c: for Netlink commands specific to the in-kernel PM - sockopt.c: for the exported counters per netns - (while at it, a useless 'return;' spot by checkpatch at the end of mptcp_pm_nl_set_flags_all, has been removed) The code around the PM is now less confusing, which should help for the maintenance in the long term. This will certainly impact future backports, but because other cleanups have already done recently, and more are coming to ease the addition of a new path-manager controlled with BPF (struct_ops), doing that now seems to be a good time. Also, many issues around the PM have been fixed a few months ago while increasing the code coverage in the selftests, so such big reorganisation can be done with more confidence now. No behavioural changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-14-abef20ada03b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-10mptcp: pm: move generic PM helpers to pm.cMatthieu Baerts (NGI0)
Before this patch, the PM code was dispersed in different places: - pm.c had common code for all PMs - pm_netlink.c was supposed to be about the in-kernel PM, but also had exported common helpers, callbacks used by the different PMs, NL events for PM userspace daemon, etc. quite confusing. - pm_userspace.c had userspace PM only code, but using specific in-kernel PM helpers To clarify the code, a reorganisation is suggested here, only by moving code around, and (un)exporting functions: - helpers used from both PMs and not linked to Netlink - callbacks used by different PMs, e.g. ADD_ADDR management - some helpers have been marked as 'static' - protocol.h has been updated accordingly - (while at it, a needless if before a kfree(), spot by checkpatch in mptcp_remove_anno_list_by_saddr(), has been removed) The code around the PM is now less confusing, which should help for the maintenance in the long term. This will certainly impact future backports, but because other cleanups have already done recently, and more are coming to ease the addition of a new path-manager controlled with BPF (struct_ops), doing that now seems to be a good time. Also, many issues around the PM have been fixed a few months ago while increasing the code coverage in the selftests, so such big reorganisation can be done with more confidence now. No behavioural changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-13-abef20ada03b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-10mptcp: pm: move generic helper at the topMatthieu Baerts (NGI0)
In prevision to another change importing all generic PM helpers from pm_netlink.c to there. No behavioural changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-12-abef20ada03b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-10mptcp: pm: export mptcp_remote_addressMatthieu Baerts (NGI0)
In a following commit, the 'remote_address' helper will need to be used from different files. It is then exported, and prefixed with 'mptcp_', similar to 'mptcp_local_address'. No behavioural changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-11-abef20ada03b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-10mptcp: pm: worker: split in-kernel and common tasksMatthieu Baerts (NGI0)
To make it clear what actions are in-kernel PM specific and which ones are not and done for all PMs, e.g. sending ADD_ADDR and close associated subflows when a RM_ADDR is received. The behavioural is changed a bit: MPTCP_PM_ADD_ADDR_RECEIVED is now treated after MPTCP_PM_ADD_ADDR_SEND_ACK and MPTCP_PM_RM_ADDR_RECEIVED, but that should not change anything in practice. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-10-abef20ada03b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-10mptcp: pm: avoid calling PM specific code from coreMatthieu Baerts (NGI0)
When destroying an MPTCP socket, some userspace PM specific code was called from mptcp_destroy_common() in protocol.c. That feels wrong, and it is the only case. Instead, the core now calls mptcp_pm_destroy() from pm.c which is now in charge of cleaning the announced addresses list, and ask the different PMs to do extra cleaning if needed, e.g. the userspace PM, if used, will clean the local addresses list. While at it, the userspace PM specific helper has been prefixed with 'mptcp_userspace_pm_' like the other ones. No behavioural changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-9-abef20ada03b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-10mptcp: pm: kernel: add '_pm' to mptcp_nl_set_flagsMatthieu Baerts (NGI0)
Currently, in-kernel PM specific helpers are prefixed with 'mptcp_pm_nl_'. Here, '_pm' was missing from 'mptcp_nl_set_flags'. Add '_pm' to be similar to others, and add '_all' to avoid confusions witih the global 'mptcp_pm_nl_set_flags'. No behavioural changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-8-abef20ada03b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-10mptcp: pm: remove '_nl' from mptcp_pm_nl_is_init_remote_addrMatthieu Baerts (NGI0)
Currently, in-kernel PM specific helpers are prefixed with 'mptcp_pm_nl_'. But here 'mptcp_pm_nl_is_init_remote_addr' is not specific to this PM: it is called from pm.c for both the in-kernel and userspace PMs. To avoid confusions, the '_nl' bit has been removed from the name. No behavioural changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-7-abef20ada03b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-10mptcp: pm: remove '_nl' from mptcp_pm_nl_subflow_chk_stale()Matthieu Baerts (NGI0)
Currently, in-kernel PM specific helpers are prefixed with 'mptcp_pm_nl_'. But here 'mptcp_pm_nl_subflow_chk_stale' is not specific to this PM: it is called from pm.c for both the in-kernel and userspace PMs. To avoid confusions, the '_nl' bit has been removed from the name. No behavioural changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-6-abef20ada03b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-10mptcp: pm: remove '_nl' from mptcp_pm_nl_rm_addr_receivedMatthieu Baerts (NGI0)
Currently, in-kernel PM specific helpers are prefixed with 'mptcp_pm_nl_'. But here 'mptcp_pm_nl_rm_addr_received' is not specific to this PM: it is called from the PM worker, and used by both the in-kernel and userspace PMs. The helper has been renamed to 'mptcp_pm_rm_addr_recv' instead of '_received' to avoid confusions with the one from pm.c. mptcp_pm_nl_rm_addr_or_subflow', and 'mptcp_pm_nl_rm_subflow_received' have been updated too for the same reason. To avoid confusions, the '_nl' bit has been removed from the name. While at it, the in-kernel PM specific code has been move from mptcp_pm_rm_addr_or_subflow to a new dedicated helper, clearer. No behavioural changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-5-abef20ada03b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-10mptcp: pm: remove '_nl' from mptcp_pm_nl_workMatthieu Baerts (NGI0)
Currently, in-kernel PM specific helpers are prefixed with 'mptcp_pm_nl_'. But here 'mptcp_pm_nl_work' is not specific to this PM: it is called from the core to call helpers, some of them needed by both the in-kernel and userspace PMs. To avoid confusions, the '_nl' bit has been removed from the name. Also used 'worker' instead of 'work', similar to protocol.c's worker. No behavioural changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-4-abef20ada03b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-10mptcp: pm: remove '_nl' from mptcp_pm_nl_mp_prio_send_ackMatthieu Baerts (NGI0)
Currently, in-kernel PM specific helpers are prefixed with 'mptcp_pm_nl_'. But here 'mptcp_pm_nl_mp_prio_send_ack()' is not specific to this PM: it is used by both the in-kernel and userspace PMs. To avoid confusions, the '_nl' bit has been removed from the name. No behavioural changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-3-abef20ada03b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-10mptcp: pm: remove '_nl' from mptcp_pm_nl_addr_send_ackMatthieu Baerts (NGI0)
Currently, in-kernel PM specific helpers are prefixed with 'mptcp_pm_nl_'. But here 'mptcp_pm_nl_addr_send_ack()' is not specific to this PM: it is used by both the in-kernel and userspace PMs. To avoid confusions, the '_nl' bit has been removed from the name. No behavioural changes intended. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-2-abef20ada03b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-10mptcp: pm: use addr entry for get_local_idGeliang Tang
The following code in mptcp_userspace_pm_get_local_id() that assigns "skc" to "new_entry" is not allowed in BPF if we use the same code to implement the get_local_id() interface of a BFP path manager: memset(&new_entry, 0, sizeof(struct mptcp_pm_addr_entry)); new_entry.addr = *skc; new_entry.addr.id = 0; new_entry.flags = MPTCP_PM_ADDR_FLAG_IMPLICIT; To solve the issue, this patch moves this assignment to "new_entry" forward to mptcp_pm_get_local_id(), and then passing "new_entry" as a parameter to both mptcp_pm_nl_get_local_id() and mptcp_userspace_pm_get_local_id(). No behavioural changes intended. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250307-net-next-mptcp-pm-reorg-v1-1-abef20ada03b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-10net: devmem: do not WARN conditionally after netdev_rx_queue_restart()Taehee Yoo
When devmem socket is closed, netdev_rx_queue_restart() is called to reset queue by the net_devmem_unbind_dmabuf(). But callback may return -ENETDOWN if the interface is down because queues are already freed when the interface is down so queue reset is not needed. So, it should not warn if the return value is -ENETDOWN. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250309134219.91670-8-ap420073@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-10net: ethtool: tsinfo: Fix dump commandKory Maincent
Fix missing initialization of ts_info->phc_index in the dump command, which could cause a netdev interface to incorrectly display a PTP provider at index 0 instead of "none". Fix it by initializing the phc_index to -1. In the same time, restore missing initialization of ts_info.cmd for the IOCTL case, as it was before the transition from ethnl_default_dumpit to custom ethnl_tsinfo_dumpit. Also, remove unnecessary zeroing of ts_info, as it is embedded within reply_data, which is fully zeroed two lines earlier. Fixes: b9e3f7dc9ed95 ("net: ethtool: tsinfo: Enhance tsinfo to support several hwtstamp by net topology") Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20250307091255.463559-1-kory.maincent@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-10ipv6: save dontfrag in corkWillem de Bruijn
When spanning datagram construction over multiple send calls using MSG_MORE, per datagram settings are configured on the first send. That is when ip(6)_setup_cork stores these settings for subsequent use in __ip(6)_append_data and others. The only flag that escaped this was dontfrag. As a result, a datagram could be constructed with df=0 on the first sendmsg, but df=1 on a next. Which is what cmsg_ip.sh does in an upcoming MSG_MORE test in the "diff" scenario. Changing datagram conditions in the middle of constructing an skb makes this already complex code path even more convoluted. It is here unintentional. Bring this flag in line with expected sockopt/cmsg behavior. And stop passing ipc6 to __ip6_append_data, to avoid such issues in the future. This is already the case for __ip_append_data. inet6_cork had a 6 byte hole, so the 1B flag has no impact. Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250307033620.411611-3-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-10ipv6: remove leftover ip6 cookie initializerWillem de Bruijn
As of the blamed commit ipc6.dontfrag is always initialized at the start of udpv6_sendmsg, by ipcm6_init_sk, to either 0 or 1. Later checks against -1 are no longer needed and the branches are now dead code. The blamed commit had removed those branches. But I had overlooked this one case. UDP has both a lockless fast path and a slower path for corked requests. This branch remained in the fast path. Fixes: 096208592b09 ("ipv6: replace ipcm6_init calls with ipcm6_init_sk") Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250307033620.411611-2-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-10svcrdma: do not unregister device for listenersOlga Kornievskaia
On an rdma-capable machine, a start/stop/start and then on a stop of a knfsd server would lead kref underflow warning because svc_rdma_free would indiscriminately unregister the rdma device but a listening transport never calls the rdma_rn_register() thus leading to kref going down to 0 on the 1st stop of the server and on the 2nd stop it leads to a problem. Suggested-by: Chuck Lever <chuck.lever@oracle.com> Fixes: c4de97f7c454 ("svcrdma: Handle device removal outside of the CM event handler") Signed-off-by: Olga Kornievskaia <okorniev@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10SUNRPC: Remove unused make_checksumDr. David Alan Gilbert
Commit ec596aaf9b48 ("SUNRPC: Remove code behind CONFIG_RPCSEC_GSS_KRB5_SIMPLIFIED") was the last user of the make_checksum() function. Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10SUNRPC: Remove unused krb5_decryptDr. David Alan Gilbert
The last use of krb5_decrypt() was removed in 2023 by commit 2a9893f796a3 ("SUNRPC: Remove net/sunrpc/auth_gss/gss_krb5_seqnum.c") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10sunrpc: clean cache_detail immediately when flush is written frequentlyLi Lingfeng
We will write /proc/net/rpc/xxx/flush if we want to clean cache_detail. This updates nextcheck to the current time and calls cache_flush --> cache_clean to clean cache_detail. If we write this interface again within one second, it will only increase flush_time and nextcheck without actually cleaning cache_detail. Therefore, if we keep writing this interface repeatedly within one second, flush_time and nextcheck will keep increasing, even far exceeding the current time, making it impossible to clear cache_detail through the flush interface or cache_cleaner. If someone frequently calls the flush interface, we should immediately clean the corresponding cache_detail instead of continuously accumulating nextcheck. Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10afs: Use the per-peer app data provided by rxrpcDavid Howells
Make use of the per-peer application data that rxrpc now allows the application to store on the rxrpc_peer struct to hold a back pointer to the afs_server record that peer represents an endpoint for. Then, when a call comes in to the AFS cache manager, this can be used to map it to the correct server record rather than having to use a UUID-to-server mapping table and having to do an additional lookup. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20250224234154.2014840-14-dhowells@redhat.com/ # v1 Link: https://lore.kernel.org/r/20250310094206.801057-10-dhowells@redhat.com/ # v4
2025-03-10rxrpc: Allow the app to store private data on peer structsDavid Howells
Provide a way for the application (e.g. the afs filesystem) to store private data on the rxrpc_peer structs for later retrieval via the call object. This will allow afs to store a pointer to the afs_server object on the rxrpc_peer struct, thereby obviating the need for afs to keep lookup tables by which it can associate an incoming call with server that transmitted it. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Jakub Kicinski <kuba@kernel.org> cc: "David S. Miller" <davem@davemloft.net> cc: Eric Dumazet <edumazet@google.com> cc: Paolo Abeni <pabeni@redhat.com> cc: Simon Horman <horms@kernel.org> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org cc: netdev@vger.kernel.org Link: https://lore.kernel.org/r/20250224234154.2014840-13-dhowells@redhat.com/ # v1 Link: https://lore.kernel.org/r/20250310094206.801057-9-dhowells@redhat.com/ # v4
2025-03-08net: move misc netdev_lock flavors to a separate headerJakub Kicinski
Move the more esoteric helpers for netdev instance lock to a dedicated header. This avoids growing netdevice.h to infinity and makes rebuilding the kernel much faster (after touching the header with the helpers). The main netdev_lock() / netdev_unlock() functions are used in static inlines in netdevice.h and will probably be used most commonly, so keep them in netdevice.h. Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250307183006.2312761-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-08udp: expand SKB_DROP_REASON_UDP_CSUM useEric Dumazet
SKB_DROP_REASON_UDP_CSUM can be used in four locations when dropping a packet because of a wrong UDP checksum. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250307102002.2095238-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-07netpoll: hold rcu read lock in __netpoll_send_skb()Breno Leitao
The function __netpoll_send_skb() is being invoked without holding the RCU read lock. This oversight triggers a warning message when CONFIG_PROVE_RCU_LIST is enabled: net/core/netpoll.c:330 suspicious rcu_dereference_check() usage! netpoll_send_skb netpoll_send_udp write_ext_msg console_flush_all console_unlock vprintk_emit To prevent npinfo from disappearing unexpectedly, ensure that __netpoll_send_skb() is protected with the RCU read lock. Fixes: 2899656b494dcd1 ("netpoll: take rcu_read_lock_bh() in netpoll_send_skb_on_dev()") Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250306-netpoll_rcu_v2-v2-1-bc4f5c51742a@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-07netpoll: Optimize skb refilling on critical pathBreno Leitao
netpoll tries to refill the skb queue on every packet send, independently if packets are being consumed from the pool or not. This was particularly problematic while being called from printk(), where the operation would be done while holding the console lock. Introduce a more intelligent approach to skb queue management. Instead of constantly attempting to refill the queue, the system now defers refilling to a work queue and only triggers the workqueue when a buffer is actually dequeued. This change significantly reduces operations with the lock held. Add a work_struct to the netpoll structure for asynchronous refilling, updating find_skb() to schedule refill work only when necessary (skb is dequeued). These changes have demonstrated a 15% reduction in time spent during netpoll_send_msg operations, especially when no SKBs are not consumed from consumed from pool. When SKBs are being dequeued, the improvement is even better, around 70%, mainly because refilling the SKB pool is now happening outside of the critical patch (with console_owner lock held). Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250304-netpoll_refill_v2-v1-1-06e2916a4642@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-07tcp: ulp: diag: more info without CAP_NET_ADMINMatthieu Baerts (NGI0)
When introduced in commit 61723b393292 ("tcp: ulp: add functions to dump ulp-specific information"), the whole ULP diag info has been exported only if the requester had CAP_NET_ADMIN. It looks like not everything is sensitive, and some info can be exported to all users in order to ease the debugging from the userspace side without requiring additional capabilities. Each layer should then decide what can be exposed to everybody. The 'net_admin' boolean is then passed to the different layers. On kTLS side, it looks like there is nothing sensitive there: version, cipher type, tx/rx user config type, plus some flags. So, only some metadata about the configuration, no cryptographic info like keys, etc. Then, everything can be exported to all users. On MPTCP side, that's different. The MPTCP-related sequence numbers per subflow should certainly not be exposed to everybody. For example, the DSS mapping and ssn_offset would give all users on the system access to narrow ranges of values for the subflow TCP sequence numbers and MPTCP-level DSNs, and then ease packet injection. The TCP diag interface doesn't expose the TCP sequence numbers for TCP sockets, so best to do the same here. The rest -- token, IDs, flags -- can be exported to everybody. Acked-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250306-net-next-tcp-ulp-diag-net-admin-v1-2-06afdd860fc9@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-07tcp: ulp: diag: always print the name if anyMatthieu Baerts (NGI0)
Since its introduction in commit 61723b393292 ("tcp: ulp: add functions to dump ulp-specific information"), the ULP diag info have been exported only if the requester had CAP_NET_ADMIN. At least the ULP name can be exported without CAP_NET_ADMIN. This will already help identifying which layer is being used, e.g. which TCP connections are in fact MPTCP subflow. Acked-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250306-net-next-tcp-ulp-diag-net-admin-v1-1-06afdd860fc9@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-07netmem: prevent TX of unreadable skbsMina Almasry
Currently on stable trees we have support for netmem/devmem RX but not TX. It is not safe to forward/redirect an RX unreadable netmem packet into the device's TX path, as the device may call dma-mapping APIs on dma addrs that should not be passed to it. Fix this by preventing the xmit of unreadable skbs. Tested by configuring tc redirect: sudo tc qdisc add dev eth1 ingress sudo tc filter add dev eth1 ingress protocol ip prio 1 flower ip_proto \ tcp src_ip 192.168.1.12 action mirred egress redirect dev eth1 Before, I see unreadable skbs in the driver's TX path passed to dma mapping APIs. After, I don't see unreadable skbs in the driver's TX path passed to dma mapping APIs. Fixes: 65249feb6b3d ("net: add support for skbs with unreadable frags") Suggested-by: Jakub Kicinski <kuba@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250306215520.1415465-1-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-07net: ethtool: use correct device pointer in ethnl_default_dump_one()Eric Dumazet
ethnl_default_dump_one() operates on the device provided in its @dev parameter, not from ctx->req_info->dev. syzbot reported: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000197: 0000 [#1] PREEMPT SMP KASAN PTI KASAN: null-ptr-deref in range [0x0000000000000cb8-0x0000000000000cbf] RIP: 0010:netdev_need_ops_lock include/linux/netdevice.h:2792 [inline] RIP: 0010:netdev_lock_ops include/linux/netdevice.h:2803 [inline] RIP: 0010:ethnl_default_dump_one net/ethtool/netlink.c:557 [inline] RIP: 0010:ethnl_default_dumpit+0x447/0xd40 net/ethtool/netlink.c:593 Call Trace: <TASK> genl_dumpit+0x10d/0x1b0 net/netlink/genetlink.c:1027 netlink_dump+0x64d/0xe10 net/netlink/af_netlink.c:2309 __netlink_dump_start+0x5a2/0x790 net/netlink/af_netlink.c:2424 genl_family_rcv_msg_dumpit net/netlink/genetlink.c:1076 [inline] genl_family_rcv_msg net/netlink/genetlink.c:1192 [inline] genl_rcv_msg+0x894/0xec0 net/netlink/genetlink.c:1210 netlink_rcv_skb+0x206/0x480 net/netlink/af_netlink.c:2534 genl_rcv+0x28/0x40 net/netlink/genetlink.c:1219 netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline] netlink_unicast+0x7f6/0x990 net/netlink/af_netlink.c:1339 netlink_sendmsg+0x8de/0xcb0 net/netlink/af_netlink.c:1883 sock_sendmsg_nosec net/socket.c:709 [inline] __sock_sendmsg+0x221/0x270 net/socket.c:724 ____sys_sendmsg+0x53a/0x860 net/socket.c:2564 ___sys_sendmsg net/socket.c:2618 [inline] __sys_sendmsg+0x269/0x350 net/socket.c:2650 Fixes: 2bcf4772e45a ("net: ethtool: try to protect all callback with netdev instance lock") Reported-by: syzbot+3da2442641f0c6a705a2@syzkaller.appspotmail.com Closes: https://lore.kernel.org/lkml/67caaf5e.050a0220.15b4b9.007a.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250307083544.1659135-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-07Revert "Bluetooth: hci_core: Fix sleeping function called from invalid context"Luiz Augusto von Dentz
This reverts commit 4d94f05558271654670d18c26c912da0c1c15549 which has problems (see [1]) and is no longer needed since 581dd2dc168f ("Bluetooth: hci_event: Fix using rcu_read_(un)lock while iterating") has reworked the code where the original bug has been found. [1] Link: https://lore.kernel.org/linux-bluetooth/877c55ci1r.wl-tiwai@suse.de/T/#t Fixes: 4d94f0555827 ("Bluetooth: hci_core: Fix sleeping function called from invalid context") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-03-07Bluetooth: hci_event: Fix enabling passive scanningLuiz Augusto von Dentz
Passive scanning shall only be enabled when disconnecting LE links, otherwise it may start result in triggering scanning when e.g. an ISO link disconnects: > HCI Event: LE Meta Event (0x3e) plen 29 LE Connected Isochronous Stream Established (0x19) Status: Success (0x00) Connection Handle: 257 CIG Synchronization Delay: 0 us (0x000000) CIS Synchronization Delay: 0 us (0x000000) Central to Peripheral Latency: 10000 us (0x002710) Peripheral to Central Latency: 10000 us (0x002710) Central to Peripheral PHY: LE 2M (0x02) Peripheral to Central PHY: LE 2M (0x02) Number of Subevents: 1 Central to Peripheral Burst Number: 1 Peripheral to Central Burst Number: 1 Central to Peripheral Flush Timeout: 2 Peripheral to Central Flush Timeout: 2 Central to Peripheral MTU: 320 Peripheral to Central MTU: 160 ISO Interval: 10.00 msec (0x0008) ... > HCI Event: Disconnect Complete (0x05) plen 4 Status: Success (0x00) Handle: 257 Reason: Remote User Terminated Connection (0x13) < HCI Command: LE Set Extended Scan Enable (0x08|0x0042) plen 6 Extended scan: Enabled (0x01) Filter duplicates: Enabled (0x01) Duration: 0 msec (0x0000) Period: 0.00 sec (0x0000) Fixes: 9fcb18ef3acb ("Bluetooth: Introduce LE auto connect options") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-03-07Bluetooth: SCO: fix sco_conn refcounting on sco_conn_readyPauli Virtanen
sco_conn refcount shall not be incremented a second time if the sk already owns the refcount, so hold only when adding new chan. Add sco_conn_hold() for clarity, as refcnt is never zero here due to the sco_conn_add(). Fixes SCO socket shutdown not actually closing the SCO connection. Fixes: ed9588554943 ("Bluetooth: SCO: remove the redundant sco_conn_put") Signed-off-by: Pauli Virtanen <pav@iki.fi> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-03-07wifi: cfg80211: cancel wiphy_work before freeing wiphyMiri Korenblit
A wiphy_work can be queued from the moment the wiphy is allocated and initialized (i.e. wiphy_new_nm). When a wiphy_work is queued, the rdev::wiphy_work is getting queued. If wiphy_free is called before the rdev::wiphy_work had a chance to run, the wiphy memory will be freed, and then when it eventally gets to run it'll use invalid memory. Fix this by canceling the work before freeing the wiphy. Fixes: a3ee4dc84c4e ("wifi: cfg80211: add a work abstraction with special semantics") Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Link: https://patch.msgid.link/20250306123626.efd1d19f6e07.I48229f96f4067ef73f5b87302335e2fd750136c9@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-07wifi: mac80211: fix SA Query processing in MLOJohannes Berg
When MLO is used and SA Query processing isn't done by userspace (e.g. wpa_supplicant w/o CONFIG_OCV), then the mac80211 code kicks in but uses the wrong addresses. Fix them. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20250306123626.bab48bb49061.I9391b22f1360d20ac8c4e92604de23f27696ba8f@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-07wifi: nl80211: fix assoc link handlingJohannes Berg
The refactoring of the assoc link handling in order to support multi-link reconfiguration broke the setting of the assoc link ID, and thus resulted in the wrong BSS "use_for" value being selected. Fix that for both association and ML reconfiguration. Fixes: 720fa448f5a7 ("wifi: nl80211: Split the links handling of an association request") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20250306123626.7b233d769c32.I62fd04a8667dd55cedb9a1c0414cc92dd098da75@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-07wifi: mac80211: don't queue sdata::work for a non-running sdataMiri Korenblit
The worker really shouldn't be queued for a non-running interface. Also, if ieee80211_setup_sdata is called between queueing and executing the wk, it will be initialized, which will corrupt wiphy_work_list. Fixes: f8891461a277 ("mac80211: do not start any work during reconfigure flow") Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Link: https://patch.msgid.link/20250306123626.1e02caf82640.I4949e71ed56e7186ed4968fa9ddff477473fa2f4@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-07wifi: mac80211: flush the station before moving it to UN-AUTHORIZED stateEmmanuel Grumbach
We first want to flush the station to make sure we no longer have any frames being Tx by the station before the station is moved to un-authorized state. Failing to do that will lead to races: a frame may be sent after the station's state has been changed. Since the API clearly states that the driver can't fail the sta_state() transition down the list of state, we can easily flush the station first, and only then call the driver's sta_state(). Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20250306123626.450bc40e8b04.I636ba96843c77f13309c15c9fd6eb0c5a52a7976@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-03-06Merge tag 'nf-25-03-06' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Fix racy non-atomic read-then-increment operation with PREEMPT_RT in nft_ct, from Sebastian Andrzej Siewior. 2) GC is not skipped when jiffies wrap around in nf_conncount, from Nicklas Bo Jensen. 3) flush_work() on nf_tables_destroy_work waits for the last queued instance, this could be an instance that is different from the one that we must wait for, then make destruction work queue. * tag 'nf-25-03-06' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: make destruction work queue pernet netfilter: nf_conncount: garbage collection is not skipped when jiffies wrap around netfilter: nft_ct: Use __refcount_inc() for per-CPU nft_ct_pcpu_template. ==================== Link: https://patch.msgid.link/20250306153446.46712-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06sched: address a potential NULL pointer dereference in the GRED scheduler.Jun Yang
If kzalloc in gred_init returns a NULL pointer, the code follows the error handling path, invoking gred_destroy. This, in turn, calls gred_offload, where memset could receive a NULL pointer as input, potentially leading to a kernel crash. When table->opt is NULL in gred_init(), gred_change_table_def() is not called yet, so it is not necessary to call ->ndo_setup_tc() in gred_offload(). Signed-off-by: Jun Yang <juny24602@gmail.com> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Fixes: f25c0515c521 ("net: sched: gred: dynamically allocate tc_gred_qopt_offload") Link: https://patch.msgid.link/20250305154410.3505642-1-juny24602@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06tcp: clamp window like before the cleanupMatthieu Baerts (NGI0)
A recent cleanup changed the behaviour of tcp_set_window_clamp(). This looks unintentional, and affects MPTCP selftests, e.g. some tests re-establishing a connection after a disconnect are now unstable. Before the cleanup, this operation was done: new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp); tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh); The cleanup used the 'clamp' macro which takes 3 arguments -- value, lowest, and highest -- and returns a value between the lowest and the highest allowable values. This then assumes ... lowest (rcv_ssthresh) <= highest (rcv_wnd) ... which doesn't seem to be always the case here according to the MPTCP selftests, even when running them without MPTCP, but only TCP. For example, when we have ... rcv_wnd < rcv_ssthresh < new_rcv_ssthresh ... before the cleanup, the rcv_ssthresh was not changed, while after the cleanup, it is lowered down to rcv_wnd (highest). During a simple test with TCP, here are the values I observed: new_window_clamp (val) rcv_ssthresh (lo) rcv_wnd (hi) 117760 (out) 65495 < 65536 128512 (out) 109595 > 80256 => lo > hi 1184975 (out) 328987 < 329088 113664 (out) 65483 < 65536 117760 (out) 110968 < 110976 129024 (out) 116527 > 109696 => lo > hi Here, we can see that it is not that rare to have rcv_ssthresh (lo) higher than rcv_wnd (hi), so having a different behaviour when the clamp() macro is used, even without MPTCP. Note: new_window_clamp is always out of range (rcv_ssthresh < rcv_wnd) here, which seems to be generally the case in my tests with small connections. I then suggests reverting this part, not to change the behaviour. Fixes: 863a952eb79a ("tcp: tcp_set_window_clamp() cleanup") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/551 Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Tested-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250305-net-next-fix-tcp-win-clamp-v1-1-12afb705d34e@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06inet: call inet6_ehashfn() once from inet6_hash_connect()Eric Dumazet
inet6_ehashfn() being called from __inet6_check_established() has a big impact on performance, as shown in the Tested section. After prior patch, we can compute the hash for port 0 from inet6_hash_connect(), and derive each hash in __inet_hash_connect() from this initial hash: hash(saddr, lport, daddr, dport) == hash(saddr, 0, daddr, dport) + lport Apply the same principle for __inet_check_established(), although inet_ehashfn() has a smaller cost. Tested: Server: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog Client: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server Before this patch: utime_start=0.286131 utime_end=4.378886 stime_start=11.952556 stime_end=1991.655533 num_transactions=1446830 latency_min=0.001061085 latency_max=12.075275028 latency_mean=0.376375302 latency_stddev=1.361969596 num_samples=306383 throughput=151866.56 perf top: 50.01% [kernel] [k] __inet6_check_established 20.65% [kernel] [k] __inet_hash_connect 15.81% [kernel] [k] inet6_ehashfn 2.92% [kernel] [k] rcu_all_qs 2.34% [kernel] [k] __cond_resched 0.50% [kernel] [k] _raw_spin_lock 0.34% [kernel] [k] sched_balance_trigger 0.24% [kernel] [k] queued_spin_lock_slowpath After this patch: utime_start=0.315047 utime_end=9.257617 stime_start=7.041489 stime_end=1923.688387 num_transactions=3057968 latency_min=0.003041375 latency_max=7.056589232 latency_mean=0.141075048 # Better latency metrics latency_stddev=0.526900516 num_samples=312996 throughput=320677.21 # 111 % increase, and 229 % for the series perf top: inet6_ehashfn is no longer seen. 39.67% [kernel] [k] __inet_hash_connect 37.06% [kernel] [k] __inet6_check_established 4.79% [kernel] [k] rcu_all_qs 3.82% [kernel] [k] __cond_resched 1.76% [kernel] [k] sched_balance_domains 0.82% [kernel] [k] _raw_spin_lock 0.81% [kernel] [k] sched_balance_rq 0.81% [kernel] [k] sched_balance_trigger 0.76% [kernel] [k] queued_spin_lock_slowpath Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Tested-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20250305034550.879255-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06inet: change lport contribution to inet_ehashfn() and inet6_ehashfn()Eric Dumazet
In order to speedup __inet_hash_connect(), we want to ensure hash values for <source address, port X, destination address, destination port> are not randomly spread, but monotonically increasing. Goal is to allow __inet_hash_connect() to derive the hash value of a candidate 4-tuple with a single addition in the following patch in the series. Given : hash_0 = inet_ehashfn(saddr, 0, daddr, dport) hash_sport = inet_ehashfn(saddr, sport, daddr, dport) Then (hash_sport == hash_0 + sport) for all sport values. As far as I know, there is no security implication with this change. After this patch, when __inet_hash_connect() has to try XXXX candidates, the hash table buckets are contiguous and packed, allowing a better use of cpu caches and hardware prefetchers. Tested: Server: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog Client: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server Before this patch: utime_start=0.271607 utime_end=3.847111 stime_start=18.407684 stime_end=1997.485557 num_transactions=1350742 latency_min=0.014131929 latency_max=17.895073144 latency_mean=0.505675853 latency_stddev=2.125164772 num_samples=307884 throughput=139866.80 perf top on client: 56.86% [kernel] [k] __inet6_check_established 17.96% [kernel] [k] __inet_hash_connect 13.88% [kernel] [k] inet6_ehashfn 2.52% [kernel] [k] rcu_all_qs 2.01% [kernel] [k] __cond_resched 0.41% [kernel] [k] _raw_spin_lock After this patch: utime_start=0.286131 utime_end=4.378886 stime_start=11.952556 stime_end=1991.655533 num_transactions=1446830 latency_min=0.001061085 latency_max=12.075275028 latency_mean=0.376375302 latency_stddev=1.361969596 num_samples=306383 throughput=151866.56 perf top: 50.01% [kernel] [k] __inet6_check_established 20.65% [kernel] [k] __inet_hash_connect 15.81% [kernel] [k] inet6_ehashfn 2.92% [kernel] [k] rcu_all_qs 2.34% [kernel] [k] __cond_resched 0.50% [kernel] [k] _raw_spin_lock 0.34% [kernel] [k] sched_balance_trigger 0.24% [kernel] [k] queued_spin_lock_slowpath There is indeed an increase of throughput and reduction of latency. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Tested-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20250305034550.879255-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06tcp: bring back NUMA dispersion in inet_ehash_locks_alloc()Eric Dumazet
We have platforms with 6 NUMA nodes and 480 cpus. inet_ehash_locks_alloc() currently allocates a single 64KB page to hold all ehash spinlocks. This adds more pressure on a single node. Change inet_ehash_locks_alloc() to use vmalloc() to spread the spinlocks on all online nodes, driven by NUMA policies. At boot time, NUMA policy is interleave=all, meaning that tcp_hashinfo.ehash_locks gets hash dispersion on all nodes. Tested: lack5:~# grep inet_ehash_locks_alloc /proc/vmallocinfo 0x00000000d9aec4d1-0x00000000a828b652 69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2 lack5:~# echo 8192 >/proc/sys/net/ipv4/tcp_child_ehash_entries lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo" 0x000000004e99d30c-0x00000000763f3279 36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=1 N1=2 N2=2 N3=1 N4=1 N5=1 0x00000000d9aec4d1-0x00000000a828b652 69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2 lack5:~# numactl --interleave=0,5 unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo" 0x00000000fd73a33e-0x0000000004b9a177 36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=4 N5=4 0x00000000d9aec4d1-0x00000000a828b652 69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2 lack5:~# echo 1024 >/proc/sys/net/ipv4/tcp_child_ehash_entries lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo" 0x00000000db07d7a2-0x00000000ad697d29 8192 inet_ehash_locks_alloc+0x90/0x100 pages=1 vmalloc N2=1 0x00000000d9aec4d1-0x00000000a828b652 69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2 Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250305130550.1865988-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.14-rc6). Conflicts: net/ethtool/cabletest.c 2bcf4772e45a ("net: ethtool: try to protect all callback with netdev instance lock") 637399bf7e77 ("net: ethtool: netlink: Allow NULL nlattrs when getting a phy_device") No Adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06net: replace dev_addr_sem with netdev instance lockStanislav Fomichev
Lockdep reports possible circular dependency in [0]. Instead of fixing the ordering, replace global dev_addr_sem with netdev instance lock. Most of the paths that set/get mac are RTNL protected. Two places where it's not, convert to explicit locking: - sysfs address_show - dev_get_mac_address via dev_ioctl 0: https://netdev-3.bots.linux.dev/vmksft-forwarding-dbg/results/993321/24-router-bridge-1d-lag-sh/stderr Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250305163732.2766420-12-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06net: ethtool: try to protect all callback with netdev instance lockJakub Kicinski
Protect all ethtool callbacks and PHY related state with the netdev instance lock, for drivers which want / need to have their ops instance-locked. Basically take the lock everywhere we take rtnl_lock. It was tempting to take the lock in ethnl_ops_begin(), but turns out we actually nest those calls (when generating notifications). Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Cc: Saeed Mahameed <saeed@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250305163732.2766420-11-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>