summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2024-06-05Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-06-05 We've added 8 non-merge commits during the last 6 day(s) which contain a total of 9 files changed, 34 insertions(+), 35 deletions(-). The main changes are: 1) Fix a potential use-after-free in bpf_link_free when the link uses dealloc_deferred to free the link object but later still tests for presence of link->ops->dealloc, from Cong Wang. 2) Fix BPF test infra to set the run context for rawtp test_run callback where syzbot reported a crash, from Jiri Olsa. 3) Fix bpf_session_cookie BTF_ID in the special_kfunc_set list to exclude it for the case of !CONFIG_FPROBE, also from Jiri Olsa. 4) Fix a Coverity static analysis report to not close() a link_fd of -1 in the multi-uprobe feature detector, from Andrii Nakryiko. 5) Revert support for redirect to any xsk socket bound to the same umem as it can result in corrupted ring state which can lead to a crash when flushing rings. A different approach will be pursued for bpf-next to address it safely, from Magnus Karlsson. 6) Fix inet_csk_accept prototype in test_sk_storage_tracing.c which caused BPF CI failure after the last tree fast forwarding, from Andrii Nakryiko. 7) Fix a coccicheck warning in BPF devmap that iterator variable cannot be NULL, from Thorsten Blum. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: Revert "xsk: Document ability to redirect to any socket bound to the same umem" Revert "xsk: Support redirect to any socket bound to the same umem" bpf: Set run context for rawtp test_run callback bpf: Fix a potential use-after-free in bpf_link_free() bpf, devmap: Remove unnecessary if check in for loop libbpf: don't close(-1) in multi-uprobe feature detector bpf: Fix bpf_session_cookie BTF_ID in special_kfunc_set list selftests/bpf: fix inet_csk_accept prototype in test_sk_storage_tracing.c ==================== Link: https://lore.kernel.org/r/20240605091525.22628-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05net/sched: taprio: always validate TCA_TAPRIO_ATTR_PRIOMAPEric Dumazet
If one TCA_TAPRIO_ATTR_PRIOMAP attribute has been provided, taprio_parse_mqprio_opt() must validate it, or userspace can inject arbitrary data to the kernel, the second time taprio_change() is called. First call (with valid attributes) sets dev->num_tc to a non zero value. Second call (with arbitrary mqprio attributes) returns early from taprio_parse_mqprio_opt() and bad things can happen. Fixes: a3d43c0d56f1 ("taprio: Add support adding an admin schedule") Reported-by: Noam Rathaus <noamr@ssd-disclosure.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20240604181511.769870-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05tcp: add sysctl_tcp_rto_min_usKevin Yang
Adding a sysctl knob to allow user to specify a default rto_min at socket init time, other than using the hard coded 200ms default rto_min. Note that the rto_min route option has the highest precedence for configuring this setting, followed by the TCP_BPF_RTO_MIN socket option, followed by the tcp_rto_min_us sysctl. Signed-off-by: Kevin Yang <yyd@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05tcp: derive delack_max with tcp_rto_min helperKevin Yang
Rto_min now has multiple sources, ordered by preprecedence high to low: ip route option rto_min, icsk->icsk_rto_min. When derive delack_max from rto_min, we should not only use ip route option, but should use tcp_rto_min helper to get the correct rto_min. Signed-off-by: Kevin Yang <yyd@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05rtnetlink: make the "split" NLM_DONE handling genericJakub Kicinski
Jaroslav reports Dell's OMSA Systems Management Data Engine expects NLM_DONE in a separate recvmsg(), both for rtnl_dump_ifinfo() and inet_dump_ifaddr(). We already added a similar fix previously in commit 460b0d33cf10 ("inet: bring NLM_DONE out to a separate recv() again") Instead of modifying all the dump handlers, and making them look different than modern for_each_netdev_dump()-based dump handlers - put the workaround in rtnetlink code. This will also help us move the custom rtnl-locking from af_netlink in the future (in net-next). Note that this change is not touching rtnl_dump_all(). rtnl_dump_all() is different kettle of fish and a potential problem. We now mix families in a single recvmsg(), but NLM_DONE is not coalesced. Tested: ./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_addr.yaml \ --dump getaddr --json '{"ifa-family": 2}' ./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_route.yaml \ --dump getroute --json '{"rtm-family": 2}' ./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_link.yaml \ --dump getlink Fixes: 3e41af90767d ("rtnetlink: use xarray iterator to implement rtnl_dump_ifinfo()") Fixes: cdb2f80f1c10 ("inet: use xa_array iterator to implement inet_dump_ifaddr()") Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Link: https://lore.kernel.org/all/CAK8fFZ7MKoFSEzMBDAOjoUt+vTZRRQgLDNXEOfdCCXSoXXKE0g@mail.gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05mptcp: count CLOSE-WAIT sockets for MPTCP_MIB_CURRESTABJason Xing
Like previous patch does in TCP, we need to adhere to RFC 1213: "tcpCurrEstab OBJECT-TYPE ... The number of TCP connections for which the current state is either ESTABLISHED or CLOSE- WAIT." So let's consider CLOSE-WAIT sockets. The logic of counting When we increment the counter? a) Only if we change the state to ESTABLISHED. When we decrement the counter? a) if the socket leaves ESTABLISHED and will never go into CLOSE-WAIT, say, on the client side, changing from ESTABLISHED to FIN-WAIT-1. b) if the socket leaves CLOSE-WAIT, say, on the server side, changing from CLOSE-WAIT to LAST-ACK. Fixes: d9cd27b8cd19 ("mptcp: add CurrEstab MIB counter support") Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05tcp: count CLOSE-WAIT sockets for TCP_MIB_CURRESTABJason Xing
According to RFC 1213, we should also take CLOSE-WAIT sockets into consideration: "tcpCurrEstab OBJECT-TYPE ... The number of TCP connections for which the current state is either ESTABLISHED or CLOSE- WAIT." After this, CurrEstab counter will display the total number of ESTABLISHED and CLOSE-WAIT sockets. The logic of counting When we increment the counter? a) if we change the state to ESTABLISHED. b) if we change the state from SYN-RECEIVED to CLOSE-WAIT. When we decrement the counter? a) if the socket leaves ESTABLISHED and will never go into CLOSE-WAIT, say, on the client side, changing from ESTABLISHED to FIN-WAIT-1. b) if the socket leaves CLOSE-WAIT, say, on the server side, changing from CLOSE-WAIT to LAST-ACK. Please note: there are two chances that old state of socket can be changed to CLOSE-WAIT in tcp_fin(). One is SYN-RECV, the other is ESTABLISHED. So we have to take care of the former case. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05tcp: annotate data-races around tw->tw_ts_recent and tw->tw_ts_recent_stampEric Dumazet
These fields can be read and written locklessly, add annotations around these minor races. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05net: sched: sch_multiq: fix possible OOB write in multiq_tune()Hangyu Hua
q->bands will be assigned to qopt->bands to execute subsequent code logic after kmalloc. So the old q->bands should not be used in kmalloc. Otherwise, an out-of-bounds write will occur. Fixes: c2999f7fb05b ("net: sched: multiq: don't call qdisc_put() while holding tree lock") Signed-off-by: Hangyu Hua <hbh25y@gmail.com> Acked-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05devlink: Constify the 'table_ops' parameter of devl_dpipe_table_register()Christophe JAILLET
"struct devlink_dpipe_table_ops" only contains some function pointers. Update "struct devlink_dpipe_table" and the 'table_ops' parameter of devl_dpipe_table_register() so that structures in drivers can be constified. Constifying these structures will move some data to a read-only section, so increase overall security. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05net: ethtool: remove unused struct 'cable_test_tdr_req_info'Dr. David Alan Gilbert
'cable_test_tdr_req_info' is unused since the original commit f2bc8ad31a7f ("net: ethtool: Allow PHY cable test TDR data to configured"). Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05net: caif: remove unused structsDr. David Alan Gilbert
'cfpktq' has been unused since commit 73d6ac633c6c ("caif: code cleanup"). 'caif_packet_funcs' is declared but never defined. Remove both of them. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05net: remove NULL-pointer net parameter in ip_metrics_convertJason Xing
When I was doing some experiments, I found that when using the first parameter, namely, struct net, in ip_metrics_convert() always triggers NULL pointer crash. Then I digged into this part, realizing that we can remove this one due to its uselessness. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05net: bridge: fix an inconsistent indentationChen Hanxiao
Smatch complains: net/bridge/br_netlink_tunnel.c: 318 br_process_vlan_tunnel_info() warn: inconsistent indenting Fix it with a proper indenting Signed-off-by: Chen Hanxiao <chenhx.fnst@fujitsu.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05net/smc: avoid overwriting when adjusting sock bufsizesWen Gu
When copying smc settings to clcsock, avoid setting clcsock's sk_sndbuf to sysctl_tcp_wmem[1], since this may overwrite the value set by tcp_sndbuf_expand() in TCP connection establishment. And the other setting sk_{snd|rcv}buf to sysctl value in smc_adjust_sock_bufsizes() can also be omitted since the initialization of smc sock and clcsock has set sk_{snd|rcv}buf to smc.sysctl_{w|r}mem or ipv4_sysctl_tcp_{w|r}mem[1]. Fixes: 30c3c4a4497c ("net/smc: Use correct buffer sizes when switching between TCP and SMC") Link: https://lore.kernel.org/r/5eaf3858-e7fd-4db8-83e8-3d7a3e0e9ae2@linux.alibaba.com Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Reviewed-by: Gerd Bayer <gbayer@linux.ibm.com>, too. Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05Revert "xsk: Support redirect to any socket bound to the same umem"Magnus Karlsson
This reverts commit 2863d665ea41282379f108e4da6c8a2366ba66db. This patch introduced a potential kernel crash when multiple napi instances redirect to the same AF_XDP socket. By removing the queue_index check, it is possible for multiple napi instances to access the Rx ring at the same time, which will result in a corrupted ring state which can lead to a crash when flushing the rings in __xsk_flush(). This can happen when the linked list of sockets to flush gets corrupted by concurrent accesses. A quick and small fix is not possible, so let us revert this for now. Reported-by: Yuval El-Hanany <YuvalE@radware.com> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/xdp-newbies/8100DBDC-0B7C-49DB-9995-6027F6E63147@radware.com Link: https://lore.kernel.org/bpf/20240604122927.29080-2-magnus.karlsson@gmail.com
2024-06-05bpf: Set run context for rawtp test_run callbackJiri Olsa
syzbot reported crash when rawtp program executed through the test_run interface calls bpf_get_attach_cookie helper or any other helper that touches task->bpf_ctx pointer. Setting the run context (task->bpf_ctx pointer) for test_run callback. Fixes: 7adfc6c9b315 ("bpf: Add bpf_get_attach_cookie() BPF helper to access bpf_cookie value") Reported-by: syzbot+3ab78ff125b7979e45f9@syzkaller.appspotmail.com Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Closes: https://syzkaller.appspot.com/bug?extid=3ab78ff125b7979e45f9 Link: https://lore.kernel.org/bpf/20240604150024.359247-1-jolsa@kernel.org
2024-06-04openvswitch: Remove generic .ndo_get_stats64Breno Leitao
Commit 3e2f544dd8a33 ("net: get stats64 if device if driver is configured") moved the callback to dev_get_tstats64() to net core, so, unless the driver is doing some custom stats collection, it does not need to set .ndo_get_stats64. Since this driver is now relying in NETDEV_PCPU_STAT_TSTATS, then, it doesn't need to set the dev_get_tstats64() generic .ndo_get_stats64 function pointer. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com> Link: https://lore.kernel.org/r/20240531111552.3209198-2-leitao@debian.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-04openvswitch: Move stats allocation to coreBreno Leitao
With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and convert veth & vrf"), stats allocation could be done on net core instead of this driver. With this new approach, the driver doesn't have to bother with error handling (allocation failure checking, making sure free happens in the right spot, etc). This is core responsibility now. Move openvswitch driver to leverage the core allocation. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20240531111552.3209198-1-leitao@debian.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-04net: skb: add compatibility warnings to skb_shift()Jakub Kicinski
According to current semantics we should never try to shift data between skbs which differ on decrypted or pp_recycle status. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-04tcp: add a helper for setting EOR on tail skbJakub Kicinski
TLS (and hopefully soon PSP will) use EOR to prevent skbs with different decrypted state from getting merged, without adding new tests to the skb handling. In both cases once the connection switches to an "encrypted" state, all subsequent skbs will be encrypted, so a single "EOR fence" is sufficient to prevent mixing. Add a helper for setting the EOR bit, to make this arrangement more explicit. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-04tcp: wrap mptcp and decrypted checks into tcp_skb_can_collapse_rx()Jakub Kicinski
tcp_skb_can_collapse() checks for conditions which don't make sense on input. Because of this we ended up sprinkling a few pairs of mptcp_skb_can_collapse() and skb_cmp_decrypted() calls on the input path. Group them in a new helper. This should make it less likely that someone will check mptcp and not decrypted or vice versa when adding new code. This implicitly adds a decrypted check early in tcp_collapse(). AFAIU this will very slightly increase our ability to collapse packets under memory pressure, not a real bug. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-04net: tls: fix marking packets as decryptedJakub Kicinski
For TLS offload we mark packets with skb->decrypted to make sure they don't escape the host without getting encrypted first. The crypto state lives in the socket, so it may get detached by a call to skb_orphan(). As a safety check - the egress path drops all packets with skb->decrypted and no "crypto-safe" socket. The skb marking was added to sendpage only (and not sendmsg), because tls_device injected data into the TCP stack using sendpage. This special case was missed when sendpage got folded into sendmsg. Fixes: c5c37af6ecad ("tcp: Convert do_tcp_sendpages() to use MSG_SPLICE_PAGES") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240530232607.82686-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-04net/sched: cls_flower: add support for matching tunnel control flagsDavide Caratti
extend cls_flower to match TUNNEL_FLAGS_PRESENT bits in tunnel metadata. Suggested-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-04flow_dissector: add support for tunnel control flagsDavide Caratti
Dissect [no]csum, [no]dontfrag, [no]oam, [no]crit flags from skb metadata. This is a prerequisite for matching these control flags using TC flower. Suggested-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-04net: count drops due to missing qdisc as dev->tx_dropsJakub Kicinski
Catching and debugging missing qdiscs is pretty tricky. When qdisc is deleted we replace it with a noop qdisc, which silently drops all the packets. Since the noop qdisc has a single static instance we can't count drops at the qdisc level. Count them as dev->tx_drops. ip netns add red ip link add type veth peer netns red ip link set dev veth0 up ip -netns red link set dev veth0 up ip a a dev veth0 10.0.0.1/24 ip -netns red a a dev veth0 10.0.0.2/24 ping -c 2 10.0.0.2 # 2 packets transmitted, 2 received, 0% packet loss, time 1031ms ip -s link show dev veth0 # TX: bytes packets errors dropped carrier collsns # 1314 17 0 0 0 0 tc qdisc replace dev veth0 root handle 1234: mq tc qdisc replace dev veth0 parent 1234:1 pfifo tc qdisc del dev veth0 parent 1234:1 ping -c 2 10.0.0.2 # 2 packets transmitted, 0 received, 100% packet loss, time 1034ms ip -s link show dev veth0 # TX: bytes packets errors dropped carrier collsns # 1314 17 0 3 0 0 Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20240529162527.3688979-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-03Merge tag 'wireless-2024-06-03' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Kalle Valo says: ==================== wireless fixes for v6.10-rc3 The first fixes for v6.10. And we have a big one, I suspect the biggest wireless pull request we ever had. There are fixes all over, both in stack and drivers. Likely the most important here are mt76 not working on mt7615 devices, ath11k not being able to connect to 6 GHz networks and rtlwifi suffering from packet loss. But of course there's much more. * tag 'wireless-2024-06-03' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: (37 commits) wifi: rtlwifi: Ignore IEEE80211_CONF_CHANGE_RETRY_LIMITS wifi: mt76: mt7615: add missing chanctx ops wifi: wilc1000: document SRCU usage instead of SRCU Revert "wifi: wilc1000: set atomic flag on kmemdup in srcu critical section" Revert "wifi: wilc1000: convert list management to RCU" wifi: mac80211: fix UBSAN noise in ieee80211_prep_hw_scan() wifi: mac80211: correctly parse Spatial Reuse Parameter Set element wifi: mac80211: fix Spatial Reuse element size check wifi: iwlwifi: mvm: don't read past the mfuart notifcation wifi: iwlwifi: mvm: Fix scan abort handling with HW rfkill wifi: iwlwifi: mvm: check n_ssids before accessing the ssids wifi: iwlwifi: mvm: properly set 6 GHz channel direct probe option wifi: iwlwifi: mvm: handle BA session teardown in RF-kill wifi: iwlwifi: mvm: Handle BIGTK cipher in kek_kck cmd wifi: iwlwifi: mvm: remove stale STA link data during restart wifi: iwlwifi: dbg_ini: move iwl_dbg_tlv_free outside of debugfs ifdef wifi: iwlwifi: mvm: set properly mac header wifi: iwlwifi: mvm: revert gen2 TX A-MPDU size to 64 wifi: iwlwifi: mvm: d3: fix WoWLAN command version lookup wifi: iwlwifi: mvm: fix a crash on 7265 ... ==================== Link: https://lore.kernel.org/r/20240603115129.9494CC2BD10@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-03net: dst_cache: add two DEBUG_NET warningsEric Dumazet
After fixing four different bugs involving dst_cache users, it might be worth adding a check about BH being blocked by dst_cache callers. DEBUG_NET_WARN_ON_ONCE(!in_softirq()); It is not fatal, if we missed valid case where no BH deadlock is to be feared, we might change this. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240531132636.2637995-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-03ila: block BH in ila_output()Eric Dumazet
As explained in commit 1378817486d6 ("tipc: block BH before using dst_cache"), net/core/dst_cache.c helpers need to be called with BH disabled. ila_output() is called from lwtunnel_output() possibly from process context, and under rcu_read_lock(). We might be interrupted by a softirq, re-enter ila_output() and corrupt dst_cache data structures. Fix the race by using local_bh_disable(). Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240531132636.2637995-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-03ipv6: sr: block BH in seg6_output_core() and seg6_input_core()Eric Dumazet
As explained in commit 1378817486d6 ("tipc: block BH before using dst_cache"), net/core/dst_cache.c helpers need to be called with BH disabled. Disabling preemption in seg6_output_core() is not good enough, because seg6_output_core() is called from process context, lwtunnel_output() only uses rcu_read_lock(). We might be interrupted by a softirq, re-enter seg6_output_core() and corrupt dst_cache data structures. Fix the race by using local_bh_disable() instead of preempt_disable(). Apply a similar change in seg6_input_core(). Fixes: fa79581ea66c ("ipv6: sr: fix several BUGs when preemption is enabled") Fixes: 6c8702c60b88 ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Lebrun <dlebrun@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240531132636.2637995-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-03net: ipv6: rpl_iptunnel: block BH in rpl_output() and rpl_input()Eric Dumazet
As explained in commit 1378817486d6 ("tipc: block BH before using dst_cache"), net/core/dst_cache.c helpers need to be called with BH disabled. Disabling preemption in rpl_output() is not good enough, because rpl_output() is called from process context, lwtunnel_output() only uses rcu_read_lock(). We might be interrupted by a softirq, re-enter rpl_output() and corrupt dst_cache data structures. Fix the race by using local_bh_disable() instead of preempt_disable(). Apply a similar change in rpl_input(). Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexander Aring <aahringo@redhat.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240531132636.2637995-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-03ipv6: ioam: block BH from ioam6_output()Eric Dumazet
As explained in commit 1378817486d6 ("tipc: block BH before using dst_cache"), net/core/dst_cache.c helpers need to be called with BH disabled. Disabling preemption in ioam6_output() is not good enough, because ioam6_output() is called from process context, lwtunnel_output() only uses rcu_read_lock(). We might be interrupted by a softirq, re-enter ioam6_output() and corrupt dst_cache data structures. Fix the race by using local_bh_disable() instead of preempt_disable(). Fixes: 8cb3bf8bff3c ("ipv6: ioam: Add support for the ip6ip6 encapsulation") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Justin Iurman <justin.iurman@uliege.be> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240531132636.2637995-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-03SUNRPC: Fix loop termination condition in gss_free_in_token_pages()Chuck Lever
The in_token->pages[] array is not NULL terminated. This results in the following KASAN splat: KASAN: maybe wild-memory-access in range [0x04a2013400000008-0x04a201340000000f] Fixes: bafa6b4d95d9 ("SUNRPC: Fix gss_free_in_token_pages()") Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-06-03net/smc: change SMCR_RMBE_SIZES from 5 to 15Guangguan Wang
SMCR_RMBE_SIZES is the upper boundary of SMC-R's snd_buf and rcv_buf. The maximum bytes of snd_buf and rcv_buf can be calculated by 2^SMCR_ RMBE_SIZES * 16KB. SMCR_RMBE_SIZES = 5 means the upper boundary is 512KB. TCP's snd_buf and rcv_buf max size is configured by net.ipv4.tcp_w/rmem[2] whose default value is 4MB or 6MB, is much larger than SMC-R's upper boundary. In some scenarios, such as Recommendation System, the communication pattern is mainly large size send/recv, where the size of snd_buf and rcv_buf greatly affects performance. Due to the upper boundary disadvantage, SMC-R performs poor than TCP in those scenarios. So it is time to enlarge the upper boundary size of SMC-R's snd_buf and rcv_buf, so that the SMC-R's snd_buf and rcv_buf can be configured to larger size for performance gain in such scenarios. The SMC-R rcv_buf's size will be transferred to peer by the field rmbe_size in clc accept and confirm message. The length of the field rmbe_size is four bits, which means the maximum value of SMCR_RMBE_SIZES is 15. In case of frequently adjusting the value of SMCR_RMBE_SIZES in different scenarios, set the value of SMCR_RMBE_SIZES to the maximum value 15, which means the upper boundary of SMC-R's snd_buf and rcv_buf is 512MB. As the real memory usage is determined by the value of net.smc.w/rmem, not by the upper boundary, set the value of SMCR_RMBE_SIZES to the maximum value has no side affects. Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Co-developed-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-03net/smc: set rmb's SG_MAX_SINGLE_ALLOC limitation only when ↵Guangguan Wang
CONFIG_ARCH_NO_SG_CHAIN is defined SG_MAX_SINGLE_ALLOC is used to limit maximum number of entries that will be allocated in one piece of scatterlist. When the entries of scatterlist exceeds SG_MAX_SINGLE_ALLOC, sg chain will be used. From commit 7c703e54cc71 ("arch: switch the default on ARCH_HAS_SG_CHAIN"), we can know that the macro CONFIG_ARCH_NO_SG_CHAIN is used to identify whether sg chain is supported. So, SMC-R's rmb buffer should be limited by SG_MAX_SINGLE_ALLOC only when the macro CONFIG_ARCH_NO_SG_CHAIN is defined. Fixes: a3fe3d01bd0d ("net/smc: introduce sg-logic for RMBs") Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Co-developed-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-03net: mac802154: Fix racy device stats updates by DEV_STATS_INC() and ↵Yunshui Jiang
DEV_STATS_ADD() mac802154 devices update their dev->stats fields locklessly. Therefore these counters should be updated atomically. Adopt SMP safe DEV_STATS_INC() and DEV_STATS_ADD() to achieve this. Signed-off-by: Yunshui Jiang <jiangyunshui@kylinos.cn> Message-ID: <20240531080739.2608969-1-jiangyunshui@kylinos.cn> Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2024-06-03batman-adv: Don't accept TT entries for out-of-spec VIDsSven Eckelmann
The internal handling of VLAN IDs in batman-adv is only specified for following encodings: * VLAN is used - bit 15 is 1 - bit 11 - bit 0 is the VLAN ID (0-4095) - remaining bits are 0 * No VLAN is used - bit 15 is 0 - remaining bits are 0 batman-adv was only preparing new translation table entries (based on its soft interface information) using this encoding format. But the receive path was never checking if entries in the roam or TT TVLVs were also following this encoding. It was therefore possible to create more than the expected maximum of 4096 + 1 entries in the originator VLAN list. Simply by setting the "remaining bits" to "random" values in corresponding TVLV. Cc: stable@vger.kernel.org Fixes: 7ea7b4a14275 ("batman-adv: make the TT CRC logic VLAN specific") Reported-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2024-06-01af_unix: Remove dead code in unix_stream_read_generic().Kuniyuki Iwashima
When splice() support was added in commit 2b514574f7e8 ("net: af_unix: implement splice for stream af_unix sockets"), we had to release unix_sk(sk)->readlock (current iolock) before calling splice_to_pipe(). Due to the unlock, commit 73ed5d25dce0 ("af-unix: fix use-after-free with concurrent readers while splicing") added a safeguard in unix_stream_read_generic(); we had to bump the skb refcount before calling ->recv_actor() and then check if the skb was consumed by a concurrent reader. However, the pipe side locking was refactored, and since commit 25869262ef7a ("skb_splice_bits(): get rid of callback"), we can call splice_to_pipe() without releasing unix_sk(sk)->iolock. Now, the skb is always alive after the ->recv_actor() callback, so let's remove the unnecessary drop_skb logic. This is mostly the revert of commit 73ed5d25dce0 ("af-unix: fix use-after-free with concurrent readers while splicing"). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240529144648.68591-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-01net/tcp: Don't consider TCP_CLOSE in TCP_AO_ESTABLISHEDDmitry Safonov
TCP_CLOSE may or may not have current/rnext keys and should not be considered "established". The fast-path for TCP_CLOSE is SKB_DROP_REASON_TCP_CLOSE. This is what tcp_rcv_state_process() does anyways. Add an early drop path to not spend any time verifying segment signatures for sockets in TCP_CLOSE state. Cc: stable@vger.kernel.org # v6.7 Fixes: 0a3a809089eb ("net/tcp: Verify inbound TCP-AO signed segments") Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://lore.kernel.org/r/20240529-tcp_ao-sk_state-v1-1-d69b5d323c52@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-01net/ncsi: Fix the multi thread manner of NCSI driverDelphineCCChiu
Currently NCSI driver will send several NCSI commands back to back without waiting the response of previous NCSI command or timeout in some state when NIC have multi channel. This operation against the single thread manner defined by NCSI SPEC(section 6.3.2.3 in DSP0222_1.1.1) According to NCSI SPEC(section 6.2.13.1 in DSP0222_1.1.1), we should probe one channel at a time by sending NCSI commands (Clear initial state, Get version ID, Get capabilities...), than repeat this steps until the max number of channels which we got from NCSI command (Get capabilities) has been probed. Fixes: e6f44ed6d04d ("net/ncsi: Package and channel management") Signed-off-by: DelphineCCChiu <delphine_cc_chiu@wiwynn.com> Link: https://lore.kernel.org/r/20240529065856.825241-1-delphine_cc_chiu@wiwynn.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-01net: make net.core.{r,w}mem_{default,max} namespacedMatteo Croce
The following sysctl are global and can't be read from a netns: net.core.rmem_default net.core.rmem_max net.core.wmem_default net.core.wmem_max Make the following sysctl parameters available readonly from within a network namespace, allowing a container to read them. Signed-off-by: Matteo Croce <teknoraver@meta.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Link: https://lore.kernel.org/r/20240530232722.45255-2-technoboy85@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-01net: rps: fix error when CONFIG_RFS_ACCEL is offJason Xing
John Sperbeck reported that if we turn off CONFIG_RFS_ACCEL, the 'head' is not defined, which will trigger compile error. So I move the 'head' out of the CONFIG_RFS_ACCEL scope. Fixes: 84b6823cd96b ("net: rps: protect last_qtail with rps_input_queue_tail_save() helper") Reported-by: John Sperbeck <jsperbeck@google.com> Closes: https://lore.kernel.org/all/20240529203421.2432481-1-jsperbeck@google.com/ Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240530032717.57787-1-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-01ax25: Replace kfree() in ax25_dev_free() with ax25_dev_put()Duoming Zhou
The object "ax25_dev" is managed by reference counting. Thus it should not be directly released by kfree(), replace with ax25_dev_put(). Fixes: d01ffb9eee4a ("ax25: add refcount in ax25_dev to avoid UAF bugs") Suggested-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Duoming Zhou <duoming@zju.edu.cn> Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/20240530051733.11416-1-duoming@zju.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-01ax25: Fix refcount imbalance on inbound connectionsLars Kellogg-Stedman
When releasing a socket in ax25_release(), we call netdev_put() to decrease the refcount on the associated ax.25 device. However, the execution path for accepting an incoming connection never calls netdev_hold(). This imbalance leads to refcount errors, and ultimately to kernel crashes. A typical call trace for the above situation will start with one of the following errors: refcount_t: decrement hit 0; leaking memory. refcount_t: underflow; use-after-free. And will then have a trace like: Call Trace: <TASK> ? show_regs+0x64/0x70 ? __warn+0x83/0x120 ? refcount_warn_saturate+0xb2/0x100 ? report_bug+0x158/0x190 ? prb_read_valid+0x20/0x30 ? handle_bug+0x3e/0x70 ? exc_invalid_op+0x1c/0x70 ? asm_exc_invalid_op+0x1f/0x30 ? refcount_warn_saturate+0xb2/0x100 ? refcount_warn_saturate+0xb2/0x100 ax25_release+0x2ad/0x360 __sock_release+0x35/0xa0 sock_close+0x19/0x20 [...] On reboot (or any attempt to remove the interface), the kernel gets stuck in an infinite loop: unregister_netdevice: waiting for ax0 to become free. Usage count = 0 This patch corrects these issues by ensuring that we call netdev_hold() and ax25_dev_hold() for new connections in ax25_accept(). This makes the logic leading to ax25_accept() match the logic for ax25_bind(): in both cases we increment the refcount, which is ultimately decremented in ax25_release(). Fixes: 9fd75b66b8f6 ("ax25: Fix refcount leaks caused by ax25_cb_del()") Signed-off-by: Lars Kellogg-Stedman <lars@oddbit.com> Tested-by: Duoming Zhou <duoming@zju.edu.cn> Tested-by: Dan Cross <crossd@gmail.com> Tested-by: Chris Maness <christopher.maness@gmail.com> Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/20240529210242.3346844-2-lars@oddbit.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-01net: validate SO_TXTIME clockid coming from userspaceAbhishek Chauhan
Currently there are no strict checks while setting SO_TXTIME from userspace. With the recent development in skb->tstamp_type clockid with unsupported clocks results in warn_on_once, which causes unnecessary aborts in some systems which enables panic on warns. Add validation in setsockopt to support only CLOCK_REALTIME, CLOCK_MONOTONIC and CLOCK_TAI to be set from userspace. Link: https://lore.kernel.org/netdev/bc037db4-58bb-4861-ac31-a361a93841d3@linux.dev/ Link: https://lore.kernel.org/lkml/6bdba7b6-fd22-4ea5-a356-12268674def1@quicinc.com/ Fixes: 1693c5db6ab8 ("net: Add additional bit to support clockid_t timestamp type") Reported-by: syzbot+d7b227731ec589e7f4f0@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d7b227731ec589e7f4f0 Reported-by: syzbot+30a35a2e9c5067cc43fa@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=30a35a2e9c5067cc43fa Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240529183130.1717083-1-quic_abchauha@quicinc.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-01ethtool: init tsinfo stats if requestedVadim Fedorenko
Statistic values should be set to ETHTOOL_STAT_NOT_SET even if the device doesn't support statistics. Otherwise zeros will be returned as if they are proper values: host# ethtool -I -T lo Time stamping parameters for lo: Capabilities: software-transmit software-receive software-system-clock PTP Hardware Clock: none Hardware Transmit Timestamp Modes: none Hardware Receive Filter Modes: none Statistics: tx_pkts: 0 tx_lost: 0 tx_err: 0 Fixes: 0e9c127729be ("ethtool: add interface to read Tx hardware timestamping statistics") Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Link: https://lore.kernel.org/r/20240530040814.1014446-1-vadfed@meta.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/ti/icssg/icssg_classifier.c abd5576b9c57 ("net: ti: icssg-prueth: Add support for ICSSG switch firmware") 56a5cf538c3f ("net: ti: icssg-prueth: Fix start counter for ft1 filter") https://lore.kernel.org/all/20240531123822.3bb7eadf@canb.auug.org.au/ No other adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-30ipv6: sr: restruct ifdefinesHangbin Liu
There are too many ifdef in IPv6 segment routing code that may cause logic problems. like commit 160e9d275218 ("ipv6: sr: fix invalid unregister error path"). To avoid this, the init functions are redefined for both cases. The code could be more clear after all fidefs are removed. Suggested-by: Simon Horman <horms@kernel.org> Suggested-by: David Ahern <dsahern@kernel.org> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240529040908.3472952-1-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-30bpf: pass bpf_struct_ops_link to callbacks in bpf_struct_ops.Kui-Feng Lee
Pass an additional pointer of bpf_struct_ops_link to callback function reg, unreg, and update provided by subsystems defined in bpf_struct_ops. A bpf_struct_ops_map can be registered for multiple links. Passing a pointer of bpf_struct_ops_link helps subsystems to distinguish them. This pointer will be used in the later patches to let the subsystem initiate a detachment on a link that was registered to it previously. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240530065946.979330-2-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-05-30SUNRPC: return proper error from gss_wrap_req_privChen Hanxiao
don't return 0 if snd_buf->len really greater than snd_buf->buflen Signed-off-by: Chen Hanxiao <chenhx.fnst@fujitsu.com> Fixes: 0c77668ddb4e ("SUNRPC: Introduce trace points in rpc_auth_gss.ko") Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>