summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2023-03-21neighbour: switch to standard rcu, instead of rcu_bhEric Dumazet
rcu_bh is no longer a win, especially for objects freed with standard call_rcu(). Switch neighbour code to no longer disable BH when not necessary. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-21ipv6: flowlabel: do not disable BH where not neededEric Dumazet
struct ip6_flowlabel are rcu managed, and call_rcu() is used to delay fl_free_rcu() after RCU grace period. There is no point disabling BH for pure RCU lookups. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-21erspan: do not use skb_mac_header() in ndo_start_xmit()Eric Dumazet
Drivers should not assume skb_mac_header(skb) == skb->data in their ndo_start_xmit(). Use skb_network_offset() and skb_transport_offset() which better describe what is needed in erspan_fb_xmit() and ip6erspan_tunnel_xmit() syzbot reported: WARNING: CPU: 0 PID: 5083 at include/linux/skbuff.h:2873 skb_mac_header include/linux/skbuff.h:2873 [inline] WARNING: CPU: 0 PID: 5083 at include/linux/skbuff.h:2873 ip6erspan_tunnel_xmit+0x1d9c/0x2d90 net/ipv6/ip6_gre.c:962 Modules linked in: CPU: 0 PID: 5083 Comm: syz-executor406 Not tainted 6.3.0-rc2-syzkaller-00866-gd4671cb96fa3 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/02/2023 RIP: 0010:skb_mac_header include/linux/skbuff.h:2873 [inline] RIP: 0010:ip6erspan_tunnel_xmit+0x1d9c/0x2d90 net/ipv6/ip6_gre.c:962 Code: 04 02 41 01 de 84 c0 74 08 3c 03 0f 8e 1c 0a 00 00 45 89 b4 24 c8 00 00 00 c6 85 77 fe ff ff 01 e9 33 e7 ff ff e8 b4 27 a1 f8 <0f> 0b e9 b6 e7 ff ff e8 a8 27 a1 f8 49 8d bf f0 0c 00 00 48 b8 00 RSP: 0018:ffffc90003b2f830 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 000000000000ffff RCX: 0000000000000000 RDX: ffff888021273a80 RSI: ffffffff88e1bd4c RDI: 0000000000000003 RBP: ffffc90003b2f9d8 R08: 0000000000000003 R09: 000000000000ffff R10: 000000000000ffff R11: 0000000000000000 R12: ffff88802b28da00 R13: 00000000000000d0 R14: ffff88807e25b6d0 R15: ffff888023408000 FS: 0000555556a61300(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055e5b11eb6e8 CR3: 0000000027c1b000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> __netdev_start_xmit include/linux/netdevice.h:4900 [inline] netdev_start_xmit include/linux/netdevice.h:4914 [inline] __dev_direct_xmit+0x504/0x730 net/core/dev.c:4300 dev_direct_xmit include/linux/netdevice.h:3088 [inline] packet_xmit+0x20a/0x390 net/packet/af_packet.c:285 packet_snd net/packet/af_packet.c:3075 [inline] packet_sendmsg+0x31a0/0x5150 net/packet/af_packet.c:3107 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg+0xde/0x190 net/socket.c:747 __sys_sendto+0x23a/0x340 net/socket.c:2142 __do_sys_sendto net/socket.c:2154 [inline] __se_sys_sendto net/socket.c:2150 [inline] __x64_sys_sendto+0xe1/0x1b0 net/socket.c:2150 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f123aaa1039 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffc15d12058 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f123aaa1039 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003 RBP: 0000000000000000 R08: 0000000020000040 R09: 0000000000000014 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f123aa648c0 R13: 431bde82d7b634db R14: 0000000000000000 R15: 0000000000000000 Fixes: 1baf5ebf8954 ("erspan: auto detect truncated packets.") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230320163427.8096-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-21net: dsa: tag_brcm: legacy: fix daisy-chained switchesÁlvaro Fernández Rojas
When BCM63xx internal switches are connected to switches with a 4-byte Broadcom tag, it does not identify the packet as VLAN tagged, so it adds one based on its PVID (which is likely 0). Right now, the packet is received by the BCM63xx internal switch and the 6-byte tag is properly processed. The next step would to decode the corresponding 4-byte tag. However, the internal switch adds an invalid VLAN tag after the 6-byte tag and the 4-byte tag handling fails. In order to fix this we need to remove the invalid VLAN tag after the 6-byte tag before passing it to the 4-byte tag decoding. Fixes: 964dbf186eaa ("net: dsa: tag_brcm: add support for legacy tags") Signed-off-by: Álvaro Fernández Rojas <noltari@gmail.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20230319095540.239064-1-noltari@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-20net: skbuff: rename __pkt_vlan_present_offset to __mono_tc_offsetJakub Kicinski
vlan_present is gone since commit 354259fa73e2 ("net: remove skb->vlan_present") rename the offset field to what BPF is currently looking for in this byte - mono_delivery_time and tc_at_ingress. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230321014115.997841-2-kuba@kernel.org Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-03-20xfrm: copy_to_user_state fetch offloaded SA packets/bytes statisticsRaed Salem
Both in RX and TX, the traffic that performs IPsec packet offload transformation is accounted by HW only. Consequently, the HW should be queried for packets/bytes statistics when user asks for such transformation data. Signed-off-by: Raed Salem <raeds@nvidia.com> Link: https://lore.kernel.org/r/d90ec74186452b1509ee94875d942cb777b7181e.1678714336.git.leon@kernel.org Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-03-20xfrm: add new device offload acquire flagRaed Salem
During XFRM acquire flow, a default SA is created to be updated later, once acquire netlink message is handled in user space. When the relevant policy is offloaded this default SA is also offloaded to IPsec offload supporting driver, however this SA does not have context suitable for offloading in HW, nor is interesting to offload to HW, consequently needs a special driver handling apart from other offloaded SA(s). Add a special flag that marks such SA so driver can handle it correctly. Signed-off-by: Raed Salem <raeds@nvidia.com> Link: https://lore.kernel.org/r/f5da0834d8c6b82ab9ba38bd4a0c55e71f0e3dab.1678714336.git.leon@kernel.org Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-03-20net: dsa: report rx_bytes unadjusted for ETH_HLENVladimir Oltean
We collect the software statistics counters for RX bytes (reported to /proc/net/dev and to ethtool -S $dev | grep 'rx_bytes: ") at a time when skb->len has already been adjusted by the eth_type_trans() -> skb_pull_inline(skb, ETH_HLEN) call to exclude the L2 header. This means that when connecting 2 DSA interfaces back to back and sending 1 packet with length 100, the sending interface will report tx_bytes as incrementing by 100, and the receiving interface will report rx_bytes as incrementing by 86. Since accounting for that in scripts is quirky and is something that would be DSA-specific behavior (requiring users to know that they are running on a DSA interface in the first place), the proposal is that we treat it as a bug and fix it. This design bug has always existed in DSA, according to my analysis: commit 91da11f870f0 ("net: Distributed Switch Architecture protocol support") also updates skb->dev->stats.rx_bytes += skb->len after the eth_type_trans() call. Technically, prior to Florian's commit a86d8becc3f0 ("net: dsa: Factor bottom tag receive functions"), each and every vendor-specific tagging protocol driver open-coded the same bug, until the buggy code was consolidated into something resembling what can be seen now. So each and every driver should have its own Fixes: tag, because of their different histories until the convergence point. I'm not going to do that, for the sake of simplicity, but just blame the oldest appearance of buggy code. There are 2 ways to fix the problem. One is the obvious way, and the other is how I ended up doing it. Obvious would have been to move dev_sw_netstats_rx_add() one line above eth_type_trans(), and below skb_push(skb, ETH_HLEN). But DSA processing is not as simple as that. We count the bytes after removing everything DSA-related from the packet, to emulate what the packet's length was, on the wire, when the user port received it. When eth_type_trans() executes, dsa_untag_bridge_pvid() has not run yet, so in case the switch driver requests this behavior - commit 412a1526d067 ("net: dsa: untag the bridge pvid from rx skbs") has the details - the obvious variant of the fix wouldn't have worked, because the positioning there would have also counted the not-yet-stripped VLAN header length, something which is absent from the packet as seen on the wire (there it may be untagged, whereas software will see it as PVID-tagged). Fixes: f613ed665bb3 ("net: dsa: Add support for 64-bit statistics") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-19net/packet: remove po->xmitEric Dumazet
Use PACKET_SOCK_QDISC_BYPASS atomic bit instead of a pointer. This removes one indirect call in fast path, and READ_ONCE()/WRITE_ONCE() annotations as well. Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Willem de Bruijn <willemb@google.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-18tcp: preserve const qualifier in tcp_sk()Eric Dumazet
We can change tcp_sk() to propagate its argument const qualifier, thanks to container_of_const(). We have two places where a const sock pointer has to be upgraded to a write one. We have been using const qualifier for lockless listeners to clearly identify points where writes could happen. Add tcp_sk_rw() helper to better document these. tcp_inbound_md5_hash(), __tcp_grow_window(), tcp_reset_check() and tcp_rack_reo_wnd() get an additional const qualififer for their @tp local variables. smc_check_reset_syn_req() also needs a similar change. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-18mptcp: preserve const qualifier in mptcp_sk()Eric Dumazet
We can change mptcp_sk() to propagate its argument const qualifier, thanks to container_of_const(). We need to change few things to avoid build errors: mptcp_set_datafin_timeout() and mptcp_rtx_head() have to accept non-const sk pointers. @msk local variable in mptcp_pending_tail() must be const. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Matthieu Baerts <matthieu.baerts@tessares.net> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-18smc: preserve const qualifier in smc_sk()Eric Dumazet
We can change smc_sk() to propagate its argument const qualifier, thanks to container_of_const(). Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Karsten Graul <kgraul@linux.ibm.com> Cc: Wenjia Zhang <wenjia@linux.ibm.com> Cc: Jan Karcher <jaka@linux.ibm.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-18af_packet: preserve const qualifier in pkt_sk()Eric Dumazet
We can change pkt_sk() to propagate const qualifier of its argument, thanks to container_of_const() This should avoid some potential errors caused by accidental (const -> not_const) promotion. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
net/wireless/nl80211.c b27f07c50a73 ("wifi: nl80211: fix puncturing bitmap policy") cbbaf2bb829b ("wifi: nl80211: add a command to enable/disable HW timestamping") https://lore.kernel.org/all/20230314105421.3608efae@canb.auug.org.au tools/testing/selftests/net/Makefile 62199e3f1658 ("selftests: net: Add VXLAN MDB test") 13715acf8ab5 ("selftest: Add test for bind() conflicts.") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-17Merge tag 'net-6.3-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter, wifi and ipsec. A little more changes than usual, but it's pretty normal for us that the rc3/rc4 PRs are oversized as people start testing in earnest. Possibly an extra boost from people deploying the 6.1 LTS but that's more of an unscientific hunch. Current release - regressions: - phy: mscc: fix deadlock in phy_ethtool_{get,set}_wol() - virtio: vsock: don't use skbuff state to account credit - virtio: vsock: don't drop skbuff on copy failure - virtio_net: fix page_to_skb() miscalculating the memory size Current release - new code bugs: - eth: correct xdp_features after device reconfig - wifi: nl80211: fix the puncturing bitmap policy - net/mlx5e: flower: - fix raw counter initialization - fix missing error code - fix cloned flow attribute - ipa: - fix some register validity checks - fix a surprising number of bad offsets - kill FILT_ROUT_CACHE_CFG IPA register Previous releases - regressions: - tcp: fix bind() conflict check for dual-stack wildcard address - veth: fix use after free in XDP_REDIRECT when skb headroom is small - ipv4: fix incorrect table ID in IOCTL path - ipvlan: make skb->skb_iif track skb->dev for l3s mode - mptcp: - fix possible deadlock in subflow_error_report - fix UaFs when destroying unaccepted and listening sockets - dsa: mv88e6xxx: fix max_mtu of 1492 on 6165, 6191, 6220, 6250, 6290 Previous releases - always broken: - tcp: tcp_make_synack() can be called from process context, don't assume preemption is disabled when updating stats - netfilter: correct length for loading protocol registers - virtio_net: add checking sq is full inside xdp xmit - bonding: restore IFF_MASTER/SLAVE flags on bond enslave Ethertype change - phy: nxp-c45-tja11xx: fix MII_BASIC_CONFIG_REV bit number - eth: i40e: fix crash during reboot when adapter is in recovery mode - eth: ice: avoid deadlock on rtnl lock when auxiliary device plug/unplug meets bonding - dsa: mt7530: - remove now incorrect comment regarding port 5 - set PLL frequency and trgmii only when trgmii is used - eth: mtk_eth_soc: reset PCS state when changing interface types Misc: - ynl: another license adjustment - move the TCA_EXT_WARN_MSG attribute for tc action" * tag 'net-6.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (108 commits) selftests: bonding: add tests for ether type changes bonding: restore bond's IFF_SLAVE flag if a non-eth dev enslave fails bonding: restore IFF_MASTER/SLAVE flags on bond enslave ether type change net: renesas: rswitch: Fix GWTSDIE register handling net: renesas: rswitch: Fix the output value of quote from rswitch_rx() ethernet: sun: add check for the mdesc_grab() net: ipa: fix some register validity checks net: ipa: kill FILT_ROUT_CACHE_CFG IPA register net: ipa: add two missing declarations net: ipa: reg: include <linux/bug.h> net: xdp: don't call notifiers during driver init net/sched: act_api: add specific EXT_WARN_MSG for tc action Revert "net/sched: act_api: move TCA_EXT_WARN_MSG to the correct hierarchy" net: dsa: microchip: fix RGMII delay configuration on KSZ8765/KSZ8794/KSZ8795 ynl: make the tooling check the license ynl: broaden the license even more tools: ynl: make definitions optional again hsr: ratelimit only when errors are printed qed/qed_mng_tlv: correctly zero out ->min instead of ->hour selftests: net: devlink_port_split.py: skip test if no suitable device available ...
2023-03-17driver core: class: remove module * from class_create()Greg Kroah-Hartman
The module pointer in class_create() never actually did anything, and it shouldn't have been requred to be set as a parameter even if it did something. So just remove it and fix up all callers of the function in the kernel tree at the same time. Cc: "Rafael J. Wysocki" <rafael@kernel.org> Acked-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Link: https://lore.kernel.org/r/20230313181843.1207845-4-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-17drivers: remove struct module * setting from struct classGreg Kroah-Hartman
There is no need to manually set the owner of a struct class, as the registering function does it automatically, so remove all of the explicit settings from various drivers that did so as it is unneeded. This allows us to remove this pointer entirely from this structure going forward. Cc: "Rafael J. Wysocki" <rafael@kernel.org> Link: https://lore.kernel.org/r/20230313181843.1207845-2-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-17net/smc: Use percpu ref for wr tx referenceKai Shen
The refcount wr_tx_refcnt may cause cache thrashing problems among cores and we can use percpu ref to mitigate this issue here. We gain some performance improvement with percpu ref here on our customized smc-r verion. Applying cache alignment may also mitigate this problem but it seem more reasonable to use percpu ref here. We can also replace wr_reg_refcnt with one percpu reference like wr_tx_refcnt. redis-benchmark on smc-r with atomic wr_tx_refcnt: SET: 525707.06 requests per second, p50=0.087 msec GET: 554877.38 requests per second, p50=0.087 msec redis-benchmark on the percpu_ref version: SET: 540482.06 requests per second, p50=0.087 msec GET: 570711.12 requests per second, p50=0.079 msec Cases are like "redis-benchmark -h x.x.x.x -q -t set,get -P 1 -n 5000000 -c 50 -d 10 --threads 4". Signed-off-by: Kai Shen <KaiShen@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17inet_diag: constify raw_lookup() socket argumentEric Dumazet
Now both raw_v4_match() and raw_v6_match() accept a const socket, raw_lookup() can do the same to clarify its role. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17ipv4: raw: constify raw_v4_match() socket argumentEric Dumazet
This clarifies raw_v4_match() intent. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17ipv6: raw: constify raw_v6_match() socket argumentEric Dumazet
This clarifies raw_v6_match() intent. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17udp6: constify __udp_v6_is_mcast_sock() socket argumentEric Dumazet
This clarifies __udp_v6_is_mcast_sock() intent. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17ipv6: constify inet6_mc_check()Eric Dumazet
inet6_mc_check() is essentially a read-only function. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17udp: constify __udp_is_mcast_sock() socket argumentEric Dumazet
This clarifies __udp_is_mcast_sock() intent. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17ipv4: constify ip_mc_sf_allow() socket argumentEric Dumazet
This clarifies ip_mc_sf_allow() intent. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17inet: preserve const qualifier in inet_sk()Eric Dumazet
We can change inet_sk() to propagate const qualifier of its argument. This should avoid some potential errors caused by accidental (const -> not_const) promotion. Other helpers like tcp_sk(), udp_sk(), raw_sk() will be handled in separate patch series. v2: use container_of_const() as advised by Jakub and Linus Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/netdev/20230315142841.3a2ac99a@kernel.org/ Link: https://lore.kernel.org/netdev/CAHk-=wiOf12nrYEF2vJMcucKjWPN-Ns_SW9fA7LwST_2Dzp7rw@mail.gmail.com/ Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17net/packet: convert po->pressure to an atomic flagEric Dumazet
Not only this removes some READ_ONCE()/WRITE_ONCE(), this also removes one integer. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17net/packet: convert po->running to an atomic flagEric Dumazet
Instead of consuming 32 bits for po->running, use one available bit in po->flags. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17net/packet: convert po->has_vnet_hdr to an atomic flagEric Dumazet
po->has_vnet_hdr can be read locklessly. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17net/packet: convert po->tp_loss to an atomic flagEric Dumazet
tp_loss can be read locklessly. Convert it to an atomic flag to avoid races. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17net/packet: convert po->tp_tx_has_off to an atomic flagEric Dumazet
This is to use existing space in po->flags, and reclaim the storage used by the non atomic bit fields. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17net/packet: annotate accesses to po->tp_tstampEric Dumazet
tp_tstamp is read locklessly. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17net/packet: convert po->auxdata to an atomic flagEric Dumazet
po->auxdata can be read while another thread is changing its value, potentially raising KCSAN splat. Convert it to PACKET_SOCK_AUXDATA flag. Fixes: 8dc419447415 ("[PACKET]: Add optional checksum computation for recvmsg") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17net/packet: convert po->origdev to an atomic flagEric Dumazet
syzbot/KCAN reported that po->origdev can be read while another thread is changing its value. We can avoid this splat by converting this field to an actual bit. Following patches will convert remaining 1bit fields. Fixes: 80feaacb8a64 ("[AF_PACKET]: Add option to return orig_dev to userspace.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17net/packet: annotate accesses to po->xmitEric Dumazet
po->xmit can be set from setsockopt(PACKET_QDISC_BYPASS), while read locklessly. Use READ_ONCE()/WRITE_ONCE() to avoid potential load/store tearing issues. Fixes: d346a3fae3ff ("packet: introduce PACKET_QDISC_BYPASS socket option") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17af_unix: annotate lockless accesses to sk->sk_errEric Dumazet
unix_poll() and unix_dgram_poll() read sk->sk_err without any lock held. Add relevant READ_ONCE()/WRITE_ONCE() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17mptcp: annotate lockless accesses to sk->sk_errEric Dumazet
mptcp_poll() reads sk->sk_err without socket lock held/owned. Add READ_ONCE() and WRITE_ONCE() to avoid load/store tearing. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17tcp: annotate lockless access to sk->sk_errEric Dumazet
tcp_poll() reads sk->sk_err without socket lock held/owned. We should used READ_ONCE() here, and update writers to use WRITE_ONCE(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17net: annotate lockless accesses to sk->sk_err_softEric Dumazet
This field can be read/written without lock synchronization. tcp and dccp have been handled in different patches. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17dccp: annotate lockless accesses to sk->sk_err_softEric Dumazet
This field can be read/written without lock synchronization. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17tcp: annotate lockless accesses to sk->sk_err_softEric Dumazet
This field can be read/written without lock synchronization. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17vlan: partially enable SIOCSHWTSTAMP in containerVadim Fedorenko
Setting timestamp filter was explicitly disabled on vlan devices in containers because it might affect other processes on the host. But it's absolutely legit in case when real device is in the same namespace. Fixes: 873017af7784 ("vlan: disable SIOCSHWTSTAMP in container") Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17net: ieee802154: remove an unnecessary null pointer checkDongliang Mu
llsec_parse_seclevel has the null pointer check at its begining. Compared with nl802154_add_llsec_seclevel, nl802154_del_llsec_seclevel has a redundant null pointer check of info->attrs[NL802154_ATTR_SEC_LEVEL] before llsec_parse_seclevel. Fix this issue by removing the null pointer check in nl802154_del_llsec_seclevel. Signed-off-by: Dongliang Mu <dzm91@hust.edu.cn> Link: https://lore.kernel.org/r/20230308083231.460015-1-dzm91@hust.edu.cn Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2023-03-17rtnetlink: bridge: mcast: Relax group address validation in common codeIdo Schimmel
In the upcoming VXLAN MDB implementation, the 0.0.0.0 and :: MDB entries will act as catchall entries for unregistered IP multicast traffic in a similar fashion to the 00:00:00:00:00:00 VXLAN FDB entry that is used to transmit BUM traffic. In deployments where inter-subnet multicast forwarding is used, not all the VTEPs in a tenant domain are members in all the broadcast domains. It is therefore advantageous to transmit BULL (broadcast, unknown unicast and link-local multicast) and unregistered IP multicast traffic on different tunnels. If the same tunnel was used, a VTEP only interested in IP multicast traffic would also pull all the BULL traffic and drop it as it is not a member in the originating broadcast domain [1]. Prepare for this change by allowing the 0.0.0.0 group address in the common rtnetlink MDB code and forbid it in the bridge driver. A similar change is not needed for IPv6 because the common code only validates that the group address is not the all-nodes address. [1] https://datatracker.ietf.org/doc/html/draft-ietf-bess-evpn-irb-mcast#section-2.6 Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17rtnetlink: bridge: mcast: Move MDB handlers out of bridge driverIdo Schimmel
Currently, the bridge driver registers handlers for MDB netlink messages, making it impossible for other drivers to implement MDB support. As a preparation for VXLAN MDB support, move the MDB handlers out of the bridge driver to the core rtnetlink code. The rtnetlink code will call into individual drivers by invoking their previously added MDB net device operations. Note that while the diffstat is large, the change is mechanical. It moves code out of the bridge driver to rtnetlink code. Also note that a similar change was made in 2012 with commit 77162022ab26 ("net: add generic PF_BRIDGE:RTM_ FDB hooks") that moved FDB handlers out of the bridge driver to the core rtnetlink code. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17bridge: mcast: Implement MDB net device operationsIdo Schimmel
Implement the previously added MDB net device operations in the bridge driver so that they could be invoked by core rtnetlink code in the next patch. The operations are identical to the existing br_mdb_{dump,add,del} functions. The '_new' suffix will be removed in the next patch. The functions are re-implemented in this patch to make the conversion in the next patch easier to review. Add dummy implementations when 'CONFIG_BRIDGE_IGMP_SNOOPING' is disabled, so that an error will be returned to user space when it is trying to add or delete an MDB entry. This is consistent with existing behavior where the bridge driver does not even register rtnetlink handlers for RTM_{NEW,DEL,GET}MDB messages when this Kconfig option is disabled. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-16bpf, test_run: fix crashes due to XDP frame overwriting/corruptionAlexander Lobakin
syzbot and Ilya faced the splats when %XDP_PASS happens for bpf_test_run after skb PP recycling was enabled for {__,}xdp_build_skb_from_frame(): BUG: kernel NULL pointer dereference, address: 0000000000000d28 RIP: 0010:memset_erms+0xd/0x20 arch/x86/lib/memset_64.S:66 [...] Call Trace: <TASK> __finalize_skb_around net/core/skbuff.c:321 [inline] __build_skb_around+0x232/0x3a0 net/core/skbuff.c:379 build_skb_around+0x32/0x290 net/core/skbuff.c:444 __xdp_build_skb_from_frame+0x121/0x760 net/core/xdp.c:622 xdp_recv_frames net/bpf/test_run.c:248 [inline] xdp_test_run_batch net/bpf/test_run.c:334 [inline] bpf_test_run_xdp_live+0x1289/0x1930 net/bpf/test_run.c:362 bpf_prog_test_run_xdp+0xa05/0x14e0 net/bpf/test_run.c:1418 [...] This happens due to that it calls xdp_scrub_frame(), which nullifies xdpf->data. bpf_test_run code doesn't reinit the frame when the XDP program doesn't adjust head or tail. Previously, %XDP_PASS meant the page will be released from the pool and returned to the MM layer, but now it does return to the Pool with the nullified xdpf->data, which doesn't get reinitialized then. So, in addition to checking whether the head and/or tail have been adjusted, check also for a potential XDP frame corruption. xdpf->data is 100% affected and also xdpf->flags is the field closest to the metadata / frame start. Checking for these two should be enough for non-extreme cases. Fixes: 9c94bbf9a87b ("xdp: recycle Page Pool backed skbs built from XDP frames") Reported-by: syzbot+e1d1b65f7c32f2a86a9f@syzkaller.appspotmail.com Link: https://lore.kernel.org/bpf/000000000000f1985705f6ef2243@google.com Reported-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/bpf/e07dd94022ad5731705891b9487cc9ed66328b94.camel@linux.ibm.com Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/r/20230316175051.922550-2-aleksander.lobakin@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-16net: xdp: don't call notifiers during driver initJakub Kicinski
Drivers will commonly perform feature setting during init, if they use the xdp_set_features_flag() helper they'll likely run into an ASSERT_RTNL() inside call_netdevice_notifiers_info(). Don't call the notifier until the device is actually registered. Nothing should be tracking the device until its registered and after its unregistration has started. Fixes: 4d5ab0ad964d ("net/mlx5e: take into account device reconfiguration for xdp_features flag") Link: https://lore.kernel.org/r/20230316220234.598091-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-16net/sched: act_api: add specific EXT_WARN_MSG for tc actionHangbin Liu
In my previous commit 0349b8779cc9 ("sched: add new attr TCA_EXT_WARN_MSG to report tc extact message") I didn't notice the tc action use different enum with filter. So we can't use TCA_EXT_WARN_MSG directly for tc action. Let's add a TCA_ROOT_EXT_WARN_MSG for tc action specifically and put this param before going to the TCA_ACT_TAB nest. Fixes: 0349b8779cc9 ("sched: add new attr TCA_EXT_WARN_MSG to report tc extact message") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-16Revert "net/sched: act_api: move TCA_EXT_WARN_MSG to the correct hierarchy"Hangbin Liu
This reverts commit 923b2e30dc9cd05931da0f64e2e23d040865c035. This is not a correct fix as TCA_EXT_WARN_MSG is not a hierarchy to TCA_ACT_TAB. I didn't notice the TC actions use different enum when adding TCA_EXT_WARN_MSG. To fix the difference I will add a new WARN enum in TCA_ROOT_MAX as Jamal suggested. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>