summaryrefslogtreecommitdiff
path: root/net/bridge
AgeCommit message (Collapse)Author
45 hoursMerge tag 'net-next-6.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core: - Implement the Device Memory TCP transmit path, allowing zero-copy data transmission on top of TCP from e.g. GPU memory to the wire. - Move all the IPv6 routing tables management outside the RTNL scope, under its own lock and RCU. The route control path is now 3x times faster. - Convert queue related netlink ops to instance lock, reducing again the scope of the RTNL lock. This improves the control plane scalability. - Refactor the software crc32c implementation, removing unneeded abstraction layers and improving significantly the related micro-benchmarks. - Optimize the GRO engine for UDP-tunneled traffic, for a 10% performance improvement in related stream tests. - Cover more per-CPU storage with local nested BH locking; this is a prep work to remove the current per-CPU lock in local_bh_disable() on PREMPT_RT. - Introduce and use nlmsg_payload helper, combining buffer bounds verification with accessing payload carried by netlink messages. Netfilter: - Rewrite the procfs conntrack table implementation, improving considerably the dump performance. A lot of user-space tools still use this interface. - Implement support for wildcard netdevice in netdev basechain and flowtables. - Integrate conntrack information into nft trace infrastructure. - Export set count and backend name to userspace, for better introspection. BPF: - BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops programs and can be controlled in similar way to traditional qdiscs using the "tc qdisc" command. - Refactor the UDP socket iterator, addressing long standing issues WRT duplicate hits or missed sockets. Protocols: - Improve TCP receive buffer auto-tuning and increase the default upper bound for the receive buffer; overall this improves the single flow maximum thoughput on 200Gbs link by over 60%. - Add AFS GSSAPI security class to AF_RXRPC; it provides transport security for connections to the AFS fileserver and VL server. - Improve TCP multipath routing, so that the sources address always matches the nexthop device. - Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS, and thus preventing DoS caused by passing around problematic FDs. - Retire DCCP socket. DCCP only receives updates for bugs, and major distros disable it by default. Its removal allows for better organisation of TCP fields to reduce the number of cache lines hit in the fast path. - Extend TCP drop-reason support to cover PAWS checks. Driver API: - Reorganize PTP ioctl flag support to require an explicit opt-in for the drivers, avoiding the problem of drivers not rejecting new unsupported flags. - Converted several device drivers to timestamping APIs. - Introduce per-PHY ethtool dump helpers, improving the support for dump operations targeting PHYs. Tests and tooling: - Add support for classic netlink in user space C codegen, so that ynl-c can now read, create and modify links, routes addresses and qdisc layer configuration. - Add ynl sub-types for binary attributes, allowing ynl-c to output known struct instead of raw binary data, clarifying the classic netlink output. - Extend MPTCP selftests to improve the code-coverage. - Add tests for XDP tail adjustment in AF_XDP. New hardware / drivers: - OpenVPN virtual driver: offload OpenVPN data channels processing to the kernel-space, increasing the data transfer throughput WRT the user-space implementation. - Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC. - Broadcom asp-v3.0 ethernet driver. - AMD Renoir ethernet device. - ReakTek MT9888 2.5G ethernet PHY driver. - Aeonsemi 10G C45 PHYs driver. Drivers: - Ethernet high-speed NICs: - nVidia/Mellanox (mlx5): - refactor the steering table handling to significantly reduce the amount of memory used - add support for complex matches in H/W flow steering - improve flow streeing error handling - convert to netdev instance locking - Intel (100G, ice, igb, ixgbe, idpf): - ice: add switchdev support for LLDP traffic over VF - ixgbe: add firmware manipulation and regions devlink support - igb: introduce support for frame transmission premption - igb: adds persistent NAPI configuration - idpf: introduce RDMA support - idpf: add initial PTP support - Meta (fbnic): - extend hardware stats coverage - add devlink dev flash support - Broadcom (bnxt): - add support for RX-side device memory TCP - Wangxun (txgbe): - implement support for udp tunnel offload - complete PTP and SRIOV support for AML 25G/10G devices - Ethernet NICs embedded and virtual: - Google (gve): - add device memory TCP TX support - Amazon (ena): - support persistent per-NAPI config - Airoha: - add H/W support for L2 traffic offload - add per flow stats for flow offloading - RealTek (rtl8211): add support for WoL magic packet - Synopsys (stmmac): - dwmac-socfpga 1000BaseX support - add Loongson-2K3000 support - introduce support for hardware-accelerated VLAN stripping - Broadcom (bcmgenet): - expose more H/W stats - Freescale (enetc, dpaa2-eth): - enetc: add MAC filter, VLAN filter RSS and loopback support - dpaa2-eth: convert to H/W timestamping APIs - vxlan: convert FDB table to rhashtable, for better scalabilty - veth: apply qdisc backpressure on full ring to reduce TX drops - Ethernet switches: - Microchip (kzZ88x3): add ETS scheduler support - Ethernet PHYs: - RealTek (rtl8211): - add support for WoL magic packet - add support for PHY LEDs - CAN: - Adds RZ/G3E CANFD support to the rcar_canfd driver. - Preparatory work for CAN-XL support. - Add self-tests framework with support for CAN physical interfaces. - WiFi: - mac80211: - scan improvements with multi-link operation (MLO) - Qualcomm (ath12k): - enable AHB support for IPQ5332 - add monitor interface support to QCN9274 - add multi-link operation support to WCN7850 - add 802.11d scan offload support to WCN7850 - monitor mode for WCN7850, better 6 GHz regulatory - Qualcomm (ath11k): - restore hibernation support - MediaTek (mt76): - WiFi-7 improvements - implement support for mt7990 - Intel (iwlwifi): - enhanced multi-link single-radio (EMLSR) support on 5 GHz links - rework device configuration - RealTek (rtw88): - improve throughput for RTL8814AU - RealTek (rtw89): - add multi-link operation support - STA/P2P concurrency improvements - support different SAR configs by antenna - Bluetooth: - introduce HCI Driver protocol - btintel_pcie: do not generate coredump for diagnostic events - btusb: add HCI Drv commands for configuring altsetting - btusb: add RTL8851BE device 0x0bda:0xb850 - btusb: add new VID/PID 13d3/3584 for MT7922 - btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925 - btnxpuart: implement host-wakeup feature" * tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1611 commits) selftests/bpf: Fix bpf selftest build warning selftests: netfilter: Fix skip of wildcard interface test net: phy: mscc: Stop clearing the the UDPv4 checksum for L2 frames net: openvswitch: Fix the dead loop of MPLS parse calipso: Don't call calipso functions for AF_INET sk. selftests/tc-testing: Add a test for HFSC eltree double add with reentrant enqueue behaviour on netem net_sched: hfsc: Address reentrant enqueue adding class to eltree twice octeontx2-pf: QOS: Refactor TC_HTB_LEAF_DEL_LAST callback octeontx2-pf: QOS: Perform cache sync on send queue teardown net: mana: Add support for Multi Vports on Bare metal net: devmem: ncdevmem: remove unused variable net: devmem: ksft: upgrade rx test to send 1K data net: devmem: ksft: add 5 tuple FS support net: devmem: ksft: add exit_wait to make rx test pass net: devmem: ksft: add ipv4 support net: devmem: preserve sockc_err page_pool: fix ugly page_pool formatting net: devmem: move list_add to net_devmem_bind_dmabuf. selftests: netfilter: nft_queue.sh: include file transfer duration in log message net: phy: mscc: Fix memory leak when using one step timestamping ...
3 daysMerge tag 'timers-cleanups-2025-05-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer cleanups from Thomas Gleixner: "Another set of timer API cleanups: - Convert init_timer*(), try_to_del_timer_sync() and destroy_timer_on_stack() over to the canonical timer_*() namespace convention. There is another large conversion pending, which has not been included because it would have caused a gazillion of merge conflicts in next. The conversion scripts will be run towards the end of the merge window and a pull request sent once all conflict dependencies have been merged" * tag 'timers-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: treewide, timers: Rename destroy_timer_on_stack() as timer_destroy_on_stack() treewide, timers: Rename try_to_del_timer_sync() as timer_delete_sync_try() timers: Rename init_timers() as timers_init() timers: Rename NEXT_TIMER_MAX_DELTA as TIMER_NEXT_MAX_DELTA timers: Rename __init_timer_on_stack() as __timer_init_on_stack() timers: Rename __init_timer() as __timer_init() timers: Rename init_timer_on_stack_key() as timer_init_key_on_stack() timers: Rename init_timer_key() as timer_init_key()
8 daysMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.15-rc8). Conflicts: 80f2ab46c2ee ("irdma: free iwdev->rf after removing MSI-X") 4bcc063939a5 ("ice, irdma: fix an off by one in error handling code") c24a65b6a27c ("iidc/ice/irdma: Update IDC to support multiple consumers") https://lore.kernel.org/20250513130630.280ee6c5@canb.auug.org.au No extra adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
14 daysbridge: netfilter: Fix forwarding of fragmented packetsIdo Schimmel
When netfilter defrag hooks are loaded (due to the presence of conntrack rules, for example), fragmented packets entering the bridge will be defragged by the bridge's pre-routing hook (br_nf_pre_routing() -> ipv4_conntrack_defrag()). Later on, in the bridge's post-routing hook, the defragged packet will be fragmented again. If the size of the largest fragment is larger than what the kernel has determined as the destination MTU (using ip_skb_dst_mtu()), the defragged packet will be dropped. Before commit ac6627a28dbf ("net: ipv4: Consolidate ipv4_mtu and ip_dst_mtu_maybe_forward"), ip_skb_dst_mtu() would return dst_mtu() as the destination MTU. Assuming the dst entry attached to the packet is the bridge's fake rtable one, this would simply be the bridge's MTU (see fake_mtu()). However, after above mentioned commit, ip_skb_dst_mtu() ends up returning the route's MTU stored in the dst entry's metrics. Ideally, in case the dst entry is the bridge's fake rtable one, this should be the bridge's MTU as the bridge takes care of updating this metric when its MTU changes (see br_change_mtu()). Unfortunately, the last operation is a no-op given the metrics attached to the fake rtable entry are marked as read-only. Therefore, ip_skb_dst_mtu() ends up returning 1500 (the initial MTU value) and defragged packets are dropped during fragmentation when dealing with large fragments and high MTU (e.g., 9k). Fix by moving the fake rtable entry's metrics to be per-bridge (in a similar fashion to the fake rtable entry itself) and marking them as writable, thereby allowing MTU changes to be reflected. Fixes: 62fa8a846d7d ("net: Implement read-only protection and COW'ing of metrics.") Fixes: 33eb9873a283 ("bridge: initialize fake_rtable metrics") Reported-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Closes: https://lore.kernel.org/netdev/PH0PR10MB4504888284FF4CBA648197D0ACB82@PH0PR10MB4504.namprd10.prod.outlook.com/ Tested-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250515084848.727706-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-08treewide, timers: Rename try_to_del_timer_sync() as timer_delete_sync_try()Ingo Molnar
Move this API to the canonical timer_*() namespace. Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250507175338.672442-9-mingo@kernel.org
2025-05-05netfilter: bridge: Move specific fragmented packet to slow_path instead of ↵Huajian Yang
dropping it The config NF_CONNTRACK_BRIDGE will change the bridge forwarding for fragmented packets. The original bridge does not know that it is a fragmented packet and forwards it directly, after NF_CONNTRACK_BRIDGE is enabled, function nf_br_ip_fragment and br_ip6_fragment will check the headroom. In original br_forward, insufficient headroom of skb may indeed exist, but there's still a way to save the skb in the device driver after dev_queue_xmit.So droping the skb will change the original bridge forwarding in some cases. Fixes: 3c171f496ef5 ("netfilter: bridge: add connection tracking system") Signed-off-by: Huajian Yang <huajianyang@asrmicro.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-04-23net: bridge: mcast: update multicast contex when vlan state is changedYong Wang
When the vlan STP state is changed, which could be manipulated by "bridge vlan" commands, similar to port STP state, this also impacts multicast behaviors such as igmp query. In the scenario of per-VLAN snooping, there's a need to update the corresponding multicast context to re-arm the port query timer when vlan state becomes "forwarding" etc. Update br_vlan_set_state() function to enable vlan multicast context in such scenario. Before the patch, the IGMP query does not happen in the last step of the following test sequence, i.e. no growth for tx counter: # ip link add name br1 up type bridge vlan_filtering 1 mcast_snooping 1 mcast_vlan_snooping 1 mcast_querier 1 mcast_stats_enabled 1 # bridge vlan global set vid 1 dev br1 mcast_snooping 1 mcast_querier 1 mcast_query_interval 100 mcast_startup_query_count 0 # ip link add name swp1 up master br1 type dummy # sleep 1 # bridge vlan set vid 1 dev swp1 state 4 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]' 1 # sleep 1 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]' 1 # bridge vlan set vid 1 dev swp1 state 3 # sleep 2 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]' 1 After the patch, the IGMP query happens in the last step of the test: # ip link add name br1 up type bridge vlan_filtering 1 mcast_snooping 1 mcast_vlan_snooping 1 mcast_querier 1 mcast_stats_enabled 1 # bridge vlan global set vid 1 dev br1 mcast_snooping 1 mcast_querier 1 mcast_query_interval 100 mcast_startup_query_count 0 # ip link add name swp1 up master br1 type dummy # sleep 1 # bridge vlan set vid 1 dev swp1 state 4 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]' 1 # sleep 1 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]' 1 # bridge vlan set vid 1 dev swp1 state 3 # sleep 2 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]' 3 Signed-off-by: Yong Wang <yongwang@nvidia.com> Reviewed-by: Andy Roulin <aroulin@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-04-23net: bridge: mcast: re-implement br_multicast_{enable, disable}_port functionsYong Wang
When a bridge port STP state is changed from BLOCKING/DISABLED to FORWARDING, the port's igmp query timer will NOT re-arm itself if the bridge has been configured as per-VLAN multicast snooping. Solve this by choosing the correct multicast context(s) to enable/disable port multicast based on whether per-VLAN multicast snooping is enabled or not, i.e. using per-{port, VLAN} context in case of per-VLAN multicast snooping by re-implementing br_multicast_enable_port() and br_multicast_disable_port() functions. Before the patch, the IGMP query does not happen in the last step of the following test sequence, i.e. no growth for tx counter: # ip link add name br1 up type bridge vlan_filtering 1 mcast_snooping 1 mcast_vlan_snooping 1 mcast_querier 1 mcast_stats_enabled 1 # bridge vlan global set vid 1 dev br1 mcast_snooping 1 mcast_querier 1 mcast_query_interval 100 mcast_startup_query_count 0 # ip link add name swp1 up master br1 type dummy # bridge link set dev swp1 state 0 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]' 1 # sleep 1 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]' 1 # bridge link set dev swp1 state 3 # sleep 2 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]' 1 After the patch, the IGMP query happens in the last step of the test: # ip link add name br1 up type bridge vlan_filtering 1 mcast_snooping 1 mcast_vlan_snooping 1 mcast_querier 1 mcast_stats_enabled 1 # bridge vlan global set vid 1 dev br1 mcast_snooping 1 mcast_querier 1 mcast_query_interval 100 mcast_startup_query_count 0 # ip link add name swp1 up master br1 type dummy # bridge link set dev swp1 state 0 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]' 1 # sleep 1 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]' 1 # bridge link set dev swp1 state 3 # sleep 2 # ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]' 3 Signed-off-by: Yong Wang <yongwang@nvidia.com> Reviewed-by: Andy Roulin <aroulin@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-04-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.15-rc3). No conflicts. Adjacent changes: tools/net/ynl/pyynl/ynl_gen_c.py 4d07bbf2d456 ("tools: ynl-gen: don't declare loop iterator in place") 7e8ba0c7de2b ("tools: ynl: don't use genlmsghdr in classic netlink") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16net: bridge: switchdev: do not notify new brentries as changedJonas Gorski
When adding a bridge vlan that is pvid or untagged after the vlan has already been added to any other switchdev backed port, the vlan change will be propagated as changed, since the flags change. This causes the vlan to not be added to the hardware for DSA switches, since the DSA handler ignores any vlans for the CPU or DSA ports that are changed. E.g. the following order of operations would work: $ ip link add swbridge type bridge vlan_filtering 1 vlan_default_pvid 0 $ ip link set lan1 master swbridge $ bridge vlan add dev swbridge vid 1 pvid untagged self $ bridge vlan add dev lan1 vid 1 pvid untagged but this order would break: $ ip link add swbridge type bridge vlan_filtering 1 vlan_default_pvid 0 $ ip link set lan1 master swbridge $ bridge vlan add dev lan1 vid 1 pvid untagged $ bridge vlan add dev swbridge vid 1 pvid untagged self Additionally, the vlan on the bridge itself would become undeletable: $ bridge vlan port vlan-id lan1 1 PVID Egress Untagged swbridge 1 PVID Egress Untagged $ bridge vlan del dev swbridge vid 1 self $ bridge vlan port vlan-id lan1 1 PVID Egress Untagged swbridge 1 Egress Untagged since the vlan was never added to DSA's vlan list, so deleting it will cause an error, causing the bridge code to not remove it. Fix this by checking if flags changed only for vlans that are already brentry and pass changed as false for those that become brentries, as these are a new vlan (member) from the switchdev point of view. Since *changed is set to true for becomes_brentry = true regardless of would_change's value, this will not change any rtnetlink notification delivery, just the value passed on to switchdev in vlan->changed. Fixes: 8d23a54f5bee ("net: bridge: switchdev: differentiate new VLANs from changed ones") Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250414200020.192715-1-jonas.gorski@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15net: bridge: locally receive all multicast packets if IFF_ALLMULTI is setShengyu Qu
If multicast snooping is enabled, multicast packets may not always end up on the local bridge interface, if the host is not a member of the multicast group. Similar to how IFF_PROMISC allows all packets to be received locally, let IFF_ALLMULTI allow all multicast packets to be received. OpenWrt uses a user space daemon for DHCPv6/RA/NDP handling, and in relay mode it sets the ALLMULTI flag in order to receive all relevant queries on the network. This works for normal network interfaces and non-snooping bridges, but not snooping bridges (unless multicast routing is enabled). Reported-by: Felix Fietkau <nbd@nbd.name> Closes: https://github.com/openwrt/openwrt/issues/15857#issuecomment-2662851243 Signed-off-by: Shengyu Qu <wiagn233@outlook.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/OSZPR01MB8434308370ACAFA90A22980798B32@OSZPR01MB8434.jpnprd01.prod.outlook.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14bridge: Convert br_net_exit_batch_rtnl() to ->exit_rtnl().Kuniyuki Iwashima
br_net_exit_batch_rtnl() iterates the dying netns list and performs the same operation for each. Let's use ->exit_rtnl(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250411205258.63164-10-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14net: bridge: mcast: Notify on mdb offload failureJoseph Huang
Notify user space on mdb offload failure if mdb_offload_fail_notification is enabled. Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250411150323.1117797-4-Joseph.Huang@garmin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14net: bridge: Add offload_fail_notification boptJoseph Huang
Add BR_BOOLOPT_MDB_OFFLOAD_FAIL_NOTIFICATION bool option. Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250411150323.1117797-3-Joseph.Huang@garmin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14net: bridge: mcast: Add offload failed mdb flagJoseph Huang
Add MDB_FLAGS_OFFLOAD_FAILED and MDB_PG_FLAGS_OFFLOAD_FAILED to indicate that an attempt to offload the MDB entry to switchdev has failed. Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250411150323.1117797-2-Joseph.Huang@garmin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.15-rc2). Conflict: Documentation/networking/netdevices.rst net/core/lock_debug.c 04efcee6ef8d ("net: hold instance lock during NETDEV_CHANGE") 03df156dd3a6 ("xdp: double protect netdev->xdp_flags with netdev->lock") No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09net: bridge: Prevent unicast ARP/NS packets from being suppressed by bridgeAmit Cohen
When Proxy ARP or ARP/ND suppression are enabled, ARP/NS packets can be handled by bridge in br_do_proxy_suppress_arp()/br_do_suppress_nd(). For broadcast packets, they are replied by bridge, but later they are not flooded. Currently, unicast packets are replied by bridge when suppression is enabled, and they are also forwarded, which results two replicas of ARP reply/NA - one from the bridge and second from the target. RFC 1122 describes use case for unicat ARP packets - "unicast poll" - actively poll the remote host by periodically sending a point-to-point ARP request to it, and delete the entry if no ARP reply is received from N successive polls. The purpose of ARP/ND suppression is to reduce flooding in the broadcast domain. If a host is sending a unicast ARP/NS, then it means it already knows the address and the switches probably know it as well and there will not be any flooding. In addition, the use case of unicast ARP/NS is to poll a specific host, so it does not make sense to have the switch answer on behalf of the host. According to RFC 9161: "A PE SHOULD reply to broadcast/multicast address resolution messages, i.e., ARP Requests, ARP probes, NS messages, as well as DAD NS messages. An ARP probe is an ARP Request constructed with an all-zero sender IP address that may be used by hosts for IPv4 Address Conflict Detection as specified in [RFC5227]. A PE SHOULD NOT reply to unicast address resolution requests (for instance, NUD NS messages)." Forward such requests and prevent the bridge from replying to them. Reported-by: Denis Yulevych <denisyu@nvidia.com> Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/6bf745a149ddfe5e6be8da684a63aa574a326f8d.1744123493.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-05treewide: Switch/rename to timer_delete[_sync]()Thomas Gleixner
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines. Conversion was done with coccinelle plus manual fixups where necessary. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Merge in late fixes to prepare for the 6.15 net-next PR. No conflicts, adjacent changes: drivers/net/ethernet/broadcom/bnxt/bnxt.c 919f9f497dbc ("eth: bnxt: fix out-of-range access of vnic_info array") fe96d717d38e ("bnxt_en: Extend queue stop/start for TX rings") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-21net: Remove RTNL dance for SIOCBRADDIF and SIOCBRDELIF.Kuniyuki Iwashima
SIOCBRDELIF is passed to dev_ioctl() first and later forwarded to br_ioctl_call(), which causes unnecessary RTNL dance and the splat below [0] under RTNL pressure. Let's say Thread A is trying to detach a device from a bridge and Thread B is trying to remove the bridge. In dev_ioctl(), Thread A bumps the bridge device's refcnt by netdev_hold() and releases RTNL because the following br_ioctl_call() also re-acquires RTNL. In the race window, Thread B could acquire RTNL and try to remove the bridge device. Then, rtnl_unlock() by Thread B will release RTNL and wait for netdev_put() by Thread A. Thread A, however, must hold RTNL after the unlock in dev_ifsioc(), which may take long under RTNL pressure, resulting in the splat by Thread B. Thread A (SIOCBRDELIF) Thread B (SIOCBRDELBR) ---------------------- ---------------------- sock_ioctl sock_ioctl `- sock_do_ioctl `- br_ioctl_call `- dev_ioctl `- br_ioctl_stub |- rtnl_lock | |- dev_ifsioc ' ' |- dev = __dev_get_by_name(...) |- netdev_hold(dev, ...) . / |- rtnl_unlock ------. | | |- br_ioctl_call `---> |- rtnl_lock Race | | `- br_ioctl_stub |- br_del_bridge Window | | | |- dev = __dev_get_by_name(...) | | | May take long | `- br_dev_delete(dev, ...) | | | under RTNL pressure | `- unregister_netdevice_queue(dev, ...) | | | | `- rtnl_unlock \ | |- rtnl_lock <-' `- netdev_run_todo | |- ... `- netdev_run_todo | `- rtnl_unlock |- __rtnl_unlock | |- netdev_wait_allrefs_any |- netdev_put(dev, ...) <----------------' Wait refcnt decrement and log splat below To avoid blocking SIOCBRDELBR unnecessarily, let's not call dev_ioctl() for SIOCBRADDIF and SIOCBRDELIF. In the dev_ioctl() path, we do the following: 1. Copy struct ifreq by get_user_ifreq in sock_do_ioctl() 2. Check CAP_NET_ADMIN in dev_ioctl() 3. Call dev_load() in dev_ioctl() 4. Fetch the master dev from ifr.ifr_name in dev_ifsioc() 3. can be done by request_module() in br_ioctl_call(), so we move 1., 2., and 4. to br_ioctl_stub(). Note that 2. is also checked later in add_del_if(), but it's better performed before RTNL. SIOCBRADDIF and SIOCBRDELIF have been processed in dev_ioctl() since the pre-git era, and there seems to be no specific reason to process them there. [0]: unregister_netdevice: waiting for wpan3 to become free. Usage count = 2 ref_tracker: wpan3@ffff8880662d8608 has 1/1 users at __netdev_tracker_alloc include/linux/netdevice.h:4282 [inline] netdev_hold include/linux/netdevice.h:4311 [inline] dev_ifsioc+0xc6a/0x1160 net/core/dev_ioctl.c:624 dev_ioctl+0x255/0x10c0 net/core/dev_ioctl.c:826 sock_do_ioctl+0x1ca/0x260 net/socket.c:1213 sock_ioctl+0x23a/0x6c0 net/socket.c:1318 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl fs/ioctl.c:892 [inline] __x64_sys_ioctl+0x1a4/0x210 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcb/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: 893b19587534 ("net: bridge: fix ioctl locking") Reported-by: syzkaller <syzkaller@googlegroups.com> Reported-by: yan kang <kangyan91@outlook.com> Reported-by: yue sun <samsun1006219@gmail.com> Closes: https://lore.kernel.org/netdev/SY8P300MB0421225D54EB92762AE8F0F2A1D32@SY8P300MB0421.AUSP300.PROD.OUTLOOK.COM/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250316192851.19781-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-08net: move misc netdev_lock flavors to a separate headerJakub Kicinski
Move the more esoteric helpers for netdev instance lock to a dedicated header. This avoids growing netdevice.h to infinity and makes rebuilding the kernel much faster (after touching the header with the helpers). The main netdev_lock() / netdev_unlock() functions are used in static inlines in netdevice.h and will probably be used most commonly, so keep them in netdevice.h. Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250307183006.2312761-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-04net: rename netns_local to netns_immutableNicolas Dichtel
The name 'netns_local' is confusing. A following commit will export it via netlink, so let's use a more explicit name. Reported-by: Eric Dumazet <edumazet@google.com> Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-21rtnetlink: Pack newlink() params into structXiao Liang
There are 4 net namespaces involved when creating links: - source netns - where the netlink socket resides, - target netns - where to put the device being created, - link netns - netns associated with the device (backend), - peer netns - netns of peer device. Currently, two nets are passed to newlink() callback - "src_net" parameter and "dev_net" (implicitly in net_device). They are set as follows, depending on netlink attributes in the request. +------------+-------------------+---------+---------+ | peer netns | IFLA_LINK_NETNSID | src_net | dev_net | +------------+-------------------+---------+---------+ | | absent | source | target | | absent +-------------------+---------+---------+ | | present | link | link | +------------+-------------------+---------+---------+ | | absent | peer | target | | present +-------------------+---------+---------+ | | present | peer | link | +------------+-------------------+---------+---------+ When IFLA_LINK_NETNSID is present, the device is created in link netns first and then moved to target netns. This has some side effects, including extra ifindex allocation, ifname validation and link events. These could be avoided if we create it in target netns from the beginning. On the other hand, the meaning of src_net parameter is ambiguous. It varies depending on how parameters are passed. It is the effective link (or peer netns) by design, but some drivers ignore it and use dev_net instead. To provide more netns context for drivers, this patch packs existing newlink() parameters, along with the source netns, link netns and peer netns, into a struct. The old "src_net" is renamed to "net" to avoid confusion with real source netns, and will be deprecated later. The use of src_net are converted to params->net trivially. Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250219125039.18024-3-shaw.leon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05bridge: mdb: Allow replace of a host-joined groupPetr Machata
Attempts to replace an MDB group membership of the host itself are currently bounced: # ip link add name br up type bridge vlan_filtering 1 # bridge mdb replace dev br port br grp 239.0.0.1 vid 2 # bridge mdb replace dev br port br grp 239.0.0.1 vid 2 Error: bridge: Group is already joined by host. A similar operation done on a member port would succeed. Ignore the check for replacement of host group memberships as well. The bit of code that this enables is br_multicast_host_join(), which, for already-joined groups only refreshes the MC group expiration timer, which is desirable; and a userspace notification, also desirable. Change a selftest that exercises this code path from expecting a rejection to expecting a pass. The rest of MDB selftests pass without modification. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/e5c5188b9787ae806609e7ca3aa2a0a501b9b5c4.1738685648.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-19netfilter: br_netfilter: remove unused conditional and dead codeAntoine Tenart
The SKB_DROP_REASON_IP_INADDRERRORS drop reason is never returned from any function, as such it cannot be returned from the ip_route_input call tree. The 'reason != SKB_DROP_REASON_IP_INADDRERRORS' conditional is thus always true. Looking back at history, commit 50038bf38e65 ("net: ip: make ip_route_input() return drop reasons") changed the ip_route_input returned value check in br_nf_pre_routing_finish from -EHOSTUNREACH to SKB_DROP_REASON_IP_INADDRERRORS. It turns out -EHOSTUNREACH could not be returned either from the ip_route_input call tree and this since commit 251da4130115 ("ipv4: Cache ip_error() routes even when not forwarding."). Not a fix as this won't change the behavior. While at it use kfree_skb_reason. Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-07bridge: Make br_is_nd_neigh_msg() accept pointer to "const struct sk_buff"Ted Chen
The skb_buff struct in br_is_nd_neigh_msg() is never modified. Mark it as const. Signed-off-by: Ted Chen <znscnchen@gmail.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250104083846.71612-1-znscnchen@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-23net: bridge: add skb drop reasons to the most common drop pointsRadu Rendec
The bridge input code may drop frames for various reasons and at various points in the ingress handling logic. Currently kfree_skb() is used everywhere, and therefore no drop reason is specified. Add drop reasons to the most common drop points. Drop reasons are not added exhaustively to the entire bridge code. The intention is to incrementally add drop reasons to the rest of the bridge code in follow up patches. Signed-off-by: Radu Rendec <rrendec@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20241219163606.717758-3-rrendec@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-20net: bridge: Handle changes in VLAN_FLAG_BRIDGE_BINDINGPetr Machata
When bridge binding is enabled on a VLAN netdevice, its link state should track bridge ports that are members of the corresponding VLAN. This works for newly-added netdevices. However toggling the option does not have the effect of enabling or disabling the behavior as appropriate. In this patch, react to bridge_binding toggles on VLAN uppers. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/90a8ca8aea4d81378b29d75d9e562433e0d5c7ff.1734540770.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-20net: bridge: Extract a helper to handle bridge_binding togglesPetr Machata
Currently, the BROPT_VLAN_BRIDGE_BINDING bridge option is only toggled when VLAN devices are added on top of a bridge or removed from it. Extract the toggling of the option to a function so that it could be invoked by a subsequent patch when the state of an upper VLAN device changes. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/a7455f6fe1dfa7b13126ed8a7fb33d3b611eecb8.1734540770.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-17net: bridge: constify 'struct bin_attribute'Thomas Weißschuh
The sysfs core now allows instances of 'struct bin_attribute' to be moved into read-only memory. Make use of that to protect them against accidental or malicious modifications. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20241216-sysfs-const-bin_attr-net-v1-1-ec460b91f274@weissschuh.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-10rtnetlink: add ndo_fdb_dump_contextEric Dumazet
rtnl_fdb_dump() and various ndo_fdb_dump() helpers share a hidden layout of cb->ctx. Before switching rtnl_fdb_dump() to for_each_netdev_dump() in the following patch, make this more explicit. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20241209100747.2269613-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-15ndo_fdb_del: Add a parameter to report whether notification was sentPetr Machata
In a similar fashion to ndo_fdb_add, which was covered in the previous patch, add the bool *notified argument to ndo_fdb_del. Callees that send a notification on their own set the flag to true. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/06b1acf4953ef0a5ed153ef1f32d7292044f2be6.1731589511.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-15ndo_fdb_add: Add a parameter to report whether notification was sentPetr Machata
Currently when FDB entries are added to or deleted from a VXLAN netdevice, the VXLAN driver emits one notification, including the VXLAN-specific attributes. The core however always sends a notification as well, a generic one. Thus two notifications are unnecessarily sent for these operations. A similar situation comes up with bridge driver, which also emits notifications on its own: # ip link add name vx type vxlan id 1000 dstport 4789 # bridge monitor fdb & [1] 1981693 # bridge fdb add de:ad:be:ef:13:37 dev vx self dst 192.0.2.1 de:ad:be:ef:13:37 dev vx dst 192.0.2.1 self permanent de:ad:be:ef:13:37 dev vx self permanent In order to prevent this duplicity, add a paremeter to ndo_fdb_add, bool *notified. The flag is primed to false, and if the callee sends a notification on its own, it sets it to true, thus informing the core that it should not generate another notification. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/cbf6ae8195e85cbf922f8058ce4eba770f3b71ed.1731589511.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-12net: ip: make ip_route_input() return drop reasonsMenglong Dong
In this commit, we make ip_route_input() return skb drop reasons that come from ip_route_input_noref(). Meanwhile, adjust all the call to it. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-11-09bridge: Allow deleting FDB entries with non-existent VLANIdo Schimmel
It is currently impossible to delete individual FDB entries (as opposed to flushing) that were added with a VLAN that no longer exists: # ip link add name dummy1 up type dummy # ip link add name br1 up type bridge vlan_filtering 1 # ip link set dev dummy1 master br1 # bridge fdb add 00:11:22:33:44:55 dev dummy1 master static vlan 1 # bridge vlan del vid 1 dev dummy1 # bridge fdb get 00:11:22:33:44:55 br br1 vlan 1 00:11:22:33:44:55 dev dummy1 vlan 1 master br1 static # bridge fdb del 00:11:22:33:44:55 dev dummy1 master vlan 1 RTNETLINK answers: Invalid argument # bridge fdb get 00:11:22:33:44:55 br br1 vlan 1 00:11:22:33:44:55 dev dummy1 vlan 1 master br1 static This is in contrast to MDB entries that can be deleted after the VLAN was deleted: # bridge vlan add vid 10 dev dummy1 # bridge mdb add dev br1 port dummy1 grp 239.1.1.1 permanent vid 10 # bridge vlan del vid 10 dev dummy1 # bridge mdb get dev br1 grp 239.1.1.1 vid 10 dev br1 port dummy1 grp 239.1.1.1 permanent vid 10 # bridge mdb del dev br1 port dummy1 grp 239.1.1.1 permanent vid 10 # bridge mdb get dev br1 grp 239.1.1.1 vid 10 Error: bridge: MDB entry not found. Align the two interfaces and allow user space to delete FDB entries that were added with a VLAN that no longer exists: # ip link add name dummy1 up type dummy # ip link add name br1 up type bridge vlan_filtering 1 # ip link set dev dummy1 master br1 # bridge fdb add 00:11:22:33:44:55 dev dummy1 master static vlan 1 # bridge vlan del vid 1 dev dummy1 # bridge fdb get 00:11:22:33:44:55 br br1 vlan 1 00:11:22:33:44:55 dev dummy1 vlan 1 master br1 static # bridge fdb del 00:11:22:33:44:55 dev dummy1 master vlan 1 # bridge fdb get 00:11:22:33:44:55 br br1 vlan 1 Error: Fdb entry not found. Add a selftest to make sure this behavior does not regress: # ./rtnetlink.sh -t kci_test_fdb_del PASS: bridge fdb del Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Andy Roulin <aroulin@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20241105133954.350479-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-07ipv4: Prepare ip_route_output() to future .flowi4_tos conversion.Guillaume Nault
Convert the "tos" parameter of ip_route_output() to dscp_t. This way we'll have a dscp_t value directly available when .flowi4_tos will eventually be converted to dscp_t. All ip_route_output() callers but one set this "tos" parameter to 0 and therefore don't need to be adapted to the new prototype. Only br_nf_pre_routing_finish() needs conversion. It can just use ip4h_dscp() to get the DSCP field from the IPv4 header. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/0f10d031dd44c70aae9bc6e19391cb30d5c2fe71.1730928699.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-07Merge tag 'nf-next-24-11-07' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following series contains Netfilter updates for net-next: 1) Make legacy xtables configs user selectable, from Breno Leitao. 2) Fix a few sparse warnings related to percpu, from Uros Bizjak. 3) Use strscpy_pad, from Justin Stitt. 4) Use nft_trans_elem_alloc() in catchall flush, from Florian Westphal. 5) A series of 7 patches to fix false positive with CONFIG_RCU_LIST=y. Florian also sees possible issue with 10 while module load/removal when requesting an expression that is available via module. As for patch 11, object is being updated so reference on the module already exists so I don't see any real issue. Florian says: "Unfortunately there are many more errors, and not all are false positives. First patches pass lockdep_commit_lock_is_held() to the rcu list traversal macro so that those splats are avoided. The last two patches are real code change as opposed to 'pass the transaction mutex to relax rcu check': Those two lists are not protected by transaction mutex so could be altered in parallel. This targets nf-next because these are long-standing issues." netfilter pull request 24-11-07 * tag 'nf-next-24-11-07' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nf_tables: must hold rcu read lock while iterating object type list netfilter: nf_tables: must hold rcu read lock while iterating expression type list netfilter: nf_tables: avoid false-positive lockdep splats with basechain hook netfilter: nf_tables: avoid false-positive lockdep splats in set walker netfilter: nf_tables: avoid false-positive lockdep splats with flowtables netfilter: nf_tables: avoid false-positive lockdep splats with sets netfilter: nf_tables: avoid false-positive lockdep splat on rule deletion netfilter: nf_tables: prefer nft_trans_elem_alloc helper netfilter: nf_tables: replace deprecated strncpy with strscpy_pad netfilter: nf_tables: Fix percpu address space issues in nf_tables_api.c netfilter: Make legacy configs user selectable ==================== Link: https://patch.msgid.link/20241106234625.168468-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-23netpoll: remove ndo_netpoll_setup() second argumentEric Dumazet
npinfo is not used in any of the ndo_netpoll_setup() methods. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241018052108.2610827-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-22rtnetlink: Return int from rtnl_af_register().Kuniyuki Iwashima
The next patch will add init_srcu_struct() in rtnl_af_register(), then we need to handle its error. Let's add the error handling in advance to make the following patch cleaner. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Matt Johnston <matt@codeconstruct.com.au> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-15net: bridge: replace call_rcu by kfree_rcu for simple kmem_cache_free callbackJulia Lawall
Since SLOB was removed and since commit 6c6c47b063b5 ("mm, slab: call kvfree_rcu_barrier() from kmem_cache_destroy()"), it is not necessary to use call_rcu when the callback only performs kmem_cache_free. Use kfree_rcu() directly. The changes were made using Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Link: https://patch.msgid.link/20241013201704.49576-9-Julia.Lawall@inria.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-15netfilter: nf_tables: replace deprecated strncpy with strscpy_padJustin Stitt
strncpy() is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. In this particular instance, the usage of strncpy() is fine and works as expected. However, towards the goal of [2], we should consider replacing it with an alternative as many instances of strncpy() are bug-prone. Its removal from the kernel promotes better long term health for the codebase. The current usage of strncpy() likely just wants the NUL-padding behavior offered by strncpy() and doesn't care about the NUL-termination. Since the compiler doesn't know the size of @dest, we can't use strtomem_pad(). Instead, use strscpy_pad() which behaves functionally the same as strncpy() in this context -- as we expect br_dev->name to be NUL-terminated itself. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://github.com/KSPP/linux/issues/90 [2] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html Cc: Kees Cook <keescook@chromium.org> Cc: linux-hardening@vger.kernel.org Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-10-15netfilter: Make legacy configs user selectableBreno Leitao
This option makes legacy Netfilter Kconfig user selectable, giving users the option to configure iptables without enabling any other config. Make the following KConfig entries user selectable: * BRIDGE_NF_EBTABLES_LEGACY * IP_NF_ARPTABLES * IP_NF_IPTABLES_LEGACY * IP6_NF_IPTABLES_LEGACY Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-10-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.12-rc3). No conflicts and no adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-10bridge: Handle error of rtnl_register_module().Kuniyuki Iwashima
Since introduced, br_vlan_rtnl_init() has been ignoring the returned value of rtnl_register_module(), which could fail silently. Handling the error allows users to view a module as an all-or-nothing thing in terms of the rtnetlink functionality. This prevents syzkaller from reporting spurious errors from its tests, where OOM often occurs and module is automatically loaded. Let's handle the errors by rtnl_register_many(). Fixes: 8dcea187088b ("net: bridge: vlan: add rtm definitions and dump support") Fixes: f26b296585dc ("net: bridge: vlan: add new rtm message support") Fixes: adb3ce9bcb0f ("net: bridge: vlan: add del rtm message support") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-04netfilter: br_netfilter: fix panic with metadata_dst skbAndy Roulin
Fix a kernel panic in the br_netfilter module when sending untagged traffic via a VxLAN device. This happens during the check for fragmentation in br_nf_dev_queue_xmit. It is dependent on: 1) the br_netfilter module being loaded; 2) net.bridge.bridge-nf-call-iptables set to 1; 3) a bridge with a VxLAN (single-vxlan-device) netdevice as a bridge port; 4) untagged frames with size higher than the VxLAN MTU forwarded/flooded When forwarding the untagged packet to the VxLAN bridge port, before the netfilter hooks are called, br_handle_egress_vlan_tunnel is called and changes the skb_dst to the tunnel dst. The tunnel_dst is a metadata type of dst, i.e., skb_valid_dst(skb) is false, and metadata->dst.dev is NULL. Then in the br_netfilter hooks, in br_nf_dev_queue_xmit, there's a check for frames that needs to be fragmented: frames with higher MTU than the VxLAN device end up calling br_nf_ip_fragment, which in turns call ip_skb_dst_mtu. The ip_dst_mtu tries to use the skb_dst(skb) as if it was a valid dst with valid dst->dev, thus the crash. This case was never supported in the first place, so drop the packet instead. PING 10.0.0.2 (10.0.0.2) from 0.0.0.0 h1-eth0: 2000(2028) bytes of data. [ 176.291791] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000110 [ 176.292101] Mem abort info: [ 176.292184] ESR = 0x0000000096000004 [ 176.292322] EC = 0x25: DABT (current EL), IL = 32 bits [ 176.292530] SET = 0, FnV = 0 [ 176.292709] EA = 0, S1PTW = 0 [ 176.292862] FSC = 0x04: level 0 translation fault [ 176.293013] Data abort info: [ 176.293104] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 [ 176.293488] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 176.293787] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 176.293995] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000043ef5000 [ 176.294166] [0000000000000110] pgd=0000000000000000, p4d=0000000000000000 [ 176.294827] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP [ 176.295252] Modules linked in: vxlan ip6_udp_tunnel udp_tunnel veth br_netfilter bridge stp llc ipv6 crct10dif_ce [ 176.295923] CPU: 0 PID: 188 Comm: ping Not tainted 6.8.0-rc3-g5b3fbd61b9d1 #2 [ 176.296314] Hardware name: linux,dummy-virt (DT) [ 176.296535] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 176.296808] pc : br_nf_dev_queue_xmit+0x390/0x4ec [br_netfilter] [ 176.297382] lr : br_nf_dev_queue_xmit+0x2ac/0x4ec [br_netfilter] [ 176.297636] sp : ffff800080003630 [ 176.297743] x29: ffff800080003630 x28: 0000000000000008 x27: ffff6828c49ad9f8 [ 176.298093] x26: ffff6828c49ad000 x25: 0000000000000000 x24: 00000000000003e8 [ 176.298430] x23: 0000000000000000 x22: ffff6828c4960b40 x21: ffff6828c3b16d28 [ 176.298652] x20: ffff6828c3167048 x19: ffff6828c3b16d00 x18: 0000000000000014 [ 176.298926] x17: ffffb0476322f000 x16: ffffb7e164023730 x15: 0000000095744632 [ 176.299296] x14: ffff6828c3f1c880 x13: 0000000000000002 x12: ffffb7e137926a70 [ 176.299574] x11: 0000000000000001 x10: ffff6828c3f1c898 x9 : 0000000000000000 [ 176.300049] x8 : ffff6828c49bf070 x7 : 0008460f18d5f20e x6 : f20e0100bebafeca [ 176.300302] x5 : ffff6828c7f918fe x4 : ffff6828c49bf070 x3 : 0000000000000000 [ 176.300586] x2 : 0000000000000000 x1 : ffff6828c3c7ad00 x0 : ffff6828c7f918f0 [ 176.300889] Call trace: [ 176.301123] br_nf_dev_queue_xmit+0x390/0x4ec [br_netfilter] [ 176.301411] br_nf_post_routing+0x2a8/0x3e4 [br_netfilter] [ 176.301703] nf_hook_slow+0x48/0x124 [ 176.302060] br_forward_finish+0xc8/0xe8 [bridge] [ 176.302371] br_nf_hook_thresh+0x124/0x134 [br_netfilter] [ 176.302605] br_nf_forward_finish+0x118/0x22c [br_netfilter] [ 176.302824] br_nf_forward_ip.part.0+0x264/0x290 [br_netfilter] [ 176.303136] br_nf_forward+0x2b8/0x4e0 [br_netfilter] [ 176.303359] nf_hook_slow+0x48/0x124 [ 176.303803] __br_forward+0xc4/0x194 [bridge] [ 176.304013] br_flood+0xd4/0x168 [bridge] [ 176.304300] br_handle_frame_finish+0x1d4/0x5c4 [bridge] [ 176.304536] br_nf_hook_thresh+0x124/0x134 [br_netfilter] [ 176.304978] br_nf_pre_routing_finish+0x29c/0x494 [br_netfilter] [ 176.305188] br_nf_pre_routing+0x250/0x524 [br_netfilter] [ 176.305428] br_handle_frame+0x244/0x3cc [bridge] [ 176.305695] __netif_receive_skb_core.constprop.0+0x33c/0xecc [ 176.306080] __netif_receive_skb_one_core+0x40/0x8c [ 176.306197] __netif_receive_skb+0x18/0x64 [ 176.306369] process_backlog+0x80/0x124 [ 176.306540] __napi_poll+0x38/0x17c [ 176.306636] net_rx_action+0x124/0x26c [ 176.306758] __do_softirq+0x100/0x26c [ 176.307051] ____do_softirq+0x10/0x1c [ 176.307162] call_on_irq_stack+0x24/0x4c [ 176.307289] do_softirq_own_stack+0x1c/0x2c [ 176.307396] do_softirq+0x54/0x6c [ 176.307485] __local_bh_enable_ip+0x8c/0x98 [ 176.307637] __dev_queue_xmit+0x22c/0xd28 [ 176.307775] neigh_resolve_output+0xf4/0x1a0 [ 176.308018] ip_finish_output2+0x1c8/0x628 [ 176.308137] ip_do_fragment+0x5b4/0x658 [ 176.308279] ip_fragment.constprop.0+0x48/0xec [ 176.308420] __ip_finish_output+0xa4/0x254 [ 176.308593] ip_finish_output+0x34/0x130 [ 176.308814] ip_output+0x6c/0x108 [ 176.308929] ip_send_skb+0x50/0xf0 [ 176.309095] ip_push_pending_frames+0x30/0x54 [ 176.309254] raw_sendmsg+0x758/0xaec [ 176.309568] inet_sendmsg+0x44/0x70 [ 176.309667] __sys_sendto+0x110/0x178 [ 176.309758] __arm64_sys_sendto+0x28/0x38 [ 176.309918] invoke_syscall+0x48/0x110 [ 176.310211] el0_svc_common.constprop.0+0x40/0xe0 [ 176.310353] do_el0_svc+0x1c/0x28 [ 176.310434] el0_svc+0x34/0xb4 [ 176.310551] el0t_64_sync_handler+0x120/0x12c [ 176.310690] el0t_64_sync+0x190/0x194 [ 176.311066] Code: f9402e61 79402aa2 927ff821 f9400023 (f9408860) [ 176.315743] ---[ end trace 0000000000000000 ]--- [ 176.316060] Kernel panic - not syncing: Oops: Fatal exception in interrupt [ 176.316371] Kernel Offset: 0x37e0e3000000 from 0xffff800080000000 [ 176.316564] PHYS_OFFSET: 0xffff97d780000000 [ 176.316782] CPU features: 0x0,88000203,3c020000,0100421b [ 176.317210] Memory Limit: none [ 176.317527] ---[ end Kernel panic - not syncing: Oops: Fatal Exception in interrupt ]---\ Fixes: 11538d039ac6 ("bridge: vlan dst_metadata hooks in ingress and egress paths") Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Andy Roulin <aroulin@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20241001154400.22787-2-aroulin@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-03ipv4: Convert ip_route_input() to dscp_t.Guillaume Nault
Pass a dscp_t variable to ip_route_input(), instead of a plain u8, to prevent accidental setting of ECN bits in ->flowi4_tos. Callers of ip_route_input() to consider are: * input_action_end_dx4_finish() and input_action_end_dt4() in net/ipv6/seg6_local.c. These functions set the tos parameter to 0, which is already a valid dscp_t value, so they don't need to be adjusted for the new prototype. * icmp_route_lookup(), which already has a dscp_t variable to pass as parameter. We just need to remove the inet_dscp_to_dsfield() conversion. * br_nf_pre_routing_finish(), ip_options_rcv_srr() and ip4ip6_err(), which get the DSCP directly from IPv4 headers. Define a helper to read the .tos field of struct iphdr as dscp_t, so that these function don't have to do the conversion manually. While there, declare *iph as const in br_nf_pre_routing_finish(), declare its local variables in reverse-christmas-tree order and move the "err = ip_route_input()" assignment out of the conditional to avoid checkpatch warning. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/e9d40781d64d3d69f4c79ac8a008b8d67a033e8d.1727807926.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-03Merge tag 'net-6.12-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from ieee802154, bluetooth and netfilter. Current release - regressions: - eth: mlx5: fix wrong reserved field in hca_cap_2 in mlx5_ifc - eth: am65-cpsw: fix forever loop in cleanup code Current release - new code bugs: - eth: mlx5: HWS, fixed double-free in error flow of creating SQ Previous releases - regressions: - core: avoid potential underflow in qdisc_pkt_len_init() with UFO - core: test for not too small csum_start in virtio_net_hdr_to_skb() - vrf: revert "vrf: remove unnecessary RCU-bh critical section" - bluetooth: - fix uaf in l2cap_connect - fix possible crash on mgmt_index_removed - dsa: improve shutdown sequence - eth: mlx5e: SHAMPO, fix overflow of hd_per_wq - eth: ip_gre: fix drops of small packets in ipgre_xmit Previous releases - always broken: - core: fix gso_features_check to check for both dev->gso_{ipv4_,}max_size - core: fix tcp fraglist segmentation after pull from frag_list - netfilter: nf_tables: prevent nf_skb_duplicated corruption - sctp: set sk_state back to CLOSED if autobind fails in sctp_listen_start - mac802154: fix potential RCU dereference issue in mac802154_scan_worker - eth: fec: restart PPS after link state change" * tag 'net-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (48 commits) sctp: set sk_state back to CLOSED if autobind fails in sctp_listen_start dt-bindings: net: xlnx,axi-ethernet: Add missing reg minItems doc: net: napi: Update documentation for napi_schedule_irqoff net/ncsi: Disable the ncsi work before freeing the associated structure net: phy: qt2025: Fix warning: unused import DeviceId gso: fix udp gso fraglist segmentation after pull from frag_list bridge: mcast: Fail MDB get request on empty entry vrf: revert "vrf: Remove unnecessary RCU-bh critical section" net: ethernet: ti: am65-cpsw: Fix forever loop in cleanup code net: phy: realtek: Check the index value in led_hw_control_get ppp: do not assume bh is held in ppp_channel_bridge_input() selftests: rds: move include.sh to TEST_FILES net: test for not too small csum_start in virtio_net_hdr_to_skb() net: gso: fix tcp fraglist segmentation after pull from frag_list ipv4: ip_gre: Fix drops of small packets in ipgre_xmit net: stmmac: dwmac4: extend timeout for VLAN Tag register busy bit check net: add more sanity checks to qdisc_pkt_len_init() net: avoid potential underflow in qdisc_pkt_len_init() with UFO net: ethernet: ti: cpsw_ale: Fix warning on some platforms net: microchip: Make FDMA config symbol invisible ...
2024-10-02bridge: mcast: Fail MDB get request on empty entryIdo Schimmel
When user space deletes a port from an MDB entry, the port is removed synchronously. If this was the last port in the entry and the entry is not joined by the host itself, then the entry is scheduled for deletion via a timer. The above means that it is possible for the MDB get netlink request to retrieve an empty entry which is scheduled for deletion. This is problematic as after deleting the last port in an entry, user space cannot rely on a non-zero return code from the MDB get request as an indication that the port was successfully removed. Fix by returning an error when the entry's port list is empty and the entry is not joined by the host. Fixes: 68b380a395a7 ("bridge: mcast: Add MDB get support") Reported-by: Jamie Bainbridge <jamie.bainbridge@gmail.com> Closes: https://lore.kernel.org/netdev/c92569919307749f879b9482b0f3e125b7d9d2e3.1726480066.git.jamie.bainbridge@gmail.com/ Tested-by: Jamie Bainbridge <jamie.bainbridge@gmail.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20240929123640.558525-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-02move asm/unaligned.h to linux/unaligned.hAl Viro
asm/unaligned.h is always an include of asm-generic/unaligned.h; might as well move that thing to linux/unaligned.h and include that - there's nothing arch-specific in that header. auto-generated by the following: for i in `git grep -l -w asm/unaligned.h`; do sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i done for i in `git grep -l -w asm-generic/unaligned.h`; do sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i done git mv include/asm-generic/unaligned.h include/linux/unaligned.h git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
2024-09-09netfilter: br_netfilter: Unmask upper DSCP bits in br_nf_pre_routing_finish()Ido Schimmel
Unmask upper DSCP bits when calling ip_route_output() so that in the future it could perform the FIB lookup according to the full DSCP value. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>