summaryrefslogtreecommitdiff
path: root/net/bridge
AgeCommit message (Collapse)Author
2021-08-14net: bridge: mcast: consolidate querier selection for ipv4 and ipv6Nikolay Aleksandrov
We can consolidate both functions as they share almost the same logic. This is easier to maintain and we have a single querier update function. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-14net: bridge: mcast: make sure querier port/address updates are consistentNikolay Aleksandrov
Use a sequence counter to make sure port/address updates can be read consistently without requiring the bridge multicast_lock. We need to zero out the port and address when the other querier has expired and we're about to select ourselves as querier. br_multicast_read_querier will be used later when dumping querier state. Updates are done only with the multicast spinlock and softirqs disabled, while reads are done from process context and from softirqs (due to notifications). Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-14net: bridge: mcast: record querier port device ifindex instead of pointerNikolay Aleksandrov
Currently when a querier port is detected its net_bridge_port pointer is recorded, but it's used only for comparisons so it's fine to have stale pointer, in order to dereference and use the port pointer a proper accounting of its usage must be implemented adding unnecessary complexity. To solve the problem we can just store the netdevice ifindex instead of the port pointer and retrieve the bridge port. It is a best effort and the device needs to be validated that is still part of that bridge before use, but that is small price to pay for avoiding querier reference counting for each port/vlan. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Conflicts: drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.h 9e26680733d5 ("bnxt_en: Update firmware call to retrieve TX PTP timestamp") 9e518f25802c ("bnxt_en: 1PPS functions to configure TSIO pins") 099fdeda659d ("bnxt_en: Event handler for PPS events") kernel/bpf/helpers.c include/linux/bpf-cgroup.h a2baf4e8bb0f ("bpf: Fix potentially incorrect results with bpf_get_local_storage()") c7603cfa04e7 ("bpf: Add ambient BPF runtime context stored in current") drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c 5957cc557dc5 ("net/mlx5: Set all field of mlx5_irq before inserting it to the xarray") 2d0b41a37679 ("net/mlx5: Refcount mlx5_irq with integer") MAINTAINERS 7b637cd52f02 ("MAINTAINERS: fix Microchip CAN BUS Analyzer Tool entry typo") 7d901a1e878a ("net: phy: add Maxlinear GPY115/21x/24x driver") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-11net: bridge: vlan: fix global vlan option range dumpingNikolay Aleksandrov
When global vlan options are equal sequentially we compress them in a range to save space and reduce processing time. In order to have the proper range end id we need to update range_end if the options are equal otherwise we get ranges with the same end vlan id as the start. Fixes: 743a53d9636a ("net: bridge: vlan: add support for dumping global vlan options") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://lore.kernel.org/r/20210810092139.11700-1-razor@blackwall.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-11net: bridge: vlan: use br_rports_fill_info() to export mcast router portsNikolay Aleksandrov
Embed the standard multicast router port export by br_rports_fill_info() into a new global vlan attribute BRIDGE_VLANDB_GOPTS_MCAST_ROUTER_PORTS. In order to have the same format for the global bridge mcast context and the per-vlan mcast context we need a double-nesting: - BRIDGE_VLANDB_GOPTS_MCAST_ROUTER_PORTS - MDBA_ROUTER Currently we don't compare router lists, if any router port exists in the bridge mcast contexts we consider their option sets as different and export them separately. In addition we export the router port vlan id when dumping similar to the router port notification format. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11net: bridge: mcast: use the proper multicast context when dumping router portsNikolay Aleksandrov
When we are dumping the router ports of a vlan mcast context we need to use the bridge/vlan and port/vlan's multicast contexts to check if IPv4/IPv6 router port is present and later to dump the vlan id. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11net: bridge: vlan: add support for mcast router global optionNikolay Aleksandrov
Add support to change and retrieve global vlan multicast router state which is used for the bridge itself. We just need to pass multicast context to br_multicast_set_router instead of bridge device and the rest of the logic remains the same. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11net: bridge: vlan: add support for mcast querier global optionNikolay Aleksandrov
Add support to change and retrieve global vlan multicast querier state. We just need to pass multicast context to br_multicast_set_querier instead of bridge device and the rest of the logic remains the same. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11net: bridge: mcast: querier and query state affect only current context typeNikolay Aleksandrov
It is a minor optimization and better behaviour to make sure querier and query sending routines affect only the matching multicast context depending if vlan snooping is enabled (vlan ctx vs bridge ctx). It also avoids sending unnecessary extra query packets. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11net: bridge: mcast: move querier state to the multicast contextNikolay Aleksandrov
We need to have the querier state per multicast context in order to have per-vlan control, so remove the internal option bit and move it to the multicast context. Also annotate the lockless reads of the new variable. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11net: bridge: vlan: add support for mcast startup query interval global optionNikolay Aleksandrov
Add support to change and retrieve global vlan multicast startup query interval option. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11net: bridge: vlan: add support for mcast query response interval global optionNikolay Aleksandrov
Add support to change and retrieve global vlan multicast query response interval option. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11net: bridge: vlan: add support for mcast query interval global optionNikolay Aleksandrov
Add support to change and retrieve global vlan multicast query interval option. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11net: bridge: vlan: add support for mcast querier interval global optionNikolay Aleksandrov
Add support to change and retrieve global vlan multicast querier interval option. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11net: bridge: vlan: add support for mcast membership interval global optionNikolay Aleksandrov
Add support to change and retrieve global vlan multicast membership interval option. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11net: bridge: vlan: add support for mcast last member interval global optionNikolay Aleksandrov
Add support to change and retrieve global vlan multicast last member interval option. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11net: bridge: vlan: add support for mcast startup query count global optionNikolay Aleksandrov
Add support to change and retrieve global vlan multicast startup query count option. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11net: bridge: vlan: add support for mcast last member count global optionNikolay Aleksandrov
Add support to change and retrieve global vlan multicast last member count option. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11net: bridge: vlan: add support for mcast igmp/mld version global optionsNikolay Aleksandrov
Add support to change and retrieve global vlan IGMP/MLD versions. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Use nfnetlink_unicast() instead of netlink_unicast() in nft_compat. 2) Remove call to nf_ct_l4proto_find() in flowtable offload timeout fixup. 3) CLUSTERIP registers ARP hook on demand, from Florian. 4) Use clusterip_net to store pernet warning, also from Florian. 5) Remove struct netns_xt, from Florian Westphal. 6) Enable ebtables hooks in initns on demand, from Florian. 7) Allow to filter conntrack netlink dump per status bits, from Florian Westphal. 8) Register x_tables hooks in initns on demand, from Florian. 9) Remove queue_handler from per-netns structure, again from Florian. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-10net: bridge: fix memleak in br_add_if()Yang Yingliang
I got a memleak report: BUG: memory leak unreferenced object 0x607ee521a658 (size 240): comm "syz-executor.0", pid 955, jiffies 4294780569 (age 16.449s) hex dump (first 32 bytes, cpu 1): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000d830ea5a>] br_multicast_add_port+0x1c2/0x300 net/bridge/br_multicast.c:1693 [<00000000274d9a71>] new_nbp net/bridge/br_if.c:435 [inline] [<00000000274d9a71>] br_add_if+0x670/0x1740 net/bridge/br_if.c:611 [<0000000012ce888e>] do_set_master net/core/rtnetlink.c:2513 [inline] [<0000000012ce888e>] do_set_master+0x1aa/0x210 net/core/rtnetlink.c:2487 [<0000000099d1cafc>] __rtnl_newlink+0x1095/0x13e0 net/core/rtnetlink.c:3457 [<00000000a01facc0>] rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3488 [<00000000acc9186c>] rtnetlink_rcv_msg+0x369/0xa10 net/core/rtnetlink.c:5550 [<00000000d4aabb9c>] netlink_rcv_skb+0x134/0x3d0 net/netlink/af_netlink.c:2504 [<00000000bc2e12a3>] netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline] [<00000000bc2e12a3>] netlink_unicast+0x4a0/0x6a0 net/netlink/af_netlink.c:1340 [<00000000e4dc2d0e>] netlink_sendmsg+0x789/0xc70 net/netlink/af_netlink.c:1929 [<000000000d22c8b3>] sock_sendmsg_nosec net/socket.c:654 [inline] [<000000000d22c8b3>] sock_sendmsg+0x139/0x170 net/socket.c:674 [<00000000e281417a>] ____sys_sendmsg+0x658/0x7d0 net/socket.c:2350 [<00000000237aa2ab>] ___sys_sendmsg+0xf8/0x170 net/socket.c:2404 [<000000004f2dc381>] __sys_sendmsg+0xd3/0x190 net/socket.c:2433 [<0000000005feca6c>] do_syscall_64+0x37/0x90 arch/x86/entry/common.c:47 [<000000007304477d>] entry_SYSCALL_64_after_hwframe+0x44/0xae On error path of br_add_if(), p->mcast_stats allocated in new_nbp() need be freed, or it will be leaked. Fixes: 1080ab95e3c7 ("net: bridge: add support for IGMP/MLD stats and export them via netlink") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://lore.kernel.org/r/20210809132023.978546-1-yangyingliang@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-10net: bridge: fix flags interpretation for extern learn fdb entriesNikolay Aleksandrov
Ignore fdb flags when adding port extern learn entries and always set BR_FDB_LOCAL flag when adding bridge extern learn entries. This is closest to the behaviour we had before and avoids breaking any use cases which were allowed. This patch fixes iproute2 calls which assume NUD_PERMANENT and were allowed before, example: $ bridge fdb add 00:11:22:33:44:55 dev swp1 extern_learn Extern learn entries are allowed to roam, but do not expire, so static or dynamic flags make no sense for them. Also add a comment for future reference. Fixes: eb100e0e24a2 ("net: bridge: allow to add externally learned entries from user-space") Fixes: 0541a6293298 ("net: bridge: validate the NUD_PERMANENT bit when adding an extern_learn FDB entry") Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20210810110010.43859-1-razor@blackwall.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfJakub Kicinski
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Restrict range element expansion in ipset to avoid soft lockup, from Jozsef Kadlecsik. 2) Memleak in error path for nf_conntrack_bridge for IPv4 packets, from Yajun Deng. 3) Simplify conntrack garbage collection strategy to avoid frequent wake-ups, from Florian Westphal. 4) Fix NFNLA_HOOK_FUNCTION_NAME string, do not include module name. 5) Missing chain family netlink attribute in chain description in nfnetlink_hook. 6) Incorrect sequence number on nfnetlink_hook dumps. 7) Use netlink request family in reply message for consistency. 8) Remove offload_pickup sysctl, use conntrack for established state instead, from Florian Westphal. 9) Translate NFPROTO_INET/ingress to NFPROTO_NETDEV/ingress, since NFPROTO_INET is not exposed through nfnetlink_hook. * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf: netfilter: nfnetlink_hook: translate inet ingress to netdev netfilter: conntrack: remove offload_pickup sysctl again netfilter: nfnetlink_hook: Use same family as request message netfilter: nfnetlink_hook: use the sequence number of the request message netfilter: nfnetlink_hook: missing chain family netfilter: nfnetlink_hook: strip off module name from hookfn netfilter: conntrack: collect all entries in one cycle netfilter: nf_conntrack_bridge: Fix memory leak when error netfilter: ipset: Limit the maximal range of consecutive elements to add/delete ==================== Link: https://lore.kernel.org/r/20210806151149.6356-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Build failure in drivers/net/wwan/mhi_wwan_mbim.c: add missing parameter (0, assuming we don't want buffer pre-alloc). Conflict in drivers/net/dsa/sja1105/sja1105_main.c between: 589918df9322 ("net: dsa: sja1105: be stateless with FDB entries on SJA1105P/Q/R/S/SJA1110 too") 0fac6aa098ed ("net: dsa: sja1105: delete the best_effort_vlan_filtering mode") Follow the instructions from the commit message of the former commit - removed the if conditions. When looking at commit 589918df9322 ("net: dsa: sja1105: be stateless with FDB entries on SJA1105P/Q/R/S/SJA1110 too") note that the mask_iotag fields get removed by the following patch. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-05net: bridge: fix ioctl old_deviceless bridge argumentNikolay Aleksandrov
Commit ad2f99aedf8f ("net: bridge: move bridge ioctls out of .ndo_do_ioctl") changed the source of the argument copy in bridge's old_deviceless() from args[1] (user ptr to device name) to uarg (ptr to ioctl arguments) causing wrong device name to be used. Example (broken, bridge exists but is up): $ brctl delbr bridge bridge bridge doesn't exist; can't delete it Example (working): $ brctl delbr bridge bridge bridge is still up; can't delete it Fixes: ad2f99aedf8f ("net: bridge: move bridge ioctls out of .ndo_do_ioctl") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05net: bridge: fix ioctl lockingNikolay Aleksandrov
Before commit ad2f99aedf8f ("net: bridge: move bridge ioctls out of .ndo_do_ioctl") the bridge ioctl calls were divided in two parts: one was deviceless called by sock_ioctl and didn't expect rtnl to be held, the other was with a device called by dev_ifsioc() and expected rtnl to be held. After the commit above they were united in a single ioctl stub, but it didn't take care of the locking expectations. For sock_ioctl now we acquire (1) br_ioctl_mutex, (2) rtnl and for dev_ifsioc we acquire (1) rtnl, (2) br_ioctl_mutex The fix is to get a refcnt on the netdev for dev_ifsioc calls and drop rtnl then to reacquire it in the bridge ioctl stub after br_ioctl_mutex has been acquired. That will avoid playing locking games and make the rules straight-forward: we always take br_ioctl_mutex first, and then rtnl. Reported-by: syzbot+34fe5894623c4ab1b379@syzkaller.appspotmail.com Fixes: ad2f99aedf8f ("net: bridge: move bridge ioctls out of .ndo_do_ioctl") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04net: make switchdev_bridge_port_{,unoffload} loosely coupled with the bridgeVladimir Oltean
With the introduction of explicit offloading API in switchdev in commit 2f5dc00f7a3e ("net: bridge: switchdev: let drivers inform which bridge ports are offloaded"), we started having Ethernet switch drivers calling directly into a function exported by net/bridge/br_switchdev.c, which is a function exported by the bridge driver. This means that drivers that did not have an explicit dependency on the bridge before, like cpsw and am65-cpsw, now do - otherwise it is not possible to call a symbol exported by a driver that can be built as module unless you are a module too. There was an attempt to solve the dependency issue in the form of commit b0e81817629a ("net: build all switchdev drivers as modules when the bridge is a module"). Grygorii Strashko, however, says about it: | In my opinion, the problem is a bit bigger here than just fixing the | build :( | | In case, of ^cpsw the switchdev mode is kinda optional and in many | cases (especially for testing purposes, NFS) the multi-mac mode is | still preferable mode. | | There were no such tight dependency between switchdev drivers and | bridge core before and switchdev serviced as independent, notification | based layer between them, so ^cpsw still can be "Y" and bridge can be | "M". Now for mostly every kernel build configuration the CONFIG_BRIDGE | will need to be set as "Y", or we will have to update drivers to | support build with BRIDGE=n and maintain separate builds for | networking vs non-networking testing. But is this enough? Wouldn't | it cause 'chain reaction' required to add more and more "Y" options | (like CONFIG_VLAN_8021Q)? | | PS. Just to be sure we on the same page - ARM builds will be forced | (with this patch) to have CONFIG_TI_CPSW_SWITCHDEV=m and so all our | automation testing will just fail with omap2plus_defconfig. In the light of this, it would be desirable for some configurations to avoid dependencies between switchdev drivers and the bridge, and have the switchdev mode as completely optional within the driver. Arnd Bergmann also tried to write a patch which better expressed the build time dependency for Ethernet switch drivers where the switchdev support is optional, like cpsw/am65-cpsw, and this made the drivers follow the bridge (compile as module if the bridge is a module) only if the optional switchdev support in the driver was enabled in the first place: https://patchwork.kernel.org/project/netdevbpf/patch/20210802144813.1152762-1-arnd@kernel.org/ but this still did not solve the fact that cpsw and am65-cpsw now must be built as modules when the bridge is a module - it just expressed correctly that optional dependency. But the new behavior is an apparent regression from Grygorii's perspective. So to support the use case where the Ethernet driver is built-in, NET_SWITCHDEV (a bool option) is enabled, and the bridge is a module, we need a framework that can handle the possible absence of the bridge from the running system, i.e. runtime bloatware as opposed to build-time bloatware. Luckily we already have this framework, since switchdev has been using it extensively. Events from the bridge side are transmitted to the driver side using notifier chains - this was originally done so that unrelated drivers could snoop for events emitted by the bridge towards ports that are implemented by other drivers (think of a switch driver with LAG offload that listens for switchdev events on a bonding/team interface that it offloads). There are also events which are transmitted from the driver side to the bridge side, which again are modeled using notifiers. SWITCHDEV_FDB_ADD_TO_BRIDGE is an example of this, and deals with notifying the bridge that a MAC address has been dynamically learned. So there is a precedent we can use for modeling the new framework. The difference compared to SWITCHDEV_FDB_ADD_TO_BRIDGE is that the work that the bridge needs to do when a port becomes offloaded is blocking in its nature: replay VLANs, MDBs etc. The calling context is indeed blocking (we are under rtnl_mutex), but the existing switchdev notification chain that the bridge is subscribed to is only the atomic one. So we need to subscribe the bridge to the blocking switchdev notification chain too. This patch: - keeps the driver-side perception of the switchdev_bridge_port_{,un}offload unchanged - moves the implementation of switchdev_bridge_port_{,un}offload from the bridge module into the switchdev module. - makes everybody that is subscribed to the switchdev blocking notifier chain "hear" offload & unoffload events - makes the bridge driver subscribe and handle those events - moves the bridge driver's handling of those events into 2 new functions called br_switchdev_port_{,un}offload. These functions contain in fact the core of the logic that was previously in switchdev_bridge_port_{,un}offload, just that now we go through an extra indirection layer to reach them. Unlike all the other switchdev notification structures, the structure used to carry the bridge port information, struct switchdev_notifier_brport_info, does not contain a "bool handled". This is because in the current usage pattern, we always know that a switchdev bridge port offloading event will be handled by the bridge, because the switchdev_bridge_port_offload() call was initiated by a NETDEV_CHANGEUPPER event in the first place, where info->upper_dev is a bridge. So if the bridge wasn't loaded, then the CHANGEUPPER event couldn't have happened. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04netfilter: nf_conntrack_bridge: Fix memory leak when errorYajun Deng
It should be added kfree_skb_list() when err is not equal to zero in nf_br_ip_fragment(). v2: keep this aligned with IPv6. v3: modify iter.frag_list to iter.frag. Fixes: 3c171f496ef5 ("netfilter: bridge: add connection tracking system") Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-03net: bridge: switchdev: fix incorrect use of FDB flags when picking the dst ↵Vladimir Oltean
device Nikolay points out that it is incorrect to assume that it is impossible to have an fdb entry with fdb->dst == NULL and the BR_FDB_LOCAL bit in fdb->flags not set. This is because there are reader-side places that test_bit(BR_FDB_LOCAL, &fdb->flags) without the br->hash_lock, and if the updating of the FDB entry happens on another CPU, there are no memory barriers at writer or reader side which would ensure that the reader sees the updates to both fdb->flags and fdb->dst in the same order, i.e. the reader will not see an inconsistent FDB entry. So we must be prepared to deal with FDB entries where fdb->dst and fdb->flags are in a potentially inconsistent state, and that means that fdb->dst == NULL should remain a condition to pick the net_device that we report to switchdev as being the bridge device, which is what the code did prior to the blamed patch. Fixes: 52e4bec15546 ("net: bridge: switchdev: treat local FDBs the same as entries towards the bridge") Suggested-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://lore.kernel.org/r/20210802113633.189831-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-02net: bridge: validate the NUD_PERMANENT bit when adding an extern_learn FDB ↵Vladimir Oltean
entry Currently it is possible to add broken extern_learn FDB entries to the bridge in two ways: 1. Entries pointing towards the bridge device that are not local/permanent: ip link add br0 type bridge bridge fdb add 00:01:02:03:04:05 dev br0 self extern_learn static 2. Entries pointing towards the bridge device or towards a port that are marked as local/permanent, however the bridge does not process the 'permanent' bit in any way, therefore they are recorded as though they aren't permanent: ip link add br0 type bridge bridge fdb add 00:01:02:03:04:05 dev br0 self extern_learn permanent Since commit 52e4bec15546 ("net: bridge: switchdev: treat local FDBs the same as entries towards the bridge"), these incorrect FDB entries can even trigger NULL pointer dereferences inside the kernel. This is because that commit made the assumption that all FDB entries that are not local/permanent have a valid destination port. For context, local / permanent FDB entries either have fdb->dst == NULL, and these point towards the bridge device and are therefore local and not to be used for forwarding, or have fdb->dst == a net_bridge_port structure (but are to be treated in the same way, i.e. not for forwarding). That assumption _is_ correct as long as things are working correctly in the bridge driver, i.e. we cannot logically have fdb->dst == NULL under any circumstance for FDB entries that are not local. However, the extern_learn code path where FDB entries are managed by a user space controller show that it is possible for the bridge kernel driver to misinterpret the NUD flags of an entry transmitted by user space, and end up having fdb->dst == NULL while not being a local entry. This is invalid and should be rejected. Before, the two commands listed above both crashed the kernel in this check from br_switchdev_fdb_notify: struct net_device *dev = info.is_local ? br->dev : dst->dev; info.is_local == false, dst == NULL. After this patch, the invalid entry added by the first command is rejected: ip link add br0 type bridge && bridge fdb add 00:01:02:03:04:05 dev br0 self extern_learn static; ip link del br0 Error: bridge: FDB entry towards bridge must be permanent. and the valid entry added by the second command is properly treated as a local address and does not crash br_switchdev_fdb_notify anymore: ip link add br0 type bridge && bridge fdb add 00:01:02:03:04:05 dev br0 self extern_learn permanent; ip link del br0 Fixes: eb100e0e24a2 ("net: bridge: allow to add externally learned entries from user-space") Reported-by: syzbot+9ba1174359adba5a5b7c@syzkaller.appspotmail.com Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://lore.kernel.org/r/20210801231730.7493-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-02netfilter: ebtables: do not hook tables by defaultFlorian Westphal
If any of these modules is loaded, hooks get registered in all netns: Before: 'unshare -n nft list hooks' shows: family bridge hook prerouting { -2147483648 ebt_broute -0000000300 ebt_nat_hook } family bridge hook input { -0000000200 ebt_filter_hook } family bridge hook forward { -0000000200 ebt_filter_hook } family bridge hook output { +0000000100 ebt_nat_hook +0000000200 ebt_filter_hook } family bridge hook postrouting { +0000000300 ebt_nat_hook } This adds 'template 'tables' for ebtables. Each ebtable_foo registers the table as a template, with an init function that gets called once the first get/setsockopt call is made. ebtables core then searches the (per netns) list of tables. If no table is found, it searches the list of templates instead. If a template entry exists, the init function is called which will enable the table and register the hooks (so packets are diverted to the table). If no entry is found in the template list, request_module is called. After this, hook registration is delayed until the 'ebtables' (set/getsockopt) request is made for a given table and will only happen in the specific namespace. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-28net: bridge: switchdev: treat local FDBs the same as entries towards the bridgeVladimir Oltean
Currently the following script: 1. ip link add br0 type bridge vlan_filtering 1 && ip link set br0 up 2. ip link set swp2 up && ip link set swp2 master br0 3. ip link set swp3 up && ip link set swp3 master br0 4. ip link set swp4 up && ip link set swp4 master br0 5. bridge vlan del dev swp2 vid 1 6. bridge vlan del dev swp3 vid 1 7. ip link set swp4 nomaster 8. ip link set swp3 nomaster produces the following output: [ 641.010738] sja1105 spi0.1: port 2 failed to delete 00:1f:7b:63:02:48 vid 1 from fdb: -2 [ swp2, swp3 and br0 all have the same MAC address, the one listed above ] In short, this happens because the number of FDB entry additions notified to switchdev is unbalanced with the number of deletions. At step 1, the bridge has a random MAC address. At step 2, the br_fdb_replay of swp2 receives this initial MAC address. Then the bridge inherits the MAC address of swp2 via br_fdb_change_mac_address(), and it notifies switchdev (only swp2 at this point) of the deletion of the random MAC address and the addition of 00:1f:7b:63:02:48 as a local FDB entry with fdb->dst == swp2, in VLANs 0 and the default_pvid (1). During step 7: del_nbp -> br_fdb_delete_by_port(br, p, vid=0, do_all=1); -> fdb_delete_local(br, p, f); br_fdb_delete_by_port() deletes all entries towards the ports, regardless of vid, because do_all is 1. fdb_delete_local() has logic to migrate local FDB entries deleted from one port to another port which shares the same MAC address and is in the same VLAN, or to the bridge device itself. This migration happens without notifying switchdev of the deletion on the old port and the addition on the new one, just fdb->dst is changed and the added_by_user flag is cleared. In the example above, the del_nbp(swp4) causes the "addr 00:1f:7b:63:02:48 vid 1" local FDB entry with fdb->dst == swp4 that existed up until then to be migrated directly towards the bridge (fdb->dst == NULL). This is because it cannot be migrated to any of the other ports (swp2 and swp3 are not in VLAN 1). After the migration to br0 takes place, swp4 requests a deletion replay of all FDB entries. Since the "addr 00:1f:7b:63:02:48 vid 1" entry now point towards the bridge, a deletion of it is replayed. There was just a prior addition of this address, so the switchdev driver deletes this entry. Then, the del_nbp(swp3) at step 8 triggers another br_fdb_replay, and switchdev is notified again to delete "addr 00:1f:7b:63:02:48 vid 1". But it can't because it no longer has it, so it returns -ENOENT. There are other possibilities to trigger this issue, but this is by far the simplest to explain. To fix this, we must avoid the situation where the addition of an FDB entry is notified to switchdev as a local entry on a port, and the deletion is notified on the bridge itself. Considering that the 2 types of FDB entries are completely equivalent and we cannot have the same MAC address as a local entry on 2 bridge ports, or on a bridge port and pointing towards the bridge at the same time, it makes sense to hide away from switchdev completely the fact that a local FDB entry is associated with a given bridge port at all. Just say that it points towards the bridge, it should make no difference whatsoever to the switchdev driver and should even lead to a simpler overall implementation, will less cases to handle. This also avoids any modification at all to the core bridge driver, just what is reported to switchdev changes. With the local/permanent entries on bridge ports being already reported to user space, it is hard to believe that the bridge behavior can change in any backwards-incompatible way such as making all local FDB entries point towards the bridge. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-28net: bridge: switchdev: replay the entire FDB for each portVladimir Oltean
Currently when a switchdev port joins a bridge, we replay all FDB entries pointing towards that port or towards the bridge. However, this is insufficient in certain situations: (a) DSA, through its assisted_learning_on_cpu_port logic, snoops dynamically learned FDB entries on foreign interfaces. These are FDB entries that are pointing neither towards the newly joined switchdev port, nor towards the bridge. So these addresses would be missed when joining a bridge where a foreign interface has already learned some addresses, and they would also linger on if the DSA port leaves the bridge before the foreign interface forgets them. None of this happens if we replay the entire FDB when the port joins. (b) There is a desire to treat local FDB entries on a port (i.e. the port's termination MAC address) identically to FDB entries pointing towards the bridge itself. More details on the reason behind this in the next patch. The point is that this cannot be done given the current structure of br_fdb_replay() in this situation: ip link set swp0 master br0 # br0 inherits its MAC address from swp0 ip link set swp1 master br0 What is desirable is that when swp1 joins the bridge, br_fdb_replay() also notifies swp1 of br0's MAC address, but this won't in fact happen because the MAC address of br0 does not have fdb->dst == NULL (it doesn't point towards the bridge), but it has fdb->dst == swp0. So our current logic makes it impossible for that address to be replayed. But if we dump the entire FDB instead of just the entries with fdb->dst == swp1 and fdb->dst == NULL, then the inherited MAC address of br0 will be replayed too, which is what we need. A natural question arises: say there is an FDB entry to be replayed, like a MAC address dynamically learned on a foreign interface that belongs to a bridge where no switchdev port has joined yet. If 10 switchdev ports belonging to the same driver join this bridge, one by one, won't every port get notified 10 times of the foreign FDB entry, amounting to a total of 100 notifications for this FDB entry in the switchdev driver? Well, yes, but this is where the "void *ctx" argument for br_fdb_replay is useful: every port of the switchdev driver is notified whenever any other port requests an FDB replay, but because the replay was initiated by a different port, its context is different from the initiating port's context, so it ignores those replays. So the foreign FDB entry will be installed only 10 times, once per port. This is done so that the following 4 code paths are always well balanced: (a) addition of foreign FDB entry is replayed when port joins bridge (b) deletion of foreign FDB entry is replayed when port leaves bridge (c) addition of foreign FDB entry is notified to all ports currently in bridge (c) deletion of foreign FDB entry is notified to all ports currently in bridge Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-27net: bridge: move bridge ioctls out of .ndo_do_ioctlArnd Bergmann
Working towards obsoleting the .ndo_do_ioctl operation entirely, stop passing the SIOCBRADDIF/SIOCBRDELIF device ioctl commands into this callback. My first attempt was to add another ndo_siocbr() callback, but as there is only a single driver that takes these commands and there is already a hook mechanism to call directly into this driver, extend this hook instead, and use it for both the deviceless and the device specific ioctl commands. Cc: Roopa Prabhu <roopa@nvidia.com> Cc: Nikolay Aleksandrov <nikolay@nvidia.com> Cc: bridge@lists.linux-foundation.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-27bridge: use ndo_siocdevprivateArnd Bergmann
The bridge driver has an old set of ioctls using the SIOCDEVPRIVATE namespace that have never worked in compat mode and are explicitly forbidden already. Move them over to ndo_siocdevprivate and fix compat mode for these, because we can. Cc: Roopa Prabhu <roopa@nvidia.com> Cc: Nikolay Aleksandrov <nikolay@nvidia.com> Cc: bridge@lists.linux-foundation.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-26net: bridge: add a helper for retrieving port VLANs from the data pathVladimir Oltean
Introduce a brother of br_vlan_get_info() which is protected by the RCU mechanism, as opposed to br_vlan_get_info() which relies on taking the write-side rtnl_mutex. This is needed for drivers which need to find out whether a bridge port has a VLAN configured or not. For example, certain DSA switches might not offer complete source port identification to the CPU on RX, just the VLAN in which the packet was received. Based on this VLAN, we cannot set an accurate skb->dev ingress port, but at least we can configure one that behaves the same as the correct one would (this is possible because DSA sets skb->offload_fwd_mark = 1). When we look at the bridge RX handler (br_handle_frame), we see that what matters regarding skb->dev is the VLAN ID and the port STP state. So we need to select an skb->dev that has the same bridge VLAN as the packet we're receiving, and is in the LEARNING or FORWARDING STP state. The latter is easy, but for the former, we should somehow keep a shadow list of the bridge VLANs on each port, and a lookup table between VLAN ID and the 'designated port for imprecise RX'. That is rather complicated to keep in sync properly (the designated port per VLAN needs to be updated on the addition and removal of a VLAN, as well as on the join/leave events of the bridge on that port). So, to avoid all that complexity, let's just iterate through our finite number of ports and ask the bridge, for each packet: "do you have this VLAN configured on this port?". Cc: Roopa Prabhu <roopa@nvidia.com> Cc: Nikolay Aleksandrov <nikolay@nvidia.com> Cc: Ido Schimmel <idosch@nvidia.com> Cc: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-26net: bridge: update BROPT_VLAN_ENABLED before notifying switchdev in ↵Vladimir Oltean
br_vlan_filter_toggle SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING is notified by the bridge from two places: - nbp_vlan_init(), during bridge port creation - br_vlan_filter_toggle(), during a netlink/sysfs/ioctl change requested by user space If a switchdev driver uses br_vlan_enabled(br_dev) inside its handler for the SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING attribute notifier, different things will be seen depending on whether the bridge calls from the first path or the second: - in nbp_vlan_init(), br_vlan_enabled() reflects the current state of the bridge - in br_vlan_filter_toggle(), br_vlan_enabled() reflects the past state of the bridge This can lead in some cases to complications in driver implementation, which can be avoided if these could reliably use br_vlan_enabled(). Nothing seems to depend on this behavior, and it seems overall more straightforward for br_vlan_enabled() to return the proper value even during the SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING notifier, so temporarily enable the bridge option, then revert it if the switchdev notifier failed. Cc: Roopa Prabhu <roopa@nvidia.com> Cc: Nikolay Aleksandrov <nikolay@nvidia.com> Cc: Ido Schimmel <idosch@nvidia.com> Cc: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-24net: bridge: fix build when setting skb->offload_fwd_mark with ↵Vladimir Oltean
CONFIG_NET_SWITCHDEV=n Switchdev support can be disabled at compile time, and in that case, struct sk_buff will not contain the offload_fwd_mark field. To make the code in br_forward.c work in both cases, we do what is done in other places and we create a helper function, with an empty shim definition, that is implemented by the br_switchdev.o translation module. This is always compiled if and only if CONFIG_NET_SWITCHDEV is y or m. Reported-by: kernel test robot <lkp@intel.com> Fixes: 472111920f1c ("net: bridge: switchdev: allow the TX data plane forwarding to be offloaded") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-23net: bridge: switchdev: allow the TX data plane forwarding to be offloadedTobias Waldekranz
Allow switchdevs to forward frames from the CPU in accordance with the bridge configuration in the same way as is done between bridge ports. This means that the bridge will only send a single skb towards one of the ports under the switchdev's control, and expects the driver to deliver the packet to all eligible ports in its domain. Primarily this improves the performance of multicast flows with multiple subscribers, as it allows the hardware to perform the frame replication. The basic flow between the driver and the bridge is as follows: - When joining a bridge port, the switchdev driver calls switchdev_bridge_port_offload() with tx_fwd_offload = true. - The bridge sends offloadable skbs to one of the ports under the switchdev's control using skb->offload_fwd_mark = true. - The switchdev driver checks the skb->offload_fwd_mark field and lets its FDB lookup select the destination port mask for this packet. v1->v2: - convert br_input_skb_cb::fwd_hwdoms to a plain unsigned long - introduce a static key "br_switchdev_fwd_offload_used" to minimize the impact of the newly introduced feature on all the setups which don't have hardware that can make use of it - introduce a check for nbp->flags & BR_FWD_OFFLOAD to optimize cache line access - reorder nbp_switchdev_frame_mark_accel() and br_handle_vlan() in __br_forward() - do not strip VLAN on egress if forwarding offload on VLAN-aware bridge is being used - propagate errors from .ndo_dfwd_add_station() if not EOPNOTSUPP v2->v3: - replace the solution based on .ndo_dfwd_add_station with a solution based on switchdev_bridge_port_offload - rename BR_FWD_OFFLOAD to BR_TX_FWD_OFFLOAD v3->v4: rebase v4->v5: - make sure the static key is decremented on bridge port unoffload - more function and variable renaming and comments for them: br_switchdev_fwd_offload_used to br_switchdev_tx_fwd_offload br_switchdev_accels_skb to br_switchdev_frame_uses_tx_fwd_offload nbp_switchdev_frame_mark_tx_fwd to nbp_switchdev_frame_mark_tx_fwd_to_hwdom nbp_switchdev_frame_mark_accel to nbp_switchdev_frame_mark_tx_fwd_offload fwd_accel to tx_fwd_offload Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Conflicts are simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-22net: bridge: move the switchdev object replay helpers to "push" modeVladimir Oltean
Starting with commit 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries"), DSA has introduced some bridge helpers that replay switchdev events (FDB/MDB/VLAN additions and deletions) that can be lost by the switchdev drivers in a variety of circumstances: - an IP multicast group was host-joined on the bridge itself before any switchdev port joined the bridge, leading to the host MDB entries missing in the hardware database. - during the bridge creation process, the MAC address of the bridge was added to the FDB as an entry pointing towards the bridge device itself, but with no switchdev ports being part of the bridge yet, this local FDB entry would remain unknown to the switchdev hardware database. - a VLAN/FDB/MDB was added to a bridge port that is a LAG interface, before any switchdev port joined that LAG, leading to the hardware database missing those entries. - a switchdev port left a LAG that is a bridge port, while the LAG remained part of the bridge, and all FDB/MDB/VLAN entries remained installed in the hardware database of the switchdev port. Also, since commit 0d2cfbd41c4a ("net: bridge: ignore switchdev events for LAG ports which didn't request replay"), DSA introduced a method, based on a const void *ctx, to ensure that two switchdev ports under the same LAG that is a bridge port do not see the same MDB/VLAN entry being replayed twice by the bridge, once for every bridge port that joins the LAG. With so many ordering corner cases being possible, it seems unreasonable to expect a switchdev driver writer to get it right from the first try. Therefore, now that DSA has experimented with the bridge replay helpers for a little bit, we can move the code to the bridge driver where it is more readily available to all switchdev drivers. To convert the switchdev object replay helpers from "pull mode" (where the driver asks for them) to a "push mode" (where the bridge offers them automatically), the biggest problem is that the bridge needs to be aware when a switchdev port joins and leaves, even when the switchdev is only indirectly a bridge port (for example when the bridge port is a LAG upper of the switchdev). Luckily, we already have a hook for that, in the form of the newly introduced switchdev_bridge_port_offload() and switchdev_bridge_port_unoffload() calls. These offer a natural place for hooking the object addition and deletion replays. Extend the above 2 functions with: - pointers to the switchdev atomic notifier (for FDB replays) and the blocking notifier (for MDB and VLAN replays). - the "const void *ctx" argument required for drivers to be able to disambiguate between which port is targeted, when multiple ports are lowers of the same LAG that is a bridge port. Most of the drivers pass NULL to this argument, except the ones that support LAG offload and have the proper context check already in place in the switchdev blocking notifier handler. Also unexport the replay helpers, since nobody except the bridge calls them directly now. Note that: (a) we abuse the terminology slightly, because FDB entries are not "switchdev objects", but we count them as objects nonetheless. With no direct way to prove it, I think they are not modeled as switchdev objects because those can only be installed by the bridge to the hardware (as opposed to FDB entries which can be propagated in the other direction too). This is merely an abuse of terms, FDB entries are replayed too, despite not being objects. (b) the bridge does not attempt to sync port attributes to newly joined ports, just the countable stuff (the objects). The reason for this is simple: no universal and symmetric way to sync and unsync them is known. For example, VLAN filtering: what to do on unsync, disable or leave it enabled? Similarly, STP state, ageing timer, etc etc. What a switchdev port does when it becomes standalone again is not really up to the bridge's competence, and the driver should deal with it. On the other hand, replaying deletions of switchdev objects can be seen a matter of cleanup and therefore be treated by the bridge, hence this patch. We make the replay helpers opt-in for drivers, because they might not bring immediate benefits for them: - nbp_vlan_init() is called _after_ netdev_master_upper_dev_link(), so br_vlan_replay() should not do anything for the new drivers on which we call it. The existing drivers where there was even a slight possibility for there to exist a VLAN on a bridge port before they join it are already guarded against this: mlxsw and prestera deny joining LAG interfaces that are members of a bridge. - br_fdb_replay() should now notify of local FDB entries, but I patched all drivers except DSA to ignore these new entries in commit 2c4eca3ef716 ("net: bridge: switchdev: include local flag in FDB notifications"). Driver authors can lift this restriction as they wish, and when they do, they can also opt into the FDB replay functionality. - br_mdb_replay() should fix a real issue which is described in commit 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries"). However most drivers do not offload the SWITCHDEV_OBJ_ID_HOST_MDB to see this issue: only cpsw and am65_cpsw offload this switchdev object, and I don't completely understand the way in which they offload this switchdev object anyway. So I'll leave it up to these drivers' respective maintainers to opt into br_mdb_replay(). So most of the drivers pass NULL notifier blocks for the replay helpers, except: - dpaa2-switch which was already acked/regression-tested with the helpers enabled (and there isn't much of a downside in having them) - ocelot which already had replay logic in "pull" mode - DSA which already had replay logic in "pull" mode An important observation is that the drivers which don't currently request bridge event replays don't even have the switchdev_bridge_port_{offload,unoffload} calls placed in proper places right now. This was done to avoid unnecessary rework for drivers which might never even add support for this. For driver writers who wish to add replay support, this can be used as a tentative placement guide: https://patchwork.kernel.org/project/netdevbpf/patch/20210720134655.892334-11-vladimir.oltean@nxp.com/ Cc: Vadym Kochan <vkochan@marvell.com> Cc: Taras Chornyi <tchornyi@marvell.com> Cc: Ioana Ciornei <ioana.ciornei@nxp.com> Cc: Lars Povlsen <lars.povlsen@microchip.com> Cc: Steen Hegelund <Steen.Hegelund@microchip.com> Cc: UNGLinuxDriver@microchip.com Cc: Claudiu Manoil <claudiu.manoil@nxp.com> Cc: Alexandre Belloni <alexandre.belloni@bootlin.com> Cc: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-22net: bridge: guard the switchdev replay helpers against a NULL notifier blockVladimir Oltean
There is a desire to make the object and FDB replay helpers optional when moving them inside the bridge driver. For example a certain driver might not offload host MDBs and there is no case where the replay helpers would be of immediate use to it. So it would be nice if we could allow drivers to pass NULL pointers for the atomic and blocking notifier blocks, and the replay helpers to do nothing in that case. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-22net: bridge: switchdev: let drivers inform which bridge ports are offloadedVladimir Oltean
On reception of an skb, the bridge checks if it was marked as 'already forwarded in hardware' (checks if skb->offload_fwd_mark == 1), and if it is, it assigns the source hardware domain of that skb based on the hardware domain of the ingress port. Then during forwarding, it enforces that the egress port must have a different hardware domain than the ingress one (this is done in nbp_switchdev_allowed_egress). Non-switchdev drivers don't report any physical switch id (neither through devlink nor .ndo_get_port_parent_id), therefore the bridge assigns them a hardware domain of 0, and packets coming from them will always have skb->offload_fwd_mark = 0. So there aren't any restrictions. Problems appear due to the fact that DSA would like to perform software fallback for bonding and team interfaces that the physical switch cannot offload. +-- br0 ---+ / / | \ / / | \ / | | bond0 / | | / \ swp0 swp1 swp2 swp3 swp4 There, it is desirable that the presence of swp3 and swp4 under a non-offloaded LAG does not preclude us from doing hardware bridging beteen swp0, swp1 and swp2. The bandwidth of the CPU is often times high enough that software bridging between {swp0,swp1,swp2} and bond0 is not impractical. But this creates an impossible paradox given the current way in which port hardware domains are assigned. When the driver receives a packet from swp0 (say, due to flooding), it must set skb->offload_fwd_mark to something. - If we set it to 0, then the bridge will forward it towards swp1, swp2 and bond0. But the switch has already forwarded it towards swp1 and swp2 (not to bond0, remember, that isn't offloaded, so as far as the switch is concerned, ports swp3 and swp4 are not looking up the FDB, and the entire bond0 is a destination that is strictly behind the CPU). But we don't want duplicated traffic towards swp1 and swp2, so it's not ok to set skb->offload_fwd_mark = 0. - If we set it to 1, then the bridge will not forward the skb towards the ports with the same switchdev mark, i.e. not to swp1, swp2 and bond0. Towards swp1 and swp2 that's ok, but towards bond0? It should have forwarded the skb there. So the real issue is that bond0 will be assigned the same hardware domain as {swp0,swp1,swp2}, because the function that assigns hardware domains to bridge ports, nbp_switchdev_add(), recurses through bond0's lower interfaces until it finds something that implements devlink (calls dev_get_port_parent_id with bool recurse = true). This is a problem because the fact that bond0 can be offloaded by swp3 and swp4 in our example is merely an assumption. A solution is to give the bridge explicit hints as to what hardware domain it should use for each port. Currently, the bridging offload is very 'silent': a driver registers a netdevice notifier, which is put on the netns's notifier chain, and which sniffs around for NETDEV_CHANGEUPPER events where the upper is a bridge, and the lower is an interface it knows about (one registered by this driver, normally). Then, from within that notifier, it does a bunch of stuff behind the bridge's back, without the bridge necessarily knowing that there's somebody offloading that port. It looks like this: ip link set swp0 master br0 | v br_add_if() calls netdev_master_upper_dev_link() | v call_netdevice_notifiers | v dsa_slave_netdevice_event | v oh, hey! it's for me! | v .port_bridge_join What we do to solve the conundrum is to be less silent, and change the switchdev drivers to present themselves to the bridge. Something like this: ip link set swp0 master br0 | v br_add_if() calls netdev_master_upper_dev_link() | v bridge: Aye! I'll use this call_netdevice_notifiers ^ ppid as the | | hardware domain for v | this port, and zero dsa_slave_netdevice_event | if I got nothing. | | v | oh, hey! it's for me! | | | v | .port_bridge_join | | | +------------------------+ switchdev_bridge_port_offload(swp0, swp0) Then stacked interfaces (like bond0 on top of swp3/swp4) would be treated differently in DSA, depending on whether we can or cannot offload them. The offload case: ip link set bond0 master br0 | v br_add_if() calls netdev_master_upper_dev_link() | v bridge: Aye! I'll use this call_netdevice_notifiers ^ ppid as the | | switchdev mark for v | bond0. dsa_slave_netdevice_event | Coincidentally (or not), | | bond0 and swp0, swp1, swp2 v | all have the same switchdev hmm, it's not quite for me, | mark now, since the ASIC but my driver has already | is able to forward towards called .port_lag_join | all these ports in hw. for it, because I have | a port with dp->lag_dev == bond0. | | | v | .port_bridge_join | for swp3 and swp4 | | | +------------------------+ switchdev_bridge_port_offload(bond0, swp3) switchdev_bridge_port_offload(bond0, swp4) And the non-offload case: ip link set bond0 master br0 | v br_add_if() calls netdev_master_upper_dev_link() | v bridge waiting: call_netdevice_notifiers ^ huh, switchdev_bridge_port_offload | | wasn't called, okay, I'll use a v | hwdom of zero for this one. dsa_slave_netdevice_event : Then packets received on swp0 will | : not be software-forwarded towards v : swp1, but they will towards bond0. it's not for me, but bond0 is an upper of swp3 and swp4, but their dp->lag_dev is NULL because they couldn't offload it. Basically we can draw the conclusion that the lowers of a bridge port can come and go, so depending on the configuration of lowers for a bridge port, it can dynamically toggle between offloaded and unoffloaded. Therefore, we need an equivalent switchdev_bridge_port_unoffload too. This patch changes the way any switchdev driver interacts with the bridge. From now on, everybody needs to call switchdev_bridge_port_offload and switchdev_bridge_port_unoffload, otherwise the bridge will treat the port as non-offloaded and allow software flooding to other ports from the same ASIC. Note that these functions lay the ground for a more complex handshake between switchdev drivers and the bridge in the future. For drivers that will request a replay of the switchdev objects when they offload and unoffload a bridge port (DSA, dpaa2-switch, ocelot), we place the call to switchdev_bridge_port_unoffload() strategically inside the NETDEV_PRECHANGEUPPER notifier's code path, and not inside NETDEV_CHANGEUPPER. This is because the switchdev object replay helpers need the netdev adjacency lists to be valid, and that is only true in NETDEV_PRECHANGEUPPER. Cc: Vadym Kochan <vkochan@marvell.com> Cc: Taras Chornyi <tchornyi@marvell.com> Cc: Ioana Ciornei <ioana.ciornei@nxp.com> Cc: Lars Povlsen <lars.povlsen@microchip.com> Cc: Steen Hegelund <Steen.Hegelund@microchip.com> Cc: UNGLinuxDriver@microchip.com Cc: Claudiu Manoil <claudiu.manoil@nxp.com> Cc: Alexandre Belloni <alexandre.belloni@bootlin.com> Cc: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch: regression Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com> # ocelot-switch Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-22net: bridge: switchdev: recycle unused hwdomsTobias Waldekranz
Since hwdoms have only been used thus far for equality comparisons, the bridge has used the simplest possible assignment policy; using a counter to keep track of the last value handed out. With the upcoming transmit offloading, we need to perform set operations efficiently based on hwdoms, e.g. we want to answer questions like "has this skb been forwarded to any port within this hwdom?" Move to a bitmap-based allocation scheme that recycles hwdoms once all members leaves the bridge. This means that we can use a single unsigned long to keep track of the hwdoms that have received an skb. v1->v2: convert the typedef DECLARE_BITMAP(br_hwdom_map_t, BR_HWDOM_MAX) into a plain unsigned long. v2->v6: none Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-22net: bridge: disambiguate offload_fwd_markTobias Waldekranz
Before this change, four related - but distinct - concepts where named offload_fwd_mark: - skb->offload_fwd_mark: Set by the switchdev driver if the underlying hardware has already forwarded this frame to the other ports in the same hardware domain. - nbp->offload_fwd_mark: An idetifier used to group ports that share the same hardware forwarding domain. - br->offload_fwd_mark: Counter used to make sure that unique IDs are used in cases where a bridge contains ports from multiple hardware domains. - skb->cb->offload_fwd_mark: The hardware domain on which the frame ingressed and was forwarded. Introduce the term "hardware forwarding domain" ("hwdom") in the bridge to denote a set of ports with the following property: If an skb with skb->offload_fwd_mark set, is received on a port belonging to hwdom N, that frame has already been forwarded to all other ports in hwdom N. By decoupling the name from "offload_fwd_mark", we can extend the term's definition in the future - e.g. to add constraints that describe expected egress behavior - without overloading the meaning of "offload_fwd_mark". - nbp->offload_fwd_mark thus becomes nbp->hwdom. - br->offload_fwd_mark becomes br->last_hwdom. - skb->cb->offload_fwd_mark becomes skb->cb->src_hwdom. The slight change in naming here mandates a slight change in behavior of the nbp_switchdev_frame_mark() function. Previously, it only set this value in skb->cb for packets with skb->offload_fwd_mark true (ones which were forwarded in hardware). Whereas now we always track the incoming hwdom for all packets coming from a switchdev (even for the packets which weren't forwarded in hardware, such as STP BPDUs, IGMP reports etc). As all uses of skb->cb->offload_fwd_mark were already gated behind checks of skb->offload_fwd_mark, this will not introduce any functional change, but it paves the way for future changes where the ingressing hwdom must be known for frames coming from a switchdev regardless of whether they were forwarded in hardware or not (basically, if the skb comes from a switchdev, skb->cb->src_hwdom now always tracks which one). A typical example where this is relevant: the switchdev has a fixed configuration to trap STP BPDUs, but STP is not running on the bridge and the group_fwd_mask allows them to be forwarded. Say we have this setup: br0 / | \ / | \ swp0 swp1 swp2 A BPDU comes in on swp0 and is trapped to the CPU; the driver does not set skb->offload_fwd_mark. The bridge determines that the frame should be forwarded to swp{1,2}. It is imperative that forward offloading is _not_ allowed in this case, as the source hwdom is already "poisoned". Recording the source hwdom allows this case to be handled properly. v2->v3: added code comments v3->v6: none Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-21net: bridge: multicast: add context support for host-joined groupsNikolay Aleksandrov
Adding bridge multicast context support for host-joined groups is easy because we only need the proper timer value. We pass the already chosen context and use its timer value. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-21net: bridge: multicast: add mdb context supportNikolay Aleksandrov
Choose the proper bridge multicast context when user-spaces is adding mdb entries. Currently we require the vlan to be configured on at least one device (port or bridge) in order to add an mdb entry if vlan mcast snooping is enabled (vlan snooping implies vlan filtering). Note that we always allow deleting an entry, regardless of the vlan state. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-21net: bridge: multicast: fix igmp/mld port context null pointer dereferencesNikolay Aleksandrov
With the recent change to use bridge/port multicast context pointers instead of bridge/port I missed to convert two locations which pass the port pointer as-is, but with the new model we need to verify the port context is non-NULL first and retrieve the port from it. The first location is when doing querier selection when a query is received, the second location is when leaving a group. The port context will be null if the packets originated from the bridge device (i.e. from the host). The fix is simple just check if the port context exists and retrieve the port pointer from it. Fixes: adc47037a7d5 ("net: bridge: multicast: use multicast contexts instead of bridge or port") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: bridge: vlan: add mcast snooping controlNikolay Aleksandrov
Add a new global vlan option which controls whether multicast snooping is enabled or disabled for a single vlan. It controls the vlan private flag: BR_VLFLAG_GLOBAL_MCAST_ENABLED. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>