summaryrefslogtreecommitdiff
path: root/net/bridge
AgeCommit message (Collapse)Author
2019-07-05netfilter: nft_meta_bridge: Add NFT_META_BRI_IIFVPROTO supportwenxu
This patch allows you to match on bridge vlan protocol, eg. nft add rule bridge firewall zones counter meta ibrvproto 0x8100 Signed-off-by: wenxu <wenxu@ucloud.cn> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-07-05bridge: add br_vlan_get_proto()wenxu
This new function allows you to fetch the bridge port vlan protocol. Signed-off-by: wenxu <wenxu@ucloud.cn> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-07-05netfilter: nft_meta_bridge: add NFT_META_BRI_IIFPVID supportwenxu
This patch allows you to match on the bridge port pvid, eg. nft add rule bridge firewall zones counter meta ibrpvid 10 Signed-off-by: wenxu <wenxu@ucloud.cn> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-07-05bridge: add br_vlan_get_pvid_rcu()Pablo Neira Ayuso
This new function allows you to fetch bridge pvid from packet path. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
2019-07-05netfilter: nft_meta_bridge: Remove the br_private.h headerwenxu
nft_bridge_meta should not access the bridge internal API. Signed-off-by: wenxu <wenxu@ucloud.cn> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-07-05netfilter: nft_meta: move bridge meta keys into nft_meta_bridgewenxu
Separate bridge meta key from nft_meta to meta_bridge to avoid a dependency between the bridge module and nft_meta when using the bridge API available through include/linux/if_bridge.h Signed-off-by: wenxu <wenxu@ucloud.cn> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-07-04netfilter: nf_queue: remove unused hook entries pointerFlorian Westphal
Its not used anywhere, so remove this. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-07-02net: bridge: stp: don't cache eth dest pointer before skb pullNikolay Aleksandrov
Don't cache eth dest pointer before calling pskb_may_pull. Fixes: cf0f02d04a83 ("[BRIDGE]: use llc for receiving STP packets") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-02net: bridge: don't cache ether dest pointer on inputNikolay Aleksandrov
We would cache ether dst pointer on input in br_handle_frame_finish but after the neigh suppress code that could lead to a stale pointer since both ipv4 and ipv6 suppress code do pskb_may_pull. This means we have to always reload it after the suppress code so there's no point in having it cached just retrieve it directly. Fixes: 057658cb33fbf ("bridge: suppress arp pkts on BR_NEIGH_SUPPRESS ports") Fixes: ed842faeb2bd ("bridge: suppress nd pkts on BR_NEIGH_SUPPRESS ports") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-02net: bridge: mcast: fix stale ipv6 hdr pointer when handling v6 queryNikolay Aleksandrov
We get a pointer to the ipv6 hdr in br_ip6_multicast_query but we may call pskb_may_pull afterwards and end up using a stale pointer. So use the header directly, it's just 1 place where it's needed. Fixes: 08b202b67264 ("bridge br_multicast: IPv6 MLD support.") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Tested-by: Martin Weinelt <martin@linuxlounge.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-02net: bridge: mcast: fix stale nsrcs pointer in igmp3/mld2 report handlingNikolay Aleksandrov
We take a pointer to grec prior to calling pskb_may_pull and use it afterwards to get nsrcs so record nsrcs before the pull when handling igmp3 and we get a pointer to nsrcs and call pskb_may_pull when handling mld2 which again could lead to reading 2 bytes out-of-bounds. ================================================================== BUG: KASAN: use-after-free in br_multicast_rcv+0x480c/0x4ad0 [bridge] Read of size 2 at addr ffff8880421302b4 by task ksoftirqd/1/16 CPU: 1 PID: 16 Comm: ksoftirqd/1 Tainted: G OE 5.2.0-rc6+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 Call Trace: dump_stack+0x71/0xab print_address_description+0x6a/0x280 ? br_multicast_rcv+0x480c/0x4ad0 [bridge] __kasan_report+0x152/0x1aa ? br_multicast_rcv+0x480c/0x4ad0 [bridge] ? br_multicast_rcv+0x480c/0x4ad0 [bridge] kasan_report+0xe/0x20 br_multicast_rcv+0x480c/0x4ad0 [bridge] ? br_multicast_disable_port+0x150/0x150 [bridge] ? ktime_get_with_offset+0xb4/0x150 ? __kasan_kmalloc.constprop.6+0xa6/0xf0 ? __netif_receive_skb+0x1b0/0x1b0 ? br_fdb_update+0x10e/0x6e0 [bridge] ? br_handle_frame_finish+0x3c6/0x11d0 [bridge] br_handle_frame_finish+0x3c6/0x11d0 [bridge] ? br_pass_frame_up+0x3a0/0x3a0 [bridge] ? virtnet_probe+0x1c80/0x1c80 [virtio_net] br_handle_frame+0x731/0xd90 [bridge] ? select_idle_sibling+0x25/0x7d0 ? br_handle_frame_finish+0x11d0/0x11d0 [bridge] __netif_receive_skb_core+0xced/0x2d70 ? virtqueue_get_buf_ctx+0x230/0x1130 [virtio_ring] ? do_xdp_generic+0x20/0x20 ? virtqueue_napi_complete+0x39/0x70 [virtio_net] ? virtnet_poll+0x94d/0xc78 [virtio_net] ? receive_buf+0x5120/0x5120 [virtio_net] ? __netif_receive_skb_one_core+0x97/0x1d0 __netif_receive_skb_one_core+0x97/0x1d0 ? __netif_receive_skb_core+0x2d70/0x2d70 ? _raw_write_trylock+0x100/0x100 ? __queue_work+0x41e/0xbe0 process_backlog+0x19c/0x650 ? _raw_read_lock_irq+0x40/0x40 net_rx_action+0x71e/0xbc0 ? __switch_to_asm+0x40/0x70 ? napi_complete_done+0x360/0x360 ? __switch_to_asm+0x34/0x70 ? __switch_to_asm+0x40/0x70 ? __schedule+0x85e/0x14d0 __do_softirq+0x1db/0x5f9 ? takeover_tasklets+0x5f0/0x5f0 run_ksoftirqd+0x26/0x40 smpboot_thread_fn+0x443/0x680 ? sort_range+0x20/0x20 ? schedule+0x94/0x210 ? __kthread_parkme+0x78/0xf0 ? sort_range+0x20/0x20 kthread+0x2ae/0x3a0 ? kthread_create_worker_on_cpu+0xc0/0xc0 ret_from_fork+0x35/0x40 The buggy address belongs to the page: page:ffffea0001084c00 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0 flags: 0xffffc000000000() raw: 00ffffc000000000 ffffea0000cfca08 ffffea0001098608 0000000000000000 raw: 0000000000000000 0000000000000003 00000000ffffff7f 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888042130180: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff888042130200: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff > ffff888042130280: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ^ ffff888042130300: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff888042130380: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ================================================================== Disabling lock debugging due to kernel taint Fixes: bc8c20acaea1 ("bridge: multicast: treat igmpv3 report with INCLUDE and no sources as a leave") Reported-by: Martin Weinelt <martin@linuxlounge.net> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Tested-by: Martin Weinelt <martin@linuxlounge.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextPablo Neira Ayuso
Resolve conflict between d2912cb15bdd ("treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500") removing the GPL disclaimer and fe03d4745675 ("Update my email address") which updates Jozsef Kadlecsik's email. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Minor SPDX change conflict. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-21netfilter: bridge: Fix non-untagged fragment packetwenxu
ip netns exec ns1 ip a a dev eth0 10.0.0.7/24 ip netns exec ns2 ip link a link eth0 name vlan type vlan id 200 ip netns exec ns2 ip a a dev vlan 10.0.0.8/24 ip l add dev br0 type bridge vlan_filtering 1 brctl addif br0 veth1 brctl addif br0 veth2 bridge vlan add dev veth1 vid 200 pvid untagged bridge vlan add dev veth2 vid 200 A two fragment packet sent from ns2 contains the vlan tag 200. In the bridge conntrack, this packet will defrag to one skb with fraglist. When the packet is forwarded to ns1 through veth1, the first skb vlan tag will be cleared by the "untagged" flags. But the vlan tag in the second skb is still tagged, so the second fragment ends up with tag 200 to ns1. So if the first fragment packet doesn't contain the vlan tag, all of the remain should not contain vlan tag. Fixes: 3c171f496ef5 ("netfilter: bridge: add connection tracking system") Signed-off-by: wenxu <wenxu@ucloud.cn> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-20netfilter: bridge: prevent UAF in brnf_exit_net()Christian Brauner
Prevent a UAF in brnf_exit_net(). When unregister_net_sysctl_table() is called the ctl_hdr pointer will obviously be freed and so accessing it righter after is invalid. Fix this by stashing a pointer to the table we want to free before we unregister the sysctl header. Note that syzkaller falsely chased this down to the drm tree so the Fixes tag that syzkaller requested would be wrong. This commit uses a different but the correct Fixes tag. /* Splat */ BUG: KASAN: use-after-free in br_netfilter_sysctl_exit_net net/bridge/br_netfilter_hooks.c:1121 [inline] BUG: KASAN: use-after-free in brnf_exit_net+0x38c/0x3a0 net/bridge/br_netfilter_hooks.c:1141 Read of size 8 at addr ffff8880a4078d60 by task kworker/u4:4/8749 CPU: 0 PID: 8749 Comm: kworker/u4:4 Not tainted 5.2.0-rc5-next-20190618 #17 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: netns cleanup_net Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_address_description.cold+0xd4/0x306 mm/kasan/report.c:351 __kasan_report.cold+0x1b/0x36 mm/kasan/report.c:482 kasan_report+0x12/0x20 mm/kasan/common.c:614 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132 br_netfilter_sysctl_exit_net net/bridge/br_netfilter_hooks.c:1121 [inline] brnf_exit_net+0x38c/0x3a0 net/bridge/br_netfilter_hooks.c:1141 ops_exit_list.isra.0+0xaa/0x150 net/core/net_namespace.c:154 cleanup_net+0x3fb/0x960 net/core/net_namespace.c:553 process_one_work+0x989/0x1790 kernel/workqueue.c:2269 worker_thread+0x98/0xe40 kernel/workqueue.c:2415 kthread+0x354/0x420 kernel/kthread.c:255 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352 Allocated by task 11374: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_kmalloc mm/kasan/common.c:489 [inline] __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462 kasan_kmalloc+0x9/0x10 mm/kasan/common.c:503 __do_kmalloc mm/slab.c:3645 [inline] __kmalloc+0x15c/0x740 mm/slab.c:3654 kmalloc include/linux/slab.h:552 [inline] kzalloc include/linux/slab.h:743 [inline] __register_sysctl_table+0xc7/0xef0 fs/proc/proc_sysctl.c:1327 register_net_sysctl+0x29/0x30 net/sysctl_net.c:121 br_netfilter_sysctl_init_net net/bridge/br_netfilter_hooks.c:1105 [inline] brnf_init_net+0x379/0x6a0 net/bridge/br_netfilter_hooks.c:1126 ops_init+0xb3/0x410 net/core/net_namespace.c:130 setup_net+0x2d3/0x740 net/core/net_namespace.c:316 copy_net_ns+0x1df/0x340 net/core/net_namespace.c:439 create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:103 unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:202 ksys_unshare+0x444/0x980 kernel/fork.c:2822 __do_sys_unshare kernel/fork.c:2890 [inline] __se_sys_unshare kernel/fork.c:2888 [inline] __x64_sys_unshare+0x31/0x40 kernel/fork.c:2888 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 9: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459 __cache_free mm/slab.c:3417 [inline] kfree+0x10a/0x2c0 mm/slab.c:3746 __rcu_reclaim kernel/rcu/rcu.h:215 [inline] rcu_do_batch kernel/rcu/tree.c:2092 [inline] invoke_rcu_callbacks kernel/rcu/tree.c:2310 [inline] rcu_core+0xcc7/0x1500 kernel/rcu/tree.c:2291 __do_softirq+0x25c/0x94c kernel/softirq.c:292 The buggy address belongs to the object at ffff8880a4078d40 which belongs to the cache kmalloc-512 of size 512 The buggy address is located 32 bytes inside of 512-byte region [ffff8880a4078d40, ffff8880a4078f40) The buggy address belongs to the page: page:ffffea0002901e00 refcount:1 mapcount:0 mapping:ffff8880aa400a80 index:0xffff8880a40785c0 flags: 0x1fffc0000000200(slab) raw: 01fffc0000000200 ffffea0001d636c8 ffffea0001b07308 ffff8880aa400a80 raw: ffff8880a40785c0 ffff8880a40780c0 0000000100000004 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8880a4078c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880a4078c80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc > ffff8880a4078d00: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb ^ ffff8880a4078d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880a4078e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Reported-by: syzbot+43a3fa52c0d9c5c94f41@syzkaller.appspotmail.com Fixes: 22567590b2e6 ("netfilter: bridge: namespace bridge netfilter sysctls") Signed-off-by: Christian Brauner <christian@brauner.io> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500Thomas Gleixner
Based on 2 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation # extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 4122 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Enrico Weigelt <info@metux.net> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-17netfilter: bridge: namespace bridge netfilter sysctlsChristian Brauner
Currently, the /proc/sys/net/bridge folder is only created in the initial network namespace. This patch ensures that the /proc/sys/net/bridge folder is available in each network namespace if the module is loaded and disappears from all network namespaces when the module is unloaded. In doing so the patch makes the sysctls: bridge-nf-call-arptables bridge-nf-call-ip6tables bridge-nf-call-iptables bridge-nf-filter-pppoe-tagged bridge-nf-filter-vlan-tagged bridge-nf-pass-vlan-input-dev apply per network namespace. This unblocks some use-cases where users would like to e.g. not do bridge filtering for bridges in a specific network namespace while doing so for bridges located in another network namespace. The netfilter rules are afaict already per network namespace so it should be safe for users to specify whether bridge devices inside a network namespace are supposed to go through iptables et al. or not. Also, this can already be done per-bridge by setting an option for each individual bridge via Netlink. It should also be possible to do this for all bridges in a network namespace via sysctls. Cc: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-17netfilter: bridge: port sysctls to use brnf_netChristian Brauner
This ports the sysctls to use struct brnf_net. With this patch we make it possible to namespace the br_netfilter module in the following patch. Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-14docs: kbuild: convert docs to ReST and rename to *.rstMauro Carvalho Chehab
The kbuild documentation clearly shows that the documents there are written at different times: some use markdown, some use their own peculiar logic to split sections. Convert everything to ReST without affecting too much the author's style and avoiding adding uneeded markups. The conversion is actually: - add blank lines and identation in order to identify paragraphs; - fix tables markups; - add some lists markups; - mark literal blocks; - adjust title markups. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2019-06-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Some ISDN files that got removed in net-next had some changes done in mainline, take the removals. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-31netfilter: bridge: convert skb_make_writable to skb_ensure_writableFlorian Westphal
Back in the day, skb_ensure_writable did not exist. By now, both functions have the same precondition: I. skb_make_writable will test in this order: 1. wlen > skb->len -> error 2. if not cloned and wlen <= headlen -> OK 3. If cloned and wlen bytes of clone writeable -> OK After those checks, skb is either not cloned but needs to pull from nonlinear area, or writing to head would also alter data of another clone. In both cases skb_make_writable will then call __pskb_pull_tail, which will kmalloc a new memory area to use for skb->head. IOW, after successful skb_make_writable call, the requested length is in linear area and can be modified, even if skb was cloned. II. skb_ensure_writable will do this instead: 1. call pskb_may_pull. This handles case 1 above. After this, wlen is in linear area, but skb might be cloned. 2. return if skb is not cloned 3. return if wlen byte of clone are writeable. 4. fully copy the skb. So post-conditions are the same: *len bytes are writeable in linear area without altering any payload data of a clone, all header pointers might have been changed. Only differences are that skb_ensure_writable is in the core, whereas skb_make_writable lives in netfilter core and the inverted return value. skb_make_writable returns 0 on error, whereas skb_ensure_writable returns negative value. For the normal cases performance is similar: A. skb is not cloned and in linear area: pskb_may_pull is inline helper, so neither function copies. B. skb is cloned, write is in linear area and clone is writeable: both funcions return with step 3. This series removes skb_make_writable from the kernel. While at it, pass the needed value instead, its less confusing that way: There is no special-handling of "0-length" argument in either skb_make_writable or skb_ensure_writable. bridge already makes sure ethernet header is in linear area, only purpose of the make_writable() is is to copy skb->head in case of cloned skbs. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-30netfilter: nf_conntrack_bridge: add support for IPv6Pablo Neira Ayuso
br_defrag() and br_fragment() indirections are added in case that IPv6 support comes as a module, to avoid pulling innecessary dependencies in. The new fraglist iterator and fragment transformer APIs are used to implement the refragmentation code. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30netfilter: bridge: add connection tracking systemPablo Neira Ayuso
This patch adds basic connection tracking support for the bridge, including initial IPv4 support. This patch register two hooks to deal with the bridge forwarding path, one from the bridge prerouting hook to call nf_conntrack_in(); and another from the bridge postrouting hook to confirm the entry. The conntrack bridge prerouting hook defragments packets before passing them to nf_conntrack_in() to look up for an existing entry, otherwise a new entry is allocated and it is attached to the skbuff. The conntrack bridge postrouting hook confirms new conntrack entries, ie. if this is the first packet seen, then it adds the entry to the hashtable and (if needed) it refragments the skbuff into the original fragments, leaving the geometry as is if possible. Exceptions are linearized skbuffs, eg. skbuffs that are passed up to nfqueue and conntrack helpers, as well as cloned skbuff for the local delivery (eg. tcpdump), also in case of bridge port flooding (cloned skbuff too). The packet defragmentation is done through the ip_defrag() call. This forces us to save the bridge control buffer, reset the IP control buffer area and then restore it after call. This function also bumps the IP fragmentation statistics, it would be probably desiderable to have independent statistics for the bridge defragmentation/refragmentation. The maximum fragment length is stored in the control buffer and it is used to refragment the skbuff from the postrouting path. The new fraglist splitter and fragment transformer APIs are used to implement the bridge refragmentation code. The br_ip_fragment() function drops the packet in case the maximum fragment size seen is larger than the output port MTU. This patchset follows the principle that conntrack should not drop packets, so users can do it through policy via invalid state matching. Like br_netfilter, there is no refragmentation for packets that are passed up for local delivery, ie. prerouting -> input path. There are calls to nf_reset() already in several spots in the stack since time ago already, eg. af_packet, that show that skbuff fraglist handling from the netif_rx path is supported already. The helpers are called from the postrouting hook, before confirmation, from there we may see packet floods to bridge ports. Then, although unlikely, this may result in exercising the helpers many times for each clone. It would be good to explore how to pass all the packets in a list to the conntrack hook to do this handle only once for this case. Thanks to Florian Westphal for handing me over an initial patchset version to add support for conntrack bridge. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152Thomas Gleixner
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 3029 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 13Thomas Gleixner
Based on 2 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details you should have received a copy of the gnu general public license along with this program if not see http www gnu org licenses this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details [based] [from] [clk] [highbank] [c] you should have received a copy of the gnu general public license along with this program if not see http www gnu org licenses extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 355 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Jilayne Lovejoy <opensource@jilayne.com> Reviewed-by: Steve Winslow <swinslow@gmail.com> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190519154041.837383322@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21treewide: Add SPDX license identifier - Makefile/KconfigThomas Gleixner
Add SPDX license identifiers to all Make/Kconfig files which: - Have no license information of any form These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21treewide: Add SPDX license identifier for more missed filesThomas Gleixner
Add SPDX license identifiers to all files which: - Have no license information of any form - Have MODULE_LICENCE("GPL*") inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21treewide: Add SPDX license identifier for missed filesThomas Gleixner
Add SPDX license identifiers to all files which: - Have no license information of any form - Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Postpone chain policy update to drop after transaction is complete, from Florian Westphal. 2) Add entry to flowtable after confirmation to fix UDP flows with packets going in one single direction. 3) Reference count leak in dst object, from Taehee Yoo. 4) Check for TTL field in flowtable datapath, from Taehee Yoo. 5) Fix h323 conntrack helper due to incorrect boundary check, from Jakub Jankowski. 6) Fix incorrect rcu dereference when fetching basechain stats, from Florian Westphal. 7) Missing error check when adding new entries to flowtable, from Taehee Yoo. 8) Use version field in nfnetlink message to honor the nfgen_family field, from Kristian Evensen. 9) Remove incorrect configuration check for CONFIG_NF_CONNTRACK_IPV6, from Subash Abhinov Kasiviswanathan. 10) Prevent dying entries from being added to the flowtable, from Taehee Yoo. 11) Don't hit WARN_ON() with malformed blob in ebtables with trailing data after last rule, reported by syzbot, patch from Florian Westphal. 12) Remove NFT_CT_TIMEOUT enumeration, never used in the kernel code. 13) Fix incorrect definition for NFT_LOGLEVEL_MAX, from Florian Westphal. This batch comes with a conflict that can be fixed with this patch: diff --cc include/uapi/linux/netfilter/nf_tables.h index 7bdb234f3d8c,f0cf7b0f4f35..505393c6e959 --- a/include/uapi/linux/netfilter/nf_tables.h +++ b/include/uapi/linux/netfilter/nf_tables.h @@@ -966,6 -966,8 +966,7 @@@ enum nft_socket_keys * @NFT_CT_DST_IP: conntrack layer 3 protocol destination (IPv4 address) * @NFT_CT_SRC_IP6: conntrack layer 3 protocol source (IPv6 address) * @NFT_CT_DST_IP6: conntrack layer 3 protocol destination (IPv6 address) - * @NFT_CT_TIMEOUT: connection tracking timeout policy assigned to conntrack + * @NFT_CT_ID: conntrack id */ enum nft_ct_keys { NFT_CT_STATE, @@@ -991,6 -993,8 +992,7 @@@ NFT_CT_DST_IP, NFT_CT_SRC_IP6, NFT_CT_DST_IP6, - NFT_CT_TIMEOUT, + NFT_CT_ID, __NFT_CT_MAX }; #define NFT_CT_MAX (__NFT_CT_MAX - 1) That replaces the unused NFT_CT_TIMEOUT definition by NFT_CT_ID. If you prefer, I can also solve this conflict here, just let me know. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-10bridge: Fix error path for kobject_init_and_add()Tobin C. Harding
Currently error return from kobject_init_and_add() is not followed by a call to kobject_put(). This means there is a memory leak. We currently set p to NULL so that kfree() may be called on it as a noop, the code is arguably clearer if we move the kfree() up closer to where it is called (instead of after goto jump). Remove a goto label 'err1' and jump to call to kobject_put() in error return from kobject_init_and_add() fixing the memory leak. Re-name goto label 'put_back' to 'err1' now that we don't use err1, following current nomenclature (err1, err2 ...). Move call to kfree out of the error code at bottom of function up to closer to where memory was allocated. Add comment to clarify call to kfree(). Signed-off-by: Tobin C. Harding <tobin@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-09netfilter: ebtables: CONFIG_COMPAT: reject trailing data after last ruleFlorian Westphal
If userspace provides a rule blob with trailing data after last target, we trigger a splat, then convert ruleset to 64bit format (with trailing data), then pass that to do_replace_finish() which then returns -EINVAL. Erroring out right away avoids the splat plus unneeded translation and error unwind. Fixes: 81e675c227ec ("netfilter: ebtables: add CONFIG_COMPAT support") Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-27netlink: make validation more configurable for future strictnessJohannes Berg
We currently have two levels of strict validation: 1) liberal (default) - undefined (type >= max) & NLA_UNSPEC attributes accepted - attribute length >= expected accepted - garbage at end of message accepted 2) strict (opt-in) - NLA_UNSPEC attributes accepted - attribute length >= expected accepted Split out parsing strictness into four different options: * TRAILING - check that there's no trailing data after parsing attributes (in message or nested) * MAXTYPE - reject attrs > max known type * UNSPEC - reject attributes with NLA_UNSPEC policy entries * STRICT_ATTRS - strictly validate attribute size The default for future things should be *everything*. The current *_strict() is a combination of TRAILING and MAXTYPE, and is renamed to _deprecated_strict(). The current regular parsing has none of this, and is renamed to *_parse_deprecated(). Additionally it allows us to selectively set one of the new flags even on old policies. Notably, the UNSPEC flag could be useful in this case, since it can be arranged (by filling in the policy) to not be an incompatible userspace ABI change, but would then going forward prevent forgetting attribute entries. Similar can apply to the POLICY flag. We end up with the following renames: * nla_parse -> nla_parse_deprecated * nla_parse_strict -> nla_parse_deprecated_strict * nlmsg_parse -> nlmsg_parse_deprecated * nlmsg_parse_strict -> nlmsg_parse_deprecated_strict * nla_parse_nested -> nla_parse_nested_deprecated * nla_validate_nested -> nla_validate_nested_deprecated Using spatch, of course: @@ expression TB, MAX, HEAD, LEN, POL, EXT; @@ -nla_parse(TB, MAX, HEAD, LEN, POL, EXT) +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT) @@ expression NLH, HDRLEN, TB, MAX, POL, EXT; @@ -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT) +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT) @@ expression NLH, HDRLEN, TB, MAX, POL, EXT; @@ -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT) +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT) @@ expression TB, MAX, NLA, POL, EXT; @@ -nla_parse_nested(TB, MAX, NLA, POL, EXT) +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT) @@ expression START, MAX, POL, EXT; @@ -nla_validate_nested(START, MAX, POL, EXT) +nla_validate_nested_deprecated(START, MAX, POL, EXT) @@ expression NLH, HDRLEN, MAX, POL, EXT; @@ -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT) +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT) For this patch, don't actually add the strict, non-renamed versions yet so that it breaks compile if I get it wrong. Also, while at it, make nla_validate and nla_parse go down to a common __nla_validate_parse() function to avoid code duplication. Ultimately, this allows us to have very strict validation for every new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the next patch, while existing things will continue to work as is. In effect then, this adds fully strict validation for any new command. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27net: fix two coding style issuesMichal Kubecek
This is a simple cleanup addressing two coding style issues found by checkpatch.pl in an earlier patch. It's submitted as a separate patch to keep the original patch as it was generated by spatch. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27netlink: make nla_nest_start() add NLA_F_NESTED flagMichal Kubecek
Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most netlink based interfaces (including recently added ones) are still not setting it in kernel generated messages. Without the flag, message parsers not aware of attribute semantics (e.g. wireshark dissector or libmnl's mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display the structure of their contents. Unfortunately we cannot just add the flag everywhere as there may be userspace applications which check nlattr::nla_type directly rather than through a helper masking out the flags. Therefore the patch renames nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start() as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually are rewritten to use nla_nest_start(). Except for changes in include/net/netlink.h, the patch was generated using this semantic patch: @@ expression E1, E2; @@ -nla_nest_start(E1, E2) +nla_nest_start_noflag(E1, E2) @@ expression E1, E2; @@ -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED) +nla_nest_start(E1, E2) Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Two easy cases of overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22bridge: Fix possible use-after-free when deleting bridge portIdo Schimmel
When a bridge port is being deleted, do not dereference it later in br_vlan_port_event() as it can result in a use-after-free [1] if the RCU callback was executed before invoking the function. [1] [ 129.638551] ================================================================== [ 129.646904] BUG: KASAN: use-after-free in br_vlan_port_event+0x53c/0x5fd [ 129.654406] Read of size 8 at addr ffff8881e4aa1ae8 by task ip/483 [ 129.663008] CPU: 0 PID: 483 Comm: ip Not tainted 5.1.0-rc5-custom-02265-ga946bd73daac #1383 [ 129.672359] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016 [ 129.682484] Call Trace: [ 129.685242] dump_stack+0xa9/0x10e [ 129.689068] print_address_description.cold.2+0x9/0x25e [ 129.694930] kasan_report.cold.3+0x78/0x9d [ 129.704420] br_vlan_port_event+0x53c/0x5fd [ 129.728300] br_device_event+0x2c7/0x7a0 [ 129.741505] notifier_call_chain+0xb5/0x1c0 [ 129.746202] rollback_registered_many+0x895/0xe90 [ 129.793119] unregister_netdevice_many+0x48/0x210 [ 129.803384] rtnl_delete_link+0xe1/0x140 [ 129.815906] rtnl_dellink+0x2a3/0x820 [ 129.844166] rtnetlink_rcv_msg+0x397/0x910 [ 129.868517] netlink_rcv_skb+0x137/0x3a0 [ 129.882013] netlink_unicast+0x49b/0x660 [ 129.900019] netlink_sendmsg+0x755/0xc90 [ 129.915758] ___sys_sendmsg+0x761/0x8e0 [ 129.966315] __sys_sendmsg+0xf0/0x1c0 [ 129.988918] do_syscall_64+0xa4/0x470 [ 129.993032] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 129.998696] RIP: 0033:0x7ff578104b58 ... [ 130.073811] Allocated by task 479: [ 130.077633] __kasan_kmalloc.constprop.5+0xc1/0xd0 [ 130.083008] kmem_cache_alloc_trace+0x152/0x320 [ 130.088090] br_add_if+0x39c/0x1580 [ 130.092005] do_set_master+0x1aa/0x210 [ 130.096211] do_setlink+0x985/0x3100 [ 130.100224] __rtnl_newlink+0xc52/0x1380 [ 130.104625] rtnl_newlink+0x6b/0xa0 [ 130.108541] rtnetlink_rcv_msg+0x397/0x910 [ 130.113136] netlink_rcv_skb+0x137/0x3a0 [ 130.117538] netlink_unicast+0x49b/0x660 [ 130.121939] netlink_sendmsg+0x755/0xc90 [ 130.126340] ___sys_sendmsg+0x761/0x8e0 [ 130.130645] __sys_sendmsg+0xf0/0x1c0 [ 130.134753] do_syscall_64+0xa4/0x470 [ 130.138864] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 130.146195] Freed by task 0: [ 130.149421] __kasan_slab_free+0x125/0x170 [ 130.154016] kfree+0xf3/0x310 [ 130.157349] kobject_put+0x1a8/0x4c0 [ 130.161363] rcu_core+0x859/0x19b0 [ 130.165175] __do_softirq+0x250/0xa26 [ 130.170956] The buggy address belongs to the object at ffff8881e4aa1ae8 which belongs to the cache kmalloc-1k of size 1024 [ 130.184972] The buggy address is located 0 bytes inside of 1024-byte region [ffff8881e4aa1ae8, ffff8881e4aa1ee8) Fixes: 9c0ec2e7182a ("bridge: support binding vlan dev link state to vlan member bridge ports") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Cc: Mike Manning <mmanning@vyatta.att-mail.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Acked-by: Mike Manning <mmanning@vyatta.att-mail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter/IPVS fixes for your net tree: 1) Add a selftest for icmp packet too big errors with conntrack, from Florian Westphal. 2) Validate inner header in ICMP error message does not lie to us in conntrack, also from Florian. 3) Initialize ct->timeout to calm down KASAN, from Alexander Potapenko. 4) Skip ICMP error messages from tunnels in IPVS, from Julian Anastasov. 5) Use a hash to expose conntrack and expectation ID, from Florian Westphal. 6) Prevent shift wrap in nft_chain_parse_hook(), from Dan Carpenter. 7) Fix broken ICMP ID randomization with NAT, also from Florian. 8) Remove WARN_ON in ebtables compat that is reached via syzkaller, from Florian Westphal. 9) Fix broken timestamps since fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC"), from Florian. 10) Fix logging of invalid packets in conntrack, from Andrei Vagin. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22netfilter: ebtables: CONFIG_COMPAT: drop a bogus WARN_ONFlorian Westphal
It means userspace gave us a ruleset where there is some other data after the ebtables target but before the beginning of the next rule. Fixes: 81e675c227ec ("netfilter: ebtables: add CONFIG_COMPAT support") Reported-by: syzbot+659574e7bcc7f7eb4df7@syzkaller.appspotmail.com Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-19bridge: update vlan dev link state for bridge netdev changesMike Manning
If vlan bridge binding is enabled, then the link state of a vlan device that is an upper device of the bridge tracks the state of bridge ports that are members of that vlan. But this can only be done when the link state of the bridge is up. If it is down, then the link state of the vlan devices must also be down. This is to maintain existing behavior for when STP is enabled and there are no live ports, in which case the link state for the bridge and any vlan devices is down. Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-19bridge: update vlan dev state when port added to or deleted from vlanMike Manning
If vlan bridge binding is enabled, then the link state of a vlan device that is an upper device of the bridge should track the state of bridge ports that are members of that vlan. So if a bridge port becomes or stops being a member of a vlan, then update the link state of the vlan device if necessary. Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-19bridge: support binding vlan dev link state to vlan member bridge portsMike Manning
In the case of vlan filtering on bridges, the bridge may also have the corresponding vlan devices as upper devices. A vlan bridge binding mode is added to allow the link state of the vlan device to track only the state of the subset of bridge ports that are also members of the vlan, rather than that of all bridge ports. This mode is set with a vlan flag rather than a bridge sysfs so that the 8021q module is aware that it should not set the link state for the vlan device. If bridge vlan is configured, the bridge device event handling results in the link state for an upper device being set, if it is a vlan device with the vlan bridge binding mode enabled. This also sets a vlan_bridge_binding flag so that subsequent UP/DOWN/CHANGE events for the ports in that bridge result in a link state update of the vlan device if required. The link state of the vlan device is up if there is at least one bridge port that is a vlan member that is admin & oper up, otherwise its oper state is IF_OPER_LOWERLAYERDOWN. Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflict resolution of af_smc.c from Stephen Rothwell. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-16net: bridge: fix netlink export of vlan_stats_per_port optionNikolay Aleksandrov
Since the introduction of the vlan_stats_per_port option the netlink export of it has been broken since I made a typo and used the ifla attribute instead of the bridge option to retrieve its state. Sysfs export is fine, only netlink export has been affected. Fixes: 9163a0fc1f0c0 ("net: bridge: add support for per-port vlan stats") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-16net: bridge: fix per-port af_packet socketsNikolay Aleksandrov
When the commit below was introduced it changed two visible things: - the skb was no longer passed through the protocol handlers with the original device - the skb was passed up the stack with skb->dev = bridge The first change broke af_packet sockets on bridge ports. For example we use them for hostapd which listens for ETH_P_PAE packets on the ports. We discussed two possible fixes: - create a clone and pass it through NF_HOOK(), act on the original skb based on the result - somehow signal to the caller from the okfn() that it was called, meaning the skb is ok to be passed, which this patch is trying to implement via returning 1 from the bridge link-local okfn() Note that we rely on the fact that NF_QUEUE/STOLEN would return 0 and drop/error would return < 0 thus the okfn() is called only when the return was 1, so we signal to the caller that it was called by preserving the return value from nf_hook(). Fixes: 8626c56c8279 ("bridge: fix potential use-after-free when hook returns QUEUE or STOLEN verdict") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15bridge: only include nf_queue.h if neededStephen Rothwell
After merging the netfilter-next tree, today's linux-next build (powerpc ppc44x_defconfig) failed like this: In file included from net/bridge/br_input.c:19: include/net/netfilter/nf_queue.h:16:23: error: field 'state' has incomplete type struct nf_hook_state state; ^~~~~ Fixes: 971502d77faa ("bridge: netfilter: unroll NF_HOOK helper in bridge input path") Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-12bridge: broute: make broute a real ebtables tableFlorian Westphal
This makes broute a normal ebtables table, hooking at PREROUTING. The broute hook is removed. It uses skb->cb to signal to bridge rx handler that the skb should be routed instead of being bridged. This change is backwards compatible with ebtables as no userspace visible parts are changed. This means we can also remove the !ops test in ebt_register_table, it was only there for broute table sake. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-12bridge: netfilter: unroll NF_HOOK helper in bridge input pathFlorian Westphal
Replace NF_HOOK() based invocation of the netfilter hooks with a private copy of nf_hook_slow(). This copy has one difference: it can return the rx handler value expected by the stack, i.e. RX_HANDLER_CONSUMED or RX_HANDLER_PASS. This is needed by the next patch to invoke the ebtables "broute" table via the standard netfilter hooks rather than the custom "br_should_route_hook" indirection that is used now. When the skb is to be "brouted", we must return RX_HANDLER_PASS from the bridge rx input handler, but there is no way to indicate this via NF_HOOK(), unless perhaps by some hack such as exposing bridge_cb in the netfilter core or a percpu flag. text data bss dec filename 3369 56 0 3425 net/bridge/br_input.o.before 3458 40 0 3498 net/bridge/br_input.o.after This allows removal of the "br_should_route_hook" in the next patch. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-12bridge: reduce size of input cb to 16 bytesFlorian Westphal
Reduce size of br_input_skb_cb from 24 to 16 bytes by using bitfield for those values that can only be 0 or 1. igmp is the igmp type value, so it needs to be at least u8. Furthermore, the bridge currently relies on step-by-step initialization of br_input_skb_cb fields as the skb passes through the stack. Explicitly zero out the bridge input cb instead, this avoids having to review/validate that no BR_INPUT_SKB_CB(skb)->foo test can see a 'random' value from previous protocol cb. AFAICS all current fields are always set up before they are read again, so this is not a bug fix. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-11net: bridge: multicast: use rcu to access port list from ↵Nikolay Aleksandrov
br_multicast_start_querier br_multicast_start_querier() walks over the port list but it can be called from a timer with only multicast_lock held which doesn't protect the port list, so use RCU to walk over it. Fixes: c83b8fab06fc ("bridge: Restart queries when last querier expires") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-07rhashtable: use bit_spin_locks to protect hash bucket.NeilBrown
This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the bucket pointer to lock the hash chain for that bucket. The benefits of a bit spin_lock are: - no need to allocate a separate array of locks. - no need to have a configuration option to guide the choice of the size of this array - locking cost is often a single test-and-set in a cache line that will have to be loaded anyway. When inserting at, or removing from, the head of the chain, the unlock is free - writing the new address in the bucket head implicitly clears the lock bit. For __rhashtable_insert_fast() we ensure this always happens when adding a new key. - even when lockings costs 2 updates (lock and unlock), they are in a cacheline that needs to be read anyway. The cost of using a bit spin_lock is a little bit of code complexity, which I think is quite manageable. Bit spin_locks are sometimes inappropriate because they are not fair - if multiple CPUs repeatedly contend of the same lock, one CPU can easily be starved. This is not a credible situation with rhashtable. Multiple CPUs may want to repeatedly add or remove objects, but they will typically do so at different buckets, so they will attempt to acquire different locks. As we have more bit-locks than we previously had spinlocks (by at least a factor of two) we can expect slightly less contention to go with the slightly better cache behavior and reduced memory consumption. To enhance type checking, a new struct is introduced to represent the pointer plus lock-bit that is stored in the bucket-table. This is "struct rhash_lock_head" and is empty. A pointer to this needs to be cast to either an unsigned lock, or a "struct rhash_head *" to be useful. Variables of this type are most often called "bkt". Previously "pprev" would sometimes point to a bucket, and sometimes a ->next pointer in an rhash_head. As these are now different types, pprev is NULL when it would have pointed to the bucket. In that case, 'blk' is used, together with correct locking protocol. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>