summaryrefslogtreecommitdiff
path: root/net/netfilter
AgeCommit message (Collapse)Author
2020-04-29netfilter: nf_osf: avoid passing pointer to local varArnd Bergmann
gcc-10 points out that a code path exists where a pointer to a stack variable may be passed back to the caller: net/netfilter/nfnetlink_osf.c: In function 'nf_osf_hdr_ctx_init': cc1: warning: function may return address of local variable [-Wreturn-local-addr] net/netfilter/nfnetlink_osf.c:171:16: note: declared here 171 | struct tcphdr _tcph; | ^~~~~ I am not sure whether this can happen in practice, but moving the variable declaration into the callers avoids the problem. Fixes: 31a9c29210e2 ("netfilter: nf_osf: add struct nf_osf_hdr_ctx") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-26netfilter: nat: never update the UDP checksum when it's 0Guillaume Nault
If the UDP header of a local VXLAN endpoint is NAT-ed, and the VXLAN device has disabled UDP checksums and enabled Tx checksum offloading, then the skb passed to udp_manip_pkt() has hdr->check == 0 (outer checksum disabled) and skb->ip_summed == CHECKSUM_PARTIAL (inner packet checksum offloaded). Because of the ->ip_summed value, udp_manip_pkt() tries to update the outer checksum with the new address and port, leading to an invalid checksum sent on the wire, as the original null checksum obviously didn't take the old address and port into account. So, we can't take ->ip_summed into account in udp_manip_pkt(), as it might not refer to the checksum we're acting on. Instead, we can base the decision to update the UDP checksum entirely on the value of hdr->check, because it's null if and only if checksum is disabled: * A fully computed checksum can't be 0, since a 0 checksum is represented by the CSUM_MANGLED_0 value instead. * A partial checksum can't be 0, since the pseudo-header always adds at least one non-zero value (the UDP protocol type 0x11) and adding more values to the sum can't make it wrap to 0 as the carry is then added to the wrapped number. * A disabled checksum uses the special value 0. The problem seems to be there from day one, although it was probably not visible before UDP tunnels were implemented. Fixes: 5b1158e909ec ("[NETFILTER]: Add NAT support for nf_conntrack") Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) flow_block_cb memleak in nf_flow_table_offload_del_cb(), from Roi Dayan. 2) Fix error path handling in nf_nat_inet_register_fn(), from Hillf Danton. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-19netfilter: nat: fix error handling upon registering inet hookHillf Danton
A case of warning was reported by syzbot. ------------[ cut here ]------------ WARNING: CPU: 0 PID: 19934 at net/netfilter/nf_nat_core.c:1106 nf_nat_unregister_fn+0x532/0x5c0 net/netfilter/nf_nat_core.c:1106 Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 19934 Comm: syz-executor.5 Not tainted 5.6.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x188/0x20d lib/dump_stack.c:118 panic+0x2e3/0x75c kernel/panic.c:221 __warn.cold+0x2f/0x35 kernel/panic.c:582 report_bug+0x27b/0x2f0 lib/bug.c:195 fixup_bug arch/x86/kernel/traps.c:175 [inline] fixup_bug arch/x86/kernel/traps.c:170 [inline] do_error_trap+0x12b/0x220 arch/x86/kernel/traps.c:267 do_invalid_op+0x32/0x40 arch/x86/kernel/traps.c:286 invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1027 RIP: 0010:nf_nat_unregister_fn+0x532/0x5c0 net/netfilter/nf_nat_core.c:1106 Code: ff df 48 c1 ea 03 80 3c 02 00 75 75 48 8b 44 24 10 4c 89 ef 48 c7 00 00 00 00 00 e8 e8 f8 53 fb e9 4d fe ff ff e8 ee 9c 16 fb <0f> 0b e9 41 fe ff ff e8 e2 45 54 fb e9 b5 fd ff ff 48 8b 7c 24 20 RSP: 0018:ffffc90005487208 EFLAGS: 00010246 RAX: 0000000000040000 RBX: 0000000000000004 RCX: ffffc9001444a000 RDX: 0000000000040000 RSI: ffffffff865c94a2 RDI: 0000000000000005 RBP: ffff88808b5cf000 R08: ffff8880a2620140 R09: fffffbfff14bcd79 R10: ffffc90005487208 R11: fffffbfff14bcd78 R12: 0000000000000000 R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000 nf_nat_ipv6_unregister_fn net/netfilter/nf_nat_proto.c:1017 [inline] nf_nat_inet_register_fn net/netfilter/nf_nat_proto.c:1038 [inline] nf_nat_inet_register_fn+0xfc/0x140 net/netfilter/nf_nat_proto.c:1023 nf_tables_register_hook net/netfilter/nf_tables_api.c:224 [inline] nf_tables_addchain.constprop.0+0x82e/0x13c0 net/netfilter/nf_tables_api.c:1981 nf_tables_newchain+0xf68/0x16a0 net/netfilter/nf_tables_api.c:2235 nfnetlink_rcv_batch+0x83a/0x1610 net/netfilter/nfnetlink.c:433 nfnetlink_rcv_skb_batch net/netfilter/nfnetlink.c:543 [inline] nfnetlink_rcv+0x3af/0x420 net/netfilter/nfnetlink.c:561 netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline] netlink_unicast+0x537/0x740 net/netlink/af_netlink.c:1329 netlink_sendmsg+0x882/0xe10 net/netlink/af_netlink.c:1918 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:672 ____sys_sendmsg+0x6bf/0x7e0 net/socket.c:2362 ___sys_sendmsg+0x100/0x170 net/socket.c:2416 __sys_sendmsg+0xec/0x1b0 net/socket.c:2449 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295 entry_SYSCALL_64_after_hwframe+0x49/0xb3 and to quiesce it, unregister NFPROTO_IPV6 hook instead of NFPROTO_INET in case of failing to register NFPROTO_IPV4 hook. Reported-by: syzbot <syzbot+33e06702fd6cffc24c40@syzkaller.appspotmail.com> Fixes: d164385ec572 ("netfilter: nat: add inet family nat support") Cc: Florian Westphal <fw@strlen.de> Cc: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: 1) Disable RISCV BPF JIT builds when !MMU, from Björn Töpel. 2) nf_tables leaves dangling pointer after free, fix from Eric Dumazet. 3) Out of boundary write in __xsk_rcv_memcpy(), fix from Li RongQing. 4) Adjust icmp6 message source address selection when routes have a preferred source address set, from Tim Stallard. 5) Be sure to validate HSR protocol version when creating new links, from Taehee Yoo. 6) CAP_NET_ADMIN should be sufficient to manage l2tp tunnels even in non-initial namespaces, from Michael Weiß. 7) Missing release firmware call in mlx5, from Eran Ben Elisha. 8) Fix variable type in macsec_changelink(), caught by KASAN. Fix from Taehee Yoo. 9) Fix pause frame negotiation in marvell phy driver, from Clemens Gruber. 10) Record RX queue early enough in tun packet paths such that XDP programs will see the correct RX queue index, from Gilberto Bertin. 11) Fix double unlock in mptcp, from Florian Westphal. 12) Fix offset overflow in ARM bpf JIT, from Luke Nelson. 13) marvell10g needs to soft reset PHY when coming out of low power mode, from Russell King. 14) Fix MTU setting regression in stmmac for some chip types, from Florian Fainelli. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (101 commits) amd-xgbe: Use __napi_schedule() in BH context mISDN: make dmril and dmrim static net: stmmac: dwmac-sunxi: Provide TX and RX fifo sizes net: dsa: mt7530: fix tagged frames pass-through in VLAN-unaware mode tipc: fix incorrect increasing of link window Documentation: Fix tcp_challenge_ack_limit default value net: tulip: make early_486_chipsets static dt-bindings: net: ethernet-phy: add desciption for ethernet-phy-id1234.d400 ipv6: remove redundant assignment to variable err net/rds: Use ERR_PTR for rds_message_alloc_sgs() net: mscc: ocelot: fix untagged packet drops when enslaving to vlan aware bridge selftests/bpf: Check for correct program attach/detach in xdp_attach test libbpf: Fix type of old_fd in bpf_xdp_set_link_opts libbpf: Always specify expected_attach_type on program load if supported xsk: Add missing check on user supplied headroom size mac80211: fix channel switch trigger from unknown mesh peer mac80211: fix race in ieee80211_register_hw() net: marvell10g: soft-reset the PHY when coming out of low power net: marvell10g: report firmware version net/cxgb4: Check the return from t4_query_params properly ...
2020-04-15netfilter: flowtable: Free block_cb when being deletedRoi Dayan
Free block_cb memory when asked to be deleted. Fixes: 978703f42549 ("netfilter: flowtable: Add API for registering to flow table events") Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-09x86: update AS_* macros to binutils >=2.23, supporting ADX and AVX2Jason A. Donenfeld
Now that the kernel specifies binutils 2.23 as the minimum version, we can remove ifdefs for AVX2 and ADX throughout. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-04-09x86: probe assembler capabilities via kconfig instead of makefileJason A. Donenfeld
Doing this probing inside of the Makefiles means we have a maze of ifdefs inside the source code and child Makefiles that need to make proper decisions on this too. Instead, we do it at Kconfig time, like many other compiler and assembler options, which allows us to set up the dependencies normally for full compilation units. In the process, the ADX test changes to use %eax instead of %r10 so that it's valid in both 32-bit and 64-bit mode. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-04-07netfilter: nf_tables: reintroduce the NFT_SET_CONCAT flagPablo Neira Ayuso
Stefano originally proposed to introduce this flag, users hit EOPNOTSUPP in new binaries with old kernels when defining a set with ranges in a concatenation. Fixes: f3a2181e16f1 ("netfilter: nf_tables: Support for sets with multiple ranged fields") Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-07netfilter: nf_tables: report EOPNOTSUPP on unsupported flags/object typePablo Neira Ayuso
EINVAL should be used for malformed netlink messages. New userspace utility and old kernels might easily result in EINVAL when exercising new set features, which is misleading. Fixes: 8aeff920dcc9 ("netfilter: nf_tables: add stateful object reference to set elements") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-06netfilter: ipset: Pass lockdep expression to RCU listsAmol Grover
ip_set_type_list is traversed using list_for_each_entry_rcu outside an RCU read-side critical section but under the protection of ip_set_type_mutex. Hence, add corresponding lockdep expression to silence false-positive warnings, and harden RCU lists. Signed-off-by: Amol Grover <frextrite@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-05netfilter: nf_tables: do not leave dangling pointer in nf_tables_set_alloc_nameEric Dumazet
If nf_tables_set_alloc_name() frees set->name, we better clear set->name to avoid a future use-after-free or invalid-free. BUG: KASAN: double-free or invalid-free in nf_tables_newset+0x1ed6/0x2560 net/netfilter/nf_tables_api.c:4148 CPU: 0 PID: 28233 Comm: syz-executor.0 Not tainted 5.6.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x188/0x20d lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xd3/0x315 mm/kasan/report.c:374 kasan_report_invalid_free+0x61/0xa0 mm/kasan/report.c:468 __kasan_slab_free+0x129/0x140 mm/kasan/common.c:455 __cache_free mm/slab.c:3426 [inline] kfree+0x109/0x2b0 mm/slab.c:3757 nf_tables_newset+0x1ed6/0x2560 net/netfilter/nf_tables_api.c:4148 nfnetlink_rcv_batch+0x83a/0x1610 net/netfilter/nfnetlink.c:433 nfnetlink_rcv_skb_batch net/netfilter/nfnetlink.c:543 [inline] nfnetlink_rcv+0x3af/0x420 net/netfilter/nfnetlink.c:561 netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline] netlink_unicast+0x537/0x740 net/netlink/af_netlink.c:1329 netlink_sendmsg+0x882/0xe10 net/netlink/af_netlink.c:1918 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:672 ____sys_sendmsg+0x6b9/0x7d0 net/socket.c:2345 ___sys_sendmsg+0x100/0x170 net/socket.c:2399 __sys_sendmsg+0xec/0x1b0 net/socket.c:2432 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x45c849 Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fe5ca21dc78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007fe5ca21e6d4 RCX: 000000000045c849 RDX: 0000000000000000 RSI: 0000000020000c40 RDI: 0000000000000003 RBP: 000000000076bf00 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff R13: 000000000000095b R14: 00000000004cc0e9 R15: 000000000076bf0c Allocated by task 28233: save_stack+0x1b/0x80 mm/kasan/common.c:72 set_track mm/kasan/common.c:80 [inline] __kasan_kmalloc mm/kasan/common.c:515 [inline] __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:488 __do_kmalloc mm/slab.c:3656 [inline] __kmalloc_track_caller+0x159/0x790 mm/slab.c:3671 kvasprintf+0xb5/0x150 lib/kasprintf.c:25 kasprintf+0xbb/0xf0 lib/kasprintf.c:59 nf_tables_set_alloc_name net/netfilter/nf_tables_api.c:3536 [inline] nf_tables_newset+0x1543/0x2560 net/netfilter/nf_tables_api.c:4088 nfnetlink_rcv_batch+0x83a/0x1610 net/netfilter/nfnetlink.c:433 nfnetlink_rcv_skb_batch net/netfilter/nfnetlink.c:543 [inline] nfnetlink_rcv+0x3af/0x420 net/netfilter/nfnetlink.c:561 netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline] netlink_unicast+0x537/0x740 net/netlink/af_netlink.c:1329 netlink_sendmsg+0x882/0xe10 net/netlink/af_netlink.c:1918 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:672 ____sys_sendmsg+0x6b9/0x7d0 net/socket.c:2345 ___sys_sendmsg+0x100/0x170 net/socket.c:2399 __sys_sendmsg+0xec/0x1b0 net/socket.c:2432 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 28233: save_stack+0x1b/0x80 mm/kasan/common.c:72 set_track mm/kasan/common.c:80 [inline] kasan_set_free_info mm/kasan/common.c:337 [inline] __kasan_slab_free+0xf7/0x140 mm/kasan/common.c:476 __cache_free mm/slab.c:3426 [inline] kfree+0x109/0x2b0 mm/slab.c:3757 nf_tables_set_alloc_name net/netfilter/nf_tables_api.c:3544 [inline] nf_tables_newset+0x1f73/0x2560 net/netfilter/nf_tables_api.c:4088 nfnetlink_rcv_batch+0x83a/0x1610 net/netfilter/nfnetlink.c:433 nfnetlink_rcv_skb_batch net/netfilter/nfnetlink.c:543 [inline] nfnetlink_rcv+0x3af/0x420 net/netfilter/nfnetlink.c:561 netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline] netlink_unicast+0x537/0x740 net/netlink/af_netlink.c:1329 netlink_sendmsg+0x882/0xe10 net/netlink/af_netlink.c:1918 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:672 ____sys_sendmsg+0x6b9/0x7d0 net/socket.c:2345 ___sys_sendmsg+0x100/0x170 net/socket.c:2399 __sys_sendmsg+0xec/0x1b0 net/socket.c:2432 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff8880a6032d00 which belongs to the cache kmalloc-32 of size 32 The buggy address is located 0 bytes inside of 32-byte region [ffff8880a6032d00, ffff8880a6032d20) The buggy address belongs to the page: page:ffffea0002980c80 refcount:1 mapcount:0 mapping:ffff8880aa0001c0 index:0xffff8880a6032fc1 flags: 0xfffe0000000200(slab) raw: 00fffe0000000200 ffffea0002a3be88 ffffea00029b1908 ffff8880aa0001c0 raw: ffff8880a6032fc1 ffff8880a6032000 000000010000003e 0000000000000000 page dumped because: kasan: bad access detected Fixes: 65038428b2c6 ("netfilter: nf_tables: allow to specify stateful expression in set definition") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-05netfilter: xt_IDLETIMER: target v1 - match Android layoutMaciej Żenczykowski
Android has long had an extension to IDLETIMER to send netlink messages to userspace, see: https://android.googlesource.com/kernel/common/+/refs/heads/android-mainline/include/uapi/linux/netfilter/xt_IDLETIMER.h#42 Note: this is idletimer target rev 1, there is no rev 0 in the Android common kernel sources, see registration at: https://android.googlesource.com/kernel/common/+/refs/heads/android-mainline/net/netfilter/xt_IDLETIMER.c#483 When we compare that to upstream's new idletimer target rev 1: https://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git/tree/include/uapi/linux/netfilter/xt_IDLETIMER.h#n46 We immediately notice that these two rev 1 structs are the same size and layout, and that while timer_type and send_nl_msg are differently named and serve a different purpose, they're at the same offset. This makes them impossible to tell apart - and thus one cannot know in a mixed Android/vanilla environment whether one means timer_type or send_nl_msg. Since this is iptables/netfilter uapi it introduces a problem between iptables (vanilla vs Android) userspace and kernel (vanilla vs Android) if the two don't match each other. Additionally when at some point in the future Android picks up 5.7+ it's not at all clear how to resolve the resulting merge conflict. Furthermore, since upgrading the kernel on old Android phones is pretty much impossible there does not seem to be an easy way out of this predicament. The only thing I've been able to come up with is some super disgusting kernel version >= 5.7 check in the iptables binary to flip between different struct layouts. By adding a dummy field to the vanilla Linux kernel header file we can force the two structs to be compatible with each other. Long term I think I would like to deprecate send_nl_msg out of Android entirely, but I haven't quite been able to figure out exactly how we depend on it. It seems to be very similar to sysfs notifications but with some extra info. Currently it's actually always enabled whenever Android uses the IDLETIMER target, so we could also probably entirely remove it from the uapi in favour of just always enabling it, but again we can't upgrade old kernels already in the field. (Also note that this doesn't change the structure's size, as it is simply fitting into the pre-existing padding, and that since 5.7 hasn't been released yet, there's still time to make this uapi visible change) Cc: Manoj Basapathi <manojbm@codeaurora.org> Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-05netfilter: nf_tables: do not update stateful expressions if lookup is invertedPablo Neira Ayuso
Initialize set lookup matching element to NULL. Otherwise, the NFT_LOOKUP_F_INV flag reverses the matching logic and it leads to deference an uninitialized pointer to the matching element. Make sure element data area and stateful expression are accessed if there is a matching set element. This patch undoes 24791b9aa1ab ("netfilter: nft_set_bitmap: initialize set element extension in lookups") which is not required anymore. Fixes: 339706bc21c1 ("netfilter: nft_lookup: update element stateful expression") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-05netfilter: nft_set_rbtree: Drop spurious condition for overlap detection on ↵Stefano Brivio
insertion Case a1. for overlap detection in __nft_rbtree_insert() is not a valid one: start-after-start is not needed to detect any type of interval overlap and it actually results in a false positive if, while descending the tree, this is the only step we hit after starting from the root. This introduced a regression, as reported by Pablo, in Python tests cases ip/ip.t and ip/numgen.t: ip/ip.t: ERROR: line 124: add rule ip test-ip4 input ip hdrlength vmap { 0-4 : drop, 5 : accept, 6 : continue } counter: This rule should not have failed. ip/numgen.t: ERROR: line 7: add rule ip test-ip4 pre dnat to numgen inc mod 10 map { 0-5 : 192.168.10.100, 6-9 : 192.168.20.200}: This rule should not have failed. Drop case a1. and renumber others, so that they are a bit clearer. In order for these diagrams to be readily understandable, a bigger rework is probably needed, such as an ASCII art of the actual rbtree (instead of a flattened version). Shell script test sets/0044interval_overlap_0 should cover all possible cases for false negatives, so I consider that test case still sufficient after this change. v2: Fix comments for cases a3. and b3. Reported-by: Pablo Neira Ayuso <pablo@netfilter.org> Fixes: 7c84d41416d8 ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for net-next: 1) Add support to specify a stateful expression in set definitions, this allows users to specify e.g. counters per set elements. 2) Flowtable software counter support. 3) Flowtable hardware offload counter support, from wenxu. 3) Parallelize flowtable hardware offload requests, from Paul Blakey. This includes a patch to add one work entry per offload command. 4) Several patches to rework nf_queue refcount handling, from Florian Westphal. 4) A few fixes for the flowtable tunnel offload: Fix crash if tunneling information is missing and set up indirect flow block as TC_SETUP_FT, patch from wenxu. 5) Stricter netlink attribute sanity check on filters, from Romain Bellan and Florent Fourcot. 5) Annotations to make sparse happy, from Jules Irenge. 6) Improve icmp errors in debugging information, from Haishuang Yan. 7) Fix warning in IPVS icmp error debugging, from Haishuang Yan. 8) Fix endianess issue in tcp extension header, from Sergey Marinkevich. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30ipvs: fix uninitialized variable warningHaishuang Yan
If outer_proto is not set, GCC warning as following: In file included from net/netfilter/ipvs/ip_vs_core.c:52: net/netfilter/ipvs/ip_vs_core.c: In function 'ip_vs_in_icmp': include/net/ip_vs.h:233:4: warning: 'outer_proto' may be used uninitialized in this function [-Wmaybe-uninitialized] 233 | printk(KERN_DEBUG pr_fmt(msg), ##__VA_ARGS__); \ | ^~~~~~ net/netfilter/ipvs/ip_vs_core.c:1666:8: note: 'outer_proto' was declared here 1666 | char *outer_proto; | ^~~~~~~~~~~ Fixes: 73348fed35d0 ("ipvs: optimize tunnel dumps for icmp errors") Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-30netfilter: nft_exthdr: fix endianness of tcp option castSergey Marinkevich
I got a problem on MIPS with Big-Endian is turned on: every time when NF trying to change TCP MSS it returns because of new.v16 was greater than old.v16. But real MSS was 1460 and my rule was like this: add rule table chain tcp option maxseg size set 1400 And 1400 is lesser that 1460, not greater. Later I founded that main causer is cast from u32 to __be16. Debugging: In example MSS = 1400(HEX: 0x578). Here is representation of each byte like it is in memory by addresses from left to right(e.g. [0x0 0x1 0x2 0x3]). LE — Little-Endian system, BE — Big-Endian, left column is type. LE BE u32: [78 05 00 00] [00 00 05 78] As you can see, u32 representation will be casted to u16 from different half of 4-byte address range. But actually nf_tables uses registers and store data of various size. Actually TCP MSS stored in 2 bytes. But registers are still u32 in definition: struct nft_regs { union { u32 data[20]; struct nft_verdict verdict; }; }; So, access like regs->data[priv->sreg] exactly u32. So, according to table presents above, per-byte representation of stored TCP MSS in register will be: LE BE (u32)regs->data[]: [78 05 00 00] [05 78 00 00] ^^ ^^ We see that register uses just half of u32 and other 2 bytes may be used for some another data. But in nft_exthdr_tcp_set_eval() it casted just like u32 -> __be16: new.v16 = src But u32 overfill __be16, so it get 2 low bytes. For clarity draw one more table(<xx xx> means that bytes will be used for cast). LE BE u32: [<78 05> 00 00] [00 00 <05 78>] (u32)regs->data[]: [<78 05> 00 00] [05 78 <00 00>] As you can see, for Little-Endian nothing changes, but for Big-endian we take the wrong half. In my case there is some other data instead of zeros, so new MSS was wrongly greater. For shooting this bug I used solution for ports ranges. Applying of this patch does not affect Little-Endian systems. Signed-off-by: Sergey Marinkevich <sergey.marinkevich@eltex-co.ru> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-30netfilter: flowtable: add counter support in HW offloadwenxu
Store the conntrack counters to the conntrack entry in the HW flowtable offload. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-30netfilter: conntrack: add nf_ct_acct_add()wenxu
Add nf_ct_acct_add function to update the conntrack counter with packets and bytes. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-30netfilter: nf_tables: skip set types that do not support for expressionsPablo Neira Ayuso
The bitmap set does not support for expressions, skip it from the estimation step. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-30netfilter: nft_dynset: validate set expression definitionPablo Neira Ayuso
If the global set expression definition mismatches the dynset expression, then bail out. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-30netfilter: nft_set_bitmap: initialize set element extension in lookupsPablo Neira Ayuso
Otherwise, nft_lookup might dereference an uninitialized pointer to the element extension. Fixes: 665153ff5752 ("netfilter: nf_tables: add bitmap set type") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-30netfilter: ctnetlink: be more strict when NF_CONNTRACK_MARK is not setRomain Bellan
When CONFIG_NF_CONNTRACK_MARK is not set, any CTA_MARK or CTA_MARK_MASK in netlink message are not supported. We should return an error when one of them is set, not both Fixes: 9306425b70bf ("netfilter: ctnetlink: must check mark attributes vs NULL") Signed-off-by: Romain Bellan <romain.bellan@wifirst.fr> Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-29netfilter: nf_queue: prefer nf_queue_entry_freeFlorian Westphal
Instead of dropping refs+kfree, use the helper added in previous patch. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-29netfilter: nf_queue: do not release refcouts until nf_reinject is doneFlorian Westphal
nf_queue is problematic when another NF_QUEUE invocation happens from nf_reinject(). 1. nf_queue is invoked, increments state->sk refcount. 2. skb is queued, waiting for verdict. 3. sk is closed/released. 3. verdict comes back, nf_reinject is called. 4. nf_reinject drops the reference -- refcount can now drop to 0 Instead of get_ref/release_ref pattern, we need to nest the get_ref calls: get_ref get_ref release_ref release_ref So that when we invoke the next processing stage (another netfilter or the okfn()), we hold at least one reference count on the devices/socket. After previous patch, it is now safe to put the entry even after okfn() has potentially free'd the skb. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-29netfilter: nf_queue: place bridge physports into queue_entry structFlorian Westphal
The refcount is done via entry->skb, which does work fine. Major problem: When putting the refcount of the bridge ports, we must always put the references while the skb is still around. However, we will need to put the references after okfn() to avoid a possible 1 -> 0 -> 1 refcount transition, so we cannot use the skb pointer anymore. Place the physports in the queue entry structure instead to allow for refcounting changes in the next patch. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-29netfilter: nf_queue: make nf_queue_entry_release_refs staticFlorian Westphal
This is a preparation patch, no logical changes. Move free_entry into core and rename it to something more sensible. Will ease followup patches which will complicate the refcount handling. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-27netfilter: flowtable: Use work entry per offload commandPaul Blakey
To allow offload commands to execute in parallel, create workqueue for flow table offload, and use a work entry per offload command. Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-27netfilter: flowtable: Use rw sem as flow block lockPaul Blakey
Currently flow offload threads are synchronized by the flow block mutex. Use rw lock instead to increase flow insertion (read) concurrency. Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-27netfilter: nf_tables: silence a RCU-list warning in nft_table_lookup()Qian Cai
It is safe to traverse &net->nft.tables with &net->nft.commit_mutex held using list_for_each_entry_rcu(). Silence the PROVE_RCU_LIST false positive, WARNING: suspicious RCU usage net/netfilter/nf_tables_api.c:523 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by iptables/1384: #0: ffffffff9745c4a8 (&net->nft.commit_mutex){+.+.}, at: nf_tables_valid_genid+0x25/0x60 [nf_tables] Call Trace: dump_stack+0xa1/0xea lockdep_rcu_suspicious+0x103/0x10d nft_table_lookup.part.0+0x116/0x120 [nf_tables] nf_tables_newtable+0x12c/0x7d0 [nf_tables] nfnetlink_rcv_batch+0x559/0x1190 [nfnetlink] nfnetlink_rcv+0x1da/0x210 [nfnetlink] netlink_unicast+0x306/0x460 netlink_sendmsg+0x44b/0x770 ____sys_sendmsg+0x46b/0x4a0 ___sys_sendmsg+0x138/0x1a0 __sys_sendmsg+0xb6/0x130 __x64_sys_sendmsg+0x48/0x50 do_syscall_64+0x69/0xf4 entry_SYSCALL_64_after_hwframe+0x49/0xb3 Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-27netfilter: flowtable: Fix incorrect tc_setup_type typewenxu
The indirect block setup should use TC_SETUP_FT as the type instead of TC_SETUP_BLOCK. Adjust existing users of the indirect flow block infrastructure. Fixes: b5140a36da78 ("netfilter: flowtable: add indr block setup support") Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-27netfilter: flowtable: add counter supportPablo Neira Ayuso
Add a new flag to turn on flowtable counters which are stored in the conntrack entry. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-27netfilter: nf_tables: add enum nft_flowtable_flags to uapiPablo Neira Ayuso
Expose the NFT_FLOWTABLE_HW_OFFLOAD flag through uapi. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-27netfilter: conntrack: export nf_ct_acct_update()Pablo Neira Ayuso
This function allows you to update the conntrack counters. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-27ipvs: optimize tunnel dumps for icmp errorsHaishuang Yan
After strip GRE/UDP tunnel header for icmp errors, it's better to show "GRE/UDP" instead of "IPIP" in debug message. Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-27netfilter: conntrack: Add missing annotations for nf_conntrack_all_lock() ↵Jules Irenge
and nf_conntrack_all_unlock() Sparse reports warnings at nf_conntrack_all_lock() and nf_conntrack_all_unlock() warning: context imbalance in nf_conntrack_all_lock() - wrong count at exit warning: context imbalance in nf_conntrack_all_unlock() - unexpected unlock Add the missing __acquires(&nf_conntrack_locks_all_lock) Add missing __releases(&nf_conntrack_locks_all_lock) Signed-off-by: Jules Irenge <jbi.octave@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-27netfilter: ctnetlink: Add missing annotation for ctnetlink_parse_nat_setup()Jules Irenge
Sparse reports a warning at ctnetlink_parse_nat_setup() warning: context imbalance in ctnetlink_parse_nat_setup() - unexpected unlock The root cause is the missing annotation at ctnetlink_parse_nat_setup() Add the missing __must_hold(RCU) annotation Signed-off-by: Jules Irenge <jbi.octave@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Overlapping header include additions in macsec.c A bug fix in 'net' overlapping with the removal of 'version' string in ena_netdev.c Overlapping test additions in selftests Makefile Overlapping PCI ID table adjustments in iwlwifi driver. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-25net: Fix CONFIG_NET_CLS_ACT=n and CONFIG_NFT_FWD_NETDEV={y, m} buildPablo Neira Ayuso
net/netfilter/nft_fwd_netdev.c: In function ‘nft_fwd_netdev_eval’: net/netfilter/nft_fwd_netdev.c:32:10: error: ‘struct sk_buff’ has no member named ‘tc_redirected’ pkt->skb->tc_redirected = 1; ^~ net/netfilter/nft_fwd_netdev.c:33:10: error: ‘struct sk_buff’ has no member named ‘tc_from_ingress’ pkt->skb->tc_from_ingress = 1; ^~ To avoid a direct dependency with tc actions from netfilter, wrap the redirect bits around CONFIG_NET_REDIRECT and move helpers to include/linux/skbuff.h. Turn on this toggle from the ifb driver, the only existing client of these bits in the tree. This patch adds skb_set_redirected() that sets on the redirected bit on the skbuff, it specifies if the packet was redirect from ingress and resets the timestamp (timestamp reset was originally missing in the netfilter bugfix). Fixes: bcfabee1afd99484 ("netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress") Reported-by: noreply@ellerman.id.au Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-24netfilter: nft_fwd_netdev: allow to redirect to ifb via ingressPablo Neira Ayuso
Set skb->tc_redirected to 1, otherwise the ifb driver drops the packet. Set skb->tc_from_ingress to 1 to reinject the packet back to the ingress path after leaving the ifb egress path. This patch inconditionally sets on these two skb fields that are meaningful to the ifb driver. The existing forward action is guaranteed to run from ingress path. Fixes: 39e6dea28adc ("netfilter: nf_tables: add forward expression to the netdev family") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-24netfilter: nft_fwd_netdev: validate family and chain typePablo Neira Ayuso
Make sure the forward action is only used from ingress. Fixes: 39e6dea28adc ("netfilter: nf_tables: add forward expression to the netdev family") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-24netfilter: nft_set_rbtree: Detect partial overlaps on insertionStefano Brivio
...and return -ENOTEMPTY to the front-end in this case, instead of proceeding. Currently, nft takes care of checking for these cases and not sending them to the kernel, but if we drop the set_overlap() call in nft we can end up in situations like: # nft add table t # nft add set t s '{ type inet_service ; flags interval ; }' # nft add element t s '{ 1 - 5 }' # nft add element t s '{ 6 - 10 }' # nft add element t s '{ 4 - 7 }' # nft list set t s table ip t { set s { type inet_service flags interval elements = { 1-3, 4-5, 6-7 } } } This change has the primary purpose of making the behaviour consistent with nft_set_pipapo, but is also functional to avoid inconsistent behaviour if userspace sends overlapping elements for any reason. v2: When we meet the same key data in the tree, as start element while inserting an end element, or as end element while inserting a start element, actually check that the existing element is active, before resetting the overlap flag (Pablo Neira Ayuso) Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-24netfilter: nft_set_rbtree: Introduce and use nft_rbtree_interval_start()Stefano Brivio
Replace negations of nft_rbtree_interval_end() with a new helper, nft_rbtree_interval_start(), wherever this helps to visualise the problem at hand, that is, for all the occurrences except for the comparison against given flags in __nft_rbtree_get(). This gets especially useful in the next patch. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-24netfilter: nft_set_pipapo: Separate partial and complete overlap cases on ↵Stefano Brivio
insertion ...and return -ENOTEMPTY to the front-end on collision, -EEXIST if an identical element already exists. Together with the previous patch, element collision will now be returned to the user as -EEXIST. Reported-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-24netfilter: nf_tables: Allow set back-ends to report partial overlaps on ↵Pablo Neira Ayuso
insertion Currently, the -EEXIST return code of ->insert() callbacks is ambiguous: it might indicate that a given element (including intervals) already exists as such, or that the new element would clash with existing ones. If identical elements already exist, the front-end is ignoring this without returning error, in case NLM_F_EXCL is not set. However, if the new element can't be inserted due an overlap, we should report this to the user. To this purpose, allow set back-ends to return -ENOTEMPTY on collision with existing elements, translate that to -EEXIST, and return that to userspace, no matter if NLM_F_EXCL was set. Reported-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-19netfilter: flowtable: populate addr_type maskEdward Cree
nf_flow_rule_match() sets control.addr_type in key, so needs to also set the corresponding mask. An exact match is wanted, so mask is all ones. Fixes: c29f74e0df7a ("netfilter: nf_flow_table: hardware offload support") Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-19netfilter: flowtable: fix NULL pointer dereference in tunnel offload supportwenxu
The tc ct action does not cache the route in the flowtable entry. Fixes: 88bf6e4114d5 ("netfilter: flowtable: add tunnel encap/decap action offload support") Fixes: cfab6dbd0ecf ("netfilter: flowtable: add tunnel match offload support") Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-19netfilter: flowtable: Fix flushing of offloaded flows on freePaul Blakey
Freeing a flowtable with offloaded flows, the flow are deleted from hardware but are not deleted from the flow table, leaking them, and leaving their offload bit on. Add a second pass of the disabled gc to delete the these flows from the flow table before freeing it. Fixes: c29f74e0df7a ("netfilter: nf_flow_table: hardware offload support") Signed-off-by: Paul Blakey <paulb@mellanox.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-19netfilter: flowtable: reload ip{v6}h in nf_flow_tuple_ip{v6}Haishuang Yan
Since pskb_may_pull may change skb->data, so we need to reload ip{v6}h at the right place. Fixes: a908fdec3dda ("netfilter: nf_flow_table: move ipv6 offload hook code to nf_flow_table") Fixes: 7d2086871762 ("netfilter: nf_flow_table: move ipv4 offload hook code to nf_flow_table") Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>