summaryrefslogtreecommitdiff
path: root/net/netfilter
AgeCommit message (Collapse)Author
2021-09-21netfilter: nf_tables: unlink table before deleting itFlorian Westphal
syzbot reports following UAF: BUG: KASAN: use-after-free in memcmp+0x18f/0x1c0 lib/string.c:955 nla_strcmp+0xf2/0x130 lib/nlattr.c:836 nft_table_lookup.part.0+0x1a2/0x460 net/netfilter/nf_tables_api.c:570 nft_table_lookup net/netfilter/nf_tables_api.c:4064 [inline] nf_tables_getset+0x1b3/0x860 net/netfilter/nf_tables_api.c:4064 nfnetlink_rcv_msg+0x659/0x13f0 net/netfilter/nfnetlink.c:285 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504 Problem is that all get operations are lockless, so the commit_mutex held by nft_rcv_nl_event() isn't enough to stop a parallel GET request from doing read-accesses to the table object even after synchronize_rcu(). To avoid this, unlink the table first and store the table objects in on-stack scratch space. Fixes: 6001a930ce03 ("netfilter: nftables: introduce table ownership") Reported-and-tested-by: syzbot+f31660cf279b0557160c@syzkaller.appspotmail.com Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-09-21netfilter: nat: include zone id in nat table hash againFlorian Westphal
Similar to the conntrack change, also use the zone id for the nat source lists if the zone id is valid in both directions. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-09-21netfilter: conntrack: include zone id in tuple hash againFlorian Westphal
commit deedb59039f111 ("netfilter: nf_conntrack: add direction support for zones") removed the zone id from the hash value. This has implications on hash chain lengths with overlapping tuples, which can hit 64k entries on released kernels, before upper droplimit was added in d7e7747ac5c ("netfilter: refuse insertion if chain has grown too large"). With that change reverted, test script coming with this series shows linear insertion time growth: 10000 entries in 3737 ms (now 10000 total, loop 1) 10000 entries in 16994 ms (now 20000 total, loop 2) 10000 entries in 47787 ms (now 30000 total, loop 3) 10000 entries in 72731 ms (now 40000 total, loop 4) 10000 entries in 95761 ms (now 50000 total, loop 5) 10000 entries in 96809 ms (now 60000 total, loop 6) inserted 60000 entries from packet path in 333825 ms With d7e7747ac5c in place, the test fails. There are three supported zone use cases: 1. Connection is in the default zone (zone 0). This means to special config (the default). 2. Connection is in a different zone (1 to 2**16). This means rules are in place to put packets in the desired zone, e.g. derived from vlan id or interface. 3. Original direction is in zone X and Reply is in zone 0. 3) allows to use of the existing NAT port collision avoidance to provide connectivity to internet/wan even when the various zones have overlapping source networks separated via policy routing. In case the original zone is 0 all three cases are identical. There is no way to place original direction in zone x and reply in zone y (with y != 0). Zones need to be assigned manually via the iptables/nftables ruleset, before conntrack lookup occurs (raw table in iptables) using the "CT" target conntrack template support (-j CT --{zone,zone-orig,zone-reply} X). Normally zone assignment happens based on incoming interface, but could also be derived from packet mark, vlan id and so on. This means that when case 3 is used, the ruleset will typically not even assign a connection tracking template to the "reply" packets, so lookup happens in zone 0. However, it is possible that reply packets also match a ct zone assignment rule which sets up a template for zone X (X > 0) in original direction only. Therefore, after making the zone id part of the hash, we need to do a second lookup using the reply zone id if we did not find an entry on the first lookup. In practice, most deployments will either not use zones at all or the origin and reply zones are the same, no second lookup is required in either case. After this change, packet path insertion test passes with constant insertion times: 10000 entries in 1064 ms (now 10000 total, loop 1) 10000 entries in 1074 ms (now 20000 total, loop 2) 10000 entries in 1066 ms (now 30000 total, loop 3) 10000 entries in 1079 ms (now 40000 total, loop 4) 10000 entries in 1081 ms (now 50000 total, loop 5) 10000 entries in 1082 ms (now 60000 total, loop 6) inserted 60000 entries from packet path in 6452 ms Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-09-21netfilter: conntrack: make max chain length randomFlorian Westphal
Similar to commit 67d6d681e15b ("ipv4: make exception cache less predictible"): Use a random drop length to make it harder to detect when entries were hashed to same bucket list. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-09-14ipvs: check that ip_vs_conn_tab_bits is between 8 and 20Andrea Claudi
ip_vs_conn_tab_bits may be provided by the user through the conn_tab_bits module parameter. If this value is greater than 31, or less than 0, the shift operator used to derive tab_size causes undefined behaviour. Fix this checking ip_vs_conn_tab_bits value to be in the range specified in ipvs Kconfig. If not, simply use default value. Fixes: 6f7edb4881bf ("IPVS: Allow boot time change of hash size") Reported-by: Yi Chen <yiche@redhat.com> Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Acked-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-09-14netfilter: ipset: Fix oversized kvmalloc() callsJozsef Kadlecsik
The commit commit 7661809d493b426e979f39ab512e3adf41fbcc69 Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Wed Jul 14 09:45:49 2021 -0700 mm: don't allow oversized kvmalloc() calls limits the max allocatable memory via kvmalloc() to MAX_INT. Apply the same limit in ipset. Reported-by: syzbot+3493b1873fb3ea827986@syzkaller.appspotmail.com Reported-by: syzbot+2b8443c35458a617c904@syzkaller.appspotmail.com Reported-by: syzbot+ee5cb15f4a0e85e0d54e@syzkaller.appspotmail.com Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-09-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfJakub Kicinski
Pablo Neira Ayuso says: ==================== Netfilter fixes for net 1) Protect nft_ct template with global mutex, from Pavel Skripkin. 2) Two recent commits switched inet rt and nexthop exception hashes from jhash to siphash. If those two spots are problematic then conntrack is affected as well, so switch voer to siphash too. While at it, add a hard upper limit on chain lengths and reject insertion if this is hit. Patches from Florian Westphal. 3) Fix use-after-scope in nf_socket_ipv6 reported by KASAN, from Benjamin Hesmans. * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf: netfilter: socket: icmp6: fix use-after-scope netfilter: refuse insertion if chain has grown too large netfilter: conntrack: switch to siphash netfilter: conntrack: sanitize table size default settings netfilter: nft_ct: protect nft_ct_pcpu_template_refcnt with mutex ==================== Link: https://lore.kernel.org/r/20210903163020.13741-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-30Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Daniel Borkmann says: ==================== bpf-next 2021-08-31 We've added 116 non-merge commits during the last 17 day(s) which contain a total of 126 files changed, 6813 insertions(+), 4027 deletions(-). The main changes are: 1) Add opaque bpf_cookie to perf link which the program can read out again, to be used in libbpf-based USDT library, from Andrii Nakryiko. 2) Add bpf_task_pt_regs() helper to access userspace pt_regs, from Daniel Xu. 3) Add support for UNIX stream type sockets for BPF sockmap, from Jiang Wang. 4) Allow BPF TCP congestion control progs to call bpf_setsockopt() e.g. to switch to another congestion control algorithm during init, from Martin KaFai Lau. 5) Extend BPF iterator support for UNIX domain sockets, from Kuniyuki Iwashima. 6) Allow bpf_{set,get}sockopt() calls from setsockopt progs, from Prankur Gupta. 7) Add bpf_get_netns_cookie() helper for BPF_PROG_TYPE_{SOCK_OPS,CGROUP_SOCKOPT} progs, from Xu Liu and Stanislav Fomichev. 8) Support for __weak typed ksyms in libbpf, from Hao Luo. 9) Shrink struct cgroup_bpf by 504 bytes through refactoring, from Dave Marchevsky. 10) Fix a smatch complaint in verifier's narrow load handling, from Andrey Ignatov. 11) Fix BPF interpreter's tail call count limit, from Daniel Borkmann. 12) Big batch of improvements to BPF selftests, from Magnus Karlsson, Li Zhijian, Yucong Sun, Yonghong Song, Ilya Leoshkevich, Jussi Maki, Ilya Leoshkevich, others. 13) Another big batch to revamp XDP samples in order to give them consistent look and feel, from Kumar Kartikeya Dwivedi. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (116 commits) MAINTAINERS: Remove self from powerpc BPF JIT selftests/bpf: Fix potential unreleased lock samples: bpf: Fix uninitialized variable in xdp_redirect_cpu selftests/bpf: Reduce more flakyness in sockmap_listen bpf: Fix bpf-next builds without CONFIG_BPF_EVENTS bpf: selftests: Add dctcp fallback test bpf: selftests: Add connect_to_fd_opts to network_helpers bpf: selftests: Add sk_state to bpf_tcp_helpers.h bpf: tcp: Allow bpf-tcp-cc to call bpf_(get|set)sockopt selftests: xsk: Preface options with opt selftests: xsk: Make enums lower case selftests: xsk: Generate packets from specification selftests: xsk: Generate packet directly in umem selftests: xsk: Simplify cleanup of ifobjects selftests: xsk: Decrease sending speed selftests: xsk: Validate tx stats on tx thread selftests: xsk: Simplify packet validation in xsk tests selftests: xsk: Rename worker_* functions that are not thread entry points selftests: xsk: Disassociate umem size with packets sent selftests: xsk: Remove end-of-test packet ... ==================== Link: https://lore.kernel.org/r/20210830225618.11634-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-30netfilter: refuse insertion if chain has grown too largeFlorian Westphal
Also add a stat counter for this that gets exported both via old /proc interface and ctnetlink. Assuming the old default size of 16536 buckets and max hash occupancy of 64k, this results in 128k insertions (origin+reply), so ~8 entries per chain on average. The revised settings in this series will result in about two entries per bucket on average. This allows a hard-limit ceiling of 64. This is not tunable at the moment, but its possible to either increase nf_conntrack_buckets or decrease nf_conntrack_max to reduce average lengths. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-30netfilter: conntrack: switch to siphashFlorian Westphal
Replace jhash in conntrack and nat core with siphash. While at it, use the netns mix value as part of the input key rather than abuse the seed value. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-30netfilter: conntrack: sanitize table size default settingsFlorian Westphal
conntrack has two distinct table size settings: nf_conntrack_max and nf_conntrack_buckets. The former limits how many conntrack objects are allowed to exist in each namespace. The second sets the size of the hashtable. As all entries are inserted twice (once for original direction, once for reply), there should be at least twice as many buckets in the table than the maximum number of conntrack objects that can exist at the same time. Change the default multiplier to 1 and increase the chosen bucket sizes. This results in the same nf_conntrack_max settings as before but reduces the average bucket list length. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-30netfilter: add netfilter hooks to SRv6 data planeRyoga Saito
This patch introduces netfilter hooks for solving the problem that conntrack couldn't record both inner flows and outer flows. This patch also introduces a new sysctl toggle for enabling lightweight tunnel netfilter hooks. Signed-off-by: Ryoga Saito <contact@proelbtn.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-25netfilter: ctnetlink: missing counters and timestamp in nfnetlink_{log,queue}Pablo Neira Ayuso
Add counters and timestamps (if available) to the conntrack object that is represented in nfnetlink_log and _queue messages. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-25netfilter: ecache: remove nf_exp_event_notifier structureFlorian Westphal
Reuse the conntrack event notofier struct, this allows to remove the extra register/unregister functions and avoids a pointer in struct net. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-25netfilter: ecache: prepare for event notifier mergeFlorian Westphal
This prepares for merge for ct and exp notifier structs. The 'fcn' member is renamed to something unique. Second, the register/unregister api is simplified. There is only one implementation so there is no need to do any error checking. Replace the EBUSY logic with WARN_ON_ONCE. This allows to remove error unwinding. The exp notifier register/unregister function is removed in a followup patch. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-25netfilter: ecache: add common helper for nf_conntrack_eventmask_reportFlorian Westphal
nf_ct_deliver_cached_events and nf_conntrack_eventmask_report are very similar. Split nf_conntrack_eventmask_report into a common helper function that can be used for both cases. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-25netfilter: ecache: remove another indent levelFlorian Westphal
... by changing: if (unlikely(ret < 0 || missed)) { if (ret < 0) { to if (likely(ret >= 0 && !missed)) goto out; if (ret < 0) { After this nf_conntrack_eventmask_report and nf_ct_deliver_cached_events look pretty much the same, next patch moves common code to a helper. This patch has no effect on generated code. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-25netfilter: ecache: remove one indent levelFlorian Westphal
nf_conntrack_eventmask_report and nf_ct_deliver_cached_events shared most of their code. This unifies the layout by changing if (nf_ct_is_confirmed(ct)) { foo } to if (!nf_ct_is_confirmed(ct))) return foo This removes one level of indentation. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-19net: Fix offloading indirect devices dependency on qdisc order creationEli Cohen
Currently, when creating an ingress qdisc on an indirect device before the driver registered for callbacks, the driver will not have a chance to register its filter configuration callbacks. To fix that, modify the code such that it keeps track of all the ingress qdiscs that call flow_indr_dev_setup_offload(). When a driver calls flow_indr_dev_register(), go through the list of tracked ingress qdiscs and call the driver callback entry point so as to give it a chance to register its callback. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Eli Cohen <elic@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-17bpf: Refactor BPF_PROG_RUN into a functionAndrii Nakryiko
Turn BPF_PROG_RUN into a proper always inlined function. No functional and performance changes are intended, but it makes it much easier to understand what's going on with how BPF programs are actually get executed. It's more obvious what types and callbacks are expected. Also extra () around input parameters can be dropped, as well as `__` variable prefixes intended to avoid naming collisions, which makes the code simpler to read and write. This refactoring also highlighted one extra issue. BPF_PROG_RUN is both a macro and an enum value (BPF_PROG_RUN == BPF_PROG_TEST_RUN). Turning BPF_PROG_RUN into a function causes naming conflict compilation error. So rename BPF_PROG_RUN into lower-case bpf_prog_run(), similar to bpf_prog_run_xdp(), bpf_prog_run_pin_on_cpu(), etc. All existing callers of BPF_PROG_RUN, the macro, are switched to bpf_prog_run() explicitly. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210815070609.987780-2-andrii@kernel.org
2021-08-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Conflicts: drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.h 9e26680733d5 ("bnxt_en: Update firmware call to retrieve TX PTP timestamp") 9e518f25802c ("bnxt_en: 1PPS functions to configure TSIO pins") 099fdeda659d ("bnxt_en: Event handler for PPS events") kernel/bpf/helpers.c include/linux/bpf-cgroup.h a2baf4e8bb0f ("bpf: Fix potentially incorrect results with bpf_get_local_storage()") c7603cfa04e7 ("bpf: Add ambient BPF runtime context stored in current") drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c 5957cc557dc5 ("net/mlx5: Set all field of mlx5_irq before inserting it to the xarray") 2d0b41a37679 ("net/mlx5: Refcount mlx5_irq with integer") MAINTAINERS 7b637cd52f02 ("MAINTAINERS: fix Microchip CAN BUS Analyzer Tool entry typo") 7d901a1e878a ("net: phy: add Maxlinear GPY115/21x/24x driver") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Use nfnetlink_unicast() instead of netlink_unicast() in nft_compat. 2) Remove call to nf_ct_l4proto_find() in flowtable offload timeout fixup. 3) CLUSTERIP registers ARP hook on demand, from Florian. 4) Use clusterip_net to store pernet warning, also from Florian. 5) Remove struct netns_xt, from Florian Westphal. 6) Enable ebtables hooks in initns on demand, from Florian. 7) Allow to filter conntrack netlink dump per status bits, from Florian Westphal. 8) Register x_tables hooks in initns on demand, from Florian. 9) Remove queue_handler from per-netns structure, again from Florian. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11netfilter: nft_ct: protect nft_ct_pcpu_template_refcnt with mutexPavel Skripkin
Syzbot hit use-after-free in nf_tables_dump_sets. The problem was in missing lock protection for nft_ct_pcpu_template_refcnt. Before commit f102d66b335a ("netfilter: nf_tables: use dedicated mutex to guard transactions") all transactions were serialized by global mutex, but then global mutex was changed to local per netnamespace commit_mutex. This change causes use-after-free bug, when 2 netnamespaces concurently changing nft_ct_pcpu_template_refcnt without proper locking. Fix it by adding nft_ct_pcpu_mutex and protect all nft_ct_pcpu_template_refcnt changes with it. Fixes: f102d66b335a ("netfilter: nf_tables: use dedicated mutex to guard transactions") Reported-and-tested-by: syzbot+649e339fa6658ee623d3@syzkaller.appspotmail.com Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-10netfilter: nf_queue: move hookfn registration out of struct netFlorian Westphal
This was done to detect when the pernet->init() function was not called yet, by checking if net->nf.queue_handler is NULL. Once the nfnetlink_queue module is active, all struct net pointers contain the same address. So place this back in nf_queue.c. Handle the 'netns error unwind' test by checking nfnl_queue_net for a NULL pointer and add a comment for this. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-09netfilter: x_tables: never register tables by defaultFlorian Westphal
For historical reasons x_tables still register tables by default in the initial namespace. Only newly created net namespaces add the hook on demand. This means that the init_net always pays hook cost, even if no filtering rules are added (e.g. only used inside a single netns). Note that the hooks are added even when 'iptables -L' is called. This is because there is no way to tell 'iptables -A' and 'iptables -L' apart at kernel level. The only solution would be to register the table, but delay hook registration until the first rule gets added (or policy gets changed). That however means that counters are not hooked either, so 'iptables -L' would always show 0-counters even when traffic is flowing which might be unexpected. This keeps table and hook registration consistent with what is already done in non-init netns: first iptables(-save) invocation registers both table and hooks. This applies the same solution adopted for ebtables. All tables register a template that contains the l3 family, the name and a constructor function that is called when the initial table has to be added. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-06netfilter: nfnetlink_hook: translate inet ingress to netdevPablo Neira Ayuso
The NFPROTO_INET pseudofamily is not exposed through this new netlink interface. The netlink dump either shows NFPROTO_IPV4 or NFPROTO_IPV6 for NFPROTO_INET prerouting/input/forward/output/postrouting hooks. The NFNLA_CHAIN_FAMILY attribute provides the family chain, which specifies if this hook applies to inet traffic only (either IPv4 or IPv6). Translate the inet/ingress hook to netdev/ingress to fully hide the NFPROTO_INET implementation details. Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-06netfilter: conntrack: remove offload_pickup sysctl againFlorian Westphal
These two sysctls were added because the hardcoded defaults (2 minutes, tcp, 30 seconds, udp) turned out to be too low for some setups. They appeared in 5.14-rc1 so it should be fine to remove it again. Marcelo convinced me that there should be no difference between a flow that was offloaded vs. a flow that was not wrt. timeout handling. Thus the default is changed to those for TCP established and UDP stream, 5 days and 120 seconds, respectively. Marcelo also suggested to account for the timeout value used for the offloading, this avoids increase beyond the value in the conntrack-sysctl and will also instantly expire the conntrack entry with altered sysctls. Example: nf_conntrack_udp_timeout_stream=60 nf_flowtable_udp_timeout=60 This will remove offloaded udp flows after one minute, rather than two. An earlier version of this patch also cleared the ASSURED bit to allow nf_conntrack to evict the entry via early_drop (i.e., table full). However, it looks like we can safely assume that connection timed out via HW is still in established state, so this isn't needed. Quoting Oz: [..] the hardware sends all packets with a set FIN flags to sw. [..] Connections that are aged in hardware are expected to be in the established state. In case it turns out that back-to-sw-path transition can occur for 'dodgy' connections too (e.g., one side disappeared while software-path would have been in RETRANS timeout), we can adjust this later. Cc: Oz Shlomo <ozsh@nvidia.com> Cc: Paul Blakey <paulb@nvidia.com> Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-06netfilter: nfnetlink_hook: Use same family as request messagePablo Neira Ayuso
Use the same family as the request message, for consistency. The netlink payload provides sufficient information to describe the hook object, including the family. This makes it easier to userspace to correlate the hooks are that visited by the packets for a certain family. Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-06netfilter: nfnetlink_hook: use the sequence number of the request messagePablo Neira Ayuso
The sequence number allows to correlate the netlink reply message (as part of the dump) with the original request message. The cb->seq field is internally used to detect an interference (update) of the hook list during the netlink dump, do not use it as sequence number in the netlink dump header. Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-06netfilter: nfnetlink_hook: missing chain familyPablo Neira Ayuso
The family is relevant for pseudo-families like NFPROTO_INET otherwise the user needs to rely on the hook function name to differentiate it from NFPROTO_IPV4 and NFPROTO_IPV6 names. Add nfnl_hook_chain_desc_attributes instead of using the existing NFTA_CHAIN_* attributes, since these do not provide a family number. Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-06netfilter: nfnetlink_hook: strip off module name from hookfnPablo Neira Ayuso
NFNLA_HOOK_FUNCTION_NAME should include the hook function name only, the module name is already provided by NFNLA_HOOK_MODULE_NAME. Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-06netfilter: conntrack: collect all entries in one cycleFlorian Westphal
Michal Kubecek reports that conntrack gc is responsible for frequent wakeups (every 125ms) on idle systems. On busy systems, timed out entries are evicted during lookup. The gc worker is only needed to remove entries after system becomes idle after a busy period. To resolve this, always scan the entire table. If the scan is taking too long, reschedule so other work_structs can run and resume from next bucket. After a completed scan, wait for 2 minutes before the next cycle. Heuristics for faster re-schedule are removed. GC_SCAN_INTERVAL could be exposed as a sysctl in the future to allow tuning this as-needed or even turn the gc worker off. Reported-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-05net: Remove redundant if statementsYajun Deng
The 'if (dev)' statement already move into dev_{put , hold}, so remove redundant if statements. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05netfilter: ctnetlink: allow to filter dump by status bitsFlorian Westphal
If CTA_STATUS is present, but CTA_STATUS_MASK is not, then the mask is automatically set to 'status', so that kernel returns those entries that have all of the requested bits set. This makes more sense than using a all-one mask since we'd hardly ever find a match. There are no other checks for status bits, so if e.g. userspace sets impossible combinations it will get an empty dump. If kernel would reject unknown status bits, then a program that works on a future kernel that has IPS_FOO bit fails on old kernels. Same for 'impossible' combinations: Kernel never sets ASSURED without first having set SEEN_REPLY, but its possible that a future kernel could do so. Therefore no sanity tests other than a 0-mask. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-05netfilter: ctnetlink: add and use a helper for mark parsingFlorian Westphal
ctnetlink dumps can be filtered based on the connmark. Prepare for status bit filtering by using a named structure and by moving the mark parsing code to a helper. Else ctnetlink_alloc_filter size grows a bit too big for my taste when status handling is added. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-04netfilter: ipset: Limit the maximal range of consecutive elements to add/deleteJozsef Kadlecsik
The range size of consecutive elements were not limited. Thus one could define a huge range which may result soft lockup errors due to the long execution time. Now the range size is limited to 2^20 entries. Reported-by: Brad Spengler <spender@grsecurity.net> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-01netfilter: remove xt pernet dataFlorian Westphal
clusterip is now handled via net_generic. NOTRACK is tiny compared to rest of xt_CT feature set, even the existing deprecation warning is bigger than the actual functionality. Just remove the warning, its not worth keeping/adding a net_generic one. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-01netfilter: flowtable: remove nf_ct_l4proto_find() callPablo Neira Ayuso
TCP and UDP are built-in conntrack protocol trackers and the flowtable only supports for TCP and UDP, remove this call. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-01netfilter: nft_compat: use nfnetlink_unicast()Pablo Neira Ayuso
Use nfnetlink_unicast() which already translates EAGAIN to ENOBUFS, since EAGAIN is reserved to report missing module dependencies to the nfnetlink core. e0241ae6ac59 ("netfilter: use nfnetlink_unicast() forgot to update this spot. Reported-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Conflicting commits, all resolutions pretty trivial: drivers/bus/mhi/pci_generic.c 5c2c85315948 ("bus: mhi: pci-generic: configurable network interface MRU") 56f6f4c4eb2a ("bus: mhi: pci_generic: Apply no-op for wake using sideband wake boolean") drivers/nfc/s3fwrn5/firmware.c a0302ff5906a ("nfc: s3fwrn5: remove unnecessary label") 46573e3ab08f ("nfc: s3fwrn5: fix undefined parameter values in dev_err()") 801e541c79bb ("nfc: s3fwrn5: fix undefined parameter values in dev_err()") MAINTAINERS 7d901a1e878a ("net: phy: add Maxlinear GPY115/21x/24x driver") 8a7b46fa7902 ("MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-23netfilter: nfnl_hook: fix unused variable warningArnd Bergmann
The only user of this variable is in an #ifdef: net/netfilter/nfnetlink_hook.c: In function 'nfnl_hook_entries_head': net/netfilter/nfnetlink_hook.c:177:28: error: unused variable 'netdev' [-Werror=unused-variable] Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-23netfilter: nft_nat: allow to specify layer 4 protocol NAT onlyPablo Neira Ayuso
nft_nat reports a bogus EAFNOSUPPORT if no layer 3 information is specified. Fixes: d07db9884a5f ("netfilter: nf_tables: introduce nft_validate_register_load()") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-23netfilter: conntrack: adjust stop timestamp to real expiry valueFlorian Westphal
In case the entry is evicted via garbage collection there is delay between the timeout value and the eviction event. This adjusts the stop value based on how much time has passed. Fixes: b87a2f9199ea82 ("netfilter: conntrack: add gc worker to remove timed-out entries") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-23netfilter: nft_last: avoid possible false sharingPablo Neira Ayuso
Use the idiom described in: https://github.com/google/ktsan/wiki/READ_ONCE-and-WRITE_ONCE#it-may-improve-performance Moreover, prevent a compiler optimization. Fixes: 836382dc2471 ("netfilter: nf_tables: add last expression") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-23netfilter: flowtable: avoid possible false sharingPablo Neira Ayuso
The flowtable follows the same timeout approach as conntrack, use the same idiom as in cc16921351d8 ("netfilter: conntrack: avoid same-timeout update") but also include the fix provided by e37542ba111f ("netfilter: conntrack: avoid possible false sharing"). Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-21net: ipv6: introduce ip6_dst_mtu_maybe_forwardVadim Fedorenko
Replace ip6_dst_mtu_forward with ip6_dst_mtu_maybe_forward and reuse this code in ip6_mtu. Actually these two functions were almost duplicates, this change will simplify the maintaince of mtu calculation code. Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-17netfilter: nf_tables: fix audit memory leak in nf_tables_commitDongliang Mu
In nf_tables_commit, if nf_tables_commit_audit_alloc fails, it does not free the adp variable. Fix this by adding nf_tables_commit_audit_free which frees the linked list with the head node adl. backtrace: kmalloc include/linux/slab.h:591 [inline] kzalloc include/linux/slab.h:721 [inline] nf_tables_commit_audit_alloc net/netfilter/nf_tables_api.c:8439 [inline] nf_tables_commit+0x16e/0x1760 net/netfilter/nf_tables_api.c:8508 nfnetlink_rcv_batch+0x512/0xa80 net/netfilter/nfnetlink.c:562 nfnetlink_rcv_skb_batch net/netfilter/nfnetlink.c:634 [inline] nfnetlink_rcv+0x1fa/0x220 net/netfilter/nfnetlink.c:652 netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline] netlink_unicast+0x2c7/0x3e0 net/netlink/af_netlink.c:1340 netlink_sendmsg+0x36b/0x6b0 net/netlink/af_netlink.c:1929 sock_sendmsg_nosec net/socket.c:702 [inline] sock_sendmsg+0x56/0x80 net/socket.c:722 Reported-by: syzbot <syzkaller@googlegroups.com> Reported-by: kernel test robot <lkp@intel.com> Fixes: c520292f29b8 ("audit: log nftables configuration change events once per table") Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-06netfilter: nft_last: incorrect arithmetics when restoring last usedPablo Neira Ayuso
Subtract the jiffies that have passed by to current jiffies to fix last used restoration. Fixes: 836382dc2471 ("netfilter: nf_tables: add last expression") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-06netfilter: nft_last: honor NFTA_LAST_SET on restorationPablo Neira Ayuso
NFTA_LAST_SET tells us if this expression has ever seen a packet, do not ignore this attribute when restoring the ruleset. Fixes: 836382dc2471 ("netfilter: nf_tables: add last expression") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-06netfilter: conntrack: Mark access for KCSANManfred Spraul
KCSAN detected an data race with ipc/sem.c that is intentional. As nf_conntrack_lock() uses the same algorithm: Update nf_conntrack_core as well: nf_conntrack_lock() contains a1) spin_lock() a2) smp_load_acquire(nf_conntrack_locks_all). a1) actually accesses one lock from an array of locks. nf_conntrack_locks_all() contains b1) nf_conntrack_locks_all=true (normal write) b2) spin_lock() b3) spin_unlock() b2 and b3 are done for every lock. This guarantees that nf_conntrack_locks_all() prevents any concurrent nf_conntrack_lock() owners: If a thread past a1), then b2) will block until that thread releases the lock. If the threat is before a1, then b3)+a1) ensure the write b1) is visible, thus a2) is guaranteed to see the updated value. But: This is only the latest time when b1) becomes visible. It may also happen that b1) is visible an undefined amount of time before the b3). And thus KCSAN will notice a data race. In addition, the compiler might be too clever. Solution: Use WRITE_ONCE(). Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>