summaryrefslogtreecommitdiff
path: root/include/net/netfilter
AgeCommit message (Collapse)Author
2023-04-21Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-04-21 We've added 71 non-merge commits during the last 8 day(s) which contain a total of 116 files changed, 13397 insertions(+), 8896 deletions(-). The main changes are: 1) Add a new BPF netfilter program type and minimal support to hook BPF programs to netfilter hooks such as prerouting or forward, from Florian Westphal. 2) Fix race between btf_put and btf_idr walk which caused a deadlock, from Alexei Starovoitov. 3) Second big batch to migrate test_verifier unit tests into test_progs for ease of readability and debugging, from Eduard Zingerman. 4) Add support for refcounted local kptrs to the verifier for allowing shared ownership, useful for adding a node to both the BPF list and rbtree, from Dave Marchevsky. 5) Migrate bpf_for(), bpf_for_each() and bpf_repeat() macros from BPF selftests into libbpf-provided bpf_helpers.h header and improve kfunc handling, from Andrii Nakryiko. 6) Support 64-bit pointers to kfuncs needed for archs like s390x, from Ilya Leoshkevich. 7) Support BPF progs under getsockopt with a NULL optval, from Stanislav Fomichev. 8) Improve verifier u32 scalar equality checking in order to enable LLVM transformations which earlier had to be disabled specifically for BPF backend, from Yonghong Song. 9) Extend bpftool's struct_ops object loading to support links, from Kui-Feng Lee. 10) Add xsk selftest follow-up fixes for hugepage allocated umem, from Magnus Karlsson. 11) Support BPF redirects from tc BPF to ifb devices, from Daniel Borkmann. 12) Add BPF support for integer type when accessing variable length arrays, from Feng Zhou. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (71 commits) selftests/bpf: verifier/value_ptr_arith converted to inline assembly selftests/bpf: verifier/value_illegal_alu converted to inline assembly selftests/bpf: verifier/unpriv converted to inline assembly selftests/bpf: verifier/subreg converted to inline assembly selftests/bpf: verifier/spin_lock converted to inline assembly selftests/bpf: verifier/sock converted to inline assembly selftests/bpf: verifier/search_pruning converted to inline assembly selftests/bpf: verifier/runtime_jit converted to inline assembly selftests/bpf: verifier/regalloc converted to inline assembly selftests/bpf: verifier/ref_tracking converted to inline assembly selftests/bpf: verifier/map_ptr_mixing converted to inline assembly selftests/bpf: verifier/map_in_map converted to inline assembly selftests/bpf: verifier/lwt converted to inline assembly selftests/bpf: verifier/loops1 converted to inline assembly selftests/bpf: verifier/jeq_infer_not_null converted to inline assembly selftests/bpf: verifier/direct_packet_access converted to inline assembly selftests/bpf: verifier/d_path converted to inline assembly selftests/bpf: verifier/ctx converted to inline assembly selftests/bpf: verifier/btf_ctx_access converted to inline assembly selftests/bpf: verifier/bpf_get_stack converted to inline assembly ... ==================== Link: https://lore.kernel.org/r/20230421211035.9111-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-22netfilter: nf_tables: support for adding new devices to an existing netdev chainPablo Neira Ayuso
This patch allows users to add devices to an existing netdev chain. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-22netfilter: nf_tables: do not store rule in traceinfo structureFlorian Westphal
pass it as argument instead. This reduces size of traceinfo to 16 bytes. Total stack usage: nf_tables_core.c:252 nft_do_chain 304 static While its possible to also pass basechain as argument, doing so increases nft_do_chaininfo function size. Unlike pktinfo/verdict/rule the basechain info isn't used in the expression evaluation path. gcc places it on the stack, which results in extra push/pop when it gets passed to the trace helpers as argument rather than as part of the traceinfo structure. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-22netfilter: nf_tables: do not store verdict in traceinfo structureFlorian Westphal
Just pass it as argument to nft_trace_notify. Stack is reduced by 8 bytes: nf_tables_core.c:256 nft_do_chain 312 static Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-22netfilter: nf_tables: do not store pktinfo in traceinfo structureFlorian Westphal
pass it as argument. No change in object size. stack usage decreases by 8 byte: nf_tables_core.c:254 nft_do_chain 320 static Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-22netfilter: nf_tables: make validation state per tableFlorian Westphal
We only need to validate tables that saw changes in the current transaction. The existing code revalidates all tables, but this isn't needed as cross-table jumps are not allowed (chains have table scope). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-22netfilter: nf_tables: don't store chain address on jumpFlorian Westphal
Now that the rule trailer/end marker and the rcu head reside in the same structure, we no longer need to save/restore the chain pointer when performing/returning from a jump. We can simply let the trace infra walk the evaluated rule until it hits the end marker and then fetch the chain pointer from there. When the rule is NULL (policy tracing), then chain and basechain pointers were already identical, so just use the basechain. This cuts size of jumpstack in half, from 256 to 128 bytes in 64bit, scripts/stackusage says: nf_tables_core.c:251 nft_do_chain 328 static Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-21bpf: minimal support for programs hooked into netfilter frameworkFlorian Westphal
This adds minimal support for BPF_PROG_TYPE_NETFILTER bpf programs that will be invoked via the NF_HOOK() points in the ip stack. Invocation incurs an indirect call. This is not a necessity: Its possible to add 'DEFINE_BPF_DISPATCHER(nf_progs)' and handle the program invocation with the same method already done for xdp progs. This isn't done here to keep the size of this chunk down. Verifier restricts verdicts to either DROP or ACCEPT. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230421170300.24115-3-fw@strlen.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-21bpf: add bpf_link support for BPF_NETFILTER programsFlorian Westphal
Add bpf_link support skeleton. To keep this reviewable, no bpf program can be invoked yet, if a program is attached only a c-stub is called and not the actual bpf program. Defaults to 'y' if both netfilter and bpf syscall are enabled in kconfig. Uapi example usage: union bpf_attr attr = { }; attr.link_create.prog_fd = progfd; attr.link_create.attach_type = 0; /* unused */ attr.link_create.netfilter.pf = PF_INET; attr.link_create.netfilter.hooknum = NF_INET_LOCAL_IN; attr.link_create.netfilter.priority = -128; err = bpf(BPF_LINK_CREATE, &attr, sizeof(attr)); ... this would attach progfd to ipv4:input hook. Such hook gets removed automatically if the calling program exits. BPF_NETFILTER program invocation is added in followup change. NF_HOOK_OP_BPF enum will eventually be read from nfnetlink_hook, it allows to tell userspace which program is attached at the given hook when user runs 'nft hook list' command rather than just the priority and not-very-helpful 'this hook runs a bpf prog but I can't tell which one'. Will also be used to disallow registration of two bpf programs with same priority in a followup patch. v4: arm32 cmpxchg only supports 32bit operand s/prio/priority/ v3: restrict prog attachment to ip/ip6 for now, lets lift restrictions if more use cases pop up (arptables, ebtables, netdev ingress/egress etc). Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230421170300.24115-2-fw@strlen.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Adjacent changes: net/mptcp/protocol.h 63740448a32e ("mptcp: fix accept vs worker race") 2a6a870e44dd ("mptcp: stops worker on unaccepted sockets at listener close") ddb1a072f858 ("mptcp: move first subflow allocation at mpc access time") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-19netfilter: conntrack: fix wrong ct->timeout valueTzung-Bi Shih
(struct nf_conn)->timeout is an interval before the conntrack confirmed. After confirmed, it becomes a timestamp. It is observed that timeout of an unconfirmed conntrack: - Set by calling ctnetlink_change_timeout(). As a result, `nfct_time_stamp` was wrongly added to `ct->timeout` twice. - Get by calling ctnetlink_dump_timeout(). As a result, `nfct_time_stamp` was wrongly subtracted. Call Trace: <TASK> dump_stack_lvl ctnetlink_dump_timeout __ctnetlink_glue_build ctnetlink_glue_build __nfqnl_enqueue_packet nf_queue nf_hook_slow ip_mc_output ? __pfx_ip_finish_output ip_send_skb ? __pfx_dst_output udp_send_skb udp_sendmsg ? __pfx_ip_generic_getfrag sock_sendmsg Separate the 2 cases in: - Setting `ct->timeout` in __nf_ct_set_timeout(). - Getting `ct->timeout` in ctnetlink_dump_timeout(). Pablo appends: Update ctnetlink to set up the timeout _after_ the IPS_CONFIRMED flag is set on, otherwise conntrack creation via ctnetlink breaks. Note that the problem described in this patch occurs since the introduction of the nfnetlink_queue conntrack support, select a sufficiently old Fixes: tag for -stable kernel to pick up this fix. Fixes: a4b4766c3ceb ("netfilter: nfnetlink_queue: rename related to nfqueue attaching conntrack info") Signed-off-by: Tzung-Bi Shih <tzungbi@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-18netfilter: nf_tables: validate catch-all set elementsPablo Neira Ayuso
catch-all set element might jump/goto to chain that uses expressions that require validation. Fixes: aaa31047a6d2 ("netfilter: nftables: add catch-all set element support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-03-22netfilter: nft_redir: use `struct nf_nat_range2` throughout and deduplicate ↵Jeremy Sowden
eval call-backs `nf_nat_redirect_ipv4` takes a `struct nf_nat_ipv4_multi_range_compat`, but converts it internally to a `struct nf_nat_range2`. Change the function to take the latter, factor out the code now shared with `nf_nat_redirect_ipv6`, move the conversion to the xt_REDIRECT module, and update the ipv4 range initialization in the nft_redir module. Replace a bare hex constant for 127.0.0.1 with a macro. Remove `WARN_ON`. `nf_nat_setup_info` calls `nf_ct_is_confirmed`: /* Can't setup nat info for confirmed ct. */ if (nf_ct_is_confirmed(ct)) return NF_ACCEPT; This means that `ct` cannot be null or the kernel will crash, and implies that `ctinfo` is `IP_CT_NEW` or `IP_CT_RELATED`. nft_redir has separate ipv4 and ipv6 call-backs which share much of their code, and an inet one switch containing a switch that calls one of the others based on the family of the packet. Merge the ipv4 and ipv6 ones into the inet one in order to get rid of the duplicate code. Const-qualify the `priv` pointer since we don't need to write through it. Assign `priv->flags` to the range instead of OR-ing it in. Set the `NF_NAT_RANGE_PROTO_SPECIFIED` flag once during init, rather than on every eval. Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-03-06netfilter: tproxy: fix deadlock due to missing BH disableFlorian Westphal
The xtables packet traverser performs an unconditional local_bh_disable(), but the nf_tables evaluation loop does not. Functions that are called from either xtables or nftables must assume that they can be called in process context. inet_twsk_deschedule_put() assumes that no softirq interrupt can occur. If tproxy is used from nf_tables its possible that we'll deadlock trying to aquire a lock already held in process context. Add a small helper that takes care of this and use it. Link: https://lore.kernel.org/netfilter-devel/401bd6ed-314a-a196-1cdc-e13c720cc8f2@balasys.hu/ Fixes: 4ed8eb6570a4 ("netfilter: nf_tables: Add native tproxy support") Reported-and-tested-by: Major Dávid <major.david@balasys.hu> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-02-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter updates for net-next: 1) Add safeguard to check for NULL tupe in objects updates via NFT_MSG_NEWOBJ, this should not ever happen. From Alok Tiwari. 2) Incorrect pointer check in the new destroy rule command, from Yang Yingliang. 3) Incorrect status bitcheck in nf_conntrack_udp_packet(), from Florian Westphal. 4) Simplify seq_print_acct(), from Ilia Gavrilov. 5) Use 2-arg optimal variant of kfree_rcu() in IPVS, from Julian Anastasov. 6) TCP connection enters CLOSE state in conntrack for locally originated TCP reset packet from the reject target, from Florian Westphal. The fixes #2 and #3 in this series address issues from the previous pull nf-next request in this net-next cycle. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-17netfilter: let reset rules clean out conntrack entriesFlorian Westphal
iptables/nftables support responding to tcp packets with tcp resets. The generated tcp reset packet passes through both output and postrouting netfilter hooks, but conntrack will never see them because the generated skb has its ->nfct pointer copied over from the packet that triggered the reset rule. If the reset rule is used for established connections, this may result in the conntrack entry to be around for a very long time (default timeout is 5 days). One way to avoid this would be to not copy the nf_conn pointer so that the rest packet passes through conntrack too. Problem is that output rules might not have the same conntrack zone setup as the prerouting ones, so its possible that the reset skb won't find the correct entry. Generating a template entry for the skb seems error prone as well. Add an explicit "closing" function that switches a confirmed conntrack entry to closed state and wire this up for tcp. If the entry isn't confirmed, no action is needed because the conntrack entry will never be committed to the table. Reported-by: Russel King <linux@armlinux.org.uk> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-02-10net: extract nf_ct_handle_fragments to nf_conntrack_ovsXin Long
Now handle_fragments() in OVS and TC have the similar code, and this patch removes the duplicate code by moving the function to nf_conntrack_ovs. Note that skb_clear_hash(skb) or skb->ignore_df = 1 should be done only when defrag returns 0, as it does in other places in kernel. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-10net: extract nf_ct_skb_network_trim function to nf_conntrack_ovsXin Long
There are almost the same code in ovs_skb_network_trim() and tcf_ct_skb_network_trim(), this patch extracts them into a function nf_ct_skb_network_trim() and moves the function to nf_conntrack_ovs. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-03netfilter: flowtable: cache info of last offloadVlad Buslov
Modify flow table offload to cache the last ct info status that was passed to the driver offload callbacks by extending enum nf_flow_flags with new "NF_FLOW_HW_ESTABLISHED" flag. Set the flag if ctinfo was 'established' during last act_ct meta actions fill call. This infrastructure change is necessary to optimize promoting of UDP connections from 'new' to 'established' in following patches in this series. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-03netfilter: flowtable: allow unidirectional rulesVlad Buslov
Modify flow table offload to support unidirectional connections by extending enum nf_flow_flags with new "NF_FLOW_HW_BIDIRECTIONAL" flag. Only offload reply direction when the flag is set. This infrastructure change is necessary to support offloading UDP NEW connections in original direction in following patches in series. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-01netfilter: use skb_ip_totlen and iph_totlenXin Long
There are also quite some places in netfilter that may process IPv4 TCP GSO packets, we need to replace them too. In length_mt(), we have to use u_int32_t/int to accept skb_ip_totlen() return value, otherwise it may overflow and mismatch. This change will also help us add selftest for IPv4 BIG TCP in the following patch. Note that we don't need to replace the one in tcpmss_tg4(), as it will return if there is data after tcphdr in tcpmss_mangle_packet(). The same in mangle_contents() in nf_nat_helper.c, it returns false when skb->len + extra > 65535 in enlarge_skb(). Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-18netfilter: nf_tables: avoid retpoline overhead for some ct expression callsFlorian Westphal
nft_ct expression cannot be made builtin to nf_tables without also forcing the conntrack itself to be builtin. However, this can be avoided by splitting retrieval of a few selector keys that only need to access the nf_conn structure, i.e. no function calls to nf_conntrack code. Many rulesets start with something like "ct status established,related accept" With this change, this no longer requires an indirect call, which gives about 1.8% more throughput with a simple conntrack-enabled forwarding test (retpoline thunk used). Signed-off-by: Florian Westphal <fw@strlen.de>
2023-01-18netfilter: nf_tables: avoid retpoline overhead for objref callsFlorian Westphal
objref expression is builtin, so avoid calls to it for RETOLINE=y builds. Signed-off-by: Florian Westphal <fw@strlen.de>
2022-12-22netfilter: nf_tables: honor set timeout and garbage collection updatesPablo Neira Ayuso
Set timeout and garbage collection interval updates are ignored on updates. Add transaction to update global set element timeout and garbage collection interval. Fixes: 96518518cc41 ("netfilter: add nftables") Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-12-21netfilter: nf_tables: consolidate set descriptionPablo Neira Ayuso
Add the following fields to the set description: - key type - data type - object type - policy - gc_int: garbage collection interval) - timeout: element timeout This prepares for stricter set type checks on updates in a follow up patch. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-12-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-nextJakub Kicinski
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next 1) Incorrect error check in nft_expr_inner_parse(), from Dan Carpenter. 2) Add DATA_SENT state to SCTP connection tracking helper, from Sriram Yagnaraman. 3) Consolidate nf_confirm for ipv4 and ipv6, from Florian Westphal. 4) Add bitmask support for ipset, from Vishwanath Pai. 5) Handle icmpv6 redirects as RELATED, from Florian Westphal. 6) Add WARN_ON_ONCE() to impossible case in flowtable datapath, from Li Qiong. 7) A large batch of IPVS updates to replace timer-based estimators by kthreads to scale up wrt. CPUs and workload (millions of estimators). Julian Anastasov says: This patchset implements stats estimation in kthread context. It replaces the code that runs on single CPU in timer context every 2 seconds and causing latency splats as shown in reports [1], [2], [3]. The solution targets setups with thousands of IPVS services, destinations and multi-CPU boxes. Spread the estimation on multiple (configured) CPUs and multiple time slots (timer ticks) by using multiple chains organized under RCU rules. When stats are not needed, it is recommended to use run_estimation=0 as already implemented before this change. RCU Locking: - As stats are now RCU-locked, tot_stats, svc and dest which hold estimator structures are now always freed from RCU callback. This ensures RCU grace period after the ip_vs_stop_estimator() call. Kthread data: - every kthread works over its own data structure and all such structures are attached to array. For now we limit kthreads depending on the number of CPUs. - even while there can be a kthread structure, its task may not be running, eg. before first service is added or while the sysctl var is set to an empty cpulist or when run_estimation is set to 0 to disable the estimation. - the allocated kthread context may grow from 1 to 50 allocated structures for timer ticks which saves memory for setups with small number of estimators - a task and its structure may be released if all estimators are unlinked from its chains, leaving the slot in the array empty - every kthread data structure allows limited number of estimators. Kthread 0 is also used to initially calculate the max number of estimators to allow in every chain considering a sub-100 microsecond cond_resched rate. This number can be from 1 to hundreds. - kthread 0 has an additional job of optimizing the adding of estimators: they are first added in temp list (est_temp_list) and later kthread 0 distributes them to other kthreads. The optimization is based on the fact that newly added estimator should be estimated after 2 seconds, so we have the time to offload the adding to chain from controlling process to kthread 0. - to add new estimators we use the last added kthread context (est_add_ktid). The new estimators are linked to the chains just before the estimated one, based on add_row. This ensures their estimation will start after 2 seconds. If estimators are added in bursts, common case if all services and dests are initially configured, we may spread the estimators to more chains and as result, reducing the initial delay below 2 seconds. Many thanks to Jiri Wiesner for his valuable comments and for spending a lot of time reviewing and testing the changes on different platforms with 48-256 CPUs and 1-8 NUMA nodes under different cpufreq governors. The new IPVS estimators do not use workqueue infrastructure because: - The estimation can take long time when using multiple IPVS rules (eg. millions estimator structures) and especially when box has multiple CPUs due to the for_each_possible_cpu usage that expects packets from any CPU. With est_nice sysctl we have more control how to prioritize the estimation kthreads compared to other processes/kthreads that have latency requirements (such as servers). As a benefit, we can see these kthreads in top and decide if we will need some further control to limit their CPU usage (max number of structure to estimate per kthread). - with kthreads we run code that is read-mostly, no write/lock operations to process the estimators in 2-second intervals. - work items are one-shot: as estimators are processed every 2 seconds, they need to be re-added every time. This again loads the timers (add_timer) if we use delayed works, as there are no kthreads to do the timings. [1] Report from Yunhong Jiang: https://lore.kernel.org/netdev/D25792C1-1B89-45DE-9F10-EC350DC04ADC@gmail.com/ [2] https://marc.info/?l=linux-virtual-server&m=159679809118027&w=2 [3] Report from Dust: https://archive.linuxvirtualserver.org/html/lvs-devel/2020-12/msg00000.html * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: ipvs: run_estimation should control the kthread tasks ipvs: add est_cpulist and est_nice sysctl vars ipvs: use kthreads for stats estimation ipvs: use u64_stats_t for the per-cpu counters ipvs: use common functions for stats allocation ipvs: add rcu protection to stats netfilter: flowtable: add a 'default' case to flowtable datapath netfilter: conntrack: set icmpv6 redirects as RELATED netfilter: ipset: Add support for new bitmask parameter netfilter: conntrack: merge ipv4+ipv6 confirm functions netfilter: conntrack: add sctp DATA_SENT state netfilter: nft_inner: fix IS_ERR() vs NULL check ==================== Link: https://lore.kernel.org/r/20221211101204.1751-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-12net: move the nat function to nf_nat_ovs for ovs and tcXin Long
There are two nat functions are nearly the same in both OVS and TC code, (ovs_)ct_nat_execute() and ovs_ct_nat/tcf_ct_act_nat(). This patch creates nf_nat_ovs.c under netfilter and moves them there then exports nf_ct_nat() so that it can be shared by both OVS and TC, and keeps the nat (type) check and nat flag update in OVS and TC's own place, as these parts are different between OVS and TC. Note that in OVS nat function it was using skb->protocol to get the proto as it already skips vlans in key_extract(), while it doesn't in TC, and TC has to call skb_protocol() to get proto. So in nf_ct_nat_execute(), we keep using skb_protocol() which works for both OVS and TC contrack. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Aaron Conole <aconole@redhat.com> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-30netfilter: conntrack: merge ipv4+ipv6 confirm functionsFlorian Westphal
No need to have distinct functions. After merge, ipv6 can avoid protooff computation if the connection neither needs sequence adjustment nor helper invocation -- this is the normal case. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-11-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-nextJakub Kicinski
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next 1) Fix sparse warning in the new nft_inner expression, reported by Jakub Kicinski. 2) Incorrect vlan header check in nft_inner, from Peng Wu. 3) Two patches to pass reset boolean to expression dump operation, in preparation for allowing to reset stateful expressions in rules. This adds a new NFT_MSG_GETRULE_RESET command. From Phil Sutter. 4) Inconsistent indentation in nft_fib, from Jiapeng Chong. 5) Speed up siphash calculation in conntrack, from Florian Westphal. * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: conntrack: use siphash_4u64 netfilter: rpfilter/fib: clean up some inconsistent indenting netfilter: nf_tables: Introduce NFT_MSG_GETRULE_RESET netfilter: nf_tables: Extend nft_expr_ops::dump callback parameters netfilter: nft_inner: fix return value check in nft_inner_parse_l2l3() netfilter: nft_payload: use __be16 to store gre version ==================== Link: https://lore.kernel.org/r/20221115095922.139954-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-15netfilter: nf_tables: Introduce NFT_MSG_GETRULE_RESETPhil Sutter
Analogous to NFT_MSG_GETOBJ_RESET, but for rules: Reset stateful expressions like counters or quotas. The latter two are the only consumers, adjust their 'dump' callbacks to respect the parameter introduced earlier. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-11-15netfilter: nf_tables: Extend nft_expr_ops::dump callback parametersPhil Sutter
Add a 'reset' flag just like with nft_object_ops::dump. This will be useful to reset "anonymous stateful objects", e.g. simple rule counters. No functional change intended. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-11-08net: move add ct helper function to nf_conntrack_helper for ovs and tcXin Long
Move ovs_ct_add_helper from openvswitch to nf_conntrack_helper and rename as nf_ct_add_helper, so that it can be used in TC act_ct in the next patch. Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-11-08net: move the ct helper function to nf_conntrack_helper for ovs and tcXin Long
Move ovs_ct_helper from openvswitch to nf_conntrack_helper and rename as nf_ct_helper so that it can be used in TC act_ct in the next patch. Note that it also adds the checks for the family and proto, as in TC act_ct, the packets with correct family and proto are not guaranteed. Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25netfilter: nft_meta: add inner match supportPablo Neira Ayuso
Add support for inner meta matching on: - NFT_META_PROTOCOL: to match on the ethertype, this can be used regardless tunnel protocol provides no link layer header, in that case nft_inner sets on the ethertype based on the IP header version field. - NFT_META_L4PROTO: to match on the layer 4 protocol. These meta expression are usually autogenerated as dependencies by userspace nftables. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-10-25netfilter: nft_inner: add percpu inner contextPablo Neira Ayuso
Add NFT_PKTINFO_INNER_FULL flag to annotate that inner offsets are available. Store nft_inner_tun_ctx object in percpu area to cache existing inner offsets for this skbuff. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-10-25netfilter: nft_inner: support for inner tunnel header matchingPablo Neira Ayuso
This new expression allows you to match on the inner headers that are encapsulated by any of the existing tunneling protocols. This expression parses the inner packet to set the link, network and transport offsets, so the existing expressions (with a few updates) can be reused to match on the inner headers. The inner expression supports for different tunnel combinations such as: - ethernet frame over IPv4/IPv6 packet, eg. VxLAN. - IPv4/IPv6 packet over IPv4/IPv6 packet, eg. IPIP. - IPv4/IPv6 packet over IPv4/IPv6 + transport header, eg. GRE. - transport header (ESP or SCTP) over transport header (usually UDP) The following fields are used to describe the tunnel protocol: - flags, which describe how to parse the inner headers: NFT_PAYLOAD_CTX_INNER_TUN, the tunnel provides its own header. NFT_PAYLOAD_CTX_INNER_ETHER, the ethernet frame is available as inner header. NFT_PAYLOAD_CTX_INNER_NH, the network header is available as inner header. NFT_PAYLOAD_CTX_INNER_TH, the transport header is available as inner header. For example, VxLAN sets on all of these flags. While GRE only sets on NFT_PAYLOAD_CTX_INNER_NH and NFT_PAYLOAD_CTX_INNER_TH. Then, ESP over UDP only sets on NFT_PAYLOAD_CTX_INNER_TH. The tunnel description is composed of the following attributes: - header size: in case the tunnel comes with its own header, eg. VxLAN. - type: this provides a hint to userspace on how to delinearize the rule. This is useful for VxLAN and Geneve since they run over UDP, since transport does not provide a hint. This is also useful in case hardware offload is ever supported. The type is not currently interpreted by the kernel. - expression: currently only payload supported. Follow up patch adds also inner meta support which is required by autogenerated dependencies. The exthdr expression should be supported too at some point. There is a new inner_ops operation that needs to be set on to allow to use an existing expression from the inner expression. This patch adds a new NFT_PAYLOAD_TUN_HEADER base which allows to match on the tunnel header fields, eg. vxlan vni. The payload expression is embedded into nft_inner private area and this private data area is passed to the payload inner eval function via direct call. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-10-25netfilter: nft_objref: make it builtinFlorian Westphal
nft_objref is needed to reference named objects, it makes no sense to disable it. Before: text data bss dec filename 4014 424 0 4438 nft_objref.o 4174 1128 0 5302 nft_objref.ko 359351 15276 864 375491 nf_tables.ko After: text data bss dec filename 3815 408 0 4223 nft_objref.o 363161 15692 864 379717 nf_tables.ko Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-10-25netfilter: nf_tables: reduce nft_pktinfo by 8 bytesFlorian Westphal
structure is reduced from 32 to 24 bytes. While at it, also check that iphdrlen is sane, this is guaranteed for NFPROTO_IPV4 but not for ingress or bridge, so add checks for this. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-10-25netfilter: nft_payload: move struct nft_payload_set definition where it belongsPablo Neira Ayuso
Not required to expose this header in nf_tables_core.h, move it to where it is used, ie. nft_payload. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-10-11treewide: use get_random_u32() when possibleJason A. Donenfeld
The prandom_u32() function has been a deprecated inline wrapper around get_random_u32() for several releases now, and compiles down to the exact same code. Replace the deprecated wrapper with a direct call to the real function. The same also applies to get_random_int(), which is just a wrapper around get_random_u32(). This was done as a basic find and replace. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Acked-by: Helge Deller <deller@gmx.de> # for parisc Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-10-03Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Daniel Borkmann says: ==================== pull-request: bpf-next 2022-10-03 We've added 143 non-merge commits during the last 27 day(s) which contain a total of 151 files changed, 8321 insertions(+), 1402 deletions(-). The main changes are: 1) Add kfuncs for PKCS#7 signature verification from BPF programs, from Roberto Sassu. 2) Add support for struct-based arguments for trampoline based BPF programs, from Yonghong Song. 3) Fix entry IP for kprobe-multi and trampoline probes under IBT enabled, from Jiri Olsa. 4) Batch of improvements to veristat selftest tool in particular to add CSV output, a comparison mode for CSV outputs and filtering, from Andrii Nakryiko. 5) Add preparatory changes needed for the BPF core for upcoming BPF HID support, from Benjamin Tissoires. 6) Support for direct writes to nf_conn's mark field from tc and XDP BPF program types, from Daniel Xu. 7) Initial batch of documentation improvements for BPF insn set spec, from Dave Thaler. 8) Add a new BPF_MAP_TYPE_USER_RINGBUF map which provides single-user-space-producer / single-kernel-consumer semantics for BPF ring buffer, from David Vernet. 9) Follow-up fixes to BPF allocator under RT to always use raw spinlock for the BPF hashtab's bucket lock, from Hou Tao. 10) Allow creating an iterator that loops through only the resources of one task/thread instead of all, from Kui-Feng Lee. 11) Add support for kptrs in the per-CPU arraymap, from Kumar Kartikeya Dwivedi. 12) Add a new kfunc helper for nf to set src/dst NAT IP/port in a newly allocated CT entry which is not yet inserted, from Lorenzo Bianconi. 13) Remove invalid recursion check for struct_ops for TCP congestion control BPF programs, from Martin KaFai Lau. 14) Fix W^X issue with BPF trampoline and BPF dispatcher, from Song Liu. 15) Fix percpu_counter leakage in BPF hashtab allocation error path, from Tetsuo Handa. 16) Various cleanups in BPF selftests to use preferred ASSERT_* macros, from Wang Yufen. 17) Add invocation for cgroup/connect{4,6} BPF programs for ICMP pings, from YiFei Zhu. 18) Lift blinding decision under bpf_jit_harden = 1 to bpf_capable(), from Yauheni Kaliuta. 19) Various libbpf fixes and cleanups including a libbpf NULL pointer deref, from Xin Liu. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (143 commits) net: netfilter: move bpf_ct_set_nat_info kfunc in nf_nat_bpf.c Documentation: bpf: Add implementation notes documentations to table of contents bpf, docs: Delete misformatted table. selftests/xsk: Fix double free bpftool: Fix error message of strerror libbpf: Fix overrun in netlink attribute iteration selftests/bpf: Fix spelling mistake "unpriviledged" -> "unprivileged" samples/bpf: Fix typo in xdp_router_ipv4 sample bpftool: Remove unused struct event_ring_info bpftool: Remove unused struct btf_attach_point bpf, docs: Add TOC and fix formatting. bpf, docs: Add Clang note about BPF_ALU bpf, docs: Move Clang notes to a separate file bpf, docs: Linux byteswap note bpf, docs: Move legacy packet instructions to a separate file selftests/bpf: Check -EBUSY for the recurred bpf_setsockopt(TCP_CONGESTION) bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself bpf: Refactor bpf_setsockopt(TCP_CONGESTION) handling into another function bpf: Move the "cdg" tcp-cc check to the common sol_tcp_sockopt() bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline ... ==================== Link: https://lore.kernel.org/r/20221003194915.11847-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-03net: netfilter: move bpf_ct_set_nat_info kfunc in nf_nat_bpf.cLorenzo Bianconi
Remove circular dependency between nf_nat module and nf_conntrack one moving bpf_ct_set_nat_info kfunc in nf_nat_bpf.c Fixes: 0fabd2aa199f ("net: netfilter: add bpf_ct_set_nat_info kfunc helper") Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Yauheni Kaliuta <ykaliuta@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/51a65513d2cda3eeb0754842e8025ab3966068d8.1664490511.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-20bpf: Move nf_conn extern declarations to filter.hDaniel Xu
We're seeing the following new warnings on netdev/build_32bit and netdev/build_allmodconfig_warn CI jobs: ../net/core/filter.c:8608:1: warning: symbol 'nf_conn_btf_access_lock' was not declared. Should it be static? ../net/core/filter.c:8611:5: warning: symbol 'nfct_bsa' was not declared. Should it be static? Fix by ensuring extern declaration is present while compiling filter.o. Fixes: 864b656f82cc ("bpf: Add support for writing to nf_conn:mark") Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/2bd2e0283df36d8a4119605878edb1838d144174.1663683114.git.dxu@dxuuu.xyz Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-09-20bpf: Rename nfct_bsa to nfct_btf_struct_accessDaniel Xu
The former name was a little hard to guess. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/73adc72385c8b162391fbfb404f0b6d4c5cc55d7.1663683114.git.dxu@dxuuu.xyz Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-09-20bpf: Remove unused btf_struct_access stubDaniel Xu
This stub was not being used anywhere. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/590e7bd6172ffe0f3d7b51cd40e8ded941aaf7e8.1663683114.git.dxu@dxuuu.xyz Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-09-10bpf: Add support for writing to nf_conn:markDaniel Xu
Support direct writes to nf_conn:mark from TC and XDP prog types. This is useful when applications want to store per-connection metadata. This is also particularly useful for applications that run both bpf and iptables/nftables because the latter can trivially access this metadata. One example use case would be if a bpf prog is responsible for advanced packet classification and iptables/nftables is later used for routing due to pre-existing/legacy code. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/ebca06dea366e3e7e861c12f375a548cc4c61108.1662568410.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-nextDavid S. Miller
Florian Westphal says: ==================== The following set contains changes for your *net-next* tree: - make conntrack ignore packets that are delayed (containing data already acked). The current behaviour to flag them as INVALID causes more harm than good, let them pass so peer can send an immediate ACK for the most recent sequence number. - make conntrack recognize when both peers have sent 'invalid' FINs: This helps cleaning out stale connections faster for those cases where conntrack is no longer in sync with the actual connection state. - Now that DECNET is gone, we don't need to reserve space for DECNET related information. - compact common 'find a free port number for the new inbound connection' code and move it to a helper, then cap number of tries the new helper will make until it gives up. - replace various instances of strlcpy with strscpy, from Wolfram Sang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
drivers/net/ethernet/freescale/fec.h 7d650df99d52 ("net: fec: add pm_qos support on imx6q platform") 40c79ce13b03 ("net: fec: add stop mode support for imx8 platform") Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-09-07netfilter: nat: move repetitive nat port reserve loop to a helperFlorian Westphal
Almost all nat helpers reserve an expecation port the same way: Try the port inidcated by the peer, then move to next port if that port is already in use. We can squash this into a helper. Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2022-08-31netfilter: remove nf_conntrack_helper sysctl and modparam togglesPablo Neira Ayuso
__nf_ct_try_assign_helper() remains in place but it now requires a template to configure the helper. A toggle to disable automatic helper assignment was added by: a9006892643a ("netfilter: nf_ct_helper: allow to disable automatic helper assignment") in 2012 to address the issues described in "Secure use of iptables and connection tracking helpers". Automatic conntrack helper assignment was disabled by: 3bb398d925ec ("netfilter: nf_ct_helper: disable automatic helper assignment") back in 2016. This patch removes the sysctl and modparam toggles, users now have to rely on explicit conntrack helper configuration via ruleset. Update tools/testing/selftests/netfilter/nft_conntrack_helper.sh to check that auto-assignment does not happen anymore. Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>