summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2023-10-25ipv6: refactor ip6_finish_output for GSO handlingYan Zhai
Separate GSO and non-GSO packets handling to make the logic cleaner. For GSO packets, frag_max_size check can be omitted because it is only useful for packets defragmented by netfilter hooks. Both local output and GRO logic won't produce GSO packets when defragment is needed. This also mirrors what IPv4 side code is doing. Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Yan Zhai <yan@cloudflare.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/0e1d4599f858e2becff5c4fe0b5f843236bc3fe8.1698156966.git.yan@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25ipv6: drop feature RTAX_FEATURE_ALLFRAGYan Zhai
RTAX_FEATURE_ALLFRAG was added before the first git commit: https://www.mail-archive.com/bk-commits-head@vger.kernel.org/msg03399.html The feature would send packets to the fragmentation path if a box receives a PMTU value with less than 1280 byte. However, since commit 9d289715eb5c ("ipv6: stop sending PTB packets for MTU < 1280"), such message would be simply discarded. The feature flag is neither supported in iproute2 utility. In theory one can still manipulate it with direct netlink message, but it is not ideal because it was based on obsoleted guidance of RFC-2460 (replaced by RFC-8200). The feature would always test false at the moment, so remove related code or mark them as unused. Signed-off-by: Yan Zhai <yan@cloudflare.com> Reviewed-by: Florian Westphal <fw@strlen.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/d78e44dcd9968a252143ffe78460446476a472a1.1698156966.git.yan@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25Merge tag 'nf-23-10-25' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net This patch contains two late Netfilter's flowtable fixes for net: 1) Flowtable GC pushes back packets to classic path in every GC run, ie. every second. This is because NF_FLOW_HW_ESTABLISHED is only used by sched/act_ct (never set) and IPS_SEEN_REPLY might be unset by the time the flow is offloaded (this status bit is only reliable in the sched/act_ct datapath). 2) sched/act_ct logic to push back packets to classic path to reevaluate if UDP flow is unidirectional only applies if IPS_HW_OFFLOAD_BIT is set on and no hardware offload request is pending to be handled. From Vlad Buslov. These two patches fixes two problems that were introduced in the previous 6.5 development cycle. * tag 'nf-23-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: net/sched: act_ct: additional checks for outdated flows netfilter: flowtable: GC pushes back packets to classic path ==================== Link: https://lore.kernel.org/r/20231025100819.2664-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25vsock/virtio: initialize the_virtio_vsock before using VQsAlexandru Matei
Once VQs are filled with empty buffers and we kick the host, it can send connection requests. If the_virtio_vsock is not initialized before, replies are silently dropped and do not reach the host. virtio_transport_send_pkt() can queue packets once the_virtio_vsock is set, but they won't be processed until vsock->tx_run is set to true. We queue vsock->send_pkt_work when initialization finishes to send those packets queued earlier. Fixes: 0deab087b16a ("vsock/virtio: use RCU to avoid use-after-free on the_virtio_vsock") Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lore.kernel.org/r/20231024191742.14259-1-alexandru.matei@uipath.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-269p: v9fs_listxattr: fix %s null argument warningDominique Martinet
W=1 warns about null argument to kprintf: In file included from fs/9p/xattr.c:12: In function ‘v9fs_xattr_get’, inlined from ‘v9fs_listxattr’ at fs/9p/xattr.c:142:9: include/net/9p/9p.h:55:2: error: ‘%s’ directive argument is null [-Werror=format-overflow=] 55 | _p9_debug(level, __func__, fmt, ##__VA_ARGS__) | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use an empty string instead of : - this is ok 9p-wise because p9pdu_vwritef serializes a null string and an empty string the same way (one '0' word for length) - since this degrades the print statements, add new single quotes for xattr's name delimter (Old: "file = (null)", new: "file = ''") Link: https://lore.kernel.org/r/20231008060138.517057-1-suhui@nfschina.com Suggested-by: Su Hui <suhui@nfschina.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> Acked-by: Christian Schoenebeck <linux_oss@crudebyte.com> Message-ID: <20231025103445.1248103-2-asmadeus@codewreck.org>
2023-10-269p/trans_fd: Annotate data-racy writes to file::f_flagsMarco Elver
syzbot reported: | BUG: KCSAN: data-race in p9_fd_create / p9_fd_create | | read-write to 0xffff888130fb3d48 of 4 bytes by task 15599 on cpu 0: | p9_fd_open net/9p/trans_fd.c:842 [inline] | p9_fd_create+0x210/0x250 net/9p/trans_fd.c:1092 | p9_client_create+0x595/0xa70 net/9p/client.c:1010 | v9fs_session_init+0xf9/0xd90 fs/9p/v9fs.c:410 | v9fs_mount+0x69/0x630 fs/9p/vfs_super.c:123 | legacy_get_tree+0x74/0xd0 fs/fs_context.c:611 | vfs_get_tree+0x51/0x190 fs/super.c:1519 | do_new_mount+0x203/0x660 fs/namespace.c:3335 | path_mount+0x496/0xb30 fs/namespace.c:3662 | do_mount fs/namespace.c:3675 [inline] | __do_sys_mount fs/namespace.c:3884 [inline] | [...] | | read-write to 0xffff888130fb3d48 of 4 bytes by task 15563 on cpu 1: | p9_fd_open net/9p/trans_fd.c:842 [inline] | p9_fd_create+0x210/0x250 net/9p/trans_fd.c:1092 | p9_client_create+0x595/0xa70 net/9p/client.c:1010 | v9fs_session_init+0xf9/0xd90 fs/9p/v9fs.c:410 | v9fs_mount+0x69/0x630 fs/9p/vfs_super.c:123 | legacy_get_tree+0x74/0xd0 fs/fs_context.c:611 | vfs_get_tree+0x51/0x190 fs/super.c:1519 | do_new_mount+0x203/0x660 fs/namespace.c:3335 | path_mount+0x496/0xb30 fs/namespace.c:3662 | do_mount fs/namespace.c:3675 [inline] | __do_sys_mount fs/namespace.c:3884 [inline] | [...] | | value changed: 0x00008002 -> 0x00008802 Within p9_fd_open(), O_NONBLOCK is added to f_flags of the read and write files. This may happen concurrently if e.g. mounting process modifies the fd in another thread. Mark the plain read-modify-writes as intentional data-races, with the assumption that the result of executing the accesses concurrently will always result in the same result despite the accesses themselves not being atomic. Reported-by: syzbot+e441aeeb422763cc5511@syzkaller.appspotmail.com Signed-off-by: Marco Elver <elver@google.com> Link: https://lore.kernel.org/r/ZO38mqkS0TYUlpFp@elver.google.com Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> Message-ID: <20231025103445.1248103-1-asmadeus@codewreck.org>
2023-10-25mptcp: refactor sndbuf auto-tuningPaolo Abeni
The MPTCP protocol account for the data enqueued on all the subflows to the main socket send buffer, while the send buffer auto-tuning algorithm set the main socket send buffer size as the max size among the subflows. That causes bad performances when at least one subflow is sndbuf limited, e.g. due to very high latency, as the MPTCP scheduler can't even fill such buffer. Change the send-buffer auto-tuning algorithm to compute the main socket send buffer size as the sum of all the subflows buffer size. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-9-9dc60939d371@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25mptcp: ignore notsent_lowat setting at the subflow levelPaolo Abeni
Any latency related tuning taking action at the subflow level does not really affect the user-space, as only the main MPTCP socket is relevant. Anyway any limiting setting may foul the MPTCP scheduler, not being able to fully use the subflow-level cwin, leading to very poor b/w usage. Enforce notsent_lowat to be a no-op on every subflow. Note that TCP_NOTSENT_LOWAT is currently not supported, and properly dealing with that will require more invasive changes. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-8-9dc60939d371@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25mptcp: consolidate sockopt synchronizationPaolo Abeni
Move the socket option synchronization for active subflows at subflow creation time. This allows removing the now unused unlocked variant of such helper. While at that, clean-up a bit the mptcp_subflow_create_socket() errors path. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-7-9dc60939d371@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25mptcp: use copy_from_iter helpers on transmitPaolo Abeni
The perf traces show an high cost for the MPTCP transmit path memcpy. It turn out that the helper currently in use carries quite a bit of unneeded overhead, e.g. to map/unmap the memory pages. Moving to the 'copy_from_iter' variant removes such overhead and additionally gains the no-cache support. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-6-9dc60939d371@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25mptcp: give rcvlowat some lovePaolo Abeni
The MPTCP protocol allow setting sk_rcvlowat, but the value there is currently ignored. Additionally, the default subflows sk_rcvlowat basically disables per subflow delayed ack: the MPTCP protocol move the incoming data from the subflows into the msk socket as soon as the TCP stacks invokes the subflow data_ready callback. Later, when __tcp_ack_snd_check() takes action, the subflow-level copied_seq matches rcv_nxt, and that mandate for an immediate ack. Let the mptcp receive path be aware of such threshold, explicitly tracking the amount of data available to be ready and checking vs sk_rcvlowat in mptcp_poll() and before waking-up readers. Additionally implement the set_rcvlowat() callback, to properly handle the rcvbuf auto-tuning on sk_rcvlowat changes. Finally to properly handle delayed ack, force the subflow level threshold to 0 and instead explicitly ask for an immediate ack when the msk level th is not reached. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-5-9dc60939d371@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25mptcp: use plain bool instead of custom binary enumPaolo Abeni
The 'data_avail' subflow field is already used as plain boolean, drop the custom binary enum type and switch to bool. No functional changed intended. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-3-9dc60939d371@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25mptcp: properly account fastopen dataPaolo Abeni
Currently the socket level counter aggregating the received data does not take in account the data received via fastopen. Address the issue updating the counter as required. Fixes: 38967f424b5b ("mptcp: track some aggregate data counters") Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-2-9dc60939d371@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25mptcp: add a new sysctl for make after break timeoutPaolo Abeni
The MPTCP protocol allows sockets with no alive subflows to stay in ESTABLISHED status for and user-defined timeout, to allow for later subflows creation. Currently such timeout is constant - TCP_TIMEWAIT_LEN. Let the user-space configure them via a newly added sysctl, to better cope with busy servers and simplify (make them faster) the relevant pktdrill tests. Note that the new know does not apply to orphaned MPTCP socket waiting for the data_fin handshake completion: they always wait TCP_TIMEWAIT_LEN. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-2-v1-1-9dc60939d371@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25net: ipv6: fix typo in commentsDeming Wang
The word "advertize" should be replaced by "advertise". Signed-off-by: Deming Wang <wangdeming@inspur.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-25net: ipv4: fix typo in commentsDeming Wang
The word "advertize" should be replaced by "advertise". Signed-off-by: Deming Wang <wangdeming@inspur.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-25net/sched: act_ct: additional checks for outdated flowsVlad Buslov
Current nf_flow_is_outdated() implementation considers any flow table flow which state diverged from its underlying CT connection status for teardown which can be problematic in the following cases: - Flow has never been offloaded to hardware in the first place either because flow table has hardware offload disabled (flag NF_FLOWTABLE_HW_OFFLOAD is not set) or because it is still pending on 'add' workqueue to be offloaded for the first time. The former is incorrect, the later generates excessive deletions and additions of flows. - Flow is already pending to be updated on the workqueue. Tearing down such flows will also generate excessive removals from the flow table, especially on highly loaded system where the latency to re-offload a flow via 'add' workqueue can be quite high. When considering a flow for teardown as outdated verify that it is both offloaded to hardware and doesn't have any pending updates. Fixes: 41f2c7c342d3 ("net/sched: act_ct: Fix promotion of offloaded unreplied tuple") Reviewed-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-25netfilter: flowtable: GC pushes back packets to classic pathPablo Neira Ayuso
Since 41f2c7c342d3 ("net/sched: act_ct: Fix promotion of offloaded unreplied tuple"), flowtable GC pushes back flows with IPS_SEEN_REPLY back to classic path in every run, ie. every second. This is because of a new check for NF_FLOW_HW_ESTABLISHED which is specific of sched/act_ct. In Netfilter's flowtable case, NF_FLOW_HW_ESTABLISHED never gets set on and IPS_SEEN_REPLY is unreliable since users decide when to offload the flow before, such bit might be set on at a later stage. Fix it by adding a custom .gc handler that sched/act_ct can use to deal with its NF_FLOW_HW_ESTABLISHED bit. Fixes: 41f2c7c342d3 ("net/sched: act_ct: Fix promotion of offloaded unreplied tuple") Reported-by: Vladimir Smelhaus <vl.sm@email.cz> Reviewed-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-25sched: act_ct: switch to per-action label countingFlorian Westphal
net->ct.labels_used was meant to convey 'number of ip/nftables rules that need the label extension allocated'. act_ct enables this for each net namespace, which voids all attempts to avoid ct->ext allocation when possible. Move this increment to the control plane to request label extension space allocation only when its needed. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-24Merge tag 'wireless-2023-10-24' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== Three more fixes: - don't drop all unprotected public action frames since some don't have a protected dual - fix pointer confusion in scanning code - fix warning in some connections with multiple links * tag 'wireless-2023-10-24' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: wifi: mac80211: don't drop all unprotected public action frames wifi: cfg80211: fix assoc response warning on failed links wifi: cfg80211: pass correct pointer to rdev_inform_bss() ==================== Link: https://lore.kernel.org/r/20231024103540.19198-2-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24net: dsa: Rename IFLA_DSA_MASTER to IFLA_DSA_CONDUITFlorian Fainelli
This preserves the existing IFLA_DSA_MASTER which is part of the uAPI and creates an alias named IFLA_DSA_CONDUIT. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://lore.kernel.org/r/20231023181729.1191071-3-florian.fainelli@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24net: dsa: Use conduit and user termsFlorian Fainelli
Use more inclusive terms throughout the DSA subsystem by moving away from "master" which is replaced by "conduit" and "slave" which is replaced by "user". No functional changes. Acked-by: Rob Herring <robh@kernel.org> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://lore.kernel.org/r/20231023181729.1191071-2-florian.fainelli@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24net: remove else after return in dev_prep_valid_name()Jakub Kicinski
Remove unnecessary else clauses after return. I copied this if / else construct from somewhere, it makes the code harder to read. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231023152346.3639749-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24net: remove dev_valid_name() check from __dev_alloc_name()Jakub Kicinski
__dev_alloc_name() is only called by dev_prep_valid_name(), which already checks that name is valid. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231023152346.3639749-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24net: trust the bitmap in __dev_alloc_name()Jakub Kicinski
Prior to restructuring __dev_alloc_name() handled both printf and non-printf names. In a clever attempt at code reuse it always prints the name into a buffer and checks if it's a duplicate. Trust the bitmap, and return an error if its full. This shrinks the possible ID space by one from 32K to 32K - 1, as previously the max value would have been tried as a valid ID. It seems very unlikely that anyone would care as we heard no requests to increase the max beyond 32k. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231023152346.3639749-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24net: reduce indentation of __dev_alloc_name()Jakub Kicinski
All callers of __dev_valid_name() go thru dev_prep_valid_name() which handles the non-printf case. Focus __dev_alloc_name() on the sprintf case, remove the indentation level. Minor functional change of returning -EINVAL if % is not found, which should now never happen. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231023152346.3639749-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24net: make dev_alloc_name() call dev_prep_valid_name()Jakub Kicinski
__dev_alloc_name() handles both the sprintf and non-sprintf target names. This complicates the code. dev_prep_valid_name() already handles the non-sprintf case, before calling __dev_alloc_name(), make the only other caller also go thru dev_prep_valid_name(). This way we can drop the non-sprintf handling in __dev_alloc_name() in one of the next changes. commit 55a5ec9b7710 ("Revert "net: core: dev_get_valid_name is now the same as dev_alloc_name_ns"") and commit 029b6d140550 ("Revert "net: core: maybe return -EEXIST in __dev_alloc_name"") tell us that we can't start returning -EEXIST from dev_alloc_name() on name duplicates. Bite the bullet and pass the expected errno to dev_prep_valid_name(). dev_prep_valid_name() must now propagate out the allocated id for printf names. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231023152346.3639749-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24net: don't use input buffer of __dev_alloc_name() as a scratch spaceJakub Kicinski
Callers of __dev_alloc_name() want to pass dev->name as the output buffer. Make __dev_alloc_name() not clobber that buffer on failure, and remove the workarounds in callers. dev_alloc_name_ns() is now completely unnecessary. The extra strscpy() added here will be gone by the end of the patch series. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231023152346.3639749-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24net: mptcp: use policy generated by YAML specDavide Caratti
generated with: $ ./tools/net/ynl/ynl-gen-c.py --mode kernel \ > --spec Documentation/netlink/specs/mptcp.yaml --source \ > -o net/mptcp/mptcp_pm_gen.c $ ./tools/net/ynl/ynl-gen-c.py --mode kernel \ > --spec Documentation/netlink/specs/mptcp.yaml --header \ > -o net/mptcp/mptcp_pm_gen.h Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/340 Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-1-v2-7-16b1f701f900@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24net: mptcp: rename netlink handlers to mptcp_pm_nl_<blah>_{doit,dumpit}Davide Caratti
so that they will match names generated from YAML spec. Link: https://github.com/multipath-tcp/mptcp_net-next/issues/340 Suggested-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-1-v2-6-16b1f701f900@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24net: mptcp: convert netlink from small_ops to opsDavide Caratti
in the current MPTCP control plane, all operations use a netlink attribute of the same type "MPTCP_PM_ATTR". However, add/del/get/flush operations only parse the first element in the message _ the one that describes MPTCP endpoints (that was named MPTCP_PM_ATTR_ADDR and mostly used in ADD_ADDR operations _ probably the similarity of "attr", "addr" and "add" might cause some confusion to human readers). Convert MPTCP from 'small_ops' to 'ops', thus allowing different attributes for each single operation, hopefully makes all this clearer to human readers. - use a separate attribute set for add/del/get/flush address operation, binary compatible with the existing one, to store the endpoint address. MPTCP_PM_ENDPOINT_ADDR is added to the uAPI (with the same value as MPTCP_PM_ATTR_ADDR) for these operations. - convert mptcp_pm_ops[] and add policy files accordingly. this prepares MPTCP control plane to be described as YAML spec. Link: https://github.com/multipath-tcp/mptcp_net-next/issues/340 Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-1-v2-3-16b1f701f900@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24netfilter: nf_tables: Carry reset boolean in nft_set_dump_ctxPhil Sutter
Relieve the dump callback from having to check nlmsg_type upon each call. Prep work for set element reset locking. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nf_tables: set->ops->insert returns opaque set element in case of ↵Pablo Neira Ayuso
EEXIST Return struct nft_elem_priv instead of struct nft_set_ext for consistency with ("netfilter: nf_tables: expose opaque set element as struct nft_elem_priv") and to prepare the introduction of element timeout updates from control path. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nf_tables: shrink memory consumption of set elementsPablo Neira Ayuso
Instead of copying struct nft_set_elem into struct nft_trans_elem, store the pointer to the opaque set element object in the transaction. Adapt set backend API (and set backend implementations) to take the pointer to opaque set element representation whenever required. This patch deconstifies .remove() and .activate() set backend API since these modify the set element opaque object. And it also constify nft_set_elem_ext() this provides access to the nft_set_ext struct without updating the object. According to pahole on x86_64, this patch shrinks struct nft_trans_elem size from 216 to 24 bytes. This patch also reduces stack memory consumption by removing the template struct nft_set_elem object, using the opaque set element object instead such as from the set iterator API, catchall elements and the get element command. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nf_tables: expose opaque set element as struct nft_elem_privPablo Neira Ayuso
Add placeholder structure and place it at the beginning of each struct nft_*_elem for each existing set backend, instead of exposing elements as void type to the frontend which defeats compiler type checks. Use this pointer to this new type to replace void *. This patch updates the following set backend API to use this new struct nft_elem_priv placeholder structure: - update - deactivate - flush - get as well as the following helper functions: - nft_set_elem_ext() - nft_set_elem_init() - nft_set_elem_destroy() - nf_tables_set_elem_destroy() This patch adds nft_elem_priv_cast() to cast struct nft_elem_priv to native element representation from the corresponding set backend. BUILD_BUG_ON() makes sure this .priv placeholder is always at the top of the opaque set element representation. Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nf_tables: set backend .flush always succeedsPablo Neira Ayuso
.flush is always successful since this results from iterating over the set elements to toggle mark the element as inactive in the next generation. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nft_set_pipapo: no need to call pipapo_deactivate() from flushPablo Neira Ayuso
Use the element object that is already offered instead. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nf_tables: Carry reset boolean in nft_obj_dump_ctxPhil Sutter
Relieve the dump callback from having to inspect nlmsg_type upon each call, just do it once at start of the dump. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nf_tables: nft_obj_filter fits into cb->ctxPhil Sutter
No need to allocate it if one may just use struct netlink_callback's scratch area for it. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nf_tables: Carry s_idx in nft_obj_dump_ctxPhil Sutter
Prep work for moving the context into struct netlink_callback scratch area. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nf_tables: A better name for nft_obj_filterPhil Sutter
Name it for what it is supposed to become, a real nft_obj_dump_ctx. No functional change intended. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nf_tables: Unconditionally allocate nft_obj_filterPhil Sutter
Prep work for moving the filter into struct netlink_callback's scratch area. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nf_tables: Drop pointless memset in nf_tables_dump_objPhil Sutter
The code does not make use of cb->args fields past the first one, no need to zero them. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: conntrack: switch connlabels to atomic_tFlorian Westphal
The spinlock is back from the day when connabels did not have a fixed size and reallocation had to be supported. Remove it. This change also allows to call the helpers from softirq or timers without deadlocks. Also add WARN()s to catch refcounting imbalances. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24br_netfilter: use single forward hook for ip and arpFlorian Westphal
br_netfilter registers two forward hooks, one for ip and one for arp. Just use a common function for both and then call the arp/ip helper as needed. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nf_tables: Add locking for NFT_MSG_GETRULE_RESET requestsPhil Sutter
Rule reset is not concurrency-safe per-se, so multiple CPUs may reset the same rule at the same time. At least counter and quota expressions will suffer from value underruns in this case. Prevent this by introducing dedicated locking callbacks for nfnetlink and the asynchronous dump handling to serialize access. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nf_tables: Introduce nf_tables_getrule_single()Phil Sutter
Outsource the reply skb preparation for non-dump getrule requests into a distinct function. Prep work for rule reset locking. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nf_tables: Open-code audit log call in nf_tables_getrule()Phil Sutter
The table lookup will be dropped from that function, so remove that dependency from audit logging code. Using whatever is in nla[NFTA_RULE_TABLE] is sufficient as long as the previous rule info filling succeded. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nft_set_rbtree: prefer sync gc to async workerFlorian Westphal
There is no need for asynchronous garbage collection, rbtree inserts can only happen from the netlink control plane. We already perform on-demand gc on insertion, in the area of the tree where the insertion takes place, but we don't do a full tree walk there for performance reasons. Do a full gc walk at the end of the transaction instead and remove the async worker. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nft_set_rbtree: rename gc deactivate+erase functionFlorian Westphal
Next patch adds a cllaer that doesn't hold the priv->write lock and will need a similar function. Rename the existing function to make it clear that it can only be used for opportunistic gc during insertion. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>