summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2023-08-16Merge tag 'ipsec-2023-08-15' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== 1) Fix a slab-out-of-bounds read in xfrm_address_filter. From Lin Ma. 2) Fix the pfkey sadb_x_filter validation. From Lin Ma. 3) Use the correct nla_policy structure for XFRMA_SEC_CTX. From Lin Ma. 4) Fix warnings triggerable by bad packets in the encap functions. From Herbert Xu. 5) Fix some slab-use-after-free in decode_session6. From Zhengchao Shao. 6) Fix a possible NULL piointer dereference in xfrm_update_ae_params. Lin Ma. 7) Add a forgotten nla_policy for XFRMA_MTIMER_THRESH. From Lin Ma. 8) Don't leak offloaded policies. From Leon Romanovsky. 9) Delete also the offloading part of an acquire state. From Leon Romanovsky. Please pull or let me know if there are problems.
2023-08-15net: warn about attempts to register negative ifindexJakub Kicinski
Since the xarray changes we mix returning valid ifindex and negative errno in a single int returned from dev_index_reserve(). This depends on the fact that ifindexes can't be negative. Otherwise we may insert into the xarray and return a very large negative value. This in turn may break ERR_PTR(). OvS is susceptible to this problem and lacking validation (fix posted separately for net). Reject negative ifindex explicitly. Add a warning because the input validation is better handled by the caller. Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20230814205627.2914583-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15net: openvswitch: reject negative ifindexJakub Kicinski
Recent changes in net-next (commit 759ab1edb56c ("net: store netdevs in an xarray")) refactored the handling of pre-assigned ifindexes and let syzbot surface a latent problem in ovs. ovs does not validate ifindex, making it possible to create netdev ports with negative ifindex values. It's easy to repro with YNL: $ ./cli.py --spec netlink/specs/ovs_datapath.yaml \ --do new \ --json '{"upcall-pid": 1, "name":"my-dp"}' $ ./cli.py --spec netlink/specs/ovs_vport.yaml \ --do new \ --json '{"upcall-pid": "00000001", "name": "some-port0", "dp-ifindex":3,"ifindex":4294901760,"type":2}' $ ip link show -65536: some-port0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 7a:48:21:ad:0b:fb brd ff:ff:ff:ff:ff:ff ... Validate the inputs. Now the second command correctly returns: $ ./cli.py --spec netlink/specs/ovs_vport.yaml \ --do new \ --json '{"upcall-pid": "00000001", "name": "some-port0", "dp-ifindex":3,"ifindex":4294901760,"type":2}' lib.ynl.NlError: Netlink error: Numerical result out of range nl_len = 108 (92) nl_flags = 0x300 nl_type = 2 error: -34 extack: {'msg': 'integer out of range', 'unknown': [[type:4 len:36] b'\x0c\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0c\x00\x03\x00\xff\xff\xff\x7f\x00\x00\x00\x00\x08\x00\x01\x00\x08\x00\x00\x00'], 'bad-attr': '.ifindex'} Accept 0 since it used to be silently ignored. Fixes: 54c4ef34c4b6 ("openvswitch: allow specifying ifindex of new interfaces") Reported-by: syzbot+7456b5dcf65111553320@syzkaller.appspotmail.com Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Link: https://lore.kernel.org/r/20230814203840.2908710-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15nexthop: Do not increment dump sentinel at the end of the dumpIdo Schimmel
The nexthop and nexthop bucket dump callbacks previously returned a positive return code even when the dump was complete, prompting the core netlink code to invoke the callback again, until returning zero. Zero was only returned by these callbacks when no information was filled in the provided skb, which was achieved by incrementing the dump sentinel at the end of the dump beyond the ID of the last nexthop. This is no longer necessary as when the dump is complete these callbacks return zero. Remove the unnecessary increment. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230813164856.2379822-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15nexthop: Simplify nexthop bucket dumpIdo Schimmel
Before commit f10d3d9df49d ("nexthop: Make nexthop bucket dump more efficient"), rtm_dump_nexthop_bucket_nh() returned a non-zero return code for each resilient nexthop group whose buckets it dumped, regardless if it encountered an error or not. This meant that the sentinel ('dd->ctx->nh.idx') used by the function that walked the different nexthops could not be used as a sentinel for the bucket dump, as otherwise buckets from the same group would be dumped over and over again. This was dealt with by adding another sentinel ('dd->ctx->done_nh_idx') that was incremented by rtm_dump_nexthop_bucket_nh() after successfully dumping all the buckets from a given group. After the previously mentioned commit this sentinel is no longer necessary since the function no longer returns a non-zero return code when successfully dumping all the buckets from a given group. Remove this sentinel and simplify the code. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230813164856.2379822-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15seg6: add NEXT-C-SID support for SRv6 End.X behaviorAndrea Mayer
The NEXT-C-SID mechanism described in [1] offers the possibility of encoding several SRv6 segments within a single 128 bit SID address. Such a SID address is called a Compressed SID (C-SID) container. In this way, the length of the SID List can be drastically reduced. A SID instantiated with the NEXT-C-SID flavor considers an IPv6 address logically structured in three main blocks: i) Locator-Block; ii) Locator-Node Function; iii) Argument. C-SID container +------------------------------------------------------------------+ | Locator-Block |Loc-Node| Argument | | |Function| | +------------------------------------------------------------------+ <--------- B -----------> <- NF -> <------------- A ---------------> (i) The Locator-Block can be any IPv6 prefix available to the provider; (ii) The Locator-Node Function represents the node and the function to be triggered when a packet is received on the node; (iii) The Argument carries the remaining C-SIDs in the current C-SID container. This patch leverages the NEXT-C-SID mechanism previously introduced in the Linux SRv6 subsystem [2] to support SID compression capabilities in the SRv6 End.X behavior [3]. An SRv6 End.X behavior with NEXT-C-SID flavor works as an End.X behavior but it is capable of processing the compressed SID List encoded in C-SID containers. An SRv6 End.X behavior with NEXT-C-SID flavor can be configured to support user-provided Locator-Block and Locator-Node Function lengths. In this implementation, such lengths must be evenly divisible by 8 (i.e. must be byte-aligned), otherwise the kernel informs the user about invalid values with a meaningful error code and message through netlink_ext_ack. If Locator-Block and/or Locator-Node Function lengths are not provided by the user during configuration of an SRv6 End.X behavior instance with NEXT-C-SID flavor, the kernel will choose their default values i.e., 32-bit Locator-Block and 16-bit Locator-Node Function. [1] - https://datatracker.ietf.org/doc/html/draft-ietf-spring-srv6-srh-compression [2] - https://lore.kernel.org/all/20220912171619.16943-1-andrea.mayer@uniroma2.it/ [3] - https://datatracker.ietf.org/doc/html/rfc8986#name-endx-l3-cross-connect Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230812180926.16689-2-andrea.mayer@uniroma2.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15networking: Update to register_net_sysctl_szJoel Granados
Move from register_net_sysctl to register_net_sysctl_sz for all the networking related files. Do this while making sure to mirror the NULL assignments with a table_size of zero for the unprivileged users. We need to move to the new function in preparation for when we change SIZE_MAX to ARRAY_SIZE() in the register_net_sysctl macro. Failing to do so would erroneously allow ARRAY_SIZE() to be called on a pointer. We hold off the SIZE_MAX to ARRAY_SIZE change until we have migrated all the relevant net sysctl registering functions to register_net_sysctl_sz in subsequent commits. An additional size function was added to the following files in order to calculate the size of an array that is defined in another file: include/net/ipv6.h net/ipv6/icmp.c net/ipv6/route.c net/ipv6/sysctl_net_ipv6.c Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-15netfilter: Update to register_net_sysctl_szJoel Granados
Move from register_net_sysctl to register_net_sysctl_sz for all the netfilter related files. Do this while making sure to mirror the NULL assignments with a table_size of zero for the unprivileged users. We need to move to the new function in preparation for when we change SIZE_MAX to ARRAY_SIZE() in the register_net_sysctl macro. Failing to do so would erroneously allow ARRAY_SIZE() to be called on a pointer. We hold off the SIZE_MAX to ARRAY_SIZE change until we have migrated all the relevant net sysctl registering functions to register_net_sysctl_sz in subsequent commits. Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-15ax.25: Update to register_net_sysctl_szJoel Granados
Move from register_net_sysctl to register_net_sysctl_sz and pass the ARRAY_SIZE of the ctl_table array that was used to create the table variable. We need to move to the new function in preparation for when we change SIZE_MAX to ARRAY_SIZE() in the register_net_sysctl macro. Failing to do so would erroneously allow ARRAY_SIZE() to be called on a pointer. We hold off the SIZE_MAX to ARRAY_SIZE change until we have migrated all the relevant net sysctl registering functions to register_net_sysctl_sz in subsequent commits. Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-15sysctl: Add size to register_net_sysctl functionJoel Granados
This commit adds size to the register_net_sysctl indirection function to facilitate the removal of the sentinel elements (last empty markers) from the ctl_table arrays. Though we don't actually remove any sentinels in this commit, register_net_sysctl* now has the capability of forwarding table_size for when that happens. We create a new function register_net_sysctl_sz with an extra size argument. A macro replaces the existing register_net_sysctl. The size in the macro is SIZE_MAX instead of ARRAY_SIZE to avoid compilation errors while we systematically migrate to register_net_sysctl_sz. Will change to ARRAY_SIZE in subsequent commits. Care is taken to add table_size to the stopping criteria in such a way that when we remove the empty sentinel element, it will continue stopping in the last element of the ctl_table array. Signed-off-by: Joel Granados <j.granados@samsung.com> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-15sysctl: Add size to register_sysctlJoel Granados
This commit adds table_size to register_sysctl in preparation for the removal of the sentinel elements in the ctl_table arrays (last empty markers). And though we do *not* remove any sentinels in this commit, we set things up by either passing the table_size explicitly or using ARRAY_SIZE on the ctl_table arrays. We replace the register_syctl function with a macro that will add the ARRAY_SIZE to the new register_sysctl_sz function. In this way the callers that are already using an array of ctl_table structs do not change. For the callers that pass a ctl_table array pointer, we pass the table_size to register_sysctl_sz instead of the macro. Signed-off-by: Joel Granados <j.granados@samsung.com> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-15sysctl: Add a size arg to __register_sysctl_tableJoel Granados
We make these changes in order to prepare __register_sysctl_table and its callers for when we remove the sentinel element (empty element at the end of ctl_table arrays). We don't actually remove any sentinels in this commit, but we *do* make sure to use ARRAY_SIZE so the table_size is available when the removal occurs. We add a table_size argument to __register_sysctl_table and adjust callers, all of which pass ctl_table pointers and need an explicit call to ARRAY_SIZE. We implement a size calculation in register_net_sysctl in order to forward the size of the array pointer received from the network register calls. The new table_size argument does not yet have any effect in the init_header call which is still dependent on the sentinel's presence. table_size *does* however drive the `kzalloc` allocation in __register_sysctl_table with no adverse effects as the allocated memory is either one element greater than the calculated ctl_table array (for the calls in ipc_sysctl.c, mq_sysctl.c and ucount.c) or the exact size of the calculated ctl_table array (for the call from sysctl_net.c and register_sysctl). This approach will allows us to "just" remove the sentinel without further changes to __register_sysctl_table as table_size will represent the exact size for all the callers at that point. Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-16netfilter: nft_dynset: disallow object mapsPablo Neira Ayuso
Do not allow to insert elements from datapath to objects maps. Fixes: 8aeff920dcc9 ("netfilter: nf_tables: add stateful object reference to set elements") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16netfilter: nf_tables: GC transaction race with netns dismantlePablo Neira Ayuso
Use maybe_get_net() since GC workqueue might race with netns exit path. Fixes: 5f68718b34a5 ("netfilter: nf_tables: GC transaction API to avoid race with control plane") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16netfilter: nf_tables: fix GC transaction races with netns and netlink event ↵Pablo Neira Ayuso
exit path Netlink event path is missing a synchronization point with GC transactions. Add GC sequence number update to netns release path and netlink event path, any GC transaction losing race will be discarded. Fixes: 5f68718b34a5 ("netfilter: nf_tables: GC transaction API to avoid race with control plane") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16ipvs: fix racy memcpy in proc_do_sync_thresholdSishuai Gong
When two threads run proc_do_sync_threshold() in parallel, data races could happen between the two memcpy(): Thread-1 Thread-2 memcpy(val, valp, sizeof(val)); memcpy(valp, val, sizeof(val)); This race might mess up the (struct ctl_table *) table->data, so we add a mutex lock to serialize them. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Link: https://lore.kernel.org/netdev/B6988E90-0A1E-4B85-BF26-2DAF6D482433@gmail.com/ Signed-off-by: Sishuai Gong <sishuai.system@gmail.com> Acked-by: Simon Horman <horms@kernel.org> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16netfilter: set default timeout to 3 secs for sctp shutdown send and recv stateXin Long
In SCTP protocol, it is using the same timer (T2 timer) for SHUTDOWN and SHUTDOWN_ACK retransmission. However in sctp conntrack the default timeout value for SCTP_CONNTRACK_SHUTDOWN_ACK_SENT state is 3 secs while it's 300 msecs for SCTP_CONNTRACK_SHUTDOWN_SEND/RECV state. As Paolo Valerio noticed, this might cause unwanted expiration of the ct entry. In my test, with 1s tc netem delay set on the NAT path, after the SHUTDOWN is sent, the sctp ct entry enters SCTP_CONNTRACK_SHUTDOWN_SEND state. However, due to 300ms (too short) delay, when the SHUTDOWN_ACK is sent back from the peer, the sctp ct entry has expired and been deleted, and then the SHUTDOWN_ACK has to be dropped. Also, it is confusing these two sysctl options always show 0 due to all timeout values using sec as unit: net.netfilter.nf_conntrack_sctp_timeout_shutdown_recd = 0 net.netfilter.nf_conntrack_sctp_timeout_shutdown_sent = 0 This patch fixes it by also using 3 secs for sctp shutdown send and recv state in sctp conntrack, which is also RTO.initial value in SCTP protocol. Note that the very short time value for SCTP_CONNTRACK_SHUTDOWN_SEND/RECV was probably used for a rare scenario where SHUTDOWN is sent on 1st path but SHUTDOWN_ACK is replied on 2nd path, then a new connection started immediately on 1st path. So this patch also moves from SHUTDOWN_SEND/RECV to CLOSE when receiving INIT in the ORIGINAL direction. Fixes: 9fb9cbb1082d ("[NETFILTER]: Add nf_conntrack subsystem.") Reported-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16netfilter: nf_tables: don't fail inserts if duplicate has expiredFlorian Westphal
nftables selftests fail: run-tests.sh testcases/sets/0044interval_overlap_0 Expected: 0-2 . 0-3, got: W: [FAILED] ./testcases/sets/0044interval_overlap_0: got 1 Insertion must ignore duplicate but expired entries. Moreover, there is a strange asymmetry in nft_pipapo_activate: It refetches the current element, whereas the other ->activate callbacks (bitmap, hash, rhash, rbtree) use elem->priv. Same for .remove: other set implementations take elem->priv, nft_pipapo_remove fetches elem->priv, then does a relookup, remove this. I suspect this was the reason for the change that prompted the removal of the expired check in pipapo_get() in the first place, but skipping exired elements there makes no sense to me, this helper is used for normal get requests, insertions (duplicate check) and deactivate callback. In first two cases expired elements must be skipped. For ->deactivate(), this gets called for DELSETELEM, so it seems to me that expired elements should be skipped as well, i.e. delete request should fail with -ENOENT error. Fixes: 24138933b97b ("netfilter: nf_tables: don't skip expired elements during walk") Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16netfilter: nf_tables: deactivate catchall elements in next generationFlorian Westphal
When flushing, individual set elements are disabled in the next generation via the ->flush callback. Catchall elements are not disabled. This is incorrect and may lead to double-deactivations of catchall elements which then results in memory leaks: WARNING: CPU: 1 PID: 3300 at include/net/netfilter/nf_tables.h:1172 nft_map_deactivate+0x549/0x730 CPU: 1 PID: 3300 Comm: nft Not tainted 6.5.0-rc5+ #60 RIP: 0010:nft_map_deactivate+0x549/0x730 [..] ? nft_map_deactivate+0x549/0x730 nf_tables_delset+0xb66/0xeb0 (the warn is due to nft_use_dec() detecting underflow). Fixes: aaa31047a6d2 ("netfilter: nftables: add catch-all set element support") Reported-by: lonial con <kongln9170@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16netfilter: nf_tables: fix kdoc warnings after gc reworkFlorian Westphal
Jakub Kicinski says: We've got some new kdoc warnings here: net/netfilter/nft_set_pipapo.c:1557: warning: Function parameter or member '_set' not described in 'pipapo_gc' net/netfilter/nft_set_pipapo.c:1557: warning: Excess function parameter 'set' description in 'pipapo_gc' include/net/netfilter/nf_tables.h:577: warning: Function parameter or member 'dead' not described in 'nft_set' Fixes: 5f68718b34a5 ("netfilter: nf_tables: GC transaction API to avoid race with control plane") Fixes: f6c383b8c31a ("netfilter: nf_tables: adapt set backend to use GC transaction API") Reported-by: Jakub Kicinski <kuba@kernel.org> Closes: https://lore.kernel.org/netdev/20230810104638.746e46f1@kernel.org/ Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16netfilter: nf_tables: fix false-positive lockdep splatFlorian Westphal
->abort invocation may cause splat on debug kernels: WARNING: suspicious RCU usage net/netfilter/nft_set_pipapo.c:1697 suspicious rcu_dereference_check() usage! [..] rcu_scheduler_active = 2, debug_locks = 1 1 lock held by nft/133554: [..] (nft_net->commit_mutex){+.+.}-{3:3}, at: nf_tables_valid_genid [..] lockdep_rcu_suspicious+0x1ad/0x260 nft_pipapo_abort+0x145/0x180 __nf_tables_abort+0x5359/0x63d0 nf_tables_abort+0x24/0x40 nfnetlink_rcv+0x1a0a/0x22c0 netlink_unicast+0x73c/0x900 netlink_sendmsg+0x7f0/0xc20 ____sys_sendmsg+0x48d/0x760 Transaction mutex is held, so parallel updates are not possible. Switch to _protected and check mutex is held for lockdep enabled builds. Fixes: 212ed75dc5fb ("netfilter: nf_tables: integrate pipapo into commit protocol") Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-15ethtool: netlink: always pass genl_info to .prepare_dataJakub Kicinski
We had a number of bugs in the past because developers forgot to fully test dumps, which pass NULL as info to .prepare_data. .prepare_data implementations would try to access info->extack leading to a null-deref. Now that dumps and notifications can access struct genl_info we can pass it in, and remove the info null checks. Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # pause Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-11-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15ethtool: netlink: simplify arguments to ethnl_default_parse()Jakub Kicinski
Pass struct genl_info directly instead of its members. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-10-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15netdev-genl: use struct genl_info for reply constructionJakub Kicinski
Use the just added APIs to make the code simpler. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-9-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15genetlink: add a family pointer to struct genl_infoJakub Kicinski
Having family in struct genl_info is quite useful. It cuts down the number of arguments which need to be passed to helpers which already take struct genl_info. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15genetlink: use attrs from struct genl_infoJakub Kicinski
Since dumps carry struct genl_info now, use the attrs pointer from genl_info and remove the one in struct genl_dumpit_info. Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15genetlink: add struct genl_info to struct genl_dumpit_infoJakub Kicinski
Netlink GET implementations must currently juggle struct genl_info and struct netlink_callback, depending on whether they were called from doit or dumpit. Add genl_info to the dump state and populate the fields. This way implementations can simply pass struct genl_info around. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15genetlink: remove userhdr from struct genl_infoJakub Kicinski
Only three families use info->userhdr today and going forward we discourage using fixed headers in new families. So having the pointer to user header in struct genl_info is an overkill. Compute the header pointer at runtime. Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Link: https://lore.kernel.org/r/20230814214723.2924989-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15genetlink: make genl_info->nlhdr constJakub Kicinski
struct netlink_callback has a const nlh pointer, make the pointer in struct genl_info const as well, to make copying between the two easier. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15genetlink: push conditional locking into dumpit/doneJakub Kicinski
Add helpers which take/release the genl mutex based on family->parallel_ops. Remove the separation between handling of ops in locked and parallel families. Future patches would make the duplicated code grow even more. Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15net: fix the RTO timer retransmitting skb every 1ms if linear option is enabledJason Xing
In the real workload, I encountered an issue which could cause the RTO timer to retransmit the skb per 1ms with linear option enabled. The amount of lost-retransmitted skbs can go up to 1000+ instantly. The root cause is that if the icsk_rto happens to be zero in the 6th round (which is the TCP_THIN_LINEAR_RETRIES value), then it will always be zero due to the changed calculation method in tcp_retransmit_timer() as follows: icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX); Above line could be converted to icsk->icsk_rto = min(0 << 1, TCP_RTO_MAX) = 0 Therefore, the timer expires so quickly without any doubt. I read through the RFC 6298 and found that the RTO value can be rounded up to a certain value, in Linux, say TCP_RTO_MIN as default, which is regarded as the lower bound in this patch as suggested by Eric. Fixes: 36e31b0af587 ("net: TCP thin linear timeouts") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14sunrpc: set the bv_offset of first bvec in svc_tcp_sendmsgJeff Layton
svc_tcp_sendmsg used to factor in the xdr->page_base when sending pages, but commit 5df5dd03a8f7 ("sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage") dropped that part of the handling. Fix it by setting the bv_offset of the first bvec. Fixes: 5df5dd03a8f7 ("sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage") Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-14netlink: specs: devlink: extend health reporter dump attributes by port indexJiri Pirko
Allow user to pass port index for health reporter dump request. Re-generate the related code. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-14-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14devlink: extend health reporter dump selector by port indexJiri Pirko
Introduce a possibility for devlink object to expose attributes it supports for selection of dumped objects. Use this by health reporter to indicate it supports port index based selection of dump objects. Implement this selection mechanism in devlink_nl_cmd_health_reporter_get_dump_one() Example: $ devlink health pci/0000:08:00.0: reporter fw state healthy error 0 recover 0 auto_dump true reporter fw_fatal state healthy error 0 recover 0 grace_period 60000 auto_recover true auto_dump true reporter vnic state healthy error 0 recover 0 pci/0000:08:00.0/32768: reporter vnic state healthy error 0 recover 0 pci/0000:08:00.0/32769: reporter vnic state healthy error 0 recover 0 pci/0000:08:00.0/32770: reporter vnic state healthy error 0 recover 0 pci/0000:08:00.1: reporter fw state healthy error 0 recover 0 auto_dump true reporter fw_fatal state healthy error 0 recover 0 grace_period 60000 auto_recover true auto_dump true reporter vnic state healthy error 0 recover 0 pci/0000:08:00.1/98304: reporter vnic state healthy error 0 recover 0 pci/0000:08:00.1/98305: reporter vnic state healthy error 0 recover 0 pci/0000:08:00.1/98306: reporter vnic state healthy error 0 recover 0 $ devlink health show pci/0000:08:00.0 pci/0000:08:00.0: reporter fw state healthy error 0 recover 0 auto_dump true reporter fw_fatal state healthy error 0 recover 0 grace_period 60000 auto_recover true auto_dump true reporter vnic state healthy error 0 recover 0 pci/0000:08:00.0/32768: reporter vnic state healthy error 0 recover 0 pci/0000:08:00.0/32769: reporter vnic state healthy error 0 recover 0 pci/0000:08:00.0/32770: reporter vnic state healthy error 0 recover 0 $ devlink health show pci/0000:08:00.0/32768 pci/0000:08:00.0/32768: reporter vnic state healthy error 0 recover 0 The last command is possible because of this patch. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-13-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14netlink: specs: devlink: extend per-instance dump commands to accept ↵Jiri Pirko
instance attributes Extend per-instance dump command definitions to accept instance attributes. Allow parsing of devlink handle attributes so they could be used for instance selection. Re-generate the related code. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-12-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14devlink: allow user to narrow per-instance dumps by passing handle attrsJiri Pirko
For SFs, one devlink instance per SF is created. There might be thousands of these on a single host. When a user needs to know port handle for specific SF, he needs to dump all devlink ports on the host which does not scale good. Allow user to pass devlink handle attributes alongside the dump command and dump only objects which are under selected devlink instance. Example: $ devlink port show auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false $ devlink port show auxiliary/mlx5_core.eth.0 auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false $ devlink port show auxiliary/mlx5_core.eth.1 auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-11-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14devlink: remove converted commands from small opsJiri Pirko
As the commands are already defined in split ops, remove them from small ops. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-10-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14devlink: remove duplicate temporary netlink callback prototypesJiri Pirko
Remove the duplicate temporary netlink callback prototype as the generated ones are already in place. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-9-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14netlink: specs: devlink: add commands that do per-instance dumpJiri Pirko
Add the definitions for the commands that do per-instance dump and re-generate the related code. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-8-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14devlink: pass flags as an arg of dump_one() callbackJiri Pirko
In order to easily set NLM_F_DUMP_FILTERED for partial dumps, pass the flags as an arg of dump_one() callback. Currently, it is always NLM_F_MULTI. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-7-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14devlink: introduce dumpit callbacks for split opsJiri Pirko
Introduce dumpit callbacks for generated split ops. Have them as a thin wrapper around iteration function and allow to pass dump_one() function pointer directly without need to store in devlink_cmd structs. Note that the function prototypes are temporary until the generated ones will replace them in a follow-up patch. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-6-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14devlink: rename doit callbacks for per-instance dump commandsJiri Pirko
Rename netlink doit callback functions for the commands that do implement per-instance dump to match the generated names that are going to be introduce in the follow-up patch. Note that the function prototypes are temporary until the generated ones will replace them in a follow-up patch. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-5-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14devlink: introduce devlink_nl_pre_doit_port*() helper functionsJiri Pirko
Define port handling helpers what don't rely on internal_flags. Have __devlink_nl_pre_doit() to accept the flags as a function arg and make devlink_nl_pre_doit() a wrapper helper function calling it. Introduce new helpers devlink_nl_pre_doit_port() and devlink_nl_pre_doit_port_optional() to be used by split ops in follow-up patch. Note that the function prototypes are temporary until the generated ones will replace them in a follow-up patch. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-4-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14devlink: parse rate attrs in doit() callbacksJiri Pirko
No need to give the rate any special treatment in netlink attributes parsing, as unlike for ports, there is only a couple of commands benefiting from that. Remove DEVLINK_NL_FLAG_NEED_RATE*, make pre_doit() callback simpler by moving the rate attributes parsing to rate_*_doit() ops. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-3-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14devlink: parse linecard attr in doit() callbacksJiri Pirko
No need to give the linecards any special treatment in netlink attribute parsing, as unlike for ports, there is only a couple of commands benefiting from that. Remove DEVLINK_NL_FLAG_NEED_LINECARD, make pre_doit() callback simpler by moving the linecard attribute parsing to linecard_[gs]et_doit() ops. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230811155714.1736405-2-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14batman-adv: Drop per algo GW section class codeSven Eckelmann
This code was only used in the past for the sysfs interface. But since this was replace with netlink, it was never executed. The function pointer was only checked to figure out whether the limit 255 (B.A.T.M.A.N. IV) or 2**32-1 (B.A.T.M.A.N. V) should be used as limit. So instead of keeping the function pointer, just store the limits directly in struct batadv_algo_gw_ops. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2023-08-14batman-adv: Keep batadv_netlink_notify_* staticSven Eckelmann
The batadv_netlink_notify_*() functions are not used by any other source file. Just keep them local to netlink.c to get informed by the compiler when they are not used anymore. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2023-08-14batman-adv: Drop unused function batadv_gw_bandwidth_setSven Eckelmann
This function is no longer used since the sysfs support was removed from batman-adv. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2023-08-14Revert "vlan: Fix VLAN 0 memory leak"Vlad Buslov
This reverts commit 718cb09aaa6fa78cc8124e9517efbc6c92665384. The commit triggers multiple syzbot issues, probably due to possibility of manually creating VLAN 0 on netdevice which will cause the code to delete it since it can't distinguish such VLAN from implicit VLAN 0 automatically created for devices with NETIF_F_HW_VLAN_CTAG_FILTER feature. Reported-by: syzbot+662f783a5cdf3add2719@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/00000000000090196d0602a6167d@google.com/ Reported-by: syzbot+4b4f06495414e92701d5@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/00000000000096ae870602a61602@google.com/ Reported-by: syzbot+d810d3cd45ed1848c3f7@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/0000000000009f0f9c0602a616ce@google.com/ Fixes: 718cb09aaa6f ("vlan: Fix VLAN 0 memory leak") Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14net: openvswitch: add misc error drop reasonsAdrian Moreno
Use drop reasons from include/net/dropreason-core.h when a reasonable candidate exists. Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>