summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-10-24net: remove else after return in dev_prep_valid_name()Jakub Kicinski
Remove unnecessary else clauses after return. I copied this if / else construct from somewhere, it makes the code harder to read. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231023152346.3639749-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24net: remove dev_valid_name() check from __dev_alloc_name()Jakub Kicinski
__dev_alloc_name() is only called by dev_prep_valid_name(), which already checks that name is valid. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231023152346.3639749-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24net: trust the bitmap in __dev_alloc_name()Jakub Kicinski
Prior to restructuring __dev_alloc_name() handled both printf and non-printf names. In a clever attempt at code reuse it always prints the name into a buffer and checks if it's a duplicate. Trust the bitmap, and return an error if its full. This shrinks the possible ID space by one from 32K to 32K - 1, as previously the max value would have been tried as a valid ID. It seems very unlikely that anyone would care as we heard no requests to increase the max beyond 32k. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231023152346.3639749-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24net: reduce indentation of __dev_alloc_name()Jakub Kicinski
All callers of __dev_valid_name() go thru dev_prep_valid_name() which handles the non-printf case. Focus __dev_alloc_name() on the sprintf case, remove the indentation level. Minor functional change of returning -EINVAL if % is not found, which should now never happen. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231023152346.3639749-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24net: make dev_alloc_name() call dev_prep_valid_name()Jakub Kicinski
__dev_alloc_name() handles both the sprintf and non-sprintf target names. This complicates the code. dev_prep_valid_name() already handles the non-sprintf case, before calling __dev_alloc_name(), make the only other caller also go thru dev_prep_valid_name(). This way we can drop the non-sprintf handling in __dev_alloc_name() in one of the next changes. commit 55a5ec9b7710 ("Revert "net: core: dev_get_valid_name is now the same as dev_alloc_name_ns"") and commit 029b6d140550 ("Revert "net: core: maybe return -EEXIST in __dev_alloc_name"") tell us that we can't start returning -EEXIST from dev_alloc_name() on name duplicates. Bite the bullet and pass the expected errno to dev_prep_valid_name(). dev_prep_valid_name() must now propagate out the allocated id for printf names. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231023152346.3639749-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24net: don't use input buffer of __dev_alloc_name() as a scratch spaceJakub Kicinski
Callers of __dev_alloc_name() want to pass dev->name as the output buffer. Make __dev_alloc_name() not clobber that buffer on failure, and remove the workarounds in callers. dev_alloc_name_ns() is now completely unnecessary. The extra strscpy() added here will be gone by the end of the patch series. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231023152346.3639749-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24Merge branch 'mptcp-convert-netlink-code-to-use-yaml-spec'Jakub Kicinski
Mat Martineau says: ==================== mptcp: convert Netlink code to use YAML spec This series from Davide converts most of the MPTCP Netlink interface (plus uAPI bits) to use sources generated by YNL using a YAML spec file. This new YAML file is useful to validate the API and to generate a good documentation page. Patch 1 modifies YNL spec to support "uns-admin-perm" for genetlink legacy. Patch 2 adds support for validating exact length of netlink attrs. Patch 3 converts Netlink structures from small_ops to ops to prepare the switch to YAML. Patch 4 adds the Netlink YAML spec for MPTCP. Patch 5 adds and uses a new header file generated from the new YAML spec. Patch 6 renames some handlers to match the ones generated from the YAML spec. Patch 7 adds and uses Netlink policies automatically generated from the YAML spec. ==================== Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-1-v2-0-16b1f701f900@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24net: mptcp: use policy generated by YAML specDavide Caratti
generated with: $ ./tools/net/ynl/ynl-gen-c.py --mode kernel \ > --spec Documentation/netlink/specs/mptcp.yaml --source \ > -o net/mptcp/mptcp_pm_gen.c $ ./tools/net/ynl/ynl-gen-c.py --mode kernel \ > --spec Documentation/netlink/specs/mptcp.yaml --header \ > -o net/mptcp/mptcp_pm_gen.h Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/340 Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-1-v2-7-16b1f701f900@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24net: mptcp: rename netlink handlers to mptcp_pm_nl_<blah>_{doit,dumpit}Davide Caratti
so that they will match names generated from YAML spec. Link: https://github.com/multipath-tcp/mptcp_net-next/issues/340 Suggested-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-1-v2-6-16b1f701f900@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24uapi: mptcp: use header file generated from YAML specDavide Caratti
generated with: $ ./tools/net/ynl/ynl-gen-c.py --mode uapi \ > --spec Documentation/netlink/specs/mptcp.yaml \ > --header -o include/uapi/linux/mptcp_pm.h Link: https://github.com/multipath-tcp/mptcp_net-next/issues/340 Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-1-v2-5-16b1f701f900@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24Documentation: netlink: add a YAML spec for mptcpDavide Caratti
it describes most of the current netlink interface (uAPI definitions, doit/dumpit operations and attributes) Link: https://github.com/multipath-tcp/mptcp_net-next/issues/340 Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-1-v2-4-16b1f701f900@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24net: mptcp: convert netlink from small_ops to opsDavide Caratti
in the current MPTCP control plane, all operations use a netlink attribute of the same type "MPTCP_PM_ATTR". However, add/del/get/flush operations only parse the first element in the message _ the one that describes MPTCP endpoints (that was named MPTCP_PM_ATTR_ADDR and mostly used in ADD_ADDR operations _ probably the similarity of "attr", "addr" and "add" might cause some confusion to human readers). Convert MPTCP from 'small_ops' to 'ops', thus allowing different attributes for each single operation, hopefully makes all this clearer to human readers. - use a separate attribute set for add/del/get/flush address operation, binary compatible with the existing one, to store the endpoint address. MPTCP_PM_ENDPOINT_ADDR is added to the uAPI (with the same value as MPTCP_PM_ATTR_ADDR) for these operations. - convert mptcp_pm_ops[] and add policy files accordingly. this prepares MPTCP control plane to be described as YAML spec. Link: https://github.com/multipath-tcp/mptcp_net-next/issues/340 Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-1-v2-3-16b1f701f900@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24tools: ynl-gen: add support for exact-len validationDavide Caratti
add support for 'exact-len' validation on netlink attributes. Link: https://github.com/multipath-tcp/mptcp_net-next/issues/340 Acked-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-1-v2-2-16b1f701f900@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24tools: ynl: add uns-admin-perm to genetlink legacyDavide Caratti
this flag maps to GENL_UNS_ADMIN_PERM and will be used by future specs. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231023-send-net-next-20231023-1-v2-1-16b1f701f900@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24netfilter: nf_tables: Carry reset boolean in nft_set_dump_ctxPhil Sutter
Relieve the dump callback from having to check nlmsg_type upon each call. Prep work for set element reset locking. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nf_tables: set->ops->insert returns opaque set element in case of ↵Pablo Neira Ayuso
EEXIST Return struct nft_elem_priv instead of struct nft_set_ext for consistency with ("netfilter: nf_tables: expose opaque set element as struct nft_elem_priv") and to prepare the introduction of element timeout updates from control path. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nf_tables: shrink memory consumption of set elementsPablo Neira Ayuso
Instead of copying struct nft_set_elem into struct nft_trans_elem, store the pointer to the opaque set element object in the transaction. Adapt set backend API (and set backend implementations) to take the pointer to opaque set element representation whenever required. This patch deconstifies .remove() and .activate() set backend API since these modify the set element opaque object. And it also constify nft_set_elem_ext() this provides access to the nft_set_ext struct without updating the object. According to pahole on x86_64, this patch shrinks struct nft_trans_elem size from 216 to 24 bytes. This patch also reduces stack memory consumption by removing the template struct nft_set_elem object, using the opaque set element object instead such as from the set iterator API, catchall elements and the get element command. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nf_tables: expose opaque set element as struct nft_elem_privPablo Neira Ayuso
Add placeholder structure and place it at the beginning of each struct nft_*_elem for each existing set backend, instead of exposing elements as void type to the frontend which defeats compiler type checks. Use this pointer to this new type to replace void *. This patch updates the following set backend API to use this new struct nft_elem_priv placeholder structure: - update - deactivate - flush - get as well as the following helper functions: - nft_set_elem_ext() - nft_set_elem_init() - nft_set_elem_destroy() - nf_tables_set_elem_destroy() This patch adds nft_elem_priv_cast() to cast struct nft_elem_priv to native element representation from the corresponding set backend. BUILD_BUG_ON() makes sure this .priv placeholder is always at the top of the opaque set element representation. Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nf_tables: set backend .flush always succeedsPablo Neira Ayuso
.flush is always successful since this results from iterating over the set elements to toggle mark the element as inactive in the next generation. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nft_set_pipapo: no need to call pipapo_deactivate() from flushPablo Neira Ayuso
Use the element object that is already offered instead. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nf_tables: Carry reset boolean in nft_obj_dump_ctxPhil Sutter
Relieve the dump callback from having to inspect nlmsg_type upon each call, just do it once at start of the dump. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nf_tables: nft_obj_filter fits into cb->ctxPhil Sutter
No need to allocate it if one may just use struct netlink_callback's scratch area for it. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nf_tables: Carry s_idx in nft_obj_dump_ctxPhil Sutter
Prep work for moving the context into struct netlink_callback scratch area. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nf_tables: A better name for nft_obj_filterPhil Sutter
Name it for what it is supposed to become, a real nft_obj_dump_ctx. No functional change intended. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nf_tables: Unconditionally allocate nft_obj_filterPhil Sutter
Prep work for moving the filter into struct netlink_callback's scratch area. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nf_tables: Drop pointless memset in nf_tables_dump_objPhil Sutter
The code does not make use of cb->args fields past the first one, no need to zero them. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: conntrack: switch connlabels to atomic_tFlorian Westphal
The spinlock is back from the day when connabels did not have a fixed size and reallocation had to be supported. Remove it. This change also allows to call the helpers from softirq or timers without deadlocks. Also add WARN()s to catch refcounting imbalances. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24br_netfilter: use single forward hook for ip and arpFlorian Westphal
br_netfilter registers two forward hooks, one for ip and one for arp. Just use a common function for both and then call the arp/ip helper as needed. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nf_tables: Add locking for NFT_MSG_GETRULE_RESET requestsPhil Sutter
Rule reset is not concurrency-safe per-se, so multiple CPUs may reset the same rule at the same time. At least counter and quota expressions will suffer from value underruns in this case. Prevent this by introducing dedicated locking callbacks for nfnetlink and the asynchronous dump handling to serialize access. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nf_tables: Introduce nf_tables_getrule_single()Phil Sutter
Outsource the reply skb preparation for non-dump getrule requests into a distinct function. Prep work for rule reset locking. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nf_tables: Open-code audit log call in nf_tables_getrule()Phil Sutter
The table lookup will be dropped from that function, so remove that dependency from audit logging code. Using whatever is in nla[NFTA_RULE_TABLE] is sufficient as long as the previous rule info filling succeded. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nft_set_rbtree: prefer sync gc to async workerFlorian Westphal
There is no need for asynchronous garbage collection, rbtree inserts can only happen from the netlink control plane. We already perform on-demand gc on insertion, in the area of the tree where the insertion takes place, but we don't do a full tree walk there for performance reasons. Do a full gc walk at the end of the transaction instead and remove the async worker. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24netfilter: nft_set_rbtree: rename gc deactivate+erase functionFlorian Westphal
Next patch adds a cllaer that doesn't hold the priv->write lock and will need a similar function. Rename the existing function to make it clear that it can only be used for opportunistic gc during insertion. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24net: sched: sch_qfq: Use non-work-conserving warning handlerLiu Jian
A helper function for printing non-work-conserving alarms is added in commit b00355db3f88 ("pkt_sched: sch_hfsc: sch_htb: Add non-work-conserving warning handler."). In this commit, use qdisc_warn_nonwc() instead of WARN_ONCE() to handle the non-work-conserving warning in qfq Qdisc. Signed-off-by: Liu Jian <liujian56@huawei.com> Link: https://lore.kernel.org/r/20231023064729.370649-1-liujian56@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-24net: ethernet: davinci_emac: Use MAC Address from Device TreeAdam Ford
Currently there is a device tree entry called "local-mac-address" which can be filled by the bootloader or manually set.This is useful when the user does not want to use the MAC address programmed into the SoC. Currently, the davinci_emac reads the MAC from the DT, copies it from pdata->mac_addr to priv->mac_addr, then blindly overwrites it by reading from registers in the SoC, and falls back to a random MAC if it's still not valid. This completely ignores any MAC address in the device tree. In order to use the local-mac-address, check to see if the contents of priv->mac_addr are valid before falling back to reading from the SoC when the MAC address is not valid. Signed-off-by: Adam Ford <aford173@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20231022151911.4279-1-aford173@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-24sock: Ignore memcg pressure heuristics when raising allocatedAbel Wu
Before sockets became aware of net-memcg's memory pressure since commit e1aab161e013 ("socket: initial cgroup code."), the memory usage would be granted to raise if below average even when under protocol's pressure. This provides fairness among the sockets of same protocol. That commit changes this because the heuristic will also be effective when only memcg is under pressure which makes no sense. So revert that behavior. After reverting, __sk_mem_raise_allocated() no longer considers memcg's pressure. As memcgs are isolated from each other w.r.t. memory accounting, consuming one's budget won't affect others. So except the places where buffer sizes are needed to be tuned, allow workloads to use the memory they are provisioned. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20231019120026.42215-3-wuyun.abel@bytedance.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-24sock: Doc behaviors for pressure heurisiticsAbel Wu
There are now two accounting infrastructures for skmem, while the heuristics in __sk_mem_raise_allocated() were actually introduced before memcg was born. Add some comments to clarify whether they can be applied to both infrastructures or not. Suggested-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20231019120026.42215-2-wuyun.abel@bytedance.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-24sock: Code cleanup on __sk_mem_raise_allocated()Abel Wu
Code cleanup for both better simplicity and readability. No functional change intended. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20231019120026.42215-1-wuyun.abel@bytedance.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-24net: ti: icssg-prueth: Add phys_port_name supportJan Kiszka
Helps identifying the ports in udev rules e.g. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/895ae9c1-b6dd-4a97-be14-6f2b73c7b2b5@siemens.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-24net: microchip: lan743x: improve throughput with rx timestamp configVishvambar Panth S
Currently all RX frames are timestamped which results in a performance penalty when timestamping is not needed. The default is now being changed to not timestamp any Rx frames (HWTSTAMP_FILTER_NONE), but support has been added to allow changing the desired RX timestamping mode (HWTSTAMP_FILTER_ALL - which was the previous setting and HWTSTAMP_FILTER_PTP_V2_EVENT are now supported) using SIOCSHWTSTAMP. All settings were tested using the hwstamp_ctl application. It is also noted that ptp4l, when started, preconfigures the device to timestamp using HWTSTAMP_FILTER_PTP_V2_EVENT, so this driver continues to work properly "out of the box". Test setup: x64 PC with LAN7430 ---> x64 PC as partner iperf3 with - Timestamp all incoming packets: - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-5.05 sec 517 MBytes 859 Mbits/sec 0 sender [ 5] 0.00-5.00 sec 515 MBytes 864 Mbits/sec receiver iperf Done. iperf3 with - Timestamp only PTP packets: - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-5.04 sec 563 MBytes 937 Mbits/sec 0 sender [ 5] 0.00-5.00 sec 561 MBytes 941 Mbits/sec receiver Signed-off-by: Vishvambar Panth S <vishvambarpanth.s@microchip.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20231020185801.25649-1-vishvambarpanth.s@microchip.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-23Merge branch 'introduce-page_pool_alloc-related-api'Jakub Kicinski
Yunsheng Lin says: ==================== introduce page_pool_alloc() related API In [1] & [2] & [3], there are usecases for veth and virtio_net to use frag support in page pool to reduce memory usage, and it may request different frag size depending on the head/tail room space for xdp_frame/shinfo and mtu/packet size. When the requested frag size is large enough that a single page can not be split into more than one frag, using frag support only have performance penalty because of the extra frag count handling for frag support. So this patchset provides a page pool API for the driver to allocate memory with least memory utilization and performance penalty when it doesn't know the size of memory it need beforehand. 1. https://patchwork.kernel.org/project/netdevbpf/patch/d3ae6bd3537fbce379382ac6a42f67e22f27ece2.1683896626.git.lorenzo@kernel.org/ 2. https://patchwork.kernel.org/project/netdevbpf/patch/20230526054621.18371-3-liangchen.linux@gmail.com/ 3. https://github.com/alobakin/linux/tree/iavf-pp-frag ==================== Link: https://lore.kernel.org/r/20231020095952.11055-1-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-23net: veth: use newly added page pool API for veth with xdpYunsheng Lin
Use page_pool_alloc() API to allocate memory with least memory utilization and performance penalty. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> CC: Lorenzo Bianconi <lorenzo@kernel.org> CC: Alexander Duyck <alexander.duyck@gmail.com> CC: Liang Chen <liangchen.linux@gmail.com> CC: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20231020095952.11055-6-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-23page_pool: update document about fragment APIYunsheng Lin
As more drivers begin to use the fragment API, update the document about how to decide which API to use for the driver author. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> CC: Lorenzo Bianconi <lorenzo@kernel.org> CC: Alexander Duyck <alexander.duyck@gmail.com> CC: Liang Chen <liangchen.linux@gmail.com> CC: Alexander Lobakin <aleksander.lobakin@intel.com> CC: Dima Tisnek <dimaqq@gmail.com> Link: https://lore.kernel.org/r/20231020095952.11055-5-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-23page_pool: introduce page_pool_alloc() APIYunsheng Lin
Currently page pool supports the below use cases: use case 1: allocate page without page splitting using page_pool_alloc_pages() API if the driver knows that the memory it need is always bigger than half of the page allocated from page pool. use case 2: allocate page frag with page splitting using page_pool_alloc_frag() API if the driver knows that the memory it need is always smaller than or equal to the half of the page allocated from page pool. There is emerging use case [1] & [2] that is a mix of the above two case: the driver doesn't know the size of memory it need beforehand, so the driver may use something like below to allocate memory with least memory utilization and performance penalty: if (size << 1 > max_size) page = page_pool_alloc_pages(); else page = page_pool_alloc_frag(); To avoid the driver doing something like above, add the page_pool_alloc() API to support the above use case, and update the true size of memory that is acctually allocated by updating '*size' back to the driver in order to avoid exacerbating truesize underestimate problem. Rename page_pool_free() which is used in the destroy process to __page_pool_destroy() to avoid confusion with the newly added API. 1. https://lore.kernel.org/all/d3ae6bd3537fbce379382ac6a42f67e22f27ece2.1683896626.git.lorenzo@kernel.org/ 2. https://lore.kernel.org/all/20230526054621.18371-3-liangchen.linux@gmail.com/ Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> CC: Lorenzo Bianconi <lorenzo@kernel.org> CC: Alexander Duyck <alexander.duyck@gmail.com> CC: Liang Chen <liangchen.linux@gmail.com> CC: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20231020095952.11055-4-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-23page_pool: remove PP_FLAG_PAGE_FRAGYunsheng Lin
PP_FLAG_PAGE_FRAG is not really needed after pp_frag_count handling is unified and page_pool_alloc_frag() is supported in 32-bit arch with 64-bit DMA, so remove it. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> CC: Lorenzo Bianconi <lorenzo@kernel.org> CC: Alexander Duyck <alexander.duyck@gmail.com> CC: Liang Chen <liangchen.linux@gmail.com> CC: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20231020095952.11055-3-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-23page_pool: unify frag_count handling in page_pool_is_last_frag()Yunsheng Lin
Currently when page_pool_create() is called with PP_FLAG_PAGE_FRAG flag, page_pool_alloc_pages() is only allowed to be called under the below constraints: 1. page_pool_fragment_page() need to be called to setup page->pp_frag_count immediately. 2. page_pool_defrag_page() often need to be called to drain the page->pp_frag_count when there is no more user will be holding on to that page. Those constraints exist in order to support a page to be split into multi fragments. And those constraints have some overhead because of the cache line dirtying/bouncing and atomic update. Those constraints are unavoidable for case when we need a page to be split into more than one fragment, but there is also case that we want to avoid the above constraints and their overhead when a page can't be split as it can only hold a fragment as requested by user, depending on different use cases: use case 1: allocate page without page splitting. use case 2: allocate page with page splitting. use case 3: allocate page with or without page splitting depending on the fragment size. Currently page pool only provide page_pool_alloc_pages() and page_pool_alloc_frag() API to enable the 1 & 2 separately, so we can not use a combination of 1 & 2 to enable 3, it is not possible yet because of the per page_pool flag PP_FLAG_PAGE_FRAG. So in order to allow allocating unsplit page without the overhead of split page while still allow allocating split page we need to remove the per page_pool flag in page_pool_is_last_frag(), as best as I can think of, it seems there are two methods as below: 1. Add per page flag/bit to indicate a page is split or not, which means we might need to update that flag/bit everytime the page is recycled, dirtying the cache line of 'struct page' for use case 1. 2. Unify the page->pp_frag_count handling for both split and unsplit page by assuming all pages in the page pool is split into a big fragment initially. As page pool already supports use case 1 without dirtying the cache line of 'struct page' whenever a page is recyclable, we need to support the above use case 3 with minimal overhead, especially not adding any noticeable overhead for use case 1, and we are already doing an optimization by not updating pp_frag_count in page_pool_defrag_page() for the last fragment user, this patch chooses to unify the pp_frag_count handling to support the above use case 3. There is no noticeable performance degradation and some justification for unifying the frag_count handling with this patch applied using a micro-benchmark testing in [1]. 1. https://lore.kernel.org/all/bf2591f8-7b3c-4480-bb2c-31dc9da1d6ac@huawei.com/ Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> CC: Lorenzo Bianconi <lorenzo@kernel.org> CC: Alexander Duyck <alexander.duyck@gmail.com> CC: Liang Chen <liangchen.linux@gmail.com> CC: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20231020095952.11055-2-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-23Merge tag 'for-net-next-2023-10-23' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Luiz Augusto von Dentz says: ==================== bluetooth-next pull request for net-next: - Add 0bda:b85b for Fn-Link RTL8852BE - ISO: Many fixes for broadcast support - Mark bcm4378/bcm4387 as BROKEN_LE_CODED - Add support ITTIM PE50-M75C - Add RTW8852BE device 13d3:3570 - Add support for QCA2066 - Add support for Intel Misty Peak - 8087:0038 * tag 'for-net-next-2023-10-23' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: Bluetooth: hci_sync: Fix Opcode prints in bt_dev_dbg/err Bluetooth: Fix double free in hci_conn_cleanup Bluetooth: btmtksdio: enable bluetooth wakeup in system suspend Bluetooth: btusb: Add 0bda:b85b for Fn-Link RTL8852BE Bluetooth: hci_bcm4377: Mark bcm4378/bcm4387 as BROKEN_LE_CODED Bluetooth: ISO: Copy BASE if service data matches EIR_BAA_SERVICE_UUID Bluetooth: Make handle of hci_conn be unique Bluetooth: btusb: Add date->evt_skb is NULL check Bluetooth: ISO: Fix bcast listener cleanup Bluetooth: msft: __hci_cmd_sync() doesn't return NULL Bluetooth: ISO: Match QoS adv handle with BIG handle Bluetooth: ISO: Allow binding a bcast listener to 0 bises Bluetooth: btusb: Add RTW8852BE device 13d3:3570 to device tables Bluetooth: qca: add support for QCA2066 Bluetooth: ISO: Set CIS bit only for devices with CIS support Bluetooth: Add support for Intel Misty Peak - 8087:0038 Bluetooth: Add support ITTIM PE50-M75C Bluetooth: ISO: Pass BIG encryption info through QoS Bluetooth: ISO: Fix BIS cleanup ==================== Link: https://lore.kernel.org/r/20231023182119.3629194-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-23Merge branch 'devlink-finish-conversion-to-generated-split_ops'Jakub Kicinski
Jiri Pirko says: ==================== devlink: finish conversion to generated split_ops This patchset converts the remaining genetlink commands to generated split_ops and removes the existing small_ops arrays entirely alongside with shared netlink attribute policy. Patches #1-#6 are just small preparations and small fixes on multiple places. Note that couple of patches contain the "Fixes" tag but no need to put them into -net tree. Patch #7 is a simple rename preparation Patch #8 is the main one in this set and adds actual definitions of cmds in to yaml file. Patches #9-#10 finalize the change removing bits that are no longer in use. ==================== Link: https://lore.kernel.org/r/20231021112711.660606-1-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-23devlink: remove netlink small_opsJiri Pirko
All commands are now covered by generated split_ops. Remove the small_ops entirely alongside with unified devlink netlink policy array. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20231021112711.660606-11-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-23devlink: remove duplicated netlink callback prototypesJiri Pirko
The prototypes are now generated, remove the old ones. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20231021112711.660606-10-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>