summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-11-17ice: Add documentation for devlink-rate implementationMichal Wilczynski
Add documentation to a newly added devlink-rate feature. Provide some examples on how to use the commands, which netlink attributes are supported and descriptions of the attributes. Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17ice: Prevent ADQ, DCB coexistence with Custom Tx schedulerMichal Wilczynski
ADQ, DCB might interfere with Custom Tx Scheduler changes that user might introduce using devlink-rate API. Check if ADQ, DCB is active, when user tries to change any setting in exported Tx scheduler tree. If any of those are active block the user from doing so, and log an appropriate message. Remove the exported hierarchy if user enable ADQ or DCB. Prevent ADQ or DCB from getting configured if user already made some changes using devlink-rate API. Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17ice: Implement devlink-rate APIMichal Wilczynski
There is a need to support modification of Tx scheduler tree, in the ice driver. This will allow user to control Tx settings of each node in the internal hierarchy of nodes. As a result user will be able to use Hierarchy QoS implemented entirely in the hardware. This patch implemenents devlink-rate API. It also exports initial default hierarchy. It's mostly dictated by the fact that the tree can't be removed entirely, all we can do is enable the user to modify it. For example root node shouldn't ever be removed, also nodes that have children are off-limits. Example initial tree with 2 VF's: [root@fedora ~]# devlink port function rate show pci/0000:4b:00.0/node_27: type node parent node_26 pci/0000:4b:00.0/node_26: type node parent node_0 pci/0000:4b:00.0/node_34: type node parent node_33 pci/0000:4b:00.0/node_33: type node parent node_32 pci/0000:4b:00.0/node_32: type node parent node_16 pci/0000:4b:00.0/node_19: type node parent node_18 pci/0000:4b:00.0/node_18: type node parent node_17 pci/0000:4b:00.0/node_17: type node parent node_16 pci/0000:4b:00.0/node_21: type node parent node_20 pci/0000:4b:00.0/node_20: type node parent node_3 pci/0000:4b:00.0/node_14: type node parent node_5 pci/0000:4b:00.0/node_5: type node parent node_3 pci/0000:4b:00.0/node_13: type node parent node_4 pci/0000:4b:00.0/node_12: type node parent node_4 pci/0000:4b:00.0/node_11: type node parent node_4 pci/0000:4b:00.0/node_10: type node parent node_4 pci/0000:4b:00.0/node_9: type node parent node_4 pci/0000:4b:00.0/node_8: type node parent node_4 pci/0000:4b:00.0/node_7: type node parent node_4 pci/0000:4b:00.0/node_6: type node parent node_4 pci/0000:4b:00.0/node_4: type node parent node_3 pci/0000:4b:00.0/node_3: type node parent node_16 pci/0000:4b:00.0/node_16: type node parent node_15 pci/0000:4b:00.0/node_15: type node parent node_0 pci/0000:4b:00.0/node_2: type node parent node_1 pci/0000:4b:00.0/node_1: type node parent node_0 pci/0000:4b:00.0/node_0: type node pci/0000:4b:00.0/1: type leaf parent node_27 pci/0000:4b:00.0/2: type leaf parent node_27 Let me visualize part of the tree: +---------+ | node_0 | +---------+ | +----v----+ | node_26 | +----+----+ | +----v----+ | node_27 | +----+----+ | |-----------------| +----v----+ +----v----+ | VF 1 | | VF 2 | +----+----+ +----+----+ So at this point there is a couple things that can be done. For example we could only assign parameters to VF's. [root@fedora ~]# devlink port function rate set pci/0000:4b:00.0/1 \ tx_max 5Gbps This would cap the VF 1 BW to 5Gbps. But let's say you would like to create a completely new branch. This can be done like this: [root@fedora ~]# devlink port function rate add \ pci/0000:4b:00.0/node_custom parent node_0 [root@fedora ~]# devlink port function rate add \ pci/0000:4b:00.0/node_custom_1 parent node_custom [root@fedora ~]# devlink port function rate set \ pci/0000:4b:00.0/1 parent node_custom_1 This creates a completely new branch and reassigns VF 1 to it. A number of parameters is supported per each node: tx_max, tx_share, tx_priority and tx_weight. Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17ice: Add an option to pre-allocate memory for ice_sched_nodeMichal Wilczynski
devlink-rate API requires a priv object to be allocated when node still doesn't have a parent. This is problematic, because ice_sched_node can't be currently created without a parent. Add an option to pre-allocate memory for ice_sched_node struct. Add new arguments to ice_sched_add() and ice_sched_add_elems() that allow for pre-allocation of memory for ice_sched_node struct. Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17ice: Introduce new parameters in ice_sched_nodeMichal Wilczynski
To support new devlink-rate API ice_sched_node struct needs to store a number of additional parameters. This includes tx_max, tx_share, tx_weight, and tx_priority. Add new fields to ice_sched_node struct. Add new functions to configure the hardware with new parameters. Introduce new xarray to identify nodes uniquely. Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17devlink: Allow to set up parent in devl_rate_leaf_create()Michal Wilczynski
Currently the driver is able to create leaf nodes for the devlink-rate, but is unable to set parent for them. This wasn't as issue before the possibility to export hierarchy from the driver. After adding the export feature, in order for the driver to supply correct hierarchy, it's necessary for it to be able to supply a parent name to devl_rate_leaf_create(). Introduce a new parameter 'parent_name' in devl_rate_leaf_create(). Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17devlink: Allow for devlink-rate nodes parent reassignmentMichal Wilczynski
Currently it's not possible to reassign the parent of the node using one command. As the previous commit introduced a way to export entire hierarchy from the driver, being able to modify and reassign parents become important. This way user might easily change QoS settings without interrupting traffic. Example command: devlink port function rate set pci/0000:4b:00.0/1 parent node_custom_1 This reassigns leaf node parent to node_custom_1. Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17devlink: Enable creation of the devlink-rate nodes from the driverMichal Wilczynski
Intel 100G card internal firmware hierarchy for Hierarchicial QoS is very rigid and can't be easily removed. This requires an ability to export default hierarchy to allow user to modify it. Currently the driver is only able to create the 'leaf' nodes, which usually represent the vport. This is not enough for HQoS implemented in Intel hardware. Introduce new function devl_rate_node_create() that allows for creation of the devlink-rate nodes from the driver. Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17devlink: Introduce new attribute 'tx_weight' to devlink-rateMichal Wilczynski
To fully utilize offload capabilities of Intel 100G card QoS capabilities new attribute 'tx_weight' needs to be introduced. This attribute allows for usage of Weighted Fair Queuing arbitration scheme among siblings. This arbitration scheme can be used simultaneously with the strict priority. Introduce new attribute in devlink-rate that will allow for configuration of Weighted Fair Queueing. New attribute is optional. Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17devlink: Introduce new attribute 'tx_priority' to devlink-rateMichal Wilczynski
To fully utilize offload capabilities of Intel 100G card QoS capabilities new attribute 'tx_priority' needs to be introduced. This attribute allows for usage of strict priority arbiter among siblings. This arbitration scheme attempts to schedule nodes based on their priority as long as the nodes remain within their bandwidth limit. Introduce new attribute in devlink-rate that will allow for configuration of strict priority. New attribute is optional. Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17Merge branch 'autoload-dsa-tagging-driver-when-dynamically-changing-protocol'Jakub Kicinski
Vladimir Oltean says: ==================== Autoload DSA tagging driver when dynamically changing protocol This patch set solves the issue reported by Michael and Heiko here: https://lore.kernel.org/lkml/20221027113248.420216-1-michael@walle.cc/ making full use of Michael's suggestion of having two modaliases: one gets used for loading the tagging protocol when it's the default one reported by the switch driver, the other gets loaded at user's request, by name. # modinfo tag_ocelot filename: /lib/modules/6.1.0-rc4+/kernel/net/dsa/tag_ocelot.ko license: GPL v2 alias: dsa_tag:seville alias: dsa_tag:id-21 alias: dsa_tag:ocelot alias: dsa_tag:id-15 depends: dsa_core intree: Y name: tag_ocelot vermagic: 6.1.0-rc4+ SMP preempt mod_unload modversions aarch64 Tested on NXP LS1028A-RDB with the following device tree addition: &mscc_felix_port4 { dsa-tag-protocol = "ocelot-8021q"; }; &mscc_felix_port5 { dsa-tag-protocol = "ocelot-8021q"; }; CONFIG_NET_DSA and everything that depends on it is built as module. Everything auto-loads, and "cat /sys/class/net/eno2/dsa/tagging" shows "ocelot-8021q". Traffic works as well. Furthermore, "echo ocelot-8021q" into the aforementioned sysfs file now auto-loads the driver for it. ==================== Link: https://lore.kernel.org/r/20221115011847.2843127-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17net: dsa: autoload tag driver module on tagging protocol changeVladimir Oltean
Issue a request_module() call when an attempt to change the tagging protocol is made, either by sysfs or by device tree. In the case of ocelot (the only driver for which the default and the alternative tagging protocol are compiled as different modules), the user is now no longer required to insert tag_ocelot_8021q.ko manually. In the particular case of ocelot, this solves a problem where tag_ocelot_8021q.ko is built as module, and this is present in the device tree: &mscc_felix_port4 { dsa-tag-protocol = "ocelot-8021q"; }; &mscc_felix_port5 { dsa-tag-protocol = "ocelot-8021q"; }; Because no one attempts to load the module into the kernel at boot time, the switch driver will fail to probe (actually forever defer) until someone manually inserts tag_ocelot_8021q.ko. This is now no longer necessary and happens automatically. Rename dsa_find_tagger_by_name() to denote the change in functionality: there is now feature parity with dsa_tag_driver_get_by_id(), i.o.w. we also load the module if it's missing. Link: https://lore.kernel.org/lkml/20221027113248.420216-1-michael@walle.cc/ Suggested-by: Michael Walle <michael@walle.cc> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Michael Walle <michael@walle.cc> # on kontron-sl28 w/ ocelot_8021q Tested-by: Michael Walle <michael@walle.cc> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17net: dsa: rename dsa_tag_driver_get() to dsa_tag_driver_get_by_id()Vladimir Oltean
A future patch will introduce one more way of getting a reference on a tagging protocl driver (by name). Rename the current method to "by_id". Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Michael Walle <michael@walle.cc> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17net: dsa: strip sysfs "tagging" string of trailing newlineVladimir Oltean
Currently, dsa_find_tagger_by_name() uses sysfs_streq() which works both with strings that contain \n at the end (echo ocelot > .../dsa/tagging) and with strings that don't (printf ocelot > .../dsa/tagging). There will be a problem once we'll want to construct the modalias string based on which we auto-load the protocol kernel module. If the sysfs buffer ends in a newline, we need to strip it first. This is a preparatory patch specifically for that. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Michael Walle <michael@walle.cc> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17net: dsa: provide a second modalias to tag proto drivers based on their nameVladimir Oltean
Currently, tagging protocol drivers have a modalias of "dsa_tag:id-<number>", where the number is one of DSA_TAG_PROTO_*_VALUE. This modalias makes it possible for the request_module() call in dsa_tag_driver_get() to work, given the input it has - an integer returned by ds->ops->get_tag_protocol(). It is also possible to change tagging protocols at (pseudo-)runtime, via sysfs or via device tree, and this works via the name string of the tagging protocol rather than via its id (DSA_TAG_PROTO_*_VALUE). In the latter case, there is no request_module() call, because there is no association that the DSA core has between the string name and the ID, to construct the modalias. The module is simply assumed to have been inserted. This is actually slightly problematic when the tagging protocol change should take place at probe time, since it's expected that the dependency module should get autoloaded. For this purpose, let's introduce a second modalias, so that the DSA core can call request_module() by name. There is no reason to make the modalias by name optional, so just modify the MODULE_ALIAS_DSA_TAG_DRIVER() macro to take both the ID and the name as arguments, and generate two modaliases behind the scenes. Suggested-by: Michael Walle <michael@walle.cc> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Michael Walle <michael@walle.cc> # on kontron-sl28 w/ ocelot_8021q Tested-by: Michael Walle <michael@walle.cc> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17net: dsa: rename tagging protocol driver modaliasVladimir Oltean
It's autumn cleanup time, and today's target are modaliases. Michael says that for users of modinfo, "dsa_tag-20" is not the most suggestive name, and recommends a change to "dsa_tag-id-20". Andrew points out that other modaliases have a prefix delimited by colons, so he recommends "dsa_tag:20" instead of "dsa_tag-20". To satisfy both proposals, Florian recommends "dsa_tag:id-20". The modaliases are not stable ABI, and the essential information (protocol ID) is still conveyed in the new string, which request_module() must be adapted to form. Link: 20221027210830.3577793-1-vladimir.oltean@nxp.com Suggested-by: Andrew Lunn <andrew@lunn.ch> Suggested-by: Michael Walle <michael@walle.cc> Suggested-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Michael Walle <michael@walle.cc> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17net: dsa: stop exposing tag proto module helpers to the worldVladimir Oltean
The DSA tagging protocol driver macros are in the public include/net/dsa.h probably because that's also where the DSA_TAG_PROTO_*_VALUE macros are (MODULE_ALIAS_DSA_TAG_DRIVER hinges on those macro definitions). But there is no reason to expose these helpers to <net/dsa.h>. That header is shared between switch drivers (drivers/net/dsa/), tagging protocol drivers (net/dsa/tag_*.c), the DSA core (net/dsa/ sans tag_*.c), and the rest of the world (DSA master drivers, network stack, etc). Too much exposure. On the other hand, net/dsa/dsa_priv.h is included only by the DSA core and by DSA tagging protocol drivers (or IOW, "friend" modules). Also a bit too much exposure - I've contemplated creating a new header which is only included by tagging protocol drivers, but completely separating a new dsa_tag_proto.h from dsa_priv.h is not immediately trivial - for example dsa_slave_to_port() is used both from the fast path and from the control path. So for now, move these definitions to dsa_priv.h which at least hides them from the world. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Michael Walle <michael@walle.cc> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17dt-bindings: net: ipq4019-mdio: document required clock-namesRobert Marko
IPQ5018, IPQ6018 and IPQ8074 require clock-names to be set as driver is requesting the clock based on it and not index, so document that and make it required for the listed SoC-s. Signed-off-by: Robert Marko <robimarko@gmail.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20221114194734.3287854-4-robimarko@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17dt-bindings: net: ipq4019-mdio: require and validate clocksRobert Marko
Now that we can match the platforms requiring clocks by compatible start using those to allow clocks per compatible and make them required. Signed-off-by: Robert Marko <robimarko@gmail.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20221114194734.3287854-3-robimarko@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17dt-bindings: net: ipq4019-mdio: add IPQ8074 compatibleRobert Marko
Allow using IPQ8074 specific compatible along with the fallback IPQ4019 one in order to be able to specify which compatibles require clocks to be able to validate them via schema. Signed-off-by: Robert Marko <robimarko@gmail.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20221114194734.3287854-2-robimarko@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17dt-bindings: net: ipq4019-mdio: document IPQ6018 compatibleRobert Marko
Document IPQ6018 compatible that is already being used in the DTS along with the fallback IPQ4019 compatible as driver itself only gets probed on IPQ4019 and IPQ5018 compatibles. This is also required in order to specify which platform require clock to be defined and validate it in schema. Signed-off-by: Robert Marko <robimarko@gmail.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20221114194734.3287854-1-robimarko@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17Merge branch 'net-dsa-use-more-appropriate-net_name_-constants-for-user-ports'Jakub Kicinski
Rasmus Villemoes says: ==================== net: dsa: use more appropriate NET_NAME_* constants for user ports The intention of commit 685343fc3ba6 ("net: add name_assign_type netdev attribute") was clearly that drivers be switched over one by one to select appropriate NET_NAME_* constants instead of NET_NAME_UNKNOWN. This small series attempts to do that for DSA user ports. This is obviously and intentionally user-visible changes, so there's a small chance that it could lead to a regression. To make it easy to revert either of the "label in DT" and "fallback to eth%d" changes, this is done as a refactoring which shouldn't introduce any functional change (but by itself adds code which looks a little odd, with the two identical assignments in the two branches), followed by changing the constant used in each case in two different patches. ==================== Link: https://lore.kernel.org/r/20221116105205.1127843-1-linux@rasmusvillemoes.dk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17net: dsa: set name_assign_type to NET_NAME_ENUM for enumerated user portsRasmus Villemoes
When a user port does not have a label in device tree, and we thus fall back to the eth%d scheme, the proper constant to use is NET_NAME_ENUM. See also commit e9f656b7a214 ("net: ethernet: set default assignment identifier to NET_NAME_ENUM"), which in turn quoted commit 685343fc3ba6 ("net: add name_assign_type netdev attribute"): ... when the kernel has given the interface a name using global device enumeration based on order of discovery (ethX, wlanY, etc) ... are labelled NET_NAME_ENUM. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.faineli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17net: dsa: use NET_NAME_PREDICTABLE for user ports with name given in DTRasmus Villemoes
When a user port has a label in device tree, the corresponding netdevice is, to quote include/uapi/linux/netdevice.h, "predictably named by the kernel". This is also explicitly one of the intended use cases for NET_NAME_PREDICTABLE, quoting 685343fc3ba6 ("net: add name_assign_type netdev attribute"): NET_NAME_PREDICTABLE: The ifname has been assigned by the kernel in a predictable way [...] Examples include [...] and names deduced from hardware properties (including being given explicitly by the firmware). Expose that information properly for the benefit of userspace tools that make decisions based on the name_assign_type attribute, e.g. a systemd-udev rule with "kernel" in NamePolicy. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.faineli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17net: dsa: refactor name assignment for user portsRasmus Villemoes
The following two patches each have a (small) chance of causing regressions for userspace and will in that case of course need to be reverted. In order to prepare for that and make those two patches independent and individually revertable, refactor the code which sets the names for user ports by moving the "fall back to eth%d if no label is given in device tree" to dsa_slave_create(). No functional change (at least none intended). Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.faineli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17ethtool: doc: clarify what drivers can implement in their get_drvinfo()Vincent Mailhol
Many of the drivers which implement ethtool_ops::get_drvinfo() will prints the .driver, .version or .bus_info of struct ethtool_drvinfo. To have a glance of current state, do: $ git grep -W "get_drvinfo(struct" Printing in those three fields is useless because: - since [1], the driver version should be the kernel version (at least for upstream drivers). Arguably, out of tree drivers might still want to set a custom version, but out of tree is not our focus. - since [2], the core is able to provide default values for .driver and .bus_info. In summary, drivers may provide .fw_version and .erom_version, the rest is expected to be done by the core. In struct ethtool_ops doc from linux/ethtool: rephrase field get_drvinfo() doc to discourage developers from implementing this callback. In struct ethtool_drvinfo doc from uapi/linux/ethtool.h: remove the paragraph mentioning what drivers should do. Rationale: no need to repeat what is already written in struct ethtool_ops doc. But add a note that .fw_version and .erom_version are driver defined. Also update the dummy driver and simply remove the callback in order not to confuse the newcomers: most of the drivers will not need this callback function any more. [1] commit 6a7e25c7fb48 ("net/core: Replace driver version to be kernel version") Link: https://git.kernel.org/torvalds/linux/c/6a7e25c7fb48 [2] commit edaf5df22cb8 ("ethtool: ethtool_get_drvinfo: populate drvinfo fields even if callback exits") Link: https://git.kernel.org/netdev/net-next/c/edaf5df22cb8 Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Link: https://lore.kernel.org/r/20221116171828.4093-1-mailhol.vincent@wanadoo.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17Merge branch 'Allocated objects, BPF linked lists'Alexei Starovoitov
Kumar Kartikeya Dwivedi says: ==================== This series introduces user defined BPF objects of a type in program BTF. This allows BPF programs to allocate their own objects, build their own object hierarchies, and use the basic building blocks provided by BPF runtime to build their own data structures flexibly. Then, we introduce the support for single ownership BPF linked lists, which can be put inside BPF maps, or allocated objects, and hold such allocated objects as elements. It works as an instrusive collection, which is done to allow making allocated objects part of multiple data structures at the same time in the future. The eventual goal of this and future patches is to allow one to do some limited form of kernel style programming in BPF C, and allow programmers to build their own complex data structures flexibly out of basic building blocks. The key difference will be that such programs are verified to be safe, preserve runtime integrity of the system, and are proven to be bug free as far as the invariants of BPF specific APIs are concerned. One immediate use case that will be using the entire infrastructure this series is introducing will be managing percpu NMI safe linked lists inside BPF programs. The other use case this will serve in the near future will be linking kernel structures like XDP frame and sk_buff directly into user data structures (rbtree, pifomap, etc.) for packet queueing. This will follow single ownership concept included in this series. The user has complete control of the internal locking, and hence also the batching of operations for each critical section. The features are: - Allocated objects. - bpf_obj_new, bpf_obj_drop to allocate and free them. - Single ownership BPF linked lists. - Support for them in BPF maps. - Support for them in allocated objects. - Global spin locks. - Spin locks inside allocated objects. Some other notable things: - Completely static verification of locking. - Kfunc argument handling has been completely reworked. - Argument rewriting support for kfuncs. - A new bpf_experimental.h header as a dumping ground for these APIs. Any functionality exposed in this series is NOT part of UAPI. It is only available through use of kfuncs, and structs that can be added to map value may also change their size or name in the future. Hence, every feature in this series must be considered experimental. Follow-ups: ----------- * Support for kptrs (local and kernel) in local storage and percpu maps + kptr tests * Fixes for helper access checks rebasing on top of this series Next steps: ----------- * NMI safe percpu single ownership linked lists (using local_t protection). * Lockless linked lists. * Allow RCU protected BPF allocated objects. This then allows RCU protected list lookups, since spinlock protection for readers does not scale. * Introduce bpf_refcount for local kptrs, shared ownership. * Introduce shared ownership linked lists. * Documentation. Changelog: ---------- v9 -> v10 v9: https://lore.kernel.org/bpf/20221117225510.1676785-1-memxor@gmail.com * Deduplicate code to find btf_record of reg (Alexei) * Add linked_list test to DENYLIST.aarch64 (Alexei) * Disable some linked list tests for now so that they compile with clang nightly (Alexei) v8 -> v9 v8: https://lore.kernel.org/bpf/20221117162430.1213770-1-memxor@gmail.com * Fix up commit log of patch 2, Simplify patch 3 * Explain the implicit requirement of bpf_list_head requiring map BTF to match in btf_record_equal in a separate patch. v7 -> v8 v7: https://lore.kernel.org/bpf/20221114191547.1694267-1-memxor@gmail.com * Fix early return in map_check_btf (Dan Carpenter) * Fix two memory leak bugs in local storage maps, outer maps * Address comments from Alexei and Dave * More local kptr -> allocated object renaming * Use krealloc with NULL instead kmalloc + krealloc * Drop WARN_ON_ONCE for field_offs parsing * Combine kfunc add + remove patches into one * Drop STRONG suffix from KF_ARG_PTR_TO_KPTR * Rename is_kfunc_arg_ret_buf_size to is_kfunc_arg_scalar_with_name * Remove redundant check for reg->type and arg type in it * Drop void * ret type check * Remove code duplication in checks for NULL pointer with offset != 0 * Fix two bpf_list_node typos * Improve log message for bpf_list_head operations * Improve comments for active_lock struct * Improve comments for Implementation details of process_spin_lock * Add Dave's acks v6 -> v7 v6: https://lore.kernel.org/bpf/20221111193224.876706-1-memxor@gmail.com * Fix uninitialized variable warning (Dan Carpenter, Kernel Test Robot) * One more local_kptr renaming v5 -> v6 v5: https://lore.kernel.org/bpf/20221107230950.7117-1-memxor@gmail.com * Replace (i && !off) check with next_off, include test (Andrii) * Drop local kptrs naming (Andrii, Alexei) * Drop reg->precise == EXACT patch (Andrii) * Add comment about ptr member of struct active_lock (Andrii) * Use btf__new_empty + btf__add_xxx APIs (Andrii) * Address other misc nits from Andrii v4 -> v5 v4: https://lore.kernel.org/bpf/20221103191013.1236066-1-memxor@gmail.com * Add a lot more selftests (failure, success, runtime, BTF) * Make sure series is bisect friendly * Move list draining out of spin lock * This exposed an issue where bpf_mem_free can now be called in map_free path without migrate_disable, also fixed that. * Rename MEM_ALLOC -> MEM_RINGBUF, MEM_TYPE_LOCAL -> MEM_ALLOC (Alexei) * Group lock identity into a struct active_lock { ptr, id } (Dave) * Split set_release_on_unlock logic into separate patch (Alexei) v3 -> v4 v3: https://lore.kernel.org/bpf/20221102202658.963008-1-memxor@gmail.com * Fix compiler error for !CONFIG_BPF_SYSCALL (Kernel Test Robot) * Fix error due to BUILD_BUG_ON on 32-bit platforms (Kernel Test Robot) v2 -> v3 v2: https://lore.kernel.org/bpf/20221013062303.896469-1-memxor@gmail.com * Add ack from Dave for patch 5 * Rename btf_type_fields -> btf_record, btf_type_fields_off -> btf_field_offs, rename functions similarly (Alexei) * Remove 'kind' component from contains declaration tag (Alexei) * Move bpf_list_head, bpf_list_node definitions to UAPI bpf.h (Alexei) * Add note in commit log about modifying btf_struct_access API (Dave) * Downgrade WARN_ON_ONCE to verbose(env, "...") and return -EFAULT (Dave) * Add type_is_local_kptr wrapper to avoid noisy checks (Dave) * Remove unused flags parameter from bpf_kptr_new (Alexei) * Rename bpf_kptr_new -> bpf_obj_new, bpf_kptr_drop -> bpf_obj_drop (Alexei) * Reword comment in ref_obj_id_set_release_on_unlock (Dave) * Fix return type of ref_obj_id_set_release_on_unlock (Dave) * Introduce is_bpf_list_api_kfunc to dedup checks (Dave) * Disallow BPF_WRITE to untrusted local kptrs * Add details about soundness of check_reg_allocation_locked logic * List untrusted local kptrs for PROBE_MEM handling v1 -> v2 v1: https://lore.kernel.org/bpf/20221011012240.3149-1-memxor@gmail.com * Rebase on bpf-next to resolve merge conflict in DENYLIST.s390x * Fix a couple of mental lapses in bpf_list_head_free RFC v1 -> v1 RFC v1: https://lore.kernel.org/bpf/20220904204145.3089-1-memxor@gmail.com * Mostly a complete rewrite of BTF parsing, refactor existing code (Kartikeya) * Rebase kfunc rewrite for bpf-next, add support for more changes * Cache type metadata in BTF to avoid recomputation inside verifier (Kartikeya) * Remove __kernel tag, make things similar to map values, reserve bpf_ prefix * bpf_kptr_new, bpf_kptr_drop * Rename precision state enum values (Alexei) * Drop explicit constructor/destructor support (Alexei) * Rewrite code for constructing/destructing objects and offload to runtime * Minimize duplication in bpf_map_value_off_desc handling (Alexei) * Expose global memory allocator (Alexei) * Address other nits from Alexei * Split out local kptrs in maps, more kptrs in maps support into a follow up Links: ------ * Dave's BPF RB-Tree RFC series v1 (Discussion thread) https://lore.kernel.org/bpf/20220722183438.3319790-1-davemarchevsky@fb.com v2 (With support for static locks) https://lore.kernel.org/bpf/20220830172759.4069786-1-davemarchevsky@fb.com * BPF Linked Lists Discussion https://lore.kernel.org/bpf/CAP01T74U30+yeBHEgmgzTJ-XYxZ0zj71kqCDJtTH9YQNfTK+Xw@mail.gmail.com * BPF Memory Allocator from Alexei https://lore.kernel.org/bpf/20220902211058.60789-1-alexei.starovoitov@gmail.com * BPF Memory Allocator UAPI Discussion https://lore.kernel.org/bpf/d3f76b27f4e55ec9e400ae8dcaecbb702a4932e8.camel@fb.com ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17selftests/bpf: Temporarily disable linked list testsKumar Kartikeya Dwivedi
The latest clang nightly as of writing crashes with the given test case for BPF linked lists wherever global glock, ghead, glock2 are used, hence comment out the parts that cause the crash, and prepare this commit so that it can be reverted when the fix has been made. More context in [0]. [0]: https://lore.kernel.org/bpf/d56223f9-483e-fbc1-4564-44c0858a1e3e@meta.com Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-25-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17selftests/bpf: Add BTF sanity testsKumar Kartikeya Dwivedi
Preparing the metadata for bpf_list_head involves a complicated parsing step and type resolution for the contained value. Ensure that corner cases are tested against and invalid specifications in source are duly rejected. Also include tests for incorrect ownership relationships in the BTF. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-24-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17selftests/bpf: Add BPF linked list API testsKumar Kartikeya Dwivedi
Include various tests covering the success and failure cases. Also, run the success cases at runtime to verify correctness of linked list manipulation routines, in addition to ensuring successful verification. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-23-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17selftests/bpf: Add failure test cases for spin lock pairingKumar Kartikeya Dwivedi
First, ensure that whenever a bpf_spin_lock is present in an allocation, the reg->id is preserved. This won't be true for global variables however, since they have a single map value per map, hence the verifier harcodes it to 0 (so that multiple pseudo ldimm64 insns can yield the same lock object per map at a given offset). Next, add test cases for all possible combinations (kptr, global, map value, inner map value). Since we lifted restriction on locking in inner maps, also add test cases for them. Currently, each lookup into an inner map gets a fresh reg->id, so even if the reg->map_ptr is same, they will be treated as separate allocations and the incorrect unlock pairing will be rejected. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-22-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17selftests/bpf: Update spinlock selftestKumar Kartikeya Dwivedi
Make updates in preparation for adding more test cases to this selftest: - Convert from CHECK_ to ASSERT macros. - Use BPF skeleton - Fix typo sping -> spin - Rename spinlock.c -> spin_lock.c Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-21-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17selftests/bpf: Add __contains macro to bpf_experimental.hKumar Kartikeya Dwivedi
Add user facing __contains macro which provides a convenient wrapper over the verbose kernel specific BTF declaration tag required to annotate BPF list head structs in user types. Acked-by: Dave Marchevsky <davemarchevsky@fb.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-20-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Add comments for map BTF matching requirement for bpf_list_headKumar Kartikeya Dwivedi
The old behavior of bpf_map_meta_equal was that it compared timer_off to be equal (but not spin_lock_off, because that was not allowed), and did memcmp of kptr_off_tab. Now, we memcmp the btf_record of two bpf_map structs, which has all fields. We preserve backwards compat as we kzalloc the array, so if only spin lock and timer exist in map, we only compare offset while the rest of unused members in the btf_field struct are zeroed out. In case of kptr, btf and everything else is of vmlinux or module, so as long type is same it will match, since kernel btf, module, dtor pointer will be same across maps. Now with list_head in the mix, things are a bit complicated. We implicitly add a requirement that both BTFs are same, because struct btf_field_list_head has btf and value_rec members. We obviously shouldn't force BTFs to be equal by default, as that breaks backwards compatibility. Currently it is only implicitly required due to list_head matching struct btf and value_rec member. value_rec points back into a btf_record stashed in the map BTF (btf member of btf_field_list_head). So that pointer and btf member has to match exactly. Document all these subtle details so that things don't break in the future when touching this code. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-19-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Add 'release on unlock' logic for bpf_list_push_{front,back}Kumar Kartikeya Dwivedi
This commit implements the delayed release logic for bpf_list_push_front and bpf_list_push_back. Once a node has been added to the list, it's pointer changes to PTR_UNTRUSTED. However, it is only released once the lock protecting the list is unlocked. For such PTR_TO_BTF_ID | MEM_ALLOC with PTR_UNTRUSTED set but an active ref_obj_id, it is still permitted to read them as long as the lock is held. Writing to them is not allowed. This allows having read access to push items we no longer own until we release the lock guarding the list, allowing a little more flexibility when working with these APIs. Note that enabling write support has fairly tricky interactions with what happens inside the critical section. Just as an example, currently, bpf_obj_drop is not permitted, but if it were, being able to write to the PTR_UNTRUSTED pointer while the object gets released back to the memory allocator would violate safety properties we wish to guarantee (i.e. not crashing the kernel). The memory could be reused for a different type in the BPF program or even in the kernel as it gets eventually kfree'd. Not enabling bpf_obj_drop inside the critical section would appear to prevent all of the above, but that is more of an artifical limitation right now. Since the write support is tangled with how we handle potential aliasing of nodes inside the critical section that may or may not be part of the list anymore, it has been deferred to a future patch. Acked-by: Dave Marchevsky <davemarchevsky@fb.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-18-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Introduce single ownership BPF linked list APIKumar Kartikeya Dwivedi
Add a linked list API for use in BPF programs, where it expects protection from the bpf_spin_lock in the same allocation as the bpf_list_head. For now, only one bpf_spin_lock can be present hence that is assumed to be the one protecting the bpf_list_head. The following functions are added to kick things off: // Add node to beginning of list void bpf_list_push_front(struct bpf_list_head *head, struct bpf_list_node *node); // Add node to end of list void bpf_list_push_back(struct bpf_list_head *head, struct bpf_list_node *node); // Remove node at beginning of list and return it struct bpf_list_node *bpf_list_pop_front(struct bpf_list_head *head); // Remove node at end of list and return it struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head); The lock protecting the bpf_list_head needs to be taken for all operations. The verifier ensures that the lock that needs to be taken is always held, and only the correct lock is taken for these operations. These checks are made statically by relying on the reg->id preserved for registers pointing into regions having both bpf_spin_lock and the objects protected by it. The comment over check_reg_allocation_locked in this change describes the logic in detail. Note that bpf_list_push_front and bpf_list_push_back are meant to consume the object containing the node in the 1st argument, however that specific mechanism is intended to not release the ref_obj_id directly until the bpf_spin_unlock is called. In this commit, nothing is done, but the next commit will be introducing logic to handle this case, so it has been left as is for now. bpf_list_pop_front and bpf_list_pop_back delete the first or last item of the list respectively, and return pointer to the element at the list_node offset. The user can then use container_of style macro to get the actual entry type. The verifier however statically knows the actual type, so the safety properties are still preserved. With these additions, programs can now manage their own linked lists and store their objects in them. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-17-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Permit NULL checking pointer with non-zero fixed offsetKumar Kartikeya Dwivedi
Pointer increment on seeing PTR_MAYBE_NULL is already protected against, hence make an exception for PTR_TO_BTF_ID | MEM_ALLOC while still keeping the warning for other unintended cases that might creep in. bpf_list_pop_{front,_back} helpers planned to be introduced in next commit will return a MEM_ALLOC register with incremented offset pointing to bpf_list_node field. The user is supposed to then obtain the pointer to the entry using container_of after NULL checking it. The current restrictions trigger a warning when doing the NULL checking. Revisiting the reason, it is meant as an assertion which seems to actually work and catch the bad case. Hence, under no other circumstances can reg->off be non-zero for a register that has the PTR_MAYBE_NULL type flag set. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-16-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Introduce bpf_obj_dropKumar Kartikeya Dwivedi
Introduce bpf_obj_drop, which is the kfunc used to free allocated objects (allocated using bpf_obj_new). Pairing with bpf_obj_new, it implicitly destructs the fields part of object automatically without user intervention. Just like the previous patch, btf_struct_meta that is needed to free up the special fields is passed as a hidden argument to the kfunc. For the user, a convenience macro hides over the kernel side kfunc which is named bpf_obj_drop_impl. Continuing the previous example: void prog(void) { struct foo *f; f = bpf_obj_new(typeof(*f)); if (!f) return; bpf_obj_drop(f); } Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-15-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Introduce bpf_obj_newKumar Kartikeya Dwivedi
Introduce type safe memory allocator bpf_obj_new for BPF programs. The kernel side kfunc is named bpf_obj_new_impl, as passing hidden arguments to kfuncs still requires having them in prototype, unlike BPF helpers which always take 5 arguments and have them checked using bpf_func_proto in verifier, ignoring unset argument types. Introduce __ign suffix to ignore a specific kfunc argument during type checks, then use this to introduce support for passing type metadata to the bpf_obj_new_impl kfunc. The user passes BTF ID of the type it wants to allocates in program BTF, the verifier then rewrites the first argument as the size of this type, after performing some sanity checks (to ensure it exists and it is a struct type). The second argument is also fixed up and passed by the verifier. This is the btf_struct_meta for the type being allocated. It would be needed mostly for the offset array which is required for zero initializing special fields while leaving the rest of storage in unitialized state. It would also be needed in the next patch to perform proper destruction of the object's special fields. Under the hood, bpf_obj_new will call bpf_mem_alloc and bpf_mem_free, using the any context BPF memory allocator introduced recently. To this end, a global instance of the BPF memory allocator is initialized on boot to be used for this purpose. This 'bpf_global_ma' serves all allocations for bpf_obj_new. In the future, bpf_obj_new variants will allow specifying a custom allocator. Note that now that bpf_obj_new can be used to allocate objects that can be linked to BPF linked list (when future linked list helpers are available), we need to also free the elements using bpf_mem_free. However, since the draining of elements is done outside the bpf_spin_lock, we need to do migrate_disable around the call since bpf_list_head_free can be called from map free path where migration is enabled. Otherwise, when called from BPF programs migration is already disabled. A convenience macro is included in the bpf_experimental.h header to hide over the ugly details of the implementation, leading to user code looking similar to a language level extension which allocates and constructs fields of a user type. struct bar { struct bpf_list_node node; }; struct foo { struct bpf_spin_lock lock; struct bpf_list_head head __contains(bar, node); }; void prog(void) { struct foo *f; f = bpf_obj_new(typeof(*f)); if (!f) return; ... } A key piece of this story is still missing, i.e. the free function, which will come in the next patch. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-14-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Support constant scalar arguments for kfuncsKumar Kartikeya Dwivedi
Allow passing known constant scalars as arguments to kfuncs that do not represent a size parameter. We use mark_chain_precision for the constant scalar argument to mark it precise. This makes the search pruning optimization of verifier more conservative for such kfunc calls, and each non-distinct argument is considered unequivalent. We will use this support to then expose a bpf_obj_new function where it takes the local type ID of a type in program BTF, and returns a PTR_TO_BTF_ID | MEM_ALLOC to the local type, and allows programs to allocate their own objects. Each type ID resolves to a distinct type with a possibly distinct size, hence the type ID constant matters in terms of program safety and its precision needs to be checked between old and cur states inside regsafe. The use of mark_chain_precision enables this. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-13-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Rewrite kfunc argument handlingKumar Kartikeya Dwivedi
As we continue to add more features, argument types, kfunc flags, and different extensions to kfuncs, the code to verify the correctness of the kfunc prototype wrt the passed in registers has become ad-hoc and ugly to read. To make life easier, and make a very clear split between different stages of argument processing, move all the code into verifier.c and refactor into easier to read helpers and functions. This also makes sharing code within the verifier easier with kfunc argument processing. This will be more and more useful in later patches as we are now moving to implement very core BPF helpers as kfuncs, to keep them experimental before baking into UAPI. Remove all kfunc related bits now from btf_check_func_arg_match, as users have been converted away to refactored kfunc argument handling. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-12-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Allow locking bpf_spin_lock in inner map valuesKumar Kartikeya Dwivedi
There is no need to restrict users from locking bpf_spin_lock in map values of inner maps. Each inner map lookup gets a unique reg->id assigned to the returned PTR_TO_MAP_VALUE which will be preserved after the NULL check. Distinct lookups into different inner map get unique IDs, and distinct lookups into same inner map also get unique IDs. Hence, lift the restriction by removing the check return -ENOTSUPP in map_in_map.c. Later commits will add comprehensive test cases to ensure that invalid cases are rejected. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-11-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Allow locking bpf_spin_lock global variablesKumar Kartikeya Dwivedi
Global variables reside in maps accessible using direct_value_addr callbacks, so giving each load instruction's rewrite a unique reg->id disallows us from holding locks which are global. The reason for preserving reg->id as a unique value for registers that may point to spin lock is that two separate lookups are treated as two separate memory regions, and any possible aliasing is ignored for the purposes of spin lock correctness. This is not great especially for the global variable case, which are served from maps that have max_entries == 1, i.e. they always lead to map values pointing into the same map value. So refactor the active_spin_lock into a 'active_lock' structure which represents the lock identity, and instead of the reg->id, remember two fields, a pointer and the reg->id. The pointer will store reg->map_ptr or reg->btf. It's only necessary to distinguish for the id == 0 case of global variables, but always setting the pointer to a non-NULL value and using the pointer to check whether the lock is held simplifies code in the verifier. This is generic enough to allow it for global variables, map lookups, and allocated objects at the same time. Note that while whether a lock is held can be answered by just comparing active_lock.ptr to NULL, to determine whether the register is pointing to the same held lock requires comparing _both_ ptr and id. Finally, as a result of this refactoring, pseudo load instructions are not given a unique reg->id, as they are doing lookup for the same map value (max_entries is never greater than 1). Essentially, we consider that the tuple of (ptr, id) will always be unique for any kind of argument to bpf_spin_{lock,unlock}. Note that this can be extended in the future to also remember offset used for locking, so that we can introduce multiple bpf_spin_lock fields in the same allocation. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-10-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Allow locking bpf_spin_lock in allocated objectsKumar Kartikeya Dwivedi
Allow locking a bpf_spin_lock in an allocated object, in addition to already supported map value pointers. The handling is similar to that of map values, by just preserving the reg->id of PTR_TO_BTF_ID | MEM_ALLOC as well, and adjusting process_spin_lock to work with them and remember the id in verifier state. Refactor the existing process_spin_lock to work with PTR_TO_BTF_ID | MEM_ALLOC in addition to PTR_TO_MAP_VALUE. We need to update the reg_may_point_to_spin_lock which is used in mark_ptr_or_null_reg to preserve reg->id, that will be used in env->cur_state->active_spin_lock to remember the currently held spin lock. Also update the comment describing bpf_spin_lock implementation details to also talk about PTR_TO_BTF_ID | MEM_ALLOC type. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-9-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Verify ownership relationships for user BTF typesKumar Kartikeya Dwivedi
Ensure that there can be no ownership cycles among different types by way of having owning objects that can hold some other type as their element. For instance, a map value can only hold allocated objects, but these are allowed to have another bpf_list_head. To prevent unbounded recursion while freeing resources, elements of bpf_list_head in local kptrs can never have a bpf_list_head which are part of list in a map value. Later patches will verify this by having dedicated BTF selftests. Also, to make runtime destruction easier, once btf_struct_metas is fully populated, we can stash the metadata of the value type directly in the metadata of the list_head fields, as that allows easier access to the value type's layout to destruct it at runtime from the btf_field entry of the list head itself. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-8-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Recognize lock and list fields in allocated objectsKumar Kartikeya Dwivedi
Allow specifying bpf_spin_lock, bpf_list_head, bpf_list_node fields in a allocated object. Also update btf_struct_access to reject direct access to these special fields. A bpf_list_head allows implementing map-in-map style use cases, where an allocated object with bpf_list_head is linked into a list in a map value. This would require embedding a bpf_list_node, support for which is also included. The bpf_spin_lock is used to protect the bpf_list_head and other data. While we strictly don't require to hold a bpf_spin_lock while touching the bpf_list_head in such objects, as when have access to it, we have complete ownership of the object, the locking constraint is still kept and may be conditionally lifted in the future. Note that the specification of such types can be done just like map values, e.g.: struct bar { struct bpf_list_node node; }; struct foo { struct bpf_spin_lock lock; struct bpf_list_head head __contains(bar, node); struct bpf_list_node node; }; struct map_value { struct bpf_spin_lock lock; struct bpf_list_head head __contains(foo, node); }; To recognize such types in user BTF, we build a btf_struct_metas array of metadata items corresponding to each BTF ID. This is done once during the btf_parse stage to avoid having to do it each time during the verification process's requirement to inspect the metadata. Moreover, the computed metadata needs to be passed to some helpers in future patches which requires allocating them and storing them in the BTF that is pinned by the program itself, so that valid access can be assumed to such data during program runtime. A key thing to note is that once a btf_struct_meta is available for a type, both the btf_record and btf_field_offs should be available. It is critical that btf_field_offs is available in case special fields are present, as we extensively rely on special fields being zeroed out in map values and allocated objects in later patches. The code ensures that by bailing out in case of errors and ensuring both are available together. If the record is not available, the special fields won't be recognized, so not having both is also fine (in terms of being a verification error and not a runtime bug). Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-7-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Introduce allocated objects supportKumar Kartikeya Dwivedi
Introduce support for representing pointers to objects allocated by the BPF program, i.e. PTR_TO_BTF_ID that point to a type in program BTF. This is indicated by the presence of MEM_ALLOC type flag in reg->type to avoid having to check btf_is_kernel when trying to match argument types in helpers. Whenever walking such types, any pointers being walked will always yield a SCALAR instead of pointer. In the future we might permit kptr inside such allocated objects (either kernel or program allocated), and it will then form a PTR_TO_BTF_ID of the respective type. For now, such allocated objects will always be referenced in verifier context, hence ref_obj_id == 0 for them is a bug. It is allowed to write to such objects, as long fields that are special are not touched (support for which will be added in subsequent patches). Note that once such a pointer is marked PTR_UNTRUSTED, it is no longer allowed to write to it. No PROBE_MEM handling is therefore done for loads into this type unless PTR_UNTRUSTED is part of the register type, since they can never be in an undefined state, and their lifetime will always be valid. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-6-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Populate field_offs for inner_map_metaKumar Kartikeya Dwivedi
Far too much code simply assumes that both btf_record and btf_field_offs are set to valid pointers together, or both are unset. They go together hand in hand as btf_record describes the special fields and btf_field_offs is compact representation for runtime copying/zeroing. It is very difficult to make this clear in the code when the only exception to this universal invariant is inner_map_meta which is used as reg->map_ptr in the verifier. This is simply a bug waiting to happen, as in verifier context we cannot easily distinguish if PTR_TO_MAP_VALUE is coming from an inner map, and if we ever end up using field_offs for any reason in the future, we will silently ignore the special fields for inner map case (as NULL is not an error but unset field_offs). Hence, simply copy field_offs from inner map together with btf_record. While at it, refactor code to unwind properly on errors with gotos. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Free inner_map_meta when btf_record_dup failsKumar Kartikeya Dwivedi
Whenever btf_record_dup fails, we must free inner_map_meta that was allocated before. This fixes a memory leak (in case of errors) during inner map creation. Fixes: aa3496accc41 ("bpf: Refactor kptr_off_tab into btf_record") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Do btf_record_free outside map_free callbackKumar Kartikeya Dwivedi
Since the commit being fixed, we now miss freeing btf_record for local storage maps which will have a btf_record populated in case they have bpf_spin_lock element. This was missed because I made the choice of offloading the job to free kptr_off_tab (now btf_record) to the map_free callback when adding support for kptrs. Revisiting the reason for this decision, there is the possibility that the btf_record gets used inside map_free callback (e.g. in case of maps embedding kptrs) to iterate over them and free them, hence doing it before the map_free callback would be leaking special field memory, and do invalid memory access. The btf_record keeps module references which is critical to ensure the dtor call made for referenced kptr is safe to do. If doing it after map_free callback, the map area is already freed, so we cannot access bpf_map structure anymore. To fix this and prevent such lapses in future, move bpf_map_free_record out of the map_free callback, and do it after map_free by remembering the btf_record pointer. There is no need to access bpf_map structure in that case, and we can avoid missing this case when support for new map types is added for other special fields. Since a btf_record and its btf_field_offs are used together, for consistency delay freeing of field_offs as well. While not a problem right now, a lot of code assumes that either both record and field_offs are set or none at once. Note that in case of map of maps (outer maps), inner_map_meta->record is only used during verification, not to free fields in map value, hence we simply keep the bpf_map_free_record call as is in bpf_map_meta_free and never touch map->inner_map_meta in bpf_map_free_deferred. Add a comment making note of these details. Fixes: db559117828d ("bpf: Consolidate spin_lock, timer management into btf_record") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>