summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2023-11-21net: page_pool: avoid touching slow on the fastpathJakub Kicinski
To fully benefit from previous commit add one byte of state in the first cache line recording if we need to look at the slow part. The packing isn't all that impressive right now, we create a 7B hole. I'm expecting Olek's rework will reshuffle this, anyway. Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://lore.kernel.org/r/20231121000048.789613-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-21net: page_pool: split the page_pool_params into fast and slowJakub Kicinski
struct page_pool is rather performance critical and we use 16B of the first cache line to store 2 pointers used only by test code. Future patches will add more informational (non-fast path) attributes. It's convenient for the user of the API to not have to worry which fields are fast and which are slow path. Use struct groups to split the params into the two categories internally. Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Link: https://lore.kernel.org/r/20231121000048.789613-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-21Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2023-11-21 We've added 19 non-merge commits during the last 4 day(s) which contain a total of 18 files changed, 1043 insertions(+), 416 deletions(-). The main changes are: 1) Fix BPF verifier to validate callbacks as if they are called an unknown number of times in order to fix not detecting some unsafe programs, from Eduard Zingerman. 2) Fix bpf_redirect_peer() handling which missed proper stats accounting for veth and netkit and also generally fix missing stats for the latter, from Peilin Ye, Daniel Borkmann et al. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: check if max number of bpf_loop iterations is tracked bpf: keep track of max number of bpf_loop callback iterations selftests/bpf: test widening for iterating callbacks bpf: widening for callback iterators selftests/bpf: tests for iterating callbacks bpf: verify callbacks as if they are called unknown number of times bpf: extract setup_func_entry() utility function bpf: extract __check_reg_arg() utility function selftests/bpf: fix bpf_loop_bench for new callback verification scheme selftests/bpf: track string payload offset as scalar in strobemeta selftests/bpf: track tcp payload offset as scalar in xdp_synproxy selftests/bpf: Add netkit to tc_redirect selftest selftests/bpf: De-veth-ize the tc_redirect test case bpf, netkit: Add indirect call wrapper for fetching peer dev bpf: Fix dev's rx stats for bpf_redirect_peer traffic veth: Use tstats per-CPU traffic counters netkit: Add tstats per-CPU traffic counters net: Move {l,t,d}stats allocation to core and convert veth & vrf net, vrf: Move dstats structure to core ==================== Link: https://lore.kernel.org/r/20231121193113.11796-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-21net: do not send a MOVE event when netdev changes netnsJakub Kicinski
Networking supports changing netdevice's netns and name at the same time. This allows avoiding name conflicts and having to rename the interface in multiple steps. E.g. netns1={eth0, eth1}, netns2={eth1} - we want to move netns1:eth1 to netns2 and call it eth0 there. If we can't rename "in flight" we'd need to (1) rename eth1 -> $tmp, (2) change netns, (3) rename $tmp -> eth0. To rename the underlying struct device we have to call device_rename(). The rename()'s MOVE event, however, doesn't "belong" to either the old or the new namespace. If there are conflicts on both sides it's actually impossible to issue a real MOVE (old name -> new name) without confusing user space. And Daniel reports that such confusions do in fact happen for systemd, in real life. Since we already issue explicit REMOVE and ADD events manually - suppress the MOVE event completely. Move the ADD after the rename, so that the REMOVE uses the old name, and the ADD the new one. If there is no rename this changes the picture as follows: Before: old ns | KERNEL[213.399289] remove /devices/virtual/net/eth0 (net) new ns | KERNEL[213.401302] add /devices/virtual/net/eth0 (net) new ns | KERNEL[213.401397] move /devices/virtual/net/eth0 (net) After: old ns | KERNEL[266.774257] remove /devices/virtual/net/eth0 (net) new ns | KERNEL[266.774509] add /devices/virtual/net/eth0 (net) If there is a rename and a conflict (using the exact eth0/eth1 example explained above) we get this: Before: old ns | KERNEL[224.316833] remove /devices/virtual/net/eth1 (net) new ns | KERNEL[224.318551] add /devices/virtual/net/eth1 (net) new ns | KERNEL[224.319662] move /devices/virtual/net/eth0 (net) After: old ns | KERNEL[333.033166] remove /devices/virtual/net/eth1 (net) new ns | KERNEL[333.035098] add /devices/virtual/net/eth0 (net) Note that "in flight" rename is only performed when needed. If there is no conflict for old name in the target netns - the rename will be performed separately by dev_change_name(), as if the rename was a different command, and there will still be a MOVE event for the rename: Before: old ns | KERNEL[194.416429] remove /devices/virtual/net/eth0 (net) new ns | KERNEL[194.418809] add /devices/virtual/net/eth0 (net) new ns | KERNEL[194.418869] move /devices/virtual/net/eth0 (net) new ns | KERNEL[194.420866] move /devices/virtual/net/eth1 (net) After: old ns | KERNEL[71.917520] remove /devices/virtual/net/eth0 (net) new ns | KERNEL[71.919155] add /devices/virtual/net/eth0 (net) new ns | KERNEL[71.920729] move /devices/virtual/net/eth1 (net) If deleting the MOVE event breaks some user space we should insert an explicit kobject_uevent(MOVE) after the ADD, like this: @@ -11192,6 +11192,12 @@ int __dev_change_net_namespace(struct net_device *dev, struct net *net, kobject_uevent(&dev->dev.kobj, KOBJ_ADD); netdev_adjacent_add_links(dev); + /* User space wants an explicit MOVE event, issue one unless + * dev_change_name() will get called later and issue one. + */ + if (!pat || new_name[0]) + kobject_uevent(&dev->dev.kobj, KOBJ_MOVE); + /* Adapt owner in case owning user namespace of target network * namespace is different from the original one. */ Reported-by: Daniel Gröber <dxld@darkboxed.org> Link: https://lore.kernel.org/all/20231010121003.x3yi6fihecewjy4e@House.clients.dxld.at/ Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/all/20231120184140.578375-1-kuba@kernel.org/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-21ipv4: Correct/silence an endian warning in __ip_do_redirectKunwu Chan
net/ipv4/route.c:783:46: warning: incorrect type in argument 2 (different base types) net/ipv4/route.c:783:46: expected unsigned int [usertype] key net/ipv4/route.c:783:46: got restricted __be32 [usertype] new_gw Fixes: 969447f226b4 ("ipv4: use new_gw for redirect neigh lookup") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Link: https://lore.kernel.org/r/20231119141759.420477-1-chentao@kylinos.cn Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-11-20bpf, netkit: Add indirect call wrapper for fetching peer devDaniel Borkmann
ndo_get_peer_dev is used in tcx BPF fast path, therefore make use of indirect call wrapper and therefore optimize the bpf_redirect_peer() internal handling a bit. Add a small skb_get_peer_dev() wrapper which utilizes the INDIRECT_CALL_1() macro instead of open coding. Future work could potentially add a peer pointer directly into struct net_device in future and convert veth and netkit over to use it so that eventually ndo_get_peer_dev can be removed. Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20231114004220.6495-7-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-11-20bpf: Fix dev's rx stats for bpf_redirect_peer trafficPeilin Ye
Traffic redirected by bpf_redirect_peer() (used by recent CNIs like Cilium) is not accounted for in the RX stats of supported devices (that is, veth and netkit), confusing user space metrics collectors such as cAdvisor [0], as reported by Youlun. Fix it by calling dev_sw_netstats_rx_add() in skb_do_redirect(), to update RX traffic counters. Devices that support ndo_get_peer_dev _must_ use the @tstats per-CPU counters (instead of @lstats, or @dstats). To make this more fool-proof, error out when ndo_get_peer_dev is set but @tstats are not selected. [0] Specifically, the "container_network_receive_{byte,packet}s_total" counters are affected. Fixes: 9aa1206e8f48 ("bpf: Add redirect_peer helper") Reported-by: Youlun Zhang <zhangyoulun@bytedance.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20231114004220.6495-6-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-11-20net: Move {l,t,d}stats allocation to core and convert veth & vrfDaniel Borkmann
Move {l,t,d}stats allocation to the core and let netdevs pick the stats type they need. That way the driver doesn't have to bother with error handling (allocation failure checking, making sure free happens in the right spot, etc) - all happening in the core. Co-developed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Cc: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20231114004220.6495-3-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-11-20ieee802154: Give the user the association listMiquel Raynal
Upon request, we must be able to provide to the user the list of associations currently in place. Let's add a new netlink command and attribute for this purpose. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-12-miquel.raynal@bootlin.com
2023-11-20mac802154: Handle disassociation notifications from peersMiquel Raynal
Peers may decided to disassociate from us, their coordinator, in this case they will send a disassociation notification which we must acknowledge. If we don't, the peer device considers itself disassociated anyway. We also need to drop the reference to this child from our internal structures. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-11-miquel.raynal@bootlin.com
2023-11-20mac802154: Follow the number of associated devicesMiquel Raynal
Track the count of associated devices. Limit the number of associations using the value provided by the user if any. If we reach the maximum number of associations, we tell the device we are at capacity. If the user do not want to accept any more associations, it may specify the value 0 to the maximum number of associations, which will lead to an access denied error status returned to the peers trying to associate. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-10-miquel.raynal@bootlin.com
2023-11-20ieee802154: Add support for limiting the number of associated devicesMiquel Raynal
Coordinators may refuse associations. We need a user input for that. Let's add a new netlink command which can provide a maximum number of devices we accept to associate with as a first step. Later, we could also forward the request to userspace and check whether the association should be accepted or not. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-9-miquel.raynal@bootlin.com
2023-11-20mac802154: Handle association requests from peersMiquel Raynal
Coordinators may have to handle association requests from peers which want to join the PAN. The logic involves: - Acknowledging the request (done by hardware) - If requested, a random short address that is free on this PAN should be chosen for the device. - Sending an association response with the short address allocated for the peer and expecting it to be ack'ed. If anything fails during this procedure, the peer is considered not associated. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-8-miquel.raynal@bootlin.com
2023-11-20mac802154: Handle disassociationsMiquel Raynal
Devices may decide to disassociate from their coordinator for different reasons (device turning off, coordinator signal strength too low, etc), the MAC layer just has to send a disassociation notification. If the ack of the disassociation notification is not received, the device may consider itself disassociated anyway. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-7-miquel.raynal@bootlin.com
2023-11-20ieee802154: Add support for user disassociation requestsMiquel Raynal
A device may decide at some point to disassociate from a PAN, let's introduce a netlink command for this purpose. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-6-miquel.raynal@bootlin.com
2023-11-20mac802154: Handle associatingMiquel Raynal
Joining a PAN officially goes by associating with a coordinator. This coordinator may have been discovered thanks to the beacons it sent in the past. Add support to the MAC layer for these associations, which require: - Sending an association request - Receiving an association response The association response contains the association status, eventually a reason if the association was unsuccessful, and finally a short address that we should use for intra-PAN communication from now on, if we required one (which is the default, and not yet configurable). Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-5-miquel.raynal@bootlin.com
2023-11-20ieee802154: Add support for user association requestsMiquel Raynal
Users may decide to associate with a peer, which becomes our parent coordinator. Let's add the necessary netlink support for this. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-4-miquel.raynal@bootlin.com
2023-11-20ieee802154: Internal PAN managementMiquel Raynal
Introduce structures to describe peer devices in a PAN as well as a few related helpers. We basically care about: - Our unique parent after associating with a coordinator. - Peer devices, children, which successfully associated with us. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-3-miquel.raynal@bootlin.com
2023-11-20ieee802154: Let PAN IDs be resetMiquel Raynal
Soon association and disassociation will be implemented, which will require to be able to either change the PAN ID from 0xFFFF to a real value when association succeeded, or to reset the PAN ID to 0xFFFF upon disassociation. Let's allow to do that manually for now. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-2-miquel.raynal@bootlin.com
2023-11-19net: fill in MODULE_DESCRIPTION()s for SOCK_DIAG modulesJakub Kicinski
W=1 builds now warn if module is built without a MODULE_DESCRIPTION(). Add descriptions to all the sock diag modules in one fell swoop. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-18rtnetlink: introduce nlmsg_new_large and use it in rtnl_getlinkLi RongQing
if a PF has 256 or more VFs, ip link command will allocate an order 3 memory or more, and maybe trigger OOM due to memory fragment, the VFs needed memory size is computed in rtnl_vfinfo_size. so introduce nlmsg_new_large which calls netlink_alloc_large_skb in which vmalloc is used for large memory, to avoid the failure of allocating memory ip invoked oom-killer: gfp_mask=0xc2cc0(GFP_KERNEL|__GFP_NOWARN|\ __GFP_COMP|__GFP_NOMEMALLOC), order=3, oom_score_adj=0 CPU: 74 PID: 204414 Comm: ip Kdump: loaded Tainted: P OE Call Trace: dump_stack+0x57/0x6a dump_header+0x4a/0x210 oom_kill_process+0xe4/0x140 out_of_memory+0x3e8/0x790 __alloc_pages_slowpath.constprop.116+0x953/0xc50 __alloc_pages_nodemask+0x2af/0x310 kmalloc_large_node+0x38/0xf0 __kmalloc_node_track_caller+0x417/0x4d0 __kmalloc_reserve.isra.61+0x2e/0x80 __alloc_skb+0x82/0x1c0 rtnl_getlink+0x24f/0x370 rtnetlink_rcv_msg+0x12c/0x350 netlink_rcv_skb+0x50/0x100 netlink_unicast+0x1b2/0x280 netlink_sendmsg+0x355/0x4a0 sock_sendmsg+0x5b/0x60 ____sys_sendmsg+0x1ea/0x250 ___sys_sendmsg+0x88/0xd0 __sys_sendmsg+0x5e/0xa0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f95a65a5b70 Cc: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Link: https://lore.kernel.org/r/20231115120108.3711-1-lirongqing@baidu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-18net/sched: cls_u32: replace int refcounts with proper refcountsPedro Tammela
Proper refcounts will always warn splat when something goes wrong, be it underflow, saturation or object resurrection. As these are always a source of bugs, use it in cls_u32 as a safeguard to prevent/catch issues. Another benefit is that the refcount API self documents the code, making clear when transitions to dead are expected. For such an update we had to make minor adaptations on u32 to fit the refcount API. First we set explicitly to '1' when objects are created, then the objects are alive until a 1 -> 0 happens, which is then released appropriately. The above made clear some redundant operations in the u32 code around the root_ht handling that were removed. The root_ht is created with a refcnt set to 1. Then when it's associated with tcf_proto it increments the refcnt to 2. Throughout the entire code the root_ht is an exceptional case and can never be referenced, therefore the refcnt never incremented/decremented. Its lifetime is always bound to tcf_proto, meaning if you delete tcf_proto the root_ht is deleted as well. The code made up for the fact that root_ht refcnt is 2 and did a double decrement to free it, which is not a fit for the refcount API. Even though refcount_t is implemented using atomics, we should observe a negligible control plane impact. Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20231114141856.974326-2-pctammela@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-18net: partial revert of the "Make timestamping selectable: seriesJakub Kicinski
Revert following commits: commit acec05fb78ab ("net_tstamp: Add TIMESTAMPING SOFTWARE and HARDWARE mask") commit 11d55be06df0 ("net: ethtool: Add a command to expose current time stamping layer") commit bb8645b00ced ("netlink: specs: Introduce new netlink command to get current timestamp") commit d905f9c75329 ("net: ethtool: Add a command to list available time stamping layers") commit aed5004ee7a0 ("netlink: specs: Introduce new netlink command to list available time stamping layers") commit 51bdf3165f01 ("net: Replace hwtstamp_source by timestamping layer") commit 0f7f463d4821 ("net: Change the API of PHY default timestamp to MAC") commit 091fab122869 ("net: ethtool: ts: Update GET_TS to reply the current selected timestamp") commit 152c75e1d002 ("net: ethtool: ts: Let the active time stamping layer be selectable") commit ee60ea6be0d3 ("netlink: specs: Introduce time stamping set command") They need more time for reviews. Link: https://lore.kernel.org/all/20231118183529.6e67100c@kernel.org/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-18Merge tag 'batadv-next-pullrequest-20231115' of ↵David S. Miller
git://git.open-mesh.org/linux-merge This feature/cleanup patchset includes the following patches: - bump version strings, by Simon Wunderlich - Implement new multicast packet type, including its transmission, forwarding and parsing, by Linus Lüssing (3 patches) - Switch to new headers for sprintf and array size, by Sven Eckelmann (2 patches) Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-18devlink: Add device lock assert in reload operationIdo Schimmel
Add an assert to verify that the device lock is always held throughout reload operations. Tested the following flows with netdevsim and mlxsw while lockdep is enabled: netdevsim: # echo "10 1" > /sys/bus/netdevsim/new_device # devlink dev reload netdevsim/netdevsim10 # ip netns add bla # devlink dev reload netdevsim/netdevsim10 netns bla # ip netns del bla # echo 10 > /sys/bus/netdevsim/del_device mlxsw: # devlink dev reload pci/0000:01:00.0 # ip netns add bla # devlink dev reload pci/0000:01:00.0 netns bla # ip netns del bla # echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/remove # echo 1 > /sys/bus/pci/rescan Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-18devlink: Acquire device lock during reload commandIdo Schimmel
Device drivers register with devlink from their probe routines (under the device lock) by acquiring the devlink instance lock and calling devl_register(). Drivers that support a devlink reload usually implement the reload_{down, up}() operations in a similar fashion to their remove and probe routines, respectively. However, while the remove and probe routines are invoked with the device lock held, the reload operations are only invoked with the devlink instance lock held. It is therefore impossible for drivers to acquire the device lock from their reload operations, as this would result in lock inversion. The motivating use case for invoking the reload operations with the device lock held is in mlxsw which needs to trigger a PCI reset as part of the reload. The driver cannot call pci_reset_function() as this function acquires the device lock. Instead, it needs to call __pci_reset_function_locked which expects the device lock to be held. To that end, adjust devlink to always acquire the device lock before the devlink instance lock when performing a reload. Do that when reload is explicitly triggered by user space by specifying the 'DEVLINK_NL_FLAG_NEED_DEV_LOCK' flag in the pre_doit and post_doit operations of the reload command. A previous patch already handled the case where reload is invoked as part of netns dismantle. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-18devlink: Allow taking device lock in pre_doit operationsIdo Schimmel
Introduce a new private flag ('DEVLINK_NL_FLAG_NEED_DEV_LOCK') to allow netlink commands to specify that they need to acquire the device lock in their pre_doit operation and release it in their post_doit operation. The reload command will use this flag in the subsequent patch. No functional changes intended. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-18devlink: Enable the use of private flags in post_doit operationsIdo Schimmel
Currently, private flags (e.g., 'DEVLINK_NL_FLAG_NEED_PORT') are only used in pre_doit operations, but a subsequent patch will need to conditionally lock and unlock the device lock in pre and post doit operations, respectively. As a preparation, enable the use of private flags in post_doit operations in a similar fashion to how it is done for pre_doit operations. No functional changes intended. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-18devlink: Acquire device lock during netns dismantleIdo Schimmel
Device drivers register with devlink from their probe routines (under the device lock) by acquiring the devlink instance lock and calling devl_register(). Drivers that support a devlink reload usually implement the reload_{down, up}() operations in a similar fashion to their remove and probe routines, respectively. However, while the remove and probe routines are invoked with the device lock held, the reload operations are only invoked with the devlink instance lock held. It is therefore impossible for drivers to acquire the device lock from their reload operations, as this would result in lock inversion. The motivating use case for invoking the reload operations with the device lock held is in mlxsw which needs to trigger a PCI reset as part of the reload. The driver cannot call pci_reset_function() as this function acquires the device lock. Instead, it needs to call __pci_reset_function_locked which expects the device lock to be held. To that end, adjust devlink to always acquire the device lock before the devlink instance lock when performing a reload. For now, only do that when reload is triggered as part of netns dismantle. Subsequent patches will handle the case where reload is explicitly triggered by user space. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-18devlink: Move private netlink flags to C fileIdo Schimmel
The flags are not used outside of the C file so move them there. Suggested-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-18net/ncsi: Add NC-SI 1.2 Get MC MAC Address commandPeter Delevoryas
This change adds support for the NC-SI 1.2 Get MC MAC Address command, specified here: https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.2.0.pdf It serves the exact same function as the existing OEM Get MAC Address commands, so if a channel reports that it supports NC-SI 1.2, we prefer to use the standard command rather than the OEM command. Verified with an invalid MAC address and 2 valid ones: [ 55.137072] ftgmac100 1e690000.ftgmac eth0: NCSI: Received 3 provisioned MAC addresses [ 55.137614] ftgmac100 1e690000.ftgmac eth0: NCSI: MAC address 0: 00:00:00:00:00:00 [ 55.138026] ftgmac100 1e690000.ftgmac eth0: NCSI: MAC address 1: fa:ce:b0:0c:20:22 [ 55.138528] ftgmac100 1e690000.ftgmac eth0: NCSI: MAC address 2: fa:ce:b0:0c:20:23 [ 55.139241] ftgmac100 1e690000.ftgmac eth0: NCSI: Unable to assign 00:00:00:00:00:00 to device [ 55.140098] ftgmac100 1e690000.ftgmac eth0: NCSI: Set MAC address to fa:ce:b0:0c:20:22 Signed-off-by: Peter Delevoryas <peter@pjd.dev> Signed-off-by: Patrick Williams <patrick@stwcx.xyz> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-18net/ncsi: Fix netlink major/minor version numbersPeter Delevoryas
The netlink interface for major and minor version numbers doesn't actually return the major and minor version numbers. It reports a u32 that contains the (major, minor, update, alpha1) components as the major version number, and then alpha2 as the minor version number. For whatever reason, the u32 byte order was reversed (ntohl): maybe it was assumed that the encoded value was a single big-endian u32, and alpha2 was the minor version. The correct way to get the supported NC-SI version from the network controller is to parse the Get Version ID response as described in 8.4.44 of the NC-SI spec[1]. Get Version ID Response Packet Format Bits +--------+--------+--------+--------+ Bytes | 31..24 | 23..16 | 15..8 | 7..0 | +-------+--------+--------+--------+--------+ | 0..15 | NC-SI Header | +-------+--------+--------+--------+--------+ | 16..19| Response code | Reason code | +-------+--------+--------+--------+--------+ |20..23 | Major | Minor | Update | Alpha1 | +-------+--------+--------+--------+--------+ |24..27 | reserved | Alpha2 | +-------+--------+--------+--------+--------+ | .... other stuff .... | The major, minor, and update fields are all binary-coded decimal (BCD) encoded [2]. The spec provides examples below the Get Version ID response format in section 8.4.44.1, but for practical purposes, this is an example from a live network card: root@bmc:~# ncsi-util 0x15 NC-SI Command Response: cmd: GET_VERSION_ID(0x15) Response: COMMAND_COMPLETED(0x0000) Reason: NO_ERROR(0x0000) Payload length = 40 20: 0xf1 0xf1 0xf0 0x00 <<<<<<<<< (major, minor, update, alpha1) 24: 0x00 0x00 0x00 0x00 <<<<<<<<< (_, _, _, alpha2) 28: 0x6d 0x6c 0x78 0x30 32: 0x2e 0x31 0x00 0x00 36: 0x00 0x00 0x00 0x00 40: 0x16 0x1d 0x07 0xd2 44: 0x10 0x1d 0x15 0xb3 48: 0x00 0x17 0x15 0xb3 52: 0x00 0x00 0x81 0x19 This should be parsed as "1.1.0". "f" in the upper-nibble means to ignore it, contributing zero. If both nibbles are "f", I think the whole field is supposed to be ignored. Major and minor are "required", meaning they're not supposed to be "ff", but the update field is "optional" so I think it can be ff. I think the simplest thing to do is just set the major and minor to zero instead of juggling some conditional logic or something. bcd2bin() from "include/linux/bcd.h" seems to assume both nibbles are 0-9, so I've provided a custom BCD decoding function. Alpha1 and alpha2 are ISO/IEC 8859-1 encoded, which just means ASCII characters as far as I can tell, although the full encoding table for non-alphabetic characters is slightly different (I think). I imagine the alpha fields are just supposed to be alphabetic characters, but I haven't seen any network cards actually report a non-zero value for either. If people wrote software against this netlink behavior, and were parsing the major and minor versions themselves from the u32, then this would definitely break their code. [1] https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.0.0.pdf [2] https://en.wikipedia.org/wiki/Binary-coded_decimal [2] https://en.wikipedia.org/wiki/ISO/IEC_8859-1 Signed-off-by: Peter Delevoryas <peter@pjd.dev> Fixes: 138635cc27c9 ("net/ncsi: NCSI response packet handler") Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-18net/ncsi: Simplify Kconfig/dts control flowPeter Delevoryas
Background: 1. CONFIG_NCSI_OEM_CMD_KEEP_PHY If this is enabled, we send an extra OEM Intel command in the probe sequence immediately after discovering a channel (e.g. after "Clear Initial State"). 2. CONFIG_NCSI_OEM_CMD_GET_MAC If this is enabled, we send one of 3 OEM "Get MAC Address" commands from Broadcom, Mellanox (Nvidida), and Intel in the *configuration* sequence for a channel. 3. mellanox,multi-host (or mlx,multi-host) Introduced by this patch: https://lore.kernel.org/all/20200108234341.2590674-1-vijaykhemka@fb.com/ Which was actually originally from cosmo.chou@quantatw.com: https://github.com/facebook/openbmc-linux/commit/9f132a10ec48db84613519258cd8a317fb9c8f1b Cosmo claimed that the Nvidia ConnectX-4 and ConnectX-6 NIC's don't respond to Get Version ID, et. al in the probe sequence unless you send the Set MC Affinity command first. Problem Statement: We've been using a combination of #ifdef code blocks and IS_ENABLED() conditions to conditionally send these OEM commands. It makes adding any new code around these commands hard to understand. Solution: In this patch, I just want to remove the conditionally compiled blocks of code, and always use IS_ENABLED(...) to do dynamic control flow. I don't think the small amount of code this adds to non-users of the OEM Kconfigs is a big deal. Signed-off-by: Peter Delevoryas <peter@pjd.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-18net: ethtool: ts: Let the active time stamping layer be selectableKory Maincent
Now that the current timestamp is saved in a variable lets add the ETHTOOL_MSG_TS_SET ethtool netlink socket to make it selectable. Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-18net: ethtool: ts: Update GET_TS to reply the current selected timestampKory Maincent
As the default selected timestamp API change we have to change also the timestamp return by ethtool. This patch return now the current selected timestamp. Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-18net: Change the API of PHY default timestamp to MACKory Maincent
Change the API to select MAC default time stamping instead of the PHY. Indeed the PHY is closer to the wire therefore theoretically it has less delay than the MAC timestamping but the reality is different. Due to lower time stamping clock frequency, latency in the MDIO bus and no PHC hardware synchronization between different PHY, the PHY PTP is often less precise than the MAC. The exception is for PHY designed specially for PTP case but these devices are not very widespread. For not breaking the compatibility I introduce a default_timestamp flag in phy_device that is set by the phy driver to know we are using the old API behavior. The phy_set_timestamp function is called at each call of phy_attach_direct. In case of MAC driver using phylink this function is called when the interface is turned up. Then if the interface goes down and up again the last choice of timestamp will be overwritten by the default choice. A solution could be to cache the timestamp status but it can bring other issues. In case of SFP, if we change the module, it doesn't make sense to blindly re-set the timestamp back to PHY, if the new module has a PHY with mediocre timestamping capabilities. Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-18net: Replace hwtstamp_source by timestamping layerKory Maincent
Replace hwtstamp_source which is only used by the kernel_hwtstamp_config structure by the more widely use timestamp_layer structure. This is done to prepare the support of selectable timestamping source. Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-18net: ethtool: Add a command to list available time stamping layersKory Maincent
Introduce a new netlink message that lists all available time stamping layers on a given interface. Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-18net: ethtool: Add a command to expose current time stamping layerKory Maincent
Time stamping on network packets may happen either in the MAC or in the PHY, but not both. In preparation for making the choice selectable, expose both the current layers via ethtool. In accordance with the kernel implementation as it stands, the current layer will always read as "phy" when a PHY time stamping device is present. Future patches will allow changing the current layer administratively. Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-18net: Make dev_set_hwtstamp_phylib accessibleKory Maincent
Make the dev_set_hwtstamp_phylib function accessible in prevision to use it from ethtool to reset the tstamp current configuration. Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-18net: ethtool: Refactor identical get_ts_info implementations.Richard Cochran
The vlan, macvlan and the bonding drivers call their "real" device driver in order to report the time stamping capabilities. Provide a core ethtool helper function to avoid copy/paste in the stack. Signed-off-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-17rxrpc: Defer the response to a PING ACK until we've parsed itDavid Howells
Defer the generation of a PING RESPONSE ACK in response to a PING ACK until we've parsed the PING ACK so that we pick up any changes to the packet queue so that we can update ackinfo. This is also applied to an ACK generated in response to an ACK with the REQUEST_ACK flag set. Note that whilst the problem was added in commit 248f219cb8bc, it didn't really matter at that point because the ACK was proposed in softirq mode and generated asynchronously later in process context, taking the latest values at the time. But this fix is only needed since the move to parse incoming packets in an I/O thread rather than in softirq and generate the ACK at point of proposal (b0346843b1076b34a0278ff601f8f287535cb064). Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code") Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: "David S. Miller" <davem@davemloft.net> cc: Eric Dumazet <edumazet@google.com> cc: Jakub Kicinski <kuba@kernel.org> cc: Paolo Abeni <pabeni@redhat.com> cc: linux-afs@lists.infradead.org cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-17rxrpc: Fix RTT determination to use any ACK as a sourceDavid Howells
Fix RTT determination to be able to use any type of ACK as the response from which RTT can be calculated provided its ack.serial is non-zero and matches the serial number of an outgoing DATA or ACK packet. This shouldn't be limited to REQUESTED-type ACKs as these can have other types substituted for them for things like duplicate or out-of-order packets. Fixes: 4700c4d80b7b ("rxrpc: Fix loss of RTT samples due to interposed ACK") Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: "David S. Miller" <davem@davemloft.net> cc: Eric Dumazet <edumazet@google.com> cc: Jakub Kicinski <kuba@kernel.org> cc: Paolo Abeni <pabeni@redhat.com> cc: linux-afs@lists.infradead.org cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-17tipc: Remove redundant call to TLV_SPACE()Shigeru Yoshida
The purpose of TLV_SPACE() is to add the TLV descriptor size to the size of the TLV value passed as argument and align the resulting size to TLV_ALIGNTO. tipc_tlv_alloc() calls TLV_SPACE() on its argument. In other words, tipc_tlv_alloc() takes its argument as the size of the TLV value. So the call to TLV_SPACE() in tipc_get_err_tlv() is redundant. Let's remove this redundancy. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-17rxrpc: Fix some minor issues with bundle tracingDavid Howells
Fix some superficial issues with the tracing of rxrpc_bundle structs, including: (1) Set the debug_id when the bundle is allocated rather than when it is set up so that the "NEW" trace line displays the correct bundle ID. (2) Show the refcount when emitting the "FREE" traceline. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: "David S. Miller" <davem@davemloft.net> cc: Eric Dumazet <edumazet@google.com> cc: Jakub Kicinski <kuba@kernel.org> cc: Paolo Abeni <pabeni@redhat.com> cc: linux-afs@lists.infradead.org cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-16tcp: no longer abort SYN_SENT when receiving some ICMPEric Dumazet
Currently, non fatal ICMP messages received on behalf of SYN_SENT sockets do call tcp_ld_RTO_revert() to implement RFC 6069, but immediately call tcp_done(), thus aborting the connect() attempt. This violates RFC 1122 following requirement: 4.2.3.9 ICMP Messages ... o Destination Unreachable -- codes 0, 1, 5 Since these Unreachable messages indicate soft error conditions, TCP MUST NOT abort the connection, and it SHOULD make the information available to the application. This patch makes sure non 'fatal' ICMP[v6] messages do not abort the connection attempt. It enables RFC 6069 for SYN_SENT sockets as a result. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Morley <morleyd@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-16tcp: use tp->total_rto to track number of linear timeouts in SYN_SENT stateEric Dumazet
In commit ccce324dabfe ("tcp: make the first N SYN RTO backoffs linear") David used icsk->icsk_backoff field to track the number of linear timeouts. Since then, tp->total_rto has been added. This commit uses tp->total_rto instead of icsk->icsk_backoff so that tcp_ld_RTO_revert() no longer can trigger an overflow in inet_csk_rto_backoff(). Other than the potential UBSAN report, there was no issue because receiving an ICMP message currently aborts the connect(). In the following patch, we want to adhere to RFC 6069 and RFC 1122 4.2.3.9. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Morley <morleyd@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
Cross-merge networking fixes after downstream PR. No conflicts. Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-11-16Merge tag 'nf-23-11-15' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Remove unused variable causing compilation warning in nft_set_rbtree, from Yang Li. This unused variable is a left over from previous merge window. 2) Possible return of uninitialized in nf_conntrack_bridge, from Linkui Xiao. This is there since nf_conntrack_bridge is available. 3) Fix incorrect pointer math in nft_byteorder, from Dan Carpenter. Problem has been there since 2016. 4) Fix bogus error in destroy set element command. Problem is there since this new destroy command was added. 5) Fix race condition in ipset between swap and destroy commands and add/del/test control plane. This problem is there since ipset was merged. 6) Split async and sync catchall GC in two function to fix unsafe iteration over RCU. This is a fix-for-fix that was included in the previous pull request. * tag 'nf-23-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: split async and sync catchall in two functions netfilter: ipset: fix race condition between swap/destroy and kernel side add/del/test netfilter: nf_tables: bogus ENOENT when destroying element which does not exist netfilter: nf_tables: fix pointer math issue in nft_byteorder_eval() netfilter: nf_conntrack_bridge: initialize err to 0 netfilter: nft_set_rbtree: Remove unused variable nft_net ==================== Link: https://lore.kernel.org/r/20231115184514.8965-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-11-16net: sched: do not offload flows with a helper in act_ctXin Long
There is no hardware supporting ct helper offload. However, prior to this patch, a flower filter with a helper in the ct action can be successfully set into the HW, for example (eth1 is a bnxt NIC): # tc qdisc add dev eth1 ingress_block 22 ingress # tc filter add block 22 proto ip flower skip_sw ip_proto tcp \ dst_port 21 ct_state -trk action ct helper ipv4-tcp-ftp # tc filter show dev eth1 ingress filter block 22 protocol ip pref 49152 flower chain 0 handle 0x1 eth_type ipv4 ip_proto tcp dst_port 21 ct_state -trk skip_sw in_hw in_hw_count 1 <---- action order 1: ct zone 0 helper ipv4-tcp-ftp pipe index 2 ref 1 bind 1 used_hw_stats delayed This might cause the flower filter not to work as expected in the HW. This patch avoids this problem by simply returning -EOPNOTSUPP in tcf_ct_offload_act_setup() to not allow to offload flows with a helper in act_ct. Fixes: a21b06e73191 ("net: sched: add helper support in act_ct") Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/f8685ec7702c4a448a1371a8b34b43217b583b9d.1699898008.git.lucien.xin@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>