summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-07-10ice: save RSS hash configuration for migrationJacob Keller
The VF can program the RSS hash configuration over virtchnl. It does this by sending a u64 bitmask which represents the current hash configuration. It is not trivial to reverse the hardware configuration back to this hash set for migration. Instead, save the value to the ice_vf structure when its modified by the VF. The rss_hashcfg value is an 8-byte field. Make room for it in ice_vf by re-arranging some of the existing fields. There is a 4-byte gap after the first_vector_idx, and a 4-byte gap between max_tx_rate and vf_states. Move first_vector_idx into the later 4-byte gap, creating an 8 byte area where rss_hashcfg can be placed. Also move the num_msix field near min_tx_rate, filling 2 bytes of a 3 byte hole. The end result of these changes enables placing the rss_hashcfg field into the structure while also saving 8 bytes in size. It looks like there are a handful of more possible cleanups to reduce the size even further, but those have been left as a future cleanup. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-07-10ice: add functions to get and set Tx queue contextJacob Keller
The live migration driver will need to save and restore the Tx queue context state from the hardware registers. This state contains both static fields which do not change during Tx traffic as well as dynamic fields which may change during Tx traffic. Unlike the Rx context, the Tx queue context is accessed indirectly from GLCOMM_QTX_CNTX_CTL and GLCOMM_QTX_CNTX_DATA registers. These registers are shared by multiple PFs on the same PCIe card. Multiple PFs cannot safely access the registers simultaneously, and there is no hardware semaphore or logic to control access. To handle this, introduce the txq_ctx_lock to the ice_adapter structure. This is similar to the ptp_gltsyn_time_lock. All PFs on the same adapter share this structure, and use it to serialize access to the registers to prevent error. Add a new functions to get and set the Tx queue context through the GLCOMM_QTX_CNTX_CTL interface. The hardware context values are stored in the registers using the same packed format as the Admin Queue buffer. The hardware buffer is 40 bytes wide, as it contains an additional 18 bytes of internal state not sent with the Admin Queue buffer. For this reason, a separate typedef and packing function must be used. We can share the same packed fields definitions because we never need to unpack the internal state. This is preferred, as it ensures the internal state is zero'd when writing into HW, and avoids issues with reading by u32 registers into a buffer of 22 bytes in length. Thanks to the typedefs, misuse of the API with the wrong size buffer can easily be caught at compile time. Note reading this data from hardware is essential because the current Tx queue context may be different from the context as initially programmed by the driver during VF initialization. When migrating a VF we must ensure the target VF has identical context as the source VF did. Co-developed-by: Yahui Cao <yahui.cao@intel.com> Signed-off-by: Yahui Cao <yahui.cao@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-07-10ice: add support for reading and unpacking Rx queue contextJacob Keller
In order to support live migration, the ice driver will need to read certain data from the Rx queue context. This is stored in the hardware in a packed format. Since we use <linux/packing.h> for the mapping between the packed hardware format and the unpacked structure, it is trivial to enable unpacking support via the unpack_fields() function. Add the ice_unpack_rxq_ctx() function based on the unpack_fields() API. Re-use the same field definitions from the packing implementation. Add ice_copy_rxq_ctx_from_hw() to copy the Rx queue context data from the hardware registers. Use these to implement ice_read_rxq_ctx() which will return the Rx queue context to the caller in its unpacked ice_rlan_ctx struct. This will enable the migration logic access to the relevant data about the Rx device queues. It can easily be copied to the target system as part of the migration payload, where it will be used to configure the Rx queues. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-07-10Merge branch 'net-dsa-rzn1_a5psw-add-compile_test'Paolo Abeni
Rosen Penev says: ==================== net: dsa: rzn1_a5psw: add COMPILE_TEST Allows the various bots to test compilation. Also threw in a small devm conversion for enabling clocks. ==================== Link: https://patch.msgid.link/20250708014144.2514-1-rosenp@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-10net: dsa: rzn1_a5psw: use devm to enable clocksRosen Penev
The remove function has these in the wrong order. The switch should be unregistered last. Simpler to use devm so that the right thing is done. Signed-off-by: Rosen Penev <rosenp@gmail.com> Link: https://patch.msgid.link/20250708014144.2514-3-rosenp@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-10net: dsa: rzn1_a5psw: add COMPILE_TESTRosen Penev
There's no architecture specific requirement for it to compile. Allows the bots to test compilation properly. Signed-off-by: Rosen Penev <rosenp@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20250708014144.2514-2-rosenp@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-10net: xsk: introduce XDP_MAX_TX_SKB_BUDGET setsockoptJason Xing
This patch provides a setsockopt method to let applications leverage to adjust how many descs to be handled at most in one send syscall. It mitigates the situation where the default value (32) that is too small leads to higher frequency of triggering send syscall. Considering the prosperity/complexity the applications have, there is no absolutely ideal suggestion fitting all cases. So keep 32 as its default value like before. The patch does the following things: - Add XDP_MAX_TX_SKB_BUDGET socket option. - Set max_tx_budget to 32 by default in the initialization phase as a per-socket granular control. - Set the range of max_tx_budget as [32, xs->tx->nentries]. The idea behind this comes out of real workloads in production. We use a user-level stack with xsk support to accelerate sending packets and minimize triggering syscalls. When the packets are aggregated, it's not hard to hit the upper bound (namely, 32). The moment user-space stack fetches the -EAGAIN error number passed from sendto(), it will loop to try again until all the expected descs from tx ring are sent out to the driver. Enlarging the XDP_MAX_TX_SKB_BUDGET value contributes to less frequency of sendto() and higher throughput/PPS. Here is what I did in production, along with some numbers as follows: For one application I saw lately, I suggested using 128 as max_tx_budget because I saw two limitations without changing any default configuration: 1) XDP_MAX_TX_SKB_BUDGET, 2) socket sndbuf which is 212992 decided by net.core.wmem_default. As to XDP_MAX_TX_SKB_BUDGET, the scenario behind this was I counted how many descs are transmitted to the driver at one time of sendto() based on [1] patch and then I calculated the possibility of hitting the upper bound. Finally I chose 128 as a suitable value because 1) it covers most of the cases, 2) a higher number would not bring evident results. After twisting the parameters, a stable improvement of around 4% for both PPS and throughput and less resources consumption were found to be observed by strace -c -p xxx: 1) %time was decreased by 7.8% 2) error counter was decreased from 18367 to 572 [1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@gmail.com/ Signed-off-by: Jason Xing <kernelxing@tencent.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://patch.msgid.link/20250704160138.48677-1-kerneljasonxing@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-09Merge branch 'net-mlx5-misc-changes-2025-07-09'Jakub Kicinski
Tariq Toukan says: ==================== net/mlx5: misc changes 2025-07-09 This series contains misc enhancements to the mlx5 driver. ==================== Link: https://patch.msgid.link/1752009387-13300-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net/mlx5e: RX, Remove unnecessary RQT redirectsTariq Toukan
RQTs (Receive Queue Table) should redirect traffic to the channels' RQs when they're active. Otherwise, redirect to the designated "drop RQ". RQTs are created in "inactive" state, pointing to the "drop RQ". In activate and de-activate flows, do not "deactivate" the rest of RQTs (beyond the num of channels), as they are already inactive. This cuts down unnecessary execution of FW commands (MODIFY_RQT), and improves the latency of open/close channels or configuration change. Perf: NIC: Connect-X7. Configuration: 1 combined channel, max num channels 248. Measure time for "interface up + interface down". Before: 0.313 sec After: 0.057 sec (5.5x faster) 247 MODIFY_RQT commands saved in interface up. 247 MODIFY_RQT commands saved in interface down. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1752009387-13300-6-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net/mlx5: Warn when write combining is not supportedMaor Gottlieb
Warn if write combining is not supported, as it can impact latency. Add the warning message to be printed only when the driver actually run the test and detect unsupported state, rather than when inheriting parent's result for SFs. Signed-off-by: Maor Gottlieb <maorg@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1752009387-13300-5-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net/mlx5e: Replace recursive VLAN push handling with an iterative loopGal Pressman
mlx5e_tc_act_vlan_add_push_action() uses tail-recursion to walk through a stack of VLAN devices. There is no need for a complicated recursion with unnecessary stack consumption and less obvious code flow, rewrite the function so that it uses a do while loop instead. Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1752009387-13300-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net/mlx5e: CT: extract a memcmp from a spinlock sectionCosmin Ratiu
This reduces the time the lock is held and reduces contention. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1752009387-13300-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net/mlx5e: Remove unused VLAN insertion logic in TX pathCarolina Jubran
The VLAN insertion capability (`wqe_vlan_insert`) was never enabled on all mlx5 devices. When VLAN TX offload is advertised but this capability is not supported, the driver uses inline headers to insert the VLAN tag. To support this, the driver used to set the `MLX5E_SQ_STATE_VLAN_NEED_L2_INLINE` bit to enforce L2 inline mode when `wqe_vlan_insert` was not supported. Since the capability is disabled on all devices, this logic was always active, and the SQ flag has become redundant. L2 inline is enforced unconditionally for VLAN-tagged packets. The `skb_vlan_tag_present()` check in the else-if block of `mlx5e_sq_xmit_wqe()` is never true by this point in the TX flow, as the VLAN tag has already been inserted by the driver using inline headers. As a result, this code is never executed. Remove the redundant SQ state, dead VLAN insertion code block, and related logic. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1752009387-13300-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09vsock/test: fix test for null ptr deref when transport changesStefano Garzarella
In test_stream_transport_change_client(), the client sends CONTROL_CONTINUE on each iteration, even when connect() is unsuccessful. This causes a flood of control messages in the server that hangs around for more than 10 seconds after the test finishes, triggering several timeouts and causing subsequent tests to fail. This was discovered in testing a newly proposed test that failed in this way on the client side: ... 33 - SOCK_STREAM transport change null-ptr-deref...ok 34 - SOCK_STREAM ioctl(SIOCINQ) functionality...recv timed out The CONTROL_CONTINUE message is used only to tell to the server to call accept() to consume successful connections, so that subsequent connect() will not fail for finding the queue full. Send CONTROL_CONTINUE message only when the connect() has succeeded, or found the queue full. Note that the second connect() can also succeed if the first one was interrupted after sending the request. Fixes: 3a764d93385c ("vsock/test: Add test for null ptr deref when transport changes") Cc: leonardi@redhat.com Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Link: https://patch.msgid.link/20250708111701.129585-1-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09Merge branch 'net-phy-bcm54811-phy-initialization'Jakub Kicinski
says: ==================== net: phy: bcm54811: PHY initialization Proper bcm54811 PHY driver initialization for MII-Lite. The bcm54811 PHY in MLP package must be setup for MII-Lite interface mode by software. Normally, the PHY to MAC interface is selected in hardware by setting the bootstrap pins of the PHY. However, MII and MII-Lite share the same hardware setup and must be distinguished by software, setting appropriate bit in a configuration register. The MII-Lite interface mode is non-standard one, defined by Broadcom for some of their PHYs. The MII-Lite lightness consist in omitting RXER, TXER, CRS and COL signals of the standard MII interface. Absence of COL them makes half-duplex links modes impossible but does not interfere with Broadcom's BroadR-Reach link modes, because they are full-duplex only. To do it in a clean way, MII-Lite must be introduced first, including its limitation to link modes (no half-duplex), because it is a prerequisite for the patch #3 of this series. The patch #4 does not depend on MII-Lite directly but both #3 and #4 are necessary for bcm54811 to work properly without additional configuration steps to be done - for example in the bootloader, before the kernel starts. PATCH 1 - Add MII-Lite PHY interface mode as defined by Broadcom for their two-wire PHYs. It can be used with most Ethernet controllers under certain limitations (no half-duplex link modes etc.). PATCH 2 - Add MII-Lite PHY interface type PATCH 3 - Activation of MII-Lite interface mode on Broadcom bcm5481x PHYs PATCH 4 - Initialize the BCM54811 PHY properly so that it conforms to the datasheet regarding a reserved bit in the LRE Control register, which must be written to zero after every device reset. Ignore the LDS capability bit in LRE Status register on bcm54811. ==================== Link: https://patch.msgid.link/20250708090140.61355-1-kamilh@axis.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net: phy: bcm54811: PHY initializationKamil Horák - 2N
Reset the bit 12 in PHY's LRE Control register upon initialization. According to the datasheet, this bit must be written to zero after every device reset. Signed-off-by: Kamil Horák - 2N <kamilh@axis.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://patch.msgid.link/20250708090140.61355-5-kamilh@axis.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net: phy: bcm5481x: MII-Lite activationKamil Horák - 2N
Broadcom PHYs featuring the BroadR-Reach two-wire link mode are usually capable to operate in simplified MII mode, without TXER, RXER, CRS and COL signals as defined for the MII. The absence of COL signal makes half-duplex link modes impossible, however, the BroadR-Reach modes are all full-duplex only. Depending on the IC encapsulation, there exist MII-Lite-only PHYs such as bcm54811 in MLP. The PHY itself is hardware-strapped to select among multiple RGMII and MII-Lite modes, but the MII-Lite mode must be also activated by software. Add MII-Lite activation for bcm5481x PHYs. Signed-off-by: Kamil Horák - 2N <kamilh@axis.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/20250708090140.61355-4-kamilh@axis.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09dt-bindings: ethernet-phy: add MII-Lite phy interface typeKamil Horák - 2N
Some Broadcom PHYs are capable to operate in simplified MII mode, without TXER, RXER, CRS and COL signals as defined for the MII. The MII-Lite mode can be used on most Ethernet controllers with full MII interface by just leaving the input signals (RXER, CRS, COL) inactive. The absence of COL signal makes half-duplex link modes impossible but does not interfere with BroadR-Reach link modes on Broadcom PHYs, because they are all full-duplex only. Add new interface type "mii-lite" to phy-connection-type enum. Signed-off-by: Kamil Horák - 2N <kamilh@axis.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Acked-by: Rob Herring (Arm) <robh@kernel.org> Link: https://patch.msgid.link/20250708090140.61355-3-kamilh@axis.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net: phy: MII-Lite PHY interface modeKamil Horák - 2N
Some Broadcom PHYs are capable to operate in simplified MII mode, without TXER, RXER, CRS and COL signals as defined for the MII. The MII-Lite mode can be used on most Ethernet controllers with full MII interface by just leaving the input signals (RXER, CRS, COL) inactive. The absence of COL signal makes half-duplex link modes impossible but does not interfere with BroadR-Reach link modes on Broadcom PHYs, because they are all full-duplex only. Add MII-Lite interface mode, especially for Broadcom two-wire PHYs. Signed-off-by: Kamil Horák - 2N <kamilh@axis.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/20250708090140.61355-2-kamilh@axis.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net: usb: enable the work after stop usbnet by ip down/upZqiang
Oleksij reported that: The smsc95xx driver fails after one down/up cycle, like this: $ nmcli device set enu1u1 managed no $ p a a 10.10.10.1/24 dev enu1u1 $ ping -c 4 10.10.10.3 $ ip l s dev enu1u1 down $ ip l s dev enu1u1 up $ ping -c 4 10.10.10.3 The second ping does not reach the host. Networking also fails on other interfaces. Enable the work by replacing the disable_work_sync() with cancel_work_sync(). [Jun Miao: completely write the commit changelog] Fixes: 2c04d279e857 ("net: usb: Convert tasklet API to new bottom half workqueue mechanism") Reported-by: Oleksij Rempel <o.rempel@pengutronix.de> Tested-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Jun Miao <jun.miao@intel.com> Link: https://patch.msgid.link/20250708081653.307815-1-jun.miao@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09Merge branch 'vsock-introduce-siocinq-ioctl-support'Jakub Kicinski
Xuewei Niu says: ==================== vsock: Introduce SIOCINQ ioctl support Introduce SIOCINQ ioctl support for vsock, indicating the length of unread bytes. Similar with SIOCOUTQ ioctl, the information is transport-dependent. The first patch adds SIOCINQ ioctl support in AF_VSOCK. Thanks to @dexuan, the second patch is to fix the issue where hyper-v `hvs_stream_has_data()` doesn't return the readable bytes. The third patch wraps the ioctl into `ioctl_int()`, which implements a retry mechanism to prevent immediate failure. The last one adds two test cases to check the functionality. The changes have been tested, and the results are as expected. ==================== Link: https://patch.msgid.link/20250708-siocinq-v6-0-3775f9a9e359@antgroup.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09test/vsock: Add ioctl SIOCINQ testsXuewei Niu
Add SIOCINQ ioctl tests for both SOCK_STREAM and SOCK_SEQPACKET. The client waits for the server to send data, and checks if the SIOCINQ ioctl value matches the data size. After consuming the data, the client checks if the SIOCINQ value is 0. Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Tested-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Link: https://patch.msgid.link/20250708-siocinq-v6-4-3775f9a9e359@antgroup.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09test/vsock: Add retry mechanism to ioctl wrapperXuewei Niu
Wrap the ioctl in `ioctl_int()`, which takes a pointer to the actual int value and an expected int value. The function will not return until either the ioctl returns the expected value or a timeout occurs, thus avoiding immediate failure. Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Link: https://patch.msgid.link/20250708-siocinq-v6-3-3775f9a9e359@antgroup.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09vsock: Add support for SIOCINQ ioctlXuewei Niu
Add support for SIOCINQ ioctl, indicating the length of bytes unread in the socket. The value is obtained from `vsock_stream_has_data()`. Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Link: https://patch.msgid.link/20250708-siocinq-v6-2-3775f9a9e359@antgroup.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09hv_sock: Return the readable bytes in hvs_stream_has_data()Dexuan Cui
When hv_sock was originally added, __vsock_stream_recvmsg() and vsock_stream_has_data() actually only needed to know whether there is any readable data or not, so hvs_stream_has_data() was written to return 1 or 0 for simplicity. However, now hvs_stream_has_data() should return the readable bytes because vsock_data_ready() -> vsock_stream_has_data() needs to know the actual bytes rather than a boolean value of 1 or 0. The SIOCINQ ioctl support also needs hvs_stream_has_data() to return the readable bytes. Let hvs_stream_has_data() return the readable bytes of the payload in the next host-to-guest VMBus hv_sock packet. Note: there may be multiple incoming hv_sock packets pending in the VMBus channel's ringbuffer, but so far there is not a VMBus API that allows us to know all the readable bytes in total without reading and caching the payload of the multiple packets, so let's just return the readable bytes of the next single packet. In the future, we'll either add a VMBus API that allows us to know the total readable bytes without touching the data in the ringbuffer, or the hv_sock driver needs to understand the VMBus packet format and parse the packets directly. Signed-off-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Wei Liu <wei.liu@kernel.org> Link: https://patch.msgid.link/20250708-siocinq-v6-1-3775f9a9e359@antgroup.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09Documentation: xsk: correct the obsolete references and examplesJason Xing
The modified lines are mainly related to the following commits[1][2] which remove those tests and examples. Since samples/bpf has been deprecated, we can refer to more examples that are easily searched in the various xdp-projects, like the following link: https://github.com/xdp-project/bpf-examples/tree/main/AF_XDP-example [1] commit f36600634282 ("libbpf: move xsk.{c,h} into selftests/bpf") [2] commit cfb5a2dbf141 ("bpf, samples: Remove AF_XDP samples") Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250708062907.11557-1-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09skbuff: Add MSG_MORE flag to optimize tcp large packet transmissionFeng Yang
When using sockmap for forwarding, the average latency for different packet sizes after sending 10,000 packets is as follows: size old(us) new(us) 512 56 55 1472 58 58 1600 106 81 3000 145 105 5000 182 125 Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Feng Yang <yangfeng@kylinos.cn> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250708054053.39551-1-yangfeng59949@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09Merge branch 'converge-on-using-secs_to_jiffies-part-two'Jakub Kicinski
Easwar Hariharan says: ==================== Converge on using secs_to_jiffies() part two This is the second series (part 1*) that converts users of msecs_to_jiffies() that either use the multiply pattern of either of: - msecs_to_jiffies(N*1000) or - msecs_to_jiffies(N*MSEC_PER_SEC) where N is a constant or an expression, to avoid the multiplication. The conversion is made with Coccinelle with the secs_to_jiffies() script in scripts/coccinelle/misc. Attention is paid to what the best change can be rather than restricting to what the tool provides. v1: https://lore.kernel.org/20250219-netdev-secs-to-jiffies-part-2-v1-0-c484cc63611b@linux.microsoft.com ==================== Link: https://patch.msgid.link/20250707-netdev-secs-to-jiffies-part-2-v2-0-b7817036342f@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net: ipconfig: convert timeouts to secs_to_jiffies()Easwar Hariharan
Commit b35108a51cf7 ("jiffies: Define secs_to_jiffies()") introduced secs_to_jiffies(). As the value here is a multiple of 1000, use secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication. This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with the following Coccinelle rules: @depends on patch@ expression E; @@ -msecs_to_jiffies(E * 1000) +secs_to_jiffies(E) -msecs_to_jiffies(E * MSEC_PER_SEC) +secs_to_jiffies(E) While here, manually convert a couple timeouts denominated in seconds Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Link: https://patch.msgid.link/20250707-netdev-secs-to-jiffies-part-2-v2-2-b7817036342f@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net/smc: convert timeouts to secs_to_jiffies()Easwar Hariharan
Commit b35108a51cf7 ("jiffies: Define secs_to_jiffies()") introduced secs_to_jiffies(). As the value here is a multiple of 1000, use secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication. This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with the following Coccinelle rules: @depends on patch@ expression E; @@ -msecs_to_jiffies(E * 1000) +secs_to_jiffies(E) -msecs_to_jiffies(E * MSEC_PER_SEC) +secs_to_jiffies(E) Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Link: https://patch.msgid.link/20250707-netdev-secs-to-jiffies-part-2-v2-1-b7817036342f@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09gve: make IRQ handlers and page allocation NUMA awareBailey Forrest
All memory in GVE is currently allocated without regard for the NUMA node of the device. Because access to NUMA-local memory access is significantly cheaper than access to a remote node, this change attempts to ensure that page frags used in the RX path, including page pool frags, are allocated on the NUMA node local to the gVNIC device. Note that this attempt is best-effort. If necessary, the driver will still allocate non-local memory, as __GFP_THISNODE is not passed. Descriptor ring allocations are not updated, as dma_alloc_coherent handles that. This change also modifies the IRQ affinity setting to only select CPUs from the node local to the device, preserving the behavior that TX and RX queues of the same index share CPU affinity. Signed-off-by: Bailey Forrest <bcf@google.com> Signed-off-by: Joshua Washington <joshwash@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Harshitha Ramamurthy <hramamurthy@google.com> Signed-off-by: Jeroen de Borst <jeroendb@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250707210107.2742029-1-jeroendb@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09Merge branch 'add-microchip-zl3073x-support-part-1'Jakub Kicinski
Ivan Vecera says: ==================== Add Microchip ZL3073x support (part 1) Add support for Microchip Azurite DPLL/PTP/SyncE chip family that provides DPLL and PTP functionality. This series bring first part that adds the core functionality and basic DPLL support. The next part of the series will bring additional DPLL functionality like eSync support, phase offset and frequency offset reporting and phase adjustments. Testing was done by myself and by Prathosh Satish on Microchip EDS2 development board with ZL30732 DPLL chip connected over I2C bus. ==================== Link: https://patch.msgid.link/20250704182202.1641943-1-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09dpll: zl3073x: Add support to get/set frequency on pinsIvan Vecera
Add support to get/set frequency on pins. The frequency for input pins (references) is computed in the device according this formula: freq = base_freq * multiplier * (nominator / denominator) where the base_freq comes from the list of supported base frequencies and other parameters are arbitrary numbers. All these parameters are 16-bit unsigned integers. The frequency for output pin is determined by the frequency of synthesizer the output pin is connected to and divisor of the output to which is the given pin belongs. The resulting frequency of the P-pin and the N-pin from this output pair depends on the signal format of this output pair. The device supports so-called N-divided signal formats where for the N-pin there is an additional divisor. The frequencies for both pins from such output pair are computed: P-pin-freq = synth_freq / output_div N-pin-freq = synth_freq / output_div / n_div For other signal-format types both P and N pin have the same frequency based only synth frequency and output divisor. Implement output pin callbacks to get and set frequency. The frequency setting for the output non-N-divided signal format is simple as we have to compute just new output divisor. For N-divided formats it is more complex because by changing of output divisor we change frequency for both P and N pins. In this case if we are changing frequency for P-pin we have to compute also new N-divisor for N-pin to keep its current frequency. From this and the above it follows that the frequency of the N-pin cannot be higher than the frequency of the P-pin and the callback must take this limitation into account. Co-developed-by: Prathosh Satish <Prathosh.Satish@microchip.com> Signed-off-by: Prathosh Satish <Prathosh.Satish@microchip.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20250704182202.1641943-13-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09dpll: zl3073x: Implement input pin state setting in automatic modeIvan Vecera
Implement input pin state setting when the DPLL is running in automatic mode. Unlike manual mode, the DPLL mode switching is not used here and the implementation uses special priority value (15) to make the given pin non-selectable. When the user sets state of the pin as disconnected the driver internally sets its priority in HW to 15 that prevents the DPLL to choose this input pin. Conversely, if the pin status is set to selectable, the driver sets the pin priority in HW to the original saved value. Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20250704182202.1641943-12-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09dpll: zl3073x: Add support to get/set priority on input pinsIvan Vecera
Add support for getting and setting input pin priority. Implement required callbacks and set appropriate capability for input pins. Although the pin priority make sense only if the DPLL is running in automatic mode we have to expose this capability unconditionally because input pins (references) are shared between all DPLLs where one of them can run in automatic mode while the other one not. Co-developed-by: Prathosh Satish <Prathosh.Satish@microchip.com> Signed-off-by: Prathosh Satish <Prathosh.Satish@microchip.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20250704182202.1641943-11-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09dpll: zl3073x: Implement input pin selection in manual modeIvan Vecera
Implement input pin state setting if the DPLL is running in manual mode. The driver indicates manual mode if the DPLL mode is one of ref-lock, forced-holdover, freerun. Use these modes to implement input pin state change between connected and disconnected states. When the user set the particular pin as connected the driver marks this input pin as forced reference and switches the DPLL mode to ref-lock. When the use set the pin as disconnected the driver switches the DPLL to freerun or forced holdover mode. The switch to holdover mode is done if the DPLL has holdover capability (e.g is currently locked with holdover acquired). Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20250704182202.1641943-10-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09dpll: zl3073x: Register DPLL devices and pinsIvan Vecera
Enumerate all available DPLL channels and registers a DPLL device for each of them. Check all input references and outputs and register DPLL pins for them. Number of registered DPLL pins depends on configuration of references and outputs. If the reference or output is configured as differential one then only one DPLL pin is registered. Both references and outputs can be also disabled from firmware configuration and in this case no DPLL pins are registered. All registrable references are registered to all available DPLL devices with exception of DPLLs that are configured in NCO (numerically controlled oscillator) mode. In this mode DPLL channel acts as PHC and cannot be locked to any reference. Device outputs are connected to one of synthesizers and each synthesizer is driven by some DPLL channel. So output pins belonging to given output are registered to DPLL device that drives associated synthesizer. Finally add kworker task to monitor async changes on all DPLL channels and input pins and to notify about them DPLL core. Output pins are not monitored as their parameters are not changed asynchronously by the device. Co-developed-by: Prathosh Satish <Prathosh.Satish@microchip.com> Signed-off-by: Prathosh Satish <Prathosh.Satish@microchip.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20250704182202.1641943-9-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09dpll: zl3073x: Read DPLL types and pin properties from system firmwareIvan Vecera
Add support for reading of DPLL types and optional pin properties from the system firmware (DT, ACPI...). The DPLL types are stored in property 'dpll-types' as string array and possible values 'pps' and 'eec' are mapped to DPLL enums DPLL_TYPE_PPS and DPLL_TYPE_EEC. The pin properties are stored under 'input-pins' and 'output-pins' sub-nodes and the following ones are supported: * reg integer that specifies pin index * label string that is used by driver as board label * connection-type string that indicates pin connection type * supported-frequencies-hz array of u64 values what frequencies are supported / allowed for given pin with respect to hardware wiring Do not blindly trust system firmware and filter out frequencies that cannot be configured/represented in device (input frequencies have to be factorized by one of the base frequencies and output frequencies have to divide configured synthesizer frequency). Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20250704182202.1641943-8-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09dpll: zl3073x: Fetch invariants during probeIvan Vecera
Several configuration parameters will remain constant at runtime, so we can load them during probe to avoid excessive reads from the hardware. Read the following parameters from the device during probe and store them for later use: * enablement status and frequencies of the synthesizers and their associated DPLL channels * enablement status and type (single-ended or differential) of input pins * associated synthesizers, signal format, and enablement status of outputs Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20250704182202.1641943-7-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09dpll: Add basic Microchip ZL3073x supportIvan Vecera
Microchip Azurite ZL3073x represents chip family providing DPLL and optionally PHC (PTP) functionality. The chips can be connected be connected over I2C or SPI bus. They have the following characteristics: * up to 5 separate DPLL units (channels) * 5 synthesizers * 10 input pins (references) * 10 outputs * 20 output pins (output pin pair shares one output) * Each reference and output can operate in either differential or single-ended mode (differential mode uses 2 pins) * Each output is connected to one of the synthesizers * Each synthesizer is driven by one of the DPLL unit The device uses 7-bit addresses and 8-bits values. It exposes 8-, 16-, 32- and 48-bits registers in address range <0x000,0x77F>. Due to 7bit addressing, the range is organized into pages of 128 bytes, with each page containing a page selector register at address 0x7F. For reading/writing multi-byte registers, the device supports bulk transfers. Add basic functionality to access device registers, probe functionality both I2C and SPI cases and add devlink support to provide info and to set clock ID parameter. Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20250704182202.1641943-6-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09devlink: Add new "clock_id" generic device paramIvan Vecera
Add a new device generic parameter to specify clock ID that should be used by the device for registering DPLL devices and pins. Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20250704182202.1641943-5-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09devlink: Add support for u64 parametersIvan Vecera
Only 8, 16 and 32-bit integers are supported for numeric devlink parameters. The subsequent patch adds support for DPLL clock ID that is defined as 64-bit number. Add support for u64 parameter type. Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20250704182202.1641943-4-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09dt-bindings: dpll: Add support for Microchip Azurite chip familyIvan Vecera
Add DT bindings for Microchip Azurite DPLL chip family. These chips provide up to 5 independent DPLL channels, 10 differential or single-ended inputs and 10 differential or 20 single-ended outputs. They can be connected via I2C or SPI busses. Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20250704182202.1641943-3-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09dt-bindings: dpll: Add DPLL device and pinIvan Vecera
Add a common DT schema for DPLL device and its associated pins. The DPLL (device phase-locked loop) is a device used for precise clock synchronization in networking and telecom hardware. The device includes one or more DPLLs (channels) and one or more physical input/output pins. Each DPLL channel is used either to provide a pulse-per-clock signal or to drive an Ethernet equipment clock. The input and output pins have the following properties: * label: specifies board label * connection type: specifies its usage depending on wiring * list of supported or allowed frequencies: depending on how the pin is connected and where) * embedded sync capability: indicates whether the pin supports this Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20250704182202.1641943-2-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09virtio-net: xsk: rx: move the xdp->data adjustment to buf_to_xdp()Bui Quang Minh
This commit does not do any functional changes. It moves xdp->data adjustment for buffer other than first buffer to buf_to_xdp() helper so that the xdp_buff adjustment does not scatter over different functions. Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com> Link: https://patch.msgid.link/20250705075515.34260-1-minhquangbui99@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09Merge branch 'add-vf-drivers-for-wangxun-virtual-functions'Jakub Kicinski
Mengyuan Lou says: ==================== Add vf drivers for wangxun virtual functions Introduces basic support for Wangxun’s virtual function (VF) network drivers, specifically txgbevf and ngbevf. These drivers provide SR-IOV VF functionality for Wangxun 10/25/40G network devices. The first three patches add common APIs for Wangxun VF drivers, including mailbox communication and shared initialization logic.These abstractions are placed in libwx to reduce duplication across VF drivers. Patches 4–8 introduce the txgbevf driver, including: PCI device initialization, Hardware reset, Interrupt setup, Rx/Tx datapath implementation and link status changeing flow. Patches 9–12 implement the ngbevf driver, mirroring the functionality added in txgbevf. v2: https://lore.kernel.org/20250625102058.19898-1-mengyuanlou@net-swift.com v1: https://lore.kernel.org/20250611083559.14175-1-mengyuanlou@net-swift.com ==================== Link: https://patch.msgid.link/20250704094923.652-1-mengyuanlou@net-swift.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net: ngbevf: add link update flowMengyuan Lou
Add link update flow to wangxun 1G virtual functions. Get link status from pf in mbox, and if it is failed then check the vx_status, because vx_status switching is too slow. Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com> Link: https://patch.msgid.link/20250704094923.652-13-mengyuanlou@net-swift.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net: ngbevf: init interrupts and request irqsMengyuan Lou
Add specific parameters for irq alloc, then use wx_init_interrupt_scheme to initialize interrupt allocation in probe. Add .ndo_start_xmit support and start all queues. Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com> Link: https://patch.msgid.link/20250704094923.652-12-mengyuanlou@net-swift.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net: ngbevf: add sw init pci info and reset hardwareMengyuan Lou
Do sw init and reset hw for ngbevf virtual functions, then register netdev. Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com> Link: https://patch.msgid.link/20250704094923.652-11-mengyuanlou@net-swift.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net: wangxun: add ngbevf buildMengyuan Lou
Add doc build infrastructure for ngbevf driver. Implement the basic PCI driver loading and unloading interface. Initialize the id_table which support 1G virtual functions for Wangxun. Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com> Link: https://patch.msgid.link/20250704094923.652-10-mengyuanlou@net-swift.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>