summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-07-03net: ip-sysctl: Add link to SCTP IPv4 scoping draftBagas Sanjaya
addr_scope_policy description contains pointer to SCTP IPv4 scoping draft but not its IETF Datatracker link. Add it. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250701031300.19088-6-bagasdotme@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03net: ip-sysctl: Format SCTP-related memory parameters description as bullet listBagas Sanjaya
The description for vector elements of SCTP-related memory usage parameters (sctp{r,w,}mem) is formatted as normal paragraphs rather than bullet list. Convert the description to the latter. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250701031300.19088-5-bagasdotme@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03net: ip-sysctl: Format pf_{enable,expose} boolean lists as bullet listsBagas Sanjaya
These lists' items were separated by newlines but without bullet list marker. Turn the lists into proper bullet list. While at it, also reword values description for pf_expose to not repeat mentioning SCTP_PEER_ADDR_CHANGE and SCTP_GET_PEER_ADDR_INFO sockopt. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250701031300.19088-4-bagasdotme@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03net: ip-sysctl: Format possible value range of ioam6_id{,_wide} as bullet listBagas Sanjaya
Format possible value range bounds of ioam6_id and ioam6_id_wide as bullet list instead of running paragraph. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250701031300.19088-3-bagasdotme@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03net: ip-sysctl: Format Private VLAN proxy arp aliases as bullet listBagas Sanjaya
Alias names list for private VLAN proxy arp technology is formatted as indented paragraph instead. Make it bullet list as it is better fit for this purpose. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250701031300.19088-2-bagasdotme@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03drm/xe: Do not wedge device on killed exec queuesMatthew Brost
When a user closes an exec queue or interrupts an app with Ctrl-C, this does not warrant wedging the device in mode 2. Avoid this by skipping the wedge check for killed exec queues in the TDR and LR exec queue cleanup worker. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250624174103.2707941-1-matthew.brost@intel.com (cherry picked from commit 5a2f117a80c207372513ca8964eeb178874f4990) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-07-03drm/xe: Extend WA 14018094691 to BMGDaniele Ceraolo Spurio
This WA is applicable to BMG as well. Note that this is a GSC WA and we don't load the GSC on BMG, so extending the WA to BMG won't do anything right now. However, it helps future-proof the driver so that if we ever turn the GSC on we won't have to remember to extend this WA. v2: don't use VERSION_RANGE from 2001 to 2004 (Matt) Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://lore.kernel.org/r/20250613231128.1261815-2-daniele.ceraolospurio@intel.com (cherry picked from commit 1a5ce0c5b95b0624ebd44f574b98003a466973be) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-07-03Merge branch ↵Paolo Abeni
'ptp-provide-support-for-auxiliary-clocks-for-ptp_sys_offset_extended' Thomas Gleixner says: ==================== ptp: Provide support for auxiliary clocks for PTP_SYS_OFFSET_EXTENDED This is a follow up to the V1 series, which can be found here: https://lore.kernel.org/all/20250626124327.667087805@linutronix.de to address the merge logistics problem, which I created myself. Changes vs. V1: - Make patch 1, which provides the timestamping function temporarily define CLOCK_AUX* if undefined so that it can be merged independently, - Add a missing check for CONFIG_POSIX_AUX_CLOCK in the PTP IOCTL - Picked up tags Merge logistics if agreed on: 1) Patch #1 is applied to the tip tree on top of plain v6.16-rc1 and tagged 2) That tag is merged into tip:timers/ptp and the temporary CLOCK_AUX define is removed in a subsequent commit 3) Network folks merge the tag and apply patches #2 + #3 So the only fallout from this are the extra merges in both trees and the cleanup commit in the tip tree. But that way there are no dependencies and no duplicate commits with different SHAs. Thoughts? Due to the above constraints there is no branch offered to pull from right now. Sorry for the inconveniance. Should have thought about that earlier. ==================== Link: https://patch.msgid.link/20250701130923.579834908@linutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03ptp: Enable auxiliary clocks for PTP_SYS_OFFSET_EXTENDEDThomas Gleixner
Allow ioctl(PTP_SYS_OFFSET_EXTENDED*) to select CLOCK_AUX clock ids for generating the pre and post hardware readout timestamps. Aside of adding these clocks to the clock ID validation, this also requires to check the timestamp to be valid, i.e. the seconds value being greater than or equal zero. This is necessary because AUX clocks can be asynchronously enabled or disabled, so there is no way to validate the availability upfront. The same could have been achieved by handing the return value of ktime_get_aux_ts64() all the way down to the IOCTL call site, but that'd require to modify all existing ptp::gettimex64() callbacks and their inner call chains. The timestamp check achieves the same with less churn and less complicated code all over the place. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20250701132628.491315452@linutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03ptp: Use ktime_get_clock_ts64() for timestampingThomas Gleixner
The inlined ptp_read_system_[pre|post]ts() switch cases expand to a copious amount of text in drivers, e.g. ~500 bytes in e1000e. Adding auxiliary clock support to the inlines would increase it further. Replace the inline switch case with a call to ktime_get_clock_ts64(), which reduces the code size in drivers and allows to access auxiliary clocks once they are enabled in the IOCTL parameter filter. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Acked-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20250701132628.426168092@linutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03Merge tag 'ktime-get-clock-ts64-for-ptp' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Base implementation for PTP with a temporary CLOCK_AUX* workaround to allow integration of depending changes into the networking tree. Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03bonding: don't force LACPDU tx to ~333 ms boundariesSeth Forshee (DigitalOcean)
The timer which ensures that no more than 3 LACPDUs are transmitted in a second rearms itself every 333ms regardless of whether an LACPDU is transmitted when the timer expires. This causes LACPDU tx to be delayed until the next expiration of the timer, which effectively aligns LACPDUs to ~333ms boundaries. This results in a variable amount of jitter in the timing of periodic LACPDUs. Change this to only rearm the timer when an LACPDU is actually sent, allowing tx at any point after the timer has expired. Signed-off-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Reviewed-by: Carlos Bilbao <carlos.bilbao@kernel.org> Link: https://patch.msgid.link/20250625-fix-lacpdu-jitter-v1-1-4d0ee627e1ba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03timekeeping: Provide ktime_get_clock_ts64()Thomas Gleixner
PTP implements an inline switch case for taking timestamps from various POSIX clock IDs, which already consumes quite some text space. Expanding it for auxiliary clocks really becomes too big for inlining. Provide a out of line version. The function invalidates the timestamp in case the clock is invalid. The invalidation allows to implement a validation check without the need to propagate a return value through deep existing call chains. Due to merge logistics this temporarily defines CLOCK_AUX[_LAST] if undefined, so that the plain branch, which does not contain any of the core timekeeper changes, can be pulled into the networking tree as prerequisite for the PTP side changes. These temporary defines are removed after that branch is merged into the tip::timers/ptp branch. That way the result in -next or upstream in the next merge window has zero dependencies. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250701132628.357686408@linutronix.de
2025-07-03Merge tag 'ffa-fixes-6.16' of ↵Arnd Bergmann
https://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux into arm/fixes Arm FF-A fixes for v6.16 Couple of fixes to address: 1. The safety and memory issues in the FF-A notification callback handler: The fixes replaces a mutex with an rwlock to prevent sleeping in atomic context, resolving kernel warnings. Memory allocation is moved outside the lock to support this transition safely. Additionally, a memory leak in the notifier unregistration path is fixed by properly freeing the callback node. 2. The missing entry in struct ffa_indirect_msg_hdr: The fix adds the missing 32 bit reserved entry in the structure as required by the FF-A specification. * tag 'ffa-fixes-6.16' of https://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux: firmware: arm_ffa: Fix the missing entry in struct ffa_indirect_msg_hdr firmware: arm_ffa: Replace mutex with rwlock to avoid sleep in atomic context firmware: arm_ffa: Move memory allocation outside the mutex locking firmware: arm_ffa: Fix memory leak by freeing notifier callback node Link: https://lore.kernel.org/r/20250609105207.1185570-1-sudeep.holla@arm.com Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2025-07-03netfilter: conntrack: remove DCCP protocol supportPablo Neira Ayuso
The DCCP socket family has now been removed from this tree, see: 8bb3212be4b4 ("Merge branch 'net-retire-dccp-socket'") Remove connection tracking and NAT support for this protocol, this should not pose a problem because no DCCP traffic is expected to be seen on the wire. As for the code for matching on dccp header for iptables and nftables, mark it as deprecated and keep it in place. Ruleset restoration is an atomic operation. Without dccp matching support, an astray match on dccp could break this operation leaving your computer with no policy in place, so let's follow a more conservative approach for matches. Add CONFIG_NFT_EXTHDR_DCCP which is set to 'n' by default to deprecate dccp extension support. Similarly, label CONFIG_NETFILTER_XT_MATCH_DCCP as deprecated too and also set it to 'n' by default. Code to match on DCCP protocol from ebtables also remains in place, this is just a few checks on IPPROTO_DCCP from _check() path which is exercised when ruleset is loaded. There is another use of IPPROTO_DCCP from the _check() path in the iptables multiport match. Another check for IPPROTO_DCCP from the packet in the reject target is also removed. So let's schedule removal of the dccp matching for a second stage, this should not interfer with the dccp retirement since this is only matching on the dccp header. Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-07-03regulator: gpio: Fix the out-of-bounds access to drvdata::gpiodsManivannan Sadhasivam
drvdata::gpiods is supposed to hold an array of 'gpio_desc' pointers. But the memory is allocated for only one pointer. This will lead to out-of-bounds access later in the code if 'config::ngpios' is > 1. So fix the code to allocate enough memory to hold 'config::ngpios' of GPIO descriptors. While at it, also move the check for memory allocation failure to be below the allocation to make it more readable. Cc: stable@vger.kernel.org # 5.0 Fixes: d6cd33ad7102 ("regulator: gpio: Convert to use descriptors") Signed-off-by: Manivannan Sadhasivam <mani@kernel.org> Link: https://patch.msgid.link/20250703103549.16558-1-mani@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org>
2025-07-03ASoC: cs35l56: probe() should fail if the device ID is not recognizedRichard Fitzgerald
Return an error from driver probe if the DEVID read from the chip is not one supported by this driver. In cs35l56_hw_init() there is a check for valid DEVID, but the invalid case was returning the value of ret. At this point in the code ret == 0 so the caller would think that cs35l56_hw_init() was successful. Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com> Fixes: 84851aa055c8 ("ASoC: cs35l56: Move part of cs35l56_init() to shared library") Link: https://patch.msgid.link/20250703102521.54204-1-rf@opensource.cirrus.com Signed-off-by: Mark Brown <broonie@kernel.org>
2025-07-03Revert "ACPI: battery: negate current when discharging"Rafael J. Wysocki
Revert commit 234f71555019 ("ACPI: battery: negate current when discharging") breaks not one but several userspace implementations of battery monitoring: Steam and MangoHud. Perhaps it breaks more, but those are the two that have been tested. Reported-by: Matthew Schwartz <matthew.schwartz@linux.dev> Closes: https://lore.kernel.org/linux-acpi/87C1B2AF-D430-4568-B620-14B941A8ABA4@linux.dev/ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-07-03vsock/vmci: Clear the vmci transport packet properly when initializing itHarshaVardhana S A
In vmci_transport_packet_init memset the vmci_transport_packet before populating the fields to avoid any uninitialised data being left in the structure. Cc: Bryan Tan <bryan-bt.tan@broadcom.com> Cc: Vishnu Dasa <vishnu.dasa@broadcom.com> Cc: Broadcom internal kernel review list Cc: Stefano Garzarella <sgarzare@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Cc: virtualization@lists.linux.dev Cc: netdev@vger.kernel.org Cc: stable <stable@kernel.org> Signed-off-by: HarshaVardhana S A <harshavardhana.sa@broadcom.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Fixes: d021c344051a ("VSOCK: Introduce VM Sockets") Acked-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20250701122254.2397440-1-gregkh@linuxfoundation.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03dt-bindings: net: sophgo,sg2044-dwmac: Drop status from the exampleKrzysztof Kozlowski
Examples should be complete and should not have a 'status' property, especially a disabled one because this disables the dt_binding_check of the example against the schema. Dropping 'status' property shows missing other properties - phy-mode and phy-handle. Fixes: 114508a89ddc ("dt-bindings: net: Add support for Sophgo SG2044 dwmac") Cc: <stable@vger.kernel.org> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Alexander Sverdlin <alexander.sverdlin@gmail.com> Reviewed-by: Chen Wang <unicorn_wang@outlook.com> Link: https://patch.msgid.link/20250701063621.23808-2-krzysztof.kozlowski@linaro.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03Merge branch 'fix-irq-vectors'Paolo Abeni
Jiawen Wu says: ==================== Fix IRQ vectors The interrupt vector order was adjusted by [1]commit 937d46ecc5f9 ("net: wangxun: add ethtool_ops for channel number") in Linux-6.8. Because at that time, the MISC interrupt acts as the parent interrupt in the GPIO IRQ chip. When the number of Rx/Tx ring changes, the last MISC interrupt must be reallocated. Then the GPIO interrupt controller would be corrupted. So the initial plan was to adjust the sequence of the interrupt vectors, let MISC interrupt to be the first one and do not free it. Later, irq_domain was introduced in [2]commit aefd013624a1 ("net: txgbe: use irq_domain for interrupt controller") to avoid this problem. However, the vector sequence adjustment was not reverted. So there is still one problem that has been left unresolved. Due to hardware limitations of NGBE, queue IRQs can only be requested on vector 0 to 7. When the number of queues is set to the maximum 8, the PCI IRQ vectors are allocated from 0 to 8. The vector 0 is used by MISC interrupt, and althrough the vector 8 is used by queue interrupt, it is unable to receive packets. This will cause some packets to be dropped when RSS is enabled and they are assigned to queue 8. This patch set fix the above problems. [1] https://git.kernel.org/netdev/net-next/c/937d46ecc5f9 [2] https://git.kernel.org/netdev/net-next/c/aefd013624a1 ==================== Link: https://patch.msgid.link/20250701063030.59340-1-jiawenwu@trustnetic.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03net: ngbe: specify IRQ vector when the number of VFs is 7Jiawen Wu
For NGBE devices, the queue number is limited to be 1 when SRIOV is enabled. In this case, IRQ vector[0] is used for MISC and vector[1] is used for queue, based on the previous patches. But for the hardware design, the IRQ vector[1] must be allocated for use by the VF[6] when the number of VFs is 7. So the IRQ vector[0] should be shared for PF MISC and QUEUE interrupts. +-----------+----------------------+ | Vector | Assigned To | +-----------+----------------------+ | Vector 0 | PF MISC and QUEUE | | Vector 1 | VF 6 | | Vector 2 | VF 5 | | Vector 3 | VF 4 | | Vector 4 | VF 3 | | Vector 5 | VF 2 | | Vector 6 | VF 1 | | Vector 7 | VF 0 | +-----------+----------------------+ Minimize code modifications, only adjust the IRQ vector number for this case. Fixes: 877253d2cbf2 ("net: ngbe: add sriov function support") Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://patch.msgid.link/20250701063030.59340-4-jiawenwu@trustnetic.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03net: wangxun: revert the adjustment of the IRQ vector sequenceJiawen Wu
Due to hardware limitations of NGBE, queue IRQs can only be requested on vector 0 to 7. When the number of queues is set to the maximum 8, the PCI IRQ vectors are allocated from 0 to 8. The vector 0 is used by MISC interrupt, and althrough the vector 8 is used by queue interrupt, it is unable to receive packets. This will cause some packets to be dropped when RSS is enabled and they are assigned to queue 8. So revert the adjustment of the MISC IRQ location, to make it be the last one in IRQ vectors. Fixes: 937d46ecc5f9 ("net: wangxun: add ethtool_ops for channel number") Cc: stable@vger.kernel.org Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://patch.msgid.link/20250701063030.59340-3-jiawenwu@trustnetic.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03net: txgbe: request MISC IRQ in ndo_openJiawen Wu
Move the creating of irq_domain for MISC IRQ from .probe to .ndo_open, and free it in .ndo_stop, to maintain consistency with the queue IRQs. This it for subsequent adjustments to the IRQ vectors. Fixes: aefd013624a1 ("net: txgbe: use irq_domain for interrupt controller") Cc: stable@vger.kernel.org Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/20250701063030.59340-2-jiawenwu@trustnetic.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03Merge branch 'virtio-fixes-for-tx-ring-sizing-and-resize-error-reporting'Paolo Abeni
Laurent Vivier says: ==================== virtio: Fixes for TX ring sizing and resize error reporting This patch series contains two fixes and a cleanup for the virtio subsystem. The first patch fixes an error reporting bug in virtio_ring's virtqueue_resize() function. Previously, errors from internal resize helpers could be masked if the subsequent re-enabling of the virtqueue succeeded. This patch restores the correct error propagation, ensuring that callers of virtqueue_resize() are properly informed of underlying resize failures. The second patch does a cleanup of the use of '2+MAX_SKB_FRAGS' The third patch addresses a reliability issue in virtio_net where the TX ring size could be configured too small, potentially leading to persistently stopped queues and degraded performance. It enforces a minimum TX ring size to ensure there's always enough space for at least one maximally-fragmented packet plus an additional slot. ==================== Link: https://patch.msgid.link/20250521092236.661410-1-lvivier@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03virtio_net: Enforce minimum TX ring size for reliabilityLaurent Vivier
The `tx_may_stop()` logic stops TX queues if free descriptors (`sq->vq->num_free`) fall below the threshold of (`MAX_SKB_FRAGS` + 2). If the total ring size (`ring_num`) is not strictly greater than this value, queues can become persistently stopped or stop after minimal use, severely degrading performance. A single sk_buff transmission typically requires descriptors for: - The virtio_net_hdr (1 descriptor) - The sk_buff's linear data (head) (1 descriptor) - Paged fragments (up to MAX_SKB_FRAGS descriptors) This patch enforces that the TX ring size ('ring_num') must be strictly greater than (MAX_SKB_FRAGS + 2). This ensures that the ring is always large enough to hold at least one maximally-fragmented packet plus at least one additional slot. Reported-by: Lei Yang <leiyang@redhat.com> Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://patch.msgid.link/20250521092236.661410-4-lvivier@redhat.com Tested-by: Lei Yang <leiyang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03virtio_net: Cleanup '2+MAX_SKB_FRAGS'Laurent Vivier
Improve consistency by using everywhere it is needed 'MAX_SKB_FRAGS + 2' rather than '2+MAX_SKB_FRAGS' or '2 + MAX_SKB_FRAGS'. No functional change. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://patch.msgid.link/20250521092236.661410-3-lvivier@redhat.com Tested-by: Lei Yang <leiyang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03virtio_ring: Fix error reporting in virtqueue_resizeLaurent Vivier
The virtqueue_resize() function was not correctly propagating error codes from its internal resize helper functions, specifically virtqueue_resize_packet() and virtqueue_resize_split(). If these helpers returned an error, but the subsequent call to virtqueue_enable_after_reset() succeeded, the original error from the resize operation would be masked. Consequently, virtqueue_resize() could incorrectly report success to its caller despite an underlying resize failure. This change restores the original code behavior: if (vdev->config->enable_vq_after_reset(_vq)) return -EBUSY; return err; Fix: commit ad48d53b5b3f ("virtio_ring: separate the logic of reset/enable from virtqueue_resize") Cc: xuanzhuo@linux.alibaba.com Signed-off-by: Laurent Vivier <lvivier@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://patch.msgid.link/20250521092236.661410-2-lvivier@redhat.com Tested-by: Lei Yang <leiyang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03KVM: arm64: Remove kvm_arch_vcpu_run_map_fp()Mark Rutland
Historically KVM hyp code saved the host's FPSIMD state into the hosts's fpsimd_state memory, and so it was necessary to map this into the hyp Stage-1 mappings before running a vCPU. This is no longer necessary as of commits: * fbc7e61195e2 ("KVM: arm64: Unconditionally save+flush host FPSIMD/SVE/SME state") * 8eca7f6d5100 ("KVM: arm64: Remove host FPSIMD saving for non-protected KVM") Since those commits, we eagerly save the host's FPSIMD state before calling into hyp to run a vCPU, and hyp code never reads nor writes the host's fpsimd_state memory. There's no longer any need to map the host's fpsimd_state memory into the hyp Stage-1, and kvm_arch_vcpu_run_map_fp() is unnecessary but benign. Remove kvm_arch_vcpu_run_map_fp(). Currently there is no code to perform a corresponding unmap, and we never mapped the host's SVE or SME state into the hyp Stage-1, so no other code needs to be removed. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Fuad Tabba <tabba@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Will Deacon <will@kernel.org> Cc: kvmarm@lists.linux.dev Reviewed-by: Mark Brown <broonie@kernel.org> Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Link: https://lore.kernel.org/r/20250619134817.4075340-1-mark.rutland@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-07-03KVM: arm64: Fix handling of FEAT_GTG for unimplemented granule sizesMarc Zyngier
Booting an EL2 guest on a system only supporting a subset of the possible page sizes leads to interesting situations. For example, on a system that only supports 4kB and 64kB, and is booted with a 4kB kernel, we end-up advertising 16kB support at stage-2, which is pretty weird. That's because we consider that any S2 bigger than our base granule is fair game, irrespective of what the HW actually supports. While this is not impossible to support (KVM would happily handle it), it is likely to be confusing for the guest. Add new checks that will verify that this granule size is actually supported before publishing it to the guest. Fixes: e7ef6ed4583ea ("KVM: arm64: Enforce NV limits on a per-idregs basis") Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-07-03HID: appletb-kbd: fix slab use-after-free bug in appletb_kbd_probeQasim Ijaz
In probe appletb_kbd_probe() a "struct appletb_kbd *kbd" is allocated via devm_kzalloc() to store touch bar keyboard related data. Later on if backlight_device_get_by_name() finds a backlight device with name "appletb_backlight" a timer (kbd->inactivity_timer) is setup with appletb_inactivity_timer() and the timer is armed to run after appletb_tb_dim_timeout (60) seconds. A use-after-free is triggered when failure occurs after the timer is armed. This ultimately means probe failure occurs and as a result the "struct appletb_kbd *kbd" which is device managed memory is freed. After 60 seconds the timer will have expired and __run_timers will attempt to access the timer (kbd->inactivity_timer) however the kdb structure has been freed causing a use-after free. [ 71.636938] ================================================================== [ 71.637915] BUG: KASAN: slab-use-after-free in __run_timers+0x7ad/0x890 [ 71.637915] Write of size 8 at addr ffff8881178c5958 by task swapper/1/0 [ 71.637915] [ 71.637915] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.16.0-rc2-00318-g739a6c93cc75-dirty #12 PREEMPT(voluntary) [ 71.637915] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 71.637915] Call Trace: [ 71.637915] <IRQ> [ 71.637915] dump_stack_lvl+0x53/0x70 [ 71.637915] print_report+0xce/0x670 [ 71.637915] ? __run_timers+0x7ad/0x890 [ 71.637915] kasan_report+0xce/0x100 [ 71.637915] ? __run_timers+0x7ad/0x890 [ 71.637915] __run_timers+0x7ad/0x890 [ 71.637915] ? __pfx___run_timers+0x10/0x10 [ 71.637915] ? update_process_times+0xfc/0x190 [ 71.637915] ? __pfx_update_process_times+0x10/0x10 [ 71.637915] ? _raw_spin_lock_irq+0x80/0xe0 [ 71.637915] ? _raw_spin_lock_irq+0x80/0xe0 [ 71.637915] ? __pfx__raw_spin_lock_irq+0x10/0x10 [ 71.637915] run_timer_softirq+0x141/0x240 [ 71.637915] ? __pfx_run_timer_softirq+0x10/0x10 [ 71.637915] ? __pfx___hrtimer_run_queues+0x10/0x10 [ 71.637915] ? kvm_clock_get_cycles+0x18/0x30 [ 71.637915] ? ktime_get+0x60/0x140 [ 71.637915] handle_softirqs+0x1b8/0x5c0 [ 71.637915] ? __pfx_handle_softirqs+0x10/0x10 [ 71.637915] irq_exit_rcu+0xaf/0xe0 [ 71.637915] sysvec_apic_timer_interrupt+0x6c/0x80 [ 71.637915] </IRQ> [ 71.637915] [ 71.637915] Allocated by task 39: [ 71.637915] kasan_save_stack+0x33/0x60 [ 71.637915] kasan_save_track+0x14/0x30 [ 71.637915] __kasan_kmalloc+0x8f/0xa0 [ 71.637915] __kmalloc_node_track_caller_noprof+0x195/0x420 [ 71.637915] devm_kmalloc+0x74/0x1e0 [ 71.637915] appletb_kbd_probe+0x37/0x3c0 [ 71.637915] hid_device_probe+0x2d1/0x680 [ 71.637915] really_probe+0x1c3/0x690 [ 71.637915] __driver_probe_device+0x247/0x300 [ 71.637915] driver_probe_device+0x49/0x210 [...] [ 71.637915] [ 71.637915] Freed by task 39: [ 71.637915] kasan_save_stack+0x33/0x60 [ 71.637915] kasan_save_track+0x14/0x30 [ 71.637915] kasan_save_free_info+0x3b/0x60 [ 71.637915] __kasan_slab_free+0x37/0x50 [ 71.637915] kfree+0xcf/0x360 [ 71.637915] devres_release_group+0x1f8/0x3c0 [ 71.637915] hid_device_probe+0x315/0x680 [ 71.637915] really_probe+0x1c3/0x690 [ 71.637915] __driver_probe_device+0x247/0x300 [ 71.637915] driver_probe_device+0x49/0x210 [...] The root cause of the issue is that the timer is not disarmed on failure paths leading to it remaining active and accessing freed memory. To fix this call timer_delete_sync() to deactivate the timer. Another small issue is that timer_delete_sync is called unconditionally in appletb_kbd_remove(), fix this by checking for a valid kbd->backlight_dev before calling timer_delete_sync. Fixes: 93a0fc489481 ("HID: hid-appletb-kbd: add support for automatic brightness control while using the touchbar") Cc: stable@vger.kernel.org Signed-off-by: Qasim Ijaz <qasdev00@gmail.com> Reviewed-by: Aditya Garg <gargaditya08@live.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2025-07-03virtio-net: xsk: rx: fix the frame's length checkBui Quang Minh
When calling buf_to_xdp, the len argument is the frame data's length without virtio header's length (vi->hdr_len). We check that len with xsk_pool_get_rx_frame_size() + vi->hdr_len to ensure the provided len does not larger than the allocated chunk size. The additional vi->hdr_len is because in virtnet_add_recvbuf_xsk, we use part of XDP_PACKET_HEADROOM for virtio header and ask the vhost to start placing data from hard_start + XDP_PACKET_HEADROOM - vi->hdr_len not hard_start + XDP_PACKET_HEADROOM But the first buffer has virtio_header, so the maximum frame's length in the first buffer can only be xsk_pool_get_rx_frame_size() not xsk_pool_get_rx_frame_size() + vi->hdr_len like in the current check. This commit adds an additional argument to buf_to_xdp differentiate between the first buffer and other ones to correctly calculate the maximum frame's length. Cc: stable@vger.kernel.org Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Fixes: a4e7ba702701 ("virtio_net: xsk: rx: support recv small mode") Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com> Link: https://patch.msgid.link/20250630151315.86722-2-minhquangbui99@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03Merge branch 'virtio-net-fixes-for-mergeable-xdp-receive-path'Paolo Abeni
Bui Quang Minh says: ==================== virtio-net: fixes for mergeable XDP receive path This series contains fixes for XDP receive path in virtio-net - Patch 1: add a missing check for the received data length with our allocated buffer size in mergeable mode. - Patch 2: remove a redundant truesize check with PAGE_SIZE in mergeable mode - Patch 3: make the current repeated code use the check_mergeable_len to check for received data length in mergeable mode ==================== Link: https://patch.msgid.link/20250630144212.48471-1-minhquangbui99@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03virtio-net: use the check_mergeable_len helperBui Quang Minh
Replace the current repeated code to check received length in mergeable mode with the new check_mergeable_len helper. Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://patch.msgid.link/20250630144212.48471-4-minhquangbui99@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03virtio-net: remove redundant truesize check with PAGE_SIZEBui Quang Minh
The truesize is guaranteed not to exceed PAGE_SIZE in get_mergeable_buf_len(). It is saved in mergeable context, which is not changeable by the host side, so the check in receive path is quite redundant. Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com> Link: https://patch.msgid.link/20250630144212.48471-3-minhquangbui99@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03virtio-net: ensure the received length does not exceed allocated sizeBui Quang Minh
In xdp_linearize_page, when reading the following buffers from the ring, we forget to check the received length with the true allocate size. This can lead to an out-of-bound read. This commit adds that missing check. Cc: <stable@vger.kernel.org> Fixes: 4941d472bf95 ("virtio-net: do not reset during XDP set") Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://patch.msgid.link/20250630144212.48471-2-minhquangbui99@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03perf: Revert to requiring CAP_SYS_ADMIN for uprobesPeter Zijlstra
Jann reports that uprobes can be used destructively when used in the middle of an instruction. The kernel only verifies there is a valid instruction at the requested offset, but due to variable instruction length cannot determine if this is an instruction as seen by the intended execution stream. Additionally, Mark Rutland notes that on architectures that mix data in the text segment (like arm64), a similar things can be done if the data word is 'mistaken' for an instruction. As such, require CAP_SYS_ADMIN for uprobes. Fixes: c9e0924e5c2b ("perf/core: open access to probes for CAP_PERFMON privileged process") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/CAG48ez1n4520sq0XrWYDHKiKxE_+WCfAK+qt9qkY4ZiBGmL-5g@mail.gmail.com
2025-07-03HID: Fix debug name for BTN_GEAR_DOWN, BTN_GEAR_UP, BTN_WHEELVicki Pfau
The name of BTN_GEAR_DOWN was WheelBtn and BTN_WHEEL was missing. Further, BTN_GEAR_UP had a space in its name and no Btn, which is against convention. This makes the names BtnGearDown, BtnGearUp, and BtnWheel, fixing the errors and matching convention. Signed-off-by: Vicki Pfau <vi@endrift.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2025-07-03HID: elecom: add support for ELECOM HUGE 019B variantLeonard Dizon
The ELECOM M-HT1DRBK trackball has an additional device ID (056E:019B) not yet recognized by the driver, despite using the same report descriptor as earlier variants. This patch adds the new ID and applies the same fixups, enabling all 8 buttons to function properly. Signed-off-by: Leonard Dizon <leonard@snekbyte.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2025-07-03HID: appletb-kbd: fix memory corruption of input_handler_listQasim Ijaz
In appletb_kbd_probe an input handler is initialised and then registered with input core through input_register_handler(). When this happens input core will add the input handler (specifically its node) to the global input_handler_list. The input_handler_list is central to the functionality of input core and is traversed in various places in input core. An example of this is when a new input device is plugged in and gets registered with input core. The input_handler in probe is allocated as device managed memory. If a probe failure occurs after input_register_handler() the input_handler memory is freed, yet it will remain in the input_handler_list. This effectively means the input_handler_list contains a dangling pointer to data belonging to a freed input handler. This causes an issue when any other input device is plugged in - in my case I had an old PixArt HP USB optical mouse and I decided to plug it in after a failure occurred after input_register_handler(). This lead to the registration of this input device via input_register_device which involves traversing over every handler in the corrupted input_handler_list and calling input_attach_handler(), giving each handler a chance to bind to newly registered device. The core of this bug is a UAF which causes memory corruption of input_handler_list and to fix it we must ensure the input handler is unregistered from input core, this is done through input_unregister_handler(). [ 63.191597] ================================================================== [ 63.192094] BUG: KASAN: slab-use-after-free in input_attach_handler.isra.0+0x1a9/0x1e0 [ 63.192094] Read of size 8 at addr ffff888105ea7c80 by task kworker/0:2/54 [ 63.192094] [ 63.192094] CPU: 0 UID: 0 PID: 54 Comm: kworker/0:2 Not tainted 6.16.0-rc2-00321-g2aa6621d [ 63.192094] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.164 [ 63.192094] Workqueue: usb_hub_wq hub_event [ 63.192094] Call Trace: [ 63.192094] <TASK> [ 63.192094] dump_stack_lvl+0x53/0x70 [ 63.192094] print_report+0xce/0x670 [ 63.192094] kasan_report+0xce/0x100 [ 63.192094] input_attach_handler.isra.0+0x1a9/0x1e0 [ 63.192094] input_register_device+0x76c/0xd00 [ 63.192094] hidinput_connect+0x686d/0xad60 [ 63.192094] hid_connect+0xf20/0x1b10 [ 63.192094] hid_hw_start+0x83/0x100 [ 63.192094] hid_device_probe+0x2d1/0x680 [ 63.192094] really_probe+0x1c3/0x690 [ 63.192094] __driver_probe_device+0x247/0x300 [ 63.192094] driver_probe_device+0x49/0x210 [ 63.192094] __device_attach_driver+0x160/0x320 [ 63.192094] bus_for_each_drv+0x10f/0x190 [ 63.192094] __device_attach+0x18e/0x370 [ 63.192094] bus_probe_device+0x123/0x170 [ 63.192094] device_add+0xd4d/0x1460 [ 63.192094] hid_add_device+0x30b/0x910 [ 63.192094] usbhid_probe+0x920/0xe00 [ 63.192094] usb_probe_interface+0x363/0x9a0 [ 63.192094] really_probe+0x1c3/0x690 [ 63.192094] __driver_probe_device+0x247/0x300 [ 63.192094] driver_probe_device+0x49/0x210 [ 63.192094] __device_attach_driver+0x160/0x320 [ 63.192094] bus_for_each_drv+0x10f/0x190 [ 63.192094] __device_attach+0x18e/0x370 [ 63.192094] bus_probe_device+0x123/0x170 [ 63.192094] device_add+0xd4d/0x1460 [ 63.192094] usb_set_configuration+0xd14/0x1880 [ 63.192094] usb_generic_driver_probe+0x78/0xb0 [ 63.192094] usb_probe_device+0xaa/0x2e0 [ 63.192094] really_probe+0x1c3/0x690 [ 63.192094] __driver_probe_device+0x247/0x300 [ 63.192094] driver_probe_device+0x49/0x210 [ 63.192094] __device_attach_driver+0x160/0x320 [ 63.192094] bus_for_each_drv+0x10f/0x190 [ 63.192094] __device_attach+0x18e/0x370 [ 63.192094] bus_probe_device+0x123/0x170 [ 63.192094] device_add+0xd4d/0x1460 [ 63.192094] usb_new_device+0x7b4/0x1000 [ 63.192094] hub_event+0x234d/0x3fa0 [ 63.192094] process_one_work+0x5bf/0xfe0 [ 63.192094] worker_thread+0x777/0x13a0 [ 63.192094] </TASK> [ 63.192094] [ 63.192094] Allocated by task 54: [ 63.192094] kasan_save_stack+0x33/0x60 [ 63.192094] kasan_save_track+0x14/0x30 [ 63.192094] __kasan_kmalloc+0x8f/0xa0 [ 63.192094] __kmalloc_node_track_caller_noprof+0x195/0x420 [ 63.192094] devm_kmalloc+0x74/0x1e0 [ 63.192094] appletb_kbd_probe+0x39/0x440 [ 63.192094] hid_device_probe+0x2d1/0x680 [ 63.192094] really_probe+0x1c3/0x690 [ 63.192094] __driver_probe_device+0x247/0x300 [ 63.192094] driver_probe_device+0x49/0x210 [ 63.192094] __device_attach_driver+0x160/0x320 [...] [ 63.192094] [ 63.192094] Freed by task 54: [ 63.192094] kasan_save_stack+0x33/0x60 [ 63.192094] kasan_save_track+0x14/0x30 [ 63.192094] kasan_save_free_info+0x3b/0x60 [ 63.192094] __kasan_slab_free+0x37/0x50 [ 63.192094] kfree+0xcf/0x360 [ 63.192094] devres_release_group+0x1f8/0x3c0 [ 63.192094] hid_device_probe+0x315/0x680 [ 63.192094] really_probe+0x1c3/0x690 [ 63.192094] __driver_probe_device+0x247/0x300 [ 63.192094] driver_probe_device+0x49/0x210 [ 63.192094] __device_attach_driver+0x160/0x320 [...] Fixes: 7d62ba8deacf ("HID: hid-appletb-kbd: add support for fn toggle between media and function mode") Cc: stable@vger.kernel.org Reviewed-by: Aditya Garg <gargaditya08@live.com> Signed-off-by: Qasim Ijaz <qasdev00@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2025-07-03wifi: iwlwifi: Fix error code in iwl_op_mode_dvm_start()Dan Carpenter
Preserve the error code if iwl_setup_deferred_work() fails. The current code returns ERR_PTR(0) (which is NULL) on this path. I believe the missing error code potentially leads to a use after free involving debugfs. Fixes: 90a0d9f33996 ("iwlwifi: Add missing check for alloc_ordered_workqueue") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://patch.msgid.link/a7a1cd2c-ce01-461a-9afd-dbe535f8df01@sabinyo.mountain Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2025-07-02net: ipv6: Fix spelling mistakeChenguang Zhao
change 'Maximium' to 'Maximum' Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://patch.msgid.link/20250702055820.112190-1-zhaochenguang@kylinos.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02Merge branch 'support-rate-management-on-traffic-classes-in-devlink-and-mlx5'Jakub Kicinski
Mark Bloch says: ==================== Support rate management on traffic classes in devlink and mlx5 This patch series extends the devlink-rate API to support traffic class (TC) bandwidth management, enabling more granular control over traffic shaping and rate limiting across multiple TCs. The API now allows users to specify bandwidth proportions for different traffic classes in a single command. This is particularly useful for managing Enhanced Transmission Selection (ETS) for groups of Virtual Functions (VFs), allowing precise bandwidth allocation across traffic classes. Additionally the series refines the QoS handling in net/mlx5 to support TC arbitration and bandwidth management on vports and rate nodes. Discussions on traffic class shaping in net-shapers began in V5 [1], where we discussed with maintainers whether net-shapers should support traffic classes and how this could be implemented. Later, after further conversations with Paolo Abeni and Simon Horman, Cosmin provided an update [2], confirming that net-shapers' tree-based hierarchy aligns well with traffic classes when treated as distinct subsets of netdev queues. Since mlx5 enforces a 1:1 mapping between TX queues and traffic classes, this approach seems feasible, though some open questions remain regarding queue reconfiguration and certain mlx5 scheduling behaviors. Building on that discussion, Cosmin has now shared a concrete implementation plan on the netdev mailing list [3]. The plan, developed in collaboration with Paolo and Simon, outlines how net-shapers can be extended to support the same use cases currently covered by devlink-rate, with the eventual goal of aligning both and simplifying the shaping infrastructure in the kernel. This work was presented at Netdev 0x19 in Zagreb [4]. There we presented how TC scheduling is enforced in mlx5 hardware, which led to discussions on the mailing list. A summary of how things work: Classification means labeling a packet with a traffic class based on the packet's DSCP or VLAN PCP field, then treating packets with different traffic classes differently during transmit processing. In a virtualized setup, VFs are untrusted and do not control classification or shaping. Classification is done by the hardware using a prio-to-TC mapping set by the hypervisor. VFs only select which send queue to use and are expected to respect the classification logic by sending each traffic class on its dedicated queue. As stated in the net-shapers plan [3], each transmit queue should carry only a single traffic class. Mixing classes in a single queue can lead to HOL blocking. In the mlx5 implementation, if the queue used does not match the classified traffic class, the hardware moves the queue to the correct TC scheduler. This movement is not a reclassification; it’s a necessary enforcement step to ensure traffic class isolation is maintained. Extend devlink-rate API to support rate management on TCs: - devlink: Extend the devlink rate API to support traffic class bandwidth management Introduce a no-op implementation: - net/mlx5: Add no-op implementation for setting tc-bw on rate objects Add support for enabling and disabling TC QoS on vports and nodes: - net/mlx5: Add support for setting tc-bw on nodes - net/mlx5: Add traffic class scheduling support for vport QoS Support for setting tc-bw on rate objects: - net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw [1] https://lore.kernel.org/netdev/20241204220931.254964-1-tariqt@nvidia.com/ [2] https://lore.kernel.org/netdev/67df1a562614b553dcab043f347a0d7c5393ff83.camel@nvidia.com/ [3] https://lore.kernel.org/netdev/d9831d0c940a7b77419abe7c7330e822bbfd1cfb.camel@nvidia.com/T/ [4] https://netdevconf.info/0x19/sessions/talk/optimizing-bandwidth-allocation-with-ets-and-traffic-classes.html ==================== Link: https://patch.msgid.link/20250629142138.361537-1-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02selftests: drv-net: Add test for devlink-rate traffic class bandwidth ↵Carolina Jubran
distribution This test suite validates the functionality of the devlink-rate API for traffic class (TC) bandwidth allocation. It ensures that bandwidth can be distributed between different traffic classes as configured, and verifies that explicit TC-to-queue mapping is required for the allocation to be effective. The first test (test_no_tc_mapping_bandwidth) is marked as expected failure on mlx5, since the hardware automatically enforces traffic class separation by dynamically moving queues to the correct TC scheduler, even without explicit TC-to-queue mapping configuration. Test output on mlx5: 1..2 # Created VF interface: eth5 # Created VLAN eth5.101 on eth5 with tc 3 and IP 198.51.100.2 # Created VLAN eth5.102 on eth5 with tc 4 and IP 198.51.100.10 # Set representor eth4 up and added to bridge # Bandwidth check results without TC mapping: # TC 3: 0.19 Gbits/sec # TC 4: 0.76 Gbits/sec # Total bandwidth: 0.95 Gbits/sec # TC 3 percentage: 20.0% # TC 4 percentage: 80.0% ok 1 devlink_rate_tc_bw.test_no_tc_mapping_bandwidth # XFAIL Bandwidth matched 80/20 split without TC mapping # Created VF interface: eth5 # Created VLAN eth5.101 on eth5 with tc 3 and IP 198.51.100.2 # Created VLAN eth5.102 on eth5 with tc 4 and IP 198.51.100.10 # Set representor eth4 up and added to bridge # Bandwidth check results with TC mapping: # TC 3: 0.21 Gbits/sec # TC 4: 0.78 Gbits/sec # Total bandwidth: 0.98 Gbits/sec # TC 3 percentage: 21.1% # TC 4 percentage: 78.9% # Bandwidth is distributed as 80/20 with TC mapping ok 2 devlink_rate_tc_bw.test_tc_mapping_bandwidth # Totals: pass:1 fail:0 xfail:1 xpass:0 skip:0 error:0 Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Nimrod Oren <noren@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-9-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net/mlx5: Manage TC arbiter nodes and implement full support for tc-bwCarolina Jubran
Introduce support for managing Traffic Class (TC) arbiter nodes and associated vports TC nodes within the E-Switch QoS hierarchy. This patch adds support for the new scheduling node type, `SCHED_NODE_TYPE_VPORTS_TC_TSAR`, and implements full support for setting tc-bw on both vports and nodes. Key changes include: - Introduced the new scheduling node type, `SCHED_NODE_TYPE_VPORTS_TC_TSAR`, for managing vports within the TC arbiter node. - New helper functions for creating and destroying vports TC nodes under the TC arbiter. - Updated the minimum rate normalization function to skip nodes of type `SCHED_NODE_TYPE_VPORTS_TC_TSAR`. Vports TC TSARs have bandwidth shares configured on them but not minimum rates, so their `min_rate` cannot be normalized. - Implementation of `esw_qos_tc_arbiter_scheduling_setup()` and `esw_qos_tc_arbiter_scheduling_teardown()` for initializing and cleaning up TC arbiter scheduling elements. These functions now fully support tc-bw configuration on TC arbiter nodes. - Introduced a new helper `esw_qos_calculate_tc_bw_divider()` to compute the total TC bandwidth share, which is used as a divider for normalizing each TC's share. - Added `esw_qos_tc_arbiter_get_bw_shares()` and `esw_qos_set_tc_arbiter_bw_shares()` to handle the settings of bandwidth shares for vports traffic class TSARs. - `esw_qos_set_tc_arbiter_bw_shares()` normalizes each TC share based on the total and the firmware's maximum allowed TSAR bandwidth share. - Refactored `mlx5_esw_devlink_rate_node_tc_bw_set()` and `mlx5_esw_devlink_rate_leaf_tc_bw_set()` to fully support configuring tc-bw on devlink rate nodes and vports, respectively. - Refactored `mlx5_esw_qos_node_update_parent()` to ensure that tc-bw configuration remains compatible with setting a parent on a rate node, preserving level hierarchy functionality. - Refactored `esw_qos_calc_bw_share()` to generalize its input so it can be used for both minimum rate and bandwidth share calculations. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-8-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net/mlx5: Add traffic class scheduling support for vport QoSCarolina Jubran
Introduce support for traffic class (TC) scheduling on vports by allowing the vport to own multiple TC scheduling nodes. This patch enables more granular control of QoS by defining three distinct QoS states for vports, each providing unique scheduling behavior: 1. Regular QoS: The `sched_node` represents the vport directly, handling QoS as a single scheduling entity. 2. TC QoS on the vport: The `sched_node` acts as a TC arbiter, enabling TC scheduling directly on the vport. 3. TC QoS on the parent node: The `sched_node` functions as a rate limiter, with TC arbitration enabled at the parent level, associating multiple scheduling nodes with each vport. Key changes include: - Added support for new scheduling elements, vport traffic class and rate limiter. - New helper functions for creating, destroying, and restoring vport TC scheduling nodes, handling transitions between regular QoS and TC arbitration states. - Updated `esw_qos_vport_enable()` and `esw_qos_vport_disable()` to support both regular QoS and TC arbitration states, ensuring consistent transitions between scheduling modes. - Introduced a `sched_nodes` array under `vport->qos` to store multiple TC scheduling nodes per vport, enabling finer control over per-TC QoS. - Enhanced `esw_qos_vport_update_parent()` to handle transitions between the three QoS states based on the current and new parent node types. This patch lays the groundwork for future support for configuring tc-bw on vports. Although the infrastructure is in place, full support for tc-bw is not yet implemented; attempts to set tc-bw on vports will return `-EOPNOTSUPP`. No functional changes are introduced at this stage. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-7-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net/mlx5: Add support for setting tc-bw on nodesCarolina Jubran
Introduce support for enabling and disabling Traffic Class (TC) arbitration for existing devlink rate nodes. This patch adds support for a new scheduling node type, `SCHED_NODE_TYPE_TC_ARBITER_TSAR`. Key changes include: - New helper functions for transitioning existing rate nodes to TC arbiter nodes and vice versa. These functions handle the allocation of TC arbiter nodes, copying of child nodes, and restoring vport QoS settings when TC arbitration is disabled. - Implementation of `mlx5_esw_devlink_rate_node_tc_bw_set()` to manage tc-bw configuration on nodes. - Introduced stubs for `esw_qos_tc_arbiter_scheduling_setup()` and `esw_qos_tc_arbiter_scheduling_teardown()`, which will be extended in future patches to provide full support for tc-bw on devlink rate objects. - Validation functions for tc-bw settings, allowing graceful handling of unsupported traffic class bandwidth configurations. - Updated `__esw_qos_alloc_node()` to insert the new node into the parent’s children list only if the parent is not NULL. For the root TSAR, the new node is inserted directly after the allocation call. - Don't allow `tc-bw` configuration for nodes containing non-leaf children. This patch lays the groundwork for future support for configuring tc-bw on devlink rate nodes. Although the infrastructure is in place, full support for tc-bw is not yet implemented; attempts to set tc-bw on nodes will return `-EOPNOTSUPP`. No functional changes are introduced at this stage. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-6-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net/mlx5: Add no-op implementation for setting tc-bw on rate objectsCarolina Jubran
Introduce `mlx5_esw_devlink_rate_node_tc_bw_set()` and `mlx5_esw_devlink_rate_leaf_tc_bw_set()` with no-op logic. Future patches will add support for setting traffic class bandwidth on rate objects. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-5-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02selftest: netdevsim: Add devlink rate tc-bw testCarolina Jubran
Test verifies that netdevsim correctly implements devlink ops callbacks that set tc-bw on leaf or node rate object. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-4-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02devlink: Extend devlink rate API with traffic classes bandwidth managementCarolina Jubran
Introduce support for specifying relative bandwidth shares between traffic classes (TC) in the devlink-rate API. This new option allows users to allocate bandwidth across multiple traffic classes in a single command. This feature provides a more granular control over traffic management, especially for scenarios requiring Enhanced Transmission Selection. Users can now define a relative bandwidth share for each traffic class. For example, assigning share values of 20 to TC0 (TCP/UDP) and 80 to TC5 (RoCE) will result in TC0 receiving 20% and TC5 receiving 80% of the total bandwidth. The actual percentage each class receives depends on the ratio of its share value to the sum of all shares. Example: DEV=pci/0000:08:00.0 $ devlink port function rate add $DEV/vfs_group tx_share 10Gbit \ tx_max 50Gbit tc-bw 0:20 1:0 2:0 3:0 4:0 5:80 6:0 7:0 $ devlink port function rate set $DEV/vfs_group \ tc-bw 0:20 1:0 2:0 3:0 4:0 5:20 6:60 7:0 Example usage with ynl: ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \ --do rate-set --json '{ "bus-name": "pci", "dev-name": "0000:08:00.0", "port-index": 1, "rate-tc-bws": [ {"rate-tc-index": 0, "rate-tc-bw": 50}, {"rate-tc-index": 1, "rate-tc-bw": 50}, {"rate-tc-index": 2, "rate-tc-bw": 0}, {"rate-tc-index": 3, "rate-tc-bw": 0}, {"rate-tc-index": 4, "rate-tc-bw": 0}, {"rate-tc-index": 5, "rate-tc-bw": 0}, {"rate-tc-index": 6, "rate-tc-bw": 0}, {"rate-tc-index": 7, "rate-tc-bw": 0} ] }' ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \ --do rate-get --json '{ "bus-name": "pci", "dev-name": "0000:08:00.0", "port-index": 1 }' output for rate-get: {'bus-name': 'pci', 'dev-name': '0000:08:00.0', 'port-index': 1, 'rate-tc-bws': [{'rate-tc-bw': 50, 'rate-tc-index': 0}, {'rate-tc-bw': 50, 'rate-tc-index': 1}, {'rate-tc-bw': 0, 'rate-tc-index': 2}, {'rate-tc-bw': 0, 'rate-tc-index': 3}, {'rate-tc-bw': 0, 'rate-tc-index': 4}, {'rate-tc-bw': 0, 'rate-tc-index': 5}, {'rate-tc-bw': 0, 'rate-tc-index': 6}, {'rate-tc-bw': 0, 'rate-tc-index': 7}], 'rate-tx-max': 0, 'rate-tx-priority': 0, 'rate-tx-share': 0, 'rate-tx-weight': 0, 'rate-type': 'leaf'} Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-3-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>