linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2025-07-03	HID: appletb-kbd: fix slab use-after-free bug in appletb_kbd_probe	Qasim Ijaz
	In probe appletb_kbd_probe() a "struct appletb_kbd kbd" is allocated via devm_kzalloc() to store touch bar keyboard related data. Later on if backlight_device_get_by_name() finds a backlight device with name "appletb_backlight" a timer (kbd->inactivity_timer) is setup with appletb_inactivity_timer() and the timer is armed to run after appletb_tb_dim_timeout (60) seconds. A use-after-free is triggered when failure occurs after the timer is armed. This ultimately means probe failure occurs and as a result the "struct appletb_kbd kbd" which is device managed memory is freed. After 60 seconds the timer will have expired and __run_timers will attempt to access the timer (kbd->inactivity_timer) however the kdb structure has been freed causing a use-after free. [ 71.636938] ================================================================== [ 71.637915] BUG: KASAN: slab-use-after-free in __run_timers+0x7ad/0x890 [ 71.637915] Write of size 8 at addr ffff8881178c5958 by task swapper/1/0 [ 71.637915] [ 71.637915] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.16.0-rc2-00318-g739a6c93cc75-dirty #12 PREEMPT(voluntary) [ 71.637915] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 71.637915] Call Trace: [ 71.637915] <IRQ> [ 71.637915] dump_stack_lvl+0x53/0x70 [ 71.637915] print_report+0xce/0x670 [ 71.637915] ? __run_timers+0x7ad/0x890 [ 71.637915] kasan_report+0xce/0x100 [ 71.637915] ? __run_timers+0x7ad/0x890 [ 71.637915] __run_timers+0x7ad/0x890 [ 71.637915] ? __pfx___run_timers+0x10/0x10 [ 71.637915] ? update_process_times+0xfc/0x190 [ 71.637915] ? __pfx_update_process_times+0x10/0x10 [ 71.637915] ? _raw_spin_lock_irq+0x80/0xe0 [ 71.637915] ? _raw_spin_lock_irq+0x80/0xe0 [ 71.637915] ? __pfx__raw_spin_lock_irq+0x10/0x10 [ 71.637915] run_timer_softirq+0x141/0x240 [ 71.637915] ? __pfx_run_timer_softirq+0x10/0x10 [ 71.637915] ? __pfx___hrtimer_run_queues+0x10/0x10 [ 71.637915] ? kvm_clock_get_cycles+0x18/0x30 [ 71.637915] ? ktime_get+0x60/0x140 [ 71.637915] handle_softirqs+0x1b8/0x5c0 [ 71.637915] ? __pfx_handle_softirqs+0x10/0x10 [ 71.637915] irq_exit_rcu+0xaf/0xe0 [ 71.637915] sysvec_apic_timer_interrupt+0x6c/0x80 [ 71.637915] </IRQ> [ 71.637915] [ 71.637915] Allocated by task 39: [ 71.637915] kasan_save_stack+0x33/0x60 [ 71.637915] kasan_save_track+0x14/0x30 [ 71.637915] __kasan_kmalloc+0x8f/0xa0 [ 71.637915] __kmalloc_node_track_caller_noprof+0x195/0x420 [ 71.637915] devm_kmalloc+0x74/0x1e0 [ 71.637915] appletb_kbd_probe+0x37/0x3c0 [ 71.637915] hid_device_probe+0x2d1/0x680 [ 71.637915] really_probe+0x1c3/0x690 [ 71.637915] __driver_probe_device+0x247/0x300 [ 71.637915] driver_probe_device+0x49/0x210 [...] [ 71.637915] [ 71.637915] Freed by task 39: [ 71.637915] kasan_save_stack+0x33/0x60 [ 71.637915] kasan_save_track+0x14/0x30 [ 71.637915] kasan_save_free_info+0x3b/0x60 [ 71.637915] __kasan_slab_free+0x37/0x50 [ 71.637915] kfree+0xcf/0x360 [ 71.637915] devres_release_group+0x1f8/0x3c0 [ 71.637915] hid_device_probe+0x315/0x680 [ 71.637915] really_probe+0x1c3/0x690 [ 71.637915] __driver_probe_device+0x247/0x300 [ 71.637915] driver_probe_device+0x49/0x210 [...] The root cause of the issue is that the timer is not disarmed on failure paths leading to it remaining active and accessing freed memory. To fix this call timer_delete_sync() to deactivate the timer. Another small issue is that timer_delete_sync is called unconditionally in appletb_kbd_remove(), fix this by checking for a valid kbd->backlight_dev before calling timer_delete_sync. Fixes: 93a0fc489481 ("HID: hid-appletb-kbd: add support for automatic brightness control while using the touchbar") Cc: stable@vger.kernel.org Signed-off-by: Qasim Ijaz <qasdev00@gmail.com> Reviewed-by: Aditya Garg <gargaditya08@live.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2025-07-03	virtio-net: xsk: rx: fix the frame's length check	Bui Quang Minh
	When calling buf_to_xdp, the len argument is the frame data's length without virtio header's length (vi->hdr_len). We check that len with xsk_pool_get_rx_frame_size() + vi->hdr_len to ensure the provided len does not larger than the allocated chunk size. The additional vi->hdr_len is because in virtnet_add_recvbuf_xsk, we use part of XDP_PACKET_HEADROOM for virtio header and ask the vhost to start placing data from hard_start + XDP_PACKET_HEADROOM - vi->hdr_len not hard_start + XDP_PACKET_HEADROOM But the first buffer has virtio_header, so the maximum frame's length in the first buffer can only be xsk_pool_get_rx_frame_size() not xsk_pool_get_rx_frame_size() + vi->hdr_len like in the current check. This commit adds an additional argument to buf_to_xdp differentiate between the first buffer and other ones to correctly calculate the maximum frame's length. Cc: stable@vger.kernel.org Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Fixes: a4e7ba702701 ("virtio_net: xsk: rx: support recv small mode") Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com> Link: https://patch.msgid.link/20250630151315.86722-2-minhquangbui99@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03	Merge branch 'virtio-net-fixes-for-mergeable-xdp-receive-path'	Paolo Abeni
	Bui Quang Minh says: ==================== virtio-net: fixes for mergeable XDP receive path This series contains fixes for XDP receive path in virtio-net - Patch 1: add a missing check for the received data length with our allocated buffer size in mergeable mode. - Patch 2: remove a redundant truesize check with PAGE_SIZE in mergeable mode - Patch 3: make the current repeated code use the check_mergeable_len to check for received data length in mergeable mode ==================== Link: https://patch.msgid.link/20250630144212.48471-1-minhquangbui99@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03	virtio-net: use the check_mergeable_len helper	Bui Quang Minh
	Replace the current repeated code to check received length in mergeable mode with the new check_mergeable_len helper. Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://patch.msgid.link/20250630144212.48471-4-minhquangbui99@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03	virtio-net: remove redundant truesize check with PAGE_SIZE	Bui Quang Minh
	The truesize is guaranteed not to exceed PAGE_SIZE in get_mergeable_buf_len(). It is saved in mergeable context, which is not changeable by the host side, so the check in receive path is quite redundant. Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com> Link: https://patch.msgid.link/20250630144212.48471-3-minhquangbui99@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03	virtio-net: ensure the received length does not exceed allocated size	Bui Quang Minh
	In xdp_linearize_page, when reading the following buffers from the ring, we forget to check the received length with the true allocate size. This can lead to an out-of-bound read. This commit adds that missing check. Cc: <stable@vger.kernel.org> Fixes: 4941d472bf95 ("virtio-net: do not reset during XDP set") Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://patch.msgid.link/20250630144212.48471-2-minhquangbui99@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03	perf: Revert to requiring CAP_SYS_ADMIN for uprobes	Peter Zijlstra
	Jann reports that uprobes can be used destructively when used in the middle of an instruction. The kernel only verifies there is a valid instruction at the requested offset, but due to variable instruction length cannot determine if this is an instruction as seen by the intended execution stream. Additionally, Mark Rutland notes that on architectures that mix data in the text segment (like arm64), a similar things can be done if the data word is 'mistaken' for an instruction. As such, require CAP_SYS_ADMIN for uprobes. Fixes: c9e0924e5c2b ("perf/core: open access to probes for CAP_PERFMON privileged process") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/CAG48ez1n4520sq0XrWYDHKiKxE_+WCfAK+qt9qkY4ZiBGmL-5g@mail.gmail.com
2025-07-03	HID: Fix debug name for BTN_GEAR_DOWN, BTN_GEAR_UP, BTN_WHEEL	Vicki Pfau
	The name of BTN_GEAR_DOWN was WheelBtn and BTN_WHEEL was missing. Further, BTN_GEAR_UP had a space in its name and no Btn, which is against convention. This makes the names BtnGearDown, BtnGearUp, and BtnWheel, fixing the errors and matching convention. Signed-off-by: Vicki Pfau <vi@endrift.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2025-07-03	HID: elecom: add support for ELECOM HUGE 019B variant	Leonard Dizon
	The ELECOM M-HT1DRBK trackball has an additional device ID (056E:019B) not yet recognized by the driver, despite using the same report descriptor as earlier variants. This patch adds the new ID and applies the same fixups, enabling all 8 buttons to function properly. Signed-off-by: Leonard Dizon <leonard@snekbyte.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2025-07-03	HID: appletb-kbd: fix memory corruption of input_handler_list	Qasim Ijaz
	In appletb_kbd_probe an input handler is initialised and then registered with input core through input_register_handler(). When this happens input core will add the input handler (specifically its node) to the global input_handler_list. The input_handler_list is central to the functionality of input core and is traversed in various places in input core. An example of this is when a new input device is plugged in and gets registered with input core. The input_handler in probe is allocated as device managed memory. If a probe failure occurs after input_register_handler() the input_handler memory is freed, yet it will remain in the input_handler_list. This effectively means the input_handler_list contains a dangling pointer to data belonging to a freed input handler. This causes an issue when any other input device is plugged in - in my case I had an old PixArt HP USB optical mouse and I decided to plug it in after a failure occurred after input_register_handler(). This lead to the registration of this input device via input_register_device which involves traversing over every handler in the corrupted input_handler_list and calling input_attach_handler(), giving each handler a chance to bind to newly registered device. The core of this bug is a UAF which causes memory corruption of input_handler_list and to fix it we must ensure the input handler is unregistered from input core, this is done through input_unregister_handler(). [ 63.191597] ================================================================== [ 63.192094] BUG: KASAN: slab-use-after-free in input_attach_handler.isra.0+0x1a9/0x1e0 [ 63.192094] Read of size 8 at addr ffff888105ea7c80 by task kworker/0:2/54 [ 63.192094] [ 63.192094] CPU: 0 UID: 0 PID: 54 Comm: kworker/0:2 Not tainted 6.16.0-rc2-00321-g2aa6621d [ 63.192094] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.164 [ 63.192094] Workqueue: usb_hub_wq hub_event [ 63.192094] Call Trace: [ 63.192094] <TASK> [ 63.192094] dump_stack_lvl+0x53/0x70 [ 63.192094] print_report+0xce/0x670 [ 63.192094] kasan_report+0xce/0x100 [ 63.192094] input_attach_handler.isra.0+0x1a9/0x1e0 [ 63.192094] input_register_device+0x76c/0xd00 [ 63.192094] hidinput_connect+0x686d/0xad60 [ 63.192094] hid_connect+0xf20/0x1b10 [ 63.192094] hid_hw_start+0x83/0x100 [ 63.192094] hid_device_probe+0x2d1/0x680 [ 63.192094] really_probe+0x1c3/0x690 [ 63.192094] __driver_probe_device+0x247/0x300 [ 63.192094] driver_probe_device+0x49/0x210 [ 63.192094] __device_attach_driver+0x160/0x320 [ 63.192094] bus_for_each_drv+0x10f/0x190 [ 63.192094] __device_attach+0x18e/0x370 [ 63.192094] bus_probe_device+0x123/0x170 [ 63.192094] device_add+0xd4d/0x1460 [ 63.192094] hid_add_device+0x30b/0x910 [ 63.192094] usbhid_probe+0x920/0xe00 [ 63.192094] usb_probe_interface+0x363/0x9a0 [ 63.192094] really_probe+0x1c3/0x690 [ 63.192094] __driver_probe_device+0x247/0x300 [ 63.192094] driver_probe_device+0x49/0x210 [ 63.192094] __device_attach_driver+0x160/0x320 [ 63.192094] bus_for_each_drv+0x10f/0x190 [ 63.192094] __device_attach+0x18e/0x370 [ 63.192094] bus_probe_device+0x123/0x170 [ 63.192094] device_add+0xd4d/0x1460 [ 63.192094] usb_set_configuration+0xd14/0x1880 [ 63.192094] usb_generic_driver_probe+0x78/0xb0 [ 63.192094] usb_probe_device+0xaa/0x2e0 [ 63.192094] really_probe+0x1c3/0x690 [ 63.192094] __driver_probe_device+0x247/0x300 [ 63.192094] driver_probe_device+0x49/0x210 [ 63.192094] __device_attach_driver+0x160/0x320 [ 63.192094] bus_for_each_drv+0x10f/0x190 [ 63.192094] __device_attach+0x18e/0x370 [ 63.192094] bus_probe_device+0x123/0x170 [ 63.192094] device_add+0xd4d/0x1460 [ 63.192094] usb_new_device+0x7b4/0x1000 [ 63.192094] hub_event+0x234d/0x3fa0 [ 63.192094] process_one_work+0x5bf/0xfe0 [ 63.192094] worker_thread+0x777/0x13a0 [ 63.192094] </TASK> [ 63.192094] [ 63.192094] Allocated by task 54: [ 63.192094] kasan_save_stack+0x33/0x60 [ 63.192094] kasan_save_track+0x14/0x30 [ 63.192094] __kasan_kmalloc+0x8f/0xa0 [ 63.192094] __kmalloc_node_track_caller_noprof+0x195/0x420 [ 63.192094] devm_kmalloc+0x74/0x1e0 [ 63.192094] appletb_kbd_probe+0x39/0x440 [ 63.192094] hid_device_probe+0x2d1/0x680 [ 63.192094] really_probe+0x1c3/0x690 [ 63.192094] __driver_probe_device+0x247/0x300 [ 63.192094] driver_probe_device+0x49/0x210 [ 63.192094] __device_attach_driver+0x160/0x320 [...] [ 63.192094] [ 63.192094] Freed by task 54: [ 63.192094] kasan_save_stack+0x33/0x60 [ 63.192094] kasan_save_track+0x14/0x30 [ 63.192094] kasan_save_free_info+0x3b/0x60 [ 63.192094] __kasan_slab_free+0x37/0x50 [ 63.192094] kfree+0xcf/0x360 [ 63.192094] devres_release_group+0x1f8/0x3c0 [ 63.192094] hid_device_probe+0x315/0x680 [ 63.192094] really_probe+0x1c3/0x690 [ 63.192094] __driver_probe_device+0x247/0x300 [ 63.192094] driver_probe_device+0x49/0x210 [ 63.192094] __device_attach_driver+0x160/0x320 [...] Fixes: 7d62ba8deacf ("HID: hid-appletb-kbd: add support for fn toggle between media and function mode") Cc: stable@vger.kernel.org Reviewed-by: Aditya Garg <gargaditya08@live.com> Signed-off-by: Qasim Ijaz <qasdev00@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2025-07-03	wifi: iwlwifi: Fix error code in iwl_op_mode_dvm_start()	Dan Carpenter
	Preserve the error code if iwl_setup_deferred_work() fails. The current code returns ERR_PTR(0) (which is NULL) on this path. I believe the missing error code potentially leads to a use after free involving debugfs. Fixes: 90a0d9f33996 ("iwlwifi: Add missing check for alloc_ordered_workqueue") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://patch.msgid.link/a7a1cd2c-ce01-461a-9afd-dbe535f8df01@sabinyo.mountain Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2025-07-02	net: ipv6: Fix spelling mistake	Chenguang Zhao
	change 'Maximium' to 'Maximum' Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://patch.msgid.link/20250702055820.112190-1-zhaochenguang@kylinos.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	Merge branch 'support-rate-management-on-traffic-classes-in-devlink-and-mlx5'	Jakub Kicinski
	Mark Bloch says: ==================== Support rate management on traffic classes in devlink and mlx5 This patch series extends the devlink-rate API to support traffic class (TC) bandwidth management, enabling more granular control over traffic shaping and rate limiting across multiple TCs. The API now allows users to specify bandwidth proportions for different traffic classes in a single command. This is particularly useful for managing Enhanced Transmission Selection (ETS) for groups of Virtual Functions (VFs), allowing precise bandwidth allocation across traffic classes. Additionally the series refines the QoS handling in net/mlx5 to support TC arbitration and bandwidth management on vports and rate nodes. Discussions on traffic class shaping in net-shapers began in V5 [1], where we discussed with maintainers whether net-shapers should support traffic classes and how this could be implemented. Later, after further conversations with Paolo Abeni and Simon Horman, Cosmin provided an update [2], confirming that net-shapers' tree-based hierarchy aligns well with traffic classes when treated as distinct subsets of netdev queues. Since mlx5 enforces a 1:1 mapping between TX queues and traffic classes, this approach seems feasible, though some open questions remain regarding queue reconfiguration and certain mlx5 scheduling behaviors. Building on that discussion, Cosmin has now shared a concrete implementation plan on the netdev mailing list [3]. The plan, developed in collaboration with Paolo and Simon, outlines how net-shapers can be extended to support the same use cases currently covered by devlink-rate, with the eventual goal of aligning both and simplifying the shaping infrastructure in the kernel. This work was presented at Netdev 0x19 in Zagreb [4]. There we presented how TC scheduling is enforced in mlx5 hardware, which led to discussions on the mailing list. A summary of how things work: Classification means labeling a packet with a traffic class based on the packet's DSCP or VLAN PCP field, then treating packets with different traffic classes differently during transmit processing. In a virtualized setup, VFs are untrusted and do not control classification or shaping. Classification is done by the hardware using a prio-to-TC mapping set by the hypervisor. VFs only select which send queue to use and are expected to respect the classification logic by sending each traffic class on its dedicated queue. As stated in the net-shapers plan [3], each transmit queue should carry only a single traffic class. Mixing classes in a single queue can lead to HOL blocking. In the mlx5 implementation, if the queue used does not match the classified traffic class, the hardware moves the queue to the correct TC scheduler. This movement is not a reclassification; it’s a necessary enforcement step to ensure traffic class isolation is maintained. Extend devlink-rate API to support rate management on TCs: - devlink: Extend the devlink rate API to support traffic class bandwidth management Introduce a no-op implementation: - net/mlx5: Add no-op implementation for setting tc-bw on rate objects Add support for enabling and disabling TC QoS on vports and nodes: - net/mlx5: Add support for setting tc-bw on nodes - net/mlx5: Add traffic class scheduling support for vport QoS Support for setting tc-bw on rate objects: - net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw [1] https://lore.kernel.org/netdev/20241204220931.254964-1-tariqt@nvidia.com/ [2] https://lore.kernel.org/netdev/67df1a562614b553dcab043f347a0d7c5393ff83.camel@nvidia.com/ [3] https://lore.kernel.org/netdev/d9831d0c940a7b77419abe7c7330e822bbfd1cfb.camel@nvidia.com/T/ [4] https://netdevconf.info/0x19/sessions/talk/optimizing-bandwidth-allocation-with-ets-and-traffic-classes.html ==================== Link: https://patch.msgid.link/20250629142138.361537-1-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	selftests: drv-net: Add test for devlink-rate traffic class bandwidth ↵	Carolina Jubran
	distribution This test suite validates the functionality of the devlink-rate API for traffic class (TC) bandwidth allocation. It ensures that bandwidth can be distributed between different traffic classes as configured, and verifies that explicit TC-to-queue mapping is required for the allocation to be effective. The first test (test_no_tc_mapping_bandwidth) is marked as expected failure on mlx5, since the hardware automatically enforces traffic class separation by dynamically moving queues to the correct TC scheduler, even without explicit TC-to-queue mapping configuration. Test output on mlx5: 1..2 # Created VF interface: eth5 # Created VLAN eth5.101 on eth5 with tc 3 and IP 198.51.100.2 # Created VLAN eth5.102 on eth5 with tc 4 and IP 198.51.100.10 # Set representor eth4 up and added to bridge # Bandwidth check results without TC mapping: # TC 3: 0.19 Gbits/sec # TC 4: 0.76 Gbits/sec # Total bandwidth: 0.95 Gbits/sec # TC 3 percentage: 20.0% # TC 4 percentage: 80.0% ok 1 devlink_rate_tc_bw.test_no_tc_mapping_bandwidth # XFAIL Bandwidth matched 80/20 split without TC mapping # Created VF interface: eth5 # Created VLAN eth5.101 on eth5 with tc 3 and IP 198.51.100.2 # Created VLAN eth5.102 on eth5 with tc 4 and IP 198.51.100.10 # Set representor eth4 up and added to bridge # Bandwidth check results with TC mapping: # TC 3: 0.21 Gbits/sec # TC 4: 0.78 Gbits/sec # Total bandwidth: 0.98 Gbits/sec # TC 3 percentage: 21.1% # TC 4 percentage: 78.9% # Bandwidth is distributed as 80/20 with TC mapping ok 2 devlink_rate_tc_bw.test_tc_mapping_bandwidth # Totals: pass:1 fail:0 xfail:1 xpass:0 skip:0 error:0 Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Nimrod Oren <noren@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-9-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw	Carolina Jubran
	Introduce support for managing Traffic Class (TC) arbiter nodes and associated vports TC nodes within the E-Switch QoS hierarchy. This patch adds support for the new scheduling node type, `SCHED_NODE_TYPE_VPORTS_TC_TSAR`, and implements full support for setting tc-bw on both vports and nodes. Key changes include: - Introduced the new scheduling node type, `SCHED_NODE_TYPE_VPORTS_TC_TSAR`, for managing vports within the TC arbiter node. - New helper functions for creating and destroying vports TC nodes under the TC arbiter. - Updated the minimum rate normalization function to skip nodes of type `SCHED_NODE_TYPE_VPORTS_TC_TSAR`. Vports TC TSARs have bandwidth shares configured on them but not minimum rates, so their `min_rate` cannot be normalized. - Implementation of `esw_qos_tc_arbiter_scheduling_setup()` and `esw_qos_tc_arbiter_scheduling_teardown()` for initializing and cleaning up TC arbiter scheduling elements. These functions now fully support tc-bw configuration on TC arbiter nodes. - Introduced a new helper `esw_qos_calculate_tc_bw_divider()` to compute the total TC bandwidth share, which is used as a divider for normalizing each TC's share. - Added `esw_qos_tc_arbiter_get_bw_shares()` and `esw_qos_set_tc_arbiter_bw_shares()` to handle the settings of bandwidth shares for vports traffic class TSARs. - `esw_qos_set_tc_arbiter_bw_shares()` normalizes each TC share based on the total and the firmware's maximum allowed TSAR bandwidth share. - Refactored `mlx5_esw_devlink_rate_node_tc_bw_set()` and `mlx5_esw_devlink_rate_leaf_tc_bw_set()` to fully support configuring tc-bw on devlink rate nodes and vports, respectively. - Refactored `mlx5_esw_qos_node_update_parent()` to ensure that tc-bw configuration remains compatible with setting a parent on a rate node, preserving level hierarchy functionality. - Refactored `esw_qos_calc_bw_share()` to generalize its input so it can be used for both minimum rate and bandwidth share calculations. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-8-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	net/mlx5: Add traffic class scheduling support for vport QoS	Carolina Jubran
	Introduce support for traffic class (TC) scheduling on vports by allowing the vport to own multiple TC scheduling nodes. This patch enables more granular control of QoS by defining three distinct QoS states for vports, each providing unique scheduling behavior: 1. Regular QoS: The `sched_node` represents the vport directly, handling QoS as a single scheduling entity. 2. TC QoS on the vport: The `sched_node` acts as a TC arbiter, enabling TC scheduling directly on the vport. 3. TC QoS on the parent node: The `sched_node` functions as a rate limiter, with TC arbitration enabled at the parent level, associating multiple scheduling nodes with each vport. Key changes include: - Added support for new scheduling elements, vport traffic class and rate limiter. - New helper functions for creating, destroying, and restoring vport TC scheduling nodes, handling transitions between regular QoS and TC arbitration states. - Updated `esw_qos_vport_enable()` and `esw_qos_vport_disable()` to support both regular QoS and TC arbitration states, ensuring consistent transitions between scheduling modes. - Introduced a `sched_nodes` array under `vport->qos` to store multiple TC scheduling nodes per vport, enabling finer control over per-TC QoS. - Enhanced `esw_qos_vport_update_parent()` to handle transitions between the three QoS states based on the current and new parent node types. This patch lays the groundwork for future support for configuring tc-bw on vports. Although the infrastructure is in place, full support for tc-bw is not yet implemented; attempts to set tc-bw on vports will return `-EOPNOTSUPP`. No functional changes are introduced at this stage. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-7-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	net/mlx5: Add support for setting tc-bw on nodes	Carolina Jubran
	Introduce support for enabling and disabling Traffic Class (TC) arbitration for existing devlink rate nodes. This patch adds support for a new scheduling node type, `SCHED_NODE_TYPE_TC_ARBITER_TSAR`. Key changes include: - New helper functions for transitioning existing rate nodes to TC arbiter nodes and vice versa. These functions handle the allocation of TC arbiter nodes, copying of child nodes, and restoring vport QoS settings when TC arbitration is disabled. - Implementation of `mlx5_esw_devlink_rate_node_tc_bw_set()` to manage tc-bw configuration on nodes. - Introduced stubs for `esw_qos_tc_arbiter_scheduling_setup()` and `esw_qos_tc_arbiter_scheduling_teardown()`, which will be extended in future patches to provide full support for tc-bw on devlink rate objects. - Validation functions for tc-bw settings, allowing graceful handling of unsupported traffic class bandwidth configurations. - Updated `__esw_qos_alloc_node()` to insert the new node into the parent’s children list only if the parent is not NULL. For the root TSAR, the new node is inserted directly after the allocation call. - Don't allow `tc-bw` configuration for nodes containing non-leaf children. This patch lays the groundwork for future support for configuring tc-bw on devlink rate nodes. Although the infrastructure is in place, full support for tc-bw is not yet implemented; attempts to set tc-bw on nodes will return `-EOPNOTSUPP`. No functional changes are introduced at this stage. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-6-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	net/mlx5: Add no-op implementation for setting tc-bw on rate objects	Carolina Jubran
	Introduce `mlx5_esw_devlink_rate_node_tc_bw_set()` and `mlx5_esw_devlink_rate_leaf_tc_bw_set()` with no-op logic. Future patches will add support for setting traffic class bandwidth on rate objects. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-5-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	selftest: netdevsim: Add devlink rate tc-bw test	Carolina Jubran
	Test verifies that netdevsim correctly implements devlink ops callbacks that set tc-bw on leaf or node rate object. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-4-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	devlink: Extend devlink rate API with traffic classes bandwidth management	Carolina Jubran
	Introduce support for specifying relative bandwidth shares between traffic classes (TC) in the devlink-rate API. This new option allows users to allocate bandwidth across multiple traffic classes in a single command. This feature provides a more granular control over traffic management, especially for scenarios requiring Enhanced Transmission Selection. Users can now define a relative bandwidth share for each traffic class. For example, assigning share values of 20 to TC0 (TCP/UDP) and 80 to TC5 (RoCE) will result in TC0 receiving 20% and TC5 receiving 80% of the total bandwidth. The actual percentage each class receives depends on the ratio of its share value to the sum of all shares. Example: DEV=pci/0000:08:00.0 $ devlink port function rate add $DEV/vfs_group tx_share 10Gbit \ tx_max 50Gbit tc-bw 0:20 1:0 2:0 3:0 4:0 5:80 6:0 7:0 $ devlink port function rate set $DEV/vfs_group \ tc-bw 0:20 1:0 2:0 3:0 4:0 5:20 6:60 7:0 Example usage with ynl: ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \ --do rate-set --json '{ "bus-name": "pci", "dev-name": "0000:08:00.0", "port-index": 1, "rate-tc-bws": [ {"rate-tc-index": 0, "rate-tc-bw": 50}, {"rate-tc-index": 1, "rate-tc-bw": 50}, {"rate-tc-index": 2, "rate-tc-bw": 0}, {"rate-tc-index": 3, "rate-tc-bw": 0}, {"rate-tc-index": 4, "rate-tc-bw": 0}, {"rate-tc-index": 5, "rate-tc-bw": 0}, {"rate-tc-index": 6, "rate-tc-bw": 0}, {"rate-tc-index": 7, "rate-tc-bw": 0} ] }' ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \ --do rate-get --json '{ "bus-name": "pci", "dev-name": "0000:08:00.0", "port-index": 1 }' output for rate-get: {'bus-name': 'pci', 'dev-name': '0000:08:00.0', 'port-index': 1, 'rate-tc-bws': [{'rate-tc-bw': 50, 'rate-tc-index': 0}, {'rate-tc-bw': 50, 'rate-tc-index': 1}, {'rate-tc-bw': 0, 'rate-tc-index': 2}, {'rate-tc-bw': 0, 'rate-tc-index': 3}, {'rate-tc-bw': 0, 'rate-tc-index': 4}, {'rate-tc-bw': 0, 'rate-tc-index': 5}, {'rate-tc-bw': 0, 'rate-tc-index': 6}, {'rate-tc-bw': 0, 'rate-tc-index': 7}], 'rate-tx-max': 0, 'rate-tx-priority': 0, 'rate-tx-share': 0, 'rate-tx-weight': 0, 'rate-type': 'leaf'} Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-3-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	netlink: introduce type-checking attribute iteration for nlmsg	Carolina Jubran
	Add the nlmsg_for_each_attr_type() macro to simplify iteration over attributes of a specific type in a Netlink message. Convert existing users in vxlan and nfsd to use the new macro. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-2-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	vhost-net: reduce one userspace copy when building XDP buff	Jason Wang
	We used to do twice copy_from_iter() to copy virtio-net and packet separately. This introduce overheads for userspace access hardening as well as SMAP (for x86 it's stac/clac). So this patch tries to use one copy_from_iter() to copy them once and move the virtio-net header afterwards to reduce overheads. Testpmd + vhost_net shows 10% improvement from 5.45Mpps to 6.0Mpps. Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://patch.msgid.link/20250701010352.74515-2-jasowang@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	tun: remove unnecessary tun_xdp_hdr structure	Jason Wang
	With f95f0f95cfb7("net, xdp: Introduce xdp_init_buff utility routine"), buffer length could be stored as frame size so there's no need to have a dedicated tun_xdp_hdr structure. We can simply store virtio net header instead. Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://patch.msgid.link/20250701010352.74515-1-jasowang@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	drm/v3d: Disable interrupts before resetting the GPU	Maíra Canal
	Currently, an interrupt can be triggered during a GPU reset, which can lead to GPU hangs and NULL pointer dereference in an interrupt context as shown in the following trace: [ 314.035040] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000c0 [ 314.043822] Mem abort info: [ 314.046606] ESR = 0x0000000096000005 [ 314.050347] EC = 0x25: DABT (current EL), IL = 32 bits [ 314.055651] SET = 0, FnV = 0 [ 314.058695] EA = 0, S1PTW = 0 [ 314.061826] FSC = 0x05: level 1 translation fault [ 314.066694] Data abort info: [ 314.069564] ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000 [ 314.075039] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 314.080080] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 314.085382] user pgtable: 4k pages, 39-bit VAs, pgdp=0000000102728000 [ 314.091814] [00000000000000c0] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000 [ 314.100511] Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP [ 314.106770] Modules linked in: v3d i2c_brcmstb vc4 snd_soc_hdmi_codec gpu_sched drm_shmem_helper drm_display_helper cec drm_dma_helper drm_kms_helper drm drm_panel_orientation_quirks snd_soc_core snd_compress snd_pcm_dmaengine snd_pcm snd_timer snd backlight [ 314.129654] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.12.25+rpt-rpi-v8 #1 Debian 1:6.12.25-1+rpt1 [ 314.139388] Hardware name: Raspberry Pi 4 Model B Rev 1.4 (DT) [ 314.145211] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 314.152165] pc : v3d_irq+0xec/0x2e0 [v3d] [ 314.156187] lr : v3d_irq+0xe0/0x2e0 [v3d] [ 314.160198] sp : ffffffc080003ea0 [ 314.163502] x29: ffffffc080003ea0 x28: ffffffec1f184980 x27: 021202b000000000 [ 314.170633] x26: ffffffec1f17f630 x25: ffffff8101372000 x24: ffffffec1f17d9f0 [ 314.177764] x23: 000000000000002a x22: 000000000000002a x21: ffffff8103252000 [ 314.184895] x20: 0000000000000001 x19: 00000000deadbeef x18: 0000000000000000 [ 314.192026] x17: ffffff94e51d2000 x16: ffffffec1dac3cb0 x15: c306000000000000 [ 314.199156] x14: 0000000000000000 x13: b2fc982e03cc5168 x12: 0000000000000001 [ 314.206286] x11: ffffff8103f8bcc0 x10: ffffffec1f196868 x9 : ffffffec1dac3874 [ 314.213416] x8 : 0000000000000000 x7 : 0000000000042a3a x6 : ffffff810017a180 [ 314.220547] x5 : ffffffec1ebad400 x4 : ffffffec1ebad320 x3 : 00000000000bebeb [ 314.227677] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 [ 314.234807] Call trace: [ 314.237243] v3d_irq+0xec/0x2e0 [v3d] [ 314.240906] __handle_irq_event_percpu+0x58/0x218 [ 314.245609] handle_irq_event+0x54/0xb8 [ 314.249439] handle_fasteoi_irq+0xac/0x240 [ 314.253527] handle_irq_desc+0x48/0x68 [ 314.257269] generic_handle_domain_irq+0x24/0x38 [ 314.261879] gic_handle_irq+0x48/0xd8 [ 314.265533] call_on_irq_stack+0x24/0x58 [ 314.269448] do_interrupt_handler+0x88/0x98 [ 314.273624] el1_interrupt+0x34/0x68 [ 314.277193] el1h_64_irq_handler+0x18/0x28 [ 314.281281] el1h_64_irq+0x64/0x68 [ 314.284673] default_idle_call+0x3c/0x168 [ 314.288675] do_idle+0x1fc/0x230 [ 314.291895] cpu_startup_entry+0x3c/0x50 [ 314.295810] rest_init+0xe4/0xf0 [ 314.299030] start_kernel+0x5e8/0x790 [ 314.302684] __primary_switched+0x80/0x90 [ 314.306691] Code: 940029eb 360ffc13 f9442ea0 52800001 (f9406017) [ 314.312775] ---[ end trace 0000000000000000 ]--- [ 314.317384] Kernel panic - not syncing: Oops: Fatal exception in interrupt [ 314.324249] SMP: stopping secondary CPUs [ 314.328167] Kernel Offset: 0x2b9da00000 from 0xffffffc080000000 [ 314.334076] PHYS_OFFSET: 0x0 [ 314.336946] CPU features: 0x08,00002013,c0200000,0200421b [ 314.342337] Memory Limit: none [ 314.345382] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]--- Before resetting the GPU, it's necessary to disable all interrupts and deal with any interrupt handler still in-flight. Otherwise, the GPU might reset with jobs still running, or yet, an interrupt could be handled during the reset. Cc: stable@vger.kernel.org Fixes: 57692c94dcbe ("drm/v3d: Introduce a new DRM driver for Broadcom V3D V3.x+") Reviewed-by: Juan A. Suarez <jasuarez@igalia.com> Reviewed-by: Iago Toral Quiroga <itoral@igalia.com> Link: https://lore.kernel.org/r/20250628224243.47599-1-mcanal@igalia.com Signed-off-by: Maíra Canal <mcanal@igalia.com>
2025-07-02	Merge branch 'preserve-msg_zerocopy-with-forwarding'	Jakub Kicinski
	Willem de Bruijn says: ==================== preserve MSG_ZEROCOPY with forwarding Avoid false positive copying of zerocopy skb frags when entering the ingress path if the skb is not queued locally but forwarded. Patch 1 for more details and feature. Patch 2 converts the existing selftest to a pass/fail test and adds coverage for this new feature. ==================== Link: https://patch.msgid.link/20250630194312.1571410-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	selftest: net: extend msg_zerocopy test with forwarding	Willem de Bruijn
	Zerocopy skbs are converted to regular copy skbs when data is queued to a local socket. This happens in the existing test with a sender and receiver communicating over a veth device. Zerocopy skbs are sent without copying if egressing a device. Verify that this behavior is maintained even in the common container setup where data is forwarded over a veth to the physical device. Update msg_zerocopy.sh to 1. Have a dummy network device to simulate a physical device. 2. Have forwarding enabled between veth and dummy. 3. Add a tx-only test that sends out dummy via the forwarding path. 4. Verify the exitcode of the sender, which signals zerocopy success. As dummy drops all packets, this cannot be a TCP connection. Test the new case with unconnected UDP only. Update msg_zerocopy.c to - Accept an argument whether send with zerocopy is expected. - Return an exitcode whether behavior matched that expectation. Signed-off-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250630194312.1571410-3-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	net: preserve MSG_ZEROCOPY with forwarding	Willem de Bruijn
	MSG_ZEROCOPY data must be copied before data is queued to local sockets, to avoid indefinite timeout until memory release. This test is performed by skb_orphan_frags_rx, which is called when looping an egress skb to packet sockets, error queue or ingress path. To preserve zerocopy for skbs that are looped to ingress but are then forwarded to an egress device rather than delivered locally, defer this last check until an skb enters the local IP receive path. This is analogous to existing behavior of skb_clear_delivery_time. Signed-off-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250630194312.1571410-2-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	Merge branch 'vsock-test-check-for-null-ptr-deref-when-transport-changes'	Jakub Kicinski
	Luigi Leonardi says: ==================== vsock/test: check for null-ptr-deref when transport changes This series introduces a new test that checks for a null pointer dereference that may happen when there is a transport change[1]. This bug was fixed in [2]. Note that this test cannot fail, it hangs if it triggers a kernel oops. The intended use-case is to run it and then check if there is any oops in the dmesg. This test is based on Hyunwoo Kim's[3] and Michal's python reproducers[4]. [1]https://lore.kernel.org/netdev/Z2LvdTTQR7dBmPb5@v4bel-B760M-AORUS-ELITE-AX/ [2]https://lore.kernel.org/netdev/20250110083511.30419-1-sgarzare@redhat.com/ [3]https://lore.kernel.org/netdev/Z2LvdTTQR7dBmPb5@v4bel-B760M-AORUS-ELITE-AX/#t [4]https://lore.kernel.org/netdev/2b3062e3-bdaa-4c94-a3c0-2930595b9670@rbox.co/ v4: https://lore.kernel.org/20250624-test_vsock-v4-1-087c9c8e25a2@redhat.com v3: https://lore.kernel.org/20250611-test_vsock-v3-1-8414a2d4df62@redhat.com v2: https://lore.kernel.org/20250314-test_vsock-v2-1-3c0a1d878a6d@redhat.com v1: https://lore.kernel.org/20250306-test_vsock-v1-0-0320b5accf92@redhat.com ==================== Link: https://patch.msgid.link/20250630-test_vsock-v5-0-2492e141e80b@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	vsock/test: Add test for null ptr deref when transport changes	Luigi Leonardi
	Add a new test to ensure that when the transport changes a null pointer dereference does not occur. The bug was reported upstream [1] and fixed with commit 2cb7c756f605 ("vsock/virtio: discard packets if the transport changes"). KASAN: null-ptr-deref in range [0x0000000000000060-0x0000000000000067] CPU: 2 UID: 0 PID: 463 Comm: kworker/2:3 Not tainted Workqueue: vsock-loopback vsock_loopback_work RIP: 0010:vsock_stream_has_data+0x44/0x70 Call Trace: virtio_transport_do_close+0x68/0x1a0 virtio_transport_recv_pkt+0x1045/0x2ae4 vsock_loopback_work+0x27d/0x3f0 process_one_work+0x846/0x1420 worker_thread+0x5b3/0xf80 kthread+0x35a/0x700 ret_from_fork+0x2d/0x70 ret_from_fork_asm+0x1a/0x30 Note that this test may not fail in a kernel without the fix, but it may hang on the client side if it triggers a kernel oops. This works by creating a socket, trying to connect to a server, and then executing a second connect operation on the same socket but to a different CID (0). This triggers a transport change. If the connect operation is interrupted by a signal, this could cause a null-ptr-deref. Since this bug is non-deterministic, we need to try several times. It is reasonable to assume that the bug will show up within the timeout period. If there is a G2H transport loaded in the system, the bug is not triggered and this test will always pass. This is because `vsock_assign_transport`, when using CID 0, like in this case, sets vsk->transport to `transport_g2h` that is not NULL if a G2H transport is available. [1]https://lore.kernel.org/netdev/Z2LvdTTQR7dBmPb5@v4bel-B760M-AORUS-ELITE-AX/ Suggested-by: Hyunwoo Kim <v4bel@theori.io> Suggested-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20250630-test_vsock-v5-2-2492e141e80b@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	vsock/test: Add macros to identify transports	Luigi Leonardi
	Add three new macros: TRANSPORTS_G2H, TRANSPORTS_H2G and TRANSPORTS_LOCAL. They can be used to identify the type of the transport(s) loaded when using the `get_transports()` function. Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20250630-test_vsock-v5-1-2492e141e80b@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	dt-bindings: net: Convert socfpga-dwmac bindings to yaml	Matthew Gerlach
	Convert the bindings for socfpga-dwmac to yaml. Since the original text contained descriptions for two separate nodes, two separate yaml files were created. Signed-off-by: Mun Yew Tham <mun.yew.tham@altera.com> Signed-off-by: Matthew Gerlach <matthew.gerlach@altera.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Rob Herring (Arm) <robh@kernel.org> Link: https://patch.msgid.link/20250630213748.71919-1-matthew.gerlach@altera.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	Merge branch '200GbE' of ↵	Jakub Kicinski
	git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2025-07-01 (idpf, igc) For idpf: Michal returns 0 for key size when RSS is not supported. Ahmed changes control queue to a spinlock due to sleeping calls. For igc: Vitaly disables L1.2 PCI-E link substate on I226 devices to resolve performance issues. * '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: igc: disable L1.2 PCI-E link substate to avoid performance issue idpf: convert control queue mutex to a spinlock idpf: return 0 size for RSS key if not supported ==================== Link: https://patch.msgid.link/20250701164317.2983952-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	amd-xgbe: add support for giant packet size	Raju Rangoju
	AMD XGBE hardware supports giant Ethernet frames up to 16K bytes. Add support for configuring and enabling giant packet handling in the driver. - Define new register fields and macros for giant packet support. - Update the jumbo frame configuration logic to enable giant packet mode when MTU exceeds the jumbo threshold. Acked-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250701121929.319690-1-Raju.Rangoju@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	net: ipv4: fix stat increase when udp early demux drops the packet	Antoine Tenart
	udp_v4_early_demux now returns drop reasons as it either returns 0 or ip_mc_validate_source, which returns itself a drop reason. However its use was not converted in ip_rcv_finish_core and the drop reason is ignored, leading to potentially skipping increasing LINUX_MIB_IPRPFILTER if the drop reason is SKB_DROP_REASON_IP_RPFILTER. This is a fix and we're not converting udp_v4_early_demux to explicitly return a drop reason to ease backports; this can be done as a follow-up. Fixes: d46f827016d8 ("net: ip: make ip_mc_validate_source() return drop reason") Cc: Menglong Dong <menglong8.dong@gmail.com> Reported-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250701074935.144134-1-atenart@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	net: ifb: support BIG TCP packets	Eric Dumazet
	Set the driver limit to GSO_MAX_SIZE (512 KB). This allows the admin/user to set a GSO limit up to this value, to avoid segmenting too large GRO packets in the netem -> ifb path. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250701084540.459261-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	net: libwx: fix the incorrect display of the queue number	Jiawen Wu
	When setting "ethtool -L eth0 combined 1", the number of RX/TX queue is changed to be 1. RSS is disabled at this moment, and the indices of FDIR have not be changed in wx_set_rss_queues(). So the combined count still shows the previous value. This issue was introduced when supporting FDIR. Fix it for those devices that support FDIR. Fixes: 34744a7749b3 ("net: txgbe: add FDIR info to ethtool ops") Cc: stable@vger.kernel.org Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/A5C8FE56D6C04608+20250701070625.73680-1-jiawenwu@trustnetic.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	amd-xgbe: do not double read link status	Raju Rangoju
	The link status is latched low so that momentary link drops can be detected. Always double-reading the status defeats this design feature. Only double read if link was already down This prevents unnecessary duplicate readings of the link status. Fixes: 4f3b20bfbb75 ("amd-xgbe: add support for rx-adaptation") Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250701065016.4140707-1-Raju.Rangoju@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	net/sched: Always pass notifications when child class becomes empty	Lion Ackermann
	Certain classful qdiscs may invoke their classes' dequeue handler on an enqueue operation. This may unexpectedly empty the child qdisc and thus make an in-flight class passive via qlen_notify(). Most qdiscs do not expect such behaviour at this point in time and may re-activate the class eventually anyways which will lead to a use-after-free. The referenced fix commit attempted to fix this behavior for the HFSC case by moving the backlog accounting around, though this turned out to be incomplete since the parent's parent may run into the issue too. The following reproducer demonstrates this use-after-free: tc qdisc add dev lo root handle 1: drr tc filter add dev lo parent 1: basic classid 1:1 tc class add dev lo parent 1: classid 1:1 drr tc qdisc add dev lo parent 1:1 handle 2: hfsc def 1 tc class add dev lo parent 2: classid 2:1 hfsc rt m1 8 d 1 m2 0 tc qdisc add dev lo parent 2:1 handle 3: netem tc qdisc add dev lo parent 3:1 handle 4: blackhole echo 1 \| socat -u STDIN UDP4-DATAGRAM:127.0.0.1:8888 tc class delete dev lo classid 1:1 echo 1 \| socat -u STDIN UDP4-DATAGRAM:127.0.0.1:8888 Since backlog accounting issues leading to a use-after-frees on stale class pointers is a recurring pattern at this point, this patch takes a different approach. Instead of trying to fix the accounting, the patch ensures that qdisc_tree_reduce_backlog always calls qlen_notify when the child qdisc is empty. This solves the problem because deletion of qdiscs always involves a call to qdisc_reset() and / or qdisc_purge_queue() which ultimately resets its qlen to 0 thus causing the following qdisc_tree_reduce_backlog() to report to the parent. Note that this may call qlen_notify on passive classes multiple times. This is not a problem after the recent patch series that made all the classful qdiscs qlen_notify() handlers idempotent. Fixes: 3f981138109f ("sch_hfsc: Fix qlen accounting bug when using peek in hfsc_enqueue()") Signed-off-by: Lion Ackermann <nnamrec@gmail.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/d912cbd7-193b-4269-9857-525bee8bbb6a@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	Merge branch 'net-add-data-race-annotations-around-dst-fields'	Jakub Kicinski
	Eric Dumazet says: ==================== net: add data-race annotations around dst fields Add annotations around various dst fields, which can change under us. Add four helpers to prepare better dst->dev protection, and start using them. More to come later. v1: https://lore.kernel.org/20250627112526.3615031-1-edumazet@google.com ==================== Link: https://patch.msgid.link/20250630121934.3399505-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	ipv6: ip6_mc_input() and ip6_mr_input() cleanups	Eric Dumazet
	Both functions are always called under RCU. We remove the extra rcu_read_lock()/rcu_read_unlock(). We use skb_dst_dev_net_rcu() instead of skb_dst_dev_net(). We use dev_net_rcu() instead of dev_net(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-11-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	ipv6: adopt skb_dst_dev() and skb_dst_dev_net[_rcu]() helpers	Eric Dumazet
	Use the new helpers as a step to deal with potential dst->dev races. v2: fix typo in ipv6_rthdr_rcv() (kernel test robot <lkp@intel.com>) Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-10-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	ipv6: adopt dst_dev() helper	Eric Dumazet
	Use the new helper as a step to deal with potential dst->dev races. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-9-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	ipv4: adopt dst_dev, skb_dst_dev and skb_dst_dev_net[_rcu]	Eric Dumazet
	Use the new helpers as a first step to deal with potential dst->dev races. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-8-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	net: dst: add four helpers to annotate data-races around dst->dev	Eric Dumazet
	dst->dev is read locklessly in many contexts, and written in dst_dev_put(). Fixing all the races is going to need many changes. We probably will have to add full RCU protection. Add three helpers to ease this painful process. static inline struct net_device dst_dev(const struct dst_entry dst) { return READ_ONCE(dst->dev); } static inline struct net_device skb_dst_dev(const struct sk_buff skb) { return dst_dev(skb_dst(skb)); } static inline struct net skb_dst_dev_net(const struct sk_buff skb) { return dev_net(skb_dst_dev(skb)); } static inline struct net skb_dst_dev_net_rcu(const struct sk_buff skb) { return dev_net_rcu(skb_dst_dev(skb)); } Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	net: dst: annotate data-races around dst->output	Eric Dumazet
	dst_dev_put() can overwrite dst->output while other cpus might read this field (for instance from dst_output()) Add READ_ONCE()/WRITE_ONCE() annotations to suppress potential issues. We will likely need RCU protection in the future. Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	net: dst: annotate data-races around dst->input	Eric Dumazet
	dst_dev_put() can overwrite dst->input while other cpus might read this field (for instance from dst_input()) Add READ_ONCE()/WRITE_ONCE() annotations to suppress potential issues. We will likely need full RCU protection later. Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	net: dst: annotate data-races around dst->lastuse	Eric Dumazet
	(dst_entry)->lastuse is read and written locklessly, add corresponding annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	net: dst: annotate data-races around dst->expires	Eric Dumazet
	(dst_entry)->expires is read and written locklessly, add corresponding annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	net: dst: annotate data-races around dst->obsolete	Eric Dumazet
	(dst_entry)->obsolete is read locklessly, add corresponding annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02	Merge branch 'net-introduce-net_aligned_data'	Jakub Kicinski
	Eric Dumazet says: ==================== net: introduce net_aligned_data ____cacheline_aligned_in_smp on small fields like tcp_memory_allocated and udp_memory_allocated is not good enough. It makes sure to put these fields at the start of a cache line, but does not prevent the linker from using the cache line for other fields, with potential performance impact. nm -v vmlinux\|egrep -5 "tcp_memory_allocated\|udp_memory_allocated" ... ffffffff83e35cc0 B tcp_memory_allocated ffffffff83e35cc8 b __key.0 ffffffff83e35cc8 b __tcp_tx_delay_enabled.1 ffffffff83e35ce0 b tcp_orphan_timer ... ffffffff849dddc0 B udp_memory_allocated ffffffff849dddc8 B udp_encap_needed_key ffffffff849dddd8 B udpv6_encap_needed_key ffffffff849dddf0 b inetsw_lock One solution is to move these sensitive fields to a structure, so that the compiler is forced to add empty holes between each member. nm -v vmlinux\|egrep -2 "tcp_memory_allocated\|udp_memory_allocated\|net_aligned_data" ffffffff885af970 b mem_id_init ffffffff885af980 b __key.0 ffffffff885af9c0 B net_aligned_data ffffffff885afa80 B page_pool_mem_providers ffffffff885afa90 b __key.2 v1: https://lore.kernel.org/netdev/20250627200551.348096-1-edumazet@google.com/ ==================== Link: https://patch.msgid.link/20250630093540.3052835-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>