summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-02-06net: Add rx_skb of kfree_skb to raw_tp_null_args[].Kuniyuki Iwashima
Yan Zhai reported a BPF prog could trigger a null-ptr-deref [0] in trace_kfree_skb if the prog does not check if rx_sk is NULL. Commit c53795d48ee8 ("net: add rx_sk to trace_kfree_skb") added rx_sk to trace_kfree_skb, but rx_sk is optional and could be NULL. Let's add kfree_skb to raw_tp_null_args[] to let the BPF verifier validate such a prog and prevent the issue. Now we fail to load such a prog: libbpf: prog 'drop': -- BEGIN PROG LOAD LOG -- 0: R1=ctx() R10=fp0 ; int BPF_PROG(drop, struct sk_buff *skb, void *location, @ kfree_skb_sk_null.bpf.c:21 0: (79) r3 = *(u64 *)(r1 +24) func 'kfree_skb' arg3 has btf_id 5253 type STRUCT 'sock' 1: R1=ctx() R3_w=trusted_ptr_or_null_sock(id=1) ; bpf_printk("sk: %d, %d\n", sk, sk->__sk_common.skc_family); @ kfree_skb_sk_null.bpf.c:24 1: (69) r4 = *(u16 *)(r3 +16) R3 invalid mem access 'trusted_ptr_or_null_' processed 2 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 -- END PROG LOAD LOG -- Note this fix requires commit 838a10bd2ebf ("bpf: Augment raw_tp arguments with PTR_MAYBE_NULL"). [0]: BUG: kernel NULL pointer dereference, address: 0000000000000010 PF: supervisor read access in kernel mode PF: error_code(0x0000) - not-present page PGD 0 P4D 0 PREEMPT SMP RIP: 0010:bpf_prog_5e21a6db8fcff1aa_drop+0x10/0x2d Call Trace: <TASK> ? __die+0x1f/0x60 ? page_fault_oops+0x148/0x420 ? search_bpf_extables+0x5b/0x70 ? fixup_exception+0x27/0x2c0 ? exc_page_fault+0x75/0x170 ? asm_exc_page_fault+0x22/0x30 ? bpf_prog_5e21a6db8fcff1aa_drop+0x10/0x2d bpf_trace_run4+0x68/0xd0 ? unix_stream_connect+0x1f4/0x6f0 sk_skb_reason_drop+0x90/0x120 unix_stream_connect+0x1f4/0x6f0 __sys_connect+0x7f/0xb0 __x64_sys_connect+0x14/0x20 do_syscall_64+0x47/0xc30 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Fixes: c53795d48ee8 ("net: add rx_sk to trace_kfree_skb") Reported-by: Yan Zhai <yan@cloudflare.com> Closes: https://lore.kernel.org/netdev/Z50zebTRzI962e6X@debian.debian/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Tested-by: Yan Zhai <yan@cloudflare.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20250201030142.62703-1-kuniyu@amazon.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-06eth: fbnic: set IFF_UNICAST_FLT to avoid enabling promiscuous mode when ↵Alexander Duyck
adding unicast addrs I realized when we were adding unicast addresses we were enabling promiscuous mode. I did a bit of digging and realized we had overlooked setting the driver private flag to indicate we supported unicast filtering. Example below shows the table with 00deadbeef01 as the main NIC address, and 5 additional addresses in the 00deadbeefX0 format. # cat $dbgfs/mac_addr Idx S TCAM Bitmap Addr/Mask ---------------------------------- 00 0 00000000,00000000 000000000000 000000000000 01 0 00000000,00000000 000000000000 000000000000 02 0 00000000,00000000 000000000000 000000000000 ... 24 0 00000000,00000000 000000000000 000000000000 25 1 00100000,00000000 00deadbeef50 000000000000 26 1 00100000,00000000 00deadbeef40 000000000000 27 1 00100000,00000000 00deadbeef30 000000000000 28 1 00100000,00000000 00deadbeef20 000000000000 29 1 00100000,00000000 00deadbeef10 000000000000 30 1 00100000,00000000 00deadbeef01 000000000000 31 0 00000000,00000000 000000000000 000000000000 Before rule 31 would be active. With this change it correctly sticks to just the unicast filters. Signed-off-by: Alexander Duyck <alexanderduyck@meta.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250204010038.1404268-2-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-06eth: fbnic: add MAC address TCAM to debugfsAlexander Duyck
Add read only access to the 32-entry MAC address TCAM via debugfs. BMC filtering shares the same table so this is quite useful to access during debug. See next commit for an example output. Signed-off-by: Alexander Duyck <alexanderduyck@meta.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250204010038.1404268-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-06kbuild: fix misspelling in scripts/Makefile.libOleh Zadorozhnyi
Signed-off-by: Oleh Zadorozhnyi <lesorubshayan@gmail.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2025-02-06tools: ynl-gen: support limits using definitionsJakub Kicinski
Support using defines / constants in integer checks. Carolina will need this for rate API extensions. Reported-by: Carolina Jubran <cjubran@nvidia.com> Link: https://lore.kernel.org/1e886aaf-e1eb-4f1a-b7ef-f63b350a3320@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250203215510.1288728-2-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-06tools: ynl-gen: don't output external constantsJakub Kicinski
A definition with a "header" property is an "external" definition for C code, as in it is defined already in another C header file. Other languages will need the exact value but C codegen should not recreate it. So don't output those definitions in the uAPI header. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250203215510.1288728-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-06MAINTAINERS: add a sample ethtool section entryJakub Kicinski
I feel like we don't do a good enough keeping authors of driver APIs around. The ethtool code base was very nicely compartmentalized by Michal. Establish a precedent of creating MAINTAINERS entries for "sections" of the ethtool API. Use Andrew and cable test as a sample entry. The entry should ideally cover 3 elements: a core file, test(s), and keywords. The last one is important because we intend the entries to cover core code *and* reviews of drivers implementing given API! Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250204215750.169249-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-06MAINTAINERS: add entry for ethtoolJakub Kicinski
Michal did an amazing job converting ethtool to Netlink, but never added an entry to MAINTAINERS for himself. Create a formal entry so that we can delegate (portions) of this code to folks. Over the last 3 years majority of the reviews have been done by Andrew and I. I suppose Michal didn't want to be on the receiving end of the flood of patches. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250204215729.168992-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-06Merge branch 'support-one-ptp-device-per-hardware-clock'Paolo Abeni
Tariq Toukan says: ==================== Support one PTP device per hardware clock This series contains two features from Jianbo, followed by simple cleanups. Patches 1-9 by Jianbo add support for one PTP device per hardware clock, described below [1]. Patches 10-12 by Jianbo add support for 200Gbps per-lane link modes in kernel and mlx5 driver. Patches 13-15 are simple cleanups by Gal and Carolina. [1] PHC (PTP hardware clock) is normally shared by multiple functions (PF/VF/SF). mlx5 driver currently creates a separate PTP device for each network interface that shares one PHC. PHC can be configured to work as free running mode or real time mode. In this series, only one PTP device is created for the shared PHC when it is running in real time mode. To support this feature, * Firmware needs to support clock identity. When functions share a PHC, the clock identities they query are same. * Driver dynamically allocates mlx5_clock to represent a PHC. * New devcom component is added for hardware clock. Functions are grouped by the identity, and one mlx5_clock is allocated and shared by the functions with the same identity. * When PTP device accesses PHC by its callbacks, the first function in the clock devcom list is selected to send commands to firmware. * PPS IN event is armed on one function. It should be re-armed on the other one when current is unloaded. ==================== Link: https://patch.msgid.link/20250203213516.227902-1-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-06net/mlx5e: Avoid WARN_ON when configuring MQPRIO with HTB offload enabledCarolina Jubran
When attempting to enable MQPRIO while HTB offload is already configured, the driver currently returns `-EINVAL` and triggers a `WARN_ON`, leading to an unnecessary call trace. Update the code to handle this case more gracefully by returning `-EOPNOTSUPP` instead, while also providing a helpful user message. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Yael Chemla <ychemla@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-06net/mlx5e: Remove unused mlx5e_tc_flow_action structGal Pressman
Commit 67efaf45930d ("net/mlx5e: TC, Remove CT action reordering") removed the usage of mlx5e_tc_flow_action struct, remove the struct as well. Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-06net/mlx5: Remove stray semicolon in LAG port selection table creationGal Pressman
Remove the stray semicolon in the mlx5_ldev_for_each_reverse() loop. Signed-off-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-06net/mlx5e: Support FEC settings for 200G per lane link modesJianbo Liu
Add support to show and config FEC by ethtool for 200G/lane link modes. The RS encoding setting is mapped, and can be overridden to FEC_RS_544_514_INTERLEAVED_QUAD for these modes. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-06net/mlx5: Add support for 200Gbps per lane link modesJianbo Liu
This patch exposes new link modes using 200Gbps per lane, including 200G, 400G and 800G modes. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-06ethtool: Add support for 200Gbps per lane link modesJianbo Liu
Define 200G, 400G and 800G link modes using 200Gbps per lane. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-06net/mlx5: Generate PPS IN event on new function for shared clockJianbo Liu
As a specific function (mdev) is chosen to send MTPPSE command to firmware, the event is generated only on that function. When that function is unloaded, the PPS event can't be forward to PTP device, even when there are other functions in the group, and PTP device is not destroyed. To resolve this problem, need to send MTPPSE again from new function, and dis-arm the event on old function after that. PPS events are handled by EQ notifier. The async EQs and notifiers are destroyed in mlx5_eq_table_destroy() which is called before mlx5_cleanup_clock(). During the period between mlx5_eq_table_destroy() and mlx5_cleanup_clock(), the events can't be handled. To avoid event loss, add mlx5_clock_unload() in mlx5_unload() to arm the event on other available function, and mlx5_clock_load in mlx5_load() for symmetry. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-06net/mlx5: Support one PTP device per hardware clockJianbo Liu
Currently, mlx5 driver exposes a PTP device for each network interface, resulting in multiple device nodes representing the same underlying PHC (PTP hardware clock). This causes problem if it is trying to synchronize to itself. For instance, when ptp4l operates on multiple interfaces following different masters, phc2sys attempts to synchronize them in automatic mode. PHC can be configured to work as free running mode or real time mode. All functions can access it directly. In this patch, we create one PTP device for each PHC when it's running in real time mode. All the functions share the same PTP device if the clock identifies they query are same, and they are already grouped by devcom in previous commit. The first mdev in the peer list is chosen when sending MTPPS/MTUTC/MTPPSE/MRTCQ to firmware. Since the function can be unloaded at any time, we need to use a mutex lock to protect the mdev pointer used in PTP and PPS callbacks. Besides, new one should be picked from the peer list when the current is not available. The clock info, which is used by IB, is shared by all the interfaces using the same hardware clock. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-06net/mlx5: Move PPS notifier and out_work to clock_stateJianbo Liu
The PPS notifier is currently in mlx5_clock, and mlx5_clock can be shared in later patch, so the notifier should be registered for each device to avoid any event miss. Besides, the out_work is scheduled by PPS out event which is triggered only when the device is in free running mode. So, both are moved to mlx5_core_dev's clock_state. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-06net/mlx5: Add devcom component for the clock shared by functionsJianbo Liu
Add new devcom component for hardware clock. When it is running in real time mode, the functions are grouped by the identify they query. According to firmware document, the clock identify size is 64 bits, so it's safe to memcpy to component key, as the key size is also 64 bits. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-06net/mlx5: Change clock in mlx5_core_dev to mlx5_clock pointerJianbo Liu
Change clock member in mlx5_core_dev to a pointer, so it can point to a clock shared by multiple functions in later patch. For now, each function has its own clock, so mdev in mlx5_clock_priv is the back pointer to the function. Later it points to one (normally the first one) of the multiple functions sharing the same clock. Change mlx5_init_clock() to return error if mlx5_clock is not allocated. Besides, a null clock is defined and used when hardware clock is not supported. So, the clock pointer is always pointing to something valid. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-06net/mlx5: Add API to get mlx5_core_dev from mlx5_clockJianbo Liu
The mdev is calculated directly from mlx5_clock, as it's one of the fields in mlx5_core_dev. Move to a function so it can be easily changed in next patch. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-06net/mlx5: Add init and destruction functions for a single HW clockJianbo Liu
Move hardware clock initialization and destruction to the functions, which will be used for dynamically allocated clock. Such clock is shared by all the devices if the queried clock identities are same. The out_work is for PPS out event, which can't be triggered when clock is shared, so INIT_WORK is not moved to the initialization function. Besides, we still need to register notifier for each device. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-06net/mlx5: Change parameters for PTP internal functionsJianbo Liu
In later patch, the mlx5_clock will be allocated dynamically, its address can be obtained from mlx5_core_dev struct, but mdev can't be obtained from mlx5_clock because it can be shared by multiple interfaces. So change the parameter for such internal functions, only mdev is passed down from the callers. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-06net/mlx5: Add helper functions for PTP callbacksJianbo Liu
The PTP callback functions should not be used directly by internal callers. Add helpers that can be used internally and externally. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-06pinctrl: pinconf-generic: Print unsigned value if a format is registeredClaudiu Beznea
Commit 3ba11e684d16 ("pinctrl: pinconf-generic: print hex value") unconditionally switched to printing hex values in pinconf_generic_dump_one(). However, if a dump format is registered for the dumped pin, the hex value is printed as well. This hex value does not necessarily correspond 1:1 with the hardware register value (as noted by commit 3ba11e684d16 ("pinctrl: pinconf-generic: print hex value")). As a result, user-facing output may include information like: output drive strength (0x100 uA). To address this, check if a dump format is registered for the dumped property, and print the unsigned value instead when applicable. Fixes: 3ba11e684d16 ("pinctrl: pinconf-generic: print hex value") Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com> Link: https://lore.kernel.org/20250205101058.2034860-1-claudiu.beznea.uj@bp.renesas.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2025-02-06RDMA/mana_ib: Allocate PAGE aligned doorbell indexKonstantin Taranov
Allocate a PAGE aligned doorbell index to ensure each process gets a separate PAGE sized doorbell area space remapped to it in mana_ib_mmap Fixes: 0266a177631d ("RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter") Signed-off-by: Shiraz Saleem <shirazsaleem@microsoft.com> Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com> Link: https://patch.msgid.link/1738751405-15041-1-git-send-email-kotaranov@linux.microsoft.com Reviewed-by: Long Li <longli@microsoft.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-06Merge remote-tracking branch 'drm-misc/drm-misc-next-fixes' into drm-misc-fixesMaxime Ripard
Merge the few remaining patches stuck into drm-misc-next-fixes. Signed-off-by: Maxime Ripard <mripard@kernel.org>
2025-02-06RDMA/mlx5: Fix a WARN during dereg_mr for DM typeYishai Hadas
Memory regions (MR) of type DM (device memory) do not have an associated umem. In the __mlx5_ib_dereg_mr() -> mlx5_free_priv_descs() flow, the code incorrectly takes the wrong branch, attempting to call dma_unmap_single() on a DMA address that is not mapped. This results in a WARN [1], as shown below. The issue is resolved by properly accounting for the DM type and ensuring the correct branch is selected in mlx5_free_priv_descs(). [1] WARNING: CPU: 12 PID: 1346 at drivers/iommu/dma-iommu.c:1230 iommu_dma_unmap_page+0x79/0x90 Modules linked in: ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry ovelay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core fuse mlx5_core CPU: 12 UID: 0 PID: 1346 Comm: ibv_rc_pingpong Not tainted 6.12.0-rc7+ #1631 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:iommu_dma_unmap_page+0x79/0x90 Code: 2b 49 3b 29 72 26 49 3b 69 08 73 20 4d 89 f0 44 89 e9 4c 89 e2 48 89 ee 48 89 df 5b 5d 41 5c 41 5d 41 5e 41 5f e9 07 b8 88 ff <0f> 0b 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 66 0f 1f 44 00 RSP: 0018:ffffc90001913a10 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88810194b0a8 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001 RBP: ffff88810194b0a8 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f537abdd740(0000) GS:ffff88885fb00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f537aeb8000 CR3: 000000010c248001 CR4: 0000000000372eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __warn+0x84/0x190 ? iommu_dma_unmap_page+0x79/0x90 ? report_bug+0xf8/0x1c0 ? handle_bug+0x55/0x90 ? exc_invalid_op+0x13/0x60 ? asm_exc_invalid_op+0x16/0x20 ? iommu_dma_unmap_page+0x79/0x90 dma_unmap_page_attrs+0xe6/0x290 mlx5_free_priv_descs+0xb0/0xe0 [mlx5_ib] __mlx5_ib_dereg_mr+0x37e/0x520 [mlx5_ib] ? _raw_spin_unlock_irq+0x24/0x40 ? wait_for_completion+0xfe/0x130 ? rdma_restrack_put+0x63/0xe0 [ib_core] ib_dereg_mr_user+0x5f/0x120 [ib_core] ? lock_release+0xc6/0x280 destroy_hw_idr_uobject+0x1d/0x60 [ib_uverbs] uverbs_destroy_uobject+0x58/0x1d0 [ib_uverbs] uobj_destroy+0x3f/0x70 [ib_uverbs] ib_uverbs_cmd_verbs+0x3e4/0xbb0 [ib_uverbs] ? __pfx_uverbs_destroy_def_handler+0x10/0x10 [ib_uverbs] ? lock_acquire+0xc1/0x2f0 ? ib_uverbs_ioctl+0xcb/0x170 [ib_uverbs] ? ib_uverbs_ioctl+0x116/0x170 [ib_uverbs] ? lock_release+0xc6/0x280 ib_uverbs_ioctl+0xe7/0x170 [ib_uverbs] ? ib_uverbs_ioctl+0xcb/0x170 [ib_uverbs] __x64_sys_ioctl+0x1b0/0xa70 do_syscall_64+0x6b/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f537adaf17b Code: 0f 1e fa 48 8b 05 1d ad 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ed ac 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffff218f0b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007ffff218f1d8 RCX: 00007f537adaf17b RDX: 00007ffff218f1c0 RSI: 00000000c0181b01 RDI: 0000000000000003 RBP: 00007ffff218f1a0 R08: 00007f537aa8d010 R09: 0000561ee2e4f270 R10: 00007f537aace3a8 R11: 0000000000000246 R12: 00007ffff218f190 R13: 000000000000001c R14: 0000561ee2e4d7c0 R15: 00007ffff218f450 </TASK> Fixes: f18ec4223117 ("RDMA/mlx5: Use a union inside mlx5_ib_mr") Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/2039c22cfc3df02378747ba4d623a558b53fc263.1738587076.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-06RDMA/mlx5: Fix a race for DMABUF MR which can lead to CQE with errorYishai Hadas
This patch addresses a potential race condition for a DMABUF MR that can result in a CQE with an error on the UMR QP. During the __mlx5_ib_dereg_mr() flow, the following sequence of calls occurs: mlx5_revoke_mr() mlx5r_umr_revoke_mr() mlx5r_umr_post_send_wait() At this point, the lkey is freed from the hardware's perspective. However, concurrently, mlx5_ib_dmabuf_invalidate_cb() might be triggered by another task attempting to invalidate the MR having that freed lkey. Since the lkey has already been freed, this can lead to a CQE error, causing the UMR QP to enter an error state. To resolve this race condition, the dma_resv_lock() which was hold as part of the mlx5_ib_dmabuf_invalidate_cb() is now also acquired as part of the mlx5_revoke_mr() scope. Upon a successful revoke, we set umem_dmabuf->private which points to that MR to NULL, preventing any further invalidation attempts on its lkey. Fixes: e6fb246ccafb ("RDMA/mlx5: Consolidate MR destruction to mlx5_ib_dereg_mr()") Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Artemy Kovalyov <artemyko@mnvidia.com> Link: https://patch.msgid.link/70617067abbfaa0c816a2544c922e7f4346def58.1738587016.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-06arm64: defconfig: Enable TISCI Interrupt Router and AggregatorVaishnav Achath
Enable TISCI Interrupt Router and Interrupt Aggregator drivers. These IPs are found in all TI K3 SoCs like J721E, AM62X and is required for core functionality like DMA, GPIO Interrupts which is necessary during boot, thus make them built-in. bloat-o-meter summary on vmlinux: add/remove: 460/1 grow/shrink: 4/0 up/down: 162483/-8 (162475) ... Total: Before=31615984, After=31778459, chg +0.51% These configs were previously selected for ARCH_K3 in respective Kconfigs till commit b8b26ae398c4 ("irqchip/ti-sci-inta : Add module build support") and commit 2d95ffaecbc2 ("irqchip/ti-sci-intr: Add module build support") dropped them and few driver configs (TI_K3_UDMA, TI_K3_RINGACC) dependent on these also got disabled due to this. While re-enabling the TI_SCI_INT_*_IRQCHIP configs, these configs with missing dependencies (which are already part of arm64 defconfig) also get re-enabled which explains the slightly larger size increase from the bloat-o-meter summary. Fixes: 2d95ffaecbc2 ("irqchip/ti-sci-intr: Add module build support") Fixes: b8b26ae398c4 ("irqchip/ti-sci-inta : Add module build support") Signed-off-by: Vaishnav Achath <vaishnav.a@ti.com> Tested-by: Dhruva Gole <d-gole@ti.com> Reviewed-by: Dhruva Gole <d-gole@ti.com> Link: https://lore.kernel.org/r/20250205062229.3869081-1-vaishnav.a@ti.com Signed-off-by: Nishanth Menon <nm@ti.com>
2025-02-05smb: client: get rid of kstrdup() in get_ses_refpath()Paulo Alcantara
After commit 36008fe6e3dc ("smb: client: don't try following DFS links in cifs_tree_connect()"), TCP_Server_Info::leaf_fullpath will no longer be changed, so there is no need to kstrdup() it. Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-02-05smb: client: fix noisy when tree connecting to DFS interlink targetsPaulo Alcantara
When the client attempts to tree connect to a domain-based DFS namespace from a DFS interlink target, the server will return STATUS_BAD_NETWORK_NAME and the following will appear on dmesg: CIFS: VFS: BAD_NETWORK_NAME: \\dom\dfs Since a DFS share might contain several DFS interlinks and they expire after 10 minutes, the above message might end up being flooded on dmesg when mounting or accessing them. Print this only once per share. Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-02-05smb: client: don't trust DFSREF_STORAGE_SERVER bitPaulo Alcantara
Some servers don't respect the DFSREF_STORAGE_SERVER bit, so unconditionally tree connect to DFS link target and then decide whether or not continue chasing DFS referrals for DFS interlinks. Otherwise the client would fail to mount such shares. Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-02-05Merge branch 'vxlan-age-fdb-entries-based-on-rx-traffic'Jakub Kicinski
Ido Schimmel says: ==================== vxlan: Age FDB entries based on Rx traffic tl;dr - This patchset prevents VXLAN FDB entries from lingering if traffic is only forwarded to a silent host. The VXLAN driver maintains two timestamps for each FDB entry: 'used' and 'updated'. The first is refreshed by both the Rx and Tx paths and the second is refreshed upon migration. The driver ages out entries according to their 'used' time which means that an entry can linger when traffic is only forwarded to a silent host that might have migrated to a different remote. This patchset solves the problem by adjusting the above semantics and aligning them to those of the bridge driver. That is, 'used' time is refreshed by the Tx path, 'updated' time is refresh by Rx path or user space updates and entries are aged out according to their 'updated' time. Patches #1-#2 perform small changes in how the 'used' and 'updated' fields are accessed. Patches #3-#5 refresh the 'updated' time where needed. Patch #6 flips the driver to age out FDB entries according to their 'updated' time. Patch #7 removes unnecessary updates to the 'used' time. Patch #8 extends a test case to cover aging of FDB entries in the presence of Tx traffic. ==================== Link: https://patch.msgid.link/20250204145549.1216254-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05selftests: forwarding: vxlan_bridge_1d: Check aging while forwardingIdo Schimmel
Extend the VXLAN FDB aging test case to verify that FDB entries are aged out when they only forward traffic and not refreshed by received traffic. The test fails before "vxlan: Age out FDB entries based on 'updated' time": # ./vxlan_bridge_1d.sh [...] TEST: VXLAN: Ageing of learned FDB entry [FAIL] [...] # echo $? 1 And passes after it: # ./vxlan_bridge_1d.sh [...] TEST: VXLAN: Ageing of learned FDB entry [ OK ] [...] # echo $? 0 Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250204145549.1216254-9-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05vxlan: Avoid unnecessary updates to FDB 'used' timeIdo Schimmel
Now that the VXLAN driver ages out FDB entries based on their 'updated' time we can remove unnecessary updates of the 'used' time from the Rx path and the control path, so that the 'used' time is only updated by the Tx path. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250204145549.1216254-8-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05vxlan: Age out FDB entries based on 'updated' timeIdo Schimmel
Currently, the VXLAN driver ages out FDB entries based on their 'used' time which is refreshed by both the Tx and Rx paths. This means that an FDB entry will not age out if traffic is only forwarded to the target host: # ip link add name vx1 up type vxlan id 10010 local 192.0.2.1 dstport 4789 learning ageing 10 # bridge fdb add 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1 # bridge fdb get 00:11:22:33:44:55 br vx1 self 00:11:22:33:44:55 dev vx1 dst 198.51.100.1 self # mausezahn vx1 -a own -b 00:11:22:33:44:55 -c 0 -p 100 -q & # sleep 20 # bridge fdb get 00:11:22:33:44:55 br vx1 self 00:11:22:33:44:55 dev vx1 dst 198.51.100.1 self This is wrong as an FDB entry will remain present when we no longer have an indication that the host is still behind the current remote. It is also inconsistent with the bridge driver: # ip link add name br1 up type bridge ageing_time $((10 * 100)) # ip link add name swp1 up master br1 type dummy # bridge fdb add 00:11:22:33:44:55 dev swp1 master dynamic # bridge fdb get 00:11:22:33:44:55 br br1 00:11:22:33:44:55 dev swp1 master br1 # mausezahn br1 -a own -b 00:11:22:33:44:55 -c 0 -p 100 -q & # sleep 20 # bridge fdb get 00:11:22:33:44:55 br br1 Error: Fdb entry not found. Solve this by aging out entries based on their 'updated' time, which is not refreshed by the Tx path: # ip link add name vx1 up type vxlan id 10010 local 192.0.2.1 dstport 4789 learning ageing 10 # bridge fdb add 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1 # bridge fdb get 00:11:22:33:44:55 br vx1 self 00:11:22:33:44:55 dev vx1 dst 198.51.100.1 self # mausezahn vx1 -a own -b 00:11:22:33:44:55 -c 0 -p 100 -q & # sleep 20 # bridge fdb get 00:11:22:33:44:55 br vx1 self Error: Fdb entry not found. But is refreshed by the Rx path: # ip address add 192.0.2.1/32 dev lo # ip link add name vx1 up type vxlan id 10010 local 192.0.2.1 dstport 4789 localbypass # ip link add name vx2 up type vxlan id 20010 local 192.0.2.1 dstport 4789 learning ageing 10 # bridge fdb add 00:11:22:33:44:55 dev vx1 self static dst 127.0.0.1 vni 20010 # mausezahn vx1 -a 00:aa:bb:cc:dd:ee -b 00:11:22:33:44:55 -c 0 -p 100 -q & # sleep 20 # bridge fdb get 00:aa:bb:cc:dd:ee br vx2 self 00:aa:bb:cc:dd:ee dev vx2 dst 127.0.0.1 self # pkill mausezahn # sleep 20 # bridge fdb get 00:aa:bb:cc:dd:ee br vx2 self Error: Fdb entry not found. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250204145549.1216254-7-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05vxlan: Refresh FDB 'updated' time upon user space updatesIdo Schimmel
When a host migrates to a different remote and a packet is received from the new remote, the corresponding FDB entry is updated and its 'updated' time is refreshed. However, when user space replaces the remote of an FDB entry, its 'updated' time is not refreshed: # ip link add name vx1 up type vxlan id 10010 dstport 4789 # bridge fdb add 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1 # sleep 10 # bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]' 10 # bridge fdb replace 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.2 # bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]' 10 This can lead to the entry being aged out prematurely and it is also inconsistent with the bridge driver: # ip link add name br1 up type bridge # ip link add name swp1 master br1 up type dummy # ip link add name swp2 master br1 up type dummy # bridge fdb add 00:11:22:33:44:55 dev swp1 master dynamic vlan 1 # sleep 10 # bridge -s -j fdb get 00:11:22:33:44:55 br br1 vlan 1 | jq '.[]["updated"]' 10 # bridge fdb replace 00:11:22:33:44:55 dev swp2 master dynamic vlan 1 # bridge -s -j fdb get 00:11:22:33:44:55 br br1 vlan 1 | jq '.[]["updated"]' 0 Adjust the VXLAN driver to refresh the 'updated' time of an FDB entry whenever one of its attributes is changed by user space: # ip link add name vx1 up type vxlan id 10010 dstport 4789 # bridge fdb add 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1 # sleep 10 # bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]' 10 # bridge fdb replace 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.2 # bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]' 0 Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250204145549.1216254-6-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05vxlan: Refresh FDB 'updated' time upon 'NTF_USE'Ido Schimmel
The 'NTF_USE' flag can be used by user space to refresh FDB entries so that they will not age out. Currently, the VXLAN driver implements it by refreshing the 'used' field in the FDB entry as this is the field according to which FDB entries are aged out. Subsequent patches will switch the VXLAN driver to age out entries based on the 'updated' field. Prepare for this change by refreshing the 'updated' field upon 'NTF_USE'. This is consistent with the bridge driver's FDB: # ip link add name br1 up type bridge # ip link add name swp1 master br1 up type dummy # bridge fdb add 00:11:22:33:44:55 dev swp1 master dynamic vlan 1 # sleep 10 # bridge fdb replace 00:11:22:33:44:55 dev swp1 master dynamic vlan 1 # bridge -s -j fdb get 00:11:22:33:44:55 br br1 vlan 1 | jq '.[]["updated"]' 10 # sleep 10 # bridge fdb replace 00:11:22:33:44:55 dev swp1 master use dynamic vlan 1 # bridge -s -j fdb get 00:11:22:33:44:55 br br1 vlan 1 | jq '.[]["updated"]' 0 Before: # ip link add name vx1 up type vxlan id 10010 dstport 4789 # bridge fdb add 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1 # sleep 10 # bridge fdb replace 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1 # bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]' 10 # sleep 10 # bridge fdb replace 00:11:22:33:44:55 dev vx1 self use dynamic dst 198.51.100.1 # bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]' 20 After: # ip link add name vx1 up type vxlan id 10010 dstport 4789 # bridge fdb add 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1 # sleep 10 # bridge fdb replace 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1 # bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]' 10 # sleep 10 # bridge fdb replace 00:11:22:33:44:55 dev vx1 self use dynamic dst 198.51.100.1 # bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]' 0 Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250204145549.1216254-5-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05vxlan: Always refresh FDB 'updated' time when learning is enabledIdo Schimmel
Currently, when learning is enabled and a packet is received from the expected remote, the 'updated' field of the FDB entry is not refreshed. This will become a problem when we switch the VXLAN driver to age out entries based on the 'updated' field. Solve this by always refreshing an FDB entry when we receive a packet with a matching source MAC address, regardless if it was received via the expected remote or not as it indicates the host is alive. This is consistent with the bridge driver's FDB. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250204145549.1216254-4-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05vxlan: Read jiffies once when updating FDB 'used' timeIdo Schimmel
Avoid two volatile reads in the data path. Instead, read jiffies once and only if an FDB entry was found. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250204145549.1216254-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05vxlan: Annotate FDB data racesIdo Schimmel
The 'used' and 'updated' fields in the FDB entry structure can be accessed concurrently by multiple threads, leading to reports such as [1]. Can be reproduced using [2]. Suppress these reports by annotating these accesses using READ_ONCE() / WRITE_ONCE(). [1] BUG: KCSAN: data-race in vxlan_xmit / vxlan_xmit write to 0xffff942604d263a8 of 8 bytes by task 286 on cpu 0: vxlan_xmit+0xb29/0x2380 dev_hard_start_xmit+0x84/0x2f0 __dev_queue_xmit+0x45a/0x1650 packet_xmit+0x100/0x150 packet_sendmsg+0x2114/0x2ac0 __sys_sendto+0x318/0x330 __x64_sys_sendto+0x76/0x90 x64_sys_call+0x14e8/0x1c00 do_syscall_64+0x9e/0x1a0 entry_SYSCALL_64_after_hwframe+0x77/0x7f read to 0xffff942604d263a8 of 8 bytes by task 287 on cpu 2: vxlan_xmit+0xadf/0x2380 dev_hard_start_xmit+0x84/0x2f0 __dev_queue_xmit+0x45a/0x1650 packet_xmit+0x100/0x150 packet_sendmsg+0x2114/0x2ac0 __sys_sendto+0x318/0x330 __x64_sys_sendto+0x76/0x90 x64_sys_call+0x14e8/0x1c00 do_syscall_64+0x9e/0x1a0 entry_SYSCALL_64_after_hwframe+0x77/0x7f value changed: 0x00000000fffbac6e -> 0x00000000fffbac6f Reported by Kernel Concurrency Sanitizer on: CPU: 2 UID: 0 PID: 287 Comm: mausezahn Not tainted 6.13.0-rc7-01544-gb4b270f11a02 #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014 [2] #!/bin/bash set +H echo whitelist > /sys/kernel/debug/kcsan echo !vxlan_xmit > /sys/kernel/debug/kcsan ip link add name vx0 up type vxlan id 10010 dstport 4789 local 192.0.2.1 bridge fdb add 00:11:22:33:44:55 dev vx0 self static dst 198.51.100.1 taskset -c 0 mausezahn vx0 -a own -b 00:11:22:33:44:55 -c 0 -q & taskset -c 2 mausezahn vx0 -a own -b 00:11:22:33:44:55 -c 0 -q & Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250204145549.1216254-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05ipv4: ip_gre: Fix set but not used warning in ipgre_err() if IPv4-onlyGeert Uytterhoeven
if CONFIG_NET_IPGRE is enabled, but CONFIG_IPV6 is disabled: net/ipv4/ip_gre.c: In function ‘ipgre_err’: net/ipv4/ip_gre.c:144:22: error: variable ‘data_len’ set but not used [-Werror=unused-but-set-variable] 144 | unsigned int data_len = 0; | ^~~~~~~~ Fix this by moving all data_len processing inside the IPV6-only section that uses its result. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202501121007.2GofXmh5-lkp@intel.com/ Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/d09113cfe2bfaca02f3dddf832fb5f48dd20958b.1738704881.git.geert@linux-m68k.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05Merge branch 'rxrpc-call-state-fixes'Jakub Kicinski
David Howells says: ==================== rxrpc: Call state fixes Here some call state fixes for AF_RXRPC. (1) Fix the state of a call to not treat the challenge-response cycle as part of an incoming call's state set. The problem is that it makes handling received of the final packet in the receive phase difficult as that wants to change the call state - but security negotiations may not yet be complete. (2) Fix a race between the changing of the call state at the end of the request reception phase of a service call, recvmsg() collecting the last data and sendmsg() trying to send the reply before the I/O thread has advanced the call state. Link: https://lore.kernel.org/20250203110307.7265-2-dhowells@redhat.com ==================== Link: https://patch.msgid.link/20250204230558.712536-1-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05rxrpc: Fix race in call state changing vs recvmsg()David Howells
There's a race in between the rxrpc I/O thread recording the end of the receive phase of a call and recvmsg() examining the state of the call to determine whether it has completed. The problem is that call->_state records the I/O thread's view of the call, not the application's view (which may lag), so that alone is not sufficient. To this end, the application also checks whether there is anything left in call->recvmsg_queue for it to pick up. The call must be in state RXRPC_CALL_COMPLETE and the recvmsg_queue empty for the call to be considered fully complete. In rxrpc_input_queue_data(), the latest skbuff is added to the queue and then, if it was marked as LAST_PACKET, the state is advanced... But this is two separate operations with no locking around them. As a consequence, the lack of locking means that sendmsg() can jump into the gap on a service call and attempt to send the reply - but then get rejected because the I/O thread hasn't advanced the state yet. Simply flipping the order in which things are done isn't an option as that impacts the client side, causing the checks in rxrpc_kernel_check_life() as to whether the call is still alive to race instead. Fix this by moving the update of call->_state inside the skb queue spinlocked section where the packet is queued on the I/O thread side. rxrpc's recvmsg() will then automatically sync against this because it has to take the call->recvmsg_queue spinlock in order to dequeue the last packet. rxrpc's sendmsg() doesn't need amending as the app shouldn't be calling it to send a reply until recvmsg() indicates it has returned all of the request. Fixes: 93368b6bd58a ("rxrpc: Move call state changes from recvmsg to I/O thread") Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Simon Horman <horms@kernel.org> cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/20250204230558.712536-3-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05rxrpc: Fix call state set to not include the SERVER_SECURING stateDavid Howells
The RXRPC_CALL_SERVER_SECURING state doesn't really belong with the other states in the call's state set as the other states govern the call's Rx/Tx phase transition and govern when packets can and can't be received or transmitted. The "Securing" state doesn't actually govern the reception of packets and would need to be split depending on whether or not we've received the last packet yet (to mirror RECV_REQUEST/ACK_REQUEST). The "Securing" state is more about whether or not we can start forwarding packets to the application as recvmsg will need to decode them and the decoding can't take place until the challenge/response exchange has completed. Fix this by removing the RXRPC_CALL_SERVER_SECURING state from the state set and, instead, using a flag, RXRPC_CALL_CONN_CHALLENGING, to track whether or not we can queue the call for reception by recvmsg() or notify the kernel app that data is ready. In the event that we've already received all the packets, the connection event handler will poke the app layer in the appropriate manner. Also there's a race whereby the app layer sees the last packet before rxrpc has managed to end the rx phase and change the state to one amenable to allowing a reply. Fix this by queuing the packet after calling rxrpc_end_rx_phase(). Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Simon Horman <horms@kernel.org> cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/20250204230558.712536-2-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05net: sched: Fix truncation of offloaded action statisticsIdo Schimmel
In case of tc offload, when user space queries the kernel for tc action statistics, tc will query the offloaded statistics from device drivers. Among other statistics, drivers are expected to pass the number of packets that hit the action since the last query as a 64-bit number. Unfortunately, tc treats the number of packets as a 32-bit number, leading to truncation and incorrect statistics when the number of packets since the last query exceeds 0xffffffff: $ tc -s filter show dev swp2 ingress filter protocol all pref 1 flower chain 0 filter protocol all pref 1 flower chain 0 handle 0x1 skip_sw in_hw in_hw_count 1 action order 1: mirred (Egress Redirect to device swp1) stolen index 1 ref 1 bind 1 installed 58 sec used 0 sec Action statistics: Sent 1133877034176 bytes 536959475 pkt (dropped 0, overlimits 0 requeues 0) [...] According to the above, 2111-byte packets were redirected which is impossible as only 64-byte packets were transmitted and the MTU was 1500. Fix by treating packets as a 64-bit number: $ tc -s filter show dev swp2 ingress filter protocol all pref 1 flower chain 0 filter protocol all pref 1 flower chain 0 handle 0x1 skip_sw in_hw in_hw_count 1 action order 1: mirred (Egress Redirect to device swp1) stolen index 1 ref 1 bind 1 installed 61 sec used 0 sec Action statistics: Sent 1370624380864 bytes 21416005951 pkt (dropped 0, overlimits 0 requeues 0) [...] Which shows that only 64-byte packets were redirected (1370624380864 / 21416005951 = 64). Fixes: 380407023526 ("net/sched: Enable netdev drivers to update statistics of offloaded actions") Reported-by: Joe Botha <joe@atomic.ac> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250204123839.1151804-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05r8169: don't scan PHY addresses > 0Heiner Kallweit
The PHY address is a dummy, because r8169 PHY access registers don't support a PHY address. Therefore scan address 0 only. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/830637dd-4016-4a68-92b3-618fcac6589d@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05tun: revert fix group permission checkWillem de Bruijn
This reverts commit 3ca459eaba1bf96a8c7878de84fa8872259a01e3. The blamed commit caused a regression when neither tun->owner nor tun->group is set. This is intended to be allowed, but now requires CAP_NET_ADMIN. Discussion in the referenced thread pointed out that the original issue that prompted this patch can be resolved in userspace. The relaxed access control may also make a device accessible when it previously wasn't, while existing users may depend on it to not be. This is a clean pure git revert, except for fixing the indentation on the gid_valid line that checkpatch correctly flagged. Fixes: 3ca459eaba1b ("tun: fix group permission check") Link: https://lore.kernel.org/netdev/CAFqZXNtkCBT4f+PwyVRmQGoT3p1eVa01fCG_aNtpt6dakXncUg@mail.gmail.com/ Signed-off-by: Willem de Bruijn <willemb@google.com> Cc: Ondrej Mosnacek <omosnace@redhat.com> Cc: Stas Sergeev <stsp2@yandex.ru> Link: https://patch.msgid.link/20250204161015.739430-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05net: flush_backlog() small changesEric Dumazet
Add READ_ONCE() around reads of skb->dev->reg_state, because this field can be changed from other threads/cpus. Instead of calling dev_kfree_skb_irq() and kfree_skb() while interrupts are masked and locks held, use a temporary list and use __skb_queue_purge_reason() Use SKB_DROP_REASON_DEV_READY drop reason to better describe why these skbs are dropped. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20250204144825.316785-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>