Age | Commit message (Collapse) | Author |
|
Commit 67efaf45930d ("net/mlx5e: TC, Remove CT action reordering")
removed the usage of mlx5e_tc_flow_action struct, remove the struct as
well.
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Remove the stray semicolon in the mlx5_ldev_for_each_reverse() loop.
Signed-off-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Add support to show and config FEC by ethtool for 200G/lane link
modes. The RS encoding setting is mapped, and can be overridden to
FEC_RS_544_514_INTERLEAVED_QUAD for these modes.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This patch exposes new link modes using 200Gbps per lane, including
200G, 400G and 800G modes.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Define 200G, 400G and 800G link modes using 200Gbps per lane.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
As a specific function (mdev) is chosen to send MTPPSE command to
firmware, the event is generated only on that function. When that
function is unloaded, the PPS event can't be forward to PTP device,
even when there are other functions in the group, and PTP device is
not destroyed. To resolve this problem, need to send MTPPSE again from
new function, and dis-arm the event on old function after that.
PPS events are handled by EQ notifier. The async EQs and notifiers are
destroyed in mlx5_eq_table_destroy() which is called before
mlx5_cleanup_clock(). During the period between
mlx5_eq_table_destroy() and mlx5_cleanup_clock(), the events can't be
handled. To avoid event loss, add mlx5_clock_unload() in mlx5_unload()
to arm the event on other available function, and mlx5_clock_load in
mlx5_load() for symmetry.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Currently, mlx5 driver exposes a PTP device for each network interface,
resulting in multiple device nodes representing the same underlying
PHC (PTP hardware clock). This causes problem if it is trying to
synchronize to itself. For instance, when ptp4l operates on multiple
interfaces following different masters, phc2sys attempts to
synchronize them in automatic mode.
PHC can be configured to work as free running mode or real time mode.
All functions can access it directly. In this patch, we create one PTP
device for each PHC when it's running in real time mode. All the
functions share the same PTP device if the clock identifies they query
are same, and they are already grouped by devcom in previous commit.
The first mdev in the peer list is chosen when sending
MTPPS/MTUTC/MTPPSE/MRTCQ to firmware. Since the function can be
unloaded at any time, we need to use a mutex lock to protect the mdev
pointer used in PTP and PPS callbacks. Besides, new one should be
picked from the peer list when the current is not available.
The clock info, which is used by IB, is shared by all the interfaces
using the same hardware clock.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The PPS notifier is currently in mlx5_clock, and mlx5_clock can be
shared in later patch, so the notifier should be registered for each
device to avoid any event miss. Besides, the out_work is scheduled by
PPS out event which is triggered only when the device is in free
running mode. So, both are moved to mlx5_core_dev's clock_state.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Add new devcom component for hardware clock. When it is running in
real time mode, the functions are grouped by the identify they query.
According to firmware document, the clock identify size is 64 bits, so
it's safe to memcpy to component key, as the key size is also 64 bits.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Change clock member in mlx5_core_dev to a pointer, so it can point to
a clock shared by multiple functions in later patch.
For now, each function has its own clock, so mdev in mlx5_clock_priv
is the back pointer to the function. Later it points to one (normally
the first one) of the multiple functions sharing the same clock.
Change mlx5_init_clock() to return error if mlx5_clock is not
allocated. Besides, a null clock is defined and used when hardware
clock is not supported. So, the clock pointer is always pointing to
something valid.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The mdev is calculated directly from mlx5_clock, as it's one of the
fields in mlx5_core_dev. Move to a function so it can be easily
changed in next patch.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Move hardware clock initialization and destruction to the functions,
which will be used for dynamically allocated clock. Such clock is
shared by all the devices if the queried clock identities are same.
The out_work is for PPS out event, which can't be triggered when clock
is shared, so INIT_WORK is not moved to the initialization function.
Besides, we still need to register notifier for each device.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
In later patch, the mlx5_clock will be allocated dynamically, its
address can be obtained from mlx5_core_dev struct, but mdev can't be
obtained from mlx5_clock because it can be shared by multiple
interfaces. So change the parameter for such internal functions, only
mdev is passed down from the callers.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The PTP callback functions should not be used directly by internal
callers. Add helpers that can be used internally and externally.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Ido Schimmel says:
====================
vxlan: Age FDB entries based on Rx traffic
tl;dr - This patchset prevents VXLAN FDB entries from lingering if
traffic is only forwarded to a silent host.
The VXLAN driver maintains two timestamps for each FDB entry: 'used' and
'updated'. The first is refreshed by both the Rx and Tx paths and the
second is refreshed upon migration.
The driver ages out entries according to their 'used' time which means
that an entry can linger when traffic is only forwarded to a silent host
that might have migrated to a different remote.
This patchset solves the problem by adjusting the above semantics and
aligning them to those of the bridge driver. That is, 'used' time is
refreshed by the Tx path, 'updated' time is refresh by Rx path or user
space updates and entries are aged out according to their 'updated'
time.
Patches #1-#2 perform small changes in how the 'used' and 'updated'
fields are accessed.
Patches #3-#5 refresh the 'updated' time where needed.
Patch #6 flips the driver to age out FDB entries according to their
'updated' time.
Patch #7 removes unnecessary updates to the 'used' time.
Patch #8 extends a test case to cover aging of FDB entries in the
presence of Tx traffic.
====================
Link: https://patch.msgid.link/20250204145549.1216254-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Extend the VXLAN FDB aging test case to verify that FDB entries are aged
out when they only forward traffic and not refreshed by received
traffic.
The test fails before "vxlan: Age out FDB entries based on 'updated'
time":
# ./vxlan_bridge_1d.sh
[...]
TEST: VXLAN: Ageing of learned FDB entry [FAIL]
[...]
# echo $?
1
And passes after it:
# ./vxlan_bridge_1d.sh
[...]
TEST: VXLAN: Ageing of learned FDB entry [ OK ]
[...]
# echo $?
0
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250204145549.1216254-9-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Now that the VXLAN driver ages out FDB entries based on their 'updated'
time we can remove unnecessary updates of the 'used' time from the Rx
path and the control path, so that the 'used' time is only updated by
the Tx path.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250204145549.1216254-8-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, the VXLAN driver ages out FDB entries based on their 'used'
time which is refreshed by both the Tx and Rx paths. This means that an
FDB entry will not age out if traffic is only forwarded to the target
host:
# ip link add name vx1 up type vxlan id 10010 local 192.0.2.1 dstport 4789 learning ageing 10
# bridge fdb add 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1
# bridge fdb get 00:11:22:33:44:55 br vx1 self
00:11:22:33:44:55 dev vx1 dst 198.51.100.1 self
# mausezahn vx1 -a own -b 00:11:22:33:44:55 -c 0 -p 100 -q &
# sleep 20
# bridge fdb get 00:11:22:33:44:55 br vx1 self
00:11:22:33:44:55 dev vx1 dst 198.51.100.1 self
This is wrong as an FDB entry will remain present when we no longer have
an indication that the host is still behind the current remote. It is
also inconsistent with the bridge driver:
# ip link add name br1 up type bridge ageing_time $((10 * 100))
# ip link add name swp1 up master br1 type dummy
# bridge fdb add 00:11:22:33:44:55 dev swp1 master dynamic
# bridge fdb get 00:11:22:33:44:55 br br1
00:11:22:33:44:55 dev swp1 master br1
# mausezahn br1 -a own -b 00:11:22:33:44:55 -c 0 -p 100 -q &
# sleep 20
# bridge fdb get 00:11:22:33:44:55 br br1
Error: Fdb entry not found.
Solve this by aging out entries based on their 'updated' time, which is
not refreshed by the Tx path:
# ip link add name vx1 up type vxlan id 10010 local 192.0.2.1 dstport 4789 learning ageing 10
# bridge fdb add 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1
# bridge fdb get 00:11:22:33:44:55 br vx1 self
00:11:22:33:44:55 dev vx1 dst 198.51.100.1 self
# mausezahn vx1 -a own -b 00:11:22:33:44:55 -c 0 -p 100 -q &
# sleep 20
# bridge fdb get 00:11:22:33:44:55 br vx1 self
Error: Fdb entry not found.
But is refreshed by the Rx path:
# ip address add 192.0.2.1/32 dev lo
# ip link add name vx1 up type vxlan id 10010 local 192.0.2.1 dstport 4789 localbypass
# ip link add name vx2 up type vxlan id 20010 local 192.0.2.1 dstport 4789 learning ageing 10
# bridge fdb add 00:11:22:33:44:55 dev vx1 self static dst 127.0.0.1 vni 20010
# mausezahn vx1 -a 00:aa:bb:cc:dd:ee -b 00:11:22:33:44:55 -c 0 -p 100 -q &
# sleep 20
# bridge fdb get 00:aa:bb:cc:dd:ee br vx2 self
00:aa:bb:cc:dd:ee dev vx2 dst 127.0.0.1 self
# pkill mausezahn
# sleep 20
# bridge fdb get 00:aa:bb:cc:dd:ee br vx2 self
Error: Fdb entry not found.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250204145549.1216254-7-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When a host migrates to a different remote and a packet is received from
the new remote, the corresponding FDB entry is updated and its 'updated'
time is refreshed.
However, when user space replaces the remote of an FDB entry, its
'updated' time is not refreshed:
# ip link add name vx1 up type vxlan id 10010 dstport 4789
# bridge fdb add 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1
# sleep 10
# bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]'
10
# bridge fdb replace 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.2
# bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]'
10
This can lead to the entry being aged out prematurely and it is also
inconsistent with the bridge driver:
# ip link add name br1 up type bridge
# ip link add name swp1 master br1 up type dummy
# ip link add name swp2 master br1 up type dummy
# bridge fdb add 00:11:22:33:44:55 dev swp1 master dynamic vlan 1
# sleep 10
# bridge -s -j fdb get 00:11:22:33:44:55 br br1 vlan 1 | jq '.[]["updated"]'
10
# bridge fdb replace 00:11:22:33:44:55 dev swp2 master dynamic vlan 1
# bridge -s -j fdb get 00:11:22:33:44:55 br br1 vlan 1 | jq '.[]["updated"]'
0
Adjust the VXLAN driver to refresh the 'updated' time of an FDB entry
whenever one of its attributes is changed by user space:
# ip link add name vx1 up type vxlan id 10010 dstport 4789
# bridge fdb add 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1
# sleep 10
# bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]'
10
# bridge fdb replace 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.2
# bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]'
0
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250204145549.1216254-6-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The 'NTF_USE' flag can be used by user space to refresh FDB entries so
that they will not age out. Currently, the VXLAN driver implements it by
refreshing the 'used' field in the FDB entry as this is the field
according to which FDB entries are aged out.
Subsequent patches will switch the VXLAN driver to age out entries based
on the 'updated' field. Prepare for this change by refreshing the
'updated' field upon 'NTF_USE'. This is consistent with the bridge
driver's FDB:
# ip link add name br1 up type bridge
# ip link add name swp1 master br1 up type dummy
# bridge fdb add 00:11:22:33:44:55 dev swp1 master dynamic vlan 1
# sleep 10
# bridge fdb replace 00:11:22:33:44:55 dev swp1 master dynamic vlan 1
# bridge -s -j fdb get 00:11:22:33:44:55 br br1 vlan 1 | jq '.[]["updated"]'
10
# sleep 10
# bridge fdb replace 00:11:22:33:44:55 dev swp1 master use dynamic vlan 1
# bridge -s -j fdb get 00:11:22:33:44:55 br br1 vlan 1 | jq '.[]["updated"]'
0
Before:
# ip link add name vx1 up type vxlan id 10010 dstport 4789
# bridge fdb add 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1
# sleep 10
# bridge fdb replace 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1
# bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]'
10
# sleep 10
# bridge fdb replace 00:11:22:33:44:55 dev vx1 self use dynamic dst 198.51.100.1
# bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]'
20
After:
# ip link add name vx1 up type vxlan id 10010 dstport 4789
# bridge fdb add 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1
# sleep 10
# bridge fdb replace 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1
# bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]'
10
# sleep 10
# bridge fdb replace 00:11:22:33:44:55 dev vx1 self use dynamic dst 198.51.100.1
# bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]'
0
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250204145549.1216254-5-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, when learning is enabled and a packet is received from the
expected remote, the 'updated' field of the FDB entry is not refreshed.
This will become a problem when we switch the VXLAN driver to age out
entries based on the 'updated' field.
Solve this by always refreshing an FDB entry when we receive a packet
with a matching source MAC address, regardless if it was received via
the expected remote or not as it indicates the host is alive. This is
consistent with the bridge driver's FDB.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250204145549.1216254-4-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Avoid two volatile reads in the data path. Instead, read jiffies once
and only if an FDB entry was found.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250204145549.1216254-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The 'used' and 'updated' fields in the FDB entry structure can be
accessed concurrently by multiple threads, leading to reports such as
[1]. Can be reproduced using [2].
Suppress these reports by annotating these accesses using
READ_ONCE() / WRITE_ONCE().
[1]
BUG: KCSAN: data-race in vxlan_xmit / vxlan_xmit
write to 0xffff942604d263a8 of 8 bytes by task 286 on cpu 0:
vxlan_xmit+0xb29/0x2380
dev_hard_start_xmit+0x84/0x2f0
__dev_queue_xmit+0x45a/0x1650
packet_xmit+0x100/0x150
packet_sendmsg+0x2114/0x2ac0
__sys_sendto+0x318/0x330
__x64_sys_sendto+0x76/0x90
x64_sys_call+0x14e8/0x1c00
do_syscall_64+0x9e/0x1a0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
read to 0xffff942604d263a8 of 8 bytes by task 287 on cpu 2:
vxlan_xmit+0xadf/0x2380
dev_hard_start_xmit+0x84/0x2f0
__dev_queue_xmit+0x45a/0x1650
packet_xmit+0x100/0x150
packet_sendmsg+0x2114/0x2ac0
__sys_sendto+0x318/0x330
__x64_sys_sendto+0x76/0x90
x64_sys_call+0x14e8/0x1c00
do_syscall_64+0x9e/0x1a0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
value changed: 0x00000000fffbac6e -> 0x00000000fffbac6f
Reported by Kernel Concurrency Sanitizer on:
CPU: 2 UID: 0 PID: 287 Comm: mausezahn Not tainted 6.13.0-rc7-01544-gb4b270f11a02 #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
[2]
#!/bin/bash
set +H
echo whitelist > /sys/kernel/debug/kcsan
echo !vxlan_xmit > /sys/kernel/debug/kcsan
ip link add name vx0 up type vxlan id 10010 dstport 4789 local 192.0.2.1
bridge fdb add 00:11:22:33:44:55 dev vx0 self static dst 198.51.100.1
taskset -c 0 mausezahn vx0 -a own -b 00:11:22:33:44:55 -c 0 -q &
taskset -c 2 mausezahn vx0 -a own -b 00:11:22:33:44:55 -c 0 -q &
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250204145549.1216254-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
if CONFIG_NET_IPGRE is enabled, but CONFIG_IPV6 is disabled:
net/ipv4/ip_gre.c: In function ‘ipgre_err’:
net/ipv4/ip_gre.c:144:22: error: variable ‘data_len’ set but not used [-Werror=unused-but-set-variable]
144 | unsigned int data_len = 0;
| ^~~~~~~~
Fix this by moving all data_len processing inside the IPV6-only section
that uses its result.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202501121007.2GofXmh5-lkp@intel.com/
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/d09113cfe2bfaca02f3dddf832fb5f48dd20958b.1738704881.git.geert@linux-m68k.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The PHY address is a dummy, because r8169 PHY access registers
don't support a PHY address. Therefore scan address 0 only.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/830637dd-4016-4a68-92b3-618fcac6589d@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add READ_ONCE() around reads of skb->dev->reg_state, because
this field can be changed from other threads/cpus.
Instead of calling dev_kfree_skb_irq() and kfree_skb()
while interrupts are masked and locks held,
use a temporary list and use __skb_queue_purge_reason()
Use SKB_DROP_REASON_DEV_READY drop reason to better
describe why these skbs are dropped.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250204144825.316785-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The original Open Systems Adapter (OSA) was introduced by IBM in the
mid-90s. These were then superseded by OSA-Express in 1999 which used
Queued Direct IO to greatly improve throughput. The newer cards
retained the older, slower non-QDIO (OSE) modes for compatibility with
older systems. In Linux, the lcs driver was responsible for cards
operating in the older OSE mode and the qeth driver was introduced to
allow the OSA-Express cards to operate in the newer QDIO (OSD) mode.
For an S390 machine from 1998 or later, there is no reason to use the
OSE mode and lcs driver as all OSA cards since 1999 provide the faster
OSD mode. As a result, it's been years since we have heard of a
customer configuration involving the lcs driver.
This patch removes the lcs driver. The technology it supports has been
obsolete for past 25+ years and is irrelevant for current use cases.
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Aswin Karuvally <aswin@linux.ibm.com>
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250204103135.1619097-1-wintera@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.
Move the conflicting declaration to the end of the structure. Notice
that `struct ethtool_dump` is a flexible structure --a structure that
contains a flexible-array member.
Fix the following warning:
./drivers/net/ethernet/chelsio/cxgb4/cxgb4.h:1215:29: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://patch.msgid.link/Z6GBZ4brXYffLkt_@kspp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Attempts to replace an MDB group membership of the host itself are
currently bounced:
# ip link add name br up type bridge vlan_filtering 1
# bridge mdb replace dev br port br grp 239.0.0.1 vid 2
# bridge mdb replace dev br port br grp 239.0.0.1 vid 2
Error: bridge: Group is already joined by host.
A similar operation done on a member port would succeed. Ignore the check
for replacement of host group memberships as well.
The bit of code that this enables is br_multicast_host_join(), which, for
already-joined groups only refreshes the MC group expiration timer, which
is desirable; and a userspace notification, also desirable.
Change a selftest that exercises this code path from expecting a rejection
to expecting a pass. The rest of MDB selftests pass without modification.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/e5c5188b9787ae806609e7ca3aa2a0a501b9b5c4.1738685648.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Some selftests need libynl.a. When building it try to skip
generating the ReST documentation, libynl.a does not depend
on them.
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250203214850.1282291-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Antoine Tenart says:
====================
net-sysfs: remove the rtnl_trylock/restart_syscall construction
The series initially aimed at improving spins (and thus delays) while
accessing net sysfs under rtnl lock contention[1]. The culprit was the
trylock/restart_syscall constructions. There wasn't much interest at the
time but it got traction recently for other reasons (lowering the rtnl
lock pressure).
Since v1[2]:
- Do not export rtnl_lock_interruptible [Stephen].
- Add netdev_warn_once messages in rx_queue_add_kobject [Jakub].
Since the RFC[1]:
- Limit the breaking of the sysfs protection to sysfs_rtnl_lock() only
as this is not needed in the whole rtnl locking section thanks to the
additional check on dev_isalive(). This simplifies error handling as
well as the unlocking path.
- Used an interruptible version of rtnl_lock, as done by Jakub in
his experiments.
- Removed a WARN_ONCE_ONCE [Greg].
- Removed explicit inline markers [Stephen].
Most of the reasoning is explained in comments added in patch 1. This
was tested by stress-testing net sysfs attributes (read/write ops) while
adding/removing queues and adding/removing veths, all in parallel. I
also used an OCP single node cluster, spawning lots of pods.
[1] https://lore.kernel.org/all/20231018154804.420823-1-atenart@kernel.org/T/
[2] https://lore.kernel.org/all/20250117102612.132644-1-atenart@kernel.org/T/
====================
Link: https://patch.msgid.link/20250204170314.146022-1-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Similar to the commit removing remove rtnl_trylock from device
attributes we here apply the same technique to networking queues.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Link: https://patch.msgid.link/20250204170314.146022-5-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
With the (upcoming) removal of the rtnl_trylock/restart_syscall logic
and because of how Tx/Rx queues are implemented (and their
requirements), it might happen that a queue is re-added before having
the chance to be cleared. In such rare case, do not complete the queue
addition operation.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Link: https://patch.msgid.link/20250204170314.146022-4-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Rx/tx queues embed their own kobject for registering their per-queue
sysfs files. The issue is they're using the kobject default groups for
this and entirely rely on the kobject refcounting for releasing their
sysfs paths.
In order to remove rtnl_trylock calls we need sysfs files not to rely on
their associated kobject refcounting for their release. Thus we here
move queues sysfs files from the kobject default groups to their own
groups which can be removed separately.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Link: https://patch.msgid.link/20250204170314.146022-3-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
There is an ABBA deadlock between net device unregistration and sysfs
files being accessed[1][2]. To prevent this from happening all paths
taking the rtnl lock after the sysfs one (actually kn->active refcount)
use rtnl_trylock and return early (using restart_syscall)[3], which can
make syscalls to spin for a long time when there is contention on the
rtnl lock[4].
There are not many possibilities to improve the above:
- Rework the entire net/ locking logic.
- Invert two locks in one of the paths — not possible.
But here it's actually possible to drop one of the locks safely: the
kernfs_node refcount. More details in the code itself, which comes with
lots of comments.
Note that we check the device is alive in the added sysfs_rtnl_lock
helper to disallow sysfs operations to run after device dismantle has
started. This also help keeping the same behavior as before. Because of
this calls to dev_isalive in sysfs ops were removed.
[1] https://lore.kernel.org/netdev/49A4D5D5.5090602@trash.net/
[2] https://lore.kernel.org/netdev/m14oyhis31.fsf@fess.ebiederm.org/
[3] https://lore.kernel.org/netdev/20090226084924.16cb3e08@nehalam/
[4] https://lore.kernel.org/all/20210928125500.167943-1-atenart@kernel.org/T/
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Link: https://patch.msgid.link/20250204170314.146022-2-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use string choices helpers to simplify the code.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202501190707.qQS8PGHW-lkp@intel.com/
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Make config option R8169_LEDS user-visible, so that users can remove
support if not needed.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/d29f0cdb-32bf-435f-b59d-dc96bca1e3ab@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Make config symbol REALTEK_PHY_HWMON user-visible, so that users can
remove support if not needed.
Suggested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/3466ee92-166a-4b0f-9ae7-42b9e046f333@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a new selftest to verify netconsole's handling of messages that
exceed the packet size limit and require fragmentation. The test sends
messages with varying sizes and userdata, validating that:
1. Large messages are correctly fragmented and reassembled
2. Userdata fields are properly preserved across fragments
3. Messages work correctly with and without kernel release version
appending
The test creates a networking environment using netdevsim, sends
messages through /dev/kmsg, and verifies the received fragments maintain
message integrity.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250203-netcons_frag_msgs-v1-1-5bc6bedf2ac0@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.
Remove unused flexible-array member `buf` and, with this, fix the following
warnings:
drivers/net/ethernet/aquantia/atlantic/aq_hw.h:197:36: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
drivers/net/ethernet/aquantia/atlantic/hw_atl/../aq_hw.h:197:36: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
Suggested-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Igor Russkikh <irusskikh@marvell.com>
Link: https://patch.msgid.link/Z6F3KZVfnAZ2FoJm@kspp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Drivers should always disable a NAPI instance before removing it.
If they don't the instance may be queued for polling.
Since commit 86e25f40aa1e ("net: napi: Add napi_config")
we also remove the NAPI from the busy polling hash table
in napi_disable(), so not disabling would leave a stale
entry there.
Use of busy polling is relatively uncommon so bugs may be lurking
in the drivers. Add an explicit warning.
Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250203215816.1294081-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
lio_get_device_id() has been unused since 2018's
commit 64fecd3ec512 ("liquidio: remove obsolete functions and data
structures")
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250203183343.193691-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
mlxsw_sp_ipip_lb_ul_vr_id() has been unused since 2020's
commit acde33bf7319 ("mlxsw: spectrum_router: Reduce
mlxsw_sp_ipip_fib_entry_op_gre4()")
mlxsw_sp_rif_exists() has been unused since 2023's
commit 49c3a615d382 ("mlxsw: spectrum_router: Replay MACVLANs when RIF is
made")
mlxsw_sp_rif_vid() has been unused since 2023's
commit a5b52692e693 ("mlxsw: spectrum_switchdev: Manage RIFs on PVID
change")
Remove them.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20250203190141.204951-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
mlx5dr_domain_sync() was added in 2019 by
commit 70605ea545e8 ("net/mlx5: DR, Expose APIs for direct rule managing")
but hasn't been used.
Remove it.
mlx5dr_domain_sync() was the only user of
mlx5dr_send_ring_force_drain().
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20250203185958.204794-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The last use of mlx4_find_cached_mac() was removed in 2014 by
commit 2f5bb473681b ("mlx4: Add ref counting to port MAC table for RoCE")
mlx4_zone_free_entries() was added in 2014 by
commit 7a89399ffad7 ("net/mlx4: Add mlx4_bitmap zone allocator")
but hasn't been used. (The _unique version is used)
Remove them.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20250203185229.204279-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
There are some typos in comments/messages:
- Valiate -> Validate
- acceptible -> acceptable
- acces -> access
- relased -> released
Fix them via codespell.
Signed-off-by: Andrew Kreimer <algonell@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250203175419.4146-1-algonell@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Aspeed device supports rgmii, rgmii-id, rgmii-rxid, rgmii-txid so
document them.
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Ninad Palsule <ninad@linux.ibm.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250203151306.276358-2-ninad@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
neigh_parms_destroy() is a simple kfree(), no need for
a forward declaration.
neigh_parms_put() can instead call kfree() directly.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250203151152.3163876-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
XFRM API makes sure that xs->xso.dev is valid in all XFRM offload
callbacks. There is no need to check it again.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/0b2f8f5f09701bb43bbd83b94bfe5cb506b57adc.1738587150.git.leon@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from IPSec, netfilter and Bluetooth.
Nothing really stands out, but as usual there's a slight concentration
of fixes for issues added in the last two weeks before the merge
window, and driver bugs from 6.13 which tend to get discovered upon
wider distribution.
Current release - regressions:
- net: revert RTNL changes in unregister_netdevice_many_notify()
- Bluetooth: fix possible infinite recursion of btusb_reset
- eth: adjust locking in some old drivers which protect their state
with spinlocks to avoid sleeping in atomic; core protects netdev
state with a mutex now
Previous releases - regressions:
- eth:
- mlx5e: make sure we pass node ID, not CPU ID to kvzalloc_node()
- bgmac: reduce max frame size to support just 1500 bytes; the
jumbo frame support would previously cause OOB writes, but now
fails outright
- mptcp: blackhole only if 1st SYN retrans w/o MPC is accepted, avoid
false detection of MPTCP blackholing
Previous releases - always broken:
- mptcp: handle fastopen disconnect correctly
- xfrm:
- make sure skb->sk is a full sock before accessing its fields
- fix taking a lock with preempt disabled for RT kernels
- usb: ipheth: improve safety of packet metadata parsing; prevent
potential OOB accesses
- eth: renesas: fix missing rtnl lock in suspend/resume path"
* tag 'net-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits)
MAINTAINERS: add Neal to TCP maintainers
net: revert RTNL changes in unregister_netdevice_many_notify()
net: hsr: fix fill_frame_info() regression vs VLAN packets
doc: mptcp: sysctl: blackhole_timeout is per-netns
mptcp: blackhole only if 1st SYN retrans w/o MPC is accepted
netfilter: nf_tables: reject mismatching sum of field_len with set key length
net: sh_eth: Fix missing rtnl lock in suspend/resume path
net: ravb: Fix missing rtnl lock in suspend/resume path
selftests/net: Add test for loading devbound XDP program in generic mode
net: xdp: Disallow attaching device-bound programs in generic mode
tcp: correct handling of extreme memory squeeze
bgmac: reduce max frame size to support just MTU 1500
vsock/test: Add test for connect() retries
vsock/test: Add test for UAF due to socket unbinding
vsock/test: Introduce vsock_connect_fd()
vsock/test: Introduce vsock_bind()
vsock: Allow retrying on connect() failure
vsock: Keep the binding until socket destruction
Bluetooth: L2CAP: accept zero as a special value for MTU auto-selection
Bluetooth: btnxpuart: Fix glitches seen in dual A2DP streaming
...
|