Age | Commit message (Collapse) | Author |
|
Remove the hwptp abstraction and associated callbacks from
the struct xgbe_hw_if {}.
The callback structure was only ever assigned a single function, without
null checks. This cleanup inlines the logic and moves all the hwtstamp
realted code a separate file, improving readability and maintainance.
Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com>
Link: https://patch.msgid.link/20250718185628.4038779-2-Raju.Rangoju@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add test cases to tc_flower.sh to validate generic matching on ERSPAN
options. Both ERSPAN Type II and Type III are covered.
Also add check_tc_erspan_support() to verify whether tc supports
erspan_opts.
Signed-off-by: Li Shuang <shuali@redhat.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/1f354a1afd60f29bbbf02bd60cb52ecfc0b6bd17.1752848172.git.shuali@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This commit remove duplicate assignments for net->pcpu_stat_type
in usbnet_probe().
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The TCP header fields seq and ack_seq are 32-bit values in network
byte order as (__be32). these fields were earlier printed using
ntohs(), which converts only 16-bit values and produces incorrect
results for 32-bit fields. This patch is changeing the conversion
to ntohl(), ensuring correct interpretation of these sequence numbers.
Notably, the format specifier is updated from %d to %u to reflect the
unsigned nature of these fields.
improves the accuracy of debug log messages for TCP sequence and
acknowledgment numbers during TX timeouts.
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250717193552.3648791-1-alok.a.tiwari@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Wire-up ethtool_ops::nway_reset to phy_ethtool_nway_reset in order to
support re-starting auto-negotiation.
Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250717180915.2611890-1-florian.fainelli@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Stanislav Fomichev says:
====================
net: maintain netif vs dev prefix semantics
Commit cc34acd577f1 ("docs: net: document new locking reality")
introduced netif_ vs dev_ function semantics: the former expects locked
netdev, the latter takes care of the locking. We don't strictly
follow this semantics on either side, but there are more dev_xxx handlers
now that don't fit. Rename them to netif_xxx where appropriate. We care only
about driver-visible APIs, don't touch stack-internal routines.
The rest seem to be ok:
* dev_xdp_prog_count - mostly called by sw drivers (bonding), should not matter
* dev_get_by_xxx - too many to reasonably cleanup, already have different flavors
* dev_fetch_sw_netstats - don't need instance lock
* dev_get_tstats64 - never called directly, only as an ndo callback
* dev_pick_tx_zero - never called directly, only as an ndo callback
* dev_add_pack / dev_remove_pack - called early enough (in module init) to not matter
* dev_get_iflink - mostly called by sw drivers, should not matter
* dev_fill_forward_path - ditto
* dev_getbyhwaddr_rcu - ditto
* dev_getbyhwaddr - ditto
* dev_getfirstbyhwtype - ditto
* dev_valid_name - ditto
* __dev_forward_skb dev_forward_skb dev_queue_xmit_nit - established helpers, no netif vs dev distinction
====================
Link: https://patch.msgid.link/20250717172333.1288349-1-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit cc34acd577f1 ("docs: net: document new locking reality")
introduced netif_ vs dev_ function semantics: the former expects locked
netdev, the latter takes care of the locking. We don't strictly
follow this semantics on either side, but there are more dev_xxx handlers
now that don't fit. Rename them to netif_xxx where appropriate.
netif_close_many is used only by vlan/dsa and one mtk driver, so move it into
NETDEV_INTERNAL namespace.
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250717172333.1288349-8-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit cc34acd577f1 ("docs: net: document new locking reality")
introduced netif_ vs dev_ function semantics: the former expects locked
netdev, the latter takes care of the locking. We don't strictly
follow this semantics on either side, but there are more dev_xxx handlers
now that don't fit. Rename them to netif_xxx where appropriate.
Note that one dev_set_threaded call still remains in mt76 for debugfs file.
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250717172333.1288349-7-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit cc34acd577f1 ("docs: net: document new locking reality")
introduced netif_ vs dev_ function semantics: the former expects locked
netdev, the latter takes care of the locking. We don't strictly
follow this semantics on either side, but there are more dev_xxx handlers
now that don't fit. Rename them to netif_xxx where appropriate.
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250717172333.1288349-6-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit cc34acd577f1 ("docs: net: document new locking reality")
introduced netif_ vs dev_ function semantics: the former expects locked
netdev, the latter takes care of the locking. We don't strictly
follow this semantics on either side, but there are more dev_xxx handlers
now that don't fit. Rename them to netif_xxx where appropriate.
__netif_set_mtu is used only by bond, so move it into
NETDEV_INTERNAL namespace.
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250717172333.1288349-5-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit cc34acd577f1 ("docs: net: document new locking reality")
introduced netif_ vs dev_ function semantics: the former expects locked
netdev, the latter takes care of the locking. We don't strictly
follow this semantics on either side, but there are more dev_xxx handlers
now that don't fit. Rename them to netif_xxx where appropriate.
netif_pre_changeaddr_notify is used only by ipvlan/bond, so move it into
NETDEV_INTERNAL namespace.
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250717172333.1288349-4-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit cc34acd577f1 ("docs: net: document new locking reality")
introduced netif_ vs dev_ function semantics: the former expects locked
netdev, the latter takes care of the locking. We don't strictly
follow this semantics on either side, but there are more dev_xxx handlers
now that don't fit. Rename them to netif_xxx where appropriate.
netif_get_mac_address is used only by tun/tap, so move it into
NETDEV_INTERNAL namespace.
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250717172333.1288349-3-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit cc34acd577f1 ("docs: net: document new locking reality")
introduced netif_ vs dev_ function semantics: the former expects locked
netdev, the latter takes care of the locking. We don't strictly
follow this semantics on either side, but there are more dev_xxx handlers
now that don't fit. Rename them to netif_xxx where appropriate.
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250717172333.1288349-2-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Virtual devices (e.g., VXLAN) that do not have a notion of a carrier are
created with an "UNKNOWN" operational state which some users find
confusing [1].
It is possible to set the operational state from user space either
during device creation or afterwards and some applications will start
doing that in order to avoid the above problem.
Add a test for this functionality to ensure it does not regress.
[1] https://lore.kernel.org/netdev/20241119153703.71f97b76@hermes.local/
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250717125151.466882-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Detect NICs and drivers that either drop frames with a corrupted TCP
checksum or, worse, pass them up as valid. The test flips one bit in
the checksum, transmits the packet in internal loopback, and fails when
the driver reports CHECKSUM_UNNECESSARY.
Discussed at:
https://lore.kernel.org/all/20250625132117.1b3264e8@kernel.org/
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250717083524.1645069-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a new SKB drop reason (SKB_DROP_REASON_PFMEMALLOC) to track packets
dropped due to memory pressure. In production environments, we've observed
memory exhaustion reported by memory layer stack traces, but these drops
were not properly tracked in the SKB drop reason infrastructure.
While most network code paths now properly report pfmemalloc drops, some
protocol-specific socket implementations still use sk_filter() without
drop reason tracking:
- Bluetooth L2CAP sockets
- CAIF sockets
- IUCV sockets
- Netlink sockets
- SCTP sockets
- Unix domain sockets
These remaining cases represent less common paths and could be converted
in a follow-up patch if needed. The current implementation provides
significantly improved observability into memory pressure events in the
network stack, especially for key protocols like TCP and UDP, helping to
diagnose problems in production environments.
Reported-by: Matt Fleming <mfleming@cloudflare.com>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://patch.msgid.link/175268316579.2407873.11634752355644843509.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a proper description for the sk_stream_write_space() function as
previously marked by a FIXME comment.
No functional changes.
Signed-off-by: Suchit Karunakaran <suchitkarunakaran@gmail.com>
Link: https://patch.msgid.link/20250716153404.7385-1-suchitkarunakaran@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The DMA map functions can fail and should be tested for errors.
If the mapping fails, unmap and return an error.
Signed-off-by: Thomas Fourier <fourier.thomas@gmail.com>
Acked-by: Mark Einon <mark.einon@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250716094733.28734-2-fourier.thomas@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The DMA map functions can fail and should be tested for errors.
Signed-off-by: Thomas Fourier <fourier.thomas@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250716095733.37452-3-fourier.thomas@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add support for IPv6 environment for napi_id test.
Test Plan:
./run_kselftest.sh -t drivers/net:napi_id.py
TAP version 13
1..1
# timeout set to 45
# selftests: drivers/net: napi_id.py
# TAP version 13
# 1..1
# ok 1 napi_id.test_napi_id
# # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
ok 1 selftests: drivers/net: napi_id.py
Signed-off-by: Tianyi Cui <1997cui@gmail.com>
Link: https://patch.msgid.link/20250717011913.1248816-1-1997cui@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
VNIC testing on multi-core Power systems showed SAR stats drift
and packet rate inconsistencies under load.
Implements ndo_get_stats64 to provide safe aggregation of queue-level
atomic64 counters into rtnl_link_stats64 for use by tools like 'ip -s',
'ifconfig', and 'sar'. Switch to ndo_get_stats64 to align SAR reporting
with the standard kernel interface for retrieving netdev stats.
This removes redundant per-adapter stat updates, reduces overhead,
eliminates cacheline bouncing from hot path updates, and improves
the accuracy of reported packet rates.
Signed-off-by: Mingming Cao <mmc@linux.ibm.com>
Reviewed-by: Brian King <bjking1@linux.ibm.com>
Reviewed-by: Dave Marquardt <davemarq@linux.ibm.com>
Reviewed-by: Simon Horman <horms@kernel.org>
----
Changes since v3:
link to v3: https://www.spinics.net/lists/netdev/msg1107999.html
-- keep per queue counters as u64 (this patch) and drop off patch 1 in v3
Link: https://patch.msgid.link/20250716152115.61143-1-mmc@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Tariq Toukan says:
====================
net/mlx5: misc changes 2025-07-16
This series contains misc enhancements to the mlx5 driver.
v1: https://lore.kernel.org/1752471585-18053-1-git-send-email-tariqt@nvidia.com
====================
Link: https://patch.msgid.link/1752675472-201445-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
qdisc_sleeping variable is declared as "struct Qdisc __rcu" and
as such needs proper annotation while accessing it.
Without rtnl_dereference(), the following error is generated by sparse:
drivers/net/ethernet/mellanox/mlx5/core/en/qos.c:377:40: warning:
incorrect type in initializer (different address spaces)
drivers/net/ethernet/mellanox/mlx5/core/en/qos.c:377:40: expected
struct Qdisc *qdisc
drivers/net/ethernet/mellanox/mlx5/core/en/qos.c:377:40: got struct
Qdisc [noderef] __rcu *qdisc_sleeping
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/1752675472-201445-4-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix the following kdoc warning:
git ls-files *.[ch] | egrep drivers/net/ethernet/mellanox/mlx5/core/ |\
xargs scripts/kernel-doc --none
drivers/net/ethernet/mellanox/mlx5/core/eswitch.h:824: warning: cannot
understand function prototype: 'struct mlx5_esw_event_info '
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/1752675472-201445-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
IPSec hardware offload in legacy mode should not be affected by the
steering mode, hence it should also work properly with hmfs mode.
Remove steering mode validation when calculating the cap for packet
offload, this will also enable the missing cap MLX5_IPSEC_CAP_PRIO
needed for crypto offload.
Signed-off-by: Lama Kayal <lkayal@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/1752675472-201445-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
readl() returns 32-bit value but Clause 22/45 registers are 16-bit wide.
Masking with 0xFFFF avoids using garbage upper bits.
Signed-off-by: Jack Ping CHNG <jchng@maxlinear.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20250716030349.3796806-1-jchng@maxlinear.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The __esw_qos_alloc_node() function returns NULL on error. It doesn't
return error pointers. Update the error checking to match.
Fixes: 96619c485fa6 ("net/mlx5: Add support for setting tc-bw on nodes")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/0ce4ec2a-2b5d-4652-9638-e715a99902a7@sabinyo.mountain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We recently changed this from using devm_ioremap() to using
devm_ioremap_resource() and unfortunately the former returns NULL while
the latter returns error pointers. The check for errors needs to be
updated as well.
Fixes: e27dba1951ce ("net: Use of_reserved_mem_region_to_resource{_byname}() for "memory-region"")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/87c10dbd-df86-4971-b4f5-40ba02c076fb@sabinyo.mountain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The devm_ioremap_resource() function returns error pointers. It never
returns NULL. Update the check to match.
Fixes: e27dba1951ce ("net: Use of_reserved_mem_region_to_resource{_byname}() for "memory-region"")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/fc6d194e-6bf5-49ca-bc77-3fdfda62c434@sabinyo.mountain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Luo Jie says:
====================
Add shared PHY counter support for QCA807x and QCA808x
The implementation of the PHY counter is identical for both QCA808x and
QCA807x series devices. This includes counters for both good and bad CRC
frames in the RX and TX directions, which are active when CRC checking
is enabled.
This patch series introduces PHY counter functions into a shared library,
enabling counter support for the QCA808x and QCA807x families through this
common infrastructure. Additionally, enable CRC checking and configure
automatic clearing of counters after reading within config_init() to ensure
accurate counter recording.
v2: https://lore.kernel.org/20250714-qcom_phy_counter-v2-0-94dde9d9769f@quicinc.com
v1: https://lore.kernel.org/20250709-qcom_phy_counter-v1-0-93a54a029c46@quicinc.com
====================
Link: https://patch.msgid.link/20250715-qcom_phy_counter-v3-0-8b0e460a527b@quicinc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Within the QCA807X PHY operation's config_init() function, enable CRC
checking for received and transmitted frames and configure counter to
clear after being read to support counter recording. Additionally, add
support for PHY counter operations.
Signed-off-by: Luo Jie <quic_luoj@quicinc.com>
Link: https://patch.msgid.link/20250715-qcom_phy_counter-v3-3-8b0e460a527b@quicinc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Enable CRC checking for received and transmitted frames, and configure
counters to clear after being read within config_init() for accurate
counter recording. Additionally, add PHY counter operations and integrate
shared functions.
Signed-off-by: Luo Jie <quic_luoj@quicinc.com>
Link: https://patch.msgid.link/20250715-qcom_phy_counter-v3-2-8b0e460a527b@quicinc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add PHY counter functionality to the shared library. The implementation
is identical for the current QCA807X and QCA808X PHYs.
The PHY counter can be configured to perform CRC checking for both received
and transmitted packets. Additionally, the packet counter can be set to
automatically clear after it is read.
The PHY counter includes 32-bit packet counters for both RX (received) and
TX (transmitted) packets, as well as 16-bit counters for recording CRC
error packets for both RX and TX.
Signed-off-by: Luo Jie <quic_luoj@quicinc.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250715-qcom_phy_counter-v3-1-8b0e460a527b@quicinc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
bool notify is referenced nowhere else in the function except to check
whether or not to call rtnl_offload_xstats_notify(). Remove it and move
the call to the previous branch.
Signed-off-by: Dennis Chen <dechen@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20250716165750.561175-1-dechen@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Martin KaFai Lau says:
====================
pull-request: bpf-next 2025-07-17
We've added 13 non-merge commits during the last 20 day(s) which contain
a total of 4 files changed, 712 insertions(+), 84 deletions(-).
The main changes are:
1) Avoid skipping or repeating a sk when using a TCP bpf_iter,
from Jordan Rife.
2) Clarify the driver requirement on using the XDP metadata,
from Song Yoong Siang
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
doc: xdp: Clarify driver implementation for XDP Rx metadata
selftests/bpf: Add tests for bucket resume logic in established sockets
selftests/bpf: Create iter_tcp_destroy test program
selftests/bpf: Create established sockets in socket iterator tests
selftests/bpf: Make ehash buckets configurable in socket iterator tests
selftests/bpf: Allow for iteration over multiple states
selftests/bpf: Allow for iteration over multiple ports
selftests/bpf: Add tests for bucket resume logic in listening sockets
bpf: tcp: Avoid socket skips and repeats during iteration
bpf: tcp: Use bpf_tcp_iter_batch_item for bpf_tcp_iter_state batch items
bpf: tcp: Get rid of st_bucket_done
bpf: tcp: Make sure iter->batch always contains a full bucket snapshot
bpf: tcp: Make mem flags configurable through bpf_iter_tcp_realloc_batch
====================
Link: https://patch.msgid.link/20250717191731.4142326-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Make sure Python doesn't buffer the output, otherwise for some
tests we may see false positive timeouts in NIPA. NIPA thinks that
a machine has hung if the test doesn't print anything for 3min.
This is also nice to heave for running the tests manually,
especially in vng.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20250716205712.1787325-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Kuniyuki Iwashima says:
====================
neighbour: Convert RTM_GETNEIGH to RCU and make pneigh RTNL-free.
This is kind of v3 of the series below [0] but without NEIGHTBL patches.
Patch 1 ~ 4 and 9 come from the series to convert RTM_GETNEIGH to RCU.
Other patches clean up pneigh_lookup() and convert the pneigh code to
RCU + private mutex so that we can easily remove RTNL from RTM_NEWNEIGH
in the later series.
[0]: https://lore.kernel.org/netdev/20250418012727.57033-1-kuniyu@amazon.com/
v2: https://lore.kernel.org/20250712203515.4099110-1-kuniyu@google.com
v1: https://lore.kernel.org/20250711191007.3591938-1-kuniyu@google.com
====================
Link: https://patch.msgid.link/20250716221221.442239-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
neigh_add() updates pneigh_entry() found or created by pneigh_create().
This update is serialised by RTNL, but we will remove it.
Let's move the update part to pneigh_create() and make it return errno
instead of a pointer of pneigh_entry.
Now, the pneigh code is RTNL free.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-16-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
tbl->phash_buckets[] is only modified in the slow path by pneigh_create()
and pneigh_delete() under the table lock.
Both of them are called under RTNL, so no extra lock is needed, but we
will remove RTNL from the paths.
pneigh_create() looks up a pneigh_entry, and this part can be lockless,
but it would complicate the logic like
1. lookup
2. allocate pengih_entry for GFP_KERNEL
3. lookup again but under lock
4. if found, return it after freeing the allocated memory
5. else, return the new one
Instead, let's add a per-table mutex and run lookup and allocation
under it.
Note that updating pneigh_entry part in neigh_add() is still protected
by RTNL and will be moved to pneigh_create() in the next patch.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-15-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Now, all callers of pneigh_lookup() are under RCU, and the read
lock there is no longer needed.
Let's drop the lock, inline __pneigh_lookup_1() to pneigh_lookup(),
and call it from pneigh_create().
The next patch will remove tbl->lock from pneigh_create().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-14-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
__pneigh_lookup() is the lockless version of pneigh_lookup(),
but its only caller pndisc_is_router() holds the table lock and
reads pneigh_netry.flags.
This is because accessing pneigh_entry after pneigh_lookup() was
illegal unless the caller holds RTNL or the table lock.
Now, pneigh_entry is guaranteed to be alive during the RCU critical
section.
Let's call pneigh_lookup() and use READ_ONCE() for n->flags in
pndisc_is_router() and remove __pneigh_lookup().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-13-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Now pneigh_entry is guaranteed to be alive during the
RCU critical section even without holding tbl->lock.
Let's use rcu_dereference() in pneigh_get_{first,next}().
Note that neigh_seq_start() still holds tbl->lock for the
normal neighbour entry.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-12-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Now pneigh_entry is guaranteed to be alive during the
RCU critical section even without holding tbl->lock.
Let's drop read_lock_bh(&tbl->lock) and use rcu_dereference()
to iterate tbl->phash_buckets[] in pneigh_dump_table()
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-11-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Only __dev_get_by_index() is the RTNL dependant in neigh_get().
Let's replace it with dev_get_by_index_rcu() and convert RTM_GETNEIGH
to RCU.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-10-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We will convert pneigh readers to RCU, and its flags and protocol
will be read locklessly.
Let's annotate the access to the two fields.
Note that all access to pn->permanent is under RTNL (neigh_add()
and pneigh_ifdown_and_unlock()), so WRITE_ONCE() and READ_ONCE()
are not needed.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-9-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We will convert RTM_GETNEIGH to RCU.
neigh_get() looks up pneigh_entry by pneigh_lookup() and passes
it to pneigh_fill_info().
Then, we must ensure that the entry is alive till pneigh_fill_info()
completes, but read_lock_bh(&tbl->lock) in pneigh_lookup() does not
guarantee that.
Also, we will convert all readers of tbl->phash_buckets[] to RCU.
Let's use call_rcu() to free pneigh_entry and update phash_buckets[]
and ->next by rcu_assign_pointer().
pneigh_ifdown_and_unlock() uses list_head to avoid overwriting
->next and moving RCU iterators to another list.
pndisc_destructor() (only IPv6 ndisc uses this) uses a mutex, so it
is not delayed to call_rcu(), where we cannot sleep. This is fine
because the mcast code works with RCU and ipv6_dev_mc_dec() frees
mcast objects after RCU grace period.
While at it, we change the return type of pneigh_ifdown_and_unlock()
to void.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-8-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The next patch will free pneigh_entry with call_rcu().
Then, we need to annotate neigh_table.phash_buckets[] and
pneigh_entry.next with __rcu.
To make the next patch cleaner, let's annotate the fields in advance.
Currently, all accesses to the fields are under the neigh table lock,
so rcu_dereference_protected() is used with 1 for now, but most of them
(except in pneigh_delete() and pneigh_ifdown_and_unlock()) will be
replaced with rcu_dereference() and rcu_dereference_check().
Note that pneigh_ifdown_and_unlock() changes pneigh_entry.next to a
local list, which is illegal because the RCU iterator could be moved
to another list. This part will be fixed in the next patch.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
pneigh_lookup() has ASSERT_RTNL() in the middle of the function, which
is confusing.
When called with the last argument, creat, 0, pneigh_lookup() literally
looks up a proxy neighbour entry. This is the case of the reader path
as the fast path and RTM_GETNEIGH.
pneigh_lookup(), however, creates a pneigh_entry when called with creat 1
from RTM_NEWNEIGH and SIOCSARP, which require RTNL.
Let's split pneigh_lookup() into two functions.
We will convert all the reader paths to RCU, and read_lock_bh(&tbl->lock)
in the new pneigh_lookup() will be dropped.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
neigh_valid_get_req() calls neigh_find_table() to fetch neigh_tables[].
neigh_find_table() uses rcu_dereference_rtnl(), but RTNL actually does
not protect it at all; neigh_table_clear() can be called without RTNL
and only waits for RCU readers by synchronize_rcu().
Fortunately, there is no bug because IPv4 is built-in, IPv6 cannot be
unloaded, and DECNET was removed.
To fetch neigh_tables[] by rcu_dereference() later, let's move
neigh_find_table() from neigh_valid_get_req() to neigh_get().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We will remove RTNL for neigh_get() and run it under RCU instead.
neigh_get_reply() and pneigh_get_reply() allocate skb with GFP_KERNEL.
Let's move the allocation before __dev_get_by_index() in neigh_get().
Now, neigh_get_reply() and pneigh_get_reply() are inlined and
rtnl_unicast() is factorised.
We will convert pneigh_lookup() to __pneigh_lookup() later.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|