summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-04-10tools: ynl-gen: use family c-name in notificationsJakub Kicinski
Family names may include dashes. Fix notification handling code gen to the c-compatible name. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250410014658.782120-12-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10tools: ynl-gen: consider dump ops without a do "type-consistent"Jakub Kicinski
If the type for the response to do and dump are the same we don't generate it twice. This is called "type_consistent" in the generator. Consider operations which only have dump to also be consistent. This removes unnecessary "_dump" from the names. There's a number of GET ops in classic Netlink which only have dump handlers. Make sure we output the "onesided" types, normally if the type is consistent we only output it when we render the do structures. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20250410014658.782120-11-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10tools: ynl: don't use genlmsghdr in classic netlinkJakub Kicinski
Make sure the codegen calls the right YNL lib helper to start the request based on family type. Classic netlink request must not include the genl header. Conversely don't expect genl headers in the responses. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250410014658.782120-10-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10tools: ynl-gen: don't consider requests with fixed hdr emptyJakub Kicinski
C codegen skips generating the structs if request/reply has no attrs. In such cases the request op takes no argument and return int (rather than response struct). In case of classic netlink a lot of information gets passed using the fixed struct, however, so adjust the logic to consider a request empty only if it has no attrs _and_ no fixed struct. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20250410014658.782120-9-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10tools: ynl: support creating non-genl socketsJakub Kicinski
Classic netlink has static family IDs specified in YAML, there is no family name -> ID lookup. Support providing the ID info to the library via the generated struct and make library use it. Since NETLINK_ROUTE is ID 0 we need an extra boolean to indicate classic_id is to be used. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250410014658.782120-8-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10netlink: specs: rt-route: add C naming infoJakub Kicinski
Add properties needed for C codegen to match names with uAPI headers. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250410014658.782120-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10netlink: specs: rt-addr: add C naming infoJakub Kicinski
Add properties needed for C codegen to match names with uAPI headers. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250410014658.782120-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10netlink: specs: rt-route: remove the fixed members from attrsJakub Kicinski
The purpose of the attribute list is to list the attributes which will be included in a given message to shrink the objects for families with huge attr spaces. Fixed headers are always present in their entirety (between netlink header and the attrs) so there's no point in listing their members. Current C codegen doesn't expect them and tries to look them up in the attribute space. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250410014658.782120-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10netlink: specs: rt-addr: remove the fixed members from attrsJakub Kicinski
The purpose of the attribute list is to list the attributes which will be included in a given message to shrink the objects for families with huge attr spaces. Fixed headers are always present in their entirety (between netlink header and the attrs) so there's no point in listing their members. Current C codegen doesn't expect them and tries to look them up in the attribute space. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250410014658.782120-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10netlink: specs: rt-route: specify fixed-header at operations levelJakub Kicinski
The C codegen currently stores the fixed-header as part of family info, so it only supports one fixed-header type per spec. Luckily all rtm route message have the same fixed header so just move it up to the higher level. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250410014658.782120-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10netlink: specs: rename rtnetlink specs in accordance with family nameJakub Kicinski
The rtnetlink family names are set to rt-$name within the YAML but the files are called rt_$name. C codegen assumes that the generated file name will match the family. The use of dashes is in line with our general expectation that name properties in the spec use dashes not underscores (even tho, as Donald points out most genl families use underscores in the name). We have 3 un-ideal options to choose from: - accept the slight inconsistency with old families using _, or - accept the slight annoyance with all languages having to do s/-/_/ when looking up family ID, or - accept the inconsistency with all name properties in new YAML spec being separated with - and just the family name always using _. Pick option 1 and rename the rtnl spec files. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250410014658.782120-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10usbnet: asix AX88772: leave the carrier control to phylinkKrzysztof Hałasa
ASIX AX88772B based USB 10/100 Ethernet adapter doesn't come up ("carrier off"), despite the built-in 100BASE-FX PHY positive link indication. The internal PHY is configured (using EEPROM) in fixed 100 Mbps full duplex mode. The primary problem appears to be using carrier_netif_{on,off}() while, at the same time, delegating carrier management to phylink. Use only the latter and remove "manual control" in the asix driver. I don't have any other AX88772 board here, but the problem doesn't seem specific to a particular board or settings - it's probably timing-dependent. Remove unused asix_adjust_link() as well. Signed-off-by: Krzysztof Hałasa <khalasa@piap.pl> Tested-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/m3plhmdfte.fsf_-_@t19.piap.pl Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10Merge branch 'trace-add-tracepoint-for-tcp_sendmsg_locked'Jakub Kicinski
Breno Leitao says: ==================== trace: add tracepoint for tcp_sendmsg_locked() Meta has been using BPF programs to monitor tcp_sendmsg() for years, indicating significant interest in observing this important functionality. Adding a proper tracepoint provides a stable API for all users who need visibility into TCP message transmission. David Ahern is using a similar functionality with a custom patch[1]. So, this means we have more than a single use case for this request, and it might be a good idea to have such feature upstream. Link: https://lore.kernel.org/all/70168c8f-bf52-4279-b4c4-be64527aa1ac@kernel.org/ [1] v2: https://lore.kernel.org/20250407-tcpsendmsg-v2-0-9f0ea843ef99@debian.org v1: https://lore.kernel.org/20250224-tcpsendmsg-v1-1-bac043c59cc8@debian.org ==================== Link: https://patch.msgid.link/20250408-tcpsendmsg-v3-0-208b87064c28@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10trace: tcp: Add tracepoint for tcp_sendmsg_locked()Breno Leitao
Add a tracepoint to monitor TCP send operations, enabling detailed visibility into TCP message transmission. Create a new tracepoint within the tcp_sendmsg_locked function, capturing traditional fields along with size_goal, which indicates the optimal data size for a single TCP segment. Additionally, a reference to the struct sock sk is passed, allowing direct access for BPF programs. The implementation is largely based on David's patch[1] and suggestions. Link: https://lore.kernel.org/all/70168c8f-bf52-4279-b4c4-be64527aa1ac@kernel.org/ [1] Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250408-tcpsendmsg-v3-2-208b87064c28@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10net: pass const to msg_data_left()Breno Leitao
The msg_data_left() function doesn't modify the struct msghdr parameter, so mark it as const. This allows the function to be used with const references, improving type safety and making the API more flexible. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250408-tcpsendmsg-v3-1-208b87064c28@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10Merge branch 'net-stmmac-stmmac_pltfr_find_clk'Jakub Kicinski
Russell King says: ==================== net: stmmac: stmmac_pltfr_find_clk() The GBETH glue driver that is being proposed duplicates the clock finding from the bulk clock data in the stmmac platform data structure. iLet's provide a generic implementation that glue drivers can use, and convert dwc-qos-eth to use it. ==================== Link: https://patch.msgid.link/Z_Yn3dJjzcOi32uU@shell.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10net: stmmac: dwc-qos: use stmmac_pltfr_find_clk()Russell King (Oracle)
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/E1u2QO9-001Rp8-Ii@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10net: stmmac: provide stmmac_pltfr_find_clk()Russell King (Oracle)
Provide a generic way to find a clock in the bulk data. Reviewed-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1u2QO4-001Rp2-Dy@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10Merge branch 'tcp-add-a-new-tw_paws-drop-reason'Jakub Kicinski
Jiayuan Chen says: ==================== tcp: add a new TW_PAWS drop reason Devices in the networking path, such as firewalls, NATs, or routers, which can perform SNAT or DNAT, use addresses from their own limited address pools to masquerade the source address during forwarding, causing PAWS verification to fail more easily under TW status. Currently, packet loss statistics for PAWS can only be viewed through MIB, which is a global metric and cannot be precisely obtained through tracing to get the specific 4-tuple of the dropped packet. In the past, we had to use kprobe ret to retrieve relevant skb information from tcp_timewait_state_process(). We add a drop_reason pointer and a new counter. I didn't provide a packetdrill script. I struggled for a long time to get packetdrill to fix the client port, but ultimately failed to do so... Instead, I wrote my own program to trigger PAWS, which can be found at https://github.com/mrpre/nettrigger/tree/main ''' //assume nginx running on 172.31.75.114:9999, current host is 172.31.75.115 iptables -t filter -I OUTPUT -p tcp --sport 12345 --tcp-flags RST RST -j DROP ./nettrigger -i eth0 -s 172.31.75.115:12345 -d 172.31.75.114:9999 -action paws ''' v2: https://lore.kernel.org/5cdc1bdd9caee92a6ae932638a862fd5c67630e8@linux.dev v3: https://lore.kernel.org/20250407140001.13886-1-jiayuan.chen@linux.dev ==================== Link: https://patch.msgid.link/20250409112614.16153-1-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10tcp: add LINUX_MIB_PAWS_TW_REJECTED counterJiayuan Chen
When TCP is in TIME_WAIT state, PAWS verification uses LINUX_PAWSESTABREJECTED, which is ambiguous and cannot be distinguished from other PAWS verification processes. We added a new counter, like the existing PAWS_OLD_ACK one. Also we update the doc with previously missing PAWS_OLD_ACK. usage: ''' nstat -az | grep PAWSTimewait TcpExtPAWSTimewait 1 0.0 ''' Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250409112614.16153-3-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10tcp: add TCP_RFC7323_TW_PAWS drop reasonJiayuan Chen
Devices in the networking path, such as firewalls, NATs, or routers, which can perform SNAT or DNAT, use addresses from their own limited address pools to masquerade the source address during forwarding, causing PAWS verification to fail more easily. Currently, packet loss statistics for PAWS can only be viewed through MIB, which is a global metric and cannot be precisely obtained through tracing to get the specific 4-tuple of the dropped packet. In the past, we had to use kprobe ret to retrieve relevant skb information from tcp_timewait_state_process(). We add a drop_reason pointer, similar to what previous commit does: commit e34100c2ecbb ("tcp: add a drop_reason pointer to tcp_check_req()") This commit addresses the PAWSESTABREJECTED case and also sets the corresponding drop reason. We use 'pwru' to test. Before this commit: '''' ./pwru 'port 9999' 2025/04/07 13:40:19 Listening for events.. TUPLE FUNC 172.31.75.115:12345->172.31.75.114:9999(tcp) sk_skb_reason_drop(SKB_DROP_REASON_NOT_SPECIFIED) ''' After this commit: ''' ./pwru 'port 9999' 2025/04/07 13:51:34 Listening for events.. TUPLE FUNC 172.31.75.115:12345->172.31.75.114:9999(tcp) sk_skb_reason_drop(SKB_DROP_REASON_TCP_RFC7323_TW_PAWS) ''' Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250409112614.16153-2-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10af_unix: Remove unix_unhash()Michal Luczaj
Dummy unix_unhash() was introduced for sockmap in commit 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap"), but there's no need to implement it anymore. ->unhash() is only called conditionally: in unix_shutdown() since commit d359902d5c35 ("af_unix: Fix NULL pointer bug in unix_shutdown"), and in BPF proto's sock_map_unhash() since commit 5b4a79ba65a1 ("bpf, sockmap: Don't let sock_map_{close,destroy,unhash} call itself"). Remove it. Signed-off-by: Michal Luczaj <mhal@rbox.co> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250409-cleanup-drop-unix-unhash-v1-1-1659e5b8ee84@rbox.co Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.15-rc2). Conflict: Documentation/networking/netdevices.rst net/core/lock_debug.c 04efcee6ef8d ("net: hold instance lock during NETDEV_CHANGE") 03df156dd3a6 ("xdp: double protect netdev->xdp_flags with netdev->lock") No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10Merge tag 'net-6.15-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from netfilter. Current release - regressions: - core: hold instance lock during NETDEV_CHANGE - rtnetlink: fix bad unlock balance in do_setlink() - ipv6: - fix null-ptr-deref in addrconf_add_ifaddr() - align behavior across nexthops during path selection Previous releases - regressions: - sctp: prevent transport UaF in sendmsg - mptcp: only inc MPJoinAckHMacFailure for HMAC failures Previous releases - always broken: - sched: - make ->qlen_notify() idempotent - ensure sufficient space when sending filter netlink notifications - sch_sfq: really don't allow 1 packet limit - netfilter: fix incorrect avx2 match of 5th field octet - tls: explicitly disallow disconnect - eth: octeontx2-pf: fix VF root node parent queue priority" * tag 'net-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (38 commits) ethtool: cmis_cdb: Fix incorrect read / write length extension selftests: netfilter: add test case for recent mismatch bug nft_set_pipapo: fix incorrect avx2 match of 5th field octet net: ppp: Add bound checking for skb data on ppp_sync_txmung net: Fix null-ptr-deref by sock_lock_init_class_and_name() and rmmod. ipv6: Align behavior across nexthops during path selection net: phy: allow MDIO bus PM ops to start/stop state machine for phylink-controlled PHY net: phy: move phy_link_change() prior to mdio_bus_phy_may_suspend() selftests/tc-testing: sfq: check that a derived limit of 1 is rejected net_sched: sch_sfq: move the limit validation net_sched: sch_sfq: use a temporary work area for validating configuration net: libwx: handle page_pool_dev_alloc_pages error selftests: mptcp: validate MPJoin HMacFailure counters mptcp: only inc MPJoinAckHMacFailure for HMAC failures rtnetlink: Fix bad unlock balance in do_setlink(). net: ethtool: Don't call .cleanup_data when prepare_data fails tc: Ensure we have enough buffer space when sending filter netlink notifications net: libwx: Fix the wrong Rx descriptor field octeontx2-pf: qos: fix VF root node parent queue index selftests: tls: check that disconnect does nothing ...
2025-04-10Merge tag 'for-linus-6.15a-rc2-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: - A simple fix adding the module description of the Xenbus frontend module - A fix correcting the xen-acpi-processor Kconfig dependency for PVH Dom0 support - A fix for the Xen balloon driver when running as Xen Dom0 in PVH mode - A fix for PVH Dom0 in order to avoid problems with CPU idle and frequency drivers conflicting with Xen * tag 'for-linus-6.15a-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: x86/xen: disable CPU idle and frequency drivers for PVH dom0 x86/xen: fix balloon target initialization for PVH dom0 xen: Change xen-acpi-processor dom0 dependency xenbus: add module description
2025-04-10Merge tag 'block-6.15-20250410' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - Add a missing ublk selftest script, from test additions added last week - Two fixes for ublk error recovery and reissue - Cleanup of ublk argument passing * tag 'block-6.15-20250410' of git://git.kernel.dk/linux: ublk: pass ublksrv_ctrl_cmd * instead of io_uring_cmd * ublk: don't fail request for recovery & reissue in case of ubq->canceling ublk: fix handling recovery & reissue in ublk_abort_queue() selftests: ublk: fix test_stripe_04
2025-04-10Merge tag 'io_uring-6.15-20250410' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fixes from Jens Axboe: - Reject zero sized legacy provided buffers upfront. No ill side effects from this one, only really done to shut up a silly syzbot test case. - Fix for a regression in tag posting for registered files or buffers, where the tag would be posted even when the registration failed. - two minor zcrx cleanups for code added this merge window. * tag 'io_uring-6.15-20250410' of git://git.kernel.dk/linux: io_uring/kbuf: reject zero sized provided buffers io_uring/zcrx: separate niov number from pages io_uring/zcrx: put refill data into separate cache line io_uring: don't post tag CQEs on file/buffer registration failure
2025-04-10Merge tag 'gpio-fixes-for-v6.15-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux Pull gpio fixes from Bartosz Golaszewski: - fix resource handling in gpio-tegra186 - fix wakeup source leaks in gpio-mpc8xxx and gpio-zynq - fix minor issues with some GPIO OF quirks - deprecate GPIOD_FLAGS_BIT_NONEXCLUSIVE and devm_gpiod_unhinge() symbols and add a TODO task to track replacing them with a better solution * tag 'gpio-fixes-for-v6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux: gpiolib: of: Move Atmel HSMCI quirk up out of the regulator comment gpiolib: of: Fix the choice for Ingenic NAND quirk gpio: zynq: Fix wakeup source leaks on device unbind gpio: mpc8xxx: Fix wakeup source leaks on device unbind gpio: TODO: track the removal of regulator-related workarounds MAINTAINERS: add more keywords for the GPIO subsystem entry gpio: deprecate devm_gpiod_unhinge() gpio: deprecate the GPIOD_FLAGS_BIT_NONEXCLUSIVE flag gpio: tegra186: fix resource handling in ACPI probe path
2025-04-10Merge tag 'mtd/fixes-for-6.15-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux Pull mtd fixes from Miquel Raynal: "Two important fixes: the build of the SPI NAND layer with old GCC versions as well as the fix of the Qpic Makefile which was wrong in the first place. There are also two smaller fixes about a missing error and status check" * tag 'mtd/fixes-for-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux: mtd: spinand: Fix build with gcc < 7.5 mtd: rawnand: Add status chack in r852_ready() mtd: inftlcore: Add error check for inftl_read_oob() mtd: nand: Drop explicit test for built-in CONFIG_SPI_QPIC_SNAND
2025-04-10ethtool: cmis_cdb: Fix incorrect read / write length extensionIdo Schimmel
The 'read_write_len_ext' field in 'struct ethtool_cmis_cdb_cmd_args' stores the maximum number of bytes that can be read from or written to the Local Payload (LPL) page in a single multi-byte access. Cited commit started overwriting this field with the maximum number of bytes that can be read from or written to the Extended Payload (LPL) pages in a single multi-byte access. Transceiver modules that support auto paging can advertise a number larger than 255 which is problematic as 'read_write_len_ext' is a 'u8', resulting in the number getting truncated and firmware flashing failing [1]. Fix by ignoring the maximum EPL access size as the kernel does not currently support auto paging (even if the transceiver module does) and will not try to read / write more than 128 bytes at once. [1] Transceiver module firmware flashing started for device enp177s0np0 Transceiver module firmware flashing in progress for device enp177s0np0 Progress: 0% Transceiver module firmware flashing encountered an error for device enp177s0np0 Status message: Write FW block EPL command failed, LPL length is longer than CDB read write length extension allows. Fixes: 9a3b0d078bd8 ("net: ethtool: Add support for writing firmware blocks using EPL payload") Reported-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com> Closes: https://lore.kernel.org/netdev/20250402183123.321036-3-michael.chan@broadcom.com/ Tested-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/20250409112440.365672-1-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-10Merge tag 'nf-25-04-10' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following batch contains a Netfilter fix and improved test coverage: 1) Fix AVX2 matching in nft_pipapo, from Florian Westphal. 2) Extend existing test to improve coverage for the aforementioned bug, also from Florian. netfilter pull request 25-04-10 * tag 'nf-25-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: selftests: netfilter: add test case for recent mismatch bug nft_set_pipapo: fix incorrect avx2 match of 5th field octet ==================== Link: https://patch.msgid.link/20250410103647.1030244-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-10selftests: netfilter: add test case for recent mismatch bugFlorian Westphal
Without 'nft_set_pipapo: fix incorrect avx2 match of 5th field octet" this fails: TEST: reported issues Add two elements, flush, re-add 1s [ OK ] net,mac with reload 0s [ OK ] net,port,proto 3s [ OK ] avx2 false match 0s [FAIL] False match for fe80:dead:01fe:0a02:0b03:6007:8009:a001 Other tests do not detect the kernel bug as they only alter parts in the /64 netmask. Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-04-10nft_set_pipapo: fix incorrect avx2 match of 5th field octetFlorian Westphal
Given a set element like: icmpv6 . dead:beef:00ff::1 The value of 'ff' is irrelevant, any address will be matched as long as the other octets are the same. This is because of too-early register clobbering: ymm7 is reloaded with new packet data (pkt[9]) but it still holds data of an earlier load that wasn't processed yet. The existing tests in nft_concat_range.sh selftests do exercise this code path, but do not trigger incorrect matching due to the network prefix limitation. Fixes: 7400b063969b ("nft_set_pipapo: Introduce AVX2-based lookup implementation") Reported-by: sontu mazumdar <sontu21@gmail.com> Closes: https://lore.kernel.org/netfilter/CANgxkqwnMH7fXra+VUfODT-8+qFLgskq3set1cAzqqJaV4iEZg@mail.gmail.com/T/#t Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-04-10net: ppp: Add bound checking for skb data on ppp_sync_txmungArnaud Lecomte
Ensure we have enough data in linear buffer from skb before accessing initial bytes. This prevents potential out-of-bounds accesses when processing short packets. When ppp_sync_txmung receives an incoming package with an empty payload: (remote) gef➤ p *(struct pppoe_hdr *) (skb->head + skb->network_header) $18 = { type = 0x1, ver = 0x1, code = 0x0, sid = 0x2, length = 0x0, tag = 0xffff8880371cdb96 } from the skb struct (trimmed) tail = 0x16, end = 0x140, head = 0xffff88803346f400 "4", data = 0xffff88803346f416 ":\377", truesize = 0x380, len = 0x0, data_len = 0x0, mac_len = 0xe, hdr_len = 0x0, it is not safe to access data[2]. Reported-by: syzbot+29fc8991b0ecb186cf40@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=29fc8991b0ecb186cf40 Tested-by: syzbot+29fc8991b0ecb186cf40@syzkaller.appspotmail.com Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com> Link: https://patch.msgid.link/20250408-bound-checking-ppp_txmung-v2-1-94bb6e1b92d0@arnaud-lcm.com [pabeni@redhat.com: fixed subj typo] Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-09net: txgbe: add sriov function supportMengyuan Lou
Add sriov_configure for driver ops. Add mailbox handler wx_msg_task for txgbe. Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com> Link: https://patch.msgid.link/ECDC57CF4F2316B9+20250408091556.9640-7-mengyuanlou@net-swift.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09net: ngbe: add sriov function supportMengyuan Lou
Add sriov_configure for driver ops. Add mailbox handler wx_msg_task for ngbe in the interrupt handler. Add the notification flow when the vfs exist. Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com> Link: https://patch.msgid.link/C9A0A43732966022+20250408091556.9640-6-mengyuanlou@net-swift.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09net: libwx: Add msg task funcMengyuan Lou
Implement wx_msg_task which is used to process mailbox messages sent by vf. Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com> Link: https://patch.msgid.link/8257B39B95CDB469+20250408091556.9640-5-mengyuanlou@net-swift.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09net: libwx: Redesign flow when sriov is enabledMengyuan Lou
Reallocate queue and int resources when sriov is enabled. Redefine macro VMDQ to make it work in VT mode. Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com> Link: https://patch.msgid.link/64B616774ABE3C5A+20250408091556.9640-4-mengyuanlou@net-swift.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09net: libwx: Add sriov api for wangxun nicsMengyuan Lou
Implement sriov_configure interface for wangxun nics in libwx. Enable VT mode and initialize vf control structure, when sriov is enabled. Do not be allowed to disable sriov when vfs are assigned. Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com> Link: https://patch.msgid.link/81EA45C21B0A98B0+20250408091556.9640-3-mengyuanlou@net-swift.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09net: libwx: Add mailbox api for wangxun pf driversMengyuan Lou
Implements the mailbox interfaces for wangxun pf drivers ngbe and txgbe. Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com> Link: https://patch.msgid.link/70017BD4D67614A4+20250408091556.9640-2-mengyuanlou@net-swift.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09net: ethernet: cortina: Use TOE/TSO on all TCPLinus Walleij
It is desireable to push the hardware accelerator to also process non-segmented TCP frames: we pass the skb->len to the "TOE/TSO" offloader and it will handle them. Without this quirk the driver becomes unstable and lock up and and crash. I do not know exactly why, but it is probably due to the TOE (TCP offload engine) feature that is coupled with the segmentation feature - it is not possible to turn one part off and not the other, either both TOE and TSO are active, or neither of them. Not having the TOE part active seems detrimental, as if that hardware feature is not really supposed to be turned off. The datasheet says: "Based on packet parsing and TCP connection/NAT table lookup results, the NetEngine puts the packets belonging to the same TCP connection to the same queue for the software to process. The NetEngine puts incoming packets to the buffer or series of buffers for a jumbo packet. With this hardware acceleration, IP/TCP header parsing, checksum validation and connection lookup are offloaded from the software processing." After numerous tests with the hardware locking up after something between minutes and hours depending on load using iperf3 I have concluded this is necessary to stabilize the hardware. Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Link: https://patch.msgid.link/20250408-gemini-ethernet-tso-always-v1-1-e669f932359c@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09Merge branch ↵Jakub Kicinski
'bridge-prevent-unicast-arp-ns-packets-from-being-suppressed-by-bridge' Amit Cohen says: ==================== bridge: Prevent unicast ARP/NS packets from being suppressed by bridge Currently, unicast ARP requests/NS packets are replied by bridge when suppression is enabled, then they are also forwarded, which results two replicas of ARP reply/NA - one from the bridge and second from the target. The purpose of ARP/ND suppression is to reduce flooding in the broadcast domain, which is not relevant for unicast packets. In addition, the use case of unicast ARP/NS is to poll a specific host, so it does not make sense to have the switch answer on behalf of the host. Forward ARP requests/NS packets and prevent the bridge from replying to them. Patch set overview: Patch #1 prevents unicast ARP/NS packets from being suppressed by bridge Patch #2 adds test cases for unicast ARP/NS with suppression enabled ==================== Link: https://patch.msgid.link/cover.1744123493.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09selftests: test_bridge_neigh_suppress: Test unicast ARP/NS with suppressionAmit Cohen
Add test cases to check that unicast ARP/NS packets are replied once, even if ARP/ND suppression is enabled. Without the previous patch: $ ./test_bridge_neigh_suppress.sh ... Unicast ARP, per-port ARP suppression - VLAN 10 ----------------------------------------------- TEST: "neigh_suppress" is on [ OK ] TEST: Unicast ARP, suppression on, h1 filter [FAIL] TEST: Unicast ARP, suppression on, h2 filter [ OK ] Unicast ARP, per-port ARP suppression - VLAN 20 ----------------------------------------------- TEST: "neigh_suppress" is on [ OK ] TEST: Unicast ARP, suppression on, h1 filter [FAIL] TEST: Unicast ARP, suppression on, h2 filter [ OK ] ... Unicast NS, per-port NS suppression - VLAN 10 --------------------------------------------- TEST: "neigh_suppress" is on [ OK ] TEST: Unicast NS, suppression on, h1 filter [FAIL] TEST: Unicast NS, suppression on, h2 filter [ OK ] Unicast NS, per-port NS suppression - VLAN 20 --------------------------------------------- TEST: "neigh_suppress" is on [ OK ] TEST: Unicast NS, suppression on, h1 filter [FAIL] TEST: Unicast NS, suppression on, h2 filter [ OK ] ... Tests passed: 156 Tests failed: 4 With the previous patch: $ ./test_bridge_neigh_suppress.sh ... Unicast ARP, per-port ARP suppression - VLAN 10 ----------------------------------------------- TEST: "neigh_suppress" is on [ OK ] TEST: Unicast ARP, suppression on, h1 filter [ OK ] TEST: Unicast ARP, suppression on, h2 filter [ OK ] Unicast ARP, per-port ARP suppression - VLAN 20 ----------------------------------------------- TEST: "neigh_suppress" is on [ OK ] TEST: Unicast ARP, suppression on, h1 filter [ OK ] TEST: Unicast ARP, suppression on, h2 filter [ OK ] ... Unicast NS, per-port NS suppression - VLAN 10 --------------------------------------------- TEST: "neigh_suppress" is on [ OK ] TEST: Unicast NS, suppression on, h1 filter [ OK ] TEST: Unicast NS, suppression on, h2 filter [ OK ] Unicast NS, per-port NS suppression - VLAN 20 --------------------------------------------- TEST: "neigh_suppress" is on [ OK ] TEST: Unicast NS, suppression on, h1 filter [ OK ] TEST: Unicast NS, suppression on, h2 filter [ OK ] ... Tests passed: 160 Tests failed: 0 Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/dc240b9649b31278295189f412223f320432c5f2.1744123493.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09net: bridge: Prevent unicast ARP/NS packets from being suppressed by bridgeAmit Cohen
When Proxy ARP or ARP/ND suppression are enabled, ARP/NS packets can be handled by bridge in br_do_proxy_suppress_arp()/br_do_suppress_nd(). For broadcast packets, they are replied by bridge, but later they are not flooded. Currently, unicast packets are replied by bridge when suppression is enabled, and they are also forwarded, which results two replicas of ARP reply/NA - one from the bridge and second from the target. RFC 1122 describes use case for unicat ARP packets - "unicast poll" - actively poll the remote host by periodically sending a point-to-point ARP request to it, and delete the entry if no ARP reply is received from N successive polls. The purpose of ARP/ND suppression is to reduce flooding in the broadcast domain. If a host is sending a unicast ARP/NS, then it means it already knows the address and the switches probably know it as well and there will not be any flooding. In addition, the use case of unicast ARP/NS is to poll a specific host, so it does not make sense to have the switch answer on behalf of the host. According to RFC 9161: "A PE SHOULD reply to broadcast/multicast address resolution messages, i.e., ARP Requests, ARP probes, NS messages, as well as DAD NS messages. An ARP probe is an ARP Request constructed with an all-zero sender IP address that may be used by hosts for IPv4 Address Conflict Detection as specified in [RFC5227]. A PE SHOULD NOT reply to unicast address resolution requests (for instance, NUD NS messages)." Forward such requests and prevent the bridge from replying to them. Reported-by: Denis Yulevych <denisyu@nvidia.com> Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/6bf745a149ddfe5e6be8da684a63aa574a326f8d.1744123493.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09net: Fix null-ptr-deref by sock_lock_init_class_and_name() and rmmod.Kuniyuki Iwashima
When I ran the repro [0] and waited a few seconds, I observed two LOCKDEP splats: a warning immediately followed by a null-ptr-deref. [1] Reproduction Steps: 1) Mount CIFS 2) Add an iptables rule to drop incoming FIN packets for CIFS 3) Unmount CIFS 4) Unload the CIFS module 5) Remove the iptables rule At step 3), the CIFS module calls sock_release() for the underlying TCP socket, and it returns quickly. However, the socket remains in FIN_WAIT_1 because incoming FIN packets are dropped. At this point, the module's refcnt is 0 while the socket is still alive, so the following rmmod command succeeds. # ss -tan State Recv-Q Send-Q Local Address:Port Peer Address:Port FIN-WAIT-1 0 477 10.0.2.15:51062 10.0.0.137:445 # lsmod | grep cifs cifs 1159168 0 This highlights a discrepancy between the lifetime of the CIFS module and the underlying TCP socket. Even after CIFS calls sock_release() and it returns, the TCP socket does not die immediately in order to close the connection gracefully. While this is generally fine, it causes an issue with LOCKDEP because CIFS assigns a different lock class to the TCP socket's sk->sk_lock using sock_lock_init_class_and_name(). Once an incoming packet is processed for the socket or a timer fires, sk->sk_lock is acquired. Then, LOCKDEP checks the lock context in check_wait_context(), where hlock_class() is called to retrieve the lock class. However, since the module has already been unloaded, hlock_class() logs a warning and returns NULL, triggering the null-ptr-deref. If LOCKDEP is enabled, we must ensure that a module calling sock_lock_init_class_and_name() (CIFS, NFS, etc) cannot be unloaded while such a socket is still alive to prevent this issue. Let's hold the module reference in sock_lock_init_class_and_name() and release it when the socket is freed in sk_prot_free(). Note that sock_lock_init() clears sk->sk_owner for svc_create_socket() that calls sock_lock_init_class_and_name() for a listening socket, which clones a socket by sk_clone_lock() without GFP_ZERO. [0]: CIFS_SERVER="10.0.0.137" CIFS_PATH="//${CIFS_SERVER}/Users/Administrator/Desktop/CIFS_TEST" DEV="enp0s3" CRED="/root/WindowsCredential.txt" MNT=$(mktemp -d /tmp/XXXXXX) mount -t cifs ${CIFS_PATH} ${MNT} -o vers=3.0,credentials=${CRED},cache=none,echo_interval=1 iptables -A INPUT -s ${CIFS_SERVER} -j DROP for i in $(seq 10); do umount ${MNT} rmmod cifs sleep 1 done rm -r ${MNT} iptables -D INPUT -s ${CIFS_SERVER} -j DROP [1]: DEBUG_LOCKS_WARN_ON(1) WARNING: CPU: 10 PID: 0 at kernel/locking/lockdep.c:234 hlock_class (kernel/locking/lockdep.c:234 kernel/locking/lockdep.c:223) Modules linked in: cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs] CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Not tainted 6.14.0 #36 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:hlock_class (kernel/locking/lockdep.c:234 kernel/locking/lockdep.c:223) ... Call Trace: <IRQ> __lock_acquire (kernel/locking/lockdep.c:4853 kernel/locking/lockdep.c:5178) lock_acquire (kernel/locking/lockdep.c:469 kernel/locking/lockdep.c:5853 kernel/locking/lockdep.c:5816) _raw_spin_lock_nested (kernel/locking/spinlock.c:379) tcp_v4_rcv (./include/linux/skbuff.h:1678 ./include/net/tcp.h:2547 net/ipv4/tcp_ipv4.c:2350) ... BUG: kernel NULL pointer dereference, address: 00000000000000c4 PF: supervisor read access in kernel mode PF: error_code(0x0000) - not-present page PGD 0 Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Tainted: G W 6.14.0 #36 Tainted: [W]=WARN Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:__lock_acquire (kernel/locking/lockdep.c:4852 kernel/locking/lockdep.c:5178) Code: 15 41 09 c7 41 8b 44 24 20 25 ff 1f 00 00 41 09 c7 8b 84 24 a0 00 00 00 45 89 7c 24 20 41 89 44 24 24 e8 e1 bc ff ff 4c 89 e7 <44> 0f b6 b8 c4 00 00 00 e8 d1 bc ff ff 0f b6 80 c5 00 00 00 88 44 RSP: 0018:ffa0000000468a10 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ff1100010091cc38 RCX: 0000000000000027 RDX: ff1100081f09ca48 RSI: 0000000000000001 RDI: ff1100010091cc88 RBP: ff1100010091c200 R08: ff1100083fe6e228 R09: 00000000ffffbfff R10: ff1100081eca0000 R11: ff1100083fe10dc0 R12: ff1100010091cc88 R13: 0000000000000001 R14: 0000000000000000 R15: 00000000000424b1 FS: 0000000000000000(0000) GS:ff1100081f080000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000c4 CR3: 0000000002c4a003 CR4: 0000000000771ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <IRQ> lock_acquire (kernel/locking/lockdep.c:469 kernel/locking/lockdep.c:5853 kernel/locking/lockdep.c:5816) _raw_spin_lock_nested (kernel/locking/spinlock.c:379) tcp_v4_rcv (./include/linux/skbuff.h:1678 ./include/net/tcp.h:2547 net/ipv4/tcp_ipv4.c:2350) ip_protocol_deliver_rcu (net/ipv4/ip_input.c:205 (discriminator 1)) ip_local_deliver_finish (./include/linux/rcupdate.h:878 net/ipv4/ip_input.c:234) ip_sublist_rcv_finish (net/ipv4/ip_input.c:576) ip_list_rcv_finish (net/ipv4/ip_input.c:628) ip_list_rcv (net/ipv4/ip_input.c:670) __netif_receive_skb_list_core (net/core/dev.c:5939 net/core/dev.c:5986) netif_receive_skb_list_internal (net/core/dev.c:6040 net/core/dev.c:6129) napi_complete_done (./include/linux/list.h:37 ./include/net/gro.h:519 ./include/net/gro.h:514 net/core/dev.c:6496) e1000_clean (drivers/net/ethernet/intel/e1000/e1000_main.c:3815) __napi_poll.constprop.0 (net/core/dev.c:7191) net_rx_action (net/core/dev.c:7262 net/core/dev.c:7382) handle_softirqs (kernel/softirq.c:561) __irq_exit_rcu (kernel/softirq.c:596 kernel/softirq.c:435 kernel/softirq.c:662) irq_exit_rcu (kernel/softirq.c:680) common_interrupt (arch/x86/kernel/irq.c:280 (discriminator 14)) </IRQ> <TASK> asm_common_interrupt (./arch/x86/include/asm/idtentry.h:693) RIP: 0010:default_idle (./arch/x86/include/asm/irqflags.h:37 ./arch/x86/include/asm/irqflags.h:92 arch/x86/kernel/process.c:744) Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d c3 2b 15 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 RSP: 0018:ffa00000000ffee8 EFLAGS: 00000202 RAX: 000000000000640b RBX: ff1100010091c200 RCX: 0000000000061aa4 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff812f30c5 RBP: 000000000000000a R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000002 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ? do_idle (kernel/sched/idle.c:186 kernel/sched/idle.c:325) default_idle_call (./include/linux/cpuidle.h:143 kernel/sched/idle.c:118) do_idle (kernel/sched/idle.c:186 kernel/sched/idle.c:325) cpu_startup_entry (kernel/sched/idle.c:422 (discriminator 1)) start_secondary (arch/x86/kernel/smpboot.c:315) common_startup_64 (arch/x86/kernel/head_64.S:421) </TASK> Modules linked in: cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs] CR2: 00000000000000c4 Fixes: ed07536ed673 ("[PATCH] lockdep: annotate nfs/nfsd in-kernel sockets") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20250407163313.22682-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09net: remove cpu stall in txq_trans_update()Eric Dumazet
txq_trans_update() currently uses txq->xmit_lock_owner to conditionally update txq->trans_start. For regular devices, txq->xmit_lock_owner is updated from HARD_TX_LOCK() and HARD_TX_UNLOCK(), and this apparently causes cpu stalls. Using dev->lltx, which sits in a read-mostly cache-line, and already used in HARD_TX_LOCK() and HARD_TX_UNLOCK() helps cpu prediction. On an AMD EPYC 7B12 dual socket server, tcp_rr with 128 threads and 30,000 flows gets a 5 % increase in throughput. As explained in commit 95ecba62e2fd ("net: fix races in netdev_tx_sent_queue()/dev_watchdog()") I am planning to no longer update txq->trans_start in the fast path in a followup patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250408202742.2145516-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09ipv6: Align behavior across nexthops during path selectionIdo Schimmel
A nexthop is only chosen when the calculated multipath hash falls in the nexthop's hash region (i.e., the hash is smaller than the nexthop's hash threshold) and when the nexthop is assigned a non-negative score by rt6_score_route(). Commit 4d0ab3a6885e ("ipv6: Start path selection from the first nexthop") introduced an unintentional difference between the first nexthop and the rest when the score is negative. When the first nexthop matches, but has a negative score, the code will currently evaluate subsequent nexthops until one is found with a non-negative score. On the other hand, when a different nexthop matches, but has a negative score, the code will fallback to the nexthop with which the selection started ('match'). Align the behavior across all nexthops and fallback to 'match' when the first nexthop matches, but has a negative score. Fixes: 3d709f69a3e7 ("ipv6: Use hash-threshold instead of modulo-N") Fixes: 4d0ab3a6885e ("ipv6: Start path selection from the first nexthop") Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Closes: https://lore.kernel.org/netdev/67efef607bc41_1ddca82948c@willemb.c.googlers.com.notmuch/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250408084316.243559-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09octeontx2-pf: Add error log forcn10k_map_unmap_rq_policer()Wentao Liang
The cn10k_free_matchall_ipolicer() calls the cn10k_map_unmap_rq_policer() for each queue in a for loop without checking for any errors. Check the return value of the cn10k_map_unmap_rq_policer() function during each loop, and report a warning if the function fails. Signed-off-by: Wentao Liang <vulab@iscas.ac.cn> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250408032602.2909-1-vulab@iscas.ac.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09net: phy: allow MDIO bus PM ops to start/stop state machine for ↵Vladimir Oltean
phylink-controlled PHY DSA has 2 kinds of drivers: 1. Those who call dsa_switch_suspend() and dsa_switch_resume() from their device PM ops: qca8k-8xxx, bcm_sf2, microchip ksz 2. Those who don't: all others. The above methods should be optional. For type 1, dsa_switch_suspend() calls dsa_user_suspend() -> phylink_stop(), and dsa_switch_resume() calls dsa_user_resume() -> phylink_start(). These seem good candidates for setting mac_managed_pm = true because that is essentially its definition [1], but that does not seem to be the biggest problem for now, and is not what this change focuses on. Talking strictly about the 2nd category of DSA drivers here (which do not have MAC managed PM, meaning that for their attached PHYs, mdio_bus_phy_suspend() and mdio_bus_phy_resume() should run in full), I have noticed that the following warning from mdio_bus_phy_resume() is triggered: WARN_ON(phydev->state != PHY_HALTED && phydev->state != PHY_READY && phydev->state != PHY_UP); because the PHY state machine is running. It's running as a result of a previous dsa_user_open() -> ... -> phylink_start() -> phy_start() having been initiated by the user. The previous mdio_bus_phy_suspend() was supposed to have called phy_stop_machine(), but it didn't. So this is why the PHY is in state PHY_NOLINK by the time mdio_bus_phy_resume() runs. mdio_bus_phy_suspend() did not call phy_stop_machine() because for phylink, the phydev->adjust_link function pointer is NULL. This seems a technicality introduced by commit fddd91016d16 ("phylib: fix PAL state machine restart on resume"). That commit was written before phylink existed, and was intended to avoid crashing with consumer drivers which don't use the PHY state machine - phylink always does, when using a PHY. But phylink itself has historically not been developed with suspend/resume in mind, and apparently not tested too much in that scenario, allowing this bug to exist unnoticed for so long. Plus, prior to the WARN_ON(), it would have likely been invisible. This issue is not in fact restricted to type 2 DSA drivers (according to the above ad-hoc classification), but can be extrapolated to any MAC driver with phylink and MDIO-bus-managed PHY PM ops. DSA is just where the issue was reported. Assuming mac_managed_pm is set correctly, a quick search indicates the following other drivers might be affected: $ grep -Zlr PHYLINK_NETDEV drivers/ | xargs -0 grep -L mac_managed_pm drivers/net/ethernet/atheros/ag71xx.c drivers/net/ethernet/microchip/sparx5/sparx5_main.c drivers/net/ethernet/microchip/lan966x/lan966x_main.c drivers/net/ethernet/freescale/dpaa2/dpaa2-mac.c drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c drivers/net/ethernet/freescale/dpaa/dpaa_eth.c drivers/net/ethernet/freescale/ucc_geth.c drivers/net/ethernet/freescale/enetc/enetc_pf_common.c drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c drivers/net/ethernet/marvell/mvneta.c drivers/net/ethernet/marvell/prestera/prestera_main.c drivers/net/ethernet/mediatek/mtk_eth_soc.c drivers/net/ethernet/altera/altera_tse_main.c drivers/net/ethernet/wangxun/txgbe/txgbe_phy.c drivers/net/ethernet/meta/fbnic/fbnic_phylink.c drivers/net/ethernet/tehuti/tn40_phy.c drivers/net/ethernet/mscc/ocelot_net.c Make the existing conditions dependent on the PHY device having a phydev->phy_link_change() implementation equal to the default phy_link_change() provided by phylib. Otherwise, we implicitly know that the phydev has the phylink-provided phylink_phy_change() callback, and when phylink is used, the PHY state machine always needs to be stopped/ started on the suspend/resume path. The code is structured as such that if phydev->phy_link_change() is absent, it is a matter of time until the kernel will crash - no need to further complicate the test. Thus, for the situation where the PM is not managed by the MAC, we will make the MDIO bus PM ops treat identically the phylink-controlled PHYs with the phylib-controlled PHYs where an adjust_link() callback is supplied. In both cases, the MDIO bus PM ops should stop and restart the PHY state machine. [1] https://lore.kernel.org/netdev/Z-1tiW9zjcoFkhwc@shell.armlinux.org.uk/ Fixes: 744d23c71af3 ("net: phy: Warn about incorrect mdio_bus_phy_resume() state") Reported-by: Wei Fang <wei.fang@nxp.com> Tested-by: Wei Fang <wei.fang@nxp.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20250407094042.2155633-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09net: phy: move phy_link_change() prior to mdio_bus_phy_may_suspend()Vladimir Oltean
In an upcoming change, mdio_bus_phy_may_suspend() will need to distinguish a phylib-based PHY client from a phylink PHY client. For that, it will need to compare the phydev->phy_link_change() function pointer with the eponymous phy_link_change() provided by phylib. To avoid forward function declarations, the default PHY link state change method should be moved upwards. There is no functional change associated with this patch, it is only to reduce the noise from a real bug fix. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/20250407093900.2155112-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>