Age | Commit message (Collapse) | Author |
|
Up till now the code to start HSR announce timer, which triggers sending
supervisory frames, was assuming that hsr_netdev_notify() would be called
at least twice for hsrX interface. This was required to have different
values for old and current values of network device's operstate.
This is problematic for a case where hsrX interface is already in the
operational state when hsr_netdev_notify() is called, so timer is not
configured to trigger and as a result the hsrX is not sending supervisory
frames to HSR ring.
This error has been discovered when hsr_ping.sh script was run. To be
more specific - for the hsr1 and hsr2 the hsr_netdev_notify() was
called at least twice with different IF_OPER_{LOWERDOWN|DOWN|UP} states
assigned in hsr_check_carrier_and_operstate(hsr). As a result there was
no issue with sending supervisory frames.
However, with hsr3, the notify function was called only once with
operstate set to IF_OPER_UP and timer responsible for triggering
supervisory frames was not fired.
The solution is to use netif_oper_up() and netif_running() helper
functions to assess if network hsrX device is up.
Only then, when the timer is not already pending, it is started.
Otherwise it is deactivated.
Fixes: f421436a591d ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
Signed-off-by: Lukasz Majewski <lukma@denx.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240507111214.3519800-1-lukma@denx.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
route_dumpit() already relies on RCU, RTNL is not needed.
Also change return value at the end of a dump.
This allows NLMSG_DONE to be appended to the current
skb at the end of a dump, saving a couple of recvmsg()
system calls.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Remi Denis-Courmont <courmisch@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240507121748.416287-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
syzbot is able to trigger the following crash [1],
caused by unsafe ip6_dst_idev() use.
Indeed ip6_dst_idev() can return NULL, and must always be checked.
[1]
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 0 PID: 31648 Comm: syz-executor.0 Not tainted 6.9.0-rc4-next-20240417-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
RIP: 0010:__fib6_rule_action net/ipv6/fib6_rules.c:237 [inline]
RIP: 0010:fib6_rule_action+0x241/0x7b0 net/ipv6/fib6_rules.c:267
Code: 02 00 00 49 8d 9f d8 00 00 00 48 89 d8 48 c1 e8 03 42 80 3c 20 00 74 08 48 89 df e8 f9 32 bf f7 48 8b 1b 48 89 d8 48 c1 e8 03 <42> 80 3c 20 00 74 08 48 89 df e8 e0 32 bf f7 4c 8b 03 48 89 ef 4c
RSP: 0018:ffffc9000fc1f2f0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 1a772f98c8186700
RDX: 0000000000000003 RSI: ffffffff8bcac4e0 RDI: ffffffff8c1f9760
RBP: ffff8880673fb980 R08: ffffffff8fac15ef R09: 1ffffffff1f582bd
R10: dffffc0000000000 R11: fffffbfff1f582be R12: dffffc0000000000
R13: 0000000000000080 R14: ffff888076509000 R15: ffff88807a029a00
FS: 00007f55e82ca6c0(0000) GS:ffff8880b9400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b31d23000 CR3: 0000000022b66000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
fib_rules_lookup+0x62c/0xdb0 net/core/fib_rules.c:317
fib6_rule_lookup+0x1fd/0x790 net/ipv6/fib6_rules.c:108
ip6_route_output_flags_noref net/ipv6/route.c:2637 [inline]
ip6_route_output_flags+0x38e/0x610 net/ipv6/route.c:2649
ip6_route_output include/net/ip6_route.h:93 [inline]
ip6_dst_lookup_tail+0x189/0x11a0 net/ipv6/ip6_output.c:1120
ip6_dst_lookup_flow+0xb9/0x180 net/ipv6/ip6_output.c:1250
sctp_v6_get_dst+0x792/0x1e20 net/sctp/ipv6.c:326
sctp_transport_route+0x12c/0x2e0 net/sctp/transport.c:455
sctp_assoc_add_peer+0x614/0x15c0 net/sctp/associola.c:662
sctp_connect_new_asoc+0x31d/0x6c0 net/sctp/socket.c:1099
__sctp_connect+0x66d/0xe30 net/sctp/socket.c:1197
sctp_connect net/sctp/socket.c:4819 [inline]
sctp_inet_connect+0x149/0x1f0 net/sctp/socket.c:4834
__sys_connect_file net/socket.c:2048 [inline]
__sys_connect+0x2df/0x310 net/socket.c:2065
__do_sys_connect net/socket.c:2075 [inline]
__se_sys_connect net/socket.c:2072 [inline]
__x64_sys_connect+0x7a/0x90 net/socket.c:2072
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Fixes: 5e5f3f0f8013 ("[IPV6] ADDRCONF: Convert ipv6_get_saddr() to ipv6_dev_get_saddr().")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240507163145.835254-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
There is no need to use this_cpu_ptr(dst_cache->cache) twice.
Compiler is unable to optimize the second call, because of
per-cpu constraints.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240507132717.627518-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
dst_cache->reset_ts is read or written locklessly,
add READ_ONCE() and WRITE_ONCE() annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240507132000.614591-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Only generate one ACK packet for all the subpackets in a jumbo packet. If
we would like to generate more than one ACK, we prioritise them base on
their reason code, in the order, highest first:
OutOfSeq > NoSpace > ExceedsWin > Duplicate > Requested > Delay > Idle
For the first four, we reference the lowest offending subpacket; for the
last three, the highest.
This reduces the number of ACKs we end up transmitting to one per UDP
packet transmitted to reduce network loading and packet parsing.
Fixes: 5d7edbc9231e ("rxrpc: Get rid of the Rx ring")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Reviewed-by: Jeffrey Altman <jaltman@auristor.com <mailto:jaltman@auristor.com>>
Link: https://lore.kernel.org/r/20240503150749.1001323-3-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Make the following fixes to the congestion control algorithm:
(1) Don't vary the cwnd starting value by the size of RXRPC_TX_SMSS since
that's currently held constant - set to the size of a jumbo subpacket
payload so that we can create jumbo packets on the fly. The current
code invariably picks 3 as the starting value.
Further, the starting cwnd needs to be an even number because we ack
every other packet, so set it to 4.
(2) Don't cut ssthresh when we see an ACK come from the peer with a
receive window (rwind) less than ssthresh. ssthresh keeps track of
characteristics of the connection whereas rwind may be reduced by the
peer for any reason - and may be reduced to 0.
Fixes: 1fc4fa2ac93d ("rxrpc: Fix congestion management")
Fixes: 0851115090a3 ("rxrpc: Reduce ssthresh to peer's receive window")
Signed-off-by: David Howells <dhowells@redhat.com>
Suggested-by: Simon Wilkinson <sxw@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Reviewed-by: Jeffrey Altman <jaltman@auristor.com <mailto:jaltman@auristor.com>>
Link: https://lore.kernel.org/r/20240503150749.1001323-2-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Remove the explicit call to "return" in the void ax25_ds_set_timer
function that was introduced in 78a7b5dbc060 ("ax.25: x.25: Remove the
now superfluous sentinel elements from ctl_table array").
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Let's make all IPVS sysctls writtable even when
network namespace is owned by non-initial user namespace.
Let's make a few sysctls to be read-only for non-privileged users:
- sync_qlen_max
- sync_sock_size
- run_estimation
- est_cpulist
- est_nice
I'm trying to be conservative with this to prevent
introducing any security issues in there. Maybe,
we can allow more sysctls to be writable, but let's
do this on-demand and when we see real use-case.
This patch is motivated by user request in the LXC
project [1]. Having this can help with running some
Kubernetes [2] or Docker Swarm [3] workloads inside the system
containers.
Link: https://github.com/lxc/lxc/issues/4278 [1]
Link: https://github.com/kubernetes/kubernetes/blob/b722d017a34b300a2284b890448e5a605f21d01e/pkg/proxy/ipvs/proxier.go#L103 [2]
Link: https://github.com/moby/libnetwork/blob/3797618f9a38372e8107d8c06f6ae199e1133ae8/osl/namespace_linux.go#L682 [3]
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Simon Horman <horms@verge.net.au>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Simon Horman <horms@verge.net.au>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Florian Westphal <fw@strlen.de>
Suggested-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As it was done in commit fc1092f51567 ("ipv4: Fix uninit-value access in
__ip_make_skb()") for IPv4, check FLOWI_FLAG_KNOWN_NH on fl6->flowi6_flags
instead of testing HDRINCL on the socket to avoid a race condition which
causes uninit-value access.
Fixes: ea30388baebc ("ipv6: Fix an uninit variable access bug in __ip6_make_skb()")
Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Enhance the error reporting mechanism in the switchdev framework to
provide more informative and user-friendly error messages.
Following feedback from users struggling to understand the implications
of error messages like "failed (err=-28) to add object (id=2)", this
update aims to clarify what operation failed and how this might impact
the system or network.
With this change, error messages now include a description of the failed
operation, the specific object involved, and a brief explanation of the
potential impact on the system. This approach helps administrators and
developers better understand the context and severity of errors,
facilitating quicker and more effective troubleshooting.
Example of the improved logging:
[ 70.516446] ksz-switch spi0.0 uplink: Failed to add Port Multicast
Database entry (object id=2) with error: -ENOSPC (-28).
[ 70.516446] Failure in updating the port's Multicast Database could
lead to multicast forwarding issues.
[ 70.516446] Current HW/SW setup lacks sufficient resources.
This comprehensive update includes handling for a range of switchdev
object IDs, ensuring that most operations within the switchdev framework
benefit from clearer error reporting.
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When a broadcast AppleTalk packet is received, prefer queuing it on the
socket whose address matches the address of the interface that received
the packet (and is listening on the correct port). Userspace
applications that handle such packets will usually send a response on
the same socket that received the packet; this fix allows the response
to be sent on the correct interface.
If a socket matching the interface's address is not found, an arbitrary
socket listening on the correct port will be used, if any. This matches
the implementation's previous behavior.
Fixes atalkd's responses to network information requests when multiple
network interfaces are configured to use AppleTalk.
Link: https://lore.kernel.org/netdev/20200722113752.1218-2-vincent.ldev@duvert.net/
Link: https://gist.github.com/VinDuv/4db433b6dce39d51a5b7847ee749b2a4
Signed-off-by: Vincent Duvert <vincent.ldev@duvert.net>
Signed-off-by: Doug Brown <doug@schmorgal.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Introduce a tracepoint for icmp_send, which can help users to get more
detail information conveniently when icmp abnormal events happen.
1. Giving an usecase example:
=============================
When an application experiences packet loss due to an unreachable UDP
destination port, the kernel will send an exception message through the
icmp_send function. By adding a trace point for icmp_send, developers or
system administrators can obtain detailed information about the UDP
packet loss, including the type, code, source address, destination address,
source port, and destination port. This facilitates the trouble-shooting
of UDP packet loss issues especially for those network-service
applications.
2. Operation Instructions:
==========================
Switch to the tracing directory.
cd /sys/kernel/tracing
Filter for destination port unreachable.
echo "type==3 && code==3" > events/icmp/icmp_send/filter
Enable trace event.
echo 1 > events/icmp/icmp_send/enable
3. Result View:
================
udp_client_erro-11370 [002] ...s.12 124.728002:
icmp_send: icmp_send: type=3, code=3.
From 127.0.0.1:41895 to 127.0.0.1:6666 ulen=23
skbaddr=00000000589b167a
Signed-off-by: Peilin He <he.peilin@zte.com.cn>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Reviewed-by: Yunkai Zhang <zhang.yunkai@zte.com.cn>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Cc: Liu Chun <liu.chun2@zte.com.cn>
Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The change from skb_copy to pskb_copy unfortunately changed the data
copying to omit the ethernet header, since it was pulled before reaching
this point. Fix this by calling __skb_push/pull around pskb_copy.
Fixes: 59c878cbcdd8 ("net: bridge: fix multicast-to-unicast with fraglist GSO")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Some switches like Microchip KSZ variants do not support per port DSCP
priority configuration. Instead there is a global DSCP mapping table.
To handle it, we will accept set/del request to any of user ports to
make global configuration and update dcb app entries for all other
ports.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
IEEE 802.1q specification provides recommendation and examples which can
be used as good default values for different drivers.
This patch implements mapping examples documented in IEEE 802.1Q-2022 in
Annex I "I.3 Traffic type to traffic class mapping" and IETF DSCP naming
and mapping DSCP to Traffic Type inspired by RFC8325.
This helpers will be used in followup patches for dsa/microchip DCB
implementation.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add DCB support to get/set trust configuration for different packet
priority information sources. Some switch allow to chose different
source of packet priority classification. For example on KSZ switches it
is possible to configure VLAN PCP and/or DSCP sources.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
XSk infra's been using its own DMA sync shortcut to try avoiding
redundant function calls. Now that there is a generic one, remove
the custom implementation and rely on the generic helpers.
xsk_buff_dma_sync_for_cpu() doesn't need the second argument anymore,
remove it.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
We can save a couple more function calls in the Page Pool code if we
check for dma_need_sync() earlier, just when we test pp->p.dma_sync.
Move both these checks into an inline wrapper and call the PP wrapper
over the generic DMA sync function only when both are true.
You can't cache the result of dma_need_sync() in &page_pool, as it may
change anytime if an SWIOTLB buffer is allocated or mapped.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The current behavior is to accept any strings as inputs, this results in
an inconsistent result where an unexisting scheduler can be set:
# sysctl -w net.mptcp.scheduler=notdefault
net.mptcp.scheduler = notdefault
This patch changes this behavior by checking for existing scheduler
before accepting the input.
Fixes: e3b2870b6d22 ("mptcp: add a new sysctl scheduler")
Cc: stable@vger.kernel.org
Signed-off-by: Gregory Detal <gregory.detal@gmail.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Tested-by: Geliang Tang <geliang@kernel.org>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://lore.kernel.org/r/20240506-upstream-net-20240506-mptcp-sched-exist-v1-1-2ed1529e521e@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit 7e8cdc97148c ("nfc: Add KCOV annotations") added
kcov_remote_start_common()/kcov_remote_stop() pair into nci_rx_work(),
with an assumption that kcov_remote_stop() is called upon continue of
the for loop. But commit d24b03535e5e ("nfc: nci: Fix uninit-value in
nci_dev_up and nci_ntf_packet") forgot to call kcov_remote_stop() before
break of the for loop.
Reported-by: syzbot <syzbot+0438378d6f157baae1a2@syzkaller.appspotmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=0438378d6f157baae1a2
Fixes: d24b03535e5e ("nfc: nci: Fix uninit-value in nci_dev_up and nci_ntf_packet")
Suggested-by: Andrey Konovalov <andreyknvl@gmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/6d10f829-5a0c-405a-b39a-d7266f3a1a0b@I-love.SAKURA.ne.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
subflow_add_reset_reason(skb, ...) can fail.
We can not assume mptcp_get_ext(skb) always return a non NULL pointer.
syzbot reported:
general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
CPU: 0 PID: 5098 Comm: syz-executor132 Not tainted 6.9.0-rc6-syzkaller-01478-gcdc74c9d06e7 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
RIP: 0010:subflow_v6_route_req+0x2c7/0x490 net/mptcp/subflow.c:388
Code: 8d 7b 07 48 89 f8 48 c1 e8 03 42 0f b6 04 20 84 c0 0f 85 c0 01 00 00 0f b6 43 07 48 8d 1c c3 48 83 c3 18 48 89 d8 48 c1 e8 03 <42> 0f b6 04 20 84 c0 0f 85 84 01 00 00 0f b6 5b 01 83 e3 0f 48 89
RSP: 0018:ffffc9000362eb68 EFLAGS: 00010206
RAX: 0000000000000003 RBX: 0000000000000018 RCX: ffff888022039e00
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88807d961140 R08: ffffffff8b6cb76b R09: 1ffff1100fb2c230
R10: dffffc0000000000 R11: ffffed100fb2c231 R12: dffffc0000000000
R13: ffff888022bfe273 R14: ffff88802cf9cc80 R15: ffff88802ad5a700
FS: 0000555587ad2380(0000) GS:ffff8880b9400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f420c3f9720 CR3: 0000000022bfc000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
tcp_conn_request+0xf07/0x32c0 net/ipv4/tcp_input.c:7180
tcp_rcv_state_process+0x183c/0x4500 net/ipv4/tcp_input.c:6663
tcp_v6_do_rcv+0x8b2/0x1310 net/ipv6/tcp_ipv6.c:1673
tcp_v6_rcv+0x22b4/0x30b0 net/ipv6/tcp_ipv6.c:1910
ip6_protocol_deliver_rcu+0xc76/0x1570 net/ipv6/ip6_input.c:438
ip6_input_finish+0x186/0x2d0 net/ipv6/ip6_input.c:483
NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314
NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314
__netif_receive_skb_one_core net/core/dev.c:5625 [inline]
__netif_receive_skb+0x1ea/0x650 net/core/dev.c:5739
netif_receive_skb_internal net/core/dev.c:5825 [inline]
netif_receive_skb+0x1e8/0x890 net/core/dev.c:5885
tun_rx_batched+0x1b7/0x8f0 drivers/net/tun.c:1549
tun_get_user+0x2f35/0x4560 drivers/net/tun.c:2002
tun_chr_write_iter+0x113/0x1f0 drivers/net/tun.c:2048
call_write_iter include/linux/fs.h:2110 [inline]
new_sync_write fs/read_write.c:497 [inline]
vfs_write+0xa84/0xcb0 fs/read_write.c:590
ksys_write+0x1a0/0x2c0 fs/read_write.c:643
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Fixes: 3e140491dd80 ("mptcp: support rstreason for passive reset")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://lore.kernel.org/r/20240506123032.3351895-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Simon reported that ndo_change_mtu() methods were never
updated to use WRITE_ONCE(dev->mtu, new_mtu) as hinted
in commit 501a90c94510 ("inet: protect against too small
mtu values.")
We read dev->mtu without holding RTNL in many places,
with READ_ONCE() annotations.
It is time to take care of ndo_change_mtu() methods
to use corresponding WRITE_ONCE()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Simon Horman <horms@kernel.org>
Closes: https://lore.kernel.org/netdev/20240505144608.GB67882@kernel.org/
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://lore.kernel.org/r/20240506102812.3025432-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
make C=1 reports:
warning: Function parameter or struct member 'mrtt' not described in 'ccid2_rtt_estimator'
So document the 'mrtt' parameter.
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240505-ccid2_rtt_estimator-kdoc-v1-1-09231fcb9145@quicinc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
page_pool::p is driver-defined params, copied directly from the
structure passed to page_pool_create(). The structure isn't meant
to be modified by the Page Pool core code and this even might look
confusing[0][1].
In order to be able to alter some flags, let's define our own, internal
fields the same way as the already existing one (::has_init_callback).
They are defined as bits in the driver-set params, leave them so here
as well, to not waste byte-per-bit or so. Almost 30 bits are still free
for future extensions.
We could've defined only new flags here or only the ones we may need
to alter, but checking some flags in one place while others in another
doesn't sound convenient or intuitive. ::flags passed by the driver can
now go to the "slow" PP params.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Link[0]: https://lore.kernel.org/netdev/20230703133207.4f0c54ce@kernel.org
Suggested-by: Alexander Duyck <alexanderduyck@fb.com>
Link[1]: https://lore.kernel.org/netdev/CAKgT0UfZCGnWgOH96E4GV3ZP6LLbROHM7SHE8NKwq+exX+Gk_Q@mail.gmail.com
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
After commit 5027ec19f104 ("net: page_pool: split the page_pool_params
into fast and slow") that made &page_pool contain only "hot" params at
the start, cacheline boundary chops frag API fields group in the middle
again.
To not bother with this each time fast params get expanded or shrunk,
let's just align them to `4 * sizeof(long)`, the closest upper pow-2 to
their actual size (2 longs + 1 int). This ensures 16-byte alignment for
the 32-bit architectures and 32-byte alignment for the 64-bit ones,
excluding unnecessary false-sharing.
::page_state_hold_cnt is used quite intensively on hotpath no matter if
frag API is used, so move it to the newly created hole in the first
cacheline.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
We want to be able to run rtnl_fill_ifinfo() under RCU protection
instead of RTNL in the future.
All rtnl_link_ops->get_link_net() methods already using dev_net()
are ready. I added READ_ONCE() annotations on others.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
dev->xdp_prog is protected by RCU, we can lift RTNL requirement
from rtnl_xdp_prog_skb().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Change dev_change_proto_down() and dev_change_proto_down_reason()
to write once on dev->proto_down and dev->proto_down_reason.
Then rtnl_fill_proto_down() can use READ_ONCE() annotations
and run locklessly.
rtnl_proto_down_size() should assume worst case,
because readng dev->proto_down_reason multiple
times would be racy without RTNL in the future.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Following device fields can be read locklessly
in rtnl_fill_ifinfo() :
type, ifindex, operstate, link_mode, mtu, min_mtu, max_mtu, group,
promiscuity, allmulti, num_tx_queues, gso_max_segs, gso_max_size,
gro_max_size, gso_ipv4_max_size, gro_ipv4_max_size, tso_max_size,
tso_max_segs, num_rx_queues.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
In the following patch we want to read dev->allmulti
and dev->promiscuity locklessly from rtnl_fill_ifinfo()
In this patch I change __dev_set_promiscuity() and
__dev_set_allmulti() to write these fields (and dev->flags)
only if they succeed, with WRITE_ONCE() annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
rtnl_fill_ifinfo() can read dev->tx_queue_len locklessly,
granted we add corresponding READ_ONCE()/WRITE_ONCE() annotations.
Add missing READ_ONCE(dev->tx_queue_len) in teql_enqueue()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
We can use netdev_copy_name() to no longer rely on RTNL
to fetch dev->name.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
dev->qdisc can be read using RCU protection.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2024-05-03
1) Remove Obsolete UDP_ENCAP_ESPINUDP_NON_IKE Support.
This was defined by an early version of an IETF draft
that did not make it to a standard.
2) Introduce direction attribute for xfrm states.
xfrm states have a direction, a stsate can be used
either for input or output packet processing.
Add a direction to xfrm states to make it clear
for what a xfrm state is used.
* tag 'ipsec-next-2024-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
xfrm: Restrict SA direction attribute to specific netlink message types
xfrm: Add dir validation to "in" data path lookup
xfrm: Add dir validation to "out" data path lookup
xfrm: Add Direction to the SA in or out
udpencap: Remove Obsolete UDP_ENCAP_ESPINUDP_NON_IKE Support
====================
Link: https://lore.kernel.org/r/20240503082732.2835810-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This patch fixes the spelling mistakes in comments.
The changes were generated using codespell and reviewed manually.
eariler -> earlier
greceful -> graceful
Signed-off-by: Shi-Sheng Yang <fourcolor4c@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://lore.kernel.org/r/20240502154740.249839-1-fourcolor4c@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
fill_route() stores three components in the skb:
- struct rtmsg
- RTA_DST (u8)
- RTA_OIF (u32)
Therefore, rtm_phonet_notify() should use
NLMSG_ALIGN(sizeof(struct rtmsg)) +
nla_total_size(1) +
nla_total_size(4)
Fixes: f062f41d0657 ("Phonet: routing table Netlink interface")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Rémi Denis-Courmont <courmisch@gmail.com>
Link: https://lore.kernel.org/r/20240502161700.1804476-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Given how late we are in the cycle, merge the two fixes from
wireless into wireless-next as they don't see that urgent.
This way, the wireless tree won't need rebasing later.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Its the only remaining call site so there is no need for this to
be separated anymore.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
The existing code uses iter->type to figure out what data is needed, the
live copy (READ) or clone (UPDATE).
Without pending updates, priv->clone and priv->match will point to
different memory locations, but they have identical content.
Future patch will make priv->clone == NULL if there are no pending changes,
in this case we must copy the live data for the UPDATE case.
Currently this would require GFP_ATOMIC allocation. Split the walk
function in two parts: one that does the walk and one that decides which
data is needed.
In the UPDATE case, callers hold the transaction mutex so we do not need
the rcu read lock. This allows to use GFP_KERNEL allocation while
cloning.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Once priv->clone can be NULL in case no insertions/removals occurred
in the last transaction we need to drop set elements from priv->match
if priv->clone is NULL.
While at it, condense this function by reusing the pipapo_free_match
helper instead of open-coded version.
The rcu_barrier() is removed, its not needed: old call_rcu instances
for pipapo_reclaim_match do not access struct nft_set.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Currently it returns an error pointer, but the only possible failure
is ENOMEM.
After a followup patch, we'd need to discard the errno code, i.e.
x = pipapo_clone()
if (IS_ERR(x))
return NULL
or make more changes to fix up callers to expect IS_ERR() code
from set->ops->deactivate().
So simplify this and make it return ptr-or-null.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Preparation patch, the helper will soon get called from insert
function too.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Not sure why this special case exists. Early drop logic
(which kicks in when conntrack table is full) should be independent
of flowtable offload and only consider assured bit (i.e., two-way
traffic was seen).
flowtable entries hold a reference to the conntrack entry (struct
nf_conn) that has been offloaded. The conntrack use count is not
decremented until after the entry is free'd.
This change therefore will not result in exceeding the conntrack table
limit. It does allow early-drop of tcp flows even when they've been
offloaded, but only if they have been offloaded before syn-ack was
received or after at least one peer has sent a fin.
Currently 'fin' packet reception already stops offloading, so this
should not impact offloading either.
Cc: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
At the beginning in 2009 one patch [1] introduced collecting drop
counter in nf_conntrack_in() by returning -NF_DROP. Later, another
patch [2] changed the return value of tcp_packet() which now is
renamed to nf_conntrack_tcp_packet() from -NF_DROP to NF_DROP. As
we can see, that -NF_DROP should be corrected.
Similarly, there are other two points where the -NF_DROP is used.
Well, as NF_DROP is equal to 0, inverting NF_DROP makes no sense
as patch [2] said many years ago.
[1]
commit 7d1e04598e5e ("netfilter: nf_conntrack: account packets drop by tcp_packet()")
[2]
commit ec8d540969da ("netfilter: conntrack: fix dropping packet after l4proto->packet()")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
It is obsolete since sp_lock was discarded in commit 580a25756a9f
("SUNRPC: discard sp_lock").
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
svc_find_listener will return the transport instance pointer for the
endpoint accepting connections/peer traffic from the specified transport
class and matching sockaddr.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Add svc_xprt_create_from_sa utility routine and refactor
svc_xprt_create() codebase in order to introduce the capability to
create a svc port from socket address.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
since vs_proc pointer is dereferenced before getting it's address there's
no need to check for NULL.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 8e5b67731d08 ("SUNRPC: Add a callback to initialise server requests")
Signed-off-by: Aleksandr Aprelkov <aaprelkov@usergate.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|