Age | Commit message (Collapse) | Author |
|
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Features:
- Multiple patches to add support for fcntl() leases over NFSv4.
- A sysfs interface to display more information about the various
transport connections used by the RPC client
- A sysfs interface to allow a suitably privileged user to offline a
transport that may no longer point to a valid server
- A sysfs interface to allow a suitably privileged user to change the
server IP address used by the RPC client
Stable fixes:
- Two sunrpc fixes for deadlocks involving privileged rpc_wait_queues
Bugfixes:
- SUNRPC: Avoid a KASAN slab-out-of-bounds bug in xdr_set_page_base()
- SUNRPC: prevent port reuse on transports which don't request it.
- NFSv3: Fix memory leak in posix_acl_create()
- NFS: Various fixes to attribute revalidation timeouts
- NFSv4: Fix handling of non-atomic change attribute updates
- NFSv4: If a server is down, don't cause mounts to other servers to
hang as well
- pNFS: Fix an Oops in pnfs_mark_request_commit() when doing O_DIRECT
- NFS: Fix mount failures due to incorrect setting of the
has_sec_mnt_opts filesystem flag
- NFS: Ensure nfs_readpage returns promptly when an internal error
occurs
- NFS: Fix fscache read from NFS after cache error
- pNFS: Various bugfixes around the LAYOUTGET operation"
* tag 'nfs-for-5.14-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (46 commits)
NFSv4/pNFS: Return an error if _nfs4_pnfs_v3_ds_connect can't load NFSv3
NFSv4/pNFS: Don't call _nfs4_pnfs_v3_ds_connect multiple times
NFSv4/pnfs: Clean up layout get on open
NFSv4/pnfs: Fix layoutget behaviour after invalidation
NFSv4/pnfs: Fix the layout barrier update
NFS: Fix fscache read from NFS after cache error
NFS: Ensure nfs_readpage returns promptly when internal error occurs
sunrpc: remove an offlined xprt using sysfs
sunrpc: provide showing transport's state info in the sysfs directory
sunrpc: display xprt's queuelen of assigned tasks via sysfs
sunrpc: provide multipath info in the sysfs directory
NFSv4.1 identify and mark RPC tasks that can move between transports
sunrpc: provide transport info in the sysfs directory
SUNRPC: take a xprt offline using sysfs
sunrpc: add dst_attr attributes to the sysfs xprt directory
SUNRPC for TCP display xprt's source port in sysfs xprt_info
SUNRPC query transport's source port
SUNRPC display xprt's main value in sysfs's xprt_info
SUNRPC mark the first transport
sunrpc: add add sysfs directory per xprt under each xprt_switch
...
|
|
Add the dummy response handler for Intel boards to prevent incorrect
handling of OEM commands.
Signed-off-by: Ivan Mikhaylov <i.mikhaylov@yadro.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This allows to keep PHY link up and prevents any channel resets during
the host load.
It is KEEP_PHY_LINK_UP option(Veto bit) in i210 datasheet which
block PHY reset and power state changes.
Signed-off-by: Ivan Mikhaylov <i.mikhaylov@yadro.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Sparse reports:
net/ncsi/ncsi-rsp.c:406:24: warning: cast to restricted __be32
net/ncsi/ncsi-manage.c:732:33: warning: cast to restricted __be32
net/ncsi/ncsi-manage.c:756:25: warning: cast to restricted __be32
net/ncsi/ncsi-manage.c:779:25: warning: cast to restricted __be32
Signed-off-by: Ivan Mikhaylov <i.mikhaylov@yadro.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
While TCP stack scales reasonably well, there is still one part that
can be used to DDOS it.
IPv6 Packet too big messages have to lookup/insert a new route,
and if abused by attackers, can easily put hosts under high stress,
with many cpus contending on a spinlock while one is stuck in fib6_run_gc()
ip6_protocol_deliver_rcu()
icmpv6_rcv()
icmpv6_notify()
tcp_v6_err()
tcp_v6_mtu_reduced()
inet6_csk_update_pmtu()
ip6_rt_update_pmtu()
__ip6_rt_update_pmtu()
ip6_rt_cache_alloc()
ip6_dst_alloc()
dst_alloc()
ip6_dst_gc()
fib6_run_gc()
spin_lock_bh() ...
Some of our servers have been hit by malicious ICMPv6 packets
trying to _increase_ the MTU/MSS of TCP flows.
We believe these ICMPv6 packets are a result of a bug in one ISP stack,
since they were blindly sent back for _every_ (small) packet sent to them.
These packets are for one TCP flow:
09:24:36.266491 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
09:24:36.266509 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
09:24:36.316688 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
09:24:36.316704 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
09:24:36.608151 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
TCP stack can filter some silly requests :
1) MTU below IPV6_MIN_MTU can be filtered early in tcp_v6_err()
2) tcp_v6_mtu_reduced() can drop requests trying to increase current MSS.
This tests happen before the IPv6 routing stack is entered, thus
removing the potential contention and route exhaustion.
Note that IPv6 stack was performing these checks, but too late
(ie : after the route has been added, and after the potential
garbage collect war)
v2: fix typo caught by Martin, thanks !
v3: exports tcp_mtu_to_mss(), caught by David, thanks !
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Once a transport has been put offline, this transport can be also
removed from the list of transports. Any tasks that have been stuck
on this transport would find the next available active transport
and be re-tried. This transport would be removed from the xprt_switch
list and freed.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
In preparation of being able to change the xprt's state, add a way
to show currect state of the transport.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Once a task grabs a trasnport it's reflected in the queuelen of
the rpc_xprt structure. Add display of that value in the xprt's
info file in sysfs.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Allow to query xrpt_switch attributes. Currently showing the following
fields of the rpc_xprt_switch structure: xps_nxprts, xps_nactive,
xps_queuelen.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Allow to query transport's attributes. Currently showing following
fields of the rpc_xprt structure: state, last_used, cong, cwnd,
max_reqs, min_reqs, num_reqs, sizes of queues binding, sending,
pending, backlog.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Using sysfs's xprt_state attribute, mark a particular transport offline.
It will not be picked during the round-robin selection. It's not allowed
to take the main (1st created transport associated with the rpc_client)
offline. Also bring a transport back online via sysfs by writing "online"
and that would allow for this transport to be picked during the round-
robin selection.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Allow to query and set the destination's address of a transport.
Setting of the destination address is allowed only for TCP or RDMA
based connections.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Using TCP connection's source port it is useful to match connections
seen on the network traces to the xprts used by the linux nfs client.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Provide ability to query transport's source port.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Display in sysfs in the information about the xprt if this is a
main transport or not.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
When an RPC client gets created it's first transport is special
and should be marked a main transport.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Add individual transport directories under each transport switch
group. For instance, for each nconnect=X connections there will be
a transport directory. Naming conventions also identifies transport
type -- xprt-<id>-<type> where type is udp, tcp, rdma, local, bc.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
An rpc client uses a transport switch and one ore more transports
associated with that switch. Since transports are shared among
rpc clients, create a symlink into the xprt_switch directory
instead of duplicating entries under each rpc client.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Add xprt_switch directory to the sysfs and create individual
xprt_swith subdirectories for multipath transport group.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
We need to keep track of the type for a given transport.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
This is used to uniquely identify sunrpc multipath objects in /sys.
Signed-off-by: Dan Aloni <dan@kernelim.com>
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
This adds a unique identifier for a sunrpc transport in sysfs, which is
similarly managed to the unique IDs of clients.
Signed-off-by: Dan Aloni <dan@kernelim.com>
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
These will eventually have files placed under them for sysfs operations.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
For network namespace separation.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
This is where we'll put per-rpc_client related files
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
We will fail to build with CONFIG_SKB_EXTENSIONS disabled after
8550ff8d8c75 ("skbuff: Release nfct refcount on napi stolen or re-used
skbs") since there is an unconditionally use of skb_ext_find() without
an appropriate stub. Simply build the code conditionally and properly
guard against both COFNIG_SKB_EXTENSIONS as well as
CONFIG_NET_TC_SKB_EXT being disabled.
Fixes: Fixes: 8550ff8d8c75 ("skbuff: Release nfct refcount on napi stolen or re-used skbs")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fixed indentation by removing extra spaces.
Signed-off-by: Roy, UjjaL <royujjal@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If copy_from_sockptr() then we need to unlock before returning.
Fixes: d463126e23f1 ("net: sock: extend SO_TIMESTAMPING for PHC binding")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This lifts the restriction on running devmap BPF progs in generic
redirect mode. To match native XDP behavior, it is invoked right before
generic_xdp_tx is called, and only supports XDP_PASS/XDP_ABORTED/
XDP_DROP actions.
We also return 0 even if devmap program drops the packet, as
semantically redirect has already succeeded and the devmap prog is the
last point before TX of the packet to device where it can deliver a
verdict on the packet.
This also means it must take care of freeing the skb, as
xdp_do_generic_redirect callers only do that in case an error is
returned.
Since devmap entry prog is supported, remove the check in
generic_xdp_install entirely.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210702111825.491065-5-memxor@gmail.com
|
|
This change implements CPUMAP redirect support for generic XDP programs.
The idea is to reuse the cpu map entry's queue that is used to push
native xdp frames for redirecting skb to a different CPU. This will
match native XDP behavior (in that RPS is invoked again for packet
reinjected into networking stack).
To be able to determine whether the incoming skb is from the driver or
cpumap, we reuse skb->redirected bit that skips generic XDP processing
when it is set. To always make use of this, CONFIG_NET_REDIRECT guard on
it has been lifted and it is always available.
>From the redirect side, we add the skb to ptr_ring with its lowest bit
set to 1. This should be safe as skb is not 1-byte aligned. This allows
kthread to discern between xdp_frames and sk_buff. On consumption of the
ptr_ring item, the lowest bit is unset.
In the end, the skb is simply added to the list that kthread is anyway
going to maintain for xdp_frames converted to skb, and then received
again by using netif_receive_skb_list.
Bulking optimization for generic cpumap is left as an exercise for a
future patch for now.
Since cpumap entry progs are now supported, also remove check in
generic_xdp_install for the cpumap.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/20210702111825.491065-4-memxor@gmail.com
|
|
This helper can later be utilized in code that runs cpumap and devmap
programs in generic redirect mode and adjust skb based on changes made
to xdp_buff.
When returning XDP_REDIRECT/XDP_TX, it invokes __skb_push, so whenever a
generic redirect path invokes devmap/cpumap prog if set, it must
__skb_pull again as we expect mac header to be pulled.
It also drops the skb_reset_mac_len call after do_xdp_generic, as the
mac_header and network_header are advanced by the same offset, so the
difference (mac_len) remains constant.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210702111825.491065-2-memxor@gmail.com
|
|
Support specifying the ingress_ifindex and rx_queue_index of xdp_md
contexts for BPF_PROG_TEST_RUN.
The intended use case is to allow testing XDP programs that make decisions
based on the ingress interface or RX queue.
If ingress_ifindex is specified, look up the device by the provided index
in the current namespace and use its xdp_rxq for the xdp_buff. If the
rx_queue_index is out of range, or is non-zero when the ingress_ifindex is
0, return -EINVAL.
Co-developed-by: Cody Haas <chaas@riotgames.com>
Co-developed-by: Lisa Watanabe <lwatanabe@riotgames.com>
Signed-off-by: Cody Haas <chaas@riotgames.com>
Signed-off-by: Lisa Watanabe <lwatanabe@riotgames.com>
Signed-off-by: Zvi Effron <zeffron@riotgames.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210707221657.3985075-4-zeffron@riotgames.com
|
|
Support passing a xdp_md via ctx_in/ctx_out in bpf_attr for
BPF_PROG_TEST_RUN.
The intended use case is to pass some XDP meta data to the test runs of
XDP programs that are used as tail calls.
For programs that use bpf_prog_test_run_xdp, support xdp_md input and
output. Unlike with an actual xdp_md during a non-test run, data_meta must
be 0 because it must point to the start of the provided user data. From
the initial xdp_md, use data and data_end to adjust the pointers in the
generated xdp_buff. All other non-zero fields are prohibited (with
EINVAL). If the user has set ctx_out/ctx_size_out, copy the (potentially
different) xdp_md back to the userspace.
We require all fields of input xdp_md except the ones we explicitly
support to be set to zero. The expectation is that in the future we might
add support for more fields and we want to fail explicitly if the user
runs the program on the kernel where we don't yet support them.
Co-developed-by: Cody Haas <chaas@riotgames.com>
Co-developed-by: Lisa Watanabe <lwatanabe@riotgames.com>
Signed-off-by: Cody Haas <chaas@riotgames.com>
Signed-off-by: Lisa Watanabe <lwatanabe@riotgames.com>
Signed-off-by: Zvi Effron <zeffron@riotgames.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210707221657.3985075-3-zeffron@riotgames.com
|
|
This commit prepares to use the XDP meta data length check in multiple
places by making it into a static inline function instead of a literal.
Co-developed-by: Cody Haas <chaas@riotgames.com>
Co-developed-by: Lisa Watanabe <lwatanabe@riotgames.com>
Signed-off-by: Cody Haas <chaas@riotgames.com>
Signed-off-by: Lisa Watanabe <lwatanabe@riotgames.com>
Signed-off-by: Zvi Effron <zeffron@riotgames.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210707221657.3985075-2-zeffron@riotgames.com
|
|
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Do not refresh timeout in SYN_SENT for syn retransmissions.
Add selftest for unreplied TCP connection, from Florian Westphal.
2) Fix null dereference from error path with hardware offload
in nftables.
3) Remove useless nf_ct_gre_keymap_flush() from netns exit path,
from Vasily Averin.
4) Missing rcu read-lock side in ctnetlink helper info dump,
also from Vasily.
5) Do not mark RST in the reply direction coming after SYN packet
for an out-of-sync entry, from Ali Abdallah and Florian Westphal.
6) Add tcp_ignore_invalid_rst sysctl to allow to disable out of
segment RSTs, from Ali.
7) KCSAN fix for nf_conntrack_all_lock(), from Manfred Spraul.
8) Honor NFTA_LAST_SET in nft_last.
9) Fix incorrect arithmetics when restore last_jiffies in nft_last.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pull nfsd updates from Bruce Fields:
- add tracepoints for callbacks and for client creation and destruction
- cache the mounts used for server-to-server copies
- expose callback information in /proc/fs/nfsd/clients/*/info
- don't hold locks unnecessarily while waiting for commits
- update NLM to use xdr_stream, as we have for NFSv2/v3/v4
* tag 'nfsd-5.14' of git://linux-nfs.org/~bfields/linux: (69 commits)
nfsd: fix NULL dereference in nfs3svc_encode_getaclres
NFSD: Prevent a possible oops in the nfs_dirent() tracepoint
nfsd: remove redundant assignment to pointer 'this'
nfsd: Reduce contention for the nfsd_file nf_rwsem
lockd: Update the NLMv4 SHARE results encoder to use struct xdr_stream
lockd: Update the NLMv4 nlm_res results encoder to use struct xdr_stream
lockd: Update the NLMv4 TEST results encoder to use struct xdr_stream
lockd: Update the NLMv4 void results encoder to use struct xdr_stream
lockd: Update the NLMv4 FREE_ALL arguments decoder to use struct xdr_stream
lockd: Update the NLMv4 SHARE arguments decoder to use struct xdr_stream
lockd: Update the NLMv4 SM_NOTIFY arguments decoder to use struct xdr_stream
lockd: Update the NLMv4 nlm_res arguments decoder to use struct xdr_stream
lockd: Update the NLMv4 UNLOCK arguments decoder to use struct xdr_stream
lockd: Update the NLMv4 CANCEL arguments decoder to use struct xdr_stream
lockd: Update the NLMv4 LOCK arguments decoder to use struct xdr_stream
lockd: Update the NLMv4 TEST arguments decoder to use struct xdr_stream
lockd: Update the NLMv4 void arguments decoder to use struct xdr_stream
lockd: Update the NLMv1 SHARE results encoder to use struct xdr_stream
lockd: Update the NLMv1 nlm_res results encoder to use struct xdr_stream
lockd: Update the NLMv1 TEST results encoder to use struct xdr_stream
...
|
|
The variable status is being initialized with a value that is never
read, the assignment is redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Fix some spelling mistakes in comments:
succes ==> success
Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
The goal of commit df789fe75206 ("ipv6: Provide ipv6 version of
"disable_policy" sysctl") was to have the disable_policy from ipv4
available on ipv6.
However, it's not exactly the same mechanism. On IPv4, all packets coming
from an interface, which has disable_policy set, bypass the policy check.
For ipv6, this is done only for local packets, ie for packets destinated to
an address configured on the incoming interface.
Let's align ipv6 with ipv4 so that the 'disable_policy' sysctl has the same
effect for both protocols.
My first approach was to create a new kind of route cache entries, to be
able to set DST_NOPOLICY without modifying routes. This would have added a
lot of code. Because the local delivery path is already handled, I choose
to focus on the forwarding path to minimize code churn.
Fixes: df789fe75206 ("ipv6: Provide ipv6 version of "disable_policy" sysctl")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This commit fixes a bug (found by syzkaller) that could cause spurious
double-initializations for congestion control modules, which could cause
memory leaks or other problems for congestion control modules (like CDG)
that allocate memory in their init functions.
The buggy scenario constructed by syzkaller was something like:
(1) create a TCP socket
(2) initiate a TFO connect via sendto()
(3) while socket is in TCP_SYN_SENT, call setsockopt(TCP_CONGESTION),
which calls:
tcp_set_congestion_control() ->
tcp_reinit_congestion_control() ->
tcp_init_congestion_control()
(4) receive ACK, connection is established, call tcp_init_transfer(),
set icsk_ca_initialized=0 (without first calling cc->release()),
call tcp_init_congestion_control() again.
Note that in this sequence tcp_init_congestion_control() is called
twice without a cc->release() call in between. Thus, for CC modules
that allocate memory in their init() function, e.g, CDG, a memory leak
may occur. The syzkaller tool managed to find a reproducer that
triggered such a leak in CDG.
The bug was introduced when that commit 8919a9b31eb4 ("tcp: Only init
congestion control if not initialized already")
introduced icsk_ca_initialized and set icsk_ca_initialized to 0 in
tcp_init_transfer(), missing the possibility for a sequence like the
one above, where a process could call setsockopt(TCP_CONGESTION) in
state TCP_SYN_SENT (i.e. after the connect() or TFO open sendmsg()),
which would call tcp_init_congestion_control(). It did not intend to
reset any initialization that the user had already explicitly made;
it just missed the possibility of that particular sequence (which
syzkaller managed to find).
Fixes: 8919a9b31eb4 ("tcp: Only init congestion control if not initialized already")
Reported-by: syzbot+f1e24a0594d4e3a895d3@syzkaller.appspotmail.com
Signed-off-by: Nguyen Dinh Phi <phind.uet@gmail.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When multiple SKBs are merged to a new skb under napi GRO,
or SKB is re-used by napi, if nfct was set for them in the
driver, it will not be released while freeing their stolen
head state or on re-use.
Release nfct on napi's stolen or re-used SKBs, and
in gro_list_prepare, check conntrack metadata diff.
Fixes: 5c6b94604744 ("net/mlx5e: CT: Handle misses after executing CT action")
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Subtract the jiffies that have passed by to current jiffies to fix last
used restoration.
Fixes: 836382dc2471 ("netfilter: nf_tables: add last expression")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
NFTA_LAST_SET tells us if this expression has ever seen a packet, do not
ignore this attribute when restoring the ruleset.
Fixes: 836382dc2471 ("netfilter: nf_tables: add last expression")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
KCSAN detected an data race with ipc/sem.c that is intentional.
As nf_conntrack_lock() uses the same algorithm: Update
nf_conntrack_core as well:
nf_conntrack_lock() contains
a1) spin_lock()
a2) smp_load_acquire(nf_conntrack_locks_all).
a1) actually accesses one lock from an array of locks.
nf_conntrack_locks_all() contains
b1) nf_conntrack_locks_all=true (normal write)
b2) spin_lock()
b3) spin_unlock()
b2 and b3 are done for every lock.
This guarantees that nf_conntrack_locks_all() prevents any
concurrent nf_conntrack_lock() owners:
If a thread past a1), then b2) will block until that thread releases
the lock.
If the threat is before a1, then b3)+a1) ensure the write b1) is
visible, thus a2) is guaranteed to see the updated value.
But: This is only the latest time when b1) becomes visible.
It may also happen that b1) is visible an undefined amount of time
before the b3). And thus KCSAN will notice a data race.
In addition, the compiler might be too clever.
Solution: Use WRITE_ONCE().
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This patch adds a new sysctl tcp_ignore_invalid_rst to disable marking
out of segments RSTs as INVALID.
Signed-off-by: Ali Abdallah <aabdallah@suse.de>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
If we receive a SYN packet in original direction on an existing
connection tracking entry, we let this SYN through because conntrack
might be out-of-sync.
Conntrack gets back in sync when server responds with SYN/ACK and state
gets updated accordingly.
However, if server replies with RST, this packet might be marked as
INVALID because td_maxack value reflects the *old* conntrack state
and not the state of the originator of the RST.
Avoid td_maxack-based checks if previous packet was a SYN.
Unfortunately that is not be enough: an out of order ACK in original
direction updates last_index, so we still end up marking valid RST.
Thus disable the sequence check when we are not in established state and
the received RST has a sequence of 0.
Because marking RSTs as invalid usually leads to unwanted timeouts,
also skip RST sequence checks if a conntrack entry is already closing.
Such entries can already be evicted via GC in case the table is full.
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Ali Abdallah <aabdallah@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty / serial updates from Greg KH:
"Here is the big set of tty and serial driver patches for 5.14-rc1.
A bit more than normal, but nothing major, lots of cleanups.
Highlights are:
- lots of tty api cleanups and mxser driver cleanups from Jiri
- build warning fixes
- various serial driver updates
- coding style cleanups
- various tty driver minor fixes and updates
- removal of broken and disable r3964 line discipline (finally!)
All of these have been in linux-next for a while with no reported
issues"
* tag 'tty-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (227 commits)
serial: mvebu-uart: remove unused member nb from struct mvebu_uart
arm64: dts: marvell: armada-37xx: Fix reg for standard variant of UART
dt-bindings: mvebu-uart: fix documentation
serial: mvebu-uart: correctly calculate minimal possible baudrate
serial: mvebu-uart: do not allow changing baudrate when uartclk is not available
serial: mvebu-uart: fix calculation of clock divisor
tty: make linux/tty_flip.h self-contained
serial: Prefer unsigned int to bare use of unsigned
serial: 8250: 8250_omap: Fix possible interrupt storm on K3 SoCs
serial: qcom_geni_serial: use DT aliases according to DT bindings
Revert "tty: serial: Add UART driver for Cortina-Access platform"
tty: serial: Add UART driver for Cortina-Access platform
MAINTAINERS: add me back as mxser maintainer
mxser: Documentation, fix typos
mxser: Documentation, make the docs up-to-date
mxser: Documentation, remove traces of callout device
mxser: introduce mxser_16550A_or_MUST helper
mxser: rename flags to old_speed in mxser_set_serial_info
mxser: use port variable in mxser_set_serial_info
mxser: access info->MCR under info->slock
...
|
|
If an UDP packet enters the GRO engine but is not eligible
for aggregation and is not targeting an UDP tunnel,
udp_gro_receive() will not set the flush bit, and packet
could delayed till the next napi flush.
Fix the issue ensuring non GROed packets traverse
skb_gro_flush_final().
Reported-and-tested-by: Matthias Treydte <mt@waldheinz.de>
Fixes: 18f25dc39990 ("udp: skip L4 aggregation for UDP tunnel packets")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|