Age | Commit message (Collapse) | Author |
|
This reverts commit cc00bcaa589914096edef7fb87ca5cee4a166b5c.
This (and the preceding) patch basically re-implemented the RCU
mechanisms of patch 784544739a25. That patch was replaced because of the
performance problems that it created when replacing tables. Now, we have
the same issue: the call to synchronize_rcu() makes replacing tables
slower by as much as an order of magnitude.
Prior to using RCU a script calling "iptables" approx. 200 times was
taking 1.16s. With RCU this increased to 11.59s.
Revert these patches and fix the issue in a different way.
Signed-off-by: Mark Tomlinson <mark.tomlinson@alliedtelesis.co.nz>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This reverts commit 443d6e86f821a165fae3fc3fc13086d27ac140b1.
This (and the following) patch basically re-implemented the RCU
mechanisms of patch 784544739a25. That patch was replaced because of the
performance problems that it created when replacing tables. Now, we have
the same issue: the call to synchronize_rcu() makes replacing tables
slower by as much as an order of magnitude.
Revert these patches and fix the issue in a different way.
Signed-off-by: Mark Tomlinson <mark.tomlinson@alliedtelesis.co.nz>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Augment the current set of options that are accessible via
bpf_{g,s}etsockopt to also support SO_REUSEPORT.
Signed-off-by: Manu Bretelle <chantra@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210310182305.1910312-1-chantra@fb.com
|
|
Extend psample to report the following attributes when available:
* Output traffic class as a 16-bit value
* Output traffic class occupancy in bytes as a 64-bit value
* End-to-end latency of the packet in nanoseconds resolution
* Software timestamp in nanoseconds resolution (always available)
* Packet's protocol. Needed for packet dissection in user space (always
available)
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently, callers of psample_sample_packet() pass three metadata
attributes: Ingress port, egress port and truncated size. Subsequent
patches are going to add more attributes (e.g., egress queue occupancy),
which also need an indication whether they are valid or not.
Encapsulate packet metadata in a struct in order to keep the number of
arguments reasonable.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
It's used only for flow dissection, which now takes constant data
pointers.
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Flow Dissector code never modifies the input buffer, neither skb nor
raw data.
Make 'data' argument const for all of the Flow dissector's functions.
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
'hash' stores not the flow hash, but the index of the GRO bucket
corresponding to it.
Change its name to 'bucket' to avoid confusion while reading lines
like '__set_bit(hash, &napi->gro_bitmask)'.
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
GRO bucket index doesn't change through the entire function.
Store a pointer to the corresponding bucket instead of its member
and use it consistently through the function.
It is performance-safe since &gro_list->list == gro_list.
Misc: remove superfluous braces around single-line branches.
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
gro_list_prepare() always returns &napi->gro_hash[bucket].list,
without any variations. Moreover, it uses 'napi' argument only to
have access to this list, and calculates the bucket index for the
second time (firstly it happens at the beginning of
dev_gro_receive()) to do that.
Given that dev_gro_receive() already has an index to the needed
list, just pass it as the first argument to eliminate redundant
calculations, and make gro_list_prepare() return void.
Also, both arguments of gro_list_prepare() can be constified since
this function can only modify the skbs from the bucket list.
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
flow_dissector_key_icmp::id is of type u16 (CPU byteorder),
ICMP header has its ID field in network byteorder obviously.
Sparse says:
net/core/flow_dissector.c:178:43: warning: restricted __be16 degrades to integer
Convert ID value to CPU byteorder when storing it into
flow_dissector_key_icmp.
Fixes: 5dec597e5cd0 ("flow_dissector: extract more ICMP information")
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
struct sockaddr_qrtr has a 2-byte hole, and qrtr_recvmsg() currently
does not clear it before copying kernel data to user space.
It might be too late to name the hole since sockaddr_qrtr structure is uapi.
BUG: KMSAN: kernel-infoleak in kmsan_copy_to_user+0x9c/0xb0 mm/kmsan/kmsan_hooks.c:249
CPU: 0 PID: 29705 Comm: syz-executor.3 Not tainted 5.11.0-rc7-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x21c/0x280 lib/dump_stack.c:120
kmsan_report+0xfb/0x1e0 mm/kmsan/kmsan_report.c:118
kmsan_internal_check_memory+0x202/0x520 mm/kmsan/kmsan.c:402
kmsan_copy_to_user+0x9c/0xb0 mm/kmsan/kmsan_hooks.c:249
instrument_copy_to_user include/linux/instrumented.h:121 [inline]
_copy_to_user+0x1ac/0x270 lib/usercopy.c:33
copy_to_user include/linux/uaccess.h:209 [inline]
move_addr_to_user+0x3a2/0x640 net/socket.c:237
____sys_recvmsg+0x696/0xd50 net/socket.c:2575
___sys_recvmsg net/socket.c:2610 [inline]
do_recvmmsg+0xa97/0x22d0 net/socket.c:2710
__sys_recvmmsg net/socket.c:2789 [inline]
__do_sys_recvmmsg net/socket.c:2812 [inline]
__se_sys_recvmmsg+0x24a/0x410 net/socket.c:2805
__x64_sys_recvmmsg+0x62/0x80 net/socket.c:2805
do_syscall_64+0x9f/0x140 arch/x86/entry/common.c:48
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x465f69
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f43659d6188 EFLAGS: 00000246 ORIG_RAX: 000000000000012b
RAX: ffffffffffffffda RBX: 000000000056bf60 RCX: 0000000000465f69
RDX: 0000000000000008 RSI: 0000000020003e40 RDI: 0000000000000003
RBP: 00000000004bfa8f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000010060 R11: 0000000000000246 R12: 000000000056bf60
R13: 0000000000a9fb1f R14: 00007f43659d6300 R15: 0000000000022000
Local variable ----addr@____sys_recvmsg created at:
____sys_recvmsg+0x168/0xd50 net/socket.c:2550
____sys_recvmsg+0x168/0xd50 net/socket.c:2550
Bytes 2-3 of 12 are uninitialized
Memory access of size 12 starts at ffff88817c627b40
Data copied to user address 0000000020000140
Fixes: bdabad3e363d ("net: Add Qualcomm IPC router")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Courtney Cavin <courtney.cavin@sonymobile.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
There is only a single patch this time:
- Use netif_rx_any_context(), by Sebastian Andrzej Siewior
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Allow a policer action to enforce a rate-limit based on packets-per-second,
configurable using a packet-per-second rate and burst parameters.
e.g.
tc filter add dev tap1 parent ffff: u32 match \
u32 0 0 police pkts_rate 3000 pkts_burst 1000
Testing was unable to uncover a performance impact of this change on
existing features.
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Louis Peens <louis.peens@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Allow flow_offload API to configure packet-per-second policing using rate
and burst parameters.
Dummy implementations of tcf_police_rate_pkt_ps() and
tcf_police_burst_pkt() are supplied which return 0, the unconfigured state.
This is to facilitate splitting the offload, driver, and TC code portion of
this feature into separate patches with the aim of providing a logical flow
for review. And the implementation of these helpers will be filled out by a
follow-up patch.
Signed-off-by: Xingfeng Hu <xingfeng.hu@corigine.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Louis Peens <louis.peens@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch invoked mptcp_nl_remove_addrs_list to remove a list of addresses
when the netlink flushes addresses, instead of using
mptcp_nl_remove_subflow_and_signal_addr to remove them one by one.
And dropped the unused parameter net in __flush_addrs too.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch implemented the function to remove a list of addresses and
subflows, named mptcp_nl_remove_addrs_list, which had a input parameter
rm_list as the removing addresses list.
In mptcp_nl_remove_addrs_list, traverse all the existing msk sockets to
invoke mptcp_pm_remove_addrs_and_subflows to remove a list of addresses
for each msk socket.
In mptcp_pm_remove_addrs_and_subflows, traverse all the addresses in the
removing addresses list, to find whether this address is in the conn_list
or anno_list. If it is, put the address ID into the removing address list
or the removing subflow list, and pass the two lists to
mptcp_pm_remove_addr and mptcp_pm_remove_subflow.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch dealt with removing multi subflows in PM:
In mptcp_pm_remove_subflow, changed the input parameter local_id as an
list of removing address ids, and passed the list to
mptcp_pm_nl_rm_subflow_received.
In mptcp_pm_nl_rm_subflow_received, iterated each address id from the
received ids list. Then shut down and closed each address id's subsocket.
In mptcp_nl_remove_subflow_and_signal_addr, put the single address id into
an ids list, and passed it to mptcp_pm_remove_subflow.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch dropped the member rm_id of struct mptcp_pm_data. Use
rm_list_rx in mptcp_pm_nl_rm_addr_received instead of using rm_id.
In mptcp_pm_nl_rm_addr_received, iterated each address id from
pm.rm_list_rx, then shut down and closed each address id's subsocket.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch added a new member rm_list_rx for struct mptcp_pm_data as an
list of the removing address ids on the incoming direction. Initialized
its nr field to zero in mptcp_pm_data_init.
In mptcp_pm_rm_addr_received, set it as the input rm_list.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch changed the member rm_id in struct mptcp_options_received as a
list of the removing address ids, and renamed it to rm_list.
In mptcp_parse_option, parsed the RM_ADDR suboption and filled them into
the rm_list in struct mptcp_options_received.
In mptcp_incoming_options, passed this rm_list to the function
mptcp_pm_rm_addr_received.
It also changed the parameter type of mptcp_pm_rm_addr_received.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch added a new member rm_list_tx for struct mptcp_pm_data as the
removing address list on the outgoing direction. Initialize its nr field
to zero in mptcp_pm_data_init.
In mptcp_pm_remove_anno_addr, put the single address id into an removing
list, and passed it to mptcp_pm_remove_addr.
In mptcp_pm_remove_addr, save the input rm_list to rm_list_tx in struct
mptcp_pm_data.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch defined a new struct mptcp_rm_list, the ids field was an
array of the removing address ids, the nr field was the valid number of
removing address ids in the array. The array size was definced as a new
macro MPTCP_RM_IDS_MAX. Changed the member rm_id of struct
mptcp_out_options to rm_list.
In mptcp_established_options_rm_addr, invoked mptcp_pm_rm_addr_signal to
get the rm_list. According the number of addresses in it, calculated
the padded RM_ADDR suboption length. And saved the ids array in struct
mptcp_out_options's rm_list member.
In mptcp_write_options, iterated each address id from struct
mptcp_out_options's rm_list member, set the invalid ones as TCPOPT_NOP,
then filled them into the RM_ADDR suboption.
Changed TCPOLEN_MPTCP_RM_ADDR_BASE from 4 to 3.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
All comment lines inside the comment block have been aligned.
Every line of comment starts with a * (uniformity in code).
Signed-off-by: Shubhankar Kuranagatti <shubhankarvk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The MPTCP_PUSH_PENDING define is 6 and these tests should be testing if
BIT(6) is set.
Fixes: c2e6048fa1cf ("mptcp: fix race in release_cb")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pull NFS client bugfixes from Anna Schumaker:
"These are mostly fixes for issues discovered at the recent NFS
bakeathon:
- Fix PNFS_FLEXFILE_LAYOUT kconfig so it is possible to build
into the kernel
- Correct size calculationn for create reply length
- Set memalloc_nofs_save() for sync tasks to prevent deadlocks
- Don't revalidate directory permissions on lookup failure
- Don't clear inode cache when lookup fails
- Change functions to use nfs_set_cache_invalid() for proper
delegation handling
- Fix return value of _nfs4_get_security_label()
- Return an error when attempting to remove system.nfs4_acl"
* tag 'nfs-for-5.12-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
nfs: we don't support removing system.nfs4_acl
NFSv4.2: fix return value of _nfs4_get_security_label()
NFS: Fix open coded versions of nfs_set_cache_invalid() in NFSv4
NFS: Fix open coded versions of nfs_set_cache_invalid()
NFS: Clean up function nfs_mark_dir_for_revalidate()
NFS: Don't gratuitously clear the inode cache when lookup failed
NFS: Don't revalidate the directory permissions on a lookup failure
SUNRPC: Set memalloc_nofs_save() for sync tasks
NFS: Correct size calculation for create reply length
nfs: fix PNFS_FLEXFILE_LAYOUT Kconfig default
|
|
TSQ provides a nice way to avoid bufferbloat on individual socket,
including retransmit packets. We can get rid of the old
heuristic:
/* Do not sent more than we queued. 1/4 is reserved for possible
* copying overhead: fragmentation, tunneling, mangling etc.
*/
if (refcount_read(&sk->sk_wmem_alloc) >
min_t(u32, sk->sk_wmem_queued + (sk->sk_wmem_queued >> 2),
sk->sk_sndbuf))
return -EAGAIN;
This heuristic was giving false positives according to Jakub,
whenever TX completions are delayed above RTT. (Ack packets
are processed by TCP stack before clones are orphaned/freed)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jakub Kicinski <kuba@kernel.org>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Jakub reported Data included in a Fastopen SYN that had to be
retransmit would have to wait for an RTO if TX completions are slow,
even with prior fix.
This is because tcp_rcv_fastopen_synack() does not use standard
rtx logic, meaning TSQ handler exits early in tcp_tsq_write()
because tp->lost_out == tp->retrans_out
Lets make tcp_rcv_fastopen_synack() use standard rtx logic,
by using tcp_mark_skb_lost() on the skb thats needs to be
sent again.
Not this raised a warning in tcp_fastretrans_alert() during my tests
since we consider the data not being aknowledged
by the receiver does not mean packet was lost on the network.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jakub Kicinski <kuba@kernel.org>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Jakub and Neil reported an increase of RTO timers whenever
TX completions are delayed a bit more (by increasing
NIC TX coalescing parameters)
Main issue is that TCP stack has a logic preventing a packet
being retransmit if the prior clone has not yet been
orphaned or freed.
This logic came with commit 1f3279ae0c13 ("tcp: avoid
retransmits of TCP packets hanging in host queues")
Thankfully, in the case skb_still_in_host_queue() detects
the initial clone is still in flight, it can use TSQ logic
that will eventually retry later, at the moment the clone
is freed or orphaned.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Neil Spring <ntspring@fb.com>
Reported-by: Jakub Kicinski <kuba@kernel.org>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
htb_init may fail to do the offload if it's not supported or if a
runtime error happens when allocating direct qdiscs. In those cases
TC_HTB_CREATE command is not sent to the driver, however, htb_destroy
gets called anyway and attempts to send TC_HTB_DESTROY.
It shouldn't happen, because the driver didn't receive TC_HTB_CREATE,
and also because the driver may not support ndo_setup_tc at all, while
q->offload is true, and htb_destroy mistakenly thinks the offload is
supported. Trying to call ndo_setup_tc in the latter case will lead to a
NULL pointer dereference.
This commit fixes the issues with htb_destroy by deferring assignment of
q->offload until after the TC_HTB_CREATE command. The necessary cleanup
of the offload entities is already done in htb_init.
Reported-by: syzbot+b53a709f04722ca12a3c@syzkaller.appspotmail.com
Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload")
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
htb_select_queue assumes it's always the offload mode, and it ends up in
calling ndo_setup_tc without any checks. It may lead to a NULL pointer
dereference if ndo_setup_tc is not implemented, or to an error returned
from the driver, which will prevent attaching qdiscs to HTB classes in
the non-offload mode.
This commit fixes the bug by adding the missing check to
htb_select_queue. In the non-offload mode it will return sch->dev_queue,
mimicking tc_modify_qdisc's behavior for the case where select_queue is
not implemented.
Reported-by: syzbot+b53a709f04722ca12a3c@syzkaller.appspotmail.com
Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch fixes the following warning from sparse:
net/tipc/monitor.c:263:35: warning: incorrect type in assignment (different base types)
net/tipc/monitor.c:263:35: expected unsigned int
net/tipc/monitor.c:263:35: got restricted __be32 [usertype]
[...]
net/tipc/node.c:374:13: warning: context imbalance in 'tipc_node_read_lock' - wrong count at exit
net/tipc/node.c:379:13: warning: context imbalance in 'tipc_node_read_unlock' - unexpected unlock
net/tipc/node.c:384:13: warning: context imbalance in 'tipc_node_write_lock' - wrong count at exit
net/tipc/node.c:389:13: warning: context imbalance in 'tipc_node_write_unlock_fast' - unexpected unlock
net/tipc/node.c:404:17: warning: context imbalance in 'tipc_node_write_unlock' - unexpected unlock
[...]
net/tipc/crypto.c:1201:9: warning: incorrect type in initializer (different address spaces)
net/tipc/crypto.c:1201:9: expected struct tipc_aead [noderef] __rcu *__tmp
net/tipc/crypto.c:1201:9: got struct tipc_aead *
[...]
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
(struct tipc_link_info)->dest is in network order (__be32), so we must
convert the value to network order before assigning. The problem detected
by sparse:
net/tipc/netlink_compat.c:699:24: warning: incorrect type in assignment (different base types)
net/tipc/netlink_compat.c:699:24: expected restricted __be32 [usertype] dest
net/tipc/netlink_compat.c:699:24: got int
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Now that all the code is in place, stop rejecting requests to create
resilient next-hop groups.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Nexthop replacements et.al. are notified through netlink, but if a delayed
work migrates buckets on the background, userspace will stay oblivious.
Notify these as RTM_NEWNEXTHOPBUCKET events.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Allow getting (but not setting) individual buckets to inspect the next hop
mapped therein, idle time, and flags.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a dump handler for resilient next hop buckets. When next-hop group ID
is given, it walks buckets of that group, otherwise it walks buckets of all
groups. It then dumps the buckets whose next hops match the given filtering
criteria.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Implement the netlink messages that allow creation and dumping of resilient
nexthop groups.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The kernel periodically checks the idle time of nexthop buckets to
determine if they are idle and can be re-populated with a new nexthop.
When the resilient nexthop group is offloaded to hardware, the kernel
will not see activity on nexthop buckets unless it is reported from
hardware.
Add a function that can be periodically called by device drivers to
report activity on nexthop buckets after querying it from the underlying
device.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a function that can be called by device drivers to set "offload" or
"trap" indication on nexthop buckets following nexthop notifications and
other changes such as a neighbour becoming invalid.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Implement the following notifications towards drivers:
- NEXTHOP_EVENT_REPLACE, when a resilient nexthop group is created.
- NEXTHOP_EVENT_BUCKET_REPLACE any time there is a change in assignment of
next hops to hash table buckets. That includes replacements, deletions,
and delayed upkeep cycles. Some bucket notifications can be vetoed by the
driver, to make it possible to propagate bucket busy-ness flags from the
HW back to the algorithm. Some are however forced, e.g. if a next hop is
deleted, all buckets that use this next hop simply must be migrated,
whether the HW wishes so or not.
- NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE, before a resilient nexthop group is
replaced. Usually the driver will get the bucket notifications as well,
and could veto those. But in some cases, a bucket may not be migrated
immediately, but during delayed upkeep, and that is too late to roll the
transaction back. This notification allows the driver to take a look and
veto the new proposed group up front, before anything is committed.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
At this moment, there is only one type of next-hop group: an mpath group,
which implements the hash-threshold algorithm.
To select a next hop, hash-threshold algorithm first assigns a range of
hashes to each next hop in the group, and then selects the next hop by
comparing the SKB hash with the individual ranges. When a next hop is
removed from the group, the ranges are recomputed, which leads to
reassignment of parts of hash space from one next hop to another. While
there will usually be some overlap between the previous and the new
distribution, some traffic flows change the next hop that they resolve to.
That causes problems e.g. as established TCP connections are reset, because
the traffic is forwarded to a server that is not familiar with the
connection.
Resilient hashing is a technique to address the above problem. Resilient
next-hop group has another layer of indirection between the group itself
and its constituent next hops: a hash table. The selection algorithm uses a
straightforward modulo operation to choose a hash bucket, and then reads
the next hop that this bucket contains, and forwards traffic there.
This indirection brings an important feature. In the hash-threshold
algorithm, the range of hashes associated with a next hop must be
continuous. With a hash table, mapping between the hash table buckets and
the individual next hops is arbitrary. Therefore when a next hop is deleted
the buckets that held it are simply reassigned to other next hops. When
weights of next hops in a group are altered, it may be possible to choose a
subset of buckets that are currently not used for forwarding traffic, and
use those to satisfy the new next-hop distribution demands, keeping the
"busy" buckets intact. This way, established flows are ideally kept being
forwarded to the same endpoints through the same paths as before the
next-hop group change.
In a nutshell, the algorithm works as follows. Each next hop has a number
of buckets that it wants to have, according to its weight and the number of
buckets in the hash table. In case of an event that might cause bucket
allocation change, the numbers for individual next hops are updated,
similarly to how ranges are updated for mpath group next hops. Following
that, a new "upkeep" algorithm runs, and for idle buckets that belong to a
next hop that is currently occupying more buckets than it wants (it is
"overweight"), it migrates the buckets to one of the next hops that has
fewer buckets than it wants (it is "underweight"). If, after this, there
are still underweight next hops, another upkeep run is scheduled to a
future time.
Chances are there are not enough "idle" buckets to satisfy the new demands.
The algorithm has knobs to select both what it means for a bucket to be
idle, and for whether and when to forcefully migrate buckets if there keeps
being an insufficient number of idle buckets.
There are three users of the resilient data structures.
- The forwarding code accesses them under RCU, and does not modify them
except for updating the time a selected bucket was last used.
- Netlink code, running under RTNL, which may modify the data.
- The delayed upkeep code, which may modify the data. This runs unlocked,
and mutual exclusion between the RTNL code and the delayed upkeep is
maintained by canceling the delayed work synchronously before the RTNL
code touches anything. Later it restarts the delayed work if necessary.
The RTNL code has to implement next-hop group replacement, next hop
removal, etc. For removal, the mpath code uses a neat trick of having a
backup next hop group structure, doing the necessary changes offline, and
then RCU-swapping them in. However, the hash tables for resilient hashing
are about an order of magnitude larger than the groups themselves (the size
might be e.g. 4K entries), and it was felt that keeping two of them is an
overkill. Both the primary next-hop group and the spare therefore use the
same resilient table, and writers are careful to keep all references valid
for the forwarding code. The hash table references next-hop group entries
from the next-hop group that is currently in the primary role (i.e. not
spare). During the transition from primary to spare, the table references a
mix of both the primary group and the spare. When a next hop is deleted,
the corresponding buckets are not set to NULL, but instead marked as empty,
so that the pointer is valid and can be used by the forwarding code. The
buckets are then migrated to a new next-hop group entry during upkeep. The
only times that the hash table is invalid is the very beginning and very
end of its lifetime. Between those points, it is always kept valid.
This patch introduces the core support code itself. It does not handle
notifications towards drivers, which are kept as if the group were an mpath
one. It does not handle netlink either. The only bit currently exposed to
user space is the new next-hop group type, and that is currently bounced.
There is therefore no way to actually access this code.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
- RTM_NEWNEXTHOP et.al. that handle resilient groups will have a new nested
attribute, NHA_RES_GROUP, whose elements are attributes NHA_RES_GROUP_*.
- RTM_NEWNEXTHOPBUCKET et.al. is a suite of new messages that will
currently serve only for dumping of individual buckets of resilient next
hop groups. For nexthop group buckets, these messages will carry a nested
attribute NHA_RES_BUCKET, whose elements are attributes NHA_RES_BUCKET_*.
There are several reasons why a new suite of messages is created for
nexthop buckets instead of overloading the information on the existing
RTM_{NEW,DEL,GET}NEXTHOP messages.
First, a nexthop group can contain a large number of nexthop buckets (4k
is not unheard of). This imposes limits on the amount of information that
can be encoded for each nexthop bucket given a netlink message is limited
to 64k bytes.
Second, while RTM_NEWNEXTHOPBUCKET is only used for notifications at
this point, in the future it can be extended to provide user space with
control over nexthop buckets configuration.
- The new group type is NEXTHOP_GRP_TYPE_RES. Note that nexthop code is
adjusted to bounce groups with that type for now.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
With the introduction of resilient nexthop groups, there will be two types
of multipath groups: the current hash-threshold "mpath" ones, and resilient
groups. Both are multipath, but to determine the fact, the system needs to
consider two flags. This might prove costly in the datapath. Therefore,
introduce a new flag, that should be set for next-hop groups that have more
than one nexthop, and should be considered multipath.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The cited function currently uses rtnl_dereference() to get nh_info from a
handed-in nexthop. However, under the resilient hashing scheme, this
function will not always be called under RTNL, sometimes the mutual
exclusion will be achieved differently. Therefore move the nh_info
extraction from the function to its callers to make it possible to use a
different synchronization guarantee.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently, replace assumes that the new group that is given is a
fully-formed object. But mpath groups really only have one attribute, and
that is the constituent next hop configuration. This may not be universally
true. From the usability perspective, it is desirable to allow the replace
operation to adjust just the constituent next hop configuration and leave
the group attributes as such intact.
But the object that keeps track of whether an attribute was or was not
given is the nh_config object, not the next hop or next-hop group. To allow
(selective) attribute updates during NH group replacement, propagate `cfg'
to replace_nexthop() and further to replace_nexthop_grp().
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When there are 2 segments routing header, after an End.B6 action
for example, the second SRH will never be handled by an action, packet will
be dropped when the first SRH has segments left equal to 0.
For actions that doesn't perform decapsulation (currently: End, End.X,
End.T, End.B6, End.B6.Encaps), this patch adds the IP6_FH_F_SKIP_RH flag
in arguments for ipv6_find_hdr().
Signed-off-by: Julien Massonneau <julien.massonneau@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As specified in IETF RFC 8754, section 4.3.1.2, if the upper layer
header is IPv4 or IPv6, perform IPv6 decapsulation and resubmit the
decapsulated packet to the IPv4 or IPv6 module.
Only IPv6 decapsulation was implemented. This patch adds support for IPv4
decapsulation.
Link: https://tools.ietf.org/html/rfc8754#section-4.3.1.2
Signed-off-by: Julien Massonneau <julien.massonneau@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Introduce the new function tw_prot_init (inspired by
req_prot_init) to simplify "proto_register" function.
tw_prot_cleanup will take care of a partially initialized
timewait_sock_ops.
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
I tested commit 43042b90cae1 ("svcrdma: Reduce Receive doorbell
rate") with mlx4 (IB) and software iWARP and didn't find any
issues. However, I recently got my hardware iWARP setup back on
line (FastLinQ) and it's crashing hard on this commit (confirmed
via bisect).
The failure mode is complex.
- After a connection is established, the first Receive completes
normally.
- But the second and third Receives have garbage in their Receive
buffers. The server responds with ERR_VERS as a result.
- When the client tears down the connection to retry, a couple
of posted Receives flush twice, and that corrupts the recv_ctxt
free list.
- __svc_rdma_free then faults or loops infinitely while destroying
the xprt's recv_ctxts.
Since 43042b90cae1 ("svcrdma: Reduce Receive doorbell rate") does
not fix a bug but is a scalability enhancement, it's safe and
appropriate to revert it while working on a replacement.
Fixes: 43042b90cae1 ("svcrdma: Reduce Receive doorbell rate")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|