Age | Commit message (Collapse) | Author |
|
This counter tracks number of ACK packets that the host has not sent,
thanks to ACK compression.
Sample output :
$ nstat -n;sleep 1;nstat|egrep "IpInReceives|IpOutRequests|TcpInSegs|TcpOutSegs|TcpExtTCPAckCompressed"
IpInReceives 123250 0.0
IpOutRequests 3684 0.0
TcpInSegs 123251 0.0
TcpOutSegs 3684 0.0
TcpExtTCPAckCompressed 119252 0.0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When TCP receives an out-of-order packet, it immediately sends
a SACK packet, generating network load but also forcing the
receiver to send 1-MSS pathological packets, increasing its
RTX queue length/depth, and thus processing time.
Wifi networks suffer from this aggressive behavior, but generally
speaking, all these SACK packets add fuel to the fire when networks
are under congestion.
This patch adds a high resolution timer and tp->compressed_ack counter.
Instead of sending a SACK, we program this timer with a small delay,
based on RTT and capped to 1 ms :
delay = min ( 5 % of RTT, 1 ms)
If subsequent SACKs need to be sent while the timer has not yet
expired, we simply increment tp->compressed_ack.
When timer expires, a SACK is sent with the latest information.
Whenever an ACK is sent (if data is sent, or if in-order
data is received) timer is canceled.
Note that tcp_sack_new_ofo_skb() is able to force a SACK to be sent
if the sack blocks need to be shuffled, even if the timer has not
expired.
A new SNMP counter is added in the following patch.
Two other patches add sysctls to allow changing the 1,000,000 and 44
values that this commit hard-coded.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As explained in commit 9f9843a751d0 ("tcp: properly handle stretch
acks in slow start"), TCP stacks have to consider how many packets
are acknowledged in one single ACK, because of GRO, but also
because of ACK compression or losses.
We plan to add SACK compression in the following patch, we
must therefore not call tcp_enter_quickack_mode()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Properly align xsk_proto_ops initialization.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
Removed some cases of unnecessary parentheses.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
Minor cleanup, remove newline at end of Makefile.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
Clean up SPDX-License-Identifier and removing licensing leftovers.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
This fixes memory leaks in cases where we got the station
info but failed sending it out properly.
Fixes: 8689c051a201 ("cfg80211: dynamically allocate per-tid stats for station info")
Reviewed-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
This fixes memory leaks in the case where we just have the
station info on the stack for internal usage without sending
it to cfg80211.
Fixes: 8689c051a201 ("cfg80211: dynamically allocate per-tid stats for station info")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
The mac80211 tear down code is not waiting for the driver call back.
This can bring down the the TX path (TID) till the user manually
reconnects. (Observed with iwldvm and enabled TX aggregation.)
The race can be prevented when the ampdu_mlme worker handles the tear
down.
The race:
* ieee80211_sta_tear_down_BA_sessions calls
___ieee80211_stop_tx_ba_session for all TIDs,
* then cancels the ampdu_mlme worker
* and cleanups the TIDs the driver already has called back for.
* ieee80211_stop_tx_ba_cb will never be called for a TID if the callback
came after the the check in ieee80211_sta_tear_down_BA_sessions.
Signed-off-by: Alexander Wetzel <Alexander.Wetzel@web.de>
[johannes: "enabled" -> "blocked" and invert logic, simplify init]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Trivial fix to spelling mistake in pr_debug message text
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Arend's previous patch made the sinfo structure smaller
again by to dynamically allocating the per-tid stats
only when needed. Thus, revert to stack allocation for
the struct to simplify the code.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
With the addition of TXQ stats in the per-tid statistics the struct
station_info grew significantly. This resulted in stack size warnings
due to the structure itself being above the limit for the warnings.
Add an allocation function that those who want to provide per-tid
stats should use to allocate the tid array, i.e.
struct station_info::pertid.
Cc: Toke Høiland-Jørgensen <toke@toke.dk>
Fixes: 52539ca89f36 ("cfg80211: Expose TXQ stats and parameters to userspace")
Signed-off-by: Arend van Spriel <aspriel@gmail.com>
[johannes: fix missing BIT() and logic by removing]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
The mesh_neighbour_update() function, queued via beacon rx, can race with
userspace creating the same station. If the station already exists by the
time mesh_neighbour_update() is called, the function wrongly assumes rate
control has been initialized and calls rate_control_rate_update(), which
in turn calls into the driver.
Updating the rate control before it has been initialized can cause a
crash in some drivers, for example this firmware crash in ath10k due
to sta->rx_nss being 0:
[ 3078.088247] mesh0: Inserted STA 5c:e2:8c:f1:ab:ba
[ 3078.258407] ath10k_pci 0000:0d:00.0: firmware crashed! (uuid d6ed5961-93cc-4d61-803f-5eda55bb8643)
[ 3078.258421] ath10k_pci 0000:0d:00.0: qca988x hw2.0 target 0x4100016c chip_id 0x043202ff sub 0000:0000
[ 3078.258426] ath10k_pci 0000:0d:00.0: kconfig debug 1 debugfs 1 tracing 1 dfs 0 testmode 0
[ 3078.258608] ath10k_pci 0000:0d:00.0: firmware ver 10.2.4.70.59-2 api 5 features no-p2p,raw-mode,mfp crc32 4159f498
[ 3078.258613] ath10k_pci 0000:0d:00.0: board_file api 1 bmi_id N/A crc32 bebc7c08
[ 3078.258617] ath10k_pci 0000:0d:00.0: htt-ver 2.1 wmi-op 5 htt-op 2 cal otp max-sta 128 raw 0 hwcrypto 1
[ 3078.260627] ath10k_pci 0000:0d:00.0: firmware register dump:
[ 3078.260640] ath10k_pci 0000:0d:00.0: [00]: 0x4100016C 0x000015B3 0x009A31BB 0x00955B31
[ 3078.260647] ath10k_pci 0000:0d:00.0: [04]: 0x009A31BB 0x00060130 0x00000008 0x00000007
[ 3078.260652] ath10k_pci 0000:0d:00.0: [08]: 0x00000000 0x00955B31 0x00000000 0x0040F89E
[ 3078.260656] ath10k_pci 0000:0d:00.0: [12]: 0x00000009 0xFFFFFFFF 0x009580F5 0x00958117
[ 3078.260660] ath10k_pci 0000:0d:00.0: [16]: 0x00958080 0x0094085D 0x00000000 0x00000000
[ 3078.260664] ath10k_pci 0000:0d:00.0: [20]: 0x409A31BB 0x0040AA84 0x00000002 0x00000001
[ 3078.260669] ath10k_pci 0000:0d:00.0: [24]: 0x809A2B8D 0x0040AAE4 0x00000088 0xC09A31BB
[ 3078.260673] ath10k_pci 0000:0d:00.0: [28]: 0x809898C8 0x0040AB04 0x0043F91C 0x009C6458
[ 3078.260677] ath10k_pci 0000:0d:00.0: [32]: 0x809B66AC 0x0040AB34 0x009C6458 0x0043F91C
[ 3078.260686] ath10k_pci 0000:0d:00.0: [36]: 0x809B2824 0x0040ADA4 0x00400000 0x00416EB4
[ 3078.260692] ath10k_pci 0000:0d:00.0: [40]: 0x809C07D9 0x0040ADE4 0x0040AE08 0x00412028
[ 3078.260696] ath10k_pci 0000:0d:00.0: [44]: 0x809486FA 0x0040AE04 0x00000001 0x00000000
[ 3078.260700] ath10k_pci 0000:0d:00.0: [48]: 0x80948E2C 0x0040AEA4 0x0041F4F0 0x00412634
[ 3078.260704] ath10k_pci 0000:0d:00.0: [52]: 0x809BFC39 0x0040AEC4 0x0041F4F0 0x00000001
[ 3078.260709] ath10k_pci 0000:0d:00.0: [56]: 0x80940F18 0x0040AF14 0x00000010 0x00403AC0
[ 3078.284130] ath10k_pci 0000:0d:00.0: failed to to request monitor vdev 1 stop: -108
Fix this by checking whether the sta has already initialized rate control
using the flag for that purpose. We can also drop the unnecessary insert
parameter here.
Signed-off-by: Bob Copeland <bobcopeland@fb.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Allocation size of nlmsg in cfg80211_ft_event is based on ric_ies_len
and doesn't take into account ies_len. This leads to
NL80211_CMD_FT_EVENT message construction failure in case ft_event
contains large enough ies buffer.
Add ies_len to the nlmsg allocation size.
Signed-off-by: Dedy Lansky <dlansky@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
This function allows to send a HCI command without expecting any
controller event/response in return. This is allowed for vendor-
specific commands only.
Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
I've seen timeout errors from HCI commands where it looks like
schedule_timeout() has returned immediately; additional logging for the
error case gives:
req_status=1 req_result=0 remaining=10000 jiffies
so the device is still in state HCI_REQ_PEND and the value returned by
schedule_timeout() is the same as the original timeout (HCI_INIT_TIMEOUT
on a system with HZ=1000).
Use wait_event_interruptible_timeout() instead of open-coding similar
behaviour which is subject to the spurious failure described above.
Signed-off-by: John Keeping <john@metanate.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
There are some controllers sending out advertising data with illegal
length value which is longer than HCI_MAX_AD_LENGTH, causing the
buffer last_adv_data overflows. To avoid these controllers from
overflowing the buffer, we do not process the advertisement data
if its length is incorrect.
Signed-off-by: Chriz Chow <chriz.chow@aminocom.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
Daniel Borkmann says:
====================
pull-request: bpf 2018-05-18
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix two bugs in sockmap, a use after free in sockmap's error path
from sock_map_ctx_update_elem() where we mistakenly drop a reference
we didn't take prior to that, and in the same function fix a race
in bpf_prog_inc_not_zero() where we didn't use the progs from prior
READ_ONCE(), from John.
2) Reject program expansions once we figure out that their jump target
which crosses patchlet boundaries could otherwise get truncated in
insn->off space, from Daniel.
3) Check the return value of fopen() in BPF selftest's test_verifier
where we determine whether unpriv BPF is disabled, and iff we do
fail there then just assume it is disabled. This fixes a segfault
when used with older kernels, from Jesper.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Recently during testing, I ran into the following panic:
[ 207.892422] Internal error: Accessing user space memory outside uaccess.h routines: 96000004 [#1] SMP
[ 207.901637] Modules linked in: binfmt_misc [...]
[ 207.966530] CPU: 45 PID: 2256 Comm: test_verifier Tainted: G W 4.17.0-rc3+ #7
[ 207.974956] Hardware name: FOXCONN R2-1221R-A4/C2U4N_MB, BIOS G31FB18A 03/31/2017
[ 207.982428] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 207.987214] pc : bpf_skb_load_helper_8_no_cache+0x34/0xc0
[ 207.992603] lr : 0xffff000000bdb754
[ 207.996080] sp : ffff000013703ca0
[ 207.999384] x29: ffff000013703ca0 x28: 0000000000000001
[ 208.004688] x27: 0000000000000001 x26: 0000000000000000
[ 208.009992] x25: ffff000013703ce0 x24: ffff800fb4afcb00
[ 208.015295] x23: ffff00007d2f5038 x22: ffff00007d2f5000
[ 208.020599] x21: fffffffffeff2a6f x20: 000000000000000a
[ 208.025903] x19: ffff000009578000 x18: 0000000000000a03
[ 208.031206] x17: 0000000000000000 x16: 0000000000000000
[ 208.036510] x15: 0000ffff9de83000 x14: 0000000000000000
[ 208.041813] x13: 0000000000000000 x12: 0000000000000000
[ 208.047116] x11: 0000000000000001 x10: ffff0000089e7f18
[ 208.052419] x9 : fffffffffeff2a6f x8 : 0000000000000000
[ 208.057723] x7 : 000000000000000a x6 : 00280c6160000000
[ 208.063026] x5 : 0000000000000018 x4 : 0000000000007db6
[ 208.068329] x3 : 000000000008647a x2 : 19868179b1484500
[ 208.073632] x1 : 0000000000000000 x0 : ffff000009578c08
[ 208.078938] Process test_verifier (pid: 2256, stack limit = 0x0000000049ca7974)
[ 208.086235] Call trace:
[ 208.088672] bpf_skb_load_helper_8_no_cache+0x34/0xc0
[ 208.093713] 0xffff000000bdb754
[ 208.096845] bpf_test_run+0x78/0xf8
[ 208.100324] bpf_prog_test_run_skb+0x148/0x230
[ 208.104758] sys_bpf+0x314/0x1198
[ 208.108064] el0_svc_naked+0x30/0x34
[ 208.111632] Code: 91302260 f9400001 f9001fa1 d2800001 (29500680)
[ 208.117717] ---[ end trace 263cb8a59b5bf29f ]---
The program itself which caused this had a long jump over the whole
instruction sequence where all of the inner instructions required
heavy expansions into multiple BPF instructions. Additionally, I also
had BPF hardening enabled which requires once more rewrites of all
constant values in order to blind them. Each time we rewrite insns,
bpf_adj_branches() would need to potentially adjust branch targets
which cross the patchlet boundary to accommodate for the additional
delta. Eventually that lead to the case where the target offset could
not fit into insn->off's upper 0x7fff limit anymore where then offset
wraps around becoming negative (in s16 universe), or vice versa
depending on the jump direction.
Therefore it becomes necessary to detect and reject any such occasions
in a generic way for native eBPF and cBPF to eBPF migrations. For
the latter we can simply check bounds in the bpf_convert_filter()'s
BPF_EMIT_JMP helper macro and bail out once we surpass limits. The
bpf_patch_insn_single() for native eBPF (and cBPF to eBPF in case
of subsequent hardening) is a bit more complex in that we need to
detect such truncations before hitting the bpf_prog_realloc(). Thus
the latter is split into an extra pass to probe problematic offsets
on the original program in order to fail early. With that in place
and carefully tested I no longer hit the panic and the rewrites are
rejected properly. The above example panic I've seen on bpf-next,
though the issue itself is generic in that a guard against this issue
in bpf seems more appropriate in this case.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add informative messages for error paths related to adding a
VLAN to a device.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Device features may change during transmission. In particular with
corking, a device may toggle scatter-gather in between allocating
and writing to an skb.
Do not unconditionally assume that !NETIF_F_SG at write time implies
that the same held at alloc time and thus the skb has sufficient
tailroom.
This issue predates git history.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Even though ip6erspan_tap_init() sets up hlen and tun_hlen according to
what ERSPAN needs, it goes ahead to call ip6gre_tnl_link_config() which
overwrites these settings with GRE-specific ones.
Similarly for changelink callbacks, which are handled by
ip6gre_changelink() calls ip6gre_tnl_change() calls
ip6gre_tnl_link_config() as well.
The difference ends up being 12 vs. 20 bytes, and this is generally not
a problem, because a 12-byte request likely ends up allocating more and
the extra 8 bytes are thus available. However correct it is not.
So replace the newlink and changelink callbacks with an ERSPAN-specific
ones, reusing the newly-introduced _common() functions.
Fixes: 5a963eb61b7c ("ip6_gre: Add ERSPAN native tunnel support")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Extract from ip6gre_changelink() a reusable function
ip6gre_changelink_common(). This will allow introduction of
ERSPAN-specific _changelink() function with not a lot of code
duplication.
Fixes: 5a963eb61b7c ("ip6_gre: Add ERSPAN native tunnel support")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Extract from ip6gre_newlink() a reusable function
ip6gre_newlink_common(). The ip6gre_tnl_link_config() call needs to be
made customizable for ERSPAN, thus reorder it with calls to
ip6_tnl_change_mtu() and dev_hold(), and extract the whole tail to the
caller, ip6gre_newlink(). Thus enable an ERSPAN-specific _newlink()
function without a lot of duplicity.
Fixes: 5a963eb61b7c ("ip6_gre: Add ERSPAN native tunnel support")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Split a reusable function ip6gre_tnl_copy_tnl_parm() from
ip6gre_tnl_change(). This will allow ERSPAN-specific code to
reuse the common parts while customizing the behavior for ERSPAN.
Fixes: 5a963eb61b7c ("ip6_gre: Add ERSPAN native tunnel support")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The function ip6gre_tnl_link_config() is used for setting up
configuration of both ip6gretap and ip6erspan tunnels. Split the
function into the common part and the route-lookup part. The latter then
takes the calculated header length as an argument. This split will allow
the patches down the line to sneak in a custom header length computation
for the ERSPAN tunnel.
Fixes: 5a963eb61b7c ("ip6_gre: Add ERSPAN native tunnel support")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
dev->needed_headroom is not primed until ip6_tnl_xmit(), so it starts
out zero. Thus the call to skb_cow_head() fails to actually make sure
there's enough headroom to push the ERSPAN headers to. That can lead to
the panic cited below. (Reproducer below that).
Fix by requesting either needed_headroom if already primed, or just the
bare minimum needed for the header otherwise.
[ 190.703567] kernel BUG at net/core/skbuff.c:104!
[ 190.708384] invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
[ 190.714007] Modules linked in: act_mirred cls_matchall ip6_gre ip6_tunnel tunnel6 gre sch_ingress vrf veth x86_pkg_temp_thermal mlx_platform nfsd e1000e leds_mlxcpld
[ 190.728975] CPU: 1 PID: 959 Comm: kworker/1:2 Not tainted 4.17.0-rc4-net_master-custom-139 #10
[ 190.737647] Hardware name: Mellanox Technologies Ltd. "MSN2410-CB2F"/"SA000874", BIOS 4.6.5 03/08/2016
[ 190.747006] Workqueue: ipv6_addrconf addrconf_dad_work
[ 190.752222] RIP: 0010:skb_panic+0xc3/0x100
[ 190.756358] RSP: 0018:ffff8801d54072f0 EFLAGS: 00010282
[ 190.761629] RAX: 0000000000000085 RBX: ffff8801c1a8ecc0 RCX: 0000000000000000
[ 190.768830] RDX: 0000000000000085 RSI: dffffc0000000000 RDI: ffffed003aa80e54
[ 190.776025] RBP: ffff8801bd1ec5a0 R08: ffffed003aabce19 R09: ffffed003aabce19
[ 190.783226] R10: 0000000000000001 R11: ffffed003aabce18 R12: ffff8801bf695dbe
[ 190.790418] R13: 0000000000000084 R14: 00000000000006c0 R15: ffff8801bf695dc8
[ 190.797621] FS: 0000000000000000(0000) GS:ffff8801d5400000(0000) knlGS:0000000000000000
[ 190.805786] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 190.811582] CR2: 000055fa929aced0 CR3: 0000000003228004 CR4: 00000000001606e0
[ 190.818790] Call Trace:
[ 190.821264] <IRQ>
[ 190.823314] ? ip6erspan_tunnel_xmit+0x5e4/0x1982 [ip6_gre]
[ 190.828940] ? ip6erspan_tunnel_xmit+0x5e4/0x1982 [ip6_gre]
[ 190.834562] skb_push+0x78/0x90
[ 190.837749] ip6erspan_tunnel_xmit+0x5e4/0x1982 [ip6_gre]
[ 190.843219] ? ip6gre_tunnel_ioctl+0xd90/0xd90 [ip6_gre]
[ 190.848577] ? debug_check_no_locks_freed+0x210/0x210
[ 190.853679] ? debug_check_no_locks_freed+0x210/0x210
[ 190.858783] ? print_irqtrace_events+0x120/0x120
[ 190.863451] ? sched_clock_cpu+0x18/0x210
[ 190.867496] ? cyc2ns_read_end+0x10/0x10
[ 190.871474] ? skb_network_protocol+0x76/0x200
[ 190.875977] dev_hard_start_xmit+0x137/0x770
[ 190.880317] ? do_raw_spin_trylock+0x6d/0xa0
[ 190.884624] sch_direct_xmit+0x2ef/0x5d0
[ 190.888589] ? pfifo_fast_dequeue+0x3fa/0x670
[ 190.892994] ? pfifo_fast_change_tx_queue_len+0x810/0x810
[ 190.898455] ? __lock_is_held+0xa0/0x160
[ 190.902422] __qdisc_run+0x39e/0xfc0
[ 190.906041] ? _raw_spin_unlock+0x29/0x40
[ 190.910090] ? pfifo_fast_enqueue+0x24b/0x3e0
[ 190.914501] ? sch_direct_xmit+0x5d0/0x5d0
[ 190.918658] ? pfifo_fast_dequeue+0x670/0x670
[ 190.923047] ? __dev_queue_xmit+0x172/0x1770
[ 190.927365] ? preempt_count_sub+0xf/0xd0
[ 190.931421] __dev_queue_xmit+0x410/0x1770
[ 190.935553] ? ___slab_alloc+0x605/0x930
[ 190.939524] ? print_irqtrace_events+0x120/0x120
[ 190.944186] ? memcpy+0x34/0x50
[ 190.947364] ? netdev_pick_tx+0x1c0/0x1c0
[ 190.951428] ? __skb_clone+0x2fd/0x3d0
[ 190.955218] ? __copy_skb_header+0x270/0x270
[ 190.959537] ? rcu_read_lock_sched_held+0x93/0xa0
[ 190.964282] ? kmem_cache_alloc+0x344/0x4d0
[ 190.968520] ? cyc2ns_read_end+0x10/0x10
[ 190.972495] ? skb_clone+0x123/0x230
[ 190.976112] ? skb_split+0x820/0x820
[ 190.979747] ? tcf_mirred+0x554/0x930 [act_mirred]
[ 190.984582] tcf_mirred+0x554/0x930 [act_mirred]
[ 190.989252] ? tcf_mirred_act_wants_ingress.part.2+0x10/0x10 [act_mirred]
[ 190.996109] ? __lock_acquire+0x706/0x26e0
[ 191.000239] ? sched_clock_cpu+0x18/0x210
[ 191.004294] tcf_action_exec+0xcf/0x2a0
[ 191.008179] tcf_classify+0xfa/0x340
[ 191.011794] __netif_receive_skb_core+0x8e1/0x1c60
[ 191.016630] ? debug_check_no_locks_freed+0x210/0x210
[ 191.021732] ? nf_ingress+0x500/0x500
[ 191.025458] ? process_backlog+0x347/0x4b0
[ 191.029619] ? print_irqtrace_events+0x120/0x120
[ 191.034302] ? lock_acquire+0xd8/0x320
[ 191.038089] ? process_backlog+0x1b6/0x4b0
[ 191.042246] ? process_backlog+0xc2/0x4b0
[ 191.046303] process_backlog+0xc2/0x4b0
[ 191.050189] net_rx_action+0x5cc/0x980
[ 191.053991] ? napi_complete_done+0x2c0/0x2c0
[ 191.058386] ? mark_lock+0x13d/0xb40
[ 191.062001] ? clockevents_program_event+0x6b/0x1d0
[ 191.066922] ? print_irqtrace_events+0x120/0x120
[ 191.071593] ? __lock_is_held+0xa0/0x160
[ 191.075566] __do_softirq+0x1d4/0x9d2
[ 191.079282] ? ip6_finish_output2+0x524/0x1460
[ 191.083771] do_softirq_own_stack+0x2a/0x40
[ 191.087994] </IRQ>
[ 191.090130] do_softirq.part.13+0x38/0x40
[ 191.094178] __local_bh_enable_ip+0x135/0x190
[ 191.098591] ip6_finish_output2+0x54d/0x1460
[ 191.102916] ? ip6_forward_finish+0x2f0/0x2f0
[ 191.107314] ? ip6_mtu+0x3c/0x2c0
[ 191.110674] ? ip6_finish_output+0x2f8/0x650
[ 191.114992] ? ip6_output+0x12a/0x500
[ 191.118696] ip6_output+0x12a/0x500
[ 191.122223] ? ip6_route_dev_notify+0x5b0/0x5b0
[ 191.126807] ? ip6_finish_output+0x650/0x650
[ 191.131120] ? ip6_fragment+0x1a60/0x1a60
[ 191.135182] ? icmp6_dst_alloc+0x26e/0x470
[ 191.139317] mld_sendpack+0x672/0x830
[ 191.143021] ? igmp6_mcf_seq_next+0x2f0/0x2f0
[ 191.147429] ? __local_bh_enable_ip+0x77/0x190
[ 191.151913] ipv6_mc_dad_complete+0x47/0x90
[ 191.156144] addrconf_dad_completed+0x561/0x720
[ 191.160731] ? addrconf_rs_timer+0x3a0/0x3a0
[ 191.165036] ? mark_held_locks+0xc9/0x140
[ 191.169095] ? __local_bh_enable_ip+0x77/0x190
[ 191.173570] ? addrconf_dad_work+0x50d/0xa20
[ 191.177886] ? addrconf_dad_work+0x529/0xa20
[ 191.182194] addrconf_dad_work+0x529/0xa20
[ 191.186342] ? addrconf_dad_completed+0x720/0x720
[ 191.191088] ? __lock_is_held+0xa0/0x160
[ 191.195059] ? process_one_work+0x45d/0xe20
[ 191.199302] ? process_one_work+0x51e/0xe20
[ 191.203531] ? rcu_read_lock_sched_held+0x93/0xa0
[ 191.208279] process_one_work+0x51e/0xe20
[ 191.212340] ? pwq_dec_nr_in_flight+0x200/0x200
[ 191.216912] ? get_lock_stats+0x4b/0xf0
[ 191.220788] ? preempt_count_sub+0xf/0xd0
[ 191.224844] ? worker_thread+0x219/0x860
[ 191.228823] ? do_raw_spin_trylock+0x6d/0xa0
[ 191.233142] worker_thread+0xeb/0x860
[ 191.236848] ? process_one_work+0xe20/0xe20
[ 191.241095] kthread+0x206/0x300
[ 191.244352] ? process_one_work+0xe20/0xe20
[ 191.248587] ? kthread_stop+0x570/0x570
[ 191.252459] ret_from_fork+0x3a/0x50
[ 191.256082] Code: 14 3e ff 8b 4b 78 55 4d 89 f9 41 56 41 55 48 c7 c7 a0 cf db 82 41 54 44 8b 44 24 2c 48 8b 54 24 30 48 8b 74 24 20 e8 16 94 13 ff <0f> 0b 48 c7 c7 60 8e 1f 85 48 83 c4 20 e8 55 ef a6 ff 89 74 24
[ 191.275327] RIP: skb_panic+0xc3/0x100 RSP: ffff8801d54072f0
[ 191.281024] ---[ end trace 7ea51094e099e006 ]---
[ 191.285724] Kernel panic - not syncing: Fatal exception in interrupt
[ 191.292168] Kernel Offset: disabled
[ 191.295697] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
Reproducer:
ip link add h1 type veth peer name swp1
ip link add h3 type veth peer name swp3
ip link set dev h1 up
ip address add 192.0.2.1/28 dev h1
ip link add dev vh3 type vrf table 20
ip link set dev h3 master vh3
ip link set dev vh3 up
ip link set dev h3 up
ip link set dev swp3 up
ip address add dev swp3 2001:db8:2::1/64
ip link set dev swp1 up
tc qdisc add dev swp1 clsact
ip link add name gt6 type ip6erspan \
local 2001:db8:2::1 remote 2001:db8:2::2 oseq okey 123
ip link set dev gt6 up
sleep 1
tc filter add dev swp1 ingress pref 1000 matchall skip_hw \
action mirred egress mirror dev gt6
ping -I h1 192.0.2.2
Fixes: e41c7c68ea77 ("ip6erspan: make sure enough headroom at xmit.")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
__gre6_xmit() pushes GRE headers before handing over to ip6_tnl_xmit()
for generic IP-in-IP processing. However it doesn't make sure that there
is enough headroom to push the header to. That can lead to the panic
cited below. (Reproducer below that).
Fix by requesting either needed_headroom if already primed, or just the
bare minimum needed for the header otherwise.
[ 158.576725] kernel BUG at net/core/skbuff.c:104!
[ 158.581510] invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
[ 158.587174] Modules linked in: act_mirred cls_matchall ip6_gre ip6_tunnel tunnel6 gre sch_ingress vrf veth x86_pkg_temp_thermal mlx_platform nfsd e1000e leds_mlxcpld
[ 158.602268] CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 4.17.0-rc4-net_master-custom-139 #10
[ 158.610938] Hardware name: Mellanox Technologies Ltd. "MSN2410-CB2F"/"SA000874", BIOS 4.6.5 03/08/2016
[ 158.620426] RIP: 0010:skb_panic+0xc3/0x100
[ 158.624586] RSP: 0018:ffff8801d3f27110 EFLAGS: 00010286
[ 158.629882] RAX: 0000000000000082 RBX: ffff8801c02cc040 RCX: 0000000000000000
[ 158.637127] RDX: 0000000000000082 RSI: dffffc0000000000 RDI: ffffed003a7e4e18
[ 158.644366] RBP: ffff8801bfec8020 R08: ffffed003aabce19 R09: ffffed003aabce19
[ 158.651574] R10: 000000000000000b R11: ffffed003aabce18 R12: ffff8801c364de66
[ 158.658786] R13: 000000000000002c R14: 00000000000000c0 R15: ffff8801c364de68
[ 158.666007] FS: 0000000000000000(0000) GS:ffff8801d5400000(0000) knlGS:0000000000000000
[ 158.674212] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 158.680036] CR2: 00007f4b3702dcd0 CR3: 0000000003228002 CR4: 00000000001606e0
[ 158.687228] Call Trace:
[ 158.689752] ? __gre6_xmit+0x246/0xd80 [ip6_gre]
[ 158.694475] ? __gre6_xmit+0x246/0xd80 [ip6_gre]
[ 158.699141] skb_push+0x78/0x90
[ 158.702344] __gre6_xmit+0x246/0xd80 [ip6_gre]
[ 158.706872] ip6gre_tunnel_xmit+0x3bc/0x610 [ip6_gre]
[ 158.711992] ? __gre6_xmit+0xd80/0xd80 [ip6_gre]
[ 158.716668] ? debug_check_no_locks_freed+0x210/0x210
[ 158.721761] ? print_irqtrace_events+0x120/0x120
[ 158.726461] ? sched_clock_cpu+0x18/0x210
[ 158.730572] ? sched_clock_cpu+0x18/0x210
[ 158.734692] ? cyc2ns_read_end+0x10/0x10
[ 158.738705] ? skb_network_protocol+0x76/0x200
[ 158.743216] ? netif_skb_features+0x1b2/0x550
[ 158.747648] dev_hard_start_xmit+0x137/0x770
[ 158.752010] sch_direct_xmit+0x2ef/0x5d0
[ 158.755992] ? pfifo_fast_dequeue+0x3fa/0x670
[ 158.760460] ? pfifo_fast_change_tx_queue_len+0x810/0x810
[ 158.765975] ? __lock_is_held+0xa0/0x160
[ 158.770002] __qdisc_run+0x39e/0xfc0
[ 158.773673] ? _raw_spin_unlock+0x29/0x40
[ 158.777781] ? pfifo_fast_enqueue+0x24b/0x3e0
[ 158.782191] ? sch_direct_xmit+0x5d0/0x5d0
[ 158.786372] ? pfifo_fast_dequeue+0x670/0x670
[ 158.790818] ? __dev_queue_xmit+0x172/0x1770
[ 158.795195] ? preempt_count_sub+0xf/0xd0
[ 158.799313] __dev_queue_xmit+0x410/0x1770
[ 158.803512] ? ___slab_alloc+0x605/0x930
[ 158.807525] ? ___slab_alloc+0x605/0x930
[ 158.811540] ? memcpy+0x34/0x50
[ 158.814768] ? netdev_pick_tx+0x1c0/0x1c0
[ 158.818895] ? __skb_clone+0x2fd/0x3d0
[ 158.822712] ? __copy_skb_header+0x270/0x270
[ 158.827079] ? rcu_read_lock_sched_held+0x93/0xa0
[ 158.831903] ? kmem_cache_alloc+0x344/0x4d0
[ 158.836199] ? skb_clone+0x123/0x230
[ 158.839869] ? skb_split+0x820/0x820
[ 158.843521] ? tcf_mirred+0x554/0x930 [act_mirred]
[ 158.848407] tcf_mirred+0x554/0x930 [act_mirred]
[ 158.853104] ? tcf_mirred_act_wants_ingress.part.2+0x10/0x10 [act_mirred]
[ 158.860005] ? __lock_acquire+0x706/0x26e0
[ 158.864162] ? mark_lock+0x13d/0xb40
[ 158.867832] tcf_action_exec+0xcf/0x2a0
[ 158.871736] tcf_classify+0xfa/0x340
[ 158.875402] __netif_receive_skb_core+0x8e1/0x1c60
[ 158.880334] ? nf_ingress+0x500/0x500
[ 158.884059] ? process_backlog+0x347/0x4b0
[ 158.888241] ? lock_acquire+0xd8/0x320
[ 158.892050] ? process_backlog+0x1b6/0x4b0
[ 158.896228] ? process_backlog+0xc2/0x4b0
[ 158.900291] process_backlog+0xc2/0x4b0
[ 158.904210] net_rx_action+0x5cc/0x980
[ 158.908047] ? napi_complete_done+0x2c0/0x2c0
[ 158.912525] ? rcu_read_unlock+0x80/0x80
[ 158.916534] ? __lock_is_held+0x34/0x160
[ 158.920541] __do_softirq+0x1d4/0x9d2
[ 158.924308] ? trace_event_raw_event_irq_handler_exit+0x140/0x140
[ 158.930515] run_ksoftirqd+0x1d/0x40
[ 158.934152] smpboot_thread_fn+0x32b/0x690
[ 158.938299] ? sort_range+0x20/0x20
[ 158.941842] ? preempt_count_sub+0xf/0xd0
[ 158.945940] ? schedule+0x5b/0x140
[ 158.949412] kthread+0x206/0x300
[ 158.952689] ? sort_range+0x20/0x20
[ 158.956249] ? kthread_stop+0x570/0x570
[ 158.960164] ret_from_fork+0x3a/0x50
[ 158.963823] Code: 14 3e ff 8b 4b 78 55 4d 89 f9 41 56 41 55 48 c7 c7 a0 cf db 82 41 54 44 8b 44 24 2c 48 8b 54 24 30 48 8b 74 24 20 e8 16 94 13 ff <0f> 0b 48 c7 c7 60 8e 1f 85 48 83 c4 20 e8 55 ef a6 ff 89 74 24
[ 158.983235] RIP: skb_panic+0xc3/0x100 RSP: ffff8801d3f27110
[ 158.988935] ---[ end trace 5af56ee845aa6cc8 ]---
[ 158.993641] Kernel panic - not syncing: Fatal exception in interrupt
[ 159.000176] Kernel Offset: disabled
[ 159.003767] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
Reproducer:
ip link add h1 type veth peer name swp1
ip link add h3 type veth peer name swp3
ip link set dev h1 up
ip address add 192.0.2.1/28 dev h1
ip link add dev vh3 type vrf table 20
ip link set dev h3 master vh3
ip link set dev vh3 up
ip link set dev h3 up
ip link set dev swp3 up
ip address add dev swp3 2001:db8:2::1/64
ip link set dev swp1 up
tc qdisc add dev swp1 clsact
ip link add name gt6 type ip6gretap \
local 2001:db8:2::1 remote 2001:db8:2::2
ip link set dev gt6 up
sleep 1
tc filter add dev swp1 ingress pref 1000 matchall skip_hw \
action mirred egress mirror dev gt6
ping -I h1 192.0.2.2
Fixes: c12b395a4664 ("gre: Support GRE over IPv6")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We recently refactored this code and introduced a static checker
warning. Smatch complains that if cmd->index is zero then we would
underflow the arrays. That's obviously true.
The question is whether we prevent cmd->index from being zero at a
different level. I've looked at the code and I don't immediately see
a check for that.
Fixes: 062b3e1b6d4f ("net/ncsi: Refactor MAC, VLAN filters")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
syzkaller found that following program crashes the host :
{
int fd = socket(AF_SMC, SOCK_STREAM, 0);
int val = 1;
listen(fd, 0);
shutdown(fd, SHUT_RDWR);
setsockopt(fd, 6, TCP_NODELAY, &val, 4);
}
Simply initialize conn.tx_work & conn.send_lock at socket creation,
rather than deeper in the stack.
ODEBUG: assert_init not available (active state 0) object type: timer_list hint: (null)
WARNING: CPU: 1 PID: 13988 at lib/debugobjects.c:329 debug_print_object+0x16a/0x210 lib/debugobjects.c:326
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 13988 Comm: syz-executor0 Not tainted 4.17.0-rc4+ #46
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1b9/0x294 lib/dump_stack.c:113
panic+0x22f/0x4de kernel/panic.c:184
__warn.cold.8+0x163/0x1b3 kernel/panic.c:536
report_bug+0x252/0x2d0 lib/bug.c:186
fixup_bug arch/x86/kernel/traps.c:178 [inline]
do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296
do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
RIP: 0010:debug_print_object+0x16a/0x210 lib/debugobjects.c:326
RSP: 0018:ffff880197a37880 EFLAGS: 00010086
RAX: 0000000000000061 RBX: 0000000000000005 RCX: ffffc90001ed0000
RDX: 0000000000004aaf RSI: ffffffff8160f6f1 RDI: 0000000000000001
RBP: ffff880197a378c0 R08: ffff8801aa7a0080 R09: ffffed003b5e3eb2
R10: ffffed003b5e3eb2 R11: ffff8801daf1f597 R12: 0000000000000001
R13: ffffffff88d96980 R14: ffffffff87fa19a0 R15: ffffffff81666ec0
debug_object_assert_init+0x309/0x500 lib/debugobjects.c:692
debug_timer_assert_init kernel/time/timer.c:724 [inline]
debug_assert_init kernel/time/timer.c:776 [inline]
del_timer+0x74/0x140 kernel/time/timer.c:1198
try_to_grab_pending+0x439/0x9a0 kernel/workqueue.c:1223
mod_delayed_work_on+0x91/0x250 kernel/workqueue.c:1592
mod_delayed_work include/linux/workqueue.h:541 [inline]
smc_setsockopt+0x387/0x6d0 net/smc/af_smc.c:1367
__sys_setsockopt+0x1bd/0x390 net/socket.c:1903
__do_sys_setsockopt net/socket.c:1914 [inline]
__se_sys_setsockopt net/socket.c:1911 [inline]
__x64_sys_setsockopt+0xbe/0x150 net/socket.c:1911
do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Fixes: 01d2f7e2cdd3 ("net/smc: sockopts TCP_NODELAY and TCP_CORK")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ursula Braun <ubraun@linux.ibm.com>
Cc: linux-s390@vger.kernel.org
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
ERSPAN only support version 1 and 2. When packets send to an
erspan device which does not have proper version number set,
drop the packet. In real case, we observe multicast packets
sent to the erspan pernet device, erspan0, which does not have
erspan version configured.
Reported-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
An RTO event indicates the head has not been acked for a long time
after its last (re)transmission. But the other packets are not
necessarily lost if they have been only sent recently (for example
due to application limit). This patch would prohibit marking packets
sent within an RTT to be lost on RTO event, using similar logic in
TCP RACK detection.
Normally the head (SND.UNA) would be marked lost since RTO should
fire strictly after the head was sent. An exception is when the
most recent RACK RTT measurement is larger than the (previous)
RTO. To address this exception the head is always marked lost.
Congestion control interaction: since we may not mark every packet
lost, the congestion window may be more than 1 (inflight plus 1).
But only one packet will be retransmitted after RTO, since
tcp_retransmit_timer() calls tcp_retransmit_skb(...,segs=1). The
connection still performs slow start from one packet (with Cubic
congestion control).
This commit was tested in an A/B test with Google web servers,
and showed a reduction of 2% in (spurious) retransmits post
timeout (SlowStartRetrans), and correspondingly reduced DSACKs
(DSACKIgnoredOld) by 7%.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Create and export a new helper tcp_rack_skb_timeout and move tcp_is_rack
to prepare the final RTO change.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Previously when TCP times out, it first updates cwnd and ssthresh,
marks packets lost, and then updates congestion state again. This
was fine because everything not yet delivered is marked lost,
so the inflight is always 0 and cwnd can be safely set to 1 to
retransmit one packet on timeout.
But the inflight may not always be 0 on timeout if TCP changes to
mark packets lost based on packet sent time. Therefore we must
first mark the packet lost, then set the cwnd based on the
(updated) inflight.
This is not a pure refactor. Congestion control may potentially
break if it uses (not yet updated) inflight to compute ssthresh.
Fortunately all existing congestion control modules does not do that.
Also it changes the inflight when CA_LOSS_EVENT is called, and only
westwood processes such an event but does not use inflight.
This change has two other minor side benefits:
1) consistent with Fast Recovery s.t. the inflight is updated
first before tcp_enter_recovery flips state to CA_Recovery.
2) avoid intertwining loss marking with state update, making the
code more readable.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Refactor using a new helper, tcp_timeout_mark_loss(), that marks packets
lost upon RTO.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The previous approach for the lost and retransmit bits was to
wipe the slate clean: zero all the lost and retransmit bits,
correspondingly zero the lost_out and retrans_out counters, and
then add back the lost bits (and correspondingly increment lost_out).
The new approach is to treat this very much like marking packets
lost in fast recovery. We don’t wipe the slate clean. We just say
that for all packets that were not yet marked sacked or lost, we now
mark them as lost in exactly the same way we do for fast recovery.
This fixes the lost retransmit accounting at RTO time and greatly
simplifies the RTO code by sharing much of the logic with Fast
Recovery.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This is a rewrite of NewReno loss recovery implementation that is
simpler and standalone for readability and better performance by
using less states.
Note that NewReno refers to RFC6582 as a modification to the fast
recovery algorithm. It is used only if the connection does not
support SACK in Linux. It should not to be confused with the Reno
(AIMD) congestion control.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch disables RFC6675 loss detection and make sysctl
net.ipv4.tcp_recovery = 1 controls a binary choice between RACK
(1) or RFC6675 (0).
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch adds support for the classic DUPACK threshold rule
(#DupThresh) in RACK.
When the number of packets SACKed is greater or equal to the
threshold, RACK sets the reordering window to zero which would
immediately mark all the unsacked packets below the highest SACKed
sequence lost. Since this approach is known to not work well with
reordering, RACK only uses it if no reordering has been observed.
The DUPACK threshold rule is a particularly useful extension to the
fast recoveries triggered by RACK reordering timer. For example
data-center transfers where the RTT is much smaller than a timer
tick, or high RTT path where the default RTT/4 may take too long.
Note that this patch differs slightly from RFC6675. RFC6675
considers a packet lost when at least #DupThresh higher-sequence
packets are SACKed.
With RACK, for connections that have seen reordering, RACK
continues to use a dynamically-adaptive time-based reordering
window to detect losses. But for connections on which we have not
yet seen reordering, this patch considers a packet lost when at
least one higher sequence packet is SACKed and the total number
of SACKed packets is at least DupThresh. For example, suppose a
connection has not seen reordering, and sends 10 packets, and
packets 3, 5, 7 are SACKed. RFC6675 considers packets 1 and 2
lost. RACK considers packets 1, 2, 4, 6 lost.
There is some small risk of spurious retransmits here due to
reordering. However, this is mostly limited to the first flight of
a connection on which the sender receives SACKs from reordering.
And RFC 6675 and FACK loss detection have a similar risk on the
first flight with reordering (it's just that the risk of spurious
retransmits from reordering was slightly narrower for those older
algorithms due to the margin of 3*MSS).
Also the minimum reordering window is reduced from 1 msec to 0
to recover quicker on short RTT transfers. Therefore RACK is more
aggressive in marking packets lost during recovery to reduce the
reordering window timeouts.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Updating the FIB tracepoint for the recent change to allow rules using
the protocol and ports exposed a few places where the entries in the flow
struct are not initialized.
For __fib_validate_source add the call to fib4_rules_early_flow_dissect
since it is invoked for the input path. For netfilter, add the memset on
the flow struct to avoid future problems like this. In ip_route_input_slow
need to set the fields if the skb dissection does not happen.
Fixes: bfff4862653b ("net: fib_rules: support for match on ip_proto, sport and dport")
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
scatterlist code expects virt_to_page() to work, which fails with
CONFIG_VMAP_STACK=y.
Fixes: c46234ebb4d1e ("tls: RX path for ktls")
Signed-off-by: Matt Mullins <mmullins@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
After the previous patch, for NOLOCK qdiscs, q->seqlock is
always held when the dequeue() is invoked, we can drop
any additional locking to protect such operation.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
So that we can use lockdep on it.
The newly introduced sequence lock has the same scope of busylock,
so it shares the same lockdep annotation, but it's only used for
NOLOCK qdiscs.
With this changeset we acquire such lock in the control path around
flushing operation (qdisc reset), to allow more NOLOCK qdisc perf
improvement in the next patch.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch creates new attributes to accept a map as argument and
then perform the lookup with the generated hash accordingly.
Both current hash functions are supported: Jenkins and Symmetric Hash.
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This patch uses the map lookup already included to be applied
for random number generation.
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
nfnetlink tracing is available since nft 0.6 (June 2016).
Remove old nf_log based tracing to avoid rule counter in main loop.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
strlcpy() can't be safely used on a user-space provided string,
as it can try to read beyond the buffer's end, if the latter is
not NULL terminated.
Leveraging the above, syzbot has been able to trigger the following
splat:
BUG: KASAN: stack-out-of-bounds in strlcpy include/linux/string.h:300
[inline]
BUG: KASAN: stack-out-of-bounds in compat_mtw_from_user
net/bridge/netfilter/ebtables.c:1957 [inline]
BUG: KASAN: stack-out-of-bounds in ebt_size_mwt
net/bridge/netfilter/ebtables.c:2059 [inline]
BUG: KASAN: stack-out-of-bounds in size_entry_mwt
net/bridge/netfilter/ebtables.c:2155 [inline]
BUG: KASAN: stack-out-of-bounds in compat_copy_entries+0x96c/0x14a0
net/bridge/netfilter/ebtables.c:2194
Write of size 33 at addr ffff8801b0abf888 by task syz-executor0/4504
CPU: 0 PID: 4504 Comm: syz-executor0 Not tainted 4.17.0-rc2+ #40
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1b9/0x294 lib/dump_stack.c:113
print_address_description+0x6c/0x20b mm/kasan/report.c:256
kasan_report_error mm/kasan/report.c:354 [inline]
kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
check_memory_region_inline mm/kasan/kasan.c:260 [inline]
check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
memcpy+0x37/0x50 mm/kasan/kasan.c:303
strlcpy include/linux/string.h:300 [inline]
compat_mtw_from_user net/bridge/netfilter/ebtables.c:1957 [inline]
ebt_size_mwt net/bridge/netfilter/ebtables.c:2059 [inline]
size_entry_mwt net/bridge/netfilter/ebtables.c:2155 [inline]
compat_copy_entries+0x96c/0x14a0 net/bridge/netfilter/ebtables.c:2194
compat_do_replace+0x483/0x900 net/bridge/netfilter/ebtables.c:2285
compat_do_ebt_set_ctl+0x2ac/0x324 net/bridge/netfilter/ebtables.c:2367
compat_nf_sockopt net/netfilter/nf_sockopt.c:144 [inline]
compat_nf_setsockopt+0x9b/0x140 net/netfilter/nf_sockopt.c:156
compat_ip_setsockopt+0xff/0x140 net/ipv4/ip_sockglue.c:1279
inet_csk_compat_setsockopt+0x97/0x120 net/ipv4/inet_connection_sock.c:1041
compat_tcp_setsockopt+0x49/0x80 net/ipv4/tcp.c:2901
compat_sock_common_setsockopt+0xb4/0x150 net/core/sock.c:3050
__compat_sys_setsockopt+0x1ab/0x7c0 net/compat.c:403
__do_compat_sys_setsockopt net/compat.c:416 [inline]
__se_compat_sys_setsockopt net/compat.c:413 [inline]
__ia32_compat_sys_setsockopt+0xbd/0x150 net/compat.c:413
do_syscall_32_irqs_on arch/x86/entry/common.c:323 [inline]
do_fast_syscall_32+0x345/0xf9b arch/x86/entry/common.c:394
entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
RIP: 0023:0xf7fb3cb9
RSP: 002b:00000000fff0c26c EFLAGS: 00000282 ORIG_RAX: 000000000000016e
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000000000
RDX: 0000000000000080 RSI: 0000000020000300 RDI: 00000000000005f4
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
The buggy address belongs to the page:
page:ffffea0006c2afc0 count:0 mapcount:0 mapping:0000000000000000 index:0x0
flags: 0x2fffc0000000000()
raw: 02fffc0000000000 0000000000000000 0000000000000000 00000000ffffffff
raw: 0000000000000000 ffffea0006c20101 0000000000000000 0000000000000000
page dumped because: kasan: bad access detected
Fix the issue replacing the unsafe function with strscpy() and
taking care of possible errors.
Fixes: 81e675c227ec ("netfilter: ebtables: add CONFIG_COMPAT support")
Reported-and-tested-by: syzbot+4e42a04e0bc33cb6c087@syzkaller.appspotmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
In the nft_ct_helper_obj_dump(), always priv->helper4 is dereferenced.
But if family is ipv6, priv->helper6 should be dereferenced.
Steps to reproduces:
#test.nft
table ip6 filter {
ct helper ftp {
type "ftp" protocol tcp
}
chain input {
type filter hook input priority 4;
ct helper set "ftp"
}
}
%nft -f test.nft
%nft list ruleset
we can see the below messages:
[ 916.286233] kasan: GPF could be caused by NULL-ptr deref or user memory access
[ 916.294777] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[ 916.302613] Modules linked in: nft_objref nf_conntrack_sip nf_conntrack_snmp nf_conntrack_broadcast nf_conntrack_ftp nft_ct nf_conntrack nf_tables nfnetlink [last unloaded: nfnetlink]
[ 916.318758] CPU: 1 PID: 2093 Comm: nft Not tainted 4.17.0-rc4+ #181
[ 916.326772] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
[ 916.338773] RIP: 0010:strlen+0x1a/0x90
[ 916.342781] RSP: 0018:ffff88010ff0f2f8 EFLAGS: 00010292
[ 916.346773] RAX: dffffc0000000000 RBX: ffff880119b26ee8 RCX: ffff88010c150038
[ 916.354777] RDX: 0000000000000002 RSI: ffff880119b26ee8 RDI: 0000000000000010
[ 916.362773] RBP: 0000000000000010 R08: 0000000000007e88 R09: ffff88010c15003c
[ 916.370773] R10: ffff88010c150037 R11: ffffed002182a007 R12: ffff88010ff04040
[ 916.378779] R13: 0000000000000010 R14: ffff880119b26f30 R15: ffff88010ff04110
[ 916.387265] FS: 00007f57a1997700(0000) GS:ffff88011b800000(0000) knlGS:0000000000000000
[ 916.394785] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 916.402778] CR2: 00007f57a0ac80f0 CR3: 000000010ff02000 CR4: 00000000001006e0
[ 916.410772] Call Trace:
[ 916.414787] nft_ct_helper_obj_dump+0x94/0x200 [nft_ct]
[ 916.418779] ? nft_ct_set_eval+0x560/0x560 [nft_ct]
[ 916.426771] ? memset+0x1f/0x40
[ 916.426771] ? __nla_reserve+0x92/0xb0
[ 916.434774] ? memcpy+0x34/0x50
[ 916.434774] nf_tables_fill_obj_info+0x484/0x860 [nf_tables]
[ 916.442773] ? __nft_release_basechain+0x600/0x600 [nf_tables]
[ 916.450779] ? lock_acquire+0x193/0x380
[ 916.454771] ? lock_acquire+0x193/0x380
[ 916.458789] ? nf_tables_dump_obj+0x148/0xcb0 [nf_tables]
[ 916.462777] nf_tables_dump_obj+0x5f0/0xcb0 [nf_tables]
[ 916.470769] ? __alloc_skb+0x30b/0x500
[ 916.474779] netlink_dump+0x752/0xb50
[ 916.478775] __netlink_dump_start+0x4d3/0x750
[ 916.482784] nf_tables_getobj+0x27a/0x930 [nf_tables]
[ 916.490774] ? nft_obj_notify+0x100/0x100 [nf_tables]
[ 916.494772] ? nf_tables_getobj+0x930/0x930 [nf_tables]
[ 916.502579] ? nf_tables_dump_flowtable_done+0x70/0x70 [nf_tables]
[ 916.506774] ? nft_obj_notify+0x100/0x100 [nf_tables]
[ 916.514808] nfnetlink_rcv_msg+0x8ab/0xa86 [nfnetlink]
[ 916.518771] ? nfnetlink_rcv_msg+0x550/0xa86 [nfnetlink]
[ 916.526782] netlink_rcv_skb+0x23e/0x360
[ 916.530773] ? nfnetlink_bind+0x200/0x200 [nfnetlink]
[ 916.534778] ? debug_check_no_locks_freed+0x280/0x280
[ 916.542770] ? netlink_ack+0x870/0x870
[ 916.546786] ? ns_capable_common+0xf4/0x130
[ 916.550765] nfnetlink_rcv+0x172/0x16c0 [nfnetlink]
[ 916.554771] ? sched_clock_local+0xe2/0x150
[ 916.558774] ? sched_clock_cpu+0x144/0x180
[ 916.566575] ? lock_acquire+0x380/0x380
[ 916.570775] ? sched_clock_local+0xe2/0x150
[ 916.574765] ? nfnetlink_net_init+0x130/0x130 [nfnetlink]
[ 916.578763] ? sched_clock_cpu+0x144/0x180
[ 916.582770] ? lock_acquire+0x193/0x380
[ 916.590771] ? lock_acquire+0x193/0x380
[ 916.594766] ? lock_acquire+0x380/0x380
[ 916.598760] ? netlink_deliver_tap+0x262/0xa60
[ 916.602766] ? lock_acquire+0x193/0x380
[ 916.606766] netlink_unicast+0x3ef/0x5a0
[ 916.610771] ? netlink_attachskb+0x630/0x630
[ 916.614763] netlink_sendmsg+0x72a/0xb00
[ 916.618769] ? netlink_unicast+0x5a0/0x5a0
[ 916.626766] ? _copy_from_user+0x92/0xc0
[ 916.630773] __sys_sendto+0x202/0x300
[ 916.634772] ? __ia32_sys_getpeername+0xb0/0xb0
[ 916.638759] ? lock_acquire+0x380/0x380
[ 916.642769] ? lock_acquire+0x193/0x380
[ 916.646761] ? finish_task_switch+0xf4/0x560
[ 916.650763] ? __schedule+0x582/0x19a0
[ 916.655301] ? __sched_text_start+0x8/0x8
[ 916.655301] ? up_read+0x1c/0x110
[ 916.655301] ? __do_page_fault+0x48b/0xaa0
[ 916.655301] ? entry_SYSCALL_64_after_hwframe+0x59/0xbe
[ 916.655301] __x64_sys_sendto+0xdd/0x1b0
[ 916.655301] do_syscall_64+0x96/0x3d0
[ 916.655301] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 916.655301] RIP: 0033:0x7f57a0ff5e03
[ 916.655301] RSP: 002b:00007fff6367e0a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 916.655301] RAX: ffffffffffffffda RBX: 00007fff6367f1e0 RCX: 00007f57a0ff5e03
[ 916.655301] RDX: 0000000000000020 RSI: 00007fff6367e110 RDI: 0000000000000003
[ 916.655301] RBP: 00007fff6367e100 R08: 00007f57a0ce9160 R09: 000000000000000c
[ 916.655301] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff6367e110
[ 916.655301] R13: 0000000000000020 R14: 00007f57a153c610 R15: 0000562417258de0
[ 916.655301] Code: ff ff ff 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 fa 53 48 c1 ea 03 48 b8 00 00 00 00 00 fc ff df 48 89 fd 48 83 ec 08 <0f> b6 04 02 48 89 fa 83 e2 07 38 d0 7f
[ 916.655301] RIP: strlen+0x1a/0x90 RSP: ffff88010ff0f2f8
[ 916.771929] ---[ end trace 1065e048e72479fe ]---
[ 916.777204] Kernel panic - not syncing: Fatal exception
[ 916.778158] Kernel Offset: 0x14000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-05-17
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Provide a new BPF helper for doing a FIB and neighbor lookup
in the kernel tables from an XDP or tc BPF program. The helper
provides a fast-path for forwarding packets. The API supports
IPv4, IPv6 and MPLS protocols, but currently IPv4 and IPv6 are
implemented in this initial work, from David (Ahern).
2) Just a tiny diff but huge feature enabled for nfp driver by
extending the BPF offload beyond a pure host processing offload.
Offloaded XDP programs are allowed to set the RX queue index and
thus opening the door for defining a fully programmable RSS/n-tuple
filter replacement. Once BPF decided on a queue already, the device
data-path will skip the conventional RSS processing completely,
from Jakub.
3) The original sockmap implementation was array based similar to
devmap. However unlike devmap where an ifindex has a 1:1 mapping
into the map there are use cases with sockets that need to be
referenced using longer keys. Hence, sockhash map is added reusing
as much of the sockmap code as possible, from John.
4) Introduce BTF ID. The ID is allocatd through an IDR similar as
with BPF maps and progs. It also makes BTF accessible to user
space via BPF_BTF_GET_FD_BY_ID and adds exposure of the BTF data
through BPF_OBJ_GET_INFO_BY_FD, from Martin.
5) Enable BPF stackmap with build_id also in NMI context. Due to the
up_read() of current->mm->mmap_sem build_id cannot be parsed.
This work defers the up_read() via a per-cpu irq_work so that
at least limited support can be enabled, from Song.
6) Various BPF JIT follow-up cleanups and fixups after the LD_ABS/LD_IND
JIT conversion as well as implementation of an optimized 32/64 bit
immediate load in the arm64 JIT that allows to reduce the number of
emitted instructions; in case of tested real-world programs they
were shrinking by three percent, from Daniel.
7) Add ifindex parameter to the libbpf loader in order to enable
BPF offload support. Right now only iproute2 can load offloaded
BPF and this will also enable libbpf for direct integration into
other applications, from David (Beckett).
8) Convert the plain text documentation under Documentation/bpf/ into
RST format since this is the appropriate standard the kernel is
moving to for all documentation. Also add an overview README.rst,
from Jesper.
9) Add __printf verification attribute to the bpf_verifier_vlog()
helper. Though it uses va_list we can still allow gcc to check
the format string, from Mathieu.
10) Fix a bash reference in the BPF selftest's Makefile. The '|& ...'
is a bash 4.0+ feature which is not guaranteed to be available
when calling out to shell, therefore use a more portable variant,
from Joe.
11) Fix a 64 bit division in xdp_umem_reg() by using div_u64()
instead of relying on the gcc built-in, from Björn.
12) Fix a sock hashmap kmalloc warning reported by syzbot when an
overly large key size is used in hashmap then causing overflows
in htab->elem_size. Reject bogus attr->key_size early in the
sock_hash_alloc(), from Yonghong.
13) Ensure in BPF selftests when urandom_read is being linked that
--build-id is always enabled so that test_stacktrace_build_id[_nmi]
won't be failing, from Alexei.
14) Add bitsperlong.h as well as errno.h uapi headers into the tools
header infrastructure which point to one of the arch specific
uapi headers. This was needed in order to fix a build error on
some systems for the BPF selftests, from Sirio.
15) Allow for short options to be used in the xdp_monitor BPF sample
code. And also a bpf.h tools uapi header sync in order to fix a
selftest build failure. Both from Prashant.
16) More formally clarify the meaning of ID in the direct packet access
section of the BPF documentation, from Wang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|