summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2020-10-11bpf, sockmap: Remove dropped data on errors in redirect caseJohn Fastabend
In the sk_skb redirect case we didn't handle the case where we overrun the sk_rmem_alloc entry on ingress redirect or sk_wmem_alloc on egress. Because we didn't have anything implemented we simply dropped the skb. This meant data could be dropped if socket memory accounting was in place. This fixes the above dropped data case by moving the memory checks later in the code where we actually do the send or recv. This pushes those checks into the workqueue and allows us to return an EAGAIN error which in turn allows us to try again later from the workqueue. Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/160226863689.5692.13861422742592309285.stgit@john-Precision-5820-Tower
2020-10-11bpf, sockmap: Remove skb_set_owner_w wmem will be taken later from sendpageJohn Fastabend
The skb_set_owner_w is unnecessary here. The sendpage call will create a fresh skb and set the owner correctly from workqueue. Its also not entirely harmless because it consumes cycles, but also impacts resource accounting by increasing sk_wmem_alloc. This is charging the socket we are going to send to for the skb, but we will put it on the workqueue for some time before this happens so we are artifically inflating sk_wmem_alloc for this period. Further, we don't know how many skbs will be used to send the packet or how it will be broken up when sent over the new socket so charging it with one big sum is also not correct when the workqueue may break it up if facing memory pressure. Seeing we don't know how/when this is going to be sent drop the early accounting. A later patch will do proper accounting charged on receive socket for the case where skbs get enqueued on the workqueue. Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/160226861708.5692.17964237936462425136.stgit@john-Precision-5820-Tower
2020-10-11bpf, sockmap: On receive programs try to fast track SK_PASS ingressJohn Fastabend
When we receive an skb and the ingress skb verdict program returns SK_PASS we currently set the ingress flag and put it on the workqueue so it can be turned into a sk_msg and put on the sk_msg ingress queue. Then finally telling userspace with data_ready hook. Here we observe that if the workqueue is empty then we can try to convert into a sk_msg type and call data_ready directly without bouncing through a workqueue. Its a common pattern to have a recv verdict program for visibility that always returns SK_PASS. In this case unless there is an ENOMEM error or we overrun the socket we can avoid the workqueue completely only using it when we fall back to error cases caused by memory pressure. By doing this we eliminate another case where data may be dropped if errors occur on memory limits in workqueue. Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/160226859704.5692.12929678876744977669.stgit@john-Precision-5820-Tower
2020-10-11bpf, sockmap: Skb verdict SK_PASS to self already checked rmem limitsJohn Fastabend
For sk_skb case where skb_verdict program returns SK_PASS to continue to pass packet up the stack, the memory limits were already checked before enqueuing in skb_queue_tail from TCP side. So, lets remove the extra checks here. The theory is if the TCP stack believes we have memory to receive the packet then lets trust the stack and not double check the limits. In fact the accounting here can cause a drop if sk_rmem_alloc has increased after the stack accepted this packet, but before the duplicate check here. And worse if this happens because TCP stack already believes the data has been received there is no retransmit. Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/160226857664.5692.668205469388498375.stgit@john-Precision-5820-Tower
2020-10-11bpf: Always return target ifindex in bpf_fib_lookupToke Høiland-Jørgensen
The bpf_fib_lookup() helper performs a neighbour lookup for the destination IP and returns BPF_FIB_LKUP_NO_NEIGH if this fails, with the expectation that the BPF program will pass the packet up the stack in this case. However, with the addition of bpf_redirect_neigh() that can be used instead to perform the neighbour lookup, at the cost of a bit of duplicated work. For that we still need the target ifindex, and since bpf_fib_lookup() already has that at the time it performs the neighbour lookup, there is really no reason why it can't just return it in any case. So let's just always return the ifindex if the FIB lookup itself succeeds. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: David Ahern <dsahern@gmail.com> Link: https://lore.kernel.org/bpf/20201009184234.134214-1-toke@redhat.com
2020-10-11bpf: Allow for map-in-map with dynamic inner array map entriesDaniel Borkmann
Recent work in f4d05259213f ("bpf: Add map_meta_equal map ops") and 134fede4eecf ("bpf: Relax max_entries check for most of the inner map types") added support for dynamic inner max elements for most map-in-map types. Exceptions were maps like array or prog array where the map_gen_lookup() callback uses the maps' max_entries field as a constant when emitting instructions. We recently implemented Maglev consistent hashing into Cilium's load balancer which uses map-in-map with an outer map being hash and inner being array holding the Maglev backend table for each service. This has been designed this way in order to reduce overall memory consumption given the outer hash map allows to avoid preallocating a large, flat memory area for all services. Also, the number of service mappings is not always known a-priori. The use case for dynamic inner array map entries is to further reduce memory overhead, for example, some services might just have a small number of back ends while others could have a large number. Right now the Maglev backend table for small and large number of backends would need to have the same inner array map entries which adds a lot of unneeded overhead. Dynamic inner array map entries can be realized by avoiding the inlined code generation for their lookup. The lookup will still be efficient since it will be calling into array_map_lookup_elem() directly and thus avoiding retpoline. The patch adds a BPF_F_INNER_MAP flag to map creation which therefore skips inline code generation and relaxes array_map_meta_equal() check to ignore both maps' max_entries. This also still allows to have faster lookups for map-in-map when BPF_F_INNER_MAP is not specified and hence dynamic max_entries not needed. Example code generation where inner map is dynamic sized array: # bpftool p d x i 125 int handle__sys_enter(void * ctx): ; int handle__sys_enter(void *ctx) 0: (b4) w1 = 0 ; int key = 0; 1: (63) *(u32 *)(r10 -4) = r1 2: (bf) r2 = r10 ; 3: (07) r2 += -4 ; inner_map = bpf_map_lookup_elem(&outer_arr_dyn, &key); 4: (18) r1 = map[id:468] 6: (07) r1 += 272 7: (61) r0 = *(u32 *)(r2 +0) 8: (35) if r0 >= 0x3 goto pc+5 9: (67) r0 <<= 3 10: (0f) r0 += r1 11: (79) r0 = *(u64 *)(r0 +0) 12: (15) if r0 == 0x0 goto pc+1 13: (05) goto pc+1 14: (b7) r0 = 0 15: (b4) w6 = -1 ; if (!inner_map) 16: (15) if r0 == 0x0 goto pc+6 17: (bf) r2 = r10 ; 18: (07) r2 += -4 ; val = bpf_map_lookup_elem(inner_map, &key); 19: (bf) r1 = r0 | No inlining but instead 20: (85) call array_map_lookup_elem#149280 | call to array_map_lookup_elem() ; return val ? *val : -1; | for inner array lookup. 21: (15) if r0 == 0x0 goto pc+1 ; return val ? *val : -1; 22: (61) r6 = *(u32 *)(r0 +0) ; } 23: (bc) w0 = w6 24: (95) exit Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201010234006.7075-4-daniel@iogearbox.net
2020-10-11bpf: Add redirect_peer helperDaniel Borkmann
Add an efficient ingress to ingress netns switch that can be used out of tc BPF programs in order to redirect traffic from host ns ingress into a container veth device ingress without having to go via CPU backlog queue [0]. For local containers this can also be utilized and path via CPU backlog queue only needs to be taken once, not twice. On a high level this borrows from ipvlan which does similar switch in __netif_receive_skb_core() and then iterates via another_round. This helps to reduce latency for mentioned use cases. Pod to remote pod with redirect(), TCP_RR [1]: # percpu_netperf 10.217.1.33 RT_LATENCY: 122.450 (per CPU: 122.666 122.401 122.333 122.401 ) MEAN_LATENCY: 121.210 (per CPU: 121.100 121.260 121.320 121.160 ) STDDEV_LATENCY: 120.040 (per CPU: 119.420 119.910 125.460 115.370 ) MIN_LATENCY: 46.500 (per CPU: 47.000 47.000 47.000 45.000 ) P50_LATENCY: 118.500 (per CPU: 118.000 119.000 118.000 119.000 ) P90_LATENCY: 127.500 (per CPU: 127.000 128.000 127.000 128.000 ) P99_LATENCY: 130.750 (per CPU: 131.000 131.000 129.000 132.000 ) TRANSACTION_RATE: 32666.400 (per CPU: 8152.200 8169.842 8174.439 8169.897 ) Pod to remote pod with redirect_peer(), TCP_RR: # percpu_netperf 10.217.1.33 RT_LATENCY: 44.449 (per CPU: 43.767 43.127 45.279 45.622 ) MEAN_LATENCY: 45.065 (per CPU: 44.030 45.530 45.190 45.510 ) STDDEV_LATENCY: 84.823 (per CPU: 66.770 97.290 84.380 90.850 ) MIN_LATENCY: 33.500 (per CPU: 33.000 33.000 34.000 34.000 ) P50_LATENCY: 43.250 (per CPU: 43.000 43.000 43.000 44.000 ) P90_LATENCY: 46.750 (per CPU: 46.000 47.000 47.000 47.000 ) P99_LATENCY: 52.750 (per CPU: 51.000 54.000 53.000 53.000 ) TRANSACTION_RATE: 90039.500 (per CPU: 22848.186 23187.089 22085.077 21919.130 ) [0] https://linuxplumbersconf.org/event/7/contributions/674/attachments/568/1002/plumbers_2020_cilium_load_balancer.pdf [1] https://github.com/borkmann/netperf_scripts/blob/master/percpu_netperf Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201010234006.7075-3-daniel@iogearbox.net
2020-10-09bpf: Add tcp_notsent_lowat bpf setsockoptNikita V. Shirokov
Adding support for TCP_NOTSENT_LOWAT sockoption (https://lwn.net/Articles/560082/) in tcp bpf programs. Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20201009070325.226855-1-tehnerd@tehnerd.com
2020-10-09xsk: Introduce padding between ring pointersMagnus Karlsson
Introduce one cache line worth of padding between the producer and consumer pointers in all the lockless rings. This so that the HW adjacency prefetcher will not prefetch the consumer pointer when the producer pointer is used and vice versa. This improves throughput performance for the l2fwd sample app with 2% on my machine with HW prefetching turned on. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/1602166338-21378-1-git-send-email-magnus.karlsson@gmail.com
2020-10-05xsk: Remove internal DMA headersBjörn Töpel
Christoph Hellwig correctly pointed out [1] that the AF_XDP core was pointlessly including internal headers. Let us remove those includes. [1] https://lore.kernel.org/bpf/20201005084341.GA3224@infradead.org/ Fixes: 1c1efc2af158 ("xsk: Create and free buffer pool independently from umem") Reported-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/bpf/20201005090525.116689-1-bjorn.topel@gmail.com
2020-10-02bpf, sockmap: Add skb_adjust_room to pop bytes off ingress payloadJohn Fastabend
This implements a new helper skb_adjust_room() so users can push/pop extra bytes from a BPF_SK_SKB_STREAM_VERDICT program. Some protocols may include headers and other information that we may not want to include when doing a redirect from a BPF_SK_SKB_STREAM_VERDICT program. One use case is to redirect TLS packets into a receive socket that doesn't expect TLS data. In TLS case the first 13B or so contain the protocol header. With KTLS the payload is decrypted so we should be able to redirect this to a receiving socket, but the receiving socket may not be expecting to receive a TLS header and discard the data. Using the above helper we can pop the header off and put an appropriate header on the payload. This allows for creating a proxy between protocols without extra hops through the stack or userspace. So in order to fix this case add skb_adjust_room() so users can strip the header. After this the user can strip the header and an unmodified receiver thread will work correctly when data is redirected into the ingress path of a sock. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/160160099197.7052.8443193973242831692.stgit@john-Precision-5820-Tower
2020-10-02bpf: tcp: Do not limit cb_flags when creating child sk from listen skMartin KaFai Lau
The commit 0813a841566f ("bpf: tcp: Allow bpf prog to write and parse TCP header option") unnecessarily introduced bpf_skops_init_child() which limited the child sk from inheriting all bpf_sock_ops_cb_flags of the listen sk. That breaks existing user expectation. This patch removes the bpf_skops_init_child() and just allows sock_copy() to do its job to copy everything from listen sk to the child sk. Fixes: 0813a841566f ("bpf: tcp: Allow bpf prog to write and parse TCP header option") Reported-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201002013448.2542025-1-kafai@fb.com
2020-10-01net-sysfs: Fix inconsistent of format with argument type in net-sysfs.cYe Bin
Fix follow warnings: [net/core/net-sysfs.c:1161]: (warning) %u in format string (no. 1) requires 'unsigned int' but the argument type is 'int'. [net/core/net-sysfs.c:1162]: (warning) %u in format string (no. 1) requires 'unsigned int' but the argument type is 'int'. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-01pktgen: Fix inconsistent of format with argument type in pktgen.cYe Bin
Fix follow warnings: [net/core/pktgen.c:925]: (warning) %u in format string (no. 1) requires 'unsigned int' but the argument type is 'signed int'. [net/core/pktgen.c:942]: (warning) %u in format string (no. 1) requires 'unsigned int' but the argument type is 'signed int'. [net/core/pktgen.c:962]: (warning) %u in format string (no. 1) requires 'unsigned int' but the argument type is 'signed int'. [net/core/pktgen.c:984]: (warning) %u in format string (no. 1) requires 'unsigned int' but the argument type is 'signed int'. [net/core/pktgen.c:1149]: (warning) %d in format string (no. 1) requires 'int' but the argument type is 'unsigned int'. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf-next 2020-10-01 The following pull-request contains BPF updates for your *net-next* tree. We've added 90 non-merge commits during the last 8 day(s) which contain a total of 103 files changed, 7662 insertions(+), 1894 deletions(-). Note that once bpf(/net) tree gets merged into net-next, there will be a small merge conflict in tools/lib/bpf/btf.c between commit 1245008122d7 ("libbpf: Fix native endian assumption when parsing BTF") from the bpf tree and the commit 3289959b97ca ("libbpf: Support BTF loading and raw data output in both endianness") from the bpf-next tree. Correct resolution would be to stick with bpf-next, it should look like: [...] /* check BTF magic */ if (fread(&magic, 1, sizeof(magic), f) < sizeof(magic)) { err = -EIO; goto err_out; } if (magic != BTF_MAGIC && magic != bswap_16(BTF_MAGIC)) { /* definitely not a raw BTF */ err = -EPROTO; goto err_out; } /* get file size */ [...] The main changes are: 1) Add bpf_snprintf_btf() and bpf_seq_printf_btf() helpers to support displaying BTF-based kernel data structures out of BPF programs, from Alan Maguire. 2) Speed up RCU tasks trace grace periods by a factor of 50 & fix a few race conditions exposed by it. It was discussed to take these via BPF and networking tree to get better testing exposure, from Paul E. McKenney. 3) Support multi-attach for freplace programs, needed for incremental attachment of multiple XDP progs using libxdp dispatcher model, from Toke Høiland-Jørgensen. 4) libbpf support for appending new BTF types at the end of BTF object, allowing intrusive changes of prog's BTF (useful for future linking), from Andrii Nakryiko. 5) Several BPF helper improvements e.g. avoid atomic op in cookie generator and add a redirect helper into neighboring subsys, from Daniel Borkmann. 6) Allow map updates on sockmaps from bpf_iter context in order to migrate sockmaps from one to another, from Lorenz Bauer. 7) Fix 32 bit to 64 bit assignment from latest alu32 bounds tracking which caused a verifier issue due to type downgrade to scalar, from John Fastabend. 8) Follow-up on tail-call support in BPF subprogs which optimizes x64 JIT prologue and epilogue sections, from Maciej Fijalkowski. 9) Add an option to perf RB map to improve sharing of event entries by avoiding remove- on-close behavior. Also, add BPF_PROG_TEST_RUN for raw_tracepoint, from Song Liu. 10) Fix a crash in AF_XDP's socket_release when memory allocation for UMEMs fails, from Magnus Karlsson. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-30drop_monitor: Filter control packets in drop monitorIdo Schimmel
Previously, devlink called into drop monitor in order to report hardware originated drops / exceptions. devlink intentionally filtered control packets and did not pass them to drop monitor as they were not dropped by the underlying hardware. Now drop monitor registers its probe on a generic 'devlink_trap_report' tracepoint and should therefore perform this filtering itself instead of having devlink do that. Add the trap type as metadata and have drop monitor ignore control packets. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-30drop_monitor: Remove duplicate structIdo Schimmel
'struct net_dm_hw_metadata' is a duplicate of 'struct devlink_trap_metadata'. Remove the former and simplify the code. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-30drop_monitor: Remove no longer used functionsIdo Schimmel
The old probe functions that were invoked by drop monitor code are no longer called and can thus be removed. They were replaced by actual probe functions that are registered on the recently introduced 'devlink_trap_report' tracepoint. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-30drop_monitor: Convert to using devlink tracepointIdo Schimmel
Convert drop monitor to use the recently introduced 'devlink_trap_report' tracepoint instead of having devlink call into drop monitor. This is both consistent with software originated drops ('kfree_skb' tracepoint) and also allows drop monitor to be built as a module and still report hardware originated drops. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-30drop_monitor: Prepare probe functions for devlink tracepointIdo Schimmel
Drop monitor supports two alerting modes: Summary and packet. Prepare a probe function for each, so that they could be later registered on the devlink tracepoint by calling register_trace_devlink_trap_report(), based on the configured alerting mode. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-30devlink: Add a tracepoint for trap reportsIdo Schimmel
Add a tracepoint for trap reports so that drop monitor could register its probe on it. Use trace_devlink_trap_report_enabled() to avoid wasting cycles setting the trap metadata if the tracepoint is not enabled. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-30tcp: add exponential backoff in __tcp_send_ack()Eric Dumazet
Whenever host is under very high memory pressure, __tcp_send_ack() skb allocation fails, and we setup a 200 ms (TCP_DELACK_MAX) timer before retrying. On hosts with high number of TCP sockets, we can spend considerable amount of cpu cycles in these attempts, add high pressure on various spinlocks in mm-layer, ultimately blocking threads attempting to free space from making any progress. This patch adds standard exponential backoff to avoid adding fuel to the fire. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-30inet: remove icsk_ack.blockedEric Dumazet
TCP has been using it to work around the possibility of tcp_delack_timer() finding the socket owned by user. After commit 6f458dfb4092 ("tcp: improve latencies of timer triggered events") we added TCP_DELACK_TIMER_DEFERRED atomic bit for more immediate recovery, so we can get rid of icsk_ack.blocked This frees space that following patch will reuse. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-30bpf: Add redirect_neigh helper as redirect drop-inDaniel Borkmann
Add a redirect_neigh() helper as redirect() drop-in replacement for the xmit side. Main idea for the helper is to be very similar in semantics to the latter just that the skb gets injected into the neighboring subsystem in order to let the stack do the work it knows best anyway to populate the L2 addresses of the packet and then hand over to dev_queue_xmit() as redirect() does. This solves two bigger items: i) skbs don't need to go up to the stack on the host facing veth ingress side for traffic egressing the container to achieve the same for populating L2 which also has the huge advantage that ii) the skb->sk won't get orphaned in ip_rcv_core() when entering the IP routing layer on the host stack. Given that skb->sk neither gets orphaned when crossing the netns as per 9c4c325252c5 ("skbuff: preserve sock reference when scrubbing the skb.") the helper can then push the skbs directly to the phys device where FQ scheduler can do its work and TCP stack gets proper backpressure given we hold on to skb->sk as long as skb is still residing in queues. With the helper used in BPF data path to then push the skb to the phys device, I observed a stable/consistent TCP_STREAM improvement on veth devices for traffic going container -> host -> host -> container from ~10Gbps to ~15Gbps for a single stream in my test environment. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: David Ahern <dsahern@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Cc: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/bpf/f207de81629e1724899b73b8112e0013be782d35.1601477936.git.daniel@iogearbox.net
2020-09-30bpf, net: Rework cookie generator as per-cpu oneDaniel Borkmann
With its use in BPF, the cookie generator can be called very frequently in particular when used out of cgroup v2 hooks (e.g. connect / sendmsg) and attached to the root cgroup, for example, when used in v1/v2 mixed environments. In particular, when there's a high churn on sockets in the system there can be many parallel requests to the bpf_get_socket_cookie() and bpf_get_netns_cookie() helpers which then cause contention on the atomic counter. As similarly done in f991bd2e1421 ("fs: introduce a per-cpu last_ino allocator"), add a small helper library that both can use for the 64 bit counters. Given this can be called from different contexts, we also need to deal with potential nested calls even though in practice they are considered extremely rare. One idea as suggested by Eric Dumazet was to use a reverse counter for this situation since we don't expect 64 bit overflows anyways; that way, we can avoid bigger gaps in the 64 bit counter space compared to just batch-wise increase. Even on machines with small number of cores (e.g. 4) the cookie generation shrinks from min/max/med/avg (ns) of 22/50/40/38.9 down to 10/35/14/17.3 when run in parallel from multiple CPUs. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Link: https://lore.kernel.org/bpf/8a80b8d27d3c49f9a14e1d5213c19d8be87d1dc8.1601477936.git.daniel@iogearbox.net
2020-09-30bpf: Add classid helper only based on skb->skDaniel Borkmann
Similarly to 5a52ae4e32a6 ("bpf: Allow to retrieve cgroup v1 classid from v2 hooks"), add a helper to retrieve cgroup v1 classid solely based on the skb->sk, so it can be used as key as part of BPF map lookups out of tc from host ns, in particular given the skb->sk is retained these days when crossing net ns thanks to 9c4c325252c5 ("skbuff: preserve sock reference when scrubbing the skb."). This is similar to bpf_skb_cgroup_id() which implements the same for v2. Kubernetes ecosystem is still operating on v1 however, hence net_cls needs to be used there until this can be dropped in with the v2 helper of bpf_skb_cgroup_id(). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/ed633cf27a1c620e901c5aa99ebdefb028dce600.1601477936.git.daniel@iogearbox.net
2020-09-30bpf: fix raw_tp test run in preempt kernelSong Liu
In preempt kernel, BPF_PROG_TEST_RUN on raw_tp triggers: [ 35.874974] BUG: using smp_processor_id() in preemptible [00000000] code: new_name/87 [ 35.893983] caller is bpf_prog_test_run_raw_tp+0xd4/0x1b0 [ 35.900124] CPU: 1 PID: 87 Comm: new_name Not tainted 5.9.0-rc6-g615bd02bf #1 [ 35.907358] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 35.916941] Call Trace: [ 35.919660] dump_stack+0x77/0x9b [ 35.923273] check_preemption_disabled+0xb4/0xc0 [ 35.928376] bpf_prog_test_run_raw_tp+0xd4/0x1b0 [ 35.933872] ? selinux_bpf+0xd/0x70 [ 35.937532] __do_sys_bpf+0x6bb/0x21e0 [ 35.941570] ? find_held_lock+0x2d/0x90 [ 35.945687] ? vfs_write+0x150/0x220 [ 35.949586] do_syscall_64+0x2d/0x40 [ 35.953443] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fix this by calling migrate_disable() before smp_processor_id(). Fixes: 1b4d60ec162f ("bpf: Enable BPF_PROG_TEST_RUN for raw_tracepoint") Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-09-29net: Add netif_rx_any_context()Sebastian Andrzej Siewior
Quite some drivers make conditional decisions based on in_interrupt() to invoke either netif_rx() or netif_rx_ni(). Conditionals based on in_interrupt() or other variants of preempt count checks in drivers should not exist for various reasons and Linus clearly requested to either split the code pathes or pass an argument to the common functions which provides the context. This is obviously the correct solution, but for some of the affected drivers this needs a major rewrite due to their convoluted structure. As in_interrupt() usage in drivers needs to be phased out, provide netif_rx_any_context() as a stop gap for these drivers. This confines the in_interrupt() conditional to core code which in turn allows to remove the access to this check for driver code and provides one central place to do further modifications once the driver maze is cleaned up. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29l2tp: report rx cookie discards in netlink getTom Parkin
When an L2TPv3 session receives a data frame with an incorrect cookie l2tp_core logs a warning message and bumps a stats counter to reflect the fact that the packet has been dropped. However, the stats counter in question is missing from the l2tp_netlink get message for tunnel and session instances. Include the statistic in the netlink get response. Signed-off-by: Tom Parkin <tparkin@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29Merge branch 'for-upstream' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2020-09-29 Here's the main bluetooth-next pull request for 5.10: - Multiple fixes to suspend/resume handling - Added mgmt events for controller suspend/resume state - Improved extended advertising support - btintel: Enhanced support for next generation controllers - Added Qualcomm Bluetooth SoC WCN6855 support - Several other smaller fixes & improvements ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29xsk: Fix a documentation mistake in xsk_queue.hCiara Loftus
After 'peeking' the ring, the consumer, not the producer, reads the data. Fix this mistake in the comments. Fixes: 15d8c9162ced ("xsk: Add function naming comments and reorder functions") Signed-off-by: Ciara Loftus <ciara.loftus@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20200928082344.17110-1-ciara.loftus@intel.com
2020-09-28net/sched: cls_u32: Replace one-element array with flexible-array memberGustavo A. R. Silva
There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. Refactor the code according to the use of a flexible-array member in struct tc_u_hnode and use the struct_size() helper to calculate the size for the allocations. Commit 5778d39d070b ("net_sched: fix struct tc_u_hnode layout in u32") makes it clear that the code is expected to dynamically allocate divisor + 1 entries for ->ht[] in tc_uhnode. Also, based on other observations, as the piece of code below: 1232 for (h = 0; h <= ht->divisor; h++) { 1233 for (n = rtnl_dereference(ht->ht[h]); 1234 n; 1235 n = rtnl_dereference(n->next)) { 1236 if (tc_skip_hw(n->flags)) 1237 continue; 1238 1239 err = u32_reoffload_knode(tp, n, add, cb, 1240 cb_priv, extack); 1241 if (err) 1242 return err; 1243 } 1244 } we can assume that, in general, the code is actually expecting to allocate that extra space for the one-element array in tc_uhnode, everytime it allocates memory for instances of tc_uhnode or tc_u_common structures. That's the reason for passing '1' as the last argument for struct_size() in the allocation for _root_ht_ and _tp_c_, and 'divisor + 1' in the allocation code for _ht_. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://www.kernel.org/doc/html/v5.9-rc1/process/deprecated.html#zero-length-and-one-element-arrays Tested-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/lkml/5f7062af.z3T9tn9yIPv6h5Ny%25lkp@intel.com/ Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-28bpf: sockmap: Enable map_update_elem from bpf_iterLorenz Bauer
Allow passing a pointer to a BTF struct sock_common* when updating a sockmap or sockhash. Since BTF pointers can fault and therefore be NULL at runtime we need to add an additional !sk check to sock_map_update_elem. Since we may be passed a request or timewait socket we also need to check sk_fullsock. Doing this allows calling map_update_elem on sockmap from bpf_iter context, which uses BTF pointers. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200928090805.23343-2-lmb@cloudflare.com
2020-09-28ip6gre: avoid tx_error when sending MLD/DAD on external tunnelsDavide Caratti
similarly to what has been done with commit 9d149045b3c0 ("geneve: change from tx_error to tx_dropped on missing metadata"), avoid reporting errors to userspace in case the kernel doesn't find any tunnel information for a skb that is going to be transmitted: an increase of tx_dropped is enough. tested with the following script: # for t in ip6gre ip6gretap ip6erspan; do > ip link add dev gre6-test0 type $t external > ip address add dev gre6-test0 2001:db8::1/64 > ip link set dev gre6-test0 up > sleep 30 > ip -s -j link show dev gre6-test0 | jq \ > '.[0].stats64.tx | {"errors": .errors, "dropped": .dropped}' > ip link del dev gre6-test0 > done Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-28net/smc: CLC decline - V2 enhancementsUrsula Braun
This patch covers the small SMCD version 2 changes for CLC decline. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-28net/smc: introduce CLC first contact extensionUrsula Braun
SMC Version 2 defines a first contact extension for CLC accept and CLC confirm. This patch covers sending and receiving of the CLC first contact extension. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-28net/smc: CLC accept / confirm V2Ursula Braun
The new format of SMCD V2 CLC accept and confirm is introduced, and building and checking of SMCD V2 CLC accepts / confirms is adapted accordingly. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-28net/smc: determine accepted ISM devicesUrsula Braun
SMCD Version 2 allows to propose up to 8 additional ISM devices offered to the peer as candidates for SMCD communication. This patch covers the server side, i.e. selection of an ISM device matching one of the proposed ISM devices, that will be used for CLC accept Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-28net/smc: build and send V2 CLC proposalUrsula Braun
The new format of an SMCD V2 CLC proposal is introduced, and building and checking of SMCD V2 CLC proposals is adapted accordingly. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-28net/smc: determine proposed ISM devicesUrsula Braun
SMCD Version 2 allows to propose up to 8 additional ISM devices offered to the peer as candidates for SMCD communication. This patch covers determination of the ISM devices to be proposed. ISM devices without PNETID are preferred, since ISM devices with PNETID are a V1 leftover and will disappear over the time. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-28net/smc: introduce list of pnetids for Ethernet devicesUrsula Braun
SMCD version 2 allows usage of ISM devices with hardware PNETID only, if an Ethernet net_device exists with the same hardware PNETID. This requires to maintain a list of pnetids belonging to Ethernet net_devices, which is covered by this patch. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-28net/smc: introduce CHID callback for ISM devicesUrsula Braun
With SMCD version 2 the CHIDs of ISM devices are needed for the CLC handshake. This patch provides the new callback to retrieve the CHID of an ISM device. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-28net/smc: introduce System Enterprise ID (SEID)Ursula Braun
SMCD version 2 defines a System Enterprise ID (short SEID). This patch contains the SEID creation and adds the callback to retrieve the created SEID. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-28net/smc: prepare for more proposed ISM devicesUrsula Braun
SMCD Version 2 allows proposing of up to 8 ISM devices in addition to the native ISM device of SMCD Version 1. This patch prepares the struct smc_init_info to deal with these additional 8 ISM devices. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-28net/smc: split CLC confirm/accept data to be sentUrsula Braun
When sending CLC confirm and CLC accept, separate the trailing part of the message from the initial part (to be prepared for future first contact extension). Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-28net/smc: separate find device functionsUrsula Braun
This patch provides better separation of device determinations in function smc_listen_work(). No functional change. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-28net/smc: CLC header fields renamingUrsula Braun
SMCD version 2 defines 2 more bits in the CLC header to specify version 2 types. This patch prepares better naming of the CLC header fields. No functional change. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-28net/smc: remove constant and introduce helper to check for a pnet idKarsten Graul
Use the existing symbol _S instead of SMC_ASCII_BLANK, and introduce a helper to check if a pnetid is set. No functional change. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-28bpf: Enable BPF_PROG_TEST_RUN for raw_tracepointSong Liu
Add .test_run for raw_tracepoint. Also, introduce a new feature that runs the target program on a specific CPU. This is achieved by a new flag in bpf_attr.test, BPF_F_TEST_RUN_ON_CPU. When this flag is set, the program is triggered on cpu with id bpf_attr.test.cpu. This feature is needed for BPF programs that handle perf_event and other percpu resources, as the program can access these resource locally. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200925205432.1777-2-songliubraving@fb.com
2020-09-28udp_tunnel: add the ability to share port tablesJakub Kicinski
Unfortunately recent Intel NIC designs share the UDP port table across netdevs. So far the UDP tunnel port state was maintained per netdev, we need to extend that to cater to Intel NICs. Expect NICs to allocate the info structure dynamically and link to the state from there. All the shared NICs will record port offload information in the one instance of the table so we need to make sure that the use count can accommodate larger numbers. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>