summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2019-07-03net: adjust socket level ICW to cope with ipv6 variant of {recv, send}msgPaolo Abeni
After the previous patch we have ipv{6,4} variants for {recv,send}msg, we should use the generic _INET ICW variant to call into the proper build-in. This also allows dropping the now unused and rather ugly _INET4 ICW macro v1 -> v2: - use ICW macro to declare inet6_{recv,send}msg - fix a couple of checkpatch offender in the code context Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03ipv6: provide and use ipv6 specific version for {recv, send}msgPaolo Abeni
This will simplify indirect call wrapper invocation in the following patch. No functional change intended, any - out-of-tree - IPv6 user of inet_{recv,send}msg can keep using the existing functions. SCTP code still uses the existing version even for ipv6: as this series will not add ICW for SCTP, moving to the new helper would not give any benefit. The only other in-kernel user of inet_{recv,send}msg is pvcalls_conn_back_read(), but psvcalls explicitly creates only IPv4 socket, so no need to update that code path, too. v1 -> v2: drop inet6_{recv,send}msg declaration from header file, prefer ICW macro instead Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03inet: factor out inet_send_prepare()Paolo Abeni
The same code is replicated verbatim in multiple places, and the next patches will introduce an additional user for it. Factor out a helper and use it where appropriate. No functional change intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03bonding: validate ip header before check IPPROTO_IGMPCong Wang
bond_xmit_roundrobin() checks for IGMP packets but it parses the IP header even before checking skb->protocol. We should validate the IP header with pskb_may_pull() before using iph->protocol. Reported-and-tested-by: syzbot+e5be16aa39ad6e755391@syzkaller.appspotmail.com Fixes: a2fd940f4cff ("bonding: fix broken multicast with round-robin mode") Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03net/mlx5: Properly name the generic WQE control fieldTariq Toukan
A generic WQE control field is used for different purposes in different cases. Use union to allow using the proper name in each case. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-07-03net/mlx5: Introduce TLS TX offload hardware bits and structuresEran Ben Elisha
Add TLS offload related IFC structs, layouts and enumerations. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-07-03net/mlx5: Refactor mlx5_esw_query_functions for modularityParav Pandit
Functions change event output data size changes when functions other than VFs will be enabled in HCA CAP. With current API, multiple callers needs to align, calculate accurate size of the output data depending on number on non VF functions enabled in the device. Instead of duplicating such math at multiple places, refactor mlx5_esw_query_functions() to return raw output allocated by itself. Caller must free the allocated memory using kvfree() as described in the function comment section. This hides calcuation within mlx5_esw_query_functions() and provides simpler API. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-07-03net/mlx5: E-Switch prepare functions change handler to be modularParav Pandit
Eswitch function change handler will service multiple type of events for VFs and non VF functions update. Hence, introduce and use the helper function esw_vfs_changed_event_handler() for handling change in num VFs to improve the code readability. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-07-03net/mlx5: Introduce and use mlx5_eswitch_get_total_vports()Parav Pandit
Instead MLX5_TOTAL_VPORTS, use mlx5_eswitch_get_total_vports(). mlx5_eswitch_get_total_vports() in subsequent patch accounts for SF vports as well. Expanding MLX5_TOTAL_VPORTS macro would require exposing SF internals to more generic vport.h header file. Such exposure is not desired. Hence a mlx5_eswitch_get_total_vports() is introduced. Given that mlx5_eswitch_get_total_vports() API wants to work on const mlx5_core_dev*, change its helper functions also to accept const *dev. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-07-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2019-07-03 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix the interpreter to properly handle BPF_ALU32 | BPF_ARSH on BE architectures, from Jiong. 2) Fix several bugs in the x32 BPF JIT for handling shifts by 0, from Luke and Xi. 3) Fix NULL pointer deref in btf_type_is_resolve_source_only(), from Stanislav. 4) Properly handle the check that forwarding is enabled on the device in bpf_ipv6_fib_lookup() helper code, from Anton. 5) Fix UAPI bpf_prog_info fields alignment for archs that have 16 bit alignment such as m68k, from Baruch. 6) Fix kernel hanging in unregister_netdevice loop while unregistering device bound to XDP socket, from Ilya. 7) Properly terminate tail update in xskq_produce_flush_desc(), from Nathan. 8) Fix broken always_inline handling in test_lwt_seg6local, from Jiri. 9) Fix bpftool to use correct argument in cgroup errors, from Jakub. 10) Fix detaching dummy prog in XDP redirect sample code, from Prashant. 11) Add Jonathan to AF_XDP reviewers, from Björn. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03net: hns: add support for vlan TSOYonglong Liu
The hip07 chip support vlan TSO, this patch adds NETIF_F_TSO and NETIF_F_TSO6 flags to vlan_features to improve the performance after adding vlan to the net ports. Signed-off-by: Yonglong Liu <liuyonglong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03sctp: count data bundling sack chunk for outctrlchunksXin Long
Now all ctrl chunks are counted for asoc stats.octrlchunks and net SCTP_MIB_OUTCTRLCHUNKS either after queuing up or bundling, other than the chunk maked and bundled in sctp_packet_bundle_sack, which caused 'outctrlchunks' not consistent with 'inctrlchunks' in peer. This issue exists since very beginning, here to fix it by increasing both net SCTP_MIB_OUTCTRLCHUNKS and asoc stats.octrlchunks when sack chunk is maked and bundled in sctp_packet_bundle_sack. Reported-by: Ja Ram Jeon <jajeon@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03qlcnic: remove redundant assignment to variable errColin Ian King
The variable err is being initialized with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03atl1c: remove redundant assignment to variable tpd_reqColin Ian King
The variable tpd_req is being initialized with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03r8152: move calling r8153b_rx_agg_chg_indicate()Hayes Wang
r8153b_rx_agg_chg_indicate() needs to be called after enabling TX/RX and before calling rxdy_gated_en(tp, false). Otherwise, the change of the settings of RX aggregation wouldn't work. Besides, adjust rtl8152_set_coalesce() for the same reason. If rx_coalesce_usecs is changed, restart TX/RX to let the setting work. Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03qed: Add support for Timestamping the unicast PTP packets.Sudarsana Reddy Kalluru
This patch adds driver changes to detect/timestamp the unicast PTP packets. Changes from previous version: ------------------------------- v2: Defined a macro for unicast ptp param mask. Please consider applying this to "net-next". Signed-off-by: Sudarsana Reddy Kalluru <skalluru@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03gve: Fix u64_stats_sync to initialize startCatherine Sullivan
u64_stats_fetch_begin needs to initialize start. Signed-off-by: Catherine Sullivan <csully@google.com> Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03net: don't warn in inet diag when IPV6 is disabledStephen Hemminger
If IPV6 was disabled, then ss command would cause a kernel warning because the command was attempting to dump IPV6 socket information. The fix is to just remove the warning. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202249 Fixes: 432490f9d455 ("net: ip, diag -- Add diag interface for raw sockets") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03loopback: fix lockdep splatMahesh Bandewar
dev_init_scheduler() and dev_activate() expect the caller to hold RTNL. Since we don't want blackhole device to be initialized per ns, we are initializing at init. [ 3.855027] Call Trace: [ 3.855034] dump_stack+0x67/0x95 [ 3.855037] lockdep_rcu_suspicious+0xd5/0x110 [ 3.855044] dev_init_scheduler+0xe3/0x120 [ 3.855048] ? net_olddevs_init+0x60/0x60 [ 3.855050] blackhole_netdev_init+0x45/0x6e [ 3.855052] do_one_initcall+0x6c/0x2fa [ 3.855058] ? rcu_read_lock_sched_held+0x8c/0xa0 [ 3.855066] kernel_init_freeable+0x1e5/0x288 [ 3.855071] ? rest_init+0x260/0x260 [ 3.855074] kernel_init+0xf/0x180 [ 3.855076] ? rest_init+0x260/0x260 [ 3.855078] ret_from_fork+0x24/0x30 Fixes: 4de83b88c66 ("loopback: create blackhole net device similar to loopack.") Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Mahesh Bandewar <maheshb@google.com> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03net/mlx5: Expose device definitions for object eventsYishai Hadas
Expose an extra device definitions for objects events. It includes: object_type values for legacy objects and generic data header for any other object. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2019-07-03net/mlx5: Report EQE data upon CQ completionYishai Hadas
Report EQE data upon CQ completion to let upper layers use this data. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2019-07-03net/mlx5: Report a CQ error event only when a handler was setYishai Hadas
Report a CQ error event only when a handler was set. This enables mlx5_ib to not set a handler upon CQ creation and use some other mechanism to get this event as of other events by the mlx5_eq_notifier_register API. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2019-07-03net/mlx5: mlx5_core_create_cq() enhancementsYishai Hadas
Enhance mlx5_core_create_cq() to get the command out buffer from the callers to let them use the output. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2019-07-03net/mlx5: Expose the API to register for ANY eventYishai Hadas
Expose the API to register for ANY event, mlx5_ib will be able to use this functionality for its needs. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2019-07-03net/mlx5: Use event mask based on device capabilitiesYishai Hadas
Use the reported device capabilities for the supported user events (i.e. affiliated and un-affiliated) to set the EQ mask. As the event mask can be up to 256 defined by 4 entries of u64 change the applicable code to work accordingly. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2019-07-03net/mlx5: Fix mlx5_core_destroy_cq() error flowYishai Hadas
The firmware command to destroy a CQ might fail when the object is referenced by other object and the ref count is managed by the firmware. To enable a second successful destruction post the first failure need to change mlx5_eq_del_cq() to be a void function. As an error in mlx5_eq_del_cq() is quite fatal from the option to recover, a debug message inside it should be good enougth and it was changed to be void. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2019-07-03Merge branch 'bpf-tcp-rtt-hook'Daniel Borkmann
Stanislav Fomichev says: ==================== Congestion control team would like to have a periodic callback to track some TCP statistics. Let's add a sock_ops callback that can be selectively enabled on a socket by socket basis and is executed for every RTT. BPF program frequency can be further controlled by calling bpf_ktime_get_ns and bailing out early. I run neper tcp_stream and tcp_rr tests with the sample program from the last patch and didn't observe any noticeable performance difference. v2: * add a comment about second accept() in selftest (Yonghong Song) * refer to tcp_bpf.readme in sample program (Yonghong Song) ==================== Suggested-by: Eric Dumazet <edumazet@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03samples/bpf: fix tcp_bpf.readme detach commandStanislav Fomichev
Copy-paste, should be detach, not attach. Signed-off-by: Stanislav Fomichev <sdf@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03samples/bpf: add sample program that periodically dumps TCP statsStanislav Fomichev
Uses new RTT callback to dump stats every second. $ mkdir -p /tmp/cgroupv2 $ mount -t cgroup2 none /tmp/cgroupv2 $ mkdir -p /tmp/cgroupv2/foo $ echo $$ >> /tmp/cgroupv2/foo/cgroup.procs $ bpftool prog load ./tcp_dumpstats_kern.o /sys/fs/bpf/tcp_prog $ bpftool cgroup attach /tmp/cgroupv2/foo sock_ops pinned /sys/fs/bpf/tcp_prog $ bpftool prog tracelog $ # run neper/netperf/etc Used neper to compare performance with and without this program attached and didn't see any noticeable performance impact. Sample output: <idle>-0 [015] ..s. 2074.128800: 0: dsack_dups=0 delivered=242526 <idle>-0 [015] ..s. 2074.128808: 0: delivered_ce=0 icsk_retransmits=0 <idle>-0 [015] ..s. 2075.130133: 0: dsack_dups=0 delivered=323599 <idle>-0 [015] ..s. 2075.130138: 0: delivered_ce=0 icsk_retransmits=0 <idle>-0 [005] .Ns. 2076.131440: 0: dsack_dups=0 delivered=404648 <idle>-0 [005] .Ns. 2076.131447: 0: delivered_ce=0 icsk_retransmits=0 Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03selftests/bpf: test BPF_SOCK_OPS_RTT_CBStanislav Fomichev
Make sure the callback is invoked for syn-ack and data packet. Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03bpf/tools: sync bpf.hStanislav Fomichev
Sync new bpf_tcp_sock fields and new BPF_PROG_TYPE_SOCK_OPS RTT callback. Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03bpf: add icsk_retransmits to bpf_tcp_sockStanislav Fomichev
Add some inet_connection_sock fields to bpf_tcp_sock that might be useful for debugging congestion control issues. Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03bpf: add dsack_dups/delivered{, _ce} to bpf_tcp_sockStanislav Fomichev
Add more fields to bpf_tcp_sock that might be useful for debugging congestion control issues. Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03bpf: split shared bpf_tcp_sock and bpf_sock_ops implementationStanislav Fomichev
We've added bpf_tcp_sock member to bpf_sock_ops and don't expect any new tcp_sock fields in bpf_sock_ops. Let's remove CONVERT_COMMON_TCP_SOCK_FIELDS so bpf_tcp_sock can be independently extended. Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03bpf: add BPF_CGROUP_SOCK_OPS callback that is executed on every RTTStanislav Fomichev
Performance impact should be minimal because it's under a new BPF_SOCK_OPS_RTT_CB_FLAG flag that has to be explicitly enabled. Suggested-by: Eric Dumazet <edumazet@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03xdp: fix hang while unregistering device bound to xdp socketIlya Maximets
Device that bound to XDP socket will not have zero refcount until the userspace application will not close it. This leads to hang inside 'netdev_wait_allrefs()' if device unregistering requested: # ip link del p1 < hang on recvmsg on netlink socket > # ps -x | grep ip 5126 pts/0 D+ 0:00 ip link del p1 # journalctl -b Jun 05 07:19:16 kernel: unregister_netdevice: waiting for p1 to become free. Usage count = 1 Jun 05 07:19:27 kernel: unregister_netdevice: waiting for p1 to become free. Usage count = 1 ... Fix that by implementing NETDEV_UNREGISTER event notification handler to properly clean up all the resources and unref device. This should also allow socket killing via ss(8) utility. Fixes: 965a99098443 ("xsk: add support for bind for Rx") Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03xdp: hold device for umem regardless of zero-copy modeIlya Maximets
Device pointer stored in umem regardless of zero-copy mode, so we heed to hold the device in all cases. Fixes: c9b47cc1fabc ("xsk: fix bug when trying to use both copy and zero-copy on one queue id") Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03selftests: bpf: fix inlines in test_lwt_seg6localJiri Benc
Selftests are reporting this failure in test_lwt_seg6local.sh: + ip netns exec ns2 ip -6 route add fb00::6 encap bpf in obj test_lwt_seg6local.o sec encap_srh dev veth2 Error fetching program/map! Failed to parse eBPF program: Operation not permitted The problem is __attribute__((always_inline)) alone is not enough to prevent clang from inserting those functions in .text. In that case, .text is not marked as relocateable. See the output of objdump -h test_lwt_seg6local.o: Idx Name Size VMA LMA File off Algn 0 .text 00003530 0000000000000000 0000000000000000 00000040 2**3 CONTENTS, ALLOC, LOAD, READONLY, CODE This causes the iproute bpf loader to fail in bpf_fetch_prog_sec: bpf_has_call_data returns true but bpf_fetch_prog_relo fails as there's no relocateable .text section in the file. To fix this, convert to 'static __always_inline'. v2: Use 'static __always_inline' instead of 'static inline __attribute__((always_inline))' Fixes: c99a84eac026 ("selftests/bpf: test for seg6local End.BPF action") Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03selftests: bpf: standardize to static __always_inlineJiri Benc
The progs for bpf selftests use several different notations to force function inlining. Standardize to what most of them use, static __always_inline. Suggested-by: Song Liu <liu.song.a23@gmail.com> Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03bpf: Add support for fq's EDT to HBMbrakmo
Adds support for fq's Earliest Departure Time to HBM (Host Bandwidth Manager). Includes a new BPF program supporting EDT, and also updates corresponding programs. It will drop packets with an EDT of more than 500us in the future unless the packet belongs to a flow with less than 2 packets in flight. This is done so each flow has at least 2 packets in flight, so they will not starve, and also to help prevent delayed ACK timeouts. It will also work with ECN enabled traffic, where the packets will be CE marked if their EDT is more than 50us in the future. The table below shows some performance numbers. The flows are back to back RPCS. One server sending to another, either 2 or 4 flows. One flow is a 10KB RPC, the rest are 1MB RPCs. When there are more than one flow of a given RPC size, the numbers represent averages. The rate limit applies to all flows (they are in the same cgroup). Tests ending with "-edt" ran with the new BPF program supporting EDT. Tests ending with "-hbt" ran on top HBT qdisc with the specified rate (i.e. no HBM). The other tests ran with the HBM BPF program included in the HBM patch-set. EDT has limited value when using DCTCP, but it helps in many cases when using Cubic. It usually achieves larger link utilization and lower 99% latencies for the 1MB RPCs. HBM ends up queueing a lot of packets with its default parameter values, reducing the goodput of the 10KB RPCs and increasing their latency. Also, the RTTs seen by the flows are quite large. Aggr 10K 10K 10K 1MB 1MB 1MB Limit rate drops RTT rate P90 P99 rate P90 P99 Test rate Flows Mbps % us Mbps us us Mbps ms ms -------- ---- ----- ---- ----- --- ---- ---- ---- ---- ---- ---- cubic 1G 2 904 0.02 108 257 511 539 647 13.4 24.5 cubic-edt 1G 2 982 0.01 156 239 656 967 743 14.0 17.2 dctcp 1G 2 977 0.00 105 324 408 744 653 14.5 15.9 dctcp-edt 1G 2 981 0.01 142 321 417 811 660 15.7 17.0 cubic-htb 1G 2 919 0.00 1825 40 2822 4140 879 9.7 9.9 cubic 200M 2 155 0.30 220 81 532 655 74 283 450 cubic-edt 200M 2 188 0.02 222 87 1035 1095 101 84 85 dctcp 200M 2 188 0.03 111 77 912 939 111 76 325 dctcp-edt 200M 2 188 0.03 217 74 1416 1738 114 76 79 cubic-htb 200M 2 188 0.00 5015 8 14ms 15ms 180 48 50 cubic 1G 4 952 0.03 110 165 516 546 262 38 154 cubic-edt 1G 4 973 0.01 190 111 1034 1314 287 65 79 dctcp 1G 4 951 0.00 103 180 617 905 257 37 38 dctcp-edt 1G 4 967 0.00 163 151 732 1126 272 43 55 cubic-htb 1G 4 914 0.00 3249 13 7ms 8ms 300 29 34 cubic 5G 4 4236 0.00 134 305 490 624 1310 10 17 cubic-edt 5G 4 4865 0.00 156 306 425 759 1520 10 16 dctcp 5G 4 4936 0.00 128 485 221 409 1484 7 9 dctcp-edt 5G 4 4924 0.00 148 390 392 623 1508 11 26 v1 -> v2: Incorporated Andrii's suggestions v2 -> v3: Incorporated Yonghong's suggestions v3 -> v4: Removed credit update that is not needed Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03bpf, libbpf, smatch: Fix potential NULL pointer dereferenceLeo Yan
Based on the following report from Smatch, fix the potential NULL pointer dereference check: tools/lib/bpf/libbpf.c:3493 bpf_prog_load_xattr() warn: variable dereferenced before check 'attr' (see line 3483) 3479 int bpf_prog_load_xattr(const struct bpf_prog_load_attr *attr, 3480 struct bpf_object **pobj, int *prog_fd) 3481 { 3482 struct bpf_object_open_attr open_attr = { 3483 .file = attr->file, 3484 .prog_type = attr->prog_type, ^^^^^^ 3485 }; At the head of function, it directly access 'attr' without checking if it's NULL pointer. This patch moves the values assignment after validating 'attr' and 'attr->file'. Signed-off-by: Leo Yan <leo.yan@linaro.org> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03libbpf: fix GCC8 warning for strncpyAndrii Nakryiko
GCC8 started emitting warning about using strncpy with number of bytes exactly equal destination size, which is generally unsafe, as can lead to non-zero terminated string being copied. Use IFNAMSIZ - 1 as number of bytes to ensure name is always zero-terminated. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Cc: Magnus Karlsson <magnus.karlsson@intel.com> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03selftests: bpf: add tests for shifts by zeroLuke Nelson
There are currently no tests for ALU64 shift operations when the shift amount is 0. This adds 6 new tests to make sure they are equivalent to a no-op. The x32 JIT had such bugs that could have been caught by these tests. Cc: Xi Wang <xi.wang@gmail.com> Signed-off-by: Luke Nelson <luke.r.nels@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03bpf, x32: Fix bug with ALU64 {LSH, RSH, ARSH} BPF_K shift by 0Luke Nelson
The current x32 BPF JIT does not correctly compile shift operations when the immediate shift amount is 0. The expected behavior is for this to be a no-op. The following program demonstrates the bug. The expexceted result is 1, but the current JITed code returns 2. r0 = 1 r1 = 1 r1 <<= 0 if r1 == 1 goto end r0 = 2 end: exit This patch simplifies the code and fixes the bug. Fixes: 03f5781be2c7 ("bpf, x86_32: add eBPF JIT compiler for ia32") Co-developed-by: Xi Wang <xi.wang@gmail.com> Signed-off-by: Xi Wang <xi.wang@gmail.com> Signed-off-by: Luke Nelson <luke.r.nels@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03bpf, x32: Fix bug with ALU64 {LSH, RSH, ARSH} BPF_X shift by 0Luke Nelson
The current x32 BPF JIT for shift operations is not correct when the shift amount in a register is 0. The expected behavior is a no-op, whereas the current implementation changes bits in the destination register. The following example demonstrates the bug. The expected result of this program is 1, but the current JITed code returns 2. r0 = 1 r1 = 1 r2 = 0 r1 <<= r2 if r1 == 1 goto end r0 = 2 end: exit The bug is caused by an incorrect assumption by the JIT that a shift by 32 clear the register. On x32 however, shifts use the lower 5 bits of the source, making a shift by 32 equivalent to a shift by 0. This patch fixes the bug using double-precision shifts, which also simplifies the code. Fixes: 03f5781be2c7 ("bpf, x86_32: add eBPF JIT compiler for ia32") Co-developed-by: Xi Wang <xi.wang@gmail.com> Signed-off-by: Xi Wang <xi.wang@gmail.com> Signed-off-by: Luke Nelson <luke.r.nels@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03bpf: fix precision trackingAlexei Starovoitov
When equivalent state is found the current state needs to propagate precision marks. Otherwise the verifier will prune the search incorrectly. There is a price for correctness: before before broken fixed cnst spill precise precise bpf_lb-DLB_L3.o 1923 8128 1863 1898 bpf_lb-DLB_L4.o 3077 6707 2468 2666 bpf_lb-DUNKNOWN.o 1062 1062 544 544 bpf_lxc-DDROP_ALL.o 166729 380712 22629 36823 bpf_lxc-DUNKNOWN.o 174607 440652 28805 45325 bpf_netdev.o 8407 31904 6801 7002 bpf_overlay.o 5420 23569 4754 4858 bpf_lxc_jit.o 39389 359445 50925 69631 Overall precision tracking is still very effective. Fixes: b5dc0163d8fd ("bpf: precise scalar_value tracking") Reported-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Tested-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03xfrm interface: fix memory leak on creationNicolas Dichtel
The following commands produce a backtrace and return an error but the xfrm interface is created (in the wrong netns): $ ip netns add foo $ ip netns add bar $ ip -n foo netns set bar 0 $ ip -n foo link add xfrmi0 link-netnsid 0 type xfrm dev lo if_id 23 RTNETLINK answers: Invalid argument $ ip -n bar link ls xfrmi0 2: xfrmi0@lo: <NOARP,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/none 00:00:00:00:00:00 brd 00:00:00:00:00:00 Here is the backtrace: [ 79.879174] WARNING: CPU: 0 PID: 1178 at net/core/dev.c:8172 rollback_registered_many+0x86/0x3c1 [ 79.880260] Modules linked in: xfrm_interface nfsv3 nfs_acl auth_rpcgss nfsv4 nfs lockd grace sunrpc fscache button parport_pc parport serio_raw evdev pcspkr loop ext4 crc16 mbcache jbd2 crc32c_generic ide_cd_mod ide_gd_mod cdrom ata_$ eneric ata_piix libata scsi_mod 8139too piix psmouse i2c_piix4 ide_core 8139cp mii i2c_core floppy [ 79.883698] CPU: 0 PID: 1178 Comm: ip Not tainted 5.2.0-rc6+ #106 [ 79.884462] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [ 79.885447] RIP: 0010:rollback_registered_many+0x86/0x3c1 [ 79.886120] Code: 01 e8 d7 7d c6 ff 0f 0b 48 8b 45 00 4c 8b 20 48 8d 58 90 49 83 ec 70 48 8d 7b 70 48 39 ef 74 44 8a 83 d0 04 00 00 84 c0 75 1f <0f> 0b e8 61 cd ff ff 48 b8 00 01 00 00 00 00 ad de 48 89 43 70 66 [ 79.888667] RSP: 0018:ffffc900015ab740 EFLAGS: 00010246 [ 79.889339] RAX: ffff8882353e5700 RBX: ffff8882353e56a0 RCX: ffff8882353e5710 [ 79.890174] RDX: ffffc900015ab7e0 RSI: ffffc900015ab7e0 RDI: ffff8882353e5710 [ 79.891029] RBP: ffffc900015ab7e0 R08: ffffc900015ab7e0 R09: ffffc900015ab7e0 [ 79.891866] R10: ffffc900015ab7a0 R11: ffffffff82233fec R12: ffffc900015ab770 [ 79.892728] R13: ffffffff81eb7ec0 R14: ffff88822ed6cf00 R15: 00000000ffffffea [ 79.893557] FS: 00007ff350f31740(0000) GS:ffff888237a00000(0000) knlGS:0000000000000000 [ 79.894581] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 79.895317] CR2: 00000000006c8580 CR3: 000000022c272000 CR4: 00000000000006f0 [ 79.896137] Call Trace: [ 79.896464] unregister_netdevice_many+0x12/0x6c [ 79.896998] __rtnl_newlink+0x6e2/0x73b [ 79.897446] ? __kmalloc_node_track_caller+0x15e/0x185 [ 79.898039] ? pskb_expand_head+0x5f/0x1fe [ 79.898556] ? stack_access_ok+0xd/0x2c [ 79.899009] ? deref_stack_reg+0x12/0x20 [ 79.899462] ? stack_access_ok+0xd/0x2c [ 79.899927] ? stack_access_ok+0xd/0x2c [ 79.900404] ? __module_text_address+0x9/0x4f [ 79.900910] ? is_bpf_text_address+0x5/0xc [ 79.901390] ? kernel_text_address+0x67/0x7b [ 79.901884] ? __kernel_text_address+0x1a/0x25 [ 79.902397] ? unwind_get_return_address+0x12/0x23 [ 79.903122] ? __cmpxchg_double_slab.isra.37+0x46/0x77 [ 79.903772] rtnl_newlink+0x43/0x56 [ 79.904217] rtnetlink_rcv_msg+0x200/0x24c In fact, each time a xfrm interface was created, a netdev was allocated by __rtnl_newlink()/rtnl_create_link() and then another one by xfrmi_newlink()/xfrmi_create(). Only the second one was registered, it's why the previous commands produce a backtrace: dev_change_net_namespace() was called on a netdev with reg_state set to NETREG_UNINITIALIZED (the first one). CC: Lorenzo Colitti <lorenzo@google.com> CC: Benedict Wong <benedictwong@google.com> CC: Steffen Klassert <steffen.klassert@secunet.com> CC: Shannon Nelson <shannon.nelson@oracle.com> CC: Antony Antony <antony@phenome.org> CC: Eyal Birger <eyal.birger@gmail.com> Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces") Reported-by: Julien Floret <julien.floret@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-07-03xfrm: policy: fix bydst hlist corruption on hash rebuildFlorian Westphal
syzbot reported following spat: BUG: KASAN: use-after-free in __write_once_size include/linux/compiler.h:221 BUG: KASAN: use-after-free in hlist_del_rcu include/linux/rculist.h:455 BUG: KASAN: use-after-free in xfrm_hash_rebuild+0xa0d/0x1000 net/xfrm/xfrm_policy.c:1318 Write of size 8 at addr ffff888095e79c00 by task kworker/1:3/8066 Workqueue: events xfrm_hash_rebuild Call Trace: __write_once_size include/linux/compiler.h:221 [inline] hlist_del_rcu include/linux/rculist.h:455 [inline] xfrm_hash_rebuild+0xa0d/0x1000 net/xfrm/xfrm_policy.c:1318 process_one_work+0x814/0x1130 kernel/workqueue.c:2269 Allocated by task 8064: __kmalloc+0x23c/0x310 mm/slab.c:3669 kzalloc include/linux/slab.h:742 [inline] xfrm_hash_alloc+0x38/0xe0 net/xfrm/xfrm_hash.c:21 xfrm_policy_init net/xfrm/xfrm_policy.c:4036 [inline] xfrm_net_init+0x269/0xd60 net/xfrm/xfrm_policy.c:4120 ops_init+0x336/0x420 net/core/net_namespace.c:130 setup_net+0x212/0x690 net/core/net_namespace.c:316 The faulting address is the address of the old chain head, free'd by xfrm_hash_resize(). In xfrm_hash_rehash(), chain heads get re-initialized without any hlist_del_rcu: for (i = hmask; i >= 0; i--) INIT_HLIST_HEAD(odst + i); Then, hlist_del_rcu() gets called on the about to-be-reinserted policy when iterating the per-net list of policies. hlist_del_rcu() will then make chain->first be nonzero again: static inline void __hlist_del(struct hlist_node *n) { struct hlist_node *next = n->next; // address of next element in list struct hlist_node **pprev = n->pprev;// location of previous elem, this // can point at chain->first WRITE_ONCE(*pprev, next); // chain->first points to next elem if (next) next->pprev = pprev; Then, when we walk chainlist to find insertion point, we may find a non-empty list even though we're supposedly reinserting the first policy to an empty chain. To fix this first unlink all exact and inexact policies instead of zeroing the list heads. Add the commands equivalent to the syzbot reproducer to xfrm_policy.sh, without fix KASAN catches the corruption as it happens, SLUB poisoning detects it a bit later. Reported-by: syzbot+0165480d4ef07360eeda@syzkaller.appspotmail.com Fixes: 1548bc4e0512 ("xfrm: policy: delete inexact policies from inexact list on hash rebuild") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-07-02selftests/net: skip psock_tpacket test if KALLSYMS was not enabledPo-Hsu Lin
The psock_tpacket test will need to access /proc/kallsyms, this would require the kernel config CONFIG_KALLSYMS to be enabled first. Apart from adding CONFIG_KALLSYMS to the net/config file here, check the file existence to determine if we can run this test will be helpful to avoid a false-positive test result when testing it directly with the following commad against a kernel that have CONFIG_KALLSYMS disabled: make -C tools/testing/selftests TARGETS=net run_tests Signed-off-by: Po-Hsu Lin <po-hsu.lin@canonical.com> Acked-by: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-02mlxsw: spectrum_ptp: Fix validation in mlxsw_sp1_ptp_packet_finish()Petr Machata
Before mlxsw_sp1_ptp_packet_finish() sends the packet back, it validates whether the corresponding port is still valid. However the condition is incorrect: when mlxsw_sp_port == NULL, the code dereferences the port to compare it to skb->dev. The condition needs to check whether the port is present and skb->dev still refers to that port (or else is NULL). If that does not hold, bail out. Add a pair of parentheses to fix the condition. Fixes: d92e4e6e33c8 ("mlxsw: spectrum: PTP: Support timestamping on Spectrum-1") Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>