summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2019-07-03nfsd: use 64-bit seconds fields in nfsd v4 codeJ. Bruce Fields
After commit 95582b008388 "vfs: change inode times to use struct timespec64" there are spots in the NFSv4 decoding where we decode the protocol into a struct timeval and then convert that into a timeval64. That's unnecesary in the NFSv4 case since the on-the-wire protocol also uses 64-bit values. So just fix up our code to use timeval64 everywhere. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03nfsd: Spelling s/EACCESS/EACCES/Geert Uytterhoeven
The correct spelling is EACCES: include/uapi/asm-generic/errno-base.h:#define EACCES 13 /* Permission denied */ Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03lockd: Make two symbols staticYueHaibing
Fix sparse warnings: fs/lockd/clntproc.c:57:6: warning: symbol 'nlmclnt_put_lockowner' was not declared. Should it be static? fs/lockd/svclock.c:409:35: warning: symbol 'nlmsvc_lock_ops' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03locks: Cleanup lm_compare_owner and lm_owner_keyBenjamin Coddington
After the update to use nlm_lockowners for the NLM server, there are no more users of lm_compare_owner and lm_owner_key. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03lockd: Show pid of lockd for remote locksBenjamin Coddington
Use the pid of lockd instead of the remote lock's svid for the fl_pid for local POSIX locks. This allows proper enumeration of which local process owns which lock. The svid is meaningless to local lock readers. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03lockd: Remove lm_compare_owner and lm_owner_keyBenjamin Coddington
Now that the NLM server allocates an nlm_lockowner for fl_owner, there's no need for special hashing or comparison. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03lockd: Convert NLM service fl_owner to nlm_lockownerBenjamin Coddington
Do as the NLM client: allocate and track a struct nlm_lockowner for use as the fl_owner for locks created by the NLM sever. This allows us to keep the svid within this structure for matching locks, and will allow us to track the pid of lockd in a future patch. It should also allow easier reference of the nlm_host in conflicting locks, and simplify lock hashing and comparison. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> [bfields@redhat.com: fix type of some error returns] Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03lockd: prepare nlm_lockowner for use by the serverBenjamin Coddington
The nlm_lockowner structure that the client uses to track locks is generally useful to the server as well. Very similar functions to handle allocation and tracking of the nlm_lockowner will follow. Rename the client functions for clarity. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03nfsd: note inadequate stats lockingJ. Bruce Fields
After 89a26b3d295d "nfsd: split DRC global spinlock into per-bucket locks", there is no longer a single global spinlock to protect these stats. So, really we need to fix that. For now, at least fix the comment. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03nfsd4: drc containerizationJ. Bruce Fields
The nfsd duplicate reply cache should not be shared between network namespaces. The most straightforward way to fix this is just to move every global in the code to per-net-namespace memory, so that's what we do. Still todo: sort out which members of nfsd_stats should be global and which per-net-namespace. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03nfsd: don't call nfsd_reply_cache_shutdown twiceJ. Bruce Fields
The caller is cleaning up on ENOMEM, don't try to do it here too. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03nfsd: Fix overflow causing non-working mounts on 1 TB machinesPaul Menzel
Since commit 10a68cdf10 (nfsd: fix performance-limiting session calculation) (Linux 5.1-rc1 and 4.19.31), shares from NFS servers with 1 TB of memory cannot be mounted anymore. The mount just hangs on the client. The gist of commit 10a68cdf10 is the change below. -avail = clamp_t(int, avail, slotsize, avail/3); +avail = clamp_t(int, avail, slotsize, total_avail/3); Here are the macros. #define min_t(type, x, y) __careful_cmp((type)(x), (type)(y), <) #define clamp_t(type, val, lo, hi) min_t(type, max_t(type, val, lo), hi) `total_avail` is 8,434,659,328 on the 1 TB machine. `clamp_t()` casts the values to `int`, which for 32-bit integers can only hold values −2,147,483,648 (−2^31) through 2,147,483,647 (2^31 − 1). `avail` (in the function signature) is just 65536, so that no overflow was happening. Before the commit the assignment would result in 21845, and `num = 4`. When using `total_avail`, it is causing the assignment to be 18446744072226137429 (printed as %lu), and `num` is then 4164608182. My next guess is, that `nfsd_drc_mem_used` is then exceeded, and the server thinks there is no memory available any more for this client. Updating the arguments of `clamp_t()` and `min_t()` to `unsigned long` fixes the issue. Now, `avail = 65536` (before commit 10a68cdf10 `avail = 21845`), but `num = 4` remains the same. Fixes: c54f24e338ed (nfsd: fix performance-limiting session calculation) Cc: stable@vger.kernel.org Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03gve: fix -ENOMEM null check on a page allocationColin Ian King
Currently the check to see if a page is allocated is incorrect and is checking if the pointer page is null, not *page as intended. Fix this. Addresses-Coverity: ("Dereference before null check") Fixes: f5cedc84a30d ("gve: Add transmit and receive support") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03Merge branch 'net-ICW-sendmsg-recvmsg'David S. Miller
Paolo Abeni says: ==================== net: use ICW for sk_proto->{send,recv}msg This series extends ICW usage to one of the few remaining spots in fast-path still hitting per packet retpoline overhead, namely the sk_proto->{send,recv}msg calls. The first 3 patches in this series refactor the existing code so that applying the ICW macros is straight-forward: we demux inet_{recv,send}msg in ipv4 and ipv6 variants so that each of them can easily select the appropriate TCP or UDP direct call. While at it, a new helper is created to avoid excessive code duplication, and the current ICWs for inet_{recv,send}msg are adjusted accordingly. The last 2 patches really introduce the new ICW use-case, respectively for the ipv6 and the ipv4 code path. This gives up to 5% performance improvement under UDP flood, and smaller but measurable gains for TCP RR workloads. v1 -> v2: - drop inet6_{recv,send}msg declaration from header file, prefer ICW macro instead - avoid unneeded reclaration for udp_sendmsg, as suggested by Willem ==================== Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03ipv4: use indirect call wrappers for {tcp, udp}_{recv, send}msg()Paolo Abeni
This avoids an indirect call per syscall for common ipv4 transports v1 -> v2: - avoid unneeded reclaration for udp_sendmsg, as suggested by Willem Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03ipv6: use indirect call wrappers for {tcp, udpv6}_{recv, send}msg()Paolo Abeni
This avoids an indirect call per syscall for common ipv6 transports Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03net: adjust socket level ICW to cope with ipv6 variant of {recv, send}msgPaolo Abeni
After the previous patch we have ipv{6,4} variants for {recv,send}msg, we should use the generic _INET ICW variant to call into the proper build-in. This also allows dropping the now unused and rather ugly _INET4 ICW macro v1 -> v2: - use ICW macro to declare inet6_{recv,send}msg - fix a couple of checkpatch offender in the code context Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03ipv6: provide and use ipv6 specific version for {recv, send}msgPaolo Abeni
This will simplify indirect call wrapper invocation in the following patch. No functional change intended, any - out-of-tree - IPv6 user of inet_{recv,send}msg can keep using the existing functions. SCTP code still uses the existing version even for ipv6: as this series will not add ICW for SCTP, moving to the new helper would not give any benefit. The only other in-kernel user of inet_{recv,send}msg is pvcalls_conn_back_read(), but psvcalls explicitly creates only IPv4 socket, so no need to update that code path, too. v1 -> v2: drop inet6_{recv,send}msg declaration from header file, prefer ICW macro instead Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03inet: factor out inet_send_prepare()Paolo Abeni
The same code is replicated verbatim in multiple places, and the next patches will introduce an additional user for it. Factor out a helper and use it where appropriate. No functional change intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03bonding: validate ip header before check IPPROTO_IGMPCong Wang
bond_xmit_roundrobin() checks for IGMP packets but it parses the IP header even before checking skb->protocol. We should validate the IP header with pskb_may_pull() before using iph->protocol. Reported-and-tested-by: syzbot+e5be16aa39ad6e755391@syzkaller.appspotmail.com Fixes: a2fd940f4cff ("bonding: fix broken multicast with round-robin mode") Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03net/mlx5: Properly name the generic WQE control fieldTariq Toukan
A generic WQE control field is used for different purposes in different cases. Use union to allow using the proper name in each case. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-07-03net/mlx5: Introduce TLS TX offload hardware bits and structuresEran Ben Elisha
Add TLS offload related IFC structs, layouts and enumerations. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-07-03net/mlx5: Refactor mlx5_esw_query_functions for modularityParav Pandit
Functions change event output data size changes when functions other than VFs will be enabled in HCA CAP. With current API, multiple callers needs to align, calculate accurate size of the output data depending on number on non VF functions enabled in the device. Instead of duplicating such math at multiple places, refactor mlx5_esw_query_functions() to return raw output allocated by itself. Caller must free the allocated memory using kvfree() as described in the function comment section. This hides calcuation within mlx5_esw_query_functions() and provides simpler API. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-07-03net/mlx5: E-Switch prepare functions change handler to be modularParav Pandit
Eswitch function change handler will service multiple type of events for VFs and non VF functions update. Hence, introduce and use the helper function esw_vfs_changed_event_handler() for handling change in num VFs to improve the code readability. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-07-03net/mlx5: Introduce and use mlx5_eswitch_get_total_vports()Parav Pandit
Instead MLX5_TOTAL_VPORTS, use mlx5_eswitch_get_total_vports(). mlx5_eswitch_get_total_vports() in subsequent patch accounts for SF vports as well. Expanding MLX5_TOTAL_VPORTS macro would require exposing SF internals to more generic vport.h header file. Such exposure is not desired. Hence a mlx5_eswitch_get_total_vports() is introduced. Given that mlx5_eswitch_get_total_vports() API wants to work on const mlx5_core_dev*, change its helper functions also to accept const *dev. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-07-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2019-07-03 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix the interpreter to properly handle BPF_ALU32 | BPF_ARSH on BE architectures, from Jiong. 2) Fix several bugs in the x32 BPF JIT for handling shifts by 0, from Luke and Xi. 3) Fix NULL pointer deref in btf_type_is_resolve_source_only(), from Stanislav. 4) Properly handle the check that forwarding is enabled on the device in bpf_ipv6_fib_lookup() helper code, from Anton. 5) Fix UAPI bpf_prog_info fields alignment for archs that have 16 bit alignment such as m68k, from Baruch. 6) Fix kernel hanging in unregister_netdevice loop while unregistering device bound to XDP socket, from Ilya. 7) Properly terminate tail update in xskq_produce_flush_desc(), from Nathan. 8) Fix broken always_inline handling in test_lwt_seg6local, from Jiri. 9) Fix bpftool to use correct argument in cgroup errors, from Jakub. 10) Fix detaching dummy prog in XDP redirect sample code, from Prashant. 11) Add Jonathan to AF_XDP reviewers, from Björn. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03net: hns: add support for vlan TSOYonglong Liu
The hip07 chip support vlan TSO, this patch adds NETIF_F_TSO and NETIF_F_TSO6 flags to vlan_features to improve the performance after adding vlan to the net ports. Signed-off-by: Yonglong Liu <liuyonglong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03sctp: count data bundling sack chunk for outctrlchunksXin Long
Now all ctrl chunks are counted for asoc stats.octrlchunks and net SCTP_MIB_OUTCTRLCHUNKS either after queuing up or bundling, other than the chunk maked and bundled in sctp_packet_bundle_sack, which caused 'outctrlchunks' not consistent with 'inctrlchunks' in peer. This issue exists since very beginning, here to fix it by increasing both net SCTP_MIB_OUTCTRLCHUNKS and asoc stats.octrlchunks when sack chunk is maked and bundled in sctp_packet_bundle_sack. Reported-by: Ja Ram Jeon <jajeon@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03qlcnic: remove redundant assignment to variable errColin Ian King
The variable err is being initialized with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03atl1c: remove redundant assignment to variable tpd_reqColin Ian King
The variable tpd_req is being initialized with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03r8152: move calling r8153b_rx_agg_chg_indicate()Hayes Wang
r8153b_rx_agg_chg_indicate() needs to be called after enabling TX/RX and before calling rxdy_gated_en(tp, false). Otherwise, the change of the settings of RX aggregation wouldn't work. Besides, adjust rtl8152_set_coalesce() for the same reason. If rx_coalesce_usecs is changed, restart TX/RX to let the setting work. Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03qed: Add support for Timestamping the unicast PTP packets.Sudarsana Reddy Kalluru
This patch adds driver changes to detect/timestamp the unicast PTP packets. Changes from previous version: ------------------------------- v2: Defined a macro for unicast ptp param mask. Please consider applying this to "net-next". Signed-off-by: Sudarsana Reddy Kalluru <skalluru@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03gve: Fix u64_stats_sync to initialize startCatherine Sullivan
u64_stats_fetch_begin needs to initialize start. Signed-off-by: Catherine Sullivan <csully@google.com> Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03net: don't warn in inet diag when IPV6 is disabledStephen Hemminger
If IPV6 was disabled, then ss command would cause a kernel warning because the command was attempting to dump IPV6 socket information. The fix is to just remove the warning. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202249 Fixes: 432490f9d455 ("net: ip, diag -- Add diag interface for raw sockets") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03loopback: fix lockdep splatMahesh Bandewar
dev_init_scheduler() and dev_activate() expect the caller to hold RTNL. Since we don't want blackhole device to be initialized per ns, we are initializing at init. [ 3.855027] Call Trace: [ 3.855034] dump_stack+0x67/0x95 [ 3.855037] lockdep_rcu_suspicious+0xd5/0x110 [ 3.855044] dev_init_scheduler+0xe3/0x120 [ 3.855048] ? net_olddevs_init+0x60/0x60 [ 3.855050] blackhole_netdev_init+0x45/0x6e [ 3.855052] do_one_initcall+0x6c/0x2fa [ 3.855058] ? rcu_read_lock_sched_held+0x8c/0xa0 [ 3.855066] kernel_init_freeable+0x1e5/0x288 [ 3.855071] ? rest_init+0x260/0x260 [ 3.855074] kernel_init+0xf/0x180 [ 3.855076] ? rest_init+0x260/0x260 [ 3.855078] ret_from_fork+0x24/0x30 Fixes: 4de83b88c66 ("loopback: create blackhole net device similar to loopack.") Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Mahesh Bandewar <maheshb@google.com> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03net/mlx5: Expose device definitions for object eventsYishai Hadas
Expose an extra device definitions for objects events. It includes: object_type values for legacy objects and generic data header for any other object. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2019-07-03net/mlx5: Report EQE data upon CQ completionYishai Hadas
Report EQE data upon CQ completion to let upper layers use this data. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2019-07-03net/mlx5: Report a CQ error event only when a handler was setYishai Hadas
Report a CQ error event only when a handler was set. This enables mlx5_ib to not set a handler upon CQ creation and use some other mechanism to get this event as of other events by the mlx5_eq_notifier_register API. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2019-07-03net/mlx5: mlx5_core_create_cq() enhancementsYishai Hadas
Enhance mlx5_core_create_cq() to get the command out buffer from the callers to let them use the output. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2019-07-03net/mlx5: Expose the API to register for ANY eventYishai Hadas
Expose the API to register for ANY event, mlx5_ib will be able to use this functionality for its needs. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2019-07-03net/mlx5: Use event mask based on device capabilitiesYishai Hadas
Use the reported device capabilities for the supported user events (i.e. affiliated and un-affiliated) to set the EQ mask. As the event mask can be up to 256 defined by 4 entries of u64 change the applicable code to work accordingly. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2019-07-03net/mlx5: Fix mlx5_core_destroy_cq() error flowYishai Hadas
The firmware command to destroy a CQ might fail when the object is referenced by other object and the ref count is managed by the firmware. To enable a second successful destruction post the first failure need to change mlx5_eq_del_cq() to be a void function. As an error in mlx5_eq_del_cq() is quite fatal from the option to recover, a debug message inside it should be good enougth and it was changed to be void. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2019-07-03Merge branch 'bpf-tcp-rtt-hook'Daniel Borkmann
Stanislav Fomichev says: ==================== Congestion control team would like to have a periodic callback to track some TCP statistics. Let's add a sock_ops callback that can be selectively enabled on a socket by socket basis and is executed for every RTT. BPF program frequency can be further controlled by calling bpf_ktime_get_ns and bailing out early. I run neper tcp_stream and tcp_rr tests with the sample program from the last patch and didn't observe any noticeable performance difference. v2: * add a comment about second accept() in selftest (Yonghong Song) * refer to tcp_bpf.readme in sample program (Yonghong Song) ==================== Suggested-by: Eric Dumazet <edumazet@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03samples/bpf: fix tcp_bpf.readme detach commandStanislav Fomichev
Copy-paste, should be detach, not attach. Signed-off-by: Stanislav Fomichev <sdf@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03samples/bpf: add sample program that periodically dumps TCP statsStanislav Fomichev
Uses new RTT callback to dump stats every second. $ mkdir -p /tmp/cgroupv2 $ mount -t cgroup2 none /tmp/cgroupv2 $ mkdir -p /tmp/cgroupv2/foo $ echo $$ >> /tmp/cgroupv2/foo/cgroup.procs $ bpftool prog load ./tcp_dumpstats_kern.o /sys/fs/bpf/tcp_prog $ bpftool cgroup attach /tmp/cgroupv2/foo sock_ops pinned /sys/fs/bpf/tcp_prog $ bpftool prog tracelog $ # run neper/netperf/etc Used neper to compare performance with and without this program attached and didn't see any noticeable performance impact. Sample output: <idle>-0 [015] ..s. 2074.128800: 0: dsack_dups=0 delivered=242526 <idle>-0 [015] ..s. 2074.128808: 0: delivered_ce=0 icsk_retransmits=0 <idle>-0 [015] ..s. 2075.130133: 0: dsack_dups=0 delivered=323599 <idle>-0 [015] ..s. 2075.130138: 0: delivered_ce=0 icsk_retransmits=0 <idle>-0 [005] .Ns. 2076.131440: 0: dsack_dups=0 delivered=404648 <idle>-0 [005] .Ns. 2076.131447: 0: delivered_ce=0 icsk_retransmits=0 Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03selftests/bpf: test BPF_SOCK_OPS_RTT_CBStanislav Fomichev
Make sure the callback is invoked for syn-ack and data packet. Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03bpf/tools: sync bpf.hStanislav Fomichev
Sync new bpf_tcp_sock fields and new BPF_PROG_TYPE_SOCK_OPS RTT callback. Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03bpf: add icsk_retransmits to bpf_tcp_sockStanislav Fomichev
Add some inet_connection_sock fields to bpf_tcp_sock that might be useful for debugging congestion control issues. Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03bpf: add dsack_dups/delivered{, _ce} to bpf_tcp_sockStanislav Fomichev
Add more fields to bpf_tcp_sock that might be useful for debugging congestion control issues. Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03bpf: split shared bpf_tcp_sock and bpf_sock_ops implementationStanislav Fomichev
We've added bpf_tcp_sock member to bpf_sock_ops and don't expect any new tcp_sock fields in bpf_sock_ops. Let's remove CONVERT_COMMON_TCP_SOCK_FIELDS so bpf_tcp_sock can be independently extended. Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>