summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-09-14tools headers UAPI: Sync kvm.h headers with the kernel sourcesArnaldo Carvalho de Melo
To pick the changes in: 15e9e35cd1dec2bc ("KVM: MIPS: Change the definition of kvm type") 004a01241c5a0d37 ("arm64/x86: KVM: Introduce steal-time cap") That do not result in any change in tooling, as the additions are not being used in any table generator. This silences these perf build warning: Warning: Kernel ABI header at 'tools/include/uapi/linux/kvm.h' differs from latest version at 'include/uapi/linux/kvm.h' diff -u tools/include/uapi/linux/kvm.h include/uapi/linux/kvm.h Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andrew Jones <drjones@redhat.com> Cc: Huacai Chen <chenhc@lemote.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-09-14ipv4: Initialize flowi4_multipath_hash in data pathDavid Ahern
flowi4_multipath_hash was added by the commit referenced below for tunnels. Unfortunately, the patch did not initialize the new field for several fast path lookups that do not initialize the entire flow struct to 0. Fix those locations. Currently, flowi4_multipath_hash is random garbage and affects the hash value computed by fib_multipath_hash for multipath selection. Fixes: 24ba14406c5c ("route: Add multipath_hash in flowi_common to make user-define hash") Signed-off-by: David Ahern <dsahern@gmail.com> Cc: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14Merge branch 'net-lantiq-Fix-bugs-in-NAPI-handling'David S. Miller
Hauke Mehrtens says: ==================== net: lantiq: Fix bugs in NAPI handling This fixes multiple bugs in the NAPI handling. Changes since: v1: - removed stable tag from "net: lantiq: use netif_tx_napi_add() for TX NAPI" - Check the NAPI budged in "net: lantiq: Use napi_complete_done()" - Add extra fix "net: lantiq: Disable IRQs only if NAPI gets scheduled" ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14net: lantiq: Disable IRQs only if NAPI gets scheduledHauke Mehrtens
The napi_schedule() call will only schedule the NAPI if it is not already running. To make sure that we do not deactivate interrupts without scheduling NAPI only deactivate the interrupts in case NAPI also gets scheduled. Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14net: lantiq: Use napi_complete_done()Hauke Mehrtens
Use napi_complete_done() and activate the interrupts when this function returns true. This way the generic NAPI code can take care of activating the interrupts. Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14net: lantiq: use netif_tx_napi_add() for TX NAPIHauke Mehrtens
netif_tx_napi_add() should be used for NAPI in the TX direction instead of the netif_napi_add() function. Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14net: lantiq: Wake TX queue againHauke Mehrtens
The call to netif_wake_queue() when the TX descriptors were freed was missing. When there are no TX buffers available the TX queue will be stopped, but it was not started again when they are available again, this is fixed in this patch. Fixes: fe1a56420cf2 ("net: lantiq: Add Lantiq / Intel VRX200 Ethernet driver") Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14perf record: Prevent override of attr->sample_period for libpfm4 eventsStephane Eranian
Before: $ perf record -c 10000 --pfm-events=cycles:period=77777 Would yield a cycles event with period=10000, instead of 77777. the event string and perf record initializing the event. This was due to an ordering issue between libpfm4 parsing events with attr->sample_period != 0 by the time intent of the author. perf_evsel__config() is invoked. This seems to have been the This patch fixes the problem by preventing override for Signed-off-by: Stephane Eranian <eranian@google.com> Reviewed-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrii Nakryiko <andriin@fb.com> Cc: Athira Jajeev <atrajeev@linux.vnet.ibm.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <songliubraving@fb.com> Cc: Yonghong Song <yhs@fb.com> Link: http://lore.kernel.org/lkml/20200912025655.1337192-3-irogers@google.com Signed-off-by: Ian Rogers <irogers@google.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-09-14perf record: Set PERF_RECORD_PERIOD if attr->freq is set.David Sharp
evsel__config() would only set PERF_RECORD_PERIOD if it set attr->freq from perf record options. When it is set by libpfm events, it would not get set. This changes evsel__config to see if attr->freq is set outside of whether or not it changes attr->freq itself. Signed-off-by: David Sharp <dhsharp@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrii Nakryiko <andriin@fb.com> Cc: Athira Jajeev <atrajeev@linux.vnet.ibm.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <songliubraving@fb.com> Cc: Stephane Eranian <eranian@google.com> Cc: Yonghong Song <yhs@fb.com> Cc: david sharp <dhsharp@google.com> Link: http://lore.kernel.org/lkml/20200912025655.1337192-2-irogers@google.com Signed-off-by: Ian Rogers <irogers@google.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-09-14drivers/net/wan/x25_asy: Remove an unnecessary x25_type_trans callXie He
x25_type_trans only needs to be called before we call netif_rx to pass the skb to upper layers. It does not need to be called before lapb_data_received. The LAPB module does not need the fields that are set by calling it. In the other two X.25 drivers - lapbether and hdlc_x25. x25_type_trans is only called before netif_rx and not before lapb_data_received. Cc: Martin Schiller <ms@dev.tdt.de> Signed-off-by: Xie He <xie.he.0141@gmail.com> Acked-by: Martin Schiller <ms@dev.tdt.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14rndis_host: increase sleep time in the query-response loopOlympia Giannou
Some WinCE devices face connectivity issues via the NDIS interface. They fail to register, resulting in -110 timeout errors and failures during the probe procedure. In this kind of WinCE devices, the Windows-side ndis driver needs quite more time to be loaded and configured, so that the linux rndis host queries to them fail to be responded correctly on time. More specifically, when INIT is called on the WinCE side - no other requests can be served by the Client and this results in a failed QUERY afterwards. The increase of the waiting time on the side of the linux rndis host in the command-response loop leaves the INIT process to complete and respond to a QUERY, which comes afterwards. The WinCE devices with this special "feature" in their ndis driver are satisfied by this fix. Signed-off-by: Olympia Giannou <olympia.giannou@leica-geosystems.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14net: try to avoid unneeded backlog flushPaolo Abeni
flush_all_backlogs() may cause deadlock on systems running processes with FIFO scheduling policy. The above is critical in -RT scenarios, where user-space specifically ensure no network activity is scheduled on the CPU running the mentioned FIFO process, but still get stuck. This commit tries to address the problem checking the backlog status on the remote CPUs before scheduling the flush operation. If the backlog is empty, we can skip it. v1 -> v2: - explicitly clear flushed cpu mask - Eric Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14Merge branch 'mlxsw-Derive-SBIB-from-maximum-port-speed-and-MTU'David S. Miller
Ido Schimmel says: ==================== mlxsw: Derive SBIB from maximum port speed & MTU Petr says: Internal buffer is a part of port headroom used for packets that are mirrored due to triggers that the Spectrum ASIC considers "egress". Besides ACL mirroring on port egresss this includes also packets mirrored due to ECN marking. This patchset changes the way the internal mirroring buffer is reserved. Currently the buffer reflects port MTU and speed accurately. In the future, mlxsw should support dcbnl_setbuffer hook to allow the users to set buffer sizes by hand. In that case, there might not be enough space for growth of the internal mirroring buffer due to MTU and speed changes. While vetoing MTU changes would be merely confusing, port speed changes cannot be vetoed, and such change would simply lead to issues in packet mirroring. For these reasons, with these patches the internal mirroring buffer is derived from maximum MTU and maximum speed achievable on the port. Patches #1 and #2 introduce a new callback to determine the maximum speed a given port can achieve. With patches #3 and #4, the information about, respectively, maximum MTU and maximum port speed, is kept in struct mlxsw_sp_port. In patch #5, maximum MTU and maximum speed are used to determine the size of the internal buffer. MTU update and speed update hooks are dropped, because they are no longer necessary. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14mlxsw: spectrum_span: Derive SBIB from maximum port speed & MTUPetr Machata
The SBIB register configures the size of an internal buffer that the Spectrum ASICs use when mirroring traffic on egress. This size should be taken into account when validating that the port headroom buffers are not larger than the chip can handle. Up until now this was not done, which is incidentally not a problem, because the priority group buffers that mlxsw auto-configures are small enough that the boundary condition could not be violated. However when dcbnl_setbuffer is implemented, the user has control over sizes of PG buffers, and they might overshoot the headroom capacity. However the size of the SBIB buffer depends on port speed, and that cannot be vetoed. Therefore SBIB size should be deduced from maximum port speed. Additionally, once the buffers are configured by hand, the user could get into an uncomfortable situation where their MTU change requests get vetoed, because the SBIB does not fit anymore. Therefore derive SBIB size from maximum permissible MTU as well. Remove all the code that adjusted the SBIB size whenever speed or MTU changed. Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14mlxsw: spectrum: Keep maximum speed aroundPetr Machata
The maximum port speed depends on link modes supported by the port, and for Ethernet ports is constant. The maximum speed will be handy when setting SBIB, the internal buffer used for traffic mirroring. Therefore, keep it in struct mlxsw_sp_port for easy access. Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14mlxsw: spectrum: Keep maximum MTU aroundPetr Machata
The maximum port MTU depends on port type. On Spectrum, mlxsw configures all ports as Ethernet ports, and the maximum MTU therefore never changes. Besides checking MTU configuration, maximum MTU will also be handy when setting SBIB, the internal buffer used for traffic mirroring. Therefore, keep it in struct mlxsw_sp_port for easy access. Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14mlxsw: spectrum_ethtool: Introduce ptys_max_speed callbackPetr Machata
The SBIB register configures the size of an internal buffer that the Spectrum ASICs use when mirroring traffic on egress. This size should be taken into account when validating that the port headroom buffers are not larger than the chip can handle. Up until now this was not done, which is incidentally not a problem, because the priority group buffers that mlxsw auto-configures are small enough that the boundary condition could not be violated. When dcbnl_setbuffer is implemented, the user gets control over sizes of PG buffers, and they might overshoot the headroom capacity. However the size of the SBIB buffer depends on port speed, which cannot be vetoed. There is obviously no way to retroactively push back on requests for overlarge PG buffers, or reject an overlarge MTU, or cancel losslessness of a certain PG. Therefore, instead of taking into account the current speed when calculating SBIB buffer size, take into account the maximum speed that a port with given Ethernet protocol capabilities can have. To that end, add a new ethtool callback, ptys_max_speed, which determines this maximum speed. Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14mlxsw: spectrum_ethtool: Extract a helper to get Ethernet attributesPetr Machata
In order to allow reusing the logic, extract from mlxsw_sp_port_get_link_ksettings() the code to obtain Ethernet protocol attributes, mlxsw_sp_port_ptys_query(). Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14perf bench: Fix 2 memory sanitizer warningsIan Rogers
Memory sanitizer warns if a write is performed where the memory being read for the write is uninitialized. Avoid this warning by initializing the memory. Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lore.kernel.org/lkml/20200912053725.1405857-1-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-09-14perf test: Fix the "signal" test inline assemblyJiri Olsa
When compiling with DEBUG=1 on Fedora 32 I'm getting crash for 'perf test signal': Program received signal SIGSEGV, Segmentation fault. 0x0000000000c68548 in __test_function () (gdb) bt #0 0x0000000000c68548 in __test_function () #1 0x00000000004d62e9 in test_function () at tests/bp_signal.c:61 #2 0x00000000004d689a in test__bp_signal (test=0xa8e280 <generic_ ... #3 0x00000000004b7d49 in run_test (test=0xa8e280 <generic_tests+1 ... #4 0x00000000004b7e7f in test_and_print (t=0xa8e280 <generic_test ... #5 0x00000000004b8927 in __cmd_test (argc=1, argv=0x7fffffffdce0, ... ... It's caused by the symbol __test_function being in the ".bss" section: $ readelf -a ./perf | less [Nr] Name Type Address Offset Size EntSize Flags Link Info Align ... [28] .bss NOBITS 0000000000c356a0 008346a0 00000000000511f8 0000000000000000 WA 0 0 32 $ nm perf | grep __test_function 0000000000c68548 B __test_function I guess most of the time we're just lucky the inline asm ended up in the ".text" section, so making it specific explicit with push and pop section clauses. $ readelf -a ./perf | less [Nr] Name Type Address Offset Size EntSize Flags Link Info Align ... [13] .text PROGBITS 0000000000431240 00031240 0000000000306faa 0000000000000000 AX 0 0 16 $ nm perf | grep __test_function 00000000004d62c8 T __test_function Committer testing: $ readelf -wi ~/bin/perf | grep producer -m1 <c> DW_AT_producer : (indirect string, offset: 0x254a): GNU C99 10.2.1 20200723 (Red Hat 10.2.1-1) -mtune=generic -march=x86-64 -ggdb3 -std=gnu99 -fno-omit-frame-pointer -funwind-tables -fstack-protector-all ^^^^^ ^^^^^ ^^^^^ $ Before: $ perf test signal 20: Breakpoint overflow signal handler : FAILED! $ After: $ perf test signal 20: Breakpoint overflow signal handler : Ok $ Fixes: 8fd34e1cce18 ("perf test: Improve bp_signal") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Michael Petlan <mpetlan@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lore.kernel.org/lkml/20200911130005.1842138-1-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-09-14Merge branch '40GbE' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue Tony Nguyen says: ==================== 40GbE Intel Wired LAN Driver Updates 2020-09-14 This series contains updates to i40e driver only. Li RongQing removes binding affinity mask to a fixed CPU and sets prefetch of Rx buffer page to occur conditionally. Björn provides AF_XDP performance improvements by not prefetching HW descriptors, using 16 byte descriptors, and moving buffer allocation out of Rx processing loop. v2: Define prefetch_page_address in a common header for patch 2. Dropped, previous, patch 5 as it is being reworked to be more generalized. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14Merge tag 'rxrpc-next-20200914' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Fixes for the connection manager rewrite Here are some fixes for the connection manager rewrite: (1) Fix a goto to the wrong place in error handling. (2) Fix a missing NULL pointer check. (3) The stored allocation error needs to be stored signed. (4) Fix a leak of connection bundle when clearing connections due to net namespace exit. (5) Fix an overget of the bundle when setting up a new client conn. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14hinic: add vxlan segmentation and cs offload supportLuo bin
Add NETIF_F_GSO_UDP_TUNNEL and NETIF_F_GSO_UDP_TUNNEL_CSUM features to support vxlan segmentation and checksum offload. Ipip and ipv6 tunnel packets are regarded as non-tunnel pkt for hw and as for other type of tunnel pkts, checksum offload is disabled. Signed-off-by: Luo bin <luobin9@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14core/entry: Report syscall correctly for trace and auditKees Cook
On v5.8 when doing seccomp syscall rewrites (e.g. getpid into getppid as seen in the seccomp selftests), trace (and audit) correctly see the rewritten syscall on entry and exit: seccomp_bpf-1307 [000] .... 22974.874393: sys_enter: NR 110 (... seccomp_bpf-1307 [000] .N.. 22974.874401: sys_exit: NR 110 = 1304 With mainline we see a mismatched enter and exit (the original syscall is incorrectly visible on entry): seccomp_bpf-1030 [000] .... 21.806766: sys_enter: NR 39 (... seccomp_bpf-1030 [000] .... 21.806767: sys_exit: NR 110 = 1027 When ptrace or seccomp change the syscall, this needs to be visible to trace and audit at that time as well. Update the syscall earlier so they see the correct value. Fixes: d88d59b64ca3 ("core/entry: Respect syscall number rewrites") Reported-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20200912005826.586171-1-keescook@chromium.org
2020-09-14net: qlcnic: remove unused variable 'val' in qlcnic_83xx_cam_unlock()Zhang Changzhong
Fixes the following W=1 kernel build warning(s): drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c:661:6: warning: variable 'val' set but not used [-Wunused-but-set-variable] 661 | u32 val; | ^~~ After commit 7f9664525f9c ("qlcnic: 83xx memory map and HW access routines"), variable 'val' is never used in qlcnic_83xx_cam_unlock(), so removing it to avoid build warning. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14net: pxa168_eth: remove unused variable 'retval' int pxa168_eth_change_mtu()Zhang Changzhong
Fixes the following W=1 kernel build warning(s): drivers/net/ethernet/marvell/pxa168_eth.c:1190:6: warning: variable 'retval' set but not used [-Wunused-but-set-variable] 1190 | int retval; | ^~~~~~ Function pxa168_eth_change_mtu() always return zero, so variable 'retval' is redundant, just remove it. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14net: fec: ptp: remove unused variable 'ns' in fec_time_keep()Zhang Changzhong
Fixes the following W=1 kernel build warning(s): drivers/net/ethernet/freescale/fec_ptp.c:523:6: warning: variable 'ns' set but not used [-Wunused-but-set-variable] 523 | u64 ns; | ^~ After commit 6605b730c061 ("FEC: Add time stamping code and a PTP hardware clock"), variable 'ns' is never used in fec_time_keep(), so removing it to avoid build warning. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com> Acked-by: Fugang Duan <fugang.duan@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14net: dnet: remove unused variable 'tx_status 'in dnet_start_xmit()Zhang Changzhong
Fixes the following W=1 kernel build warning(s): drivers/net/ethernet/dnet.c:510:6: warning: variable 'tx_status' set but not used [-Wunused-but-set-variable] u32 tx_status, irq_enable; ^~~~~~~~~ After commit 4796417417a6 ("dnet: Dave DNET ethernet controller driver (updated)"), variable 'tx_status' is never used in dnet_start_xmit(), so removing it to avoid build warning. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14tcp: remove SOCK_QUEUE_SHRUNKEric Dumazet
SOCK_QUEUE_SHRUNK is currently used by TCP as a temporary state that remembers if some room has been made in the rtx queue by an incoming ACK packet. This is later used from tcp_check_space() before considering to send EPOLLOUT. Problem is: If we receive SACK packets, and no packet is removed from RTX queue, we can send fresh packets, thus moving them from write queue to rtx queue and eventually empty the write queue. This stall can happen if TCP_NOTSENT_LOWAT is used. With this fix, we no longer risk stalling sends while holes are repaired, and we can fully use socket sndbuf. This also removes a cache line dirtying for typical RPC workloads. Fixes: c9bee3b7fdec ("tcp: TCP_NOTSENT_LOWAT socket option") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14net/packet: Fix a comment about hard_header_len and headroom allocationXie He
This comment is outdated and no longer reflects the actual implementation of af_packet.c. Reasons for the new comment: 1. In af_packet.c, the function packet_snd first reserves a headroom of length (dev->hard_header_len + dev->needed_headroom). Then if the socket is a SOCK_DGRAM socket, it calls dev_hard_header, which calls dev->header_ops->create, to create the link layer header. If the socket is a SOCK_RAW socket, it "un-reserves" a headroom of length (dev->hard_header_len), and checks if the user has provided a header sized between (dev->min_header_len) and (dev->hard_header_len) (in dev_validate_header). This shows the developers of af_packet.c expect hard_header_len to be consistent with header_ops. 2. In af_packet.c, the function packet_sendmsg_spkt has a FIXME comment. That comment states that prepending an LL header internally in a driver is considered a bug. I believe this bug can be fixed by setting hard_header_len to 0, making the internal header completely invisible to af_packet.c (and requesting the headroom in needed_headroom instead). 3. There is a commit for a WiFi driver: commit 9454f7a895b8 ("mwifiex: set needed_headroom, not hard_header_len") According to the discussion about it at: https://patchwork.kernel.org/patch/11407493/ The author tried to set the WiFi driver's hard_header_len to the Ethernet header length, and request additional header space internally needed by setting needed_headroom. This means this usage is already adopted by driver developers. Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Brian Norris <briannorris@chromium.org> Cc: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Xie He <xie.he.0141@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14Merge branch 'mptcp-introduce-support-for-real-multipath-xmit'David S. Miller
Paolo Abeni says: ==================== mptcp: introduce support for real multipath xmit This series enable MPTCP socket to transmit data on multiple subflows concurrently in a load balancing scenario. First the receive code path is refactored to better deal with out-of-order data (patches 1-7). An RB-tree is introduced to queue MPTCP-level out-of-order data, closely resembling the TCP level OoO handling. When data is sent on multiple subflows, the peer can easily see OoO - "future" data at the MPTCP level, especially if speeds, delay, or jitter are not symmetric. The other major change regards the netlink PM, which is extended to allow creating non backup subflows in patches 9-11. There are a few smaller additions, like the introduction of OoO related mibs, send buffer autotuning and better ack handling. Finally a bunch of new self-tests is introduced. The new feature is tested ensuring that the B/W used by an MPTCP socket using multiple subflows matches the link aggregated B/W - we use low B/W virtual links, to ensure the tests are not CPU bounded. v1 -> v2: - fix 32 bit build breakage - fix a bunch of checkpatch issues ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14mptcp: simult flow self-testsPaolo Abeni
Add a bunch of test-cases for multiple subflow xmit: create multiple subflows simulating different links condition via netem and verify that the msk is able to use completely the aggregated bandwidth. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14mptcp: call tcp_cleanup_rbuf on subflowsPaolo Abeni
That is needed to let the subflows announce promptly when new space is available in the receive buffer. tcp_cleanup_rbuf() is currently a static function, drop the scope modifier and add a declaration in the TCP header. Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14mptcp: allow picking different xmit subflowsPaolo Abeni
Update the scheduler to less trivial heuristic: cache the last used subflow, and try to send on it a reasonably long burst of data. When the burst or the subflow send space is exhausted, pick the subflow with the lower ratio between write space and send buffer - that is, the subflow with the greater relative amount of free space. v1 -> v2: - fix 32 bit build breakage due to 64bits div - fix checkpath issues (uint64_t -> u64) Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14mptcp: allow creating non-backup subflowsPaolo Abeni
Currently the 'backup' attribute of local endpoint is ignored. Let's use it for the MP_JOIN handshake Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14mptcp: move address attribute into mptcp_addr_infoPaolo Abeni
So that can be accessed easily from the subflow creation helper. No functional change intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14mptcp: add OoO related mibsPaolo Abeni
Add a bunch of MPTCP mibs related to MPTCP OoO data processing. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14mptcp: cleanup mptcp_subflow_discard_data()Paolo Abeni
There is no need to use the tcp_read_sock(), we can simply drop the skb. Additionally try to look at the next buffer for in order data. This both simplifies the code and avoid unneeded indirect calls. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14mptcp: move ooo skbs into msk out of order queue.Paolo Abeni
Add an RB-tree to cope with OoO (at MPTCP level) data. __mptcp_move_skb() insert into the RB tree "future" data, eventually coalescing skb as allowed by the MPTCP DSN. To simplify sequence accounting, move the DSN inside the cb. After successfully enqueuing in sequence data, check if we can use any data from the RB tree. Additionally move the data_fin check after spooling data from the OoO tree, otherwise we could miss shutdown events. The RB tree code is copied as verbatim as possible from tcp_data_queue_ofo(), with a few simplifications due to the fact that MPTCP doesn't need to cope with sacks. All bugs here are added by me. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14mptcp: introduce and use mptcp_try_coalesce()Paolo Abeni
Factor-out existing code, will be re-used by the next patch. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14mptcp: basic sndbuf autotuningPaolo Abeni
Let the msk sendbuf track the size of the larger subflow's send window, so that we ensure mptcp_sendmsg() does not exceed MPTCP-level send window. The update is performed just before try to send any data. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14mptcp: trigger msk processing even for OoO dataPaolo Abeni
This is a prerequisite to allow receiving data from multiple subflows without re-injection. Instead of dropping the OoO - "future" data in subflow_check_data_avail(), call into __mptcp_move_skbs() and let the msk drop that. To avoid code duplication factor out the mptcp_subflow_discard_data() helper. Note that __mptcp_move_skbs() can now find multiple subflows with data avail (comprising to-be-discarded data), so must update the byte counter incrementally. v1 -> v2: - fix checkpatch issues (unsigned -> unsigned int) Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14mptcp: set data_ready status bit in subflow_check_data_avail()Paolo Abeni
This simplify mptcp_subflow_data_available() and will made follow-up patches simpler. Additionally remove the unneeded checks on subflow copied_seq: we always whole skbs out of subflows. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14mptcp: rethink 'is writable' conditionalPaolo Abeni
Currently, when checking for the 'msk is writable' condition, we look at the individual subflows write space. That works well while we send data via a single subflow, but will not as soon as we will enable concurrent xmit on multiple subflows. With this change msk becomes writable when the following conditions hold: - the socket has some free write space - there is at least a subflow with write free space Additionally we need to set the NOSPACE bit on all subflows before blocking. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14batman-adv: Add missing include for in_interrupt()Sven Eckelmann
The fix for receiving (internally generated) bla packets outside the interrupt context introduced the usage of in_interrupt(). But this functionality is only defined in linux/preempt.h which was not included with the same patch. Fixes: 279e89b2281a ("batman-adv: bla: use netif_rx_ni when not in interrupt context") Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-09-14Merge branch 'ethernet-convert-tasklets-to-use-new-tasklet_setup-API'David S. Miller
Allen Pais says: ==================== ethernet: convert tasklets to use new tasklet_setup API Commit 12cc923f1ccc ("tasklet: Introduce new initialization API")' introduced a new tasklet initialization API. This series converts all the crypto modules to use the new tasklet_setup() API This series is based on v5.9-rc5 v3: fix subject prefix use backpointer instead of fragile priv to netdev. v2: fix kdoc reported by Jakub Kicinski. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14net: smc91x: convert tasklets to use new tasklet_setup() APIAllen Pais
In preparation for unconditionally passing the struct tasklet_struct pointer to all tasklet callbacks, switch to using the new tasklet_setup() and from_tasklet() to pass the tasklet pointer explicitly. Signed-off-by: Romain Perier <romain.perier@gmail.com> Signed-off-by: Allen Pais <apais@linux.microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14net: silan: convert tasklets to use new tasklet_setup() APIAllen Pais
In preparation for unconditionally passing the struct tasklet_struct pointer to all tasklet callbacks, switch to using the new tasklet_setup() and from_tasklet() to pass the tasklet pointer explicitly. Signed-off-by: Romain Perier <romain.perier@gmail.com> Signed-off-by: Allen Pais <apais@linux.microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14qed: convert tasklets to use new tasklet_setup() APIAllen Pais
In preparation for unconditionally passing the struct tasklet_struct pointer to all tasklet callbacks, switch to using the new tasklet_setup() and from_tasklet() to pass the tasklet pointer explicitly. Signed-off-by: Romain Perier <romain.perier@gmail.com> Signed-off-by: Allen Pais <apais@linux.microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14net: nixge: convert tasklets to use new tasklet_setup() APIAllen Pais
In preparation for unconditionally passing the struct tasklet_struct pointer to all tasklet callbacks, switch to using the new tasklet_setup() and from_tasklet() to pass the tasklet pointer explicitly. Signed-off-by: Romain Perier <romain.perier@gmail.com> Signed-off-by: Allen Pais <apais@linux.microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>