Age | Commit message (Collapse) | Author |
|
Willy Tarreau says:
====================
macb: support the 2-deep Tx queue on at91
while running some tests on my Breadbee board, I noticed poor network
Tx performance. I had a look at the driver (macb, at91ether variant)
and noticed that at91ether_start_xmit() immediately stops the queue
after sending a frame and waits for the interrupt to restart the queue,
causing a dead time after each packet is sent.
The AT91RM9200 datasheet states that the controller supports two frames,
one being sent and the other one being queued, so I performed minimal
changes to support this. The transmit performance on my board has
increased by 50% on medium-sized packets (HTTP traffic), and with large
packets I can now reach line rate.
Since this driver is shared by various platforms, I tried my best to
isolate and limit the changes as much as possible and I think it's pretty
reasonable as-is. I've run extensive tests and couldn't meet any
unexpected situation (no stall, overflow nor lockup).
There are 3 patches in this series. The first one adds the missing
interrupt flag for RM9200 (TBRE, indicating the tx buffer is willing
to take a new packet). The second one replaces the single skb with a
2-array and uses only index 0. It does no other change, this is just
to prepare the code for the third one. The third one implements the
queue. Packets are added at the tail of the queue, the queue is
stopped at 2 packets and the interrupt releases 0, 1 or 2 depending
on what the transmit status register reports.
====================
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The at91rm9200 variant used by a few chips including the MSC313 supports
two Tx descriptors (one frame being serialized and another one queued).
However the driver only implemented a single one, which adds a dead time
after each transfer to receive and process the interrupt and wake the
queue up, preventing from reaching line rate.
This patch implements a very basic 2-deep queue to address this limitation.
The tests run on a Breadbee board equipped with an MSC313E show that at
1 GHz, HTTP traffic on medium-sized objects (45kB) was limited to exactly
50 Mbps before this patch, and jumped to 76 Mbps with this patch. And tests
on a single TCP stream with an MTU of 576 jump from 10kpps to 15kpps. With
1500 byte packets it's now possible to reach line rate versus 75 Mbps
before.
Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
Cc: Claudiu Beznea <claudiu.beznea@microchip.com>
Cc: Daniel Palmer <daniel@0x0f.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20201011090944.10607-4-w@1wt.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The RM9200 supports one frame being sent while another one is waiting in
queue. This avoids the dead time that follows the emission of a frame
and which prevents one from reaching line speed.
Right now the driver supports only a single skb, so we'll first replace
the rm9200-specific skb info with an array of two macb_tx_skb (already
used by other drivers). This patch only moves the skb_length to
txq[0].size and skb_physaddr to skb[0].mapping but doesn't perform any
other change. It already uses [desc] in order to minimize future changes.
Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
Cc: Claudiu Beznea <claudiu.beznea@microchip.com>
Cc: Daniel Palmer <daniel@0x0f.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20201011090944.10607-3-w@1wt.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Transmit Buffer Register Empty replaces TXERR on RM9200 and signals the
sender may try to send again becase the last queued frame is no longer
in queue (being transmitted or already transmitted).
Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
Cc: Claudiu Beznea <claudiu.beznea@microchip.com>
Cc: Daniel Palmer <daniel@0x0f.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20201011090944.10607-2-w@1wt.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Dump vlan tag and proto for the usual vlan offload case if the
NF_LOG_MACDECODE flag is set on. Without this information the logging is
misleading as there is no reference to the VLAN header.
[12716.993704] test: IN=veth0 OUT= MACSRC=86:6c:92:ea:d6:73 MACDST=0e:3b:eb:86:73:76 VPROTO=8100 VID=10 MACPROTO=0800 SRC=192.168.10.2 DST=172.217.168.163 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=2548 DF PROTO=TCP SPT=55848 DPT=80 WINDOW=501 RES=0x00 ACK FIN URGP=0
[12721.157643] test: IN=veth0 OUT= MACSRC=86:6c:92:ea:d6:73 MACDST=0e:3b:eb:86:73:76 VPROTO=8100 VID=10 MACPROTO=0806 ARP HTYPE=1 PTYPE=0x0800 OPCODE=2 MACSRC=86:6c:92:ea:d6:73 IPSRC=192.168.10.2 MACDST=0e:3b:eb:86:73:76 IPDST=192.168.10.1
Fixes: 83e96d443b37 ("netfilter: log: split family specific code to nf_log_{ip,ip6,common}.c files")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
With the introduction of netif_set_xps_queue, XPS can be enabled
by the driver at initialization.
Update the documentation to reflect this, as otherwise users
may incorrectly believe that the feature is off by default.
Fixes: 537c00de1c9b ("net: Add functions netif_reset_xps_queue and netif_set_xps_queue")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Alexei Starovoitov says:
====================
pull-request: bpf-next 2020-10-12
The main changes are:
1) The BPF verifier improvements to track register allocation pattern, from Alexei and Yonghong.
2) libbpf relocation support for different size load/store, from Andrii.
3) bpf_redirect_peer() helper and support for inner map array with different max_entries, from Daniel.
4) BPF support for per-cpu variables, form Hao.
5) sockmap improvements, from John.
====================
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In the TX data path, spot packets with xfrm stack IPsec offload
indication.
Fill Software-Parser segment in TX descriptor so that the hardware
may parse the ESP protocol, and perform TX checksum offload on the
inner payload.
Support GSO, by providing the trailer data and ICV placeholder
so HW can fill it post encryption operation.
Padding alignment cannot be performed in HW (ConnectX-6Dx) due to
a bug. Software can overcome this limitation by adding NETIF_F_HW_ESP to
the gso_partial_features field in netdev so the packets being
aligned by the stack.
l4_inner_checksum cannot be offloaded by HW for IPsec tunnel type packet.
Note that for GSO SKBs, the stack does not include an ESP trailer,
unlike the non-GSO case.
Below is the iperf3 performance report on two server of 24 cores
Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz with ConnectX6-DX.
All the bandwidth test uses iperf3 TCP traffic with packet size 128KB.
Each tunnel uses one iperf3 stream with one thread (option -P1).
TX crypto offload shows improvements on both bandwidth
and CPU utilization.
----------------------------------------------------------------------
Mode | Num tunnel | BW | Send CPU util | Recv CPU util
| | (Gbps) | (Average %) | (Average %)
----------------------------------------------------------------------
Cryto offload | | | |
(RX only) | 1 | 4.7 | 4.2 | 3.5
----------------------------------------------------------------------
Cryto offload | | | |
(RX only) | 24 | 15.6 | 20 | 10
----------------------------------------------------------------------
Non-offload | 1 | 4.6 | 4 | 5
----------------------------------------------------------------------
Non-offload | 24 | 11.9 | 16 | 12
----------------------------------------------------------------------
Cryto offload | | | |
(TX & RX) | 1 | 11.9 | 2.1 | 5.9
----------------------------------------------------------------------
Cryto offload | | | |
(TX & RX) | 24 | 38 | 9.5 | 27.5
----------------------------------------------------------------------
Cryto offload | | | |
(TX only) | 1 | 4.7 | 0.7 | 5
----------------------------------------------------------------------
Cryto offload | | | |
(TX only) | 24 | 14.5 | 6 | 20
Regression tests show no degradation on non-ipsec and
non-offload-ipsec traffics. The packet rate test uses pktgen UDP to
transmit on single CPU, the instructions and cycles are measured on
the transmit CPU.
before:
----------------------------------------------------------------------
Non-offload | 1 | 4.7 | 4.2 | 5.1
----------------------------------------------------------------------
Non-offload | 24 | 11.2 | 14 | 15
----------------------------------------------------------------------
Non-ipsec | 1 | 28 | 4 | 5.7
----------------------------------------------------------------------
Non-ipsec | 24 | 68.3 | 17.8 | 39.7
----------------------------------------------------------------------
Non-ipsec packet rate(BURST=1000 BC=5 NCPUS=1 SIZE=60)
13.56Mpps, 456 instructions/pkt, 191 cycles/pkt
after:
----------------------------------------------------------------------
Non-offload | 1 | 4.69 | 4.2 | 5
----------------------------------------------------------------------
Non-offload | 24 | 11.9 | 13.5 | 15.1
----------------------------------------------------------------------
Non-ipsec | 1 | 29 | 3.2 | 5.5
----------------------------------------------------------------------
Non-ipsec | 24 | 68.2 | 18.5 | 39.8
----------------------------------------------------------------------
Non-ipsec packet rate: 13.56Mpps, 472 instructions/pkt, 191 cycles/pkt
Signed-off-by: Raed Salem <raeds@mellanox.com>
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Add new FTE in TX IPsec FT per IPsec state. It has the
same matching criteria as the RX steering rule.
The IPsec FT is created/destroyed when the first/last rule
is added/deleted respectively.
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Add new namespace that represents the NIC TX domain.
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Raed Salem <raeds@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Currently the error exit path err_free kfree's attr. In the case where
flow and parse_attr failed to be allocated this return path will free
the uninitialized pointer attr, which is not correct. In the other
case where attr fails to allocate attr does not need to be freed. So
in both error exits via err_free attr should not be freed, so remove
it.
Addresses-Coverity: ("Uninitialized pointer read")
Fixes: ff7ea04ad579 ("net/mlx5e: Fix potential null pointer dereference")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for net-next:
1) Inspect the reply packets coming from DR/TUN and refresh connection
state and timeout, from longguang yue and Julian Anastasov.
2) Series to add support for the inet ingress chain type in nf_tables.
====================
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Michael Chan says:
====================
bnxt_en: Updates for net-next.
This series contains these main changes:
1. Change of default message level to enable more logging.
2. Some cleanups related to processing async events from firmware.
3. Allow online ethtool selftest on multi-function PFs.
4. Return stored firmware version information to devlink.
v2:
Patch 3: Change bnxt_reset_task() to silent mode.
Patch 8 & 9: Ensure we copy NULL terminated fw strings to devlink.
Patch 8 & 9: Return directly after the last bnxt_dl_info_put() call.
Patch 9: If FW call to get stored dev info fails, return success to
devlink without the stored versions.
====================
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This patch adds FW versions stored in the flash to devlink info_get
callback. Return the correct fw.psid running version using the
newly added bp->nvm_cfg_ver.
v2:
Ensure stored pkg_name string is NULL terminated when copied to
devlink.
Return directly from the last call to bnxt_dl_info_put().
If the FW call to get stored version fails for any reason, return
success immediately to devlink without the stored versions.
Reviewed-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/1602493854-29283-10-git-send-email-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a new function bnxt_dl_info_put() to simplify the code, as there
are more stored firmware version fields to be added in the next patch.
Also, rename fw_ver variable name to ncsi_ver for better naming while
copying to devlink info_get cb.
v2:
Ensure active_pkg_name string is NULL terminated when copied to
devlink.
Return directly from the last call to bnxt_dl_info_put().
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/1602493854-29283-9-git-send-email-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a new bnxt_hwrm_nvm_get_dev_info() to query firmware version
information via NVM_GET_DEV_INFO firmware command. Use it to
get the running version of the NVM configuration information.
This new function will also be used in subsequent patches to get the
stored firmware versions.
Reviewed-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/1602493854-29283-8-git-send-email-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
If the VF virtual link is set to always enabled, the speed may be
unknown when the physical link is down. The driver currently logs
the link speed as 4294967295 Mbps which is SPEED_UNKNOWN. Modify
the link up log message as "speed unknown" which makes more sense.
Reviewed-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/1602493854-29283-7-git-send-email-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Log these values that contain useful firmware state information.
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Reviewed-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/1602493854-29283-6-git-send-email-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
event_data1 and event_data2 are used when processing most events.
Store these in local variables at the beginning of the function to
simplify many of the case statements.
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/1602493854-29283-5-git-send-email-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, bp->msg_enable has default value of 0. It is more useful
to have the commonly used NETIF_MSG_DRV and NETIF_MSG_HW enabled by
default.
v2: Change the fall back bnxt_reset_task() inside bnxt_rx_ring_reset()
to silent mode. With older fw, we would take the fall back path and
it would be very noisy.
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Reviewed-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/1602493854-29283-4-git-send-email-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Online self tests are not disruptive and can be run in NPAR mode
and in multi-host NIC as well.
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/1602493854-29283-3-git-send-email-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
If NVRAM resources are locked, NVM writes are not permitted. In such
scenarios, firmware returns HWRM_ERR_CODE_RESOURCE_LOCKED error to
firmware commands.
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/1602493854-29283-2-git-send-email-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The phy_reset_after_clk_enable() is always called with ndev->phydev,
however that pointer may be NULL even though the PHY device instance
already exists and is sufficient to perform the PHY reset.
This condition happens in fec_open(), where the clock must be enabled
first, then the PHY must be reset, and then the PHY IDs can be read
out of the PHY.
If the PHY still is not bound to the MAC, but there is OF PHY node
and a matching PHY device instance already, use the OF PHY node to
obtain the PHY device instance, and then use that PHY device instance
when triggering the PHY reset.
Fixes: 1b0a83ac04e3 ("net: fec: add phy_reset_after_clk_enable() support")
Signed-off-by: Marek Vasut <marex@denx.de>
Cc: Christoph Niedermaier <cniedermaier@dh-electronics.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: NXP Linux Team <linux-imx@nxp.com>
Cc: Richard Leitner <richard.leitner@skidata.com>
Cc: Shawn Guo <shawnguo@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
netcons calls napi_poll with a budget of 0 to transmit packets.
Handle this by:
- skipping RX processing
- do not try to recycle TX packets to the RX cache
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
kmalloc returns KSEG0 addresses so convert back from KSEG1
in kfree. Also make sure array is freed when the driver is
unloaded from the kernel.
Fixes: ef11291bcd5f ("Add support the Korina (IDT RC32434) Ethernet MAC")
Signed-off-by: Valentin Vidic <vvidic@valentin-vidic.from.hr>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Between queuing the delayed work and finishing the setup of the dsa
ports, the process may sleep in request_module() (via
phy_device_create()) and the queued work may be executed prior to the
switch net devices being registered. In ksz_mib_read_work(), a NULL
dereference will happen within netof_carrier_ok(dp->slave).
Not queuing the delayed work in ksz_init_mib_timer() makes things even
worse because the work will now be queued for immediate execution
(instead of 2000 ms) in ksz_mac_link_down() via
dsa_port_link_register_of().
Call tree:
ksz9477_i2c_probe()
\--ksz9477_switch_register()
\--ksz_switch_register()
+--dsa_register_switch()
| \--dsa_switch_probe()
| \--dsa_tree_setup()
| \--dsa_tree_setup_switches()
| +--dsa_switch_setup()
| | +--ksz9477_setup()
| | | \--ksz_init_mib_timer()
| | | |--/* Start the timer 2 seconds later. */
| | | \--schedule_delayed_work(&dev->mib_read, msecs_to_jiffies(2000));
| | \--__mdiobus_register()
| | \--mdiobus_scan()
| | \--get_phy_device()
| | +--get_phy_id()
| | \--phy_device_create()
| | |--/* sleeping, ksz_mib_read_work() can be called meanwhile */
| | \--request_module()
| |
| \--dsa_port_setup()
| +--/* Called for non-CPU ports */
| +--dsa_slave_create()
| | +--/* Too late, ksz_mib_read_work() may be called beforehand */
| | \--port->slave = ...
| ...
| +--Called for CPU port */
| \--dsa_port_link_register_of()
| \--ksz_mac_link_down()
| +--/* mib_read must be initialized here */
| +--/* work is already scheduled, so it will be executed after 2000 ms */
| \--schedule_delayed_work(&dev->mib_read, 0);
\-- /* here port->slave is setup properly, scheduling the delayed work should be safe */
Solution:
1. Do not queue (only initialize) delayed work in ksz_init_mib_timer().
2. Only queue delayed work in ksz_mac_link_down() if init is completed.
3. Queue work once in ksz_switch_register(), after dsa_register_switch()
has completed.
Fixes: 7c6ff470aa86 ("net: dsa: microchip: add MIB counter reading support")
Signed-off-by: Christian Eggers <ceggers@arri.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:
====================
linux-can-next-for-5.10-20201012
Both patches are by Oliver Hartkopp, the first one addresses Jakub's review
comments of the ISOTP protocol, the other one removes version strings from
various CAN protocols.
====================
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use netdev_err for better device identification in syslog.
Signed-off-by: Ondrej Zary <linux@zary.sk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When the router is rebooted without a power cycle, the USB device
remains connected but its configuration is reset. This results in
a non-working ethernet connection with messages like this in syslog:
usb 2-2: RX packet too long: 65535 B
Re-enable ethernet mode when receiving a packet with invalid size of
0xffff.
Signed-off-by: Ondrej Zary <linux@zary.sk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The initial support for netlink extended ACK is missing the chain update
path, which results in misleading error reporting in case of EEXIST.
Fixes 36dd1bcc07e5 ("netfilter: nf_tables: initial support for extended ACK reporting")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
As pointed out by Jakub Kicinski here:
http://lore.kernel.org/r/20201009175751.5c54097f@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com
this patch removes the obsolete version information of the different
CAN protocols and the AF_CAN core module.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20201012074354.25839-2-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
As pointed out by Jakub Kicinski here:
http://lore.kernel.org/r/20201009175751.5c54097f@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com
this patch addresses the remarked issues:
- remove empty line in comment
- remove default=y for CAN_ISOTP in Kconfig
- make use of pr_notice_once()
- use GFP_ATOMIC instead of gfp_any() in soft hrtimer context
The version strings in the CAN subsystem are removed by a separate patch.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20201012074354.25839-1-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
John Fastabend says:
====================
This allows a sockmap sk_skb verdict programs to run without a parser. For
some use cases, such as verdict program that support streaming data or a
l3/l4 proxy that does not use data in packet, loading the nop parser
'return skb->len' is an extra unnecessary complexity. With this series we
simply call the verdict program directly from data_ready instead of
bouncing through the strparser logic.
Patches 1,2 do the lifting on the sockmap side then patches 3,4 add the
selftests.
This applies on top of the series here,
sockmap/sk_skb program memory acct fixes
https://patchwork.ozlabs.org/project/netdev/list/?series=206975
it will apply without the above series cleanly, but will have an incorrect
memory accounting causing a failure in ./test_sockmap. I could have left
it so the series passed without above series, but it seemed odd to have
it out there and then require yet another patch to fix it up here.
Thanks.
---
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Here we add three new tests for sockmap to test having a verdict program
without setting the parser program.
The first test covers the most simply case,
sender proxy_recv proxy_send recv
| | |
| verdict -----+ |
| | | |
+----------------+ +------------+
We load the verdict program on the proxy_recv socket without a
parser program. It then does a redirect into the send path of the
proxy_send socket using sendpage_locked().
Next we test the drop case to ensure if we kfree_skb as a result of
the verdict program everything behaves as expected.
Next we test the same configuration above, but with ktls and a
redirect into socket ingress queue. Shown here
tls tls
sender proxy_recv proxy_send recv
| | |
| verdict ------------------+
| | redirect_ingress
+----------------+
Also to set up ping/pong test
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160239302638.8495.17125996694402793471.stgit@john-Precision-5820-Tower
|
|
Add option to allow running without a parser program in place. To test
with ping/pong program use,
# test_sockmap -t ping --txmsg_omit_skb_parser
this will send packets between two socket bouncing through a proxy
socket that does not use a parser program.
(ping) (pong)
sender proxy_recv proxy_send recv
| | |
| verdict -----+ |
| | | |
+----------------+ +------------+
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160239300387.8495.11908295143121563076.stgit@john-Precision-5820-Tower
|
|
Currently, we often run with a nop parser namely one that just does
this, 'return skb->len'. This happens when either our verdict program
can handle streaming data or it is only looking at socket data such
as IP addresses and other metadata associated with the flow. The second
case is common for a L3/L4 proxy for instance.
So lets allow loading programs without the parser then we can skip
the stream parser logic and avoid having to add a BPF program that
is effectively a nop.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160239297866.8495.13345662302749219672.stgit@john-Precision-5820-Tower
|
|
We are about to allow skb_verdict to run without skb_parser programs
as a first step change code to check each program type specifically.
This should be a mechanical change without any impact to actual result.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160239294756.8495.5796595770890272219.stgit@john-Precision-5820-Tower
|
|
John Fastabend says:
====================
Users of sockmap and skmsg trying to build proxys and other tools
have pointed out to me the error handling can be problematic. If
the proxy is under-provisioned and/or the BPF admin does not have
the ability to update/modify memory provisions on the sockets
its possible data may be dropped. For some things we have retries
so everything works out OK, but for most things this is likely
not great. And things go bad.
The original design dropped memory accounting on the receive
socket as early as possible. We did this early in sk_skb
handling and then charged it to the redirect socket immediately
after running the BPF program.
But, this design caused a fundamental problem. Namely, what should we do
if we redirect to a socket that has already reached its socket memory
limits. For proxy use cases the network admin can tune memory limits.
But, in general we punted on this problem and told folks to simply make
your memory limits high enough to handle your workload. This is not a
really good answer. When deploying into environments where we expect this
to be transparent its no longer the case because we need to tune params.
In fact its really only viable in cases where we have fine grained
control over the application. For example a proxy redirecting from an
ingress socket to an egress socket. The result is I get bug
reports because its surprising for one, but more importantly also breaks
some use cases. So lets fix it.
This series cleans up the different cases so that in many common
modes, such as passing packet up to receive socket, we can simply
use the underlying assumption that the TCP stack already has done
memory accounting.
Next instead of trying to do memory accounting against the socket
we plan to redirect into we keep memory accounting on the receive
socket until the skb can be put on the redirect socket. This means
if we do an egress redirect to a socket and sock_writable() returns
EAGAIN we can requeue the skb on the workqueue and try again. The
same scenario plays out for ingress. If the skb can not be put on
the receive queue of the redirect socket than we simply requeue and
retry. In both cases memory is still accounted for against the
receiving socket.
This also handles head of line blocking. With the above scheme the
skb is on a queue associated with the socket it will be sent/recv'd
on, but the memory accounting is against the received socket. This
means the receive socket can advance to the next skb and avoid head
of line blocking. At least until its receive memory on the socket
runs out. This will put some maximum size on the amount of data any
socket can enqueue giving us bounds on the skb lists so they can't grow
indefinitely.
Overall I think this is a win. Tested with test_sockmap.
These are fixes, but I tagged it for bpf-next considering we are
at -rc8.
v1->v2: Fix uninitialized/unused variables (kernel test robot)
v2->v3: fix typo in patch2 err=0 needs to be <0 so use err=-EIO
---
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Move skb->sk assignment out of sk_psock_bpf_run() and into individual
callers. Then we can use proper skb_set_owner_r() call to assign a
sk to a skb. This improves things by also charging the truesize against
the sockets sk_rmem_alloc counter. With this done we get some accounting
in place to ensure the memory associated with skbs on the workqueue are
still being accounted for somewhere. Finally, by using skb_set_owner_r
the destructor is setup so we can just let the normal skb_kfree logic
recover the memory. Combined with previous patch dropping skb_orphan()
we now can recover from memory pressure and maintain accounting.
Note, we will charge the skbs against their originating socket even
if being redirected into another socket. Once the skb completes the
redirect op the kfree_skb will give the memory back. This is important
because if we charged the socket we are redirecting to (like it was
done before this series) the sock_writeable() test could fail because
of the skb trying to be sent is already charged against the socket.
Also TLS case is special. Here we wait until we have decided not to
simply PASS the packet up the stack. In the case where we PASS the
packet up the stack we already have an skb which is accounted for on
the TLS socket context.
For the parser case we continue to just set/clear skb->sk this is
because the skb being used here may be combined with other skbs or
turned into multiple skbs depending on the parser logic. For example
the parser could request a payload length greater than skb->len so
that the strparser needs to collect multiple skbs. At any rate
the final result will be handled in the strparser recv callback.
Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160226867513.5692.10579573214635925960.stgit@john-Precision-5820-Tower
|
|
Calling skb_orphan() is unnecessary in the strp rcv handler because the skb
is from a skb_clone() in __strp_recv. So it never has a destructor or a
sk assigned. Plus its confusing to read because it might hint to the reader
that the skb could have an sk assigned which is not true. Even if we did
have an sk assigned it would be cleaner to simply wait for the upcoming
kfree_skb().
Additionally, move the comment about strparser clone up so its closer to
the logic it is describing and add to it so that it is more complete.
Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160226865548.5692.9098315689984599579.stgit@john-Precision-5820-Tower
|
|
In the sk_skb redirect case we didn't handle the case where we overrun
the sk_rmem_alloc entry on ingress redirect or sk_wmem_alloc on egress.
Because we didn't have anything implemented we simply dropped the skb.
This meant data could be dropped if socket memory accounting was in
place.
This fixes the above dropped data case by moving the memory checks
later in the code where we actually do the send or recv. This pushes
those checks into the workqueue and allows us to return an EAGAIN error
which in turn allows us to try again later from the workqueue.
Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160226863689.5692.13861422742592309285.stgit@john-Precision-5820-Tower
|
|
The skb_set_owner_w is unnecessary here. The sendpage call will create a
fresh skb and set the owner correctly from workqueue. Its also not entirely
harmless because it consumes cycles, but also impacts resource accounting
by increasing sk_wmem_alloc. This is charging the socket we are going to
send to for the skb, but we will put it on the workqueue for some time
before this happens so we are artifically inflating sk_wmem_alloc for
this period. Further, we don't know how many skbs will be used to send the
packet or how it will be broken up when sent over the new socket so
charging it with one big sum is also not correct when the workqueue may
break it up if facing memory pressure. Seeing we don't know how/when
this is going to be sent drop the early accounting.
A later patch will do proper accounting charged on receive socket for
the case where skbs get enqueued on the workqueue.
Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160226861708.5692.17964237936462425136.stgit@john-Precision-5820-Tower
|
|
When we receive an skb and the ingress skb verdict program returns
SK_PASS we currently set the ingress flag and put it on the workqueue
so it can be turned into a sk_msg and put on the sk_msg ingress queue.
Then finally telling userspace with data_ready hook.
Here we observe that if the workqueue is empty then we can try to
convert into a sk_msg type and call data_ready directly without
bouncing through a workqueue. Its a common pattern to have a recv
verdict program for visibility that always returns SK_PASS. In this
case unless there is an ENOMEM error or we overrun the socket we
can avoid the workqueue completely only using it when we fall back
to error cases caused by memory pressure.
By doing this we eliminate another case where data may be dropped
if errors occur on memory limits in workqueue.
Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160226859704.5692.12929678876744977669.stgit@john-Precision-5820-Tower
|
|
For sk_skb case where skb_verdict program returns SK_PASS to continue to
pass packet up the stack, the memory limits were already checked before
enqueuing in skb_queue_tail from TCP side. So, lets remove the extra checks
here. The theory is if the TCP stack believes we have memory to receive
the packet then lets trust the stack and not double check the limits.
In fact the accounting here can cause a drop if sk_rmem_alloc has increased
after the stack accepted this packet, but before the duplicate check here.
And worse if this happens because TCP stack already believes the data has
been received there is no retransmit.
Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160226857664.5692.668205469388498375.stgit@john-Precision-5820-Tower
|
|
fq qdisc requires tstamp to be cleared in forwarding path
Reported-by: Evgeny B <abt-admin@mail.ru>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=209427
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Fixes: 8203e2d844d3 ("net: clear skb->tstamp in forwarding paths")
Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC")
Fixes: 80b14dee2bea ("net: Add a new socket option for a future transmit time.")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
add a test with re-queueing: usespace doesn't pass accept verdict,
but tells to re-queue to another nf_queue instance.
Also, make the second nf-queue program use non-gso mode, kernel will
have to perform software segmentation.
Lastly, do not queue every packet, just one per second, and add delay
when re-injecting the packet to the kernel.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Make two unfront calls to pskb_may_pull() to linearize the network and
transport header.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This patch adds a new ingress hook for the inet family. The inet ingress
hook emulates the IP receive path code, therefore, unclean packets are
drop before walking over the ruleset in this basechain.
This patch also introduces the nft_base_chain_netdev() helper function
to check if this hook is bound to one or more devices (through the hook
list infrastructure). This check allows to perform the same handling for
the inet ingress as it would be a netdev ingress chain from the control
plane.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This patch adds the NF_INET_INGRESS pseudohook for the NFPROTO_INET
family. This is a mapping this new hook to the existing NFPROTO_NETDEV
and NF_NETDEV_INGRESS hook. The hook does not guarantee that packets are
inet only, users must filter out non-ip traffic explicitly.
This infrastructure makes it easier to support this new hook in nf_tables.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Add helper function to check if this is an ingress hook.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|