summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2017-04-25ipvlan: use pernet operations and restrict l3s hooks to master netnsFlorian Westphal
commit 4fbae7d83c98c30efc ("ipvlan: Introduce l3s mode") added registration of netfilter hooks via nf_register_hooks(). This API provides the illusion of 'global' netfilter hooks by placing the hooks in all current and future network namespaces. In case of ipvlan the hook appears to be only needed in the namespace that contains the ipvlan master device (i.e., usually init_net), so placing them in all namespaces is not needed. This switches ipvlan driver to pernet operations, and then only registers hooks in namespaces where a ipvlan master device is set to l3s mode. Extra care has to be taken when the master device is moved to another namespace, as we might have to 'move' the netfilter hooks too. This is done by storing the namespace the ipvlan port was created in. On REGISTER event, do (un)register operations in the old/new namespaces. This will also allow removal of the nf_register_hooks() in a future patch. Cc: Mahesh Bandewar <maheshb@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25macvlan: Fix device ref leak when purging bc_queueHerbert Xu
When a parent macvlan device is destroyed we end up purging its broadcast queue without dropping the device reference count on the packet source device. This causes the source device to linger. This patch drops that reference count. Fixes: 260916dfb48c ("macvlan: Fix potential use-after free for...") Reported-by: Joe Ghalam <Joe.Ghalam@dell.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25usb: plusb: Add support for PL-27A1Roman Spychała
This patch adds support for the PL-27A1 by adding the appropriate USB ID's. This chip is used in the goobay Active USB 3.0 Data Link and Unitek Y-3501 cables. Signed-off-by: Roman Spychała <roed@onet.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25iwlwifi: mvm: handle possible BIOS bugSharon Dvir
In iwl_mvm_sar_get_ewrd_table() In case of a BIOS bug, n_profiles might be 0 thus we need to return an error value. Found by Klocwork. Signed-off-by: Sharon Dvir <sharon.dvir@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
2017-04-25iwlwifi: mvm: scan: avoid "big" printsMordechai Goodstein
Delete the scanned channel results. No need in it we get it any way when logging. The print only clogs up the ftrace print buffer. Signed-off-by: Mordechai Goodstein <mordechay.goodstein@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
2017-04-25iwlwifi: mvm: check if returned value is NULLSharon Dvir
While freeing inactive queue, check mvmsta to be valid before dereferencing it. Found by Klocwork. Signed-off-by: Sharon Dvir <sharon.dvir@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
2017-04-25iwlwifi: mvm: make iwl_run_unified_mvm_ucode() staticJohannes Berg
There's no need to have iwl_run_unified_mvm_ucode() be exposed to other parts of the code since the logic to pick it over the normal code in iwl_run_init_mvm_ucode() can just be done in that function itself. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
2017-04-25iwlwifi: mvm: support new rate flagsSara Sharon
Rates were changed to adapt to HE. Change is backward compatible. Signed-off-by: Sara Sharon <sara.sharon@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
2017-04-25net: can: usb: gs_usb: Fix buffer on stackMaksim Salau
Allocate buffers on HEAP instead of STACK for local structures that are to be sent using usb_control_msg(). Signed-off-by: Maksim Salau <maksim.salau@gmail.com> Cc: linux-stable <stable@vger.kernel.org> # >= v4.8 Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25s390/mm: simplify arch_get_unmapped_area[_topdown]Martin Schwidefsky
With TASK_SIZE now reflecting the maximum size of the address space for a process the code for arch_get_unmapped_area[_topdown] can be simplified. Just let the logic pick a suitable address and deal with the page table upgrade after the address has been selected. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2017-04-25netfilter: Wrong icmp6 checksum for ICMPV6_TIME_EXCEED in reverse SNATv6 pathDave Johnson
When recalculating the outer ICMPv6 checksum for a reverse path NATv6 such as ICMPV6_TIME_EXCEED nf_nat_icmpv6_reply_translation() was accessing data beyond the headlen of the skb for non-linear skb. This resulted in incorrect ICMPv6 checksum as garbage data was used. Patch replaces csum_partial() with skb_checksum() which supports non-linear skbs similar to nf_nat_icmp_reply_translation() from ipv4. Signed-off-by: Dave Johnson <dave-kernel@centerclick.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-25netfilter: nft_dynset: continue to next expr if _OP_ADD succeededLiping Zhang
Currently, after adding the following nft rules: # nft add set x target1 { type ipv4_addr \; flags timeout \;} # nft add rule x y set add ip daddr timeout 1d @target1 counter the counters will always be zero despite of the elements are added to the dynamic set "target1" or not, as we will break the nft expr traversal unconditionally: # nft list ruleset ... set target1 { ... elements = { 8.8.8.8 expires 23h59m53s} } chain output { ... set add ip daddr timeout 1d @target1 counter packets 0 bytes 0 ^ ^ ... } Since we add the elements to the set successfully, we should continue to the next expression. Additionally, if elements are added to "flow table" successfully, we will _always_ continue to the next expr, even if the operation is _OP_ADD. So it's better to keep them to be consistent. Fixes: 22fe54d5fefc ("netfilter: nf_tables: add support for dynamic set updates") Reported-by: Robert White <rwhite@pobox.com> Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-25bridge: ebtables: fix reception of frames DNAT-ed to bridge device/portLinus Lüssing
When trying to redirect bridged frames to the bridge device itself or a bridge port (brouting) via the dnat target then this currently fails: The ethernet destination of the frame is dnat'ed to the MAC address of the bridge device or port just fine. However, the IP code drops it in the beginning of ip_input.c/ip_rcv() as the dnat target left the skb->pkt_type as PACKET_OTHERHOST. Fixing this by resetting skb->pkt_type to an appropriate type after dnat'ing. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-25can: usb: Kconfig: Add PCAN-USB X6 device in help textStephane Grosjean
This patch adds a text line in the help section of the CAN_PEAK_USB config item describing the support of the PCAN-USB X6 adapter, which is already included in the Kernel since 4.9. Signed-off-by: Stephane Grosjean <s.grosjean@peak-system.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25can: usb: Add support of PCAN-Chip USB stamp moduleStephane Grosjean
This patch adds the support of the PCAN-Chip USB, a stamp module for customer hardware designs, which communicates via USB 2.0 with the hardware. The integrated CAN controller supports the protocols CAN 2.0 A/B as well as CAN FD. The physical CAN connection is determined by external wiring. The Stamp module with its single-sided mounting and plated half-holes is suitable for automatic assembly. Note that the chip is equipped with the same logic than the PCAN-USB FD. Signed-off-by: Stephane Grosjean <s.grosjean@peak-system.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25can: ti_hecc: fix return value check in ti_hecc_probe()Wei Yongjun
In case of error, the function devm_ioremap_resource() returns ERR_PTR() and never returns NULL. The NULL test in the return value check should be replaced with IS_ERR(). Fixes: dabf54dd1c63 ("can: ti_hecc: Convert TI HECC driver to DT only driver") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25mmc: sdhci-xenon: Remove redundant dev_err call in get_dt_pad_ctrl_data()Wei Yongjun
There is a error message within devm_ioremap_resource already, so remove the dev_err call to avoid redundant error message. Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2017-04-25can: enable module auto loading for virtual CAN interfacesOliver Hartkopp
Autoload the vcan module when a vcan instance is to be created by 'ip link add type vcan' Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25can: add Virtual CAN Tunnel driver (vxcan)Oliver Hartkopp
Similar to the virtual ethernet driver veth, vxcan implements a local CAN traffic tunnel between two virtual CAN network devices. See Kconfig entry for details. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25can: network namespace support for CAN gatewayOliver Hartkopp
The CAN gateway was not implemented as per-net in the initial network namespace support by Mario Kicherer (8e8cda6d737d). This patch enables the CAN gateway to be used in different namespaces. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25can: network namespace support for CAN_BCM protocolOliver Hartkopp
The CAN_BCM protocol and its procfs entries were not implemented as per-net in the initial network namespace support by Mario Kicherer (8e8cda6d737d). This patch adds the missing per-net functionality for the CAN BCM. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25can: complete initial namespace supportOliver Hartkopp
The statistics and its proc output was not implemented as per-net in the initial network namespace support by Mario Kicherer (8e8cda6d737d). This patch adds the missing per-net statistics for the CAN subsystem. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25can: remove obsolete definitionsOliver Hartkopp
can_rx_alldev_list is a per-net data structure now. Remove it's definition here and can_rx_dev_list too. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25can: remove obsolete pernet_operations definitionsOliver Hartkopp
The namespace support for the CAN subsystem does not need any additional memory. So when ".size = 0" there's no extra memory allocated by the system. And therefore ".id" is obsolete too. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25can: fix memory leak in initial namespace supportOliver Hartkopp
The can_rx_alldev_list is a per-net data structure now and allocated in can_pernet_init(). Make sure the memory is free'd in can_pernet_exit() too. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25can: mcba_usb: Add support for Microchip CAN BUS AnalyzerRemigiusz Kołłątaj
SocketCAN driver for Microchip CAN BUS Analyzer (http://www.microchip.com/development-tools/) Changes in v4: - possible memory leak fixed in mcba_usb_write_bulk_callback - LED support added - failure handling in mcba_usb_probe improved - C99 initializers for structs on stack Changes in v3: - improved/simplified CAN ID conversion - functions for transmission of skb and cmd separated - fixed/improved netif_stop_queue handling - style/cosmetic corrections Changes in v2: - Termination handling reimplemented to fit new netlink API (IFLA_CAN_TERMINATION) - Bitrate handling reimplemented to fit new netlink API (IFLA_CAN_BITRATE) - CAN ID conversion refactored (changed from macro to inline functions) - CAN DLC handling using get_can_dlc() - Endianness handling for can_speed introduced - Debugging removed - Redundant error prints removed - Style/cosmetic corrections (i.e. macro names, redefs, inits etc.) Signed-off-by: Remigiusz Kołłątaj <remigiusz.kollataj@mobica.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25can: m_can: Enable TX FIFO Handling for M_CAN IP version >= v3.1.xMario Huettel
* Added defines for TX Event FIFO Element * Adapted ndo_start_xmit function. For versions >= v3.1.x it uses the TX FIFO to optimize the data throughput. It stores the echo skb at the same index as in the M_CAN's TX FIFO. The frame's message marker is set to this index. This message marker is received in the TX Event FIFO after the message was successfully transmitted. It is used to echo the correct echo skb back to the network stack. * Added m_can_echo_tx_event function. It reads all received message markers in the TX Event FIFO and loops back the corresponding echo skbs. * ISR checks for new TX Event Entry interrupt for version >= 3.1.x. Signed-off-by: Mario Huettel <mario.huettel@gmx.net> Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net> Tested-by: Quentin Schulz <quentin.schulz@free-electrons.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25can: m_can: Configuration for TX and TX event FIFOsMario Huettel
* TX/TX Event FIFO sizes are configured for version >= v3.1.x Signed-off-by: Mario Huettel <mario.huettel@gmx.net> Tested-by: Quentin Schulz <quentin.schulz@free-electrons.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25can: m_can: Enable M_CAN version dependent initializationMario Huettel
This patch adapts the initialization of the M_CAN. So it can be used with all versions >= 3.0.x. Changes: * Added version element to m_can_priv structure to hold M_CAN version. * Renamed bittiming structs for version 3.0.x * Added new bittiming structs for version >= 3.1.x * Function alloc_m_can_dev takes 2 new arguments. The TX FIFO size and the base address of the module. * Chip configuration for CAN_CTRLMODE_LOOPBACK is changed: Enabled CCCR_MON bit. In combination with TEST_LBCK it activates the internal loopback mode. Leaving CCCR_MON '0' results in external loopback mode. * Clocks are temporarily enabled by platform_propbe function in order to allow read access to the Core Release register and the Control Register. Registers are used to detect M_CAN version and optional Non-ISO Feature. Initialization of M_CAN for version >= 3.1.x: * TX FIFO of M_CAN is used to transmit frames. The driver does not need to stop the tx queue after each frame sent. * Initialization of TX Event FIFO is added. * NON-ISO is fixed for all M_CAN versions < 3.2.x. Version 3.2.x _can_ have the NISO (Non-ISO) bit which can switch the mode of the M_CAN to Non-ISO mode. This bit does not have to be writeable. Therefore it is checked. If it is writable Non-ISO support is added to the controllers supported CAN modes. New Functions: * Function to check the Core Release version. The read value determines the behaviour of the driver. * Function to check if the NISO bit for version >= 3.2.x is implemented. Signed-off-by: Mario Huettel <mario.huettel@gmx.net> Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net> Tested-by: Quentin Schulz <quentin.schulz@free-electrons.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25can: m_can: Updated register defines to newest versionMario Huettel
* Updated register defines to newest M_CAN version (v3.2.1). * Changed defines in the whole code. Signed-off-by: Mario Huettel <mario.huettel@gmx.net> Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net> Tested-by: Quentin Schulz <quentin.schulz@free-electrons.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25can: m_can: Removed virtual address from printMario Huettel
The virtual address of the device was printed. I removed it because it leaks internal information. Signed-off-by: Mario Huettel <mario.huettel@gmx.net> Tested-by: Quentin Schulz <quentin.schulz@free-electrons.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25can: m_can: Removed initialization of FIFO water marksMario Huettel
FIFO water marks disabled because the driver doesn't handle water mark events. Signed-off-by: Mario Huettel <mario.huettel@gmx.net> Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net> Tested-by: Quentin Schulz <quentin.schulz@free-electrons.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25can: m_can: Disabled Interrupt Line 1Mario Huettel
* Disabled interrupt line 1. The driver didn't use it. Signed-off-by: Mario Huettel <mario.huettel@gmx.net> Tested-by: Quentin Schulz <quentin.schulz@free-electrons.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25can: peak: add support for PEAK PCAN-PCIe FD CAN-FD boardsStephane Grosjean
This patch adds the support of the PCAN-PCI Express FD boards made by PEAK-System, for computers using the PCI Express slot. The PCAN-PCI Express FD has one or two CAN FD channels, depending on the model. A galvanic isolation of the CAN ports protects the electronics of the card and the respective computer against disturbances of up to 500 Volts. The PCAN-PCI Express FD can be operated with ambient temperatures in a range of -40 to +85 °C. Such boards run an extented version of the CAN-FD IP running into USB CAN-FD interfaces from PEAK-System, so this patch adds several new commands and their corresponding data types to the PEAK CAN-FD common definitions header file too. Signed-off-by: Stephane Grosjean <s.grosjean@peak-system.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25can: peak: move header file to new can common subdirStephane Grosjean
The CAN-FD IP from PEAK-System runs into several kinds of PC CAN-FD interfaces. Up to now, only the USB CAN-FD adapters were supported by the Kernel. In order to prepare the adding of some new non-USB CAN-FD interfaces, this patch moves - and rename - the IP definitions file from its private (usb) sub-directory into a - newly created - CAN specific one. Signed-off-by: Stephane Grosjean <s.grosjean@peak-system.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25can: peak: fix usage of const qualifier in pointers argsStephane Grosjean
Fixes the usage of the const qualifier in the memory pointer arguments of the declared inline functions. By changing the line containing "const", this patch also changes the name of the arg into a more usual one. Signed-off-by: Stephane Grosjean <s.grosjean@peak-system.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25can: peak: fix usage of usb specific data typeStephane Grosjean
This patch fixes the wrong usage of a specific USB data type into a common header file. This common header file is intended to define the common data types and values that define access to the PEAK-System CAN-FD IP, whatever the PC interface is. Signed-off-by: Stephane Grosjean <s.grosjean@peak-system.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25mmc: cavium: Use module_pci_driver to simplify the codeWei Yongjun
Use the module_pci_driver() macro to make the code simpler by eliminating module_init and module_exit calls. Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Acked-by: Jan Glauber <jglauber@cavium.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2017-04-25s390/mm: make TASK_SIZE independent from the number of page table levelsMartin Schwidefsky
The TASK_SIZE for a process should be maximum possible size of the address space, 2GB for a 31-bit process and 8PB for a 64-bit process. The number of page table levels required for a given memory layout is a consequence of the mapped memory areas and their location. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2017-04-24virtio_blk: Fix English description of VIRTIO_BLK_SCSIJean Delvare
Signed-off-by: Jean Delvare <jdelvare@suse.de> Fixes: 97b50a654d5d ("virtio_blk: make SCSI passthrough support configurable") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-24nvme: Add nvme_core.force_apst to ignore the NO_APST quirkAndy Lutomirski
We're probably going to be stuck quirking APST off on an over-broad range of devices for 4.11. Let's make it easy to override the quirk for testing. Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-24nvme: Display raw APST configuration via DYNAMIC_DEBUGAndy Lutomirski
Debugging APST is currently a bit of a pain. This gives optional simple log messages that describe the APST state. The easiest way to use this is probably with the nvme_core.dyndbg=+p module parameter. Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-24nvme: Fix APST commentAndy Lutomirski
There was a typo in the description of the timeout heuristic. Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-24Merge branch 'master' into for-4.12/post-mergeJens Axboe
2017-04-24Merge branch 'virtio-net-tx-napi'David S. Miller
Willem de Bruijn says: ==================== virtio-net tx napi Add napi for virtio-net transmit completion processing. Changes: v2 -> v3: - convert __netif_tx_trylock to __netif_tx_lock on tx napi poll ensure that the handler always cleans, to avoid deadlock - unconditionally clean in start_xmit avoid adding an unnecessary "if (use_napi)" branch - remove virtqueue_disable_cb in patch 5/5 a noop in the common event_idx based loop - document affinity_hint_set constraint v1 -> v2: - disable by default - disable unless affinity_hint_set because cache misses add up to a third higher cycle cost, e.g., in TCP_RR tests. This is not limited to the patch that enables tx completion cleaning in rx napi. - use trylock to avoid contention between tx and rx napi - keep interrupts masked during xmit_more (new patch 5/5) this improves cycles especially for multi UDP_STREAM, which does not benefit from cleaning tx completions on rx napi. - move free_old_xmit_skbs (new patch 3/5) to avoid forward declaration not changed: - deduplicate virnet_poll_tx and virtnet_poll_txclean they look similar, but have differ too much to make it worthwhile. - delay netif_wake_subqueue for more than 2 + MAX_SKB_FRAGS evaluated, but made no difference - patch 1/5 RFC -> v1: - dropped vhost interrupt moderation patch: not needed and likely expensive at light load - remove tx napi weight - always clean all tx completions - use boolean to toggle tx-napi, instead - only clean tx in rx if tx-napi is enabled - then clean tx before rx - fix: add missing braces in virtnet_freeze_down - testing: add 4KB TCP_RR + UDP test results Based on previous patchsets by Jason Wang: [RFC V7 PATCH 0/7] enable tx interrupts for virtio-net http://lkml.iu.edu/hypermail/linux/kernel/1505.3/00245.html Before commit b0c39dbdc204 ("virtio_net: don't free buffers in xmit ring") the virtio-net driver would free transmitted packets on transmission of new packets in ndo_start_xmit and, to catch the edge case when no new packet is sent, also in a timer at 10HZ. A timer can cause long stalls. VIRTIO_F_NOTIFY_ON_EMPTY avoids stalls due to low free descriptor count. It does not address a stalls due to low socket SO_SNDBUF. Increasing timer frequency decreases that stall time, but increases interrupt rate and, thus, cycle count. Currently, with no timer, packets are freed only at ndo_start_xmit. Latency of consume_skb is now unbounded. To avoid a deadlock if a sock reaches SO_SNDBUF, packets are orphaned on tx. This breaks TCP small queues. Reenable TCP small queues by removing the orphan. Instead of using a timer, convert the driver to regular tx napi. This does not have the unresolved stall issue and does not have any frequency to tune. By keeping interrupts enabled by default, napi increases tx interrupt rate. VIRTIO_F_EVENT_IDX avoids sending an interrupt if one is already unacknowledged, so makes this more feasible today. Combine that with an optimization that brings interrupt rate back in line with the existing version for most workloads: Tx completion cleaning on rx interrupts elides most explicit tx interrupts by relying on the fact that many rx interrupts fire. Tested by running {1, 10, 100} {TCP, UDP} STREAM, RR, 4K_RR benchmarks from a guest to a server on the host, on an x86_64 Haswell. The guest runs 4 vCPUs pinned to 4 cores. vhost and the test server are pinned to a core each. All results are the median of 5 runs, with variance well < 10%. Used neper (github.com/google/neper) as test process. Napi increases single stream throughput, but increases cycle cost. The optimizations bring this down. The previous patchset saw a regression with UDP_STREAM, which does not benefit from cleaning tx interrupts in rx napi. This regression is now gone for 10x, 100x. Remaining difference is higher 1x TCP_STREAM, lower 1x UDP_STREAM. The latest results are with process, rx napi and tx napi affine to the same core. All numbers are lower than the previous patchset. upstream napi TCP_STREAM: 1x: Mbps 27816 39805 Gcycles 274 285 10x: Mbps 42947 42531 Gcycles 300 296 100x: Mbps 31830 28042 Gcycles 279 269 TCP_RR Latency (us): 1x: p50 21 21 p99 27 27 Gcycles 180 167 10x: p50 40 39 p99 52 52 Gcycles 214 211 100x: p50 281 241 p99 411 337 Gcycles 218 226 TCP_RR 4K: 1x: p50 28 29 p99 34 36 Gcycles 177 167 10x: p50 70 71 p99 85 134 Gcycles 213 214 100x: p50 442 611 p99 802 785 Gcycles 237 216 UDP_STREAM: 1x: Mbps 29468 26800 Gcycles 284 293 10x: Mbps 29891 29978 Gcycles 285 312 100x: Mbps 30269 30304 Gcycles 318 316 UDP_RR: 1x: p50 19 19 p99 23 23 Gcycles 180 173 10x: p50 35 40 p99 54 64 Gcycles 245 237 100x: p50 234 286 p99 484 473 Gcycles 224 214 Note that GSO is enabled, so 4K RR still translates to one packet per request. Lower throughput at 100x vs 10x can be (at least in part) explained by looking at bytes per packet sent (nstat). It likely also explains the lower throughput of 1x for some variants. upstream: N=1 bytes/pkt=16581 N=10 bytes/pkt=61513 N=100 bytes/pkt=51558 at_rx: N=1 bytes/pkt=65204 N=10 bytes/pkt=65148 N=100 bytes/pkt=56840 ==================== Acked-by: Michael S. Tsirkin <mst@redhat.com>
2017-04-24virtio-net: keep tx interrupts disabled unless kickWillem de Bruijn
Tx napi mode increases the rate of transmit interrupts. Suppress some by masking interrupts while more packets are expected. The interrupts will be reenabled before the last packet is sent. This optimization reduces the througput drop with tx napi for unidirectional flows such as UDP_STREAM that do not benefit from cleaning tx completions in the the receive napi handler. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24virtio-net: clean tx descriptors from rx napiWillem de Bruijn
Amortize the cost of virtual interrupts by doing both rx and tx work on reception of a receive interrupt if tx napi is enabled. With VIRTIO_F_EVENT_IDX, this suppresses most explicit tx completion interrupts for bidirectional workloads. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24virtio-net: move free_old_xmit_skbsWillem de Bruijn
An upcoming patch will call free_old_xmit_skbs indirectly from virtnet_poll. Move the function above this to avoid having to introduce a forward declaration. This is a pure move: no code changes. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24virtio-net: transmit napiWillem de Bruijn
Convert virtio-net to a standard napi tx completion path. This enables better TCP pacing using TCP small queues and increases single stream throughput. The virtio-net driver currently cleans tx descriptors on transmission of new packets in ndo_start_xmit. Latency depends on new traffic, so is unbounded. To avoid deadlock when a socket reaches its snd limit, packets are orphaned on tranmission. This breaks socket backpressure, including TSQ. Napi increases the number of interrupts generated compared to the current model, which keeps interrupts disabled as long as the ring has enough free descriptors. Keep tx napi optional and disabled for now. Follow-on patches will reduce the interrupt cost. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24virtio-net: napi helper functionsWillem de Bruijn
Prepare virtio-net for tx napi by converting existing napi code to use helper functions. This also deduplicates some logic. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>