summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2019-08-18tipc: clean up skb list lock handling on send pathJon Maloy
The policy for handling the skb list locks on the send and receive paths is simple. - On the send path we never need to grab the lock on the 'xmitq' list when the destination is an exernal node. - On the receive path we always need to grab the lock on the 'inputq' list, irrespective of source node. However, when transmitting node local messages those will eventually end up on the receive path of a local socket, meaning that the argument 'xmitq' in tipc_node_xmit() will become the 'ínputq' argument in the function tipc_sk_rcv(). This has been handled by always initializing the spinlock of the 'xmitq' list at message creation, just in case it may end up on the receive path later, and despite knowing that the lock in most cases never will be used. This approach is inaccurate and confusing, and has also concealed the fact that the stated 'no lock grabbing' policy for the send path is violated in some cases. We now clean up this by never initializing the lock at message creation, instead doing this at the moment we find that the message actually will enter the receive path. At the same time we fix the four locations where we incorrectly access the spinlock on the send/error path. This patch also reverts commit d12cffe9329f ("tipc: ensure head->lock is initialised") which has now become redundant. CC: Eric Dumazet <edumazet@google.com> Reported-by: Chris Packham <chris.packham@alliedtelesis.co.nz> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-18ibmvnic: Unmap DMA address of TX descriptor buffers after useThomas Falcon
There's no need to wait until a completion is received to unmap TX descriptor buffers that have been passed to the hypervisor. Instead unmap it when the hypervisor call has completed. This patch avoids the possibility that a buffer will not be unmapped because a TX completion is lost or mishandled. Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Tested-by: Devesh K. Singh <devesh_singh@in.ibm.com> Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-18Merge branch 'bnxt_en-Bug-fixes'David S. Miller
Michael Chan says: ==================== bnxt_en: Bug fixes. 2 Bug fixes related to 57500 shutdown sequence and doorbell sequence, 2 TC Flower bug fixes related to the setting of the flow direction, 1 NVRAM update bug fix, and a minor fix to suppress an unnecessary error message. Please queue for -stable as well. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-18bnxt_en: Fix to include flow direction in L2 keySomnath Kotur
FW expects the driver to provide unique flow reference handles for Tx or Rx flows. When a Tx flow and an Rx flow end up sharing a reference handle, flow offload does not seem to work. This could happen in the case of 2 flows having their L2 fields wildcarded but in different direction. Fix to incorporate the flow direction as part of the L2 key v2: Move the dir field to the end of the bnxt_tc_l2_key struct to fix the warning reported by kbuild test robot <lkp@intel.com>. There is existing code that initializes the structure using nested initializer and will warn with the new u8 field added to the beginning. The structure also packs nicer when this new u8 is added to the end of the structure [MChan]. Fixes: abd43a13525d ("bnxt_en: Support for 64-bit flow handle.") Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-18bnxt_en: Use correct src_fid to determine direction of the flowVenkat Duvvuru
Direction of the flow is determined using src_fid. For an RX flow, src_fid is PF's fid and for TX flow, src_fid is VF's fid. Direction of the flow must be specified, when getting statistics for that flow. Currently, for DECAP flow, direction is determined incorrectly, i.e., direction is initialized as TX for DECAP flow, instead of RX. Because of which, stats are not reported for this DECAP flow, though it is offloaded and there is traffic for that flow, resulting in flow age out. This patch fixes the problem by determining the DECAP flow's direction using correct fid. Set the flow direction in all cases for consistency even if 64-bit flow handle is not used. Fixes: abd43a13525d ("bnxt_en: Support for 64-bit flow handle.") Signed-off-by: Venkat Duvvuru <venkatkumar.duvvuru@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-18bnxt_en: Suppress HWRM errors for HWRM_NVM_GET_VARIABLE commandVasundhara Volam
For newly added NVM parameters, older firmware may not have the support. Suppress the error message to avoid the unncessary error message which is triggered when devlink calls the driver during initialization. Fixes: 782a624d00fa ("bnxt_en: Add bnxt_en initial params table and register it.") Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-18bnxt_en: Fix handling FRAG_ERR when NVM_INSTALL_UPDATE cmd failsVasundhara Volam
If FW returns FRAG_ERR in response error code, driver is resending the command only when HWRM command returns success. Fix the code to resend NVM_INSTALL_UPDATE command with DEFRAG install flags, if FW returns FRAG_ERR in its response error code. Fixes: cb4d1d626145 ("bnxt_en: Retry failed NVM_INSTALL_UPDATE with defragmentation flag enabled.") Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-18bnxt_en: Improve RX doorbell sequence.Michael Chan
When both RX buffers and RX aggregation buffers have to be replenished at the end of NAPI, post the RX aggregation buffers first before RX buffers. Otherwise, we may run into a situation where there are only RX buffers without RX aggregation buffers for a split second. This will cause the hardware to abort the RX packet and report buffer errors, which will cause unnecessary cleanup by the driver. Ringing the Aggregation ring doorbell first before the RX ring doorbell will prevent some of these buffer errors. Use the same sequence during ring initialization as well. Fixes: 697197e5a173 ("bnxt_en: Re-structure doorbells.") Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-18bnxt_en: Fix VNIC clearing logic for 57500 chips.Michael Chan
During device shutdown, the VNIC clearing sequence needs to be modified to free the VNIC first before freeing the RSS contexts. The current code is doing the reverse and we can get mis-directed RX completions to CP ring ID 0 when the RSS contexts are freed and zeroed. The clearing of RSS contexts is not required with the new sequence. Refactor the VNIC clearing logic into a new function bnxt_clear_vnic() and do the chip specific VNIC clearing sequence. Fixes: 7b3af4f75b81 ("bnxt_en: Add RSS support for 57500 chips.") Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-18net: kalmia: fix memory leaksWenwen Wang
In kalmia_init_and_get_ethernet_addr(), 'usb_buf' is allocated through kmalloc(). In the following execution, if the 'status' returned by kalmia_send_init_packet() is not 0, 'usb_buf' is not deallocated, leading to memory leaks. To fix this issue, add the 'out' label to free 'usb_buf'. Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-18cx82310_eth: fix a memory leak bugWenwen Wang
In cx82310_bind(), 'dev->partial_data' is allocated through kmalloc(). Then, the execution waits for the firmware to become ready. If the firmware is not ready in time, the execution is terminated. However, the allocated 'dev->partial_data' is not deallocated on this path, leading to a memory leak bug. To fix this issue, free 'dev->partial_data' before returning the error. Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-18Merge branch 'hns3-next'David S. Miller
Huazhong Tan says: ==================== net: hns3: add some cleanups & bugfix This patch-set includes cleanups and bugfix for the HNS3 ethernet controller driver. [patch 01/06 - 03/06] adds some cleanups. [patch 04/06] changes the print level of RAS. [patch 05/06] fixes a bug related to MAC TNL. [patch 06/06] adds phy_attached_info(). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-18net: hns3: add phy_attached_info() to the hns3 driverYonglong Liu
This patch adds the call to phy_attached_info() to the hns3 driver to identify which exact PHY drivers and models is in use. Signed-off-by: Yonglong Liu <liuyonglong@huawei.com> Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-18net: hns3: prevent unnecessary MAC TNL interruptHuazhong Tan
MAC TNL interrupt is used to collect statistic info about link status changing suddenly when netdev is running. But when stopping netdev, the enabled MAC TNL interrupt is unnecessary, and may add some noises to the statistic info. So this patch disables it before stopping MAC. Fixes: a63457878b12 ("net: hns3: Add handling of MAC tunnel interruption") Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-18net: hns3: change print level of RAS error log from warning to errorXiaofei Tan
This patch changes print level of RAS error log from warning to error. Because RAS error and its recovery process could cause application failure. Also uses %u instead of %d when the parameter is unsigned. Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Weihang Li <liweihang@hisilicon.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-18net: hns3: fix error and incorrect formatGuojia Liao
The pointer type parameter should be declare as const for preventing from its pointed value being unexpected modified. The uninitialized variable can not be return directly. The default return value is 0 if no abnormal result. This patch fixes the preceding two errors, deletes redundant declaration of a function and align one parameter. Signed-off-by: Guojia Liao <liaoguojia@huawei.com> Signed-off-by: Weihang Li <liweihang@hisilicon.com> Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-18net: hns3: modify redundant initialization of variableGuojia Liao
Some temporary variables do not need to be initialized that they will be set before used, so this patch deletes the initialization value of these temporary variables. Signed-off-by: Guojia Liao <liaoguojia@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: Huzhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-18net: hns3: add or modify commentsGuojia Liao
To explain some code, this patch adds some comments, and modifies or merges some comments to make them more neat. Signed-off-by: Guojia Liao <liaoguojia@huawei.com> Signed-off-by: Zhongzhu Liu <liuzhongzhu@huawei.com> Signed-off-by: Weihang Li <liweihang@hisilicon.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-18Merge tag 'fixes-for-5.3-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux Pull MTD fix from Richard Weinberger: "A single fix for MTD to correctly set the spi-nor WP pin" * tag 'fixes-for-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux: mtd: spi-nor: Fix the disabling of write protection at init
2019-08-18bnx2x: Fix VF's VLAN reconfiguration in reload.Manish Chopra
Commit 04f05230c5c13 ("bnx2x: Remove configured vlans as part of unload sequence."), introduced a regression in driver that as a part of VF's reload flow, VLANs created on the VF doesn't get re-configured in hardware as vlan metadata/info was not getting cleared for the VFs which causes vlan PING to stop. This patch clears the vlan metadata/info so that VLANs gets re-configured back in the hardware in VF's reload flow and PING/traffic continues for VLANs created over the VFs. Fixes: 04f05230c5c13 ("bnx2x: Remove configured vlans as part of unload sequence.") Signed-off-by: Manish Chopra <manishc@marvell.com> Signed-off-by: Sudarsana Kalluru <skalluru@marvell.com> Signed-off-by: Shahed Shaikh <shshaikh@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-18Merge tag 'for-5.3-rc4-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Two fixes that popped up during testing: - fix for sysfs-related code that adds/removes block groups, warnings appear during several fstests in connection with sysfs updates in 5.3, the fix essentially replaces a workaround with scope NOFS and applies to 5.2-based branch too - add sanity check of trim range" * tag 'for-5.3-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: trim: Check the range passed into to prevent overflow Btrfs: fix sysfs warning and missing raid sysfs directories
2019-08-18Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A set of fixes for x86: - Fix the inconsistent error handling in the umwait init code - Rework the boot param zeroing so gcc9 stops complaining about out of bound memset. The resulting source code is actually more sane to read than the smart solution we had - Maintainers update so Tony gets involved when Intel models are added - Some more fallthrough fixes" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot: Save fields explicitly, zero out everything else MAINTAINERS, x86/CPU: Tony Luck will maintain asm/intel-family.h x86/fpu/math-emu: Address fallthrough warnings x86/apic/32: Fix yet another implicit fallthrough warning x86/umwait: Fix error handling in umwait_init()
2019-08-18Merge branch 'efi-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI fix from Thomas Gleixner: "A single fix for a EFI mixed mode regression caused by recent rework which did not take the firmware bitwidth into account" * 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: efi-stub: Fix get_efi_config_table on mixed-mode setups
2019-08-18Merge tag 'spdx-5.3-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx Pull SPDX fixes from Greg KH: "Here are four small SPDX fixes for 5.3-rc5. A few style fixes for some SPDX comments, added an SPDX tag for one file, and fix up some GPL boilerplate for another file. All of these have been in linux-next for a few weeks with no reported issues (they are comment changes only, so that's to be expected...)" * tag 'spdx-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx: i2c: stm32: Use the correct style for SPDX License Identifier intel_th: Use the correct style for SPDX License Identifier coccinelle: api/atomic_as_refcounter: add SPDX License Identifier kernel/configs: Replace GPL boilerplate code with SPDX identifier
2019-08-18Merge tag 'char-misc-5.3-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver fixes from Greg KH: "Here are some small char and misc driver fixes for 5.3-rc5. These are two different subsystems needing some fixes, the habanalabs driver which is has some more big endian fixes for problems found. The other are some small soundwire fixes, including some Kconfig dependencies needed to resolve reported build errors. All of these have been in linux-next this week with no reported issues" * tag 'char-misc-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: misc: xilinx-sdfec: fix dependency and build error habanalabs: fix device IRQ unmasking for BE host habanalabs: fix endianness handling for internal QMAN submission habanalabs: fix completion queue handling when host is BE habanalabs: fix endianness handling for packets from user habanalabs: fix DRAM usage accounting on context tear down habanalabs: Avoid double free in error flow soundwire: fix regmap dependencies and align with other serial links soundwire: cadence_master: fix definitions for INTSTAT0/1 soundwire: cadence_master: fix register definition for SLAVE_STATE
2019-08-18Merge tag 'staging-5.3-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging Pull staging/IIO fixes from Greg KH: "Here are four small staging and iio driver fixes for 5.3-rc5 Two are for the dt3000 comedi driver for some reported problems found in that codebase, and two are some small iio fixes. All of these have been in linux-next this week with no reported issues" * tag 'staging-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: staging: comedi: dt3000: Fix rounding up of timer divisor staging: comedi: dt3000: Fix signed integer overflow 'divider * base' iio: adc: max9611: Fix temperature reading in probe iio: frequency: adf4371: Fix output frequency setting
2019-08-18Merge tag 'usb-5.3-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB fixes from Greg KH: "Here are number of small USB fixes for 5.3-rc5. Syzbot has been on a tear recently now that it has some good USB debugging hooks integrated, so there's a number of fixes in here found by those tools for some _very_ old bugs. Also a handful of gadget driver fixes for reported issues, some hopefully-final dma fixes for host controller drivers, and some new USB serial gadget driver ids. All of these have been in linux-next this week with no reported issues (the usb-serial ones were in linux-next in its own branch, but merged into mine on Friday)" * tag 'usb-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: usb: add a hcd_uses_dma helper usb: don't create dma pools for HCDs with a localmem_pool usb: chipidea: imx: fix EPROBE_DEFER support during driver probe usb: host: fotg2: restart hcd after port reset USB: CDC: fix sanity checks in CDC union parser usb: cdc-acm: make sure a refcount is taken early enough USB: serial: option: add the BroadMobi BM818 card USB: serial: option: Add Motorola modem UARTs USB: core: Fix races in character device registration and deregistraion usb: gadget: mass_storage: Fix races between fsg_disable and fsg_set_alt usb: gadget: composite: Clear "suspended" on reset/disconnect usb: gadget: udc: renesas_usb3: Fix sysfs interface of "role" USB: serial: option: add D-Link DWM-222 device ID USB: serial: option: Add support for ZTE MF871A
2019-08-17Merge tag 'for-linus-2019-08-17' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "A collection of fixes that should go into this series. This contains: - Revert of the REQ_NOWAIT_INLINE and associated dio changes. There were still corner cases there, and even though I had a solution for it, it's too involved for this stage. (me) - Set of NVMe fixes (via Sagi) - io_uring fix for fixed buffers (Anthony) - io_uring defer issue fix (Jackie) - Regression fix for queue sync at exit time (zhengbin) - xen blk-back memory leak fix (Wenwen)" * tag 'for-linus-2019-08-17' of git://git.kernel.dk/linux-block: io_uring: fix an issue when IOSQE_IO_LINK is inserted into defer list block: remove REQ_NOWAIT_INLINE io_uring: fix manual setup of iov_iter for fixed buffers xen/blkback: fix memory leaks blk-mq: move cancel of requeue_work to the front of blk_exit_queue nvme-pci: Fix async probe remove race nvme: fix controller removal race with scan work nvme-rdma: fix possible use-after-free in connect error flow nvme: fix a possible deadlock when passthru commands sent to a multipath device nvme-core: Fix extra device_put() call on error path nvmet-file: fix nvmet_file_flush() always returning an error nvmet-loop: Flush nvme_delete_wq when removing the port nvmet: Fix use-after-free bug when a port is removed nvme-multipath: revalidate nvme_ns_head gendisk in nvme_validate_ns
2019-08-17Merge tag 'hyperv-fixes-signed' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull Hyper-V fixes from Sasha Levin: - A few fixes for the userspace hyper-v tools from Adrian Vladu. - A fix for the hyper-v MAINTAINERs entry from Lan Tianyu. - Fix for SPDX license identifier in the userspace tools from Nishad Kamdar. * tag 'hyperv-fixes-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: MAINTAINERS: Fix Hyperv vIOMMU driver file name tools: hv: Use the correct style for SPDX License Identifier tools: hv: fix typos in toolchain tools: hv: fix KVP and VSS daemons exit code tools: hv: fixed Python pep8/flake8 warnings for lsvmbus
2019-08-17Merge branch 'bpf-af-xdp-xskmap-improvements'Daniel Borkmann
Björn Töpel says: ==================== This series (v5 and counting) add two improvements for the XSKMAP, used by AF_XDP sockets. 1. Automatic cleanup when an AF_XDP socket goes out of scope/is released. Instead of require that the user manually clears the "released" state socket from the map, this is done automatically. Each socket tracks which maps it resides in, and remove itself from those maps at relase. A notable implementation change, is that the sockets references the map, instead of the map referencing the sockets. Which implies that when the XSKMAP is freed, it is by definition cleared of sockets. 2. The XSKMAP did not honor the BPF_EXIST/BPF_NOEXIST flag on insert, which this patch addresses. v1->v2: Fixed deadlock and broken cleanup. (Daniel) v2->v3: Rebased onto bpf-next v3->v4: {READ, WRITE}_ONCE consistency. (Daniel) Socket release/map update race. (Daniel) v4->v5: Avoid use-after-free on XSKMAP self-assignment [1]. (Daniel) Removed redundant assignment in xsk_map_update_elem(). Variable name consistency; Use map_entry everywhere. [1] https://lore.kernel.org/bpf/20190802081154.30962-1-bjorn.topel@gmail.com/T/#mc68439e97bc07fa301dad9fc4850ed5aa392f385 ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17xsk: support BPF_EXIST and BPF_NOEXIST flags in XSKMAPBjörn Töpel
The XSKMAP did not honor the BPF_EXIST/BPF_NOEXIST flags when updating an entry. This patch addresses that. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17xsk: remove AF_XDP socket from map when the socket is releasedBjörn Töpel
When an AF_XDP socket is released/closed the XSKMAP still holds a reference to the socket in a "released" state. The socket will still use the netdev queue resource, and block newly created sockets from attaching to that queue, but no user application can access the fill/complete/rx/tx queues. This results in that all applications need to explicitly clear the map entry from the old "zombie state" socket. This should be done automatically. In this patch, the sockets tracks, and have a reference to, which maps it resides in. When the socket is released, it will remove itself from all maps. Suggested-by: Bruce Richardson <bruce.richardson@intel.com> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17Merge branch 'bpf-sk-storage-clone'Daniel Borkmann
Stanislav Fomichev says: ==================== Currently there is no way to propagate sk storage from the listener socket to a newly accepted one. Consider the following use case: fd = socket(); setsockopt(fd, SOL_IP, IP_TOS,...); /* ^^^ setsockopt BPF program triggers here and saves something * into sk storage of the listener. */ listen(fd, ...); while (client = accept(fd)) { /* At this point all association between listener * socket and newly accepted one is gone. New * socket will not have any sk storage attached. */ } Let's add new BPF_F_CLONE flag that can be specified when creating a socket storage map. This new flag indicates that map contents should be cloned when the socket is cloned. v4: * drop 'goto err' in bpf_sk_storage_clone (Yonghong Song) * add comment about race with bpf_sk_storage_map_free to the bpf_sk_storage_clone side as well (Daniel Borkmann) v3: * make sure BPF_F_NO_PREALLOC is always present when creating a map (Martin KaFai Lau) * don't call bpf_sk_storage_free explicitly, rely on sk_free_unlock_clone to do the cleanup (Martin KaFai Lau) v2: * remove spinlocks around selem_link_map/sk (Martin KaFai Lau) * BPF_F_CLONE on a map, not selem (Martin KaFai Lau) * hold a map while cloning (Martin KaFai Lau) * use BTF maps in selftests (Yonghong Song) * do proper cleanup selftests; don't call close(-1) (Yonghong Song) * export bpf_map_inc_not_zero ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17selftests/bpf: add sockopt clone/inheritance testStanislav Fomichev
Add a test that calls setsockopt on the listener socket which triggers BPF program. This BPF program writes to the sk storage and sets clone flag. Make sure that sk storage is cloned for a newly accepted connection. We have two cloned maps in the tests to make sure we hit both cases in bpf_sk_storage_clone: first element (sk_storage_alloc) and non-first element(s) (selem_link_map). Cc: Martin KaFai Lau <kafai@fb.com> Cc: Yonghong Song <yhs@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17bpf: sync bpf.h to tools/Stanislav Fomichev
Sync new sk storage clone flag. Cc: Martin KaFai Lau <kafai@fb.com> Cc: Yonghong Song <yhs@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17bpf: support cloning sk storage on accept()Stanislav Fomichev
Add new helper bpf_sk_storage_clone which optionally clones sk storage and call it from sk_clone_lock. Cc: Martin KaFai Lau <kafai@fb.com> Cc: Yonghong Song <yhs@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17bpf: export bpf_map_inc_not_zeroStanislav Fomichev
Rename existing bpf_map_inc_not_zero to __bpf_map_inc_not_zero to indicate that it's caller's responsibility to do proper locking. Create and export bpf_map_inc_not_zero wrapper that properly locks map_idr_lock. Will be used in the next commit to hold a map while cloning a socket. Cc: Martin KaFai Lau <kafai@fb.com> Cc: Yonghong Song <yhs@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17selftests/bpf: fix race in test_tcp_rtt testPetar Penkov
There is a race in this test between receiving the ACK for the single-byte packet sent in the test, and reading the values from the map. This patch fixes this by having the client wait until there are no more unacknowledged packets. Before: for i in {1..1000}; do ../net/in_netns.sh ./test_tcp_rtt; \ done | grep -c PASSED < trimmed error messages > 993 After: for i in {1..10000}; do ../net/in_netns.sh ./test_tcp_rtt; \ done | grep -c PASSED 10000 Fixes: b55873984dab ("selftests/bpf: test BPF_SOCK_OPS_RTT_CB") Signed-off-by: Petar Penkov <ppenkov@google.com> Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17libbpf: relicense bpf_helpers.h and bpf_endian.hAndrii Nakryiko
bpf_helpers.h and bpf_endian.h contain useful macros and BPF helper definitions essential to almost every BPF program. Which makes them useful not just for selftests. To be able to expose them as part of libbpf, though, we need them to be dual-licensed as LGPL-2.1 OR BSD-2-Clause. This patch updates licensing of those two files. Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Hechao Li <hechaol@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Andrey Ignatov <rdna@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Lawrence Brakmo <brakmo@fb.com> Acked-by: Adam Barth <arb@fb.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Josef Bacik <jbacik@fb.com> Acked-by: Joe Stringer <joe@wand.net.nz> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org> Acked-by: David Ahern <dsahern@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Acked-by: Lorenz Bauer <lmb@cloudflare.com> Acked-by: Adrian Ratiu <adrian.ratiu@collabora.com> Acked-by: Nikita V. Shirokov <tehnerd@tehnerd.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: Petar Penkov <ppenkov@google.com> Acked-by: Teng Qin <palmtenor@gmail.com> Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Michal Rostecki <mrostecki@opensuse.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Sargun Dhillon <sargun@sargun.me> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17net: Don't call XDP_SETUP_PROG when nothing is changedMaxim Mikityanskiy
Don't uninstall an XDP program when none is installed, and don't install an XDP program that has the same ID as the one already installed. dev_change_xdp_fd doesn't perform any checks in case it uninstalls an XDP program. It means that the driver's ndo_bpf can be called with XDP_SETUP_PROG asking to set it to NULL even if it's already NULL. This case happens if the user runs `ip link set eth0 xdp off` when there is no XDP program attached. The symmetrical case is possible when the user tries to set the program that is already set. The drivers typically perform some heavy operations on XDP_SETUP_PROG, so they all have to handle these cases internally to return early if they happen. This patch puts this check into the kernel code, so that all drivers will benefit from it. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17Merge branch 'bpf-af-xdp-wakeup'Daniel Borkmann
Magnus Karlsson says: ==================== This patch set adds support for a new flag called need_wakeup in the AF_XDP Tx and fill rings. When this flag is set by the driver, it means that the application has to explicitly wake up the kernel Rx (for the bit in the fill ring) or kernel Tx (for bit in the Tx ring) processing by issuing a syscall. Poll() can wake up both and sendto() will wake up Tx processing only. The main reason for introducing this new flag is to be able to efficiently support the case when application and driver is executing on the same core. Previously, the driver was just busy-spinning on the fill ring if it ran out of buffers in the HW and there were none to get from the fill ring. This approach works when the application and driver is running on different cores as the application can replenish the fill ring while the driver is busy-spinning. Though, this is a lousy approach if both of them are running on the same core as the probability of the fill ring getting more entries when the driver is busy-spinning is zero. With this new feature the driver now sets the need_wakeup flag and returns to the application. The application can then replenish the fill queue and then explicitly wake up the Rx processing in the kernel using the syscall poll(). For Tx, the flag is only set to one if the driver has no outstanding Tx completion interrupts. If it has some, the flag is zero as it will be woken up by a completion interrupt anyway. This flag can also be used in other situations where the driver needs to be woken up explicitly. As a nice side effect, this new flag also improves the Tx performance of the case where application and driver are running on two different cores as it reduces the number of syscalls to the kernel. The kernel tells user space if it needs to be woken up by a syscall, and this eliminates many of the syscalls. The Rx performance of the 2-core case is on the other hand slightly worse, since there is a need to use a syscall now to wake up the driver, instead of the driver busy-spinning. It does waste less CPU cycles though, which might lead to better overall system performance. This new flag needs some simple driver support. If the driver does not support it, the Rx flag is always zero and the Tx flag is always one. This makes any application relying on this feature default to the old behavior of not requiring any syscalls in the Rx path and always having to call sendto() in the Tx path. For backwards compatibility reasons, this feature has to be explicitly turned on using a new bind flag (XDP_USE_NEED_WAKEUP). I recommend that you always turn it on as it has a large positive performance impact for the one core case and does not degrade 2 core performance and actually improves it for Tx heavy workloads. Here are some performance numbers measured on my local, non-performance optimized development system. That is why you are seeing numbers lower than the ones from Björn and Jesper. 64 byte packets at 40Gbit/s line rate. All results in Mpps. Cores == 1 means that both application and driver is executing on the same core. Cores == 2 that they are on different cores. Applications need_wakeup cores txpush rxdrop l2fwd --------------------------------------------------------------- n 1 0.07 0.06 0.03 y 1 21.6 8.2 6.5 n 2 32.3 11.7 8.7 y 2 33.1 11.7 8.7 Overall, the need_wakeup flag provides the same or better performance in all the micro-benchmarks. The reduction of sendto() calls in txpush is large. Only a few per second is needed. For l2fwd, the drop is 50% for the 1 core case and more than 99.9% for the 2 core case. Do not know why I am not seeing the same drop for the 1 core case yet. The name and inspiration of the flag has been taken from io_uring by Jens Axboe. Details about this feature in io_uring can be found in http://kernel.dk/io_uring.pdf, section 8.3. It also addresses most of the denial of service and sendto() concerns raised by Maxim Mikityanskiy in https://www.spinics.net/lists/netdev/msg554657.html. The typical Tx part of an application will have to change from: ret = sendto(fd,....) to: if (xsk_ring_prod__needs_wakeup(&xsk->tx)) ret = sendto(fd,....) and th Rx part from: rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx); if (!rcvd) return; to: rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx); if (!rcvd) { if (xsk_ring_prod__needs_wakeup(&xsk->umem->fq)) ret = poll(fd,.....); return; } v3 -> v4: * Maxim found a possible race in the Tx part of the driver. The setting of the flag needs to happen before the sending, otherwise it might trigger this race. Fixed in ixgbe and i40e driver. * Mellanox support contributed by Maxim * Removed the XSK_DRV_CAN_SLEEP flag as it was not used anymore. Thanks to Sridhar for discovering this. * For consistency the feature is now always called need_wakeup. There were some places where it was referred to as might_sleep, but they have been removed. Thanks to Sridhar for spotting. * Fixed some typos in the commit messages v2 -> v3: * Converted the Mellanox driver to the new ndo in patch 1 as pointed out by Maxim * Fixed the compatibility code of XDP_MMAP_OFFSETS so it now works. v1 -> v2: * Fixed bisectability problem pointed out by Jakub * Added missing initiliztion of the Tx need_wakeup flag to 1 This patch has been applied against commit b753c5a7f99f ("Merge branch 'r8152-RX-improve'") Structure of the patch set: Patch 1: Replaces the ndo_xsk_async_xmit with ndo_xsk_wakeup to support waking up both Rx and Tx processing Patch 2: Implements the need_wakeup functionality in common code Patch 3-4: Add need_wakeup support to the i40e and ixgbe drivers Patch 5: Add need_wakeup support to libbpf Patch 6: Add need_wakeup support to the xdpsock sample application Patch 7-8: Add need_wakeup support to the Mellanox mlx5 driver ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17net/mlx5e: Add AF_XDP need_wakeup supportMaxim Mikityanskiy
This commit adds support for the new need_wakeup feature of AF_XDP. The applications can opt-in by using the XDP_USE_NEED_WAKEUP bind() flag. When this feature is enabled, some behavior changes: RX side: If the Fill Ring is empty, instead of busy-polling, set the flag to tell the application to kick the driver when it refills the Fill Ring. TX side: If there are pending completions or packets queued for transmission, set the flag to tell the application that it can skip the sendto() syscall and save time. The performance testing was performed on a machine with the following configuration: - 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz - Mellanox ConnectX-5 Ex with 100 Gbit/s link The results with retpoline disabled: | without need_wakeup | with need_wakeup | |----------------------|----------------------| | one core | two cores | one core | two cores | -------|----------|-----------|----------|-----------| txonly | 20.1 | 33.5 | 29.0 | 34.2 | rxdrop | 0.065 | 14.1 | 12.0 | 14.1 | l2fwd | 0.032 | 7.3 | 6.6 | 7.2 | "One core" means the application and NAPI run on the same core. "Two cores" means they are pinned to different cores. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17net/mlx5e: Move the SW XSK code from NAPI poll to a separate functionMaxim Mikityanskiy
Two XSK tasks are performed during NAPI polling, that are not bound to hardware interrupts: TXing packets and polling for frames in the Fill Ring. They are special in a way that the hardware doesn't know about these tasks, so it doesn't trigger interrupts if there is still some work to be done, it's our driver's responsibility to ensure NAPI will be rescheduled if needed. Create a new function to handle these tasks and move the corresponding code from mlx5e_napi_poll to the new function to improve modularity and prepare for the changes in the following patch. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17samples/bpf: add use of need_wakeup flag in xdpsockMagnus Karlsson
This commit adds using the need_wakeup flag to the xdpsock sample application. It is turned on by default as we think it is a feature that seems to always produce a performance benefit, if the application has been written taking advantage of it. It can be turned off in the sample app by using the '-m' command line option. The txpush and l2fwd sub applications have also been updated to support poll() with multiple sockets. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17libbpf: add support for need_wakeup flag in AF_XDP partMagnus Karlsson
This commit adds support for the new need_wakeup flag in AF_XDP. The xsk_socket__create function is updated to handle this and a new function is introduced called xsk_ring_prod__needs_wakeup(). This function can be used by the application to check if Rx and/or Tx processing needs to be explicitly woken up. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17ixgbe: add support for AF_XDP need_wakeup featureMagnus Karlsson
This patch adds support for the need_wakeup feature of AF_XDP. If the application has told the kernel that it might sleep using the new bind flag XDP_USE_NEED_WAKEUP, the driver will then set this flag if it has no more buffers on the NIC Rx ring and yield to the application. For Tx, it will set the flag if it has no outstanding Tx completion interrupts and return to the application. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17i40e: add support for AF_XDP need_wakeup featureMagnus Karlsson
This patch adds support for the need_wakeup feature of AF_XDP. If the application has told the kernel that it might sleep using the new bind flag XDP_USE_NEED_WAKEUP, the driver will then set this flag if it has no more buffers on the NIC Rx ring and yield to the application. For Tx, it will set the flag if it has no outstanding Tx completion interrupts and return to the application. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17xsk: add support for need_wakeup flag in AF_XDP ringsMagnus Karlsson
This commit adds support for a new flag called need_wakeup in the AF_XDP Tx and fill rings. When this flag is set, it means that the application has to explicitly wake up the kernel Rx (for the bit in the fill ring) or kernel Tx (for bit in the Tx ring) processing by issuing a syscall. Poll() can wake up both depending on the flags submitted and sendto() will wake up tx processing only. The main reason for introducing this new flag is to be able to efficiently support the case when application and driver is executing on the same core. Previously, the driver was just busy-spinning on the fill ring if it ran out of buffers in the HW and there were none on the fill ring. This approach works when the application is running on another core as it can replenish the fill ring while the driver is busy-spinning. Though, this is a lousy approach if both of them are running on the same core as the probability of the fill ring getting more entries when the driver is busy-spinning is zero. With this new feature the driver now sets the need_wakeup flag and returns to the application. The application can then replenish the fill queue and then explicitly wake up the Rx processing in the kernel using the syscall poll(). For Tx, the flag is only set to one if the driver has no outstanding Tx completion interrupts. If it has some, the flag is zero as it will be woken up by a completion interrupt anyway. As a nice side effect, this new flag also improves the performance of the case where application and driver are running on two different cores as it reduces the number of syscalls to the kernel. The kernel tells user space if it needs to be woken up by a syscall, and this eliminates many of the syscalls. This flag needs some simple driver support. If the driver does not support this, the Rx flag is always zero and the Tx flag is always one. This makes any application relying on this feature default to the old behaviour of not requiring any syscalls in the Rx path and always having to call sendto() in the Tx path. For backwards compatibility reasons, this feature has to be explicitly turned on using a new bind flag (XDP_USE_NEED_WAKEUP). I recommend that you always turn it on as it so far always have had a positive performance impact. The name and inspiration of the flag has been taken from io_uring by Jens Axboe. Details about this feature in io_uring can be found in http://kernel.dk/io_uring.pdf, section 8.3. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17xsk: replace ndo_xsk_async_xmit with ndo_xsk_wakeupMagnus Karlsson
This commit replaces ndo_xsk_async_xmit with ndo_xsk_wakeup. This new ndo provides the same functionality as before but with the addition of a new flags field that is used to specifiy if Rx, Tx or both should be woken up. The previous ndo only woke up Tx, as implied by the name. The i40e and ixgbe drivers (which are all the supported ones) are updated with this new interface. This new ndo will be used by the new need_wakeup functionality of XDP sockets that need to be able to wake up both Rx and Tx driver processing. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17Merge branch 'stmmac-next'David S. Miller
Jose Abreu says: ==================== net: stmmac: Improvements for -next Couple of improvements for -next tree. More info in commit logs. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>