summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2019-06-28net/mlx5e: Disallow tc redirect offload cases we don't supportPaul Blakey
After changing the parent_id to be the same for both NICs of same the hardware device, netdev_port_same_parent_id now returns true for more cases (all the lower devices in the hierarchy are on the same hardware device). If merged eswitch isn't enabled, these cases aren't supported, so disallow them. Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-28net/mlx5e: Expose same physical switch_id for all representorsPaul Blakey
Report system_image_guid as the E-Switch switch_id, this ensures that when a NIC contains multiple PCI functions and which has merged eswitch capability, all representors from multiple PFs publish same switch_id. Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-28net/mlx5e: Don't refresh TIRs when updating representor SQsGavi Teitz
Refreshing TIRs is done in order to update the TIRs with the current state of SQs in the transport domain, so that the TIRs can filter out undesired self-loopback packets based on the source SQ of the packet. Representor TIRs will only receive packets that originate from their associated vport, due to dedicated steering, and therefore will never receive self-loopback packets, whose source vport will be the vport of the E-Switch manager, and therefore not the vport associated with the representor. As such, it is not necessary to refresh the representors' TIRs, since self-loopback packets can't reach them. Since representors only exist in switchdev mode, and there is no scenario in which a representor will exist in the transport domain alongside a non-representor, it is not necessary to refresh the transport domain's TIRs upon changing the state of a representor's queues. Therefore, do not refresh TIRs upon such a change. Achieve this by adding an update_rx callback to the mlx5e_profile, which refreshes TIRs for non-representors and does nothing for representors, and replace instances of mlx5e_refresh_tirs() upon changing the state of the queues with update_rx(). Signed-off-by: Gavi Teitz <gavi@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-28net/mlx5e: reduce stack usage in mlx5_eswitch_termtbl_createArnd Bergmann
Putting an empty 'mlx5_flow_spec' structure on the stack is a bit wasteful and causes a warning on 32-bit architectures when building with clang -fsanitize-coverage: drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads_termtbl.c: In function 'mlx5_eswitch_termtbl_create': drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads_termtbl.c:90:1: error: the frame size of 1032 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] Since the structure is never written to, we can statically allocate it to avoid the stack usage. To be on the safe side, mark all subsequent function arguments that we pass it into as 'const' as well. Fixes: 10caabdaad5a ("net/mlx5e: Use termination table for VLAN push actions") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-28net/mlx5e: Set drvinfo in generic mannerParav Pandit
Consider PCI and non PCI device types while setting device name in get_drvinfo() callback using existing generic device. Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Vu Pham <vuhuong@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-28net/mlx5e: Correct phys_port_name for PF portParav Pandit
Currently PF phys_port_name is named as pfNvf-1 as vport number for PF vport is 65535. Correct PF's phys_port name as agreed upon name as pfN. Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Vu Pham <vuhuong@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-28net/mlx5e: Report netdevice MPLS featuresAriel Levkovich
Set supported device features in the netdevice MPLS features mask. This will enable HW checksumming and TSO for MPLS tagged traffic. Signed-off-by: Ariel Levkovich <lariel@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-28net/mlx5e: Move to HW checksumming advertisingAriel Levkovich
This patch changes the way the driver advertises its checksum offload capabilities within the net device features bit mask. Instead of advertising protocol specific checksumming capabilities which are limited today to IPv4 and IPv6, we move to reporing generic HW checksumming capabilities. This will allow the network stack to let mlx5 device offload checksum for cases where the IP header is encapsulated within another protocol and the skb->protocol doesn't indicate one of the IP versions protocol, specifically in the case of MPLS label encapsulating the IP header and the skb->protocol indiciates MPLS ethertype rather than IP. Moving the HW_CSUM reporting is required in the basic net device hw features mask and also in the extensions (vlan and encpasulation features) since the extensions are always multiplied by the basic features set during the packet's traversal through the stack's tx flow. Signed-off-by: Ariel Levkovich <lariel@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-28net/mlx5: MPFS, Allow adding the same MAC more than onceGavi Teitz
Remove the limitation preventing adding a vport's MAC address to the Multi-Physical Function Switch (MPFS) more than once per E-switch, as there is no difference in the MPFS if an address is being used by an E-switch more than once. This allows the E-switch to have multiple vports with the same MAC address, allowing vports to be classified by VLAN id instead of by MAC if desired. Signed-off-by: Gavi Teitz <gavi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-28net/mlx5: MPFS, Cleanup add MAC flowGavi Teitz
Unify and isolate the error handling flow in mlx5_mpfs_add_mac(), removing code duplication. Signed-off-by: Gavi Teitz <gavi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-28Merge branch 'mlx5-next' of ↵Saeed Mahameed
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Misc updates from mlx5-next branch: 1) E-Switch vport metadata support for source vport matching 2) Convert mkey_table to XArray 3) Shared IRQs and to use single IRQ for all async EQs Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-28e1000e: PCIm function state supportVitaly Lifshits
Due to commit: 5d8682588605 ("[misc] mei: me: allow runtime pm for platform with D0i3") When disconnecting the cable and reconnecting it the NIC enters DMoff state. This caused wrong link indication and duplex mismatch. This bug is described in: https://bugzilla.redhat.com/show_bug.cgi?id=1689436 Checking PCIm function state and performing PHY reset after a timeout in watchdog task solves this issue. Signed-off-by: Vitaly Lifshits <vitaly.lifshits@intel.com> Acked-by: Sasha Neftin <sasha.neftin@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-28e1000e: Make watchdog use delayed workDetlev Casanova
Use delayed work instead of timers to run the watchdog of the e1000e driver. Simplify the code with one less middle function. Signed-off-by: Detlev Casanova <detlev.casanova@gmail.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-28i40e: Add macvlan support on i40eHarshitha Ramamurthy
This patch enables macvlan offloads for i40e. The idea is to use channels as macvlan interfaces. The channels are VSIs of type VMDQ. When the first macvlan is created, the maximum number of channels possible are created. From then on, as a macvlan interface is created, a macvlan filter is added to these already created channels (VSIs). This patch utilizes subordinate device traffic classes to make queue groups(channels) available for an upper device like a macvlan. Steps to configure macvlan offloads: 1. ethtool -K ethx l2-fwd-offload on 2. ip link add link ethx name macvlan1 type macvlan 3. ip addr add <address> dev macvlan1 4. ip link set macvlan1 up Signed-off-by: Harshitha Ramamurthy <harshitha.ramamurthy@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-28ixgbevf: Use cached link state instead of re-reading the value for ethtoolAlexander Duyck
Change the ethtool link settings call to just read the cached state out of the adapter structure instead of trying to recheck the value from the PF. Doing this should prevent excessive reading of the mailbox. Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Reviewed-by: "Guilherme G. Piccoli" <gpiccoli@canonical.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-28iavf: fix dereference of null rx_buffer pointerColin Ian King
A recent commit efa14c3985828d ("iavf: allow null RX descriptors") added a null pointer sanity check on rx_buffer, however, rx_buffer is being dereferenced before that check, which implies a null pointer dereference bug can potentially occur. Fix this by only dereferencing rx_buffer until after the null pointer check. Addresses-Coverity: ("Dereference before null check") Signed-off-by: Colin Ian King <colin.king@canonical.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-28igb: add RR2DCDELAY to ethtool registers dumpArtem Bityutskiy
This patch adds the RR2DCDELAY register to the ethtool registers dump. RR2DCDELAY exists on I210 and I211 Intel Gigabit Ethernet chips and it stands for "Read Request To Data Completion Delay". Here is how this register is described in the I210 datasheet: "This field captures the maximum PCIe split time in 16 ns units, which is the maximum delay between the read request to the first data completion. This is giving an estimation of the PCIe round trip time." In other words, whenever I210 reads from the host memory (e.g., fetches a descriptor from the ring), the chip measures every PCI DMA read transaction and captures the maximum value. So it ends up containing the longest DMA transaction time. This register is very useful for troubleshooting and research purposes. If you are dealing with time-sensitive networks, this register can help you get an idea of your "I210-to-ring" latency. This helps answering questions like "should I have PCIe ASPM enabled?" or "should I enable deep C-states?" on my system. It is safe to read this register at any point, reading it has no effect on the I210 chip functionality. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-28igb: minor ethool regdump amendmentArtem Bityutskiy
This patch has no functional impact and it is just a preparation for the following patch. It removes an early return from the 'igb_get_regs()' function by moving the 82576-only registers dump into an "if" block. With this preparation, we can dump more non-82576 registers at the end of this function. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-28iavf: Fix up debug print macroJeff Kirsher
This aligns the iavf_debug() macro with the other Intel drivers. Add the bus number, bus_id field to i40e_bus_info so output shows each physical port(i.e func) in following format: [[[[<domain>]:]<bus>]:][<slot>][.[<func>]] domains are numbered from 0 to ffff), bus (0-ff), slot (0-1f) and function (0-7). Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
2019-06-28e1000e: Reduce boot time by tightening sleep rangesArjan van de Ven
The e1000e driver is a great user of the usleep_range() API, and has nice ranges that in principle help power management. However the ranges that are used only during system startup are very long (and can add easily 100 msec to the boot time) while the power savings of such long ranges is irrelevant due to the one-off, boot only, nature of these functions. This patch shrinks some of the longest ranges to be shorter (while still using a power friendly 1 msec range); this saves 100msec+ of boot time on my BDW NUCs Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-28iavf: use struct_size() helperGustavo A. R. Silva
Make use of the struct_size() helper instead of an open-coded version in order to avoid any potential type mistakes, in particular in the context in which this code is being used. So, replace code of the following form: sizeof(struct virtchnl_ether_addr_list) + (count * sizeof(struct virtchnl_ether_addr)) with: struct_size(veal, list, count) and so on... This code was detected with the help of Coccinelle. Signed-off-by: "Gustavo A. R. Silva" <gustavo@embeddedor.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-28e1000: Use dma_wmb() instead of wmb() before doorbell writesVenkatesh Srinivas
e1000 writes to doorbells to post transmit descriptors and fill the receive ring. After writing descriptors to memory but before writing to doorbells, use dma_wmb() rather than wmb(). wmb() is more heavyweight than necessary for a device to see descriptor writes. On x86, this avoids SFENCEs before doorbell writes in both the Tx and Rx paths. On ARM, this converts DSB ST -> DMB OSHST. Tested: 82576EB / x86; QEMU (qemu emulates an 8257x) Signed-off-by: Venkatesh Srinivas <venkateshs@google.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-28ixgbe: fix potential u32 overflow on shiftColin Ian King
The u32 variable rem is being shifted using u32 arithmetic however it is being passed to div_u64 that expects the expression to be a u64. The 32 bit shift may potentially overflow, so cast rem to a u64 before shifting to avoid this. Also remove comment about overflow. Addresses-Coverity: ("Unintentional integer overflow") Fixes: cd4583206990 ("ixgbe: implement support for SDP/PPS output on X550 hardware") Fixes: 68d9676fc04e ("ixgbe: fix PTP SDP pin setup on X540 hardware") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-28ixgbe: Avoid NULL pointer dereference with VF on non-IPsec hwDann Frazier
An ipsec structure will not be allocated if the hardware does not support offload. Fixes the following Oops: [ 191.045452] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [ 191.054232] Mem abort info: [ 191.057014] ESR = 0x96000004 [ 191.060057] Exception class = DABT (current EL), IL = 32 bits [ 191.065963] SET = 0, FnV = 0 [ 191.069004] EA = 0, S1PTW = 0 [ 191.072132] Data abort info: [ 191.074999] ISV = 0, ISS = 0x00000004 [ 191.078822] CM = 0, WnR = 0 [ 191.081780] user pgtable: 4k pages, 48-bit VAs, pgdp = 0000000043d9e467 [ 191.088382] [0000000000000000] pgd=0000000000000000 [ 191.093252] Internal error: Oops: 96000004 [#1] SMP [ 191.098119] Modules linked in: vhost_net vhost tap vfio_pci vfio_virqfd vfio_iommu_type1 vfio xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter devlink ebtables ip6table_filter ip6_tables iptable_filter bpfilter ipmi_ssif nls_iso8859_1 input_leds joydev ipmi_si hns_roce_hw_v2 ipmi_devintf hns_roce ipmi_msghandler cppc_cpufreq sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 ses enclosure btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor hid_generic usbhid hid raid6_pq libcrc32c raid1 raid0 multipath linear ixgbevf hibmc_drm ttm [ 191.168607] drm_kms_helper aes_ce_blk aes_ce_cipher syscopyarea crct10dif_ce sysfillrect ghash_ce qla2xxx sysimgblt sha2_ce sha256_arm64 hisi_sas_v3_hw fb_sys_fops sha1_ce uas nvme_fc mpt3sas ixgbe drm hisi_sas_main nvme_fabrics usb_storage hclge scsi_transport_fc ahci libsas hnae3 raid_class libahci xfrm_algo scsi_transport_sas mdio aes_neon_bs aes_neon_blk crypto_simd cryptd aes_arm64 [ 191.202952] CPU: 94 PID: 0 Comm: swapper/94 Not tainted 4.19.0-rc1+ #11 [ 191.209553] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 - V1.20.01 04/26/2019 [ 191.218064] pstate: 20400089 (nzCv daIf +PAN -UAO) [ 191.222873] pc : ixgbe_ipsec_vf_clear+0x60/0xd0 [ixgbe] [ 191.228093] lr : ixgbe_msg_task+0x2d0/0x1088 [ixgbe] [ 191.233044] sp : ffff000009b3bcd0 [ 191.236346] x29: ffff000009b3bcd0 x28: 0000000000000000 [ 191.241647] x27: ffff000009628000 x26: 0000000000000000 [ 191.246946] x25: ffff803f652d7600 x24: 0000000000000004 [ 191.252246] x23: ffff803f6a718900 x22: 0000000000000000 [ 191.257546] x21: 0000000000000000 x20: 0000000000000000 [ 191.262845] x19: 0000000000000000 x18: 0000000000000000 [ 191.268144] x17: 0000000000000000 x16: 0000000000000000 [ 191.273443] x15: 0000000000000000 x14: 0000000100000026 [ 191.278742] x13: 0000000100000025 x12: ffff8a5f7fbe0df0 [ 191.284042] x11: 000000010000000b x10: 0000000000000040 [ 191.289341] x9 : 0000000000001100 x8 : ffff803f6a824fd8 [ 191.294640] x7 : ffff803f6a825098 x6 : 0000000000000001 [ 191.299939] x5 : ffff000000f0ffc0 x4 : 0000000000000000 [ 191.305238] x3 : ffff000028c00000 x2 : ffff803f652d7600 [ 191.310538] x1 : 0000000000000000 x0 : ffff000000f205f0 [ 191.315838] Process swapper/94 (pid: 0, stack limit = 0x00000000addfed5a) [ 191.322613] Call trace: [ 191.325055] ixgbe_ipsec_vf_clear+0x60/0xd0 [ixgbe] [ 191.329927] ixgbe_msg_task+0x2d0/0x1088 [ixgbe] [ 191.334536] ixgbe_msix_other+0x274/0x330 [ixgbe] [ 191.339233] __handle_irq_event_percpu+0x78/0x270 [ 191.343924] handle_irq_event_percpu+0x40/0x98 [ 191.348355] handle_irq_event+0x50/0xa8 [ 191.352180] handle_fasteoi_irq+0xbc/0x148 [ 191.356263] generic_handle_irq+0x34/0x50 [ 191.360259] __handle_domain_irq+0x68/0xc0 [ 191.364343] gic_handle_irq+0x84/0x180 [ 191.368079] el1_irq+0xe8/0x180 [ 191.371208] arch_cpu_idle+0x30/0x1a8 [ 191.374860] do_idle+0x1dc/0x2a0 [ 191.378077] cpu_startup_entry+0x2c/0x30 [ 191.381988] secondary_start_kernel+0x150/0x1e0 [ 191.386506] Code: 6b15003f 54000320 f1404a9f 54000060 (79400260) Fixes: eda0333ac2930 ("ixgbe: add VF IPsec management") Signed-off-by: Dann Frazier <dann.frazier@canonical.com> Acked-by: Shannon Nelson <snelson@pensando.io> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-28drm/panfrost: Fix a double-free errorBoris Brezillon
drm_gem_shmem_create_with_handle() returns a GEM object and attach a handle to it. When the user closes the DRM FD, the core releases all GEM handles along with their backing GEM objs, which can lead to a double-free issue if panfrost_ioctl_create_bo() failed and went through the err_free path where drm_gem_object_put_unlocked() is called without deleting the associate handle. Replace this drm_gem_object_put_unlocked() call by a drm_gem_handle_delete() one to fix that. Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver") Cc: <stable@vger.kernel.org> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com> Signed-off-by: Rob Herring <robh@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20190627172414.27231-1-boris.brezillon@collabora.com
2019-06-28e1000e: Increase pause and refresh timeMiguel Bernal Marin
Suggested-by: Tim Pepper <timothy.c.pepper@linux.intel.com> Signed-off-by: Miguel Bernal Marin <miguel.bernal.marin@linux.intel.com> Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de> Acked-by: Sasha Neftin <sasha.neftin@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-28ice: Use struct_size() helperGustavo A. R. Silva
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; struct boo entry[]; }; size = sizeof(struct foo) + count * sizeof(struct boo); instance = alloc(size, GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: size = struct_size(instance, entry, count); This code was detected with the help of Coccinelle. Signed-off-by: "Gustavo A. R. Silva" <gustavo@embeddedor.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-06-28Merge branch 'net-sched-Add-txtime-assist-support-for-taprio'David S. Miller
Vedang Patel says: ==================== net/sched: Add txtime-assist support for taprio. Changes in v6: - Use _BITUL() instead of BIT() in UAPI for etf. (patch #1) - Fix a bug reported by kbuild test bot in length_to_duration(). (patch #6) - Remove an unused function (get_cycle_start()). (Patch #6) Changes in v5: - Commit message improved for the igb patch (patch #1). - Fixed typo in commit message for etf patch (patch #2). Changes in v4: - Remove inline directive from functions in foo.c. - Fix spacing in pkt_sched.h (for etf patch). Changes in v3: - Simplify implementation for taprio flags. - txtime_delay can only be set if txtime-assist mode is enabled. - txtime_delay and flags will only be visible in tc output if set by user. - Minor changes in error reporting. Changes in v2: - Txtime-offload has now been renamed to txtime-assist mode. - Renamed the offload parameter to flags. - Removed the code which introduced the hardware offloading functionality. Original Cover letter (with above changes included) -------------------------------------------------- Currently, we are seeing packets being transmitted outside their timeslices. We can confirm that the packets are being dequeued at the right time. So, the delay is induced after the packet is dequeued, because taprio, without any offloading, has no control of when a packet is actually transmitted. In order to solve this, we are making use of the txtime feature provided by ETF qdisc. Hardware offloading needs to be supported by the ETF qdisc in order to take advantage of this feature. The taprio qdisc will assign txtime (in skb->tstamp) for all the packets which do not have the txtime allocated via the SO_TXTIME socket option. For the packets which already have SO_TXTIME set, taprio will validate whether the packet will be transmitted in the correct interval. In order to support this, the following parameters have been added: - flags (taprio): This is added in order to support different offloading modes which will be added in the future. - txtime-delay (taprio): This indicates the minimum time it will take for the packet to hit the wire after it reaches taprio_enqueue(). This is useful in determining whether we can transmit the packet in the remaining time if the gate corresponding to the packet is currently open. - skip_skb_check (ETF): ETF currently drops any packet which does not have the SO_TXTIME socket option set. This check can be skipped by specifying this option. Following is an example configuration: tc qdisc replace dev $IFACE parent root handle 100 taprio \\ num_tc 3 \\ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\ queues 1@0 1@0 1@0 \\ base-time $BASE_TIME \\ sched-entry S 01 300000 \\ sched-entry S 02 300000 \\ sched-entry S 04 400000 \\ flags 0x1 \\ txtime-delay 200000 \\ clockid CLOCK_TAI tc qdisc replace dev $IFACE parent 100:1 etf \\ offload delta 200000 clockid CLOCK_TAI skip_skb_check Here, the "flags" parameter is indicating that the txtime-assist mode is enabled. Also, all the traffic classes have been assigned the same queue. This is to prevent the traffic classes in the lower priority queues from getting starved. Note that this configuration is specific to the i210 ethernet card. Other network cards where the hardware queues are given the same priority, might be able to utilize more than one queue. Following are some of the other highlights of the series: - Fix a bug where hardware timestamping and SO_TXTIME options cannot be used together. (Patch 1) - Introduces the skip_skb_check option. (Patch 2) - Make TxTime assist mode work with TCP packets (Patch 7). The following changes are recommended to be done in order to get the best performance from taprio in this mode: ip link set dev enp1s0 mtu 1514 ethtool -K eth0 gso off ethtool -K eth0 tso off ethtool --set-eee eth0 eee off ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-28taprio: Adjust timestamps for TCP packetsVedang Patel
When the taprio qdisc is running in "txtime offload" mode, it will set the launchtime value (in skb->tstamp) for all the packets which do not have the SO_TXTIME socket option. But, the TCP packets already have this value set and it indicates the earliest departure time represented in CLOCK_MONOTONIC clock. We need to respect the timestamp set by the TCP subsystem. So, convert this time to the clock which taprio is using and ensure that the packet is not transmitted before the deadline set by TCP. Signed-off-by: Vedang Patel <vedang.patel@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-28taprio: make clock reference conversions easierVedang Patel
Later in this series we will need to transform from CLOCK_MONOTONIC (used in TCP) to the clock reference used in TAPRIO. Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Vedang Patel <vedang.patel@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-28taprio: Add support for txtime-assist modeVedang Patel
Currently, we are seeing non-critical packets being transmitted outside of their timeslice. We can confirm that the packets are being dequeued at the right time. So, the delay is induced in the hardware side. The most likely reason is the hardware queues are starving the lower priority queues. In order to improve the performance of taprio, we will be making use of the txtime feature provided by the ETF qdisc. For all the packets which do not have the SO_TXTIME option set, taprio will set the transmit timestamp (set in skb->tstamp) in this mode. TAPrio Qdisc will ensure that the transmit time for the packet is set to when the gate is open. If SO_TXTIME is set, the TAPrio qdisc will validate whether the timestamp (in skb->tstamp) occurs when the gate corresponding to skb's traffic class is open. Following two parameters added to support this mode: - flags: used to enable txtime-assist mode. Will also be used to enable other modes (like hardware offloading) later. - txtime-delay: This indicates the minimum time it will take for the packet to hit the wire. This is useful in determining whether we can transmit the packet in the remaining time if the gate corresponding to the packet is currently open. An example configuration for enabling txtime-assist: tc qdisc replace dev eth0 parent root handle 100 taprio \\ num_tc 3 \\ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\ queues 1@0 1@0 1@0 \\ base-time 1558653424279842568 \\ sched-entry S 01 300000 \\ sched-entry S 02 300000 \\ sched-entry S 04 400000 \\ flags 0x1 \\ txtime-delay 40000 \\ clockid CLOCK_TAI tc qdisc replace dev $IFACE parent 100:1 etf skip_sock_check \\ offload delta 200000 clockid CLOCK_TAI Note that all the traffic classes are mapped to the same queue. This is only possible in taprio when txtime-assist is enabled. Also, note that the ETF Qdisc is enabled with offload mode set. In this mode, if the packet's traffic class is open and the complete packet can be transmitted, taprio will try to transmit the packet immediately. This will be done by setting skb->tstamp to current_time + the time delta indicated in the txtime-delay parameter. This parameter indicates the time taken (in software) for packet to reach the network adapter. If the packet cannot be transmitted in the current interval or if the packet's traffic is not currently transmitting, the skb->tstamp is set to the next available timestamp value. This is tracked in the next_launchtime parameter in the struct sched_entry. The behaviour w.r.t admin and oper schedules is not changed from what is present in software mode. The transmit time is already known in advance. So, we do not need the HR timers to advance the schedule and wakeup the dequeue side of taprio. So, HR timer won't be run when this mode is enabled. Signed-off-by: Vedang Patel <vedang.patel@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-28taprio: Remove inline directiveVedang Patel
Remove inline directive from length_to_duration(). We will let the compiler make the decisions. Signed-off-by: Vedang Patel <vedang.patel@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-28taprio: calculate cycle_time when schedule is installedVedang Patel
cycle time for a particular schedule is calculated only when it is first installed. So, it makes sense to just calculate it once right after the 'cycle_time' parameter has been parsed and store it in cycle_time. Signed-off-by: Vedang Patel <vedang.patel@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-28etf: Add skip_sock_checkVedang Patel
Currently, etf expects a socket with SO_TXTIME option set for each packet it encounters. So, it will drop all other packets. But, in the future commits we are planning to add functionality where tstamp value will be set by another qdisc. Also, some packets which are generated from within the kernel (e.g. ICMP packets) do not have any socket associated with them. So, this commit adds support for skip_sock_check. When this option is set, etf will skip checking for a socket and other associated options for all skbs. Signed-off-by: Vedang Patel <vedang.patel@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-28etf: Don't use BIT() in UAPI headers.Vedang Patel
The BIT() macro isn't exported as part of the UAPI interface. So, the compile-test to ensure they are self contained fails. So, use _BITUL() instead. Signed-off-by: Vedang Patel <vedang.patel@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-28igb: clear out skb->tstamp after reading the txtimeVedang Patel
If a packet which is utilizing the launchtime feature (via SO_TXTIME socket option) also requests the hardware transmit timestamp, the hardware timestamp is not delivered to the userspace. This is because the value in skb->tstamp is mistaken as the software timestamp. Applications, like ptp4l, request a hardware timestamp by setting the SOF_TIMESTAMPING_TX_HARDWARE socket option. Whenever a new timestamp is detected by the driver (this work is done in igb_ptp_tx_work() which calls igb_ptp_tx_hwtstamps() in igb_ptp.c[1]), it will queue the timestamp in the ERR_QUEUE for the userspace to read. When the userspace is ready, it will issue a recvmsg() call to collect this timestamp. The problem is in this recvmsg() call. If the skb->tstamp is not cleared out, it will be interpreted as a software timestamp and the hardware tx timestamp will not be successfully sent to the userspace. Look at skb_is_swtx_tstamp() and the callee function __sock_recv_timestamp() in net/socket.c for more details. Signed-off-by: Vedang Patel <vedang.patel@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-28net: mvpp2: prs: Don't override the sign bit in SRAM parser shiftMaxime Chevallier
The Header Parser allows identifying various fields in the packet headers, used for various kind of filtering and classification steps. This is a re-entrant process, where the offset in the packet header depends on the previous lookup results. This offset is represented in the SRAM results of the TCAM, as a shift to be operated. This shift can be negative in some cases, such as in IPv6 parsing. This commit prevents overriding the sign bit when setting the shift value, which could cause instabilities when parsing IPv6 flows. Fixes: 3f518509dedc ("ethernet: Add new driver for Marvell Armada 375 network unit") Suggested-by: Alan Winkowski <walan@marvell.com> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-28net: phylink: further documentation clarificationsRussell King
Clarify the validate() behaviour in a few cases which weren't mentioned in the documentation, but which are necessary for users to get the correct behaviour. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-28Merge branch 'mirred-recurse'David S. Miller
John Hurley says: ==================== Track recursive calls in TC act_mirred These patches aim to prevent act_mirred causing stack overflow events from recursively calling packet xmit or receive functions. Such events can occur with poor TC configuration that causes packets to travel in loops within the system. Florian Westphal advises that a recursion crash and packets looping are separate issues and should be treated as such. David Miller futher points out that pcpu counters cannot track the precise skb context required to detect loops. Hence these patches are not aimed at detecting packet loops, rather, preventing stack flows arising from such loops. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-28net: sched: protect against stack overflow in TC act_mirredJohn Hurley
TC hooks allow the application of filters and actions to packets at both ingress and egress of the network stack. It is possible, with poor configuration, that this can produce loops whereby an ingress hook calls a mirred egress action that has an egress hook that redirects back to the first ingress etc. The TC core classifier protects against loops when doing reclassifies but there is no protection against a packet looping between multiple hooks and recursively calling act_mirred. This can lead to stack overflow panics. Add a per CPU counter to act_mirred that is incremented for each recursive call of the action function when processing a packet. If a limit is passed then the packet is dropped and CPU counter reset. Note that this patch does not protect against loops in TC datapaths. Its aim is to prevent stack overflow kernel panics that can be a consequence of such loops. Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-28net: sched: refactor reinsert actionJohn Hurley
The TC_ACT_REINSERT return type was added as an in-kernel only option to allow a packet ingress or egress redirect. This is used to avoid unnecessary skb clones in situations where they are not required. If a TC hook returns this code then the packet is 'reinserted' and no skb consume is carried out as no clone took place. This return type is only used in act_mirred. Rather than have the reinsert called from the main datapath, call it directly in act_mirred. Instead of returning TC_ACT_REINSERT, change the type to the new TC_ACT_CONSUMED which tells the caller that the packet has been stolen by another process and that no consume call is required. Moving all redirect calls to the act_mirred code is in preparation for tracking recursion created by act_mirred. Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-28ipv4: enable route flushing in network namespacesChristian Brauner
Tools such as vpnc try to flush routes when run inside network namespaces by writing 1 into /proc/sys/net/ipv4/route/flush. This currently does not work because flush is not enabled in non-initial network namespaces. Since routes are per network namespace it is safe to enable /proc/sys/net/ipv4/route/flush in there. Link: https://github.com/lxc/lxd/issues/4257 Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-28net: ethernet: ti: cpsw: Assign OF node to slave devicesMarek Vasut
Assign OF node to CPSW slave devices, otherwise it is not possible to bind e.g. DSA switch to them. Without this patch, the DSA code tries to find the ethernet device by OF match, but fails to do so because the slave device has NULL OF node. Signed-off-by: Marek Vasut <marex@denx.de> Cc: David S. Miller <davem@davemloft.net> Cc: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter fixes for net: 1) Fix memleak reported by syzkaller when registering IPVS hooks, patch from Julian Anastasov. 2) Fix memory leak in start_sync_thread, also from Julian. 3) Fix conntrack deletion via ctnetlink, from Felix Kaechele. 4) Fix reject for ICMP due to incorrect checksum handling, from He Zhe. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-28net: dsa: b53: Disable all ports on setupBenedikt Spranger
A b53 device may configured through an external EEPROM like the switch device on the Lamobo R1 router board. The configuration of a port may therefore differ from the reset configuration of the switch. The switch configuration reported by the DSA subsystem is different until the port is configured by DSA i.e. a port can be active, while the DSA subsystem reports the port is inactive. Disable all ports and not only the unused ones to put all ports into a well defined state. Signed-off-by: Benedikt Spranger <b.spranger@linutronix.de> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-28ath10k: pci: remove unnecessary castsKalle Valo
Fixes checkpatch warnings: drivers/net/wireless/ath/ath10k/pci.c:926: unnecessary cast may hide bugs, see http://c-faq.com/malloc/mallocnocast.html drivers/net/wireless/ath/ath10k/pci.c:1072: unnecessary cast may hide bugs, see http://c-faq.com/malloc/mallocnocast.html While at it, also remove unnecessary initialisation of data_buf variable in both cases. Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
2019-06-28ath10k: remove unnecessary 'out of memory' messageKalle Valo
Fixes checkpatch warning: drivers/net/wireless/ath/ath10k/swap.c:110: Possible unnecessary 'out of memory' message Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
2019-06-28ath10k: destroy sdio workqueue while remove sdio moduleWen Gong
The workqueue need to flush and destory while remove sdio module, otherwise it will have thread which is not destory after remove sdio modules. Tested with QCA6174 SDIO with firmware WLAN.RMH.4.4.1-00007-QCARMSWP-1. Signed-off-by: Wen Gong <wgong@codeaurora.org> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
2019-06-28ath10k: Move non-fatal warn logs to dbg level for SDIO chipWen Gong
ath10k will receive some message with invalid peer id from firmware. reason is: There are incoming frames to MAC hardware that NOT find relative address search table, then peer id is invalid set by MAC hardware, it is hardware's logic, so fix it in ath10k will be more convenient. log: ath10k_sdio mmc1:0001:1: Got RX ind from invalid peer: 65535 Tested with QCA6174 SDIO with firmware WLAN.RMH.4.4.1-00007-QCARMSWP-1. Signed-off-by: Wen Gong <wgong@codeaurora.org> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
2019-06-28ath10k: Fix memory leak in qmiDundi Raviteja
Currently the memory allocated for qmi handle is not being freed during de-init which leads to memory leak. Free the allocated qmi memory in qmi deinit to avoid memory leak. Tested HW: WCN3990 Tested FW: WLAN.HL.3.1-01040-QCAHLSWMTPLZ-1 Fixes: fda6fee0001e ("ath10k: add QMI message handshake for wcn3990 client") Signed-off-by: Dundi Raviteja <dundi@codeaurora.org> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>