summaryrefslogtreecommitdiff
path: root/drivers/net/ethernet/intel/ice/ice_txrx.c
AgeCommit message (Collapse)Author
2021-12-17ice: xsk: allocate separate memory for XDP SW ringMaciej Fijalkowski
Currently, the zero-copy data path is reusing the memory region that was initially allocated for an array of struct ice_rx_buf for its own purposes. This is error prone as it is based on the ice_rx_buf struct always being the same size or bigger than what the zero-copy path needs. There can also be old values present in that array giving rise to errors when the zero-copy path uses it. Fix this by freeing the ice_rx_buf region and allocating a new array for the zero-copy path that has the right length and is initialized to zero. Fixes: 57f7f8b6bc0b ("ice: Use xdp_buf instead of rx_buf for xsk zero-copy") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-10-19ice: use devm_kcalloc() instead of devm_kzalloc()Gustavo A. R. Silva
Use 2-factor multiplication argument form devm_kcalloc() instead of devm_kzalloc(). Link: https://github.com/KSPP/linux/issues/162 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-10-19ice: fix software generating extra interruptsJesse Brandeburg
The driver tried to work around missing completion events that occurred while interrupts are disabled, by triggering a software interrupt whenever we exit polling (but we had to have polled at least once). This was causing a *lot* of extra interrupts for some workloads like NVMe over TCP, which resulted in regressions in performance. It was also visible when polling didn't prevent interrupts when busy_poll was enabled. Fix the extra interrupts by utilizing our previously unused 3rd ITR (interrupt throttle) index and set it to 20K interrupts per second, and then trigger a software interrupt within that rate limit. While here, slightly refactor the code to avoid an overwrite of a local variable in the case of wb_en = true. Fixes: b7306b42beaf ("ice: manage interrupts during poll exit") Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-10-19ice: update dim usage and moderationJesse Brandeburg
The driver was having trouble with unreliable latency when doing single threaded ping-pong tests. This was root caused to the DIM algorithm landing on a too slow interrupt value, which caused high latency, and it was especially present when queues were being switched frequently by the scheduler as happens on default setups today. In attempting to improve this, we allow the upper rate limit for interrupts to move to rate limit of 4 microseconds as a max, which means that no vector can generate more than 250,000 interrupts per second. The old config was up to 100,000. The driver previously tried to program the rate limit too frequently and if the receive and transmit side were both active on the same vector, the INTRL would be set incorrectly, and this change fixes that issue as a side effect of the redesign. This driver will operate from now on with a slightly changed DIM table with more emphasis towards latency sensitivity by having more table entries with lower latency than with high latency (high being >= 64 microseconds). The driver also resets the DIM algorithm state with a new stats set when there is no work done and the data becomes stale (older than 1 second), for the respective receive or transmit portion of the interrupt. Add a new helper for setting rate limit, which will be used more in a followup patch. Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-10-15ice: introduce XDP_TX fallback pathMaciej Fijalkowski
Under rare circumstances there might be a situation where a requirement of having XDP Tx queue per CPU could not be fulfilled and some of the Tx resources have to be shared between CPUs. This yields a need for placing accesses to xdp_ring inside a critical section protected by spinlock. These accesses happen to be in the hot path, so let's introduce the static branch that will be triggered from the control plane when driver could not provide Tx queue dedicated for XDP on each CPU. Currently, the design that has been picked is to allow any number of XDP Tx queues that is at least half of a count of CPUs that platform has. For lower number driver will bail out with a response to user that there were not enough Tx resources that would allow configuring XDP. The sharing of rings is signalled via static branch enablement which in turn indicates that lock for xdp_ring accesses needs to be taken in hot path. Approach based on static branch has no impact on performance of a non-fallback path. One thing that is needed to be mentioned is a fact that the static branch will act as a global driver switch, meaning that if one PF got out of Tx resources, then other PFs that ice driver is servicing will suffer. However, given the fact that HW that ice driver is handling has 1024 Tx queues per each PF, this is currently an unlikely scenario. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-10-15ice: optimize XDP_TX workloadsMaciej Fijalkowski
Optimize Tx descriptor cleaning for XDP. Current approach doesn't really scale and chokes when multiple flows are handled. Introduce two ring fields, @next_dd and @next_rs that will keep track of descriptor that should be looked at when the need for cleaning arise and the descriptor that should have the RS bit set, respectively. Note that at this point the threshold is a constant (32), but it is something that we could make configurable. First thing is to get away from setting RS bit on each descriptor. Let's do this only once NTU is higher than the currently @next_rs value. In such case, grab the tx_desc[next_rs], set the RS bit in descriptor and advance the @next_rs by a 32. Second thing is to clean the Tx ring only when there are less than 32 free entries. For that case, look up the tx_desc[next_dd] for a DD bit. This bit is written back by HW to let the driver know that xmit was successful. It will happen only for those descriptors that had RS bit set. Clean only 32 descriptors and advance the DD bit. Actual cleaning routine is moved from ice_napi_poll() down to the ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any SKBs in there that would rely on interrupts for the cleaning. Nice side effect is that for rare case of Tx fallback path (that next patch is going to introduce) we don't have to trigger the SW irq to clean the ring. With those two concepts, ring is kept at being almost full, but it is guaranteed that driver will be able to produce Tx descriptors. This approach seems to work out well even though the Tx descriptors are produced in one-by-one manner. Test was conducted with the ice HW bombarded with packets from HW generator, configured to generate 30 flows. Xdp2 sample yields the following results: <snip> proto 17: 79973066 pkt/s proto 17: 80018911 pkt/s proto 17: 80004654 pkt/s proto 17: 79992395 pkt/s proto 17: 79975162 pkt/s proto 17: 79955054 pkt/s proto 17: 79869168 pkt/s proto 17: 79823947 pkt/s proto 17: 79636971 pkt/s </snip> As that sample reports the Rx'ed frames, let's look at sar output. It says that what we Rx'ed we do actually Tx, no noticeable drops. Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32 with tx_busy staying calm. When compared to a state before: Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64 it can be observed that the amount of txpck/s is almost doubled, meaning that the performance is improved by around 90%. All of this due to the drops in the driver, previously the tx_busy stat was bumped at a 7mpps rate. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-10-15ice: propagate xdp_ring onto rx_ringMaciej Fijalkowski
With rings being split, it is now convenient to introduce a pointer to XDP ring within the Rx ring. For XDP_TX workloads this means that xdp_rings array access will be skipped, which was executed per each processed frame. Also, read the XDP prog once per NAPI and if prog is present, set up the local xdp_ring pointer. Reading prog a single time was discussed in [1] with some concern raised by Toke around dispatcher handling and having the need for going through the RCU grace period in the ndo_bpf driver callback, but ice currently is torning down NAPI instances regardless of the prog presence on VSI. Although the pointer to XDP ring introduced to Rx ring makes things a lot slimmer/simpler, I still feel that single prog read per NAPI lifetime is beneficial. Further patch that will introduce the fallback path will also get a profit from that as xdp_ring pointer will be set during the XDP rings setup. [1]: https://lore.kernel.org/bpf/87k0oseo6e.fsf@toke.dk/ Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-10-15ice: do not create xdp_frame on XDP_TXMaciej Fijalkowski
xdp_frame is not needed for XDP_TX data path in ice driver case. For this data path cleaning of sent descriptor will not happen anywhere outside of the driver, which means that carrying the information about the underlying memory model via xdp_frame will not be used. Therefore, this conversion can be simply dropped, which would relieve CPU a bit. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-10-15ice: split ice_ring onto Tx/Rx separate structsMaciej Fijalkowski
While it was convenient to have a generic ring structure that served both Tx and Rx sides, next commits are going to introduce several Tx-specific fields, so in order to avoid hurting the Rx side, let's pull out the Tx ring onto new ice_tx_ring and ice_rx_ring structs. Rx ring could be handled by the old ice_ring which would reduce the code churn within this patch, but this would make things asymmetric. Make the union out of the ring container within ice_q_vector so that it is possible to iterate over newly introduced ice_tx_ring. Remove the @size as it's only accessed from control path and it can be calculated pretty easily. Change definitions of ice_update_ring_stats and ice_fetch_u64_stats_per_ring so that they are ring agnostic and can be used for both Rx and Tx rings. Sizes of Rx and Tx ring structs are 256 and 192 bytes, respectively. In Rx ring xdp_rxq_info occupies its own cacheline, so it's the major difference now. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-10-07ice: switchdev slow pathGrzegorz Nitka
Slow path means allowing packet to go from uplink to representor and from representor to correct VF on Rx site and from VF to representor and to uplink on Tx site. To accomplish this driver, has to set correct Tx descriptor. When packet is sent from representor to VF, destination should be set to VF VSI. When packet is sent from uplink port destination should be uplink to bypass switch infrastructure and send packet outside. On Rx site driver should check source VSI field from Rx descriptor and based on that forward packed to correct netdev. To allow this there is a target netdevs table in control plane VSI struct. Co-developed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-09-28ice: Add DSCP supportDave Ertman
Implement code to handle submission of APP TLV's containing DSCP to TC mapping. The first such mapping received on an interface will cause that PF to switch to L3 DSCP QoS mode, apply the default config for that mode, and apply the received mapping. Only one such mapping will be allowed per DSCP value, and when the last DSCP mapping is deleted, the PF will switch back into L2 VLAN QoS mode, applying the appropriate default QoS settings. L3 DSCP QoS mode will only be allowed in SW DCBx mode, in other words, when the FW LLDP engine is disabled. Commands that break this mutual exclusivity will be blocked. Co-developed-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Signed-off-by: Dave Ertman <david.m.ertman@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-06-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf-next 2021-06-28 The following pull-request contains BPF updates for your *net-next* tree. We've added 37 non-merge commits during the last 12 day(s) which contain a total of 56 files changed, 394 insertions(+), 380 deletions(-). The main changes are: 1) XDP driver RCU cleanups, from Toke Høiland-Jørgensen and Paul E. McKenney. 2) Fix bpf_skb_change_proto() IPv4/v6 GSO handling, from Maciej Żenczykowski. 3) Fix false positive kmemleak report for BPF ringbuf alloc, from Rustam Kovhaev. 4) Fix x86 JIT's extable offset calculation for PROBE_LDX NULL, from Ravi Bangoria. 5) Enable libbpf fallback probing with tracing under RHEL7, from Jonathan Edwards. 6) Clean up x86 JIT to remove unused cnt tracking from EMIT macro, from Jiri Olsa. 7) Netlink cleanups for libbpf to please Coverity, from Kumar Kartikeya Dwivedi. 8) Allow to retrieve ancestor cgroup id in tracing programs, from Namhyung Kim. 9) Fix lirc BPF program query to use user-provided prog_cnt, from Sean Young. 10) Add initial libbpf doc including generated kdoc for its API, from Grant Seltzer. 11) Make xdp_rxq_info_unreg_mem_model() more robust, from Jakub Kicinski. 12) Fix up bpfilter startup log-level to info level, from Gary Lin. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-25ice: add tracepointsJesse Brandeburg
This patch is modeled after one by Scott Peterson for i40e. Add tracepoints to the driver, via a new file ice_trace.h and some new trace calls added in interesting places in the driver. Add some tracing for DIMLIB to help debug interrupt moderation problems. Performance should not be affected, and this can be very useful for debugging and adding new trace events to paths in the future. Note eBPF programs can attach to these events, as well as perf can count them since we're attaching to the events subsystem in the kernel. Co-developed-by: Ben Shelton <benjamin.h.shelton@intel.com> Signed-off-by: Ben Shelton <benjamin.h.shelton@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-06-24intel: Remove rcu_read_lock() around XDP program invocationToke Høiland-Jørgensen
The Intel drivers all have rcu_read_lock()/rcu_read_unlock() pairs around XDP program invocations. However, the actual lifetime of the objects referred by the XDP program invocation is longer, all the way through to the call to xdp_do_flush(), making the scope of the rcu_read_lock() too small. This turns out to be harmless because it all happens in a single NAPI poll cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock() misleading. Rather than extend the scope of the rcu_read_lock(), just get rid of it entirely. With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep it around. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Jesper Dangaard Brouer <brouer@redhat.com> # i40e Cc: Jesse Brandeburg <jesse.brandeburg@intel.com> Cc: Tony Nguyen <anthony.l.nguyen@intel.com> Cc: intel-wired-lan@lists.osuosl.org Link: https://lore.kernel.org/bpf/20210624160609.292325-12-toke@redhat.com
2021-06-18ice: report hash type such as L2/L3/L4Jesse Brandeburg
The hardware is reporting the type of the hash used for RSS as a PTYPE field in the receive descriptor. Use this value to set the skb packet hash type by extending the hash type table to cover all 10-bits of possible values (requiring some variables to be changed from u8 to u16), and then use that table to convert to one of the possible values in enum pkt_hash_types. While we're here, remove the unused ptype struct value, which makes table init easier for the zero entries, and use ranged initializer to remove a bunch of code (works with gcc and clang). Without this change, the kernel will recalculate the hash in software, which can consume extra CPU cycles. Co-developed-by: Kiran Patil <kiran.patil@intel.com> Signed-off-by: Kiran Patil <kiran.patil@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-06-11ice: enable transmit timestamps for E810 devicesJacob Keller
Add support for enabling Tx timestamp requests for outgoing packets on E810 devices. The ice hardware can support multiple outstanding Tx timestamp requests. When sending a descriptor to hardware, a Tx timestamp request is made by setting a request bit, and assigning an index that represents which Tx timestamp index to store the timestamp in. Hardware makes no effort to synchronize the index use, so it is up to software to ensure that Tx timestamp indexes are not re-used before the timestamp is reported back. To do this, introduce a Tx timestamp tracker which will keep track of currently in-use indexes. In the hot path, if a packet has a timestamp request, an index will be requested from the tracker. Unfortunately, this does require a lock as the indexes are shared across all queues on a PHY. There are not enough indexes to reliably assign only 1 to each queue. For the E810 devices, the timestamp indexes are not shared across PHYs, so each port can have its own tracking. Once hardware captures a timestamp, an interrupt is fired. In this interrupt, trigger a new work item that will figure out which timestamp was completed, and report the timestamp back to the stack. This function loops through the Tx timestamp indexes and checks whether there is now a valid timestamp. If so, it clears the PHY timestamp indication in the PHY memory, locks and removes the SKB and bit in the tracker, then reports the timestamp to the stack. It is possible in some cases that a timestamp request will be initiated but never completed. This might occur if the packet is dropped by software or hardware before it reaches the PHY. Add a task to the periodic work function that will check whether a timestamp request is more than a few seconds old. If so, the timestamp index is cleared in the PHY, and the SKB is released. Just as with Rx timestamps, the Tx timestamps are only 40 bits wide, and use the same overall logic for extending to 64 bits of nanoseconds. With this change, E810 devices should be able to perform basic PTP functionality. Future changes will extend the support to cover the E822-based devices. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-06-04ice: Allow all LLDP packets from PF to TxDave Ertman
Currently in the ice driver, the check whether to allow a LLDP packet to egress the interface from the PF_VSI is being based on the SKB's priority field. It checks to see if the packets priority is equal to TC_PRIO_CONTROL. Injected LLDP packets do not always meet this condition. SCAPY defaults to a sk_buff->protocol value of ETH_P_ALL (0x0003) and does not set the priority field. There will be other injection methods (even ones used by end users) that will not correctly configure the socket so that SKB fields are correctly populated. Then ethernet header has to have to correct value for the protocol though. Add a check to also allow packets whose ethhdr->h_proto matches ETH_P_LLDP (0x88CC). Fixes: 0c3a6101ff2d ("ice: Allow egress control packets from PF_VSI") Signed-off-by: Dave Ertman <david.m.ertman@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-06-03ice: add correct exception tracing for XDPMagnus Karlsson
Add missing exception tracing to XDP when a number of different errors can occur. The support was only partial. Several errors where not logged which would confuse the user quite a lot not knowing where and why the packets disappeared. Fixes: efc2214b6047 ("ice: Add support for XDP") Fixes: 2d4238f55697 ("ice: Add support for AF_XDP") Reported-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-04-14ice: refactor ITR data structuresJesse Brandeburg
Use a dedicated bitfield in order to both increase the amount of checking around the length of ITR writes as well as simplify the checks of dynamic mode. Basically unpack the "high bit means dynamic" logic into bitfields. Also, remove some unused ITR defines. Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-04-14ice: manage interrupts during poll exitJesse Brandeburg
The driver would occasionally miss that there were outstanding descriptors to clean when exiting busy/napi poll. This issue has been in the code since the introduction of the ice driver. Attempt to "catch" any remaining work by triggering a software interrupt when exiting napi poll or busy-poll. This will not cause extra interrupts in the case of normal execution. This issue was found when running sfnt-pingpong, with busy poll enabled, and typically with larger I/O sizes like > 8192, the program would occasionally report > 1 second maximums to complete a ping pong. Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-04-14ice: replace custom AIM algorithm with kernel's DIM libraryJacob Keller
The ice driver has support for adaptive interrupt moderation, an algorithm for tuning the interrupt rate dynamically. This algorithm is based on various assumptions about ring size, socket buffer size, link speed, SKB overhead, ethernet frame overhead and more. The Linux kernel has support for a dynamic interrupt moderation algorithm known as "dimlib". Replace the custom driver-specific implementation of dynamic interrupt moderation with the kernel's algorithm. The Intel hardware has a different hardware implementation than the originators of the dimlib code had to work with, which requires the driver to use a slightly different set of inputs for the actual moderation values, while getting all the advice from dimlib of better/worse, shift left or right. The change made for this implementation is to use a pair of values for each of the 5 "slots" that the dimlib moderation expects, and the driver will program those pairs when dimlib recommends a slot to use. The currently implementation uses two tables, one for receive and one for transmit, and the pairs of values in each slot set the maximum delay of an interrupt and a maximum number of interrupts per second (both expressed in microseconds). There are two separate kinds of bugs fixed by using DIMLIB, one is UDP single stream send was too slow, and the other is that 8K ping-pong was going to the most aggressive moderation and has much too high latency. The overall result of using DIMLIB is that we meet or exceed our performance expectations set based on the old algorithm. Co-developed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-03-31ice: Consolidate VSI state and flagsAnirudh Venkataramanan
struct ice_vsi has two fields, state and flags which seem to be serving the same purpose. Consolidate them into one field 'state'. enum ice_state is used to represent state information of the PF. While some of these enum values can be use to represent VSI state, it makes more sense to represent VSI state with its own enum. So derive a new enum ice_vsi_state from ice_vsi_flags and ice_state and use it. Also rename enum ice_state to ice_pf_state for clarity. Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-03-25Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf-next 2021-03-24 The following pull-request contains BPF updates for your *net-next* tree. We've added 37 non-merge commits during the last 15 day(s) which contain a total of 65 files changed, 3200 insertions(+), 738 deletions(-). The main changes are: 1) Static linking of multiple BPF ELF files, from Andrii. 2) Move drop error path to devmap for XDP_REDIRECT, from Lorenzo. 3) Spelling fixes from various folks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22ice: Check FDIR program status for AVFQi Zhang
Enable returning FDIR completion status by checking the ctrl_vsi Rx queue descriptor value. To enable returning FDIR completion status from ctrl_vsi Rx queue, COMP_Queue and COMP_Report of FDIR filter programming descriptor needs to be properly configured. After program request sent to ctrl_vsi Tx queue, ctrl_vsi Rx queue interrupt will be triggered and completion status will be returned. Driver will first issue request in ice_vc_fdir_add_fltr(), then pass FDIR context to the background task in interrupt service routine ice_vc_fdir_irq_handler() and finally deal with them in ice_flush_fdir_ctx(). ice_flush_fdir_ctx() will check the descriptor's value, fdir context, and then send back virtual channel message to VF by calling ice_vc_add_fdir_fltr_post(). An additional timer will be setup in case of hardware interrupt timeout. Signed-off-by: Yahui Cao <yahui.cao@intel.com> Signed-off-by: Brett Creeley <brett.creeley@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Qi Zhang <qi.z.zhang@intel.com> Tested-by: Chen Bo <BoX.C.Chen@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-03-18bpf, devmap: Move drop error path to devmap for XDP_REDIRECTLorenzo Bianconi
We want to change the current ndo_xdp_xmit drop semantics because it will allow us to implement better queue overflow handling. This is working towards the larger goal of a XDP TX queue-hook. Move XDP_REDIRECT error path handling from each XDP ethernet driver to devmap code. According to the new APIs, the driver running the ndo_xdp_xmit pointer, will break tx loop whenever the hw reports a tx error and it will just return to devmap caller the number of successfully transmitted frames. It will be devmap responsibility to free dropped frames. Move each XDP ndo_xdp_xmit capable driver to the new APIs: - veth - virtio-net - mvneta - mvpp2 - socionext - amazon ena - bnxt - freescale (dpaa2, dpaa) - xen-frontend - qede - ice - igb - ixgbe - i40e - mlx5 - ti (cpsw, cpsw-new) - tun - sfc Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Reviewed-by: Camelia Groza <camelia.groza@nxp.com> Acked-by: Edward Cree <ecree.xilinx@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Shay Agroskin <shayagr@amazon.com> Link: https://lore.kernel.org/bpf/ed670de24f951cfd77590decf0229a0ad7fd12f6.1615201152.git.lorenzo@kernel.org
2021-03-12ice: move headroom initialization to ice_setup_rx_ctxMaciej Fijalkowski
ice_rx_offset(), that is supposed to initialize the Rx buffer headroom, relies on ICE_RX_FLAGS_RING_BUILD_SKB flag as well as XDP prog presence. Currently, the callsite of mentioned function is placed incorrectly within ice_setup_rx_ring() where Rx ring's build skb flag is not set yet. This causes the XDP_REDIRECT to be partially broken due to inability to create xdp_frame in the headroom space, as the headroom is 0. Fix this by moving ice_rx_offset() to ice_setup_rx_ctx() after the flag setting. Fixes: f1b1f409bf79 ("ice: store the result of ice_rx_offset() onto ice_ring") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-02-12ice: store the result of ice_rx_offset() onto ice_ringMaciej Fijalkowski
Output of ice_rx_offset() is based on ethtool's priv flag setting, which when changed, causes PF reset (disables napi, frees irqs, loads different Rx mem model, etc.). This means that within napi its result is constant and there is no reason to call it per each processed frame. Add new 'rx_offset' field to ice_ring that is meant to hold the ice_rx_offset() result and use it within ice_clean_rx_irq(). Furthermore, use it within ice_alloc_mapped_page(). Reviewed-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-02-12ice: move skb pointer from rx_buf to rx_ringMaciej Fijalkowski
Similar thing has been done in i40e, as there is no real need for having the sk_buff pointer in each rx_buf. Non-eop frames can be simply handled on that pointer moved upwards to rx_ring. Reviewed-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-02-12ice: simplify ice_run_xdpMaciej Fijalkowski
There's no need for 'result' variable, we can directly return the internal status based on action returned by xdp prog. Reviewed-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-02-08ice: Refactor DCB related variables out of the ice_port_info structChinh T Cao
Refactor the DCB related variables out of the ice_port_info_struct. The goal is to make the ice_port_info struct cleaner. Signed-off-by: Chinh T Cao <chinh.t.cao@intel.com> Co-developed-by: Dave Ertman <david.m.ertman@intel.com> Signed-off-by: Dave Ertman <david.m.ertman@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-02-08ice: fix writeback enable logicJesse Brandeburg
The writeback enable logic was incorrectly implemented (due to misunderstanding what the side effects of the implementation would be during polling). Fix this logic issue, while implementing a new feature allowing the user to control the writeback frequency using the knobs for controlling interrupt throttling that we already have. Basically if you leave adaptive interrupts enabled, the writeback frequency will be varied even if busy_polling or if napi-poll is in use. If the interrupt rates are set to a fixed value by ethtool -C and adaptive is off, the driver will allow the user-set interrupt rate to guide how frequently the hardware will complete descriptors to the driver. Effectively the user will get a control over the hardware efficiency, allowing the choice between immediate interrupts or delayed up to a maximum of the interrupt rate, even when interrupts are disabled during polling. Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Co-developed-by: Brett Creeley <brett.creeley@intel.com> Signed-off-by: Brett Creeley <brett.creeley@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-02-04net: use the new dev_page_is_reusable() instead of private versionsAlexander Lobakin
Now we can remove a bunch of identical functions from the drivers and make them use common dev_page_is_reusable(). All {,un}likely() checks are omitted since it's already present in this helper. Also update some comments near the call sites. Suggested-by: David Rientjes <rientjes@google.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
drivers/net/can/dev.c b552766c872f ("can: dev: prevent potential information leak in can_fill_info()") 3e77f70e7345 ("can: dev: move driver related infrastructure into separate subdir") 0a042c6ec991 ("can: dev: move netlink related code into seperate file") Code move. drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 57ac4a31c483 ("net/mlx5e: Correctly handle changing the number of queues when the interface is down") 214baf22870c ("net/mlx5e: Support HTB offload") Adjacent code changes net/switchdev/switchdev.c 20776b465c0c ("net: switchdev: don't set port_obj_info->handled true when -EOPNOTSUPP") ffb68fc58e96 ("net: switchdev: remove the transaction structure from port object notifiers") bae33f2b5afe ("net: switchdev: remove the transaction structure from port attributes") Transaction parameter gets dropped otherwise keep the fix. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-26ice: Implement flow for IPv6 next header (extension header)Nick Nunley
This patch is based on a similar change to i40e by Slawomir Laba: "i40e: Implement flow for IPv6 next header (extension header)". When a packet contains an IPv6 header with next header which is an extension header and not a protocol one, the kernel function skb_transport_header called with such sk_buff will return a pointer to the extension header and not to the TCP one. The above explained call caused a problem with packet processing for skb with encapsulation for tunnel with ICE_TX_CTX_EIPT_IPV6. The extension header was not skipped at all. The ipv6_skip_exthdr function does check if next header of the IPV6 header is an extension header and doesn't modify the l4_proto pointer if it points to a protocol header value so its safe to omit the comparison of exthdr and l4.hdr pointers. The ipv6_skip_exthdr can return value -1. This means that the skipping process failed and there is something wrong with the packet so it will be dropped. Fixes: a4e82a81f573 ("ice: Add support for tunnel offloads") Signed-off-by: Nick Nunley <nicholas.d.nunley@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-01-08net, xdp: Introduce xdp_prepare_buff utility routineLorenzo Bianconi
Introduce xdp_prepare_buff utility routine to initialize per-descriptor xdp_buff fields (e.g. xdp_buff pointers). Rely on xdp_prepare_buff() in all XDP capable drivers. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Shay Agroskin <shayagr@amazon.com> Acked-by: Martin Habets <habetsm.xilinx@gmail.com> Acked-by: Camelia Groza <camelia.groza@nxp.com> Acked-by: Marcin Wojtas <mw@semihalf.com> Link: https://lore.kernel.org/bpf/45f46f12295972a97da8ca01990b3e71501e9d89.1608670965.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-01-08net, xdp: Introduce xdp_init_buff utility routineLorenzo Bianconi
Introduce xdp_init_buff utility routine to initialize xdp_buff fields const over NAPI iterations (e.g. frame_sz or rxq pointer). Rely on xdp_init_buff in all XDP capable drivers. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Shay Agroskin <shayagr@amazon.com> Acked-by: Martin Habets <habetsm.xilinx@gmail.com> Acked-by: Camelia Groza <camelia.groza@nxp.com> Acked-by: Marcin Wojtas <mw@semihalf.com> Link: https://lore.kernel.org/bpf/7f8329b6da1434dc2b05a77f2e800b29628a8913.1608670965.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-12-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
xdp_return_frame_bulk() needs to pass a xdp_buff to __xdp_return(). strlcpy got converted to strscpy but here it makes no functional difference, so just keep the right code. Conflicts: net/netfilter/nf_tables_api.c Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-09ice: avoid premature Rx buffer reuseBjörn Töpel
The page recycle code, incorrectly, relied on that a page fragment could not be freed inside xdp_do_redirect(). This assumption leads to that page fragments that are used by the stack/XDP redirect can be reused and overwritten. To avoid this, store the page count prior invoking xdp_do_redirect(). Fixes: efc2214b6047 ("ice: Add support for XDP") Reported-and-analyzed-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2020-12-01xsk: Propagate napi_id to XDP socket Rx pathBjörn Töpel
Add napi_id to the xdp_rxq_info structure, and make sure the XDP socket pick up the napi_id in the Rx path. The napi_id is used to find the corresponding NAPI structure for socket busy polling. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/bpf/20201130185205.196029-7-bjorn.topel@gmail.com
2020-09-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf-next 2020-09-01 The following pull-request contains BPF updates for your *net-next* tree. There are two small conflicts when pulling, resolve as follows: 1) Merge conflict in tools/lib/bpf/libbpf.c between 88a82120282b ("libbpf: Factor out common ELF operations and improve logging") in bpf-next and 1e891e513e16 ("libbpf: Fix map index used in error message") in net-next. Resolve by taking the hunk in bpf-next: [...] scn = elf_sec_by_idx(obj, obj->efile.btf_maps_shndx); data = elf_sec_data(obj, scn); if (!scn || !data) { pr_warn("elf: failed to get %s map definitions for %s\n", MAPS_ELF_SEC, obj->path); return -EINVAL; } [...] 2) Merge conflict in drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c between 9647c57b11e5 ("xsk: i40e: ice: ixgbe: mlx5: Test for dma_need_sync earlier for better performance") in bpf-next and e20f0dbf204f ("net/mlx5e: RX, Add a prefetch command for small L1_CACHE_BYTES") in net-next. Resolve the two locations by retaining net_prefetch() and taking xsk_buff_dma_sync_for_cpu() from bpf-next. Should look like: [...] xdp_set_data_meta_invalid(xdp); xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool); net_prefetch(xdp->data); [...] We've added 133 non-merge commits during the last 14 day(s) which contain a total of 246 files changed, 13832 insertions(+), 3105 deletions(-). The main changes are: 1) Initial support for sleepable BPF programs along with bpf_copy_from_user() helper for tracing to reliably access user memory, from Alexei Starovoitov. 2) Add BPF infra for writing and parsing TCP header options, from Martin KaFai Lau. 3) bpf_d_path() helper for returning full path for given 'struct path', from Jiri Olsa. 4) AF_XDP support for shared umems between devices and queues, from Magnus Karlsson. 5) Initial prep work for full BPF-to-BPF call support in libbpf, from Andrii Nakryiko. 6) Generalize bpf_sk_storage map & add local storage for inodes, from KP Singh. 7) Implement sockmap/hash updates from BPF context, from Lorenz Bauer. 8) BPF xor verification for scalar types & add BPF link iterator, from Yonghong Song. 9) Use target's prog type for BPF_PROG_TYPE_EXT prog verification, from Udip Pant. 10) Rework BPF tracing samples to use libbpf loader, from Daniel T. Lee. 11) Fix xdpsock sample to really cycle through all buffers, from Weqaar Janjua. 12) Improve type safety for tun/veth XDP frame handling, from Maciej Żenczykowski. 13) Various smaller cleanups and improvements all over the place. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31xsk: i40e: ice: ixgbe: mlx5: Pass buffer pool to driver instead of umemMagnus Karlsson
Replace the explicit umem reference passed to the driver in AF_XDP zero-copy mode with the buffer pool instead. This in preparation for extending the functionality of the zero-copy mode so that umems can be shared between queues on the same netdev and also between netdevs. In this commit, only an umem reference has been added to the buffer pool struct. But later commits will add other entities to it. These are going to be entities that are different between different queue ids and netdevs even though the umem is shared between them. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1598603189-32145-2-git-send-email-magnus.karlsson@intel.com
2020-08-26net: Take common prefetch code structure into a functionTariq Toukan
Many device drivers use the same prefetch code structure to deal with small L1 cacheline size. Take this code into a function and call it from the drivers. Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-01ice: Misc minor fixesTony Nguyen
This is a collection of minor fixes including typos, white space, and style. No functional changes. Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
2020-08-01ice: port fix for chk_linearlizeKiran Patil
This is a port of commit 248de22e638f ("i40e/i40evf: Account for frags split over multiple descriptors in check linearize") As part of testing workloads (read/write) using larger IO size (128K) tx_timeout is observed and whenever it happens, it was due to tx_linearize. Signed-off-by: Kiran Patil <kiran.patil@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2020-08-01ice: remove page_reuse statisticJesse Brandeburg
The page reuse statistic wasn't even being displayed to the user, even though the driver counted it. Don't waste the struct space and hot-path cycles since the driver doesn't display it. Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2020-07-29ice: fix unused parameter warningTony Nguyen
Depending on PAGE_SIZE, the following unused parameter warning can be reported: drivers/net/ethernet/intel/ice/ice_txrx.c: In function ‘ice_rx_frame_truesize’: drivers/net/ethernet/intel/ice/ice_txrx.c:513:21: warning: unused parameter ‘size’ [-Wunused-parameter] unsigned int size) The 'size' variable is used only when PAGE_SIZE >= 8192. Add __maybe_unused to remove the warning. Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
2020-05-31ice: Fix transmit for all software offloaded VLANsBrett Creeley
Currently the driver does not recognize when there is an 802.1AD VLAN tag right after the dmac/smac (outermost VLAN tag). If any DCB map is applied and/or DCB is enabled this is causing the hardware to insert a VLAN 0 tag after the 802.1AD VLAN tag that is already in the packet. Fix this by preventing VLAN tag 0 from being added when any VLAN is already present after dmac/smac (software offloaded) or skb (hardware offloaded). Signed-off-by: Brett Creeley <brett.creeley@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2020-05-22ice: Support IPv4 Flow Director filtersHenry Tieman
Support the addition and deletion of IPv4 filters. Supported fields are: src-ip, dst-ip, src-port, and dst-port Supported flow-types are: tcp4, udp4, sctp4, ip4 Example usage: ethtool -N eth0 flow-type tcp4 src-ip 192.168.0.55 dst-ip 172.16.0.55 \ src-port 16 dst-port 12 action 32 Signed-off-by: Henry Tieman <henry.w.tieman@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2020-05-22ice: Initialize Flow Director resourcesHenry Tieman
Flow Director allows for redirection based on ntuple rules. Rules are programmed using the ethtool set-ntuple interface. Supported actions are redirect to queue and drop. Setup the initial framework to process Flow Director filters. Create and allocate resources to manage and program filters to the hardware. Filters are processed via a sideband interface; a control VSI is created to manage communication and process requests through the sideband. Upon allocation of resources, update the hardware tables to accept perfect filters. Signed-off-by: Henry Tieman <henry.w.tieman@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>