summaryrefslogtreecommitdiff
path: root/drivers/net/ethernet/intel
AgeCommit message (Collapse)Author
2023-02-13ice: fix lost multicast packets in promisc modeJesse Brandeburg
There was a problem reported to us where the addition of a VF with an IPv6 address ending with a particular sequence would cause the parent device on the PF to no longer be able to respond to neighbor discovery packets. In this case, we had an ovs-bridge device living on top of a VLAN, which was on top of a PF, and it would not be able to talk anymore (the neighbor entry would expire and couldn't be restored). The root cause of the issue is that if the PF is asked to be in IFF_PROMISC mode (promiscuous mode) and it had an ipv6 address that needed the 33:33:ff:00:00:04 multicast address to work, then when the VF was added with the need for the same multicast address, the VF would steal all the traffic destined for that address. The ice driver didn't auto-subscribe a request of IFF_PROMISC to the "multicast replication from other port's traffic" meaning that it won't get for instance, packets with an exact destination in the VF, as above. The VF's IPv6 address, which adds a "perfect filter" for 33:33:ff:00:00:04, results in no packets for that multicast address making it to the PF (which is in promisc but NOT "multicast replication"). The fix is to enable "multicast promiscuous" whenever the driver is asked to enable IFF_PROMISC, and make sure to disable it when appropriate. Fixes: e94d44786693 ("ice: Implement filter sync, NDO operations and bump version") Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-13ice: Micro-optimize .ndo_xdp_xmit() pathAlexander Lobakin
After the recent mbuf changes, ice_xmit_xdp_ring() became a 3-liner. It makes no sense to keep it global in a different file than its caller. Move it just next to the sole call site and mark static. Also, it doesn't need a full xdp_convert_frame_to_buff(). Save several cycles and fill only the fields used by __ice_xmit_xdp_ring() later on. Finally, since it doesn't modify @xdpf anyhow, mark the argument const to save some more (whole -11 bytes of .text! :D). Thanks to 1 jump less and less calcs as well, this yields as many as 6.7 Mpps per queue. `xdp.data_hard_start = xdpf` is fully intentional again (see xdp_convert_buff_to_frame()) and just works when there are no source device's driver issues. Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20230210170618.1973430-7-alexandr.lobakin@intel.com
2023-02-13ice: Fix freeing XDP frames backed by Page PoolAlexander Lobakin
As already mentioned, freeing any &xdp_frame via page_frag_free() is wrong, as it assumes the frame is backed by either an order-0 page or a page with no "patrons" behind them, while in fact frames backed by Page Pool can be redirected to a device, which's driver doesn't use it. Keep storing a pointer to the raw buffer and then freeing it unconditionally via page_frag_free() for %XDP_TX frames, but introduce a separate type in the enum for frames coming through .ndo_xdp_xmit(), and free them via xdp_return_frame_bulk(). Note that saving xdpf as xdp_buff->data_hard_start is intentional and is always true when everything is configured properly. After this change, %XDP_REDIRECT from a Page Pool based driver to ice becomes zero-alloc as it should be and horrendous 3.3 Mpps / queue turn into 6.6, hehe. Let it go with no "Fixes:" tag as it spans across good 5+ commits and can't be trivially backported. Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20230210170618.1973430-6-alexandr.lobakin@intel.com
2023-02-13ice: Robustify cleaning/completing XDP Tx buffersAlexander Lobakin
When queueing frames from a Page Pool for redirecting to a device backed by the ice driver, `perf top` shows heavy load on page_alloc() and page_frag_free(), despite that on a properly working system it must be fully or at least almost zero-alloc. The problem is in fact a bit deeper and raises from how ice cleans up completed Tx buffers. The story so far: when cleaning/freeing the resources related to a particular completed Tx frame (skbs, DMA mappings etc.), ice uses some heuristics only without setting any type explicitly (except for dummy Flow Director packets, which are marked via ice_tx_buf::tx_flags). This kinda works, but only up to some point. For example, currently ice assumes that each frame coming to __ice_xmit_xdp_ring(), is backed by either plain order-0 page or plain page frag, while it may also be backed by Page Pool or any other possible memory models introduced in future. This means any &xdp_frame must be freed properly via xdp_return_frame() family with no assumptions. In order to do that, the whole heuristics must be replaced with setting the Tx buffer/frame type explicitly, just how it's always been done via an enum. Let us reuse 16 bits from ::tx_flags -- 1 bit-and instr won't hurt much -- especially given that sometimes there was a check for %ICE_TX_FLAGS_DUMMY_PKT, which is now turned from a flag to an enum member. The rest of the changes is straightforward and most of it is just a conversion to rely now on the type set in &ice_tx_buf rather than to some secondary properties. For now, no functional changes intended, the change only prepares the ground for starting freeing XDP frames properly next step. And it must be done atomically/synchronously to not break stuff. Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20230210170618.1973430-5-alexandr.lobakin@intel.com
2023-02-13ice: Remove two impossible branches on XDP Tx cleaningAlexander Lobakin
The tagged commit started sending %XDP_TX frames from XSk Rx ring directly without converting it to an &xdp_frame. However, when XSk is enabled on a queue pair, it has its separate Tx cleaning functions, so neither ice_clean_xdp_irq() nor ice_unmap_and_free_tx_buf() ever happens there. Remove impossible branches in order to reduce the diffstat of the upcoming change. Fixes: a24b4c6e9aab ("ice: xsk: Do not convert to buff to frame for XDP_TX") Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20230210170618.1973430-4-alexandr.lobakin@intel.com
2023-02-13ice: Fix XDP Tx ring overrunAlexander Lobakin
Sometimes, under heavy XDP Tx traffic, e.g. when using XDP traffic generator (%BPF_F_TEST_XDP_LIVE_FRAMES), the machine can catch OOM due to the driver not freeing all of the pages passed to it by .ndo_xdp_xmit(). Turned out that during the development of the tagged commit, the check, which ensures that we have a free descriptor to queue a frame, moved into the branch happening only when a buffer has frags. Otherwise, we only run a cleaning cycle, but don't check anything. ATST, there can be situations when the driver gets new frames to send, but there are no buffers that can be cleaned/completed and the ring has no free slots. It's very rare, but still possible (> 6.5 Mpps per ring). The driver then fills the next buffer/descriptor, effectively overwriting the data, which still needs to be freed. Restore the check after the cleaning routine to make sure there is a slot to queue a new frame. When there are frags, there still will be a separate check that we can place all of them, but if the ring is full, there's no point in wasting any more time. (minor: make `!ready_frames` unlikely since it happens ~1-2 times per billion of frames) Fixes: 3246a10752a7 ("ice: Add support for XDP multi-buffer on Tx side") Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20230210170618.1973430-3-alexandr.lobakin@intel.com
2023-02-13ice: fix ice_tx_ring:: Xdp_tx_active underflowAlexander Lobakin
xdp_tx_active is used to indicate whether an XDP ring has any %XDP_TX frames queued to shortcut processing Tx cleaning for XSk-enabled queues. When !XSk, it simply indicates whether the ring has any queued frames in general. It gets increased on each frame placed onto the ring and counts the whole frame, not each frag. However, currently it gets decremented in ice_clean_xdp_tx_buf(), which is called per each buffer, i.e. per each frag. Thus, on completing multi-frag frames, an underflow happens. Move the decrement to the outer function and do it once per frame, not buf. Also, do that on the stack and update the ring counter after the loop is done to save several cycles. XSk rings are fine since there are no frags at the moment. Fixes: 3246a10752a7 ("ice: Add support for XDP multi-buffer on Tx side") Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20230210170618.1973430-2-alexandr.lobakin@intel.com
2023-02-13ice: Fix check for weight and priority of a scheduling nodeMichal Wilczynski
Currently checks for weight and priority ranges don't check incoming value from the devlink. Instead it checks node current weight or priority. This makes those checks useless. Change range checks in ice_set_object_tx_priority() and ice_set_object_tx_weight() to check against incoming priority an weight. Fixes: 42c2eb6b1f43 ("ice: Implement devlink-rate API") Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> Acked-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-10i40e: Add checking for null for nlmsg_find_attr()Natalia Petrova
The result of nlmsg_find_attr() 'br_spec' is dereferenced in nla_for_each_nested(), but it can take NULL value in nla_find() function, which will result in an error. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: 51616018dd1b ("i40e: Add support for getlink, setlink ndo ops") Signed-off-by: Natalia Petrova <n.petrova@fintech.ru> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://lore.kernel.org/r/20230209172833.3596034-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-10Merge branch '40GbE' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2023-02-09 (i40e) Jan removes i40e_status from the driver; replacing them with standard kernel error codes. Kees Cook replaces 0-length array with flexible array. * '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: net/i40e: Replace 0-length array with flexible array i40e: use ERR_PTR error print in i40e messages i40e: use int for i40e_status i40e: Remove string printing for i40e_status i40e: Remove unused i40e status codes ==================== Link: https://lore.kernel.org/r/20230209172536.3595838-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-10ice: xsk: Fix cleaning of XDP_TX framesLarysa Zaremba
Incrementation of xsk_frames inside the for-loop produces infinite loop, if we have both normal AF_XDP-TX and XDP_TXed buffers to complete. Split xsk_frames into 2 variables (xsk_frames and completed_frames) to eliminate this bug. Fixes: 29322791bc8b ("ice: xsk: change batched Tx descriptor cleaning") Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Acked-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://lore.kernel.org/r/20230209160130.1779890-1-larysa.zaremba@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-10Daniel Borkmann says:Jakub Kicinski
==================== pull-request: bpf-next 2023-02-11 We've added 96 non-merge commits during the last 14 day(s) which contain a total of 152 files changed, 4884 insertions(+), 962 deletions(-). There is a minor conflict in drivers/net/ethernet/intel/ice/ice_main.c between commit 5b246e533d01 ("ice: split probe into smaller functions") from the net-next tree and commit 66c0e13ad236 ("drivers: net: turn on XDP features") from the bpf-next tree. Remove the hunk given ice_cfg_netdev() is otherwise there a 2nd time, and add XDP features to the existing ice_cfg_netdev() one: [...] ice_set_netdev_features(netdev); netdev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT | NETDEV_XDP_ACT_XSK_ZEROCOPY; ice_set_ops(netdev); [...] Stephen's merge conflict mail: https://lore.kernel.org/bpf/20230207101951.21a114fa@canb.auug.org.au/ The main changes are: 1) Add support for BPF trampoline on s390x which finally allows to remove many test cases from the BPF CI's DENYLIST.s390x, from Ilya Leoshkevich. 2) Add multi-buffer XDP support to ice driver, from Maciej Fijalkowski. 3) Add capability to export the XDP features supported by the NIC. Along with that, add a XDP compliance test tool, from Lorenzo Bianconi & Marek Majtyka. 4) Add __bpf_kfunc tag for marking kernel functions as kfuncs, from David Vernet. 5) Add a deep dive documentation about the verifier's register liveness tracking algorithm, from Eduard Zingerman. 6) Fix and follow-up cleanups for resolve_btfids to be compiled as a host program to avoid cross compile issues, from Jiri Olsa & Ian Rogers. 7) Batch of fixes to the BPF selftest for xdp_hw_metadata which resulted when testing on different NICs, from Jesper Dangaard Brouer. 8) Fix libbpf to better detect kernel version code on Debian, from Hao Xiang. 9) Extend libbpf to add an option for when the perf buffer should wake up, from Jon Doron. 10) Follow-up fix on xdp_metadata selftest to just consume on TX completion, from Stanislav Fomichev. 11) Extend the kfuncs.rst document with description on kfunc lifecycle & stability expectations, from David Vernet. 12) Fix bpftool prog profile to skip attaching to offline CPUs, from Tonghao Zhang. ==================== Link: https://lore.kernel.org/r/20230211002037.8489-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
net/devlink/leftover.c / net/core/devlink.c: 565b4824c39f ("devlink: change port event netdev notifier from per-net to global") f05bd8ebeb69 ("devlink: move code to a dedicated directory") 687125b5799c ("devlink: split out core code") https://lore.kernel.org/all/20230208094657.379f2b1a@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-09net/i40e: Replace 0-length array with flexible arrayKees Cook
Zero-length arrays are deprecated[1]. Replace struct i40e_lump_tracking's "list" 0-length array with a flexible array. Detected with GCC 13, using -fstrict-flex-arrays=3: In function 'i40e_put_lump', inlined from 'i40e_clear_interrupt_scheme' at drivers/net/ethernet/intel/i40e/i40e_main.c:5145:2: drivers/net/ethernet/intel/i40e/i40e_main.c:278:27: warning: array subscript <unknown> is outside array bounds of 'u16[0]' {aka 'short unsigned int[]'} [-Warray-bounds=] 278 | pile->list[i] = 0; | ~~~~~~~~~~^~~ drivers/net/ethernet/intel/i40e/i40e.h: In function 'i40e_clear_interrupt_scheme': drivers/net/ethernet/intel/i40e/i40e.h:179:13: note: while referencing 'list' 179 | u16 list[0]; | ^~~~ [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#zero-length-and-one-element-arrays Cc: Jesse Brandeburg <jesse.brandeburg@intel.com> Cc: Tony Nguyen <anthony.l.nguyen@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org> Cc: intel-wired-lan@lists.osuosl.org Cc: netdev@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-09i40e: use ERR_PTR error print in i40e messagesJan Sokolowski
In i40e_status removal patches, i40e_status conversion to strings was removed in order to easily refactor the code to use standard errornums. This however made it more difficult for read error logs. Use %pe formatter to print error messages in human-readable format. Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-09i40e: use int for i40e_statusJan Sokolowski
To prepare for removal of i40e_status, change the variables from i40e_status to int. This eases the transition when values are changed to return standard int error codes over enum i40e_status. As such changes often also change variable orders, a cleanup is also applied here to make variables conform to RCT and some lines are also reformatted where applicable. Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-09i40e: Remove string printing for i40e_statusJan Sokolowski
Remove the i40e_stat_str() function which prints the string representation of the i40e_status error code. With upcoming changes moving away from i40e_status, there will be no need for this function Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-09i40e: Remove unused i40e status codesJan Sokolowski
In an effort to remove i40e status codes and replace them with standard kernel errornums, unused values of i40e_status_code were removed. Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-08net/sched: taprio: give higher priority to higher TCs in software dequeue modeVladimir Oltean
Current taprio software implementation is haunted by the shadow of the igb/igc hardware model. It iterates over child qdiscs in increasing order of TXQ index, therefore giving higher xmit priority to TXQ 0 and lower to TXQ N. According to discussions with Vinicius, that is the default (perhaps even unchangeable) prioritization scheme used for the NICs that taprio was first written for (igb, igc), and we have a case of two bugs canceling out, resulting in a functional setup on igb/igc, but a less sane one on other NICs. To the best of my understanding, taprio should prioritize based on the traffic class, so it should really dequeue starting with the highest traffic class and going down from there. We get to the TXQ using the tc_to_txq[] netdev property. TXQs within the same TC have the same (strict) priority, so we should pick from them as fairly as we can. We can achieve that by implementing something very similar to q->curband from multiq_dequeue(). Since igb/igc really do have TXQ 0 of higher hardware priority than TXQ 1 etc, we need to preserve the behavior for them as well. We really have no choice, because in txtime-assist mode, taprio is essentially a software scheduler towards offloaded child tc-etf qdiscs, so the TXQ selection really does matter (not all igb TXQs support ETF/SO_TXTIME, says Kurt Kanzenbach). To preserve the behavior, we need a capability bit so that taprio can determine if it's running on igb/igc, or on something else. Because igb doesn't offload taprio at all, we can't piggyback on the qdisc_offload_query_caps() call from taprio_enable_offload(), but instead we need a separate call which is also made for software scheduling. Introduce two static keys to minimize the performance penalty on systems which only have igb/igc NICs, and on systems which only have other NICs. For mixed systems, taprio will have to dynamically check whether to dequeue using one prioritization algorithm or using the other. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-07Merge branch '100GbE' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2023-02-06 (ice) This series contains updates to ice driver only. Ani removes WQ_MEM_RECLAIM flag from workqueue to resolve check_flush_dependency warning. Michal fixes KASAN out-of-bounds warning. Brett corrects behaviour for port VLAN Rx filters to prevent receiving of unintended traffic. Dan Carpenter fixes possible off by one issue. Zhang Changzhong adjusts error path for switch recipe to prevent memory leak. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: ice: switch: fix potential memleak in ice_add_adv_recipe() ice: Fix off by one in ice_tc_forward_to_queue() ice: Fix disabling Rx VLAN filtering with port VLAN enabled ice: fix out-of-bounds KASAN warning in virtchnl ice: Do not use WQ_MEM_RECLAIM flag for workqueue ==================== Link: https://lore.kernel.org/r/20230206232934.634298-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-07igc: Add ndo_tx_timeout supportSasha Neftin
On some platforms, 100/1000/2500 speeds seem to have sometimes problems reporting false positive tx unit hang during stressful UDP traffic. Likely other Intel drivers introduce responses to a tx hang. Update the 'tx hang' comparator with the comparison of the head and tail of ring pointers and restore the tx_timeout_factor to the previous value (one). This can be test by using netperf or iperf3 applications. Example: iperf3 -s -p 5001 iperf3 -c 192.168.0.2 --udp -p 5001 --time 600 -b 0 netserver -p 16604 netperf -H 192.168.0.2 -l 600 -p 16604 -t UDP_STREAM -- -m 64000 Fixes: b27b8dc77b5e ("igc: Increase timeout value for Speed 100/1000/2500") Signed-off-by: Sasha Neftin <sasha.neftin@intel.com> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://lore.kernel.org/r/20230206235818.662384-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-06ice: switch: fix potential memleak in ice_add_adv_recipe()Zhang Changzhong
When ice_add_special_words() fails, the 'rm' is not released, which will lead to a memory leak. Fix this up by going to 'err_unroll' label. Compile tested only. Fixes: 8b032a55c1bd ("ice: low level support for tunnels") Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
2023-02-06ice: Fix off by one in ice_tc_forward_to_queue()Dan Carpenter
The > comparison should be >= to prevent reading one element beyond the end of the array. The "vsi->num_rxq" is not strictly speaking the number of elements in the vsi->rxq_map[] array. The array has "vsi->alloc_rxq" elements and "vsi->num_rxq" is less than or equal to the number of elements in the array. The array is allocated in ice_vsi_alloc_arrays(). It's still an off by one but it might not access outside the end of the array. Fixes: 143b86f346c7 ("ice: Enable RX queue selection using skbedit action") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Amritha Nambiar <amritha.nambiar@intel.com> Tested-by: Bharathi Sreenivas <bharathi.sreenivas@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
2023-02-06ice: Fix disabling Rx VLAN filtering with port VLAN enabledBrett Creeley
If the user turns on the vf-true-promiscuous-support flag, then Rx VLAN filtering will be disabled if the VF requests to enable promiscuous mode. When the VF is in a port VLAN, this is the incorrect behavior because it will allow the VF to receive traffic outside of its port VLAN domain. Fortunately this only resulted in the VF(s) receiving broadcast traffic outside of the VLAN domain because all of the VLAN promiscuous rules are based on the port VLAN ID. Fix this by setting the .disable_rx_filtering VLAN op to a no-op when a port VLAN is enabled on the VF. Also, make sure to make this fix for both Single VLAN Mode and Double VLAN Mode enabled devices. Fixes: c31af68a1b94 ("ice: Add outer_vlan_ops and VSI specific VLAN ops implementations") Signed-off-by: Brett Creeley <brett.creeley@intel.com> Signed-off-by: Karen Ostrowska <karen.ostrowska@intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-06ice: fix out-of-bounds KASAN warning in virtchnlMichal Swiatkowski
KASAN reported: [ 9793.708867] BUG: KASAN: global-out-of-bounds in ice_get_link_speed+0x16/0x30 [ice] [ 9793.709205] Read of size 4 at addr ffffffffc1271b1c by task kworker/6:1/402 [ 9793.709222] CPU: 6 PID: 402 Comm: kworker/6:1 Kdump: loaded Tainted: G B OE 6.1.0+ #3 [ 9793.709235] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.00.01.0014.070920180847 07/09/2018 [ 9793.709245] Workqueue: ice ice_service_task [ice] [ 9793.709575] Call Trace: [ 9793.709582] <TASK> [ 9793.709588] dump_stack_lvl+0x44/0x5c [ 9793.709613] print_report+0x17f/0x47b [ 9793.709632] ? __cpuidle_text_end+0x5/0x5 [ 9793.709653] ? ice_get_link_speed+0x16/0x30 [ice] [ 9793.709986] ? ice_get_link_speed+0x16/0x30 [ice] [ 9793.710317] kasan_report+0xb7/0x140 [ 9793.710335] ? ice_get_link_speed+0x16/0x30 [ice] [ 9793.710673] ice_get_link_speed+0x16/0x30 [ice] [ 9793.711006] ice_vc_notify_vf_link_state+0x14c/0x160 [ice] [ 9793.711351] ? ice_vc_repr_cfg_promiscuous_mode+0x120/0x120 [ice] [ 9793.711698] ice_vc_process_vf_msg+0x7a7/0xc00 [ice] [ 9793.712074] __ice_clean_ctrlq+0x98f/0xd20 [ice] [ 9793.712534] ? ice_bridge_setlink+0x410/0x410 [ice] [ 9793.712979] ? __request_module+0x320/0x520 [ 9793.713014] ? ice_process_vflr_event+0x27/0x130 [ice] [ 9793.713489] ice_service_task+0x11cf/0x1950 [ice] [ 9793.713948] ? io_schedule_timeout+0xb0/0xb0 [ 9793.713972] process_one_work+0x3d0/0x6a0 [ 9793.714003] worker_thread+0x8a/0x610 [ 9793.714031] ? process_one_work+0x6a0/0x6a0 [ 9793.714049] kthread+0x164/0x1a0 [ 9793.714071] ? kthread_complete_and_exit+0x20/0x20 [ 9793.714100] ret_from_fork+0x1f/0x30 [ 9793.714137] </TASK> [ 9793.714151] The buggy address belongs to the variable: [ 9793.714158] ice_aq_to_link_speed+0x3c/0xffffffffffff3520 [ice] [ 9793.714632] Memory state around the buggy address: [ 9793.714642] ffffffffc1271a00: f9 f9 f9 f9 00 00 05 f9 f9 f9 f9 f9 00 00 02 f9 [ 9793.714656] ffffffffc1271a80: f9 f9 f9 f9 00 00 04 f9 f9 f9 f9 f9 00 00 00 00 [ 9793.714670] >ffffffffc1271b00: 00 00 00 04 f9 f9 f9 f9 04 f9 f9 f9 f9 f9 f9 f9 [ 9793.714680] ^ [ 9793.714690] ffffffffc1271b80: 00 00 00 00 00 04 f9 f9 f9 f9 f9 f9 00 00 00 00 [ 9793.714704] ffffffffc1271c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 The ICE_AQ_LINK_SPEED_UNKNOWN define is BIT(15). The value is bigger than both legacy and normal link speed tables. Add one element (0 - unknown) to both tables. There is no need to explicitly set table size, leave it empty. Fixes: 1d0e28a9be1f ("ice: Remove and replace ice speed defines with ethtool.h versions") Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
2023-02-06ice: Do not use WQ_MEM_RECLAIM flag for workqueueAnirudh Venkataramanan
When both ice and the irdma driver are loaded, a warning in check_flush_dependency is being triggered. This is due to ice driver workqueue being allocated with the WQ_MEM_RECLAIM flag and the irdma one is not. According to kernel documentation, this flag should be set if the workqueue will be involved in the kernel's memory reclamation flow. Since it is not, there is no need for the ice driver's WQ to have this flag set so remove it. Example trace: [ +0.000004] workqueue: WQ_MEM_RECLAIM ice:ice_service_task [ice] is flushing !WQ_MEM_RECLAIM infiniband:0x0 [ +0.000139] WARNING: CPU: 0 PID: 728 at kernel/workqueue.c:2632 check_flush_dependency+0x178/0x1a0 [ +0.000011] Modules linked in: bonding tls xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_cha in_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink bridge stp llc rfkill vfat fat intel_rapl_msr intel _rapl_common isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct1 0dif_pclmul crc32_pclmul ghash_clmulni_intel rapl intel_cstate rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_ core_mod ib_iser libiscsi scsi_transport_iscsi rdma_cm ib_cm iw_cm iTCO_wdt iTCO_vendor_support ipmi_ssif irdma mei_me ib_uverbs ib_core intel_uncore joydev pcspkr i2c_i801 acpi_ipmi mei lpc_ich i2c_smbus intel_pch_thermal ioatdma ipmi_si acpi_power_meter acpi_pad xfs libcrc32c sd_mod t10_pi crc64_rocksoft crc64 sg ahci ixgbe libahci ice i40e igb crc32c_intel mdio i2c_algo_bit liba ta dca wmi dm_mirror dm_region_hash dm_log dm_mod ipmi_devintf ipmi_msghandler fuse [ +0.000161] [last unloaded: bonding] [ +0.000006] CPU: 0 PID: 728 Comm: kworker/0:2 Tainted: G S 6.2.0-rc2_next-queue-13jan-00458-gc20aabd57164 #1 [ +0.000006] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0010.010620200716 01/06/2020 [ +0.000003] Workqueue: ice ice_service_task [ice] [ +0.000127] RIP: 0010:check_flush_dependency+0x178/0x1a0 [ +0.000005] Code: 89 8e 02 01 e8 49 3d 40 00 49 8b 55 18 48 8d 8d d0 00 00 00 48 8d b3 d0 00 00 00 4d 89 e0 48 c7 c7 e0 3b 08 9f e8 bb d3 07 01 <0f> 0b e9 be fe ff ff 80 3d 24 89 8e 02 00 0f 85 6b ff ff ff e9 06 [ +0.000004] RSP: 0018:ffff88810a39f990 EFLAGS: 00010282 [ +0.000005] RAX: 0000000000000000 RBX: ffff888141bc2400 RCX: 0000000000000000 [ +0.000004] RDX: 0000000000000001 RSI: dffffc0000000000 RDI: ffffffffa1213a80 [ +0.000003] RBP: ffff888194bf3400 R08: ffffed117b306112 R09: ffffed117b306112 [ +0.000003] R10: ffff888bd983088b R11: ffffed117b306111 R12: 0000000000000000 [ +0.000003] R13: ffff888111f84d00 R14: ffff88810a3943ac R15: ffff888194bf3400 [ +0.000004] FS: 0000000000000000(0000) GS:ffff888bd9800000(0000) knlGS:0000000000000000 [ +0.000003] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ +0.000003] CR2: 000056035b208b60 CR3: 000000017795e005 CR4: 00000000007706f0 [ +0.000003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ +0.000003] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ +0.000002] PKRU: 55555554 [ +0.000003] Call Trace: [ +0.000002] <TASK> [ +0.000003] __flush_workqueue+0x203/0x840 [ +0.000006] ? mutex_unlock+0x84/0xd0 [ +0.000008] ? __pfx_mutex_unlock+0x10/0x10 [ +0.000004] ? __pfx___flush_workqueue+0x10/0x10 [ +0.000006] ? mutex_lock+0xa3/0xf0 [ +0.000005] ib_cache_cleanup_one+0x39/0x190 [ib_core] [ +0.000174] __ib_unregister_device+0x84/0xf0 [ib_core] [ +0.000094] ib_unregister_device+0x25/0x30 [ib_core] [ +0.000093] irdma_ib_unregister_device+0x97/0xc0 [irdma] [ +0.000064] ? __pfx_irdma_ib_unregister_device+0x10/0x10 [irdma] [ +0.000059] ? up_write+0x5c/0x90 [ +0.000005] irdma_remove+0x36/0x90 [irdma] [ +0.000062] auxiliary_bus_remove+0x32/0x50 [ +0.000007] device_release_driver_internal+0xfa/0x1c0 [ +0.000005] bus_remove_device+0x18a/0x260 [ +0.000007] device_del+0x2e5/0x650 [ +0.000005] ? __pfx_device_del+0x10/0x10 [ +0.000003] ? mutex_unlock+0x84/0xd0 [ +0.000004] ? __pfx_mutex_unlock+0x10/0x10 [ +0.000004] ? _raw_spin_unlock+0x18/0x40 [ +0.000005] ice_unplug_aux_dev+0x52/0x70 [ice] [ +0.000160] ice_service_task+0x1309/0x14f0 [ice] [ +0.000134] ? __pfx___schedule+0x10/0x10 [ +0.000006] process_one_work+0x3b1/0x6c0 [ +0.000008] worker_thread+0x69/0x670 [ +0.000005] ? __kthread_parkme+0xec/0x110 [ +0.000007] ? __pfx_worker_thread+0x10/0x10 [ +0.000005] kthread+0x17f/0x1b0 [ +0.000005] ? __pfx_kthread+0x10/0x10 [ +0.000004] ret_from_fork+0x29/0x50 [ +0.000009] </TASK> Fixes: 940b61af02f4 ("ice: Initialize PF and setup miscellaneous interrupt") Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Tested-by: Jakub Andrysiak <jakub.andrysiak@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
2023-02-06ice: remove unnecessary virtchnl_ether_addr struct useJacob Keller
The dev_lan_addr and hw_lan_addr members of ice_vf are used only to store the MAC address for the VF. They are defined using virtchnl_ether_addr, but only the .addr sub-member is actually used. Drop the use of virtchnl_ether_addr and just use a u8 array of length [ETH_ALEN]. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-06ice: introduce .irq_close VF operationJacob Keller
The Scalable IOV implementation will require notifying the VDCM driver when an IRQ must be closed. This allows the VDCM to handle releasing stale IRQ context values and properly reconfigure. To handle this, introduce a new optional .irq_close callback to the VF operations structure. This will be implemented by Scalable IOV to handle the shutdown of the IRQ context. Since the SR-IOV implementation does not need this, we must check that its non-NULL before calling it. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-06ice: introduce clear_reset_state operationJacob Keller
When hardware is reset, the VF relies on the VFGEN_RSTAT register to detect when the VF is finished resetting. This is a tri-state register where 0 indicates a reset is in progress, 1 indicates the hardware is done resetting, and 2 indicates that the software is done resetting. Currently the PF driver relies on the device hardware resetting VFGEN_RSTAT when a global reset occurs. This works ok, but it does mean that the VF might not immediately notice a reset when the driver first detects that the global reset is occurring. This is also problematic for Scalable IOV, because there is no read/write equivalent VFGEN_RSTAT register for the Scalable VSI type. Instead, the Scalable IOV VFs will need to emulate this register. To support this, introduce a new VF operation, clear_reset_state, which is called when the PF driver first detects a global reset. The Single Root IOV implementation can just write to VFGEN_RSTAT to ensure it's cleared immediately, without waiting for the actual hardware reset to begin. The Scalable IOV implementation will use this as part of its tracking of the reset status to allow properly reporting the emulated VFGEN_RSTAT to the VF driver. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-06ice: convert vf_ops .vsi_rebuild to .create_vsiJacob Keller
The .vsi_rebuild function exists for ice_reset_vf. It is used to release and re-create the VSI during a single-VF reset. This function is only called when we need to re-create the VSI, and not when rebuilding an existing VSI. This makes the single-VF reset process different from the process used to restore functionality after a hardware reset such as the PF reset or EMP reset. When we add support for Scalable IOV VFs, the implementation will be very similar. The primary difference will be in the fact that each VF type uses a different underlying VSI type in hardware. Move the common functionality into a new ice_vf_recreate VSI function. This will allow the two IOV paths to share this functionality. Rework the .vsi_rebuild vf_op into .create_vsi, only performing the task of creating a new VSI. This creates a nice dichotomy between the ice_vf_rebuild_vsi and ice_vf_recreate_vsi, and should make it more clear why the two flows atre distinct. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-06ice: introduce ice_vf_init_host_cfg functionJacob Keller
Introduce a new generic helper ice_vf_init_host_cfg which performs common host configuration initialization tasks that will need to be done for both Single Root IOV and the new Scalable IOV implementation. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-06ice: add a function to initialize vf entryJacob Keller
Some of the initialization code for Single Root IOV VFs will need to be reused when we introduce Scalable IOV. Pull this code out into a new ice_initialize_vf_entry helper function. Co-developed-by: Harshitha Ramamurthy <harshitha.ramamurthy@intel.com> Signed-off-by: Harshitha Ramamurthy <harshitha.ramamurthy@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-06ice: Pull common tasks into ice_vf_post_vsi_rebuildJacob Keller
The Single Root IOV implementation of .post_vsi_rebuild performs some tasks that will ultimately need to be shared with the Scalable IOV implementation such as rebuilding the host configuration. Refactor by introducing a new wrapper function, ice_vf_post_vsi_rebuild which performs the tasks that will be shared between SR-IOV and Scalable IOV. Move the ice_vf_rebuild_host_cfg and ice_vf_set_initialized calls into this wrapper. Then call the implementation specific post_vsi_rebuild handler afterwards. This ensures that we will properly re-initialize filters and expected settings for both SR-IOV and Scalable IOV. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-06ice: move ice_vf_vsi_release into ice_vf_lib.cJacob Keller
The ice_vf_vsi_release function will be used in a future change to refactor the .vsi_rebuild function. Move this over to ice_vf_lib.c so that it can be used there. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-06ice: move vsi_type assignment from ice_vsi_alloc to ice_vsi_cfgJacob Keller
The ice_vsi_alloc and ice_vsi_cfg functions are used together to allocate and configure a new VSI, called as part of the ice_vsi_setup function. In the future with the addition of the subfunction code the ice driver will want to be able to allocate a VSI while delaying the configuration to a later point of the port activation. Currently this requires that the port code know what type of VSI should be allocated. This is required because ice_vsi_alloc assigns the VSI type. Refactor the ice_vsi_alloc and ice_vsi_cfg functions so that VSI type assignment isn't done until the configuration stage. This will allow the devlink port addition logic to reserve a VSI as early as possible before the type of the port is known. In this way, the port add can fail in the event that all hardware VSI resources are exhausted. Since the ice_vsi_cfg function already takes the ice_vsi_cfg_params structure, this is relatively straight forward. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-06ice: refactor VSI setup to use parameter structureJacob Keller
The ice_vsi_setup function, ice_vsi_alloc, and ice_vsi_cfg functions have grown a large number of parameters. These parameters are used to initialize a new VSI, as well as re-configure an existing VSI Any time we want to add a new parameter to this function chain, even if it will usually be unset, we have to change many call sites due to changing the function signature. A future change is going to refactor ice_vsi_alloc and ice_vsi_cfg to move the VSI configuration and initialization all into ice_vsi_cfg. Before this, refactor the VSI setup flow to use a new ice_vsi_cfg_params structure. This will contain the configuration (mainly pointers) used to initialize a VSI. Pass this from ice_vsi_setup into the related functions such as ice_vsi_alloc, ice_vsi_cfg, and ice_vsi_cfg_def. Introduce a helper, ice_vsi_to_params to convert an existing VSI to the parameters used to initialize it. This will aid in the flows where we rebuild an existing VSI. Since we also pass the ICE_VSI_FLAG_INIT to more functions which do not need (or cannot yet have) the VSI parameters, lets make this clear by renaming the function parameter to vsi_flags and using a u32 instead of a signed integer. The name vsi_flags also makes it clear that we may extend the flags in the future. This change will make it easier to refactor the setup flow in the future, and will reduce the complexity required to add a new parameter for configuration in the future. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-06ice: drop unnecessary VF parameter from several VSI functionsJacob Keller
The vsi->vf pointer gets assigned early on during ice_vsi_alloc. Several functions currently take a VF pointer, but they can just use the existing vsi->vf pointer as needed. Modify these functions to drop the unnecessary VF parameter. Note that ice_vsi_cfg is not changed as a following change will refactor so that the VF pointer is assigned during ice_vsi_cfg rather than ice_vsi_alloc. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-06ice: fix function comment referring to ice_vsi_allocJacob Keller
Since commit 1d2e32275de7 ("ice: split ice_vsi_setup into smaller functions") ice_vsi_alloc has not been responsible for all of the behavior implied by the comment for ice_vsi_setup_vector_base. Fix the comment to refer to the new function ice_vsi_alloc_def(). Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-06ice: Add more usage of existing function ice_get_vf_vsi(vf)Brett Creeley
Extend the usage of function ice_get_vf_vsi(vf) in multiple places instead of VF's VSI by using a long string of dereferences (i.e. vf->pf->vsi[vf->lan_vsi_idx]). Signed-off-by: Brett Creeley <brett.creeley@intel.com> Signed-off-by: Kalyan Kodamagula <kalyan.kodamagula@intel.com> Tested-by: Piotr Tyda <piotr.tyda@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-06net/sched: taprio: only pass gate mask per TXQ for igc, stmmac, tsnep, am65_cpswVladimir Oltean
There are 2 classes of in-tree drivers currently: - those who act upon struct tc_taprio_sched_entry :: gate_mask as if it holds a bit mask of TXQs - those who act upon the gate_mask as if it holds a bit mask of TCs When it comes to the standard, IEEE 802.1Q-2018 does say this in the second paragraph of section 8.6.8.4 Enhancements for scheduled traffic: | A gate control list associated with each Port contains an ordered list | of gate operations. Each gate operation changes the transmission gate | state for the gate associated with each of the Port's traffic class | queues and allows associated control operations to be scheduled. In typically obtuse language, it refers to a "traffic class queue" rather than a "traffic class" or a "queue". But careful reading of 802.1Q clarifies that "traffic class" and "queue" are in fact synonymous (see 8.6.6 Queuing frames): | A queue in this context is not necessarily a single FIFO data structure. | A queue is a record of all frames of a given traffic class awaiting | transmission on a given Bridge Port. The structure of this record is not | specified. i.o.w. their definition of "queue" isn't the Linux TX queue. The gate_mask really is input into taprio via its UAPI as a mask of traffic classes, but taprio_sched_to_offload() converts it into a TXQ mask. The breakdown of drivers which handle TC_SETUP_QDISC_TAPRIO is: - hellcreek, felix, sja1105: these are DSA switches, it's not even very clear what TXQs correspond to, other than purely software constructs. Only the mqprio configuration with 8 TCs and 1 TXQ per TC makes sense. So it's fine to convert these to a gate mask per TC. - enetc: I have the hardware and can confirm that the gate mask is per TC, and affects all TXQs (BD rings) configured for that priority. - igc: in igc_save_qbv_schedule(), the gate_mask is clearly interpreted to be per-TXQ. - tsnep: Gerhard Engleder clarifies that even though this hardware supports at most 1 TXQ per TC, the TXQ indices may be different from the TC values themselves, and it is the TXQ indices that matter to this hardware. So keep it per-TXQ as well. - stmmac: I have a GMAC datasheet, and in the EST section it does specify that the gate events are per TXQ rather than per TC. - lan966x: again, this is a switch, and while not a DSA one, the way in which it implements lan966x_mqprio_add() - by only allowing num_tc == NUM_PRIO_QUEUES (8) - makes it clear to me that TXQs are a purely software construct here as well. They seem to map 1:1 with TCs. - am65_cpsw: from looking at am65_cpsw_est_set_sched_cmds(), I get the impression that the fetch_allow variable is treated like a prio_mask. This definitely sounds closer to a per-TC gate mask rather than a per-TXQ one, and TI documentation does seem to recomment an identity mapping between TCs and TXQs. However, Roger Quadros would like to do some testing before making changes, so I'm leaving this driver to operate as it did before, for now. Link with more details at the end. Based on this breakdown, we have 5 drivers with a gate mask per TC and 4 with a gate mask per TXQ. So let's make the gate mask per TXQ the opt-in and the gate mask per TC the default. Benefit from the TC_QUERY_CAPS feature that Jakub suggested we add, and query the device driver before calling the proper ndo_setup_tc(), and figure out if it expects one or the other format. Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230202003621.2679603-15-vladimir.oltean@nxp.com/#25193204 Cc: Horatiu Vultur <horatiu.vultur@microchip.com> Cc: Siddharth Vadapalli <s-vadapalli@ti.com> Cc: Roger Quadros <rogerq@kernel.org> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-06net/sched: move struct tc_mqprio_qopt_offload from pkt_cls.h to pkt_sched.hVladimir Oltean
Since mqprio is a scheduler and not a classifier, move its offload structure to pkt_sched.h, where struct tc_taprio_qopt_offload also lies. Also update some header inclusions in drivers that access this structure, to the best of my abilities. Cc: Igor Russkikh <irusskikh@marvell.com> Cc: Yisen Zhuang <yisen.zhuang@huawei.com> Cc: Salil Mehta <salil.mehta@huawei.com> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com> Cc: Tony Nguyen <anthony.l.nguyen@intel.com> Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Horatiu Vultur <horatiu.vultur@microchip.com> Cc: Lars Povlsen <lars.povlsen@microchip.com> Cc: Steen Hegelund <Steen.Hegelund@microchip.com> Cc: Daniel Machon <daniel.machon@microchip.com> Cc: UNGLinuxDriver@microchip.com Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-03ice: implement devlink reinit actionMichal Swiatkowski
Call ice_unload() and ice_load() in driver reinit flow. Block reinit when switchdev, ADQ or SRIOV is active. In reload path we don't want to rebuild all features. Ask user to remove them instead of quitely removing it in reload path. Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-03ice: update VSI instead of init in some caseMichal Swiatkowski
ice_vsi_cfg() is called from different contexts: 1) VSI exsist in HW, but it is reconfigured, because of changing queues for example -> update instead of init should be used 2) VSI doesn't exsist, because rest has happened -> init command should be sent To support both cases pass boolean value which will store information what type of command has to be sent to HW. Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-03ice: move VSI delete outside deconfigMichal Swiatkowski
In deconfig VSI shouldn't be deleted from hw. Rewrite VSI delete function to reflect that sometimes it is only needed to remove VSI from hw without freeing the memory: ice_vsi_delete() -> delete from HW and free memory ice_vsi_delete_from_hw() -> delete only from HW Value returned from ice_vsi_free() is never used. Change return type to void. Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-03ice: sync netdev filters after clearing VSIMichal Swiatkowski
In driver reload path the netdev isn't removed, but VSI is. Remove filters on netdev right after removing them on VSI. Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-03ice: split probe into smaller functionsMichal Swiatkowski
Part of code from probe can be reused in reload flow. Move this code to separate function. Create unroll functions for each part of initialization, like: ice_init_dev() and ice_deinit_dev(). It simplifies unrolling and can be used in remove flow. Avoid freeing port info as it could be reused in reload path. Will be freed in remove path since is allocated via devm_kzalloc(). Also clean the remove path to reflect the init steps. Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-03ice: stop hard coding the ICE_VSI_CTRL locationJacob Keller
When allocating the ICE_VSI_CTRL, the allocated struct ice_vsi pointer is stored into the PF's pf->vsi array at a fixed location. This was historically done on the basis that it could provide an O(1) lookup for the special control VSI. Since we store the ctrl_vsi_idx, we already have O(1) lookup regardless of where in the array we store this VSI. Simplify the logic in ice_vsi_alloc by using the same method of storing the control VSI as other types of VSIs. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-03ice: split ice_vsi_setup into smaller functionsMichal Swiatkowski
Main goal is to reuse the same functions in VSI config and rebuild paths. To do this split ice_vsi_setup into smaller pieces and reuse it during rebuild. ice_vsi_alloc() should only alloc memory, not set the default values for VSI. Move setting defaults to separate function. This will allow config of already allocated VSI, for example in reload path. The path is mostly moving code around without introducing new functionality. Functions ice_vsi_cfg() and ice_vsi_decfg() were added, but they are using code that already exist. Use flag to pass information about VSI initialization during rebuild instead of using boolean value. Co-developed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-03ice: cleanup in VSI config/deconfig codeMichal Swiatkowski
Do few small cleanups: 1) Rename the function to reflect that it doesn't configure all things related to VSI. ice_vsi_cfg_lan() better fits to what function is doing. ice_vsi_cfg() can be use to name function that will configure whole VSI. 2) Remove unused ethtype field from VSI. There is no need to set ethtype here, because it is never used. 3) Remove unnecessary check for ICE_VSI_CHNL. There is check for ICE_VSI_CHNL in ice_vsi_get_qs, so there is no need to check it before calling the function. 4) Simplify ice_vsi_alloc() call. There is no need to check the type of VSI before calling ice_vsi_alloc(). For ICE_VSI_CHNL vf is always NULL (ice_vsi_setup() is called with vf=NULL). For ICE_VSI_VF or ICE_VSI_CTRL ch is always NULL and for other VSI types ch and vf are always NULL. 5) Remove unnecessary call to ice_vsi_dis_irq(). ice_vsi_dis_irq() will be called in ice_vsi_close() flow (ice_vsi_close() -> ice_vsi_down() -> ice_vsi_dis_irq()). Remove unnecessary call. 6) Don't remove specific filters in release. All hw filters are removed in ice_fltr_remove_alli(), which is always called in VSI release flow. There is no need to remove only ethertype filters before calling ice_fltr_remove_all(). 7) Rename ice_vsi_clear() to ice_vsi_free(). As ice_vsi_clear() only free memory allocated in ice_vsi_alloc() rename it to ice_vsi_free() which better shows what function is doing. 8) Free coalesce param in rebuild. There is potential memory leak if configuration of VSI lan fails. Free coalesce to avoid it. Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-03ice: alloc id for RDMA using xa_arrayMichal Swiatkowski
Use xa_array instead of deprecated ida to alloc id for RDMA aux driver. Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>