summaryrefslogtreecommitdiff
path: root/drivers/net/ethernet/ibm
AgeCommit message (Collapse)Author
2023-04-05mm, treewide: redefine MAX_ORDER sanelyKirill A. Shutemov
MAX_ORDER currently defined as number of orders page allocator supports: user can ask buddy allocator for page order between 0 and MAX_ORDER-1. This definition is counter-intuitive and lead to number of bugs all over the kernel. Change the definition of MAX_ORDER to be inclusive: the range of orders user can ask from buddy allocator is 0..MAX_ORDER now. [kirill@shutemov.name: fix min() warning] Link: https://lkml.kernel.org/r/20230315153800.32wib3n5rickolvh@box [akpm@linux-foundation.org: fix another min_t warning] [kirill@shutemov.name: fixups per Zi Yan] Link: https://lkml.kernel.org/r/20230316232144.b7ic4cif4kjiabws@box.shutemov.name [akpm@linux-foundation.org: fix underlining in docs] Link: https://lore.kernel.org/oe-kbuild-all/202303191025.VRCTk6mP-lkp@intel.com/ Link: https://lkml.kernel.org/r/20230315113133.11326-11-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-16net: Use of_property_read_bool() for boolean propertiesRob Herring
It is preferred to use typed property access functions (i.e. of_property_read_<type> functions) rather than low-level of_get_property/of_find_property functions for reading properties. Convert reading boolean properties to of_property_read_bool(). Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for net/can Acked-by: Kalle Valo <kvalo@kernel.org> Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com> Acked-by: Francois Romieu <romieu@fr.zoreil.com> Reviewed-by: Wei Fang <wei.fang@nxp.com> Signed-off-by: Rob Herring <robh@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-24ibmvnic: Assign XPS map to correct queue indexNick Child
When setting the XPS map value for TX queues, use the index of the transmit queue. Previously, the function was passing the index of the loop that iterates over all queues (RX and TX). This was causing invalid XPS map values. Fixes: 6831582937bd ("ibmvnic: Toggle between queue types in affinity mapping") Signed-off-by: Nick Child <nnac123@linux.ibm.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Link: https://lore.kernel.org/r/20230223153944.44969-1-nnac123@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31ibmvnic: Toggle between queue types in affinity mappingNick Child
Previously, ibmvnic IRQs were assigned to CPU numbers by assigning all the IRQs for transmit queues then assigning all the IRQs for receive queues. With multi-threaded processors, in a heavy RX or TX environment, physical cores would either be overloaded or underutilized (due to the IRQ assignment algorithm). This approach is sub-optimal because IRQs for the same subprocess (RX or TX) would be bound to adjacent CPU numbers, meaning they were more likely to be contending for the same core. For example, in a system with 64 CPU's and 32 queues, the IRQs would be bound to CPU in the following pattern: IRQ type | CPU number ----------------------- TX0 | 0-1 TX1 | 2-3 <etc> RX0 | 32-33 RX1 | 34-35 <etc> Observe that in SMT-8, the first 4 tx queues would be sharing the same core. A more optimal algorithm would balance the number RX and TX IRQ's across the physical cores. Therefore, to increase performance, distribute RX and TX IRQs across cores by alternating between assigning IRQs for RX and TX queues to CPUs. With a system with 64 CPUs and 32 queues, this results in the following pattern: IRQ type | CPU number ----------------------- TX0 | 0-1 RX0 | 2-3 TX1 | 4-5 RX1 | 6-7 <etc> Observe that in SMT-8, there is equal distribution of RX and TX IRQs per core. In the above case, each core handles 2 TX and 2 RX IRQ's. Signed-off-by: Nick Child <nnac123@linux.ibm.com> Reviewed-by: Haren Myneni <haren@linux.ibm.com> Link: https://lore.kernel.org/r/20230127214358.318152-1-nnac123@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-11-14ibmvnic: Update XPS assignments during affinity bindingNick Child
Transmit Packet Steering (XPS) maps cpu numbers to transmit queues. By running the same connection on the same set of cpu's, contention for the queue and cache miss rate can be minimized. When assigning a cpu mask for a tranmit queues irq number, assign the same cpu mask as the set of cpu's that XPS should use for that queue. Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com> Signed-off-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: Nick Child <nnac123@linux.ibm.com> Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com> Reviewed-by: Haren Myneni <haren@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-14ibmvnic: Add hotpluggable CPU callbacks to reassign affinity hintsNick Child
When CPU's are added and removed, ibmvnic devices will reassign hint values. Introduce a new cpu hotplug state CPUHP_IBMVNIC_DEAD to signal to ibmvnic devices that the CPU has been removed and it is time to reset affinity hint assignments. On the other hand, when CPU's are being added, add a state instance to CPUHP_AP_ONLINE_DYN which will trigger a reassignment of affinity hints once the new CPU's are online. This implementation is based on the virtio_net driver. Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com> Signed-off-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: Nick Child <nnac123@linux.ibm.com> Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com> Reviewed-by: Haren Myneni <haren@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-14ibmvnic: Assign IRQ affinity hints to device queuesNick Child
Assign affinity hints to ibmvnic device queue interrupts. Affinity hints are assigned and removed during sub-crq init and teardown, respectively. This update should improve latency if utilized as interrupt lines and processing are more equally distributed among CPU's. This implementation is based on the virtio_net driver. Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com> Signed-off-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: Nick Child <nnac123@linux.ibm.com> Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com> Reviewed-by: Haren Myneni <haren@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
drivers/net/can/pch_can.c ae64438be192 ("can: dev: fix skb drop check") 1dd1b521be85 ("can: remove obsolete PCH CAN driver") https://lore.kernel.org/all/20221110102509.1f7d63cc@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-08ibmveth: Reduce default tx queues to 8Nick Child
Previously, the default number of transmit queues was 16. Due to resource concerns, set to 8 queues instead. Still allow the user to set more queues (max 16) if they like. Since the driver is virtualized away from the physical NIC, the purpose of multiple queues is purely to allow for parallel calls to the hypervisor. Therefore, there is no noticeable effect on performance by reducing queue count to 8. Fixes: d926793c1de9 ("ibmveth: Implement multi queue on xmit") Reported-by: Dave Taht <dave.taht@gmail.com> Signed-off-by: Nick Child <nnac123@linux.ibm.com> Link: https://lore.kernel.org/r/20221107203215.58206-1-nnac123@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-02ibmvnic: Free rwi on reset successNick Child
Free the rwi structure in the event that the last rwi in the list processed successfully. The logic in commit 4f408e1fa6e1 ("ibmvnic: retry reset if there are no other resets") introduces an issue that results in a 32 byte memory leak whenever the last rwi in the list gets processed. Fixes: 4f408e1fa6e1 ("ibmvnic: retry reset if there are no other resets") Signed-off-by: Nick Child <nnac123@linux.ibm.com> Link: https://lore.kernel.org/r/20221031150642.13356-1-nnac123@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
drivers/net/can/usb/kvaser_usb/kvaser_usb_leaf.c 2871edb32f46 ("can: kvaser_usb: Fix possible completions during init_completion") abb8670938b2 ("can: kvaser_usb_leaf: Ignore stale bus-off after start") 8d21f5927ae6 ("can: kvaser_usb_leaf: Fix improved state not being reported") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27net: ehea: fix possible memory leak in ehea_register_port()Yang Yingliang
If of_device_register() returns error, the of node and the name allocated in dev_set_name() is leaked, call put_device() to give up the reference that was set in device_initialize(), so that of node is put in logical_port_release() and the name is freed in kobject_cleanup(). Fixes: 1acf2318dd13 ("ehea: dynamic add / remove port") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Link: https://lore.kernel.org/r/20221025130011.1071357-1-yangyingliang@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-24ibmveth: Always stop tx queues during closeNick Child
netif_stop_all_queues must be called before calling H_FREE_LOGICAL_LAN. As a result, we can remove the pool_config field from the ibmveth adapter structure. Some device configuration changes call ibmveth_close in order to free the current resources held by the device. These functions then make their changes and call ibmveth_open to reallocate and reserve resources for the device. Prior to this commit, the flag pool_config was used to tell ibmveth_close that it should not halt the transmit queue. pool_config was introduced in commit 860f242eb534 ("[PATCH] ibmveth change buffer pools dynamically") to avoid interrupting the tx flow when making rx config changes. Since then, other commits adopted this approach, even if making tx config changes. The issue with this approach was that the hypervisor freed all of the devices control structures after the hcall H_FREE_LOGICAL_LAN was performed but the transmit queues were never stopped. So the higher layers in the network stack would continue transmission but any H_SEND_LOGICAL_LAN hcall would fail with H_PARAMETER until the hypervisor's structures for the device were allocated with the H_REGISTER_LOGICAL_LAN hcall in ibmveth_open. This resulted in no real networking harm but did cause several of these error messages to be logged: "h_send_logical_lan failed with rc=-4" So, instead of trying to keep the transmit queues alive during network configuration changes, just stop the queues, make necessary changes then restart the queues. Signed-off-by: Nick Child <nnac123@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30ibmveth: Ethtool set queue supportNick Child
Implement channel management functions to allow dynamic addition and removal of transmit queues. The `ethtool --show-channels` and `ethtool --set-channels` commands can be used to get and set the number of queues, respectively. Allow the ability to add as many transmit queues as available processors but never allow more than the hard maximum of 16. The number of receive queues is one and cannot be modified. Depending on whether the requested number of queues is larger or smaller than the current value, either allocate or free long term buffers. Since long term buffer construction and destruction can occur in two different areas, from either channel set requests or device open/close, define functions for performing this work. If allocation of a new buffer fails, then attempt to revert back to the previous number of queues. Signed-off-by: Nick Child <nnac123@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30ibmveth: Implement multi queue on xmitNick Child
The `ndo_start_xmit` function is protected by a spinlock on the tx queue being used to transmit the skb. Allow concurrent calls to `ndo_start_xmit` by using more than one tx queue. This allows for greater throughput when several jobs are trying to transmit data. Introduce 16 tx queues (leave single rx queue as is) which each correspond to one DMA mapped long term buffer. Signed-off-by: Nick Child <nnac123@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30ibmveth: Copy tx skbs into a premapped bufferNick Child
Rather than DMA mapping and unmapping every outgoing skb, copy the skb into a buffer that was mapped during the drivers open function. Copying the skb and its frags have proven to be more time efficient than mapping and unmapping. As an effect, performance increases by 3-5 Gbits/s. Allocate and DMA map one continuous 64KB buffer at `ndo_open`. This buffer is maintained until `ibmveth_close` is called. This buffer is large enough to hold the largest possible linnear skb. During `ndo_start_xmit`, copy the skb and all of it's frags into the continuous buffer. By manually linnearizing all the socket buffers, time is saved during memcpy as well as more efficient handling in FW. As a result, we no longer need to worry about the firmware limitation of handling a max of 6 frags. So, we only need to maintain 1 descriptor instead of 6 and can hardcode 0 for the other 5 descriptors during h_send_logical_lan. Since, DMA allocation/mapping issues can no longer arise in xmit functions, we can further reduce code size by removing the need for a bounce buffer on DMA errors. Signed-off-by: Nick Child <nnac123@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-28net: drop the weight argument from netif_napi_addJakub Kicinski
We tell driver developers to always pass NAPI_POLL_WEIGHT as the weight to netif_napi_add(). This may be confusing to newcomers, drop the weight argument, those who really need to tweak the weight can use netif_napi_add_weight(). Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for CAN Link: https://lore.kernel.org/r/20220927132753.750069-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-21net: ibm: emac: Switch to use dev_err_probe() helperYang Yingliang
dev_err() can be replace with dev_err_probe() which will check if error code is -EPROBE_DEFER. Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-08-31net: ethernet: move from strlcpy with unused retval to strscpyWolfram Sang
Follow the advice of the below link and prefer 'strscpy' in this subsystem. Conversion is 1:1 because the return value is not used. Generated by a coccinelle script. Link: https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=V6A6G1oUZcprmknw@mail.gmail.com/ Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Reviewed-by: Petr Machata <petrm@nvidia.com> # For drivers/net/ethernet/mellanox/mlxsw Acked-by: Geoff Levand <geoff@infradead.org> # For ps3_gelic_net and spider_net_ethtool Acked-by: Tom Lendacky <thomas.lendacky@amd.com> # For drivers/net/ethernet/amd/xgbe/xgbe-ethtool.c Acked-by: Marcin Wojtas <mw@semihalf.com> # For drivers/net/ethernet/marvell/mvpp2 Reviewed-by: Leon Romanovsky <leonro@nvidia.com> # For drivers/net/ethernet/mellanox/mlx{4|5} Reviewed-by: Shay Agroskin <shayagr@amazon.com> # For drivers/net/ethernet/amazon/ena Acked-by: Krzysztof Hałasa <khalasa@piap.pl> # For IXP4xx Ethernet Link: https://lore.kernel.org/r/20220830201457.7984-3-wsa+renesas@sang-engineering.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-04ibmvnic: Properly dispose of all skbs during a failover.Rick Lindsley
During a reset, there may have been transmits in flight that are no longer valid and cannot be fulfilled. Resetting and clearing the queues is insufficient; each skb also needs to be explicitly freed so that upper levels are not left waiting for confirmation of a transmit that will never happen. If this happens frequently enough, the apparent backlog will cause TCP to begin "congestion control" unnecessarily, culminating in permanently decreased throughput. Fixes: d7c0ef36bde03 ("ibmvnic: Free and re-allocate scrqs when tx/rx scrqs change") Tested-by: Nick Child <nnac123@linux.ibm.com> Reviewed-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: Rick Lindsley <ricklind@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-02net: add skb_[inner_]tcp_all_headers helpersEric Dumazet
Most drivers use "skb_transport_offset(skb) + tcp_hdrlen(skb)" to compute headers length for a TCP packet, but others use more convoluted (but equivalent) ways. Add skb_tcp_all_headers() and skb_inner_tcp_all_headers() helpers to harmonize this a bit. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-08eth: switch to netif_napi_add_weight()Jakub Kicinski
Switch all Ethernet drivers which use custom napi weights to the new API. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-05net: ethernet: Prepare cleanup of powerpc's asm/prom.hChristophe Leroy
powerpc's asm/prom.h includes some headers that it doesn't need itself. In order to clean powerpc's asm/prom.h up in a further step, first clean all files that include asm/prom.h Some files don't need asm/prom.h at all. For those ones, just remove inclusion of asm/prom.h Some files don't need any of the items provided by asm/prom.h, but need some of the headers included by asm/prom.h. For those ones, add the needed headers that are brought by asm/prom.h at the moment and remove asm/prom.h Some files really need asm/prom.h but also need some of the headers included by asm/prom.h. For those one, leave asm/prom.h but also add the needed headers so that they can be removed from asm/prom.h in a later step. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Link: https://lore.kernel.org/r/09a13d592d628de95d30943e59b2170af5b48110.1651663857.git.christophe.leroy@csgroup.eu Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
include/linux/netdevice.h net/core/dev.c 6510ea973d8d ("net: Use this_cpu_inc() to increment net->core_stats") 794c24e9921f ("net-core: rx_otherhost_dropped to core_stats") https://lore.kernel.org/all/20220428111903.5f4304e0@canb.auug.org.au/ drivers/net/wan/cosa.c d48fea8401cf ("net: cosa: fix error check return value of register_chrdev()") 89fbca3307d4 ("net: wan: remove support for COSA and SRP synchronous serial boards") https://lore.kernel.org/all/20220428112130.1f689e5e@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-28Revert "ibmvnic: Add ethtool private flag for driver-defined queue limits"Dany Madden
This reverts commit 723ad916134784b317b72f3f6cf0f7ba774e5dae When client requests channel or ring size larger than what the server can support the server will cap the request to the supported max. So, the client would not be able to successfully request resources that exceed the server limit. Fixes: 723ad9161347 ("ibmvnic: Add ethtool private flag for driver-defined queue limits") Signed-off-by: Dany Madden <drt@linux.ibm.com> Link: https://lore.kernel.org/r/20220427235146.23189-1-drt@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-15ibmvnic: Allow multiple ltbs in txpool ltb_setSukadev Bhattiprolu
Allow multiple LTBs in the txpool's ltb_set. i.e rather than using a single large LTB, use several smaller LTBs. The first n-1 LTBs will all be of the same size. The size of the last LTB in the set depends on the number of buffers and buffer (mtu) size. This strategy hopefully allows more reuse of the initial LTBs and also reduces the chances of an allocation failure (of the large LTB) when system is low in memory. Suggested-by: Brian King <brking@linux.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-15ibmvnic: Allow multiple ltbs in rxpool ltb_setSukadev Bhattiprolu
Allow multiple LTBs in the rxpool's ltb_set. The first n-1 LTBs will all be of the same size. The size of the last LTB in the set depends on the number of buffers and buffer (mtu) size. Having a set of LTBs per pool provides a couple of benefits. First, with the current value of IBMVNIC_MAX_LTB_SIZE of 16MB, with an MTU of 9000, we need a LTB (DMA buffer) of that size but the allocation can fail in low memory conditions. With a set of LTBs per pool, we can use several smaller (8MB) LTBs and hopefully have fewer allocation failures. (See also comments in ibmvnic.h on the trade-off with smaller LTBs) Second since the kernel limits the size of the DMA buffer to 16MB (based on MAX_ORDER), with a single DMA buffer per pool, the pool is also limited to 16MB. This in turn limits the number of buffers per pool to 1763 when MTU is 9000. With a set of LTBs per pool, we can have upto the max of 4096 buffers per pool even when MTU is 9000. Suggested-by: Brian King <brking@linux.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-15ibmvnic: convert rxpool ltb to a set of ltbsSukadev Bhattiprolu
Define and use interfaces that treat the long term buffer (LTB) of an rxpool as a set of LTBs rather than a single LTB. The set only has one LTB for now. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-15ibmvnic: define map_txpool_buf_to_ltb()Sukadev Bhattiprolu
Define a helper to map a given txpool buffer into its corresponding long term buffer (LTB) and offset. Currently there is just one LTB per txpool so the mapping is trivial. When we add support for multiple LTBs per txpool, this helper will be more useful. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-15ibmvnic: define map_rxpool_buf_to_ltb()Sukadev Bhattiprolu
Define a helper to map a given rx pool buffer into its corresponding long term buffer (LTB) and offset. Currently there is just one LTB per pool so the mapping is trivial. When we add support for multiple LTBs per pool, this helper will be more useful. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-15ibmvnic: rename local variable index to bufidxSukadev Bhattiprolu
The local variable 'index' is heavily used in some functions and is confusing with the presence of other "index" fields like pool->index, ->consumer_index, etc. Rename it to bufidx to better reflect that its the index of a buffer in the pool. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Merge in overtime fixes, no conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-18ibmvnic: fix race between xmit and resetSukadev Bhattiprolu
There is a race between reset and the transmit paths that can lead to ibmvnic_xmit() accessing an scrq after it has been freed in the reset path. It can result in a crash like: Kernel attempted to read user page (0) - exploit attempt? (uid: 0) BUG: Kernel NULL pointer dereference on read at 0x00000000 Faulting instruction address: 0xc0080000016189f8 Oops: Kernel access of bad area, sig: 11 [#1] ... NIP [c0080000016189f8] ibmvnic_xmit+0x60/0xb60 [ibmvnic] LR [c000000000c0046c] dev_hard_start_xmit+0x11c/0x280 Call Trace: [c008000001618f08] ibmvnic_xmit+0x570/0xb60 [ibmvnic] (unreliable) [c000000000c0046c] dev_hard_start_xmit+0x11c/0x280 [c000000000c9cfcc] sch_direct_xmit+0xec/0x330 [c000000000bfe640] __dev_xmit_skb+0x3a0/0x9d0 [c000000000c00ad4] __dev_queue_xmit+0x394/0x730 [c008000002db813c] __bond_start_xmit+0x254/0x450 [bonding] [c008000002db8378] bond_start_xmit+0x40/0xc0 [bonding] [c000000000c0046c] dev_hard_start_xmit+0x11c/0x280 [c000000000c00ca4] __dev_queue_xmit+0x564/0x730 [c000000000cf97e0] neigh_hh_output+0xd0/0x180 [c000000000cfa69c] ip_finish_output2+0x31c/0x5c0 [c000000000cfd244] __ip_queue_xmit+0x194/0x4f0 [c000000000d2a3c4] __tcp_transmit_skb+0x434/0x9b0 [c000000000d2d1e0] __tcp_retransmit_skb+0x1d0/0x6a0 [c000000000d2d984] tcp_retransmit_skb+0x34/0x130 [c000000000d310e8] tcp_retransmit_timer+0x388/0x6d0 [c000000000d315ec] tcp_write_timer_handler+0x1bc/0x330 [c000000000d317bc] tcp_write_timer+0x5c/0x200 [c000000000243270] call_timer_fn+0x50/0x1c0 [c000000000243704] __run_timers.part.0+0x324/0x460 [c000000000243894] run_timer_softirq+0x54/0xa0 [c000000000ea713c] __do_softirq+0x15c/0x3e0 [c000000000166258] __irq_exit_rcu+0x158/0x190 [c000000000166420] irq_exit+0x20/0x40 [c00000000002853c] timer_interrupt+0x14c/0x2b0 [c000000000009a00] decrementer_common_virt+0x210/0x220 --- interrupt: 900 at plpar_hcall_norets_notrace+0x18/0x2c The immediate cause of the crash is the access of tx_scrq in the following snippet during a reset, where the tx_scrq can be either NULL or an address that will soon be invalid: ibmvnic_xmit() { ... tx_scrq = adapter->tx_scrq[queue_num]; txq = netdev_get_tx_queue(netdev, queue_num); ind_bufp = &tx_scrq->ind_buf; if (test_bit(0, &adapter->resetting)) { ... } But beyond that, the call to ibmvnic_xmit() itself is not safe during a reset and the reset path attempts to avoid this by stopping the queue in ibmvnic_cleanup(). However just after the queue was stopped, an in-flight ibmvnic_complete_tx() could have restarted the queue even as the reset is progressing. Since the queue was restarted we could get a call to ibmvnic_xmit() which can then access the bad tx_scrq (or other fields). We cannot however simply have ibmvnic_complete_tx() check the ->resetting bit and skip starting the queue. This can race at the "back-end" of a good reset which just restarted the queue but has not cleared the ->resetting bit yet. If we skip restarting the queue due to ->resetting being true, the queue would remain stopped indefinitely potentially leading to transmit timeouts. IOW ->resetting is too broad for this purpose. Instead use a new flag that indicates whether or not the queues are active. Only the open/ reset paths control when the queues are active. ibmvnic_complete_tx() and others wake up the queue only if the queue is marked active. So we will have: A. reset/open thread in ibmvnic_cleanup() and __ibmvnic_open() ->resetting = true ->tx_queues_active = false disable tx queues ... ->tx_queues_active = true start tx queues B. Tx interrupt in ibmvnic_complete_tx(): if (->tx_queues_active) netif_wake_subqueue(); To ensure that ->tx_queues_active and state of the queues are consistent, we need a lock which: - must also be taken in the interrupt path (ibmvnic_complete_tx()) - shared across the multiple queues in the adapter (so they don't become serialized) Use rcu_read_lock() and have the reset thread synchronize_rcu() after updating the ->tx_queues_active state. While here, consolidate a few boolean fields in ibmvnic_adapter for better alignment. Based on discussions with Brian King and Dany Madden. Fixes: 7ed5b31f4a66 ("net/ibmvnic: prevent more than one thread from running in reset") Reported-by: Vaishnavi Bhat <vaish123@in.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
net/batman-adv/hard-interface.c commit 690bb6fb64f5 ("batman-adv: Request iflink once in batadv-on-batadv check") commit 6ee3c393eeb7 ("batman-adv: Demote batadv-on-batadv skip error message") https://lore.kernel.org/all/20220302163049.101957-1-sw@simonwunderlich.de/ net/smc/af_smc.c commit 4d08b7b57ece ("net/smc: Fix cleanup when register ULP fails") commit 462791bbfa35 ("net/smc: add sysctl interface for SMC") https://lore.kernel.org/all/20220302112209.355def40@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-25ibmvnic: Allow queueing resets during probeSukadev Bhattiprolu
We currently don't allow queuing resets when adapter is in VNIC_PROBING state - instead we throw away the reset and return EBUSY. The reasoning is probably that during ibmvnic_probe() the ibmvnic_adapter itself is being initialized so performing a reset during this time can lead us to accessing fields in the ibmvnic_adapter that are not fully initialized. A review of the code shows that all the adapter state neede to process a reset is initialized before registering the CRQ so that should no longer be a concern. Further the expectation is that if we do get a reset (transport event) during probe, the do..while() loop in ibmvnic_probe() will handle this by reinitializing the CRQ. While that is true to some extent, it is possible that the reset might occur _after_ the CRQ is registered and CRQ_INIT message was exchanged but _before_ the adapter state is set to VNIC_PROBED. As mentioned above, such a reset will be thrown away. While the client assumes that the adapter is functional, the vnic server will wait for the client to reinit the adapter. This disconnect between the two leaves the adapter down needing manual intervention. Because ibmvnic_probe() has other work to do after initializing the CRQ (such as registering the netdev at a minimum) and because the reset event can occur at any instant after the CRQ is initialized, there will always be a window between initializing the CRQ and considering the adapter ready for resets (ie state == PROBED). So rather than discarding resets during this window, allow queueing them - but only process them after the adapter is fully initialized. To do this, introduce a new completion state ->probe_done and have the reset worker thread wait on this before processing resets. This change brings up two new situations in or just after ibmvnic_probe(). First after one or more resets were queued, we encounter an error and decide to retry the initialization. At that point the queued resets are no longer relevant since we could be talking to a new vnic server. So we must purge/flush the queued resets before restarting the initialization. As a side note, since we are still in the probing stage and we have not registered the netdev, it will not be CHANGE_PARAM reset. Second this change opens up a potential race between the worker thread in __ibmvnic_reset(), the tasklet and the ibmvnic_open() due to the following sequence of events: 1. Register CRQ 2. Get transport event before CRQ_INIT completes. 3. Tasklet schedules reset: a) add rwi to list b) schedule_work() to start worker thread which runs and waits for ->probe_done. 4. ibmvnic_probe() decides to retry, purges rwi_list 5. Re-register crq and this time rest of probe succeeds - register netdev and complete(->probe_done). 6. Worker thread resumes in __ibmvnic_reset() from 3b. 7. Worker thread sets ->resetting bit 8. ibmvnic_open() comes in, notices ->resetting bit, sets state to IBMVNIC_OPEN and returns early expecting worker thread to finish the open. 9. Worker thread finds rwi_list empty and returns without opening the interface. If this happens, the ->ndo_open() call is effectively lost and the interface remains down. To address this, ensure that ->rwi_list is not empty before setting the ->resetting bit. See also comments in __ibmvnic_reset(). Fixes: 6a2fb0e99f9c ("ibmvnic: driver initialization for kdump/kexec") Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25ibmvnic: clear fop when retrying probeSukadev Bhattiprolu
Clear ->failover_pending flag that may have been set in the previous pass of registering CRQ. If we don't clear, a subsequent ibmvnic_open() call would be misled into thinking a failover is pending and assuming that the reset worker thread would open the adapter. If this pass of registering the CRQ succeeds (i.e there is no transport event), there wouldn't be a reset worker thread. This would leave the adapter unconfigured and require manual intervention to bring it up during boot. Fixes: 5a18e1e0c193 ("ibmvnic: Fix failover case for non-redundant configuration") Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25ibmvnic: init init_done_rc earlierSukadev Bhattiprolu
We currently initialize the ->init_done completion/return code fields before issuing a CRQ_INIT command. But if we get a transport event soon after registering the CRQ the taskslet may already have recorded the completion and error code. If we initialize here, we might overwrite/ lose that and end up issuing the CRQ_INIT only to timeout later. If that timeout happens during probe, we will leave the adapter in the DOWN state rather than retrying to register/init the CRQ. Initialize the completion before registering the CRQ so we don't lose the notification. Fixes: 032c5e82847a ("Driver for IBM System i/p VNIC protocol") Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25ibmvnic: register netdev after init of adapterSukadev Bhattiprolu
Finish initializing the adapter before registering netdev so state is consistent. Fixes: c26eba03e407 ("ibmvnic: Update reset infrastructure to support tunable parameters") Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25ibmvnic: complete init_done on transport eventsSukadev Bhattiprolu
If we get a transport event, set the error and mark the init as complete so the attempt to send crq-init or login fail sooner rather than wait for the timeout. Fixes: bbd669a868bb ("ibmvnic: Fix completion structure initialization") Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25ibmvnic: define flush_reset_queue helperSukadev Bhattiprolu
Define and use a helper to flush the reset queue. Fixes: 2770a7984db5 ("ibmvnic: Introduce hard reset recovery") Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25ibmvnic: initialize rc before completing waitSukadev Bhattiprolu
We should initialize ->init_done_rc before calling complete(). Otherwise the waiting thread may see ->init_done_rc as 0 before we have updated it and may assume that the CRQ was successful. Fixes: 6b278c0cb378 ("ibmvnic delay complete()") Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-25ibmvnic: free reset-work-item when flushingSukadev Bhattiprolu
Fix a tiny memory leak when flushing the reset work queue. Fixes: 2770a7984db5 ("ibmvnic: Introduce hard reset recovery") Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
tools/testing/selftests/net/mptcp/mptcp_join.sh 34aa6e3bccd8 ("selftests: mptcp: add ip mptcp wrappers") 857898eb4b28 ("selftests: mptcp: add missing join check") 6ef84b1517e0 ("selftests: mptcp: more robust signal race test") https://lore.kernel.org/all/20220221131842.468893-1-broonie@kernel.org/ drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/act.h drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/ct.c fb7e76ea3f3b6 ("net/mlx5e: TC, Skip redundant ct clear actions") c63741b426e11 ("net/mlx5e: Fix MPLSoUDP encap to use MPLS action information") 09bf97923224f ("net/mlx5e: TC, Move pedit_headers_action to parse_attr") 84ba8062e383 ("net/mlx5e: Test CT and SAMPLE on flow attr") efe6f961cd2e ("net/mlx5e: CT, Don't set flow flag CT for ct clear flow") 3b49a7edec1d ("net/mlx5e: TC, Reject rules with multiple CT actions") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-22ibmvnic: schedule failover only if vioctl failsSukadev Bhattiprolu
If client is unable to initiate a failover reset via H_VIOCTL hcall, then it should schedule a failover reset as a last resort. Otherwise, there is no need to do a last resort. Fixes: 334c42414729 ("ibmvnic: improve failover sysfs entry") Reported-by: Cris Forno <cforno12@outlook.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: Dany Madden <drt@linux.ibm.com> Link: https://lore.kernel.org/r/20220221210545.115283-1-drt@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-18net/ibmvnic: Cleanup workaround doing an EOI after partition migrationCédric Le Goater
There were a fair amount of changes to workaround a firmware bug leaving a pending interrupt after migration of the ibmvnic device : commit 2df5c60e198c ("net/ibmvnic: Ignore H_FUNCTION return from H_EOI to tolerate XIVE mode") commit 284f87d2f387 ("Revert "net/ibmvnic: Fix EOI when running in XIVE mode"") commit 11d49ce9f794 ("net/ibmvnic: Fix EOI when running in XIVE mode.") commit f23e0643cd0b ("ibmvnic: Clear pending interrupt after device reset") Here is the final one taking into account the XIVE interrupt mode. Cc: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Cc: Dany Madden <drt@linux.ibm.com> Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-08ibmvnic: don't release napi in __ibmvnic_open()Sukadev Bhattiprolu
If __ibmvnic_open() encounters an error such as when setting link state, it calls release_resources() which frees the napi structures needlessly. Instead, have __ibmvnic_open() only clean up the work it did so far (i.e. disable napi and irqs) and leave the rest to the callers. If caller of __ibmvnic_open() is ibmvnic_open(), it should release the resources immediately. If the caller is do_reset() or do_hard_reset(), they will release the resources on the next reset. This fixes following crash that occurred when running the drmgr command several times to add/remove a vnic interface: [102056] ibmvnic 30000003 env3: Disabling rx_scrq[6] irq [102056] ibmvnic 30000003 env3: Disabling rx_scrq[7] irq [102056] ibmvnic 30000003 env3: Replenished 8 pools Kernel attempted to read user page (10) - exploit attempt? (uid: 0) BUG: Kernel NULL pointer dereference on read at 0x00000010 Faulting instruction address: 0xc000000000a3c840 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries ... CPU: 9 PID: 102056 Comm: kworker/9:2 Kdump: loaded Not tainted 5.16.0-rc5-autotest-g6441998e2e37 #1 Workqueue: events_long __ibmvnic_reset [ibmvnic] NIP: c000000000a3c840 LR: c0080000029b5378 CTR: c000000000a3c820 REGS: c0000000548e37e0 TRAP: 0300 Not tainted (5.16.0-rc5-autotest-g6441998e2e37) MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 28248484 XER: 00000004 CFAR: c0080000029bdd24 DAR: 0000000000000010 DSISR: 40000000 IRQMASK: 0 GPR00: c0080000029b55d0 c0000000548e3a80 c0000000028f0200 0000000000000000 ... NIP [c000000000a3c840] napi_enable+0x20/0xc0 LR [c0080000029b5378] __ibmvnic_open+0xf0/0x430 [ibmvnic] Call Trace: [c0000000548e3a80] [0000000000000006] 0x6 (unreliable) [c0000000548e3ab0] [c0080000029b55d0] __ibmvnic_open+0x348/0x430 [ibmvnic] [c0000000548e3b40] [c0080000029bcc28] __ibmvnic_reset+0x500/0xdf0 [ibmvnic] [c0000000548e3c60] [c000000000176228] process_one_work+0x288/0x570 [c0000000548e3d00] [c000000000176588] worker_thread+0x78/0x660 [c0000000548e3da0] [c0000000001822f0] kthread+0x1c0/0x1d0 [c0000000548e3e10] [c00000000000cf64] ret_from_kernel_thread+0x5c/0x64 Instruction dump: 7d2948f8 792307e0 4e800020 60000000 3c4c01eb 384239e0 f821ffd1 39430010 38a0fff6 e92d1100 f9210028 39200000 <e9030010> f9010020 60420000 e9210020 ---[ end trace 5f8033b08fd27706 ]--- Fixes: ed651a10875f ("ibmvnic: Updated reset handling") Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Reviewed-by: Dany Madden <drt@linux.ibm.com> Link: https://lore.kernel.org/r/20220208001918.900602-1-sukadev@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-24ibmvnic: remove unused ->wait_capabilitySukadev Bhattiprolu
With previous bug fix, ->wait_capability flag is no longer needed and can be removed. Fixes: 249168ad07cd ("ibmvnic: Make CRQ interrupt tasklet wait for all capabilities crqs") Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Reviewed-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-24ibmvnic: don't spin in taskletSukadev Bhattiprolu
ibmvnic_tasklet() continuously spins waiting for responses to all capability requests. It does this to avoid encountering an error during initialization of the vnic. However if there is a bug in the VIOS and we do not receive a response to one or more queries the tasklet ends up spinning continuously leading to hard lock ups. If we fail to receive a message from the VIOS it is reasonable to timeout the login attempt rather than spin indefinitely in the tasklet. Fixes: 249168ad07cd ("ibmvnic: Make CRQ interrupt tasklet wait for all capabilities crqs") Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Reviewed-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>