summaryrefslogtreecommitdiff
path: root/drivers/net
AgeCommit message (Collapse)Author
2021-08-12Merge tag 'mlx5-updates-2021-08-11' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5 updates 2021-08-11 This series provides misc updates to mlx5. For more information please see tag log below. Please pull and let me know if there is any problem. mlx5-updates-2021-08-11 Misc. cleanup for mlx5. 1) Typos and use of netdev_warn() 2) smatch cleanup 3) Minor fix to inner TTC table creation 4) Dynamic capability cache allocation ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-12wwan: core: Unshadow error code returned by ida_alloc_range()Andy Shevchenko
ida_alloc_range() may return other than -ENOMEM error code. Unshadow it in the wwan_create_port(). Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Reviewed-by: Loic Poulain <loic.poulain@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-12net: dsa: mt7530: fix VLAN traffic leaks againDENG Qingfang
When a port leaves a VLAN-aware bridge, the current code does not clear other ports' matrix field bit. If the bridge is later set to VLAN-unaware mode, traffic in the bridge may leak to that port. Remove the VLAN filtering check in mt7530_port_bridge_leave. Fixes: 474a2ddaa192 ("net: dsa: mt7530: fix VLAN traffic leaks") Fixes: 83163f7dca56 ("net: dsa: mediatek: add VLAN support for MT7530") Signed-off-by: DENG Qingfang <dqfext@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-12net: phy: nxp-tja11xx: log critical health stateOleksij Rempel
TJA1102 provides interrupt notification for the critical health states like overtemperature and undervoltage. The overtemperature bit is set if package temperature is beyond 155C°. This functionality was tested by heating the package up to 200C° The undervoltage bit is set if supply voltage drops beyond some critical threshold. Currently not tested. In a typical use case, both of this events should be logged and stored (or send to some remote system) for further investigations. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11net: hns3: add support for triggering reset by ethtoolYufeng Mo
Currently, four reset types are supported for the HNS3 ethernet driver: IMP reset, global reset, function reset, and FLR. Only FLR can now be triggered by the user. To restore the device when an exception occurs, add support for triggering reset by ethtool. Run the "ethtool --reset DEVNAME mgmt | all | dedicated" to trigger the IMP | global | function reset manually. In addition, VF can only trigger function reset. Signed-off-by: Yufeng Mo <moyufeng@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Link: https://lore.kernel.org/r/1628602128-15640-1-git-send-email-huangguangbin2@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-11bonding: combine netlink and console error messagesJonathan Toppins
There seems to be no reason to have different error messages between netlink and printk. It also cleans up the function slightly. Signed-off-by: Jonathan Toppins <jtoppins@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-11net: mscc: Fix non-GPL export of regmap APIsMark Brown
The ocelot driver makes use of regmap, wrapping it with driver specific operations that are thin wrappers around the core regmap APIs. These are exported with EXPORT_SYMBOL, dropping the _GPL from the core regmap exports which is frowned upon. Add _GPL suffixes to at least the APIs that are doing register I/O. Signed-off-by: Mark Brown <broonie@kernel.org> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Link: https://lore.kernel.org/r/20210810123748.47871-1-broonie@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-11net/mlx5e: Make use of netdev_warn()Cai Huoqing
to replace printk(KERN_WARNING ...) with netdev_warn() kindly Signed-off-by: Cai Huoqing <caihuoqing@baidu.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-11net/mlx5: Fix variable type to match 64bitEran Ben Elisha
Fix the following smatch warning: wait_func_handle_exec_timeout() warn: should '1 << ent->idx' be a 64 bit type? Use 1ULL, to have a 64 bit type variable. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Eran Ben Elisha <eranbe@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-11net/mlx5: Initialize numa node for all core devicesParav Pandit
Subsequent patches make use of numa node affinity for memory allocations. Initialize it for PCI PF, VF and SF devices. Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-11net/mlx5: Allocate individual capabilityParav Pandit
Currently mlx5_core_dev contains array of capabilities. It contains 19 valid capabilities of the device, 2 reserved entries and 12 holes. Due to this for 14 unused entries, mlx5_core_dev allocates 14 * 8K = 112K bytes of memory which is never used. Due to this mlx5_core_dev structure size is 270Kbytes odd. This allocation further aligns to next power of 2 to 512Kbytes. By skipping non-existent entries, (a) 112Kbyte is saved, (b) mlx5_core_dev reduces to 8KB with alignment (c) 350KB saved in alignment In future individual capability allocation can be used to skip its allocation when such capability is disabled at the device level. This patch prepares mlx5_core_dev to hold capability using a pointer instead of inline array. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-11net/mlx5: Reorganize current and maximal capabilities to be per-typeParav Pandit
In the current code, the current and maximal capabilities are maintained in separate arrays which are both per type. In order to allow the creation of such a basic structure as a dynamically allocated array, we move curr and max fields to a unified structure so that specific capabilities can be allocated as one unit. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-11net/mlx5: SF, use recent sysfs apiParav Pandit
Use sysfs_emit() which is aware of PAGE_SIZE buffer. Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-11net/mlx5: Refcount mlx5_irq with integerShay Drory
Currently, all access to mlx5 IRQs are done undere a lock. Hance, there isn't a reason to have kref in struct mlx5_irq. Switch it to integer. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-11net/mlx5: Change SF missing dedicated MSI-X err message to dbgShay Drory
When MSI-X vectors allocated are not enough for SFs to have dedicated, MSI-X, kernel log buffer has too many entries. Hence only enable such log with debug level. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-11net/mlx5: Align mlx5_irq structureShay Drory
mlx5_irq structure have holes due to incorrect position of fields in it. Make them naturally align. pahole output after alignment: struct mlx5_irq { struct atomic_notifier_head nh; /* 0 72 */ /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */ cpumask_var_t mask; /* 72 8 */ char name[32]; /* 80 32 */ struct mlx5_irq_pool * pool; /* 112 8 */ struct kref kref; /* 120 4 */ u32 index; /* 124 4 */ /* --- cacheline 2 boundary (128 bytes) --- */ int irqn; /* 128 4 */ /* size: 136, cachelines: 3, members: 7 */ /* padding: 4 */ /* last cacheline: 8 bytes */ }; Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-11net/mlx5: Delete impossible dev->state checksLeon Romanovsky
New mlx5_core device structure is allocated through devlink_alloc with\ kzalloc and that ensures that all fields are equal to zero and it includes ->state too. That means that checks of that field in the mlx5_init_one() is completely redundant, because that function is called only once in the begging of mlx5_core_dev lifetime. PCI: .probe() -> probe_one() -> mlx5_init_one() The recovery flow can't run at that time or before it, because relevant work initialized later in mlx5_init_once(). Such initialization flow ensures that dev->state can't be MLX5_DEVICE_STATE_UNINITIALIZED at all, so remove such impossible checks. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-11net/mlx5: Fix inner TTC table creationMaor Gottlieb
Fix typo of the cited commit that calls to mlx5_create_ttc_table, instead of mlx5_create_inner_ttc_table. Fixes: f4b45940e9b9 ("net/mlx5: Embed mlx5_ttc_table") Signed-off-by: Maor Gottlieb <maorg@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-11net/mlx5: Fix typo in commentsCai Huoqing
Fix typo: *vectores ==> vectors *realeased ==> released *erros ==> errors *namepsace ==> namespace *trafic ==> traffic *proccessed ==> processed *retore ==> restore *Currenlty ==> Currently *crated ==> created *chane ==> change *cannnot ==> cannot *usuallly ==> usually *failes ==> fails *importent ==> important *reenabled ==> re-enabled *alocation ==> allocation *recived ==> received *tanslation ==> translation Signed-off-by: Cai Huoqing <caihuoqing@baidu.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-11net/mlx5: Support enable_vnet devlink dev paramParav Pandit
Enable user to disable VDPA net auxiliary device so that when it is not required, user can disable it. For example, $ devlink dev param set pci/0000:06:00.0 \ name enable_vnet value false cmode driverinit $ devlink dev reload pci/0000:06:00.0 At this point devlink instance do not create auxiliary device mlx5_core.vnet.2 for the VDPA net functionality. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11net/mlx5: Support enable_rdma devlink dev paramParav Pandit
Enable user to disable RDMA auxiliary device so that when it is not required, user can disable it. For example, $ devlink dev param set pci/0000:06:00.0 \ name enable_rdma value false cmode driverinit $ devlink dev reload pci/0000:06:00.0 At this point devlink instance do not create auxiliary device mlx5_core.rdma.2 for the RDMA functionality. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11net/mlx5: Support enable_eth devlink dev paramParav Pandit
Enable user to disable Ethernet auxiliary device so that when it is not required, user can disable it. For example, $ devlink dev param set pci/0000:06:00.0 \ name enable_eth value false cmode driverinit $ devlink dev reload pci/0000:06:00.0 At this point devlink instance do not create mlx5_core.eth.2 auxiliary device for the Ethernet functionality. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11net/mlx5: Fix unpublish devlink parametersParav Pandit
Cleanup routine missed to unpublish the parameters. Add it. Fixes: e890acd5ff18 ("net/mlx5: Add devlink flow_steering_mode parameter") Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11net: ipa: kill ipa_clock_get_additional()Alex Elder
Now that ipa_clock_get_additional() is a trivial wrapper around pm_runtime_get_if_active(), just open-code it in its only caller and delete the function. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11net: ipa: kill IPA clock reference countAlex Elder
The runtime power management core code maintains a usage count. This count mirrors the IPA clock reference count, and there's no need to maintain both. So get rid of the IPA clock reference count and just rely on the runtime PM usage count to determine when the hardware should be suspended or resumed. Use pm_runtime_get_if_active() in ipa_clock_get_additional(). We care whether power is active, regardless of whether it's in use, so pass true for its ign_usage_count argument. The IPA clock mutex is just used to make enabling/disabling the clock and updating the reference count occur atomically. Without the reference count, there's no need for the mutex, so get rid of that too. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11net: ipa: get rid of extra clock referenceAlex Elder
Suspending the IPA hardware is now managed by the runtime PM core code. The ->runtime_idle callback returns a non-zero value, so it will never suspend except when forced. As a result, there's no need to take an extra "do not suspend" clock reference. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11net: ipa: use runtime PM coreAlex Elder
Use the runtime power management core to cause hardware suspend and resume to occur. Enable it in ipa_clock_init() (without autosuspend), and disable it in ipa_clock_exit(). Use ipa_runtime_suspend() as the ->runtime_suspend power operation, and arrange for it to be called by having ipa_clock_get() call pm_runtime_get_sync() when the first clock reference is taken. Similarly, use ipa_runtime_resume() as the ->runtime_resume power operation, and pm_runtime_put() when the last IPA clock reference is dropped. Introduce ipa_runtime_idle() as the ->runtime_idle power operation, and have it return a non-zero value; this way suspend will never occur except when forced. Use pm_runtime_force_suspend() and pm_runtime_force_resume() as the system suspend and resume callbacks, and remove ipa_suspend() and ipa_resume(). Store a pointer to the device structure passed to ipa_clock_init(), so it can be used by ipa_clock_exit() to disable runtime power management. For now we preserve IPA clock reference counting. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11net: ipa: resume in ipa_clock_get()Alex Elder
Introduce ipa_runtime_suspend() and ipa_runtime_resume(), which encapsulate the activities necessary for suspending and resuming the IPA hardware. Call these functions from ipa_clock_get() and ipa_clock_put() when the first reference is taken or last one is dropped. When the very first clock reference is taken (for ipa_config()), setup isn't complete yet, so (as before) only the core clock gets enabled. When the last clock reference is dropped (after ipa_deconfig()), ipa_teardown() will have made the setup_complete flag false, so there too, the core clock will be stopped without affecting GSI or the endpoints. Otherwise these new functions will perform the desired suspend and resume actions once setup is complete. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11net: ipa: disable clock in suspendAlex Elder
Disable the IPA clock rather than dropping a reference to it in the system suspend callback. This forces the suspend to occur without affecting existing references. Similarly, enable the clock rather than taking a reference in ipa_resume(), forcing a resume without changing the reference count. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11net: ipa: have ipa_clock_get() return a valueAlex Elder
We currently assume no errors occur when enabling or disabling the IPA core clock and interconnects. And although this commit exposes errors that could occur, we generally assume this won't happen in practice. This commit changes ipa_clock_get() and ipa_clock_put() so each returns a value. The values returned are meant to mimic what the runtime power management functions return, so we can set up error handling here before we make the switch. Have ipa_clock_get() increment the reference count even if it returns an error, to match the behavior of pm_runtime_get(). More details follow. When taking a reference in ipa_clock_get(), return 0 for the first reference, 1 for subsequent references, or a negative error code if an error occurs. Note that if ipa_clock_get() returns an error, we must not touch hardware; in some cases such errors now cause entire blocks of code to be skipped. When dropping a reference in ipa_clock_put(), we return 0 or an error code. The error would come from ipa_clock_disable(), which now returns what ipa_interconnect_disable() returns (either 0 or a negative error code). For now, callers ignore the return value; if an error occurs, a message will have already been logged, and little more can actually be done to improve the situation. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-10Merge branch 'mlx5-next' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Saeed Mahameed says: ==================== pull-request: mlx5-next 2020-08-9 This pulls mlx5-next branch which includes patches already reviewed on net-next and rdma mailing lists. 1) mlx5 single E-Switch FDB for lag 2) IB/mlx5: Rename is_apu_thread_cq function to is_apu_cq 3) Add DCS caps & fields support [1] https://patchwork.kernel.org/project/netdevbpf/cover/20210803231959.26513-1-saeed@kernel.org/ [2] https://patchwork.kernel.org/project/netdevbpf/patch/0e3364dab7e0e4eea5423878b01aa42470be8d36.1626609184.git.leonro@nvidia.com/ [3] https://patchwork.kernel.org/project/netdevbpf/patch/55e1d69bef1fbfa5cf195c0bfcbe35c8019de35e.1624258894.git.leonro@nvidia.com/ * 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: net/mlx5: Lag, Create shared FDB when in switchdev mode net/mlx5: E-Switch, add logic to enable shared FDB net/mlx5: Lag, move lag destruction to a workqueue net/mlx5: Lag, properly lock eswitch if needed net/mlx5: Add send to vport rules on paired device net/mlx5: E-Switch, Add event callback for representors net/mlx5e: Use shared mappings for restoring from metadata net/mlx5e: Add an option to create a shared mapping net/mlx5: E-Switch, set flow source for send to uplink rule RDMA/mlx5: Add shared FDB support {net, RDMA}/mlx5: Extend send to vport rules RDMA/mlx5: Fill port info based on the relevant eswitch net/mlx5: Lag, add initial logic for shared FDB net/mlx5: Return mdev from eswitch IB/mlx5: Rename is_apu_thread_cq function to is_apu_cq net/mlx5: Add DCS caps & fields support ==================== Link: https://lore.kernel.org/r/20210809202522.316930-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-10Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Daniel Borkmann says: ==================== bpf-next 2021-08-10 We've added 31 non-merge commits during the last 8 day(s) which contain a total of 28 files changed, 3644 insertions(+), 519 deletions(-). 1) Native XDP support for bonding driver & related BPF selftests, from Jussi Maki. 2) Large batch of new BPF JIT tests for test_bpf.ko that came out as a result from 32-bit MIPS JIT development, from Johan Almbladh. 3) Rewrite of netcnt BPF selftest and merge into test_progs, from Stanislav Fomichev. 4) Fix XDP bpf_prog_test_run infra after net to net-next merge, from Andrii Nakryiko. 5) Follow-up fix in unix_bpf_update_proto() to enforce socket type, from Cong Wang. 6) Fix bpf-iter-tcp4 selftest to print the correct dest IP, from Jose Blanquicet. 7) Various misc BPF XDP sample improvements, from Niklas Söderlund, Matthew Cover, and Muhammad Falak R Wani. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (31 commits) bpf, tests: Add tail call test suite bpf, tests: Add tests for BPF_CMPXCHG bpf, tests: Add tests for atomic operations bpf, tests: Add test for 32-bit context pointer argument passing bpf, tests: Add branch conversion JIT test bpf, tests: Add word-order tests for load/store of double words bpf, tests: Add tests for ALU operations implemented with function calls bpf, tests: Add more ALU64 BPF_MUL tests bpf, tests: Add more BPF_LSH/RSH/ARSH tests for ALU64 bpf, tests: Add more ALU32 tests for BPF_LSH/RSH/ARSH bpf, tests: Add more tests of ALU32 and ALU64 bitwise operations bpf, tests: Fix typos in test case descriptions bpf, tests: Add BPF_MOV tests for zero and sign extension bpf, tests: Add BPF_JMP32 test cases samples, bpf: Add an explict comment to handle nested vlan tagging. selftests/bpf: Add tests for XDP bonding selftests/bpf: Fix xdp_tx.c prog section name net, core: Allow netdev_lower_get_next_private_rcu in bh context bpf, devmap: Exclude XDP broadcast to master device net, bonding: Add XDP support to the bonding driver ... ==================== Link: https://lore.kernel.org/r/20210810130038.16927-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-09net: hns3: support skb's frag page recycling based on page poolYunsheng Lin
This patch adds skb's frag page recycling support based on the frag page support in page pool. The performance improves above 10~20% for single thread iperf TCP flow with IOMMU disabled when iperf server and irq/NAPI have a different CPU. The performance improves about 135%(14Gbit to 33Gbit) for single thread iperf TCP flow when IOMMU is in strict mode and iperf server shares the same cpu with irq/NAPI. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-09page_pool: keep pp info as long as page pool owns the pageYunsheng Lin
Currently, page->pp is cleared and set everytime the page is recycled, which is unnecessary. So only set the page->pp when the page is added to the page pool and only clear it when the page is released from the page pool. This is also a preparation to support allocating frag page in page pool. Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-09net, bonding: Add XDP support to the bonding driverJussi Maki
XDP is implemented in the bonding driver by transparently delegating the XDP program loading, removal and xmit operations to the bonding slave devices. The overall goal of this work is that XDP programs can be attached to a bond device *without* any further changes (or awareness) necessary to the program itself, meaning the same XDP program can be attached to a native device but also a bonding device. Semantics of XDP_TX when attached to a bond are equivalent in such setting to the case when a tc/BPF program would be attached to the bond, meaning transmitting the packet out of the bond itself using one of the bond's configured xmit methods to select a slave device (rather than XDP_TX on the slave itself). Handling of XDP_TX to transmit using the configured bonding mechanism is therefore implemented by rewriting the BPF program return value in bpf_prog_run_xdp. To avoid performance impact this check is guarded by a static key, which is incremented when a XDP program is loaded onto a bond device. This approach was chosen to avoid changes to drivers implementing XDP. If the slave device does not match the receive device, then XDP_REDIRECT is transparently used to perform the redirection in order to have the network driver release the packet from its RX ring. The bonding driver hashing functions have been refactored to allow reuse with xdp_buff's to avoid code duplication. The motivation for this change is to enable use of bonding (and 802.3ad) in hairpinning L4 load-balancers such as [1] implemented with XDP and also to transparently support bond devices for projects that use XDP given most modern NICs have dual port adapters. An alternative to this approach would be to implement 802.3ad in user-space and implement the bonding load-balancing in the XDP program itself, but is rather a cumbersome endeavor in terms of slave device management (e.g. by watching netlink) and requires separate programs for native vs bond cases for the orchestrator. A native in-kernel implementation overcomes these issues and provides more flexibility. Below are benchmark results done on two machines with 100Gbit Intel E810 (ice) NIC and with 32-core 3970X on sending machine, and 16-core 3950X on receiving machine. 64 byte packets were sent with pktgen-dpdk at full rate. Two issues [2, 3] were identified with the ice driver, so the tests were performed with iommu=off and patch [2] applied. Additionally the bonding round robin algorithm was modified to use per-cpu tx counters as high CPU load (50% vs 10%) and high rate of cache misses were caused by the shared rr_tx_counter (see patch 2/3). The statistics were collected using "sar -n dev -u 1 10". On top of that, for ice, further work is in progress on improving the XDP_TX numbers [4]. -----------------------| CPU |--| rxpck/s |--| txpck/s |---- without patch (1 dev): XDP_DROP: 3.15% 48.6Mpps XDP_TX: 3.12% 18.3Mpps 18.3Mpps XDP_DROP (RSS): 9.47% 116.5Mpps XDP_TX (RSS): 9.67% 25.3Mpps 24.2Mpps ----------------------- with patch, bond (1 dev): XDP_DROP: 3.14% 46.7Mpps XDP_TX: 3.15% 13.9Mpps 13.9Mpps XDP_DROP (RSS): 10.33% 117.2Mpps XDP_TX (RSS): 10.64% 25.1Mpps 24.0Mpps ----------------------- with patch, bond (2 devs): XDP_DROP: 6.27% 92.7Mpps XDP_TX: 6.26% 17.6Mpps 17.5Mpps XDP_DROP (RSS): 11.38% 117.2Mpps XDP_TX (RSS): 14.30% 28.7Mpps 27.4Mpps -------------------------------------------------------------- RSS: Receive Side Scaling, e.g. the packets were sent to a range of destination IPs. [1]: https://cilium.io/blog/2021/05/20/cilium-110#standalonelb [2]: https://lore.kernel.org/bpf/20210601113236.42651-1-maciej.fijalkowski@intel.com/T/#t [3]: https://lore.kernel.org/bpf/CAHn8xckNXci+X_Eb2WMv4uVYjO2331UWB2JLtXr_58z0Av8+8A@mail.gmail.com/ [4]: https://lore.kernel.org/bpf/20210805230046.28715-1-maciej.fijalkowski@intel.com/T/#t Signed-off-by: Jussi Maki <joamaki@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Cc: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20210731055738.16820-4-joamaki@gmail.com
2021-08-09net, bonding: Refactor bond_xmit_hash for use with xdp_buffJussi Maki
In preparation for adding XDP support to the bonding driver refactor the packet hashing functions to be able to work with any linear data buffer without an skb. Signed-off-by: Jussi Maki <joamaki@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Link: https://lore.kernel.org/bpf/20210731055738.16820-2-joamaki@gmail.com
2021-08-09net: fec: fix build error for ARCH m68kJoakim Zhang
reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross make.cross ARCH=m68k m5272c3_defconfig make.cross ARCH=m68k drivers/net/ethernet/freescale/fec_main.c: In function 'fec_enet_eee_mode_set': >> drivers/net/ethernet/freescale/fec_main.c:2758:33: error: 'FEC_LPI_SLEEP' undeclared (first use in this function); did you mean 'FEC_ECR_SLEEP'? 2758 | writel(sleep_cycle, fep->hwp + FEC_LPI_SLEEP); | ^~~~~~~~~~~~~ arch/m68k/include/asm/io_no.h:25:66: note: in definition of macro '__raw_writel' 25 | #define __raw_writel(b, addr) (void)((*(__force volatile u32 *) (addr)) = (b)) | ^~~~ drivers/net/ethernet/freescale/fec_main.c:2758:2: note: in expansion of macro 'writel' 2758 | writel(sleep_cycle, fep->hwp + FEC_LPI_SLEEP); | ^~~~~~ drivers/net/ethernet/freescale/fec_main.c:2758:33: note: each undeclared identifier is reported only once for each function it appears in 2758 | writel(sleep_cycle, fep->hwp + FEC_LPI_SLEEP); | ^~~~~~~~~~~~~ arch/m68k/include/asm/io_no.h:25:66: note: in definition of macro '__raw_writel' 25 | #define __raw_writel(b, addr) (void)((*(__force volatile u32 *) (addr)) = (b)) | ^~~~ drivers/net/ethernet/freescale/fec_main.c:2758:2: note: in expansion of macro 'writel' 2758 | writel(sleep_cycle, fep->hwp + FEC_LPI_SLEEP); | ^~~~~~ >> drivers/net/ethernet/freescale/fec_main.c:2759:32: error: 'FEC_LPI_WAKE' undeclared (first use in this function) 2759 | writel(wake_cycle, fep->hwp + FEC_LPI_WAKE); | ^~~~~~~~~~~~ arch/m68k/include/asm/io_no.h:25:66: note: in definition of macro '__raw_writel' 25 | #define __raw_writel(b, addr) (void)((*(__force volatile u32 *) (addr)) = (b)) | ^~~~ drivers/net/ethernet/freescale/fec_main.c:2759:2: note: in expansion of macro 'writel' 2759 | writel(wake_cycle, fep->hwp + FEC_LPI_WAKE); | ^~~~~~ This patch adds register definition for M5272 platform to pass build. Fixes: b82f8c3f1409 ("net: fec: add eee mode tx lpi support") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-09devlink: Set device as early as possibleLeon Romanovsky
All kernel devlink implementations call to devlink_alloc() during initialization routine for specific device which is used later as a parent device for devlink_register(). Such late device assignment causes to the situation which requires us to call to device_register() before setting other parameters, but that call opens devlink to the world and makes accessible for the netlink users. Any attempt to move devlink_register() to be the last call generates the following error due to access to the devlink->dev pointer. [ 8.758862] devlink_nl_param_fill+0x2e8/0xe50 [ 8.760305] devlink_param_notify+0x6d/0x180 [ 8.760435] __devlink_params_register+0x2f1/0x670 [ 8.760558] devlink_params_register+0x1e/0x20 The simple change of API to set devlink device in the devlink_alloc() instead of devlink_register() fixes all this above and ensures that prior to call to devlink_register() everything already set. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-09wwan: mhi: Fix missing spin_lock_init() in mhi_mbim_probe()Wei Yongjun
The driver allocates the spinlock but not initialize it. Use spin_lock_init() on it to initialize it correctly. Fixes: aa730a9905b7 ("net: wwan: Add MHI MBIM network driver") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Reviewed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-08net: dsa: sja1105: add FDB fast ageing supportVladimir Oltean
Delete the dynamically learned FDB entries when the STP state changes and when address learning is disabled. On sja1105 there is no shorthand SPI command for this, so we need to walk through the entire FDB to delete. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-08net: dsa: sja1105: rely on DSA core tracking of port learning stateVladimir Oltean
Now that DSA keeps track of the port learning state, it becomes superfluous to keep an additional variable with this information in the sja1105 driver. Remove it. The DSA core's learning state is present in struct dsa_port *dp. To avoid the antipattern where we iterate through a DSA switch's ports and then call dsa_to_port to obtain the "dp" reference (which is bad because dsa_to_port iterates through the DSA switch tree once again), just iterate through the dst->ports and operate on those directly. The sja1105 had an extra use of priv->learn_ena on non-user ports. DSA does not touch the learning state of those ports - drivers are free to do what they wish on them. Mark that information with a comment in struct dsa_port and let sja1105 set dp->learning for cascade ports. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-08net: dsa: centralize fast ageing when address learning is turned offVladimir Oltean
Currently DSA leaves it down to device drivers to fast age the FDB on a port when address learning is disabled on it. There are 2 reasons for doing that in the first place: - when address learning is disabled by user space, through IFLA_BRPORT_LEARNING or the brport_attr_learning sysfs, what user space typically wants to achieve is to operate in a mode with no dynamic FDB entry on that port. But if the port is already up, some addresses might have been already learned on it, and it seems silly to wait for 5 minutes for them to expire until something useful can be done. - when a port leaves a bridge and becomes standalone, DSA turns off address learning on it. This also has the nice side effect of flushing the dynamically learned bridge FDB entries on it, which is a good idea because standalone ports should not have bridge FDB entries on them. We let drivers manage fast ageing under this condition because if DSA were to do it, it would need to track each port's learning state, and act upon the transition, which it currently doesn't. But there are 2 reasons why doing it is better after all: - drivers might get it wrong and not do it (see b53_port_set_learning) - we would like to flush the dynamic entries from the software bridge too, and letting drivers do that would be another pain point So track the port learning state and trigger a fast age process automatically within DSA. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-08devlink: Simplify devlink port API callsLeon Romanovsky
Devlink port already has pointer to the devlink instance and all API calls that forward these devlink ports to the drivers perform same "devlink_port->devlink" assignment before actual call. This patch removes useless parameter and allows us in the future to create specific devlink_port_ops to manage user space access with reliable ops assignment. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-07net: bonding: bond_alb: Remove the dependency on ipx network layerCai Huoqing
commit <47595e32869f> ("<MAINTAINERS: Mark some staging directories>") indicated the ipx network layer as obsolete in Jan 2018, updated in the MAINTAINERS file now, after being exposed for 3 years to refactoring, so to delete the ipx net layer related code for good. Signed-off-by: Cai Huoqing <caihuoqing@baidu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-07net: ethernet: stmmac: Do not use unreachable() in ipq806x_gmac_probe()Nathan Chancellor
When compiling with clang in certain configurations, an objtool warning appears: drivers/net/ethernet/stmicro/stmmac/dwmac-ipq806x.o: warning: objtool: ipq806x_gmac_probe() falls through to next function phy_modes() This happens because the unreachable annotation in the third switch statement is not eliminated. The compiler should know that the first default case would prevent the second and third from being reached as the comment notes but sanitizer options can make it harder for the compiler to reason this out. Help the compiler out by eliminating the unreachable() annotation and unifying the default case error handling so that there is no objtool warning, the meaning of the code stays the same, and there is less duplication. Reported-by: Sami Tolvanen <samitolvanen@google.com> Tested-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-07tulip: Remove deadcode on startup true conditionColin Ian King
The true check on the variable startable in the ternary operator is always false because the previous if statement handles the true condition for startable. Hence the ternary check is dead code and can be removed. Addresses-Coverity: ("Logically dead code") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-06net: ethernet: ti: davinci_cpdma: revert "drop frame padding"Grygorii Strashko
This reverts commit 9ffc513f95ee ("net: ethernet: ti: davinci_cpdma: drop frame padding") which has depndency from not yet merged patch [1] and so breaks cpsw_new driver. [1] https://patchwork.kernel.org/project/netdevbpf/patch/20210805145511.12016-1-grygorii.strashko@ti.com/ Fixes: 9ffc513f95ee ("net: ethernet: ti: davinci_cpdma: drop frame padding") Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Link: https://lore.kernel.org/r/20210806142809.15069-1-grygorii.strashko@ti.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-06vrf: fix NULL dereference in vrf_finish_output()Dan Carpenter
The "skb" pointer is NULL on this error path so we can't dereference it. Use "dev" instead. Fixes: 14ee70ca89e6 ("vrf: use skb_expand_head in vrf_finish_output") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/20210806150435.GB15586@kili Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-06net: dsa: mt7530: drop untagged frames on VLAN-aware ports without PVIDDENG Qingfang
The driver currently still accepts untagged frames on VLAN-aware ports without PVID. Use PVC.ACC_FRM to drop untagged frames in that case. Signed-off-by: DENG Qingfang <dqfext@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-06net: dsa: don't disable multicast flooding to the CPU even without an IGMP ↵Vladimir Oltean
querier Commit 08cc83cc7fd8 ("net: dsa: add support for BRIDGE_MROUTER attribute") added an option for users to turn off multicast flooding towards the CPU if they turn off the IGMP querier on a bridge which already has enslaved ports (echo 0 > /sys/class/net/br0/bridge/multicast_router). And commit a8b659e7ff75 ("net: dsa: act as passthrough for bridge port flags") simply papered over that issue, because it moved the decision to flood the CPU with multicast (or not) from the DSA core down to individual drivers, instead of taking a more radical position then. The truth is that disabling multicast flooding to the CPU is simply something we are not prepared to do now, if at all. Some reasons: - ICMP6 neighbor solicitation messages are unregistered multicast packets as far as the bridge is concerned. So if we stop flooding multicast, the outside world cannot ping the bridge device's IPv6 link-local address. - There might be foreign interfaces bridged with our DSA switch ports (sending a packet towards the host does not necessarily equal termination, but maybe software forwarding). So if there is no one interested in that multicast traffic in the local network stack, that doesn't mean nobody is. - PTP over L4 (IPv4, IPv6) is multicast, but is unregistered as far as the bridge is concerned. This should reach the CPU port. - The switch driver might not do FDB partitioning. And since we don't even bother to do more fine-grained flood disabling (such as "disable flooding _from_port_N_ towards the CPU port" as opposed to "disable flooding _from_any_port_ towards the CPU port"), this breaks standalone ports, or even multiple bridges where one has an IGMP querier and one doesn't. Reverting the logic makes all of the above work. Fixes: a8b659e7ff75 ("net: dsa: act as passthrough for bridge port flags") Fixes: 08cc83cc7fd8 ("net: dsa: add support for BRIDGE_MROUTER attribute") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>