summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-12-06net/mlx5: Return ready to use ASO WQELeon Romanovsky
There is no need in hiding returned ASO WQE type by providing void*, use the real type instead. Do it together with zeroing that memory, so ASO WQE will be ready to use immediately. Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-12-06Merge branch 'Extend XFRM core to allow packet offload configuration'Steffen Klassert
Leon Romanovsky says: ============ The following series extends XFRM core code to handle a new type of IPsec offload - packet offload. In this mode, the HW is going to be responsible for the whole data path, so both policy and state should be offloaded. IPsec packet offload is an improved version of IPsec crypto mode, In packet mode, HW is responsible to trim/add headers in addition to decrypt/encrypt. In this mode, the packet arrives to the stack as already decrypted and vice versa for TX (exits to HW as not-encrypted). Devices that implement IPsec packet offload mode offload policies too. In the RX path, it causes the situation that HW can't effectively handle mixed SW and HW priorities unless users make sure that HW offloaded policies have higher priorities. It means that we don't need to perform any search of inexact policies and/or priority checks if HW policy was discovered. In such situation, the HW will catch the packets anyway and HW can still implement inexact lookups. In case specific policy is not found, we will continue with packet lookup and check for existence of HW policies in inexact list. HW policies are added to the head of SPD to ensure fast lookup, as XFRM iterates over all policies in the loop. This simple solution allows us to achieve same benefits of separate HW/SW policies databases without over-engineering the code to iterate and manage two databases at the same path. To not over-engineer the code, HW policies are treated as SW ones and don't take into account netdev to allow reuse of the same priorities for policies databases without over-engineering the code to iterate and manage two databases at the same path. To not over-engineer the code, HW policies are treated as SW ones and don't take into account netdev to allow reuse of the same priorities for different devices. * No software fallback * Fragments are dropped, both in RX and TX * No sockets policies * Only IPsec transport mode is implemented ================================================================================ Rekeying: In order to support rekeying, as XFRM core is skipped, the HW/driver should do the following: * Count the handled packets * Raise event that limits are reached * Drop packets once hard limit is occurred. The XFRM core calls to newly introduced xfrm_dev_state_update_curlft() function in order to perform sync between device statistics and internal structures. On HW limit event, driver calls to xfrm_state_check_expire() to allow XFRM core take relevant decisions. This separation between control logic (in XFRM) and data plane allows us to packet reuse SW stack. ================================================================================ Configuration: iproute2: https://lore.kernel.org/netdev/cover.1652179360.git.leonro@nvidia.com/ Packet offload mode: ip xfrm state offload packet dev <if-name> dir <in|out> ip xfrm policy .... offload packet dev <if-name> Crypto offload mode: ip xfrm state offload crypto dev <if-name> dir <in|out> or (backward compatibility) ip xfrm state offload dev <if-name> dir <in|out> ================================================================================ Performance results: TCP multi-stream, using iperf3 instance per-CPU. +----------------------+--------+--------+--------+--------+---------+---------+ | | 1 CPU | 2 CPUs | 4 CPUs | 8 CPUs | 16 CPUs | 32 CPUs | | +--------+--------+--------+--------+---------+---------+ | | BW (Gbps) | +----------------------+--------+--------+-------+---------+---------+---------+ | Baseline | 27.9 | 59 | 93.1 | 92.8 | 93.7 | 94.4 | +----------------------+--------+--------+-------+---------+---------+---------+ | Software IPsec | 6 | 11.9 | 23.3 | 45.9 | 83.8 | 91.8 | +----------------------+--------+--------+-------+---------+---------+---------+ | IPsec crypto offload | 15 | 29.7 | 58.5 | 89.6 | 90.4 | 90.8 | +----------------------+--------+--------+-------+---------+---------+---------+ | IPsec packet offload | 28 | 57 | 90.7 | 91 | 91.3 | 91.9 | +----------------------+--------+--------+-------+---------+---------+---------+ IPsec packet offload mode behaves as baseline and reaches linerate with same amount of CPUs. Setups details (similar for both sides): * NIC: ConnectX6-DX dual port, 100 Gbps each. Single port used in the tests. * CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz ================================================================================ Series together with mlx5 part: https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=xfrm-next ================================================================================ Changelog: v10: * Added forgotten xdo_dev_state_del. Patch #4. * Moved changelog in cover letter to the end. * Added "if (xs->xso.type != XFRM_DEV_OFFLOAD_CRYPTO) {" line to newly added netronome IPsec support. Patch #2. v9: https://lore.kernel.org/all/cover.1669547603.git.leonro@nvidia.com * Added acquire support v8: https://lore.kernel.org/all/cover.1668753030.git.leonro@nvidia.com * Removed not-related blank line * Fixed typos in documentation v7: https://lore.kernel.org/all/cover.1667997522.git.leonro@nvidia.com As was discussed in IPsec workshop: * Renamed "full offload" to be "packet offload". * Added check that offloaded SA and policy have same device while sending packet * Added to SAD same optimization as was done for SPD to speed-up lookups. v6: https://lore.kernel.org/all/cover.1666692948.git.leonro@nvidia.com * Fixed misplaced "!" in sixth patch. v5: https://lore.kernel.org/all/cover.1666525321.git.leonro@nvidia.com * Rebased to latest ipsec-next. * Replaced HW priority patch with solution which mimics separated SPDs for SW and HW. See more description in this cover letter. * Dropped RFC tag, usecase, API and implementation are clear. v4: https://lore.kernel.org/all/cover.1662295929.git.leonro@nvidia.com * Changed title from "PATCH" to "PATCH RFC" per-request. * Added two new patches: one to update hard/soft limits and another initial take on documentation. * Added more info about lifetime/rekeying flow to cover letter, see relevant section. * perf traces for crypto mode will come later. v3: https://lore.kernel.org/all/cover.1661260787.git.leonro@nvidia.com * I didn't hear any suggestion what term to use instead of "packet offload", so left it as is. It is used in commit messages and documentation only and easy to rename. * Added performance data and background info to cover letter * Reused xfrm_output_resume() function to support multiple XFRM transformations * Add PMTU check in addition to driver .xdo_dev_offload_ok validation * Documentation is in progress, but not part of this series yet. v2: https://lore.kernel.org/all/cover.1660639789.git.leonro@nvidia.com * Rebased to latest 6.0-rc1 * Add an extra check in TX datapath patch to validate packets before forwarding to HW. * Added policy cleanup logic in case of netdev down event v1: https://lore.kernel.org/all/cover.1652851393.git.leonro@nvidia.com * Moved comment to be before if (...) in third patch. v0: https://lore.kernel.org/all/cover.1652176932.git.leonro@nvidia.com ----------------------------------------------------------------------- ============ Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-12-06Merge branch 'net-lan966x-enable-ptp-on-bridge-interfaces'Paolo Abeni
Horatiu Vultur says: ==================== net: lan966x: Enable PTP on bridge interfaces Before it was not allowed to run ptp on ports that are part of a bridge because in case of transparent clock the HW will still forward the frames so there would be duplicate frames. Now that there is VCAP support, it is possible to add entries in the VCAP to trap frames to the CPU and the CPU will forward these frames. The first part of the patch series, extends the VCAP support to be able to modify and get the rule, while the last patch uses the VCAP to trap the ptp frames. ==================== Link: https://lore.kernel.org/r/20221203104348.1749811-1-horatiu.vultur@microchip.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-12-06net: lan966x: Add ptp trap rulesHoratiu Vultur
Currently lan966x, doesn't allow to run PTP over interfaces that are part of the bridge. The reason is when the lan966x was receiving a PTP frame (regardless if L2/IPv4/IPv6) the HW it would flood this frame. Now that it is possible to add VCAP rules to the HW, such to trap these frames to the CPU, it is possible to run PTP also over interfaces that are part of the bridge. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-12-06net: microchip: vcap: Add vcap_rule_get_key_u32Horatiu Vultur
Add the function vcap_rule_get_key_u32 which allows to get the value and the mask of a key that exist on the rule. If the key doesn't exist, it would return error. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-12-06net: microchip: vcap: Add vcap_mod_ruleHoratiu Vultur
Add the function vcap_mod_rule which allows to update an existing rule in the vcap. It is required for the rule to exist in the vcap to be able to modify it. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-12-06net: microchip: vcap: Add vcap_get_ruleHoratiu Vultur
Add function vcap_get_rule which returns a rule based on the internal rule id. The entire functionality of reading and decoding the rule from the VCAP was inside vcap_api_debugfs file. So move the entire implementation in vcap_api as this is used also by vcap_get_rule. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-12-06tipc: Fix potential OOB in tipc_link_proto_rcv()YueHaibing
Fix the potential risk of OOB if skb_linearize() fails in tipc_link_proto_rcv(). Fixes: 5cbb28a4bf65 ("tipc: linearize arriving NAME_DISTR and LINK_PROTO buffers") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Link: https://lore.kernel.org/r/20221203094635.29024-1-yuehaibing@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-12-06net: hisilicon: Fix potential use-after-free in hix5hd2_rx()Liu Jian
The skb is delivered to napi_gro_receive() which may free it, after calling this, dereferencing skb may trigger use-after-free. Fixes: 57c5bc9ad7d7 ("net: hisilicon: add hix5hd2 mac driver") Signed-off-by: Liu Jian <liujian56@huawei.com> Link: https://lore.kernel.org/r/20221203094240.1240211-2-liujian56@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-12-06net: mdio: fix unbalanced fwnode reference count in mdio_device_release()Zeng Heng
There is warning report about of_node refcount leak while probing mdio device: OF: ERROR: memory leak, expected refcount 1 instead of 2, of_node_get()/of_node_put() unbalanced - destroy cset entry: attach overlay node /spi/soc@0/mdio@710700c0/ethernet@4 In of_mdiobus_register_device(), we increase fwnode refcount by fwnode_handle_get() before associating the of_node with mdio device, but it has never been decreased in normal path. Since that, in mdio_device_release(), it needs to call fwnode_handle_put() in addition instead of calling kfree() directly. After above, just calling mdio_device_free() in the error handle path of of_mdiobus_register_device() is enough to keep the refcount balanced. Fixes: a9049e0c513c ("mdio: Add support for mdio drivers.") Signed-off-by: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/20221203073441.3885317-1-zengheng4@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-12-06net: hisilicon: Fix potential use-after-free in hisi_femac_rx()Liu Jian
The skb is delivered to napi_gro_receive() which may free it, after calling this, dereferencing skb may trigger use-after-free. Fixes: 542ae60af24f ("net: hisilicon: Add Fast Ethernet MAC driver") Signed-off-by: Liu Jian <liujian56@huawei.com> Link: https://lore.kernel.org/r/20221203094240.1240211-1-liujian56@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-12-06net: thunderx: Fix missing destroy_workqueue of nicvf_rx_mode_wqYongqiang Liu
The nicvf_probe() won't destroy workqueue when register_netdev() failed. Add destroy_workqueue err handle case to fix this issue. Fixes: 2ecbe4f4a027 ("net: thunderx: replace global nicvf_rx_mode_wq work queue for all VFs to private for each of them.") Signed-off-by: Yongqiang Liu <liuyongqiang13@huawei.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Link: https://lore.kernel.org/r/20221203094125.602812-1-liuyongqiang13@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-12-06ravb: Fix potential use-after-free in ravb_rx_gbeth()YueHaibing
The skb is delivered to napi_gro_receive() which may free it, after calling this, dereferencing skb may trigger use-after-free. Fixes: 1c59eb678cbd ("ravb: Fillup ravb_rx_gbeth() stub") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lore.kernel.org/r/20221203092941.10880-1-yuehaibing@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-12-06net: microchip: sparx5: Fix missing destroy_workqueue of mact_queueQiheng Lin
The mchp_sparx5_probe() won't destroy workqueue created by create_singlethread_workqueue() in sparx5_start() when later inits failed. Add destroy_workqueue in the cleanup_ports case, also add it in mchp_sparx5_remove() Fixes: b37a1bae742f ("net: sparx5: add mactable support") Signed-off-by: Qiheng Lin <linqiheng@huawei.com> Link: https://lore.kernel.org/r/20221203070259.19560-1-linqiheng@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-12-06ip_gre: do not report erspan version on GRE interfaceHangbin Liu
Although the type I ERSPAN is based on the barebones IP + GRE encapsulation and no extra ERSPAN header. Report erspan version on GRE interface looks unreasonable. Fix this by separating the erspan and gre fill info. IPv6 GRE does not have this info as IPv6 only supports erspan version 1 and 2. Reported-by: Jianlin Shi <jishi@redhat.com> Fixes: f989d546a2d5 ("erspan: Add type I version 0 support.") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: William Tu <u9012063@gmail.com> Link: https://lore.kernel.org/r/20221203032858.3130339-1-liuhangbin@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-12-06net: wwan: iosm: fix memory leak in ipc_mux_init()Zhengchao Shao
When failed to alloc ipc_mux->ul_adb.pp_qlt in ipc_mux_init(), ipc_mux is not released. Fixes: 1f52d7b62285 ("net: wwan: iosm: Enable M.2 7360 WWAN card support") Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: M Chetan Kumar <m.chetan.kumar@intel.com> Link: https://lore.kernel.org/r/20221203020903.383235-1-shaozhengchao@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-12-06net: mana: Fix race on per-CQ variable napi work_doneHaiyang Zhang
After calling napi_complete_done(), the NAPIF_STATE_SCHED bit may be cleared, and another CPU can start napi thread and access per-CQ variable, cq->work_done. If the other thread (for example, from busy_poll) sets it to a value >= budget, this thread will continue to run when it should stop, and cause memory corruption and panic. To fix this issue, save the per-CQ work_done variable in a local variable before napi_complete_done(), so it won't be corrupted by a possible concurrent thread after napi_complete_done(). Also, add a flag bit to advertise to the NIC firmware: the NAPI work_done variable race is fixed, so the driver is able to reliably support features like busy_poll. Cc: stable@vger.kernel.org Fixes: e1b5683ff62e ("net: mana: Move NAPI from EQ to CQ") Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Link: https://lore.kernel.org/r/1670010190-28595-1-git-send-email-haiyangz@microsoft.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-12-06net: stmmac: fix "snps,axi-config" node property parsingJisheng Zhang
In dt-binding snps,dwmac.yaml, some properties under "snps,axi-config" node are named without "axi_" prefix, but the driver expects the prefix. Since the dt-binding has been there for a long time, we'd better make driver match the binding for compatibility. Fixes: afea03656add ("stmmac: rework DMA bus setting and introduce new platform AXI structure") Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Link: https://lore.kernel.org/r/20221202161739.2203-1-jszhang@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-12-06gpio/rockchip: fix refcount leak in rockchip_gpiolib_register()Wang Yufen
The node returned by of_get_parent() with refcount incremented, of_node_put() needs be called when finish using it. So add it in the end of of_pinctrl_get(). Fixes: 936ee2675eee ("gpio/rockchip: add driver for rockchip gpio") Signed-off-by: Wang Yufen <wangyufen@huawei.com> Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2022-12-05Merge branch 'xfrm: interface: Add unstable helpers for XFRM metadata'Martin KaFai Lau
Eyal Birger says: ==================== This patch series adds xfrm metadata helpers using the unstable kfunc call interface for the TC-BPF hooks. This allows steering traffic towards different IPsec connections based on logic implemented in bpf programs. The helpers are integrated into the xfrm_interface module. For this purpose the main functionality of this module is moved to xfrm_interface_core.c. --- changes in v6: fix sparse warning in patch 2 changes in v5: - avoid cleanup of percpu dsts as detailed in patch 2 changes in v3: - tag bpf-next tree instead of ipsec-next - add IFLA_XFRM_COLLECT_METADATA sync patch ==================== Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-12-05selftests/bpf: add xfrm_info testsEyal Birger
Test the xfrm_info kfunc helpers. The test setup creates three name spaces - NS0, NS1, NS2. XFRM tunnels are setup between NS0 and the two other NSs. The kfunc helpers are used to steer traffic from NS0 to the other NSs based on a userspace populated bpf global variable and validate that the return traffic had arrived from the desired NS. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Link: https://lore.kernel.org/r/20221203084659.1837829-5-eyal.birger@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-12-05tools: add IFLA_XFRM_COLLECT_METADATA to uapi/linux/if_link.hEyal Birger
Needed for XFRM metadata tests. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Link: https://lore.kernel.org/r/20221203084659.1837829-4-eyal.birger@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-12-05xfrm: interface: Add unstable helpers for setting/getting XFRM metadata from ↵Eyal Birger
TC-BPF This change adds xfrm metadata helpers using the unstable kfunc call interface for the TC-BPF hooks. This allows steering traffic towards different IPsec connections based on logic implemented in bpf programs. This object is built based on the availability of BTF debug info. When setting the xfrm metadata, percpu metadata dsts are used in order to avoid allocating a metadata dst per packet. In order to guarantee safe module unload, the percpu dsts are allocated on first use and never freed. The percpu pointer is stored in net/core/filter.c so that it can be reused on module reload. The metadata percpu dsts take ownership of the original skb dsts so that they may be used as part of the xfrm transmission logic - e.g. for MTU calculations. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Link: https://lore.kernel.org/r/20221203084659.1837829-3-eyal.birger@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-12-05xfrm: interface: rename xfrm_interface.c to xfrm_interface_core.cEyal Birger
This change allows adding additional files to the xfrm_interface module. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Link: https://lore.kernel.org/r/20221203084659.1837829-2-eyal.birger@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-12-05net: mtk_eth_soc: enable flow offload support for MT7986 SoCLorenzo Bianconi
Since Wireless Ethernet Dispatcher is now available for mt7986 in mt76, enable hw flow support for MT7986 SoC. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/fdcaacd827938e6a8c4aa1ac2c13e46d2c08c821.1670072898.git.lorenzo@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-05NFC: nci: Bounds check struct nfc_target arraysKees Cook
While running under CONFIG_FORTIFY_SOURCE=y, syzkaller reported: memcpy: detected field-spanning write (size 129) of single field "target->sensf_res" at net/nfc/nci/ntf.c:260 (size 18) This appears to be a legitimate lack of bounds checking in nci_add_new_protocol(). Add the missing checks. Reported-by: syzbot+210e196cef4711b65139@syzkaller.appspotmail.com Link: https://lore.kernel.org/lkml/0000000000001c590f05ee7b3ff4@google.com Fixes: 019c4fbaa790 ("NFC: Add NCI multiple targets support") Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20221202214410.never.693-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-05ethtool: add netlink based get rss supportSudheer Mogilappagari
Add netlink based support for "ethtool -x <dev> [context x]" command by implementing ETHTOOL_MSG_RSS_GET netlink message. This is equivalent to functionality provided via ETHTOOL_GRSSH in ioctl path. It sends RSS table, hash key and hash function of an interface to user space. This patch implements existing functionality available in ioctl path and enables addition of new RSS context based parameters in future. Signed-off-by: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com> Link: https://lore.kernel.org/r/20221202002555.241580-1-sudheer.mogilappagari@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-05proc: proc_skip_spaces() shouldn't think it is working on C stringsLinus Torvalds
proc_skip_spaces() seems to think it is working on C strings, and ends up being just a wrapper around skip_spaces() with a really odd calling convention. Instead of basing it on skip_spaces(), it should have looked more like proc_skip_char(), which really is the exact same function (except it skips a particular character, rather than whitespace). So use that as inspiration, odd coding and all. Now the calling convention actually makes sense and works for the intended purpose. Reported-and-tested-by: Kyle Zeng <zengyhkyle@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-12-05proc: avoid integer type confusion in get_proc_longLinus Torvalds
proc_get_long() is passed a size_t, but then assigns it to an 'int' variable for the length. Let's not do that, even if our IO paths are limited to MAX_RW_COUNT (exactly because of these kinds of type errors). So do the proper test in the rigth type. Reported-by: Kyle Zeng <zengyhkyle@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-12-05ipc/sem: Fix dangling sem_array access in semtimedop raceJann Horn
When __do_semtimedop() goes to sleep because it has to wait for a semaphore value becoming zero or becoming bigger than some threshold, it links the on-stack sem_queue to the sem_array, then goes to sleep without holding a reference on the sem_array. When __do_semtimedop() comes back out of sleep, one of two things must happen: a) We prove that the on-stack sem_queue has been disconnected from the (possibly freed) sem_array, making it safe to return from the stack frame that the sem_queue exists in. b) We stabilize our reference to the sem_array, lock the sem_array, and detach the sem_queue from the sem_array ourselves. sem_array has RCU lifetime, so for case (b), the reference can be stabilized inside an RCU read-side critical section by locklessly checking whether the sem_queue is still connected to the sem_array. However, the current code does the lockless check on sem_queue before starting an RCU read-side critical section, so the result of the lockless check immediately becomes useless. Fix it by doing rcu_read_lock() before the lockless check. Now RCU ensures that if we observe the object being on our queue, the object can't be freed until rcu_read_unlock(). This bug is only hittable on kernel builds with full preemption support (either CONFIG_PREEMPT or PREEMPT_DYNAMIC with preempt=full). Fixes: 370b262c896e ("ipc/sem: avoid idr tree lookup for interrupted semop") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-12-05i40e: Disallow ip4 and ip6 l4_4_bytesPrzemyslaw Patynowski
Return -EOPNOTSUPP, when user requests l4_4_bytes for raw IP4 or IP6 flow director filters. Flow director does not support filtering on l4 bytes for PCTYPEs used by IP4 and IP6 filters. Without this patch, user could create filters with l4_4_bytes fields, which did not do any filtering on L4, but only on L3 fields. Fixes: 36777d9fa24c ("i40e: check current configured input set when adding ntuple filters") Signed-off-by: Przemyslaw Patynowski <przemyslawx.patynowski@intel.com> Signed-off-by: Kamil Maziarz <kamil.maziarz@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-12-05i40e: Fix for VF MAC address 0Sylwester Dziedziuch
After spawning max VFs on a PF, some VFs were not getting resources and their MAC addresses were 0. This was caused by PF sleeping before flushing HW registers which caused VIRTCHNL_VFR_VFACTIVE to not be set in time for VF. Fix by adding a sleep after hw flush. Fixes: e4b433f4a741 ("i40e: reset all VFs in parallel when rebuilding PF") Signed-off-by: Sylwester Dziedziuch <sylwesterx.dziedziuch@intel.com> Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-12-05i40e: Fix not setting default xps_cpus after resetMichal Jaron
During tx rings configuration default XPS queue config is set and __I40E_TX_XPS_INIT_DONE is locked. __I40E_TX_XPS_INIT_DONE state is cleared and set again with default mapping only during queues build, it means after first setup or reset with queues rebuild. (i.e. ethtool -L <interface> combined <number>) After other resets (i.e. ethtool -t <interface>) XPS_INIT_DONE is not cleared and those default maps cannot be set again. It results in cleared xps_cpus mapping until queues are not rebuild or mapping is not set by user. Add clearing __I40E_TX_XPS_INIT_DONE state during reset to let the driver set xps_cpus to defaults again after it was cleared. Fixes: 6f853d4f8e93 ("i40e: allow XPS with QoS enabled") Signed-off-by: Michal Jaron <michalx.jaron@intel.com> Signed-off-by: Kamil Maziarz <kamil.maziarz@intel.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-12-05net: phy: mxl-gpy: rename MMD_VEND1 macros to match datasheetMichael Walle
Rename the temperature sensors macros to match the names in the datasheet. Signed-off-by: Michael Walle <michael@walle.cc> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-05net: mvneta: Prevent out of bounds read in mvneta_config_rss()Dan Carpenter
The pp->indir[0] value comes from the user. It is passed to: if (cpu_online(pp->rxq_def)) inside the mvneta_percpu_elect() function. It needs bounds checkeding to ensure that it is not beyond the end of the cpu bitmap. Fixes: cad5d847a093 ("net: mvneta: Fix the CPU choice in mvneta_percpu_elect") Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-05nfp: add support for multicast filterDiana Wang
Rewrite nfp_net_set_rx_mode() to implement interface to delivery mc address and operations to firmware by using general mailbox for filtering multicast packets. The operations include add mc address and delete mc address. And the limitation of mc addresses number is 1024 for each net device. User triggers adding mc address by using command below: ip maddress add <mc address> dev <interface name> User triggers deleting mc address by using command below: ip maddress del <mc address> dev <interface name> Signed-off-by: Diana Wang <na.wang@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-05xen-netfront: Fix NULL sring after live migrationLin Liu
A NAPI is setup for each network sring to poll data to kernel The sring with source host is destroyed before live migration and new sring with target host is setup after live migration. The NAPI for the old sring is not deleted until setup new sring with target host after migration. With busy_poll/busy_read enabled, the NAPI can be polled before got deleted when resume VM. BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: xennet_poll+0xae/0xd20 PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI Call Trace: finish_task_switch+0x71/0x230 timerqueue_del+0x1d/0x40 hrtimer_try_to_cancel+0xb5/0x110 xennet_alloc_rx_buffers+0x2a0/0x2a0 napi_busy_loop+0xdb/0x270 sock_poll+0x87/0x90 do_sys_poll+0x26f/0x580 tracing_map_insert+0x1d4/0x2f0 event_hist_trigger+0x14a/0x260 finish_task_switch+0x71/0x230 __schedule+0x256/0x890 recalc_sigpending+0x1b/0x50 xen_sched_clock+0x15/0x20 __rb_reserve_next+0x12d/0x140 ring_buffer_lock_reserve+0x123/0x3d0 event_triggers_call+0x87/0xb0 trace_event_buffer_commit+0x1c4/0x210 xen_clocksource_get_cycles+0x15/0x20 ktime_get_ts64+0x51/0xf0 SyS_ppoll+0x160/0x1a0 SyS_ppoll+0x160/0x1a0 do_syscall_64+0x73/0x130 entry_SYSCALL_64_after_hwframe+0x41/0xa6 ... RIP: xennet_poll+0xae/0xd20 RSP: ffffb4f041933900 CR2: 0000000000000008 ---[ end trace f8601785b354351c ]--- xen frontend should remove the NAPIs for the old srings before live migration as the bond srings are destroyed There is a tiny window between the srings are set to NULL and the NAPIs are disabled, It is safe as the NAPI threads are still frozen at that time Signed-off-by: Lin Liu <lin.liu@citrix.com> Fixes: 4ec2411980d0 ([NET]: Do not check netif_running() and carrier state in ->poll()) Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-05net: microchip: sparx5: correctly free skb in xmitCasper Andersson
consume_skb on transmitted, kfree_skb on dropped, do not free on TX_BUSY. Previously the xmit function could return -EBUSY without freeing, which supposedly is interpreted as a drop. And was using kfree on successfully transmitted packets. sparx5_fdma_xmit and sparx5_inject returns error code, where -EBUSY indicates TX_BUSY and any other error code indicates dropped. Fixes: f3cad2611a77 ("net: sparx5: add hostmode with phylink support") Signed-off-by: Casper Andersson <casper.casan@gmail.com> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-05octeontx2-pf: Fix potential memory leak in otx2_init_tc()Ziyang Xuan
In otx2_init_tc(), if rhashtable_init() failed, it does not free tc->tc_entries_bitmap which is allocated in otx2_tc_alloc_ent_bitmap(). Fixes: 2e2a8126ffac ("octeontx2-pf: Unify flow management variables") Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-05net: ipa: use sysfs_emit() to instead of scnprintf()ye xingchen
Follow the advice of the Documentation/filesystems/sysfs.rst and show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn> Reviewed-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-05net: mdiobus: fix double put fwnode in the error pathYang Yingliang
If phy_device_register() or fwnode_mdiobus_phy_device_register() fail, phy_device_free() is called, the device refcount is decreased to 0, then fwnode_handle_put() will be called in phy_device_release(), but in the error path, fwnode_handle_put() has already been called, so set fwnode to NULL after fwnode_handle_put() in the error path to avoid double put. Fixes: cdde1560118f ("net: mdiobus: fix unbalanced node reference count") Reported-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-05Merge tag 'rxrpc-next-20221201-b' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Increasing SACK size and moving away from softirq, parts 2 & 3 Here are the second and third parts of patches in the process of moving rxrpc from doing a lot of its stuff in softirq context to doing it in an I/O thread in process context and thereby making it easier to support a larger SACK table. The full description is in the description for the first part[1] which is already in net-next. The second part includes some cleanups, adds some testing and overhauls some tracing: (1) Remove declaration of rxrpc_kernel_call_is_complete() as the definition is no longer present. (2) Remove the knet() and kproto() macros in favour of using tracepoints. (3) Remove handling of duplicate packets from recvmsg. The input side isn't now going to insert overlapping/duplicate packets into the recvmsg queue. (4) Don't use the rxrpc_conn_parameters struct in the rxrpc_connection or rxrpc_bundle structs - rather put the members in directly. (5) Extract the abort code from a received abort packet right up front rather than doing it in multiple places later. (6) Use enums and symbol lists rather than __builtin_return_address() to indicate where a tracepoint was triggered for local, peer, conn, call and skbuff tracing. (7) Add a refcount tracepoint for the rxrpc_bundle struct. (8) Implement an in-kernel server for the AFS rxperf testing program to talk to (enabled by a Kconfig option). This is tagged as rxrpc-next-20221201-a. The third part introduces the I/O thread and switches various bits over to running there: (1) Fix call timers and call and connection workqueues to not hold refs on the rxrpc_call and rxrpc_connection structs to thereby avoid messy cleanup when the last ref is put in softirq mode. (2) Split input.c so that the call packet processing bits are separate from the received packet distribution bits. Call packet processing gets bumped over to the call event handler. (3) Create a per-local endpoint I/O thread. Barring some tiny bits that still get done in softirq context, all packet reception, processing and transmission is done in this thread. That will allow a load of locking to be removed. (4) Perform packet processing and error processing from the I/O thread. (5) Provide a mechanism to process call event notifications in the I/O thread rather than queuing a work item for that call. (6) Move data and ACK transmission into the I/O thread. ACKs can then be transmitted at the point they're generated rather than getting delegated from softirq context to some process context somewhere. (7) Move call and local processor event handling into the I/O thread. (8) Move cwnd degradation to after packets have been transmitted so that they don't shorten the window too quickly. A bunch of simplifications can then be done: (1) The input_lock is no longer necessary as exclusion is achieved by running the code in the I/O thread only. (2) Don't need to use sk->sk_receive_queue.lock to guard socket state changes as the socket mutex should suffice. (3) Don't take spinlocks in RCU callback functions as they get run in softirq context and thus need _bh annotations. (4) RCU is then no longer needed for the peer's error_targets list. (5) Simplify the skbuff handling in the receive path by dropping the ref in the basic I/O thread loop and getting an extra ref as and when we need to queue the packet for recvmsg or another context. (6) Get the peer address earlier in the input process and pass it to the users so that we only do it once. This is tagged as rxrpc-next-20221201-b. Changes: ======== ver #2) - Added a patch to change four assertions into warnings in rxrpc_read() and fixed a checker warning from a __user annotation that should have been removed.. - Change a min() to min_t() in rxperf as PAGE_SIZE doesn't seem to match type size_t on i386. - Three error handling issues in rxrpc_new_incoming_call(): - If not DATA or not seq #1, should drop the packet, not abort. - Fix a goto that went to the wrong place, dropping a non-held lock. - Fix an rcu_read_lock that should've been an unlock. Tested-by: Marc Dionne <marc.dionne@auristor.com> Tested-by: kafs-testing+fedora36_64checkkafs-build-144@auristor.com Link: https://lore.kernel.org/r/166794587113.2389296.16484814996876530222.stgit@warthog.procyon.org.uk/ [1] Link: https://lore.kernel.org/r/166982725699.621383.2358362793992993374.stgit@warthog.procyon.org.uk/ # v1 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-05net: encx24j600: Fix invalid logic in reading of MISTAT registerValentina Goncharenko
A loop for reading MISTAT register continues while regmap_read() fails and (mistat & BUSY), but if regmap_read() fails a value of mistat is undefined. The patch proposes to check for BUSY flag only when regmap_read() succeed. Compile test only. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: d70e53262f5c ("net: Microchip encx24j600 driver") Signed-off-by: Valentina Goncharenko <goncharenko.vp@ispras.ru> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-05net: encx24j600: Add parentheses to fix precedenceValentina Goncharenko
In functions regmap_encx24j600_phy_reg_read() and regmap_encx24j600_phy_reg_write() in the conditions of the waiting cycles for filling the variable 'ret' it is necessary to add parentheses to prevent wrong assignment due to logical operations precedence. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: d70e53262f5c ("net: Microchip encx24j600 driver") Signed-off-by: Valentina Goncharenko <goncharenko.vp@ispras.ru> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-05net: stmmac: tegra: Add MGBE supportBhadram Varka
Add support for the Multi-Gigabit Ethernet (MGBE/XPCS) IP found on NVIDIA Tegra234 SoCs. Signed-off-by: Thierry Reding <treding@nvidia.com> Signed-off-by: Bhadram Varka <vbhadram@nvidia.com> Co-developed-by: Revanth Kumar Uppala <ruppala@nvidia.com> Signed-off-by: Revanth Kumar Uppala <ruppala@nvidia.com> Signed-off-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-05net: stmmac: Power up SERDES after the PHY linkRevanth Kumar Uppala
The Tegra MGBE ethernet controller requires that the SERDES link is powered-up after the PHY link is up, otherwise the link fails to become ready following a resume from suspend. Add a variable to indicate that the SERDES link must be powered-up after the PHY link. Signed-off-by: Revanth Kumar Uppala <ruppala@nvidia.com> Signed-off-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-05xfrm: document IPsec packet offload modeLeon Romanovsky
Extend XFRM device offload API description with newly added packet offload mode. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-12-05xfrm: add support to HW update soft and hard limitsLeon Romanovsky
Both in RX and TX, the traffic that performs IPsec packet offload transformation is accounted by HW. It is needed to properly handle hard limits that require to drop the packet. It means that XFRM core needs to update internal counters with the one that accounted by the HW, so new callbacks are introduced in this patch. In case of soft or hard limit is occurred, the driver should call to xfrm_state_check_expire() that will perform key rekeying exactly as done by XFRM core. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-12-05xfrm: speed-up lookup of HW policiesLeon Romanovsky
Devices that implement IPsec packet offload mode should offload SA and policies too. In RX path, it causes to the situation that HW will always have higher priority over any SW policies. It means that we don't need to perform any search of inexact policies and/or priority checks if HW policy was discovered. In such situation, the HW will catch the packets anyway and HW can still implement inexact lookups. In case specific policy is not found, we will continue with packet lookup and check for existence of HW policies in inexact list. HW policies are added to the head of SPD to ensure fast lookup, as XFRM iterates over all policies in the loop. The same solution of adding HW SAs at the begging of the list is applied to SA database too. However, we don't need to change lookups as they are sorted by insertion order and not priority. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-12-05xfrm: add RX datapath protection for IPsec packet offload modeLeon Romanovsky
Traffic received by device with enabled IPsec packet offload should be forwarded to the stack only after decryption, packet headers and trailers removed. Such packets are expected to be seen as normal (non-XFRM) ones, while not-supported packets should be dropped by the HW. Reviewed-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>