summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-12-06gpio/rockchip: fix refcount leak in rockchip_gpiolib_register()Wang Yufen
The node returned by of_get_parent() with refcount incremented, of_node_put() needs be called when finish using it. So add it in the end of of_pinctrl_get(). Fixes: 936ee2675eee ("gpio/rockchip: add driver for rockchip gpio") Signed-off-by: Wang Yufen <wangyufen@huawei.com> Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2022-12-05Merge branch 'xfrm: interface: Add unstable helpers for XFRM metadata'Martin KaFai Lau
Eyal Birger says: ==================== This patch series adds xfrm metadata helpers using the unstable kfunc call interface for the TC-BPF hooks. This allows steering traffic towards different IPsec connections based on logic implemented in bpf programs. The helpers are integrated into the xfrm_interface module. For this purpose the main functionality of this module is moved to xfrm_interface_core.c. --- changes in v6: fix sparse warning in patch 2 changes in v5: - avoid cleanup of percpu dsts as detailed in patch 2 changes in v3: - tag bpf-next tree instead of ipsec-next - add IFLA_XFRM_COLLECT_METADATA sync patch ==================== Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-12-05selftests/bpf: add xfrm_info testsEyal Birger
Test the xfrm_info kfunc helpers. The test setup creates three name spaces - NS0, NS1, NS2. XFRM tunnels are setup between NS0 and the two other NSs. The kfunc helpers are used to steer traffic from NS0 to the other NSs based on a userspace populated bpf global variable and validate that the return traffic had arrived from the desired NS. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Link: https://lore.kernel.org/r/20221203084659.1837829-5-eyal.birger@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-12-05tools: add IFLA_XFRM_COLLECT_METADATA to uapi/linux/if_link.hEyal Birger
Needed for XFRM metadata tests. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Link: https://lore.kernel.org/r/20221203084659.1837829-4-eyal.birger@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-12-05xfrm: interface: Add unstable helpers for setting/getting XFRM metadata from ↵Eyal Birger
TC-BPF This change adds xfrm metadata helpers using the unstable kfunc call interface for the TC-BPF hooks. This allows steering traffic towards different IPsec connections based on logic implemented in bpf programs. This object is built based on the availability of BTF debug info. When setting the xfrm metadata, percpu metadata dsts are used in order to avoid allocating a metadata dst per packet. In order to guarantee safe module unload, the percpu dsts are allocated on first use and never freed. The percpu pointer is stored in net/core/filter.c so that it can be reused on module reload. The metadata percpu dsts take ownership of the original skb dsts so that they may be used as part of the xfrm transmission logic - e.g. for MTU calculations. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Link: https://lore.kernel.org/r/20221203084659.1837829-3-eyal.birger@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-12-05xfrm: interface: rename xfrm_interface.c to xfrm_interface_core.cEyal Birger
This change allows adding additional files to the xfrm_interface module. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Link: https://lore.kernel.org/r/20221203084659.1837829-2-eyal.birger@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-12-05net: mtk_eth_soc: enable flow offload support for MT7986 SoCLorenzo Bianconi
Since Wireless Ethernet Dispatcher is now available for mt7986 in mt76, enable hw flow support for MT7986 SoC. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/fdcaacd827938e6a8c4aa1ac2c13e46d2c08c821.1670072898.git.lorenzo@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-05NFC: nci: Bounds check struct nfc_target arraysKees Cook
While running under CONFIG_FORTIFY_SOURCE=y, syzkaller reported: memcpy: detected field-spanning write (size 129) of single field "target->sensf_res" at net/nfc/nci/ntf.c:260 (size 18) This appears to be a legitimate lack of bounds checking in nci_add_new_protocol(). Add the missing checks. Reported-by: syzbot+210e196cef4711b65139@syzkaller.appspotmail.com Link: https://lore.kernel.org/lkml/0000000000001c590f05ee7b3ff4@google.com Fixes: 019c4fbaa790 ("NFC: Add NCI multiple targets support") Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20221202214410.never.693-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-05ethtool: add netlink based get rss supportSudheer Mogilappagari
Add netlink based support for "ethtool -x <dev> [context x]" command by implementing ETHTOOL_MSG_RSS_GET netlink message. This is equivalent to functionality provided via ETHTOOL_GRSSH in ioctl path. It sends RSS table, hash key and hash function of an interface to user space. This patch implements existing functionality available in ioctl path and enables addition of new RSS context based parameters in future. Signed-off-by: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com> Link: https://lore.kernel.org/r/20221202002555.241580-1-sudheer.mogilappagari@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-05proc: proc_skip_spaces() shouldn't think it is working on C stringsLinus Torvalds
proc_skip_spaces() seems to think it is working on C strings, and ends up being just a wrapper around skip_spaces() with a really odd calling convention. Instead of basing it on skip_spaces(), it should have looked more like proc_skip_char(), which really is the exact same function (except it skips a particular character, rather than whitespace). So use that as inspiration, odd coding and all. Now the calling convention actually makes sense and works for the intended purpose. Reported-and-tested-by: Kyle Zeng <zengyhkyle@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-12-05proc: avoid integer type confusion in get_proc_longLinus Torvalds
proc_get_long() is passed a size_t, but then assigns it to an 'int' variable for the length. Let's not do that, even if our IO paths are limited to MAX_RW_COUNT (exactly because of these kinds of type errors). So do the proper test in the rigth type. Reported-by: Kyle Zeng <zengyhkyle@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-12-05ipc/sem: Fix dangling sem_array access in semtimedop raceJann Horn
When __do_semtimedop() goes to sleep because it has to wait for a semaphore value becoming zero or becoming bigger than some threshold, it links the on-stack sem_queue to the sem_array, then goes to sleep without holding a reference on the sem_array. When __do_semtimedop() comes back out of sleep, one of two things must happen: a) We prove that the on-stack sem_queue has been disconnected from the (possibly freed) sem_array, making it safe to return from the stack frame that the sem_queue exists in. b) We stabilize our reference to the sem_array, lock the sem_array, and detach the sem_queue from the sem_array ourselves. sem_array has RCU lifetime, so for case (b), the reference can be stabilized inside an RCU read-side critical section by locklessly checking whether the sem_queue is still connected to the sem_array. However, the current code does the lockless check on sem_queue before starting an RCU read-side critical section, so the result of the lockless check immediately becomes useless. Fix it by doing rcu_read_lock() before the lockless check. Now RCU ensures that if we observe the object being on our queue, the object can't be freed until rcu_read_unlock(). This bug is only hittable on kernel builds with full preemption support (either CONFIG_PREEMPT or PREEMPT_DYNAMIC with preempt=full). Fixes: 370b262c896e ("ipc/sem: avoid idr tree lookup for interrupted semop") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-12-05i40e: Disallow ip4 and ip6 l4_4_bytesPrzemyslaw Patynowski
Return -EOPNOTSUPP, when user requests l4_4_bytes for raw IP4 or IP6 flow director filters. Flow director does not support filtering on l4 bytes for PCTYPEs used by IP4 and IP6 filters. Without this patch, user could create filters with l4_4_bytes fields, which did not do any filtering on L4, but only on L3 fields. Fixes: 36777d9fa24c ("i40e: check current configured input set when adding ntuple filters") Signed-off-by: Przemyslaw Patynowski <przemyslawx.patynowski@intel.com> Signed-off-by: Kamil Maziarz <kamil.maziarz@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-12-05i40e: Fix for VF MAC address 0Sylwester Dziedziuch
After spawning max VFs on a PF, some VFs were not getting resources and their MAC addresses were 0. This was caused by PF sleeping before flushing HW registers which caused VIRTCHNL_VFR_VFACTIVE to not be set in time for VF. Fix by adding a sleep after hw flush. Fixes: e4b433f4a741 ("i40e: reset all VFs in parallel when rebuilding PF") Signed-off-by: Sylwester Dziedziuch <sylwesterx.dziedziuch@intel.com> Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-12-05i40e: Fix not setting default xps_cpus after resetMichal Jaron
During tx rings configuration default XPS queue config is set and __I40E_TX_XPS_INIT_DONE is locked. __I40E_TX_XPS_INIT_DONE state is cleared and set again with default mapping only during queues build, it means after first setup or reset with queues rebuild. (i.e. ethtool -L <interface> combined <number>) After other resets (i.e. ethtool -t <interface>) XPS_INIT_DONE is not cleared and those default maps cannot be set again. It results in cleared xps_cpus mapping until queues are not rebuild or mapping is not set by user. Add clearing __I40E_TX_XPS_INIT_DONE state during reset to let the driver set xps_cpus to defaults again after it was cleared. Fixes: 6f853d4f8e93 ("i40e: allow XPS with QoS enabled") Signed-off-by: Michal Jaron <michalx.jaron@intel.com> Signed-off-by: Kamil Maziarz <kamil.maziarz@intel.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-12-05net: phy: mxl-gpy: rename MMD_VEND1 macros to match datasheetMichael Walle
Rename the temperature sensors macros to match the names in the datasheet. Signed-off-by: Michael Walle <michael@walle.cc> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-05net: mvneta: Prevent out of bounds read in mvneta_config_rss()Dan Carpenter
The pp->indir[0] value comes from the user. It is passed to: if (cpu_online(pp->rxq_def)) inside the mvneta_percpu_elect() function. It needs bounds checkeding to ensure that it is not beyond the end of the cpu bitmap. Fixes: cad5d847a093 ("net: mvneta: Fix the CPU choice in mvneta_percpu_elect") Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-05nfp: add support for multicast filterDiana Wang
Rewrite nfp_net_set_rx_mode() to implement interface to delivery mc address and operations to firmware by using general mailbox for filtering multicast packets. The operations include add mc address and delete mc address. And the limitation of mc addresses number is 1024 for each net device. User triggers adding mc address by using command below: ip maddress add <mc address> dev <interface name> User triggers deleting mc address by using command below: ip maddress del <mc address> dev <interface name> Signed-off-by: Diana Wang <na.wang@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-05xen-netfront: Fix NULL sring after live migrationLin Liu
A NAPI is setup for each network sring to poll data to kernel The sring with source host is destroyed before live migration and new sring with target host is setup after live migration. The NAPI for the old sring is not deleted until setup new sring with target host after migration. With busy_poll/busy_read enabled, the NAPI can be polled before got deleted when resume VM. BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: xennet_poll+0xae/0xd20 PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI Call Trace: finish_task_switch+0x71/0x230 timerqueue_del+0x1d/0x40 hrtimer_try_to_cancel+0xb5/0x110 xennet_alloc_rx_buffers+0x2a0/0x2a0 napi_busy_loop+0xdb/0x270 sock_poll+0x87/0x90 do_sys_poll+0x26f/0x580 tracing_map_insert+0x1d4/0x2f0 event_hist_trigger+0x14a/0x260 finish_task_switch+0x71/0x230 __schedule+0x256/0x890 recalc_sigpending+0x1b/0x50 xen_sched_clock+0x15/0x20 __rb_reserve_next+0x12d/0x140 ring_buffer_lock_reserve+0x123/0x3d0 event_triggers_call+0x87/0xb0 trace_event_buffer_commit+0x1c4/0x210 xen_clocksource_get_cycles+0x15/0x20 ktime_get_ts64+0x51/0xf0 SyS_ppoll+0x160/0x1a0 SyS_ppoll+0x160/0x1a0 do_syscall_64+0x73/0x130 entry_SYSCALL_64_after_hwframe+0x41/0xa6 ... RIP: xennet_poll+0xae/0xd20 RSP: ffffb4f041933900 CR2: 0000000000000008 ---[ end trace f8601785b354351c ]--- xen frontend should remove the NAPIs for the old srings before live migration as the bond srings are destroyed There is a tiny window between the srings are set to NULL and the NAPIs are disabled, It is safe as the NAPI threads are still frozen at that time Signed-off-by: Lin Liu <lin.liu@citrix.com> Fixes: 4ec2411980d0 ([NET]: Do not check netif_running() and carrier state in ->poll()) Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-05net: microchip: sparx5: correctly free skb in xmitCasper Andersson
consume_skb on transmitted, kfree_skb on dropped, do not free on TX_BUSY. Previously the xmit function could return -EBUSY without freeing, which supposedly is interpreted as a drop. And was using kfree on successfully transmitted packets. sparx5_fdma_xmit and sparx5_inject returns error code, where -EBUSY indicates TX_BUSY and any other error code indicates dropped. Fixes: f3cad2611a77 ("net: sparx5: add hostmode with phylink support") Signed-off-by: Casper Andersson <casper.casan@gmail.com> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-05octeontx2-pf: Fix potential memory leak in otx2_init_tc()Ziyang Xuan
In otx2_init_tc(), if rhashtable_init() failed, it does not free tc->tc_entries_bitmap which is allocated in otx2_tc_alloc_ent_bitmap(). Fixes: 2e2a8126ffac ("octeontx2-pf: Unify flow management variables") Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-05net: ipa: use sysfs_emit() to instead of scnprintf()ye xingchen
Follow the advice of the Documentation/filesystems/sysfs.rst and show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn> Reviewed-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-05net: mdiobus: fix double put fwnode in the error pathYang Yingliang
If phy_device_register() or fwnode_mdiobus_phy_device_register() fail, phy_device_free() is called, the device refcount is decreased to 0, then fwnode_handle_put() will be called in phy_device_release(), but in the error path, fwnode_handle_put() has already been called, so set fwnode to NULL after fwnode_handle_put() in the error path to avoid double put. Fixes: cdde1560118f ("net: mdiobus: fix unbalanced node reference count") Reported-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-05Merge tag 'rxrpc-next-20221201-b' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Increasing SACK size and moving away from softirq, parts 2 & 3 Here are the second and third parts of patches in the process of moving rxrpc from doing a lot of its stuff in softirq context to doing it in an I/O thread in process context and thereby making it easier to support a larger SACK table. The full description is in the description for the first part[1] which is already in net-next. The second part includes some cleanups, adds some testing and overhauls some tracing: (1) Remove declaration of rxrpc_kernel_call_is_complete() as the definition is no longer present. (2) Remove the knet() and kproto() macros in favour of using tracepoints. (3) Remove handling of duplicate packets from recvmsg. The input side isn't now going to insert overlapping/duplicate packets into the recvmsg queue. (4) Don't use the rxrpc_conn_parameters struct in the rxrpc_connection or rxrpc_bundle structs - rather put the members in directly. (5) Extract the abort code from a received abort packet right up front rather than doing it in multiple places later. (6) Use enums and symbol lists rather than __builtin_return_address() to indicate where a tracepoint was triggered for local, peer, conn, call and skbuff tracing. (7) Add a refcount tracepoint for the rxrpc_bundle struct. (8) Implement an in-kernel server for the AFS rxperf testing program to talk to (enabled by a Kconfig option). This is tagged as rxrpc-next-20221201-a. The third part introduces the I/O thread and switches various bits over to running there: (1) Fix call timers and call and connection workqueues to not hold refs on the rxrpc_call and rxrpc_connection structs to thereby avoid messy cleanup when the last ref is put in softirq mode. (2) Split input.c so that the call packet processing bits are separate from the received packet distribution bits. Call packet processing gets bumped over to the call event handler. (3) Create a per-local endpoint I/O thread. Barring some tiny bits that still get done in softirq context, all packet reception, processing and transmission is done in this thread. That will allow a load of locking to be removed. (4) Perform packet processing and error processing from the I/O thread. (5) Provide a mechanism to process call event notifications in the I/O thread rather than queuing a work item for that call. (6) Move data and ACK transmission into the I/O thread. ACKs can then be transmitted at the point they're generated rather than getting delegated from softirq context to some process context somewhere. (7) Move call and local processor event handling into the I/O thread. (8) Move cwnd degradation to after packets have been transmitted so that they don't shorten the window too quickly. A bunch of simplifications can then be done: (1) The input_lock is no longer necessary as exclusion is achieved by running the code in the I/O thread only. (2) Don't need to use sk->sk_receive_queue.lock to guard socket state changes as the socket mutex should suffice. (3) Don't take spinlocks in RCU callback functions as they get run in softirq context and thus need _bh annotations. (4) RCU is then no longer needed for the peer's error_targets list. (5) Simplify the skbuff handling in the receive path by dropping the ref in the basic I/O thread loop and getting an extra ref as and when we need to queue the packet for recvmsg or another context. (6) Get the peer address earlier in the input process and pass it to the users so that we only do it once. This is tagged as rxrpc-next-20221201-b. Changes: ======== ver #2) - Added a patch to change four assertions into warnings in rxrpc_read() and fixed a checker warning from a __user annotation that should have been removed.. - Change a min() to min_t() in rxperf as PAGE_SIZE doesn't seem to match type size_t on i386. - Three error handling issues in rxrpc_new_incoming_call(): - If not DATA or not seq #1, should drop the packet, not abort. - Fix a goto that went to the wrong place, dropping a non-held lock. - Fix an rcu_read_lock that should've been an unlock. Tested-by: Marc Dionne <marc.dionne@auristor.com> Tested-by: kafs-testing+fedora36_64checkkafs-build-144@auristor.com Link: https://lore.kernel.org/r/166794587113.2389296.16484814996876530222.stgit@warthog.procyon.org.uk/ [1] Link: https://lore.kernel.org/r/166982725699.621383.2358362793992993374.stgit@warthog.procyon.org.uk/ # v1 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-05net: encx24j600: Fix invalid logic in reading of MISTAT registerValentina Goncharenko
A loop for reading MISTAT register continues while regmap_read() fails and (mistat & BUSY), but if regmap_read() fails a value of mistat is undefined. The patch proposes to check for BUSY flag only when regmap_read() succeed. Compile test only. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: d70e53262f5c ("net: Microchip encx24j600 driver") Signed-off-by: Valentina Goncharenko <goncharenko.vp@ispras.ru> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-05net: encx24j600: Add parentheses to fix precedenceValentina Goncharenko
In functions regmap_encx24j600_phy_reg_read() and regmap_encx24j600_phy_reg_write() in the conditions of the waiting cycles for filling the variable 'ret' it is necessary to add parentheses to prevent wrong assignment due to logical operations precedence. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: d70e53262f5c ("net: Microchip encx24j600 driver") Signed-off-by: Valentina Goncharenko <goncharenko.vp@ispras.ru> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-05net: stmmac: tegra: Add MGBE supportBhadram Varka
Add support for the Multi-Gigabit Ethernet (MGBE/XPCS) IP found on NVIDIA Tegra234 SoCs. Signed-off-by: Thierry Reding <treding@nvidia.com> Signed-off-by: Bhadram Varka <vbhadram@nvidia.com> Co-developed-by: Revanth Kumar Uppala <ruppala@nvidia.com> Signed-off-by: Revanth Kumar Uppala <ruppala@nvidia.com> Signed-off-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-05net: stmmac: Power up SERDES after the PHY linkRevanth Kumar Uppala
The Tegra MGBE ethernet controller requires that the SERDES link is powered-up after the PHY link is up, otherwise the link fails to become ready following a resume from suspend. Add a variable to indicate that the SERDES link must be powered-up after the PHY link. Signed-off-by: Revanth Kumar Uppala <ruppala@nvidia.com> Signed-off-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-05xfrm: document IPsec packet offload modeLeon Romanovsky
Extend XFRM device offload API description with newly added packet offload mode. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-12-05xfrm: add support to HW update soft and hard limitsLeon Romanovsky
Both in RX and TX, the traffic that performs IPsec packet offload transformation is accounted by HW. It is needed to properly handle hard limits that require to drop the packet. It means that XFRM core needs to update internal counters with the one that accounted by the HW, so new callbacks are introduced in this patch. In case of soft or hard limit is occurred, the driver should call to xfrm_state_check_expire() that will perform key rekeying exactly as done by XFRM core. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-12-05xfrm: speed-up lookup of HW policiesLeon Romanovsky
Devices that implement IPsec packet offload mode should offload SA and policies too. In RX path, it causes to the situation that HW will always have higher priority over any SW policies. It means that we don't need to perform any search of inexact policies and/or priority checks if HW policy was discovered. In such situation, the HW will catch the packets anyway and HW can still implement inexact lookups. In case specific policy is not found, we will continue with packet lookup and check for existence of HW policies in inexact list. HW policies are added to the head of SPD to ensure fast lookup, as XFRM iterates over all policies in the loop. The same solution of adding HW SAs at the begging of the list is applied to SA database too. However, we don't need to change lookups as they are sorted by insertion order and not priority. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-12-05xfrm: add RX datapath protection for IPsec packet offload modeLeon Romanovsky
Traffic received by device with enabled IPsec packet offload should be forwarded to the stack only after decryption, packet headers and trailers removed. Such packets are expected to be seen as normal (non-XFRM) ones, while not-supported packets should be dropped by the HW. Reviewed-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-12-05xfrm: add TX datapath support for IPsec packet offload modeLeon Romanovsky
In IPsec packet mode, the device is going to encrypt and encapsulate packets that are associated with offloaded policy. After successful policy lookup to indicate if packets should be offloaded or not, the stack forwards packets to the device to do the magic. Signed-off-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Huy Nguyen <huyn@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-12-05xfrm: add an interface to offload policyLeon Romanovsky
Extend netlink interface to add and delete XFRM policy from the device. This functionality is a first step to implement packet IPsec offload solution. Signed-off-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-12-05xfrm: allow state packet offload modeLeon Romanovsky
Allow users to configure xfrm states with packet offload mode. The packet mode must be requested both for policy and state, and such requires us to do not implement fallback. We explicitly return an error if requested packet mode can't be configured. Reviewed-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-12-05xfrm: add new packet offload flagLeon Romanovsky
In the next patches, the xfrm core code will be extended to support new type of offload - packet offload. In that mode, both policy and state should be specially configured in order to perform whole offloaded data path. Full offload takes care of encryption, decryption, encapsulation and other operations with headers. As this mode is new for XFRM policy flow, we can "start fresh" with flag bits and release first and second bit for future use. Reviewed-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-12-05mac802154: fix missing INIT_LIST_HEAD in ieee802154_if_add()Wei Yongjun
Kernel fault injection test reports null-ptr-deref as follows: BUG: kernel NULL pointer dereference, address: 0000000000000008 RIP: 0010:cfg802154_netdev_notifier_call+0x120/0x310 include/linux/list.h:114 Call Trace: <TASK> raw_notifier_call_chain+0x6d/0xa0 kernel/notifier.c:87 call_netdevice_notifiers_info+0x6e/0xc0 net/core/dev.c:1944 unregister_netdevice_many_notify+0x60d/0xcb0 net/core/dev.c:1982 unregister_netdevice_queue+0x154/0x1a0 net/core/dev.c:10879 register_netdevice+0x9a8/0xb90 net/core/dev.c:10083 ieee802154_if_add+0x6ed/0x7e0 net/mac802154/iface.c:659 ieee802154_register_hw+0x29c/0x330 net/mac802154/main.c:229 mcr20a_probe+0xaaa/0xcb1 drivers/net/ieee802154/mcr20a.c:1316 ieee802154_if_add() allocates wpan_dev as netdev's private data, but not init the list in struct wpan_dev. cfg802154_netdev_notifier_call() manage the list when device register/unregister, and may lead to null-ptr-deref. Use INIT_LIST_HEAD() on it to initialize it correctly. Fixes: fcf39e6e88e9 ("ieee802154: add wpan_dev_list") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/r/20221130091705.1831140-1-weiyongjun@huaweicloud.com Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2022-12-04selftests/bpf: Fix conflicts with built-in functions in bpf_iter_ksymJames Hilliard
Both tolower and toupper are built in c functions, we should not redefine them as this can result in a build error. Fixes the following errors: progs/bpf_iter_ksym.c:10:20: error: conflicting types for built-in function 'tolower'; expected 'int(int)' [-Werror=builtin-declaration-mismatch] 10 | static inline char tolower(char c) | ^~~~~~~ progs/bpf_iter_ksym.c:5:1: note: 'tolower' is declared in header '<ctype.h>' 4 | #include <bpf/bpf_helpers.h> +++ |+#include <ctype.h> 5 | progs/bpf_iter_ksym.c:17:20: error: conflicting types for built-in function 'toupper'; expected 'int(int)' [-Werror=builtin-declaration-mismatch] 17 | static inline char toupper(char c) | ^~~~~~~ progs/bpf_iter_ksym.c:17:20: note: 'toupper' is declared in header '<ctype.h>' See background on this sort of issue: https://stackoverflow.com/a/20582607 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=12213 (C99, 7.1.3p1) "All identifiers with external linkage in any of the following subclauses (including the future library directions) are always reserved for use as identifiers with external linkage." This is documented behavior in GCC: https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-std-2 Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20221203010847.2191265-1-james.hilliard1@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-04bpf, sockmap: fix race in sock_map_free()Eric Dumazet
sock_map_free() calls release_sock(sk) without owning a reference on the socket. This can cause use-after-free as syzbot found [1] Jakub Sitnicki already took care of a similar issue in sock_hash_free() in commit 75e68e5bf2c7 ("bpf, sockhash: Synchronize delete from bucket list on map free") [1] refcount_t: decrement hit 0; leaking memory. WARNING: CPU: 0 PID: 3785 at lib/refcount.c:31 refcount_warn_saturate+0x17c/0x1a0 lib/refcount.c:31 Modules linked in: CPU: 0 PID: 3785 Comm: kworker/u4:6 Not tainted 6.1.0-rc7-syzkaller-00103-gef4d3ea40565 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 Workqueue: events_unbound bpf_map_free_deferred RIP: 0010:refcount_warn_saturate+0x17c/0x1a0 lib/refcount.c:31 Code: 68 8b 31 c0 e8 75 71 15 fd 0f 0b e9 64 ff ff ff e8 d9 6e 4e fd c6 05 62 9c 3d 0a 01 48 c7 c7 80 bb 68 8b 31 c0 e8 54 71 15 fd <0f> 0b e9 43 ff ff ff 89 d9 80 e1 07 80 c1 03 38 c1 0f 8c a2 fe ff RSP: 0018:ffffc9000456fb60 EFLAGS: 00010246 RAX: eae59bab72dcd700 RBX: 0000000000000004 RCX: ffff8880207057c0 RDX: 0000000000000000 RSI: 0000000000000201 RDI: 0000000000000000 RBP: 0000000000000004 R08: ffffffff816fdabd R09: fffff520008adee5 R10: fffff520008adee5 R11: 1ffff920008adee4 R12: 0000000000000004 R13: dffffc0000000000 R14: ffff88807b1c6c00 R15: 1ffff1100f638dcf FS: 0000000000000000(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b30c30000 CR3: 000000000d08e000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> __refcount_dec include/linux/refcount.h:344 [inline] refcount_dec include/linux/refcount.h:359 [inline] __sock_put include/net/sock.h:779 [inline] tcp_release_cb+0x2d0/0x360 net/ipv4/tcp_output.c:1092 release_sock+0xaf/0x1c0 net/core/sock.c:3468 sock_map_free+0x219/0x2c0 net/core/sock_map.c:356 process_one_work+0x81c/0xd10 kernel/workqueue.c:2289 worker_thread+0xb14/0x1330 kernel/workqueue.c:2436 kthread+0x266/0x300 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306 </TASK> Fixes: 7e81a3530206 ("bpf: Sockmap, ensure sock lock held during tear down") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Jakub Sitnicki <jakub@cloudflare.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Song Liu <songliubraving@fb.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20221202111640.2745533-1-edumazet@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-04bpf: Add dummy type reference to nf_conn___init to fix type deduplicationToke Høiland-Jørgensen
The bpf_ct_set_nat_info() kfunc is defined in the nf_nat.ko module, and takes as a parameter the nf_conn___init struct, which is allocated through the bpf_xdp_ct_alloc() helper defined in the nf_conntrack.ko module. However, because kernel modules can't deduplicate BTF types between each other, and the nf_conn___init struct is not referenced anywhere in vmlinux BTF, this leads to two distinct BTF IDs for the same type (one in each module). This confuses the verifier, as described here: https://lore.kernel.org/all/87leoh372s.fsf@toke.dk/ As a workaround, add an explicit BTF_TYPE_EMIT for the type in net/filter.c, so the type definition gets included in vmlinux BTF. This way, both modules can refer to the same type ID (as they both build on top of vmlinux BTF), and the verifier is no longer confused. v2: - Use BTF_TYPE_EMIT (which is a statement so it has to be inside a function definition; use xdp_func_proto() for this, since this is mostly xdp-related). Fixes: 820dc0523e05 ("net: netfilter: move bpf_ct_set_nat_info kfunc in nf_nat_bpf.c") Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20221201123939.696558-1-toke@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-04bpf: Add sleepable prog tests for cgrp local storageYonghong Song
Add three tests for cgrp local storage support for sleepable progs. Two tests can load and run properly, one for cgroup_iter, another for passing current->cgroups->dfl_cgrp to bpf_cgrp_storage_get() helper. One test has bpf_rcu_read_lock() and failed to load. Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221201050449.2785613-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-04bpf: Enable sleeptable support for cgrp local storageYonghong Song
Similar to sk/inode/task local storage, enable sleepable support for cgrp local storage. Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221201050444.2785007-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-04bpf, docs: BPF Iterator DocumentSreevani Sreejith
Document that describes how BPF iterators work, how to use iterators, and how to pass parameters in BPF iterators. Acked-by: David Vernet <void@manifault.com> Signed-off-by: Sreevani Sreejith <psreep@gmail.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221202221710.320810-2-ssreevani@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-04nfp: correct desc type when header dma len is 4096Yinjun Zhang
When there's only one buffer to dma and its length is 4096, then only one data descriptor is needed to carry it according to current descriptor definition. So the descriptor type should be `simple` instead of `gather`, the latter requires more than one descriptor, otherwise it'll be dropped by application firmware. Fixes: c10d12e3dce8 ("nfp: add support for NFDK data path") Fixes: d9d950490a0a ("nfp: nfdk: implement xdp tx path for NFDK") Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com> Reviewed-by: Richard Donkin <richard.donkin@corigine.com> Reviewed-by: Niklas Söderlund <niklas.soderlund@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20221202134646.311108-1-simon.horman@corigine.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-04Linux 6.1-rc8Linus Torvalds
2022-12-04bpf: Do not mark certain LSM hook arguments as trustedYonghong Song
Martin mentioned that the verifier cannot assume arguments from LSM hook sk_alloc_security being trusted since after the hook is called, the sk ref_count is set to 1. This will overwrite the ref_count changed by the bpf program and may cause ref_count underflow later on. I then further checked some other hooks. For example, for bpf_lsm_file_alloc() hook in fs/file_table.c, f->f_cred = get_cred(cred); error = security_file_alloc(f); if (unlikely(error)) { file_free_rcu(&f->f_rcuhead); return ERR_PTR(error); } atomic_long_set(&f->f_count, 1); The input parameter 'f' to security_file_alloc() cannot be trusted as well. Specifically, I investiaged bpf_map/bpf_prog/file/sk/task alloc/free lsm hooks. Except bpf_map_alloc and task_alloc, arguments for all other hooks should not be considered as trusted. This may not be a complete list, but it covers common usage for sk and task. Fixes: 3f00c5239344 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs") Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221203204954.2043348-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-04Merge branch 'bpf: Handle MEM_RCU type properly'Alexei Starovoitov
Yonghong Song says: ==================== Patch set [1] added rcu support for bpf programs. In [1], a rcu pointer is considered to be trusted and not null. This is actually not true in some cases. The rcu pointer could be null, and for non-null rcu pointer, it may have reference count of 0. This small patch set fixed this problem. Patch 1 is the kernel fix. Patch 2 adjusted selftests properly. Patch 3 added documentation for newly-introduced KF_RCU flag. [1] https://lore.kernel.org/all/20221124053201.2372298-1-yhs@fb.com/ Changelogs: v1 -> v2: - rcu ptr could be NULL. - non_null_rcu_ptr->rcu_field can be marked as MEM_RCU as well. - Adjust the code to avoid existing error message change. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-04docs/bpf: Add KF_RCU documentationYonghong Song
Add proper KF_RCU documentation in kfuncs.rst. Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221203184613.478967-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-04selftests/bpf: Fix rcu_read_lock test with new MEM_RCU semanticsYonghong Song
Add MEM_RCU pointer null checking for related tests. Also modified task_acquire test so it takes a rcu ptr 'ptr' where 'ptr = rcu_ptr->rcu_field'. Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221203184607.478314-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-04bpf: Handle MEM_RCU type properlyYonghong Song
Commit 9bb00b2895cb ("bpf: Add kfunc bpf_rcu_read_lock/unlock()") introduced MEM_RCU and bpf_rcu_read_lock/unlock() support. In that commit, a rcu pointer is tagged with both MEM_RCU and PTR_TRUSTED so that it can be passed into kfuncs or helpers as an argument. Martin raised a good question in [1] such that the rcu pointer, although being able to accessing the object, might have reference count of 0. This might cause a problem if the rcu pointer is passed to a kfunc which expects trusted arguments where ref count should be greater than 0. This patch makes the following changes related to MEM_RCU pointer: - MEM_RCU pointer might be NULL (PTR_MAYBE_NULL). - Introduce KF_RCU so MEM_RCU ptr can be acquired with a KF_RCU tagged kfunc which assumes ref count of rcu ptr could be zero. - For mem access 'b = ptr->a', say 'ptr' is a MEM_RCU ptr, and 'a' is tagged with __rcu as well. Let us mark 'b' as MEM_RCU | PTR_MAYBE_NULL. [1] https://lore.kernel.org/bpf/ac70f574-4023-664e-b711-e0d3b18117fd@linux.dev/ Fixes: 9bb00b2895cb ("bpf: Add kfunc bpf_rcu_read_lock/unlock()") Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221203184602.477272-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>