summaryrefslogtreecommitdiff
path: root/net/xdp/xsk_buff_pool.c
AgeCommit message (Collapse)Author
2025-05-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.15-rc5). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-24xsk: Fix race condition in AF_XDP generic RX pathe.kubanski
Move rx_lock from xsk_socket to xsk_buff_pool. Fix synchronization for shared umem mode in generic RX path where multiple sockets share single xsk_buff_pool. RX queue is exclusive to xsk_socket, while FILL queue can be shared between multiple sockets. This could result in race condition where two CPU cores access RX path of two different sockets sharing the same umem. Protect both queues by acquiring spinlock in shared xsk_buff_pool. Lock contention may be minimized in the future by some per-thread FQ buffering. It's safe and necessary to move spin_lock_bh(rx_lock) after xsk_rcv_check(): * xs->pool and spinlock_init is synchronized by xsk_bind() -> xsk_is_bound() memory barriers. * xsk_rcv_check() may return true at the moment of xsk_release() or xsk_unbind_dev(), however this will not cause any data races or race conditions. xsk_unbind_dev() removes xdp socket from all maps and waits for completion of all outstanding rx operations. Packets in RX path will either complete safely or drop. Signed-off-by: Eryk Kubanski <e.kubanski@partner.samsung.com> Fixes: bf0bdd1343efb ("xdp: fix race on generic receive path") Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://patch.msgid.link/20250416101908.10919-1-e.kubanski@partner.samsung.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09net: designate XSK pool pointers in queues as "ops protected"Jakub Kicinski
Read accesses go via xsk_get_pool_from_qid(), the call coming from the core and gve look safe (other "ops locked" drivers don't support XSK). Write accesses go via xsk_reg_pool_at_qid() and xsk_clear_pool_at_qid(). Former is already under the ops lock, latter is not (both coming from the workqueue via xp_clear_dev() and NETDEV_UNREGISTER via xsk_notifier()). Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250408195956.412733-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
Cross-merge networking fixes after downstream PR (net-6.14-rc8). Conflict: tools/testing/selftests/net/Makefile 03544faad761 ("selftest: net: add proc_net_pktgen") 3ed61b8938c6 ("selftests: net: test for lwtunnel dst ref loops") tools/testing/selftests/net/config: 85cb3711acb8 ("selftests: net: Add test cases for link and peer netns") 3ed61b8938c6 ("selftests: net: test for lwtunnel dst ref loops") Adjacent commits: tools/testing/selftests/net/Makefile c935af429ec2 ("selftests: net: add support for testing SO_RCVMARK and SO_RCVPRIORITY") 355d940f4d5a ("Revert "selftests: Add IPv6 link-local address generation tests for GRE devices."") Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-19xsk: fix an integer overflow in xp_create_and_assign_umem()Gavrilov Ilia
Since the i and pool->chunk_size variables are of type 'u32', their product can wrap around and then be cast to 'u64'. This can lead to two different XDP buffers pointing to the same memory area. Found by InfoTeCS on behalf of Linux Verification Center (linuxtesting.org) with SVACE. Fixes: 94033cd8e73b ("xsk: Optimize for aligned case") Cc: stable@vger.kernel.org Signed-off-by: Ilia Gavrilov <Ilia.Gavrilov@infotecs.ru> Link: https://patch.msgid.link/20250313085007.3116044-1-Ilia.Gavrilov@infotecs.ru Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-08net: move misc netdev_lock flavors to a separate headerJakub Kicinski
Move the more esoteric helpers for netdev instance lock to a dedicated header. This avoids growing netdevice.h to infinity and makes rebuilding the kernel much faster (after touching the header with the helpers). The main netdev_lock() / netdev_unlock() functions are used in static inlines in netdevice.h and will probably be used most commonly, so keep them in netdevice.h. Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250307183006.2312761-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06net: hold netdev instance lock during ndo_bpfStanislav Fomichev
Cover the paths that come via bpf system call and XSK bind. Cc: Saeed Mahameed <saeed@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250305163732.2766420-10-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10xsk: add helper to get &xdp_desc's DMA and meta pointer in one goAlexander Lobakin
Currently, when your driver supports XSk Tx metadata and you want to send an XSk frame, you need to do the following: * call external xsk_buff_raw_get_dma(); * call inline xsk_buff_get_metadata(), which calls external xsk_buff_raw_get_data() and then do some inline checks. This effectively means that the following piece: addr = pool->unaligned ? xp_unaligned_add_offset_to_addr(addr) : addr; is done twice per frame, plus you have 2 external calls per frame, plus this: meta = pool->addrs + addr - pool->tx_metadata_len; if (unlikely(!xsk_buff_valid_tx_metadata(meta))) is always inlined, even if there's no meta or it's invalid. Add xsk_buff_raw_get_ctx() (xp_raw_get_ctx() to be precise) to do that in one go. It returns a small structure with 2 fields: DMA address, filled unconditionally, and metadata pointer, non-NULL only if it's present and valid. The address correction is performed only once and you also have only 1 external call per XSk frame, which does all the calculations and checks outside of your hotpath. You only need to check `if (ctx.meta)` for the metadata presence. To not copy any existing code, derive address correction and getting virtual and DMA address into small helpers. bloat-o-meter reports no object code changes for the existing functionality. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20250206182630.3914318-5-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-25xsk: always clear DMA mapping information when unmapping the poolLarysa Zaremba
When the umem is shared, the DMA mapping is also shared between the xsk pools, therefore it should stay valid as long as at least 1 user remains. However, the pool also keeps the copies of DMA-related information that are initialized in the same way in xp_init_dma_info(), but cleared by xp_dma_unmap() only for the last remaining pool, this causes the problems below. The first one is that the commit adbf5a42341f ("ice: remove af_xdp_zc_qps bitmap") relies on pool->dev to determine the presence of a ZC pool on a given queue, avoiding internal bookkeeping. This works perfectly fine if the UMEM is not shared, but reliably fails otherwise as stated in the linked report. The second one is pool->dma_pages which is dynamically allocated and only freed in xp_dma_unmap(), this leads to a small memory leak. kmemleak does not catch it, but by printing the allocation results after terminating the userspace program it is possible to see that all addresses except the one belonging to the last detached pool are still accessible through the kmemleak dump functionality. Always clear the DMA mapping information from the pool and free pool->dma_pages when unmapping the pool, so that the only difference between results of the last remaining user's call and the ones before would be the destruction of the DMA mapping. Fixes: adbf5a42341f ("ice: remove af_xdp_zc_qps bitmap") Fixes: 921b68692abb ("xsk: Enable sharing of dma mappings") Reported-by: Alasdair McWilliam <alasdair.mcwilliam@outlook.com> Closes: https://lore.kernel.org/PA4P194MB10056F208AF221D043F57A3D86512@PA4P194MB1005.EURP194.PROD.OUTLOOK.COM Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://lore.kernel.org/r/20241122112912.89881-1-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-14xsk: Wrap duplicated code to functionMaciej Fijalkowski
Both allocation paths have exactly the same code responsible for getting and initializing xskb. Pull it out to common function. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20241007122458.282590-6-maciej.fijalkowski@intel.com
2024-10-14xsk: Carry a copy of xdp_zc_max_segs within xsk_buff_poolMaciej Fijalkowski
This so we avoid dereferencing struct net_device within hot path. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20241007122458.282590-5-maciej.fijalkowski@intel.com
2024-10-14xsk: Get rid of xdp_buff_xsk::orig_addrMaciej Fijalkowski
Continue the process of dieting xdp_buff_xsk by removing orig_addr member. It can be calculated from xdp->data_hard_start where it was previously used, so it is not anything that has to be carried around in struct used widely in hot path. This has been used for initializing xdp_buff_xsk::frame_dma during pool setup and as a shortcut in xp_get_handle() to retrieve address provided to xsk Rx queue. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20241007122458.282590-4-maciej.fijalkowski@intel.com
2024-10-14xsk: s/free_list_node/list_node/Maciej Fijalkowski
Now that free_list_node's purpose is two-folded, make it just a 'list_node'. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20241007122458.282590-3-maciej.fijalkowski@intel.com
2024-10-14xsk: Get rid of xdp_buff_xsk::xskb_list_nodeMaciej Fijalkowski
Let's bring xdp_buff_xsk back to occupying 2 cachelines by removing xskb_list_node - for the purpose of gathering the xskb frags free_list_node can be used, head of the list (xsk_buff_pool::xskb_list) stays as-is, just reuse the node ptr. It is safe to do as a single xdp_buff_xsk can never reside in two pool's lists simultaneously. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20241007122458.282590-2-maciej.fijalkowski@intel.com
2024-09-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Merge in late fixes to prepare for the 6.12 net-next PR. No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-13xsk: fix batch alloc API on non-coherent systemsMaciej Fijalkowski
In cases when synchronizing DMA operations is necessary, xsk_buff_alloc_batch() returns a single buffer instead of the requested count. This puts the pressure on drivers that use batch API as they have to check for this corner case on their side and take care of allocations by themselves, which feels counter productive. Let us improve the core by looping over xp_alloc() @max times when slow path needs to be taken. Another issue with current interface, as spotted and fixed by Dries, was that when driver called xsk_buff_alloc_batch() with @max == 0, for slow path case it still allocated and returned a single buffer, which should not happen. By introducing the logic from first paragraph we kill two birds with one stone and address this problem as well. Fixes: 47e4075df300 ("xsk: Batched buffer allocation for the pool") Reported-and-tested-by: Dries De Winter <ddewinter@synamedia.com> Co-developed-by: Dries De Winter <ddewinter@synamedia.com> Signed-off-by: Dries De Winter <ddewinter@synamedia.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Link: https://patch.msgid.link/20240911191019.296480-1-maciej.fijalkowski@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-12Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-09-11 We've added 12 non-merge commits during the last 16 day(s) which contain a total of 20 files changed, 228 insertions(+), 30 deletions(-). There's a minor merge conflict in drivers/net/netkit.c: 00d066a4d4ed ("netdev_features: convert NETIF_F_LLTX to dev->lltx") d96608794889 ("netkit: Disable netpoll support") The main changes are: 1) Enable bpf_dynptr_from_skb for tp_btf such that this can be used to easily parse skbs in BPF programs attached to tracepoints, from Philo Lu. 2) Add a cond_resched() point in BPF's sock_hash_free() as there have been several syzbot soft lockup reports recently, from Eric Dumazet. 3) Fix xsk_buff_can_alloc() to account for queue_empty_descs which got noticed when zero copy ice driver started to use it, from Maciej Fijalkowski. 4) Move the xdp:xdp_cpumap_kthread tracepoint before cpumap pushes skbs up via netif_receive_skb_list() to better measure latencies, from Daniel Xu. 5) Follow-up to disable netpoll support from netkit, from Daniel Borkmann. 6) Improve xsk selftests to not assume a fixed MAX_SKB_FRAGS of 17 but instead gather the actual value via /proc/sys/net/core/max_skb_frags, also from Maciej Fijalkowski. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: sock_map: Add a cond_resched() in sock_hash_free() selftests/bpf: Expand skb dynptr selftests for tp_btf bpf: Allow bpf_dynptr_from_skb() for tp_btf tcp: Use skb__nullable in trace_tcp_send_reset selftests/bpf: Add test for __nullable suffix in tp_btf bpf: Support __nullable argument suffix for tp_btf bpf, cpumap: Move xdp:xdp_cpumap_kthread tracepoint before rcv selftests/xsk: Read current MAX_SKB_FRAGS from sysctl knob xsk: Bump xsk_queue::queue_empty_descs in xp_can_alloc() tcp_bpf: Remove an unused parameter for bpf_tcp_ingress() bpf, sockmap: Correct spelling skmsg.c netkit: Disable netpoll support Signed-off-by: Jakub Kicinski <kuba@kernel.org> ==================== Link: https://patch.msgid.link/20240911211525.13834-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-11netdev: support binding dma-buf to netdeviceMina Almasry
Add a netdev_dmabuf_binding struct which represents the dma-buf-to-netdevice binding. The netlink API will bind the dma-buf to rx queues on the netdevice. On the binding, the dma_buf_attach & dma_buf_map_attachment will occur. The entries in the sg_table from mapping will be inserted into a genpool to make it ready for allocation. The chunks in the genpool are owned by a dmabuf_chunk_owner struct which holds the dma-buf offset of the base of the chunk and the dma_addr of the chunk. Both are needed to use allocations that come from this chunk. We create a new type that represents an allocation from the genpool: net_iov. We setup the net_iov allocation size in the genpool to PAGE_SIZE for simplicity: to match the PAGE_SIZE normally allocated by the page pool and given to the drivers. The user can unbind the dmabuf from the netdevice by closing the netlink socket that established the binding. We do this so that the binding is automatically unbound even if the userspace process crashes. The binding and unbinding leaves an indicator in struct netdev_rx_queue that the given queue is bound, and the binding is actuated by resetting the rx queue using the queue API. The netdev_dmabuf_binding struct is refcounted, and releases its resources only when all the refs are released. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com> Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # excluding netlink Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20240910171458.219195-4-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-05xsk: Bump xsk_queue::queue_empty_descs in xp_can_alloc()Maciej Fijalkowski
We have STAT_FILL_EMPTY test case in xskxceiver that tries to process traffic with fill queue being empty which currently fails for zero copy ice driver after it started to use xsk_buff_can_alloc() API. That is because xsk_queue::queue_empty_descs is currently only increased from alloc APIs and right now if driver sees that xsk_buff_pool will be unable to provide the requested count of buffers, it bails out early, skipping calls to xsk_buff_alloc{_batch}(). Mentioned statistic should be handled in xsk_buff_can_alloc() from the very beginning, so let's add this logic now. Do it by open coding xskq_cons_has_entries() and bumping queue_empty_descs in the middle when fill queue currently has no entries. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20240904162808.249160-1-maciej.fijalkowski@intel.com
2024-05-08xsk: use generic DMA sync shortcut instead of a custom oneAlexander Lobakin
XSk infra's been using its own DMA sync shortcut to try avoiding redundant function calls. Now that there is a generic one, remove the custom implementation and rely on the generic helpers. xsk_buff_dma_sync_for_cpu() doesn't need the second argument anymore, remove it. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-01-24xsk: make xsk_buff_pool responsible for clearing xdp_buff::flagsMaciej Fijalkowski
XDP multi-buffer support introduced XDP_FLAGS_HAS_FRAGS flag that is used by drivers to notify data path whether xdp_buff contains fragments or not. Data path looks up mentioned flag on first buffer that occupies the linear part of xdp_buff, so drivers only modify it there. This is sufficient for SKB and XDP_DRV modes as usually xdp_buff is allocated on stack or it resides within struct representing driver's queue and fragments are carried via skb_frag_t structs. IOW, we are dealing with only one xdp_buff. ZC mode though relies on list of xdp_buff structs that is carried via xsk_buff_pool::xskb_list, so ZC data path has to make sure that fragments do *not* have XDP_FLAGS_HAS_FRAGS set. Otherwise, xsk_buff_free() could misbehave if it would be executed against xdp_buff that carries a frag with XDP_FLAGS_HAS_FRAGS flag set. Such scenario can take place when within supplied XDP program bpf_xdp_adjust_tail() is used with negative offset that would in turn release the tail fragment from multi-buffer frame. Calling xsk_buff_free() on tail fragment with XDP_FLAGS_HAS_FRAGS would result in releasing all the nodes from xskb_list that were produced by driver before XDP program execution, which is not what is intended - only tail fragment should be deleted from xskb_list and then it should be put onto xsk_buff_pool::free_list. Such multi-buffer frame will never make it up to user space, so from AF_XDP application POV there would be no traffic running, however due to free_list getting constantly new nodes, driver will be able to feed HW Rx queue with recycled buffers. Bottom line is that instead of traffic being redirected to user space, it would be continuously dropped. To fix this, let us clear the mentioned flag on xsk_buff_pool side during xdp_buff initialization, which is what should have been done right from the start of XSK multi-buffer support. Fixes: 1bbc04de607b ("ice: xsk: add RX multi-buffer support") Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support") Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20240124191602.566724-3-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-13xsk: add functions to fill control bufferMaciej Fijalkowski
Commit 94ecc5ca4dbf ("xsk: Add cb area to struct xdp_buff_xsk") has added a buffer for custom data to xdp_buff_xsk. Particularly, this memory is used for data, consumed by XDP hints kfuncs. It does not always change on a per-packet basis and some parts can be set for example, at the same time as RX queue info. Add functions to fill all cbs in xsk_buff_pool with the same metadata. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20231205210847.28460-8-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-29xsk: Add option to calculate TX checksum in SWStanislav Fomichev
For XDP_COPY mode, add a UMEM option XDP_UMEM_TX_SW_CSUM to call skb_checksum_help in transmit path. Might be useful to debugging issues with real hardware. I also use this mode in the selftests. Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20231127190319.1190813-9-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-29xsk: Support tx_metadata_lenStanislav Fomichev
For zerocopy mode, tx_desc->addr can point to an arbitrary offset and carry some TX metadata in the headroom. For copy mode, there is no way currently to populate skb metadata. Introduce new tx_metadata_len umem config option that indicates how many bytes to treat as metadata. Metadata bytes come prior to tx_desc address (same as in RX case). The size of the metadata has mostly the same constraints as XDP: - less than 256 bytes - 8-byte aligned (compared to 4-byte alignment on xdp, due to 8-byte timestamp in the completion) - non-zero This data is not interpreted in any way right now. Reviewed-by: Song Yoong Siang <yoong.siang.song@intel.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20231127190319.1190813-2-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-15xsk: add multi-buffer support for sockets sharing umemTirthendu Sarkar
Userspace applications indicate their multi-buffer capability to xsk using XSK_USE_SG socket bind flag. For sockets using shared umem the bind flag may contain XSK_USE_SG only for the first socket. For any subsequent socket the only option supported is XDP_SHARED_UMEM. Add option XDP_UMEM_SG_FLAG in umem config flags to store the multi-buffer handling capability when indicated by XSK_USE_SG option in bing flag by the first socket. Use this to derive multi-buffer capability for subsequent sockets in xsk core. Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com> Fixes: 81470b5c3c66 ("xsk: introduce XSK_USE_SG bind flag for xsk socket") Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20230907035032.2627879-1-tirthendu.sarkar@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19xsk: support mbuf on ZC RXMaciej Fijalkowski
Given that skb_shared_info relies on skb_frag_t, in order to support xskb chaining, introduce xdp_buff_xsk::xskb_list_node and xsk_buff_pool::xskb_list. This is needed so ZC drivers can add frags as xskb nodes which will make it possible to handle it both when producing AF_XDP Rx descriptors as well as freeing/recycling all the frags that a single frame carries. Speaking of latter, update xsk_buff_free() to take care of list nodes. For the former (adding as frags), introduce xsk_buff_add_frag() for ZC drivers usage that is going to be used to add a frag to xskb list from pool. xsk_buff_get_frag() will be utilized by XDP_TX and, on contrary, will return xdp_buff. One of the previous patches added a wrapper for ZC Rx so implement xskb list walk and production of Rx descriptors there. On bind() path, bail out if socket wants to use ZC multi-buffer but underlying netdev does not support it. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-12-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-27xsk: Use pool->dma_pages to check for DMAKal Conley
Compare pool->dma_pages instead of pool->dma_pages_cnt to check for an active DMA mapping. pool->dma_pages needs to be read anyway to access the map so this compiles to more efficient code. Signed-off-by: Kal Conley <kal.conley@dectris.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20230423180157.93559-1-kal.conley@dectris.com
2023-02-02xsk: add usage of XDP features flagsMarek Majtyka
Change necessary condition check for XSK from ndo functions to xdp features flags. Signed-off-by: Marek Majtyka <alardam@gmail.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/45a98ec67b4556a6a22dfd85df3eb8276beeeb74.1675245258.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-22xsk: Inherit need_wakeup flag for shared socketsJalal Mostafa
The flag for need_wakeup is not set for xsks with `XDP_SHARED_UMEM` flag and of different queue ids and/or devices. They should inherit the flag from the first socket buffer pool since no flags can be specified once `XDP_SHARED_UMEM` is specified. Fixes: b5aea28dca134 ("xsk: Add shared umem support between queue ids") Signed-off-by: Jalal Mostafa <jalal.a.mostapha@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220921135701.10199-1-jalal.a.mostapha@gmail.com
2022-08-15xsk: Fix corrupted packets for XDP_SHARED_UMEMMagnus Karlsson
Fix an issue in XDP_SHARED_UMEM mode together with aligned mode where packets are corrupted for the second and any further sockets bound to the same umem. In other words, this does not affect the first socket bound to the umem. The culprit for this bug is that the initialization of the DMA addresses for the pre-populated xsk buffer pool entries was not performed for any socket but the first one bound to the umem. Only the linear array of DMA addresses was populated. Fix this by populating the DMA addresses in the xsk buffer pool for every socket bound to the same umem. Fixes: 94033cd8e73b8 ("xsk: Optimize for aligned case") Reported-by: Alasdair McWilliam <alasdair.mcwilliam@outlook.com> Reported-by: Intrusion Shield Team <dnevil@intrusion.com> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Alasdair McWilliam <alasdair.mcwilliam@outlook.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/xdp-newbies/6205E10C-292E-4995-9D10-409649354226@outlook.com/ Link: https://lore.kernel.org/bpf/20220812113259.531-1-magnus.karlsson@gmail.com
2022-06-28xsk: Clear page contiguity bit when unmapping poolIvan Malov
When a XSK pool gets mapped, xp_check_dma_contiguity() adds bit 0x1 to pages' DMA addresses that go in ascending order and at 4K stride. The problem is that the bit does not get cleared before doing unmap. As a result, a lot of warnings from iommu_dma_unmap_page() are seen in dmesg, which indicates that lookups by iommu_iova_to_phys() fail. Fixes: 2b43470add8c ("xsk: Introduce AF_XDP buffer allocation API") Signed-off-by: Ivan Malov <ivan.malov@oktetlabs.ru> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220628091848.534803-1-ivan.malov@oktetlabs.ru
2022-04-26xsk: Fix possible crash when multiple sockets are createdMaciej Fijalkowski
Fix a crash that happens if an Rx only socket is created first, then a second socket is created that is Tx only and bound to the same umem as the first socket and also the same netdev and queue_id together with the XDP_SHARED_UMEM flag. In this specific case, the tx_descs array page pool was not created by the first socket as it was an Rx only socket. When the second socket is bound it needs this tx_descs array of this shared page pool as it has a Tx component, but unfortunately it was never allocated, leading to a crash. Note that this array is only used for zero-copy drivers using the batched Tx APIs, currently only ice and i40e. [ 5511.150360] BUG: kernel NULL pointer dereference, address: 0000000000000008 [ 5511.158419] #PF: supervisor write access in kernel mode [ 5511.164472] #PF: error_code(0x0002) - not-present page [ 5511.170416] PGD 0 P4D 0 [ 5511.173347] Oops: 0002 [#1] PREEMPT SMP PTI [ 5511.178186] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G E 5.18.0-rc1+ #97 [ 5511.187245] Hardware name: Intel Corp. GRANTLEY/GRANTLEY, BIOS GRRFCRB1.86B.0276.D07.1605190235 05/19/2016 [ 5511.198418] RIP: 0010:xsk_tx_peek_release_desc_batch+0x198/0x310 [ 5511.205375] Code: c0 83 c6 01 84 c2 74 6d 8d 46 ff 23 07 44 89 e1 48 83 c0 14 48 c1 e1 04 48 c1 e0 04 48 03 47 10 4c 01 c1 48 8b 50 08 48 8b 00 <48> 89 51 08 48 89 01 41 80 bd d7 00 00 00 00 75 82 48 8b 19 49 8b [ 5511.227091] RSP: 0018:ffffc90000003dd0 EFLAGS: 00010246 [ 5511.233135] RAX: 0000000000000000 RBX: ffff88810c8da600 RCX: 0000000000000000 [ 5511.241384] RDX: 000000000000003c RSI: 0000000000000001 RDI: ffff888115f555c0 [ 5511.249634] RBP: ffffc90000003e08 R08: 0000000000000000 R09: ffff889092296b48 [ 5511.257886] R10: 0000ffffffffffff R11: ffff889092296800 R12: 0000000000000000 [ 5511.266138] R13: ffff88810c8db500 R14: 0000000000000040 R15: 0000000000000100 [ 5511.274387] FS: 0000000000000000(0000) GS:ffff88903f800000(0000) knlGS:0000000000000000 [ 5511.283746] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5511.290389] CR2: 0000000000000008 CR3: 00000001046e2001 CR4: 00000000003706f0 [ 5511.298640] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 5511.306892] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 5511.315142] Call Trace: [ 5511.317972] <IRQ> [ 5511.320301] ice_xmit_zc+0x68/0x2f0 [ice] [ 5511.324977] ? ktime_get+0x38/0xa0 [ 5511.328913] ice_napi_poll+0x7a/0x6a0 [ice] [ 5511.333784] __napi_poll+0x2c/0x160 [ 5511.337821] net_rx_action+0xdd/0x200 [ 5511.342058] __do_softirq+0xe6/0x2dd [ 5511.346198] irq_exit_rcu+0xb5/0x100 [ 5511.350339] common_interrupt+0xa4/0xc0 [ 5511.354777] </IRQ> [ 5511.357201] <TASK> [ 5511.359625] asm_common_interrupt+0x1e/0x40 [ 5511.364466] RIP: 0010:cpuidle_enter_state+0xd2/0x360 [ 5511.370211] Code: 49 89 c5 0f 1f 44 00 00 31 ff e8 e9 00 7b ff 45 84 ff 74 12 9c 58 f6 c4 02 0f 85 72 02 00 00 31 ff e8 02 0c 80 ff fb 45 85 f6 <0f> 88 11 01 00 00 49 63 c6 4c 2b 2c 24 48 8d 14 40 48 8d 14 90 49 [ 5511.391921] RSP: 0018:ffffffff82a03e60 EFLAGS: 00000202 [ 5511.397962] RAX: ffff88903f800000 RBX: 0000000000000001 RCX: 000000000000001f [ 5511.406214] RDX: 0000000000000000 RSI: ffffffff823400b9 RDI: ffffffff8234c046 [ 5511.424646] RBP: ffff88810a384800 R08: 000005032a28c046 R09: 0000000000000008 [ 5511.443233] R10: 000000000000000b R11: 0000000000000006 R12: ffffffff82bcf700 [ 5511.461922] R13: 000005032a28c046 R14: 0000000000000001 R15: 0000000000000000 [ 5511.480300] cpuidle_enter+0x29/0x40 [ 5511.494329] do_idle+0x1c7/0x250 [ 5511.507610] cpu_startup_entry+0x19/0x20 [ 5511.521394] start_kernel+0x649/0x66e [ 5511.534626] secondary_startup_64_no_verify+0xc3/0xcb [ 5511.549230] </TASK> Detect such case during bind() and allocate this memory region via newly introduced xp_alloc_tx_descs(). Also, use kvcalloc instead of kcalloc as for other buffer pool allocations, so that it matches the kvfree() from xp_destroy(). Fixes: d1bc532e99be ("i40e: xsk: Move tmp desc array from driver to pool") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220425153745.481322-1-maciej.fijalkowski@intel.com
2022-03-28xsk: Do not write NULL in SW ring at allocation failureMagnus Karlsson
For the case when xp_alloc_batch() is used but the batched allocation cannot be used, there is a slow path that uses the non-batched xp_alloc(). When it fails to allocate an entry, it returns NULL. The current code wrote this NULL into the entry of the provided results array (pointer to the driver SW ring usually) and returned. This might not be what the driver expects and to make things simpler, just write successfully allocated xdp_buffs into the SW ring,. The driver might have information in there that is still important after an allocation failure. Note that at this point in time, there are no drivers using xp_alloc_batch() that could trigger this slow path. But one might get added. Fixes: 47e4075df300 ("xsk: Batched buffer allocation for the pool") Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220328142123.170157-2-maciej.fijalkowski@intel.com
2022-01-27i40e: xsk: Move tmp desc array from driver to poolMagnus Karlsson
Move desc_array from the driver to the pool. The reason behind this is that we can then reuse this array as a temporary storage for descriptors in all zero-copy drivers that use the batched interface. This will make it easier to add batching to more drivers. i40e is the only driver that has a batched Tx zero-copy implementation, so no need to touch any other driver. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20220125160446.78976-6-maciej.fijalkowski@intel.com
2021-12-29xsk: Initialise xskb free_list_nodeCiara Loftus
This commit initialises the xskb's free_list_node when the xskb is allocated. This prevents a potential false negative returned from a call to list_empty for that node, such as the one introduced in commit 199d983bc015 ("xsk: Fix crash on double free in buffer pool") In my environment this issue caused packets to not be received by the xdpsock application if the traffic was running prior to application launch. This happened when the first batch of packets failed the xskmap lookup and XDP_PASS was returned from the bpf program. This action is handled in the i40e zc driver (and others) by allocating an skbuff, freeing the xdp_buff and adding the associated xskb to the xsk_buff_pool's free_list if it hadn't been added already. Without this fix, the xskb is not added to the free_list because the check to determine if it was added already returns an invalid positive result. Later, this caused allocation errors in the driver and the failure to receive packets. Fixes: 199d983bc015 ("xsk: Fix crash on double free in buffer pool") Fixes: 2b43470add8c ("xsk: Introduce AF_XDP buffer allocation API") Signed-off-by: Ciara Loftus <ciara.loftus@intel.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20211220155250.2746-1-ciara.loftus@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-12xsk: Fix crash on double free in buffer poolMagnus Karlsson
Fix a crash in the buffer pool allocator when a buffer is double freed. It is possible to trigger this behavior not only from a faulty driver, but also from user space like this: Create a zero-copy AF_XDP socket. Load an XDP program that will issue XDP_DROP for all packets. Put the same umem buffer into the fill ring multiple times, then bind the socket and send some traffic. This will crash the kernel as the XDP_DROP action triggers one call to xsk_buff_free()/xp_free() for every packet dropped. Each call will add the corresponding buffer entry to the free_list and increase the free_list_cnt. Some entries will have been added multiple times due to the same buffer being freed. The buffer allocation code will then traverse this broken list and since the same buffer is in the list multiple times, it will try to delete the same buffer twice from the list leading to a crash. The fix for this is just to test that the buffer has not been added before in xp_free(). If it has been, just return from the function and do not put it in the free_list a second time. Note that this bug was not present in the code before the commit referenced in the Fixes tag. That code used one list entry per allocated buffer, so multiple frees did not have any side effects. But the commit below optimized the usage of the pool and only uses a single entry per buffer in the umem, meaning that multiple allocations/frees of the same buffer will also only use one entry, thus leading to the problem. Fixes: 47e4075df300 ("xsk: Batched buffer allocation for the pool") Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn@kernel.org> Link: https://lore.kernel.org/bpf/20211111075707.21922-1-magnus.karlsson@gmail.com
2021-09-29xsk: Fix clang build error in __xp_allocMagnus Karlsson
Fix a build error with clang in __xp_alloc(): [...] net/xdp/xsk_buff_pool.c:465:15: error: variable 'xskb' is uninitialized when used here [-Werror,-Wuninitialized] xp_release(xskb); ^~~~ This is correctly detected by clang, but not gcc. In fact, the xp_release() statement should not be there at all in the refactored code, just remove it. Fixes: 94033cd8e73b ("xsk: Optimize for aligned case") Reported-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20210929061403.8587-1-magnus.karlsson@gmail.com
2021-09-28xsk: Optimize for aligned caseMagnus Karlsson
Optimize for the aligned case by precomputing the parameter values of the xdp_buff_xsk and xdp_buff structures in the heads array. We can do this as the heads array size is equal to the number of chunks in the umem for the aligned case. Then every entry in this array will reflect a certain chunk/frame and can therefore be prepopulated with the correct values and we can drop the use of the free_heads stack. Note that it is not possible to allocate more buffers than what has been allocated in the aligned case since each chunk can only contain a single buffer. We can unfortunately not do this in the unaligned case as one chunk might contain multiple buffers. In this case, we keep the old scheme of populating a heads entry every time it is used and using the free_heads stack. Also move xp_release() and xp_get_handle() to xsk_buff_pool.h. They were for some reason in xsk.c even though they are buffer pool operations. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210922075613.12186-7-magnus.karlsson@gmail.com
2021-09-28xsk: Batched buffer allocation for the poolMagnus Karlsson
Add a new driver interface xsk_buff_alloc_batch() offering batched buffer allocations to improve performance. The new interface takes three arguments: the buffer pool to allocated from, a pointer to an array of struct xdp_buff pointers which will contain pointers to the allocated xdp_buffs, and an unsigned integer specifying the max number of buffers to allocate. The return value is the actual number of buffers that the allocator managed to allocate and it will be in the range 0 <= N <= max, where max is the third parameter to the function. u32 xsk_buff_alloc_batch(struct xsk_buff_pool *pool, struct xdp_buff **xdp, u32 max); A second driver interface is also introduced that need to be used in conjunction with xsk_buff_alloc_batch(). It is a helper that sets the size of struct xdp_buff and is used by the NIC Rx irq routine when receiving a packet. This helper sets the three struct members data, data_meta, and data_end. The two first ones is in the xsk_buff_alloc() case set in the allocation routine and data_end is set when a packet is received in the receive irq function. This unfortunately leads to worse performance since the xdp_buff is touched twice with a long time period in between leading to an extra cache miss. Instead, we fill out the xdp_buff with all 3 fields at one single point in time in the driver, when the size of the packet is known. Hence this helper. Note that the driver has to use this helper (or set all three fields itself) when using xsk_buff_alloc_batch(). xsk_buff_alloc() works as before and does not require this. void xsk_buff_set_size(struct xdp_buff *xdp, u32 size); Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210922075613.12186-3-magnus.karlsson@gmail.com
2021-01-25xsk: Fold xp_assign_dev and __xp_assign_devBjörn Töpel
Fold xp_assign_dev and __xp_assign_dev. The former directly calls the latter. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20210122105351.11751-3-bjorn.topel@gmail.com
2020-12-18xsk: Fix race in SKB mode transmit with shared cqMagnus Karlsson
Fix a race when multiple sockets are simultaneously calling sendto() when the completion ring is shared in the SKB case. This is the case when you share the same netdev and queue id through the XDP_SHARED_UMEM bind flag. The problem is that multiple processes can be in xsk_generic_xmit() and call the backpressure mechanism in xskq_prod_reserve(xs->pool->cq). As this is a shared resource in this specific scenario, a race might occur since the rings are single-producer single-consumer. Fix this by moving the tx_completion_lock from the socket to the pool as the pool is shared between the sockets that share the completion ring. (The pool is not shared when this is not the case.) And then protect the accesses to xskq_prod_reserve() with this lock. The tx_completion_lock is renamed cq_lock to better reflect that it protects accesses to the potentially shared completion ring. Fixes: 35fcde7f8deb ("xsk: support for Tx") Reported-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/20201218134525.13119-2-magnus.karlsson@gmail.com
2020-12-17xsk: Fix memory leak for failed bindMagnus Karlsson
Fix a possible memory leak when a bind of an AF_XDP socket fails. When the fill and completion rings are created, they are tied to the socket. But when the buffer pool is later created at bind time, the ownership of these two rings are transferred to the buffer pool as they might be shared between sockets (and the buffer pool cannot be created until we know what we are binding to). So, before the buffer pool is created, these two rings are cleaned up with the socket, and after they have been transferred they are cleaned up together with the buffer pool. The problem is that ownership was transferred before it was absolutely certain that the buffer pool could be created and initialized correctly and when one of these errors occurred, the fill and completion rings did neither belong to the socket nor the pool and where therefore leaked. Solve this by moving the ownership transfer to the point where the buffer pool has been completely set up and there is no way it can fail. Fixes: 7361f9c3d719 ("xsk: Move fill and completion rings to buffer pool") Reported-by: syzbot+cfa88ddd0655afa88763@syzkaller.appspotmail.com Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/20201214085127.3960-1-magnus.karlsson@gmail.com
2020-12-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
xdp_return_frame_bulk() needs to pass a xdp_buff to __xdp_return(). strlcpy got converted to strscpy but here it makes no functional difference, so just keep the right code. Conflicts: net/netfilter/nf_tables_api.c Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04xsk: Return error code if force_zc is setZhang Changzhong
If force_zc is set, we should exit out with an error, not fall back to copy mode. Fixes: 921b68692abb ("xsk: Enable sharing of dma mappings") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/1607077277-41995-1-git-send-email-zhangchangzhong@huawei.com
2020-12-04Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-12-03 The main changes are: 1) Support BTF in kernel modules, from Andrii. 2) Introduce preferred busy-polling, from Björn. 3) bpf_ima_inode_hash() and bpf_bprm_opts_set() helpers, from KP Singh. 4) Memcg-based memory accounting for bpf objects, from Roman. 5) Allow bpf_{s,g}etsockopt from cgroup bind{4,6} hooks, from Stanislav. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (118 commits) selftests/bpf: Fix invalid use of strncat in test_sockmap libbpf: Use memcpy instead of strncpy to please GCC selftests/bpf: Add fentry/fexit/fmod_ret selftest for kernel module selftests/bpf: Add tp_btf CO-RE reloc test for modules libbpf: Support attachment of BPF tracing programs to kernel modules libbpf: Factor out low-level BPF program loading helper bpf: Allow to specify kernel module BTFs when attaching BPF programs bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier selftests/bpf: Add CO-RE relocs selftest relying on kernel module BTF selftests/bpf: Add support for marking sub-tests as skipped selftests/bpf: Add bpf_testmod kernel module for testing libbpf: Add kernel module BTF support for CO-RE relocations libbpf: Refactor CO-RE relocs to not assume a single BTF object libbpf: Add internal helper to load BTF data by FD bpf: Keep module's btf_data_size intact after load bpf: Fix bpf_put_raw_tracepoint()'s use of __module_address() selftests/bpf: Add Userspace tests for TCP_WINDOW_CLAMP bpf: Adds support for setting window clamp samples/bpf: Fix spelling mistake "recieving" -> "receiving" bpf: Fix cold build of test_progs-no_alu32 ... ==================== Link: https://lore.kernel.org/r/20201204021936.85653-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01xsk: Check need wakeup flag in sendmsg()Björn Töpel
Add a check for need wake up in sendmsg(), so that if a user calls sendmsg() when no wakeup is needed, do not trigger a wakeup. To simplify the need wakeup check in the syscall, unconditionally enable the need wakeup flag for Tx. This has a side-effect for poll(); If poll() is called for a socket without enabled need wakeup, a Tx wakeup is unconditionally performed. The wakeup matrix for AF_XDP now looks like: need wakeup | poll() | sendmsg() | recvmsg() ------------+--------------+-------------+------------ disabled | wake Tx | wake Tx | nop enabled | check flag; | check flag; | check flag; | wake Tx/Rx | wake Tx | wake Rx Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20201130185205.196029-5-bjorn.topel@gmail.com
2020-11-23xsk: Fix incorrect netdev reference countMarek Majtyka
Fix incorrect netdev reference count in xsk_bind operation. Incorrect reference count of the device appears when a user calls bind with the XDP_ZEROCOPY flag on an interface which does not support zero-copy. In such a case, an error is returned but the reference count is not decreased. This change fixes the fault, by decreasing the reference count in case of such an error. The problem being corrected appeared in '162c820ed896' for the first time, and the code was moved to new file location over the time with commit 'c2d3d6a47462'. This specific patch applies to all version starting from 'c2d3d6a47462'. The same solution should be applied but on different file (net/xdp/xdp_umem.c) and function (xdp_umem_assign_dev) for versions from '162c820ed896' to 'c2d3d6a47462' excluded. Fixes: 162c820ed896 ("xdp: hold device for umem regardless of zero-copy mode") Signed-off-by: Marek Majtyka <marekx.majtyka@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20201120151443.105903-1-marekx.majtyka@intel.com
2020-11-20xsk: Fix umem cleanup bug at socket destructMagnus Karlsson
Fix a bug that is triggered when a partially setup socket is destroyed. For a fully setup socket, a socket that has been bound to a device, the cleanup of the umem is performed at the end of the buffer pool's cleanup work queue item. This has to be performed in a work queue, and not in RCU cleanup, as it is doing a vunmap that cannot execute in interrupt context. However, when a socket has only been partially set up so that a umem has been created but the buffer pool has not, the code erroneously directly calls the umem cleanup function instead of using a work queue, and this leads to a BUG_ON() in vunmap(). As there in this case is no buffer pool, we cannot use its work queue, so we need to introduce a work queue for the umem and schedule this for the cleanup. So in the case there is no pool, we are going to use the umem's own work queue to schedule the cleanup. But if there is a pool, the cleanup of the umem is still being performed by the pool's work queue, as it is important that the umem is cleaned up after the pool. Fixes: e5e1a4bc916d ("xsk: Fix possible memory leak at socket close") Reported-by: Marek Majtyka <marekx.majtyka@intel.com> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Marek Majtyka <marekx.majtyka@intel.com> Link: https://lore.kernel.org/bpf/1605873219-21629-1-git-send-email-magnus.karlsson@gmail.com
2020-10-29xsk: Fix possible memory leak at socket closeMagnus Karlsson
Fix a possible memory leak at xsk socket close that is caused by the refcounting of the umem object being wrong. The reference count of the umem was decremented only after the pool had been freed. Note that if the buffer pool is destroyed, it is important that the umem is destroyed after the pool, otherwise the umem would disappear while the driver is still running. And as the buffer pool needs to be destroyed in a work queue, the umem is also (if its refcount reaches zero) destroyed after the buffer pool in that same work queue. What was missing is that the refcount also needs to be decremented when the pool is not freed and when the pool has not even been created. The first case happens when the refcount of the pool is higher than 1, i.e. it is still being used by some other socket using the same device and queue id. In this case, it is safe to decrement the refcount of the umem outside of the work queue as the umem will never be freed because the refcount of the umem is always greater than or equal to the refcount of the buffer pool. The second case is if the buffer pool has not been created yet, i.e. the socket was closed before it was bound but after the umem was created. In this case, it is safe to destroy the umem outside of the work queue, since there is no pool that can use it by definition. Fixes: 1c1efc2af158 ("xsk: Create and free buffer pool independently from umem") Reported-by: syzbot+eb71df123dc2be2c1456@syzkaller.appspotmail.com Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1603801921-2712-1-git-send-email-magnus.karlsson@gmail.com
2020-10-05xsk: Remove internal DMA headersBjörn Töpel
Christoph Hellwig correctly pointed out [1] that the AF_XDP core was pointlessly including internal headers. Let us remove those includes. [1] https://lore.kernel.org/bpf/20201005084341.GA3224@infradead.org/ Fixes: 1c1efc2af158 ("xsk: Create and free buffer pool independently from umem") Reported-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/bpf/20201005090525.116689-1-bjorn.topel@gmail.com