summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-01-25selftests: tc-testing: adjust fq test to latest iproute2Pedro Tammela
Adjust the fq verify regex to the latest iproute2 Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Davide Caratti <dcaratti@redhat.com> Link: https://lore.kernel.org/r/20240124181933.75724-4-pctammela@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-25selftests: tc-testing: check if 'jq' is available in taprio testsPedro Tammela
If 'jq' is not available the taprio tests might enter an infinite loop, use the "dependsOn" feature from tdc to check if jq is present. If it's not the test is skipped. Suggested-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Davide Caratti <dcaratti@redhat.com> Link: https://lore.kernel.org/r/20240124181933.75724-3-pctammela@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-25selftests: tc-testing: add missing netfilter configPedro Tammela
On a default config + tc-testing config build, tdc will miss all the netfilter related tests because it's missing: CONFIG_NETFILTER=y Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Davide Caratti <dcaratti@redhat.com> Link: https://lore.kernel.org/r/20240124181933.75724-2-pctammela@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-25Merge tag 'net-6.8-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from bpf, netfilter and WiFi. Jakub is doing a lot of work to include the self-tests in our CI, as a result a significant amount of self-tests related fixes is flowing in (and will likely continue in the next few weeks). Current release - regressions: - bpf: fix a kernel crash for the riscv 64 JIT - bnxt_en: fix memory leak in bnxt_hwrm_get_rings() - revert "net: macsec: use skb_ensure_writable_head_tail to expand the skb" Previous releases - regressions: - core: fix removing a namespace with conflicting altnames - tc/flower: fix chain template offload memory leak - tcp: - make sure init the accept_queue's spinlocks once - fix autocork on CPUs with weak memory model - udp: fix busy polling - mlx5e: - fix out-of-bound read in port timestamping - fix peer flow lists corruption - iwlwifi: fix a memory corruption Previous releases - always broken: - netfilter: - nft_chain_filter: handle NETDEV_UNREGISTER for inet/ingress basechain - nft_limit: reject configurations that cause integer overflow - bpf: fix bpf_xdp_adjust_tail() with XSK zero-copy mbuf, avoiding a NULL pointer dereference upon shrinking - llc: make llc_ui_sendmsg() more robust against bonding changes - smc: fix illegal rmb_desc access in SMC-D connection dump - dpll: fix pin dump crash for rebound module - bnxt_en: fix possible crash after creating sw mqprio TCs - hv_netvsc: calculate correct ring size when PAGE_SIZE is not 4kB Misc: - several self-tests fixes for better integration with the netdev CI - added several missing modules descriptions" * tag 'net-6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits) tsnep: Fix XDP_RING_NEED_WAKEUP for empty fill ring tsnep: Remove FCS for XDP data path net: fec: fix the unhandled context fault from smmu selftests: bonding: do not test arp/ns target with mode balance-alb/tlb fjes: fix memleaks in fjes_hw_setup i40e: update xdp_rxq_info::frag_size for ZC enabled Rx queue i40e: set xdp_rxq_info::frag_size xdp: reflect tail increase for MEM_TYPE_XSK_BUFF_POOL ice: update xdp_rxq_info::frag_size for ZC enabled Rx queue intel: xsk: initialize skb_frag_t::bv_offset in ZC drivers ice: remove redundant xdp_rxq_info registration i40e: handle multi-buffer packets that are shrunk by xdp prog ice: work on pre-XDP prog frag count xsk: fix usage of multi-buffer BPF helpers for ZC XDP xsk: make xsk_buff_pool responsible for clearing xdp_buff::flags xsk: recycle buffer in case Rx queue was full net: fill in MODULE_DESCRIPTION()s for rvu_mbox net: fill in MODULE_DESCRIPTION()s for litex net: fill in MODULE_DESCRIPTION()s for fsl_pq_mdio net: fill in MODULE_DESCRIPTION()s for fec ...
2024-01-25Merge tag 'ovl-fixes-6.8-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs Pull overlayfs fix from Amir Goldstein: "Change the on-disk format for the new "xwhiteouts" feature introduced in v6.7 The change reduces unneeded overhead of an extra getxattr per readdir. The only user of the "xwhiteout" feature is the external composefs tool, which has been updated to support the new on-disk format. This change is also designated for 6.7.y" * tag 'ovl-fixes-6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs: ovl: mark xwhiteouts directory with overlay.opaque='x'
2024-01-25Merge tag 'vfs-6.8-rc2.netfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull netfs fixes from Christian Brauner: "This contains various fixes for the netfs work merged earlier this cycle: afs: - Fix locking imbalance in afs_proc_addr_prefs_show() - Remove afs_dynroot_d_revalidate() which is redundant - Fix error handling during lookup - Hide sillyrenames from userspace. This fixes a race between silly-rename files being created/removed and userspace iterating over directory entries - Don't use unnecessary folio_*() functions cifs: - Don't use unnecessary folio_*() functions cachefiles: - erofs: Fix Null dereference when cachefiles are not doing ondemand-mode - Update mailing list netfs library: - Add Jeff Layton as reviewer - Update mailing list - Fix a error checking in netfs_perform_write() - fscache: Check error before dereferencing - Don't use unnecessary folio_*() functions" * tag 'vfs-6.8-rc2.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: afs: Fix missing/incorrect unlocking of RCU read lock afs: Remove afs_dynroot_d_revalidate() as it is redundant afs: Fix error handling with lookup via FS.InlineBulkStatus afs: Hide silly-rename files from userspace cachefiles, erofs: Fix NULL deref in when cachefiles is not doing ondemand-mode netfs: Fix a NULL vs IS_ERR() check in netfs_perform_write() netfs, fscache: Prevent Oops in fscache_put_cache() cifs: Don't use certain unnecessary folio_*() functions afs: Don't use certain unnecessary folio_*() functions netfs: Don't use certain unnecessary folio_*() functions netfs: Add Jeff Layton as reviewer netfs, cachefiles: Change mailing list
2024-01-25Merge tag 'nfsd-6.8-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Fix in-kernel RPC UDP transport - Fix NFSv4.0 RELEASE_LOCKOWNER * tag 'nfsd-6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: nfsd: fix RELEASE_LOCKOWNER SUNRPC: use request size to initialize bio_vec in svc_udp_sendto()
2024-01-25Merge tag 'urgent-rcu.2024.01.24a' of https://github.com/neeraju/linuxLinus Torvalds
Pull RCU fix from Neeraj Upadhyay: "This fixes RCU grace period stalls, which are observed when an outgoing CPU's quiescent state reporting results in wakeup of one of the grace period kthreads, to complete the grace period. If those kthreads have SCHED_FIFO policy, the wake up can indirectly arm the RT bandwith timer to the local offline CPU. Earlier migration of the hrtimers from the CPU introduced in commit 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") results in this timer getting ignored. If the RCU grace period kthreads are waiting for RT bandwidth to be available, they may never be actually scheduled, resulting in RCU stall warnings" * tag 'urgent-rcu.2024.01.24a' of https://github.com/neeraju/linux: rcu: Defer RCU kthreads wakeup when CPU is dying
2024-01-25net: dsa: mt7530: support OF-based registration of switch MDIO busArınç ÜNAL
Currently the MDIO bus of the switches the MT7530 DSA subdriver controls can only be registered as non-OF-based. Bring support for registering the bus OF-based. The subdrivers that control switches [with MDIO bus] probed on OF must follow this logic to support all cases properly: No switch MDIO bus defined: Populate ds->user_mii_bus, register the MDIO bus, set the interrupts for PHYs if "interrupt-controller" is defined at the switch node. This case should only be covered for the switches which their dt-bindings documentation didn't document the MDIO bus from the start. This is to keep supporting the device trees that do not describe the MDIO bus on the device tree but the MDIO bus is being used nonetheless. Switch MDIO bus defined: Don't populate ds->user_mii_bus, register the MDIO bus, set the interrupts for PHYs if ["interrupt-controller" is defined at the switch node and "interrupts" is defined at the PHY nodes under the switch MDIO bus node]. Switch MDIO bus defined but explicitly disabled: If the device tree says status = "disabled" for the MDIO bus, we shouldn't need an MDIO bus at all. Instead, just exit as early as possible and do not call any MDIO API. The use of ds->user_mii_bus is inappropriate when the MDIO bus of the switch is described on the device tree [1], which is why we don't populate ds->user_mii_bus in that case. Link: https://lore.kernel.org/netdev/20231213120656.x46fyad6ls7sqyzv@skbuf/ [1] Suggested-by: David Bauer <mail@david-bauer.net> Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Link: https://lore.kernel.org/r/20240122053431.7751-1-arinc.unal@arinc9.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-25Merge branch 'tsnep-xdp-fixes'Paolo Abeni
Gerhard Engleder says: ==================== tsnep: XDP fixes Found two driver specific problems during XDP and XSK testing. ==================== Link: https://lore.kernel.org/r/20240123200918.61219-1-gerhard@engleder-embedded.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-25tsnep: Fix XDP_RING_NEED_WAKEUP for empty fill ringGerhard Engleder
The fill ring of the XDP socket may contain not enough buffers to completey fill the RX queue during socket creation. In this case the flag XDP_RING_NEED_WAKEUP is not set as this flag is only set if the RX queue is not completely filled during polling. Set XDP_RING_NEED_WAKEUP flag also if RX queue is not completely filled during XDP socket creation. Fixes: 3fc2333933fd ("tsnep: Add XDP socket zero-copy RX support") Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-25tsnep: Remove FCS for XDP data pathGerhard Engleder
The RX data buffer includes the FCS. The FCS is already stripped for the normal data path. But for the XDP data path the FCS is included and acts like additional/useless data. Remove the FCS from the RX data buffer also for XDP. Fixes: 65b28c810035 ("tsnep: Add XDP RX support") Fixes: 3fc2333933fd ("tsnep: Add XDP socket zero-copy RX support") Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-25Merge tag 'mlx5-fixes-2024-01-24' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5 fixes 2024-01-24 This series provides bug fixes to mlx5 driver. Please pull and let me know if there is any problem. * tag 'mlx5-fixes-2024-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5e: fix a potential double-free in fs_any_create_groups net/mlx5e: fix a double-free in arfs_create_groups net/mlx5e: Ignore IPsec replay window values on sender side net/mlx5e: Allow software parsing when IPsec crypto is enabled net/mlx5: Use mlx5 device constant for selecting CQ period mode for ASO net/mlx5: DR, Can't go to uplink vport on RX rule net/mlx5: DR, Use the right GVMI number for drop action net/mlx5: Bridge, fix multicast packets sent to uplink net/mlx5: Fix a WARN upon a callback command failure net/mlx5e: Fix peer flow lists handling net/mlx5e: Fix inconsistent hairpin RQT sizes net/mlx5e: Fix operation precedence bug in port timestamping napi_poll context net/mlx5: Fix query of sd_group field net/mlx5e: Use the correct lag ports number when creating TISes ==================== Link: https://lore.kernel.org/r/20240124081855.115410-1-saeed@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-25Merge tag 'for-netdev' of ↵Paolo Abeni
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-01-25 The following pull-request contains BPF updates for your *net* tree. We've added 12 non-merge commits during the last 2 day(s) which contain a total of 13 files changed, 190 insertions(+), 91 deletions(-). The main changes are: 1) Fix bpf_xdp_adjust_tail() in context of XSK zero-copy drivers which support XDP multi-buffer. The former triggered a NULL pointer dereference upon shrinking, from Maciej Fijalkowski & Tirthendu Sarkar. 2) Fix a bug in riscv64 BPF JIT which emitted a wrong prologue and epilogue for struct_ops programs, from Pu Lehui. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: i40e: update xdp_rxq_info::frag_size for ZC enabled Rx queue i40e: set xdp_rxq_info::frag_size xdp: reflect tail increase for MEM_TYPE_XSK_BUFF_POOL ice: update xdp_rxq_info::frag_size for ZC enabled Rx queue intel: xsk: initialize skb_frag_t::bv_offset in ZC drivers ice: remove redundant xdp_rxq_info registration i40e: handle multi-buffer packets that are shrunk by xdp prog ice: work on pre-XDP prog frag count xsk: fix usage of multi-buffer BPF helpers for ZC XDP xsk: make xsk_buff_pool responsible for clearing xdp_buff::flags xsk: recycle buffer in case Rx queue was full riscv, bpf: Fix unpredictable kernel crash about RV64 struct_ops ==================== Link: https://lore.kernel.org/r/20240125084416.10876-1-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-25net: fec: fix the unhandled context fault from smmuShenwei Wang
When repeatedly changing the interface link speed using the command below: ethtool -s eth0 speed 100 duplex full ethtool -s eth0 speed 1000 duplex full The following errors may sometimes be reported by the ARM SMMU driver: [ 5395.035364] fec 5b040000.ethernet eth0: Link is Down [ 5395.039255] arm-smmu 51400000.iommu: Unhandled context fault: fsr=0x402, iova=0x00000000, fsynr=0x100001, cbfrsynra=0x852, cb=2 [ 5398.108460] fec 5b040000.ethernet eth0: Link is Up - 100Mbps/Full - flow control off It is identified that the FEC driver does not properly stop the TX queue during the link speed transitions, and this results in the invalid virtual I/O address translations from the SMMU and causes the context faults. Fixes: dbc64a8ea231 ("net: fec: move calls to quiesce/resume packet processing out of fec_restart()") Signed-off-by: Shenwei Wang <shenwei.wang@nxp.com> Link: https://lore.kernel.org/r/20240123165141.2008104-1-shenwei.wang@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-25selftests: bonding: do not test arp/ns target with mode balance-alb/tlbHangbin Liu
The prio_arp/ns tests hard code the mode to active-backup. At the same time, The balance-alb/tlb modes do not support arp/ns target. So remove the prio_arp/ns tests from the loop and only test active-backup mode. Fixes: 481b56e0391e ("selftests: bonding: re-format bond option tests") Reported-by: Jay Vosburgh <jay.vosburgh@canonical.com> Closes: https://lore.kernel.org/netdev/17415.1705965957@famine/ Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Link: https://lore.kernel.org/r/20240123075917.1576360-1-liuhangbin@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-24Merge tag 'nf-24-01-24' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Update nf_tables kdoc to keep it in sync with the code, from George Guo. 2) Handle NETDEV_UNREGISTER event for inet/ingress basechain. 3) Reject configuration that cause nft_limit to overflow, from Florian Westphal. 4) Restrict anonymous set/map names to 16 bytes, from Florian Westphal. 5) Disallow to encode queue number and error in verdicts. This reverts a patch which seems to have introduced an early attempt to support for nfqueue maps, which is these days supported via nft_queue expression. 6) Sanitize family via .validate for expressions that explicitly refer to NF_INET_* hooks. * tag 'nf-24-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: validate NFPROTO_* family netfilter: nf_tables: reject QUEUE/DROP verdict parameters netfilter: nf_tables: restrict anonymous set and map names to 16 bytes netfilter: nft_limit: reject configurations that cause integer overflow netfilter: nft_chain_filter: handle NETDEV_UNREGISTER for inet/ingress basechain netfilter: nf_tables: cleanup documentation ==================== Link: https://lore.kernel.org/r/20240124191248.75463-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-24fjes: fix memleaks in fjes_hw_setupZhipeng Lu
In fjes_hw_setup, it allocates several memory and delay the deallocation to the fjes_hw_exit in fjes_probe through the following call chain: fjes_probe |-> fjes_hw_init |-> fjes_hw_setup |-> fjes_hw_exit However, when fjes_hw_setup fails, fjes_hw_exit won't be called and thus all the resources allocated in fjes_hw_setup will be leaked. In this patch, we free those resources in fjes_hw_setup and prevents such leaks. Fixes: 2fcbca687702 ("fjes: platform_driver's .probe and .remove routine") Signed-off-by: Zhipeng Lu <alexious@zju.edu.cn> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240122172445.3841883-1-alexious@zju.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-24tipc: node: remove Excess struct member kernel-doc warningsRandy Dunlap
Remove 2 kernel-doc descriptions to squelch warnings: node.c:150: warning: Excess struct member 'inputq' description in 'tipc_node' node.c:150: warning: Excess struct member 'namedq' description in 'tipc_node' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jon Maloy <jmaloy@redhat.com> Cc: Ying Xue <ying.xue@windriver.com> Cc: Jonathan Corbet <corbet@lwn.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240123051152.23684-1-rdunlap@infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-24tipc: socket: remove Excess struct member kernel-doc warningRandy Dunlap
Remove a kernel-doc description to squelch a warning: socket.c:143: warning: Excess struct member 'blocking_link' description in 'tipc_sock' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jon Maloy <jmaloy@redhat.com> Cc: Ying Xue <ying.xue@windriver.com> Cc: Jonathan Corbet <corbet@lwn.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240123051201.24701-1-rdunlap@infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-24vsock/test: add '--peer-port' input argumentArseniy Krasnov
Implement port for given CID as input argument instead of using hardcoded value '1234'. This allows to run different test instances on a single CID. Port argument is not required parameter and if it is not set, then default value will be '1234' - thus we preserve previous behaviour. Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20240123072750.4084181-1-avkrasnov@salutedevices.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-24Merge tag 'ceph-for-6.8-rc2' of https://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fixes from Ilya Dryomov: "A fix to avoid triggering an assert in some cases where RBD exclusive mappings are involved and a deprecated API cleanup" * tag 'ceph-for-6.8-rc2' of https://github.com/ceph/ceph-client: rbd: don't move requests to the running list on errors rbd: remove usage of the deprecated ida_simple_*() API
2024-01-24Merge tag 'integrity-v6.8-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity Pull integrity fix from Mimi Zohar: "Revert patch that required user-provided key data, since keys can be created from kernel-generated random numbers" * tag 'integrity-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity: Revert "KEYS: encrypted: Add check for strsep"
2024-01-24Merge branch 'net-bpf_xdp_adjust_tail-and-intel-mbuf-fixes'Alexei Starovoitov
Maciej Fijalkowski says: ==================== net: bpf_xdp_adjust_tail() and Intel mbuf fixes Hey, after a break followed by dealing with sickness, here is a v6 that makes bpf_xdp_adjust_tail() actually usable for ZC drivers that support XDP multi-buffer. Since v4 I tried also using bpf_xdp_adjust_tail() with positive offset which exposed yet another issues, which can be observed by increased commit count when compared to v3. John, in the end I think we should remove handling MEM_TYPE_XSK_BUFF_POOL from __xdp_return(), but it is out of the scope for fixes set, IMHO. Thanks, Maciej v6: - add acks [Magnus] - fix spelling mistakes [Magnus] - avoid touching xdp_buff in xp_alloc_{reused,new_from_fq}() [Magnus] - s/shrink_data/bpf_xdp_shrink_data [Jakub] - remove __shrink_data() [Jakub] - check retvals from __xdp_rxq_info_reg() [Magnus] v5: - pick correct version of patch 5 [Simon] - elaborate a bit more on what patch 2 fixes v4: - do not clear frags flag when deleting tail; xsk_buff_pool now does that - skip some NULL tests for xsk_buff_get_tail [Martin, John] - address problems around registering xdp_rxq_info - fix bpf_xdp_frags_increase_tail() for ZC mbuf v3: - add acks - s/xsk_buff_tail_del/xsk_buff_del_tail - address i40e as well (thanks Tirthendu) v2: - fix !CONFIG_XDP_SOCKETS builds - add reviewed-by tag to patch 3 ==================== Link: https://lore.kernel.org/r/20240124191602.566724-1-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24i40e: update xdp_rxq_info::frag_size for ZC enabled Rx queueMaciej Fijalkowski
Now that i40e driver correctly sets up frag_size in xdp_rxq_info, let us make it work for ZC multi-buffer as well. i40e_ring::rx_buf_len for ZC is being set via xsk_pool_get_rx_frame_size() and this needs to be propagated up to xdp_rxq_info. Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support") Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20240124191602.566724-12-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24i40e: set xdp_rxq_info::frag_sizeMaciej Fijalkowski
i40e support XDP multi-buffer so it is supposed to use __xdp_rxq_info_reg() instead of xdp_rxq_info_reg() and set the frag_size. It can not be simply converted at existing callsite because rx_buf_len could be un-initialized, so let us register xdp_rxq_info within i40e_configure_rx_ring(), which happen to be called with already initialized rx_buf_len value. Commit 5180ff1364bc ("i40e: use int for i40e_status") converted 'err' to int, so two variables to deal with return codes are not needed within i40e_configure_rx_ring(). Remove 'ret' and use 'err' to handle status from xdp_rxq_info registration. Fixes: e213ced19bef ("i40e: add support for XDP multi-buffer Rx") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20240124191602.566724-11-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24xdp: reflect tail increase for MEM_TYPE_XSK_BUFF_POOLMaciej Fijalkowski
XSK ZC Rx path calculates the size of data that will be posted to XSK Rx queue via subtracting xdp_buff::data_end from xdp_buff::data. In bpf_xdp_frags_increase_tail(), when underlying memory type of xdp_rxq_info is MEM_TYPE_XSK_BUFF_POOL, add offset to data_end in tail fragment, so that later on user space will be able to take into account the amount of bytes added by XDP program. Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20240124191602.566724-10-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24ice: update xdp_rxq_info::frag_size for ZC enabled Rx queueMaciej Fijalkowski
Now that ice driver correctly sets up frag_size in xdp_rxq_info, let us make it work for ZC multi-buffer as well. ice_rx_ring::rx_buf_len for ZC is being set via xsk_pool_get_rx_frame_size() and this needs to be propagated up to xdp_rxq_info. Use a bigger hammer and instead of unregistering only xdp_rxq_info's memory model, unregister it altogether and register it again and have xdp_rxq_info with correct frag_size value. Fixes: 1bbc04de607b ("ice: xsk: add RX multi-buffer support") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20240124191602.566724-9-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24intel: xsk: initialize skb_frag_t::bv_offset in ZC driversMaciej Fijalkowski
Ice and i40e ZC drivers currently set offset of a frag within skb_shared_info to 0, which is incorrect. xdp_buffs that come from xsk_buff_pool always have 256 bytes of a headroom, so they need to be taken into account to retrieve xdp_buff::data via skb_frag_address(). Otherwise, bpf_xdp_frags_increase_tail() would be starting its job from xdp_buff::data_hard_start which would result in overwriting existing payload. Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support") Fixes: 1bbc04de607b ("ice: xsk: add RX multi-buffer support") Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20240124191602.566724-8-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24ice: remove redundant xdp_rxq_info registrationMaciej Fijalkowski
xdp_rxq_info struct can be registered by drivers via two functions - xdp_rxq_info_reg() and __xdp_rxq_info_reg(). The latter one allows drivers that support XDP multi-buffer to set up xdp_rxq_info::frag_size which in turn will make it possible to grow the packet via bpf_xdp_adjust_tail() BPF helper. Currently, ice registers xdp_rxq_info in two spots: 1) ice_setup_rx_ring() // via xdp_rxq_info_reg(), BUG 2) ice_vsi_cfg_rxq() // via __xdp_rxq_info_reg(), OK Cited commit under fixes tag took care of setting up frag_size and updated registration scheme in 2) but it did not help as 1) is called before 2) and as shown above it uses old registration function. This means that 2) sees that xdp_rxq_info is already registered and never calls __xdp_rxq_info_reg() which leaves us with xdp_rxq_info::frag_size being set to 0. To fix this misbehavior, simply remove xdp_rxq_info_reg() call from ice_setup_rx_ring(). Fixes: 2fba7dc5157b ("ice: Add support for XDP multi-buffer on Rx side") Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20240124191602.566724-7-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24i40e: handle multi-buffer packets that are shrunk by xdp progTirthendu Sarkar
XDP programs can shrink packets by calling the bpf_xdp_adjust_tail() helper function. For multi-buffer packets this may lead to reduction of frag count stored in skb_shared_info area of the xdp_buff struct. This results in issues with the current handling of XDP_PASS and XDP_DROP cases. For XDP_PASS, currently skb is being built using frag count of xdp_buffer before it was processed by XDP prog and thus will result in an inconsistent skb when frag count gets reduced by XDP prog. To fix this, get correct frag count while building the skb instead of using pre-obtained frag count. For XDP_DROP, current page recycling logic will not reuse the page but instead will adjust the pagecnt_bias so that the page can be freed. This again results in inconsistent behavior as the page refcnt has already been changed by the helper while freeing the frag(s) as part of shrinking the packet. To fix this, only adjust pagecnt_bias for buffers that are stillpart of the packet post-xdp prog run. Fixes: e213ced19bef ("i40e: add support for XDP multi-buffer Rx") Reported-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com> Link: https://lore.kernel.org/r/20240124191602.566724-6-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24ice: work on pre-XDP prog frag countMaciej Fijalkowski
Fix an OOM panic in XDP_DRV mode when a XDP program shrinks a multi-buffer packet by 4k bytes and then redirects it to an AF_XDP socket. Since support for handling multi-buffer frames was added to XDP, usage of bpf_xdp_adjust_tail() helper within XDP program can free the page that given fragment occupies and in turn decrease the fragment count within skb_shared_info that is embedded in xdp_buff struct. In current ice driver codebase, it can become problematic when page recycling logic decides not to reuse the page. In such case, __page_frag_cache_drain() is used with ice_rx_buf::pagecnt_bias that was not adjusted after refcount of page was changed by XDP prog which in turn does not drain the refcount to 0 and page is never freed. To address this, let us store the count of frags before the XDP program was executed on Rx ring struct. This will be used to compare with current frag count from skb_shared_info embedded in xdp_buff. A smaller value in the latter indicates that XDP prog freed frag(s). Then, for given delta decrement pagecnt_bias for XDP_DROP verdict. While at it, let us also handle the EOP frag within ice_set_rx_bufs_act() to make our life easier, so all of the adjustments needed to be applied against freed frags are performed in the single place. Fixes: 2fba7dc5157b ("ice: Add support for XDP multi-buffer on Rx side") Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20240124191602.566724-5-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24xsk: fix usage of multi-buffer BPF helpers for ZC XDPMaciej Fijalkowski
Currently when packet is shrunk via bpf_xdp_adjust_tail() and memory type is set to MEM_TYPE_XSK_BUFF_POOL, null ptr dereference happens: [1136314.192256] BUG: kernel NULL pointer dereference, address: 0000000000000034 [1136314.203943] #PF: supervisor read access in kernel mode [1136314.213768] #PF: error_code(0x0000) - not-present page [1136314.223550] PGD 0 P4D 0 [1136314.230684] Oops: 0000 [#1] PREEMPT SMP NOPTI [1136314.239621] CPU: 8 PID: 54203 Comm: xdpsock Not tainted 6.6.0+ #257 [1136314.250469] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019 [1136314.265615] RIP: 0010:__xdp_return+0x6c/0x210 [1136314.274653] Code: ad 00 48 8b 47 08 49 89 f8 a8 01 0f 85 9b 01 00 00 0f 1f 44 00 00 f0 41 ff 48 34 75 32 4c 89 c7 e9 79 cd 80 ff 83 fe 03 75 17 <f6> 41 34 01 0f 85 02 01 00 00 48 89 cf e9 22 cc 1e 00 e9 3d d2 86 [1136314.302907] RSP: 0018:ffffc900089f8db0 EFLAGS: 00010246 [1136314.312967] RAX: ffffc9003168aed0 RBX: ffff8881c3300000 RCX: 0000000000000000 [1136314.324953] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffc9003168c000 [1136314.336929] RBP: 0000000000000ae0 R08: 0000000000000002 R09: 0000000000010000 [1136314.348844] R10: ffffc9000e495000 R11: 0000000000000040 R12: 0000000000000001 [1136314.360706] R13: 0000000000000524 R14: ffffc9003168aec0 R15: 0000000000000001 [1136314.373298] FS: 00007f8df8bbcb80(0000) GS:ffff8897e0e00000(0000) knlGS:0000000000000000 [1136314.386105] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1136314.396532] CR2: 0000000000000034 CR3: 00000001aa912002 CR4: 00000000007706f0 [1136314.408377] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [1136314.420173] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [1136314.431890] PKRU: 55555554 [1136314.439143] Call Trace: [1136314.446058] <IRQ> [1136314.452465] ? __die+0x20/0x70 [1136314.459881] ? page_fault_oops+0x15b/0x440 [1136314.468305] ? exc_page_fault+0x6a/0x150 [1136314.476491] ? asm_exc_page_fault+0x22/0x30 [1136314.484927] ? __xdp_return+0x6c/0x210 [1136314.492863] bpf_xdp_adjust_tail+0x155/0x1d0 [1136314.501269] bpf_prog_ccc47ae29d3b6570_xdp_sock_prog+0x15/0x60 [1136314.511263] ice_clean_rx_irq_zc+0x206/0xc60 [ice] [1136314.520222] ? ice_xmit_zc+0x6e/0x150 [ice] [1136314.528506] ice_napi_poll+0x467/0x670 [ice] [1136314.536858] ? ttwu_do_activate.constprop.0+0x8f/0x1a0 [1136314.546010] __napi_poll+0x29/0x1b0 [1136314.553462] net_rx_action+0x133/0x270 [1136314.561619] __do_softirq+0xbe/0x28e [1136314.569303] do_softirq+0x3f/0x60 This comes from __xdp_return() call with xdp_buff argument passed as NULL which is supposed to be consumed by xsk_buff_free() call. To address this properly, in ZC case, a node that represents the frag being removed has to be pulled out of xskb_list. Introduce appropriate xsk helpers to do such node operation and use them accordingly within bpf_xdp_adjust_tail(). Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX") Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> # For the xsk header part Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20240124191602.566724-4-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24xsk: make xsk_buff_pool responsible for clearing xdp_buff::flagsMaciej Fijalkowski
XDP multi-buffer support introduced XDP_FLAGS_HAS_FRAGS flag that is used by drivers to notify data path whether xdp_buff contains fragments or not. Data path looks up mentioned flag on first buffer that occupies the linear part of xdp_buff, so drivers only modify it there. This is sufficient for SKB and XDP_DRV modes as usually xdp_buff is allocated on stack or it resides within struct representing driver's queue and fragments are carried via skb_frag_t structs. IOW, we are dealing with only one xdp_buff. ZC mode though relies on list of xdp_buff structs that is carried via xsk_buff_pool::xskb_list, so ZC data path has to make sure that fragments do *not* have XDP_FLAGS_HAS_FRAGS set. Otherwise, xsk_buff_free() could misbehave if it would be executed against xdp_buff that carries a frag with XDP_FLAGS_HAS_FRAGS flag set. Such scenario can take place when within supplied XDP program bpf_xdp_adjust_tail() is used with negative offset that would in turn release the tail fragment from multi-buffer frame. Calling xsk_buff_free() on tail fragment with XDP_FLAGS_HAS_FRAGS would result in releasing all the nodes from xskb_list that were produced by driver before XDP program execution, which is not what is intended - only tail fragment should be deleted from xskb_list and then it should be put onto xsk_buff_pool::free_list. Such multi-buffer frame will never make it up to user space, so from AF_XDP application POV there would be no traffic running, however due to free_list getting constantly new nodes, driver will be able to feed HW Rx queue with recycled buffers. Bottom line is that instead of traffic being redirected to user space, it would be continuously dropped. To fix this, let us clear the mentioned flag on xsk_buff_pool side during xdp_buff initialization, which is what should have been done right from the start of XSK multi-buffer support. Fixes: 1bbc04de607b ("ice: xsk: add RX multi-buffer support") Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support") Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20240124191602.566724-3-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24xsk: recycle buffer in case Rx queue was fullMaciej Fijalkowski
Add missing xsk_buff_free() call when __xsk_rcv_zc() failed to produce descriptor to XSK Rx queue. Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX") Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20240124191602.566724-2-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24Merge branch 'bpf-token'Andrii Nakryiko
Andrii Nakryiko says: ==================== BPF token This patch set is a combination of three BPF token-related patch sets ([0], [1], [2]) with fixes ([3]) to kernel-side token_fd passing APIs incorporated into relevant patches, bpf_token_capable() changes requested by Christian Brauner, and necessary libbpf and BPF selftests side adjustments. This patch set introduces an ability to delegate a subset of BPF subsystem functionality from privileged system-wide daemon (e.g., systemd or any other container manager) through special mount options for userns-bound BPF FS to a *trusted* unprivileged application. Trust is the key here. This functionality is not about allowing unconditional unprivileged BPF usage. Establishing trust, though, is completely up to the discretion of respective privileged application that would create and mount a BPF FS instance with delegation enabled, as different production setups can and do achieve it through a combination of different means (signing, LSM, code reviews, etc), and it's undesirable and infeasible for kernel to enforce any particular way of validating trustworthiness of particular process. The main motivation for this work is a desire to enable containerized BPF applications to be used together with user namespaces. This is currently impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read arbitrary memory, and it's impossible to ensure that they only read memory of processes belonging to any given namespace. This means that it's impossible to have a mechanically verifiable namespace-aware CAP_BPF capability, and as such another mechanism to allow safe usage of BPF functionality is necessary. BPF FS delegation mount options and BPF token derived from such BPF FS instance is such a mechanism. Kernel makes no assumption about what "trusted" constitutes in any particular case, and it's up to specific privileged applications and their surrounding infrastructure to decide that. What kernel provides is a set of APIs to setup and mount special BPF FS instance and derive BPF tokens from it. BPF FS and BPF token are both bound to its owning userns and in such a way are constrained inside intended container. Users can then pass BPF token FD to privileged bpf() syscall commands, like BPF map creation and BPF program loading, to perform such operations without having init userns privileges. This version incorporates feedback and suggestions ([4]) received on earlier iterations of BPF token approach, and instead of allowing to create BPF tokens directly assuming capable(CAP_SYS_ADMIN), we instead enhance BPF FS to accept a few new delegation mount options. If these options are used and BPF FS itself is properly created, set up, and mounted inside the user namespaced container, user application is able to derive a BPF token object from BPF FS instance, and pass that token to bpf() syscall. As explained in patch #3, BPF token itself doesn't grant access to BPF functionality, but instead allows kernel to do namespaced capabilities checks (ns_capable() vs capable()) for CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, and CAP_SYS_ADMIN, as applicable. So it forms one half of a puzzle and allows container managers and sys admins to have safe and flexible configuration options: determining which containers get delegation of BPF functionality through BPF FS, and then which applications within such containers are allowed to perform bpf() commands, based on namespaces capabilities. Previous attempt at addressing this very same problem ([5]) attempted to utilize authoritative LSM approach, but was conclusively rejected by upstream LSM maintainers. BPF token concept is not changing anything about LSM approach, but can be combined with LSM hooks for very fine-grained security policy. Some ideas about making BPF token more convenient to use with LSM (in particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF 2023 presentation ([6]). E.g., an ability to specify user-provided data (context), which in combination with BPF LSM would allow implementing a very dynamic and fine-granular custom security policies on top of BPF token. In the interest of minimizing API surface area and discussions this was relegated to follow up patches, as it's not essential to the fundamental concept of delegatable BPF token. It should be noted that BPF token is conceptually quite similar to the idea of /dev/bpf device file, proposed by Song a while ago ([7]). The biggest difference is the idea of using virtual anon_inode file to hold BPF token and allowing multiple independent instances of them, each (potentially) with its own set of restrictions. And also, crucially, BPF token approach is not using any special stateful task-scoped flags. Instead, bpf() syscall accepts token_fd parameters explicitly for each relevant BPF command. This addresses main concerns brought up during the /dev/bpf discussion, and fits better with overall BPF subsystem design. Second part of this patch set adds full support for BPF token in libbpf's BPF object high-level API. Good chunk of the changes rework libbpf feature detection internals, which are the most affected by BPF token presence. Besides internal refactorings, libbpf allows to pass location of BPF FS from which BPF token should be created by libbpf. This can be done explicitly though a new bpf_object_open_opts.bpf_token_path field. But we also add implicit BPF token creation logic to BPF object load step, even without any explicit involvement of the user. If the environment is setup properly, BPF token will be created transparently and used implicitly. This allows for all existing application to gain BPF token support by just linking with latest version of libbpf library. No source code modifications are required. All that under assumption that privileged container management agent properly set up default BPF FS instance at /sys/bpf/fs to allow BPF token creation. libbpf adds support to override default BPF FS location for BPF token creation through LIBBPF_BPF_TOKEN_PATH envvar knowledge. This allows admins or container managers to mount BPF token-enabled BPF FS at non-standard location without the need to coordinate with applications. LIBBPF_BPF_TOKEN_PATH can also be used to disable BPF token implicit creation by setting it to an empty value. [0] https://patchwork.kernel.org/project/netdevbpf/list/?series=805707&state=* [1] https://patchwork.kernel.org/project/netdevbpf/list/?series=810260&state=* [2] https://patchwork.kernel.org/project/netdevbpf/list/?series=809800&state=* [3] https://patchwork.kernel.org/project/netdevbpf/patch/20231219053150.336991-1-andrii@kernel.org/ [4] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/ [5] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/ [6] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf [7] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/ v1->v2: - disable BPF token creation in init userns, and simplify bpf_token_capable() logic (Christian); - use kzalloc/kfree instead of kvzalloc/kvfree (Linus); - few more selftest cases to validate LSM and BPF token interations. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> ==================== Link: https://lore.kernel.org/r/20240124022127.2379740-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24selftests/bpf: Incorporate LSM policy to token-based testsAndrii Nakryiko
Add tests for LSM interactions (both bpf_token_capable and bpf_token_cmd LSM hooks) with BPF token in bpf() subsystem. Now child process passes back token FD for parent to be able to do tests with token originating in "wrong" userns. But we also create token in initns and check that token LSMs don't accidentally reject BPF operations when capable() checks pass without BPF token. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20240124022127.2379740-31-andrii@kernel.org
2024-01-24selftests/bpf: Add tests for LIBBPF_BPF_TOKEN_PATH envvarAndrii Nakryiko
Add new subtest validating LIBBPF_BPF_TOKEN_PATH envvar semantics. Extend existing test to validate that LIBBPF_BPF_TOKEN_PATH allows to disable implicit BPF token creation by setting envvar to empty string. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20240124022127.2379740-30-andrii@kernel.org
2024-01-24libbpf: Support BPF token path setting through LIBBPF_BPF_TOKEN_PATH envvarAndrii Nakryiko
To allow external admin authority to override default BPF FS location (/sys/fs/bpf) for implicit BPF token creation, teach libbpf to recognize LIBBPF_BPF_TOKEN_PATH envvar. If it is specified and user application didn't explicitly specify bpf_token_path option, it will be treated exactly like bpf_token_path option, overriding default /sys/fs/bpf location and making BPF token mandatory. Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20240124022127.2379740-29-andrii@kernel.org
2024-01-24selftests/bpf: Add tests for BPF object load with implicit tokenAndrii Nakryiko
Add a test to validate libbpf's implicit BPF token creation from default BPF FS location (/sys/fs/bpf). Also validate that disabling this implicit BPF token creation works. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20240124022127.2379740-28-andrii@kernel.org
2024-01-24selftests/bpf: Add BPF object loading tests with explicit token passingAndrii Nakryiko
Add a few tests that attempt to load BPF object containing privileged map, program, and the one requiring mandatory BTF uploading into the kernel (to validate token FD propagation to BPF_BTF_LOAD command). Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20240124022127.2379740-27-andrii@kernel.org
2024-01-24libbpf: Wire up BPF token support at BPF object levelAndrii Nakryiko
Add BPF token support to BPF object-level functionality. BPF token is supported by BPF object logic either as an explicitly provided BPF token from outside (through BPF FS path), or implicitly (unless prevented through bpf_object_open_opts). Implicit mode is assumed to be the most common one for user namespaced unprivileged workloads. The assumption is that privileged container manager sets up default BPF FS mount point at /sys/fs/bpf with BPF token delegation options (delegate_{cmds,maps,progs,attachs} mount options). BPF object during loading will attempt to create BPF token from /sys/fs/bpf location, and pass it for all relevant operations (currently, map creation, BTF load, and program load). In this implicit mode, if BPF token creation fails due to whatever reason (BPF FS is not mounted, or kernel doesn't support BPF token, etc), this is not considered an error. BPF object loading sequence will proceed with no BPF token. In explicit BPF token mode, user provides explicitly custom BPF FS mount point path. In such case, BPF object will attempt to create BPF token from provided BPF FS location. If BPF token creation fails, that is considered a critical error and BPF object load fails with an error. Libbpf provides a way to disable implicit BPF token creation, if it causes any troubles (BPF token is designed to be completely optional and shouldn't cause any problems even if provided, but in the world of BPF LSM, custom security logic can be installed that might change outcome depending on the presence of BPF token). To disable libbpf's default BPF token creation behavior user should provide either invalid BPF token FD (negative), or empty bpf_token_path option. BPF token presence can influence libbpf's feature probing, so if BPF object has associated BPF token, feature probing is instructed to use BPF object-specific feature detection cache and token FD. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20240124022127.2379740-26-andrii@kernel.org
2024-01-24libbpf: Wire up token_fd into feature probing logicAndrii Nakryiko
Adjust feature probing callbacks to take into account optional token_fd. In unprivileged contexts, some feature detectors would fail to detect kernel support just because BPF program, BPF map, or BTF object can't be loaded due to privileged nature of those operations. So when BPF object is loaded with BPF token, this token should be used for feature probing. This patch is setting support for this scenario, but we don't yet pass non-zero token FD. This will be added in the next patch. We also switched BPF cookie detector from using kprobe program to tracepoint one, as tracepoint is somewhat less dangerous BPF program type and has higher likelihood of being allowed through BPF token in the future. This change has no effect on detection behavior. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20240124022127.2379740-25-andrii@kernel.org
2024-01-24libbpf: Move feature detection code into its own fileAndrii Nakryiko
It's quite a lot of well isolated code, so it seems like a good candidate to move it out of libbpf.c to reduce its size. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20240124022127.2379740-24-andrii@kernel.org
2024-01-24libbpf: Further decouple feature checking logic from bpf_objectAndrii Nakryiko
Add feat_supported() helper that accepts feature cache instead of bpf_object. This allows low-level code in bpf.c to not know or care about higher-level concept of bpf_object, yet it will be able to utilize custom feature checking in cases where BPF token might influence the outcome. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20240124022127.2379740-23-andrii@kernel.org
2024-01-24libbpf: Split feature detectors definitions from cached resultsAndrii Nakryiko
Split a list of supported feature detectors with their corresponding callbacks from actual cached supported/missing values. This will allow to have more flexible per-token or per-object feature detectors in subsequent refactorings. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20240124022127.2379740-22-andrii@kernel.org
2024-01-24selftests/bpf: Utilize string values for delegate_xxx mount optionsAndrii Nakryiko
Use both hex-based and string-based way to specify delegate mount options for BPF FS. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20240124022127.2379740-21-andrii@kernel.org
2024-01-24bpf: Support symbolic BPF FS delegation mount optionsAndrii Nakryiko
Besides already supported special "any" value and hex bit mask, support string-based parsing of delegation masks based on exact enumerator names. Utilize BTF information of `enum bpf_cmd`, `enum bpf_map_type`, `enum bpf_prog_type`, and `enum bpf_attach_type` types to find supported symbolic names (ignoring __MAX_xxx guard values and stripping repetitive prefixes like BPF_ for cmd and attach types, BPF_MAP_TYPE_ for maps, and BPF_PROG_TYPE_ for prog types). The case doesn't matter, but it is normalized to lower case in mount option output. So "PROG_LOAD", "prog_load", and "MAP_create" are all valid values to specify for delegate_cmds options, "array" is among supported for map types, etc. Besides supporting string values, we also support multiple values specified at the same time, using colon (':') separator. There are corresponding changes on bpf_show_options side to use known values to print them in human-readable format, falling back to hex mask printing, if there are any unrecognized bits. This shouldn't be necessary when enum BTF information is present, but in general we should always be able to fall back to this even if kernel was built without BTF. As mentioned, emitted symbolic names are normalized to be all lower case. Example below shows various ways to specify delegate_cmds options through mount command and how mount options are printed back: 12/14 14:39:07.604 vmuser@archvm:~/local/linux/tools/testing/selftests/bpf $ mount | rg token $ sudo mkdir -p /sys/fs/bpf/token $ sudo mount -t bpf bpffs /sys/fs/bpf/token \ -o delegate_cmds=prog_load:MAP_CREATE \ -o delegate_progs=kprobe \ -o delegate_attachs=xdp $ mount | grep token bpffs on /sys/fs/bpf/token type bpf (rw,relatime,delegate_cmds=map_create:prog_load,delegate_progs=kprobe,delegate_attachs=xdp) Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20240124022127.2379740-20-andrii@kernel.org
2024-01-24bpf: Fail BPF_TOKEN_CREATE if no delegation option was set on BPF FSAndrii Nakryiko
It's quite confusing in practice when it's possible to successfully create a BPF token from BPF FS that didn't have any of delegate_xxx mount options set up. While it's not wrong, it's actually more meaningful to reject BPF_TOKEN_CREATE with specific error code (-ENOENT) to let user-space know that no token delegation is setup up. So, instead of creating empty BPF token that will be always ignored because it doesn't have any of the allow_xxx bits set, reject it with -ENOENT. If we ever need empty BPF token to be possible, we can support that with extra flag passed into BPF_TOKEN_CREATE. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20240124022127.2379740-19-andrii@kernel.org