summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2024-01-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-25Merge tag 'net-6.8-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from bpf, netfilter and WiFi. Jakub is doing a lot of work to include the self-tests in our CI, as a result a significant amount of self-tests related fixes is flowing in (and will likely continue in the next few weeks). Current release - regressions: - bpf: fix a kernel crash for the riscv 64 JIT - bnxt_en: fix memory leak in bnxt_hwrm_get_rings() - revert "net: macsec: use skb_ensure_writable_head_tail to expand the skb" Previous releases - regressions: - core: fix removing a namespace with conflicting altnames - tc/flower: fix chain template offload memory leak - tcp: - make sure init the accept_queue's spinlocks once - fix autocork on CPUs with weak memory model - udp: fix busy polling - mlx5e: - fix out-of-bound read in port timestamping - fix peer flow lists corruption - iwlwifi: fix a memory corruption Previous releases - always broken: - netfilter: - nft_chain_filter: handle NETDEV_UNREGISTER for inet/ingress basechain - nft_limit: reject configurations that cause integer overflow - bpf: fix bpf_xdp_adjust_tail() with XSK zero-copy mbuf, avoiding a NULL pointer dereference upon shrinking - llc: make llc_ui_sendmsg() more robust against bonding changes - smc: fix illegal rmb_desc access in SMC-D connection dump - dpll: fix pin dump crash for rebound module - bnxt_en: fix possible crash after creating sw mqprio TCs - hv_netvsc: calculate correct ring size when PAGE_SIZE is not 4kB Misc: - several self-tests fixes for better integration with the netdev CI - added several missing modules descriptions" * tag 'net-6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits) tsnep: Fix XDP_RING_NEED_WAKEUP for empty fill ring tsnep: Remove FCS for XDP data path net: fec: fix the unhandled context fault from smmu selftests: bonding: do not test arp/ns target with mode balance-alb/tlb fjes: fix memleaks in fjes_hw_setup i40e: update xdp_rxq_info::frag_size for ZC enabled Rx queue i40e: set xdp_rxq_info::frag_size xdp: reflect tail increase for MEM_TYPE_XSK_BUFF_POOL ice: update xdp_rxq_info::frag_size for ZC enabled Rx queue intel: xsk: initialize skb_frag_t::bv_offset in ZC drivers ice: remove redundant xdp_rxq_info registration i40e: handle multi-buffer packets that are shrunk by xdp prog ice: work on pre-XDP prog frag count xsk: fix usage of multi-buffer BPF helpers for ZC XDP xsk: make xsk_buff_pool responsible for clearing xdp_buff::flags xsk: recycle buffer in case Rx queue was full net: fill in MODULE_DESCRIPTION()s for rvu_mbox net: fill in MODULE_DESCRIPTION()s for litex net: fill in MODULE_DESCRIPTION()s for fsl_pq_mdio net: fill in MODULE_DESCRIPTION()s for fec ...
2024-01-25Merge tag 'vfs-6.8-rc2.netfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull netfs fixes from Christian Brauner: "This contains various fixes for the netfs work merged earlier this cycle: afs: - Fix locking imbalance in afs_proc_addr_prefs_show() - Remove afs_dynroot_d_revalidate() which is redundant - Fix error handling during lookup - Hide sillyrenames from userspace. This fixes a race between silly-rename files being created/removed and userspace iterating over directory entries - Don't use unnecessary folio_*() functions cifs: - Don't use unnecessary folio_*() functions cachefiles: - erofs: Fix Null dereference when cachefiles are not doing ondemand-mode - Update mailing list netfs library: - Add Jeff Layton as reviewer - Update mailing list - Fix a error checking in netfs_perform_write() - fscache: Check error before dereferencing - Don't use unnecessary folio_*() functions" * tag 'vfs-6.8-rc2.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: afs: Fix missing/incorrect unlocking of RCU read lock afs: Remove afs_dynroot_d_revalidate() as it is redundant afs: Fix error handling with lookup via FS.InlineBulkStatus afs: Hide silly-rename files from userspace cachefiles, erofs: Fix NULL deref in when cachefiles is not doing ondemand-mode netfs: Fix a NULL vs IS_ERR() check in netfs_perform_write() netfs, fscache: Prevent Oops in fscache_put_cache() cifs: Don't use certain unnecessary folio_*() functions afs: Don't use certain unnecessary folio_*() functions netfs: Don't use certain unnecessary folio_*() functions netfs: Add Jeff Layton as reviewer netfs, cachefiles: Change mailing list
2024-01-25Merge tag 'mlx5-fixes-2024-01-24' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5 fixes 2024-01-24 This series provides bug fixes to mlx5 driver. Please pull and let me know if there is any problem. * tag 'mlx5-fixes-2024-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5e: fix a potential double-free in fs_any_create_groups net/mlx5e: fix a double-free in arfs_create_groups net/mlx5e: Ignore IPsec replay window values on sender side net/mlx5e: Allow software parsing when IPsec crypto is enabled net/mlx5: Use mlx5 device constant for selecting CQ period mode for ASO net/mlx5: DR, Can't go to uplink vport on RX rule net/mlx5: DR, Use the right GVMI number for drop action net/mlx5: Bridge, fix multicast packets sent to uplink net/mlx5: Fix a WARN upon a callback command failure net/mlx5e: Fix peer flow lists handling net/mlx5e: Fix inconsistent hairpin RQT sizes net/mlx5e: Fix operation precedence bug in port timestamping napi_poll context net/mlx5: Fix query of sd_group field net/mlx5e: Use the correct lag ports number when creating TISes ==================== Link: https://lore.kernel.org/r/20240124081855.115410-1-saeed@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-25Merge tag 'for-netdev' of ↵Paolo Abeni
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-01-25 The following pull-request contains BPF updates for your *net* tree. We've added 12 non-merge commits during the last 2 day(s) which contain a total of 13 files changed, 190 insertions(+), 91 deletions(-). The main changes are: 1) Fix bpf_xdp_adjust_tail() in context of XSK zero-copy drivers which support XDP multi-buffer. The former triggered a NULL pointer dereference upon shrinking, from Maciej Fijalkowski & Tirthendu Sarkar. 2) Fix a bug in riscv64 BPF JIT which emitted a wrong prologue and epilogue for struct_ops programs, from Pu Lehui. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: i40e: update xdp_rxq_info::frag_size for ZC enabled Rx queue i40e: set xdp_rxq_info::frag_size xdp: reflect tail increase for MEM_TYPE_XSK_BUFF_POOL ice: update xdp_rxq_info::frag_size for ZC enabled Rx queue intel: xsk: initialize skb_frag_t::bv_offset in ZC drivers ice: remove redundant xdp_rxq_info registration i40e: handle multi-buffer packets that are shrunk by xdp prog ice: work on pre-XDP prog frag count xsk: fix usage of multi-buffer BPF helpers for ZC XDP xsk: make xsk_buff_pool responsible for clearing xdp_buff::flags xsk: recycle buffer in case Rx queue was full riscv, bpf: Fix unpredictable kernel crash about RV64 struct_ops ==================== Link: https://lore.kernel.org/r/20240125084416.10876-1-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-24xsk: fix usage of multi-buffer BPF helpers for ZC XDPMaciej Fijalkowski
Currently when packet is shrunk via bpf_xdp_adjust_tail() and memory type is set to MEM_TYPE_XSK_BUFF_POOL, null ptr dereference happens: [1136314.192256] BUG: kernel NULL pointer dereference, address: 0000000000000034 [1136314.203943] #PF: supervisor read access in kernel mode [1136314.213768] #PF: error_code(0x0000) - not-present page [1136314.223550] PGD 0 P4D 0 [1136314.230684] Oops: 0000 [#1] PREEMPT SMP NOPTI [1136314.239621] CPU: 8 PID: 54203 Comm: xdpsock Not tainted 6.6.0+ #257 [1136314.250469] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019 [1136314.265615] RIP: 0010:__xdp_return+0x6c/0x210 [1136314.274653] Code: ad 00 48 8b 47 08 49 89 f8 a8 01 0f 85 9b 01 00 00 0f 1f 44 00 00 f0 41 ff 48 34 75 32 4c 89 c7 e9 79 cd 80 ff 83 fe 03 75 17 <f6> 41 34 01 0f 85 02 01 00 00 48 89 cf e9 22 cc 1e 00 e9 3d d2 86 [1136314.302907] RSP: 0018:ffffc900089f8db0 EFLAGS: 00010246 [1136314.312967] RAX: ffffc9003168aed0 RBX: ffff8881c3300000 RCX: 0000000000000000 [1136314.324953] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffc9003168c000 [1136314.336929] RBP: 0000000000000ae0 R08: 0000000000000002 R09: 0000000000010000 [1136314.348844] R10: ffffc9000e495000 R11: 0000000000000040 R12: 0000000000000001 [1136314.360706] R13: 0000000000000524 R14: ffffc9003168aec0 R15: 0000000000000001 [1136314.373298] FS: 00007f8df8bbcb80(0000) GS:ffff8897e0e00000(0000) knlGS:0000000000000000 [1136314.386105] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1136314.396532] CR2: 0000000000000034 CR3: 00000001aa912002 CR4: 00000000007706f0 [1136314.408377] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [1136314.420173] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [1136314.431890] PKRU: 55555554 [1136314.439143] Call Trace: [1136314.446058] <IRQ> [1136314.452465] ? __die+0x20/0x70 [1136314.459881] ? page_fault_oops+0x15b/0x440 [1136314.468305] ? exc_page_fault+0x6a/0x150 [1136314.476491] ? asm_exc_page_fault+0x22/0x30 [1136314.484927] ? __xdp_return+0x6c/0x210 [1136314.492863] bpf_xdp_adjust_tail+0x155/0x1d0 [1136314.501269] bpf_prog_ccc47ae29d3b6570_xdp_sock_prog+0x15/0x60 [1136314.511263] ice_clean_rx_irq_zc+0x206/0xc60 [ice] [1136314.520222] ? ice_xmit_zc+0x6e/0x150 [ice] [1136314.528506] ice_napi_poll+0x467/0x670 [ice] [1136314.536858] ? ttwu_do_activate.constprop.0+0x8f/0x1a0 [1136314.546010] __napi_poll+0x29/0x1b0 [1136314.553462] net_rx_action+0x133/0x270 [1136314.561619] __do_softirq+0xbe/0x28e [1136314.569303] do_softirq+0x3f/0x60 This comes from __xdp_return() call with xdp_buff argument passed as NULL which is supposed to be consumed by xsk_buff_free() call. To address this properly, in ZC case, a node that represents the frag being removed has to be pulled out of xskb_list. Introduce appropriate xsk helpers to do such node operation and use them accordingly within bpf_xdp_adjust_tail(). Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX") Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> # For the xsk header part Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20240124191602.566724-4-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24xsk: make xsk_buff_pool responsible for clearing xdp_buff::flagsMaciej Fijalkowski
XDP multi-buffer support introduced XDP_FLAGS_HAS_FRAGS flag that is used by drivers to notify data path whether xdp_buff contains fragments or not. Data path looks up mentioned flag on first buffer that occupies the linear part of xdp_buff, so drivers only modify it there. This is sufficient for SKB and XDP_DRV modes as usually xdp_buff is allocated on stack or it resides within struct representing driver's queue and fragments are carried via skb_frag_t structs. IOW, we are dealing with only one xdp_buff. ZC mode though relies on list of xdp_buff structs that is carried via xsk_buff_pool::xskb_list, so ZC data path has to make sure that fragments do *not* have XDP_FLAGS_HAS_FRAGS set. Otherwise, xsk_buff_free() could misbehave if it would be executed against xdp_buff that carries a frag with XDP_FLAGS_HAS_FRAGS flag set. Such scenario can take place when within supplied XDP program bpf_xdp_adjust_tail() is used with negative offset that would in turn release the tail fragment from multi-buffer frame. Calling xsk_buff_free() on tail fragment with XDP_FLAGS_HAS_FRAGS would result in releasing all the nodes from xskb_list that were produced by driver before XDP program execution, which is not what is intended - only tail fragment should be deleted from xskb_list and then it should be put onto xsk_buff_pool::free_list. Such multi-buffer frame will never make it up to user space, so from AF_XDP application POV there would be no traffic running, however due to free_list getting constantly new nodes, driver will be able to feed HW Rx queue with recycled buffers. Bottom line is that instead of traffic being redirected to user space, it would be continuously dropped. To fix this, let us clear the mentioned flag on xsk_buff_pool side during xdp_buff initialization, which is what should have been done right from the start of XSK multi-buffer support. Fixes: 1bbc04de607b ("ice: xsk: add RX multi-buffer support") Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support") Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20240124191602.566724-3-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-24bpf,lsm: Add BPF token LSM hooksAndrii Nakryiko
Wire up bpf_token_create and bpf_token_free LSM hooks, which allow to allocate LSM security blob (we add `void *security` field to struct bpf_token for that), but also control who can instantiate BPF token. This follows existing pattern for BPF map and BPF prog. Also add security_bpf_token_allow_cmd() and security_bpf_token_capable() LSM hooks that allow LSM implementation to control and negate (if necessary) BPF token's delegation of a specific bpf_cmd and capability, respectively. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://lore.kernel.org/bpf/20240124022127.2379740-12-andrii@kernel.org
2024-01-24bpf,lsm: Refactor bpf_map_alloc/bpf_map_free LSM hooksAndrii Nakryiko
Similarly to bpf_prog_alloc LSM hook, rename and extend bpf_map_alloc hook into bpf_map_create, taking not just struct bpf_map, but also bpf_attr and bpf_token, to give a fuller context to LSMs. Unlike bpf_prog_alloc, there is no need to move the hook around, as it currently is firing right before allocating BPF map ID and FD, which seems to be a sweet spot. But like bpf_prog_alloc/bpf_prog_free combo, make sure that bpf_map_free LSM hook is called even if bpf_map_create hook returned error, as if few LSMs are combined together it could be that one LSM successfully allocated security blob for its needs, while subsequent LSM rejected BPF map creation. The former LSM would still need to free up LSM blob, so we need to ensure security_bpf_map_free() is called regardless of the outcome. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://lore.kernel.org/bpf/20240124022127.2379740-11-andrii@kernel.org
2024-01-24bpf,lsm: Refactor bpf_prog_alloc/bpf_prog_free LSM hooksAndrii Nakryiko
Based on upstream discussion ([0]), rework existing bpf_prog_alloc_security LSM hook. Rename it to bpf_prog_load and instead of passing bpf_prog_aux, pass proper bpf_prog pointer for a full BPF program struct. Also, we pass bpf_attr union with all the user-provided arguments for BPF_PROG_LOAD command. This will give LSMs as much information as we can basically provide. The hook is also BPF token-aware now, and optional bpf_token struct is passed as a third argument. bpf_prog_load LSM hook is called after a bunch of sanity checks were performed, bpf_prog and bpf_prog_aux were allocated and filled out, but right before performing full-fledged BPF verification step. bpf_prog_free LSM hook is now accepting struct bpf_prog argument, for consistency. SELinux code is adjusted to all new names, types, and signatures. Note, given that bpf_prog_load (previously bpf_prog_alloc) hook can be used by some LSMs to allocate extra security blob, but also by other LSMs to reject BPF program loading, we need to make sure that bpf_prog_free LSM hook is called after bpf_prog_load/bpf_prog_alloc one *even* if the hook itself returned error. If we don't do that, we run the risk of leaking memory. This seems to be possible today when combining SELinux and BPF LSM, as one example, depending on their relative ordering. Also, for BPF LSM setup, add bpf_prog_load and bpf_prog_free to sleepable LSM hooks list, as they are both executed in sleepable context. Also drop bpf_prog_load hook from untrusted, as there is no issue with refcount or anything else anymore, that originally forced us to add it to untrusted list in c0c852dd1876 ("bpf: Do not mark certain LSM hook arguments as trusted"). We now trigger this hook much later and it should not be an issue anymore. [0] https://lore.kernel.org/bpf/9fe88aef7deabbe87d3fc38c4aea3c69.paul@paul-moore.com/ Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://lore.kernel.org/bpf/20240124022127.2379740-10-andrii@kernel.org
2024-01-24bpf: Consistently use BPF token throughout BPF verifier logicAndrii Nakryiko
Remove remaining direct queries to perfmon_capable() and bpf_capable() in BPF verifier logic and instead use BPF token (if available) to make decisions about privileges. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20240124022127.2379740-9-andrii@kernel.org
2024-01-24bpf: Take into account BPF token when fetching helper protosAndrii Nakryiko
Instead of performing unconditional system-wide bpf_capable() and perfmon_capable() calls inside bpf_base_func_proto() function (and other similar ones) to determine eligibility of a given BPF helper for a given program, use previously recorded BPF token during BPF_PROG_LOAD command handling to inform the decision. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20240124022127.2379740-8-andrii@kernel.org
2024-01-24bpf: Add BPF token support to BPF_PROG_LOAD commandAndrii Nakryiko
Add basic support of BPF token to BPF_PROG_LOAD. BPF_F_TOKEN_FD flag should be set in prog_flags field when providing prog_token_fd. Wire through a set of allowed BPF program types and attach types, derived from BPF FS at BPF token creation time. Then make sure we perform bpf_token_capable() checks everywhere where it's relevant. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20240124022127.2379740-7-andrii@kernel.org
2024-01-24bpf: Add BPF token support to BPF_BTF_LOAD commandAndrii Nakryiko
Accept BPF token FD in BPF_BTF_LOAD command to allow BTF data loading through delegated BPF token. BPF_F_TOKEN_FD flag has to be specified when passing BPF token FD. Given BPF_BTF_LOAD command didn't have flags field before, we also add btf_flags field. BTF loading is a pretty straightforward operation, so as long as BPF token is created with allow_cmds granting BPF_BTF_LOAD command, kernel proceeds to parsing BTF data and creating BTF object. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20240124022127.2379740-6-andrii@kernel.org
2024-01-24bpf: Add BPF token support to BPF_MAP_CREATE commandAndrii Nakryiko
Allow providing token_fd for BPF_MAP_CREATE command to allow controlled BPF map creation from unprivileged process through delegated BPF token. New BPF_F_TOKEN_FD flag is added to specify together with BPF token FD for BPF_MAP_CREATE command. Wire through a set of allowed BPF map types to BPF token, derived from BPF FS at BPF token creation time. This, in combination with allowed_cmds allows to create a narrowly-focused BPF token (controlled by privileged agent) with a restrictive set of BPF maps that application can attempt to create. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20240124022127.2379740-5-andrii@kernel.org
2024-01-24bpf: Introduce BPF token objectAndrii Nakryiko
Add new kind of BPF kernel object, BPF token. BPF token is meant to allow delegating privileged BPF functionality, like loading a BPF program or creating a BPF map, from privileged process to a *trusted* unprivileged process, all while having a good amount of control over which privileged operations could be performed using provided BPF token. This is achieved through mounting BPF FS instance with extra delegation mount options, which determine what operations are delegatable, and also constraining it to the owning user namespace (as mentioned in the previous patch). BPF token itself is just a derivative from BPF FS and can be created through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF FS FD, which can be attained through open() API by opening BPF FS mount point. Currently, BPF token "inherits" delegated command, map types, prog type, and attach type bit sets from BPF FS as is. In the future, having an BPF token as a separate object with its own FD, we can allow to further restrict BPF token's allowable set of things either at the creation time or after the fact, allowing the process to guard itself further from unintentionally trying to load undesired kind of BPF programs. But for now we keep things simple and just copy bit sets as is. When BPF token is created from BPF FS mount, we take reference to the BPF super block's owning user namespace, and then use that namespace for checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN} capabilities that are normally only checked against init userns (using capable()), but now we check them using ns_capable() instead (if BPF token is provided). See bpf_token_capable() for details. Such setup means that BPF token in itself is not sufficient to grant BPF functionality. User namespaced process has to *also* have necessary combination of capabilities inside that user namespace. So while previously CAP_BPF was useless when granted within user namespace, now it gains a meaning and allows container managers and sys admins to have a flexible control over which processes can and need to use BPF functionality within the user namespace (i.e., container in practice). And BPF FS delegation mount options and derived BPF tokens serve as a per-container "flag" to grant overall ability to use bpf() (plus further restrict on which parts of bpf() syscalls are treated as namespaced). Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF) within the BPF FS owning user namespace, rounding up the ns_capable() story of BPF token. Also creating BPF token in init user namespace is currently not supported, given BPF token doesn't have any effect in init user namespace anyways. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/bpf/20240124022127.2379740-4-andrii@kernel.org
2024-01-24bpf: Add BPF token delegation mount options to BPF FSAndrii Nakryiko
Add few new mount options to BPF FS that allow to specify that a given BPF FS instance allows creation of BPF token (added in the next patch), and what sort of operations are allowed under BPF token. As such, we get 4 new mount options, each is a bit mask - `delegate_cmds` allow to specify which bpf() syscall commands are allowed with BPF token derived from this BPF FS instance; - if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies a set of allowable BPF map types that could be created with BPF token; - if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies a set of allowable BPF program types that could be loaded with BPF token; - if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies a set of allowable BPF program attach types that could be loaded with BPF token; delegate_progs and delegate_attachs are meant to be used together, as full BPF program type is, in general, determined through both program type and program attach type. Currently, these mount options accept the following forms of values: - a special value "any", that enables all possible values of a given bit set; - numeric value (decimal or hexadecimal, determined by kernel automatically) that specifies a bit mask value directly; - all the values for a given mount option are combined, if specified multiple times. E.g., `mount -t bpf nodev /path/to/mount -o delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3 mask. Ideally, more convenient (for humans) symbolic form derived from corresponding UAPI enums would be accepted (e.g., `-o delegate_progs=kprobe|tracepoint`) and I intend to implement this, but it requires a bunch of UAPI header churn, so I postponed it until this feature lands upstream or at least there is a definite consensus that this feature is acceptable and is going to make it, just to minimize amount of wasted effort and not increase amount of non-essential code to be reviewed. Attentive reader will notice that BPF FS is now marked as FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init user namespace as long as the process has sufficient *namespaced* capabilities within that user namespace. But in reality we still restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added to allow creating BPF FS context object (i.e., fsopen("bpf")) from inside unprivileged process inside non-init userns, to capture that userns as the owning userns. It will still be required to pass this context object back to privileged process to instantiate and mount it. This manipulation is important, because capturing non-init userns as the owning userns of BPF FS instance (super block) allows to use that userns to constraint BPF token to that userns later on (see next patch). So creating BPF FS with delegation inside unprivileged userns will restrict derived BPF token objects to only "work" inside that intended userns, making it scoped to a intended "container". Also, setting these delegation options requires capable(CAP_SYS_ADMIN), so unprivileged process cannot set this up without involvement of a privileged process. There is a set of selftests at the end of the patch set that simulates this sequence of steps and validates that everything works as intended. But careful review is requested to make sure there are no missed gaps in the implementation and testing. This somewhat subtle set of aspects is the result of previous discussions ([0]) about various user namespace implications and interactions with BPF token functionality and is necessary to contain BPF token inside intended user namespace. [0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/ Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/bpf/20240124022127.2379740-3-andrii@kernel.org
2024-01-24Merge tag 'execve-v6.8-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve fixes from Kees Cook: - Fix error handling in begin_new_exec() (Bernd Edlinger) - MAINTAINERS: specifically mention ELF (Alexey Dobriyan) - Various cleanups related to earlier open() (Askar Safin, Kees Cook) * tag 'execve-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: exec: Distinguish in_execve from in_exec exec: Fix error handling in begin_new_exec() exec: Add do_close_execat() helper exec: remove useless comment ELF, MAINTAINERS: specifically mention ELF
2024-01-24exec: Distinguish in_execve from in_execKees Cook
Just to help distinguish the fs->in_exec flag from the current->in_execve flag, add comments in check_unsafe_exec() and copy_fs() for more context. Also note that in_execve is only used by TOMOYO now. Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-fsdevel@vger.kernel.org Cc: linux-mm@kvack.org Signed-off-by: Kees Cook <keescook@chromium.org>
2024-01-24netfilter: nf_tables: cleanup documentationGeorge Guo
- Correct comments for nlpid, family, udlen and udata in struct nft_table, and afinfo is no longer a member of enum nft_set_class. - Add comment for data in struct nft_set_elem. - Add comment for flags in struct nft_ctx. - Add comments for timeout in struct nft_set_iter, and flags is not a member of struct nft_set_iter, remove the comment for it. - Add comments for commit, abort, estimate and gc_init in struct nft_set_ops. - Add comments for pending_update, num_exprs, exprs and catchall_list in struct nft_set. - Add comment for ext_len in struct nft_set_ext_tmpl. - Add comment for inner_ops in struct nft_expr_type. - Add comments for clone, destroy_clone, reduce, gc, offload, offload_action, offload_stats in struct nft_expr_ops. - Add comments for blob_gen_0, blob_gen_1, bound, genmask, udlen, udata, blob_next in struct nft_chain. - Add comment for flags in struct nft_base_chain. - Add comments for udlen, udata in struct nft_object. - Add comment for type in struct nft_object_ops. - Add comment for hook_list in struct nft_flowtable, and remove comments for dev_name and ops which are not members of struct nft_flowtable. Signed-off-by: George Guo <guodongtai@kylinos.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-24spi: Raise limit on number of chip selectsMark Brown
As reported by Guenter the limit we've got on the number of chip selects is set too low for some systems, raise the limit. We should really remove the hard coded limit but this is needed as a fix so let's do the simple thing and raise the limit for now. Fixes: 4d8ff6b0991d ("spi: Add multi-cs memories support in SPI core") Reported-by: Guenter Roeck <linux@roeck-us.net> Suggested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/r/20240124-spi-multi-cs-max-v2-1-df6fc5ab1abc@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org>
2024-01-24x86/entry/ia32: Ensure s32 is sign extended to s64Richard Palethorpe
Presently ia32 registers stored in ptregs are unconditionally cast to unsigned int by the ia32 stub. They are then cast to long when passed to __se_sys*, but will not be sign extended. This takes the sign of the syscall argument into account in the ia32 stub. It still casts to unsigned int to avoid implementation specific behavior. However then casts to int or unsigned int as necessary. So that the following cast to long sign extends the value. This fixes the io_pgetevents02 LTP test when compiled with -m32. Presently the systemcall io_pgetevents_time64() unexpectedly accepts -1 for the maximum number of events. It doesn't appear other systemcalls with signed arguments are effected because they all have compat variants defined and wired up. Fixes: ebeb8c82ffaf ("syscalls/x86: Use 'struct pt_regs' based syscall calling for IA32_EMULATION and x32") Suggested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Richard Palethorpe <rpalethorpe@suse.com> Signed-off-by: Nikolay Borisov <nik.borisov@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240110130122.3836513-1-nik.borisov@suse.com Link: https://lore.kernel.org/ltp/20210921130127.24131-1-rpalethorpe@suse.com/
2024-01-24net/mlx5: Bridge, fix multicast packets sent to uplinkMoshe Shemesh
To enable multicast packets which are offloaded in bridge multicast offload mode to be sent also to uplink, FTE bit uplink_hairpin_en should be set. Add this bit to FTE for the bridge multicast offload rules. Fixes: 18c2916cee12 ("net/mlx5: Bridge, snoop igmp/mld packets") Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-01-24net/mlx5: Fix query of sd_group fieldTariq Toukan
The sd_group field moved in the HW spec from the MPIR register to the vport context. Align the query accordingly. Fixes: f5e956329960 ("net/mlx5: Expose Management PCIe Index Register (MPIR)") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-01-24net/mlx5e: Use the correct lag ports number when creating TISesSaeed Mahameed
The cited commit moved the code of mlx5e_create_tises() and changed the loop to create TISes over MLX5_MAX_PORTS constant value, instead of getting the correct lag ports supported by the device, which can cause FW errors on devices with less than MLX5_MAX_PORTS ports. Change that back to mlx5e_get_num_lag_ports(mdev). Also IPoIB interfaces create there own TISes, they don't use the eth TISes, pass a flag to indicate that. This fixes the following errors that might appear in kernel log: mlx5_cmd_out_err:808:(pid 650): CREATE_TIS(0x912) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x595b5d), err(-22) mlx5e_create_mdev_resources:174:(pid 650): alloc tises failed, -22 Fixes: b25bd37c859f ("net/mlx5: Move TISes from priv to mdev HW resources") Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-01-24net/sched: flower: Fix chain template offloadIdo Schimmel
When a qdisc is deleted from a net device the stack instructs the underlying driver to remove its flow offload callback from the associated filter block using the 'FLOW_BLOCK_UNBIND' command. The stack then continues to replay the removal of the filters in the block for this driver by iterating over the chains in the block and invoking the 'reoffload' operation of the classifier being used. In turn, the classifier in its 'reoffload' operation prepares and emits a 'FLOW_CLS_DESTROY' command for each filter. However, the stack does not do the same for chain templates and the underlying driver never receives a 'FLOW_CLS_TMPLT_DESTROY' command when a qdisc is deleted. This results in a memory leak [1] which can be reproduced using [2]. Fix by introducing a 'tmplt_reoffload' operation and have the stack invoke it with the appropriate arguments as part of the replay. Implement the operation in the sole classifier that supports chain templates (flower) by emitting the 'FLOW_CLS_TMPLT_{CREATE,DESTROY}' command based on whether a flow offload callback is being bound to a filter block or being unbound from one. As far as I can tell, the issue happens since cited commit which reordered tcf_block_offload_unbind() before tcf_block_flush_all_chains() in __tcf_block_put(). The order cannot be reversed as the filter block is expected to be freed after flushing all the chains. [1] unreferenced object 0xffff888107e28800 (size 2048): comm "tc", pid 1079, jiffies 4294958525 (age 3074.287s) hex dump (first 32 bytes): b1 a6 7c 11 81 88 ff ff e0 5b b3 10 81 88 ff ff ..|......[...... 01 00 00 00 00 00 00 00 e0 aa b0 84 ff ff ff ff ................ backtrace: [<ffffffff81c06a68>] __kmem_cache_alloc_node+0x1e8/0x320 [<ffffffff81ab374e>] __kmalloc+0x4e/0x90 [<ffffffff832aec6d>] mlxsw_sp_acl_ruleset_get+0x34d/0x7a0 [<ffffffff832bc195>] mlxsw_sp_flower_tmplt_create+0x145/0x180 [<ffffffff832b2e1a>] mlxsw_sp_flow_block_cb+0x1ea/0x280 [<ffffffff83a10613>] tc_setup_cb_call+0x183/0x340 [<ffffffff83a9f85a>] fl_tmplt_create+0x3da/0x4c0 [<ffffffff83a22435>] tc_ctl_chain+0xa15/0x1170 [<ffffffff838a863c>] rtnetlink_rcv_msg+0x3cc/0xed0 [<ffffffff83ac87f0>] netlink_rcv_skb+0x170/0x440 [<ffffffff83ac6270>] netlink_unicast+0x540/0x820 [<ffffffff83ac6e28>] netlink_sendmsg+0x8d8/0xda0 [<ffffffff83793def>] ____sys_sendmsg+0x30f/0xa80 [<ffffffff8379d29a>] ___sys_sendmsg+0x13a/0x1e0 [<ffffffff8379d50c>] __sys_sendmsg+0x11c/0x1f0 [<ffffffff843b9ce0>] do_syscall_64+0x40/0xe0 unreferenced object 0xffff88816d2c0400 (size 1024): comm "tc", pid 1079, jiffies 4294958525 (age 3074.287s) hex dump (first 32 bytes): 40 00 00 00 00 00 00 00 57 f6 38 be 00 00 00 00 @.......W.8..... 10 04 2c 6d 81 88 ff ff 10 04 2c 6d 81 88 ff ff ..,m......,m.... backtrace: [<ffffffff81c06a68>] __kmem_cache_alloc_node+0x1e8/0x320 [<ffffffff81ab36c1>] __kmalloc_node+0x51/0x90 [<ffffffff81a8ed96>] kvmalloc_node+0xa6/0x1f0 [<ffffffff82827d03>] bucket_table_alloc.isra.0+0x83/0x460 [<ffffffff82828d2b>] rhashtable_init+0x43b/0x7c0 [<ffffffff832aed48>] mlxsw_sp_acl_ruleset_get+0x428/0x7a0 [<ffffffff832bc195>] mlxsw_sp_flower_tmplt_create+0x145/0x180 [<ffffffff832b2e1a>] mlxsw_sp_flow_block_cb+0x1ea/0x280 [<ffffffff83a10613>] tc_setup_cb_call+0x183/0x340 [<ffffffff83a9f85a>] fl_tmplt_create+0x3da/0x4c0 [<ffffffff83a22435>] tc_ctl_chain+0xa15/0x1170 [<ffffffff838a863c>] rtnetlink_rcv_msg+0x3cc/0xed0 [<ffffffff83ac87f0>] netlink_rcv_skb+0x170/0x440 [<ffffffff83ac6270>] netlink_unicast+0x540/0x820 [<ffffffff83ac6e28>] netlink_sendmsg+0x8d8/0xda0 [<ffffffff83793def>] ____sys_sendmsg+0x30f/0xa80 [2] # tc qdisc add dev swp1 clsact # tc chain add dev swp1 ingress proto ip chain 1 flower dst_ip 0.0.0.0/32 # tc qdisc del dev swp1 clsact # devlink dev reload pci/0000:06:00.0 Fixes: bbf73830cd48 ("net: sched: traverse chains in block with tcf_get_next_chain()") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-23net/ipv6: resolve warning in ip6_fib.cBreno Leitao
In some configurations, the 'iter' variable in function fib6_repair_tree() is unused, resulting the following warning when compiled with W=1. net/ipv6/ip6_fib.c:1781:6: warning: variable 'iter' set but not used [-Wunused-but-set-variable] 1781 | int iter = 0; | ^ It is unclear what is the advantage of this RT6_TRACE() macro[1], since users can control pr_debug() in runtime, which is better than at compilation time. pr_debug() has no overhead when disabled. Remove the RT6_TRACE() in favor of simple pr_debug() helpers. [1] Link: https://lore.kernel.org/all/ZZwSEJv2HgI0cD4J@gmail.com/ Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240122181955.2391676-2-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-23bpf, net: switch to dynamic registrationKui-Feng Lee
Replace the static list of struct_ops types with per-btf struct_ops_tab to enable dynamic registration. Both bpf_dummy_ops and bpf_tcp_ca now utilize the registration function instead of being listed in bpf_struct_ops_types.h. Cc: netdev@vger.kernel.org Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240119225005.668602-12-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23bpf: validate value_typeKui-Feng Lee
A value_type should consist of three components: refcnt, state, and data. refcnt and state has been move to struct bpf_struct_ops_common_value to make it easier to check the value type. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240119225005.668602-11-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23bpf: hold module refcnt in bpf_struct_ops map creation and prog verification.Kui-Feng Lee
To ensure that a module remains accessible whenever a struct_ops object of a struct_ops type provided by the module is still in use. struct bpf_struct_ops_map doesn't hold a refcnt to btf anymore since a module will hold a refcnt to it's btf already. But, struct_ops programs are different. They hold their associated btf, not the module since they need only btf to assure their types (signatures). However, verifier holds the refcnt of the associated module of a struct_ops type temporarily when verify a struct_ops prog. Verifier needs the help from the verifier operators (struct bpf_verifier_ops) provided by the owner module to verify data access of a prog, provide information, and generate code. This patch also add a count of links (links_cnt) to bpf_struct_ops_map. It avoids bpf_struct_ops_map_put_progs() from accessing btf after calling module_put() in bpf_struct_ops_map_free(). Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240119225005.668602-10-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23bpf: pass attached BTF to the bpf_struct_ops subsystemKui-Feng Lee
Pass the fd of a btf from the userspace to the bpf() syscall, and then convert the fd into a btf. The btf is generated from the module that defines the target BPF struct_ops type. In order to inform the kernel about the module that defines the target struct_ops type, the userspace program needs to provide a btf fd for the respective module's btf. This btf contains essential information on the types defined within the module, including the target struct_ops type. A btf fd must be provided to the kernel for struct_ops maps and for the bpf programs attached to those maps. In the case of the bpf programs, the attach_btf_obj_fd parameter is passed as part of the bpf_attr and is converted into a btf. This btf is then stored in the prog->aux->attach_btf field. Here, it just let the verifier access attach_btf directly. In the case of struct_ops maps, a btf fd is passed as value_type_btf_obj_fd of bpf_attr. The bpf_struct_ops_map_alloc() function converts the fd to a btf and stores it as st_map->btf. A flag BPF_F_VTYPE_BTF_OBJ_FD is added for map_flags to indicate that the value of value_type_btf_obj_fd is set. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240119225005.668602-9-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23bpf: lookup struct_ops types from a given module BTF.Kui-Feng Lee
This is a preparation for searching for struct_ops types from a specified module. BTF is always btf_vmlinux now. This patch passes a pointer of BTF to bpf_struct_ops_find_value() and bpf_struct_ops_find(). Once the new registration API of struct_ops types is used, other BTFs besides btf_vmlinux can also be passed to them. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240119225005.668602-8-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23bpf: pass btf object id in bpf_map_info.Kui-Feng Lee
Include btf object id (btf_obj_id) in bpf_map_info so that tools (ex: bpftools struct_ops dump) know the correct btf from the kernel to look up type information of struct_ops types. Since struct_ops types can be defined and registered in a module. The type information of a struct_ops type are defined in the btf of the module defining it. The userspace tools need to know which btf is for the module defining a struct_ops type. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240119225005.668602-7-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23bpf, net: introduce bpf_struct_ops_desc.Kui-Feng Lee
Move some of members of bpf_struct_ops to bpf_struct_ops_desc. type_id is unavailabe in bpf_struct_ops anymore. Modules should get it from the btf received by kmod's init function. Cc: netdev@vger.kernel.org Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240119225005.668602-4-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23bpf: refactory struct_ops type initialization to a function.Kui-Feng Lee
Move the majority of the code to bpf_struct_ops_init_one(), which can then be utilized for the initialization of newly registered dynamically allocated struct_ops types in the following patches. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240119225005.668602-2-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-01-23bpf: Store cookies in kprobe_multi bpf_link_info dataJiri Olsa
Storing cookies in kprobe_multi bpf_link_info data. The cookies field is optional and if provided it needs to be an array of __u64 with kprobe_multi.count length. Acked-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240119110505.400573-3-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23bpf: Add cookie to perf_event bpf_link_info recordsJiri Olsa
At the moment we don't store cookie for perf_event probes, while we do that for the rest of the probes. Adding cookie fields to struct bpf_link_info perf event probe records: perf_event.uprobe perf_event.kprobe perf_event.tracepoint perf_event.perf_event And the code to store that in bpf_link_info struct. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Song Liu <song@kernel.org> Acked-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lore.kernel.org/r/20240119110505.400573-2-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23bpf: Define struct bpf_tcp_req_attrs when CONFIG_SYN_COOKIES=n.Kuniyuki Iwashima
kernel test robot reported the warning below: >> net/core/filter.c:11842:13: warning: declaration of 'struct bpf_tcp_req_attrs' will not be visible outside of this function [-Wvisibility] 11842 | struct bpf_tcp_req_attrs *attrs, int attrs__sz) | ^ 1 warning generated. struct bpf_tcp_req_attrs is defined under CONFIG_SYN_COOKIES but used in kfunc without the config. Let's move struct bpf_tcp_req_attrs definition outside of CONFIG_SYN_COOKIES guard. Fixes: e472f88891ab ("bpf: tcp: Support arbitrary SYN Cookie.") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202401180418.CUVc0hxF-lkp@intel.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240118211751.25790-1-kuniyu@amazon.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23bpf, docs: Fix bpf_redirect_peer header docVictor Stewart
Amend the bpf_redirect_peer() header documentation to also mention support for the netkit device type. Signed-off-by: Victor Stewart <v@nametag.social> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20240116202952.241009-1-v@nametag.social Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23bpf: tcp: Support arbitrary SYN Cookie.Kuniyuki Iwashima
This patch adds a new kfunc available at TC hook to support arbitrary SYN Cookie. The basic usage is as follows: struct bpf_tcp_req_attrs attrs = { .mss = mss, .wscale_ok = wscale_ok, .rcv_wscale = rcv_wscale, /* Server's WScale < 15 */ .snd_wscale = snd_wscale, /* Client's WScale < 15 */ .tstamp_ok = tstamp_ok, .rcv_tsval = tsval, .rcv_tsecr = tsecr, /* Server's Initial TSval */ .usec_ts_ok = usec_ts_ok, .sack_ok = sack_ok, .ecn_ok = ecn_ok, } skc = bpf_skc_lookup_tcp(...); sk = (struct sock *)bpf_skc_to_tcp_sock(skc); bpf_sk_assign_tcp_reqsk(skb, sk, attrs, sizeof(attrs)); bpf_sk_release(skc); bpf_sk_assign_tcp_reqsk() takes skb, a listener sk, and struct bpf_tcp_req_attrs and allocates reqsk and configures it. Then, bpf_sk_assign_tcp_reqsk() links reqsk with skb and the listener. The notable thing here is that we do not hold refcnt for both reqsk and listener. To differentiate that, we mark reqsk->syncookie, which is only used in TX for now. So, if reqsk->syncookie is 1 in RX, it means that the reqsk is allocated by kfunc. When skb is freed, sock_pfree() checks if reqsk->syncookie is 1, and in that case, we set NULL to reqsk->rsk_listener before calling reqsk_free() as reqsk does not hold a refcnt of the listener. When the TCP stack looks up a socket from the skb, we steal the listener from the reqsk in skb_steal_sock() and create a full sk in cookie_v[46]_check(). The refcnt of reqsk will finally be set to 1 in tcp_get_cookie_sock() after creating a full sk. Note that we can extend struct bpf_tcp_req_attrs in the future when we add a new attribute that is determined in 3WHS. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240115205514.68364-6-kuniyu@amazon.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check().Kuniyuki Iwashima
We will support arbitrary SYN Cookie with BPF in the following patch. If BPF prog validates ACK and kfunc allocates a reqsk, it will be carried to cookie_[46]_check() as skb->sk. If skb->sk is not NULL, we call cookie_bpf_check(). Then, we clear skb->sk and skb->destructor, which are needed not to hold refcnt for reqsk and the listener. See the following patch for details. After that, we finish initialisation for the remaining fields with cookie_tcp_reqsk_init(). Note that the server side WScale is set only for non-BPF SYN Cookie. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240115205514.68364-5-kuniyu@amazon.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23bpf: tcp: Handle BPF SYN Cookie in skb_steal_sock().Kuniyuki Iwashima
We will support arbitrary SYN Cookie with BPF. If BPF prog validates ACK and kfunc allocates a reqsk, it will be carried to TCP stack as skb->sk with req->syncookie 1. Also, the reqsk has its listener as req->rsk_listener with no refcnt taken. When the TCP stack looks up a socket from the skb, we steal inet_reqsk(skb->sk)->rsk_listener in skb_steal_sock() so that the skb will be processed in cookie_v[46]_check() with the listener. Note that we do not clear skb->sk and skb->destructor so that we can carry the reqsk to cookie_v[46]_check(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240115205514.68364-4-kuniyu@amazon.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23tcp: Move skb_steal_sock() to request_sock.hKuniyuki Iwashima
We will support arbitrary SYN Cookie with BPF. If BPF prog validates ACK and kfunc allocates a reqsk, it will be carried to TCP stack as skb->sk with req->syncookie 1. In skb_steal_sock(), we need to check inet_reqsk(sk)->syncookie to see if the reqsk is created by kfunc. However, inet_reqsk() is not available in sock.h. Let's move skb_steal_sock() to request_sock.h. While at it, we refactor skb_steal_sock() so it returns early if skb->sk is NULL to minimise the following patch. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240115205514.68364-3-kuniyu@amazon.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23tcp: Move tcp_ns_to_ts() to tcp.hKuniyuki Iwashima
We will support arbitrary SYN Cookie with BPF. When BPF prog validates ACK and kfunc allocates a reqsk, we need to call tcp_ns_to_ts() to calculate an offset of TSval for later use: time t0 : Send SYN+ACK -> tsval = Initial TSval (Random Number) t1 : Recv ACK of 3WHS -> tsoff = TSecr - tcp_ns_to_ts(usec_ts_ok, tcp_clock_ns()) = Initial TSval - t1 t2 : Send ACK -> tsval = t2 + tsoff = Initial TSval + (t2 - t1) = Initial TSval + Time Delta (x) (x) Note that the time delta does not include the initial RTT from t0 to t1. Let's move tcp_ns_to_ts() to tcp.h. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240115205514.68364-2-kuniyu@amazon.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23bpf: Make bpf_for_each_spilled_reg consider narrow spillsMaxim Mikityanskiy
Adjust the check in bpf_get_spilled_reg to take into account spilled registers narrower than 64 bits. That allows find_equal_scalars to properly adjust the range of all spilled registers that have the same ID. Before this change, it was possible for a register and a spilled register to have the same IDs but different ranges if the spill was narrower than 64 bits and a range check was performed on the register. Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20240108205209.838365-5-maxtram95@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23bpf: support multiple tags per argumentAndrii Nakryiko
Add ability to iterate multiple decl_tag types pointed to the same function argument. Use this to support multiple __arg_xxx tags per global subprog argument. We leave btf_find_decl_tag_value() intact, but change its implementation to use a new btf_find_next_decl_tag() which can be straightforwardly used to find next BTF type ID of a matching btf_decl_tag type. btf_prepare_func_args() is switched from btf_find_decl_tag_value() to btf_find_next_decl_tag() to gain multiple tags per argument support. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20240105000909.2818934-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23bpf: Support inlining bpf_kptr_xchg() helperHou Tao
The motivation of inlining bpf_kptr_xchg() comes from the performance profiling of bpf memory allocator benchmark. The benchmark uses bpf_kptr_xchg() to stash the allocated objects and to pop the stashed objects for free. After inling bpf_kptr_xchg(), the performance for object free on 8-CPUs VM increases about 2%~10%. The inline also has downside: both the kasan and kcsan checks on the pointer will be unavailable. bpf_kptr_xchg() can be inlined by converting the calling of bpf_kptr_xchg() into an atomic_xchg() instruction. But the conversion depends on two conditions: 1) JIT backend supports atomic_xchg() on pointer-sized word 2) For the specific arch, the implementation of xchg is the same as atomic_xchg() on pointer-sized words. It seems most 64-bit JIT backends satisfies these two conditions. But as a precaution, defining a weak function bpf_jit_supports_ptr_xchg() to state whether such conversion is safe and only supporting inline for 64-bit host. For x86-64, it supports BPF_XCHG atomic operation and both xchg() and atomic_xchg() use arch_xchg() to implement the exchange, so enabling the inline of bpf_kptr_xchg() on x86-64 first. Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20240105104819.3916743-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-23nvmet: unify aer type enumGuixin Liu
The host and target use two definition of aer type, unify them into a single one. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-23sock_diag: allow concurrent operation in sock_diag_rcv_msg()Eric Dumazet
TCPDIAG_GETSOCK and DCCPDIAG_GETSOCK diag are serialized on sock_diag_table_mutex. This is to make sure inet_diag module is not unloaded while diag was ongoing. It is time to get rid of this mutex and use RCU protection, allowing full parallelism. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-23sock_diag: add module pointer to "struct sock_diag_handler"Eric Dumazet
Following patch is going to use RCU instead of sock_diag_table_mutex acquisition. This patch is a preparation, no change of behavior yet. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>