summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2022-10-31rtnetlink: Honour NLM_F_ECHO flag in rtnl_delete_linkHangbin Liu
This patch use the new helper unregister_netdevice_many_notify() for rtnl_delete_link(), so that the kernel could reply unicast when userspace set NLM_F_ECHO flag to request the new created interface info. At the same time, the parameters of rtnl_delete_link() need to be updated since we need nlmsghdr and portid info. Suggested-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-31rtnetlink: pass netlink message header and portid to rtnl_configure_link()Hangbin Liu
This patch pass netlink message header and portid to rtnl_configure_link() All the functions in this call chain need to add the parameters so we can use them in the last call rtnl_notify(), and notify the userspace about the new link info if NLM_F_ECHO flag is set. - rtnl_configure_link() - __dev_notify_flags() - rtmsg_ifinfo() - rtmsg_ifinfo_event() - rtmsg_ifinfo_build_skb() - rtmsg_ifinfo_send() - rtnl_notify() Also move __dev_notify_flags() declaration to net/core/dev.h, as Jakub suggested. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-31ptp: introduce helpers to adjust by scaled parts per millionJacob Keller
Many drivers implement the .adjfreq or .adjfine PTP op function with the same basic logic: 1. Determine a base frequency value 2. Multiply this by the abs() of the requested adjustment, then divide by the appropriate divisor (1 billion, or 65,536 billion). 3. Add or subtract this difference from the base frequency to calculate a new adjustment. A few drivers need the difference and direction rather than the combined new increment value. I recently converted the Intel drivers to .adjfine and the scaled parts per million (65.536 parts per billion) logic. To avoid overflow with minimal loss of precision, mul_u64_u64_div_u64 was used. The basic logic used by all of these drivers is very similar, and leads to a lot of duplicate code to perform the same task. Rather than keep this duplicate code, introduce diff_by_scaled_ppm and adjust_by_scaled_ppm. These helper functions calculate the difference or adjustment necessary based on the scaled parts per million input. The diff_by_scaled_ppm function returns true if the difference should be subtracted, and false otherwise. Update the Intel drivers to use the new helper functions. Other vendor drivers will be converted to .adjfine and this helper function in the following changes. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-31ptp: add missing documentation for parametersJacob Keller
The ptp_find_pin_unlocked function and the ptp_system_timestamp structure didn't document their parameters and fields. Fix this. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-31net: geneve: fix array of flexible structures warningsJakub Kicinski
New compilers don't like flexible array of flexible structs: include/net/geneve.h:62:34: warning: array of flexible structures Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-31netlink: split up copies in the ack constructionJakub Kicinski
Clean up the use of unsafe_memcpy() by adding a flexible array at the end of netlink message header and splitting up the header and data copies. Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-30net: remove unused netdev_unregistering()Juhee Kang
Currently, use dev->reg_state == NETREG_UNREGISTERING to check the status which is NETREG_UNREGISTERING, rather than using netdev_unregistering. Also, A helper function which is netdev_unregistering on nedevice.h is no longer used. Thus, netdev_unregistering removes from netdevice.h. Signed-off-by: Juhee Kang <claudiajkang@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-28net/packet: add PACKET_FANOUT_FLAG_IGNORE_OUTGOINGWillem de Bruijn
Extend packet socket option PACKET_IGNORE_OUTGOING to fanout groups. The socket option sets ptype.ignore_outgoing, which makes dev_queue_xmit_nit skip the socket. When the socket joins a fanout group, the option is not reflected in the struct ptype of the group. dev_queue_xmit_nit only tests the fanout ptype, so the flag is ignored once a socket joins a fanout group. Inheriting the option from a socket would change established behavior. Different sockets in the group can set different flags, and can also change them at runtime. Testing in packet_rcv_fanout defeats the purpose of the original patch, which is to avoid skb_clone in dev_queue_xmit_nit (esp. for MSG_ZEROCOPY packets). Instead, introduce a new fanout group flag with the same behavior. Tested with https://github.com/wdebruij/kerneltools/blob/master/tests/test_psock_fanout_ignore_outgoing.c Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20221027211014.3581513-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-28net: sfp: move field definitions along side register indexRussell King (Oracle)
Just as we do for the A2h enum, arrange the A0h enum to have the field definitions next to their corresponding register index. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-28net: sfp: convert register indexes from hex to decimalRussell King (Oracle)
The register indexes in the standards are in decimal rather than hex, so lets specify them in decimal in the header file so we can easily cross-reference without converting between hex and decimal. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-28net: phylink: add phylink_get_link_timer_ns() helperRussell King (Oracle)
Add a helper to convert the PHY interface mode to the required link timer setting as stated by the appropriate standard. Inappropriate interface modes return an error. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-28Kalle Valo says:Jakub Kicinski
==================== pull-request: wireless-next-2022-10-28 First set of patches v6.2. mac80211 refactoring continues for Wi-Fi 7. All mac80211 driver are now converted to use internal TX queues, this might cause some regressions so we wanted to do this early in the cycle. Note: wireless tree was merged[1] to wireless-next to avoid some conflicts with mac80211 patches between the trees. Unfortunately there are still two smaller conflicts in net/mac80211/util.c which Stephen also reported[2]. In the first conflict initialise scratch_len to "params->scratch_len ?: 3 * params->len" (note number 3, not 2!) and in the second conflict take the version which uses elems->scratch_pos. [1] https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next.git/commit/?id=dfd2d876b3fda1790bc0239ba4c6967e25d16e91 [2] https://lore.kernel.org/all/20221020032340.5cf101c0@canb.auug.org.au/ mac80211 - preparation for Wi-Fi 7 Multi-Link Operation (MLO) continues - add API to show the link STAs in debugfs - all mac80211 drivers are now using mac80211 internal TX queues (iTXQs) rtw89 - support 8852BE rtl8xxxu - support RTL8188FU brmfmac - support two station interfaces concurrently bcma - support SPROM rev 11 ==================== Link: https://lore.kernel.org/r/20221028132943.304ECC433B5@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-28tcp: add rcv_wnd and plb_rehash to TCP_INFOMubashir Adnan Qureshi
rcv_wnd can be useful to diagnose TCP performance where receiver window becomes the bottleneck. rehash reports the PLB and timeout triggered rehash attempts by the TCP connection. Signed-off-by: Mubashir Adnan Qureshi <mubashirq@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-28tcp: add u32 counter in tcp_sock and an SNMP counter for PLBMubashir Adnan Qureshi
A u32 counter is added to tcp_sock for counting the number of PLB triggered rehashes for a TCP connection. An SNMP counter is also added to count overall PLB triggered rehash events for a host. These counters are hooked up to PLB implementation for DCTCP. TCP_NLA_REHASH is added to SCM_TIMESTAMPING_OPT_STATS that reports the rehash attempts triggered due to PLB or timeouts. This gives a historical view of sustained congestion or timeouts experienced by the TCP connection. Signed-off-by: Mubashir Adnan Qureshi <mubashirq@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-28tcp: add PLB functionality for TCPMubashir Adnan Qureshi
Congestion control algorithms track PLB state and cause the connection to trigger a path change when either of the 2 conditions is satisfied: - No packets are in flight and (# consecutive congested rounds >= sysctl_tcp_plb_idle_rehash_rounds) - (# consecutive congested rounds >= sysctl_tcp_plb_rehash_rounds) A round (RTT) is marked as congested when congestion signal (ECN ce_ratio) over an RTT is greater than sysctl_tcp_plb_cong_thresh. In the event of RTO, PLB (via tcp_write_timeout()) triggers a path change and disables congestion-triggered path changes for random time between (sysctl_tcp_plb_suspend_rto_sec, 2*sysctl_tcp_plb_suspend_rto_sec) to avoid hopping onto the "connectivity blackhole". RTO-triggered path changes can still happen during this cool-off period. Signed-off-by: Mubashir Adnan Qureshi <mubashirq@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-28tcp: add sysctls for TCP PLB parametersMubashir Adnan Qureshi
PLB (Protective Load Balancing) is a host based mechanism for load balancing across switch links. It leverages congestion signals(e.g. ECN) from transport layer to randomly change the path of the connection experiencing congestion. PLB changes the path of the connection by changing the outgoing IPv6 flow label for IPv6 connections (implemented in Linux by calling sk_rethink_txhash()). Because of this implementation mechanism, PLB can currently only work for IPv6 traffic. For more information, see the SIGCOMM 2022 paper: https://doi.org/10.1145/3544216.3544226 This commit adds new sysctl knobs and sets their default values for TCP PLB. Signed-off-by: Mubashir Adnan Qureshi <mubashirq@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-nextJakub Kicinski
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next 1) Move struct nft_payload_set definition to .c file where it is only used. 2) Shrink transport and inner header offset fields in the nft_pktinfo structure to 16-bits, from Florian Westphal. 3) Get rid of nft_objref Kbuild toggle, make it built-in into nf_tables. This expression is used to instantiate conntrack helpers in nftables. After removing the conntrack helper auto-assignment toggle it this feature became more important so move it to the nf_tables core module. Also from Florian. 4) Extend the existing function to calculate payload inner header offset to deal with the GRE and IPIP transport protocols. 6) Add inner expression support for nf_tables. This new expression provides a packet parser for tunneled packets which uses a userspace description of the expected inner headers. The inner expression invokes the payload expression (via direct call) to match on the inner header protocol fields using the inner link, network and transport header offsets. An example of the bytecode generated from userspace to match on IP source encapsulated in a VxLAN packet: # nft --debug=netlink add rule netdev x y udp dport 4789 vxlan ip saddr 1.2.3.4 netdev x y [ meta load l4proto => reg 1 ] [ cmp eq reg 1 0x00000011 ] [ payload load 2b @ transport header + 2 => reg 1 ] [ cmp eq reg 1 0x0000b512 ] [ inner type vxlan hdrsize 8 flags f [ meta load protocol => reg 1 ] ] [ cmp eq reg 1 0x00000008 ] [ inner type vxlan hdrsize 8 flags f [ payload load 4b @ network header + 12 => reg 1 ] ] [ cmp eq reg 1 0x04030201 ] 7) Store inner link, network and transport header offsets in percpu area to parse inner packet header once only. Matching on a different tunnel type invalidates existing offsets in the percpu area and it invokes the inner tunnel parser again. 8) Add support for inner meta matching. This support for NFTA_META_PROTOCOL, which specifies the inner ethertype, and NFT_META_L4PROTO, which specifies the inner transport protocol. 9) Extend nft_inner to parse GENEVE optional fields to calculate the link layer offset. 10) Update inner expression so tunnel offset points to GRE header to normalize tunnel header handling. This also allows to perform different interpretations of the GRE header from userspace. * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nft_inner: set tunnel offset to GRE header offset netfilter: nft_inner: add geneve support netfilter: nft_meta: add inner match support netfilter: nft_inner: add percpu inner context netfilter: nft_inner: support for inner tunnel header matching netfilter: nft_payload: access ipip payload for inner offset netfilter: nft_payload: access GRE payload via inner offset netfilter: nft_objref: make it builtin netfilter: nf_tables: reduce nft_pktinfo by 8 bytes netfilter: nft_payload: move struct nft_payload_set definition where it belongs ==================== Link: https://lore.kernel.org/r/20221026132227.3287-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
drivers/net/can/usb/kvaser_usb/kvaser_usb_leaf.c 2871edb32f46 ("can: kvaser_usb: Fix possible completions during init_completion") abb8670938b2 ("can: kvaser_usb_leaf: Ignore stale bus-off after start") 8d21f5927ae6 ("can: kvaser_usb_leaf: Fix improved state not being reported") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27Merge tag 'net-6.1-rc3-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from 802.15.4 (Zigbee et al). Current release - regressions: - ipa: fix bugs in the register conversion for IPA v3.1 and v3.5.1 Current release - new code bugs: - mptcp: fix abba deadlock on fastopen - eth: stmmac: rk3588: allow multiple gmac controllers in one system Previous releases - regressions: - ip: rework the fix for dflt addr selection for connected nexthop - net: couple more fixes for misinterpreting bits in struct page after the signature was added Previous releases - always broken: - ipv6: ensure sane device mtu in tunnels - openvswitch: switch from WARN to pr_warn on a user-triggerable path - ethtool: eeprom: fix null-deref on genl_info in dump - ieee802154: more return code fixes for corner cases in dgram_sendmsg - mac802154: fix link-quality-indicator recording - eth: mlx5: fixes for IPsec, PTP timestamps, OvS and conntrack offload - eth: fec: limit register access on i.MX6UL - eth: bcm4908_enet: update TX stats after actual transmission - can: rcar_canfd: improve IRQ handling for RZ/G2L Misc: - genetlink: piggy back on the newly added resv_op_start to enforce more sanity checks on new commands" * tag 'net-6.1-rc3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (57 commits) net: enetc: survive memory pressure without crashing kcm: do not sense pfmemalloc status in kcm_sendpage() net: do not sense pfmemalloc status in skb_append_pagefrags() net/mlx5e: Fix macsec sci endianness at rx sa update net/mlx5e: Fix wrong bitwise comparison usage in macsec_fs_rx_add_rule function net/mlx5e: Fix macsec rx security association (SA) update/delete net/mlx5e: Fix macsec coverity issue at rx sa update net/mlx5: Fix crash during sync firmware reset net/mlx5: Update fw fatal reporter state on PCI handlers successful recover net/mlx5e: TC, Fix cloned flow attr instance dests are not zeroed net/mlx5e: TC, Reject forwarding from internal port to internal port net/mlx5: Fix possible use-after-free in async command interface net/mlx5: ASO, Create the ASO SQ with the correct timestamp format net/mlx5e: Update restore chain id for slow path packets net/mlx5e: Extend SKB room check to include PTP-SQ net/mlx5: DR, Fix matcher disconnect error flow net/mlx5: Wait for firmware to enable CRS before pci_restore_state net/mlx5e: Do not increment ESN when updating IPsec ESN state netdevsim: remove dir in nsim_dev_debugfs_init() when creating ports dir failed netdevsim: fix memory leak in nsim_drv_probe() when nsim_dev_resources_register() failed ...
2022-10-27Merge tag 'hardening-v6.1-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening fixes from Kees Cook: - Fix older Clang vs recent overflow KUnit test additions (Nick Desaulniers, Kees Cook) - Fix kern-doc visibility for overflow helpers (Kees Cook) * tag 'hardening-v6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: overflow: Refactor test skips for Clang-specific issues overflow: disable failing tests for older clang versions overflow: Fix kern-doc markup for functions
2022-10-27Merge tag 'media/v6.1-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media Pull media fixes from Mauro Carvalho Chehab: "A bunch of patches addressing issues in the vivid driver and adding new checks in V4L2 to validate the input parameters from some ioctls" * tag 'media/v6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: media: vivid.rst: loop_video is set on the capture devnode media: vivid: set num_in/outputs to 0 if not supported media: vivid: drop GFP_DMA32 media: vivid: fix control handler mutex deadlock media: videodev2.h: V4L2_DV_BT_BLANKING_HEIGHT should check 'interlaced' media: v4l2-dv-timings: add sanity checks for blanking values media: vivid: dev->bitmap_cap wasn't freed in all cases media: vivid: s_fbuf: add more sanity checks
2022-10-27Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscryptLinus Torvalds
Pull fscrypt fix from Eric Biggers: "Fix a memory leak that was introduced by a change that went into -rc1" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: fscrypt: fix keyring memory leak on mount failure
2022-10-27net/mlx5: Fix possible use-after-free in async command interfaceTariq Toukan
mlx5_cmd_cleanup_async_ctx should return only after all its callback handlers were completed. Before this patch, the below race between mlx5_cmd_cleanup_async_ctx and mlx5_cmd_exec_cb_handler was possible and lead to a use-after-free: 1. mlx5_cmd_cleanup_async_ctx is called while num_inflight is 2 (i.e. elevated by 1, a single inflight callback). 2. mlx5_cmd_cleanup_async_ctx decreases num_inflight to 1. 3. mlx5_cmd_exec_cb_handler is called, decreases num_inflight to 0 and is about to call wake_up(). 4. mlx5_cmd_cleanup_async_ctx calls wait_event, which returns immediately as the condition (num_inflight == 0) holds. 5. mlx5_cmd_cleanup_async_ctx returns. 6. The caller of mlx5_cmd_cleanup_async_ctx frees the mlx5_async_ctx object. 7. mlx5_cmd_exec_cb_handler goes on and calls wake_up() on the freed object. Fix it by syncing using a completion object. Mark it completed when num_inflight reaches 0. Trace: BUG: KASAN: use-after-free in do_raw_spin_lock+0x23d/0x270 Read of size 4 at addr ffff888139cd12f4 by task swapper/5/0 CPU: 5 PID: 0 Comm: swapper/5 Not tainted 6.0.0-rc3_for_upstream_debug_2022_08_30_13_10 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: <IRQ> dump_stack_lvl+0x57/0x7d print_report.cold+0x2d5/0x684 ? do_raw_spin_lock+0x23d/0x270 kasan_report+0xb1/0x1a0 ? do_raw_spin_lock+0x23d/0x270 do_raw_spin_lock+0x23d/0x270 ? rwlock_bug.part.0+0x90/0x90 ? __delete_object+0xb8/0x100 ? lock_downgrade+0x6e0/0x6e0 _raw_spin_lock_irqsave+0x43/0x60 ? __wake_up_common_lock+0xb9/0x140 __wake_up_common_lock+0xb9/0x140 ? __wake_up_common+0x650/0x650 ? destroy_tis_callback+0x53/0x70 [mlx5_core] ? kasan_set_track+0x21/0x30 ? destroy_tis_callback+0x53/0x70 [mlx5_core] ? kfree+0x1ba/0x520 ? do_raw_spin_unlock+0x54/0x220 mlx5_cmd_exec_cb_handler+0x136/0x1a0 [mlx5_core] ? mlx5_cmd_cleanup_async_ctx+0x220/0x220 [mlx5_core] ? mlx5_cmd_cleanup_async_ctx+0x220/0x220 [mlx5_core] mlx5_cmd_comp_handler+0x65a/0x12b0 [mlx5_core] ? dump_command+0xcc0/0xcc0 [mlx5_core] ? lockdep_hardirqs_on_prepare+0x400/0x400 ? cmd_comp_notifier+0x7e/0xb0 [mlx5_core] cmd_comp_notifier+0x7e/0xb0 [mlx5_core] atomic_notifier_call_chain+0xd7/0x1d0 mlx5_eq_async_int+0x3ce/0xa20 [mlx5_core] atomic_notifier_call_chain+0xd7/0x1d0 ? irq_release+0x140/0x140 [mlx5_core] irq_int_handler+0x19/0x30 [mlx5_core] __handle_irq_event_percpu+0x1f2/0x620 handle_irq_event+0xb2/0x1d0 handle_edge_irq+0x21e/0xb00 __common_interrupt+0x79/0x1a0 common_interrupt+0x78/0xa0 </IRQ> <TASK> asm_common_interrupt+0x22/0x40 RIP: 0010:default_idle+0x42/0x60 Code: c1 83 e0 07 48 c1 e9 03 83 c0 03 0f b6 14 11 38 d0 7c 04 84 d2 75 14 8b 05 eb 47 22 02 85 c0 7e 07 0f 00 2d e0 9f 48 00 fb f4 <c3> 48 c7 c7 80 08 7f 85 e8 d1 d3 3e fe eb de 66 66 2e 0f 1f 84 00 RSP: 0018:ffff888100dbfdf0 EFLAGS: 00000242 RAX: 0000000000000001 RBX: ffffffff84ecbd48 RCX: 1ffffffff0afe110 RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffff835cc9bc RBP: 0000000000000005 R08: 0000000000000001 R09: ffff88881dec4ac3 R10: ffffed1103bd8958 R11: 0000017d0ca571c9 R12: 0000000000000005 R13: ffffffff84f024e0 R14: 0000000000000000 R15: dffffc0000000000 ? default_idle_call+0xcc/0x450 default_idle_call+0xec/0x450 do_idle+0x394/0x450 ? arch_cpu_idle_exit+0x40/0x40 ? do_idle+0x17/0x450 cpu_startup_entry+0x19/0x20 start_secondary+0x221/0x2b0 ? set_cpu_sibling_map+0x2070/0x2070 secondary_startup_64_no_verify+0xcd/0xdb </TASK> Allocated by task 49502: kasan_save_stack+0x1e/0x40 __kasan_kmalloc+0x81/0xa0 kvmalloc_node+0x48/0xe0 mlx5e_bulk_async_init+0x35/0x110 [mlx5_core] mlx5e_tls_priv_tx_list_cleanup+0x84/0x3e0 [mlx5_core] mlx5e_ktls_cleanup_tx+0x38f/0x760 [mlx5_core] mlx5e_cleanup_nic_tx+0xa7/0x100 [mlx5_core] mlx5e_detach_netdev+0x1ca/0x2b0 [mlx5_core] mlx5e_suspend+0xdb/0x140 [mlx5_core] mlx5e_remove+0x89/0x190 [mlx5_core] auxiliary_bus_remove+0x52/0x70 device_release_driver_internal+0x40f/0x650 driver_detach+0xc1/0x180 bus_remove_driver+0x125/0x2f0 auxiliary_driver_unregister+0x16/0x50 mlx5e_cleanup+0x26/0x30 [mlx5_core] cleanup+0xc/0x4e [mlx5_core] __x64_sys_delete_module+0x2b5/0x450 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Freed by task 49502: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 kasan_set_free_info+0x20/0x30 ____kasan_slab_free+0x11d/0x1b0 kfree+0x1ba/0x520 mlx5e_tls_priv_tx_list_cleanup+0x2e7/0x3e0 [mlx5_core] mlx5e_ktls_cleanup_tx+0x38f/0x760 [mlx5_core] mlx5e_cleanup_nic_tx+0xa7/0x100 [mlx5_core] mlx5e_detach_netdev+0x1ca/0x2b0 [mlx5_core] mlx5e_suspend+0xdb/0x140 [mlx5_core] mlx5e_remove+0x89/0x190 [mlx5_core] auxiliary_bus_remove+0x52/0x70 device_release_driver_internal+0x40f/0x650 driver_detach+0xc1/0x180 bus_remove_driver+0x125/0x2f0 auxiliary_driver_unregister+0x16/0x50 mlx5e_cleanup+0x26/0x30 [mlx5_core] cleanup+0xc/0x4e [mlx5_core] __x64_sys_delete_module+0x2b5/0x450 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Fixes: e355477ed9e4 ("net/mlx5: Make mlx5_cmd_exec_cb() a safe API") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20221026135153.154807-8-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27ice: Add support Flex RXDMichal Jaron
Add new VIRTCHNL_VF_OFFLOAD_RX_FLEX_DESC flag, opcode VIRTCHNL_OP_GET_SUPPORTED_RXDIDS and add member rxdid in struct virtchnl_rxq_info to support AVF Flex RXD extension. Add support to allow VF to query flexible descriptor RXDIDs supported by DDP package and configure Rx queues with selected RXDID for IAVF. Add code to allow VIRTCHNL_OP_GET_SUPPORTED_RXDIDS message to be processed. Add necessary macros for registers. Signed-off-by: Leyi Rong <leyi.rong@intel.com> Signed-off-by: Xu Ting <ting.xu@intel.com> Signed-off-by: Michal Jaron <michalx.jaron@intel.com> Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com> Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20221025161252.1952939-1-jacob.e.keller@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-27bond: Disable TLS features indicationTariq Toukan
Bond agnostically interacts with TLS device-offload requests via the .ndo_sk_get_lower_dev operation. Return value is true iff bond guarantees fixed mapping between the TLS connection and a lower netdev. Due to this nature, the bond TLS device offload features are not explicitly controllable in the bond layer. As of today, these are read-only values based on the evaluation of bond_sk_check(). However, this indication might be incorrect and misleading, when the feature bits are "fixed" by some dependency features. For example, NETIF_F_HW_TLS_TX/RX are forcefully cleared in case the corresponding checksum offload is disabled. But in fact the bond ability to still offload TLS connections to the lower device is not hurt. This means that these bits can not be trusted, and hence better become unused. This patch revives some old discussion [1] and proposes a much simpler solution: Clear the bond's TLS features bits. Everyone should stop reading them. [1] https://lore.kernel.org/netdev/20210526095747.22446-1-tariqt@nvidia.com/ Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20221025105300.4718-1-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-26Merge tag 'spi-fix-v6.1-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi Pull spi fixes from Mark Brown: "A collection of mostly unremarkable fixes for SPI that have built up since the merge window, all driver specific. The change to the qup adding support for GPIO chip selects is fixing a regression due to the removal of legacy GPIO handling, the driver had previously been silently relying on the legacy GPIO support in a slightly broken way which worked well enough on some systems. Fixing it is simply a case of setting a couple of bits of information in the driver description" * tag 'spi-fix-v6.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: spi: aspeed: Fix window offset of CE1 spi: qup: support using GPIO as chip select line spi: intel: Fix the offset to get the 64K erase opcode spi: aspeed: Fix typo in mode_bits field for AST2600 platform spi: mpc52xx: Replace NO_IRQ by 0 spi: spi-mem: Fix typo (of -> or) spi: spi-gxp: fix typo in SPDX identifier line spi: tegra210-quad: Fix combined sequence
2022-10-26Merge tag 'ieee802154-for-net-next-2022-10-25' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan-next Stefan Schmidt says: ==================== == One of the biggest cycles for ieee802154 in a long time. We are landing the first pieces of a big enhancements in managing PAN's. We might have another pull request ready for this cycle later on, but I want to get this one out first. Miquel Raynal added support for sending frames synchronously as a dependency to handle MLME commands. Also introducing more filtering levels to match with the needs of a device when scanning or operating as a pan coordinator. To support development and testing the hwsim driver for ieee802154 was also enhanced for the new filtering levels and to update the PIB attributes. Alexander Aring fixed quite a few bugs spotted during reviewing changes. He also added support for TRAC in the atusb driver to have better failure handling if the firmware provides the needed information. Jilin Yuan fixed a comment with a repeated word in it. ================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-25overflow: Fix kern-doc markup for functionsKees Cook
Fix the kern-doc markings for several of the overflow helpers and move their location into the core kernel API documentation, where it belongs (it's not driver-specific). Cc: Jonathan Corbet <corbet@lwn.net> Cc: linux-doc@vger.kernel.org Cc: linux-hardening@vger.kernel.org Reviewed-by: Akira Yokosawa <akiyks@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org>
2022-10-25net: dev: Convert sa_data to flexible array in struct sockaddrKees Cook
One of the worst offenders of "fake flexible arrays" is struct sockaddr, as it is the classic example of why GCC and Clang have been traditionally forced to treat all trailing arrays as fake flexible arrays: in the distant misty past, sa_data became too small, and code started just treating it as a flexible array, even though it was fixed-size. The special case by the compiler is specifically that sizeof(sa->sa_data) and FORTIFY_SOURCE (which uses __builtin_object_size(sa->sa_data, 1)) do not agree (14 and -1 respectively), which makes FORTIFY_SOURCE treat it as a flexible array. However, the coming -fstrict-flex-arrays compiler flag will remove these special cases so that FORTIFY_SOURCE can gain coverage over all the trailing arrays in the kernel that are _not_ supposed to be treated as a flexible array. To deal with this change, convert sa_data to a true flexible array. To keep the structure size the same, move sa_data into a union with a newly introduced sa_data_min with the original size. The result is that FORTIFY_SOURCE can continue to have no idea how large sa_data may actually be, but anything using sizeof(sa->sa_data) must switch to sizeof(sa->sa_data_min). Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: David Ahern <dsahern@kernel.org> Cc: Dylan Yudaken <dylany@fb.com> Cc: Yajun Deng <yajun.deng@linux.dev> Cc: Petr Machata <petrm@nvidia.com> Cc: Hangbin Liu <liuhangbin@gmail.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: syzbot <syzkaller@googlegroups.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221018095503.never.671-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-25media: videodev2.h: V4L2_DV_BT_BLANKING_HEIGHT should check 'interlaced'Hans Verkuil
If it is a progressive (non-interlaced) format, then ignore the interlaced timing values. Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Fixes: 7f68127fa11f ([media] videodev2.h: defines to calculate blanking and frame sizes) Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
2022-10-25netfilter: nft_inner: add geneve supportPablo Neira Ayuso
Geneve tunnel header may contain options, parse geneve header and update offset to point to the link layer header according to the opt_len field. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-10-25netfilter: nft_meta: add inner match supportPablo Neira Ayuso
Add support for inner meta matching on: - NFT_META_PROTOCOL: to match on the ethertype, this can be used regardless tunnel protocol provides no link layer header, in that case nft_inner sets on the ethertype based on the IP header version field. - NFT_META_L4PROTO: to match on the layer 4 protocol. These meta expression are usually autogenerated as dependencies by userspace nftables. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-10-25netfilter: nft_inner: add percpu inner contextPablo Neira Ayuso
Add NFT_PKTINFO_INNER_FULL flag to annotate that inner offsets are available. Store nft_inner_tun_ctx object in percpu area to cache existing inner offsets for this skbuff. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-10-25netfilter: nft_inner: support for inner tunnel header matchingPablo Neira Ayuso
This new expression allows you to match on the inner headers that are encapsulated by any of the existing tunneling protocols. This expression parses the inner packet to set the link, network and transport offsets, so the existing expressions (with a few updates) can be reused to match on the inner headers. The inner expression supports for different tunnel combinations such as: - ethernet frame over IPv4/IPv6 packet, eg. VxLAN. - IPv4/IPv6 packet over IPv4/IPv6 packet, eg. IPIP. - IPv4/IPv6 packet over IPv4/IPv6 + transport header, eg. GRE. - transport header (ESP or SCTP) over transport header (usually UDP) The following fields are used to describe the tunnel protocol: - flags, which describe how to parse the inner headers: NFT_PAYLOAD_CTX_INNER_TUN, the tunnel provides its own header. NFT_PAYLOAD_CTX_INNER_ETHER, the ethernet frame is available as inner header. NFT_PAYLOAD_CTX_INNER_NH, the network header is available as inner header. NFT_PAYLOAD_CTX_INNER_TH, the transport header is available as inner header. For example, VxLAN sets on all of these flags. While GRE only sets on NFT_PAYLOAD_CTX_INNER_NH and NFT_PAYLOAD_CTX_INNER_TH. Then, ESP over UDP only sets on NFT_PAYLOAD_CTX_INNER_TH. The tunnel description is composed of the following attributes: - header size: in case the tunnel comes with its own header, eg. VxLAN. - type: this provides a hint to userspace on how to delinearize the rule. This is useful for VxLAN and Geneve since they run over UDP, since transport does not provide a hint. This is also useful in case hardware offload is ever supported. The type is not currently interpreted by the kernel. - expression: currently only payload supported. Follow up patch adds also inner meta support which is required by autogenerated dependencies. The exthdr expression should be supported too at some point. There is a new inner_ops operation that needs to be set on to allow to use an existing expression from the inner expression. This patch adds a new NFT_PAYLOAD_TUN_HEADER base which allows to match on the tunnel header fields, eg. vxlan vni. The payload expression is embedded into nft_inner private area and this private data area is passed to the payload inner eval function via direct call. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-10-25netfilter: nft_objref: make it builtinFlorian Westphal
nft_objref is needed to reference named objects, it makes no sense to disable it. Before: text data bss dec filename 4014 424 0 4438 nft_objref.o 4174 1128 0 5302 nft_objref.ko 359351 15276 864 375491 nf_tables.ko After: text data bss dec filename 3815 408 0 4223 nft_objref.o 363161 15692 864 379717 nf_tables.ko Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-10-25netfilter: nf_tables: reduce nft_pktinfo by 8 bytesFlorian Westphal
structure is reduced from 32 to 24 bytes. While at it, also check that iphdrlen is sane, this is guaranteed for NFPROTO_IPV4 but not for ingress or bridge, so add checks for this. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-10-25netfilter: nft_payload: move struct nft_payload_set definition where it belongsPablo Neira Ayuso
Not required to expose this header in nf_tables_core.h, move it to where it is used, ie. nft_payload. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-10-25soreuseport: Fix socket selection for SO_INCOMING_CPU.Kuniyuki Iwashima
Kazuho Oku reported that setsockopt(SO_INCOMING_CPU) does not work with setsockopt(SO_REUSEPORT) since v4.6. With the combination of SO_REUSEPORT and SO_INCOMING_CPU, we could build a highly efficient server application. setsockopt(SO_INCOMING_CPU) associates a CPU with a TCP listener or UDP socket, and then incoming packets processed on the CPU will likely be distributed to the socket. Technically, a socket could even receive packets handled on another CPU if no sockets in the reuseport group have the same CPU receiving the flow. The logic exists in compute_score() so that a socket will get a higher score if it has the same CPU with the flow. However, the score gets ignored after the blamed two commits, which introduced a faster socket selection algorithm for SO_REUSEPORT. This patch introduces a counter of sockets with SO_INCOMING_CPU in a reuseport group to check if we should iterate all sockets to find a proper one. We increment the counter when * calling listen() if the socket has SO_INCOMING_CPU and SO_REUSEPORT * enabling SO_INCOMING_CPU if the socket is in a reuseport group Also, we decrement it when * detaching a socket out of the group to apply SO_INCOMING_CPU to migrated TCP requests * disabling SO_INCOMING_CPU if the socket is in a reuseport group When the counter reaches 0, we can get back to the O(1) selection algorithm. The overall changes are negligible for the non-SO_INCOMING_CPU case, and the only notable thing is that we have to update sk_incomnig_cpu under reuseport_lock. Otherwise, the race prevents transitioning to the O(n) algorithm and results in the wrong socket selection. cpu1 (setsockopt) cpu2 (listen) +-----------------+ +-------------+ lock_sock(sk1) lock_sock(sk2) reuseport_update_incoming_cpu(sk1, val) . | /* set CPU as 0 */ |- WRITE_ONCE(sk1->incoming_cpu, val) | | spin_lock_bh(&reuseport_lock) | reuseport_grow(sk2, reuse) | . | |- more_socks_size = reuse->max_socks * 2U; | |- if (more_socks_size > U16_MAX && | | reuse->num_closed_socks) | | . | | |- RCU_INIT_POINTER(sk1->sk_reuseport_cb, NULL); | | `- __reuseport_detach_closed_sock(sk1, reuse) | | . | | `- reuseport_put_incoming_cpu(sk1, reuse) | | . | | | /* Read shutdown()ed sk1's sk_incoming_cpu | | | * without lock_sock(). | | | */ | | `- if (sk1->sk_incoming_cpu >= 0) | | . | | | /* decrement not-yet-incremented | | | * count, which is never incremented. | | | */ | | `- __reuseport_put_incoming_cpu(reuse); | | | `- spin_lock_bh(&reuseport_lock) | |- spin_lock_bh(&reuseport_lock) | |- reuse = rcu_dereference_protected(sk1->sk_reuseport_cb, ...) |- if (!reuse) | . | | /* Cannot increment reuse->incoming_cpu. */ | `- goto out; | `- spin_unlock_bh(&reuseport_lock) Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection") Fixes: c125e80b8868 ("soreuseport: fast reuseport TCP socket selection") Reported-by: Kazuho Oku <kazuhooku@gmail.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-25act_skbedit: skbedit queue mapping for receive queueAmritha Nambiar
Add support for skbedit queue mapping action on receive side. This is supported only in hardware, so the skip_sw flag is enforced. This enables offloading filters for receive queue selection in the hardware using the skbedit action. Traffic arrives on the Rx queue requested in the skbedit action parameter. A new tc action flag TCA_ACT_FLAGS_AT_INGRESS is introduced to identify the traffic direction the action queue_mapping is requested on during filter addition. This is used to disallow offloading the skbedit queue mapping action on transmit side. Example: $tc filter add dev $IFACE ingress protocol ip flower dst_ip $DST_IP\ action skbedit queue_mapping $rxq_id skip_sw Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-24net: sfp: provide a definition for the power level select bitRussell King (Oracle)
Provide a named definition for the power level select bit in the extended status register, rather than using BIT(0) in the code. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24genetlink: piggy back on resv_op to default to a reject policyJakub Kicinski
To keep backward compatibility we used to leave attribute parsing to the family if no policy is specified. This becomes tedious as we move to more strict validation. Families must define reject all policies if they don't want any attributes accepted. Piggy back on the resv_start_op field as the switchover point. AFAICT only ethtool has added new commands since the resv_start_op was defined, and it has per-op policies so this should be a no-op. Nonetheless the patch should still go into v6.1 for consistency. Link: https://lore.kernel.org/all/20221019125745.3f2e7659@kernel.org/ Link: https://lore.kernel.org/r/20221021193532.1511293-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
include/linux/net.h a5ef058dc4d9 ("net: introduce and use custom sockopt socket flag") e993ffe3da4b ("net: flag sockets supporting msghdr originated zerocopy") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24Merge tag 'net-6.1-rc3-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf. The net-memcg fix stands out, the rest is very run-off-the-mill. Maybe I'm biased. Current release - regressions: - eth: fman: re-expose location of the MAC address to userspace, apparently some udev scripts depended on the exact value Current release - new code bugs: - bpf: - wait for busy refill_work when destroying bpf memory allocator - allow bpf_user_ringbuf_drain() callbacks to return 1 - fix dispatcher patchable function entry to 5 bytes nop Previous releases - regressions: - net-memcg: avoid stalls when under memory pressure - tcp: fix indefinite deferral of RTO with SACK reneging - tipc: fix a null-ptr-deref in tipc_topsrv_accept - eth: macb: specify PHY PM management done by MAC - tcp: fix a signed-integer-overflow bug in tcp_add_backlog() Previous releases - always broken: - eth: amd-xgbe: SFP fixes and compatibility improvements Misc: - docs: netdev: offer performance feedback to contributors" * tag 'net-6.1-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (37 commits) net-memcg: avoid stalls when under memory pressure tcp: fix indefinite deferral of RTO with SACK reneging tcp: fix a signed-integer-overflow bug in tcp_add_backlog() net: lantiq_etop: don't free skb when returning NETDEV_TX_BUSY net: fix UAF issue in nfqnl_nf_hook_drop() when ops_init() failed docs: netdev: offer performance feedback to contributors kcm: annotate data-races around kcm->rx_wait kcm: annotate data-races around kcm->rx_psock net: fman: Use physical address for userspace interfaces net/mlx5e: Cleanup MACsec uninitialization routine atlantic: fix deadlock at aq_nic_stop nfp: only clean `sp_indiff` when application firmware is unloaded amd-xgbe: add the bit rate quirk for Molex cables amd-xgbe: fix the SFP compliance codes check for DAC cables amd-xgbe: enable PLL_CTL for fixed PHY modes only amd-xgbe: use enums for mailbox cmd and sub_cmds amd-xgbe: Yellow carp devices do not need rrc bpf: Use __llist_del_all() whenever possbile during memory draining bpf: Wait for busy refill_work when destroying bpf memory allocator MAINTAINERS: add keyword match on PTP ...
2022-10-24net-memcg: avoid stalls when under memory pressureJakub Kicinski
As Shakeel explains the commit under Fixes had the unintended side-effect of no longer pre-loading the cached memory allowance. Even tho we previously dropped the first packet received when over memory limit - the consecutive ones would get thru by using the cache. The charging was happening in batches of 128kB, so we'd let in 128kB (truesize) worth of packets per one drop. After the change we no longer force charge, there will be no cache filling side effects. This causes significant drops and connection stalls for workloads which use a lot of page cache, since we can't reclaim page cache under GFP_NOWAIT. Some of the latency can be recovered by improving SACK reneg handling but nowhere near enough to get back to the pre-5.15 performance (the application I'm experimenting with still sees 5-10x worst latency). Apply the suggested workaround of using GFP_ATOMIC. We will now be more permissive than previously as we'll drop _no_ packets in softirq when under pressure. But I can't think of any good and simple way to address that within networking. Link: https://lore.kernel.org/all/20221012163300.795e7b86@kernel.org/ Suggested-by: Shakeel Butt <shakeelb@google.com> Fixes: 4b1327be9fe5 ("net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()") Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Link: https://lore.kernel.org/r/20221021160304.1362511-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Alexei Starovoitov says: ==================== pull-request: bpf 2022-10-23 We've added 7 non-merge commits during the last 18 day(s) which contain a total of 8 files changed, 69 insertions(+), 5 deletions(-). The main changes are: 1) Wait for busy refill_work when destroying bpf memory allocator, from Hou. 2) Allow bpf_user_ringbuf_drain() callbacks to return 1, from David. 3) Fix dispatcher patchable function entry to 5 bytes nop, from Jiri. 4) Prevent decl_tag from being referenced in func_proto, from Stanislav. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Use __llist_del_all() whenever possbile during memory draining bpf: Wait for busy refill_work when destroying bpf memory allocator bpf: Fix dispatcher patchable function entry to 5 bytes nop bpf: prevent decl_tag from being referenced in func_proto selftests/bpf: Add reproducer for decl_tag in func_proto return type selftests/bpf: Make bpf_user_ringbuf_drain() selftest callback return 1 bpf: Allow bpf_user_ringbuf_drain() callbacks to return 1 ==================== Link: https://lore.kernel.org/r/20221023192244.81137-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24net: skb: move skb_pp_recycle() to skbuff.cYunsheng Lin
skb_pp_recycle() is only used by skb_free_head() in skbuff.c, so move it to skbuff.c. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24net: remove useless parameter of __sock_cmsg_sendxu xin
The parameter 'msg' has never been used by __sock_cmsg_send, so we can remove it safely. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: xu xin <xu.xin16@zte.com.cn> Reviewed-by: Zhang Yunkai <zhang.yunkai@zte.com.cn> Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24net: add a refcount tracker for kernel socketsEric Dumazet
Commit ffa84b5ffb37 ("net: add netns refcount tracker to struct sock") added a tracker to sockets, but did not track kernel sockets. We still have syzbot reports hinting about netns being destroyed while some kernel TCP sockets had not been dismantled. This patch tracks kernel sockets, and adds a ref_tracker_dir_print() call to net_free() right before the netns is freed. Normally, each layer is responsible for properly releasing its kernel sockets before last call to net_free(). This debugging facility is enabled with CONFIG_NET_NS_REFCNT_TRACKER=y Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Tested-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24udp: track the forward memory release threshold in an hot cachelinePaolo Abeni
When the receiver process and the BH runs on different cores, udp_rmem_release() experience a cache miss while accessing sk_rcvbuf, as the latter shares the same cacheline with sk_forward_alloc, written by the BH. With this patch, UDP tracks the rcvbuf value and its update via custom SOL_SOCKET socket options, and copies the forward memory threshold value used by udp_rmem_release() in a different cacheline, already accessed by the above function and uncontended. Since the UDP socket init operation grown a bit, factor out the common code between v4 and v6 in a shared helper. Overall the above give a 10% peek throughput increase under UDP flood. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24net: introduce and use custom sockopt socket flagPaolo Abeni
We will soon introduce custom setsockopt for UDP sockets, too. Instead of doing even more complex arbitrary checks inside sock_use_custom_sol_socket(), add a new socket flag and set it for the relevant socket types (currently only MPTCP). Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>