summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-10-01net: lockless SO_{TYPE|PROTOCOL|DOMAIN|ERROR } setsockopt()Eric Dumazet
This options can not be set and return -ENOPROTOOPT, no need to acqure socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01net: lockless SO_PASSCRED, SO_PASSPIDFD and SO_PASSSECEric Dumazet
sock->flags are atomic, no need to hold the socket lock in sk_setsockopt() for SO_PASSCRED, SO_PASSPIDFD and SO_PASSSEC. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01net: implement lockless SO_PRIORITYEric Dumazet
This is a followup of 8bf43be799d4 ("net: annotate data-races around sk->sk_priority"). sk->sk_priority can be read and written without holding the socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01openvswitch: reduce stack usage in do_execute_actionsIlya Maximets
do_execute_actions() function can be called recursively multiple times while executing actions that require pipeline forking or recirculations. It may also be re-entered multiple times if the packet leaves openvswitch module and re-enters it through a different port. Currently, there is a 256-byte array allocated on stack in this function that is supposed to hold NSH header. Compilers tend to pre-allocate that space right at the beginning of the function: a88: 48 81 ec b0 01 00 00 sub $0x1b0,%rsp NSH is not a very common protocol, but the space is allocated on every recursive call or re-entry multiplying the wasted stack space. Move the stack allocation to push_nsh() function that is only used if NSH actions are actually present. push_nsh() is also a simple function without a possibility for re-entry, so the stack is returned right away. With this change the preallocated space is reduced by 256 B per call: b18: 48 81 ec b0 00 00 00 sub $0xb0,%rsp Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Eelco Chaudron echaudro@redhat.com Reviewed-by: Aaron Conole <aconole@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01ipv4, ipv6: Fix handling of transhdrlen in __ip{,6}_append_data()David Howells
Including the transhdrlen in length is a problem when the packet is partially filled (e.g. something like send(MSG_MORE) happened previously) when appending to an IPv4 or IPv6 packet as we don't want to repeat the transport header or account for it twice. This can happen under some circumstances, such as splicing into an L2TP socket. The symptom observed is a warning in __ip6_append_data(): WARNING: CPU: 1 PID: 5042 at net/ipv6/ip6_output.c:1800 __ip6_append_data.isra.0+0x1be8/0x47f0 net/ipv6/ip6_output.c:1800 that occurs when MSG_SPLICE_PAGES is used to append more data to an already partially occupied skbuff. The warning occurs when 'copy' is larger than the amount of data in the message iterator. This is because the requested length includes the transport header length when it shouldn't. This can be triggered by, for example: sfd = socket(AF_INET6, SOCK_DGRAM, IPPROTO_L2TP); bind(sfd, ...); // ::1 connect(sfd, ...); // ::1 port 7 send(sfd, buffer, 4100, MSG_MORE); sendfile(sfd, dfd, NULL, 1024); Fix this by only adding transhdrlen into the length if the write queue is empty in l2tp_ip6_sendmsg(), analogously to how UDP does things. l2tp_ip_sendmsg() looks like it won't suffer from this problem as it builds the UDP packet itself. Fixes: a32e0eec7042 ("l2tp: introduce L2TPv3 IP encapsulation support for IPv6") Reported-by: syzbot+62cbf263225ae13ff153@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/0000000000001c12b30605378ce8@google.com/ Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Eric Dumazet <edumazet@google.com> cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> cc: "David S. Miller" <davem@davemloft.net> cc: David Ahern <dsahern@kernel.org> cc: Paolo Abeni <pabeni@redhat.com> cc: Jakub Kicinski <kuba@kernel.org> cc: netdev@vger.kernel.org cc: bpf@vger.kernel.org cc: syzkaller-bugs@googlegroups.com Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01Merge tag 'x86-urgent-2023-10-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Misc fixes: a kerneldoc build warning fix, add SRSO mitigation for AMD-derived Hygon processors, and fix a SGX kernel crash in the page fault handler that can trigger when ksgxd races to reclaim the SECS special page, by making the SECS page unswappable" * tag 'x86-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sgx: Resolves SECS reclaim vs. page fault for EAUG race x86/srso: Add SRSO mitigation for Hygon processors x86/kgdb: Fix a kerneldoc warning when build with W=1
2023-10-01Merge tag 'timers-urgent-2023-10-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Fix a spurious kernel warning during CPU hotplug events that may trigger when timer/hrtimer softirqs are pending, which are otherwise hotplug-safe and don't merit a warning" * tag 'timers-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers: Tag (hr)timer softirq as hotplug safe
2023-10-01Merge tag 'sched-urgent-2023-10-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a RT tasks related lockup/live-lock during CPU offlining" * tag 'sched-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/rt: Fix live lock between select_fallback_rq() and RT push
2023-10-01Merge tag 'perf-urgent-2023-10-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf event fixes from Ingo Molnar: "Misc fixes: work around an AMD microcode bug on certain models, and fix kexec kernel PMI handlers on AMD systems that get loaded on older kernels that have an unexpected register state" * tag 'perf-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/amd: Do not WARN() on every IRQ perf/x86/amd/core: Fix overflow reset on hotplug
2023-10-01neighbour: fix data-races around n->outputEric Dumazet
n->output field can be read locklessly, while a writer might change the pointer concurrently. Add missing annotations to prevent load-store tearing. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01Merge branch 'dev-stats-virtio-l2tp_eth'David S. Miller
Eric Dumazet says: ==================== net: use DEV_STATS_xxx() helpers in virtio_net and l2tp_eth Inspired by another (minor) KCSAN syzbot report. Both virtio_net and l2tp_eth can use DEV_STATS_xxx() helpers. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01net: l2tp_eth: use generic dev->stats fieldsEric Dumazet
Core networking has opt-in atomic variant of dev->stats, simply use DEV_STATS_INC(), DEV_STATS_ADD() and DEV_STATS_READ(). v2: removed @priv local var in l2tp_eth_dev_recv() (Simon) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01virtio_net: avoid data-races on dev->stats fieldsEric Dumazet
Use DEV_STATS_INC() and DEV_STATS_READ() which provide atomicity on paths that can be used concurrently. Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01net: add DEV_STATS_READ() helperEric Dumazet
Companion of DEV_STATS_INC() & DEV_STATS_ADD(). This is going to be used in the series. Use it in macsec_get_stats64(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01octeontx2-pf: Tc flower offload support for MPLSHariprasad Kelam
This patch extends flower offload support for MPLS protocol. Due to hardware limitation, currently driver supports lse depth up to 4. Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01net: fix possible store tearing in neigh_periodic_work()Eric Dumazet
While looking at a related syzbot report involving neigh_periodic_work(), I found that I forgot to add an annotation when deleting an RCU protected item from a list. Readers use rcu_deference(*np), we need to use either rcu_assign_pointer() or WRITE_ONCE() on writer side to prevent store tearing. I use rcu_assign_pointer() to have lockdep support, this was the choice made in neigh_flush_dev(). Fixes: 767e97e1e0db ("neigh: RCU conversion of struct neighbour") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01kbuild: remove stale code for 'source' symlink in packaging scriptsMasahiro Yamada
Since commit d8131c2965d5 ("kbuild: remove $(MODLIB)/source symlink"), modules_install does not create the 'source' symlink. Remove the stale code from builddeb and kernel.spec. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2023-10-01Merge tag 'for-net-2023-09-20' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth bluetooth pull request for net: - Fix handling of HCI_QUIRK_STRICT_DUPLICATE_FILTER - Fix handling of listen for ISO unicast - Fix build warnings - Fix leaking content of local_codecs - Add shutdown function for QCA6174 - Delete unused hci_req_prepare_suspend() declaration - Fix hci_link_tx_to RCU lock usage - Avoid redundant authentication Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01net: ethernet: xilinx: Drop kernel doc comment about return valueUwe Kleine-König
During review of the patch that became 2e0ec0afa902 ("net: ethernet: xilinx: Convert to platform remove callback returning void") in net-next, Radhey Shyam Pandey pointed out that the change makes the documentation about the return value obsolete. The patch was applied without addressing this feedback, so here comes a fix in a separate patch. Fixes: 2e0ec0afa902 ("net: ethernet: xilinx: Convert to platform remove callback returning void") Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Reviewed-by: Radhey Shyam Pandey <radhey.shyam.pandey@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01net: stmmac: platform: fix the incorrect parameterClark Wang
The second parameter of stmmac_pltfr_init() needs the pointer of "struct plat_stmmacenet_data". So, correct the parameter typo when calling the function. Otherwise, it may cause this alignment exception when doing suspend/resume. [ 49.067201] CPU1 is up [ 49.135258] Internal error: SP/PC alignment exception: 000000008a000000 [#1] PREEMPT SMP [ 49.143346] Modules linked in: soc_imx9 crct10dif_ce polyval_ce nvmem_imx_ocotp_fsb_s400 polyval_generic layerscape_edac_mod snd_soc_fsl_asoc_card snd_soc_imx_audmux snd_soc_imx_card snd_soc_wm8962 el_enclave snd_soc_fsl_micfil rtc_pcf2127 rtc_pcf2131 flexcan can_dev snd_soc_fsl_xcvr snd_soc_fsl_sai imx8_media_dev(C) snd_soc_fsl_utils fuse [ 49.173393] CPU: 0 PID: 565 Comm: sh Tainted: G C 6.5.0-rc4-next-20230804-05047-g5781a6249dae #677 [ 49.183721] Hardware name: NXP i.MX93 11X11 EVK board (DT) [ 49.189190] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 49.196140] pc : 0x80800052 [ 49.198931] lr : stmmac_pltfr_resume+0x34/0x50 [ 49.203368] sp : ffff800082f8bab0 [ 49.206670] x29: ffff800082f8bab0 x28: ffff0000047d0ec0 x27: ffff80008186c170 [ 49.213794] x26: 0000000b5e4ff1ba x25: ffff800081e5fa74 x24: 0000000000000010 [ 49.220918] x23: ffff800081fe0000 x22: 0000000000000000 x21: 0000000000000000 [ 49.228042] x20: ffff0000001b4010 x19: ffff0000001b4010 x18: 0000000000000006 [ 49.235166] x17: ffff7ffffe007000 x16: ffff800080000000 x15: 0000000000000000 [ 49.242290] x14: 00000000000000fc x13: 0000000000000000 x12: 0000000000000000 [ 49.249414] x11: 0000000000000001 x10: 0000000000000a60 x9 : ffff800082f8b8c0 [ 49.256538] x8 : 0000000000000008 x7 : 0000000000000001 x6 : 000000005f54a200 [ 49.263662] x5 : 0000000001000000 x4 : ffff800081b93680 x3 : ffff800081519be0 [ 49.270786] x2 : 0000000080800052 x1 : 0000000000000000 x0 : ffff0000001b4000 [ 49.277911] Call trace: [ 49.280346] 0x80800052 [ 49.282781] platform_pm_resume+0x2c/0x68 [ 49.286785] dpm_run_callback.constprop.0+0x74/0x134 [ 49.291742] device_resume+0x88/0x194 [ 49.295391] dpm_resume+0x10c/0x230 [ 49.298866] dpm_resume_end+0x18/0x30 [ 49.302515] suspend_devices_and_enter+0x2b8/0x624 [ 49.307299] pm_suspend+0x1fc/0x348 [ 49.310774] state_store+0x80/0x104 [ 49.314258] kobj_attr_store+0x18/0x2c [ 49.318002] sysfs_kf_write+0x44/0x54 [ 49.321659] kernfs_fop_write_iter+0x120/0x1ec [ 49.326088] vfs_write+0x1bc/0x300 [ 49.329485] ksys_write+0x70/0x104 [ 49.332874] __arm64_sys_write+0x1c/0x28 [ 49.336783] invoke_syscall+0x48/0x114 [ 49.340527] el0_svc_common.constprop.0+0xc4/0xe4 [ 49.345224] do_el0_svc+0x38/0x98 [ 49.348526] el0_svc+0x2c/0x84 [ 49.351568] el0t_64_sync_handler+0x100/0x12c [ 49.355910] el0t_64_sync+0x190/0x194 [ 49.359567] Code: ???????? ???????? ???????? ???????? (????????) [ 49.365644] ---[ end trace 0000000000000000 ]--- Fixes: 97117eb51ec8 ("net: stmmac: platform: provide stmmac_pltfr_init()") Signed-off-by: Clark Wang <xiaoning.wang@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Serge Semin <fancer.lancer@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01net: atl1c: switch to napi_consume_skb()Sieng-Piaw Liew
Switch to napi_consume_skb() to take advantage of bulk free, and skb reuse through skb cache in conjunction with napi_build_skb(). When parameter 'budget' = 0, indicating non-NAPI context, dev_consume_skb_any() is called internally. Signed-off-by: Sieng-Piaw Liew <liew.s.piaw@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01Merge branch 'sch_fq-improvements'David S. Miller
Eric Dumazet says: ==================== net_sched: sch_fq: round of improvements For FQ tenth anniversary, it was time for making it faster. The FQ part (as in Fair Queue) is rather expensive, because we have to classify packets and store them in a per-flow structure, and add this per-flow structure in a hash table. Then the RR lists also add cache line misses. Most fq qdisc are almost idle. Trying to share NIC bandwidth has no benefits, thus the qdisc could behave like a FIFO. This series brings a 5 % throughput increase in intensive tcp_rr workload, and 13 % increase for (unpaced) UDP packets. v2: removed an extra label (build bot). Fix an accidental increase of stat_internal_packets counter in fast path. Added "constify qdisc_priv()" patch to allow fq_fastpath_check() first parameter to be const. typo on 'eligible' (Willem) ==================== Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01net_sched: sch_fq: always garbage collectEric Dumazet
FQ performs garbage collection at enqueue time, and only if number of flows is above a given threshold, which is hit after the qdisc has been used a bit. Since an RB-tree traversal is needed to locate a flow, it makes sense to perform gc all the time, to keep rb-trees smaller. This reduces by 50 % average storage costs in FQ, and avoids 1 cache line miss at enqueue time when fast path added in prior patch can not be used. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01net_sched: sch_fq: add fast path for mostly idle qdiscEric Dumazet
TCQ_F_CAN_BYPASS can be used by few qdiscs. Idea is that if we queue a packet to an empty qdisc, following dequeue() would pick it immediately. FQ can not use the generic TCQ_F_CAN_BYPASS code, because some additional checks need to be performed. This patch adds a similar fast path to FQ. Most of the time, qdisc is not throttled, and many packets can avoid bringing/touching at least four cache lines, and consuming 128bytes of memory to store the state of a flow. After this patch, netperf can send UDP packets about 13 % faster, and pktgen goes 30 % faster (when FQ is in the way), on a fast NIC. TCP traffic is also improved, thanks to a reduction of cache line misses. I have measured a 5 % increase of throughput on a tcp_rr intensive workload. tc -s -d qd sh dev eth1 ... qdisc fq 8004: parent 1:2 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 3028b initial_quantum 15140b low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop Sent 5646784384 bytes 1985161 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 flows 122 (inactive 122 throttled 0) gc 0 highprio 0 fastpath 659990 throttled 27762 latency 8.57us Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01net_sched: sch_fq: change how @inactive is trackedEric Dumazet
Currently, when one fq qdisc has no more packets to send, it can still have some flows stored in its RR lists (q->new_flows & q->old_flows) This was a design choice, but what is a bit disturbing is that the inactive_flows counter does not include the count of empty flows in RR lists. As next patch needs to know better if there are active flows, this change makes inactive_flows exact. Before the patch, following command on an empty qdisc could have returned: lpaa17:~# tc -s -d qd sh dev eth1 | grep inactive flows 1322 (inactive 1316 throttled 0) flows 1330 (inactive 1325 throttled 0) flows 1193 (inactive 1190 throttled 0) flows 1208 (inactive 1202 throttled 0) After the patch, we now have: lpaa17:~# tc -s -d qd sh dev eth1 | grep inactive flows 1322 (inactive 1322 throttled 0) flows 1330 (inactive 1330 throttled 0) flows 1193 (inactive 1193 throttled 0) flows 1208 (inactive 1208 throttled 0) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01net_sched: sch_fq: struct sched_data reorgEric Dumazet
q->flows can be often modified, and q->timer_slack is read mostly. Exchange the two fields, so that cache line countaining quantum, initial_quantum, and other critical parameters stay clean (read-mostly). Move q->watchdog next to q->stat_throttled Add comments explaining how the structure is split in three different parts. pahole output before the patch: struct fq_sched_data { struct fq_flow_head new_flows; /* 0 0x10 */ struct fq_flow_head old_flows; /* 0x10 0x10 */ struct rb_root delayed; /* 0x20 0x8 */ u64 time_next_delayed_flow; /* 0x28 0x8 */ u64 ktime_cache; /* 0x30 0x8 */ unsigned long unthrottle_latency_ns; /* 0x38 0x8 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct fq_flow internal __attribute__((__aligned__(64))); /* 0x40 0x80 */ /* XXX last struct has 16 bytes of padding */ /* --- cacheline 3 boundary (192 bytes) --- */ u32 quantum; /* 0xc0 0x4 */ u32 initial_quantum; /* 0xc4 0x4 */ u32 flow_refill_delay; /* 0xc8 0x4 */ u32 flow_plimit; /* 0xcc 0x4 */ unsigned long flow_max_rate; /* 0xd0 0x8 */ u64 ce_threshold; /* 0xd8 0x8 */ u64 horizon; /* 0xe0 0x8 */ u32 orphan_mask; /* 0xe8 0x4 */ u32 low_rate_threshold; /* 0xec 0x4 */ struct rb_root * fq_root; /* 0xf0 0x8 */ u8 rate_enable; /* 0xf8 0x1 */ u8 fq_trees_log; /* 0xf9 0x1 */ u8 horizon_drop; /* 0xfa 0x1 */ /* XXX 1 byte hole, try to pack */ <bad> u32 flows; /* 0xfc 0x4 */ /* --- cacheline 4 boundary (256 bytes) --- */ u32 inactive_flows; /* 0x100 0x4 */ u32 throttled_flows; /* 0x104 0x4 */ u64 stat_gc_flows; /* 0x108 0x8 */ u64 stat_internal_packets; /* 0x110 0x8 */ u64 stat_throttled; /* 0x118 0x8 */ u64 stat_ce_mark; /* 0x120 0x8 */ u64 stat_horizon_drops; /* 0x128 0x8 */ u64 stat_horizon_caps; /* 0x130 0x8 */ u64 stat_flows_plimit; /* 0x138 0x8 */ /* --- cacheline 5 boundary (320 bytes) --- */ u64 stat_pkts_too_long; /* 0x140 0x8 */ u64 stat_allocation_errors; /* 0x148 0x8 */ <bad> u32 timer_slack; /* 0x150 0x4 */ /* XXX 4 bytes hole, try to pack */ struct qdisc_watchdog watchdog; /* 0x158 0x48 */ /* size: 448, cachelines: 7, members: 34 */ /* sum members: 411, holes: 2, sum holes: 5 */ /* padding: 32 */ /* paddings: 1, sum paddings: 16 */ /* forced alignments: 1 */ }; pahole output after the patch: struct fq_sched_data { struct fq_flow_head new_flows; /* 0 0x10 */ struct fq_flow_head old_flows; /* 0x10 0x10 */ struct rb_root delayed; /* 0x20 0x8 */ u64 time_next_delayed_flow; /* 0x28 0x8 */ u64 ktime_cache; /* 0x30 0x8 */ unsigned long unthrottle_latency_ns; /* 0x38 0x8 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct fq_flow internal __attribute__((__aligned__(64))); /* 0x40 0x80 */ /* XXX last struct has 16 bytes of padding */ /* --- cacheline 3 boundary (192 bytes) --- */ u32 quantum; /* 0xc0 0x4 */ u32 initial_quantum; /* 0xc4 0x4 */ u32 flow_refill_delay; /* 0xc8 0x4 */ u32 flow_plimit; /* 0xcc 0x4 */ unsigned long flow_max_rate; /* 0xd0 0x8 */ u64 ce_threshold; /* 0xd8 0x8 */ u64 horizon; /* 0xe0 0x8 */ u32 orphan_mask; /* 0xe8 0x4 */ u32 low_rate_threshold; /* 0xec 0x4 */ struct rb_root * fq_root; /* 0xf0 0x8 */ u8 rate_enable; /* 0xf8 0x1 */ u8 fq_trees_log; /* 0xf9 0x1 */ u8 horizon_drop; /* 0xfa 0x1 */ /* XXX 1 byte hole, try to pack */ <good> u32 timer_slack; /* 0xfc 0x4 */ /* --- cacheline 4 boundary (256 bytes) --- */ <good> u32 flows; /* 0x100 0x4 */ u32 inactive_flows; /* 0x104 0x4 */ u32 throttled_flows; /* 0x108 0x4 */ /* XXX 4 bytes hole, try to pack */ u64 stat_throttled; /* 0x110 0x8 */ <better> struct qdisc_watchdog watchdog; /* 0x118 0x48 */ /* --- cacheline 5 boundary (320 bytes) was 32 bytes ago --- */ u64 stat_gc_flows; /* 0x160 0x8 */ u64 stat_internal_packets; /* 0x168 0x8 */ u64 stat_ce_mark; /* 0x170 0x8 */ u64 stat_horizon_drops; /* 0x178 0x8 */ /* --- cacheline 6 boundary (384 bytes) --- */ u64 stat_horizon_caps; /* 0x180 0x8 */ u64 stat_flows_plimit; /* 0x188 0x8 */ u64 stat_pkts_too_long; /* 0x190 0x8 */ u64 stat_allocation_errors; /* 0x198 0x8 */ /* Force padding: */ u64 :64; u64 :64; u64 :64; u64 :64; /* size: 448, cachelines: 7, members: 34 */ /* sum members: 411, holes: 2, sum holes: 5 */ /* padding: 32 */ /* paddings: 1, sum paddings: 16 */ /* forced alignments: 1 */ }; Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01net_sched: constify qdisc_priv()Eric Dumazet
In order to propagate const qualifiers, we change qdisc_priv() to accept a possibly const argument. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01Merge branch 'tcp_delack_max'David S. Miller
Eric Dumazet says: ==================== tcp: add tcp_delack_max() First patches are adding const qualifiers to four existing helpers. Third patch adds a much needed companion feature to RTAX_RTO_MIN. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01tcp: derive delack_max from rto_minEric Dumazet
While BPF allows to set icsk->->icsk_delack_max and/or icsk->icsk_rto_min, we have an ip route attribute (RTAX_RTO_MIN) to be able to tune rto_min, but nothing to consequently adjust max delayed ack, which vary from 40ms to 200 ms (TCP_DELACK_{MIN|MAX}). This makes RTAX_RTO_MIN of almost no practical use, unless customers are in big trouble. Modern days datacenter communications want to set rto_min to ~5 ms, and the max delayed ack one jiffie smaller to avoid spurious retransmits. After this patch, an "rto_min 5" route attribute will effectively lower max delayed ack timers to 4 ms. Note in the following ss output, "rto:6 ... ato:4" $ ss -temoi dst XXXXXX State Recv-Q Send-Q Local Address:Port Peer Address:Port Process ESTAB 0 0 [2002:a05:6608:295::]:52950 [2002:a05:6608:297::]:41597 ino:255134 sk:1001 <-> skmem:(r0,rb1707063,t872,tb262144,f0,w0,o0,bl0,d0) ts sack cubic wscale:8,8 rto:6 rtt:0.02/0.002 ato:4 mss:4096 pmtu:4500 rcvmss:536 advmss:4096 cwnd:10 bytes_sent:54823160 bytes_acked:54823121 bytes_received:54823120 segs_out:1370582 segs_in:1370580 data_segs_out:1370579 data_segs_in:1370578 send 16.4Gbps pacing_rate 32.6Gbps delivery_rate 1.72Gbps delivered:1370579 busy:26920ms unacked:1 rcv_rtt:34.615 rcv_space:65920 rcv_ssthresh:65535 minrtt:0.015 snd_wnd:65536 While we could argue this patch fixes a bug with RTAX_RTO_MIN, I do not add a Fixes: tag, so that we can soak it a bit before asking backports to stable branches. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01tcp: constify tcp_rto_min() and tcp_rto_min_us() argumentEric Dumazet
Make clear these functions do not change any field from TCP socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01net: constify sk_dst_get() and __sk_dst_get() argumentEric Dumazet
Both helpers only read fields from their socket argument. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01modpost: Don't let "driver"s reference .exit.*Uwe Kleine-König
Drivers must not reference functions marked with __exit as these likely are not available when the code is built-in. There are few creative offenders uncovered for example in ARCH=amd64 allmodconfig builds. So only trigger the section mismatch warning for W=1 builds. The dual rule that drivers must not reference .init.* is implemented since commit 0db252452378 ("modpost: don't allow *driver to reference .init.*") which however missed that .exit.* should be handled in the same way. Thanks to Masahiro Yamada and Arnd Bergmann who gave valuable hints to find this improvement. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2023-10-01vmlinux.lds.h: remove unused CPU_KEEP and CPU_DISCARD macrosMasahiro Yamada
Remove the left-over of commit e24f6628811e ("modpost: remove all traces of cpuinit/cpuexit sections"). Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2023-10-01modpost: add missing else to the "of" checkMauricio Faria de Oliveira
Without this 'else' statement, an "usb" name goes into two handlers: the first/previous 'if' statement _AND_ the for-loop over 'devtable', but the latter is useless as it has no 'usb' device_id entry anyway. Tested with allmodconfig before/after patch; no changes to *.mod.c: git checkout v6.6-rc3 make -j$(nproc) allmodconfig make -j$(nproc) olddefconfig make -j$(nproc) find . -name '*.mod.c' | cpio -pd /tmp/before # apply patch make -j$(nproc) find . -name '*.mod.c' | cpio -pd /tmp/after diff -r /tmp/before/ /tmp/after/ # no difference Fixes: acbef7b76629 ("modpost: fix module autoloading for OF devices with generic compatible property") Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2023-09-30Merge tag 'soc-fixes-6.6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull ARM SoC fixes from Arnd Bergmann: "These are the latest bug fixes that have come up in the soc tree. Most of these are fairly minor. Most notably, the majority of changes this time are not for dts files as usual. - Updates to the addresses of the broadcom and aspeed entries in the MAINTAINERS file. - Defconfig updates to address a regression on samsung and a build warning from an unknown Kconfig symbol - Build fixes for the StrongARM and Uniphier platforms - Code fixes for SCMI and FF-A firmware drivers, both of which had a simple bug that resulted in invalid data, and a lesser fix for the optee firmware driver - Multiple fixes for the recently added loongson/loongarch "guts" soc driver - Devicetree fixes for RISC-V on the startfive platform, addressing issues with NOR flash, usb and uart. - Multiple fixes for NXP i.MX8/i.MX9 dts files, fixing problems with clock, gpio, hdmi settings and the Makefile - Bug fixes for i.MX firmware code and the OCOTP soc driver - Multiple fixes for the TI sysc bus driver - Minor dts updates for TI omap dts files, to address boot time warnings and errors" * tag 'soc-fixes-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (35 commits) MAINTAINERS: Fix Florian Fainelli's email address arm64: defconfig: enable syscon-poweroff driver ARM: locomo: fix locomolcd_power declaration soc: loongson: loongson2_guts: Remove unneeded semicolon soc: loongson: loongson2_guts: Convert to devm_platform_ioremap_resource() soc: loongson: loongson_pm2: Populate children syscon nodes dt-bindings: soc: loongson,ls2k-pmc: Allow syscon-reboot/syscon-poweroff as child soc: loongson: loongson_pm2: Drop useless of_device_id compatible dt-bindings: soc: loongson,ls2k-pmc: Use fallbacks for ls2k-pmc compatible soc: loongson: loongson_pm2: Add dependency for INPUT arm64: defconfig: remove CONFIG_COMMON_CLK_NPCM8XX=y ARM: uniphier: fix cache kernel-doc warnings MAINTAINERS: aspeed: Update Andrew's email address MAINTAINERS: aspeed: Update git tree URL firmware: arm_ffa: Don't set the memory region attributes for MEM_LEND arm64: dts: imx: Add imx8mm-prt8mm.dtb to build arm64: dts: imx8mm-evk: Fix hdmi@3d node soc: imx8m: Enable OCOTP clock for imx8mm before reading registers arm64: dts: imx8mp-beacon-kit: Fix audio_pll2 clock arm64: dts: imx8mp: Fix SDMA2/3 clocks ...
2023-09-30Merge tag 'trace-v6.6-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Make sure 32-bit applications using user events have aligned access when running on a 64-bit kernel. - Add cond_resched in the loop that handles converting enums in print_fmt string is trace events. - Fix premature wake ups of polling processes in the tracing ring buffer. When a task polls waiting for a percentage of the ring buffer to be filled, the writer still will wake it up at every event. Add the polling's percentage to the "shortest_full" list to tell the writer when to wake it up. - For eventfs dir lookups on dynamic events, an event system's only event could be removed, leaving its dentry with no children. This is totally legitimate. But in eventfs_release() it must not access the children array, as it is only allocated when the dentry has children. * tag 'trace-v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: eventfs: Test for dentries array allocated in eventfs_release() tracing/user_events: Align set_bit() address for all archs tracing: relax trace_event_eval_update() execution with cond_resched() ring-buffer: Update "shortest_full" in polling
2023-09-30eventfs: Test for dentries array allocated in eventfs_release()Steven Rostedt (Google)
The dcache_dir_open_wrapper() could be called when a dynamic event is being deleted leaving a dentry with no children. In this case the dlist->dentries array will never be allocated. This needs to be checked for in eventfs_release(), otherwise it will trigger a NULL pointer dereference. Link: https://lore.kernel.org/linux-trace-kernel/20230930090106.1c3164e9@rorschach.local.home Cc: Mark Rutland <mark.rutland@arm.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Fixes: ef36b4f92868 ("eventfs: Remember what dentries were created on dir open") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-30tracing/user_events: Align set_bit() address for all archsBeau Belgrave
All architectures should use a long aligned address passed to set_bit(). User processes can pass either a 32-bit or 64-bit sized value to be updated when tracing is enabled when on a 64-bit kernel. Both cases are ensured to be naturally aligned, however, that is not enough. The address must be long aligned without affecting checks on the value within the user process which require different adjustments for the bit for little and big endian CPUs. Add a compat flag to user_event_enabler that indicates when a 32-bit value is being used on a 64-bit kernel. Long align addresses and correct the bit to be used by set_bit() to account for this alignment. Ensure compat flags are copied during forks and used during deletion clears. Link: https://lore.kernel.org/linux-trace-kernel/20230925230829.341-2-beaub@linux.microsoft.com Link: https://lore.kernel.org/linux-trace-kernel/20230914131102.179100-1-cleger@rivosinc.com/ Cc: stable@vger.kernel.org Fixes: 7235759084a4 ("tracing/user_events: Use remote writes for event enablement") Reported-by: Clément Léger <cleger@rivosinc.com> Suggested-by: Clément Léger <cleger@rivosinc.com> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-30tracing: relax trace_event_eval_update() execution with cond_resched()Clément Léger
When kernel is compiled without preemption, the eval_map_work_func() (which calls trace_event_eval_update()) will not be preempted up to its complete execution. This can actually cause a problem since if another CPU call stop_machine(), the call will have to wait for the eval_map_work_func() function to finish executing in the workqueue before being able to be scheduled. This problem was observe on a SMP system at boot time, when the CPU calling the initcalls executed clocksource_done_booting() which in the end calls stop_machine(). We observed a 1 second delay because one CPU was executing eval_map_work_func() and was not preempted by the stop_machine() task. Adding a call to cond_resched() in trace_event_eval_update() allows other tasks to be executed and thus continue working asynchronously like before without blocking any pending task at boot time. Link: https://lore.kernel.org/linux-trace-kernel/20230929191637.416931-1-cleger@rivosinc.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Clément Léger <cleger@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-30ring-buffer: Update "shortest_full" in pollingSteven Rostedt (Google)
It was discovered that the ring buffer polling was incorrectly stating that read would not block, but that's because polling did not take into account that reads will block if the "buffer-percent" was set. Instead, the ring buffer polling would say reads would not block if there was any data in the ring buffer. This was incorrect behavior from a user space point of view. This was fixed by commit 42fb0a1e84ff by having the polling code check if the ring buffer had more data than what the user specified "buffer percent" had. The problem now is that the polling code did not register itself to the writer that it wanted to wait for a specific "full" value of the ring buffer. The result was that the writer would wake the polling waiter whenever there was a new event. The polling waiter would then wake up, see that there's not enough data in the ring buffer to notify user space and then go back to sleep. The next event would wake it up again. Before the polling fix was added, the code would wake up around 100 times for a hackbench 30 benchmark. After the "fix", due to the constant waking of the writer, it would wake up over 11,0000 times! It would never leave the kernel, so the user space behavior was still "correct", but this definitely is not the desired effect. To fix this, have the polling code add what it's waiting for to the "shortest_full" variable, to tell the writer not to wake it up if the buffer is not as full as it expects to be. Note, after this fix, it appears that the waiter is now woken up around 2x the times it was before (~200). This is a tremendous improvement from the 11,000 times, but I will need to spend some time to see why polling is more aggressive in its wakeups than the read blocking code. Link: https://lore.kernel.org/linux-trace-kernel/20230929180113.01c2cae3@rorschach.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Fixes: 42fb0a1e84ff ("tracing/ring-buffer: Have polling block on watermark") Reported-by: Julia Lawall <julia.lawall@inria.fr> Tested-by: Julia Lawall <julia.lawall@inria.fr> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-30wifi: mt76: Annotate struct mt76_rx_tid with __counted_byKees Cook
Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). As found with Coccinelle[1], add __counted_by for struct mt76_rx_tid. [1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci Cc: Felix Fietkau <nbd@nbd.name> Cc: Lorenzo Bianconi <lorenzo@kernel.org> Cc: Ryder Lee <ryder.lee@mediatek.com> Cc: Shayne Chen <shayne.chen@mediatek.com> Cc: Sean Wang <sean.wang@mediatek.com> Cc: Kalle Valo <kvalo@kernel.org> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Cc: linux-wireless@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-mediatek@lists.infradead.org Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Felix Fietkau <nbd@nbd.name>
2023-09-30wifi: mt76: mt7921: update the channel usage when the regd domain changedMing Yen Hsieh
The 5.9/6GHz channel license of a certain platform device has been regulated in various countries. That may be difference with standard Liunx regulatory domain settings. In this case, when .reg_notifier() called for regulatory change, mt792x chipset should update the channel usage based on clc or dts configurations. Channel would be disabled by following cases. * clc report the particular UNII-x is disabled. * dts enabled and the channel is not configured. Signed-off-by: Ming Yen Hsieh <mingyen.hsieh@mediatek.com> Co-developed-by: Deren Wu <deren.wu@mediatek.com> Signed-off-by: Deren Wu <deren.wu@mediatek.com> Signed-off-by: Felix Fietkau <nbd@nbd.name>
2023-09-30wifi: mt76: mt7921: get regulatory information from the clc eventMing Yen Hsieh
The clc event can report the radio configuration for the corresponding country and the driver would take it as regulatory information of a certain platform device. This patch would change the clc commnad from no-waiting to waiting for event. For backward compatible, we also add a new nic capability tag to indicate the firmware did support this new clc event from now on. Signed-off-by: Ming Yen Hsieh <mingyen.hsieh@mediatek.com> Co-developed-by: Deren Wu <deren.wu@mediatek.com> Signed-off-by: Deren Wu <deren.wu@mediatek.com> Signed-off-by: Felix Fietkau <nbd@nbd.name>
2023-09-30wifi: mt76: mt7921: add 6GHz power type support for clcMing Yen Hsieh
There are several power type should be supported in 6GHz band. mt7921 apply 6GHz power type from AP settings and clc will setup the corresponding regulatory power. Signed-off-by: Ming Yen Hsieh <mingyen.hsieh@mediatek.com> Co-developed-by: Deren Wu <deren.wu@mediatek.com> Signed-off-by: Deren Wu <deren.wu@mediatek.com> Signed-off-by: Felix Fietkau <nbd@nbd.name>
2023-09-30wifi: mt76: mt7921: enable set txpower for UNII-4Ming Yen Hsieh
support power config for channel 165/173/177 Signed-off-by: Ming Yen Hsieh <mingyen.hsieh@mediatek.com> Signed-off-by: Felix Fietkau <nbd@nbd.name>
2023-09-30wifi: mt76: mt7921: move connac nic capability handling to mt7921Deren Wu
mt76_connac_mcu_get_nic_capability() is used by mt7921 only. It would be better to put the code in chip folder. And we can provide more chip capability information in mt792x_phy without making mt76_phy much bigger. The three functions would be moved to mt7921 folder and renamed. mt76_connac_mcu_parse_tx_resource() mt76_connac_mcu_parse_phy_cap() mt76_connac_mcu_get_nic_capability() Signed-off-by: Deren Wu <deren.wu@mediatek.com> Signed-off-by: Felix Fietkau <nbd@nbd.name>
2023-09-30wifi: mt76: reduce spin_lock_bh held up in mt76_dma_rx_cleanupSean Wang
mt76_dma_rx_cleanup would be frequenetly called up to reset the dma rings to be freshed as new ones when switching back from the deep sleep mode to the active mode on mt7921 and mt7922. Shrink the scope of spin_lock_bh in mt76_dma_rx_cleanup being held up to allow the kernel scheduler to be able to switch other tasks in time by reducing the latency. Signed-off-by: Sean Wang <sean.wang@mediatek.com> Signed-off-by: Felix Fietkau <nbd@nbd.name>
2023-09-30wifi: mt76: mt7996: remove periodic MPDU TXS requestBenjamin Lin
Remove periodic MPDU TXS request. Get TID and FrameType from SKB instead of TXWI, which is empty for Data Frame after MPDU TXS request is removed, hence prohibiting the establishment of TX BA session. Signed-off-by: Benjamin Lin <benjamin-jw.lin@mediatek.com> Signed-off-by: Yi-Chia Hsieh <yi-chia.hsieh@mediatek.com> Signed-off-by: Money Wang <Money.Wang@mediatek.com> Signed-off-by: Peter Chiu <chui-hao.chiu@mediatek.com> Signed-off-by: Evelyn Tsai <evelyn.tsai@mediatek.com> Signed-off-by: Ryder Lee <ryder.lee@mediatek.com> Signed-off-by: Felix Fietkau <nbd@nbd.name>
2023-09-30wifi: mt76: mt7996: enable PPDU-TxS to hostYi-Chia Hsieh
Enable PPDU TxS by default. This makes the driver able to get Tx rate information from TxS. The driver will also refresh BA session timer on receive of PPDU TxS when WED is on. Signed-off-by: Yi-Chia Hsieh <yi-chia.hsieh@mediatek.com> Signed-off-by: Money Wang <Money.Wang@mediatek.com> Signed-off-by: Peter Chiu <chui-hao.chiu@mediatek.com> Signed-off-by: Evelyn Tsai <evelyn.tsai@mediatek.com> Signed-off-by: Ryder Lee <ryder.lee@mediatek.com> Signed-off-by: Felix Fietkau <nbd@nbd.name>
2023-09-30wifi: mt76: mt7996: Add mcu commands for getting sta tx statisticYi-Chia Hsieh
Per peer Tx/Rx statistic can only be obtained by querying WM when WED is on. This patch switches to periodic event reporting in the case of WED being enabled. Signed-off-by: Yi-Chia Hsieh <yi-chia.hsieh@mediatek.com> Signed-off-by: Money Wang <Money.Wang@mediatek.com> Signed-off-by: Peter Chiu <chui-hao.chiu@mediatek.com> Signed-off-by: Evelyn Tsai <evelyn.tsai@mediatek.com> Signed-off-by: Ryder Lee <ryder.lee@mediatek.com> Signed-off-by: Felix Fietkau <nbd@nbd.name>