summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-09-30ibmveth: Implement multi queue on xmitNick Child
The `ndo_start_xmit` function is protected by a spinlock on the tx queue being used to transmit the skb. Allow concurrent calls to `ndo_start_xmit` by using more than one tx queue. This allows for greater throughput when several jobs are trying to transmit data. Introduce 16 tx queues (leave single rx queue as is) which each correspond to one DMA mapped long term buffer. Signed-off-by: Nick Child <nnac123@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30ibmveth: Copy tx skbs into a premapped bufferNick Child
Rather than DMA mapping and unmapping every outgoing skb, copy the skb into a buffer that was mapped during the drivers open function. Copying the skb and its frags have proven to be more time efficient than mapping and unmapping. As an effect, performance increases by 3-5 Gbits/s. Allocate and DMA map one continuous 64KB buffer at `ndo_open`. This buffer is maintained until `ibmveth_close` is called. This buffer is large enough to hold the largest possible linnear skb. During `ndo_start_xmit`, copy the skb and all of it's frags into the continuous buffer. By manually linnearizing all the socket buffers, time is saved during memcpy as well as more efficient handling in FW. As a result, we no longer need to worry about the firmware limitation of handling a max of 6 frags. So, we only need to maintain 1 descriptor instead of 6 and can hardcode 0 for the other 5 descriptors during h_send_logical_lan. Since, DMA allocation/mapping issues can no longer arise in xmit functions, we can further reduce code size by removing the need for a bounce buffer on DMA errors. Signed-off-by: Nick Child <nnac123@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30bnx2: Fix spelling mistake "bufferred" -> "buffered"Colin Ian King
There are spelling mistakes in two literal strings. Fix these. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30tcp: fix tcp_cwnd_validate() to not forget is_cwnd_limitedNeal Cardwell
This commit fixes a bug in the tracking of max_packets_out and is_cwnd_limited. This bug can cause the connection to fail to remember that is_cwnd_limited is true, causing the connection to fail to grow cwnd when it should, causing throughput to be lower than it should be. The following event sequence is an example that triggers the bug: (a) The connection is cwnd_limited, but packets_out is not at its peak due to TSO deferral deciding not to send another skb yet. In such cases the connection can advance max_packets_seq and set tp->is_cwnd_limited to true and max_packets_out to a small number. (b) Then later in the round trip the connection is pacing-limited (not cwnd-limited), and packets_out is larger. In such cases the connection would raise max_packets_out to a bigger number but (unexpectedly) flip tp->is_cwnd_limited from true to false. This commit fixes that bug. One straightforward fix would be to separately track (a) the next window after max_packets_out reaches a maximum, and (b) the next window after tp->is_cwnd_limited is set to true. But this would require consuming an extra u32 sequence number. Instead, to save space we track only the most important information. Specifically, we track the strongest available signal of the degree to which the cwnd is fully utilized: (1) If the connection is cwnd-limited then we remember that fact for the current window. (2) If the connection not cwnd-limited then we track the maximum number of outstanding packets in the current window. In particular, note that the new logic cannot trigger the buggy (a)/(b) sequence above because with the new logic a condition where tp->packets_out > tp->max_packets_out can only trigger an update of tp->is_cwnd_limited if tp->is_cwnd_limited is false. This first showed up in a testing of a BBRv2 dev branch, but this buggy behavior highlighted a general issue with the tcp_cwnd_validate() logic that can cause cwnd to fail to increase at the proper rate for any TCP congestion control, including Reno or CUBIC. Fixes: ca8a22634381 ("tcp: make cwnd-limited checks measurement-based, and gentler") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Kevin(Yudong) Yang <yyd@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30sctp: handle the error returned from sctp_auth_asoc_init_active_keyXin Long
When it returns an error from sctp_auth_asoc_init_active_key(), the active_key is actually not updated. The old sh_key will be freeed while it's still used as active key in asoc. Then an use-after-free will be triggered when sending patckets, as found by syzbot: sctp_auth_shkey_hold+0x22/0xa0 net/sctp/auth.c:112 sctp_set_owner_w net/sctp/socket.c:132 [inline] sctp_sendmsg_to_asoc+0xbd5/0x1a20 net/sctp/socket.c:1863 sctp_sendmsg+0x1053/0x1d50 net/sctp/socket.c:2025 inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819 sock_sendmsg_nosec net/socket.c:714 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:734 This patch is to fix it by not replacing the sh_key when it returns errors from sctp_auth_asoc_init_active_key() in sctp_auth_set_key(). For sctp_auth_set_active_key(), old active_key_id will be set back to asoc->active_key_id when the same thing happens. Fixes: 58acd1009226 ("sctp: update active_key for asoc when old key is being replaced") Reported-by: syzbot+a236dd8e9622ed8954a3@syzkaller.appspotmail.com Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30net: bridge: assign path_cost for 2.5G and 5G link speedSteven Hsieh
As 2.5G, 5G ethernet ports are more common and affordable, these ports are being used in LAN bridge devices. STP port_cost() is missing path_cost assignment for these link speeds, causes highest cost 100 being used. This result in lower speed port being picked when there is loop between 5G and 1G ports. Original path_cost: 10G=2, 1G=4, 100m=19, 10m=100 Adjusted path_cost: 10G=2, 5G=3, 2.5G=4, 1G=5, 100m=19, 10m=100 speed greater than 10G = 1 Signed-off-by: Steven Hsieh <steven.hsieh@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30net: lan966x: Fix spelling mistake "tarffic" -> "traffic"Colin Ian King
There is a spelling mistake in a netdev_err message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30mISDN: fix use-after-free bugs in l1oip timer handlersDuoming Zhou
The l1oip_cleanup() traverses the l1oip_ilist and calls release_card() to cleanup module and stack. However, release_card() calls del_timer() to delete the timers such as keep_tl and timeout_tl. If the timer handler is running, the del_timer() will not stop it and result in UAF bugs. One of the processes is shown below: (cleanup routine) | (timer handler) release_card() | l1oip_timeout() ... | del_timer() | ... ... | kfree(hc) //FREE | | hc->timeout_on = 0 //USE Fix by calling del_timer_sync() in release_card(), which makes sure the timer handlers have finished before the resources, such as l1oip and so on, have been deallocated. What's more, the hc->workq and hc->socket_thread can kick those timers right back in. We add a bool flag to show if card is released. Then, check this flag in hc->workq and hc->socket_thread. Fixes: 3712b42d4b1b ("Add layer1 over IP support") Signed-off-by: Duoming Zhou <duoming@zju.edu.cn> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30net-next: skbuff: refactor pskb_pullRichard Gobert
pskb_may_pull already contains all of the checks performed by pskb_pull. Use pskb_may_pull for validation in pskb_pull, eliminating the duplication and making __pskb_pull obsolete. Replace __pskb_pull with pskb_pull where applicable. Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30net: bonding: Convert to use sysfs_emit()/sysfs_emit_at() APIsWang Yufen
Follow the advice of the Documentation/filesystems/sysfs.rst and show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. Signed-off-by: Wang Yufen <wangyufen@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30net-sysfs: Convert to use sysfs_emit() APIsWang Yufen
Follow the advice of the Documentation/filesystems/sysfs.rst and show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. Signed-off-by: Wang Yufen <wangyufen@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30net: tun: Convert to use sysfs_emit() APIsWang Yufen
Follow the advice of the Documentation/filesystems/sysfs.rst and show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. Signed-off-by: Wang Yufen <wangyufen@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30KVM: selftests: Compare insn opcodes directly in fix_hypercall_testSean Christopherson
Directly compare the expected versus observed hypercall instructions when verifying that KVM patched in the native hypercall (FIX_HYPERCALL_INSN quirk enabled). gcc rightly complains that doing a 4-byte memcpy() with an "unsigned char" as the source generates an out-of-bounds accesses. Alternatively, "exp" and "obs" could be declared as 3-byte arrays, but there's no known reason to copy locally instead of comparing directly. In function ‘assert_hypercall_insn’, inlined from ‘guest_main’ at x86_64/fix_hypercall_test.c:91:2: x86_64/fix_hypercall_test.c:63:9: error: array subscript ‘unsigned int[0]’ is partly outside array bounds of ‘unsigned char[1]’ [-Werror=array-bounds] 63 | memcpy(&exp, exp_insn, sizeof(exp)); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ x86_64/fix_hypercall_test.c: In function ‘guest_main’: x86_64/fix_hypercall_test.c:42:22: note: object ‘vmx_hypercall_insn’ of size 1 42 | extern unsigned char vmx_hypercall_insn; | ^~~~~~~~~~~~~~~~~~ x86_64/fix_hypercall_test.c:25:22: note: object ‘svm_hypercall_insn’ of size 1 25 | extern unsigned char svm_hypercall_insn; | ^~~~~~~~~~~~~~~~~~ In function ‘assert_hypercall_insn’, inlined from ‘guest_main’ at x86_64/fix_hypercall_test.c:91:2: x86_64/fix_hypercall_test.c:64:9: error: array subscript ‘unsigned int[0]’ is partly outside array bounds of ‘unsigned char[1]’ [-Werror=array-bounds] 64 | memcpy(&obs, obs_insn, sizeof(obs)); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ x86_64/fix_hypercall_test.c: In function ‘guest_main’: x86_64/fix_hypercall_test.c:25:22: note: object ‘svm_hypercall_insn’ of size 1 25 | extern unsigned char svm_hypercall_insn; | ^~~~~~~~~~~~~~~~~~ x86_64/fix_hypercall_test.c:42:22: note: object ‘vmx_hypercall_insn’ of size 1 42 | extern unsigned char vmx_hypercall_insn; | ^~~~~~~~~~~~~~~~~~ cc1: all warnings being treated as errors make: *** [../lib.mk:135: tools/testing/selftests/kvm/x86_64/fix_hypercall_test] Error 1 Fixes: 6c2fa8b20d0c ("selftests: KVM: Test KVM_X86_QUIRK_FIX_HYPERCALL_INSN") Cc: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Message-Id: <20220928233652.783504-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-30KVM: selftests: Implement memcmp(), memcpy(), and memset() for guest useSean Christopherson
Implement memcmp(), memcpy(), and memset() to override the compiler's built-in versions in order to guarantee that the compiler won't generate out-of-line calls to external functions via the PLT. This allows the helpers to be safely used in guest code, as KVM selftests don't support dynamic loading of guest code. Steal the implementations from the kernel's generic versions, sans the optimizations in memcmp() for unaligned accesses. Put the utilities in a separate compilation unit and build with -ffreestanding to fudge around a gcc "feature" where it will optimize memset(), memcpy(), etc... by generating a recursive call. I.e. the compiler optimizes itself into infinite recursion. Alternatively, the individual functions could be tagged with optimize("no-tree-loop-distribute-patterns"), but using "optimize" for anything but debug is discouraged, and Linus NAK'd the use of the flag in the kernel proper[*]. https://lore.kernel.org/lkml/CAHk-=wik-oXnUpfZ6Hw37uLykc-_P0Apyn2XuX-odh-3Nzop8w@mail.gmail.com Cc: Andrew Jones <andrew.jones@linux.dev> Cc: Anup Patel <anup@brainfault.org> Cc: Atish Patra <atishp@atishpatra.org> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220928233652.783504-2-seanjc@google.com> Reviewed-by: Andrew Jones <andrew.jones@linux.dev> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-30KVM: x86: Hide IA32_PLATFORM_DCA_CAP[31:0] from the guestJim Mattson
The only thing reported by CPUID.9 is the value of IA32_PLATFORM_DCA_CAP[31:0] in EAX. This MSR doesn't even exist in the guest, since CPUID.1:ECX.DCA[bit 18] is clear in the guest. Clear CPUID.9 in KVM_GET_SUPPORTED_CPUID. Fixes: 24c82e576b78 ("KVM: Sanitize cpuid") Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20220922231854.249383-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-30KVM: selftests: Gracefully handle empty stack tracesDavid Matlack
Bail out of test_dump_stack() if the stack trace is empty rather than invoking addr2line with zero addresses. The problem with the latter is that addr2line will block waiting for addresses to be passed in via stdin, e.g. if running a selftest from an interactive terminal. Opportunistically fix up the comment that mentions skipping 3 frames since only 2 are skipped in the code. Cc: Vipin Sharma <vipinsh@google.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220922231724.3560211-1-dmatlack@google.com> [Small tweak to keep backtrace() call close to if(). - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-30KVM: selftests: replace assertion with warning in access_tracking_perf_testEmanuele Giuseppe Esposito
Page_idle uses {ptep/pmdp}_clear_young_notify which in turn calls the mmu notifier callback ->clear_young(), which purposefully does not flush the TLB. When running the test in a nested guest, point 1. of the test doc header is violated, because KVM TLB is unbounded by size and since no flush is forced, KVM does not update the sptes accessed/idle bits resulting in guest assertion failure. More precisely, only the first ACCESS_WRITE in run_test() actually makes visible changes, because sptes are created and the accessed bit is set to 1 (or idle bit is 0). Then the first mark_memory_idle() passes since access bit is still one, and sets all pages as idle (or not accessed). When the next write is performed, the update is not flushed therefore idle is still 1 and next mark_memory_idle() fails. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20220926082923.299554-1-eesposit@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-30Merge branch 'net-tsnep-multiqueue'David S. Miller
Gerhard Engleder says: ==================== tsnep: multi queue support and some other improvements Add support for additional TX/RX queues along with RX flow classification support. Binding is extended to allow additional interrupts for additional TX/RX queues. Also dma-coherent is allowed as minor improvement. RX path optimisation is done by using page pool as preparations for future XDP support. v4: - rework dma-coherent commit message (Krzysztof Kozlowski) - fixed order of interrupt-names in binding (Krzysztof Kozlowski) - add line break between examples in binding (Krzysztof Kozlowski) - add RX_CLS_LOC_ANY support to RX flow classification v3: - now with changes in cover letter v2: - use netdev_name() (Jakub Kicinski) - use ENOENT if RX flow rule is not found (Jakub Kicinski) - eliminate return code of tsnep_add_rule() (Jakub Kicinski) - remove commit with lazy refill due to depletion problem (Jakub Kicinski) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30tsnep: Use page pool for RXGerhard Engleder
Use page pool for RX buffer handling. Makes RX path more efficient and is required prework for future XDP support. Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30tsnep: Add EtherType RX flow classification supportGerhard Engleder
Received Ethernet frames are assigned to first RX queue per default. Based on EtherType Ethernet frames can be assigned to other RX queues. This enables processing of real-time Ethernet protocols on dedicated RX queues. Add RX flow classification interface for EtherType based RX queue assignment. Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30tsnep: Support multiple TX/RX queue pairsGerhard Engleder
Support additional TX/RX queue pairs if dedicated interrupt is available. Interrupts are detected by name in device tree. Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30tsnep: Move interrupt from device to queueGerhard Engleder
For multiple queues multiple interrupts shall be used. Therefore, rework global interrupt to per queue interrupt. Every interrupt name shall contain interface name and queue information. To get a valid interface name, the interrupt request needs to by done during open like in other drivers. Additionally, this allows the removal of some initialisation checks in the interrupt handler. Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30dt-bindings: net: tsnep: Allow additional interruptsGerhard Engleder
Additional TX/RX queue pairs require dedicated interrupts. Extend binding with additional interrupts. Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30dt-bindings: net: tsnep: Allow dma-coherentGerhard Engleder
Within SoCs like ZynqMP, FPGA logic can be connected to different kinds of AXI master ports. Also cache coherent AXI master ports are available. The property "dma-coherent" is used to signal that DMA is cache coherent. Add "dma-coherent" property to allow the configuration of cache coherent DMA. Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30Merge branch 'xfrm: add netlink extack to all the ->init_stat'Steffen Klassert
Sabrina Dubroca says: ============ This series completes extack support for state creation. ============ Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-09-29hardening: Remove Clang's enable flag for -ftrivial-auto-var-init=zeroKees Cook
Now that Clang's -enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang option is no longer required, remove it from the command line. Clang 16 and later will warn when it is used, which will cause Kconfig to think it can't use -ftrivial-auto-var-init=zero at all. Check for whether it is required and only use it when so. Cc: Nathan Chancellor <nathan@kernel.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: linux-kbuild@vger.kernel.org Cc: llvm@lists.linux.dev Cc: stable@vger.kernel.org Fixes: f02003c860d9 ("hardening: Avoid harmless Clang option under CONFIG_INIT_STACK_ALL_ZERO") Signed-off-by: Kees Cook <keescook@chromium.org>
2022-09-29Merge branch '100GbE' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2022-09-28 (ice) Arkadiusz implements a single pin initialization function, checking feature bits, instead of having separate device functions and updates sub-device IDs for recognizing E810T devices. Martyna adds support for switchdev filters on VLAN priority field. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: ice: Add support for VLAN priority filters in switchdev ice: support features on new E810T variants ice: Merge pin initialization of E810 and E810T adapters ==================== Link: https://lore.kernel.org/r/20220928203217.411078-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29eth: alx: take rtnl_lock on resumeJakub Kicinski
Zbynek reports that alx trips an rtnl assertion on resume: RTNL: assertion failed at net/core/dev.c (2891) RIP: 0010:netif_set_real_num_tx_queues+0x1ac/0x1c0 Call Trace: <TASK> __alx_open+0x230/0x570 [alx] alx_resume+0x54/0x80 [alx] ? pci_legacy_resume+0x80/0x80 dpm_run_callback+0x4a/0x150 device_resume+0x8b/0x190 async_resume+0x19/0x30 async_run_entry_fn+0x30/0x130 process_one_work+0x1e5/0x3b0 indeed the driver does not hold rtnl_lock during its internal close and re-open functions during suspend/resume. Note that this is not a huge bug as the driver implements its own locking, and does not implement changing the number of queues, but we need to silence the splat. Fixes: 4a5fe57e7751 ("alx: use fine-grained locking instead of RTNL") Reported-and-tested-by: Zbynek Michl <zbynek.michl@gmail.com> Reviewed-by: Niels Dossche <dossche.niels@gmail.com> Link: https://lore.kernel.org/r/20220928181236.1053043-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29sparc: Unbreak the buildBart Van Assche
Fix the following build errors: arch/sparc/mm/srmmu.c: In function ‘smp_flush_page_for_dma’: arch/sparc/mm/srmmu.c:1639:13: error: cast between incompatible function types from ‘void (*)(long unsigned int)’ to ‘void (*)(long unsigned int, long unsigned int, long unsigned int, long unsigned int, long unsigned int)’ [-Werror=cast-function-type] 1639 | xc1((smpfunc_t) local_ops->page_for_dma, page); | ^ arch/sparc/mm/srmmu.c: In function ‘smp_flush_cache_mm’: arch/sparc/mm/srmmu.c:1662:29: error: cast between incompatible function types from ‘void (*)(struct mm_struct *)’ to ‘void (*)(long unsigned int, long unsigned int, long unsigned int, long unsigned int, long unsigned int)’ [-Werror=cast-function-type] 1662 | xc1((smpfunc_t) local_ops->cache_mm, (unsigned long) mm); | [ ... ] Compile-tested only. Fixes: 552a23a0e5d0 ("Makefile: Enable -Wcast-function-type") Cc: stable@vger.kernel.org Signed-off-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Andreas Larsson <andreas@gaisler.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220830205854.1918026-1-bvanassche@acm.org
2022-09-29net: phy: Convert to use sysfs_emit() APIsWang Yufen
Follow the advice of the Documentation/filesystems/sysfs.rst and show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. Signed-off-by: Wang Yufen <wangyufen@huawei.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/1664364860-29153-1-git-send-email-wangyufen@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29Merge branch 'add-tc-taprio-support-for-queuemaxsdu'Jakub Kicinski
Vladimir Oltean says: ==================== Add tc-taprio support for queueMaxSDU The tc-taprio offload mode supported by the Felix DSA driver has limitations surrounding its guard bands. The initial discussion was at: https://lore.kernel.org/netdev/c7618025da6723418c56a54fe4683bd7@walle.cc/ with the latest status being that we now have a vsc9959_tas_guard_bands_update() method which makes a best-guess attempt at how much useful space to reserve for packet scheduling in a taprio interval, and how much to reserve for guard bands. IEEE 802.1Q actually does offer a tunable variable (queueMaxSDU) which can determine the max MTU supported per traffic class. In turn we can determine the size we need for the guard bands, depending on the queueMaxSDU. This way we can make the guard band of small taprio intervals smaller than one full MTU worth of transmission time, if we know that said traffic class will transport only smaller packets. As discussed with Gerhard Engleder, the queueMaxSDU may also be useful in limiting the latency on an endpoint, if some of the TX queues are outside of the control of the Linux driver. https://patchwork.kernel.org/project/netdevbpf/patch/20220914153303.1792444-11-vladimir.oltean@nxp.com/ Allow input of queueMaxSDU through netlink into tc-taprio, offload it to the hardware I have access to (LS1028A), and (implicitly) deny non-default values to everyone else. Kurt Kanzenbach has also kindly tested and shared a patch to offload this to hellcreek. v3 at: https://patchwork.kernel.org/project/netdevbpf/cover/20220927234746.1823648-1-vladimir.oltean@nxp.com/ v2 at: https://patchwork.kernel.org/project/netdevbpf/list/?series=679954&state=* v1 at: https://patchwork.kernel.org/project/netdevbpf/cover/20220914153303.1792444-1-vladimir.oltean@nxp.com/ ==================== Link: https://lore.kernel.org/r/20220928095204.2093716-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29net: enetc: offload per-tc max SDU from tc-taprioVladimir Oltean
The driver currently sets the PTCMSDUR register statically to the max MTU supported by the interface. Keep this logic if tc-taprio is absent or if the max_sdu for a traffic class is 0, and follow the requested max SDU size otherwise. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29net: enetc: use common naming scheme for PTGCR and PTGCAPR registersVladimir Oltean
The Port Time Gating Control Register (PTGCR) and Port Time Gating Capability Register (PTGCAPR) have definitions in the driver which aren't in line with the other registers. Rename these. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29net: enetc: cache accesses to &priv->si->hwVladimir Oltean
The &priv->si->hw construct dereferences 2 pointers and makes lines longer than they need to be, in turn making the code harder to read. Replace &priv->si->hw accesses with a "hw" variable when there are 2 or more accesses within a function that dereference this. This includes loops, since &priv->si->hw is a loop invariant. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29net: dsa: hellcreek: Offload per-tc max SDU from tc-taprioKurt Kanzenbach
Add support for configuring the max SDU per priority and per port. If not specified, keep the default. Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29net: dsa: hellcreek: refactor hellcreek_port_setup_tc() to use switch/caseVladimir Oltean
The following patch will need to make this function also respond to TC_QUERY_BASE, so make the processing more structured around the tc_setup_type. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29net: dsa: felix: offload per-tc max SDU from tc-taprioVladimir Oltean
Our current vsc9959_tas_guard_bands_update() algorithm has a limitation imposed by the hardware design. To avoid packet overruns between one gate interval and the next (which would add jitter for scheduled traffic in the next gate), we configure the switch to use guard bands. These are as large as the largest packet which is possible to be transmitted. The problem is that at tc-taprio intervals of sizes comparable to a guard band, there isn't an obvious place in which to split the interval between the useful portion (for scheduling) and the guard band portion (where scheduling is blocked). For example, a 10 us interval at 1Gbps allows 1225 octets to be transmitted. We currently split the interval between the bare minimum of 33 ns useful time (required to schedule a single packet) and the rest as guard band. But 33 ns of useful scheduling time will only allow a single packet to be sent, be that packet 1200 octets in size, or 60 octets in size. It is impossible to send 2 60 octets frames in the 10 us window. Except that if we reduced the guard band (and therefore the maximum allowable SDU size) to 5 us, the useful time for scheduling is now also 5 us, so more packets could be scheduled. The hardware inflexibility of not scheduling according to individual packet lengths must unfortunately propagate to the user, who needs to tune the queueMaxSDU values if he wants to fit more small packets into a 10 us interval, rather than one large packet. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29net/sched: taprio: allow user input of per-tc max SDUVladimir Oltean
IEEE 802.1Q clause 12.29.1.1 "The queueMaxSDUTable structure and data types" and 8.6.8.4 "Enhancements for scheduled traffic" talk about the existence of a per traffic class limitation of maximum frame sizes, with a fallback on the port-based MTU. As far as I am able to understand, the 802.1Q Service Data Unit (SDU) represents the MAC Service Data Unit (MSDU, i.e. L2 payload), excluding any number of prepended VLAN headers which may be otherwise present in the MSDU. Therefore, the queueMaxSDU is directly comparable to the device MTU (1500 means L2 payload sizes are accepted, or frame sizes of 1518 octets, or 1522 plus one VLAN header). Drivers which offload this are directly responsible of translating into other units of measurement. To keep the fast path checks optimized, we keep 2 arrays in the qdisc, one for max_sdu translated into frame length (so that it's comparable to skb->len), and another for offloading and for dumping back to the user. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29net/sched: query offload capabilities through ndo_setup_tc()Vladimir Oltean
When adding optional new features to Qdisc offloads, existing drivers must reject the new configuration until they are coded up to act on it. Since modifying all drivers in lockstep with the changes in the Qdisc can create problems of its own, it would be nice if there existed an automatic opt-in mechanism for offloading optional features. Jakub proposes that we multiplex one more kind of call through ndo_setup_tc(): one where the driver populates a Qdisc-specific capability structure. First user will be taprio in further changes. Here we are introducing the definitions for the base functionality. Link: https://patchwork.kernel.org/project/netdevbpf/patch/20220923163310.3192733-3-vladimir.oltean@nxp.com/ Suggested-by: Jakub Kicinski <kuba@kernel.org> Co-developed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29net/tipc: Remove unused struct distr_queue_itemYuan Can
After commit 09b5678c778f("tipc: remove dead code in tipc_net and relatives"), struct distr_queue_item is not used any more and can be removed as well. Signed-off-by: Yuan Can <yuancan@huawei.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Link: https://lore.kernel.org/r/20220928085636.71749-1-yuancan@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29net: skb: introduce and use a single page frag cachePaolo Abeni
After commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for tiny skbs") we are observing 10-20% regressions in performance tests with small packets. The perf trace points to high pressure on the slab allocator. This change tries to improve the allocation schema for small packets using an idea originally suggested by Eric: a new per CPU page frag is introduced and used in __napi_alloc_skb to cope with small allocation requests. To ensure that the above does not lead to excessive truesize underestimation, the frag size for small allocation is inflated to 1K and all the above is restricted to build with 4K page size. Note that we need to update accordingly the run-time check introduced with commit fd9ea57f4e95 ("net: add napi_get_frags_check() helper"). Alex suggested a smart page refcount schema to reduce the number of atomic operations and deal properly with pfmemalloc pages. Under small packet UDP flood, I measure a 15% peak tput increases. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Suggested-by: Alexander H Duyck <alexanderduyck@fb.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/6b6f65957c59f86a353fc09a5127e83a32ab5999.1664350652.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29net: sched: cls_u32: Avoid memcpy() false-positive warningKees Cook
To work around a misbehavior of the compiler's ability to see into composite flexible array structs (as detailed in the coming memcpy() hardening series[1]), use unsafe_memcpy(), as the sizing, bounds-checking, and allocation are all very tightly coupled here. This silences the false-positive reported by syzbot: memcpy: detected field-spanning write (size 80) of single field "&n->sel" at net/sched/cls_u32.c:1043 (size 16) [1] https://lore.kernel.org/linux-hardening/20220901065914.1417829-2-keescook@chromium.org Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Reported-by: syzbot+a2c4601efc75848ba321@syzkaller.appspotmail.com Link: https://lore.kernel.org/lkml/000000000000a96c0b05e97f0444@google.com/ Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20220927153700.3071688-1-keescook@chromium.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29dt-bindings: net: snps,dwmac: Document stmmac-axi-config subnodeMarek Vasut
The stmmac-axi-config subnode is present in multiple dwmac instance DTs, document its content per snps,axi-config property description which is a phandle to this subnode. Signed-off-by: Marek Vasut <marex@denx.de> Reviewed-by: Rob Herring <robh@kernel.org> Link: https://lore.kernel.org/r/20220927012449.698915-1-marex@denx.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29docs: netlink: clarify the historical baggage of Netlink flagsJakub Kicinski
nlmsg_flags are full of historical baggage, inconsistencies and strangeness. Try to document it more thoroughly. Explain the meaning of the ECHO flag (and while at it clarify the comment in the uAPI). Handwave a little about the NEW request flags and how they make sense on the surface but cater to really old paradigm before commands were a thing. I will add more notes on how to make use of ECHO and discouragement for reuse of flags to the kernel-side documentation. Link: https://lore.kernel.org/r/20220927212306.823862-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29vhost/vsock: Use kvmalloc/kvfree for larger packets.Junichi Uekawa
When copying a large file over sftp over vsock, data size is usually 32kB, and kmalloc seems to fail to try to allocate 32 32kB regions. vhost-5837: page allocation failure: order:4, mode:0x24040c0 Call Trace: [<ffffffffb6a0df64>] dump_stack+0x97/0xdb [<ffffffffb68d6aed>] warn_alloc_failed+0x10f/0x138 [<ffffffffb68d868a>] ? __alloc_pages_direct_compact+0x38/0xc8 [<ffffffffb664619f>] __alloc_pages_nodemask+0x84c/0x90d [<ffffffffb6646e56>] alloc_kmem_pages+0x17/0x19 [<ffffffffb6653a26>] kmalloc_order_trace+0x2b/0xdb [<ffffffffb66682f3>] __kmalloc+0x177/0x1f7 [<ffffffffb66e0d94>] ? copy_from_iter+0x8d/0x31d [<ffffffffc0689ab7>] vhost_vsock_handle_tx_kick+0x1fa/0x301 [vhost_vsock] [<ffffffffc06828d9>] vhost_worker+0xf7/0x157 [vhost] [<ffffffffb683ddce>] kthread+0xfd/0x105 [<ffffffffc06827e2>] ? vhost_dev_set_owner+0x22e/0x22e [vhost] [<ffffffffb683dcd1>] ? flush_kthread_worker+0xf3/0xf3 [<ffffffffb6eb332e>] ret_from_fork+0x4e/0x80 [<ffffffffb683dcd1>] ? flush_kthread_worker+0xf3/0xf3 Work around by doing kvmalloc instead. Fixes: 433fc58e6bf2 ("VSOCK: Introduce vhost_vsock.ko") Signed-off-by: Junichi Uekawa <uekawa@chromium.org> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20220928064538.667678-1-uekawa@chromium.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29binfmt: remove taso from linux_binprm structLukas Bulwahn
With commit 987f20a9dcce ("a.out: Remove the a.out implementation"), the use of the special taso flag for alpha architectures in the linux_binprm struct is gone. Remove the definition of taso in the linux_binprm struct. No functional change. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220929203903.9475-1-lukas.bulwahn@gmail.com
2022-09-30Merge tag 'drm-intel-fixes-2022-09-29' of ↵Dave Airlie
git://anongit.freedesktop.org/drm/drm-intel into drm-fixes - Restrict forced preemption to the active context (Chris) - Restrict perf_limit_reasons to the supported platforms - gen11+ (Ashutosh) Signed-off-by: Dave Airlie <airlied@redhat.com> From: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/YzXAkH1a32pYJD33@intel.com
2022-09-30Merge tag 'amd-drm-fixes-6.0-2022-09-29' of ↵Dave Airlie
https://gitlab.freedesktop.org/agd5f/linux into drm-fixes amd-drm-fixes-6.0-2022-09-29: amdgpu: - GC 11.x fixes - SMU 13.x fixes - DCN 3.1.4 fixes - DCN 3.2.x fixes - GC 9.x fix - Fence fix - SR-IOV supend/resume fix - PSR regression fix Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220929144003.8363-1-alexander.deucher@amd.com
2022-09-30Merge tag 'drm-misc-fixes-2022-09-29' of ↵Dave Airlie
git://anongit.freedesktop.org/drm/drm-misc into drm-fixes Short summary of fixes pull: * bridge/analogix: Revert earlier suspend fix * bridge/lt8912b: Fix corrupt display output Signed-off-by: Dave Airlie <airlied@redhat.com> From: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/YzWvHhaqHhYirn4L@linux-uq9g
2022-09-29Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull coredump fix from Al Viro: "Fix for breakage in dump_user_range()" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: [coredump] don't use __kernel_write() on kmap_local_page()