summaryrefslogtreecommitdiff
path: root/drivers/net/ethernet/microsoft
AgeCommit message (Collapse)Author
2024-05-23Merge tag 'net-6.10-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Quite smaller than usual. Notably it includes the fix for the unix regression from the past weeks. The TCP window fix will require some follow-up, already queued. Current release - regressions: - af_unix: fix garbage collection of embryos Previous releases - regressions: - af_unix: fix race between GC and receive path - ipv6: sr: fix missing sk_buff release in seg6_input_core - tcp: remove 64 KByte limit for initial tp->rcv_wnd value - eth: r8169: fix rx hangup - eth: lan966x: remove ptp traps in case the ptp is not enabled - eth: ixgbe: fix link breakage vs cisco switches - eth: ice: prevent ethtool from corrupting the channels Previous releases - always broken: - openvswitch: set the skbuff pkt_type for proper pmtud support - tcp: Fix shift-out-of-bounds in dctcp_update_alpha() Misc: - a bunch of selftests stabilization patches" * tag 'net-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (25 commits) r8169: Fix possible ring buffer corruption on fragmented Tx packets. idpf: Interpret .set_channels() input differently ice: Interpret .set_channels() input differently nfc: nci: Fix handling of zero-length payload packets in nci_rx_work() net: relax socket state check at accept time. tcp: remove 64 KByte limit for initial tp->rcv_wnd value net: ti: icssg_prueth: Fix NULL pointer dereference in prueth_probe() tls: fix missing memory barrier in tls_init net: fec: avoid lock evasion when reading pps_enable Revert "ixgbe: Manual AN-37 for troublesome link partners for X550 SFI" testing: net-drv: use stats64 for testing net: mana: Fix the extra HZ in mana_hwc_send_request net: lan966x: Remove ptp traps in case the ptp is not enabled. openvswitch: Set the skbuff pkt_type for proper pmtud support. selftest: af_unix: Make SCM_RIGHTS into OOB data. af_unix: Fix garbage collection of embryos carrying OOB with SCM_RIGHTS tcp: Fix shift-out-of-bounds in dctcp_update_alpha(). selftests/net: use tc rule to filter the na packet ipv6: sr: fix memleak in seg6_hmac_init_algo af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock. ...
2024-05-22net: mana: Fix the extra HZ in mana_hwc_send_requestSouradeep Chakrabarti
Commit 62c1bff593b7 added an extra HZ along with msecs_to_jiffies. This patch fixes that. Cc: stable@vger.kernel.org Fixes: 62c1bff593b7 ("net: mana: Configure hwc timeout from hardware") Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> Reviewed-by: Brett Creeley <brett.creeley@amd.com> Reviewed-by: Dexuan Cui <decui@microsoft.com> Link: https://lore.kernel.org/r/1716185104-31658-1-git-send-email-schakrabarti@linux.microsoft.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-19Merge tag 'mm-stable-2024-05-17-19-19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: "The usual shower of singleton fixes and minor series all over MM, documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/ maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series: "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking"" * tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits) memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault selftests: cgroup: add tests to verify the zswap writeback path mm: memcg: make alloc_mem_cgroup_per_node_info() return bool mm/damon/core: fix return value from damos_wmark_metric_value mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED selftests: cgroup: remove redundant enabling of memory controller Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT Docs/mm/damon/design: use a list for supported filters Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file selftests/damon: classify tests for functionalities and regressions selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None' selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts selftests/damon/_damon_sysfs: check errors from nr_schemes file reads mm/damon/core: initialize ->esz_bp from damos_quota_init_priv() selftests/damon: add a test for DAMOS quota goal ...
2024-05-13net: mana: Enable MANA driver on ARM64 with 4K page sizeHaiyang Zhang
Change the Kconfig dependency, so this driver can be built and run on ARM64 with 4K page size. 16/64K page sizes are not supported yet. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Link: https://lore.kernel.org/r/1715632141-8089-1-git-send-email-haiyangz@microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-07net: annotate writes on dev->mtu from ndo_change_mtu()Eric Dumazet
Simon reported that ndo_change_mtu() methods were never updated to use WRITE_ONCE(dev->mtu, new_mtu) as hinted in commit 501a90c94510 ("inet: protect against too small mtu values.") We read dev->mtu without holding RTNL in many places, with READ_ONCE() annotations. It is time to take care of ndo_change_mtu() methods to use corresponding WRITE_ONCE() Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Simon Horman <horms@kernel.org> Closes: https://lore.kernel.org/netdev/20240505144608.GB67882@kernel.org/ Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20240506102812.3025432-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-25fix missing vmalloc.h includesKent Overstreet
Patch series "Memory allocation profiling", v6. Overview: Low overhead [1] per-callsite memory allocation profiling. Not just for debug kernels, overhead low enough to be deployed in production. Example output: root@moria-kvm:~# sort -rn /proc/allocinfo 127664128 31168 mm/page_ext.c:270 func:alloc_page_ext 56373248 4737 mm/slub.c:2259 func:alloc_slab_page 14880768 3633 mm/readahead.c:247 func:page_cache_ra_unbounded 14417920 3520 mm/mm_init.c:2530 func:alloc_large_system_hash 13377536 234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs 11718656 2861 mm/filemap.c:1919 func:__filemap_get_folio 9192960 2800 kernel/fork.c:307 func:alloc_thread_stack_node 4206592 4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable 4136960 1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start 3940352 962 mm/memory.c:4214 func:alloc_anon_folio 2894464 22613 fs/kernfs/dir.c:615 func:__kernfs_new_node ... Usage: kconfig options: - CONFIG_MEM_ALLOC_PROFILING - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT - CONFIG_MEM_ALLOC_PROFILING_DEBUG adds warnings for allocations that weren't accounted because of a missing annotation sysctl: /proc/sys/vm/mem_profiling Runtime info: /proc/allocinfo Notes: [1]: Overhead To measure the overhead we are comparing the following configurations: (1) Baseline with CONFIG_MEMCG_KMEM=n (2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) (3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) (4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1) (5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT (6) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) && CONFIG_MEMCG_KMEM=y (7) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) && CONFIG_MEMCG_KMEM=y Performance overhead: To evaluate performance we implemented an in-kernel test executing multiple get_free_page/free_page and kmalloc/kfree calls with allocation sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU affinity set to a specific CPU to minimize the noise. Below are results from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on 56 core Intel Xeon: kmalloc pgalloc (1 baseline) 6.764s 16.902s (2 default disabled) 6.793s (+0.43%) 17.007s (+0.62%) (3 default enabled) 7.197s (+6.40%) 23.666s (+40.02%) (4 runtime enabled) 7.405s (+9.48%) 23.901s (+41.41%) (5 memcg) 13.388s (+97.94%) 48.460s (+186.71%) (6 def disabled+memcg) 13.332s (+97.10%) 48.105s (+184.61%) (7 def enabled+memcg) 13.446s (+98.78%) 54.963s (+225.18%) Memory overhead: Kernel size: text data bss dec diff (1) 26515311 18890222 17018880 62424413 (2) 26524728 19423818 16740352 62688898 264485 (3) 26524724 19423818 16740352 62688894 264481 (4) 26524728 19423818 16740352 62688898 264485 (5) 26541782 18964374 16957440 62463596 39183 Memory consumption on a 56 core Intel CPU with 125GB of memory: Code tags: 192 kB PageExts: 262144 kB (256MB) SlabExts: 9876 kB (9.6MB) PcpuExts: 512 kB (0.5MB) Total overhead is 0.2% of total memory. Benchmarks: Hackbench tests run 100 times: hackbench -s 512 -l 200 -g 15 -f 25 -P baseline disabled profiling enabled profiling avg 0.3543 0.3559 (+0.0016) 0.3566 (+0.0023) stdev 0.0137 0.0188 0.0077 hackbench -l 10000 baseline disabled profiling enabled profiling avg 6.4218 6.4306 (+0.0088) 6.5077 (+0.0859) stdev 0.0933 0.0286 0.0489 stress-ng tests: stress-ng --class memory --seq 4 -t 60 stress-ng --class cpu --seq 4 -t 60 Results posted at: https://evilpiepirate.org/~kent/memalloc_prof_v4_stress-ng/ [2] https://lore.kernel.org/all/20240306182440.2003814-1-surenb@google.com/ This patch (of 37): The next patch drops vmalloc.h from a system header in order to fix a circular dependency; this adds it to all the files that were pulling it in implicitly. [kent.overstreet@linux.dev: fix arch/alpha/lib/memcpy.c] Link: https://lkml.kernel.org/r/20240327002152.3339937-1-kent.overstreet@linux.dev [surenb@google.com: fix arch/x86/mm/numa_32.c] Link: https://lkml.kernel.org/r/20240402180933.1663992-1-surenb@google.com [kent.overstreet@linux.dev: a few places were depending on sizes.h] Link: https://lkml.kernel.org/r/20240404034744.1664840-1-kent.overstreet@linux.dev [arnd@arndb.de: fix mm/kasan/hw_tags.c] Link: https://lkml.kernel.org/r/20240404124435.3121534-1-arnd@kernel.org [surenb@google.com: fix arc build] Link: https://lkml.kernel.org/r/20240405225115.431056-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-2-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-11Merge branch mana-ib-flex of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git Erick Archer says: ==================== mana: Add flex array to struct mana_cfg_rx_steer_req_v2 (part) The "struct mana_cfg_rx_steer_req_v2" uses a dynamically sized set of trailing elements. Specifically, it uses a "mana_handle_t" array. So, use the preferred way in the kernel declaring a flexible array [1]. At the same time, prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). Also, avoid the open-coded arithmetic in the memory allocator functions [2] using the "struct_size" macro. Moreover, use the "offsetof" helper to get the indirect table offset instead of the "sizeof" operator and avoid the open-coded arithmetic in pointers using the new flex member. This new structure member also allow us to remove the "req_indir_tab" variable since it is no longer needed. Now, it is also possible to use the "flex_array_size" helper to compute the size of these trailing elements in the "memcpy" function. Specifically, the first commit adds the flex member and the patches 2 and 3 refactor the consumers of the "struct mana_cfg_rx_steer_req_v2". This code was detected with the help of Coccinelle, and audited and modified manually. The Coccinelle script used to detect this code pattern is the following: virtual report @rule1@ type t1; type t2; identifier i0; identifier i1; identifier i2; identifier ALLOC =~ "kmalloc|kzalloc|kmalloc_node|kzalloc_node|vmalloc|vzalloc|kvmalloc|kvzalloc"; position p1; @@ i0 = sizeof(t1) + sizeof(t2) * i1; ... i2 = ALLOC@p1(..., i0, ...); @script:python depends on report@ p1 << rule1.p1; @@ msg = "WARNING: verify allocation on line %s" % (p1[0].line) coccilib.report.print_report(p1[0],msg) Link: https://www.kernel.org/doc/html/next/process/deprecated.html#zero-length-and-one-element-arrays [1] Link: https://www.kernel.org/doc/html/next/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [2] v1: https://lore.kernel.org/linux-hardening/AS8PR02MB7237974EF1B9BAFA618166C38B382@AS8PR02MB7237.eurprd02.prod.outlook.com/ v2: https://lore.kernel.org/linux-hardening/AS8PR02MB723729C5A63F24C312FC9CD18B3F2@AS8PR02MB7237.eurprd02.prod.outlook.com/ ==================== Link: https://lore.kernel.org/r/AS8PR02MB72374BD1B23728F2E3C3B1A18B022@AS8PR02MB7237.eurprd02.prod.outlook.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-11net: mana: Avoid open coded arithmeticErick Archer
This is an effort to get rid of all multiplications from allocation functions in order to prevent integer overflows [1][2]. As the "req" variable is a pointer to "struct mana_cfg_rx_steer_req_v2" and this structure ends in a flexible array: struct mana_cfg_rx_steer_req_v2 { [...] mana_handle_t indir_tab[] __counted_by(num_indir_entries); }; the preferred way in the kernel is to use the struct_size() helper to do the arithmetic instead of the calculation "size + size * count" in the kzalloc() function. Moreover, use the "offsetof" helper to get the indirect table offset instead of the "sizeof" operator and avoid the open-coded arithmetic in pointers using the new flex member. This new structure member also allow us to remove the "req_indir_tab" variable since it is no longer needed. Now, it is also possible to use the "flex_array_size" helper to compute the size of these trailing elements in the "memcpy" function. This way, the code is more readable and safer. This code was detected with the help of Coccinelle, and audited and modified manually. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [1] Link: https://github.com/KSPP/linux/issues/160 [2] Signed-off-by: Erick Archer <erick.archer@outlook.com> Link: https://lore.kernel.org/r/AS8PR02MB7237A21355C86EC0DCC0D83B8B022@AS8PR02MB7237.eurprd02.prod.outlook.com Reviewed-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-03net: mana: Fix Rx DMA datasize and skb_over_panicHaiyang Zhang
mana_get_rxbuf_cfg() aligns the RX buffer's DMA datasize to be multiple of 64. So a packet slightly bigger than mtu+14, say 1536, can be received and cause skb_over_panic. Sample dmesg: [ 5325.237162] skbuff: skb_over_panic: text:ffffffffc043277a len:1536 put:1536 head:ff1100018b517000 data:ff1100018b517100 tail:0x700 end:0x6ea dev:<NULL> [ 5325.243689] ------------[ cut here ]------------ [ 5325.245748] kernel BUG at net/core/skbuff.c:192! [ 5325.247838] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 5325.258374] RIP: 0010:skb_panic+0x4f/0x60 [ 5325.302941] Call Trace: [ 5325.304389] <IRQ> [ 5325.315794] ? skb_panic+0x4f/0x60 [ 5325.317457] ? asm_exc_invalid_op+0x1f/0x30 [ 5325.319490] ? skb_panic+0x4f/0x60 [ 5325.321161] skb_put+0x4e/0x50 [ 5325.322670] mana_poll+0x6fa/0xb50 [mana] [ 5325.324578] __napi_poll+0x33/0x1e0 [ 5325.326328] net_rx_action+0x12e/0x280 As discussed internally, this alignment is not necessary. To fix this bug, remove it from the code. So oversized packets will be marked as CQE_RX_TRUNCATED by NIC, and dropped. Cc: stable@vger.kernel.org Fixes: 2fbbd712baf1 ("net: mana: Enable RX path to handle various MTU sizes") Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Dexuan Cui <decui@microsoft.com> Link: https://lore.kernel.org/r/1712087316-20886-1-git-send-email-haiyangz@microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-01net: mana: Assigning IRQ affinity on HT coresSouradeep Chakrabarti
Existing MANA design assigns IRQ to every CPU, including sibling hyper-threads. This may cause multiple IRQs to be active simultaneously in the same core and may reduce the network performance. Improve the performance by assigning IRQ to non sibling CPUs in local NUMA node. The performance improvement we are getting using ntttcp with following patch is around 15 percent against existing design and approximately 11 percent, when trying to assign one IRQ in each core across NUMA nodes, if enough cores are present. The change will improve the performance for the system with high number of CPU, where number of CPUs in a node is more than 64 CPUs. Nodes with 64 CPUs or less than 64 CPUs will not be affected by this change. The performance study was done using ntttcp tool in Azure. The node had 2 nodes with 32 cores each, total 128 vCPU and number of channels were 32 for 32 RX rings. The below table shows a comparison between existing design and new design: IRQ node-num core-num CPU performance(%) 1 0 | 0 0 | 0 0 | 0-1 0 2 0 | 0 0 | 1 1 | 2-3 3 3 0 | 0 1 | 2 2 | 4-5 10 4 0 | 0 1 | 3 3 | 6-7 15 5 0 | 0 2 | 4 4 | 8-9 15 ... ... 25 0 | 0 12| 24 24| 48-49 12 ... 32 0 | 0 15| 31 31| 62-63 12 33 0 | 0 16| 0 32| 0-1 10 ... 64 0 | 0 31| 31 63| 62-63 0 Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-02-01net: mana: add a function to spread IRQs per CPUsYury Norov
Souradeep investigated that the driver performs faster if IRQs are spread on CPUs with the following heuristics: 1. No more than one IRQ per CPU, if possible; 2. NUMA locality is the second priority; 3. Sibling dislocality is the last priority. Let's consider this topology: Node 0 1 Core 0 1 2 3 CPU 0 1 2 3 4 5 6 7 The most performant IRQ distribution based on the above topology and heuristics may look like this: IRQ Nodes Cores CPUs 0 1 0 0-1 1 1 1 2-3 2 1 0 0-1 3 1 1 2-3 4 2 2 4-5 5 2 3 6-7 6 2 2 4-5 7 2 3 6-7 The irq_setup() routine introduced in this patch leverages the for_each_numa_hop_mask() iterator and assigns IRQs to sibling groups as described above. According to [1], for NUMA-aware but sibling-ignorant IRQ distribution based on cpumask_local_spread() performance test results look like this: ./ntttcp -r -m 16 NTTTCP for Linux 1.4.0 --------------------------------------------------------- 08:05:20 INFO: 17 threads created 08:05:28 INFO: Network activity progressing... 08:06:28 INFO: Test run completed. 08:06:28 INFO: Test cycle finished. 08:06:28 INFO: ##### Totals: ##### 08:06:28 INFO: test duration :60.00 seconds 08:06:28 INFO: total bytes :630292053310 08:06:28 INFO: throughput :84.04Gbps 08:06:28 INFO: retrans segs :4 08:06:28 INFO: cpu cores :192 08:06:28 INFO: cpu speed :3799.725MHz 08:06:28 INFO: user :0.05% 08:06:28 INFO: system :1.60% 08:06:28 INFO: idle :96.41% 08:06:28 INFO: iowait :0.00% 08:06:28 INFO: softirq :1.94% 08:06:28 INFO: cycles/byte :2.50 08:06:28 INFO: cpu busy (all) :534.41% For NUMA- and sibling-aware IRQ distribution, the same test works 15% faster: ./ntttcp -r -m 16 NTTTCP for Linux 1.4.0 --------------------------------------------------------- 08:08:51 INFO: 17 threads created 08:08:56 INFO: Network activity progressing... 08:09:56 INFO: Test run completed. 08:09:56 INFO: Test cycle finished. 08:09:56 INFO: ##### Totals: ##### 08:09:56 INFO: test duration :60.00 seconds 08:09:56 INFO: total bytes :741966608384 08:09:56 INFO: throughput :98.93Gbps 08:09:56 INFO: retrans segs :6 08:09:56 INFO: cpu cores :192 08:09:56 INFO: cpu speed :3799.791MHz 08:09:56 INFO: user :0.06% 08:09:56 INFO: system :1.81% 08:09:56 INFO: idle :96.18% 08:09:56 INFO: iowait :0.00% 08:09:56 INFO: softirq :1.95% 08:09:56 INFO: cycles/byte :2.25 08:09:56 INFO: cpu busy (all) :569.22% [1] https://lore.kernel.org/all/20231211063726.GA4977@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/ Signed-off-by: Yury Norov <yury.norov@gmail.com> Co-developed-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-12Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma updates from Jason Gunthorpe: "Small cycle, with some typical driver updates: - General code tidying in siw, hfi1, idrdma, usnic, hns rtrs and bnxt_re - Many small siw cleanups without an overeaching theme - Debugfs stats for hns - Fix a TX queue timeout in IPoIB and missed locking of the mcast list - Support more features of P7 devices in bnxt_re including a new work submission protocol - CQ interrupts for MANA - netlink stats for erdma - EFA multipath PCI support - Fix Incorrect MR invalidation in iser" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (66 commits) RDMA/bnxt_re: Fix error code in bnxt_re_create_cq() RDMA/efa: Add EFA query MR support IB/iser: Prevent invalidating wrong MR RDMA/erdma: Add hardware statistics support RDMA/erdma: Introduce dma pool for hardware responses of CMDQ requests IB/iser: iscsi_iser.h: fix kernel-doc warning and spellos RDMA/mana_ib: Add CQ interrupt support for RAW QP RDMA/mana_ib: query device capabilities RDMA/mana_ib: register RDMA device with GDMA RDMA/bnxt_re: Fix the sparse warnings RDMA/bnxt_re: Fix the offset for GenP7 adapters for user applications RDMA/bnxt_re: Share a page to expose per CQ info with userspace RDMA/bnxt_re: Add UAPI to share a page with user space IB/ipoib: Fix mcast list locking RDMA/mlx5: Expose register c0 for RDMA device net/mlx5: E-Switch, expose eswitch manager vport net/mlx5: Manage ICM type of SW encap RDMA/mlx5: Support handling of SW encap ICM area net/mlx5: Introduce indirect-sw-encap ICM properties RDMA/bnxt_re: Adds MSN table capability for Gen P7 adapters ...
2023-12-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
Cross-merge networking fixes after downstream PR. Adjacent changes: drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c 23c93c3b6275 ("bnxt_en: do not map packet buffers twice") 6d1add95536b ("bnxt_en: Modify TX ring indexing logic.") tools/testing/selftests/net/Makefile 2258b666482d ("selftests: add vlan hw filter tests") a0bc96c0cd6e ("selftests: net: verify fq per-band packet limit") Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-12-20RDMA/mana_ib: register RDMA device with GDMALong Li
Software client needs to register with the RDMA management interface on the SoC to access more features, including querying device capabilities and RC queue pair. Signed-off-by: Long Li <longli@microsoft.com> Link: https://lore.kernel.org/r/1702692255-23640-2-git-send-email-longli@linuxonhyperv.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-12-19net: mana: select PAGE_POOLYury Norov
Mana uses PAGE_POOL API. x86_64 defconfig doesn't select it: ld: vmlinux.o: in function `mana_create_page_pool.isra.0': mana_en.c:(.text+0x9ae36f): undefined reference to `page_pool_create' ld: vmlinux.o: in function `mana_get_rxfrag': mana_en.c:(.text+0x9afed1): undefined reference to `page_pool_alloc_pages' make[3]: *** [/home/yury/work/linux/scripts/Makefile.vmlinux:37: vmlinux] Error 1 make[2]: *** [/home/yury/work/linux/Makefile:1154: vmlinux] Error 2 make[1]: *** [/home/yury/work/linux/Makefile:234: __sub-make] Error 2 make[1]: Leaving directory '/home/yury/work/build-linux-x86_64' make: *** [Makefile:234: __sub-make] Error 2 So we need to select it explicitly. Signed-off-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Simon Horman <horms@kernel.org> # build-tested Fixes: ca9c54d2 ("net: mana: Add a driver for Microsoft Azure Network Adapter") Link: https://lore.kernel.org/r/20231215203353.635379-1-yury.norov@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-12-15net: mana: add msix index sharing between EQsKonstantin Taranov
This patch allows to assign and poll more than one EQ on the same msix index. It is achieved by introducing a list of attached EQs in each IRQ context. It also removes the existing msix_index map that tried to ensure that there is only one EQ at each msix_index. This patch exports symbols for creating EQs from other MANA kernel modules. Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-12-13net: ethtool: pass a pointer to parameters to get/set_rxfh ethtool opsAhmed Zaki
The get/set_rxfh ethtool ops currently takes the rxfh (RSS) parameters as direct function arguments. This will force us to change the API (and all drivers' functions) every time some new parameters are added. This is part 1/2 of the fix, as suggested in [1]: - First simplify the code by always providing a pointer to all params (indir, key and func); the fact that some of them may be NULL seems like a weird historic thing or a premature optimization. It will simplify the drivers if all pointers are always present. - Then make the functions take a dev pointer, and a pointer to a single struct wrapping all arguments. The set_* should also take an extack. Link: https://lore.kernel.org/netdev/20231121152906.2dd5f487@kernel.org/ [1] Suggested-by: Jakub Kicinski <kuba@kernel.org> Suggested-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com> Link: https://lore.kernel.org/r/20231213003321.605376-2-ahmed.zaki@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-29net: mana: Fix spelling mistake "enforecement" -> "enforcement"Colin Ian King
There is a spelling mistake in struct field hc_tx_err_sqpdid_enforecement. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com> Link: https://lore.kernel.org/r/20231128095304.515492-1-colin.i.king@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-28eth: link netdev to page_pools in driversJakub Kicinski
Link page pool instances to netdev for the drivers which already link to NAPI. Unless the driver is doing something very weird per-NAPI should imply per-netdev. Add netsec as well, Ilias indicates that it fits the mold. Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-11-27net :mana :Add remaining GDMA stats for MANA to ethtoolShradha Gupta
Extend performance counter stats in 'ethtool -S <interface>' for MANA VF to include all GDMA stat counter. Tested-on: Ubuntu22 Testcases: 1. LISA testcase: PERF-NETWORK-TCP-THROUGHPUT-MULTICONNECTION-NTTTCP-Synthetic 2. LISA testcase: PERF-NETWORK-TCP-THROUGHPUT-MULTICONNECTION-NTTTCP-SRIOV Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com> Link: https://lore.kernel.org/r/1700830950-803-1-git-send-email-shradhagupta@linux.microsoft.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-27net: mana: Use xdp_set_features_flag instead of direct assignmentKonstantin Taranov
This patch uses a helper function for assignment of xdp_features. This change simplifies backports. Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Link: https://lore.kernel.org/r/1698430011-21562-1-git-send-email-haiyangz@microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-05net: mana: Fix oversized sge0 for GSO packetsHaiyang Zhang
Handle the case when GSO SKB linear length is too large. MANA NIC requires GSO packets to put only the header part to SGE0, otherwise the TX queue may stop at the HW level. So, use 2 SGEs for the skb linear part which contains more than the packet header. Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)") Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-05net: mana: Fix the tso_bytes calculationHaiyang Zhang
sizeof(struct hop_jumbo_hdr) is not part of tso_bytes, so remove the subtraction from header size. Cc: stable@vger.kernel.org Fixes: bd7fc6e1957c ("net: mana: Add new MANA VF performance counters for easier troubleshooting") Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-05net: mana: Fix TX CQE error handlingHaiyang Zhang
For an unknown TX CQE error type (probably from a newer hardware), still free the SKB, update the queue tail, etc., otherwise the accounting will be wrong. Also, TX errors can be triggered by injecting corrupted packets, so replace the WARN_ONCE to ratelimited error logging. Cc: stable@vger.kernel.org Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)") Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-08-11net: mana: Add gdma stats to ethtool output for manaShradha Gupta
Extended performance counter stats in 'ethtool -S <interface>' for MANA VF to include GDMA tx LSO packets and bytes count. Tested-on: Ubuntu22 Testcases: 1. LISA testcase: PERF-NETWORK-TCP-THROUGHPUT-MULTICONNECTION-NTTTCP-Synthetic 2. LISA testcase: PERF-NETWORK-TCP-THROUGHPUT-MULTICONNECTION-NTTTCP-SRIOV 3. Validated the GDMA stat packets and byte counters Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: drivers/net/ethernet/intel/igc/igc_main.c 06b412589eef ("igc: Add lock to safeguard global Qbv variables") d3750076d464 ("igc: Add TransmissionOverrun counter") drivers/net/ethernet/microsoft/mana/mana_en.c a7dfeda6fdec ("net: mana: Fix MANA VF unload when hardware is unresponsive") a9ca9f9ceff3 ("page_pool: split types and declarations from page_pool.h") 92272ec4107e ("eth: add missing xdp.h includes in drivers") net/mptcp/protocol.h 511b90e39250 ("mptcp: fix disconnect vs accept race") b8dc6d6ce931 ("mptcp: fix rcv buffer auto-tuning") tools/testing/selftests/net/mptcp/mptcp_join.sh c8c101ae390a ("selftests: mptcp: join: fix 'implicit EP' test") 03668c65d153 ("selftests: mptcp: join: rework detailed report") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-10net: mana: Fix MANA VF unload when hardware is unresponsiveSouradeep Chakrabarti
When unloading the MANA driver, mana_dealloc_queues() waits for the MANA hardware to complete any inflight packets and set the pending send count to zero. But if the hardware has failed, mana_dealloc_queues() could wait forever. Fix this by adding a timeout to the wait. Set the timeout to 120 seconds, which is a somewhat arbitrary value that is more than long enough for functional hardware to complete any sends. Cc: stable@vger.kernel.org Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)") Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> Link: https://lore.kernel.org/r/1691576525-24271-1-git-send-email-schakrabarti@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-07page_pool: split types and declarations from page_pool.hYunsheng Lin
Split types and pure function declarations from page_pool.h and add them in page_page/types.h, so that C sources can include page_pool.h and headers should generally only include page_pool/types.h as suggested by jakub. Rename page_pool.h to page_pool/helpers.h to have both in one place. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/20230804180529.2483231-2-aleksander.lobakin@intel.com [Jakub: change microsoft/mana, fix kdoc paths in Documentation] Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-06net: mana: Add page pool for RX buffersHaiyang Zhang
Add page pool for RX buffers for faster buffer cycle and reduce CPU usage. The standard page pool API is used. With iperf and 128 threads test, this patch improved the throughput by 12-15%, and decreased the IRQ associated CPU's usage from 99-100% to 10-50%. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-04net: mana: Configure hwc timeout from hardwareSouradeep Chakrabarti
At present hwc timeout value is a fixed value. This patch sets the hwc timeout from the hardware. It now uses a new hardware capability GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG to query and set the value in hwc_timeout. Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-03eth: add missing xdp.h includes in driversJakub Kicinski
Handful of drivers currently expect to get xdp.h by virtue of including netdevice.h. This will soon no longer be the case so add explicit includes. Reviewed-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/20230803010230.1755386-2-kuba@kernel.org Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-18net: mana: Use the correct WQE count for ringing RQ doorbellLong Li
The hardware specification specifies that WQE_COUNT should set to 0 for the Receive Queue. Although currently the hardware doesn't enforce the check, in the future releases it may check on this value. Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Long Li <longli@microsoft.com> Link: https://lore.kernel.org/r/1689622539-5334-3-git-send-email-longli@linuxonhyperv.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-18net: mana: Batch ringing RX queue doorbell on receiving packetsLong Li
It's inefficient to ring the doorbell page every time a WQE is posted to the received queue. Excessive MMIO writes result in CPU spending more time waiting on LOCK instructions (atomic operations), resulting in poor scaling performance. Move the code for ringing doorbell page to where after we have posted all WQEs to the receive queue during a callback from napi_poll(). With this change, tests showed an improvement from 120G/s to 160G/s on a 200G physical link, with 16 or 32 hardware queues. Tests showed no regression in network latency benchmarks on single connection. Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Long Li <longli@microsoft.com> Link: https://lore.kernel.org/r/1689622539-5334-2-git-send-email-longli@linuxonhyperv.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-29Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma updates from Jason Gunthorpe: "This cycle saw a focus on rxe and bnxt_re drivers: - Code cleanups for irdma, rxe, rtrs, hns, vmw_pvrdma - rxe uses workqueues instead of tasklets - rxe has better compliance around access checks for MRs and rereg_mr - mana supportst he 'v2' FW interface for RX coalescing - hfi1 bug fix for stale cache entries in its MR cache - mlx5 buf fix to handle FW failures when destroying QPs - erdma HW has a new doorbell allocation mechanism for uverbs that is secure - Lots of small cleanups and rework in bnxt_re: - Use the common mmap functions - Support disassociation - Improve FW command flow - support for 'low latency push'" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (71 commits) RDMA/bnxt_re: Fix an IS_ERR() vs NULL check RDMA/bnxt_re: Fix spelling mistake "priviledged" -> "privileged" RDMA/bnxt_re: Remove duplicated include in bnxt_re/main.c RDMA/bnxt_re: Refactor code around bnxt_qplib_map_rc() RDMA/bnxt_re: Remove incorrect return check from slow path RDMA/bnxt_re: Enable low latency push RDMA/bnxt_re: Reorg the bar mapping RDMA/bnxt_re: Move the interface version to chip context structure RDMA/bnxt_re: Query function capabilities from firmware RDMA/bnxt_re: Optimize the bnxt_re_init_hwrm_hdr usage RDMA/bnxt_re: Add disassociate ucontext support RDMA/bnxt_re: Use the common mmap helper functions RDMA/bnxt_re: Initialize opcode while sending message RDMA/cma: Remove NULL check before dev_{put, hold} RDMA/rxe: Simplify cq->notify code RDMA/rxe: Fixes mr access supported list RDMA/bnxt_re: optimize the parameters passed to helper functions RDMA/bnxt_re: remove redundant cmdq_bitmap RDMA/bnxt_re: use firmware provided max request timeout RDMA/bnxt_re: cancel all control path command waiters upon error ...
2023-06-27Merge tag 'v6.4' into rdma.git for-nextJason Gunthorpe
Linux 6.4 Resolve conflicts between rdma rc and next in rxe_cq matching linux-next: drivers/infiniband/sw/rxe/rxe_cq.c: https://lore.kernel.org/r/20230622115246.365d30ad@canb.auug.org.au Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-27net: mana: use vmalloc_array and vcallocJulia Lawall
Use vmalloc_array and vcalloc to protect against multiplication overflows. The changes were done using the following Coccinelle semantic patch: // <smpl> @initialize:ocaml@ @@ let rename alloc = match alloc with "vmalloc" -> "vmalloc_array" | "vzalloc" -> "vcalloc" | _ -> failwith "unknown" @@ size_t e1,e2; constant C1, C2; expression E1, E2, COUNT, x1, x2, x3; typedef u8; typedef __u8; type t = {u8,__u8,char,unsigned char}; identifier alloc = {vmalloc,vzalloc}; fresh identifier realloc = script:ocaml(alloc) { rename alloc }; @@ ( alloc(x1*x2*x3) | alloc(C1 * C2) | alloc((sizeof(t)) * (COUNT), ...) | - alloc((e1) * (e2)) + realloc(e1, e2) | - alloc((e1) * (COUNT)) + realloc(COUNT, e1) | - alloc((E1) * (E2)) + realloc(E1, E2) ) // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Link: https://lore.kernel.org/r/20230627144339.144478-23-Julia.Lawall@inria.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-12net: mana: Add support for vlan taggingHaiyang Zhang
To support vlan, use MANA_LONG_PKT_FMT if vlan tag is present in TX skb. Then extract the vlan tag from the skb struct, and save it to tx_oob for the NIC to transmit. For vlan tags on the payload, they are accepted by the NIC too. For RX, extract the vlan tag from CQE and put it into skb. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-01RDMA/mana_ib: Use v2 version of cfg_rx_steer_req to enable RX coalescingLong Li
With RX coalescing, one CQE entry can be used to indicate multiple packets on the receive queue. This saves processing time and PCI bandwidth over the CQ. The MANA Ethernet driver also uses the v2 version of the protocol. It doesn't use RX coalescing and its behavior is not changed. Link: https://lore.kernel.org/r/1684045095-31228-1-git-send-email-longli@linuxonhyperv.com Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-05-30net: mana: Fix perf regression: remove rx_cqes, tx_cqes countersHaiyang Zhang
The apc->eth_stats.rx_cqes is one per NIC (vport), and it's on the frequent and parallel code path of all queues. So, r/w into this single shared variable by many threads on different CPUs creates a lot caching and memory overhead, hence perf regression. And, it's not accurate due to the high volume concurrent r/w. For example, a workload is iperf with 128 threads, and with RPS enabled. We saw perf regression of 25% with the previous patch adding the counters. And this patch eliminates the regression. Since the error path of mana_poll_rx_cq() already has warnings, so keeping the counter and convert it to a per-queue variable is not necessary. So, just remove this counter from this high frequency code path. Also, remove the tx_cqes counter for the same reason. We have warnings & other counters for errors on that path, and don't need to count every normal cqe processing. Cc: stable@vger.kernel.org Fixes: bd7fc6e1957c ("net: mana: Add new MANA VF performance counters for easier troubleshooting") Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/1685115537-31675-1-git-send-email-haiyangz@microsoft.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-04-24net: mana: Check if netdev/napi_alloc_frag returns single pageHaiyang Zhang
netdev/napi_alloc_frag() may fall back to single page which is smaller than the requested size. Add error checking to avoid memory overwritten. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-24net: mana: Rename mana_refill_rxoob and remove some empty linesHaiyang Zhang
Rename mana_refill_rxoob for naming consistency. And remove some empty lines between function call and error checking. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-14net: mana: Add support for jumbo frameHaiyang Zhang
During probe, get the hardware-allowed max MTU by querying the device configuration. Users can select MTU up to the device limit. When XDP is in use, limit MTU settings so the buffer size is within one page. And, when MTU is set to a too large value, XDP is not allowed to run. Also, to prevent changing MTU fails, and leaves the NIC in a bad state, pre-allocate all buffers before starting the change. So in low memory condition, it will return error, without affecting the NIC. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-14net: mana: Enable RX path to handle various MTU sizesHaiyang Zhang
Update RX data path to allocate and use RX queue DMA buffers with proper size based on potentially various MTU sizes. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-14net: mana: Refactor RX buffer allocation code to prepare for various MTUHaiyang Zhang
Move out common buffer allocation code from mana_process_rx_cqe() and mana_alloc_rx_wqe() to helper functions. Refactor related variables so they can be changed in one place, and buffer sizes are in sync. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-14net: mana: Use napi_build_skb in RX pathHaiyang Zhang
Use napi_build_skb() instead of build_skb() to take advantage of the NAPI percpu caches to obtain skbuff_head. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-24net: mana: Remove redundant pci_clear_masterCai Huoqing
Remove pci_clear_master to simplify the code, the bus-mastering is also cleared in do_pci_disable_device, like this: ./drivers/pci/pci.c:2197 static void do_pci_disable_device(struct pci_dev *dev) { u16 pci_command; pci_read_config_word(dev, PCI_COMMAND, &pci_command); if (pci_command & PCI_COMMAND_MASTER) { pci_command &= ~PCI_COMMAND_MASTER; pci_write_config_word(dev, PCI_COMMAND, pci_command); } pcibios_disable_device(dev); }. And dev->is_busmaster is set to 0 in pci_disable_device. Signed-off-by: Cai Huoqing <cai.huoqing@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17net: mana: Add new MANA VF performance counters for easier troubleshootingShradha Gupta
Extended performance counter stats in 'ethtool -S <interface>' output for MANA VF to facilitate troubleshooting. Tested-on: Ubuntu22 Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-10Daniel Borkmann says:Jakub Kicinski
==================== pull-request: bpf-next 2023-02-11 We've added 96 non-merge commits during the last 14 day(s) which contain a total of 152 files changed, 4884 insertions(+), 962 deletions(-). There is a minor conflict in drivers/net/ethernet/intel/ice/ice_main.c between commit 5b246e533d01 ("ice: split probe into smaller functions") from the net-next tree and commit 66c0e13ad236 ("drivers: net: turn on XDP features") from the bpf-next tree. Remove the hunk given ice_cfg_netdev() is otherwise there a 2nd time, and add XDP features to the existing ice_cfg_netdev() one: [...] ice_set_netdev_features(netdev); netdev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT | NETDEV_XDP_ACT_XSK_ZEROCOPY; ice_set_ops(netdev); [...] Stephen's merge conflict mail: https://lore.kernel.org/bpf/20230207101951.21a114fa@canb.auug.org.au/ The main changes are: 1) Add support for BPF trampoline on s390x which finally allows to remove many test cases from the BPF CI's DENYLIST.s390x, from Ilya Leoshkevich. 2) Add multi-buffer XDP support to ice driver, from Maciej Fijalkowski. 3) Add capability to export the XDP features supported by the NIC. Along with that, add a XDP compliance test tool, from Lorenzo Bianconi & Marek Majtyka. 4) Add __bpf_kfunc tag for marking kernel functions as kfuncs, from David Vernet. 5) Add a deep dive documentation about the verifier's register liveness tracking algorithm, from Eduard Zingerman. 6) Fix and follow-up cleanups for resolve_btfids to be compiled as a host program to avoid cross compile issues, from Jiri Olsa & Ian Rogers. 7) Batch of fixes to the BPF selftest for xdp_hw_metadata which resulted when testing on different NICs, from Jesper Dangaard Brouer. 8) Fix libbpf to better detect kernel version code on Debian, from Hao Xiang. 9) Extend libbpf to add an option for when the perf buffer should wake up, from Jon Doron. 10) Follow-up fix on xdp_metadata selftest to just consume on TX completion, from Stanislav Fomichev. 11) Extend the kfuncs.rst document with description on kfunc lifecycle & stability expectations, from David Vernet. 12) Fix bpftool prog profile to skip attaching to offline CPUs, from Tonghao Zhang. ==================== Link: https://lore.kernel.org/r/20230211002037.8489-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-07net: mana: Fix accessing freed irq affinity_hintHaiyang Zhang
After calling irq_set_affinity_and_hint(), the cpumask pointer is saved in desc->affinity_hint, and will be used later when reading /proc/irq/<num>/affinity_hint. So the cpumask variable needs to be persistent. Otherwise, we are accessing freed memory when reading the affinity_hint file. Also, need to clear affinity_hint before free_irq(), otherwise there is a one-time warning and stack trace during module unloading: [ 243.948687] WARNING: CPU: 10 PID: 1589 at kernel/irq/manage.c:1913 free_irq+0x318/0x360 ... [ 243.948753] Call Trace: [ 243.948754] <TASK> [ 243.948760] mana_gd_remove_irqs+0x78/0xc0 [mana] [ 243.948767] mana_gd_remove+0x3e/0x80 [mana] [ 243.948773] pci_device_remove+0x3d/0xb0 [ 243.948778] device_remove+0x46/0x70 [ 243.948782] device_release_driver_internal+0x1fe/0x280 [ 243.948785] driver_detach+0x4e/0xa0 [ 243.948787] bus_remove_driver+0x70/0xf0 [ 243.948789] driver_unregister+0x35/0x60 [ 243.948792] pci_unregister_driver+0x44/0x90 [ 243.948794] mana_driver_exit+0x14/0x3fe [mana] [ 243.948800] __do_sys_delete_module.constprop.0+0x185/0x2f0 To fix the bug, use the persistent mask, cpumask_of(cpu#), and set affinity_hint to NULL before freeing the IRQ, as required by free_irq(). Cc: stable@vger.kernel.org Fixes: 71fa6887eeca ("net: mana: Assign interrupts to CPUs based on NUMA nodes") Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/1675718929-19565-1-git-send-email-haiyangz@microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-02drivers: net: turn on XDP featuresMarek Majtyka
A summary of the flags being set for various drivers is given below. Note that XDP_F_REDIRECT_TARGET and XDP_F_FRAG_TARGET are features that can be turned off and on at runtime. This means that these flags may be set and unset under RTNL lock protection by the driver. Hence, READ_ONCE must be used by code loading the flag value. Also, these flags are not used for synchronization against the availability of XDP resources on a device. It is merely a hint, and hence the read may race with the actual teardown of XDP resources on the device. This may change in the future, e.g. operations taking a reference on the XDP resources of the driver, and in turn inhibiting turning off this flag. However, for now, it can only be used as a hint to check whether device supports becoming a redirection target. Turn 'hw-offload' feature flag on for: - netronome (nfp) - netdevsim. Turn 'native' and 'zerocopy' features flags on for: - intel (i40e, ice, ixgbe, igc) - mellanox (mlx5). - stmmac - netronome (nfp) Turn 'native' features flags on for: - amazon (ena) - broadcom (bnxt) - freescale (dpaa, dpaa2, enetc) - funeth - intel (igb) - marvell (mvneta, mvpp2, octeontx2) - mellanox (mlx4) - mtk_eth_soc - qlogic (qede) - sfc - socionext (netsec) - ti (cpsw) - tap - tsnep - veth - xen - virtio_net. Turn 'basic' (tx, pass, aborted and drop) features flags on for: - netronome (nfp) - cavium (thunder) - hyperv. Turn 'redirect_target' feature flag on for: - amanzon (ena) - broadcom (bnxt) - freescale (dpaa, dpaa2) - intel (i40e, ice, igb, ixgbe) - ti (cpsw) - marvell (mvneta, mvpp2) - sfc - socionext (netsec) - qlogic (qede) - mellanox (mlx5) - tap - veth - virtio_net - xen Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Stanislav Fomichev <sdf@google.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Marek Majtyka <alardam@gmail.com> Link: https://lore.kernel.org/r/3eca9fafb308462f7edb1f58e451d59209aa07eb.1675245258.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>