summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-03-29vsock: support sockmapBobby Eshleman
This patch adds sockmap support for vsock sockets. It is intended to be usable by all transports, but only the virtio and loopback transports are implemented. SOCK_STREAM, SOCK_DGRAM, and SOCK_SEQPACKET are all supported. Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-28testing/vsock: add vsock_perf to gitignoreBobby Eshleman
This adds the vsock_perf binary to the gitignore file. Fixes: 8abbffd27ced ("test/vsock: vsock_perf utility") Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com> Reviewed-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lore.kernel.org/r/20230327-vsock-add-vsock-perf-to-ignore-v1-1-f28a84f3606b@bytedance.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-28Merge branch 'ynl-add-support-for-user-headers-and-struct-attrs'Jakub Kicinski
Donald Hunter says: ==================== ynl: add support for user headers and struct attrs Add support for user headers and struct attrs to YNL. This patchset adds features to ynl and add a partial spec for openvswitch that demonstrates use of the features. Patch 1-4 add features to ynl Patch 5 adds partial openvswitch specs that demonstrate the new features Patch 6-7 add documentation for legacy structs and for sub-type ==================== Link: https://lore.kernel.org/r/20230327083138.96044-1-donald.hunter@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-28docs: netlink: document the sub-type attribute propertyDonald Hunter
Add a definition for sub-type to the protocol spec doc and a description of its usage for C arrays in genetlink-legacy. Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-28docs: netlink: document struct support for genetlink-legacyDonald Hunter
Describe the genetlink-legacy support for using struct definitions for fixed headers and for binary attributes. Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-28netlink: specs: add partial specification for openvswitchDonald Hunter
The openvswitch family has a fixed header, uses struct attrs and has array values. This partial spec demonstrates these features in the YNL CLI. These specs are sufficient to create, delete and dump datapaths and to dump vports: $ ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/ovs_datapath.yaml \ --do dp-new --json '{ "dp-ifindex": 0, "name": "demo", "upcall-pid": 0}' None $ ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/ovs_datapath.yaml \ --dump dp-get --json '{ "dp-ifindex": 0 }' [{'dp-ifindex': 3, 'masks-cache-size': 256, 'megaflow-stats': {'cache-hits': 0, 'mask-hit': 0, 'masks': 0, 'pad1': 0, 'padding': 0}, 'name': 'test', 'stats': {'flows': 0, 'hit': 0, 'lost': 0, 'missed': 0}, 'user-features': {'dispatch-upcall-per-cpu', 'tc-recirc-sharing', 'unaligned'}}, {'dp-ifindex': 48, 'masks-cache-size': 256, 'megaflow-stats': {'cache-hits': 0, 'mask-hit': 0, 'masks': 0, 'pad1': 0, 'padding': 0}, 'name': 'demo', 'stats': {'flows': 0, 'hit': 0, 'lost': 0, 'missed': 0}, 'user-features': set()}] $ ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/ovs_datapath.yaml \ --do dp-del --json '{ "dp-ifindex": 0, "name": "demo"}' None $ ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/ovs_vport.yaml \ --dump vport-get --json '{ "dp-ifindex": 3 }' [{'dp-ifindex': 3, 'ifindex': 3, 'name': 'test', 'port-no': 0, 'stats': {'rx-bytes': 0, 'rx-dropped': 0, 'rx-errors': 0, 'rx-packets': 0, 'tx-bytes': 0, 'tx-dropped': 0, 'tx-errors': 0, 'tx-packets': 0}, 'type': 'internal', 'upcall-pid': [0], 'upcall-stats': {'fail': 0, 'success': 0}}] Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-28tools: ynl: Add fixed-header support to ynlDonald Hunter
Add support for netlink families that add an optional fixed header structure after the genetlink header and before any attributes. The fixed-header can be specified on a per op basis, or once for all operations, which serves as a default value that can be overridden. Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-28tools: ynl: Add struct attr decoding to ynlDonald Hunter
Add support for decoding attributes that contain C structs. Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-28tools: ynl: Add C array attribute decoding to ynlDonald Hunter
Add support for decoding C arrays from binay blobs in genetlink-legacy messages. Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-28tools: ynl: Add struct parsing to nlspecDonald Hunter
Add python classes for struct definitions to nlspec Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-28Merge tag 'mlx5-updates-2023-03-20' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2023-03-20 mlx5 dynamic msix This patch series adds support for dynamic msix vectors allocation in mlx5. Eli Cohen Says: ================ The following series of patches modifies mlx5_core to work with the dynamic MSIX API. Currently, mlx5_core allocates all the interrupt vectors it needs and distributes them amongst the consumers. With the introduction of dynamic MSIX support, which allows for allocation of interrupts more than once, we now allocate vectors as we need them. This allows other drivers running on top of mlx5_core to allocate interrupt vectors for their own use. An example for this is mlx5_vdpa, which uses these vectors to propagate interrupts directly from the hardware to the vCPU [1]. As a preparation for using this series, a use after free issue is fixed in lib/cpu_rmap.c and the allocator for rmap entries has been modified. A complementary API for irq_cpu_rmap_add() has also been introduced. [1] https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/patch/?id=0f2bf1fcae96a83b8c5581854713c9fc3407556e ================ * tag 'mlx5-updates-2023-03-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5: Provide external API for allocating vectors net/mlx5: Use one completion vector if eth is disabled net/mlx5: Refactor calculation of required completion vectors net/mlx5: Move devlink registration before mlx5_load net/mlx5: Use dynamic msix vectors allocation net/mlx5: Refactor completion irq request/release code net/mlx5: Improve naming of pci function vectors net/mlx5: Use newer affinity descriptor net/mlx5: Modify struct mlx5_irq to use struct msi_map net/mlx5: Fix wrong comment net/mlx5e: Coding style fix, add empty line lib: cpu_rmap: Add irq_cpu_rmap_remove to complement irq_cpu_rmap_add lib: cpu_rmap: Use allocator for rmap entries lib: cpu_rmap: Avoid use after free on rmap->obj array entries ==================== Link: https://lore.kernel.org/r/20230324231341.29808-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-28docs: netdev: clarify the need to sending reverts as patchesJakub Kicinski
We don't state explicitly that reverts need to be submitted as a patch. It occasionally comes up. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20230327172646.2622943-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-28net: ethernet: 8390: axnet_cs: remove unused xfer_count variableTom Rix
clang with W=1 reports drivers/net/ethernet/8390/axnet_cs.c:653:9: error: variable 'xfer_count' set but not used [-Werror,-Wunused-but-set-variable] int xfer_count = count; ^ This variable is not used so remove it. Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230327235423.1777590-1-trix@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-28Revert "sh_eth: remove open coded netif_running()"Wolfram Sang
This reverts commit ce1fdb065695f49ef6f126d35c1abbfe645d62d5. It turned out this actually introduces a race condition. netif_running() is not a suitable check for get_stats. Reported-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230327152112.15635-1-wsa+renesas@sang-engineering.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-28Merge branch 'net-refcount-address-dst_entry-reference-count-scalability-issues'Jakub Kicinski
Thomas Gleixner says: ==================== net, refcount: Address dst_entry reference count scalability issues This is version 3 of this series. Version 2 can be found here: https://lore.kernel.org/lkml/20230307125358.772287565@linutronix.de Wangyang and Arjan reported a bottleneck in the networking code related to struct dst_entry::__refcnt. Performance tanks massively when concurrency on a dst_entry increases. This happens when there are a large amount of connections to or from the same IP address. The memtier benchmark when run on the same host as memcached amplifies this massively. But even over real network connections this issue can be observed at an obviously smaller scale (due to the network bandwith limitations in my setup, i.e. 1Gb). How to reproduce: Run memcached with -t $N and memtier_benchmark with -t $M and --ratio=1:100 on the same machine. localhost connections amplify the problem. Start with the defaults for $N and $M and increase them. Depending on your machine this will tank at some point. But even in reasonably small $N, $M scenarios the refcount operations and the resulting false sharing fallout becomes visible in perf top. At some point it becomes the dominating issue. There are two factors which make this reference count a scalability issue: 1) False sharing dst_entry:__refcnt is located at offset 64 of dst_entry, which puts it into a seperate cacheline vs. the read mostly members located at the beginning of the struct. That prevents false sharing vs. the struct members in the first 64 bytes of the structure, but there is also dst_entry::lwtstate which is located after the reference count and in the same cache line. This member is read after a reference count has been acquired. The other problem is struct rtable, which embeds a struct dst_entry at offset 0. struct dst_entry has a size of 112 bytes, which means that the struct members of rtable which follow the dst member share the same cache line as dst_entry::__refcnt. Especially rtable::rt_genid is also read by the contexts which have a reference count acquired already. When dst_entry:__refcnt is incremented or decremented via an atomic operation these read accesses stall and contribute to the performance problem. 2) atomic_inc_not_zero() A reference on dst_entry:__refcnt is acquired via atomic_inc_not_zero() and released via atomic_dec_return(). atomic_inc_not_zero() is implemted via a atomic_try_cmpxchg() loop, which exposes O(N^2) behaviour under contention with N concurrent operations. Contention scalability is degrading with even a small amount of contenders and gets worse from there. Lightweight instrumentation exposed an average of 8!! retry loops per atomic_inc_not_zero() invocation in a inc()/dec() loop running concurrently on 112 CPUs. There is nothing which can be done to make atomic_inc_not_zero() more scalable. The following series addresses these issues: 1) Reorder and pad struct dst_entry to prevent the false sharing. 2) Implement and use a reference count implementation which avoids the atomic_inc_not_zero() problem. It is slightly less performant in the case of the final 0 -> -1 transition, but the deconstruction of these objects is a low frequency event. get()/put() pairs are in the hotpath and that's what this implementation optimizes for. The algorithm of this reference count is only suitable for RCU managed objects. Therefore it cannot replace the refcount_t algorithm, which is also based on atomic_inc_not_zero(), due to a subtle race condition related to the 0 -> -1 transition and the final verdict to mark the reference count dead. See details in patch 2/3. It might be just my lack of imagination which declares this to be impossible and I'd be happy to be proven wrong. As a bonus the new rcuref implementation provides underflow/overflow detection and mitigation while being performance wise on par with open coded atomic_inc_not_zero() / atomic_dec_return() pairs even in the non-contended case. The combination of these two changes results in performance gains in micro benchmarks and also localhost and networked memtier benchmarks talking to memcached. It's hard to quantify the benchmark results as they depend heavily on the micro-architecture and the number of concurrent operations. The overall gain of both changes for localhost memtier ranges from 1.2X to 3.2X and from +2% to %5% range for networked operations on a 1Gb connection. A micro benchmark which enforces maximized concurrency shows a gain between 1.2X and 4.7X!!! Obviously this is focussed on a particular problem and therefore needs to be discussed in detail. It also requires wider testing outside of the cases which this is focussed on. Though the false sharing issue is obvious and should be addressed independent of the more focussed reference count changes. The series is also available from git: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rcuref Changes vs. V2: - Rename __refcnt to __rcuref (Linus) - Fix comments and changelogs (Mark, Qiuxu) - Fixup kernel doc of generated atomic_add_negative() variants I want to say thanks to Wangyang who analyzed the issue and provided the initial fix for the false sharing problem. Further thanks go to Arjan Peter, Marc, Will and Borislav for valuable input and providing test results on machines which I do not have access to, and to Linus and Eric, Qiuxu and Mark for helpful feedback. ==================== Link: https://lore.kernel.org/r/20230323102649.764958589@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-28net: dst: Switch to rcuref_t reference countingThomas Gleixner
Under high contention dst_entry::__refcnt becomes a significant bottleneck. atomic_inc_not_zero() is implemented with a cmpxchg() loop, which goes into high retry rates on contention. Switch the reference count to rcuref_t which results in a significant performance gain. Rename the reference count member to __rcuref to reflect the change. The gain depends on the micro-architecture and the number of concurrent operations and has been measured in the range of +25% to +130% with a localhost memtier/memcached benchmark which amplifies the problem massively. Running the memtier/memcached benchmark over a real (1Gb) network connection the conversion on top of the false sharing fix for struct dst_entry::__refcnt results in a total gain in the 2%-5% range over the upstream baseline. Reported-by: Wangyang Guo <wangyang.guo@intel.com> Reported-by: Arjan Van De Ven <arjan.van.de.ven@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230307125538.989175656@linutronix.de Link: https://lore.kernel.org/r/20230323102800.215027837@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-28net: dst: Prevent false sharing vs. dst_entry:: __refcntWangyang Guo
dst_entry::__refcnt is highly contended in scenarios where many connections happen from and to the same IP. The reference count is an atomic_t, so the reference count operations have to take the cache-line exclusive. Aside of the unavoidable reference count contention there is another significant problem which is caused by that: False sharing. perf top identified two affected read accesses. dst_entry::lwtstate and rtable::rt_genid. dst_entry:__refcnt is located at offset 64 of dst_entry, which puts it into a seperate cacheline vs. the read mostly members located at the beginning of the struct. That prevents false sharing vs. the struct members in the first 64 bytes of the structure, but there is also dst_entry::lwtstate which is located after the reference count and in the same cache line. This member is read after a reference count has been acquired. struct rtable embeds a struct dst_entry at offset 0. struct dst_entry has a size of 112 bytes, which means that the struct members of rtable which follow the dst member share the same cache line as dst_entry::__refcnt. Especially rtable::rt_genid is also read by the contexts which have a reference count acquired already. When dst_entry:__refcnt is incremented or decremented via an atomic operation these read accesses stall. This was found when analysing the memtier benchmark in 1:100 mode, which amplifies the problem extremly. Move the rt[6i]_uncached[_list] members out of struct rtable and struct rt6_info into struct dst_entry to provide padding and move the lwtstate member after that so it ends up in the same cache line. The resulting improvement depends on the micro-architecture and the number of CPUs. It ranges from +20% to +120% with a localhost memtier/memcached benchmark. [ tglx: Rearrange struct ] Signed-off-by: Wangyang Guo <wangyang.guo@intel.com> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230323102800.042297517@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-28Merge branch 'locking/rcuref' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pulling rcurefs from Peter for tglx's work. Link: https://lore.kernel.org/all/20230328084534.GE4253@hirez.programming.kicks-ass.net/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-28net/mlx5e: Fix build break on 32bitSaeed Mahameed
The cited commit caused the following build break in mlx5 due to a change in size of MAX_SKB_FRAGS. error: format '%lu' expects argument of type 'long unsigned int', but argument 7 has type 'unsigned int' [-Werror=format=] Fix this by explicit casting. Fixes: 3948b05950fd ("net: introduce a config option to tweak MAX_SKB_FRAGS") Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20230328200723.125122-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-28net: ethernet: ti: am65-cpsw: enable p0 host port rx_vlan_remapGrygorii Strashko
By default, the tagged ingress packets to the switch from the host port P0 get internal switch priority assigned equal to the DMA CPPI channel number they came from, unless CPSW_P0_CONTROL_REG.RX_REMAP_VLAN is enabled. This causes issues with applying QoS policies and mapping packets on external port fifos, because the default configuration is vlan_aware and DMA CPPI channels are shared between all external ports. Hence enable CPSW_P0_CONTROL_REG.RX_REMAP_VLAN so packet will preserve internal switch priority assigned following the VLAN(priority) tag no matter through which DMA CPPI Channels packets enter the switch. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com> Link: https://lore.kernel.org/r/20230327092103.3256118-1-s-vadapalli@ti.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-03-28net: ethernet: ti: am65-cpsw: add .ndo to set dma per-queue rateGrygorii Strashko
Enable rate limiting TX DMA queues for CPSW interface by configuring the rate in absolute Mb/s units per TX queue. Example: ethtool -L eth0 tx 4 echo 100 > /sys/class/net/eth0/queues/tx-0/tx_maxrate echo 200 > /sys/class/net/eth0/queues/tx-1/tx_maxrate echo 50 > /sys/class/net/eth0/queues/tx-2/tx_maxrate echo 30 > /sys/class/net/eth0/queues/tx-3/tx_maxrate # disable echo 0 > /sys/class/net/eth0/queues/tx-0/tx_maxrate Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com> Link: https://lore.kernel.org/r/20230327085758.3237155-1-s-vadapalli@ti.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-03-28Merge branch 'allocate-multiple-skbuffs-on-tx'Paolo Abeni
Arseniy Krasnov says: ==================== allocate multiple skbuffs on tx This adds small optimization for tx path: instead of allocating single skbuff on every call to transport, allocate multiple skbuff's until credit space allows, thus trying to send as much as possible data without return to af_vsock.c. Also this patchset includes second patch which adds check and return from 'virtio_transport_get_credit()' and 'virtio_transport_put_credit()' when these functions are called with 0 argument. This is needed, because zero argument makes both functions to behave as no-effect, but both of them always tries to acquire spinlock. Moreover, first patch always calls function 'virtio_transport_put_credit()' with zero argument in case of successful packet transmission. ==================== Link: https://lore.kernel.org/r/b0d15942-65ba-3a32-ba8d-fed64332d8f6@sberdevices.ru Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-03-28virtio/vsock: check argument to avoid no effect callArseniy Krasnov
Both of these functions have no effect when input argument is 0, so to avoid useless spinlock access, check argument before it. Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-03-28virtio/vsock: allocate multiple skbuffs on txArseniy Krasnov
This adds small optimization for tx path: instead of allocating single skbuff on every call to transport, allocate multiple skbuff's until credit space allows, thus trying to send as much as possible data without return to af_vsock.c. Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-03-28atomics: Provide rcuref - scalable reference countingThomas Gleixner
atomic_t based reference counting, including refcount_t, uses atomic_inc_not_zero() for acquiring a reference. atomic_inc_not_zero() is implemented with a atomic_try_cmpxchg() loop. High contention of the reference count leads to retry loops and scales badly. There is nothing to improve on this implementation as the semantics have to be preserved. Provide rcuref as a scalable alternative solution which is suitable for RCU managed objects. Similar to refcount_t it comes with overflow and underflow detection and mitigation. rcuref treats the underlying atomic_t as an unsigned integer and partitions this space into zones: 0x00000000 - 0x7FFFFFFF valid zone (1 .. (INT_MAX + 1) references) 0x80000000 - 0xBFFFFFFF saturation zone 0xC0000000 - 0xFFFFFFFE dead zone 0xFFFFFFFF no reference rcuref_get() unconditionally increments the reference count with atomic_add_negative_relaxed(). rcuref_put() unconditionally decrements the reference count with atomic_add_negative_release(). This unconditional increment avoids the inc_not_zero() problem, but requires a more complex implementation on the put() side when the count drops from 0 to -1. When this transition is detected then it is attempted to mark the reference count dead, by setting it to the midpoint of the dead zone with a single atomic_cmpxchg_release() operation. This operation can fail due to a concurrent rcuref_get() elevating the reference count from -1 to 0 again. If the unconditional increment in rcuref_get() hits a reference count which is marked dead (or saturated) it will detect it after the fact and bring back the reference count to the midpoint of the respective zone. The zones provide enough tolerance which makes it practically impossible to escape from a zone. The racy implementation of rcuref_put() requires to protect rcuref_put() against a grace period ending in order to prevent a subtle use after free. As RCU is the only mechanism which allows to protect against that, it is not possible to fully replace the atomic_inc_not_zero() based implementation of refcount_t with this scheme. The final drop is slightly more expensive than the atomic_dec_return() counterpart, but that's not the case which this is optimized for. The optimization is on the high frequeunt get()/put() pairs and their scalability. The performance of an uncontended rcuref_get()/put() pair where the put() is not dropping the last reference is still on par with the plain atomic operations, while at the same time providing overflow and underflow detection and mitigation. The performance of rcuref compared to plain atomic_inc_not_zero() and atomic_dec_return() based reference counting under contention: - Micro benchmark: All CPUs running a increment/decrement loop on an elevated reference count, which means the 0 to -1 transition never happens. The performance gain depends on microarchitecture and the number of CPUs and has been observed in the range of 1.3X to 4.7X - Conversion of dst_entry::__refcnt to rcuref and testing with the localhost memtier/memcached benchmark. That benchmark shows the reference count contention prominently. The performance gain depends on microarchitecture and the number of CPUs and has been observed in the range of 1.1X to 2.6X over the previous fix for the false sharing issue vs. struct dst_entry::__refcnt. When memtier is run over a real 1Gb network connection, there is a small gain on top of the false sharing fix. The two changes combined result in a 2%-5% total gain for that networked test. Reported-by: Wangyang Guo <wangyang.guo@intel.com> Reported-by: Arjan Van De Ven <arjan.van.de.ven@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230323102800.158429195@linutronix.de
2023-03-28atomics: Provide atomic_add_negative() variantsThomas Gleixner
atomic_add_negative() does not provide the relaxed/acquire/release variants. Provide them in preparation for a new scalable reference count algorithm. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20230323102800.101763813@linutronix.de
2023-03-27Merge tag 'linux-can-next-for-6.4-20230327' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== pull-request: can-next 2023-03-27 The first 2 patches by Geert Uytterhoeven add transceiver support and improve the error messages in the rcar_canfd driver. Cai Huoqing contributes 3 patches which remove a redundant call to pci_clear_master() in the c_can, ctucanfd and kvaser_pciefd driver. Frank Jungclaus's patch replaces the struct esd_usb_msg with a union in the esd_usb driver to improve readability. Markus Schneider-Pargmann contributes 5 patches to improve the performance in the m_can driver, especially for SPI attached controllers like the tcan4x5x. * tag 'linux-can-next-for-6.4-20230327' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next: can: m_can: Keep interrupts enabled during peripheral read can: m_can: Disable unused interrupts can: m_can: Remove double interrupt enable can: m_can: Always acknowledge all interrupts can: m_can: Remove repeated check for is_peripheral can: esd_usb: Improve code readability by means of replacing struct esd_usb_msg with a union can: kvaser_pciefd: Remove redundant pci_clear_master can: ctucanfd: Remove redundant pci_clear_master can: c_can: Remove redundant pci_clear_master can: rcar_canfd: Improve error messages can: rcar_canfd: Add transceiver support ==================== Link: https://lore.kernel.org/r/20230327073354.1003134-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-27Merge branch 'add-tx-push-buf-len-param-to-ethtool'Jakub Kicinski
Shay Agroskin says: ==================== Add tx push buf len param to ethtool This patchset adds a new sub-configuration to ethtool get/set queue params (ethtool -g) called 'tx-push-buf-len'. This configuration specifies the maximum number of bytes of a transmitted packet a driver can push directly to the underlying device ('push' mode). The motivation for pushing some of the bytes to the device has the advantages of - Allowing a smart device to take fast actions based on the packet's header - Reducing latency for small packets that can be copied completely into the device This new param is practically similar to tx-copybreak value that can be set using ethtool's tunable but conceptually serves a different purpose. While tx-copybreak is used to reduce the overhead of DMA mapping and makes no sense to use if less than the whole segment gets copied, tx-push-buf-len allows to improve performance by analyzing the packet's data (usually headers) before performing the DMA operation. The configuration can be queried and set using the commands: $ ethtool -g [interface] # ethtool -G [interface] tx-push-buf-len [number of bytes] This patchset also adds support for the new configuration in ENA driver for which this parameter ensures efficient resources management on the device side. ==================== Link: https://lore.kernel.org/r/20230323163610.1281468-1-shayagr@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-27net: ena: Advertise TX push supportShay Agroskin
LLQ is auto enabled by the device and disabling it isn't supported on new ENA generations while on old ones causes sub-optimal performance. This patch adds advertisement of push-mode when LLQ mode is used, but rejects an attempt to modify it. Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Shay Agroskin <shayagr@amazon.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-27net: ena: Add support to changing tx_push_buf_lenShay Agroskin
The ENA driver allows for two distinct values for the number of bytes of the packet's payload that can be written directly to the device. For a value of 224 the driver turns on Large LLQ Header mode in which the first 224 of the packet's payload are written to the LLQ. Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Shay Agroskin <shayagr@amazon.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-27net: ena: Recalculate TX state variables every device resetShay Agroskin
With the ability to modify LLQ entry size, the size of packet's payload that can be written directly to the device changes. This patch makes the driver recalculate this information every device negotiation (also called device reset). Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Shay Agroskin <shayagr@amazon.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-27net: ena: Add an option to configure large LLQ headersDavid Arinzon
Allow configuring the device with large LLQ headers. The Low Latency Queue (LLQ) allows the driver to write the first N bytes of the packet, along with the rest of the TX descriptors directly into device (N can be either 96 or 224 for large LLQ headers configuration). Having L4 TCP/UDP headers contained in the first 96 bytes of the packet is required to get maximum performance from the device. Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: David Arinzon <darinzon@amazon.com> Signed-off-by: Shay Agroskin <shayagr@amazon.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-27net: ena: Make few cosmetic preparations to support large LLQShay Agroskin
Move ena_calc_io_queue_size() implementation closer to the file's beginning so that it can be later called from ena_device_init() function without adding a function declaration. Also add an empty line at some spots to separate logical blocks in funcitons. Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Shay Agroskin <shayagr@amazon.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-27ethtool: Add support for configuring tx_push_buf_lenShay Agroskin
This attribute, which is part of ethtool's ring param configuration allows the user to specify the maximum number of the packet's payload that can be written directly to the device. Example usage: # ethtool -G [interface] tx-push-buf-len [number of bytes] Co-developed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Shay Agroskin <shayagr@amazon.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-27netlink: Add a macro to set policy message with format stringShay Agroskin
Similar to NL_SET_ERR_MSG_FMT, add a macro which sets netlink policy error message with a format string. Signed-off-by: Shay Agroskin <shayagr@amazon.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-27qed: remove unused num_ooo_add_to_peninsula variableTom Rix
clang with W=1 reports drivers/net/ethernet/qlogic/qed/qed_ll2.c:649:6: error: variable 'num_ooo_add_to_peninsula' set but not used [-Werror,-Wunused-but-set-variable] u32 num_ooo_add_to_peninsula = 0, cid; ^ This variable is not used so remove it. Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230326001733.1343274-1-trix@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-27net: introduce a config option to tweak MAX_SKB_FRAGSEric Dumazet
Currently, MAX_SKB_FRAGS value is 17. For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg() attempts order-3 allocations, stuffing 32768 bytes per frag. But with zero copy, we use order-0 pages. For BIG TCP to show its full potential, we add a config option to be able to fit up to 45 segments per skb. This is also needed for BIG TCP rx zerocopy, as zerocopy currently does not support skbs with frag list. We have used MAX_SKB_FRAGS=45 value for years at Google before we deployed 4K MTU, with no adverse effect, other than a recent issue in mlx4, fixed in commit 26782aad00cc ("net/mlx4: MLX4_TX_BOUNCE_BUFFER_SIZE depends on MAX_SKB_FRAGS") Back then, goal was to be able to receive full size (64KB) GRO packets without the frag_list overhead. Note that /proc/sys/net/core/max_skb_frags can also be used to limit the number of fragments TCP can use in tx packets. By default we keep the old/legacy value of 17 until we get more coverage for the updated values. Sizes of struct skb_shared_info on 64bit arches MAX_SKB_FRAGS | sizeof(struct skb_shared_info): ============================================== 17 320 21 320+64 = 384 25 320+128 = 448 29 320+192 = 512 33 320+256 = 576 37 320+320 = 640 41 320+384 = 704 45 320+448 = 768 This inflation might cause problems for drivers assuming they could pack both the incoming packet (for MTU=1500) and skb_shared_info in half a page, using build_skb(). v3: fix build error when CONFIG_NET=n v2: fix two build errors assuming MAX_SKB_FRAGS was "unsigned long" Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://lore.kernel.org/r/20230323162842.1935061-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-27dev_ioctl: fix a W=1 warningHeiner Kallweit
This fixes the following warning when compiled with GCC 12.2.0 and W=1. net/core/dev_ioctl.c:475: warning: Function parameter or member 'data' not described in 'dev_ioctl' Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-27net: phy: bcm7xxx: use devm_clk_get_optional_enabled to simplify the codeHeiner Kallweit
Use devm_clk_get_optional_enabled to simplify the code. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-27tools: ynl: default to treating enums as flags for mask generationJakub Kicinski
I was a bit too optimistic in commit bf51d27704c9 ("tools: ynl: fix get_mask utility routine"), not every mask we use is necessarily coming from an enum of type "flags". We also allow flipping an enum into flags on per-attribute basis. That's done by the 'enum-as-flags' property of an attribute. Restore this functionality, it's not currently used by any in-tree family. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-27selftests: tls: add a test for queuing data before setting the ULPJakub Kicinski
Other tests set up the connection fully on both ends before communicating any data. Add a test which will queue up TLS records to TCP before the TLS ULP is installed. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-27tools: ynl: Add missing types to encode/decodeMichal Michalik
While testing the tool I noticed we miss the u16 type on payload create. On the code inspection it turned out we miss also u64 - add them. We also miss the decoding of u16 despite the fact `NlAttr` class supports it - add it. Signed-off-by: Michal Michalik <michal.michalik@intel.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-27Merge branch 'sunhme-cleanups'David S. Miller
Sean Anderson says: ==================== net: sunhme: Probe/IRQ cleanups Well, I've had these patches kicking around in my tree since last October, so I guess I had better get around to posting them. This series is mainly a cleanup/consolidation of the probe process, with some interrupt changes as well. Some of these changes are SBUS- (AKA SPARC-) specific, so this should really get some testing there as well to ensure nothing breaks. I've CC'd a few SPARC mailing lists in hopes that someone there can try this out. I also have an SBUS card I ordered by mistake if anyone has a SPARC computer but lacks this card. Changes in v4: - Tweak variable order for yuletide - Move uninitialized return to its own commit - Use correct SBUS/PCI accessors - Rework hme_version to set the default in pci/sbus_probe and override it (if necessary) in common_probe Changes in v3: - Incorperate a fix from another series into this commit Changes in v2: - Move happy_meal_begin_auto_negotiation earlier and remove forward declaration - Make some more includes common - Clean up mac address init - Inline error returns ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-27net: sunhme: Consolidate common probe tasksSean Anderson
Most of the second half of the PCI/SBUS probe functions are the same. Consolidate them into a common function. Signed-off-by: Sean Anderson <seanga2@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-27net: sunhme: Inline error returnsSean Anderson
The err_out label used to have cleanup. Now that it just returns, inline it everywhere. Signed-off-by: Sean Anderson <seanga2@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-27net: sunhme: Clean up mac address initSean Anderson
Clean up some oddities suggested during review. Signed-off-by: Sean Anderson <seanga2@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-27net: sunhme: Consolidate mac address initializationSean Anderson
The mac address initialization is braodly the same between PCI and SBUS, and one was clearly copied from the other. Consolidate them. We still have to have some ifdefs because pci_(un)map_rom is only implemented for PCI, and idprom is only implemented for SPARC. Signed-off-by: Sean Anderson <seanga2@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-27net: sunhme: Switch SBUS to devresSean Anderson
The PCI half of this driver was converted in commit 914d9b2711dd ("sunhme: switch to devres"). Do the same for the SBUS half. Signed-off-by: Sean Anderson <seanga2@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-27net: sunhme: Alphabetize includesSean Anderson
Alphabetize includes to make it clearer where to add new ones. Signed-off-by: Sean Anderson <seanga2@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-27net: sunhme: Unify IRQ requestingSean Anderson
Instead of registering one interrupt handler for all four SBUS Quattro HMEs, let each HME register its own handler. To make this work, we don't handle the IRQ if none of the status bits are set. This reduces the complexity of the driver, and makes it easier to ensure things happen before/after enabling IRQs. I'm not really sure why we request IRQs in two different places (and leave them running after removing the driver!). A lot of things in this driver seem to just be crusty, and not necessarily intentional. I'm assuming that's the case here as well. This really needs to be tested by someone with an SBUS Quattro card. Signed-off-by: Sean Anderson <seanga2@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>