Age | Commit message (Collapse) | Author |
|
This commit shrinks inet_connection_sock by 4 bytes, by shrinking
icsk_mtup.enabled from 32 bits to 1 bit, and shrinking
icsk_mtup.probe_size from s32 to an unsuigned 31 bit field.
This is to save space to compensate for the recent introduction of a
new u32 in inet_connection_sock, icsk_probes_tstamp, in the recent bug
fix commit 9d9b1ee0b2d1 ("tcp: fix TCP_USER_TIMEOUT with zero window").
This should not change functionality, since icsk_mtup.enabled is only
ever set to 0 or 1, and icsk_mtup.probe_size can only be either 0
or a positive MTU value returned by tcp_mss_to_mtu()
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210129185438.1813237-1-ncardwell.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
There are use cases for which the existing tagger, based on the NPI
(Node Processor Interface) functionality, is insufficient.
Namely:
- Frames injected through the NPI port bypass the frame analyzer, so no
source address learning is performed, no TSN stream classification,
etc.
- Flow control is not functional over an NPI port (PAUSE frames are
encapsulated in the same Extraction Frame Header as all other frames)
- There can be at most one NPI port configured for an Ocelot switch. But
in NXP LS1028A and T1040 there are two Ethernet CPU ports. The non-NPI
port is currently either disabled, or operated as a plain user port
(albeit an internally-facing one). Having the ability to configure the
two CPU ports symmetrically could pave the way for e.g. creating a LAG
between them, to increase bandwidth seamlessly for the system.
So there is a desire to have an alternative to the NPI mode. This change
keeps the default tagger for the Seville and Felix switches as "ocelot",
but it can be changed via the following device attribute:
echo ocelot-8021q > /sys/class/<dsa-master>/dsa/tagging
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently DSA exposes the following sysfs:
$ cat /sys/class/net/eno2/dsa/tagging
ocelot
which is a read-only device attribute, introduced in the kernel as
commit 98cdb4807123 ("net: dsa: Expose tagging protocol to user-space"),
and used by libpcap since its commit 993db3800d7d ("Add support for DSA
link-layer types").
It would be nice if we could extend this device attribute by making it
writable:
$ echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
This is useful with DSA switches that can make use of more than one
tagging protocol. It may be useful in dsa_loop in the future too, to
perform offline testing of various taggers, or for changing between dsa
and edsa on Marvell switches, if that is desirable.
In terms of implementation, drivers can support this feature by
implementing .change_tag_protocol, which should always leave the switch
in a consistent state: either with the new protocol if things went well,
or with the old one if something failed. Teardown of the old protocol,
if necessary, must be handled by the driver.
Some things remain as before:
- The .get_tag_protocol is currently only called at probe time, to load
the initial tagging protocol driver. Nonetheless, new drivers should
report the tagging protocol in current use now.
- The driver should manage by itself the initial setup of tagging
protocol, no later than the .setup() method, as well as destroying
resources used by the last tagger in use, no earlier than the
.teardown() method.
For multi-switch DSA trees, error handling is a bit more complicated,
since e.g. the 5th out of 7 switches may fail to change the tag
protocol. When that happens, a revert to the original tag protocol is
attempted, but that may fail too, leaving the tree in an inconsistent
state despite each individual switch implementing .change_tag_protocol
transactionally. Since the intersection between drivers that implement
.change_tag_protocol and drivers that support D in DSA is currently the
empty set, the possibility for this error to happen is ignored for now.
Testing:
$ insmod mscc_felix.ko
[ 79.549784] mscc_felix 0000:00:00.5: Adding to iommu group 14
[ 79.565712] mscc_felix 0000:00:00.5: Failed to register DSA switch: -517
$ insmod tag_ocelot.ko
$ rmmod mscc_felix.ko
$ insmod mscc_felix.ko
[ 97.261724] libphy: VSC9959 internal MDIO bus: probed
[ 97.267363] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 0
[ 97.274998] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 1
[ 97.282561] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 2
[ 97.289700] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 3
[ 97.599163] mscc_felix 0000:00:00.5 swp0 (uninitialized): PHY [0000:00:00.3:10] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 97.862034] mscc_felix 0000:00:00.5 swp1 (uninitialized): PHY [0000:00:00.3:11] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 97.950731] mscc_felix 0000:00:00.5 swp0: configuring for inband/qsgmii link mode
[ 97.964278] 8021q: adding VLAN 0 to HW filter on device swp0
[ 98.146161] mscc_felix 0000:00:00.5 swp2 (uninitialized): PHY [0000:00:00.3:12] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 98.238649] mscc_felix 0000:00:00.5 swp1: configuring for inband/qsgmii link mode
[ 98.251845] 8021q: adding VLAN 0 to HW filter on device swp1
[ 98.433916] mscc_felix 0000:00:00.5 swp3 (uninitialized): PHY [0000:00:00.3:13] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 98.485542] mscc_felix 0000:00:00.5: configuring for fixed/internal link mode
[ 98.503584] mscc_felix 0000:00:00.5: Link is Up - 2.5Gbps/Full - flow control rx/tx
[ 98.527948] device eno2 entered promiscuous mode
[ 98.544755] DSA: tree 0 setup
$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=2.337 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.754 ms
^C
- 10.0.0.1 ping statistics -
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.754/1.545/2.337 ms
$ cat /sys/class/net/eno2/dsa/tagging
ocelot
$ cat ./test_ocelot_8021q.sh
#!/bin/bash
ip link set swp0 down
ip link set swp1 down
ip link set swp2 down
ip link set swp3 down
ip link set swp5 down
ip link set eno2 down
echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
ip link set eno2 up
ip link set swp0 up
ip link set swp1 up
ip link set swp2 up
ip link set swp3 up
ip link set swp5 up
$ ./test_ocelot_8021q.sh
./test_ocelot_8021q.sh: line 9: echo: write error: Protocol not available
$ rmmod tag_ocelot.ko
rmmod: can't unload module 'tag_ocelot': Resource temporarily unavailable
$ insmod tag_ocelot_8021q.ko
$ ./test_ocelot_8021q.sh
$ cat /sys/class/net/eno2/dsa/tagging
ocelot-8021q
$ rmmod tag_ocelot.ko
$ rmmod tag_ocelot_8021q.ko
rmmod: can't unload module 'tag_ocelot_8021q': Resource temporarily unavailable
$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=0.953 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.787 ms
64 bytes from 10.0.0.1: seq=2 ttl=64 time=0.771 ms
$ rmmod mscc_felix.ko
[ 645.544426] mscc_felix 0000:00:00.5: Link is Down
[ 645.838608] DSA: tree 0 torn down
$ rmmod tag_ocelot_8021q.ko
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cascading DSA switches can be done multiple ways. There is the brute
force approach / tag stacking, where one upstream switch, located
between leaf switches and the host Ethernet controller, will just
happily transport the DSA header of those leaf switches as payload.
For this kind of setups, DSA works without any special kind of treatment
compared to a single switch - they just aren't aware of each other.
Then there's the approach where the upstream switch understands the tags
it transports from its leaves below, as it doesn't push a tag of its own,
but it routes based on the source port & switch id information present
in that tag (as opposed to DMAC & VID) and it strips the tag when
egressing a front-facing port. Currently only Marvell implements the
latter, and Marvell DSA trees contain only Marvell switches.
So it is safe to say that DSA trees already have a single tag protocol
shared by all switches, and in fact this is what makes the switches able
to understand each other. This fact is also implied by the fact that
currently, the tagging protocol is reported as part of a sysfs installed
on the DSA master and not per port, so it must be the same for all the
ports connected to that DSA master regardless of the switch that they
belong to.
It's time to make this official and enforce it (yes, this also means we
won't have any "switch understands tag to some extent but is not able to
speak it" hardware oddities that we'll support in the future).
This is needed due to the imminent introduction of the dsa_switch_ops::
change_tag_protocol driver API. When that is introduced, we'll have
to notify switches of the tagging protocol that they're configured to
use. Currently the tag_ops structure pointer is held only for CPU ports.
But there are switches which don't have CPU ports and nonetheless still
need to be configured. These would be Marvell leaf switches whose
upstream port is just a DSA link. How do we inform these of their
tagging protocol setup/deletion?
One answer to the above would be: iterate through the DSA switch tree's
ports once, list the CPU ports, get their tag_ops, then iterate again
now that we have it, and notify everybody of that tag_ops. But what to
do if conflicts appear between one cpu_dp->tag_ops and another? There's
no escaping the fact that conflict resolution needs to be done, so we
can be upfront about it.
Ease our work and just keep the master copy of the tag_ops inside the
struct dsa_switch_tree. Reference counting is now moved to be per-tree
too, instead of per-CPU port.
There are many places in the data path that access master->dsa_ptr->tag_ops
and we would introduce unnecessary performance penalty going through yet
another indirection, so keep those right where they are.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This patch is to add csum offload support for gre header:
On the TX path in gre_build_header(), when CHECKSUM_PARTIAL's set
for inner proto, it will calculate the csum for outer proto, and
inner csum will be offloaded later. Otherwise, CHECKSUM_PARTIAL
and csum_start/offset will be set for outer proto, and the outer
csum will be offloaded later.
On the GSO path in gre_gso_segment(), when CHECKSUM_PARTIAL is
not set for inner proto and the hardware supports csum offload,
CHECKSUM_PARTIAL and csum_start/offset will be set for outer
proto, and outer csum will be offloaded later. Otherwise, it
will do csum for outer proto by calling gso_make_checksum().
Note that SCTP has to do the csum by itself for non GSO path in
sctp_packet_pack(), as gre_build_header() can't handle the csum
with CHECKSUM_PARTIAL set for SCTP CRC csum offload.
v1->v2:
- remove the SCTP part, as GRE dev doesn't support SCTP CRC CSUM
and it will always do checksum for SCTP in sctp_packet_pack()
when it's not a GSO packet.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Give offloading drivers the direction of the offloaded ct flow,
this will be used for matches on direction (ct_state +/-rpl).
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Some devices, e.g. the RTL8723BS bluetooth part, some USB attached devices,
completely drop from the bus on a system-suspend. These devices will
have their driver unbound and rebound on resume (when the dropping of
the bus gets detected) and will show up as a new HCI after resume.
These devices do not benefit from the suspend / resume handling work done
by the hci_suspend_notifier. At best this unnecessarily adds some time to
the suspend/resume time. But this may also actually cause problems, if the
code doing the driver unbinding runs after the pm-notifier then the
hci_suspend_notifier code will try to talk to a device which is now in
an uninitialized state.
This commit adds a new HCI_QUIRK_NO_SUSPEND_NOTIFIER quirk which allows
drivers to opt-out of the hci_suspend_notifier when they know beforehand
that their device will be fully re-initialized / reprobed on resume.
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
Currently there are only two types of in-kernel nexthop notification.
The two are distinguished by the 'is_grp' boolean field in 'struct
nh_notifier_info'.
As more notification types are introduced for more next-hop group types, a
boolean is not an easily extensible interface. Instead, convert it to an
enum.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The values that a next-hop group needs to keep track of depend on the group
type. Introduce a union to separate fields specific to the mpath groups
from fields specific to other group types.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Stop maintaining the skb_send_q list for TRANS_HIPER sockets.
Not only is it extra overhead, but keeping around a list of skb clones
means that we later also have to match the ->sk_txnotify() calls
against these clones and free them accordingly.
The current matching logic (comparing the skbs' shinfo location) is
frustratingly fragile, and breaks if the skb's head is mangled in any
sort of way while passing from dev_queue_xmit() to the device's
HW queue.
Also adjust the interface for ->sk_txnotify(), to make clear that we
don't actually care about any skb internals.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The TX code keeps track of all skbs that are in-flight but haven't
actually been sent out yet. For native IUCV sockets that's not a huge
deal, but with TRANS_HIPER sockets it would be much better if we
didn't need to maintain a list of skb clones.
Note that we actually only care about the _count_ of skbs in this stage
of the TX pipeline. So as prep work for removing the skb tracking on
TRANS_HIPER sockets, keep track of the skb count in a separate variable
and pair any list {enqueue, unlink} with a count {increment, decrement}.
Then replace all occurences where we currently look at the skb list's
fill level.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
drivers/net/can/dev.c
b552766c872f ("can: dev: prevent potential information leak in can_fill_info()")
3e77f70e7345 ("can: dev: move driver related infrastructure into separate subdir")
0a042c6ec991 ("can: dev: move netlink related code into seperate file")
Code move.
drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
57ac4a31c483 ("net/mlx5e: Correctly handle changing the number of queues when the interface is down")
214baf22870c ("net/mlx5e: Support HTB offload")
Adjacent code changes
net/switchdev/switchdev.c
20776b465c0c ("net: switchdev: don't set port_obj_info->handled true when -EOPNOTSUPP")
ffb68fc58e96 ("net: switchdev: remove the transaction structure from port object notifiers")
bae33f2b5afe ("net: switchdev: remove the transaction structure from port attributes")
Transaction parameter gets dropped otherwise keep the fix.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5 subfunction support
Parav Pandit says:
This patchset introduces support for mlx5 subfunction (SF).
A subfunction is a lightweight function that has a parent PCI function on
which it is deployed. mlx5 subfunction has its own function capabilities
and its own resources. This means a subfunction has its own dedicated
queues(txq, rxq, cq, eq). These queues are neither shared nor stolen from
the parent PCI function.
When subfunction is RDMA capable, it has its own QP1, GID table and rdma
resources neither shared nor stolen from the parent PCI function.
A subfunction has dedicated window in PCI BAR space that is not shared
with the other subfunctions or parent PCI function. This ensures that all
class devices of the subfunction accesses only assigned PCI BAR space.
A Subfunction supports eswitch representation through which it supports tc
offloads. User must configure eswitch to send/receive packets from/to
subfunction port.
Subfunctions share PCI level resources such as PCI MSI-X IRQs with
their other subfunctions and/or with its parent PCI function.
Subfunction support is discussed in detail in RFC [1] and [2].
RFC [1] and extension [2] describes requirements, design and proposed
plumbing using devlink, auxiliary bus and sysfs for systemd/udev
support. Functionality of this patchset is best explained using real
examples further below.
overview:
--------
A subfunction can be created and deleted by a user using devlink port
add/delete interface.
A subfunction can be configured using devlink port function attribute
before its activated.
When a subfunction is activated, it results in an auxiliary device on
the host PCI device where it is deployed. A driver binds to the
auxiliary device that further creates supported class devices.
example subfunction usage sequence:
-----------------------------------
Change device to switchdev mode:
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
Add a devlink port of subfunction flavour:
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
Configure mac address of the port function:
$ devlink port function set ens2f0npf0sf88 hw_addr 00:00:00:00:88:88
Now activate the function:
$ devlink port function set ens2f0npf0sf88 state active
Now use the auxiliary device and class devices:
$ devlink dev show
pci/0000:06:00.0
auxiliary/mlx5_core.sf.4
$ ip link show
127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff
altname enp6s0f0np0
129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff
$ rdma dev show
43: rdmap6s0f0: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d112 sys_image_guid 248a:0703:00b3:d112
44: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112
After use inactivate the function:
$ devlink port function set ens2f0npf0sf88 state inactive
Now delete the subfunction port:
$ devlink port del ens2f0npf0sf88
[1] https://lore.kernel.org/netdev/20200519092258.GF4655@nanopsycho/
[2] https://marc.info/?l=linux-netdev&m=158555928517777&w=2
=================
* tag 'mlx5-updates-2021-01-13' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5: Add devlink subfunction port documentation
devlink: Extend devlink port documentation for subfunctions
devlink: Add devlink port documentation
net/mlx5: SF, Port function state change support
net/mlx5: SF, Add port add delete functionality
net/mlx5: E-switch, Add eswitch helpers for SF vport
net/mlx5: E-switch, Prepare eswitch to handle SF vport
net/mlx5: SF, Add auxiliary device driver
net/mlx5: SF, Add auxiliary device support
net/mlx5: Introduce vhca state event notifier
devlink: Support get and set state of port function
devlink: Support add and delete devlink port
devlink: Introduce PCI SF port flavour and port attribute
devlink: Prepare code to fill multiple port function attributes
====================
Link: https://lore.kernel.org/r/20210122193658.282884-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add packet trap that can report packets that were dropped due to
destination MAC filtering.
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
More updates:
* many minstrel improvements, including removal of the old
minstrel in favour of minstrel_ht
* speed improvements on FQ
* support for RX decapsulation (header conversion) offload
* RTNL reduction: limit RTNL usage in the wireless stack
mostly to where really needed (regulatory not yet) to
reduce contention on it
* tag 'mac80211-next-for-net-next-2021-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next: (24 commits)
mac80211: minstrel_ht: fix regression in the max_prob_rate fix
virt_wifi: fix deadlock on RTNL
cfg80211: avoid holding the RTNL when calling the driver
cfg80211: change netdev registration/unregistration semantics
mac80211: minstrel_ht: fix rounding error in throughput calculation
mac80211: minstrel_ht: increase stats update interval
mac80211: minstrel_ht: fix max probability rate selection
mac80211: minstrel_ht: improve sample rate selection
mac80211: minstrel_ht: improve ampdu length estimation
mac80211: minstrel_ht: remove old ewma based rate average code
mac80211: remove legacy minstrel rate control
mac80211: minstrel_ht: add support for OFDM rates on non-HT clients
mac80211: minstrel_ht: clean up CCK code
mac80211: introduce aql_enable node in debugfs
cfg80211: Add phyrate conversion support for extended MCS in 60GHz band
cfg80211: add VHT rate entries for MCS-10 and MCS-11
mac80211: reduce peer HE MCS/NSS to own capabilities
mac80211: remove NSS number of 160MHz if not support 160MHz for HE
mac80211_hwsim: add 6GHz channels
mac80211: add LDPC encoding to ieee80211_parse_tx_radiotap
...
====================
Link: https://lore.kernel.org/r/20210127210915.135550-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
At the moment, BPF_CGROUP_INET{4,6}_BIND hooks can rewrite user_port
to the privileged ones (< ip_unprivileged_port_start), but it will
be rejected later on in the __inet_bind or __inet6_bind.
Let's add another return value to indicate that CAP_NET_BIND_SERVICE
check should be ignored. Use the same idea as we currently use
in cgroup/egress where bit #1 indicates CN. Instead, for
cgroup/bind{4,6}, bit #1 indicates that CAP_NET_BIND_SERVICE should
be bypassed.
v5:
- rename flags to be less confusing (Andrey Ignatov)
- rework BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY to work on flags
and accept BPF_RET_SET_CN (no behavioral changes)
v4:
- Add missing IPv6 support (Martin KaFai Lau)
v3:
- Update description (Martin KaFai Lau)
- Fix capability restore in selftest (Martin KaFai Lau)
v2:
- Switch to explicit return code (Martin KaFai Lau)
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrey Ignatov <rdna@fb.com>
Link: https://lore.kernel.org/bpf/20210127193140.3170382-1-sdf@google.com
|
|
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Honor stateful expressions defined in the set from the dynset
extension. The set definition provides a stateful expression
that must be used by the dynset expression in case it is specified.
2) Missing timeout extension in the set element in the dynset
extension leads to inconsistent ruleset listing, not allowing
the user to restore timeout and expiration on ruleset reload.
3) Do not dump the stateful expression from the dynset extension
if it coming from the set definition.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
netfilter: nft_dynset: dump expressions when set definition contains no expressions
netfilter: nft_dynset: add timeout extension to template
netfilter: nft_dynset: honor stateful expressions in set definition
====================
Link: https://lore.kernel.org/r/20210127132512.5472-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This function is not used anymore by any extension, statify it.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This new function combines the netlink register attribute parser
and the store validation function.
This update requires to replace:
enum nft_registers dreg:8;
in many of the expression private areas otherwise compiler complains
with:
error: cannot take address of bit-field ‘dreg’
when passing the register field as reference.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This new function combines the netlink register attribute parser
and the load validation function.
This update requires to replace:
enum nft_registers sreg:8;
in many of the expression private areas otherwise compiler complains
with:
error: cannot take address of bit-field ‘sreg’
when passing the register field as reference.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
For IPv4, default route is learned via DHCPv4 and user is allowed to change
metric using config etc/network/interfaces. But for IPv6, default route can
be learned via RA, for which, currently a fixed metric value 1024 is used.
Ideally, user should be able to configure metric on default route for IPv6
similar to IPv4. This patch adds sysctl for the same.
Logs:
For IPv4:
Config in etc/network/interfaces:
auto eth0
iface eth0 inet dhcp
metric 4261413864
IPv4 Kernel Route Table:
$ ip route list
default via 172.21.47.1 dev eth0 metric 4261413864
FRR Table, if a static route is configured:
[In real scenario, it is useful to prefer BGP learned default route over DHCPv4 default route.]
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, P - PIM, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
> - selected route, * - FIB route
S>* 0.0.0.0/0 [20/0] is directly connected, eth0, 00:00:03
K 0.0.0.0/0 [254/1000] via 172.21.47.1, eth0, 6d08h51m
i.e. User can prefer Default Router learned via Routing Protocol in IPv4.
Similar behavior is not possible for IPv6, without this fix.
After fix [for IPv6]:
sudo sysctl -w net.ipv6.conf.eth0.net.ipv6.conf.eth0.ra_defrtr_metric=1996489705
IP monitor: [When IPv6 RA is received]
default via fe80::xx16:xxxx:feb3:ce8e dev eth0 proto ra metric 1996489705 pref high
Kernel IPv6 routing table
$ ip -6 route list
default via fe80::be16:65ff:feb3:ce8e dev eth0 proto ra metric 1996489705 expires 21sec hoplimit 64 pref high
FRR Table, if a static route is configured:
[In real scenario, it is useful to prefer BGP learned default route over IPv6 RA default route.]
Codes: K - kernel route, C - connected, S - static, R - RIPng,
O - OSPFv3, I - IS-IS, B - BGP, N - NHRP, T - Table,
v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
> - selected route, * - FIB route
S>* ::/0 [20/0] is directly connected, eth0, 00:00:06
K ::/0 [119/1001] via fe80::xx16:xxxx:feb3:ce8e, eth0, 6d07h43m
If the metric is changed later, the effect will be seen only when next IPv6
RA is received, because the default route must be fully controlled by RA msg.
Below metric is changed from 1996489705 to 1996489704.
$ sudo sysctl -w net.ipv6.conf.eth0.ra_defrtr_metric=1996489704
net.ipv6.conf.eth0.ra_defrtr_metric = 1996489704
IP monitor:
[On next IPv6 RA msg, Kernel deletes prev route and installs new route with updated metric]
Deleted default via fe80::xx16:xxxx:feb3:ce8e dev eth0 proto ra metric 1996489705 expires 3sec hoplimit 64 pref high
default via fe80::xx16:xxxx:feb3:ce8e dev eth0 proto ra metric 1996489704 pref high
Signed-off-by: Praveen Chaudhary <pchaudhary@linkedin.com>
Signed-off-by: Zhenggen Xu <zxu@linkedin.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20210125214430.24079-1-pchaudhary@linkedin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In the lapb module, the timers may run concurrently with other code in
this module, and there is currently no locking to prevent the code from
racing on "struct lapb_cb". This patch adds locking to prevent racing.
1. Add "spinlock_t lock" to "struct lapb_cb"; Add "spin_lock_bh" and
"spin_unlock_bh" to APIs, timer functions and notifier functions.
2. Add "bool t1timer_stop, t2timer_stop" to "struct lapb_cb" to make us
able to ask running timers to abort; Modify "lapb_stop_t1timer" and
"lapb_stop_t2timer" to make them able to abort running timers;
Modify "lapb_t2timer_expiry" and "lapb_t1timer_expiry" to make them
abort after they are stopped by "lapb_stop_t1timer", "lapb_stop_t2timer",
and "lapb_start_t1timer", "lapb_start_t2timer".
3. Let lapb_unregister wait for other API functions and running timers
to stop.
4. The lapb_device_event function calls lapb_disconnect_request. In
order to avoid trying to hold the lock twice, add a new function named
"__lapb_disconnect_request" which assumes the lock is held, and make
it called by lapb_disconnect_request and lapb_device_event.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: Xie He <xie.he.0141@gmail.com>
Link: https://lore.kernel.org/r/20210126040939.69995-1-xie.he.0141@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, _everything_ in cfg80211 holds the RTNL, and if you
have a slow USB device (or a few) you can get some bad lock
contention on that.
Fix that by re-adding a mutex to each wiphy/rdev as we had at
some point, so we have locking for the wireless_dev lists and
all the other things in there, and also so that drivers still
don't have to worry too much about it (they still won't get
parallel calls for a single device).
Then, we can restrict the RTNL to a few cases where we add or
remove interfaces and really need the added protection. Some
of the global list management still also uses the RTNL, since
we need to have it anyway for netdev management, but we only
hold the RTNL for very short periods of time here.
Link: https://lore.kernel.org/r/20210122161942.81df9f5e047a.I4a8e1a60b18863ea8c5e6d3a0faeafb2d45b2f40@changeid
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> [marvell driver issues]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Add a placeholder field to calculate hash tuple offset. Similar to
2c407aca6497 ("netfilter: conntrack: avoid gcc-10 zero-length-bounds
warning").
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Bluetooth Core Specification v5.2, Vol. 3, Part A, section 1.4, table
1.1:
'Start Fragments always either begin with the first octet of the Basic
L2CAP header of a PDU or they have a length of zero (see [Vol 2] Part
B, Section 6.6.2).'
Apparently this was changed by the following errata:
https://www.bluetooth.org/tse/errata_view.cfm?errata_id=10216
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
This adds logic to disable and reenable advertisement filters during
suspend and resume. After this patch, we would only receive packets from
devices in allow list during suspend.
Signed-off-by: Howard Chung <howardchung@google.com>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
Implements the monitor removal functionality for advertising monitor
offloading to MSFT controllers. Supply handle = 0 to remove all
monitors.
Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Reviewed-by: Yun-Hao Chung <howardchung@google.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
Enables advertising monitor offloading to the controller, if MSFT
extension is supported. The kernel won't adjust the monitor parameters
to match what the controller supports - that is the user space's
responsibility.
This patch only manages the addition of monitors. Monitor removal is
going to be handled by another patch.
Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Manish Mandlik <mmandlik@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Reviewed-by: Yun-Hao Chung <howardchung@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
MSFT needs rssi parameter for monitoring advertisement packet,
therefore we should supply them from mgmt. This adds a new opcode
to add advertisement monitor with rssi parameters.
Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Manish Mandlik <mmandlik@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Reviewed-by: Yun-Hao Chung <howardchung@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
Upon receiving a cumulative ACK that changes the congestion state from
Disorder to Open, the TLP timer is not set. If the sender is app-limited,
it can only wait for the RTO timer to expire and retransmit.
The reason for this is that the TLP timer is set before the congestion
state changes in tcp_ack(), so we delay the time point of calling
tcp_set_xmit_timer() until after tcp_fastretrans_alert() returns and
remove the FLAG_SET_XMIT_TIMER from ack_flag when the RACK reorder timer
is set.
This commit has two additional benefits:
1) Make sure to reset RTO according to RFC6298 when receiving ACK, to
avoid spurious RTO caused by RTO timer early expires.
2) Reduce the xmit timer reschedule once per ACK when the RACK reorder
timer is set.
Fixes: df92c8394e6e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
Link: https://lore.kernel.org/netdev/1611311242-6675-1-git-send-email-yangpc@wangsu.com
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/1611464834-23030-1-git-send-email-yangpc@wangsu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The TCP_USER_TIMEOUT is checked by the 0-window probe timer. As the
timer has backoff with a max interval of about two minutes, the
actual timeout for TCP_USER_TIMEOUT can be off by up to two minutes.
In this patch the TCP_USER_TIMEOUT is made more accurate by taking it
into account when computing the timer value for the 0-window probes.
This patch is similar to and builds on top of the one that made
TCP_USER_TIMEOUT accurate for RTOs in commit b701a99e431d ("tcp: Add
tcp_clamp_rto_to_user_timeout() helper to improve accuracy").
Fixes: 9721e709fa68 ("tcp: simplify window probe aborting on USER_TIMEOUT")
Signed-off-by: Enke Chen <enchen@paloaltonetworks.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210122191306.GA99540@localhost.localdomain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The switch ASIC has a limited capacity of physical ('flavour physical'
in devlink terminology) ports that it can support. While each system is
brought up with a different number of ports, this number can be
increased via splitting up to the ASIC's limit.
Expose physical ports as a devlink resource so that user space will have
visibility to the maximum number of ports that can be supported and the
current occupancy.
In addition, add a "Generic Resources" section in devlink-resource
documentation so the different drivers will be aligned by the same resource
name when exposing to user space.
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
HTB doesn't scale well because of contention on a single lock, and it
also consumes CPU. This patch adds support for offloading HTB to
hardware that supports hierarchical rate limiting.
In the offload mode, HTB passes control commands to the driver using
ndo_setup_tc. The driver has to replicate the whole hierarchy of classes
and their settings (rate, ceil) in the NIC. Every modification of the
HTB tree caused by the admin results in ndo_setup_tc being called.
After this setup, the HTB algorithm is done completely in the NIC. An SQ
(send queue) is created for every leaf class and attached to the
hierarchy, so that the NIC can calculate and obey aggregated rate
limits, too. In the future, it can be changed, so that multiple SQs will
back a single leaf class.
ndo_select_queue is responsible for selecting the right queue that
serves the traffic class of each packet.
The data path works as follows: a packet is classified by clsact, the
driver selects a hardware queue according to its class, and the packet
is enqueued into this queue's qdisc.
This solution addresses two main problems of scaling HTB:
1. Contention by flow classification. Currently the filters are attached
to the HTB instance as follows:
# tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80
classid 1:10
It's possible to move classification to clsact egress hook, which is
thread-safe and lock-free:
# tc filter add dev eth0 egress protocol ip flower dst_port 80
action skbedit priority 1:10
This way classification still happens in software, but the lock
contention is eliminated, and it happens before selecting the TX queue,
allowing the driver to translate the class to the corresponding hardware
queue in ndo_select_queue.
Note that this is already compatible with non-offloaded HTB and doesn't
require changes to the kernel nor iproute2.
2. Contention by handling packets. HTB is not multi-queue, it attaches
to a whole net device, and handling of all packets takes the same lock.
When HTB is offloaded, it registers itself as a multi-queue qdisc,
similarly to mq: HTB is attached to the netdev, and each queue has its
own qdisc.
Some features of HTB may be not supported by some particular hardware,
for example, the maximum number of classes may be limited, the
granularity of rate and ceil parameters may be different, etc. - so, the
offload is not enabled by default, a new parameter is used to enable it:
# tc qdisc replace dev eth0 root handle 1: htb offload
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In a following commit, sch_htb will start using extack in the delete
class operation to pass hardware errors in offload mode. This commit
prepares for that by adding the extack parameter to this callback and
converting usage of the existing qdiscs.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The existing qdiscs that set TCQ_F_MQROOT don't use sch_tree_lock.
However, hardware-offloaded HTB will start setting this flag while also
using sch_tree_lock.
The current implementation of sch_tree_lock basically locks on
qdisc->dev_queue->qdisc, and it works fine when the tree is attached to
some queue. However, it's not the case for MQROOT qdiscs: such a qdisc
is the root itself, and its dev_queue just points to queue 0, while not
actually being used, because there are real per-queue qdiscs.
This patch changes the logic of sch_tree_lock and sch_tree_unlock to
lock the qdisc itself if it's the MQROOT.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Since the early retransmit has been removed by
commit bec41a11dd3d ("tcp: remove early retransmit"),
we also remove the unused ICSK_TIME_EARLY_RETRANS macro.
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/1611239473-27304-1-git-send-email-yangpc@wangsu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
devlink port function can be in active or inactive state.
Allow users to get and set port function's state.
When the port function it activated, its operational state may change
after a while when the device is created and driver binds to it.
Similarly on deactivation flow.
To clearly describe the state of the port function and its device's
operational state in the host system, define state and opstate
attributes.
Example of a PCI SF port which supports a port function:
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
pci/0000:08:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
function:
hw_addr 00:00:00:00:00:00 state inactive opstate detached
$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
function:
hw_addr 00:00:00:00:88:88 state inactive opstate detached
$ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88 state active
$ devlink port show pci/0000:06:00.0/32768 -jp
{
"port": {
"pci/0000:06:00.0/32768": {
"type": "eth",
"netdev": "ens2f0npf0sf88",
"flavour": "pcisf",
"controller": 0,
"pfnum": 0,
"sfnum": 88,
"external": false,
"splittable": false,
"function": {
"hw_addr": "00:00:00:00:88:88",
"state": "active",
"opstate": "attached"
}
}
}
}
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Extended devlink interface for the user to add and delete a port.
Extend devlink to connect user requests to driver to add/delete
a port in the device.
Driver routines are invoked without holding devlink instance lock.
This enables driver to perform several devlink objects registration,
unregistration such as (port, health reporter, resource etc) by using
existing devlink APIs.
This also helps to uniformly use the code for port unregistration
during driver unload and during port deletion initiated by user.
Examples of add, show and delete commands:
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
function:
hw_addr 00:00:00:00:00:00 state inactive opstate detached
$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
function:
hw_addr 00:00:00:00:00:00 state inactive opstate detached
$ udevadm test-builtin net_id /sys/class/net/eth6
Load module index
Parsed configuration file /usr/lib/systemd/network/99-default.link
Created link configuration context.
Using default interface naming scheme 'v245'.
ID_NET_NAMING_SCHEME=v245
ID_NET_NAME_PATH=enp6s0f0npf0sf88
ID_NET_NAME_SLOT=ens2f0npf0sf88
Unload module index
Unloaded link configuration context.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
A PCI sub-function (SF) represents a portion of the device similar
to PCI VF.
In an eswitch, PCI SF may have port which is normally represented
using a representor netdevice.
To have better visibility of eswitch port, its association with SF,
and its representor netdevice, introduce a PCI SF port flavour.
When devlink port flavour is PCI SF, fill up PCI SF attributes of the
port.
Extend port name creation using PCI PF and SF number scheme on best
effort basis, so that vendor drivers can skip defining their own
scheme.
This is done as cApfNSfM, where A, N and M are controller, PCI PF and
PCI SF number respectively.
This is similar to existing naming for PCI PF and PCI VF ports.
An example view of a PCI SF port:
$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
function:
hw_addr 00:00:00:00:88:88 state active opstate attached
$ devlink port show pci/0000:06:00.0/32768 -jp
{
"port": {
"pci/0000:06:00.0/32768": {
"type": "eth",
"netdev": "ens2f0npf0sf88",
"flavour": "pcisf",
"controller": 0,
"pfnum": 0,
"sfnum": 88,
"splittable": false,
"function": {
"hw_addr": "00:00:00:00:88:88",
"state": "active",
"opstate": "attached"
}
}
}
}
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
We used to not require anything in terms of registering netdevs
with cfg80211, using a netdev notifier instead. However, in the
next patch reducing RTNL locking, this causes big problems, and
the simplest way is to just require drivers to do things better.
Change the registration/unregistration semantics to require the
drivers to call cfg80211_(un)register_netdevice() when this is
happening due to a cfg80211 request, i.e. add_virtual_intf() or
del_virtual_intf() (or if it somehow has to happen in any other
cfg80211 callback).
Otherwise, in other contexts, drivers may continue to use the
normal netdev (un)registration functions as usual.
Internally, we still use the netdev notifier and track (by the
new wdev->registered bool) if the wdev had already been added
to cfg80211 or not.
Link: https://lore.kernel.org/r/20210122161942.cf2f4b65e4e9.Ida8234e50da13eb675b557bac52a713ad4eddf71@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
The current phyrate conversion does not include extended MCS and provides
incorrect rates. Add a flag for extended MCS in DMG and add corresponding
phyrate table for the correct conversions using base MCS in DMG specs.
Signed-off-by: Max Chen <mxchen@codeaurora.org>
Link: https://lore.kernel.org/r/1609977050-7089-2-git-send-email-mxchen@codeaurora.org
[reduce data size, make a single WARN]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
This allows drivers to pass 802.3 frames to mac80211, with some restrictions:
- the skb must be passed with a valid sta
- fast-rx needs to be active for the sta
- monitor mode needs to be disabled
mac80211 will tell the driver when it is safe to enable rx decap offload for
a particular station.
In order to implement support, a driver must:
- call ieee80211_hw_set(hw, SUPPORTS_RX_DECAP_OFFLOAD)
- implement ops->sta_set_decap_offload
- mark 802.3 frames with RX_FLAG_8023
If it doesn't want to enable offload for some vif types, it can mask out
IEEE80211_OFFLOAD_DECAP_ENABLED in vif->offload_flags from within the
.add_interface or .update_vif_offload driver ops
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20201218184718.93650-6-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
A sorted flow list is only needed to drop packets in the biggest flow when
hitting the overmemory condition.
By scanning flows only when needed, we can avoid paying the cost of
maintaining the list under normal conditions
In order to avoid scanning lots of empty flows and touching too many cold
cache lines, a bitmap of flows with backlog is maintained
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20201218184718.93650-3-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Simplifies the code and prepares for a rework of scanning for flows on
overmemory drop.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20201218184718.93650-2-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
This is similar to what sch_fq_codel does. It also amortizes the worst
case cost of a follow-up patch that changes the selection of the biggest
flow for dropping packets
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20201218184718.93650-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
This patch add the TCA_FLOWER_KEY_CT_FLAGS_INVALID flag to
match the ct_state with invalid for conntrack.
Signed-off-by: wenxu <wenxu@ucloud.cn>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://lore.kernel.org/r/1611045110-682-1-git-send-email-wenxu@ucloud.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
call in do_tcp_getsockopt using the on-stack data. This removes
3% overhead for locking/unlocking the socket.
Without this patch:
3.38% 0.07% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt
|
--3.30%--__cgroup_bpf_run_filter_getsockopt
|
--0.81%--__kmalloc
With the patch applied:
0.52% 0.12% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt_kern
Note, exporting uapi/tcp.h requires removing netinet/tcp.h
from test_progs.h because those headers have confliciting
definitions.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210115163501.805133-2-sdf@google.com
|
|
Introduce xdp_build_skb_from_frame utility routine to build the skb
from xdp_frame. Respect to __xdp_build_skb_from_frame,
xdp_build_skb_from_frame will allocate the skb object. Rely on
xdp_build_skb_from_frame in veth driver.
Introduce missing xdp metadata support in veth_xdp_rcv_one routine.
Add missing metadata support in veth_xdp_rcv_one().
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/94ade9e853162ae1947941965193190da97457bc.1610475660.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Introduce __xdp_build_skb_from_frame utility routine to build
the skb from xdp_frame. Rely on __xdp_build_skb_from_frame in
cpumap code.
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/4f9f4c6b3dd3933770c617eb6689dbc0c6e25863.1610475660.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Conflicts:
drivers/net/can/dev.c
commit 03f16c5075b2 ("can: dev: can_restart: fix use after free bug")
commit 3e77f70e7345 ("can: dev: move driver related infrastructure into separate subdir")
Code move.
drivers/net/dsa/b53/b53_common.c
commit 8e4052c32d6b ("net: dsa: b53: fix an off by one in checking "vlan->vid"")
commit b7a9e0da2d1c ("net: switchdev: remove vid_begin -> vid_end range from VLAN objects")
Field rename.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|