Age | Commit message (Collapse) | Author |
|
The current implementation of L2CAP options negotiation will continue
the negotiation when a device responds with L2CAP_CONF_UNACCEPT ("unaccepted
options"), but not when the device replies with L2CAP_CONF_UNKNOWN ("unknown
options").
Trying to continue the negotiation without ERTM support will allow
Bluetooth-capable XBox One controllers (notably models 1708 and 1797)
to connect.
btmon before patch:
> ACL Data RX: Handle 256 flags 0x02 dlen 16 #64 [hci0] 59.182702
L2CAP: Connection Response (0x03) ident 2 len 8
Destination CID: 64
Source CID: 64
Result: Connection successful (0x0000)
Status: No further information available (0x0000)
< ACL Data TX: Handle 256 flags 0x00 dlen 23 #65 [hci0] 59.182744
L2CAP: Configure Request (0x04) ident 3 len 15
Destination CID: 64
Flags: 0x0000
Option: Retransmission and Flow Control (0x04) [mandatory]
Mode: Basic (0x00)
TX window size: 0
Max transmit: 0
Retransmission timeout: 0
Monitor timeout: 0
Maximum PDU size: 0
> ACL Data RX: Handle 256 flags 0x02 dlen 16 #66 [hci0] 59.183948
L2CAP: Configure Request (0x04) ident 1 len 8
Destination CID: 64
Flags: 0x0000
Option: Maximum Transmission Unit (0x01) [mandatory]
MTU: 1480
< ACL Data TX: Handle 256 flags 0x00 dlen 18 #67 [hci0] 59.183994
L2CAP: Configure Response (0x05) ident 1 len 10
Source CID: 64
Flags: 0x0000
Result: Success (0x0000)
Option: Maximum Transmission Unit (0x01) [mandatory]
MTU: 1480
> ACL Data RX: Handle 256 flags 0x02 dlen 15 #69 [hci0] 59.187676
L2CAP: Configure Response (0x05) ident 3 len 7
Source CID: 64
Flags: 0x0000
Result: Failure - unknown options (0x0003)
04 .
< ACL Data TX: Handle 256 flags 0x00 dlen 12 #70 [hci0] 59.187722
L2CAP: Disconnection Request (0x06) ident 4 len 4
Destination CID: 64
Source CID: 64
> ACL Data RX: Handle 256 flags 0x02 dlen 12 #73 [hci0] 59.192714
L2CAP: Disconnection Response (0x07) ident 4 len 4
Destination CID: 64
Source CID: 64
btmon after patch:
> ACL Data RX: Handle 256 flags 0x02 dlen 16 #248 [hci0] 103.502970
L2CAP: Connection Response (0x03) ident 5 len 8
Destination CID: 65
Source CID: 65
Result: Connection pending (0x0001)
Status: No further information available (0x0000)
> ACL Data RX: Handle 256 flags 0x02 dlen 16 #249 [hci0] 103.504184
L2CAP: Connection Response (0x03) ident 5 len 8
Destination CID: 65
Source CID: 65
Result: Connection successful (0x0000)
Status: No further information available (0x0000)
< ACL Data TX: Handle 256 flags 0x00 dlen 23 #250 [hci0] 103.504398
L2CAP: Configure Request (0x04) ident 6 len 15
Destination CID: 65
Flags: 0x0000
Option: Retransmission and Flow Control (0x04) [mandatory]
Mode: Basic (0x00)
TX window size: 0
Max transmit: 0
Retransmission timeout: 0
Monitor timeout: 0
Maximum PDU size: 0
> ACL Data RX: Handle 256 flags 0x02 dlen 16 #251 [hci0] 103.505472
L2CAP: Configure Request (0x04) ident 3 len 8
Destination CID: 65
Flags: 0x0000
Option: Maximum Transmission Unit (0x01) [mandatory]
MTU: 1480
< ACL Data TX: Handle 256 flags 0x00 dlen 18 #252 [hci0] 103.505689
L2CAP: Configure Response (0x05) ident 3 len 10
Source CID: 65
Flags: 0x0000
Result: Success (0x0000)
Option: Maximum Transmission Unit (0x01) [mandatory]
MTU: 1480
> ACL Data RX: Handle 256 flags 0x02 dlen 15 #254 [hci0] 103.509165
L2CAP: Configure Response (0x05) ident 6 len 7
Source CID: 65
Flags: 0x0000
Result: Failure - unknown options (0x0003)
04 .
< ACL Data TX: Handle 256 flags 0x00 dlen 12 #255 [hci0] 103.509426
L2CAP: Configure Request (0x04) ident 7 len 4
Destination CID: 65
Flags: 0x0000
< ACL Data TX: Handle 256 flags 0x00 dlen 12 #257 [hci0] 103.511870
L2CAP: Connection Request (0x02) ident 8 len 4
PSM: 1 (0x0001)
Source CID: 66
> ACL Data RX: Handle 256 flags 0x02 dlen 14 #259 [hci0] 103.514121
L2CAP: Configure Response (0x05) ident 7 len 6
Source CID: 65
Flags: 0x0000
Result: Success (0x0000)
Signed-off-by: Florian Dollinger <dollinger.florian@gmx.de>
Co-developed-by: Florian Dollinger <dollinger.florian@gmx.de>
Reviewed-by: Luiz Augusto Von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
Bluetooth Core Specification v5.2, Vol. 3, Part A, section 1.4, table
1.1:
'Start Fragments always either begin with the first octet of the Basic
L2CAP header of a PDU or they have a length of zero (see [Vol 2] Part
B, Section 6.6.2).'
Apparently this was changed by the following errata:
https://www.bluetooth.org/tse/errata_view.cfm?errata_id=10216
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
kmemleak report:
unreferenced object 0xffff9b1127f00500 (size 208):
comm "kworker/u17:2", pid 500, jiffies 4294937470 (age 580.136s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 60 ed 05 11 9b ff ff 00 00 00 00 00 00 00 00 .`..............
backtrace:
[<000000006ab3fd59>] kmem_cache_alloc_node+0x17a/0x480
[<0000000051a5f6f9>] __alloc_skb+0x5b/0x1d0
[<0000000037e2d252>] hci_prepare_cmd+0x32/0xc0 [bluetooth]
[<0000000010b586d5>] hci_req_add_ev+0x84/0xe0 [bluetooth]
[<00000000d2deb520>] hci_req_clear_event_filter+0x42/0x70 [bluetooth]
[<00000000f864bd8c>] hci_req_prepare_suspend+0x84/0x470 [bluetooth]
[<000000001deb2cc4>] hci_prepare_suspend+0x31/0x40 [bluetooth]
[<000000002677dd79>] process_one_work+0x209/0x3b0
[<00000000aaa62b07>] worker_thread+0x34/0x400
[<00000000826d176c>] kthread+0x126/0x140
[<000000002305e558>] ret_from_fork+0x22/0x30
unreferenced object 0xffff9b1125c6ee00 (size 512):
comm "kworker/u17:2", pid 500, jiffies 4294937470 (age 580.136s)
hex dump (first 32 bytes):
04 00 00 00 0d 00 00 00 05 0c 01 00 11 9b ff ff ................
00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................
backtrace:
[<000000009f07c0cc>] slab_post_alloc_hook+0x59/0x270
[<0000000049431dc2>] __kmalloc_node_track_caller+0x15f/0x330
[<00000000027a42f6>] __kmalloc_reserve.isra.70+0x31/0x90
[<00000000e8e3e76a>] __alloc_skb+0x87/0x1d0
[<0000000037e2d252>] hci_prepare_cmd+0x32/0xc0 [bluetooth]
[<0000000010b586d5>] hci_req_add_ev+0x84/0xe0 [bluetooth]
[<00000000d2deb520>] hci_req_clear_event_filter+0x42/0x70 [bluetooth]
[<00000000f864bd8c>] hci_req_prepare_suspend+0x84/0x470 [bluetooth]
[<000000001deb2cc4>] hci_prepare_suspend+0x31/0x40 [bluetooth]
[<000000002677dd79>] process_one_work+0x209/0x3b0
[<00000000aaa62b07>] worker_thread+0x34/0x400
[<00000000826d176c>] kthread+0x126/0x140
[<000000002305e558>] ret_from_fork+0x22/0x30
unreferenced object 0xffff9b112b395788 (size 8):
comm "kworker/u17:2", pid 500, jiffies 4294937470 (age 580.136s)
hex dump (first 8 bytes):
20 00 00 00 00 00 04 00 .......
backtrace:
[<0000000052dc28d2>] kmem_cache_alloc_trace+0x15e/0x460
[<0000000046147591>] alloc_ctrl_urb+0x52/0xe0 [btusb]
[<00000000a2ed3e9e>] btusb_send_frame+0x91/0x100 [btusb]
[<000000001e66030e>] hci_send_frame+0x7e/0xf0 [bluetooth]
[<00000000bf6b7269>] hci_cmd_work+0xc5/0x130 [bluetooth]
[<000000002677dd79>] process_one_work+0x209/0x3b0
[<00000000aaa62b07>] worker_thread+0x34/0x400
[<00000000826d176c>] kthread+0x126/0x140
[<000000002305e558>] ret_from_fork+0x22/0x30
In pm sleep-resume context, while the btusb device rebinds, it enters
hci_unregister_dev(), whilst there is a possibility of hdev receiving
PM_POST_SUSPEND suspend_notifier event, leading to generation of msg
frames. When hci_unregister_dev() completes, i.e. hdev context is
destroyed/freed, those intermittently sent msg frames cause memory
leak.
BUG details:
Below is stack trace of thread that enters hci_unregister_dev(), marks
the hdev flag HCI_UNREGISTER to 1, and then goes onto to wait on notifier
lock - refer unregister_pm_notifier().
hci_unregister_dev+0xa5/0x320 [bluetoot]
btusb_disconnect+0x68/0x150 [btusb]
usb_unbind_interface+0x77/0x250
? kernfs_remove_by_name_ns+0x75/0xa0
device_release_driver_internal+0xfe/0x1
device_release_driver+0x12/0x20
bus_remove_device+0xe1/0x150
device_del+0x192/0x3e0
? usb_remove_ep_devs+0x1f/0x30
usb_disable_device+0x92/0x1b0
usb_disconnect+0xc2/0x270
hub_event+0x9f6/0x15d0
? rpm_idle+0x23/0x360
? rpm_idle+0x26b/0x360
process_one_work+0x209/0x3b0
worker_thread+0x34/0x400
? process_one_work+0x3b0/0x3b0
kthread+0x126/0x140
? kthread_park+0x90/0x90
ret_from_fork+0x22/0x30
Below is stack trace of thread executing hci_suspend_notifier() which
processes the PM_POST_SUSPEND event, while the unbinding thread is
waiting on lock.
hci_suspend_notifier.cold.39+0x5/0x2b [bluetooth]
blocking_notifier_call_chain+0x69/0x90
pm_notifier_call_chain+0x1a/0x20
pm_suspend.cold.9+0x334/0x352
state_store+0x84/0xf0
kobj_attr_store+0x12/0x20
sysfs_kf_write+0x3b/0x40
kernfs_fop_write+0xda/0x1c0
vfs_write+0xbb/0x250
ksys_write+0x61/0xe0
__x64_sys_write+0x1a/0x20
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fix hci_suspend_notifer(), not to act on events when flag HCI_UNREGISTER
is set.
Signed-off-by: Vamshi K Sthambamkadi <vamshi.k.sthambamkadi@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
Jump to the label done to decrement the reference count of HCI device
hdev on path that the Inquiry procedure is interrupted.
Fixes: 3e13fa1e1fab ("Bluetooth: Fix hci_inquiry ioctl usage")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
Call hci_dev_put() to decrement reference count of HCI device hdev if
fails to duplicate memory.
Fixes: 0b26ab9dce74 ("Bluetooth: AMP: Handle Accept phylink command status evt")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
This adds logic to disable and reenable advertisement filters during
suspend and resume. After this patch, we would only receive packets from
devices in allow list during suspend.
Signed-off-by: Howard Chung <howardchung@google.com>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
When MSFT extension is supported, we don't have to interleave the scan
as we could just do allowlist scan.
Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
Implements the feature to disable/enable the filter used for
advertising monitor on MSFT controller, effectively have the same
effect as "remove all monitors" and "add all previously removed
monitors".
This feature would be needed when suspending, where we would not want
to get packets from anything outside the allowlist. Note that the
integration with the suspending part is not included in this patch.
Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Reviewed-by: Yun-Hao Chung <howardchung@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
When the controller is powered off, the registered advertising monitor
is removed from the controller. This patch handles the re-registration
of those monitors when the power is on.
Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Reviewed-by: Yun-Hao Chung <howardchung@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
Implements the monitor removal functionality for advertising monitor
offloading to MSFT controllers. Supply handle = 0 to remove all
monitors.
Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Reviewed-by: Yun-Hao Chung <howardchung@google.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
Enables advertising monitor offloading to the controller, if MSFT
extension is supported. The kernel won't adjust the monitor parameters
to match what the controller supports - that is the user space's
responsibility.
This patch only manages the addition of monitors. Monitor removal is
going to be handled by another patch.
Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Manish Mandlik <mmandlik@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Reviewed-by: Yun-Hao Chung <howardchung@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
MSFT needs rssi parameter for monitoring advertisement packet,
therefore we should supply them from mgmt. This adds a new opcode
to add advertisement monitor with rssi parameters.
Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Manish Mandlik <mmandlik@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Reviewed-by: Yun-Hao Chung <howardchung@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
Clean up: The rq_argpages field was removed from struct svc_rqst in
the pre-git era.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The Receive completion handler doesn't look at the contents of the
Receive buffer. The DMA sync isn't terribly expensive but it's one
less thing that needs to be done by the Receive completion handler,
which is single-threaded (per svc_xprt). This helps scalability.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
This is similar to commit e340c2d6ef2a ("xprtrdma: Reduce the
doorbell rate (Receive)") which added Receive batching to the
client.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Clean up. We are not permitted to remove old proc files. Instead,
convert these variables to stubs that are only ever allowed to
display a value of zero.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Now that we have an efficient mechanism to update these two stats,
let's start maintaining them again.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Avoid the overhead of a memory bus lock cycle for counting a value
that is hardly every used.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Receives are frequent events. Avoid the overhead of a memory bus
lock cycle for counting a value that is hardly every used.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Setting up the proc variables is about to get more complicated.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
We need the fixes in here and this resolves a merge issue in
drivers/tty/tty_io.c
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Upon receiving a cumulative ACK that changes the congestion state from
Disorder to Open, the TLP timer is not set. If the sender is app-limited,
it can only wait for the RTO timer to expire and retransmit.
The reason for this is that the TLP timer is set before the congestion
state changes in tcp_ack(), so we delay the time point of calling
tcp_set_xmit_timer() until after tcp_fastretrans_alert() returns and
remove the FLAG_SET_XMIT_TIMER from ack_flag when the RACK reorder timer
is set.
This commit has two additional benefits:
1) Make sure to reset RTO according to RFC6298 when receiving ACK, to
avoid spurious RTO caused by RTO timer early expires.
2) Reduce the xmit timer reschedule once per ACK when the RACK reorder
timer is set.
Fixes: df92c8394e6e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
Link: https://lore.kernel.org/netdev/1611311242-6675-1-git-send-email-yangpc@wangsu.com
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/1611464834-23030-1-git-send-email-yangpc@wangsu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit 9fd1ff5d2ac7 ("udp: Support UDP fraglist GRO/GSO.") actually
not only added a support for fraglisted UDP GRO, but also tweaked
some logics the way that non-fraglisted UDP GRO started to work for
forwarding too.
Commit 2e4ef10f5850 ("net: add GSO UDP L4 and GSO fraglists to the
list of software-backed types") added GSO UDP L4 to the list of
software GSO to allow virtual netdevs to forward them as is up to
the real drivers.
Tests showed that currently forwarding and NATing of plain UDP GRO
packets are performed fully correctly, regardless if the target
netdevice has a support for hardware/driver GSO UDP L4 or not.
Add the last element and allow to form plain UDP GRO packets if
we are on forwarding path, and the new NETIF_F_GRO_UDP_FWD is
enabled on a receiving netdevice.
If both NETIF_F_GRO_FRAGLIST and NETIF_F_GRO_UDP_FWD are set,
fraglisted GRO takes precedence. This keeps the current behaviour
and is generally more optimal for now, as the number of NICs with
hardware USO offload is relatively small.
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Introduce a new netdev feature, NETIF_F_GRO_UDP_FWD, to allow user
to turn UDP GRO on and off for forwarding.
Defaults to off to not change current datapath.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The TCP_USER_TIMEOUT is checked by the 0-window probe timer. As the
timer has backoff with a max interval of about two minutes, the
actual timeout for TCP_USER_TIMEOUT can be off by up to two minutes.
In this patch the TCP_USER_TIMEOUT is made more accurate by taking it
into account when computing the timer value for the 0-window probes.
This patch is similar to and builds on top of the one that made
TCP_USER_TIMEOUT accurate for RTOs in commit b701a99e431d ("tcp: Add
tcp_clamp_rto_to_user_timeout() helper to improve accuracy").
Fixes: 9721e709fa68 ("tcp: simplify window probe aborting on USER_TIMEOUT")
Signed-off-by: Enke Chen <enchen@paloaltonetworks.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210122191306.GA99540@localhost.localdomain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Goto to the label put_dev instead of the label error to fix potential
resource leak on path that the target index is invalid.
Fixes: c4fbb6515a4d ("NFC: The core part should generate the target index")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Link: https://lore.kernel.org/r/20210121152748.98409-1-bianpan2016@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Put the device to avoid resource leak on path that the polling flag is
invalid.
Fixes: a831b9132065 ("NFC: Do not return EBUSY when stopping a poll that's already stopped")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Link: https://lore.kernel.org/r/20210121153745.122184-1-bianpan2016@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
None of these are actually used in the kernel/userspace interface -
there's a userspace component of implementing MRP, and userspace will
need to construct certain frames to put on the wire, but there's no
reason the kernel should provide the relevant definitions in a UAPI
header.
In fact, some of those definitions were broken until previous commit,
so only keep the few that are actually referenced in the kernel code,
and move them to the br_private_mrp.h header.
Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The switch ASIC has a limited capacity of physical ('flavour physical'
in devlink terminology) ports that it can support. While each system is
brought up with a different number of ports, this number can be
increased via splitting up to the ASIC's limit.
Expose physical ports as a devlink resource so that user space will have
visibility to the maximum number of ports that can be supported and the
current occupancy.
In addition, add a "Generic Resources" section in devlink-resource
documentation so the different drivers will be aligned by the same resource
name when exposing to user space.
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This commit adds support for statistics of offloaded HTB. Bytes and
packets counters for leaf and inner nodes are supported, the values are
taken from per-queue qdiscs, and the numbers that the user sees should
have the same behavior as the software (non-offloaded) HTB.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
HTB doesn't scale well because of contention on a single lock, and it
also consumes CPU. This patch adds support for offloading HTB to
hardware that supports hierarchical rate limiting.
In the offload mode, HTB passes control commands to the driver using
ndo_setup_tc. The driver has to replicate the whole hierarchy of classes
and their settings (rate, ceil) in the NIC. Every modification of the
HTB tree caused by the admin results in ndo_setup_tc being called.
After this setup, the HTB algorithm is done completely in the NIC. An SQ
(send queue) is created for every leaf class and attached to the
hierarchy, so that the NIC can calculate and obey aggregated rate
limits, too. In the future, it can be changed, so that multiple SQs will
back a single leaf class.
ndo_select_queue is responsible for selecting the right queue that
serves the traffic class of each packet.
The data path works as follows: a packet is classified by clsact, the
driver selects a hardware queue according to its class, and the packet
is enqueued into this queue's qdisc.
This solution addresses two main problems of scaling HTB:
1. Contention by flow classification. Currently the filters are attached
to the HTB instance as follows:
# tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80
classid 1:10
It's possible to move classification to clsact egress hook, which is
thread-safe and lock-free:
# tc filter add dev eth0 egress protocol ip flower dst_port 80
action skbedit priority 1:10
This way classification still happens in software, but the lock
contention is eliminated, and it happens before selecting the TX queue,
allowing the driver to translate the class to the corresponding hardware
queue in ndo_select_queue.
Note that this is already compatible with non-offloaded HTB and doesn't
require changes to the kernel nor iproute2.
2. Contention by handling packets. HTB is not multi-queue, it attaches
to a whole net device, and handling of all packets takes the same lock.
When HTB is offloaded, it registers itself as a multi-queue qdisc,
similarly to mq: HTB is attached to the netdev, and each queue has its
own qdisc.
Some features of HTB may be not supported by some particular hardware,
for example, the maximum number of classes may be limited, the
granularity of rate and ceil parameters may be different, etc. - so, the
offload is not enabled by default, a new parameter is used to enable it:
# tc qdisc replace dev eth0 root handle 1: htb offload
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In a following commit, sch_htb will start using extack in the delete
class operation to pass hardware errors in offload mode. This commit
prepares for that by adding the extack parameter to this callback and
converting usage of the existing qdiscs.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
tcp_recvmsg() uses the CMSG mechanism to receive control information
like packet receive timestamps. This patch adds CMSG fields to
struct tcp_zerocopy_receive, and provides receive timestamps
if available to the user.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
At present, tcp_recvmsg() uses flags to track if any CMSGs are pending
and what those CMSGs are. These flags are currently magic numbers,
used only within tcp_recvmsg().
To prepare for receive timestamp support in tcp receive zerocopy,
gently refactor these magic numbers into enums.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Mark groups which were deleted due to fast leave/EHT.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
A block report can result in empty source and host sets for both include
and exclude groups so if there are no hosts left we can safely remove
the group. Pull the block group handling so it can cover both cases and
add a check if EHT requires the delete.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We should be able to handle host filter mode changing. For exclude mode
we must create a zero-src entry so the group will be kept even without
any S,G entries (non-zero source sets). That entry doesn't count to the
entry limit and can always be created, its timer is refreshed on new
exclude reports and if we change the host filter mode to include then it
gets removed and we rely only on the non-zero source sets.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This is an optimization specifically for TO_INCLUDE which sends queries
for the older entries and thus lowers the S,G timers to LMQT. If we have
the following situation for a group in either include or exclude mode:
- host A was interested in srcs X and Y, but is timing out
- host B sends TO_INCLUDE src Z, the bridge lowers X and Y's timeouts
to LMQT
- host B sends BLOCK src Z after LMQT time has passed
=> since host B is the last host we can delete the group, but if we
still have host A's EHT entries for X and Y (i.e. if they weren't
lowered to LMQT previously) then we'll have to wait another LMQT
time before deleting the group, with this optimization we can
directly remove it regardless of the group mode as there are no more
interested hosts
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add support for IGMPv3/MLDv2 include and exclude EHT handling. Similar to
how the reports are processed we have 2 cases when the group is in include
or exclude mode, these are processed as follows:
- group include
- is_include: create missing entries
- to_include: flush existing entries and create a new set from the
report, obviously if the src set is empty then we delete the group
- group exclude
- is_exclude: create missing entries
- to_exclude: flush existing entries and create a new set from the
report, any empty source set entries are removed
If the group is in a different mode then we just flush all entries reported
by the host and we create a new set with the new mode entries created from
the report. If the report is include type, the source list is empty and
the group has empty sources' set then we remove it. Any source set entries
which are empty are removed as well. If the group is in exclude mode it
can exist without any S,G entries (allowing for all traffic to pass).
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add support for IGMPv3/MLDv2 allow/block EHT handling. Similar to how
the reports are processed we have 2 cases when the group is in include
or exclude mode, these are processed as follows:
- group include
- allow: create missing entries
- block: remove existing matching entries and remove the corresponding
S,G entries if there are no more set host entries, then possibly
delete the whole group if there are no more S,G entries
- group exclude
- allow
- host include: create missing entries
- host exclude: remove existing matching entries and remove the
corresponding S,G entries if there are no more set host entries
- block
- host include: remove existing matching entries and remove the
corresponding S,G entries if there are no more set host entries,
then possibly delete the whole group if there are no more S,G entries
- host exclude: create missing entries
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Now that we can delete set entries, we can use that to remove EHT hosts.
Since the group's host set entries exist only when there are related
source set entries we just have to flush all source set entries
joined by the host set entry and it will be automatically removed.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add EHT source set and set-entry create, delete and lookup functions.
These allow to manipulate source sets which contain their own host sets
with entries which joined that S,G. We're limiting the maximum number of
tracked S,G entries per host to PG_SRC_ENT_LIMIT (currently 32) which is
the current maximum of S,G entries for a group. There's a per-set timer
which will be used to destroy the whole set later.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add functions to create, destroy and lookup an EHT host. These are
per-host entries contained in the eht_host_tree in net_bridge_port_group
which are used to store a list of all sources (S,G) entries joined for that
group by each host, the host's current filter mode and total number of
joined entries.
No functional changes yet, these would be used in later patches.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add EHT structures for tracking hosts and sources per group. We keep one
set for each host which has all of the host's S,G entries, and one set for
each multicast source which has all hosts that have joined that S,G. For
each host, source entry we record the filter_mode and we keep an expiry
timer. There is also one global expiry timer per source set, it is
updated with each set entry update, it will be later used to lower the
set's timer instead of lowering each entry's timer separately.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We need to preserve the srcs pointer since we'll be passing it for EHT
handling later.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Prepare __grp_src_block_incl() for being able to cause a notification
due to changes. Currently it cannot happen, but EHT would change that
since we'll be deleting sources immediately. Make sure that if the pg is
deleted we don't return true as that would cause the caller to access
freed pg. This patch shouldn't cause any functional change.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We need to pass the host address so later it can be used for explicit
host tracking. No functional change.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Rename src_size argument to addr_size in preparation for passing host
address as an argument to IGMPv3/MLDv2 functions.
No functional change.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
On MPTCP-level ack reception, the packet scheduler
may select a subflow other then the current one.
Prior to this commit we rely on the workqueue to trigger
action on such subflow.
This changeset introduces an infrastructure that allows
any MPTCP subflow to schedule actions (MPTCP xmit) on
others subflows without resorting to (multiple) process
reschedule.
A dummy NAPI instance is used instead. When MPTCP needs to
trigger action an a different subflow, it enqueues the target
subflow on the NAPI backlog and schedule such instance as needed.
The dummy NAPI poll method walks the sockets backlog and tries
to acquire the (BH) socket lock on each of them. If the socket
is owned by the user space, the action will be completed by
the sock release cb, otherwise push is started.
This change leverages the delegated action infrastructure
to avoid invoking the MPTCP worker to spool the pending data,
when the packet scheduler picks a subflow other then the one
currently processing the incoming MPTCP-level ack.
Additionally we further refine the subflow selection
invoking the packet scheduler for each chunk of data
even inside __mptcp_subflow_push_pending().
v1 -> v2:
- fix possible UaF at shutdown time, resetting sock ops
after removing the ulp context
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Otherwise the packet scheduler policy will not be
enforced when pushing pending data at MPTCP-level
ack reception time.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|