Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
|
|
SO_REUSEPORT per EUID.
If there is no TCP_LISTEN socket on a ephemeral port, we can bind multiple
sockets having SO_REUSEADDR to the same port. Then if all sockets bound to
the port have also SO_REUSEPORT enabled and have the same EUID, all of them
can be listened. This is not safe.
Let's say, an application has root privilege and binds sockets to an
ephemeral port with both of SO_REUSEADDR and SO_REUSEPORT. When none of
sockets is not listened yet, a malicious user can use sudo, exhaust
ephemeral ports, and bind sockets to the same ephemeral port, so he or she
can call listen and steal the port.
To prevent this issue, we must not bind more than one sockets that have the
same EUID and both of SO_REUSEADDR and SO_REUSEPORT.
On the other hand, if the sockets have different EUIDs, the issue above does
not occur. After sockets with different EUIDs are bound to the same port and
one of them is listened, no more socket can be listened. This is because the
condition below is evaluated true and listen() for the second socket fails.
} else if (!reuseport_ok ||
!reuseport || !sk2->sk_reuseport ||
rcu_access_pointer(sk->sk_reuseport_cb) ||
(sk2->sk_state != TCP_TIME_WAIT &&
!uid_eq(uid, sock_i_uid(sk2)))) {
if (inet_rcv_saddr_equal(sk, sk2, true))
break;
}
Therefore, on the same port, we cannot do listen() for multiple sockets with
different EUIDs and any other listen syscalls fail, so the problem does not
happen. In this case, we can still call connect() for other sockets that
cannot be listened, so we have to succeed to call bind() in order to fully
utilize 4-tuples.
Summarizing the above, we should be able to bind only one socket having
SO_REUSEADDR and SO_REUSEPORT per EUID.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
exhausted.
Commit aacd9289af8b82f5fb01bcdd53d0e3406d1333c7 ("tcp: bind() use stronger
condition for bind_conflict") introduced a restriction to forbid to bind
SO_REUSEADDR enabled sockets to the same (addr, port) tuple in order to
assign ports dispersedly so that we can connect to the same remote host.
The change results in accelerating port depletion so that we fail to bind
sockets to the same local port even if we want to connect to the different
remote hosts.
You can reproduce this issue by following instructions below.
1. # sysctl -w net.ipv4.ip_local_port_range="32768 32768"
2. set SO_REUSEADDR to two sockets.
3. bind two sockets to (localhost, 0) and the latter fails.
Therefore, when ephemeral ports are exhausted, bind(0) should fallback to
the legacy behaviour to enable the SO_REUSEADDR option and make it possible
to connect to different remote (addr, port) tuples.
This patch allows us to bind SO_REUSEADDR enabled sockets to the same
(addr, port) only when net.ipv4.ip_autobind_reuse is set 1 and all
ephemeral ports are exhausted. This also allows connect() and listen() to
share ports in the following way and may break some applications. So the
ip_autobind_reuse is 0 by default and disables the feature.
1. setsockopt(sk1, SO_REUSEADDR)
2. setsockopt(sk2, SO_REUSEADDR)
3. bind(sk1, saddr, 0)
4. bind(sk2, saddr, 0)
5. connect(sk1, daddr)
6. listen(sk2)
If it is set 1, we can fully utilize the 4-tuples, but we should use
IP_BIND_ADDRESS_NO_PORT for bind()+connect() as possible.
The notable thing is that if all sockets bound to the same port have
both SO_REUSEADDR and SO_REUSEPORT enabled, we can bind sockets to an
ephemeral port and also do listen().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When we get an ephemeral port, the relax is false, so the SO_REUSEADDR
conditions may be evaluated twice. We do not need to check the conditions
again.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There was a bug that was causing packets to be sent to the driver
without first calling dequeue() on the "child" qdisc. And the KASAN
report below shows that sending a packet without calling dequeue()
leads to bad results.
The problem is that when checking the last qdisc "child" we do not set
the returned skb to NULL, which can cause it to be sent to the driver,
and so after the skb is sent, it may be freed, and in some situations a
reference to it may still be in the child qdisc, because it was never
dequeued.
The crash log looks like this:
[ 19.937538] ==================================================================
[ 19.938300] BUG: KASAN: use-after-free in taprio_dequeue_soft+0x620/0x780
[ 19.938968] Read of size 4 at addr ffff8881128628cc by task swapper/1/0
[ 19.939612]
[ 19.939772] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc3+ #97
[ 19.940397] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qe4
[ 19.941523] Call Trace:
[ 19.941774] <IRQ>
[ 19.941985] dump_stack+0x97/0xe0
[ 19.942323] print_address_description.constprop.0+0x3b/0x60
[ 19.942884] ? taprio_dequeue_soft+0x620/0x780
[ 19.943325] ? taprio_dequeue_soft+0x620/0x780
[ 19.943767] __kasan_report.cold+0x1a/0x32
[ 19.944173] ? taprio_dequeue_soft+0x620/0x780
[ 19.944612] kasan_report+0xe/0x20
[ 19.944954] taprio_dequeue_soft+0x620/0x780
[ 19.945380] __qdisc_run+0x164/0x18d0
[ 19.945749] net_tx_action+0x2c4/0x730
[ 19.946124] __do_softirq+0x268/0x7bc
[ 19.946491] irq_exit+0x17d/0x1b0
[ 19.946824] smp_apic_timer_interrupt+0xeb/0x380
[ 19.947280] apic_timer_interrupt+0xf/0x20
[ 19.947687] </IRQ>
[ 19.947912] RIP: 0010:default_idle+0x2d/0x2d0
[ 19.948345] Code: 00 00 41 56 41 55 65 44 8b 2d 3f 8d 7c 7c 41 54 55 53 0f 1f 44 00 00 e8 b1 b2 c5 fd e9 07 00 3
[ 19.950166] RSP: 0018:ffff88811a3efda0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
[ 19.950909] RAX: 0000000080000000 RBX: ffff88811a3a9600 RCX: ffffffff8385327e
[ 19.951608] RDX: 1ffff110234752c0 RSI: 0000000000000000 RDI: ffffffff8385262f
[ 19.952309] RBP: ffffed10234752c0 R08: 0000000000000001 R09: ffffed10234752c1
[ 19.953009] R10: ffffed10234752c0 R11: ffff88811a3a9607 R12: 0000000000000001
[ 19.953709] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[ 19.954408] ? default_idle_call+0x2e/0x70
[ 19.954816] ? default_idle+0x1f/0x2d0
[ 19.955192] default_idle_call+0x5e/0x70
[ 19.955584] do_idle+0x3d4/0x500
[ 19.955909] ? arch_cpu_idle_exit+0x40/0x40
[ 19.956325] ? _raw_spin_unlock_irqrestore+0x23/0x30
[ 19.956829] ? trace_hardirqs_on+0x30/0x160
[ 19.957242] cpu_startup_entry+0x19/0x20
[ 19.957633] start_secondary+0x2a6/0x380
[ 19.958026] ? set_cpu_sibling_map+0x18b0/0x18b0
[ 19.958486] secondary_startup_64+0xa4/0xb0
[ 19.958921]
[ 19.959078] Allocated by task 33:
[ 19.959412] save_stack+0x1b/0x80
[ 19.959747] __kasan_kmalloc.constprop.0+0xc2/0xd0
[ 19.960222] kmem_cache_alloc+0xe4/0x230
[ 19.960617] __alloc_skb+0x91/0x510
[ 19.960967] ndisc_alloc_skb+0x133/0x330
[ 19.961358] ndisc_send_ns+0x134/0x810
[ 19.961735] addrconf_dad_work+0xad5/0xf80
[ 19.962144] process_one_work+0x78e/0x13a0
[ 19.962551] worker_thread+0x8f/0xfa0
[ 19.962919] kthread+0x2ba/0x3b0
[ 19.963242] ret_from_fork+0x3a/0x50
[ 19.963596]
[ 19.963753] Freed by task 33:
[ 19.964055] save_stack+0x1b/0x80
[ 19.964386] __kasan_slab_free+0x12f/0x180
[ 19.964830] kmem_cache_free+0x80/0x290
[ 19.965231] ip6_mc_input+0x38a/0x4d0
[ 19.965617] ipv6_rcv+0x1a4/0x1d0
[ 19.965948] __netif_receive_skb_one_core+0xf2/0x180
[ 19.966437] netif_receive_skb+0x8c/0x3c0
[ 19.966846] br_handle_frame_finish+0x779/0x1310
[ 19.967302] br_handle_frame+0x42a/0x830
[ 19.967694] __netif_receive_skb_core+0xf0e/0x2a90
[ 19.968167] __netif_receive_skb_one_core+0x96/0x180
[ 19.968658] process_backlog+0x198/0x650
[ 19.969047] net_rx_action+0x2fa/0xaa0
[ 19.969420] __do_softirq+0x268/0x7bc
[ 19.969785]
[ 19.969940] The buggy address belongs to the object at ffff888112862840
[ 19.969940] which belongs to the cache skbuff_head_cache of size 224
[ 19.971202] The buggy address is located 140 bytes inside of
[ 19.971202] 224-byte region [ffff888112862840, ffff888112862920)
[ 19.972344] The buggy address belongs to the page:
[ 19.972820] page:ffffea00044a1800 refcount:1 mapcount:0 mapping:ffff88811a2bd1c0 index:0xffff8881128625c0 compo0
[ 19.973930] flags: 0x8000000000010200(slab|head)
[ 19.974388] raw: 8000000000010200 ffff88811a2ed650 ffff88811a2ed650 ffff88811a2bd1c0
[ 19.975151] raw: ffff8881128625c0 0000000000190013 00000001ffffffff 0000000000000000
[ 19.975915] page dumped because: kasan: bad access detected
[ 19.976461] page_owner tracks the page as allocated
[ 19.976946] page last allocated via order 2, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NO)
[ 19.978332] prep_new_page+0x24b/0x330
[ 19.978707] get_page_from_freelist+0x2057/0x2c90
[ 19.979170] __alloc_pages_nodemask+0x218/0x590
[ 19.979619] new_slab+0x9d/0x300
[ 19.979948] ___slab_alloc.constprop.0+0x2f9/0x6f0
[ 19.980421] __slab_alloc.constprop.0+0x30/0x60
[ 19.980870] kmem_cache_alloc+0x201/0x230
[ 19.981269] __alloc_skb+0x91/0x510
[ 19.981620] alloc_skb_with_frags+0x78/0x4a0
[ 19.982043] sock_alloc_send_pskb+0x5eb/0x750
[ 19.982476] unix_stream_sendmsg+0x399/0x7f0
[ 19.982904] sock_sendmsg+0xe2/0x110
[ 19.983262] ____sys_sendmsg+0x4de/0x6d0
[ 19.983660] ___sys_sendmsg+0xe4/0x160
[ 19.984032] __sys_sendmsg+0xab/0x130
[ 19.984396] do_syscall_64+0xe7/0xae0
[ 19.984761] page last free stack trace:
[ 19.985142] __free_pages_ok+0x432/0xbc0
[ 19.985533] qlist_free_all+0x56/0xc0
[ 19.985907] quarantine_reduce+0x149/0x170
[ 19.986315] __kasan_kmalloc.constprop.0+0x9e/0xd0
[ 19.986791] kmem_cache_alloc+0xe4/0x230
[ 19.987182] prepare_creds+0x24/0x440
[ 19.987548] do_faccessat+0x80/0x590
[ 19.987906] do_syscall_64+0xe7/0xae0
[ 19.988276] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 19.988775]
[ 19.988930] Memory state around the buggy address:
[ 19.989402] ffff888112862780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 19.990111] ffff888112862800: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
[ 19.990822] >ffff888112862880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 19.991529] ^
[ 19.992081] ffff888112862900: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
[ 19.992796] ffff888112862980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler")
Reported-by: Michael Schmidt <michael.schmidt@eti.uni-siegen.de>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Acked-by: Andre Guedes <andre.guedes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This reverts commit 4cda75275f9f89f9485b0ca4d6950c95258a9bce
from net-next.
Brown bag time.
Michal noticed that this change doesn't work at all when
netif_set_real_num_tx_queues() gets called prior to an initial
dev_activate(), as for instance igb does.
Doing so dies with:
[ 40.579142] BUG: kernel NULL pointer dereference, address: 0000000000000400
[ 40.586922] #PF: supervisor read access in kernel mode
[ 40.592668] #PF: error_code(0x0000) - not-present page
[ 40.598405] PGD 0 P4D 0
[ 40.601234] Oops: 0000 [#1] PREEMPT SMP PTI
[ 40.605909] CPU: 18 PID: 1681 Comm: wickedd Tainted: G E 5.6.0-rc3-ethnl.50-default #1
[ 40.616205] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.R3.27.D685.1305151734 05/15/2013
[ 40.627377] RIP: 0010:qdisc_hash_add.part.22+0x2e/0x90
[ 40.633115] Code: 00 55 53 89 f5 48 89 fb e8 2f 9b fb ff 85 c0 74 44 48 8b 43 40 48 8b 08 69 43 38 47 86 c8 61 c1 e8 1c 48 83 e8 80 48 8d 14 c1 <48> 8b 04 c1 48 8d 4b 28 48 89 53 30 48 89 43 28 48 85 c0 48 89 0a
[ 40.654080] RSP: 0018:ffffb879864934d8 EFLAGS: 00010203
[ 40.659914] RAX: 0000000000000080 RBX: ffffffffb8328d80 RCX: 0000000000000000
[ 40.667882] RDX: 0000000000000400 RSI: 0000000000000000 RDI: ffffffffb831faa0
[ 40.675849] RBP: 0000000000000000 R08: ffffa0752c8b9088 R09: ffffa0752c8b9208
[ 40.683816] R10: 0000000000000006 R11: 0000000000000000 R12: ffffa0752d734000
[ 40.691783] R13: 0000000000000008 R14: 0000000000000000 R15: ffffa07113c18000
[ 40.699750] FS: 00007f94548e5880(0000) GS:ffffa0752e980000(0000) knlGS:0000000000000000
[ 40.708782] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 40.715189] CR2: 0000000000000400 CR3: 000000082b6ae006 CR4: 00000000001606e0
[ 40.723156] Call Trace:
[ 40.725888] dev_qdisc_set_real_num_tx_queues+0x61/0x90
[ 40.731725] netif_set_real_num_tx_queues+0x94/0x1d0
[ 40.737286] __igb_open+0x19a/0x5d0 [igb]
[ 40.741767] __dev_open+0xbb/0x150
[ 40.745567] __dev_change_flags+0x157/0x1a0
[ 40.750240] dev_change_flags+0x23/0x60
[...]
Fixes: 4cda75275f9f ("net: sched: make newly activated qdiscs visible")
Reported-by: Michal Kubecek <mkubecek@suse.cz>
CC: Michal Kubecek <mkubecek@suse.cz>
CC: Eric Dumazet <edumazet@google.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: Cong Wang <xiyou.wangcong@gmail.com>
CC: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Smatch complains about the indenting:
net/bluetooth/l2cap_sock.c:1027 l2cap_sock_recvmsg()
warn: inconsistent indenting
It looks like this is supposed to be an "else if" condition.
Fixes: 15f02b910562 ("Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
According to Core Spec Version 5.2 | Vol 3, Part A 6.1.5,
the incoming L2CAP_ConfigReq should be handled during
OPEN state.
The section below shows the btmon trace when running
L2CAP/COS/CFD/BV-12-C before and after this change.
=== Before ===
...
> ACL Data RX: Handle 256 flags 0x02 dlen 12 #22
L2CAP: Connection Request (0x02) ident 2 len 4
PSM: 1 (0x0001)
Source CID: 65
< ACL Data TX: Handle 256 flags 0x00 dlen 16 #23
L2CAP: Connection Response (0x03) ident 2 len 8
Destination CID: 64
Source CID: 65
Result: Connection successful (0x0000)
Status: No further information available (0x0000)
< ACL Data TX: Handle 256 flags 0x00 dlen 12 #24
L2CAP: Configure Request (0x04) ident 2 len 4
Destination CID: 65
Flags: 0x0000
> HCI Event: Number of Completed Packets (0x13) plen 5 #25
Num handles: 1
Handle: 256
Count: 1
> HCI Event: Number of Completed Packets (0x13) plen 5 #26
Num handles: 1
Handle: 256
Count: 1
> ACL Data RX: Handle 256 flags 0x02 dlen 16 #27
L2CAP: Configure Request (0x04) ident 3 len 8
Destination CID: 64
Flags: 0x0000
Option: Unknown (0x10) [hint]
01 00 ..
< ACL Data TX: Handle 256 flags 0x00 dlen 18 #28
L2CAP: Configure Response (0x05) ident 3 len 10
Source CID: 65
Flags: 0x0000
Result: Success (0x0000)
Option: Maximum Transmission Unit (0x01) [mandatory]
MTU: 672
> HCI Event: Number of Completed Packets (0x13) plen 5 #29
Num handles: 1
Handle: 256
Count: 1
> ACL Data RX: Handle 256 flags 0x02 dlen 14 #30
L2CAP: Configure Response (0x05) ident 2 len 6
Source CID: 64
Flags: 0x0000
Result: Success (0x0000)
> ACL Data RX: Handle 256 flags 0x02 dlen 20 #31
L2CAP: Configure Request (0x04) ident 3 len 12
Destination CID: 64
Flags: 0x0000
Option: Unknown (0x10) [hint]
01 00 91 02 11 11 ......
< ACL Data TX: Handle 256 flags 0x00 dlen 14 #32
L2CAP: Command Reject (0x01) ident 3 len 6
Reason: Invalid CID in request (0x0002)
Destination CID: 64
Source CID: 65
> HCI Event: Number of Completed Packets (0x13) plen 5 #33
Num handles: 1
Handle: 256
Count: 1
...
=== After ===
...
> ACL Data RX: Handle 256 flags 0x02 dlen 12 #22
L2CAP: Connection Request (0x02) ident 2 len 4
PSM: 1 (0x0001)
Source CID: 65
< ACL Data TX: Handle 256 flags 0x00 dlen 16 #23
L2CAP: Connection Response (0x03) ident 2 len 8
Destination CID: 64
Source CID: 65
Result: Connection successful (0x0000)
Status: No further information available (0x0000)
< ACL Data TX: Handle 256 flags 0x00 dlen 12 #24
L2CAP: Configure Request (0x04) ident 2 len 4
Destination CID: 65
Flags: 0x0000
> HCI Event: Number of Completed Packets (0x13) plen 5 #25
Num handles: 1
Handle: 256
Count: 1
> HCI Event: Number of Completed Packets (0x13) plen 5 #26
Num handles: 1
Handle: 256
Count: 1
> ACL Data RX: Handle 256 flags 0x02 dlen 16 #27
L2CAP: Configure Request (0x04) ident 3 len 8
Destination CID: 64
Flags: 0x0000
Option: Unknown (0x10) [hint]
01 00 ..
< ACL Data TX: Handle 256 flags 0x00 dlen 18 #28
L2CAP: Configure Response (0x05) ident 3 len 10
Source CID: 65
Flags: 0x0000
Result: Success (0x0000)
Option: Maximum Transmission Unit (0x01) [mandatory]
MTU: 672
> HCI Event: Number of Completed Packets (0x13) plen 5 #29
Num handles: 1
Handle: 256
Count: 1
> ACL Data RX: Handle 256 flags 0x02 dlen 14 #30
L2CAP: Configure Response (0x05) ident 2 len 6
Source CID: 64
Flags: 0x0000
Result: Success (0x0000)
> ACL Data RX: Handle 256 flags 0x02 dlen 20 #31
L2CAP: Configure Request (0x04) ident 3 len 12
Destination CID: 64
Flags: 0x0000
Option: Unknown (0x10) [hint]
01 00 91 02 11 11 .....
< ACL Data TX: Handle 256 flags 0x00 dlen 18 #32
L2CAP: Configure Response (0x05) ident 3 len 10
Source CID: 65
Flags: 0x0000
Result: Success (0x0000)
Option: Maximum Transmission Unit (0x01) [mandatory]
MTU: 672
< ACL Data TX: Handle 256 flags 0x00 dlen 12 #33
L2CAP: Configure Request (0x04) ident 3 len 4
Destination CID: 65
Flags: 0x0000
> HCI Event: Number of Completed Packets (0x13) plen 5 #34
Num handles: 1
Handle: 256
Count: 1
> HCI Event: Number of Completed Packets (0x13) plen 5 #35
Num handles: 1
Handle: 256
Count: 1
...
Signed-off-by: Howard Chung <howardchung@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
In bluetooth core specification 4.2,
Vol 2, Part E, 7.8.9 LE Set Advertise Enable Command, it says
The Controller shall continue advertising until ...
or until a connection is created or ...
In these cases, advertising is then disabled.
Hence, advertising would be disabled before a connection is
established. In current kernel implementation, advertising would
be re-enabled when all connections are terminated.
The correct disconnection flow looks like
< HCI Command: Disconnect
> HCI Event: Command Status
Status: Success
> HCI Event: Disconnect Complete
Status: Success
Specifically, the last Disconnect Complete Event would trigger a
callback function hci_event.c:hci_disconn_complete_evt() to
cleanup the connection and re-enable advertising when proper.
However, sometimes, there might occur an exception in the controller
when disconnection is being executed. The disconnection flow might
then look like
< HCI Command: Disconnect
> HCI Event: Command Status
Status: Unknown Connection Identifier
Note that "> HCI Event: Disconnect Complete" is missing when such an
exception occurs. This would result in advertising staying disabled
forever since the connection in question is not cleaned up correctly.
To fix the controller exception issue, we need to do some connection
cleanup when the disconnect command status indicates an error.
Signed-off-by: Joseph Hwang <josephsih@chromium.org>
Signed-off-by: Manish Mandlik <mmandlik@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
the following packetdrill script
socket(..., SOCK_STREAM, IPPROTO_MPTCP) = 3
fcntl(3, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
> S 0:0(0) <mss 1460,sackOK,TS val 100 ecr 0,nop,wscale 8,mpcapable v1 flags[flag_h] nokey>
< S. 0:0(0) ack 1 win 65535 <mss 1460,sackOK,TS val 700 ecr 100,nop,wscale 8,mpcapable v1 flags[flag_h] key[skey=2]>
> . 1:1(0) ack 1 win 256 <nop, nop, TS val 100 ecr 700,mpcapable v1 flags[flag_h] key[ckey,skey]>
getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
fcntl(3, F_SETFL, O_RDWR) = 0
write(3, ..., 1000) = 1000
doesn't transmit 1KB data packet after a successful three-way-handshake,
using mp_capable with data as required by protocol v1, and write() hangs
forever:
PID: 973 TASK: ffff97dd399cae80 CPU: 1 COMMAND: "packetdrill"
#0 [ffffa9b94062fb78] __schedule at ffffffff9c90a000
#1 [ffffa9b94062fc08] schedule at ffffffff9c90a4a0
#2 [ffffa9b94062fc18] schedule_timeout at ffffffff9c90e00d
#3 [ffffa9b94062fc90] wait_woken at ffffffff9c120184
#4 [ffffa9b94062fcb0] sk_stream_wait_connect at ffffffff9c75b064
#5 [ffffa9b94062fd20] mptcp_sendmsg at ffffffff9c8e801c
#6 [ffffa9b94062fdc0] sock_sendmsg at ffffffff9c747324
#7 [ffffa9b94062fdd8] sock_write_iter at ffffffff9c7473c7
#8 [ffffa9b94062fe48] new_sync_write at ffffffff9c302976
#9 [ffffa9b94062fed0] vfs_write at ffffffff9c305685
#10 [ffffa9b94062ff00] ksys_write at ffffffff9c305985
#11 [ffffa9b94062ff38] do_syscall_64 at ffffffff9c004475
#12 [ffffa9b94062ff50] entry_SYSCALL_64_after_hwframe at ffffffff9ca0008c
RIP: 00007f959407eaf7 RSP: 00007ffe9e95a910 RFLAGS: 00000293
RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f959407eaf7
RDX: 00000000000003e8 RSI: 0000000001785fe0 RDI: 0000000000000008
RBP: 0000000001785fe0 R8: 0000000000000000 R9: 0000000000000003
R10: 0000000000000007 R11: 0000000000000293 R12: 00000000000003e8
R13: 00007ffe9e95ae30 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
Fix it ensuring that socket state is TCP_ESTABLISHED on reception of the
third ack.
Fixes: 1954b86016cf ("mptcp: Check connection state before attempting send")
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Locking newsk while still holding the listener lock triggered
a lockdep splat [1]
We can simply move the memcg code after we release the listener lock,
as this can also help if multiple threads are sharing a common listener.
Also fix a typo while reading socket sk_rmem_alloc.
[1]
WARNING: possible recursive locking detected
5.6.0-rc3-syzkaller #0 Not tainted
--------------------------------------------
syz-executor598/9524 is trying to acquire lock:
ffff88808b5b8b90 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline]
ffff88808b5b8b90 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x69f/0xd30 net/ipv4/inet_connection_sock.c:492
but task is already holding lock:
ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline]
ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x8d/0xd30 net/ipv4/inet_connection_sock.c:445
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(sk_lock-AF_INET6);
lock(sk_lock-AF_INET6);
*** DEADLOCK ***
May be due to missing lock nesting notation
1 lock held by syz-executor598/9524:
#0: ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline]
#0: ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x8d/0xd30 net/ipv4/inet_connection_sock.c:445
stack backtrace:
CPU: 0 PID: 9524 Comm: syz-executor598 Not tainted 5.6.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x188/0x20d lib/dump_stack.c:118
print_deadlock_bug kernel/locking/lockdep.c:2370 [inline]
check_deadlock kernel/locking/lockdep.c:2411 [inline]
validate_chain kernel/locking/lockdep.c:2954 [inline]
__lock_acquire.cold+0x114/0x288 kernel/locking/lockdep.c:3954
lock_acquire+0x197/0x420 kernel/locking/lockdep.c:4484
lock_sock_nested+0xc5/0x110 net/core/sock.c:2947
lock_sock include/net/sock.h:1541 [inline]
inet_csk_accept+0x69f/0xd30 net/ipv4/inet_connection_sock.c:492
inet_accept+0xe9/0x7c0 net/ipv4/af_inet.c:734
__sys_accept4_file+0x3ac/0x5b0 net/socket.c:1758
__sys_accept4+0x53/0x90 net/socket.c:1809
__do_sys_accept4 net/socket.c:1821 [inline]
__se_sys_accept4 net/socket.c:1818 [inline]
__x64_sys_accept4+0x93/0xf0 net/socket.c:1818
do_syscall_64+0xf6/0x790 arch/x86/entry/common.c:294
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4445c9
Code: e8 0c 0d 03 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffc35b37608 EFLAGS: 00000246 ORIG_RAX: 0000000000000120
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004445c9
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 0000000000000000 R08: 0000000000306777 R09: 0000000000306777
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00000000004053d0 R14: 0000000000000000 R15: 0000000000000000
Fixes: d752a4986532 ("net: memcg: late association of sock to memcg")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The Internet Assigned Numbers Authority (IANA) has recently assigned
a protocol number value of 143 for Ethernet [1].
Before this assignment, encapsulation mechanisms such as Segment Routing
used the IPv6-NoNxt protocol number (59) to indicate that the encapsulated
payload is an Ethernet frame.
In this patch, we add the definition of the Ethernet protocol number to the
kernel headers and update the SRv6 L2 tunnels to use it.
[1] https://www.iana.org/assignments/protocol-numbers/protocol-numbers.xhtml
Signed-off-by: Paolo Lungaroni <paolo.lungaroni@cnit.it>
Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Acked-by: Ahmed Abdelsalam <ahmed.abdelsalam@gssi.it>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
By default, DSA drivers should configure CPU and DSA ports to their
maximum speed. In many configurations this is sufficient to make the
link work.
In some cases it is necessary to configure the link to run slower,
e.g. because of limitations of the SoC it is connected to. Or back to
back PHYs are used and the PHY needs to be driven in order to
establish link. In this case, phylink is used.
Only instantiate phylink if it is required. If there is no PHY, or no
fixed link properties, phylink can upset a link which works in the
default configuration.
Fixes: 0e27921816ad ("net: dsa: Use PHYLINK for the CPU/DSA ports")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Sparse reports a warning at netlink_seq_start()
warning: context imbalance in netlink_seq_start() - wrong count at exit
The root cause is the missing annotation at netlink_seq_start()
Add the missing __acquires(RCU) annotation
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Sparse reports warning at tcp_child_process()
warning: context imbalance in tcp_child_process() - unexpected unlock
The root cause is the missing annotation at tcp_child_process()
Add the missing __releases(&((child)->sk_lock.slock)) annotation
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Sparse reports warnings at raw_seq_start() and raw_seq_stop()
warning: context imbalance in raw_seq_start() - wrong count at exit
warning: context imbalance in raw_seq_stop() - unexpected unlock
The root cause is the missing annotations at raw_seq_start()
and raw_seq_stop()
Add the missing __acquires(&h->lock) annotation
Add the missing __releases(&h->lock) annotation
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In their .attach callback, mq[prio] only add the qdiscs of the currently
active TX queues to the device's qdisc hash list.
If a user later increases the number of active TX queues, their qdiscs
are not visible via eg. 'tc qdisc show'.
Add a hook to netif_set_real_num_tx_queues() that walks all active
TX queues and adds those which are missing to the hash list.
CC: Eric Dumazet <edumazet@google.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: Cong Wang <xiyou.wangcong@gmail.com>
CC: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In one error case, tpacket_rcv drops packets after incrementing the
ring producer index.
If this happens, it does not update tp_status to TP_STATUS_USER and
thus the reader is stalled for an iteration of the ring, causing out
of order arrival.
The only such error path is when virtio_net_hdr_from_skb fails due
to encountering an unknown GSO type.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
caifdevs->list is traversed using list_for_each_entry_rcu()
outside an RCU read-side critical section but under the
protection of rtnl_mutex. Hence, add the corresponding lockdep
expression to silence the following false-positive warning:
[ 10.868467] =============================
[ 10.869082] WARNING: suspicious RCU usage
[ 10.869817] 5.6.0-rc1-00177-g06ec0a154aae4 #1 Not tainted
[ 10.870804] -----------------------------
[ 10.871557] net/caif/caif_dev.c:115 RCU-list traversed in non-reader section!!
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
To prevent spurious wake ups, we disable any discovery or advertising
when we enter suspend and restore it when we exit suspend. While paused,
we disable any management requests to modify discovery or advertising.
Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
To handle LE devices, we must first disable passive scanning and
disconnect all connected devices. Once that is complete, we update the
whitelist and re-enable scanning
Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
To handle BR/EDR devices, we first disable page scan and disconnect all
connected devices. Once that is complete, we add event filters (for
devices that can wake the system) and re-enable page scan.
Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
Register for PM_SUSPEND_PREPARE and PM_POST_SUSPEND to make sure the
Bluetooth controller is prepared correctly for suspend/resume. Implement
the registration, scheduling and task handling portions only in this
patch.
Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
If an error occurs during request building in add_advertising(),
remember to send MGMT_STATUS_FAILED command status back to bluetoothd.
Signed-off-by: Joseph Hwang <josephsih@chromium.org>
Signed-off-by: Manish Mandlik <mmandlik@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
This change fixes the off by one error in the erroneous command bit
masks which can lead to the erroneous data commands being sent to a
controller that doesn't support them.
Signed-off-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
The bucket->lock is not needed in the sock_hash_free and sock_map_free
calls, in fact it is causing a splat due to being inside rcu block.
| BUG: sleeping function called from invalid context at net/core/sock.c:2935
| in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 62, name: kworker/0:1
| 3 locks held by kworker/0:1/62:
| #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
| #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
| #2: ffff8881381f6df8 (&stab->lock){+...}, at: sock_map_free+0x26/0x180
| CPU: 0 PID: 62 Comm: kworker/0:1 Not tainted 5.5.0-04008-g7b083332376e #454
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
| Workqueue: events bpf_map_free_deferred
| Call Trace:
| dump_stack+0x71/0xa0
| ___might_sleep.cold+0xa6/0xb6
| lock_sock_nested+0x28/0x90
| sock_map_free+0x5f/0x180
| bpf_map_free_deferred+0x58/0x80
| process_one_work+0x260/0x5e0
| worker_thread+0x4d/0x3e0
| kthread+0x108/0x140
| ? process_one_work+0x5e0/0x5e0
| ? kthread_park+0x90/0x90
| ret_from_fork+0x3a/0x50
The reason we have stab->lock and bucket->locks in sockmap code is to
handle checking EEXIST in update/delete cases. We need to be careful during
an update operation that we check for EEXIST and we need to ensure that the
psock object is not in some partial state of removal/insertion while we do
this. So both map_update_common and sock_map_delete need to guard from being
run together potentially deleting an entry we are checking, etc. But by the
time we get to the tear-down code in sock_{ma[|hash}_free we have already
disconnected the map and we just did synchronize_rcu() in the line above so
no updates/deletes should be in flight. Because of this we can drop the
bucket locks from the map free'ing code, noting no update/deletes can be
in-flight.
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Reported-by: Jakub Sitnicki <jakub@cloudflare.com>
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/158385850787.30597.8346421465837046618.stgit@john-Precision-5820-Tower
|
|
When the RPA generation fails, indicate the error with a device specifc
error message.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
|
|
When trying to transmit to an unknown destination, the mesh code would
unconditionally transmit a HWMP PREQ even if HWMP is not the current
path selection algorithm.
Signed-off-by: Nicolas Cavallari <nicolas.cavallari@green-communications.fr>
Link: https://lore.kernel.org/r/20200305140409.12204-1-cavallar@lri.fr
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Add missing attribute validation for NL80211_ATTR_OPER_CLASS
to the netlink policy.
Fixes: 1057d35ede5d ("cfg80211: introduce TDLS channel switch commands")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20200303051058.4089398-4-kuba@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Add missing attribute validation for beacon report scanning
to the netlink policy.
Fixes: 1d76250bd34a ("nl80211: support beacon report scanning")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20200303051058.4089398-3-kuba@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Add missing attribute validation for critical protocol fields
to the netlink policy.
Fixes: 5de17984898c ("cfg80211: introduce critical protocol indication from user-space")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20200303051058.4089398-2-kuba@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
When pktgen is used to measure the performance of dev_queue_xmit()
packet handling in the core, it is preferable to not hand down
packets to a low-level Ethernet driver as it would distort the
measurements.
Allow using pktgen on the loopback device, thus constraining
measurements to core code.
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
During IB device removal, cancel the event worker before the device
structure is freed.
Fixes: a4cf0443c414 ("smc: introduce SMC as an IB-client")
Reported-by: syzbot+b297c6825752e7a07272@syzkaller.appspotmail.com
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Rafał found an issue that for non-Ethernet interface, if we down and up
frequently, the memory will be consumed slowly.
The reason is we add allnodes/allrouters addressed in multicast list in
ipv6_add_dev(). When link down, we call ipv6_mc_down(), store all multicast
addresses via mld_add_delrec(). But when link up, we don't call ipv6_mc_up()
for non-Ethernet interface to remove the addresses. This makes idev->mc_tomb
getting bigger and bigger. The call stack looks like:
addrconf_notify(NETDEV_REGISTER)
ipv6_add_dev
ipv6_dev_mc_inc(ff01::1)
ipv6_dev_mc_inc(ff02::1)
ipv6_dev_mc_inc(ff02::2)
addrconf_notify(NETDEV_UP)
addrconf_dev_config
/* Alas, we support only Ethernet autoconfiguration. */
return;
addrconf_notify(NETDEV_DOWN)
addrconf_ifdown
ipv6_mc_down
igmp6_group_dropped(ff02::2)
mld_add_delrec(ff02::2)
igmp6_group_dropped(ff02::1)
igmp6_group_dropped(ff01::1)
After investigating, I can't found a rule to disable multicast on
non-Ethernet interface. In RFC2460, the link could be Ethernet, PPP, ATM,
tunnels, etc. In IPv4, it doesn't check the dev type when calls ip_mc_up()
in inetdev_event(). Even for IPv6, we don't check the dev type and call
ipv6_add_dev(), ipv6_dev_mc_inc() after register device.
So I think it's OK to fix this memory consumer by calling ipv6_mc_up() for
non-Ethernet interface.
v2: Also check IFF_MULTICAST flag to make sure the interface supports
multicast
Reported-by: Rafał Miłecki <zajec5@gmail.com>
Tested-by: Rafał Miłecki <zajec5@gmail.com>
Fixes: 74235a25c673 ("[IPV6] addrconf: Fix IPv6 on tuntap tunnels")
Fixes: 1666d49e1d41 ("mld: do not remove mld souce list info when set link down")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If a TCP socket is allocated in IRQ context or cloned from unassociated
(i.e. not associated to a memcg) in IRQ context then it will remain
unassociated for its whole life. Almost half of the TCPs created on the
system are created in IRQ context, so, memory used by such sockets will
not be accounted by the memcg.
This issue is more widespread in cgroup v1 where network memory
accounting is opt-in but it can happen in cgroup v2 if the source socket
for the cloning was created in root memcg.
To fix the issue, just do the association of the sockets at the accept()
time in the process context and then force charge the memory buffer
already used and reserved by the socket.
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This splits it into two parts, one that imports the message, and one
that imports the iovec. This allows a caller to only do the first part,
and import the iovec manually afterwards.
No functional changes in this patch.
Acked-by: David Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The mptcp rcvbuf size is adjusted according to the subflow rcvbuf size.
This should not be done if userspace did set a fixed value.
Fixes: 600911ff5f72bae ("mptcp: add rmem queue accounting")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
This cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
- Avoid RCU list-traversal in spinlock, by Sven Eckelmann
- Replace zero-length array with flexible-array member,
by Gustavo A. R. Silva
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Simon Wunderlich says:
====================
Here is a batman-adv bugfix:
- Don't schedule OGM for disabled interface, by Sven Eckelmann
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In our production environment we have faced with problem that updating
classid in cgroup with heavy tasks cause long freeze of the file tables
in this tasks. By heavy tasks we understand tasks with many threads and
opened sockets (e.g. balancers). This freeze leads to an increase number
of client timeouts.
This patch implements following logic to fix this issue:
аfter iterating 1000 file descriptors file table lock will be released
thus providing a time gap for socket creation/deletion.
Now update is non atomic and socket may be skipped using calls:
dup2(oldfd, newfd);
close(oldfd);
But this case is not typical. Moreover before this patch skip is possible
too by hiding socket fd in unix socket buffer.
New sockets will be allocated with updated classid because cgroup state
is updated before start of the file descriptors iteration.
So in common cases this patch has no side effects.
Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit 105e808c1da2 ("pie: remove pie_vars->accu_prob_overflows")
changes the scale of probability values in PIE from (2^64 - 1) to
(2^56 - 1). This affects the precision of tc_pie_xstats->prob in
user space.
This patch ensures user space is unaffected.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add TCP_NLA_BYTES_NOTSENT to SCM_TIMESTAMPING_OPT_STATS that reports
bytes in the write queue but not sent. This is the same metric as
what is exported with tcp_info.tcpi_notsent_bytes.
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Allow adding hashed UDP sockets to sockmaps.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-9-lmb@cloudflare.com
|
|
Add basic psock hooks for UDP sockets. This allows adding and
removing sockets, as well as automatic removal on unhash and close.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-8-lmb@cloudflare.com
|
|
We can take advantage of the fact that both callers of
sock_map_init_proto are holding a RCU read lock, and
have verified that psock is valid.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-7-lmb@cloudflare.com
|
|
The init, close and unhash handlers from TCP sockmap are generic,
and can be reused by UDP sockmap. Move the helpers into the sockmap code
base and expose them. This requires tcp_bpf_get_proto and tcp_bpf_clone to
be conditional on BPF_STREAM_PARSER.
The moved functions are unmodified, except that sk_psock_unlink is
renamed to sock_map_unlink to better match its behaviour.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-6-lmb@cloudflare.com
|
|
We need to ensure that sk->sk_prot uses certain callbacks, so that
code that directly calls e.g. tcp_sendmsg in certain corner cases
works. To avoid spurious asserts, we must to do this only if
sk_psock_update_proto has not yet been called. The same invariants
apply for tcp_bpf_check_v6_needs_rebuild, so move the call as well.
Doing so allows us to merge tcp_bpf_init and tcp_bpf_reinit.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-4-lmb@cloudflare.com
|
|
Only update psock->saved_* if psock->sk_proto has not been initialized
yet. This allows us to get rid of tcp_bpf_reinit_sk_prot.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-3-lmb@cloudflare.com
|
|
The sock map code checks that a socket does not have an active upper
layer protocol before inserting it into the map. This requires casting
via inet_csk, which isn't valid for UDP sockets.
Guard checks for ULP by checking inet_sk(sk)->is_icsk first.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-2-lmb@cloudflare.com
|
|
In commit 1ec17dbd90f8 ("inet_diag: fix reporting cgroup classid and
fallback to priority") croup classid reporting was fixed. But this works
only for TCP sockets because for other socket types icsk parameter can
be NULL and classid code path is skipped. This change moves classid
handling to inet_diag_msg_attrs_fill() function.
Also inet_diag_msg_attrs_size() helper was added and addends in
nlmsg_new() were reordered to save order from inet_sk_diag_fill().
Fixes: 1ec17dbd90f8 ("inet_diag: fix reporting cgroup classid and fallback to priority")
Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
|