summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2021-03-03rtnetlink: using dev_base_seq from target netzhang kai
Signed-off-by: zhang kai <zhangkaiheb@126.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-03net: 9p: advance iov on empty readJisheng Zhang
I met below warning when cating a small size(about 80bytes) txt file on 9pfs(msize=2097152 is passed to 9p mount option), the reason is we miss iov_iter_advance() if the read count is 0 for zerocopy case, so we didn't truncate the pipe, then iov_iter_pipe() thinks the pipe is full. Fix it by removing the exception for 0 to ensure to call iov_iter_advance() even on empty read for zerocopy case. [ 8.279568] WARNING: CPU: 0 PID: 39 at lib/iov_iter.c:1203 iov_iter_pipe+0x31/0x40 [ 8.280028] Modules linked in: [ 8.280561] CPU: 0 PID: 39 Comm: cat Not tainted 5.11.0+ #6 [ 8.281260] RIP: 0010:iov_iter_pipe+0x31/0x40 [ 8.281974] Code: 2b 42 54 39 42 5c 76 22 c7 07 20 00 00 00 48 89 57 18 8b 42 50 48 c7 47 08 b [ 8.283169] RSP: 0018:ffff888000cbbd80 EFLAGS: 00000246 [ 8.283512] RAX: 0000000000000010 RBX: ffff888000117d00 RCX: 0000000000000000 [ 8.283876] RDX: ffff88800031d600 RSI: 0000000000000000 RDI: ffff888000cbbd90 [ 8.284244] RBP: ffff888000cbbe38 R08: 0000000000000000 R09: ffff8880008d2058 [ 8.284605] R10: 0000000000000002 R11: ffff888000375510 R12: 0000000000000050 [ 8.284964] R13: ffff888000cbbe80 R14: 0000000000000050 R15: ffff88800031d600 [ 8.285439] FS: 00007f24fd8af600(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000 [ 8.285844] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8.286150] CR2: 00007f24fd7d7b90 CR3: 0000000000c97000 CR4: 00000000000406b0 [ 8.286710] Call Trace: [ 8.288279] generic_file_splice_read+0x31/0x1a0 [ 8.289273] ? do_splice_to+0x2f/0x90 [ 8.289511] splice_direct_to_actor+0xcc/0x220 [ 8.289788] ? pipe_to_sendpage+0xa0/0xa0 [ 8.290052] do_splice_direct+0x8b/0xd0 [ 8.290314] do_sendfile+0x1ad/0x470 [ 8.290576] do_syscall_64+0x2d/0x40 [ 8.290818] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 8.291409] RIP: 0033:0x7f24fd7dca0a [ 8.292511] Code: c3 0f 1f 80 00 00 00 00 4c 89 d2 4c 89 c6 e9 bd fd ff ff 0f 1f 44 00 00 31 8 [ 8.293360] RSP: 002b:00007ffc20932818 EFLAGS: 00000206 ORIG_RAX: 0000000000000028 [ 8.293800] RAX: ffffffffffffffda RBX: 0000000001000000 RCX: 00007f24fd7dca0a [ 8.294153] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000001 [ 8.294504] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000 [ 8.294867] R10: 0000000001000000 R11: 0000000000000206 R12: 0000000000000003 [ 8.295217] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000 [ 8.295782] ---[ end trace 63317af81b3ca24b ]--- Signed-off-by: Jisheng Zhang <Jisheng.Zhang@synaptics.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-03net: l2tp: reduce log level of messages in receive path, add counter insteadMatthias Schiffer
Commit 5ee759cda51b ("l2tp: use standard API for warning log messages") changed a number of warnings about invalid packets in the receive path so that they are always shown, instead of only when a special L2TP debug flag is set. Even with rate limiting these warnings can easily cause significant log spam - potentially triggered by a malicious party sending invalid packets on purpose. In addition these warnings were noticed by projects like Tunneldigger [1], which uses L2TP for its data path, but implements its own control protocol (which is sufficiently different from L2TP data packets that it would always be passed up to userspace even with future extensions of L2TP). Some of the warnings were already redundant, as l2tp_stats has a counter for these packets. This commit adds one additional counter for invalid packets that are passed up to userspace. Packets with unknown session are not counted as invalid, as there is nothing wrong with the format of these packets. With the additional counter, all of these messages are either redundant or benign, so we reduce them to pr_debug_ratelimited(). [1] https://github.com/wlanslovenija/tunneldigger/issues/160 Fixes: 5ee759cda51b ("l2tp: use standard API for warning log messages") Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-03Bluetooth: Allow scannable adv with extended MGMT APIsDaniel Winkler
An issue was found, where if a bluetooth client requests a broadcast advertisement with scan response data, it will not be properly registered with the controller. This is because at the time that the hci_cp_le_set_scan_param structure is created, the scan response will not yet have been received since it comes in a second MGMT call. With empty scan response, the request defaults to a non-scannable PDU type. On some controllers, the subsequent scan response request will fail due to incorrect PDU type, and others will succeed and not use the scan response. This fix allows the advertising parameters MGMT call to include a flag to let the kernel know whether a scan response will be coming, so that the correct PDU type is used in the first place. A bluetoothd change is also incoming to take advantage of it. To test this, I created a broadcast advertisement with scan response data and registered it on the hatch chromebook. Without this change, the request fails, and with it will succeed. Reviewed-by: Alain Michaud <alainm@chromium.org> Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org> Reviewed-by: Miao-chen Chou <mcchou@chromium.org> Signed-off-by: Daniel Winkler <danielwinkler@google.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-03-03Bluetooth: Remove unneeded commands for suspendAbhishek Pandit-Subedi
During suspend, there are a few scan enable and set event filter commands that don't need to be sent unless there are actual BR/EDR devices capable of waking the system. Check the HCI_PSCAN bit before writing scan enable and use a new dev flag, HCI_EVENT_FILTER_CONFIGURED to control whether to clear the event filter. Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org> Reviewed-by: Archie Pusaka <apusaka@chromium.org> Reviewed-by: Alain Michaud <alainm@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-03-03xfrm: Use actual socket sk instead of skb socket for xfrm_output_resumeEvan Nimmo
A situation can occur where the interface bound to the sk is different to the interface bound to the sk attached to the skb. The interface bound to the sk is the correct one however this information is lost inside xfrm_output2 and instead the sk on the skb is used in xfrm_output_resume instead. This assumes that the sk bound interface and the bound interface attached to the sk within the skb are the same which can lead to lookup failures inside ip_route_me_harder resulting in the packet being dropped. We have an l2tp v3 tunnel with ipsec protection. The tunnel is in the global VRF however we have an encapsulated dot1q tunnel interface that is within a different VRF. We also have a mangle rule that marks the packets causing them to be processed inside ip_route_me_harder. Prior to commit 31c70d5956fc ("l2tp: keep original skb ownership") this worked fine as the sk attached to the skb was changed from the dot1q encapsulated interface to the sk for the tunnel which meant the interface bound to the sk and the interface bound to the skb were identical. Commit 46d6c5ae953c ("netfilter: use actual socket sk rather than skb sk when routing harder") fixed some of these issues however a similar problem existed in the xfrm code. Fixes: 31c70d5956fc ("l2tp: keep original skb ownership") Signed-off-by: Evan Nimmo <evan.nimmo@alliedtelesis.co.nz> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-03-03vti6: fix ipv4 pmtu check to honor ip header dfEyal Birger
Frag needed should only be sent if the header enables DF. This fix allows IPv4 packets larger than MTU to pass the vti6 interface and be fragmented after encapsulation, aligning behavior with non-vti6 xfrm. Fixes: ccd740cbc6e0 ("vti6: Add pmtu handling to vti6_xmit.") Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-03-03vti: fix ipv4 pmtu check to honor ip header dfEyal Birger
Frag needed should only be sent if the header enables DF. This fix allows packets larger than MTU to pass the vti interface and be fragmented after encapsulation, aligning behavior with non-vti xfrm. Fixes: d6af1a31cc72 ("vti: Add pmtu handling to vti_xmit.") Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-03-02net: ieee802154: nl-mac: fix check on panidAlexander Aring
This patch fixes a null pointer derefence for panid handle by move the check for the netlink variable directly before accessing them. Reported-by: syzbot+d4c07de0144f6f63be3a@syzkaller.appspotmail.com Signed-off-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/r/20210228151817.95700-4-aahringo@redhat.com Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2021-03-02netfilter: nftables: disallow updates on table ownershipPablo Neira Ayuso
Disallow updating the ownership bit on an existing table: Do not allow to grab ownership on an existing table. Do not allow to drop ownership on an existing table. Fixes: 6001a930ce03 ("netfilter: nftables: introduce table ownership") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-02Bluetooth: Notify suspend on le conn failedAbhishek Pandit-Subedi
When suspending, Bluetooth disconnects all connected peers devices. If an LE connection is started but isn't completed, we will see an LE Create Connection Cancel instead of an HCI disconnect. This just adds a check to see if an LE cancel was the last disconnected device and wake the suspend thread when that is the case. Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org> Reviewed-by: Archie Pusaka <apusaka@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-03-01tcp: add sanity tests to TCP_QUEUE_SEQEric Dumazet
Qingyu Li reported a syzkaller bug where the repro changes RCV SEQ _after_ restoring data in the receive queue. mprotect(0x4aa000, 12288, PROT_READ) = 0 mmap(0x1ffff000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x1ffff000 mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000 mmap(0x21000000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x21000000 socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3 setsockopt(3, SOL_TCP, TCP_REPAIR, [1], 4) = 0 connect(3, {sa_family=AF_INET6, sin6_port=htons(0), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0 setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0 sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="0x0000000000000003\0\0", iov_len=20}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20 setsockopt(3, SOL_TCP, TCP_REPAIR, [0], 4) = 0 setsockopt(3, SOL_TCP, TCP_QUEUE_SEQ, [128], 4) = 0 recvfrom(3, NULL, 20, 0, NULL, NULL) = -1 ECONNRESET (Connection reset by peer) syslog shows: [ 111.205099] TCP recvmsg seq # bug 2: copied 80, seq 0, rcvnxt 80, fl 0 [ 111.207894] WARNING: CPU: 1 PID: 356 at net/ipv4/tcp.c:2343 tcp_recvmsg_locked+0x90e/0x29a0 This should not be allowed. TCP_QUEUE_SEQ should only be used when queues are empty. This patch fixes this case, and the tx path as well. Fixes: ee9952831cfd ("tcp: Initial repair mode") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=212005 Reported-by: Qingyu Li <ieatmuttonchuan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01net: dsa: tag_mtk: fix 802.1ad VLAN egressDENG Qingfang
A different TPID bit is used for 802.1ad VLAN frames. Reported-by: Ilario Gelmetti <iochesonome@gmail.com> Fixes: f0af34317f4b ("net: dsa: mediatek: combine MediaTek tag with VLAN tag") Signed-off-by: DENG Qingfang <dqfext@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01net: expand textsearch ts_state to fit skb_seq_stateWillem de Bruijn
The referenced commit expands the skb_seq_state used by skb_find_text with a 4B frag_off field, growing it to 48B. This exceeds container ts_state->cb, causing a stack corruption: [ 73.238353] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: skb_find_text+0xc5/0xd0 [ 73.247384] CPU: 1 PID: 376 Comm: nping Not tainted 5.11.0+ #4 [ 73.252613] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 [ 73.260078] Call Trace: [ 73.264677] dump_stack+0x57/0x6a [ 73.267866] panic+0xf6/0x2b7 [ 73.270578] ? skb_find_text+0xc5/0xd0 [ 73.273964] __stack_chk_fail+0x10/0x10 [ 73.277491] skb_find_text+0xc5/0xd0 [ 73.280727] string_mt+0x1f/0x30 [ 73.283639] ipt_do_table+0x214/0x410 The struct is passed between skb_find_text and its callbacks skb_prepare_seq_read, skb_seq_read and skb_abort_seq read through the textsearch interface using TS_SKB_CB. I assumed that this mapped to skb->cb like other .._SKB_CB wrappers. skb->cb is 48B. But it maps to ts_state->cb, which is only 40B. skb->cb was increased from 40B to 48B after ts_state was introduced, in commit 3e3850e989c5 ("[NETFILTER]: Fix xfrm lookup in ip_route_me_harder/ip6_route_me_harder"). Increase ts_state.cb[] to 48 to fit the struct. Also add a BUILD_BUG_ON to avoid a repeat. The alternative is to directly add a dependency from textsearch onto linux/skbuff.h, but I think the intent is textsearch to have no such dependencies on its callers. Link: https://bugzilla.kernel.org/show_bug.cgi?id=211911 Fixes: 97550f6fa592 ("net: compound page support in skb_seq_read") Reported-by: Kris Karas <bugs-a17@moonlit-rail.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01inetpeer: use div64_ul() and clamp_val() calculate inet_peer_thresholdYejune Deng
In inet_initpeers(), struct inet_peer on IA32 uses 128 bytes in nowdays. Get rid of the cascade and use div64_ul() and clamp_val() calculate that will not need to be adjusted in the future as suggested by Eric Dumazet. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Yejune Deng <yejune.deng@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01net/qrtr: fix __netdev_alloc_skb callPavel Skripkin
syzbot found WARNING in __alloc_pages_nodemask()[1] when order >= MAX_ORDER. It was caused by a huge length value passed from userspace to qrtr_tun_write_iter(), which tries to allocate skb. Since the value comes from the untrusted source there is no need to raise a warning in __alloc_pages_nodemask(). [1] WARNING in __alloc_pages_nodemask+0x5f8/0x730 mm/page_alloc.c:5014 Call Trace: __alloc_pages include/linux/gfp.h:511 [inline] __alloc_pages_node include/linux/gfp.h:524 [inline] alloc_pages_node include/linux/gfp.h:538 [inline] kmalloc_large_node+0x60/0x110 mm/slub.c:3999 __kmalloc_node_track_caller+0x319/0x3f0 mm/slub.c:4496 __kmalloc_reserve net/core/skbuff.c:150 [inline] __alloc_skb+0x4e4/0x5a0 net/core/skbuff.c:210 __netdev_alloc_skb+0x70/0x400 net/core/skbuff.c:446 netdev_alloc_skb include/linux/skbuff.h:2832 [inline] qrtr_endpoint_post+0x84/0x11b0 net/qrtr/qrtr.c:442 qrtr_tun_write_iter+0x11f/0x1a0 net/qrtr/tun.c:98 call_write_iter include/linux/fs.h:1901 [inline] new_sync_write+0x426/0x650 fs/read_write.c:518 vfs_write+0x791/0xa30 fs/read_write.c:605 ksys_write+0x12d/0x250 fs/read_write.c:658 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported-by: syzbot+80dccaee7c6630fa9dcf@syzkaller.appspotmail.com Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Acked-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01net: always use icmp{,v6}_ndo_send from ndo_start_xmitJason A. Donenfeld
There were a few remaining tunnel drivers that didn't receive the prior conversion to icmp{,v6}_ndo_send. Knowing now that this could lead to memory corrution (see ee576c47db60 ("net: icmp: pass zeroed opts from icmp{,v6}_ndo_send before sending") for details), there's even more imperative to have these all converted. So this commit goes through the remaining cases that I could find and does a boring translation to the ndo variety. The Fixes: line below is the merge that originally added icmp{,v6}_ ndo_send and converted the first batch of icmp{,v6}_send users. The rationale then for the change applies equally to this patch. It's just that these drivers were left out of the initial conversion because these network devices are hiding in net/ rather than in drivers/net/. Cc: Florian Westphal <fw@strlen.de> Cc: Willem de Bruijn <willemb@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: David Ahern <dsahern@kernel.org> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Fixes: 803381f9f117 ("Merge branch 'icmp-account-for-NAT-when-sending-icmps-from-ndo-layer'") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-01net: dsa: tag_rtl4_a: fix egress tagsDENG Qingfang
Commit 86dd9868b878 has several issues, but was accepted too soon before anyone could take a look. - Double free. dsa_slave_xmit() will free the skb if the xmit function returns NULL, but the skb is already freed by eth_skb_pad(). Use __skb_put_padto() to avoid that. - Unnecessary allocation. It has been done by DSA core since commit a3b0b6479700. - A u16 pointer points to skb data. It should be __be16 for network byte order. - Typo in comments. "numer" -> "number". Fixes: 86dd9868b878 ("net: dsa: tag_rtl4_a: Support also egress tags") Signed-off-by: DENG Qingfang <dqfext@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-28net: Fix gro aggregation for udp encaps with zero csumDaniel Borkmann
We noticed a GRO issue for UDP-based encaps such as vxlan/geneve when the csum for the UDP header itself is 0. In that case, GRO aggregation does not take place on the phys dev, but instead is deferred to the vxlan/geneve driver (see trace below). The reason is essentially that GRO aggregation bails out in udp_gro_receive() for such case when drivers marked the skb with CHECKSUM_UNNECESSARY (ice, i40e, others) where for non-zero csums 2abb7cdc0dc8 ("udp: Add support for doing checksum unnecessary conversion") promotes those skbs to CHECKSUM_COMPLETE and napi context has csum_valid set. This is however not the case for zero UDP csum (here: csum_cnt is still 0 and csum_valid continues to be false). At the same time 57c67ff4bd92 ("udp: additional GRO support") added matches on !uh->check ^ !uh2->check as part to determine candidates for aggregation, so it certainly is expected to handle zero csums in udp_gro_receive(). The purpose of the check added via 662880f44203 ("net: Allow GRO to use and set levels of checksum unnecessary") seems to catch bad csum and stop aggregation right away. One way to fix aggregation in the zero case is to only perform the !csum_valid check in udp_gro_receive() if uh->check is infact non-zero. Before: [...] swapper 0 [008] 731.946506: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497100400 len=1500 (1) swapper 0 [008] 731.946507: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497100200 len=1500 swapper 0 [008] 731.946507: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497101100 len=1500 swapper 0 [008] 731.946508: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497101700 len=1500 swapper 0 [008] 731.946508: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497101b00 len=1500 swapper 0 [008] 731.946508: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497100600 len=1500 swapper 0 [008] 731.946508: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497100f00 len=1500 swapper 0 [008] 731.946509: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497100a00 len=1500 swapper 0 [008] 731.946516: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497100500 len=1500 swapper 0 [008] 731.946516: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497100700 len=1500 swapper 0 [008] 731.946516: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497101d00 len=1500 (2) swapper 0 [008] 731.946517: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497101000 len=1500 swapper 0 [008] 731.946517: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497101c00 len=1500 swapper 0 [008] 731.946517: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497101400 len=1500 swapper 0 [008] 731.946518: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497100e00 len=1500 swapper 0 [008] 731.946518: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497101600 len=1500 swapper 0 [008] 731.946521: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff966497100800 len=774 swapper 0 [008] 731.946530: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff966497100400 len=14032 (1) swapper 0 [008] 731.946530: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff966497101d00 len=9112 (2) [...] # netperf -H 10.55.10.4 -t TCP_STREAM -l 20 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.55.10.4 () port 0 AF_INET : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 20.01 13129.24 After: [...] swapper 0 [026] 521.862641: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff93ab0d479000 len=11286 (1) swapper 0 [026] 521.862643: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff93ab0d479000 len=11236 (1) swapper 0 [026] 521.862650: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff93ab0d478500 len=2898 (2) swapper 0 [026] 521.862650: net:netif_receive_skb: dev=enp10s0f0 skbaddr=0xffff93ab0d479f00 len=8490 (3) swapper 0 [026] 521.862653: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff93ab0d478500 len=2848 (2) swapper 0 [026] 521.862653: net:netif_receive_skb: dev=test_vxlan skbaddr=0xffff93ab0d479f00 len=8440 (3) [...] # netperf -H 10.55.10.4 -t TCP_STREAM -l 20 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.55.10.4 () port 0 AF_INET : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 20.01 24576.53 Fixes: 57c67ff4bd92 ("udp: additional GRO support") Fixes: 662880f44203 ("net: Allow GRO to use and set levels of checksum unnecessary") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com> Cc: Tom Herbert <tom@herbertland.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20210226212248.8300-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-28ethtool: fix the check logic of at least one channel for RX/TXYinjun Zhang
The command "ethtool -L <intf> combined 0" may clean the RX/TX channel count and skip the error path, since the attrs tb[ETHTOOL_A_CHANNELS_RX_COUNT] and tb[ETHTOOL_A_CHANNELS_TX_COUNT] are NULL in this case when recent ethtool is used. Tested using ethtool v5.10. Fixes: 7be92514b99c ("ethtool: check if there is at least one channel for TX/RX in the core") Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Louis Peens <louis.peens@netronome.com> Link: https://lore.kernel.org/r/20210225125102.23989-1-simon.horman@netronome.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-28netfilter: x_tables: gpf inside xt_find_revision()Vasily Averin
nested target/match_revfn() calls work with xt[NFPROTO_UNSPEC] lists without taking xt[NFPROTO_UNSPEC].mutex. This can race with module unload and cause host to crash: general protection fault: 0000 [#1] Modules linked in: ... [last unloaded: xt_cluster] CPU: 0 PID: 542455 Comm: iptables RIP: 0010:[<ffffffff8ffbd518>] [<ffffffff8ffbd518>] strcmp+0x18/0x40 RDX: 0000000000000003 RSI: ffff9a5a5d9abe10 RDI: dead000000000111 R13: ffff9a5a5d9abe10 R14: ffff9a5a5d9abd8c R15: dead000000000100 (VvS: %R15 -- &xt_match, %RDI -- &xt_match.name, xt_cluster unregister match in xt[NFPROTO_UNSPEC].match list) Call Trace: [<ffffffff902ccf44>] match_revfn+0x54/0xc0 [<ffffffff902ccf9f>] match_revfn+0xaf/0xc0 [<ffffffff902cd01e>] xt_find_revision+0x6e/0xf0 [<ffffffffc05a5be0>] do_ipt_get_ctl+0x100/0x420 [ip_tables] [<ffffffff902cc6bf>] nf_getsockopt+0x4f/0x70 [<ffffffff902dd99e>] ip_getsockopt+0xde/0x100 [<ffffffff903039b5>] raw_getsockopt+0x25/0x50 [<ffffffff9026c5da>] sock_common_getsockopt+0x1a/0x20 [<ffffffff9026b89d>] SyS_getsockopt+0x7d/0xf0 [<ffffffff903cbf92>] system_call_fastpath+0x25/0x2a Fixes: 656caff20e1 ("netfilter 04/09: x_tables: fix match/target revision lookup") Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-02-28netfilter: conntrack: avoid misleading 'invalid' in log messageFlorian Westphal
The packet is not flagged as invalid: conntrack will accept it and its associated with the conntrack entry. This happens e.g. when receiving a retransmitted SYN in SYN_RECV state. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-02-28netfilter: nf_nat: undo erroneous tcp edemux lookupFlorian Westphal
Under extremely rare conditions TCP early demux will retrieve the wrong socket. 1. local machine establishes a connection to a remote server, S, on port p. This gives: laddr:lport -> S:p ... both in tcp and conntrack. 2. local machine establishes a connection to host H, on port p2. 2a. TCP stack choses same laddr:lport, so we have laddr:lport -> H:p2 from TCP point of view. 2b). There is a destination NAT rewrite in place, translating H:p2 to S:p. This results in following conntrack entries: I) laddr:lport -> S:p (origin) S:p -> laddr:lport (reply) II) laddr:lport -> H:p2 (origin) S:p -> laddr:lport2 (reply) NAT engine has rewritten laddr:lport to laddr:lport2 to map the reply packet to the correct origin. When server sends SYN/ACK to laddr:lport2, the PREROUTING hook will undo-the SNAT transformation, rewriting IP header to S:p -> laddr:lport This causes TCP early demux to associate the skb with the TCP socket of the first connection. The INPUT hook will then reverse the DNAT transformation, rewriting the IP header to H:p2 -> laddr:lport. Because packet ends up with the wrong socket, the new connection never completes: originator stays in SYN_SENT and conntrack entry remains in SYN_RECV until timeout, and responder retransmits SYN/ACK until it gives up. To resolve this, orphan the skb after the input rewrite: Because the source IP address changed, the socket must be incorrect. We can't move the DNAT undo to prerouting due to backwards compatibility, doing so will make iptables/nftables rules to no longer match the way they did. After orphan, the packet will be handed to the next protocol layer (tcp, udp, ...) and that will repeat the socket lookup just like as if early demux was disabled. Fixes: 41063e9dd1195 ("ipv4: Early TCP socket demux.") Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1427 Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-02-28netfilter: conntrack: Remove a double space in a log messageKlemen Košir
Removed an extra space in a log message and an extra blank line in code. Signed-off-by: Klemen Košir <klemen.kosir@kream.io> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-02-27Merge tag 'io_uring-worker.v3-2021-02-25' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring thread rewrite from Jens Axboe: "This converts the io-wq workers to be forked off the tasks in question instead of being kernel threads that assume various bits of the original task identity. This kills > 400 lines of code from io_uring/io-wq, and it's the worst part of the code. We've had several bugs in this area, and the worry is always that we could be missing some pieces for file types doing unusual things (recent /dev/tty example comes to mind, userfaultfd reads installing file descriptors is another fun one... - both of which need special handling, and I bet it's not the last weird oddity we'll find). With these identical workers, we can have full confidence that we're never missing anything. That, in itself, is a huge win. Outside of that, it's also more efficient since we're not wasting space and code on tracking state, or switching between different states. I'm sure we're going to find little things to patch up after this series, but testing has been pretty thorough, from the usual regression suite to production. Any issue that may crop up should be manageable. There's also a nice series of further reductions we can do on top of this, but I wanted to get the meat of it out sooner rather than later. The general worry here isn't that it's fundamentally broken. Most of the little issues we've found over the last week have been related to just changes in how thread startup/exit is done, since that's the main difference between using kthreads and these kinds of threads. In fact, if all goes according to plan, I want to get this into the 5.10 and 5.11 stable branches as well. That said, the changes outside of io_uring/io-wq are: - arch setup, simple one-liner to each arch copy_thread() implementation. - Removal of net and proc restrictions for io_uring, they are no longer needed or useful" * tag 'io_uring-worker.v3-2021-02-25' of git://git.kernel.dk/linux-block: (30 commits) io-wq: remove now unused IO_WQ_BIT_ERROR io_uring: fix SQPOLL thread handling over exec io-wq: improve manager/worker handling over exec io_uring: ensure SQPOLL startup is triggered before error shutdown io-wq: make buffered file write hashed work map per-ctx io-wq: fix race around io_worker grabbing io-wq: fix races around manager/worker creation and task exit io_uring: ensure io-wq context is always destroyed for tasks arch: ensure parisc/powerpc handle PF_IO_WORKER in copy_thread() io_uring: cleanup ->user usage io-wq: remove nr_process accounting io_uring: flag new native workers with IORING_FEAT_NATIVE_WORKERS net: remove cmsg restriction from io_uring based send/recvmsg calls Revert "proc: don't allow async path resolution of /proc/self components" Revert "proc: don't allow async path resolution of /proc/thread-self components" io_uring: move SQPOLL thread io-wq forked worker io-wq: make io_wq_fork_thread() available to other users io-wq: only remove worker from free_list, if it was there io_uring: remove io_identity io_uring: remove any grabbing of context ...
2021-02-26tcp: Fix sign comparison bug in getsockopt(TCP_ZEROCOPY_RECEIVE)Arjun Roy
getsockopt(TCP_ZEROCOPY_RECEIVE) has a bug where we read a user-provided "len" field of type signed int, and then compare the value to the result of an "offsetofend" operation, which is unsigned. Negative values provided by the user will be promoted to large positive numbers; thus checking that len < offsetofend() will return false when the intention was that it return true. Note that while len is originally checked for negative values earlier on in do_tcp_getsockopt(), subsequent calls to get_user() re-read the value from userspace which may have changed in the meantime. Therefore, re-add the check for negative values after the call to get_user in the handler code for TCP_ZEROCOPY_RECEIVE. Fixes: c8856c051454 ("tcp-zerocopy: Return inq along with tcp receive zerocopy.") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Arjun Roy <arjunroy@google.com> Link: https://lore.kernel.org/r/20210225232628.4033281-1-arjunroy.kdev@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-26net: dsa: tag_ocelot_8021q: fix driver dependencyArnd Bergmann
When the ocelot driver code is in a library, the dsa tag code cannot be built-in: ld.lld: error: undefined symbol: ocelot_can_inject >>> referenced by tag_ocelot_8021q.c >>> dsa/tag_ocelot_8021q.o:(ocelot_xmit) in archive net/built-in.a ld.lld: error: undefined symbol: ocelot_port_inject_frame >>> referenced by tag_ocelot_8021q.c >>> dsa/tag_ocelot_8021q.o:(ocelot_xmit) in archive net/built-in.a Building the tag support only really makes sense for compile-testing when the driver is available, so add a Kconfig dependency that prevents the broken configuration while allowing COMPILE_TEST alternative when MSCC_OCELOT_SWITCH_LIB is disabled entirely. This case is handled through the #ifdef check in include/soc/mscc/ocelot.h. Fixes: 0a6f17c6ae21 ("net: dsa: tag_ocelot_8021q: add support for PTP timestamping") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20210225143910.3964364-2-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-26skmsg: Get rid of sk_psock_bpf_run()Cong Wang
It is now nearly identical to bpf_prog_run_pin_on_cpu() and it has an unused parameter 'psock', so we can just get rid of it and call bpf_prog_run_pin_on_cpu() directly. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210223184934.6054-9-xiyou.wangcong@gmail.com
2021-02-26skmsg: Make __sk_psock_purge_ingress_msg() staticCong Wang
It is only used within skmsg.c so can become static. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210223184934.6054-8-xiyou.wangcong@gmail.com
2021-02-26sock_map: Make sock_map_prog_update() staticCong Wang
It is only used within sock_map.c so can become static. Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210223184934.6054-7-xiyou.wangcong@gmail.com
2021-02-26sock_map: Rename skb_parser and skb_verdictCong Wang
These two eBPF programs are tied to BPF_SK_SKB_STREAM_PARSER and BPF_SK_SKB_STREAM_VERDICT, rename them to reflect the fact they are only used for TCP. And save the name 'skb_verdict' for general use later. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210223184934.6054-6-xiyou.wangcong@gmail.com
2021-02-26skmsg: Move sk_redir from TCP_SKB_CB to skbCong Wang
Currently TCP_SKB_CB() is hard-coded in skmsg code, it certainly does not work for any other non-TCP protocols. We can move them to skb ext, but it introduces a memory allocation on fast path. Fortunately, we only need to a word-size to store all the information, because the flags actually only contains 1 bit so can be just packed into the lowest bit of the "pointer", which is stored as unsigned long. Inside struct sk_buff, '_skb_refdst' can be reused because skb dst is no longer needed after ->sk_data_ready() so we can just drop it. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210223184934.6054-5-xiyou.wangcong@gmail.com
2021-02-26bpf: Compute data_end dynamically with JIT codeCong Wang
Currently, we compute ->data_end with a compile-time constant offset of skb. But as Jakub pointed out, we can actually compute it in eBPF JIT code at run-time, so that we can competely get rid of ->data_end. This is similar to skb_shinfo(skb) computation in bpf_convert_shinfo_access(). Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210223184934.6054-4-xiyou.wangcong@gmail.com
2021-02-26skmsg: Get rid of struct sk_psock_parserCong Wang
struct sk_psock_parser is embedded in sk_psock, it is unnecessary as skb verdict also uses ->saved_data_ready. We can simply fold these fields into sk_psock, and get rid of ->enabled. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210223184934.6054-3-xiyou.wangcong@gmail.com
2021-02-26bpf: Clean up sockmap related KconfigsCong Wang
As suggested by John, clean up sockmap related Kconfigs: Reduce the scope of CONFIG_BPF_STREAM_PARSER down to TCP stream parser, to reflect its name. Make the rest sockmap code simply depend on CONFIG_BPF_SYSCALL and CONFIG_INET, the latter is still needed at this point because of TCP/UDP proto update. And leave CONFIG_NET_SOCK_MSG untouched, as it is used by non-sockmap cases. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210223184934.6054-2-xiyou.wangcong@gmail.com
2021-02-26bpf: Prevent deadlock from recursive bpf_task_storage_[get|delete]Song Liu
BPF helpers bpf_task_storage_[get|delete] could hold two locks: bpf_local_storage_map_bucket->lock and bpf_local_storage->lock. Calling these helpers from fentry/fexit programs on functions in bpf_*_storage.c may cause deadlock on either locks. Prevent such deadlock with a per cpu counter, bpf_task_storage_busy. We need this counter to be global, because the two locks here belong to two different objects: bpf_local_storage_map and bpf_local_storage. If we pick one of them as the owner of the counter, it is still possible to trigger deadlock on the other lock. For example, if bpf_local_storage_map owns the counters, it cannot prevent deadlock on bpf_local_storage->lock when two maps are used. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210225234319.336131-3-songliubraving@fb.com
2021-02-26Merge tag 'nfs-for-5.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS Client Updates from Anna Schumaker: "New Features: - Support for eager writes, and the write=eager and write=wait mount options - Other Bugfixes and Cleanups: - Fix typos in some comments - Fix up fall-through warnings for Clang - Cleanups to the NFS readpage codepath - Remove FMR support in rpcrdma_convert_iovs() - Various other cleanups to xprtrdma - Fix xprtrdma pad optimization for servers that don't support RFC 8797 - Improvements to rpcrdma tracepoints - Fix up nfs4_bitmask_adjust() - Optimize sparse writes past the end of files" * tag 'nfs-for-5.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (27 commits) NFS: Support the '-owrite=' option in /proc/self/mounts and mountinfo NFS: Set the stable writes flag when initialising the super block NFS: Add mount options supporting eager writes NFS: Add support for eager writes NFS: 'flags' field should be unsigned in struct nfs_server NFS: Don't set NFS_INO_INVALID_XATTR if there is no xattr cache NFS: Always clear an invalid mapping when attempting a buffered write NFS: Optimise sparse writes past the end of file NFS: Fix documenting comment for nfs_revalidate_file_size() NFSv4: Fixes for nfs4_bitmask_adjust() xprtrdma: Clean up rpcrdma_prepare_readch() rpcrdma: Capture bytes received in Receive completion tracepoints xprtrdma: Pad optimization, revisited rpcrdma: Fix comments about reverse-direction operation xprtrdma: Refactor invocations of offset_in_page() xprtrdma: Simplify rpcrdma_convert_kvec() and frwr_map() xprtrdma: Remove FMR support in rpcrdma_convert_iovs() NFS: Add nfs_pageio_complete_read() and remove nfs_readpage_async() NFS: Call readpage_async_filler() from nfs_readpage_async() NFS: Refactor nfs_readpage() and nfs_readpage_async() to use nfs_readdesc ...
2021-02-25Merge tag 'net-5.12-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Rather small batch this time. Current release - regressions: - bcm63xx_enet: fix sporadic kernel panic due to queue length mis-accounting Current release - new code bugs: - bcm4908_enet: fix RX path possible mem leak - bcm4908_enet: fix NAPI poll returned value - stmmac: fix missing spin_lock_init in visconti_eth_dwmac_probe() - sched: cls_flower: validate ct_state for invalid and reply flags Previous releases - regressions: - net: introduce CAN specific pointer in the struct net_device to prevent mis-interpreting memory - phy: micrel: set soft_reset callback to genphy_soft_reset for KSZ8081 - psample: fix netlink skb length with tunnel info Previous releases - always broken: - icmp: pass zeroed opts from icmp{,v6}_ndo_send before sending - wireguard: device: do not generate ICMP for non-IP packets - mptcp: provide subflow aware release function to avoid a mem leak - hsr: add support for EntryForgetTime - r8169: fix jumbo packet handling on RTL8168e - octeontx2-af: fix an off by one in rvu_dbg_qsize_write() - i40e: fix flow for IPv6 next header (extension header) - phy: icplus: call phy_restore_page() when phy_select_page() fails - dpaa_eth: fix the access method for the dpaa_napi_portal" * tag 'net-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (55 commits) r8169: fix jumbo packet handling on RTL8168e net: phy: micrel: set soft_reset callback to genphy_soft_reset for KSZ8081 net: psample: Fix netlink skb length with tunnel info net: broadcom: bcm4908_enet: fix NAPI poll returned value net: broadcom: bcm4908_enet: fix RX path possible mem leak net: hsr: add support for EntryForgetTime net: dsa: sja1105: Remove unneeded cast in sja1105_crc32() ibmvnic: fix a race between open and reset net: stmmac: Fix missing spin_lock_init in visconti_eth_dwmac_probe() net: introduce CAN specific pointer in the struct net_device net: usb: qmi_wwan: support ZTE P685M modem wireguard: kconfig: use arm chacha even with no neon wireguard: queueing: get rid of per-peer ring buffers wireguard: device: do not generate ICMP for non-IP packets wireguard: peer: put frequently used members above cache lines wireguard: selftests: test multiple parallel streams wireguard: socket: remove bogus __be32 annotation wireguard: avoid double unlikely() notation when using IS_ERR() net: qrtr: Fix memory leak in qrtr_tun_open vxlan: move debug check after netdev unregister ...
2021-02-25net: psample: Fix netlink skb length with tunnel infoChris Mi
Currently, the psample netlink skb is allocated with a size that does not account for the nested 'PSAMPLE_ATTR_TUNNEL' attribute and the padding required for the 64-bit attribute 'PSAMPLE_TUNNEL_KEY_ATTR_ID'. This can result in failure to add attributes to the netlink skb due to insufficient tail room. The following error message is printed to the kernel log: "Could not create psample log message". Fix this by adjusting the allocation size to take into account the nested attribute and the padding. Fixes: d8bed686ab96 ("net: psample: Add tunnel support") CC: Yotam Gigi <yotam.gi@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Chris Mi <cmi@nvidia.com> Link: https://lore.kernel.org/r/20210225075145.184314-1-cmi@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-25net: hsr: add support for EntryForgetTimeMarco Wenzel
In IEC 62439-3 EntryForgetTime is defined with a value of 400 ms. When a node does not send any frame within this time, the sequence number check for can be ignored. This solves communication issues with Cisco IE 2000 in Redbox mode. Fixes: f421436a591d ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)") Signed-off-by: Marco Wenzel <marco.wenzel@a-eberle.de> Reviewed-by: George McCollister <george.mccollister@gmail.com> Tested-by: George McCollister <george.mccollister@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20210224094653.1440-1-marco.wenzel@a-eberle.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-25xsk: Build skb by page (aka generic zerocopy xmit)Xuan Zhuo
This patch is used to construct skb based on page to save memory copy overhead. This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to directly construct skb. If this feature is not supported, it is still necessary to copy data to construct skb. ---------------- Performance Testing ------------ The test environment is Aliyun ECS server. Test cmd: ``` xdpsock -i eth0 -t -S -s <msg size> ``` Test result data: size 64 512 1024 1500 copy 1916747 1775988 1600203 1440054 page 1974058 1953655 1945463 1904478 percent 3.0% 10.0% 21.58% 32.3% Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210218204908.5455-6-alobakin@pm.me
2021-02-25xsk: Respect device's headroom and tailroom on generic xmit pathAlexander Lobakin
xsk_generic_xmit() allocates a new skb and then queues it for xmitting. The size of new skb's headroom is desc->len, so it comes to the driver/device with no reserved headroom and/or tailroom. Lots of drivers need some headroom (and sometimes tailroom) to prepend (and/or append) some headers or data, e.g. CPU tags, device-specific headers/descriptors (LSO, TLS etc.), and if case of no available space skb_cow_head() will reallocate the skb. Reallocations are unwanted on fast-path, especially when it comes to XDP, so generic XSK xmit should reserve the spaces declared in dev->needed_headroom and dev->needed tailroom to avoid them. Note on max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom)): Usually, output functions reserve LL_RESERVED_SPACE(dev), which consists of dev->hard_header_len + dev->needed_headroom, aligned by 16. However, on XSK xmit hard header is already here in the chunk, so hard_header_len is not needed. But it'd still be better to align data up to cacheline, while reserving no less than driver requests for headroom. NET_SKB_PAD here is to double-insure there will be no reallocations even when the driver advertises no needed_headroom, but in fact need it (not so rare case). Fixes: 35fcde7f8deb ("xsk: support for Tx") Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210218204908.5455-5-alobakin@pm.me
2021-02-24net: introduce CAN specific pointer in the struct net_deviceOleksij Rempel
Since 20dd3850bcf8 ("can: Speed up CAN frame receiption by using ml_priv") the CAN framework uses per device specific data in the AF_CAN protocol. For this purpose the struct net_device->ml_priv is used. Later the ml_priv usage in CAN was extended for other users, one of them being CAN_J1939. Later in the kernel ml_priv was converted to an union, used by other drivers. E.g. the tun driver started storing it's stats pointer. Since tun devices can claim to be a CAN device, CAN specific protocols will wrongly interpret this pointer, which will cause system crashes. Mostly this issue is visible in the CAN_J1939 stack. To fix this issue, we request a dedicated CAN pointer within the net_device struct. Reported-by: syzbot+5138c4dd15a0401bec7b@syzkaller.appspotmail.com Fixes: 20dd3850bcf8 ("can: Speed up CAN frame receiption by using ml_priv") Fixes: ffd956eef69b ("can: introduce CAN midlayer private and allocate it automatically") Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol") Fixes: 497a5757ce4e ("tun: switch to net core provided statistics counters") Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://lore.kernel.org/r/20210223070127.4538-1-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-24net: ieee802154: fix nl802154 del llsec devkeyAlexander Aring
This patch fixes a nullpointer dereference if NL802154_ATTR_SEC_DEVKEY is not set by the user. If this is the case nl802154 will return -EINVAL. Reported-by: syzbot+368672e0da240db53b5f@syzkaller.appspotmail.com Signed-off-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/r/20210221174321.14210-4-aahringo@redhat.com Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2021-02-24net: ieee802154: fix nl802154 add llsec keyAlexander Aring
This patch fixes a nullpointer dereference if NL802154_ATTR_SEC_KEY is not set by the user. If this is the case nl802154 will return -EINVAL. Reported-by: syzbot+ce4e062c2d51977ddc50@syzkaller.appspotmail.com Signed-off-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/r/20210221174321.14210-3-aahringo@redhat.com Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2021-02-24net: ieee802154: fix nl802154 del llsec devAlexander Aring
This patch fixes a nullpointer dereference if NL802154_ATTR_SEC_DEVICE is not set by the user. If this is the case nl802154 will return -EINVAL. Reported-by: syzbot+d946223c2e751d136c94@syzkaller.appspotmail.com Signed-off-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/r/20210221174321.14210-2-aahringo@redhat.com Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2021-02-24net: ieee802154: fix nl802154 del llsec keyAlexander Aring
This patch fixes a nullpointer dereference if NL802154_ATTR_SEC_KEY is not set by the user. If this is the case nl802154 will return -EINVAL. Reported-by: syzbot+ac5c11d2959a8b3c4806@syzkaller.appspotmail.com Signed-off-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/r/20210221174321.14210-1-aahringo@redhat.com Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2021-02-24Merge remote-tracking branch 'net/master'Stefan Schmidt
2021-02-23net: remove cmsg restriction from io_uring based send/recvmsg callsJens Axboe
No need to restrict these anymore, as the worker threads are direct clones of the original task. Hence we know for a fact that we can support anything that the regular task can. Since the only user of proto_ops->flags was to flag PROTO_CMSG_DATA_ONLY, kill the member and the flag definition too. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-23net: qrtr: Fix memory leak in qrtr_tun_openTakeshi Misawa
If qrtr_endpoint_register() failed, tun is leaked. Fix this, by freeing tun in error path. syzbot report: BUG: memory leak unreferenced object 0xffff88811848d680 (size 64): comm "syz-executor684", pid 10171, jiffies 4294951561 (age 26.070s) hex dump (first 32 bytes): 80 dd 0a 84 ff ff ff ff 00 00 00 00 00 00 00 00 ................ 90 d6 48 18 81 88 ff ff 90 d6 48 18 81 88 ff ff ..H.......H..... backtrace: [<0000000018992a50>] kmalloc include/linux/slab.h:552 [inline] [<0000000018992a50>] kzalloc include/linux/slab.h:682 [inline] [<0000000018992a50>] qrtr_tun_open+0x22/0x90 net/qrtr/tun.c:35 [<0000000003a453ef>] misc_open+0x19c/0x1e0 drivers/char/misc.c:141 [<00000000dec38ac8>] chrdev_open+0x10d/0x340 fs/char_dev.c:414 [<0000000079094996>] do_dentry_open+0x1e6/0x620 fs/open.c:817 [<000000004096d290>] do_open fs/namei.c:3252 [inline] [<000000004096d290>] path_openat+0x74a/0x1b00 fs/namei.c:3369 [<00000000b8e64241>] do_filp_open+0xa0/0x190 fs/namei.c:3396 [<00000000a3299422>] do_sys_openat2+0xed/0x230 fs/open.c:1172 [<000000002c1bdcef>] do_sys_open fs/open.c:1188 [inline] [<000000002c1bdcef>] __do_sys_openat fs/open.c:1204 [inline] [<000000002c1bdcef>] __se_sys_openat fs/open.c:1199 [inline] [<000000002c1bdcef>] __x64_sys_openat+0x7f/0xe0 fs/open.c:1199 [<00000000f3a5728f>] do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 [<000000004b38b7ec>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 28fb4e59a47d ("net: qrtr: Expose tunneling endpoint to user space") Reported-by: syzbot+5d6e4af21385f5cfc56a@syzkaller.appspotmail.com Signed-off-by: Takeshi Misawa <jeliantsurux@gmail.com> Link: https://lore.kernel.org/r/20210221234427.GA2140@DESKTOP Signed-off-by: Jakub Kicinski <kuba@kernel.org>