summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2020-03-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix deadlock in bpf_send_signal() from Yonghong Song. 2) Fix off by one in kTLS offload of mlx5, from Tariq Toukan. 3) Add missing locking in iwlwifi mvm code, from Avraham Stern. 4) Fix MSG_WAITALL handling in rxrpc, from David Howells. 5) Need to hold RTNL mutex in tcindex_partial_destroy_work(), from Cong Wang. 6) Fix producer race condition in AF_PACKET, from Willem de Bruijn. 7) cls_route removes the wrong filter during change operations, from Cong Wang. 8) Reject unrecognized request flags in ethtool netlink code, from Michal Kubecek. 9) Need to keep MAC in reset until PHY is up in bcmgenet driver, from Doug Berger. 10) Don't leak ct zone template in act_ct during replace, from Paul Blakey. 11) Fix flushing of offloaded netfilter flowtable flows, also from Paul Blakey. 12) Fix throughput drop during tx backpressure in cxgb4, from Rahul Lakkireddy. 13) Don't let a non-NULL skb->dev leave the TCP stack, from Eric Dumazet. 14) TCP_QUEUE_SEQ socket option has to update tp->copied_seq as well, also from Eric Dumazet. 15) Restrict macsec to ethernet devices, from Willem de Bruijn. 16) Fix reference leak in some ethtool *_SET handlers, from Michal Kubecek. 17) Fix accidental disabling of MSI for some r8169 chips, from Heiner Kallweit. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (138 commits) net: Fix CONFIG_NET_CLS_ACT=n and CONFIG_NFT_FWD_NETDEV={y, m} build net: ena: Add PCI shutdown handler to allow safe kexec selftests/net/forwarding: define libs as TEST_PROGS_EXTENDED selftests/net: add missing tests to Makefile r8169: re-enable MSI on RTL8168c net: phy: mdio-bcm-unimac: Fix clock handling cxgb4/ptp: pass the sign of offset delta in FW CMD net: dsa: tag_8021q: replace dsa_8021q_remove_header with __skb_vlan_pop net: cbs: Fix software cbs to consider packet sending time net/mlx5e: Do not recover from a non-fatal syndrome net/mlx5e: Fix ICOSQ recovery flow with Striding RQ net/mlx5e: Fix missing reset of SW metadata in Striding RQ reset net/mlx5e: Enhance ICOSQ WQE info fields net/mlx5_core: Set IB capability mask1 to fix ib_srpt connection failure selftests: netfilter: add nfqueue test case netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress netfilter: nft_fwd_netdev: validate family and chain type netfilter: nft_set_rbtree: Detect partial overlaps on insertion netfilter: nft_set_rbtree: Introduce and use nft_rbtree_interval_start() netfilter: nft_set_pipapo: Separate partial and complete overlap cases on insertion ...
2020-03-25net: Fix CONFIG_NET_CLS_ACT=n and CONFIG_NFT_FWD_NETDEV={y, m} buildPablo Neira Ayuso
net/netfilter/nft_fwd_netdev.c: In function ‘nft_fwd_netdev_eval’: net/netfilter/nft_fwd_netdev.c:32:10: error: ‘struct sk_buff’ has no member named ‘tc_redirected’ pkt->skb->tc_redirected = 1; ^~ net/netfilter/nft_fwd_netdev.c:33:10: error: ‘struct sk_buff’ has no member named ‘tc_from_ingress’ pkt->skb->tc_from_ingress = 1; ^~ To avoid a direct dependency with tc actions from netfilter, wrap the redirect bits around CONFIG_NET_REDIRECT and move helpers to include/linux/skbuff.h. Turn on this toggle from the ifb driver, the only existing client of these bits in the tree. This patch adds skb_set_redirected() that sets on the redirected bit on the skbuff, it specifies if the packet was redirect from ingress and resets the timestamp (timestamp reset was originally missing in the netfilter bugfix). Fixes: bcfabee1afd99484 ("netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress") Reported-by: noreply@ellerman.id.au Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) A new selftest for nf_queue, from Florian Westphal. This test covers two recent fixes: 07f8e4d0fddb ("tcp: also NULL skb->dev when copy was needed") and b738a185beaa ("tcp: ensure skb->dev is NULL before leaving TCP stack"). 2) The fwd action breaks with ifb. For safety in next extensions, make sure the fwd action only runs from ingress until it is extended to be used from a different hook. 3) The pipapo set type now reports EEXIST in case of subrange overlaps. Update the rbtree set to validate range overlaps, so far this validation is only done only from userspace. From Stefano Brivio. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-24net: dsa: tag_8021q: replace dsa_8021q_remove_header with __skb_vlan_popVladimir Oltean
Not only did this wheel did not need reinventing, but there is also an issue with it: It doesn't remove the VLAN header in a way that preserves the L2 payload checksum when that is being provided by the DSA master hw. It should recalculate checksum both for the push, before removing the header, and for the pull afterwards. But the current implementation is quite dizzying, with pulls followed immediately afterwards by pushes, the memmove is done before the push, etc. This makes a DSA master with RX checksumming offload to print stack traces with the infamous 'hw csum failure' message. So remove the dsa_8021q_remove_header function and replace it with something that actually works with inet checksumming. Fixes: d461933638ae ("net: dsa: tag_8021q: Create helper function for removing VLAN header") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-24net: cbs: Fix software cbs to consider packet sending timeZh-yuan Ye
Currently the software CBS does not consider the packet sending time when depleting the credits. It caused the throughput to be Idleslope[kbps] * (Port transmit rate[kbps] / |Sendslope[kbps]|) where Idleslope * (Port transmit rate / (Idleslope + |Sendslope|)) = Idleslope is expected. In order to fix the issue above, this patch takes the time when the packet sending completes into account by moving the anchor time variable "last" ahead to the send completion time upon transmission and adding wait when the next dequeue request comes before the send completion time of the previous packet. changelog: V2->V3: - remove unnecessary whitespace cleanup - add the checks if port_rate is 0 before division V1->V2: - combine variable "send_completed" into "last" - add the comment for estimate of the packet sending Fixes: 585d763af09c ("net/sched: Introduce Credit Based Shaper (CBS) qdisc") Signed-off-by: Zh-yuan Ye <ye.zh-yuan@socionext.com> Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-24netfilter: nft_fwd_netdev: allow to redirect to ifb via ingressPablo Neira Ayuso
Set skb->tc_redirected to 1, otherwise the ifb driver drops the packet. Set skb->tc_from_ingress to 1 to reinject the packet back to the ingress path after leaving the ifb egress path. This patch inconditionally sets on these two skb fields that are meaningful to the ifb driver. The existing forward action is guaranteed to run from ingress path. Fixes: 39e6dea28adc ("netfilter: nf_tables: add forward expression to the netdev family") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-24netfilter: nft_fwd_netdev: validate family and chain typePablo Neira Ayuso
Make sure the forward action is only used from ingress. Fixes: 39e6dea28adc ("netfilter: nf_tables: add forward expression to the netdev family") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-24netfilter: nft_set_rbtree: Detect partial overlaps on insertionStefano Brivio
...and return -ENOTEMPTY to the front-end in this case, instead of proceeding. Currently, nft takes care of checking for these cases and not sending them to the kernel, but if we drop the set_overlap() call in nft we can end up in situations like: # nft add table t # nft add set t s '{ type inet_service ; flags interval ; }' # nft add element t s '{ 1 - 5 }' # nft add element t s '{ 6 - 10 }' # nft add element t s '{ 4 - 7 }' # nft list set t s table ip t { set s { type inet_service flags interval elements = { 1-3, 4-5, 6-7 } } } This change has the primary purpose of making the behaviour consistent with nft_set_pipapo, but is also functional to avoid inconsistent behaviour if userspace sends overlapping elements for any reason. v2: When we meet the same key data in the tree, as start element while inserting an end element, or as end element while inserting a start element, actually check that the existing element is active, before resetting the overlap flag (Pablo Neira Ayuso) Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-24netfilter: nft_set_rbtree: Introduce and use nft_rbtree_interval_start()Stefano Brivio
Replace negations of nft_rbtree_interval_end() with a new helper, nft_rbtree_interval_start(), wherever this helps to visualise the problem at hand, that is, for all the occurrences except for the comparison against given flags in __nft_rbtree_get(). This gets especially useful in the next patch. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-24netfilter: nft_set_pipapo: Separate partial and complete overlap cases on ↵Stefano Brivio
insertion ...and return -ENOTEMPTY to the front-end on collision, -EEXIST if an identical element already exists. Together with the previous patch, element collision will now be returned to the user as -EEXIST. Reported-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-24netfilter: nf_tables: Allow set back-ends to report partial overlaps on ↵Pablo Neira Ayuso
insertion Currently, the -EEXIST return code of ->insert() callbacks is ambiguous: it might indicate that a given element (including intervals) already exists as such, or that the new element would clash with existing ones. If identical elements already exist, the front-end is ignoring this without returning error, in case NLM_F_EXCL is not set. However, if the new element can't be inserted due an overlap, we should report this to the user. To this purpose, allow set back-ends to return -ENOTEMPTY on collision with existing elements, translate that to -EEXIST, and return that to userspace, no matter if NLM_F_EXCL was set. Reported-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-23ethtool: fix reference leak in some *_SET handlersMichal Kubecek
Andrew noticed that some handlers for *_SET commands leak a netdev reference if required ethtool_ops callbacks do not exist. A simple reproducer would be e.g. ip link add veth1 type veth peer name veth2 ethtool -s veth1 wol g ip link del veth1 Make sure dev_put() is called when ethtool_ops check fails. v2: add Fixes tags Fixes: a53f3d41e4d3 ("ethtool: set link settings with LINKINFO_SET request") Fixes: bfbcfe2032e7 ("ethtool: set link modes related data with LINKMODES_SET request") Fixes: e54d04e3afea ("ethtool: set message mask with DEBUG_SET request") Fixes: 8d425b19b305 ("ethtool: set wake-on-lan settings with WOL_SET request") Reported-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23net: dsa: Fix duplicate frames flooded by learningFlorian Fainelli
When both the switch and the bridge are learning about new addresses, switch ports attached to the bridge would see duplicate ARP frames because both entities would attempt to send them. Fixes: 5037d532b83d ("net: dsa: add Broadcom tag RX/TX handler") Reported-by: Maxime Bizon <mbizon@freebox.fr> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23ipv4: fix a RCU-list lock in inet_dump_fib()Qian Cai
There is a place, inet_dump_fib() fib_table_dump fn_trie_dump_leaf() hlist_for_each_entry_rcu() without rcu_read_lock() will trigger a warning, WARNING: suspicious RCU usage ----------------------------- net/ipv4/fib_trie.c:2216 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by ip/1923: #0: ffffffff8ce76e40 (rtnl_mutex){+.+.}, at: netlink_dump+0xd6/0x840 Call Trace: dump_stack+0xa1/0xea lockdep_rcu_suspicious+0x103/0x10d fn_trie_dump_leaf+0x581/0x590 fib_table_dump+0x15f/0x220 inet_dump_fib+0x4ad/0x5d0 netlink_dump+0x350/0x840 __netlink_dump_start+0x315/0x3e0 rtnetlink_rcv_msg+0x4d1/0x720 netlink_rcv_skb+0xf0/0x220 rtnetlink_rcv+0x15/0x20 netlink_unicast+0x306/0x460 netlink_sendmsg+0x44b/0x770 __sys_sendto+0x259/0x270 __x64_sys_sendto+0x80/0xa0 do_syscall_64+0x69/0xf4 entry_SYSCALL_64_after_hwframe+0x49/0xb3 Fixes: 18a8021a7be3 ("net/ipv4: Plumb support for filtering route dumps") Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23tcp: repair: fix TCP_QUEUE_SEQ implementationEric Dumazet
When application uses TCP_QUEUE_SEQ socket option to change tp->rcv_next, we must also update tp->copied_seq. Otherwise, stuff relying on tcp_inq() being precise can eventually be confused. For example, tcp_zerocopy_receive() might crash because it does not expect tcp_recv_skb() to return NULL. We could add tests in various places to fix the issue, or simply make sure tcp_inq() wont return a random value, and leave fast path as it is. Note that this fixes ioctl(fd, SIOCINQ, &val) at the same time. Fixes: ee9952831cfd ("tcp: Initial repair mode") Fixes: 05255b823a61 ("tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-21hsr: fix general protection fault in hsr_addr_is_self()Taehee Yoo
The port->hsr is used in the hsr_handle_frame(), which is a callback of rx_handler. hsr master and slaves are initialized in hsr_add_port(). This function initializes several pointers, which includes port->hsr after registering rx_handler. So, in the rx_handler routine, un-initialized pointer would be used. In order to fix this, pointers should be initialized before registering rx_handler. Test commands: ip netns del left ip netns del right modprobe -rv veth modprobe -rv hsr killall ping modprobe hsr ip netns add left ip netns add right ip link add veth0 type veth peer name veth1 ip link add veth2 type veth peer name veth3 ip link add veth4 type veth peer name veth5 ip link set veth1 netns left ip link set veth3 netns right ip link set veth4 netns left ip link set veth5 netns right ip link set veth0 up ip link set veth2 up ip link set veth0 address fc:00:00:00:00:01 ip link set veth2 address fc:00:00:00:00:02 ip netns exec left ip link set veth1 up ip netns exec left ip link set veth4 up ip netns exec right ip link set veth3 up ip netns exec right ip link set veth5 up ip link add hsr0 type hsr slave1 veth0 slave2 veth2 ip a a 192.168.100.1/24 dev hsr0 ip link set hsr0 up ip netns exec left ip link add hsr1 type hsr slave1 veth1 slave2 veth4 ip netns exec left ip a a 192.168.100.2/24 dev hsr1 ip netns exec left ip link set hsr1 up ip netns exec left ip n a 192.168.100.1 dev hsr1 lladdr \ fc:00:00:00:00:01 nud permanent ip netns exec left ip n r 192.168.100.1 dev hsr1 lladdr \ fc:00:00:00:00:01 nud permanent for i in {1..100} do ip netns exec left ping 192.168.100.1 & done ip netns exec left hping3 192.168.100.1 -2 --flood & ip netns exec right ip link add hsr2 type hsr slave1 veth3 slave2 veth5 ip netns exec right ip a a 192.168.100.3/24 dev hsr2 ip netns exec right ip link set hsr2 up ip netns exec right ip n a 192.168.100.1 dev hsr2 lladdr \ fc:00:00:00:00:02 nud permanent ip netns exec right ip n r 192.168.100.1 dev hsr2 lladdr \ fc:00:00:00:00:02 nud permanent for i in {1..100} do ip netns exec right ping 192.168.100.1 & done ip netns exec right hping3 192.168.100.1 -2 --flood & while : do ip link add hsr0 type hsr slave1 veth0 slave2 veth2 ip a a 192.168.100.1/24 dev hsr0 ip link set hsr0 up ip link del hsr0 done Splat looks like: [ 120.954938][ C0] general protection fault, probably for non-canonical address 0xdffffc0000000006: 0000 [#1]I [ 120.957761][ C0] KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037] [ 120.959064][ C0] CPU: 0 PID: 1511 Comm: hping3 Not tainted 5.6.0-rc5+ #460 [ 120.960054][ C0] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 120.962261][ C0] RIP: 0010:hsr_addr_is_self+0x65/0x2a0 [hsr] [ 120.963149][ C0] Code: 44 24 18 70 73 2f c0 48 c1 eb 03 48 8d 04 13 c7 00 f1 f1 f1 f1 c7 40 04 00 f2 f2 f2 4 [ 120.966277][ C0] RSP: 0018:ffff8880d9c09af0 EFLAGS: 00010206 [ 120.967293][ C0] RAX: 0000000000000006 RBX: 1ffff1101b38135f RCX: 0000000000000000 [ 120.968516][ C0] RDX: dffffc0000000000 RSI: ffff8880d17cb208 RDI: 0000000000000000 [ 120.969718][ C0] RBP: 0000000000000030 R08: ffffed101b3c0e3c R09: 0000000000000001 [ 120.972203][ C0] R10: 0000000000000001 R11: ffffed101b3c0e3b R12: 0000000000000000 [ 120.973379][ C0] R13: ffff8880aaf80100 R14: ffff8880aaf800f2 R15: ffff8880aaf80040 [ 120.974410][ C0] FS: 00007f58e693f740(0000) GS:ffff8880d9c00000(0000) knlGS:0000000000000000 [ 120.979794][ C0] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 120.980773][ C0] CR2: 00007ffcb8b38f29 CR3: 00000000afe8e001 CR4: 00000000000606f0 [ 120.981945][ C0] Call Trace: [ 120.982411][ C0] <IRQ> [ 120.982848][ C0] ? hsr_add_node+0x8c0/0x8c0 [hsr] [ 120.983522][ C0] ? rcu_read_lock_held+0x90/0xa0 [ 120.984159][ C0] ? rcu_read_lock_sched_held+0xc0/0xc0 [ 120.984944][ C0] hsr_handle_frame+0x1db/0x4e0 [hsr] [ 120.985597][ C0] ? hsr_nl_nodedown+0x2b0/0x2b0 [hsr] [ 120.986289][ C0] __netif_receive_skb_core+0x6bf/0x3170 [ 120.992513][ C0] ? check_chain_key+0x236/0x5d0 [ 120.993223][ C0] ? do_xdp_generic+0x1460/0x1460 [ 120.993875][ C0] ? register_lock_class+0x14d0/0x14d0 [ 120.994609][ C0] ? __netif_receive_skb_one_core+0x8d/0x160 [ 120.995377][ C0] __netif_receive_skb_one_core+0x8d/0x160 [ 120.996204][ C0] ? __netif_receive_skb_core+0x3170/0x3170 [ ... ] Reported-by: syzbot+fcf5dd39282ceb27108d@syzkaller.appspotmail.com Fixes: c5a759117210 ("net/hsr: Use list_head (and rcu) instead of array for slave devices.") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-21Merge tag 'io_uring-5.6-20200320' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: "Two different fixes in here: - Fix for a potential NULL pointer deref for links with async or drain marked (Pavel) - Fix for not properly checking RLIMIT_NOFILE for async punted operations. This affects openat/openat2, which were added this cycle, and accept4. I did a full audit of other cases where we might check current->signal->rlim[] and found only RLIMIT_FSIZE for buffered writes and fallocate. That one is fixed and queued for 5.7 and marked stable" * tag 'io_uring-5.6-20200320' of git://git.kernel.dk/linux-block: io_uring: make sure accept honor rlimit nofile io_uring: make sure openat/openat2 honor rlimit nofile io_uring: NULL-deref for IOSQE_{ASYNC,DRAIN}
2020-03-20tcp: also NULL skb->dev when copy was neededFlorian Westphal
In rare cases retransmit logic will make a full skb copy, which will not trigger the zeroing added in recent change b738a185beaa ("tcp: ensure skb->dev is NULL before leaving TCP stack"). Cc: Eric Dumazet <edumazet@google.com> Fixes: 75c119afe14f ("tcp: implement rb-tree based retransmit queue") Fixes: 28f8bfd1ac94 ("netfilter: Support iif matches in POSTROUTING") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Refetch IP header pointer after pskb_may_pull() in flowtable, from Haishuang Yan. 2) Fix memleak in flowtable offload in nf_flow_table_free(), from Paul Blakey. 3) Set control.addr_type mask in flowtable offload, from Edward Cree. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-20io_uring: make sure accept honor rlimit nofileJens Axboe
Just like commit 4022e7af86be, this fixes the fact that IORING_OP_ACCEPT ends up using get_unused_fd_flags(), which checks current->signal->rlim[] for limits. Add an extra argument to __sys_accept4_file() that allows us to pass in the proper nofile limit, and grab it at request prep time. Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-19tcp: ensure skb->dev is NULL before leaving TCP stackEric Dumazet
skb->rbnode is sharing three skb fields : next, prev, dev When a packet is sent, TCP keeps the original skb (master) in a rtx queue, which was converted to rbtree a while back. __tcp_transmit_skb() is responsible to clone the master skb, and add the TCP header to the clone before sending it to network layer. skb_clone() already clears skb->next and skb->prev, but copies the master oskb->dev into the clone. We need to clear skb->dev, otherwise lower layers could interpret the value as a pointer to a netdev. This old bug surfaced recently when commit 28f8bfd1ac94 ("netfilter: Support iif matches in POSTROUTING") was merged. Before this netfilter commit, skb->dev value was ignored and changed before reaching dev_queue_xmit() Fixes: 75c119afe14f ("tcp: implement rb-tree based retransmit queue") Fixes: 28f8bfd1ac94 ("netfilter: Support iif matches in POSTROUTING") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Martin Zaharinov <micron10@gmail.com> Cc: Florian Westphal <fw@strlen.de> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-19Merge tag 'rxrpc-fixes-20200319' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc, afs: Interruptibility fixes Here are a number of fixes for AF_RXRPC and AFS that make AFS system calls less interruptible and so less likely to leave the filesystem in an uncertain state. There's also a miscellaneous patch to make tracing consistent. (1) Firstly, abstract out the Tx space calculation in sendmsg. Much the same code is replicated in a number of places that subsequent patches are going to alter, including adding another copy. (2) Fix Tx interruptibility by allowing a kernel service, such as AFS, to request that a call be interruptible only when waiting for a call slot to become available (ie. the call has not taken place yet) or that a call be not interruptible at all (e.g. when we want to do writeback and don't want a signal interrupting a VM-induced writeback). (3) Increase the minimum delay on MSG_WAITALL for userspace sendmsg() when waiting for Tx buffer space as a 2*RTT delay is really small over 10G ethernet and a 1 jiffy timeout might be essentially 0 if at the end of the jiffy period. (4) Fix some tracing output in AFS to make it consistent with rxrpc. (5) Make sure aborted asynchronous AFS operations are tidied up properly so we don't end up with stuck rxrpc calls. (6) Make AFS client calls uninterruptible in the Rx phase. If we don't wait for the reply to be fully gathered, we can't update the local VFS state and we end up in an indeterminate state with respect to the server. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-19netfilter: flowtable: populate addr_type maskEdward Cree
nf_flow_rule_match() sets control.addr_type in key, so needs to also set the corresponding mask. An exact match is wanted, so mask is all ones. Fixes: c29f74e0df7a ("netfilter: nf_flow_table: hardware offload support") Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-19netfilter: flowtable: Fix flushing of offloaded flows on freePaul Blakey
Freeing a flowtable with offloaded flows, the flow are deleted from hardware but are not deleted from the flow table, leaking them, and leaving their offload bit on. Add a second pass of the disabled gc to delete the these flows from the flow table before freeing it. Fixes: c29f74e0df7a ("netfilter: nf_flow_table: hardware offload support") Signed-off-by: Paul Blakey <paulb@mellanox.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-19netfilter: flowtable: reload ip{v6}h in nf_flow_tuple_ip{v6}Haishuang Yan
Since pskb_may_pull may change skb->data, so we need to reload ip{v6}h at the right place. Fixes: a908fdec3dda ("netfilter: nf_flow_table: move ipv6 offload hook code to nf_flow_table") Fixes: 7d2086871762 ("netfilter: nf_flow_table: move ipv4 offload hook code to nf_flow_table") Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-19netfilter: flowtable: reload ip{v6}h in nf_flow_nat_ip{v6}Haishuang Yan
Since nf_flow_snat_port and nf_flow_snat_ip{v6} call pskb_may_pull() which may change skb->data, so we need to reload ip{v6}h at the right place. Fixes: a908fdec3dda ("netfilter: nf_flow_table: move ipv6 offload hook code to nf_flow_table") Fixes: 7d2086871762 ("netfilter: nf_flow_table: move ipv4 offload hook code to nf_flow_table") Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-18net/sched: act_ct: Fix leak of ct zone template on replacePaul Blakey
Currently, on replace, the previous action instance params is swapped with a newly allocated params. The old params is only freed (via kfree_rcu), without releasing the allocated ct zone template related to it. Call tcf_ct_params_free (via call_rcu) for the old params, so it will release it. Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct") Signed-off-by: Paul Blakey <paulb@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17net: core: dev.c: fix a documentation warningMauro Carvalho Chehab
There's a markup for link with is "foo_". On this kernel-doc comment, we don't want this, but instead, place a literal reference. So, escape the literal with ``foo``, in order to avoid this warning: ./net/core/dev.c:5195: WARNING: Unknown target name: "page_is". Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16net: ip_gre: Accept IFLA_INFO_DATA-less configurationPetr Machata
The fix referenced below causes a crash when an ERSPAN tunnel is created without passing IFLA_INFO_DATA. Fix by validating passed-in data in the same way as ipgre does. Fixes: e1f8f78ffe98 ("net: ip_gre: Separate ERSPAN newlink / changelink callbacks") Reported-by: syzbot+1b4ebf4dae4e510dd219@syzkaller.appspotmail.com Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16ethtool: reject unrecognized request flagsMichal Kubecek
As pointed out by Jakub Kicinski, we ethtool netlink code should respond with an error if request head has flags set which are not recognized by kernel, either as a mistake or because it expects functionality introduced in later kernel versions. To avoid unnecessary roundtrips, use extack cookie to provide the information about supported request flags. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16netlink: allow extack cookie also for error messagesMichal Kubecek
Commit ba0dc5f6e0ba ("netlink: allow sending extended ACK with cookie on success") introduced a cookie which can be sent to userspace as part of extended ack message in the form of NLMSGERR_ATTR_COOKIE attribute. Currently the cookie is ignored if error code is non-zero but there is no technical reason for such limitation and it can be useful to provide machine parseable information as part of an error message. Include NLMSGERR_ATTR_COOKIE whenever the cookie has been set, regardless of error code. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16net_sched: cls_route: remove the right filter from hashtableCong Wang
route4_change() allocates a new filter and copies values from the old one. After the new filter is inserted into the hash table, the old filter should be removed and freed, as the final step of the update. However, the current code mistakenly removes the new one. This looks apparently wrong to me, and it causes double "free" and use-after-free too, as reported by syzbot. Reported-and-tested-by: syzbot+f9b32aaacd60305d9687@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+2f8c233f131943d6056d@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+9c2df9fd5e9445b74e01@syzkaller.appspotmail.com Fixes: 1109c00547fc ("net: sched: RCU cls_route") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16hsr: set .netnsok flagTaehee Yoo
The hsr module has been supporting the list and status command. (HSR_C_GET_NODE_LIST and HSR_C_GET_NODE_STATUS) These commands send node information to the user-space via generic netlink. But, in the non-init_net namespace, these commands are not allowed because .netnsok flag is false. So, there is no way to get node information in the non-init_net namespace. Fixes: f421436a591d ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16hsr: add restart routine into hsr_get_node_list()Taehee Yoo
The hsr_get_node_list() is to send node addresses to the userspace. If there are so many nodes, it could fail because of buffer size. In order to avoid this failure, the restart routine is added. Fixes: f421436a591d ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16hsr: use rcu_read_lock() in hsr_get_node_{list/status}()Taehee Yoo
hsr_get_node_{list/status}() are not under rtnl_lock() because they are callback functions of generic netlink. But they use __dev_get_by_index() without rtnl_lock(). So, it would use unsafe data. In order to fix it, rcu_read_lock() and dev_get_by_index_rcu() are used instead of __dev_get_by_index(). Fixes: f421436a591d ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15net/packet: tpacket_rcv: avoid a producer race conditionWillem de Bruijn
PACKET_RX_RING can cause multiple writers to access the same slot if a fast writer wraps the ring while a slow writer is still copying. This is particularly likely with few, large, slots (e.g., GSO packets). Synchronize kernel thread ownership of rx ring slots with a bitmap. Writers acquire a slot race-free by testing tp_status TP_STATUS_KERNEL while holding the sk receive queue lock. They release this lock before copying and set tp_status to TP_STATUS_USER to release to userspace when done. During copying, another writer may take the lock, also see TP_STATUS_KERNEL, and start writing to the same slot. Introduce a new rx_owner_map bitmap with a bit per slot. To acquire a slot, test and set with the lock held. To release race-free, update tp_status and owner bit as a transaction, so take the lock again. This is the one of a variety of discussed options (see Link below): * instead of a shadow ring, embed the data in the slot itself, such as in tp_padding. But any test for this field may match a value left by userspace, causing deadlock. * avoid the lock on release. This leaves a small race if releasing the shadow slot before setting TP_STATUS_USER. The below reproducer showed that this race is not academic. If releasing the slot after tp_status, the race is more subtle. See the first link for details. * add a new tp_status TP_KERNEL_OWNED to avoid the transactional store of two fields. But, legacy applications may interpret all non-zero tp_status as owned by the user. As libpcap does. So this is possible only opt-in by newer processes. It can be added as an optional mode. * embed the struct at the tail of pg_vec to avoid extra allocation. The implementation proved no less complex than a separate field. The additional locking cost on release adds contention, no different than scaling on multicore or multiqueue h/w. In practice, below reproducer nor small packet tcpdump showed a noticeable change in perf report in cycles spent in spinlock. Where contention is problematic, packet sockets support mitigation through PACKET_FANOUT. And we can consider adding opt-in state TP_KERNEL_OWNED. Easy to reproduce by running multiple netperf or similar TCP_STREAM flows concurrently with `tcpdump -B 129 -n greater 60000`. Based on an earlier patchset by Jon Rosen. See links below. I believe this issue goes back to the introduction of tpacket_rcv, which predates git history. Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg237222.html Suggested-by: Jon Rosen <jrosen@cisco.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jon Rosen <jrosen@cisco.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15net: ip_gre: Separate ERSPAN newlink / changelink callbacksPetr Machata
ERSPAN shares most of the code path with GRE and gretap code. While that helps keep the code compact, it is also error prone. Currently a broken userspace can turn a gretap tunnel into a de facto ERSPAN one by passing IFLA_GRE_ERSPAN_VER. There has been a similar issue in ip6gretap in the past. To prevent these problems in future, split the newlink and changelink code paths. Split the ERSPAN code out of ipgre_netlink_parms() into a new function erspan_netlink_parms(). Extract a piece of common logic from ipgre_newlink() and ipgre_changelink() into ipgre_newlink_encap_setup(). Add erspan_newlink() and erspan_changelink(). Fixes: 84e54fe0a5ea ("gre: introduce native tunnel support for ERSPAN") Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-14net/bpfilter: fix dprintf usage for /dev/kmsgBruno Meneguele
The bpfilter UMH code was recently changed to log its informative messages to /dev/kmsg, however this interface doesn't support SEEK_CUR yet, used by dprintf(). As result dprintf() returns -EINVAL and doesn't log anything. However there already had some discussions about supporting SEEK_CUR into /dev/kmsg interface in the past it wasn't concluded. Since the only user of that from userspace perspective inside the kernel is the bpfilter UMH (userspace) module it's better to correct it here instead waiting a conclusion on the interface. Fixes: 36c4357c63f3 ("net: bpfilter: print umh messages to /dev/kmsg") Signed-off-by: Bruno Meneguele <bmeneg@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-14net_sched: keep alloc_hash updated after hash allocationCong Wang
In commit 599be01ee567 ("net_sched: fix an OOB access in cls_tcindex") I moved cp->hash calculation before the first tcindex_alloc_perfect_hash(), but cp->alloc_hash is left untouched. This difference could lead to another out of bound access. cp->alloc_hash should always be the size allocated, we should update it after this tcindex_alloc_perfect_hash(). Reported-and-tested-by: syzbot+dcc34d54d68ef7d2d53d@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+c72da7b9ed57cde6fca2@syzkaller.appspotmail.com Fixes: 599be01ee567 ("net_sched: fix an OOB access in cls_tcindex") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-14net_sched: hold rtnl lock in tcindex_partial_destroy_work()Cong Wang
syzbot reported a use-after-free in tcindex_dump(). This is due to the lack of RTNL in the deferred rcu work. We queue this work with RTNL in tcindex_change(), later, tcindex_dump() is called: fh = tp->ops->get(tp, t->tcm_handle); ... err = tp->ops->change(..., &fh, ...); tfilter_notify(..., fh, ...); but there is nothing to serialize the pending tcindex_partial_destroy_work() with tcindex_dump(). Fix this by simply holding RTNL in tcindex_partial_destroy_work(), so that it won't be called until RTNL is released after tc_new_tfilter() is completed. Reported-and-tested-by: syzbot+653090db2562495901dc@syzkaller.appspotmail.com Fixes: 3d210534cc93 ("net_sched: fix a race condition in tcindex_destroy()") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-13afs: Fix client call Rx-phase signal handlingDavid Howells
Fix the handling of signals in client rxrpc calls made by the afs filesystem. Ignore signals completely, leaving call abandonment or connection loss to be detected by timeouts inside AF_RXRPC. Allowing a filesystem call to be interrupted after the entire request has been transmitted and an abort sent means that the server may or may not have done the action - and we don't know. It may even be worse than that for older servers. Fixes: bc5e3a546d55 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals") Signed-off-by: David Howells <dhowells@redhat.com>
2020-03-13rxrpc: Fix sendmsg(MSG_WAITALL) handlingDavid Howells
Fix the handling of sendmsg() with MSG_WAITALL for userspace to round the timeout for when a signal occurs up to at least two jiffies as a 1 jiffy timeout may end up being effectively 0 if jiffies wraps at the wrong time. Fixes: bc5e3a546d55 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals") Signed-off-by: David Howells <dhowells@redhat.com>
2020-03-13rxrpc: Fix call interruptibility handlingDavid Howells
Fix the interruptibility of kernel-initiated client calls so that they're either only interruptible when they're waiting for a call slot to come available or they're not interruptible at all. Either way, they're not interruptible during transmission. This should help prevent StoreData calls from being interrupted when writeback is in progress. It doesn't, however, handle interruption during the receive phase. Userspace-initiated calls are still interruptable. After the signal has been handled, sendmsg() will return the amount of data copied out of the buffer and userspace can perform another sendmsg() call to continue transmission. Fixes: bc5e3a546d55 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals") Signed-off-by: David Howells <dhowells@redhat.com>
2020-03-13rxrpc: Abstract out the calculation of whether there's Tx spaceDavid Howells
Abstract out the calculation of there being sufficient Tx buffer space. This is reproduced several times in the rxrpc sendmsg code. Signed-off-by: David Howells <dhowells@redhat.com>
2020-03-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf 2020-03-12 The following pull-request contains BPF updates for your *net* tree. We've added 12 non-merge commits during the last 8 day(s) which contain a total of 12 files changed, 161 insertions(+), 15 deletions(-). The main changes are: 1) Andrii fixed two bugs in cgroup-bpf. 2) John fixed sockmap. 3) Luke fixed x32 jit. 4) Martin fixed two issues in struct_ops. 5) Yonghong fixed bpf_send_signal. 6) Yoshiki fixed BTF enum. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12taprio: Fix sending packets without dequeueing themVinicius Costa Gomes
There was a bug that was causing packets to be sent to the driver without first calling dequeue() on the "child" qdisc. And the KASAN report below shows that sending a packet without calling dequeue() leads to bad results. The problem is that when checking the last qdisc "child" we do not set the returned skb to NULL, which can cause it to be sent to the driver, and so after the skb is sent, it may be freed, and in some situations a reference to it may still be in the child qdisc, because it was never dequeued. The crash log looks like this: [ 19.937538] ================================================================== [ 19.938300] BUG: KASAN: use-after-free in taprio_dequeue_soft+0x620/0x780 [ 19.938968] Read of size 4 at addr ffff8881128628cc by task swapper/1/0 [ 19.939612] [ 19.939772] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc3+ #97 [ 19.940397] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qe4 [ 19.941523] Call Trace: [ 19.941774] <IRQ> [ 19.941985] dump_stack+0x97/0xe0 [ 19.942323] print_address_description.constprop.0+0x3b/0x60 [ 19.942884] ? taprio_dequeue_soft+0x620/0x780 [ 19.943325] ? taprio_dequeue_soft+0x620/0x780 [ 19.943767] __kasan_report.cold+0x1a/0x32 [ 19.944173] ? taprio_dequeue_soft+0x620/0x780 [ 19.944612] kasan_report+0xe/0x20 [ 19.944954] taprio_dequeue_soft+0x620/0x780 [ 19.945380] __qdisc_run+0x164/0x18d0 [ 19.945749] net_tx_action+0x2c4/0x730 [ 19.946124] __do_softirq+0x268/0x7bc [ 19.946491] irq_exit+0x17d/0x1b0 [ 19.946824] smp_apic_timer_interrupt+0xeb/0x380 [ 19.947280] apic_timer_interrupt+0xf/0x20 [ 19.947687] </IRQ> [ 19.947912] RIP: 0010:default_idle+0x2d/0x2d0 [ 19.948345] Code: 00 00 41 56 41 55 65 44 8b 2d 3f 8d 7c 7c 41 54 55 53 0f 1f 44 00 00 e8 b1 b2 c5 fd e9 07 00 3 [ 19.950166] RSP: 0018:ffff88811a3efda0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13 [ 19.950909] RAX: 0000000080000000 RBX: ffff88811a3a9600 RCX: ffffffff8385327e [ 19.951608] RDX: 1ffff110234752c0 RSI: 0000000000000000 RDI: ffffffff8385262f [ 19.952309] RBP: ffffed10234752c0 R08: 0000000000000001 R09: ffffed10234752c1 [ 19.953009] R10: ffffed10234752c0 R11: ffff88811a3a9607 R12: 0000000000000001 [ 19.953709] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000 [ 19.954408] ? default_idle_call+0x2e/0x70 [ 19.954816] ? default_idle+0x1f/0x2d0 [ 19.955192] default_idle_call+0x5e/0x70 [ 19.955584] do_idle+0x3d4/0x500 [ 19.955909] ? arch_cpu_idle_exit+0x40/0x40 [ 19.956325] ? _raw_spin_unlock_irqrestore+0x23/0x30 [ 19.956829] ? trace_hardirqs_on+0x30/0x160 [ 19.957242] cpu_startup_entry+0x19/0x20 [ 19.957633] start_secondary+0x2a6/0x380 [ 19.958026] ? set_cpu_sibling_map+0x18b0/0x18b0 [ 19.958486] secondary_startup_64+0xa4/0xb0 [ 19.958921] [ 19.959078] Allocated by task 33: [ 19.959412] save_stack+0x1b/0x80 [ 19.959747] __kasan_kmalloc.constprop.0+0xc2/0xd0 [ 19.960222] kmem_cache_alloc+0xe4/0x230 [ 19.960617] __alloc_skb+0x91/0x510 [ 19.960967] ndisc_alloc_skb+0x133/0x330 [ 19.961358] ndisc_send_ns+0x134/0x810 [ 19.961735] addrconf_dad_work+0xad5/0xf80 [ 19.962144] process_one_work+0x78e/0x13a0 [ 19.962551] worker_thread+0x8f/0xfa0 [ 19.962919] kthread+0x2ba/0x3b0 [ 19.963242] ret_from_fork+0x3a/0x50 [ 19.963596] [ 19.963753] Freed by task 33: [ 19.964055] save_stack+0x1b/0x80 [ 19.964386] __kasan_slab_free+0x12f/0x180 [ 19.964830] kmem_cache_free+0x80/0x290 [ 19.965231] ip6_mc_input+0x38a/0x4d0 [ 19.965617] ipv6_rcv+0x1a4/0x1d0 [ 19.965948] __netif_receive_skb_one_core+0xf2/0x180 [ 19.966437] netif_receive_skb+0x8c/0x3c0 [ 19.966846] br_handle_frame_finish+0x779/0x1310 [ 19.967302] br_handle_frame+0x42a/0x830 [ 19.967694] __netif_receive_skb_core+0xf0e/0x2a90 [ 19.968167] __netif_receive_skb_one_core+0x96/0x180 [ 19.968658] process_backlog+0x198/0x650 [ 19.969047] net_rx_action+0x2fa/0xaa0 [ 19.969420] __do_softirq+0x268/0x7bc [ 19.969785] [ 19.969940] The buggy address belongs to the object at ffff888112862840 [ 19.969940] which belongs to the cache skbuff_head_cache of size 224 [ 19.971202] The buggy address is located 140 bytes inside of [ 19.971202] 224-byte region [ffff888112862840, ffff888112862920) [ 19.972344] The buggy address belongs to the page: [ 19.972820] page:ffffea00044a1800 refcount:1 mapcount:0 mapping:ffff88811a2bd1c0 index:0xffff8881128625c0 compo0 [ 19.973930] flags: 0x8000000000010200(slab|head) [ 19.974388] raw: 8000000000010200 ffff88811a2ed650 ffff88811a2ed650 ffff88811a2bd1c0 [ 19.975151] raw: ffff8881128625c0 0000000000190013 00000001ffffffff 0000000000000000 [ 19.975915] page dumped because: kasan: bad access detected [ 19.976461] page_owner tracks the page as allocated [ 19.976946] page last allocated via order 2, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NO) [ 19.978332] prep_new_page+0x24b/0x330 [ 19.978707] get_page_from_freelist+0x2057/0x2c90 [ 19.979170] __alloc_pages_nodemask+0x218/0x590 [ 19.979619] new_slab+0x9d/0x300 [ 19.979948] ___slab_alloc.constprop.0+0x2f9/0x6f0 [ 19.980421] __slab_alloc.constprop.0+0x30/0x60 [ 19.980870] kmem_cache_alloc+0x201/0x230 [ 19.981269] __alloc_skb+0x91/0x510 [ 19.981620] alloc_skb_with_frags+0x78/0x4a0 [ 19.982043] sock_alloc_send_pskb+0x5eb/0x750 [ 19.982476] unix_stream_sendmsg+0x399/0x7f0 [ 19.982904] sock_sendmsg+0xe2/0x110 [ 19.983262] ____sys_sendmsg+0x4de/0x6d0 [ 19.983660] ___sys_sendmsg+0xe4/0x160 [ 19.984032] __sys_sendmsg+0xab/0x130 [ 19.984396] do_syscall_64+0xe7/0xae0 [ 19.984761] page last free stack trace: [ 19.985142] __free_pages_ok+0x432/0xbc0 [ 19.985533] qlist_free_all+0x56/0xc0 [ 19.985907] quarantine_reduce+0x149/0x170 [ 19.986315] __kasan_kmalloc.constprop.0+0x9e/0xd0 [ 19.986791] kmem_cache_alloc+0xe4/0x230 [ 19.987182] prepare_creds+0x24/0x440 [ 19.987548] do_faccessat+0x80/0x590 [ 19.987906] do_syscall_64+0xe7/0xae0 [ 19.988276] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 19.988775] [ 19.988930] Memory state around the buggy address: [ 19.989402] ffff888112862780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 19.990111] ffff888112862800: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb [ 19.990822] >ffff888112862880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 19.991529] ^ [ 19.992081] ffff888112862900: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc [ 19.992796] ffff888112862980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler") Reported-by: Michael Schmidt <michael.schmidt@eti.uni-siegen.de> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Acked-by: Andre Guedes <andre.guedes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-11net: memcg: fix lockdep splat in inet_csk_accept()Eric Dumazet
Locking newsk while still holding the listener lock triggered a lockdep splat [1] We can simply move the memcg code after we release the listener lock, as this can also help if multiple threads are sharing a common listener. Also fix a typo while reading socket sk_rmem_alloc. [1] WARNING: possible recursive locking detected 5.6.0-rc3-syzkaller #0 Not tainted -------------------------------------------- syz-executor598/9524 is trying to acquire lock: ffff88808b5b8b90 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline] ffff88808b5b8b90 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x69f/0xd30 net/ipv4/inet_connection_sock.c:492 but task is already holding lock: ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline] ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x8d/0xd30 net/ipv4/inet_connection_sock.c:445 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(sk_lock-AF_INET6); lock(sk_lock-AF_INET6); *** DEADLOCK *** May be due to missing lock nesting notation 1 lock held by syz-executor598/9524: #0: ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline] #0: ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x8d/0xd30 net/ipv4/inet_connection_sock.c:445 stack backtrace: CPU: 0 PID: 9524 Comm: syz-executor598 Not tainted 5.6.0-rc3-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x188/0x20d lib/dump_stack.c:118 print_deadlock_bug kernel/locking/lockdep.c:2370 [inline] check_deadlock kernel/locking/lockdep.c:2411 [inline] validate_chain kernel/locking/lockdep.c:2954 [inline] __lock_acquire.cold+0x114/0x288 kernel/locking/lockdep.c:3954 lock_acquire+0x197/0x420 kernel/locking/lockdep.c:4484 lock_sock_nested+0xc5/0x110 net/core/sock.c:2947 lock_sock include/net/sock.h:1541 [inline] inet_csk_accept+0x69f/0xd30 net/ipv4/inet_connection_sock.c:492 inet_accept+0xe9/0x7c0 net/ipv4/af_inet.c:734 __sys_accept4_file+0x3ac/0x5b0 net/socket.c:1758 __sys_accept4+0x53/0x90 net/socket.c:1809 __do_sys_accept4 net/socket.c:1821 [inline] __se_sys_accept4 net/socket.c:1818 [inline] __x64_sys_accept4+0x93/0xf0 net/socket.c:1818 do_syscall_64+0xf6/0x790 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x4445c9 Code: e8 0c 0d 03 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffc35b37608 EFLAGS: 00000246 ORIG_RAX: 0000000000000120 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004445c9 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003 RBP: 0000000000000000 R08: 0000000000306777 R09: 0000000000306777 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00000000004053d0 R14: 0000000000000000 R15: 0000000000000000 Fixes: d752a4986532 ("net: memcg: late association of sock to memcg") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Shakeel Butt <shakeelb@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-11seg6: fix SRv6 L2 tunnels to use IANA-assigned protocol numberPaolo Lungaroni
The Internet Assigned Numbers Authority (IANA) has recently assigned a protocol number value of 143 for Ethernet [1]. Before this assignment, encapsulation mechanisms such as Segment Routing used the IPv6-NoNxt protocol number (59) to indicate that the encapsulated payload is an Ethernet frame. In this patch, we add the definition of the Ethernet protocol number to the kernel headers and update the SRv6 L2 tunnels to use it. [1] https://www.iana.org/assignments/protocol-numbers/protocol-numbers.xhtml Signed-off-by: Paolo Lungaroni <paolo.lungaroni@cnit.it> Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it> Acked-by: Ahmed Abdelsalam <ahmed.abdelsalam@gssi.it> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-11net: dsa: Don't instantiate phylink for CPU/DSA ports unless neededAndrew Lunn
By default, DSA drivers should configure CPU and DSA ports to their maximum speed. In many configurations this is sufficient to make the link work. In some cases it is necessary to configure the link to run slower, e.g. because of limitations of the SoC it is connected to. Or back to back PHYs are used and the PHY needs to be driven in order to establish link. In this case, phylink is used. Only instantiate phylink if it is required. If there is no PHY, or no fixed link properties, phylink can upset a link which works in the default configuration. Fixes: 0e27921816ad ("net: dsa: Use PHYLINK for the CPU/DSA ports") Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-11net/packet: tpacket_rcv: do not increment ring index on dropWillem de Bruijn
In one error case, tpacket_rcv drops packets after incrementing the ring producer index. If this happens, it does not update tp_status to TP_STATUS_USER and thus the reader is stalled for an iteration of the ring, causing out of order arrival. The only such error path is when virtio_net_hdr_from_skb fails due to encountering an unknown GSO type. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>