summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2018-04-16netfilter: nf_tables: free set name in error pathFlorian Westphal
set->name must be free'd here in case ops->init fails. Fixes: 387454901bd6 ("netfilter: nf_tables: Allow set names of up to 255 chars") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-16netfilter: nf_tables: can't fail after linking rule into active rule listFlorian Westphal
rules in nftables a free'd using kfree, but protected by rcu, i.e. we must wait for a grace period to elapse. Normal removal patch does this, but nf_tables_newrule() doesn't obey this rule during error handling. It calls nft_trans_rule_add() *after* linking rule, and, if that fails to allocate memory, it unlinks the rule and then kfree() it -- this is unsafe. Switch order -- first add rule to transaction list, THEN link it to public list. Note: nft_trans_rule_add() uses GFP_KERNEL; it will not fail so this is not a problem in practice (spotted only during code review). Fixes: 0628b123c96d12 ("netfilter: nfnetlink: add batch support and use it from nf_tables") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-16netfilter: fix CONFIG_NF_REJECT_IPV6=m link errorArnd Bergmann
We get a new link error with CONFIG_NFT_REJECT_INET=y and CONFIG_NF_REJECT_IPV6=m after larger parts of the nftables modules are linked together: net/netfilter/nft_reject_inet.o: In function `nft_reject_inet_eval': nft_reject_inet.c:(.text+0x17c): undefined reference to `nf_send_unreach6' nft_reject_inet.c:(.text+0x190): undefined reference to `nf_send_reset6' The problem is that with NF_TABLES_INET set, we implicitly try to use the ipv6 version as well for NFT_REJECT, but when CONFIG_IPV6 is set to a loadable module, it's impossible to reach that. The best workaround I found is to express the above as a Kconfig dependency, forcing NFT_REJECT itself to be 'm' in that particular configuration. Fixes: 02c7b25e5f54 ("netfilter: nf_tables: build-in filter chain type") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-16netfilter: conntrack: silent a memory leak warningCong Wang
The following memory leak is false postive: unreferenced object 0xffff8f37f156fb38 (size 128): comm "softirq", pid 0, jiffies 4294899665 (age 11.292s) hex dump (first 32 bytes): 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk 00 00 00 00 30 00 20 00 48 6b 6b 6b 6b 6b 6b 6b ....0. .Hkkkkkkk backtrace: [<000000004fda266a>] __kmalloc_track_caller+0x10d/0x141 [<000000007b0a7e3c>] __krealloc+0x45/0x62 [<00000000d08e0bfb>] nf_ct_ext_add+0xdc/0x133 [<0000000099b47fd8>] init_conntrack+0x1b1/0x392 [<0000000086dc36ec>] nf_conntrack_in+0x1ee/0x34b [<00000000940592de>] nf_hook_slow+0x36/0x95 [<00000000d1bd4da7>] nf_hook.constprop.43+0x1c3/0x1dd [<00000000c3673266>] __ip_local_out+0xae/0xb4 [<000000003e4192a6>] ip_local_out+0x17/0x33 [<00000000b64356de>] igmp_ifc_timer_expire+0x23e/0x26f [<000000006a8f3032>] call_timer_fn+0x14c/0x2a5 [<00000000650c1725>] __run_timers.part.34+0x150/0x182 [<0000000090e6946e>] run_timer_softirq+0x2a/0x4c [<000000004d1e7293>] __do_softirq+0x1d1/0x3c2 [<000000004643557d>] irq_exit+0x53/0xa2 [<0000000029ddee8f>] smp_apic_timer_interrupt+0x22a/0x235 because __krealloc() is not supposed to release the old memory and it is released later via kfree_rcu(). Since this is the only external user of __krealloc(), just mark it as not leak here. Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-16net: af_packet: fix race in PACKET_{R|T}X_RINGEric Dumazet
In order to remove the race caught by syzbot [1], we need to lock the socket before using po->tp_version as this could change under us otherwise. This means lock_sock() and release_sock() must be done by packet_set_ring() callers. [1] : BUG: KMSAN: uninit-value in packet_set_ring+0x1254/0x3870 net/packet/af_packet.c:4249 CPU: 0 PID: 20195 Comm: syzkaller707632 Not tainted 4.16.0+ #83 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x185/0x1d0 lib/dump_stack.c:53 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676 packet_set_ring+0x1254/0x3870 net/packet/af_packet.c:4249 packet_setsockopt+0x12c6/0x5a90 net/packet/af_packet.c:3662 SYSC_setsockopt+0x4b8/0x570 net/socket.c:1849 SyS_setsockopt+0x76/0xa0 net/socket.c:1828 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x449099 RSP: 002b:00007f42b5307ce8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036 RAX: ffffffffffffffda RBX: 000000000070003c RCX: 0000000000449099 RDX: 0000000000000005 RSI: 0000000000000107 RDI: 0000000000000003 RBP: 0000000000700038 R08: 000000000000001c R09: 0000000000000000 R10: 00000000200000c0 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000080eecf R14: 00007f42b53089c0 R15: 0000000000000001 Local variable description: ----req_u@packet_setsockopt Variable was created at: packet_setsockopt+0x13f/0x5a90 net/packet/af_packet.c:3612 SYSC_setsockopt+0x4b8/0x570 net/socket.c:1849 Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16tcp: clear tp->packets_out when purging write queueSoheil Hassas Yeganeh
Clear tp->packets_out when purging the write queue, otherwise tcp_rearm_rto() mistakenly assumes TCP write queue is not empty. This results in NULL pointer dereference. Also, remove the redundant `tp->packets_out = 0` from tcp_disconnect(), since tcp_disconnect() calls tcp_write_queue_purge(). Fixes: a27fd7a8ed38 (tcp: purge write queue upon RST) Reported-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Reported-by: Sami Farin <hvtaifwkbgefbaei@gmail.com> Tested-by: Sami Farin <hvtaifwkbgefbaei@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16xfrm: Fix warning in xfrm6_tunnel_net_exit.Steffen Klassert
We need to make sure that all states are really deleted before we check that the state lists are empty. Otherwise we trigger a warning. Fixes: baeb0dbbb5659 ("xfrm6_tunnel: exit_net cleanup check added") Reported-and-tested-by:syzbot+777bf170a89e7b326405@syzkaller.appspotmail.com Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-04-15rpc_pipefs: fix double-dput()Al Viro
if we ever hit rpc_gssd_dummy_depopulate() dentry passed to it has refcount equal to 1. __rpc_rmpipe() drops it and dput() done after that hits an already freed dentry. Cc: stable@kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-04-15Merge tag 'kbuild-v4.17-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull more Kbuild updates from Masahiro Yamada: - pass HOSTLDFLAGS when compiling single .c host programs - build genksyms lexer and parser files instead of using shipped versions - rename *-asn1.[ch] to *.asn1.[ch] for suffix consistency - let the top .gitignore globally ignore artifacts generated by flex, bison, and asn1_compiler - let the top Makefile globally clean artifacts generated by flex, bison, and asn1_compiler - use safer .SECONDARY marker instead of .PRECIOUS to prevent intermediate files from being removed - support -fmacro-prefix-map option to make __FILE__ a relative path - fix # escaping to prepare for the future GNU Make release - clean up deb-pkg by using debian tools instead of handrolled source/changes generation - improve rpm-pkg portability by supporting kernel-install as a fallback of new-kernel-pkg - extend Kconfig listnewconfig target to provide more information * tag 'kbuild-v4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kconfig: extend output of 'listnewconfig' kbuild: rpm-pkg: use kernel-install as a fallback for new-kernel-pkg Kbuild: fix # escaping in .cmd files for future Make kbuild: deb-pkg: split generating packaging and build kbuild: use -fmacro-prefix-map to make __FILE__ a relative path kbuild: mark $(targets) as .SECONDARY and remove .PRECIOUS markers kbuild: rename *-asn1.[ch] to *.asn1.[ch] kbuild: clean up *-asn1.[ch] patterns from top-level Makefile .gitignore: move *-asn1.[ch] patterns to the top-level .gitignore kbuild: add %.dtb.S and %.dtb to 'targets' automatically kbuild: add %.lex.c and %.tab.[ch] to 'targets' automatically genksyms: generate lexer and parser during build instead of shipping kbuild: clean up *.lex.c and *.tab.[ch] patterns from top-level Makefile .gitignore: move *.lex.c *.tab.[ch] patterns to the top-level .gitignore kbuild: use HOSTLDFLAGS for single .c executables
2018-04-13l2tp: hold reference on tunnels printed in l2tp/tunnels debugfs fileGuillaume Nault
Use l2tp_tunnel_get_nth() instead of l2tp_tunnel_find_nth(), to be safe against concurrent tunnel deletion. Use the same mechanism as in l2tp_ppp.c for dropping the reference taken by l2tp_tunnel_get_nth(). That is, drop the reference just before looking up the next tunnel. In case of error, drop the last accessed tunnel in l2tp_dfs_seq_stop(). That was the last use of l2tp_tunnel_find_nth(). Fixes: 0ad6614048cf ("l2tp: Add debugfs files for dumping l2tp debug info") Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-13l2tp: hold reference on tunnels printed in pppol2tp proc fileGuillaume Nault
Use l2tp_tunnel_get_nth() instead of l2tp_tunnel_find_nth(), to be safe against concurrent tunnel deletion. Unlike sessions, we can't drop the reference held on tunnels in pppol2tp_seq_show(). Tunnels are reused across several calls to pppol2tp_seq_start() when iterating over sessions. These iterations need the tunnel for accessing the next session. Therefore the only safe moment for dropping the reference is just before searching for the next tunnel. Normally, the last invocation of pppol2tp_next_tunnel() doesn't find any new tunnel, so it drops the last tunnel without taking any new reference. However, in case of error, pppol2tp_seq_stop() is called directly, so we have to drop the reference there. Fixes: fd558d186df2 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts") Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-13l2tp: hold reference on tunnels in netlink dumpsGuillaume Nault
l2tp_tunnel_find_nth() is unsafe: no reference is held on the returned tunnel, therefore it can be freed whenever the caller uses it. This patch defines l2tp_tunnel_get_nth() which works similarly, but also takes a reference on the returned tunnel. The caller then has to drop it after it stops using the tunnel. Convert netlink dumps to make them safe against concurrent tunnel deletion. Fixes: 309795f4bec2 ("l2tp: Add netlink control API for L2TP") Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-12net: fix deadlock while clearing neighbor proxy tableWolfgang Bumiller
When coming from ndisc_netdev_event() in net/ipv6/ndisc.c, neigh_ifdown() is called with &nd_tbl, locking this while clearing the proxy neighbor entries when eg. deleting an interface. Calling the table's pndisc_destructor() with the lock still held, however, can cause a deadlock: When a multicast listener is available an IGMP packet of type ICMPV6_MGM_REDUCTION may be sent out. When reaching ip6_finish_output2(), if no neighbor entry for the target address is found, __neigh_create() is called with &nd_tbl, which it'll want to lock. Move the elements into their own list, then unlock the table and perform the destruction. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199289 Fixes: 6fd6ce2056de ("ipv6: Do not depend on rt->n in ip6_finish_output2().") Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-12sctp: do not check port in sctp_inet6_cmp_addrXin Long
pf->cmp_addr() is called before binding a v6 address to the sock. It should not check ports, like in sctp_inet_cmp_addr. But sctp_inet6_cmp_addr checks the addr by invoking af(6)->cmp_addr, sctp_v6_cmp_addr where it also compares the ports. This would cause that setsockopt(SCTP_SOCKOPT_BINDX_ADD) could bind multiple duplicated IPv6 addresses after Commit 40b4f0fd74e4 ("sctp: lack the check for ports in sctp_v6_cmp_addr"). This patch is to remove af->cmp_addr called in sctp_inet6_cmp_addr, but do the proper check for both v6 addrs and v4mapped addrs. v1->v2: - define __sctp_v6_cmp_addr to do the common address comparison used for both pf and af v6 cmp_addr. Fixes: 40b4f0fd74e4 ("sctp: lack the check for ports in sctp_v6_cmp_addr") Reported-by: Jianwen Ji <jiji@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-12tipc: fix missing initializer in tipc_sendmsg()Jon Maloy
The stack variable 'dnode' in __tipc_sendmsg() may theoretically end up tipc_node_get_mtu() as an unitilalized variable. We fix this by intializing the variable at declaration. We also add a default else clause to the two conditional ones already there, so that we never end up in the named function if the given address type is illegal. Reported-by: syzbot+b0975ce9355b347c1546@syzkaller.appspotmail.com Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-12strparser: Fix incorrect strp->need_bytes value.Doron Roberts-Kedes
strp_data_ready resets strp->need_bytes to 0 if strp_peek_len indicates that the remainder of the message has been received. However, do_strp_work does not reset strp->need_bytes to 0. If do_strp_work completes a partial message, the value of strp->need_bytes will continue to reflect the needed bytes of the previous message, causing future invocations of strp_data_ready to return early if strp->need_bytes is less than strp_peek_len. Resetting strp->need_bytes to 0 in __strp_recv on handing a full message to the upper layer solves this problem. __strp_recv also calculates strp->need_bytes using stm->accum_len before stm->accum_len has been incremented by cand_len. This can cause strp->need_bytes to be equal to the full length of the message instead of the full length minus the accumulated length. This, in turn, causes strp_data_ready to return early, even when there is sufficient data to complete the partial message. Incrementing stm->accum_len before using it to calculate strp->need_bytes solves this problem. Found while testing net/tls_sw recv path. Fixes: 43a0c6751a322847 ("strparser: Stream parser for messages") Signed-off-by: Doron Roberts-Kedes <doronrk@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-12net: validate attribute sizes in neigh_dump_table()Eric Dumazet
Since neigh_dump_table() calls nlmsg_parse() without giving policy constraints, attributes can have arbirary size that we must validate Reported by syzbot/KMSAN : BUG: KMSAN: uninit-value in neigh_master_filtered net/core/neighbour.c:2292 [inline] BUG: KMSAN: uninit-value in neigh_dump_table net/core/neighbour.c:2348 [inline] BUG: KMSAN: uninit-value in neigh_dump_info+0x1af0/0x2250 net/core/neighbour.c:2438 CPU: 1 PID: 3575 Comm: syzkaller268891 Not tainted 4.16.0+ #83 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x185/0x1d0 lib/dump_stack.c:53 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676 neigh_master_filtered net/core/neighbour.c:2292 [inline] neigh_dump_table net/core/neighbour.c:2348 [inline] neigh_dump_info+0x1af0/0x2250 net/core/neighbour.c:2438 netlink_dump+0x9ad/0x1540 net/netlink/af_netlink.c:2225 __netlink_dump_start+0x1167/0x12a0 net/netlink/af_netlink.c:2322 netlink_dump_start include/linux/netlink.h:214 [inline] rtnetlink_rcv_msg+0x1435/0x1560 net/core/rtnetlink.c:4598 netlink_rcv_skb+0x355/0x5f0 net/netlink/af_netlink.c:2447 rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:4653 netlink_unicast_kernel net/netlink/af_netlink.c:1311 [inline] netlink_unicast+0x1672/0x1750 net/netlink/af_netlink.c:1337 netlink_sendmsg+0x1048/0x1310 net/netlink/af_netlink.c:1900 sock_sendmsg_nosec net/socket.c:630 [inline] sock_sendmsg net/socket.c:640 [inline] ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046 __sys_sendmsg net/socket.c:2080 [inline] SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091 SyS_sendmsg+0x54/0x80 net/socket.c:2087 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x43fed9 RSP: 002b:00007ffddbee2798 EFLAGS: 00000213 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043fed9 RDX: 0000000000000000 RSI: 0000000020005000 RDI: 0000000000000003 RBP: 00000000006ca018 R08: 00000000004002c8 R09: 00000000004002c8 R10: 00000000004002c8 R11: 0000000000000213 R12: 0000000000401800 R13: 0000000000401890 R14: 0000000000000000 R15: 0000000000000000 Uninit was created at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline] kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314 kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321 slab_post_alloc_hook mm/slab.h:445 [inline] slab_alloc_node mm/slub.c:2737 [inline] __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369 __kmalloc_reserve net/core/skbuff.c:138 [inline] __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206 alloc_skb include/linux/skbuff.h:984 [inline] netlink_alloc_large_skb net/netlink/af_netlink.c:1183 [inline] netlink_sendmsg+0x9a6/0x1310 net/netlink/af_netlink.c:1875 sock_sendmsg_nosec net/socket.c:630 [inline] sock_sendmsg net/socket.c:640 [inline] ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046 __sys_sendmsg net/socket.c:2080 [inline] SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091 SyS_sendmsg+0x54/0x80 net/socket.c:2087 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Fixes: 21fdd092acc7 ("net: Add support for filtering neigh dump by master device") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Ahern <dsa@cumulusnetworks.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-12tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on established socketsEric Dumazet
syzbot/KMSAN reported an uninit-value in tcp_parse_options() [1] I believe this was caused by a TCP_MD5SIG being set on live flow. This is highly unexpected, since TCP option space is limited. For instance, presence of TCP MD5 option automatically disables TCP TimeStamp option at SYN/SYNACK time, which we can not do once flow has been established. Really, adding/deleting an MD5 key only makes sense on sockets in CLOSE or LISTEN state. [1] BUG: KMSAN: uninit-value in tcp_parse_options+0xd74/0x1a30 net/ipv4/tcp_input.c:3720 CPU: 1 PID: 6177 Comm: syzkaller192004 Not tainted 4.16.0+ #83 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x185/0x1d0 lib/dump_stack.c:53 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676 tcp_parse_options+0xd74/0x1a30 net/ipv4/tcp_input.c:3720 tcp_fast_parse_options net/ipv4/tcp_input.c:3858 [inline] tcp_validate_incoming+0x4f1/0x2790 net/ipv4/tcp_input.c:5184 tcp_rcv_established+0xf60/0x2bb0 net/ipv4/tcp_input.c:5453 tcp_v4_do_rcv+0x6cd/0xd90 net/ipv4/tcp_ipv4.c:1469 sk_backlog_rcv include/net/sock.h:908 [inline] __release_sock+0x2d6/0x680 net/core/sock.c:2271 release_sock+0x97/0x2a0 net/core/sock.c:2786 tcp_sendmsg+0xd6/0x100 net/ipv4/tcp.c:1464 inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764 sock_sendmsg_nosec net/socket.c:630 [inline] sock_sendmsg net/socket.c:640 [inline] SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747 SyS_sendto+0x8a/0xb0 net/socket.c:1715 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x448fe9 RSP: 002b:00007fd472c64d38 EFLAGS: 00000216 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00000000006e5a30 RCX: 0000000000448fe9 RDX: 000000000000029f RSI: 0000000020a88f88 RDI: 0000000000000004 RBP: 00000000006e5a34 R08: 0000000020e68000 R09: 0000000000000010 R10: 00000000200007fd R11: 0000000000000216 R12: 0000000000000000 R13: 00007fff074899ef R14: 00007fd472c659c0 R15: 0000000000000009 Uninit was created at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline] kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314 kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321 slab_post_alloc_hook mm/slab.h:445 [inline] slab_alloc_node mm/slub.c:2737 [inline] __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369 __kmalloc_reserve net/core/skbuff.c:138 [inline] __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206 alloc_skb include/linux/skbuff.h:984 [inline] tcp_send_ack+0x18c/0x910 net/ipv4/tcp_output.c:3624 __tcp_ack_snd_check net/ipv4/tcp_input.c:5040 [inline] tcp_ack_snd_check net/ipv4/tcp_input.c:5053 [inline] tcp_rcv_established+0x2103/0x2bb0 net/ipv4/tcp_input.c:5469 tcp_v4_do_rcv+0x6cd/0xd90 net/ipv4/tcp_ipv4.c:1469 sk_backlog_rcv include/net/sock.h:908 [inline] __release_sock+0x2d6/0x680 net/core/sock.c:2271 release_sock+0x97/0x2a0 net/core/sock.c:2786 tcp_sendmsg+0xd6/0x100 net/ipv4/tcp.c:1464 inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764 sock_sendmsg_nosec net/socket.c:630 [inline] sock_sendmsg net/socket.c:640 [inline] SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747 SyS_sendto+0x8a/0xb0 net/socket.c:1715 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Fixes: cfb6eeb4c860 ("[TCP]: MD5 Signature Option (RFC2385) support.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-12tipc: fix unbalanced reference counterJon Maloy
When a topology subscription is created, we may encounter (or KASAN may provoke) a failure to create a corresponding service instance in the binding table. Instead of letting the tipc_nametbl_subscribe() report the failure back to the caller, the function just makes a warning printout and returns, without incrementing the subscription reference counter as expected by the caller. This makes the caller believe that the subscription was successful, so it will at a later moment try to unsubscribe the item. This involves a sub_put() call. Since the reference counter never was incremented in the first place, we get a premature delete of the subscription item, followed by a "use-after-free" warning. We fix this by adding a return value to tipc_nametbl_subscribe() and make the caller aware of the failure to subscribe. This bug seems to always have been around, but this fix only applies back to the commit shown below. Given the low risk of this happening we believe this to be sufficient. Fixes: commit 218527fe27ad ("tipc: replace name table service range array with rb tree") Reported-by: syzbot+aa245f26d42b8305d157@syzkaller.appspotmail.com Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-12net/tls: Remove VLA usageKees Cook
In the quest to remove VLAs from the kernel[1], this replaces the VLA size with the only possible size used in the code, and adds a mechanism to double-check future IV sizes. [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-12Merge tag 'nfs-for-4.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client updates from Anna Schumaker: "Stable bugfixes: - xprtrdma: Fix corner cases when handling device removal # v4.12+ - xprtrdma: Fix latency regression on NUMA NFS/RDMA clients # v4.15+ Features: - New sunrpc tracepoint for RPC pings - Finer grained NFSv4 attribute checking - Don't unnecessarily return NFS v4 delegations Other bugfixes and cleanups: - Several other small NFSoRDMA cleanups - Improvements to the sunrpc RTT measurements - A few sunrpc tracepoint cleanups - Various fixes for NFS v4 lock notifications - Various sunrpc and NFS v4 XDR encoding cleanups - Switch to the ida_simple API - Fix NFSv4.1 exclusive create - Forget acl cache after setattr operation - Don't advance the nfs_entry readdir cookie if xdr decoding fails" * tag 'nfs-for-4.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (47 commits) NFS: advance nfs_entry cookie only after decoding completes successfully NFSv3/acl: forget acl cache after setattr NFSv4.1: Fix exclusive create NFSv4: Declare the size up to date after it was set. nfs: Use ida_simple API NFSv4: Fix the nfs_inode_set_delegation() arguments NFSv4: Clean up CB_GETATTR encoding NFSv4: Don't ask for attributes when ACCESS is protected by a delegation NFSv4: Add a helper to encode/decode struct timespec NFSv4: Clean up encode_attrs NFSv4; Clean up XDR encoding of type bitmap4 NFSv4: Allow GFP_NOIO sleeps in decode_attr_owner/decode_attr_group SUNRPC: Add a helper for encoding opaque data inline SUNRPC: Add helpers for decoding opaque and string types NFSv4: Ignore change attribute invalidations if we hold a delegation NFS: More fine grained attribute tracking NFS: Don't force unnecessary cache invalidation in nfs_update_inode() NFS: Don't redirty the attribute cache in nfs_wcc_update_inode() NFS: Don't force a revalidation of all attributes if change is missing NFS: Convert NFS_INO_INVALID flags to unsigned long ...
2018-04-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) In ip_gre tunnel, handle the conflict between TUNNEL_{SEQ,CSUM} and GSO/LLTX properly. From Sabrina Dubroca. 2) Stop properly on error in lan78xx_read_otp(), from Phil Elwell. 3) Don't uncompress in slip before rstate is initialized, from Tejaswi Tanikella. 4) When using 1.x firmware on aquantia, issue a deinit before we hardware reset the chip, otherwise we break dirty wake WOL. From Igor Russkikh. 5) Correct log check in vhost_vq_access_ok(), from Stefan Hajnoczi. 6) Fix ethtool -x crashes in bnxt_en, from Michael Chan. 7) Fix races in l2tp tunnel creation and duplicate tunnel detection, from Guillaume Nault. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (22 commits) l2tp: fix race in duplicate tunnel detection l2tp: fix races in tunnel creation tun: send netlink notification when the device is modified tun: set the flags before registering the netdevice lan78xx: Don't reset the interface on open bnxt_en: Fix NULL pointer dereference at bnxt_free_irq(). bnxt_en: Need to include RDMA rings in bnxt_check_rings(). bnxt_en: Support max-mtu with VF-reps bnxt_en: Ignore src port field in decap filter nodes bnxt_en: do not allow wildcard matches for L2 flows bnxt_en: Fix ethtool -x crash when device is down. vhost: return bool from *_access_ok() functions vhost: fix vhost_vq_access_ok() log check vhost: Fix vhost_copy_to_user() net: aquantia: oops when shutdown on already stopped device net: aquantia: Regression on reset with 1.x firmware cdc_ether: flag the Cinterion AHS8 modem by gemalto as WWAN slip: Check if rstate is initialized before uncompressing lan78xx: Avoid spurious kevent 4 "error" lan78xx: Correctly indicate invalid OTP ...
2018-04-11l2tp: fix race in duplicate tunnel detectionGuillaume Nault
We can't use l2tp_tunnel_find() to prevent l2tp_nl_cmd_tunnel_create() from creating a duplicate tunnel. A tunnel can be concurrently registered after l2tp_tunnel_find() returns. Therefore, searching for duplicates must be done at registration time. Finally, remove l2tp_tunnel_find() entirely as it isn't use anywhere anymore. Fixes: 309795f4bec2 ("l2tp: Add netlink control API for L2TP") Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-11l2tp: fix races in tunnel creationGuillaume Nault
l2tp_tunnel_create() inserts the new tunnel into the namespace's tunnel list and sets the socket's ->sk_user_data field, before returning it to the caller. Therefore, there are two ways the tunnel can be accessed and freed, before the caller even had the opportunity to take a reference. In practice, syzbot could crash the module by closing the socket right after a new tunnel was returned to pppol2tp_create(). This patch moves tunnel registration out of l2tp_tunnel_create(), so that the caller can safely hold a reference before publishing the tunnel. This second step is done with the new l2tp_tunnel_register() function, which is now responsible for associating the tunnel to its socket and for inserting it into the namespace's list. While moving the code to l2tp_tunnel_register(), a few modifications have been done. First, the socket validation tests are done in a helper function, for clarity. Also, modifying the socket is now done after having inserted the tunnel to the namespace's tunnels list. This will allow insertion to fail, without having to revert theses modifications in the error path (a followup patch will check for duplicate tunnels before insertion). Either the socket is a kernel socket which we control, or it is a user-space socket for which we have a reference on the file descriptor. In any case, the socket isn't going to be closed from under us. Reported-by: syzbot+fbeeb5c3b538e8545644@syzkaller.appspotmail.com Fixes: fd558d186df2 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts") Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-11rds: MP-RDS may use an invalid c_pathKa-Cheong Poon
rds_sendmsg() calls rds_send_mprds_hash() to find a c_path to use to send a message. Suppose the RDS connection is not yet up. In rds_send_mprds_hash(), it does if (conn->c_npaths == 0) wait_event_interruptible(conn->c_hs_waitq, (conn->c_npaths != 0)); If it is interrupted before the connection is set up, rds_send_mprds_hash() will return a non-zero hash value. Hence rds_sendmsg() will use a non-zero c_path to send the message. But if the RDS connection ends up to be non-MP capable, the message will be lost as only the zero c_path can be used. Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-11netfilter: xt_connmark: Add bit mapping for bit-shift operation.Jack Ma
With the addition of bit-shift operations, we are able to shift ct/skbmark based on user requirements. However, this change might also cause the most left/right hand- side mark to be accidentially lost during shift operations. This patch adds the ability to 'grep' certain bits based on ctmask or nfmask out of the original mark. Then, apply shift operations to achieve a new mapping between ctmark and skb->mark. For example: If someone would like save the fourth F bits of ctmark 0xFFF(F)000F into the seventh hexadecimal (0) skb->mark 0xABC000(0)E. new_targetmark = (ctmark & ctmask) >> 12; (new) skb->mark = (skb->mark &~nfmask) ^ new_targetmark; This will preserve the other bits that are not related to this operation. Fixes: 472a73e00757 ("netfilter: xt_conntrack: Support bit-shifting for CONNMARK & MARK targets.") Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Jack Ma <jack.ma@alliedtelesis.co.nz> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-10SUNRPC: Add helpers for decoding opaque and string typesTrond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10xprtrdma: Fix corner cases when handling device removalChuck Lever
Michal Kalderon has found some corner cases around device unload with active NFS mounts that I didn't have the imagination to test when xprtrdma device removal was added last year. - The ULP device removal handler is responsible for deallocating the PD. That wasn't clear to me initially, and my own testing suggested it was not necessary, but that is incorrect. - The transport destruction path can no longer assume that there is a valid ID. - When destroying a transport, ensure that ib_free_cq() is not invoked on a CQ that was already released. Reported-by: Michal Kalderon <Michal.Kalderon@cavium.com> Fixes: bebd031866ca ("xprtrdma: Support unplugging an HCA from ...") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10sunrpc: Add static trace point to report result of RPC pingChuck Lever
This information can help track down local misconfiguration issues as well as network partitions and unresponsive servers. There are several ways to send a ping, and with transport multi- plexing, the exact rpc_xprt that is used is sometimes not known by the upper layer. The rpc_xprt pointer passed to the trace point call also has to be RCU-safe. I found a spot inside the client FSM where an rpc_xprt pointer is always available and safe to use. Suggested-by: Bill Baker <Bill.Baker@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10sunrpc: Add static trace point to report RPC latency statsChuck Lever
Introduce a low-overhead mechanism to report information about latencies of individual RPCs. The goal is to enable user space to filter the trace record for latency outliers, or build histograms, etc. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10sunrpc: Simplify synopsis of some trace pointsChuck Lever
Clean up: struct rpc_task carries a pointer to a struct rpc_clnt, and in fact task->tk_client is always what is passed into trace points that are already passing @task. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10SUNRPC: Make num_reqs a non-atomic integerChuck Lever
If recording xprt->stat.max_slots is moved into xprt_alloc_slot, then xprt->num_reqs is never manipulated outside xprt->reserve_lock. There's no longer a need for xprt->num_reqs to be atomic. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10SUNRPC: Make RTT measurement more precise (Send)Chuck Lever
Some RPC transports have more overhead in their send_request callouts than others. For example, for RPC-over-RDMA: - Marshaling an RPC often has to DMA map the RPC arguments - Registration methods perform memory registration as part of marshaling To capture just server and network latencies more precisely: when sending a Call, capture the rq_xtime timestamp _after_ the transport header has been marshaled. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10SUNRPC: Make RTT measurement more precise (Receive)Chuck Lever
Some RPC transports have more overhead in their reply handlers than others. For example, for RPC-over-RDMA: - RPC completion has to wait for memory invalidation, which is not a part of the server/network round trip - Recently a context switch was introduced into the reply handler, which further artificially inflates the measure of RPC RTT To capture just server and network latencies more precisely: when receiving a reply, compute the RTT as soon as the XID is recognized rather than at RPC completion time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10SUNRPC: Move xprt_update_rtt callsiteChuck Lever
Since commit 33849792cbcd ("xprtrdma: Detect unreachable NFS/RDMA servers more reliably"), the xprtrdma transport now has a ->timer callout. But xprtrdma does not need to compute RTT data, only UDP needs that. Move the xprt_update_rtt call into the UDP transport implementation. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10xprtrdma: Move creation of rl_rdmabuf to rpcrdma_create_reqChuck Lever
Refactor: Both rpcrdma_create_req call sites have to allocate the buffer where the transport header is built, so just move that allocation into rpcrdma_create_req. This buffer is a fixed size. There's no needed information available in call_allocate that is not also available when the transport is created. The original purpose for allocating these buffers on demand was to reduce the possibility that an allocation failure during transport creation will hork the mount operation during low memory scenarios. Some relief for this rare possibility is coming up in the next few patches. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10xprtrdma: Chain Send to FastReg WRsChuck Lever
With FRWR, the client transport can perform memory registration and post a Send with just a single ib_post_send. This reduces contention between the send_request path and the Send Completion handlers, and reduces the overhead of registering a chunk that has multiple segments. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10xprtrdma: "Support" call-only RPCsChuck Lever
RPC-over-RDMA version 1 credit accounting relies on there being a response message for every RPC Call. This means that RPC procedures that have no reply will disrupt credit accounting, just in the same way as a retransmit would (since it is sent because no reply has arrived). Deal with the "no reply" case the same way. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10xprtrdma: Reduce number of MRs created by rpcrdma_mrs_createChuck Lever
Create fewer MRs on average. Many workloads don't need as many as 32 MRs, and the transport can now quickly restock the MR free list. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10xprtrdma: ->send_request returns -EAGAIN when there are no free MRsChuck Lever
Currently, when the MR free list is exhausted during marshaling, the RPC/RDMA transport places the RPC task on the delayq, which forces a wait for HZ >> 2 before the marshal and send is retried. With this change, the transport now places such an RPC task on the pending queue, and wakes it just as soon as more MRs have been created. Creating more MRs typically takes less than a millisecond, and this waking mechanism is less deadlock-prone. Moreover, the waiting RPC task is holding the transport's write lock, which blocks the transport from sending RPCs. Therefore faster recovery from MR exhaustion is desirable. This is the same mechanism that the TCP transport utilizes when handling write buffer space exhaustion. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10xprtrdma: Remove xprt-specific connect cookieChuck Lever
Clean up: The generic rq_connect_cookie is sufficient to detect RPC Call retransmission. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10xprtrdma: Remove arbitrary limit on initiator depthChuck Lever
Clean up: We need to check only that the value does not exceed the range of the u8 field it's going into. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10xprtrdma: Fix latency regression on NUMA NFS/RDMA clientsChuck Lever
With v4.15, on one of my NFS/RDMA clients I measured a nearly doubling in the latency of small read and write system calls. There was no change in server round trip time. The extra latency appears in the whole RPC execution path. "git bisect" settled on commit ccede7598588 ("xprtrdma: Spread reply processing over more CPUs") . After some experimentation, I found that leaving the WQ bound and allowing the scheduler to pick the dispatch CPU seems to eliminate the long latencies, and it does not introduce any new regressions. The fix is implemented by reverting only the part of commit ccede7598588 ("xprtrdma: Spread reply processing over more CPUs") that dispatches RPC replies specifically on the CPU where the matching RPC call was made. Interestingly, saving the CPU number and later queuing reply processing there was effective _only_ for a NFS READ and WRITE request. On my NUMA client, in-kernel RPC reply processing for asynchronous RPCs was dispatched on the same CPU where the RPC call was made, as expected. However synchronous RPCs seem to get their reply dispatched on some other CPU than where the call was placed, every time. Fixes: ccede7598588 ("xprtrdma: Spread reply processing over ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org # v4.15+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10Merge tag 'ceph-for-4.17-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph updates from Ilya Dryomov: "The big ticket items are: - support for rbd "fancy" striping (myself). The striping feature bit is now fully implemented, allowing mapping v2 images with non-default striping patterns. This completes support for --image-format 2. - CephFS quota support (Luis Henriques and Zheng Yan). This set is based on the new SnapRealm code in the upcoming v13.y.z ("Mimic") release. Quota handling will be rejected on older filesystems. - memory usage improvements in CephFS (Chengguang Xu). Directory specific bits have been split out of ceph_file_info and some effort went into improving cap reservation code to avoid OOM crashes. Also included a bunch of assorted fixes all over the place from Chengguang and others" * tag 'ceph-for-4.17-rc1' of git://github.com/ceph/ceph-client: (67 commits) ceph: quota: report root dir quota usage in statfs ceph: quota: add counter for snaprealms with quota ceph: quota: cache inode pointer in ceph_snap_realm ceph: fix root quota realm check ceph: don't check quota for snap inode ceph: quota: update MDS when max_bytes is approaching ceph: quota: support for ceph.quota.max_bytes ceph: quota: don't allow cross-quota renames ceph: quota: support for ceph.quota.max_files ceph: quota: add initial infrastructure to support cephfs quotas rbd: remove VLA usage rbd: fix spelling mistake: "reregisteration" -> "reregistration" ceph: rename function drop_leases() to a more descriptive name ceph: fix invalid point dereference for error case in mdsc destroy ceph: return proper bool type to caller instead of pointer ceph: optimize memory usage ceph: optimize mds session register libceph, ceph: add __init attribution to init funcitons ceph: filter out used flags when printing unused open flags ceph: don't wait on writeback when there is no more dirty pages ...
2018-04-10ip_gre: clear feature flags when incompatible o_flags are setSabrina Dubroca
Commit dd9d598c6657 ("ip_gre: add the support for i/o_flags update via netlink") added the ability to change o_flags, but missed that the GSO/LLTX features are disabled by default, and only enabled some gre features are unused. Thus we also need to disable the GSO/LLTX features on the device when the TUNNEL_SEQ or TUNNEL_CSUM flags are set. These two examples should result in the same features being set: ip link add gre_none type gre local 192.168.0.10 remote 192.168.0.20 ttl 255 key 0 ip link set gre_none type gre seq ip link add gre_seq type gre local 192.168.0.10 remote 192.168.0.20 ttl 255 key 1 seq Fixes: dd9d598c6657 ("ip_gre: add the support for i/o_flags update via netlink") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Xin Long <lucien.xin@gmail.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) The sockmap code has to free socket memory on close if there is corked data, from John Fastabend. 2) Tunnel names coming from userspace need to be length validated. From Eric Dumazet. 3) arp_filter() has to take VRFs properly into account, from Miguel Fadon Perlines. 4) Fix oops in error path of tcf_bpf_init(), from Davide Caratti. 5) Missing idr_remove() in u32_delete_key(), from Cong Wang. 6) More syzbot stuff. Several use of uninitialized value fixes all over, from Eric Dumazet. 7) Do not leak kernel memory to userspace in sctp, also from Eric Dumazet. 8) Discard frames from unused ports in DSA, from Andrew Lunn. 9) Fix DMA mapping and reset/failover problems in ibmvnic, from Thomas Falcon. 10) Do not access dp83640 PHY registers prematurely after reset, from Esben Haabendal. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits) vhost-net: set packet weight of tx polling to 2 * vq size net: thunderx: rework mac addresses list to u64 array inetpeer: fix uninit-value in inet_getpeer dp83640: Ensure against premature access to PHY registers after reset devlink: convert occ_get op to separate registration ARM: dts: ls1021a: Specify TBIPA register address net/fsl_pq_mdio: Allow explicit speficition of TBIPA address ibmvnic: Do not reset CRQ for Mobility driver resets ibmvnic: Fix failover case for non-redundant configuration ibmvnic: Fix reset scheduler error handling ibmvnic: Zero used TX descriptor counter on reset ibmvnic: Fix DMA mapping mistakes tipc: use the right skb in tipc_sk_fill_sock_diag() sctp: sctp_sockaddr_af must check minimal addr length for AF_INET6 net: dsa: Discard frames from unused ports sctp: do not leak kernel memory to user space soreuseport: initialise timewait reuseport field ipv4: fix uninit-value in ip_route_output_key_hash_rcu() dccp: initialize ireq->ir_mark net: fix uninit-value in __hw_addr_add_ex() ...
2018-04-09netfilter: ebtables: don't attempt to allocate 0-sized compat arrayFlorian Westphal
Dmitry reports 32bit ebtables on 64bit kernel got broken by a recent change that returns -EINVAL when ruleset has no entries. ebtables however only counts user-defined chains, so for the initial table nentries will be 0. Don't try to allocate the compat array in this case, as no user defined rules exist no rule will need 64bit translation. Reported-by: Dmitry Vyukov <dvyukov@google.com> Fixes: 7d7d7e02111e9 ("netfilter: compat: reject huge allocation requests") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-09ipvs: fix rtnl_lock lockups caused by start_sync_threadJulian Anastasov
syzkaller reports for wrong rtnl_lock usage in sync code [1] and [2] We have 2 problems in start_sync_thread if error path is taken, eg. on memory allocation error or failure to configure sockets for mcast group or addr/port binding: 1. recursive locking: holding rtnl_lock while calling sock_release which in turn calls again rtnl_lock in ip_mc_drop_socket to leave the mcast group, as noticed by Florian Westphal. Additionally, sock_release can not be called while holding sync_mutex (ABBA deadlock). 2. task hung: holding rtnl_lock while calling kthread_stop to stop the running kthreads. As the kthreads do the same to leave the mcast group (sock_release -> ip_mc_drop_socket -> rtnl_lock) they hang. Fix the problems by calling rtnl_unlock early in the error path, now sock_release is called after unlocking both mutexes. Problem 3 (task hung reported by syzkaller [2]) is variant of problem 2: use _trylock to prevent one user to call rtnl_lock and then while waiting for sync_mutex to block kthreads that execute sock_release when they are stopped by stop_sync_thread. [1] IPVS: stopping backup sync thread 4500 ... WARNING: possible recursive locking detected 4.16.0-rc7+ #3 Not tainted -------------------------------------------- syzkaller688027/4497 is trying to acquire lock: (rtnl_mutex){+.+.}, at: [<00000000bb14d7fb>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 but task is already holding lock: IPVS: stopping backup sync thread 4495 ... (rtnl_mutex){+.+.}, at: [<00000000bb14d7fb>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(rtnl_mutex); lock(rtnl_mutex); *** DEADLOCK *** May be due to missing lock nesting notation 2 locks held by syzkaller688027/4497: #0: (rtnl_mutex){+.+.}, at: [<00000000bb14d7fb>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 #1: (ipvs->sync_mutex){+.+.}, at: [<00000000703f78e3>] do_ip_vs_set_ctl+0x10f8/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2388 stack backtrace: CPU: 1 PID: 4497 Comm: syzkaller688027 Not tainted 4.16.0-rc7+ #3 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x194/0x24d lib/dump_stack.c:53 print_deadlock_bug kernel/locking/lockdep.c:1761 [inline] check_deadlock kernel/locking/lockdep.c:1805 [inline] validate_chain kernel/locking/lockdep.c:2401 [inline] __lock_acquire+0xe8f/0x3e00 kernel/locking/lockdep.c:3431 lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920 __mutex_lock_common kernel/locking/mutex.c:756 [inline] __mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893 mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908 rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 ip_mc_drop_socket+0x88/0x230 net/ipv4/igmp.c:2643 inet_release+0x4e/0x1c0 net/ipv4/af_inet.c:413 sock_release+0x8d/0x1e0 net/socket.c:595 start_sync_thread+0x2213/0x2b70 net/netfilter/ipvs/ip_vs_sync.c:1924 do_ip_vs_set_ctl+0x1139/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2389 nf_sockopt net/netfilter/nf_sockopt.c:106 [inline] nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115 ip_setsockopt+0x97/0xa0 net/ipv4/ip_sockglue.c:1261 udp_setsockopt+0x45/0x80 net/ipv4/udp.c:2406 sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2975 SYSC_setsockopt net/socket.c:1849 [inline] SyS_setsockopt+0x189/0x360 net/socket.c:1828 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 RIP: 0033:0x446a69 RSP: 002b:00007fa1c3a64da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000446a69 RDX: 000000000000048b RSI: 0000000000000000 RDI: 0000000000000003 RBP: 00000000006e29fc R08: 0000000000000018 R09: 0000000000000000 R10: 00000000200000c0 R11: 0000000000000246 R12: 00000000006e29f8 R13: 00676e697279656b R14: 00007fa1c3a659c0 R15: 00000000006e2b60 [2] IPVS: sync thread started: state = BACKUP, mcast_ifn = syz_tun, syncid = 4, id = 0 IPVS: stopping backup sync thread 25415 ... INFO: task syz-executor7:25421 blocked for more than 120 seconds. Not tainted 4.16.0-rc6+ #284 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. syz-executor7 D23688 25421 4408 0x00000004 Call Trace: context_switch kernel/sched/core.c:2862 [inline] __schedule+0x8fb/0x1ec0 kernel/sched/core.c:3440 schedule+0xf5/0x430 kernel/sched/core.c:3499 schedule_timeout+0x1a3/0x230 kernel/time/timer.c:1777 do_wait_for_common kernel/sched/completion.c:86 [inline] __wait_for_common kernel/sched/completion.c:107 [inline] wait_for_common kernel/sched/completion.c:118 [inline] wait_for_completion+0x415/0x770 kernel/sched/completion.c:139 kthread_stop+0x14a/0x7a0 kernel/kthread.c:530 stop_sync_thread+0x3d9/0x740 net/netfilter/ipvs/ip_vs_sync.c:1996 do_ip_vs_set_ctl+0x2b1/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2394 nf_sockopt net/netfilter/nf_sockopt.c:106 [inline] nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115 ip_setsockopt+0x97/0xa0 net/ipv4/ip_sockglue.c:1253 sctp_setsockopt+0x2ca/0x63e0 net/sctp/socket.c:4154 sock_common_setsockopt+0x95/0xd0 net/core/sock.c:3039 SYSC_setsockopt net/socket.c:1850 [inline] SyS_setsockopt+0x189/0x360 net/socket.c:1829 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 RIP: 0033:0x454889 RSP: 002b:00007fc927626c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000036 RAX: ffffffffffffffda RBX: 00007fc9276276d4 RCX: 0000000000454889 RDX: 000000000000048c RSI: 0000000000000000 RDI: 0000000000000017 RBP: 000000000072bf58 R08: 0000000000000018 R09: 0000000000000000 R10: 0000000020000000 R11: 0000000000000246 R12: 00000000ffffffff R13: 000000000000051c R14: 00000000006f9b40 R15: 0000000000000001 Showing all locks held in the system: 2 locks held by khungtaskd/868: #0: (rcu_read_lock){....}, at: [<00000000a1a8f002>] check_hung_uninterruptible_tasks kernel/hung_task.c:175 [inline] #0: (rcu_read_lock){....}, at: [<00000000a1a8f002>] watchdog+0x1c5/0xd60 kernel/hung_task.c:249 #1: (tasklist_lock){.+.+}, at: [<0000000037c2f8f9>] debug_show_all_locks+0xd3/0x3d0 kernel/locking/lockdep.c:4470 1 lock held by rsyslogd/4247: #0: (&f->f_pos_lock){+.+.}, at: [<000000000d8d6983>] __fdget_pos+0x12b/0x190 fs/file.c:765 2 locks held by getty/4338: #0: (&tty->ldisc_sem){++++}, at: [<00000000bee98654>] ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365 #1: (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>] n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131 2 locks held by getty/4339: #0: (&tty->ldisc_sem){++++}, at: [<00000000bee98654>] ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365 #1: (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>] n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131 2 locks held by getty/4340: #0: (&tty->ldisc_sem){++++}, at: [<00000000bee98654>] ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365 #1: (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>] n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131 2 locks held by getty/4341: #0: (&tty->ldisc_sem){++++}, at: [<00000000bee98654>] ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365 #1: (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>] n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131 2 locks held by getty/4342: #0: (&tty->ldisc_sem){++++}, at: [<00000000bee98654>] ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365 #1: (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>] n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131 2 locks held by getty/4343: #0: (&tty->ldisc_sem){++++}, at: [<00000000bee98654>] ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365 #1: (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>] n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131 2 locks held by getty/4344: #0: (&tty->ldisc_sem){++++}, at: [<00000000bee98654>] ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365 #1: (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>] n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131 3 locks held by kworker/0:5/6494: #0: ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<00000000a062b18e>] work_static include/linux/workqueue.h:198 [inline] #0: ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<00000000a062b18e>] set_work_data kernel/workqueue.c:619 [inline] #0: ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<00000000a062b18e>] set_work_pool_and_clear_pending kernel/workqueue.c:646 [inline] #0: ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<00000000a062b18e>] process_one_work+0xb12/0x1bb0 kernel/workqueue.c:2084 #1: ((addr_chk_work).work){+.+.}, at: [<00000000278427d5>] process_one_work+0xb89/0x1bb0 kernel/workqueue.c:2088 #2: (rtnl_mutex){+.+.}, at: [<00000000066e35ac>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 1 lock held by syz-executor7/25421: #0: (ipvs->sync_mutex){+.+.}, at: [<00000000d414a689>] do_ip_vs_set_ctl+0x277/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2393 2 locks held by syz-executor7/25427: #0: (rtnl_mutex){+.+.}, at: [<00000000066e35ac>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 #1: (ipvs->sync_mutex){+.+.}, at: [<00000000e6d48489>] do_ip_vs_set_ctl+0x10f8/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2388 1 lock held by syz-executor7/25435: #0: (rtnl_mutex){+.+.}, at: [<00000000066e35ac>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 1 lock held by ipvs-b:2:0/25415: #0: (rtnl_mutex){+.+.}, at: [<00000000066e35ac>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 Reported-and-tested-by: syzbot+a46d6abf9d56b1365a72@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+5fe074c01b2032ce9618@syzkaller.appspotmail.com Fixes: e0b26cc997d5 ("ipvs: call rtnl_lock early") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-09netfilter: nf_conntrack_sip: allow duplicate SDP expectationsFlorian Westphal
Callum Sinclair reported SIP IP Phone errors that he tracked down to such phones sending session descriptions for different media types but with same port numbers. The expect core will only 'refresh' existing expectation if it is from same master AND same expectation class (media type). As expectation class is different, we get an error. The SIP connection tracking code will then 1). drop the SDP packet 2). if an rtp expectation was already installed successfully, error on rtcp expectation will cancel the rtp one. Make the expect core report back to caller when the conflict is due to different expectation class and have SIP tracker ignore soft-error. Reported-by: Callum Sinclair <Callum.Sinclair@alliedtelesis.co.nz> Tested-by: Callum Sinclair <Callum.Sinclair@alliedtelesis.co.nz> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-09inetpeer: fix uninit-value in inet_getpeerEric Dumazet
syzbot/KMSAN reported that p->dtime was read while it was not yet initialized in : delta = (__u32)jiffies - p->dtime; if (delta < ttl || !refcount_dec_if_one(&p->refcnt)) gc_stack[i] = NULL; This is a false positive, because the inetpeer wont be erased from rb-tree if the refcount_dec_if_one(&p->refcnt) does not succeed. And this wont happen before first inet_putpeer() call for this inetpeer has been done, and ->dtime field is written exactly before the refcount_dec_and_test(&p->refcnt). The KMSAN report was : BUG: KMSAN: uninit-value in inet_peer_gc net/ipv4/inetpeer.c:163 [inline] BUG: KMSAN: uninit-value in inet_getpeer+0x1567/0x1e70 net/ipv4/inetpeer.c:228 CPU: 0 PID: 9494 Comm: syz-executor5 Not tainted 4.16.0+ #82 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x185/0x1d0 lib/dump_stack.c:53 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676 inet_peer_gc net/ipv4/inetpeer.c:163 [inline] inet_getpeer+0x1567/0x1e70 net/ipv4/inetpeer.c:228 inet_getpeer_v4 include/net/inetpeer.h:110 [inline] icmpv4_xrlim_allow net/ipv4/icmp.c:330 [inline] icmp_send+0x2b44/0x3050 net/ipv4/icmp.c:725 ip_options_compile+0x237c/0x29f0 net/ipv4/ip_options.c:472 ip_rcv_options net/ipv4/ip_input.c:284 [inline] ip_rcv_finish+0xda8/0x16d0 net/ipv4/ip_input.c:365 NF_HOOK include/linux/netfilter.h:288 [inline] ip_rcv+0x119d/0x16f0 net/ipv4/ip_input.c:493 __netif_receive_skb_core+0x47cf/0x4a80 net/core/dev.c:4562 __netif_receive_skb net/core/dev.c:4627 [inline] netif_receive_skb_internal+0x49d/0x630 net/core/dev.c:4701 netif_receive_skb+0x230/0x240 net/core/dev.c:4725 tun_rx_batched drivers/net/tun.c:1555 [inline] tun_get_user+0x6d88/0x7580 drivers/net/tun.c:1962 tun_chr_write_iter+0x1d4/0x330 drivers/net/tun.c:1990 do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776 do_iter_write+0x30d/0xd40 fs/read_write.c:932 vfs_writev fs/read_write.c:977 [inline] do_writev+0x3c9/0x830 fs/read_write.c:1012 SYSC_writev+0x9b/0xb0 fs/read_write.c:1085 SyS_writev+0x56/0x80 fs/read_write.c:1082 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x455111 RSP: 002b:00007fae0365cba0 EFLAGS: 00000293 ORIG_RAX: 0000000000000014 RAX: ffffffffffffffda RBX: 000000000000002e RCX: 0000000000455111 RDX: 0000000000000001 RSI: 00007fae0365cbf0 RDI: 00000000000000fc RBP: 0000000020000040 R08: 00000000000000fc R09: 0000000000000000 R10: 000000000000002e R11: 0000000000000293 R12: 00000000ffffffff R13: 0000000000000658 R14: 00000000006fc8e0 R15: 0000000000000000 Uninit was created at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline] kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314 kmem_cache_alloc+0xaab/0xb90 mm/slub.c:2756 inet_getpeer+0xed8/0x1e70 net/ipv4/inetpeer.c:210 inet_getpeer_v4 include/net/inetpeer.h:110 [inline] ip4_frag_init+0x4d1/0x740 net/ipv4/ip_fragment.c:153 inet_frag_alloc net/ipv4/inet_fragment.c:369 [inline] inet_frag_create net/ipv4/inet_fragment.c:385 [inline] inet_frag_find+0x7da/0x1610 net/ipv4/inet_fragment.c:418 ip_find net/ipv4/ip_fragment.c:275 [inline] ip_defrag+0x448/0x67a0 net/ipv4/ip_fragment.c:676 ip_check_defrag+0x775/0xda0 net/ipv4/ip_fragment.c:724 packet_rcv_fanout+0x2a8/0x8d0 net/packet/af_packet.c:1447 deliver_skb net/core/dev.c:1897 [inline] deliver_ptype_list_skb net/core/dev.c:1912 [inline] __netif_receive_skb_core+0x314a/0x4a80 net/core/dev.c:4545 __netif_receive_skb net/core/dev.c:4627 [inline] netif_receive_skb_internal+0x49d/0x630 net/core/dev.c:4701 netif_receive_skb+0x230/0x240 net/core/dev.c:4725 tun_rx_batched drivers/net/tun.c:1555 [inline] tun_get_user+0x6d88/0x7580 drivers/net/tun.c:1962 tun_chr_write_iter+0x1d4/0x330 drivers/net/tun.c:1990 do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776 do_iter_write+0x30d/0xd40 fs/read_write.c:932 vfs_writev fs/read_write.c:977 [inline] do_writev+0x3c9/0x830 fs/read_write.c:1012 SYSC_writev+0x9b/0xb0 fs/read_write.c:1085 SyS_writev+0x56/0x80 fs/read_write.c:1082 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>