summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2019-06-21netfilter: nf_tables: add support for matching IPv4 optionsStephen Suryaputra
This is the kernel change for the overall changes with this description: Add capability to have rules matching IPv4 options. This is developed mainly to support dropping of IP packets with loose and/or strict source route route options. Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-21ipvs: defer hook registration to avoid leaksJulian Anastasov
syzkaller reports for memory leak when registering hooks [1] As we moved the nf_unregister_net_hooks() call into __ip_vs_dev_cleanup(), defer the nf_register_net_hooks() call, so that hooks are allocated and freed from same pernet_operations (ipvs_core_dev_ops). [1] BUG: memory leak unreferenced object 0xffff88810acd8a80 (size 96): comm "syz-executor073", pid 7254, jiffies 4294950560 (age 22.250s) hex dump (first 32 bytes): 02 00 00 00 00 00 00 00 50 8b bb 82 ff ff ff ff ........P....... 00 00 00 00 00 00 00 00 00 77 bb 82 ff ff ff ff .........w...... backtrace: [<0000000013db61f1>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline] [<0000000013db61f1>] slab_post_alloc_hook mm/slab.h:439 [inline] [<0000000013db61f1>] slab_alloc_node mm/slab.c:3269 [inline] [<0000000013db61f1>] kmem_cache_alloc_node_trace+0x15b/0x2a0 mm/slab.c:3597 [<000000001a27307d>] __do_kmalloc_node mm/slab.c:3619 [inline] [<000000001a27307d>] __kmalloc_node+0x38/0x50 mm/slab.c:3627 [<0000000025054add>] kmalloc_node include/linux/slab.h:590 [inline] [<0000000025054add>] kvmalloc_node+0x4a/0xd0 mm/util.c:431 [<0000000050d1bc00>] kvmalloc include/linux/mm.h:637 [inline] [<0000000050d1bc00>] kvzalloc include/linux/mm.h:645 [inline] [<0000000050d1bc00>] allocate_hook_entries_size+0x3b/0x60 net/netfilter/core.c:61 [<00000000e8abe142>] nf_hook_entries_grow+0xae/0x270 net/netfilter/core.c:128 [<000000004b94797c>] __nf_register_net_hook+0x9a/0x170 net/netfilter/core.c:337 [<00000000d1545cbc>] nf_register_net_hook+0x34/0xc0 net/netfilter/core.c:464 [<00000000876c9b55>] nf_register_net_hooks+0x53/0xc0 net/netfilter/core.c:480 [<000000002ea868e0>] __ip_vs_init+0xe8/0x170 net/netfilter/ipvs/ip_vs_core.c:2280 [<000000002eb2d451>] ops_init+0x4c/0x140 net/core/net_namespace.c:130 [<000000000284ec48>] setup_net+0xde/0x230 net/core/net_namespace.c:316 [<00000000a70600fa>] copy_net_ns+0xf0/0x1e0 net/core/net_namespace.c:439 [<00000000ff26c15e>] create_new_namespaces+0x141/0x2a0 kernel/nsproxy.c:107 [<00000000b103dc79>] copy_namespaces+0xa1/0xe0 kernel/nsproxy.c:165 [<000000007cc008a2>] copy_process.part.0+0x11fd/0x2150 kernel/fork.c:2035 [<00000000c344af7c>] copy_process kernel/fork.c:1800 [inline] [<00000000c344af7c>] _do_fork+0x121/0x4f0 kernel/fork.c:2369 Reported-by: syzbot+722da59ccb264bc19910@syzkaller.appspotmail.com Fixes: 719c7d563c17 ("ipvs: Fix use-after-free in ip_vs_in") Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-21netfilter: synproxy: fix manual bump of the reference counterFernando Fernandez Mancera
This operation is handled by nf_synproxy_ipv4_init() now. Fixes: d7f9b2f18eae ("netfilter: synproxy: extract SYNPROXY infrastructure from {ipt, ip6t}_SYNPROXY") Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-21netfilter: bridge: Fix non-untagged fragment packetwenxu
ip netns exec ns1 ip a a dev eth0 10.0.0.7/24 ip netns exec ns2 ip link a link eth0 name vlan type vlan id 200 ip netns exec ns2 ip a a dev vlan 10.0.0.8/24 ip l add dev br0 type bridge vlan_filtering 1 brctl addif br0 veth1 brctl addif br0 veth2 bridge vlan add dev veth1 vid 200 pvid untagged bridge vlan add dev veth2 vid 200 A two fragment packet sent from ns2 contains the vlan tag 200. In the bridge conntrack, this packet will defrag to one skb with fraglist. When the packet is forwarded to ns1 through veth1, the first skb vlan tag will be cleared by the "untagged" flags. But the vlan tag in the second skb is still tagged, so the second fragment ends up with tag 200 to ns1. So if the first fragment packet doesn't contain the vlan tag, all of the remain should not contain vlan tag. Fixes: 3c171f496ef5 ("netfilter: bridge: add connection tracking system") Signed-off-by: wenxu <wenxu@ucloud.cn> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-216lowpan: no need to check return value of debugfs_create functionsGreg Kroah-Hartman
When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Because we don't care if debugfs works or not, this trickles back a bit so we can clean things up by making some functions return void instead of an error value that is never going to fail. Cc: Alexander Aring <alex.aring@gmail.com> Cc: Jukka Rissanen <jukka.rissanen@linux.intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: linux-bluetooth@vger.kernel.org Cc: linux-wpan@vger.kernel.org Cc: netdev@vger.kernel.org Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-21ipsec: select crypto ciphers for xfrm_algoArnd Bergmann
kernelci.org reports failed builds on arc because of what looks like an old missed 'select' statement: net/xfrm/xfrm_algo.o: In function `xfrm_probe_algs': xfrm_algo.c:(.text+0x1e8): undefined reference to `crypto_has_ahash' I don't see this in randconfig builds on other architectures, but it's fairly clear we want to select the hash code for it, like we do for all its other users. As Herbert points out, CRYPTO_BLKCIPHER is also required even though it has not popped up in build tests. Fixes: 17bc19702221 ("ipsec: Use skcipher and ahash when probing algorithms") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-06-20scsi: lib/sg_pool.c: improve APIs for allocating sg poolMing Lei
sg_alloc_table_chained() currently allows the caller to provide one preallocated SGL and returns if the requested number isn't bigger than size of that SGL. This is used to inline an SGL for an IO request. However, scattergather code only allows that size of the 1st preallocated SGL to be SG_CHUNK_SIZE(128). This means a substantial amount of memory (4KB) is claimed for the SGL for each IO request. If the I/O is small, it would be prudent to allocate a smaller SGL. Introduce an extra parameter to sg_alloc_table_chained() and sg_free_table_chained() for specifying size of the preallocated SGL. Both __sg_free_table() and __sg_alloc_table() assume that each SGL has the same size except for the last one. Change the code to allow both functions to accept a variable size for the 1st preallocated SGL. [mkp: attempted to clarify commit desc] Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Ewan D. Milne <emilne@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: netdev@vger.kernel.org Cc: linux-nvme@lists.infradead.org Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2019-06-20rpc_pipefs: call fsnotify_{unlink,rmdir}() hooksAmir Goldstein
This will allow generating fsnotify delete events after the fsnotify_nameremove() hook is removed from d_delete(). Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Anna Schumaker <anna.schumaker@netapp.com> Reviewed-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2019-06-20netfilter: bridge: prevent UAF in brnf_exit_net()Christian Brauner
Prevent a UAF in brnf_exit_net(). When unregister_net_sysctl_table() is called the ctl_hdr pointer will obviously be freed and so accessing it righter after is invalid. Fix this by stashing a pointer to the table we want to free before we unregister the sysctl header. Note that syzkaller falsely chased this down to the drm tree so the Fixes tag that syzkaller requested would be wrong. This commit uses a different but the correct Fixes tag. /* Splat */ BUG: KASAN: use-after-free in br_netfilter_sysctl_exit_net net/bridge/br_netfilter_hooks.c:1121 [inline] BUG: KASAN: use-after-free in brnf_exit_net+0x38c/0x3a0 net/bridge/br_netfilter_hooks.c:1141 Read of size 8 at addr ffff8880a4078d60 by task kworker/u4:4/8749 CPU: 0 PID: 8749 Comm: kworker/u4:4 Not tainted 5.2.0-rc5-next-20190618 #17 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: netns cleanup_net Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_address_description.cold+0xd4/0x306 mm/kasan/report.c:351 __kasan_report.cold+0x1b/0x36 mm/kasan/report.c:482 kasan_report+0x12/0x20 mm/kasan/common.c:614 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132 br_netfilter_sysctl_exit_net net/bridge/br_netfilter_hooks.c:1121 [inline] brnf_exit_net+0x38c/0x3a0 net/bridge/br_netfilter_hooks.c:1141 ops_exit_list.isra.0+0xaa/0x150 net/core/net_namespace.c:154 cleanup_net+0x3fb/0x960 net/core/net_namespace.c:553 process_one_work+0x989/0x1790 kernel/workqueue.c:2269 worker_thread+0x98/0xe40 kernel/workqueue.c:2415 kthread+0x354/0x420 kernel/kthread.c:255 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352 Allocated by task 11374: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_kmalloc mm/kasan/common.c:489 [inline] __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462 kasan_kmalloc+0x9/0x10 mm/kasan/common.c:503 __do_kmalloc mm/slab.c:3645 [inline] __kmalloc+0x15c/0x740 mm/slab.c:3654 kmalloc include/linux/slab.h:552 [inline] kzalloc include/linux/slab.h:743 [inline] __register_sysctl_table+0xc7/0xef0 fs/proc/proc_sysctl.c:1327 register_net_sysctl+0x29/0x30 net/sysctl_net.c:121 br_netfilter_sysctl_init_net net/bridge/br_netfilter_hooks.c:1105 [inline] brnf_init_net+0x379/0x6a0 net/bridge/br_netfilter_hooks.c:1126 ops_init+0xb3/0x410 net/core/net_namespace.c:130 setup_net+0x2d3/0x740 net/core/net_namespace.c:316 copy_net_ns+0x1df/0x340 net/core/net_namespace.c:439 create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:103 unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:202 ksys_unshare+0x444/0x980 kernel/fork.c:2822 __do_sys_unshare kernel/fork.c:2890 [inline] __se_sys_unshare kernel/fork.c:2888 [inline] __x64_sys_unshare+0x31/0x40 kernel/fork.c:2888 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 9: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459 __cache_free mm/slab.c:3417 [inline] kfree+0x10a/0x2c0 mm/slab.c:3746 __rcu_reclaim kernel/rcu/rcu.h:215 [inline] rcu_do_batch kernel/rcu/tree.c:2092 [inline] invoke_rcu_callbacks kernel/rcu/tree.c:2310 [inline] rcu_core+0xcc7/0x1500 kernel/rcu/tree.c:2291 __do_softirq+0x25c/0x94c kernel/softirq.c:292 The buggy address belongs to the object at ffff8880a4078d40 which belongs to the cache kmalloc-512 of size 512 The buggy address is located 32 bytes inside of 512-byte region [ffff8880a4078d40, ffff8880a4078f40) The buggy address belongs to the page: page:ffffea0002901e00 refcount:1 mapcount:0 mapping:ffff8880aa400a80 index:0xffff8880a40785c0 flags: 0x1fffc0000000200(slab) raw: 01fffc0000000200 ffffea0001d636c8 ffffea0001b07308 ffff8880aa400a80 raw: ffff8880a40785c0 ffff8880a40780c0 0000000100000004 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8880a4078c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880a4078c80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc > ffff8880a4078d00: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb ^ ffff8880a4078d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880a4078e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Reported-by: syzbot+43a3fa52c0d9c5c94f41@syzkaller.appspotmail.com Fixes: 22567590b2e6 ("netfilter: bridge: namespace bridge netfilter sysctls") Signed-off-by: Christian Brauner <christian@brauner.io> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-20netfilter: synproxy: use nf_cookie_v6_check() from corePablo Neira Ayuso
This helper function is never used and it is intended to avoid a direct dependency with the ipv6 module. Fixes: d7f9b2f18eae ("netfilter: synproxy: extract SYNPROXY infrastructure from {ipt, ip6t}_SYNPROXY") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-20netfilter: synproxy: fix building syncookie callsArnd Bergmann
When either CONFIG_IPV6 or CONFIG_SYN_COOKIES are disabled, the kernel fails to build: include/linux/netfilter_ipv6.h:180:9: error: implicit declaration of function '__cookie_v6_init_sequence' [-Werror,-Wimplicit-function-declaration] return __cookie_v6_init_sequence(iph, th, mssp); include/linux/netfilter_ipv6.h:194:9: error: implicit declaration of function '__cookie_v6_check' [-Werror,-Wimplicit-function-declaration] return __cookie_v6_check(iph, th, cookie); net/ipv6/netfilter.c:237:26: error: use of undeclared identifier '__cookie_v6_init_sequence'; did you mean 'cookie_init_sequence'? net/ipv6/netfilter.c:238:21: error: use of undeclared identifier '__cookie_v6_check'; did you mean '__cookie_v4_check'? Fix the IS_ENABLED() checks to match the function declaration and definitions for these. Fixes: 3006a5224f15 ("netfilter: synproxy: remove module dependency on IPv6 SYNPROXY") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-20net/lib80211: move TKIP handling to ARC4 library codeArd Biesheuvel
The crypto API abstraction is not very useful for invoking ciphers directly, especially in the case of arc4, which only has a generic implementation in C. So let's invoke the library code directly. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-20net/lib80211: move WEP handling to ARC4 library codeArd Biesheuvel
The crypto API abstraction is not very useful for invoking ciphers directly, especially in the case of arc4, which only has a generic implementation in C. So let's invoke the library code directly. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-20net/mac80211: move WEP handling to ARC4 library interfaceArd Biesheuvel
The WEP code in the mac80211 subsystem currently uses the crypto API to access the arc4 (RC4) cipher, which is overly complicated, and doesn't really have an upside in this particular case, since ciphers are always synchronous and therefore always implemented in software. Given that we have no accelerated software implementations either, it is much more straightforward to invoke a generic library interface directly. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf-next 2019-06-19 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) new SO_REUSEPORT_DETACH_BPF setsocktopt, from Martin. 2) BTF based map definition, from Andrii. 3) support bpf_map_lookup_elem for xskmap, from Jonathan. 4) bounded loops and scalar precision logic in the verifier, from Alexei. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19inet: clear num_timeout reqsk_alloc()Eric Dumazet
KMSAN caught uninit-value in tcp_create_openreq_child() [1] This is caused by a recent change, combined by the fact that TCP cleared num_timeout, num_retrans and sk fields only when a request socket was about to be queued. Under syncookie mode, a temporary request socket is used, and req->num_timeout could contain garbage. Lets clear these three fields sooner, there is really no point trying to defer this and risk other bugs. [1] BUG: KMSAN: uninit-value in tcp_create_openreq_child+0x157f/0x1cc0 net/ipv4/tcp_minisocks.c:526 CPU: 1 PID: 13357 Comm: syz-executor591 Not tainted 5.2.0-rc4+ #3 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x191/0x1f0 lib/dump_stack.c:113 kmsan_report+0x162/0x2d0 mm/kmsan/kmsan.c:611 __msan_warning+0x75/0xe0 mm/kmsan/kmsan_instr.c:304 tcp_create_openreq_child+0x157f/0x1cc0 net/ipv4/tcp_minisocks.c:526 tcp_v6_syn_recv_sock+0x761/0x2d80 net/ipv6/tcp_ipv6.c:1152 tcp_get_cookie_sock+0x16e/0x6b0 net/ipv4/syncookies.c:209 cookie_v6_check+0x27e0/0x29a0 net/ipv6/syncookies.c:252 tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1039 [inline] tcp_v6_do_rcv+0xf1c/0x1ce0 net/ipv6/tcp_ipv6.c:1344 tcp_v6_rcv+0x60b7/0x6a30 net/ipv6/tcp_ipv6.c:1554 ip6_protocol_deliver_rcu+0x1433/0x22f0 net/ipv6/ip6_input.c:397 ip6_input_finish net/ipv6/ip6_input.c:438 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ip6_input+0x2af/0x340 net/ipv6/ip6_input.c:447 dst_input include/net/dst.h:439 [inline] ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ipv6_rcv+0x683/0x710 net/ipv6/ip6_input.c:272 __netif_receive_skb_one_core net/core/dev.c:4981 [inline] __netif_receive_skb net/core/dev.c:5095 [inline] process_backlog+0x721/0x1410 net/core/dev.c:5906 napi_poll net/core/dev.c:6329 [inline] net_rx_action+0x738/0x1940 net/core/dev.c:6395 __do_softirq+0x4ad/0x858 kernel/softirq.c:293 do_softirq_own_stack+0x49/0x80 arch/x86/entry/entry_64.S:1052 </IRQ> do_softirq kernel/softirq.c:338 [inline] __local_bh_enable_ip+0x199/0x1e0 kernel/softirq.c:190 local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32 rcu_read_unlock_bh include/linux/rcupdate.h:682 [inline] ip6_finish_output2+0x213f/0x2670 net/ipv6/ip6_output.c:117 ip6_finish_output+0xae4/0xbc0 net/ipv6/ip6_output.c:150 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0x5d3/0x720 net/ipv6/ip6_output.c:167 dst_output include/net/dst.h:433 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ip6_xmit+0x1f53/0x2650 net/ipv6/ip6_output.c:271 inet6_csk_xmit+0x3df/0x4f0 net/ipv6/inet6_connection_sock.c:135 __tcp_transmit_skb+0x4076/0x5b40 net/ipv4/tcp_output.c:1156 tcp_transmit_skb net/ipv4/tcp_output.c:1172 [inline] tcp_write_xmit+0x39a9/0xa730 net/ipv4/tcp_output.c:2397 __tcp_push_pending_frames+0x124/0x4e0 net/ipv4/tcp_output.c:2573 tcp_send_fin+0xd43/0x1540 net/ipv4/tcp_output.c:3118 tcp_close+0x16ba/0x1860 net/ipv4/tcp.c:2403 inet_release+0x1f7/0x270 net/ipv4/af_inet.c:427 inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:470 __sock_release net/socket.c:601 [inline] sock_close+0x156/0x490 net/socket.c:1273 __fput+0x4c9/0xba0 fs/file_table.c:280 ____fput+0x37/0x40 fs/file_table.c:313 task_work_run+0x22e/0x2a0 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:185 [inline] exit_to_usermode_loop arch/x86/entry/common.c:168 [inline] prepare_exit_to_usermode+0x39d/0x4d0 arch/x86/entry/common.c:199 syscall_return_slowpath+0x90/0x5c0 arch/x86/entry/common.c:279 do_syscall_64+0xe2/0xf0 arch/x86/entry/common.c:305 entry_SYSCALL_64_after_hwframe+0x63/0xe7 RIP: 0033:0x401d50 Code: 01 f0 ff ff 0f 83 40 0d 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 83 3d dd 8d 2d 00 00 75 14 b8 03 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 14 0d 00 00 c3 48 83 ec 08 e8 7a 02 00 00 RSP: 002b:00007fff1cf58cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000401d50 RDX: 000000000000001c RSI: 0000000000000000 RDI: 0000000000000003 RBP: 00000000004a9050 R08: 0000000020000040 R09: 000000000000001c R10: 0000000020004004 R11: 0000000000000246 R12: 0000000000402ef0 R13: 0000000000402f80 R14: 0000000000000000 R15: 0000000000000000 Uninit was created at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:201 [inline] kmsan_internal_poison_shadow+0x53/0xa0 mm/kmsan/kmsan.c:160 kmsan_kmalloc+0xa4/0x130 mm/kmsan/kmsan_hooks.c:177 kmem_cache_alloc+0x534/0xb00 mm/slub.c:2781 reqsk_alloc include/net/request_sock.h:84 [inline] inet_reqsk_alloc+0xa8/0x600 net/ipv4/tcp_input.c:6384 cookie_v6_check+0xadb/0x29a0 net/ipv6/syncookies.c:173 tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1039 [inline] tcp_v6_do_rcv+0xf1c/0x1ce0 net/ipv6/tcp_ipv6.c:1344 tcp_v6_rcv+0x60b7/0x6a30 net/ipv6/tcp_ipv6.c:1554 ip6_protocol_deliver_rcu+0x1433/0x22f0 net/ipv6/ip6_input.c:397 ip6_input_finish net/ipv6/ip6_input.c:438 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ip6_input+0x2af/0x340 net/ipv6/ip6_input.c:447 dst_input include/net/dst.h:439 [inline] ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ipv6_rcv+0x683/0x710 net/ipv6/ip6_input.c:272 __netif_receive_skb_one_core net/core/dev.c:4981 [inline] __netif_receive_skb net/core/dev.c:5095 [inline] process_backlog+0x721/0x1410 net/core/dev.c:5906 napi_poll net/core/dev.c:6329 [inline] net_rx_action+0x738/0x1940 net/core/dev.c:6395 __do_softirq+0x4ad/0x858 kernel/softirq.c:293 do_softirq_own_stack+0x49/0x80 arch/x86/entry/entry_64.S:1052 do_softirq kernel/softirq.c:338 [inline] __local_bh_enable_ip+0x199/0x1e0 kernel/softirq.c:190 local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32 rcu_read_unlock_bh include/linux/rcupdate.h:682 [inline] ip6_finish_output2+0x213f/0x2670 net/ipv6/ip6_output.c:117 ip6_finish_output+0xae4/0xbc0 net/ipv6/ip6_output.c:150 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0x5d3/0x720 net/ipv6/ip6_output.c:167 dst_output include/net/dst.h:433 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ip6_xmit+0x1f53/0x2650 net/ipv6/ip6_output.c:271 inet6_csk_xmit+0x3df/0x4f0 net/ipv6/inet6_connection_sock.c:135 __tcp_transmit_skb+0x4076/0x5b40 net/ipv4/tcp_output.c:1156 tcp_transmit_skb net/ipv4/tcp_output.c:1172 [inline] tcp_write_xmit+0x39a9/0xa730 net/ipv4/tcp_output.c:2397 __tcp_push_pending_frames+0x124/0x4e0 net/ipv4/tcp_output.c:2573 tcp_send_fin+0xd43/0x1540 net/ipv4/tcp_output.c:3118 tcp_close+0x16ba/0x1860 net/ipv4/tcp.c:2403 inet_release+0x1f7/0x270 net/ipv4/af_inet.c:427 inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:470 __sock_release net/socket.c:601 [inline] sock_close+0x156/0x490 net/socket.c:1273 __fput+0x4c9/0xba0 fs/file_table.c:280 ____fput+0x37/0x40 fs/file_table.c:313 task_work_run+0x22e/0x2a0 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:185 [inline] exit_to_usermode_loop arch/x86/entry/common.c:168 [inline] prepare_exit_to_usermode+0x39d/0x4d0 arch/x86/entry/common.c:199 syscall_return_slowpath+0x90/0x5c0 arch/x86/entry/common.c:279 do_syscall_64+0xe2/0xf0 arch/x86/entry/common.c:305 entry_SYSCALL_64_after_hwframe+0x63/0xe7 Fixes: 336c39a03151 ("tcp: undo init congestion window on false SYNACK timeout") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19net/ipv4: fib_trie: Avoid cryptic ternary expressionsMatthias Kaehlcke
empty_child_inc/dec() use the ternary operator for conditional operations. The conditions involve the post/pre in/decrement operator and the operation is only performed when the condition is *not* true. This is hard to parse for humans, use a regular 'if' construct instead and perform the in/decrement separately. This also fixes two warnings that are emitted about the value of the ternary expression being unused, when building the kernel with clang + "kbuild: Remove unnecessary -Wno-unused-value" (https://lore.kernel.org/patchwork/patch/1089869/): CC net/ipv4/fib_trie.o net/ipv4/fib_trie.c:351:2: error: expression result unused [-Werror,-Wunused-value] ++tn_info(n)->empty_children ? : ++tn_info(n)->full_children; Fixes: 95f60ea3e99a ("fib_trie: Add collapse() and should_collapse() to resize") Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19ipv6: Default fib6_type to RTN_UNICAST when not setDavid Ahern
A user reported that routes are getting installed with type 0 (RTN_UNSPEC) where before the routes were RTN_UNICAST. One example is from accel-ppp which apparently still uses the ioctl interface and does not set rtmsg_type. Another is the netlink interface where ipv6 does not require rtm_type to be set (v4 does). Prior to the commit in the Fixes tag the ipv6 stack converted type 0 to RTN_UNICAST, so restore that behavior. Fixes: e8478e80e5a7 ("net/ipv6: Save route type in rt6_info") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19svcrdma: Ignore source port when computing DRC hashChuck Lever
The DRC appears to be effectively empty after an RPC/RDMA transport reconnect. The problem is that each connection uses a different source port, which defeats the DRC hash. Clients always have to disconnect before they send retransmissions to reset the connection's credit accounting, thus every retransmit on NFS/RDMA will miss the DRC. An NFS/RDMA client's IP source port is meaningless for RDMA transports. The transport layer typically sets the source port value on the connection to a random ephemeral port. The server already ignores it for the "secure port" check. See commit 16e4d93f6de7 ("NFSD: Ignore client's source port on RDMA transports"). The Linux NFS server's DRC resolves XID collisions from the same source IP address by using the checksum of the first 200 bytes of the RPC call header. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org # v4.14+ Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-06-19net/af_iucv: always register net_device notifierJulian Wiedmann
Even when running as VM guest (ie pr_iucv != NULL), af_iucv can still open HiperTransport-based connections. For robust operation these connections require the af_iucv_netdev_notifier, so register it unconditionally. Also handle any error that register_netdevice_notifier() returns. Fixes: 9fbd87d41392 ("af_iucv: handle netdev events") Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19net/af_iucv: build proper skbs for HiperTransportJulian Wiedmann
The HiperSockets-based transport path in af_iucv is still too closely entangled with qeth. With commit a647a02512ca ("s390/qeth: speed-up L3 IQD xmit"), the relevant xmit code in qeth has begun to use skb_cow_head(). So to avoid unnecessary skb head expansions, af_iucv must learn to 1) respect dev->needed_headroom when allocating skbs, and 2) drop the header reference before cloning the skb. While at it, also stop hard-coding the LL-header creation stage and just use the appropriate helper. Fixes: a647a02512ca ("s390/qeth: speed-up L3 IQD xmit") Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19net/af_iucv: remove GFP_DMA restriction for HiperTransportJulian Wiedmann
af_iucv sockets over z/VM IUCV require that their skbs are allocated in DMA memory. This restriction doesn't apply to connections over HiperSockets. So only set this limit for z/VM IUCV sockets, thereby increasing the likelihood that the large (and linear!) allocations for HiperTransport messages succeed. Fixes: 3881ac441f64 ("af_iucv: add HiperSockets transport") Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19netfilter: nf_tables: enable set expiration time for set elementsLaura Garcia Liebana
Currently, the expiration of every element in a set or map is a read-only parameter generated at kernel side. This change will permit to set a certain expiration date per element that will be required, for example, during stateful replication among several nodes. This patch handles the NFTA_SET_ELEM_EXPIRATION in order to configure the expiration parameter per element, or will use the timeout in the case that the expiration is not set. Signed-off-by: Laura Garcia Liebana <nevola@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-19netfilter: nft_ct: fix null pointer in ct expectations supportStéphane Veyret
nf_ct_helper_ext_add may return null, which must then be checked. Fixes: 857b46027d6f ("netfilter: nft_ct: add ct expectations support") Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Stéphane Veyret <sveyret@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-19netfilter: synproxy: ensure zero is returned on non-error return pathColin Ian King
Currently functions nf_synproxy_{ipc4|ipv6}_init return an uninitialized garbage value in variable ret on a successful return. Fix this by returning zero on success. Addresses-Coverity: ("Uninitialized scalar variable") Fixes: d7f9b2f18eae ("netfilter: synproxy: extract SYNPROXY infrastructure from {ipt, ip6t}_SYNPROXY") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-19inet: fix various use-after-free in defrags unitsEric Dumazet
syzbot reported another issue caused by my recent patches. [1] The issue here is that fqdir_exit() is initiating a work queue and immediately returns. A bit later cleanup_net() was able to free the MIB (percpu data) and the whole struct net was freed, but we had active frag timers that fired and triggered use-after-free. We need to make sure that timers can catch fqdir->dead being set, to bailout. Since RCU is used for the reader side, this means we want to respect an RCU grace period between these operations : 1) qfdir->dead = 1; 2) netns dismantle (freeing of various data structure) This patch uses new new (struct pernet_operations)->pre_exit infrastructure to ensures a full RCU grace period happens between fqdir_pre_exit() and fqdir_exit() This also means we can use a regular work queue, we no longer need rcu_work. Tested: $ time for i in {1..1000}; do unshare -n /bin/false;done real 0m2.585s user 0m0.160s sys 0m2.214s [1] BUG: KASAN: use-after-free in ip_expire+0x73e/0x800 net/ipv4/ip_fragment.c:152 Read of size 8 at addr ffff88808b9fe330 by task syz-executor.4/11860 CPU: 1 PID: 11860 Comm: syz-executor.4 Not tainted 5.2.0-rc2+ #22 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188 __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317 kasan_report+0x12/0x20 mm/kasan/common.c:614 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132 ip_expire+0x73e/0x800 net/ipv4/ip_fragment.c:152 call_timer_fn+0x193/0x720 kernel/time/timer.c:1322 expire_timers kernel/time/timer.c:1366 [inline] __run_timers kernel/time/timer.c:1685 [inline] __run_timers kernel/time/timer.c:1653 [inline] run_timer_softirq+0x66f/0x1740 kernel/time/timer.c:1698 __do_softirq+0x25c/0x94c kernel/softirq.c:293 invoke_softirq kernel/softirq.c:374 [inline] irq_exit+0x180/0x1d0 kernel/softirq.c:414 exiting_irq arch/x86/include/asm/apic.h:536 [inline] smp_apic_timer_interrupt+0x13b/0x550 arch/x86/kernel/apic/apic.c:1068 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:806 </IRQ> RIP: 0010:tomoyo_domain_quota_is_ok+0x131/0x540 security/tomoyo/util.c:1035 Code: 24 4c 3b 65 d0 0f 84 9c 00 00 00 e8 19 1d 73 fe 49 8d 7c 24 18 48 ba 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03 0f b6 04 10 <48> 89 fa 83 e2 07 38 d0 7f 08 84 c0 0f 85 69 03 00 00 41 0f b6 5c RSP: 0018:ffff88806ae079c0 EFLAGS: 00000a02 ORIG_RAX: ffffffffffffff13 RAX: 0000000000000000 RBX: 0000000000000010 RCX: ffffc9000e655000 RDX: dffffc0000000000 RSI: ffffffff82fd88a7 RDI: ffff888086202398 RBP: ffff88806ae07a00 R08: ffff88808b6c8700 R09: ffffed100d5c0f4d R10: ffffed100d5c0f4c R11: 0000000000000000 R12: ffff888086202380 R13: 0000000000000030 R14: 00000000000000d3 R15: 0000000000000000 tomoyo_supervisor+0x2e8/0xef0 security/tomoyo/common.c:2087 tomoyo_audit_path_number_log security/tomoyo/file.c:235 [inline] tomoyo_path_number_perm+0x42f/0x520 security/tomoyo/file.c:734 tomoyo_file_ioctl+0x23/0x30 security/tomoyo/tomoyo.c:335 security_file_ioctl+0x77/0xc0 security/security.c:1370 ksys_ioctl+0x57/0xd0 fs/ioctl.c:711 __do_sys_ioctl fs/ioctl.c:720 [inline] __se_sys_ioctl fs/ioctl.c:718 [inline] __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x4592c9 Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f8db5e44c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004592c9 RDX: 0000000020000080 RSI: 00000000000089f1 RDI: 0000000000000006 RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f8db5e456d4 R13: 00000000004cc770 R14: 00000000004d5cd8 R15: 00000000ffffffff Allocated by task 9047: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_kmalloc mm/kasan/common.c:489 [inline] __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462 kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497 slab_post_alloc_hook mm/slab.h:437 [inline] slab_alloc mm/slab.c:3326 [inline] kmem_cache_alloc+0x11a/0x6f0 mm/slab.c:3488 kmem_cache_zalloc include/linux/slab.h:732 [inline] net_alloc net/core/net_namespace.c:386 [inline] copy_net_ns+0xed/0x340 net/core/net_namespace.c:426 create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:107 unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:206 ksys_unshare+0x440/0x980 kernel/fork.c:2692 __do_sys_unshare kernel/fork.c:2760 [inline] __se_sys_unshare kernel/fork.c:2758 [inline] __x64_sys_unshare+0x31/0x40 kernel/fork.c:2758 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 2541: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459 __cache_free mm/slab.c:3432 [inline] kmem_cache_free+0x86/0x260 mm/slab.c:3698 net_free net/core/net_namespace.c:402 [inline] net_drop_ns.part.0+0x70/0x90 net/core/net_namespace.c:409 net_drop_ns net/core/net_namespace.c:408 [inline] cleanup_net+0x538/0x960 net/core/net_namespace.c:571 process_one_work+0x989/0x1790 kernel/workqueue.c:2269 worker_thread+0x98/0xe40 kernel/workqueue.c:2415 kthread+0x354/0x420 kernel/kthread.c:255 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352 The buggy address belongs to the object at ffff88808b9fe100 which belongs to the cache net_namespace of size 6784 The buggy address is located 560 bytes inside of 6784-byte region [ffff88808b9fe100, ffff88808b9ffb80) The buggy address belongs to the page: page:ffffea00022e7f80 refcount:1 mapcount:0 mapping:ffff88821b6f60c0 index:0x0 compound_mapcount: 0 flags: 0x1fffc0000010200(slab|head) raw: 01fffc0000010200 ffffea000256f288 ffffea0001bbef08 ffff88821b6f60c0 raw: 0000000000000000 ffff88808b9fe100 0000000100000001 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88808b9fe200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88808b9fe280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff88808b9fe300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88808b9fe380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88808b9fe400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Fixes: 3c8fc8782044 ("inet: frags: rework rhashtable dismantle") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19netns: add pre_exit method to struct pernet_operationsEric Dumazet
Current struct pernet_operations exit() handlers are highly discouraged to call synchronize_rcu(). There are cases where we need them, and exit_batch() does not help the common case where a single netns is dismantled. This patch leverages the existing synchronize_rcu() call in cleanup_net() Calling optional ->pre_exit() method before ->exit() or ->exit_batch() allows to benefit from a single synchronize_rcu() call. Note that the synchronize_rcu() calls added in this patch are only in error paths or slow paths. Tested: $ time for i in {1..1000}; do unshare -n /bin/false;done real 0m2.612s user 0m0.171s sys 0m2.216s Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19page_pool: make sure struct device is stableJesper Dangaard Brouer
For DMA mapping use-case the page_pool keeps a pointer to the struct device, which is used in DMA map/unmap calls. For our in-flight handling, we also need to make sure that the struct device have not disappeared. This is assured via using get_device/put_device API. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reported-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19page_pool: add tracepoints for page_pool with details need by XDPJesper Dangaard Brouer
The xdp tracepoints for mem id disconnect don't carry information about, why it was not safe_to_remove. The tracepoint page_pool:page_pool_inflight in this patch can be used for extract this info for further debugging. This patchset also adds tracepoint for the pages_state_* release/hold transitions, including a pointer to the page. This can be used for stats about in-flight pages, or used to debug page leakage via keeping track of page pointer and combining this with kprobe for __put_page(). Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19xdp: add tracepoints for XDP memJesper Dangaard Brouer
These tracepoints make it easier to troubleshoot XDP mem id disconnect. The xdp:mem_disconnect tracepoint cannot be replaced via kprobe. It is placed at the last stable place for the pointer to struct xdp_mem_allocator, just before it's scheduled for RCU removal. It also extract info on 'safe_to_remove' and 'force'. Detailed info about in-flight pages is not available at this layer. The next patch will added tracepoints needed at the page_pool layer for this. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19xdp: force mem allocator removal and periodic warningJesper Dangaard Brouer
If bugs exists or are introduced later e.g. by drivers misusing the API, then we want to warn about the issue, such that developer notice. This patch will generate a bit of noise in form of periodic pr_warn every 30 seconds. It is not nice to have this stall warning running forever. Thus, this patch will (after 120 attempts) force disconnect the mem id (from the rhashtable) and free the page_pool object. This will cause fallback to the put_page() as before, which only potentially leak DMA-mappings, if objects are really stuck for this long. In that unlikely case, a WARN_ONCE should show us the call stack. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19xdp: tracking page_pool resources and safe removalJesper Dangaard Brouer
This patch is needed before we can allow drivers to use page_pool for DMA-mappings. Today with page_pool and XDP return API, it is possible to remove the page_pool object (from rhashtable), while there are still in-flight packet-pages. This is safely handled via RCU and failed lookups in __xdp_return() fallback to call put_page(), when page_pool object is gone. In-case page is still DMA mapped, this will result in page note getting correctly DMA unmapped. To solve this, the page_pool is extended with tracking in-flight pages. And XDP disconnect system queries page_pool and waits, via workqueue, for all in-flight pages to be returned. To avoid killing performance when tracking in-flight pages, the implement use two (unsigned) counters, that in placed on different cache-lines, and can be used to deduct in-flight packets. This is done by mapping the unsigned "sequence" counters onto signed Two's complement arithmetic operations. This is e.g. used by kernel's time_after macros, described in kernel commit 1ba3aab3033b and 5a581b367b5, and also explained in RFC1982. The trick is these two incrementing counters only need to be read and compared, when checking if it's safe to free the page_pool structure. Which will only happen when driver have disconnected RX/alloc side. Thus, on a non-fast-path. It is chosen that page_pool tracking is also enabled for the non-DMA use-case, as this can be used for statistics later. After this patch, using page_pool requires more strict resource "release", e.g. via page_pool_release_page() that was introduced in this patchset, and previous patches implement/fix this more strict requirement. Drivers no-longer call page_pool_destroy(). Drivers already call xdp_rxq_info_unreg() which call xdp_rxq_info_unreg_mem_model(), which will attempt to disconnect the mem id, and if attempt fails schedule the disconnect for later via delayed workqueue. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19page_pool: introduce page_pool_free and use in mlx5Jesper Dangaard Brouer
In case driver fails to register the page_pool with XDP return API (via xdp_rxq_info_reg_mem_model()), then the driver can free the page_pool resources more directly than calling page_pool_destroy(), which does a unnecessarily RCU free procedure. This patch is preparing for removing page_pool_destroy(), from driver invocation. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19xdp: page_pool related fix to cpumapJesper Dangaard Brouer
When converting an xdp_frame into an SKB, and sending this into the network stack, then the underlying XDP memory model need to release associated resources, because the network stack don't have callbacks for XDP memory models. The only memory model that needs this is page_pool, when a driver use the DMA-mapping feature. Introduce page_pool_release_page(), which basically does the same as page_pool_unmap_page(). Add xdp_release_frame() as the XDP memory model interface for calling it, if the memory model match MEM_TYPE_PAGE_POOL, to save the function call overhead for others. Have cpumap call xdp_release_frame() before xdp_scrub_frame(). Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19xdp: fix leak of IDA cyclic id if rhashtable_insert_slow failsJesper Dangaard Brouer
Fix error handling case, where inserting ID with rhashtable_insert_slow fails in xdp_rxq_info_reg_mem_model, which leads to never releasing the IDA ID, as the lookup in xdp_rxq_info_unreg_mem_model fails and thus ida_simple_remove() is never called. Fix by releasing ID via ida_simple_remove(), and mark xdp_rxq->mem.id with zero, which is already checked in xdp_rxq_info_unreg_mem_model(). Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19net: page_pool: add helper function to unmap dma addressesIlias Apalodimas
On a previous patch dma addr was stored in 'struct page'. Use that to unmap DMA addresses used by network drivers Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 505Thomas Gleixner
Based on 1 normalized pattern(s): gplv2 extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 58 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Enrico Weigelt <info@metux.net> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190604081207.556988620@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 501Thomas Gleixner
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation see readme and copying for more details extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 9 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Enrico Weigelt <info@metux.net> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190604081207.060259192@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500Thomas Gleixner
Based on 2 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation # extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 4122 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Enrico Weigelt <info@metux.net> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 484Thomas Gleixner
Based on 1 normalized pattern(s): this source code is licensed under general public license version 2 extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 5 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Enrico Weigelt <info@metux.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190604081204.871734026@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 482Thomas Gleixner
Based on 1 normalized pattern(s): this work is licensed under the terms of the gnu gpl version 2 extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 48 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Enrico Weigelt <info@metux.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190604081204.624030236@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 235Thomas Gleixner
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 53 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190602204653.904365654@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 234Thomas Gleixner
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details you should have received a copy of the gnu general public license along with this program if not see http www gnu org licenses extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 503 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Enrico Weigelt <info@metux.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190602204653.811534538@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 230Thomas Gleixner
Based on 2 normalized pattern(s): this source code is licensed under the gnu general public license version 2 see the file copying for more details this source code is licensed under general public license version 2 see extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 52 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Enrico Weigelt <info@metux.net> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190602204653.449021192@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19net: flow_offload: implement support for meta keyJiri Pirko
Implement support for previously added flow dissector meta key. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19net: sched: cls_flower: use flow_dissector for ingress ifindexJiri Pirko
Use previously introduced infra to obtain and store ingress ifindex instead doing it locally. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19flow_dissector: add support for ingress ifindex dissectionJiri Pirko
Add new key meta that contains ingress ifindex value and add a function to dissect this from skb. The key and function is prepared to cover other potential skb metadata values dissection. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18Merge remote-tracking branch 'mlx5-next/mlx5-next' into HEADDoug Ledford
Take mlx5-next so we can take a dependent two patch series next. Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net 1) Module autoload for masquerade and redirection does not work. 2) Leak in unqueued packets in nf_ct_frag6_queue(). Ignore duplicated fragments, pretend they are placed into the queue. Patches from Guillaume Nault. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18hvsock: fix epollout hang from race conditionSunil Muthuswamy
Currently, hvsock can enter into a state where epoll_wait on EPOLLOUT will not return even when the hvsock socket is writable, under some race condition. This can happen under the following sequence: - fd = socket(hvsocket) - fd_out = dup(fd) - fd_in = dup(fd) - start a writer thread that writes data to fd_out with a combination of epoll_wait(fd_out, EPOLLOUT) and - start a reader thread that reads data from fd_in with a combination of epoll_wait(fd_in, EPOLLIN) - On the host, there are two threads that are reading/writing data to the hvsocket stack: hvs_stream_has_space hvs_notify_poll_out vsock_poll sock_poll ep_poll Race condition: check for epollout from ep_poll(): assume no writable space in the socket hvs_stream_has_space() returns 0 check for epollin from ep_poll(): assume socket has some free space < HVS_PKT_LEN(HVS_SEND_BUF_SIZE) hvs_stream_has_space() will clear the channel pending send size host will not notify the guest because the pending send size has been cleared and so the hvsocket will never mark the socket writable Now, the EPOLLOUT will never return even if the socket write buffer is empty. The fix is to set the pending size to the default size and never change it. This way the host will always notify the guest whenever the writable space is bigger than the pending size. The host is already optimized to *only* notify the guest when the pending size threshold boundary is crossed and not everytime. This change also reduces the cpu usage somewhat since hv_stream_has_space() is in the hotpath of send: vsock_stream_sendmsg()->hv_stream_has_space() Earlier hv_stream_has_space was setting/clearing the pending size on every call. Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com> Reviewed-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>