From 7a1ca97269ee197ea967de2c9412d8e7e2274ee6 Mon Sep 17 00:00:00 2001 From: Jakub Sitnicki Date: Thu, 2 Apr 2020 14:55:24 +0200 Subject: net, sk_msg: Don't use RCU_INIT_POINTER on sk_user_data sparse reports an error due to use of RCU_INIT_POINTER helper to assign to sk_user_data pointer, which is not tagged with __rcu: net/core/sock.c:1875:25: error: incompatible types in comparison expression (different address spaces): net/core/sock.c:1875:25: void [noderef] * net/core/sock.c:1875:25: void * ... and rightfully so. sk_user_data is not always treated as a pointer to an RCU-protected data. When it is used to point at an RCU-protected object, we access it with __sk_user_data to inform sparse about it. In this case, when the child socket does not inherit sk_user_data from the parent, there is no reason to treat it as an RCU-protected pointer. Use a regular assignment to clear the pointer value. Fixes: f1ff5ce2cd5e ("net, sk_msg: Clear sk_user_data pointer on clone if tagged") Reported-by: kbuild test robot Signed-off-by: Jakub Sitnicki Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20200402125524.851439-1-jakub@cloudflare.com --- net/core/sock.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'net') diff --git a/net/core/sock.c b/net/core/sock.c index da32d9b6d09f..0510826bf860 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1872,7 +1872,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority) * as not suitable for copying when cloning. */ if (sk_user_data_is_nocopy(newsk)) - RCU_INIT_POINTER(newsk->sk_user_data, NULL); + newsk->sk_user_data = NULL; newsk->sk_err = 0; newsk->sk_err_soft = 0; -- cgit From 72239f2795fab9a58633bd0399698ff7581534a3 Mon Sep 17 00:00:00 2001 From: Stefano Brivio Date: Wed, 1 Apr 2020 17:14:38 +0200 Subject: netfilter: nft_set_rbtree: Drop spurious condition for overlap detection on insertion Case a1. for overlap detection in __nft_rbtree_insert() is not a valid one: start-after-start is not needed to detect any type of interval overlap and it actually results in a false positive if, while descending the tree, this is the only step we hit after starting from the root. This introduced a regression, as reported by Pablo, in Python tests cases ip/ip.t and ip/numgen.t: ip/ip.t: ERROR: line 124: add rule ip test-ip4 input ip hdrlength vmap { 0-4 : drop, 5 : accept, 6 : continue } counter: This rule should not have failed. ip/numgen.t: ERROR: line 7: add rule ip test-ip4 pre dnat to numgen inc mod 10 map { 0-5 : 192.168.10.100, 6-9 : 192.168.20.200}: This rule should not have failed. Drop case a1. and renumber others, so that they are a bit clearer. In order for these diagrams to be readily understandable, a bigger rework is probably needed, such as an ASCII art of the actual rbtree (instead of a flattened version). Shell script test sets/0044interval_overlap_0 should cover all possible cases for false negatives, so I consider that test case still sufficient after this change. v2: Fix comments for cases a3. and b3. Reported-by: Pablo Neira Ayuso Fixes: 7c84d41416d8 ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion") Signed-off-by: Stefano Brivio Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nft_set_rbtree.c | 23 +++++++++++------------ 1 file changed, 11 insertions(+), 12 deletions(-) (limited to 'net') diff --git a/net/netfilter/nft_set_rbtree.c b/net/netfilter/nft_set_rbtree.c index 3a5552e14f75..3ffef454d469 100644 --- a/net/netfilter/nft_set_rbtree.c +++ b/net/netfilter/nft_set_rbtree.c @@ -218,27 +218,26 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set, /* Detect overlaps as we descend the tree. Set the flag in these cases: * - * a1. |__ _ _? >|__ _ _ (insert start after existing start) - * a2. _ _ __>| ?_ _ __| (insert end before existing end) - * a3. _ _ ___| ?_ _ _>| (insert end after existing end) - * a4. >|__ _ _ _ _ __| (insert start before existing end) + * a1. _ _ __>| ?_ _ __| (insert end before existing end) + * a2. _ _ ___| ?_ _ _>| (insert end after existing end) + * a3. _ _ ___? >|_ _ __| (insert start before existing end) * * and clear it later on, as we eventually reach the points indicated by * '?' above, in the cases described below. We'll always meet these * later, locally, due to tree ordering, and overlaps for the intervals * that are the closest together are always evaluated last. * - * b1. |__ _ _! >|__ _ _ (insert start after existing end) - * b2. _ _ __>| !_ _ __| (insert end before existing start) - * b3. !_____>| (insert end after existing start) + * b1. _ _ __>| !_ _ __| (insert end before existing start) + * b2. _ _ ___| !_ _ _>| (insert end after existing start) + * b3. _ _ ___! >|_ _ __| (insert start after existing end) * - * Case a4. resolves to b1.: + * Case a3. resolves to b3.: * - if the inserted start element is the leftmost, because the '0' * element in the tree serves as end element * - otherwise, if an existing end is found. Note that end elements are * always inserted after corresponding start elements. * - * For a new, rightmost pair of elements, we'll hit cases b1. and b3., + * For a new, rightmost pair of elements, we'll hit cases b3. and b2., * in that order. * * The flag is also cleared in two special cases: @@ -262,9 +261,9 @@ static int __nft_rbtree_insert(const struct net *net, const struct nft_set *set, p = &parent->rb_left; if (nft_rbtree_interval_start(new)) { - overlap = nft_rbtree_interval_start(rbe) && - nft_set_elem_active(&rbe->ext, - genmask); + if (nft_rbtree_interval_end(rbe) && + nft_set_elem_active(&rbe->ext, genmask)) + overlap = false; } else { overlap = nft_rbtree_interval_end(rbe) && nft_set_elem_active(&rbe->ext, -- cgit From a26c1e49c8e97922edc8d7e23683384729d09f77 Mon Sep 17 00:00:00 2001 From: Pablo Neira Ayuso Date: Tue, 31 Mar 2020 23:02:59 +0200 Subject: netfilter: nf_tables: do not update stateful expressions if lookup is inverted Initialize set lookup matching element to NULL. Otherwise, the NFT_LOOKUP_F_INV flag reverses the matching logic and it leads to deference an uninitialized pointer to the matching element. Make sure element data area and stateful expression are accessed if there is a matching set element. This patch undoes 24791b9aa1ab ("netfilter: nft_set_bitmap: initialize set element extension in lookups") which is not required anymore. Fixes: 339706bc21c1 ("netfilter: nft_lookup: update element stateful expression") Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nft_lookup.c | 12 +++++++----- net/netfilter/nft_set_bitmap.c | 1 - 2 files changed, 7 insertions(+), 6 deletions(-) (limited to 'net') diff --git a/net/netfilter/nft_lookup.c b/net/netfilter/nft_lookup.c index 1e70359d633c..f1363b8aabba 100644 --- a/net/netfilter/nft_lookup.c +++ b/net/netfilter/nft_lookup.c @@ -29,7 +29,7 @@ void nft_lookup_eval(const struct nft_expr *expr, { const struct nft_lookup *priv = nft_expr_priv(expr); const struct nft_set *set = priv->set; - const struct nft_set_ext *ext; + const struct nft_set_ext *ext = NULL; bool found; found = set->ops->lookup(nft_net(pkt), set, ®s->data[priv->sreg], @@ -39,11 +39,13 @@ void nft_lookup_eval(const struct nft_expr *expr, return; } - if (set->flags & NFT_SET_MAP) - nft_data_copy(®s->data[priv->dreg], - nft_set_ext_data(ext), set->dlen); + if (ext) { + if (set->flags & NFT_SET_MAP) + nft_data_copy(®s->data[priv->dreg], + nft_set_ext_data(ext), set->dlen); - nft_set_elem_update_expr(ext, regs, pkt); + nft_set_elem_update_expr(ext, regs, pkt); + } } static const struct nla_policy nft_lookup_policy[NFTA_LOOKUP_MAX + 1] = { diff --git a/net/netfilter/nft_set_bitmap.c b/net/netfilter/nft_set_bitmap.c index 32f0fc8be3a4..2a81ea421819 100644 --- a/net/netfilter/nft_set_bitmap.c +++ b/net/netfilter/nft_set_bitmap.c @@ -81,7 +81,6 @@ static bool nft_bitmap_lookup(const struct net *net, const struct nft_set *set, u32 idx, off; nft_bitmap_location(set, key, &idx, &off); - *ext = NULL; return nft_bitmap_active(priv->bitmap, idx, off, genmask); } -- cgit From bc9fe6143de5df8fb36cf1532b48fecf35868571 Mon Sep 17 00:00:00 2001 From: Maciej Żenczykowski Date: Tue, 31 Mar 2020 09:35:59 -0700 Subject: netfilter: xt_IDLETIMER: target v1 - match Android layout MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Android has long had an extension to IDLETIMER to send netlink messages to userspace, see: https://android.googlesource.com/kernel/common/+/refs/heads/android-mainline/include/uapi/linux/netfilter/xt_IDLETIMER.h#42 Note: this is idletimer target rev 1, there is no rev 0 in the Android common kernel sources, see registration at: https://android.googlesource.com/kernel/common/+/refs/heads/android-mainline/net/netfilter/xt_IDLETIMER.c#483 When we compare that to upstream's new idletimer target rev 1: https://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git/tree/include/uapi/linux/netfilter/xt_IDLETIMER.h#n46 We immediately notice that these two rev 1 structs are the same size and layout, and that while timer_type and send_nl_msg are differently named and serve a different purpose, they're at the same offset. This makes them impossible to tell apart - and thus one cannot know in a mixed Android/vanilla environment whether one means timer_type or send_nl_msg. Since this is iptables/netfilter uapi it introduces a problem between iptables (vanilla vs Android) userspace and kernel (vanilla vs Android) if the two don't match each other. Additionally when at some point in the future Android picks up 5.7+ it's not at all clear how to resolve the resulting merge conflict. Furthermore, since upgrading the kernel on old Android phones is pretty much impossible there does not seem to be an easy way out of this predicament. The only thing I've been able to come up with is some super disgusting kernel version >= 5.7 check in the iptables binary to flip between different struct layouts. By adding a dummy field to the vanilla Linux kernel header file we can force the two structs to be compatible with each other. Long term I think I would like to deprecate send_nl_msg out of Android entirely, but I haven't quite been able to figure out exactly how we depend on it. It seems to be very similar to sysfs notifications but with some extra info. Currently it's actually always enabled whenever Android uses the IDLETIMER target, so we could also probably entirely remove it from the uapi in favour of just always enabling it, but again we can't upgrade old kernels already in the field. (Also note that this doesn't change the structure's size, as it is simply fitting into the pre-existing padding, and that since 5.7 hasn't been released yet, there's still time to make this uapi visible change) Cc: Manoj Basapathi Cc: Subash Abhinov Kasiviswanathan Signed-off-by: Maciej Żenczykowski Signed-off-by: Pablo Neira Ayuso --- net/netfilter/xt_IDLETIMER.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'net') diff --git a/net/netfilter/xt_IDLETIMER.c b/net/netfilter/xt_IDLETIMER.c index 75bd0e5dd312..7b2f359bfce4 100644 --- a/net/netfilter/xt_IDLETIMER.c +++ b/net/netfilter/xt_IDLETIMER.c @@ -346,6 +346,9 @@ static int idletimer_tg_checkentry_v1(const struct xt_tgchk_param *par) pr_debug("checkentry targinfo%s\n", info->label); + if (info->send_nl_msg) + return -EOPNOTSUPP; + ret = idletimer_tg_helper((struct idletimer_tg_info *)info); if(ret < 0) { -- cgit From 7fb6f78df7003234d8df4f90aeecc432d7d0c804 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Wed, 1 Apr 2020 10:37:16 -0700 Subject: netfilter: nf_tables: do not leave dangling pointer in nf_tables_set_alloc_name If nf_tables_set_alloc_name() frees set->name, we better clear set->name to avoid a future use-after-free or invalid-free. BUG: KASAN: double-free or invalid-free in nf_tables_newset+0x1ed6/0x2560 net/netfilter/nf_tables_api.c:4148 CPU: 0 PID: 28233 Comm: syz-executor.0 Not tainted 5.6.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x188/0x20d lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xd3/0x315 mm/kasan/report.c:374 kasan_report_invalid_free+0x61/0xa0 mm/kasan/report.c:468 __kasan_slab_free+0x129/0x140 mm/kasan/common.c:455 __cache_free mm/slab.c:3426 [inline] kfree+0x109/0x2b0 mm/slab.c:3757 nf_tables_newset+0x1ed6/0x2560 net/netfilter/nf_tables_api.c:4148 nfnetlink_rcv_batch+0x83a/0x1610 net/netfilter/nfnetlink.c:433 nfnetlink_rcv_skb_batch net/netfilter/nfnetlink.c:543 [inline] nfnetlink_rcv+0x3af/0x420 net/netfilter/nfnetlink.c:561 netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline] netlink_unicast+0x537/0x740 net/netlink/af_netlink.c:1329 netlink_sendmsg+0x882/0xe10 net/netlink/af_netlink.c:1918 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:672 ____sys_sendmsg+0x6b9/0x7d0 net/socket.c:2345 ___sys_sendmsg+0x100/0x170 net/socket.c:2399 __sys_sendmsg+0xec/0x1b0 net/socket.c:2432 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x45c849 Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fe5ca21dc78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007fe5ca21e6d4 RCX: 000000000045c849 RDX: 0000000000000000 RSI: 0000000020000c40 RDI: 0000000000000003 RBP: 000000000076bf00 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff R13: 000000000000095b R14: 00000000004cc0e9 R15: 000000000076bf0c Allocated by task 28233: save_stack+0x1b/0x80 mm/kasan/common.c:72 set_track mm/kasan/common.c:80 [inline] __kasan_kmalloc mm/kasan/common.c:515 [inline] __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:488 __do_kmalloc mm/slab.c:3656 [inline] __kmalloc_track_caller+0x159/0x790 mm/slab.c:3671 kvasprintf+0xb5/0x150 lib/kasprintf.c:25 kasprintf+0xbb/0xf0 lib/kasprintf.c:59 nf_tables_set_alloc_name net/netfilter/nf_tables_api.c:3536 [inline] nf_tables_newset+0x1543/0x2560 net/netfilter/nf_tables_api.c:4088 nfnetlink_rcv_batch+0x83a/0x1610 net/netfilter/nfnetlink.c:433 nfnetlink_rcv_skb_batch net/netfilter/nfnetlink.c:543 [inline] nfnetlink_rcv+0x3af/0x420 net/netfilter/nfnetlink.c:561 netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline] netlink_unicast+0x537/0x740 net/netlink/af_netlink.c:1329 netlink_sendmsg+0x882/0xe10 net/netlink/af_netlink.c:1918 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:672 ____sys_sendmsg+0x6b9/0x7d0 net/socket.c:2345 ___sys_sendmsg+0x100/0x170 net/socket.c:2399 __sys_sendmsg+0xec/0x1b0 net/socket.c:2432 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 28233: save_stack+0x1b/0x80 mm/kasan/common.c:72 set_track mm/kasan/common.c:80 [inline] kasan_set_free_info mm/kasan/common.c:337 [inline] __kasan_slab_free+0xf7/0x140 mm/kasan/common.c:476 __cache_free mm/slab.c:3426 [inline] kfree+0x109/0x2b0 mm/slab.c:3757 nf_tables_set_alloc_name net/netfilter/nf_tables_api.c:3544 [inline] nf_tables_newset+0x1f73/0x2560 net/netfilter/nf_tables_api.c:4088 nfnetlink_rcv_batch+0x83a/0x1610 net/netfilter/nfnetlink.c:433 nfnetlink_rcv_skb_batch net/netfilter/nfnetlink.c:543 [inline] nfnetlink_rcv+0x3af/0x420 net/netfilter/nfnetlink.c:561 netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline] netlink_unicast+0x537/0x740 net/netlink/af_netlink.c:1329 netlink_sendmsg+0x882/0xe10 net/netlink/af_netlink.c:1918 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:672 ____sys_sendmsg+0x6b9/0x7d0 net/socket.c:2345 ___sys_sendmsg+0x100/0x170 net/socket.c:2399 __sys_sendmsg+0xec/0x1b0 net/socket.c:2432 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff8880a6032d00 which belongs to the cache kmalloc-32 of size 32 The buggy address is located 0 bytes inside of 32-byte region [ffff8880a6032d00, ffff8880a6032d20) The buggy address belongs to the page: page:ffffea0002980c80 refcount:1 mapcount:0 mapping:ffff8880aa0001c0 index:0xffff8880a6032fc1 flags: 0xfffe0000000200(slab) raw: 00fffe0000000200 ffffea0002a3be88 ffffea00029b1908 ffff8880aa0001c0 raw: ffff8880a6032fc1 ffff8880a6032000 000000010000003e 0000000000000000 page dumped because: kasan: bad access detected Fixes: 65038428b2c6 ("netfilter: nf_tables: allow to specify stateful expression in set definition") Signed-off-by: Eric Dumazet Reported-by: syzbot Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_tables_api.c | 1 + 1 file changed, 1 insertion(+) (limited to 'net') diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index d0ab5ffa1e2c..f91e96d8de05 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -3542,6 +3542,7 @@ cont: continue; if (!strcmp(set->name, i->name)) { kfree(set->name); + set->name = NULL; return -ENFILE; } } -- cgit From b135fc0801b671c50de103572b819bcd41603613 Mon Sep 17 00:00:00 2001 From: Amol Grover Date: Sun, 16 Feb 2020 22:56:54 +0530 Subject: netfilter: ipset: Pass lockdep expression to RCU lists ip_set_type_list is traversed using list_for_each_entry_rcu outside an RCU read-side critical section but under the protection of ip_set_type_mutex. Hence, add corresponding lockdep expression to silence false-positive warnings, and harden RCU lists. Signed-off-by: Amol Grover Signed-off-by: Pablo Neira Ayuso --- net/netfilter/ipset/ip_set_core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'net') diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c index 8dd17589217d..340cb955af25 100644 --- a/net/netfilter/ipset/ip_set_core.c +++ b/net/netfilter/ipset/ip_set_core.c @@ -86,7 +86,8 @@ find_set_type(const char *name, u8 family, u8 revision) { struct ip_set_type *type; - list_for_each_entry_rcu(type, &ip_set_type_list, list) + list_for_each_entry_rcu(type, &ip_set_type_list, list, + lockdep_is_held(&ip_set_type_mutex)) if (STRNCMP(type->name, name) && (type->family == family || type->family == NFPROTO_UNSPEC) && -- cgit From db5c97f02373917efe2c218ebf8e3d8b19e343b6 Mon Sep 17 00:00:00 2001 From: Li RongQing Date: Thu, 2 Apr 2020 15:52:10 +0800 Subject: xsk: Fix out of boundary write in __xsk_rcv_memcpy MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit first_len is the remainder of the first page we're copying. If this size is larger, then out of page boundary write will otherwise happen. Fixes: c05cd3645814 ("xsk: add support to allow unaligned chunk placement") Signed-off-by: Li RongQing Signed-off-by: Daniel Borkmann Acked-by: Jonathan Lemon Acked-by: Björn Töpel Link: https://lore.kernel.org/bpf/1585813930-19712-1-git-send-email-lirongqing@baidu.com --- net/xdp/xsk.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'net') diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index 356f90e4522b..c350108aa38d 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -131,8 +131,9 @@ static void __xsk_rcv_memcpy(struct xdp_umem *umem, u64 addr, void *from_buf, u64 page_start = addr & ~(PAGE_SIZE - 1); u64 first_len = PAGE_SIZE - (addr - page_start); - memcpy(to_buf, from_buf, first_len + metalen); - memcpy(next_pg_addr, from_buf + first_len, len - first_len); + memcpy(to_buf, from_buf, first_len); + memcpy(next_pg_addr, from_buf + first_len, + len + metalen - first_len); return; } -- cgit From d9583cdf2f38d0f526d9a8c8564dd2e35e649bc7 Mon Sep 17 00:00:00 2001 From: Pablo Neira Ayuso Date: Tue, 7 Apr 2020 14:10:11 +0200 Subject: netfilter: nf_tables: report EOPNOTSUPP on unsupported flags/object type EINVAL should be used for malformed netlink messages. New userspace utility and old kernels might easily result in EINVAL when exercising new set features, which is misleading. Fixes: 8aeff920dcc9 ("netfilter: nf_tables: add stateful object reference to set elements") Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_tables_api.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'net') diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index f91e96d8de05..21cbde6ecee3 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -3963,7 +3963,7 @@ static int nf_tables_newset(struct net *net, struct sock *nlsk, NFT_SET_INTERVAL | NFT_SET_TIMEOUT | NFT_SET_MAP | NFT_SET_EVAL | NFT_SET_OBJECT)) - return -EINVAL; + return -EOPNOTSUPP; /* Only one of these operations is supported */ if ((flags & (NFT_SET_MAP | NFT_SET_OBJECT)) == (NFT_SET_MAP | NFT_SET_OBJECT)) @@ -4001,7 +4001,7 @@ static int nf_tables_newset(struct net *net, struct sock *nlsk, objtype = ntohl(nla_get_be32(nla[NFTA_SET_OBJ_TYPE])); if (objtype == NFT_OBJECT_UNSPEC || objtype > NFT_OBJECT_MAX) - return -EINVAL; + return -EOPNOTSUPP; } else if (flags & NFT_SET_OBJECT) return -EINVAL; else -- cgit From ef516e8625ddea90b3a0313f3a0b0baa83db7ac2 Mon Sep 17 00:00:00 2001 From: Pablo Neira Ayuso Date: Tue, 7 Apr 2020 14:10:38 +0200 Subject: netfilter: nf_tables: reintroduce the NFT_SET_CONCAT flag Stefano originally proposed to introduce this flag, users hit EOPNOTSUPP in new binaries with old kernels when defining a set with ranges in a concatenation. Fixes: f3a2181e16f1 ("netfilter: nf_tables: Support for sets with multiple ranged fields") Reviewed-by: Stefano Brivio Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_tables_api.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'net') diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 21cbde6ecee3..9adfbc7e8ae7 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -3962,7 +3962,7 @@ static int nf_tables_newset(struct net *net, struct sock *nlsk, if (flags & ~(NFT_SET_ANONYMOUS | NFT_SET_CONSTANT | NFT_SET_INTERVAL | NFT_SET_TIMEOUT | NFT_SET_MAP | NFT_SET_EVAL | - NFT_SET_OBJECT)) + NFT_SET_OBJECT | NFT_SET_CONCAT)) return -EOPNOTSUPP; /* Only one of these operations is supported */ if ((flags & (NFT_SET_MAP | NFT_SET_OBJECT)) == -- cgit From b93cfb9cd3af3adc9ba4854f178d5300f7544d3e Mon Sep 17 00:00:00 2001 From: Tim Stallard Date: Fri, 3 Apr 2020 21:22:57 +0100 Subject: net: icmp6: do not select saddr from iif when route has prefsrc set Since commit fac6fce9bdb5 ("net: icmp6: provide input address for traceroute6") ICMPv6 errors have source addresses from the ingress interface. However, this overrides when source address selection is influenced by setting preferred source addresses on routes. This can result in ICMP errors being lost to upstream BCP38 filters when the wrong source addresses are used, breaking path MTU discovery and traceroute. This patch sets the modified source address selection to only take place when the route used has no prefsrc set. It can be tested with: ip link add v1 type veth peer name v2 ip netns add test ip netns exec test ip link set lo up ip link set v2 netns test ip link set v1 up ip netns exec test ip link set v2 up ip addr add 2001:db8::1/64 dev v1 nodad ip addr add 2001:db8::3 dev v1 nodad ip netns exec test ip addr add 2001:db8::2/64 dev v2 nodad ip netns exec test ip route add unreachable 2001:db8:1::1 ip netns exec test ip addr add 2001:db8:100::1 dev lo ip netns exec test ip route add 2001:db8::1 dev v2 src 2001:db8:100::1 ip route add 2001:db8:1000::1 via 2001:db8::2 traceroute6 -s 2001:db8::1 2001:db8:1000::1 traceroute6 -s 2001:db8::3 2001:db8:1000::1 ip netns delete test Output before: $ traceroute6 -s 2001:db8::1 2001:db8:1000::1 traceroute to 2001:db8:1000::1 (2001:db8:1000::1), 30 hops max, 80 byte packets 1 2001:db8::2 (2001:db8::2) 0.843 ms !N 0.396 ms !N 0.257 ms !N $ traceroute6 -s 2001:db8::3 2001:db8:1000::1 traceroute to 2001:db8:1000::1 (2001:db8:1000::1), 30 hops max, 80 byte packets 1 2001:db8::2 (2001:db8::2) 0.772 ms !N 0.257 ms !N 0.357 ms !N After: $ traceroute6 -s 2001:db8::1 2001:db8:1000::1 traceroute to 2001:db8:1000::1 (2001:db8:1000::1), 30 hops max, 80 byte packets 1 2001:db8:100::1 (2001:db8:100::1) 8.885 ms !N 0.310 ms !N 0.174 ms !N $ traceroute6 -s 2001:db8::3 2001:db8:1000::1 traceroute to 2001:db8:1000::1 (2001:db8:1000::1), 30 hops max, 80 byte packets 1 2001:db8::2 (2001:db8::2) 1.403 ms !N 0.205 ms !N 0.313 ms !N Fixes: fac6fce9bdb5 ("net: icmp6: provide input address for traceroute6") Signed-off-by: Tim Stallard Signed-off-by: David S. Miller --- net/ipv6/icmp.c | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) (limited to 'net') diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c index 2688f3e82165..fc5000370030 100644 --- a/net/ipv6/icmp.c +++ b/net/ipv6/icmp.c @@ -229,6 +229,25 @@ static bool icmpv6_xrlim_allow(struct sock *sk, u8 type, return res; } +static bool icmpv6_rt_has_prefsrc(struct sock *sk, u8 type, + struct flowi6 *fl6) +{ + struct net *net = sock_net(sk); + struct dst_entry *dst; + bool res = false; + + dst = ip6_route_output(net, sk, fl6); + if (!dst->error) { + struct rt6_info *rt = (struct rt6_info *)dst; + struct in6_addr prefsrc; + + rt6_get_prefsrc(rt, &prefsrc); + res = !ipv6_addr_any(&prefsrc); + } + dst_release(dst); + return res; +} + /* * an inline helper for the "simple" if statement below * checks if parameter problem report is caused by an @@ -527,7 +546,7 @@ static void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info, saddr = force_saddr; if (saddr) { fl6.saddr = *saddr; - } else { + } else if (!icmpv6_rt_has_prefsrc(sk, type, &fl6)) { /* select a more meaningful saddr from input if */ struct net_device *in_netdev; -- cgit From a4837980fd9fa4c70a821d11831698901baef56b Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Mon, 6 Apr 2020 14:39:32 +0300 Subject: net: revert default NAPI poll timeout to 2 jiffies For HZ < 1000 timeout 2000us rounds up to 1 jiffy but expires randomly because next timer interrupt could come shortly after starting softirq. For commonly used CONFIG_HZ=1000 nothing changes. Fixes: 7acf8a1e8a28 ("Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning") Reported-by: Dmitry Yakunin Signed-off-by: Konstantin Khlebnikov Signed-off-by: David S. Miller --- net/core/dev.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'net') diff --git a/net/core/dev.c b/net/core/dev.c index 9c9e763bfe0e..df8097b8e286 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4140,7 +4140,8 @@ EXPORT_SYMBOL(netdev_max_backlog); int netdev_tstamp_prequeue __read_mostly = 1; int netdev_budget __read_mostly = 300; -unsigned int __read_mostly netdev_budget_usecs = 2000; +/* Must be at least 2 jiffes to guarantee 1 jiffy timeout */ +unsigned int __read_mostly netdev_budget_usecs = 2 * USEC_PER_SEC / HZ; int weight_p __read_mostly = 64; /* old backlog weight */ int dev_weight_rx_bias __read_mostly = 1; /* bias for backlog weight */ int dev_weight_tx_bias __read_mostly = 1; /* bias for output_queue quota */ -- cgit From a080da6ac7fa13282f1be8705cc67ceacd999ac3 Mon Sep 17 00:00:00 2001 From: Paul Blakey Date: Mon, 6 Apr 2020 18:36:56 +0300 Subject: net: sched: Fix setting last executed chain on skb extension After driver sets the missed chain on the tc skb extension it is consumed (deleted) by tc_classify_ingress and tc jumps to that chain. If tc now misses on this chain (either no match, or no goto action), then last executed chain remains 0, and the skb extension is not re-added, and the next datapath (ovs) will start from 0. Fix that by setting last executed chain to the chain read from the skb extension, so if there is a miss, we set it back. Fixes: af699626ee26 ("net: sched: Support specifying a starting chain via tc skb ext") Reviewed-by: Oz Shlomo Reviewed-by: Jiri Pirko Signed-off-by: Paul Blakey Signed-off-by: Saeed Mahameed Signed-off-by: David S. Miller --- net/sched/cls_api.c | 1 + 1 file changed, 1 insertion(+) (limited to 'net') diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c index f6a3b969ead0..55bd1429678f 100644 --- a/net/sched/cls_api.c +++ b/net/sched/cls_api.c @@ -1667,6 +1667,7 @@ int tcf_classify_ingress(struct sk_buff *skb, skb_ext_del(skb, TC_SKB_EXT); tp = rcu_dereference_bh(fchain->filter_chain); + last_executed_chain = fchain->index; } ret = __tcf_classify(skb, tp, orig_tp, res, compat_mode, -- cgit From 4faab8c446def7667adf1f722456c2f4c304069c Mon Sep 17 00:00:00 2001 From: Taehee Yoo Date: Tue, 7 Apr 2020 13:23:21 +0000 Subject: hsr: check protocol version in hsr_newlink() In the current hsr code, only 0 and 1 protocol versions are valid. But current hsr code doesn't check the version, which is received by userspace. Test commands: ip link add dummy0 type dummy ip link add dummy1 type dummy ip link add hsr0 type hsr slave1 dummy0 slave2 dummy1 version 4 In the test commands, version 4 is invalid. So, the command should be failed. After this patch, following error will occur. "Error: hsr: Only versions 0..1 are supported." Fixes: ee1c27977284 ("net/hsr: Added support for HSR v1") Signed-off-by: Taehee Yoo Signed-off-by: David S. Miller --- net/hsr/hsr_netlink.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) (limited to 'net') diff --git a/net/hsr/hsr_netlink.c b/net/hsr/hsr_netlink.c index 5465a395da04..1decb25f6764 100644 --- a/net/hsr/hsr_netlink.c +++ b/net/hsr/hsr_netlink.c @@ -69,10 +69,16 @@ static int hsr_newlink(struct net *src_net, struct net_device *dev, else multicast_spec = nla_get_u8(data[IFLA_HSR_MULTICAST_SPEC]); - if (!data[IFLA_HSR_VERSION]) + if (!data[IFLA_HSR_VERSION]) { hsr_version = 0; - else + } else { hsr_version = nla_get_u8(data[IFLA_HSR_VERSION]); + if (hsr_version > 1) { + NL_SET_ERR_MSG_MOD(extack, + "Only versions 0..1 are supported"); + return -EINVAL; + } + } return hsr_dev_finalize(dev, link, multicast_spec, hsr_version, extack); } -- cgit From 2abe05234f2e892728c388169631e4b99f354c86 Mon Sep 17 00:00:00 2001 From: Michael Weiß Date: Tue, 7 Apr 2020 13:11:48 +0200 Subject: l2tp: Allow management of tunnels and session in user namespace MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Creation and management of L2TPv3 tunnels and session through netlink requires CAP_NET_ADMIN. However, a process with CAP_NET_ADMIN in a non-initial user namespace gets an EPERM due to the use of the genetlink GENL_ADMIN_PERM flag. Thus, management of L2TP VPNs inside an unprivileged container won't work. We replaced the GENL_ADMIN_PERM by the GENL_UNS_ADMIN_PERM flag similar to other network modules which also had this problem, e.g., openvswitch (commit 4a92602aa1cd "openvswitch: allow management from inside user namespaces") and nl80211 (commit 5617c6cd6f844 "nl80211: Allow privileged operations from user namespaces"). I tested this in the container runtime trustm3 (trustm3.github.io) and was able to create l2tp tunnels and sessions in unpriviliged (user namespaced) containers using a private network namespace. For other runtimes such as docker or lxc this should work, too. Signed-off-by: Michael Weiß Signed-off-by: David S. Miller --- net/l2tp/l2tp_netlink.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) (limited to 'net') diff --git a/net/l2tp/l2tp_netlink.c b/net/l2tp/l2tp_netlink.c index f5a9bdc4980c..ebb381c3f1b9 100644 --- a/net/l2tp/l2tp_netlink.c +++ b/net/l2tp/l2tp_netlink.c @@ -920,51 +920,51 @@ static const struct genl_ops l2tp_nl_ops[] = { .cmd = L2TP_CMD_TUNNEL_CREATE, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = l2tp_nl_cmd_tunnel_create, - .flags = GENL_ADMIN_PERM, + .flags = GENL_UNS_ADMIN_PERM, }, { .cmd = L2TP_CMD_TUNNEL_DELETE, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = l2tp_nl_cmd_tunnel_delete, - .flags = GENL_ADMIN_PERM, + .flags = GENL_UNS_ADMIN_PERM, }, { .cmd = L2TP_CMD_TUNNEL_MODIFY, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = l2tp_nl_cmd_tunnel_modify, - .flags = GENL_ADMIN_PERM, + .flags = GENL_UNS_ADMIN_PERM, }, { .cmd = L2TP_CMD_TUNNEL_GET, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = l2tp_nl_cmd_tunnel_get, .dumpit = l2tp_nl_cmd_tunnel_dump, - .flags = GENL_ADMIN_PERM, + .flags = GENL_UNS_ADMIN_PERM, }, { .cmd = L2TP_CMD_SESSION_CREATE, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = l2tp_nl_cmd_session_create, - .flags = GENL_ADMIN_PERM, + .flags = GENL_UNS_ADMIN_PERM, }, { .cmd = L2TP_CMD_SESSION_DELETE, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = l2tp_nl_cmd_session_delete, - .flags = GENL_ADMIN_PERM, + .flags = GENL_UNS_ADMIN_PERM, }, { .cmd = L2TP_CMD_SESSION_MODIFY, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = l2tp_nl_cmd_session_modify, - .flags = GENL_ADMIN_PERM, + .flags = GENL_UNS_ADMIN_PERM, }, { .cmd = L2TP_CMD_SESSION_GET, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .doit = l2tp_nl_cmd_session_get, .dumpit = l2tp_nl_cmd_session_dump, - .flags = GENL_ADMIN_PERM, + .flags = GENL_UNS_ADMIN_PERM, }, }; -- cgit From f691a25ce5e5e405156ad4091c8e660b2622b7ad Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Wed, 8 Apr 2020 20:54:43 +0200 Subject: net/tls: fix const assignment warning Building with some experimental patches, I came across a warning in the tls code: include/linux/compiler.h:215:30: warning: assignment discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers] 215 | *(volatile typeof(x) *)&(x) = (val); \ | ^ net/tls/tls_main.c:650:4: note: in expansion of macro 'smp_store_release' 650 | smp_store_release(&saved_tcpv4_prot, prot); This appears to be a legitimate warning about assigning a const pointer into the non-const 'saved_tcpv4_prot' global. Annotate both the ipv4 and ipv6 pointers 'const' to make the code internally consistent. Fixes: 5bb4c45d466c ("net/tls: Read sk_prot once when building tls proto ops") Signed-off-by: Arnd Bergmann Signed-off-by: David S. Miller --- net/tls/tls_main.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'net') diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c index 156efce50dbd..0e989005bdc2 100644 --- a/net/tls/tls_main.c +++ b/net/tls/tls_main.c @@ -56,9 +56,9 @@ enum { TLS_NUM_PROTS, }; -static struct proto *saved_tcpv6_prot; +static const struct proto *saved_tcpv6_prot; static DEFINE_MUTEX(tcpv6_prot_mutex); -static struct proto *saved_tcpv4_prot; +static const struct proto *saved_tcpv4_prot; static DEFINE_MUTEX(tcpv4_prot_mutex); static struct proto tls_prots[TLS_NUM_PROTS][TLS_NUM_CONFIG][TLS_NUM_CONFIG]; static struct proto_ops tls_sw_proto_ops; -- cgit From 8e368dc72e86ad1e1a612416f32d5ad22dca88bc Mon Sep 17 00:00:00 2001 From: Joe Stringer Date: Tue, 7 Apr 2020 20:35:40 -0700 Subject: bpf: Fix use of sk->sk_reuseport from sk_assign In testing, we found that for request sockets the sk->sk_reuseport field may yet be uninitialized, which caused bpf_sk_assign() to randomly succeed or return -ESOCKTNOSUPPORT when handling the forward ACK in a three-way handshake. Fix it by only applying the reuseport check for full sockets. Fixes: cf7fbe660f2d ("bpf: Add socket assign support") Signed-off-by: Joe Stringer Signed-off-by: Daniel Borkmann Acked-by: Martin KaFai Lau Link: https://lore.kernel.org/bpf/20200408033540.10339-1-joe@wand.net.nz --- net/core/filter.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'net') diff --git a/net/core/filter.c b/net/core/filter.c index 7628b947dbc3..7d6ceaa54d21 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -5925,7 +5925,7 @@ BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags) return -EOPNOTSUPP; if (unlikely(dev_net(skb->dev) != sock_net(sk))) return -ENETUNREACH; - if (unlikely(sk->sk_reuseport)) + if (unlikely(sk_fullsock(sk) && sk->sk_reuseport)) return -ESOCKTNOSUPPORT; if (sk_is_refcounted(sk) && unlikely(!refcount_inc_not_zero(&sk->sk_refcnt))) -- cgit From 6dbf02acef69b0742c238574583b3068afbd227c Mon Sep 17 00:00:00 2001 From: Wang Wenhu Date: Wed, 8 Apr 2020 19:53:53 -0700 Subject: net: qrtr: send msgs from local of same id as broadcast If the local node id(qrtr_local_nid) is not modified after its initialization, it equals to the broadcast node id(QRTR_NODE_BCAST). So the messages from local node should not be taken as broadcast and keep the process going to send them out anyway. The definitions are as follow: static unsigned int qrtr_local_nid = NUMA_NO_NODE; Fixes: fdf5fd397566 ("net: qrtr: Broadcast messages only from control port") Signed-off-by: Wang Wenhu Signed-off-by: David S. Miller --- net/qrtr/qrtr.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) (limited to 'net') diff --git a/net/qrtr/qrtr.c b/net/qrtr/qrtr.c index e22092e4a783..7ed31b5e77e4 100644 --- a/net/qrtr/qrtr.c +++ b/net/qrtr/qrtr.c @@ -906,20 +906,21 @@ static int qrtr_sendmsg(struct socket *sock, struct msghdr *msg, size_t len) node = NULL; if (addr->sq_node == QRTR_NODE_BCAST) { - enqueue_fn = qrtr_bcast_enqueue; - if (addr->sq_port != QRTR_PORT_CTRL) { + if (addr->sq_port != QRTR_PORT_CTRL && + qrtr_local_nid != QRTR_NODE_BCAST) { release_sock(sk); return -ENOTCONN; } + enqueue_fn = qrtr_bcast_enqueue; } else if (addr->sq_node == ipc->us.sq_node) { enqueue_fn = qrtr_local_enqueue; } else { - enqueue_fn = qrtr_node_enqueue; node = qrtr_node_lookup(addr->sq_node); if (!node) { release_sock(sk); return -ECONNRESET; } + enqueue_fn = qrtr_node_enqueue; } plen = (len + 3) & ~3; -- cgit From 5f0224a6ddc3101ab9664a5f7a6287047934a079 Mon Sep 17 00:00:00 2001 From: Colin Ian King Date: Thu, 9 Apr 2020 14:41:26 +0100 Subject: net-sysfs: remove redundant assignment to variable ret The variable ret is being initialized with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King Signed-off-by: David S. Miller --- net/core/net-sysfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'net') diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index cf0215734ceb..4773ad6ec111 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -80,7 +80,7 @@ static ssize_t netdev_store(struct device *dev, struct device_attribute *attr, struct net_device *netdev = to_net_dev(dev); struct net *net = dev_net(netdev); unsigned long new; - int ret = -EINVAL; + int ret; if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; -- cgit From e228a5d05e9ee25878e9a40de96e7ceb579d4893 Mon Sep 17 00:00:00 2001 From: Ka-Cheong Poon Date: Wed, 8 Apr 2020 03:21:01 -0700 Subject: net/rds: Replace struct rds_mr's r_refcount with struct kref And removed rds_mr_put(). Signed-off-by: Ka-Cheong Poon Acked-by: Santosh Shilimkar Signed-off-by: David S. Miller --- net/rds/message.c | 6 +++--- net/rds/rdma.c | 28 +++++++++++++++------------- net/rds/rds.h | 9 ++------- 3 files changed, 20 insertions(+), 23 deletions(-) (limited to 'net') diff --git a/net/rds/message.c b/net/rds/message.c index 50f13f1d4ae0..bbecb8cb873e 100644 --- a/net/rds/message.c +++ b/net/rds/message.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2006 Oracle. All rights reserved. + * Copyright (c) 2006, 2020 Oracle and/or its affiliates. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -162,12 +162,12 @@ static void rds_message_purge(struct rds_message *rm) if (rm->rdma.op_active) rds_rdma_free_op(&rm->rdma); if (rm->rdma.op_rdma_mr) - rds_mr_put(rm->rdma.op_rdma_mr); + kref_put(&rm->rdma.op_rdma_mr->r_kref, __rds_put_mr_final); if (rm->atomic.op_active) rds_atomic_free_op(&rm->atomic); if (rm->atomic.op_rdma_mr) - rds_mr_put(rm->atomic.op_rdma_mr); + kref_put(&rm->atomic.op_rdma_mr->r_kref, __rds_put_mr_final); } void rds_message_put(struct rds_message *rm) diff --git a/net/rds/rdma.c b/net/rds/rdma.c index 585e6b3b69ce..f828b66978e4 100644 --- a/net/rds/rdma.c +++ b/net/rds/rdma.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2007, 2017 Oracle and/or its affiliates. All rights reserved. + * Copyright (c) 2007, 2020 Oracle and/or its affiliates. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -84,7 +84,7 @@ static struct rds_mr *rds_mr_tree_walk(struct rb_root *root, u64 key, if (insert) { rb_link_node(&insert->r_rb_node, parent, p); rb_insert_color(&insert->r_rb_node, root); - refcount_inc(&insert->r_refcount); + kref_get(&insert->r_kref); } return NULL; } @@ -99,7 +99,7 @@ static void rds_destroy_mr(struct rds_mr *mr) unsigned long flags; rdsdebug("RDS: destroy mr key is %x refcnt %u\n", - mr->r_key, refcount_read(&mr->r_refcount)); + mr->r_key, kref_read(&mr->r_kref)); if (test_and_set_bit(RDS_MR_DEAD, &mr->r_state)) return; @@ -115,8 +115,10 @@ static void rds_destroy_mr(struct rds_mr *mr) mr->r_trans->free_mr(trans_private, mr->r_invalidate); } -void __rds_put_mr_final(struct rds_mr *mr) +void __rds_put_mr_final(struct kref *kref) { + struct rds_mr *mr = container_of(kref, struct rds_mr, r_kref); + rds_destroy_mr(mr); kfree(mr); } @@ -141,7 +143,7 @@ void rds_rdma_drop_keys(struct rds_sock *rs) RB_CLEAR_NODE(&mr->r_rb_node); spin_unlock_irqrestore(&rs->rs_rdma_lock, flags); rds_destroy_mr(mr); - rds_mr_put(mr); + kref_put(&mr->r_kref, __rds_put_mr_final); spin_lock_irqsave(&rs->rs_rdma_lock, flags); } spin_unlock_irqrestore(&rs->rs_rdma_lock, flags); @@ -242,7 +244,7 @@ static int __rds_rdma_map(struct rds_sock *rs, struct rds_get_mr_args *args, goto out; } - refcount_set(&mr->r_refcount, 1); + kref_init(&mr->r_kref); RB_CLEAR_NODE(&mr->r_rb_node); mr->r_trans = rs->rs_transport; mr->r_sock = rs; @@ -343,7 +345,7 @@ static int __rds_rdma_map(struct rds_sock *rs, struct rds_get_mr_args *args, rdsdebug("RDS: get_mr key is %x\n", mr->r_key); if (mr_ret) { - refcount_inc(&mr->r_refcount); + kref_get(&mr->r_kref); *mr_ret = mr; } @@ -351,7 +353,7 @@ static int __rds_rdma_map(struct rds_sock *rs, struct rds_get_mr_args *args, out: kfree(pages); if (mr) - rds_mr_put(mr); + kref_put(&mr->r_kref, __rds_put_mr_final); return ret; } @@ -440,7 +442,7 @@ int rds_free_mr(struct rds_sock *rs, char __user *optval, int optlen) * someone else drops their ref. */ rds_destroy_mr(mr); - rds_mr_put(mr); + kref_put(&mr->r_kref, __rds_put_mr_final); return 0; } @@ -481,7 +483,7 @@ void rds_rdma_unuse(struct rds_sock *rs, u32 r_key, int force) * trigger an async flush. */ if (zot_me) { rds_destroy_mr(mr); - rds_mr_put(mr); + kref_put(&mr->r_kref, __rds_put_mr_final); } } @@ -490,7 +492,7 @@ void rds_rdma_free_op(struct rm_rdma_op *ro) unsigned int i; if (ro->op_odp_mr) { - rds_mr_put(ro->op_odp_mr); + kref_put(&ro->op_odp_mr->r_kref, __rds_put_mr_final); } else { for (i = 0; i < ro->op_nents; i++) { struct page *page = sg_page(&ro->op_sg[i]); @@ -730,7 +732,7 @@ int rds_cmsg_rdma_args(struct rds_sock *rs, struct rds_message *rm, goto out_pages; } RB_CLEAR_NODE(&local_odp_mr->r_rb_node); - refcount_set(&local_odp_mr->r_refcount, 1); + kref_init(&local_odp_mr->r_kref); local_odp_mr->r_trans = rs->rs_transport; local_odp_mr->r_sock = rs; local_odp_mr->r_trans_private = @@ -827,7 +829,7 @@ int rds_cmsg_rdma_dest(struct rds_sock *rs, struct rds_message *rm, if (!mr) err = -EINVAL; /* invalid r_key */ else - refcount_inc(&mr->r_refcount); + kref_get(&mr->r_kref); spin_unlock_irqrestore(&rs->rs_rdma_lock, flags); if (mr) { diff --git a/net/rds/rds.h b/net/rds/rds.h index e4a603523083..3cda01cfaa56 100644 --- a/net/rds/rds.h +++ b/net/rds/rds.h @@ -291,7 +291,7 @@ struct rds_incoming { struct rds_mr { struct rb_node r_rb_node; - refcount_t r_refcount; + struct kref r_kref; u32 r_key; /* A copy of the creation flags */ @@ -946,12 +946,7 @@ void rds_atomic_send_complete(struct rds_message *rm, int wc_status); int rds_cmsg_atomic(struct rds_sock *rs, struct rds_message *rm, struct cmsghdr *cmsg); -void __rds_put_mr_final(struct rds_mr *mr); -static inline void rds_mr_put(struct rds_mr *mr) -{ - if (refcount_dec_and_test(&mr->r_refcount)) - __rds_put_mr_final(mr); -} +void __rds_put_mr_final(struct kref *kref); static inline bool rds_destroy_pending(struct rds_connection *conn) { -- cgit From 2fabef4f65b46b261434a27ecdce291b63de8522 Mon Sep 17 00:00:00 2001 From: Ka-Cheong Poon Date: Wed, 8 Apr 2020 03:21:02 -0700 Subject: net/rds: Fix MR reference counting problem In rds_free_mr(), it calls rds_destroy_mr(mr) directly. But this defeats the purpose of reference counting and makes MR free handling impossible. It means that holding a reference does not guarantee that it is safe to access some fields. For example, In rds_cmsg_rdma_dest(), it increases the ref count, unlocks and then calls mr->r_trans->sync_mr(). But if rds_free_mr() (and rds_destroy_mr()) is called in between (there is no lock preventing this to happen), r_trans_private is set to NULL, causing a panic. Similar issue is in rds_rdma_unuse(). Reported-by: zerons Signed-off-by: Ka-Cheong Poon Acked-by: Santosh Shilimkar Signed-off-by: David S. Miller --- net/rds/rdma.c | 25 ++++++++++++------------- net/rds/rds.h | 8 -------- 2 files changed, 12 insertions(+), 21 deletions(-) (limited to 'net') diff --git a/net/rds/rdma.c b/net/rds/rdma.c index f828b66978e4..113e442101ce 100644 --- a/net/rds/rdma.c +++ b/net/rds/rdma.c @@ -101,9 +101,6 @@ static void rds_destroy_mr(struct rds_mr *mr) rdsdebug("RDS: destroy mr key is %x refcnt %u\n", mr->r_key, kref_read(&mr->r_kref)); - if (test_and_set_bit(RDS_MR_DEAD, &mr->r_state)) - return; - spin_lock_irqsave(&rs->rs_rdma_lock, flags); if (!RB_EMPTY_NODE(&mr->r_rb_node)) rb_erase(&mr->r_rb_node, &rs->rs_rdma_keys); @@ -142,7 +139,6 @@ void rds_rdma_drop_keys(struct rds_sock *rs) rb_erase(&mr->r_rb_node, &rs->rs_rdma_keys); RB_CLEAR_NODE(&mr->r_rb_node); spin_unlock_irqrestore(&rs->rs_rdma_lock, flags); - rds_destroy_mr(mr); kref_put(&mr->r_kref, __rds_put_mr_final); spin_lock_irqsave(&rs->rs_rdma_lock, flags); } @@ -436,12 +432,6 @@ int rds_free_mr(struct rds_sock *rs, char __user *optval, int optlen) if (!mr) return -EINVAL; - /* - * call rds_destroy_mr() ourselves so that we're sure it's done by the time - * we return. If we let rds_mr_put() do it it might not happen until - * someone else drops their ref. - */ - rds_destroy_mr(mr); kref_put(&mr->r_kref, __rds_put_mr_final); return 0; } @@ -466,6 +456,14 @@ void rds_rdma_unuse(struct rds_sock *rs, u32 r_key, int force) return; } + /* Get a reference so that the MR won't go away before calling + * sync_mr() below. + */ + kref_get(&mr->r_kref); + + /* If it is going to be freed, remove it from the tree now so + * that no other thread can find it and free it. + */ if (mr->r_use_once || force) { rb_erase(&mr->r_rb_node, &rs->rs_rdma_keys); RB_CLEAR_NODE(&mr->r_rb_node); @@ -479,12 +477,13 @@ void rds_rdma_unuse(struct rds_sock *rs, u32 r_key, int force) if (mr->r_trans->sync_mr) mr->r_trans->sync_mr(mr->r_trans_private, DMA_FROM_DEVICE); + /* Release the reference held above. */ + kref_put(&mr->r_kref, __rds_put_mr_final); + /* If the MR was marked as invalidate, this will * trigger an async flush. */ - if (zot_me) { - rds_destroy_mr(mr); + if (zot_me) kref_put(&mr->r_kref, __rds_put_mr_final); - } } void rds_rdma_free_op(struct rm_rdma_op *ro) diff --git a/net/rds/rds.h b/net/rds/rds.h index 3cda01cfaa56..8e18cd2aec51 100644 --- a/net/rds/rds.h +++ b/net/rds/rds.h @@ -299,19 +299,11 @@ struct rds_mr { unsigned int r_invalidate:1; unsigned int r_write:1; - /* This is for RDS_MR_DEAD. - * It would be nice & consistent to make this part of the above - * bit field here, but we need to use test_and_set_bit. - */ - unsigned long r_state; struct rds_sock *r_sock; /* back pointer to the socket that owns us */ struct rds_transport *r_trans; void *r_trans_private; }; -/* Flags for mr->r_state */ -#define RDS_MR_DEAD 0 - static inline rds_rdma_cookie_t rds_rdma_make_cookie(u32 r_key, u32 offset) { return r_key | (((u64) offset) << 32); -- cgit From 690cc86321eb9bcee371710252742fb16fe96824 Mon Sep 17 00:00:00 2001 From: Taras Chornyi Date: Thu, 9 Apr 2020 20:25:24 +0300 Subject: net: ipv4: devinet: Fix crash when add/del multicast IP with autojoin When CONFIG_IP_MULTICAST is not set and multicast ip is added to the device with autojoin flag or when multicast ip is deleted kernel will crash. steps to reproduce: ip addr add 224.0.0.0/32 dev eth0 ip addr del 224.0.0.0/32 dev eth0 or ip addr add 224.0.0.0/32 dev eth0 autojoin Unable to handle kernel NULL pointer dereference at virtual address 0000000000000088 pc : _raw_write_lock_irqsave+0x1e0/0x2ac lr : lock_sock_nested+0x1c/0x60 Call trace: _raw_write_lock_irqsave+0x1e0/0x2ac lock_sock_nested+0x1c/0x60 ip_mc_config.isra.28+0x50/0xe0 inet_rtm_deladdr+0x1a8/0x1f0 rtnetlink_rcv_msg+0x120/0x350 netlink_rcv_skb+0x58/0x120 rtnetlink_rcv+0x14/0x20 netlink_unicast+0x1b8/0x270 netlink_sendmsg+0x1a0/0x3b0 ____sys_sendmsg+0x248/0x290 ___sys_sendmsg+0x80/0xc0 __sys_sendmsg+0x68/0xc0 __arm64_sys_sendmsg+0x20/0x30 el0_svc_common.constprop.2+0x88/0x150 do_el0_svc+0x20/0x80 el0_sync_handler+0x118/0x190 el0_sync+0x140/0x180 Fixes: 93a714d6b53d ("multicast: Extend ip address command to enable multicast group join/leave on") Signed-off-by: Taras Chornyi Signed-off-by: Vadym Kochan Signed-off-by: David S. Miller --- net/ipv4/devinet.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) (limited to 'net') diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c index 30fa42f5997d..c0dd561aa190 100644 --- a/net/ipv4/devinet.c +++ b/net/ipv4/devinet.c @@ -614,12 +614,15 @@ struct in_ifaddr *inet_ifa_byprefix(struct in_device *in_dev, __be32 prefix, return NULL; } -static int ip_mc_config(struct sock *sk, bool join, const struct in_ifaddr *ifa) +static int ip_mc_autojoin_config(struct net *net, bool join, + const struct in_ifaddr *ifa) { +#if defined(CONFIG_IP_MULTICAST) struct ip_mreqn mreq = { .imr_multiaddr.s_addr = ifa->ifa_address, .imr_ifindex = ifa->ifa_dev->dev->ifindex, }; + struct sock *sk = net->ipv4.mc_autojoin_sk; int ret; ASSERT_RTNL(); @@ -632,6 +635,9 @@ static int ip_mc_config(struct sock *sk, bool join, const struct in_ifaddr *ifa) release_sock(sk); return ret; +#else + return -EOPNOTSUPP; +#endif } static int inet_rtm_deladdr(struct sk_buff *skb, struct nlmsghdr *nlh, @@ -675,7 +681,7 @@ static int inet_rtm_deladdr(struct sk_buff *skb, struct nlmsghdr *nlh, continue; if (ipv4_is_multicast(ifa->ifa_address)) - ip_mc_config(net->ipv4.mc_autojoin_sk, false, ifa); + ip_mc_autojoin_config(net, false, ifa); __inet_del_ifa(in_dev, ifap, 1, nlh, NETLINK_CB(skb).portid); return 0; } @@ -940,8 +946,7 @@ static int inet_rtm_newaddr(struct sk_buff *skb, struct nlmsghdr *nlh, */ set_ifa_lifetime(ifa, valid_lft, prefered_lft); if (ifa->ifa_flags & IFA_F_MCAUTOJOIN) { - int ret = ip_mc_config(net->ipv4.mc_autojoin_sk, - true, ifa); + int ret = ip_mc_autojoin_config(net, true, ifa); if (ret < 0) { inet_free_ifa(ifa); -- cgit From e154659ba39a1c2be576aaa0a5bda8088d707950 Mon Sep 17 00:00:00 2001 From: Florian Westphal Date: Sat, 11 Apr 2020 21:05:01 +0200 Subject: mptcp: fix double-unlock in mptcp_poll mptcp_connect/28740 is trying to release lock (sk_lock-AF_INET) at: [] mptcp_poll+0xb9/0x550 but there are no more locks to release! Call Trace: lock_release+0x50f/0x750 release_sock+0x171/0x1b0 mptcp_poll+0xb9/0x550 sock_poll+0x157/0x470 ? get_net_ns+0xb0/0xb0 do_sys_poll+0x63c/0xdd0 Problem is that __mptcp_tcp_fallback() releases the mptcp socket lock, but after recent change it doesn't do this in all of its return paths. To fix this, remove the unlock from __mptcp_tcp_fallback() and always do the unlock in the caller. Also add a small comment as to why we have this __mptcp_needs_tcp_fallback(). Fixes: 0b4f33def7bbde ("mptcp: fix tcp fallback crash") Reported-by: syzbot+e56606435b7bfeea8cf5@syzkaller.appspotmail.com Signed-off-by: Florian Westphal Signed-off-by: David S. Miller --- net/mptcp/protocol.c | 25 +++++++++++++------------ 1 file changed, 13 insertions(+), 12 deletions(-) (limited to 'net') diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 939a5045181a..9936e33ac351 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -97,12 +97,7 @@ static struct socket *__mptcp_tcp_fallback(struct mptcp_sock *msk) if (likely(!__mptcp_needs_tcp_fallback(msk))) return NULL; - if (msk->subflow) { - release_sock((struct sock *)msk); - return msk->subflow; - } - - return NULL; + return msk->subflow; } static bool __mptcp_can_create_subflow(const struct mptcp_sock *msk) @@ -734,9 +729,10 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) goto out; } +fallback: ssock = __mptcp_tcp_fallback(msk); if (unlikely(ssock)) { -fallback: + release_sock(sk); pr_debug("fallback passthrough"); ret = sock_sendmsg(ssock, msg); return ret >= 0 ? ret + copied : (copied ? copied : ret); @@ -769,8 +765,14 @@ fallback: if (ret < 0) break; if (ret == 0 && unlikely(__mptcp_needs_tcp_fallback(msk))) { + /* Can happen for passive sockets: + * 3WHS negotiated MPTCP, but first packet after is + * plain TCP (e.g. due to middlebox filtering unknown + * options). + * + * Fall back to TCP. + */ release_sock(ssk); - ssock = __mptcp_tcp_fallback(msk); goto fallback; } @@ -883,6 +885,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, ssock = __mptcp_tcp_fallback(msk); if (unlikely(ssock)) { fallback: + release_sock(sk); pr_debug("fallback-read subflow=%p", mptcp_subflow_ctx(ssock->sk)); copied = sock_recvmsg(ssock, msg, flags); @@ -1467,12 +1470,11 @@ static int mptcp_setsockopt(struct sock *sk, int level, int optname, */ lock_sock(sk); ssock = __mptcp_tcp_fallback(msk); + release_sock(sk); if (ssock) return tcp_setsockopt(ssock->sk, level, optname, optval, optlen); - release_sock(sk); - return -EOPNOTSUPP; } @@ -1492,12 +1494,11 @@ static int mptcp_getsockopt(struct sock *sk, int level, int optname, */ lock_sock(sk); ssock = __mptcp_tcp_fallback(msk); + release_sock(sk); if (ssock) return tcp_getsockopt(ssock->sk, level, optname, optval, option); - release_sock(sk); - return -EOPNOTSUPP; } -- cgit From 0e012b4e4b5ec8e064be3502382579dd0bb43269 Mon Sep 17 00:00:00 2001 From: Johannes Berg Date: Sun, 12 Apr 2020 00:40:30 +0200 Subject: nl80211: fix NL80211_ATTR_FTM_RESPONDER policy The nested policy here should be established using the NLA_POLICY_NESTED() macro so the length is properly filled in. Cc: stable@vger.kernel.org Fixes: 81e54d08d9d8 ("cfg80211: support FTM responder configuration/statistics") Link: https://lore.kernel.org/r/20200412004029.9d0722bb56c8.Ie690bfcc4a1a61ff8d8ca7e475d59fcaa52fb2da@changeid Signed-off-by: Johannes Berg --- net/wireless/nl80211.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) (limited to 'net') diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c index 5fa402144cda..692bcd35f809 100644 --- a/net/wireless/nl80211.c +++ b/net/wireless/nl80211.c @@ -644,10 +644,8 @@ const struct nla_policy nl80211_policy[NUM_NL80211_ATTR] = { [NL80211_ATTR_HE_CAPABILITY] = { .type = NLA_BINARY, .len = NL80211_HE_MAX_CAPABILITY_LEN }, - [NL80211_ATTR_FTM_RESPONDER] = { - .type = NLA_NESTED, - .validation_data = nl80211_ftm_responder_policy, - }, + [NL80211_ATTR_FTM_RESPONDER] = + NLA_POLICY_NESTED(nl80211_ftm_responder_policy), [NL80211_ATTR_TIMEOUT] = NLA_POLICY_MIN(NLA_U32, 1), [NL80211_ATTR_PEER_MEASUREMENTS] = NLA_POLICY_NESTED(nl80211_pmsr_attr_policy), -- cgit From dfa74909cb6b846cbdabfc2c3c7de1d507fca075 Mon Sep 17 00:00:00 2001 From: David Ahern Date: Sun, 12 Apr 2020 07:32:04 -0600 Subject: xdp: Reset prog in dev_change_xdp_fd when fd is negative MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The commit mentioned in the Fixes tag reuses the local prog variable when looking up an expected_fd. The variable is not reset when fd < 0 causing a detach with the expected_fd set to actually call dev_xdp_install for the existing program. The end result is that the detach does not happen. Fixes: 92234c8f15c8 ("xdp: Support specifying expected existing program when attaching XDP") Signed-off-by: David Ahern Signed-off-by: Daniel Borkmann Reviewed-by: Jakub Kicinski Reviewed-by: Toke Høiland-Jørgensen Link: https://lore.kernel.org/bpf/20200412133204.43847-1-dsahern@kernel.org --- net/core/dev.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'net') diff --git a/net/core/dev.c b/net/core/dev.c index df8097b8e286..522288177bbd 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -8667,8 +8667,8 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack, const struct net_device_ops *ops = dev->netdev_ops; enum bpf_netdev_command query; u32 prog_id, expected_id = 0; - struct bpf_prog *prog = NULL; bpf_op_t bpf_op, bpf_chk; + struct bpf_prog *prog; bool offload; int err; @@ -8734,6 +8734,7 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack, } else { if (!prog_id) return 0; + prog = NULL; } err = dev_xdp_install(dev, bpf_op, extack, flags, prog); -- cgit From 0e631eee17dcea576ab922fa70e4fdbd596ee452 Mon Sep 17 00:00:00 2001 From: David Howells Date: Mon, 13 Apr 2020 13:57:14 +0100 Subject: rxrpc: Fix DATA Tx to disable nofrag for UDP on AF_INET6 socket Fix the DATA packet transmission to disable nofrag for UDPv4 on an AF_INET6 socket as well as UDPv6 when trying to transmit fragmentably. Without this, packets filled to the normal size used by the kernel AFS client of 1412 bytes be rejected by udp_sendmsg() with EMSGSIZE immediately. The ->sk_error_report() notification hook is called, but rxrpc doesn't generate a trace for it. This is a temporary fix; a more permanent solution needs to involve changing the size of the packets being filled in accordance with the MTU, which isn't currently done in AF_RXRPC. The reason for not doing so was that, barring the last packet in an rx jumbo packet, jumbos can only be assembled out of 1412-byte packets - and the plan was to construct jumbos on the fly at transmission time. Also, there's no point turning on IPV6_MTU_DISCOVER, since IPv6 has to engage in this anyway since fragmentation is only done by the sender. We can then condense the switch-statement in rxrpc_send_data_packet(). Fixes: 75b54cb57ca3 ("rxrpc: Add IPv6 support") Signed-off-by: David Howells Signed-off-by: David S. Miller --- net/rxrpc/local_object.c | 9 --------- net/rxrpc/output.c | 44 ++++++++++++-------------------------------- 2 files changed, 12 insertions(+), 41 deletions(-) (limited to 'net') diff --git a/net/rxrpc/local_object.c b/net/rxrpc/local_object.c index a6c1349e965d..01135e54d95d 100644 --- a/net/rxrpc/local_object.c +++ b/net/rxrpc/local_object.c @@ -165,15 +165,6 @@ static int rxrpc_open_socket(struct rxrpc_local *local, struct net *net) goto error; } - /* we want to set the don't fragment bit */ - opt = IPV6_PMTUDISC_DO; - ret = kernel_setsockopt(local->socket, SOL_IPV6, IPV6_MTU_DISCOVER, - (char *) &opt, sizeof(opt)); - if (ret < 0) { - _debug("setsockopt failed"); - goto error; - } - /* Fall through and set IPv4 options too otherwise we don't get * errors from IPv4 packets sent through the IPv6 socket. */ diff --git a/net/rxrpc/output.c b/net/rxrpc/output.c index bad3d2420344..90e263c6aa69 100644 --- a/net/rxrpc/output.c +++ b/net/rxrpc/output.c @@ -474,41 +474,21 @@ send_fragmentable: skb->tstamp = ktime_get_real(); switch (conn->params.local->srx.transport.family) { + case AF_INET6: case AF_INET: opt = IP_PMTUDISC_DONT; - ret = kernel_setsockopt(conn->params.local->socket, - SOL_IP, IP_MTU_DISCOVER, - (char *)&opt, sizeof(opt)); - if (ret == 0) { - ret = kernel_sendmsg(conn->params.local->socket, &msg, - iov, 2, len); - conn->params.peer->last_tx_at = ktime_get_seconds(); - - opt = IP_PMTUDISC_DO; - kernel_setsockopt(conn->params.local->socket, SOL_IP, - IP_MTU_DISCOVER, - (char *)&opt, sizeof(opt)); - } - break; - -#ifdef CONFIG_AF_RXRPC_IPV6 - case AF_INET6: - opt = IPV6_PMTUDISC_DONT; - ret = kernel_setsockopt(conn->params.local->socket, - SOL_IPV6, IPV6_MTU_DISCOVER, - (char *)&opt, sizeof(opt)); - if (ret == 0) { - ret = kernel_sendmsg(conn->params.local->socket, &msg, - iov, 2, len); - conn->params.peer->last_tx_at = ktime_get_seconds(); - - opt = IPV6_PMTUDISC_DO; - kernel_setsockopt(conn->params.local->socket, - SOL_IPV6, IPV6_MTU_DISCOVER, - (char *)&opt, sizeof(opt)); - } + kernel_setsockopt(conn->params.local->socket, + SOL_IP, IP_MTU_DISCOVER, + (char *)&opt, sizeof(opt)); + ret = kernel_sendmsg(conn->params.local->socket, &msg, + iov, 2, len); + conn->params.peer->last_tx_at = ktime_get_seconds(); + + opt = IP_PMTUDISC_DO; + kernel_setsockopt(conn->params.local->socket, + SOL_IP, IP_MTU_DISCOVER, + (char *)&opt, sizeof(opt)); break; -#endif default: BUG(); -- cgit From 3be98b2d5fbca3da7c4df0477eed95bfb5b83d64 Mon Sep 17 00:00:00 2001 From: Andrew Lunn Date: Tue, 14 Apr 2020 02:34:39 +0200 Subject: net: dsa: Down cpu/dsa ports phylink will control DSA and CPU ports can be configured in two ways. By default, the driver should configure such ports to there maximum bandwidth. For most use cases, this is sufficient. When this default is insufficient, a phylink instance can be bound to such ports, and phylink will configure the port, e.g. based on fixed-link properties. phylink assumes the port is initially down. Given that the driver should have already configured it to its maximum speed, ask the driver to down the port before instantiating the phylink instance. Fixes: 30c4a5b0aad8 ("net: mv88e6xxx: use resolved link config in mac_link_up()") Signed-off-by: Andrew Lunn Signed-off-by: David S. Miller --- net/dsa/port.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) (limited to 'net') diff --git a/net/dsa/port.c b/net/dsa/port.c index 231b2d494f1c..a58fdd362574 100644 --- a/net/dsa/port.c +++ b/net/dsa/port.c @@ -670,11 +670,16 @@ int dsa_port_link_register_of(struct dsa_port *dp) { struct dsa_switch *ds = dp->ds; struct device_node *phy_np; + int port = dp->index; if (!ds->ops->adjust_link) { phy_np = of_parse_phandle(dp->dn, "phy-handle", 0); - if (of_phy_is_fixed_link(dp->dn) || phy_np) + if (of_phy_is_fixed_link(dp->dn) || phy_np) { + if (ds->ops->phylink_mac_link_down) + ds->ops->phylink_mac_link_down(ds, port, + MLO_AN_FIXED, PHY_INTERFACE_MODE_NA); return dsa_port_phylink_register(dp); + } return 0; } -- cgit From 52e04b4ce5d03775b6a78f3ed1097480faacc9fd Mon Sep 17 00:00:00 2001 From: Sumit Garg Date: Tue, 7 Apr 2020 15:40:55 +0530 Subject: mac80211: fix race in ieee80211_register_hw() A race condition leading to a kernel crash is observed during invocation of ieee80211_register_hw() on a dragonboard410c device having wcn36xx driver built as a loadable module along with a wifi manager in user-space waiting for a wifi device (wlanX) to be active. Sequence diagram for a particular kernel crash scenario: user-space ieee80211_register_hw() ieee80211_tasklet_handler() ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | | | |<---phy0----wiphy_register() | |-----iwd if_add---->| | | |<---IRQ----(RX packet) | Kernel crash | | due to unallocated | | workqueue. | | | | | alloc_ordered_workqueue() | | | | | Misc wiphy init. | | | | | ieee80211_if_add() | | | | As evident from above sequence diagram, this race condition isn't specific to a particular wifi driver but rather the initialization sequence in ieee80211_register_hw() needs to be fixed. So re-order the initialization sequence and the updated sequence diagram would look like: user-space ieee80211_register_hw() ieee80211_tasklet_handler() ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | | | | alloc_ordered_workqueue() | | | | | Misc wiphy init. | | | | |<---phy0----wiphy_register() | |-----iwd if_add---->| | | |<---IRQ----(RX packet) | | | | ieee80211_if_add() | | | | Cc: stable@vger.kernel.org Signed-off-by: Sumit Garg Link: https://lore.kernel.org/r/1586254255-28713-1-git-send-email-sumit.garg@linaro.org [Johannes: fix rtnl imbalances] Signed-off-by: Johannes Berg --- net/mac80211/main.c | 24 +++++++++++++----------- 1 file changed, 13 insertions(+), 11 deletions(-) (limited to 'net') diff --git a/net/mac80211/main.c b/net/mac80211/main.c index 8345926193de..0e9ad60fb2b3 100644 --- a/net/mac80211/main.c +++ b/net/mac80211/main.c @@ -1069,7 +1069,7 @@ int ieee80211_register_hw(struct ieee80211_hw *hw) local->hw.wiphy->signal_type = CFG80211_SIGNAL_TYPE_UNSPEC; if (hw->max_signal <= 0) { result = -EINVAL; - goto fail_wiphy_register; + goto fail_workqueue; } } @@ -1135,7 +1135,7 @@ int ieee80211_register_hw(struct ieee80211_hw *hw) result = ieee80211_init_cipher_suites(local); if (result < 0) - goto fail_wiphy_register; + goto fail_workqueue; if (!local->ops->remain_on_channel) local->hw.wiphy->max_remain_on_channel_duration = 5000; @@ -1161,10 +1161,6 @@ int ieee80211_register_hw(struct ieee80211_hw *hw) local->hw.wiphy->max_num_csa_counters = IEEE80211_MAX_CSA_COUNTERS_NUM; - result = wiphy_register(local->hw.wiphy); - if (result < 0) - goto fail_wiphy_register; - /* * We use the number of queues for feature tests (QoS, HT) internally * so restrict them appropriately. @@ -1217,9 +1213,9 @@ int ieee80211_register_hw(struct ieee80211_hw *hw) goto fail_flows; rtnl_lock(); - result = ieee80211_init_rate_ctrl_alg(local, hw->rate_control_algorithm); + rtnl_unlock(); if (result < 0) { wiphy_debug(local->hw.wiphy, "Failed to initialize rate control algorithm\n"); @@ -1273,6 +1269,12 @@ int ieee80211_register_hw(struct ieee80211_hw *hw) local->sband_allocated |= BIT(band); } + result = wiphy_register(local->hw.wiphy); + if (result < 0) + goto fail_wiphy_register; + + rtnl_lock(); + /* add one default STA interface if supported */ if (local->hw.wiphy->interface_modes & BIT(NL80211_IFTYPE_STATION) && !ieee80211_hw_check(hw, NO_AUTO_VIF)) { @@ -1312,17 +1314,17 @@ int ieee80211_register_hw(struct ieee80211_hw *hw) #if defined(CONFIG_INET) || defined(CONFIG_IPV6) fail_ifa: #endif + wiphy_unregister(local->hw.wiphy); + fail_wiphy_register: rtnl_lock(); rate_control_deinitialize(local); ieee80211_remove_interfaces(local); - fail_rate: rtnl_unlock(); + fail_rate: fail_flows: ieee80211_led_exit(local); destroy_workqueue(local->workqueue); fail_workqueue: - wiphy_unregister(local->hw.wiphy); - fail_wiphy_register: if (local->wiphy_ciphers_allocated) kfree(local->hw.wiphy->cipher_suites); kfree(local->int_scan_req); @@ -1372,8 +1374,8 @@ void ieee80211_unregister_hw(struct ieee80211_hw *hw) skb_queue_purge(&local->skb_queue_unreliable); skb_queue_purge(&local->skb_queue_tdls_chsw); - destroy_workqueue(local->workqueue); wiphy_unregister(local->hw.wiphy); + destroy_workqueue(local->workqueue); ieee80211_led_exit(local); kfree(local->int_scan_req); } -- cgit From 93e2d04a1888668183f3fb48666e90b9b31d29e6 Mon Sep 17 00:00:00 2001 From: Tamizh chelvam Date: Sat, 28 Mar 2020 19:23:24 +0530 Subject: mac80211: fix channel switch trigger from unknown mesh peer Previously mesh channel switch happens if beacon contains CSA IE without checking the mesh peer info. Due to that channel switch happens even if the beacon is not from its own mesh peer. Fixing that by checking if the CSA originated from the same mesh network before proceeding for channel switch. Signed-off-by: Tamizh chelvam Link: https://lore.kernel.org/r/1585403604-29274-1-git-send-email-tamizhr@codeaurora.org Signed-off-by: Johannes Berg --- net/mac80211/mesh.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) (limited to 'net') diff --git a/net/mac80211/mesh.c b/net/mac80211/mesh.c index d09b3c789314..36978a0e5000 100644 --- a/net/mac80211/mesh.c +++ b/net/mac80211/mesh.c @@ -1257,15 +1257,15 @@ static void ieee80211_mesh_rx_bcn_presp(struct ieee80211_sub_if_data *sdata, sdata->u.mesh.mshcfg.rssi_threshold < rx_status->signal) mesh_neighbour_update(sdata, mgmt->sa, &elems, rx_status); + + if (ifmsh->csa_role != IEEE80211_MESH_CSA_ROLE_INIT && + !sdata->vif.csa_active) + ieee80211_mesh_process_chnswitch(sdata, &elems, true); } if (ifmsh->sync_ops) ifmsh->sync_ops->rx_bcn_presp(sdata, stype, mgmt, &elems, rx_status); - - if (ifmsh->csa_role != IEEE80211_MESH_CSA_ROLE_INIT && - !sdata->vif.csa_active) - ieee80211_mesh_process_chnswitch(sdata, &elems, true); } int ieee80211_mesh_finish_csa(struct ieee80211_sub_if_data *sdata) @@ -1373,6 +1373,9 @@ static void mesh_rx_csa_frame(struct ieee80211_sub_if_data *sdata, ieee802_11_parse_elems(pos, len - baselen, true, &elems, mgmt->bssid, NULL); + if (!mesh_matches_local(sdata, &elems)) + return; + ifmsh->chsw_ttl = elems.mesh_chansw_params_ie->mesh_ttl; if (!--ifmsh->chsw_ttl) fwd_csa = false; -- cgit From 99e3a236dd43d06c65af0a2ef9cb44306aef6e02 Mon Sep 17 00:00:00 2001 From: Magnus Karlsson Date: Tue, 14 Apr 2020 09:35:15 +0200 Subject: xsk: Add missing check on user supplied headroom size Add a check that the headroom cannot be larger than the available space in the chunk. In the current code, a malicious user can set the headroom to a value larger than the chunk size minus the fixed XDP headroom. That way packets with a length larger than the supported size in the umem could get accepted and result in an out-of-bounds write. Fixes: c0c77d8fb787 ("xsk: add user memory registration support sockopt") Reported-by: Bui Quang Minh Signed-off-by: Magnus Karlsson Signed-off-by: Daniel Borkmann Link: https://bugzilla.kernel.org/show_bug.cgi?id=207225 Link: https://lore.kernel.org/bpf/1586849715-23490-1-git-send-email-magnus.karlsson@intel.com --- net/xdp/xdp_umem.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) (limited to 'net') diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c index fa7bb5e060d0..ed7a6060f73c 100644 --- a/net/xdp/xdp_umem.c +++ b/net/xdp/xdp_umem.c @@ -343,7 +343,7 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr) u32 chunk_size = mr->chunk_size, headroom = mr->headroom; unsigned int chunks, chunks_per_page; u64 addr = mr->addr, size = mr->len; - int size_chk, err; + int err; if (chunk_size < XDP_UMEM_MIN_CHUNK_SIZE || chunk_size > PAGE_SIZE) { /* Strictly speaking we could support this, if: @@ -382,8 +382,7 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr) return -EINVAL; } - size_chk = chunk_size - headroom - XDP_PACKET_HEADROOM; - if (size_chk < 0) + if (headroom >= chunk_size - XDP_PACKET_HEADROOM) return -EINVAL; umem->address = (unsigned long)addr; -- cgit From 7dba92037baf3fa00b4880a31fd532542264994c Mon Sep 17 00:00:00 2001 From: Jason Gunthorpe Date: Tue, 14 Apr 2020 20:02:07 -0300 Subject: net/rds: Use ERR_PTR for rds_message_alloc_sgs() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Returning the error code via a 'int *ret' when the function returns a pointer is very un-kernely and causes gcc 10's static analysis to choke: net/rds/message.c: In function ‘rds_message_map_pages’: net/rds/message.c:358:10: warning: ‘ret’ may be used uninitialized in this function [-Wmaybe-uninitialized] 358 | return ERR_PTR(ret); Use a typical ERR_PTR return instead. Signed-off-by: Jason Gunthorpe Acked-by: Santosh Shilimkar Signed-off-by: David S. Miller --- net/rds/message.c | 19 ++++++------------- net/rds/rdma.c | 12 ++++++++---- net/rds/rds.h | 3 +-- net/rds/send.c | 6 ++++-- 4 files changed, 19 insertions(+), 21 deletions(-) (limited to 'net') diff --git a/net/rds/message.c b/net/rds/message.c index bbecb8cb873e..071a261fdaab 100644 --- a/net/rds/message.c +++ b/net/rds/message.c @@ -308,26 +308,20 @@ out: /* * RDS ops use this to grab SG entries from the rm's sg pool. */ -struct scatterlist *rds_message_alloc_sgs(struct rds_message *rm, int nents, - int *ret) +struct scatterlist *rds_message_alloc_sgs(struct rds_message *rm, int nents) { struct scatterlist *sg_first = (struct scatterlist *) &rm[1]; struct scatterlist *sg_ret; - if (WARN_ON(!ret)) - return NULL; - if (nents <= 0) { pr_warn("rds: alloc sgs failed! nents <= 0\n"); - *ret = -EINVAL; - return NULL; + return ERR_PTR(-EINVAL); } if (rm->m_used_sgs + nents > rm->m_total_sgs) { pr_warn("rds: alloc sgs failed! total %d used %d nents %d\n", rm->m_total_sgs, rm->m_used_sgs, nents); - *ret = -ENOMEM; - return NULL; + return ERR_PTR(-ENOMEM); } sg_ret = &sg_first[rm->m_used_sgs]; @@ -343,7 +337,6 @@ struct rds_message *rds_message_map_pages(unsigned long *page_addrs, unsigned in unsigned int i; int num_sgs = DIV_ROUND_UP(total_len, PAGE_SIZE); int extra_bytes = num_sgs * sizeof(struct scatterlist); - int ret; rm = rds_message_alloc(extra_bytes, GFP_NOWAIT); if (!rm) @@ -352,10 +345,10 @@ struct rds_message *rds_message_map_pages(unsigned long *page_addrs, unsigned in set_bit(RDS_MSG_PAGEVEC, &rm->m_flags); rm->m_inc.i_hdr.h_len = cpu_to_be32(total_len); rm->data.op_nents = DIV_ROUND_UP(total_len, PAGE_SIZE); - rm->data.op_sg = rds_message_alloc_sgs(rm, num_sgs, &ret); - if (!rm->data.op_sg) { + rm->data.op_sg = rds_message_alloc_sgs(rm, num_sgs); + if (IS_ERR(rm->data.op_sg)) { rds_message_put(rm); - return ERR_PTR(ret); + return ERR_CAST(rm->data.op_sg); } for (i = 0; i < rm->data.op_nents; ++i) { diff --git a/net/rds/rdma.c b/net/rds/rdma.c index 113e442101ce..a7ae11846cd7 100644 --- a/net/rds/rdma.c +++ b/net/rds/rdma.c @@ -665,9 +665,11 @@ int rds_cmsg_rdma_args(struct rds_sock *rs, struct rds_message *rm, op->op_odp_mr = NULL; WARN_ON(!nr_pages); - op->op_sg = rds_message_alloc_sgs(rm, nr_pages, &ret); - if (!op->op_sg) + op->op_sg = rds_message_alloc_sgs(rm, nr_pages); + if (IS_ERR(op->op_sg)) { + ret = PTR_ERR(op->op_sg); goto out_pages; + } if (op->op_notify || op->op_recverr) { /* We allocate an uninitialized notifier here, because @@ -906,9 +908,11 @@ int rds_cmsg_atomic(struct rds_sock *rs, struct rds_message *rm, rm->atomic.op_silent = !!(args->flags & RDS_RDMA_SILENT); rm->atomic.op_active = 1; rm->atomic.op_recverr = rs->rs_recverr; - rm->atomic.op_sg = rds_message_alloc_sgs(rm, 1, &ret); - if (!rm->atomic.op_sg) + rm->atomic.op_sg = rds_message_alloc_sgs(rm, 1); + if (IS_ERR(rm->atomic.op_sg)) { + ret = PTR_ERR(rm->atomic.op_sg); goto err; + } /* verify 8 byte-aligned */ if (args->local_addr & 0x7) { diff --git a/net/rds/rds.h b/net/rds/rds.h index 8e18cd2aec51..6019b0c004a9 100644 --- a/net/rds/rds.h +++ b/net/rds/rds.h @@ -844,8 +844,7 @@ rds_conn_connecting(struct rds_connection *conn) /* message.c */ struct rds_message *rds_message_alloc(unsigned int nents, gfp_t gfp); -struct scatterlist *rds_message_alloc_sgs(struct rds_message *rm, int nents, - int *ret); +struct scatterlist *rds_message_alloc_sgs(struct rds_message *rm, int nents); int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from, bool zcopy); struct rds_message *rds_message_map_pages(unsigned long *page_addrs, unsigned int total_len); diff --git a/net/rds/send.c b/net/rds/send.c index 82dcd8b84fe7..68e2bdb08fd0 100644 --- a/net/rds/send.c +++ b/net/rds/send.c @@ -1274,9 +1274,11 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len) /* Attach data to the rm */ if (payload_len) { - rm->data.op_sg = rds_message_alloc_sgs(rm, num_sgs, &ret); - if (!rm->data.op_sg) + rm->data.op_sg = rds_message_alloc_sgs(rm, num_sgs); + if (IS_ERR(rm->data.op_sg)) { + ret = PTR_ERR(rm->data.op_sg); goto out; + } ret = rds_message_copy_from_user(rm, &msg->msg_iter, zcopy); if (ret) goto out; -- cgit From 672e24772aeb45293c86f6176520d98b19cd48a1 Mon Sep 17 00:00:00 2001 From: Colin Ian King Date: Thu, 16 Apr 2020 00:16:30 +0100 Subject: ipv6: remove redundant assignment to variable err The variable err is being initialized with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King Signed-off-by: David S. Miller --- net/ipv6/seg6.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'net') diff --git a/net/ipv6/seg6.c b/net/ipv6/seg6.c index 75421a472d25..4c7e0a27fa9c 100644 --- a/net/ipv6/seg6.c +++ b/net/ipv6/seg6.c @@ -434,7 +434,7 @@ static struct genl_family seg6_genl_family __ro_after_init = { int __init seg6_init(void) { - int err = -ENOMEM; + int err; err = genl_register_family(&seg6_genl_family); if (err) -- cgit From edadedf1c5b4e4404192a0a4c3c0c05e3b7672ab Mon Sep 17 00:00:00 2001 From: Tuong Lien Date: Wed, 15 Apr 2020 18:34:49 +0700 Subject: tipc: fix incorrect increasing of link window In commit 16ad3f4022bb ("tipc: introduce variable window congestion control"), we allow link window to change with the congestion avoidance algorithm. However, there is a bug that during the slow-start if packet retransmission occurs, the link will enter the fast-recovery phase, set its window to the 'ssthresh' which is never less than 300, so the link window suddenly increases to that limit instead of decreasing. Consequently, two issues have been observed: - For broadcast-link: it can leave a gap between the link queues that a new packet will be inserted and sent before the previous ones, i.e. not in-order. - For unicast: the algorithm does not work as expected, the link window jumps to the slow-start threshold whereas packet retransmission occurs. This commit fixes the issues by avoiding such the link window increase, but still decreasing if the 'ssthresh' is lowered. Fixes: 16ad3f4022bb ("tipc: introduce variable window congestion control") Acked-by: Jon Maloy Signed-off-by: Tuong Lien Signed-off-by: David S. Miller --- net/tipc/link.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'net') diff --git a/net/tipc/link.c b/net/tipc/link.c index 467c53a1fb5c..d4675e922a8f 100644 --- a/net/tipc/link.c +++ b/net/tipc/link.c @@ -1065,7 +1065,7 @@ static void tipc_link_update_cwin(struct tipc_link *l, int released, /* Enter fast recovery */ if (unlikely(retransmitted)) { l->ssthresh = max_t(u16, l->window / 2, 300); - l->window = l->ssthresh; + l->window = min_t(u16, l->ssthresh, l->window); return; } /* Enter slow start */ -- cgit