summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2020-05-25tcp: allow traceroute -Mtcp for unpriv usersEric Dumazet
Unpriv users can use traceroute over plain UDP sockets, but not TCP ones. $ traceroute -Mtcp 8.8.8.8 You do not have enough privileges to use this traceroute method. $ traceroute -n -Mudp 8.8.8.8 traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets 1 192.168.86.1 3.631 ms 3.512 ms 3.405 ms 2 10.1.10.1 4.183 ms 4.125 ms 4.072 ms 3 96.120.88.125 20.621 ms 19.462 ms 20.553 ms 4 96.110.177.65 24.271 ms 25.351 ms 25.250 ms 5 69.139.199.197 44.492 ms 43.075 ms 44.346 ms 6 68.86.143.93 27.969 ms 25.184 ms 25.092 ms 7 96.112.146.18 25.323 ms 96.112.146.22 25.583 ms 96.112.146.26 24.502 ms 8 72.14.239.204 24.405 ms 74.125.37.224 16.326 ms 17.194 ms 9 209.85.251.9 18.154 ms 209.85.247.55 14.449 ms 209.85.251.9 26.296 ms^C We can easily support traceroute over TCP, by queueing an error message into socket error queue. Note that applications need to set IP_RECVERR/IPV6_RECVERR option to enable this feature, and that the error message is only queued while in SYN_SNT state. socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3 setsockopt(3, SOL_IPV6, IPV6_RECVERR, [1], 4) = 0 setsockopt(3, SOL_SOCKET, SO_TIMESTAMP_OLD, [1], 4) = 0 setsockopt(3, SOL_IPV6, IPV6_UNICAST_HOPS, [5], 4) = 0 connect(3, {sa_family=AF_INET6, sin6_port=htons(8787), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "2002:a05:6608:297::", &sin6_addr), sin6_scope_id=0}, 28) = -1 EHOSTUNREACH (No route to host) recvmsg(3, {msg_name={sa_family=AF_INET6, sin6_port=htons(8787), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "2002:a05:6608:297::", &sin6_addr), sin6_scope_id=0}, msg_namelen=1024->28, msg_iov=[{iov_base="`\r\337\320\0004\6\1&\7\370\260\200\231\16\27\0\0\0\0\0\0\0\0 \2\n\5f\10\2\227"..., iov_len=1024}], msg_iovlen=1, msg_control=[{cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=SO_TIMESTAMP_OLD, cmsg_data={tv_sec=1590340680, tv_usec=272424}}, {cmsg_len=60, cmsg_level=SOL_IPV6, cmsg_type=IPV6_RECVERR}], msg_controllen=96, msg_flags=MSG_ERRQUEUE}, MSG_ERRQUEUE) = 144 Suggested-by: Maciej Żenczykowski <maze@google.com Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Reviewed-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-25ipv4: potential underflow in compat_ip_setsockopt()Dan Carpenter
The value of "n" is capped at 0x1ffffff but it checked for negative values. I don't think this causes a problem but I'm not certain and it's harmless to prevent it. Fixes: 2e04172875c9 ("ipv4: do compat setsockopt for MCAST_MSFILTER directly") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-25net/tls: fix race condition causing kernel panicVinay Kumar Yadav
tls_sw_recvmsg() and tls_decrypt_done() can be run concurrently. // tls_sw_recvmsg() if (atomic_read(&ctx->decrypt_pending)) crypto_wait_req(-EINPROGRESS, &ctx->async_wait); else reinit_completion(&ctx->async_wait.completion); //tls_decrypt_done() pending = atomic_dec_return(&ctx->decrypt_pending); if (!pending && READ_ONCE(ctx->async_notify)) complete(&ctx->async_wait.completion); Consider the scenario tls_decrypt_done() is about to run complete() if (!pending && READ_ONCE(ctx->async_notify)) and tls_sw_recvmsg() reads decrypt_pending == 0, does reinit_completion(), then tls_decrypt_done() runs complete(). This sequence of execution results in wrong completion. Consequently, for next decrypt request, it will not wait for completion, eventually on connection close, crypto resources freed, there is no way to handle pending decrypt response. This race condition can be avoided by having atomic_read() mutually exclusive with atomic_dec_return(),complete().Intoduced spin lock to ensure the mutual exclution. Addressed similar problem in tx direction. v1->v2: - More readable commit message. - Corrected the lock to fix new race scenario. - Removed barrier which is not needed now. Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance") Signed-off-by: Vinay Kumar Yadav <vinay.yadav@chelsio.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26xsk: Add overflow check for u64 division, stored into u32Björn Töpel
The npgs member of struct xdp_umem is an u32 entity, and stores the number of pages the UMEM consumes. The calculation of npgs npgs = size / PAGE_SIZE can overflow. To avoid overflow scenarios, the division is now first stored in a u64, and the result is verified to fit into 32b. An alternative would be storing the npgs as a u64, however, this wastes memory and is an unrealisticly large packet area. Fixes: c0c77d8fb787 ("xsk: add user memory registration support sockopt") Reported-by: "Minh Bùi Quang" <minhquangbui99@gmail.com> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Link: https://lore.kernel.org/bpf/CACtPs=GGvV-_Yj6rbpzTVnopgi5nhMoCcTkSkYrJHGQHJWFZMQ@mail.gmail.com/ Link: https://lore.kernel.org/bpf/20200525080400.13195-1-bjorn.topel@gmail.com
2020-05-25netfilter: nfnetlink_cthelper: unbreak userspace helper supportPablo Neira Ayuso
Restore helper data size initialization and fix memcopy of the helper data size. Fixes: 157ffffeb5dc ("netfilter: nfnetlink_cthelper: reject too large userspace allocation requests") Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-25netfilter: conntrack: make conntrack userspace helpers work againPablo Neira Ayuso
Florian Westphal says: "Problem is that after the helper hook was merged back into the confirm one, the queueing itself occurs from the confirm hook, i.e. we queue from the last netfilter callback in the hook-list. Therefore, on return, the packet bypasses the confirm action and the connection is never committed to the main conntrack table. To fix this there are several ways: 1. revert the 'Fixes' commit and have a extra helper hook again. Works, but has the drawback of adding another indirect call for everyone. 2. Special case this: split the hooks only when userspace helper gets added, so queueing occurs at a lower priority again, and normal enqueue reinject would eventually call the last hook. 3. Extend the existing nf_queue ct update hook to allow a forced confirmation (plus run the seqadj code). This goes for 3)." Fixes: 827318feb69cb ("netfilter: conntrack: remove helper hook again") Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-25netfilter: nf_conntrack_pptp: prevent buffer overflows in debug codePablo Neira Ayuso
Dan Carpenter says: "Smatch complains that the value for "cmd" comes from the network and can't be trusted." Add pptp_msg_name() helper function that checks for the array boundary. Fixes: f09943fefe6b ("[NETFILTER]: nf_conntrack/nf_nat: add PPTP helper port") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-25netfilter: ipset: Fix subcounter update skipPhil Sutter
If IPSET_FLAG_SKIP_SUBCOUNTER_UPDATE is set, user requested to not update counters in sub sets. Therefore IPSET_FLAG_SKIP_COUNTER_UPDATE must be set, not unset. Fixes: 6e01781d1c80e ("netfilter: ipset: set match: add support to match the counters") Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-25netfilter: nft_reject_bridge: enable reject with bridge vlanMichael Braun
Currently, using the bridge reject target with tagged packets results in untagged packets being sent back. Fix this by mirroring the vlan id as well. Fixes: 85f5b3086a04 ("netfilter: bridge: add reject support") Signed-off-by: Michael Braun <michael-dev@fami-braun.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-26bpfilter: document build requirements for bpfilter_umhValdis Kl ē tnieks
It's not intuitively obvious that bpfilter_umh is a statically linked binary. Mention the toolchain requirement in the Kconfig help, so people have an easier time figuring out what's needed. Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-05-25xfrm: fix a warning in xfrm_policy_insert_listXin Long
This waring can be triggered simply by: # ip xfrm policy update src 192.168.1.1/24 dst 192.168.1.2/24 dir in \ priority 1 mark 0 mask 0x10 #[1] # ip xfrm policy update src 192.168.1.1/24 dst 192.168.1.2/24 dir in \ priority 2 mark 0 mask 0x1 #[2] # ip xfrm policy update src 192.168.1.1/24 dst 192.168.1.2/24 dir in \ priority 2 mark 0 mask 0x10 #[3] Then dmesg shows: [ ] WARNING: CPU: 1 PID: 7265 at net/xfrm/xfrm_policy.c:1548 [ ] RIP: 0010:xfrm_policy_insert_list+0x2f2/0x1030 [ ] Call Trace: [ ] xfrm_policy_inexact_insert+0x85/0xe50 [ ] xfrm_policy_insert+0x4ba/0x680 [ ] xfrm_add_policy+0x246/0x4d0 [ ] xfrm_user_rcv_msg+0x331/0x5c0 [ ] netlink_rcv_skb+0x121/0x350 [ ] xfrm_netlink_rcv+0x66/0x80 [ ] netlink_unicast+0x439/0x630 [ ] netlink_sendmsg+0x714/0xbf0 [ ] sock_sendmsg+0xe2/0x110 The issue was introduced by Commit 7cb8a93968e3 ("xfrm: Allow inserting policies with matching mark and different priorities"). After that, the policies [1] and [2] would be able to be added with different priorities. However, policy [3] will actually match both [1] and [2]. Policy [1] was matched due to the 1st 'return true' in xfrm_policy_mark_match(), and policy [2] was matched due to the 2nd 'return true' in there. It caused WARN_ON() in xfrm_policy_insert_list(). This patch is to fix it by only (the same value and priority) as the same policy in xfrm_policy_mark_match(). Thanks to Yuehaibing, we could make this fix better. v1->v2: - check policy->mark.v == pol->mark.v only without mask. Fixes: 7cb8a93968e3 ("xfrm: Allow inserting policies with matching mark and different priorities") Reported-by: Xiumei Mu <xmu@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-05-25cfg80211: fix debugfs rename crashJohannes Berg
Removing the "if (IS_ERR(dir)) dir = NULL;" check only works if we adjust the remaining code to not rely on it being NULL. Check IS_ERR_OR_NULL() before attempting to dereference it. I'm not actually entirely sure this fixes the syzbot crash as the kernel config indicates that they do have DEBUG_FS in the kernel, but this is what I found when looking there. Cc: stable@vger.kernel.org Fixes: d82574a8e5a4 ("cfg80211: no need to check return value of debugfs_create functions") Reported-by: syzbot+fd5332e429401bf42d18@syzkaller.appspotmail.com Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20200525113816.fc4da3ec3d4b.Ica63a110679819eaa9fb3bc1b7437d96b1fd187d@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-25mac80211: mesh: fix discovery timer re-arming issue / crashLinus Lüssing
On a non-forwarding 802.11s link between two fairly busy neighboring nodes (iperf with -P 16 at ~850MBit/s TCP; 1733.3 MBit/s VHT-MCS 9 80MHz short GI VHT-NSS 4), so with frequent PREQ retries, usually after around 30-40 seconds the following crash would occur: [ 1110.822428] Unable to handle kernel read from unreadable memory at virtual address 00000000 [ 1110.830786] Mem abort info: [ 1110.833573] Exception class = IABT (current EL), IL = 32 bits [ 1110.839494] SET = 0, FnV = 0 [ 1110.842546] EA = 0, S1PTW = 0 [ 1110.845678] user pgtable: 4k pages, 48-bit VAs, pgd = ffff800076386000 [ 1110.852204] [0000000000000000] *pgd=00000000f6322003, *pud=00000000f62de003, *pmd=0000000000000000 [ 1110.861167] Internal error: Oops: 86000004 [#1] PREEMPT SMP [ 1110.866730] Modules linked in: pppoe ppp_async batman_adv ath10k_pci ath10k_core ath pppox ppp_generic nf_conntrack_ipv6 mac80211 iptable_nat ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_FLOWOFFLOAD slhc nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_conntrack_ipv4 nf_nat_ipv4 nf_nat nf_log_ipv4 nf_flow_table_hw nf_flow_table nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack iptable_mangle iptable_filter ip_tables crc_ccitt compat nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 usb_storage xhci_plat_hcd xhci_pci xhci_hcd dwc3 usbcore usb_common [ 1110.932190] Process swapper/3 (pid: 0, stack limit = 0xffff0000090c8000) [ 1110.938884] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.14.162 #0 [ 1110.944965] Hardware name: LS1043A RGW Board (DT) [ 1110.949658] task: ffff8000787a81c0 task.stack: ffff0000090c8000 [ 1110.955568] PC is at 0x0 [ 1110.958097] LR is at call_timer_fn.isra.27+0x24/0x78 [ 1110.963055] pc : [<0000000000000000>] lr : [<ffff0000080ff29c>] pstate: 00400145 [ 1110.970440] sp : ffff00000801be10 [ 1110.973744] x29: ffff00000801be10 x28: ffff000008bf7018 [ 1110.979047] x27: ffff000008bf87c8 x26: ffff000008c160c0 [ 1110.984352] x25: 0000000000000000 x24: 0000000000000000 [ 1110.989657] x23: dead000000000200 x22: 0000000000000000 [ 1110.994959] x21: 0000000000000000 x20: 0000000000000101 [ 1111.000262] x19: ffff8000787a81c0 x18: 0000000000000000 [ 1111.005565] x17: ffff0000089167b0 x16: 0000000000000058 [ 1111.010868] x15: ffff0000089167b0 x14: 0000000000000000 [ 1111.016172] x13: ffff000008916788 x12: 0000000000000040 [ 1111.021475] x11: ffff80007fda9af0 x10: 0000000000000001 [ 1111.026777] x9 : ffff00000801bea0 x8 : 0000000000000004 [ 1111.032080] x7 : 0000000000000000 x6 : ffff80007fda9aa8 [ 1111.037383] x5 : ffff00000801bea0 x4 : 0000000000000010 [ 1111.042685] x3 : ffff00000801be98 x2 : 0000000000000614 [ 1111.047988] x1 : 0000000000000000 x0 : 0000000000000000 [ 1111.053290] Call trace: [ 1111.055728] Exception stack(0xffff00000801bcd0 to 0xffff00000801be10) [ 1111.062158] bcc0: 0000000000000000 0000000000000000 [ 1111.069978] bce0: 0000000000000614 ffff00000801be98 0000000000000010 ffff00000801bea0 [ 1111.077798] bd00: ffff80007fda9aa8 0000000000000000 0000000000000004 ffff00000801bea0 [ 1111.085618] bd20: 0000000000000001 ffff80007fda9af0 0000000000000040 ffff000008916788 [ 1111.093437] bd40: 0000000000000000 ffff0000089167b0 0000000000000058 ffff0000089167b0 [ 1111.101256] bd60: 0000000000000000 ffff8000787a81c0 0000000000000101 0000000000000000 [ 1111.109075] bd80: 0000000000000000 dead000000000200 0000000000000000 0000000000000000 [ 1111.116895] bda0: ffff000008c160c0 ffff000008bf87c8 ffff000008bf7018 ffff00000801be10 [ 1111.124715] bdc0: ffff0000080ff29c ffff00000801be10 0000000000000000 0000000000400145 [ 1111.132534] bde0: ffff8000787a81c0 ffff00000801bde8 0000ffffffffffff 000001029eb19be8 [ 1111.140353] be00: ffff00000801be10 0000000000000000 [ 1111.145220] [< (null)>] (null) [ 1111.149917] [<ffff0000080ff77c>] run_timer_softirq+0x184/0x398 [ 1111.155741] [<ffff000008081938>] __do_softirq+0x100/0x1fc [ 1111.161130] [<ffff0000080a2e28>] irq_exit+0x80/0xd8 [ 1111.166002] [<ffff0000080ea708>] __handle_domain_irq+0x88/0xb0 [ 1111.171825] [<ffff000008081678>] gic_handle_irq+0x68/0xb0 [ 1111.177213] Exception stack(0xffff0000090cbe30 to 0xffff0000090cbf70) [ 1111.183642] be20: 0000000000000020 0000000000000000 [ 1111.191461] be40: 0000000000000001 0000000000000000 00008000771af000 0000000000000000 [ 1111.199281] be60: ffff000008c95180 0000000000000000 ffff000008c19360 ffff0000090cbef0 [ 1111.207101] be80: 0000000000000810 0000000000000400 0000000000000098 ffff000000000000 [ 1111.214920] bea0: 0000000000000001 ffff0000089167b0 0000000000000000 ffff0000089167b0 [ 1111.222740] bec0: 0000000000000000 ffff000008c198e8 ffff000008bf7018 ffff000008c19000 [ 1111.230559] bee0: 0000000000000000 0000000000000000 ffff8000787a81c0 ffff000008018000 [ 1111.238380] bf00: ffff00000801c000 ffff00000913ba34 ffff8000787a81c0 ffff0000090cbf70 [ 1111.246199] bf20: ffff0000080857cc ffff0000090cbf70 ffff0000080857d0 0000000000400145 [ 1111.254020] bf40: ffff000008018000 ffff00000801c000 ffffffffffffffff ffff0000080fa574 [ 1111.261838] bf60: ffff0000090cbf70 ffff0000080857d0 [ 1111.266706] [<ffff0000080832e8>] el1_irq+0xe8/0x18c [ 1111.271576] [<ffff0000080857d0>] arch_cpu_idle+0x10/0x18 [ 1111.276880] [<ffff0000080d7de4>] do_idle+0xec/0x1b8 [ 1111.281748] [<ffff0000080d8020>] cpu_startup_entry+0x20/0x28 [ 1111.287399] [<ffff00000808f81c>] secondary_start_kernel+0x104/0x110 [ 1111.293662] Code: bad PC value [ 1111.296710] ---[ end trace 555b6ca4363c3edd ]--- [ 1111.301318] Kernel panic - not syncing: Fatal exception in interrupt [ 1111.307661] SMP: stopping secondary CPUs [ 1111.311574] Kernel Offset: disabled [ 1111.315053] CPU features: 0x0002000 [ 1111.318530] Memory Limit: none [ 1111.321575] Rebooting in 3 seconds.. With some added debug output / delays we were able to push the crash from the timer callback runner into the callback function and by that shedding some light on which object holding the timer gets corrupted: [ 401.720899] Unable to handle kernel read from unreadable memory at virtual address 00000868 [...] [ 402.335836] [<ffff0000088fafa4>] _raw_spin_lock_bh+0x14/0x48 [ 402.341548] [<ffff000000dbe684>] mesh_path_timer+0x10c/0x248 [mac80211] [ 402.348154] [<ffff0000080ff29c>] call_timer_fn.isra.27+0x24/0x78 [ 402.354150] [<ffff0000080ff77c>] run_timer_softirq+0x184/0x398 [ 402.359974] [<ffff000008081938>] __do_softirq+0x100/0x1fc [ 402.365362] [<ffff0000080a2e28>] irq_exit+0x80/0xd8 [ 402.370231] [<ffff0000080ea708>] __handle_domain_irq+0x88/0xb0 [ 402.376053] [<ffff000008081678>] gic_handle_irq+0x68/0xb0 The issue happens due to the following sequence of events: 1) mesh_path_start_discovery(): -> spin_unlock_bh(&mpath->state_lock) before mesh_path_sel_frame_tx() 2) mesh_path_free_rcu() -> del_timer_sync(&mpath->timer) [...] -> kfree_rcu(mpath) 3) mesh_path_start_discovery(): -> mod_timer(&mpath->timer, ...) [...] -> rcu_read_unlock() 4) mesh_path_free_rcu()'s kfree_rcu(): -> kfree(mpath) 5) mesh_path_timer() starts after timeout, using freed mpath object So a use-after-free issue due to a timer re-arming bug caused by an early spin-unlocking. This patch fixes this issue by re-checking if mpath is about to be free'd and if so bails out of re-arming the timer. Cc: stable@vger.kernel.org Fixes: 050ac52cbe1f ("mac80211: code for on-demand Hybrid Wireless Mesh Protocol") Cc: Simon Wunderlich <sw@simonwunderlich.de> Signed-off-by: Linus Lüssing <ll@simonwunderlich.de> Link: https://lore.kernel.org/r/20200522170413.14973-1-linus.luessing@c0d3.blue Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
The MSCC bug fix in 'net' had to be slightly adjusted because the register accesses are done slightly differently in net-next. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix RCU warnings in ipv6 multicast router code, from Madhuparna Bhowmik. 2) Nexthop attributes aren't being checked properly because of mis-initialized iterator, from David Ahern. 3) Revert iop_idents_reserve() change as it caused performance regressions and was just working around what is really a UBSAN bug in the compiler. From Yuqi Jin. 4) Read MAC address properly from ROM in bmac driver (double iteration proceeds past end of address array), from Jeremy Kerr. 5) Add Microsoft Surface device IDs to r8152, from Marc Payne. 6) Prevent reference to freed SKB in __netif_receive_skb_core(), from Boris Sukholitko. 7) Fix ACK discard behavior in rxrpc, from David Howells. 8) Preserve flow hash across packet scrubbing in wireguard, from Jason A. Donenfeld. 9) Cap option length properly for SO_BINDTODEVICE in AX25, from Eric Dumazet. 10) Fix encryption error checking in kTLS code, from Vadim Fedorenko. 11) Missing BPF prog ref release in flow dissector, from Jakub Sitnicki. 12) dst_cache must be used with BH disabled in tipc, from Eric Dumazet. 13) Fix use after free in mlxsw driver, from Jiri Pirko. 14) Order kTLS key destruction properly in mlx5 driver, from Tariq Toukan. 15) Check devm_platform_ioremap_resource() return value properly in several drivers, from Tiezhu Yang. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (71 commits) net: smsc911x: Fix runtime PM imbalance on error net/mlx4_core: fix a memory leak bug. net: ethernet: ti: cpsw: fix ASSERT_RTNL() warning during suspend net: phy: mscc: fix initialization of the MACsec protocol mode net: stmmac: don't attach interface until resume finishes net: Fix return value about devm_platform_ioremap_resource() net/mlx5: Fix error flow in case of function_setup failure net/mlx5e: CT: Correctly get flow rule net/mlx5e: Update netdev txq on completions during closure net/mlx5: Annotate mutex destroy for root ns net/mlx5: Don't maintain a case of del_sw_func being null net/mlx5: Fix cleaning unmanaged flow tables net/mlx5: Fix memory leak in mlx5_events_init net/mlx5e: Fix inner tirs handling net/mlx5e: kTLS, Destroy key object after destroying the TIS net/mlx5e: Fix allowed tc redirect merged eswitch offload cases net/mlx5: Avoid processing commands before cmdif is ready net/mlx5: Fix a race when moving command interface to events mode net/mlx5: Add command entry handling completion rxrpc: Fix a memory leak in rxkad_verify_response() ...
2020-05-23ethtool: propagate get_coalesce return valueHeiner Kallweit
get_coalesce returns 0 or ERRNO, but the return value isn't checked. The returned coalesce data may be invalid if an ERRNO is set, therefore better check and propagate the return value. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-23net: devres: provide devm_register_netdev()Bartosz Golaszewski
Provide devm_register_netdev() - a device resource managed variant of register_netdev(). This new helper will only work for net_device structs that are also already managed by devres. Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-23net: devres: define a separate devres structure for devm_alloc_etherdev()Bartosz Golaszewski
Not using a proxy structure to store struct net_device doesn't save anything in terms of compiled code size or memory usage but significantly decreases the readability of the code with all the pointer casting. Define struct net_device_devres and use it in devm_alloc_etherdev_mqs(). Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-23net: move devres helpers into a separate source fileBartosz Golaszewski
There's currently only a single devres helper in net/ - devm variant of alloc_etherdev. Let's move it to net/devres.c with the intention of assing a second one: devm_register_netdev(). This new routine will need to know the address of the release function of devm_alloc_etherdev() so that it can verify (using devres_find()) that the struct net_device that's being passed to it is also resource managed. Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-23net: psample: fix build error when CONFIG_INET is not enabledRandy Dunlap
Fix psample build error when CONFIG_INET is not set/enabled by bracketing the tunnel code in #ifdef CONFIG_NET / #endif. ../net/psample/psample.c: In function ‘__psample_ip_tun_to_nlattr’: ../net/psample/psample.c:216:25: error: implicit declaration of function ‘ip_tunnel_info_opts’; did you mean ‘ip_tunnel_info_opts_set’? [-Werror=implicit-function-declaration] Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Yotam Gigi <yotam.gi@gmail.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf-next 2020-05-23 The following pull-request contains BPF updates for your *net-next* tree. We've added 50 non-merge commits during the last 8 day(s) which contain a total of 109 files changed, 2776 insertions(+), 2887 deletions(-). The main changes are: 1) Add a new AF_XDP buffer allocation API to the core in order to help lowering the bar for drivers adopting AF_XDP support. i40e, ice, ixgbe as well as mlx5 have been moved over to the new API and also gained a small improvement in performance, from Björn Töpel and Magnus Karlsson. 2) Add getpeername()/getsockname() attach types for BPF sock_addr programs in order to allow for e.g. reverse translation of load-balancer backend to service address/port tuple from a connected peer, from Daniel Borkmann. 3) Improve the BPF verifier is_branch_taken() logic to evaluate pointers being non-NULL, e.g. if after an initial test another non-NULL test on that pointer follows in a given path, then it can be pruned right away, from John Fastabend. 4) Larger rework of BPF sockmap selftests to make output easier to understand and to reduce overall runtime as well as adding new BPF kTLS selftests that run in combination with sockmap, also from John Fastabend. 5) Batch of misc updates to BPF selftests including fixing up test_align to match verifier output again and moving it under test_progs, allowing bpf_iter selftest to compile on machines with older vmlinux.h, and updating config options for lirc and v6 segment routing helpers, from Stanislav Fomichev, Andrii Nakryiko and Alan Maguire. 6) Conversion of BPF tracing samples outdated internal BPF loader to use libbpf API instead, from Daniel T. Lee. 7) Follow-up to BPF kernel test infrastructure in order to fix a flake in the XDP selftests, from Jesper Dangaard Brouer. 8) Minor improvements to libbpf's internal hashmap implementation, from Ian Rogers. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-23rxrpc: Fix a memory leak in rxkad_verify_response()Qiushi Wu
A ticket was not released after a call of the function "rxkad_decrypt_ticket" failed. Thus replace the jump target "temporary_error_free_resp" by "temporary_error_free_ticket". Fixes: 8c2f826dc3631 ("rxrpc: Don't put crypto buffers on the stack") Signed-off-by: Qiushi Wu <wu000273@umn.edu> Signed-off-by: David Howells <dhowells@redhat.com> cc: Markus Elfring <Markus.Elfring@web.de>
2020-05-22bridge: mrp: Restore port state when deleting MRP instanceHoratiu Vultur
When a MRP instance is deleted, then restore the port according to the bridge state. If the bridge is up then the ports will be in forwarding state otherwise will be in disabled state. Fixes: 9a9f26e8f7ea ("bridge: mrp: Connect MRP API with the switchdev API") Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22bridge: mrp: Add br_mrp_unique_ifindex functionHoratiu Vultur
It is not allow to have the same net bridge port part of multiple MRP rings. Therefore add a check if the port is used already in a different MRP. In that case return failure. Fixes: 9a9f26e8f7ea ("bridge: mrp: Connect MRP API with the switchdev API") Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22Merge tag 'rxrpc-fixes-20200520' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Fix retransmission timeout and ACK discard Here are a couple of fixes and an extra tracepoint for AF_RXRPC: (1) Calculate the RTO pretty much as TCP does, rather than making something up, including an initial 4s timeout (which causes return probes from the fileserver to fail if a packet goes missing), and add backoff. (2) Fix the discarding of out-of-order received ACKs. We mustn't let the hard-ACK point regress, nor do we want to do unnecessary retransmission because the soft-ACK list regresses. This is not trivial, however, due to some loose wording in various old protocol specs, the ACK field that should be used for this sometimes has the wrong information in it. (3) Add a tracepoint to log a discarded ACK. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22net: flow_offload: simplify hw stats check handlingEdward Cree
Make FLOW_ACTION_HW_STATS_DONT_CARE be all bits, rather than none, so that drivers and __flow_action_hw_stats_check can use simple bitwise checks. Pre-fill all actions with DONT_CARE in flow_rule_alloc(), rather than relying on implicit semantics of zero from kzalloc, so that callers which don't configure action stats themselves (i.e. netfilter) get the correct behaviour by default. Only the kernel's internal API semantics change; the TC uAPI is unaffected. v4: move DONT_CARE setting to flow_rule_alloc() for robustness and simplicity. v3: set DONT_CARE in nft and ct offload. v2: rebased on net-next, removed RFC tags. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22mpls: Add support for IPv6 tunnelsVadim Fedorenko
Add support for IPv6 tunnel devices in AF_MPLS. Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22ip6_tunnel: add generic MPLS receive supportVadim Fedorenko
Add support for MPLS in receive side. Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22tunnel6: support for IPPROTO_MPLSVadim Fedorenko
This patch is just preparation for MPLS support in ip6_tunnel Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22ip6_tunnel: add MPLS transmit supportVadim Fedorenko
Add ETH_P_MPLS_UC as supported protocol. Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22ip6_tunnel: simplify transmit pathVadim Fedorenko
Merge ip{4,6}ip6_tnl_xmit functions into one universal ipxip6_tnl_xmit in preparation for adding MPLS support. Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22sctp: Start shutdown on association restart if in SHUTDOWN-SENT state and ↵Jere Leppänen
socket is closed Commit bdf6fa52f01b ("sctp: handle association restarts when the socket is closed.") starts shutdown when an association is restarted, if in SHUTDOWN-PENDING state and the socket is closed. However, the rationale stated in that commit applies also when in SHUTDOWN-SENT state - we don't want to move an association to ESTABLISHED state when the socket has been closed, because that results in an association that is unreachable from user space. The problem scenario: 1. Client crashes and/or restarts. 2. Server (using one-to-one socket) calls close(). SHUTDOWN is lost. 3. Client reconnects using the same addresses and ports. 4. Server's association is restarted. The association and the socket move to ESTABLISHED state, even though the server process has closed its descriptor. Also, after step 4 when the server process exits, some resources are leaked in an attempt to release the underlying inet sock structure in ESTABLISHED state: IPv4: Attempt to release TCP socket in state 1 00000000377288c7 Fix by acting the same way as in SHUTDOWN-PENDING state. That is, if an association is restarted in SHUTDOWN-SENT state and the socket is closed, then start shutdown and don't move the association or the socket to ESTABLISHED state. Fixes: bdf6fa52f01b ("sctp: handle association restarts when the socket is closed.") Signed-off-by: Jere Leppänen <jere.leppanen@nokia.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22tipc: block BH before using dst_cacheEric Dumazet
dst_cache_get() documents it must be used with BH disabled. sysbot reported : BUG: using smp_processor_id() in preemptible [00000000] code: /21697 caller is dst_cache_get+0x3a/0xb0 net/core/dst_cache.c:68 CPU: 0 PID: 21697 Comm: Not tainted 5.7.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x188/0x20d lib/dump_stack.c:118 check_preemption_disabled lib/smp_processor_id.c:47 [inline] debug_smp_processor_id.cold+0x88/0x9b lib/smp_processor_id.c:57 dst_cache_get+0x3a/0xb0 net/core/dst_cache.c:68 tipc_udp_xmit.isra.0+0xb9/0xad0 net/tipc/udp_media.c:164 tipc_udp_send_msg+0x3e6/0x490 net/tipc/udp_media.c:244 tipc_bearer_xmit_skb+0x1de/0x3f0 net/tipc/bearer.c:526 tipc_enable_bearer+0xb2f/0xd60 net/tipc/bearer.c:331 __tipc_nl_bearer_enable+0x2bf/0x390 net/tipc/bearer.c:995 tipc_nl_bearer_enable+0x1e/0x30 net/tipc/bearer.c:1003 genl_family_rcv_msg_doit net/netlink/genetlink.c:673 [inline] genl_family_rcv_msg net/netlink/genetlink.c:718 [inline] genl_rcv_msg+0x627/0xdf0 net/netlink/genetlink.c:735 netlink_rcv_skb+0x15a/0x410 net/netlink/af_netlink.c:2469 genl_rcv+0x24/0x40 net/netlink/genetlink.c:746 netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline] netlink_unicast+0x537/0x740 net/netlink/af_netlink.c:1329 netlink_sendmsg+0x882/0xe10 net/netlink/af_netlink.c:1918 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:672 ____sys_sendmsg+0x6bf/0x7e0 net/socket.c:2362 ___sys_sendmsg+0x100/0x170 net/socket.c:2416 __sys_sendmsg+0xec/0x1b0 net/socket.c:2449 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295 entry_SYSCALL_64_after_hwframe+0x49/0xb3 RIP: 0033:0x45ca29 Fixes: e9c1a793210f ("tipc: add dst_cache support for udp media") Cc: Xin Long <lucien.xin@gmail.com> Cc: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2020-05-22 The following pull-request contains BPF updates for your *net* tree. We've added 3 non-merge commits during the last 3 day(s) which contain a total of 5 files changed, 69 insertions(+), 11 deletions(-). The main changes are: 1) Fix to reject mmap()'ing read-only array maps as writable since BPF verifier relies on such map content to be frozen, from Andrii Nakryiko. 2) Fix breaking audit from secid_to_secctx() LSM hook by avoiding to use call_int_hook() since this hook is not stackable, from KP Singh. 3) Fix BPF flow dissector program ref leak on netns cleanup, from Jakub Sitnicki. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22mptcp: use untruncated hash in ADD_ADDR HMACTodd Malsbary
There is some ambiguity in the RFC as to whether the ADD_ADDR HMAC is the rightmost 64 bits of the entire hash or of the leftmost 160 bits of the hash. The intention, as clarified with the author of the RFC, is the entire hash. This change returns the entire hash from mptcp_crypto_hmac_sha (instead of only the first 160 bits), and moves any truncation/selection operation on the hash to the caller. Fixes: 12555a2d97e5 ("mptcp: use rightmost 64 bits in ADD_ADDR HMAC") Reviewed-by: Christoph Paasch <cpaasch@apple.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Todd Malsbary <todd.malsbary@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22nexthop: add support for notifiersRoopa Prabhu
This patch adds nexthop add/del notifiers. To be used by vxlan driver in a later patch. Could possibly be used by switchdev drivers in the future. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22vxlan: ecmp support for mac fdb entriesRoopa Prabhu
Todays vxlan mac fdb entries can point to multiple remote ips (rdsts) with the sole purpose of replicating broadcast-multicast and unknown unicast packets to those remote ips. E-VPN multihoming [1,2,3] requires bridged vxlan traffic to be load balanced to remote switches (vteps) belonging to the same multi-homed ethernet segment (E-VPN multihoming is analogous to multi-homed LAG implementations, but with the inter-switch peerlink replaced with a vxlan tunnel). In other words it needs support for mac ecmp. Furthermore, for faster convergence, E-VPN multihoming needs the ability to update fdb ecmp nexthops independent of the fdb entries. New route nexthop API is perfect for this usecase. This patch extends the vxlan fdb code to take a nexthop id pointing to an ecmp nexthop group. Changes include: - New NDA_NH_ID attribute for fdbs - Use the newly added fdb nexthop groups - makes vxlan rdsts and nexthop handling code mutually exclusive - since this is a new use-case and the requirement is for ecmp nexthop groups, the fdb add and update path checks that the nexthop is really an ecmp nexthop group. This check can be relaxed in the future, if we want to introduce replication fdb nexthop groups and allow its use in lieu of current rdst lists. - fdb update requests with nexthop id's only allowed for existing fdb's that have nexthop id's - learning will not override an existing fdb entry with nexthop group - I have wrapped the switchdev offload code around the presence of rdst [1] E-VPN RFC https://tools.ietf.org/html/rfc7432 [2] E-VPN with vxlan https://tools.ietf.org/html/rfc8365 [3] http://vger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV3.pdf Includes a null check fix in vxlan_xmit from Nikolay v2 - Fixed build issue: Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22nexthop: support for fdb ecmp nexthopsRoopa Prabhu
This patch introduces ecmp nexthops and nexthop groups for mac fdb entries. In subsequent patches this is used by the vxlan driver fdb entries. The use case is E-VPN multihoming [1,2,3] which requires bridged vxlan traffic to be load balanced to remote switches (vteps) belonging to the same multi-homed ethernet segment (This is analogous to a multi-homed LAG but over vxlan). Changes include new nexthop flag NHA_FDB for nexthops referenced by fdb entries. These nexthops only have ip. This patch includes appropriate checks to avoid routes referencing such nexthops. example: $ip nexthop add id 12 via 172.16.1.2 fdb $ip nexthop add id 13 via 172.16.1.3 fdb $ip nexthop add id 102 group 12/13 fdb $bridge fdb add 02:02:00:00:00:13 dev vxlan1000 nhid 101 self [1] E-VPN https://tools.ietf.org/html/rfc7432 [2] E-VPN VxLAN: https://tools.ietf.org/html/rfc8365 [3] LPC talk with mention of nexthop groups for L2 ecmp http://vger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV3.pdf v4 - fixed uninitialized variable reported by kernel test robot Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22batman-adv: use rcu_replace_pointer() where appropriateAntonio Quartulli
In commit a63fc6b75cca ("rcu: Upgrade rcu_swap_protected() to rcu_replace_pointer()") a new helper macro named rcu_replace_pointer() was introduced to simplify code requiring to switch an rcu pointer to a new value while extracting the old one. Use rcu_replace_pointer() where appropriate to make code slimer. Signed-off-by: Antonio Quartulli <a@unstable.cc> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-05-22batman-adv: Revert "Drop lockdep.h include for soft-interface.c"Sven Eckelmann
The commit 1a33e10e4a95 ("net: partially revert dynamic lockdep key changes") reverts the commit ab92d68fc22f ("net: core: add generic lockdep keys"). But it forgot to also revert the commit 5759af0682b3 ("batman-adv: Drop lockdep.h include for soft-interface.c") which depends on the latter. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-05-21flow_dissector: Drop BPF flow dissector prog ref on netns cleanupJakub Sitnicki
When attaching a flow dissector program to a network namespace with bpf(BPF_PROG_ATTACH, ...) we grab a reference to bpf_prog. If netns gets destroyed while a flow dissector is still attached, and there are no other references to the prog, we leak the reference and the program remains loaded. Leak can be reproduced by running flow dissector tests from selftests/bpf: # bpftool prog list # ./test_flow_dissector.sh ... selftests: test_flow_dissector [PASS] # bpftool prog list 4: flow_dissector name _dissect tag e314084d332a5338 gpl loaded_at 2020-05-20T18:50:53+0200 uid 0 xlated 552B jited 355B memlock 4096B map_ids 3,4 btf_id 4 # Fix it by detaching the flow dissector program when netns is going away. Fixes: d58e468b1112 ("flow_dissector: implements flow dissector BPF hook") Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20200521083435.560256-1-jakub@cloudflare.com
2020-05-21xsk: Explicitly inline functions and move definitionsBjörn Töpel
In order to reduce the number of function calls, the struct xsk_buff_pool definition is moved to xsk_buff_pool.h. The functions xp_get_dma(), xp_dma_sync_for_cpu(), xp_dma_sync_for_device(), xp_validate_desc() and various helper functions are explicitly inlined. Further, move xp_get_handle() and xp_release() to xsk.c, to allow for the compiler to perform inlining. rfc->v1: Make sure xp_validate_desc() is inlined for Tx perf. (Maxim) Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200520192103.355233-15-bjorn.topel@gmail.com
2020-05-21xdp: Simplify xdp_return_{frame, frame_rx_napi, buff}Björn Töpel
The xdp_return_{frame,frame_rx_napi,buff} function are never used, except in xdp_convert_zc_to_xdp_frame(), by the MEM_TYPE_XSK_BUFF_POOL memory type. To simplify and reduce code, change so that xdp_convert_zc_to_xdp_frame() calls xsk_buff_free() directly since the type is know, and remove MEM_TYPE_XSK_BUFF_POOL from the switch statement in __xdp_return() function. Suggested-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200520192103.355233-14-bjorn.topel@gmail.com
2020-05-21xsk: Remove MEM_TYPE_ZERO_COPY and corresponding codeBjörn Töpel
There are no users of MEM_TYPE_ZERO_COPY. Remove all corresponding code, including the "handle" member of struct xdp_buff. rfc->v1: Fixed spelling in commit message. (Björn) Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200520192103.355233-13-bjorn.topel@gmail.com
2020-05-21xsk: Introduce AF_XDP buffer allocation APIBjörn Töpel
In order to simplify AF_XDP zero-copy enablement for NIC driver developers, a new AF_XDP buffer allocation API is added. The implementation is based on a single core (single producer/consumer) buffer pool for the AF_XDP UMEM. A buffer is allocated using the xsk_buff_alloc() function, and returned using xsk_buff_free(). If a buffer is disassociated with the pool, e.g. when a buffer is passed to an AF_XDP socket, a buffer is said to be released. Currently, the release function is only used by the AF_XDP internals and not visible to the driver. Drivers using this API should register the XDP memory model with the new MEM_TYPE_XSK_BUFF_POOL type. The API is defined in net/xdp_sock_drv.h. The buffer type is struct xdp_buff, and follows the lifetime of regular xdp_buffs, i.e. the lifetime of an xdp_buff is restricted to a NAPI context. In other words, the API is not replacing xdp_frames. In addition to introducing the API and implementations, the AF_XDP core is migrated to use the new APIs. rfc->v1: Fixed build errors/warnings for m68k and riscv. (kbuild test robot) Added headroom/chunk size getter. (Maxim/Björn) v1->v2: Swapped SoBs. (Maxim) v2->v3: Initialize struct xdp_buff member frame_sz. (Björn) Add API to query the DMA address of a frame. (Maxim) Do DMA sync for CPU till the end of the frame to handle possible growth (frame_sz). (Maxim) Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200520192103.355233-6-bjorn.topel@gmail.com
2020-05-21xsk: Move defines only used by AF_XDP internals to xsk.hBjörn Töpel
Move the XSK_NEXT_PG_CONTIG_{MASK,SHIFT}, and XDP_UMEM_USES_NEED_WAKEUP defines from xdp_sock.h to the AF_XDP internal xsk.h file. Also, start using the BIT{,_ULL} macro instead of explicit shifts. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200520192103.355233-5-bjorn.topel@gmail.com
2020-05-21xsk: Move driver interface to xdp_sock_drv.hMagnus Karlsson
Move the AF_XDP zero-copy driver interface to its own include file called xdp_sock_drv.h. This, hopefully, will make it more clear for NIC driver implementors to know what functions to use for zero-copy support. v4->v5: Fix -Wmissing-prototypes by include header file. (Jakub) Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200520192103.355233-4-bjorn.topel@gmail.com
2020-05-21xsk: Move xskmap.c to net/xdp/Björn Töpel
The XSKMAP is partly implemented by net/xdp/xsk.c. Move xskmap.c from kernel/bpf/ to net/xdp/, which is the logical place for AF_XDP related code. Also, move AF_XDP struct definitions, and function declarations only used by AF_XDP internals into net/xdp/xsk.h. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200520192103.355233-3-bjorn.topel@gmail.com
2020-05-21net: don't return invalid table id error when we fall back to PF_UNSPECSabrina Dubroca
In case we can't find a ->dumpit callback for the requested (family,type) pair, we fall back to (PF_UNSPEC,type). In effect, we're in the same situation as if userspace had requested a PF_UNSPEC dump. For RTM_GETROUTE, that handler is rtnl_dump_all, which calls all the registered RTM_GETROUTE handlers. The requested table id may or may not exist for all of those families. commit ae677bbb4441 ("net: Don't return invalid table id error when dumping all families") fixed the problem when userspace explicitly requests a PF_UNSPEC dump, but missed the fallback case. For example, when we pass ipv6.disable=1 to a kernel with CONFIG_IP_MROUTE=y and CONFIG_IP_MROUTE_MULTIPLE_TABLES=y, the (PF_INET6, RTM_GETROUTE) handler isn't registered, so we end up in rtnl_dump_all, and listing IPv6 routes will unexpectedly print: # ip -6 r Error: ipv4: MR table does not exist. Dump terminated commit ae677bbb4441 introduced the dump_all_families variable, which gets set when userspace requests a PF_UNSPEC dump. However, we can't simply set the family to PF_UNSPEC in rtnetlink_rcv_msg in the fallback case to get dump_all_families == true, because some messages types (for example RTM_GETRULE and RTM_GETNEIGH) only register the PF_UNSPEC handler and use the family to filter in the kernel what is dumped to userspace. We would then export more entries, that userspace would have to filter. iproute does that, but other programs may not. Instead, this patch removes dump_all_families and updates the RTM_GETROUTE handlers to check if the family that is being dumped is their own. When it's not, which covers both the intentional PF_UNSPEC dumps (as dump_all_families did) and the fallback case, ignore the missing table id error. Fixes: cb167893f41e ("net: Plumb support for filtering ipv4 and ipv6 multicast route dumps") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-21net: ipip: fix wrong address family in init error pathVadim Fedorenko
In case of error with MPLS support the code is misusing AF_INET instead of AF_MPLS. Fixes: 1b69e7e6c4da ("ipip: support MPLS over IPv4") Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Signed-off-by: David S. Miller <davem@davemloft.net>