summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2020-09-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf-next 2020-09-01 The following pull-request contains BPF updates for your *net-next* tree. There are two small conflicts when pulling, resolve as follows: 1) Merge conflict in tools/lib/bpf/libbpf.c between 88a82120282b ("libbpf: Factor out common ELF operations and improve logging") in bpf-next and 1e891e513e16 ("libbpf: Fix map index used in error message") in net-next. Resolve by taking the hunk in bpf-next: [...] scn = elf_sec_by_idx(obj, obj->efile.btf_maps_shndx); data = elf_sec_data(obj, scn); if (!scn || !data) { pr_warn("elf: failed to get %s map definitions for %s\n", MAPS_ELF_SEC, obj->path); return -EINVAL; } [...] 2) Merge conflict in drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c between 9647c57b11e5 ("xsk: i40e: ice: ixgbe: mlx5: Test for dma_need_sync earlier for better performance") in bpf-next and e20f0dbf204f ("net/mlx5e: RX, Add a prefetch command for small L1_CACHE_BYTES") in net-next. Resolve the two locations by retaining net_prefetch() and taking xsk_buff_dma_sync_for_cpu() from bpf-next. Should look like: [...] xdp_set_data_meta_invalid(xdp); xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool); net_prefetch(xdp->data); [...] We've added 133 non-merge commits during the last 14 day(s) which contain a total of 246 files changed, 13832 insertions(+), 3105 deletions(-). The main changes are: 1) Initial support for sleepable BPF programs along with bpf_copy_from_user() helper for tracing to reliably access user memory, from Alexei Starovoitov. 2) Add BPF infra for writing and parsing TCP header options, from Martin KaFai Lau. 3) bpf_d_path() helper for returning full path for given 'struct path', from Jiri Olsa. 4) AF_XDP support for shared umems between devices and queues, from Magnus Karlsson. 5) Initial prep work for full BPF-to-BPF call support in libbpf, from Andrii Nakryiko. 6) Generalize bpf_sk_storage map & add local storage for inodes, from KP Singh. 7) Implement sockmap/hash updates from BPF context, from Lorenz Bauer. 8) BPF xor verification for scalar types & add BPF link iterator, from Yonghong Song. 9) Use target's prog type for BPF_PROG_TYPE_EXT prog verification, from Udip Pant. 10) Rework BPF tracing samples to use libbpf loader, from Daniel T. Lee. 11) Fix xdpsock sample to really cycle through all buffers, from Weqaar Janjua. 12) Improve type safety for tun/veth XDP frame handling, from Maciej Żenczykowski. 13) Various smaller cleanups and improvements all over the place. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-01net/tls: Implement getsockopt SOL_TLS TLS_RXYutaro Hayakawa
Implement the getsockopt SOL_TLS TLS_RX which is currently missing. The primary usecase is to use it in conjunction with TCP_REPAIR to checkpoint/restore the TLS record layer state. TLS connection state usually exists on the user space library. So basically we can easily extract it from there, but when the TLS connections are delegated to the kTLS, it is not the case. We need to have a way to extract the TLS state from the kernel for both of TX and RX side. The new TLS_RX getsockopt copies the crypto_info to user in the same way as TLS_TX does. We have described use cases in our research work in Netdev 0x14 Transport Workshop [1]. Also, there is an TLS implementation called tlse [2] which supports TLS connection migration. They have support of kTLS and their code shows that they are expecting the future support of this option. [1] https://speakerdeck.com/yutarohayakawa/prism-proxies-without-the-pain [2] https://github.com/eduardsui/tlse Signed-off-by: Yutaro Hayakawa <yhayakawa3720@gmail.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-01pktgen: fix error message with wrong function nameLeesoo Ahn
Error on calling kthread_create_on_node prints wrong function name, kernel_thread. Fixes: 94dcf29a11b3 ("kthread: use kthread_create_on_node()") Signed-off-by: Leesoo Ahn <dev@ooseel.net> Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-01net: openvswitch: remove unused keep_flowsTonghao Zhang
keep_flows was introduced by [1], which used as flag to delete flows or not. When rehashing or expanding the table instance, we will not flush the flows. Now don't use it anymore, remove it. [1] - https://github.com/openvswitch/ovs/commit/acd051f1761569205827dc9b037e15568a8d59f8 Cc: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-01net: openvswitch: refactor flow free functionTonghao Zhang
Decrease table->count and ufid_count unconditionally, because we only don't use count or ufid_count to count when flushing the flows. To simplify the codes, we remove the "count" argument of table_instance_flow_free. To avoid a bug when deleting flows in the future, add WARN_ON in flush flows function. Cc: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-01net: openvswitch: improve the coding styleTonghao Zhang
Not change the logic, just improve the coding style. Cc: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-01Bluetooth: Clear suspend tasks on unregisterAbhishek Pandit-Subedi
While unregistering, make sure to clear the suspend tasks before cancelling the work. If the unregister is called during resume from suspend, this will unnecessarily add 2s to the resume time otherwise. Fixes: 4e8c36c3b0d73d (Bluetooth: Fix suspend notifier race) Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-08-31ipvs: remove dependency on ip6_tablesYaroslav Bolyukin
This dependency was added because ipv6_find_hdr was in iptables specific code but is no longer required Fixes: f8f626754ebe ("ipv6: Move ipv6_find_hdr() out of Netfilter code.") Fixes: 63dca2c0b0e7 ("ipvs: Fix faulty IPv6 extension header handling in IPVS") Signed-off-by: Yaroslav Bolyukin <iam@lach.pw> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-31net: ipv4: remove unused arg exact_dif in compute_scoreMiaohe Lin
The arg exact_dif is not used anymore, remove it. inet_exact_dif_match() is no longer needed after the above is removed, so remove it too. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31net: ipv6: remove unused arg exact_dif in compute_scoreMiaohe Lin
The arg exact_dif is not used anymore, remove it. inet6_exact_dif_match() is no longer needed after the above is removed, remove it too. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31tipc: Remove unused macro TIPC_NACK_INTVYueHaibing
There is no caller in tree any more. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31tipc: Remove unused macro TIPC_FWD_MSGYueHaibing
There is no caller in tree any more. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31mptcp: Remove unused macro MPTCP_SAME_STATEYueHaibing
There is no caller in tree any more. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31net: clean up codestyleMiaohe Lin
This is a pure codestyle cleanup patch. No functional change intended. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31net: Use helper macro IP_MAX_MTU in __ip_append_data()Miaohe Lin
What 0xFFFF means here is actually the max mtu of a ip packet. Use help macro IP_MAX_MTU here. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31openvswitch: using ip6_fragment in ipv6_stubwenxu
Using ipv6_stub->ipv6_fragment to avoid the netfilter dependency Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31ipv6: add ipv6_fragment hook in ipv6_stubwenxu
Add ipv6_fragment to ipv6_stub to avoid calling netfilter when access ip6_fragment. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31xsk: Add shared umem support between devicesMagnus Karlsson
Add support to share a umem between different devices. This mode can be invoked with the XDP_SHARED_UMEM bind flag. Previously, sharing was only supported within the same device. Note that when sharing a umem between devices, just as in the case of sharing a umem between queue ids, you need to create a fill ring and a completion ring and tie them to the socket (with two setsockopts, one for each ring) before you do the bind with the XDP_SHARED_UMEM flag. This so that the single-producer single-consumer semantics of the rings can be upheld. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1598603189-32145-13-git-send-email-magnus.karlsson@intel.com
2020-08-31xsk: Add shared umem support between queue idsMagnus Karlsson
Add support to share a umem between queue ids on the same device. This mode can be invoked with the XDP_SHARED_UMEM bind flag. Previously, sharing was only supported within the same queue id and device, and you shared one set of fill and completion rings. However, note that when sharing a umem between queue ids, you need to create a fill ring and a completion ring and tie them to the socket before you do the bind with the XDP_SHARED_UMEM flag. This so that the single-producer single-consumer semantics can be upheld. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1598603189-32145-12-git-send-email-magnus.karlsson@intel.com
2020-08-31xsk: Enable sharing of dma mappingsMagnus Karlsson
Enable the sharing of dma mappings by moving them out from the buffer pool. Instead we put each dma mapped umem region in a list in the umem structure. If dma has already been mapped for this umem and device, it is not mapped again and the existing dma mappings are reused. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1598603189-32145-9-git-send-email-magnus.karlsson@intel.com
2020-08-31xsk: Move addrs from buffer pool to umemMagnus Karlsson
Replicate the addrs pointer in the buffer pool to the umem. This mapping will be the same for all buffer pools sharing the same umem. In the buffer pool we leave the addrs pointer for performance reasons. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1598603189-32145-8-git-send-email-magnus.karlsson@intel.com
2020-08-31xsk: Move xsk_tx_list and its lock to buffer poolMagnus Karlsson
Move the xsk_tx_list and the xsk_tx_list_lock from the umem to the buffer pool. This so that we in a later commit can share the umem between multiple HW queues. There is one xsk_tx_list per device and queue id, so it should be located in the buffer pool. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1598603189-32145-7-git-send-email-magnus.karlsson@intel.com
2020-08-31xsk: Move queue_id, dev and need_wakeup to buffer poolMagnus Karlsson
Move queue_id, dev, and need_wakeup from the umem to the buffer pool. This so that we in a later commit can share the umem between multiple HW queues. There is one buffer pool per dev and queue id, so these variables should belong to the buffer pool, not the umem. Need_wakeup is also something that is set on a per napi level, so there is usually one per device and queue id. So move this to the buffer pool too. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1598603189-32145-6-git-send-email-magnus.karlsson@intel.com
2020-08-31xsk: Move fill and completion rings to buffer poolMagnus Karlsson
Move the fill and completion rings from the umem to the buffer pool. This so that we in a later commit can share the umem between multiple HW queue ids. In this case, we need one fill and completion ring per queue id. As the buffer pool is per queue id and napi id this is a natural place for it and one umem struture can be shared between these buffer pools. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1598603189-32145-5-git-send-email-magnus.karlsson@intel.com
2020-08-31xsk: Create and free buffer pool independently from umemMagnus Karlsson
Create and free the buffer pool independently from the umem. Move these operations that are performed on the buffer pool from the umem create and destroy functions to new create and destroy functions just for the buffer pool. This so that in later commits we can instantiate multiple buffer pools per umem when sharing a umem between HW queues and/or devices. We also erradicate the back pointer from the umem to the buffer pool as this will not work when we introduce the possibility to have multiple buffer pools per umem. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1598603189-32145-4-git-send-email-magnus.karlsson@intel.com
2020-08-31xsk: i40e: ice: ixgbe: mlx5: Rename xsk zero-copy driver interfacesMagnus Karlsson
Rename the AF_XDP zero-copy driver interface functions to better reflect what they do after the replacement of umems with buffer pools in the previous commit. Mostly it is about replacing the umem name from the function names with xsk_buff and also have them take the a buffer pool pointer instead of a umem. The various ring functions have also been renamed in the process so that they have the same naming convention as the internal functions in xsk_queue.h. This so that it will be clearer what they do and also for consistency. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1598603189-32145-3-git-send-email-magnus.karlsson@intel.com
2020-08-31xsk: i40e: ice: ixgbe: mlx5: Pass buffer pool to driver instead of umemMagnus Karlsson
Replace the explicit umem reference passed to the driver in AF_XDP zero-copy mode with the buffer pool instead. This in preparation for extending the functionality of the zero-copy mode so that umems can be shared between queues on the same netdev and also between netdevs. In this commit, only an umem reference has been added to the buffer pool struct. But later commits will add other entities to it. These are going to be entities that are different between different queue ids and netdevs even though the umem is shared between them. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1598603189-32145-2-git-send-email-magnus.karlsson@intel.com
2020-08-31netlink: policy: correct validation type checkJohannes Berg
In the policy export for binary attributes I erroneously used a != NLA_VALIDATE_NONE comparison instead of checking for the two possible values, which meant that if a validation function pointer ended up aliasing the min/max as negatives, we'd hit a warning in nla_get_range_unsigned(). Fix this to correctly check for only the two types that should be handled here, i.e. range with or without warn-too-long. Reported-by: syzbot+353df1490da781637624@syzkaller.appspotmail.com Fixes: 8aa26c575fb3 ("netlink: make NLA_BINARY validation more flexible") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Do not delete clash entries on reply, let them expire instead, from Florian Westphal. 2) Do not report EAGAIN to nfnetlink, otherwise this enters a busy loop. Update nfnetlink_unicast() to translate EAGAIN to ENOBUFS. 3) Remove repeated words in code comments, from Randy Dunlap. 4) Several patches for the flowtable selftests, from Fabian Frederick. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-30tipc: fix using smp_processor_id() in preemptibleTuong Lien
The 'this_cpu_ptr()' is used to obtain the AEAD key' TFM on the current CPU for encryption, however the execution can be preemptible since it's actually user-space context, so the 'using smp_processor_id() in preemptible' has been observed. We fix the issue by using the 'get/put_cpu_ptr()' API which consists of a 'preempt_disable()' instead. Fixes: fc1b6d6de220 ("tipc: introduce TIPC encryption & authentication") Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-29netfilter: nft_socket: add wildcard supportBalazs Scheidler
Add NFT_SOCKET_WILDCARD to match to wildcard socket listener. Signed-off-by: Balazs Scheidler <bazsi77@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-29netfilter: conntrack: do not auto-delete clash entries on replyFlorian Westphal
Its possible that we have more than one packet with the same ct tuple simultaneously, e.g. when an application emits n packets on same UDP socket from multiple threads. NAT rules might be applied to those packets. With the right set of rules, n packets will be mapped to m destinations, where at least two packets end up with the same destination. When this happens, the existing clash resolution may merge the skb that is processed after the first has been received with the identical tuple already in hash table. However, its possible that this identical tuple is a NAT_CLASH tuple. In that case the second skb will be sent, but no reply can be received since the reply that is processed first removes the NAT_CLASH tuple. Do not auto-delete, this gives a 1 second window for replies to be passed back to originator. Packets that are coming later (udp stream case) will not be affected: they match the original ct entry, not a NAT_CLASH one. Also prevent NAT_CLASH entries from getting offloaded. Fixes: 6a757c07e51f ("netfilter: conntrack: allow insertion of clashing entries") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28netfilter: nfnetlink: nfnetlink_unicast() reports EAGAIN instead of ENOBUFSPablo Neira Ayuso
Frontend callback reports EAGAIN to nfnetlink to retry a command, this is used to signal that module autoloading is required. Unfortunately, nlmsg_unicast() reports EAGAIN in case the receiver socket buffer gets full, so it enters a busy-loop. This patch updates nfnetlink_unicast() to turn EAGAIN into ENOBUFS and to use nlmsg_unicast(). Remove the flags field in nfnetlink_unicast() since this is always MSG_DONTWAIT in the existing code which is exactly what nlmsg_unicast() passes to netlink_unicast() as parameter. Fixes: 96518518cc41 ("netfilter: add nftables") Reported-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28netfilter: delete repeated wordsRandy Dunlap
Drop duplicated words in net/netfilter/ and net/ipv4/netfilter/. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28netfilter: xt_HMARK: Use ip_is_fragment() helperYueHaibing
Use ip_is_fragment() to simpify code. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28netfilter: conntrack: remove unneeded nf_ct_putFlorian Westphal
We can delay refcount increment until we reassign the existing entry to the current skb. A 0 refcount can't happen while the nf_conn object is still in the hash table and parallel mutations are impossible because we hold the bucket lock. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28netfilter: conntrack: add clash resolution stat counterFlorian Westphal
There is a misconception about what "insert_failed" means. We increment this even when a clash got resolved, so it might not indicate a problem. Add a dedicated counter for clash resolution and only increment insert_failed if a clash cannot be resolved. For the old /proc interface, export this in place of an older stat that got removed a while back. For ctnetlink, export this with a new attribute. Also correct an outdated comment that implies we add a duplicate tuple -- we only add the (unique) reply direction. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28netfilter: conntrack: remove ignore statsFlorian Westphal
This counter increments when nf_conntrack_in sees a packet that already has a conntrack attached or when the packet is marked as UNTRACKED. Neither is an error. The former is normal for loopback traffic. The second happens for certain ICMPv6 packets or when nftables/ip(6)tables rules are in place. In case someone needs to count UNTRACKED packets, or packets that are marked as untracked before conntrack_in this can be done with both nftables and ip(6)tables rules. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28netfilter: conntrack: do not increment two error counters at same timeFlorian Westphal
The /proc interface for nf_conntrack displays the "error" counter as "icmp_error". It makes sense to not increment "invalid" when failing to handle an icmp packet since those are special. For example, its possible for conntrack to see partial and/or fragmented packets inside icmp errors. This should be a separate event and not get mixed with the "invalid" counter. Likewise, remove the "error" increment for errors from get_l4proto(). After this, the error counter will only increment for errors coming from icmp(v6) packet handling. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28netfilter: nf_tables: add userdata attributes to nft_tableJose M. Guisado Gomez
Enables storing userdata for nft_table. Field udata points to user data and udlen store its length. Adds new attribute flag NFTA_TABLE_USERDATA Signed-off-by: Jose M. Guisado Gomez <guigom@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28ipvs: Fix uninit-value in do_ip_vs_set_ctl()Peilin Ye
do_ip_vs_set_ctl() is referencing uninitialized stack value when `len` is zero. Fix it. Reported-by: syzbot+23b5f9e7caf61d9a3898@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=46ebfb92a8a812621a001ef04d90dfa459520fe2 Suggested-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Peilin Ye <yepeilin.cs@gmail.com> Acked-by: Julian Anastasov <ja@ssi.bg> Reviewed-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28netfilter: ip6t_NPT: rewrite addresses in ICMPv6 original packetMichael Zhou
Detect and rewrite a prefix embedded in an ICMPv6 original packet that was rewritten by a corresponding DNPT/SNPT rule so it will be recognised by the host that sent the original packet. Example Rules in effect on the 1:2:3:4::/64 + 5:6:7:8::/64 side router: * SNPT src-pfx 1:2:3:4::/64 dst-pfx 5:6:7:8::/64 * DNPT src-pfx 5:6:7:8::/64 dst-pfx 1:2:3:4::/64 No rules on the 9:a:b:c::/64 side. 1. 1:2:3:4::1 sends UDP packet to 9:a:b:c::1 2. Router applies SNPT changing src to 5:6:7:8::ffef::1 3. 9:a:b:c::1 receives packet with (src 5:6:7:8::ffef::1 dst 9:a:b:c::1) and replies with ICMPv6 port unreachable to 5:6:7:8::ffef::1, including original packet (src 5:6:7:8::ffef::1 dst 9:a:b:c::1) 4. Router forwards ICMPv6 packet with (src 9:a:b:c::1 dst 5:6:7:8::ffef::1) including original packet (src 5:6:7:8::ffef::1 dst 9:a:b:c::1) and applies DNPT changing dst to 1:2:3:4::1 5. 1:2:3:4::1 receives ICMPv6 packet with (src 9:a:b:c::1 dst 1:2:3:4::1) including original packet (src 5:6:7:8::ffef::1 dst 9:a:b:c::1). It doesn't recognise the original packet as the src doesn't match anything it originally sent With this change, at step 4, DNPT will also rewrite the original packet src to 1:2:3:4::1, so at step 5, 1:2:3:4::1 will recognise the ICMPv6 error and provide feedback to the application properly. Conversely, SNPT will help when ICMPv6 errors are sent from the translated network. 1. 9:a:b:c::1 sends UDP packet to 5:6:7:8::ffef::1 2. Router applies DNPT changing dst to 1:2:3:4::1 3. 1:2:3:4::1 receives packet with (src 9:a:b:c::1 dst 1:2:3:4::1) and replies with ICMPv6 port unreachable to 9:a:b:c::1 including original packet (src 9:a:b:c::1 dst 1:2:3:4::1) 4. Router forwards ICMPv6 packet with (src 1:2:3:4::1 dst 9:a:b:c::1) including original packet (src 9:a:b:c::1 dst 1:2:3:4::1) and applies SNPT changing src to 5:6:7:8::ffef::1 5. 9:a:b:c::1 receives ICMPv6 packet with (src 5:6:7:8::ffef::1 dst 9:a:b:c::1) including original packet (src 9:a:b:c::1 dst 1:2:3:4::1). It doesn't recognise the original packet as the dst doesn't match anything it already sent The change to SNPT means the ICMPv6 original packet dst will be rewritten to 5:6:7:8::ffef::1 in step 4, allowing the error to be properly recognised in step 5. Signed-off-by: Michael Zhou <mzhou@cse.unsw.edu.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28netlabel: remove unused param from audit_log_format()Alex Dewar
Commit d3b990b7f327 ("netlabel: fix problems with mapping removal") added a check to return an error if ret_val != 0, before ret_val is later used in a log message. Now it will unconditionally print "... res=1". So just drop the check. Addresses-Coverity: ("Dead code") Fixes: d3b990b7f327 ("netlabel: fix problems with mapping removal") Signed-off-by: Alex Dewar <alex.dewar90@gmail.com> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-28net_sched: fix error path in red_init()Cong Wang
When ->init() fails, ->destroy() is called to clean up. So it is unnecessary to clean up in red_init(), and it would cause some refcount underflow. Fixes: aee9caa03fc3 ("net: sched: sch_red: Add qevents "early_drop" and "mark"") Reported-and-tested-by: syzbot+b33c1cb0a30ebdc8a5f9@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+e5ea5f8a3ecfd4427a1c@syzkaller.appspotmail.com Cc: Petr Machata <petrm@mellanox.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-28net: add option to not create fall-back tunnels in root-ns as wellMahesh Bandewar
The sysctl that was added earlier by commit 79134e6ce2c ("net: do not create fallback tunnels for non-default namespaces") to create fall-back only in root-ns. This patch enhances that behavior to provide option not to create fallback tunnels in root-ns as well. Since modules that create fallback tunnels could be built-in and setting the sysctl value after booting is pointless, so added a kernel cmdline options to change this default. The default setting is preseved for backward compatibility. The kernel command line option of fb_tunnels=initns will set the sysctl value to 1 and will create fallback tunnels only in initns while kernel cmdline fb_tunnels=none will set the sysctl value to 2 and fallback tunnels are skipped in every netns. Signed-off-by: Mahesh Bandewar <maheshb@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Maciej Zenczykowski <maze@google.com> Cc: Jian Yang <jianyang@google.com> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-28netlink: fix a data race in netlink_rcv_wake()zhudi
The data races were reported by KCSAN: BUG: KCSAN: data-race in netlink_recvmsg / skb_queue_tail write (marked) to 0xffff8c0986e5a8c8 of 8 bytes by interrupt on cpu 3: skb_queue_tail+0xcc/0x120 __netlink_sendskb+0x55/0x80 netlink_broadcast_filtered+0x465/0x7e0 nlmsg_notify+0x8f/0x120 rtnl_notify+0x8e/0xb0 __neigh_notify+0xf2/0x120 neigh_update+0x927/0xde0 arp_process+0x8a3/0xf50 arp_rcv+0x27c/0x3b0 __netif_receive_skb_core+0x181c/0x1840 __netif_receive_skb+0x38/0xf0 netif_receive_skb_internal+0x77/0x1c0 napi_gro_receive+0x1bd/0x1f0 e1000_clean_rx_irq+0x538/0xb20 [e1000] e1000_clean+0x5e4/0x1340 [e1000] net_rx_action+0x310/0x9d0 __do_softirq+0xe8/0x308 irq_exit+0x109/0x110 do_IRQ+0x7f/0xe0 ret_from_intr+0x0/0x1d 0xffffffffffffffff read to 0xffff8c0986e5a8c8 of 8 bytes by task 1463 on cpu 0: netlink_recvmsg+0x40b/0x820 sock_recvmsg+0xc9/0xd0 ___sys_recvmsg+0x1a4/0x3b0 __sys_recvmsg+0x86/0x120 __x64_sys_recvmsg+0x52/0x70 do_syscall_64+0xb5/0x360 entry_SYSCALL_64_after_hwframe+0x65/0xca 0xffffffffffffffff Since the write is under sk_receive_queue->lock but the read is done as lockless. so fix it by using skb_queue_empty_lockless() instead of skb_queue_empty() for the read in netlink_rcv_wake() Signed-off-by: zhudi <zhudi21@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-28bpf: Relax max_entries check for most of the inner map typesMartin KaFai Lau
Most of the maps do not use max_entries during verification time. Thus, those map_meta_equal() do not need to enforce max_entries when it is inserted as an inner map during runtime. The max_entries check is removed from the default implementation bpf_map_meta_equal(). The prog_array_map and xsk_map are exception. Its map_gen_lookup uses max_entries to generate inline lookup code. Thus, they will implement its own map_meta_equal() to enforce max_entries. Since there are only two cases now, the max_entries check is not refactored and stays in its own .c file. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200828011813.1970516-1-kafai@fb.com
2020-08-28bpf: Add map_meta_equal map opsMartin KaFai Lau
Some properties of the inner map is used in the verification time. When an inner map is inserted to an outer map at runtime, bpf_map_meta_equal() is currently used to ensure those properties of the inserting inner map stays the same as the verification time. In particular, the current bpf_map_meta_equal() checks max_entries which turns out to be too restrictive for most of the maps which do not use max_entries during the verification time. It limits the use case that wants to replace a smaller inner map with a larger inner map. There are some maps do use max_entries during verification though. For example, the map_gen_lookup in array_map_ops uses the max_entries to generate the inline lookup code. To accommodate differences between maps, the map_meta_equal is added to bpf_map_ops. Each map-type can decide what to check when its map is used as an inner map during runtime. Also, some map types cannot be used as an inner map and they are currently black listed in bpf_map_meta_alloc() in map_in_map.c. It is not unusual that the new map types may not aware that such blacklist exists. This patch enforces an explicit opt-in and only allows a map to be used as an inner map if it has implemented the map_meta_equal ops. It is based on the discussion in [1]. All maps that support inner map has its map_meta_equal points to bpf_map_meta_equal in this patch. A later patch will relax the max_entries check for most maps. bpf_types.h counts 28 map types. This patch adds 23 ".map_meta_equal" by using coccinelle. -5 for BPF_MAP_TYPE_PROG_ARRAY BPF_MAP_TYPE_(PERCPU)_CGROUP_STORAGE BPF_MAP_TYPE_STRUCT_OPS BPF_MAP_TYPE_ARRAY_OF_MAPS BPF_MAP_TYPE_HASH_OF_MAPS The "if (inner_map->inner_map_meta)" check in bpf_map_meta_alloc() is moved such that the same error is returned. [1]: https://lore.kernel.org/bpf/20200522022342.899756-1-kafai@fb.com/ Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200828011806.1970400-1-kafai@fb.com
2020-08-28Merge tag 'mac80211-next-for-davem-2020-08-28' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== This time we have: * some code to support SAE (WPA3) offload in AP mode * many documentation (wording) fixes/updates * netlink policy updates, including the use of NLA_RANGE with binary attributes * regulatory improvements for adjacent frequency bands * and a few other small additions/refactorings/cleanups ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-28Merge tag 'mac80211-for-davem-2020-08-28' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== We have: * fixes for AQL (airtime queue limits) * reduce packet loss detection false positives * a small channel number fix for the 6 GHz band * a fix for 80+80/160 MHz negotiation * an nl80211 attribute (NL80211_ATTR_HE_6GHZ_CAPABILITY) fix * add a missing sanity check for the regulatory code ==================== Signed-off-by: David S. Miller <davem@davemloft.net>