summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2020-05-17rds: convert get_user_pages() --> pin_user_pages()John Hubbard
This code was using get_user_pages_fast(), in a "Case 2" scenario (DMA/RDMA), using the categorization from [1]. That means that it's time to convert the get_user_pages_fast() + put_page() calls to pin_user_pages_fast() + unpin_user_pages() calls. There is some helpful background in [2]: basically, this is a small part of fixing a long-standing disconnect between pinning pages, and file systems' use of those pages. [1] Documentation/core-api/pin_user_pages.rst [2] "Explicit pinning of user-space pages": https://lwn.net/Articles/807108/ Cc: David S. Miller <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: netdev@vger.kernel.org Cc: linux-rdma@vger.kernel.org Cc: rds-devel@oss.oracle.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17net: allow __skb_ext_alloc to sleepFlorian Westphal
mptcp calls this from the transmit side, from process context. Allow a sleeping allocation instead of unconditional GFP_ATOMIC. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17mptcp: remove inner wait loop from mptcp_sendmsg_fragFlorian Westphal
previous patches made sure we only call into this function when these prerequisites are met, so no need to wait on the subflow socket anymore. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/7 Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17mptcp: fill skb page frag cache outside of mptcp_sendmsg_fragFlorian Westphal
The mptcp_sendmsg_frag helper contains a loop that will wait on the subflow sk. It seems preferrable to only wait in mptcp_sendmsg() when blocking io is requested. mptcp_sendmsg already has such a wait loop that is used when no subflow socket is available for transmission. This is another preparation patch that makes sure we call mptcp_sendmsg_frag only if the page frag cache has been refilled. Followup patch will remove the wait loop from mptcp_sendmsg_frag(). The retransmit worker doesn't need to do this refill as it won't transmit new mptcp-level data. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17mptcp: fill skb extension cache outside of mptcp_sendmsg_fragFlorian Westphal
The mptcp_sendmsg_frag helper contains a loop that will wait on the subflow sk. It seems preferrable to only wait in mptcp_sendmsg() when blocking io is requested. mptcp_sendmsg already has such a wait loop that is used when no subflow socket is available for transmission. This is a preparation patch that makes sure we call mptcp_sendmsg_frag only if a skb extension has been allocated. Moreover, such allocation currently uses GFP_ATOMIC while it could use sleeping allocation instead. Followup patches will remove the wait loop from mptcp_sendmsg_frag() and will allow to do a sleeping allocation for the extension. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17mptcp: avoid blocking in tcp_sendpagesFlorian Westphal
The transmit loop continues to xmit new data until an error is returned or all data was transmitted. For the blocking i/o case, this means that tcp_sendpages() may block on the subflow until more space becomes available, i.e. we end up sleeping with the mptcp socket lock held. Instead we should check if a different subflow is ready to be used. This restarts the subflow sk lookup when the tx operation succeeded and the tcp subflow can't accept more data or if tcp_sendpages indicates -EAGAIN on a blocking mptcp socket. In that case we also need to set the NOSPACE bit to make sure we get notified once memory becomes available. In case all subflows are busy, the existing logic will wait until a subflow is ready, releasing the mptcp socket lock while doing so. The mptcp worker already sets DONTWAIT, so no need to make changes there. v2: * set NOSPACE bit * add a comment to clarify that mptcp-sk sndbuf limits need to be checked as well. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17mptcp: break and restart in case mptcp sndbuf is fullFlorian Westphal
Its not enough to check for available tcp send space. We also hold on to transmitted data for mptcp-level retransmits. Right now we will send more and more data if the peer can ack data at the tcp level fast enough, since that frees up tcp send buffer space. But we also need to check that data was acked and reclaimed at the mptcp level. Therefore add needed check in mptcp_sendmsg, flush tcp data and wait until more mptcp snd space becomes available if we are over the limit. Before we wait for more data, also make sure we start the retransmit timer if we ran out of sndbuf space. Otherwise there is a very small chance that we wait forever: * receiver is waiting for data * sender is blocked because mptcp socket buffer is full * at tcp level, all data was acked * mptcp-level snd_una was not updated, because last ack that acknowledged the last data packet carried an older MPTCP-ack. Restarting the retransmit timer avoids this problem: if TCP subflow is idle, data is retransmitted from the RTX queue. New data will make the peer send a new, updated MPTCP-Ack. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17mptcp: move common nospace-pattern to a helperFlorian Westphal
Paolo noticed that ssk_check_wmem() has same pattern, so add/use common helper for both places. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17net: revert "net: get rid of an signed integer overflow in ip_idents_reserve()"Yuqi Jin
Commit adb03115f459 ("net: get rid of an signed integer overflow in ip_idents_reserve()") used atomic_cmpxchg to replace "atomic_add_return" inside the function "ip_idents_reserve". The reason was to avoid UBSAN warning. However, this change has caused performance degrade and in GCC-8, fno-strict-overflow is now mapped to -fwrapv -fwrapv-pointer and signed integer overflow is now undefined by default at all optimization levels[1]. Moreover, it was a bug in UBSAN vs -fwrapv /-fno-strict-overflow, so Let's revert it safely. [1] https://gcc.gnu.org/gcc-8/changes.html Suggested-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Eric Dumazet <edumazet@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Jiri Pirko <jiri@resnulli.us> Cc: Arvind Sankar <nivedita@alum.mit.edu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Jiong Wang <jiongwang@huawei.com> Signed-off-by: Yuqi Jin <jinyuqi@huawei.com> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17nexthop: Fix attribute checking for groupsDavid Ahern
For nexthop groups, attributes after NHA_GROUP_TYPE are invalid, but nh_check_attr_group starts checking at NHA_GROUP. The group type defaults to multipath and the NHA_GROUP_TYPE is currently optional so this has slipped through so far. Fix the attribute checking to handle support of new group types. Fixes: 430a049190de ("nexthop: Add support for nexthop groups") Signed-off-by: ASSOGBA Emery <assogba.emery@gmail.com> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17bpfilter: use 'userprogs' syntax to build bpfilter_umhMasahiro Yamada
The user mode helper should be compiled for the same architecture as the kernel. This Makefile reused the 'hostprogs' syntax by overriding HOSTCC with CC. Use the new syntax 'userprogs' to fix the Makefile mess. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Sam Ravnborg <sam@ravnborg.org>
2020-05-17bpfilter: check if $(CC) can link static libc in KconfigMasahiro Yamada
On Fedora, linking static glibc requires the glibc-static RPM package, which is not part of the glibc-devel package. CONFIG_CC_CAN_LINK does not check the capability of static linking, so you can enable CONFIG_BPFILTER_UMH, then fail to build: HOSTLD net/bpfilter/bpfilter_umh /usr/bin/ld: cannot find -lc collect2: error: ld returned 1 exit status Add CONFIG_CC_CAN_LINK_STATIC, and make CONFIG_BPFILTER_UMH depend on it. Reported-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org>
2020-05-17bpfilter: match bit size of bpfilter_umh to that of the kernelMasahiro Yamada
bpfilter_umh is built for the default machine bit of the compiler, which may not match to the bit size of the kernel. This happens in the scenario below: You can use biarch GCC that defaults to 64-bit for building the 32-bit kernel. In this case, Kbuild passes -m32 to teach the compiler to produce 32-bit kernel space objects. However, it is missing when building bpfilter_umh. It is built as a 64-bit ELF, and then embedded into the 32-bit kernel. The 32-bit kernel and 64-bit umh is a bad combination. In theory, we can have 32-bit umh running on 64-bit kernel, but we do not have a good reason to support such a usecase. The best is to match the bit size between them. Pass -m32 or -m64 to the umh build command if it is found in $(KBUILD_CFLAGS). Evaluate CC_CAN_LINK against the kernel bit-size. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-05-16ethtool: don't call set_channels in drivers if config didn't changeJakub Kicinski
Don't call drivers if nothing changed. Netlink code already contains this logic. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-16ethtool: check if there is at least one channel for TX/RX in the coreJakub Kicinski
Having a channel config with no ability to RX or TX traffic is clearly wrong. Check for this in the core so the drivers don't have to. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-16mptcp: Use 32-bit DATA_ACK when possibleChristoph Paasch
RFC8684 allows to send 32-bit DATA_ACKs as long as the peer is not sending 64-bit data-sequence numbers. The 64-bit DSN is only there for extreme scenarios when a very high throughput subflow is combined with a long-RTT subflow such that the high-throughput subflow wraps around the 32-bit sequence number space within an RTT of the high-RTT subflow. It is thus a rare scenario and we should try to use the 32-bit DATA_ACK instead as long as possible. It allows to reduce the TCP-option overhead by 4 bytes, thus makes space for an additional SACK-block. It also makes tcpdumps much easier to read when the DSN and DATA_ACK are both either 32 or 64-bit. Signed-off-by: Christoph Paasch <cpaasch@apple.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-16net: dsa: mt7530: fix roaming from DSA user portsDENG Qingfang
When a client moves from a DSA user port to a software port in a bridge, it cannot reach any other clients that connected to the DSA user ports. That is because SA learning on the CPU port is disabled, so the switch ignores the client's frames from the CPU port and still thinks it is at the user port. Fix it by enabling SA learning on the CPU port. To prevent the switch from learning from flooding frames from the CPU port, set skb->offload_fwd_mark to 1 for unicast and broadcast frames, and let the switch flood them instead of trapping to the CPU port. Multicast frames still need to be trapped to the CPU port for snooping, so set the SA_DIS bit of the MTK tag to 1 when transmitting those frames to disable SA learning. Fixes: b8f126a8d543 ("net-next: dsa: add dsa support for Mediatek MT7530 switch") Signed-off-by: DENG Qingfang <dqfext@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-16netns: enable to inherit devconf from current netnsNicolas Dichtel
The goal is to be able to inherit the initial devconf parameters from the current netns, ie the netns where this new netns has been created. This is useful in a containers environment where /proc/sys is read only. For example, if a pod is created with specifics devconf parameters and has the capability to create netns, the user expects to get the same parameters than his 'init_net', which is not the real init_net in this case. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-16ipv6: Fix suspicious RCU usage warning in ip6mrMadhuparna Bhowmik
This patch fixes the following warning: ============================= WARNING: suspicious RCU usage 5.7.0-rc4-next-20200507-syzkaller #0 Not tainted ----------------------------- net/ipv6/ip6mr.c:124 RCU-list traversed in non-reader section!! ipmr_new_table() returns an existing table, but there is no table at init. Therefore the condition: either holding rtnl or the list is empty is used. Fixes: d1db275dd3f6e ("ipv6: ip6mr: support multiple tables") Reported-by: kernel test robot <lkp@intel.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15Merge tag 'nfs-for-5.7-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Trond Myklebust: "Highlights include: Stable fixes: - nfs: fix NULL deference in nfs4_get_valid_delegation Bugfixes: - Fix corruption of the return value in cachefiles_read_or_alloc_pages() - Fix several fscache cookie issues - Fix a fscache queuing race that can trigger a BUG_ON - NFS: Fix two use-after-free regressions due to the RPC_TASK_CRED_NOREF flag - SUNRPC: Fix a use-after-free regression in rpc_free_client_work() - SUNRPC: Fix a race when tearing down the rpc client debugfs directory - SUNRPC: Signalled ASYNC tasks need to exit - NFSv3: fix rpc receive buffer size for MOUNT call" * tag 'nfs-for-5.7-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv3: fix rpc receive buffer size for MOUNT call SUNRPC: 'Directory with parent 'rpc_clnt' already present!' NFS/pnfs: Don't use RPC_TASK_CRED_NOREF with pnfs NFS: Don't use RPC_TASK_CRED_NOREF with delegreturn SUNRPC: Signalled ASYNC tasks need to exit nfs: fix NULL deference in nfs4_get_valid_delegation SUNRPC: fix use-after-free in rpc_free_client_work() cachefiles: Fix race between read_waiter and read_copier involving op->to_do NFSv4: Fix fscache cookie aux_data to ensure change_attr is included NFS: Fix fscache super_cookie allocation NFS: Fix fscache super_cookie index_key from changing after umount cachefiles: Fix corruption of the return value in cachefiles_read_or_alloc_pages()
2020-05-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Move the bpf verifier trace check into the new switch statement in HEAD. Resolve the overlapping changes in hinic, where bug fixes overlap the addition of VF support. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix sk_psock reference count leak on receive, from Xiyu Yang. 2) CONFIG_HNS should be invisible, from Geert Uytterhoeven. 3) Don't allow locking route MTUs in ipv6, RFCs actually forbid this, from Maciej Żenczykowski. 4) ipv4 route redirect backoff wasn't actually enforced, from Paolo Abeni. 5) Fix netprio cgroup v2 leak, from Zefan Li. 6) Fix infinite loop on rmmod in conntrack, from Florian Westphal. 7) Fix tcp SO_RCVLOWAT hangs, from Eric Dumazet. 8) Various bpf probe handling fixes, from Daniel Borkmann. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (68 commits) selftests: mptcp: pm: rm the right tmp file dpaa2-eth: properly handle buffer size restrictions bpf: Restrict bpf_trace_printk()'s %s usage and add %pks, %pus specifier bpf: Add bpf_probe_read_{user, kernel}_str() to do_refine_retval_range bpf: Restrict bpf_probe_read{, str}() only to archs where they work MAINTAINERS: Mark networking drivers as Maintained. ipmr: Add lockdep expression to ipmr_for_each_table macro ipmr: Fix RCU list debugging warning drivers: net: hamradio: Fix suspicious RCU usage warning in bpqether.c net: phy: broadcom: fix BCM54XX_SHD_SCR3_TRDDAPD value for BCM54810 tcp: fix error recovery in tcp_zerocopy_receive() MAINTAINERS: Add Jakub to networking drivers. MAINTAINERS: another add of Karsten Graul for S390 networking drivers: ipa: fix typos for ipa_smp2p structure doc pppoe: only process PADT targeted at local interfaces selftests/bpf: Enforce returning 0 for fentry/fexit programs bpf: Enforce returning 0 for fentry/fexit progs net: stmmac: fix num_por initialization security: Fix the default value of secid_to_secctx hook libbpf: Fix register naming in PT_REGS s390 macros ...
2020-05-15mptcp: cope better with MP_JOIN failurePaolo Abeni
Currently, on MP_JOIN failure we reset the child socket, but leave the request socket untouched. tcp_check_req will deal with it according to the 'tcp_abort_on_overflow' sysctl value - by default the req socket will stay alive. The above leads to inconsistent behavior on MP JOIN failure, and bad listener overflow accounting. This patch addresses the issue leveraging the infrastructure just introduced to ask the TCP stack to drop the req on failure. The child socket is not freed anymore by subflow_syn_recv_sock(), instead it's moved to a dead state and will be disposed by the next sock_put done by the TCP stack, so that listener overflow accounting is not affected by MP JOIN failure. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Christoph Paasch <cpaasch@apple.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15inet_connection_sock: factor out destroy helper.Paolo Abeni
Move the steps to prepare an inet_connection_sock for forced disposal inside a separate helper. No functional changes inteded, this will just simplify the next patch. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Christoph Paasch <cpaasch@apple.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15mptcp: add new sock flag to deal with join subflowsPaolo Abeni
MP_JOIN subflows must not land into the accept queue. Currently tcp_check_req() calls an mptcp specific helper to detect such scenario. Such helper leverages the subflow context to check for MP_JOIN subflows. We need to deal also with MP JOIN failures, even when the subflow context is not available due allocation failure. A possible solution would be changing the syn_recv_sock() signature to allow returning a more descriptive action/ error code and deal with that in tcp_check_req(). Since the above need is MPTCP specific, this patch instead uses a TCP request socket hole to add a MPTCP specific flag. Such flag is used by the MPTCP syn_recv_sock() to tell tcp_check_req() how to deal with the request socket. This change is a no-op for !MPTCP build, and makes the MPTCP code simpler. It allows also the next patch to deal correctly with MP JOIN failure. v1 -> v2: - be more conservative on drop_req initialization (Mat) RFC -> v1: - move the drop_req bit inside tcp_request_sock (Eric) Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Reviewed-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-05-15 The following pull-request contains BPF updates for your *net-next* tree. We've added 37 non-merge commits during the last 1 day(s) which contain a total of 67 files changed, 741 insertions(+), 252 deletions(-). The main changes are: 1) bpf_xdp_adjust_tail() now allows to grow the tail as well, from Jesper. 2) bpftool can probe CONFIG_HZ, from Daniel. 3) CAP_BPF is introduced to isolate user processes that use BPF infra and to secure BPF networking services by dropping CAP_SYS_ADMIN requirement in certain cases, from Alexei. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15net: sched: cls_flower: implement terse dump supportVlad Buslov
Implement tcf_proto_ops->terse_dump() callback for flower classifier. Only dump handle, flags and action data in terse mode. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15net: sched: implement terse dump support in actVlad Buslov
Extend tcf_action_dump() with boolean argument 'terse' that is used to request terse-mode action dump. In terse mode only essential data needed to identify particular action (action kind, cookie, etc.) and its stats is put to resulting skb and everything else is omitted. Implement tcf_exts_terse_dump() helper in cls API that is intended to be used to request terse dump of all exts (actions) attached to the filter. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15net: sched: introduce terse dump flagVlad Buslov
Add new TCA_DUMP_FLAGS attribute and use it in cls API to request terse filter output from classifiers with TCA_DUMP_FLAGS_TERSE flag. This option is intended to be used to improve performance of TC filter dump when userland only needs to obtain stats and not the whole classifier/action data. Extend struct tcf_proto_ops with new terse_dump() callback that must be defined by supporting classifier implementations. Support of the options in specific classifiers and actions is implemented in following patches in the series. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15net: core: recursively find netdev by device nodeTobias Waldekranz
The assumption that a device node is associated either with the netdev's device, or the parent of that device, does not hold for all drivers. E.g. Freescale's DPAA has two layers of platform devices above the netdev. Instead, recursively walk up the tree from the netdev, allowing any parent to match against the sought after node. Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15bpf: Implement CAP_BPFAlexei Starovoitov
Implement permissions as stated in uapi/linux/capability.h In order to do that the verifier allow_ptr_leaks flag is split into four flags and they are set as: env->allow_ptr_leaks = bpf_allow_ptr_leaks(); env->bypass_spec_v1 = bpf_bypass_spec_v1(); env->bypass_spec_v4 = bpf_bypass_spec_v4(); env->bpf_capable = bpf_capable(); The first three currently equivalent to perfmon_capable(), since leaking kernel pointers and reading kernel memory via side channel attacks is roughly equivalent to reading kernel memory with cap_perfmon. 'bpf_capable' enables bounded loops, precision tracking, bpf to bpf calls and other verifier features. 'allow_ptr_leaks' enable ptr leaks, ptr conversions, subtraction of pointers. 'bypass_spec_v1' disables speculative analysis in the verifier, run time mitigations in bpf array, and enables indirect variable access in bpf programs. 'bypass_spec_v4' disables emission of sanitation code by the verifier. That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN will have speculative checks done by the verifier and other spectre mitigation applied. Such networking BPF program will not be able to leak kernel pointers and will not be able to access arbitrary kernel memory. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200513230355.7858-3-alexei.starovoitov@gmail.com
2020-05-14bpf: Add xdp.frame_sz in bpf_prog_test_run_xdp().Jesper Dangaard Brouer
Update the memory requirements, when adding xdp.frame_sz in BPF test_run function bpf_prog_test_run_xdp() which e.g. is used by XDP selftests. Specifically add the expected reserved tailroom, but also allocated a larger memory area to reflect that XDP frames usually comes in this format. Limit the provided packet data size to 4096 minus headroom + tailroom, as this also reflect a common 3520 bytes MTU limit with XDP. Note that bpf_test_init already use a memory allocation method that clears memory. Thus, this already guards against leaking uninit kernel memory. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/158945349549.97035.15316291762482444006.stgit@firesoul
2020-05-14xdp: Clear grow memory in bpf_xdp_adjust_tail()Jesper Dangaard Brouer
Clearing memory of tail when grow happens, because it is too easy to write a XDP_PASS program that extend the tail, which expose this memory to users that can run tcpdump. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/158945349039.97035.5262100484553494.stgit@firesoul
2020-05-14xdp: Allow bpf_xdp_adjust_tail() to grow packet sizeJesper Dangaard Brouer
Finally, after all drivers have a frame size, allow BPF-helper bpf_xdp_adjust_tail() to grow or extend packet size at frame tail. Remember that helper/macro xdp_data_hard_end have reserved some tailroom. Thus, this helper makes sure that the BPF-prog don't have access to this tailroom area. V2: Remove one chicken check and use WARN_ONCE for other Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/158945348530.97035.12577148209134239291.stgit@firesoul
2020-05-14xdp: Xdp_frame add member frame_sz and handle in convert_to_xdp_frameJesper Dangaard Brouer
Use hole in struct xdp_frame, when adding member frame_sz, which keeps same sizeof struct (32 bytes) Drivers ixgbe and sfc had bug cases where the necessary/expected tailroom was not reserved. This can lead to some hard to catch memory corruption issues. Having the drivers frame_sz this can be detected when packet length/end via xdp->data_end exceed the xdp_data_hard_end pointer, which accounts for the reserved the tailroom. When detecting this driver issue, simply fail the conversion with NULL, which results in feedback to driver (failing xdp_do_redirect()) causing driver to drop packet. Given the lack of consistent XDP stats, this can be hard to troubleshoot. And given this is a driver bug, we want to generate some more noise in form of a WARN stack dump (to ID the driver code that inlined convert_to_xdp_frame). Inlining the WARN macro is problematic, because it adds an asm instruction (on Intel CPUs ud2) what influence instruction cache prefetching. Thus, introduce xdp_warn and macro XDP_WARN, to avoid this and at the same time make identifying the function and line of this inlined function easier. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/158945337313.97035.10015729316710496600.stgit@firesoul
2020-05-14net: XDP-generic determining XDP frame sizeJesper Dangaard Brouer
The SKB "head" pointer points to the data area that contains skb_shared_info, that can be found via skb_end_pointer(). Given xdp->data_hard_start have been established (basically pointing to skb->head), frame size is between skb_end_pointer() and data_hard_start, plus the size reserved to skb_shared_info. Change the bpf_xdp_adjust_tail offset adjust of skb->len, to be a positive offset number on grow, and negative number on shrink. As this seems more natural when reading the code. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/158945336804.97035.7164852191163722056.stgit@firesoul
2020-05-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-05-14 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Merged tag 'perf-for-bpf-2020-05-06' from tip tree that includes CAP_PERFMON. 2) support for narrow loads in bpf_sock_addr progs and additional helpers in cg-skb progs, from Andrey. 3) bpf benchmark runner, from Andrii. 4) arm and riscv JIT optimizations, from Luke. 5) bpf iterator infrastructure, from Yonghong. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-14bpf: Introduce bpf_sk_{, ancestor_}cgroup_id helpersAndrey Ignatov
With having ability to lookup sockets in cgroup skb programs it becomes useful to access cgroup id of retrieved sockets so that policies can be implemented based on origin cgroup of such socket. For example, a container running in a cgroup can have cgroup skb ingress program that can lookup peer socket that is sending packets to a process inside the container and decide whether those packets should be allowed or denied based on cgroup id of the peer. More specifically such ingress program can implement intra-host policy "allow incoming packets only from this same container and not from any other container on same host" w/o relying on source IP addresses since quite often it can be the case that containers share same IP address on the host. Introduce two new helpers for this use-case: bpf_sk_cgroup_id() and bpf_sk_ancestor_cgroup_id(). These helpers are similar to existing bpf_skb_{,ancestor_}cgroup_id helpers with the only difference that sk is used to get cgroup id instead of skb, and share code with them. See documentation in UAPI for more details. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/f5884981249ce911f63e9b57ecd5d7d19154ff39.1589486450.git.rdna@fb.com
2020-05-14bpf: Allow skb_ancestor_cgroup_id helper in cgroup skbAndrey Ignatov
cgroup skb programs already can use bpf_skb_cgroup_id. Allow bpf_skb_ancestor_cgroup_id as well so that container policies can be implemented for a container that can have sub-cgroups dynamically created, but policies should still be implemented based on cgroup id of container itself not on an id of a sub-cgroup. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/8874194d6041eba190356453ea9f6071edf5f658.1589486450.git.rdna@fb.com
2020-05-14bpf: Allow sk lookup helpers in cgroup skbAndrey Ignatov
Currently sk lookup helpers are allowed in tc, xdp, sk skb, and cgroup sock_addr programs. But they would be useful in cgroup skb as well so that for example cgroup skb ingress program can lookup a peer socket a packet comes from on same host and make a decision whether to allow or deny this packet based on the properties of that socket, e.g. cgroup that peer socket belongs to. Allow the following sk lookup helpers in cgroup skb: * bpf_sk_lookup_tcp; * bpf_sk_lookup_udp; * bpf_sk_release; * bpf_skc_lookup_tcp. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/f8c7ee280f1582b586629436d777b6db00597d63.1589486450.git.rdna@fb.com
2020-05-14bpf: Support narrow loads from bpf_sock_addr.user_portAndrey Ignatov
bpf_sock_addr.user_port supports only 4-byte load and it leads to ugly code in BPF programs, like: volatile __u32 user_port = ctx->user_port; __u16 port = bpf_ntohs(user_port); Since otherwise clang may optimize the load to be 2-byte and it's rejected by verifier. Add support for 1- and 2-byte loads same way as it's supported for other fields in bpf_sock_addr like user_ip4, msg_src_ip4, etc. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/c1e983f4c17573032601d0b2b1f9d1274f24bc16.1589420814.git.rdna@fb.com
2020-05-14ipmr: Add lockdep expression to ipmr_for_each_table macroAmol Grover
During the initialization process, ipmr_new_table() is called to create new tables which in turn calls ipmr_get_table() which traverses net->ipv4.mr_tables without holding the writer lock. However, this is safe to do so as no tables exist at this time. Hence add a suitable lockdep expression to silence the following false-positive warning: ============================= WARNING: suspicious RCU usage 5.7.0-rc3-next-20200428-syzkaller #0 Not tainted ----------------------------- net/ipv4/ipmr.c:136 RCU-list traversed in non-reader section!! ipmr_get_table+0x130/0x160 net/ipv4/ipmr.c:136 ipmr_new_table net/ipv4/ipmr.c:403 [inline] ipmr_rules_init net/ipv4/ipmr.c:248 [inline] ipmr_net_init+0x133/0x430 net/ipv4/ipmr.c:3089 Fixes: f0ad0860d01e ("ipv4: ipmr: support multiple tables") Reported-by: syzbot+1519f497f2f9f08183c6@syzkaller.appspotmail.com Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Amol Grover <frextrite@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-14ipmr: Fix RCU list debugging warningAmol Grover
ipmr_for_each_table() macro uses list_for_each_entry_rcu() for traversing outside of an RCU read side critical section but under the protection of rtnl_mutex. Hence, add the corresponding lockdep expression to silence the following false-positive warning at boot: [ 4.319347] ============================= [ 4.319349] WARNING: suspicious RCU usage [ 4.319351] 5.5.4-stable #17 Tainted: G E [ 4.319352] ----------------------------- [ 4.319354] net/ipv4/ipmr.c:1757 RCU-list traversed in non-reader section!! Fixes: f0ad0860d01e ("ipv4: ipmr: support multiple tables") Signed-off-by: Amol Grover <frextrite@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-14devlink: refactor end checks in devlink_nl_cmd_region_read_dumpitJakub Kicinski
Clean up after recent fixes, move address calculations around and change the variable init, so that we can have just one start_offset == end_offset check. Make the check a little stricter to preserve the -EINVAL error if requested start offset is larger than the region itself. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-14tcp: fix error recovery in tcp_zerocopy_receive()Eric Dumazet
If user provides wrong virtual address in TCP_ZEROCOPY_RECEIVE operation we want to return -EINVAL error. But depending on zc->recv_skip_hint content, we might return -EIO error if the socket has SOCK_DONE set. Make sure to return -EINVAL in this case. BUG: KMSAN: uninit-value in tcp_zerocopy_receive net/ipv4/tcp.c:1833 [inline] BUG: KMSAN: uninit-value in do_tcp_getsockopt+0x4494/0x6320 net/ipv4/tcp.c:3685 CPU: 1 PID: 625 Comm: syz-executor.0 Not tainted 5.7.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c9/0x220 lib/dump_stack.c:118 kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:121 __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:215 tcp_zerocopy_receive net/ipv4/tcp.c:1833 [inline] do_tcp_getsockopt+0x4494/0x6320 net/ipv4/tcp.c:3685 tcp_getsockopt+0xf8/0x1f0 net/ipv4/tcp.c:3728 sock_common_getsockopt+0x13f/0x180 net/core/sock.c:3131 __sys_getsockopt+0x533/0x7b0 net/socket.c:2177 __do_sys_getsockopt net/socket.c:2192 [inline] __se_sys_getsockopt+0xe1/0x100 net/socket.c:2189 __x64_sys_getsockopt+0x62/0x80 net/socket.c:2189 do_syscall_64+0xb8/0x160 arch/x86/entry/common.c:297 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x45c829 Code: 0d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f1deeb72c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000037 RAX: ffffffffffffffda RBX: 00000000004e01e0 RCX: 000000000045c829 RDX: 0000000000000023 RSI: 0000000000000006 RDI: 0000000000000009 RBP: 000000000078bf00 R08: 0000000020000200 R09: 0000000000000000 R10: 00000000200001c0 R11: 0000000000000246 R12: 00000000ffffffff R13: 00000000000001d8 R14: 00000000004d3038 R15: 00007f1deeb736d4 Local variable ----zc@do_tcp_getsockopt created at: do_tcp_getsockopt+0x1a74/0x6320 net/ipv4/tcp.c:3670 do_tcp_getsockopt+0x1a74/0x6320 net/ipv4/tcp.c:3670 Fixes: 05255b823a61 ("tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-14SUNRPC: 'Directory with parent 'rpc_clnt' already present!'J. Bruce Fields
Each rpc_client has a cl_clid which is allocated from a global ida, and a debugfs directory which is named after cl_clid. We're releasing the cl_clid before we free the debugfs directory named after it. As soon as the cl_clid is released, that value is available for another newly created client. That leaves a window where another client may attempt to create a new debugfs directory with the same name as the not-yet-deleted debugfs directory from the dying client. Symptoms are log messages like Directory 4 with parent 'rpc_clnt' already present! Fixes: 7c4310ff5642 "SUNRPC: defer slow parts of rpc_free_client() to a workqueue." Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-05-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Fix gcc-10 compilation warning in nf_conntrack, from Arnd Bergmann. 2) Add NF_FLOW_HW_PENDING to avoid races between stats and deletion commands, from Paul Blakey. 3) Remove WQ_MEM_RECLAIM from the offload workqueue, from Roi Dayan. 4) Infinite loop when removing nf_conntrack module, from Florian Westphal. 5) Set NF_FLOW_TEARDOWN bit on expiration to avoid races when refreshing the timeout from the software path. 6) Missing nft_set_elem_expired() check in the rbtree, from Phil Sutter. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-14esp6: calculate transport_header correctly when sel.family != AF_INET6Xin Long
In esp6_init_state() for beet mode when x->sel.family != AF_INET6: x->props.header_len = sizeof(struct ip_esp_hdr) + crypto_aead_ivsize(aead) + IPV4_BEET_PHMAXLEN + (sizeof(struct ipv6hdr) - sizeof(struct iphdr)) In xfrm6_beet_gso_segment() skb->transport_header is supposed to move to the end of the ph header for IPPROTO_BEETPH, so if x->sel.family != AF_INET6 and it's IPPROTO_BEETPH, it should do: skb->transport_header -= (sizeof(struct ipv6hdr) - sizeof(struct iphdr)); skb->transport_header += ph->hdrlen * 8; And IPV4_BEET_PHMAXLEN is only reserved for PH header, so if x->sel.family != AF_INET6 and it's not IPPROTO_BEETPH, it should do: skb->transport_header -= (sizeof(struct ipv6hdr) - sizeof(struct iphdr)); skb->transport_header -= IPV4_BEET_PHMAXLEN; Thanks Sabrina for looking deep into this issue. Fixes: 7f9e40eb18a9 ("esp6: add gso_segment for esp6 beet mode") Reported-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-05-13ipv6: set msg_control_is_user in do_ipv6_getsockoptChristoph Hellwig
While do_ipv6_getsockopt does not call the high-level recvmsg helper, the msghdr eventually ends up being passed to put_cmsg anyway, and thus needs msg_control_is_user set to the proper value. Fixes: 1f466e1f15cf ("net: cleanly handle kernel vs user buffers for ->msg_control") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-13tipc: fix failed service subscription deletionTuong Lien
When a service subscription is expired or canceled by user, it needs to be deleted from the subscription list, so that new subscriptions can be registered (max = 65535 per net). However, there are two issues in code that can cause such an unused subscription to persist: 1) The 'tipc_conn_delete_sub()' has a loop on the subscription list but it makes a break shortly when the 1st subscription differs from the one specified, so the subscription will not be deleted. 2) In case a subscription is canceled, the code to remove the 'TIPC_SUB_CANCEL' flag from the subscription filter does not work if it is a local subscription (i.e. the little endian isn't involved). So, it will be no matches when looking for the subscription to delete later. The subscription(s) will be removed eventually when the user terminates its topology connection but that could be a long time later. Meanwhile, the number of available subscriptions may be exhausted. This commit fixes the two issues above, so as needed a subscription can be deleted correctly. Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>