summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2020-12-04seg6: add VRF support for SRv6 End.DT6 behaviorAndrea Mayer
SRv6 End.DT6 is defined in the SRv6 Network Programming [1]. The Linux kernel already offers an implementation of the SRv6 End.DT6 behavior which permits IPv6 L3 VPNs over SRv6 networks. This implementation is not particularly suitable in contexts where we need to deploy IPv6 L3 VPNs among different tenants which share the same network address schemes. The underlying problem lies in the fact that the current version of DT6 (called legacy DT6 from now on) needs a complex configuration to be applied on routers which requires ad-hoc routes and routing policy rules to ensure the correct isolation of tenants. Consequently, a new implementation of DT6 has been introduced with the aim of simplifying the construction of IPv6 L3 VPN services in the multi-tenant environment using SRv6 networks. To accomplish this task, we reused the same VRF infrastructure and SRv6 core components already exploited for implementing the SRv6 End.DT4 behavior. Currently the two End.DT6 implementations coexist seamlessly and can be used depending on the context and the user preferences. So, in order to support both versions of DT6 a new attribute (vrftable) has been introduced which allows us to differentiate the implementation of the behavior to be used. A SRv6 End.DT6 legacy behavior is still instantiated using a command like the following one: $ ip -6 route add 2001:db8::1 encap seg6local action End.DT6 table 100 dev eth0 While to instantiate the SRv6 End.DT6 in VRF mode, the command is still pretty straight forward: $ ip -6 route add 2001:db8::1 encap seg6local action End.DT6 vrftable 100 dev eth0. Obviously as in the case of SRv6 End.DT4, the VRF strict_mode parameter must be set (net.vrf.strict_mode=1) and the VRF associated with table 100 must exist. Please note that the instances of SRv6 End.DT6 legacy and End.DT6 VRF mode can coexist in the same system/configuration without problems. [1] https://tools.ietf.org/html/draft-ietf-spring-srv6-network-programming Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04seg6: add support for the SRv6 End.DT4 behaviorAndrea Mayer
SRv6 End.DT4 is defined in the SRv6 Network Programming [1]. The SRv6 End.DT4 is used to implement IPv4 L3VPN use-cases in multi-tenants environments. It decapsulates the received packets and it performs IPv4 routing lookup in the routing table of the tenant. The SRv6 End.DT4 Linux implementation leverages a VRF device in order to force the routing lookup into the associated routing table. To make the End.DT4 work properly, it must be guaranteed that the routing table used for routing lookup operations is bound to one and only one VRF during the tunnel creation. Such constraint has to be enforced by enabling the VRF strict_mode sysctl parameter, i.e: $ sysctl -wq net.vrf.strict_mode=1. At JANOG44, LINE corporation presented their multi-tenant DC architecture using SRv6 [2]. In the slides, they reported that the Linux kernel is missing the support of SRv6 End.DT4 behavior. The SRv6 End.DT4 behavior can be instantiated using a command similar to the following: $ ip route add 2001:db8::1 encap seg6local action End.DT4 vrftable 100 dev eth0 We introduce the "vrftable" extension in iproute2 in a following patch. [1] https://tools.ietf.org/html/draft-ietf-spring-srv6-network-programming [2] https://speakerdeck.com/line_developers/line-data-center-networking-with-srv6 Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04seg6: add callbacks for customizing the creation/destruction of a behaviorAndrea Mayer
We introduce two callbacks used for customizing the creation/destruction of a SRv6 behavior. Such callbacks are defined in the new struct seg6_local_lwtunnel_ops and hereafter we provide a brief description of them: - build_state(...): used for calling the custom constructor of the behavior during its initialization phase and after all the attributes have been parsed successfully; - destroy_state(...): used for calling the custom destructor of the behavior before it is completely destroyed. Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04seg6: add support for optional attributes in SRv6 behaviorsAndrea Mayer
Before this patch, each SRv6 behavior specifies a set of required attributes that must be provided by the userspace application when such behavior is going to be instantiated. If at least one of the required attributes is not provided, the creation of the behavior fails. The SRv6 behavior framework lacks a way to manage optional attributes. By definition, an optional attribute for a SRv6 behavior consists of an attribute which may or may not be provided by the userspace. Therefore, if an optional attribute is missing (and thus not supplied by the user) the creation of the behavior goes ahead without any issue. This patch explicitly differentiates the required attributes from the optional attributes. In particular, each behavior can declare a set of required attributes and a set of optional ones. The semantic of the required attributes remains *totally* unaffected by this patch. The introduction of the optional attributes does NOT impact on the backward compatibility of the existing SRv6 behaviors. It is essential to note that if an (optional or required) attribute is supplied to a SRv6 behavior which does not expect it, the behavior simply discards such attribute without generating any error or warning. This operating mode remained unchanged both before and after the introduction of the optional attributes extension. The optional attributes are one of the key components used to implement the SRv6 End.DT6 behavior based on the Virtual Routing and Forwarding (VRF) framework. The optional attributes make possible the coexistence of the already existing SRv6 End.DT6 implementation with the new SRv6 End.DT6 VRF-based implementation without breaking any backward compatibility. Further details on the SRv6 End.DT6 behavior (VRF mode) are reported in subsequent patches. From the userspace point of view, the support for optional attributes DO NOT require any changes to the userspace applications, i.e: iproute2 unless new attributes (required or optional) are needed. Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04seg6: improve management of behavior attributesAndrea Mayer
Depending on the attribute (i.e.: SEG6_LOCAL_SRH, SEG6_LOCAL_TABLE, etc), the parse() callback performs some validity checks on the provided input and updates the tunnel state (slwt) with the result of the parsing operation. However, an attribute may also need to reserve some additional resources (i.e.: memory or setting up an eBPF program) in the parse() callback to complete the parsing operation. The parse() callbacks are invoked by the parse_nla_action() for each attribute belonging to a specific behavior. Given a behavior with N attributes, if the parsing of the i-th attribute fails, the parse_nla_action() returns immediately with an error. Nonetheless, the resources acquired during the parsing of the i-1 attributes are not freed by the parse_nla_action(). Attributes which acquire resources must release them *in an explicit way* in both the seg6_local_{build/destroy}_state(). However, adding a new attribute of this type requires changes to seg6_local_{build/destroy}_state() to release the resources correctly. The seg6local infrastructure still lacks a simple and structured way to release the resources acquired in the parse() operations. We introduced a new callback in the struct seg6_action_param named destroy(). This callback releases any resource which may have been acquired in the parse() counterpart. Each attribute may or may not implement the destroy() callback depending on whether it needs to free some acquired resources. The destroy() callback comes with several of advantages: 1) we can have many attributes as we want for a given behavior with no need to explicitly free the taken resources; 2) As in case of the seg6_local_build_state(), the seg6_local_destroy_state() does not need to handle the release of resources directly. Indeed, it calls the destroy_attrs() function which is in charge of calling the destroy() callback for every set attribute. We do not need to patch seg6_local_{build/destroy}_state() anymore as we add new attributes; 3) the code is more readable and better structured. Indeed, all the information needed to handle a given attribute are contained in only one place; 4) it facilitates the integration with new features introduced in further patches. Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04Merge tag 'wireless-drivers-next-2020-12-03' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next Kalle Valo says: ==================== wireless-drivers-next patches for v5.11 First set of patches for v5.11. rtw88 getting improvements to work better with Bluetooth and other driver also getting some new features. mhi-ath11k-immutable branch was pulled from mhi tree to avoid conflicts with mhi tree. Major changes: rtw88 * major bluetooth co-existance improvements wilc1000 * Wi-Fi Multimedia (WMM) support ath11k * Fast Initial Link Setup (FILS) discovery and unsolicited broadcast probe response support * qcom,ath11k-calibration-variant Device Tree setting * cold boot calibration support * new DFS region: JP wnc36xx * enable connection monitoring and keepalive in firmware ath10k * firmware IRAM recovery feature mhi * merge mhi-ath11k-immutable branch to make MHI API change go smoothly * tag 'wireless-drivers-next-2020-12-03' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next: (180 commits) wl1251: remove trailing semicolon in macro definition airo: remove trailing semicolon in macro definition wilc1000: added queue support for WMM wilc1000: call complete() for failure in wilc_wlan_txq_add_cfg_pkt() wilc1000: free resource in wilc_wlan_txq_add_mgmt_pkt() for failure path wilc1000: free resource in wilc_wlan_txq_add_net_pkt() for failure path wilc1000: added 'ndo_set_mac_address' callback support brcmfmac: expose firmware config files through modinfo wlcore: Switch to using the new API kobj_to_dev() rtw88: coex: add feature to enhance HID coexistence performance rtw88: coex: upgrade coexistence A2DP mechanism rtw88: coex: add action for coexistence in hardware initial rtw88: coex: add function to avoid cck lock rtw88: coex: change the coexistence mechanism for WLAN connected rtw88: coex: change the coexistence mechanism for HID rtw88: coex: update AFH information while in free-run mode rtw88: coex: update the mechanism for A2DP + PAN rtw88: coex: add debug message rtw88: coex: run coexistence when WLAN entering/leaving LPS Revert "rtl8xxxu: Add Buffalo WI-U3-866D to list of supported devices" ... ==================== Link: https://lore.kernel.org/r/20201203185732.9CFA5C433ED@smtp.codeaurora.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04xsk: Return error code if force_zc is setZhang Changzhong
If force_zc is set, we should exit out with an error, not fall back to copy mode. Fixes: 921b68692abb ("xsk: Enable sharing of dma mappings") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/1607077277-41995-1-git-send-email-zhangchangzhong@huawei.com
2020-12-04Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-12-03 The main changes are: 1) Support BTF in kernel modules, from Andrii. 2) Introduce preferred busy-polling, from Björn. 3) bpf_ima_inode_hash() and bpf_bprm_opts_set() helpers, from KP Singh. 4) Memcg-based memory accounting for bpf objects, from Roman. 5) Allow bpf_{s,g}etsockopt from cgroup bind{4,6} hooks, from Stanislav. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (118 commits) selftests/bpf: Fix invalid use of strncat in test_sockmap libbpf: Use memcpy instead of strncpy to please GCC selftests/bpf: Add fentry/fexit/fmod_ret selftest for kernel module selftests/bpf: Add tp_btf CO-RE reloc test for modules libbpf: Support attachment of BPF tracing programs to kernel modules libbpf: Factor out low-level BPF program loading helper bpf: Allow to specify kernel module BTFs when attaching BPF programs bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier selftests/bpf: Add CO-RE relocs selftest relying on kernel module BTF selftests/bpf: Add support for marking sub-tests as skipped selftests/bpf: Add bpf_testmod kernel module for testing libbpf: Add kernel module BTF support for CO-RE relocations libbpf: Refactor CO-RE relocs to not assume a single BTF object libbpf: Add internal helper to load BTF data by FD bpf: Keep module's btf_data_size intact after load bpf: Fix bpf_put_raw_tracepoint()'s use of __module_address() selftests/bpf: Add Userspace tests for TCP_WINDOW_CLAMP bpf: Adds support for setting window clamp samples/bpf: Fix spelling mistake "recieving" -> "receiving" bpf: Fix cold build of test_progs-no_alu32 ... ==================== Link: https://lore.kernel.org/r/20201204021936.85653-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04mac80211: set SDATA_STATE_RUNNING for monitor interfacesBorwankar, Antara
During restarrt, mac80211 is supposed to reconfigure the driver. When there's a monitor interface, the interface is added and the channel context for it was created, but not assigned to it as it was not considered running during the restart. Fix this by setting SDATA_STATE_RUNNING while adding monitor interfaces. Signed-off-by: Borwankar, Antara <antara.borwankar@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20201129172929.e1df99693a4c.I494579f28018c2d0b9d4083a664cf872c28405ae@changeid [reword commit log] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-04cfg80211: initialize rekey_dataSara Sharon
In case we have old supplicant, the akm field is uninitialized. Signed-off-by: Sara Sharon <sara.sharon@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20201129172929.930f0ab7ebee.Ic546e384efab3f4a89f318eafddc3eb7d556aecb@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-04mac80211: fix return value of ieee80211_chandef_he_6ghz_operWen Gong
ieee80211_chandef_he_6ghz_oper() needs to return true if it determined a value 6 GHz chandef, fix that. Fixes: 1d00ce807efa ("mac80211: support S1G association") Signed-off-by: Wen Gong <wgong@codeaurora.org> Link: https://lore.kernel.org/r/1606121152-3452-1-git-send-email-wgong@codeaurora.org [rewrite commit message] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-04batman-adv: Drop unused soft-interface.h include in fragmentation.cSimon Wunderlich
The commit 992b03b88e36 ("batman-adv: Don't always reallocate the fragmentation skb head") removed the last user of functions from soft-interface.h. Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-12-04batman-adv: Drop legacy code for auto deleting mesh interfacesSven Eckelmann
The only way to automatically drop batadv mesh interfaces when all soft interfaces were removed was dropped with the sysfs support. It is no longer needed to have them handled by kernel anymore. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-12-04batman-adv: Drop deprecated debugfs supportSven Eckelmann
The debugfs support in batman-adv was marked as deprecated by the commit 00caf6a2b318 ("batman-adv: Mark debugfs functionality as deprecated") and scheduled for removal in 2021. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-12-04batman-adv: Drop deprecated sysfs supportSven Eckelmann
The sysfs in batman-adv support was marked as deprecated by the commit 42cdd521487f ("batman-adv: ABI: Mark sysfs files as deprecated") and scheduled for removal in 2021. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-12-04batman-adv: Allow selection of routing algorithm over rtnetlinkSven Eckelmann
A batadv net_device is associated to a B.A.T.M.A.N. routing algorithm. This algorithm has to be selected before the interface is initialized and cannot be changed after that. The only way to select this algorithm was a module parameter which specifies the default algorithm used during the creation of the net_device. This module parameter is writeable over /sys/module/batman_adv/parameters/routing_algo and thus allows switching of the routing algorithm: 1. change routing_algo parameter 2. create new batadv net_device But this is not race free because another process can be scheduled between 1 + 2 and in that time frame change the routing_algo parameter again. It is much cleaner to directly provide this information inside the rtnetlink's RTM_NEWLINK message. The two processes would be (in regards of the creation parameter of their batadv interfaces) be isolated. This also eases the integration of batadv devices inside tools like network-manager or systemd-networkd which are not expecting to operate on /sys before a new net_device is created. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-12-04batman-adv: Prepare infrastructure for newlink settingsSven Eckelmann
The batadv generic netlink family can be used to retrieve the current state and set various configuration settings. But there are also settings which must be set before the actual interface is created. The rtnetlink already uses IFLA_INFO_DATA to allow net_device families to transfer such configurations. The minimal required functionality for this is now available for the batadv rtnl_link_ops. Also a new IFLA class of attributes will be attached to it because rtnetlink only allows 51 different attributes but batadv_nl_attrs already contains 62 attributes. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-12-04batman-adv: Add new include for min/max helpersSven Eckelmann
The commit b296a6d53339 ("kernel.h: split out min()/max() et al. helpers") moved the min/max helper functionality from kernel.h to minmax.h. Adjust the kernel code accordingly to avoid fragile indirect includes. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-12-04batman-adv: Start new development cycleSimon Wunderlich
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-12-03bpf: Remove hard-coded btf_vmlinux assumption from BPF verifierAndrii Nakryiko
Remove a permeating assumption thoughout BPF verifier of vmlinux BTF. Instead, wherever BTF type IDs are involved, also track the instance of struct btf that goes along with the type ID. This allows to gradually add support for kernel module BTFs and using/tracking module types across BPF helper calls and registers. This patch also renames btf_id() function to btf_obj_id() to minimize naming clash with using btf_id to denote BTF *type* ID, rather than BTF *object*'s ID. Also, altough btf_vmlinux can't get destructed and thus doesn't need refcounting, module BTFs need that, so apply BTF refcounting universally when BPF program is using BTF-powered attachment (tp_btf, fentry/fexit, etc). This makes for simpler clean up code. Now that BTF type ID is not enough to uniquely identify a BTF type, extend BPF trampoline key to include BTF object ID. To differentiate that from target program BPF ID, set 31st bit of type ID. BTF type IDs (at least currently) are not allowed to take full 32 bits, so there is no danger of confusing that bit with a valid BTF type ID. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201203204634.1325171-10-andrii@kernel.org
2020-12-03bpf: Adds support for setting window clampPrankur gupta
Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_WINDOW_CLAMP, which sets the maximum receiver window size. It will be useful for limiting receiver window based on RTT. Signed-off-by: Prankur gupta <prankgup@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201202213152.435886-2-prankgup@fb.com
2020-12-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Conflicts: drivers/net/ethernet/ibm/ibmvnic.c Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03mptcp: emit tcp reset when a join request failsFlorian Westphal
RFC 8684 says: If the token is unknown or the host wants to refuse subflow establishment (for example, due to a limit on the number of subflows it will permit), the receiver will send back a reset (RST) signal, analogous to an unknown port in TCP, containing an MP_TCPRST option (Section 3.6) with an "MPTCP specific error" reason code. mptcp-next doesn't support MP_TCPRST yet, this can be added in another change. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03tcp: merge 'init_req' and 'route_req' functionsFlorian Westphal
The Multipath-TCP standard (RFC 8684) says that an MPTCP host should send a TCP reset if the token in a MP_JOIN request is unknown. At this time we don't do this, the 3whs completes and the 'new subflow' is reset afterwards. There are two ways to allow MPTCP to send the reset. 1. override 'send_synack' callback and emit the rst from there. The drawback is that the request socket gets inserted into the listeners queue just to get removed again right away. 2. Send the reset from the 'route_req' function instead. This avoids the 'add&remove request socket', but route_req lacks the skb that is required to send the TCP reset. Instead of just adding the skb to that function for MPTCP sake alone, Paolo suggested to merge init_req and route_req functions. This saves one indirection from syn processing path and provides the skb to the merged function at the same time. 'send reset on unknown mptcp join token' is added in next patch. Suggested-by: Paolo Abeni <pabeni@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03net/sched: act_mpls: ensure LSE is pullable before reading itDavide Caratti
when 'act_mpls' is used to mangle the LSE, the current value is read from the packet dereferencing 4 bytes at mpls_hdr(): ensure that the label is contained in the skb "linear" area. Found by code inspection. v2: - use MPLS_HLEN instead of sizeof(new_lse), thanks to Jakub Kicinski Fixes: 2a2ea50870ba ("net: sched: add mpls manipulation actions to TC") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Guillaume Nault <gnault@redhat.com> Link: https://lore.kernel.org/r/3243506cba43d14858f3bd21ee0994160e44d64a.1606987058.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03net: openvswitch: ensure LSE is pullable before reading itDavide Caratti
when openvswitch is configured to mangle the LSE, the current value is read from the packet dereferencing 4 bytes at mpls_hdr(): ensure that the label is contained in the skb "linear" area. Found by code inspection. Fixes: d27cf5c59a12 ("net: core: add MPLS update core helper and use in OvS") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Link: https://lore.kernel.org/r/aa099f245d93218b84b5c056b67b6058ccf81a66.1606987185.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03net: skbuff: ensure LSE is pullable before decrementing the MPLS ttlDavide Caratti
skb_mpls_dec_ttl() reads the LSE without ensuring that it is contained in the skb "linear" area. Fix this calling pskb_may_pull() before reading the current ttl. Found by code inspection. Fixes: 2a2ea50870ba ("net: sched: add mpls manipulation actions to TC") Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Link: https://lore.kernel.org/r/53659f28be8bc336c113b5254dc637cc76bbae91.1606987074.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-02bpf: Eliminate rlimit-based memory accounting for xskmap mapsRoman Gushchin
Do not use rlimit-based memory accounting for xskmap maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-31-guro@fb.com
2020-12-02bpf: Eliminate rlimit-based memory accounting for sockmap and sockhash mapsRoman Gushchin
Do not use rlimit-based memory accounting for sockmap and sockhash maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-29-guro@fb.com
2020-12-02bpf: Refine memcg-based memory accounting for xskmap mapsRoman Gushchin
Extend xskmap memory accounting to include the memory taken by the xsk_map_node structure. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-18-guro@fb.com
2020-12-02bpf: Refine memcg-based memory accounting for sockmap and sockhash mapsRoman Gushchin
Include internal metadata into the memcg-based memory accounting. Also include the memory allocated on updating an element. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-17-guro@fb.com
2020-12-02net/x25: prevent a couple of overflowsDan Carpenter
The .x25_addr[] address comes from the user and is not necessarily NUL terminated. This leads to a couple problems. The first problem is that the strlen() in x25_bind() can read beyond the end of the buffer. The second problem is more subtle and could result in memory corruption. The call tree is: x25_connect() --> x25_write_internal() --> x25_addr_aton() The .x25_addr[] buffers are copied to the "addresses" buffer from x25_write_internal() so it will lead to stack corruption. Verify that the strings are NUL terminated and return -EINVAL if they are not. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Fixes: a9288525d2ae ("X25: Dont let x25_bind use addresses containing characters") Reported-by: "kiyin(尹亮)" <kiyin@tencent.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Martin Schiller <ms@dev.tdt.de> Link: https://lore.kernel.org/r/X8ZeAKm8FnFpN//B@mwanda Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03xsk: Change the tx writeable conditionXuan Zhuo
Modify the tx writeable condition from the queue is not full to the number of present tx queues is less than the half of the total number of queues. Because the tx queue not full is a very short time, this will cause a large number of EPOLLOUT events, and cause a large number of process wake up. Fixes: 35fcde7f8deb ("xsk: support for Tx") Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/508fef55188d4e1160747ead64c6dcda36735880.1606555939.git.xuanzhuo@linux.alibaba.com
2020-12-03xsk: Replace datagram_poll by sock_poll_waitXuan Zhuo
datagram_poll will judge the current socket status (EPOLLIN, EPOLLOUT) based on the traditional socket information (eg: sk_wmem_alloc), but this does not apply to xsk. So this patch uses sock_poll_wait instead of datagram_poll, and the mask is calculated by xsk_poll. Fixes: c497176cb2e4 ("xsk: add Rx receive functions and poll support") Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/e82f4697438cd63edbf271ebe1918db8261b7c09.1606555939.git.xuanzhuo@linux.alibaba.com
2020-12-02bpf: Allow bpf_{s,g}etsockopt from cgroup bind{4,6} hooksStanislav Fomichev
I have to now lock/unlock socket for the bind hook execution. That shouldn't cause any overhead because the socket is unbound and shouldn't receive any traffic. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrey Ignatov <rdna@fb.com> Link: https://lore.kernel.org/bpf/20201202172516.3483656-3-sdf@google.com
2020-12-02mptcp: avoid potential infinite loop in mptcp_recvmsg()Eric Dumazet
If a packet is ready in receive queue, and application isssues a recvmsg()/recvfrom()/recvmmsg() request asking for zero bytes, we hang in mptcp_recvmsg(). Fixes: ea4ca586b16f ("mptcp: refine MPTCP-level ack scheduling") Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Link: https://lore.kernel.org/r/20201202171657.1185108-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-02net: ip6_gre: set dev->hard_header_len when using header_opsAntoine Tenart
syzkaller managed to crash the kernel using an NBMA ip6gre interface. I could reproduce it creating an NBMA ip6gre interface and forwarding traffic to it: skbuff: skb_under_panic: text:ffffffff8250e927 len:148 put:44 head:ffff8c03c7a33 ------------[ cut here ]------------ kernel BUG at net/core/skbuff.c:109! Call Trace: skb_push+0x10/0x10 ip6gre_header+0x47/0x1b0 neigh_connected_output+0xae/0xf0 ip6gre tunnel provides its own header_ops->create, and sets it conditionally when initializing the tunnel in NBMA mode. When header_ops->create is used, dev->hard_header_len should reflect the length of the header created. Otherwise, when not used, dev->needed_headroom should be used. Fixes: eb95f52fc72d ("net: ipv6_gre: Fix GRO to work on IPv6 over GRE tap") Cc: Maria Pasechnik <mariap@mellanox.com> Signed-off-by: Antoine Tenart <atenart@kernel.org> Link: https://lore.kernel.org/r/20201130161911.464106-1-atenart@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-02net: sunrpc: Fix 'snprintf' return value check in 'do_xprt_debugfs'Fedor Tokarev
'snprintf' returns the number of characters which would have been written if enough space had been available, excluding the terminating null byte. Thus, the return value of 'sizeof(buf)' means that the last character has been dropped. Signed-off-by: Fedor Tokarev <ftokarev@gmail.com> Fixes: 2f34b8bfae19 ("SUNRPC: add links for all client xprts to debugfs") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02SUNRPC: Fix open coded xdr_stream_remaining()Trond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02SUNRPC: Fix up xdr_set_page()Trond Myklebust
While we always want to align to the next page and/or the beginning of the tail buffer when we call xdr_set_next_page(), the functions xdr_align_data() and xdr_expand_hole() really want to align to the next object in that next page or tail. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02SUNRPC: Clean up the handling of page padding in rpc_prepare_reply_pages()Trond Myklebust
rpc_prepare_reply_pages() currently expects the 'hdrsize' argument to contain the length of the data that we expect to want placed in the head kvec plus a count of 1 word of padding that is placed after the page data. This is very confusing when trying to read the code, and sometimes leads to callers adding an arbitrary value of '1' just in order to satisfy the requirement (whether or not the page data actually needs such padding). This patch aims to clarify the code by changing the 'hdrsize' argument to remove that 1 word of padding. This means we need to subtract the padding from all the existing callers. Fixes: 02ef04e432ba ("NFS: Account for XDR pad of buf->pages") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02SUNRPC: Fix up xdr_read_pages() to take arbitrary object lengthsTrond Myklebust
Fix up xdr_read_pages() so that it can handle object lengths that are larger than the page length, by simply aligning to the next object in the buffer tail. The function will continue to return the length of the truncate object data that actually fit into the pages. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02SUNRPC: Clean up helpers xdr_set_iov() and xdr_set_page_base()Trond Myklebust
Allow xdr_set_iov() to set a base so that we can use it to set the cursor to a specific position in the kvec buffer. If the new base overflows the kvec/pages buffer in either xdr_set_iov() or xdr_set_page_base(), then truncate it so that we point to the end of the buffer. Finally, change both function to return the number of bytes remaining to read in their buffers. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02SUNRPC: Fix up typo in xdr_init_decode()Trond Myklebust
We already know that the head buffer and page are empty, so if there is any data, it is in the tail. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02SUNRPC: Fix up open coded kmemdup_nul()Trond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02SUNRPC: Remove unused function xprt_load_transport()Trond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02SUNRPC: Add a helper to return the transport identifier given a netidTrond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02SUNRPC: Close a race with transport setup and module putTrond Myklebust
After we've looked up the transport module, we need to ensure it can't go away until we've finished running the transport setup code. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02SUNRPC: xprt_load_transport() needs to support the netid "rdma6"Trond Myklebust
According to RFC5666, the correct netid for an IPv6 addressed RDMA transport is "rdma6", which we've supported as a mount option since Linux-4.7. The problem is when we try to load the module "xprtrdma6", that will fail, since there is no modulealias of that name. Fixes: 181342c5ebe8 ("xprtrdma: Add rdma6 option to support NFS/RDMA IPv6") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02SUNRPC: rpc_wake_up() should wake up tasks in the correct orderTrond Myklebust
Currently, we wake up the tasks by priority queue ordering, which means that we ignore the batching that is supposed to help with QoS issues. Fixes: c049f8ea9a0d ("SUNRPC: Remove the bh-safe lock requirement on the rpc_wait_queue->lock") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>