summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2021-06-25bpfilter: Specify the log level for the kmsg messageGary Lin
Per the kmsg document [0], if we don't specify the log level with a prefix "<N>" in the message string, the default log level will be applied to the message. Since the default level could be warning(4), this would make the log utility such as journalctl treat the message, "Started bpfilter", as a warning. To avoid confusion, this commit adds the prefix "<5>" to make the message always a notice. [0] https://www.kernel.org/doc/Documentation/ABI/testing/dev-kmsg Fixes: 36c4357c63f3 ("net: bpfilter: print umh messages to /dev/kmsg") Reported-by: Martin Loviska <mloviska@suse.com> Signed-off-by: Gary Lin <glin@suse.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Dmitrii Banshchikov <me@ubique.spb.ru> Link: https://lore.kernel.org/bpf/20210623040918.8683-1-glin@suse.com
2021-06-24ipv6: delete useless dst check in ip6_dst_lookup_tailzhang kai
parameter dst always points to null. Signed-off-by: zhang kai <zhangkaiheb@126.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24sctp: send the next probe immediately once the last one is ackedXin Long
These is no need to wait for 'interval' period for the next probe if the last probe is already acked in search state. The 'interval' period waiting should be only for probe failure timeout and the current pmtu check when it's in search complete state. This change will shorten the probe time a lot in search state, and also fix the document accordingly. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24sctp: do black hole detection in search complete stateXin Long
Currently the PLPMUTD probe will stop for a long period (interval * 30) after it enters search complete state. If there's a pmtu change on the route path, it takes a long time to be aware if the ICMP TooBig packet is lost or filtered. As it says in rfc8899#section-4.3: "A DPLPMTUD method MUST NOT rely solely on this method." (ICMP PTB message). This patch is to enable the other method for search complete state: "A PL can use the DPLPMTUD probing mechanism to periodically generate probe packets of the size of the current PLPMTU." With this patch, the probe will continue with the current pmtu every 'interval' until the PMTU_RAISE_TIMER 'timeout', which we implement by adding raise_count to raise the probe size when it counts to 30 and removing the SCTP_PL_COMPLETE check for PMTU_RAISE_TIMER. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24ipv6: fix out-of-bound access in ip6_parse_tlv()Eric Dumazet
First problem is that optlen is fetched without checking there is more than one byte to parse. Fix this by taking care of IPV6_TLV_PAD1 before fetching optlen (under appropriate sanity checks against len) Second problem is that IPV6_TLV_PADN checks of zero padding are performed before the check of remaining length. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Fixes: c1412fce7ecc ("net/ipv6/exthdrs.c: Strict PadN option checking") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24libceph: set global_id as soon as we get an auth ticketIlya Dryomov
Commit 61ca49a9105f ("libceph: don't set global_id until we get an auth ticket") delayed the setting of global_id too much. It is set only after all tickets are received, but in pre-nautilus clusters an auth ticket and the service tickets are obtained in separate steps (for a total of three MAuth replies). When the service tickets are requested, global_id is used to build an authorizer; if global_id is still 0 we never get them and fail to establish the session. Moving the setting of global_id into protocol implementations. This way global_id can be set exactly when an auth ticket is received, not sooner nor later. Fixes: 61ca49a9105f ("libceph: don't set global_id until we get an auth ticket") Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>
2021-06-24libceph: don't pass result into ac->ops->handle_reply()Ilya Dryomov
There is no result to pass in msgr2 case because authentication failures are reported through auth_bad_method frame and in MAuth case an error is returned immediately. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>
2021-06-24net: ip: avoid OOM kills with large UDP sends over loopbackJakub Kicinski
Dave observed number of machines hitting OOM on the UDP send path. The workload seems to be sending large UDP packets over loopback. Since loopback has MTU of 64k kernel will try to allocate an skb with up to 64k of head space. This has a good chance of failing under memory pressure. What's worse if the message length is <32k the allocation may trigger an OOM killer. This is entirely avoidable, we can use an skb with page frags. af_unix solves a similar problem by limiting the head length to SKB_MAX_ALLOC. This seems like a good and simple approach. It means that UDP messages > 16kB will now use fragments if underlying device supports SG, if extra allocator pressure causes regressions in real workloads we can switch to trying the large allocation first and falling back. v4: pre-calculate all the additions to alloclen so we can be sure it won't go over order-2 Reported-by: Dave Jones <dsj@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24net: retrieve netns cookie via getsocketoptMartynas Pumputis
It's getting more common to run nested container environments for testing cloud software. One of such examples is Kind [1] which runs a Kubernetes cluster in Docker containers on a single host. Each container acts as a Kubernetes node, and thus can run any Pod (aka container) inside the former. This approach simplifies testing a lot, as it eliminates complicated VM setups. Unfortunately, such a setup breaks some functionality when cgroupv2 BPF programs are used for load-balancing. The load-balancer BPF program needs to detect whether a request originates from the host netns or a container netns in order to allow some access, e.g. to a service via a loopback IP address. Typically, the programs detect this by comparing netns cookies with the one of the init ns via a call to bpf_get_netns_cookie(NULL). However, in nested environments the latter cannot be used given the Kubernetes node's netns is outside the init ns. To fix this, we need to pass the Kubernetes node netns cookie to the program in a different way: by extending getsockopt() with a SO_NETNS_COOKIE option, the orchestrator which runs in the Kubernetes node netns can retrieve the cookie and pass it to the program instead. Thus, this is following up on Eric's commit 3d368ab87cf6 ("net: initialize net->net_cookie at netns setup") to allow retrieval via SO_NETNS_COOKIE. This is also in line in how we retrieve socket cookie via SO_COOKIE. [1] https://kind.sigs.k8s.io/ Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Martynas Pumputis <m@lambda.lt> Cc: Eric Dumazet <edumazet@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24bpf, sched: Remove unneeded rcu_read_lock() around BPF program invocationToke Høiland-Jørgensen
The rcu_read_lock() call in cls_bpf and act_bpf are redundant: on the TX side, there's already a call to rcu_read_lock_bh() in __dev_queue_xmit(), and on RX there's a covering rcu_read_lock() in netif_receive_skb{,_list}_internal(). With the previous patches we also amended the lockdep checks in the map code to not require any particular RCU flavour, so we can just get rid of the rcu_read_lock()s. Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210624160609.292325-7-toke@redhat.com
2021-06-24xdp: Add proper __rcu annotations to redirect map entriesToke Høiland-Jørgensen
XDP_REDIRECT works by a three-step process: the bpf_redirect() and bpf_redirect_map() helpers will lookup the target of the redirect and store it (along with some other metadata) in a per-CPU struct bpf_redirect_info. Next, when the program returns the XDP_REDIRECT return code, the driver will call xdp_do_redirect() which will use the information thus stored to actually enqueue the frame into a bulk queue structure (that differs slightly by map type, but shares the same principle). Finally, before exiting its NAPI poll loop, the driver will call xdp_do_flush(), which will flush all the different bulk queues, thus completing the redirect. Pointers to the map entries will be kept around for this whole sequence of steps, protected by RCU. However, there is no top-level rcu_read_lock() in the core code; instead drivers add their own rcu_read_lock() around the XDP portions of the code, but somewhat inconsistently as Martin discovered[0]. However, things still work because everything happens inside a single NAPI poll sequence, which means it's between a pair of calls to local_bh_disable()/local_bh_enable(). So Paul suggested[1] that we could document this intention by using rcu_dereference_check() with rcu_read_lock_bh_held() as a second parameter, thus allowing sparse and lockdep to verify that everything is done correctly. This patch does just that: we add an __rcu annotation to the map entry pointers and remove the various comments explaining the NAPI poll assurance strewn through devmap.c in favour of a longer explanation in filter.c. The goal is to have one coherent documentation of the entire flow, and rely on the RCU annotations as a "standard" way of communicating the flow in the map code (which can additionally be understood by sparse and lockdep). The RCU annotation replacements result in a fairly straight-forward replacement where READ_ONCE() becomes rcu_dereference_check(), WRITE_ONCE() becomes rcu_assign_pointer() and xchg() and cmpxchg() gets wrapped in the proper constructs to cast the pointer back and forth between __rcu and __kernel address space (for the benefit of sparse). The one complication is that xskmap has a few constructions where double-pointers are passed back and forth; these simply all gain __rcu annotations, and only the final reference/dereference to the inner-most pointer gets changed. With this, everything can be run through sparse without eliciting complaints, and lockdep can verify correctness even without the use of rcu_read_lock() in the drivers. Subsequent patches will clean these up from the drivers. [0] https://lore.kernel.org/bpf/20210415173551.7ma4slcbqeyiba2r@kafai-mbp.dhcp.thefacebook.com/ [1] https://lore.kernel.org/bpf/20210419165837.GA975577@paulmck-ThinkPad-P17-Gen-1/ Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210624160609.292325-6-toke@redhat.com
2021-06-24bpf: Support all gso types in bpf_skb_change_proto()Maciej Żenczykowski
Since we no longer modify gso_size, it is now theoretically safe to not set SKB_GSO_DODGY and reset gso_segs to zero. This also means the skb_is_gso_tcp() check should no longer be necessary. Unfortunately we cannot remove the skb_{decrease,increase}_gso_size() helpers, as they are still used elsewhere: bpf_skb_net_grow() without BPF_F_ADJ_ROOM_FIXED_GSO bpf_skb_net_shrink() without BPF_F_ADJ_ROOM_FIXED_GSO net/core/lwt_bpf.c's handle_gso_type() Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Dongseok Yi <dseok.yi@samsung.com> Cc: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/bpf/20210617000953.2787453-3-zenczykowski@gmail.com
2021-06-24bpf: Do not change gso_size during bpf_skb_change_proto()Maciej Żenczykowski
This is technically a backwards incompatible change in behaviour, but I'm going to argue that it is very unlikely to break things, and likely to fix *far* more then it breaks. In no particular order, various reasons follow: (a) I've long had a bug assigned to myself to debug a super rare kernel crash on Android Pixel phones which can (per stacktrace) be traced back to BPF clat IPv6 to IPv4 protocol conversion causing some sort of ugly failure much later on during transmit deep in the GSO engine, AFAICT precisely because of this change to gso_size, though I've never been able to manually reproduce it. I believe it may be related to the particular network offload support of attached USB ethernet dongle being used for tethering off of an IPv6-only cellular connection. The reason might be we end up with more segments than max permitted, or with a GSO packet with only one segment... (either way we break some assumption and hit a BUG_ON) (b) There is no check that the gso_size is > 20 when reducing it by 20, so we might end up with a negative (or underflowing) gso_size or a gso_size of 0. This can't possibly be good. Indeed this is probably somehow exploitable (or at least can result in a kernel crash) by delivering crafted packets and perhaps triggering an infinite loop or a divide by zero... As a reminder: gso_size (MSS) is related to MTU, but not directly derived from it: gso_size/MSS may be significantly smaller then one would get by deriving from local MTU. And on some NICs (which do loose MTU checking on receive, it may even potentially be larger, for example my work pc with 1500 MTU can receive 1520 byte frames [and sometimes does due to bugs in a vendor plat46 implementation]). Indeed even just going from 21 to 1 is potentially problematic because it increases the number of segments by a factor of 21 (think DoS, or some other crash due to too many segments). (c) It's always safe to not increase the gso_size, because it doesn't result in the max packet size increasing. So the skb_increase_gso_size() call was always unnecessary for correctness (and outright undesirable, see later). As such the only part which is potentially dangerous (ie. could cause backwards compatibility issues) is the removal of the skb_decrease_gso_size() call. (d) If the packets are ultimately destined to the local device, then there is absolutely no benefit to playing around with gso_size. It only matters if the packets will egress the device. ie. we're either forwarding, or transmitting from the device. (e) This logic only triggers for packets which are GSO. It does not trigger for skbs which are not GSO. It will not convert a non-GSO MTU sized packet into a GSO packet (and you don't even know what the MTU is, so you can't even fix it). As such your transmit path must *already* be able to handle an MTU 20 bytes larger then your receive path (for IPv4 to IPv6 translation) - and indeed 28 bytes larger due to IPv4 fragments. Thus removing the skb_decrease_gso_size() call doesn't actually increase the size of the packets your transmit side must be able to handle. ie. to handle non-GSO max-MTU packets, the IPv4/IPv6 device/ route MTUs must already be set correctly. Since for example with an IPv4 egress MTU of 1500, IPv4 to IPv6 translation will already build 1520 byte IPv6 frames, so you need a 1520 byte device MTU. This means if your IPv6 device's egress MTU is 1280, your IPv4 route must be 1260 (and actually 1252, because of the need to handle fragments). This is to handle normal non-GSO packets. Thus the reduction is simply not needed for GSO packets, because when they're correctly built, they will already be the right size. (f) TSO/GSO should be able to exactly undo GRO: the number of packets (TCP segments) should not be modified, so that TCP's MSS counting works correctly (this matters for congestion control). If protocol conversion changes the gso_size, then the number of TCP segments may increase or decrease. Packet loss after protocol conversion can result in partial loss of MSS segments that the sender sent. How's the sending TCP stack going to react to receiving ACKs/SACKs in the middle of the segments it sent? (g) skb_{decrease,increase}_gso_size() are already no-ops for GSO_BY_FRAGS case (besides triggering WARN_ON_ONCE). This means you already cannot guarantee that gso_size (and thus resulting packet MTU) is changed. ie. you must assume it won't be changed. (h) changing gso_size is outright buggy for UDP GSO packets, where framing matters (I believe that's also the case for SCTP, but it's already excluded by [g]). So the only remaining case is TCP, which also doesn't want it (see [f]). (i) see also the reasoning on the previous attempt at fixing this (commit fa7b83bf3b156c767f3e4a25bbf3817b08f3ff8e) which shows that the current behaviour causes TCP packet loss: In the forwarding path GRO -> BPF 6 to 4 -> GSO for TCP traffic, the coalesced packet payload can be > MSS, but < MSS + 20. bpf_skb_proto_6_to_4() will upgrade the MSS and it can be > the payload length. After then tcp_gso_segment checks for the payload length if it is <= MSS. The condition is causing the packet to be dropped. tcp_gso_segment(): [...] mss = skb_shinfo(skb)->gso_size; if (unlikely(skb->len <= mss)) goto out; [...] Thus changing the gso_size is simply a very bad idea. Increasing is unnecessary and buggy, and decreasing can go negative. Fixes: 6578171a7ff0 ("bpf: add bpf_skb_change_proto helper") Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Dongseok Yi <dseok.yi@samsung.com> Cc: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/bpf/CANP3RGfjLikQ6dg=YpBU0OeHvyv7JOki7CyOUS9modaXAi-9vQ@mail.gmail.com Link: https://lore.kernel.org/bpf/20210617000953.2787453-2-zenczykowski@gmail.com
2021-06-24Revert "bpf: Check for BPF_F_ADJ_ROOM_FIXED_GSO when bpf_skb_change_proto"Maciej Żenczykowski
This reverts commit fa7b83bf3b156c767f3e4a25bbf3817b08f3ff8e. See the followup commit for the reasoning why I believe the appropriate approach is to simply make this change without a flag, but it can basically be summarized as using this helper without the flag is bug-prone or outright buggy, and thus the default should be this new behaviour. As this commit has only made it into net-next/master, but not into any real release, such a backwards incompatible change is still ok. Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Dongseok Yi <dseok.yi@samsung.com> Cc: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/bpf/20210617000953.2787453-1-zenczykowski@gmail.com
2021-06-24can: j1939: j1939_sk_setsockopt(): prevent allocation of j1939 filter for ↵Norbert Slusarek
optlen == 0 If optval != NULL and optlen == 0 are specified for SO_J1939_FILTER in j1939_sk_setsockopt(), memdup_sockptr() will return ZERO_PTR for 0 size allocation. The new filter will be mistakenly assigned ZERO_PTR. This patch checks for optlen != 0 and filter will be assigned NULL in case of optlen == 0. Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol") Link: https://lore.kernel.org/r/20210620123842.117975-1-nslusarek@gmx.net Signed-off-by: Norbert Slusarek <nslusarek@gmx.net> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-06-23devlink: Protect rate list with lock while switching modesDmytro Linkin
Devlink eswitch set command doesn't hold devlink->lock, which makes possible race condition between rate list traversing and others devlink rate KAPI calls, like devlink_rate_nodes_destroy(). Hold devlink lock while traversing the list. Fixes: a8ecb93ef03d ("devlink: Introduce rate nodes") Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23devlink: Remove eswitch mode check for mode set callDmytro Linkin
When eswitch is disabled, querying its current mode results in error. Due to this when trying to set the eswitch mode for mlx5 devices, it fails to set the eswitch switchdev mode. Hence remove such check. Fixes: a8ecb93ef03d ("devlink: Introduce rate nodes") Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23devlink: Decrease refcnt of parent rate object on leaf destroyDmytro Linkin
Port functions, like SFs, can be deleted by the user when its leaf rate object has parent node. In such case node refcnt won't be decreased which blocks the node from deletion later. Do simple refcnt decrease, since driver in cleanup stage. This: 1) assumes that driver took proper internal parent unset action; 2) allows to avoid nested callbacks call and deadlock. Fixes: d75559845078 ("devlink: Allow setting parent node of rate objects") Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2021-06-23 The following pull-request contains BPF updates for your *net* tree. We've added 14 non-merge commits during the last 6 day(s) which contain a total of 13 files changed, 137 insertions(+), 64 deletions(-). Note that when you merge net into net-next, there is a small merge conflict between 9f2470fbc4cb ("skmsg: Improve udp_bpf_recvmsg() accuracy") from bpf with c49661aa6f70 ("skmsg: Remove unused parameters of sk_msg_wait_data()") from net-next. Resolution is to: i) net/ipv4/udp_bpf.c: take udp_msg_wait_data() and remove err parameter from the function, ii) net/ipv4/tcp_bpf.c: take tcp_msg_wait_data() and remove err parameter from the function, iii) for net/core/skmsg.c and include/linux/skmsg.h: remove the sk_msg_wait_data() implementation and its prototype in header. The main changes are: 1) Fix BPF poke descriptor adjustments after insn rewrite, from John Fastabend. 2) Fix regression when using BPF_OBJ_GET with non-O_RDWR flags, from Maciej Żenczykowski. 3) Various bug and error handling fixes for UDP-related sock_map, from Cong Wang. 4) Fix patching of vmlinux BTF IDs with correct endianness, from Tony Ambardar. 5) Two fixes for TX descriptor validation in AF_XDP, from Magnus Karlsson. 6) Fix overflow in size calculation for bpf_map_area_alloc(), from Bui Quang Minh. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23ipv6: exthdrs: do not blindly use init_netEric Dumazet
I see no reason why max_dst_opts_cnt and max_hbh_opts_cnt are fetched from the initial net namespace. The other sysctls (max_dst_opts_len & max_hbh_opts_len) are in fact already using the current ns. Note: it is not clear why ipv6_destopt_rcv() use two ways to get to the netns : 1) dev_net(dst->dev) Originally used to increment IPSTATS_MIB_INHDRERRORS 2) dev_net(skb->dev) Tom used this variant in his patch. Maybe this calls to use ipv6_skb_net() instead ? Fixes: 47d3d7ac656a ("ipv6: Implement limits on Hop-by-Hop and Destination options") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <tom@quantonium.net> Cc: Coco Li <lixiaoyan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23tcp: Add stats for socket migration.Kuniyuki Iwashima
This commit adds two stats for the socket migration feature to evaluate the effectiveness: LINUX_MIB_TCPMIGRATEREQ(SUCCESS|FAILURE). If the migration fails because of the own_req race in receiving ACK and sending SYN+ACK paths, we do not increment the failure stat. Then another CPU is responsible for the req. Link: https://lore.kernel.org/bpf/CAK6E8=cgFKuGecTzSCSQ8z3YJ_163C0uwO9yRvfDSE7vOe9mJA@mail.gmail.com/ Suggested-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23Merge tag 'mlx5-net-next-2021-06-22' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-net-next-2021-06-22 1) Various minor cleanups and fixes from net-next branch 2) Optimize mlx5 feature check on tx and a fix to allow Vxlan with Ipsec offloads ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2021-06-23 1) Don't return a mtu smaller than 1280 on IPv6 pmtu discovery. From Sabrina Dubroca 2) Fix seqcount rcu-read side in xfrm_policy_lookup_bytype for the PREEMPT_RT case. From Varad Gautam. 3) Remove a repeated declaration of xfrm_parse_spi. From Shaokun Zhang. 4) IPv4 beet mode can't handle fragments, but IPv6 does. commit 68dc022d04eb ("xfrm: BEET mode doesn't support fragments for inner packets") handled IPv4 and IPv6 the same way. Relax the check for IPv6 because fragments are possible here. From Xin Long. 5) Memory allocation failures are not reported for XFRMA_ENCAP and XFRMA_COADDR in xfrm_state_construct. Fix this by moving both cases in front of the function. 6) Fix a missing initialization in the xfrm offload fallback fail case for bonding devices. From Ayush Sawal. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Skip non-SCTP packets in the new SCTP chunk support for nft_exthdr, from Phil Sutter. 2) Simplify TCP option sanity check for TCP packets, also from Phil. 3) Add a new expression to store when the rule has been used last time. 4) Pass the hook state object to log function, from Florian Westphal. 5) Document the new sysctl knobs to tune the flowtable timeouts, from Oz Shlomo. 6) Fix snprintf error check in the new nfnetlink_hook infrastructure, from Dan Carpenter. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23net: sched: remove qdisc->empty for lockless qdiscYunsheng Lin
As MISSED and DRAINING state are used to indicate a non-empty qdisc, qdisc->empty is not longer needed, so remove it. Acked-by: Jakub Kicinski <kuba@kernel.org> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23net: sched: implement TCQ_F_CAN_BYPASS for lockless qdiscYunsheng Lin
Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK flag set, but queue discipline by-pass does not work for lockless qdisc because skb is always enqueued to qdisc even when the qdisc is empty, see __dev_xmit_skb(). This patch calls sch_direct_xmit() to transmit the skb directly to the driver for empty lockless qdisc, which aviod enqueuing and dequeuing operation. As qdisc->empty is not reliable to indicate a empty qdisc because there is a time window between enqueuing and setting qdisc->empty. So we use the MISSED state added in commit a90c57f2cedd ("net: sched: fix packet stuck problem for lockless qdisc"), which indicate there is lock contention, suggesting that it is better not to do the qdisc bypass in order to avoid packet out of order problem. In order to make MISSED state reliable to indicate a empty qdisc, we need to ensure that testing and clearing of MISSED state is within the protection of qdisc->seqlock, only setting MISSED state can be done without the protection of qdisc->seqlock. A MISSED state testing is added without the protection of qdisc->seqlock to aviod doing unnecessary spin_trylock() for contention case. As the enqueuing is not within the protection of qdisc->seqlock, there is still a potential data race as mentioned by Jakub [1]: thread1 thread2 thread3 qdisc_run_begin() # true qdisc_run_begin(q) set(MISSED) pfifo_fast_dequeue clear(MISSED) # recheck the queue qdisc_run_end() enqueue skb1 qdisc empty # true qdisc_run_begin() # true sch_direct_xmit() # skb2 qdisc_run_begin() set(MISSED) When above happens, skb1 enqueued by thread2 is transmited after skb2 is transmited by thread3 because MISSED state setting and enqueuing is not under the qdisc->seqlock. If qdisc bypass is disabled, skb1 has better chance to be transmited quicker than skb2. This patch does not take care of the above data race, because we view this as similar as below: Even at the same time CPU1 and CPU2 write the skb to two socket which both heading to the same qdisc, there is no guarantee that which skb will hit the qdisc first, because there is a lot of factor like interrupt/softirq/cache miss/scheduling afffecting that. There are below cases that need special handling: 1. When MISSED state is cleared before another round of dequeuing in pfifo_fast_dequeue(), and __qdisc_run() might not be able to dequeue all skb in one round and call __netif_schedule(), which might result in a non-empty qdisc without MISSED set. In order to avoid this, the MISSED state is set for lockless qdisc and __netif_schedule() will be called at the end of qdisc_run_end. 2. The MISSED state also need to be set for lockless qdisc instead of calling __netif_schedule() directly when requeuing a skb for a similar reason. 3. For netdev queue stopped case, the MISSED case need clearing while the netdev queue is stopped, otherwise there may be unnecessary __netif_schedule() calling. So a new DRAINING state is added to indicate this case, which also indicate a non-empty qdisc. 4. As there is already netif_xmit_frozen_or_stopped() checking in dequeue_skb() and sch_direct_xmit(), which are both within the protection of qdisc->seqlock, but the same checking in __dev_xmit_skb() is without the protection, which might cause empty indication of a lockless qdisc to be not reliable. So remove the checking in __dev_xmit_skb(), and the checking in the protection of qdisc->seqlock seems enough to avoid the cpu consumption problem for netdev queue stopped case. 1. https://lkml.org/lkml/2021/5/29/215 Acked-by: Jakub Kicinski <kuba@kernel.org> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23mac80211: Switch to a virtual time-based airtime schedulerToke Høiland-Jørgensen
This switches the airtime scheduler in mac80211 to use a virtual time-based scheduler instead of the round-robin scheduler used before. This has a couple of advantages: - No need to sync up the round-robin scheduler in firmware/hardware with the round-robin airtime scheduler. - If several stations are eligible for transmission we can schedule both of them; no need to hard-block the scheduling rotation until the head of the queue has used up its quantum. - The check of whether a station is eligible for transmission becomes simpler (in ieee80211_txq_may_transmit()). The drawback is that scheduling becomes slightly more expensive, as we need to maintain an rbtree of TXQs sorted by virtual time. This means that ieee80211_register_airtime() becomes O(logN) in the number of currently scheduled TXQs because it can change the order of the scheduled stations. We mitigate this overhead by only resorting when a station changes position in the tree, and hopefully N rarely grows too big (it's only TXQs currently backlogged, not all associated stations), so it shouldn't be too big of an issue. To prevent divisions in the fast path, we maintain both station sums and pre-computed reciprocals of the sums. This turns the fast-path operation into a multiplication, with divisions only happening as the number of active stations change (to re-compute the current sum of all active station weights). To prevent this re-computation of the reciprocal from happening too frequently, we use a time-based notion of station activity, instead of updating the weight every time a station gets scheduled or de-scheduled. As queues can oscillate between empty and occupied quite frequently, this can significantly cut down on the number of re-computations. It also has the added benefit of making the station airtime calculation independent on whether the queue happened to have drained at the time an airtime value was accounted. Co-developed-by: Yibo Zhao <yiboz@codeaurora.org> Signed-off-by: Yibo Zhao <yiboz@codeaurora.org> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20210623134755.235545-1-toke@redhat.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23Revert "mac80211: HE STA disassoc due to QOS NULL not sent"Ping-Ke Shih
This reverts commit f39b07fdfb68 ("mac80211: HE STA disassoc due to QOS NULL not sent") Since iwlwifi specific workaround, which blocks to send NDP, is removed, we can revert this commit. Signed-off-by: Ping-Ke Shih <pkshih@realtek.com> Link: https://lore.kernel.org/r/20210623134826.10318-2-pkshih@realtek.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23mac80211: remove iwlwifi specific workaround NDPs of null_responsePing-Ke Shih
Remove the remaining workaround that is not removed by the commit e41eb3e408de ("mac80211: remove iwlwifi specific workaround that broke sta NDP tx") Fixes: 41cbb0f5a295 ("mac80211: add support for HE") Signed-off-by: Ping-Ke Shih <pkshih@realtek.com> Link: https://lore.kernel.org/r/20210623134826.10318-1-pkshih@realtek.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23mac80211: fix NULL ptr dereference during mesh peer connection for non HE ↵Abinaya Kalaiselvan
devices "sband->iftype_data" is not assigned with any value for non HE supported devices, which causes NULL pointer access during mesh peer connection in those devices. Fix this by accessing the pointer after HE capabilities condition check. Cc: stable@vger.kernel.org Fixes: 7f7aa94bcaf0 (mac80211: reduce peer HE MCS/NSS to own capabilities) Signed-off-by: Abinaya Kalaiselvan <akalaise@codeaurora.org> Link: https://lore.kernel.org/r/1624459244-4497-1-git-send-email-akalaise@codeaurora.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23mac80211: Enable power save after receiving NULL packet ACKBassem Dawood
Trigger dynamic_ps_timer to re-evaluate power saving once a null function packet (with PM = 1) is ACKed, otherwise dynamic PS is not enabled at that point. Signed-off-by: Bassem Dawood <bassem@morsemicro.com> Link: https://lore.kernel.org/r/20210227055815.14838-1-bassem@morsemicro.com [reformatting] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23mac80211: add HE 6 GHz capability only if supportedJohannes Berg
The HE 6 GHz capability should only be included if there are actually available channels on 6 GHz, and if that's the case we need to get it from the 6 GHz band data, not whatever other band we're on now. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210618133832.290bf5c87030.I178aff1c3a6e32456d4ac9238e4a2eb47d209ccd@changeid Link: https://lore.kernel.org/r/iwlwifi.20210618133832.05e935e8dd98.I83ff7eb2ae8ebdf2e30c4fa2461344d9e569f599@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23mac80211: notify driver on mgd TX completionJohannes Berg
We have mgd_prepare_tx(), but sometimes drivers may want/need to take action when the exchange finishes, whether successfully or not. Add a notification to the driver on completion, i.e. call the new method mgd_complete_tx(). To unify the two scenarios, and to add more information, make both of them take a struct that has the duration (prepare only), subtype (both) and success (complete only). Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210618133832.5d94e78f6230.I6dc979606b6f28701b740d7aab725f7853a5a155@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23mac80211: always include HE 6GHz capability in probe requestJohannes Berg
If HE/6GHz is available (thus we consider dot11HE6GOptionImplemented to be true), then always include the corresponding capability in the probe request as required by the spec. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210618133832.25ee4a54a7d0.I8cebd799c85524c8123a11941a104dbdefc03762@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23cfg80211: Support hidden AP discovery over 6GHz bandIlan Peer
To discover a hidden AP on the 6GHz band, the probe request sent to the AP needs to include the AP's SSID, as some APs would not respond with a probe response based only on short SSID match. To support hidden AP discovery over the 6GHz band, when constructing the specific 6GHz band scan also include SSIDs that were part of the original scan request, so these can be used in the probe requests transmitted during scan. Signed-off-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210618133832.218df9d3203c.Ice0f7a2f6a65f1f9710b7898591481baeefaf490@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23mac80211: add vendor-specific capabilities to assoc requestJohannes Berg
When sending an association request, add any vendor specific capabilities at the end of the frame. This way, mac80211 is still completely in charge of building the frame, but drivers can determine what should be added depending on the band and interface type. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210618133832.80d716d69a5f.I28097ff19be6b22aebdc33a72795d2662755d41f@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23cfg80211: allow advertising vendor-specific capabilitiesJohannes Berg
There may be cases where vendor-specific elements need to be used over the air. Rather than have driver or firmware add them and possibly cause problems that way, add them to the iftype-data band capabilities. This way we can advertise to userspace first, and use them in mac80211 next. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210618133832.e8c4f0347276.Iee5964682b3e9ec51fc1cd57a7c62383eaf6ddd7@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23cfg80211: set custom regdomain after wiphy registrationMiri Korenblit
We used to set regulatory info before the registration of the device and then the regulatory info didn't get set, because the device isn't registered so there isn't a device to set the regulatory info for. So set the regulatory info after the device registration. Call reg_process_self_managed_hints() once again after the device registration because it does nothing before it. Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210618133832.c96eadcffe80.I86799c2c866b5610b4cf91115c21d8ceb525c5aa@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23mac80211: conditionally advertise HE in probe requestsJohannes Berg
While building probe requests, only enable HE capability if there are actually any channels in the band with HE enabled, otherwise we're not really capable. We're doing the same in association requests, so doing it here makes it consistent. This also makes HE not appear available if it isn't due to regulatory constraints. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210618133832.b5513f2af335.Ic01862678712ae4238cea43ad2185928865efad2@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23cfg80211: add cfg80211_any_usable_channels()Johannes Berg
This helper function checks if there are any usable channels on any of the given bands with the given properties (as expressed by disallowed channel flags). Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210618133832.2b613addaa85.Idaf8b859089490537878a7de5c7453a873a3f638@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23cfg80211: reg: improve bad regulatory warningJohannes Berg
There's a WARN_ON here but it says nothing, and the later dump of the regdomain aren't usually printed. As a first step, include the regdomain code in the WARN_ON message, just like in other similar instances. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210618133832.853ffdd6c62b.I63e37b2ab184ee3653686e4df4dd23eb303687d2@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23nl80211: Fix typo pmsr->pmsrSosthène Guédon
This was mis-spelled in the policy, fix that. Signed-off-by: Sosthène Guédon <sosthene@guedon.gdn> Link: https://lore.kernel.org/r/YLkT27RG0DaWLUot@arch.localdomain Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23mac80211: fix some spelling mistakesZheng Yongjun
Fix some spelling mistakes in comments: freeed ==> freed addreses ==> addresses containging ==> containing capablity ==> capability sucess ==> success atleast ==> at least Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Link: https://lore.kernel.org/r/20210607150047.2855962-1-zhengyongjun3@huawei.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23mac80211: remove use of ieee80211_get_he_sta_cap()Johannes Berg
All uses of ieee80211_get_he_sta_cap() were actually wrong, in net/mac80211/mlme.c they were wrong because that code is also used for P2P (which is a different interface type), in net/mac80211/main.c that should check all interface types. Fix all that. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210618133832.ede114bc8b46.Ibcd9a5d98430e936344eb6d242ef8a65c2f59b74@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23cfg80211: trace more information in assoc trace eventJohannes Berg
Add more information to the assoc trace event so we can see more precisely what's going on and what options were used. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210618133832.86c58fca486d.Iabd8f036d2ef1d770fd20ed3ccd149f32154f430@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23mac80211: improve AP disconnect messageJohannes Berg
If the AP changes capability/bandwidth in some fashion, the message might be somewhat misleading and we don't know what really changed. Modify the message to speak about "caps/bw" instead of just "bandwidth", and print out the flags. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210618133832.dc22c48985fa.I4bf5fbc17ec783c21d4b50c8c35b1de390896ccd@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23mac80211: rearrange struct txq_info for fewer holesJohannes Berg
We can slightly decrease the size of struct txq_info by rearranging some fields for fewer holes, so do that. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210618133832.1bf019a1fe2e.Ib54622b8d6dc1a9a7dc484e573c073119450538b@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23mac80211: Properly WARN on HW scan before restartIlan Peer
The following race was possible: 1. The device driver requests HW restart. 2. A scan is requested from user space and is propagated to the driver. During this flow HW_SCANNING flag is set. 3. The thread that handles the HW restart is scheduled, and before starting the actual reconfiguration it checks that HW_SCANNING is not set. The flow does so without acquiring any lock, and thus the WARN fires. Fix this by checking that HW_SCANNING is on only after RTNL is acquired, i.e., user space scan request handling is no longer in transit. Signed-off-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210618133832.8238ab3e19ab.I2693c581c70251472b4f9089e37e06fb2c18268f@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23nl80211/cfg80211: add BSS color to NDP ranging parametersAvraham Stern
In NDP ranging, the initiator need to set the BSS color in the NDP to the BSS color of the responder. Add the BSS color as a parameter for NDP ranging. Signed-off-by: Avraham Stern <avraham.stern@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210618133832.f097a6144b59.I27dec8b994df52e691925ea61be4dd4fa6d396c0@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-23mac80211: add to bss_conf if broadcast TWT is supportedShaul Triebitz
Add to struct ieee80211_bss_conf a twt_broadcast field. Set it to true if both STA and AP support broadcast TWT. Signed-off-by: Shaul Triebitz <shaul.triebitz@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210618133832.f7c105237541.I50b302044e2b35e5ed4d3fb8bc7bd3d8bb89b1e1@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>