summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2019-06-18net/udp_gso: Allow TX timestamp with UDP GSOFred Klassen
Fixes an issue where TX Timestamps are not arriving on the error queue when UDP_SEGMENT CMSG type is combined with CMSG type SO_TIMESTAMPING. This can be illustrated with an updated updgso_bench_tx program which includes the '-T' option to test for this condition. It also introduces the '-P' option which will call poll() before reading the error queue. ./udpgso_bench_tx -4ucTPv -S 1472 -l2 -D 172.16.120.18 poll timeout udp tx: 0 MB/s 1 calls/s 1 msg/s The "poll timeout" message above indicates that TX timestamp never arrived. This patch preserves tx_flags for the first UDP GSO segment. Only the first segment is timestamped, even though in some cases there may be benefital in timestamping both the first and last segment. Factors in deciding on first segment timestamp only: - Timestamping both first and last segmented is not feasible. Hardware can only have one outstanding TS request at a time. - Timestamping last segment may under report network latency of the previous segments. Even though the doorbell is suppressed, the ring producer counter has been incremented. - Timestamping the first segment has the upside in that it reports timestamps from the application's view, e.g. RTT. - Timestamping the first segment has the downside that it may underreport tx host network latency. It appears that we have to pick one or the other. And possibly follow-up with a config flag to choose behavior. v2: Remove tests as noted by Willem de Bruijn <willemb@google.com> Moving tests from net to net-next v3: Update only relevant tx_flag bits as per Willem de Bruijn <willemb@google.com> v4: Update comments and commit message as per Willem de Bruijn <willemb@google.com> Fixes: ee80d1ebe5ba ("udp: add udp gso") Signed-off-by: Fred Klassen <fklassen@appneta.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18net: netem: fix use after free and double free with packet corruptionJakub Kicinski
Brendan reports that the use of netem's packet corruption capability leads to strange crashes. This seems to be caused by commit d66280b12bd7 ("net: netem: use a list in addition to rbtree") which uses skb->next pointer to construct a fast-path queue of in-order skbs. Packet corruption code has to invoke skb_gso_segment() in case of skbs in need of GSO. skb_gso_segment() returns a list of skbs. If next pointers of the skbs on that list do not get cleared fast path list may point to freed skbs or skbs which are also on the RB tree. Let's say skb gets segmented into 3 frames: A -> B -> C A gets hooked to the t_head t_tail list by tfifo_enqueue(), but it's next pointer didn't get cleared so we have: h t |/ A -> B -> C Now if B and C get also get enqueued successfully all is fine, because tfifo_enqueue() will overwrite the list in order. IOW: Enqueue B: h t | | A -> B C Enqueue C: h t | | A -> B -> C But if B and C get reordered we may end up with: h t RB tree |/ | A -> B -> C B \ C Or if they get dropped just: h t |/ A -> B -> C where A and B are already freed. To reproduce either limit has to be set low to cause freeing of segs or reorders have to happen (due to delay jitter). Note that we only have to mark the first segment as not on the list, "finish_segs" handling of other frags already does that. Another caveat is that qdisc_drop_all() still has to free all segments correctly in case of drop of first segment, therefore we re-link segs before calling it. v2: - re-link before drop, v1 was leaking non-first segs if limit was hit at the first seg - better commit message which lead to discovering the above :) Reported-by: Brendan Galloway <brendan.galloway@netronome.com> Fixes: d66280b12bd7 ("net: netem: use a list in addition to rbtree") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18net: netem: fix backlog accounting for corrupted GSO framesJakub Kicinski
When GSO frame has to be corrupted netem uses skb_gso_segment() to produce the list of frames, and re-enqueues the segments one by one. The backlog length has to be adjusted to account for new frames. The current calculation is incorrect, leading to wrong backlog lengths in the parent qdisc (both bytes and packets), and incorrect packet backlog count in netem itself. Parent backlog goes negative, netem's packet backlog counts all non-first segments twice (thus remaining non-zero even after qdisc is emptied). Move the variables used to count the adjustment into local scope to make 100% sure they aren't used at any stage in backports. Fixes: 6071bd1aa13e ("netem: Segment GSO packets on enqueue") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18tipc: pass tunnel dev as NULL to udp_tunnel(6)_xmit_skbXin Long
udp_tunnel(6)_xmit_skb() called by tipc_udp_xmit() expects a tunnel device to count packets on dev->tstats, a perpcu variable. However, TIPC is using udp tunnel with no tunnel device, and pass the lower dev, like veth device that only initializes dev->lstats(a perpcu variable) when creating it. Later iptunnel_xmit_stats() called by ip(6)tunnel_xmit() thinks the dev as a tunnel device, and uses dev->tstats instead of dev->lstats. tstats' each pointer points to a bigger struct than lstats, so when tstats->tx_bytes is increased, other percpu variable's members could be overwritten. syzbot has reported quite a few crashes due to fib_nh_common percpu member 'nhc_pcpu_rth_output' overwritten, call traces are like: BUG: KASAN: slab-out-of-bounds in rt_cache_valid+0x158/0x190 net/ipv4/route.c:1556 rt_cache_valid+0x158/0x190 net/ipv4/route.c:1556 __mkroute_output net/ipv4/route.c:2332 [inline] ip_route_output_key_hash_rcu+0x819/0x2d50 net/ipv4/route.c:2564 ip_route_output_key_hash+0x1ef/0x360 net/ipv4/route.c:2393 __ip_route_output_key include/net/route.h:125 [inline] ip_route_output_flow+0x28/0xc0 net/ipv4/route.c:2651 ip_route_output_key include/net/route.h:135 [inline] ... or: kasan: GPF could be caused by NULL-ptr deref or user memory access RIP: 0010:dst_dev_put+0x24/0x290 net/core/dst.c:168 <IRQ> rt_fibinfo_free_cpus net/ipv4/fib_semantics.c:200 [inline] free_fib_info_rcu+0x2e1/0x490 net/ipv4/fib_semantics.c:217 __rcu_reclaim kernel/rcu/rcu.h:240 [inline] rcu_do_batch kernel/rcu/tree.c:2437 [inline] invoke_rcu_callbacks kernel/rcu/tree.c:2716 [inline] rcu_process_callbacks+0x100a/0x1ac0 kernel/rcu/tree.c:2697 ... The issue exists since tunnel stats update is moved to iptunnel_xmit by Commit 039f50629b7f ("ip_tunnel: Move stats update to iptunnel_xmit()"), and here to fix it by passing a NULL tunnel dev to udp_tunnel(6)_xmit_skb so that the packets counting won't happen on dev->tstats. Reported-by: syzbot+9d4c12bfd45a58738d0a@syzkaller.appspotmail.com Reported-by: syzbot+a9e23ea2aa21044c2798@syzkaller.appspotmail.com Reported-by: syzbot+c4c4b2bb358bb936ad7e@syzkaller.appspotmail.com Reported-by: syzbot+0290d2290a607e035ba1@syzkaller.appspotmail.com Reported-by: syzbot+a43d8d4e7e8a7a9e149e@syzkaller.appspotmail.com Reported-by: syzbot+a47c5f4c6c00fc1ed16e@syzkaller.appspotmail.com Fixes: 039f50629b7f ("ip_tunnel: Move stats update to iptunnel_xmit()") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18ip_tunnel: allow not to count pkts on tstats by setting skb's dev to NULLXin Long
iptunnel_xmit() works as a common function, also used by a udp tunnel which doesn't have to have a tunnel device, like how TIPC works with udp media. In these cases, we should allow not to count pkts on dev's tstats, so that udp tunnel can work with no tunnel device safely. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18ipoib: show VF broadcast addressDenis Kirjanov
in IPoIB case we can't see a VF broadcast address for but can see for PF Before: 11: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast state UP mode DEFAULT group default qlen 256 link/infiniband 80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff vf 0 MAC 14:80:00:00:66:fe, spoof checking off, link-state disable, trust off, query_rss off ... After: 11: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast state UP mode DEFAULT group default qlen 256 link/infiniband 80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff vf 0 link/infiniband 80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off v1->v2: add the IFLA_VF_BROADCAST constant v2->v3: put IFLA_VF_BROADCAST at the end to avoid KABI breakage and set NLA_REJECT dev_setlink Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org> Acked-by: Doug Ledford <dledford@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18net: remove duplicate fetch in sock_getsockoptJingYi Hou
In sock_getsockopt(), 'optlen' is fetched the first time from userspace. 'len < 0' is then checked. Then in condition 'SO_MEMINFO', 'optlen' is fetched the second time from userspace. If change it between two fetches may cause security problems or unexpected behaivor, and there is no reason to fetch it a second time. To fix this, we need to remove the second fetch. Signed-off-by: JingYi Hou <houjingyi647@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18tipc: fix issues with early FAILOVER_MSG from peerTuong Lien
It appears that a FAILOVER_MSG can come from peer even when the failure link is resetting (i.e. just after the 'node_write_unlock()'...). This means the failover procedure on the node has not been started yet. The situation is as follows: node1 node2 linkb linka linka linkb | | | | | | x failure | | | RESETTING | | | | | | x failure RESET | | RESETTING FAILINGOVER | | | (FAILOVER_MSG) | | |<-------------------------------------------------| | *FAILINGOVER | | | | | (dummy FAILOVER_MSG) | | |------------------------------------------------->| | RESET | | FAILOVER_END | FAILINGOVER RESET | . . . . . . . . . . . . Once this happens, the link failover procedure will be triggered wrongly on the receiving node since the node isn't in FAILINGOVER state but then another link failover will be carried out. The consequences are: 1) A peer might get stuck in FAILINGOVER state because the 'sync_point' was set, reset and set incorrectly, the criteria to end the failover would not be met, it could keep waiting for a message that has already received. 2) The early FAILOVER_MSG(s) could be queued in the link failover deferdq but would be purged or not pulled out because the 'drop_point' was not set correctly. 3) The early FAILOVER_MSG(s) could be dropped too. 4) The dummy FAILOVER_MSG could make the peer leaving FAILINGOVER state shortly, but later on it would be restarted. The same situation can also happen when the link is in PEER_RESET state and a FAILOVER_MSG arrives. The commit resolves the issues by forcing the link down immediately, so the failover procedure will be started normally (which is the same as when receiving a FAILOVER_MSG and the link is in up state). Also, the function "tipc_node_link_failover()" is toughen to avoid such a situation from happening. Acked-by: Jon Maloy <jon.maloy@ericsson.se> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18ipv6: Stop sending in-kernel notifications for each nexthopIdo Schimmel
Both listeners - mlxsw and netdevsim - of IPv6 FIB notifications are now ready to handle IPv6 multipath notifications. Therefore, stop ignoring such notifications in both drivers and stop sending notification for each added / deleted nexthop. v2: * Remove 'multipath_rt' from 'struct fib6_entry_notifier_info' Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18ipv6: Add IPv6 multipath notification for route deleteIdo Schimmel
If all the nexthops of a multipath route are being deleted, send one notification for the entire route, instead of one per-nexthop. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18ipv6: Add IPv6 multipath notifications for add / replaceIdo Schimmel
Emit a notification when a multipath routes is added or replace. Note that unlike the replace notifications sent from fib6_add_rt2node(), it is possible we are sending a 'FIB_EVENT_ENTRY_REPLACE' when a route was merely added and not replaced. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18ipv6: Extend notifier info for multipath routesIdo Schimmel
Extend the IPv6 FIB notifier info with number of sibling routes being notified. This will later allow listeners to process one notification for a multipath routes instead of N, where N is the number of nexthops. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Honestly all the conflicts were simple overlapping changes, nothing really interesting to report. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17net: ipv4: remove erroneous advancement of list pointerFlorian Westphal
Causes crash when lifetime expires on an adress as garbage is dereferenced soon after. This used to look like this: for (ifap = &ifa->ifa_dev->ifa_list; *ifap != NULL; ifap = &(*ifap)->ifa_next) { if (*ifap == ifa) ... but this was changed to: struct in_ifaddr *tmp; ifap = &ifa->ifa_dev->ifa_list; tmp = rtnl_dereference(*ifap); while (tmp) { tmp = rtnl_dereference(tmp->ifa_next); // Bogus if (rtnl_dereference(*ifap) == ifa) { ... ifap = &tmp->ifa_next; // Can be NULL tmp = rtnl_dereference(*ifap); // Dereference } } Remove the bogus assigment/list entry skip. Fixes: 2638eb8b50cf ("net: ipv4: provide __rcu annotation for ifa_list") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "Lots of bug fixes here: 1) Out of bounds access in __bpf_skc_lookup, from Lorenz Bauer. 2) Fix rate reporting in cfg80211_calculate_bitrate_he(), from John Crispin. 3) Use after free in psock backlog workqueue, from John Fastabend. 4) Fix source port matching in fdb peer flow rule of mlx5, from Raed Salem. 5) Use atomic_inc_not_zero() in fl6_sock_lookup(), from Eric Dumazet. 6) Network header needs to be set for packet redirect in nfp, from John Hurley. 7) Fix udp zerocopy refcnt, from Willem de Bruijn. 8) Don't assume linear buffers in vxlan and geneve error handlers, from Stefano Brivio. 9) Fix TOS matching in mlxsw, from Jiri Pirko. 10) More SCTP cookie memory leak fixes, from Neil Horman. 11) Fix VLAN filtering in rtl8366, from Linus Walluij. 12) Various TCP SACK payload size and fragmentation memory limit fixes from Eric Dumazet. 13) Use after free in pneigh_get_next(), also from Eric Dumazet. 14) LAPB control block leak fix from Jeremy Sowden" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (145 commits) lapb: fixed leak of control-blocks. tipc: purge deferredq list for each grp member in tipc_group_delete ax25: fix inconsistent lock state in ax25_destroy_timer neigh: fix use-after-free read in pneigh_get_next tcp: fix compile error if !CONFIG_SYSCTL hv_sock: Suppress bogus "may be used uninitialized" warnings be2net: Fix number of Rx queues used for flow hashing net: handle 802.1P vlan 0 packets properly tcp: enforce tcp_min_snd_mss in tcp_mtu_probing() tcp: add tcp_min_snd_mss sysctl tcp: tcp_fragment() should apply sane memory limits tcp: limit payload size of sacked skbs Revert "net: phylink: set the autoneg state in phylink_phy_change" bpf: fix nested bpf tracepoints with per-cpu data bpf: Fix out of bounds memory access in bpf_sk_storage vsock/virtio: set SOCK_DONE on peer shutdown net: dsa: rtl8366: Fix up VLAN filtering net: phylink: set the autoneg state in phylink_phy_change net: add high_order_alloc_disable sysctl/static key tcp: add tcp_tx_skb_cache sysctl ...
2019-06-17bpf: fix the check that forwarding is enabled in bpf_ipv6_fib_lookupAnton Protopopov
The bpf_ipv6_fib_lookup function should return BPF_FIB_LKUP_RET_FWD_DISABLED when forwarding is disabled for the input device. However instead of checking if forwarding is enabled on the input device, it checked the global net->ipv6.devconf_all->forwarding flag. Change it to behave as expected. Fixes: 87f5fc7e48dd ("bpf: Provide helper to do forwarding lookups in kernel FIB table") Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-17net: sched: cls_matchall: allow to delete filterJiri Pirko
Currently user is unable to delete the filter. See following example: $ tc filter add dev ens16np1 ingress pref 1 handle 1 matchall action drop $ tc filter show dev ens16np1 ingress filter protocol all pref 1 matchall chain 0 filter protocol all pref 1 matchall chain 0 handle 0x1 in_hw action order 1: gact action drop random type none pass val 0 index 1 ref 1 bind 1 $ tc filter del dev ens16np1 ingress pref 1 handle 1 matchall action drop RTNETLINK answers: Operation not supported Implement tcf_proto_ops->delete() op and allow user to delete the filter. Reported-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17net: sched: act_ctinfo: fix policy validationKevin Darbyshire-Bryant
Fix nla_policy definition by specifying an exact length type attribute to CTINFO action paraneter block structure. Without this change, netlink parsing will fail validation and the action will not be instantiated. 8cb081746c03 ("netlink: make validation more configurable for future") introduced much stricter checking to attributes being passed via netlink. Existing actions were updated to use less restrictive deprecated versions of nla_parse_nested. As a new module, act_ctinfo should be designed to use the strict checking model otherwise, well, what was the point of implementing it. Confession time: Until very recently, development of this module has been done on 'net-next' tree to 'clean compile' level with run-time testing on backports to 4.14 & 4.19 kernels under openwrt. This is how I managed to miss the run-time impacts of the new strict nla_parse_nested function. I hopefully have learned something from this (glances toward laptop running a net-next kernel) There is however a still outstanding implication on iproute2 user space in that it needs to be told to pass nested netlink messages with the nested attribute actually set. So even with this kernel fix to do things correctly you still cannot instantiate a new 'strict' nla_parse_nested based action such as act_ctinfo with iproute2's tc. Signed-off-by: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17net: sched: act_ctinfo: fix action creationKevin Darbyshire-Bryant
Use correct return value on action creation: ACT_P_CREATED. The use of incorrect return value could result in a situation where the system thought a ctinfo module was listening but actually wasn't instantiated correctly leading to an OOPS in tcf_generic_walker(). Confession time: Until very recently, development of this module has been done on 'net-next' tree to 'clean compile' level with run-time testing on backports to 4.14 & 4.19 kernels under openwrt. During the back & forward porting during development & testing, the critical ACT_P_CREATED return code got missed despite being in the 4.14 & 4.19 backports. I have now gone through the init functions, using act_csum as reference with a fine toothed comb. Bonus, no more OOPSes. I managed to also miss this issue till now due to the new strict nla_parse_nested function failing validation before action creation. As an inexperienced developer I've learned that copy/pasting/backporting/forward porting code correctly is hard. If I ever get to a developer conference I shall don the cone of shame. Signed-off-by: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17net: ipv4: move tcp_fastopen server side code to SipHash libraryArd Biesheuvel
Using a bare block cipher in non-crypto code is almost always a bad idea, not only for security reasons (and we've seen some examples of this in the kernel in the past), but also for performance reasons. In the TCP fastopen case, we call into the bare AES block cipher one or two times (depending on whether the connection is IPv4 or IPv6). On most systems, this results in a call chain such as crypto_cipher_encrypt_one(ctx, dst, src) crypto_cipher_crt(tfm)->cit_encrypt_one(crypto_cipher_tfm(tfm), ...); aesni_encrypt kernel_fpu_begin(); aesni_enc(ctx, dst, src); // asm routine kernel_fpu_end(); It is highly unlikely that the use of special AES instructions has a benefit in this case, especially since we are doing the above twice for IPv6 connections, instead of using a transform which can process the entire input in one go. We could switch to the cbcmac(aes) shash, which would at least get rid of the duplicated overhead in *some* cases (i.e., today, only arm64 has an accelerated implementation of cbcmac(aes), while x86 will end up using the generic cbcmac template wrapping the AES-NI cipher, which basically ends up doing exactly the above). However, in the given context, it makes more sense to use a light-weight MAC algorithm that is more suitable for the purpose at hand, such as SipHash. Since the output size of SipHash already matches our chosen value for TCP_FASTOPEN_COOKIE_SIZE, and given that it accepts arbitrary input sizes, this greatly simplifies the code as well. NOTE: Server farms backing a single server IP for load balancing purposes and sharing a single fastopen key will be adversely affected by this change unless all systems in the pool receive their kernel upgrades at the same time. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17tipc: include retrans failure detection for unicastTuong Lien
In patch series, commit 9195948fbf34 ("tipc: improve TIPC throughput by Gap ACK blocks"), as for simplicity, the repeated retransmit failures' detection in the function - "tipc_link_retrans()" was kept there for broadcast retransmissions only. This commit now reapplies this feature for link unicast retransmissions that has been done via the function - "tipc_link_advance_transmq()". Also, the "tipc_link_retrans()" is renamed to "tipc_link_bc_retrans()" as it is used only for broadcast. Acked-by: Jon Maloy <jon.maloy@ericsson.se> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17Merge branch 'tcp-fixes'David S. Miller
Eric Dumazet says: ==================== tcp: make sack processing more robust Jonathan Looney brought to our attention multiple problems in TCP stack at the sender side. SACK processing can be abused by malicious peers to either cause overflows, or increase of memory usage. First two patches fix the immediate problems. Since the malicious peers abuse senders by advertizing a very small MSS in their SYN or SYNACK packet, the last two patches add a new sysctl so that admins can chose a higher limit for MSS clamping. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17netfilter: synproxy: extract SYNPROXY infrastructure from {ipt, ip6t}_SYNPROXYFernando Fernandez Mancera
Add common functions into nf_synproxy_core.c to prepare for nftables support. The prototypes of the functions used by {ipt, ip6t}_SYNPROXY are in the new file nf_synproxy.h Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-17netfilter: synproxy: remove module dependency on IPv6 SYNPROXYFernando Fernandez Mancera
This is a prerequisite for the infrastructure module NETFILTER_SYNPROXY. The new module is needed to avoid duplicated code for the SYNPROXY nftables support. Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-17Merge branch 'master' of git://blackhole.kfki.hu/nf-nextPablo Neira Ayuso
Jozsef Kadlecsik says: ==================== ipset patches for nf-next - Remove useless memset() calls, nla_parse_nested/nla_parse erase the tb array properly, from Florent Fourcot. - Merge the uadd and udel functions, the code is nicer this way, also from Florent Fourcot. - Add a missing check for the return value of a nla_parse[_deprecated] call, from Aditya Pakki. - Add the last missing check for the return value of nla_parse[_deprecated] call. - Fix error path and release the references properly in set_target_v3_checkentry(). - Fix memory accounting which is reported to userspace for hash types on resize, from Stefano Brivio. - Update my email address to kadlec@netfilter.org. The patch covers all places in the source tree where my kadlec@blackhole.kfki.hu address could be found. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-17netfilter: bridge: namespace bridge netfilter sysctlsChristian Brauner
Currently, the /proc/sys/net/bridge folder is only created in the initial network namespace. This patch ensures that the /proc/sys/net/bridge folder is available in each network namespace if the module is loaded and disappears from all network namespaces when the module is unloaded. In doing so the patch makes the sysctls: bridge-nf-call-arptables bridge-nf-call-ip6tables bridge-nf-call-iptables bridge-nf-filter-pppoe-tagged bridge-nf-filter-vlan-tagged bridge-nf-pass-vlan-input-dev apply per network namespace. This unblocks some use-cases where users would like to e.g. not do bridge filtering for bridges in a specific network namespace while doing so for bridges located in another network namespace. The netfilter rules are afaict already per network namespace so it should be safe for users to specify whether bridge devices inside a network namespace are supposed to go through iptables et al. or not. Also, this can already be done per-bridge by setting an option for each individual bridge via Netlink. It should also be possible to do this for all bridges in a network namespace via sysctls. Cc: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-17netfilter: bridge: port sysctls to use brnf_netChristian Brauner
This ports the sysctls to use struct brnf_net. With this patch we make it possible to namespace the br_netfilter module in the following patch. Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-17netfilter: xt_owner: bail out with EINVAL in case of unsupported flagsPablo Neira Ayuso
Reject flags that are not supported with EINVAL. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-17netfilter: conntrack: small conntrack lookup optimizationFlorian Westphal
____nf_conntrack_find() performs checks on the conntrack objects in this order: 1. if (nf_ct_is_expired(ct)) This fetches ct->timeout, in third cache line. The hnnode that is used to store the list pointers resides in the first (origin) or second (reply tuple) cache lines. This test rarely passes, but its necessary to reap obsolete entries. 2. if (nf_ct_is_dying(ct)) This fetches ct->status, also in third cache line. The test is useless, and can be removed: Consider: cpu0 cpu1 ct = ____nf_conntrack_find() atomic_inc_not_zero(ct) -> ok nf_ct_key_equal -> ok is_dying -> DYING bit not set, ok set_bit(ct, DYING); ... unhash ... etc. return ct -> returning a ct with dying bit set, despite having a test for it. This (unlikely) case is fine - refcount prevents ct from getting free'd. 3. if (nf_ct_key_equal(h, tuple, zone, net)) nf_ct_key_equal checks in following order: 1. Tuple equal (first or second cacheline) 2. Zone equal (third cacheline) 3. confirmed bit set (->status, third cacheline) 4. net namespace match (third cacheline). Swapping "timeout" and "cpu" places timeout in the first cacheline. This has two advantages: 1. For a conntrack that won't even match the original tuple, we will now only fetch the first and maybe the second cacheline instead of always accessing the 3rd one as well. 2. in case of TCP ct->timeout changes frequently because we reduce/increase it when there are packets outstanding in the network. The first cacheline contains both the reference count and the ct spinlock, i.e. moving timeout there avoids writes to 3rd cacheline. The restart sequence in __nf_conntrack_find() is removed, if we found a candidate, but then fail to increment the refcount or discover the tuple has changed (object recycling), just pretend we did not find an entry. A second lookup won't find anything until another CPU adds a new conntrack with identical tuple into the hash table, which is very unlikely. We have the confirmation-time checks (when we hold hash lock) that deal with identical entries and even perform clash resolution in some cases. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-17netfilter: nft_ct: add ct expectations supportStéphane Veyret
This patch allows to add, list and delete expectations via nft objref infrastructure and assigning these expectations via nft rule. This allows manual port triggering when no helper is defined to manage a specific protocol. For example, if I have an online game which protocol is based on initial connection to TCP port 9753 of the server, and where the server opens a connection to port 9876, I can set rules as follow: table ip filter { ct expectation mygame { protocol udp; dport 9876; timeout 2m; size 1; } chain input { type filter hook input priority 0; policy drop; tcp dport 9753 ct expectation set "mygame"; } chain output { type filter hook output priority 0; policy drop; udp dport 9876 ct status expected accept; } } Signed-off-by: Stéphane Veyret <sveyret@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-17xfrm: fix sa selector validationNicolas Dichtel
After commit b38ff4075a80, the following command does not work anymore: $ ip xfrm state add src 10.125.0.2 dst 10.125.0.1 proto esp spi 34 reqid 1 \ mode tunnel enc 'cbc(aes)' 0xb0abdba8b782ad9d364ec81e3a7d82a1 auth-trunc \ 'hmac(sha1)' 0xe26609ebd00acb6a4d51fca13e49ea78a72c73e6 96 flag align4 In fact, the selector is not mandatory, allow the user to provide an empty selector. Fixes: b38ff4075a80 ("xfrm: Fix xfrm sel prefix length validation") CC: Anirudh Gupta <anirudh.gupta@sophos.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-06-16lapb: fixed leak of control-blocks.Jeremy Sowden
lapb_register calls lapb_create_cb, which initializes the control- block's ref-count to one, and __lapb_insert_cb, which increments it when adding the new block to the list of blocks. lapb_unregister calls __lapb_remove_cb, which decrements the ref-count when removing control-block from the list of blocks, and calls lapb_put itself to decrement the ref-count before returning. However, lapb_unregister also calls __lapb_devtostruct to look up the right control-block for the given net_device, and __lapb_devtostruct also bumps the ref-count, which means that when lapb_unregister returns the ref-count is still 1 and the control-block is leaked. Call lapb_put after __lapb_devtostruct to fix leak. Reported-by: syzbot+afb980676c836b4a0afa@syzkaller.appspotmail.com Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-16tipc: purge deferredq list for each grp member in tipc_group_deleteXin Long
Syzbot reported a memleak caused by grp members' deferredq list not purged when the grp is be deleted. The issue occurs when more(msg_grp_bc_seqno(hdr), m->bc_rcv_nxt) in tipc_group_filter_msg() and the skb will stay in deferredq. So fix it by calling __skb_queue_purge for each member's deferredq in tipc_group_delete() when a tipc sk leaves the grp. Fixes: b87a5ea31c93 ("tipc: guarantee group unicast doesn't bypass group broadcast") Reported-by: syzbot+78fbe679c8ca8d264a8d@syzkaller.appspotmail.com Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-16lapb: moved export of lapb_register.Jeremy Sowden
The EXPORT_SYMBOL for lapb_register was next to a different function. Moved it to the right place. Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-16ax25: fix inconsistent lock state in ax25_destroy_timerEric Dumazet
Before thread in process context uses bh_lock_sock() we must disable bh. sysbot reported : WARNING: inconsistent lock state 5.2.0-rc3+ #32 Not tainted inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. blkid/26581 [HC0[0]:SC1[1]:HE1:SE0] takes: 00000000e0da85ee (slock-AF_AX25){+.?.}, at: spin_lock include/linux/spinlock.h:338 [inline] 00000000e0da85ee (slock-AF_AX25){+.?.}, at: ax25_destroy_timer+0x53/0xc0 net/ax25/af_ax25.c:275 {SOFTIRQ-ON-W} state was registered at: lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:4303 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:338 [inline] ax25_rt_autobind+0x3ca/0x720 net/ax25/ax25_route.c:429 ax25_connect.cold+0x30/0xa4 net/ax25/af_ax25.c:1221 __sys_connect+0x264/0x330 net/socket.c:1834 __do_sys_connect net/socket.c:1845 [inline] __se_sys_connect net/socket.c:1842 [inline] __x64_sys_connect+0x73/0xb0 net/socket.c:1842 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe irq event stamp: 2272 hardirqs last enabled at (2272): [<ffffffff810065f3>] trace_hardirqs_on_thunk+0x1a/0x1c hardirqs last disabled at (2271): [<ffffffff8100660f>] trace_hardirqs_off_thunk+0x1a/0x1c softirqs last enabled at (1522): [<ffffffff87400654>] __do_softirq+0x654/0x94c kernel/softirq.c:320 softirqs last disabled at (2267): [<ffffffff81449010>] invoke_softirq kernel/softirq.c:374 [inline] softirqs last disabled at (2267): [<ffffffff81449010>] irq_exit+0x180/0x1d0 kernel/softirq.c:414 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(slock-AF_AX25); <Interrupt> lock(slock-AF_AX25); *** DEADLOCK *** 1 lock held by blkid/26581: #0: 0000000010fd154d ((&ax25->dtimer)){+.-.}, at: lockdep_copy_map include/linux/lockdep.h:175 [inline] #0: 0000000010fd154d ((&ax25->dtimer)){+.-.}, at: call_timer_fn+0xe0/0x720 kernel/time/timer.c:1312 stack backtrace: CPU: 1 PID: 26581 Comm: blkid Not tainted 5.2.0-rc3+ #32 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_usage_bug.cold+0x393/0x4a2 kernel/locking/lockdep.c:2935 valid_state kernel/locking/lockdep.c:2948 [inline] mark_lock_irq kernel/locking/lockdep.c:3138 [inline] mark_lock+0xd46/0x1370 kernel/locking/lockdep.c:3513 mark_irqflags kernel/locking/lockdep.c:3391 [inline] __lock_acquire+0x159f/0x5490 kernel/locking/lockdep.c:3745 lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:4303 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:338 [inline] ax25_destroy_timer+0x53/0xc0 net/ax25/af_ax25.c:275 call_timer_fn+0x193/0x720 kernel/time/timer.c:1322 expire_timers kernel/time/timer.c:1366 [inline] __run_timers kernel/time/timer.c:1685 [inline] __run_timers kernel/time/timer.c:1653 [inline] run_timer_softirq+0x66f/0x1740 kernel/time/timer.c:1698 __do_softirq+0x25c/0x94c kernel/softirq.c:293 invoke_softirq kernel/softirq.c:374 [inline] irq_exit+0x180/0x1d0 kernel/softirq.c:414 exiting_irq arch/x86/include/asm/apic.h:536 [inline] smp_apic_timer_interrupt+0x13b/0x550 arch/x86/kernel/apic/apic.c:1068 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:806 </IRQ> RIP: 0033:0x7f858d5c3232 Code: 8b 61 08 48 8b 84 24 d8 00 00 00 4c 89 44 24 28 48 8b ac 24 d0 00 00 00 4c 8b b4 24 e8 00 00 00 48 89 7c 24 68 48 89 4c 24 78 <48> 89 44 24 58 8b 84 24 e0 00 00 00 89 84 24 84 00 00 00 8b 84 24 RSP: 002b:00007ffcaf0cf5c0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13 RAX: 00007f858d7d27a8 RBX: 00007f858d7d8820 RCX: 00007f858d3940d8 RDX: 00007ffcaf0cf798 RSI: 00000000f5e616f3 RDI: 00007f858d394fee RBP: 0000000000000000 R08: 00007ffcaf0cf780 R09: 00007f858d7db480 R10: 0000000000000000 R11: 0000000009691a75 R12: 0000000000000005 R13: 00000000f5e616f3 R14: 0000000000000000 R15: 00007ffcaf0cf798 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-16neigh: fix use-after-free read in pneigh_get_nextEric Dumazet
Nine years ago, I added RCU handling to neighbours, not pneighbours. (pneigh are not commonly used) Unfortunately I missed that /proc dump operations would use a common entry and exit point : neigh_seq_start() and neigh_seq_stop() We need to read_lock(tbl->lock) or risk use-after-free while iterating the pneigh structures. We might later convert pneigh to RCU and revert this patch. sysbot reported : BUG: KASAN: use-after-free in pneigh_get_next.isra.0+0x24b/0x280 net/core/neighbour.c:3158 Read of size 8 at addr ffff888097f2a700 by task syz-executor.0/9825 CPU: 1 PID: 9825 Comm: syz-executor.0 Not tainted 5.2.0-rc4+ #32 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188 __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317 kasan_report+0x12/0x20 mm/kasan/common.c:614 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132 pneigh_get_next.isra.0+0x24b/0x280 net/core/neighbour.c:3158 neigh_seq_next+0xdb/0x210 net/core/neighbour.c:3240 seq_read+0x9cf/0x1110 fs/seq_file.c:258 proc_reg_read+0x1fc/0x2c0 fs/proc/inode.c:221 do_loop_readv_writev fs/read_write.c:714 [inline] do_loop_readv_writev fs/read_write.c:701 [inline] do_iter_read+0x4a4/0x660 fs/read_write.c:935 vfs_readv+0xf0/0x160 fs/read_write.c:997 kernel_readv fs/splice.c:359 [inline] default_file_splice_read+0x475/0x890 fs/splice.c:414 do_splice_to+0x127/0x180 fs/splice.c:877 splice_direct_to_actor+0x2d2/0x970 fs/splice.c:954 do_splice_direct+0x1da/0x2a0 fs/splice.c:1063 do_sendfile+0x597/0xd00 fs/read_write.c:1464 __do_sys_sendfile64 fs/read_write.c:1525 [inline] __se_sys_sendfile64 fs/read_write.c:1511 [inline] __x64_sys_sendfile64+0x1dd/0x220 fs/read_write.c:1511 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x4592c9 Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f4aab51dc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000028 RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00000000004592c9 RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000005 RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000080000000 R11: 0000000000000246 R12: 00007f4aab51e6d4 R13: 00000000004c689d R14: 00000000004db828 R15: 00000000ffffffff Allocated by task 9827: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_kmalloc mm/kasan/common.c:489 [inline] __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462 kasan_kmalloc+0x9/0x10 mm/kasan/common.c:503 __do_kmalloc mm/slab.c:3660 [inline] __kmalloc+0x15c/0x740 mm/slab.c:3669 kmalloc include/linux/slab.h:552 [inline] pneigh_lookup+0x19c/0x4a0 net/core/neighbour.c:731 arp_req_set_public net/ipv4/arp.c:1010 [inline] arp_req_set+0x613/0x720 net/ipv4/arp.c:1026 arp_ioctl+0x652/0x7f0 net/ipv4/arp.c:1226 inet_ioctl+0x2a0/0x340 net/ipv4/af_inet.c:926 sock_do_ioctl+0xd8/0x2f0 net/socket.c:1043 sock_ioctl+0x3ed/0x780 net/socket.c:1194 vfs_ioctl fs/ioctl.c:46 [inline] file_ioctl fs/ioctl.c:509 [inline] do_vfs_ioctl+0xd5f/0x1380 fs/ioctl.c:696 ksys_ioctl+0xab/0xd0 fs/ioctl.c:713 __do_sys_ioctl fs/ioctl.c:720 [inline] __se_sys_ioctl fs/ioctl.c:718 [inline] __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 9824: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459 __cache_free mm/slab.c:3432 [inline] kfree+0xcf/0x220 mm/slab.c:3755 pneigh_ifdown_and_unlock net/core/neighbour.c:812 [inline] __neigh_ifdown+0x236/0x2f0 net/core/neighbour.c:356 neigh_ifdown+0x20/0x30 net/core/neighbour.c:372 arp_ifdown+0x1d/0x21 net/ipv4/arp.c:1274 inetdev_destroy net/ipv4/devinet.c:319 [inline] inetdev_event+0xa14/0x11f0 net/ipv4/devinet.c:1544 notifier_call_chain+0xc2/0x230 kernel/notifier.c:95 __raw_notifier_call_chain kernel/notifier.c:396 [inline] raw_notifier_call_chain+0x2e/0x40 kernel/notifier.c:403 call_netdevice_notifiers_info+0x3f/0x90 net/core/dev.c:1749 call_netdevice_notifiers_extack net/core/dev.c:1761 [inline] call_netdevice_notifiers net/core/dev.c:1775 [inline] rollback_registered_many+0x9b9/0xfc0 net/core/dev.c:8178 rollback_registered+0x109/0x1d0 net/core/dev.c:8220 unregister_netdevice_queue net/core/dev.c:9267 [inline] unregister_netdevice_queue+0x1ee/0x2c0 net/core/dev.c:9260 unregister_netdevice include/linux/netdevice.h:2631 [inline] __tun_detach+0xd8a/0x1040 drivers/net/tun.c:724 tun_detach drivers/net/tun.c:741 [inline] tun_chr_close+0xe0/0x180 drivers/net/tun.c:3451 __fput+0x2ff/0x890 fs/file_table.c:280 ____fput+0x16/0x20 fs/file_table.c:313 task_work_run+0x145/0x1c0 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:185 [inline] exit_to_usermode_loop+0x273/0x2c0 arch/x86/entry/common.c:168 prepare_exit_to_usermode arch/x86/entry/common.c:199 [inline] syscall_return_slowpath arch/x86/entry/common.c:279 [inline] do_syscall_64+0x58e/0x680 arch/x86/entry/common.c:304 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff888097f2a700 which belongs to the cache kmalloc-64 of size 64 The buggy address is located 0 bytes inside of 64-byte region [ffff888097f2a700, ffff888097f2a740) The buggy address belongs to the page: page:ffffea00025fca80 refcount:1 mapcount:0 mapping:ffff8880aa400340 index:0x0 flags: 0x1fffc0000000200(slab) raw: 01fffc0000000200 ffffea000250d548 ffffea00025726c8 ffff8880aa400340 raw: 0000000000000000 ffff888097f2a000 0000000100000020 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888097f2a600: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc ffff888097f2a680: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc >ffff888097f2a700: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ^ ffff888097f2a780: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ffff888097f2a800: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc Fixes: 767e97e1e0db ("neigh: RCU conversion of struct neighbour") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-16tcp: fix compile error if !CONFIG_SYSCTLEric Dumazet
tcp_tx_skb_cache_key and tcp_rx_skb_cache_key must be available even if CONFIG_SYSCTL is not set. Fixes: 0b7d7f6b2208 ("tcp: add tcp_tx_skb_cache sysctl") Fixes: ede61ca474a0 ("tcp: add tcp_rx_skb_cache sysctl") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-16hv_sock: Suppress bogus "may be used uninitialized" warningsDexuan Cui
gcc 8.2.0 may report these bogus warnings under some condition: warning: ‘vnew’ may be used uninitialized in this function warning: ‘hvs_new’ may be used uninitialized in this function Actually, the 2 pointers are only initialized and used if the variable "conn_from_host" is true. The code is not buggy here. Signed-off-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-16net: handle 802.1P vlan 0 packets properlyGovindarajulu Varadarajan
When stack receives pkt: [802.1P vlan 0][802.1AD vlan 100][IPv4], vlan_do_receive() returns false if it does not find vlan_dev. Later __netif_receive_skb_core() fails to find packet type handler for skb->protocol 801.1AD and drops the packet. 801.1P header with vlan id 0 should be handled as untagged packets. This patch fixes it by checking if vlan_id is 0 and processes next vlan header. Signed-off-by: Govindarajulu Varadarajan <gvaradar@cisco.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-16net/mlx5: Declare more strictly devlink encap modeLeon Romanovsky
Devlink has UAPI declaration for encap mode, so there is no need to be loose on the data get/set by drivers. Update call sites to use enum devlink_eswitch_encap_mode instead of plain u8. Suggested-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Petr Vorel <pvorel@suse.cz>
2019-06-15tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()Eric Dumazet
If mtu probing is enabled tcp_mtu_probing() could very well end up with a too small MSS. Use the new sysctl tcp_min_snd_mss to make sure MSS search is performed in an acceptable range. CVE-2019-11479 -- tcp mss hardcoded to 48 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jonathan Lemon <jonathan.lemon@gmail.com> Cc: Jonathan Looney <jtl@netflix.com> Acked-by: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Bruce Curtis <brucec@netflix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15tcp: add tcp_min_snd_mss sysctlEric Dumazet
Some TCP peers announce a very small MSS option in their SYN and/or SYN/ACK messages. This forces the stack to send packets with a very high network/cpu overhead. Linux has enforced a minimal value of 48. Since this value includes the size of TCP options, and that the options can consume up to 40 bytes, this means that each segment can include only 8 bytes of payload. In some cases, it can be useful to increase the minimal value to a saner value. We still let the default to 48 (TCP_MIN_SND_MSS), for compatibility reasons. Note that TCP_MAXSEG socket option enforces a minimal value of (TCP_MIN_MSS). David Miller increased this minimal value in commit c39508d6f118 ("tcp: Make TCP_MAXSEG minimum more correct.") from 64 to 88. We might in the future merge TCP_MIN_SND_MSS and TCP_MIN_MSS. CVE-2019-11479 -- tcp mss hardcoded to 48 Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Jonathan Looney <jtl@netflix.com> Acked-by: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Bruce Curtis <brucec@netflix.com> Cc: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15tcp: tcp_fragment() should apply sane memory limitsEric Dumazet
Jonathan Looney reported that a malicious peer can force a sender to fragment its retransmit queue into tiny skbs, inflating memory usage and/or overflow 32bit counters. TCP allows an application to queue up to sk_sndbuf bytes, so we need to give some allowance for non malicious splitting of retransmit queue. A new SNMP counter is added to monitor how many times TCP did not allow to split an skb if the allowance was exceeded. Note that this counter might increase in the case applications use SO_SNDBUF socket option to lower sk_sndbuf. CVE-2019-11478 : tcp_fragment, prevent fragmenting a packet when the socket is already using more than half the allowed space Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jonathan Looney <jtl@netflix.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Cc: Bruce Curtis <brucec@netflix.com> Cc: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15tcp: limit payload size of sacked skbsEric Dumazet
Jonathan Looney reported that TCP can trigger the following crash in tcp_shifted_skb() : BUG_ON(tcp_skb_pcount(skb) < pcount); This can happen if the remote peer has advertized the smallest MSS that linux TCP accepts : 48 An skb can hold 17 fragments, and each fragment can hold 32KB on x86, or 64KB on PowerPC. This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs can overflow. Note that tcp_sendmsg() builds skbs with less than 64KB of payload, so this problem needs SACK to be enabled. SACK blocks allow TCP to coalesce multiple skbs in the retransmit queue, thus filling the 17 fragments to maximal capacity. CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs Fixes: 832d11c5cd07 ("tcp: Try to restore large SKBs while SACK processing") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jonathan Looney <jtl@netflix.com> Acked-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Bruce Curtis <brucec@netflix.com> Cc: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf 2019-06-15 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) fix stack layout of JITed x64 bpf code, from Alexei. 2) fix out of bounds memory access in bpf_sk_storage, from Arthur. 3) fix lpm trie walk, from Jonathan. 4) fix nested bpf_perf_event_output, from Matt. 5) and several other fixes. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15bpf: Fix out of bounds memory access in bpf_sk_storageArthur Fabre
bpf_sk_storage maps use multiple spin locks to reduce contention. The number of locks to use is determined by the number of possible CPUs. With only 1 possible CPU, bucket_log == 0, and 2^0 = 1 locks are used. When updating elements, the correct lock is determined with hash_ptr(). Calling hash_ptr() with 0 bits is undefined behavior, as it does: x >> (64 - bits) Using the value results in an out of bounds memory access. In my case, this manifested itself as a page fault when raw_spin_lock_bh() is called later, when running the self tests: ./tools/testing/selftests/bpf/test_verifier 773 775 [ 16.366342] BUG: unable to handle page fault for address: ffff8fe7a66f93f8 Force the minimum number of locks to two. Signed-off-by: Arthur Fabre <afabre@cloudflare.com> Fixes: 6ac99e8f23d4 ("bpf: Introduce bpf sk local storage") Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-15net: sched: remove NET_CLS_IND config optionJiri Pirko
This config option makes only couple of lines optional. Two small helpers and an int in couple of cls structs. Remove the config option and always compile this in. This saves the user from unexpected surprises when he adds a filter with ingress device match which is silently ignored in case the config option is not set. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15vsock/virtio: set SOCK_DONE on peer shutdownStephen Barber
Set the SOCK_DONE flag to match the TCP_CLOSING state when a peer has shut down and there is nothing left to read. This fixes the following bug: 1) Peer sends SHUTDOWN(RDWR). 2) Socket enters TCP_CLOSING but SOCK_DONE is not set. 3) read() returns -ENOTCONN until close() is called, then returns 0. Signed-off-by: Stephen Barber <smbarber@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net: dsa: use switchdev handle helpersVivien Didelot
Get rid of the dsa_slave_switchdev_port_{attr_set,obj}_event functions in favor of the switchdev_handle_port_{attr_set,obj_add,obj_del} helpers which recurse into the lower devices of the target interface. This has the benefit of being aware of the operations made on the bridge device itself, where orig_dev is the bridge, and dev is the slave. This can be used later to configure the hardware switches. Only VLAN and (port) MDB objects not directly targeting the slave device are unsupported at the moment, so skip this case in their respective case statements. Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14net: dsa: make dsa_slave_dev_check use constVivien Didelot
The switchdev handle helpers make use of a device checking helper requiring a const net_device. Make dsa_slave_dev_check compliant to this. Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>