summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2019-04-17net: core: introduce build_skb_aroundJesper Dangaard Brouer
The function build_skb() also have the responsibility to allocate and clear the SKB structure. Introduce a new function build_skb_around(), that moves the responsibility of allocation and clearing to the caller. This allows caller to use kmem_cache (slab/slub) bulk allocation API. Next patch use this function combined with kmem_cache_alloc_bulk. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-17ipv4: set the tcp_min_rtt_wlen range from 0 to one dayZhangXiaoxu
There is a UBSAN report as below: UBSAN: Undefined behaviour in net/ipv4/tcp_input.c:2877:56 signed integer overflow: 2147483647 * 1000 cannot be represented in type 'int' CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.1.0-rc4-00058-g582549e #1 Call Trace: <IRQ> dump_stack+0x8c/0xba ubsan_epilogue+0x11/0x60 handle_overflow+0x12d/0x170 ? ttwu_do_wakeup+0x21/0x320 __ubsan_handle_mul_overflow+0x12/0x20 tcp_ack_update_rtt+0x76c/0x780 tcp_clean_rtx_queue+0x499/0x14d0 tcp_ack+0x69e/0x1240 ? __wake_up_sync_key+0x2c/0x50 ? update_group_capacity+0x50/0x680 tcp_rcv_established+0x4e2/0xe10 tcp_v4_do_rcv+0x22b/0x420 tcp_v4_rcv+0xfe8/0x1190 ip_protocol_deliver_rcu+0x36/0x180 ip_local_deliver+0x15b/0x1a0 ip_rcv+0xac/0xd0 __netif_receive_skb_one_core+0x7f/0xb0 __netif_receive_skb+0x33/0xc0 netif_receive_skb_internal+0x84/0x1c0 napi_gro_receive+0x2a0/0x300 receive_buf+0x3d4/0x2350 ? detach_buf_split+0x159/0x390 virtnet_poll+0x198/0x840 ? reweight_entity+0x243/0x4b0 net_rx_action+0x25c/0x770 __do_softirq+0x19b/0x66d irq_exit+0x1eb/0x230 do_IRQ+0x7a/0x150 common_interrupt+0xf/0xf </IRQ> It can be reproduced by: echo 2147483647 > /proc/sys/net/ipv4/tcp_min_rtt_wlen Fixes: f672258391b42 ("tcp: track min RTT using windowed min-filter") Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17SUNRPC: Ignore queue transmission errors on successful transmissionTrond Myklebust
If a request transmission fails due to write space or slot unavailability errors, but the queued task then gets transmitted before it has time to process the error in call_transmit_status() or call_bc_transmit_status(), we need to suppress the transmission error code to prevent it from leaking out of the RPC layer. Reported-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Tested-by: Chuck Lever <chuck.lever@oracle.com>
2019-04-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflict resolution of af_smc.c from Stephen Rothwell. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Handle init flow failures properly in iwlwifi driver, from Shahar S Matityahu. 2) mac80211 TXQs need to be unscheduled on powersave start, from Felix Fietkau. 3) SKB memory accounting fix in A-MDSU aggregation, from Felix Fietkau. 4) Increase RCU lock hold time in mlx5 FPGA code, from Saeed Mahameed. 5) Avoid checksum complete with XDP in mlx5, also from Saeed. 6) Fix netdev feature clobbering in ibmvnic driver, from Thomas Falcon. 7) Partial sent TLS record leak fix from Jakub Kicinski. 8) Reject zero size iova range in vhost, from Jason Wang. 9) Allow pending work to complete before clcsock release from Karsten Graul. 10) Fix XDP handling max MTU in thunderx, from Matteo Croce. 11) A lot of protocols look at the sa_family field of a sockaddr before validating it's length is large enough, from Tetsuo Handa. 12) Don't write to free'd pointer in qede ptp error path, from Colin Ian King. 13) Have to recompile IP options in ipv4_link_failure because it can be invoked from ARP, from Stephen Suryaputra. 14) Doorbell handling fixes in qed from Denis Bolotin. 15) Revert net-sysfs kobject register leak fix, it causes new problems. From Wang Hai. 16) Spectre v1 fix in ATM code, from Gustavo A. R. Silva. 17) Fix put of BROPT_VLAN_STATS_PER_PORT in bridging code, from Nikolay Aleksandrov. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (111 commits) socket: fix compat SO_RCVTIMEO_NEW/SO_SNDTIMEO_NEW tcp: tcp_grow_window() needs to respect tcp_space() ocelot: Clean up stats update deferred work ocelot: Don't sleep in atomic context (irqs_disabled()) net: bridge: fix netlink export of vlan_stats_per_port option qed: fix spelling mistake "faspath" -> "fastpath" tipc: set sysctl_tipc_rmem and named_timeout right range tipc: fix link established but not in session net: Fix missing meta data in skb with vlan packet net: atm: Fix potential Spectre v1 vulnerabilities net/core: work around section mismatch warning for ptp_classifier net: bridge: fix per-port af_packet sockets bnx2x: fix spelling mistake "dicline" -> "decline" route: Avoid crash from dereferencing NULL rt->from MAINTAINERS: normalize Woojung Huh's email address bonding: fix event handling for stacked bonds Revert "net-sysfs: Fix memory leak in netdev_register_kobject" rtnetlink: fix rtnl_valid_stats_req() nlmsg_len check qed: Fix the DORQ's attentions handling qed: Fix missing DORQ attentions ...
2019-04-16socket: fix compat SO_RCVTIMEO_NEW/SO_SNDTIMEO_NEWArnd Bergmann
It looks like the new socket options only work correctly for native execution, but in case of compat mode fall back to the old behavior as we ignore the 'old_timeval' flag. Rework so we treat SO_RCVTIMEO_NEW/SO_SNDTIMEO_NEW the same way in compat and native 32-bit mode. Cc: Deepa Dinamani <deepa.kernel@gmail.com> Fixes: a9beb86ae6e5 ("sock: Add SO_RCVTIMEO_NEW and SO_SNDTIMEO_NEW") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-16tcp: tcp_grow_window() needs to respect tcp_space()Eric Dumazet
For some reason, tcp_grow_window() correctly tests if enough room is present before attempting to increase tp->rcv_ssthresh, but does not prevent it to grow past tcp_space() This is causing hard to debug issues, like failing the (__tcp_select_window(sk) >= tp->rcv_wnd) test in __tcp_ack_snd_check(), causing ACK delays and possibly slow flows. Depending on tcp_rmem[2], MTU, skb->len/skb->truesize ratio, we can see the problem happening on "netperf -t TCP_RR -- -r 2000,2000" after about 60 round trips, when the active side no longer sends immediate acks. This bug predates git history. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-16net: bridge: fix netlink export of vlan_stats_per_port optionNikolay Aleksandrov
Since the introduction of the vlan_stats_per_port option the netlink export of it has been broken since I made a typo and used the ifla attribute instead of the bridge option to retrieve its state. Sysfs export is fine, only netlink export has been affected. Fixes: 9163a0fc1f0c0 ("net: bridge: add support for per-port vlan stats") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-16tipc: set sysctl_tipc_rmem and named_timeout right rangeJie Liu
We find that sysctl_tipc_rmem and named_timeout do not have the right minimum setting. sysctl_tipc_rmem should be larger than zero, like sysctl_tcp_rmem. And named_timeout as a timeout setting should be not less than zero. Fixes: cc79dd1ba9c10 ("tipc: change socket buffer overflow control to respect sk_rcvbuf") Fixes: a5325ae5b8bff ("tipc: add name distributor resiliency queue") Signed-off-by: Jie Liu <liujie165@huawei.com> Reported-by: Qiang Ning <ningqiang1@huawei.com> Reviewed-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-16tipc: fix link established but not in sessionTuong Lien
According to the link FSM, when a link endpoint got RESET_MSG (- a traditional one without the stopping bit) from its peer, it moves to PEER_RESET state and raises a LINK_DOWN event which then resets the link itself. Its state will become ESTABLISHING after the reset event and the link will be re-established soon after this endpoint starts to send ACTIVATE_MSG to the peer. There is no problem with this mechanism, however the link resetting has cleared the link 'in_session' flag (along with the other important link data such as: the link 'mtu') that was correctly set up at the 1st step (i.e. when this endpoint received the peer RESET_MSG). As a result, the link will become ESTABLISHED, but the 'in_session' flag is not set, and all STATE_MSG from its peer will be dropped at the link_validate_msg(). It means the link not synced and will sooner or later face a failure. Since the link reset action is obviously needed for a new link session (this is also true in the other situations), the problem here is that the link is re-established a bit too early when the link endpoints are not really in-sync yet. The commit forces a resync as already done in the previous commit 91986ee166cf ("tipc: fix link session and re-establish issues") by simply varying the link 'peer_session' value at the link_reset(). Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-16net: Fix missing meta data in skb with vlan packetYuya Kusakabe
skb_reorder_vlan_header() should move XDP meta data with ethernet header if XDP meta data exists. Fixes: de8f3a83b0a0 ("bpf: add meta pointer for direct access") Signed-off-by: Yuya Kusakabe <yuya.kusakabe@gmail.com> Signed-off-by: Takeru Hayasaka <taketarou2@gmail.com> Co-developed-by: Takeru Hayasaka <taketarou2@gmail.com> Reviewed-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-16net: atm: Fix potential Spectre v1 vulnerabilitiesGustavo A. R. Silva
arg is controlled by user-space, hence leading to a potential exploitation of the Spectre variant 1 vulnerability. This issue was detected with the help of Smatch: net/atm/lec.c:715 lec_mcast_attach() warn: potential spectre issue 'dev_lec' [r] (local cap) Fix this by sanitizing arg before using it to index dev_lec. Notice that given that speculation windows are large, the policy is to kill the speculation on the first load and not worry if it can be completed with a dependent load/store [1]. [1] https://lore.kernel.org/lkml/20180423164740.GY17484@dhcp22.suse.cz/ Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-16net/core: work around section mismatch warning for ptp_classifierArd Biesheuvel
The routine ptp_classifier_init() uses an initializer for an automatic struct type variable which refers to an __initdata symbol. This is perfectly legal, but may trigger a section mismatch warning when running the compiler in -fpic mode, due to the fact that the initializer may be emitted into an anonymous .data section thats lack the __init annotation. So work around it by using assignments instead. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-16net: bridge: fix per-port af_packet socketsNikolay Aleksandrov
When the commit below was introduced it changed two visible things: - the skb was no longer passed through the protocol handlers with the original device - the skb was passed up the stack with skb->dev = bridge The first change broke af_packet sockets on bridge ports. For example we use them for hostapd which listens for ETH_P_PAE packets on the ports. We discussed two possible fixes: - create a clone and pass it through NF_HOOK(), act on the original skb based on the result - somehow signal to the caller from the okfn() that it was called, meaning the skb is ok to be passed, which this patch is trying to implement via returning 1 from the bridge link-local okfn() Note that we rely on the fact that NF_QUEUE/STOLEN would return 0 and drop/error would return < 0 thus the okfn() is called only when the return was 1, so we signal to the caller that it was called by preserving the return value from nf_hook(). Fixes: 8626c56c8279 ("bridge: fix potential use-after-free when hook returns QUEUE or STOLEN verdict") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-16xsk: fix XDP socket ring buffer memory orderingMagnus Karlsson
The ring buffer code of XDP sockets is missing a memory barrier on the consumer side between the load of the data and the write that signals that it is ok for the producer to put new data into the buffer. On architectures that does not guarantee that stores are not reordered with older loads, the producer might put data into the ring before the consumer had the chance to read it. As IA does guarantee this ordering, it would only need a compiler barrier here, but there are no primitives in Linux for this specific case (hinder writes to be ordered before older reads) so I had to add a smp_mb() here which will translate into a run-time synch operation on IA. Added a longish comment in the code explaining what each barrier in the ring implementation accomplishes and what would happen if we removed one of them. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-16bpf: allow clearing all sock_ops callback flagsViet Hoang Tran
The helper function bpf_sock_ops_cb_flags_set() can be used to both set and clear the sock_ops callback flags. However, its current behavior is not consistent. BPF program may clear a flag if more than one were set, or replace a flag with another one, but cannot clear all flags. This patch also updates the documentation to clarify the ability to clear flags of this helper function. Signed-off-by: Hoang Tran <hoang.tran@uclouvain.be> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-16bpf: reserve flags in bpf_skb_net_shrinkWillem de Bruijn
The ENCAP flags in bpf_skb_adjust_room are ignored on decap with bpf_skb_net_shrink. Reserve these bits for future use. Fixes: 868d523535c2d ("bpf: add bpf_skb_adjust_room encap flags") Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-15net: hsr: add tx stats for master interfaceMurali Karicheri
Add tx stats to hsr interface. Without this ifconfig for hsr interface doesn't show tx packet stats. Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15net: hsr: fix debugfs path to support multiple interfacesMurali Karicheri
Fix the path of hsr debugfs root directory to use the net device name so that it can work with multiple interfaces. While at it, also fix some typos. Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15net: hsr: fix naming of file and functionsMurali Karicheri
Fix the file name and functions to match with existing implementation. Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15sctp: implement memory accounting on rx pathXin Long
sk_forward_alloc's updating is also done on rx path, but to be consistent we change to use sk_mem_charge() in sctp_skb_set_owner_r(). In sctp_eat_data(), it's not enough to check sctp_memory_pressure only, which doesn't work for mem_cgroup_sockets_enabled, so we change to use sk_under_memory_pressure(). When it's under memory pressure, sk_mem_reclaim() and sk_rmem_schedule() should be called on both RENEGE or CHUNK DELIVERY path exit the memory pressure status as soon as possible. Note that sk_rmem_schedule() is using datalen to make things easy there. Reported-by: Matteo Croce <mcroce@redhat.com> Tested-by: Matteo Croce <mcroce@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15sctp: implement memory accounting on tx pathXin Long
Now when sending packets, sk_mem_charge() and sk_mem_uncharge() have been used to set sk_forward_alloc. We just need to call sk_wmem_schedule() to check if the allocated should be raised, and call sk_mem_reclaim() to check if the allocated should be reduced when it's under memory pressure. If sk_wmem_schedule() returns false, which means no memory is allowed to allocate, it will block and wait for memory to become available. Note different from tcp, sctp wait_for_buf happens before allocating any skb, so memory accounting check is done with the whole msg_len before it too. Reported-by: Matteo Croce <mcroce@redhat.com> Tested-by: Matteo Croce <mcroce@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15route: Avoid crash from dereferencing NULL rt->fromJonathan Lemon
When __ip6_rt_update_pmtu() is called, rt->from is RCU dereferenced, but is never checked for null - rt6_flush_exceptions() may have removed the entry. [ 1913.989004] RIP: 0010:ip6_rt_cache_alloc+0x13/0x170 [ 1914.209410] Call Trace: [ 1914.214798] <IRQ> [ 1914.219226] __ip6_rt_update_pmtu+0xb0/0x190 [ 1914.228649] ip6_tnl_xmit+0x2c2/0x970 [ip6_tunnel] [ 1914.239223] ? ip6_tnl_parse_tlv_enc_lim+0x32/0x1a0 [ip6_tunnel] [ 1914.252489] ? __gre6_xmit+0x148/0x530 [ip6_gre] [ 1914.262678] ip6gre_tunnel_xmit+0x17e/0x3c7 [ip6_gre] [ 1914.273831] dev_hard_start_xmit+0x8d/0x1f0 [ 1914.283061] sch_direct_xmit+0xfa/0x230 [ 1914.291521] __qdisc_run+0x154/0x4b0 [ 1914.299407] net_tx_action+0x10e/0x1f0 [ 1914.307678] __do_softirq+0xca/0x297 [ 1914.315567] irq_exit+0x96/0xa0 [ 1914.322494] smp_apic_timer_interrupt+0x68/0x130 [ 1914.332683] apic_timer_interrupt+0xf/0x20 [ 1914.341721] </IRQ> Fixes: a68886a69180 ("net/ipv6: Make from in rt6_info rcu protected") Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@gmail.com> Reviewed-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15Revert "net-sysfs: Fix memory leak in netdev_register_kobject"Wang Hai
This reverts commit 6b70fc94afd165342876e53fc4b2f7d085009945. The reverted bugfix will cause another issue. Reported by syzbot+6024817a931b2830bc93@syzkaller.appspotmail.com. See https://syzkaller.appspot.com/x/log.txt?x=1737671b200000 for details. Signed-off-by: Wang Hai <wanghai26@huawei.com> Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter updates for net-next: 1) Remove the broute pseudo hook, implement this from the bridge prerouting hook instead. Now broute becomes real table in ebtables, from Florian Westphal. This also includes a size reduction patch for the bridge control buffer area via squashing boolean into bitfields and a selftest. 2) Add OS passive fingerprint version matching, from Fernando Fernandez. 3) Support for gue encapsulation for IPVS, from Jacky Hu. 4) Add support for NAT to the inet family, from Florian Westphal. This includes support for masquerade, redirect and nat extensions. 5) Skip interface lookup in flowtable, use device in the dst object. 6) Add jiffies64_to_msecs() and use it, from Li RongQing. 7) Remove unused parameter in nf_tables_set_desc_parse(), from Colin Ian King. 8) Statify several functions, patches from YueHaibing and Florian Westphal. 9) Add an optimized version of nf_inet_addr_cmp(), from Li RongQing. 10) Merge route extension to core, also from Florian. 11) Use IS_ENABLED(CONFIG_NF_NAT) instead of NF_NAT_NEEDED, from Florian. 12) Merge ip/ip6 masquerade extensions, from Florian. This includes netdevice notifier unification. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15bridge: only include nf_queue.h if neededStephen Rothwell
After merging the netfilter-next tree, today's linux-next build (powerpc ppc44x_defconfig) failed like this: In file included from net/bridge/br_input.c:19: include/net/netfilter/nf_queue.h:16:23: error: field 'state' has incomplete type struct nf_hook_state state; ^~~~~ Fixes: 971502d77faa ("bridge: netfilter: unroll NF_HOOK helper in bridge input path") Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-15xfrm: kconfig: make xfrm depend on inetFlorian Westphal
when CONFIG_INET is not enabled: net/xfrm/xfrm_output.c: In function ‘xfrm4_tunnel_encap_add’: net/xfrm/xfrm_output.c:234:2: error: implicit declaration of function ‘ip_select_ident’ [-Werror=implicit-function-declaration] ip_select_ident(dev_net(dst->dev), skb, NULL); XFRM only supports ipv4 and ipv6 so change dependency to INET and place user-visible options (pfkey sockets, migrate support and the like) under 'if INET' guard as well. Fixes: 1de70830066b7 ("xfrm: remove output2 indirection from xfrm_mode") Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-15netfilter: nat: fix icmp id randomizationFlorian Westphal
Sven Auhagen reported that a 2nd ping request will fail if 'fully-random' mode is used. Reason is that if no proto information is given, min/max are both 0, so we set the icmp id to 0 instead of chosing a random value between 0 and 65535. Update test case as well to catch this, without fix this yields: [..] ERROR: cannot ping ns1 from ns2 with ip masquerade fully-random (attempt 2) ERROR: cannot ping ns1 from ns2 with ipv6 masquerade fully-random (attempt 2) ... becaus 2nd ping clashes with existing 'id 0' icmp conntrack and gets dropped. Fixes: 203f2e78200c27e ("netfilter: nat: remove l4proto->unique_tuple") Reported-by: Sven Auhagen <sven.auhagen@voleatech.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-15netfilter: nf_tables: prevent shift wrap in nft_chain_parse_hook()Dan Carpenter
I believe that "hook->num" can be up to UINT_MAX. Shifting more than 31 bits would is undefined in C but in practice it would lead to shift wrapping. That would lead to an array overflow in nf_tables_addchain(): ops->hook = hook.type->hooks[ops->hooknum]; Fixes: fe19c04ca137 ("netfilter: nf_tables: remove nhooks field from struct nft_af_info") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-15netfilter: ctnetlink: don't use conntrack/expect object addresses as idFlorian Westphal
else, we leak the addresses to userspace via ctnetlink events and dumps. Compute an ID on demand based on the immutable parts of nf_conn struct. Another advantage compared to using an address is that there is no immediate re-use of the same ID in case the conntrack entry is freed and reallocated again immediately. Fixes: 3583240249ef ("[NETFILTER]: nf_conntrack_expect: kill unique ID") Fixes: 7f85f914721f ("[NETFILTER]: nf_conntrack: kill unique ID") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-14rtnetlink: fix rtnl_valid_stats_req() nlmsg_len checkEric Dumazet
Jakub forgot to either use nlmsg_len() or nlmsg_msg_size(), allowing KMSAN to detect a possible uninit-value in rtnl_stats_get BUG: KMSAN: uninit-value in rtnl_stats_get+0x6d9/0x11d0 net/core/rtnetlink.c:4997 CPU: 0 PID: 10428 Comm: syz-executor034 Not tainted 5.1.0-rc2+ #24 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x173/0x1d0 lib/dump_stack.c:113 kmsan_report+0x131/0x2a0 mm/kmsan/kmsan.c:619 __msan_warning+0x7a/0xf0 mm/kmsan/kmsan_instr.c:310 rtnl_stats_get+0x6d9/0x11d0 net/core/rtnetlink.c:4997 rtnetlink_rcv_msg+0x115b/0x1550 net/core/rtnetlink.c:5192 netlink_rcv_skb+0x431/0x620 net/netlink/af_netlink.c:2485 rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:5210 netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline] netlink_unicast+0xf3e/0x1020 net/netlink/af_netlink.c:1336 netlink_sendmsg+0x127f/0x1300 net/netlink/af_netlink.c:1925 sock_sendmsg_nosec net/socket.c:622 [inline] sock_sendmsg net/socket.c:632 [inline] ___sys_sendmsg+0xdb3/0x1220 net/socket.c:2137 __sys_sendmsg net/socket.c:2175 [inline] __do_sys_sendmsg net/socket.c:2184 [inline] __se_sys_sendmsg+0x305/0x460 net/socket.c:2182 __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2182 do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:291 entry_SYSCALL_64_after_hwframe+0x63/0xe7 Fixes: 51bc860d4a99 ("rtnetlink: stats: validate attributes in get as well as dumps") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-14ipv4: ensure rcu_read_lock() in ipv4_link_failure()Eric Dumazet
fib_compute_spec_dst() needs to be called under rcu protection. syzbot reported : WARNING: suspicious RCU usage 5.1.0-rc4+ #165 Not tainted include/linux/inetdevice.h:220 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by swapper/0/0: #0: 0000000051b67925 ((&n->timer)){+.-.}, at: lockdep_copy_map include/linux/lockdep.h:170 [inline] #0: 0000000051b67925 ((&n->timer)){+.-.}, at: call_timer_fn+0xda/0x720 kernel/time/timer.c:1315 stack backtrace: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.1.0-rc4+ #165 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 lockdep_rcu_suspicious+0x153/0x15d kernel/locking/lockdep.c:5162 __in_dev_get_rcu include/linux/inetdevice.h:220 [inline] fib_compute_spec_dst+0xbbd/0x1030 net/ipv4/fib_frontend.c:294 spec_dst_fill net/ipv4/ip_options.c:245 [inline] __ip_options_compile+0x15a7/0x1a10 net/ipv4/ip_options.c:343 ipv4_link_failure+0x172/0x400 net/ipv4/route.c:1195 dst_link_failure include/net/dst.h:427 [inline] arp_error_report+0xd1/0x1c0 net/ipv4/arp.c:297 neigh_invalidate+0x24b/0x570 net/core/neighbour.c:995 neigh_timer_handler+0xc35/0xf30 net/core/neighbour.c:1081 call_timer_fn+0x190/0x720 kernel/time/timer.c:1325 expire_timers kernel/time/timer.c:1362 [inline] __run_timers kernel/time/timer.c:1681 [inline] __run_timers kernel/time/timer.c:1649 [inline] run_timer_softirq+0x652/0x1700 kernel/time/timer.c:1694 __do_softirq+0x266/0x95a kernel/softirq.c:293 invoke_softirq kernel/softirq.c:374 [inline] irq_exit+0x180/0x1d0 kernel/softirq.c:414 exiting_irq arch/x86/include/asm/apic.h:536 [inline] smp_apic_timer_interrupt+0x14a/0x570 arch/x86/kernel/apic/apic.c:1062 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:807 Fixes: ed0de45a1008 ("ipv4: recompile ip options in ipv4_link_failure") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Stephen Suryaputra <ssuryaextr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-13ipvs: do not schedule icmp errors from tunnelsJulian Anastasov
We can receive ICMP errors from client or from tunneling real server. While the former can be scheduled to real server, the latter should not be scheduled, they are decapsulated only when existing connection is found. Fixes: 6044eeffafbe ("ipvs: attempt to schedule icmp packets") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-13netfilter: conntrack: initialize ct->timeoutAlexander Potapenko
KMSAN started reporting an error when accessing ct->timeout for the first time without initialization: BUG: KMSAN: uninit-value in __nf_ct_refresh_acct+0x1ae/0x470 net/netfilter/nf_conntrack_core.c:1765 ... dump_stack+0x173/0x1d0 lib/dump_stack.c:113 kmsan_report+0x131/0x2a0 mm/kmsan/kmsan.c:624 __msan_warning+0x7a/0xf0 mm/kmsan/kmsan_instr.c:310 __nf_ct_refresh_acct+0x1ae/0x470 net/netfilter/nf_conntrack_core.c:1765 nf_ct_refresh_acct ./include/net/netfilter/nf_conntrack.h:201 nf_conntrack_udp_packet+0xb44/0x1040 net/netfilter/nf_conntrack_proto_udp.c:122 nf_conntrack_handle_packet net/netfilter/nf_conntrack_core.c:1605 nf_conntrack_in+0x1250/0x26c9 net/netfilter/nf_conntrack_core.c:1696 ... Uninit was created at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:205 kmsan_internal_poison_shadow+0x92/0x150 mm/kmsan/kmsan.c:159 kmsan_kmalloc+0xa9/0x130 mm/kmsan/kmsan_hooks.c:173 kmem_cache_alloc+0x554/0xb10 mm/slub.c:2789 __nf_conntrack_alloc+0x16f/0x690 net/netfilter/nf_conntrack_core.c:1342 init_conntrack+0x6cb/0x2490 net/netfilter/nf_conntrack_core.c:1421 Signed-off-by: Alexander Potapenko <glider@google.com> Fixes: cc16921351d8ba1 ("netfilter: conntrack: avoid same-timeout update") Cc: Florian Westphal <fw@strlen.de> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-13netfilter: conntrack: don't set related state for different outer addressFlorian Westphal
Luca Moro says: ------ The issue lies in the filtering of ICMP and ICMPv6 errors that include an inner IP datagram. For these packets, icmp_error_message() extract the ICMP error and inner layer to search of a known state. If a state is found the packet is tagged as related (IP_CT_RELATED). The problem is that there is no correlation check between the inner and outer layer of the packet. So one can encapsulate an error with an inner layer matching a known state, while its outer layer is directed to a filtered host. In this case the whole packet will be tagged as related. This has various implications from a rule bypass (if a rule to related trafic is allow), to a known state oracle. Unfortunately, we could not find a real statement in a RFC on how this case should be filtered. The closest we found is RFC5927 (Section 4.3) but it is not very clear. A possible fix would be to check that the inner IP source is the same than the outer destination. We believed this kind of attack was not documented yet, so we started to write a blog post about it. You can find it attached to this mail (sorry for the extract quality). It contains more technical details, PoC and discussion about the identified behavior. We discovered later that https://www.gont.com.ar/papers/filtering-of-icmp-error-messages.pdf described a similar attack concept in 2004 but without the stateful filtering in mind. ----- This implements above suggested fix: In icmp(v6) error handler, take outer destination address, then pass that into the common function that does the "related" association. After obtaining the nf_conn of the matching inner-headers connection, check that the destination address of the opposite direction tuple is the same as the outer address and only set RELATED if thats the case. Reported-by: Luca Moro <luca.moro@synacktiv.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-12ipv4: recompile ip options in ipv4_link_failureStephen Suryaputra
Recompile IP options since IPCB may not be valid anymore when ipv4_link_failure is called from arp_error_report. Refer to the commit 3da1ed7ac398 ("net: avoid use IPCB in cipso_v4_error") and the commit before that (9ef6b42ad6fd) for a similar issue. Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12ipv6: Remove flowi6_oif compare from __ip6_route_redirectDavid Ahern
In the review of 0b34eb004347 ("ipv6: Refactor __ip6_route_redirect"), Martin noted that the flowi6_oif compare is moved to the new helper and should be removed from __ip6_route_redirect. Fix the oversight. Fixes: 0b34eb004347 ("ipv6: Refactor __ip6_route_redirect") Reported-by: Martin Lau <kafai@fb.com> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12rxrpc: Fix detection of out of order acksJeffrey Altman
The rxrpc packet serial number cannot be safely used to compute out of order ack packets for several reasons: 1. The allocation of serial numbers cannot be assumed to imply the order by which acks are populated and transmitted. In some rxrpc implementations, delayed acks and ping acks are transmitted asynchronously to the receipt of data packets and so may be transmitted out of order. As a result, they can race with idle acks. 2. Serial numbers are allocated by the rxrpc connection and not the call and as such may wrap independently if multiple channels are in use. In any case, what matters is whether the ack packet provides new information relating to the bounds of the window (the firstPacket and previousPacket in the ACK data). Fix this by discarding packets that appear to wind back the window bounds rather than on serial number procession. Fixes: 298bc15b2079 ("rxrpc: Only take the rwind and mtu values from latest ACK") Signed-off-by: Jeffrey Altman <jaltman@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12rxrpc: Trace received connection abortsDavid Howells
Trace received calls that are aborted due to a connection abort, typically because of authentication failure. Without this, connection aborts don't show up in the trace log. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12rxrpc: Allow errors to be returned from rxrpc_queue_packet()Marc Dionne
Change rxrpc_queue_packet()'s signature so that it can return any error code it may encounter when trying to send the packet. This allows the caller to eventually do something in case of error - though it should be noted that the packet has been queued and a resend is scheduled. Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12rxrpc: Make rxrpc_kernel_check_life() indicate if call completedMarc Dionne
Make rxrpc_kernel_check_life() pass back the life counter through the argument list and return true if the call has not yet completed. Suggested-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12rxrpc: Clear socket errorMarc Dionne
When an ICMP or ICMPV6 error is received, the error will be attached to the socket (sk_err) and the report function will get called. Clear any pending error here by calling sock_error(). This would cause the following attempt to use the socket to fail with the error code stored by the ICMP error, resulting in unexpected errors with various side effects depending on the context. Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Jonathan Billings <jsbillin@umich.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12net/smc: improve smc_conn_create reason codesKarsten Graul
Rework smc_conn_create() to always return a valid DECLINE reason code. This removes the need to translate the return codes on 4 different places and allows to easily add more detailed return codes by changing smc_conn_create() only. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12net/smc: improve smc_listen_work reason codesKarsten Graul
Rework smc_listen_work() to provide improved reason codes when an SMC connection is declined. This allows better debugging on user side. This also adds 3 more detailed reason codes in smc_clc.h to indicate what type of device was not found (ism or rdma or both), or if ism cannot talk to the peer. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12net/smc: code cleanup smc_listen_workKarsten Graul
In smc_listen_work() the variables rc and reason_code are defined which have the same meaning. Eliminate reason_code in favor of the shorter name rc. No functional changes. Rename the functions smc_check_ism() and smc_check_rdma() into smc_find_ism_device() and smc_find_rdma_device() to make there purpose more clear. No functional changes. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12net/smc: cleanup of get vlan idKarsten Graul
The vlan_id of the underlying CLC socket was retrieved two times during processing of the listen handshaking. Change this to get the vlan id one time in connect and in listen processing, and reuse the id. And add a new CLC DECLINE return code for the case when the retrieval of the vlan id failed. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12net/smc: consolidate function parametersKarsten Graul
During initialization of an SMC socket a lot of function parameters need to get passed down the function call path. Consolidate the parameters in a helper struct so there are less enough parameters to get all passed by register. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12net/smc: check for ip prefix and subnetKarsten Graul
The check for a matching ip prefix and subnet was only done for SMC-R in smc_listen_rdma_check() but not when an SMC-D connection was possible. Rename the function into smc_listen_prfx_check() and move its call to a place where it is called for both SMC variants. And add a new CLC DECLINE reason for the case when the IP prefix or subnet check fails so the reason for the failing SMC connection can be found out more easily. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12net/smc: fallback to TCP after connect problemsKarsten Graul
Correct the CLC decline reason codes for internal problems to not have the sign bit set, negative reason codes are interpreted as not eligible for TCP fallback. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12net/smc: nonblocking connect reworkUrsula Braun
For nonblocking sockets move the kernel_connect() from the connect worker into the initial smc_connect part to return kernel_connect() errors other than -EINPROGRESS to user space. Reviewed-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>