summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2015-10-07net: Fix vti use case with oif in dst lookups for IPv6David Ahern
It occurred to me yesterday that 741a11d9e4103 ("net: ipv6: Add RT6_LOOKUP_F_IFACE flag if oif is set") means that xfrm6_dst_lookup needs the FLOWI_FLAG_SKIP_NH_OIF flag set. This latest commit causes the oif to be considered in lookups which is known to break vti. This explains why 58189ca7b274 did not the IPv6 change at the time it was submitted. Fixes: 42a7b32b73d6 ("xfrm: Add oif to dst lookups") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-07openvswitch: netlink attributes for IPv6 tunnelingJiri Benc
Add netlink attributes for IPv6 tunnel addresses. This enables IPv6 support for tunnels. Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-07openvswitch: add tunnel protocol to sw_flow_keyJiri Benc
Store tunnel protocol (AF_INET or AF_INET6) in sw_flow_key. This field now also acts as an indicator whether the flow contains tunnel data (this was previously indicated by tun_key.u.ipv4.dst being set but with IPv6 addresses in an union with IPv4 ones this won't work anymore). The new field was added to a hole in sw_flow_key. Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-07bridge: netlink: make br_fill_info's frame size smallerNikolay Aleksandrov
When KASAN is enabled the frame size grows > 2048 bytes and we get a warning, so make it smaller. net/bridge/br_netlink.c: In function 'br_fill_info': >> net/bridge/br_netlink.c:1110:1: warning: the frame size of 2160 bytes >> is larger than 2048 bytes [-Wframe-larger-than=] Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-07net: Add support for filtering neigh dump by device indexDavid Ahern
Add support for filtering neighbor dumps by device by adding the NDA_IFINDEX attribute to the dump request. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-07net: dsa: better error reportingRussell King
Add additional error reporting to the generic DSA code, so it's easier to debug when things go wrong. This was useful when initially bringing up 88e6176 on a new board. Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-07ipvs: Remove possibly unused variables from ip_vs_conn_net_{init,cleanup}Simon Horman
If CONFIG_PROC_FS is undefined then the arguments of proc_create() and remove_proc_entry() are unused. As a result the net variables of ip_vs_conn_net_{init,cleanup} are unused. net/netfilter/ipvs//ip_vs_conn.c: In function ‘ip_vs_conn_net_init’: net/netfilter/ipvs//ip_vs_conn.c:1350:14: warning: unused variable ‘net’ [-Wunused-variable] net/netfilter/ipvs//ip_vs_conn.c: In function ‘ip_vs_conn_net_cleanup’: net/netfilter/ipvs//ip_vs_conn.c:1361:14: warning: unused variable ‘net’ [-Wunused-variable] ... Resolve this by dereferencing net as needed rather than storing it in a variable. Fixes: 3d99376689ee ("ipvs: Pass ipvs not net into ip_vs_control_net_(init|cleanup)") Signed-off-by: Simon Horman <horms@verge.net.au> Acked-by: Julian Anastasov <ja@ssi.bg>
2015-10-07ipvs: Remove possibly unused variable from ip_vs_outDavid Ahern
Eric's net namespace changes in 1b75097dd7a26 leaves net unreferenced if CONFIG_IP_VS_IPV6 is not enabled: ../net/netfilter/ipvs/ip_vs_core.c: In function ‘ip_vs_out’: ../net/netfilter/ipvs/ip_vs_core.c:1177:14: warning: unused variable ‘net’ [-Wunused-variable] After the net refactoring there is only 1 user; push the reference to the 1 user. While the line length slightly exceeds 80 it seems to be the best change. Fixes: 1b75097dd7a26("ipvs: Pass ipvs into ip_vs_out") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Julian Anastasov <ja@ssi.bg> [horms: updated subject] Signed-off-by: Simon Horman <horms@verge.net.au>
2015-10-06xprtrdma: Don't require LOCAL_DMA_LKEY support for fastregSagi Grimberg
There is no need to require LOCAL_DMA_LKEY support as the PD allocation makes sure that there is a local_dma_lkey. Also correctly set a return value in error path. This caused a NULL pointer dereference in mlx5 which removed the support for LOCAL_DMA_LKEY. Fixes: bb6c96d72879 ("xprtrdma: Replace global lkey with lkey local to PD") Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-10-05ipv4: Fix compilation errors in fib_rebalancePeter Nørlund
This fixes net/built-in.o: In function `fib_rebalance': fib_semantics.c:(.text+0x9df14): undefined reference to `__divdi3' and net/built-in.o: In function `fib_rebalance': net/ipv4/fib_semantics.c:572: undefined reference to `__aeabi_ldivmod' Fixes: 0e884c78ee19 ("ipv4: L3 hash-based multipath") Signed-off-by: Peter Nørlund <pch@ordbogen.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05RDS: IB: split mr pool to improve 8K messages performanceSantosh Shilimkar
8K message sizes are pretty important usecase for RDS current workloads so we make provison to have 8K mrs available from the pool. Based on number of SG's in the RDS message, we pick a pool to use. Also to make sure that we don't under utlise mrs when say 8k messages are dominating which could lead to 8k pull being exhausted, we fall-back to 1m pool till 8k pool recovers for use. This helps to at least push ~55 kB/s bidirectional data which is a nice improvement. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-10-05RDS: IB: use max_mr from HCA caps than max_fmrSantosh Shilimkar
All HCA drivers seems to popullate max_mr caps and few of them do both max_mr and max_fmr. Hence update RDS code to make use of max_mr. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-10-05RDS: IB: mark rds_ib_fmr_wq staticSantosh Shilimkar
Fix below warning by marking rds_ib_fmr_wq static net/rds/ib_rdma.c:87:25: warning: symbol 'rds_ib_fmr_wq' was not declared. Should it be static? Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-10-05RDS: IB: use already available pool handle from ibmrSantosh Shilimkar
rds_ib_mr already keeps the pool handle which it associates with. Lets use that instead of round about way of fetching it from rds_ib_device. No functional change. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-10-05RDS: IB: fix the rds_ib_fmr_wq kick callSantosh Shilimkar
RDS IB mr pool has its own workqueue 'rds_ib_fmr_wq', so we need to use queue_delayed_work() to kick the work. This was hurting the performance since pool maintenance was less often triggered from other path. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-10-05RDS: IB: handle rds_ibdev release case instead of crashing the kernelSantosh Shilimkar
Just in case we are still handling the QP receive completion while the rds_ibdev is released, drop the connection instead of crashing the kernel. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-10-05RDS: IB: split send completion handling and do batch ackSantosh Shilimkar
Similar to what we did with receive CQ completion handling, we split the transmit completion handler so that it lets us implement batched work completion handling. We re-use the cq_poll routine and makes use of RDS_IB_SEND_OP to identify the send vs receive completion event handler invocation. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-10-05RDS: IB: ack more receive completions to improve performanceSantosh Shilimkar
For better performance, we split the receive completion IRQ handler. That lets us acknowledge several WCE events in one call. We also limit the WC to max 32 to avoid latency. Acknowledging several completions in one call instead of several calls each time will provide better performance since less mutual exclusion locks are being performed. In next patch, send completion is also split which re-uses the poll_cq() and hence the code is moved to ib_cm.c Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-10-05RDS: use rds_send_xmit() state instead of RDS_LL_SEND_FULLSantosh Shilimkar
In Transport indepedent rds_sendmsg(), we shouldn't make decisions based on RDS_LL_SEND_FULL which is used to manage the ring for RDMA based transports. We can safely issue rds_send_xmit() and the using its return value take decision on deferred work. This will also fix the scenario where at times we are seeing connections stuck with the LL_SEND_FULL bit getting set and never cleared. We kick krdsd after any time we see -ENOMEM or -EAGAIN from the ring allocation code. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-10-05RDS: defer the over_batch work to send workerSantosh Shilimkar
Current process gives up if its send work over the batch limit. The work queue will get kicked to finish off any other requests. This fixes remainder condition from commit 443be0e5affe ("RDS: make sure not to loop forever inside rds_send_xmit"). The restart condition is only for the case where we reached to over_batch code for some other reason so just retrying again before giving up. While at it, make sure we use already available 'send_batch_count' parameter instead of magic value. The batch count threshold value of 1024 came via commit 443be0e5affe ("RDS: make sure not to loop forever inside rds_send_xmit"). The idea is to process as big a batch as we can but at the same time we don't hold other waiting processes for send. Hence back-off after the send_batch_count limit (1024) to avoid soft-lock ups. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-10-05mac80211: make ieee80211_new_mesh_header return unsignedAndrzej Hajda
The function returns always non-negative values. The problem has been detected using proposed semantic patch scripts/coccinelle/tests/assign_signed_to_unsigned.cocci [1]. [1]: http://permalink.gmane.org/gmane.linux.kernel/2046107 Signed-off-by: Andrzej Hajda <a.hajda@samsung.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-05netfilter: nfnetlink_log: allow to attach conntrackKen-ichirou MATSUZAWA
This patch enables to include the conntrack information together with the packet that is sent to user-space via NFLOG, then a user-space program can acquire NATed information by this NFULA_CT attribute. Including the conntrack information is optional, you can set it via NFULNL_CFG_F_CONNTRACK flag with the NFULA_CFG_FLAGS attribute like NFQUEUE. Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-05netfilter: ctnetlink: add const qualifier to nfnl_hook.get_ctKen-ichirou MATSUZAWA
get_ct as is and will not update its skb argument, and users of nfnl_ct_hook is currently only nfqueue, we can add const qualifier. Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
2015-10-05netfilter: Kconfig rename QUEUE_CT to GLUE_CTKen-ichirou MATSUZAWA
Conntrack information attaching infrastructure is now generic and update it's name to use `glue' in previous patch. This patch updates Kconfig symbol name and adding NF_CT_NETLINK dependency. Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-05netfilter: nfnetlink_queue: rename related to nfqueue attaching conntrack infoKen-ichirou MATSUZAWA
The idea of this series of patch is to attach conntrack information to nflog like nfqueue has already done. nfqueue conntrack info attaching basis is generic, rename those names to generic one, glue. Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-05netfilter: nfnetlink_queue: use y2038 safe timestampPablo Neira Ayuso
The __build_packet_message function fills a nfulnl_msg_packet_timestamp structure that uses 64-bit seconds and is therefore y2038 safe, but it uses an intermediate 'struct timespec' which is not. This trivially changes the code to use 'struct timespec64' instead, to correct the result on 32-bit architectures. This is a copy and paste of Arnd's original patch for nfnetlink_log. Suggested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-05Merge tag 'ipvs3-for-v4.4' of ↵Pablo Neira Ayuso
https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next Simon Horman says: ==================== Third Round of IPVS Updates for v4.4 please consider this build fix from Eric Biederman which resolves a build problem introduced in is excellent work to cleanup IPVS which you recently pulled: its queued up for v4.4 so no need to worry about earlier kernel versions. I have another minor cleanup, to fix a build warning, pending. However, I wanted to send this one to you now as its hit nf-next, net-next and in turn next, and a slow trickle of bug reports are appearing. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-05openvswitch: Fix ovs_vport_get_stats()Pravin B Shelar
Not every device has dev->tstats set. So when OVS tries to calculate vport stats it causes kernel panic. Following patch fixes it by using standard API to get net-device stats. ---8<--- Unable to handle kernel paging request at virtual address 766b4008 Internal error: Oops: 96000005 [#1] PREEMPT SMP Modules linked in: vport_vxlan vxlan ip6_udp_tunnel udp_tunnel tun bridge stp llc openvswitch ipv6 CPU: 7 PID: 1108 Comm: ovs-vswitchd Not tainted 4.3.0-rc3+ #82 PC is at ovs_vport_get_stats+0x150/0x1f8 [openvswitch] <snip> Call trace: [<ffffffbffc0859f8>] ovs_vport_get_stats+0x150/0x1f8 [openvswitch] [<ffffffbffc07cdb0>] ovs_vport_cmd_fill_info+0x140/0x1e0 [openvswitch] [<ffffffbffc07cf0c>] ovs_vport_cmd_dump+0xbc/0x138 [openvswitch] [<ffffffc00045a5ac>] netlink_dump+0xb8/0x258 [<ffffffc00045ace0>] __netlink_dump_start+0x120/0x178 [<ffffffc00045dd9c>] genl_family_rcv_msg+0x2d4/0x308 [<ffffffc00045de58>] genl_rcv_msg+0x88/0xc4 [<ffffffc00045cf24>] netlink_rcv_skb+0xd4/0x100 [<ffffffc00045dab0>] genl_rcv+0x30/0x48 [<ffffffc00045c830>] netlink_unicast+0x154/0x200 [<ffffffc00045cc9c>] netlink_sendmsg+0x308/0x364 [<ffffffc00041e10c>] sock_sendmsg+0x14/0x2c [<ffffffc000420d58>] SyS_sendto+0xbc/0xf0 Code: aa1603e1 f94037a4 aa1303e2 aa1703e0 (f9400465) Reported-by: Tomasz Sawicki <tomasz.sawicki@objectiveintegration.uk> Fixes: 8c876639c98 ("openvswitch: Remove vport stats.") Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05bpf, seccomp: prepare for upcoming criu supportDaniel Borkmann
The current ongoing effort to dump existing cBPF seccomp filters back to user space requires to hold the pre-transformed instructions like we do in case of socket filters from sk_attach_filter() side, so they can be reloaded in original form at a later point in time by utilities such as criu. To prepare for this, simply extend the bpf_prog_create_from_user() API to hold a flag that tells whether we should store the original or not. Also, fanout filters could make use of that in future for things like diag. While fanout filters already use bpf_prog_destroy(), move seccomp over to them as well to handle original programs when present. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Tycho Andersen <tycho.andersen@canonical.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Kees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Alexei Starovoitov <ast@plumgrid.com> Tested-by: Tycho Andersen <tycho.andersen@canonical.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05ovs: do not allocate memory from offline numa nodeKonstantin Khlebnikov
When openvswitch tries allocate memory from offline numa node 0: stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO, 0) It catches VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid)) [ replaced with VM_WARN_ON(!node_online(nid)) recently ] in linux/gfp.h This patch disables numa affinity in this case. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05bpf: fix panic in SO_GET_FILTER with native ebpf programsDaniel Borkmann
When sockets have a native eBPF program attached through setsockopt(sk, SOL_SOCKET, SO_ATTACH_BPF, ...), and then try to dump these over getsockopt(sk, SOL_SOCKET, SO_GET_FILTER, ...), the following panic appears: [49904.178642] BUG: unable to handle kernel NULL pointer dereference at (null) [49904.178762] IP: [<ffffffff81610fd9>] sk_get_filter+0x39/0x90 [49904.182000] PGD 86fc9067 PUD 531a1067 PMD 0 [49904.185196] Oops: 0000 [#1] SMP [...] [49904.224677] Call Trace: [49904.226090] [<ffffffff815e3d49>] sock_getsockopt+0x319/0x740 [49904.227535] [<ffffffff812f59e3>] ? sock_has_perm+0x63/0x70 [49904.228953] [<ffffffff815e2fc8>] ? release_sock+0x108/0x150 [49904.230380] [<ffffffff812f5a43>] ? selinux_socket_getsockopt+0x23/0x30 [49904.231788] [<ffffffff815dff36>] SyS_getsockopt+0xa6/0xc0 [49904.233267] [<ffffffff8171b9ae>] entry_SYSCALL_64_fastpath+0x12/0x71 The underlying issue is the very same as in commit b382c0865600 ("sock, diag: fix panic in sock_diag_put_filterinfo"), that is, native eBPF programs don't store an original program since this is only needed in cBPF ones. However, sk_get_filter() wasn't updated to test for this at the time when eBPF could be attached. Just throw an error to the user to indicate that eBPF cannot be dumped over this interface. That way, it can also be known that a program _is_ attached (as opposed to just return 0), and a different (future) method needs to be consulted for a dump. Fixes: 89aa075832b0 ("net: sock: allow eBPF programs to be attached to sockets") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05openvswitch: Rename LABEL->LABELSJoe Stringer
Conntrack LABELS (plural) are exposed by conntrack; rename the OVS name for these to be consistent with conntrack. Fixes: c2ac667 "openvswitch: Allow matching on conntrack label" Signed-off-by: Joe Stringer <joestringer@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05net/unix: fix logic about sk_peek_offsetAndrey Vagin
Now send with MSG_PEEK can return data from multiple SKBs. Unfortunately we take into account the peek offset for each skb, that is wrong. We need to apply the peek offset only once. In addition, the peek offset should be used only if MSG_PEEK is set. Cc: "David S. Miller" <davem@davemloft.net> (maintainer:NETWORKING Cc: Eric Dumazet <edumazet@google.com> (commit_signer:1/14=7%) Cc: Aaron Conole <aconole@bytheb.org> Fixes: 9f389e35674f ("af_unix: return data from multiple SKBs on recv() with MSG_PEEK flag") Signed-off-by: Andrey Vagin <avagin@openvz.org> Tested-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05act_mirred: always release tcf hashWANG Cong
Align with other tc actions. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05act_mirred: fix a race condition on mirred_listWANG Cong
After commit 1ce87720d456 ("net: sched: make cls_u32 lockless") we began to release tc actions in a RCU callback. However, mirred action relies on RTNL lock to protect the global mirred_list, therefore we could have a race condition between RCU callback and netdevice event, which caused a list corruption as reported by Vinson. Instead of relying on RTNL lock, introduce a spinlock to protect this list. Note, in non-bind case, it is still called with RTNL lock, therefore should disable BH too. Reported-by: Vinson Lee <vlee@twopensource.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05ipv4: fix reply_dst leakage on arp replyJiri Benc
There are cases when the created metadata reply is not used. Ensure the allocated memory is freed also in such cases. Fixes: 63d008a4e9ee ("ipv4: send arp replies to the correct tunnel") Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05inet: fix race in reqsk_queue_unlink()Eric Dumazet
reqsk_timer_handler() tests if icsk_accept_queue.listen_opt is NULL at its beginning. By the time it calls inet_csk_reqsk_queue_drop() and reqsk_queue_unlink(), listener might have been closed and inet_csk_listen_stop() had called reqsk_queue_yank_acceptq() which sets icsk_accept_queue.listen_opt to NULL We therefore need to correctly check listen_opt being NULL after holding syn_wait_lock for proper synchronization. Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer") Fixes: b357a364c57c ("inet: fix possible panic in reqsk_queue_unlink()") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/net-next Eric W. Biederman says: ==================== net: Pass net through ip fragmention This is the next installment of my work to pass struct net through the output path so the code does not need to guess how to figure out which network namespace it is in, and ultimately routes can have output devices in another network namespace. This round focuses on passing net through ip fragmentation which we seem to call from about everywhere. That is the main ip output paths, the bridge netfilter code, and openvswitch. This has to happend at once accross the tree as function pointers are involved. First some prep work is done, then ipv4 and ipv6 are converted and then temporary helper functions are removed. ==================== Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05RDS-TCP: Set up MSG_MORE and MSG_SENDPAGE_NOTLAST as appropriate in rds_tcp_xmitSowmini Varadhan
For the same reasons as commit 2f5338442425 ("tcp: allow splice() to build full TSO packets") and commit 35f9c09fe9c7 ("tcp: tcp_sendpages() should call tcp_push() once"), rds_tcp_xmit may have multiple pages to send, so use the MSG_MORE and MSG_SENDPAGE_NOTLAST as hints to tcp_sendpage() Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05RDS-TCP: Do not bloat sndbuf/rcvbuf in rds_tcp_tuneSowmini Varadhan
Using the value of RDS_TCP_DEFAULT_BUFSIZE (128K) clobbers efficient use of TSO because it inflates the size_goal that is computed in tcp_sendmsg/tcp_sendpage and skews packet latency, and the default values for these parameters actually results in significantly better performance. In request-response tests using rds-stress with a packet size of 100K with 16 threads (test parameters -q 100000 -a 256 -t16 -d16) between a single pair of IP addresses achieves a throughput of 6-8 Gbps. Without this patch, throughput maxes at 2-3 Gbps under equivalent conditions on these platforms. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05RDS: Use a single TCP socket for both send and receive.Sowmini Varadhan
Commit f711a6ae062c ("net/rds: RDS-TCP: Always create a new rds_sock for an incoming connection.") modified rds-tcp so that an incoming SYN would ignore an existing "client" TCP connection which had the local port set to the transient port. The motivation for ignoring the existing "client" connection in f711a6ae was to avoid race conditions and an endless duel of reconnect attempts triggered by a restart/abort of one of the nodes in the TCP connection. However, having separate sockets for active and passive sides is avoidable, and the simpler model of a single TCP socket for both send and receives of all RDS connections associated with that tcp socket makes for easier observability. We avoid the race conditions from f711a6ae by attempting reconnects in rds_conn_shutdown if, and only if, the (new) c_outgoing bit is set for RDS_TRANS_TCP. The c_outgoing bit is initialized in __rds_conn_create(). A side-effect of re-using the client rds_connection for an incoming SYN is the potential of encountering duelling SYNs, i.e., we have an outgoing RDS_CONN_CONNECTING socket when we get the incoming SYN. The logic to arbitrate this criss-crossing SYN exchange in rds_tcp_accept_one() has been modified to emulate the BGP state machine: the smaller IP address should back off from the connection attempt. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05tcp: restore fastopen operationsEric Dumazet
I accidentally cleared fastopenq.max_qlen in reqsk_queue_alloc() while max_qlen can be set before listen() is called, using TCP_FASTOPEN socket option for example. Fixes: 0536fcc039a8 ("tcp: prepare fastopen code for upcoming listener changes") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05net: sctp: avoid incorrect time_t useArnd Bergmann
We want to avoid using time_t in the kernel because of the y2038 overflow problem. The use in sctp is not for storing seconds at all, but instead uses microseconds and is passed as 32-bit on all machines. This patch changes the type to u32, which better fits the use. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Vlad Yasevich <vyasevich@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: linux-sctp@vger.kernel.org Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05ipv6: use ktime_t for internal timestampsArnd Bergmann
The ipv6 mip6 implementation is one of only a few users of the skb_get_timestamp() function in the kernel, which is both unsafe on 32-bit architectures because of the 2038 overflow, and slightly less efficient than the skb_get_ktime() based approach. This converts the function call and the mip6_report_rate_limiter structure that stores the time stamp, eliminating all uses of timeval in the ipv6 code. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05nfnetlink: use y2038 safe timestampArnd Bergmann
The __build_packet_message function fills a nfulnl_msg_packet_timestamp structure that uses 64-bit seconds and is therefore y2038 safe, but it uses an intermediate 'struct timespec' which is not. This trivially changes the code to use 'struct timespec64' instead, to correct the result on 32-bit architectures. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Cc: netfilter-devel@vger.kernel.org Cc: coreteam@netfilter.org Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05mac80211: use ktime_get_secondsArnd Bergmann
The mac80211 code uses ktime_get_ts to measure the connected time. As this uses monotonic time, it is y2038 safe on 32-bit systems, but we still want to deprecate the use of 'timespec' because most other users are broken. This changes the code to use ktime_get_seconds() instead, which avoids the timespec structure and is slightly more efficient. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: linux-wireless@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05ipv4: ICMP packet inspection for multipathPeter Nørlund
ICMP packets are inspected to let them route together with the flow they belong to, minimizing the chance that a problematic path will affect flows on other paths, and so that anycast environments can work with ECMP. Signed-off-by: Peter Nørlund <pch@ordbogen.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05ipv4: L3 hash-based multipathPeter Nørlund
Replaces the per-packet multipath with a hash-based multipath using source and destination address. Signed-off-by: Peter Nørlund <pch@ordbogen.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05tcp: avoid two atomic ops for syncookiesEric Dumazet
inet_reqsk_alloc() is used to allocate a temporary request in order to generate a SYNACK with a cookie. Then later, syncookie validation also uses a temporary request. These paths already took a reference on listener refcount, we can avoid a couple of atomic operations. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05net: use sk_fullsock() in __netdev_pick_tx()Eric Dumazet
SYN_RECV & TIMEWAIT sockets are not full blown, they do not have a sk_dst_cache pointer. Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>