summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2015-05-19nl802154: cleanup invalid argument handlingAlexander Aring
This patch cleanups the -EINVAL cases by combining them in one condition. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-05-19bridge_netfilter: No ICMP packet on IPv4 fragmentation errorAndy Zhou
When bridge netfilter re-fragments an IP packet for output, all packets that can not be re-fragmented to their original input size should be silently discarded. However, current bridge netfilter output path generates an ICMP packet with 'size exceeded MTU' message for such packets, this is a bug. This patch refactors the ip_fragment() API to allow two separate use cases. The bridge netfilter user case will not send ICMP, the routing output will, as before. Signed-off-by: Andy Zhou <azhou@nicira.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19IPv4: skip ICMP for bridge contrack users when defrag expiresAndy Zhou
users in [IP_DEFRAG_CONNTRACK_BRIDGE_IN, __IP_DEFRAG_CONNTRACK_BR_IN] should not ICMP message also. Reported-by: Florian Westphal <fw@strlen.de> Signed-off-by: Andy Zhou <azhou@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19ipv4: introduce frag_expire_skip_icmp()Andy Zhou
Improve readability of skip ICMP for de-fragmentation expiration logic. This change will also make the logic easier to maintain when the following patches in this series are applied. Signed-off-by: Andy Zhou <azhou@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-18Merge branch 'for-upstream' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Johan Hedberg says: ==================== pull request: bluetooth 2015-05-17 A couple more Bluetooth updates for 4.1: - New USB IDs for ath3k & btusb - Fix for remote name resolving during device discovery Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next. Briefly speaking, cleanups and minor fixes for ipset from Jozsef Kadlecsik and Serget Popovich, more incremental updates to make br_netfilter a better place from Florian Westphal, ARP support to the x_tables mark match / target from and context Zhang Chunyu and the addition of context to know that the x_tables runs through nft_compat. More specifically, they are: 1) Fix sparse warning in ipset/ip_set_hash_ipmark.c when fetching the IPSET_ATTR_MARK netlink attribute, from Jozsef Kadlecsik. 2) Rename STREQ macro to STRNCMP in ipset, also from Jozsef. 3) Use skb->network_header to calculate the transport offset in ip_set_get_ip{4,6}_port(). From Alexander Drozdov. 4) Reduce memory consumption per element due to size miscalculation, this patch and follow up patches from Sergey Popovich. 5) Expand nomatch field from 1 bit to 8 bits to allow to simplify mtype_data_reset_flags(), also from Sergey. 6) Small clean for ipset macro trickery. 7) Fix error reporting when both ip_set_get_hostipaddr4() and ip_set_get_extensions() from per-set uadt functions. 8) Simplify IPSET_ATTR_PORT netlink attribute validation. 9) Introduce HOST_MASK instead of hardcoded 32 in ipset. 10) Return true/false instead of 0/1 in functions that return boolean in the ipset code. 11) Validate maximum length of the IPSET_ATTR_COMMENT netlink attribute. 12) Allow to dereference from ext_*() ipset macros. 13) Get rid of incorrect definitions of HKEY_DATALEN. 14) Include linux/netfilter/ipset/ip_set.h in the x_tables set match. 15) Reduce nf_bridge_info size in br_netfilter, from Florian Westphal. 16) Release nf_bridge_info after POSTROUTING since this is only needed from the physdev match, also from Florian. 17) Reduce size of ipset code by deinlining ip_set_put_extensions(), from Denys Vlasenko. 18) Oneliner to add ARP support to the x_tables mark match/target, from Zhang Chunyu. 19) Add context to know if the x_tables extension runs from nft_compat, to address minor problems with three existing extensions. 20) Correct return value in several seqfile *_show() functions in the netfilter tree, from Joe Perches. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-18RDS: Switch to generic logging helpersSagi Grimberg
Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-05-18xprtrdma, svcrdma: Switch to generic logging helpersSagi Grimberg
Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Anna Schumaker <anna.schumaker@netapp.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-05-18IB/Verbs: Use management helper rdma_cap_read_multi_sge()Michael Wang
Introduce helper rdma_cap_read_multi_sge() to help us check if the port of an IB device support RDMA Read Multiple Scatter-Gather Entries. Signed-off-by: Michael Wang <yun.wang@profitbricks.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Tested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Sean Hefty <sean.hefty@intel.com> Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Tested-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-05-18IB/Verbs: Reform IB-ulp xprtrdmaMichael Wang
Use raw management helpers to reform IB-ulp xprtrdma. Signed-off-by: Michael Wang <yun.wang@profitbricks.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Tested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Sean Hefty <sean.hefty@intel.com> Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Tested-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-05-18xfrm: optimise to search the inexact policy listLi RongQing
The policies are organized into list by priority ascent of policy, so it is unnecessary to continue to loop the policy if the priority of current looped police is larger than or equal priority which is from the policy_bydst list. This allows to match policy with ~0U priority in inexact list too. Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2015-05-17netlink: Use random autobind roverHerbert Xu
Currently we use a global rover to select a port ID that is unique. This used to work consistently when it was protected with a global lock. However as we're now lockless, the global rover can exhibit pathological behaviour should multiple threads all stomp on it at the same time. Granted this will eventually resolve itself but the process is suboptimal. This patch replaces the global rover with a pseudorandom starting point to avoid this issue. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17tcp/ipv6: fix flow label setting in TIME_WAIT stateFlorent Fourcot
commit 1d13a96c74fc ("ipv6: tcp: fix flowlabel value in ACK messages send from TIME_WAIT") added the flow label in the last TCP packets. Unfortunately, it was not casted properly. This patch replace the buggy shift with be32_to_cpu/cpu_to_be32. Fixes: 1d13a96c74fc ("ipv6: tcp: fix flowlabel value in ACK messages") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17netns: make nsid_lock per netWANG Cong
The spinlock is used to protect netns_ids which is per net, so there is no need to use a global spinlock. Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17flow_dissector: remove bogus return in tipc sectionJiri Pirko
Fixes: 06635a35d13d42b9 ("flow_dissect: use programable dissector in skb_flow_dissect and friends") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17switchdev: add support for fdb add/del/dump via switchdev_port_obj ops.Samudrala, Sridhar
- introduce port fdb obj and generic switchdev_port_fdb_add/del/dump() - use switchdev_port_fdb_add/del/dump in rocker/team/bonding ndo ops. - add support for fdb obj in switchdev_port_obj_add/del/dump() - switch rocker to implement fdb ops via switchdev_ops v3: updated to sync with named union changes. Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17tcp: halves tcp_mem[] limitsEric Dumazet
Allowing tcp to use ~19% of physical memory is way too much, and allowed bugs to be hidden. Add to this that some drivers use a full page per incoming frame, so real cost can be twice the advertized one. Reduce tcp_mem by 50 % as a first step to sanity. tcp_mem[0,1,2] defaults are now 4.68%, 6.25%, 9.37% of physical memory. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17tcp: allow one skb to be received per socket under memory pressureEric Dumazet
While testing tight tcp_mem settings, I found tcp sessions could be stuck because we do not allow even one skb to be received on them. By allowing one skb to be received, we introduce fairness and eventuallu force memory hogs to release their allocation. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17tcp: fix behavior for epoll edge triggerEric Dumazet
Under memory pressure, tcp_sendmsg() can fail to queue a packet while no packet is present in write queue. If we return -EAGAIN with no packet in write queue, no ACK packet will ever come to raise EPOLLOUT. We need to allow one skb per TCP socket, and make sure that tcp sockets can release their forward allocations under pressure. This is a followup to commit 790ba4566c1a ("tcp: set SOCK_NOSPACE under memory pressure") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17tcp: introduce tcp_under_memory_pressure()Eric Dumazet
Introduce an optimized version of sk_under_memory_pressure() for TCP. Our intent is to use it in fast paths. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17tcp: rename sk_forced_wmem_schedule() to sk_forced_mem_schedule()Eric Dumazet
We plan to use sk_forced_wmem_schedule() in input path as well, so make it non static and rename it. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17net: fix sk_mem_reclaim_partial()Eric Dumazet
sk_mem_reclaim_partial() goal is to ensure each socket has one SK_MEM_QUANTUM forward allocation. This is needed both for performance and better handling of memory pressure situations in follow up patches. SK_MEM_QUANTUM is currently a page, but might be reduced to 4096 bytes as some arches have 64KB pages. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17rtnl/bond: don't send rtnl msg for unregistered ifaceNicolas Dichtel
Before the patch, the command 'ip link add bond2 type bond mode 802.3ad' causes the kernel to send a rtnl message for the bond2 interface, with an ifindex 0. 'ip monitor' shows: 0: bond2: <BROADCAST,MULTICAST,MASTER> mtu 1500 state DOWN group default link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff 9: bond2@NONE: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN group default link/ether ea:3e:1f:53:92:7b brd ff:ff:ff:ff:ff:ff [snip] The patch fixes the spotted bug by checking in bond driver if the interface is registered before calling the notifier chain. It also adds a check in rtmsg_ifinfo() to prevent this kind of bug in the future. Fixes: d4261e565000 ("bonding: create netlink event when bonding option is changed") CC: Jiri Pirko <jiri@resnulli.us> Reported-by: Julien Meunier <julien.meunier@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17net-packet: fix null pointer exception in rollover modeWillem de Bruijn
Rollover can be enabled as flag or mode. Allocate state in both cases. This solves a NULL pointer exception in fanout_demux_rollover on referencing po->rollover if using mode rollover. Also make sure that in rollover mode each silo is tried (contrary to rollover flag, where the main socket is excluded after an initial try_self). Tested: Passes tools/testing/net/psock_fanout.c, which tests both modes and flag. My previous tests were limited to bench_rollover, which only stresses the flag. The test now completes safely. it still gives an error for mode rollover, because it does not expect the new headroom (ROOM_NORMAL) requirement. I will send a separate patch to the test. Fixes: 0648ab70afe6 ("packet: rollover prepare: per-socket state") Signed-off-by: Willem de Bruijn <willemb@google.com> ---- I should have run this test and caught this before submission, of course. Apologies for the oversight. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17netfilter: synproxy: fix sparse errorsEric Dumazet
Fix verbose sparse errors : make C=2 CF=-D__CHECK_ENDIAN__ net/ipv4/netfilter/ipt_SYNPROXY.o Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17ipip: fix one sparse errorEric Dumazet
make C=2 CF=-D__CHECK_ENDIAN__ net/ipv4/ipip.o CHECK net/ipv4/ipip.c net/ipv4/ipip.c:254:27: warning: incorrect type in assignment (different base types) net/ipv4/ipip.c:254:27: expected restricted __be32 [addressable] [usertype] o_key net/ipv4/ipip.c:254:27: got restricted __be16 [addressable] [usertype] i_flags Fixes: 3b7b514f44bf ("ipip: fix a regression in ioctl") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17netfilter: Use correct return for seq_show functionsJoe Perches
Using seq_has_overflowed doesn't produce the right return value. Either 0 or -1 is, but 0 is much more common and works well when seq allocation retries. I believe this doesn't matter as the initial allocation is always sufficient, this is just a correctness patch. Miscellanea: o Don't use strlen, use *ptr to determine if a string should be emitted like all the other tests here o Delete unnecessary return statements Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-05-16netlink: Reset portid after netlink_insert failureHerbert Xu
The commit c5adde9468b0714a051eac7f9666f23eb10b61f7 ("netlink: eliminate nl_sk_hash_lock") breaks the autobind retry mechanism because it doesn't reset portid after a failed netlink_insert. This means that should autobind fail the first time around, then the socket will be stuck in limbo as it can never be bound again since it already has a non-zero portid. Fixes: c5adde9468b0 ("netlink: eliminate nl_sk_hash_lock") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== The following patchset contains Netfilter fixes for your net tree, they are: 1) Fix a leak in IPVS, the sysctl table is not released accordingly when destroying a netns, patch from Tommi Rantala. 2) Fix a build error when TPROXY and socket are built-in but IPv6 defrag is compiled as module, from Florian Westphal. 3) Fix TCP tracket wrt. RFC5961 challenge ACK when in LAST_ACK state, patch from Jesper Dangaard Brouer. 4) Fix a bogus WARN_ON() in nf_tables when deleting a set element that stores a map, from Mirek Kratochvil. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-15netfilter: nf_tables: fix bogus warning in nft_data_uninit()Mirek Kratochvil
The values 0x00000000-0xfffffeff are reserved for userspace datatype. When, deleting set elements with maps, a bogus warning is triggered. WARNING: CPU: 0 PID: 11133 at net/netfilter/nf_tables_api.c:4481 nft_data_uninit+0x35/0x40 [nf_tables]() This fixes the check accordingly to enum definition in include/linux/netfilter/nf_tables.h Fixes: https://bugzilla.netfilter.org/show_bug.cgi?id=1013 Signed-off-by: Mirek Kratochvil <exa.exa@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-05-15conntrack: RFC5961 challenge ACK confuse conntrack LAST-ACK transitionJesper Dangaard Brouer
In compliance with RFC5961, the network stack send challenge ACK in response to spurious SYN packets, since commit 0c228e833c88 ("tcp: Restore RFC5961-compliant behavior for SYN packets"). This pose a problem for netfilter conntrack in state LAST_ACK, because this challenge ACK is (falsely) seen as ACKing last FIN, causing a false state transition (into TIME_WAIT). The challenge ACK is hard to distinguish from real last ACK. Thus, solution introduce a flag that tracks the potential for seeing a challenge ACK, in case a SYN packet is let through and current state is LAST_ACK. When conntrack transition LAST_ACK to TIME_WAIT happens, this flag is used for determining if we are expecting a challenge ACK. Scapy based reproducer script avail here: https://github.com/netoptimizer/network-testing/blob/master/scapy/tcp_hacks_3WHS_LAST_ACK.py Fixes: 0c228e833c88 ("tcp: Restore RFC5961-compliant behavior for SYN packets") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-05-15netfilter: avoid build error if TPROXY/SOCKET=y && NF_DEFRAG_IPV6=mFlorian Westphal
With TPROXY=y but DEFRAG_IPV6=m we get build failure: net/built-in.o: In function `tproxy_tg_init': net/netfilter/xt_TPROXY.c:588: undefined reference to `nf_defrag_ipv6_enable' If DEFRAG_IPV6 is modular, TPROXY must be too. (or both must be builtin). This enforces =m for both. Reported-and-tested-by: Liu Hua <liusdu@126.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-05-15netfilter: x_tables: add context to know if extension runs from nft_compatPablo Neira Ayuso
Currently, we have four xtables extensions that cannot be used from the xt over nft compat layer. The problem is that they need real access to the full blown xt_entry to validate that the rule comes with the right dependencies. This check was introduced to overcome the lack of sufficient userspace dependency validation in iptables. To resolve this problem, this patch introduces a new field to the xt_tgchk_param structure that tell us if the extension is run from nft_compat context. The three affected extensions are: 1) CLUSTERIP, this target has been superseded by xt_cluster. So just bail out by returning -EINVAL. 2) TCPMSS. Relax the checking when used from nft_compat. If used with the wrong configuration, it will corrupt !syn packets by adding TCP MSS option. 3) ebt_stp. Relax the check to make sure it uses the reserved destination MAC address for STP. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Tested-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
2015-05-15Bluetooth: Fix calls to __hci_cmd_sync()Frederic Danis
Remove test of command reply status as it is already performed by __hci_cmd_sync(). __hci_cmd_sync_ev() function already returns an error if it got a non-zero status either through a Command Complete or a Command Status event. For both of these events the status is collected up in the event handlers called by hci_event_packet() and then passed as the second parameter to req_complete_skb(). The req_complete_skb() callback in turn is hci_req_sync_complete() for __hci_cmd_sync_ev() which stores the status in hdev->req_result. The hdev->req_result is then further converted through bt_to_errno() back in __hci_cmd_sync_ev(). Signed-off-by: Frederic Danis <frederic.danis@linux.intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-05-14rename RTNH_F_EXTERNAL to RTNH_F_OFFLOADRoopa Prabhu
RTNH_F_EXTERNAL today is printed as "offload" in iproute2 output. This patch renames the flag to be consistent with what the user sees. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14net: core: set qdisc pkt len before tc_classifyFlorian Westphal
commit d2788d34885d4ce5ba ("net: sched: further simplify handle_ing") removed the call to qdisc_enqueue_root(). However, after this removal we no longer set qdisc pkt length. This breaks traffic policing on ingress. This is the minimum fix: set qdisc pkt length before tc_classify. Only setting the length does remove support for 'stab' on ingress, but as Alexei pointed out: "Though it was allowed to add qdisc_size_table to ingress, it's useless. Nothing takes advantage of recomputed qdisc_pkt_len". Jamal suggested to use qdisc_pkt_len_init(), but as Eric mentioned that would result in qdisc_pkt_len_init to no longer get inlined due to the additional 2nd call site. ingress policing is rare and GRO doesn't really work that well with police on ingress, as we see packets > mtu and drop skbs that -- without aggregation -- would still have fitted the policier budget. Thus to have reliable/smooth ingress policing GRO has to be turned off. Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Fixes: d2788d34885d ("net: sched: further simplify handle_ing") Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14netns: fix unbalanced spin_lock on errorNicolas Dichtel
Unlock was missing on error path. Fixes: 95f38411df05 ("netns: use a spin_lock to protect nsid management") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14ip_tunnel: Report Rx dropped in ip_tunnel_get_stats64Alexander Duyck
The rx_dropped stat wasn't being reported when ip_tunnel_get_stats64 was called. This was leading to some confusing results in my debug as I was seeing rx_errors increment but no other value which pointed me toward the type of error being seen. This change corrects that by using netdev_stats_to_stats64 to copy all available dev stats instead of just the few that were hand picked. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14ipv6: Fix udp checksums with raw socketsVlad Yasevich
It was reported that trancerout6 would cause a kernel to crash when trying to compute checksums on raw UDP packets. The cause was the check in __ip6_append_data that would attempt to use partial checksums on the packet. However, raw sockets do not initialize partial checksum fields so partial checksums can't be used. Solve this the same way IPv4 does it. raw sockets pass transhdrlen value of 0 to ip_append_data which causes the checksum to be computed in software. Use the same check in ip6_append_data (check transhdrlen). Reported-by: Wolfgang Walter <linux@stwm.de> CC: Wolfgang Walter <linux@stwm.de> CC: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14netlink: move nl_table in read_mostly sectionEric Dumazet
netlink sockets creation and deletion heavily modify nl_table_users and nl_table_lock. If nl_table is sharing one cache line with one of them, netlink performance is really bad on SMP. ffffffff81ff5f00 B nl_table ffffffff81ff5f0c b nl_table_users Putting nl_table in read_mostly section increased performance of my open/delete netlink sockets test by about 80 % This came up while diagnosing a getaddrinfo() problem. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14packet: fix warnings in rollover lock contentionWillem de Bruijn
Avoid two xchg calls whose return values were unused, causing a warning on some architectures. The relevant variable is a hint and read without mutual exclusion. This fix makes all writers hold the receive_queue lock. Suggested-by: David S. Miller <davem@davemloft.net> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14tipc: use sock_create_kern interface to create kernel socketYing Xue
After commit eeb1bd5c40ed ("net: Add a struct net parameter to sock_create_kern"), we should use sock_create_kern() to create kernel socket as the interface doesn't reference count struct net any more. Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14cls_flower: Fix compile errorBrian Haley
Fix compile error in net/sched/cls_flower.c net/sched/cls_flower.c: In function ‘fl_set_key’: net/sched/cls_flower.c:240:3: error: implicit declaration of function ‘tcf_change_indev’ [-Werror=implicit-function-declaration] err = tcf_change_indev(net, tb[TCA_FLOWER_INDEV]); Introduced in 77b9900ef53ae Fixes: 77b9900ef53ae ("tc: introduce Flower classifier") Signed-off-by: Brian Haley <brian.haley@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14tipc: add packet sequence number at instant of transmissionJon Paul Maloy
Currently, the packet sequence number is updated and added to each packet at the moment a packet is added to the link backlog queue. This is wasteful, since it forces the code to traverse the send packet list packet by packet when adding them to the backlog queue. It would be better to just splice the whole packet list into the backlog queue when that is the right action to do. In this commit, we do this change. Also, since the sequence numbers cannot now be assigned to the packets at the moment they are added the backlog queue, we do instead calculate and add them at the moment of transmission, when the backlog queue has to be traversed anyway. We do this in the function tipc_link_push_packet(). Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14tipc: improve link congestion algorithmJon Paul Maloy
The link congestion algorithm used until now implies two problems. - It is too generous towards lower-level messages in situations of high load by giving "absolute" bandwidth guarantees to the different priority levels. LOW traffic is guaranteed 10%, MEDIUM is guaranted 20%, HIGH is guaranteed 30%, and CRITICAL is guaranteed 40% of the available bandwidth. But, in the absence of higher level traffic, the ratio between two distinct levels becomes unreasonable. E.g. if there is only LOW and MEDIUM traffic on a system, the former is guaranteed 1/3 of the bandwidth, and the latter 2/3. This again means that if there is e.g. one LOW user and 10 MEDIUM users, the former will have 33.3% of the bandwidth, and the others will have to compete for the remainder, i.e. each will end up with 6.7% of the capacity. - Packets of type MSG_BUNDLER are created at SYSTEM importance level, but only after the packets bundled into it have passed the congestion test for their own respective levels. Since bundled packets don't result in incrementing the level counter for their own importance, only occasionally for the SYSTEM level counter, they do in practice obtain SYSTEM level importance. Hence, the current implementation provides a gap in the congestion algorithm that in the worst case may lead to a link reset. We now refine the congestion algorithm as follows: - A message is accepted to the link backlog only if its own level counter, and all superior level counters, permit it. - The importance of a created bundle packet is set according to its contents. A bundle packet created from messges at levels LOW to CRITICAL is given importance level CRITICAL, while a bundle created from a SYSTEM level message is given importance SYSTEM. In the latter case only subsequent SYSTEM level messages are allowed to be bundled into it. This solves the first problem described above, by making the bandwidth guarantee relative to the total number of users at all levels; only the upper limit for each level remains absolute. In the example described above, the single LOW user would use 1/11th of the bandwidth, the same as each of the ten MEDIUM users, but he still has the same guarantee against starvation as the latter ones. The fix also solves the second problem. If the CRITICAL level is filled up by bundle packets of that level, no lower level packets will be accepted any more. Suggested-by: Gergely Kiss <gergely.kiss@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14tipc: simplify link supervision checkpointingJon Paul Maloy
We change the sequence number checkpointing that is performed by the timer in order to discover if the peer is active. Currently, we store a checkpoint of the next expected sequence number "rcv_nxt" at each timer expiration, and compare it to the current expected number at next timeout expiration. Instead, we now use the already existing field "silent_intv_cnt" for this task. We step the counter at each timeout expiration, and zero it at each valid received packet. If no valid packet has been received from the peer after "abort_limit" number of silent timer intervals, the link is declared faulty and reset. We also remove the multiple instances of timer activation from inside the FSM function "link_state_event()", and now do it at only one place; at the end of the timer function itself. Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14tipc: rename fields in struct tipc_linkJon Paul Maloy
We rename some fields in struct tipc_link, in order to give them more descriptive names: next_in_no -> rcv_nxt next_out_no-> snd_nxt fsm_msg_cnt-> silent_intv_cnt cont_intv -> keepalive_intv last_retransmitted -> last_retransm There are no functional changes in this commit. Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14tipc: simplify packet sequence number handlingJon Paul Maloy
Although the sequence number in the TIPC protocol is 16 bits, we have until now stored it internally as an unsigned 32 bits integer. We got around this by always doing explicit modulo-65535 operations whenever we need to access a sequence number. We now make the incoming and outgoing sequence numbers to unsigned 16-bit integers, and remove the modulo operations where applicable. We also move the arithmetic inline functions for 16 bit integers to core.h, and the function buf_seqno() to msg.h, so they can easily be accessed from anywhere in the code. Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14tipc: simplify include dependenciesJon Paul Maloy
When we try to add new inline functions in the code, we sometimes run into circular include dependencies. The main problem is that the file core.h, which really should be at the root of the dependency chain, instead is a leaf. I.e., core.h includes a number of header files that themselves should be allowed to include core.h. In reality this is unnecessary, because core.h does not need to know the full signature of any of the structs it refers to, only their type declaration. In this commit, we remove all dependencies from core.h towards any other tipc header file. As a consequence of this change, we can now move the function tipc_own_addr(net) from addr.c to addr.h, and make it inline. There are no functional changes in this commit. Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14tipc: simplify link timer handlingJon Paul Maloy
Prior to this commit, the link timer has been running at a "continuity interval" of configured link tolerance/4. When a timer wakes up and discovers that there has been no sign of life from the peer during the previous interval, it divides its own timer interval by another factor four, and starts sending one probe per new interval. When the configured link tolerance time has passed without answer, i.e. after 16 unacked probes, the link is declared faulty and reset. This is unnecessary complex. It is sufficient to continue with the original continuity interval, and instead reset the link after four missed probe responses. This makes the timer handling in the link simpler, and opens up for some planned later changes in this area. This commit implements this change. Reviewed-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>