summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2013-12-19sch_cbq: remove unnecessary null pointer checkYang Yingliang
It already has a NULL pointer check of rtab in qdisc_put_rtab(). Remove the check outside of qdisc_put_rtab(). Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-19act_police: remove unnecessary null pointer checkYang Yingliang
It already has a NULL pointer check of rtab in qdisc_put_rtab(). Remove the check outside of qdisc_put_rtab(). Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-19net: inet_diag: zero out uninitialized idiag_{src,dst} fieldsDaniel Borkmann
Jakub reported while working with nlmon netlink sniffer that parts of the inet_diag_sockid are not initialized when r->idiag_family != AF_INET6. That is, fields of r->id.idiag_src[1 ... 3], r->id.idiag_dst[1 ... 3]. In fact, it seems that we can leak 6 * sizeof(u32) byte of kernel [slab] memory through this. At least, in udp_dump_one(), we allocate a skb in ... rep = nlmsg_new(sizeof(struct inet_diag_msg) + ..., GFP_KERNEL); ... and then pass that to inet_sk_diag_fill() that puts the whole struct inet_diag_msg into the skb, where we only fill out r->id.idiag_src[0], r->id.idiag_dst[0] and leave the rest untouched: r->id.idiag_src[0] = inet->inet_rcv_saddr; r->id.idiag_dst[0] = inet->inet_daddr; struct inet_diag_msg embeds struct inet_diag_sockid that is correctly / fully filled out in IPv6 case, but for IPv4 not. So just zero them out by using plain memset (for this little amount of bytes it's probably not worth the extra check for idiag_family == AF_INET). Similarly, fix also other places where we fill that out. Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-19net-qdisc-hhf: Heavy-Hitter Filter (HHF) qdiscTerry Lam
This patch implements the first size-based qdisc that attempts to differentiate between small flows and heavy-hitters. The goal is to catch the heavy-hitters and move them to a separate queue with less priority so that bulk traffic does not affect the latency of critical traffic. Currently "less priority" means less weight (2:1 in particular) in a Weighted Deficit Round Robin (WDRR) scheduler. In essence, this patch addresses the "delay-bloat" problem due to bloated buffers. In some systems, large queues may be necessary for obtaining CPU efficiency, or due to the presence of unresponsive traffic like UDP, or just a large number of connections with each having a small amount of outstanding traffic. In these circumstances, HHF aims to reduce the HoL blocking for latency sensitive traffic, while not impacting the queues built up by bulk traffic. HHF can also be used in conjunction with other AQM mechanisms such as CoDel. To capture heavy-hitters, we implement the "multi-stage filter" design in the following paper: C. Estan and G. Varghese, "New Directions in Traffic Measurement and Accounting", in ACM SIGCOMM, 2002. Some configurable qdisc settings through 'tc': - hhf_reset_timeout: period to reset counter values in the multi-stage filter (default 40ms) - hhf_admit_bytes: threshold to classify heavy-hitters (default 128KB) - hhf_evict_timeout: threshold to evict idle heavy-hitters (default 1s) - hhf_non_hh_weight: Weighted Deficit Round Robin (WDRR) weight for non-heavy-hitters (default 2) - hh_flows_limit: max number of heavy-hitter flow entries (default 2048) Note that the ratio between hhf_admit_bytes and hhf_reset_timeout reflects the bandwidth of heavy-hitters that we attempt to capture (25Mbps with the above default settings). The false negative rate (heavy-hitter flows getting away unclassified) is zero by the design of the multi-stage filter algorithm. With 100 heavy-hitter flows, using four hashes and 4000 counters yields a false positive rate (non-heavy-hitters mistakenly classified as heavy-hitters) of less than 1e-4. Signed-off-by: Terry Lam <vtlam@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-19mac80211: Add support for QoS mappingKyeyoon Park
Implement set_qos_map() handler for mac80211 to enable QoS mapping functionality. Signed-off-by: Kyeyoon Park <kyeyoonp@qca.qualcomm.com> Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2013-12-19cfg80211: Add support for QoS mappingKyeyoon Park
This allows QoS mapping from external networks to be implemented as defined in IEEE Std 802.11-2012, 10.24.9. APs can use this to advertise DSCP ranges and exceptions for mapping frames to a specific UP over Wi-Fi. The payload of the QoS Map Set element (IEEE Std 802.11-2012, 8.4.2.97) is sent to the driver through the new NL80211_ATTR_QOS_MAP attribute to configure the local behavior either on the AP (based on local configuration) or on a station (based on information received from the AP). Signed-off-by: Kyeyoon Park <kyeyoonp@qca.qualcomm.com> Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2013-12-19treewide: Fix typos in printkMasanari Iida
Correct spelling typo in various part of kernel Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-12-19Merge branch 'master' into for-nextJiri Kosina
Sync with Linus' tree to be able to apply fixes on top of newer things in tree (efi-stub). Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-12-19nl80211: support vendor-specific eventsJohannes Berg
In addition to vendor-specific commands, also support vendor-specific events. These must be registered with cfg80211 before they can be used. They're also advertised in nl80211 in the wiphy information so that userspace knows can be expected. The events themselves are sent on a new multicast group called "vendor". Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2013-12-19mac80211: add helper functions for tracking P2P NoA stateFelix Fietkau
Signed-off-by: Felix Fietkau <nbd@openwrt.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2013-12-19mac80211: fix iflist_mtx/mtx locking in radar detectionJohannes Berg
The scan code creates an iflist_mtx -> mtx locking dependency, and a few other places, notably radar detection, were creating the opposite dependency, causing lockdep to complain. As scan and radar detection are mutually exclusive, the deadlock can't really happen in practice, but it's still bad form. A similar issue exists in the monitor mode code, but this is only used by channel-context drivers right now and those have to have hardware scan, so that also can't happen. Still, fix these issues by making some of the channel context code require the mtx to be held rather than acquiring it, thus allowing the monitor/radar callers to keep the iflist_mtx->mtx lock ordering. While at it, also fix access to the local->scanning variable in the radar code, and document that radar_detect_enabled is now properly protected by the mtx. All this would now introduce an ABBA deadlock between the DFS work cancelling and local->mtx, so change the locking there a bit to not need to use cancel_delayed_work_sync() but be able to just use cancel_delayed_work(). The work is also safely stopped/removed when the interface is stopped, so no extra changes are needed. Reported-by: Kalle Valo <kvalo@qca.qualcomm.com> Tested-by: Simon Wunderlich <sw@simonwunderlich.de> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2013-12-19mac80211: remove unnecessary iflist_mtx lockingJohannes Berg
The radar detection code changed a few times, and due to the changes some iflist_mtx locking stayed in that isn't actually necessary - remove it. One version of the code needed it because an AP interface's VLAN list was changed to use this, but then we moved the list handling outside of the chanctx handling and thus the locking was no longer needed. Tested-by: Simon Wunderlich <sw@simonwunderlich.de> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2013-12-19mac80211: align struct ps_data.tim to unsigned longJoe Perches
Its address is used as an unsigned long *, so make sure that the tim u8 array is properly aligned. Signed-off-by: Joe Perches <joe@perches.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2013-12-18ipv6: sit: update mtu check to take care of gso packetsEric Dumazet
While testing my changes for TSO support in SIT devices, I was using sit0 tunnel which appears to include nopmtudisc flag. But using : ip tun add sittun mode sit remote $REMOTE_IPV4 local $LOCAL_IPV4 \ dev $IFACE We get a tunnel which rejects too long packets because of the mtu check which is not yet GSO aware. erd:~# ip tunnel sittun: ipv6/ip remote 10.246.17.84 local 10.246.17.83 ttl inherit 6rd-prefix 2002::/16 sit0: ipv6/ip remote any local any ttl 64 nopmtudisc 6rd-prefix 2002::/16 This patch is based on an excellent report from Michal Shmidt. In the future, we probably want to extend the MTU check to do the right thing for GSO packets... Fixes: ("61c1db7fae21 ipv6: sit: add GSO/TSO support") Reported-by: Michal Schmidt <mschmidt@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Michal Schmidt <mschmidt@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18bridge: spelling fixestanxiaojun
Fix spelling errors in bridge driver. Signed-off-by: Tan Xiaojun <tanxiaojun@huawei.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18ipv6: pmtudisc setting not respected with UFO/CORKHannes Frederic Sowa
Sockets marked with IPV6_PMTUDISC_PROBE (or later IPV6_PMTUDISC_INTERFACE) don't respect this setting when the outgoing interface supports UFO. We had the same problem in IPv4, which was fixed in commit daba287b299ec7a2c61ae3a714920e90e8396ad5 ("ipv4: fix DO and PROBE pmtu mode regarding local fragmentation with UFO/CORK"). Also IPV6_DONTFRAG mode did not care about already corked data, thus it may generate a fragmented frame even if this socket option was specified. It also did not care about the length of the ipv6 header and possible options. In the error path allow the user to receive the pmtu notifications via both, rxpmtu method or error queue. The user may opted in for both, so deliver the notification to both error handlers (the handlers check if the error needs to be enqueued). Also report back consistent pmtu values when sending on an already cork-appended socket. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18ip_gre: fix msg_name parsing for recvfrom/recvmsgTimo Teräs
ipgre_header_parse() needs to parse the tunnel's ip header and it uses mac_header to locate the iphdr. This got broken when gre tunneling was refactored as mac_header is no longer updated to point to iphdr. Introduce skb_pop_mac_header() helper to do the mac_header assignment and use it in ipgre_rcv() to fix msg_name parsing. Bug introduced in commit c54419321455 (GRE: Refactor GRE tunneling code.) Cc: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Timo Teräs <timo.teras@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18ipv6: support IPV6_PMTU_INTERFACE on socketsHannes Frederic Sowa
IPV6_PMTU_INTERFACE is the same as IPV6_PMTU_PROBE for ipv6. Add it nontheless for symmetry with IPv4 sockets. Also drop incoming MTU information if this mode is enabled. The additional bit in ipv6_pinfo just eats in the padding behind the bitfield. There are no changes to the layout of the struct at all. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18ipv4: new ip_no_pmtu_disc mode to always discard incoming frag needed msgsHannes Frederic Sowa
This new mode discards all incoming fragmentation-needed notifications as I guess was originally intended with this knob. To not break backward compatibility too much, I only added a special case for mode 2 in the receiving path. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18inet: make no_pmtu_disc per namespace and kill ipv4_configHannes Frederic Sowa
The other field in ipv4_config, log_martians, was converted to a per-interface setting, so we can just remove the whole structure. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/intel/i40e/i40e_main.c drivers/net/macvtap.c Both minor merge hassles, simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18Merge branch 'for-upstream' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
2013-12-18net_sched: convert tcf_proto_ops to use struct list_headWANG Cong
We don't need to maintain our own singly linked list code. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18net_sched: convert tc_action_ops to use struct list_headWANG Cong
We don't need to maintain our own singly linked list code. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18net_sched: convert tcf_hashinfo to hlist and use spinlockWANG Cong
So that we don't need to play with singly linked list, and since the code is not on hot path, we can use spinlock instead of rwlock. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18net_sched: init struct tcf_hashinfo at register timeWANG Cong
It looks weird to store the lock out of the struct but still points to a static variable. Just move them into the struct. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18net_sched: cls: refactor out struct tcf_ext_mapWANG Cong
These information can be saved in tcf_exts, and this will simplify the code. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18net_sched: act: use standard struct list_headWANG Cong
Currently actions are chained by a singly linked list, therefore it is a bit hard to add and remove a specific entry. Convert it to struct list_head so that in the latter patch we can remove an action without finding its head. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18net_sched: remove get_stats from tc_action_opsWANG Cong
It is not used. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18mac80211: make ieee80211_recalc_radar_chanctx staticJohannes Berg
The function is only used in one file, so move it up a bit to avoid forward declarations and make it static. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2013-12-18mac80211: fix checkpatch errorsWeilong Chen
Fix a number of different checkpatch errors. Signed-off-by: Weilong Chen <chenweilong@huawei.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2013-12-18packet: deliver VLAN TPID to userspaceAtzm Watanabe
This enables userspace to get VLAN TPID as well as the VLAN TCI. Signed-off-by: Atzm Watanabe <atzm@stratosphere.co.jp> Acked-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18packet: fill the gap of TPACKET_ALIGNMENT with zerosAtzm Watanabe
struct tpacket{2,3}_hdr is aligned to a multiple of TPACKET_ALIGNMENT. Explicitly defining and zeroing the gap of this makes additional changes easier. Signed-off-by: Atzm Watanabe <atzm@stratosphere.co.jp> Acked-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18packet: make aligned size of struct tpacket{2,3}_hdr clearAtzm Watanabe
struct tpacket{2,3}_hdr is aligned to a multiple of TPACKET_ALIGNMENT. We may add members to them until current aligned size without forcing userspace to call getsockopt(..., PACKET_HDRLEN, ...). Signed-off-by: Atzm Watanabe <atzm@stratosphere.co.jp> Acked-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17net: allow netdev_all_upper_get_next_dev_rcu with rtnl lock heldJohn Fastabend
It is useful to be able to walk all upper devices when bringing a device online where the RTNL lock is held. In this case it is safe to walk the all_adj_list because the RTNL lock is used to protect the write side as well. This patch adds a check to see if the rtnl lock is held before throwing a warning in netdev_all_upper_get_next_dev_rcu(). Also because we now have a call site for lockdep_rtnl_is_held() outside COFIG_LOCK_PROVING an inline definition returning 1 is needed. Similar to the rcu_read_lock_is_held(). Fixes: 2a47fa45d4df ("ixgbe: enable l2 forwarding acceleration for macvlans") CC: Veaceslav Falico <vfalico@redhat.com> Reported-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2013-12-17netfilter: nfnetlink_log: unset nf_loggers for netns when unloading moduleGao feng
Steven Rostedt and Arnaldo Carvalho de Melo reported a panic when access the files /proc/sys/net/netfilter/nf_log/*. This problem will occur when we do: echo nfnetlink_log > /proc/sys/net/netfilter/nf_log/any_file rmmod nfnetlink_log and then access the files. Since the nf_loggers of netns hasn't been unset, it will point to the memory that has been freed. This bug is introduced by commit 9368a53c ("netfilter: nfnetlink_log: add net namespace support for nfnetlink_log"). [17261.822047] BUG: unable to handle kernel paging request at ffffffffa0d49090 [17261.822056] IP: [<ffffffff8157aba0>] nf_log_proc_dostring+0xf0/0x1d0 [...] [17261.822226] Call Trace: [17261.822235] [<ffffffff81297b98>] ? security_capable+0x18/0x20 [17261.822240] [<ffffffff8106fa09>] ? ns_capable+0x29/0x50 [17261.822247] [<ffffffff8163d25f>] ? net_ctl_permissions+0x1f/0x90 [17261.822254] [<ffffffff81216613>] proc_sys_call_handler+0xb3/0xc0 [17261.822258] [<ffffffff81216651>] proc_sys_read+0x11/0x20 [17261.822265] [<ffffffff811a80de>] vfs_read+0x9e/0x170 [17261.822270] [<ffffffff811a8c09>] SyS_read+0x49/0xa0 [17261.822276] [<ffffffff810e6496>] ? __audit_syscall_exit+0x1f6/0x2a0 [17261.822283] [<ffffffff81656e99>] system_call_fastpath+0x16/0x1b [17261.822285] Code: cc 81 4d 63 e4 4c 89 45 88 48 89 4d 90 e8 19 03 0d 00 4b 8b 84 e5 28 08 00 00 48 8b 4d 90 4c 8b 45 88 48 85 c0 0f 84 a8 00 00 00 <48> 8b 40 10 48 89 43 08 48 89 df 4c 89 f2 31 f6 e8 4b 35 af ff [17261.822329] RIP [<ffffffff8157aba0>] nf_log_proc_dostring+0xf0/0x1d0 [17261.822334] RSP <ffff880274d3fe28> [17261.822336] CR2: ffffffffa0d49090 [17261.822340] ---[ end trace a14ce54c0897a90d ]--- Reported-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-12-17net: Add utility function to copy skb hashTom Herbert
Adds skb_copy_hash to copy rxhash and l4_rxhash from one skb to another. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17net: Add utility functions to clear rxhashTom Herbert
In several places 'skb->rxhash = 0' is being done to clear the rxhash value in an skb. This does not clear l4_rxhash which could still be set so that the rxhash wouldn't be recalculated on subsequent call to skb_get_rxhash. This patch adds an explict function to clear all the rxhash related information in the skb properly. skb_clear_hash_if_not_l4 clears the rxhash only if it is not marked as l4_rxhash. Fixed up places where 'skb->rxhash = 0' was being called. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17net: Change skb_get_rxhash to skb_get_hashTom Herbert
Changing name of function as part of making the hash in skbuff to be generic property, not just for receive path. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17net/hsr: using kfree_rcu() to simplify the codeWei Yongjun
The callback function of call_rcu() just calls a kfree(), so we can use kfree_rcu() instead of call_rcu() + callback function. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Acked-by: Arvid Brodin <arvid.brodin@alten.se> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17neigh: Netlink notification for administrative NUD state changeBob Gilligan
The neighbour code sends up an RTM_NEWNEIGH netlink notification if the NUD state of a neighbour cache entry is changed by a timer (e.g. from REACHABLE to STALE), even if the lladdr of the entry has not changed. But an administrative change to the the NUD state of a neighbour cache entry that does not change the lladdr (e.g. via "ip -4 neigh change ... nud ...") does not trigger a netlink notification. This means that netlink listeners will not hear about administrative NUD state changes such as from a resolved state to PERMANENT. This patch changes the neighbor code to generate an RTM_NEWNEIGH message when the NUD state of an entry is changed administratively. Signed-off-by: Bob Gilligan <gilligan@aristanetworks.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17pkt_sched: fq: more robust memory allocationEric Dumazet
This patch brings NUMA support and automatic fallback to vmalloc() in case kmalloc() failed to allocate FQ hash table. NUMA support depends on XPS being setup for the device before qdisc allocation. After a XPS change, it might be worth creating qdisc hierarchy again. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17tcp: refine TSO splitsEric Dumazet
While investigating performance problems on small RPC workloads, I noticed linux TCP stack was always splitting the last TSO skb into two parts (skbs). One being a multiple of MSS, and a small one with the Push flag. This split is done even if TCP_NODELAY is set, or if no small packet is in flight. Example with request/response of 4K/4K IP A > B: . ack 68432 win 2783 <nop,nop,timestamp 6524593 6525001> IP A > B: . 65537:68433(2896) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001> IP A > B: P 68433:69633(1200) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001> IP B > A: . ack 68433 win 2768 <nop,nop,timestamp 6525001 6524593> IP B > A: . 69632:72528(2896) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593> IP B > A: P 72528:73728(1200) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593> IP A > B: . ack 72528 win 2783 <nop,nop,timestamp 6524593 6525001> IP A > B: . 69633:72529(2896) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001> IP A > B: P 72529:73729(1200) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001> We can avoid this split by including the Nagle tests at the right place. Note : If some NIC had trouble sending TSO packets with a partial last segment, we would have hit the problem in GRO/forwarding workload already. tcp_minshall_update() is moved to tcp_output.c and is updated as we might feed a TSO packet with a partial last segment. This patch tremendously improves performance, as the traffic now looks like : IP A > B: . ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685> IP A > B: P 94209:98305(4096) ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685> IP B > A: . ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277> IP B > A: P 98304:102400(4096) ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277> IP A > B: . ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686> IP A > B: P 98305:102401(4096) ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686> IP B > A: . ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279> IP B > A: P 102400:106496(4096) ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279> IP A > B: . ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687> IP A > B: P 102401:106497(4096) ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687> IP B > A: . ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280> IP B > A: P 106496:110592(4096) ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280> Before : lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K 280774 Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K': 205719.049006 task-clock # 9.278 CPUs utilized 8,449,968 context-switches # 0.041 M/sec 1,935,997 CPU-migrations # 0.009 M/sec 160,541 page-faults # 0.780 K/sec 548,478,722,290 cycles # 2.666 GHz [83.20%] 455,240,670,857 stalled-cycles-frontend # 83.00% frontend cycles idle [83.48%] 272,881,454,275 stalled-cycles-backend # 49.75% backend cycles idle [66.73%] 166,091,460,030 instructions # 0.30 insns per cycle # 2.74 stalled cycles per insn [83.39%] 29,150,229,399 branches # 141.699 M/sec [83.30%] 1,943,814,026 branch-misses # 6.67% of all branches [83.32%] 22.173517844 seconds time elapsed lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets" IpOutRequests 16851063 0.0 IpExtOutOctets 23878580777 0.0 After patch : lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K 280877 Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K': 107496.071918 task-clock # 4.847 CPUs utilized 5,635,458 context-switches # 0.052 M/sec 1,374,707 CPU-migrations # 0.013 M/sec 160,920 page-faults # 0.001 M/sec 281,500,010,924 cycles # 2.619 GHz [83.28%] 228,865,069,307 stalled-cycles-frontend # 81.30% frontend cycles idle [83.38%] 142,462,742,658 stalled-cycles-backend # 50.61% backend cycles idle [66.81%] 95,227,712,566 instructions # 0.34 insns per cycle # 2.40 stalled cycles per insn [83.43%] 16,209,868,171 branches # 150.795 M/sec [83.20%] 874,252,952 branch-misses # 5.39% of all branches [83.37%] 22.175821286 seconds time elapsed lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets" IpOutRequests 11239428 0.0 IpExtOutOctets 23595191035 0.0 Indeed, the occupancy of tx skbs (IpExtOutOctets/IpOutRequests) is higher : 2099 instead of 1417, thus helping GRO to be more efficient when using FQ packet scheduler. Many thanks to Neal for review and ideas. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Van Jacobson <vanj@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Tested-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17net: remove dead code for add/del multiplestephen hemminger
These function to manipulate multiple addresses are not used anywhere in current net-next tree. Some out of tree code maybe using these but too bad; they should submit their code upstream.. Also, make __hw_addr_flush local since only used by dev_addr_lists.c Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17Merge branch 'for-davem' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next John W. Linville says: ==================== Please pull this batch of updates for the 3.14 stream... For the Bluetooth bits, Gustavo says: "This is the first batch of patches intended for 3.14. There is nothing big here. Most of the code are refactors, clean up, small fixes, plus some new device id support." And... "More patches to 3.14. Here we have the support for Low Energy Connection Oriented Channels (LE CoC). Basically, as the name says, this adds supports for connection oriented channels in the same way we already have them for BR/EDR connections so profiles/protocols that work on top of BR/EDR can now work on LE plus a plenty of new possibilities for LE." For the ath10k bits, Kalle says: "Janusz and Marek implemented DFS support to ath10k, but the code is not enabled yet due to missing cfg80211/mac80211 patches (it will be enabled in the next pull request). Michal did some device reset fixes and made it possible for ath10k to share an interrupt with another device. And lots of smaller fixes from different people." For the iwlwifi bits, Emmanuel says: "I have here a big rework of the rate control by Eyal. This is obviously the biggest part of this batch. I also have enhancement of protection flags by Avri and a few bits for WoWLAN by Eliad and Luca. Johannes cleans up the debugfs plus a few fixes. I provided a few things for Bluetooth coexistence. Besides this we have an implementation for low priority scan." Along with all that, there are big batches of updates to mwifiex and ath9k, Jeff Kirsher's FSF address fix patches, and a handful of other bits here and there. Please let me know if there are problems! ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== The following patchset contains two Netfilter fixes for your net tree, they are: * Fix endianness in nft_reject, the NFTA_REJECT_TYPE netlink attributes was not converted to network byte order as needed by all nfnetlink subsystems, from Eric Leblond. * Restrict SYNPROXY target to INPUT and FORWARD chains, this avoid a possible crash due to misconfigurations, from Patrick McHardy. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17net: unix: allow bind to fail on mutex lockSasha Levin
This is similar to the set_peek_off patch where calling bind while the socket is stuck in unix_dgram_recvmsg() will block and cause a hung task spew after a while. This is also the last place that did a straightforward mutex_lock(), so there shouldn't be any more of these patches. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17udp: ipv4: do not use sk_dst_lock from softirq contextEric Dumazet
Using sk_dst_lock from softirq context is not supported right now. Instead of adding BH protection everywhere, udp_sk_rx_dst_set() can instead use xchg(), as suggested by David. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Fixes: 975022310233 ("udp: ipv4: must add synchronization in udp_sk_rx_dst_set()") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17net: ovs: use CRC32 accelerated flow hash if availableFrancesco Fusco
Currently OVS uses jhash2() for calculating flow hashes in its internal flow_hash() function. The performance of the flow_hash() function is critical, as the input data can be hundreds of bytes long. OVS is largely deployed in x86_64 based datacenters. Therefore, we argue that the performance critical fast path of OVS should exploit underlying CPU features in order to reduce the per packet processing costs. We replace jhash2 with the hash implementation provided by the kernel hash lib, which exploits the crc32l instruction to achieve high performance Our patch greatly reduces the hash footprint from ~200 cycles of jhash2() to around ~90 cycles in case of ovs_flow_hash_crc() (measured with rdtsc over maximum length flow keys on an i7 Intel CPU). Additionally, we wrote a microbenchmark to stress the flow table performance. The benchmark inserts random flows into the flow hash and then performs lookups. Our hash deployed on a CRC32 capable CPU reduces the lookup for 1000 flows, 100 masks from ~10,100us to ~6,700us, for example. Thus, simply use the newly introduced arch_fast_hash2() as a drop-in replacement. Signed-off-by: Francesco Fusco <ffusco@redhat.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Thomas Graf <tgraf@redhat.com> Acked-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17Merge branch 'for-john' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211