summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2014-05-16Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch Jesse Gross says: ==================== A set of OVS changes for net-next/3.16. The major change here is a switch from per-CPU to per-NUMA flow statistics. This improves scalability by reducing kernel overhead in flow setup and maintenance. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-16bridge: Automatically manage port promiscuous mode.Vlad Yasevich
There exist configurations where the administrator or another management entity has the foreknowledge of all the mac addresses of end systems that are being bridged together. In these environments, the administrator can statically configure known addresses in the bridge FDB and disable flooding and learning on ports. This makes it possible to turn off promiscuous mode on the interfaces connected to the bridge. Here is why disabling flooding and learning allows us to control promiscuity: Consider port X. All traffic coming into this port from outside the bridge (ingress) will be either forwarded through other ports of the bridge (egress) or dropped. Forwarding (egress) is defined by FDB entries and by flooding in the event that no FDB entry exists. In the event that flooding is disabled, only FDB entries define the egress. Once learning is disabled, only static FDB entries provided by a management entity define the egress. If we provide information from these static FDBs to the ingress port X, then we'll be able to accept all traffic that can be successfully forwarded and drop all the other traffic sooner without spending CPU cycles to process it. Another way to define the above is as following equations: ingress = egress + drop expanding egress ingress = static FDB + learned FDB + flooding + drop disabling flooding and learning we a left with ingress = static FDB + drop By adding addresses from the static FDB entries to the MAC address filter of an ingress port X, we fully define what the bridge can process without dropping and can thus turn off promiscuous mode, thus dropping packets sooner. There have been suggestions that we may want to allow learning and update the filters with learned addresses as well. This would require mac-level authentication similar to 802.1x to prevent attacks against the hw filters as they are limited resource. Additionally, if the user places the bridge device in promiscuous mode, all ports are placed in promiscuous mode regardless of the changes to flooding and learning. Since the above functionality depends on full static configuration, we have also require that vlan filtering be enabled to take advantage of this. The reason is that the bridge has to be able to receive and process VLAN-tagged frames and the there are only 2 ways to accomplish this right now: promiscuous mode or vlan filtering. Suggested-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-16bridge: Add addresses from static fdbs to non-promisc portsVlad Yasevich
When a static fdb entry is created, add the mac address from this fdb entry to any ports that are currently running in non-promiscuous mode. These ports need this data so that they can receive traffic destined to these addresses. By default ports start in promiscuous mode, so this feature is disabled. Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-16bridge: Introduce BR_PROMISC flagVlad Yasevich
Introduce a BR_PROMISC per-port flag that will help us track if the current port is supposed to be in promiscuous mode or not. For now, always start in promiscuous mode. Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-16bridge: Add functionality to sync static fdb entries to hwVlad Yasevich
Add code that allows static fdb entires to be synced to the hw list for a specified port. This will be used later to program ports that can function in non-promiscuous mode. Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-16bridge: Keep track of ports capable of automatic discovery.Vlad Yasevich
By default, ports on the bridge are capable of automatic discovery of nodes located behind the port. This is accomplished via flooding of unknown traffic (BR_FLOOD) and learning the mac addresses from these packets (BR_LEARNING). If the above functionality is disabled by turning off these flags, the port requires static configuration in the form of static FDB entries to function properly. This patch adds functionality to keep track of all ports capable of automatic discovery. This will later be used to control promiscuity settings. Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-16bridge: Turn flag change macro into a function.Vlad Yasevich
Turn the flag change macro into a function to allow easier updates and to reduce space. Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-16ipv4: ip_tunnels: disable cache for nbma gre tunnelsTimo Teräs
The connected check fails to check for ip_gre nbma mode tunnels properly. ip_gre creates temporary tnl_params with daddr specified to pass-in the actual target on per-packet basis from neighbor layer. Detect these tunnels by inspecting the actual tunnel configuration. Minimal test case: ip route add 192.168.1.1/32 via 10.0.0.1 ip route add 192.168.1.2/32 via 10.0.0.2 ip tunnel add nbma0 mode gre key 1 tos c0 ip addr add 172.17.0.0/16 dev nbma0 ip link set nbma0 up ip neigh add 172.17.0.1 lladdr 192.168.1.1 dev nbma0 ip neigh add 172.17.0.2 lladdr 192.168.1.2 dev nbma0 ping 172.17.0.1 ping 172.17.0.2 The second ping should be going to 192.168.1.2 and head 10.0.0.2; but cached gre tunnel level route is used and it's actually going to 192.168.1.1 via 10.0.0.1. The lladdr's need to go to separate dst for the bug to trigger. Test case uses separate route entries, but this can also happen when the route entry is same: if there is a nexthop exception or the GRE tunnel is IPsec'ed in which case the dst points to xfrm bundle unique to the gre lladdr. Fixes: 7d442fab0a67 ("ipv4: Cache dst in tunnels") Signed-off-by: Timo Teräs <timo.teras@iki.fi> Cc: Tom Herbert <therbert@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-16ip_tunnel: don't add tunnel twiceDuan Jiong
When using command "ip tunnel add" to add a tunnel, the tunnel will be added twice, through ip_tunnel_create() and ip_tunnel_update(). Because the second is unnecessary, so we can just break after adding tunnel through ip_tunnel_create(). Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-16net/dsa/dsa.c: increment chip_index during of_node handling on dsa_of_probe()Fabian Godehardt
Adding more than one chip on device-tree currently causes the probing routine to always use the first chips data pointer. Signed-off-by: Fabian Godehardt <fg@emlix.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-16net: ipv6: make "ip -6 route get mark xyz" work.Lorenzo Colitti
Currently, "ip -6 route get mark xyz" ignores the mark passed in by userspace. Make it honour the mark, just like IPv4 does. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-16net/openvswitch: Use with RCU_INIT_POINTER(x, NULL) in vport-gre.cMonam Agarwal
This patch replaces rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL) The rcu_assign_pointer() ensures that the initialization of a structure is carried out before storing a pointer to that structure. And in the case of the NULL pointer, there is no structure to initialize. So, rcu_assign_pointer(p, NULL) can be safely converted to RCU_INIT_POINTER(p, NULL) Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-05-16openvswitch: Use TCP flags in the flow key for stats.Jarno Rajahalme
We already extract the TCP flags for the key, might as well use that for stats. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-05-16openvswitch: Fix output of SCTP mask.Jarno Rajahalme
The 'output' argument of the ovs_nla_put_flow() is the one from which the bits are written to the netlink attributes. For SCTP we accidentally used the bits from the 'swkey' instead. This caused the mask attributes to include the bits from the actual flow key instead of the mask. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-05-16openvswitch: Per NUMA node flow stats.Jarno Rajahalme
Keep kernel flow stats for each NUMA node rather than each (logical) CPU. This avoids using the per-CPU allocator and removes most of the kernel-side OVS locking overhead otherwise on the top of perf reports and allows OVS to scale better with higher number of threads. With 9 handlers and 4 revalidators netperf TCP_CRR test flow setup rate doubles on a server with two hyper-threaded physical CPUs (16 logical cores each) compared to the current OVS master. Tested with non-trivial flow table with a TCP port match rule forcing all new connections with unique port numbers to OVS userspace. The IP addresses are still wildcarded, so the kernel flows are not considered as exact match 5-tuple flows. This type of flows can be expected to appear in large numbers as the result of more effective wildcarding made possible by improvements in OVS userspace flow classifier. Perf results for this test (master): Events: 305K cycles + 8.43% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner + 5.64% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock + 4.75% ovs-vswitchd ovs-vswitchd [.] find_match_wc + 3.32% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock + 2.61% ovs-vswitchd [kernel.kallsyms] [k] pcpu_alloc_area + 2.19% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask_range + 2.03% swapper [kernel.kallsyms] [k] intel_idle + 1.84% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock + 1.64% ovs-vswitchd ovs-vswitchd [.] classifier_lookup + 1.58% ovs-vswitchd libc-2.15.so [.] 0x7f4e6 + 1.07% ovs-vswitchd [kernel.kallsyms] [k] memset + 1.03% netperf [kernel.kallsyms] [k] __ticket_spin_lock + 0.92% swapper [kernel.kallsyms] [k] __ticket_spin_lock ... And after this patch: Events: 356K cycles + 6.85% ovs-vswitchd ovs-vswitchd [.] find_match_wc + 4.63% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock + 3.06% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock + 2.81% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask_range + 2.51% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock + 2.27% ovs-vswitchd ovs-vswitchd [.] classifier_lookup + 1.84% ovs-vswitchd libc-2.15.so [.] 0x15d30f + 1.74% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner + 1.47% swapper [kernel.kallsyms] [k] intel_idle + 1.34% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask + 1.33% ovs-vswitchd ovs-vswitchd [.] rule_actions_unref + 1.16% ovs-vswitchd ovs-vswitchd [.] hindex_node_with_hash + 1.16% ovs-vswitchd ovs-vswitchd [.] do_xlate_actions + 1.09% ovs-vswitchd ovs-vswitchd [.] ofproto_rule_ref + 1.01% netperf [kernel.kallsyms] [k] __ticket_spin_lock ... There is a small increase in kernel spinlock overhead due to the same spinlock being shared between multiple cores of the same physical CPU, but that is barely visible in the netperf TCP_CRR test performance (maybe ~1% performance drop, hard to tell exactly due to variance in the test results), when testing for kernel module throughput (with no userspace activity, handful of kernel flows). On flow setup, a single stats instance is allocated (for the NUMA node 0). As CPUs from multiple NUMA nodes start updating stats, new NUMA-node specific stats instances are allocated. This allocation on the packet processing code path is made to never block or look for emergency memory pools, minimizing the allocation latency. If the allocation fails, the existing preallocated stats instance is used. Also, if only CPUs from one NUMA-node are updating the preallocated stats instance, no additional stats instances are allocated. This eliminates the need to pre-allocate stats instances that will not be used, also relieving the stats reader from the burden of reading stats that are never used. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-05-16openvswitch: Remove 5-tuple optimization.Jarno Rajahalme
The 5-tuple optimization becomes unnecessary with a later per-NUMA node stats patch. Remove it first to make the changes easier to grasp. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-05-16openvswitch: Use ether_addr_copyJoe Perches
It's slightly smaller/faster for some architectures. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-05-16openvswitch: flow_netlink: Use pr_fmt to OVS_NLERR outputJoe Perches
Add "openvswitch: " prefix to OVS_NLERR output to match the other OVS_NLERR output of datapath.c Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-05-16openvswitch: Use net_ratelimit in OVS_NLERRJoe Perches
Each use of pr_<level>_once has a per-site flag. Some of the OVS_NLERR messages look as if seeing them multiple times could be useful, so use net_ratelimit() instead of pr_info_once. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-05-16openvswitch: Added (unsigned long long) cast in printfDaniele Di Proietto
This is necessary, since u64 is not unsigned long long in all architectures: u64 could be also uint64_t. Signed-off-by: Daniele Di Proietto <daniele.di.proietto@gmail.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-05-16openvswitch: avoid cast-qual warning in vport_privDaniele Di Proietto
This function must cast a const value to a non const value. By adding an uintptr_t cast the warning is suppressed. To avoid the cast (proper solution) several function signatures must be changed. Signed-off-by: Daniele Di Proietto <daniele.di.proietto@gmail.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-05-16openvswitch: avoid warnings in vport_from_privDaniele Di Proietto
This change, firstly, avoids declaring the formal parameter const, since it is treated as non const. (to avoid -Wcast-qual) Secondly, it cast the pointer from void* to u8*, since it is used in arithmetic (to avoid -Wpointer-arith) Signed-off-by: Daniele Di Proietto <daniele.di.proietto@gmail.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-05-16openvswitch: use const in some local vars and castsDaniele Di Proietto
In few functions, const formal parameters are assigned or cast to non-const. These changes suppress warnings if compiled with -Wcast-qual. Signed-off-by: Daniele Di Proietto <daniele.di.proietto@gmail.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-05-16Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-mergeDavid S. Miller
Include changes: - fix NULL dereference in batadv_orig_hardif_seq_print_text() - fix reference counting imbalance when using fragmentation - avoid access to orig_node objects after they have been free'd - fix local TT check for outgoing arp requests in DAT
2014-05-16net: unix: Align send data_len up to PAGE_SIZEKirill Tkhai
Using whole of allocated pages reduces requested skb->data size. This is just a little more thriftily allocation. netperf does not show difference with the current performance. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-16Merge branch 'for-davem' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless John W. Linville says: ==================== pull request: wireless 2014-05-15 Please pull this batch of fixes for the 3.15 stream... For the mac80211 bits, Johannes says: "One fix is to get better VHT performance and the other fixes tracing garbage or other potential issues with the interface name tracing." And... "This has a fix from Emmanuel for a problem I failed to fix - when association is in progress then it needs to be cancelled while suspending (I had fixed the same for authentication). Also included a fix from myself for a userspace API problem that hit the iw tool and a fix to the remain-on-channel framework." For the iwlwifi bits, Emmanuel says: "Alex fixes the scan by disabling the fragmented scan. David prevents scan offload while associated, the firmware seems not to like it. I fix a stupid bug I made in BT Coex, and fix a bad #ifdef clause in rate scaling. Along with that there is a fix for a NULL pointer exception that can happen if we load the driver and our ISR gets called because the interrupt line is shared. The fix has been tested by the reporter." And... "We have here a fix from David Spinadel that makes a previous fix more complete, and an off-by-one issue fixed by Eliad in the same area. I fix the monitor that broke on the way." Beyond that... Daniel Kim's one-liner fixes a brcmfmac regression caused by a typo in an earlier commit.. Rajkumar Manoharan fixes an ath9k oops reported by David Herrmann. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-16af_rxrpc: Fix XDR length check in rxrpc key demarshalling.Nathaniel W Filardo
There may be padding on the ticket contained in the key payload, so just ensure that the claimed token length is large enough, rather than exactly the right size. Signed-off-by: Nathaniel Wesley Filardo <nwf@cs.jhu.edu> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-16crush: decode and initialize chooseleaf_vary_rIlya Dryomov
Commit e2b149cc4ba0 ("crush: add chooseleaf_vary_r tunable") added the crush_map::chooseleaf_vary_r field but missed the decode part. This lead to misdirected requests caused by incorrect raw crush mapping sets. Fixes: http://tracker.ceph.com/issues/8226 Reported-and-Tested-by: Dmitry Smirnov <onlyjob@member.fsf.org> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2014-05-16libceph: fix corruption when using page_count 0 page in rbdChunwei Chen
It has been reported that using ZFSonLinux on rbd will result in memory corruption. The bug report can be found here: https://github.com/zfsonlinux/spl/issues/241 http://tracker.ceph.com/issues/7790 The reason is that ZFS will send pages with page_count 0 into rbd, which in turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with page_count 0, as it will do get_page and put_page, and erroneously free the page. This type of issue has been noted before, and handled in iscsi, drbd, etc. So, rbd should also handle this. This fix address this issue by fall back to slower sendmsg when page_count 0 detected. Cc: Sage Weil <sage@inktank.com> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: stable@vger.kernel.org Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com>
2014-05-16cgroup: remove css_parent()Tejun Heo
cgroup in general is moving towards using cgroup_subsys_state as the fundamental structural component and css_parent() was introduced to convert from using cgroup->parent to css->parent. It was quite some time ago and we're moving forward with making css more prominent. This patch drops the trivial wrapper css_parent() and let the users dereference css->parent. While at it, explicitly mark fields of css which are public and immutable. v2: New usage from device_cgroup.c converted. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: "David S. Miller" <davem@davemloft.net> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Johannes Weiner <hannes@cmpxchg.org>
2014-05-16Bluetooth: Add missing msecs to jiffies conversionAndrzej Kaczmarek
conn_info_age value is calculated in ms, so need to be converted to jiffies. Signed-off-by: Andrzej Kaczmarek <andrzej.kaczmarek@tieto.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2014-05-15Bluetooth: Add support for max_tx_power in Get Conn InfoAndrzej Kaczmarek
This patch adds support for max_tx_power in Get Connection Information request. Value is read only once for given connection and then always returned in response as parameter. Signed-off-by: Andrzej Kaczmarek <andrzej.kaczmarek@tieto.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2014-05-15Bluetooth: Store max TX power level for connectionAndrzej Kaczmarek
This patch adds support to store local maximum TX power level for connection when reply for HCI_Read_Transmit_Power_Level is received. Signed-off-by: Andrzej Kaczmarek <andrzej.kaczmarek@tieto.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2014-05-15Bluetooth: Avoid polling TX power for LE linksAndrzej Kaczmarek
TX power for LE links is immutable thus we do not need to query for it if already have value. Signed-off-by: Andrzej Kaczmarek <andrzej.kaczmarek@tieto.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2014-05-15Bluetooth: Add support to get connection informationAndrzej Kaczmarek
This patch adds support for Get Connection Information mgmt command which can be used to query for information about connection, i.e. RSSI and local TX power level. In general values cached in hci_conn are returned as long as they are considered valid, i.e. do not exceed age limit set in hdev. This limit is calculated as random value between min/max values to avoid client trying to guess when to poll for updated information. Signed-off-by: Andrzej Kaczmarek <andrzej.kaczmarek@tieto.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2014-05-15Bluetooth: Add conn info lifetime parameters to debugfsAndrzej Kaczmarek
This patch adds conn_info_min_age and conn_info_max_age parameters to debugfs which determine lifetime of connection information. Actual lifetime will be random value between min and max age. Default values for min and max age are 1000ms and 3000ms respectively. Signed-off-by: Andrzej Kaczmarek <andrzej.kaczmarek@tieto.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2014-05-15ipv6: update Destination Cache entries when gateway turn into hostDuan Jiong
RFC 4861 states in 7.2.5: The IsRouter flag in the cache entry MUST be set based on the Router flag in the received advertisement. In those cases where the IsRouter flag changes from TRUE to FALSE as a result of this update, the node MUST remove that router from the Default Router List and update the Destination Cache entries for all destinations using that neighbor as a router as specified in Section 7.3.3. This is needed to detect when a node that is used as a router stops forwarding packets due to being configured as a host. Currently, when dealing with NA Message which IsRouter flag changes from TRUE to FALSE, the kernel only removes router from the Default Router List, and don't update the Destination Cache entries. Now in order to update those Destination Cache entries, i introduce function rt6_clean_tohost(). Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-15Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Conflicts: net/ipv4/ip_vti.c Steffen Klassert says: ==================== pull request (net): ipsec 2014-05-15 This pull request has a merge conflict in net/ipv4/ip_vti.c between commit 8d89dcdf80d8 ("vti: don't allow to add the same tunnel twice") and commit a32452366b72 ("vti4:Don't count header length twice"). It can be solved like it is done in linux-next. 1) Fix a ipv6 xfrm output crash when a packet is rerouted by netfilter to not use IPsec. 2) vti4 counts some header lengths twice leading to an incorrect device mtu. Fix this by counting these headers only once. 3) We don't catch the case if an unsupported protocol is submitted to the xfrm protocol handlers, this can lead to NULL pointer dereferences. Fix this by adding the appropriate checks. 4) vti6 may unregister pernet ops twice on init errors. Fix this by removing one of the calls to do it only once. From Mathias Krause. 5) Set the vti tunnel mark before doing a lookup in the error handlers. Otherwise we don't find the correct xfrm state. ==================== The conflict in ip_vti.c was simple, 'net' had a commit removing a line from vti_tunnel_init() and this tree being merged had a commit adding a line to the same location. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-15vti6: delete unneeded call to netdev_privJulia Lawall
Netdev_priv is an accessor function, and has no purpose if its result is not used. A simplified version of the semantic match that fixes this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ local idexpression x; @@ -x = netdev_priv(...); ... when != x // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-15ip_tunnel: delete unneeded call to netdev_privJulia Lawall
Netdev_priv is an accessor function, and has no purpose if its result is not used. A simplified version of the semantic match that fixes this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ local idexpression x; @@ -x = netdev_priv(...); ... when != x // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-15net: filter: x86: internal BPF JITAlexei Starovoitov
Maps all internal BPF instructions into x86_64 instructions. This patch replaces original BPF x64 JIT with internal BPF x64 JIT. sysctl net.core.bpf_jit_enable is reused as on/off switch. Performance: 1. old BPF JIT and internal BPF JIT generate equivalent x86_64 code. No performance difference is observed for filters that were JIT-able before Example assembler code for BPF filter "tcpdump port 22" original BPF -> old JIT: original BPF -> internal BPF -> new JIT: 0: push %rbp 0: push %rbp 1: mov %rsp,%rbp 1: mov %rsp,%rbp 4: sub $0x60,%rsp 4: sub $0x228,%rsp 8: mov %rbx,-0x8(%rbp) b: mov %rbx,-0x228(%rbp) // prologue 12: mov %r13,-0x220(%rbp) 19: mov %r14,-0x218(%rbp) 20: mov %r15,-0x210(%rbp) 27: xor %eax,%eax // clear A c: xor %ebx,%ebx 29: xor %r13,%r13 // clear X e: mov 0x68(%rdi),%r9d 2c: mov 0x68(%rdi),%r9d 12: sub 0x6c(%rdi),%r9d 30: sub 0x6c(%rdi),%r9d 16: mov 0xd8(%rdi),%r8 34: mov 0xd8(%rdi),%r10 3b: mov %rdi,%rbx 1d: mov $0xc,%esi 3e: mov $0xc,%esi 22: callq 0xffffffffe1021e15 43: callq 0xffffffffe102bd75 27: cmp $0x86dd,%eax 48: cmp $0x86dd,%rax 2c: jne 0x0000000000000069 4f: jne 0x000000000000009a 2e: mov $0x14,%esi 51: mov $0x14,%esi 33: callq 0xffffffffe1021e31 56: callq 0xffffffffe102bd91 38: cmp $0x84,%eax 5b: cmp $0x84,%rax 3d: je 0x0000000000000049 62: je 0x0000000000000074 3f: cmp $0x6,%eax 64: cmp $0x6,%rax 42: je 0x0000000000000049 68: je 0x0000000000000074 44: cmp $0x11,%eax 6a: cmp $0x11,%rax 47: jne 0x00000000000000c6 6e: jne 0x0000000000000117 49: mov $0x36,%esi 74: mov $0x36,%esi 4e: callq 0xffffffffe1021e15 79: callq 0xffffffffe102bd75 53: cmp $0x16,%eax 7e: cmp $0x16,%rax 56: je 0x00000000000000bf 82: je 0x0000000000000110 58: mov $0x38,%esi 88: mov $0x38,%esi 5d: callq 0xffffffffe1021e15 8d: callq 0xffffffffe102bd75 62: cmp $0x16,%eax 92: cmp $0x16,%rax 65: je 0x00000000000000bf 96: je 0x0000000000000110 67: jmp 0x00000000000000c6 98: jmp 0x0000000000000117 69: cmp $0x800,%eax 9a: cmp $0x800,%rax 6e: jne 0x00000000000000c6 a1: jne 0x0000000000000117 70: mov $0x17,%esi a3: mov $0x17,%esi 75: callq 0xffffffffe1021e31 a8: callq 0xffffffffe102bd91 7a: cmp $0x84,%eax ad: cmp $0x84,%rax 7f: je 0x000000000000008b b4: je 0x00000000000000c2 81: cmp $0x6,%eax b6: cmp $0x6,%rax 84: je 0x000000000000008b ba: je 0x00000000000000c2 86: cmp $0x11,%eax bc: cmp $0x11,%rax 89: jne 0x00000000000000c6 c0: jne 0x0000000000000117 8b: mov $0x14,%esi c2: mov $0x14,%esi 90: callq 0xffffffffe1021e15 c7: callq 0xffffffffe102bd75 95: test $0x1fff,%ax cc: test $0x1fff,%rax 99: jne 0x00000000000000c6 d3: jne 0x0000000000000117 d5: mov %rax,%r14 9b: mov $0xe,%esi d8: mov $0xe,%esi a0: callq 0xffffffffe1021e44 dd: callq 0xffffffffe102bd91 // MSH e2: and $0xf,%eax e5: shl $0x2,%eax e8: mov %rax,%r13 eb: mov %r14,%rax ee: mov %r13,%rsi a5: lea 0xe(%rbx),%esi f1: add $0xe,%esi a8: callq 0xffffffffe1021e0d f4: callq 0xffffffffe102bd6d ad: cmp $0x16,%eax f9: cmp $0x16,%rax b0: je 0x00000000000000bf fd: je 0x0000000000000110 ff: mov %r13,%rsi b2: lea 0x10(%rbx),%esi 102: add $0x10,%esi b5: callq 0xffffffffe1021e0d 105: callq 0xffffffffe102bd6d ba: cmp $0x16,%eax 10a: cmp $0x16,%rax bd: jne 0x00000000000000c6 10e: jne 0x0000000000000117 bf: mov $0xffff,%eax 110: mov $0xffff,%eax c4: jmp 0x00000000000000c8 115: jmp 0x000000000000011c c6: xor %eax,%eax 117: mov $0x0,%eax c8: mov -0x8(%rbp),%rbx 11c: mov -0x228(%rbp),%rbx // epilogue cc: leaveq 123: mov -0x220(%rbp),%r13 cd: retq 12a: mov -0x218(%rbp),%r14 131: mov -0x210(%rbp),%r15 138: leaveq 139: retq On fully cached SKBs both JITed functions take 12 nsec to execute. BPF interpreter executes the program in 30 nsec. The difference in generated assembler is due to the following: Old BPF imlements LDX_MSH instruction via sk_load_byte_msh() helper function inside bpf_jit.S. New JIT removes the helper and does it explicitly, so ldx_msh cost is the same for both JITs, but generated code looks longer. New JIT has 4 registers to save, so prologue/epilogue are larger, but the cost is within noise on x64. Old JIT checks whether first insn clears A and if not emits 'xor %eax,%eax'. New JIT clears %rax unconditionally. 2. old BPF JIT doesn't support ANC_NLATTR, ANC_PAY_OFFSET, ANC_RANDOM extensions. New JIT supports all BPF extensions. Performance of such filters improves 2-4 times depending on a filter. The longer the filter the higher performance gain. Synthetic benchmarks with many ancillary loads see 20x speedup which seems to be the maximum gain from JIT Notes: . net.core.bpf_jit_enable=2 + tools/net/bpf_jit_disasm is still functional and can be used to see generated assembler . there are two jit_compile() functions and code flow for classic filters is: sk_attach_filter() - load classic BPF bpf_jit_compile() - try to JIT from classic BPF sk_convert_filter() - convert classic to internal bpf_int_jit_compile() - JIT from internal BPF seccomp and tracing filters will just call bpf_int_jit_compile() Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-15mac802154: make mac802154_wpan_open staticPhoebe Buckheister
This function is only used within the same translation unit, so mark it static. Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-15ieee802154: fix dgram socket sendmsg()Phoebe Buckheister
802.15.4 datagram sockets do not currently have a compliant sendmsg(). The destination address supplied is always ignored, and in unconnected mode, packets are broadcast instead of dropped with -EDESTADDRREQ. This patch fixes 802.15.4 dgram sockets to be compliant, i.e. !conn && !msg_name => -EDESTADDRREQ !conn && msg_name => send to msg_name conn && !msg_name => send to connected conn && msg_name => -EISCONN Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-156lowpan: fix fragmentationPhoebe Buckheister
Currently, 6lowpan creates one 802.15.4 MAC header for the original packet the device was given by upper layers and reuses this header for all fragments, if fragmentation is required. This also reuses frame sequence numbers, which must not happen. 6lowpan also has issues with fragmentation in the presence of security headers, since those may imply the presence of trailing fields that are not accounted for by the fragmentation code right now. Fix both of these issues by properly allocating fragment skbs with headromm and tailroom as specified by the underlying device, create one header for each skb instead of reusing the original header, let the underlying device do the rest. Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-15ieee802154: change _cb handling slightlyPhoebe Buckheister
The current mac_cb handling of ieee802154 is rather awkward and limited. Decompose the single flags field into multiple fields with the meanings of each subfield of the flags field to make future extensions (for example, link-layer security) easier. Also don't set the frame sequence number in upper layers, since that's a thing the MAC is supposed to set on frame transmit - we set it on header creation, but assuming that upper layers do not blindly duplicate our headers, this is fine. Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-15mac802154: account for all header parts during wpan header creationgPhoebe Buckheister
The current WPAN header creation code checks for EMSGSIZE conditions, but does not account for the MIC field that link layer security may add at the end of the frame. Now that we can accurately calculate the maximum payload size of packets, use that to check for EMSGSIZE conditions. Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-15ieee802154: add definitions for link-layer security and header functionsPhoebe Buckheister
When dealing with 802.15.4, one often has to know the maximum payload size for a given packet. This depends on many factors, one of which is whether or not a security header is present in the frame. These definitions and functions provide an easy way for any upper layer to calculate the maximum payload size for a packet. The first obvious user for this is 6lowpan, which duplicates this calculation and gets it partially wrong because it ignores security headers. Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-15rtnetlink: wait for unregistering devices in rtnl_link_unregister()Cong Wang
From: Cong Wang <cwang@twopensource.com> commit 50624c934db18ab90 (net: Delay default_device_exit_batch until no devices are unregistering) introduced rtnl_lock_unregistering() for default_device_exit_batch(). Same race could happen we when rmmod a driver which calls rtnl_link_unregister() as we call dev->destructor without rtnl lock. For long term, I think we should clean up the mess of netdev_run_todo() and net namespce exit code. Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-15batman-adv: fix local TT check for outgoing arp requests in DATAntonio Quartulli
Change introduced by 88e48d7b3340ef07b108eb8a8b3813dd093cc7f7 ("batman-adv: make DAT drop ARP requests targeting local clients") implements a check that prevents DAT from using the caching mechanism when the client that is supposed to provide a reply to an arp request is local. However change brought by be1db4f6615b5e6156c807ea8985171c215c2d57 ("batman-adv: make the Distributed ARP Table vlan aware") has not converted the above check into its vlan aware version thus making it useless when the local client is behind a vlan. Fix the behaviour by properly specifying the vlan when checking for a client being local or not. Reported-by: Simon Wunderlich <simon@open-mesh.com> Signed-off-by: Antonio Quartulli <antonio@open-mesh.com> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2014-05-15batman-adv: increase orig refcount when storing ref in gw_nodeAntonio Quartulli
A pointer to the orig_node representing a bat-gateway is stored in the gw_node->orig_node member, but the refcount for such orig_node is never increased. This leads to memory faults when gw_node->orig_node is accessed and the originator has already been freed. Fix this by increasing the refcount on gw_node creation and decreasing it on gw_node free. Signed-off-by: Antonio Quartulli <antonio@open-mesh.com> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>