summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2015-05-11net: Add a struct net parameter to sock_create_kernEric W. Biederman
This is long overdue, and is part of cleaning up how we allocate kernel sockets that don't reference count struct net. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11tun: Utilize the normal socket network namespace refcounting.Eric W. Biederman
There is no need for tun to do the weird network namespace refcounting. The existing network namespace refcounting in tfile has almost exactly the same lifetime. So rewrite the code to use the struct sock network namespace refcounting and remove the unnecessary hand rolled network namespace refcounting and the unncesary tfile->net. This change allows the tun code to directly call sock_put bypassing sock_release and making SOCK_EXTERNALLY_ALLOCATED unnecessary. Remove the now unncessary tun_release so that if anything tries to use the sock_release code path the kernel will oops, and let us know about the bug. The macvtap code already uses it's internal socket this way. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11mac80211: TDLS: use the BSS chandef for HT/VHT operation IEsArik Nemtsov
The chandef of the current channel context might be wider (though compatible). The TDLS link cares only about the channel of the BSS. In addition make sure to specify the VHT operation IE when VHT is supported on a non-2.4GHz band, as required by IEEE802.11ac-2013. This is not the same as HT-operation, to be specified only if the BSS doesn't support HT. Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-05-11mac80211: move WEP tailroom size checkJanusz Dziedzic
Remove checking tailroom when adding IV as it uses only headroom, and move the check to the ICV generation that actually needs the tailroom. In other case I hit such warning and datapath don't work, when testing: - IBSS + WEP - ath9k with hw crypt enabled - IPv6 data (ping6) WARNING: CPU: 3 PID: 13301 at net/mac80211/wep.c:102 ieee80211_wep_add_iv+0x129/0x190 [mac80211]() [...] Call Trace: [<ffffffff817bf491>] dump_stack+0x45/0x57 [<ffffffff8107746a>] warn_slowpath_common+0x8a/0xc0 [<ffffffff8107755a>] warn_slowpath_null+0x1a/0x20 [<ffffffffc09ae109>] ieee80211_wep_add_iv+0x129/0x190 [mac80211] [<ffffffffc09ae7ab>] ieee80211_crypto_wep_encrypt+0x6b/0xd0 [mac80211] [<ffffffffc09d3fb1>] invoke_tx_handlers+0xc51/0xf30 [mac80211] [...] Cc: stable@vger.kernel.org Signed-off-by: Janusz Dziedzic <janusz.dziedzic@tieto.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-05-11mac80211: check fast-xmit if IBSS STA QoS changedJohannes Berg
When an IBSS station gets QoS enabled after having been added, check fast-xmit to make sure the QoS header gets added to the cache properly and frames can go out with QoS and higher rates. Reported-by: Janusz Dziedzic <janusz.dziedzic@tieto.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-05-10codel: add ce_threshold attributeEric Dumazet
For DCTCP or similar ECN based deployments on fabrics with shallow buffers, hosts are responsible for a good part of the buffering. This patch adds an optional ce_threshold to codel & fq_codel qdiscs, so that DCTCP can have feedback from queuing in the host. A DCTCP enabled egress port simply have a queue occupancy threshold above which ECT packets get CE mark. In codel language this translates to a sojourn time, so that one doesn't have to worry about bytes or bandwidth but delays. This makes the host an active participant in the health of the whole network. This also helps experimenting DCTCP in a setup without DCTCP compliant fabric. On following example, ce_threshold is set to 1ms, and we can see from 'ldelay xxx us' that TCP is not trying to go around the 5ms codel target. Queue has more capacity to absorb inelastic bursts (say from UDP traffic), as queues are maintained to an optimal level. lpaa23:~# ./tc -s -d qd sh dev eth1 qdisc mq 1: dev eth1 root Sent 87910654696 bytes 58065331 pkt (dropped 0, overlimits 0 requeues 42961) backlog 3108242b 364p requeues 42961 qdisc codel 8063: dev eth1 parent 1:1 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms Sent 7363778701 bytes 4863809 pkt (dropped 0, overlimits 0 requeues 5503) rate 2348Mbit 193919pps backlog 255866b 46p requeues 5503 count 0 lastcount 0 ldelay 1.0ms drop_next 0us maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 72384 qdisc codel 8064: dev eth1 parent 1:2 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms Sent 7636486190 bytes 5043942 pkt (dropped 0, overlimits 0 requeues 5186) rate 2319Mbit 191538pps backlog 207418b 64p requeues 5186 count 0 lastcount 0 ldelay 694us drop_next 0us maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 69873 qdisc codel 8065: dev eth1 parent 1:3 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms Sent 11569360142 bytes 7641602 pkt (dropped 0, overlimits 0 requeues 5554) rate 3041Mbit 251096pps backlog 210446b 59p requeues 5554 count 0 lastcount 0 ldelay 889us drop_next 0us maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 37780 ... Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Glenn Judd <glenn.judd@morganstanley.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-10af_packet / TX_RING not fully non-blocking (w/ MSG_DONTWAIT).Kretschmer, Mathias
This patch fixes an issue where the send(MSG_DONTWAIT) call on a TX_RING is not fully non-blocking in cases where the device's sndBuf is full. We pass nonblock=true to sock_alloc_send_skb() and return any possibly occuring error code (most likely EGAIN) to the caller. As the fast-path stays as it is, we keep the unlikely() around skb == NULL. Signed-off-by: Mathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09mpls: Change reserved label names to be consistent with netbsdTom Herbert
Since these are now visible to userspace it is nice to be consistent with BSD (sys/netmpls/mpls.h in netBSD). Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09pktgen: introduce xmit_mode '<start_xmit|netif_receive>'Alexei Starovoitov
Introduce xmit_mode 'netif_receive' for pktgen which generates the packets using familiar pktgen commands, but feeds them into netif_receive_skb() instead of ndo_start_xmit(). Default mode is called 'start_xmit'. It is designed to test netif_receive_skb and ingress qdisc performace only. Make sure to understand how it works before using it for other rx benchmarking. Sample script 'pktgen.sh': \#!/bin/bash function pgset() { local result echo $1 > $PGDEV result=`cat $PGDEV | fgrep "Result: OK:"` if [ "$result" = "" ]; then cat $PGDEV | fgrep Result: fi } [ -z "$1" ] && echo "Usage: $0 DEV" && exit 1 ETH=$1 PGDEV=/proc/net/pktgen/kpktgend_0 pgset "rem_device_all" pgset "add_device $ETH" PGDEV=/proc/net/pktgen/$ETH pgset "xmit_mode netif_receive" pgset "pkt_size 60" pgset "dst 198.18.0.1" pgset "dst_mac 90:e2:ba:ff:ff:ff" pgset "count 10000000" pgset "burst 32" PGDEV=/proc/net/pktgen/pgctrl echo "Running... ctrl^C to stop" pgset "start" echo "Done" cat /proc/net/pktgen/$ETH Usage: $ sudo ./pktgen.sh eth2 ... Result: OK: 232376(c232372+d3) usec, 10000000 (60byte,0frags) 43033682pps 20656Mb/sec (20656167360bps) errors: 10000000 Raw netif_receive_skb speed should be ~43 million packet per second on 3.7Ghz x86 and 'perf report' should look like: 37.69% kpktgend_0 [kernel.vmlinux] [k] __netif_receive_skb_core 25.81% kpktgend_0 [kernel.vmlinux] [k] kfree_skb 7.22% kpktgend_0 [kernel.vmlinux] [k] ip_rcv 5.68% kpktgend_0 [pktgen] [k] pktgen_thread_worker If fib_table_lookup is seen on top, it means skb was processed by the stack. To benchmark netif_receive_skb only make sure that 'dst_mac' of your pktgen script is different from receiving device mac and it will be dropped by ip_rcv Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09pktgen: adjust flag NO_TIMESTAMP to be more pktgen compliantJesper Dangaard Brouer
Allow flag NO_TIMESTAMP to turn timestamping on again, like other flags, with a negation of the flag like !NO_TIMESTAMP. Also document the option flag NO_TIMESTAMP. Fixes: afb84b626184 ("pktgen: add flag NO_TIMESTAMP to disable timestamping") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09netlink: allow to listen "all" netnsNicolas Dichtel
More accurately, listen all netns that have a nsid assigned into the netns where the netlink socket is opened. For this purpose, a netlink socket option is added: NETLINK_LISTEN_ALL_NSID. When this option is set on a netlink socket, this socket will receive netlink notifications from all netns that have a nsid assigned into the netns where the socket has been opened. The nsid is sent to userland via an anscillary data. With this patch, a daemon needs only one socket to listen many netns. This is useful when the number of netns is high. Because 0 is a valid value for a nsid, the field nsid_is_set indicates if the field nsid is valid or not. skb->cb is initialized to 0 on skb allocation, thus we are sure that we will never send a nsid 0 by error to the userland. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09netlink: rename private flags and statesNicolas Dichtel
These flags and states have the same prefix (NETLINK_) that netlink socket options. To avoid confusion and to be able to name a flag like a socket option, let's use an other prefix: NETLINK_[S|F]_. Note: a comment has been fixed, it was talking about NETLINK_RECV_NO_ENOBUFS socket option instead of NETLINK_NO_ENOBUFS. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09netns: use a spin_lock to protect nsid managementNicolas Dichtel
Before this patch, nsid were protected by the rtnl lock. The goal of this patch is to be able to find a nsid without needing to hold the rtnl lock. The next patch will introduce a netlink socket option to listen to all netns that have a nsid assigned into the netns where the socket is opened. Thus, it's important to call rtnl_net_notifyid() outside the spinlock, to avoid a recursive lock (nsid are notified via rtnl). This was the main reason of the previous patch. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09netns: notify new nsid outside __peernet2id()Nicolas Dichtel
There is no functional change with this patch. It will ease the refactoring of the locking system that protects nsids and the support of the netlink socket option NETLINK_LISTEN_ALL_NSID. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09netns: rename peernet2id() to peernet2id_alloc()Nicolas Dichtel
In a following commit, a new function will be introduced to only lookup for a nsid (no allocation if the nsid doesn't exist). To avoid confusion, the existing function is renamed. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09netns: always provide the id to rtnl_net_fill()Nicolas Dichtel
The goal of this commit is to prepare the rework of the locking of nsnid protection. After this patch, rtnl_net_notifyid() will not call anymore __peernet2id(), ie no idr_* operation into this function. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09netns: returns always an id in __peernet2id()Nicolas Dichtel
All callers of this function expect a nsid, not an error. Thus, returns NETNSA_NSID_NOT_ASSIGNED in case of error so that callers don't have to convert the error to NETNSA_NSID_NOT_ASSIGNED. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09tcp: set SOCK_NOSPACE under memory pressureJason Baron
Under tcp memory pressure, calling epoll_wait() in edge triggered mode after -EAGAIN, can result in an indefinite hang in epoll_wait(), even when there is sufficient memory available to continue making progress. The problem is that when __sk_mem_schedule() returns 0 under memory pressure, we do not set the SOCK_NOSPACE flag in the tcp write paths (tcp_sendmsg() or do_tcp_sendpages()). Then, since SOCK_NOSPACE is used to trigger wakeups when incoming acks create sufficient new space in the write queue, all outstanding packets are acked, but we never wake up with the the EPOLLOUT that we are expecting from epoll_wait(). This issue is currently limited to epoll() when used in edge trigger mode, since 'tcp_poll()', does in fact currently set SOCK_NOSPACE. This is sufficient for poll()/select() and epoll() in level trigger mode. However, in edge trigger mode, epoll() is relying on the write path to set SOCK_NOSPACE. EPOLL(7) says that in edge-trigger mode we can only call epoll_wait() after read/write return -EAGAIN. Thus, in the case of the socket write, we are relying on the fact that tcp_sendmsg()/network write paths are going to issue a wakeup for us at some point in the future when we get -EAGAIN. Normally, epoll() edge trigger works fine when we've exceeded the sk->sndbuf because in that case we do set SOCK_NOSPACE. However, when we return -EAGAIN from the write path b/c we are over the tcp memory limits and not b/c we are over the sndbuf, we are never going to get another wakeup. I can reproduce this issue, using SO_SNDBUF, since __sk_mem_schedule() will return 0, or failure more readily with SO_SNDBUF: 1) create socket and set SO_SNDBUF to N 2) add socket as edge trigger 3) write to socket and block in epoll on -EAGAIN 4) cause tcp mem pressure via: echo "<small val>" > net.ipv4.tcp_mem The fix here is simply to set SOCK_NOSPACE in sk_stream_wait_memory() when the socket is non-blocking. Note that SOCK_NOSPACE, in addition to waking up outstanding waiters is also used to expand the size of the sk->sndbuf. However, we will not expand it by setting it in this case because tcp_should_expand_sndbuf(), ensures that no expansion occurs when we are under tcp memory pressure. Note that we could still hang if sk->sk_wmem_queue is 0, when we get the -EAGAIN. In this case the SOCK_NOSPACE bit will not help, since we are waiting for and event that will never happen. I believe that this case is harder to hit (and did not hit in my testing), in that over the tcp 'soft' memory limits, we continue to guarantee a minimum write buffer size. Perhaps, we could return -ENOSPC in this case, or maybe we simply issue a wakeup in this case, such that we keep retrying the write. Note that this case is not specific to epoll() ET, but rather would affect blocking sockets as well. So I view this patch as bringing epoll() edge-trigger into sync with the current poll()/select()/epoll() level trigger and blocking sockets behavior. Signed-off-by: Jason Baron <jbaron@akamai.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09seccomp, filter: add and use bpf_prog_create_from_user from seccompDaniel Borkmann
Seccomp has always been a special candidate when it comes to preparation of its filters in seccomp_prepare_filter(). Due to the extra checks and filter rewrite it partially duplicates code and has BPF internals exposed. This patch adds a generic API inside the BPF code code that seccomp can use and thus keep it's filter preparation code minimal and better maintainable. The other side-effect is that now classic JITs can add seccomp support as well by only providing a BPF_LDX | BPF_W | BPF_ABS translation. Tested with seccomp and BPF test suites. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Nicolas Schichan <nschichan@freebox.fr> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Kees Cook <keescook@chromium.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09net: filter: add __GFP_NOWARN flag for larger kmem allocsDaniel Borkmann
When seccomp BPF was added, it was discussed to add __GFP_NOWARN flag for their configuration path as f.e. up to 32K allocations are more prone to fail under stress. As we're going to reuse BPF API, add __GFP_NOWARN flags where larger kmalloc() and friends allocations could fail. It doesn't make much sense to pass around __GFP_NOWARN everywhere as an extra argument only for seccomp while we just as well could run into similar issues for socket filters, where it's not desired to have a user application throw a WARN() due to allocation failure. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Nicolas Schichan <nschichan@freebox.fr> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Kees Cook <keescook@chromium.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09seccomp: simplify seccomp_prepare_filter and reuse bpf_prepare_filterNicolas Schichan
Remove the calls to bpf_check_classic(), bpf_convert_filter() and bpf_migrate_runtime() and let bpf_prepare_filter() take care of that instead. seccomp_check_filter() is passed to bpf_prepare_filter() so that it gets called from there, after bpf_check_classic(). We can now remove exposure of two internal classic BPF functions previously used by seccomp. The export of bpf_check_classic() symbol, previously known as sk_chk_filter(), was there since pre git times, and no in-tree module was using it, therefore remove it. Joint work with Daniel Borkmann. Signed-off-by: Nicolas Schichan <nschichan@freebox.fr> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Kees Cook <keescook@chromium.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09net: filter: add a callback to allow classic post-verifier transformationsNicolas Schichan
This is in preparation for use by the seccomp code, the rationale is not to duplicate additional code within the seccomp layer, but instead, have it abstracted and hidden within the classic BPF API. As an interim step, this now also makes bpf_prepare_filter() visible (not as exported symbol though), so that seccomp can reuse that code path instead of reimplementing it. Joint work with Daniel Borkmann. Signed-off-by: Nicolas Schichan <nschichan@freebox.fr> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Kees Cook <keescook@chromium.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09Merge tag 'mac80211-next-for-davem-2015-05-06' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== Lots of updates for net-next for this cycle. As usual, we have a lot of small fixes and cleanups, the bigger items are: * proper mac80211 rate control locking, to fix some random crashes (this required changing other locking as well) * mac80211 "fast-xmit", a mechanism to reduce, in most cases, the amount of code we execute while going from ndo_start_xmit() to the driver * this also clears the way for properly supporting S/G and checksum and segmentation offloads ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09tcp: add TCPWinProbe and TCPKeepAlive SNMP countersEric Dumazet
Diagnosing problems related to Window Probes has been hard because we lack a counter. TCPWinProbe counts the number of ACK packets a sender has to send at regular intervals to make sure a reverse ACK packet opening back a window had not been lost. TCPKeepAlive counts the number of ACK packets sent to keep TCP flows alive (SO_KEEPALIVE) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09tcp: adjust window probe timers to safer valuesEric Dumazet
With the advent of small rto timers in datacenter TCP, (ip route ... rto_min x), the following can happen : 1) Qdisc is full, transmit fails. TCP sets a timer based on icsk_rto to retry the transmit, without exponential backoff. With low icsk_rto, and lot of sockets, all cpus are servicing timer interrupts like crazy. Intent of the code was to retry with a timer between 200 (TCP_RTO_MIN) and 500ms (TCP_RESOURCE_PROBE_INTERVAL) 2) Receivers can send zero windows if they don't drain their receive queue. TCP sends zero window probes, based on icsk_rto current value, with exponential backoff. With /proc/sys/net/ipv4/tcp_retries2 being 15 (or even smaller in some cases), sender can abort in less than one or two minutes ! If receiver stops the sender, it obviously doesn't care of very tight rto. Probability of dropping the ACK reopening the window is not worth the risk. Lets change the base timer to be at least 200ms (TCP_RTO_MIN) for these events (but not normal RTO based retransmits) A followup patch adds a new SNMP counter, as it would have helped a lot diagnosing this issue. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09tipc: send explicit not supported error in nl compatRichard Alpe
The legacy netlink API treated EPERM (permission denied) as "operation not supported". Reported-by: Tomi Ollila <tomi.ollila@iki.fi> Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09tipc: add broadcast link window set/get to nl apiRichard Alpe
Add the ability to get or set the broadcast link window through the new netlink API. The functionality was unintentionally missing from the new netlink API. Adding this means that we also fix the breakage in the old API when coming through the compat layer. Fixes: 37e2d4843f9e (tipc: convert legacy nl link prop set to nl compat) Reported-by: Tomi Ollila <tomi.ollila@iki.fi> Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09tipc: fix default link prop regression in nl compatRichard Alpe
Default link properties can be set for media or bearer. This functionality was missed when introducing the NL compatibility layer. This patch implements this functionality in the compat netlink layer. It works the same way as it did in the old API. We search for media and bearers matching the "link name". If we find a matching media or bearer the link tolerance, priority or window is used as default for new links on that media or bearer. Fixes: 37e2d4843f9e (tipc: convert legacy nl link prop set to nl compat) Reported-by: Tomi Ollila <tomi.ollila@iki.fi> Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09net_sched: fix a use-after-free in tc_ctl_tfilter()WANG Cong
When tcf_destroy() returns true, tp could be already destroyed, we should not use tp->next after that. For long term, we probably should move tp list to list_head. Fixes: 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09net: dsa: Add lockdep class to tx queues to avoid lockdep splatAndrew Lunn
DSA stacks an Ethernet device on top of an Ethernet device. This can cause false positive lockdep splats for the transmit queue: Acked-by: Florian Fainelli <f.fainelli@gmail.com> ============================================= [ INFO: possible recursive locking detected ] 4.0.0-rc7-01838-g70621a215fc7 #386 Not tainted --------------------------------------------- kworker/0:0/4 is trying to acquire lock: (_xmit_ETHER#2){+.-...}, at: [<c040e95c>] sch_direct_xmit+0xa8/0x1fc but task is already holding lock: (_xmit_ETHER#2){+.-...}, at: [<c03f4208>] __dev_queue_xmit+0x4d4/0x56c other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(_xmit_ETHER#2); lock(_xmit_ETHER#2); To avoid this, walk the tq queues of the dsa slaves and set a lockdep class. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09net/rds: RDS-TCP: only initiate reconnect attempt on outgoing TCP socket.Sowmini Varadhan
When the peer of an RDS-TCP connection restarts, a reconnect attempt should only be made from the active side of the TCP connection, i.e. the side that has a transient TCP port number. Do not add the passive side of the TCP connection to the c_hash_node and thus avoid triggering rds_queue_reconnect() for passive rds connections. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09net/rds: RDS-TCP: Always create a new rds_sock for an incoming connection.Sowmini Varadhan
When running RDS over TCP, the active (client) side connects to the listening ("passive") side at the RDS_TCP_PORT. After the connection is established, if the client side reboots (potentially without even sending a FIN) the server still has a TCP socket in the esablished state. If the server-side now gets a new SYN comes from the client with a different client port, TCP will create a new socket-pair, but the RDS layer will incorrectly pull up the old rds_connection (which is still associated with the stale t_sock and RDS socket state). This patch corrects this behavior by having rds_tcp_accept_one() always create a new connection for an incoming TCP SYN. The rds and tcp state associated with the old socket-pair is cleaned up via the rds_tcp_state_change() callback which would typically be invoked in most cases when the client-TCP sends a FIN on TCP restart, triggering a transition to CLOSE_WAIT state. In the rarer event of client death without a FIN, TCP_KEEPALIVE probes on the socket will detect the stale socket, and the TCP transition to CLOSE state will trigger the RDS state cleanup. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09ipv6: Fixed source specific default route handling.Markus Stenberg
If there are only IPv6 source specific default routes present, the host gets -ENETUNREACH on e.g. connect() because ip6_dst_lookup_tail calls ip6_route_output first, and given source address any, it fails, and ip6_route_get_saddr is never called. The change is to use the ip6_route_get_saddr, even if the initial ip6_route_output fails, and then doing ip6_route_output _again_ after we have appropriate source address available. Note that this is '99% fix' to the problem; a correct fix would be to do route lookups only within addrconf.c when picking a source address, and never call ip6_route_output before source address has been populated. Signed-off-by: Markus Stenberg <markus.stenberg@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09Merge branch 'for-upstream' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Johan Hedberg says: ==================== Here are a couple of important Bluetooth & mac802154 fixes for 4.1: - mac802154 fix for crypto algorithm allocation failure checking - mac802154 wpan phy leak fix for error code path - Fix for not calling Bluetooth shutdown() if interface is not up Let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-08ipvs: fix memory leak in ip_vs_ctl.cTommi Rantala
Fix memory leak introduced in commit a0840e2e165a ("IPVS: netns, ip_vs_ctl local vars moved to ipvs struct."): unreferenced object 0xffff88005785b800 (size 2048): comm "(-localed)", pid 1434, jiffies 4294755650 (age 1421.089s) hex dump (first 32 bytes): bb 89 0b 83 ff ff ff ff b0 78 f0 4e 00 88 ff ff .........x.N.... 04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff8262ea8e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff811fba74>] __kmalloc_track_caller+0x244/0x430 [<ffffffff811b88a0>] kmemdup+0x20/0x50 [<ffffffff823276b7>] ip_vs_control_net_init+0x1f7/0x510 [<ffffffff8231d630>] __ip_vs_init+0x100/0x250 [<ffffffff822363a1>] ops_init+0x41/0x190 [<ffffffff82236583>] setup_net+0x93/0x150 [<ffffffff82236cc2>] copy_net_ns+0x82/0x140 [<ffffffff810ab13d>] create_new_namespaces+0xfd/0x190 [<ffffffff810ab49a>] unshare_nsproxy_namespaces+0x5a/0xc0 [<ffffffff810833e3>] SyS_unshare+0x173/0x310 [<ffffffff8265cbd7>] system_call_fastpath+0x12/0x6f [<ffffffffffffffff>] 0xffffffffffffffff Fixes: a0840e2e165a ("IPVS: netns, ip_vs_ctl local vars moved to ipvs struct.") Signed-off-by: Tommi Rantala <tt.rantala@gmail.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2015-05-07mac80211: adjust reserved chan_ctx when assigned to vifAndrei Otcheretianski
When a vif starts using a reserved channel context (during CSA, for example) the required chandef was recalculated, however it was never applied. This could result in using chanctx with narrower width than actually required. Fix this by calling ieee80211_change_chanctx with the recalculated chandef. This both changes the chanctx's width and recalcs min_def. Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com> Reviewed-by: Luciano Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-05-06mac80211: add missing documentation for rate_ctrl_lockJohannes Berg
This was missed in the previous patch, add some documentation for rate_ctrl_lock to avoid docbook warnings. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-05-06cfg80211: change GO_CONCURRENT to IR_CONCURRENT for STAArik Nemtsov
The GO_CONCURRENT regulatory definition can be extended to station interfaces requesting to IR as part of TDLS off-channel operations. Rename the GO_CONCURRENT flag to IR_CONCURRENT and allow the added use-case. Change internal users of GO_CONCURRENT to use the new definition. Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-05-06cfg80211: Allow GO concurrent relaxation after BSS disconnectionAvraham Stern
If a P2P GO was allowed on a channel because of the GO concurrent relaxation, i.e., another station interface was associated to an AP on the same channel or the same UNII band, and the station interface disconnected from the AP, allow the following use cases unless the channel is marked as indoor only and the device is not operating in an indoor environment: 1. Allow the P2P GO to stay on its current channel. The rationale behind this is that if the channel or UNII band were allowed by the AP they could still be used to continue the P2P GO operation, and avoid connection breakage. 2. Allow another P2P GO to start on the same channel or another channel that is in the same UNII band as the previous instantiated P2P GO. Signed-off-by: Avraham Stern <avraham.stern@intel.com> Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com> Signed-off-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-05-06mac80211: validate cipher scheme PN length betterJohannes Berg
Currently, a cipher scheme can advertise an arbitrarily long sequence counter, but mac80211 only supports up to 16 bytes and the initial value from userspace will be truncated. Fix two things: * don't allow the driver to register anything longer than the 16 bytes that mac80211 reserves space for * require userspace to specify a starting value with the correct length (or none at all) Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-05-06mac80211: extend get_key() to return PN for all ciphersJohannes Berg
For ciphers not supported by mac80211, the function currently doesn't return any PN data. Fix this by extending the driver's get_key_seq() a little more to allow moving arbitrary PN data. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-05-06mac80211: extend get_tkip_seq to all keysJohannes Berg
Extend the function to read the TKIP IV32/IV16 to read the IV/PN for all ciphers in order to allow drivers with full hardware crypto to properly support this. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-05-05tcp_westwood: fix tcp_westwood_info()Eric Dumazet
I forgot to update tcp_westwood when changing get_info() behavior, this patch should fix this. Fixes: 64f40ff5bbdb ("tcp: prepare CC get_info() access from getsockopt()") Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-05mpls: Move reserved label definitionsTom Herbert
Move to include/uapi/linux/mpls.h to be externally visibile. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-05openvswitch: Use eth_proto_is_802_3Alexander Duyck
Replace "ntohs(proto) >= ETH_P_802_3_MIN" w/ eth_proto_is_802_3(proto). Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-05ipv4/ip_tunnel_core: Use eth_proto_is_802_3Alexander Duyck
Replace "ntohs(proto) >= ETH_P_802_3_MIN" w/ eth_proto_is_802_3(proto). Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-05ebtables: Use eth_proto_is_802_3Alexander Duyck
Replace "ntohs(proto) >= ETH_P_802_3_MIN" w/ eth_proto_is_802_3(proto). Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-05etherdev: Fix sparse error, make test usable by other functionsAlexander Duyck
This change does two things. First it fixes a sparse error for the fact that the __be16 degrades to an integer. Since that is actually what I am kind of doing I am simply working around that by forcing both sides of the comparison to u16. Also I realized on some compilers I was generating another instruction for big endian systems such as PowerPC since it was masking the value before doing the comparison. So to resolve that I have simply pulled the mask out and wrapped it in an #ifndef __BIG_ENDIAN. Lastly I pulled this all out into its own function. I notices there are similar checks in a number of other places so this function can be reused there to help reduce overhead in these paths as well. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-05bridge: change BR_GROUPFWD_RESTRICTED to allow forwarding of LLDP framesBernhard Thaler
BR_GROUPFWD_RESTRICTED bitmask restricts users from setting values to /sys/class/net/brX/bridge/group_fwd_mask that allow forwarding of some IEEE 802.1D Table 7-10 Reserved addresses: (MAC Control) 802.3 01-80-C2-00-00-01 (Link Aggregation) 802.3 01-80-C2-00-00-02 802.1AB LLDP 01-80-C2-00-00-0E Change BR_GROUPFWD_RESTRICTED to allow to forward LLDP frames and document group_fwd_mask. e.g. echo 16384 > /sys/class/net/brX/bridge/group_fwd_mask allows to forward LLDP frames. This may be needed for bridge setups used for network troubleshooting or any other scenario where forwarding of LLDP frames is desired (e.g. bridge connecting a virtual machine to real switch transmitting LLDP frames that virtual machine needs to receive). Tested on a simple bridge setup with two interfaces and host transmitting LLDP frames on one side of this bridge (used lldpd). Setting group_fwd_mask as described above lets LLDP frames traverse bridge. Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-05tcp: provide SYN headers for passive connectionsEric Dumazet
This patch allows a server application to get the TCP SYN headers for its passive connections. This is useful if the server is doing fingerprinting of clients based on SYN packet contents. Two socket options are added: TCP_SAVE_SYN and TCP_SAVED_SYN. The first is used on a socket to enable saving the SYN headers for child connections. This can be set before or after the listen() call. The latter is used to retrieve the SYN headers for passive connections, if the parent listener has enabled TCP_SAVE_SYN. TCP_SAVED_SYN is read once, it frees the saved SYN headers. The data returned in TCP_SAVED_SYN are network (IPv4/IPv6) and TCP headers. Original patch was written by Tom Herbert, I changed it to not hold a full skb (and associated dst and conntracking reference). We have used such patch for about 3 years at Google. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Tested-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>