linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2016-08-23	netfilter: nf_tables: reject hook configuration updates on existing chains	Pablo Neira Ayuso
	Currently, if you add a base chain whose name clashes with an existing non-base chain, nf_tables doesn't complain about this. Similarly, if you update the chain type, the hook number and priority. With this patch, nf_tables bails out in case any of this unsupported operations occur by returning EBUSY. # nft add table x # nft add chain x y # nft add chain x y { type nat hook input priority 0\; } <cmdline>:1:1-49: Error: Could not process rule: Device or resource busy add chain x y { type nat hook input priority 0; } ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-23	netfilter: nf_tables: introduce nft_chain_parse_hook()	Pablo Neira Ayuso
	Introduce a new function to wrap the code that parses the chain hook configuration so we can reuse this code to validate chain updates. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-23	rxrpc: Perform terminal call ACK/ABORT retransmission from conn processor	David Howells
	Perform terminal call ACK/ABORT retransmission in the connection processor rather than in the call processor. With this change, once last_call is set, no more incoming packets will be routed to the corresponding call or any earlier calls on that channel (call IDs must only increase on a channel on a connection). Further, if a packet's callNumber is before the last_call ID or a packet is aimed at successfully completed service call then that packet is discarded and ignored. Signed-off-by: David Howells <dhowells@redhat.com>
2016-08-23	rxrpc: Calculate serial skew on packet reception	David Howells
	Calculate the serial number skew in the data_ready handler when a packet has been received and a connection looked up. The skew is cached in the sk_buff's priority field. The connection highest received serial number is updated at this time also. This can be done without locks or atomic instructions because, at this point, the code is serialised by the socket. This generates more accurate skew data because if the packet is offloaded to a work queue before this is determined, more packets may come in, bumping the highest serial number and thereby increasing the apparent skew. This also removes some unnecessary atomic ops. Signed-off-by: David Howells <dhowells@redhat.com>
2016-08-23	rxrpc: Set connection expiry on idle, not put	David Howells
	Set the connection expiry time when a connection becomes idle rather than doing this in rxrpc_put_connection(). This makes the put path more efficient (it is likely to be called occasionally whilst a connection has outstanding calls because active workqueue items needs to be given a ref). The time is also preset in the connection allocator in case the connection never gets used. Signed-off-by: David Howells <dhowells@redhat.com>
2016-08-23	rxrpc: Use a tracepoint for skb accounting debugging	David Howells
	Use a tracepoint to log various skb accounting points to help in debugging refcounting errors. Signed-off-by: David Howells <dhowells@redhat.com>
2016-08-23	rxrpc: Drop channel number field from rxrpc_call struct	David Howells
	Drop the channel number (channel) field from the rxrpc_call struct to reduce the size of the call struct. The field is redundant: if the call is attached to a connection, the channel can be obtained from there by AND'ing with RXRPC_CHANNELMASK. Signed-off-by: David Howells <dhowells@redhat.com>
2016-08-23	rxrpc: When clearing a socket, clear the call sets in the right order	David Howells
	When clearing a socket, we should clear the securing-in-progress list first, then the accept queue and last the main call tree because that's the order in which a call progresses. Not that a call should move from the accept queue to the main tree whilst we're shutting down a socket, but it a call could possibly move from sequreq to acceptq whilst we're clearing up. Signed-off-by: David Howells <dhowells@redhat.com>
2016-08-23	rxrpc: Tidy up the rxrpc_call struct a bit	David Howells
	Do a little tidying of the rxrpc_call struct: (1) in_clientflag is no longer compared against the value that's in the packet, so keeping it in this form isn't necessary. Use a flag in flags instead and provide a pair of wrapper functions. (2) We don't read the epoch value, so that can go. (3) Move what remains of the data that were used for hashing up in the struct to be with the channel number. (4) Get rid of the local pointer. We can get at this via the socket struct and we only use this in the procfs viewer. Signed-off-by: David Howells <dhowells@redhat.com>
2016-08-23	rxrpc: Remove RXRPC_CALL_PROC_BUSY	David Howells
	Remove RXRPC_CALL_PROC_BUSY as work queue items are now 100% non-reentrant. Signed-off-by: David Howells <dhowells@redhat.com>
2016-08-22	net: strparser: fix strparser sk_user_data check	Dave Watson
	sk_user_data mismatch between what kcm expects (psock) and what strparser expects (strparser). Queued rx_work, for example calling strp_check_rcv after socket buffer changes, will never complete. sk_user_data is unused in strparser, so just remove the check. Signed-off-by: Dave Watson <davejwatson@fb.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-22	net: dsa: Allow the DSA driver to indicate the tag protocol	Andrew Lunn
	DSA drivers may drive different families of switches which need different tag protocol. Rather than hard code the tag protocol in the driver structure, have a callback for the DSA core to call. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-22	net: ipconfig: Fix NULL pointer dereference on RARP/BOOTP/DHCP timeout	Geert Uytterhoeven
	If no RARP, BOOTP, or DHCP response is received, ic_dev is never set, causing a NULL pointer dereference in ic_close_devs(): Sending DHCP requests ...... timed out! Unable to handle kernel NULL pointer dereference at virtual address 00000004 To fix this, add a check to avoid dereferencing ic_dev if it is still NULL. Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Fixes: 2647cffb2bc6fbed ("net: ipconfig: Support using "delayed" DHCP replies") Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-22	net sched: fix encoding to use real length	Jamal Hadi Salim
	Encoding of the metadata was using the padded length as opposed to the real length of the data which is a bug per specification. This has not been an issue todate because all metadatum specified so far has been 32 bit where aligned and data length are the same width. This also includes a bug fix for validating the length of a u16 field. But since there is no metadata of size u16 yes we are fine to include it here. While at it get rid of magic numbers. Fixes: ef6980b6becb ("net sched: introduce IFE action") Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-22	Merge tag 'batadv-next-for-davem-20160822' of ↵	David S. Miller
	git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== This feature patchset includes the following changes: - place kref_get near usage of referenced objects, separate patches for various used objects to improve readability and maintainability by Sven Eckelmann (18 patches) - Keep batadv net device when all hard interfaces disappear, to improve situations where tools currently use work arounds, by Sven Eckelmann - Add an option to disable debugfs support to minimize footprint when userspace uses netlink only, by Sven Eckelmann ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-22	net: ip_finish_output_gso: Allow fragmenting segments of tunneled skbs if ↵	Shmulik Ladkani
	their DF is unset In b8247f095e, "net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs" gso skbs arriving from an ingress interface that go through UDP tunneling, are allowed to be fragmented if the resulting encapulated segments exceed the dst mtu of the egress interface. This aligned the behavior of gso skbs to non-gso skbs going through udp encapsulation path. However the non-gso vs gso anomaly is present also in the following cases of a GRE tunnel: - ip_gre in collect_md mode, where TUNNEL_DONT_FRAGMENT is not set (e.g. OvS vport-gre with df_default=false) - ip_gre in nopmtudisc mode, where IFLA_GRE_IGNORE_DF is set In both of the above cases, the non-gso skbs get fragmented, whereas the gso skbs (having skb_gso_network_seglen that exceeds dst mtu) get dropped, as they don't go through the segment+fragment code path. Fix: Setting IPSKB_FRAG_SEGS if the tunnel specified IP_DF bit is NOT set. Tunnels that do set IP_DF, will not go to fragmentation of segments. This preserves behavior of ip_gre in (the default) pmtudisc mode. Fixes: b8247f095e ("net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs") Reported-by: wenxu <wenxu@ucloud.cn> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Tested-by: wenxu <wenxu@ucloud.cn> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-22	net_sched: properly handle failure case of tcf_exts_init()	WANG Cong
	After commit 22dc13c837c3 ("net_sched: convert tcf_exts from list to pointer array") we do dynamic allocation in tcf_exts_init(), therefore we need to handle the ENOMEM case properly. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-22	net: ipv6: Remove addresses for failures with strict DAD	Mike Manning
	If DAD fails with accept_dad set to 2, global addresses and host routes are incorrectly left in place. Even though disable_ipv6 is set, contrary to documentation, the addresses are not dynamically deleted from the interface. It is only on a subsequent link down/up that these are removed. The fix is not only to set the disable_ipv6 flag, but also to call addrconf_ifdown(), which is the action to carry out when disabling IPv6. This results in the addresses and routes being deleted immediately. The DAD failure for the LL addr is determined as before via netlink, or by the absence of the LL addr (which also previously would have had to be checked for in case of an intervening link down and up). As the call to addrconf_ifdown() requires an rtnl lock, the logic to disable IPv6 when DAD fails is moved to addrconf_dad_work(). Previous behavior: root@vm1:/# sysctl net.ipv6.conf.eth3.accept_dad=2 net.ipv6.conf.eth3.accept_dad = 2 root@vm1:/# ip -6 addr add 2000::10/64 dev eth3 root@vm1:/# ip link set up eth3 root@vm1:/# ip -6 addr show dev eth3 5: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qlen 1000 inet6 2000::10/64 scope global valid_lft forever preferred_lft forever inet6 fe80::5054:ff:fe43:dd5a/64 scope link tentative dadfailed valid_lft forever preferred_lft forever root@vm1:/# ip -6 route show dev eth3 2000::/64 proto kernel metric 256 fe80::/64 proto kernel metric 256 root@vm1:/# ip link set down eth3 root@vm1:/# ip link set up eth3 root@vm1:/# ip -6 addr show dev eth3 root@vm1:/# ip -6 route show dev eth3 root@vm1:/# New behavior: root@vm1:/# sysctl net.ipv6.conf.eth3.accept_dad=2 net.ipv6.conf.eth3.accept_dad = 2 root@vm1:/# ip -6 addr add 2000::10/64 dev eth3 root@vm1:/# ip link set up eth3 root@vm1:/# ip -6 addr show dev eth3 root@vm1:/# ip -6 route show dev eth3 root@vm1:/# Signed-off-by: Mike Manning <mmanning@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-22	netfilter: nft_hash: fix non static symbol warning	Wei Yongjun
	Fixes the following sparse warning: net/netfilter/nft_hash.c:40:25: warning: symbol 'nft_hash_policy' was not declared. Should it be static? Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-22	netfilter: fix spelling mistake: "delimitter" -> "delimiter"	Colin Ian King
	trivial fix to spelling mistake in pr_debug message Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-22	netfilter: nf_tables: add number generator expression	Laura Garcia Liebana
	This patch adds the numgen expression that allows us to generated incremental and random numbers, this generator is bound to a upper limit that is specified by userspace. This expression is useful to distribute packets in a round-robin fashion as well as randomly. Signed-off-by: Laura Garcia Liebana <nevola@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-22	netfilter: nf_tables: add quota expression	Pablo Neira Ayuso
	This patch adds the quota expression. This new stateful expression integrate easily into the dynset expression to build 'hashquota' flow tables. Arguably, we could use instead "counter bytes > 1000" instead, but this approach has several problems: 1) We only support for one single stateful expression in dynamic set definitions, and the expression above is a composite of two expressions: get counter + comparison. 2) We would need to restore the packed counter representation (that we used to have) based on seqlock to synchronize this, since per-cpu is not suitable for this. So instead of bloating the counter expression back with the seqlock representation and extending the existing set infrastructure to make it more complex for the composite described above, let's follow the more simple approach of adding a quota expression that we can plug into our existing infrastructure. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-22	xfrm: Only add l3mdev oif to dst lookups	David Ahern
	Subash reported that commit 42a7b32b73d6 ("xfrm: Add oif to dst lookups") broke a wifi use case that uses fib rules and xfrms. The intent of 42a7b32b73d6 was driven by VRFs with IPsec. As a compromise relax the use of oif in xfrm lookups to L3 master devices only (ie., oif is either an L3 master device or is enslaved to a master device). Fixes: 42a7b32b73d6 ("xfrm: Add oif to dst lookups") Reported-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2016-08-21	Revert "l2tp: Refactor the codes with existing macros instead of literal number"	David S. Miller
	This reverts commit 5ab1fe72d5490978104fc493615ea29dd7238766. This change still has problems. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-21	l2tp: Refactor the codes with existing macros instead of literal number	Gao Feng
	Use PPP_ALLSTATIONS, PPP_UI, and SEND_SHUTDOWN instead of 0xff, 0x03, and 2 separately. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-19	net/irda: remove pointless assignment/check	Vegard Nossum
	We've already set sk to sock->sk and dereferenced it, so if it's NULL we would have crashed already. Moreover, if it was NULL we would have crashed anyway when jumping to 'out' and trying to unlock the sock. Furthermore, if we had assigned a different value to 'sk' we would have been calling lock_sock() and release_sock() on different sockets. My conclusion is that these two lines are complete nonsense and only serve to confuse the reader. Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-19	l2tp: Fix the connect status check in pppol2tp_getname	Gao Feng
	The sk->sk_state is bits flag, so need use bit operation check instead of value check. Signed-off-by: Gao Feng <fgao@ikuai8.com> Tested-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-19	net: dsa: bcm_sf2: Make it a real platform device driver	Florian Fainelli
	The Broadcom Starfighter 2 switch driver should be a proper platform driver, now that the DSA code has been updated to allow that, register a switch device, feed it with the proper configuration data coming from Device Tree and register our switch device with DSA. The bulk of the changes consist in moving what bcm_sf2_sw_setup() did into the platform driver probe function. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-19	net: dsa: Export suspend/resume functions	Florian Fainelli
	In preparation for allowing switch drivers to implement system-wide suspend/resume functions, export dsa_switch_suspend and dsa_switch_resume() such that these are callable from the appropriate driver specific suspend/resume functions. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-19	sctp: linearize early if it's not GSO	Marcelo Ricardo Leitner
	Because otherwise when crc computation is still needed it's way more expensive than on a linear buffer to the point that it affects performance. It's so expensive that netperf test gives a perf output as below: Overhead Command Shared Object Symbol 18,62% netserver [kernel.vmlinux] [k] crc32_generic_shift 2,57% netserver [kernel.vmlinux] [k] __pskb_pull_tail 1,94% netserver [kernel.vmlinux] [k] fib_table_lookup 1,90% netserver [kernel.vmlinux] [k] copy_user_enhanced_fast_string 1,66% swapper [kernel.vmlinux] [k] intel_idle 1,63% netserver [kernel.vmlinux] [k] _raw_spin_lock 1,59% netserver [sctp] [k] sctp_packet_transmit 1,55% netserver [kernel.vmlinux] [k] memcpy_erms 1,42% netserver [sctp] [k] sctp_rcv # netperf -H 192.168.10.1 -l 10 -t SCTP_STREAM -cC -- -m 12000 SCTP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 () port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 212992 212992 12000 10.00 3016.42 2.88 3.78 1.874 2.462 After patch: Overhead Command Shared Object Symbol 2,75% netserver [kernel.vmlinux] [k] memcpy_erms 2,63% netserver [kernel.vmlinux] [k] copy_user_enhanced_fast_string 2,39% netserver [kernel.vmlinux] [k] fib_table_lookup 2,04% netserver [kernel.vmlinux] [k] __pskb_pull_tail 1,91% netserver [kernel.vmlinux] [k] _raw_spin_lock 1,91% netserver [sctp] [k] sctp_packet_transmit 1,72% netserver [mlx4_en] [k] mlx4_en_process_rx_cq 1,68% netserver [sctp] [k] sctp_rcv # netperf -H 192.168.10.1 -l 10 -t SCTP_STREAM -cC -- -m 12000 SCTP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 () port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 212992 212992 12000 10.00 3681.77 3.83 3.46 2.045 1.849 Fixes: 3acb50c18d8d ("sctp: delay as much as possible skb_linearize") Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-19	net: ipv4: fix sparse error in fib_good_nh()	Eric Dumazet
	Fixes following sparse errors : net/ipv4/fib_semantics.c:1579:61: warning: incorrect type in argument 2 (different base types) net/ipv4/fib_semantics.c:1579:61: expected unsigned int [unsigned] [usertype] key net/ipv4/fib_semantics.c:1579:61: got restricted __be32 const [usertype] nh_gw Fixes: a6db4494d218c ("net: ipv4: Consider failed nexthops in multipath routes") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-19	udp: include addrconf.h	Eric Dumazet
	Include ipv4_rcv_saddr_equal() definition to avoid this sparse error : net/ipv4/udp.c:362:5: warning: symbol 'ipv4_rcv_saddr_equal' was not declared. Should it be static? Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-19	tcp: md5: remove tcp_md5_hash_header()	Eric Dumazet
	After commit 19689e38eca5 ("tcp: md5: use kmalloc() backed scratch areas") this function is no longer used. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-19	netlink: Use rhashtable walk interface in diag dump	Herbert Xu
	This patch converts the diag dumping code to use the rhashtable walk code instead of going through rhashtable by hand. The lock nl_table_lock is now only taken while we process the multicast list as it's not needed for the rhashtable walk. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18	fib_trie: Fix the description of pos and bits	Xunlei Pang
	1) Fix one typo: s/tn/tp/ 2) Fix the description about the "u" bits. Signed-off-by: Xunlei Pang <xlpang@redhat.com> Acked-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18	bpf: get rid of cgroup helper related ifdefs	Daniel Borkmann
	As recently discussed during the task_under_cgroup_hierarchy() addition, we should get rid of the ifdefs surrounding the bpf_skb_under_cgroup() helper. If related functionality is not built-in, the helper cannot be used anyway, which is also in line with what we do for all other helpers. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18	bpf: enable event output helper also for xdp types	Daniel Borkmann
	Follow-up to 555c8a8623a3 ("bpf: avoid stack copy and use skb ctx for event output") for also adding the event output helper for XDP typed programs. The event output helper has been very useful in particular for debugging or event notification purposes, since it's much faster and flexible than regular trace printk due to programmatically being able to attach meta data. Same flags structure applies as with tc BPF programs. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18	bpf: add bpf_skb_change_tail helper	Daniel Borkmann
	This work adds a bpf_skb_change_tail() helper for tc BPF programs. The basic idea is to expand or shrink the skb in a controlled manner. The eBPF program can then rewrite the rest via helpers like bpf_skb_store_bytes(), bpf_lX_csum_replace() and others rather than passing a raw buffer for writing here. bpf_skb_change_tail() is really a slow path helper and intended for replies with f.e. ICMP control messages. Concept is similar to other helpers like bpf_skb_change_proto() helper to keep the helper without protocol specifics and let the BPF program mangle the remaining parts. A flags field has been added and is reserved for now should we extend the helper in future. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18	bpf: use skb_pkt_type_ok helper in bpf_skb_change_type	Daniel Borkmann
	Since we have a skb_pkt_type_ok() helper for checking the type before mangling, make use of it instead of open coding. Follow-up to commit 8b10cab64c13 ("net: simplify and make pkt_type_ok() available for other users") that came in after d2485c4242a8 ("bpf: add bpf_skb_change_type helper"). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18	tipc: add peer removal functionality	Richard Alpe
	Add TIPC_NL_PEER_REMOVE netlink command. This command can remove an offline peer node from the internal data structures. This will be supported by the tipc user space tool in iproute2. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18	tcp: refine tcp_prune_ofo_queue() to not drop all packets	Eric Dumazet
	Over the years, TCP BDP has increased a lot, and is typically in the order of ~10 Mbytes with help of clever Congestion Control modules. In presence of packet losses, TCP stores incoming packets into an out of order queue, and number of skbs sitting there waiting for the missing packets to be received can match the BDP (~10 Mbytes) In some cases, TCP needs to make room for incoming skbs, and current strategy can simply remove all skbs in the out of order queue as a last resort, incurring a huge penalty, both for receiver and sender. Unfortunately these 'last resort events' are quite frequent, forcing sender to send all packets again, stalling the flow and wasting a lot of resources. This patch cleans only a part of the out of order queue in order to meet the memory constraints. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: C. Stephen Gun <csg@google.com> Cc: Van Jacobson <vanj@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18	tcp: defer sacked assignment	Eric Dumazet
	While chasing tcp_xmit_retransmit_queue() kasan issue, I found that we could avoid reading sacked field of skb that we wont send, possibly removing one cache line miss. Very minor change in slow path, but why not ? ;) Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18	net: bridge: export vlan flags with the stats	Nikolay Aleksandrov
	Use one of the vlan xstats padding fields to export the vlan flags. This is needed in order to be able to distinguish between master (bridge) and port vlan entries in user-space when dumping the bridge vlan stats. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18	net: bridge: consolidate bridge and port linkxstats calls	Nikolay Aleksandrov
	In the bridge driver we usually have the same function working for both port and bridge. In order to follow that logic and also avoid code duplication, consolidate the bridge_ and brport_ linkxstats calls into one since they share most of their code. As a side effect this allows us to dump the vlan stats also via the slave call which is in preparation for the upcoming per-port vlan stats and vlan flag dumping. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18	net_sched: act_vlan: Add priority option	Hadar Hen Zion
	The current vlan push action supports only vid and protocol options. Add priority option. Example script that adds vlan push action with vid and priority: tc filter add dev veth0 protocol ip parent ffff: \ flower \ indev veth0 \ action vlan push id 100 priority 5 Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18	net_sched: flower: Add vlan support	Hadar Hen Zion
	Enhance flower to support 802.1Q vlan protocol classification. Currently, the supported fields are vlan_id and vlan_priority. Example: # add a flower filter with vlan id and priority classification tc filter add dev ens4f0 protocol 802.1Q parent ffff: \ flower \ indev ens4f0 \ vlan_ethtype ipv4 \ vlan_id 100 \ vlan_prio 3 \ action vlan pop Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18	net_sched: flower: Avoid dissection of unmasked keys	Hadar Hen Zion
	The current flower implementation checks the mask range and set all the keys included in that range as "used_keys", even if a specific key in the range has a zero mask. This behavior can cause a false positive return value of dissector_uses_key function and unnecessary dissection in __skb_flow_dissect. This patch checks explicitly the mask of each key and "used_keys" will be set accordingly. Fixes: 77b9900ef53a ('tc: introduce Flower classifier') Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18	flow_dissector: Get vlan priority in addition to vlan id	Hadar Hen Zion
	Add vlan priority check to the flow dissector by adding new flow dissector struct, flow_dissector_key_vlan which includes vlan tag fields. vlan_id and flow_label fields were under the same struct (flow_dissector_key_tags). It was a convenient setting since struct flow_dissector_key_tags is used by struct flow_keys and by setting vlan_id and flow_label under the same struct, we get precisely 24 or 48 bytes in flow_keys from flow_dissector_key_basic. Now, when adding vlan priority support, the code will be cleaner if flow_label and vlan tag won't be under the same struct anymore. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18	flow_dissector: For stripped vlan, get vlan info from skb->vlan_tci	Hadar Hen Zion
	Early in the datapath skb_vlan_untag function is called, stripped the vlan from the skb and set skb->vlan_tci and skb->vlan_proto fields. The current dissection doesn't handle stripped vlan packets correctly. In some flows, vlan doesn't exist in skb->data anymore when applying flow dissection on the skb, fix that. In case vlan info wasn't stripped before applying flow_dissector (RPS flow for example), or in case of skb with multiple vlans (e.g. 802.1ad), get the vlan info from skb->data. The flow_dissector correctly skips any number of vlans and stores only the first level vlan. Fixes: 0744dd00c1b1 ('net: introduce skb_flow_dissect()') Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18	net: sched: avoid duplicates in qdisc dump	Jiri Kosina
	tc_dump_qdisc() performs dumping of the per-device qdiscs in two phases; first, the "standard" dev->qdisc is being dumped. Second, if there is/are ingress queue(s), they are being dumped as well. After conversion of netdevice's qdisc linked-list into hashtable, these two sets are not in two disjunctive sets/lists any more, but are both "reachable" directly from netdevice's hashtable. As a consequence, the "full-depth" dump of the ingress qdiscs results in immediately hitting the netdevice hashtable again, and duplicating the dump that has already been performed for dev->qdisc. What in fact needs to be dumped in case of ingress queue is "just" the top-level ingress qdisc, as everything else has been dumped already. Fix this by extending tc_dump_qdisc_root() in a way that it can be instructed whether it should (while performing the "full" per-netdev qdisc dump) perform the whole recursion, or just dump "additional" top-level (ingress) qdiscs without performing any kind of recursion. This fixes duplicate dumps such as qdisc mq 0: root qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc clsact ffff: parent ffff:fff1 qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Fixes: 59cc1f61f ("net: sched: convert qdisc linked list to hashtable") Reported-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>