summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2016-08-21Revert "l2tp: Refactor the codes with existing macros instead of literal number"David S. Miller
This reverts commit 5ab1fe72d5490978104fc493615ea29dd7238766. This change still has problems. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-21l2tp: Refactor the codes with existing macros instead of literal numberGao Feng
Use PPP_ALLSTATIONS, PPP_UI, and SEND_SHUTDOWN instead of 0xff, 0x03, and 2 separately. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-19net/irda: remove pointless assignment/checkVegard Nossum
We've already set sk to sock->sk and dereferenced it, so if it's NULL we would have crashed already. Moreover, if it was NULL we would have crashed anyway when jumping to 'out' and trying to unlock the sock. Furthermore, if we had assigned a different value to 'sk' we would have been calling lock_sock() and release_sock() on different sockets. My conclusion is that these two lines are complete nonsense and only serve to confuse the reader. Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-19l2tp: Fix the connect status check in pppol2tp_getnameGao Feng
The sk->sk_state is bits flag, so need use bit operation check instead of value check. Signed-off-by: Gao Feng <fgao@ikuai8.com> Tested-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-19net: dsa: bcm_sf2: Make it a real platform device driverFlorian Fainelli
The Broadcom Starfighter 2 switch driver should be a proper platform driver, now that the DSA code has been updated to allow that, register a switch device, feed it with the proper configuration data coming from Device Tree and register our switch device with DSA. The bulk of the changes consist in moving what bcm_sf2_sw_setup() did into the platform driver probe function. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-19net: dsa: Export suspend/resume functionsFlorian Fainelli
In preparation for allowing switch drivers to implement system-wide suspend/resume functions, export dsa_switch_suspend and dsa_switch_resume() such that these are callable from the appropriate driver specific suspend/resume functions. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-19sctp: linearize early if it's not GSOMarcelo Ricardo Leitner
Because otherwise when crc computation is still needed it's way more expensive than on a linear buffer to the point that it affects performance. It's so expensive that netperf test gives a perf output as below: Overhead Command Shared Object Symbol 18,62% netserver [kernel.vmlinux] [k] crc32_generic_shift 2,57% netserver [kernel.vmlinux] [k] __pskb_pull_tail 1,94% netserver [kernel.vmlinux] [k] fib_table_lookup 1,90% netserver [kernel.vmlinux] [k] copy_user_enhanced_fast_string 1,66% swapper [kernel.vmlinux] [k] intel_idle 1,63% netserver [kernel.vmlinux] [k] _raw_spin_lock 1,59% netserver [sctp] [k] sctp_packet_transmit 1,55% netserver [kernel.vmlinux] [k] memcpy_erms 1,42% netserver [sctp] [k] sctp_rcv # netperf -H 192.168.10.1 -l 10 -t SCTP_STREAM -cC -- -m 12000 SCTP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 () port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 212992 212992 12000 10.00 3016.42 2.88 3.78 1.874 2.462 After patch: Overhead Command Shared Object Symbol 2,75% netserver [kernel.vmlinux] [k] memcpy_erms 2,63% netserver [kernel.vmlinux] [k] copy_user_enhanced_fast_string 2,39% netserver [kernel.vmlinux] [k] fib_table_lookup 2,04% netserver [kernel.vmlinux] [k] __pskb_pull_tail 1,91% netserver [kernel.vmlinux] [k] _raw_spin_lock 1,91% netserver [sctp] [k] sctp_packet_transmit 1,72% netserver [mlx4_en] [k] mlx4_en_process_rx_cq 1,68% netserver [sctp] [k] sctp_rcv # netperf -H 192.168.10.1 -l 10 -t SCTP_STREAM -cC -- -m 12000 SCTP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 () port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 212992 212992 12000 10.00 3681.77 3.83 3.46 2.045 1.849 Fixes: 3acb50c18d8d ("sctp: delay as much as possible skb_linearize") Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-19net: ipv4: fix sparse error in fib_good_nh()Eric Dumazet
Fixes following sparse errors : net/ipv4/fib_semantics.c:1579:61: warning: incorrect type in argument 2 (different base types) net/ipv4/fib_semantics.c:1579:61: expected unsigned int [unsigned] [usertype] key net/ipv4/fib_semantics.c:1579:61: got restricted __be32 const [usertype] nh_gw Fixes: a6db4494d218c ("net: ipv4: Consider failed nexthops in multipath routes") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-19udp: include addrconf.hEric Dumazet
Include ipv4_rcv_saddr_equal() definition to avoid this sparse error : net/ipv4/udp.c:362:5: warning: symbol 'ipv4_rcv_saddr_equal' was not declared. Should it be static? Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-19tcp: md5: remove tcp_md5_hash_header()Eric Dumazet
After commit 19689e38eca5 ("tcp: md5: use kmalloc() backed scratch areas") this function is no longer used. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-19netlink: Use rhashtable walk interface in diag dumpHerbert Xu
This patch converts the diag dumping code to use the rhashtable walk code instead of going through rhashtable by hand. The lock nl_table_lock is now only taken while we process the multicast list as it's not needed for the rhashtable walk. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18fib_trie: Fix the description of pos and bitsXunlei Pang
1) Fix one typo: s/tn/tp/ 2) Fix the description about the "u" bits. Signed-off-by: Xunlei Pang <xlpang@redhat.com> Acked-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18bpf: get rid of cgroup helper related ifdefsDaniel Borkmann
As recently discussed during the task_under_cgroup_hierarchy() addition, we should get rid of the ifdefs surrounding the bpf_skb_under_cgroup() helper. If related functionality is not built-in, the helper cannot be used anyway, which is also in line with what we do for all other helpers. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18bpf: enable event output helper also for xdp typesDaniel Borkmann
Follow-up to 555c8a8623a3 ("bpf: avoid stack copy and use skb ctx for event output") for also adding the event output helper for XDP typed programs. The event output helper has been very useful in particular for debugging or event notification purposes, since it's much faster and flexible than regular trace printk due to programmatically being able to attach meta data. Same flags structure applies as with tc BPF programs. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18bpf: add bpf_skb_change_tail helperDaniel Borkmann
This work adds a bpf_skb_change_tail() helper for tc BPF programs. The basic idea is to expand or shrink the skb in a controlled manner. The eBPF program can then rewrite the rest via helpers like bpf_skb_store_bytes(), bpf_lX_csum_replace() and others rather than passing a raw buffer for writing here. bpf_skb_change_tail() is really a slow path helper and intended for replies with f.e. ICMP control messages. Concept is similar to other helpers like bpf_skb_change_proto() helper to keep the helper without protocol specifics and let the BPF program mangle the remaining parts. A flags field has been added and is reserved for now should we extend the helper in future. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18bpf: use skb_pkt_type_ok helper in bpf_skb_change_typeDaniel Borkmann
Since we have a skb_pkt_type_ok() helper for checking the type before mangling, make use of it instead of open coding. Follow-up to commit 8b10cab64c13 ("net: simplify and make pkt_type_ok() available for other users") that came in after d2485c4242a8 ("bpf: add bpf_skb_change_type helper"). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18tipc: add peer removal functionalityRichard Alpe
Add TIPC_NL_PEER_REMOVE netlink command. This command can remove an offline peer node from the internal data structures. This will be supported by the tipc user space tool in iproute2. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18tcp: refine tcp_prune_ofo_queue() to not drop all packetsEric Dumazet
Over the years, TCP BDP has increased a lot, and is typically in the order of ~10 Mbytes with help of clever Congestion Control modules. In presence of packet losses, TCP stores incoming packets into an out of order queue, and number of skbs sitting there waiting for the missing packets to be received can match the BDP (~10 Mbytes) In some cases, TCP needs to make room for incoming skbs, and current strategy can simply remove all skbs in the out of order queue as a last resort, incurring a huge penalty, both for receiver and sender. Unfortunately these 'last resort events' are quite frequent, forcing sender to send all packets again, stalling the flow and wasting a lot of resources. This patch cleans only a part of the out of order queue in order to meet the memory constraints. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: C. Stephen Gun <csg@google.com> Cc: Van Jacobson <vanj@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18tcp: defer sacked assignmentEric Dumazet
While chasing tcp_xmit_retransmit_queue() kasan issue, I found that we could avoid reading sacked field of skb that we wont send, possibly removing one cache line miss. Very minor change in slow path, but why not ? ;) Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18net: bridge: export vlan flags with the statsNikolay Aleksandrov
Use one of the vlan xstats padding fields to export the vlan flags. This is needed in order to be able to distinguish between master (bridge) and port vlan entries in user-space when dumping the bridge vlan stats. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18net: bridge: consolidate bridge and port linkxstats callsNikolay Aleksandrov
In the bridge driver we usually have the same function working for both port and bridge. In order to follow that logic and also avoid code duplication, consolidate the bridge_ and brport_ linkxstats calls into one since they share most of their code. As a side effect this allows us to dump the vlan stats also via the slave call which is in preparation for the upcoming per-port vlan stats and vlan flag dumping. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18net_sched: act_vlan: Add priority optionHadar Hen Zion
The current vlan push action supports only vid and protocol options. Add priority option. Example script that adds vlan push action with vid and priority: tc filter add dev veth0 protocol ip parent ffff: \ flower \ indev veth0 \ action vlan push id 100 priority 5 Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18net_sched: flower: Add vlan supportHadar Hen Zion
Enhance flower to support 802.1Q vlan protocol classification. Currently, the supported fields are vlan_id and vlan_priority. Example: # add a flower filter with vlan id and priority classification tc filter add dev ens4f0 protocol 802.1Q parent ffff: \ flower \ indev ens4f0 \ vlan_ethtype ipv4 \ vlan_id 100 \ vlan_prio 3 \ action vlan pop Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18net_sched: flower: Avoid dissection of unmasked keysHadar Hen Zion
The current flower implementation checks the mask range and set all the keys included in that range as "used_keys", even if a specific key in the range has a zero mask. This behavior can cause a false positive return value of dissector_uses_key function and unnecessary dissection in __skb_flow_dissect. This patch checks explicitly the mask of each key and "used_keys" will be set accordingly. Fixes: 77b9900ef53a ('tc: introduce Flower classifier') Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18flow_dissector: Get vlan priority in addition to vlan idHadar Hen Zion
Add vlan priority check to the flow dissector by adding new flow dissector struct, flow_dissector_key_vlan which includes vlan tag fields. vlan_id and flow_label fields were under the same struct (flow_dissector_key_tags). It was a convenient setting since struct flow_dissector_key_tags is used by struct flow_keys and by setting vlan_id and flow_label under the same struct, we get precisely 24 or 48 bytes in flow_keys from flow_dissector_key_basic. Now, when adding vlan priority support, the code will be cleaner if flow_label and vlan tag won't be under the same struct anymore. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18flow_dissector: For stripped vlan, get vlan info from skb->vlan_tciHadar Hen Zion
Early in the datapath skb_vlan_untag function is called, stripped the vlan from the skb and set skb->vlan_tci and skb->vlan_proto fields. The current dissection doesn't handle stripped vlan packets correctly. In some flows, vlan doesn't exist in skb->data anymore when applying flow dissection on the skb, fix that. In case vlan info wasn't stripped before applying flow_dissector (RPS flow for example), or in case of skb with multiple vlans (e.g. 802.1ad), get the vlan info from skb->data. The flow_dissector correctly skips any number of vlans and stores only the first level vlan. Fixes: 0744dd00c1b1 ('net: introduce skb_flow_dissect()') Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18net: sched: avoid duplicates in qdisc dumpJiri Kosina
tc_dump_qdisc() performs dumping of the per-device qdiscs in two phases; first, the "standard" dev->qdisc is being dumped. Second, if there is/are ingress queue(s), they are being dumped as well. After conversion of netdevice's qdisc linked-list into hashtable, these two sets are not in two disjunctive sets/lists any more, but are both "reachable" directly from netdevice's hashtable. As a consequence, the "full-depth" dump of the ingress qdiscs results in immediately hitting the netdevice hashtable again, and duplicating the dump that has already been performed for dev->qdisc. What in fact needs to be dumped in case of ingress queue is "just" the top-level ingress qdisc, as everything else has been dumped already. Fix this by extending tc_dump_qdisc_root() in a way that it can be instructed whether it should (while performing the "full" per-netdev qdisc dump) perform the whole recursion, or just dump "additional" top-level (ingress) qdiscs without performing any kind of recursion. This fixes duplicate dumps such as qdisc mq 0: root qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc clsact ffff: parent ffff:fff1 qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Fixes: 59cc1f61f ("net: sched: convert qdisc linked list to hashtable") Reported-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18net: sched: fix handling of singleton qdiscs with qdisc_hashJiri Kosina
qdisc_match_from_root() is now iterating over per-netdevice qdisc hashtable instead of going through a linked-list of qdiscs (independently on the actual underlying netdev), which was the case before the switch to hashtable for qdiscs. For singleton qdiscs, there is no underlying netdev associated though, and therefore dumping a singleton qdisc will panic, as qdisc_dev(root) will always be NULL. BUG: unable to handle kernel NULL pointer dereference at 0000000000000410 IP: [<ffffffff8167efac>] qdisc_match_from_root+0x2c/0x70 PGD 1aceba067 PUD 1aceb7067 PMD 0 Oops: 0000 [#1] PREEMPT SMP [ ... ] task: ffff8801ec996e00 task.stack: ffff8801ec934000 RIP: 0010:[<ffffffff8167efac>] [<ffffffff8167efac>] qdisc_match_from_root+0x2c/0x70 RSP: 0018:ffff8801ec937ab0 EFLAGS: 00010203 RAX: 0000000000000408 RBX: ffff88025e612000 RCX: ffffffffffffffd8 RDX: 0000000000000000 RSI: 00000000ffff0000 RDI: ffffffff81cf8100 RBP: ffff8801ec937ab0 R08: 000000000001c160 R09: ffff8802668032c0 R10: ffffffff81cf8100 R11: 0000000000000030 R12: 00000000ffff0000 R13: ffff88025e612000 R14: ffffffff81cf3140 R15: 0000000000000000 FS: 00007f24b9af6740(0000) GS:ffff88026f280000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000410 CR3: 00000001aceec000 CR4: 00000000001406e0 Stack: ffff8801ec937ad0 ffffffff81681210 ffff88025dd51a00 00000000fffffff1 ffff8801ec937b88 ffffffff81681e4e ffffffff81c42bc0 ffff880262431500 ffffffff81cf3140 ffff88025dd51a10 ffff88025dd51a24 00000000ec937b38 Call Trace: [<ffffffff81681210>] qdisc_lookup+0x40/0x50 [<ffffffff81681e4e>] tc_modify_qdisc+0x21e/0x550 [<ffffffff8166ae25>] rtnetlink_rcv_msg+0x95/0x220 [<ffffffff81209602>] ? __kmalloc_track_caller+0x172/0x230 [<ffffffff8166ad90>] ? rtnl_newlink+0x870/0x870 [<ffffffff816897b7>] netlink_rcv_skb+0xa7/0xc0 [<ffffffff816657c8>] rtnetlink_rcv+0x28/0x30 [<ffffffff8168919b>] netlink_unicast+0x15b/0x210 [<ffffffff81689569>] netlink_sendmsg+0x319/0x390 [<ffffffff816379f8>] sock_sendmsg+0x38/0x50 [<ffffffff81638296>] ___sys_sendmsg+0x256/0x260 [<ffffffff811b1275>] ? __pagevec_lru_add_fn+0x135/0x280 [<ffffffff811b1a90>] ? pagevec_lru_move_fn+0xd0/0xf0 [<ffffffff811b1140>] ? trace_event_raw_event_mm_lru_insertion+0x180/0x180 [<ffffffff811b1b85>] ? __lru_cache_add+0x75/0xb0 [<ffffffff817708a6>] ? _raw_spin_unlock+0x16/0x40 [<ffffffff811d8dff>] ? handle_mm_fault+0x39f/0x1160 [<ffffffff81638b15>] __sys_sendmsg+0x45/0x80 [<ffffffff81638b62>] SyS_sendmsg+0x12/0x20 [<ffffffff810038e7>] do_syscall_64+0x57/0xb0 Fix this by special-casing singleton qdiscs (those that don't have underlying netdevice) and introduce immediate handling of those rather than trying to go over an underlying netdevice. We're in the same situation in tc_dump_qdisc_root() and tc_dump_tclass_root(). Ultimately, this will have to be slightly reworked so that we are actually able to show singleton qdiscs (noop) in the dump properly; but we're not currently doing that anyway, so no regression there, and better do this in a gradual manner. Fixes: 59cc1f61f ("net: sched: convert qdisc linked list to hashtable") Reported-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Daniel Borkmann <daniel@iogearbox.net> Reported-by: David Ahern <dsa@cumulusnetworks.com> Tested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18tipc: ensure that link congestion and wakeup use same criteriaJon Paul Maloy
When a link is attempted woken up after congestion, it uses a different, more generous criteria than when it was originally declared congested. This has the effect that the link, and the sending process, sometimes will be woken up unnecessarily, just to immediately return to congestion when it turns out there is not not enough space in its send queue to host the pending message. This is a waste of CPU cycles. We now change the function link_prepare_wakeup() to use exactly the same criteria as tipc_link_xmit(). However, since we are now excluding the window limit from the wakeup calculation, and the current backlog limit for the lowest level is too small to house even a single maximum-size message, we have to expand this limit. We do this by evaluating an alternative, minimum value during the setting of the importance limits. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18tipc: make bearer packet filtering genericJon Paul Maloy
In commit 5b7066c3dd24 ("tipc: stricter filtering of packets in bearer layer") we introduced a method of filtering out messages while a bearer is being reset, to avoid that links may be re-created and come back in working state while we are still in the process of shutting them down. This solution works well, but is limited to only work with L2 media, which is insufficient with the increasing use of UDP as carrier media. We now replace this solution with a more generic one, by introducing a new flag "up" in the generic struct tipc_bearer. This field will be set and reset at the same locations as with the previous solution, while the packet filtering is moved to the generic code for the sending side. On the receiving side, the filtering is still done in media specific code, but now including the UDP bearer. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18net: atm: remove redundant null pointer check on dev->nameColin Ian King
dev->name is a char array of IFNAMSIZ elements, hence can never be null, so the null pointer check is redundant. Remove it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter updates for your net tree, they are: 1) Dump only conntrack that belong to this namespace via /proc file. This is some fallout from the conversion to single conntrack table for all netns, patch from Liping Zhang. 2) Missing MODULE_ALIAS_NF_LOGGER() for the ARP family that prevents module autoloading, also from Liping Zhang. 3) Report overquota event to the right netnamespace, again from Liping. 4) Fix tproxy listener sk refcount that leads to crash, from Eric Dumazet. 5) Fix racy refcounting on object deletion from nfnetlink and rule removal both for nfacct and cttimeout, from Liping Zhang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18netfilter: nf_conntrack: restore nf_conntrack_htable_size as exported symbolPablo Neira Ayuso
This is required to iterate over the hash table in cttimeout, ctnetlink and nf_conntrack_ipv4. >> ERROR: "nf_conntrack_htable_size" [net/netfilter/nfnetlink_cttimeout.ko] undefined! ERROR: "nf_conntrack_htable_size" [net/netfilter/nf_conntrack_netlink.ko] undefined! ERROR: "nf_conntrack_htable_size" [net/ipv4/netfilter/nf_conntrack_ipv4.ko] undefined! Fixes: adf0516845bcd0 ("netfilter: remove ip_conntrack* sysctl compat code") Reported-by: kbuild test robot <fengguang.wu@intel.com> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-18netfilter: cttimeout: fix use after free error when delete netnsLiping Zhang
In general, when we want to delete a netns, cttimeout_net_exit will be called before ipt_unregister_table, i.e. before ctnl_timeout_put. But after call kfree_rcu in cttimeout_net_exit, we will still decrease the timeout object's refcnt in ctnl_timeout_put, this is incorrect, and will cause a use after free error. It is easy to reproduce this problem: # while : ; do ip netns add xxx ip netns exec xxx nfct add timeout testx inet icmp timeout 200 ip netns exec xxx iptables -t raw -p icmp -I OUTPUT -j CT --timeout testx ip netns del xxx done ======================================================================= BUG kmalloc-96 (Tainted: G B E ): Poison overwritten ----------------------------------------------------------------------- INFO: 0xffff88002b5161e8-0xffff88002b5161e8. First byte 0x6a instead of 0x6b INFO: Allocated in cttimeout_new_timeout+0xd4/0x240 [nfnetlink_cttimeout] age=104 cpu=0 pid=3330 ___slab_alloc+0x4da/0x540 __slab_alloc+0x20/0x40 __kmalloc+0x1c8/0x240 cttimeout_new_timeout+0xd4/0x240 [nfnetlink_cttimeout] nfnetlink_rcv_msg+0x21a/0x230 [nfnetlink] [ ... ] So only when the refcnt decreased to 0, we call kfree_rcu to free the timeout object. And like nfnetlink_acct do, use atomic_cmpxchg to avoid race between ctnl_timeout_try_del and ctnl_timeout_put. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-18netfilter: nfnetlink_acct: fix race between nfacct del and xt_nfacct destroyLiping Zhang
Suppose that we input the following commands at first: # nfacct add test # iptables -A INPUT -m nfacct --nfacct-name test And now "test" acct's refcnt is 2, but later when we try to delete the "test" nfacct and the related iptables rule at the same time, race maybe happen: CPU0 CPU1 nfnl_acct_try_del nfnl_acct_put atomic_dec_and_test //ref=1,testfail - - atomic_dec_and_test //ref=0,testok - kfree_rcu atomic_inc //ref=1 - So after the rcu grace period, nf_acct will be freed but it is still linked in the nfnl_acct_list, and we can access it later, then oops will happen. Convert atomic_dec_and_test and atomic_inc combinaiton to one atomic operation atomic_cmpxchg here to fix this problem. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Minor overlapping changes for both merge conflicts. Resolution work done by Stephen Rothwell was used as a reference. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Buffers powersave frame test is reversed in cfg80211, fix from Felix Fietkau. 2) Remove bogus WARN_ON in openvswitch, from Jarno Rajahalme. 3) Fix some tg3 ethtool logic bugs, and one that would cause no interrupts to be generated when rx-coalescing is set to 0. From Satish Baddipadige and Siva Reddy Kallam. 4) QLCNIC mailbox corruption and napi budget handling fix from Manish Chopra. 5) Fix fib_trie logic when walking the trie during /proc/net/route output than can access a stale node pointer. From David Forster. 6) Several sctp_diag fixes from Phil Sutter. 7) PAUSE frame handling fixes in mlxsw driver from Ido Schimmel. 8) Checksum fixup fixes in bpf from Daniel Borkmann. 9) Memork leaks in nfnetlink, from Liping Zhang. 10) Use after free in rxrpc, from David Howells. 11) Use after free in new skb_array code of macvtap driver, from Jason Wang. 12) Calipso resource leak, from Colin Ian King. 13) mediatek bug fixes (missing stats sync init, etc.) from Sean Wang. 14) Fix bpf non-linear packet write helpers, from Daniel Borkmann. 15) Fix lockdep splats in macsec, from Sabrina Dubroca. 16) hv_netvsc bug fixes from Vitaly Kuznetsov, mostly to do with VF handling. 17) Various tc-action bug fixes, from CONG Wang. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits) net_sched: allow flushing tc police actions net_sched: unify the init logic for act_police net_sched: convert tcf_exts from list to pointer array net_sched: move tc offload macros to pkt_cls.h net_sched: fix a typo in tc_for_each_action() net_sched: remove an unnecessary list_del() net_sched: remove the leftover cleanup_a() mlxsw: spectrum: Allow packets to be trapped from any PG mlxsw: spectrum: Unmap 802.1Q FID before destroying it mlxsw: spectrum: Add missing rollbacks in error path mlxsw: reg: Fix missing op field fill-up mlxsw: spectrum: Trap loop-backed packets mlxsw: spectrum: Add missing packet traps mlxsw: spectrum: Mark port as active before registering it mlxsw: spectrum: Create PVID vPort before registering netdevice mlxsw: spectrum: Remove redundant errors from the code mlxsw: spectrum: Don't return upon error in removal path i40e: check for and deal with non-contiguous TCs ixgbe: Re-enable ability to toggle VLAN filtering ixgbe: Force VLNCTRL.VFE to be set in all VMDq paths ...
2016-08-17kcm: Use stream parserTom Herbert
Adapt KCM to use the stream parser. This mostly involves removing the RX handling and setting up the strparser using the interface. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-17strparser: Stream parser for messagesTom Herbert
This patch introduces a utility for parsing application layer protocol messages in a TCP stream. This is a generalization of the mechanism implemented of Kernel Connection Multiplexor. The API includes a context structure, a set of callbacks, utility functions, and a data ready function. A stream parser instance is defined by a strparse structure that is bound to a TCP socket. The function to initialize the structure is: int strp_init(struct strparser *strp, struct sock *csk, struct strp_callbacks *cb); csk is the TCP socket being bound to and cb are the parser callbacks. The upper layer calls strp_tcp_data_ready when data is ready on the lower socket for strparser to process. This should be called from a data_ready callback that is set on the socket: void strp_tcp_data_ready(struct strparser *strp); A parser is bound to a TCP socket by setting data_ready function to strp_tcp_data_ready so that all receive indications on the socket go through the parser. This is assumes that sk_user_data is set to the strparser structure. There are four callbacks. - parse_msg is called to parse the message (returns length or error). - rcv_msg is called when a complete message has been received - read_sock_done is called when data_ready function exits - abort_parser is called to abort the parser The input to parse_msg is an skbuff which contains next message under construction. The backend processing of parse_msg will parse the application layer protocol headers to determine the length of the message in the stream. The possible return values are: >0 : indicates length of successfully parsed message 0 : indicates more data must be received to parse the message -ESTRPIPE : current message should not be processed by the kernel, return control of the socket to userspace which can proceed to read the messages itself other < 0 : Error is parsing, give control back to userspace assuming that synchronzation is lost and the stream is unrecoverable (application expected to close TCP socket) In the case of error return (< 0) strparse will stop the parser and report and error to userspace. The application must deal with the error. To handle the error the strparser is unbound from the TCP socket. If the error indicates that the stream TCP socket is at recoverable point (ESTRPIPE) then the application can read the TCP socket to process the stream. Once the application has dealt with the exceptions in the stream, it may again bind the socket to a strparser to continue data operations. Note that ENODATA may be returned to the application. In this case parse_msg returned -ESTRPIPE, however strparser was unable to maintain synchronization of the stream (i.e. some of the message in question was already read by the parser). strp_pause and strp_unpause are used to provide flow control. For instance, if rcv_msg is called but the upper layer can't immediately consume the message it can hold the message and pause strparser. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-17net: ipconfig: Fix more use after freeThierry Reding
While commit 9c706a49d660 ("net: ipconfig: fix use after free") avoids the use after free, the resulting code still ends up calling both the ic_setup_if() and ic_setup_routes() after calling ic_close_devs(), and access to the device is still required. Move the call to ic_close_devs() to the very end of the function. Signed-off-by: Thierry Reding <treding@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-17net_sched: allow flushing tc police actionsRoman Mashak
The act_police uses its own code to walk the action hashtable, which leads to that we could not flush standalone tc police actions, so just switch to tcf_generic_walker() like other actions. (Joint work from Roman and Cong.) Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-17net_sched: unify the init logic for act_policeWANG Cong
Jamal reported a crash when we create a police action with a specific index, this is because the init logic is not correct, we should always create one for this case. Just unify the logic with other tc actions. Fixes: a03e6fe56971 ("act_police: fix a crash during removal") Reported-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-17net_sched: convert tcf_exts from list to pointer arrayWANG Cong
As pointed out by Jamal, an action could be shared by multiple filters, so we can't use list to chain them any more after we get rid of the original tc_action. Instead, we could just save pointers to these actions in tcf_exts, since they are refcount'ed, so convert the list to an array of pointers. The "ugly" part is the action API still accepts list as a parameter, I just introduce a helper function to convert the array of pointers to a list, instead of relying on the C99 feature to iterate the array. Fixes: a85a970af265 ("net_sched: move tc_action into tcf_common") Reported-by: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-17net_sched: remove an unnecessary list_del()WANG Cong
This list_del() for tc action is not needed actually, because we only use this list to chain bulk operations, therefore should not be carried for latter operations. Fixes: ec0595cc4495 ("net_sched: get rid of struct tcf_common") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-17net_sched: remove the leftover cleanup_a()WANG Cong
After refactoring tc_action into tcf_common, we no longer need to cleanup temporary "actions" in list, they are permanently stored in the hashtable. Fixes: a85a970af265 ("net_sched: move tc_action into tcf_common") Reported-by: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-17Merge tag 'batadv-next-for-davem-20160816' of ↵David S. Miller
git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== pull request for net-next: batman-adv 2016-08-16 This feature patchset is all about adding netlink support, which should supersede our debugfs configuration interface in the long run. It is especially necessary when batman-adv should be used in different namespaces, since debugfs can not differentiate between those. More specifically, the following changes are included: - Two fixes for namespace handling by Andrew Lunn, checking also the namespaces for parent interfaces, and supress debugfs entries for non-default netns - Implement various netlink commands for the new interface, by Matthias Schiffer, Andrew Lunn, Sven Eckelmann and Simon Wunderlich (13 patches): * routing algorithm list * hardif list * translation tables (local and global) * TTVN for the translation tables * originator and neighbor tables for B.A.T.M.A.N. IV and B.A.T.M.A.N. V * gateway dump functionality for B.A.T.M.A.N. IV and B.A.T.M.A.N. V * Bridge Loop Avoidance claims, and corresponding BLA group * Bridge Loop Avoidance backbone tables - Finally, mark batman-adv as netns compatible, by Andrew Lunn (1 patch) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18netfilter: conntrack: simplify the code by using nf_conntrack_get_htLiping Zhang
Since commit 64b87639c9cb ("netfilter: conntrack: fix race between nf_conntrack proc read and hash resize") introduce the nf_conntrack_get_ht, so there's no need to check nf_conntrack_generation again and again to get the hash table and hash size. And convert nf_conntrack_get_ht to inline function here. Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-18netfilter: tproxy: properly refcount tcp listenersEric Dumazet
inet_lookup_listener() and inet6_lookup_listener() no longer take a reference on the found listener. This minimal patch adds back the refcounting, but we might do this differently in net-next later. Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood") Reported-and-tested-by: Denys Fedoryshchenko <nuclearcat@nuclearcat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-18netfilter: nfnetlink_acct: report overquota to the right netnsLiping Zhang
We should report the over quota message to the right net namespace instead of the init netns. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-17netfilter: nfnetlink_log: add "nf-logger-3-1" module alias nameLiping Zhang
Otherwise, if nfnetlink_log.ko is not loaded, we cannot add rules to log packets to the userspace when we specify it with arp family, such as: # nft add rule arp filter input log group 0 <cmdline>:1:1-37: Error: Could not process rule: No such file or directory add rule arp filter input log group 0 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>