summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2017-01-20tipc: add function for checking broadcast support in bearerJon Paul Maloy
As a preparation for the 'replicast' functionality we are going to introduce in the next commits, we need the broadcast base structure to store whether bearer broadcast is available at all from the currently used bearer or bearers. We do this by adding a new function tipc_bearer_bcast_support() to the bearer layer, and letting the bearer selection function in bcast.c use this to give a new boolean field, 'bcast_support' the appropriate value. Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20device: Implement a bus agnostic dev_num_vf routinePhil Sutter
Now that pci_bus_type has num_vf callback set, dev_num_vf can be implemented in a bus type independent way and the check for whether a PCI device is being handled in rtnetlink can be dropped. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20gre6: Clean up unused struct ipv6_tel_txoption definitionJakub Sitnicki
Commit b05229f44228 ("gre6: Cleanup GREv6 transmit path, call common GRE functions") removed the ip6gre specific transmit function, but left the struct ipv6_tel_txoption definition. Clean it up. Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20Revert "net: sctp: fix array overrun read on sctp_timer_tbl"David S. Miller
This reverts commit 0e73fc9a56f22f2eec4d2b2910c649f7af67b74d. This fix wasn't correct, a better one is coming right up. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20net: remove bh disabling around percpu_counter accessesEric Dumazet
Shaohua Li made percpu_counter irq safe in commit 098faf5805c8 ("percpu_counter: make APIs irq safe") We can safely remove BH disable/enable sections around various percpu_counter manipulations. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20net: sctp: fix array overrun read on sctp_timer_tblColin Ian King
The comparison on the timeout can lead to an array overrun read on sctp_timer_tbl because of an off-by-one error. Fix this by using < instead of <= and also compare to the array size rather than SCTP_EVENT_TIMEOUT_MAX. Fixes CoverityScan CID#1397639 ("Out-of-bounds read") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20ipv6: seg6_genl_set_tunsrc() must check kmemdup() return valueEric Dumazet
seg6_genl_get_tunsrc() and set_tun_src() do not handle tun_src being possibly NULL, so we must check kmemdup() return value and abort if it is NULL Fixes: 915d7e5e5930 ("ipv6: sr: add code base for control plane support of SR-IPv6") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Lebrun <david.lebrun@uclouvain.be> Acked-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20virtio-net: restore VIRTIO_HDR_F_DATA_VALID on receivingJason Wang
Commit 501db511397f ("virtio: don't set VIRTIO_NET_HDR_F_DATA_VALID on xmit") in fact disables VIRTIO_HDR_F_DATA_VALID on receiving path too, fixing this by adding a hint (has_data_valid) and set it only on the receiving path. Cc: Rolf Neugebauer <rolf.neugebauer@docker.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Rolf Neugebauer <rolf.neugebauer@docker.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19net: ipv6: Keep nexthop of multipath route on admin downDavid Ahern
IPv6 deletes route entries associated with multipath routes on an admin down where IPv4 does not. For example: $ ip ro ls vrf red unreachable default metric 8192 1.1.1.0/24 metric 64 nexthop via 10.100.1.254 dev eth1 weight 1 nexthop via 10.100.2.254 dev eth2 weight 1 10.100.1.0/24 dev eth1 proto kernel scope link src 10.100.1.4 10.100.2.0/24 dev eth2 proto kernel scope link src 10.100.2.4 $ ip -6 ro ls vrf red 2001:db8:1::/120 dev eth1 proto kernel metric 256 pref medium 2001:db8:2:: dev red proto none metric 0 pref medium 2001:db8:2::/120 dev eth2 proto kernel metric 256 pref medium 2001:db8:11::/120 via 2001:db8:1::16 dev eth1 metric 1024 pref medium 2001:db8:11::/120 via 2001:db8:2::17 dev eth2 metric 1024 pref medium ... Set link down: $ ip li set eth1 down IPv4 retains the multihop route but flags eth1 route as dead: $ ip ro ls vrf red unreachable default metric 8192 1.1.1.0/24 nexthop via 10.100.1.16 dev eth1 weight 1 dead linkdown nexthop via 10.100.2.16 dev eth2 weight 1 10.100.2.0/24 dev eth2 proto kernel scope link src 10.100.2.4 and IPv6 deletes the route as part of flushing all routes for the device: $ ip -6 ro ls vrf red 2001:db8:2:: dev red proto none metric 0 pref medium 2001:db8:2::/120 dev eth2 proto kernel metric 256 pref medium 2001:db8:11::/120 via 2001:db8:2::17 dev eth2 metric 1024 pref medium ... Worse, on admin up of the device the multipath route has to be deleted to get this leg of the route re-added. This patch keeps routes that are part of a multipath route if ignore_routes_with_linkdown is set with the dead and linkdown flags enabling consistency between IPv4 and IPv6: $ ip -6 ro ls vrf red 2001:db8:2:: dev red proto none metric 0 pref medium 2001:db8:2::/120 dev eth2 proto kernel metric 256 pref medium 2001:db8:11::/120 via 2001:db8:1::16 dev eth1 metric 1024 dead linkdown pref medium 2001:db8:11::/120 via 2001:db8:2::17 dev eth2 metric 1024 pref medium ... Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19net: caif: Remove unused stats member from struct chnl_netTobias Klauser
The stats member of struct chnl_net is used nowhere in the code, so it might as well be removed. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19tcp: initialize max window for a new fastopen socketAlexey Kodanev
Found that if we run LTP netstress test with large MSS (65K), the first attempt from server to send data comparable to this MSS on fastopen connection will be delayed by the probe timer. Here is an example: < S seq 0:0 win 43690 options [mss 65495 wscale 7 tfo cookie] length 32 > S. seq 0:0 ack 1 win 43690 options [mss 65495 wscale 7] length 0 < . ack 1 win 342 length 0 Inside tcp_sendmsg(), tcp_send_mss() returns max MSS in 'mss_now', as well as in 'size_goal'. This results the segment not queued for transmition until all the data copied from user buffer. Then, inside __tcp_push_pending_frames(), it breaks on send window test and continues with the check probe timer. Fragmentation occurs in tcp_write_wakeup()... +0.2 > P. seq 1:43777 ack 1 win 342 length 43776 < . ack 43777, win 1365 length 0 > P. seq 43777:65001 ack 1 win 342 options [...] length 21224 ... This also contradicts with the fact that we should bound to the half of the window if it is large. Fix this flaw by correctly initializing max_window. Before that, it could have large values that affect further calculations of 'size_goal'. Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path") Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19net/sched: cls_flower: reduce fl_change stack sizeArnd Bergmann
The new ARP support has pushed the stack size over the edge on ARM, as there are two large objects on the stack in this function (mask and tb) and both have now grown a bit more: net/sched/cls_flower.c: In function 'fl_change': net/sched/cls_flower.c:928:1: error: the frame size of 1072 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] We can solve this by dynamically allocating one or both of them. I first tried to do it just for the mask, but that only saved 152 bytes on ARM, while this version just does it for the 'tb' array, bringing the stack size back down to 664 bytes. Fixes: 99d31326cbe6 ("net/sched: cls_flower: Support matching on ARP") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19ipv6: addrconf: Avoid addrconf_disable_change() using RCU read-side lockKefeng Wang
Just like commit 4acd4945cd1e ("ipv6: addrconf: Avoid calling netdevice notifiers with RCU read-side lock"), it is unnecessary to make addrconf_disable_change() use RCU iteration over the netdev list, since it already holds the RTNL lock, or we may meet Illegal context switch in RCU read-side critical section. Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19netfilter: conntrack: refine gc worker heuristics, reduxFlorian Westphal
This further refines the changes made to conntrack gc_worker in commit e0df8cae6c16 ("netfilter: conntrack: refine gc worker heuristics"). The main idea of that change was to reduce the scan interval when evictions take place. However, on the reporters' setup, there are 1-2 million conntrack entries in total and roughly 8k new (and closing) connections per second. In this case we'll always evict at least one entry per gc cycle and scan interval is always at 1 jiffy because of this test: } else if (expired_count) { gc_work->next_gc_run /= 2U; next_run = msecs_to_jiffies(1); being true almost all the time. Given we scan ~10k entries per run its clearly wrong to reduce interval based on nonzero eviction count, it will only waste cpu cycles since a vast majorities of conntracks are not timed out. Thus only look at the ratio (scanned entries vs. evicted entries) to make a decision on whether to reduce or not. Because evictor is supposed to only kick in when system turns idle after a busy period, pick a high ratio -- this makes it 50%. We thus keep the idea of increasing scan rate when its likely that table contains many expired entries. In order to not let timed-out entries hang around for too long (important when using event logging, in which case we want to timely destroy events), we now scan the full table within at most GC_MAX_SCAN_JIFFIES (16 seconds) even in worst-case scenario where all timed-out entries sit in same slot. I tested this with a vm under synflood (with sysctl net.netfilter.nf_conntrack_tcp_timeout_syn_recv=3). While flood is ongoing, interval now stays at its max rate (GC_MAX_SCAN_JIFFIES / GC_MAX_BUCKETS_DIV -> 125ms). With feedback from Nicolas Dichtel. Reported-by: Denys Fedoryshchenko <nuclearcat@nuclearcat.com> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Fixes: b87a2f9199ea82eaadc ("netfilter: conntrack: add gc worker to remove timed-out entries") Signed-off-by: Florian Westphal <fw@strlen.de> Tested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Tested-by: Denys Fedoryshchenko <nuclearcat@nuclearcat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-19netfilter: conntrack: remove GC_MAX_EVICTS breakFlorian Westphal
Instead of breaking loop and instant resched, don't bother checking this in first place (the loop calls cond_resched for every bucket anyway). Suggested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-18net: Remove usage of net_device last_rx memberTobias Klauser
The network stack no longer uses the last_rx member of struct net_device since the bonding driver switched to use its own private last_rx in commit 9f242738376d ("bonding: use last_arp_rx in slave_last_rx()"). However, some drivers still (ab)use the field for their own purposes and some driver just update it without actually using it. Previously, there was an accompanying comment for the last_rx member added in commit 4dc89133f49b ("net: add a comment on netdev->last_rx") which asked drivers not to update is, unless really needed. However, this commend was removed in commit f8ff080dacec ("bonding: remove useless updating of slave->dev->last_rx"), so some drivers added later on still did update last_rx. Remove all usage of last_rx and switch three drivers (sky2, atp and smc91c92_cs) which actually read and write it to use their own private copy in netdev_priv. Compile-tested with allyesconfig and allmodconfig on x86 and arm. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Cc: Mirko Lindner <mlindner@marvell.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Acked-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18lwtunnel: fix autoload of lwt modulesDavid Ahern
Trying to add an mpls encap route when the MPLS modules are not loaded hangs. For example: CONFIG_MPLS=y CONFIG_NET_MPLS_GSO=m CONFIG_MPLS_ROUTING=m CONFIG_MPLS_IPTUNNEL=m $ ip route add 10.10.10.10/32 encap mpls 100 via inet 10.100.1.2 The ip command hangs: root 880 826 0 21:25 pts/0 00:00:00 ip route add 10.10.10.10/32 encap mpls 100 via inet 10.100.1.2 $ cat /proc/880/stack [<ffffffff81065a9b>] call_usermodehelper_exec+0xd6/0x134 [<ffffffff81065efc>] __request_module+0x27b/0x30a [<ffffffff814542f6>] lwtunnel_build_state+0xe4/0x178 [<ffffffff814aa1e4>] fib_create_info+0x47f/0xdd4 [<ffffffff814ae451>] fib_table_insert+0x90/0x41f [<ffffffff814a8010>] inet_rtm_newroute+0x4b/0x52 ... modprobe is trying to load rtnl-lwt-MPLS: root 881 5 0 21:25 ? 00:00:00 /sbin/modprobe -q -- rtnl-lwt-MPLS and it hangs after loading mpls_router: $ cat /proc/881/stack [<ffffffff81441537>] rtnl_lock+0x12/0x14 [<ffffffff8142ca2a>] register_netdevice_notifier+0x16/0x179 [<ffffffffa0033025>] mpls_init+0x25/0x1000 [mpls_router] [<ffffffff81000471>] do_one_initcall+0x8e/0x13f [<ffffffff81119961>] do_init_module+0x5a/0x1e5 [<ffffffff810bd070>] load_module+0x13bd/0x17d6 ... The problem is that lwtunnel_build_state is called with rtnl lock held preventing mpls_init from registering. Given the potential references held by the time lwtunnel_build_state it can not drop the rtnl lock to the load module. So, extract the module loading code from lwtunnel_build_state into a new function to validate the encap type. The new function is called while converting the user request into a fib_config which is well before any table, device or fib entries are examined. Fixes: 745041e2aaf1 ("lwtunnel: autoload of lwt modules") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18net: dsa: use cpu_switch instead of ds[0]Vivien Didelot
Now that the DSA Ethernet switches are true Linux devices, the CPU switch is not necessarily the first one. If its address is higher than the second switch on the same MDIO bus, its index will be 1, not 0. Avoid any confusion by using dst->cpu_switch instead of dst->ds[0]. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18net: dsa: store CPU switch structure in the treeVivien Didelot
Store a dsa_switch pointer to the CPU switch in the tree instead of only its index. This avoids the need to initialize it to -1. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18net: ipv6: remove prefix arg to rt6_fill_nodeDavid Ahern
The prefix arg to rt6_fill_node is non-0 in only 1 path - rt6_dump_route where a user is requesting a prefix only dump. Simplify rt6_fill_node by removing the prefix arg and moving the prefix check to rt6_dump_route. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18net: ipv6: remove nowait arg to rt6_fill_nodeDavid Ahern
All callers of rt6_fill_node pass 0 for nowait arg. Remove the arg and simplify rt6_fill_node accordingly. rt6_fill_node passes the nowait of 0 to ip6mr_get_route. Remove the nowait arg from it as well. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18net: fix harmonize_features() vs NETIF_F_HIGHDMAEric Dumazet
Ashizuka reported a highmem oddity and sent a patch for freescale fec driver. But the problem root cause is that core networking stack must ensure no skb with highmem fragment is ever sent through a device that does not assert NETIF_F_HIGHDMA in its features. We need to call illegal_highdma() from harmonize_features() regardless of CSUM checks. Fixes: ec5f06156423 ("net: Kill link between CSUM and SG features.") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Pravin Shelar <pshelar@ovn.org> Reported-by: "Ashizuka, Yuusuke" <ashiduka@jp.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18netfilter: nf_tables: eliminate useless condition checksGao Feng
The return value of nf_tables_table_lookup() is valid pointer or one pointer error. There are two cases: 1) IS_ERR(table) is true, it would return the error or reset the table as NULL, it is unnecessary to perform the latter check "table != NULL". 2) IS_ERR(obj) is false, the table is one valid pointer. It is also unnecessary to perform that check. The nf_tables_newset() and nf_tables_newobj() have same logic codes. In summary, we could move the block of condition check "table != NULL" in the else block to eliminate the original condition checks. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-18audit: log 32-bit socketcallsRichard Guy Briggs
32-bit socketcalls were not being logged by audit on x86_64 systems. Log them. This is basically a duplicate of the call from net/socket.c:sys_socketcall(), but it addresses the impedance mismatch between 32-bit userspace process and 64-bit kernel audit. See: https://github.com/linux-audit/audit-kernel/issues/14 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-01-18netfilter: ipt_CLUSTERIP: fix build error without procfsArnd Bergmann
We can't access c->pde if CONFIG_PROC_FS is disabled: net/ipv4/netfilter/ipt_CLUSTERIP.c: In function 'clusterip_config_find_get': net/ipv4/netfilter/ipt_CLUSTERIP.c:147:9: error: 'struct clusterip_config' has no member named 'pde' This moves the check inside of another #ifdef. Fixes: 6c5d5cfbe3c5 ("netfilter: ipt_CLUSTERIP: check duplicate config when initializing") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-18net: ethtool: Initialize buffer when querying device channel settingsEran Ben Elisha
Ethtool channels respond struct was uninitialized when querying device channel boundaries settings. As a result, unreported fields by the driver hold garbage. This may cause sending unsupported params to driver. Fixes: 8bf368620486 ('ethtool: ensure channel counts are within bounds ...') Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> CC: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18sctp: implement sender-side procedures for SSN Reset Request ParameterXin Long
This patch is to implement sender-side procedures for the Outgoing and Incoming SSN Reset Request Parameter described in rfc6525 section 5.1.2 and 5.1.3. It is also add sockopt SCTP_RESET_STREAMS in rfc6525 section 6.3.2 for users. Note that the new asoc member strreset_outstanding is to make sure only one reconf request chunk on the fly as rfc6525 section 5.1.1 demands. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18sctp: add sockopt SCTP_ENABLE_STREAM_RESETXin Long
This patch is to add sockopt SCTP_ENABLE_STREAM_RESET to get/set strreset_enable to indicate which reconf request type it supports, which is described in rfc6525 section 6.3.1. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18sctp: add reconf_enable in asoc ep and netnsXin Long
This patch is to add reconf_enable field in all of asoc ep and netns to indicate if they support stream reset. When initializing, asoc reconf_enable get the default value from ep reconf_enable which is from netns netns reconf_enable by default. It is also to add reconf_capable in asoc peer part to know if peer supports reconf_enable, the value is set if ext params have reconf chunk support when processing init chunk, just as rfc6525 section 5.1.1 demands. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18sctp: add stream reconf primitiveXin Long
This patch is to add a primitive based on sctp primitive frame for sending stream reconf request. It works as the other primitives, and create a SCTP_CMD_REPLY command to send the request chunk out. sctp_primitive_RECONF would be the api to send a reconf request chunk. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18sctp: add stream reconf timerXin Long
This patch is to add a per transport timer based on sctp timer frame for stream reconf chunk retransmission. It would start after sending a reconf request chunk, and stop after receiving the response chunk. If the timer expires, besides retransmitting the reconf request chunk, it would also do the same thing with data RTO timer. like to increase the appropriate error counts, and perform threshold management, possibly destroying the asoc if sctp retransmission thresholds are exceeded, just as section 5.1.1 describes. This patch is also to add asoc strreset_chunk, it is used to save the reconf request chunk, so that it can be retransmitted, and to check if the response is really for this request by comparing the information inside with the response chunk as well. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18sctp: add support for generating stream reconf ssn reset request chunkXin Long
This patch is to add asoc strreset_outseq and strreset_inseq for saving the reconf request sequence, initialize them when create assoc and process init, and also to define Incoming and Outgoing SSN Reset Request Parameter described in rfc6525 section 4.1 and 4.2, As they can be in one same chunk as section rfc6525 3.1-3 describes, it makes them in one function. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18netfilter: nft_meta: deal with PACKET_LOOPBACK in netdev familyLiping Zhang
After adding the following nft rule, then ping 224.0.0.1: # nft add rule netdev t c pkttype host counter The warning complain message will be printed out again and again: WARNING: CPU: 0 PID: 10182 at net/netfilter/nft_meta.c:163 \ nft_meta_get_eval+0x3fe/0x460 [nft_meta] [...] Call Trace: <IRQ> dump_stack+0x85/0xc2 __warn+0xcb/0xf0 warn_slowpath_null+0x1d/0x20 nft_meta_get_eval+0x3fe/0x460 [nft_meta] nft_do_chain+0xff/0x5e0 [nf_tables] So we should deal with PACKET_LOOPBACK in netdev family too. For ipv4, convert it to PACKET_BROADCAST/MULTICAST according to the destination address's type; For ipv6, convert it to PACKET_MULTICAST directly. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-18netfilter: pkttype: unnecessary to check ipv6 multicast addressLiping Zhang
Since there's no broadcast address in IPV6, so in ipv6 family, the PACKET_LOOPBACK must be multicast packets, there's no need to check it again. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-18inet: reset tb->fastreuseport when adding a reuseport skJosef Bacik
If we have non reuseport sockets on a tb we will set tb->fastreuseport to 0 and never set it again. Which means that in the future if we end up adding a bunch of reuseport sk's to that tb we'll have to do the expensive scan every time. Instead add the ipv4/ipv6 saddr fields to the bind bucket, as well as the family so we know what comparison to make, and the ipv6 only setting so we can make sure to compare with new sockets appropriately. Once one sk has made it onto the list we know that there are no potential bind conflicts on the owners list that match that sk's rcv_addr. So copy the sk's information into our bind bucket and set tb->fastruseport to FASTREUSESOCK_STRICT so we know we have to do an extra check for subsequent reuseport sockets and skip the expensive bind conflict check. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18inet: split inet_csk_get_port into two functionsJosef Bacik
inet_csk_get_port does two different things, it either scans for an open port, or it tries to see if the specified port is available for use. Since these two operations have different rules and are basically independent lets split them into two different functions to make them both more readable. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18inet: don't check for bind conflicts twice when searching for a portJosef Bacik
This is just wasted time, we've already found a tb that doesn't have a bind conflict, and we don't drop the head lock so scanning again isn't going to give us a different answer. Instead move the tb->reuse setting logic outside of the found_tb path and put it in the success: path. Then make it so that we don't goto again if we find a bind conflict in the found_tb path as we won't reach this anymore when we are scanning for an ephemeral port. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18inet: kill smallest_size and smallest_portJosef Bacik
In inet_csk_get_port we seem to be using smallest_port to figure out where the best place to look for a SO_REUSEPORT sk that matches with an existing set of SO_REUSEPORT's. However if we get to the logic if (smallest_size != -1) { port = smallest_port; goto have_port; } we will do a useless search, because we would have already done the inet_csk_bind_conflict for that port and it would have returned 1, otherwise we would have gone to found_tb and succeeded. Since this logic makes us do yet another trip through inet_csk_bind_conflict for a port we know won't work just delete this code and save us the time. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18inet: drop ->bind_conflictJosef Bacik
The only difference between inet6_csk_bind_conflict and inet_csk_bind_conflict is how they check the rcv_saddr, so delete this call back and simply change inet_csk_bind_conflict to call inet_rcv_saddr_equal. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18inet: collapse ipv4/v6 rcv_saddr_equal functions into oneJosef Bacik
We pass these per-protocol equal functions around in various places, but we can just have one function that checks the sk->sk_family and then do the right comparison function. I've also changed the ipv4 version to not cast to inet_sock since it is unneeded. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18libceph: make sure ceph_aes_crypt() IV is alignedIlya Dryomov
... otherwise the crypto stack will align it for us with a GFP_ATOMIC allocation and a memcpy() -- see skcipher_walk_first(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-01-18locking/atomic, kref: Implement kref_put_lock()Peter Zijlstra
Because home-rolling your own is _awesome_, stop doing it. Provide kref_put_lock(), just like kref_put_mutex() but for a spinlock. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-17tcp: accept RST for rcv_nxt - 1 after receiving a FINJason Baron
Using a Mac OSX box as a client connecting to a Linux server, we have found that when certain applications (such as 'ab'), are abruptly terminated (via ^C), a FIN is sent followed by a RST packet on tcp connections. The FIN is accepted by the Linux stack but the RST is sent with the same sequence number as the FIN, and Linux responds with a challenge ACK per RFC 5961. The OSX client then sometimes (they are rate-limited) does not reply with any RST as would be expected on a closed socket. This results in sockets accumulating on the Linux server left mostly in the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible. This sequence of events can tie up a lot of resources on the Linux server since there may be a lot of data in write buffers at the time of the RST. Accepting a RST equal to rcv_nxt - 1, after we have already successfully processed a FIN, has made a significant difference for us in practice, by freeing up unneeded resources in a more expedient fashion. A packetdrill test demonstrating the behavior: // testing mac osx rst behavior // Establish a connection 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 0.000 bind(3, ..., ...) = 0 0.000 listen(3, 1) = 0 0.100 < S 0:0(0) win 32768 <mss 1460,nop,wscale 10> 0.100 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 5> 0.200 < . 1:1(0) ack 1 win 32768 0.200 accept(3, ..., ...) = 4 // Client closes the connection 0.300 < F. 1:1(0) ack 1 win 32768 // now send rst with same sequence 0.300 < R. 1:1(0) ack 1 win 32768 // make sure we are in TCP_CLOSE 0.400 %{ assert tcpi_state == 7 }% Signed-off-by: Jason Baron <jbaron@akamai.com> Cc: Eric Dumazet <edumazet@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-17net: ping: Use right format specifier to avoid type castingGao Feng
The inet_num is u16, so use %hu instead of casting it to int. And the sk_bound_dev_if is int actually, so it needn't cast to int. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-17bridge: sparse fixes in br_ip6_multicast_alloc_query()Lance Richardson
Changed type of csum field in struct igmpv3_query from __be16 to __sum16 to eliminate type warning, made same change in struct igmpv3_report for consistency. Fixed up an ntohs() where htons() should have been used instead. Signed-off-by: Lance Richardson <lrichard@redhat.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2017-01-17mpls: Packet statsRobert Shearman
Having MPLS packet stats is useful for observing network operation and for diagnosing network problems. In the absence of anything better, RFC2863 and RFC3813 are used for guidance for which stats to expose and the semantics of them. In particular rx_noroutes maps to in unknown protos in RFC2863. The stats are exposed to userspace via AF_MPLS attributes embedded in the IFLA_STATS_AF_SPEC attribute of RTM_GETSTATS messages. All the introduced fields are 64-bit, even error ones, to ensure no overflow with long uptimes. Per-CPU counters are used to avoid cache-line contention on the commonly used fields. The other fields have also been made per-CPU for code to avoid performance problems in error conditions on the assumption that on some platforms the cost of atomic operations could be more expensive than sending the packet (which is what would be done in the success case). If that's not the case, we could instead not use per-CPU counters for these fields. Only unicast and non-fragment are exposed at the moment, but other counters can be exposed in the future either by adding to the end of struct mpls_link_stats or by additional netlink attributes in the AF_MPLS IFLA_STATS_AF_SPEC nested attribute. Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-17net: AF-specific RTM_GETSTATS attributesRobert Shearman
Add the functionality for including address-family-specific per-link stats in RTM_GETSTATS messages. This is done through adding a new IFLA_STATS_AF_SPEC attribute under which address family attributes are nested and then the AF-specific attributes can be further nested. This follows the model of IFLA_AF_SPEC on RTM_*LINK messages and it has the advantage of presenting an easily extended hierarchy. The rtnl_af_ops structure is extended to provide AFs with the opportunity to fill and provide the size of their stats attributes. One alternative would have been to provide AFs with the ability to add attributes directly into the RTM_GETSTATS message without a nested hierarchy. I discounted this approach as it increases the rate at which the 32 attribute number space is used up and it makes implementation a little more tricky for stats dump resuming (at the moment the order in which attributes are added to the message has to match the numeric order of the attributes). Another alternative would have been to register per-AF RTM_GETSTATS handlers. I discounted this approach as I perceived a common use-case to be getting all the stats for an interface and this approach would necessitate multiple requests/dumps to retrieve them all. Signed-off-by: Robert Shearman <rshearma@brocade.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Handle multicast packets properly in fast-RX path of mac80211, from Johannes Berg. 2) Because of a logic bug, the user can't actually force SW checksumming on r8152 devices. This makes diagnosis of hw checksumming bugs really annoying. Fix from Hayes Wang. 3) VXLAN route lookup does not take the source and destination ports into account, which means IPSEC policies cannot be matched properly. Fix from Martynas Pumputis. 4) Do proper RCU locking in netvsc callbacks, from Stephen Hemminger. 5) Fix SKB leaks in mlxsw driver, from Arkadi Sharshevsky. 6) If lwtunnel_fill_encap() fails, we do not abort the netlink message construction properly in fib_dump_info(), from David Ahern. 7) Do not use kernel stack for DMA buffers in atusb driver, from Stefan Schmidt. 8) Openvswitch conntack actions need to maintain a correct checksum, fix from Lance Richardson. 9) ax25_disconnect() is missing a check for ax25->sk being NULL, in fact it already checks this, but not in all of the necessary spots. Fix from Basil Gunn. 10) Action GET operations in the packet scheduler can erroneously bump the reference count of the entry, making it unreleasable. Fix from Jamal Hadi Salim. Jamal gives a great set of example command lines that trigger this in the commit message. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits) net sched actions: fix refcnt when GETing of action after bind net/mlx4_core: Eliminate warning messages for SRQ_LIMIT under SRIOV net/mlx4_core: Fix when to save some qp context flags for dynamic VST to VGT transitions net/mlx4_core: Fix racy CQ (Completion Queue) free net: stmmac: don't use netdev_[dbg, info, ..] before net_device is registered net/mlx5e: Fix a -Wmaybe-uninitialized warning ax25: Fix segfault after sock connection timeout bpf: rework prog_digest into prog_tag tipc: allocate user memory with GFP_KERNEL flag net: phy: dp83867: allow RGMII_TXID/RGMII_RXID interface types ip6_tunnel: Account for tunnel header in tunnel MTU mld: do not remove mld souce list info when set link down be2net: fix MAC addr setting on privileged BE3 VFs be2net: don't delete MAC on close on unprivileged BE3 VFs be2net: fix status check in be_cmd_pmac_add() cpmac: remove hopeless #warning ravb: do not use zero-length alignment DMA descriptor mlx4: do not call napi_schedule() without care openvswitch: maintain correct checksum state in conntrack actions tcp: fix tcp_fastopen unaligned access complaints on sparc ...
2017-01-17esp: Introduce a helper to setup the trailerSteffen Klassert
We need to setup the trailer in two different cases, so add a helper to avoid code duplication. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>