summaryrefslogtreecommitdiff
path: root/include/linux/netdevice.h
AgeCommit message (Collapse)Author
2016-09-25Merge branch 'master' of ↵Pablo Neira Ayuso
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Conflicts: net/netfilter/core.c net/netfilter/nf_tables_netdev.c Resolve two conflicts before pull request for David's net-next tree: 1) Between c73c24849011 ("netfilter: nf_tables_netdev: remove redundant ip_hdr assignment") from the net tree and commit ddc8b6027ad0 ("netfilter: introduce nft_set_pktinfo_{ipv4, ipv6}_validate()"). 2) Between e8bffe0cf964 ("net: Add _nf_(un)register_hooks symbols") and Aaron Conole's patches to replace list_head with single linked list. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25netfilter: replace list_head with single linked listAaron Conole
The netfilter hook list never uses the prev pointer, and so can be trimmed to be a simple singly-linked list. In addition to having a more light weight structure for hook traversal, struct net becomes 5568 bytes (down from 6400) and struct net_device becomes 2176 bytes (down from 2240). Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-24net: Update API for VF vlan protocol 802.1ad supportMoshe Shemesh
Introduce new rtnl UAPI that exposes a list of vlans per VF, giving the ability for user-space application to specify it for the VF, as an option to support 802.1ad. We adjusted IP Link tool to support this option. For future use cases, the new UAPI supports multiple vlans. For now we limit the list size to a single vlan in kernel. Add IFLA_VF_VLAN_LIST in addition to IFLA_VF_VLAN to keep backward compatibility with older versions of IP Link tool. Add a vlan protocol parameter to the ndo_set_vf_vlan callback. We kept 802.1Q as the drivers' default vlan protocol. Suitable ip link tool command examples: Set vf vlan protocol 802.1ad: ip link set eth0 vf 1 vlan 100 proto 802.1ad Set vf to VST (802.1Q) mode: ip link set eth0 vf 1 vlan 100 proto 802.1Q Or by omitting the new parameter ip link set eth0 vf 1 vlan 100 Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21net: cls_bpf: add hardware offloadJakub Kicinski
This patch adds hardware offload capability to cls_bpf classifier, similar to what have been done with U32 and flower. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18netdevice: Add offload statistics ndoNogah Frankel
Add a new ndo to return statistics for offloaded operation. Since there can be many different offloaded operation with many stats types, the ndo gets an attribute id by which it knows which stats are wanted. The ndo also gets a void pointer to be cast according to the attribute id. Signed-off-by: Nogah Frankel <nogahf@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/mediatek/mtk_eth_soc.c drivers/net/ethernet/qlogic/qed/qed_dcbx.c drivers/net/phy/Kconfig All conflicts were cases of overlapping commits. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-04bonding: Fix bonding crashMahesh Bandewar
Following few steps will crash kernel - (a) Create bonding master > modprobe bonding miimon=50 (b) Create macvlan bridge on eth2 > ip link add link eth2 dev mvl0 address aa:0:0:0:0:01 \ type macvlan (c) Now try adding eth2 into the bond > echo +eth2 > /sys/class/net/bond0/bonding/slaves <crash> Bonding does lots of things before checking if the device enslaved is busy or not. In this case when the notifier call-chain sends notifications, the bond_netdev_event() assumes that the rx_handler /rx_handler_data is registered while the bond_enslave() hasn't progressed far enough to register rx_handler for the new slave. This patch adds a rx_handler check that can be performed right at the beginning of the enslave code to avoid getting into this situation. Signed-off-by: Mahesh Bandewar <maheshb@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01rtnetlink: fdb dump: optimize by saving last interface markersRoopa Prabhu
fdb dumps spanning multiple skb's currently restart from the first interface again for every skb. This results in unnecessary iterations on the already visited interfaces and their fdb entries. In large scale setups, we have seen this to slow down fdb dumps considerably. On a system with 30k macs we see fdb dumps spanning across more than 300 skbs. To fix the problem, this patch replaces the existing single fdb marker with three markers: netdev hash entries, netdevs and fdb index to continue where we left off instead of restarting from the first netdev. This is consistent with link dumps. In the process of fixing the performance issue, this patch also re-implements fix done by commit 472681d57a5d ("net: ndo_fdb_dump should report -EMSGSIZE to rtnl_fdb_dump") (with an internal fix from Wilson Kok) in the following ways: - change ndo_fdb_dump handlers to return error code instead of the last fdb index - use cb->args strictly for dump frag markers and not error codes. This is consistent with other dump functions. Below results were taken on a system with 1000 netdevs and 35085 fdb entries: before patch: $time bridge fdb show | wc -l 15065 real 1m11.791s user 0m0.070s sys 1m8.395s (existing code does not return all macs) after patch: $time bridge fdb show | wc -l 35085 real 0m2.017s user 0m0.113s sys 0m1.942s Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-26bridge: switchdev: Add forward mark support for stacked devicesIdo Schimmel
switchdev_port_fwd_mark_set() is used to set the 'offload_fwd_mark' of port netdevs so that packets being flooded by the device won't be flooded twice. It works by assigning a unique identifier (the ifindex of the first bridge port) to bridge ports sharing the same parent ID. This prevents packets from being flooded twice by the same switch, but will flood packets through bridge ports belonging to a different switch. This method is problematic when stacked devices are taken into account, such as VLANs. In such cases, a physical port netdev can have upper devices being members in two different bridges, thus requiring two different 'offload_fwd_mark's to be configured on the port netdev, which is impossible. The main problem is that packet and netdev marking is performed at the physical netdev level, whereas flooding occurs between bridge ports, which are not necessarily port netdevs. Instead, packet and netdev marking should really be done in the bridge driver with the switch driver only telling it which packets it already forwarded. The bridge driver will mark such packets using the mark assigned to the ingress bridge port and will prevent the packet from being forwarded through any bridge port sharing the same mark (i.e. having the same parent ID). Remove the current switchdev 'offload_fwd_mark' implementation and instead implement the proposed method. In addition, make rocker - the sole user of the mark - use the proposed method. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Minor overlapping changes for both merge conflicts. Resolution work done by Stephen Rothwell was used as a reference. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-13net: remove type_check from dev_get_nest_level()Sabrina Dubroca
The idea for type_check in dev_get_nest_level() was to count the number of nested devices of the same type (currently, only macvlan or vlan devices). This prevented the false positive lockdep warning on configurations such as: eth0 <--- macvlan0 <--- vlan0 <--- macvlan1 However, this doesn't prevent a warning on a configuration such as: eth0 <--- macvlan0 <--- vlan0 eth1 <--- vlan1 <--- macvlan1 In this case, all the locks end up with a nesting subclass of 1, so lockdep thinks that there is still a deadlock: - in the first case we have (macvlan_netdev_addr_lock_key, 1) and then take (vlan_netdev_xmit_lock_key, 1) - in the second case, we have (vlan_netdev_xmit_lock_key, 1) and then take (macvlan_netdev_addr_lock_key, 1) By removing the linktype check in dev_get_nest_level() and always incrementing the nesting depth, lockdep considers this configuration valid. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-10net: sched: convert qdisc linked list to hashtableJiri Kosina
Convert the per-device linked list into a hashtable. The primary motivation for this change is that currently, we're not tracking all the qdiscs in hierarchy (e.g. excluding default qdiscs), as the lookup performed over the linked list by qdisc_match_from_root() is rather expensive. The ultimate goal is to get rid of hidden qdiscs completely, which will bring much more determinism in user experience. Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-24net/sched: Add match-all classifier hw offloading.Yotam Gigi
Following the work that have been done on offloading classifiers like u32 and flower, now the match-all classifier hw offloading is possible. if the interface supports tc offloading. To control the offloading, two tc flags have been introduced: skip_sw and skip_hw. Typical usage: tc filter add dev eth25 parent ffff: \ matchall skip_sw \ action mirred egress mirror \ dev eth27 Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Just several instances of overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-19net: add ndo to setup/query xdp prog in adapter rxBrenden Blanco
Add one new netdev op for drivers implementing the BPF_PROG_TYPE_XDP filter. The single op is used for both setup/query of the xdp program, modelled after ndo_setup_tc. Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-16vlan: use a valid default mtu value for vlan over macsecPaolo Abeni
macsec can't cope with mtu frames which need vlan tag insertion, and vlan device set the default mtu equal to the underlying dev's one. By default vlan over macsec devices use invalid mtu, dropping all the large packets. This patch adds a netif helper to check if an upper vlan device needs mtu reduction. The helper is used during vlan devices initialization to set a valid default and during mtu updating to forbid invalid, too bit, mtu values. The helper currently only check if the lower dev is a macsec device, if we get more users, we need to update only the helper (possibly reserving an additional IFF bit). Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-05net: introduce default neigh_construct/destroy ndo calls for L2 upper devicesJiri Pirko
L2 upper device needs to propagate neigh_construct/destroy calls down to lower devices. Do this by defining default ndo functions and use them in team, bond, bridge and vlan. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-05net: add dev arg to ndo_neigh_construct/destroyJiri Pirko
As the following patch will allow upper devices to follow the call down lower devices, we need to add dev here and not rely on n->dev. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-04mlxsw: spectrum: Add couple of lower device helper functionsJiri Pirko
Add functions that iterate over lower devices and find port device. As a dependency add netdev_for_each_all_lower_dev and netdev_for_each_all_lower_dev_rcu macro with netdev_all_lower_get_next and netdev_all_lower_get_next_rcu shelpers. Also, add functions to return mlxsw struct according to lower device found and mlxsw_port struct with a reference to lower device. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01net: introduce NETDEV_CHANGE_TX_QUEUE_LENJason Wang
This patch introduces a new event - NETDEV_CHANGE_TX_QUEUE_LEN, this will be triggered when tx_queue_len. It could be used by net device who want to do some processing at that time. An example is tun who may want to resize tx array when tx_queue_len is changed. Cc: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-17net: Remove deprecated tunnel specific UDP offload functionsAlexander Duyck
Now that we have all the drivers using udp_tunnel_get_rx_ports, ndo_add_udp_enc_rx_port, and ndo_del_udp_enc_rx_port we can drop the function calls that were specific to VXLAN and GENEVE. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-17net: Merge VXLAN and GENEVE push notifiers into a single notifierAlexander Duyck
This patch merges the notifiers for VXLAN and GENEVE into a single UDP tunnel notifier. The idea is that we will want to only have to make one notifier call to receive the list of ports for VXLAN and GENEVE tunnels that need to be offloaded. In addition we add a new set of ndo functions named ndo_udp_tunnel_add and ndo_udp_tunnel_del that are meant to allow us to track the tunnel meta-data such as port and address family as tunnels are added and removed. The tunnel meta-data is now transported in a structure named udp_tunnel_info which for now carries the type, address family, and port number. In the future this could be updated so that we can include a tuple of values including things such as the destination IP address and other fields. I also ended up going with a naming scheme that consisted of using the prefix udp_tunnel on function names. I applied this to the notifier and ndo ops as well so that it hopefully points to the fact that these are primarily used in the udp_tunnel functions. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15ipv6: introduce neighbour discovery opsAlexander Aring
This patch introduces neighbour discovery ops callback structure. The idea is to separate the handling for 6LoWPAN into the 6lowpan module. These callback offers 6lowpan different handling, such as 802.15.4 short address handling or RFC6775 (Neighbor Discovery Optimization for IPv6 over 6LoWPANs). Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-156lowpan: add private neighbour dataAlexander Aring
This patch will introduce a 6lowpan neighbour private data. Like the interface private data we handle private data for generic 6lowpan and for link-layer specific 6lowpan. The current first use case if to save the short address for a 802.15.4 6lowpan neighbour. Cc: David S. Miller <davem@davemloft.net> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-12sched: remove NET_XMIT_POLICEDFlorian Westphal
sch_atm returns this when TC_ACT_SHOT classification occurs. But all other schedulers that use tc_classify (htb, hfsc, drr, fq_codel ...) return NET_XMIT_SUCCESS | __BYPASS in this case so just do that in atm. BATMAN uses it as an intermediate return value to signal forwarding vs. buffering, but it did not return POLICED to callers outside of BATMAN. Reviewed-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10bpf: enforce recursion limit on redirectsDaniel Borkmann
Respect the stack's xmit_recursion limit for calls into dev_queue_xmit(). Currently, they are not handeled by the limiter when attached to clsact's egress parent, for example, and a buggy program redirecting it to the same device again could run into stack overflow eventually. It would be good if we could notify an admin to give him a chance to react. We reuse xmit_recursion instead of having one private to eBPF, so that the stack's current recursion depth will be taken into account as well. Follow-up to commit 3896d655f4d4 ("bpf: introduce bpf_clone_redirect() helper") and 27b29f63058d ("bpf: add bpf_redirect() helper"). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-09net: add netdev_lockdep_set_classes() helperEric Dumazet
It is time to add netdev_lockdep_set_classes() helper so that lockdep annotations per device type are easier to manage. This removes a lot of copies and missing annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08net: sched: fix missing doc annotationsEric Dumazet
"make htmldocs" complains otherwise: .//net/core/gen_stats.c:168: warning: No description found for parameter 'running' .//include/linux/netdevice.h:1867: warning: No description found for parameter 'qdisc_running_key' Fixes: f9eb8aea2a1e ("net_sched: transform qdisc running bit into a seqcount") Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07net_sched: transform qdisc running bit into a seqcountEric Dumazet
Instead of using a single bit (__QDISC___STATE_RUNNING) in sch->__state, use a seqcount. This adds lockdep support, but more importantly it will allow us to sample qdisc/class statistics without having to grab qdisc root lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-03sctp: Add GSO supportMarcelo Ricardo Leitner
SCTP has this pecualiarity that its packets cannot be just segmented to (P)MTU. Its chunks must be contained in IP segments, padding respected. So we can't just generate a big skb, set gso_size to the fragmentation point and deliver it to IP layer. This patch takes a different approach. SCTP will now build a skb as it would be if it was received using GRO. That is, there will be a cover skb with protocol headers and children ones containing the actual segments, already segmented to a way that respects SCTP RFCs. With that, we can tell skb_segment() to just split based on frag_list, trusting its sizes are already in accordance. This way SCTP can benefit from GSO and instead of passing several packets through the stack, it can pass a single large packet. v2: - Added support for receiving GSO frames, as requested by Dave Miller. - Clear skb->cb if packet is GSO (otherwise it's not used by SCTP) - Added heuristics similar to what we have in TCP for not generating single GSO packets that fills cwnd. v3: - consider sctphdr size in skb_gso_transport_seglen() - rebased due to 5c7cdf339af5 ("gso: Remove arbitrary checks for unsupported GSO") Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Tested-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20net: define gso types for IPx over IPv4 and IPv6Tom Herbert
This patch defines two new GSO definitions SKB_GSO_IPXIP4 and SKB_GSO_IPXIP6 along with corresponding NETIF_F_GSO_IPXIP4 and NETIF_F_GSO_IPXIP6. These are used to described IP in IP tunnel and what the outer protocol is. The inner protocol can be deduced from other GSO types (e.g. SKB_GSO_TCPV4 and SKB_GSO_TCPV6). The GSO types of SKB_GSO_IPIP and SKB_GSO_SIT are removed (these are both instances of SKB_GSO_IPXIP4). SKB_GSO_IPXIP6 will be used when support for GSO with IP encapsulation over IPv6 is added. Signed-off-by: Tom Herbert <tom@herbertland.com> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-16bpf: move bpf_jit_enable declarationDaniel Borkmann
Move the bpf_jit_enable declaration to the filter.h file where most other core code is declared, also since we're going to add a second knob there. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-11net: l3mdev: Add hook in ip and ipv6David Ahern
Currently the VRF driver uses the rx_handler to switch the skb device to the VRF device. Switching the dev prior to the ip / ipv6 layer means the VRF driver has to duplicate IP/IPv6 processing which adds overhead and makes features such as retaining the ingress device index more complicated than necessary. This patch moves the hook to the L3 layer just after the first NF_HOOK for PRE_ROUTING. This location makes exposing the original ingress device trivial (next patch) and allows adding other NF_HOOKs to the VRF driver in the future. dev_queue_xmit_nit is exported so that the VRF driver can cycle the skb with the switched device through the packet taps to maintain current behavior (tcpdump can be used on either the vrf device or the enslaved devices). Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04net: remove dev->trans_startFlorian Westphal
previous patches removed all direct accesses to dev->trans_start, so change the netif_trans_update helper to update trans_start of netdev queue 0 instead and then remove trans_start from struct net_device. AFAICS a lot of the netif_trans_update() invocations are now useless because they occur in ndo_start_xmit and driver doesn't set LLTX (i.e. stack already took care of the update). As I can't test any of them it seems better to just leave them alone. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04netdevice: add helper to update trans_startFlorian Westphal
trans_start exists twice: - as member of net_device (legacy) - as member of netdev_queue In order to get rid of the legacy case, add a helper for the dev->trans_update (this patch), then convert spots that do dev->trans_start = jiffies to use this helper (next patch). This would then allow us to change the helper so that it updates the trans_stamp of netdev queue 0 instead. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: net/ipv4/ip_gre.c Minor conflicts between tunnel bug fixes in net and ipv6 tunnel cleanups in net-next. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02netdevice: shrink size of struct netdev_queueFlorian Westphal
- trans_timeout is incremented when tx queue timed out (tx watchdog). - tx_maxrate is set via sysfs Moving tx_maxrate to read-mostly part shrinks the struct by 64 bytes. While at it, also move trans_timeout (it is out-of-place in the 'write-mostly' part). Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-29net: constify is_skb_forwardable's argumentsNikolay Aleksandrov
is_skb_forwardable is not supposed to change anything so constify its arguments Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28net: fix net_gso_ok for new GSO types.Marcelo Ricardo Leitner
Fix casting in net_gso_ok. Otherwise the shift on gso_type << NETIF_F_GSO_SHIFT may hit the 32th bit and make it look like a INT_MIN, which is then promoted from signed to uint64 which is 0xffffffff80000000, resulting in wrong behavior when it is and'ed with the feature itself, as in: This test app: #include <stdio.h> #include <stdint.h> int main(int argc, char **argv) { uint64_t feature1; uint64_t feature2; int gso_type = 1 << 15; feature1 = gso_type << 16; feature2 = (uint64_t)gso_type << 16; printf("%lx %lx\n", feature1, feature2); return 0; } Gives: ffffffff80000000 80000000 So that this: return (features & feature) == feature; Actually works on more bits than expected and invalid ones. Fix is to promote it earlier. Issue noted while rebasing SCTP GSO patch but posting separetely as someone else may experience this meanwhile. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net-rfs: fix false sharing accessing sd->input_queue_headEric Dumazet
sd->input_queue_head is incremented for each processed packet in process_backlog(), and read from other cpus performing Out Of Order avoidance in get_rps_cpu() Moving this field in a separate cache line keeps it mostly hot for the cpu in process_backlog(), as other cpus will only read it. In a stress test, process_backlog() was consuming 6.80 % of cpu cycles, and the patch reduced the cost to 0.65 % Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26net: remove NETDEV_TX_LOCKED supportFlorian Westphal
No more users in the tree, remove NETDEV_TX_LOCKED support. Adds another hole in softnet_stats struct, but better than keeping the unused collision counter around. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21geneve: break dependency with netdev driversHannes Frederic Sowa
Equivalent to "vxlan: break dependency with netdev drivers", don't autoload geneve module in case the driver is loaded. Instead make the coupling weaker by using netdevice notifiers as proxy. Cc: Jesse Gross <jesse@kernel.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21vxlan: break dependency with netdev driversHannes Frederic Sowa
Currently all drivers depend and autoload the vxlan module because how vxlan_get_rx_port is linked into them. Remove this dependency: By using a new event type in the netdevice notifier call chain we proxy the request from the drivers to flush and resetup the vxlan ports not directly via function call but by the already existing netdevice notifier call chain. I added a separate new event type, NETDEV_OFFLOAD_PUSH_VXLAN, to do so. We don't need to save those ids, as the event type field is an unsigned long and using specialized event types for this purpose seemed to be a more elegant way. This also comes in beneficial if in future we want to add offloading knobs for vxlan. Cc: Jesse Gross <jesse@kernel.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14GSO: Support partial segmentation offloadAlexander Duyck
This patch adds support for something I am referring to as GSO partial. The basic idea is that we can support a broader range of devices for segmentation if we use fixed outer headers and have the hardware only really deal with segmenting the inner header. The idea behind the naming is due to the fact that everything before csum_start will be fixed headers, and everything after will be the region that is handled by hardware. With the current implementation it allows us to add support for the following GSO types with an inner TSO_MANGLEID or TSO6 offload: NETIF_F_GSO_GRE NETIF_F_GSO_GRE_CSUM NETIF_F_GSO_IPIP NETIF_F_GSO_SIT NETIF_F_UDP_TUNNEL NETIF_F_UDP_TUNNEL_CSUM In the case of hardware that already supports tunneling we may be able to extend this further to support TSO_TCPV4 without TSO_MANGLEID if the hardware can support updating inner IPv4 headers. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14GRO: Add support for TCP with fixed IPv4 ID field, limit tunnel IP ID valuesAlexander Duyck
This patch does two things. First it allows TCP to aggregate TCP frames with a fixed IPv4 ID field. As a result we should now be able to aggregate flows that were converted from IPv6 to IPv4. In addition this allows us more flexibility for future implementations of segmentation as we may be able to use a fixed IP ID when segmenting the flow. The second thing this does is that it places limitations on the outer IPv4 ID header in the case of tunneled frames. Specifically it forces the IP ID to be incrementing by 1 unless the DF bit is set in the outer IPv4 header. This way we can avoid creating overlapping series of IP IDs that could possibly be fragmented if the frame goes through GRO and is then resegmented via GSO. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14GSO: Add GSO type for fixed IPv4 IDAlexander Duyck
This patch adds support for TSO using IPv4 headers with a fixed IP ID field. This is meant to allow us to do a lossless GRO in the case of TCP flows that use a fixed IP ID such as those that convert IPv6 header to IPv4 headers. In addition I am adding a feature that for now I am referring to TSO with IP ID mangling. Basically when this flag is enabled the device has the option to either output the flow with incrementing IP IDs or with a fixed IP ID regardless of what the original IP ID ordering was. This is useful in cases where the DF bit is set and we do not care if the original IP ID value is maintained. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14net: remove netdevice gso_min_segsEric Dumazet
After introduction of ndo_features_check(), we believe that very specific checks for rare features should not be done in core networking stack. No driver uses gso_min_segs yet, so we revert this feature and save few instructions per tx packet in fast path. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13net: force inlining of netif_tx_start/stop_queue, sock_hold, __sock_putDenys Vlasenko
Sometimes gcc mysteriously doesn't inline very small functions we expect to be inlined. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122 Arguably, gcc should do better, but gcc people aren't willing to invest time into it, asking to use __always_inline instead. With this .config: http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os, the following functions get deinlined many times. netif_tx_stop_queue: 207 copies, 590 calls: 55 push %rbp 48 89 e5 mov %rsp,%rbp f0 80 8f e0 01 00 00 01 lock orb $0x1,0x1e0(%rdi) 5d pop %rbp c3 retq netif_tx_start_queue: 47 copies, 111 calls 55 push %rbp 48 89 e5 mov %rsp,%rbp f0 80 a7 e0 01 00 00 fe lock andb $0xfe,0x1e0(%rdi) 5d pop %rbp c3 retq sock_hold: 39 copies, 124 calls 55 push %rbp 48 89 e5 mov %rsp,%rbp f0 ff 87 80 00 00 00 lock incl 0x80(%rdi) 5d pop %rbp c3 retq __sock_put: 6 copies, 13 calls 55 push %rbp 48 89 e5 mov %rsp,%rbp f0 ff 8f 80 00 00 00 lock decl 0x80(%rdi) 5d pop %rbp c3 retq This patch fixes this via s/inline/__always_inline/. Code size decrease after the patch is ~2.5k: text data bss dec hex filename 56719876 56364551 36196352 149280779 8e5d80b vmlinux_before 56717440 56364551 36196352 149278343 8e5ce87 vmlinux Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> CC: David S. Miller <davem@davemloft.net> CC: linux-kernel@vger.kernel.org CC: netdev@vger.kernel.org CC: netfilter-devel@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2016-04-07GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOUAlexander Duyck
This patch fixes an issue I found in which we were dropping frames if we had enabled checksums on GRE headers that were encapsulated by either FOU or GUE. Without this patch I was barely able to get 1 Gb/s of throughput. With this patch applied I am now at least getting around 6 Gb/s. The issue is due to the fact that with FOU or GUE applied we do not provide a transport offset pointing to the GRE header, nor do we offload it in software as the GRE header is completely skipped by GSO and treated like a VXLAN or GENEVE type header. As such we need to prevent the stack from generating it and also prevent GRE from generating it via any interface we create. Fixes: c3483384ee511 ("gro: Allow tunnel stacking in the case of FOU/GUE") Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>