summaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)Author
2019-04-08xfrm: store xfrm_mode directly, not its addressFlorian Westphal
This structure is now only 4 bytes, so its more efficient to cache a copy rather than its address. No significant size difference in allmodconfig vmlinux. With non-modular kernel that has all XFRM options enabled, this series reduces vmlinux image size by ~11kb. All xfrm_mode indirections are gone and all modes are built-in. before (ipsec-next master): text data bss dec filename 21071494 7233140 11104324 39408958 vmlinux.master after this series: 21066448 7226772 11104324 39397544 vmlinux.patched With allmodconfig kernel, the size increase is only 362 bytes, even all the xfrm config options removed in this series are modular. before: text data bss dec filename 15731286 6936912 4046908 26715106 vmlinux.master after this series: 15731492 6937068 4046908 26715468 vmlinux Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08xfrm: make xfrm modes builtinFlorian Westphal
after previous changes, xfrm_mode contains no function pointers anymore and all modules defining such struct contain no code except an init/exit functions to register the xfrm_mode struct with the xfrm core. Just place the xfrm modes core and remove the modules, the run-time xfrm_mode register/unregister functionality is removed. Before: text data bss dec filename 7523 200 2364 10087 net/xfrm/xfrm_input.o 40003 628 440 41071 net/xfrm/xfrm_state.o 15730338 6937080 4046908 26714326 vmlinux 7389 200 2364 9953 net/xfrm/xfrm_input.o 40574 656 440 41670 net/xfrm/xfrm_state.o 15730084 6937068 4046908 26714060 vmlinux The xfrm*_mode_{transport,tunnel,beet} modules are gone. v2: replace CONFIG_INET6_XFRM_MODE_* IS_ENABLED guards with CONFIG_IPV6 ones rather than removing them. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08xfrm: remove afinfo pointer from xfrm_modeFlorian Westphal
Adds an EXPORT_SYMBOL for afinfo_get_rcu, as it will now be called from ipv6 in case of CONFIG_IPV6=m. This change has virtually no effect on vmlinux size, but it reduces afinfo size and allows followup patch to make xfrm modes const. v2: mark if (afinfo) tests as likely (Sabrina) re-fetch afinfo according to inner_mode in xfrm_prepare_input(). Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08xfrm: remove output2 indirection from xfrm_modeFlorian Westphal
similar to previous patch: no external module dependencies, so we can avoid the indirection by placing this in the core. This change removes the last indirection from xfrm_mode and the xfrm4|6_mode_{beet,tunnel}.c modules contain (almost) no code anymore. Before: text data bss dec hex filename 3957 136 0 4093 ffd net/xfrm/xfrm_output.o 587 44 0 631 277 net/ipv4/xfrm4_mode_beet.o 649 32 0 681 2a9 net/ipv4/xfrm4_mode_tunnel.o 625 44 0 669 29d net/ipv6/xfrm6_mode_beet.o 599 32 0 631 277 net/ipv6/xfrm6_mode_tunnel.o After: text data bss dec hex filename 5359 184 0 5543 15a7 net/xfrm/xfrm_output.o 171 24 0 195 c3 net/ipv4/xfrm4_mode_beet.o 171 24 0 195 c3 net/ipv4/xfrm4_mode_tunnel.o 172 24 0 196 c4 net/ipv6/xfrm6_mode_beet.o 172 24 0 196 c4 net/ipv6/xfrm6_mode_tunnel.o v2: fold the *encap_add functions into xfrm*_prepare_output preserve (move) output2 comment (Sabrina) use x->outer_mode->encap, not inner fix a build breakage on ppc (kbuild robot) Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08xfrm: remove input2 indirection from xfrm_modeFlorian Westphal
No external dependencies on any module, place this in the core. Increase is about 1800 byte for xfrm_input.o. The beet helpers get added to internal header, as they can be reused from xfrm_output.c in the next patch (kernel contains several copies of them in the xfrm{4,6}_mode_beet.c files). Before: text data bss dec filename 5578 176 2364 8118 net/xfrm/xfrm_input.o 1180 64 0 1244 net/ipv4/xfrm4_mode_beet.o 171 40 0 211 net/ipv4/xfrm4_mode_transport.o 1163 40 0 1203 net/ipv4/xfrm4_mode_tunnel.o 1083 52 0 1135 net/ipv6/xfrm6_mode_beet.o 172 40 0 212 net/ipv6/xfrm6_mode_ro.o 172 40 0 212 net/ipv6/xfrm6_mode_transport.o 1056 40 0 1096 net/ipv6/xfrm6_mode_tunnel.o After: text data bss dec filename 7373 200 2364 9937 net/xfrm/xfrm_input.o 587 44 0 631 net/ipv4/xfrm4_mode_beet.o 171 32 0 203 net/ipv4/xfrm4_mode_transport.o 649 32 0 681 net/ipv4/xfrm4_mode_tunnel.o 625 44 0 669 net/ipv6/xfrm6_mode_beet.o 172 32 0 204 net/ipv6/xfrm6_mode_ro.o 172 32 0 204 net/ipv6/xfrm6_mode_transport.o 599 32 0 631 net/ipv6/xfrm6_mode_tunnel.o v2: pass inner_mode to xfrm_inner_mode_encap_remove to fix AF_UNSPEC selector breakage (bisected by Benedict Wong) Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08xfrm: remove gso_segment indirection from xfrm_modeFlorian Westphal
These functions are small and we only have versions for tunnel and transport mode for ipv4 and ipv6 respectively. Just place the 'transport or tunnel' conditional in the protocol specific function instead of using an indirection. Before: 3226 12 0 3238 net/ipv4/esp4_offload.o 7004 492 0 7496 net/ipv4/ip_vti.o 3339 12 0 3351 net/ipv6/esp6_offload.o 11294 460 0 11754 net/ipv6/ip6_vti.o 1180 72 0 1252 net/ipv4/xfrm4_mode_beet.o 428 48 0 476 net/ipv4/xfrm4_mode_transport.o 1271 48 0 1319 net/ipv4/xfrm4_mode_tunnel.o 1083 60 0 1143 net/ipv6/xfrm6_mode_beet.o 172 48 0 220 net/ipv6/xfrm6_mode_ro.o 429 48 0 477 net/ipv6/xfrm6_mode_transport.o 1164 48 0 1212 net/ipv6/xfrm6_mode_tunnel.o 15730428 6937008 4046908 26714344 vmlinux After: 3461 12 0 3473 net/ipv4/esp4_offload.o 7000 492 0 7492 net/ipv4/ip_vti.o 3574 12 0 3586 net/ipv6/esp6_offload.o 11295 460 0 11755 net/ipv6/ip6_vti.o 1180 64 0 1244 net/ipv4/xfrm4_mode_beet.o 171 40 0 211 net/ipv4/xfrm4_mode_transport.o 1163 40 0 1203 net/ipv4/xfrm4_mode_tunnel.o 1083 52 0 1135 net/ipv6/xfrm6_mode_beet.o 172 40 0 212 net/ipv6/xfrm6_mode_ro.o 172 40 0 212 net/ipv6/xfrm6_mode_transport.o 1056 40 0 1096 net/ipv6/xfrm6_mode_tunnel.o 15730424 6937008 4046908 26714340 vmlinux Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08xfrm: remove xmit indirection from xfrm_modeFlorian Westphal
There are only two versions (tunnel and transport). The ip/ipv6 versions are only differ in sizeof(iphdr) vs ipv6hdr. Place this in the core and use x->outer_mode->encap type to call the correct adjustment helper. Before: text data bss dec filename 15730311 6937008 4046908 26714227 vmlinux After: 15730428 6937008 4046908 26714344 vmlinux (about 117 byte increase) v2: use family from x->outer_mode, not inner Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08xfrm: remove output indirection from xfrm_modeFlorian Westphal
Same is input indirection. Only exception: we need to export xfrm_outer_mode_output for pktgen. Increases size of vmlinux by about 163 byte: Before: text data bss dec filename 15730208 6936948 4046908 26714064 vmlinux After: 15730311 6937008 4046908 26714227 vmlinux xfrm_inner_extract_output has no more external callers, make it static. v2: add IS_ENABLED(IPV6) guard in xfrm6_prepare_output add two missing breaks in xfrm_outer_mode_output (Sabrina Dubroca) add WARN_ON_ONCE for 'call AF_INET6 related output function, but CONFIG_IPV6=n' case. make xfrm_inner_extract_output static Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08xfrm: remove input indirection from xfrm_modeFlorian Westphal
No need for any indirection or abstraction here, both functions are pretty much the same and quite small, they also have no external dependencies. xfrm_prepare_input can then be made static. With allmodconfig build, size increase of vmlinux is 25 byte: Before: text data bss dec filename 15730207 6936924 4046908 26714039 vmlinux After: 15730208 6936948 4046908 26714064 vmlinux v2: Fix INET_XFRM_MODE_TRANSPORT name in is-enabled test (Sabrina Dubroca) change copied comment to refer to transport and network header, not skb->{h,nh}, which don't exist anymore. (Sabrina) make xfrm_prepare_input static (Eyal Birger) Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08xfrm: place af number into xfrm_mode structFlorian Westphal
This will be useful to know if we're supposed to decode ipv4 or ipv6. While at it, make the unregister function return void, all module_exit functions did just BUG(); there is never a point in doing error checks if there is no way to handle such error. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-06nfc: nci: Potential off by one in ->pipes[] arrayDan Carpenter
This is similar to commit e285d5bfb7e9 ("NFC: Fix the number of pipes") where we changed NFC_HCI_MAX_PIPES from 127 to 128. As the comment next to the define explains, the pipe identifier is 7 bits long. The highest possible pipe is 127, but the number of possible pipes is 128. As the code is now, then there is potential for an out of bounds array access: net/nfc/nci/hci.c:297 nci_hci_cmd_received() warn: array off by one? 'ndev->hci_dev->pipes[pipe]' '0-127 == 127' Fixes: 11f54f228643 ("NFC: nci: Add HCI over NCI protocol support") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Minor comment merge conflict in mlx5. Staging driver has a fixup due to the skb->xmit_more changes in 'net-next', but was removed in 'net'. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04net: devlink: introduce devlink_compat_switch_id_get() helperJiri Pirko
Introduce devlink_compat_switch_id_get() helper which fills up switch_id according to passed netdev pointer. Call it directly from dev_get_port_parent_id() as a fallback when ndo_get_port_parent_id is not defined for given netdev. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04net: devlink: extend port attrs for switch IDJiri Pirko
Extend devlink_port_attrs_set() to pass switch ID for ports which are part of switch and store it in port attrs. For other ports, this is NULL. Note that this allows the driver to group devlink ports into one or more switches according to the actual topology. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04net: devlink: convert devlink_port_attrs bools to bitsJiri Pirko
In order to save space in the struct, convert bools to bits. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-03ipv6: Flip to fib_nexthop_infoDavid Ahern
Export fib_nexthop_info and fib_add_nexthop for use by IPv6 code. Remove rt6_nexthop_info and rt6_add_nexthop in favor of the IPv4 versions. Update fib_nexthop_info for IPv6 linkdown check and RTA_GATEWAY for AF_INET6. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-03ipv4: Add fib_nh_common to fib_resultDavid Ahern
Most of the ipv4 code only needs data from fib_nh_common. Add fib_nh_common selection to fib_result and update users to use it. Right now, fib_nh_common in fib_result will point to a fib_nh struct that is embedded within a fib_info: fib_info --> fib_nh fib_nh ... fib_nh ^ fib_result->nhc ----+ Later, nhc can point to a fib_nh within a nexthop struct: fib_info --> nexthop --> fib_nh ^ fib_result->nhc ---------------+ or for a nexthop group: fib_info --> nexthop --> nexthop --> fib_nh nexthop --> fib_nh ... nexthop --> fib_nh ^ fib_result->nhc ---------------------------+ In all cases nhsel within fib_result will point to which leg in the multipath route is used. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01net: dsa: read mac address from DT for slave deviceXiaofei Shen
Before creating a slave netdevice, get the mac address from DTS and apply in case it is valid. Signed-off-by: Xiaofei Shen <xiaofeis@codeaurora.org> Signed-off-by: Vinod Koul <vkoul@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01net: sched: introduce and use qdisc tree flush/purge helpersPaolo Abeni
The same code to flush qdisc tree and purge the qdisc queue is duplicated in many places and in most cases it does not respect NOLOCK qdisc: the global backlog len is used and the per CPU values are ignored. This change addresses the above, factoring-out the relevant code and using the helpers introduced by the previous patch to fetch the correct backlog len. Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01net: sched: introduce and use qstats read helpersPaolo Abeni
Classful qdiscs can't access directly the child qdiscs backlog length: if such qdisc is NOLOCK, per CPU values should be accounted instead. Most qdiscs no not respect the above. As a result, qstats fetching for most classful qdisc is currently incorrect: if the child qdisc is NOLOCK, it always reports 0 len backlog. This change introduces a pair of helpers to safely fetch both backlog and qlen and use them in stats class dumping functions, fixing the above issue and cleaning a bit the code. DRR needs also to access the child qdisc queue length, so it needs custom handling. Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01vrf: check accept_source_route on the original netdeviceStephen Suryaputra
Configuration check to accept source route IP options should be made on the incoming netdevice when the skb->dev is an l3mdev master. The route lookup for the source route next hop also needs the incoming netdev. v2->v3: - Simplify by passing the original netdevice down the stack (per David Ahern). Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29ipv6: Move ipv6 stubs to a separate header fileDavid Ahern
The number of stubs is growing and has nothing to do with addrconf. Move the definition of the stubs to a separate header file and update users. In the move, drop the vxlan specific comment before ipv6_stub. Code move only; no functional change intended. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29net: Use common nexthop init and release helpersDavid Ahern
With fib_nh_common in place, move common initialization and release code into helpers used by both ipv4 and ipv6. For the moment, the init is just the lwt encap and the release is both the netdev reference and the the lwt state reference. More will be added later. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29net: Add fib_nh_common and update fib_nh and fib6_nhDavid Ahern
Add fib_nh_common struct with common nexthop attributes. Convert fib_nh and fib6_nh to use it. Use macros to move existing fib_nh_* references to the new nh_common.nhc_*. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29ipv6: Rename fib6_nh entriesDavid Ahern
Rename fib6_nh entries that will be moved to a fib_nh_common struct. Specifically, the device, gateway, flags, and lwtstate are common with all nexthop definitions. In some places new temporary variables are declared or local variables renamed to maintain line lengths. Rename only; no functional change intended. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29ipv4: Rename fib_nh entriesDavid Ahern
Rename fib_nh entries that will be moved to a fib_nh_common struct. Specifically, the device, oif, gateway, flags, scope, lwtstate, nh_weight and nh_upper_bound are common with all nexthop definitions. In the process shorten fib_nh_lwtstate to fib_nh_lws to avoid really long lines. Rename only; no functional change intended. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29ipv6: Refactor fib6_ignore_linkdownDavid Ahern
fib6_ignore_linkdown takes a fib6_info but only looks at the net_device and its IPv6 config. Change it to take a net_device over a fib6_info as its input argument. In addition, move it to a header file to make the check inline and usable later with IPv4 code without going through the ipv6 stub, and rename to ip6_ignore_linkdown since it is only checking the setting based on the ipv6 struct on a device. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29ipv6: Move gateway checks to a fib6_nh settingDavid Ahern
The gateway setting is not per fib6_info entry but per-fib6_nh. Add a new fib_nh_has_gw flag to fib6_nh and convert references to RTF_GATEWAY to the new flag. For IPv6 address the flag is cheaper than checking that nh_gw is non-0 like IPv4 does. While this increases fib6_nh by 8-bytes, the effective allocation size of a fib6_info is unchanged. The 8 bytes is recovered later with a fib_nh_common change. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29ipv6: Create cleanup helper for fib6_nhDavid Ahern
Move the fib6_nh cleanup code to a new helper, fib6_nh_release. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29ipv6: Create init helper for fib6_nhDavid Ahern
Similar to IPv4, consolidate the fib6_nh initialization into a helper. As a new standalone function, add a cleanup path to put lwtstate on error. To avoid modifying fib6_config flags, move the reject check to a helper that is invoked once by fib6_nh_init to reset the device and then again in ip6_route_info_create to set the fib6_flags. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29ipv4: Create cleanup helper for fib_nhDavid Ahern
Move the fib_nh cleanup code from free_fib_info_rcu into a new helper, fib_nh_release. Move classid accounting into fib_nh_release which is called per fib_nh to make accounting symmetrical with fib_nh_init. Export the helper to allow for use with nexthop objects in the future. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29ipv4: Create init helper for fib_nhDavid Ahern
Consolidate the fib_nh initialization which is duplicated between fib_create_info for single path and fib_get_nhs for multipath. Export the helper to allow for use with nexthop objects in the future. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29mac80211: rework locking for txq scheduling / airtime fairnessFelix Fietkau
Holding the lock around the entire duration of tx scheduling can create some nasty lock contention, especially when processing airtime information from the tx status or the rx path. Improve locking by only holding the active_txq_lock for lookups / scheduling list modifications. Signed-off-by: Felix Fietkau <nbd@nbd.name> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-03-28netns: provide pure entropy for net_hash_mix()Eric Dumazet
net_hash_mix() currently uses kernel address of a struct net, and is used in many places that could be used to reveal this address to a patient attacker, thus defeating KASLR, for the typical case (initial net namespace, &init_net is not dynamically allocated) I believe the original implementation tried to avoid spending too many cycles in this function, but security comes first. Also provide entropy regardless of CONFIG_NET_NS. Fixes: 0b4419162aa6 ("netns: introduce the net_hash_mix "salt" for hashes") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Amit Klein <aksecurity@gmail.com> Reported-by: Benny Pinkas <benny@pinkas.net> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28netfilter: Export nf_ct_{set,destroy}_timeout()Yi-Hung Wei
This patch exports nf_ct_set_timeout() and nf_ct_destroy_timeout(). The two functions are derived from xt_ct_destroy_timeout() and xt_ct_set_timeout() in xt_CT.c, and moved to nf_conntrack_timeout.c without any functional change. It would be useful for other users (i.e. OVS) that utilizes the finer-grain conntrack timeout feature. CC: Pablo Neira Ayuso <pablo@netfilter.org> CC: Pravin Shelar <pshelar@ovn.org> Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28net: devlink: remove unused devlink_port_get_phys_port_name() functionJiri Pirko
Now it is unused, remove it. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28net: devlink: introduce devlink_compat_phys_port_name_get()Jiri Pirko
Introduce devlink_compat_phys_port_name_get() helper that gets the physical port name for specified netdevice according to devlink port attributes. Call this helper from dev_get_phys_port_name() in case ndo_get_phys_port_name is not defined. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28net: replace ndo_get_devlink with ndo_get_devlink_portJiri Pirko
Follow-up patch is going to need a devlink port instance according to a netdev. Devlink port instance should be always available when devlink is used. So change the recently introduced ndo_get_devlink to ndo_get_devlink_port. With that, adjust the wrapper for the only user to get devlink pointer. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2019-03-27inet: switch IP ID generator to siphashEric Dumazet
According to Amit Klein and Benny Pinkas, IP ID generation is too weak and might be used by attackers. Even with recent net_hash_mix() fix (netns: provide pure entropy for net_hash_mix()) having 64bit key and Jenkins hash is risky. It is time to switch to siphash and its 128bit keys. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Amit Klein <aksecurity@gmail.com> Reported-by: Benny Pinkas <benny@pinkas.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-27tcp: fix zerocopy and notsent_lowat issuesEric Dumazet
My recent patch had at least three problems : 1) TX zerocopy wants notification when skb is acknowledged, thus we need to call skb_zcopy_clear() if the skb is cached into sk->sk_tx_skb_cache 2) Some applications might expect precise EPOLLOUT notifications, so we need to update sk->sk_wmem_queued and call sk_mem_uncharge() from sk_wmem_free_skb() in all cases. The SOCK_QUEUE_SHRUNK flag must also be set. 3) Reuse of saved skb should have used skb_cloned() instead of simply checking if the fast clone has been freed. Fixes: 472c2e07eef0 ("tcp: add one skb cache for tx") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-27xfrm: Honor original L3 slave device in xfrmi policy lookupMartin Willi
If an xfrmi is associated to a vrf layer 3 master device, xfrm_policy_check() fails after traffic decapsulation. The input interface is replaced by the layer 3 master device, and hence xfrmi_decode_session() can't match the xfrmi anymore to satisfy policy checking. Extend ingress xfrmi lookup to honor the original layer 3 slave device, allowing xfrm interfaces to operate within a vrf domain. Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces") Signed-off-by: Martin Willi <martin@strongswan.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-03-26xfrm: clean up xfrm protocol checksCong Wang
In commit 6a53b7593233 ("xfrm: check id proto in validate_tmpl()") I introduced a check for xfrm protocol, but according to Herbert IPSEC_PROTO_ANY should only be used as a wildcard for lookup, so it should be removed from validate_tmpl(). And, IPSEC_PROTO_ANY is expected to only match 3 IPSec-specific protocols, this is why xfrm_state_flush() could still miss IPPROTO_ROUTING, which leads that those entries are left in net->xfrm.state_all before exit net. Fix this by replacing IPSEC_PROTO_ANY with zero. This patch also extracts the check from validate_tmpl() to xfrm_id_proto_valid() and uses it in parse_ipsecrequest(). With this, no other protocols should be added into xfrm. Fixes: 6a53b7593233 ("xfrm: check id proto in validate_tmpl()") Reported-by: syzbot+0bf0519d6e0de15914fe@syzkaller.appspotmail.com Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-03-24net: devlink: select NET_DEVLINK from driversJiri Pirko
Some drivers are becoming more dependent on NET_DEVLINK being selected in configuration. With upcoming compat functions, the behavior would be wrong in case devlink was not compiled in. So make the drivers select NET_DEVLINK and rely on the functions being there, not just stubs. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-24net: devlink: add port type spinlockJiri Pirko
Add spinlock to protect port type and type_dev pointer consistency. Without that, userspace may see inconsistent type and type_dev combinations. Signed-off-by: Jiri Pirko <jiri@mellanox.com> v1->v2: - rebased Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23Merge tag 'mlx5-updates-2019-03-20' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2019-03-20 This series includes updates to mlx5 driver, 1) Compiler warnings cleanup from Saeed Mahameed 2) Parav Pandit simplifies sriov enable/disables 3) Gustavo A. R. Silva, Removes a redundant assignment 4) Moshe Shemesh, Adds Geneve tunnel stateless offload support 5) Eli Britstein, Adds the Support for VLAN modify action and Replaces TC VLAN pop and push actions with VLAN modify Note: This series includes two simple non-mlx5 patches, 1) Declare IANA_VXLAN_UDP_PORT definition in include/net/vxlan.h, and use it in some drivers. 2) Declare GENEVE_UDP_PORT definition in include/net/geneve.h, and use it in mlx5 and nfp drivers. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23tcp: add one skb cache for rxEric Dumazet
Often times, recvmsg() system calls and BH handling for a particular TCP socket are done on different cpus. This means the incoming skb had to be allocated on a cpu, but freed on another. This incurs a high spinlock contention in slab layer for small rpc, but also a high number of cache line ping pongs for larger packets. A full size GRO packet might use 45 page fragments, meaning that up to 45 put_page() can be involved. More over performing the __kfree_skb() in the recvmsg() context adds a latency for user applications, and increase probability of trapping them in backlog processing, since the BH handler might found the socket owned by the user. This patch, combined with the prior one increases the rpc performance by about 10 % on servers with large number of cores. (tcp_rr workload with 10,000 flows and 112 threads reach 9 Mpps instead of 8 Mpps) This also increases single bulk flow performance on 40Gbit+ links, since in this case there are often two cpus working in tandem : - CPU handling the NIC rx interrupts, feeding the receive queue, and (after this patch) freeing the skbs that were consumed. - CPU in recvmsg() system call, essentially 100 % busy copying out data to user space. Having at most one skb in a per-socket cache has very little risk of memory exhaustion, and since it is protected by socket lock, its management is essentially free. Note that if rps/rfs is used, we do not enable this feature, because there is high chance that the same cpu is handling both the recvmsg() system call and the TCP rx path, but that another cpu did the skb allocations in the device driver right before the RPS/RFS logic. To properly handle this case, it seems we would need to record on which cpu skb was allocated, and use a different channel to give skbs back to this cpu. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23tcp: add one skb cache for txEric Dumazet
On hosts with a lot of cores, RPC workloads suffer from heavy contention on slab spinlocks. 20.69% [kernel] [k] queued_spin_lock_slowpath 5.64% [kernel] [k] _raw_spin_lock 3.83% [kernel] [k] syscall_return_via_sysret 3.48% [kernel] [k] __entry_text_start 1.76% [kernel] [k] __netif_receive_skb_core 1.64% [kernel] [k] __fget For each sendmsg(), we allocate one skb, and free it at the time ACK packet comes. In many cases, ACK packets are handled by another cpus, and this unfortunately incurs heavy costs for slab layer. This patch uses an extra pointer in socket structure, so that we try to reuse the same skb and avoid these expensive costs. We cache at most one skb per socket so this should be safe as far as memory pressure is concerned. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23net: convert rps_needed and rfs_needed to new static branch apiEric Dumazet
We prefer static_branch_unlikely() over static_key_false() these days. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23net: sched: add empty status flag for NOLOCK qdiscPaolo Abeni
The queue is marked not empty after acquiring the seqlock, and it's up to the NOLOCK qdisc clearing such flag on dequeue. Since the empty status lays on the same cache-line of the seqlock, it's always hot on cache during the updates. This makes the empty flag update a little bit loosy. Given the lack of synchronization between enqueue and dequeue, this is unavoidable. v2 -> v3: - qdisc_is_empty() has a const argument (Eric) v1 -> v2: - use really an 'empty' flag instead of 'not_empty', as suggested by Eric Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>