summaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)Author
2017-04-03flowcache: more "unsigned int"Alexey Dobriyan
Make ->hash_count, ->low_watermark and ->high_watermark unsigned int and propagate unsignedness to other variables. This change doesn't change code generation because these fields aren't used in 64-bit contexts but make it anyway: these fields can't be negative numbers. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-03flowcache: make flow_key_size() return "unsigned int"Alexey Dobriyan
Flow keys aren't 4GB+ numbers so 64-bit arithmetic is excessive. Space savings (I'm not sure what CSWTCH is): add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-48 (-48) function old new delta flow_cache_lookup 1163 1159 -4 CSWTCH 75997 75953 -44 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-03sctp: check for dst and pathmtu update in sctp_packet_configXin Long
This patch is to move sctp_transport_dst_check into sctp_packet_config from sctp_packet_transmit and add pathmtu check in sctp_packet_config. With this fix, sctp can update dst or pathmtu before appending chunks, which can void dropping packets in sctp_packet_transmit when dst is obsolete or dst's mtu is changed. This patch is also to improve some other codes in sctp_packet_config. It updates packet max_size with gso_max_size, checks for dst and pathmtu, and appends ecne chunk only when packet is empty and asoc is not NULL. It makes sctp flush work better, as we only need to set up them once for one flush schedule. It's also safe, since asoc is NULL only when the packet is created by sctp_ootb_pkt_new in which it just gets the new dst, no need to do more things for it other than set packet with transport's pathmtu. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-03sctp: add SCTP_PR_STREAM_STATUS sockopt for prsctpXin Long
Before when implementing sctp prsctp, SCTP_PR_STREAM_STATUS wasn't added, as it needs to save abandoned_(un)sent for every stream. After sctp stream reconf is added in sctp, assoc has structure sctp_stream_out to save per stream info. This patch is to add SCTP_PR_STREAM_STATUS by putting the prsctp per stream statistics into sctp_stream_out. v1->v2: fix an indent issue. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-02sock: correctly test SOCK_TIMESTAMP in sock_recv_ts_and_drops()Eric Dumazet
It seems the code does not match the intent. This broke packetdrill, and probably other programs. Fixes: 6c7c98bad488 ("sock: avoid dirtying sk_stamp, if possible") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01net: mpls: Increase max number of labels for lwt encapDavid Ahern
Alow users to push down more labels per MPLS encap. Similar to LSR case, move label array to the end of mpls_iptunnel_encap and allocate based on the number of labels for the route. For consistency with the LSR case, re-use the same maximum number of labels. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01net: dsa: add cross-chip bridging operationsVivien Didelot
Introduce crosschip_bridge_{join,leave} operations in the dsa_switch_ops structure, which can be used by switches supporting interconnection. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-31cfg80211: add documentation for cfg80211_get_bss()Johannes Berg
This was missing, but is referenced a lot in the documentation. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-31cfg80211: Add support for FILS shared key authentication offloadVidyullatha Kanchanapally
Enhance nl80211 and cfg80211 connect request and response APIs to support FILS shared key authentication offload. The new nl80211 attributes can be used to provide additional information to the driver to establish a FILS connection. Also enhance the set/del PMKSA to allow support for adding and deleting PMKSA based on FILS cache identifier. Add a new feature flag that drivers can use to advertize support for FILS shared key authentication and association in station mode when using their own SME. Signed-off-by: Vidyullatha Kanchanapally <vkanchan@qti.qualcomm.com> Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-31cfg80211: Use a structure to pass connect response paramsVidyullatha Kanchanapally
Currently the connect event from driver takes all the connection response parameters as arguments. With support for new features these response parameters can grow. Use a structure to pass these parameters rather than passing them as function arguments. Signed-off-by: Vidyullatha Kanchanapally <vkanchan@qti.qualcomm.com> Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com> [add to documentation] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-30sock: avoid dirtying sk_stamp, if possiblePaolo Abeni
sock_recv_ts_and_drops() unconditionally set sk->sk_stamp for every packet, even if the SOCK_TIMESTAMP flag is not set in the related socket. If selinux is enabled, this cause a cache miss for every packet since sk->sk_stamp and sk->sk_security share the same cacheline. With this change sk_stamp is set only if the SOCK_TIMESTAMP flag is set, and is cleared for the first packet, so that the user perceived behavior is unchanged. This gives up to 5% speed-up under udp-flood with small packets. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-30sctp: alloc stream info when initializing asocXin Long
When sending a msg without asoc established, sctp will send INIT packet first and then enqueue chunks. Before receiving INIT_ACK, stream info is not yet alloced. But enqueuing chunks needs to access stream info, like out stream state and out stream cnt. This patch is to fix it by allocing out stream info when initializing an asoc, allocing in stream and re-allocing out stream when processing init. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-29Merge tag 'mlx5e-pedit' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Or Gerlitz says: ==================== mlx5e-pedit 2017-03-28 This series adds support for offloading modifications of packet headers using ConnectX-5 HW header re-write as an action applied during packet steering. The offloaded SW mechanism is TC's pedit action. The offloading is supported for E-Switch steering of VF traffic in the SRIOV switchdev mode and for NIC (non eswitch) RX. One use-case for this offload on virtual networks, is when the hypervisor implements flow based router such as Open-Stack's DVR, where L2 headers of guest packets re-written with routers' MAC addresses and the IP TTL is decremented. Another use case (which can be applied in parallel with routing) is stateless NAT where guest L3/L4 headers are re-written. The series is built as follows: the 1st six patches are preperations which don't yet add new functionality, patches 7-8 add the FW APIs (data-structures and commands) for header re-write, and patch nine allows offloading driver to access pedit keys. The 10th patch is somehow the core of the series, where we translate from the pedit way to represent set of header modification elements to the FW API for that same matter. Once a set of HW modification is established, we register it with the FW and get a modify header ID. When this ID is used with an action during packet steering, the HW applies the header modification on the packet. Patches 11 and 12 implement the above logic as an offload for pedit action for the NIC and E-Switch use-cases. I'd like to thanks Elijah Shakkour <elijahs@mellanox.com> for implementing and helping me testing this functionality on HW simulator, before it could be done with FW. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28net: dsa: dsa2: Add basic support of devlinkAndrew Lunn
Register the switch and its ports with devlink. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Tested-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28net: break include loop netdevice.h, dsa.h, devlink.hAndrew Lunn
There is an include loop between netdevice.h, dsa.h, devlink.h because of NETDEV_ALIGN, making it impossible to use devlink structures in dsa.h. Break this loop by taking dsa.h out of netdevice.h, add a forward declaration of dsa_switch_tree and netdev_set_default_ethtool_ops() function, which is what netdevice.h requires. No longer having dsa.h in netdevice.h means the includes in dsa.h no longer get included. This breaks a few other files which depend on these includes. Add these directly in the affected file. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28net: ipv6: Refactor inet6_netconf_notify_devconf to take eventDavid Ahern
Refactor inet6_netconf_notify_devconf to take the event as an input arg. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28ipv6: add support for NETDEV_RESEND_IGMP eventVlad Yasevich
This patch adds support for NETDEV_RESEND_IGMP event similar to how it works for IPv4. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28sctp: change to save MSG_MORE flag into assocXin Long
David Laight noticed the support for MSG_MORE with datamsg->force_delay didn't really work as we expected, as the first msg with MSG_MORE set would always block the following chunks' dequeuing. This Patch is to rewrite it by saving the MSG_MORE flag into assoc as David Laight suggested. asoc->force_delay is used to save MSG_MORE flag before a msg is sent. All chunks in queue would not be sent out if asoc->force_delay is set by the msg with MSG_MORE flag, until a new msg without MSG_MORE flag clears asoc->force_delay. Note that this change would not affect the flush is generated by other triggers, like asoc->state != ESTABLISHED, queue size > pmtu etc. v1->v2: Not clear asoc->force_delay after sending the msg with MSG_MORE flag. Fixes: 4ea0c32f5f42 ("sctp: add support for MSG_MORE") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: David Laight <david.laight@aculab.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28devlink: Support for pipeline debug (dpipe)Arkadi Sharshevsky
The pipeline debug is used to export the pipeline abstractions for the main objects - tables, headers and entries. The only support for set is for changing the counter parameter on specific table. The basic structures: Header - can represent a real protocol header information or internal metadata. Generic protocol headers like IPv4 can be shared between drivers. Each driver can add local headers. Field - part of a header. Can represent protocol field or specific ASIC metadata field. Hardware special metadata fields can be mapped to different resources, for example switch ASIC ports can have internal number which from the systems point of view is mapped to netdeivce ifindex. Match - represent specific match rule. Can describe match on specific field or header. The header index should be specified as well in order to support several header instances of the same type (tunneling). Action - represents specific action rule. Actions can describe operations on specific field values for example like set, increment, etc. And header operation like add and delete. Value - represents value which can be associated with specific match or action. Table - represents a hardware block which can be described with match/ action behavior. The match/action can be done on the packets data or on the internal metadata that it gathered along the packets traversal throw the pipeline which is vendor specific and should be exported in order to provide understanding of ASICs behavior. Entry - represents single record in a specific table. The entry is identified by specific combination of values for match/action. Prior to accessing the tables/entries the drivers provide the header/ field data base which is used by driver to user-space. The data base is split between the shared headers and unique headers. Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28net/sched: Add accessor functions to pedit keys for offloading driversOr Gerlitz
HW drivers will use the header-type and command fields from the extended keys, and some fields (e.g mask, val, offset) from the legacy keys. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-03-27bonding: split bond_set_slave_link_state into two partsMahesh Bandewar
Split the function into two (a) propose (b) commit phase without changing the semantics for the original API. Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-27xfrm: branchless addr4_match() on 64-bitAlexey Dobriyan
Current addr4_match() code has special test for /0 prefixes because of standard required undefined behaviour. However, it is possible to omit it on 64-bit because shifting can be done within a 64-bit register and then truncated to the expected value (which is 0 mask). Implicit truncation by htonl() fits nicely into R32-within-R64 model on x86-64. Space savings: none (coincidence) Branch savings: 1 Before: movzx eax,BYTE PTR [rdi+0x2a] # ->prefixlen_d test al,al jne xfrm_selector_match + 0x23f ... movzx eax,BYTE PTR [rbx+0x2b] # ->prefixlen_s test al,al je xfrm_selector_match + 0x1c7 After (no branches): mov r8d,0x20 mov rdx,0xffffffffffffffff mov esi,DWORD PTR [rsi+0x2c] mov ecx,r8d sub cl,BYTE PTR [rdi+0x2a] xor esi,DWORD PTR [rbx] mov rdi,rdx xor eax,eax shl rdi,cl bswap edi Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-03-24net: Commonize busy polling code to focus on napi_id instead of socketSridhar Samudrala
Move the core functionality in sk_busy_loop() to napi_busy_loop() and make it independent of sk. This enables re-using this function in epoll busy loop implementation. Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24net: Track start of busy loop instead of when it should endAlexander Duyck
This patch flips the logic we were using to determine if the busy polling has timed out. The main motivation for this is that we will need to support two different possible timeout values in the future and by recording the start time rather than when we would want to end we can focus on making the end_time specific to the task be it epoll or socket based polling. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24net: Change return type of sk_busy_loop from bool to voidAlexander Duyck
checking the return value of sk_busy_loop. As there are only a few consumers of that data, and the data being checked for can be replaced with a check for !skb_queue_empty() we might as well just pull the code out of sk_busy_loop and place it in the spots that actually need it. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24net: Only define skb_mark_napi_id in one spot instead of twoAlexander Duyck
Instead of defining two versions of skb_mark_napi_id I think it is more readable to just match the format of the sk_mark_napi_id functions and just wrap the contents of the function instead of defining two versions of the function. This way we can save a few lines of code since we only need 2 of the ifdef/endif but needed 5 for the extra function declaration. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24net: Busy polling should ignore sender CPUsAlexander Duyck
This patch is a cleanup/fix for NAPI IDs following the changes that made it so that sender_cpu and napi_id were doing a better job of sharing the same location in the sk_buff. One issue I found is that we weren't validating the napi_id as being valid before we started trying to setup the busy polling. This change corrects that by using the MIN_NAPI_ID value that is now used in both allocating the NAPI IDs, as well as validating them. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24tcp: sysctl: Fix a race to avoid unexpected 0 window from spaceGao Feng
Because sysctl_tcp_adv_win_scale could be changed any time, so there is one race in tcp_win_from_space. For example, 1.sysctl_tcp_adv_win_scale<=0 (sysctl_tcp_adv_win_scale is negative now) 2.space>>(-sysctl_tcp_adv_win_scale) (sysctl_tcp_adv_win_scale is postive now) As a result, tcp_win_from_space returns 0. It is unexpected. Certainly if the compiler put the sysctl_tcp_adv_win_scale into one register firstly, then use the register directly, it would be ok. But we could not depend on the compiler behavior. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24net: Add sysctl to toggle early demux for tcp and udpsubashab@codeaurora.org
Certain system process significant unconnected UDP workload. It would be preferrable to disable UDP early demux for those systems and enable it for TCP only. By disabling UDP demux, we see these slight gains on an ARM64 system- 782 -> 788Mbps unconnected single stream UDPv4 633 -> 654Mbps unconnected UDPv4 different sources The performance impact can change based on CPU architecure and cache sizes. There will not much difference seen if entire UDP hash table is in cache. Both sysctls are enabled by default to preserve existing behavior. v1->v2: Change function pointer instead of adding conditional as suggested by Stephen. v2->v3: Read once in callers to avoid issues due to compiler optimizations. Also update commit message with the tests. v3->v4: Store and use read once result instead of querying pointer again incorrectly. v4->v5: Refactor to avoid errors due to compilation with IPV6={m,n} Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Suggested-by: Eric Dumazet <edumazet@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Tom Herbert <tom@herbertland.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24xfrm: use "unsigned int" in addr_match()Alexey Dobriyan
x86_64 is zero-extending arch so "unsigned int" is preferred over "int" for address calculations and extending to size_t. Space savings: add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-24 (-24) function old new delta xfrm_state_walk 708 696 -12 xfrm_selector_match 918 906 -12 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-03-24xfrm: remove unused struct xfrm_mgr::idAlexey Dobriyan
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-03-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/broadcom/genet/bcmmii.c drivers/net/hyperv/netvsc.c kernel/bpf/hashtab.c Almost entirely overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22sock: introduce SO_MEMINFO getsockoptJosh Hunt
Allows reading of SK_MEMINFO_VARS via socket option. This way an application can get all meminfo related information in single socket option call instead of multiple calls. Adds helper function, sk_get_meminfo(), and uses that for both getsockopt and sock_diag_put_meminfo(). Suggested by Eric Dumazet. Signed-off-by: Josh Hunt <johunt@akamai.com> Reviewed-by: Jason Baron <jbaron@akamai.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22sctp: declare struct sctp_stream before using itXin Long
sctp_stream_free uses struct sctp_stream as a param, but struct sctp_stream is defined after it's declaration. This patch is to declare struct sctp_stream before sctp_stream_free. Fixes: a83863174a61 ("sctp: prepare asoc stream for stream reconf") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22neighbour: fix nlmsg_pid in notificationsRoopa Prabhu
neigh notifications today carry pid 0 for nlmsg_pid in all cases. This patch fixes it to carry calling process pid when available. Applications (eg. quagga) rely on nlmsg_pid to ignore notifications generated by their own netlink operations. This patch follows the routing subsystem which already sets this correctly. Reported-by: Vivek Venkatraman <vivek@cumulusnetworks.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-21sctp: define dst_pending_confirm as a bit in sctp_transportXin Long
As tp->dst_pending_confirm's value can only be set 0 or 1, this patch is to change to define it as a bit instead of __u32. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-21net: ipv4: add support for ECMP hash policy choiceNikolay Aleksandrov
This patch adds support for ECMP hash policy choice via a new sysctl called fib_multipath_hash_policy and also adds support for L4 hashes. The current values for fib_multipath_hash_policy are: 0 - layer 3 (default) 1 - layer 4 If there's an skb hash already set and it matches the chosen policy then it will be used instead of being calculated (currently only for L4). In L3 mode we always calculate the hash due to the ICMP error special case, the flow dissector's field consistentification should handle the address order thus we can remove the address reversals. If the skb is provided we always use it for the hash calculation, otherwise we fallback to fl4, that is if skb is NULL fl4 has to be set. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-21vhost-vsock: add pkt cancel capabilityPeng Tao
To allow canceling all packets of a connection. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Jorgen Hansen <jhansen@vmware.com> Signed-off-by: Peng Tao <bergwolf@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for your net-next tree. A couple of new features for nf_tables, and unsorted cleanups and incremental updates for the Netfilter tree. More specifically, they are: 1) Allow to check for TCP option presence via nft_exthdr, patch from Phil Sutter. 2) Add symmetric hash support to nft_hash, from Laura Garcia Liebana. 3) Use pr_cont() in ebt_log, from Joe Perches. 4) Remove some dead code in arp_tables reported via static analysis tool, from Colin Ian King. 5) Consolidate nf_tables expression validation, from Liping Zhang. 6) Consolidate set lookup via nft_set_lookup(). 7) Remove unnecessary rcu read lock side in bridge netfilter, from Florian Westphal. 8) Remove unused variable in nf_reject_ipv4, from Tahee Yoo. 9) Pass nft_ctx struct to object initialization indirections, from Florian Westphal. 10) Add code to integrate conntrack helper into nf_tables, also from Florian. 11) Allow to check if interface index or name exists via NFTA_FIB_F_PRESENT, from Phil Sutter. 12) Simplify resolve_normal_ct(), from Florian. 13) Use per-limit spinlock in nft_limit and xt_limit, from Liping Zhang. 14) Use rwlock in nft_set_rbtree set, also from Liping Zhang. 15) One patch to remove a useless printk at netns init path in ipvs, and several patches to document IPVS knobs. 16) Use refcount_t for reference counter in the Netfilter/IPVS code, from Elena Reshetova. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-17netfilter: refcounter conversionsReshetova, Elena
refcount_t type and corresponding API (see include/linux/refcount.h) should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-16tcp: remove tcp_tw_recycleSoheil Hassas Yeganeh
The tcp_tw_recycle was already broken for connections behind NAT, since the per-destination timestamp is not monotonically increasing for multiple machines behind a single destination address. After the randomization of TCP timestamp offsets in commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets for each connection), the tcp_tw_recycle is broken for all types of connections for the same reason: the timestamps received from a single machine is not monotonically increasing, anymore. Remove tcp_tw_recycle, since it is not functional. Also, remove the PAWSPassive SNMP counter since it is only used for tcp_tw_recycle, and simplify tcp_v4_route_req and tcp_v6_route_req since the strict argument is only set when tcp_tw_recycle is enabled. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Cc: Lutz Vieweg <lvml@5t9.de> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16tcp: remove per-destination timestamp cacheSoheil Hassas Yeganeh
Commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets for each connection) randomizes TCP timestamps per connection. After this commit, there is no guarantee that the timestamps received from the same destination are monotonically increasing. As a result, the per-destination timestamp cache in TCP metrics (i.e., tcpm_ts in struct tcp_metrics_block) is broken and cannot be relied upon. Remove the per-destination timestamp cache and all related code paths. Note that this cache was already broken for caching timestamps of multiple machines behind a NAT sharing the same address. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Cc: Lutz Vieweg <lvml@5t9.de> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16ipv4: fib_rules: Add notifier info to FIB rules notificationsIdo Schimmel
Whenever a FIB rule is added or removed, a notification is sent in the FIB notification chain. However, listeners don't have a way to tell which rule was added or removed. This is problematic as we would like to give listeners the ability to decide which action to execute based on the notified rule. Specifically, offloading drivers should be able to determine if they support the reflection of the notified FIB rule and flush their LPM tables in case they don't. Do that by adding a notifier info to these notifications and embed the common FIB rule struct in it. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16ipv4: fib_rules: Check if rule is a default ruleIdo Schimmel
Currently, when non-default (custom) FIB rules are used, devices capable of layer 3 offloading flush their tables and let the kernel do the forwarding instead. When these devices' drivers are loaded they register to the FIB notification chain, which lets them know about the existence of any custom FIB rules. This is done by sending a RULE_ADD notification based on the value of 'net->ipv4.fib_has_custom_rules'. This approach is problematic when VRF offload is taken into account, as upon the creation of the first VRF netdev, a l3mdev rule is programmed to direct skbs to the VRF's table. Instead of merely reading the above value and sending a single RULE_ADD notification, we should iterate over all the FIB rules and send a detailed notification for each, thereby allowing offloading drivers to sanitize the rules they don't support and potentially flush their tables. While l3mdev rules are uniquely marked, the default rules are not. Therefore, when they are being notified they might invoke offloading drivers to unnecessarily flush their tables. Solve this by adding an helper to check if a FIB rule is a default rule. Namely, its selector should match all packets and its action should point to the local, main or default tables. As noted by David Ahern, uniquely marking the default rules is insufficient. When using VRFs, it's common to avoid false hits by moving the rule for the local table to just before the main table: Default configuration: $ ip rule show 0: from all lookup local 32766: from all lookup main 32767: from all lookup default Common configuration with VRFs: $ ip rule show 1000: from all lookup [l3mdev-table] 32765: from all lookup local 32766: from all lookup main 32767: from all lookup default Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15net: dsa: check out-of-range ageing time valueVivien Didelot
If a DSA switch driver cannot program an ageing time value due to it being out-of-range, switchdev will raise a stack trace before failing. To fix this, add ageing_time_min and ageing_time_max members to the dsa_switch in order for the switch drivers to optionally specify their supported ageing time limits. The DSA core will now check for provided ageing time limits and return -ERANGE from the switchdev prepare phase if the value is out-of-range. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree, a rather large batch of fixes targeted to nf_tables, conntrack and bridge netfilter. More specifically, they are: 1) Don't track fragmented packets if the socket option IP_NODEFRAG is set. From Florian Westphal. 2) SCTP protocol tracker assumes that ICMP error messages contain the checksum field, what results in packet drops. From Ying Xue. 3) Fix inconsistent handling of AH traffic from nf_tables. 4) Fix new bitmap set representation with big endian. Fix mismatches in nf_tables due to incorrect big endian handling too. Both patches from Liping Zhang. 5) Bridge netfilter doesn't honor maximum fragment size field, cap to largest fragment seen. From Florian Westphal. 6) Fake conntrack entry needs to be aligned to 8 bytes since the 3 LSB bits are now used to store the ctinfo. From Steven Rostedt. 7) Fix element comments with the bitmap set type. Revert the flush field in the nft_set_iter structure, not required anymore after fixing up element comments. 8) Missing error on invalid conntrack direction from nft_ct, also from Liping Zhang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/broadcom/genet/bcmgenet.c net/core/sock.c Conflicts were overlapping changes in bcmgenet and the lockdep handling of sockets. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Ensure that mtu is at least IPV6_MIN_MTU in ipv6 VTI tunnel driver, from Steffen Klassert. 2) Fix crashes when user tries to get_next_key on an LPM bpf map, from Alexei Starovoitov. 3) Fix detection of VLAN fitlering feature for bnx2x VF devices, from Michal Schmidt. 4) We can get a divide by zero when TCP socket are morphed into listening state, fix from Eric Dumazet. 5) Fix socket refcounting bugs in skb_complete_wifi_ack() and skb_complete_tx_timestamp(). From Eric Dumazet. 6) Use after free in dccp_feat_activate_values(), also from Eric Dumazet. 7) Like bonding team needs to use ETH_MAX_MTU as netdev->max_mtu, from Jarod Wilson. 8) Fix use after free in vrf_xmit(), from David Ahern. 9) Don't do UDP Fragmentation Offload on IPComp ipsec packets, from Alexey Kodanev. 10) Properly check napi_complete_done() return value in order to decide whether to re-enable IRQs or not in amd-xgbe driver, from Thomas Lendacky. 11) Fix double free of hwmon device in marvell phy driver, from Andrew Lunn. 12) Don't crash on malformed netlink attributes in act_connmark, from Etienne Noss. 13) Don't remove routes with a higher metric in ipv6 ECMP route replace, from Sabrina Dubroca. 14) Don't write into a cloned SKB in ipv6 fragmentation handling, from Florian Westphal. 15) Fix routing redirect races in dccp and tcp, basically the ICMP handler can't modify the socket's cached route in it's locked by the user at this moment. From Jon Maxwell. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (108 commits) qed: Enable iSCSI Out-of-Order qed: Correct out-of-bound access in OOO history qed: Fix interrupt flags on Rx LL2 qed: Free previous connections when releasing iSCSI qed: Fix mapping leak on LL2 rx flow qed: Prevent creation of too-big u32-chains qed: Align CIDs according to DORQ requirement mlxsw: reg: Fix SPVMLR max record count mlxsw: reg: Fix SPVM max record count net: Resend IGMP memberships upon peer notification. dccp: fix memory leak during tear-down of unsuccessful connection request tun: fix premature POLLOUT notification on tun devices dccp/tcp: fix routing redirect race ucc/hdlc: fix two little issue vxlan: fix ovs support net: use net->count to check whether a netns is alive or not bridge: drop netfilter fake rtable unconditionally ipv6: avoid write to a possibly cloned skb net: wimax/i2400m: fix NULL-deref at probe isdn/gigaset: fix NULL-deref at probe ...
2017-03-13mpls: allow TTL propagation from IP packets to be configuredRobert Shearman
Allow TTL propagation from IP packets to MPLS packets to be configured. Add a new optional LWT attribute, MPLS_IPTUNNEL_TTL, which allows the TTL to be set in the resulting MPLS packet, with the value of 0 having the semantics of enabling propagation of the TTL from the IP header (i.e. non-zero values disable propagation). Also allow the configuration to be overridden globally by reusing the same sysctl to control whether the TTL is propagated from IP packets into the MPLS header. If the per-LWT attribute is set then it overrides the global configuration. If the TTL isn't propagated then a default TTL value is used which can be configured via a new sysctl, "net.mpls.default_ttl". This is kept separate from the configuration of whether IP TTL propagation is enabled as it can be used in the future when non-IP payloads are supported (i.e. where there is no payload TTL that can be propagated). Signed-off-by: Robert Shearman <rshearma@brocade.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Tested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13mpls: allow TTL propagation to IP packets to be configuredRobert Shearman
Provide the ability to control on a per-route basis whether the TTL value from an MPLS packet is propagated to an IPv4/IPv6 packet when the last label is popped as per the theoretical model in RFC 3443 through a new route attribute, RTA_TTL_PROPAGATE which can be 0 to mean disable propagation and 1 to mean enable propagation. In order to provide the ability to change the behaviour for packets arriving with IPv4/IPv6 Explicit Null labels and to provide an easy way for a user to change the behaviour for all existing routes without having to reprogram them, a global knob is provided. This is done through the addition of a new per-namespace sysctl, "net.mpls.ip_ttl_propagate", which defaults to enabled. If the per-route attribute is set (either enabled or disabled) then it overrides the global configuration. Signed-off-by: Robert Shearman <rshearma@brocade.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Tested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>