summaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)Author
2017-03-31cfg80211: Add support for FILS shared key authentication offloadVidyullatha Kanchanapally
Enhance nl80211 and cfg80211 connect request and response APIs to support FILS shared key authentication offload. The new nl80211 attributes can be used to provide additional information to the driver to establish a FILS connection. Also enhance the set/del PMKSA to allow support for adding and deleting PMKSA based on FILS cache identifier. Add a new feature flag that drivers can use to advertize support for FILS shared key authentication and association in station mode when using their own SME. Signed-off-by: Vidyullatha Kanchanapally <vkanchan@qti.qualcomm.com> Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-31cfg80211: Use a structure to pass connect response paramsVidyullatha Kanchanapally
Currently the connect event from driver takes all the connection response parameters as arguments. With support for new features these response parameters can grow. Use a structure to pass these parameters rather than passing them as function arguments. Signed-off-by: Vidyullatha Kanchanapally <vkanchan@qti.qualcomm.com> Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com> [add to documentation] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-03-30sock: avoid dirtying sk_stamp, if possiblePaolo Abeni
sock_recv_ts_and_drops() unconditionally set sk->sk_stamp for every packet, even if the SOCK_TIMESTAMP flag is not set in the related socket. If selinux is enabled, this cause a cache miss for every packet since sk->sk_stamp and sk->sk_security share the same cacheline. With this change sk_stamp is set only if the SOCK_TIMESTAMP flag is set, and is cleared for the first packet, so that the user perceived behavior is unchanged. This gives up to 5% speed-up under udp-flood with small packets. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-30sctp: alloc stream info when initializing asocXin Long
When sending a msg without asoc established, sctp will send INIT packet first and then enqueue chunks. Before receiving INIT_ACK, stream info is not yet alloced. But enqueuing chunks needs to access stream info, like out stream state and out stream cnt. This patch is to fix it by allocing out stream info when initializing an asoc, allocing in stream and re-allocing out stream when processing init. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-29Merge tag 'mlx5e-pedit' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Or Gerlitz says: ==================== mlx5e-pedit 2017-03-28 This series adds support for offloading modifications of packet headers using ConnectX-5 HW header re-write as an action applied during packet steering. The offloaded SW mechanism is TC's pedit action. The offloading is supported for E-Switch steering of VF traffic in the SRIOV switchdev mode and for NIC (non eswitch) RX. One use-case for this offload on virtual networks, is when the hypervisor implements flow based router such as Open-Stack's DVR, where L2 headers of guest packets re-written with routers' MAC addresses and the IP TTL is decremented. Another use case (which can be applied in parallel with routing) is stateless NAT where guest L3/L4 headers are re-written. The series is built as follows: the 1st six patches are preperations which don't yet add new functionality, patches 7-8 add the FW APIs (data-structures and commands) for header re-write, and patch nine allows offloading driver to access pedit keys. The 10th patch is somehow the core of the series, where we translate from the pedit way to represent set of header modification elements to the FW API for that same matter. Once a set of HW modification is established, we register it with the FW and get a modify header ID. When this ID is used with an action during packet steering, the HW applies the header modification on the packet. Patches 11 and 12 implement the above logic as an offload for pedit action for the NIC and E-Switch use-cases. I'd like to thanks Elijah Shakkour <elijahs@mellanox.com> for implementing and helping me testing this functionality on HW simulator, before it could be done with FW. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28net: dsa: dsa2: Add basic support of devlinkAndrew Lunn
Register the switch and its ports with devlink. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Tested-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28net: break include loop netdevice.h, dsa.h, devlink.hAndrew Lunn
There is an include loop between netdevice.h, dsa.h, devlink.h because of NETDEV_ALIGN, making it impossible to use devlink structures in dsa.h. Break this loop by taking dsa.h out of netdevice.h, add a forward declaration of dsa_switch_tree and netdev_set_default_ethtool_ops() function, which is what netdevice.h requires. No longer having dsa.h in netdevice.h means the includes in dsa.h no longer get included. This breaks a few other files which depend on these includes. Add these directly in the affected file. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28net: ipv6: Refactor inet6_netconf_notify_devconf to take eventDavid Ahern
Refactor inet6_netconf_notify_devconf to take the event as an input arg. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28ipv6: add support for NETDEV_RESEND_IGMP eventVlad Yasevich
This patch adds support for NETDEV_RESEND_IGMP event similar to how it works for IPv4. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28sctp: change to save MSG_MORE flag into assocXin Long
David Laight noticed the support for MSG_MORE with datamsg->force_delay didn't really work as we expected, as the first msg with MSG_MORE set would always block the following chunks' dequeuing. This Patch is to rewrite it by saving the MSG_MORE flag into assoc as David Laight suggested. asoc->force_delay is used to save MSG_MORE flag before a msg is sent. All chunks in queue would not be sent out if asoc->force_delay is set by the msg with MSG_MORE flag, until a new msg without MSG_MORE flag clears asoc->force_delay. Note that this change would not affect the flush is generated by other triggers, like asoc->state != ESTABLISHED, queue size > pmtu etc. v1->v2: Not clear asoc->force_delay after sending the msg with MSG_MORE flag. Fixes: 4ea0c32f5f42 ("sctp: add support for MSG_MORE") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: David Laight <david.laight@aculab.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28devlink: Support for pipeline debug (dpipe)Arkadi Sharshevsky
The pipeline debug is used to export the pipeline abstractions for the main objects - tables, headers and entries. The only support for set is for changing the counter parameter on specific table. The basic structures: Header - can represent a real protocol header information or internal metadata. Generic protocol headers like IPv4 can be shared between drivers. Each driver can add local headers. Field - part of a header. Can represent protocol field or specific ASIC metadata field. Hardware special metadata fields can be mapped to different resources, for example switch ASIC ports can have internal number which from the systems point of view is mapped to netdeivce ifindex. Match - represent specific match rule. Can describe match on specific field or header. The header index should be specified as well in order to support several header instances of the same type (tunneling). Action - represents specific action rule. Actions can describe operations on specific field values for example like set, increment, etc. And header operation like add and delete. Value - represents value which can be associated with specific match or action. Table - represents a hardware block which can be described with match/ action behavior. The match/action can be done on the packets data or on the internal metadata that it gathered along the packets traversal throw the pipeline which is vendor specific and should be exported in order to provide understanding of ASICs behavior. Entry - represents single record in a specific table. The entry is identified by specific combination of values for match/action. Prior to accessing the tables/entries the drivers provide the header/ field data base which is used by driver to user-space. The data base is split between the shared headers and unique headers. Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28net/sched: Add accessor functions to pedit keys for offloading driversOr Gerlitz
HW drivers will use the header-type and command fields from the extended keys, and some fields (e.g mask, val, offset) from the legacy keys. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-03-27bonding: split bond_set_slave_link_state into two partsMahesh Bandewar
Split the function into two (a) propose (b) commit phase without changing the semantics for the original API. Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-27xfrm: branchless addr4_match() on 64-bitAlexey Dobriyan
Current addr4_match() code has special test for /0 prefixes because of standard required undefined behaviour. However, it is possible to omit it on 64-bit because shifting can be done within a 64-bit register and then truncated to the expected value (which is 0 mask). Implicit truncation by htonl() fits nicely into R32-within-R64 model on x86-64. Space savings: none (coincidence) Branch savings: 1 Before: movzx eax,BYTE PTR [rdi+0x2a] # ->prefixlen_d test al,al jne xfrm_selector_match + 0x23f ... movzx eax,BYTE PTR [rbx+0x2b] # ->prefixlen_s test al,al je xfrm_selector_match + 0x1c7 After (no branches): mov r8d,0x20 mov rdx,0xffffffffffffffff mov esi,DWORD PTR [rsi+0x2c] mov ecx,r8d sub cl,BYTE PTR [rdi+0x2a] xor esi,DWORD PTR [rbx] mov rdi,rdx xor eax,eax shl rdi,cl bswap edi Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-03-24net: Commonize busy polling code to focus on napi_id instead of socketSridhar Samudrala
Move the core functionality in sk_busy_loop() to napi_busy_loop() and make it independent of sk. This enables re-using this function in epoll busy loop implementation. Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24net: Track start of busy loop instead of when it should endAlexander Duyck
This patch flips the logic we were using to determine if the busy polling has timed out. The main motivation for this is that we will need to support two different possible timeout values in the future and by recording the start time rather than when we would want to end we can focus on making the end_time specific to the task be it epoll or socket based polling. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24net: Change return type of sk_busy_loop from bool to voidAlexander Duyck
checking the return value of sk_busy_loop. As there are only a few consumers of that data, and the data being checked for can be replaced with a check for !skb_queue_empty() we might as well just pull the code out of sk_busy_loop and place it in the spots that actually need it. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24net: Only define skb_mark_napi_id in one spot instead of twoAlexander Duyck
Instead of defining two versions of skb_mark_napi_id I think it is more readable to just match the format of the sk_mark_napi_id functions and just wrap the contents of the function instead of defining two versions of the function. This way we can save a few lines of code since we only need 2 of the ifdef/endif but needed 5 for the extra function declaration. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24net: Busy polling should ignore sender CPUsAlexander Duyck
This patch is a cleanup/fix for NAPI IDs following the changes that made it so that sender_cpu and napi_id were doing a better job of sharing the same location in the sk_buff. One issue I found is that we weren't validating the napi_id as being valid before we started trying to setup the busy polling. This change corrects that by using the MIN_NAPI_ID value that is now used in both allocating the NAPI IDs, as well as validating them. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24tcp: sysctl: Fix a race to avoid unexpected 0 window from spaceGao Feng
Because sysctl_tcp_adv_win_scale could be changed any time, so there is one race in tcp_win_from_space. For example, 1.sysctl_tcp_adv_win_scale<=0 (sysctl_tcp_adv_win_scale is negative now) 2.space>>(-sysctl_tcp_adv_win_scale) (sysctl_tcp_adv_win_scale is postive now) As a result, tcp_win_from_space returns 0. It is unexpected. Certainly if the compiler put the sysctl_tcp_adv_win_scale into one register firstly, then use the register directly, it would be ok. But we could not depend on the compiler behavior. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24net: Add sysctl to toggle early demux for tcp and udpsubashab@codeaurora.org
Certain system process significant unconnected UDP workload. It would be preferrable to disable UDP early demux for those systems and enable it for TCP only. By disabling UDP demux, we see these slight gains on an ARM64 system- 782 -> 788Mbps unconnected single stream UDPv4 633 -> 654Mbps unconnected UDPv4 different sources The performance impact can change based on CPU architecure and cache sizes. There will not much difference seen if entire UDP hash table is in cache. Both sysctls are enabled by default to preserve existing behavior. v1->v2: Change function pointer instead of adding conditional as suggested by Stephen. v2->v3: Read once in callers to avoid issues due to compiler optimizations. Also update commit message with the tests. v3->v4: Store and use read once result instead of querying pointer again incorrectly. v4->v5: Refactor to avoid errors due to compilation with IPV6={m,n} Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Suggested-by: Eric Dumazet <edumazet@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Tom Herbert <tom@herbertland.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24xfrm: use "unsigned int" in addr_match()Alexey Dobriyan
x86_64 is zero-extending arch so "unsigned int" is preferred over "int" for address calculations and extending to size_t. Space savings: add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-24 (-24) function old new delta xfrm_state_walk 708 696 -12 xfrm_selector_match 918 906 -12 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-03-24xfrm: remove unused struct xfrm_mgr::idAlexey Dobriyan
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-03-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/broadcom/genet/bcmmii.c drivers/net/hyperv/netvsc.c kernel/bpf/hashtab.c Almost entirely overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22sock: introduce SO_MEMINFO getsockoptJosh Hunt
Allows reading of SK_MEMINFO_VARS via socket option. This way an application can get all meminfo related information in single socket option call instead of multiple calls. Adds helper function, sk_get_meminfo(), and uses that for both getsockopt and sock_diag_put_meminfo(). Suggested by Eric Dumazet. Signed-off-by: Josh Hunt <johunt@akamai.com> Reviewed-by: Jason Baron <jbaron@akamai.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22sctp: declare struct sctp_stream before using itXin Long
sctp_stream_free uses struct sctp_stream as a param, but struct sctp_stream is defined after it's declaration. This patch is to declare struct sctp_stream before sctp_stream_free. Fixes: a83863174a61 ("sctp: prepare asoc stream for stream reconf") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22neighbour: fix nlmsg_pid in notificationsRoopa Prabhu
neigh notifications today carry pid 0 for nlmsg_pid in all cases. This patch fixes it to carry calling process pid when available. Applications (eg. quagga) rely on nlmsg_pid to ignore notifications generated by their own netlink operations. This patch follows the routing subsystem which already sets this correctly. Reported-by: Vivek Venkatraman <vivek@cumulusnetworks.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-21sctp: define dst_pending_confirm as a bit in sctp_transportXin Long
As tp->dst_pending_confirm's value can only be set 0 or 1, this patch is to change to define it as a bit instead of __u32. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-21net: ipv4: add support for ECMP hash policy choiceNikolay Aleksandrov
This patch adds support for ECMP hash policy choice via a new sysctl called fib_multipath_hash_policy and also adds support for L4 hashes. The current values for fib_multipath_hash_policy are: 0 - layer 3 (default) 1 - layer 4 If there's an skb hash already set and it matches the chosen policy then it will be used instead of being calculated (currently only for L4). In L3 mode we always calculate the hash due to the ICMP error special case, the flow dissector's field consistentification should handle the address order thus we can remove the address reversals. If the skb is provided we always use it for the hash calculation, otherwise we fallback to fl4, that is if skb is NULL fl4 has to be set. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-21vhost-vsock: add pkt cancel capabilityPeng Tao
To allow canceling all packets of a connection. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Jorgen Hansen <jhansen@vmware.com> Signed-off-by: Peng Tao <bergwolf@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for your net-next tree. A couple of new features for nf_tables, and unsorted cleanups and incremental updates for the Netfilter tree. More specifically, they are: 1) Allow to check for TCP option presence via nft_exthdr, patch from Phil Sutter. 2) Add symmetric hash support to nft_hash, from Laura Garcia Liebana. 3) Use pr_cont() in ebt_log, from Joe Perches. 4) Remove some dead code in arp_tables reported via static analysis tool, from Colin Ian King. 5) Consolidate nf_tables expression validation, from Liping Zhang. 6) Consolidate set lookup via nft_set_lookup(). 7) Remove unnecessary rcu read lock side in bridge netfilter, from Florian Westphal. 8) Remove unused variable in nf_reject_ipv4, from Tahee Yoo. 9) Pass nft_ctx struct to object initialization indirections, from Florian Westphal. 10) Add code to integrate conntrack helper into nf_tables, also from Florian. 11) Allow to check if interface index or name exists via NFTA_FIB_F_PRESENT, from Phil Sutter. 12) Simplify resolve_normal_ct(), from Florian. 13) Use per-limit spinlock in nft_limit and xt_limit, from Liping Zhang. 14) Use rwlock in nft_set_rbtree set, also from Liping Zhang. 15) One patch to remove a useless printk at netns init path in ipvs, and several patches to document IPVS knobs. 16) Use refcount_t for reference counter in the Netfilter/IPVS code, from Elena Reshetova. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-17netfilter: refcounter conversionsReshetova, Elena
refcount_t type and corresponding API (see include/linux/refcount.h) should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-16tcp: remove tcp_tw_recycleSoheil Hassas Yeganeh
The tcp_tw_recycle was already broken for connections behind NAT, since the per-destination timestamp is not monotonically increasing for multiple machines behind a single destination address. After the randomization of TCP timestamp offsets in commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets for each connection), the tcp_tw_recycle is broken for all types of connections for the same reason: the timestamps received from a single machine is not monotonically increasing, anymore. Remove tcp_tw_recycle, since it is not functional. Also, remove the PAWSPassive SNMP counter since it is only used for tcp_tw_recycle, and simplify tcp_v4_route_req and tcp_v6_route_req since the strict argument is only set when tcp_tw_recycle is enabled. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Cc: Lutz Vieweg <lvml@5t9.de> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16tcp: remove per-destination timestamp cacheSoheil Hassas Yeganeh
Commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets for each connection) randomizes TCP timestamps per connection. After this commit, there is no guarantee that the timestamps received from the same destination are monotonically increasing. As a result, the per-destination timestamp cache in TCP metrics (i.e., tcpm_ts in struct tcp_metrics_block) is broken and cannot be relied upon. Remove the per-destination timestamp cache and all related code paths. Note that this cache was already broken for caching timestamps of multiple machines behind a NAT sharing the same address. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Cc: Lutz Vieweg <lvml@5t9.de> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16ipv4: fib_rules: Add notifier info to FIB rules notificationsIdo Schimmel
Whenever a FIB rule is added or removed, a notification is sent in the FIB notification chain. However, listeners don't have a way to tell which rule was added or removed. This is problematic as we would like to give listeners the ability to decide which action to execute based on the notified rule. Specifically, offloading drivers should be able to determine if they support the reflection of the notified FIB rule and flush their LPM tables in case they don't. Do that by adding a notifier info to these notifications and embed the common FIB rule struct in it. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16ipv4: fib_rules: Check if rule is a default ruleIdo Schimmel
Currently, when non-default (custom) FIB rules are used, devices capable of layer 3 offloading flush their tables and let the kernel do the forwarding instead. When these devices' drivers are loaded they register to the FIB notification chain, which lets them know about the existence of any custom FIB rules. This is done by sending a RULE_ADD notification based on the value of 'net->ipv4.fib_has_custom_rules'. This approach is problematic when VRF offload is taken into account, as upon the creation of the first VRF netdev, a l3mdev rule is programmed to direct skbs to the VRF's table. Instead of merely reading the above value and sending a single RULE_ADD notification, we should iterate over all the FIB rules and send a detailed notification for each, thereby allowing offloading drivers to sanitize the rules they don't support and potentially flush their tables. While l3mdev rules are uniquely marked, the default rules are not. Therefore, when they are being notified they might invoke offloading drivers to unnecessarily flush their tables. Solve this by adding an helper to check if a FIB rule is a default rule. Namely, its selector should match all packets and its action should point to the local, main or default tables. As noted by David Ahern, uniquely marking the default rules is insufficient. When using VRFs, it's common to avoid false hits by moving the rule for the local table to just before the main table: Default configuration: $ ip rule show 0: from all lookup local 32766: from all lookup main 32767: from all lookup default Common configuration with VRFs: $ ip rule show 1000: from all lookup [l3mdev-table] 32765: from all lookup local 32766: from all lookup main 32767: from all lookup default Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15net: dsa: check out-of-range ageing time valueVivien Didelot
If a DSA switch driver cannot program an ageing time value due to it being out-of-range, switchdev will raise a stack trace before failing. To fix this, add ageing_time_min and ageing_time_max members to the dsa_switch in order for the switch drivers to optionally specify their supported ageing time limits. The DSA core will now check for provided ageing time limits and return -ERANGE from the switchdev prepare phase if the value is out-of-range. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree, a rather large batch of fixes targeted to nf_tables, conntrack and bridge netfilter. More specifically, they are: 1) Don't track fragmented packets if the socket option IP_NODEFRAG is set. From Florian Westphal. 2) SCTP protocol tracker assumes that ICMP error messages contain the checksum field, what results in packet drops. From Ying Xue. 3) Fix inconsistent handling of AH traffic from nf_tables. 4) Fix new bitmap set representation with big endian. Fix mismatches in nf_tables due to incorrect big endian handling too. Both patches from Liping Zhang. 5) Bridge netfilter doesn't honor maximum fragment size field, cap to largest fragment seen. From Florian Westphal. 6) Fake conntrack entry needs to be aligned to 8 bytes since the 3 LSB bits are now used to store the ctinfo. From Steven Rostedt. 7) Fix element comments with the bitmap set type. Revert the flush field in the nft_set_iter structure, not required anymore after fixing up element comments. 8) Missing error on invalid conntrack direction from nft_ct, also from Liping Zhang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/broadcom/genet/bcmgenet.c net/core/sock.c Conflicts were overlapping changes in bcmgenet and the lockdep handling of sockets. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Ensure that mtu is at least IPV6_MIN_MTU in ipv6 VTI tunnel driver, from Steffen Klassert. 2) Fix crashes when user tries to get_next_key on an LPM bpf map, from Alexei Starovoitov. 3) Fix detection of VLAN fitlering feature for bnx2x VF devices, from Michal Schmidt. 4) We can get a divide by zero when TCP socket are morphed into listening state, fix from Eric Dumazet. 5) Fix socket refcounting bugs in skb_complete_wifi_ack() and skb_complete_tx_timestamp(). From Eric Dumazet. 6) Use after free in dccp_feat_activate_values(), also from Eric Dumazet. 7) Like bonding team needs to use ETH_MAX_MTU as netdev->max_mtu, from Jarod Wilson. 8) Fix use after free in vrf_xmit(), from David Ahern. 9) Don't do UDP Fragmentation Offload on IPComp ipsec packets, from Alexey Kodanev. 10) Properly check napi_complete_done() return value in order to decide whether to re-enable IRQs or not in amd-xgbe driver, from Thomas Lendacky. 11) Fix double free of hwmon device in marvell phy driver, from Andrew Lunn. 12) Don't crash on malformed netlink attributes in act_connmark, from Etienne Noss. 13) Don't remove routes with a higher metric in ipv6 ECMP route replace, from Sabrina Dubroca. 14) Don't write into a cloned SKB in ipv6 fragmentation handling, from Florian Westphal. 15) Fix routing redirect races in dccp and tcp, basically the ICMP handler can't modify the socket's cached route in it's locked by the user at this moment. From Jon Maxwell. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (108 commits) qed: Enable iSCSI Out-of-Order qed: Correct out-of-bound access in OOO history qed: Fix interrupt flags on Rx LL2 qed: Free previous connections when releasing iSCSI qed: Fix mapping leak on LL2 rx flow qed: Prevent creation of too-big u32-chains qed: Align CIDs according to DORQ requirement mlxsw: reg: Fix SPVMLR max record count mlxsw: reg: Fix SPVM max record count net: Resend IGMP memberships upon peer notification. dccp: fix memory leak during tear-down of unsuccessful connection request tun: fix premature POLLOUT notification on tun devices dccp/tcp: fix routing redirect race ucc/hdlc: fix two little issue vxlan: fix ovs support net: use net->count to check whether a netns is alive or not bridge: drop netfilter fake rtable unconditionally ipv6: avoid write to a possibly cloned skb net: wimax/i2400m: fix NULL-deref at probe isdn/gigaset: fix NULL-deref at probe ...
2017-03-13mpls: allow TTL propagation from IP packets to be configuredRobert Shearman
Allow TTL propagation from IP packets to MPLS packets to be configured. Add a new optional LWT attribute, MPLS_IPTUNNEL_TTL, which allows the TTL to be set in the resulting MPLS packet, with the value of 0 having the semantics of enabling propagation of the TTL from the IP header (i.e. non-zero values disable propagation). Also allow the configuration to be overridden globally by reusing the same sysctl to control whether the TTL is propagated from IP packets into the MPLS header. If the per-LWT attribute is set then it overrides the global configuration. If the TTL isn't propagated then a default TTL value is used which can be configured via a new sysctl, "net.mpls.default_ttl". This is kept separate from the configuration of whether IP TTL propagation is enabled as it can be used in the future when non-IP payloads are supported (i.e. where there is no payload TTL that can be propagated). Signed-off-by: Robert Shearman <rshearma@brocade.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Tested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13mpls: allow TTL propagation to IP packets to be configuredRobert Shearman
Provide the ability to control on a per-route basis whether the TTL value from an MPLS packet is propagated to an IPv4/IPv6 packet when the last label is popped as per the theoretical model in RFC 3443 through a new route attribute, RTA_TTL_PROPAGATE which can be 0 to mean disable propagation and 1 to mean enable propagation. In order to provide the ability to change the behaviour for packets arriving with IPv4/IPv6 Explicit Null labels and to provide an easy way for a user to change the behaviour for all existing routes without having to reprogram them, a global knob is provided. This is done through the addition of a new per-namespace sysctl, "net.mpls.ip_ttl_propagate", which defaults to enabled. If the per-route attribute is set (either enabled or disabled) then it overrides the global configuration. Signed-off-by: Robert Shearman <rshearma@brocade.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Tested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13Revert "netfilter: nf_tables: add flush field to struct nft_set_iter"Pablo Neira Ayuso
This reverts commit 1f48ff6c5393aa7fe290faf5d633164f105b0aa7. This patch is not required anymore now that we keep a dummy list of set elements in the bitmap set implementation, so revert this before we forget this code has no clients. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-13netfilter: nft_fib: Support existence checkPhil Sutter
Instead of the actual interface index or name, set destination register to just 1 or 0 depending on whether the lookup succeeded or not if NFTA_FIB_F_PRESENT was set in userspace. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-13netfilter: provide nft_ctx in object init functionFlorian Westphal
this is needed by the upcoming ct helper object type -- we'd like to be able use the table family (ip, ip6, inet) to figure out which helper has to be requested. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-13netfilter: Force fake conntrack entry to be at least 8 bytes alignedSteven Rostedt (VMware)
Since the nfct and nfctinfo have been combined, the nf_conn structure must be at least 8 bytes aligned, as the 3 LSB bits are used for the nfctinfo. But there's a fake nf_conn structure to denote untracked connections, which is created by a PER_CPU construct. This does not guarantee that it will be 8 bytes aligned and can break the logic in determining the correct nfctinfo. I triggered this on a 32bit machine with the following error: BUG: unable to handle kernel NULL pointer dereference at 00000af4 IP: nf_ct_deliver_cached_events+0x1b/0xfb *pdpt = 0000000031962001 *pde = 0000000000000000 Oops: 0000 [#1] SMP [Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 crc_ccitt ppdev r8169 parport_pc parport OK ] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.10.0-test+ #75 Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014 task: c126ec00 task.stack: c1258000 EIP: nf_ct_deliver_cached_events+0x1b/0xfb EFLAGS: 00010202 CPU: 0 EAX: 0021cd01 EBX: 00000000 ECX: 27b0c767 EDX: 32bcb17a ESI: f34135c0 EDI: f34135c0 EBP: f2debd60 ESP: f2debd3c DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 CR0: 80050033 CR2: 00000af4 CR3: 309a0440 CR4: 001406f0 Call Trace: <SOFTIRQ> ? ipv6_skip_exthdr+0xac/0xcb ipv6_confirm+0x10c/0x119 [nf_conntrack_ipv6] nf_hook_slow+0x22/0xc7 nf_hook+0x9a/0xad [ipv6] ? ip6t_do_table+0x356/0x379 [ip6_tables] ? ip6_fragment+0x9e9/0x9e9 [ipv6] ip6_output+0xee/0x107 [ipv6] ? ip6_fragment+0x9e9/0x9e9 [ipv6] dst_output+0x36/0x4d [ipv6] NF_HOOK.constprop.37+0xb2/0xba [ipv6] ? icmp6_dst_alloc+0x2c/0xfd [ipv6] ? local_bh_enable+0x14/0x14 [ipv6] mld_sendpack+0x1c5/0x281 [ipv6] ? mark_held_locks+0x40/0x5c mld_ifc_timer_expire+0x1f6/0x21e [ipv6] call_timer_fn+0x135/0x283 ? detach_if_pending+0x55/0x55 ? mld_dad_timer_expire+0x3e/0x3e [ipv6] __run_timers+0x111/0x14b ? mld_dad_timer_expire+0x3e/0x3e [ipv6] run_timer_softirq+0x1c/0x36 __do_softirq+0x185/0x37c ? test_ti_thread_flag.constprop.19+0xd/0xd do_softirq_own_stack+0x22/0x28 </SOFTIRQ> irq_exit+0x5a/0xa4 smp_apic_timer_interrupt+0x2a/0x34 apic_timer_interrupt+0x37/0x3c By using DEFINE/DECLARE_PER_CPU_ALIGNED we can enforce at least 8 byte alignment as all cache line sizes are at least 8 bytes or more. Fixes: a9e419dc7be6 ("netfilter: merge ctinfo into nfct pointer storage area") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-13netfilter: nf_tables: fix mismatch in big-endian systemLiping Zhang
Currently, there are two different methods to store an u16 integer to the u32 data register. For example: u32 *dest = &regs->data[priv->dreg]; 1. *dest = 0; *(u16 *) dest = val_u16; 2. *dest = val_u16; For method 1, the u16 value will be stored like this, either in big-endian or little-endian system: 0 15 31 +-+-+-+-+-+-+-+-+-+-+-+-+ | Value | 0 | +-+-+-+-+-+-+-+-+-+-+-+-+ For method 2, in little-endian system, the u16 value will be the same as listed above. But in big-endian system, the u16 value will be stored like this: 0 15 31 +-+-+-+-+-+-+-+-+-+-+-+-+ | 0 | Value | +-+-+-+-+-+-+-+-+-+-+-+-+ So later we use "memcmp(&regs->data[priv->sreg], data, 2);" to do compare in nft_cmp, nft_lookup expr ..., method 2 will get the wrong result in big-endian system, as 0~15 bits will always be zero. For the similar reason, when loading an u16 value from the u32 data register, we should use "*(u16 *) sreg;" instead of "(u16)*sreg;", the 2nd method will get the wrong value in the big-endian system. So introduce some wrapper functions to store/load an u8 or u16 integer to/from the u32 data register, and use them in the right place. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-12net: dsa: add dsa_is_normal_port helperVivien Didelot
Introduce a dsa_is_normal_port helper to check if a given port is a normal user port as opposed to a CPU port or DSA link. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-12sctp: implement receiver-side procedures for the Reconf Response ParameterXin Long
This patch is to implement Receiver-Side Procedures for the Re-configuration Response Parameter in rfc6525 section 5.2.7. sctp_process_strreset_resp would process the response for any kind of reconf request, and the stream reconf is applied only when the response result is success. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-12sctp: implement receiver-side procedures for the Add Incoming Streams ↵Xin Long
Request Parameter This patch is to implement Receiver-Side Procedures for the Add Incoming Streams Request Parameter described in rfc6525 section 5.2.6. It is also to fix that it shouldn't have add streams when sending addstrm in request, as the process in peer will handle it by sending a addstrm out request back. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>