summaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)Author
2018-01-12Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2018-01-11 1) Don't allow to change the encap type on state updates. The encap type is set on state initialization and should not change anymore. From Herbert Xu. 2) Skip dead policies when rehashing to fix a slab-out-of-bounds bug in xfrm_hash_rebuild. From Florian Westphal. 3) Two buffer overread fixes in pfkey. From Eric Biggers. 4) Fix rcu usage in xfrm_get_type_offload, request_module can sleep, so can't be used under rcu_read_lock. From Sabrina Dubroca. 5) Fix an uninitialized lock in xfrm_trans_queue. Use __skb_queue_tail instead of skb_queue_tail in xfrm_trans_queue as we don't need the lock. From Herbert Xu. 6) Currently it is possible to create an xfrm state with an unknown encap type in ESP IPv4. Fix this by returning an error on unknown encap types. Also from Herbert Xu. 7) Fix sleeping inside a spinlock in xfrm_policy_cache_flush. From Florian Westphal. 8) Fix ESP GRO when the headers not fully in the linear part of the skb. We need to pull before we can access them. 9) Fix a skb leak on error in key_notify_policy. 10) Fix a race in the xdst pcpu cache, we need to run the resolver routines with bottom halfes off like the old flowcache did. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
BPF alignment tests got a conflict because the registers are output as Rn_w instead of just Rn in net-next, and in net a fixup for a testcase prohibits logical operations on pointers before using them. Also, we should attempt to patch BPF call args if JIT always on is enabled. Instead, if we fail to JIT the subprogs we should pass an error back up and fail immediately. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-11netfilter: nf_defrag: Skip defrag if NOTRACK is setSubash Abhinov Kasiviswanathan
conntrack defrag is needed only if some module like CONNTRACK or NAT explicitly requests it. For plain forwarding scenarios, defrag is not needed and can be skipped if NOTRACK is set in a rule. Since conntrack defrag is currently higher priority than raw table, setting NOTRACK is not sufficient. We need to move raw to a higher priority for iptables only. This is achieved by introducing a module parameter "raw_before_defrag" which allows to change the priority of raw table to place it before defrag. By default, the parameter is disabled and the priority of raw table is NF_IP_PRI_RAW to support legacy behavior. If the module parameter is enabled, then the priority of the raw table is set to NF_IP_PRI_RAW_BEFORE_DEFRAG. Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-11netfilter: clusterip: make sure arp hooks are availableFlorian Westphal
The clusterip target needs to register an arp mangling hook, so make sure NF_ARP hooks are available. Fixes: 2a95183a5e ("netfilter: don't allocate space for arp/bridge hooks unless needed") Reported-by: kernel test robot <fengguang.wu@intel.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-10tcp: make local function tcp_recv_timestamp staticWei Yongjun
Fixes the following sparse warning: net/ipv4/tcp.c:1736:6: warning: symbol 'tcp_recv_timestamp' was not declared. Should it be static? Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10netfilter: improve flow table Kconfig dependenciesArnd Bergmann
The newly added NF_FLOW_TABLE options cause some build failures in randconfig kernels: - when CONFIG_NF_CONNTRACK is disabled, or is a loadable module but NF_FLOW_TABLE is built-in: In file included from net/netfilter/nf_flow_table.c:8:0: include/net/netfilter/nf_conntrack.h:59:22: error: field 'ct_general' has incomplete type struct nf_conntrack ct_general; include/net/netfilter/nf_conntrack.h: In function 'nf_ct_get': include/net/netfilter/nf_conntrack.h:148:15: error: 'const struct sk_buff' has no member named '_nfct' include/net/netfilter/nf_conntrack.h: In function 'nf_ct_put': include/net/netfilter/nf_conntrack.h:157:2: error: implicit declaration of function 'nf_conntrack_put'; did you mean 'nf_ct_put'? [-Werror=implicit-function-declaration] net/netfilter/nf_flow_table.o: In function `nf_flow_offload_work_gc': (.text+0x1540): undefined reference to `nf_ct_delete' - when CONFIG_NF_TABLES is disabled: In file included from net/ipv6/netfilter/nf_flow_table_ipv6.c:13:0: include/net/netfilter/nf_tables.h: In function 'nft_gencursor_next': include/net/netfilter/nf_tables.h:1189:14: error: 'const struct net' has no member named 'nft'; did you mean 'nf'? - when CONFIG_NF_FLOW_TABLE_INET is enabled, but NF_FLOW_TABLE_IPV4 or NF_FLOW_TABLE_IPV6 are not, or are loadable modules net/netfilter/nf_flow_table_inet.o: In function `nf_flow_offload_inet_hook': nf_flow_table_inet.c:(.text+0x94): undefined reference to `nf_flow_offload_ipv6_hook' nf_flow_table_inet.c:(.text+0x40): undefined reference to `nf_flow_offload_ip_hook' - when CONFIG_NF_FLOW_TABLES is disabled, but the other options are enabled: net/netfilter/nf_flow_table_inet.o: In function `nf_flow_offload_inet_hook': nf_flow_table_inet.c:(.text+0x6c): undefined reference to `nf_flow_offload_ipv6_hook' net/netfilter/nf_flow_table_inet.o: In function `nf_flow_inet_module_exit': nf_flow_table_inet.c:(.exit.text+0x8): undefined reference to `nft_unregister_flowtable_type' net/netfilter/nf_flow_table_inet.o: In function `nf_flow_inet_module_init': nf_flow_table_inet.c:(.init.text+0x8): undefined reference to `nft_register_flowtable_type' net/ipv4/netfilter/nf_flow_table_ipv4.o: In function `nf_flow_ipv4_module_exit': nf_flow_table_ipv4.c:(.exit.text+0x8): undefined reference to `nft_unregister_flowtable_type' net/ipv4/netfilter/nf_flow_table_ipv4.o: In function `nf_flow_ipv4_module_init': nf_flow_table_ipv4.c:(.init.text+0x8): undefined reference to `nft_register_flowtable_type' This adds additional Kconfig dependencies to ensure that NF_CONNTRACK and NF_TABLES are always visible from NF_FLOW_TABLE, and that the internal dependencies between the four new modules are met. Fixes: 7c23b629a808 ("netfilter: flow table support for the mixed IPv4/IPv6 family") Fixes: 0995210753a2 ("netfilter: flow table support for IPv6") Fixes: 97add9f0d66d ("netfilter: flow table support for IPv4") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-10netfilter: nf_tables: get rid of struct nft_af_info abstractionPablo Neira Ayuso
Remove the infrastructure to register/unregister nft_af_info structure, this structure stores no useful information anymore. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-10netfilter: nf_tables: get rid of pernet familiesPablo Neira Ayuso
Now that we have a single table list for each netns, we can get rid of one pointer per family and the global afinfo list, thus, shrinking struct netns for nftables that now becomes 64 bytes smaller. And call __nft_release_afinfo() from __net_exit path accordingly to release netnamespace objects on removal. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-10netfilter: nf_tables: remove nhooks field from struct nft_af_infoPablo Neira Ayuso
We already validate the hook through bitmask, so this check is superfluous. When removing this, this patch is also fixing a bug in the new flowtable codebase, since ctx->afi points to the table family instead of the netdev family which is where the flowtable is really hooked in. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-09net: ipv4: emulate READ_ONCE() on ->hdrincl bit-field in raw_sendmsg()Nicolai Stange
Commit 8f659a03a0ba ("net: ipv4: fix for a race condition in raw_sendmsg") fixed the issue of possibly inconsistent ->hdrincl handling due to concurrent updates by reading this bit-field member into a local variable and using the thus stabilized value in subsequent tests. However, aforementioned commit also adds the (correct) comment that /* hdrincl should be READ_ONCE(inet->hdrincl) * but READ_ONCE() doesn't work with bit fields */ because as it stands, the compiler is free to shortcut or even eliminate the local variable at its will. Note that I have not seen anything like this happening in reality and thus, the concern is a theoretical one. However, in order to be on the safe side, emulate a READ_ONCE() on the bit-field by doing it on the local 'hdrincl' variable itself: int hdrincl = inet->hdrincl; hdrincl = READ_ONCE(hdrincl); This breaks the chain in the sense that the compiler is not allowed to replace subsequent reads from hdrincl with reloads from inet->hdrincl. Fixes: 8f659a03a0ba ("net: ipv4: fix for a race condition in raw_sendmsg") Signed-off-by: Nicolai Stange <nstange@suse.de> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09esp: Fix GRO when the headers not fully in the linear part of the skb.Steffen Klassert
The GRO layer does not necessarily pull the complete headers into the linear part of the skb, a part may remain on the first page fragment. This can lead to a crash if we try to pull the headers, so make sure we have them on the linear part before pulling. Fixes: 7785bba299a8 ("esp: Add a software GRO codepath") Reported-by: syzbot+82bbd65569c49c6c0c4d@syzkaller.appspotmail.com Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-01-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for your net-next tree: 1) Free hooks via call_rcu to speed up netns release path, from Florian Westphal. 2) Reduce memory footprint of hook arrays, skip allocation if family is not present - useful in case decnet support is not compiled built-in. Patches from Florian Westphal. 3) Remove defensive check for malformed IPv4 - including ihl field - and IPv6 headers in x_tables and nf_tables. 4) Add generic flow table offload infrastructure for nf_tables, this includes the netlink control plane and support for IPv4, IPv6 and mixed IPv4/IPv6 dataplanes. This comes with NAT support too. This patchset adds the IPS_OFFLOAD conntrack status bit to indicate that this flow has been offloaded. 5) Add secpath matching support for nf_tables, from Florian. 6) Save some code bytes in the fast path for the nf_tables netdev, bridge and inet families. 7) Allow one single NAT hook per point and do not allow to register NAT hooks in nf_tables before the conntrack hook, patches from Florian. 8) Seven patches to remove the struct nf_af_info abstraction, instead we perform direct calls for IPv4 which is faster. IPv6 indirections are still needed to avoid dependencies with the 'ipv6' module, but these now reside in struct nf_ipv6_ops. 9) Seven patches to handle NFPROTO_INET from the Netfilter core, hence we can remove specific code in nf_tables to handle this pseudofamily. 10) No need for synchronize_net() call for nf_queue after conversion to hook arrays. Also from Florian. 11) Call cond_resched_rcu() when dumping large sets in ipset to avoid softlockup. Again from Florian. 12) Pass lockdep_nfnl_is_held() to rcu_dereference_protected(), patch from Florian Westphal. 13) Fix matching of counters in ipset, from Jozsef Kadlecsik. 14) Missing nfnl lock protection in the ip_set_net_exit path, also from Jozsef. 15) Move connlimit code that we can reuse from nf_tables into nf_conncount, from Florian Westhal. And asorted cleanups: 16) Get rid of nft_dereference(), it only has one single caller. 17) Add nft_set_is_anonymous() helper function. 18) Remove NF_ARP_FORWARD leftover chain definition in nf_tables_arp. 19) Remove unnecessary comments in nf_conntrack_h323_asn1.c From Varsha Rao. 20) Remove useless parameters in frag_safe_skb_hp(), from Gao Feng. 21) Constify layer 4 conntrack protocol definitions, function parameters to register/unregister these protocol trackers, and timeouts. Patches from Florian Westphal. 22) Remove nlattr_size indirection, from Florian Westphal. 23) Add fall-through comments as -Wimplicit-fallthrough needs this, from Gustavo A. R. Silva. 24) Use swap() macro to exchange values in ipset, patch from Gustavo A. R. Silva. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-08tcp: Split BUG_ON() in tcp_tso_should_defer() into two assertionsStefano Brivio
The two conditions triggering BUG_ON() are somewhat unrelated: the tcp_skb_pcount() check is meant to catch TSO flaws, the second one checks sanity of congestion window bookkeeping. Split them into two separate BUG_ON() assertions on two lines, so that we know which one actually triggers, when they do. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-08netfilter: flow table support for the mixed IPv4/IPv6 familyPablo Neira Ayuso
This patch adds the IPv6 flow table type, that implements the datapath flow table to forward IPv6 traffic. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: flow table support for IPv4Pablo Neira Ayuso
This patch adds the IPv4 flow table type, that implements the datapath flow table to forward IPv4 traffic. Rationale is: 1) Look up for the packet in the flow table, from the ingress hook. 2) If there's a hit, decrement ttl and pass it on to the neighbour layer for transmission. 3) If there's a miss, packet is passed up to the classic forwarding path. This patch also supports layer 3 source and destination NAT. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: remove defensive check on malformed packets from raw socketsPablo Neira Ayuso
Users cannot forge malformed IPv4/IPv6 headers via raw sockets that they can inject into the stack. Specifically, not for IPv4 since 55888dfb6ba7 ("AF_RAW: Augment raw_send_hdrinc to expand skb to fit iphdr->ihl (v2)"). IPv6 raw sockets also ensure that packets have a well-formed IPv6 header available in the skbuff. At quick glance, br_netfilter also validates layer 3 headers and it drops malformed both IPv4 and IPv6 packets. Therefore, let's remove this defensive check all over the place. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: remove struct nf_afinfo and its helper functionsPablo Neira Ayuso
This abstraction has no clients anymore, remove it. This is what remains from previous authors, so correct copyright statement after recent modifications and code removal. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: remove route_key_size field in struct nf_afinfoPablo Neira Ayuso
This is only needed by nf_queue, place this code where it belongs. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: move reroute indirection to struct nf_ipv6_opsPablo Neira Ayuso
We cannot make a direct call to nf_ip6_reroute() because that would result in autoloading the 'ipv6' module because of symbol dependencies. Therefore, define reroute indirection in nf_ipv6_ops where this really belongs to. For IPv4, we can indeed make a direct function call, which is faster, given IPv4 is built-in in the networking code by default. Still, CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline stub for IPv4 in such case. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: move route indirection to struct nf_ipv6_opsPablo Neira Ayuso
We cannot make a direct call to nf_ip6_route() because that would result in autoloading the 'ipv6' module because of symbol dependencies. Therefore, define route indirection in nf_ipv6_ops where this really belongs to. For IPv4, we can indeed make a direct function call, which is faster, given IPv4 is built-in in the networking code by default. Still, CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline stub for IPv4 in such case. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: remove saveroute indirection in struct nf_afinfoPablo Neira Ayuso
This is only used by nf_queue.c and this function comes with no symbol dependencies with IPv6, it just refers to structure layouts. Therefore, we can replace it by a direct function call from where it belongs. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: move checksum_partial indirection to struct nf_ipv6_opsPablo Neira Ayuso
We cannot make a direct call to nf_ip6_checksum_partial() because that would result in autoloading the 'ipv6' module because of symbol dependencies. Therefore, define checksum_partial indirection in nf_ipv6_ops where this really belongs to. For IPv4, we can indeed make a direct function call, which is faster, given IPv4 is built-in in the networking code by default. Still, CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline stub for IPv4 in such case. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: move checksum indirection to struct nf_ipv6_opsPablo Neira Ayuso
We cannot make a direct call to nf_ip6_checksum() because that would result in autoloading the 'ipv6' module because of symbol dependencies. Therefore, define checksum indirection in nf_ipv6_ops where this really belongs to. For IPv4, we can indeed make a direct function call, which is faster, given IPv4 is built-in in the networking code by default. Still, CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline stub for IPv4 in such case. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: nf_tables: remove hooks from family definitionPablo Neira Ayuso
They don't belong to the family definition, move them to the filter chain type definition instead. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: nf_tables: remove multihook chains and familiesPablo Neira Ayuso
Since NFPROTO_INET is handled from the core, we don't need to maintain extra infrastructure in nf_tables to handle the double hook registration, one for IPv4 and another for IPv6. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: nf_tables_inet: don't use multihook infrastructure anymorePablo Neira Ayuso
Use new native NFPROTO_INET support in netfilter core, this gets rid of ad-hoc code in the nf_tables API codebase. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: nf_tables: explicit nft_set_pktinfo() call from hook pathPablo Neira Ayuso
Instead of calling this function from the family specific variant, this reduces the code size in the fast path for the netdev, bridge and inet families. After this change, we must call nft_set_pktinfo() upfront from the chain hook indirection. Before: text data bss dec hex filename 2145 208 0 2353 931 net/netfilter/nf_tables_netdev.o After: text data bss dec hex filename 2125 208 0 2333 91d net/netfilter/nf_tables_netdev.o Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: nf_tables_arp: don't set forward chainPablo Neira Ayuso
46928a0b49f3 ("netfilter: nf_tables: remove multihook chains and families") already removed this, this is a leftover. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: core: only allow one nat hook per hook pointFlorian Westphal
The netfilter NAT core cannot deal with more than one NAT hook per hook location (prerouting, input ...), because the NAT hooks install a NAT null binding in case the iptables nat table (iptable_nat hooks) or the corresponding nftables chain (nft nat hooks) doesn't specify a nat transformation. Null bindings are needed to detect port collsisions between NAT-ed and non-NAT-ed connections. This causes nftables NAT rules to not work when iptable_nat module is loaded, and vice versa because nat binding has already been attached when the second nat hook is consulted. The netfilter core is not really the correct location to handle this (hooks are just hooks, the core has no notion of what kinds of side effects a hook implements), but its the only place where we can check for conflicts between both iptables hooks and nftables hooks without adding dependencies. So add nat annotation to hook_ops to describe those hooks that will add NAT bindings and then make core reject if such a hook already exists. The annotation fills a padding hole, in case further restrictions appar we might change this to a 'u8 type' instead of bool. iptables error if nft nat hook active: iptables -t nat -A POSTROUTING -j MASQUERADE iptables v1.4.21: can't initialize iptables table `nat': File exists Perhaps iptables or your kernel needs to be upgraded. nftables error if iptables nat table present: nft -f /etc/nftables/ipv4-nat /usr/etc/nftables/ipv4-nat:3:1-2: Error: Could not process rule: File exists table nat { ^^ Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: xtables: add and use xt_request_find_table_lockFlorian Westphal
currently we always return -ENOENT to userspace if we can't find a particular table, or if the table initialization fails. Followup patch will make nat table init fail in case nftables already registered a nat hook so this change makes xt_find_table_lock return an ERR_PTR to return the errno value reported from the table init function. Add xt_request_find_table_lock as try_then_request_module replacement and use it where needed. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: don't allocate space for arp/bridge hooks unless neededFlorian Westphal
no need to define hook points if the family isn't supported. Because we need these hooks for either nftables, arp/ebtables or the 'call-iptables' hack we have in the bridge layer add two new dependencies, NETFILTER_FAMILY_{ARP,BRIDGE}, and have the users select them. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: conntrack: timeouts can be constFlorian Westphal
Nowadays this is just the default template that is used when setting up the net namespace, so nothing writes to these locations. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: conntrack: l4 protocol trackers can be constFlorian Westphal
previous patches removed all writes to these structs so we can now mark them as const. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: conntrack: constify list of builtin trackersFlorian Westphal
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08xfrm: Return error on unknown encap_type in init_stateHerbert Xu
Currently esp will happily create an xfrm state with an unknown encap type for IPv4, without setting the necessary state parameters. This patch fixes it by returning -EINVAL. There is a similar problem in IPv6 where if the mode is unknown we will skip initialisation while returning zero. However, this is harmless as the mode has already been checked further up the stack. This patch removes this anomaly by aligning the IPv6 behaviour with IPv4 and treating unknown modes (which cannot actually happen) as transport mode. Fixes: 38320c70d282 ("[IPSEC]: Use crypto_aead and authenc in ESP") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-01-05net: revert "Update RFS target at poll for tcp/udp"Soheil Hassas Yeganeh
On multi-threaded processes, one common architecture is to have one (or a small number of) threads polling sockets, and a considerably larger pool of threads reading form and writing to the sockets. When we set RPS core on tcp_poll() or udp_poll() we essentially steer all packets of all the polled FDs to one (or small number of) cores, creaing a bottleneck and/or RPS misprediction. Another common architecture is to shard FDs among threads pinned to cores. In such a setting, setting RPS core in tcp_poll() and udp_poll() is redundant because the RFS core is correctly set in recvmsg and sendmsg. Thus, revert the following commit: c3f1dbaf6e28 ("net: Update RFS target at poll for tcp/udp"). Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-05ip: do not set RFS core on error queue readsSoheil Hassas Yeganeh
We should only record RPS on normal reads and writes. In single threaded processes, all calls record the same state. In multi-threaded processes where a separate thread processes errors, the RFS table mispredicts. Note that, when CONFIG_RPS is disabled, sock_rps_record_flow is a noop and no branch is added as a result of this patch. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-03Merge branch 'for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU updates from Paul E. McKenney: - Updates to use cond_resched() instead of cond_resched_rcu_qs() where feasible (currently everywhere except in kernel/rcu and in kernel/torture.c). Also a couple of fixes to avoid sending IPIs to offline CPUs. - Updates to simplify RCU's dyntick-idle handling. - Updates to remove almost all uses of smp_read_barrier_depends() and read_barrier_depends(). - Miscellaneous fixes. - Torture-test updates. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-02net: tcp: Remove TCP probe moduleMasami Hiramatsu
Remove TCP probe module since jprobe has been deprecated. That function is now replaced by tcp/tcp_probe trace-event. You can use it via ftrace or perftools. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-02net: tcp: Add trace events for TCP congestion window tracingMasami Hiramatsu
This adds an event to trace TCP stat variables with slightly intrusive trace-event. This uses ftrace/perf event log buffer to trace those state, no needs to prepare own ring-buffer, nor custom user apps. User can use ftrace to trace this event as below; # cd /sys/kernel/debug/tracing # echo 1 > events/tcp/tcp_probe/enable (run workloads) # cat trace Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-02inet_diag: Add equal-operator for portsKristian Evensen
inet_diag currently provides less/greater than or equal operators for comparing ports when filtering sockets. An equal comparison can be performed by combining the two existing operators, or a user can for example request a port range and then do the final filtering in userspace. However, these approaches both have drawbacks. Implementing equal using LE/GE causes the size and complexity of a filter to grow quickly as the number of ports increase, while it on busy machines would be great if the kernel only returns information about relevant sockets. This patch introduces source and destination port equal operators. INET_DIAG_BC_S_EQ is used to match a source port, INET_DIAG_BC_D_EQ a destination port, and usage is the same as for the existing port operators. I.e., the port to match is stored in the no-member of the next inet_diag_bc_op-struct in the filter. Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
net/ipv6/ip6_gre.c is a case of parallel adds. include/trace/events/tcp.h is a little bit more tricky. The removal of in-trace-macro ifdefs in 'net' paralleled with moving show_tcp_state_name and friends over to include/trace/events/sock.h in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-27tcp: do not allocate linear memory for zerocopy skbsWillem de Bruijn
Zerocopy payload is now always stored in frags, and space for headers is reversed, so this memory is unused. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-27tcp: place all zerocopy payload in fragsWillem de Bruijn
This avoids an unnecessary copy of 1-2KB and improves tso_fragment, which has to fall back to tcp_fragment if skb->len != skb_data_len. It also avoids a surprising inconsistency in notifications: Zerocopy packets sent over loopback have their frags copied, so set SO_EE_CODE_ZEROCOPY_COPIED in the notification. But this currently does not happen for small packets, because when all data fits in the linear fragment, data is not copied in skb_orphan_frags_rx. Reported-by: Tom Deseyn <tom.deseyn@gmail.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-27tcp: push full zerocopy packetsWillem de Bruijn
Skbs that reach MAX_SKB_FRAGS cannot be extended further. Do the same for zerocopy frags as non-zerocopy frags and set the PSH bit. This improves GRO assembly. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-27Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2017-12-22 1) Separate ESP handling from segmentation for GRO packets. This unifies the IPsec GSO and non GSO codepath. 2) Add asynchronous callbacks for xfrm on layer 2. This adds the necessary infrastructure to core networking. 3) Allow to use the layer2 IPsec GSO codepath for software crypto, all infrastructure is there now. 4) Also allow IPsec GSO with software crypto for local sockets. 5) Don't require synchronous crypto fallback on IPsec offloading, it is not needed anymore. 6) Check for xdo_dev_state_free and only call it if implemented. From Shannon Nelson. 7) Check for the required add and delete functions when a driver registers xdo_dev_ops. From Shannon Nelson. 8) Define xfrmdev_ops only with offload config. From Shannon Nelson. 9) Update the xfrm stats documentation. From Shannon Nelson. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-27Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2017-12-22 1) Check for valid id proto in validate_tmpl(), otherwise we may trigger a warning in xfrm_state_fini(). From Cong Wang. 2) Fix a typo on XFRMA_OUTPUT_MARK policy attribute. From Michal Kubecek. 3) Verify the state is valid when encap_type < 0, otherwise we may crash on IPsec GRO . From Aviv Heller. 4) Fix stack-out-of-bounds read on socket policy lookup. We access the flowi of the wrong address family in the IPv4 mapped IPv6 case, fix this by catching address family missmatches before we do the lookup. 5) fix xfrm_do_migrate() with AEAD to copy the geniv field too. Otherwise the state is not fully initialized and migration fails. From Antony Antony. 6) Fix stack-out-of-bounds with misconfigured transport mode policies. Our policy template validation is not strict enough. It is possible to configure policies with transport mode template where the address family of the template does not match the selectors address family. Fix this by refusing such a configuration, address family can not change on transport mode. 7) Fix a policy reference leak when reusing pcpu xdst entry. From Florian Westphal. 8) Reinject transport-mode packets through tasklet, otherwise it is possible to reate a recursion loop. From Herbert Xu. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-26net: erspan: remove md NULL checkWilliam Tu
The 'md' is allocated from 'tun_dst = ip_tun_rx_dst' and since we've checked 'tun_dst', 'md' will never be NULL. The patch removes it at both ipv4 and ipv6 erspan. Fixes: afb4c97d90e6 ("ip6_gre: fix potential memory leak in ip6erspan_rcv") Fixes: 50670b6ee9bc ("ip_gre: fix potential memory leak in erspan_rcv") Cc: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-26tcp: md5: Handle RCU dereference of md5sig_infoMat Martineau
Dereference tp->md5sig_info in tcp_v4_destroy_sock() the same way it is done in the adjacent call to tcp_clear_md5_list(). Resolves this sparse warning: net/ipv4/tcp_ipv4.c:1914:17: warning: incorrect type in argument 1 (different address spaces) net/ipv4/tcp_ipv4.c:1914:17: expected struct callback_head *head net/ipv4/tcp_ipv4.c:1914:17: got struct callback_head [noderef] <asn:4>*<noident> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Acked-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Lots of overlapping changes. Also on the net-next side the XDP state management is handled more in the generic layers so undo the 'net' nfp fix which isn't applicable in net-next. Include a necessary change by Jakub Kicinski, with log message: ==================== cls_bpf no longer takes care of offload tracking. Make sure netdevsim performs necessary checks. This fixes a warning caused by TC trying to remove a filter it has not added. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> ==================== Signed-off-by: David S. Miller <davem@davemloft.net>