summaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)Author
2016-11-16lwtunnel: subtract tunnel headroom from mtu on output redirectDavid Lebrun
This patch changes the lwtunnel_headroom() function which is called in ipv4_mtu() and ip6_mtu(), to also return the correct headroom value when the lwtunnel state is OUTPUT_REDIRECT. This patch enables e.g. SR-IPv6 encapsulations to work without manually setting the route mtu. Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16net: busy-poll: remove need_resched() from sk_can_busy_loop()Eric Dumazet
Now sk_busy_loop() can schedule by itself, we can remove need_resched() check from sk_can_busy_loop() Also add a const to its struct sock parameter. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Adam Belay <abelay@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Yuval Mintz <Yuval.Mintz@cavium.com> Cc: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16ipv4: Restore fib_trie_flush_external function and fix call orderingAlexander Duyck
The patch that removed the FIB offload infrastructure was a bit too aggressive and also removed code needed to clean up us splitting the table if additional rules were added. Specifically the function fib_trie_flush_external was called at the end of a new rule being added to flush the foreign trie entries from the main trie. I updated the code so that we only call fib_trie_flush_external on the main table so that we flush the entries for local from main. This way we don't call it for every rule change which is what was happening previously. Fixes: 347e3b28c1ba2 ("switchdev: remove FIB offload infrastructure") Reported-by: Eric Dumazet <edumazet@google.com> Cc: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15gro_cells: mark napi struct as not busy poll candidatesEric Dumazet
Rolf Neugebauer reported very long delays at netns dismantle. Eric W. Biederman was kind enough to look at this problem and noticed synchronize_net() occurring from netif_napi_del() that was added in linux-4.5 Busy polling makes no sense for tunnels NAPI. If busy poll is used for sessions over tunnels, the poller will need to poll the physical device queue anyway. netif_tx_napi_add() could be used here, but function name is misleading, and renaming it is not stable material, so set NAPI_STATE_NO_BUSY_POLL bit directly. This will avoid inserting gro_cells napi structures in napi_hash[] and avoid the problematic synchronize_net() (per possible cpu) that Rolf reported. Fixes: 93d05d4a320c ("net: provide generic busy polling to all NAPI drivers") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Rolf Neugebauer <rolf.neugebauer@docker.com> Reported-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Tested-by: Rolf Neugebauer <rolf.neugebauer@docker.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15vxlan: remove unsed vxlan_dev_dst_port()pravin shelar
Signed-off-by: Pravin B Shelar <pshelar@ovn.org> Acked-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15udplite: fix NULL pointer dereferencePaolo Abeni
The commit 850cbaddb52d ("udp: use it's own memory accounting schema") assumes that the socket proto has memory accounting enabled, but this is not the case for UDPLITE. Fix it enabling memory accounting for UDPLITE and performing fwd allocated memory reclaiming on socket shutdown. UDP and UDPLITE share now the same memory accounting limits. Also drop the backlog receive operation, since is no more needed. Fixes: 850cbaddb52d ("udp: use it's own memory accounting schema") Reported-by: Andrei Vagin <avagin@gmail.com> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Several cases of bug fixes in 'net' overlapping other changes in 'net-next-. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-14net: fix sleeping for sk_wait_event()WANG Cong
Similar to commit 14135f30e33c ("inet: fix sleeping inside inet_wait_for_connect()"), sk_wait_event() needs to fix too, because release_sock() is blocking, it changes the process state back to running after sleep, which breaks the previous prepare_to_wait(). Switch to the new wait API. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains a second batch of Netfilter updates for your net-next tree. This includes a rework of the core hook infrastructure that improves Netfilter performance by ~15% according to synthetic benchmarks. Then, a large batch with ipset updates, including a new hash:ipmac set type, via Jozsef Kadlecsik. This also includes a couple of assorted updates. Regarding the core hook infrastructure rework to improve performance, using this simple drop-all packets ruleset from ingress: nft add table netdev x nft add chain netdev x y { type filter hook ingress device eth0 priority 0\; } nft add rule netdev x y drop And generating traffic through Jesper Brouer's samples/pktgen/pktgen_bench_xmit_mode_netif_receive.sh script using -i option. perf report shows nf_tables calls in its top 10: 17.30% kpktgend_0 [nf_tables] [k] nft_do_chain 15.75% kpktgend_0 [kernel.vmlinux] [k] __netif_receive_skb_core 10.39% kpktgend_0 [nf_tables_netdev] [k] nft_do_chain_netdev I'm measuring here an improvement of ~15% in performance with this patchset, so we got +2.5Mpps more. I have used my old laptop Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz 4-cores. This rework contains more specifically, in strict order, these patches: 1) Remove compile-time debugging from core. 2) Remove obsolete comments that predate the rcu era. These days it is well known that a Netfilter hook always runs under rcu_read_lock(). 3) Remove threshold handling, this is only used by br_netfilter too. We already have specific code to handle this from br_netfilter, so remove this code from the core path. 4) Deprecate NF_STOP, as this is only used by br_netfilter. 5) Place nf_state_hook pointer into xt_action_param structure, so this structure fits into one single cacheline according to pahole. This also implicit affects nftables since it also relies on the xt_action_param structure. 6) Move state->hook_entries into nf_queue entry. The hook_entries pointer is only required by nf_queue(), so we can store this in the queue entry instead. 7) use switch() statement to handle verdict cases. 8) Remove hook_entries field from nf_hook_state structure, this is only required by nf_queue, so store it in nf_queue_entry structure. 9) Merge nf_iterate() into nf_hook_slow() that results in a much more simple and readable function. 10) Handle NF_REPEAT away from the core, so far the only client is nf_conntrack_in() and we can restart the packet processing using a simple goto to jump back there when the TCP requires it. This update required a second pass to fix fallout, fix from Arnd Bergmann. 11) Set random seed from nft_hash when no seed is specified from userspace. 12) Simplify nf_tables expression registration, in a much smarter way to save lots of boiler plate code, by Liping Zhang. 13) Simplify layer 4 protocol conntrack tracker registration, from Davide Caratti. 14) Missing CONFIG_NF_SOCKET_IPV4 dependency for udp4_lib_lookup, due to recent generalization of the socket infrastructure, from Arnd Bergmann. 15) Then, the ipset batch from Jozsef, he describes it as it follows: * Cleanup: Remove extra whitespaces in ip_set.h * Cleanup: Mark some of the helpers arguments as const in ip_set.h * Cleanup: Group counter helper functions together in ip_set.h * struct ip_set_skbinfo is introduced instead of open coded fields in skbinfo get/init helper funcions. * Use kmalloc() in comment extension helper instead of kzalloc() because it is unnecessary to zero out the area just before explicit initialization. * Cleanup: Split extensions into separate files. * Cleanup: Separate memsize calculation code into dedicated function. * Cleanup: group ip_set_put_extensions() and ip_set_get_extensions() together. * Add element count to hash headers by Eric B Munson. * Add element count to all set types header for uniform output across all set types. * Count non-static extension memory into memsize calculation for userspace. * Cleanup: Remove redundant mtype_expire() arguments, because they can be get from other parameters. * Cleanup: Simplify mtype_expire() for hash types by removing one level of intendation. * Make NLEN compile time constant for hash types. * Make sure element data size is a multiple of u32 for the hash set types. * Optimize hash creation routine, exit as early as possible. * Make struct htype per ipset family so nets array becomes fixed size and thus simplifies the struct htype allocation. * Collapse same condition body into a single one. * Fix reported memory size for hash:* types, base hash bucket structure was not taken into account. * hash:ipmac type support added to ipset by Tomasz Chilinski. * Use setup_timer() and mod_timer() instead of init_timer() by Muhammad Falak R Wani, individually for the set type families. 16) Remove useless connlabel field in struct netns_ct, patch from Florian Westphal. 17) xt_find_table_lock() doesn't return ERR_PTR() anymore, so simplify {ip,ip6,arp}tables code that uses this. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13netfilter: conntrack: remove unused netns_ct memberFlorian Westphal
since 23014011ba420 ('netfilter: conntrack: support a fixed size of 128 distinct labels') this isn't needed anymore. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-13tcp: take care of truncations done by sk_filter()Eric Dumazet
With syzkaller help, Marco Grassi found a bug in TCP stack, crashing in tcp_collapse() Root cause is that sk_filter() can truncate the incoming skb, but TCP stack was not really expecting this to happen. It probably was expecting a simple DROP or ACCEPT behavior. We first need to make sure no part of TCP header could be removed. Then we need to adjust TCP_SKB_CB(skb)->end_seq Many thanks to syzkaller team and Marco for giving us a reproducer. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Marco Grassi <marco.gra@gmail.com> Reported-by: Vladis Dronov <vdronov@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13genetlink: Make family a signed integer.David S. Miller
The idr_alloc(), idr_remove(), et al. routines all expect IDs to be signed integers. Therefore make the genl_family member 'id' signed too. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09tc_act: Remove tcf_act macroYotam Gigi
tc_act macro addressed a non existing field, and was not used in the kernel source. Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09ipv6: add source address argument for ipv6_push_nfrag_optsDavid Lebrun
This patch prepares for insertion of SRH through setsockopt(). The new source address argument is used when an HMAC field is present in the SRH, which must be filled. The HMAC signature process requires the source address as input text. Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09ipv6: sr: add core files for SR HMAC supportDavid Lebrun
This patch adds the necessary functions to compute and check the HMAC signature of an SR-enabled packet. Two HMAC algorithms are supported: hmac(sha1) and hmac(sha256). In order to avoid dynamic memory allocation for each HMAC computation, a per-cpu ring buffer is allocated for this purpose. A new per-interface sysctl called seg6_require_hmac is added, allowing a user-defined policy for processing HMAC-signed SR-enabled packets. A value of -1 means that the HMAC field will always be ignored. A value of 0 means that if an HMAC field is present, its validity will be enforced (the packet is dropped is the signature is incorrect). Finally, a value of 1 means that any SR-enabled packet that does not contain an HMAC signature or whose signature is incorrect will be dropped. Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09ipv6: sr: add support for SRH encapsulation and injection with lwtunnelsDavid Lebrun
This patch creates a new type of interfaceless lightweight tunnel (SEG6), enabling the encapsulation and injection of SRH within locally emitted packets and forwarded packets. >From a configuration viewpoint, a seg6 tunnel would be configured as follows: ip -6 ro ad fc00::1/128 encap seg6 mode encap segs fc42::1,fc42::2,fc42::3 dev eth0 Any packet whose destination address is fc00::1 would thus be encapsulated within an outer IPv6 header containing the SRH with three segments, and would actually be routed to the first segment of the list. If `mode inline' was specified instead of `mode encap', then the SRH would be directly inserted after the IPv6 header without outer encapsulation. The inline mode is only available if CONFIG_IPV6_SEG6_INLINE is enabled. This feature was made configurable because direct header insertion may break several mechanisms such as PMTUD or IPSec AH. Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09ipv6: sr: add code base for control plane support of SR-IPv6David Lebrun
This patch adds the necessary hooks and structures to provide support for SR-IPv6 control plane, essentially the Generic Netlink commands that will be used for userspace control over the Segment Routing kernel structures. The genetlink commands provide control over two different structures: tunnel source and HMAC data. The tunnel source is the source address that will be used by default when encapsulating packets into an outer IPv6 header + SRH. If the tunnel source is set to :: then an address of the outgoing interface will be selected as the source. The HMAC commands currently just return ENOTSUPP and will be implemented in a future patch. Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09ipv6: implement dataplane support for rthdr type 4 (Segment Routing Header)David Lebrun
Implement minimal support for processing of SR-enabled packets as described in https://tools.ietf.org/html/draft-ietf-6man-segment-routing-header-02. This patch implements the following operations: - Intermediate segment endpoint: incrementation of active segment and rerouting. - Egress for SR-encapsulated packets: decapsulation of outer IPv6 header + SRH and routing of inner packet. - Cleanup flag support for SR-inlined packets: removal of SRH if we are the penultimate segment endpoint. A per-interface sysctl seg6_enabled is provided, to accept/deny SR-enabled packets. Default is deny. This patch does not provide support for HMAC-signed packets. Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains a larger than usual batch of Netfilter fixes for your net tree. This series contains a mixture of old bugs and recently introduced bugs, they are: 1) Fix a crash when using nft_dynset with nft_set_rbtree, which doesn't support the set element updates from the packet path. From Liping Zhang. 2) Fix leak when nft_expr_clone() fails, from Liping Zhang. 3) Fix a race when inserting new elements to the set hash from the packet path, also from Liping. 4) Handle segmented TCP SIP packets properly, basically avoid that the INVITE in the allow header create bogus expectations by performing stricter SIP message parsing, from Ulrich Weber. 5) nft_parse_u32_check() should return signed integer for errors, from John Linville. 6) Fix wrong allocation instead of connlabels, allocate 16 instead of 32 bytes, from Florian Westphal. 7) Fix compilation breakage when building the ip_vs_sync code with CONFIG_OPTIMIZE_INLINING on x86, from Arnd Bergmann. 8) Destroy the new set if the transaction object cannot be allocated, also from Liping Zhang. 9) Use device to route duplicated packets via nft_dup only when set by the user, otherwise packets may not follow the right route, again from Liping. 10) Fix wrong maximum genetlink attribute definition in IPVS, from WANG Cong. 11) Ignore untracked conntrack objects from xt_connmark, from Florian Westphal. 12) Allow to use conntrack helpers that are registered NFPROTO_UNSPEC via CT target, otherwise we cannot use the h.245 helper, from Florian. 13) Revisit garbage collection heuristic in the new workqueue-based timer approach for conntrack to evict objects earlier, again from Florian. 14) Fix crash in nf_tables when inserting an element into a verdict map, from Liping Zhang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09netfilter: conntrack: simplify init/uninit of L4 protocol trackersDavide Caratti
modify registration and deregistration of layer-4 protocol trackers to facilitate inclusion of new elements into the current list of builtin protocols. Both builtin (TCP, UDP, ICMP) and non-builtin (DCCP, GRE, SCTP, UDPlite) layer-4 protocol trackers usually register/deregister themselves using consecutive calls to nf_ct_l4proto_{,pernet}_{,un}register(...). This sequence is interrupted and rolled back in case of error; in order to simplify addition of builtin protocols, the input of the above functions has been modified to allow registering/unregistering multiple protocols. Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-09net/flowcache: Convert to hotplug state machineSebastian Andrzej Siewior
Install the callbacks via the state machine. Use multi state support to avoid custom list handling for the multiple instances. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: netdev@vger.kernel.org Cc: rt@linutronix.de Cc: "David S. Miller" <davem@davemloft.net> Link: http://lkml.kernel.org/r/20161103145021.28528-10-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-09netfilter: nf_tables: simplify the basic expressions' init routineLiping Zhang
Some basic expressions are built into nf_tables.ko, such as nft_cmp, nft_lookup, nft_range and so on. But these basic expressions' init routine is a little ugly, too many goto errX labels, and we forget to call nft_range_module_exit in the exit routine, although it is harmless. Acctually, the init and exit routines of these basic expressions are same, i.e. do nft_register_expr in the init routine and do nft_unregister_expr in the exit routine. So it's better to arrange them into an array and deal with them together. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-09net/dst: Add dst port to dst_metadata utility functionsHadar Hen Zion
Add dst port parameter to __ip_tun_set_dst and __ipv6_tun_set_dst utility functions. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09net/sched: cls_flower: Add UDP port to tunnel parametersHadar Hen Zion
The current IP tunneling classification supports only IP addresses and key. Enhance UDP based IP tunneling classification parameters by adding UDP src and dst port. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09flow_dissector: Add enums for encapsulation keysHadar Hen Zion
New encapsulation keys were added to the flower classifier, which allow classification according to outer (encapsulation) headers attributes such as key and IP addresses. In order to expose those attributes outside flower, add corresponding enums in the flow dissector. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09net/sched: act_tunnel_key: add helper inlines to access tcf_tunnel_keyHadar Hen Zion
Needed for drivers to pick the relevant action when offloading tunnel key act. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07net: make default TX queue length a defined constantJesper Dangaard Brouer
The default TX queue length of Ethernet devices have been a magic constant of 1000, ever since the initial git import. Looking back in historical trees[1][2] the value used to be 100, with the same comment "Ethernet wants good queues". The commit[3] that changed this from 100 to 1000 didn't describe why, but from conversations with Robert Olsson it seems that it was changed when Ethernet devices went from 100Mbit/s to 1Gbit/s, because the link speed increased x10 the queue size were also adjusted. This value later caused much heartache for the bufferbloat community. This patch merely moves the value into a defined constant. [1] https://git.kernel.org/cgit/linux/kernel/git/davem/netdev-vger-cvs.git/ [2] https://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/ [3] https://git.kernel.org/tglx/history/c/98921832c232 Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07udp: do fwd memory scheduling on dequeuePaolo Abeni
A new argument is added to __skb_recv_datagram to provide an explicit skb destructor, invoked under the receive queue lock. The UDP protocol uses such argument to perform memory reclaiming on dequeue, so that the UDP protocol does not set anymore skb->desctructor. Instead explicit memory reclaiming is performed at close() time and when skbs are removed from the receive queue. The in kernel UDP protocol users now need to call a skb_recv_udp() variant instead of skb_recv_datagram() to properly perform memory accounting on dequeue. Overall, this allows acquiring only once the receive queue lock on dequeue. Tested using pktgen with random src port, 64 bytes packet, wire-speed on a 10G link as sender and udp_sink as the receiver, using an l4 tuple rxhash to stress the contention, and one or more udp_sink instances with reuseport. nr sinks vanilla patched 1 440 560 3 2150 2300 6 3650 3800 9 4450 4600 12 6250 6450 v1 -> v2: - do rmem and allocated memory scheduling under the receive lock - do bulk scheduling in first_packet_length() and in udp_destruct_sock() - avoid the typdef for the dequeue callback Suggested-by: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07net/sock: add an explicit sk argument for ip_cmsg_recv_offset()Paolo Abeni
So that we can use it even after orphaining the skbuff. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-04net: inet: Support UID-based routing in IP protocols.Lorenzo Colitti
- Use the UID in routing lookups made by protocol connect() and sendmsg() functions. - Make sure that routing lookups triggered by incoming packets (e.g., Path MTU discovery) take the UID of the socket into account. - For packets not associated with a userspace socket, (e.g., ping replies) use UID 0 inside the user namespace corresponding to the network namespace the socket belongs to. This allows all namespaces to apply routing and iptables rules to kernel-originated traffic in that namespaces by matching UID 0. This is better than using the UID of the kernel socket that is sending the traffic, because the UID of kernel sockets created at namespace creation time (e.g., the per-processor ICMP and TCP sockets) is the UID of the user that created the socket, which might not be mapped in the namespace. Tested: compiles allnoconfig, allyesconfig, allmodconfig Tested: https://android-review.googlesource.com/253302 Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-04net: core: add UID to flows, rules, and routesLorenzo Colitti
- Define a new FIB rule attributes, FRA_UID_RANGE, to describe a range of UIDs. - Define a RTA_UID attribute for per-UID route lookups and dumps. - Support passing these attributes to and from userspace via rtnetlink. The value INVALID_UID indicates no UID was specified. - Add a UID field to the flow structures. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-04net: core: Add a UID field to struct sock.Lorenzo Colitti
Protocol sockets (struct sock) don't have UIDs, but most of the time, they map 1:1 to userspace sockets (struct socket) which do. Various operations such as the iptables xt_owner match need access to the "UID of a socket", and do so by following the backpointer to the struct socket. This involves taking sk_callback_lock and doesn't work when there is no socket because userspace has already called close(). Simplify this by adding a sk_uid field to struct sock whose value matches the UID of the corresponding struct socket. The semantics are as follows: 1. Whenever sk_socket is non-null: sk_uid is the same as the UID in sk_socket, i.e., matches the return value of sock_i_uid. Specifically, the UID is set when userspace calls socket(), fchown(), or accept(). 2. When sk_socket is NULL, sk_uid is defined as follows: - For a socket that no longer has a sk_socket because userspace has called close(): the previous UID. - For a cloned socket (e.g., an incoming connection that is established but on which userspace has not yet called accept): the UID of the socket it was cloned from. - For a socket that has never had an sk_socket: UID 0 inside the user namespace corresponding to the network namespace the socket belongs to. Kernel sockets created by sock_create_kern are a special case of #1 and sk_uid is the user that created them. For kernel sockets created at network namespace creation time, such as the per-processor ICMP and TCP sockets, this is the user that created the network namespace. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03dccp: do not release listeners too soonEric Dumazet
Andrey Konovalov reported following error while fuzzing with syzkaller : IPv4: Attempt to release alive inet socket ffff880068e98940 kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN Modules linked in: CPU: 1 PID: 3905 Comm: a.out Not tainted 4.9.0-rc3+ #333 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff88006b9e0000 task.stack: ffff880068770000 RIP: 0010:[<ffffffff819ead5f>] [<ffffffff819ead5f>] selinux_socket_sock_rcv_skb+0xff/0x6a0 security/selinux/hooks.c:4639 RSP: 0018:ffff8800687771c8 EFLAGS: 00010202 RAX: ffff88006b9e0000 RBX: 1ffff1000d0eee3f RCX: 1ffff1000d1d312a RDX: 1ffff1000d1d31a6 RSI: dffffc0000000000 RDI: 0000000000000010 RBP: ffff880068777360 R08: 0000000000000000 R09: 0000000000000002 R10: dffffc0000000000 R11: 0000000000000006 R12: ffff880068e98940 R13: 0000000000000002 R14: ffff880068777338 R15: 0000000000000000 FS: 00007f00ff760700(0000) GS:ffff88006cd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020008000 CR3: 000000006a308000 CR4: 00000000000006e0 Stack: ffff8800687771e0 ffffffff812508a5 ffff8800686f3168 0000000000000007 ffff88006ac8cdfc ffff8800665ea500 0000000041b58ab3 ffffffff847b5480 ffffffff819eac60 ffff88006b9e0860 ffff88006b9e0868 ffff88006b9e07f0 Call Trace: [<ffffffff819c8dd5>] security_sock_rcv_skb+0x75/0xb0 security/security.c:1317 [<ffffffff82c2a9e7>] sk_filter_trim_cap+0x67/0x10e0 net/core/filter.c:81 [<ffffffff82b81e60>] __sk_receive_skb+0x30/0xa00 net/core/sock.c:460 [<ffffffff838bbf12>] dccp_v4_rcv+0xdb2/0x1910 net/dccp/ipv4.c:873 [<ffffffff83069d22>] ip_local_deliver_finish+0x332/0xad0 net/ipv4/ip_input.c:216 [< inline >] NF_HOOK_THRESH ./include/linux/netfilter.h:232 [< inline >] NF_HOOK ./include/linux/netfilter.h:255 [<ffffffff8306abd2>] ip_local_deliver+0x1c2/0x4b0 net/ipv4/ip_input.c:257 [< inline >] dst_input ./include/net/dst.h:507 [<ffffffff83068500>] ip_rcv_finish+0x750/0x1c40 net/ipv4/ip_input.c:396 [< inline >] NF_HOOK_THRESH ./include/linux/netfilter.h:232 [< inline >] NF_HOOK ./include/linux/netfilter.h:255 [<ffffffff8306b82f>] ip_rcv+0x96f/0x12f0 net/ipv4/ip_input.c:487 [<ffffffff82bd9fb7>] __netif_receive_skb_core+0x1897/0x2a50 net/core/dev.c:4213 [<ffffffff82bdb19a>] __netif_receive_skb+0x2a/0x170 net/core/dev.c:4251 [<ffffffff82bdb493>] netif_receive_skb_internal+0x1b3/0x390 net/core/dev.c:4279 [<ffffffff82bdb6b8>] netif_receive_skb+0x48/0x250 net/core/dev.c:4303 [<ffffffff8241fc75>] tun_get_user+0xbd5/0x28a0 drivers/net/tun.c:1308 [<ffffffff82421b5a>] tun_chr_write_iter+0xda/0x190 drivers/net/tun.c:1332 [< inline >] new_sync_write fs/read_write.c:499 [<ffffffff8151bd44>] __vfs_write+0x334/0x570 fs/read_write.c:512 [<ffffffff8151f85b>] vfs_write+0x17b/0x500 fs/read_write.c:560 [< inline >] SYSC_write fs/read_write.c:607 [<ffffffff81523184>] SyS_write+0xd4/0x1a0 fs/read_write.c:599 [<ffffffff83fc02c1>] entry_SYSCALL_64_fastpath+0x1f/0xc2 It turns out DCCP calls __sk_receive_skb(), and this broke when lookups no longer took a reference on listeners. Fix this issue by adding a @refcounted parameter to __sk_receive_skb(), so that sock_put() is used only when needed. Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03ipv4: allow local fragmentation in ip_finish_output_gso()Lance Richardson
Some configurations (e.g. geneve interface with default MTU of 1500 over an ethernet interface with 1500 MTU) result in the transmission of packets that exceed the configured MTU. While this should be considered to be a "bad" configuration, it is still allowed and should not result in the sending of packets that exceed the configured MTU. Fix by dropping the assumption in ip_finish_output_gso() that locally originated gso packets will never need fragmentation. Basic testing using iperf (observing CPU usage and bandwidth) have shown no measurable performance impact for traffic not requiring fragmentation. Fixes: c7ba65d7b649 ("net: ip: push gso skb forwarding handling down the stack") Reported-by: Jan Tluka <jtluka@redhat.com> Signed-off-by: Lance Richardson <lrichard@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03net: tcp: check skb is non-NULL for exact match on lookupsDavid Ahern
Andrey reported the following error report while running the syzkaller fuzzer: general protection fault: 0000 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 0 PID: 648 Comm: syz-executor Not tainted 4.9.0-rc3+ #333 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff8800398c4480 task.stack: ffff88003b468000 RIP: 0010:[<ffffffff83091106>] [< inline >] inet_exact_dif_match include/net/tcp.h:808 RIP: 0010:[<ffffffff83091106>] [<ffffffff83091106>] __inet_lookup_listener+0xb6/0x500 net/ipv4/inet_hashtables.c:219 RSP: 0018:ffff88003b46f270 EFLAGS: 00010202 RAX: 0000000000000004 RBX: 0000000000004242 RCX: 0000000000000001 RDX: 0000000000000000 RSI: ffffc90000e3c000 RDI: 0000000000000054 RBP: ffff88003b46f2d8 R08: 0000000000004000 R09: ffffffff830910e7 R10: 0000000000000000 R11: 000000000000000a R12: ffffffff867fa0c0 R13: 0000000000004242 R14: 0000000000000003 R15: dffffc0000000000 FS: 00007fb135881700(0000) GS:ffff88003ec00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020cc3000 CR3: 000000006d56a000 CR4: 00000000000006f0 Stack: 0000000000000000 000000000601a8c0 0000000000000000 ffffffff00004242 424200003b9083c2 ffff88003def4041 ffffffff84e7e040 0000000000000246 ffff88003a0911c0 0000000000000000 ffff88003a091298 ffff88003b9083ae Call Trace: [<ffffffff831100f4>] tcp_v4_send_reset+0x584/0x1700 net/ipv4/tcp_ipv4.c:643 [<ffffffff83115b1b>] tcp_v4_rcv+0x198b/0x2e50 net/ipv4/tcp_ipv4.c:1718 [<ffffffff83069d22>] ip_local_deliver_finish+0x332/0xad0 net/ipv4/ip_input.c:216 ... MD5 has a code path that calls __inet_lookup_listener with a null skb, so inet{6}_exact_dif_match needs to check skb against null before pulling the flag. Fixes: a04a480d4392 ("net: Require exact match for TCP socket lookups if dif is l3mdev") Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03ipv4: add IP_RECVFRAGSIZE cmsgWillem de Bruijn
The IP stack records the largest fragment of a reassembled packet in IPCB(skb)->frag_max_size. When reading a datagram or raw packet that arrived fragmented, expose the value to allow applications to estimate receive path MTU. Tested: Sent data over a veth pair of which the source has a small mtu. Sent data using netcat, received using a dedicated process. Verified that the cmsg IP_RECVFRAGSIZE is returned only when data arrives fragmented, and in that cases matches the veth mtu. ip link add veth0 type veth peer name veth1 ip netns add from ip netns add to ip link set dev veth1 netns to ip netns exec to ip addr add dev veth1 192.168.10.1/24 ip netns exec to ip link set dev veth1 up ip link set dev veth0 netns from ip netns exec from ip addr add dev veth0 192.168.10.2/24 ip netns exec from ip link set dev veth0 up ip netns exec from ip link set dev veth0 mtu 1300 ip netns exec from ethtool -K veth0 ufo off dd if=/dev/zero bs=1 count=1400 2>/dev/null > payload ip netns exec to ./recv_cmsg_recvfragsize -4 -u -p 6000 & ip netns exec from nc -q 1 -u 192.168.10.1 6000 < payload using github.com/wdebruij/kerneltools/blob/master/tests/recvfragsize.c Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03netfilter: remove hook_entries field from nf_hook_statePablo Neira Ayuso
This field is only useful for nf_queue, so store it in the nf_queue_entry structure instead, away from the core path. Pass hook_head to nf_hook_slow(). Since we always have a valid entry on the first iteration in nf_iterate(), we can use 'do { ... } while (entry)' loop instead. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03netfilter: nf_tables: use hook state from xt_action_param structurePablo Neira Ayuso
Don't copy relevant fields from hook state structure, instead use the one that is already available in struct xt_action_param. This patch also adds a set of new wrapper functions to fetch relevant hook state structure fields. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-03netfilter: x_tables: move hook state into xt_action_param structurePablo Neira Ayuso
Place pointer to hook state in xt_action_param structure instead of copying the fields that we need. After this change xt_action_param fits into one cacheline. This patch also adds a set of new wrapper functions to fetch relevant hook state structure fields. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-02ip6_tunnel: Clear IP6CB in ip6tunnel_xmit()Eli Cooper
skb->cb may contain data from previous layers. In the observed scenario, the garbage data were misinterpreted as IP6CB(skb)->frag_max_size, so that small packets sent through the tunnel are mistakenly fragmented. This patch unconditionally clears the control buffer in ip6tunnel_xmit(), which affects ip6_tunnel, ip6_udp_tunnel and ip6_gre. Currently none of these tunnels set IP6CB(skb)->flags, otherwise it needs to be done earlier. Cc: stable@vger.kernel.org Signed-off-by: Eli Cooper <elicooper@gmx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree. This includes better integration with the routing subsystem for nf_tables, explicit notrack support and smaller updates. More specifically, they are: 1) Add fib lookup expression for nf_tables, from Florian Westphal. This new expression provides a native replacement for iptables addrtype and rp_filter matches. This is more flexible though, since we can populate the kernel flowi representation to inquire fib to accomodate new usecases, such as RTBH through skb mark. 2) Introduce rt expression for nf_tables, from Anders K. Pedersen. This new expression allow you to access skbuff route metadata, more specifically nexthop and classid fields. 3) Add notrack support for nf_tables, to skip conntracking, requested by many users already. 4) Add boilerplate code to allow to use nf_log infrastructure from nf_tables ingress. 5) Allow to mangle pkttype from nf_tables prerouting chain, to emulate the xtables cluster match, from Liping Zhang. 6) Move socket lookup code into generic nf_socket_* infrastructure so we can provide a native replacement for the xtables socket match. 7) Make sure nfnetlink_queue data that is updated on every packets is placed in a different cache from read-only data, from Florian Westphal. 8) Handle NF_STOLEN from nf_tables core, also from Florian Westphal. 9) Start round robin number generation in nft_numgen from zero, instead of n-1, for consistency with xtables statistics match, patch from Liping Zhang. 10) Set GFP_NOWARN flag in skbuff netlink allocations in nfnetlink_log, given we retry with a smaller allocation on failure, from Calvin Owens. 11) Cleanup xt_multiport to use switch(), from Gao feng. 12) Remove superfluous check in nft_immediate and nft_cmp, from Liping Zhang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01netfilter: move socket lookup infrastructure to nf_socket_ipv{4,6}.cPablo Neira Ayuso
We need this split to reuse existing codebase for the upcoming nf_tables socket expression. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-01netfilter: nf_log: add packet logging for netdev familyPablo Neira Ayuso
Move layer 2 packet logging into nf_log_l2packet() that resides in nf_log_common.c, so this can be shared by both bridge and netdev families. This patch adds the boiler plate code to register the netdev logging family. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-01netfilter: nf_tables: add fib expressionFlorian Westphal
Add FIB expression, supported for ipv4, ipv6 and inet family (the latter just dispatches to ipv4 or ipv6 one based on nfproto). Currently supports fetching output interface index/name and the rtm_type associated with an address. This can be used for adding path filtering. rtm_type is useful to e.g. enforce a strong-end host model where packets are only accepted if daddr is configured on the interface the packet arrived on. The fib expression is a native nftables alternative to the xtables addrtype and rp_filter matches. FIB result order for oif/oifname retrieval is as follows: - if packet is local (skb has rtable, RTF_LOCAL set, this will also catch looped-back multicast packets), set oif to the loopback interface. - if fib lookup returns an error, or result points to local, store zero result. This means '--local' option of -m rpfilter is not supported. It is possible to use 'fib type local' or add explicit saddr/daddr matching rules to create exceptions if this is really needed. - store result in the destination register. In case of multiple routes, search set for desired oif in case strict matching is requested. ipv4 and ipv6 behave fib expressions are supposed to behave the same. [ I have collapsed Arnd Bergmann's ("netfilter: nf_tables: fib warnings") http://patchwork.ozlabs.org/patch/688615/ to address fallout from this patch after rebasing nf-next, that was posted to address compilation warnings. --pablo ] Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-31net: set SK_MEM_QUANTUM to 4096Eric Dumazet
Systems with large pages (64KB pages for example) do not always have huge quantity of memory. A big SK_MEM_QUANTUM value leads to fewer interactions with the global counters (like tcp_memory_allocated) but might trigger memory pressure much faster, giving suboptimal TCP performance since windows are lowered to ridiculous values. Note that sysctl_mem units being in pages and in ABI, we also need to change sk_prot_mem_limits() accordingly. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31sctp: hold transport instead of assoc when lookup assoc in rx pathXin Long
Prior to this patch, in rx path, before calling lock_sock, it needed to hold assoc when got it by __sctp_lookup_association, in case other place would free/put assoc. But in __sctp_lookup_association, it lookup and hold transport, then got assoc by transport->assoc, then hold assoc and put transport. It means it didn't hold transport, yet it was returned and later on directly assigned to chunk->transport. Without the protection of sock lock, the transport may be freed/put by other places, which would cause a use-after-free issue. This patch is to fix this issue by holding transport instead of assoc. As holding transport can make sure to access assoc is also safe, and actually it looks up assoc by searching transport rhashtable, to hold transport here makes more sense. Note that the function will be renamed later on on another patch. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Mostly simple overlapping changes. For example, David Ahern's adjacency list revamp in 'net-next' conflicted with an adjacency list traversal bug fix in 'net'. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "Lots of fixes, mostly drivers as is usually the case. 1) Don't treat zero DMA address as invalid in vmxnet3, from Alexey Khoroshilov. 2) Fix element timeouts in netfilter's nft_dynset, from Anders K. Pedersen. 3) Don't put aead_req crypto struct on the stack in mac80211, from Ard Biesheuvel. 4) Several uninitialized variable warning fixes from Arnd Bergmann. 5) Fix memory leak in cxgb4, from Colin Ian King. 6) Fix bpf handling of VLAN header push/pop, from Daniel Borkmann. 7) Several VRF semantic fixes from David Ahern. 8) Set skb->protocol properly in ip6_tnl_xmit(), from Eli Cooper. 9) Socket needs to be locked in udp_disconnect(), from Eric Dumazet. 10) Div-by-zero on 32-bit fix in mlx4 driver, from Eugenia Emantayev. 11) Fix stale link state during failover in NCSCI driver, from Gavin Shan. 12) Fix netdev lower adjacency list traversal, from Ido Schimmel. 13) Propvide proper handle when emitting notifications of filter deletes, from Jamal Hadi Salim. 14) Memory leaks and big-endian issues in rtl8xxxu, from Jes Sorensen. 15) Fix DESYNC_FACTOR handling in ipv6, from Jiri Bohac. 16) Several routing offload fixes in mlxsw driver, from Jiri Pirko. 17) Fix broadcast sync problem in TIPC, from Jon Paul Maloy. 18) Validate chunk len before using it in SCTP, from Marcelo Ricardo Leitner. 19) Revert a netns locking change that causes regressions, from Paul Moore. 20) Add recursion limit to GRO handling, from Sabrina Dubroca. 21) GFP_KERNEL in irq context fix in ibmvnic, from Thomas Falcon. 22) Avoid accessing stale vxlan/geneve socket in data path, from Pravin Shelar" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (189 commits) geneve: avoid using stale geneve socket. vxlan: avoid using stale vxlan socket. qede: Fix out-of-bound fastpath memory access net: phy: dp83848: add dp83822 PHY support enic: fix rq disable tipc: fix broadcast link synchronization problem ibmvnic: Fix missing brackets in init_sub_crq_irqs ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context Revert "ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context" arch/powerpc: Update parameters for csum_tcpudp_magic & csum_tcpudp_nofold net/mlx4_en: Save slave ethtool stats command net/mlx4_en: Fix potential deadlock in port statistics flow net/mlx4: Fix firmware command timeout during interrupt test net/mlx4_core: Do not access comm channel if it has not yet been initialized net/mlx4_en: Fix panic during reboot net/mlx4_en: Process all completions in RX rings after port goes up net/mlx4_en: Resolve dividing by zero in 32-bit system net/mlx4_core: Change the default value of enable_qos net/mlx4_core: Avoid setting ports to auto when only one port type is supported net/mlx4_core: Fix the resource-type enum in res tracker to conform to FW spec ...
2016-10-29vxlan: avoid using stale vxlan socket.pravin shelar
When vxlan device is closed vxlan socket is freed. This operation can race with vxlan-xmit function which dereferences vxlan socket. Following patch uses RCU mechanism to avoid this situation. Signed-off-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29Merge tag 'mac80211-next-for-davem-2016-10-28' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== Among various cleanups and improvements, we have the following: * client FILS authentication support in mac80211 (Jouni) * AP/VLAN multicast improvements (Michael Braun) * config/advertising support for differing beacon intervals on multiple virtual interfaces (Purushottam Kushwaha, myself) * deprecate the old WDS mode for cfg80211-based drivers, the mode is hardly usable since it doesn't support any "modern" features like WPA encryption (2003), HT (2009) or VHT (2014), I'm not even sure WEP (introduced in 1997) could be done. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>