summaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)Author
2016-09-19Bluetooth: Introduce helper to pack mgmt version informationMarcel Holtmann
The mgmt version information will be also needed for the control changell tracing feature. This provides a helper to pack them. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19Bluetooth: Store control socket cookie and comm informationMarcel Holtmann
To further allow unique identification and tracking of control socket, store cookie and comm information when binding the socket. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19Bluetooth: add printf format attribute to hci_set_[fh]w_info()Nicolas Iooss
Commit 5177a83827cd ("Bluetooth: Add debugfs fields for hardware and firmware info") introduced hci_set_hw_info() and hci_set_fw_info(). These functions use kvasprintf_const() but are not marked with a __printf attribute. Adding such an attribute helps detecting issues related to printf-formatting at build time. Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-19Bluetooth: Add HCI device identifier for Qualcomm SMDBjorn Andersson
This patch assigns the next free HCI device identifier to Bluetooth devices based on the Qualcomm Shared Memory channels. Signed-off-by: Bjorn Andersson <bjorn.andersson@sonymobile.com> Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-19Bluetooth: Put led_trigger field behind CONFIG_BT_LEDSMarcel Holtmann
The led_trigger field in hci_dev should be conditional based on if CONFIG_BT_LEDS is set or not. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19sched: add and use qdisc_skb_head helpersFlorian Westphal
This change replaces sk_buff_head struct in Qdiscs with new qdisc_skb_head. Its similar to the skb_buff_head api, but does not use skb->prev pointers. Qdiscs will commonly enqueue at the tail of a list and dequeue at head. While skb_buff_head works fine for this, enqueue/dequeue needs to also adjust the prev pointer of next element. The ->prev pointer is not required for qdiscs so we can just leave it undefined and avoid one cacheline write access for en/dequeue. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19sched: remove qdisc arg from __qdisc_dequeue_headFlorian Westphal
Moves qdisc stat accouting to qdisc_dequeue_head. The only direct caller of the __qdisc_dequeue_head version open-codes this now. This allows us to later use __qdisc_dequeue_head as a replacement of __skb_dequeue() (which operates on sk_buff_head list). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19ipv6: Export p6_route_input_lookup symbolMahesh Bandewar
Make ip6_route_input_lookup available outside of ipv6 the module similar to ip_route_input_noref in the IPv4 world. Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18Merge tag 'mac80211-next-for-davem-2016-09-16' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== This time we have various things - all across the board: * MU-MIMO sniffer support in mac80211 * a create_singlethread_workqueue() cleanup * interface dump filtering that was documented but not implemented * support for the new radiotap timestamp field * send delBA in two unexpected conditions (as required by the spec) * connect keys cleanups - allow only WEP with index 0-3 * per-station aggregation limit to work around broken APs * debugfs improvement for the integrated codel algorithm and various other small improvements and cleanups. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18sctp: make sctp_outq_flush/tail/uncork return voidXin Long
sctp_outq_flush return value is meaningless now, this patch is to make sctp_outq_flush return void, as well as sctp_outq_fail and sctp_outq_uncork. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18sctp: free msg->chunks when sctp_primitive_SEND return errXin Long
Last patch "sctp: do not return the transmit err back to sctp_sendmsg" made sctp_primitive_SEND return err only when asoc state is unavailable. In this case, chunks are not enqueued, they have no chance to be freed if we don't take care of them later. This Patch is actually to revert commit 1cd4d5c4326a ("sctp: remove the unused sctp_datamsg_free()"), commit 69b5777f2e57 ("sctp: hold the chunks only after the chunk is enqueued in outq") and commit 8b570dc9f7b6 ("sctp: only drop the reference on the datamsg after sending a msg"), to use sctp_datamsg_free to free the chunks of current msg. Fixes: 8b570dc9f7b6 ("sctp: only drop the reference on the datamsg after sending a msg") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17ip6_tunnel: add collect_md mode to IPv6 tunnelsAlexei Starovoitov
Similar to gre, vxlan, geneve tunnels allow IPIP6 and IP6IP6 tunnels to operate in 'collect metadata' mode. Unlike ipv4 code here it's possible to reuse ip6_tnl_xmit() function for both collect_md and traditional tunnels. bpf_skb_[gs]et_tunnel_key() helpers and ovs (in the future) are the users. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17ip_tunnel: add collect_md mode to IPIP tunnelAlexei Starovoitov
Similar to gre, vxlan, geneve tunnels allow IPIP tunnels to operate in 'collect metadata' mode. bpf_skb_[gs]et_tunnel_key() helpers can make use of it right away. ovs can use it as well in the future (once appropriate ovs-vport abstractions and user apis are added). Note that just like in other tunnels we cannot cache the dst, since tunnel_info metadata can be different for every packet. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17net: l3mdev: Remove netif_index_is_l3_masterDavid Ahern
No longer used after e0d56fdd73422 ("net: l3mdev: remove redundant calls") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17sctp: fix SSN comparisionMarcelo Ricardo Leitner
This function actually operates on u32 yet its paramteres were declared as u16, causing integer truncation upon calling. Note in patch context that ADDIP_SERIAL_SIGN_BIT is already 32 bits. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17net: avoid sk_forward_alloc overflowsEric Dumazet
A malicious TCP receiver, sending SACK, can force the sender to split skbs in write queue and increase its memory usage. Then, when socket is closed and its write queue purged, we might overflow sk_forward_alloc (It becomes negative) sk_mem_reclaim() does nothing in this case, and more than 2GB are leaked from TCP perspective (tcp_memory_allocated is not changed) Then warnings trigger from inet_sock_destruct() and sk_stream_kill_queues() seeing a not zero sk_forward_alloc All TCP stack can be stuck because TCP is under memory pressure. A simple fix is to preemptively reclaim from sk_mem_uncharge(). This makes sure a socket wont have more than 2 MB forward allocated, after burst and idle period. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-16cfg80211: add helper to find an IE that matches a byte-arrayLuca Coelho
There are a few places where an IE that matches not only the EID, but also other bytes inside the element, needs to be found. To simplify that and reduce the amount of similar code, implement a new helper function to match the EID and an extra array of bytes. Additionally, simplify cfg80211_find_vendor_ie() by using the new match function. Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-16net-next: dsa: add Qualcomm tag RX/TX handlerJohn Crispin
Add support for the 2-bytes Qualcomm tag that gigabit switches such as the QCA8337/N might insert when receiving packets, or that we need to insert while targeting specific switch ports. The tag is inserted directly behind the ethernet header. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: John Crispin <john@phrozen.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15net_sched: Introduce skbmod actionJamal Hadi Salim
This action is intended to be an upgrade from a usability perspective from pedit (as well as operational debugability). Compare this: sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \ u32 match ip protocol 1 0xff flowid 1:2 \ action pedit munge offset -14 u8 set 0x02 \ munge offset -13 u8 set 0x15 \ munge offset -12 u8 set 0x15 \ munge offset -11 u8 set 0x15 \ munge offset -10 u16 set 0x1515 \ pipe to: sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \ u32 match ip protocol 1 0xff flowid 1:2 \ action skbmod dmac 02:15:15:15:15:15 Also try to do a MAC address swap with pedit or worse try to debug a policy with destination mac, source mac and etherype. Then make few rules out of those and you'll get my point. In the future common use cases on pedit can be migrated to this action (as an example different fields in ip v4/6, transports like tcp/udp/sctp etc). For this first cut, this allows modifying basic ethernet header. The most important ethernet use case at the moment is when redirecting or mirroring packets to a remote machine. The dst mac address needs a re-write so that it doesnt get dropped or confuse an interconnecting (learning) switch or dropped by a target machine (which looks at the dst mac). And at times when flipping back the packet a swap of the MAC addresses is needed. Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15mac80211: allow driver to handle packet-loss mechanismRajkumar Manoharan
Based on consecutive msdu failures, mac80211 triggers CQM packet-loss mechanism. Drivers like ath10k that have its own connection monitoring algorithm, offloaded to firmware for triggering station kickout. In case of station kickout, driver will report low ack status by mac80211 API (ieee80211_report_low_ack). This flag will enable the driver to completely rely on firmware events for station kickout and bypass mac80211 packet loss mechanism. Signed-off-by: Rajkumar Manoharan <rmanohar@qti.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-15mac80211: remove sta_remove_debugfs driver callbackJohannes Berg
No drivers implement this, relying either on the recursive directory removal to remove their debugfs, or not having any to start with. Remove the dead driver callback. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree, they are: 1) Endianess fix for the new nf_tables netlink trace infrastructure, NFTA_TRACE_POLICY endianess was not correct, patch from Liping Zhang. 2) Fix broken re-route after userspace queueing in nf_tables route chain. This patch is large but it is simple since it is just getting this code in sync with iptable_mangle. Also from Liping. 3) NAT mangling via ctnetlink lies to userspace when nf_nat_setup_info() fails to setup the NAT conntrack extension. This problem has been there since the beginning, but it can now show up after rhashtable conversion. 4) Fix possible NULL pointer dereference due to failures in allocating the synproxy and seqadj conntrack extensions, from Gao feng. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-13netfilter: synproxy: Check oom when adding synproxy and seqadj ct extensionsGao Feng
When memory is exhausted, nfct_seqadj_ext_add may fail to add the synproxy and seqadj extensions. The function nf_ct_seqadj_init doesn't check if get valid seqadj pointer by the nfct_seqadj. Now drop the packet directly when fail to add seqadj extension to avoid dereference NULL pointer in nf_ct_seqadj_init from init_conntrack(). Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/mediatek/mtk_eth_soc.c drivers/net/ethernet/qlogic/qed/qed_dcbx.c drivers/net/phy/Kconfig All conflicts were cases of overlapping commits. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-12netfilter: nf_queue: get rid of dependency on IP6_NF_IPTABLESLiping Zhang
hash_v6 is used by both nftables and ip6tables, so depend on IP6_NF_IPTABLES is not properly. Actually, it only parses ipv6hdr and computes a hash value, so even if IPV6 is disabled, there's no side effect too, remove it. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12netfilter: nf_tables: don't drop IPv6 packets that cannot parse transportPablo Neira Ayuso
This is overly conservative and not flexible at all, so better let them go through and let the filtering policy decide what to do with them. We use skb_header_pointer() all over the place so we would just fail to match when trying to access fields from malformed traffic. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12netfilter: nf_tables_bridge: use nft_set_pktinfo_ipv{4, 6}_validatePablo Neira Ayuso
Consolidate pktinfo setup and validation by using the new generic functions so we converge to the netdev family codebase. We only need a linear IPv4 and IPv6 header from the reject expression, so move nft_bridge_iphdr_validate() and nft_bridge_ip6hdr_validate() to net/bridge/netfilter/nft_reject_bridge.c. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12netfilter: introduce nft_set_pktinfo_{ipv4, ipv6}_validate()Pablo Neira Ayuso
These functions are extracted from the netdev family, they initialize the pktinfo structure and validate that the IPv4 and IPv6 headers are well-formed given that these functions are called from a path where layer 3 sanitization did not happen yet. These functions are placed in include/net/netfilter/nf_tables_ipv{4,6}.h so they can be reused by a follow up patch to use them from the bridge family too. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12netfilter: nf_tables_ipv6: setup pktinfo transport field on failure to parsePablo Neira Ayuso
Make sure the pktinfo protocol fields are initialized if this fails to parse the transport header. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12netfilter: nf_tables: ensure proper initialization of nft_pktinfo fieldsPablo Neira Ayuso
This patch introduces nft_set_pktinfo_unspec() that ensures proper initialization all of pktinfo fields for non-IP traffic. This is used by the bridge, netdev and arp families. This new function relies on nft_set_pktinfo_proto_unspec() to set a new tprot_set field that indicates if transport protocol information is available. Remain fields are zeroed. The meta expression has been also updated to check to tprot_set in first place given that zero is a valid tprot value. Even a handcrafted packet may come with the IPPROTO_RAW (255) protocol number so we can't rely on this value as tprot unset. Reported-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12mac80211: add support for radiotap timestamp fieldJohannes Berg
Use the existing device timestamp from the RX status information to add support for the new radiotap timestamp field. Currently only 32-bit counters are supported, but we also add the radiotap mactime where applicable. This new field allows more flexibility in where the timestamp is taken etc. The non-timestamp data in the field is taken from a new field in the hw struct. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-12mac80211: RX BA support for sta max_rx_aggregation_subframesMaxim Altshul
The ability to change the max_rx_aggregation frames is useful in cases of IOP. There exist some devices (latest mobile phones and some AP's) that tend to not respect a BA sessions maximum size (in Kbps). These devices won't respect the AMPDU size that was negotiated during association (even though they do respect the maximal number of packets). This violation is characterized by a valid number of packets in a single AMPDU. Even so, the total size will exceed the size negotiated during association. Eventually, this will cause some undefined behavior, which in turn causes the hw to drop packets, causing the throughput to plummet. This patch will make the subframe limitation to be held by each station, instead of being held only by hw. Signed-off-by: Maxim Altshul <maxim.altshul@ti.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-12cfg80211: clarify the requirements of .disconnect()Emmanuel Grumbach
cfg80211 expects the .disconnect() handler to call cfg80211_disconnect() when done. Make this requirement more explicit. Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-10net: flow: Remove FLOWI_FLAG_L3MDEV_SRC flagDavid Ahern
No longer used Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10net: l3mdev: remove get_rtable methodDavid Ahern
No longer used Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10net: l3mdev: Remove l3mdev_fib_oifDavid Ahern
No longer used Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10net: ipv6: Remove l3mdev_get_saddr6David Ahern
No longer needed Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10net: ipv4: Remove l3mdev_get_saddrDavid Ahern
No longer needed Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10net: vrf: Flip IPv6 output path from FIB lookup hook to out hookDavid Ahern
Flip the IPv6 output path to use the l3mdev tx out hook. The VRF dst is not returned on the first FIB lookup. Instead, the dst on the skb is switched at the beginning of the IPv6 output processing to send the packet to the VRF driver on xmit. Link scope addresses (linklocal and multicast) need special handling: specifically the oif the flow struct can not be changed because we want the lookup tied to the enslaved interface. ie., the source address and the returned route MUST point to the interface scope passed in. Convert the existing vrf_get_rt6_dst to handle only link scope addresses. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10net: l3mdev: Allow the l3mdev to be a loopbackDavid Ahern
Allow an L3 master device to act as the loopback for that L3 domain. For IPv4 the device can also have the address 127.0.0.1. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10net: l3mdev: Add hook to output pathDavid Ahern
This patch adds the infrastructure to the output path to pass an skb to an l3mdev device if it has a hook registered. This is the Tx parallel to l3mdev_ip{6}_rcv in the receive path and is the basis for removing the existing hook that returns the vrf dst on the fib lookup. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10net: flow: Add l3mdev flow updateDavid Ahern
Add l3mdev hook to set FLOWI_FLAG_SKIP_NH_OIF flag and update oif/iif in flow struct if its oif or iif points to a device enslaved to an L3 Master device. Only 1 needs to be converted to match the l3mdev FIB rule. This moves the flow adjustment for l3mdev to a single point catching all lookups. It is redundant for existing hooks (those are removed in later patches) but is needed for missed lookups such as PMTU updates. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10net/sched: Introduce act_tunnel_keyAmir Vadai
This action could be used before redirecting packets to a shared tunnel device, or when redirecting packets arriving from a such a device. The action will release the metadata created by the tunnel device (decap), or set the metadata with the specified values for encap operation. For example, the following flower filter will forward all ICMP packets destined to 11.11.11.2 through the shared vxlan device 'vxlan0'. Before redirecting, a metadata for the vxlan tunnel is created using the tunnel_key action and it's arguments: $ tc filter add dev net0 protocol ip parent ffff: \ flower \ ip_proto 1 \ dst_ip 11.11.11.2 \ action tunnel_key set \ src_ip 11.11.0.1 \ dst_ip 11.11.0.2 \ id 11 \ action mirred egress redirect dev vxlan0 Signed-off-by: Amir Vadai <amir@vadai.me> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10net/dst: Utility functions to build dst_metadata without supplying an skbAmir Vadai
Extract __ip_tun_set_dst() and __ipv6_tun_set_dst() out of ip_tun_rx_dst() and ipv6_tun_rx_dst(), to be used without supplying an skb. Signed-off-by: Amir Vadai <amir@vadai.me> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10net/ip_tunnels: Introduce tunnel_id_to_key32() and key32_to_tunnel_id()Amir Vadai
Add utility functions to convert a 32 bits key into a 64 bits tunnel and vice versa. These functions will be used instead of cloning code in GRE and VXLAN, and in tc act_iptunnel which will be introduced in a following patch in this patchset. Signed-off-by: Amir Vadai <amir@vadai.me> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Acked-by: Jiri Benc <jbenc@redhat.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09Merge tag 'rxrpc-rewrite-20160908' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Rewrite data and ack handling This patch set constitutes the main portion of the AF_RXRPC rewrite. It consists of five fix/helper patches: (1) Fix ASSERTCMP's and ASSERTIFCMP's handling of signed values. (2) Update some protocol definitions slightly. (3) Use of an hlist for RCU purposes. (4) Removal of per-call sk_buff accounting (not really needed when skbs aren't being queued on the main queue). (5) Addition of a tracepoint to log incoming packets in the data_ready callback and to log the end of the data_ready callback. And then there are two patches that form the main part: (6) Preallocation of resources for incoming calls so that in patch (7) the data_ready handler can be made to fully instantiate an incoming call and make it live. This extends through into AFS so that AFS can preallocate its own incoming call resources. The preallocation size is capped at the listen() backlog setting - and that is capped at a sysctl limit which can be set between 4 and 32. The preallocation is (re)charged either by accepting/rejecting pending calls or, in the case of AFS, manually. If insufficient preallocation resources exist, a BUSY packet will be transmitted. The advantage of using this preallocation is that once a call is set up in the data_ready handler, DATA packets can be queued on it immediately rather than the DATA packets being queued for a background work item to do all the allocation and then try and sort out the DATA packets whilst other DATA packets may still be coming in and going either to the background thread or the new call. (7) Rewrite the handling of DATA, ACK and ABORT packets. In the receive phase, DATA packets are now held in per-call circular buffers with deduplication, out of sequence detection and suchlike being done in data_ready. Since there is only one producer and only once consumer, no locks need be used on the receive queue. Received ACK and ABORT packets are now parsed and discarded in data_ready to recycle resources as fast as possible. sk_buffs are no longer pulled, trimmed or cloned, but rather the offset and size of the content is tracked. This particularly affects jumbo DATA packets which need insertion into the receive buffer in multiple places. Annotations are kept to track which bit is which. Packets are no longer queued on the socket receive queue; rather, calls are queued. Dummy packets to convey events therefore no longer need to be invented and metadata packets can be discarded as soon as parsed rather then being pushed onto the socket receive queue to indicate terminal events. The preallocation facility added in (6) is now used to set up incoming calls with very little locking required and no calls to the allocator in data_ready. Decryption and verification is now handled in recvmsg() rather than in a background thread. This allows for the future possibility of decrypting directly into the user buffer. With this patch, the code is a lot simpler and most of the mass of call event and state wangling code in call_event.c is gone. With this, the majority of the AF_RXRPC rewrite is complete. However, there are still things to be done, including: (*) Limit the number of active service calls to prevent an attacker from filling up a server's memory. (*) Limit the number of calls on the rebuff-with-BUSY queue. (*) Transmit delayed/deferred ACKs from recvmsg() if possible, rather than punting to the background thread. Ideally, the background thread shouldn't run at all, but data_ready can't call kernel_sendmsg() and we can't rely on recvmsg() attending to the call in a timely fashion. (*) Prevent the call at the front of the socket queue from hogging recvmsg()'s attention if there's a sufficiently continuous supply of data. (*) Distribute ICMP errors by connection rather than by call. Possibly parse the ICMP packet to try and pin down the exact connection and call. (*) Encrypt/decrypt directly between user buffers and socket buffers where possible. (*) IPv6. (*) Service ID upgrade. This is a facility whereby a special flag bit is set in the DATA packet header when making a call that tells the server that it is allowed to change the service ID to an upgraded one and reply with an equivalent call from the upgraded service. This is used, for example, to override certain AFS calls so that IPv6 addresses can be returned. (*) Allow userspace to preallocate call user IDs for incoming calls. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09netfilter: nf_conntrack: remove unused ctl_table_path member in ↵Liping Zhang
nf_conntrack_l3proto After commit adf0516845bc ("netfilter: remove ip_conntrack* sysctl compat code"), ctl_table_path member in struct nf_conntrack_l3proto{} is not used anymore, remove it. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-08tcp: use an RB tree for ooo receive queueYaogong Wang
Over the years, TCP BDP has increased by several orders of magnitude, and some people are considering to reach the 2 Gbytes limit. Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000 MSS. In presence of packet losses (or reorders), TCP stores incoming packets into an out of order queue, and number of skbs sitting there waiting for the missing packets to be received can be in the 10^5 range. Most packets are appended to the tail of this queue, and when packets can finally be transferred to receive queue, we scan the queue from its head. However, in presence of heavy losses, we might have to find an arbitrary point in this queue, involving a linear scan for every incoming packet, throwing away cpu caches. This patch converts it to a RB tree, to get bounded latencies. Yaogong wrote a preliminary patch about 2 years ago. Eric did the rebase, added ofo_last_skb cache, polishing and tests. Tested with network dropping between 1 and 10 % packets, with good success (about 30 % increase of throughput in stress tests) Next step would be to also use an RB tree for the write queue at sender side ;) Signed-off-by: Yaogong Wang <wygivan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Acked-By: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-08Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== ipsec-next 2016-09-08 1) Constify the xfrm_replay structures. From Julia Lawall 2) Protect xfrm state hash tables with rcu, lookups can be done now without acquiring xfrm_state_lock. From Florian Westphal. 3) Protect xfrm policy hash tables with rcu, lookups can be done now without acquiring xfrm_policy_lock. From Florian Westphal. 4) We don't need to have a garbage collector list per namespace anymore, so use a global one instead. From Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-08rxrpc: Rewrite the data and ack handling codeDavid Howells
Rewrite the data and ack handling code such that: (1) Parsing of received ACK and ABORT packets and the distribution and the filing of DATA packets happens entirely within the data_ready context called from the UDP socket. This allows us to process and discard ACK and ABORT packets much more quickly (they're no longer stashed on a queue for a background thread to process). (2) We avoid calling skb_clone(), pskb_pull() and pskb_trim(). We instead keep track of the offset and length of the content of each packet in the sk_buff metadata. This means we don't do any allocation in the receive path. (3) Jumbo DATA packet parsing is now done in data_ready context. Rather than cloning the packet once for each subpacket and pulling/trimming it, we file the packet multiple times with an annotation for each indicating which subpacket is there. From that we can directly calculate the offset and length. (4) A call's receive queue can be accessed without taking locks (memory barriers do have to be used, though). (5) Incoming calls are set up from preallocated resources and immediately made live. They can than have packets queued upon them and ACKs generated. If insufficient resources exist, DATA packet #1 is given a BUSY reply and other DATA packets are discarded). (6) sk_buffs no longer take a ref on their parent call. To make this work, the following changes are made: (1) Each call's receive buffer is now a circular buffer of sk_buff pointers (rxtx_buffer) rather than a number of sk_buff_heads spread between the call and the socket. This permits each sk_buff to be in the buffer multiple times. The receive buffer is reused for the transmit buffer. (2) A circular buffer of annotations (rxtx_annotations) is kept parallel to the data buffer. Transmission phase annotations indicate whether a buffered packet has been ACK'd or not and whether it needs retransmission. Receive phase annotations indicate whether a slot holds a whole packet or a jumbo subpacket and, if the latter, which subpacket. They also note whether the packet has been decrypted in place. (3) DATA packet window tracking is much simplified. Each phase has just two numbers representing the window (rx_hard_ack/rx_top and tx_hard_ack/tx_top). The hard_ack number is the sequence number before base of the window, representing the last packet the other side says it has consumed. hard_ack starts from 0 and the first packet is sequence number 1. The top number is the sequence number of the highest-numbered packet residing in the buffer. Packets between hard_ack+1 and top are soft-ACK'd to indicate they've been received, but not yet consumed. Four macros, before(), before_eq(), after() and after_eq() are added to compare sequence numbers within the window. This allows for the top of the window to wrap when the hard-ack sequence number gets close to the limit. Two flags, RXRPC_CALL_RX_LAST and RXRPC_CALL_TX_LAST, are added also to indicate when rx_top and tx_top point at the packets with the LAST_PACKET bit set, indicating the end of the phase. (4) Calls are queued on the socket 'receive queue' rather than packets. This means that we don't need have to invent dummy packets to queue to indicate abnormal/terminal states and we don't have to keep metadata packets (such as ABORTs) around (5) The offset and length of a (sub)packet's content are now passed to the verify_packet security op. This is currently expected to decrypt the packet in place and validate it. However, there's now nowhere to store the revised offset and length of the actual data within the decrypted blob (there may be a header and padding to skip) because an sk_buff may represent multiple packets, so a locate_data security op is added to retrieve these details from the sk_buff content when needed. (6) recvmsg() now has to handle jumbo subpackets, where each subpacket is individually secured and needs to be individually decrypted. The code to do this is broken out into rxrpc_recvmsg_data() and shared with the kernel API. It now iterates over the call's receive buffer rather than walking the socket receive queue. Additional changes: (1) The timers are condensed to a single timer that is set for the soonest of three timeouts (delayed ACK generation, DATA retransmission and call lifespan). (2) Transmission of ACK and ABORT packets is effected immediately from process-context socket ops/kernel API calls that cause them instead of them being punted off to a background work item. The data_ready handler still has to defer to the background, though. (3) A shutdown op is added to the AF_RXRPC socket so that the AFS filesystem can shut down the socket and flush its own work items before closing the socket to deal with any in-progress service calls. Future additional changes that will need to be considered: (1) Make sure that a call doesn't hog the front of the queue by receiving data from the network as fast as userspace is consuming it to the exclusion of other calls. (2) Transmit delayed ACKs from within recvmsg() when we've consumed sufficiently more packets to avoid the background work item needing to run. Signed-off-by: David Howells <dhowells@redhat.com>