summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2016-02-11ipv4: add option to drop gratuitous ARP packetsJohannes Berg
In certain 802.11 wireless deployments, there will be ARP proxies that use knowledge of the network to correctly answer requests. To prevent gratuitous ARP frames on the shared medium from being a problem, on such deployments wireless needs to drop them. Enable this by providing an option called "drop_gratuitous_arp". Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11ipv4: add option to drop unicast encapsulated in L2 multicastJohannes Berg
In order to solve a problem with 802.11, the so-called hole-196 attack, add an option (sysctl) called "drop_unicast_in_l2_multicast" which, if enabled, causes the stack to drop IPv4 unicast packets encapsulated in link-layer multi- or broadcast frames. Such frames can (as an attack) be created by any member of the same wireless network and transmitted as valid encrypted frames since the symmetric key for broadcast frames is shared between all stations. Additionally, enabling this option provides compliance with a SHOULD clause of RFC 1122. Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11soreuseport: fast reuseport TCP socket selectionCraig Gallek
This change extends the fast SO_REUSEPORT socket lookup implemented for UDP to TCP. Listener sockets with SO_REUSEPORT and the same receive address are additionally added to an array for faster random access. This means that only a single socket from the group must be found in the listener list before any socket in the group can be used to receive a packet. Previously, every socket in the group needed to be considered before handing off the incoming packet. This feature also exposes the ability to use a BPF program when selecting a socket from a reuseport group. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11inet: refactor inet[6]_lookup functions to take skbCraig Gallek
This is a preliminary step to allow fast socket lookup of SO_REUSEPORT groups. Doing so with a BPF filter will require access to the skb in question. This change plumbs the skb (and offset to payload data) through the call stack to the listening socket lookup implementations where it will be used in a following patch. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11tcp: __tcp_hdrlen() helperCraig Gallek
tcp_hdrlen is wasteful if you already have a pointer to struct tcphdr. This splits the size calculation into a helper function that can be used if a struct tcphdr is already available. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11inet: create IPv6-equivalent inet_hash functionCraig Gallek
In order to support fast lookups for TCP sockets with SO_REUSEPORT, the function that adds sockets to the listening hash set needs to be able to check receive address equality. Since this equality check is different for IPv4 and IPv6, we will need two different socket hashing functions. This patch adds inet6_hash identical to the existing inet_hash function and updates the appropriate references. A following patch will differentiate the two by passing different comparison functions to __inet_hash. Additionally, in order to use the IPv6 address equality function from inet6_hashtables (which is compiled as a built-in object when IPv6 is enabled) it also needs to be in a built-in object file as well. This moves ipv6_rcv_saddr_equal into inet_hashtables to accomplish this. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11sock: struct proto hash function may errorCraig Gallek
In order to support fast reuseport lookups in TCP, the hash function defined in struct proto must be capable of returning an error code. This patch changes the function signature of all related hash functions to return an integer and handles or propagates this return value at all call sites. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-10target/transport: add flag to indicate CPU Affinity is observedQuinn Tran
Signed-off-by: Quinn Tran <quinn.tran@qlogic.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Fixes: fb3269b ("qla2xxx: Add selective command queuing") Signed-off-by: Himanshu Madhani <himanshu.madhani@qlogic.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2016-02-10Merge branch 'for-4.5-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata Pull libata fixes from Tejun Heo: - PORTS_IMPL workaround for very early ahci controllers is misbehaving on new systems. Disabled on recent ahci versions. - Old-style PIO state machine had a horrible locking problem. Don't know how we've been getting away this far. Fixed. - Other device specific updates. * 'for-4.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: ahci: Intel DNV device IDs SATA libata: fix sff host state machine locking while polling libata-sff: use WARN instead of BUG on illegal host state machine state libata: disable forced PORTS_IMPL for >= AHCI 1.3 libata: blacklist a Viking flash model for MWDMA corruption drivers: ata: wake port before DMA stop for ALPM
2016-02-10Merge branch 'for-4.5-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - The destruction path of cgroup objects are asynchronous and multi-staged and some of them ended up destroying parents before children leading to failures in cpu and memory controllers. Ensure that parents are always destroyed after children. - cpuset mm node migration was performed synchronously while holding threadgroup and cgroup mutexes and the recent threadgroup locking update resulted in a possible deadlock. The migration is best effort and shouldn't have been performed under those locks to begin with. Made asynchronous. - Minor documentation fix. * 'for-4.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: Documentation: cgroup: Fix 'cgroup-legacy' -> 'cgroup-v1' cgroup: make sure a parent css isn't freed before its children cgroup: make sure a parent css isn't offlined before its children cpuset: make mm migration asynchronous
2016-02-10Merge branch 'for-4.5-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: "Workqueue fixes for v4.5-rc3. - Remove a spurious triggering of flush dependency warning. - Officially break local execution guarantee of unbound work items and add a debug feature to flush out usages which depend on it. - Work around CPU -> NODE mapping becoming invalid on CPU offline. The branch is young but pushing out early as stable kernels are being affected" * 'for-4.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup workqueue: implement "workqueue.debug_force_rr_cpu" debug feature workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs Revert "workqueue: make sure delayed work run in local cpu" workqueue: skip flush dependency checks for legacy workqueues
2016-02-10efi: Make efivarfs entries immutable by defaultPeter Jones
"rm -rf" is bricking some peoples' laptops because of variables being used to store non-reinitializable firmware driver data that's required to POST the hardware. These are 100% bugs, and they need to be fixed, but in the mean time it shouldn't be easy to *accidentally* brick machines. We have to have delete working, and picking which variables do and don't work for deletion is quite intractable, so instead make everything immutable by default (except for a whitelist), and make tools that aren't quite so broad-spectrum unset the immutable flag. Signed-off-by: Peter Jones <pjones@redhat.com> Tested-by: Lee, Chun-Yi <jlee@suse.com> Acked-by: Matthew Garrett <mjg59@coreos.com> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
2016-02-10efi: Make our variable validation list include the guidPeter Jones
All the variables in this list so far are defined to be in the global namespace in the UEFI spec, so this just further ensures we're validating the variables we think we are. Including the guid for entries will become more important in future patches when we decide whether or not to allow deletion of variables based on presence in this list. Signed-off-by: Peter Jones <pjones@redhat.com> Tested-by: Lee, Chun-Yi <jlee@suse.com> Acked-by: Matthew Garrett <mjg59@coreos.com> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
2016-02-10lib/ucs2_string: Add ucs2 -> utf8 helper functionsPeter Jones
This adds ucs2_utf8size(), which tells us how big our ucs2 string is in bytes, and ucs2_as_utf8, which translates from ucs2 to utf8.. Signed-off-by: Peter Jones <pjones@redhat.com> Tested-by: Lee, Chun-Yi <jlee@suse.com> Acked-by: Matthew Garrett <mjg59@coreos.com> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
2016-02-10vxlan, gre, geneve: Set a large MTU on ovs-created tunnel devicesDavid Wragg
Prior to 4.3, openvswitch tunnel vports (vxlan, gre and geneve) could transmit vxlan packets of any size, constrained only by the ability to send out the resulting packets. 4.3 introduced netdevs corresponding to tunnel vports. These netdevs have an MTU, which limits the size of a packet that can be successfully encapsulated. The default MTU values are low (1500 or less), which is awkwardly small in the context of physical networks supporting jumbo frames, and leads to a conspicuous change in behaviour for userspace. Instead, set the MTU on openvswitch-created netdevs to be the relevant maximum (i.e. the maximum IP packet size minus any relevant overhead), effectively restoring the behaviour prior to 4.3. Signed-off-by: David Wragg <david@weave.works> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-09Merge tag 'fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull module fixes from Rusty Russell: "Fix for async_probe module param added in 4.3 (clearly not widely used yet), and a much more interesting kallsyms race which has been around approximately forever. This fix is more invasive, and will require some care in backporting, but I hated all the bandaids I could think of, so... There are some more coming, which are only for breakages introduced this cycle (livepatch), but wanted these in now" * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: modules: fix longstanding /proc/kallsyms vs module insertion race. module: wrapper for symbol name. modules: fix modparam async_probe request
2016-02-09bonding: 3ad: apply ad_actor settings changes immediatelyNikolay Aleksandrov
Currently the bonding allows to set ad_actor_system and prio while the bond device is down, but these are actually applied only if there aren't any slaves yet (applied to bond device when first slave shows up, and to slaves at 3ad bind time). After this patch changes are applied immediately and the new values can be used/seen after the bond's upped so it's not necessary anymore to release all and enslave again to see the changes. CC: Jay Vosburgh <j.vosburgh@gmail.com> CC: Veaceslav Falico <vfalico@gmail.com> CC: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-09bridge: mdb: add support for offloaded mdb entriesElad Raz
Add new bitmask member 'flags' to br_mdb_entry structure. Adding MDB_FLAGS_OFFLOAD bit which indicates MDB entries is offloaded to hardware. Signed-off-by: Elad Raz <eladr@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-09net:Add sysctl_max_skb_fragsHans Westgaard Ry
Devices may have limits on the number of fragments in an skb they support. Current codebase uses a constant as maximum for number of fragments one skb can hold and use. When enabling scatter/gather and running traffic with many small messages the codebase uses the maximum number of fragments and may thereby violate the max for certain devices. The patch introduces a global variable as max number of fragments. Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com> Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-09tcp: do not drop syn_recv on all icmp reportsEric Dumazet
Petr Novopashenniy reported that ICMP redirects on SYN_RECV sockets were leading to RST. This is of course incorrect. A specific list of ICMP messages should be able to drop a SYN_RECV. For instance, a REDIRECT on SYN_RECV shall be ignored, as we do not hold a dst per SYN_RECV pseudo request. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111751 Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table") Reported-by: Petr Novopashenniy <pety@rusnet.ru> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-08unix: correctly track in-flight fds in sending process user_structHannes Frederic Sowa
The commit referenced in the Fixes tag incorrectly accounted the number of in-flight fds over a unix domain socket to the original opener of the file-descriptor. This allows another process to arbitrary deplete the original file-openers resource limit for the maximum of open files. Instead the sending processes and its struct cred should be credited. To do so, we add a reference counted struct user_struct pointer to the scm_fp_list and use it to account for the number of inflight unix fds. Fixes: 712f4aad406bb1 ("unix: properly account for FDs passed over unix sockets") Reported-by: David Herrmann <dh.herrmann@gmail.com> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Linus Torvalds <torvalds@linux-foundation.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp_notsent_lowat sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp_fin_timeout sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp_orphan_retries sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp_retries2 sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp_retries1 sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp reordering sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp syncookies sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp synack retries sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp syn retries sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ethtool: add speed/duplex validation functionsNikolay Aleksandrov
Add functions which check if the speed/duplex are defined. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07sunvnet: Add support for perf LDC event tracingSowmini Varadhan
Add perf event macros for support of tracing and instrumentation of LDC state machine Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07tcp: new delivery accountingYuchung Cheng
This patch changes the accounting of how many packets are newly acked or sacked when the sender receives an ACK. The current approach basically computes newly_acked_sacked = (prior_packets - prior_sacked) - (tp->packets_out - tp->sacked_out) where prior_packets and prior_sacked out are snapshot at the beginning of the ACK processing. The new approach tracks the delivery information via a new TCP state variable "delivered" which monotically increases as new packets are delivered in order or out-of-order. The reason for this change is that the current approach is brittle that produces negative or inaccurate estimate. 1) For non-SACK connections, an ACK that advances the SND.UNA could reset the DUPACK counters (tp->sacked_out) in tcp_process_loss() or tcp_fastretrans_alert(). This inflates the inflight suddenly and causes under-estimate or even negative estimate. Here is a real example: before after (processing ACK) packets_out 75 73 sacked_out 23 0 ca state Loss Open The old approach computes (75-23) - (73 - 0) = -21 delivered while the new approach computes 1 delivered since it considers the 2nd-24th packets are delivered OOO. 2) MSS change would re-count packets_out and sacked_out so the estimate is in-accurate and can even become negative. E.g., the inflight is doubled when MSS is halved. 3) Spurious retransmission signaled by DSACK is not accounted The new approach is simpler and more robust. For SACK connections, tp->delivered increments as packets are being acked or sacked in SACK and ACK processing. For non-sack connections, it's done in tcp_remove_reno_sacks() and tcp_add_reno_sack(). When an ACK advances the SND.UNA, tp->delivered is incremented by the number of packets ACKed (less the current number of DUPACKs received plus one packet hole). Upon receiving a DUPACK, tp->delivered is incremented assuming one out-of-order packet is delivered. Upon receiving a DSACK, tp->delivered is incremtened assuming one retransmission is delivered in tcp_sacktag_write_queue(). Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07net: Add support for fill_slave_info to VRF deviceDavid Ahern
Allows userspace to have direct access to VRF table association versus looking up master device and its table. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07vxlan: restructure vxlan.h definitionsJiri Benc
RCO and GBP are VXLAN extensions, not specified in RFC 7348. Because of that, they need to be explicitly enabled when creating vxlan interface. By default, those extensions are not used and plain VXLAN header is sent and received. Reflect this in vxlan.h: first, the plain VXLAN header is defined. Following it, RCO is documented and defined, and likewise for GBP. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07vxlan: remove duplicated macrosJiri Benc
VNI_HASH_BITS and VNI_HASH_SIZE are defined twice. Remove the extra definitions. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07vxlan: cleanup typesJiri Benc
include/net/vxlan.h is a kernel header, no need to prefix fixed size types with double underscore. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06pty: make sure super_block is still valid in final /dev/tty closeHerton R. Krzesinski
Considering current pty code and multiple devpts instances, it's possible to umount a devpts file system while a program still has /dev/tty opened pointing to a previosuly closed pty pair in that instance. In the case all ptmx and pts/N files are closed, umount can be done. If the program closes /dev/tty after umount is done, devpts_kill_index will use now an invalid super_block, which was already destroyed in the umount operation after running ->kill_sb. This is another "use after free" type of issue, but now related to the allocated super_block instance. To avoid the problem (warning at ida_remove and potential crashes) for this specific case, I added two functions in devpts which grabs additional references to the super_block, which pty code now uses so it makes sure the super block structure is still valid until pty shutdown is done. I also moved the additional inode references to the same functions, which also covered similar case with inode being freed before /dev/tty final close/shutdown. Signed-off-by: Herton R. Krzesinski <herton@redhat.com> Cc: stable@vger.kernel.org # 2.6.29+ Reviewed-by: Peter Hurley <peter@hurleysoftware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-06target: Drop legacy se_cmd->task_stop_comp + REQUEST_STOP usageNicholas Bellinger
With CMD_T_FABRIC_STOP + se_cmd->cmd_wait_set usage in place, go ahead and drop left-over CMD_T_REQUEST_STOP checks in target_complete_cmd() and unused target_stop_cmd(). Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Quinn Tran <quinn.tran@qlogic.com> Cc: Himanshu Madhani <himanshu.madhani@qlogic.com> Cc: Sagi Grimberg <sagig@mellanox.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Andy Grover <agrover@redhat.com> Cc: Mike Christie <mchristi@redhat.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2016-02-06tcp: fastopen: call tcp_fin() if FIN present in SYNACKEric Dumazet
When we acknowledge a FIN, it is not enough to ack the sequence number and queue the skb into receive queue. We also have to call tcp_fin() to properly update socket state and send proper poll() notifications. It seems we also had the problem if we received a SYN packet with the FIN flag set, but it does not seem an urgent issue, as no known implementation can do that. Fixes: 61d2bcae99f6 ("tcp: fastopen: accept data/FIN present in SYNACK message") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06bcma: add support for BCM47094Rafał Miłecki
It's another SoC with 32 GPIOs and simplified watchdog handling. It was tested on D-Link DIR-885L. Signed-off-by: Rafał Miłecki <zajec5@gmail.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
2016-02-06bcma: support PMU present as separated bus coreRafał Miłecki
On recent Broadcom chipsets PMU is present as separated core and it can't be accessed using ChipCommon anymore as it fails with e.g.: [ 0.000577] Unhandled fault: external abort on non-linefetch (0x1008) at 0xf1000604 Solve it by using a new (PMU) core pointer set to ChipCommon or PMU depending on the hardware capabilities. Signed-off-by: Rafał Miłecki <zajec5@gmail.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
2016-02-06bcma: use _PMU_ in all names of PMU registersRafał Miłecki
PMU (Power Management Unit) seems to be a separated piece of hardware, just accessed using ChipCommon core registers. In recent Broadcom chipsets PMU is not bounded to CC but available as separated core. To make code cleaner & easier to review (for a correct R/W access) use clearer names. Signed-off-by: Rafał Miłecki <zajec5@gmail.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
2016-02-06bcma: identify bus cores (devices) found on BCM47189Rafał Miłecki
Add missing defines and print proper names. Signed-off-by: Rafał Miłecki <zajec5@gmail.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
2016-02-06bpf: add lookup/update support for per-cpu hash and array mapsAlexei Starovoitov
The functions bpf_map_lookup_elem(map, key, value) and bpf_map_update_elem(map, key, value, flags) need to get/set values from all-cpus for per-cpu hash and array maps, so that user space can aggregate/update them as necessary. Example of single counter aggregation in user space: unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF); long values[nr_cpus]; long value = 0; bpf_lookup_elem(fd, key, values); for (i = 0; i < nr_cpus; i++) value += values[i]; The user space must provide round_up(value_size, 8) * nr_cpus array to get/set values, since kernel will use 'long' copy of per-cpu values to try to copy good counters atomically. It's a best-effort, since bpf programs and user space are racing to access the same memory. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY mapAlexei Starovoitov
Primary use case is a histogram array of latency where bpf program computes the latency of block requests or other events and stores histogram of latency into array of 64 elements. All cpus are constantly running, so normal increment is not accurate, bpf_xadd causes cache ping-pong and this per-cpu approach allows fastest collision-free counters. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06bpf: introduce BPF_MAP_TYPE_PERCPU_HASH mapAlexei Starovoitov
Introduce BPF_MAP_TYPE_PERCPU_HASH map type which is used to do accurate counters without need to use BPF_XADD instruction which turned out to be too costly for high-performance network monitoring. In the typical use case the 'key' is the flow tuple or other long living object that sees a lot of events per second. bpf_map_lookup_elem() returns per-cpu area. Example: struct { u32 packets; u32 bytes; } * ptr = bpf_map_lookup_elem(&map, &key); /* ptr points to this_cpu area of the value, so the following * increments will not collide with other cpus */ ptr->packets ++; ptr->bytes += skb->len; bpf_update_elem() atomically creates a new element where all per-cpu values are zero initialized and this_cpu value is populated with given 'value'. Note that non-per-cpu hash map always allocates new element and then deletes old after rcu grace period to maintain atomicity of update. Per-cpu hash map updates element values in-place. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06ethtool: Declare netdev_rss_key as __read_mostly.Kim Jones
netdev_rss_key is written to once and thereafter is read by drivers when they are initialising. The fact that it is mostly read and not written to makes it a candidate for a __read_mostly declaration. Signed-off-by: Kim Jones <kim-marie.jones@intel.com> Signed-off-by: Alan Carey <alan.carey@intel.com> Acked-by: Rami Rosen <rami.rosen@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06tcp: fastopen: accept data/FIN present in SYNACK messageEric Dumazet
RFC 7413 (TCP Fast Open) 4.2.2 states that the SYNACK message MAY include data and/or FIN This patch adds support for the client side : If we receive a SYNACK with payload or FIN, queue the skb instead of ignoring it. Since we already support the same for SYN, we refactor the existing code and reuse it. Note we need to clone the skb, so this operation might fail under memory pressure. Sara Dickinson pointed out FreeBSD server Fast Open implementation was planned to generate such SYNACK in the future. The server side might be implemented on linux later. Reported-by: Sara Dickinson <sara@sinodun.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06team: track sum of rx_nohandler for all slavesJarod Wilson
CC: Jiri Pirko <jiri@resnulli.us> CC: netdev@vger.kernel.org Signed-off-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>