summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2016-02-11net: macb: add wake-on-lan support via magic packetSergio Prado
Tested on Acqua A5 SoM (http://www.acmesystems.it/acqua). Signed-off-by: Sergio Prado <sergio.prado@e-labworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: hamradio: baycom_ser_fdx: Replace timeval with timespec64Amitoj Kaur Chawla
32 bit systems using 'struct timeval' will break in the year 2038, so we replace the code appropriately. However, this driver is not broken in 2038 since we are only using microseconds portion of the time. This patch replaces 'struct timeval' with 'struct timespec64'. We only need to find elapsed microseconds rather than absolute time, so it's better to use monotonic time, so using ktime_get_ts64() makes the code more efficient and more robust against concurrent settimeofday() calls. Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Thomas Sailer <t.sailer@alumni.ethz.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11openvswitch: allow management from inside user namespacesTycho Andersen
Operations with the GENL_ADMIN_PERM flag fail permissions checks because this flag means we call netlink_capable, which uses the init user ns. Instead, let's introduce a new flag, GENL_UNS_ADMIN_PERM for operations which should be allowed inside a user namespace. The motivation for this is to be able to run openvswitch in unprivileged containers. I've tested this and it seems to work, but I really have no idea about the security consequences of this patch, so thoughts would be much appreciated. v2: use the GENL_UNS_ADMIN_PERM flag instead of a check in each function v3: use separate ifs for UNS_ADMIN_PERM and ADMIN_PERM, instead of one massive one Reported-by: James Page <james.page@canonical.com> Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com> CC: Eric Biederman <ebiederm@xmission.com> CC: Pravin Shelar <pshelar@ovn.org> CC: Justin Pettit <jpettit@nicira.com> CC: "David S. Miller" <davem@davemloft.net> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11ethtool: future-proof interface for speed extensionsMichael S. Tsirkin
Many virtual and not quite virtual devices allow any speed to be set through ethtool. In particular, this applies to the virtio-net devices. Document this fact to make sure people don't assume the enum lists all possible values. Reserve values greater than INT_MAX for future extension and to avoid conflict with SPEED_UNKNOWN. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11vrf: duplicate include of rtnetlink.hstephen hemminger
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11vxlan: udp_tunnel duplicate include net/udp_tunnel.hstephen hemminger
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11rds: duplicate include net/tcp.hstephen hemminger
Duplicate include detected. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11bonding: Return correct error codeAmitoj Kaur Chawla
The return value of kzalloc on failure of allocation of memory should be -ENOMEM and not -1. Found using Coccinelle. A simplified version of the semantic patch used is: //<smpl> @@ expression *e; @@ e = kzalloc(...); if (e == NULL) { ... return - -1 + -ENOMEM ; } //</smpl> The single call site only checks that the return value is not 0, hence no change is required at the call site. Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11Merge branch 'gso-checksums'David S. Miller
Alexander Duyck says: ==================== Add GSO support for outer checksum w/ inner checksum offloads This patch series updates the existing segmentation offload code for tunnels to make better use of existing and updated GSO checksum computation. This is done primarily through two mechanisms. First we maintain a separate checksum in the GSO context block of the sk_buff. This allows us to maintain two checksum values, one offloaded with values stored in csum_start and csum_offset, and one computed and tracked in SKB_GSO_CB(skb)->csum. By maintaining these two values we are able to take advantage of the same sort of math used in local checksum offload so that we can provide both inner and outer checksums with minimal overhead. Below is the performance for a netperf session between an ixgbe PF and VF on the same host but in different namespaces. As can be seen a significant gain in performance can be had from allowing the use of Tx checksum offload on the inner headers while performing a software offload on the outer header computation: Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB Before: 87380 16384 16384 10.00 12844.38 9.30 -1.00 0.712 -1.00 After: 87380 16384 16384 10.00 13216.63 6.78 -1.00 0.504 -1.000 Changes from v1: * Dropped use of CHECKSUM_UNNECESSARY for remote checksum offload * Left encap_hdr_csum as it will likely be needed in future for SCTP GSO * Broke the changes out over many more patches * Updated GRE segmentation to more closely match UDP tunnel segmentation ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: Allow tunnels to use inner checksum offloads with outer checksums neededAlexander Duyck
This patch enables us to use inner checksum offloads if provided by hardware with outer checksums computed by software. It basically reduces encap_hdr_csum to an advisory flag for now, but based on the fact that SCTP may be getting segmentation support before long I thought we may want to keep it as it is possible we may need to support CRC32c and 1's compliment checksum in the same packet at some point in the future. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11udp: Use uh->len instead of skb->len to compute checksum in segmentationAlexander Duyck
The segmentation code was having to do a bunch of work to pull the skb->len and strip the udp header offset before the value could be used to adjust the checksum. Instead of doing all this work we can just use the value that goes into uh->len since that is the correct value with the correct byte order that we need anyway. By using this value we can save ourselves a bunch of pain as there is no need to do multiple byte swaps. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11udp: Clean up the use of flags in UDP segmentation offloadAlexander Duyck
This patch goes though and cleans up the logic related to several of the control flags used in UDP segmentation. Specifically the use of dont_encap isn't really needed as we can just check the skb for CHECKSUM_PARTIAL and if it isn't set then we don't need to update the internal headers. As such we can just drop that value. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11gre: Use inner_proto to obtain inner header protocolAlexander Duyck
Instead of parsing headers to determine the inner protocol we can just pull the value from inner_proto. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11gre: Use GSO flags to determine csum need instead of GRE flagsAlexander Duyck
This patch updates the gre checksum path to follow something much closer to the UDP checksum path. By doing this we can avoid needing to do as much header inspection and can just make use of the fields we were already reading in the sk_buff structure. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: Move skb_has_shared_frag check out of GRE code and into segmentationAlexander Duyck
The call skb_has_shared_frag is used in the GRE path and skb_checksum_help to verify that no frags can be modified by an external entity. This check really doesn't belong in the GRE path but in the skb_segment function itself. This way any protocol that might be segmented will be performing this check before attempting to offload a checksum to software. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: Store checksum result for offloaded GSO checksumsAlexander Duyck
This patch makes it so that we can offload the checksums for a packet up to a certain point and then begin computing the checksums via software. Setting this up is fairly straight forward as all we need to do is reset the values stored in csum and csum_start for the GSO context block. One complication for this is remote checksum offload. In order to allow the inner checksums to be offloaded while computing the outer checksum manually we needed to have some way of indicating that the offload wasn't real. In order to do that I replaced CHECKSUM_PARTIAL with CHECKSUM_UNNECESSARY in the case of us computing checksums for the outer header while skipping computing checksums for the inner headers. We clean up the ip_summed flag and set it to either CHECKSUM_PARTIAL or CHECKSUM_NONE once we hand the packet off to the next lower level. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: Update remote checksum segmentation to support use of GSO checksumAlexander Duyck
This patch addresses two main issues. First in the case of remote checksum offload we were avoiding dealing with scatter-gather issues. As a result it would be possible to assemble a series of frames that used frags instead of being linearized as they should have if remote checksum offload was enabled. Second I have updated the code so that we now let GSO take care of doing the checksum on the data itself and drop the special case that was added for remote checksum offload. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: Move GSO csum into SKB_GSO_CBAlexander Duyck
This patch moves the checksum maintained by GSO out of skb->csum and into the GSO context block in order to allow for us to work on outer checksums while maintaining the inner checksum offsets in the case of the inner checksum being offloaded, while the outer checksums will be computed. While updating the code I also did a minor cleanu-up on gso_make_checksum. The change is mostly to make it so that we store the values and compute the checksum instead of computing the checksum and then storing the values we needed to update. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: Drop unecessary enc_features variable from tunnel segmentation functionsAlexander Duyck
The enc_features variable isn't necessary since features isn't used anywhere after we create enc_features so instead just use a destructive AND on features itself and save ourselves the variable declaration. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11hv_netvsc: cleanup netdev feature flags for netvscsixiao@microsoft.com
1. Adding NETIF_F_TSO6 feature flag; 2. Adding NETIF_F_HW_CSUM. NETIF_F_IPV6_CSUM and NETIF_F_IP_CSUM are being deprecated; 3. Cleanup the coding style of flag assignment by using macro. Signed-off-by: Simon Xiao <sixiao@microsoft.com> Reviewed-by: K. Y. Srinivasan <kys@microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11Merge branch 'ethtool-nfc-ipv6'David S. Miller
Edward Cree says: ==================== IPv6 NFC This series adds support for steering IPv6 flows using the ethtool NFC interface, and implements it for sfc devices. Tested using an in-development patch to the ethtool utility. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11sfc: implement IPv6 NFC (and IPV4_USER_FLOW)Edward Cree
Signed-off-by: Edward Cree <ecree@solarflare.com> Reviewed-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11ethtool: add IPv6 to the NFC APIEdward Cree
Signed-off-by: Edward Cree <ecree@solarflare.com> Reviewed-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11Merge branch 'cxgb4-tos'David S. Miller
Hariprasad Shenai says: ==================== Add TOS support and some cleanup This series adds TOS support for iWARP and also does some cleanup to make code more readable. Patch series is created against infiniband tree and includes patches on iw_cxgb4 and cxgb4 driver. We have included all the maintainers of respective drivers. Kindly review the change and let us know in case of any review comments. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11cxgb4/iw_cxgb4: TOS supportHariprasad Shenai
This series provides support for iWARP applications to specify a TOS value and have that map to a VLAN Priority for iw_cxgb4 iWARP connections. In iw_cxgb4, when allocating an L2T entry, pass the skb_priority based on the tos value in the cm_id. Also pass the correct tos value during connection setup so the passive side gets the client's desired tos. When sending the FLOWC work request to FW, if the egress device is in a vlan, then use the vlan priority bits as the scheduling class. This allows associating RDMA connections with scheduling classes to provide traffic shaping per flow. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11iw_cxgb4: remove false error log entryHariprasad Shenai
Don't log errors if a listening endpoint is going away when procesing a PASS_ACCEPT_REQ message. This can happen. Change the error printk to a PDBG() debug log entry Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11iw_cxgb4: make queue allocation code more readableHariprasad Shenai
Rename local mm* variables to more meaningful names Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11Merge branch 'fec-next'David S. Miller
Troy Kisky says: ==================== net: fec: cleanup/fixes V2 is a rebase on top of johannes endian-safe patch and is only the 1st eight patches. The testing for this series was done on a nitrogen6x. The base commit was commit b45efa30a626e915192a6c548cd8642379cd47cc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Testing showed no change in performance. Testing used imx_v6_v7_defconfig + CONFIG_MICREL_PHY. The processor was running at 996Mhz. The following commands were used to get the transfer rates. On an x86 ubunto system, iperf -s -i.5 -u On a nitrogen6x board, running via SD Card. I first stopped some background processes stop cron stop upstart-file-bridge stop upstart-socket-bridge stop upstart-udev-bridge stop rsyslog stop dbus killall dhclient echo performance >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor taskset 0x2 iperf -c 192.168.0.201 -u -t60 -b500M -r There is a branch available on github with this series, and the rest of my fec patches, for those who would like to test it. https://github.com:boundarydevices/linux-imx6.git branch net-next_master ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: fec: improve error handlingTroy Kisky
Unmap initial buffer on error. Don't free skb until it has been unmapped. Move cbd_bufaddr assignment closer to the mapping function. Signed-off-by: Troy Kisky <troy.kisky@boundarydevices.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: fec: don't transfer ownership until descriptor write is completeTroy Kisky
If you don't own it, you shouldn't write to it. Signed-off-by: Troy Kisky <troy.kisky@boundarydevices.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: fec: don't disable FEC_ENET_TS_TIMER interruptTroy Kisky
Only the interrupt routine processes this condition. Signed-off-by: Troy Kisky <troy.kisky@boundarydevices.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: fec: add variable reg_desc_active to speed things upTroy Kisky
There is no need for complex macros every time we need to activate a queue. Also, no need to call skb_get_queue_mapping when we already know which queue it is using. Signed-off-by: Troy Kisky <troy.kisky@boundarydevices.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: fec: add struct bufdesc_propTroy Kisky
This reduces code and gains speed. Signed-off-by: Troy Kisky <troy.kisky@boundarydevices.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: fec: fix fec_enet_get_free_txdesc_numTroy Kisky
When first initialized, cur_tx points to the 1st entry in the queue, and dirty_tx points to the last. At this point, fec_enet_get_free_txdesc_num will return tx_ring_size -2. If tx_ring_size -2 entries are now queued, then fec_enet_get_free_txdesc_num should return 0, but it returns tx_ring_size instead. Signed-off-by: Troy Kisky <troy.kisky@boundarydevices.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: fec: fix rx error countsTroy Kisky
On an overrun, the other flags are not valid, so don't check them. Also, don't pass bad frames up the stack. Signed-off-by: Troy Kisky <troy.kisky@boundarydevices.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: fec: stop the "rcv is not +last, " error messagesTroy Kisky
Setting the FTRL register will stop the fec from trying to use multiple receive buffers. Signed-off-by: Troy Kisky <troy.kisky@boundarydevices.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11bonding: 3ad: allow to set ad_actor settings while the bond is upNikolay Aleksandrov
No need to require the bond down while changing these settings, the change will be reflected immediately and the 3ad mode will sort itself out. For faster convergence set port->ntt to true in order to generate new LACPDUs immediately. CC: Jay Vosburgh <j.vosburgh@gmail.com> CC: Veaceslav Falico <vfalico@gmail.com> CC: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11ipv6: add option to drop unsolicited neighbor advertisementsJohannes Berg
In certain 802.11 wireless deployments, there will be NA proxies that use knowledge of the network to correctly answer requests. To prevent unsolicitd advertisements on the shared medium from being a problem, on such deployments wireless needs to drop them. Enable this by providing an option called "drop_unsolicited_na". Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11ipv6: add option to drop unicast encapsulated in L2 multicastJohannes Berg
In order to solve a problem with 802.11, the so-called hole-196 attack, add an option (sysctl) called "drop_unicast_in_l2_multicast" which, if enabled, causes the stack to drop IPv6 unicast packets encapsulated in link-layer multi- or broadcast frames. Such frames can (as an attack) be created by any member of the same wireless network and transmitted as valid encrypted frames since the symmetric key for broadcast frames is shared between all stations. Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11ipv4: add option to drop gratuitous ARP packetsJohannes Berg
In certain 802.11 wireless deployments, there will be ARP proxies that use knowledge of the network to correctly answer requests. To prevent gratuitous ARP frames on the shared medium from being a problem, on such deployments wireless needs to drop them. Enable this by providing an option called "drop_gratuitous_arp". Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11ipv4: add option to drop unicast encapsulated in L2 multicastJohannes Berg
In order to solve a problem with 802.11, the so-called hole-196 attack, add an option (sysctl) called "drop_unicast_in_l2_multicast" which, if enabled, causes the stack to drop IPv4 unicast packets encapsulated in link-layer multi- or broadcast frames. Such frames can (as an attack) be created by any member of the same wireless network and transmitted as valid encrypted frames since the symmetric key for broadcast frames is shared between all stations. Additionally, enabling this option provides compliance with a SHOULD clause of RFC 1122. Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11MAINTAINERS: Update tg3 maintainerSiva Reddy Kallam
Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Acked-by: Prashant Sreedharan <prashant@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11bpf_dbg: do not initialise statics to 0Wei Tang
This patch fixes the checkpatch.pl error to bpf_dbg.c: ERROR: do not initialise statics to 0 Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: Add support for filtering link dump by master device and kindDavid Ahern
Add support for filtering link dumps by master device and kind, similar to the filtering implemented for neighbor dumps. Each net_device that exists adds between 1196 bytes (eth) and 1556 bytes (bridge) to the link dump. As the number of interfaces increases so does the amount of data pushed to user space for a link list. If the user only wants to see a list of specific devices (e.g., interfaces enslaved to a specific bridge or a list of VRFs) most of that data is thrown away. Passing the filters to the kernel to have only relevant data returned makes the dump more efficient. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11Merge branch 'tcp-fast-so_reuseport'David S. Miller
Craig Gallek says: ==================== Faster SO_REUSEPORT for TCP This patch series complements an earlier series (6a5ef90c58da) which added faster SO_REUSEPORT lookup for UDP sockets by extending the feature to TCP sockets. It uses the same array-based data structure which allows for socket selection after finding the first listening socket that matches an incoming packet. Prior to this feature, every socket in the reuseport group needed to be found and examined before a selection could be made. With this series the SO_ATTACH_REUSEPORT_CBPF and SO_ATTACH_REUSEPORT_EBPF socket options now work for TCP sockets as well. The test at the end of the series includes an example of how to use these options to select a reuseport socket based on the cpu core id handling the incoming packet. There are several refactoring patches that precede the feature implementation. Only the last two patches in this series should result in any behavioral changes. v4 - Fix build issue when compiling IPv6 as a module. This required moving the ipv6_rcv_saddr_equal into an object that is included as a built-in object. I included this change in the second patch which adds inet6_hash since that is where ipv6_rcv_saddr_equal will later be called from non-module code. v3: - Another warning in the first patch caught by a build bot. Return 0 in the no-op UDP hash function. v2: - In the first patched I missed a couple of hash functions that should now be returning int instead of void. I missed these the first time through as it only generated a warning and not an error :\ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11soreuseport: BPF selection functional test for TCPCraig Gallek
Unfortunately the existing test relied on packet payload in order to map incoming packets to sockets. In order to get this to work with TCP, TCP_FASTOPEN needed to be used. Since the fast open path is slightly different than the standard TCP path, I created a second test which sends to reuseport group members based on receiving cpu core id. This will probably serve as a better real-world example use as well. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11soreuseport: fast reuseport TCP socket selectionCraig Gallek
This change extends the fast SO_REUSEPORT socket lookup implemented for UDP to TCP. Listener sockets with SO_REUSEPORT and the same receive address are additionally added to an array for faster random access. This means that only a single socket from the group must be found in the listener list before any socket in the group can be used to receive a packet. Previously, every socket in the group needed to be considered before handing off the incoming packet. This feature also exposes the ability to use a BPF program when selecting a socket from a reuseport group. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11soreuseport: Prep for fast reuseport TCP socket selectionCraig Gallek
Both of the lines in this patch probably should have been included in the initial implementation of this code for generic socket support, but weren't technically necessary since only UDP sockets were supported. First, the sk_reuseport_cb points to a structure which assumes each socket in the group has this pointer assigned at the same time it's added to the array in the structure. The sk_clone_lock function breaks this assumption. Since a child socket shouldn't implicitly be in a reuseport group, the simple fix is to clear the field in the clone. Second, the SO_ATTACH_REUSEPORT_xBPF socket options require that SO_REUSEPORT also be set first. For UDP sockets, this is easily enforced at bind-time since that process both puts the socket in the appropriate receive hlist and updates the reuseport structures. Since these operations can happen at two different times for TCP sockets (bind and listen) it must be explicitly checked to enforce the use of SO_REUSEPORT with SO_ATTACH_REUSEPORT_xBPF in the setsockopt call. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11inet: refactor inet[6]_lookup functions to take skbCraig Gallek
This is a preliminary step to allow fast socket lookup of SO_REUSEPORT groups. Doing so with a BPF filter will require access to the skb in question. This change plumbs the skb (and offset to payload data) through the call stack to the listening socket lookup implementations where it will be used in a following patch. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11tcp: __tcp_hdrlen() helperCraig Gallek
tcp_hdrlen is wasteful if you already have a pointer to struct tcphdr. This splits the size calculation into a helper function that can be used if a struct tcphdr is already available. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>