summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2016-06-08net: Add l3mdev ruleDavid Ahern
Currently, VRFs require 1 oif and 1 iif rule per address family per VRF. As the number of VRF devices increases it brings scalability issues with the increasing rule list. All of the VRF rules have the same format with the exception of the specific table id to direct the lookup. Since the table id is available from the oif or iif in the loopup, the VRF rules can be consolidated to a single rule that pulls the table from the VRF device. This patch introduces a new rule attribute l3mdev. The l3mdev rule means the table id used for the lookup is pulled from the L3 master device (e.g., VRF) rather than being statically defined. With the l3mdev rule all of the basic VRF FIB rules are reduced to 1 l3mdev rule per address family (IPv4 and IPv6). If an admin wishes to insert higher priority rules for specific VRFs those rules will co-exist with the l3mdev rule. This capability means current VRF scripts will co-exist with this new simpler implementation. Currently, the rules list for both ipv4 and ipv6 look like this: $ ip ru ls 1000: from all oif vrf1 lookup 1001 1000: from all iif vrf1 lookup 1001 1000: from all oif vrf2 lookup 1002 1000: from all iif vrf2 lookup 1002 1000: from all oif vrf3 lookup 1003 1000: from all iif vrf3 lookup 1003 1000: from all oif vrf4 lookup 1004 1000: from all iif vrf4 lookup 1004 1000: from all oif vrf5 lookup 1005 1000: from all iif vrf5 lookup 1005 1000: from all oif vrf6 lookup 1006 1000: from all iif vrf6 lookup 1006 1000: from all oif vrf7 lookup 1007 1000: from all iif vrf7 lookup 1007 1000: from all oif vrf8 lookup 1008 1000: from all iif vrf8 lookup 1008 ... 32765: from all lookup local 32766: from all lookup main 32767: from all lookup default With the l3mdev rule the list is just the following regardless of the number of VRFs: $ ip ru ls 1000: from all lookup [l3mdev table] 32765: from all lookup local 32766: from all lookup main 32767: from all lookup default (Note: the above pretty print of the rule is based on an iproute2 prototype. Actual verbage may change) Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08tipc: change node timer unit from jiffies to msJon Paul Maloy
The node keepalive interval is recalculated at each timer expiration to catch any changes in the link tolerance, and stored in a field in struct tipc_node. We use jiffies as unit for the stored value. This is suboptimal, because it makes the calculation unnecessary complex, including two unit conversions. The conversions also lead to a rounding error that causes the link "abort limit" to be 3 in the normal case, instead of 4, as intended. This again leads to unnecessary link resets when the network is pushed close to its limit, e.g., in an environment with hundreds of nodes or namesapces. In this commit, we do instead let the keepalive value be calculated and stored in milliseconds, so that there is only one conversion and the rounding error is eliminated. We also remove a redundant "keepalive" field in struct tipc_link. This is remnant from the previous implementation. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08tipc: correct error in node fsmJon Paul Maloy
commit 88e8ac7000dc ("tipc: reduce transmission rate of reset messages when link is down") revealed a flaw in the node FSM, as defined in the log of commit 66996b6c47ed ("tipc: extend node FSM"). We see the following scenario: 1: Node B receives a RESET message from node A before its link endpoint is fully up, i.e., the node FSM is in state SELF_UP_PEER_COMING. This event will not change the node FSM state, but the (distinct) link FSM will move to state RESETTING. 2: As an effect of the previous event, the local endpoint on B will declare node A lost, and post the event SELF_DOWN to the its node FSM. This moves the FSM state to SELF_DOWN_PEER_LEAVING, meaning that no messages will be accepted from A until it receives another RESET message that confirms that A's endpoint has been reset. This is wasteful, since we know this as a fact already from the first received RESET, but worse is that the link instance's FSM has not wasted this information, but instead moved on to state ESTABLISHING, meaning that it repeatedly sends out ACTIVATE messages to the reset peer A. 3: Node A will receive one of the ACTIVATE messages, move its link FSM to state ESTABLISHED, and start repeatedly sending out STATE messages to node B. 4: Node B will consistently drop these messages, since it can only accept accept a RESET according to its node FSM. 5: After four lost STATE messages node A will reset its link and start repeatedly sending out RESET messages to B. 6: Because of the reduced send rate for RESET messages, it is very likely that A will receive an ACTIVATE (which is sent out at a much higher frequency) before it gets the chance to send a RESET, and A may hence quickly move back to state ESTABLISHED and continue sending out STATE messages, which will again be dropped by B. 7: GOTO 5. 8: After having repeated the cycle 5-7 a number of times, node A will by chance get in between with sending a RESET, and the situation is resolved. Unfortunately, we have seen that it may take a substantial amount of time before this vicious loop is broken, sometimes in the order of minutes. We correct this by making a small correction to the node FSM: When a node in state SELF_UP_PEER_COMING receives a SELF_DOWN event, it now moves directly back to state SELF_DOWN_PEER_DOWN, instead of as now SELF_DOWN_PEER_LEAVING. This is logically consistent, since we don't need to wait for RESET confirmation from of an endpoint that we alread know has been reset. It also means that node B in the scenario above will not be dropping incoming STATE messages, and the link can come up immediately. Finally, a symmetry comparison reveals that the FSM has a similar error when receiving the event PEER_DOWN in state PEER_UP_SELF_COMING. Instead of moving to PERR_DOWN_SELF_LEAVING, it should move directly to SELF_DOWN_PEER_DOWN. Although we have never seen any negative effect of this logical error, we choose fix this one, too. The node FSM looks as follows after those changes: +----------------------------------------+ | PEER_DOWN_EVT| | | +------------------------+----------------+ | |SELF_DOWN_EVT | | | | | | | | +-----------+ +-----------+ | | |NODE_ | |NODE_ | | | +----------|FAILINGOVER|<---------|SYNCHING |-----------+ | | |SELF_ +-----------+ FAILOVER_+-----------+ PEER_ | | | |DOWN_EVT | A BEGIN_EVT A | DOWN_EVT| | | | | | | | | | | | | | | | | | | | |FAILOVER_ |FAILOVER_ |SYNCH_ |SYNCH_ | | | | |END_EVT |BEGIN_EVT |BEGIN_EVT|END_EVT | | | | | | | | | | | | | | | | | | | | | +--------------+ | | | | | +-------->| SELF_UP_ |<-------+ | | | | +-----------------| PEER_UP |----------------+ | | | | |SELF_DOWN_EVT +--------------+ PEER_DOWN_EVT| | | | | | A A | | | | | | | | | | | | | | PEER_UP_EVT| |SELF_UP_EVT | | | | | | | | | | | V V V | | V V V +------------+ +-----------+ +-----------+ +------------+ |SELF_DOWN_ | |SELF_UP_ | |PEER_UP_ | |PEER_DOWN | |PEER_LEAVING| |PEER_COMING| |SELF_COMING| |SELF_LEAVING| +------------+ +-----------+ +-----------+ +------------+ | | A A | | | | | | | | | SELF_ | |SELF_ |PEER_ |PEER_ | | DOWN_EVT| |UP_EVT |UP_EVT |DOWN_EVT | | | | | | | | | | | | | | | +--------------+ | | |PEER_DOWN_EVT +--->| SELF_DOWN_ |<---+ SELF_DOWN_EVT| +------------------->| PEER_DOWN |<--------------------+ +--------------+ Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08net: dsa: Initialize CPU port ethtool ops per treeFlorian Fainelli
Now that we can properly support multiple distinct trees in the system, using a global variable: dsa_cpu_port_ethtool_ops is getting clobbered as soon as the second switch tree gets probed, and we don't want that. We need to move this to be dynamically allocated, and since we can't really be comparing addresses anymore to determine first time initialization versus any other times, just move this to dsa.c and dsa2.c where the remainder of the dst/ds initialization happens. The operations teardown restores the master netdev's ethtool_ops to its original ethtool_ops pointer (typically within the Ethernet driver) Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08net: dsa: Add initialization helper for CPU port ethtool_opsFlorian Fainelli
Add a helper function: dsa_cpu_port_ethtool_init() which initializes a custom ethtool_ops structure with custom DSA ethtool operations for CPU ports. This is a preliminary change to move the initialization outside of net/dsa/slave.c. Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08net: dsa: Provide a slave MII bus if neededFlorian Fainelli
Mimic what net/dsa/dsa.c does and provide a slave MII bus by default which will be created if the driver implements a phy_read method. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08net: dsa: Initialize ds->enabled_port_mask and ds->phys_mii_maskFlorian Fainelli
Some drivers rely on these two bitmasks to contain the correct values for them to successfully probe and initialize at drv->setup() time, calculate correct values to put in both masks as early as possible in dsa_get_ports_dn(). Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08net: dsa: Provide unique DSA slave MII bus namesFlorian Fainelli
In case we have multiples trees and switches with the same index, we need to add another discriminating id: the switch tree. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08net: sched: fix missing doc annotationsEric Dumazet
"make htmldocs" complains otherwise: .//net/core/gen_stats.c:168: warning: No description found for parameter 'running' .//include/linux/netdevice.h:1867: warning: No description found for parameter 'qdisc_running_key' Fixes: f9eb8aea2a1e ("net_sched: transform qdisc running bit into a seqcount") Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08net_sched: add missing paddattr descriptionEric Dumazet
"make htmldocs" complains otherwise: .//net/core/gen_stats.c:65: warning: No description found for parameter 'padattr' .//net/core/gen_stats.c:101: warning: No description found for parameter 'padattr' Fixes: 9854518ea04d ("sched: align nlattr properly when needed") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: kbuild test robot <fengguang.wu@intel.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08ipv6: Skip XFRM lookup if dst_entry in socket cache is validJakub Sitnicki
At present we perform an xfrm_lookup() for each UDPv6 message we send. The lookup involves querying the flow cache (flow_cache_lookup) and, in case of a cache miss, creating an XFRM bundle. If we miss the flow cache, we can end up creating a new bundle and deriving the path MTU (xfrm_init_pmtu) from on an already transformed dst_entry, which we pass from the socket cache (sk->sk_dst_cache) down to xfrm_lookup(). This can happen only if we're caching the dst_entry in the socket, that is when we're using a connected UDP socket. To put it another way, the path MTU shrinks each time we miss the flow cache, which later on leads to incorrectly fragmented payload. It can be observed with ESPv6 in transport mode: 1) Set up a transformation and lower the MTU to trigger fragmentation # ip xfrm policy add dir out src ::1 dst ::1 \ tmpl src ::1 dst ::1 proto esp spi 1 # ip xfrm state add src ::1 dst ::1 \ proto esp spi 1 enc 'aes' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b # ip link set dev lo mtu 1500 2) Monitor the packet flow and set up an UDP sink # tcpdump -ni lo -ttt & # socat udp6-listen:12345,fork /dev/null & 3) Send a datagram that needs fragmentation with a connected socket # perl -e 'print "@" x 1470 | socat - udp6:[::1]:12345 2016/06/07 18:52:52 socat[724] E read(3, 0x555bb3d5ba00, 8192): Protocol error 00:00:00.000000 IP6 ::1 > ::1: frag (0|1448) ESP(spi=0x00000001,seq=0x2), length 1448 00:00:00.000014 IP6 ::1 > ::1: frag (1448|32) 00:00:00.000050 IP6 ::1 > ::1: ESP(spi=0x00000001,seq=0x3), length 1272 (^ ICMPv6 Parameter Problem) 00:00:00.000022 IP6 ::1 > ::1: ESP(spi=0x00000001,seq=0x5), length 136 4) Compare it to a non-connected socket # perl -e 'print "@" x 1500' | socat - udp6-sendto:[::1]:12345 00:00:40.535488 IP6 ::1 > ::1: frag (0|1448) ESP(spi=0x00000001,seq=0x6), length 1448 00:00:00.000010 IP6 ::1 > ::1: frag (1448|64) What happens in step (3) is: 1) when connecting the socket in __ip6_datagram_connect(), we perform an XFRM lookup, miss the flow cache, create an XFRM bundle, and cache the destination, 2) afterwards, when sending the datagram, we perform an XFRM lookup, again, miss the flow cache (due to mismatch of flowi6_iif and flowi6_oif, which is an issue of its own), and recreate an XFRM bundle based on the cached (and already transformed) destination. To prevent the recreation of an XFRM bundle, avoid an XFRM lookup altogether whenever we already have a destination entry cached in the socket. This prevents the path MTU shrinkage and brings us on par with UDPv4. The fix also benefits connected PINGv6 sockets, another user of ip6_sk_dst_lookup_flow(), who also suffer messages being transformed twice. Joint work with Hannes Frederic Sowa. Reported-by: Jan Tluka <jtluka@redhat.com> Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08net: Reduce queue allocation to one in kdump kernelHariprasad Shenai
When in kdump kernel, reduce memory usage by only using a single Queue Set for multiqueue devices. So make netif_get_num_default_rss_queues() return one, when in kdump kernel. Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08l2tp: fix configuration passed to setup_udp_tunnel_sock()Guillaume Nault
Unused fields of udp_cfg must be all zeros. Otherwise setup_udp_tunnel_sock() fills ->gro_receive and ->gro_complete callbacks with garbage, eventually resulting in panic when used by udp_gro_receive(). [ 72.694123] BUG: unable to handle kernel paging request at ffff880033f87d78 [ 72.695518] IP: [<ffff880033f87d78>] 0xffff880033f87d78 [ 72.696530] PGD 26e2067 PUD 26e3067 PMD 342ed063 PTE 8000000033f87163 [ 72.696530] Oops: 0011 [#1] SMP KASAN [ 72.696530] Modules linked in: l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pptp gre pppox ppp_generic slhc crc32c_intel ghash_clmulni_intel jitterentropy_rng sha256_generic hmac drbg ansi_cprng aesni_intel evdev aes_x86_64 ablk_helper cryptd lrw gf128mul glue_helper serio_raw acpi_cpufreq button proc\ essor ext4 crc16 jbd2 mbcache virtio_blk virtio_net virtio_pci virtio_ring virtio [ 72.696530] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.7.0-rc1 #1 [ 72.696530] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014 [ 72.696530] task: ffff880035b59700 ti: ffff880035b70000 task.ti: ffff880035b70000 [ 72.696530] RIP: 0010:[<ffff880033f87d78>] [<ffff880033f87d78>] 0xffff880033f87d78 [ 72.696530] RSP: 0018:ffff880035f87bc0 EFLAGS: 00010246 [ 72.696530] RAX: ffffed000698f996 RBX: ffff88003326b840 RCX: ffffffff814cc823 [ 72.696530] RDX: ffff88003326b840 RSI: ffff880033e48038 RDI: ffff880034c7c780 [ 72.696530] RBP: ffff880035f87c18 R08: 000000000000a506 R09: 0000000000000000 [ 72.696530] R10: ffff880035f87b38 R11: ffff880034b9344d R12: 00000000ebfea715 [ 72.696530] R13: 0000000000000000 R14: ffff880034c7c780 R15: 0000000000000000 [ 72.696530] FS: 0000000000000000(0000) GS:ffff880035f80000(0000) knlGS:0000000000000000 [ 72.696530] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 72.696530] CR2: ffff880033f87d78 CR3: 0000000033c98000 CR4: 00000000000406a0 [ 72.696530] Stack: [ 72.696530] ffffffff814cc834 ffff880034b93468 0000001481416818 ffff88003326b874 [ 72.696530] ffff880034c7ccb0 ffff880033e48038 ffff88003326b840 ffff880034b93462 [ 72.696530] ffff88003326b88a ffff88003326b88c ffff880034b93468 ffff880035f87c70 [ 72.696530] Call Trace: [ 72.696530] <IRQ> [ 72.696530] [<ffffffff814cc834>] ? udp_gro_receive+0x1c6/0x1f9 [ 72.696530] [<ffffffff814ccb1c>] udp4_gro_receive+0x2b5/0x310 [ 72.696530] [<ffffffff814d989b>] inet_gro_receive+0x4a3/0x4cd [ 72.696530] [<ffffffff81431b32>] dev_gro_receive+0x584/0x7a3 [ 72.696530] [<ffffffff810adf7a>] ? __lock_is_held+0x29/0x64 [ 72.696530] [<ffffffff814321f7>] napi_gro_receive+0x124/0x21d [ 72.696530] [<ffffffffa000b145>] virtnet_receive+0x8df/0x8f6 [virtio_net] [ 72.696530] [<ffffffffa000b27e>] virtnet_poll+0x1d/0x8d [virtio_net] [ 72.696530] [<ffffffff81431350>] net_rx_action+0x15b/0x3b9 [ 72.696530] [<ffffffff815893d6>] __do_softirq+0x216/0x546 [ 72.696530] [<ffffffff81062392>] irq_exit+0x49/0xb6 [ 72.696530] [<ffffffff81588e9a>] do_IRQ+0xe2/0xfa [ 72.696530] [<ffffffff81587a49>] common_interrupt+0x89/0x89 [ 72.696530] <EOI> [ 72.696530] [<ffffffff810b05df>] ? trace_hardirqs_on_caller+0x229/0x270 [ 72.696530] [<ffffffff8102b3c7>] ? default_idle+0x1c/0x2d [ 72.696530] [<ffffffff8102b3c5>] ? default_idle+0x1a/0x2d [ 72.696530] [<ffffffff8102bb8c>] arch_cpu_idle+0xa/0xc [ 72.696530] [<ffffffff810a6c39>] default_idle_call+0x1a/0x1c [ 72.696530] [<ffffffff810a6d96>] cpu_startup_entry+0x15b/0x20f [ 72.696530] [<ffffffff81039a81>] start_secondary+0x12c/0x133 [ 72.696530] Code: ff ff ff ff ff ff ff ff ff ff 7f ff ff ff ff ff ff ff 7f 00 7e f8 33 00 88 ff ff 6d 61 58 81 ff ff ff ff 5e de 0a 81 ff ff ff ff <00> 5c e2 34 00 88 ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 72.696530] RIP [<ffff880033f87d78>] 0xffff880033f87d78 [ 72.696530] RSP <ffff880035f87bc0> [ 72.696530] CR2: ffff880033f87d78 [ 72.696530] ---[ end trace ad7758b9a1dccf99 ]--- [ 72.696530] Kernel panic - not syncing: Fatal exception in interrupt [ 72.696530] Kernel Offset: disabled [ 72.696530] ---[ end Kernel panic - not syncing: Fatal exception in interrupt v2: use empty initialiser instead of "{ NULL }" to avoid relying on first field's type. Fixes: 38fd2af24fcf ("udp: Add socket based GRO and config") Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08ila: Perform only one translation in forwarding pathTom Herbert
When setting up ILA in a router we noticed that the the encapsulation is invoked twice: once in the route input path and again upon route output. To resolve this we add a flag set_csum_neutral for the ila_update_ipv6_locator. If this flag is set and the checksum neutral bit is also set we assume that checksum-neutral translation has already been performed and take no further action. The flag is set only in ila_output path. The flag is not set for ila_input and ila_xlat. Tested: Used 3 netns to set to emulate a router and two hosts. The router translates SIR addresses between the two destinations in other two netns. Verified ping and netperf are functional. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08net-sysfs: fix missing <linux/of_net.h>Ben Dooks
The of_find_net_device_by_node() function is defined in <linux/of_net.h> but not included in the .c file that implements it. Fix the following warning by including the header: net/core/net-sysfs.c:1494:19: warning: symbol 'of_find_net_device_by_node' was not declared. Should it be static? Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08tcp: accept RST if SEQ matches right edge of right-most SACK blockPau Espin Pedrol
RFC 5961 advises to only accept RST packets containing a seq number matching the next expected seq number instead of the whole receive window in order to avoid spoofing attacks. However, this situation is not optimal in the case SACK is in use at the time the RST is sent. I recently run into a scenario in which packet losses were high while uploading data to a server, and userspace was willing to frequently terminate connections by sending a RST. In this case, the ACK sent on the receiver side (rcv_nxt) is frozen waiting for a lost packet retransmission and SACK blocks are used to let the client continue uploading data. At some point later on, the client sends the RST (snd_nxt), which matches the next expected seq number of the right-most SACK block on the receiver side which is going forward receiving data. In this scenario, as RFC 5961 defines, the RST SEQ doesn't match the frozen main ACK at receiver side and thus gets dropped and a challenge ACK is sent, which gets usually lost due to network conditions. The main consequence is that the connection stays alive for a while even if it made sense to accept the RST. This can get really bad if lots of connections like this one are created in few seconds, allocating all the resources of the server easily. For security reasons, not all SACK blocks are checked (there could be a big amount of SACK blocks => acceptable SEQ numbers). Furthermore, it wouldn't make sense to check for RST in blocks other than the right-most received one because the sender is not expected to be sending new data after the RST. For simplicity, only up to the 4 most recently updated SACK blocks (selective_acks[4] field) are compared to find the right-most block, as usually those are the ones with bigger probability to contain it. This patch was tested in a 3.18 kernel and probed to improve the situation in the scenario described above. Signed-off-by: Pau Espin Pedrol <pau.espin@tessares.net> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Tested-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08bridge: Don't insert unnecessary local fdb entry on changing mac addressToshiaki Makita
The missing br_vlan_should_use() test caused creation of an unneeded local fdb entry on changing mac address of a bridge device when there is a vlan which is configured on a bridge port but not on the bridge device. Fixes: 2594e9064a57 ("bridge: vlan: add per-vlan struct and move to rhashtables") Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08net: vrf: ipv6 support for local traffic to local addressesDavid Ahern
Add support for locally originated traffic to VRF-local IPv6 addresses. Similar to IPv4 a local dst is set on the skb and the packet is reinserted with a call to netif_rx. With this patch, ping, tcp and udp packets to a local IPv6 address are successfully routed: $ ip addr show dev eth1 4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000 link/ether 02:e0:f9:1c:b9:74 brd ff:ff:ff:ff:ff:ff inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1 valid_lft forever preferred_lft forever inet6 2100:1::1/120 scope global valid_lft forever preferred_lft forever inet6 fe80::e0:f9ff:fe1c:b974/64 scope link valid_lft forever preferred_lft forever $ ping6 -c1 -I red 2100:1::1 ping6: Warning: source address might be selected on device other than red. PING 2100:1::1(2100:1::1) from 2100:1::1 red: 56 data bytes 64 bytes from 2100:1::1: icmp_seq=1 ttl=64 time=0.098 ms ip6_input is exported so the VRF driver can use it for the dst input function. The dst_alloc function for IPv4 defaults to setting the input and output functions; IPv6's does not. VRF does not need to duplicate the Rx path so just export the ipv6 input function. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07gue: Implement direction IP encapsulationTom Herbert
This patch implements direct encapsulation of IPv4 and IPv6 packets in UDP. This is done a version "1" of GUE and as explained in I-D draft-ietf-nvo3-gue-03. Changes here are only in the receive path, fou with IPxIPx already supports the transmit side. Both the normal receive path and GRO path are modified to check for GUE version and check for IP version in the case that GUE version is "1". Tested: IPIP with direct GUE encap 1 TCP_STREAM 4530 Mbps 200 TCP_RR 1297625 tps 135/232/444 90/95/99% latencies IP4IP6 with direct GUE encap 1 TCP_STREAM 4903 Mbps 200 TCP_RR 1184481 tps 149/253/473 90/95/99% latencies IP6IP6 direct GUE encap 1 TCP_STREAM 5146 Mbps 200 TCP_RR 1202879 tps 146/251/472 90/95/99% latencies SIT with direct GUE encap 1 TCP_STREAM 6111 Mbps 200 TCP_RR 1250337 tps 139/241/467 90/95/99% latencies Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains two Netfilter/IPVS fixes for your net tree, they are: 1) Fix missing alignment in next offset calculation for standard targets, introduced in the previous merge window, patch from Florian Westphal. 2) Fix to correct the handling of outgoing connections which use the SIP-pe such that the binding of a real-server is updated when needed. This was an omission from changes introduced by Marco Angaroni in the previous merge window too, to allow handling of outgoing connections by the SIP-pe. Patch and report came via Simon Horman. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07tcp: record TLP and ER timer stats in v6 statsYuchung Cheng
The v6 tcp stats scan do not provide TLP and ER timer information correctly like the v4 version . This patch fixes that. Fixes: 6ba8a3b19e76 ("tcp: Tail loss probe (TLP)") Fixes: eed530b6c676 ("tcp: early retransmit") Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07net: sched: fix tc_should_offload for specific clsact classesDaniel Borkmann
When offloading classifiers such as u32 or flower to hardware, and the qdisc is clsact (TC_H_CLSACT), then we need to differentiate its classes, since not all of them handle ingress, therefore we must leave those in software path. Add a .tcf_cl_offload() callback, so we can generically handle them, tested on ixgbe. Fixes: 10cbc6843446 ("net/sched: cls_flower: Hardware offloaded filters statistics support") Fixes: 5b33f48842fa ("net/flower: Introduce hardware offload support") Fixes: a1b7c5fd7fe9 ("net: sched: add cls_u32 offload hooks for netdevs") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07act_police: fix a crash during removalWANG Cong
The police action is using its own code to initialize tcf hash info, which makes us to forgot to initialize a->hinfo correctly. Fix this by calling the helper function tcf_hash_create() directly. This patch fixed the following crash: BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 IP: [<ffffffff810c099f>] __lock_acquire+0xd3/0xf91 PGD d3c34067 PUD d3e18067 PMD 0 Oops: 0000 [#1] SMP CPU: 2 PID: 853 Comm: tc Not tainted 4.6.0+ #87 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff8800d3e28040 ti: ffff8800d3f6c000 task.ti: ffff8800d3f6c000 RIP: 0010:[<ffffffff810c099f>] [<ffffffff810c099f>] __lock_acquire+0xd3/0xf91 RSP: 0000:ffff88011b203c80 EFLAGS: 00010002 RAX: 0000000000000046 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000028 RBP: ffff88011b203d40 R08: 0000000000000001 R09: 0000000000000000 R10: ffff88011b203d58 R11: ffff88011b208000 R12: 0000000000000001 R13: ffff8800d3e28040 R14: 0000000000000028 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88011b200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000028 CR3: 00000000d4be1000 CR4: 00000000000006e0 Stack: ffff8800d3e289c0 0000000000000046 000000001b203d60 ffffffff00000000 0000000000000000 ffff880000000000 0000000000000000 ffffffff00000000 ffffffff8187142c ffff88011b203ce8 ffff88011b203ce8 ffffffff8101dbfc Call Trace: <IRQ> [<ffffffff8187142c>] ? __tcf_hash_release+0x77/0xd1 [<ffffffff8101dbfc>] ? native_sched_clock+0x1a/0x35 [<ffffffff8101dbfc>] ? native_sched_clock+0x1a/0x35 [<ffffffff810a9604>] ? sched_clock_local+0x11/0x78 [<ffffffff810bf6a1>] ? mark_lock+0x24/0x201 [<ffffffff810c1dbd>] lock_acquire+0x120/0x1b4 [<ffffffff810c1dbd>] ? lock_acquire+0x120/0x1b4 [<ffffffff8187142c>] ? __tcf_hash_release+0x77/0xd1 [<ffffffff81aad89f>] _raw_spin_lock_bh+0x3c/0x72 [<ffffffff8187142c>] ? __tcf_hash_release+0x77/0xd1 [<ffffffff8187142c>] __tcf_hash_release+0x77/0xd1 [<ffffffff81871a27>] tcf_action_destroy+0x49/0x7c [<ffffffff81870b1c>] tcf_exts_destroy+0x20/0x2d [<ffffffff8189273b>] u32_destroy_key+0x1b/0x4d [<ffffffff81892788>] u32_delete_key_freepf_rcu+0x1b/0x1d [<ffffffff810de3b8>] rcu_process_callbacks+0x610/0x82e [<ffffffff8189276d>] ? u32_destroy_key+0x4d/0x4d [<ffffffff81ab0bc1>] __do_softirq+0x191/0x3f4 Fixes: ddf97ccdd7cb ("net_sched: add network namespace support for tc actions") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07net: sched: do not acquire qdisc spinlock in qdisc/class stats dumpEric Dumazet
Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host agent [1] are problematic at scale : For each qdisc/class found in the dump, we currently lock the root qdisc spinlock in order to get stats. Sampling stats every 5 seconds from thousands of HTB classes is a challenge when the root qdisc spinlock is under high pressure. Not only the dumps take time, they also slow down the fast path (queue/dequeue packets) by 10 % to 20 % in some cases. An audit of existing qdiscs showed that sch_fq_codel is the only qdisc that might need the qdisc lock in fq_codel_dump_stats() and fq_codel_dump_class_stats() In v2 of this patch, I now use the Qdisc running seqcount to provide consistent reads of packets/bytes counters, regardless of 32/64 bit arches. I also changed rate estimators to use the same infrastructure so that they no longer need to lock root qdisc lock. [1] http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdf Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Kevin Athey <kda@google.com> Cc: Xiaotian Pei <xiaotian@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07net_sched: transform qdisc running bit into a seqcountEric Dumazet
Instead of using a single bit (__QDISC___STATE_RUNNING) in sch->__state, use a seqcount. This adds lockdep support, but more importantly it will allow us to sample qdisc/class statistics without having to grab qdisc root lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07fq_codel: return non zero qlen in class dumpsEric Dumazet
We properly scan the flow list to count number of packets, but John passed 0 to gnet_stats_copy_queue() so we report a zero value to user space instead of the result. Fixes: 640158536632 ("net: sched: restrict use of qstats qlen") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: John Fastabend <john.r.fastabend@intel.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07net: cls_u32: be more strict about skip-sw flagJakub Kicinski
Return an error if user requested skip-sw and the underlaying hardware cannot handle tc offloads (or offloads are disabled). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07net: cls_u32: fix error code for invalid flagsJakub Kicinski
'err' variable is not set in this test, we would return whatever previous test set 'err' to. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07net sched: indentation and other OCD stylistic fixesJamal Hadi Salim
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
2016-06-07net sched actions: aggregate dumping of actions timeinfoJamal Hadi Salim
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07net sched actions: introduce timestamp for firsttime useJamal Hadi Salim
Useful to know when the action was first used for accounting (and debugging) Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07net sched: actions use tcf_lastuse_update for consistencyJamal Hadi Salim
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07net/sched: cls_flower: Introduce support in SKIP SW flagAmir Vadai
In order to make a filter processed only by hardware, skip_sw flag should be supplied. This is an addition to the already existing skip_hw flag (filter will be processed by software only). If no flag is specified, filter will be processed by both software and hardware. If only hardware offloaded filters exist, fl_classify() will return without doing anything. A following userspace patch will be sent once kernel patch is accepted. Example: tc filter add dev enp0s9 protocol ip prio 20 parent ffff: \ flower \ ip_proto 6 \ indev enp0s9 \ skip_sw \ action skbedit mark 0x1234 Signed-off-by: Amir Vadai <amirva@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07net: get rid of spin_trylock() in net_tx_action()Eric Dumazet
Note: Tom Herbert posted almost same patch 3 months back, but for different reasons. The reasons we want to get rid of this spin_trylock() are : 1) Under high qdisc pressure, the spin_trylock() has almost no chance to succeed. 2) We loop multiple times in softirq handler, eventually reaching the max retry count (10), and we schedule ksoftirqd. Since we want to adhere more strictly to ksoftirqd being waked up in the future (https://lwn.net/Articles/687617/), better avoid spurious wakeups. 3) calls to __netif_reschedule() dirty the cache line containing q->next_sched, slowing down the owner of qdisc. 4) RT kernels can not use the spin_trylock() here. With help of busylock, we get the qdisc spinlock fast enough, and the trylock trick brings only performance penalty. Depending on qdisc setup, I observed a gain of up to 19 % in qdisc performance (1016600 pps instead of 853400 pps, using prio+tbf+fq_codel) ("mpstat -I SCPU 1" is much happier now) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <tom@herbertland.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07rxrpc: fix ptr_ret.cocci warningsWu Fengguang
net/rxrpc/rxkad.c:1165:1-3: WARNING: PTR_ERR_OR_ZERO can be used Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR Generated by: scripts/coccinelle/api/ptr_ret.cocci CC: David Howells <dhowells@redhat.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07RDS: TCP: fix race windows in send-path quiescence by rds_tcp_accept_one()Sowmini Varadhan
The send path needs to be quiesced before resetting callbacks from rds_tcp_accept_one(), and commit eb192840266f ("RDS:TCP: Synchronize rds_tcp_accept_one with rds_send_xmit when resetting t_sock") achieves this using the c_state and RDS_IN_XMIT bit following the pattern used by rds_conn_shutdown(). However this leaves the possibility of a race window as shown in the sequence below take t_conn_lock in rds_tcp_conn_connect send outgoing syn to peer drop t_conn_lock in rds_tcp_conn_connect incoming from peer triggers rds_tcp_accept_one, conn is marked CONNECTING wait for RDS_IN_XMIT to quiesce any rds_send_xmit threads call rds_tcp_reset_callbacks [.. race-window where incoming syn-ack can cause the conn to be marked UP from rds_tcp_state_change ..] lock_sock called from rds_tcp_reset_callbacks, and we set t_sock to null As soon as the conn is marked UP in the race-window above, rds_send_xmit() threads will proceed to rds_tcp_xmit and may encounter a null-pointer deref on the t_sock. Given that rds_tcp_state_change() is invoked in softirq context, whereas rds_tcp_reset_callbacks() is in workq context, and testing for RDS_IN_XMIT after lock_sock could result in a deadlock with tcp_sendmsg, this commit fixes the race by using a new c_state, RDS_TCP_RESETTING, which will prevent a transition to RDS_CONN_UP from rds_tcp_state_change(). Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07RDS: TCP: Retransmit half-sent datagrams when switching sockets in ↵Sowmini Varadhan
rds_tcp_reset_callbacks When we switch a connection's sockets in rds_tcp_rest_callbacks, any partially sent datagram must be retransmitted on the new socket so that the receiver can correctly reassmble the RDS datagram. Use rds_send_reset() which is designed for this purpose. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07RDS: TCP: Add/use rds_tcp_reset_callbacks to reset tcp socket safelySowmini Varadhan
When rds_tcp_accept_one() has to replace the existing tcp socket with a newer tcp socket (duelling-syn resolution), it must lock_sock() to suppress the rds_tcp_data_recv() path while callbacks are being changed. Also, existing RDS datagram reassembly state must be reset, so that the next datagram on the new socket does not have corrupted state. Similarly when resetting the newly accepted socket, appropriate locks and synchronization is needed. This commit ensures correct synchronization by invoking kernel_sock_shutdown to reset a newly accepted sock, and by taking appropriate lock_sock()s (for old and new sockets) when resetting existing callbacks. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07fq_codel: fix NET_XMIT_CN behaviorEric Dumazet
My prior attempt to fix the backlogs of parents failed. If we return NET_XMIT_CN, our parents wont increase their backlog, so our qdisc_tree_reduce_backlog() should take this into account. v2: Florian Westphal pointed out that we could drop the packet, so we need to save qdisc_pkt_len(skb) in a temp variable before calling fq_codel_drop() Fixes: 9d18562a2278 ("fq_codel: add batch ability to fq_codel_drop()") Fixes: 2ccccf5fb43f ("net_sched: update hierarchical backlog too") Reported-by: Stas Nichiporovich <stasn77@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: WANG Cong <xiyou.wangcong@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07netfilter: helper: avoid extra expectation iterations on unregisterFlorian Westphal
The expectation table is not duplicated per net namespace anymore, so we can move the expectation table and conntrack table iteration out of the per-net loop. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-07bridge: netfilter: checkpatch data type fixesTobin C Harding
checkpatch produces data type 'checks'. This patch amends them by changing, for example: uint8_t -> u8 Signed-off-by: Tobin C Harding <me@tobin.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-07iucv: properly clone LSM attributes to newly created child socketsPaul Moore
Much like we had to do for AF_BLUETOOTH and AF_ALG, make sure we properly clone the parent socket's LSM attributes to newly created child sockets. Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-06-06net_sched: keep backlog updated with qlenWANG Cong
For gso_skb we only update qlen, backlog should be updated too. Note, it is correct to just update these stats at one layer, because the gso_skb is cached there. Reported-by: Stas Nichiporovich <stasn77@gmail.com> Fixes: 2ccccf5fb43f ("net_sched: update hierarchical backlog too") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-06soreuseport: add compat case for setsockopt SO_ATTACH_REUSEPORT_CBPFHelge Deller
Commit 538950a1b752 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF") missed to add the compat case for the SO_ATTACH_REUSEPORT_CBPF option. Signed-off-by: Helge Deller <deller@gmx.de> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-06netlabel: add address family checks to netlbl_{sock,req}_delattr()Paul Moore
It seems risky to always rely on the caller to ensure the socket's address family is correct before passing it to the NetLabel kAPI, especially since we see at least one LSM which didn't. Add address family checks to the *_delattr() functions to help prevent future problems. Cc: <stable@vger.kernel.org> Reported-by: Maninder Singh <maninder1.s@samsung.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-06-06net: Use ns_capable_noaudit() when determining net sysctl permissionsTyler Hicks
The capability check should not be audited since it is only being used to determine the inode permissions. A failed check does not indicate a violation of security policy but, when an LSM is enabled, a denial audit message was being generated. The denial audit message caused confusion for some application authors because root-running Go applications always triggered the denial. To prevent this confusion, the capability check in net_ctl_permissions() is switched to the noaudit variant. BugLink: https://launchpad.net/bugs/1465724 Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
2016-06-05net: disable fragment reassembly if high_thresh is zeroMichal Kubeček
Before commit 6d7b857d541e ("net: use lib/percpu_counter API for fragmentation mem accounting"), setting the reassembly high threshold to 0 prevented fragment reassembly as first fragment would be always evicted before second could be added to the queue. While inefficient, some users apparently relied on this method. Since the commit mentioned above, a percpu counter is used for reassembly memory accounting and high batch size avoids taking slow path in most common scenarios. As a result, a whole full sized packet can be reassembled without the percpu counter's main counter changing its value so that even with high_thresh set to 0, fragmented packets can be still reassembled and processed. Add explicit check preventing reassembly if high threshold is zero. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-06ipvs: update real-server binding of outgoing connections in SIP-peMarco Angaroni
Previous patch that introduced handling of outgoing packets in SIP persistent-engine did not call ip_vs_check_template() in case packet was matching a connection template. Assumption was that real-server was healthy, since it was sending a packet just in that moment. There are however real-server fault conditions requiring that association between call-id and real-server (represented by connection template) gets updated. Here is an example of the sequence of events: 1) RS1 is a back2back user agent that handled call-id1 and call-id2 2) RS1 is down and was marked as unavailable 3) new message from outside comes to IPVS with call-id1 4) IPVS reschedules the message to RS2, which becomes new call handler 5) RS2 forwards the message outside, translating call-id1 to call-id2 6) inside pe->conn_out() IPVS matches call-id2 with existing template 7) IPVS does not change association call-id2 <-> RS1 8) new message comes from client with call-id2 9) IPVS reschedules the message to a real-server potentially different from RS2, which is now the correct destination This patch introduces ip_vs_check_template() call in the handling of outgoing packets for SIP-pe. And also introduces a second optional argument for ip_vs_check_template() that allows to check if dest associated to a connection template is the same dest that was identified as the source of the packet. This is to change the real-server bound to a particular call-id independently from its availability status: the idea is that it's more reliable, for in->out direction (where internal network can be considered trusted), to always associate a call-id with the last real-server that used it in one of its messages. Think about above sequence of events where, just after step 5, RS1 returns instead to be available. Comparison of dests is done by simply comparing pointers to struct ip_vs_dest; there should be no cases where struct ip_vs_dest keeps its memory address, but represent a different real-server in terms of ip-address / port. Fixes: 39b972231536 ("ipvs: handle connections started by real-servers") Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2016-06-04net: dsa: Add new binding implementationAndrew Lunn
The existing DSA binding has a number of limitations and problems. The main problem is that it cannot represent a switch as a linux device, hanging off some bus. It is limited to one CPU port. The DSA platform device is artificial, and does not really represent hardware. Implement a new binding which can be embedded into any type of node on a bus to represent one switch device, and its links to other switches. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-04net: dsa: Make mdio bus optionalAndrew Lunn
The switch may want to instantiate its own MDIO bus. Only do it centrally if the switch has not already created one, and the read op is implemented. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>