git.armlinux.org.uk/linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2017-02-06	net/mlx5e: Tx, no inline copy on ConnectX-5	Saeed Mahameed
	ConnectX-5 and later HW generations will report min inline mode == MLX5_INLINE_MODE_NONE, which means driver is not required to copy packet headers to inline fields of TX WQE. When inline is not required, vlan insertion will be handled in the TX descriptor rather than copy to inline. For LSO case driver is still required to copy headers, for the HW to duplicate on wire. This will improve CPU utilization and boost TX performance. Tested with pktgen burst single flow: CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz HCA: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] Before: 15.1Mpps After: 17.2Mpps Improvement: 14% Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
2017-02-06	net/mlx5: TX WQE update	Saeed Mahameed
	Add new TX WQE fields for Connect-X5 vlan insertion support, type and vlan_tci, when type = MLX5_ETH_WQE_INSERT_VLAN the HW will insert the vlan and prio fields (vlan_tci) to the packet. Those bits and the inline header fields are mutually exclusive, and valid only when: MLX5_CAP_ETH(mdev, wqe_inline_mode) == MLX5_CAP_INLINE_MODE_NOT_REQUIRED and MLX5_CAP_ETH(mdev, wqe_vlan_insert), who will be set in ConnectX-5 and later HW generations. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
2017-02-06	net/mlx5: Configure cache line size for start and end padding	Daniel Jurgens
	There is a hardware feature that will pad the start or end of a DMA to be cache line aligned to avoid RMWs on the last cache line. The default cache line size setting for this feature is 64B. This change configures the hardware to use 128B alignment on systems with 128B cache lines. In addition we lower bound MPWRQ stride by HCA cacheline in mlx5e, MPWRQ stride should be at least the HCA cacheline, the current default is 64B and in case HCA_CAP.cach_line_128byte capability is set, MPWRQ RX stride will automatically be aligned to 128B. Signed-off-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-02-06	Merge tag 'linux-can-next-for-4.11-20170206' of ↵	David S. Miller
	git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== pull-request: can-next 2017-02-06 this is a pull request of 16 patches for net-next/master. The first two patches by David Jander and me add the rx-offload framework for CAN devices to the kernel. The remaining 14 patches convert the flexcan driver to make use of it. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-06	mlxsw: reg: Fix HTGT register length	Elad Raz
	HTGT register length is limited to 32 bytes and not 256 bytes. Signed-off-by: Elad Raz <eladr@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-06	Merge tag 'mac80211-for-davem-2017-02-06' of ↵	David S. Miller
	git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== A few simple fixes: * fix FILS AEAD cipher usage to use the correct AAD vectors and to use synchronous algorithms * fix using mesh HT operation data from userspace * fix adding mesh vendor elements to beacons & plink frames ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-06	net: mvneta: implement .set_wol and .get_wol	Jingju Hou
	The mvneta itself does not support WOL, but the PHY might. So pass the calls to the PHY Signed-off-by: Jingju Hou <houjingj@marvell.com> Signed-off-by: Jisheng Zhang <jszhang@marvell.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-06	ipv6: tcp: add a missing tcp_v6_restore_cb()	Eric Dumazet
	Dmitry reported use-after-free in ip6_datagram_recv_specific_ctl() A similar bug was fixed in commit 8ce48623f0cf ("ipv6: tcp: restore IP6CB for pktoptions skbs"), but I missed another spot. tcp_v6_syn_recv_sock() can indeed set np->pktoptions from ireq->pktopts Fixes: 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line misses") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-06	can: flexcan: switch imx6 and vf610 to timestamp based offloading	Marc Kleine-Budde
	This patch switches the imx6 and vf610 based SoCs from the hardware FIFO to the timestamp based rx offloading. Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-02-06	can: flexcan: add support for timestamp based rx-offload	Marc Kleine-Budde
	The flexcan IP core has 64 mailboxes. For now they are configured for RX as a hardware FIFO. This FIFO has a fixed depth of 6 CAN frames. In some high load scenarios it turns out thas this buffer is too small. In order to have a buffer larger than the 6 frames FIFO, this patch adds support for timestamp based offloading via the generic rx-offload infrastructure. Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-02-06	can: flexcan: add quirk FLEXCAN_QUIRK_ENABLE_EACEN_RRS	Marc Kleine-Budde
	In order to receive RTR frames in the non HW FIFO mode the RSS and EACEN bits of the reg_ctrl2 have to be activated. As this has no side effect in the FIFO mode, we do this unconditionally on cores with the reg_ctrl2. Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-02-06	can: flexcan: activate individual RX masking and initialize reg_rximr	Marc Kleine-Budde
	Modern flexcan IP cores support two RX modes. One is using the 6 fames deep hardware FIFO, the other is using up to 64 mailboxes (in non FIFO mode). For now only the HW FIFO mode is activated. In order to make use of the RX mailboxes the individual RX masking feature has to be activated, otherwise matching mailboxes are overwritten during the reception process. This however switches on the individual RX masking, which uses reg_rximr registers for masking. This patch activates the individual RX masking feature unconditionally and initializes the mask registers (reg_rximr) with 0x0 == "don't care", which switches off any filtering. Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-02-06	can: flexcan: make use of rx-offload's irq_offload_fifo	Marc Kleine-Budde
	This patch converts the flexcan driver to make use of the rx-offload can_rx_offload_irq_offload_fifo() helper function. The idea is to read the CAN frames already in the interrupt context, as the depth of the flexcan HW FIFO is too shallow, resulting in too many missed frames. During a normal NAPI poll the frames are the pushed into the upper layers. Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-02-06	can: flexcan: make TX mailbox selectable during runtime	Marc Kleine-Budde
	This patch makes the TX mailbox selectable duing runtime. This is a preparation patch to use of the hardware FIFO selectable via runtime. As the TX mailbox number is different in HW FIFO and normal mode. Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-02-06	can: flexcan: calculate default value for imask1 during runtime	Marc Kleine-Budde
	This patch converts the define FLEXCAN_IFLAG_DEFAULT into the runtime calculated value priv->reg_imask1_default. This is a preparation patch to make the TX mailbox selectable during runtime, too. Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-02-06	can: flexcan: flexcan_irq(): don't unconditionally return IRQ_HANDLED	Marc Kleine-Budde
	This patch changes the flexcan_irq() function to only return IRQ_HANDLED, if the interrupt really has been handled, otherwise IRQ_NONE is returned. Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-02-06	can: flexcan: flexcan_poll_bus_err(): fold in do_bus_err()	Marc Kleine-Budde
	This patch folds in the do_bus_err() function into flexcan_poll_bus_err(). Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-02-06	can: flexcan: flexcan_poll_state(): no need to initialize new_state, ↵	Marc Kleine-Budde
	rx_state, tx_state This patch removed the not needed initialisation from the new_state, rx_state, tx_state variabled. Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-02-06	can: flexcan: do_bus_err(): convert rx_,tx_errors into bool	Marc Kleine-Budde
	This patch converts the rx_errors and tx_errors from int into bool values, to reflect their actual meaning. Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-02-06	can: flexcan: make declaration of devtype_data const	Marc Kleine-Budde
	This patch changes the declaration of the devtype data to const. Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-02-06	can: flexcan: remove write-only member pdata of struct flexcan_priv	Marc Kleine-Budde
	This patch removes the write only member pdata from the struct flexcan_priv. Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-02-06	can: flexcan: add missing register definitions	Marc Kleine-Budde
	This patch adds some missing register definitions, which are needed in an upcoming patch. Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-02-06	can: rx-offload: Add support for timestamp based irq offloading	Marc Kleine-Budde
	Some CAN controllers don't implement a FIFO in hardware, but fill their mailboxes in a particular order (from lowest to highest or highest to lowest). This makes problems to read the frames in the correct order from the hardware, as new frames might be filled into just read (low) mailboxes. This gets worse, when following new frames are received into not read (higher) mailboxes. On the bright side some these CAN controllers put a timestamp on each received CAN frame. This patch adds support to offload CAN frames in interrupt context, order them by timestamp and then transmitted in a NAPI context. Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-02-06	can: rx-offload: Add support for HW fifo based irq offloading	David Jander
	Some CAN controllers have a usable FIFO already but can still benefit from off-loading the CAN controller FIFO. The CAN frames of the FIFO are read and put into a skb queue during interrupt and then transmitted in a NAPI context. Signed-off-by: David Jander <david@protonic.nl> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-02-06	ALSA: seq: Don't handle loop timeout at snd_seq_pool_done()	Takashi Iwai
	snd_seq_pool_done() syncs with closing of all opened threads, but it aborts the wait loop with a timeout, and proceeds to the release resource even if not all threads have been closed. The timeout was 5 seconds, and if you run a crazy stuff, it can exceed easily, and may result in the access of the invalid memory address -- this is what syzkaller detected in a bug report. As a fix, let the code graduate from naiveness, simply remove the loop timeout. BugLink: http://lkml.kernel.org/r/CACT4Y+YdhDV2H5LLzDTJDVF-qiYHUHhtRaW4rbb4gUhTCQB81w@mail.gmail.com Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2017-02-06	Merge branches 'pm-core-fixes' and 'pm-cpufreq-fixes'	Rafael J. Wysocki
	* pm-core-fixes: PM / runtime: Avoid false-positive warnings from might_sleep_if() * pm-cpufreq-fixes: cpufreq: intel_pstate: Disable energy efficiency optimization cpufreq: brcmstb-avs-cpufreq: properly retrieve P-state upon suspend cpufreq: brcmstb-avs-cpufreq: extend sysfs entry brcm_avs_pmap
2017-02-06	net/mlx5: Fix static checker warnings	Or Gerlitz
	For some reason, sparse doesn't like using an expression of type (!x) with a bitwise \| and &. In order to mitigate that, we use a local variable. This removes the following sparse complaints on the core driver (and similar ones on the IB driver too): drivers/net/ethernet/mellanox/mlx5/core/srq.c:83:9: warning: dubious: !x & y drivers/net/ethernet/mellanox/mlx5/core/srq.c:96:9: warning: dubious: !x & y drivers/net/ethernet/mellanox/mlx5/core/port.c:59:9: warning: dubious: !x & y drivers/net/ethernet/mellanox/mlx5/core/vport.c:561:9: warning: dubious: !x & y Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Matan Barak <matanb@mellanox.com> Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-02-06	nl80211: Fix mesh HT operation check	Masashi Honma
	A previous change to fix checks for NL80211_MESHCONF_HT_OPMODE missed setting the flag when replacing FILL_IN_MESH_PARAM_IF_SET with checking codes. This results in dropping the received HT operation value when called by nl80211_update_mesh_config(). Fix this by setting the flag properly. Fixes: 9757235f451c ("nl80211: correct checks for NL80211_MESHCONF_HT_OPMODE value") Signed-off-by: Masashi Honma <masashi.honma@gmail.com> [rewrite commit message to use Fixes: line] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-02-06	mac80211: Fix adding of mesh vendor IEs	Thorsten Horstmann
	The function ieee80211_ie_split_vendor doesn't return 0 on errors. Instead it returns any offset < ielen when WLAN_EID_VENDOR_SPECIFIC is found. The return value in mesh_add_vendor_ies must therefore be checked against ifmsh->ie_len and not 0. Otherwise all ifmsh->ie starting with WLAN_EID_VENDOR_SPECIFIC will be rejected. Fixes: 082ebb0c258d ("mac80211: fix mesh beacon format") Signed-off-by: Thorsten Horstmann <thorsten@defutech.de> Signed-off-by: Mathias Kretschmer <mathias.kretschmer@fit.fraunhofer.de> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de> [sven@narfation.org: Add commit message] Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-02-06	mac80211: Allocate a sync skcipher explicitly for FILS AEAD	Jouni Malinen
	The skcipher could have been of the async variant which may return from skcipher_encrypt() with -EINPROGRESS after having queued the request. The FILS AEAD implementation here does not have code for dealing with that possibility, so allocate a sync cipher explicitly to avoid potential issues with hardware accelerators. This is based on the patch sent out by Ard. Fixes: 39404feee691 ("mac80211: FILS AEAD protection for station mode association frames") Reported-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-02-06	mac80211: Fix FILS AEAD protection in Association Request frame	Jouni Malinen
	Incorrect num_elem parameter value (1 vs. 5) was used in the aes_siv_encrypt() call. This resulted in only the first one of the five AAD vectors to SIV getting included in calculation. This does not protect all the contents correctly and would not interoperate with a standard compliant implementation. Fix this by using the correct number. A matching fix is needed in the AP side (hostapd) to get FILS authentication working properly. Fixes: 39404feee691 ("mac80211: FILS AEAD protection for station mode association frames") Reported-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-02-05	Linux 4.10-rc7v4.10-rc7	Linus Torvalds

2017-02-05	ip6_gre: fix ip6gre_err() invalid reads	Eric Dumazet
	Andrey Konovalov reported out of bound accesses in ip6gre_err() If GRE flags contains GRE_KEY, the following expression (((__be32 )p) + (grehlen / 4) - 1) accesses data ~40 bytes after the expected point, since grehlen includes the size of IPv6 headers. Let's use a "struct gre_base_hdr *greh" pointer to make this code more readable. p[1] becomes greh->protocol. grhlen is the GRE header length. Fixes: c12b395a4664 ("gre: Support GRE over IPv6") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-05	Merge branch 'remove-__napi_complete_done'	David S. Miller
	Eric Dumazet says: ==================== net: get rid of __napi_complete() This patch series removes __napi_complete() calls, in an effort to make NAPI API simpler and generalize GRO and napi_complete_done() ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-05	net: remove __napi_complete()	Eric Dumazet
	All __napi_complete() callers have been converted to use the more standard napi_complete_done(), we can now remove this NAPI method for good. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-05	aeroflex/greth: use napi_complete_done()	Eric Dumazet
	We plan to remove __napi_complete() soon, this driver is the last user. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-05	ibm/emac: use napi_complete_done()	Eric Dumazet
	Use napi_complete_done() instead of __napi_complete() We plan to remove __napi_complete() to reduce NAPI complexity. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-05	qla3xxx: add GRO support	Eric Dumazet
	Use napi_complete_done() instead of __napi_complete() to : 1) Get support of gro_flush_timeout if opt-in 2) Not rearm interrupts for busy-polling users. 3) use standard NAPI API. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-05	ks8695net: add GRO support	Eric Dumazet
	Use napi_complete_done() instead of __napi_complete() to : 1) Get support of gro_flush_timeout if opt-in 2) Not rearm interrupts for busy-polling users. 3) use standard NAPI API. Note that rx_lock seems to be useless, NAPI logic should not need this extra care. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-05	skge: use napi_complete_done()	Eric Dumazet
	Use napi_complete_done() instead of __napi_complete() to : 1) Get support of gro_flush_timeout if opt-in 2) Not rearm interrupts for busy-polling users. 3) use standard NAPI API and get rid of napi_gro_flush() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-05	ep93xx_eth: add GRO support	Eric Dumazet
	Use napi_complete_done() instead of __napi_complete() to : 1) Get support of gro_flush_timeout if opt-in 2) Not rearm interrupts for busy-polling users. 3) use standard NAPI API. 4) get rid of baroque code and ease maintenance. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-05	pcnet32: use napi_complete_done()	Eric Dumazet
	Use napi_complete_done() instead of __napi_complete() to : 1) Get support of gro_flush_timeout if opt-in 2) Not rearm interrupts for busy-polling users. 3) use standard NAPI API. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-05	amd8111e: add GRO support	Eric Dumazet
	Use napi_complete_done() instead of __napi_complete() to : 1) Get support of gro_flush_timeout if opt-in 2) Not rearm interrupts for busy-polling users. 3) use standard NAPI API. 4) get rid of baroque code and ease maintenance. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-05	epic100: use napi_complete_done()	Eric Dumazet
	Use napi_complete_done() instead of __napi_complete() to : 1) Get support of gro_flush_timeout if opt-in 2) Not rearm interrupts for busy-polling users. 3) use standard NAPI API. 4) get rid of baroque code and ease maintenance. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-05	8139cp: use napi_complete_done()	Eric Dumazet
	Use napi_complete_done() instead of __napi_complete() to : 1) Get support of gro_flush_timeout if opt-in 2) Not rearm interrupts for busy-polling users. 3) use standard NAPI API. 4) Eventually get rid of napi_gro_flush() in the future. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-05	8139too: use napi_complete_done()	Eric Dumazet
	Use napi_complete_done() instead of __napi_complete() to : 1) Get support of gro_flush_timeout if opt-in 2) Not rearm interrupts for busy-polling users. 3) use standard NAPI API. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-05	x86/CPU/AMD: Fix Zen SMT topology	Yazen Ghannam
	After: a33d331761bc ("x86/CPU/AMD: Fix Bulldozer topology") our SMT scheduling topology for Fam17h systems is broken, because the ThreadId is included in the ApicId when SMT is enabled. So, without further decoding cpu_core_id is unique for each thread rather than the same for threads on the same core. This didn't affect systems with SMT disabled. Make cpu_core_id be what it is defined to be. Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> # 4.9 Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170205105022.8705-2-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-05	x86/CPU/AMD: Bring back Compute Unit ID	Borislav Petkov
	Commit: a33d331761bc ("x86/CPU/AMD: Fix Bulldozer topology") restored the initial approach we had with the Fam15h topology of enumerating CU (Compute Unit) threads as cores. And this is still correct - they're beefier than HT threads but still have some shared functionality. Our current approach has a problem with the Mad Max Steam game, for example. Yves Dionne reported a certain "choppiness" while playing on v4.9.5. That problem stems most likely from the fact that the CU threads share resources within one CU and when we schedule to a thread of a different compute unit, this incurs latency due to migrating the working set to a different CU through the caches. When the thread siblings mask mirrors that aspect of the CUs and threads, the scheduler pays attention to it and tries to schedule within one CU first. Which takes care of the latency, of course. Reported-by: Yves Dionne <yves.dionne@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> # 4.9 Cc: Brice Goglin <Brice.Goglin@inria.fr> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yazen Ghannam <yazen.ghannam@amd.com> Link: http://lkml.kernel.org/r/20170205105022.8705-1-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-04	Merge branch 'ipv6-Improve-user-experience-with-multipath-routes'	David S. Miller
	David Ahern says: ==================== net: ipv6: Improve user experience with multipath routes This series closes a couple of gaps between IPv4 and IPv6 with respect to multipath routes: 1. IPv4 allows all nexthops of multipath routes to be deleted using just the prefix and length; IPv6 only deletes the first nexthop for the route if only the prefix and length are given. 2. IPv4 returns multipath routes encoded in the RTA_MULTIPATH attribute. IPv6 returns a series of routes with the same prefix and length - one for each nexthop. This happens for both dumps and notifications. IPv6 does accept RTA_MULTIPATH encoded routes, but installs them as a series of routes. Patch 1 addresses the first item by allowing IPv6 multipath routes to be deleted using just the prefix and length. Patch 2 addresses the second allowing IPv6 multipath routes to be returned encoded in the RTA_MULTIPATH. Patches 3 and 4 upate the RTM_{NEW,DEL}ROUTE notifications to generate 1 notification with RTA_MULTIPATH where applicable. Patch 5 prints IPv6 addresses in compressed format when showing route replace errors. This was noticed testing REPLACE failures. The end result for multipath routes: 1. Dump - RTA_MULTIPATH used for multipath routes $ ip -6 ro ls vrf red 2001:db8:1::/120 dev eth1 proto kernel metric 256 pref medium 2001:db8:2::/120 dev eth2 proto kernel metric 256 pref medium 2001:db8:200::/120 metric 1024 nexthop via 2001:db8:1::2 dev eth1 weight 1 nexthop via 2001:db8:2::2 dev eth2 weight 1 ... 2. Route Add - one notification with RTA_MULTIPATH attribute $ ip -6 ro add vrf red 2001:db8:200::/120 nexthop via 2001:db8:1::2 nexthop via 2001:db8:2::2 $ ip mon route 2001:db8:200::/120 table red metric 1024 nexthop via 2001:db8:1::2 dev eth1 weight 1 nexthop via 2001:db8:2::2 dev eth2 weight 1 2. Route Replace - one notification with RTA_MULTIPATH attribute $ ip -6 ro replace vrf red 2001:db8:200::/120 nexthop via 2001:db8:1::16 nexthop via 2001:db8:2::16 $ ip mon route Replaced 2001:db8:200::/120 table red metric 1024 nexthop via 2001:db8:1::16 dev eth1 weight 1 nexthop via 2001:db8:2::16 dev eth2 weight 1 - on a failure after the insertion of the first nexthop (which means the original route has been replaced in the FIB), a notification is sent with the successful nexthops and then the nexthops are deleted with one notification per hop. This is consistent with how it works today except the successful additions are coalesced into 1 notification. 3. Route Delete - delete of entire multipath route using prefix/length only 1 notification is generated: $ ip -6 ro del vrf red 2001:db8:200::/120 $ ip mon route Deleted 2001:db8:200::/120 table red metric 1024 nexthop via 2001:db8:1::16 dev eth1 weight 1 nexthop via 2001:db8:2::16 dev eth2 weight 1 - if a delete request contains nexthops one notification is generated per nexthop deleted. This is unavoidable since IPv6 alllows a single nexthop to be deleted within a multipath route 4. Route Appends - IPv6 allows nexthops to be appended to an existing route. In this case one notification is sent for the new route with the append flag set. $ ip -6 ro append vrf red 2001:db8:200::/120 nexthop via 2001:db8:2::20 nexthop via 2001:db8:1::20 $ ip mon route Append 2001:db8:200::/120 table red metric 1024 nexthop via 2001:db8:1::2 dev eth1 weight 1 nexthop via 2001:db8:2::2 dev eth2 weight 1 nexthop via 2001:db8:2::20 dev eth2 weight 1 nexthop via 2001:db8:1::20 dev eth1 weight 1 - on failure of an append, a notification is sent with the route containing all of the nexthops successfully added, and it is followed by delete notifications as the hops are removed returning the route to its prior state. This is consistent with how it works today except the successful additions are coalesced into 1 notification. Addresses some of the inconsistencies also noted by Roopa at netdev0.1: https://www.netdev01.org/docs/prabhu-linux_ipv4_ipv6_inconsistencies_talk_slides.pdf v4 - changed series to do encoding in 1 patch and updating notificatons in separate patches to make it easier to review and understand - 1 notification for delete when using prefix/length; 1 notification for append - handle delete of a single nexthop without RTA_MULTIPATH in delete request - upated commit messages and cover letter v3 - removed the need for a user API to opt-in to change. Requiring an API just shifts the difference from same API with different behavior to different API to achieve equivalent behavior - route notifications changed to use RTA_MULTIPATH for add and replace - upated commit messages and cover letter v2 - fixed locking in patch 1 as noted by DaveM - changed user API for patch 2 to require an rtmsg with RTM_F_ALL_NEXTHOPS set in rtm_flags - revamped explanation of patch 2 and cover letter ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04	net: ipv6: Use compressed IPv6 addresses showing route replace error	David Ahern
	ip6_print_replace_route_err logs an error if a route replace fails with IPv6 addresses in the full format. e.g,: IPv6: IPV6: multipath route replace failed (check consistency of installed routes): 2001:0db8:0200:0000:0000:0000:0000:0000 nexthop 2001:0db8:0001:0000:0000:0000:0000:0016 ifi 0 Change the message to dump the addresses in the compressed format. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>